Skip to content

Local-VLM visual review + large-scale performance sweep#50

Merged
mvalancy merged 2 commits into
developfrom
feat/vlm-scale-test-pipeline
Jun 14, 2026
Merged

Local-VLM visual review + large-scale performance sweep#50
mvalancy merged 2 commits into
developfrom
feat/vlm-scale-test-pipeline

Conversation

@mvalancy

Copy link
Copy Markdown
Member

Improves the automated test pipeline to exercise realistic user experiences from multiple perspectives and at scale, and to use local GPU vision models for visual evaluation — without ever putting hostnames in the repo.

What's new

1. Large-scale graph perf sweep — npm run test:perf:scale (perf-scale project)

Seeds real graphs (default 50,200,500,1000,2000 nodes locally; small in CI) through the GraphQL API — grid positions, varied status/type/priority, a connected backbone of canonical Edge nodes, all batched — loads each at one or more quality tiers, and records window.__graphPerf:

  • load ms, settle ms (to alpha ≤ 0.02), avg/p95 tick ms, fps, dropped frames, layout drift (rmsFromSavedPx), graph-scoped query p95.

generate-perf-report.mjstest-artifacts/scale-sweep/index.html: a metrics table + inline SVG charts of how each metric scales by size × quality, with the @perf budgets drawn for reference. Report-only (only asserts a seeded graph renders); each graph is cleaned up (edges → nodes → graph).

2. Local-VLM visual review — npm run test:vlm (vlm project)

A locally-hosted vision model judges captured states (empty graph, populated desktop + mobile, and any scale-sweep frames) from four personas: visual defects, new-user clarity, accessibility, living-graph aliveness.

  • tests/helpers/vlm.ts is protocol-agnostic: auto-detects OpenAI-compatible (/v1/chat/completions) vs Ollama-native (/api/chat) per endpoint, round-robins across all configured GPUs, bounded concurrency, lenient JSON-verdict parsing.
  • generate-vlm-report.mjstest-artifacts/vlm/index.html: screenshot + per-persona verdict cards.
  • Report-only: asserts the model answered, not its subjective verdict. Skips entirely when VLM_ENDPOINTS is unset, so CI (which can't reach local GPUs) stays green.

3. Hostname privacy

  • Real endpoints live only in .env.test.local (gitignored), auto-loaded by tests/helpers/testEnv.ts.
  • Committed .env.test.example documents the variable names with placeholder hosts (http://<gpu-host>:<port>). No hostnames / IPs / keys anywhere in the repo or docs.
VLM_ENDPOINTS=http://<host-a>:<port>,http://<host-b>:<port>,http://<host-c>:<port>
VLM_MODEL=<vision-model-tag>
SCALE_SWEEP_SIZES=50,200,500,1000,2000
SCALE_SWEEP_QUALITIES=HIGH,ULTRA

Safety / wiring

  • Dedicated Playwright projects keep both suites out of the fast smoke/perf gates; no CI job invokes them (verified). The default project excludes visual-vlm; the perf budget gate excludes scale-sweep.
  • CI lint/typecheck is per-package (turbo), so root tests/ isn't gated by it; the specs are validated by Playwright at runtime.

Verification (local)

  • Scale harness against the live dev stack: 30n/HIGH → rendered 30n/42e, load=444ms, settle=5.6s, tick=0.88ms, fps=51.7, query p95=66ms; JSON + report generated; graph cleaned up.
  • VLM client against a mock OpenAI-compatible server: protocol auto-detected, 12 persona evaluations across captured states (base64 image upload + JSON parse confirmed), report generated.
  • VLM suite skips cleanly with no config.

See docs/testing/local-vlm-and-scale.md. To gate on these, register a self-hosted runner that can reach the endpoints (instructions in the doc).

🤖 Generated with Claude Code

…line

Two new report-only, opt-in suites that exercise realistic user experiences
from multiple perspectives and at scale, plus the plumbing to run them against
local GPU vision models without leaking hostnames into the repo.

Large-scale perf sweep (tests/perf/scale-sweep.spec.ts, `perf-scale` project,
`npm run test:perf:scale`):
- Seeds real graphs (50→2000+ nodes) through the GraphQL API with grid
  positions, varied status/type/priority, and a connected edge backbone
  (canonical Edge nodes), batched.
- Loads each at one or more quality tiers and records window.__graphPerf:
  load ms, settle ms (alpha<=0.02), avg/p95 tick, fps, dropped frames, layout
  drift, plus graph-scoped query p95.
- generate-perf-report.mjs renders a table + inline SVG charts of how each
  metric scales (budgets drawn for reference). Report-only; only asserts a
  seeded graph renders. Cleans up each graph (edges→nodes→graph).

Local VLM visual review (tests/e2e/visual-vlm.spec.ts, `vlm` project,
`npm run test:vlm`):
- Protocol-agnostic client (tests/helpers/vlm.ts): auto-detects OpenAI-compatible
  vs Ollama-native per endpoint, round-robins across all configured GPUs,
  bounded concurrency, lenient JSON-verdict parsing.
- Judges captured states (empty graph, populated desktop+mobile, and any
  scale-sweep frames) from four personas: visual defects, new-user clarity,
  accessibility, living-graph aliveness.
- generate-vlm-report.mjs renders a screenshot+verdict gallery. Report-only;
  asserts the model answered, not its subjective verdict. Skips entirely when
  VLM_ENDPOINTS is unset, so CI (which can't reach local GPUs) stays green.

Hostname privacy: real endpoints live ONLY in `.env.test.local` (gitignored),
auto-loaded by tests/helpers/testEnv.ts. The committed `.env.test.example`
documents variable NAMES with placeholder hosts only. No hostnames/IPs/keys
anywhere in the repo or docs.

Wiring: dedicated Playwright projects keep these out of the fast smoke/perf
gates; no CI job invokes them. Validated end-to-end locally (scale harness
against the dev stack; VLM client against a mock OpenAI-compatible server).
Docs: docs/testing/local-vlm-and-scale.md + SYSTEMS.md gates table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🧪 Comprehensive Test Suite

  • Unit suites (Node 18.x & 20.x) — core, web, server, mcp-server: ✅ passed
  • Installer & deploy config: ✅ passed

Full-stack smoke gate runs in the CI workflow.

Iterated after running both suites against the actual GPU boxes
(gb10-02 = Qwen2.5-VL-32B on GPU, rtx4090 = Qwen2.5-VL-7B on CPU).

VLM client:
- Per-endpoint model resolution (VLM_MODEL=auto): boxes serving DIFFERENT
  models now work under one config; each verdict records which
  model/endpoint judged it + latency, shown in the report.
- max_tokens 700->400 to keep calls fast on CPU servers.

Scale sweep — make the metrics trustworthy:
- Add interactionFps: real rendered frames/sec (requestAnimationFrame)
  measured while dragging the graph. Reliable at every size with no app
  instrumentation; it's the headline scaling signal (e.g. 200n=16.6fps,
  500n=10.2fps observed).
- Seed a FRESH graph per (size, quality): the v1 -1/settle=NONE gaps were
  the 2nd quality loading the 1st run's already-settled pinned positions,
  so the sim never ticked and window.__graphPerf never published.
- Sustained node drag keeps the sim hot so __graphPerf (best-effort tick/
  drift/settle bonus) publishes when it can; take the worst under-load tick.
- visual-vlm: cap the slow 1920px scale-frame ingestion to the single
  largest; raise timeout to 900s (the all-frames version timed out at 10m).

Reports: perf report leads with Interaction FPS; VLM cards show
model@endpoint·latency. Docs updated (reliable vs best-effort metrics).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🧪 Comprehensive Test Suite

  • Unit suites (Node 18.x & 20.x) — core, web, server, mcp-server: ✅ passed
  • Installer & deploy config: ✅ passed

Full-stack smoke gate runs in the CI workflow.

@mvalancy mvalancy merged commit 0e2c412 into develop Jun 14, 2026
16 checks passed
@mvalancy mvalancy deleted the feat/vlm-scale-test-pipeline branch June 14, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant