Local-VLM visual review + large-scale performance sweep#50
Merged
Conversation
…line Two new report-only, opt-in suites that exercise realistic user experiences from multiple perspectives and at scale, plus the plumbing to run them against local GPU vision models without leaking hostnames into the repo. Large-scale perf sweep (tests/perf/scale-sweep.spec.ts, `perf-scale` project, `npm run test:perf:scale`): - Seeds real graphs (50→2000+ nodes) through the GraphQL API with grid positions, varied status/type/priority, and a connected edge backbone (canonical Edge nodes), batched. - Loads each at one or more quality tiers and records window.__graphPerf: load ms, settle ms (alpha<=0.02), avg/p95 tick, fps, dropped frames, layout drift, plus graph-scoped query p95. - generate-perf-report.mjs renders a table + inline SVG charts of how each metric scales (budgets drawn for reference). Report-only; only asserts a seeded graph renders. Cleans up each graph (edges→nodes→graph). Local VLM visual review (tests/e2e/visual-vlm.spec.ts, `vlm` project, `npm run test:vlm`): - Protocol-agnostic client (tests/helpers/vlm.ts): auto-detects OpenAI-compatible vs Ollama-native per endpoint, round-robins across all configured GPUs, bounded concurrency, lenient JSON-verdict parsing. - Judges captured states (empty graph, populated desktop+mobile, and any scale-sweep frames) from four personas: visual defects, new-user clarity, accessibility, living-graph aliveness. - generate-vlm-report.mjs renders a screenshot+verdict gallery. Report-only; asserts the model answered, not its subjective verdict. Skips entirely when VLM_ENDPOINTS is unset, so CI (which can't reach local GPUs) stays green. Hostname privacy: real endpoints live ONLY in `.env.test.local` (gitignored), auto-loaded by tests/helpers/testEnv.ts. The committed `.env.test.example` documents variable NAMES with placeholder hosts only. No hostnames/IPs/keys anywhere in the repo or docs. Wiring: dedicated Playwright projects keep these out of the fast smoke/perf gates; no CI job invokes them. Validated end-to-end locally (scale harness against the dev stack; VLM client against a mock OpenAI-compatible server). Docs: docs/testing/local-vlm-and-scale.md + SYSTEMS.md gates table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🧪 Comprehensive Test Suite
Full-stack smoke gate runs in the CI workflow. |
Iterated after running both suites against the actual GPU boxes (gb10-02 = Qwen2.5-VL-32B on GPU, rtx4090 = Qwen2.5-VL-7B on CPU). VLM client: - Per-endpoint model resolution (VLM_MODEL=auto): boxes serving DIFFERENT models now work under one config; each verdict records which model/endpoint judged it + latency, shown in the report. - max_tokens 700->400 to keep calls fast on CPU servers. Scale sweep — make the metrics trustworthy: - Add interactionFps: real rendered frames/sec (requestAnimationFrame) measured while dragging the graph. Reliable at every size with no app instrumentation; it's the headline scaling signal (e.g. 200n=16.6fps, 500n=10.2fps observed). - Seed a FRESH graph per (size, quality): the v1 -1/settle=NONE gaps were the 2nd quality loading the 1st run's already-settled pinned positions, so the sim never ticked and window.__graphPerf never published. - Sustained node drag keeps the sim hot so __graphPerf (best-effort tick/ drift/settle bonus) publishes when it can; take the worst under-load tick. - visual-vlm: cap the slow 1920px scale-frame ingestion to the single largest; raise timeout to 900s (the all-frames version timed out at 10m). Reports: perf report leads with Interaction FPS; VLM cards show model@endpoint·latency. Docs updated (reliable vs best-effort metrics). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🧪 Comprehensive Test Suite
Full-stack smoke gate runs in the CI workflow. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improves the automated test pipeline to exercise realistic user experiences from multiple perspectives and at scale, and to use local GPU vision models for visual evaluation — without ever putting hostnames in the repo.
What's new
1. Large-scale graph perf sweep —
npm run test:perf:scale(perf-scaleproject)Seeds real graphs (default
50,200,500,1000,2000nodes locally; small in CI) through the GraphQL API — grid positions, varied status/type/priority, a connected backbone of canonical Edge nodes, all batched — loads each at one or more quality tiers, and recordswindow.__graphPerf:alpha ≤ 0.02), avg/p95 tick ms, fps, dropped frames, layout drift (rmsFromSavedPx), graph-scoped query p95.generate-perf-report.mjs→test-artifacts/scale-sweep/index.html: a metrics table + inline SVG charts of how each metric scales by size × quality, with the@perfbudgets drawn for reference. Report-only (only asserts a seeded graph renders); each graph is cleaned up (edges → nodes → graph).2. Local-VLM visual review —
npm run test:vlm(vlmproject)A locally-hosted vision model judges captured states (empty graph, populated desktop + mobile, and any scale-sweep frames) from four personas: visual defects, new-user clarity, accessibility, living-graph aliveness.
tests/helpers/vlm.tsis protocol-agnostic: auto-detects OpenAI-compatible (/v1/chat/completions) vs Ollama-native (/api/chat) per endpoint, round-robins across all configured GPUs, bounded concurrency, lenient JSON-verdict parsing.generate-vlm-report.mjs→test-artifacts/vlm/index.html: screenshot + per-persona verdict cards.VLM_ENDPOINTSis unset, so CI (which can't reach local GPUs) stays green.3. Hostname privacy
.env.test.local(gitignored), auto-loaded bytests/helpers/testEnv.ts..env.test.exampledocuments the variable names with placeholder hosts (http://<gpu-host>:<port>). No hostnames / IPs / keys anywhere in the repo or docs.Safety / wiring
visual-vlm; theperfbudget gate excludesscale-sweep.tests/isn't gated by it; the specs are validated by Playwright at runtime.Verification (local)
30n/HIGH → rendered 30n/42e, load=444ms, settle=5.6s, tick=0.88ms, fps=51.7, query p95=66ms; JSON + report generated; graph cleaned up.See
docs/testing/local-vlm-and-scale.md. To gate on these, register a self-hosted runner that can reach the endpoints (instructions in the doc).🤖 Generated with Claude Code