Local-VLM visual review + large-scale performance sweep by mvalancy · Pull Request #50 · GraphDone/GraphDone-Core

mvalancy · 2026-06-14T00:56:57Z

Improves the automated test pipeline to exercise realistic user experiences from multiple perspectives and at scale, and to use local GPU vision models for visual evaluation — without ever putting hostnames in the repo.

What's new

1. Large-scale graph perf sweep — `npm run test:perf:scale` (`perf-scale` project)

Seeds real graphs (default 50,200,500,1000,2000 nodes locally; small in CI) through the GraphQL API — grid positions, varied status/type/priority, a connected backbone of canonical Edge nodes, all batched — loads each at one or more quality tiers, and records window.__graphPerf:

load ms, settle ms (to alpha ≤ 0.02), avg/p95 tick ms, fps, dropped frames, layout drift (rmsFromSavedPx), graph-scoped query p95.

generate-perf-report.mjs → test-artifacts/scale-sweep/index.html: a metrics table + inline SVG charts of how each metric scales by size × quality, with the @perf budgets drawn for reference. Report-only (only asserts a seeded graph renders); each graph is cleaned up (edges → nodes → graph).

2. Local-VLM visual review — `npm run test:vlm` (`vlm` project)

A locally-hosted vision model judges captured states (empty graph, populated desktop + mobile, and any scale-sweep frames) from four personas: visual defects, new-user clarity, accessibility, living-graph aliveness.

tests/helpers/vlm.ts is protocol-agnostic: auto-detects OpenAI-compatible (/v1/chat/completions) vs Ollama-native (/api/chat) per endpoint, round-robins across all configured GPUs, bounded concurrency, lenient JSON-verdict parsing.
generate-vlm-report.mjs → test-artifacts/vlm/index.html: screenshot + per-persona verdict cards.
Report-only: asserts the model answered, not its subjective verdict. Skips entirely when VLM_ENDPOINTS is unset, so CI (which can't reach local GPUs) stays green.

3. Hostname privacy

Real endpoints live only in .env.test.local (gitignored), auto-loaded by tests/helpers/testEnv.ts.
Committed .env.test.example documents the variable names with placeholder hosts (http://<gpu-host>:<port>). No hostnames / IPs / keys anywhere in the repo or docs.

VLM_ENDPOINTS=http://<host-a>:<port>,http://<host-b>:<port>,http://<host-c>:<port>
VLM_MODEL=<vision-model-tag>
SCALE_SWEEP_SIZES=50,200,500,1000,2000
SCALE_SWEEP_QUALITIES=HIGH,ULTRA

Safety / wiring

Dedicated Playwright projects keep both suites out of the fast smoke/perf gates; no CI job invokes them (verified). The default project excludes visual-vlm; the perf budget gate excludes scale-sweep.
CI lint/typecheck is per-package (turbo), so root tests/ isn't gated by it; the specs are validated by Playwright at runtime.

Verification (local)

Scale harness against the live dev stack: 30n/HIGH → rendered 30n/42e, load=444ms, settle=5.6s, tick=0.88ms, fps=51.7, query p95=66ms; JSON + report generated; graph cleaned up.
VLM client against a mock OpenAI-compatible server: protocol auto-detected, 12 persona evaluations across captured states (base64 image upload + JSON parse confirmed), report generated.
VLM suite skips cleanly with no config.

See docs/testing/local-vlm-and-scale.md. To gate on these, register a self-hosted runner that can reach the endpoints (instructions in the doc).

🤖 Generated with Claude Code

…line Two new report-only, opt-in suites that exercise realistic user experiences from multiple perspectives and at scale, plus the plumbing to run them against local GPU vision models without leaking hostnames into the repo. Large-scale perf sweep (tests/perf/scale-sweep.spec.ts, `perf-scale` project, `npm run test:perf:scale`): - Seeds real graphs (50→2000+ nodes) through the GraphQL API with grid positions, varied status/type/priority, and a connected edge backbone (canonical Edge nodes), batched. - Loads each at one or more quality tiers and records window.__graphPerf: load ms, settle ms (alpha<=0.02), avg/p95 tick, fps, dropped frames, layout drift, plus graph-scoped query p95. - generate-perf-report.mjs renders a table + inline SVG charts of how each metric scales (budgets drawn for reference). Report-only; only asserts a seeded graph renders. Cleans up each graph (edges→nodes→graph). Local VLM visual review (tests/e2e/visual-vlm.spec.ts, `vlm` project, `npm run test:vlm`): - Protocol-agnostic client (tests/helpers/vlm.ts): auto-detects OpenAI-compatible vs Ollama-native per endpoint, round-robins across all configured GPUs, bounded concurrency, lenient JSON-verdict parsing. - Judges captured states (empty graph, populated desktop+mobile, and any scale-sweep frames) from four personas: visual defects, new-user clarity, accessibility, living-graph aliveness. - generate-vlm-report.mjs renders a screenshot+verdict gallery. Report-only; asserts the model answered, not its subjective verdict. Skips entirely when VLM_ENDPOINTS is unset, so CI (which can't reach local GPUs) stays green. Hostname privacy: real endpoints live ONLY in `.env.test.local` (gitignored), auto-loaded by tests/helpers/testEnv.ts. The committed `.env.test.example` documents variable NAMES with placeholder hosts only. No hostnames/IPs/keys anywhere in the repo or docs. Wiring: dedicated Playwright projects keep these out of the fast smoke/perf gates; no CI job invokes them. Validated end-to-end locally (scale harness against the dev stack; VLM client against a mock OpenAI-compatible server). Docs: docs/testing/local-vlm-and-scale.md + SYSTEMS.md gates table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-14T00:59:16Z

🧪 Comprehensive Test Suite

Unit suites (Node 18.x & 20.x) — core, web, server, mcp-server: ✅ passed
Installer & deploy config: ✅ passed

Full-stack smoke gate runs in the CI workflow.

Iterated after running both suites against the actual GPU boxes (gb10-02 = Qwen2.5-VL-32B on GPU, rtx4090 = Qwen2.5-VL-7B on CPU). VLM client: - Per-endpoint model resolution (VLM_MODEL=auto): boxes serving DIFFERENT models now work under one config; each verdict records which model/endpoint judged it + latency, shown in the report. - max_tokens 700->400 to keep calls fast on CPU servers. Scale sweep — make the metrics trustworthy: - Add interactionFps: real rendered frames/sec (requestAnimationFrame) measured while dragging the graph. Reliable at every size with no app instrumentation; it's the headline scaling signal (e.g. 200n=16.6fps, 500n=10.2fps observed). - Seed a FRESH graph per (size, quality): the v1 -1/settle=NONE gaps were the 2nd quality loading the 1st run's already-settled pinned positions, so the sim never ticked and window.__graphPerf never published. - Sustained node drag keeps the sim hot so __graphPerf (best-effort tick/ drift/settle bonus) publishes when it can; take the worst under-load tick. - visual-vlm: cap the slow 1920px scale-frame ingestion to the single largest; raise timeout to 900s (the all-frames version timed out at 10m). Reports: perf report leads with Interaction FPS; VLM cards show model@endpoint·latency. Docs updated (reliable vs best-effort metrics). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-14T02:03:09Z

🧪 Comprehensive Test Suite

Unit suites (Node 18.x & 20.x) — core, web, server, mcp-server: ✅ passed
Installer & deploy config: ✅ passed

Full-stack smoke gate runs in the CI workflow.

mvalancy merged commit 0e2c412 into develop Jun 14, 2026
16 checks passed

mvalancy deleted the feat/vlm-scale-test-pipeline branch June 14, 2026 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local-VLM visual review + large-scale performance sweep#50

Local-VLM visual review + large-scale performance sweep#50
mvalancy merged 2 commits into
developfrom
feat/vlm-scale-test-pipeline

mvalancy commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvalancy commented Jun 14, 2026

What's new

1. Large-scale graph perf sweep — npm run test:perf:scale (perf-scale project)

2. Local-VLM visual review — npm run test:vlm (vlm project)

3. Hostname privacy

Safety / wiring

Verification (local)

Uh oh!

github-actions Bot commented Jun 14, 2026

🧪 Comprehensive Test Suite

Uh oh!

github-actions Bot commented Jun 14, 2026

🧪 Comprehensive Test Suite

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Large-scale graph perf sweep — `npm run test:perf:scale` (`perf-scale` project)

2. Local-VLM visual review — `npm run test:vlm` (`vlm` project)