GraphDone · mvalancy · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
diff --git a/.env.test.example b/.env.test.example
@@ -0,0 +1,42 @@
+# GraphDone test-pipeline configuration — LOCAL ONLY.
+#
+# Copy this file to `.env.test.local` (which is gitignored) and fill in your
+# own values. NEVER put real hostnames, IPs, or keys in this committed example
+# or anywhere else in the repo — the local VLM boxes (GPU workstations) must
+# stay out of version control. The test harness auto-loads `.env.test.local`.
+#
+#   cp .env.test.example .env.test.local   # then edit .env.test.local
+
+# --- Local Vision-Language-Model (VLM) endpoints ---------------------------
+# Comma-separated base URLs of your local VLM server(s). Requests are
+# round-robined across them so visual evaluation is spread over every GPU.
+# Leave blank to skip all VLM-driven suites (they no-op cleanly in CI).
+# Example shape (use your OWN hosts in .env.test.local, never here):
+#   VLM_ENDPOINTS=http://<gpu-host-a>:<port>,http://<gpu-host-b>:<port>
+VLM_ENDPOINTS=
+
+# Model id/tag to request (e.g. a llava / qwen2-vl / llama-3.2-vision build).
+VLM_MODEL=
+
+# Optional bearer key for OpenAI-compatible servers that require one.
+VLM_API_KEY=
+
+# Wire protocol: auto (default) | openai | ollama.
+#   auto   — probe each endpoint: /v1/models => OpenAI-compatible, else Ollama.
+#   openai — POST /v1/chat/completions (vLLM, LM Studio, llama.cpp, Ollama compat)
+#   ollama — POST /api/chat with an images[] array
+VLM_PROTOCOL=auto
+
+# Max concurrent VLM requests across all endpoints (default 3).
+VLM_MAX_CONCURRENCY=3
+
+# Per-request timeout in ms — VLMs can be slow on large images (default 120000).
+VLM_TIMEOUT_MS=120000
+
+# --- Large-scale performance sweep -----------------------------------------
+# Node counts to sweep, comma-separated. Leave blank to use the built-in
+# default (small in CI, large locally). Example: 50,200,500,1000,2000
+SCALE_SWEEP_SIZES=
+
+# Quality tiers to sweep per size (subset of LOW,MEDIUM,HIGH,ULTRA).
+SCALE_SWEEP_QUALITIES=HIGH,ULTRA
diff --git a/docs/SYSTEMS.md b/docs/SYSTEMS.md
@@ -19,6 +19,8 @@
 | Lint | `npm run lint` | 0 errors (warnings allowed) |
 | Build | `npm run build` | Production build succeeds |
 | Showcase report | `TEST_URL=http://localhost:3127 npm run report:showcase` | Records .webm video + screenshots of every mode at all 5 resolutions → `test-artifacts/showcase/index.html` (also an every-PR CI artifact). |
+| Large-scale perf sweep | `TEST_URL=http://localhost:3127 npm run test:perf:scale` | Seeds graphs of increasing size (50→2000+ nodes) and records `window.__graphPerf` (settle, tick, fps, drift, query p95) across size × quality → `test-artifacts/scale-sweep/index.html`. Report-only; sizes/qualities via `.env.test.local`. See [docs/testing/local-vlm-and-scale.md](./testing/local-vlm-and-scale.md). |
+| Local VLM visual review | `TEST_URL=http://localhost:3127 npm run test:vlm` | A locally-hosted vision model judges captured states from 4 perspectives (visual defects, new-user clarity, accessibility, living-graph aliveness) → `test-artifacts/vlm/index.html`. **Skips unless `VLM_ENDPOINTS` is set in the gitignored `.env.test.local`** (CI can't reach local GPUs). Report-only. |
 
 **Why THE GATE exists:** a real incident — orphaned `Edge` records made the
 edges query 500 and the UI showed "Error" with zero edges, while every unit

diff --git a/docs/testing/local-vlm-and-scale.md b/docs/testing/local-vlm-and-scale.md
@@ -0,0 +1,113 @@
+# Local VLM visual review & large-scale performance sweeps
+
+Two heavier, report-only suites that exercise GraphDone from realistic user
+perspectives and at scale. Both are **opt-in and run locally** (or on a
+self-hosted runner) because the vision models live on your own GPU boxes —
+their hostnames must never enter the repo.
+
+## TL;DR
+
+```bash
+cp .env.test.example .env.test.local     # gitignored — put your real values here
+# edit .env.test.local: VLM_ENDPOINTS, VLM_MODEL, (optional) sweep sizes
+
+./start dev                               # or have the stack running on :3127
+
+TEST_URL=http://localhost:3127 npm run test:perf:scale   # → test-artifacts/scale-sweep/index.html
+TEST_URL=http://localhost:3127 npm run test:vlm          # → test-artifacts/vlm/index.html
+```
+
+If `VLM_ENDPOINTS` is unset, `test:vlm` **skips cleanly** — so CI and other
+developers are never blocked by hardware they don't have.
+
+## Keeping hostnames out of the repo
+
+- **Never** commit hostnames, IPs, or keys. The GPU boxes (e.g. an RTX 4090
+  workstation and Grace-Blackwell nodes) are referenced only by env vars.
+- `.env.test.local` is gitignored (see `.gitignore`). It is the *only* place
+  your real endpoints live.
+- `.env.test.example` is committed and documents the variable **names** with
+  placeholder hosts (`http://<gpu-host>:<port>`). Copy it to `.env.test.local`
+  and fill in the rest.
+- The harness auto-loads `.env.test.local` via `tests/helpers/testEnv.ts`.
+
+```bash
+# .env.test.local  (NOT committed)
+VLM_ENDPOINTS=http://<host-a>:<port>,http://<host-b>:<port>,http://<host-c>:<port>
+VLM_MODEL=<your-vision-model-tag>
+VLM_PROTOCOL=auto        # auto | openai | ollama
+VLM_MAX_CONCURRENCY=3
+```
+
+Multiple endpoints are **round-robined**, so visual evaluation spreads across
+every GPU you list.
+
+## VLM protocol support
+
+`tests/helpers/vlm.ts` is protocol-agnostic and auto-detects per endpoint:
+
+| Protocol | Detected via | Request |
+|----------|--------------|---------|
+| OpenAI-compatible | `GET /v1/models` | `POST /v1/chat/completions` with an `image_url` data URI (vLLM, LM Studio, llama.cpp server, Ollama's `/v1` shim) |
+| Ollama native | `GET /api/tags` | `POST /api/chat` with a base64 `images[]` array |
+
+Force one with `VLM_PROTOCOL=openai` or `ollama`. Each model call asks for a
+strict JSON verdict `{pass, score, issues[], summary}`, parsed leniently.
+
+### Personas
+
+Each captured screenshot is judged from several perspectives (see `PERSONAS`
+in `tests/helpers/vlm.ts`):
+
+- **Visual defects** — overlapping/cut-off nodes, unreadable labels, broken
+  layout, missing edges, error chrome.
+- **New-user clarity** — is the screen legible and inviting to a newcomer?
+- **Accessibility** — contrast, text size, color-only signals, target size.
+- **Living-graph aliveness** — do glow/breathe/flow status cues read clearly?
+
+Report-only: a **FLAG** is the model's subjective concern, surfaced for a human
+to look at — it never fails the build. The suite *does* assert the model
+answered, so a broken client is still caught.
+
+## Large-scale perf sweep
+
+`tests/perf/scale-sweep.spec.ts` seeds real graphs (via the GraphQL API, the
+same path a human/AI uses) of increasing size, loads each at one or more
+quality tiers, and records the in-app `window.__graphPerf` readings plus load
+time, settle time and query latency.
+
+```bash
+# .env.test.local
+SCALE_SWEEP_SIZES=50,200,500,1000,2000     # blank => small in CI, large locally
+SCALE_SWEEP_QUALITIES=HIGH,ULTRA
+```
+
+Metrics per (size, quality):
+
+- **Reliable (measured directly from the browser, captured at every size):**
+  rendered node/edge counts, initial load ms, graph-scoped query p95, and
+  **interaction FPS** — real rendered frames/sec while a node is dragged
+  (counted via `requestAnimationFrame`, so it needs no app instrumentation and
+  reflects how janky the graph feels under interaction at scale).
+- **Best-effort bonus (from the app's `window.__graphPerf`, which only
+  publishes ~every 2s while the sim ticks):** settle ms (to `alpha ≤ 0.02`),
+  avg/p95 sim tick ms, layout drift (`rmsFromSavedPx`). These can be blank for
+  graphs that settle instantly — `interactionFps` is the headline signal.
+
+A FRESH graph is seeded per (size, quality) so each measurement starts from an
+unsettled layout (otherwise the second quality loads the first run's settled,
+pinned positions and the sim never ticks). Output:
+`test-artifacts/scale-sweep/index.html` — a table plus inline SVG charts of how
+each metric scales, with the `@perf` budgets drawn for reference.
+
+Report-only; the only hard assertion is that a seeded graph actually renders.
+Each seeded graph is deleted afterward (edges first, then nodes, then graph).
+
+## CI
+
+GitHub-hosted runners can't reach your local GPUs, so neither suite gates
+merges there. To gate on them, register a **self-hosted runner** on a machine
+that can reach the endpoints, give it the `.env.test.local`, and add a workflow
+job (manual-dispatch or nightly) that runs `npm run test:perf:scale` /
+`npm run test:vlm`. The scale sweep alone (no VLM) is safe to run on any runner
+with the dev stack and a small `SCALE_SWEEP_SIZES`.
diff --git a/package.json b/package.json
@@ -37,6 +37,10 @@
     "test:smoke": "playwright test tests/e2e/user-smoke.spec.ts --reporter=line",
     "report:showcase": "playwright test --project=showcase && node tests/generate-showcase-report.mjs",
     "test:perf": "playwright test --project=perf --reporter=line",
+    "test:perf:scale": "playwright test --project=perf-scale --reporter=line && node tests/generate-perf-report.mjs",
+    "report:perf": "node tests/generate-perf-report.mjs",
+    "test:vlm": "playwright test --project=vlm --reporter=line && node tests/generate-vlm-report.mjs",
+    "report:vlm": "node tests/generate-vlm-report.mjs",
     "perf:bundle": "node tests/perf/check-bundle-size.mjs"
   },
   "devDependencies": {

diff --git a/playwright.config.ts b/playwright.config.ts
@@ -38,9 +38,10 @@ export default defineConfig({
   projects: [
     {
       name: 'GraphDone-Core/dev-neo4j/chromium',
-      // The showcase tour runs in its own capture-heavy project below; keep it
-      // out of the default (fast) project so the smoke gate stays quick.
-      testIgnore: /showcase\.spec\.ts/,
+      // The showcase tour and the local-VLM visual eval run in their own
+      // capture-heavy projects below; keep them out of the default (fast)
+      // project so the smoke gate stays quick.
+      testIgnore: [/showcase\.spec\.ts/, /visual-vlm\.spec\.ts/],
       use: { ...devices['Desktop Chrome'] },
     },
 
@@ -65,9 +66,31 @@ export default defineConfig({
     {
       name: 'perf',
       testDir: './tests/perf',
+      // The large-scale sweep is heavy and report-only; it has its own project
+      // so `test:perf` (the budget gate) stays fast.
+      testIgnore: /scale-sweep\.spec\.ts/,
       use: { ...devices['Desktop Chrome'] },
     },
 
+    /* Large-scale graph creation + performance metric sweep. Seeds graphs of
+     * increasing size and records window.__graphPerf across them. Heavy +
+     * report-only; run via `npm run test:perf:scale`. */
+    {
+      name: 'perf-scale',
+      testDir: './tests/perf',
+      testMatch: /scale-sweep\.spec\.ts/,
+      use: { ...devices['Desktop Chrome'] },
+    },
+
+    /* Local-VLM visual evaluation across personas. Skips unless VLM_ENDPOINTS
+     * is set in .env.test.local. Run via `npm run test:vlm`. */
+    {
+      name: 'vlm',
+      testDir: './tests/e2e',
+      testMatch: /visual-vlm\.spec\.ts/,
+      use: { ...devices['Desktop Chrome'], screenshot: 'on' },
+    },
+
     // Commented out until browsers installed with system dependencies
     // {
     //   name: 'GraphDone-Core/dev-neo4j/firefox',

diff --git a/tests/e2e/visual-vlm.spec.ts b/tests/e2e/visual-vlm.spec.ts
@@ -0,0 +1,118 @@
+import { test, expect, Page } from '@playwright/test';
+import * as fs from 'fs';
+import * as path from 'path';
+import { login, TEST_USERS } from '../helpers/auth';
+import { seedLargeGraph, deleteGraphDeep } from '../helpers/seedGraph';
+import '../helpers/testEnv';
+import { isVlmAvailable, evaluateBatch, PERSONAS, personaByKey } from '../helpers/vlm';
+
+/**
+ * Local-VLM visual evaluation. Captures key user-facing states, then asks a
+ * locally-hosted vision model to judge each one from four perspectives
+ * (visual defects, new-user clarity, accessibility, living-graph aliveness).
+ *
+ * Report-only: it never fails on a model's subjective verdict — it writes
+ * test-artifacts/vlm/results.json for `npm run report:vlm` and prints a
+ * summary. It only asserts the VLM actually answered (so a broken client is
+ * still caught). Skips entirely when no VLM endpoint is configured/reachable
+ * (VLM_ENDPOINTS in .env.test.local), so CI stays green.
+ */
+
+const SHOT_DIR = path.resolve(process.cwd(), 'test-artifacts/vlm/shots');
+const OUT = path.resolve(process.cwd(), 'test-artifacts/vlm/results.json');
+const SCALE_DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep');
+
+interface Capture { file: string; context: string; personas: string[] }
+
+async function shot(page: Page, name: string): Promise<string> {
+  fs.mkdirSync(SHOT_DIR, { recursive: true });
+  const file = path.join(SHOT_DIR, `${name}.png`);
+  await page.screenshot({ path: file, fullPage: false }).catch(() => {});
+  return file;
+}
+
+test('VLM visual evaluation across personas @vlm', async ({ page }) => {
+  test.setTimeout(900_000);
+  const available = await isVlmAvailable();
+  test.skip(!available, 'No reachable VLM endpoint (set VLM_ENDPOINTS in .env.test.local)');
+
+  const allPersonas = PERSONAS.map((p) => p.key);
+  const captures: Capture[] = [];
+  const cleanup: string[] = [];
+
+  await login(page, TEST_USERS.ADMIN);
+  await page.waitForTimeout(1500);
+
+  // 1. Empty graph — first-run invitation (new-user + visual defects).
+  const empty = await page.evaluate(async () => {
+    const token = localStorage.getItem('authToken') ?? '';
+    const post = (query: string, variables?: unknown) =>
+      fetch('/api/graphql', { method: 'POST', headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` }, body: JSON.stringify({ query, variables }) }).then((r) => r.json());
+    const me = await post('{ me { id } }');
+    const g = await post(`mutation($i:[GraphCreateInput!]!){createGraphs(input:$i){graphs{id}}}`, { i: [{ name: `VLM Empty ${Date.now()}`, type: 'PROJECT', status: 'ACTIVE', createdBy: me.data.me.id, isShared: true }] });
+    return g.data.createGraphs.graphs[0].id as string;
+  });
+  cleanup.push(empty);
+  await page.setViewportSize({ width: 1440, height: 900 });
+  await page.evaluate((id) => localStorage.setItem('currentGraphId', id), empty);
+  await page.reload();
+  await page.waitForTimeout(5000);
+  captures.push({ file: await shot(page, 'empty-graph-desktop'), context: 'the first-run empty-state of a brand-new project graph in GraphDone, a graph-based task manager', personas: ['visual-defects', 'new-user', 'accessibility'] });
+
+  // 2. Populated graph at ULTRA quality — full living-graph experience.
+  const seeded = await seedLargeGraph(page, { size: 60, namePrefix: 'VLM' });
+  cleanup.push(seeded.graphId);
+  await page.evaluate((id) => { localStorage.setItem('currentGraphId', id); localStorage.setItem('graphdone.quality.override', 'ULTRA'); }, seeded.graphId);
+  await page.reload();
+  await page.waitForTimeout(8000); // let it settle + effects run
+  captures.push({ file: await shot(page, 'populated-desktop'), context: 'a populated project graph (~60 work items) with dependency edges; nodes glow by priority and animate by status (in-progress breathes, blocked aches, complete settles)', personas: allPersonas });
+
+  // 3. Same graph on a phone viewport — accessibility + new-user on mobile.
+  await page.setViewportSize({ width: 393, height: 852 });
+  await page.reload();
+  await page.waitForTimeout(6000);
+  captures.push({ file: await shot(page, 'populated-mobile'), context: 'the same project graph viewed on a phone-sized screen (393x852)', personas: ['visual-defects', 'new-user', 'accessibility'] });
+
+  // 4. Bonus: judge the SINGLE largest scale-sweep frame for density/legibility
+  //    (those frames are 1920px and slow on the model; one is enough signal).
+  if (fs.existsSync(SCALE_DIR)) {
+    const largest = fs.readdirSync(SCALE_DIR)
+      .filter((f) => f.endsWith('.png'))
+      .map((f) => ({ f, size: parseInt(f, 10) || 0 }))
+      .sort((a, b) => b.size - a.size)[0];
+    if (largest) {
+      captures.push({ file: path.join(SCALE_DIR, largest.f), context: `a large graph rendered at scale (${largest.size} nodes) — judge whether it stays legible at this density`, personas: ['visual-defects'] });
+    }
+  }
+
+  // Build and run the persona jobs.
+  const jobs = captures.flatMap((c) =>
+    c.personas
+      .map((pk) => personaByKey(pk))
+      .filter((p): p is NonNullable<typeof p> => Boolean(p))
+      .map((persona) => ({ imagePath: c.file, persona, context: c.context, meta: { capture: path.basename(c.file) } }))
+  );
+
+  let results: Awaited<ReturnType<typeof evaluateBatch>> = [];
+  try {
+    results = await evaluateBatch(jobs);
+  } finally {
+    for (const id of cleanup) await deleteGraphDeep(page, id);
+  }
+
+  fs.mkdirSync(path.dirname(OUT), { recursive: true });
+  fs.writeFileSync(OUT, JSON.stringify({ generatedAt: new Date().toISOString(), results }, null, 2));
+
+  const fails = results.filter((r) => !r.verdict.pass);
+  // eslint-disable-next-line no-console
+  console.log(`[vlm] ${results.length} evaluations, ${results.length - fails.length} pass, ${fails.length} flagged:`);
+  for (const f of fails) {
+    // eslint-disable-next-line no-console
+    console.log(`  ⚠️ [${f.persona}] ${f.meta?.capture}: ${f.verdict.summary || f.verdict.issues.join('; ')}`);
+  }
+
+  // Report-only: we assert the VLM produced answers, not what it concluded.
+  expect(results.length, 'VLM returned evaluations').toBeGreaterThan(0);
+  const answered = results.filter((r) => !r.verdict.issues.some((i) => i.startsWith('VLM request failed') || i.startsWith('No reachable')));
+  expect(answered.length, 'at least some VLM calls succeeded').toBeGreaterThan(0);
+});