diff --git a/.env.test.example b/.env.test.example
new file mode 100644
index 00000000..a8d29355
--- /dev/null
+++ b/.env.test.example
@@ -0,0 +1,42 @@
+# GraphDone test-pipeline configuration — LOCAL ONLY.
+#
+# Copy this file to `.env.test.local` (which is gitignored) and fill in your
+# own values. NEVER put real hostnames, IPs, or keys in this committed example
+# or anywhere else in the repo — the local VLM boxes (GPU workstations) must
+# stay out of version control. The test harness auto-loads `.env.test.local`.
+#
+#   cp .env.test.example .env.test.local   # then edit .env.test.local
+
+# --- Local Vision-Language-Model (VLM) endpoints ---------------------------
+# Comma-separated base URLs of your local VLM server(s). Requests are
+# round-robined across them so visual evaluation is spread over every GPU.
+# Leave blank to skip all VLM-driven suites (they no-op cleanly in CI).
+# Example shape (use your OWN hosts in .env.test.local, never here):
+#   VLM_ENDPOINTS=http://<gpu-host-a>:<port>,http://<gpu-host-b>:<port>
+VLM_ENDPOINTS=
+
+# Model id/tag to request (e.g. a llava / qwen2-vl / llama-3.2-vision build).
+VLM_MODEL=
+
+# Optional bearer key for OpenAI-compatible servers that require one.
+VLM_API_KEY=
+
+# Wire protocol: auto (default) | openai | ollama.
+#   auto   — probe each endpoint: /v1/models => OpenAI-compatible, else Ollama.
+#   openai — POST /v1/chat/completions (vLLM, LM Studio, llama.cpp, Ollama compat)
+#   ollama — POST /api/chat with an images[] array
+VLM_PROTOCOL=auto
+
+# Max concurrent VLM requests across all endpoints (default 3).
+VLM_MAX_CONCURRENCY=3
+
+# Per-request timeout in ms — VLMs can be slow on large images (default 120000).
+VLM_TIMEOUT_MS=120000
+
+# --- Large-scale performance sweep -----------------------------------------
+# Node counts to sweep, comma-separated. Leave blank to use the built-in
+# default (small in CI, large locally). Example: 50,200,500,1000,2000
+SCALE_SWEEP_SIZES=
+
+# Quality tiers to sweep per size (subset of LOW,MEDIUM,HIGH,ULTRA).
+SCALE_SWEEP_QUALITIES=HIGH,ULTRA
diff --git a/docs/SYSTEMS.md b/docs/SYSTEMS.md
index 42cdc71e..129e5774 100644
--- a/docs/SYSTEMS.md
+++ b/docs/SYSTEMS.md
@@ -19,6 +19,8 @@
 | Lint | `npm run lint` | 0 errors (warnings allowed) |
 | Build | `npm run build` | Production build succeeds |
 | Showcase report | `TEST_URL=http://localhost:3127 npm run report:showcase` | Records .webm video + screenshots of every mode at all 5 resolutions → `test-artifacts/showcase/index.html` (also an every-PR CI artifact). |
+| Large-scale perf sweep | `TEST_URL=http://localhost:3127 npm run test:perf:scale` | Seeds graphs of increasing size (50→2000+ nodes) and records `window.__graphPerf` (settle, tick, fps, drift, query p95) across size × quality → `test-artifacts/scale-sweep/index.html`. Report-only; sizes/qualities via `.env.test.local`. See [docs/testing/local-vlm-and-scale.md](./testing/local-vlm-and-scale.md). |
+| Local VLM visual review | `TEST_URL=http://localhost:3127 npm run test:vlm` | A locally-hosted vision model judges captured states from 4 perspectives (visual defects, new-user clarity, accessibility, living-graph aliveness) → `test-artifacts/vlm/index.html`. **Skips unless `VLM_ENDPOINTS` is set in the gitignored `.env.test.local`** (CI can't reach local GPUs). Report-only. |
 
 **Why THE GATE exists:** a real incident — orphaned `Edge` records made the
 edges query 500 and the UI showed "Error" with zero edges, while every unit
diff --git a/docs/testing/local-vlm-and-scale.md b/docs/testing/local-vlm-and-scale.md
new file mode 100644
index 00000000..320a5c04
--- /dev/null
+++ b/docs/testing/local-vlm-and-scale.md
@@ -0,0 +1,113 @@
+# Local VLM visual review & large-scale performance sweeps
+
+Two heavier, report-only suites that exercise GraphDone from realistic user
+perspectives and at scale. Both are **opt-in and run locally** (or on a
+self-hosted runner) because the vision models live on your own GPU boxes —
+their hostnames must never enter the repo.
+
+## TL;DR
+
+```bash
+cp .env.test.example .env.test.local     # gitignored — put your real values here
+# edit .env.test.local: VLM_ENDPOINTS, VLM_MODEL, (optional) sweep sizes
+
+./start dev                               # or have the stack running on :3127
+
+TEST_URL=http://localhost:3127 npm run test:perf:scale   # → test-artifacts/scale-sweep/index.html
+TEST_URL=http://localhost:3127 npm run test:vlm          # → test-artifacts/vlm/index.html
+```
+
+If `VLM_ENDPOINTS` is unset, `test:vlm` **skips cleanly** — so CI and other
+developers are never blocked by hardware they don't have.
+
+## Keeping hostnames out of the repo
+
+- **Never** commit hostnames, IPs, or keys. The GPU boxes (e.g. an RTX 4090
+  workstation and Grace-Blackwell nodes) are referenced only by env vars.
+- `.env.test.local` is gitignored (see `.gitignore`). It is the *only* place
+  your real endpoints live.
+- `.env.test.example` is committed and documents the variable **names** with
+  placeholder hosts (`http://<gpu-host>:<port>`). Copy it to `.env.test.local`
+  and fill in the rest.
+- The harness auto-loads `.env.test.local` via `tests/helpers/testEnv.ts`.
+
+```bash
+# .env.test.local  (NOT committed)
+VLM_ENDPOINTS=http://<host-a>:<port>,http://<host-b>:<port>,http://<host-c>:<port>
+VLM_MODEL=<your-vision-model-tag>
+VLM_PROTOCOL=auto        # auto | openai | ollama
+VLM_MAX_CONCURRENCY=3
+```
+
+Multiple endpoints are **round-robined**, so visual evaluation spreads across
+every GPU you list.
+
+## VLM protocol support
+
+`tests/helpers/vlm.ts` is protocol-agnostic and auto-detects per endpoint:
+
+| Protocol | Detected via | Request |
+|----------|--------------|---------|
+| OpenAI-compatible | `GET /v1/models` | `POST /v1/chat/completions` with an `image_url` data URI (vLLM, LM Studio, llama.cpp server, Ollama's `/v1` shim) |
+| Ollama native | `GET /api/tags` | `POST /api/chat` with a base64 `images[]` array |
+
+Force one with `VLM_PROTOCOL=openai` or `ollama`. Each model call asks for a
+strict JSON verdict `{pass, score, issues[], summary}`, parsed leniently.
+
+### Personas
+
+Each captured screenshot is judged from several perspectives (see `PERSONAS`
+in `tests/helpers/vlm.ts`):
+
+- **Visual defects** — overlapping/cut-off nodes, unreadable labels, broken
+  layout, missing edges, error chrome.
+- **New-user clarity** — is the screen legible and inviting to a newcomer?
+- **Accessibility** — contrast, text size, color-only signals, target size.
+- **Living-graph aliveness** — do glow/breathe/flow status cues read clearly?
+
+Report-only: a **FLAG** is the model's subjective concern, surfaced for a human
+to look at — it never fails the build. The suite *does* assert the model
+answered, so a broken client is still caught.
+
+## Large-scale perf sweep
+
+`tests/perf/scale-sweep.spec.ts` seeds real graphs (via the GraphQL API, the
+same path a human/AI uses) of increasing size, loads each at one or more
+quality tiers, and records the in-app `window.__graphPerf` readings plus load
+time, settle time and query latency.
+
+```bash
+# .env.test.local
+SCALE_SWEEP_SIZES=50,200,500,1000,2000     # blank => small in CI, large locally
+SCALE_SWEEP_QUALITIES=HIGH,ULTRA
+```
+
+Metrics per (size, quality):
+
+- **Reliable (measured directly from the browser, captured at every size):**
+  rendered node/edge counts, initial load ms, graph-scoped query p95, and
+  **interaction FPS** — real rendered frames/sec while a node is dragged
+  (counted via `requestAnimationFrame`, so it needs no app instrumentation and
+  reflects how janky the graph feels under interaction at scale).
+- **Best-effort bonus (from the app's `window.__graphPerf`, which only
+  publishes ~every 2s while the sim ticks):** settle ms (to `alpha ≤ 0.02`),
+  avg/p95 sim tick ms, layout drift (`rmsFromSavedPx`). These can be blank for
+  graphs that settle instantly — `interactionFps` is the headline signal.
+
+A FRESH graph is seeded per (size, quality) so each measurement starts from an
+unsettled layout (otherwise the second quality loads the first run's settled,
+pinned positions and the sim never ticks). Output:
+`test-artifacts/scale-sweep/index.html` — a table plus inline SVG charts of how
+each metric scales, with the `@perf` budgets drawn for reference.
+
+Report-only; the only hard assertion is that a seeded graph actually renders.
+Each seeded graph is deleted afterward (edges first, then nodes, then graph).
+
+## CI
+
+GitHub-hosted runners can't reach your local GPUs, so neither suite gates
+merges there. To gate on them, register a **self-hosted runner** on a machine
+that can reach the endpoints, give it the `.env.test.local`, and add a workflow
+job (manual-dispatch or nightly) that runs `npm run test:perf:scale` /
+`npm run test:vlm`. The scale sweep alone (no VLM) is safe to run on any runner
+with the dev stack and a small `SCALE_SWEEP_SIZES`.
diff --git a/package.json b/package.json
index 3c442d84..b3182854 100644
--- a/package.json
+++ b/package.json
@@ -37,6 +37,10 @@
     "test:smoke": "playwright test tests/e2e/user-smoke.spec.ts --reporter=line",
     "report:showcase": "playwright test --project=showcase && node tests/generate-showcase-report.mjs",
     "test:perf": "playwright test --project=perf --reporter=line",
+    "test:perf:scale": "playwright test --project=perf-scale --reporter=line && node tests/generate-perf-report.mjs",
+    "report:perf": "node tests/generate-perf-report.mjs",
+    "test:vlm": "playwright test --project=vlm --reporter=line && node tests/generate-vlm-report.mjs",
+    "report:vlm": "node tests/generate-vlm-report.mjs",
     "perf:bundle": "node tests/perf/check-bundle-size.mjs"
   },
   "devDependencies": {
diff --git a/playwright.config.ts b/playwright.config.ts
index 3a9f10e5..d0d1716b 100644
--- a/playwright.config.ts
+++ b/playwright.config.ts
@@ -38,9 +38,10 @@ export default defineConfig({
   projects: [
     {
       name: 'GraphDone-Core/dev-neo4j/chromium',
-      // The showcase tour runs in its own capture-heavy project below; keep it
-      // out of the default (fast) project so the smoke gate stays quick.
-      testIgnore: /showcase\.spec\.ts/,
+      // The showcase tour and the local-VLM visual eval run in their own
+      // capture-heavy projects below; keep them out of the default (fast)
+      // project so the smoke gate stays quick.
+      testIgnore: [/showcase\.spec\.ts/, /visual-vlm\.spec\.ts/],
       use: { ...devices['Desktop Chrome'] },
     },
 
@@ -65,9 +66,31 @@ export default defineConfig({
     {
       name: 'perf',
       testDir: './tests/perf',
+      // The large-scale sweep is heavy and report-only; it has its own project
+      // so `test:perf` (the budget gate) stays fast.
+      testIgnore: /scale-sweep\.spec\.ts/,
       use: { ...devices['Desktop Chrome'] },
     },
 
+    /* Large-scale graph creation + performance metric sweep. Seeds graphs of
+     * increasing size and records window.__graphPerf across them. Heavy +
+     * report-only; run via `npm run test:perf:scale`. */
+    {
+      name: 'perf-scale',
+      testDir: './tests/perf',
+      testMatch: /scale-sweep\.spec\.ts/,
+      use: { ...devices['Desktop Chrome'] },
+    },
+
+    /* Local-VLM visual evaluation across personas. Skips unless VLM_ENDPOINTS
+     * is set in .env.test.local. Run via `npm run test:vlm`. */
+    {
+      name: 'vlm',
+      testDir: './tests/e2e',
+      testMatch: /visual-vlm\.spec\.ts/,
+      use: { ...devices['Desktop Chrome'], screenshot: 'on' },
+    },
+
     // Commented out until browsers installed with system dependencies
     // {
     //   name: 'GraphDone-Core/dev-neo4j/firefox',
diff --git a/tests/e2e/visual-vlm.spec.ts b/tests/e2e/visual-vlm.spec.ts
new file mode 100644
index 00000000..654e4ea0
--- /dev/null
+++ b/tests/e2e/visual-vlm.spec.ts
@@ -0,0 +1,118 @@
+import { test, expect, Page } from '@playwright/test';
+import * as fs from 'fs';
+import * as path from 'path';
+import { login, TEST_USERS } from '../helpers/auth';
+import { seedLargeGraph, deleteGraphDeep } from '../helpers/seedGraph';
+import '../helpers/testEnv';
+import { isVlmAvailable, evaluateBatch, PERSONAS, personaByKey } from '../helpers/vlm';
+
+/**
+ * Local-VLM visual evaluation. Captures key user-facing states, then asks a
+ * locally-hosted vision model to judge each one from four perspectives
+ * (visual defects, new-user clarity, accessibility, living-graph aliveness).
+ *
+ * Report-only: it never fails on a model's subjective verdict — it writes
+ * test-artifacts/vlm/results.json for `npm run report:vlm` and prints a
+ * summary. It only asserts the VLM actually answered (so a broken client is
+ * still caught). Skips entirely when no VLM endpoint is configured/reachable
+ * (VLM_ENDPOINTS in .env.test.local), so CI stays green.
+ */
+
+const SHOT_DIR = path.resolve(process.cwd(), 'test-artifacts/vlm/shots');
+const OUT = path.resolve(process.cwd(), 'test-artifacts/vlm/results.json');
+const SCALE_DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep');
+
+interface Capture { file: string; context: string; personas: string[] }
+
+async function shot(page: Page, name: string): Promise<string> {
+  fs.mkdirSync(SHOT_DIR, { recursive: true });
+  const file = path.join(SHOT_DIR, `${name}.png`);
+  await page.screenshot({ path: file, fullPage: false }).catch(() => {});
+  return file;
+}
+
+test('VLM visual evaluation across personas @vlm', async ({ page }) => {
+  test.setTimeout(900_000);
+  const available = await isVlmAvailable();
+  test.skip(!available, 'No reachable VLM endpoint (set VLM_ENDPOINTS in .env.test.local)');
+
+  const allPersonas = PERSONAS.map((p) => p.key);
+  const captures: Capture[] = [];
+  const cleanup: string[] = [];
+
+  await login(page, TEST_USERS.ADMIN);
+  await page.waitForTimeout(1500);
+
+  // 1. Empty graph — first-run invitation (new-user + visual defects).
+  const empty = await page.evaluate(async () => {
+    const token = localStorage.getItem('authToken') ?? '';
+    const post = (query: string, variables?: unknown) =>
+      fetch('/api/graphql', { method: 'POST', headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` }, body: JSON.stringify({ query, variables }) }).then((r) => r.json());
+    const me = await post('{ me { id } }');
+    const g = await post(`mutation($i:[GraphCreateInput!]!){createGraphs(input:$i){graphs{id}}}`, { i: [{ name: `VLM Empty ${Date.now()}`, type: 'PROJECT', status: 'ACTIVE', createdBy: me.data.me.id, isShared: true }] });
+    return g.data.createGraphs.graphs[0].id as string;
+  });
+  cleanup.push(empty);
+  await page.setViewportSize({ width: 1440, height: 900 });
+  await page.evaluate((id) => localStorage.setItem('currentGraphId', id), empty);
+  await page.reload();
+  await page.waitForTimeout(5000);
+  captures.push({ file: await shot(page, 'empty-graph-desktop'), context: 'the first-run empty-state of a brand-new project graph in GraphDone, a graph-based task manager', personas: ['visual-defects', 'new-user', 'accessibility'] });
+
+  // 2. Populated graph at ULTRA quality — full living-graph experience.
+  const seeded = await seedLargeGraph(page, { size: 60, namePrefix: 'VLM' });
+  cleanup.push(seeded.graphId);
+  await page.evaluate((id) => { localStorage.setItem('currentGraphId', id); localStorage.setItem('graphdone.quality.override', 'ULTRA'); }, seeded.graphId);
+  await page.reload();
+  await page.waitForTimeout(8000); // let it settle + effects run
+  captures.push({ file: await shot(page, 'populated-desktop'), context: 'a populated project graph (~60 work items) with dependency edges; nodes glow by priority and animate by status (in-progress breathes, blocked aches, complete settles)', personas: allPersonas });
+
+  // 3. Same graph on a phone viewport — accessibility + new-user on mobile.
+  await page.setViewportSize({ width: 393, height: 852 });
+  await page.reload();
+  await page.waitForTimeout(6000);
+  captures.push({ file: await shot(page, 'populated-mobile'), context: 'the same project graph viewed on a phone-sized screen (393x852)', personas: ['visual-defects', 'new-user', 'accessibility'] });
+
+  // 4. Bonus: judge the SINGLE largest scale-sweep frame for density/legibility
+  //    (those frames are 1920px and slow on the model; one is enough signal).
+  if (fs.existsSync(SCALE_DIR)) {
+    const largest = fs.readdirSync(SCALE_DIR)
+      .filter((f) => f.endsWith('.png'))
+      .map((f) => ({ f, size: parseInt(f, 10) || 0 }))
+      .sort((a, b) => b.size - a.size)[0];
+    if (largest) {
+      captures.push({ file: path.join(SCALE_DIR, largest.f), context: `a large graph rendered at scale (${largest.size} nodes) — judge whether it stays legible at this density`, personas: ['visual-defects'] });
+    }
+  }
+
+  // Build and run the persona jobs.
+  const jobs = captures.flatMap((c) =>
+    c.personas
+      .map((pk) => personaByKey(pk))
+      .filter((p): p is NonNullable<typeof p> => Boolean(p))
+      .map((persona) => ({ imagePath: c.file, persona, context: c.context, meta: { capture: path.basename(c.file) } }))
+  );
+
+  let results: Awaited<ReturnType<typeof evaluateBatch>> = [];
+  try {
+    results = await evaluateBatch(jobs);
+  } finally {
+    for (const id of cleanup) await deleteGraphDeep(page, id);
+  }
+
+  fs.mkdirSync(path.dirname(OUT), { recursive: true });
+  fs.writeFileSync(OUT, JSON.stringify({ generatedAt: new Date().toISOString(), results }, null, 2));
+
+  const fails = results.filter((r) => !r.verdict.pass);
+  // eslint-disable-next-line no-console
+  console.log(`[vlm] ${results.length} evaluations, ${results.length - fails.length} pass, ${fails.length} flagged:`);
+  for (const f of fails) {
+    // eslint-disable-next-line no-console
+    console.log(`  ⚠️ [${f.persona}] ${f.meta?.capture}: ${f.verdict.summary || f.verdict.issues.join('; ')}`);
+  }
+
+  // Report-only: we assert the VLM produced answers, not what it concluded.
+  expect(results.length, 'VLM returned evaluations').toBeGreaterThan(0);
+  const answered = results.filter((r) => !r.verdict.issues.some((i) => i.startsWith('VLM request failed') || i.startsWith('No reachable')));
+  expect(answered.length, 'at least some VLM calls succeeded').toBeGreaterThan(0);
+});
diff --git a/tests/generate-perf-report.mjs b/tests/generate-perf-report.mjs
new file mode 100644
index 00000000..8cd78d82
--- /dev/null
+++ b/tests/generate-perf-report.mjs
@@ -0,0 +1,111 @@
+#!/usr/bin/env node
+/**
+ * Renders the large-scale perf sweep into a single self-contained page:
+ *   test-artifacts/scale-sweep/index.html
+ *
+ * Input: test-artifacts/scale-sweep/<size>n-<quality>.json (from scale-sweep.spec.ts)
+ * Output: an HTML table of every metric plus inline SVG line charts (no deps,
+ * no external assets) showing how settle time, tick cost, FPS, drift and query
+ * latency scale with graph size, per quality tier.
+ */
+import * as fs from 'fs';
+import * as path from 'path';
+
+const DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep');
+const OUT = path.join(DIR, 'index.html');
+
+if (!fs.existsSync(DIR)) {
+  console.error(`No sweep results at ${DIR} — run "npm run test:perf:scale" first.`);
+  process.exit(1);
+}
+
+const rows = fs
+  .readdirSync(DIR)
+  .filter((f) => f.endsWith('.json'))
+  .map((f) => JSON.parse(fs.readFileSync(path.join(DIR, f), 'utf8')))
+  .sort((a, b) => a.size - b.size || String(a.quality).localeCompare(b.quality));
+
+if (rows.length === 0) {
+  console.error('No JSON sweep results found.');
+  process.exit(1);
+}
+
+const qualities = [...new Set(rows.map((r) => r.quality))];
+const sizes = [...new Set(rows.map((r) => r.size))].sort((a, b) => a - b);
+const COLORS = ['#34d399', '#60a5fa', '#f472b6', '#fbbf24', '#a78bfa'];
+
+const num = (v) => (typeof v === 'number' && v >= 0 ? v : null);
+
+function lineChart(title, key, { unit = '', budget = null } = {}) {
+  const W = 560, H = 260, PADL = 56, PADB = 36, PADT = 28, PADR = 16;
+  const series = qualities.map((q) => ({
+    q,
+    pts: sizes.map((s) => {
+      const row = rows.find((r) => r.size === s && r.quality === q);
+      return { x: s, y: row ? num(row[key]) : null };
+    }).filter((p) => p.y !== null),
+  })).filter((s) => s.pts.length);
+  const allY = series.flatMap((s) => s.pts.map((p) => p.y)).concat(budget != null ? [budget] : []);
+  if (allY.length === 0) return '';
+  const maxY = Math.max(...allY) * 1.1 || 1;
+  const maxX = Math.max(...sizes);
+  const minX = Math.min(...sizes);
+  const sx = (x) => PADL + ((x - minX) / (maxX - minX || 1)) * (W - PADL - PADR);
+  const sy = (y) => H - PADB - (y / maxY) * (H - PADT - PADB);
+
+  const grid = [0, 0.25, 0.5, 0.75, 1].map((f) => {
+    const y = sy(maxY * f);
+    return `<line x1="${PADL}" y1="${y}" x2="${W - PADR}" y2="${y}" stroke="#243044"/><text x="${PADL - 8}" y="${y + 4}" fill="#7c8aa0" font-size="11" text-anchor="end">${Math.round(maxY * f)}</text>`;
+  }).join('');
+  const xticks = sizes.map((s) => `<text x="${sx(s)}" y="${H - PADB + 18}" fill="#7c8aa0" font-size="11" text-anchor="middle">${s}</text>`).join('');
+  const budgetLine = budget != null ? `<line x1="${PADL}" y1="${sy(budget)}" x2="${W - PADR}" y2="${sy(budget)}" stroke="#ef4444" stroke-dasharray="5 4"/><text x="${W - PADR}" y="${sy(budget) - 5}" fill="#ef4444" font-size="10" text-anchor="end">budget ${budget}${unit}</text>` : '';
+  const lines = series.map((s, i) => {
+    const c = COLORS[qualities.indexOf(s.q) % COLORS.length];
+    const d = s.pts.map((p, j) => `${j === 0 ? 'M' : 'L'}${sx(p.x).toFixed(1)},${sy(p.y).toFixed(1)}`).join(' ');
+    const dots = s.pts.map((p) => `<circle cx="${sx(p.x).toFixed(1)}" cy="${sy(p.y).toFixed(1)}" r="3" fill="${c}"><title>${s.q} @ ${p.x}n: ${p.y}${unit}</title></circle>`).join('');
+    return `<path d="${d}" fill="none" stroke="${c}" stroke-width="2"/>${dots}`;
+  }).join('');
+  const legend = series.map((s, i) => {
+    const c = COLORS[qualities.indexOf(s.q) % COLORS.length];
+    return `<span style="color:${c}">● ${s.q}</span>`;
+  }).join('&nbsp;&nbsp;');
+  return `<div class="chart"><h3>${title}</h3><div class="legend">${legend}</div><svg viewBox="0 0 ${W} ${H}" width="100%">${grid}${xticks}${budgetLine}${lines}<text x="${W / 2}" y="${H - 4}" fill="#7c8aa0" font-size="11" text-anchor="middle">graph size (nodes)</text></svg></div>`;
+}
+
+const HEADERS = [
+  ['size', 'nodes'], ['quality', 'quality'], ['renderedNodes', 'rendered n'], ['renderedEdges', 'rendered e'],
+  ['loadMs', 'load ms'], ['interactionFps', 'drag fps'], ['settleMs', 'settle ms'], ['finalAlpha', 'alpha'],
+  ['avgTickMs', 'tick ms'], ['p95TickMs', 'tick p95'], ['rmsFromSavedPx', 'drift px'],
+  ['queryP95Ms', 'query p95'],
+];
+const tableRows = rows.map((r) => `<tr>${HEADERS.map(([k]) => `<td>${r[k] === null ? '—' : r[k]}</td>`).join('')}</tr>`).join('');
+
+const html = `<!doctype html><html><head><meta charset="utf-8"><title>GraphDone — Large-Scale Perf Sweep</title>
+<style>
+body{background:#0b1018;color:#e6edf6;font:14px/1.5 system-ui,sans-serif;margin:0;padding:24px}
+h1{font-size:20px}h2{font-size:16px;margin-top:32px;border-bottom:1px solid #243044;padding-bottom:6px}
+h3{font-size:13px;margin:0 0 4px}.muted{color:#7c8aa0}
+table{border-collapse:collapse;width:100%;margin-top:12px;font-variant-numeric:tabular-nums}
+th,td{border:1px solid #243044;padding:6px 8px;text-align:right}th{background:#131c2b;position:sticky;top:0}
+td:nth-child(2){text-align:center}
+.charts{display:grid;grid-template-columns:repeat(auto-fit,minmax(380px,1fr));gap:20px;margin-top:16px}
+.chart{background:#0f1623;border:1px solid #243044;border-radius:10px;padding:12px}
+.legend{font-size:11px;margin-bottom:4px}
+</style></head><body>
+<h1>GraphDone — Large-Scale Graph Performance Sweep</h1>
+<p class="muted">${rows.length} runs · sizes ${sizes.join(', ')} · qualities ${qualities.join(', ')} · generated ${new Date().toISOString()}</p>
+<div class="charts">
+${lineChart('Interaction FPS vs size (drag)', 'interactionFps', { unit: '' })}
+${lineChart('Initial load vs size', 'loadMs', { unit: 'ms' })}
+${lineChart('Avg simulation tick vs size', 'avgTickMs', { unit: 'ms', budget: 8 })}
+${lineChart('Settle time vs size', 'settleMs', { unit: 'ms' })}
+${lineChart('Layout drift vs size', 'rmsFromSavedPx', { unit: 'px', budget: 25 })}
+${lineChart('Query p95 latency vs size', 'queryP95Ms', { unit: 'ms', budget: 800 })}
+</div>
+<h2>All metrics</h2>
+<table><thead><tr>${HEADERS.map(([, h]) => `<th>${h}</th>`).join('')}</tr></thead><tbody>${tableRows}</tbody></table>
+<p class="muted" style="margin-top:24px">Report-only. Budgets shown (red dashed) mirror the @perf gate; this sweep characterises how they scale, it does not enforce them.</p>
+</body></html>`;
+
+fs.writeFileSync(OUT, html);
+console.log(`✅ Perf sweep report: ${OUT} (${rows.length} runs)`);
diff --git a/tests/generate-vlm-report.mjs b/tests/generate-vlm-report.mjs
new file mode 100644
index 00000000..819a63a0
--- /dev/null
+++ b/tests/generate-vlm-report.mjs
@@ -0,0 +1,82 @@
+#!/usr/bin/env node
+/**
+ * Renders local-VLM visual evaluations into one self-contained gallery:
+ *   test-artifacts/vlm/index.html
+ *
+ * Input: test-artifacts/vlm/results.json (from visual-vlm.spec.ts)
+ * Output: each captured screenshot with a card per persona verdict
+ * (pass/flag badge, 0-1 score, summary, issues). No deps, no external assets.
+ */
+import * as fs from 'fs';
+import * as path from 'path';
+
+const VLM_DIR = path.resolve(process.cwd(), 'test-artifacts/vlm');
+const RESULTS = path.join(VLM_DIR, 'results.json');
+const OUT = path.join(VLM_DIR, 'index.html');
+
+if (!fs.existsSync(RESULTS)) {
+  console.error(`No VLM results at ${RESULTS} — run "npm run test:vlm" (with VLM_ENDPOINTS set) first.`);
+  process.exit(1);
+}
+
+const { generatedAt, results } = JSON.parse(fs.readFileSync(RESULTS, 'utf8'));
+const esc = (s) => String(s ?? '').replace(/[&<>]/g, (c) => ({ '&': '&amp;', '<': '&lt;', '>': '&gt;' }[c]));
+
+// Group verdicts by the screenshot they judged.
+const byCapture = new Map();
+for (const r of results) {
+  const key = r.imagePath;
+  if (!byCapture.has(key)) byCapture.set(key, { imagePath: r.imagePath, context: r.context, verdicts: [] });
+  byCapture.get(key).verdicts.push(r);
+}
+
+const total = results.length;
+const passed = results.filter((r) => r.verdict.pass).length;
+const avgScore = total ? (results.reduce((a, r) => a + (r.verdict.score || 0), 0) / total).toFixed(2) : '—';
+
+const sections = [...byCapture.values()].map((cap) => {
+  const rel = path.relative(VLM_DIR, cap.imagePath).split(path.sep).join('/');
+  const cards = cap.verdicts.map((r) => {
+    const v = r.verdict;
+    const cls = v.pass ? 'pass' : 'flag';
+    const issues = v.issues?.length ? `<ul>${v.issues.map((i) => `<li>${esc(i)}</li>`).join('')}</ul>` : '';
+    const model = (v.model || '').replace(/\.gguf$/, '').slice(0, 28);
+    const host = (v.endpoint || '').replace(/^https?:\/\//, '');
+    const foot = (v.endpoint || v.latencyMs) ? `<div class="foot">${esc(model)} @ ${esc(host)}${v.latencyMs ? ` · ${(v.latencyMs / 1000).toFixed(1)}s` : ''}</div>` : '';
+    return `<div class="card ${cls}">
+      <div class="chead"><span class="badge ${cls}">${v.pass ? 'PASS' : 'FLAG'}</span>
+      <strong>${esc(r.persona)}</strong><span class="score">score ${Number(v.score ?? 0).toFixed(2)}</span></div>
+      <p>${esc(v.summary)}</p>${issues}${foot}</div>`;
+  }).join('');
+  return `<section class="capture">
+    <div class="shot"><img loading="lazy" src="${rel}" alt="${esc(path.basename(cap.imagePath))}"><div class="cap">${esc(path.basename(cap.imagePath))}</div><p class="ctx">${esc(cap.context)}</p></div>
+    <div class="verdicts">${cards}</div>
+  </section>`;
+}).join('');
+
+const html = `<!doctype html><html><head><meta charset="utf-8"><title>GraphDone — Local VLM Visual Review</title>
+<style>
+body{background:#0b1018;color:#e6edf6;font:14px/1.5 system-ui,sans-serif;margin:0;padding:24px}
+h1{font-size:20px}.muted{color:#7c8aa0}
+.summary{background:#0f1623;border:1px solid #243044;border-radius:10px;padding:12px 16px;display:inline-block;margin-bottom:16px}
+section.capture{display:grid;grid-template-columns:minmax(320px,440px) 1fr;gap:20px;background:#0f1623;border:1px solid #243044;border-radius:12px;padding:16px;margin-bottom:20px}
+.shot img{width:100%;border-radius:8px;border:1px solid #243044}
+.cap{font-weight:600;margin-top:8px}.ctx{color:#7c8aa0;font-size:12px}
+.verdicts{display:grid;grid-template-columns:repeat(auto-fit,minmax(240px,1fr));gap:12px;align-content:start}
+.card{border:1px solid #243044;border-radius:8px;padding:10px;background:#0b1018}
+.card.pass{border-left:3px solid #34d399}.card.flag{border-left:3px solid #fbbf24}
+.chead{display:flex;align-items:center;gap:8px;font-size:13px}.score{margin-left:auto;color:#7c8aa0;font-size:12px}
+.badge{font-size:10px;padding:1px 6px;border-radius:4px;font-weight:700}
+.badge.pass{background:#064e3b;color:#6ee7b7}.badge.flag{background:#78350f;color:#fcd34d}
+.card ul{margin:6px 0 0;padding-left:18px;font-size:12px;color:#cdd6e2}
+.foot{margin-top:8px;font-size:10px;color:#5d6b80;border-top:1px solid #1b2536;padding-top:6px}
+</style></head><body>
+<h1>GraphDone — Local VLM Visual Review</h1>
+<div class="summary"><strong>${passed}/${total}</strong> persona checks passed · avg score <strong>${avgScore}</strong><br>
+<span class="muted">generated ${esc(generatedAt)} · evaluated by a local vision model</span></div>
+${sections}
+<p class="muted">Report-only. "FLAG" is the model's subjective concern from one perspective, not a hard failure — use it to spot real UX/rendering regressions worth a human look.</p>
+</body></html>`;
+
+fs.writeFileSync(OUT, html);
+console.log(`✅ VLM review report: ${OUT} (${total} evaluations, ${passed} pass)`);
diff --git a/tests/helpers/seedGraph.ts b/tests/helpers/seedGraph.ts
new file mode 100644
index 00000000..2635822e
--- /dev/null
+++ b/tests/helpers/seedGraph.ts
@@ -0,0 +1,151 @@
+import { Page } from '@playwright/test';
+
+/**
+ * Seeds realistically-shaped graphs of arbitrary size through the real GraphQL
+ * API (the same path a human or AI uses), so the perf sweep measures the true
+ * stack — Neo4j + Apollo + the web force simulation — not a synthetic shortcut.
+ *
+ * Nodes are spread on a grid (real positions, not all stacked at the origin),
+ * statuses/types/priorities are varied so living-graph effects and priority
+ * glow actually exercise, and edges form a connected backbone plus extra links
+ * to hit a target edge:node ratio. Edges are created as Edge nodes (the
+ * canonical model the web renders). Everything batches to stay within request
+ * limits, and cleanup deletes edges before nodes (orphan edges break the whole
+ * edges query).
+ */
+
+export interface SeededGraph {
+  graphId: string;
+  nodeIds: string[];
+  edgeCount: number;
+}
+
+async function gql<T = any>(page: Page, query: string, variables?: unknown): Promise<T> {
+  return page.evaluate(
+    async ({ query, variables }) => {
+      const token = localStorage.getItem('authToken') ?? '';
+      const res = await fetch('/api/graphql', {
+        method: 'POST',
+        headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` },
+        body: JSON.stringify({ query, variables }),
+      });
+      const body = await res.json();
+      if (body.errors) throw new Error(body.errors[0]?.message ?? 'GraphQL error');
+      return body.data;
+    },
+    { query, variables }
+  );
+}
+
+const STATUSES = ['PROPOSED', 'IN_PROGRESS', 'BLOCKED', 'COMPLETED'] as const;
+const TYPES = ['TASK', 'BUG', 'FEATURE', 'MILESTONE', 'OUTCOME'] as const;
+const EDGE_TYPES = ['DEPENDS_ON', 'BLOCKS', 'RELATES_TO'] as const;
+
+function chunk<T>(arr: T[], n: number): T[][] {
+  const out: T[][] = [];
+  for (let i = 0; i < arr.length; i += n) out.push(arr.slice(i, i + n));
+  return out;
+}
+
+export interface SeedOptions {
+  size: number;
+  /** edges ≈ edgeFactor * size (default 1.4). */
+  edgeFactor?: number;
+  /** grid spacing in px (default 130). */
+  spacing?: number;
+  namePrefix?: string;
+}
+
+export async function seedLargeGraph(page: Page, opts: SeedOptions): Promise<SeededGraph> {
+  const { size, edgeFactor = 1.4, spacing = 130, namePrefix = 'Scale' } = opts;
+  const me = await gql(page, '{ me { id } }');
+  const userId = me.me.id;
+
+  const g = await gql(
+    page,
+    `mutation($input: [GraphCreateInput!]!) { createGraphs(input: $input) { graphs { id } } }`,
+    { input: [{ name: `${namePrefix} ${size}n ${Date.now()}`, type: 'PROJECT', status: 'ACTIVE', createdBy: userId, isShared: true }] }
+  );
+  const graphId = g.createGraphs.graphs[0].id as string;
+
+  // Grid layout centered on the origin so the sim starts from a real arrangement.
+  const cols = Math.ceil(Math.sqrt(size));
+  const half = (cols * spacing) / 2;
+  const nodeInputs = Array.from({ length: size }, (_, i) => {
+    const col = i % cols;
+    const row = Math.floor(i / cols);
+    // Deterministic pseudo-variety without Math.random (kept reproducible).
+    const status = STATUSES[i % STATUSES.length];
+    const type = TYPES[(i * 7) % TYPES.length];
+    const priority = ((i * 37) % 100) / 100;
+    return {
+      type,
+      title: `${type} ${i}`,
+      status,
+      priority,
+      positionX: col * spacing - half,
+      positionY: row * spacing - half,
+      positionZ: 0,
+      owner: { connect: { where: { node: { id: userId } } } },
+      graph: { connect: { where: { node: { id: graphId } } } },
+    };
+  });
+
+  const nodeIds: string[] = [];
+  for (const batch of chunk(nodeInputs, 100)) {
+    const res = await gql(
+      page,
+      `mutation($input: [WorkItemCreateInput!]!) { createWorkItems(input: $input) { workItems { id } } }`,
+      { input: batch }
+    );
+    for (const w of res.createWorkItems.workItems) nodeIds.push(w.id);
+  }
+
+  // Backbone chain guarantees connectivity; extra forward links add realism.
+  const targetEdges = Math.round(size * edgeFactor);
+  const edgeInputs: Array<Record<string, unknown>> = [];
+  const link = (a: string, b: string, t: string) =>
+    edgeInputs.push({
+      type: t,
+      weight: 0.5 + ((edgeInputs.length % 5) / 10),
+      source: { connect: { where: { node: { id: a } } } },
+      target: { connect: { where: { node: { id: b } } } },
+    });
+  for (let i = 0; i + 1 < nodeIds.length; i++) link(nodeIds[i], nodeIds[i + 1], 'DEPENDS_ON');
+  let extra = targetEdges - edgeInputs.length;
+  for (let i = 0; i < nodeIds.length && extra > 0; i++) {
+    const jump = 2 + ((i * 5) % Math.max(2, Math.floor(nodeIds.length / 4)));
+    const j = i + jump;
+    if (j < nodeIds.length) {
+      link(nodeIds[i], nodeIds[j], EDGE_TYPES[i % EDGE_TYPES.length]);
+      extra--;
+    }
+  }
+
+  let edgeCount = 0;
+  for (const batch of chunk(edgeInputs, 100)) {
+    const res = await gql(
+      page,
+      `mutation($input: [EdgeCreateInput!]!) { createEdges(input: $input) { edges { id } } }`,
+      { input: batch }
+    );
+    edgeCount += res.createEdges.edges.length;
+  }
+
+  return { graphId, nodeIds, edgeCount };
+}
+
+export async function deleteGraphDeep(page: Page, graphId: string): Promise<void> {
+  // Edges first (orphan edges break the edges query), then nodes, then graph.
+  await gql(
+    page,
+    `mutation($id: ID!) { deleteEdges(where: { source: { graph: { id: $id } } }) { nodesDeleted } }`,
+    { id: graphId }
+  ).catch(() => {});
+  await gql(
+    page,
+    `mutation($id: ID!) { deleteWorkItems(where: { graph: { id: $id } }) { nodesDeleted } }`,
+    { id: graphId }
+  ).catch(() => {});
+  await gql(page, `mutation($id: ID!) { deleteGraphs(where: { id: $id }) { nodesDeleted } }`, { id: graphId }).catch(() => {});
+}
diff --git a/tests/helpers/testEnv.ts b/tests/helpers/testEnv.ts
new file mode 100644
index 00000000..f29c8c80
--- /dev/null
+++ b/tests/helpers/testEnv.ts
@@ -0,0 +1,42 @@
+import * as fs from 'fs';
+import * as path from 'path';
+import dotenv from 'dotenv';
+
+/**
+ * Loads local-only test configuration from `.env.test.local` (gitignored) into
+ * process.env, without ever baking secrets or hostnames into the repo. Import
+ * this for its side effect at the top of any spec/generator that needs the VLM
+ * endpoints or sweep config:
+ *
+ *   import '../helpers/testEnv';
+ *
+ * Safe to import everywhere — it's a no-op when the file is absent (e.g. CI),
+ * so VLM-driven suites skip cleanly. Existing process.env values win, so you
+ * can still override per-run on the command line.
+ */
+const localEnvPath = path.resolve(process.cwd(), '.env.test.local');
+if (fs.existsSync(localEnvPath)) {
+  dotenv.config({ path: localEnvPath });
+}
+
+/** Comma/whitespace separated env list -> trimmed non-empty string[]. */
+export function envList(name: string): string[] {
+  return (process.env[name] ?? '')
+    .split(',')
+    .map((s) => s.trim())
+    .filter(Boolean);
+}
+
+/** Parse a comma-separated list of positive integers (sweep sizes). */
+export function envIntList(name: string): number[] {
+  return envList(name)
+    .map((s) => Number.parseInt(s, 10))
+    .filter((n) => Number.isFinite(n) && n > 0);
+}
+
+export function envNumber(name: string, fallback: number): number {
+  const raw = process.env[name];
+  if (raw === undefined || raw.trim() === '') return fallback;
+  const n = Number(raw);
+  return Number.isFinite(n) ? n : fallback;
+}
diff --git a/tests/helpers/vlm.ts b/tests/helpers/vlm.ts
new file mode 100644
index 00000000..aeb4db24
--- /dev/null
+++ b/tests/helpers/vlm.ts
@@ -0,0 +1,331 @@
+import * as fs from 'fs';
+import './testEnv';
+import { envList, envNumber } from './testEnv';
+
+/**
+ * Protocol-agnostic client for LOCAL Vision-Language-Model servers.
+ *
+ * The actual endpoints (GPU workstations) live only in `.env.test.local`
+ * (gitignored) as VLM_ENDPOINTS — never in the repo. Requests are round-robined
+ * across every configured endpoint so visual evaluation spreads over all GPUs.
+ *
+ * Two wire protocols are supported and auto-detected per endpoint:
+ *   - OpenAI-compatible: POST /v1/chat/completions with image_url data URIs
+ *     (vLLM, LM Studio, llama.cpp server, Ollama's /v1 compat shim)
+ *   - Ollama native:     POST /api/chat with a base64 images[] array
+ *
+ * Everything degrades gracefully: when VLM_ENDPOINTS is unset or no endpoint is
+ * reachable, isVlmAvailable() is false and suites skip — so CI stays green.
+ */
+
+export type VlmProtocol = 'openai' | 'ollama';
+
+export interface VlmVerdict {
+  pass: boolean;
+  score: number; // 0..1
+  issues: string[];
+  summary: string;
+  raw?: string; // raw model text, for the report when parsing is imperfect
+  endpoint?: string; // which box judged this (honesty in the report)
+  model?: string;
+  latencyMs?: number;
+}
+
+export interface Persona {
+  key: string;
+  label: string;
+  /** Framing for the model — who it is and what it cares about. */
+  system: string;
+  /** What "pass" means, appended to every prompt for this persona. */
+  rubric: string;
+}
+
+/**
+ * The evaluation perspectives. Each judges a rendered screenshot from a
+ * distinct point of view, so one capture yields several independent reads.
+ */
+export const PERSONAS: Persona[] = [
+  {
+    key: 'visual-defects',
+    label: 'Visual defects',
+    system:
+      'You are a meticulous UI rendering QA inspector for a graph-visualization web app. ' +
+      'You judge ONLY what is visible in the screenshot — objective rendering correctness.',
+    rubric:
+      'Fail if you see: nodes overlapping so labels are unreadable, nodes/text cut off at the edges, ' +
+      'a broken or empty layout where content is expected, edges that clearly do not connect nodes, ' +
+      'obvious visual glitches, or any error message / "Error" badge / blank red state. ' +
+      'Pass if the graph (or its empty-state invitation) renders cleanly and legibly.',
+  },
+  {
+    key: 'new-user',
+    label: 'New-user clarity',
+    system:
+      'You are a first-time user who has never seen this product. You are evaluating whether the ' +
+      'screen is clear, inviting, and self-explanatory.',
+    rubric:
+      'Fail if you would feel lost or could not tell what to do next, or the screen looks intimidating ' +
+      'or cluttered to a newcomer. Pass if the purpose is clear and there is an obvious next action.',
+  },
+  {
+    key: 'accessibility',
+    label: 'Accessibility',
+    system:
+      'You are an accessibility reviewer judging a rendered screenshot for visual a11y.',
+    rubric:
+      'Fail if text contrast looks too low to read, text is too small, information is conveyed by color ' +
+      'alone, or interactive targets look too small to tap. Pass if it appears broadly legible and usable.',
+  },
+  {
+    key: 'living-graph',
+    label: 'Living-graph aliveness',
+    system:
+      'You evaluate whether a graph visualization feels "alive" and communicates work status. Nodes may ' +
+      'glow by priority, pulse/breathe when in progress, look settled when complete, or ache when blocked; ' +
+      'edges may show energy flow.',
+    rubric:
+      'Fail if the graph looks completely static/flat with no visual hierarchy or status cues, or if the ' +
+      'effects look chaotic/noisy rather than purposeful. Pass if status and priority read clearly and the ' +
+      'scene feels alive but legible. (Judge the single frame; do not penalize lack of motion in a still.)',
+  },
+];
+
+export const personaByKey = (key: string): Persona | undefined =>
+  PERSONAS.find((p) => p.key === key);
+
+const TIMEOUT_MS = envNumber('VLM_TIMEOUT_MS', 120_000);
+const MAX_CONCURRENCY = Math.max(1, envNumber('VLM_MAX_CONCURRENCY', 3));
+
+let rrCounter = 0;
+const protocolCache = new Map<string, VlmProtocol>();
+const modelCache = new Map<string, string>();
+
+export function vlmEndpoints(): string[] {
+  return envList('VLM_ENDPOINTS').map((e) => e.replace(/\/+$/, ''));
+}
+
+export function vlmModel(): string {
+  return (process.env.VLM_MODEL ?? '').trim();
+}
+
+/**
+ * Resolve the model id for an endpoint. With a single shared model set
+ * VLM_MODEL; with multiple boxes serving DIFFERENT models, set VLM_MODEL=auto
+ * (or leave blank) and each endpoint's own loaded model is used. llama.cpp
+ * serves one model and ignores the field, but sending the right id keeps logs
+ * honest and works with multi-model servers too.
+ */
+async function resolveModel(base: string, protocol: VlmProtocol): Promise<string> {
+  const configured = vlmModel();
+  if (configured && configured.toLowerCase() !== 'auto') return configured;
+  if (modelCache.has(base)) return modelCache.get(base)!;
+  let id = 'default';
+  try {
+    if (protocol === 'openai') {
+      const r = await fetchWithTimeout(`${base}/v1/models`, { headers: authHeaders() }, 5000);
+      const d = await r.json();
+      id = d?.data?.[0]?.id ?? d?.models?.[0]?.name ?? 'default';
+    } else {
+      const r = await fetchWithTimeout(`${base}/api/tags`, {}, 5000);
+      const d = await r.json();
+      id = d?.models?.[0]?.name ?? d?.models?.[0]?.model ?? 'default';
+    }
+  } catch { /* keep default */ }
+  modelCache.set(base, id);
+  return id;
+}
+
+export function isVlmConfigured(): boolean {
+  return vlmEndpoints().length > 0 && vlmModel().length > 0;
+}
+
+function authHeaders(): Record<string, string> {
+  const key = (process.env.VLM_API_KEY ?? '').trim();
+  return key ? { Authorization: `Bearer ${key}` } : {};
+}
+
+async function fetchWithTimeout(url: string, init: RequestInit, timeoutMs = TIMEOUT_MS): Promise<Response> {
+  const controller = new AbortController();
+  const t = setTimeout(() => controller.abort(), timeoutMs);
+  try {
+    return await fetch(url, { ...init, signal: controller.signal });
+  } finally {
+    clearTimeout(t);
+  }
+}
+
+/** Detect (and cache) the wire protocol for a single endpoint. */
+async function detectProtocol(base: string): Promise<VlmProtocol | null> {
+  const forced = (process.env.VLM_PROTOCOL ?? 'auto').trim().toLowerCase();
+  if (forced === 'openai' || forced === 'ollama') return forced;
+  if (protocolCache.has(base)) return protocolCache.get(base)!;
+  // OpenAI-compatible servers expose /v1/models.
+  try {
+    const r = await fetchWithTimeout(`${base}/v1/models`, { headers: authHeaders() }, 5000);
+    if (r.ok) { protocolCache.set(base, 'openai'); return 'openai'; }
+  } catch { /* try next */ }
+  // Ollama exposes /api/tags.
+  try {
+    const r = await fetchWithTimeout(`${base}/api/tags`, {}, 5000);
+    if (r.ok) { protocolCache.set(base, 'ollama'); return 'ollama'; }
+  } catch { /* unreachable */ }
+  return null;
+}
+
+/** Endpoints that are configured AND currently reachable, with their protocol. */
+export async function reachableEndpoints(): Promise<Array<{ base: string; protocol: VlmProtocol }>> {
+  const out: Array<{ base: string; protocol: VlmProtocol }> = [];
+  await Promise.all(
+    vlmEndpoints().map(async (base) => {
+      const protocol = await detectProtocol(base);
+      if (protocol) out.push({ base, protocol });
+    })
+  );
+  return out;
+}
+
+let availabilityCache: boolean | null = null;
+/** True only if VLM is configured and at least one endpoint responds. */
+export async function isVlmAvailable(): Promise<boolean> {
+  if (!isVlmConfigured()) return false;
+  if (availabilityCache !== null) return availabilityCache;
+  availabilityCache = (await reachableEndpoints()).length > 0;
+  return availabilityCache;
+}
+
+function extractVerdict(text: string): VlmVerdict {
+  // Models wrap JSON in prose or code fences; grab the first balanced object.
+  let parsed: Record<string, unknown> | null = null;
+  const fence = text.match(/```(?:json)?\s*([\s\S]*?)```/i);
+  const candidate = fence ? fence[1] : text;
+  const start = candidate.indexOf('{');
+  const end = candidate.lastIndexOf('}');
+  if (start !== -1 && end > start) {
+    try { parsed = JSON.parse(candidate.slice(start, end + 1)); } catch { /* fall through */ }
+  }
+  if (!parsed) {
+    return { pass: false, score: 0, issues: ['Could not parse a JSON verdict from the model'], summary: text.slice(0, 300), raw: text };
+  }
+  const issuesRaw = parsed.issues;
+  const issues = Array.isArray(issuesRaw) ? issuesRaw.map((i) => String(i)) : issuesRaw ? [String(issuesRaw)] : [];
+  let score = Number(parsed.score);
+  if (!Number.isFinite(score)) score = parsed.pass ? 1 : 0;
+  if (score > 1) score = score / 100; // tolerate 0-100 scales
+  return {
+    pass: Boolean(parsed.pass),
+    score: Math.max(0, Math.min(1, score)),
+    issues,
+    summary: String(parsed.summary ?? '').slice(0, 600),
+    raw: text,
+  };
+}
+
+const PROMPT_TAIL =
+  'Respond with ONLY a JSON object, no prose, of exactly this shape: ' +
+  '{"pass": boolean, "score": number between 0 and 1, "issues": string[], "summary": string}. ' +
+  'Keep issues short and specific. Be fair: this is a still frame.';
+
+function buildPrompt(persona: Persona, context: string): string {
+  return `Context: this screenshot shows ${context}.\n\n${persona.rubric}\n\n${PROMPT_TAIL}`;
+}
+
+async function callOpenAI(base: string, model: string, system: string, prompt: string, b64: string): Promise<string> {
+  const r = await fetchWithTimeout(`${base}/v1/chat/completions`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json', ...authHeaders() },
+    body: JSON.stringify({
+      model,
+      temperature: 0,
+      max_tokens: 400,
+      messages: [
+        { role: 'system', content: system },
+        {
+          role: 'user',
+          content: [
+            { type: 'text', text: prompt },
+            { type: 'image_url', image_url: { url: `data:image/png;base64,${b64}` } },
+          ],
+        },
+      ],
+    }),
+  });
+  if (!r.ok) throw new Error(`OpenAI VLM ${base} HTTP ${r.status}: ${(await r.text()).slice(0, 200)}`);
+  const data = await r.json();
+  return data?.choices?.[0]?.message?.content ?? '';
+}
+
+async function callOllama(base: string, model: string, system: string, prompt: string, b64: string): Promise<string> {
+  const r = await fetchWithTimeout(`${base}/api/chat`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({
+      model,
+      stream: false,
+      options: { temperature: 0 },
+      messages: [
+        { role: 'system', content: system },
+        { role: 'user', content: prompt, images: [b64] },
+      ],
+    }),
+  });
+  if (!r.ok) throw new Error(`Ollama VLM ${base} HTTP ${r.status}: ${(await r.text()).slice(0, 200)}`);
+  const data = await r.json();
+  return data?.message?.content ?? '';
+}
+
+/**
+ * Evaluate one screenshot from one persona's perspective. Round-robins across
+ * reachable endpoints. Never throws — failures come back as a non-pass verdict
+ * so the report is always complete.
+ */
+export async function evaluateImage(
+  imagePath: string,
+  persona: Persona,
+  context: string,
+  endpoints?: Array<{ base: string; protocol: VlmProtocol }>
+): Promise<VlmVerdict> {
+  const eps = endpoints ?? (await reachableEndpoints());
+  if (eps.length === 0) {
+    return { pass: false, score: 0, issues: ['No reachable VLM endpoint'], summary: '' };
+  }
+  const { base, protocol } = eps[rrCounter++ % eps.length];
+  const prompt = buildPrompt(persona, context);
+  const started = Date.now();
+  try {
+    const model = await resolveModel(base, protocol);
+    const b64 = fs.readFileSync(imagePath).toString('base64');
+    const text =
+      protocol === 'openai'
+        ? await callOpenAI(base, model, persona.system, prompt, b64)
+        : await callOllama(base, model, persona.system, prompt, b64);
+    const v = extractVerdict(text);
+    return { ...v, endpoint: base, model, latencyMs: Date.now() - started };
+  } catch (err) {
+    return {
+      pass: false,
+      score: 0,
+      issues: [`VLM request failed: ${err instanceof Error ? err.message : String(err)}`],
+      summary: '',
+      endpoint: base,
+      latencyMs: Date.now() - started,
+    };
+  }
+}
+
+/** Run a batch of {imagePath, persona, context} jobs with bounded concurrency. */
+export async function evaluateBatch(
+  jobs: Array<{ imagePath: string; persona: Persona; context: string; meta?: Record<string, unknown> }>
+): Promise<Array<{ persona: string; context: string; imagePath: string; verdict: VlmVerdict; meta?: Record<string, unknown> }>> {
+  const eps = await reachableEndpoints();
+  const results: Array<{ persona: string; context: string; imagePath: string; verdict: VlmVerdict; meta?: Record<string, unknown> }> = [];
+  let idx = 0;
+  async function worker() {
+    while (idx < jobs.length) {
+      const job = jobs[idx++];
+      const verdict = await evaluateImage(job.imagePath, job.persona, job.context, eps);
+      results.push({ persona: job.persona.key, context: job.context, imagePath: job.imagePath, verdict, meta: job.meta });
+    }
+  }
+  await Promise.all(Array.from({ length: Math.min(MAX_CONCURRENCY, jobs.length) }, worker));
+  return results;
+}
diff --git a/tests/perf/scale-sweep.spec.ts b/tests/perf/scale-sweep.spec.ts
new file mode 100644
index 00000000..2ad753e9
--- /dev/null
+++ b/tests/perf/scale-sweep.spec.ts
@@ -0,0 +1,224 @@
+import { test, expect, Page } from '@playwright/test';
+import * as fs from 'fs';
+import * as path from 'path';
+import { login, TEST_USERS } from '../helpers/auth';
+import { seedLargeGraph, deleteGraphDeep } from '../helpers/seedGraph';
+import '../helpers/testEnv';
+import { envIntList, envList } from '../helpers/testEnv';
+
+/**
+ * Large-scale graph creation + performance metric sweep. Seeds real graphs of
+ * increasing size through the GraphQL API, loads each in the browser at one or
+ * more quality tiers, and records the in-app PerfMeter/DriftMeter readings
+ * (window.__graphPerf) plus settle time, load time, and query latency.
+ *
+ * Report-only: writes one JSON per (size, quality) under
+ * test-artifacts/scale-sweep/, which `npm run report:perf` renders into a table
+ * + charts. It does NOT fail on thresholds — the goal is a metric sweep, not a
+ * gate (the @perf budgets spec is the gate). The only hard assertion is that a
+ * seeded graph actually renders, so a silent breakage still surfaces.
+ *
+ * Sizes/qualities come from env (.env.test.local) so you can push it hard
+ * locally; CI uses a small set just to keep the harness honest.
+ */
+
+const SIZES = (() => {
+  const fromEnv = envIntList('SCALE_SWEEP_SIZES');
+  if (fromEnv.length) return fromEnv;
+  return process.env.CI ? [40, 120] : [50, 200, 500, 1000, 2000];
+})();
+
+const QUALITIES = (() => {
+  const fromEnv = envList('SCALE_SWEEP_QUALITIES').map((q) => q.toUpperCase());
+  const valid = fromEnv.filter((q) => ['LOW', 'MEDIUM', 'HIGH', 'ULTRA'].includes(q));
+  if (valid.length) return valid;
+  return process.env.CI ? ['HIGH'] : ['HIGH', 'ULTRA'];
+})();
+
+const OUT_DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep');
+const SETTLE_BUDGET_MS = 30_000;
+const REST_ALPHA = 0.02;
+
+interface SweepResult {
+  size: number;
+  quality: string;
+  seededNodes: number;
+  seededEdges: number;
+  renderedNodes: number;
+  renderedEdges: number;
+  loadMs: number; // time from reload to first node painted
+  interactionFps: number; // RELIABLE: rendered frames/sec while dragging the graph
+  settleMs: number | null; // time to reach REST_ALPHA (null = never settled within budget)
+  finalAlpha: number;
+  avgTickMs: number;
+  p95TickMs: number;
+  fps: number;
+  droppedFrames: number;
+  rmsFromSavedPx: number;
+  maxStepPx: number;
+  queryP95Ms: number;
+  timestampISO: string;
+}
+
+async function measure(page: Page, graphId: string, size: number, quality: string): Promise<SweepResult> {
+  await page.setViewportSize({ width: 1920, height: 1080 });
+  await page.evaluate(
+    ({ gid, q }) => {
+      localStorage.setItem('currentGraphId', gid);
+      localStorage.setItem('graphdone.quality.override', q);
+    },
+    { gid: graphId, q: quality }
+  );
+
+  const t0 = Date.now();
+  await page.reload();
+  // Load time = first node painted.
+  await page.locator('.graph-container svg .node').first().waitFor({ timeout: 60_000 }).catch(() => {});
+  const loadMs = Date.now() - t0;
+
+  // Seeded nodes carry saved grid positions, so the app pins them and the force
+  // sim sits idle — PerfMeter (window.__graphPerf, published only every ~2s
+  // WHILE ticking) then never reports. We hold a node and drag it continuously
+  // for a few seconds: d3 keeps alphaTarget>0 while dragging, so the sim ticks
+  // the whole time and the meter publishes real UNDER-INTERACTION samples (tick
+  // cost / fps at this scale — a realistic "dragging a big graph" metric). We
+  // keep the best (lowest-tick) live sample, then release and time the settle.
+  const box = await page.evaluate(() => {
+    const n = document.querySelector('.graph-container svg .node .node-bg') as Element | null;
+    if (!n) return null;
+    const r = n.getBoundingClientRect();
+    return { x: r.x + r.width / 2, y: r.y + r.height / 2 };
+  });
+
+  // Reliable interaction FPS: count real rendered frames (requestAnimationFrame)
+  // over a fixed wall-clock window while dragging. This needs no app
+  // instrumentation, so it works at every size — when the main thread is busy
+  // ticking a huge sim, rAF visibly drops, which is exactly the scaling signal.
+  let lastNonNull: any = null;
+  const samples: any[] = [];
+  let interactionFps = -1;
+  if (box) {
+    await page.evaluate(() => {
+      (window as any).__fc = 0;
+      const loop = () => { (window as any).__fc++; (window as any).__rafId = requestAnimationFrame(loop); };
+      (window as any).__rafId = requestAnimationFrame(loop);
+    });
+    await page.mouse.move(box.x, box.y).catch(() => {});
+    await page.mouse.down().catch(() => {});
+    const dragStart = Date.now();
+    let a = 0;
+    while (Date.now() - dragStart < 6000) {
+      a += 0.6;
+      await page.mouse.move(box.x + Math.cos(a) * 70, box.y + Math.sin(a) * 55).catch(() => {});
+      await page.waitForTimeout(120);
+      const cur = await page.evaluate(() => (window as any).__graphPerf ?? null);
+      if (cur) { lastNonNull = cur; samples.push(cur); }
+    }
+    await page.mouse.up().catch(() => {});
+    const frames = await page.evaluate(() => { cancelAnimationFrame((window as any).__rafId); return (window as any).__fc || 0; });
+    const secs = (Date.now() - dragStart) / 1000;
+    interactionFps = Math.round((frames / secs) * 10) / 10;
+  }
+
+  // Now measure how long it takes to come to rest after the perturbation.
+  let settleMs: number | null = null;
+  const settleStart = Date.now();
+  while (Date.now() - settleStart < SETTLE_BUDGET_MS) {
+    const cur = await page.evaluate(() => (window as any).__graphPerf ?? null);
+    if (cur) {
+      lastNonNull = cur;
+      if (typeof cur.alpha === 'number' && cur.alpha <= REST_ALPHA) {
+        settleMs = Date.now() - settleStart;
+        break;
+      }
+    }
+    await page.waitForTimeout(300);
+  }
+  // Prefer the worst (max) tick seen under interaction — that's the real cost at
+  // scale; a single settled sample understates it.
+  const underLoad = samples.length
+    ? samples.reduce((w, s) => ((s.avgTickMs ?? 0) > (w.avgTickMs ?? 0) ? s : w))
+    : null;
+  const last = underLoad ?? (await page.evaluate(() => (window as any).__graphPerf ?? null)) ?? lastNonNull ?? {};
+
+  const renderedNodes = await page.locator('.graph-container svg .node').count();
+  const renderedEdges = await page.locator('.graph-container svg .edge').count();
+
+  // Query latency a human/AI would feel: a graph-scoped workItems fetch.
+  const queryP95Ms = await page.evaluate(async (gid) => {
+    const token = localStorage.getItem('authToken') ?? '';
+    const times: number[] = [];
+    for (let i = 0; i < 10; i++) {
+      const s = performance.now();
+      await fetch('/api/graphql', {
+        method: 'POST',
+        headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` },
+        body: JSON.stringify({
+          query: `query($w: WorkItemWhere) { workItems(where: $w, options: { limit: 5000 }) { id status type priority } }`,
+          variables: { w: { graph: { id: gid } } },
+        }),
+      }).then((r) => r.json());
+      times.push(performance.now() - s);
+    }
+    times.sort((a, b) => a - b);
+    return Math.round(times[Math.floor(times.length * 0.95)] ?? times[times.length - 1]);
+  }, graphId);
+
+  const spatial = last?.spatial ?? {};
+  fs.mkdirSync(OUT_DIR, { recursive: true });
+  await page.screenshot({ path: path.join(OUT_DIR, `${size}n-${quality}.png`), fullPage: false }).catch(() => {});
+
+  return {
+    size,
+    quality,
+    seededNodes: size,
+    seededEdges: 0, // filled by caller
+    renderedNodes,
+    renderedEdges,
+    loadMs,
+    interactionFps,
+    settleMs,
+    finalAlpha: typeof last?.alpha === 'number' ? last.alpha : -1,
+    avgTickMs: last?.avgTickMs ?? -1,
+    p95TickMs: last?.p95TickMs ?? -1,
+    fps: last?.fps ?? -1,
+    droppedFrames: last?.droppedFrames ?? -1,
+    rmsFromSavedPx: spatial.rmsFromSavedPx ?? -1,
+    maxStepPx: spatial.maxStepPx ?? -1,
+    queryP95Ms,
+    timestampISO: new Date(t0).toISOString(),
+  };
+}
+
+test.describe('large-scale graph perf sweep @scale', () => {
+  test.describe.configure({ mode: 'serial', timeout: 600_000 });
+
+  for (const size of SIZES) {
+    test(`sweep ${size} nodes`, async ({ page }) => {
+      await login(page, TEST_USERS.ADMIN);
+      await page.waitForTimeout(1500);
+
+      // A FRESH graph per (size, quality): otherwise the second quality loads
+      // the first run's already-settled positions, the sim never ticks, and
+      // PerfMeter never publishes (the -1 / settle=NONE gaps in v1).
+      for (const quality of QUALITIES) {
+        const seeded = await seedLargeGraph(page, { size });
+        try {
+          const result = await measure(page, seeded.graphId, size, quality);
+          result.seededEdges = seeded.edgeCount;
+          fs.mkdirSync(OUT_DIR, { recursive: true });
+          fs.writeFileSync(path.join(OUT_DIR, `${size}n-${quality}.json`), JSON.stringify(result, null, 2));
+          // eslint-disable-next-line no-console
+          console.log(
+            `[scale] ${size}n/${quality}: rendered ${result.renderedNodes}n/${result.renderedEdges}e ` +
+              `load=${result.loadMs}ms dragFps=${result.interactionFps} settle=${result.settleMs ?? 'NONE'}ms ` +
+              `tick=${result.avgTickMs}ms qP95=${result.queryP95Ms}ms`
+          );
+          expect(result.renderedNodes, `graph of ${size} nodes must render some nodes`).toBeGreaterThan(0);
+        } finally {
+          await deleteGraphDeep(page, seeded.graphId);
+        }
+      }
+    });
+  }
+});