diff --git a/.env.test.example b/.env.test.example new file mode 100644 index 00000000..a8d29355 --- /dev/null +++ b/.env.test.example @@ -0,0 +1,42 @@ +# GraphDone test-pipeline configuration — LOCAL ONLY. +# +# Copy this file to `.env.test.local` (which is gitignored) and fill in your +# own values. NEVER put real hostnames, IPs, or keys in this committed example +# or anywhere else in the repo — the local VLM boxes (GPU workstations) must +# stay out of version control. The test harness auto-loads `.env.test.local`. +# +# cp .env.test.example .env.test.local # then edit .env.test.local + +# --- Local Vision-Language-Model (VLM) endpoints --------------------------- +# Comma-separated base URLs of your local VLM server(s). Requests are +# round-robined across them so visual evaluation is spread over every GPU. +# Leave blank to skip all VLM-driven suites (they no-op cleanly in CI). +# Example shape (use your OWN hosts in .env.test.local, never here): +# VLM_ENDPOINTS=http://:,http://: +VLM_ENDPOINTS= + +# Model id/tag to request (e.g. a llava / qwen2-vl / llama-3.2-vision build). +VLM_MODEL= + +# Optional bearer key for OpenAI-compatible servers that require one. +VLM_API_KEY= + +# Wire protocol: auto (default) | openai | ollama. +# auto — probe each endpoint: /v1/models => OpenAI-compatible, else Ollama. +# openai — POST /v1/chat/completions (vLLM, LM Studio, llama.cpp, Ollama compat) +# ollama — POST /api/chat with an images[] array +VLM_PROTOCOL=auto + +# Max concurrent VLM requests across all endpoints (default 3). +VLM_MAX_CONCURRENCY=3 + +# Per-request timeout in ms — VLMs can be slow on large images (default 120000). +VLM_TIMEOUT_MS=120000 + +# --- Large-scale performance sweep ----------------------------------------- +# Node counts to sweep, comma-separated. Leave blank to use the built-in +# default (small in CI, large locally). Example: 50,200,500,1000,2000 +SCALE_SWEEP_SIZES= + +# Quality tiers to sweep per size (subset of LOW,MEDIUM,HIGH,ULTRA). +SCALE_SWEEP_QUALITIES=HIGH,ULTRA diff --git a/docs/SYSTEMS.md b/docs/SYSTEMS.md index 42cdc71e..129e5774 100644 --- a/docs/SYSTEMS.md +++ b/docs/SYSTEMS.md @@ -19,6 +19,8 @@ | Lint | `npm run lint` | 0 errors (warnings allowed) | | Build | `npm run build` | Production build succeeds | | Showcase report | `TEST_URL=http://localhost:3127 npm run report:showcase` | Records .webm video + screenshots of every mode at all 5 resolutions → `test-artifacts/showcase/index.html` (also an every-PR CI artifact). | +| Large-scale perf sweep | `TEST_URL=http://localhost:3127 npm run test:perf:scale` | Seeds graphs of increasing size (50→2000+ nodes) and records `window.__graphPerf` (settle, tick, fps, drift, query p95) across size × quality → `test-artifacts/scale-sweep/index.html`. Report-only; sizes/qualities via `.env.test.local`. See [docs/testing/local-vlm-and-scale.md](./testing/local-vlm-and-scale.md). | +| Local VLM visual review | `TEST_URL=http://localhost:3127 npm run test:vlm` | A locally-hosted vision model judges captured states from 4 perspectives (visual defects, new-user clarity, accessibility, living-graph aliveness) → `test-artifacts/vlm/index.html`. **Skips unless `VLM_ENDPOINTS` is set in the gitignored `.env.test.local`** (CI can't reach local GPUs). Report-only. | **Why THE GATE exists:** a real incident — orphaned `Edge` records made the edges query 500 and the UI showed "Error" with zero edges, while every unit diff --git a/docs/testing/local-vlm-and-scale.md b/docs/testing/local-vlm-and-scale.md new file mode 100644 index 00000000..320a5c04 --- /dev/null +++ b/docs/testing/local-vlm-and-scale.md @@ -0,0 +1,113 @@ +# Local VLM visual review & large-scale performance sweeps + +Two heavier, report-only suites that exercise GraphDone from realistic user +perspectives and at scale. Both are **opt-in and run locally** (or on a +self-hosted runner) because the vision models live on your own GPU boxes — +their hostnames must never enter the repo. + +## TL;DR + +```bash +cp .env.test.example .env.test.local # gitignored — put your real values here +# edit .env.test.local: VLM_ENDPOINTS, VLM_MODEL, (optional) sweep sizes + +./start dev # or have the stack running on :3127 + +TEST_URL=http://localhost:3127 npm run test:perf:scale # → test-artifacts/scale-sweep/index.html +TEST_URL=http://localhost:3127 npm run test:vlm # → test-artifacts/vlm/index.html +``` + +If `VLM_ENDPOINTS` is unset, `test:vlm` **skips cleanly** — so CI and other +developers are never blocked by hardware they don't have. + +## Keeping hostnames out of the repo + +- **Never** commit hostnames, IPs, or keys. The GPU boxes (e.g. an RTX 4090 + workstation and Grace-Blackwell nodes) are referenced only by env vars. +- `.env.test.local` is gitignored (see `.gitignore`). It is the *only* place + your real endpoints live. +- `.env.test.example` is committed and documents the variable **names** with + placeholder hosts (`http://:`). Copy it to `.env.test.local` + and fill in the rest. +- The harness auto-loads `.env.test.local` via `tests/helpers/testEnv.ts`. + +```bash +# .env.test.local (NOT committed) +VLM_ENDPOINTS=http://:,http://:,http://: +VLM_MODEL= +VLM_PROTOCOL=auto # auto | openai | ollama +VLM_MAX_CONCURRENCY=3 +``` + +Multiple endpoints are **round-robined**, so visual evaluation spreads across +every GPU you list. + +## VLM protocol support + +`tests/helpers/vlm.ts` is protocol-agnostic and auto-detects per endpoint: + +| Protocol | Detected via | Request | +|----------|--------------|---------| +| OpenAI-compatible | `GET /v1/models` | `POST /v1/chat/completions` with an `image_url` data URI (vLLM, LM Studio, llama.cpp server, Ollama's `/v1` shim) | +| Ollama native | `GET /api/tags` | `POST /api/chat` with a base64 `images[]` array | + +Force one with `VLM_PROTOCOL=openai` or `ollama`. Each model call asks for a +strict JSON verdict `{pass, score, issues[], summary}`, parsed leniently. + +### Personas + +Each captured screenshot is judged from several perspectives (see `PERSONAS` +in `tests/helpers/vlm.ts`): + +- **Visual defects** — overlapping/cut-off nodes, unreadable labels, broken + layout, missing edges, error chrome. +- **New-user clarity** — is the screen legible and inviting to a newcomer? +- **Accessibility** — contrast, text size, color-only signals, target size. +- **Living-graph aliveness** — do glow/breathe/flow status cues read clearly? + +Report-only: a **FLAG** is the model's subjective concern, surfaced for a human +to look at — it never fails the build. The suite *does* assert the model +answered, so a broken client is still caught. + +## Large-scale perf sweep + +`tests/perf/scale-sweep.spec.ts` seeds real graphs (via the GraphQL API, the +same path a human/AI uses) of increasing size, loads each at one or more +quality tiers, and records the in-app `window.__graphPerf` readings plus load +time, settle time and query latency. + +```bash +# .env.test.local +SCALE_SWEEP_SIZES=50,200,500,1000,2000 # blank => small in CI, large locally +SCALE_SWEEP_QUALITIES=HIGH,ULTRA +``` + +Metrics per (size, quality): + +- **Reliable (measured directly from the browser, captured at every size):** + rendered node/edge counts, initial load ms, graph-scoped query p95, and + **interaction FPS** — real rendered frames/sec while a node is dragged + (counted via `requestAnimationFrame`, so it needs no app instrumentation and + reflects how janky the graph feels under interaction at scale). +- **Best-effort bonus (from the app's `window.__graphPerf`, which only + publishes ~every 2s while the sim ticks):** settle ms (to `alpha ≤ 0.02`), + avg/p95 sim tick ms, layout drift (`rmsFromSavedPx`). These can be blank for + graphs that settle instantly — `interactionFps` is the headline signal. + +A FRESH graph is seeded per (size, quality) so each measurement starts from an +unsettled layout (otherwise the second quality loads the first run's settled, +pinned positions and the sim never ticks). Output: +`test-artifacts/scale-sweep/index.html` — a table plus inline SVG charts of how +each metric scales, with the `@perf` budgets drawn for reference. + +Report-only; the only hard assertion is that a seeded graph actually renders. +Each seeded graph is deleted afterward (edges first, then nodes, then graph). + +## CI + +GitHub-hosted runners can't reach your local GPUs, so neither suite gates +merges there. To gate on them, register a **self-hosted runner** on a machine +that can reach the endpoints, give it the `.env.test.local`, and add a workflow +job (manual-dispatch or nightly) that runs `npm run test:perf:scale` / +`npm run test:vlm`. The scale sweep alone (no VLM) is safe to run on any runner +with the dev stack and a small `SCALE_SWEEP_SIZES`. diff --git a/package.json b/package.json index 3c442d84..b3182854 100644 --- a/package.json +++ b/package.json @@ -37,6 +37,10 @@ "test:smoke": "playwright test tests/e2e/user-smoke.spec.ts --reporter=line", "report:showcase": "playwright test --project=showcase && node tests/generate-showcase-report.mjs", "test:perf": "playwright test --project=perf --reporter=line", + "test:perf:scale": "playwright test --project=perf-scale --reporter=line && node tests/generate-perf-report.mjs", + "report:perf": "node tests/generate-perf-report.mjs", + "test:vlm": "playwright test --project=vlm --reporter=line && node tests/generate-vlm-report.mjs", + "report:vlm": "node tests/generate-vlm-report.mjs", "perf:bundle": "node tests/perf/check-bundle-size.mjs" }, "devDependencies": { diff --git a/playwright.config.ts b/playwright.config.ts index 3a9f10e5..d0d1716b 100644 --- a/playwright.config.ts +++ b/playwright.config.ts @@ -38,9 +38,10 @@ export default defineConfig({ projects: [ { name: 'GraphDone-Core/dev-neo4j/chromium', - // The showcase tour runs in its own capture-heavy project below; keep it - // out of the default (fast) project so the smoke gate stays quick. - testIgnore: /showcase\.spec\.ts/, + // The showcase tour and the local-VLM visual eval run in their own + // capture-heavy projects below; keep them out of the default (fast) + // project so the smoke gate stays quick. + testIgnore: [/showcase\.spec\.ts/, /visual-vlm\.spec\.ts/], use: { ...devices['Desktop Chrome'] }, }, @@ -65,9 +66,31 @@ export default defineConfig({ { name: 'perf', testDir: './tests/perf', + // The large-scale sweep is heavy and report-only; it has its own project + // so `test:perf` (the budget gate) stays fast. + testIgnore: /scale-sweep\.spec\.ts/, use: { ...devices['Desktop Chrome'] }, }, + /* Large-scale graph creation + performance metric sweep. Seeds graphs of + * increasing size and records window.__graphPerf across them. Heavy + + * report-only; run via `npm run test:perf:scale`. */ + { + name: 'perf-scale', + testDir: './tests/perf', + testMatch: /scale-sweep\.spec\.ts/, + use: { ...devices['Desktop Chrome'] }, + }, + + /* Local-VLM visual evaluation across personas. Skips unless VLM_ENDPOINTS + * is set in .env.test.local. Run via `npm run test:vlm`. */ + { + name: 'vlm', + testDir: './tests/e2e', + testMatch: /visual-vlm\.spec\.ts/, + use: { ...devices['Desktop Chrome'], screenshot: 'on' }, + }, + // Commented out until browsers installed with system dependencies // { // name: 'GraphDone-Core/dev-neo4j/firefox', diff --git a/tests/e2e/visual-vlm.spec.ts b/tests/e2e/visual-vlm.spec.ts new file mode 100644 index 00000000..654e4ea0 --- /dev/null +++ b/tests/e2e/visual-vlm.spec.ts @@ -0,0 +1,118 @@ +import { test, expect, Page } from '@playwright/test'; +import * as fs from 'fs'; +import * as path from 'path'; +import { login, TEST_USERS } from '../helpers/auth'; +import { seedLargeGraph, deleteGraphDeep } from '../helpers/seedGraph'; +import '../helpers/testEnv'; +import { isVlmAvailable, evaluateBatch, PERSONAS, personaByKey } from '../helpers/vlm'; + +/** + * Local-VLM visual evaluation. Captures key user-facing states, then asks a + * locally-hosted vision model to judge each one from four perspectives + * (visual defects, new-user clarity, accessibility, living-graph aliveness). + * + * Report-only: it never fails on a model's subjective verdict — it writes + * test-artifacts/vlm/results.json for `npm run report:vlm` and prints a + * summary. It only asserts the VLM actually answered (so a broken client is + * still caught). Skips entirely when no VLM endpoint is configured/reachable + * (VLM_ENDPOINTS in .env.test.local), so CI stays green. + */ + +const SHOT_DIR = path.resolve(process.cwd(), 'test-artifacts/vlm/shots'); +const OUT = path.resolve(process.cwd(), 'test-artifacts/vlm/results.json'); +const SCALE_DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep'); + +interface Capture { file: string; context: string; personas: string[] } + +async function shot(page: Page, name: string): Promise { + fs.mkdirSync(SHOT_DIR, { recursive: true }); + const file = path.join(SHOT_DIR, `${name}.png`); + await page.screenshot({ path: file, fullPage: false }).catch(() => {}); + return file; +} + +test('VLM visual evaluation across personas @vlm', async ({ page }) => { + test.setTimeout(900_000); + const available = await isVlmAvailable(); + test.skip(!available, 'No reachable VLM endpoint (set VLM_ENDPOINTS in .env.test.local)'); + + const allPersonas = PERSONAS.map((p) => p.key); + const captures: Capture[] = []; + const cleanup: string[] = []; + + await login(page, TEST_USERS.ADMIN); + await page.waitForTimeout(1500); + + // 1. Empty graph — first-run invitation (new-user + visual defects). + const empty = await page.evaluate(async () => { + const token = localStorage.getItem('authToken') ?? ''; + const post = (query: string, variables?: unknown) => + fetch('/api/graphql', { method: 'POST', headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` }, body: JSON.stringify({ query, variables }) }).then((r) => r.json()); + const me = await post('{ me { id } }'); + const g = await post(`mutation($i:[GraphCreateInput!]!){createGraphs(input:$i){graphs{id}}}`, { i: [{ name: `VLM Empty ${Date.now()}`, type: 'PROJECT', status: 'ACTIVE', createdBy: me.data.me.id, isShared: true }] }); + return g.data.createGraphs.graphs[0].id as string; + }); + cleanup.push(empty); + await page.setViewportSize({ width: 1440, height: 900 }); + await page.evaluate((id) => localStorage.setItem('currentGraphId', id), empty); + await page.reload(); + await page.waitForTimeout(5000); + captures.push({ file: await shot(page, 'empty-graph-desktop'), context: 'the first-run empty-state of a brand-new project graph in GraphDone, a graph-based task manager', personas: ['visual-defects', 'new-user', 'accessibility'] }); + + // 2. Populated graph at ULTRA quality — full living-graph experience. + const seeded = await seedLargeGraph(page, { size: 60, namePrefix: 'VLM' }); + cleanup.push(seeded.graphId); + await page.evaluate((id) => { localStorage.setItem('currentGraphId', id); localStorage.setItem('graphdone.quality.override', 'ULTRA'); }, seeded.graphId); + await page.reload(); + await page.waitForTimeout(8000); // let it settle + effects run + captures.push({ file: await shot(page, 'populated-desktop'), context: 'a populated project graph (~60 work items) with dependency edges; nodes glow by priority and animate by status (in-progress breathes, blocked aches, complete settles)', personas: allPersonas }); + + // 3. Same graph on a phone viewport — accessibility + new-user on mobile. + await page.setViewportSize({ width: 393, height: 852 }); + await page.reload(); + await page.waitForTimeout(6000); + captures.push({ file: await shot(page, 'populated-mobile'), context: 'the same project graph viewed on a phone-sized screen (393x852)', personas: ['visual-defects', 'new-user', 'accessibility'] }); + + // 4. Bonus: judge the SINGLE largest scale-sweep frame for density/legibility + // (those frames are 1920px and slow on the model; one is enough signal). + if (fs.existsSync(SCALE_DIR)) { + const largest = fs.readdirSync(SCALE_DIR) + .filter((f) => f.endsWith('.png')) + .map((f) => ({ f, size: parseInt(f, 10) || 0 })) + .sort((a, b) => b.size - a.size)[0]; + if (largest) { + captures.push({ file: path.join(SCALE_DIR, largest.f), context: `a large graph rendered at scale (${largest.size} nodes) — judge whether it stays legible at this density`, personas: ['visual-defects'] }); + } + } + + // Build and run the persona jobs. + const jobs = captures.flatMap((c) => + c.personas + .map((pk) => personaByKey(pk)) + .filter((p): p is NonNullable => Boolean(p)) + .map((persona) => ({ imagePath: c.file, persona, context: c.context, meta: { capture: path.basename(c.file) } })) + ); + + let results: Awaited> = []; + try { + results = await evaluateBatch(jobs); + } finally { + for (const id of cleanup) await deleteGraphDeep(page, id); + } + + fs.mkdirSync(path.dirname(OUT), { recursive: true }); + fs.writeFileSync(OUT, JSON.stringify({ generatedAt: new Date().toISOString(), results }, null, 2)); + + const fails = results.filter((r) => !r.verdict.pass); + // eslint-disable-next-line no-console + console.log(`[vlm] ${results.length} evaluations, ${results.length - fails.length} pass, ${fails.length} flagged:`); + for (const f of fails) { + // eslint-disable-next-line no-console + console.log(` ⚠️ [${f.persona}] ${f.meta?.capture}: ${f.verdict.summary || f.verdict.issues.join('; ')}`); + } + + // Report-only: we assert the VLM produced answers, not what it concluded. + expect(results.length, 'VLM returned evaluations').toBeGreaterThan(0); + const answered = results.filter((r) => !r.verdict.issues.some((i) => i.startsWith('VLM request failed') || i.startsWith('No reachable'))); + expect(answered.length, 'at least some VLM calls succeeded').toBeGreaterThan(0); +}); diff --git a/tests/generate-perf-report.mjs b/tests/generate-perf-report.mjs new file mode 100644 index 00000000..8cd78d82 --- /dev/null +++ b/tests/generate-perf-report.mjs @@ -0,0 +1,111 @@ +#!/usr/bin/env node +/** + * Renders the large-scale perf sweep into a single self-contained page: + * test-artifacts/scale-sweep/index.html + * + * Input: test-artifacts/scale-sweep/n-.json (from scale-sweep.spec.ts) + * Output: an HTML table of every metric plus inline SVG line charts (no deps, + * no external assets) showing how settle time, tick cost, FPS, drift and query + * latency scale with graph size, per quality tier. + */ +import * as fs from 'fs'; +import * as path from 'path'; + +const DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep'); +const OUT = path.join(DIR, 'index.html'); + +if (!fs.existsSync(DIR)) { + console.error(`No sweep results at ${DIR} — run "npm run test:perf:scale" first.`); + process.exit(1); +} + +const rows = fs + .readdirSync(DIR) + .filter((f) => f.endsWith('.json')) + .map((f) => JSON.parse(fs.readFileSync(path.join(DIR, f), 'utf8'))) + .sort((a, b) => a.size - b.size || String(a.quality).localeCompare(b.quality)); + +if (rows.length === 0) { + console.error('No JSON sweep results found.'); + process.exit(1); +} + +const qualities = [...new Set(rows.map((r) => r.quality))]; +const sizes = [...new Set(rows.map((r) => r.size))].sort((a, b) => a - b); +const COLORS = ['#34d399', '#60a5fa', '#f472b6', '#fbbf24', '#a78bfa']; + +const num = (v) => (typeof v === 'number' && v >= 0 ? v : null); + +function lineChart(title, key, { unit = '', budget = null } = {}) { + const W = 560, H = 260, PADL = 56, PADB = 36, PADT = 28, PADR = 16; + const series = qualities.map((q) => ({ + q, + pts: sizes.map((s) => { + const row = rows.find((r) => r.size === s && r.quality === q); + return { x: s, y: row ? num(row[key]) : null }; + }).filter((p) => p.y !== null), + })).filter((s) => s.pts.length); + const allY = series.flatMap((s) => s.pts.map((p) => p.y)).concat(budget != null ? [budget] : []); + if (allY.length === 0) return ''; + const maxY = Math.max(...allY) * 1.1 || 1; + const maxX = Math.max(...sizes); + const minX = Math.min(...sizes); + const sx = (x) => PADL + ((x - minX) / (maxX - minX || 1)) * (W - PADL - PADR); + const sy = (y) => H - PADB - (y / maxY) * (H - PADT - PADB); + + const grid = [0, 0.25, 0.5, 0.75, 1].map((f) => { + const y = sy(maxY * f); + return `${Math.round(maxY * f)}`; + }).join(''); + const xticks = sizes.map((s) => `${s}`).join(''); + const budgetLine = budget != null ? `budget ${budget}${unit}` : ''; + const lines = series.map((s, i) => { + const c = COLORS[qualities.indexOf(s.q) % COLORS.length]; + const d = s.pts.map((p, j) => `${j === 0 ? 'M' : 'L'}${sx(p.x).toFixed(1)},${sy(p.y).toFixed(1)}`).join(' '); + const dots = s.pts.map((p) => `${s.q} @ ${p.x}n: ${p.y}${unit}`).join(''); + return `${dots}`; + }).join(''); + const legend = series.map((s, i) => { + const c = COLORS[qualities.indexOf(s.q) % COLORS.length]; + return `● ${s.q}`; + }).join('  '); + return `

${title}

${legend}
${grid}${xticks}${budgetLine}${lines}graph size (nodes)
`; +} + +const HEADERS = [ + ['size', 'nodes'], ['quality', 'quality'], ['renderedNodes', 'rendered n'], ['renderedEdges', 'rendered e'], + ['loadMs', 'load ms'], ['interactionFps', 'drag fps'], ['settleMs', 'settle ms'], ['finalAlpha', 'alpha'], + ['avgTickMs', 'tick ms'], ['p95TickMs', 'tick p95'], ['rmsFromSavedPx', 'drift px'], + ['queryP95Ms', 'query p95'], +]; +const tableRows = rows.map((r) => `${HEADERS.map(([k]) => `${r[k] === null ? '—' : r[k]}`).join('')}`).join(''); + +const html = `GraphDone — Large-Scale Perf Sweep + +

GraphDone — Large-Scale Graph Performance Sweep

+

${rows.length} runs · sizes ${sizes.join(', ')} · qualities ${qualities.join(', ')} · generated ${new Date().toISOString()}

+
+${lineChart('Interaction FPS vs size (drag)', 'interactionFps', { unit: '' })} +${lineChart('Initial load vs size', 'loadMs', { unit: 'ms' })} +${lineChart('Avg simulation tick vs size', 'avgTickMs', { unit: 'ms', budget: 8 })} +${lineChart('Settle time vs size', 'settleMs', { unit: 'ms' })} +${lineChart('Layout drift vs size', 'rmsFromSavedPx', { unit: 'px', budget: 25 })} +${lineChart('Query p95 latency vs size', 'queryP95Ms', { unit: 'ms', budget: 800 })} +
+

All metrics

+${HEADERS.map(([, h]) => ``).join('')}${tableRows}
${h}
+

Report-only. Budgets shown (red dashed) mirror the @perf gate; this sweep characterises how they scale, it does not enforce them.

+`; + +fs.writeFileSync(OUT, html); +console.log(`✅ Perf sweep report: ${OUT} (${rows.length} runs)`); diff --git a/tests/generate-vlm-report.mjs b/tests/generate-vlm-report.mjs new file mode 100644 index 00000000..819a63a0 --- /dev/null +++ b/tests/generate-vlm-report.mjs @@ -0,0 +1,82 @@ +#!/usr/bin/env node +/** + * Renders local-VLM visual evaluations into one self-contained gallery: + * test-artifacts/vlm/index.html + * + * Input: test-artifacts/vlm/results.json (from visual-vlm.spec.ts) + * Output: each captured screenshot with a card per persona verdict + * (pass/flag badge, 0-1 score, summary, issues). No deps, no external assets. + */ +import * as fs from 'fs'; +import * as path from 'path'; + +const VLM_DIR = path.resolve(process.cwd(), 'test-artifacts/vlm'); +const RESULTS = path.join(VLM_DIR, 'results.json'); +const OUT = path.join(VLM_DIR, 'index.html'); + +if (!fs.existsSync(RESULTS)) { + console.error(`No VLM results at ${RESULTS} — run "npm run test:vlm" (with VLM_ENDPOINTS set) first.`); + process.exit(1); +} + +const { generatedAt, results } = JSON.parse(fs.readFileSync(RESULTS, 'utf8')); +const esc = (s) => String(s ?? '').replace(/[&<>]/g, (c) => ({ '&': '&', '<': '<', '>': '>' }[c])); + +// Group verdicts by the screenshot they judged. +const byCapture = new Map(); +for (const r of results) { + const key = r.imagePath; + if (!byCapture.has(key)) byCapture.set(key, { imagePath: r.imagePath, context: r.context, verdicts: [] }); + byCapture.get(key).verdicts.push(r); +} + +const total = results.length; +const passed = results.filter((r) => r.verdict.pass).length; +const avgScore = total ? (results.reduce((a, r) => a + (r.verdict.score || 0), 0) / total).toFixed(2) : '—'; + +const sections = [...byCapture.values()].map((cap) => { + const rel = path.relative(VLM_DIR, cap.imagePath).split(path.sep).join('/'); + const cards = cap.verdicts.map((r) => { + const v = r.verdict; + const cls = v.pass ? 'pass' : 'flag'; + const issues = v.issues?.length ? `
    ${v.issues.map((i) => `
  • ${esc(i)}
  • `).join('')}
` : ''; + const model = (v.model || '').replace(/\.gguf$/, '').slice(0, 28); + const host = (v.endpoint || '').replace(/^https?:\/\//, ''); + const foot = (v.endpoint || v.latencyMs) ? `
${esc(model)} @ ${esc(host)}${v.latencyMs ? ` · ${(v.latencyMs / 1000).toFixed(1)}s` : ''}
` : ''; + return `
+
${v.pass ? 'PASS' : 'FLAG'} + ${esc(r.persona)}score ${Number(v.score ?? 0).toFixed(2)}
+

${esc(v.summary)}

${issues}${foot}
`; + }).join(''); + return `
+
${esc(path.basename(cap.imagePath))}
${esc(path.basename(cap.imagePath))}

${esc(cap.context)}

+
${cards}
+
`; +}).join(''); + +const html = `GraphDone — Local VLM Visual Review + +

GraphDone — Local VLM Visual Review

+
${passed}/${total} persona checks passed · avg score ${avgScore}
+generated ${esc(generatedAt)} · evaluated by a local vision model
+${sections} +

Report-only. "FLAG" is the model's subjective concern from one perspective, not a hard failure — use it to spot real UX/rendering regressions worth a human look.

+`; + +fs.writeFileSync(OUT, html); +console.log(`✅ VLM review report: ${OUT} (${total} evaluations, ${passed} pass)`); diff --git a/tests/helpers/seedGraph.ts b/tests/helpers/seedGraph.ts new file mode 100644 index 00000000..2635822e --- /dev/null +++ b/tests/helpers/seedGraph.ts @@ -0,0 +1,151 @@ +import { Page } from '@playwright/test'; + +/** + * Seeds realistically-shaped graphs of arbitrary size through the real GraphQL + * API (the same path a human or AI uses), so the perf sweep measures the true + * stack — Neo4j + Apollo + the web force simulation — not a synthetic shortcut. + * + * Nodes are spread on a grid (real positions, not all stacked at the origin), + * statuses/types/priorities are varied so living-graph effects and priority + * glow actually exercise, and edges form a connected backbone plus extra links + * to hit a target edge:node ratio. Edges are created as Edge nodes (the + * canonical model the web renders). Everything batches to stay within request + * limits, and cleanup deletes edges before nodes (orphan edges break the whole + * edges query). + */ + +export interface SeededGraph { + graphId: string; + nodeIds: string[]; + edgeCount: number; +} + +async function gql(page: Page, query: string, variables?: unknown): Promise { + return page.evaluate( + async ({ query, variables }) => { + const token = localStorage.getItem('authToken') ?? ''; + const res = await fetch('/api/graphql', { + method: 'POST', + headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` }, + body: JSON.stringify({ query, variables }), + }); + const body = await res.json(); + if (body.errors) throw new Error(body.errors[0]?.message ?? 'GraphQL error'); + return body.data; + }, + { query, variables } + ); +} + +const STATUSES = ['PROPOSED', 'IN_PROGRESS', 'BLOCKED', 'COMPLETED'] as const; +const TYPES = ['TASK', 'BUG', 'FEATURE', 'MILESTONE', 'OUTCOME'] as const; +const EDGE_TYPES = ['DEPENDS_ON', 'BLOCKS', 'RELATES_TO'] as const; + +function chunk(arr: T[], n: number): T[][] { + const out: T[][] = []; + for (let i = 0; i < arr.length; i += n) out.push(arr.slice(i, i + n)); + return out; +} + +export interface SeedOptions { + size: number; + /** edges ≈ edgeFactor * size (default 1.4). */ + edgeFactor?: number; + /** grid spacing in px (default 130). */ + spacing?: number; + namePrefix?: string; +} + +export async function seedLargeGraph(page: Page, opts: SeedOptions): Promise { + const { size, edgeFactor = 1.4, spacing = 130, namePrefix = 'Scale' } = opts; + const me = await gql(page, '{ me { id } }'); + const userId = me.me.id; + + const g = await gql( + page, + `mutation($input: [GraphCreateInput!]!) { createGraphs(input: $input) { graphs { id } } }`, + { input: [{ name: `${namePrefix} ${size}n ${Date.now()}`, type: 'PROJECT', status: 'ACTIVE', createdBy: userId, isShared: true }] } + ); + const graphId = g.createGraphs.graphs[0].id as string; + + // Grid layout centered on the origin so the sim starts from a real arrangement. + const cols = Math.ceil(Math.sqrt(size)); + const half = (cols * spacing) / 2; + const nodeInputs = Array.from({ length: size }, (_, i) => { + const col = i % cols; + const row = Math.floor(i / cols); + // Deterministic pseudo-variety without Math.random (kept reproducible). + const status = STATUSES[i % STATUSES.length]; + const type = TYPES[(i * 7) % TYPES.length]; + const priority = ((i * 37) % 100) / 100; + return { + type, + title: `${type} ${i}`, + status, + priority, + positionX: col * spacing - half, + positionY: row * spacing - half, + positionZ: 0, + owner: { connect: { where: { node: { id: userId } } } }, + graph: { connect: { where: { node: { id: graphId } } } }, + }; + }); + + const nodeIds: string[] = []; + for (const batch of chunk(nodeInputs, 100)) { + const res = await gql( + page, + `mutation($input: [WorkItemCreateInput!]!) { createWorkItems(input: $input) { workItems { id } } }`, + { input: batch } + ); + for (const w of res.createWorkItems.workItems) nodeIds.push(w.id); + } + + // Backbone chain guarantees connectivity; extra forward links add realism. + const targetEdges = Math.round(size * edgeFactor); + const edgeInputs: Array> = []; + const link = (a: string, b: string, t: string) => + edgeInputs.push({ + type: t, + weight: 0.5 + ((edgeInputs.length % 5) / 10), + source: { connect: { where: { node: { id: a } } } }, + target: { connect: { where: { node: { id: b } } } }, + }); + for (let i = 0; i + 1 < nodeIds.length; i++) link(nodeIds[i], nodeIds[i + 1], 'DEPENDS_ON'); + let extra = targetEdges - edgeInputs.length; + for (let i = 0; i < nodeIds.length && extra > 0; i++) { + const jump = 2 + ((i * 5) % Math.max(2, Math.floor(nodeIds.length / 4))); + const j = i + jump; + if (j < nodeIds.length) { + link(nodeIds[i], nodeIds[j], EDGE_TYPES[i % EDGE_TYPES.length]); + extra--; + } + } + + let edgeCount = 0; + for (const batch of chunk(edgeInputs, 100)) { + const res = await gql( + page, + `mutation($input: [EdgeCreateInput!]!) { createEdges(input: $input) { edges { id } } }`, + { input: batch } + ); + edgeCount += res.createEdges.edges.length; + } + + return { graphId, nodeIds, edgeCount }; +} + +export async function deleteGraphDeep(page: Page, graphId: string): Promise { + // Edges first (orphan edges break the edges query), then nodes, then graph. + await gql( + page, + `mutation($id: ID!) { deleteEdges(where: { source: { graph: { id: $id } } }) { nodesDeleted } }`, + { id: graphId } + ).catch(() => {}); + await gql( + page, + `mutation($id: ID!) { deleteWorkItems(where: { graph: { id: $id } }) { nodesDeleted } }`, + { id: graphId } + ).catch(() => {}); + await gql(page, `mutation($id: ID!) { deleteGraphs(where: { id: $id }) { nodesDeleted } }`, { id: graphId }).catch(() => {}); +} diff --git a/tests/helpers/testEnv.ts b/tests/helpers/testEnv.ts new file mode 100644 index 00000000..f29c8c80 --- /dev/null +++ b/tests/helpers/testEnv.ts @@ -0,0 +1,42 @@ +import * as fs from 'fs'; +import * as path from 'path'; +import dotenv from 'dotenv'; + +/** + * Loads local-only test configuration from `.env.test.local` (gitignored) into + * process.env, without ever baking secrets or hostnames into the repo. Import + * this for its side effect at the top of any spec/generator that needs the VLM + * endpoints or sweep config: + * + * import '../helpers/testEnv'; + * + * Safe to import everywhere — it's a no-op when the file is absent (e.g. CI), + * so VLM-driven suites skip cleanly. Existing process.env values win, so you + * can still override per-run on the command line. + */ +const localEnvPath = path.resolve(process.cwd(), '.env.test.local'); +if (fs.existsSync(localEnvPath)) { + dotenv.config({ path: localEnvPath }); +} + +/** Comma/whitespace separated env list -> trimmed non-empty string[]. */ +export function envList(name: string): string[] { + return (process.env[name] ?? '') + .split(',') + .map((s) => s.trim()) + .filter(Boolean); +} + +/** Parse a comma-separated list of positive integers (sweep sizes). */ +export function envIntList(name: string): number[] { + return envList(name) + .map((s) => Number.parseInt(s, 10)) + .filter((n) => Number.isFinite(n) && n > 0); +} + +export function envNumber(name: string, fallback: number): number { + const raw = process.env[name]; + if (raw === undefined || raw.trim() === '') return fallback; + const n = Number(raw); + return Number.isFinite(n) ? n : fallback; +} diff --git a/tests/helpers/vlm.ts b/tests/helpers/vlm.ts new file mode 100644 index 00000000..aeb4db24 --- /dev/null +++ b/tests/helpers/vlm.ts @@ -0,0 +1,331 @@ +import * as fs from 'fs'; +import './testEnv'; +import { envList, envNumber } from './testEnv'; + +/** + * Protocol-agnostic client for LOCAL Vision-Language-Model servers. + * + * The actual endpoints (GPU workstations) live only in `.env.test.local` + * (gitignored) as VLM_ENDPOINTS — never in the repo. Requests are round-robined + * across every configured endpoint so visual evaluation spreads over all GPUs. + * + * Two wire protocols are supported and auto-detected per endpoint: + * - OpenAI-compatible: POST /v1/chat/completions with image_url data URIs + * (vLLM, LM Studio, llama.cpp server, Ollama's /v1 compat shim) + * - Ollama native: POST /api/chat with a base64 images[] array + * + * Everything degrades gracefully: when VLM_ENDPOINTS is unset or no endpoint is + * reachable, isVlmAvailable() is false and suites skip — so CI stays green. + */ + +export type VlmProtocol = 'openai' | 'ollama'; + +export interface VlmVerdict { + pass: boolean; + score: number; // 0..1 + issues: string[]; + summary: string; + raw?: string; // raw model text, for the report when parsing is imperfect + endpoint?: string; // which box judged this (honesty in the report) + model?: string; + latencyMs?: number; +} + +export interface Persona { + key: string; + label: string; + /** Framing for the model — who it is and what it cares about. */ + system: string; + /** What "pass" means, appended to every prompt for this persona. */ + rubric: string; +} + +/** + * The evaluation perspectives. Each judges a rendered screenshot from a + * distinct point of view, so one capture yields several independent reads. + */ +export const PERSONAS: Persona[] = [ + { + key: 'visual-defects', + label: 'Visual defects', + system: + 'You are a meticulous UI rendering QA inspector for a graph-visualization web app. ' + + 'You judge ONLY what is visible in the screenshot — objective rendering correctness.', + rubric: + 'Fail if you see: nodes overlapping so labels are unreadable, nodes/text cut off at the edges, ' + + 'a broken or empty layout where content is expected, edges that clearly do not connect nodes, ' + + 'obvious visual glitches, or any error message / "Error" badge / blank red state. ' + + 'Pass if the graph (or its empty-state invitation) renders cleanly and legibly.', + }, + { + key: 'new-user', + label: 'New-user clarity', + system: + 'You are a first-time user who has never seen this product. You are evaluating whether the ' + + 'screen is clear, inviting, and self-explanatory.', + rubric: + 'Fail if you would feel lost or could not tell what to do next, or the screen looks intimidating ' + + 'or cluttered to a newcomer. Pass if the purpose is clear and there is an obvious next action.', + }, + { + key: 'accessibility', + label: 'Accessibility', + system: + 'You are an accessibility reviewer judging a rendered screenshot for visual a11y.', + rubric: + 'Fail if text contrast looks too low to read, text is too small, information is conveyed by color ' + + 'alone, or interactive targets look too small to tap. Pass if it appears broadly legible and usable.', + }, + { + key: 'living-graph', + label: 'Living-graph aliveness', + system: + 'You evaluate whether a graph visualization feels "alive" and communicates work status. Nodes may ' + + 'glow by priority, pulse/breathe when in progress, look settled when complete, or ache when blocked; ' + + 'edges may show energy flow.', + rubric: + 'Fail if the graph looks completely static/flat with no visual hierarchy or status cues, or if the ' + + 'effects look chaotic/noisy rather than purposeful. Pass if status and priority read clearly and the ' + + 'scene feels alive but legible. (Judge the single frame; do not penalize lack of motion in a still.)', + }, +]; + +export const personaByKey = (key: string): Persona | undefined => + PERSONAS.find((p) => p.key === key); + +const TIMEOUT_MS = envNumber('VLM_TIMEOUT_MS', 120_000); +const MAX_CONCURRENCY = Math.max(1, envNumber('VLM_MAX_CONCURRENCY', 3)); + +let rrCounter = 0; +const protocolCache = new Map(); +const modelCache = new Map(); + +export function vlmEndpoints(): string[] { + return envList('VLM_ENDPOINTS').map((e) => e.replace(/\/+$/, '')); +} + +export function vlmModel(): string { + return (process.env.VLM_MODEL ?? '').trim(); +} + +/** + * Resolve the model id for an endpoint. With a single shared model set + * VLM_MODEL; with multiple boxes serving DIFFERENT models, set VLM_MODEL=auto + * (or leave blank) and each endpoint's own loaded model is used. llama.cpp + * serves one model and ignores the field, but sending the right id keeps logs + * honest and works with multi-model servers too. + */ +async function resolveModel(base: string, protocol: VlmProtocol): Promise { + const configured = vlmModel(); + if (configured && configured.toLowerCase() !== 'auto') return configured; + if (modelCache.has(base)) return modelCache.get(base)!; + let id = 'default'; + try { + if (protocol === 'openai') { + const r = await fetchWithTimeout(`${base}/v1/models`, { headers: authHeaders() }, 5000); + const d = await r.json(); + id = d?.data?.[0]?.id ?? d?.models?.[0]?.name ?? 'default'; + } else { + const r = await fetchWithTimeout(`${base}/api/tags`, {}, 5000); + const d = await r.json(); + id = d?.models?.[0]?.name ?? d?.models?.[0]?.model ?? 'default'; + } + } catch { /* keep default */ } + modelCache.set(base, id); + return id; +} + +export function isVlmConfigured(): boolean { + return vlmEndpoints().length > 0 && vlmModel().length > 0; +} + +function authHeaders(): Record { + const key = (process.env.VLM_API_KEY ?? '').trim(); + return key ? { Authorization: `Bearer ${key}` } : {}; +} + +async function fetchWithTimeout(url: string, init: RequestInit, timeoutMs = TIMEOUT_MS): Promise { + const controller = new AbortController(); + const t = setTimeout(() => controller.abort(), timeoutMs); + try { + return await fetch(url, { ...init, signal: controller.signal }); + } finally { + clearTimeout(t); + } +} + +/** Detect (and cache) the wire protocol for a single endpoint. */ +async function detectProtocol(base: string): Promise { + const forced = (process.env.VLM_PROTOCOL ?? 'auto').trim().toLowerCase(); + if (forced === 'openai' || forced === 'ollama') return forced; + if (protocolCache.has(base)) return protocolCache.get(base)!; + // OpenAI-compatible servers expose /v1/models. + try { + const r = await fetchWithTimeout(`${base}/v1/models`, { headers: authHeaders() }, 5000); + if (r.ok) { protocolCache.set(base, 'openai'); return 'openai'; } + } catch { /* try next */ } + // Ollama exposes /api/tags. + try { + const r = await fetchWithTimeout(`${base}/api/tags`, {}, 5000); + if (r.ok) { protocolCache.set(base, 'ollama'); return 'ollama'; } + } catch { /* unreachable */ } + return null; +} + +/** Endpoints that are configured AND currently reachable, with their protocol. */ +export async function reachableEndpoints(): Promise> { + const out: Array<{ base: string; protocol: VlmProtocol }> = []; + await Promise.all( + vlmEndpoints().map(async (base) => { + const protocol = await detectProtocol(base); + if (protocol) out.push({ base, protocol }); + }) + ); + return out; +} + +let availabilityCache: boolean | null = null; +/** True only if VLM is configured and at least one endpoint responds. */ +export async function isVlmAvailable(): Promise { + if (!isVlmConfigured()) return false; + if (availabilityCache !== null) return availabilityCache; + availabilityCache = (await reachableEndpoints()).length > 0; + return availabilityCache; +} + +function extractVerdict(text: string): VlmVerdict { + // Models wrap JSON in prose or code fences; grab the first balanced object. + let parsed: Record | null = null; + const fence = text.match(/```(?:json)?\s*([\s\S]*?)```/i); + const candidate = fence ? fence[1] : text; + const start = candidate.indexOf('{'); + const end = candidate.lastIndexOf('}'); + if (start !== -1 && end > start) { + try { parsed = JSON.parse(candidate.slice(start, end + 1)); } catch { /* fall through */ } + } + if (!parsed) { + return { pass: false, score: 0, issues: ['Could not parse a JSON verdict from the model'], summary: text.slice(0, 300), raw: text }; + } + const issuesRaw = parsed.issues; + const issues = Array.isArray(issuesRaw) ? issuesRaw.map((i) => String(i)) : issuesRaw ? [String(issuesRaw)] : []; + let score = Number(parsed.score); + if (!Number.isFinite(score)) score = parsed.pass ? 1 : 0; + if (score > 1) score = score / 100; // tolerate 0-100 scales + return { + pass: Boolean(parsed.pass), + score: Math.max(0, Math.min(1, score)), + issues, + summary: String(parsed.summary ?? '').slice(0, 600), + raw: text, + }; +} + +const PROMPT_TAIL = + 'Respond with ONLY a JSON object, no prose, of exactly this shape: ' + + '{"pass": boolean, "score": number between 0 and 1, "issues": string[], "summary": string}. ' + + 'Keep issues short and specific. Be fair: this is a still frame.'; + +function buildPrompt(persona: Persona, context: string): string { + return `Context: this screenshot shows ${context}.\n\n${persona.rubric}\n\n${PROMPT_TAIL}`; +} + +async function callOpenAI(base: string, model: string, system: string, prompt: string, b64: string): Promise { + const r = await fetchWithTimeout(`${base}/v1/chat/completions`, { + method: 'POST', + headers: { 'Content-Type': 'application/json', ...authHeaders() }, + body: JSON.stringify({ + model, + temperature: 0, + max_tokens: 400, + messages: [ + { role: 'system', content: system }, + { + role: 'user', + content: [ + { type: 'text', text: prompt }, + { type: 'image_url', image_url: { url: `data:image/png;base64,${b64}` } }, + ], + }, + ], + }), + }); + if (!r.ok) throw new Error(`OpenAI VLM ${base} HTTP ${r.status}: ${(await r.text()).slice(0, 200)}`); + const data = await r.json(); + return data?.choices?.[0]?.message?.content ?? ''; +} + +async function callOllama(base: string, model: string, system: string, prompt: string, b64: string): Promise { + const r = await fetchWithTimeout(`${base}/api/chat`, { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ + model, + stream: false, + options: { temperature: 0 }, + messages: [ + { role: 'system', content: system }, + { role: 'user', content: prompt, images: [b64] }, + ], + }), + }); + if (!r.ok) throw new Error(`Ollama VLM ${base} HTTP ${r.status}: ${(await r.text()).slice(0, 200)}`); + const data = await r.json(); + return data?.message?.content ?? ''; +} + +/** + * Evaluate one screenshot from one persona's perspective. Round-robins across + * reachable endpoints. Never throws — failures come back as a non-pass verdict + * so the report is always complete. + */ +export async function evaluateImage( + imagePath: string, + persona: Persona, + context: string, + endpoints?: Array<{ base: string; protocol: VlmProtocol }> +): Promise { + const eps = endpoints ?? (await reachableEndpoints()); + if (eps.length === 0) { + return { pass: false, score: 0, issues: ['No reachable VLM endpoint'], summary: '' }; + } + const { base, protocol } = eps[rrCounter++ % eps.length]; + const prompt = buildPrompt(persona, context); + const started = Date.now(); + try { + const model = await resolveModel(base, protocol); + const b64 = fs.readFileSync(imagePath).toString('base64'); + const text = + protocol === 'openai' + ? await callOpenAI(base, model, persona.system, prompt, b64) + : await callOllama(base, model, persona.system, prompt, b64); + const v = extractVerdict(text); + return { ...v, endpoint: base, model, latencyMs: Date.now() - started }; + } catch (err) { + return { + pass: false, + score: 0, + issues: [`VLM request failed: ${err instanceof Error ? err.message : String(err)}`], + summary: '', + endpoint: base, + latencyMs: Date.now() - started, + }; + } +} + +/** Run a batch of {imagePath, persona, context} jobs with bounded concurrency. */ +export async function evaluateBatch( + jobs: Array<{ imagePath: string; persona: Persona; context: string; meta?: Record }> +): Promise }>> { + const eps = await reachableEndpoints(); + const results: Array<{ persona: string; context: string; imagePath: string; verdict: VlmVerdict; meta?: Record }> = []; + let idx = 0; + async function worker() { + while (idx < jobs.length) { + const job = jobs[idx++]; + const verdict = await evaluateImage(job.imagePath, job.persona, job.context, eps); + results.push({ persona: job.persona.key, context: job.context, imagePath: job.imagePath, verdict, meta: job.meta }); + } + } + await Promise.all(Array.from({ length: Math.min(MAX_CONCURRENCY, jobs.length) }, worker)); + return results; +} diff --git a/tests/perf/scale-sweep.spec.ts b/tests/perf/scale-sweep.spec.ts new file mode 100644 index 00000000..2ad753e9 --- /dev/null +++ b/tests/perf/scale-sweep.spec.ts @@ -0,0 +1,224 @@ +import { test, expect, Page } from '@playwright/test'; +import * as fs from 'fs'; +import * as path from 'path'; +import { login, TEST_USERS } from '../helpers/auth'; +import { seedLargeGraph, deleteGraphDeep } from '../helpers/seedGraph'; +import '../helpers/testEnv'; +import { envIntList, envList } from '../helpers/testEnv'; + +/** + * Large-scale graph creation + performance metric sweep. Seeds real graphs of + * increasing size through the GraphQL API, loads each in the browser at one or + * more quality tiers, and records the in-app PerfMeter/DriftMeter readings + * (window.__graphPerf) plus settle time, load time, and query latency. + * + * Report-only: writes one JSON per (size, quality) under + * test-artifacts/scale-sweep/, which `npm run report:perf` renders into a table + * + charts. It does NOT fail on thresholds — the goal is a metric sweep, not a + * gate (the @perf budgets spec is the gate). The only hard assertion is that a + * seeded graph actually renders, so a silent breakage still surfaces. + * + * Sizes/qualities come from env (.env.test.local) so you can push it hard + * locally; CI uses a small set just to keep the harness honest. + */ + +const SIZES = (() => { + const fromEnv = envIntList('SCALE_SWEEP_SIZES'); + if (fromEnv.length) return fromEnv; + return process.env.CI ? [40, 120] : [50, 200, 500, 1000, 2000]; +})(); + +const QUALITIES = (() => { + const fromEnv = envList('SCALE_SWEEP_QUALITIES').map((q) => q.toUpperCase()); + const valid = fromEnv.filter((q) => ['LOW', 'MEDIUM', 'HIGH', 'ULTRA'].includes(q)); + if (valid.length) return valid; + return process.env.CI ? ['HIGH'] : ['HIGH', 'ULTRA']; +})(); + +const OUT_DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep'); +const SETTLE_BUDGET_MS = 30_000; +const REST_ALPHA = 0.02; + +interface SweepResult { + size: number; + quality: string; + seededNodes: number; + seededEdges: number; + renderedNodes: number; + renderedEdges: number; + loadMs: number; // time from reload to first node painted + interactionFps: number; // RELIABLE: rendered frames/sec while dragging the graph + settleMs: number | null; // time to reach REST_ALPHA (null = never settled within budget) + finalAlpha: number; + avgTickMs: number; + p95TickMs: number; + fps: number; + droppedFrames: number; + rmsFromSavedPx: number; + maxStepPx: number; + queryP95Ms: number; + timestampISO: string; +} + +async function measure(page: Page, graphId: string, size: number, quality: string): Promise { + await page.setViewportSize({ width: 1920, height: 1080 }); + await page.evaluate( + ({ gid, q }) => { + localStorage.setItem('currentGraphId', gid); + localStorage.setItem('graphdone.quality.override', q); + }, + { gid: graphId, q: quality } + ); + + const t0 = Date.now(); + await page.reload(); + // Load time = first node painted. + await page.locator('.graph-container svg .node').first().waitFor({ timeout: 60_000 }).catch(() => {}); + const loadMs = Date.now() - t0; + + // Seeded nodes carry saved grid positions, so the app pins them and the force + // sim sits idle — PerfMeter (window.__graphPerf, published only every ~2s + // WHILE ticking) then never reports. We hold a node and drag it continuously + // for a few seconds: d3 keeps alphaTarget>0 while dragging, so the sim ticks + // the whole time and the meter publishes real UNDER-INTERACTION samples (tick + // cost / fps at this scale — a realistic "dragging a big graph" metric). We + // keep the best (lowest-tick) live sample, then release and time the settle. + const box = await page.evaluate(() => { + const n = document.querySelector('.graph-container svg .node .node-bg') as Element | null; + if (!n) return null; + const r = n.getBoundingClientRect(); + return { x: r.x + r.width / 2, y: r.y + r.height / 2 }; + }); + + // Reliable interaction FPS: count real rendered frames (requestAnimationFrame) + // over a fixed wall-clock window while dragging. This needs no app + // instrumentation, so it works at every size — when the main thread is busy + // ticking a huge sim, rAF visibly drops, which is exactly the scaling signal. + let lastNonNull: any = null; + const samples: any[] = []; + let interactionFps = -1; + if (box) { + await page.evaluate(() => { + (window as any).__fc = 0; + const loop = () => { (window as any).__fc++; (window as any).__rafId = requestAnimationFrame(loop); }; + (window as any).__rafId = requestAnimationFrame(loop); + }); + await page.mouse.move(box.x, box.y).catch(() => {}); + await page.mouse.down().catch(() => {}); + const dragStart = Date.now(); + let a = 0; + while (Date.now() - dragStart < 6000) { + a += 0.6; + await page.mouse.move(box.x + Math.cos(a) * 70, box.y + Math.sin(a) * 55).catch(() => {}); + await page.waitForTimeout(120); + const cur = await page.evaluate(() => (window as any).__graphPerf ?? null); + if (cur) { lastNonNull = cur; samples.push(cur); } + } + await page.mouse.up().catch(() => {}); + const frames = await page.evaluate(() => { cancelAnimationFrame((window as any).__rafId); return (window as any).__fc || 0; }); + const secs = (Date.now() - dragStart) / 1000; + interactionFps = Math.round((frames / secs) * 10) / 10; + } + + // Now measure how long it takes to come to rest after the perturbation. + let settleMs: number | null = null; + const settleStart = Date.now(); + while (Date.now() - settleStart < SETTLE_BUDGET_MS) { + const cur = await page.evaluate(() => (window as any).__graphPerf ?? null); + if (cur) { + lastNonNull = cur; + if (typeof cur.alpha === 'number' && cur.alpha <= REST_ALPHA) { + settleMs = Date.now() - settleStart; + break; + } + } + await page.waitForTimeout(300); + } + // Prefer the worst (max) tick seen under interaction — that's the real cost at + // scale; a single settled sample understates it. + const underLoad = samples.length + ? samples.reduce((w, s) => ((s.avgTickMs ?? 0) > (w.avgTickMs ?? 0) ? s : w)) + : null; + const last = underLoad ?? (await page.evaluate(() => (window as any).__graphPerf ?? null)) ?? lastNonNull ?? {}; + + const renderedNodes = await page.locator('.graph-container svg .node').count(); + const renderedEdges = await page.locator('.graph-container svg .edge').count(); + + // Query latency a human/AI would feel: a graph-scoped workItems fetch. + const queryP95Ms = await page.evaluate(async (gid) => { + const token = localStorage.getItem('authToken') ?? ''; + const times: number[] = []; + for (let i = 0; i < 10; i++) { + const s = performance.now(); + await fetch('/api/graphql', { + method: 'POST', + headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` }, + body: JSON.stringify({ + query: `query($w: WorkItemWhere) { workItems(where: $w, options: { limit: 5000 }) { id status type priority } }`, + variables: { w: { graph: { id: gid } } }, + }), + }).then((r) => r.json()); + times.push(performance.now() - s); + } + times.sort((a, b) => a - b); + return Math.round(times[Math.floor(times.length * 0.95)] ?? times[times.length - 1]); + }, graphId); + + const spatial = last?.spatial ?? {}; + fs.mkdirSync(OUT_DIR, { recursive: true }); + await page.screenshot({ path: path.join(OUT_DIR, `${size}n-${quality}.png`), fullPage: false }).catch(() => {}); + + return { + size, + quality, + seededNodes: size, + seededEdges: 0, // filled by caller + renderedNodes, + renderedEdges, + loadMs, + interactionFps, + settleMs, + finalAlpha: typeof last?.alpha === 'number' ? last.alpha : -1, + avgTickMs: last?.avgTickMs ?? -1, + p95TickMs: last?.p95TickMs ?? -1, + fps: last?.fps ?? -1, + droppedFrames: last?.droppedFrames ?? -1, + rmsFromSavedPx: spatial.rmsFromSavedPx ?? -1, + maxStepPx: spatial.maxStepPx ?? -1, + queryP95Ms, + timestampISO: new Date(t0).toISOString(), + }; +} + +test.describe('large-scale graph perf sweep @scale', () => { + test.describe.configure({ mode: 'serial', timeout: 600_000 }); + + for (const size of SIZES) { + test(`sweep ${size} nodes`, async ({ page }) => { + await login(page, TEST_USERS.ADMIN); + await page.waitForTimeout(1500); + + // A FRESH graph per (size, quality): otherwise the second quality loads + // the first run's already-settled positions, the sim never ticks, and + // PerfMeter never publishes (the -1 / settle=NONE gaps in v1). + for (const quality of QUALITIES) { + const seeded = await seedLargeGraph(page, { size }); + try { + const result = await measure(page, seeded.graphId, size, quality); + result.seededEdges = seeded.edgeCount; + fs.mkdirSync(OUT_DIR, { recursive: true }); + fs.writeFileSync(path.join(OUT_DIR, `${size}n-${quality}.json`), JSON.stringify(result, null, 2)); + // eslint-disable-next-line no-console + console.log( + `[scale] ${size}n/${quality}: rendered ${result.renderedNodes}n/${result.renderedEdges}e ` + + `load=${result.loadMs}ms dragFps=${result.interactionFps} settle=${result.settleMs ?? 'NONE'}ms ` + + `tick=${result.avgTickMs}ms qP95=${result.queryP95Ms}ms` + ); + expect(result.renderedNodes, `graph of ${size} nodes must render some nodes`).toBeGreaterThan(0); + } finally { + await deleteGraphDeep(page, seeded.graphId); + } + } + }); + } +});