Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .env.test.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# GraphDone test-pipeline configuration — LOCAL ONLY.
#
# Copy this file to `.env.test.local` (which is gitignored) and fill in your
# own values. NEVER put real hostnames, IPs, or keys in this committed example
# or anywhere else in the repo — the local VLM boxes (GPU workstations) must
# stay out of version control. The test harness auto-loads `.env.test.local`.
#
# cp .env.test.example .env.test.local # then edit .env.test.local

# --- Local Vision-Language-Model (VLM) endpoints ---------------------------
# Comma-separated base URLs of your local VLM server(s). Requests are
# round-robined across them so visual evaluation is spread over every GPU.
# Leave blank to skip all VLM-driven suites (they no-op cleanly in CI).
# Example shape (use your OWN hosts in .env.test.local, never here):
# VLM_ENDPOINTS=http://<gpu-host-a>:<port>,http://<gpu-host-b>:<port>
VLM_ENDPOINTS=

# Model id/tag to request (e.g. a llava / qwen2-vl / llama-3.2-vision build).
VLM_MODEL=

# Optional bearer key for OpenAI-compatible servers that require one.
VLM_API_KEY=

# Wire protocol: auto (default) | openai | ollama.
# auto — probe each endpoint: /v1/models => OpenAI-compatible, else Ollama.
# openai — POST /v1/chat/completions (vLLM, LM Studio, llama.cpp, Ollama compat)
# ollama — POST /api/chat with an images[] array
VLM_PROTOCOL=auto

# Max concurrent VLM requests across all endpoints (default 3).
VLM_MAX_CONCURRENCY=3

# Per-request timeout in ms — VLMs can be slow on large images (default 120000).
VLM_TIMEOUT_MS=120000

# --- Large-scale performance sweep -----------------------------------------
# Node counts to sweep, comma-separated. Leave blank to use the built-in
# default (small in CI, large locally). Example: 50,200,500,1000,2000
SCALE_SWEEP_SIZES=

# Quality tiers to sweep per size (subset of LOW,MEDIUM,HIGH,ULTRA).
SCALE_SWEEP_QUALITIES=HIGH,ULTRA
2 changes: 2 additions & 0 deletions docs/SYSTEMS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
| Lint | `npm run lint` | 0 errors (warnings allowed) |
| Build | `npm run build` | Production build succeeds |
| Showcase report | `TEST_URL=http://localhost:3127 npm run report:showcase` | Records .webm video + screenshots of every mode at all 5 resolutions → `test-artifacts/showcase/index.html` (also an every-PR CI artifact). |
| Large-scale perf sweep | `TEST_URL=http://localhost:3127 npm run test:perf:scale` | Seeds graphs of increasing size (50→2000+ nodes) and records `window.__graphPerf` (settle, tick, fps, drift, query p95) across size × quality → `test-artifacts/scale-sweep/index.html`. Report-only; sizes/qualities via `.env.test.local`. See [docs/testing/local-vlm-and-scale.md](./testing/local-vlm-and-scale.md). |
| Local VLM visual review | `TEST_URL=http://localhost:3127 npm run test:vlm` | A locally-hosted vision model judges captured states from 4 perspectives (visual defects, new-user clarity, accessibility, living-graph aliveness) → `test-artifacts/vlm/index.html`. **Skips unless `VLM_ENDPOINTS` is set in the gitignored `.env.test.local`** (CI can't reach local GPUs). Report-only. |

**Why THE GATE exists:** a real incident — orphaned `Edge` records made the
edges query 500 and the UI showed "Error" with zero edges, while every unit
Expand Down
113 changes: 113 additions & 0 deletions docs/testing/local-vlm-and-scale.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Local VLM visual review & large-scale performance sweeps

Two heavier, report-only suites that exercise GraphDone from realistic user
perspectives and at scale. Both are **opt-in and run locally** (or on a
self-hosted runner) because the vision models live on your own GPU boxes —
their hostnames must never enter the repo.

## TL;DR

```bash
cp .env.test.example .env.test.local # gitignored — put your real values here
# edit .env.test.local: VLM_ENDPOINTS, VLM_MODEL, (optional) sweep sizes

./start dev # or have the stack running on :3127

TEST_URL=http://localhost:3127 npm run test:perf:scale # → test-artifacts/scale-sweep/index.html
TEST_URL=http://localhost:3127 npm run test:vlm # → test-artifacts/vlm/index.html
```

If `VLM_ENDPOINTS` is unset, `test:vlm` **skips cleanly** — so CI and other
developers are never blocked by hardware they don't have.

## Keeping hostnames out of the repo

- **Never** commit hostnames, IPs, or keys. The GPU boxes (e.g. an RTX 4090
workstation and Grace-Blackwell nodes) are referenced only by env vars.
- `.env.test.local` is gitignored (see `.gitignore`). It is the *only* place
your real endpoints live.
- `.env.test.example` is committed and documents the variable **names** with
placeholder hosts (`http://<gpu-host>:<port>`). Copy it to `.env.test.local`
and fill in the rest.
- The harness auto-loads `.env.test.local` via `tests/helpers/testEnv.ts`.

```bash
# .env.test.local (NOT committed)
VLM_ENDPOINTS=http://<host-a>:<port>,http://<host-b>:<port>,http://<host-c>:<port>
VLM_MODEL=<your-vision-model-tag>
VLM_PROTOCOL=auto # auto | openai | ollama
VLM_MAX_CONCURRENCY=3
```

Multiple endpoints are **round-robined**, so visual evaluation spreads across
every GPU you list.

## VLM protocol support

`tests/helpers/vlm.ts` is protocol-agnostic and auto-detects per endpoint:

| Protocol | Detected via | Request |
|----------|--------------|---------|
| OpenAI-compatible | `GET /v1/models` | `POST /v1/chat/completions` with an `image_url` data URI (vLLM, LM Studio, llama.cpp server, Ollama's `/v1` shim) |
| Ollama native | `GET /api/tags` | `POST /api/chat` with a base64 `images[]` array |

Force one with `VLM_PROTOCOL=openai` or `ollama`. Each model call asks for a
strict JSON verdict `{pass, score, issues[], summary}`, parsed leniently.

### Personas

Each captured screenshot is judged from several perspectives (see `PERSONAS`
in `tests/helpers/vlm.ts`):

- **Visual defects** — overlapping/cut-off nodes, unreadable labels, broken
layout, missing edges, error chrome.
- **New-user clarity** — is the screen legible and inviting to a newcomer?
- **Accessibility** — contrast, text size, color-only signals, target size.
- **Living-graph aliveness** — do glow/breathe/flow status cues read clearly?

Report-only: a **FLAG** is the model's subjective concern, surfaced for a human
to look at — it never fails the build. The suite *does* assert the model
answered, so a broken client is still caught.

## Large-scale perf sweep

`tests/perf/scale-sweep.spec.ts` seeds real graphs (via the GraphQL API, the
same path a human/AI uses) of increasing size, loads each at one or more
quality tiers, and records the in-app `window.__graphPerf` readings plus load
time, settle time and query latency.

```bash
# .env.test.local
SCALE_SWEEP_SIZES=50,200,500,1000,2000 # blank => small in CI, large locally
SCALE_SWEEP_QUALITIES=HIGH,ULTRA
```

Metrics per (size, quality):

- **Reliable (measured directly from the browser, captured at every size):**
rendered node/edge counts, initial load ms, graph-scoped query p95, and
**interaction FPS** — real rendered frames/sec while a node is dragged
(counted via `requestAnimationFrame`, so it needs no app instrumentation and
reflects how janky the graph feels under interaction at scale).
- **Best-effort bonus (from the app's `window.__graphPerf`, which only
publishes ~every 2s while the sim ticks):** settle ms (to `alpha ≤ 0.02`),
avg/p95 sim tick ms, layout drift (`rmsFromSavedPx`). These can be blank for
graphs that settle instantly — `interactionFps` is the headline signal.

A FRESH graph is seeded per (size, quality) so each measurement starts from an
unsettled layout (otherwise the second quality loads the first run's settled,
pinned positions and the sim never ticks). Output:
`test-artifacts/scale-sweep/index.html` — a table plus inline SVG charts of how
each metric scales, with the `@perf` budgets drawn for reference.

Report-only; the only hard assertion is that a seeded graph actually renders.
Each seeded graph is deleted afterward (edges first, then nodes, then graph).

## CI

GitHub-hosted runners can't reach your local GPUs, so neither suite gates
merges there. To gate on them, register a **self-hosted runner** on a machine
that can reach the endpoints, give it the `.env.test.local`, and add a workflow
job (manual-dispatch or nightly) that runs `npm run test:perf:scale` /
`npm run test:vlm`. The scale sweep alone (no VLM) is safe to run on any runner
with the dev stack and a small `SCALE_SWEEP_SIZES`.
4 changes: 4 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@
"test:smoke": "playwright test tests/e2e/user-smoke.spec.ts --reporter=line",
"report:showcase": "playwright test --project=showcase && node tests/generate-showcase-report.mjs",
"test:perf": "playwright test --project=perf --reporter=line",
"test:perf:scale": "playwright test --project=perf-scale --reporter=line && node tests/generate-perf-report.mjs",
"report:perf": "node tests/generate-perf-report.mjs",
"test:vlm": "playwright test --project=vlm --reporter=line && node tests/generate-vlm-report.mjs",
"report:vlm": "node tests/generate-vlm-report.mjs",
"perf:bundle": "node tests/perf/check-bundle-size.mjs"
},
"devDependencies": {
Expand Down
29 changes: 26 additions & 3 deletions playwright.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,10 @@ export default defineConfig({
projects: [
{
name: 'GraphDone-Core/dev-neo4j/chromium',
// The showcase tour runs in its own capture-heavy project below; keep it
// out of the default (fast) project so the smoke gate stays quick.
testIgnore: /showcase\.spec\.ts/,
// The showcase tour and the local-VLM visual eval run in their own
// capture-heavy projects below; keep them out of the default (fast)
// project so the smoke gate stays quick.
testIgnore: [/showcase\.spec\.ts/, /visual-vlm\.spec\.ts/],
use: { ...devices['Desktop Chrome'] },
},

Expand All @@ -65,9 +66,31 @@ export default defineConfig({
{
name: 'perf',
testDir: './tests/perf',
// The large-scale sweep is heavy and report-only; it has its own project
// so `test:perf` (the budget gate) stays fast.
testIgnore: /scale-sweep\.spec\.ts/,
use: { ...devices['Desktop Chrome'] },
},

/* Large-scale graph creation + performance metric sweep. Seeds graphs of
* increasing size and records window.__graphPerf across them. Heavy +
* report-only; run via `npm run test:perf:scale`. */
{
name: 'perf-scale',
testDir: './tests/perf',
testMatch: /scale-sweep\.spec\.ts/,
use: { ...devices['Desktop Chrome'] },
},

/* Local-VLM visual evaluation across personas. Skips unless VLM_ENDPOINTS
* is set in .env.test.local. Run via `npm run test:vlm`. */
{
name: 'vlm',
testDir: './tests/e2e',
testMatch: /visual-vlm\.spec\.ts/,
use: { ...devices['Desktop Chrome'], screenshot: 'on' },
},

// Commented out until browsers installed with system dependencies
// {
// name: 'GraphDone-Core/dev-neo4j/firefox',
Expand Down
118 changes: 118 additions & 0 deletions tests/e2e/visual-vlm.spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
import { test, expect, Page } from '@playwright/test';
import * as fs from 'fs';
import * as path from 'path';
import { login, TEST_USERS } from '../helpers/auth';
import { seedLargeGraph, deleteGraphDeep } from '../helpers/seedGraph';
import '../helpers/testEnv';
import { isVlmAvailable, evaluateBatch, PERSONAS, personaByKey } from '../helpers/vlm';

/**
* Local-VLM visual evaluation. Captures key user-facing states, then asks a
* locally-hosted vision model to judge each one from four perspectives
* (visual defects, new-user clarity, accessibility, living-graph aliveness).
*
* Report-only: it never fails on a model's subjective verdict — it writes
* test-artifacts/vlm/results.json for `npm run report:vlm` and prints a
* summary. It only asserts the VLM actually answered (so a broken client is
* still caught). Skips entirely when no VLM endpoint is configured/reachable
* (VLM_ENDPOINTS in .env.test.local), so CI stays green.
*/

const SHOT_DIR = path.resolve(process.cwd(), 'test-artifacts/vlm/shots');
const OUT = path.resolve(process.cwd(), 'test-artifacts/vlm/results.json');
const SCALE_DIR = path.resolve(process.cwd(), 'test-artifacts/scale-sweep');

interface Capture { file: string; context: string; personas: string[] }

async function shot(page: Page, name: string): Promise<string> {
fs.mkdirSync(SHOT_DIR, { recursive: true });
const file = path.join(SHOT_DIR, `${name}.png`);
await page.screenshot({ path: file, fullPage: false }).catch(() => {});
return file;
}

test('VLM visual evaluation across personas @vlm', async ({ page }) => {
test.setTimeout(900_000);
const available = await isVlmAvailable();
test.skip(!available, 'No reachable VLM endpoint (set VLM_ENDPOINTS in .env.test.local)');

const allPersonas = PERSONAS.map((p) => p.key);
const captures: Capture[] = [];
const cleanup: string[] = [];

await login(page, TEST_USERS.ADMIN);
await page.waitForTimeout(1500);

// 1. Empty graph — first-run invitation (new-user + visual defects).
const empty = await page.evaluate(async () => {
const token = localStorage.getItem('authToken') ?? '';
const post = (query: string, variables?: unknown) =>
fetch('/api/graphql', { method: 'POST', headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${token}` }, body: JSON.stringify({ query, variables }) }).then((r) => r.json());
const me = await post('{ me { id } }');
const g = await post(`mutation($i:[GraphCreateInput!]!){createGraphs(input:$i){graphs{id}}}`, { i: [{ name: `VLM Empty ${Date.now()}`, type: 'PROJECT', status: 'ACTIVE', createdBy: me.data.me.id, isShared: true }] });
return g.data.createGraphs.graphs[0].id as string;
});
cleanup.push(empty);
await page.setViewportSize({ width: 1440, height: 900 });
await page.evaluate((id) => localStorage.setItem('currentGraphId', id), empty);
await page.reload();
await page.waitForTimeout(5000);
captures.push({ file: await shot(page, 'empty-graph-desktop'), context: 'the first-run empty-state of a brand-new project graph in GraphDone, a graph-based task manager', personas: ['visual-defects', 'new-user', 'accessibility'] });

// 2. Populated graph at ULTRA quality — full living-graph experience.
const seeded = await seedLargeGraph(page, { size: 60, namePrefix: 'VLM' });
cleanup.push(seeded.graphId);
await page.evaluate((id) => { localStorage.setItem('currentGraphId', id); localStorage.setItem('graphdone.quality.override', 'ULTRA'); }, seeded.graphId);
await page.reload();
await page.waitForTimeout(8000); // let it settle + effects run
captures.push({ file: await shot(page, 'populated-desktop'), context: 'a populated project graph (~60 work items) with dependency edges; nodes glow by priority and animate by status (in-progress breathes, blocked aches, complete settles)', personas: allPersonas });

// 3. Same graph on a phone viewport — accessibility + new-user on mobile.
await page.setViewportSize({ width: 393, height: 852 });
await page.reload();
await page.waitForTimeout(6000);
captures.push({ file: await shot(page, 'populated-mobile'), context: 'the same project graph viewed on a phone-sized screen (393x852)', personas: ['visual-defects', 'new-user', 'accessibility'] });

// 4. Bonus: judge the SINGLE largest scale-sweep frame for density/legibility
// (those frames are 1920px and slow on the model; one is enough signal).
if (fs.existsSync(SCALE_DIR)) {
const largest = fs.readdirSync(SCALE_DIR)
.filter((f) => f.endsWith('.png'))
.map((f) => ({ f, size: parseInt(f, 10) || 0 }))
.sort((a, b) => b.size - a.size)[0];
if (largest) {
captures.push({ file: path.join(SCALE_DIR, largest.f), context: `a large graph rendered at scale (${largest.size} nodes) — judge whether it stays legible at this density`, personas: ['visual-defects'] });
}
}

// Build and run the persona jobs.
const jobs = captures.flatMap((c) =>
c.personas
.map((pk) => personaByKey(pk))
.filter((p): p is NonNullable<typeof p> => Boolean(p))
.map((persona) => ({ imagePath: c.file, persona, context: c.context, meta: { capture: path.basename(c.file) } }))
);

let results: Awaited<ReturnType<typeof evaluateBatch>> = [];
try {
results = await evaluateBatch(jobs);
} finally {
for (const id of cleanup) await deleteGraphDeep(page, id);
}

fs.mkdirSync(path.dirname(OUT), { recursive: true });
fs.writeFileSync(OUT, JSON.stringify({ generatedAt: new Date().toISOString(), results }, null, 2));

const fails = results.filter((r) => !r.verdict.pass);
// eslint-disable-next-line no-console
console.log(`[vlm] ${results.length} evaluations, ${results.length - fails.length} pass, ${fails.length} flagged:`);
for (const f of fails) {
// eslint-disable-next-line no-console
console.log(` ⚠️ [${f.persona}] ${f.meta?.capture}: ${f.verdict.summary || f.verdict.issues.join('; ')}`);
}

// Report-only: we assert the VLM produced answers, not what it concluded.
expect(results.length, 'VLM returned evaluations').toBeGreaterThan(0);
const answered = results.filter((r) => !r.verdict.issues.some((i) => i.startsWith('VLM request failed') || i.startsWith('No reachable')));
expect(answered.length, 'at least some VLM calls succeeded').toBeGreaterThan(0);
});
Loading
Loading