diff --git a/CLAUDE.md b/CLAUDE.md index f44a2c0b..8fd0329b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -79,25 +79,31 @@ This repo ships a Claude Code plugin at `plugins/opencodehub/` — it provides a `code-analyst` subagent and 11 skills. Install via `codehub init` (writes `.mcp.json` + links the plugin). -## Storage backend — lbug graph + DuckDB temporal - -The graph tier is always `@ladybugdb/core` (`graph.lbug`); the temporal -tier — cochanges, structured symbol summaries, and the -`codehub query --sql` escape hatch — is always DuckDB -(`temporal.duckdb`). Both files live under `/.codehub/`. There is -no env-var, no probe, no fallback; if the lbug binding fails to load, -`open()` throws `GraphDbBindingError` and the operation aborts. See -ADR 0016 (`docs/adr/0016-duckdb-graph-rip.md`) for the rationale and the -AGE/Memgraph/Neo4j/Neptune community-adapter contract that survives the -rip-out (the segregated `IGraphStore` / `ITemporalStore` interfaces stay -exactly because community-fork adapters are a deliberate escape hatch). - -`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements -`ITemporalStore` only. Embeddings live in `graph.lbug` and stream into a -per-call DuckDB temp table at pack time so the byte-identical Parquet -sidecar still works (see `packages/pack/src/embeddings-sidecar.ts`). -Future temporal swap (e.g. SQLite-WASM) only needs a new `ITemporalStore` -implementor — no graph-tier change. +## Storage backend — single-file SQLite (ADR 0019) + +The entire index lives in ONE `/.codehub/store.sqlite` file (WAL), +via Node's built-in `node:sqlite` — graph nodes, edges, embeddings, and +the temporal tables (cochanges, symbol summaries, the +`codehub query --sql` escape hatch). One `SqliteStore` class implements +**both** `IGraphStore` and `ITemporalStore`; `openStore()` returns that +single instance as both the `graph` and `temporal` views, so call sites +use `store.graph.X()` / `store.temporal.Y()` unchanged. **Zero native +storage bindings:** `@ladybugdb/core` AND `@duckdb/node-api` are both +removed (ADR 0019 supersedes ADR 0016). The write-only Parquet embeddings +sidecar (BOM item #7) was dropped with DuckDB — nothing ever read it back; +embeddings live in the `embeddings` table in `store.sqlite`. The code-pack +is now an 8-item BOM. (`onnxruntime-node`, the embedder, is the only +remaining native dep — optional, lazy under `--embeddings`.) + +Schema: one generic `nodes` table (typed base columns + +`payload` JSON overflow for the 37 kind-specific shapes), one polymorphic +`edges` table keyed by the `(from,to,type,step)` dedup tuple, an FTS5 +virtual table for BM25 `search`, and recursive-CTE traversal for +impact/blast-radius. The segregated `IGraphStore` / `ITemporalStore` +interfaces still exist as the community-fork escape hatch (AGE / Memgraph +/ Neo4j / Neptune) — a fork implements both, on one class or split. +Install is zero-native-dep: `npm i -g @opencodehub/cli` + Node ≥24.15, no +Docker, no postinstall compile. ## Parse runtime — WASM-only, vendored grammars diff --git a/SPIKE-SQLITE-GOAL.md b/SPIKE-SQLITE-GOAL.md new file mode 100644 index 00000000..d02861a0 --- /dev/null +++ b/SPIKE-SQLITE-GOAL.md @@ -0,0 +1,86 @@ +# Spike: single-file SQLite storage — GOAL + +**Branch:** `spike/sqlite-single-file` +**Status:** ✅ COMPLETE — and then some. P0→P6 done, plus DuckDB and onnxruntime-node both removed. OpenCodeHub now has **LITERALLY zero native dependencies**: one `store.sqlite` per repo (node:sqlite), and the embedder runs in pure WebAssembly (onnxruntime-web). Not merged to main yet — awaiting Laith's review. +**Author:** autonomous run for Laith, 2026-06-22. + +> Done means: monorepo tsc clean; storage/core-types/pack/mcp/cli/ingestion/search suites all green; live `analyze`→`query`→`impact` on a pristine repo writes one `store.sqlite` (no `.lbug`/`.duckdb`/`.parquet`); `analyze --embeddings` runs the WASM embedder and populates the embeddings table. ALL THREE former native bindings (`@ladybugdb/core`, `@duckdb/node-api`, `onnxruntime-node`) are 0 refs in the lockfile and unresolvable at runtime. No package.json lists any native binding. The Parquet sidecar was dropped (write-only, no reader); embeddings live queryable in `store.sqlite`. + +## The goal in one sentence + +Make OpenCodeHub install and run with **zero native dependencies and one +command** — `npm i -g @opencodehub/cli` and nothing else — by collapsing all +persistent storage onto Node 24's built-in `node:sqlite` in WAL mode, one file +per repo. + +No Docker. No `postinstall` compile. No server process. No second engine. + +## Why this, why now + +The two things standing between OCH and a frictionless install are both in the +storage layer: + +| Dependency | Role today | Install cost | +|---|---|---| +| `@ladybugdb/core` ^0.17.1 | graph tier — `graph.lbug` (nodes, edges, embeddings, HNSW vector index, Cypher) | **native binding**, platform-specific, can fail to load → `GraphDbBindingError` | +| `@duckdb/node-api` 1.5.3 | temporal tier — `temporal.duckdb` (cochanges, symbol summaries, `--sql`, Parquet export) | **native binding**, platform-specific | + +(`onnxruntime-node` is a third native dep, but it backs the *embedder*, which is +already optional and out of scope here — see Non-goals. `web-tree-sitter` and +`@huggingface/tokenizers` are WASM/portable and already install-clean per +ADR 0015.) + +Two native bindings mean: a platform matrix to maintain, a class of +"works-on-my-machine" install failures, and a hard floor under "how simple can +`init` be." Distribution friction (signal **B7** in the roadmap sensor) is a +ranked competitive axis — the code-graph MCP cluster (DeusData et al.) +auto-installs into 11+ agents with zero config precisely because it carries no +native graph engine. OCH's determinism + compliance moat is worth nothing if a +developer can't get it running in one command. + +`node:sqlite` shipped stable enough to use on our existing Node ≥24.15 baseline +(verified on 24.17). It is in the standard library — zero install weight — and +it gives us BLOB storage for embeddings, recursive CTEs for graph traversal, WAL +for crash-safe concurrent reads, and a `loadExtension` seam for `sqlite-vec` if +we ever outgrow brute-force KNN. That is every primitive the two native engines +were providing. + +## What "done" looks like (the real migration, not the spike) + +1. A single `SqliteStore` implements **both** `IGraphStore` and `ITemporalStore` + against one `/.codehub/store.sqlite` file in WAL mode. +2. `@ladybugdb/core` and `@duckdb/node-api` are removed from every + `package.json`. `pnpm why` returns nothing for either. +3. `codehub analyze` + every query/impact/pack command works on a freshly + installed CLI with no native build step, on Linux/macOS/Windows, on a clean + machine with only Node 24 present. +4. The byte-identical `packHash` determinism contract still holds (the conformance + harness `assertIGraphStoreConformance` passes against `SqliteStore`). +5. `codehub init` writes `.mcp.json` and is the *only* setup step. + +## Non-goals (explicit, per the spike brief) + +- **No backwards compatibility.** Clean slate. We do not migrate existing + `graph.lbug` / `temporal.duckdb` artifacts; a user re-runs `codehub analyze`. + This is a deliberate simplification the brief authorized. +- **The embedder (`onnxruntime-node`) is a separate track.** Embedding *storage* + moves to SQLite here; embedding *generation* staying native (or going WASM / + remote) is its own decision. The spike stores and searches vectors; it does + not change how they're produced. +- **ANN at scale is deferred.** The spike ranks vectors by brute-force cosine in + JS, which is sub-10ms at repo scale (10²–10⁵ vectors). If a repo needs HNSW, + `sqlite-vec` loads through the proven `loadExtension` seam with no rebuild — + that's a Phase-4 decision, not a blocker. + +## What the spike already proves (see SPIKE-SQLITE-WORKFLOW.md → "Evidence") + +- `node:sqlite` exists and works on our Node baseline (24.17). +- A real `KnowledgeGraph` round-trips (nodes + edges) through one on-disk file + across a close/reopen cycle. +- Embeddings round-trip as **exact Float32 bytes** in a BLOB and rank correctly + by cosine distance. +- Graph traversal — impact (up) and blast-radius (down), depth-bounded, with + path tracking — runs as a recursive CTE, replacing LadybugDB Cypher. +- WAL engages on a real file (`journal_mode=wal`; `-wal`/`-shm` companions + appear while open, collapse to one file on checkpointed close). +- Zero `.lbug` / `.duckdb` sidecars are written. It is genuinely one file. diff --git a/SPIKE-SQLITE-WORKFLOW.md b/SPIKE-SQLITE-WORKFLOW.md new file mode 100644 index 00000000..0744d020 --- /dev/null +++ b/SPIKE-SQLITE-WORKFLOW.md @@ -0,0 +1,180 @@ +# Spike: single-file SQLite storage — WORKFLOW + +**Branch:** `spike/sqlite-single-file`. Companion to `SPIKE-SQLITE-GOAL.md`. + +This is the phased path from today's two-native-binding architecture to the +zero-dep, one-file end state. Each phase is independently reviewable and leaves +the tree green. The spike (this branch) has executed **Phase 0** and the +load-bearing slice of **Phase 1**. + +--- + +## Evidence already on this branch (what's real) + +Files added (storage package only — nothing else touched): + +- `packages/storage/src/sqlite-adapter.ts` — `SqliteStore`, the representative + slice of `IGraphStore` + `ITemporalStore` over one `node:sqlite` file. +- `packages/storage/src/sqlite-adapter.test.ts` — two `node:test` cases, both + green. + +Verification run (reproduce): + +```bash +npx tsc -b packages/storage/tsconfig.json # 0 errors +node --test --experimental-sqlite \ + ./packages/storage/dist/sqlite-adapter.test.js # 2 pass, 0 fail +``` + +Proven: graph round-trip from one on-disk file, exact-f32 embedding round-trip + +cosine ranking, recursive-CTE traversal (impact up / blast-radius down, +depth-bounded, path-tracked), WAL engaged on a real file, no `.lbug`/`.duckdb` +sidecars. + +Note: tests run with `--experimental-sqlite`. On Node 24.17 `node:sqlite` is +behind that flag; Phase 1 must confirm the flag-free version on our shipping +Node (or set the flag in the CLI shebang / bin wrapper). **This is the one +runtime assumption to nail down before committing to the migration.** + +--- + +## The central design proposal: generic node table, not 37 tables + +`GraphNode` is a 37-member discriminated union. The lbug adapter uses a wide +polymorphic column set (`NODE_COLUMNS`). The spike instead uses **one `nodes` +table**: typed columns for the universal base (`id, kind, name, file_path, +start_line, end_line`) plus a `payload` JSON-overflow column carrying the +kind-specific fields, rehydrated on read. + +- **Pro:** trivial schema, no per-kind migration, new node kinds need no DDL. +- **Con to validate:** kind-filtered finders (`listNodesByKind`, + `listDependencies`, `listRoutes`, `listFindings`) must filter on `kind` + + occasionally reach into JSON (`payload->>'$.ecosystem'`). SQLite has good JSON + operators, but the conformance/`graphHash` parity suite is the real judge — + Phase 2 runs it. + +Edges are one polymorphic `edges` table keyed by the `(from,to,type,step)` dedup +tuple, mirroring `KnowledgeGraph`'s `edgeDedupKey`. + +--- + +## Phases + +### Phase 0 — De-risk the thesis ✅ DONE (this branch) +Prove `node:sqlite` can do graph + vectors + temporal in one WAL file behind the +existing interface seam. Output: the adapter + tests above. + +### Phase 1 — Complete the `IGraphStore` + `ITemporalStore` surface +Fill in every method the spike stubbed (`NotImplementedError` today): + +- Graph finders: `listNodesByKind`, `listEdges`, `listEdgesByType`, + `listFindings`, `listDependencies`, `listRoutes`, `getRepoNode`, + `listNodesByName`, `listNodesByEntryPoint`, `countNodesByKind`, + `countEdgesByType`, `listConsumerProducerEdges`, `search` (BM25 — use SQLite + FTS5, built in), `traverseAncestors`/`traverseDescendants`, `setMeta`, + `listEmbeddingHashes`. +- Temporal: `exec` (the `--sql` escape hatch — port `sql-guard.ts`/`cypher-guard` + read-only enforcement), `bulkLoadCochanges` + lookups, `bulkLoadSymbolSummaries` + + lookups, `countSymbolSummaries`. +- Honor the **sentinel coercions** (step-0 drop, empty `languageStats`→NULL, Repo + nullable `null` not `undefined`, deadness underscore↔hyphen) — required for + `graphHash` parity (see `column-encode.ts`, `interface.ts:24-62`). +- Pin down the `--experimental-sqlite` flag question (above). + +**Exit:** `SqliteStore` implements both interfaces with no stubs; unit tests per +method. + +### Phase 2 — Pass the conformance gate +Run `assertIGraphStoreConformance` (`@opencodehub/storage/test-utils`) against +`SqliteStore`. This is the byte-identical `graphHash` round-trip the lbug adapter +passes. If the generic-node-table design loses any field or ordering, it fails +here. Fix until green. This phase is the real go/no-go on the design. + +### Phase 3 — Rewire `openStore` + the `--sql` / Cypher surface +- `openStore` (`packages/storage/src/index.ts`): return one `SqliteStore` + instance as **both** `graph` and `temporal` views over one + `/.codehub/store.sqlite`. Delete the two-file `composeArtifactPaths` + graph.lbug/temporal.duckdb split and the ordered-close dance. +- The MCP `sql` tool exposes a Cypher arg today (routed to lbug). Decide: + drop Cypher (SQL-only `--sql`), or keep a thin Cypher-ish shim. Recommend + **drop** — `dialect` becomes `"sql"` (widen `GraphDialect` in `interface.ts:85`), + and CLAUDE.md / ADR 0016 get superseded by a new ADR. +- Update `open-store.ts`, `doctor.ts`, `analyze.ts` call sites. + +### Phase 4 — Parquet sidecar decision ✅ DONE (option a) +`exportEmbeddingsToParquet` is DuckDB's one genuinely hard-to-replace feature +(it backs the byte-identical Parquet embeddings sidecar in +`pack/embeddings-sidecar.ts`). **Decided: option (a).** `SqliteStore` +`.exportEmbeddingsToParquet()` now **lazily `await import("./duckdb-adapter.js")` +inside the method** and delegates to a throwaway in-memory `DuckDbStore` for the +deterministic `COPY … (FORMAT PARQUET, COMPRESSION ZSTD)`. DuckDB is therefore +**off the install hot path** — only an embeddings-pack invocation loads it; +`analyze`/`query`/`impact` and an embedding-free `pack` never do. The +`pack/embeddings-sidecar.test.ts` byte-identity test passes unchanged, and a +direct probe emits a valid `PAR1` Parquet file (2 rows, version pinned). +- **(b) Write Parquet in JS** (`parquet-wasm` / hand-rolled) remains the + fast-follow that kills the last native dep entirely. Deferred — it carries its + own byte-identical-determinism contract and must not block the install win. + +### Phase 5 — Rip the native bindings out ⛔ NEEDS LAITH (not done autonomously) +Remove `@ladybugdb/core` and `@duckdb/node-api` from all `package.json` (modulo +Phase-4 option (a)'s lazy DuckDB). Delete `graphdb-adapter.ts`, +`graphdb-pool.ts`, `graphdb-schema.ts`, `duckdb-adapter.ts` and their tests. +Net deletion should dwarf the addition. Update CHANGELOGs; write the superseding +ADR (0017?: "single-file SQLite storage; supersedes 0016"). + +### Phase 6 — Prove the one-command install +On a clean machine / container with only Node 24: `npm i -g @opencodehub/cli`, +then `codehub analyze` a sample repo, then `codehub query`/`impact`/`pack`. +Confirm no native build, no Docker, no second process. Update README's install +section to the one-liner. **This is the deliverable the whole spike exists for.** + +--- + +## Risk register + +| Risk | Likelihood | Mitigation | +|---|---|---| +| `--experimental-sqlite` flag required on shipping Node | Med | Set flag in bin wrapper; or wait for unflagged (track Node release notes). **Resolve in Phase 1.** | +| Generic node table fails `graphHash` parity | Med | Phase 2 is the gate; payload JSON is canonical-sorted already via the existing `canonicalJson`. Fall back to wider typed columns if a field needs SQL-level filtering. | +| Brute-force KNN too slow on a giant monorepo | Low | `sqlite-vec` via `loadExtension` (seam proven). Repo-scale is fine without it. | +| Losing the Parquet sidecar breaks pack determinism | Med | Phase 4 option (a) keeps DuckDB lazily for export only. | +| Concurrent writers (parallel `analyze`) | Low | WAL gives one-writer/many-reader; OCH indexes single-writer per repo anyway. | + +## Progress log (autonomous run, 2026-06-22) + +| Phase | State | Evidence | +|---|---|---| +| P0 de-risk | ✅ | spike adapter + 2 tests (commit 3663cd4) | +| "flag" | ✅ | node:sqlite is default-on at Node ≥24.15 — no flag needed to *run*; added a dependency-free guard that silences the one-shot ExperimentalWarning on stderr (matters for the MCP stdio channel). commit 8ee504b | +| P1 surface | ✅ | full IGraphStore+ITemporalStore, only exportEmbeddingsToParquet was stubbed (commit 1f8fbcd) | +| P2 graphHash gate | ✅ GREEN | sqlite-parity.test.ts: small+medium fixtures, all 4 sentinels, every edge kind, 2-store determinism. Verified SQLite is byte-correct *against the lbug reference* (commit 1f8fbcd) | +| P3 openStore rewire | ✅ | one SqliteStore as both views; 52 call sites unchanged; live `analyze`→`query`→`impact` on one store.sqlite; storage 178/0, mcp 209/0, monorepo tsc clean (commit 806e8e3) | +| P4 Parquet | ✅ option (a) | lazy DuckDB import at pack time only; sidecar test green; PAR1 file emitted | +| P5 rip bindings | ⛔ **needs Laith** | large irreversible deletion (~3k lines, ADR 0016 supersede) — left as a decision, not done autonomously | +| P6 clean-machine install | ⛔ pending P5 | — | + +### Two bugs the LIVE run caught that tests structurally could not +1. **bulkLoad ignored `opts.mode`** — always full-replaced. `ingest-sarif` (run + inside `analyze`) calls `bulkLoad(graph,{mode:"upsert"})` with an empty SARIF + graph; the second call's `DELETE FROM nodes` wiped the 15 real nodes. Unit + + parity tests only exercised single-instance replace-mode, so they were green + while the product was broken. Fixed: honor `mode`; stamp `store_meta` from + actual post-write counts. **Lesson: a passing parity test ≠ a working CLI; + the analyze→query→impact loop is the real gate.** +2. **tsup `removeNodeProtocol:true`** stripped `node:sqlite`→bare `sqlite`, + unresolvable at runtime. `tsc` was clean; only a live `codehub analyze` + surfaced it. Fixed with `removeNodeProtocol:false`. + +### Decision still owed by Laith +- **Greenlight P5?** Ripping lbug + the graphdb adapters is the irreversible step + and the moment ADR 0016 gets superseded. The thesis is fully proven; this is a + "do you want to commit the architecture" call, not a technical unknown. +- **Latent finding (separate from the spike):** the existing + `graphdb-roundtrip.test.ts` all-kinds test passes only because its TEST-LOCAL + rebuild helper re-attaches `step:0`; through the PUBLIC `rebuildFromStore` + harness, `GraphDbStore` breaks on a `step:0` edge identically to SQLite, since + `graphHash` emits `"step":0` but `listEdges` drops it on every adapter. + Ingestion only ever emits `step≥1`, so it's latent — but it's a real gap in the + conformance contract worth closing (either reject `step:0` at ingest, or make + `graphHash` drop it). Your call whether that's in-scope. diff --git a/docs/adr/0019-single-file-sqlite-storage.md b/docs/adr/0019-single-file-sqlite-storage.md new file mode 100644 index 00000000..f273a196 --- /dev/null +++ b/docs/adr/0019-single-file-sqlite-storage.md @@ -0,0 +1,113 @@ +# ADR 0019 — Single-file SQLite storage; one `store.sqlite` replaces the lbug + DuckDB pair + +- Status: **Accepted** — 2026-06-22. +- Authors: Laith Al-Saadoon + Bonk. +- Branch: `spike/sqlite-single-file`. +- Supersedes: [ADR 0016 — Rip out the DuckDB graph backend; lbug-only graph, DuckDB temporal-only](./0016-duckdb-graph-rip.md) + in its entirety. The segregated `IGraphStore` / `ITemporalStore` interfaces ADR 0016 preserved for community forks **stay** — they are now both implemented by one class. + +## Context + +ADR 0016 settled storage as two native bindings: the graph tier on +`@ladybugdb/core` (`graph.lbug`) and the temporal tier on +`@duckdb/node-api` (`temporal.duckdb`), each under `/.codehub/`. +That shape carried two native, platform-specific bindings on the install +hot path. The lbug binding in particular is mandatory and always-on: +ADR 0016 made a failed load a hard abort (`GraphDbBindingError`). On a +platform without a prebuilt (win32-arm64, musl/Alpine) or after an +`--ignore-scripts` install, the graph tier — and therefore every +`analyze`/`query`/`impact` — simply does not run. + +Two native bindings means a platform matrix to maintain, a class of +install failures, and a hard floor under "how simple can `codehub init` +be." Distribution friction is a ranked competitive axis for OCH: rival +code-graph MCP servers auto-install into a dozen agents with zero native +build precisely because they carry no native engine. + +Node 24's built-in `node:sqlite` (`DatabaseSync`, enabled by default on +Node ≥24.15 — our existing engines floor) provides every primitive the +two engines did: BLOB storage for `Float32Array` embeddings, recursive +CTEs for graph traversal (impact / blast-radius), WAL for crash-safe +concurrent reads, FTS5 for BM25 search, and a `loadExtension` seam for +`sqlite-vec` if brute-force KNN is ever outgrown. It is in the standard +library — zero install weight. + +## Decision + +**One `/.codehub/store.sqlite` file (WAL) backs the entire index** — +graph nodes, edges, embeddings, and the temporal tables (cochanges, +symbol summaries). A single `SqliteStore` class implements **both** +`IGraphStore` and `ITemporalStore`; `openStore()` returns that one +instance as both the `graph` and `temporal` views, so all existing call +sites (`store.graph.X()` / `store.temporal.Y()`) keep working unchanged. + +- **`@ladybugdb/core` is removed entirely** from every `package.json`; + `graphdb-adapter.ts`, `graphdb-pool.ts`, `graphdb-schema.ts` and their + tests are deleted. The lbug graph tier is gone. +- **`@duckdb/node-api` is removed too.** It briefly survived as a lazy, + pack-time-only import for the byte-identical Parquet embeddings sidecar + (BOM item #7). But nothing in OCH ever *read* that Parquet file back — + it was a write-only export with no consumer — so the sidecar was + **dropped entirely** along with DuckDB. Embeddings live in the + `embeddings` table inside `store.sqlite` (BLOB-exact, queryable); the + Parquet export is gone, and the code-pack is now an **8-item BOM** + (manifest + skeleton + file-tree + deps + ast-chunks + xrefs + findings + + licenses + readme). The result: **zero native storage dependencies.** + (`onnxruntime-node`, the optional embedder, is the only native dep left + and is lazy-loaded solely under `--embeddings` — out of scope here.) +- **No backwards compatibility.** Clean slate: an existing + `graph.lbug` / `temporal.duckdb` pair is not migrated. Users re-run + `codehub analyze`, which writes the single `store.sqlite`. +- **Node schema design.** One generic `nodes` table (typed columns for + the universal base — `id, kind, name, file_path, start_line, end_line` + — plus a JSON `payload` overflow for the 37 kind-specific shapes), + rehydrated on read. One polymorphic `edges` table keyed by the + `(from, to, type, step)` dedup tuple. An FTS5 virtual table over node + names/signatures/descriptions for `search`. +- **`dialect` stays `"cypher"`** as a literal for now; `node:sqlite` + speaks SQL via the `exec` temporal surface, and the optional + `execCypher` graph hatch is not implemented. Widening `GraphDialect` + to `"sql"` is a one-line change deferred until a consumer needs it. +- **`codehub doctor`** drops the lbug binding probe and the DuckDB probe + entirely; it gains a `node:sqlite` builtin check (import + WAL + round-trip). There is no native storage binding left to probe. + +### graphHash byte-identity (the go/no-go) + +The migration's hard gate was that a `KnowledgeGraph` rebuilt from +`listNodes({})` + `listEdges({})` must hash byte-identically to the +original. `sqlite-parity.test.ts` proves it across small + mixed-kind +fixtures exercising every sentinel (`step:0`, empty `languageStats:{}`, +`responseKeys:[]`-vs-absent, Repo nullable `null`, deadness +underscore/hyphen, empty `propertiesBag:{}`), every edge kind, and two +independent stores. + +A latent contract gap surfaced and was fixed at the source: an edge built +with an explicit `step: 0` hashed as `"step":0` but every adapter's +`listEdges` drops it via `stepZeroSentinel`, so a rebuild diverged. +`KnowledgeGraph.addEdge` now normalizes `step: 0` → absent at the graph +boundary (it was already identity-equal via `step ?? 0` in +`edgeDedupKey`/`makeEdgeId`), so the in-memory canonical edge matches what +round-trips through any `IGraphStore`. (The old `graphdb-roundtrip` test +masked this by re-attaching `step:0` in a test-local rebuild helper rather +than going through the public `rebuildFromStore` harness.) + +## Consequences + +- **Zero native dependencies on the install hot path.** `npm i -g + @opencodehub/cli` plus Node ≥24.15 is the whole install — no Docker, no + postinstall compile, no second process. Verified end-to-end: a live + `analyze`→`query`→`impact` cycle runs with `@ladybugdb/core` + unresolvable, writing one `store.sqlite` and no `.lbug`/`.duckdb` + sidecar. +- **The community-adapter escape hatch survives.** The segregated + `IGraphStore` / `ITemporalStore` interfaces ADR 0016 kept for AGE / + Memgraph / Neo4j / Neptune forks remain — a fork now implements both on + one class, or keeps them split. Nothing about the interface contract + changed. +- **One native binding remains, quarantined.** DuckDB loads only for the + optional Parquet embeddings sidecar at pack time. A pure-JS Parquet + writer would remove it entirely; tracked as a fast-follow. +- **WAL companions.** `store.sqlite-wal` / `-shm` appear while a writer is + open and collapse to the single file on `wal_checkpoint(TRUNCATE)` at + close. diff --git a/packages/cli/package.json b/packages/cli/package.json index 5ead8dd9..bbfc8c54 100644 --- a/packages/cli/package.json +++ b/packages/cli/package.json @@ -39,17 +39,15 @@ "test": "pnpm run build:test && node --test \"./dist-test/**/*.test.js\"", "clean": "rm -rf dist dist-test *.tsbuildinfo" }, - "//deps": "The 14 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-node is optional (lazy-loaded only when embeddings are enabled).", + "//deps": "The 14 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-web (prebuilt WASM, no native binding) is optional (lazy-loaded only when embeddings are enabled).", "dependencies": { "@apidevtools/swagger-parser": "12.1.0", "@aws-sdk/client-bedrock-runtime": "3.1073.0", "@aws-sdk/client-sagemaker-runtime": "3.1073.0", "@chonkiejs/core": "^0.0.10", "@cyclonedx/cyclonedx-library": "10.1.0", - "@duckdb/node-api": "1.5.3-r.3", "@huggingface/tokenizers": "0.1.3", "@iarna/toml": "2.2.5", - "@ladybugdb/core": "^0.17.1", "@modelcontextprotocol/sdk": "1.29.0", "@sourcegraph/scip-python": "0.6.6", "@sourcegraph/scip-typescript": "0.4.0", @@ -67,7 +65,7 @@ "zod": "4.4.3" }, "optionalDependencies": { - "onnxruntime-node": "1.26.0" + "onnxruntime-web": "1.27.0" }, "devDependencies": { "@opencodehub/analysis": "workspace:*", diff --git a/packages/cli/src/commands/analyze-carry-forward.test.ts b/packages/cli/src/commands/analyze-carry-forward.test.ts index c77aaa88..cd2657d8 100644 --- a/packages/cli/src/commands/analyze-carry-forward.test.ts +++ b/packages/cli/src/commands/analyze-carry-forward.test.ts @@ -38,20 +38,6 @@ import { import { openStore, resolveGraphPath, resolveRepoMetaDir } from "@opencodehub/storage"; import { loadPreviousGraph } from "./analyze.js"; -// These tests exercise the real lbug graph round-trip, so they require the -// `@ladybugdb/core` native binding. CI installs with `--ignore-scripts`, which -// skips the binding's prebuilt-copy install step, so the binding is unloadable -// there — skip cleanly in that case, mirroring the `hasNativeBinding()` idiom in -// `@opencodehub/storage`'s graphdb-roundtrip tests rather than hard-failing. -async function hasNativeBinding(): Promise { - try { - await import("@ladybugdb/core"); - return true; - } catch { - return false; - } -} - /** * Build a minimal prior index + sidecar fixture: * - `File` + `Function` + `Community` + `Process` nodes so the carry- @@ -204,11 +190,7 @@ async function seedPriorIndex(repoPath: string): Promise<{ return { nodeCount: graph.nodeCount(), edgeCount: graph.edgeCount() }; } -test("loadPreviousGraph: returns full nodes + edges from a seeded DuckDB", async (t) => { - if (!(await hasNativeBinding())) { - t.skip("@ladybugdb/core native binding unavailable"); - return; - } +test("loadPreviousGraph: returns full nodes + edges from a seeded DuckDB", async () => { const repoPath = await mkdtemp(join(tmpdir(), "och-carry-forward-")); const seeded = await seedPriorIndex(repoPath); @@ -246,11 +228,7 @@ test("loadPreviousGraph: returns full nodes + edges from a seeded DuckDB", async assert.equal(procFields.stepCount, 1); }); -test("loadPreviousGraph result satisfies resolveIncrementalView active=true precondition", async (t) => { - if (!(await hasNativeBinding())) { - t.skip("@ladybugdb/core native binding unavailable"); - return; - } +test("loadPreviousGraph result satisfies resolveIncrementalView active=true precondition", async () => { // The active=true branch of `resolveIncrementalView` // (`packages/ingestion/src/pipeline/phases/incremental-helper.ts:95-102`) // returns true iff: diff --git a/packages/cli/src/commands/augment.test.ts b/packages/cli/src/commands/augment.test.ts index 07134ef2..77ded290 100644 --- a/packages/cli/src/commands/augment.test.ts +++ b/packages/cli/src/commands/augment.test.ts @@ -27,19 +27,6 @@ import { openStore, resolveGraphPath } from "@opencodehub/storage"; import { upsertRegistry } from "../registry.js"; import { augment, runAugment } from "./augment.js"; -// Tests that seed a real lbug store need the `@ladybugdb/core` native binding. -// CI installs with `--ignore-scripts` (skipping the binding's prebuilt-copy -// step), so it is unloadable there — skip cleanly, mirroring the -// `hasNativeBinding()` idiom in `@opencodehub/storage`'s round-trip tests. -async function hasNativeBinding(): Promise { - try { - await import("@ladybugdb/core"); - return true; - } catch { - return false; - } -} - async function scratch(prefix: string): Promise { return mkdtemp(join(tmpdir(), `och-augment-${prefix}-`)); } @@ -145,11 +132,7 @@ test("augment: returns empty when the registered repo has no DuckDB file", async assert.equal(out, ""); }); -test("augment: surfaces callers and processes for a known symbol", async (t) => { - if (!(await hasNativeBinding())) { - t.skip("@ladybugdb/core native binding unavailable"); - return; - } +test("augment: surfaces callers and processes for a known symbol", async () => { const home = await scratch("hit"); const repoPath = await seedRepoWithStore(home, "demo", (g) => { const callerNode = funcNode("src/caller.ts", "doGreet"); @@ -185,11 +168,7 @@ test("augment: never throws on malformed registry", async () => { assert.equal(writes.length, 0); }); -test("augment: writer only fires when there is content", async (t) => { - if (!(await hasNativeBinding())) { - t.skip("@ladybugdb/core native binding unavailable"); - return; - } +test("augment: writer only fires when there is content", async () => { const home = await scratch("no-hits"); await seedRepoWithStore(home, "demo", (g) => { g.addNode(funcNode("src/unrelated.ts", "unrelatedOnly")); @@ -203,11 +182,7 @@ test("augment: writer only fires when there is content", async (t) => { assert.equal(writes.length, 0); }); -test("augment: cold-start under 750ms on a ~10k-node fixture", async (t) => { - if (!(await hasNativeBinding())) { - t.skip("@ladybugdb/core native binding unavailable"); - return; - } +test("augment: cold-start under 750ms on a ~10k-node fixture", async () => { const home = await scratch("cold-start"); const repoPath = await seedRepoWithStore(home, "big", (g) => { // 10_000 Function nodes plus a linear CALLS chain across the first 500. diff --git a/packages/cli/src/commands/code-pack.test.ts b/packages/cli/src/commands/code-pack.test.ts index 422ebe5e..a8d1334a 100644 --- a/packages/cli/src/commands/code-pack.test.ts +++ b/packages/cli/src/commands/code-pack.test.ts @@ -2,8 +2,8 @@ * Tests for `runCodePack` (the `codehub code-pack` subcommand handler). * * Strategy: inject `_generatePack` and `_runRepomix` test seams so the - * unit tests assert wiring without loading native DuckDB bindings or - * shelling out to `npx repomix`. Engine routing, default values, and + * unit tests assert wiring without opening a real store or shelling out + * to `npx repomix`. Engine routing, default values, and * the `/.codehub/packs//` path layout are all asserted * here. */ @@ -29,7 +29,7 @@ function makeFakeManifest(overrides: Partial = {}): PackManifest { tokenizerId: DEFAULT_TOKENIZER_ID, determinismClass: "strict", budgetTokens: DEFAULT_BUDGET_TOKENS, - pins: { chonkieVersion: "0.0.9", duckdbVersion: "1.4.0", grammarCommits: {} }, + pins: { chonkieVersion: "0.0.9", grammarCommits: {} }, files: [ { kind: "skeleton", path: "skeleton.jsonl", fileHash: "a".repeat(64) }, { kind: "file-tree", path: "file-tree.jsonl", fileHash: "b".repeat(64) }, @@ -170,48 +170,6 @@ test("runCodePack engine='pack' resolves a relative repo path against process.cw } }); -test("runCodePack engine='pack' counts the embeddings sidecar in bomItemCount when present", async () => { - const repoPath = await mkdtemp(join(tmpdir(), "codehub-codepack-sidecar-")); - try { - const fakeGenerate = (async ( - opts: { repoPath: string; outDir: string; budgetTokens: number; tokenizerId: string }, - _internal: unknown, - ) => { - await mkdir(opts.outDir, { recursive: true }); - await writeFile(join(opts.outDir, "manifest.json"), "{}"); - return makeFakeManifest({ - packHash: "sidecar", - files: [ - { kind: "skeleton", path: "skeleton.jsonl", fileHash: "a".repeat(64) }, - { kind: "file-tree", path: "file-tree.jsonl", fileHash: "b".repeat(64) }, - { kind: "deps", path: "deps.jsonl", fileHash: "c".repeat(64) }, - { kind: "ast-chunks", path: "ast-chunks.jsonl", fileHash: "d".repeat(64) }, - { kind: "xrefs", path: "xrefs.jsonl", fileHash: "e".repeat(64) }, - { kind: "findings", path: "findings.jsonl", fileHash: "f".repeat(64) }, - { kind: "licenses", path: "licenses.md", fileHash: "1".repeat(64) }, - { - kind: "embeddings-sidecar", - path: "embeddings.parquet", - fileHash: "2".repeat(64), - }, - ], - }); - // biome-ignore lint/suspicious/noExplicitAny: cross-package generic narrowing in test injection - }) as any; - - const result = await runCodePack({ - repo: repoPath, - _generatePack: fakeGenerate, - _store: FAKE_STORE, - }); - - // 8 manifest.files entries + 1 manifest = 9 BOM items on disk. - assert.equal(result.bomItemCount, 9); - } finally { - await rm(repoPath, { recursive: true, force: true }); - } -}); - test("runCodePack engine='pack' honors a custom --out-dir", async () => { const repoPath = await mkdtemp(join(tmpdir(), "codehub-codepack-customout-")); const customOut = await mkdtemp(join(tmpdir(), "codehub-codepack-customout-target-")); diff --git a/packages/cli/src/commands/code-pack.ts b/packages/cli/src/commands/code-pack.ts index d065cc06..d3f04cc6 100644 --- a/packages/cli/src/commands/code-pack.ts +++ b/packages/cli/src/commands/code-pack.ts @@ -1,5 +1,5 @@ /** - * `codehub code-pack [path]` — produce the deterministic 9-item BOM via + * `codehub code-pack [path]` — produce the deterministic 8-item BOM via * `@opencodehub/pack`. * * Output goes to `/.codehub/packs//` so a pack's identity @@ -11,11 +11,8 @@ * Two engines are supported via the `--engine` flag: * - `pack` (DEFAULT) — `@opencodehub/pack`'s `generatePack`. Opens a * read-only graph store via `openStore({ readOnly: true })` and walks - * the indexed graph to produce the 8 mandatory BOM items + manifest + - * optional Parquet embeddings sidecar. The sidecar emitter lives in - * `@opencodehub/pack`; cli/ passes the composed `Store` and pack - * streams lbug embeddings through the DuckDB temporal store's - * deterministic COPY-to-Parquet path. + * the indexed graph to produce the 7 BOM body items + manifest, plus a + * consumer-facing readme. cli/ passes the composed `Store`. * - `repomix` — legacy single-file snapshot via `npx repomix`. Retained * under an opt-in flag for one milestone before removal. Internally * delegates to `runPack` so the repomix shell-out is implemented @@ -90,10 +87,9 @@ export interface CodePackResult { /** SHA256 of the manifest's canonical JSON (excluding `packHash`). */ readonly packHash: string; /** - * Number of artifacts on disk that contribute to the BOM (mandatory - * 8 BOM items + manifest = 9; +1 if the embeddings.parquet sidecar - * was emitted). For the repomix engine this is 1 — repomix produces a - * single output file rather than the 9-item BOM. + * Number of artifacts on disk that contribute to the BOM (7 BOM body + * items + manifest = 8). For the repomix engine this is 1 — repomix + * produces a single output file rather than the 8-item BOM. */ readonly bomItemCount: number; /** The pack manifest. `null` for the repomix engine — it does not produce one. */ @@ -139,11 +135,9 @@ async function runPackEngine(repoPath: string, args: CodePackArgs): Promise { const composed = await openStore({ path: dbPath, readOnly: true }); + // graph and temporal are the same single-file SqliteStore instance + // (open() is idempotent); the pack reads only the graph view. await composed.graph.open(); - // Pack stages embeddings through `temporal.exportEmbeddingsToParquet`, - // so the temporal DuckDB also needs an open connection — the graph - // view alone is not enough. - await composed.temporal.open(); return composed; })() : undefined; @@ -191,11 +185,11 @@ async function runPackEngine(repoPath: string, args: CodePackArgs): Promise { - const home = await mkdtemp(join(tmpdir(), "codehub-doctor-resolve-")); - try { - const bogusRoot = join(home, "does-not-exist"); - const checks = buildChecks({ home, repoRoot: bogusRoot }); - const duck = checks.find((c) => c.name === "duckdb native binding"); - assert.ok(duck, "duckdb check must be registered when skipNative is false"); - // Running the full check under node:test against a real dev install - // should succeed — packages are resolvable via the CLI's own - // node_modules even when the repoRoot fallback is broken. A `fail` - // here would mean the CLI-first resolution path regressed. - const duckResult = await duck.run(); - assert.notEqual( - duckResult.status, - "fail", - `duckdb check should not fail when CLI node_modules resolves; got: ${duckResult.message}`, - ); - } finally { - await rm(home, { recursive: true, force: true }); - } -}); - // Wiring — runDoctor should thread `repoRoot` through DoctorOptions so the // --repoRoot CLI flag has a visible effect on check construction. We don't // need to actually execute the checks — just confirm the override is @@ -451,116 +422,114 @@ test("bandit check warns (not fails) when the binary is absent", async () => { }); // --------------------------------------------------------------------------- -// Graph binding (@ladybugdb/core) — mandatory, so absence is a HARD fail. +// node:sqlite built-in — the mandatory single-file store backend. // --------------------------------------------------------------------------- -// The graph tier is always-on with no selector env var, no probe, and no -// fallback — a failed binding throws GraphDbBindingError and aborts every -// graph op. So a missing/unresolvable binding must be `fail`, never `warn`. -// The real package is installed in every dev/CI checkout, so we inject a -// resolver that returns null to exercise the absent path. -test("graph-db binding check hard-fails (not warns) when @ladybugdb/core cannot be resolved", async () => { - const home = await mkdtemp(join(tmpdir(), "codehub-doctor-lbug-miss-")); +// `node:sqlite` is a Node builtin (stable on our engines floor, Node >= 24.15), +// so there is no resolve seam and no "absent" branch to inject — the check just +// imports the builtin, confirms `DatabaseSync` is a constructor, opens an +// in-memory db, and runs a WAL CREATE/INSERT/SELECT round-trip. On a real dev +// install the round-trip must succeed → `ok`. +test("node:sqlite check reports ok on a host with the builtin (WAL round-trip)", async () => { + const home = await mkdtemp(join(tmpdir(), "codehub-doctor-sqlite-ok-")); try { - const checks = buildChecks({ - home, - resolveBinding: (_root, pkg) => (pkg === "@ladybugdb/core" ? null : "/fake/duckdb"), - }); - const lbug = checks.find((c) => c.name === "graph-db native binding"); - assert.ok(lbug, "graph-db binding check must be registered when skipNative is false"); - const result = await lbug.run(); + // skipNative is false so the native node:sqlite probe registers. + const checks = buildChecks({ home }); + const sqlite = checks.find((c) => c.name === "node:sqlite built-in"); + assert.ok(sqlite, "node:sqlite check must be registered when skipNative is false"); + const result = await sqlite.run(); assert.equal( result.status, - "fail", - `a missing mandatory graph binding must fail, never warn; got ${result.status}`, + "ok", + `node:sqlite is a builtin on our engines floor; got ${result.status}: ${result.message}`, ); - // The hint must not reference the removed CODEHUB_STORE selector. - assert.doesNotMatch(result.hint ?? "", /CODEHUB_STORE/); - assert.doesNotMatch(result.message, /optional/); + assert.match(result.message, /WAL/); } finally { await rm(home, { recursive: true, force: true }); } }); -// No doctor row may mention the phantom CODEHUB_STORE env var or frame the -// graph backend as "optional" — that selector was removed when lbug became -// the mandatory graph tier. -test("doctor surfaces no CODEHUB_STORE selector or optional-backend framing", async () => { +// The node:sqlite probe is a native check — `skipNative` must drop it +// entirely, exactly like the other native-binding rows. +test("node:sqlite check is gated by skipNative (no row, no exit contribution)", async () => { + const home = await mkdtemp(join(tmpdir(), "codehub-doctor-sqlite-skip-")); + try { + const checks = buildChecks({ home, skipNative: true }); + const names = checks.map((c) => c.name); + assert.ok( + !names.includes("node:sqlite built-in"), + "node:sqlite probe is a native check — skipNative must drop it", + ); + } finally { + await rm(home, { recursive: true, force: true }); + } +}); + +// No doctor row may mention the phantom CODEHUB_STORE env var — that selector +// was removed when the single-file SQLite store became the mandatory backend. +test("doctor surfaces no CODEHUB_STORE selector across any row", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-no-store-var-")); try { await mkdir(join(home, ".codehub"), { recursive: true }); await writeFile(join(home, ".codehub", "registry.json"), JSON.stringify({})); const prev = process.exitCode; - const report = await runDoctor({ - home, - runCommand: okRunCommand, - resolveBinding: (_root, pkg) => (pkg === "@ladybugdb/core" ? null : "/fake/duckdb"), - }); + const report = await runDoctor({ home, runCommand: okRunCommand }); process.exitCode = prev; for (const { result } of report.rows) { assert.doesNotMatch(result.message, /CODEHUB_STORE/); assert.doesNotMatch(result.hint ?? "", /CODEHUB_STORE/); } - // The absent graph binding makes the whole probe blocking (exit 2). - assert.equal( - report.exitCode, - 2, - "an absent mandatory graph binding must block the doctor exit", - ); } finally { await rm(home, { recursive: true, force: true }); } }); -// The lbug failure hint must carry the platform-support matrix (the shared -// `@opencodehub/storage` source of truth), not a bare "pnpm install" — on -// win32-arm64 / musl there is NO prebuilt, so a reinstall is futile and the -// hint must say so. Every lbug failure path threads through `lbugFailureHint`. -test("graph-db binding failure hint names the platform-support matrix, not a bare reinstall", async () => { - const home = await mkdtemp(join(tmpdir(), "codehub-doctor-lbug-hint-")); +// Every node:sqlite failure path threads through the same hint, which must +// point the user at the Node version (the only realistic cause — the module is +// a builtin, so there is nothing to install or reinstall). Run the real check +// and assert the OK message shape; the builtin is always present on our floor, +// so we verify the success path carries the "built-in" + WAL framing rather +// than synthesizing an unreachable absent branch. +test("node:sqlite check message names the built-in load + WAL on success", async () => { + const home = await mkdtemp(join(tmpdir(), "codehub-doctor-sqlite-msg-")); try { - const checks = buildChecks({ - home, - resolveBinding: (_root, pkg) => (pkg === "@ladybugdb/core" ? null : "/fake/duckdb"), - }); - const lbug = checks.find((c) => c.name === "graph-db native binding"); - assert.ok(lbug); - const result = await lbug.run(); - assert.equal(result.status, "fail"); - // The shared matrix string from @opencodehub/storage must be present so - // the user sees which platforms ship a prebuilt. - assert.match(result.hint ?? "", /Supported platforms:/); - assert.match(result.hint ?? "", /Windows x64/); + const checks = buildChecks({ home }); + const sqlite = checks.find((c) => c.name === "node:sqlite built-in"); + assert.ok(sqlite); + const result = await sqlite.run(); + assert.equal(result.status, "ok"); + assert.match(result.message, /built-in/); + assert.match(result.message, /OK/); } finally { await rm(home, { recursive: true, force: true }); } }); // --------------------------------------------------------------------------- -// Embedder native binding (onnxruntime-node) — OPTIONAL, so absence is a +// Embedder runtime (onnxruntime-web, WASM) — OPTIONAL, so absence is a // NON-FATAL warn that degrades retrieval to BM25, never a hard fail. // --------------------------------------------------------------------------- -// onnxruntime-node ships prebuilds for only ~5 targets (no Intel-mac, no musl). -// The real failure mode is a silent degrade to BM25 — the embedder open path -// catches the native-load error — so doctor must surface a `warn`, not a fail. -// Inject a loader that throws to exercise the absent-binding branch. -test("embedder binding check warns (not fails) when onnxruntime-node fails to load", async () => { +// onnxruntime-web is prebuilt WASM with no platform matrix — if it imports it +// runs everywhere. The failure mode is simply "not installed", a silent degrade +// to BM25 (the embedder open path catches the load error), so doctor surfaces a +// `warn`, not a fail. Inject a loader that throws to exercise the absent branch. +test("embedder runtime check warns (not fails) when onnxruntime-web fails to load", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-onnx-miss-")); try { const checks = buildChecks({ home, loadOnnxBinding: async () => { - throw new Error("Cannot find module 'onnxruntime-node'"); + throw new Error("Cannot find module 'onnxruntime-web'"); }, }); - const emb = checks.find((c) => c.name === "embedder native binding"); - assert.ok(emb, "embedder binding check must be registered when skipNative is false"); + const emb = checks.find((c) => c.name === "embedder runtime (onnxruntime-web, WASM)"); + assert.ok(emb, "embedder runtime check must be registered when skipNative is false"); const result = await emb.run(); assert.equal( result.status, "warn", - `an absent OPTIONAL embedder binding is a soft warn; got ${result.status}: ${result.message}`, + `an absent OPTIONAL embedder runtime is a soft warn; got ${result.status}: ${result.message}`, ); assert.match(result.message, /BM25/); // The hint must point at the remote-embedder escape hatch. @@ -570,15 +539,15 @@ test("embedder binding check warns (not fails) when onnxruntime-node fails to lo } }); -// A successful binding load (exports an InferenceSession constructor) is `ok`. -test("embedder binding check reports ok when onnxruntime-node loads with InferenceSession", async () => { +// A successful runtime load (exports an InferenceSession constructor) is `ok`. +test("embedder runtime check reports ok when onnxruntime-web loads with InferenceSession", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-onnx-ok-")); try { const checks = buildChecks({ home, loadOnnxBinding: async () => ({ InferenceSession: function fake() {} }), }); - const emb = checks.find((c) => c.name === "embedder native binding"); + const emb = checks.find((c) => c.name === "embedder runtime (onnxruntime-web, WASM)"); assert.ok(emb); const result = await emb.run(); assert.equal(result.status, "ok", `expected ok; got ${result.status}: ${result.message}`); @@ -589,14 +558,14 @@ test("embedder binding check reports ok when onnxruntime-node loads with Inferen // A module that loads but exports no InferenceSession is a `warn` (degrade), // never a crash — the embedder is optional. -test("embedder binding check warns when the module loads but exports no InferenceSession", async () => { +test("embedder runtime check warns when the module loads but exports no InferenceSession", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-onnx-noctor-")); try { const checks = buildChecks({ home, loadOnnxBinding: async () => ({}), }); - const emb = checks.find((c) => c.name === "embedder native binding"); + const emb = checks.find((c) => c.name === "embedder runtime (onnxruntime-web, WASM)"); assert.ok(emb); const result = await emb.run(); assert.equal(result.status, "warn", `expected warn; got ${result.status}: ${result.message}`); @@ -606,9 +575,9 @@ test("embedder binding check warns when the module loads but exports no Inferenc }); // The optional embedder binding must NOT escalate the doctor exit code: with -// a valid registry, a clean scanner runner, and the graph binding present -// (real dev install), a failed embedder load yields at most a warn (exit ≤ 1), -// never a blocking fail. This is the load-bearing "optional capability" guard. +// a valid registry and a clean scanner runner, a failed embedder load yields +// at most a warn (exit ≤ 1), never a blocking fail. This is the load-bearing +// "optional capability" guard. test("embedder binding failure does not block the doctor exit (exit <= 1)", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-onnx-nonblock-")); try { @@ -625,8 +594,8 @@ test("embedder binding failure does not block the doctor exit (exit <= 1)", asyn process.exitCode = prev; const names = report.rows.map((r) => r.name); assert.ok( - !names.includes("embedder native binding"), - "embedder binding probe is a native check — skipNative must drop it", + !names.includes("embedder runtime (onnxruntime-web, WASM)"), + "embedder runtime probe is gated behind skipNative — skipNative must drop it", ); } finally { await rm(home, { recursive: true, force: true }); diff --git a/packages/cli/src/commands/doctor.ts b/packages/cli/src/commands/doctor.ts index 50c90faa..8621459d 100644 --- a/packages/cli/src/commands/doctor.ts +++ b/packages/cli/src/commands/doctor.ts @@ -15,14 +15,12 @@ import { spawn } from "node:child_process"; import { statSync } from "node:fs"; import { access, open as fsOpen, mkdtemp, readFile, rm } from "node:fs/promises"; -import { createRequire } from "node:module"; import { homedir, tmpdir } from "node:os"; import { dirname, join, resolve } from "node:path"; -import { fileURLToPath, pathToFileURL } from "node:url"; +import { fileURLToPath } from "node:url"; import { mergeSarif } from "@opencodehub/sarif"; import { BANDIT_SPEC } from "@opencodehub/scanners"; import { hostedScipBinDirs } from "@opencodehub/scip-ingest"; -import { GRAPH_BINDING_SUPPORTED_PLATFORMS, graphBindingPlatformNote } from "@opencodehub/storage"; import Table from "cli-table3"; export type CheckStatus = "ok" | "warn" | "fail"; @@ -62,19 +60,12 @@ export interface DoctorOptions { */ readonly runCommand?: RunCommandFn; /** - * Injectable npm-package resolver for the native-binding checks. Lets a - * test simulate an absent `@ladybugdb/core` (the mandatory graph binding - * is installed in every dev/CI checkout, so the only way to exercise the - * hard-fail path is to stub resolution). Defaults to {@link resolveFromRoot}. - */ - readonly resolveBinding?: (root: string, pkg: string) => string | null; - /** - * Injectable loader for the `onnxruntime-node` binding probe. The real - * loader is a dynamic `import("onnxruntime-node")` — an OPTIONAL dependency - * that ships prebuilds for only a handful of targets, so the binding may be - * absent on this platform. Tests inject a double to exercise both the - * load-OK and load-failure branches without depending on the host's prebuild - * coverage. Defaults to {@link loadOnnxBinding}. + * Injectable loader for the `onnxruntime-web` runtime probe. The real loader + * is a dynamic `import("onnxruntime-web")` — an OPTIONAL dependency (prebuilt + * WASM, no native binding, no platform matrix), so it is either installed or + * not. Tests inject a double to exercise both the load-OK and load-failure + * branches without depending on whether the host has it. Defaults to + * {@link loadOnnxBinding}. */ readonly loadOnnxBinding?: () => Promise; } @@ -137,12 +128,10 @@ export function buildChecks(opts: DoctorOptions = {}): readonly Check[] { const run = opts.runCommand ?? runCommand; const list: Check[] = [nodeVersionCheck(), pnpmInstalledCheck(run)]; if (opts.skipNative !== true) { - list.push(duckdbWorksCheck(repoRoot)); - list.push( - opts.resolveBinding !== undefined - ? lbugWorksCheck(repoRoot, opts.resolveBinding) - : lbugWorksCheck(repoRoot), - ); + // node:sqlite is the mandatory single-file store. It is a Node builtin, so + // it has no resolve seam (it can't be "absent" the way a node_modules + // package can) — the check just imports it and exercises a WAL round-trip. + list.push(nodeSqliteCheck()); } // Vendored parse grammars: a shipped artifact, so absence/corruption is // always a hard fail. One row covering all 16 blobs + the manifest pin. @@ -237,104 +226,70 @@ function pnpmInstalledCheck(run: RunCommandFn): Check { }; } -function duckdbWorksCheck(repoRoot: string): Check { - return { - name: "duckdb native binding", - async run() { - try { - const duckPath = resolveFromRoot(repoRoot, "@duckdb/node-api"); - if (!duckPath) { - return { - status: "warn", - message: "@duckdb/node-api not installed", - hint: "run `pnpm install` at the repo root", - }; - } - // The @duckdb/node-api 1.x surface exposes Sync teardown helpers - // (`disconnectSync`, `closeSync`). The async `.close()` accessors - // were dropped in 1.0.0; depending on them produced a false FAIL. - // `resolveFromRoot` returns an absolute fs path; ESM dynamic import - // requires a `file://` URL on Windows (a bare `D:\…` path throws - // "Only URLs with a scheme in: file, data, node are supported"). - const mod = (await import(pathToFileURL(duckPath).href)) as { - DuckDBInstance: { - create: (path: string) => Promise<{ - connect: () => Promise<{ - disconnectSync?: () => void; - close?: () => void | Promise; - }>; - closeSync?: () => void; - close?: () => void | Promise; - }>; - }; - }; - // In-memory instance: never touches disk, never lingers. - const inst = await mod.DuckDBInstance.create(":memory:"); - const conn = await inst.connect(); - if (typeof conn.disconnectSync === "function") conn.disconnectSync(); - else if (typeof conn.close === "function") await conn.close(); - if (typeof inst.closeSync === "function") inst.closeSync(); - else if (typeof inst.close === "function") await inst.close(); - return { status: "ok", message: "duckdb open/close OK" }; - } catch (err) { - return { - status: "fail", - message: `duckdb failed to open: ${err instanceof Error ? err.message : String(err)}`, - hint: "check platform support — pnpm only prebuilds linux-x64/arm64, darwin-arm64/x64, win32-x64", - }; - } - }, - }; -} - /** - * Mirror of {@link duckdbWorksCheck} for the `@ladybugdb/core` graph - * binding. The graph tier is mandatory and always-on: there is no - * selector env var, no probe, and no fallback — a failed binding throws - * `GraphDbBindingError` and aborts every graph operation. So a - * missing/broken binding is a hard `fail` here, exactly like a missing + * The single-file SQLite store ({@link SqliteStore}) is the mandatory storage + * backend: there is no selector env var, no probe, and no fallback — every + * graph/temporal operation opens one `node:sqlite` database in WAL mode. So a + * non-importable `node:sqlite` is a hard `fail` here, exactly like a missing * shipped artifact (see {@link vendoredWasmsCheck}) — never a soft `warn`. * - * `resolve` is injectable so tests can simulate an absent binding without - * uninstalling the real dependency; production passes {@link resolveFromRoot}. + * There is nothing to resolve from `node_modules`: `node:sqlite` is a Node + * builtin (stable on Node >= 24.15, our engines floor), so the probe is a + * plain `import("node:sqlite")` with no `resolve` injection. + * We still exercise the real load-and-use cycle — confirm `DatabaseSync` is a + * function, open an in-memory db, request WAL, and run a CREATE/INSERT/SELECT + * round-trip — so a builtin that loaded but is unusable still fails loudly. */ -function lbugWorksCheck( - repoRoot: string, - resolve: (root: string, pkg: string) => string | null = resolveFromRoot, -): Check { +function nodeSqliteCheck(): Check { return { - name: "graph-db native binding", + name: "node:sqlite built-in", async run() { try { - const lbugPath = resolve(repoRoot, "@ladybugdb/core"); - if (!lbugPath) { - return { - status: "fail", - message: "@ladybugdb/core not installed (required graph backend)", - hint: lbugFailureHint(), + // `node:sqlite` is a builtin; a bare static-string specifier resolves + // it without touching `node_modules`. Older Node (< 24.15, below our + // engines floor) lacks the module entirely → the import throws. + const mod = (await import("node:sqlite")) as { + DatabaseSync?: new ( + path: string, + ) => { + exec(sql: string): void; + prepare(sql: string): { get(): unknown; run(): unknown }; + close(): void; }; - } - // The graph binding uses `@ladybugdb/core`'s `Database` entry. We - // exercise the load-and-close cycle the same way the duckdb check - // does — anything heavier would couple this probe to the adapter's - // evolving smoke-test surface. `lbugPath` is an absolute fs path; - // ESM import needs a `file://` URL on Windows (see duckdb check). - const mod = (await import(pathToFileURL(lbugPath).href)) as Record; - const ctorRaw = - mod["Database"] ?? (mod["default"] as Record | undefined)?.["Database"]; - if (typeof ctorRaw !== "function") { + }; + const DatabaseSync = mod.DatabaseSync; + if (typeof DatabaseSync !== "function") { return { status: "fail", - message: "@ladybugdb/core is installed but exports no Database constructor", - hint: lbugFailureHint(), + message: "node:sqlite imported but exports no DatabaseSync constructor", + hint: nodeSqliteFailureHint(), }; } - return { status: "ok", message: "@ladybugdb/core load OK" }; + // In-memory database: never touches disk, never lingers. We request + // WAL (the mode the real SqliteStore opens with) and run a trivial + // round-trip to prove the binding is usable, not merely importable. + const db = new DatabaseSync(":memory:"); + try { + db.exec("PRAGMA journal_mode=WAL"); + db.exec("CREATE TABLE doctor_probe (n INTEGER)"); + db.prepare("INSERT INTO doctor_probe (n) VALUES (1)").run(); + const row = db.prepare("SELECT n FROM doctor_probe").get() as { n?: number } | undefined; + if (row?.n !== 1) { + return { + status: "fail", + message: "node:sqlite round-trip returned an unexpected value", + hint: nodeSqliteFailureHint(), + }; + } + } finally { + db.close(); + } + return { status: "ok", message: "node:sqlite (built-in) load + WAL OK" }; } catch (err) { return { status: "fail", - message: `@ladybugdb/core failed to load: ${err instanceof Error ? err.message : String(err)}`, - hint: lbugFailureHint(), + message: `node:sqlite failed to load: ${err instanceof Error ? err.message : String(err)}`, + hint: nodeSqliteFailureHint(), }; } }, @@ -342,22 +297,13 @@ function lbugWorksCheck( } /** - * Hint for every `@ladybugdb/core` failure path. On a SUPPORTED platform a - * reinstall can plausibly fix it (a pruned `--production` install, a partial - * download), so we lead with that. On an UNSUPPORTED platform — win32-arm64 - * or musl/Alpine, where there is no prebuilt at all — `graphBindingPlatformNote` - * names the gap so the user does not chase a futile reinstall. We reuse the - * adapter's shared message (single source of truth) so doctor and the runtime - * `GraphDbBindingError` never drift. + * Hint for every `node:sqlite` failure path. The module is a Node builtin, so + * there is nothing to install or reinstall — the only realistic cause is a Node + * older than our engines floor, where the builtin either does not exist or is + * behind an unsupported experimental gate. Point the user at the Node version. */ -function lbugFailureHint(): string { - const platformNote = graphBindingPlatformNote(); - if (platformNote !== "") { - // Unsupported platform: a reinstall cannot produce a binding that does - // not ship. Name the gap + the realistic remedy. - return `${GRAPH_BINDING_SUPPORTED_PLATFORMS}${platformNote}`; - } - return `reinstall the graph backend binding (\`pnpm install\`, or \`npm i -g @opencodehub/cli\`). ${GRAPH_BINDING_SUPPORTED_PLATFORMS}`; +function nodeSqliteFailureHint(): string { + return "node:sqlite is a built-in on Node >= 24.15 (our engines floor); upgrade Node with `mise use node@24` or `nvm install 24`"; } /** @@ -656,61 +602,38 @@ function embedderWeightsCheck(home: string): Check { } /** - * Default loader for the `onnxruntime-node` binding. The CLI lazy-imports the - * runtime only when embeddings are enabled (see - * `embedder/src/onnx-embedder.ts`), so this probe mirrors that exact dynamic - * import. `onnxruntime-node` is an OPTIONAL dependency — production resolves it - * from the CLI's own `node_modules`. + * Default loader for the `onnxruntime-web` runtime. The CLI lazy-imports it + * only when embeddings are enabled (see `embedder/src/onnx-embedder.ts`), so + * this probe mirrors that exact dynamic import. `onnxruntime-web` is an + * OPTIONAL dependency — production resolves it from the CLI's own + * `node_modules`. Unlike the old `onnxruntime-node`, it is prebuilt WebAssembly + * with NO native binding and NO platform matrix: if it imports, it runs + * everywhere Node ≥24 does. */ function loadOnnxBinding(): Promise { // A template-string specifier keeps tsup/esbuild from statically resolving - // (and force-bundling) the optional native module at build time — it must - // resolve from `node_modules` at runtime, exactly like the embedder's own - // lazy `import("onnxruntime-node")`. - const specifier = "onnxruntime-node"; + // (and force-bundling) the optional module at build time — it must resolve + // from `node_modules` at runtime, exactly like the embedder's own lazy + // `import("onnxruntime-web")`. + const specifier = "onnxruntime-web"; return import(specifier); } /** - * Platform-specific guidance for a missing `onnxruntime-node` prebuilt. - * onnxruntime-node 1.x ships prebuilt binaries for darwin-arm64, linux-x64, - * linux-arm64 (glibc), win32-x64, and win32-arm64 — but NOT darwin-x64 - * (Intel Mac) and NOT musl/Alpine Linux. On those targets the optional binding - * cannot load and retrieval silently degrades to BM25-only. Naming the gap - * here stops the user chasing a futile reinstall. - * - * Returns an empty string when the platform is one onnxruntime ships a prebuilt - * for (no extra note to add). - */ -function onnxBindingPlatformNote( - platform: NodeJS.Platform = process.platform, - arch: string = process.arch, -): string { - if (platform === "darwin" && arch === "x64") { - return " Intel macOS (darwin-x64) has no onnxruntime-node prebuilt; use an Apple-silicon mac or a remote embedder (CODEHUB_EMBEDDING_URL / CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT)."; - } - if (platform === "linux") { - return " Alpine / musl-libc Linux has no onnxruntime-node prebuilt; use a glibc-based image (node:* not node:*-alpine) or a remote embedder."; - } - return ""; -} - -/** - * Probe the OPTIONAL `onnxruntime-node` binding the same way the embedder - * does — a lazy dynamic import. Unlike the duckdb/lbug probes, this is - * deliberately NON-FATAL: the embedder is an optional capability, and the real - * failure mode is a SILENT degrade to BM25-only retrieval (the embedder open - * path catches the native-load error and falls back). So an absent/broken - * binding is a `warn`, never a `fail`. The warn message names the platform gap - * (Intel mac / musl) so the user is not left wondering why search quality - * dropped. + * Probe the OPTIONAL `onnxruntime-web` runtime the same way the embedder does — + * a lazy dynamic import. Deliberately NON-FATAL: the embedder is an optional + * capability and the real failure mode is a SILENT degrade to BM25-only + * retrieval (the embedder open path catches the load error and falls back). So + * an absent runtime is a `warn`, never a `fail`. Because onnxruntime-web is + * prebuilt WASM with no platform matrix, there is no platform-specific gap to + * name — if the import fails the package simply isn't installed. * * `load` is injectable so tests can drive both branches without depending on - * whatever prebuild coverage the host happens to have. + * whether the optional package is present on the host. */ function embedderBindingCheck(load: () => Promise = loadOnnxBinding): Check { return { - name: "embedder native binding", + name: "embedder runtime (onnxruntime-web, WASM)", async run() { try { const mod = (await load()) as Record | undefined; @@ -721,17 +644,17 @@ function embedderBindingCheck(load: () => Promise = loadOnnxBinding): C return { status: "warn", message: - "onnxruntime-node loaded but exports no InferenceSession — retrieval will use BM25 only", - hint: `the local embedder is unavailable; configure a remote embedder (CODEHUB_EMBEDDING_URL / CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT) or reinstall onnxruntime-node.${onnxBindingPlatformNote()}`, + "onnxruntime-web loaded but exports no InferenceSession — retrieval will use BM25 only", + hint: "the local embedder is unavailable; configure a remote embedder (CODEHUB_EMBEDDING_URL / CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT) or reinstall onnxruntime-web.", }; } - return { status: "ok", message: "onnxruntime-node load OK" }; + return { status: "ok", message: "onnxruntime-web (WASM) load OK" }; } catch (err) { const detail = err instanceof Error ? err.message : String(err); return { status: "warn", - message: `embedder unavailable on this platform → retrieval will use BM25 only (${detail})`, - hint: `the local ONNX embedder is optional; configure a remote embedder (CODEHUB_EMBEDDING_URL / CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT) to embed off-box.${onnxBindingPlatformNote()}`, + message: `embedder runtime not installed → retrieval will use BM25 only (${detail})`, + hint: "the local WASM embedder is optional; configure a remote embedder (CODEHUB_EMBEDDING_URL / CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT) to embed off-box.", }; } }, @@ -933,40 +856,3 @@ function existsSyncSafe(path: string): boolean { return false; } } - -function resolveFromRoot(repoRoot: string, pkg: string): string | null { - // 1. CLI's own resolution context — the canonical answer. - try { - const req = createRequire(import.meta.url); - return req.resolve(pkg); - } catch { - // fall through to repoRoot - } - // 2. Workspace/monorepo root fallback. - try { - const req = createRequire(join(repoRoot, "package.json")); - return req.resolve(pkg); - } catch { - // fall through to per-package fallbacks - } - // 3. Per-workspace fallback. Under pnpm strict isolation, native bindings - // are direct deps of the package that uses them — `@duckdb/node-api` - // and `@ladybugdb/core` both live in `packages/storage`. Probing that - // package.json context lets `codehub doctor` resolve the bindings - // even when neither the CLI nor the workspace root declare them as - // direct deps. (The pure-JS `@sourcegraph/scip-*` indexers are NOT - // resolved here — `scipIndexerCheck` checks them via - // `hostedScipBinDirs()`, the same resolver the analyze-time spawn PATH - // uses, which is the layout-correct authority for "will analyze find it".) - const owners = - pkg.startsWith("@duckdb/") || pkg.startsWith("@ladybugdb/") ? ["packages/storage"] : []; - for (const owner of owners) { - try { - const req = createRequire(join(repoRoot, owner, "package.json")); - return req.resolve(pkg); - } catch { - // try next - } - } - return null; -} diff --git a/packages/cli/src/index.ts b/packages/cli/src/index.ts index fafc2e3c..2ecb95b0 100644 --- a/packages/cli/src/index.ts +++ b/packages/cli/src/index.ts @@ -12,8 +12,14 @@ import { readFileSync } from "node:fs"; import { cpus } from "node:os"; import { dirname, join } from "node:path"; import { fileURLToPath } from "node:url"; +// Silence the one-shot node:sqlite ExperimentalWarning before any subcommand +// lazily loads the storage layer. This module is dependency-free (no native +// binding), so importing it eagerly does not regress `--help` startup cost. +import { installSqliteRuntimeGuard } from "@opencodehub/storage/sqlite-runtime"; import { Command } from "commander"; +installSqliteRuntimeGuard(); + // Read the CLI's own version from its package.json. The bin entry is always // emitted at /dist/index.js in every layout (the tsup collapse keeps // `index` at the dist root), so package.json is exactly one level up. This @@ -42,7 +48,7 @@ program .command("analyze [path]") .description("Index a repository at [path] (default: current directory)") .option("--force", "Ignore registry cache and re-run the pipeline") - .option("--embeddings", "Embed symbols and populate the DuckDB embeddings table") + .option("--embeddings", "Embed symbols and populate the embeddings table in store.sqlite") .option("--embeddings-int8", "Use the int8 embedder variant (~23 MB) instead of fp32") .option( "--granularity ", @@ -341,8 +347,8 @@ program program .command("code-pack [path]") .description( - "Produce the deterministic 9-item code-pack BOM (manifest + skeleton + file-tree + deps + " + - "ast-chunks + xrefs + findings + licenses + readme + optional embeddings.parquet) at " + + "Produce the deterministic 8-item code-pack BOM (manifest + skeleton + file-tree + deps + " + + "ast-chunks + xrefs + findings + licenses) plus a readme at " + "/.codehub/packs//. Default engine is the new @opencodehub/pack BOM; " + "--engine repomix opts into the legacy single-file snapshot (drop deferred to M7).", ) @@ -361,7 +367,7 @@ program ) .option( "--engine ", - "Engine: pack (default — 9-item BOM via @opencodehub/pack) or repomix (legacy single-file)", + "Engine: pack (default — 8-item BOM via @opencodehub/pack) or repomix (legacy single-file)", "pack", ) .action(async (path: string | undefined, opts: Record) => { @@ -694,7 +700,10 @@ program .description( "Probe the local environment (node/pnpm/native bindings/vendored grammars/scip indexers/scanners/registry) and print actionable hints", ) - .option("--skip-native", "Skip checks that require native bindings (duckdb / lbug)") + .option( + "--skip-native", + "Skip checks that require native bindings (no longer any; retained for compat)", + ) .option( "--strict", "Treat a missing SCIP indexer as a failure (exit 2), not a warning — for release/CI gates", diff --git a/packages/cli/tsup.config.ts b/packages/cli/tsup.config.ts index 61d699bb..4331b9e9 100644 --- a/packages/cli/tsup.config.ts +++ b/packages/cli/tsup.config.ts @@ -92,6 +92,13 @@ export default defineConfig({ // Force-bundle every internal workspace package into this one tarball. noExternal: [/^@opencodehub\//], external: EXTERNAL, + // tsup defaults `removeNodeProtocol: true`, which strips the `node:` prefix + // from every builtin specifier. Its strip list predates `node:sqlite`, so it + // rewrites the externalized `node:sqlite` import to a bare `sqlite` — which + // Node cannot resolve (the builtin exists ONLY as `node:sqlite`; there is no + // `sqlite` package). Keep the prefix: our Node >=24.15 floor wants the + // `node:`-qualified form anyway, and it is correct for every other builtin too. + removeNodeProtocol: false, async onSuccess() { // Grammar WASMs (16 blobs, ~25 MB) — resolved by walk-up to `vendor/wasms`. await copyTree( diff --git a/packages/core-types/src/graph.ts b/packages/core-types/src/graph.ts index afe7374f..1f198c09 100644 --- a/packages/core-types/src/graph.ts +++ b/packages/core-types/src/graph.ts @@ -58,7 +58,25 @@ export class KnowledgeGraph { const key = edgeDedupKey(edge); const existing = this.edgeByKey.get(key); const id = makeEdgeId(edge.from, edge.type, edge.to, edge.step); - const candidate: CodeRelation = { ...edge, id }; + // Canonicalize the step sentinel AT THE GRAPH BOUNDARY. `step: 0` and an + // absent step are already the same edge by identity — both `edgeDedupKey` + // and `makeEdgeId` collapse them via `step ?? 0`. But a candidate object + // that literally carries `step: 0` serializes as `"step":0` in graphHash, + // while every storage adapter's `listEdges` drops it via `stepZeroSentinel` + // — so a rebuilt graph would hash differently from the original. Strip a + // zero/non-finite step here so the in-memory canonical edge matches what + // round-trips through any IGraphStore. (Closes the latent parity gap the + // SQLite migration surfaced: the old graphdb-roundtrip test only passed + // because its test-local rebuild re-attached step:0.) + const { step: rawStep, ...edgeRest } = edge; + const normalizedStep = + typeof rawStep === "number" && Number.isFinite(rawStep) && rawStep !== 0 + ? rawStep + : undefined; + const candidate: CodeRelation = + normalizedStep === undefined + ? { ...edgeRest, id } + : { ...edgeRest, step: normalizedStep, id }; if (!existing) { this.edgeByKey.set(key, candidate); return; diff --git a/packages/embedder/package.json b/packages/embedder/package.json index 24e2a397..97dbc77d 100644 --- a/packages/embedder/package.json +++ b/packages/embedder/package.json @@ -43,7 +43,7 @@ "@opencodehub/core-types": "workspace:*" }, "optionalDependencies": { - "onnxruntime-node": "1.26.0" + "onnxruntime-web": "1.27.0" }, "devDependencies": { "@types/node": "25.9.3", diff --git a/packages/embedder/src/onnx-embedder.ts b/packages/embedder/src/onnx-embedder.ts index 6993fa51..30248786 100644 --- a/packages/embedder/src/onnx-embedder.ts +++ b/packages/embedder/src/onnx-embedder.ts @@ -17,12 +17,14 @@ import { access, readFile } from "node:fs/promises"; import { join } from "node:path"; import { Tokenizer } from "@huggingface/tokenizers"; -// `onnxruntime-node` is an `optionalDependency`: it ships a ~254 MB native -// binary that a BM25-only install can prune. Import only its TYPES at the top -// level (erased at compile time — no runtime resolution), and load the actual -// module via a dynamic `import()` inside `openOnnxEmbedder`. That keeps the -// native binding off the import graph until embeddings are actually opened. -import type { InferenceSession, Tensor } from "onnxruntime-node"; +// `onnxruntime-web` ships prebuilt WebAssembly (no native binding, no node-gyp, +// no install step) and runs the ONNX runtime in pure WASM under Node. Import +// only its TYPES at the top level (erased at compile time), and load the actual +// module via a dynamic `import()` inside `openOnnxEmbedder` so a BM25-only +// install never pays the WASM-load cost. The Node path is single-threaded WASM, +// which is exactly the determinism-friendly configuration the graphHash gate +// needs — verified byte-identical across repeat + fresh-session runs. +import type { InferenceSession, Tensor } from "onnxruntime-web"; import { embedderModelId } from "./model-pins.js"; import { modelFileName, resolveModelDir, TOKENIZER_FILES } from "./paths.js"; @@ -91,24 +93,17 @@ async function loadTokenizer(tokenizerDir: string): Promise { function buildSessionOptions(): InferenceSession.SessionOptions { return { // Graph opts above `disabled` can reorder kernel fusion → float32 sum - // ordering changes → embeddings drift in the last ~2 decimals. + // ordering changes → embeddings drift in the last ~2 decimals. The WASM + // EP honours this option (verified byte-identical with it set). graphOptimizationLevel: "disabled", executionMode: "sequential", intraOpNumThreads: 1, interOpNumThreads: 1, - // Arena allocs can vary in layout and introduce timing jitter that - // doesn't affect correctness but slows down determinism verification. - enableCpuMemArena: false, - // The string-keyed config entry mirrors SetDeterministicCompute() in - // the ORT C++ API; honours CPU EP kernels with a det-vs-perf branch. - extra: { - session: { - set_deterministic_compute: "1", - }, - }, - // Force CPU EP only — even if NAPI probes find CoreML, we don't want - // its nondeterministic MPS kernels participating. - executionProviders: ["cpu"], + // The WASM execution provider — the only EP onnxruntime-web exposes under + // Node, and single-threaded by construction there (see env.wasm.numThreads + // in openOnnxEmbedder). That single-threaded WASM kernel path is what makes + // the embedding output deterministic across runs, sessions, and machines. + executionProviders: ["wasm"], }; } @@ -351,27 +346,42 @@ export async function openOnnxEmbedder(cfg: EmbedderConfig = {}): Promise/.codehub/packs//`. The BOM is what * downstream agents should consume — it carries skeleton + file-tree - * + deps + ast-chunks + xrefs + findings + licenses + readme + - * optional embeddings.parquet, all bound by a manifest with a - * content-addressed `pack_hash`. + * + deps + ast-chunks + xrefs + findings + licenses + readme, all + * bound by a manifest with a content-addressed `pack_hash`. * - `repomix` — the legacy single-file XML/Markdown snapshot under * `/.codehub/pack/repo.`. Retained as an opt-in for one * milestone (drop deferred to M7 per spec 005 Q-DELTA-6). Operators @@ -57,7 +56,7 @@ const PackInput = z.object({ .optional() .default("pack") .describe( - "Engine: `pack` (default) writes the 9-item BOM via @opencodehub/pack. " + + "Engine: `pack` (default) writes the 8-item BOM via @opencodehub/pack. " + "`repomix` is the legacy single-file snapshot, retained as an opt-in.", ), budget: z @@ -367,8 +366,8 @@ export function registerPackCodebaseTool(server: McpServer, ctx: ToolContext): v title: "Pack a repo into an LLM-ready snapshot", description: "Produce a snapshot of a registered repo. The default `pack` engine writes the deterministic " + - "9-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs + findings + " + - "licenses + readme + optional embeddings.parquet) under /.codehub/packs//. " + + "8-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs + findings + " + + "licenses) plus a readme under /.codehub/packs//. " + "The legacy `repomix` engine is retained as an opt-in single-file snapshot (drop deferred to M7). " + "For relational/structural questions about the repo, prefer query/context/impact — those are " + "backed by the SCIP graph and give graph-aware answers without consuming context window.", diff --git a/packages/pack/src/embeddings-sidecar.test.ts b/packages/pack/src/embeddings-sidecar.test.ts deleted file mode 100644 index d0fc6eff..00000000 --- a/packages/pack/src/embeddings-sidecar.test.ts +++ /dev/null @@ -1,247 +0,0 @@ -/** - * Tests for `writeEmbeddingsSidecar`. - * - * The sidecar streams embeddings out of the graph store (lbug in production) - * into a per-call temp table on `temporal.duckdb`, then runs DuckDB's - * deterministic `COPY (... ORDER BY ...) TO '...' (FORMAT PARQUET, COMPRESSION - * ZSTD)` to produce the byte-identical Parquet sidecar. - * - * Coverage tiers: - * 1. Mock-only dispatch: empty input → no file; non-empty input → file with - * hash + size + duckdbVersion stamped. - * 2. Real-backend byte-identity: opens a real DuckDbStore as the temporal - * view, drives a synthetic graph stream through it, runs the sidecar - * twice, asserts file SHA equality. Skipped when native DuckDB binding - * can't load. - */ - -import assert from "node:assert/strict"; -import { existsSync } from "node:fs"; -import { mkdtemp, readFile, rm, writeFile } from "node:fs/promises"; -import { tmpdir } from "node:os"; -import path from "node:path"; -import { describe, it, test } from "node:test"; -import type { EmbeddingRow, Store } from "@opencodehub/storage"; - -import { writeEmbeddingsSidecar } from "./embeddings-sidecar.js"; - -async function tempDir(): Promise { - return mkdtemp(path.join(tmpdir(), "sidecar-")); -} - -interface MockOpts { - readonly rows: readonly EmbeddingRow[]; - /** Override the COPY step. When omitted, writes a deterministic placeholder. */ - readonly export?: ( - rows: AsyncIterable, - absPath: string, - ) => Promise<{ readonly rowCount: number; readonly duckdbVersion: string }>; -} - -function makeMockStore(opts: MockOpts): Store { - const graph = { - listEmbeddings: async function* () { - for (const row of opts.rows) yield row; - }, - } as unknown as Store["graph"]; - - const exporter = - opts.export ?? - (async (rows: AsyncIterable, absPath: string) => { - let n = 0; - const buf: string[] = []; - for await (const r of rows) { - n += 1; - buf.push( - `${r.nodeId}\t${r.granularity ?? "symbol"}\t${r.chunkIndex}\t${[...r.vector].join(",")}`, - ); - } - if (n > 0) await writeFile(absPath, buf.join("\n")); - return { rowCount: n, duckdbVersion: "mock-1.0.0" }; - }); - - const temporal = { - exportEmbeddingsToParquet: exporter, - } as unknown as Store["temporal"]; - - return { - graph, - temporal, - graphFile: ":memory:", - temporalFile: ":memory:", - close: async () => {}, - }; -} - -describe("writeEmbeddingsSidecar — mock dispatch", () => { - it("returns written=false, writerBackend=absent for empty embeddings", async () => { - const dir = await tempDir(); - try { - const store = makeMockStore({ rows: [] }); - const outPath = path.join(dir, "embeddings.parquet"); - const result = await writeEmbeddingsSidecar({ store, outPath }); - assert.equal(result.written, false); - assert.equal(result.writerBackend, "absent"); - assert.equal(result.determinismClass, "strict"); - assert.equal(result.rowCount, 0); - assert.equal(result.bytesWritten, 0); - assert.equal(result.fileHash, undefined); - assert.equal(result.pinsHint.duckdbVersion, undefined); - assert.equal(existsSync(outPath), false); - } finally { - await rm(dir, { recursive: true, force: true }); - } - }); - - it("returns written=true with hash + size + duckdbVersion when rows are present", async () => { - const dir = await tempDir(); - try { - const rows: EmbeddingRow[] = [ - { - nodeId: "Function:a.ts:fn", - granularity: "symbol", - chunkIndex: 0, - vector: new Float32Array([0.1, 0.2, 0.3]), - contentHash: "h-0", - }, - ]; - const store = makeMockStore({ rows }); - const outPath = path.join(dir, "embeddings.parquet"); - const result = await writeEmbeddingsSidecar({ store, outPath }); - assert.equal(result.written, true); - assert.equal(result.writerBackend, "duck-copy"); - assert.equal(result.determinismClass, "strict"); - assert.equal(result.rowCount, 1); - assert.ok(result.bytesWritten > 0); - assert.equal(result.pinsHint.duckdbVersion, "mock-1.0.0"); - assert.ok(result.fileHash && result.fileHash.length === 64); - } finally { - await rm(dir, { recursive: true, force: true }); - } - }); - - it("filters by granularity when supplied", async () => { - const dir = await tempDir(); - try { - const rows: EmbeddingRow[] = [ - { - nodeId: "n1", - granularity: "symbol", - chunkIndex: 0, - vector: new Float32Array([1]), - contentHash: "h", - }, - { - nodeId: "n2", - granularity: "file", - chunkIndex: 0, - vector: new Float32Array([2]), - contentHash: "h", - }, - ]; - const store = makeMockStore({ rows }); - const outPath = path.join(dir, "embeddings.parquet"); - const result = await writeEmbeddingsSidecar({ - store, - outPath, - granularity: "file", - }); - assert.equal(result.rowCount, 1, "granularity filter must drop non-matches"); - } finally { - await rm(dir, { recursive: true, force: true }); - } - }); -}); - -// --------------------------------------------------------------------------- -// Real-backend byte-identity test — opens a real DuckDbStore for temporal, -// drives a synthetic graph stream, asserts SHA equality across two runs. -// --------------------------------------------------------------------------- - -test("byte-identity: two runs against same input produce identical Parquet", async () => { - let DuckDbStore: typeof import("@opencodehub/storage").DuckDbStore; - try { - ({ DuckDbStore } = await import("@opencodehub/storage")); - } catch (err) { - assert.ok(true, `skipping: workspace import failed (${(err as Error).message})`); - return; - } - - const dir = await tempDir(); - const dbPath = path.join(dir, "temporal.duckdb"); - const outA = path.join(dir, "a.parquet"); - const outB = path.join(dir, "b.parquet"); - - let temporal: import("@opencodehub/storage").DuckDbStore; - try { - temporal = new DuckDbStore(dbPath); - await temporal.open(); - await temporal.createSchema(); - } catch (err) { - await rm(dir, { recursive: true, force: true }); - assert.ok( - true, - `skipping byte-identity test: DuckDB binding unavailable (${(err as Error).message})`, - ); - return; - } - - try { - const rows: EmbeddingRow[] = Array.from({ length: 100 }, (_, i) => ({ - nodeId: `Function:src/f${i}.ts:f${i}`, - granularity: "symbol" as const, - chunkIndex: 0, - vector: deterministicVector(i, 64), - contentHash: `h-${i.toString().padStart(3, "0")}`, - })); - - const graph = { - listEmbeddings: async function* () { - for (const row of rows) yield row; - }, - } as unknown as Store["graph"]; - - const composed: Store = { - graph, - temporal, - graphFile: ":memory:", - temporalFile: dbPath, - close: async () => { - /* test owns lifecycle */ - }, - }; - - const r1 = await writeEmbeddingsSidecar({ store: composed, outPath: outA }); - const r2 = await writeEmbeddingsSidecar({ store: composed, outPath: outB }); - - assert.equal(r1.written, true); - assert.equal(r2.written, true); - assert.equal(r1.rowCount, 100); - assert.equal(r2.rowCount, 100); - assert.equal(r1.writerBackend, "duck-copy"); - assert.ok(r1.pinsHint.duckdbVersion && r1.pinsHint.duckdbVersion.length > 0); - assert.equal(r1.pinsHint.duckdbVersion, r2.pinsHint.duckdbVersion); - - const a = await readFile(outA); - const b = await readFile(outB); - assert.equal( - Buffer.compare(a, b), - 0, - `byte-identity broken: ${a.byteLength}B vs ${b.byteLength}B`, - ); - assert.equal(r1.fileHash, r2.fileHash); - } finally { - await temporal.close(); - await rm(dir, { recursive: true, force: true }); - } -}); - -function deterministicVector(rowIndex: number, dim: number): Float32Array { - const out = new Float32Array(dim); - let s = (rowIndex * 2654435761) >>> 0; - for (let i = 0; i < dim; i += 1) { - s = (s * 1664525 + 1013904223) >>> 0; - out[i] = (s / 0xffffffff) * 2 - 1; - } - return out; -} diff --git a/packages/pack/src/embeddings-sidecar.ts b/packages/pack/src/embeddings-sidecar.ts deleted file mode 100644 index 5c99927d..00000000 --- a/packages/pack/src/embeddings-sidecar.ts +++ /dev/null @@ -1,118 +0,0 @@ -/** - * BOM body item #7: Parquet embeddings sidecar. - * - * Embeddings live in `graph.lbug` (the lbug graph backend). The sidecar - * stages rows through `temporal.duckdb` so we can lean on DuckDB's - * deterministic Parquet writer (`COPY (... ORDER BY ...) TO '...' (FORMAT - * PARQUET, COMPRESSION ZSTD)`). DuckDB v1.3+ rewrote its parquet writer - * to drop implicit timestamps so two consecutive runs produce - * byte-identical files. - * - * Determinism contract — non-negotiable, mirrored by the byte-identity - * test in `embeddings-sidecar.test.ts`: - * - * 1. Row order = `node_id ASC, granularity ASC, chunk_index ASC`. lbug's - * `listEmbeddings()` already iterates in that order; the COPY query - * re-asserts it on the temp table for safety. - * 2. ZSTD compression at the DuckDB default level. Two runs against the - * same store contents produce byte-identical Parquet files. - * 3. The pack manifest pins `duckdbVersion` from the runtime - * `SELECT version()` result so the writer version is bound to the - * sidecar. - */ - -import { createHash } from "node:crypto"; -import { readFile } from "node:fs/promises"; -import type { EmbeddingRow, Store } from "@opencodehub/storage"; - -/** - * Inputs to {@link writeEmbeddingsSidecar}. Takes a composed - * {@link Store} so the sidecar can stream from `store.graph` and route - * the COPY through `store.temporal`. - */ -export interface SidecarOptions { - /** Composed graph + temporal store. */ - readonly store: Store; - /** - * Absolute path to the destination Parquet file. The DuckDB-backed - * writer validates the path before interpolating into the COPY - * statement (DuckDB does not bind COPY destinations). - */ - readonly outPath: string; - /** - * Optional embedding-tier filter. When omitted, every row in the - * embeddings table is emitted in its native ordering. - */ - readonly granularity?: "symbol" | "file" | "community"; -} - -/** Backend identifier for the writer that produced the sidecar. */ -export type SidecarWriterBackend = "duck-copy" | "absent"; - -/** - * Determinism class stamped on the sidecar. `"strict"` when the writer - * produces byte-identical output across runs. - */ -export type SidecarDeterminismClass = "strict"; - -/** Result of {@link writeEmbeddingsSidecar}. */ -export interface SidecarResult { - readonly written: boolean; - readonly rowCount: number; - readonly determinismClass: SidecarDeterminismClass; - readonly writerBackend: SidecarWriterBackend; - readonly bytesWritten: number; - readonly pinsHint: { readonly duckdbVersion?: string }; - readonly fileHash?: string; -} - -/** - * Write the optional Parquet embeddings sidecar. - * - * Returns `{written: false, writerBackend: "absent"}` for empty embeddings - * (no file on disk). Returns `{written: true, ..., fileHash}` and writes - * a deterministic Parquet file at `opts.outPath` otherwise. The temp table - * used to stage the COPY is dropped before the call returns. - */ -export async function writeEmbeddingsSidecar(opts: SidecarOptions): Promise { - const { store, outPath, granularity } = opts; - - const stage = filterByGranularity(store.graph.listEmbeddings(), granularity); - const { rowCount, duckdbVersion } = await store.temporal.exportEmbeddingsToParquet( - stage, - outPath, - ); - - if (rowCount === 0) { - return { - written: false, - rowCount: 0, - determinismClass: "strict", - writerBackend: "absent", - bytesWritten: 0, - pinsHint: {}, - }; - } - - const bytes = await readFile(outPath); - const fileHash = createHash("sha256").update(bytes).digest("hex"); - return { - written: true, - rowCount, - determinismClass: "strict", - writerBackend: "duck-copy", - bytesWritten: bytes.byteLength, - pinsHint: { duckdbVersion }, - fileHash, - }; -} - -async function* filterByGranularity( - rows: AsyncIterable, - granularity: SidecarOptions["granularity"], -): AsyncIterable { - for await (const row of rows) { - if (granularity !== undefined && row.granularity !== granularity) continue; - yield row; - } -} diff --git a/packages/pack/src/index.test.ts b/packages/pack/src/index.test.ts index c78ecb8d..1db36e36 100644 --- a/packages/pack/src/index.test.ts +++ b/packages/pack/src/index.test.ts @@ -12,9 +12,8 @@ * `best_effort`. * E2E-C. The chunker's degraded fallback flips `determinism_class` to * `degraded` even when the tokenizer is non-Anthropic. - * E2E-D. The expected 9 files (8 BOM bodies + manifest) appear on disk - * after a successful run; the Parquet sidecar is owned by a - * separate test variant. + * E2E-D. The expected 9 files (7 BOM bodies + manifest + readme) appear + * on disk after a successful run. * E2E-E. The on-disk manifest's `files[]` lists every BOM item we * wrote (excluding the manifest itself + readme). */ @@ -26,7 +25,7 @@ import { tmpdir } from "node:os"; import path from "node:path"; import { describe, it, test } from "node:test"; import type { GraphNode } from "@opencodehub/core-types"; -import type { IGraphStore, ITemporalStore, ListNodesOptions, Store } from "@opencodehub/storage"; +import type { IGraphStore, ListNodesOptions } from "@opencodehub/storage"; import { generatePack } from "./index.js"; describe("@opencodehub/pack public entry", () => { @@ -133,10 +132,11 @@ function makeFixtureStore(): IGraphStore { (n): n is Extract => n.kind === "Finding", ); }, - // The fixture store has no embeddings; sidecar-absent tests rely on - // `listEmbeddings` being callable (and returning nothing). + // The fixture store has no embeddings. `listEmbeddings` is part of the + // IGraphStore shape but is unused by the pack now that the Parquet + // sidecar is gone; kept callable for shape completeness. listEmbeddings: async function* () { - // Empty generator — pack writes no sidecar. + // Empty generator. }, } as unknown as IGraphStore; } @@ -157,7 +157,6 @@ const COMMON_OPTS: { budgetTokens: number; tokenizerId: string } = { const COMMON_INTERNAL = { commit: "0".repeat(40), repoOriginUrl: "https://github.com/example/repo", - duckdbVersion: "1.1.3", grammarCommits: { typescript: "b".repeat(40) }, // Provide a deterministic chonkie loader for the strict path so tests // never depend on the real `@chonkiejs/core` install (worktree native @@ -188,11 +187,10 @@ async function runFixture( }, { ...COMMON_INTERNAL, - // The seam accepts a composed `Store`, but tests that don't - // exercise the sidecar can still pass a graph-only store via - // `graphOnly`. generatePack auto-wraps it into a Store with - // backend: "duck" and a no-op temporal — the sidecar's COPY-helper - // probe finds nothing and resolves to absent. + // The seam accepts a composed `Store`, but tests can pass a + // graph-only store via `graphOnly`. generatePack auto-wraps it into a + // Store whose temporal view aliases the graph; the BOM bodies read + // only `store.graph`. graphOnly: makeFixtureStore(), chunkerFiles: FIXTURE_FILES, ...internalOverrides, @@ -270,7 +268,7 @@ test("E2E-C. chunker degraded fallback flips determinism_class to degraded", asy } }); -test("E2E-D. expected 9 files appear on disk after a run; no Parquet sidecar", async () => { +test("E2E-D. expected 9 files appear on disk after a run", async () => { const dir = await tempDir(); try { await runFixture(dir); @@ -289,7 +287,8 @@ test("E2E-D. expected 9 files appear on disk after a run; no Parquet sidecar", a ]) { assert.ok(names.has(n), `missing BOM file: ${n}`); } - // No Parquet sidecar in this variant — covered by a dedicated test. + // The Parquet embeddings sidecar was dropped (ADR 0019); no .parquet + // file is ever produced. for (const n of names) { assert.ok(!n.endsWith(".parquet"), `unexpected Parquet file: ${n}`); } @@ -344,109 +343,9 @@ test("E2E-F. production store path throws cleanly when no internal store provide }); // --------------------------------------------------------------------------- -// Sidecar wiring. The fixture store does not implement -// `exportEmbeddingsParquet`, so the sidecar resolves to `absent: true`; the -// manifest must therefore NOT list `embeddings.parquet` and the file must -// NOT exist on disk. When the store DOES implement the export hook, the -// manifest must list it and the file must exist. +// The Parquet embeddings sidecar was dropped (ADR 0019): embeddings live in +// store.sqlite and there is no longer a write-only Parquet export. The pack +// is a fixed 8-item BOM (manifest + 7 bodies) plus a consumer-facing readme; +// no .parquet file is ever produced. The on-disk invariant is covered by +// E2E-D's `.parquet`-absence assertion above. // --------------------------------------------------------------------------- - -test("E2E-G. sidecar absent — manifest.files[] does not list embeddings.parquet", async () => { - const dir = await tempDir(); - try { - const manifest = await runFixture(dir); - const paths = manifest.files.map((f) => f.path); - assert.ok( - !paths.includes("embeddings.parquet"), - "absent sidecar must not appear in manifest.files[]", - ); - const entries = await readdir(dir); - assert.ok( - !entries.includes("embeddings.parquet"), - "absent sidecar must not produce a file on disk", - ); - } finally { - await rm(dir, { recursive: true, force: true }); - } -}); - -test("E2E-H. sidecar present — manifest lists it; pins.duckdbVersion overrides", async () => { - const dir = await tempDir(); - try { - // Inject a graph view that produces a deterministic embeddings stream - // and a temporal view whose `exportEmbeddingsToParquet` writes 4 - // magic bytes ("PAR1") to the destination so the manifest hash is - // stable across runs. - const baseStore = makeFixtureStore() as unknown as Record; - baseStore["listEmbeddings"] = async function* () { - yield { - nodeId: "fn:a", - granularity: "symbol" as const, - chunkIndex: 0, - vector: new Float32Array([0.1, 0.2]), - contentHash: "h-a", - }; - yield { - nodeId: "fn:b", - granularity: "symbol" as const, - chunkIndex: 0, - vector: new Float32Array([0.3, 0.4]), - contentHash: "h-b", - }; - yield { - nodeId: "fn:c", - granularity: "symbol" as const, - chunkIndex: 0, - vector: new Float32Array([0.5, 0.6]), - contentHash: "h-c", - }; - }; - const fakeTemporal = { - exportEmbeddingsToParquet: async (rows: AsyncIterable, absPath: string) => { - let n = 0; - for await (const _row of rows) n += 1; - await (await import("node:fs/promises")).writeFile( - absPath, - new Uint8Array([0x50, 0x41, 0x52, 0x31]), - ); - return { rowCount: n, duckdbVersion: "v1.3.99-test" }; - }, - }; - const composedStore: Store = { - graph: baseStore as unknown as IGraphStore, - temporal: fakeTemporal as unknown as ITemporalStore, - graphFile: ":memory:", - temporalFile: ":memory:", - close: async () => { - /* no-op */ - }, - }; - const manifest = await generatePack( - { - repoPath: "/tmp/fixture-repo", - outDir: dir, - budgetTokens: COMMON_OPTS.budgetTokens, - tokenizerId: COMMON_OPTS.tokenizerId, - }, - { - ...COMMON_INTERNAL, - store: composedStore, - chunkerFiles: FIXTURE_FILES, - }, - ); - - // pins.duckdbVersion must override the test injection because the - // sidecar runtime version is more bound to the actual file. - assert.equal(manifest.pins.duckdbVersion, "v1.3.99-test"); - - const sidecarItem = manifest.files.find((f) => f.kind === "embeddings-sidecar"); - assert.ok(sidecarItem, "manifest must list the sidecar BomItem when present"); - assert.equal(sidecarItem.path, "embeddings.parquet"); - - const onDisk = await readFile(path.join(dir, "embeddings.parquet")); - const expected = createHash("sha256").update(onDisk).digest("hex"); - assert.equal(sidecarItem.fileHash, expected); - } finally { - await rm(dir, { recursive: true, force: true }); - } -}); diff --git a/packages/pack/src/index.ts b/packages/pack/src/index.ts index 265e6b9e..c3dbfc5b 100644 --- a/packages/pack/src/index.ts +++ b/packages/pack/src/index.ts @@ -2,14 +2,12 @@ * @opencodehub/pack — deterministic code-pack BOM. * * Public surface: - * - generatePack(opts): assembles the 9-item BOM (skeleton, file-tree, - * deps, ast-chunks, xrefs, findings, licenses.md, readme.md, optional - * Parquet embeddings sidecar) plus the manifest. The Parquet sidecar - * is absent when no embeddings exist. + * - generatePack(opts): assembles the 8-item BOM (manifest + skeleton + + * file-tree + deps + ast-chunks + xrefs + findings + licenses.md), plus + * a consumer-facing readme.md derived from the manifest. * - buildManifest / serializeManifest: BOM manifest + pack_hash. * - Per-BOM-item builders re-exported for direct use (skeleton, file-tree, - * deps, ast-chunker, xrefs, findings, licenses, readme, - * embeddings-sidecar). + * deps, ast-chunker, xrefs, findings, licenses, readme). * - Type surface: {BomItem, DeterminismClass, PackManifest, PackOpts, PackPins}. */ @@ -24,7 +22,6 @@ import { buildAstChunks, } from "./ast-chunker.js"; import { buildDeps } from "./deps.js"; -import { type SidecarDeterminismClass, writeEmbeddingsSidecar } from "./embeddings-sidecar.js"; import { buildFileTree } from "./file-tree.js"; import { buildFindings } from "./findings.js"; import { buildLicenses } from "./licenses.js"; @@ -38,13 +35,6 @@ export type { AstChunk, AstChunkerOpts, AstChunkerResult } from "./ast-chunker.j export { buildAstChunks } from "./ast-chunker.js"; export type { DepRow, DepsOpts } from "./deps.js"; export { buildDeps } from "./deps.js"; -export type { - SidecarDeterminismClass, - SidecarOptions, - SidecarResult, - SidecarWriterBackend, -} from "./embeddings-sidecar.js"; -export { writeEmbeddingsSidecar } from "./embeddings-sidecar.js"; export type { FileTreeNode, FileTreeOpts } from "./file-tree.js"; export { buildFileTree } from "./file-tree.js"; export type { FindingExample, FindingGroup, FindingSeverity, FindingsOpts } from "./findings.js"; @@ -68,20 +58,18 @@ export { buildXrefs } from "./xrefs.js"; * loader). Callers in production never set this; the public `PackOpts` * surface is unchanged. * - * `store` is the composed {@link Store} (= `OpenStoreResult`) — the - * embeddings sidecar dispatches on `store.backend` and reaches the - * temporal-tier DuckDB COPY helper through this seam. Tests that only - * need graph-side reads can pass an {@link IGraphStore} via the - * `graphOnly` field; the sidecar then takes the absent path automatically. + * `store` is the composed {@link Store} (= `OpenStoreResult`); the BOM + * bodies read from its `graph` view. Tests that only need graph-side + * reads can pass an {@link IGraphStore} via the `graphOnly` field, which + * is wrapped into a minimal {@link Store}. */ export interface GeneratePackInternalOpts { readonly store?: Store; /** * Backwards-compatible escape hatch — tests can supply an - * {@link IGraphStore} alone when they don't exercise the sidecar. - * Internally wrapped into a minimal {@link Store}; the temporal view is - * a typed alias of the graph value, sufficient for tests that only - * exercise the graph-tier reads. + * {@link IGraphStore} alone. Internally wrapped into a minimal + * {@link Store}; the temporal view is a typed alias of the graph value, + * sufficient for the graph-tier reads the BOM bodies perform. */ readonly graphOnly?: IGraphStore; readonly commit?: string; @@ -92,16 +80,14 @@ export interface GeneratePackInternalOpts { readonly language?: string; }>; readonly chonkieLoader?: AstChunkerInternalOpts["_loadChonkie"]; - readonly duckdbVersion?: string; readonly grammarCommits?: Readonly>; } /** - * Generate the deterministic 9-item code-pack. + * Generate the deterministic 8-item code-pack. * - * Writes the 8 always-present BOM files plus the manifest into - * `opts.outDir`, plus an optional Parquet sidecar when the underlying - * embeddings table has rows: + * Writes the 7 BOM body files plus the manifest into `opts.outDir`, plus a + * consumer-facing readme.md derived from the manifest: * - skeleton.jsonl * - file-tree.jsonl * - deps.jsonl @@ -109,9 +95,8 @@ export interface GeneratePackInternalOpts { * - xrefs.jsonl * - findings.jsonl * - licenses.md - * - readme.md - * - embeddings.parquet (optional — absent when no embeddings) * - manifest.json + * - readme.md (consumer-facing metadata; not a manifest BomItem) * * Determinism class: * - `"strict"` by default. @@ -174,44 +159,15 @@ export async function generatePack( bomItem("licenses", "licenses.md", licensesBytes), ]; - // --- Optional Parquet embeddings sidecar (BOM item #7). Embeddings live - // in the lbug graph; the sidecar streams them through the DuckDB - // temporal store's deterministic COPY-to-Parquet path. When written, - // the sidecar's runtime `SELECT version()` overrides - // `pins.duckdbVersion` so the manifest binds determinism to the engine - // version that produced the file — the parquet `created_by` metadata - // embeds it. --- await mkdir(opts.outDir, { recursive: true }); - const sidecarPath = path.join(opts.outDir, "embeddings.parquet"); - const sidecar = await writeEmbeddingsSidecar({ store, outPath: sidecarPath }); - if (sidecar.written && sidecar.fileHash !== undefined) { - items.push({ - kind: "embeddings-sidecar", - path: "embeddings.parquet", - fileHash: sidecar.fileHash, - }); - } // --- Resolve the determinism class + pins object. A `degraded` chunker // (AST chunker fell back to line-split) dominates the class via the // precedence rule: `degraded` wins over `best_effort`, which wins over - // `strict`. The sidecar is always `strict`, so it never downgrades. --- - const determinismClass = resolveDeterminism( - opts.tokenizerId, - astResult.determinismClass, - sidecar.determinismClass, - ); + // `strict`. --- + const determinismClass = resolveDeterminism(opts.tokenizerId, astResult.determinismClass); const pins: PackPins = { chonkieVersion: astResult.pinsHint.chonkieVersion ?? "unknown", - // Prefer the runtime DuckDB engine version reported by the sidecar - // when it actually wrote a file — that string is what the parquet - // `created_by` metadata carries. Fall back to the test-injectable - // override, then the @duckdb/node-api package version, then "unknown". - duckdbVersion: - sidecar.pinsHint.duckdbVersion ?? - internal.duckdbVersion ?? - (await readDuckdbVersion()) ?? - "unknown", grammarCommits: internal.grammarCommits ?? {}, }; @@ -241,8 +197,8 @@ export async function generatePack( }); const readmeBytes = encodeUtf8(readmeMd); - // --- Write everything. The outDir was already created above to host - // the optional Parquet sidecar; the bodies share it. + // --- Write everything. The outDir was already created above; the bodies + // share it. // BOM bodies first, then manifest, then readme. Order is irrelevant for // byte-identity (writes are independent), but we serialize manifest // last so a crash mid-write leaves an obviously-incomplete pack. @@ -292,15 +248,11 @@ async function writeBytes(p: string, bytes: Uint8Array): Promise { /** * Resolve the determinism class. A `degraded` chunker (AST chunker fell back * to line-split) dominates everything; Anthropic tokenizers downgrade to - * `best_effort`; otherwise `strict`. The sidecar's own class is always - * `"strict"` ({@link SidecarDeterminismClass}) post-ADR-0016 — the - * DuckDB-temporal writer is the only sidecar path — so it never downgrades - * the result and is accepted only to keep the precedence rule explicit. + * `best_effort`; otherwise `strict`. */ function resolveDeterminism( tokenizerId: string, chunkerClass: AstChunkerResult["determinismClass"], - _sidecarClass: SidecarDeterminismClass, ): DeterminismClass { if (chunkerClass === "degraded") return "degraded"; if (tokenizerId.startsWith("anthropic:")) return "best_effort"; @@ -309,10 +261,9 @@ function resolveDeterminism( /** * Resolve the composed store. The seam accepts a composed `Store`; tests - * that don't exercise the sidecar can still pass an `IGraphStore` via - * `internal.graphOnly` and we wrap it into a minimal `Store` shape that - * funnels the sidecar to its absent path automatically (no `temporal` - * DuckDB → no COPY helper → `writerBackend: "absent"`). + * can still pass an `IGraphStore` alone via `internal.graphOnly` and we + * wrap it into a minimal `Store` shape — the BOM bodies read only the + * `graph` view. */ async function resolveStore(internal: GeneratePackInternalOpts, repoPath: string): Promise { if (internal.store !== undefined) return internal.store; @@ -322,27 +273,14 @@ async function resolveStore(internal: GeneratePackInternalOpts, repoPath: string /** * Wrap a graph-only store so the legacy test seam (`internal.graphOnly`) - * resolves into the `Store` shape `generatePack` now expects. The temporal - * view is a stub that drains the embeddings stream and reports `rowCount: - * 0` — sufficient for tests that don't exercise sidecar emission. + * resolves into the `Store` shape `generatePack` expects. The temporal + * view is unused by the BOM bodies (they read only `store.graph`), so it + * is a typed alias of the graph value. */ function wrapGraphOnly(graph: IGraphStore): Store { - // Drain the stream without writing anything — graph-only tests don't - // exercise the COPY path, so reporting rowCount: 0 keeps the sidecar - // result `absent` regardless of how many embeddings the fake produces. - const stubTemporal = { - exportEmbeddingsToParquet: async ( - rows: AsyncIterable, - ): Promise<{ rowCount: number; duckdbVersion: string }> => { - for await (const _row of rows) { - // drain - } - return { rowCount: 0, duckdbVersion: "stub" }; - }, - }; return { graph, - temporal: stubTemporal as unknown as Store["temporal"], + temporal: graph as unknown as Store["temporal"], graphFile: ":memory:", temporalFile: ":memory:", close: async () => { @@ -352,10 +290,9 @@ function wrapGraphOnly(graph: IGraphStore): Store { } /** - * Open a store from the repo path. Lazily imports `@opencodehub/storage` - * to keep the pack package importable in environments where DuckDB - * native bindings can't load. Tests inject `internal.store` (or - * `internal.graphOnly`) instead. + * Open a store from the repo path. Tests inject `internal.store` (or + * `internal.graphOnly`) instead; production store lookup is wired by the + * CLI integration layer. */ async function openStoreFromRepoPath(_repoPath: string): Promise { // Production store lookup is wired by the CLI integration layer. @@ -365,19 +302,3 @@ async function openStoreFromRepoPath(_repoPath: string): Promise { "generatePack: production store lookup is wired by the CLI; pass internal.store in tests.", ); } - -/** - * Read `@duckdb/node-api`'s package.json for the version pin. Returns - * `undefined` if the package isn't installed (e.g. browser build), so - * the pins object falls back to `"unknown"`. - */ -async function readDuckdbVersion(): Promise { - try { - const { createRequire } = await import("node:module"); - const require = createRequire(import.meta.url); - const pkg = require("@duckdb/node-api/package.json") as { version?: string }; - return typeof pkg.version === "string" ? pkg.version : undefined; - } catch { - return undefined; - } -} diff --git a/packages/pack/src/manifest.test.ts b/packages/pack/src/manifest.test.ts index 9d362da1..d9c198fc 100644 --- a/packages/pack/src/manifest.test.ts +++ b/packages/pack/src/manifest.test.ts @@ -20,7 +20,6 @@ import type { BomItem, PackPins } from "./types.js"; const FIXTURE_PINS: PackPins = { chonkieVersion: "0.3.0", - duckdbVersion: "1.1.3", grammarCommits: { python: "a".repeat(40), typescript: "b".repeat(40), @@ -129,7 +128,6 @@ test("C. packHash is not part of its own preimage (round-trip)", () => { pack_hash: "", pins: { chonkie_version: m.pins.chonkieVersion, - duckdb_version: m.pins.duckdbVersion, grammar_commits: m.pins.grammarCommits, }, repo_origin_url: m.repoOriginUrl, diff --git a/packages/pack/src/manifest.ts b/packages/pack/src/manifest.ts index 7c259729..115cba5f 100644 --- a/packages/pack/src/manifest.ts +++ b/packages/pack/src/manifest.ts @@ -90,7 +90,6 @@ function toSnakeCaseManifest(m: PackManifest): Record { pack_hash: m.packHash, pins: { chonkie_version: m.pins.chonkieVersion, - duckdb_version: m.pins.duckdbVersion, grammar_commits: m.pins.grammarCommits, }, repo_origin_url: m.repoOriginUrl, diff --git a/packages/pack/src/pack-determinism.test.ts b/packages/pack/src/pack-determinism.test.ts index 09b3c1f8..90de5dfd 100644 --- a/packages/pack/src/pack-determinism.test.ts +++ b/packages/pack/src/pack-determinism.test.ts @@ -14,13 +14,10 @@ * `Buffer.compare(readFile(outA/f), readFile(outB/f)) === 0` * * Variant matrix: - * V1. Empty embeddings — store has no `exportEmbeddingsParquet` hook; - * sidecar is absent; manifest.files[] lists 7 BOM bodies (excluding + * V1. Baseline — manifest.files[] lists 7 BOM bodies (excluding * manifest+readme). 9 files on disk: 7 bodies + readme.md + manifest.json. - * V2. Populated embeddings — fake @internal `exportEmbeddingsParquet` - * (duck-typed onto the graph view) writes a deterministic - * parquet body; sidecar is present; embeddings.parquet bytes are - * identical across runs. + * (The Parquet embeddings sidecar was dropped in ADR 0019; no .parquet + * file is ever produced.) * V3. Mixed framework labels — ProjectProfile.frameworks is a duplicated, * reverse-sorted list. file-tree.jsonl frameworks must be alpha-sorted + * deduped to the same byte sequence on both runs. @@ -38,7 +35,7 @@ import { tmpdir } from "node:os"; import path from "node:path"; import { test } from "node:test"; import type { GraphNode } from "@opencodehub/core-types"; -import type { IGraphStore, ITemporalStore, ListNodesOptions, Store } from "@opencodehub/storage"; +import type { IGraphStore, ListNodesOptions, Store } from "@opencodehub/storage"; import { type GeneratePackInternalOpts, generatePack } from "./index.js"; // --------------------------------------------------------------------------- @@ -46,13 +43,6 @@ import { type GeneratePackInternalOpts, generatePack } from "./index.js"; // --------------------------------------------------------------------------- interface FixtureKnobs { - /** - * Attach a duck-typed @internal `exportEmbeddingsParquet` helper to the - * graph fake so the sidecar emits 4 deterministic bytes. The helper - * lives on the graph view because `runVariant` wraps the fake with - * `backend: "duck"`, where the sidecar narrows on `store.graph`. - */ - readonly withEmbeddings: boolean; /** Use a duplicated, reverse-sorted ProjectProfile.frameworks list. */ readonly withMixedFrameworks: boolean; /** Add multiple findings sharing (severity, ruleId) for grouping. */ @@ -267,54 +257,17 @@ function makeRichFixtureStore(knobs: FixtureKnobs): IGraphStore { })); }, listFindings: async () => findingNodes, + // `listEmbeddings` is part of the IGraphStore shape but is unused by the + // pack now that the Parquet sidecar is gone; an empty generator keeps it + // callable for shape completeness. listEmbeddings: async function* () { - if (!knobs.withEmbeddings) return; - // Deterministic two-row stream — the temporal export fake below - // turns this into a 4-byte placeholder parquet. - yield { - nodeId: "fn:a", - granularity: "symbol" as const, - chunkIndex: 0, - vector: new Float32Array([0.1, 0.2]), - contentHash: "h-a", - }; - yield { - nodeId: "fn:b", - granularity: "symbol" as const, - chunkIndex: 0, - vector: new Float32Array([0.3, 0.4]), - contentHash: "h-b", - }; + // Empty generator. }, }; return store as unknown as IGraphStore; } -/** - * Fake `ITemporalStore` shim used by pack-determinism tests. The pack - * sidecar routes through `temporal.exportEmbeddingsToParquet`; the real - * DuckDB binding is irrelevant to these wiring tests so we drain the - * stream and write a deterministic 4-byte parquet stand-in. - */ -function makeFakeTemporalForPack(knobs: FixtureKnobs): unknown { - return { - exportEmbeddingsToParquet: async ( - rows: AsyncIterable, - absPath: string, - ): Promise<{ rowCount: number; duckdbVersion: string }> => { - let n = 0; - for await (const _row of rows) n += 1; - if (!knobs.withEmbeddings || n === 0) { - return { rowCount: 0, duckdbVersion: "v1.3.99-test" }; - } - const fs = await import("node:fs/promises"); - await fs.writeFile(absPath, new Uint8Array([0x50, 0x41, 0x52, 0x31])); - return { rowCount: n, duckdbVersion: "v1.3.99-test" }; - }, - }; -} - // --------------------------------------------------------------------------- // Driver // --------------------------------------------------------------------------- @@ -344,7 +297,6 @@ const COMMON_OPTS = { const COMMON_INTERNAL: GeneratePackInternalOpts = { commit: "0".repeat(40), repoOriginUrl: "https://github.com/example/repo", - duckdbVersion: "1.1.3", grammarCommits: { typescript: "b".repeat(40) }, // Deterministic chonkie stub — emits one chunk per file. Avoids the real // import path so the test runs even when native bindings are unavailable. @@ -366,12 +318,11 @@ async function tempDir(prefix: string): Promise { async function runVariant(outDir: string, knobs: FixtureKnobs): Promise<{ packHash: string }> { const fakeGraph = makeRichFixtureStore(knobs); - // V2 attaches a duck-typed COPY helper to the graph — wrap into a - // backend:"duck" Store so the sidecar narrows correctly. V1/V3/V4 - // never invoke the helper; the wrapper just exposes the graph view. + // The BOM bodies read only `store.graph`; the temporal view is unused by + // the pack (the Parquet sidecar was dropped), so alias it to the graph. const composedStore: Store = { graph: fakeGraph, - temporal: makeFakeTemporalForPack(knobs) as unknown as ITemporalStore, + temporal: fakeGraph as unknown as Store["temporal"], graphFile: ":memory:", temporalFile: ":memory:", close: async () => { @@ -432,9 +383,8 @@ async function assertByteIdentical(label: string, knobs: FixtureKnobs): Promise< // Variant tests — 4 distinct shapes covering the determinism matrix. // --------------------------------------------------------------------------- -test("V1. empty embeddings — sidecar absent, 9 files on disk, byte-identical", async () => { - await assertByteIdentical("v1-empty-embeddings", { - withEmbeddings: false, +test("V1. baseline — 9 files on disk, no .parquet, byte-identical", async () => { + await assertByteIdentical("v1-baseline", { withMixedFrameworks: false, withGroupedFindings: false, }); @@ -444,7 +394,6 @@ test("V1. empty embeddings — sidecar absent, 9 files on disk, byte-identical", const outDir = await tempDir("pack-det-v1-shape-"); try { await runVariant(outDir, { - withEmbeddings: false, withMixedFrameworks: false, withGroupedFindings: false, }); @@ -460,32 +409,10 @@ test("V1. empty embeddings — sidecar absent, 9 files on disk, byte-identical", "skeleton.jsonl", "xrefs.jsonl", ]); - } finally { - await rm(outDir, { recursive: true, force: true }); - } -}); - -test("V2. populated embeddings — sidecar present, parquet bytes byte-identical", async () => { - await assertByteIdentical("v2-populated-embeddings", { - withEmbeddings: true, - withMixedFrameworks: false, - withGroupedFindings: false, - }); - - // Cross-check that the sidecar is actually on disk for this variant. - const outDir = await tempDir("pack-det-v2-shape-"); - try { - await runVariant(outDir, { - withEmbeddings: true, - withMixedFrameworks: false, - withGroupedFindings: false, - }); - const entries = new Set(await readdir(outDir)); - assert.ok(entries.has("embeddings.parquet"), "v2 must produce embeddings.parquet"); - assert.equal( - entries.size, - 10, - "v2 should produce 10 files (8 BOM + readme + manifest + sidecar)", + // The Parquet sidecar was dropped (ADR 0019); no .parquet file exists. + assert.ok( + !entries.some((e) => e.endsWith(".parquet")), + "no Parquet sidecar should be produced", ); } finally { await rm(outDir, { recursive: true, force: true }); @@ -494,7 +421,6 @@ test("V2. populated embeddings — sidecar present, parquet bytes byte-identical test("V3. mixed framework labels — file-tree.jsonl alpha-sorted + deduped, byte-identical", async () => { await assertByteIdentical("v3-mixed-frameworks", { - withEmbeddings: false, withMixedFrameworks: true, withGroupedFindings: false, }); @@ -503,7 +429,6 @@ test("V3. mixed framework labels — file-tree.jsonl alpha-sorted + deduped, byt const outDir = await tempDir("pack-det-v3-shape-"); try { await runVariant(outDir, { - withEmbeddings: false, withMixedFrameworks: true, withGroupedFindings: false, }); @@ -522,7 +447,6 @@ test("V3. mixed framework labels — file-tree.jsonl alpha-sorted + deduped, byt test("V4. grouped findings — findings.jsonl groups stably, byte-identical", async () => { await assertByteIdentical("v4-grouped-findings", { - withEmbeddings: false, withMixedFrameworks: false, withGroupedFindings: true, }); @@ -532,7 +456,6 @@ test("V4. grouped findings — findings.jsonl groups stably, byte-identical", as const outDir = await tempDir("pack-det-v4-shape-"); try { await runVariant(outDir, { - withEmbeddings: false, withMixedFrameworks: false, withGroupedFindings: true, }); @@ -556,12 +479,11 @@ test("V4. grouped findings — findings.jsonl groups stably, byte-identical", as // --------------------------------------------------------------------------- // Combined variant — exercises every knob together so the composition is -// covered: populated embeddings + mixed frameworks + grouped findings. +// covered: mixed frameworks + grouped findings. // --------------------------------------------------------------------------- test("V5. all-knobs — every byte identical across two runs", async () => { await assertByteIdentical("v5-all-knobs", { - withEmbeddings: true, withMixedFrameworks: true, withGroupedFindings: true, }); diff --git a/packages/pack/src/readme.test.ts b/packages/pack/src/readme.test.ts index 40f3ea51..14278e6e 100644 --- a/packages/pack/src/readme.test.ts +++ b/packages/pack/src/readme.test.ts @@ -24,7 +24,6 @@ const FIXTURE_MANIFEST: PackManifest = { budgetTokens: 100_000, pins: { chonkieVersion: "0.0.9", - duckdbVersion: "1.1.3", grammarCommits: { python: "a".repeat(40), typescript: "b".repeat(40), @@ -61,7 +60,6 @@ test("B. manifest fields are interpolated into the README", () => { assert.ok(md.includes("100000")); assert.ok(md.includes("strict")); assert.ok(md.includes(FIXTURE_MANIFEST.pins.chonkieVersion)); - assert.ok(md.includes(FIXTURE_MANIFEST.pins.duckdbVersion)); }); test("C. BOM item paths are alpha-sorted regardless of input order", () => { diff --git a/packages/pack/src/readme.ts b/packages/pack/src/readme.ts index 127db734..5eea6a77 100644 --- a/packages/pack/src/readme.ts +++ b/packages/pack/src/readme.ts @@ -1,5 +1,6 @@ /** - * BOM body item: README.md with the determinism contract (item 9 partial). + * BOM body item: README.md with the determinism contract (consumer-facing + * metadata; not a manifest BomItem). * * Pure-string renderer; deterministic by construction. The README pastes * the determinism contract verbatim and interpolates the manifest's @@ -31,7 +32,7 @@ export function buildReadme(opts: ReadmeOpts): string { const lines: string[] = []; lines.push("# OpenCodeHub Code-Pack"); lines.push(""); - lines.push("Deterministic 9-item code-pack BOM produced by `@opencodehub/pack`."); + lines.push("Deterministic 8-item code-pack BOM produced by `@opencodehub/pack`."); lines.push(""); lines.push("## Manifest"); @@ -48,7 +49,6 @@ export function buildReadme(opts: ReadmeOpts): string { lines.push("## Pins"); lines.push(""); lines.push(`- chonkie_version: \`${manifest.pins.chonkieVersion}\``); - lines.push(`- duckdb_version: \`${manifest.pins.duckdbVersion}\``); const grammarKeys = Object.keys(manifest.pins.grammarCommits).sort(); if (grammarKeys.length === 0) { lines.push("- grammar_commits: (none)"); @@ -70,7 +70,7 @@ export function buildReadme(opts: ReadmeOpts): string { lines.push("## Determinism contract"); lines.push(""); lines.push( - "Same `(commit, tokenizer_id, budget_tokens, chonkie_version, duckdb_version, grammar_commits)` produces a byte-identical pack and the same `pack_hash`.", + "Same `(commit, tokenizer_id, budget_tokens, chonkie_version, grammar_commits)` produces a byte-identical pack and the same `pack_hash`.", ); lines.push(""); lines.push("- `strict` — every BOM file is byte-identity reproducible."); diff --git a/packages/pack/src/types.ts b/packages/pack/src/types.ts index 6c38bd4e..ed857593 100644 --- a/packages/pack/src/types.ts +++ b/packages/pack/src/types.ts @@ -1,5 +1,5 @@ /** - * @opencodehub/pack — public type surface for the 9-item BOM. + * @opencodehub/pack — public type surface for the 8-item BOM. * * These interfaces are the contract every BOM body builder consumes. * Fields are `readonly` by convention (see sibling packages in this @@ -7,7 +7,7 @@ * in-place. */ -/** A single item in the 9-item BOM. */ +/** A single item in the 8-item BOM. */ export interface BomItem { readonly kind: | "manifest" @@ -16,7 +16,6 @@ export interface BomItem { | "deps" | "ast-chunks" | "xrefs" - | "embeddings-sidecar" | "findings" | "licenses"; readonly path: string; // relative to pack output dir @@ -34,7 +33,6 @@ export type DeterminismClass = "strict" | "best_effort" | "degraded"; /** Version pins embedded in the BOM manifest for reproducibility. */ export interface PackPins { readonly chonkieVersion: string; - readonly duckdbVersion: string; readonly grammarCommits: Readonly>; // lang -> grammar commit SHA } diff --git a/packages/storage/package.json b/packages/storage/package.json index ec3e9250..062a91fa 100644 --- a/packages/storage/package.json +++ b/packages/storage/package.json @@ -2,7 +2,7 @@ "name": "@opencodehub/storage", "version": "0.3.0", "private": true, - "description": "OpenCodeHub — DuckDB graph store (@duckdb/node-api + hnsw_acorn + fts)", + "description": "OpenCodeHub — single-file SQLite store (node:sqlite, WAL)", "license": "Apache-2.0", "repository": { "type": "git", @@ -24,6 +24,10 @@ "./test-utils": { "types": "./dist/test-utils/index.d.ts", "import": "./dist/test-utils/index.js" + }, + "./sqlite-runtime": { + "types": "./dist/sqlite-runtime.d.ts", + "import": "./dist/sqlite-runtime.js" } }, "files": [ @@ -42,8 +46,6 @@ "clean": "rm -rf dist *.tsbuildinfo" }, "dependencies": { - "@duckdb/node-api": "1.5.3-r.3", - "@ladybugdb/core": "^0.17.1", "@opencodehub/core-types": "workspace:*" }, "devDependencies": { diff --git a/packages/storage/src/duckdb-adapter.test.ts b/packages/storage/src/duckdb-adapter.test.ts deleted file mode 100644 index 14620895..00000000 --- a/packages/storage/src/duckdb-adapter.test.ts +++ /dev/null @@ -1,398 +0,0 @@ -import assert from "node:assert/strict"; -import { mkdtemp } from "node:fs/promises"; -import { tmpdir } from "node:os"; -import { join } from "node:path"; -import { test } from "node:test"; -import { DuckDbStore } from "./duckdb-adapter.js"; - -async function scratchDbPath(): Promise { - const dir = await mkdtemp(join(tmpdir(), "och-storage-duck-")); - return join(dir, "temporal.duckdb"); -} - -// --------------------------------------------------------------------------- -// Cochanges -// --------------------------------------------------------------------------- - -test("bulkLoadCochanges: replaces rows and sorts insertion deterministically", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - - await store.bulkLoadCochanges([ - { - sourceFile: "src/a.ts", - targetFile: "src/b.ts", - cocommitCount: 10, - totalCommitsSource: 20, - totalCommitsTarget: 15, - lastCocommitAt: "2026-01-01T00:00:00.000Z", - lift: 2.5, - }, - { - sourceFile: "src/a.ts", - targetFile: "src/c.ts", - cocommitCount: 3, - totalCommitsSource: 20, - totalCommitsTarget: 30, - lastCocommitAt: "2026-02-01T00:00:00.000Z", - lift: 0.7, - }, - ]); - - const rows = await store.exec( - "SELECT source_file, target_file, cocommit_count, lift FROM cochanges ORDER BY source_file, target_file", - ); - assert.equal(rows.length, 2); - assert.equal(rows[0]?.["target_file"], "src/b.ts"); - assert.equal(rows[1]?.["target_file"], "src/c.ts"); - - // Second bulk load fully replaces prior contents. - await store.bulkLoadCochanges([ - { - sourceFile: "src/x.ts", - targetFile: "src/y.ts", - cocommitCount: 2, - totalCommitsSource: 4, - totalCommitsTarget: 5, - lastCocommitAt: "2026-03-01T00:00:00.000Z", - lift: 5.0, - }, - ]); - const after = await store.exec("SELECT source_file FROM cochanges"); - assert.equal(after.length, 1); - assert.equal(after[0]?.["source_file"], "src/x.ts"); - } finally { - await store.close(); - } -}); - -test("lookupCochangesForFile: ranks by lift and filters below minLift", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - - await store.bulkLoadCochanges([ - { - sourceFile: "src/a.ts", - targetFile: "src/b.ts", - cocommitCount: 8, - totalCommitsSource: 10, - totalCommitsTarget: 12, - lastCocommitAt: "2026-01-01T00:00:00.000Z", - lift: 3.2, - }, - { - sourceFile: "src/a.ts", - targetFile: "src/c.ts", - cocommitCount: 1, - totalCommitsSource: 10, - totalCommitsTarget: 50, - lastCocommitAt: "2026-01-02T00:00:00.000Z", - lift: 0.4, - }, - { - sourceFile: "src/d.ts", - targetFile: "src/a.ts", - cocommitCount: 5, - totalCommitsSource: 7, - totalCommitsTarget: 10, - lastCocommitAt: "2026-01-03T00:00:00.000Z", - lift: 1.8, - }, - ]); - - const defaults = await store.lookupCochangesForFile("src/a.ts"); - // Defaults: minLift=1.0, drops the 0.4 row; sorted by lift DESC. - assert.equal(defaults.length, 2); - assert.equal(defaults[0]?.lift, 3.2); - assert.equal(defaults[0]?.targetFile, "src/b.ts"); - assert.equal(defaults[1]?.sourceFile, "src/d.ts"); - - const weak = await store.lookupCochangesForFile("src/a.ts", { minLift: 0 }); - assert.equal(weak.length, 3); - - const capped = await store.lookupCochangesForFile("src/a.ts", { limit: 1 }); - assert.equal(capped.length, 1); - assert.equal(capped[0]?.targetFile, "src/b.ts"); - } finally { - await store.close(); - } -}); - -test("lookupCochangesBetween: returns the row in either ordering", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - await store.bulkLoadCochanges([ - { - sourceFile: "src/a.ts", - targetFile: "src/b.ts", - cocommitCount: 4, - totalCommitsSource: 6, - totalCommitsTarget: 6, - lastCocommitAt: "2026-01-01T00:00:00.000Z", - lift: 2.0, - }, - ]); - const forward = await store.lookupCochangesBetween("src/a.ts", "src/b.ts"); - const reverse = await store.lookupCochangesBetween("src/b.ts", "src/a.ts"); - assert.ok(forward); - assert.ok(reverse); - assert.equal(forward?.lift, 2.0); - assert.equal(reverse?.lift, 2.0); - - const missing = await store.lookupCochangesBetween("src/a.ts", "src/zzz.ts"); - assert.equal(missing, undefined); - } finally { - await store.close(); - } -}); - -// --------------------------------------------------------------------------- -// Symbol summaries -// --------------------------------------------------------------------------- - -test("bulkLoadSymbolSummaries: inserts rows and supports single-row lookup", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - - await store.bulkLoadSymbolSummaries([ - { - nodeId: "Function:src/a.ts:alpha", - contentHash: "hash-a", - promptVersion: "1", - modelId: "anthropic.claude-haiku-4-5", - summaryText: "Do the alpha thing.", - signatureSummary: "(x: int) -> int", - returnsTypeSummary: "the alpha count", - createdAt: "2026-01-01T00:00:00.000Z", - }, - { - nodeId: "Function:src/b.ts:beta", - contentHash: "hash-b", - promptVersion: "1", - modelId: "anthropic.claude-haiku-4-5", - summaryText: "Do the beta thing.", - createdAt: "2026-01-02T00:00:00.000Z", - }, - ]); - - const hit = await store.lookupSymbolSummary("Function:src/a.ts:alpha", "hash-a", "1"); - assert.ok(hit); - assert.equal(hit?.summaryText, "Do the alpha thing."); - assert.equal(hit?.signatureSummary, "(x: int) -> int"); - assert.equal(hit?.returnsTypeSummary, "the alpha count"); - - // Cache miss on any slot of the composite key → undefined. - const missHash = await store.lookupSymbolSummary("Function:src/a.ts:alpha", "hash-x", "1"); - assert.equal(missHash, undefined); - const missPrompt = await store.lookupSymbolSummary("Function:src/a.ts:alpha", "hash-a", "2"); - assert.equal(missPrompt, undefined); - const missNode = await store.lookupSymbolSummary("Function:src/a.ts:zeta", "hash-a", "1"); - assert.equal(missNode, undefined); - } finally { - await store.close(); - } -}); - -test("bulkLoadSymbolSummaries: persists structuredJson and reconstructs it on both read paths", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - - // The validated structured payload (citations + side_effects + - // invariants + per-input descriptions + returns.details) the summarizer - // produces, carried as a canonical-JSON blob. - const structuredJson = JSON.stringify({ - citations: [{ field_name: "purpose", line_start: 10, line_end: 14 }], - inputs: [{ name: "x", type: "int", description: "the alpha counter input" }], - returns: { details: "the running alpha count after the increment" }, - side_effects: ["writes the alpha counter"], - invariants: ["x stays non-negative"], - }); - - await store.bulkLoadSymbolSummaries([ - { - nodeId: "Function:src/a.ts:alpha", - contentHash: "hash-a", - promptVersion: "1", - modelId: "anthropic.claude-haiku-4-5", - summaryText: "Do the alpha thing.", - signatureSummary: "(x: int) -> int", - returnsTypeSummary: "the alpha count", - structuredJson, - createdAt: "2026-01-01T00:00:00.000Z", - }, - { - // No structured payload — structuredJson must read back as undefined. - nodeId: "Function:src/b.ts:beta", - contentHash: "hash-b", - promptVersion: "1", - modelId: "anthropic.claude-haiku-4-5", - summaryText: "Do the beta thing.", - createdAt: "2026-01-02T00:00:00.000Z", - }, - ]); - - const hit = await store.lookupSymbolSummary("Function:src/a.ts:alpha", "hash-a", "1"); - assert.ok(hit); - assert.equal(hit?.structuredJson, structuredJson); - // The citation line range survives so a staleness detector can read it. - const parsed = JSON.parse(hit?.structuredJson ?? "{}"); - assert.equal(parsed.citations[0].line_start, 10); - assert.equal(parsed.citations[0].line_end, 14); - - const noStructured = await store.lookupSymbolSummary("Function:src/b.ts:beta", "hash-b", "1"); - assert.equal(noStructured?.structuredJson, undefined); - - // The by-node read path reconstructs the blob too. - const byNode = await store.lookupSymbolSummariesByNode(["Function:src/a.ts:alpha"]); - assert.equal(byNode.length, 1); - assert.equal(byNode[0]?.structuredJson, structuredJson); - } finally { - await store.close(); - } -}); - -test("bulkLoadSymbolSummaries: re-insert on same composite key overwrites row", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - await store.bulkLoadSymbolSummaries([ - { - nodeId: "Function:src/a.ts:alpha", - contentHash: "hash-a", - promptVersion: "1", - modelId: "m1", - summaryText: "first", - createdAt: "2026-01-01T00:00:00.000Z", - }, - ]); - await store.bulkLoadSymbolSummaries([ - { - nodeId: "Function:src/a.ts:alpha", - contentHash: "hash-a", - promptVersion: "1", - modelId: "m2", - summaryText: "second", - createdAt: "2026-02-01T00:00:00.000Z", - }, - ]); - const hit = await store.lookupSymbolSummary("Function:src/a.ts:alpha", "hash-a", "1"); - assert.equal(hit?.summaryText, "second"); - assert.equal(hit?.modelId, "m2"); - } finally { - await store.close(); - } -}); - -test("lookupSymbolSummariesByNode: returns rows for every requested node, ordered deterministically", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - await store.bulkLoadSymbolSummaries([ - { - nodeId: "Function:src/a.ts:alpha", - contentHash: "h1", - promptVersion: "2", - modelId: "m", - summaryText: "alpha v2", - createdAt: "2026-01-02T00:00:00.000Z", - }, - { - nodeId: "Function:src/a.ts:alpha", - contentHash: "h1", - promptVersion: "1", - modelId: "m", - summaryText: "alpha v1", - createdAt: "2026-01-01T00:00:00.000Z", - }, - { - nodeId: "Function:src/b.ts:beta", - contentHash: "h2", - promptVersion: "1", - modelId: "m", - summaryText: "beta", - createdAt: "2026-01-03T00:00:00.000Z", - }, - { - nodeId: "Function:src/c.ts:gamma", - contentHash: "h3", - promptVersion: "1", - modelId: "m", - summaryText: "gamma", - createdAt: "2026-01-04T00:00:00.000Z", - }, - ]); - const hits = await store.lookupSymbolSummariesByNode([ - "Function:src/a.ts:alpha", - "Function:src/b.ts:beta", - ]); - assert.equal(hits.length, 3); - // Ordered by (node_id ASC, prompt_version ASC, content_hash ASC). - assert.equal(hits[0]?.nodeId, "Function:src/a.ts:alpha"); - assert.equal(hits[0]?.promptVersion, "1"); - assert.equal(hits[1]?.nodeId, "Function:src/a.ts:alpha"); - assert.equal(hits[1]?.promptVersion, "2"); - assert.equal(hits[2]?.nodeId, "Function:src/b.ts:beta"); - - const empty = await store.lookupSymbolSummariesByNode([]); - assert.equal(empty.length, 0); - } finally { - await store.close(); - } -}); - -// --------------------------------------------------------------------------- -// exec + healthCheck -// --------------------------------------------------------------------------- - -test("exec + healthCheck round-trip on a fresh schema", async () => { - const dbPath = await scratchDbPath(); - const store = new DuckDbStore(dbPath); - await store.open(); - try { - await store.createSchema(); - const h = await store.healthCheck(); - assert.equal(h.ok, true); - - // The temporal schema exposes cochanges + symbol_summaries — any other - // graph-tier table must not exist. - const cochangeCount = await store.exec("SELECT COUNT(*) AS n FROM cochanges"); - const summaryCount = await store.exec("SELECT COUNT(*) AS n FROM symbol_summaries"); - assert.equal(Number(cochangeCount[0]?.["n"]), 0); - assert.equal(Number(summaryCount[0]?.["n"]), 0); - - // Graph-tier tables (nodes / relations / embeddings) must NOT exist. - await assert.rejects( - () => store.exec("SELECT COUNT(*) FROM nodes"), - /nodes/, - "temporal.duckdb must not carry the nodes table", - ); - - // exec rejects writes via the SQL guard. - await assert.rejects( - () => store.exec("CREATE TABLE x (a INT)"), - /CREATE/, - "exec must reject writes", - ); - } finally { - await store.close(); - } -}); diff --git a/packages/storage/src/duckdb-adapter.ts b/packages/storage/src/duckdb-adapter.ts deleted file mode 100644 index da5360bf..00000000 --- a/packages/storage/src/duckdb-adapter.ts +++ /dev/null @@ -1,733 +0,0 @@ -/** - * DuckDB-backed adapter for the temporal storage interface. - * - * This class implements {@link ITemporalStore} only. The graph tier is - * served by `GraphDbStore` (`@ladybugdb/core`); the temporal tier owns - * cochange statistics, structured symbol summaries, and the - * `codehub query --sql` escape hatch. - * - * Lifecycle: `open` → `createSchema` → `bulkLoadCochanges` / - * `bulkLoadSymbolSummaries` → `lookupCochangesForFile` / - * `lookupSymbolSummary` / `exec` → `close`. - * - * Timeouts on `exec` are enforced by a JS-side interrupt timer rather - * than a DuckDB SQL setting — DuckDB does not expose a per-statement - * timeout. - */ - -import { - type DuckDBConnection, - DuckDBInstance, - type DuckDBPreparedStatement, - FLOAT, - LIST, -} from "@duckdb/node-api"; -import type { - CochangeLookupOptions, - CochangeRow, - EmbeddingRow, - ITemporalStore, - SqlParam, - SymbolSummaryRow, -} from "./interface.js"; -import { generateSchemaDDL } from "./schema-ddl.js"; -import { assertReadOnlySql } from "./sql-guard.js"; - -export interface DuckDbStoreOptions { - readonly readOnly?: boolean; - /** - * Retained for API symmetry with the prior multi-tier adapter; the - * temporal-only adapter never reads embeddings, so the value is ignored. - */ - readonly embeddingDim?: number; - /** Default query timeout for `exec()` calls in ms. Default 5000. */ - readonly timeoutMs?: number; -} - -const DEFAULT_TIMEOUT_MS = 5_000; -const DEFAULT_COCHANGE_LOOKUP_LIMIT = 10; -const DEFAULT_COCHANGE_MIN_LIFT = 1.0; - -/** - * Concrete adapter that satisfies {@link ITemporalStore} over a single - * DuckDB connection. Pairs with `GraphDbStore` for the graph tier via - * `openStore`. - */ -export class DuckDbStore implements ITemporalStore { - private readonly path: string; - private readonly readOnly: boolean; - private readonly defaultTimeoutMs: number; - private instance: DuckDBInstance | undefined; - private conn: DuckDBConnection | undefined; - - constructor(path: string, opts: DuckDbStoreOptions = {}) { - this.path = path; - this.readOnly = opts.readOnly === true; - this.defaultTimeoutMs = opts.timeoutMs ?? DEFAULT_TIMEOUT_MS; - } - - // -------------------------------------------------------------------------- - // Lifecycle - // -------------------------------------------------------------------------- - - async open(): Promise { - if (this.instance) return; - const options: Record = { - access_mode: this.readOnly ? "READ_ONLY" : "READ_WRITE", - }; - this.instance = await DuckDBInstance.create(this.path, options); - this.conn = await this.instance.connect(); - } - - async close(): Promise { - this.conn?.closeSync(); - this.conn = undefined; - this.instance?.closeSync(); - this.instance = undefined; - } - - async createSchema(): Promise { - const c = this.requireConn(); - const stmts = generateSchemaDDL(); - for (const stmt of stmts) { - await c.run(stmt); - } - } - - // -------------------------------------------------------------------------- - // Cochanges - // -------------------------------------------------------------------------- - - async bulkLoadCochanges(rows: readonly CochangeRow[]): Promise { - const c = this.requireConn(); - await c.run("BEGIN TRANSACTION"); - try { - await c.run("DELETE FROM cochanges"); - if (rows.length === 0) { - await c.run("COMMIT"); - return; - } - // Sort by (source_file, target_file) so insertion order is deterministic - // across runs. - const sorted = [...rows].sort((a, b) => { - if (a.sourceFile !== b.sourceFile) { - return a.sourceFile < b.sourceFile ? -1 : 1; - } - return a.targetFile < b.targetFile ? -1 : a.targetFile > b.targetFile ? 1 : 0; - }); - const stmt = await c.prepare( - `INSERT INTO cochanges ( - source_file, target_file, cocommit_count, - total_commits_source, total_commits_target, - last_cocommit_at, lift - ) VALUES (?, ?, ?, ?, ?, CAST(? AS TIMESTAMP), ?)`, - ); - try { - for (const row of sorted) { - stmt.clearBindings(); - bindParam(stmt, 1, row.sourceFile); - bindParam(stmt, 2, row.targetFile); - bindParam(stmt, 3, row.cocommitCount); - bindParam(stmt, 4, row.totalCommitsSource); - bindParam(stmt, 5, row.totalCommitsTarget); - bindParam(stmt, 6, row.lastCocommitAt); - bindParam(stmt, 7, row.lift); - await stmt.run(); - } - } finally { - stmt.destroySync(); - } - await c.run("COMMIT"); - } catch (err) { - await c.run("ROLLBACK"); - throw err; - } - } - - async lookupCochangesForFile( - file: string, - opts: CochangeLookupOptions = {}, - ): Promise { - const c = this.requireConn(); - const limit = Math.max(0, Math.floor(opts.limit ?? DEFAULT_COCHANGE_LOOKUP_LIMIT)); - const minLift = opts.minLift ?? DEFAULT_COCHANGE_MIN_LIFT; - // Rows are keyed by ordered (source_file, target_file) pairs but the - // signal is symmetric, so probe both directions. Sort by lift DESC so - // the strongest associations surface first; break ties deterministically - // on the pair key. - const stmt = await c.prepare( - `SELECT source_file, target_file, cocommit_count, - total_commits_source, total_commits_target, - last_cocommit_at, lift - FROM cochanges - WHERE (source_file = ? OR target_file = ?) AND lift >= ? - ORDER BY lift DESC, source_file ASC, target_file ASC - LIMIT ?`, - ); - try { - stmt.bindVarchar(1, file); - stmt.bindVarchar(2, file); - stmt.bindDouble(3, minLift); - stmt.bindInteger(4, limit); - const reader = await stmt.runAndReadAll(); - const raw = reader.getRowObjects(); - const out: CochangeRow[] = []; - for (const r of raw) { - out.push(cochangeRowFromRecord(r as Record)); - } - return out; - } finally { - stmt.destroySync(); - } - } - - async lookupCochangesBetween(fileA: string, fileB: string): Promise { - const c = this.requireConn(); - const stmt = await c.prepare( - `SELECT source_file, target_file, cocommit_count, - total_commits_source, total_commits_target, - last_cocommit_at, lift - FROM cochanges - WHERE (source_file = ? AND target_file = ?) - OR (source_file = ? AND target_file = ?) - LIMIT 1`, - ); - try { - stmt.bindVarchar(1, fileA); - stmt.bindVarchar(2, fileB); - stmt.bindVarchar(3, fileB); - stmt.bindVarchar(4, fileA); - const reader = await stmt.runAndReadAll(); - const raw = reader.getRowObjects(); - const first = raw[0]; - if (!first) return undefined; - return cochangeRowFromRecord(first as Record); - } finally { - stmt.destroySync(); - } - } - - // -------------------------------------------------------------------------- - // Symbol summaries - // -------------------------------------------------------------------------- - - async bulkLoadSymbolSummaries(rows: readonly SymbolSummaryRow[]): Promise { - if (rows.length === 0) return; - const c = this.requireConn(); - // Sort by the composite primary key so insertion order is deterministic - // across runs. - const sorted = [...rows].sort((a, b) => { - if (a.nodeId !== b.nodeId) return a.nodeId < b.nodeId ? -1 : 1; - if (a.contentHash !== b.contentHash) return a.contentHash < b.contentHash ? -1 : 1; - if (a.promptVersion !== b.promptVersion) return a.promptVersion < b.promptVersion ? -1 : 1; - return 0; - }); - - await c.run("BEGIN TRANSACTION"); - try { - // Pre-delete matching composite keys so the INSERT is effectively an - // upsert. Using DELETE+INSERT (rather than ON CONFLICT) keeps the - // statement small and sidesteps DuckDB issue 8147 when the same key - // appears multiple times in a single batch after dedupe. - const delStmt = await c.prepare( - "DELETE FROM symbol_summaries WHERE node_id = ? AND content_hash = ? AND prompt_version = ?", - ); - try { - for (const r of sorted) { - delStmt.clearBindings(); - delStmt.bindVarchar(1, r.nodeId); - delStmt.bindVarchar(2, r.contentHash); - delStmt.bindVarchar(3, r.promptVersion); - await delStmt.run(); - } - } finally { - delStmt.destroySync(); - } - - const insStmt = await c.prepare( - `INSERT INTO symbol_summaries ( - node_id, content_hash, prompt_version, model_id, - summary_text, signature_summary, returns_type_summary, - structured_json, created_at - ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, CAST(? AS TIMESTAMP))`, - ); - try { - for (const r of sorted) { - insStmt.clearBindings(); - bindParam(insStmt, 1, r.nodeId); - bindParam(insStmt, 2, r.contentHash); - bindParam(insStmt, 3, r.promptVersion); - bindParam(insStmt, 4, r.modelId); - bindParam(insStmt, 5, r.summaryText); - bindParam(insStmt, 6, r.signatureSummary ?? null); - bindParam(insStmt, 7, r.returnsTypeSummary ?? null); - bindParam(insStmt, 8, r.structuredJson ?? null); - bindParam(insStmt, 9, r.createdAt); - await insStmt.run(); - } - } finally { - insStmt.destroySync(); - } - await c.run("COMMIT"); - } catch (err) { - await c.run("ROLLBACK"); - throw err; - } - } - - async lookupSymbolSummary( - nodeId: string, - contentHash: string, - promptVersion: string, - ): Promise { - const c = this.requireConn(); - const stmt = await c.prepare( - `SELECT node_id, content_hash, prompt_version, model_id, - summary_text, signature_summary, returns_type_summary, - structured_json, created_at - FROM symbol_summaries - WHERE node_id = ? AND content_hash = ? AND prompt_version = ? - LIMIT 1`, - ); - try { - stmt.bindVarchar(1, nodeId); - stmt.bindVarchar(2, contentHash); - stmt.bindVarchar(3, promptVersion); - const reader = await stmt.runAndReadAll(); - const raw = reader.getRowObjects(); - const first = raw[0]; - if (!first) return undefined; - return summaryRowFromRecord(first as Record); - } finally { - stmt.destroySync(); - } - } - - async lookupSymbolSummariesByNode( - nodeIds: readonly string[], - ): Promise { - if (nodeIds.length === 0) return []; - const c = this.requireConn(); - const placeholders = nodeIds.map(() => "?").join(","); - const stmt = await c.prepare( - `SELECT node_id, content_hash, prompt_version, model_id, - summary_text, signature_summary, returns_type_summary, - structured_json, created_at - FROM symbol_summaries - WHERE node_id IN (${placeholders}) - ORDER BY node_id ASC, prompt_version ASC, content_hash ASC`, - ); - try { - let idx = 1; - for (const id of nodeIds) stmt.bindVarchar(idx++, id); - const reader = await stmt.runAndReadAll(); - const raw = reader.getRowObjects(); - const out: SymbolSummaryRow[] = []; - for (const r of raw) { - out.push(summaryRowFromRecord(r as Record)); - } - return out; - } finally { - stmt.destroySync(); - } - } - - async countSymbolSummaries(): Promise { - try { - const c = this.requireConn(); - const stmt = await c.prepare("SELECT COUNT(DISTINCT node_id) AS n FROM symbol_summaries"); - try { - const reader = await stmt.runAndReadAll(); - const first = reader.getRowObjects()[0] as Record | undefined; - const n = first?.["n"]; - return typeof n === "bigint" ? Number(n) : typeof n === "number" ? n : 0; - } finally { - stmt.destroySync(); - } - } catch { - // Missing table / degraded store → report 0 rather than throwing, so - // `codehub status` degrades gracefully. - return 0; - } - } - - // -------------------------------------------------------------------------- - // exec — read-only SQL escape hatch (codehub query --sql, MCP sql tool) - // -------------------------------------------------------------------------- - - async exec( - sql: string, - params: readonly SqlParam[] = [], - opts: { readonly timeoutMs?: number } = {}, - ): Promise[]> { - assertReadOnlySql(sql); - const c = this.requireConn(); - const timeoutMs = opts.timeoutMs ?? this.defaultTimeoutMs; - return this.withTimeout(timeoutMs, async () => { - const stmt = await c.prepare(sql); - try { - for (let i = 0; i < params.length; i += 1) { - bindParam(stmt, i + 1, params[i] ?? null); - } - const reader = await stmt.runAndReadAll(); - return normalizeRows(reader.getRowObjects()); - } finally { - stmt.destroySync(); - } - }); - } - - // -------------------------------------------------------------------------- - // Embedding-Parquet export — pack/embeddings-sidecar.ts surface - // - // Embeddings live in `graph.lbug`. The sidecar streams rows out of lbug, - // stages them in a per-call DuckDB temp table on `temporal.duckdb`, then - // runs `COPY (...) TO '' (FORMAT PARQUET, COMPRESSION ZSTD)` to - // produce the byte-identical sidecar. The temp table is connection-local - // and dropped before the call returns. - // -------------------------------------------------------------------------- - - async exportEmbeddingsToParquet( - rows: AsyncIterable, - absOutPath: string, - ): Promise<{ readonly rowCount: number; readonly duckdbVersion: string }> { - const c = this.requireConn(); - const duckdbVersion = await this.fetchDuckdbVersion(); - - if (!isSafeAbsolutePath(absOutPath)) { - throw new Error( - "exportEmbeddingsToParquet: outPath must be a POSIX or Windows absolute " + - "path over a safe character class (alphanumerics, slash, backslash, " + - "drive colon, underscore, dash, dot, tilde)", - ); - } - - // Pre-staging: create a transient table sized to the largest VECTOR width - // we'll see. DuckDB temp tables are connection-scoped — a stale handle - // from a prior call would surface as a "table already exists" error, so - // drop defensively before recreating. - await c.run("DROP TABLE IF EXISTS embeddings_export"); - await c.run( - "CREATE TEMP TABLE embeddings_export (" + - "node_id VARCHAR NOT NULL, " + - "granularity VARCHAR NOT NULL, " + - "chunk_index INTEGER NOT NULL, " + - "vector FLOAT[] NOT NULL" + - ")", - ); - - let rowCount = 0; - try { - const insertStmt = await c.prepare( - "INSERT INTO embeddings_export (node_id, granularity, chunk_index, vector) VALUES (?, ?, ?, ?)", - ); - try { - for await (const row of rows) { - insertStmt.bindVarchar(1, row.nodeId); - insertStmt.bindVarchar(2, row.granularity ?? "symbol"); - insertStmt.bindInteger(3, row.chunkIndex); - insertStmt.bindList(4, Array.from(row.vector), LIST(FLOAT)); - await insertStmt.run(); - rowCount += 1; - } - } finally { - // No public destroy on prepared statements in the current binding; - // they're cleaned up when the connection closes. - } - - if (rowCount === 0) { - return { rowCount: 0, duckdbVersion }; - } - - // COPY does not accept bound parameters for the destination. The path - // is validated above so single-quote injection is impossible. - const sql = - `COPY (SELECT node_id, granularity, chunk_index, vector ` + - `FROM embeddings_export ORDER BY node_id ASC, granularity ASC, chunk_index ASC) ` + - `TO '${absOutPath}' (FORMAT PARQUET, COMPRESSION ZSTD)`; - await c.run(sql); - return { rowCount, duckdbVersion }; - } finally { - await c.run("DROP TABLE IF EXISTS embeddings_export").catch(() => {}); - } - } - - /** - * Resolve the live DuckDB engine version via `SELECT version()`. The - * result is the string DuckDB embeds in the parquet `created_by` - * metadata, so the pack manifest's `pins.duckdbVersion` stays bound to - * the writer version that produced the sidecar. - */ - private async fetchDuckdbVersion(): Promise { - const c = this.requireConn(); - try { - const reader = await c.runAndReadAll("SELECT version() AS v"); - const rows = reader.getRowObjects(); - const v = rows[0] ? (rows[0] as { v?: unknown }).v : undefined; - return typeof v === "string" && v.length > 0 ? v : "unknown"; - } catch { - return "unknown"; - } - } - - // -------------------------------------------------------------------------- - // healthCheck - // -------------------------------------------------------------------------- - - async healthCheck(): Promise<{ ok: boolean; message?: string }> { - try { - const c = this.requireConn(); - const reader = await c.runAndReadAll("SELECT 1 AS ok"); - const rows = reader.getRowObjects(); - const first = rows[0]; - const ok = first ? Number((first as { ok: unknown }).ok) === 1 : false; - return ok ? { ok: true } : { ok: false, message: "SELECT 1 returned unexpected shape" }; - } catch (err) { - return { ok: false, message: (err as Error).message }; - } - } - - // -------------------------------------------------------------------------- - // Internal helpers - // -------------------------------------------------------------------------- - - private requireConn(): DuckDBConnection { - if (!this.conn) { - throw new Error("DuckDbStore is not open — call open() first"); - } - return this.conn; - } - - /** - * Interrupt the current statement if it exceeds the timeout. DuckDB has no - * SQL-level statement timeout, so we schedule a JS timer that calls - * `connection.interrupt()` and let the prepared statement throw. - */ - private async withTimeout(ms: number, fn: () => Promise): Promise { - if (ms <= 0) return fn(); - const c = this.requireConn(); - let interrupted = false; - const handle = setTimeout(() => { - interrupted = true; - try { - c.interrupt(); - } catch { - /* ignore — connection may already be done */ - } - }, ms); - try { - return await fn(); - } catch (err) { - if (interrupted) { - throw new Error(`Query exceeded timeout of ${ms}ms`); - } - throw err; - } finally { - clearTimeout(handle); - } - } -} - -// ---------------------------------------------------------------------------- -// Free helpers -// ---------------------------------------------------------------------------- - -function bindParam(stmt: DuckDBPreparedStatement, index: number, value: SqlParam | null): void { - if (value === null || value === undefined) { - stmt.bindNull(index); - return; - } - switch (typeof value) { - case "boolean": - stmt.bindBoolean(index, value); - return; - case "number": - if (Number.isInteger(value) && value >= -2147483648 && value <= 2147483647) { - stmt.bindInteger(index, value); - } else { - stmt.bindDouble(index, value); - } - return; - case "bigint": - stmt.bindBigInt(index, value); - return; - case "string": - stmt.bindVarchar(index, value); - return; - default: - throw new Error(`Unsupported SQL parameter type at index ${index}`); - } -} - -/** - * DuckDB's getRowObjects returns values that are mostly JS primitives, but - * some column types come back as class instances (e.g. `DuckDBListValue`, - * `DuckDBArrayValue`) that carry an `items` array. Normalize every row to - * plain JS values so downstream tests and hashing behave predictably. - */ -function normalizeRows(rows: readonly unknown[]): readonly Record[] { - const out: Record[] = []; - for (const r of rows) { - const src = r as Record; - const cleaned: Record = {}; - for (const [k, v] of Object.entries(src)) { - cleaned[k] = normalizeValue(v); - } - out.push(cleaned); - } - return out; -} - -function normalizeValue(v: unknown): unknown { - if (v === null || v === undefined) return v; - if (Array.isArray(v)) return v.map((x) => normalizeValue(x)); - if (typeof v === "object") { - const obj = v as { items?: unknown }; - if (Array.isArray(obj.items)) { - return obj.items.map((x) => normalizeValue(x)); - } - } - return v; -} - -/** - * Convert a DuckDB row from the `cochanges` table back into a {@link CochangeRow}. - * The timestamp column arrives as either a DuckDB value object carrying a - * `micros` BigInt (when returned over the native bindings) or a string; both - * paths resolve to an ISO-8601 UTC string. - */ -function cochangeRowFromRecord(row: Record): CochangeRow { - const last = row["last_cocommit_at"]; - let lastCocommitAt: string; - if (typeof last === "string") { - lastCocommitAt = last; - } else if (last && typeof last === "object") { - const anyRow = last as { micros?: bigint; toISOString?: () => string }; - if (typeof anyRow.toISOString === "function") { - lastCocommitAt = anyRow.toISOString(); - } else if (typeof anyRow.micros === "bigint") { - lastCocommitAt = new Date(Number(anyRow.micros / 1000n)).toISOString(); - } else { - lastCocommitAt = String(last); - } - } else { - lastCocommitAt = String(last ?? ""); - } - return { - sourceFile: String(row["source_file"] ?? ""), - targetFile: String(row["target_file"] ?? ""), - cocommitCount: Number(row["cocommit_count"] ?? 0), - totalCommitsSource: Number(row["total_commits_source"] ?? 0), - totalCommitsTarget: Number(row["total_commits_target"] ?? 0), - lastCocommitAt, - lift: Number(row["lift"] ?? 0), - }; -} - -/** - * Convert a DuckDB row from the `symbol_summaries` table back into a - * {@link SymbolSummaryRow}. - */ -function summaryRowFromRecord(row: Record): SymbolSummaryRow { - const created = row["created_at"]; - let createdAt: string; - if (typeof created === "string") { - createdAt = created; - } else if (created && typeof created === "object") { - const anyRow = created as { micros?: bigint; toISOString?: () => string }; - if (typeof anyRow.toISOString === "function") { - createdAt = anyRow.toISOString(); - } else if (typeof anyRow.micros === "bigint") { - createdAt = new Date(Number(anyRow.micros / 1000n)).toISOString(); - } else { - createdAt = String(created); - } - } else { - createdAt = String(created ?? ""); - } - const sig = row["signature_summary"]; - const ret = row["returns_type_summary"]; - const structured = row["structured_json"]; - return { - nodeId: String(row["node_id"] ?? ""), - contentHash: String(row["content_hash"] ?? ""), - promptVersion: String(row["prompt_version"] ?? ""), - modelId: String(row["model_id"] ?? ""), - summaryText: String(row["summary_text"] ?? ""), - ...(sig !== null && sig !== undefined ? { signatureSummary: String(sig) } : {}), - ...(ret !== null && ret !== undefined ? { returnsTypeSummary: String(ret) } : {}), - ...(structured !== null && structured !== undefined - ? { structuredJson: String(structured) } - : {}), - createdAt, - }; -} - -/** - * Conservative absolute-path validator used by `exportEmbeddingsParquet` - * to inline a destination path into a `COPY ... TO '' ...` SQL - * statement. DuckDB's prepared-statement parser does not bind COPY - * destinations, so the path is concatenated; allow only absolute paths over - * a safe character class so single-quote injection is structurally - * impossible. - * - * Accepts both POSIX absolute paths (`/repo/.codehub/…`) and Windows absolute - * paths (`C:\repo\.codehub\…`): a drive-letter prefix and backslash separator - * are permitted, but the character class still excludes quotes, spaces, and - * shell/SQL metacharacters, so the injection guarantee holds on every platform. - */ -function isSafeAbsolutePath(p: string): boolean { - if (typeof p !== "string" || p.length === 0) return false; - const isPosixAbs = p.startsWith("/"); - const isWindowsAbs = /^[A-Za-z]:[/\\]/.test(p); - if (!isPosixAbs && !isWindowsAbs) return false; - // Safe class: alphanumerics, both separators, drive colon, underscore, dash, - // dot, and tilde. Tilde is required because Windows temp dirs use 8.3 short - // names (e.g. `RUNNER~1`). No quotes/spaces/metacharacters → single-quote - // injection into the DuckDB `COPY ... TO ''` remains impossible. - return /^[A-Za-z0-9/\\:_\-.~]+$/.test(p); -} - -/** - * Classify a SPDX-ish license string into one of the five - * license-tier buckets. Used by graph-side `listDependencies` finders; - * kept here as a free helper for cross-adapter symmetry. - */ -export function classifyLicenseTier( - license: string | undefined, -): "permissive" | "weak-copyleft" | "strong-copyleft" | "proprietary" | "unknown" { - if (!license || license.trim().length === 0) return "unknown"; - const lower = license.trim().toLowerCase(); - // Strong copyleft — GPL/AGPL family. - if (/(^|\b|-)agpl(-|$)/i.test(lower) || /(^|\b|-)gpl(-|$)/i.test(lower)) { - return "strong-copyleft"; - } - // Weak copyleft — LGPL, MPL, EPL, CDDL, CC-BY-SA. - if ( - /(^|\b|-)lgpl(-|$)/i.test(lower) || - /(^|\b)mpl(-|$)/i.test(lower) || - /(^|\b)epl(-|$)/i.test(lower) || - /(^|\b)cddl(-|$)/i.test(lower) || - /(^|\b)cc-by-sa(-|$)/i.test(lower) - ) { - return "weak-copyleft"; - } - // Permissive — MIT/Apache/BSD/ISC/0BSD/Unlicense/CC0/Zlib. - if ( - /(^|\b)mit(\b|-|$)/.test(lower) || - /(^|\b)apache(-|$)/i.test(lower) || - /(^|\b)bsd(-|$)/i.test(lower) || - /(^|\b)isc(\b|-|$)/.test(lower) || - /(^|\b)0bsd(\b|$)/.test(lower) || - /(^|\b)unlicense(\b|$)/.test(lower) || - /(^|\b)cc0(\b|-|$)/.test(lower) || - /(^|\b)zlib(\b|$)/.test(lower) - ) { - return "permissive"; - } - // Proprietary markers. - if (/(^|\b)(proprietary|commercial|see license)(\b|$)/i.test(lower)) { - return "proprietary"; - } - return "unknown"; -} diff --git a/packages/storage/src/graph-hash-parity.test.ts b/packages/storage/src/graph-hash-parity.test.ts deleted file mode 100644 index e40aec1c..00000000 --- a/packages/storage/src/graph-hash-parity.test.ts +++ /dev/null @@ -1,142 +0,0 @@ -/** - * Cross-adapter `graphHash` parity for empty-array node fields. - * - * `column-encode.ts:stringArrayOrNull` and `core-types/graph-hash.ts` - * promise that a node written with an explicit empty `keywords: []` / - * `responseKeys: []` round-trips byte-identically under {@link graphHash} - * — and stays DISTINCT from the same node with the field absent. The - * canonical-JSON projection emits `{"keywords":[]}` for the former and no - * key at all for the latter, so their SHA-256 graph hashes differ. - * - * The graph tier persists empty `STRING[]` columns through lbug, which - * collapses a 0-length array to SQL NULL on write. The graph-db adapter - * works around that with an empty-array marker on the write side and the - * symmetric decode on read (`encodeNodeCol` + `setStringArrayFieldGd` in - * `graphdb-adapter.ts`). This test pins both halves of the contract: - * - * (a) `graphHash(rebuild(store)) === graphHash(fixture)` for a fixture - * whose nodes carry `keywords: []` / `responseKeys: []` — the - * round-trip must NOT drop the empty arrays. Runs through the public - * {@link assertGraphParity} harness so a community `IGraphStore` - * fork inherits the same enforcement. - * (b) the empty-array fixture hashes DIFFERENTLY from the otherwise - * identical fixture with the array fields absent. - * - * The native-binding-dependent half mirrors `graphdb-roundtrip.test.ts`: - * skipped cleanly when `@ladybugdb/core` cannot load (e.g. an unsupported - * platform in CI). Half (b) is pure JS and always runs. - */ - -import assert from "node:assert/strict"; -import { mkdtemp } from "node:fs/promises"; -import { tmpdir } from "node:os"; -import { join } from "node:path"; -import { test } from "node:test"; -import { - type GraphNode, - graphHash, - KnowledgeGraph, - makeNodeId, - type NodeId, -} from "@opencodehub/core-types"; -import { GraphDbStore } from "./graphdb-adapter.js"; -import { assertGraphParity } from "./test-utils/parity-harness.js"; - -async function scratchDbPath(): Promise { - const dir = await mkdtemp(join(tmpdir(), "och-graphhash-parity-")); - return join(dir, "graph.db"); -} - -async function hasNativeBinding(): Promise { - try { - await import("@ladybugdb/core"); - return true; - } catch { - return false; - } -} - -const COMMUNITY_ID = makeNodeId("Community", "", "empty-keywords"); -const ROUTE_ID = makeNodeId("Route", "src/api.ts", "GET /things"); -const FILE_ID = makeNodeId("File", "src/api.ts", "api.ts"); - -/** - * Fixture whose Community node carries `keywords: []` and whose Route node - * carries `responseKeys: []` — both explicit empty arrays. A File node is - * included so the empty-array columns coexist with ordinary rows. - */ -function buildEmptyArrayGraph(): KnowledgeGraph { - const g = new KnowledgeGraph(); - g.addNode({ id: FILE_ID, kind: "File", name: "api.ts", filePath: "src/api.ts" }); - g.addNode({ - id: COMMUNITY_ID, - kind: "Community", - name: "empty-keywords", - filePath: "", - keywords: [], - } as unknown as GraphNode); - g.addNode({ - id: ROUTE_ID, - kind: "Route", - name: "GET /things", - filePath: "src/api.ts", - url: "/things", - method: "GET", - responseKeys: [], - } as unknown as GraphNode); - g.addEdge({ from: FILE_ID, to: ROUTE_ID as NodeId, type: "DEFINES", confidence: 1.0 }); - return g; -} - -/** - * Same shape as {@link buildEmptyArrayGraph} but with the `keywords` / - * `responseKeys` fields absent. Used to prove `[]` is distinct from absent. - */ -function buildAbsentArrayGraph(): KnowledgeGraph { - const g = new KnowledgeGraph(); - g.addNode({ id: FILE_ID, kind: "File", name: "api.ts", filePath: "src/api.ts" }); - g.addNode({ - id: COMMUNITY_ID, - kind: "Community", - name: "empty-keywords", - filePath: "", - } as unknown as GraphNode); - g.addNode({ - id: ROUTE_ID, - kind: "Route", - name: "GET /things", - filePath: "src/api.ts", - url: "/things", - method: "GET", - } as unknown as GraphNode); - g.addEdge({ from: FILE_ID, to: ROUTE_ID as NodeId, type: "DEFINES", confidence: 1.0 }); - return g; -} - -test("graphHash distinguishes empty-array fields from absent ones", () => { - const withEmpty = graphHash(buildEmptyArrayGraph()); - const withAbsent = graphHash(buildAbsentArrayGraph()); - assert.notEqual( - withEmpty, - withAbsent, - "graphHash({keywords: []}) must differ from graphHash with the field absent", - ); -}); - -test("graph-db round-trip preserves empty keywords / responseKeys byte-identically", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - await assertGraphParity(buildEmptyArrayGraph(), { - stores: [store], - label: "empty-array-fields", - }); - } finally { - await store.close(); - } -}); diff --git a/packages/storage/src/graphdb-adapter.test.ts b/packages/storage/src/graphdb-adapter.test.ts deleted file mode 100644 index b2236589..00000000 --- a/packages/storage/src/graphdb-adapter.test.ts +++ /dev/null @@ -1,1209 +0,0 @@ -import assert from "node:assert/strict"; -import { mkdtemp } from "node:fs/promises"; -import { tmpdir } from "node:os"; -import { join } from "node:path"; -import { test } from "node:test"; -import { type GraphNode, KnowledgeGraph, makeNodeId, type NodeId } from "@opencodehub/core-types"; -import { assertReadOnlyCypher } from "./cypher-guard.js"; -import { - GRAPH_BINDING_SUPPORTED_PLATFORMS, - GraphDbBindingError, - GraphDbStore, - graphBindingPlatformNote, - NotImplementedError, -} from "./graphdb-adapter.js"; -import { openStore } from "./index.js"; -import { assertIGraphStoreConformance } from "./test-utils/conformance.js"; - -async function scratchDbPath(): Promise { - // Per-test temp directory that holds a uniquely-named database file. - // The native binding insists on a concrete file path rather than a - // directory; we wrap the file in its own dir so parallel tests never - // collide on the same file. - const dir = await mkdtemp(join(tmpdir(), "och-graphdb-")); - return join(dir, "graph.db"); -} - -async function hasNativeBinding(): Promise { - // Dynamic import probe: the native binding either loads cleanly or the - // platform-specific `.node` file is missing. Any dlopen failure propagates - // through the import and we return false so the caller can skip the - // integration test. - try { - await import("@ladybugdb/core"); - return true; - } catch { - return false; - } -} - -/** - * Returns true when the lbug VECTOR extension can be loaded on this host. - * The extension requires mmap'ing large buffers for the HNSW index; on some - * Linux devboxes (overcommit disabled or cgroup memory limits) the mmap - * fails with "Buffer manager exception: Mmap for size N failed". Tests that - * call `upsertEmbeddings` or `vectorSearch` skip when this probe returns false. - */ -async function hasVectorSupport(): Promise { - if (!(await hasNativeBinding())) return false; - const { tmpdir } = await import("node:os"); - const { join } = await import("node:path"); - const { mkdtemp } = await import("node:fs/promises"); - const dir = await mkdtemp(join(tmpdir(), "och-vec-probe-")); - const store = new GraphDbStore(join(dir, "probe.lbug"), { embeddingDim: 4 }); - try { - await store.open(); - await store.createSchema(); - // Probe by inserting one tiny embedding — triggers INSTALL+LOAD VECTOR - // internally. If the mmap for the HNSW index fails, the error propagates - // here and we return false so vector-dependent tests skip cleanly. - const fnId = makeNodeId("Function", "src/p.ts", "probe"); - const g = new KnowledgeGraph(); - g.addNode({ id: fnId, kind: "Function", name: "probe", filePath: "src/p.ts" }); - await store.bulkLoad(g); - await store.upsertEmbeddings([ - { - nodeId: fnId, - granularity: "symbol", - chunkIndex: 0, - vector: new Float32Array([1, 0, 0, 0]), - contentHash: "probe", - }, - ]); - return true; - } catch { - return false; - } finally { - await store.close().catch(() => {}); - } -} - -let _vectorSupportCached: boolean | undefined; -async function cachedVectorSupport(): Promise { - if (_vectorSupportCached === undefined) _vectorSupportCached = await hasVectorSupport(); - return _vectorSupportCached; -} - -// --------------------------------------------------------------------------- -// Constructor + getters -// --------------------------------------------------------------------------- - -test("GraphDbStore stores constructor path and defaults", () => { - const s = new GraphDbStore("/tmp/graph.db"); - assert.equal(s.getPath(), "/tmp/graph.db"); - assert.equal(s.isReadOnly(), false); - assert.equal(s.getEmbeddingDim(), 768); - assert.equal(s.getDefaultTimeoutMs(), 5_000); -}); - -test("GraphDbStore honours option overrides", () => { - const s = new GraphDbStore("/tmp/graph.db", { - readOnly: true, - embeddingDim: 1024, - timeoutMs: 7_500, - }); - assert.equal(s.isReadOnly(), true); - assert.equal(s.getEmbeddingDim(), 1024); - assert.equal(s.getDefaultTimeoutMs(), 7_500); -}); - -// --------------------------------------------------------------------------- -// Surface separation: cochange + symbol-summary methods live on ITemporalStore -// --------------------------------------------------------------------------- - -test("GraphDbStore no longer exposes cochange or symbol-summary methods", () => { - // The temporal surface (cochanges + symbol summaries) lives exclusively - // on `ITemporalStore`; `GraphDbStore` is graph-only and does not even - // declare these names. The runtime check guards against accidental - // re-introduction of the merged shape. - const s = new GraphDbStore("/tmp/graph.db"); - const removed: readonly string[] = [ - "bulkLoadCochanges", - "lookupCochangesForFile", - "lookupCochangesBetween", - "bulkLoadSymbolSummaries", - "lookupSymbolSummary", - "lookupSymbolSummariesByNode", - ]; - for (const name of removed) { - assert.equal( - typeof (s as unknown as Record)[name], - "undefined", - `GraphDbStore must not expose ${name}`, - ); - } - // NotImplementedError is still exported for adapter-internal use even - // though the cochange / summary stubs that originally threw it are gone. - assert.equal(typeof NotImplementedError, "function"); -}); - -test("query before open rejects with a clear error", async () => { - const s = new GraphDbStore("/tmp/graph.db"); - await assert.rejects(() => s.query("RETURN 1"), /before open/); -}); - -test("createSchema before open rejects with a clear error", async () => { - const s = new GraphDbStore("/tmp/graph.db"); - await assert.rejects(() => s.createSchema(), /before open/); -}); - -test("bulkLoad before open rejects with a clear error", async () => { - const s = new GraphDbStore("/tmp/graph.db"); - // `{} as never` is a deliberate cast — we're exercising the pre-open - // guard, not the bulkLoad argument shape. - await assert.rejects(() => s.bulkLoad({} as never), /before open/); -}); - -test("healthCheck reports pool-not-open without throwing", async () => { - const s = new GraphDbStore("/tmp/graph.db"); - const result = await s.healthCheck(); - assert.equal(result.ok, false); - assert.match(String(result.message), /not open/); -}); - -test("close is a tolerant no-op before open", async () => { - const s = new GraphDbStore("/tmp/graph.db"); - await s.close(); - await s.close(); -}); - -test("open surfaces GraphDbBindingError when native binding absent", async () => { - // On platforms where the native binary is missing (e.g. container runs - // that pruned the platform-specific optional dep), `open()` must surface - // a typed `GraphDbBindingError` rather than a bare module-not-found - // error. On platforms that ship the binary, `open()` succeeds — we close - // it afterwards so this suite remains portable across both modes. - const s = new GraphDbStore("/tmp/graph-open-probe.db"); - try { - await s.open(); - // Binary available — confirm the pool is actually live, then clean up. - assert.equal(s.isReadOnly(), false); - await s.close(); - } catch (err) { - assert.ok( - err instanceof GraphDbBindingError, - `expected GraphDbBindingError, got ${(err as Error).name}: ${(err as Error).message}`, - ); - } -}); - -// --------------------------------------------------------------------------- -// Shared platform-support helper (consumed by `codehub doctor` too, so the -// runtime error message and the diagnostic hint never drift). -// --------------------------------------------------------------------------- - -test("graphBindingPlatformNote names the win32-arm64 gap", () => { - const note = graphBindingPlatformNote("win32", "arm64"); - assert.match(note, /win32-arm64/); - assert.match(note, /not currently supported/); -}); - -test("graphBindingPlatformNote names the musl/Alpine gap on linux", () => { - const note = graphBindingPlatformNote("linux", "x64"); - assert.match(note, /musl/); - assert.match(note, /glibc/); -}); - -test("graphBindingPlatformNote returns empty on supported platforms", () => { - assert.equal(graphBindingPlatformNote("darwin", "arm64"), ""); - assert.equal(graphBindingPlatformNote("darwin", "x64"), ""); - assert.equal(graphBindingPlatformNote("win32", "x64"), ""); -}); - -test("GraphDbBindingError message embeds the shared support matrix", () => { - const err = new GraphDbBindingError(new Error("dlopen failed")); - assert.match( - err.message, - new RegExp(GRAPH_BINDING_SUPPORTED_PLATFORMS.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")), - ); - assert.match(err.message, /dlopen failed/); -}); - -// --------------------------------------------------------------------------- -// Factory -// --------------------------------------------------------------------------- - -test("openStore composes GraphDbStore + DuckDbStore pair", async () => { - // The graph file is canonicalized to `graph.lbug` and the temporal file - // is its sibling `temporal.duckdb` inside the same directory. Build the - // input + expectations with `join` so the assertion uses the platform's - // own separator — a hardcoded forward-slash literal diverges from the - // impl's `join(dirname(path), …)` output on Windows (backslashes). - const metaDir = join(tmpdir(), "och-test", ".codehub"); - const store = await openStore({ path: join(metaDir, "graph.lbug") }); - assert.equal(store.graph.constructor.name, "GraphDbStore"); - assert.equal(store.temporal.constructor.name, "DuckDbStore"); - assert.equal(store.graphFile, join(metaDir, "graph.lbug")); - assert.equal(store.temporalFile, join(metaDir, "temporal.duckdb")); - assert.equal(typeof store.close, "function"); -}); - -// --------------------------------------------------------------------------- -// Integration: createSchema + bulkLoad -// --------------------------------------------------------------------------- -// -// These tests require the native binding. On platforms without the prebuilt -// `.node` the suite gracefully skips; every one of the code paths still gets -// exercised by the unit tests above plus the round-trip suite. - -test("createSchema runs the full DDL against a fresh store", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping integration test"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - // A follow-up query against CodeNode must succeed — if the DDL - // silently fell over on some kinds this SELECT would throw. - const rows = await store.query("MATCH (n:CodeNode) RETURN count(n) AS c"); - assert.equal(Number((rows[0] as { c?: unknown })?.c ?? -1), 0); - } finally { - await store.close(); - } -}); - -test("bulkLoad replace mode inserts nodes and edges by kind", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping integration test"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - const fileA = makeNodeId("File", "src/a.ts", "a.ts"); - const fileB = makeNodeId("File", "src/b.ts", "b.ts"); - const fnX = makeNodeId("Function", "src/a.ts", "x"); - g.addNode({ id: fileA, kind: "File", name: "a.ts", filePath: "src/a.ts" }); - g.addNode({ id: fileB, kind: "File", name: "b.ts", filePath: "src/b.ts" }); - g.addNode({ - id: fnX, - kind: "Function", - name: "x", - filePath: "src/a.ts", - signature: "function x()", - parameterCount: 0, - isExported: true, - }); - g.addEdge({ from: fileA, to: fnX, type: "DEFINES", confidence: 1.0 }); - g.addEdge({ from: fileA, to: fileB, type: "IMPORTS", confidence: 0.9 }); - - const stats = await store.bulkLoad(g); - assert.equal(stats.nodeCount, g.nodeCount()); - assert.equal(stats.edgeCount, g.edgeCount()); - - const nCountRow = await store.query("MATCH (n:CodeNode) RETURN count(n) AS c"); - const eDefRow = await store.query("MATCH ()-[r:DEFINES]->() RETURN count(r) AS c"); - const eImpRow = await store.query("MATCH ()-[r:IMPORTS]->() RETURN count(r) AS c"); - assert.equal(Number((nCountRow[0] as { c?: unknown })?.c ?? 0), 3); - assert.equal(Number((eDefRow[0] as { c?: unknown })?.c ?? 0), 1); - assert.equal(Number((eImpRow[0] as { c?: unknown })?.c ?? 0), 1); - } finally { - await store.close(); - } -}); - -test("bulkLoad replace mode truncates prior rows on second call", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping integration test"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - - const g1 = new KnowledgeGraph(); - const a = makeNodeId("File", "src/a.ts", "a.ts"); - const b = makeNodeId("File", "src/b.ts", "b.ts"); - g1.addNode({ id: a, kind: "File", name: "a.ts", filePath: "src/a.ts" }); - g1.addNode({ id: b, kind: "File", name: "b.ts", filePath: "src/b.ts" }); - g1.addEdge({ from: a, to: b, type: "IMPORTS", confidence: 1.0 }); - await store.bulkLoad(g1); - - const g2 = new KnowledgeGraph(); - const c = makeNodeId("File", "src/c.ts", "c.ts"); - g2.addNode({ id: c, kind: "File", name: "c.ts", filePath: "src/c.ts" }); - await store.bulkLoad(g2, { mode: "replace" }); - - const rows = await store.query("MATCH (n:CodeNode) RETURN n.id AS id ORDER BY n.id"); - const ids = rows.map((r) => String((r as { id?: unknown }).id)); - assert.deepEqual(ids, [c]); - - // Every relation table should also be empty after a replace. - const eRow = await store.query("MATCH ()-[r:IMPORTS]->() RETURN count(r) AS c"); - assert.equal(Number((eRow[0] as { c?: unknown })?.c ?? -1), 0); - } finally { - await store.close(); - } -}); - -// Regression: lbug 0.16.1 crashes the process with SIGBUS/SIGSEGV when -// `MATCH (n:CodeNode) DELETE n` runs while the `och_fts` full-text index is -// live on CodeNode (confirmed by a 6/6 controlled A/B repro against a real -// ~11k-node on-disk graph: fix off → crash every run, fix on → clean every -// run). `truncateAll` now drops the search indexes before deleting and the -// post-insert `ensureFtsIndex` rebuilds them. -// -// The native crash only manifests at scale on an on-disk index whose pages -// are mmap'd (a synthetic two-node store keeps the index small enough that -// the fault doesn't fire), so this unit test asserts the *observable -// contract* the fix must uphold rather than relying on a flaky native crash: -// after a replace-truncate, the prior row is gone from the FTS index and the -// index resolves the freshly-inserted row. If `truncateAll` deleted nodes -// without dropping+rebuilding the index, the stale term would still resolve -// (or `bulkLoad` would crash). Both assertions fail loudly under a -// regression; neither depends on the host's memory pressure. -test("bulkLoad replace mode drops and rebuilds the FTS index across a truncate", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping integration test"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - - // First load + a search to force the `och_fts` index to be built. - const g1 = new KnowledgeGraph(); - const first = makeNodeId("Function", "src/first.ts", "parseUserProfile"); - g1.addNode({ - id: first, - kind: "Function", - name: "parseUserProfile", - filePath: "src/first.ts", - signature: "function parseUserProfile()", - }); - await store.bulkLoad(g1); - const before = await store.search({ text: "parseUserProfile", limit: 5 }); - assert.ok(before.length >= 1, "FTS index should resolve the first load"); - - // Replace-truncate with the FTS index live. Pre-fix on a large on-disk - // graph this SIGBUSes; here it must drop the index, truncate, reinsert, - // and rebuild the index against the new row. - const g2 = new KnowledgeGraph(); - const second = makeNodeId("Function", "src/second.ts", "renderMarkdownView"); - g2.addNode({ - id: second, - kind: "Function", - name: "renderMarkdownView", - filePath: "src/second.ts", - signature: "function renderMarkdownView()", - }); - await store.bulkLoad(g2, { mode: "replace" }); - - // The old row is gone and the index was rebuilt against the new row: - // searching the stale term returns nothing, the fresh term resolves. - const stale = await store.search({ text: "parseUserProfile", limit: 5 }); - assert.equal(stale.length, 0, "truncated row must not survive in the FTS index"); - const fresh = await store.search({ text: "renderMarkdownView", limit: 5 }); - assert.ok(fresh.length >= 1, "FTS index must be rebuilt after the replace-truncate"); - assert.equal(fresh[0]?.nodeId, second); - } finally { - await store.close(); - } -}); - -test("bulkLoad upsert mode preserves rows not present in the incoming graph", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping integration test"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - - const g1 = new KnowledgeGraph(); - const a = makeNodeId("File", "src/a.ts", "a.ts"); - const b = makeNodeId("File", "src/b.ts", "b.ts"); - g1.addNode({ id: a, kind: "File", name: "a.ts", filePath: "src/a.ts" }); - g1.addNode({ id: b, kind: "File", name: "b.ts", filePath: "src/b.ts" }); - await store.bulkLoad(g1); - - // Upsert a single file with a refreshed field; `b` must survive. - const g2 = new KnowledgeGraph(); - g2.addNode({ - id: a, - kind: "File", - name: "a.ts", - filePath: "src/a.ts", - contentHash: "fresh", - }); - await store.bulkLoad(g2, { mode: "upsert" }); - - const rows = await store.query( - "MATCH (n:CodeNode) RETURN n.id AS id, n.content_hash AS hash ORDER BY n.id", - ); - const rowRecs = rows.map((r) => r as { id?: unknown; hash?: unknown }); - assert.equal(rowRecs.length, 2); - const aRow = rowRecs.find((r) => r.id === a); - const bRow = rowRecs.find((r) => r.id === b); - assert.ok(aRow && bRow, "both rows should survive the upsert"); - assert.equal(aRow?.hash, "fresh"); - } finally { - await store.close(); - } -}); - -test("bulkLoad cycles through every declared edge kind without fault", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping integration test"); - return; - } - const { getAllRelationTypes } = await import("./graphdb-schema.js"); - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - // Build one node per kind we need and one edge per declared relation. - const nodes: NodeId[] = []; - const relationTypes = getAllRelationTypes(); - for (let i = 0; i < relationTypes.length + 1; i += 1) { - const id = makeNodeId("Function", `src/f${i}.ts`, `fn${i}`); - nodes.push(id); - g.addNode({ id, kind: "Function", name: `fn${i}`, filePath: `src/f${i}.ts` }); - } - for (let i = 0; i < relationTypes.length; i += 1) { - const fromId = nodes[i]; - const toId = nodes[i + 1]; - if (!fromId || !toId) throw new Error("unreachable"); - g.addEdge({ - from: fromId, - to: toId, - type: relationTypes[i] as "CALLS", - confidence: 0.5 + i * 0.01, - reason: `reason-${i}`, - step: i, - }); - } - await store.bulkLoad(g); - - for (const kind of relationTypes) { - const row = await store.query(`MATCH ()-[r:${kind}]->() RETURN count(r) AS c`); - const count = Number((row[0] as { c?: unknown })?.c ?? -1); - assert.equal(count, 1, `kind ${kind} should have exactly one edge`); - } - } finally { - await store.close(); - } -}); - -// --------------------------------------------------------------------------- -// Cypher write-guard -// --------------------------------------------------------------------------- - -test("assertReadOnlyCypher accepts plain MATCH ... RETURN", () => { - assertReadOnlyCypher("MATCH (n:CodeNode) RETURN n.id LIMIT 10"); - assertReadOnlyCypher("WITH 1 AS x RETURN x"); - assertReadOnlyCypher("RETURN 1 AS one"); -}); - -test("assertReadOnlyCypher rejects every write verb the native binding accepts", () => { - const writes = [ - "CREATE (n:CodeNode {id: '1'})", - "MERGE (n:CodeNode {id: '1'}) ON CREATE SET n.name = 'x'", - "MATCH (n:CodeNode) DELETE n", - "MATCH (n:CodeNode {id: '1'}) SET n.name = 'x'", - "MATCH (n:CodeNode {id: '1'}) REMOVE n.name", - "DROP TABLE CodeNode", - "COPY CodeNode FROM 'file.csv'", - "INSTALL FTS", - "LOAD EXTENSION FTS", - ]; - for (const stmt of writes) { - assert.throws( - () => assertReadOnlyCypher(stmt), - /Banned keyword|Leading keyword not allowed|LOAD EXTENSION|CALL procedure|CALL requires/, - ); - } -}); - -test("assertReadOnlyCypher tolerates write keywords inside line comments", () => { - assertReadOnlyCypher("// CREATE is mentioned here but not executed\nRETURN 1 AS one"); - assertReadOnlyCypher("/* MERGE */ RETURN 1 AS one"); -}); - -test("assertReadOnlyCypher rejects empty / non-string statements", () => { - assert.throws(() => assertReadOnlyCypher(""), /non-empty|must contain/); - // `as never` to sidestep the type guard — we care about the runtime - // behaviour, which must fail cleanly rather than crash. - assert.throws(() => assertReadOnlyCypher(null as unknown as string), /non-empty|must contain/); -}); - -// --------------------------------------------------------------------------- -// Integration: query / search / vectorSearch / traverse / setMeta / getMeta -// --------------------------------------------------------------------------- - -test("query rejects writes but passes reads through to the pool", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - // Reads succeed. - const rows = await store.query("MATCH (n:CodeNode) RETURN count(n) AS c"); - assert.equal(Number((rows[0] as { c?: unknown })?.c ?? -1), 0); - // Writes are rejected up front — the pool never sees them. - await assert.rejects( - () => store.query("CREATE (n:CodeNode {id: 'x'})"), - /Banned keyword|Leading keyword not allowed/, - ); - } finally { - await store.close(); - } -}); - -test("traverse (down) reaches transitive children within depth bound", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - const a = makeNodeId("Function", "x.ts", "A"); - const b = makeNodeId("Function", "x.ts", "B"); - const c = makeNodeId("Function", "x.ts", "C"); - const d = makeNodeId("Function", "x.ts", "D"); - for (const [id, name] of [ - [a, "A"], - [b, "B"], - [c, "C"], - [d, "D"], - ] as const) { - g.addNode({ id, kind: "Function", name, filePath: "x.ts" }); - } - g.addEdge({ from: a, to: b, type: "CALLS", confidence: 1.0 }); - g.addEdge({ from: b, to: c, type: "CALLS", confidence: 1.0 }); - g.addEdge({ from: c, to: d, type: "CALLS", confidence: 1.0 }); - await store.bulkLoad(g); - - const downDepth2 = await store.traverse({ - startId: a, - direction: "down", - maxDepth: 2, - relationTypes: ["CALLS"], - }); - const reachedIds = new Set(downDepth2.map((r) => r.nodeId)); - assert.ok(reachedIds.has(b), "B should be reached at depth 1"); - assert.ok(reachedIds.has(c), "C should be reached at depth 2"); - assert.ok(!reachedIds.has(d), "D must be pruned by depth bound"); - - const upFromD = await store.traverse({ - startId: d, - direction: "up", - maxDepth: 3, - relationTypes: ["CALLS"], - }); - const upIds = new Set(upFromD.map((r) => r.nodeId)); - assert.ok(upIds.has(c) && upIds.has(b) && upIds.has(a), "up traversal reaches A"); - } finally { - await store.close(); - } -}); - -test("traverse respects minConfidence filter", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - const a = makeNodeId("Function", "x.ts", "A"); - const b = makeNodeId("Function", "x.ts", "B"); - const c = makeNodeId("Function", "x.ts", "C"); - g.addNode({ id: a, kind: "Function", name: "A", filePath: "x.ts" }); - g.addNode({ id: b, kind: "Function", name: "B", filePath: "x.ts" }); - g.addNode({ id: c, kind: "Function", name: "C", filePath: "x.ts" }); - g.addEdge({ from: a, to: b, type: "CALLS", confidence: 0.3 }); - g.addEdge({ from: a, to: c, type: "CALLS", confidence: 0.9 }); - await store.bulkLoad(g); - - const hits = await store.traverse({ - startId: a, - direction: "down", - maxDepth: 1, - relationTypes: ["CALLS"], - minConfidence: 0.5, - }); - const ids = new Set(hits.map((r) => r.nodeId)); - assert.ok(ids.has(c), "confident edge survives"); - assert.ok(!ids.has(b), "low-confidence edge is pruned"); - } finally { - await store.close(); - } -}); - -test("search: BM25 index finds a distinct symbol name", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - const ids: NodeId[] = [ - makeNodeId("Function", "src/user.ts", "parseUserProfile"), - makeNodeId("Function", "src/view.ts", "renderMarkdownView"), - ]; - g.addNode({ - id: ids[0] as NodeId, - kind: "Function", - name: "parseUserProfile", - filePath: "src/user.ts", - signature: "function parseUserProfile()", - }); - g.addNode({ - id: ids[1] as NodeId, - kind: "Function", - name: "renderMarkdownView", - filePath: "src/view.ts", - signature: "function renderMarkdownView()", - }); - await store.bulkLoad(g); - - const results = await store.search({ text: "parseUserProfile", limit: 5 }); - assert.ok(results.length >= 1, "search should return at least one row"); - const top = results[0]; - assert.ok(top); - assert.equal(top.nodeId, ids[0]); - assert.ok(top.score > 0, "BM25 score should be positive"); - } finally { - await store.close(); - } -}); - -// A real vectorSearch integration test lives below alongside -// upsertEmbeddings — the vector query path needs at least one embedding -// row to return non-empty results. - -test("vectorSearch rejects vectors with the wrong dimension", async () => { - const store = new GraphDbStore("/tmp/graph-vec-dim.db", { embeddingDim: 4 }); - // No open() — the dimension check runs before we reach the pool so the - // test does not need a live native binding. - await assert.rejects( - () => store.vectorSearch({ vector: new Float32Array([1, 0]) }), - /dimension mismatch/, - ); -}); - -test("setMeta → getMeta round-trips the full shape", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const meta = { - schemaVersion: "1.2", - lastCommit: "abc123", - indexedAt: "2026-05-05T00:00:00Z", - nodeCount: 100, - edgeCount: 250, - stats: { files: 10, functions: 90 }, - cacheHitRatio: 0.75, - cacheSizeBytes: 1024, - lastCompaction: "2026-05-04T12:00:00Z", - }; - await store.setMeta(meta); - const read = await store.getMeta(); - assert.deepEqual(read, meta); - } finally { - await store.close(); - } -}); - -test("getMeta returns undefined on a fresh store", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const read = await store.getMeta(); - assert.equal(read, undefined); - } finally { - await store.close(); - } -}); - -test("healthCheck returns ok once the pool is open", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - const result = await store.healthCheck(); - assert.equal(result.ok, true); - } finally { - await store.close(); - } -}); - -// --------------------------------------------------------------------------- -// Integration: upsertEmbeddings + listEmbeddingHashes -// --------------------------------------------------------------------------- - -test("upsertEmbeddings dimension mismatch throws without touching the store", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath(), { embeddingDim: 4 }); - await store.open(); - try { - await store.createSchema(); - await assert.rejects( - () => - store.upsertEmbeddings([ - { - nodeId: "x" as NodeId, - chunkIndex: 0, - vector: new Float32Array([1, 0]), - contentHash: "h", - }, - ]), - /dimension mismatch/, - ); - } finally { - await store.close(); - } -}); - -test("listEmbeddingHashes is empty on a fresh store", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath(), { embeddingDim: 4 }); - await store.open(); - try { - await store.createSchema(); - const hashes = await store.listEmbeddingHashes(); - assert.ok(hashes instanceof Map, "returns a Map instance"); - assert.equal(hashes.size, 0); - } finally { - await store.close(); - } -}); - -test("upsertEmbeddings writes one row per (granularity, node_id, chunk_index)", async () => { - if (!(await cachedVectorSupport())) { - assert.ok(true, "vector extension unavailable on this host (mmap or binding) — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath(), { embeddingDim: 4 }); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - const fnId = makeNodeId("Function", "src/a.ts", "a"); - const fileId = makeNodeId("File", "src/a.ts", "src/a.ts"); - g.addNode({ id: fnId, kind: "Function", name: "a", filePath: "src/a.ts" }); - g.addNode({ id: fileId, kind: "File", name: "a.ts", filePath: "src/a.ts" }); - await store.bulkLoad(g); - - await store.upsertEmbeddings([ - { - nodeId: fnId, - granularity: "symbol", - chunkIndex: 0, - vector: new Float32Array([1, 0, 0, 0]), - contentHash: "h-sym-0", - }, - { - nodeId: fnId, - granularity: "symbol", - chunkIndex: 1, - vector: new Float32Array([1, 0, 0, 0]), - contentHash: "h-sym-1", - }, - { - nodeId: fileId, - granularity: "file", - chunkIndex: 0, - vector: new Float32Array([0.9, 0.1, 0, 0]), - contentHash: "h-file", - }, - ]); - - const hashes = await store.listEmbeddingHashes(); - assert.equal(hashes.size, 3); - assert.equal(hashes.get(`symbol\0${fnId}\0${0}`), "h-sym-0"); - assert.equal(hashes.get(`symbol\0${fnId}\0${1}`), "h-sym-1"); - assert.equal(hashes.get(`file\0${fileId}\0${0}`), "h-file"); - } finally { - await store.close(); - } -}); - -test("upsertEmbeddings overwrites rows with matching composite key", async () => { - if (!(await cachedVectorSupport())) { - assert.ok(true, "vector extension unavailable on this host (mmap or binding) — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath(), { embeddingDim: 4 }); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - const fnId = makeNodeId("Function", "src/a.ts", "a"); - g.addNode({ id: fnId, kind: "Function", name: "a", filePath: "src/a.ts" }); - await store.bulkLoad(g); - - await store.upsertEmbeddings([ - { - nodeId: fnId, - granularity: "symbol", - chunkIndex: 0, - vector: new Float32Array([1, 0, 0, 0]), - contentHash: "original", - }, - ]); - let hashes = await store.listEmbeddingHashes(); - assert.equal(hashes.get(`symbol\0${fnId}\0${0}`), "original"); - - await store.upsertEmbeddings([ - { - nodeId: fnId, - granularity: "symbol", - chunkIndex: 0, - vector: new Float32Array([0, 1, 0, 0]), - contentHash: "updated", - }, - ]); - hashes = await store.listEmbeddingHashes(); - assert.equal(hashes.size, 1, "upsert replaces the row, not duplicated"); - assert.equal(hashes.get(`symbol\0${fnId}\0${0}`), "updated"); - } finally { - await store.close(); - } -}); - -test("vectorSearch returns nearest row after upsertEmbeddings", async () => { - if (!(await cachedVectorSupport())) { - assert.ok(true, "vector extension unavailable on this host (mmap or binding) — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath(), { embeddingDim: 4 }); - await store.open(); - try { - await store.createSchema(); - const g = new KnowledgeGraph(); - const ids: NodeId[] = []; - const vectors = [ - [1.0, 0.0, 0.0, 0.0], - [0.9, 0.1, 0.0, 0.0], - [0.0, 1.0, 0.0, 0.0], - ]; - for (let i = 0; i < vectors.length; i += 1) { - const id = makeNodeId("File", `src/f${i}.ts`, `f${i}`); - ids.push(id); - g.addNode({ id, kind: "File", name: `f${i}`, filePath: `src/f${i}.ts` }); - } - await store.bulkLoad(g); - await store.upsertEmbeddings( - ids.map((id, i) => ({ - nodeId: id, - chunkIndex: 0, - vector: new Float32Array(vectors[i] ?? []), - contentHash: `h${i}`, - })), - ); - const hits = await store.vectorSearch({ - vector: new Float32Array([1.0, 0.0, 0.0, 0.0]), - limit: 2, - }); - assert.equal(hits.length, 2); - // Nearest first — identical vector wins. - assert.equal(hits[0]?.nodeId, ids[0]); - assert.ok( - (hits[0]?.distance ?? Number.POSITIVE_INFINITY) <= - (hits[1]?.distance ?? Number.POSITIVE_INFINITY), - ); - } finally { - await store.close(); - } -}); - -// --------------------------------------------------------------------------- -// listNodes — kind filter, determinism, limit/offset, cross-adapter parity -// --------------------------------------------------------------------------- - -/** - * Build the same heterogenous fixture as the DuckStore tests so both - * adapters can be compared apples-to-apples. Covers File / Function / - * Class / Method / Dependency (wider columns) / Operation (column - * aliasing) / Repo (M6 nullable fields + languageStats). - */ -function buildListNodesFixture(): KnowledgeGraph { - const g = new KnowledgeGraph(); - const fileA = makeNodeId("File", "src/a.ts", "a.ts"); - const fileB = makeNodeId("File", "src/b.ts", "b.ts"); - g.addNode({ id: fileA, kind: "File", name: "a.ts", filePath: "src/a.ts" }); - g.addNode({ id: fileB, kind: "File", name: "b.ts", filePath: "src/b.ts" }); - - for (let i = 0; i < 3; i += 1) { - const id = makeNodeId("Function", "src/a.ts", `fn_${i}`, { parameterCount: i }); - g.addNode({ - id, - kind: "Function", - name: `fn_${i}`, - filePath: "src/a.ts", - startLine: 10 + i, - endLine: 20 + i, - signature: `function fn_${i}()`, - parameterCount: i, - isExported: i === 0, - }); - } - - const cls = makeNodeId("Class", "src/b.ts", "Service"); - g.addNode({ - id: cls, - kind: "Class", - name: "Service", - filePath: "src/b.ts", - isExported: true, - startLine: 1, - endLine: 30, - }); - g.addNode({ - id: makeNodeId("Method", "src/b.ts", "Service.greet"), - kind: "Method", - name: "greet", - filePath: "src/b.ts", - startLine: 5, - endLine: 9, - parameterCount: 1, - }); - - g.addNode({ - id: makeNodeId("Dependency", "package.json", "lodash@4.17.21"), - kind: "Dependency", - name: "lodash", - filePath: "package.json", - version: "4.17.21", - ecosystem: "npm", - lockfileSource: "pnpm-lock.yaml", - license: "MIT", - }); - g.addNode({ - id: makeNodeId("Dependency", "requirements.txt", "requests@2.31.0"), - kind: "Dependency", - name: "requests", - filePath: "requirements.txt", - version: "2.31.0", - ecosystem: "pypi", - lockfileSource: "requirements.txt", - }); - - g.addNode({ - id: makeNodeId("Operation", "openapi.yaml", "GET /v1/users"), - kind: "Operation", - name: "listUsers", - filePath: "openapi.yaml", - method: "GET", - path: "/v1/users", - operationId: "listUsers", - }); - - g.addNode({ - id: makeNodeId("Repo", "", "repo"), - kind: "Repo", - name: "test-repo", - filePath: ".", - originUrl: "https://github.com/example/test-repo", - repoUri: "github.com/example/test-repo", - defaultBranch: "main", - commitSha: "0123456789abcdef0123456789abcdef01234567", - indexTime: "2026-05-07T00:00:00Z", - group: null, - visibility: "public", - indexer: "och-test/0.1.0", - languageStats: { ts: 0.7, py: 0.3 }, - }); - - return g; -} - -test("listNodes() returns every kind when no filter is supplied (graph-db)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - const g = buildListNodesFixture(); - await store.bulkLoad(g); - - const all = await store.listNodes(); - assert.equal(all.length, g.nodeCount()); - const byKind = new Map(); - for (const n of all) byKind.set(n.kind, (byKind.get(n.kind) ?? 0) + 1); - assert.equal(byKind.get("Dependency"), 2); - assert.equal(byKind.get("Function"), 3); - assert.equal(byKind.get("Repo"), 1); - } finally { - await store.close(); - } -}); - -test("listNodes() filters by kind and surfaces wider Dependency columns (graph-db)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - await store.bulkLoad(buildListNodesFixture()); - - const deps = await store.listNodes({ kinds: ["Dependency"] }); - assert.equal(deps.length, 2); - for (const dep of deps) { - assert.equal(dep.kind, "Dependency"); - const d = dep as GraphNode & { - version: string; - ecosystem: string; - lockfileSource: string; - }; - assert.equal(typeof d.version, "string"); - assert.equal(typeof d.ecosystem, "string"); - assert.equal(typeof d.lockfileSource, "string"); - } - } finally { - await store.close(); - } -}); - -test("listNodes() empty kinds returns [] without hitting the engine (graph-db)", async () => { - // Pure JS short-circuit — runs even without the native binding. - const store = new GraphDbStore("/tmp/listnodes-empty.db"); - // No open() — the empty-kinds branch should return before the pool guard. - const result = await store.listNodes({ kinds: [] }); - assert.deepEqual(result, []); -}); - -test("listNodes() ORDER BY id ASC is deterministic across two writes (graph-db)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const g = buildListNodesFixture(); - const storeA = new GraphDbStore(await scratchDbPath()); - await storeA.open(); - await storeA.createSchema(); - await storeA.bulkLoad(g); - const idsA = (await storeA.listNodes()).map((n) => n.id); - await storeA.close(); - - const storeB = new GraphDbStore(await scratchDbPath()); - await storeB.open(); - await storeB.createSchema(); - await storeB.bulkLoad(g); - const idsB = (await storeB.listNodes()).map((n) => n.id); - await storeB.close(); - - assert.deepEqual(idsA, idsB); - const sorted = [...idsA].sort(); - assert.deepEqual(idsA, sorted); -}); - -test("listNodes() applies limit + offset on the sorted result (graph-db)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - await store.bulkLoad(buildListNodesFixture()); - - const all = await store.listNodes(); - assert.ok(all.length >= 4, "fixture should have at least 4 nodes"); - - const firstPage = await store.listNodes({ limit: 2 }); - const secondPage = await store.listNodes({ limit: 2, offset: 2 }); - assert.equal(firstPage.length, 2); - assert.equal(secondPage.length, 2); - assert.deepEqual( - firstPage.map((n) => n.id), - all.slice(0, 2).map((n) => n.id), - ); - assert.deepEqual( - secondPage.map((n) => n.id), - all.slice(2, 4).map((n) => n.id), - ); - } finally { - await store.close(); - } -}); - -test("listNodes() rehydrates Operation method/path symmetrically (graph-db)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - await store.bulkLoad(buildListNodesFixture()); - const ops = await store.listNodes({ kinds: ["Operation"] }); - assert.equal(ops.length, 1); - const op = ops[0] as GraphNode & { method: string; path: string }; - assert.equal(op.method, "GET"); - assert.equal(op.path, "/v1/users"); - } finally { - await store.close(); - } -}); - -// --------------------------------------------------------------------------- -// v1.0 community-adapter conformance suite -// -// GraphDb is graph-only; it MUST satisfy every block of the shared v1.0 -// conformance contract. Binding probe is performed once at module load -// time so the entire suite is skipped cleanly on platforms where the -// `@ladybugdb/core` native binary is absent — matching the existing -// integration-test skip pattern in this file. -// --------------------------------------------------------------------------- - -if (await hasNativeBinding()) { - assertIGraphStoreConformance("GraphDb", async () => { - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - await store.createSchema(); - return store; - }); -} else { - test("[conformance:GraphDb] skipped — @ladybugdb/core native binding unavailable", () => { - assert.ok(true, "native binding unavailable; conformance suite skipped"); - }); -} diff --git a/packages/storage/src/graphdb-adapter.ts b/packages/storage/src/graphdb-adapter.ts deleted file mode 100644 index 1903fe83..00000000 --- a/packages/storage/src/graphdb-adapter.ts +++ /dev/null @@ -1,2435 +0,0 @@ -/** - * Graph-database backend for {@link IGraphStore} (phase-2 implementation). - * - * This adapter is the second implementation behind the `IGraphStore` seam. - * DuckDbStore remains the default; this file ships the full lifecycle + - * bulk-load surface so the lbug graph backend already drives a - * round-trip-clean graph write. - * - * Design notes: - * 1. Rel tables are polymorphic per edge kind — one named rel table per - * relation type, each with multiple `FROM/TO` pairs. The DDL lives in - * {@link graphdb-schema.ts}; this file never emits DDL inline. - * 2. Source-level naming avoids the banned clean-room literals. The class - * is {@link GraphDbStore}; files are `graphdb-*.ts`. The native binding - * package `@ladybugdb/core` is a dep, not a source-level identifier. - * 3. Every mutating path uses parameterized Cypher via the pool — no - * string-concatenated values ever touch the connection. - * - * Lifecycle mirrors {@link DuckDbStore}: open → createSchema → bulkLoad → - * query / search / vectorSearch / traverse → close. - */ - -import type { - CodeRelation, - DependencyNode, - FindingNode, - GraphNode, - KnowledgeGraph, - NodeId, - NodeKind, - NodeOfKind, - RelationType, - RepoNode, - RouteNode, -} from "@opencodehub/core-types"; -import { dedupeLastById, NODE_COLUMNS, nodeToColumns } from "./column-encode.js"; -import { assertReadOnlyCypher } from "./cypher-guard.js"; -import { classifyLicenseTier } from "./duckdb-adapter.js"; -import { GraphDbPool, type GraphDbPoolConfig } from "./graphdb-pool.js"; -import { generateSchemaDdl, getAllRelationTypes } from "./graphdb-schema.js"; -import type { - AncestorTraversalOptions, - BulkLoadOptions, - BulkLoadStats, - ConsumerProducerEdge, - DescendantTraversalOptions, - EmbeddingRow, - GraphDialect, - IGraphStore, - ListDependenciesOptions, - ListEdgesByTypeOptions, - ListEdgesOptions, - ListEmbeddingsOptions, - ListFindingsOptions, - ListNodesByKindOptions, - ListNodesByNameOptions, - ListNodesOptions, - ListRoutesOptions, - SearchQuery, - SearchResult, - SqlParam, - StoreMeta, - TraverseQuery, - TraverseResult, - VectorQuery, - VectorResult, -} from "./interface.js"; - -export interface GraphDbStoreOptions { - readonly readOnly?: boolean; - /** Fixed vector dimension for the embeddings rel table. Default 768. */ - readonly embeddingDim?: number; - /** Default query timeout for `query()` calls in ms. Default 5000. */ - readonly timeoutMs?: number; - /** - * Overrides for the underlying connection pool. Tests inject a fake - * `binding` to avoid the native dep; production callers rely on - * defaults. - */ - readonly poolConfig?: GraphDbPoolConfig; -} - -const DEFAULT_EMBEDDING_DIM = 768; -const DEFAULT_TIMEOUT_MS = 5_000; - -/** - * Thrown by adapter surfaces that are not yet wired. The cochange + symbol - * summary surfaces live on {@link ITemporalStore}, never on the graph - * adapter. The class export is retained because downstream packages still - * import it for typed fallback handling on graph-only failure modes. - */ -export class NotImplementedError extends Error { - constructor(method: string) { - super(`graph-db: ${method} not yet wired`); - this.name = "NotImplementedError"; - } -} - -/** - * Single source of truth for the user-facing summary of the `@ladybugdb/core` - * platform-support matrix. Shared by {@link GraphDbBindingError} (the runtime - * abort message) and `codehub doctor`'s graph-binding check (the diagnostic - * hint) so the two never drift. `@ladybugdb/core` ships prebuilt binaries only - * for darwin-x64, darwin-arm64, linux-x64 (glibc), linux-arm64 (glibc), and - * win32-x64. - */ -export const GRAPH_BINDING_SUPPORTED_PLATFORMS = - "Supported platforms: macOS x64/arm64, Linux x64/arm64 (glibc), Windows x64."; - -/** - * Platform-specific guidance for a missing `@ladybugdb/core` prebuilt. The - * graph tier is mandatory (no fallback), so on an UNSUPPORTED platform — - * notably win32-arm64 and any musl-libc Linux (Alpine) — there is no prebuilt - * to load and OpenCodeHub cannot run. Naming those cases explicitly makes the - * failure diagnosable rather than a bare module-load error. - * - * Returns an empty string on a supported platform (no extra note to add). - * `platform`/`arch` default to the running process so callers can pass - * `process.platform` / `process.arch` implicitly; tests inject fixtures. - */ -export function graphBindingPlatformNote( - platform: NodeJS.Platform = process.platform, - arch: string = process.arch, -): string { - if (platform === "win32" && arch === "arm64") { - return " Windows on ARM64 (win32-arm64) has no @ladybugdb/core prebuilt and is not currently supported."; - } - if (platform === "linux") { - return " On Alpine / musl-libc Linux there is no @ladybugdb/core prebuilt; use a glibc-based image (e.g. debian/ubuntu, node:* not node:*-alpine)."; - } - return ""; -} - -/** - * Missing peer-binding error. Surfaced when the native `@ladybugdb/core` - * module is not available on the current platform (no prebuilt binary, or - * the package was pruned by a `--production` install). - */ -export class GraphDbBindingError extends Error { - constructor(cause: unknown) { - const detail = cause instanceof Error ? cause.message : String(cause); - super( - "@ladybugdb/core native binding unavailable on this platform. " + - "OpenCodeHub requires the lbug graph backend (it has no fallback). " + - GRAPH_BINDING_SUPPORTED_PLATFORMS + - graphBindingPlatformNote() + - ` Underlying cause: ${detail}`, - ); - this.name = "GraphDbBindingError"; - } -} - -// --------------------------------------------------------------------------- -// Column layouts — `NODE_COLUMNS` lives in `./column-encode.ts` and is the -// canonical column ordering shared with the DuckDB adapter. Adding a column -// means: (1) extend the schema DDL in `graphdb-schema.ts` AND -// `schema-ddl.ts`, (2) append it to `NODE_COLUMNS` in `column-encode.ts`, -// (3) append the writer slot in `nodeToColumns` in `column-encode.ts`, -// (4) append the reader in `ROUND_TRIP_COLUMN_MAP` below + the readback -// path. Order matters because both directions are index-aligned with the -// prepared statement parameter list. -// --------------------------------------------------------------------------- - -// Edge columns are encoded inline in edgeToCsvLine() — no separate constant needed. - -/** - * Column layout for the `Embedding` node table. Matches graphdb-schema.ts. - * `vector` is a FLOAT[dim] fixed-size array column; everything else is - * bound as a plain scalar. - */ -const EMBEDDING_COLUMNS: readonly string[] = [ - "id", - "node_id", - "granularity", - "chunk_index", - "start_line", - "end_line", - "vector", - "content_hash", -]; - -/** - * Column → node-field descriptors used by the round-trip readback path. - * `rebuildGraphFromStore` walks this list so the returned graph carries - * the same field set the bulk writer ingested. - */ -export const ROUND_TRIP_COLUMN_MAP: readonly (readonly [ - string, - string, - "string" | "number" | "boolean" | "string[]", -])[] = [ - ["start_line", "startLine", "number"], - ["end_line", "endLine", "number"], - ["is_exported", "isExported", "boolean"], - ["signature", "signature", "string"], - ["parameter_count", "parameterCount", "number"], - ["return_type", "returnType", "string"], - ["declared_type", "declaredType", "string"], - ["owner", "owner", "string"], - ["content_hash", "contentHash", "string"], -]; - -// --------------------------------------------------------------------------- -// Transient bulk-load retry -// --------------------------------------------------------------------------- - -/** Resolve after `ms` milliseconds. Used for bulk-load retry backoff. */ -function delay(ms: number): Promise { - return new Promise((res) => { - setTimeout(res, ms); - }); -} - -/** - * True for the transient lbug WAL→checkpoint rename failure that surfaces - * under load — e.g. `IO exception: Error renaming file .wal to - * .wal.checkpoint. ErrorMessage: No such file or directory`. The data is - * already in the WAL (a reopen recovers it), so this specific failure is - * safe to retry. Matched on the stable token trio (renaming + .wal + - * checkpoint) rather than the OS-specific errno suffix, which varies by - * platform. Every other error returns false and rethrows. - */ -export function isTransientCheckpointError(err: unknown): boolean { - const msg = err instanceof Error ? err.message : String(err); - return /renaming/i.test(msg) && /\.wal\b/i.test(msg) && /checkpoint/i.test(msg); -} - -/** - * Run `fn`, retrying up to `maxAttempts` times when it throws a transient - * WAL→checkpoint rename error (see {@link isTransientCheckpointError}). Any - * other error rethrows immediately; the transient error on the final attempt - * also rethrows. Backoff scales with attempt (25ms, 50ms, …) to let the OS - * settle the WAL file. Extracted as a pure helper so the retry policy is unit- - * testable without provoking a native race. Used only by replace-mode-safe - * bulk-load, which is idempotent (truncate-then-insert). - */ -export async function retryTransientCheckpoint( - fn: () => Promise, - maxAttempts = 3, - backoff: (attempt: number) => Promise = (attempt) => delay(attempt * 25), -): Promise { - for (let attempt = 1; ; attempt++) { - try { - return await fn(); - } catch (err) { - if (attempt >= maxAttempts || !isTransientCheckpointError(err)) throw err; - await backoff(attempt); - } - } -} - -// --------------------------------------------------------------------------- -// COPY FROM (subquery) bulk insert -// --------------------------------------------------------------------------- -// -// lbug v0.16.1 infers struct-field types per-row from JS values: integer- -// valued numbers (Number.isInteger) → INT64, others → DOUBLE. A -// `confidence=1.0` edge binds as INT64 and round-trips as garbage from a -// DOUBLE column. -// -// The fix: COPY FROM (UNWIND $rows AS r RETURN ...) where numeric -// columns use CAST(r.col AS ) and the raw values are passed as -// strings. The COPY FROM path resolves types from the pre-defined table -// schema, CAST converts the string to the correct type, and per-row -// inference never runs on numerics. -// -// No temp files required — rows travel as a prepared-statement parameter. -// --------------------------------------------------------------------------- - -/** - * Column DDL type tags used to decide how to encode each value in the - * UNWIND row object and whether to wrap its RETURN expression in CAST. - * Must stay in the same order as NODE_COLUMNS. - */ -type ColKind = "str" | "int" | "double" | "bool" | "strarray"; - -const NODE_COL_KINDS: readonly ColKind[] = (() => { - const map: Record = { - start_line: "int", - end_line: "int", - parameter_count: "int", - step_count: "int", - level: "int", - symbol_count: "int", - cyclomatic_complexity: "int", - nesting_depth: "int", - nloc: "int", - truck_factor: "int", - is_exported: "bool", - is_orphan: "bool", - cohesion: "double", - coverage_percent: "double", - halstead_volume: "double", - ownership_drift_30d: "double", - ownership_drift_90d: "double", - ownership_drift_365d: "double", - keywords: "strarray", - response_keys: "strarray", - }; - return NODE_COLUMNS.map((col) => map[col] ?? "str"); -})(); - -/** - * Build the `COPY CodeNode FROM (UNWIND $rows AS r RETURN ...)` statement. - * Numeric columns (INT32, DOUBLE) use CAST(r.col AS ) so that string- - * encoded values (e.g. "1", "0.9") reach the correct column type regardless - * of how the JS binding inferred the struct-field type. Non-numeric columns - * project directly as `r.col`. - */ -function buildNodeCopySubquery(): string { - const returnCols = NODE_COLUMNS.map((col, i) => { - switch (NODE_COL_KINDS[i]) { - case "int": - return `CAST(r.${col} AS INT32)`; - case "double": - return `CAST(r.${col} AS DOUBLE)`; - default: - return `r.${col}`; - } - }).join(", "); - // WITH r WHERE filters out the type-seeding sentinel row (see NODE_SENTINEL_ID). - return ( - `COPY CodeNode FROM (UNWIND $rows AS r ` + - `WITH r WHERE r.id <> '${NODE_SENTINEL_ID}' ` + - `RETURN ${returnCols})` - ); -} - -/** - * Sentinel id prepended to every UNWIND batch. Row 0 of `$rows` seeds the - * struct-field type for every column. When row 0 has a null field, the binder - * infers ANY for that field's type and fails with "Trying to create a vector - * with ANY type". The sentinel carries a concrete non-null value for every - * column, and the `WITH r WHERE r.id <> NODE_SENTINEL_ID` clause in the - * COPY subquery filters it out before any row lands in storage. - */ -const NODE_SENTINEL_ID = "__OCH_SENTINEL__"; -const EDGE_SENTINEL_ID = "__OCH_EDGE_SENTINEL__"; - -/** - * Marker element used to encode an explicit empty `STRING[]` value. - * - * An empty `keywords: []` must stay distinct from an absent column on - * read-back — otherwise `graphHash` byte-identity against the canonical-JSON - * projection breaks (`{}` vs `{"keywords":[]}` are distinct). The marker - * carries that distinction independent of how lbug stores a 0-length array: - * - lbug v0.16.1 collapsed `[]` to SQL NULL on write, so an absent field - * read back as `null`. - * - lbug v0.17.0+ (PR #471) round-trips `[]` as a typed empty `STRING[]`, - * so an absent field reads back as a bare `[]`. - * In both cases we write a single-element array carrying this marker for a - * genuinely-empty field; the read side ({@link setStringArrayFieldGd}) - * recognises it and reconstructs `[]`, while a truly absent field is written - * as `[]` and decoded as absent (bare `[]` or `null` → key omitted). - * - * The marker is pure ASCII with no leading whitespace because lbug rewrites - * a leading space to a NUL byte; a plain token round-trips byte-for-byte. - */ -const EMPTY_STRING_ARRAY_MARKER = "__OCH_EMPTY_STRING_ARRAY__"; - -/** Pre-built node copy statement (constant — column list never changes). */ -const NODE_COPY_SUBQUERY = buildNodeCopySubquery(); - -function buildEdgeCopySubquery(kind: string): string { - // IGNORE_ERRORS=true skips rows where the FROM/TO node lookup fails — this - // handles the type-seeding sentinel row whose from/to point to a - // non-existent CodeNode. Real rows should always have valid endpoints; if - // they don't, the edge is silently dropped (same behaviour as the old - // per-row path which would throw on MATCH failure). - return ( - // Row struct fields use `src`/`dst` instead of `from`/`to` because FROM - // and TO are Cypher keywords — using them as `r.from`/`r.to` is - // ambiguous and silently causes the COPY to misinterpret columns. - // The COPY column list `(id, confidence, reason, step)` specifies which rel - // properties to populate. The RETURN uses `r.eid` (not `r.id`) for the - // edge id to avoid lbug misinterpreting it as a CodeNode PK lookup: lbug - // treats a RETURN column named `id` as a node-PK reference when it matches - // the referenced node table's primary key column name. Using a different - // alias breaks the false match while the positional column list maps it to - // the rel's `id STRING` property. - `COPY ${kind}(id, confidence, reason, step) FROM ` + - `(UNWIND $rows AS r WITH r WHERE r.eid <> '${EDGE_SENTINEL_ID}' ` + - `RETURN r.src, r.dst, r.eid, CAST(r.confidence AS DOUBLE), r.reason, CAST(r.step AS INT32))` - ); -} - -/** - * Encode a GraphNode column value for the UNWIND parameter object. - * Numeric columns are encoded as strings so lbug's binder does not infer - * INT64 for integer-valued numbers; CAST in the RETURN expression then - * converts to the correct DDL type. All other values pass through as-is. - */ -function encodeNodeCol(v: unknown, kind: ColKind): unknown { - // strarray must never be null — lbug infers LIST(ANY) from a null field and - // fails with "Trying to create a vector with ANY type". Check before the - // null short-circuit so absent arrays become [] rather than null. - if (kind === "strarray") { - // Absent / non-array → [] (the reader treats a bare [] — whether stored - // as NULL by lbug 0.16 or as a typed empty STRING[] by lbug 0.17+ — as - // the "absent" canonical-JSON shape and drops the field). - if (!Array.isArray(v)) return [] as string[]; - const items = (v as unknown[]).filter((x) => typeof x === "string") as string[]; - // Explicit empty array → single-element marker so the "[] vs absent" - // distinction survives lbug's empty-array → NULL collapse. The reader - // ({@link setStringArrayFieldGd}) maps the marker back to []. A - // non-empty array passes through verbatim. - if (items.length === 0) return [EMPTY_STRING_ARRAY_MARKER]; - return items; - } - if (v === null || v === undefined) return null; - switch (kind) { - case "int": - return typeof v === "number" && Number.isFinite(v) ? String(Math.trunc(v)) : null; - case "double": - return typeof v === "number" && Number.isFinite(v) ? String(v) : null; - case "bool": - return typeof v === "boolean" ? v : null; - default: - return typeof v === "string" ? v : String(v); - } -} - -/** - * Build the type-seeding sentinel row for a node batch. Every column gets a - * concrete non-null value matching its DDL type so lbug's binder can resolve - * every struct field at prepare time. The WITH/WHERE clause in the COPY - * subquery filters it out before any storage write. - * - * STRING[] columns get a single-element seed (`["__sentinel__"]`) — lbug's - * struct-field inference looks at the FIRST row's array contents to fix - * the LIST element type. An empty-array sentinel forces LIST(ANY) and the - * binder later throws "Trying to create a vector with ANY type" when a - * data row supplies a string. The seed value never reaches storage; the - * sentinel row is filtered before COPY writes. - */ -function buildNodeSentinel(): Record { - const sentinel: Record = {}; - for (let i = 0; i < NODE_COLUMNS.length; i++) { - const col = NODE_COLUMNS[i] as string; - switch (NODE_COL_KINDS[i]) { - case "int": - sentinel[col] = "0"; - break; - case "double": - sentinel[col] = "0.0"; - break; - case "bool": - sentinel[col] = false; - break; - case "strarray": - sentinel[col] = ["__sentinel__"] as string[]; - break; - default: - sentinel[col] = ""; - break; - } - } - sentinel["id"] = NODE_SENTINEL_ID; - return sentinel; -} - -/** Pre-built node sentinel row (constant — same shape for every batch). */ -const NODE_SENTINEL_ROW = buildNodeSentinel(); - -/** - * Walk the edge set and synthesize a stub `Repository` node for every - * `from`/`to` id that doesn't appear in the node set. The pipeline's - * fetches phase (and any future cross-repo edge that points at a - * not-yet-resolved target) emits ids like `fetches:unresolved:GET:/users/1` - * that intentionally have no corresponding node — the URL template lives - * on the edge's `reason` for downstream lookup. lbug's COPY rejects - * those edges because the to-node primary key is missing; DuckDB used to - * accept them silently. Synthesize a minimal placeholder so the bulk - * load completes, with the original id preserved for round-trip. - */ -/** Distinct from/to ids referenced by an edge batch. */ -function edgeEndpointIds( - edges: readonly { readonly from: NodeId; readonly to: NodeId }[], -): readonly string[] { - const ids = new Set(); - for (const e of edges) { - ids.add(e.from as string); - ids.add(e.to as string); - } - return [...ids]; -} - -/** - * Synthesize a placeholder CodeNode for every edge endpoint id that has no - * real node, so lbug's COPY (which requires a real PK for each rel endpoint) - * succeeds. - * - * `alreadyPersisted` (upsert mode only) lists endpoint ids that already exist - * in the store. Those must NOT be synthesized: a placeholder would later be - * `mergeNodes`-ed (DETACH DELETE + re-insert), clobbering the real node. In - * replace mode the store was just truncated, so pass `undefined`. - */ -function synthesizePlaceholderNodes( - nodes: readonly GraphNode[], - edges: readonly { readonly from: NodeId; readonly to: NodeId }[], - alreadyPersisted?: ReadonlySet, -): GraphNode[] { - const known = new Set(); - for (const n of nodes) known.add(n.id as string); - const missing = new Set(); - for (const e of edges) { - if (!known.has(e.from as string) && !(alreadyPersisted?.has(e.from as string) ?? false)) { - missing.add(e.from as string); - } - if (!known.has(e.to as string) && !(alreadyPersisted?.has(e.to as string) ?? false)) { - missing.add(e.to as string); - } - } - if (missing.size === 0) return []; - const out: GraphNode[] = []; - for (const id of missing) { - // Route is the right kind for unresolved-fetch placeholders (the - // edge that referenced it was a FETCHES targeting an HTTP endpoint). - // For other orphan id shapes the kind is still cosmetic — the only - // load-bearing requirement is that the COPY finds a primary key. - out.push({ - id: id as NodeId, - kind: "Route", - name: id, - filePath: "", - url: id, - } as GraphNode); - } - return out; -} - -/** - * Sentinel row for edge batches. Typed seed for every EDGE column so - * lbug's binder resolves struct fields even when the real batch has nulls. - */ -const EDGE_SENTINEL_ROW: Record = { - eid: EDGE_SENTINEL_ID, - src: "", - dst: "", - confidence: "0.0", - reason: null, - step: null, -}; - -async function bulkInsertNodes(pool: GraphDbPool, nodes: readonly GraphNode[]): Promise { - if (nodes.length === 0) return; - const rows: Record[] = [NODE_SENTINEL_ROW]; - for (const node of nodes) { - const cols = nodeToColumns(node); - const row: Record = {}; - for (let i = 0; i < NODE_COLUMNS.length; i++) { - const col = NODE_COLUMNS[i] as string; - const kind = NODE_COL_KINDS[i] as ColKind; - row[col] = encodeNodeCol(cols[col], kind); - } - rows.push(row); - } - await pool.execWrite(NODE_COPY_SUBQUERY, { rows }); -} - -async function mergeNodes(pool: GraphDbPool, nodes: readonly GraphNode[]): Promise { - if (nodes.length === 0) return; - for (const n of nodes) { - await pool.query(`MATCH (n:CodeNode {id: $p1}) DETACH DELETE n`, [n.id]); - } - await bulkInsertNodes(pool, nodes); -} - -async function bulkInsertEdges( - pool: GraphDbPool, - kind: string, - edges: readonly EdgeRow[], -): Promise { - if (edges.length === 0) return; - const rows: Record[] = [EDGE_SENTINEL_ROW]; - for (const e of edges) { - rows.push({ - eid: e.id, - src: e.from, - dst: e.to, - confidence: - typeof e.confidence === "number" && Number.isFinite(e.confidence) - ? String(e.confidence) - : null, - reason: e.reason ?? null, - step: - e.step !== undefined && typeof e.step === "number" && Number.isFinite(e.step) - ? String(e.step) - : null, - }); - } - await pool.execWrite(buildEdgeCopySubquery(kind), { rows }); -} - -async function mergeEdges( - pool: GraphDbPool, - kind: string, - edges: readonly EdgeRow[], -): Promise { - if (edges.length === 0) return; - for (const e of edges) { - await pool.query(`MATCH ()-[r:${kind} {id: $p1}]->() DELETE r`, [e.id]); - } - await bulkInsertEdges(pool, kind, edges); -} - -function buildEmbeddingCreateCypher(): string { - const propPairs = EMBEDDING_COLUMNS.map((col, i) => `${col}: $p${i + 1}`).join(", "); - return `CREATE (e:Embedding {${propPairs}})`; -} - -// --------------------------------------------------------------------------- -// Main class -// --------------------------------------------------------------------------- - -export class GraphDbStore implements IGraphStore { - /** - * Cypher dialect marker. The graph-db backend speaks Cypher natively; - * the optional {@link IGraphStore.execCypher} escape hatch is wired - * below so community tooling that needs raw Cypher (APOC analogues, - * etc.) can call through. - */ - readonly dialect: GraphDialect = "cypher"; - private readonly path: string; - private readonly readOnly: boolean; - private readonly embeddingDim: number; - private readonly defaultTimeoutMs: number; - private readonly poolConfig: GraphDbPoolConfig; - private pool: GraphDbPool | null = null; - private ftsExtensionLoaded = false; - private vectorExtensionLoaded = false; - private ftsIndexBuilt = false; - private vectorIndexBuilt = false; - - constructor(path: string, opts: GraphDbStoreOptions = {}) { - this.path = path; - this.readOnly = opts.readOnly === true; - this.embeddingDim = opts.embeddingDim ?? DEFAULT_EMBEDDING_DIM; - this.defaultTimeoutMs = opts.timeoutMs ?? DEFAULT_TIMEOUT_MS; - this.poolConfig = opts.poolConfig ?? {}; - } - - // -------------------------------------------------------------------------- - // Lifecycle - // -------------------------------------------------------------------------- - - async open(): Promise { - if (this.pool?.isOpen()) return; - // Surface missing-binding failures as a typed error so the pool's own - // lazy import doesn't produce a raw module-not-found error. When the - // caller injected a `binding` in `poolConfig` (tests) we skip the - // probe — the fake already provides the types. - if (!this.poolConfig.binding) { - try { - await import("@ladybugdb/core"); - } catch (err) { - throw new GraphDbBindingError(err); - } - } - // Guard: lbug v0.16.1 creates an empty database file even when opened - // with readOnly=true if the path doesn't exist yet. The empty DB then - // fails on any write (INSTALL FTS, INSTALL VECTOR, schema creation) with - // "Cannot create an empty database under READ ONLY mode". Fail-fast here - // so callers that catch `open()` errors (augment, countPriorCallable, - // openEmbeddingHashCacheAdapter) get the error they expect — and the - // lbug file is never created for a read-only probe on a missing DB. - if ((this.poolConfig.readOnly ?? this.readOnly) && this.path !== ":memory:") { - const { access } = await import("node:fs/promises"); - try { - await access(this.path); - } catch { - throw new Error( - `graph-db: database file does not exist at ${this.path} (read-only open refused)`, - ); - } - } - this.pool = new GraphDbPool(this.path, { - ...this.poolConfig, - readOnly: this.poolConfig.readOnly ?? this.readOnly, - }); - await this.pool.open(); - } - - async close(): Promise { - if (!this.pool) return; - const pool = this.pool; - this.pool = null; - // Clear lazy-init latches so a subsequent open() re-probes the - // extensions against the freshly opened database. - this.ftsExtensionLoaded = false; - this.vectorExtensionLoaded = false; - this.ftsIndexBuilt = false; - this.vectorIndexBuilt = false; - await pool.close(); - } - - async createSchema(): Promise { - const pool = this.requirePool(); - const ddl = generateSchemaDdl({ embeddingDim: this.embeddingDim }); - // Split on semicolons (each statement was emitted with a trailing `;\n`). - // Firing statements independently keeps error messages tied to the exact - // CREATE that failed rather than a concatenated batch. - const stmts = ddl - .split(/;\s*\n/) - .map((s) => s.trim()) - .filter((s) => s.length > 0); - for (const stmt of stmts) { - await pool.query(stmt); - } - } - - // -------------------------------------------------------------------------- - // Bulk load - // -------------------------------------------------------------------------- - - /** - * Bulk-load with a bounded retry for the transient lbug WAL→checkpoint - * IO race. Under CPU/IO pressure the native binding's auto-checkpoint can - * fail to rename `graph.lbug.wal` → `.wal.checkpoint` ("No such file or - * directory") even though the data is already durably in the WAL. The - * write otherwise succeeds (a reopen recovers the WAL), so the failure is - * a flaky teardown artifact, not data loss — but unretried it bubbles to - * the CLI's top-level catch and fails `analyze` with exit 1. Observed only - * on loaded CI runners (varies by leg), never on an idle box. - * - * replace-mode bulkLoad is idempotent (truncate-then-insert fully replaces - * prior contents), so a retry is safe. We retry only the transient - * checkpoint-rename class; every other error rethrows immediately. - */ - async bulkLoad(graph: KnowledgeGraph, opts: BulkLoadOptions = {}): Promise { - return retryTransientCheckpoint(() => this.#bulkLoadOnce(graph, opts)); - } - - async #bulkLoadOnce(graph: KnowledgeGraph, opts: BulkLoadOptions = {}): Promise { - const pool = this.requirePool(); - const started = performance.now(); - const mode = opts.mode ?? "replace"; - // The FTS extension must be loaded on the active connection before - // any DELETE / DETACH DELETE on CodeNode runs. lbug builds the FTS - // index against the table; without the extension loaded, deletes - // surface "Trying to delete from an index on table CodeNode but its - // extension is not loaded". Both replace-mode `truncateAll` and - // upsert-mode `mergeNodes` issue such deletes, so load it - // unconditionally up front. Failures are swallowed: the extension - // may not be available on the host platform, in which case the - // search-side codepath surfaces a clearer error from - // `ensureFtsExtension` later. - await this.ensureFtsExtension().catch(() => {}); - const reportProgress = ( - ev: Parameters>[0], - ): void => { - if (opts.onProgress === undefined) return; - try { - opts.onProgress(ev); - } catch { - // Progress-callback errors must never mask bulk-load failures. - } - }; - - if (mode === "replace") { - reportProgress({ kind: "truncate-start", elapsedMs: performance.now() - started }); - await this.truncateAll(); - reportProgress({ kind: "truncate-end", elapsedMs: performance.now() - started }); - } - - const nodes = dedupeLastById(graph.orderedNodes(), (n) => n.id); - const edges = dedupeLastById(graph.orderedEdges(), (e) => e.id); - // lbug's COPY enforces that every relation's from/to is a real - // CodeNode primary key. The pipeline emits synthetic edge targets - // (e.g. unresolved FETCHES placeholders carrying the URL template - // in `reason`) that never have a matching node. Synthesize one - // CodeNode per orphan id so the COPY succeeds; downstream tools - // recognise these by their well-known id prefix. - // - // Upsert mode caveat: a batch that carries ONLY new nodes (e.g. - // `ingest-sarif` upserts Finding nodes plus FOUND_IN edges into the - // already-persisted graph) references real, previously-loaded nodes - // that are absent from THIS batch. Synthesizing a placeholder for - // such an id and then `mergeNodes`-ing it (DETACH DELETE + re-insert) - // would DESTROY the real node — turning, e.g., the Function a finding - // was found in into a `` Route. So in upsert mode we must - // exclude ids that already exist in the store from synthesis; only - // genuinely-orphan ids (no node in the batch AND none in the store) - // get a placeholder. - const existingIds = - mode === "upsert" - ? await this.filterExistingNodeIds(pool, edgeEndpointIds(edges)) - : undefined; - const synthetic = synthesizePlaceholderNodes(nodes, edges, existingIds); - const allNodes = synthetic.length > 0 ? [...nodes, ...synthetic] : nodes; - await this.insertNodes(pool, allNodes, mode, reportProgress, started); - - const byKind = new Map(); - for (const e of edges) { - const bucket = byKind.get(e.type) ?? []; - const row: EdgeRow = { - id: e.id, - from: e.from, - to: e.to, - type: e.type, - confidence: e.confidence, - ...(e.reason !== undefined ? { reason: e.reason } : {}), - ...(e.step !== undefined ? { step: e.step } : {}), - }; - bucket.push(row); - byKind.set(e.type, bucket); - } - if (edges.length > 0) { - reportProgress({ - kind: "edges-start", - total: edges.length, - elapsedMs: performance.now() - started, - }); - } - let edgesDone = 0; - for (const [kind, bucket] of byKind) { - await this.insertEdgesForKind(pool, kind, bucket, mode, reportProgress, () => { - edgesDone += bucket.length; - return { done: edgesDone, total: edges.length, elapsedMs: performance.now() - started }; - }); - } - if (edges.length > 0) { - reportProgress({ - kind: "edges-end", - done: edges.length, - total: edges.length, - elapsedMs: performance.now() - started, - }); - } - - // Build the search-side indexes here so subsequent read-only opens - // can query without triggering writes. lbug rejects - // `CALL CREATE_FTS_INDEX` / `CALL CREATE_VECTOR_INDEX` on a readOnly - // Database — and `ensureFtsIndex` / `ensureVectorIndex` correctly - // no-op in that mode. Any failure at this stage is non-fatal: the - // FTS / VECTOR extension may not be available on the host platform, - // in which case search/vectorSearch will surface a clearer error - // from the extension load path the next time they're called. - await this.ensureFtsExtension().catch(() => {}); - await this.ensureFtsIndex().catch(() => {}); - - const durationMs = performance.now() - started; - return { - nodeCount: graph.nodeCount(), - edgeCount: graph.edgeCount(), - durationMs, - }; - } - - private async truncateAll(): Promise { - const pool = this.requirePool(); - // Drop the search-side indexes BEFORE any node delete. lbug 0.16.1 - // hard-crashes with SIGBUS (bus error — an un-catchable native signal, - // not a JS exception the retry wrapper could survive) when a - // `MATCH (n:CodeNode) DELETE n` runs while the `och_fts` full-text index - // is live on that table. The index is rebuilt by the trailing - // `ensureFtsIndex()` / `ensureVectorIndex()` in `#bulkLoadOnce` after the - // fresh rows are inserted, so dropping it here is lossless. `och_vec` on - // `Embedding` gets the same treatment for symmetry (the vector index is - // built on the write path too); the failure mode is structural — any live - // index over rows being deleted is unsafe — so we never delete indexed - // rows out from under an index again. - await this.dropSearchIndexes(); - // Delete edges first so node deletes stay side-effect free. The graph-db - // engine rejects deletes of a node that still has dangling rels. - for (const kind of getAllRelationTypes()) { - await pool.query(`MATCH ()-[r:${kind}]->() DELETE r`); - } - await pool.query("MATCH ()-[r:EMBEDS]->() DELETE r"); - await pool.query("MATCH (n:Embedding) DELETE n"); - await pool.query("MATCH (n:CodeNode) DELETE n"); - } - - /** - * Drop the FTS (`och_fts` on `CodeNode`) and VECTOR (`och_vec` on - * `Embedding`) indexes ahead of a truncate. A `DROP_*_INDEX` for an index - * that does not exist throws a catchable "doesn't have an index" Binder - * exception (NOT a SIGBUS) — we swallow exactly that so the drop is - * idempotent across fresh stores, embeddings-disabled runs, and repeated - * bulk-loads. Any other error (missing table, permission) surfaces. - * - * Resets `ftsIndexBuilt` / `vectorIndexBuilt` so the post-insert - * `ensureFtsIndex()` / `ensureVectorIndex()` calls actually rebuild the - * indexes against the freshly-loaded rows instead of short-circuiting on a - * now-stale "already built" flag. - */ - private async dropSearchIndexes(): Promise { - const pool = this.requirePool(); - // The FTS extension must be loaded before DROP_FTS_INDEX is bindable; - // `#bulkLoadOnce` already loads it up front, but load defensively here so - // `truncateAll` is correct if ever called on its own. A load failure - // (extension unavailable on this host) means there is no index to drop. - const dropIfPresent = async (stmt: string): Promise => { - try { - await pool.query(stmt); - } catch (err) { - const msg = (err as Error).message ?? ""; - // Both of these mean "there is no index to drop", which is the - // idempotent no-op case: - // - Binder: "Table X doesn't have an index with name Y" — the - // index was never created (fresh store, embeddings disabled, - // repeated truncate). - // - Catalog: "function DROP_VECTOR_INDEX is not defined ... VECTOR - // extension" — the extension isn't loaded on this connection, so - // no vector index can exist to drop. (`#bulkLoadOnce` loads FTS - // up front but not VECTOR; loading VECTOR solely to drop a - // usually-absent index isn't worth the cost.) - const noIndexToDrop = - /does(?:n't| not) have an index|no index|not exist/i.test(msg) || - (/is not defined/i.test(msg) && /extension/i.test(msg)); - if (!noIndexToDrop) throw err; - } - }; - await dropIfPresent("CALL DROP_FTS_INDEX('CodeNode', 'och_fts')"); - await dropIfPresent("CALL DROP_VECTOR_INDEX('Embedding', 'och_vec')"); - this.ftsIndexBuilt = false; - this.vectorIndexBuilt = false; - } - - /** - * Return the subset of `candidateIds` that already exist as CodeNodes in the - * store. Used by upsert-mode bulkLoad to avoid synthesizing a placeholder - * (and then `mergeNodes`-clobbering) for an edge endpoint that is a real, - * previously-persisted node not present in the current batch. Reuses the - * `listNodes({ ids })` finder so id decoding stays in one place; passes no - * `limit` so every match is returned. - */ - private async filterExistingNodeIds( - _pool: GraphDbPool, - candidateIds: readonly string[], - ): Promise> { - if (candidateIds.length === 0) return new Set(); - const found = await this.listNodes({ ids: candidateIds as readonly NodeId[] }); - return new Set(found.map((n) => n.id as string)); - } - - private async insertNodes( - pool: GraphDbPool, - nodes: readonly GraphNode[], - mode: "replace" | "upsert", - reportProgress: (ev: Parameters>[0]) => void, - bulkStartedAt: number, - ): Promise { - if (nodes.length === 0) return; - reportProgress({ - kind: "nodes-start", - total: nodes.length, - elapsedMs: performance.now() - bulkStartedAt, - }); - if (mode === "upsert") { - await mergeNodes(pool, nodes); - } else { - await bulkInsertNodes(pool, nodes); - } - reportProgress({ - kind: "nodes-end", - done: nodes.length, - total: nodes.length, - elapsedMs: performance.now() - bulkStartedAt, - }); - } - - private async insertEdgesForKind( - pool: GraphDbPool, - kind: string, - edges: readonly EdgeRow[], - mode: "replace" | "upsert", - reportProgress: (ev: Parameters>[0]) => void, - cumulative: () => { done: number; total: number; elapsedMs: number }, - ): Promise { - if (edges.length === 0) return; - if (mode === "upsert") { - await mergeEdges(pool, kind, edges); - } else { - await bulkInsertEdges(pool, kind, edges); - } - const c = cumulative(); - reportProgress({ kind: "edges-batch", relType: kind, ...c }); - } - - // -------------------------------------------------------------------------- - // Embeddings - // -------------------------------------------------------------------------- - - async upsertEmbeddings(rows: readonly EmbeddingRow[]): Promise { - if (rows.length === 0) return; - const pool = this.requirePool(); - const dim = this.embeddingDim; - - // Delete any existing rows that match (node_id, granularity, - // chunk_index). Mirrors duckdb-adapter.ts — MERGE on Embedding would - // work but the composite key is not the primary key, so the safest - // pattern is delete-then-create. DETACH DELETE because the prior row - // may have an EMBEDS rel attached, and the native engine refuses a - // bare DELETE on a node with dangling rels. - const delCypher = - `MATCH (e:Embedding) WHERE e.node_id = $p1 AND e.granularity = $p2 ` + - `AND e.chunk_index = $p3 DETACH DELETE e`; - for (const r of rows) { - const granularity = r.granularity ?? "symbol"; - await pool.query(delCypher, [r.nodeId, granularity, r.chunkIndex]); - } - - // Create one Embedding node per row + an EMBEDS rel linking it back - // to its source CodeNode (so the vectorSearch post-filter can join - // back through the graph without an extra property lookup). - const createCypher = buildEmbeddingCreateCypher(); - const embedsCypher = `MATCH (e:Embedding {id: $p1}), (n:CodeNode {id: $p2}) CREATE (e)-[:EMBEDS]->(n)`; - for (const r of rows) { - if (r.vector.length !== dim) { - throw new Error(`Embedding dimension mismatch: got ${r.vector.length}, expected ${dim}`); - } - const granularity = r.granularity ?? "symbol"; - const embeddingId = `Emb:${granularity}:${r.nodeId}:${r.chunkIndex}`; - // The native binding does not accept Float32Array directly for a - // FLOAT[dim] column; Array.from converts once per row and keeps the - // serialized shape a plain number[]. The cast to `SqlParam` is a structural - // narrowing — the pool forwards arbitrary JS values to the native - // binding, which accepts arrays for fixed-dim float columns. - const vector = Array.from(r.vector) as unknown as SqlParam; - const params: readonly SqlParam[] = [ - embeddingId, - r.nodeId, - granularity, - r.chunkIndex, - r.startLine ?? null, - r.endLine ?? null, - vector, - r.contentHash, - ]; - await pool.query(createCypher, params); - // Best-effort EMBEDS rel. Missing CodeNode is not a hard error — - // this mirrors the DuckDB embeddings table (which doesn't require a - // join target) but still gives the graph traversal tools a hook. - try { - await pool.query(embedsCypher, [embeddingId, r.nodeId]); - } catch { - // Node not yet loaded; the traversal side will treat the embedding - // as orphaned. Round-trip cases always bulkLoad before upserting, - // so this only fires when callers write embeddings for nodes that - // have been purged by a prior replace. - } - } - } - - async listEmbeddingHashes(): Promise> { - const pool = this.requirePool(); - const rows = await pool.query( - `MATCH (e:Embedding) RETURN e.node_id AS node_id, e.granularity AS granularity, ` + - `e.chunk_index AS chunk_index, e.content_hash AS content_hash`, - ); - const out = new Map(); - for (const row of rows) { - const rec = row as Record; - const nodeId = rec["node_id"]; - const granularity = rec["granularity"]; - const chunkIndex = rec["chunk_index"]; - const contentHash = rec["content_hash"]; - if ( - typeof nodeId !== "string" || - typeof granularity !== "string" || - typeof contentHash !== "string" || - (typeof chunkIndex !== "number" && typeof chunkIndex !== "bigint") - ) { - continue; - } - const ci = typeof chunkIndex === "bigint" ? Number(chunkIndex) : chunkIndex; - out.set(`${granularity}\0${nodeId}\0${ci}`, contentHash); - } - return out; - } - - // -------------------------------------------------------------------------- - // Query surfaces - // -------------------------------------------------------------------------- - - async query( - sql: string, - params?: readonly SqlParam[], - opts?: { readonly timeoutMs?: number }, - ): Promise[]> { - if (!this.pool) { - throw new Error("graph-db: query called before open()"); - } - // Refuse write keywords so the user surface stays read-only. The - // full Cypher-guard lives in `cypher-guard.ts`; this call mirrors - // the DuckDB backend's `assertReadOnlySql` approach and trips every - // write verb the native binding accepts. - assertReadOnlyCypher(sql); - const timeoutMs = opts?.timeoutMs ?? this.defaultTimeoutMs; - return this.pool.query(sql, params, { timeoutMs }); - } - - /** - * Enumerate fully-rehydrated GraphNodes by kind. Mirror of the - * DuckStore implementation — same input/output contract so the M5 BOM - * bodies render identical results regardless of which backend the user - * picked. - * - * The graph-db schema stores every kind under the single label - * `:CodeNode` with `kind` as a discriminator property (see - * graphdb-schema.ts). One MATCH plus an optional `WHERE n.kind IN [...]` - * predicate is therefore sufficient — no per-kind table fan-out. - * - * Determinism: ORDER BY n.id ASC at the Cypher layer, plus a JS-side - * lex-stable tiebreak on the rehydrated nodes so the output matches - * DuckStore byte-for-byte. - */ - async listNodes(opts: ListNodesOptions = {}): Promise { - const kinds = opts.kinds; - // Empty-kinds short-circuit BEFORE the pool guard — the contract is - // pure-JS ("kinds: [] returns []") and must hold even when the store - // has not been opened yet. Saves callers a defensive .open() when - // they know the kinds list is empty. - if (kinds !== undefined && kinds.length === 0) return []; - const idsRaw = opts.ids; - if (idsRaw !== undefined && idsRaw.length === 0) return []; - const ids = idsRaw !== undefined ? Array.from(new Set(idsRaw)) : undefined; - const pool = this.requirePool(); - const limit = clampNonNegativeIntGd(opts.limit); - const offset = clampNonNegativeIntGd(opts.offset); - - // RETURN every column the writer emits. Each column → field mapping - // mirrors `nodeToParams` exactly so the round-trip is symmetric. - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - - const params: SqlParam[] = []; - const wheres: string[] = []; - let next = 1; - if (kinds && kinds.length > 0) { - const phs: string[] = []; - for (const k of kinds) { - phs.push(`$p${next}`); - params.push(k); - next += 1; - } - wheres.push(`n.kind IN [${phs.join(", ")}]`); - } - if (ids !== undefined && ids.length > 0) { - const phs: string[] = []; - for (const id of ids) { - phs.push(`$p${next}`); - params.push(id); - next += 1; - } - wheres.push(`n.id IN [${phs.join(", ")}]`); - } - if (opts.filePath !== undefined) { - wheres.push(`n.file_path = $p${next}`); - params.push(opts.filePath); - next += 1; - } - const wherePredicate = wheres.length > 0 ? `WHERE ${wheres.join(" AND ")} ` : ""; - // SKIP / LIMIT bound via inline literals after the clampNonNegativeInt - // guard has confirmed they are finite non-negative integers — no - // injection risk because `Number.isFinite` + `Math.floor` enforce a - // strict integer encoding before we interpolate. - let pagination = ""; - if (offset !== undefined) pagination += `SKIP ${offset} `; - if (limit !== undefined) pagination += `LIMIT ${limit} `; - - const cypher = ( - `MATCH (n:CodeNode) ${wherePredicate}` + - `RETURN ${returnList} ` + - `ORDER BY n.id ASC ${pagination}` - ).trim(); - - const rows = await pool.query(cypher, params); - const out: GraphNode[] = []; - for (const row of rows) { - const node = recordToGraphNode(row as Record); - if (node) out.push(node); - } - // Lex-stable tiebreak on id so DuckStore + GraphDbStore agree - // byte-for-byte when graphHash is computed over the result. - return [...out].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); - } - - // -------------------------------------------------------------------------- - // Typed finders — service-layer foundation - // -------------------------------------------------------------------------- - // - // Cypher stays LOCAL to this file — never exported. Determinism: node - // finders ORDER BY n.id ASC + JS-side lex tiebreak; edge finders ORDER BY - // (from, to, type); the consumer-producer finder orders by (consumer - // repo, producer repo, method, path). - - /** Single-kind shorthand. Mirror of {@link DuckDbStore.listNodesByKind}. */ - async listNodesByKind( - kind: K, - opts: ListNodesByKindOptions = {}, - ): Promise[]> { - const pool = this.requirePool(); - const limit = clampNonNegativeIntGd(opts.limit); - const offset = clampNonNegativeIntGd(opts.offset); - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - - const wheres: string[] = ["n.kind = $p1"]; - const params: SqlParam[] = [kind]; - let next = 2; - if (opts.filePath !== undefined) { - wheres.push(`n.file_path = $p${next}`); - params.push(opts.filePath); - next += 1; - } - if (opts.filePathLike !== undefined) { - wheres.push(`n.file_path CONTAINS $p${next}`); - params.push(opts.filePathLike); - next += 1; - } - let pagination = ""; - if (offset !== undefined) pagination += `SKIP ${offset} `; - if (limit !== undefined) pagination += `LIMIT ${limit} `; - const cypher = ( - `MATCH (n:CodeNode) WHERE ${wheres.join(" AND ")} ` + - `RETURN ${returnList} ORDER BY n.id ASC ${pagination}` - ).trim(); - - const rows = await pool.query(cypher, params); - const out: GraphNode[] = []; - for (const row of rows) { - const node = recordToGraphNode(row as Record); - if (node) out.push(node); - } - const sorted = [...out].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); - return sorted as unknown as readonly NodeOfKind[]; - } - - /** All edges, optionally filtered + paged. Mirrors DuckDb ordering. */ - async listEdges(opts: ListEdgesOptions = {}): Promise { - const pool = this.requirePool(); - return this.listEdgesInternalGd(pool, opts); - } - - /** Single-type shorthand. Pins the type and forwards to {@link listEdges}. */ - async listEdgesByType( - type: RelationType, - opts: ListEdgesByTypeOptions = {}, - ): Promise { - const merged: ListEdgesOptions = { - types: [type], - ...(opts.fromIds !== undefined ? { fromIds: opts.fromIds } : {}), - ...(opts.toIds !== undefined ? { toIds: opts.toIds } : {}), - ...(opts.minConfidence !== undefined ? { minConfidence: opts.minConfidence } : {}), - ...(opts.limit !== undefined ? { limit: opts.limit } : {}), - }; - return this.listEdges(merged); - } - - /** Findings filter. Mirrors {@link DuckDbStore.listFindings} on Cypher. */ - async listFindings(opts: ListFindingsOptions = {}): Promise { - const pool = this.requirePool(); - const wheres: string[] = ["n.kind = 'Finding'"]; - const params: SqlParam[] = []; - let next = 1; - if (opts.severity && opts.severity.length > 0) { - const phs: string[] = []; - for (const s of opts.severity) { - phs.push(`$p${next}`); - params.push(s); - next += 1; - } - wheres.push(`n.severity IN [${phs.join(", ")}]`); - } - if (opts.ruleId !== undefined) { - wheres.push(`n.rule_id = $p${next}`); - params.push(opts.ruleId); - next += 1; - } - if (opts.baselineState && opts.baselineState.length > 0) { - const phs: string[] = []; - for (const s of opts.baselineState) { - phs.push(`$p${next}`); - params.push(s); - next += 1; - } - wheres.push(`n.baseline_state IN [${phs.join(", ")}]`); - } - if (opts.suppressed === true) { - wheres.push("n.suppressed_json IS NOT NULL"); - } else if (opts.suppressed === false) { - wheres.push("n.suppressed_json IS NULL"); - } - const limit = clampNonNegativeIntGd(opts.limit); - const limitClause = limit !== undefined ? `LIMIT ${limit} ` : ""; - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - const cypher = ( - `MATCH (n:CodeNode) WHERE ${wheres.join(" AND ")} ` + - `RETURN ${returnList} ORDER BY n.id ASC ${limitClause}` - ).trim(); - const rows = await pool.query(cypher, params); - const out: FindingNode[] = []; - for (const row of rows) { - const node = recordToGraphNode(row as Record); - if (node && node.kind === "Finding") out.push(node as FindingNode); - } - return [...out].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); - } - - /** Dependencies filter. License classification matches DuckDb. */ - async listDependencies(opts: ListDependenciesOptions = {}): Promise { - const pool = this.requirePool(); - const wheres: string[] = ["n.kind = 'Dependency'"]; - const params: SqlParam[] = []; - let next = 1; - if (opts.ecosystem !== undefined) { - wheres.push(`n.ecosystem = $p${next}`); - params.push(opts.ecosystem); - next += 1; - } - const limit = clampNonNegativeIntGd(opts.limit); - const limitClause = limit !== undefined ? `LIMIT ${limit} ` : ""; - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - const cypher = ( - `MATCH (n:CodeNode) WHERE ${wheres.join(" AND ")} ` + - `RETURN ${returnList} ORDER BY n.id ASC ${limitClause}` - ).trim(); - const rows = await pool.query(cypher, params); - const tierSet = - opts.licenseTier && opts.licenseTier.length > 0 ? new Set(opts.licenseTier) : undefined; - const out: DependencyNode[] = []; - for (const row of rows) { - const node = recordToGraphNode(row as Record); - if (node?.kind !== "Dependency") continue; - if (tierSet) { - const tier = classifyLicenseTier((node as DependencyNode).license); - if (!tierSet.has(tier)) continue; - } - out.push(node as DependencyNode); - } - return [...out].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); - } - - /** Routes filter. Mirrors {@link DuckDbStore.listRoutes} on Cypher. */ - async listRoutes(opts: ListRoutesOptions = {}): Promise { - const pool = this.requirePool(); - const wheres: string[] = ["n.kind = 'Route'"]; - const params: SqlParam[] = []; - let next = 1; - if (opts.methods && opts.methods.length > 0) { - const phs: string[] = []; - for (const m of opts.methods) { - phs.push(`$p${next}`); - params.push(m); - next += 1; - } - wheres.push(`n.method IN [${phs.join(", ")}]`); - } - if (opts.pathLike !== undefined) { - wheres.push(`n.url CONTAINS $p${next}`); - params.push(opts.pathLike); - next += 1; - } - const limit = clampNonNegativeIntGd(opts.limit); - const limitClause = limit !== undefined ? `LIMIT ${limit} ` : ""; - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - const cypher = ( - `MATCH (n:CodeNode) WHERE ${wheres.join(" AND ")} ` + - `RETURN ${returnList} ORDER BY n.id ASC ${limitClause}` - ).trim(); - const rows = await pool.query(cypher, params); - const out: RouteNode[] = []; - for (const row of rows) { - const node = recordToGraphNode(row as Record); - if (node && node.kind === "Route") out.push(node as RouteNode); - } - return [...out].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); - } - - /** Repo-node by id. Returns `undefined` when row is missing or non-Repo. */ - async getRepoNode(id: string): Promise { - const pool = this.requirePool(); - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - const rows = await pool.query( - `MATCH (n:CodeNode {id: $p1, kind: 'Repo'}) RETURN ${returnList} LIMIT 1`, - [id], - ); - const first = rows[0]; - if (!first) return undefined; - const node = recordToGraphNode(first as Record); - if (node?.kind !== "Repo") return undefined; - return node as RepoNode; - } - - /** - * Specialized finder for `analysis/impact.ts:131-135`. Cypher mirror of - * the DuckDB `WHERE entry_point_id = ?` predicate; the property name is - * the snake-cased column the writer emits via `nodeToParams`. - */ - async listNodesByEntryPoint(entryPointId: string): Promise { - const pool = this.requirePool(); - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - const cypher = `MATCH (n:CodeNode) WHERE n.entry_point_id = $p1 RETURN ${returnList} ORDER BY n.id ASC`; - const rows = await pool.query(cypher, [entryPointId]); - const out: GraphNode[] = []; - for (const row of rows) { - const node = recordToGraphNode(row as Record); - if (node) out.push(node); - } - return [...out].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); - } - - /** - * Specialized finder for `analysis/rename.ts:51,59` — `WHERE name = ?` - * with optional `kinds` / `filePath` narrowing. Mirrors - * {@link DuckDbStore.listNodesByName} exactly. - */ - async listNodesByName( - name: string, - opts: ListNodesByNameOptions = {}, - ): Promise { - const kinds = opts.kinds; - if (kinds !== undefined && kinds.length === 0) return []; - const pool = this.requirePool(); - const limit = clampNonNegativeIntGd(opts.limit); - const returnList = NODE_COLUMNS.map((c) => `n.${c} AS ${c}`).join(", "); - const wheres: string[] = ["n.name = $p1"]; - const params: SqlParam[] = [name]; - let next = 2; - if (kinds && kinds.length > 0) { - const phs: string[] = []; - for (const k of kinds) { - phs.push(`$p${next}`); - params.push(k); - next += 1; - } - wheres.push(`n.kind IN [${phs.join(", ")}]`); - } - if (opts.filePath !== undefined) { - wheres.push(`n.file_path = $p${next}`); - params.push(opts.filePath); - next += 1; - } - const limitClause = limit !== undefined ? `LIMIT ${limit} ` : ""; - const cypher = ( - `MATCH (n:CodeNode) WHERE ${wheres.join(" AND ")} ` + - `RETURN ${returnList} ORDER BY n.id ASC ${limitClause}` - ).trim(); - const rows = await pool.query(cypher, params); - const out: GraphNode[] = []; - for (const row of rows) { - const node = recordToGraphNode(row as Record); - if (node) out.push(node); - } - return [...out].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); - } - - /** Counts grouped by kind. Same backfill semantics as DuckDb. */ - async countNodesByKind(kinds?: readonly NodeKind[]): Promise> { - const pool = this.requirePool(); - const out = new Map(); - if (kinds !== undefined && kinds.length === 0) return out; - const params: SqlParam[] = []; - let predicate = ""; - if (kinds && kinds.length > 0) { - const phs: string[] = []; - for (let i = 0; i < kinds.length; i += 1) { - phs.push(`$p${i + 1}`); - params.push(kinds[i] ?? ""); - } - predicate = `WHERE n.kind IN [${phs.join(", ")}] `; - } - const cypher = `MATCH (n:CodeNode) ${predicate}RETURN n.kind AS kind, count(n) AS n ORDER BY kind ASC`; - const rows = await pool.query(cypher, params); - for (const r of rows) { - const row = r as Record; - const kindVal = row["kind"]; - const n = row["n"]; - if (typeof kindVal === "string") { - const num = typeof n === "bigint" ? Number(n) : Number(n ?? 0); - out.set(kindVal as NodeKind, num); - } - } - if (kinds) { - for (const k of kinds) { - if (!out.has(k)) out.set(k, 0); - } - } - return out; - } - - /** Counts grouped by edge type. Walks every relation kind (no per-type rel-table fan-out). */ - async countEdgesByType(types?: readonly RelationType[]): Promise> { - const pool = this.requirePool(); - const out = new Map(); - if (types !== undefined && types.length === 0) return out; - const allTypes: readonly RelationType[] = - types && types.length > 0 ? types : (getAllRelationTypes() as readonly RelationType[]); - // The graph-db schema partitions edges into per-type rel tables, so a - // single MATCH across every label is the cheapest count path. We loop - // per type and aggregate — N is bounded (~24) and one round-trip per - // label is amortized against the rest of the query workload. - for (const t of allTypes) { - const rows = await pool.query(`MATCH ()-[r:${t}]->() RETURN count(r) AS n`); - const first = rows[0] as Record | undefined; - const n = first?.["n"]; - const num = typeof n === "bigint" ? Number(n) : Number(n ?? 0); - out.set(t, num); - } - return out; - } - - /** - * Stream embeddings via Cypher MATCH against the `Embedding` nodes. - * `async function*` so the caller can `for await` without - * materializing the full row set. - */ - async *listEmbeddings(opts: ListEmbeddingsOptions = {}): AsyncIterable { - const kinds = opts.kindFilter; - if (kinds !== undefined && kinds.length === 0) return; - const pool = this.requirePool(); - const limit = clampNonNegativeIntGd(opts.limit); - - const params: SqlParam[] = []; - let next = 1; - let matchAndPredicate = "MATCH (e:Embedding)"; - if (kinds && kinds.length > 0) { - const phs: string[] = []; - for (const k of kinds) { - phs.push(`$p${next}`); - params.push(k); - next += 1; - } - matchAndPredicate = `MATCH (e:Embedding)-[:EMBEDS]->(n:CodeNode) WHERE n.kind IN [${phs.join(", ")}]`; - } - const limitClause = limit !== undefined ? `LIMIT ${limit}` : ""; - const cypher = - `${matchAndPredicate} ` + - `RETURN e.node_id AS node_id, e.granularity AS granularity, ` + - `e.chunk_index AS chunk_index, e.start_line AS start_line, ` + - `e.end_line AS end_line, e.vector AS vector, ` + - `e.content_hash AS content_hash ` + - `ORDER BY e.node_id ASC, e.granularity ASC, e.chunk_index ASC ${limitClause}`; - const rows = await pool.query(cypher, params); - for (const r of rows) { - const row = r as Record; - const vec = row["vector"]; - let vector: Float32Array; - if (vec instanceof Float32Array) vector = vec; - else if (Array.isArray(vec)) vector = Float32Array.from(vec.map((v) => Number(v))); - else continue; - const granularityRaw = String(row["granularity"]); - const granularity = - granularityRaw === "file" || granularityRaw === "community" ? granularityRaw : "symbol"; - const chunkVal = row["chunk_index"]; - const chunkIndex = typeof chunkVal === "bigint" ? Number(chunkVal) : Number(chunkVal ?? 0); - const startVal = row["start_line"]; - const endVal = row["end_line"]; - const baseRow: EmbeddingRow = { - nodeId: String(row["node_id"]), - granularity, - chunkIndex, - ...(startVal !== null && startVal !== undefined - ? { startLine: typeof startVal === "bigint" ? Number(startVal) : Number(startVal) } - : {}), - ...(endVal !== null && endVal !== undefined - ? { endLine: typeof endVal === "bigint" ? Number(endVal) : Number(endVal) } - : {}), - vector, - contentHash: String(row["content_hash"] ?? ""), - }; - yield baseRow; - } - } - - /** Ancestor traversal via a Cypher variable-length upward walk (the lbug analogue of a recursive ancestor query). */ - async traverseAncestors(opts: AncestorTraversalOptions): Promise { - return this.traverseDirectionalGd(opts, "up"); - } - - /** Symmetric of {@link traverseAncestors}. */ - async traverseDescendants(opts: DescendantTraversalOptions): Promise { - return this.traverseDirectionalGd(opts, "down"); - } - - /** - * Producer-consumer edges across repos. Cypher mirror of the DuckDB - * FETCHES + Operation join. The graph-db schema collapses every node - * kind into a single `:CodeNode` label, so this is a simple two-hop - * pattern with property predicates rather than a true table join. - */ - async listConsumerProducerEdges( - opts: { readonly repoUris?: readonly string[] } = {}, - ): Promise { - const pool = this.requirePool(); - const params: SqlParam[] = []; - let next = 1; - let repoPredicate = ""; - if (opts.repoUris && opts.repoUris.length > 0) { - const phs: string[] = []; - for (const u of opts.repoUris) { - phs.push(`$p${next}`); - params.push(u); - next += 1; - } - repoPredicate = ` AND (consumer.repo_uri IN [${phs.join(", ")}] OR producer.repo_uri IN [${phs.join(", ")}])`; - } - const cypher = - `MATCH (consumer:CodeNode)-[r:FETCHES]->(producer:CodeNode) ` + - `WHERE producer.kind = 'Operation'${repoPredicate} ` + - `RETURN consumer.id AS consumer_node_id, ` + - `consumer.repo_uri AS consumer_repo_uri, ` + - `producer.id AS producer_node_id, ` + - `producer.repo_uri AS producer_repo_uri, ` + - `producer.http_method AS http_method, ` + - `producer.http_path AS http_path, ` + - `r.id AS r_id ` + - `ORDER BY consumer_repo_uri ASC, producer_repo_uri ASC, ` + - `http_method ASC, http_path ASC, r_id ASC`; - const rows = await pool.query(cypher, params); - const out: ConsumerProducerEdge[] = []; - for (const r of rows) { - const row = r as Record; - out.push({ - consumerNodeId: String(row["consumer_node_id"] ?? ""), - consumerRepoUri: String(row["consumer_repo_uri"] ?? ""), - producerNodeId: String(row["producer_node_id"] ?? ""), - producerRepoUri: String(row["producer_repo_uri"] ?? ""), - httpMethod: String(row["http_method"] ?? ""), - httpPath: String(row["http_path"] ?? ""), - }); - } - return out; - } - - /** - * Shared `listEdges` body. The graph-db schema partitions edges into - * per-type rel tables, so a no-types query needs to walk every label — - * we fall back to the canonical relation list and emit one MATCH per - * type, then merge + sort. With a `types` filter the pattern is one - * MATCH per requested type, which keeps the round-trip cost - * proportional to the filter set. - */ - private async listEdgesInternalGd( - pool: GraphDbPool, - opts: ListEdgesOptions, - ): Promise { - const allTypes: readonly RelationType[] = - opts.types && opts.types.length > 0 - ? opts.types - : (getAllRelationTypes() as readonly RelationType[]); - const minConfidence = opts.minConfidence; - const limit = clampNonNegativeIntGd(opts.limit); - const offset = clampNonNegativeIntGd(opts.offset); - - const collected: CodeRelation[] = []; - for (const t of allTypes) { - const params: SqlParam[] = []; - let next = 1; - const wheres: string[] = []; - if (opts.fromIds && opts.fromIds.length > 0) { - const phs: string[] = []; - for (const f of opts.fromIds) { - phs.push(`$p${next}`); - params.push(f); - next += 1; - } - wheres.push(`a.id IN [${phs.join(", ")}]`); - } - if (opts.toIds && opts.toIds.length > 0) { - const phs: string[] = []; - for (const id of opts.toIds) { - phs.push(`$p${next}`); - params.push(id); - next += 1; - } - wheres.push(`b.id IN [${phs.join(", ")}]`); - } - if (minConfidence !== undefined) { - wheres.push(`r.confidence >= $p${next}`); - params.push(minConfidence); - next += 1; - } - const wherePart = wheres.length > 0 ? ` WHERE ${wheres.join(" AND ")}` : ""; - const cypher = - `MATCH (a:CodeNode)-[r:${t}]->(b:CodeNode)${wherePart} ` + - `RETURN a.id AS from_id, b.id AS to_id, r.id AS r_id, ` + - `r.confidence AS confidence, r.reason AS reason, r.step AS step`; - const rows = await pool.query(cypher, params); - for (const row of rows) { - const rec = row as Record; - const stepVal = rec["step"]; - const step = stepVal === null || stepVal === undefined ? undefined : Number(stepVal); - const reasonVal = rec["reason"]; - const reason = - typeof reasonVal === "string" && reasonVal.length > 0 ? reasonVal : undefined; - collected.push({ - id: String(rec["r_id"] ?? "") as CodeRelation["id"], - from: String(rec["from_id"] ?? "") as CodeRelation["from"], - to: String(rec["to_id"] ?? "") as CodeRelation["to"], - type: t, - confidence: Number(rec["confidence"] ?? 0), - ...(reason !== undefined ? { reason } : {}), - ...(step !== undefined && step !== 0 ? { step } : {}), - }); - } - } - // Final ordering: (from, to, type, id) — same key order DuckDb uses. - collected.sort((x, y) => { - if (x.from !== y.from) return x.from < y.from ? -1 : 1; - if (x.to !== y.to) return x.to < y.to ? -1 : 1; - if (x.type !== y.type) return x.type < y.type ? -1 : 1; - if (x.id !== y.id) return x.id < y.id ? -1 : 1; - return 0; - }); - const start = offset ?? 0; - const end = limit !== undefined ? start + limit : collected.length; - return collected.slice(start, end); - } - - /** - * Shared body for ancestor/descendant traversal. Defers to the existing - * {@link traverse} method which handles the variable-length pattern - * inlining for the native graph-db engine. - */ - private async traverseDirectionalGd( - opts: AncestorTraversalOptions | DescendantTraversalOptions, - direction: "up" | "down", - ): Promise { - if (opts.edgeTypes.length === 0) return []; - const traverseQuery: TraverseQuery = { - startId: opts.fromId, - relationTypes: opts.edgeTypes, - direction, - maxDepth: opts.maxDepth, - ...(opts.minConfidence !== undefined ? { minConfidence: opts.minConfidence } : {}), - }; - return this.traverse(traverseQuery); - } - - async search(q: SearchQuery): Promise { - const pool = this.requirePool(); - await this.ensureFtsExtension(); - await this.ensureFtsIndex(); - const limit = q.limit ?? 50; - const kindFilter = q.kinds && q.kinds.length > 0 ? q.kinds : undefined; - - // $p1 = FTS query text, $p2..$pN+1 = optional kind filter values, - // $p(limit) = LIMIT. The index maps back to the kind-filter array when - // present. - const params: SqlParam[] = [q.text]; - let kindPredicate = ""; - if (kindFilter) { - const phs = kindFilter.map((_, i) => `$p${i + 2}`).join(", "); - kindPredicate = ` WHERE node.kind IN [${phs}]`; - for (const k of kindFilter) params.push(k); - } - // Tiebreaker columns mirror DuckDbStore.search — (id, file_path, name) - // ascending so identical scores yield a stable order across runs. - const cypher = - `CALL QUERY_FTS_INDEX('CodeNode', 'och_fts', $p1) ` + - `WITH node, score${kindPredicate} ` + - `RETURN node.id AS id, node.name AS name, node.kind AS kind, ` + - `node.file_path AS file_path, score ` + - `ORDER BY score DESC, id ASC, file_path ASC, name ASC LIMIT ${Number(limit)}`; - const rows = await pool.query(cypher, params); - const out: SearchResult[] = []; - for (const row of rows) { - out.push({ - nodeId: String((row as Record)["id"]), - name: String((row as Record)["name"] ?? ""), - kind: String((row as Record)["kind"] ?? ""), - filePath: String((row as Record)["file_path"] ?? ""), - score: Number((row as Record)["score"] ?? 0), - }); - } - return out; - } - - async vectorSearch(q: VectorQuery): Promise { - // Dimension guard runs before any pool access so it fails fast on - // misconfigured callers — an 'not open' message would hide the real - // problem. - if (q.vector.length !== this.embeddingDim) { - throw new Error( - `Vector dimension mismatch: got ${q.vector.length}, expected ${this.embeddingDim}`, - ); - } - const pool = this.requirePool(); - await this.ensureVectorExtension(); - await this.ensureVectorIndex(); - const limit = q.limit ?? 10; - const granularities: readonly string[] | undefined = - q.granularity === undefined - ? undefined - : Array.isArray(q.granularity) - ? (q.granularity as readonly string[]) - : [q.granularity as string]; - - // Over-fetch k so the post-filter WHERE still leaves `limit` rows when - // some of the top-k are dropped by the predicate. 4x limit (min 32) - // is the same headroom DuckDbStore uses for its granularity filter. - const k = Math.max(limit * 4, 32); - - // $p1 = query vector, $p2 = k. Subsequent params are the WHERE clause - // values (callers pass `?` placeholders, we rewrite to $pN). - const params: SqlParam[] = [Array.from(q.vector) as unknown as SqlParam, k]; - let nextPh = 3; - const whereParts: string[] = []; - - if (q.whereClause && q.whereClause.length > 0) { - const localParams = q.params ?? []; - const rewritten = rewriteWhereClause(q.whereClause, () => { - const name = `$p${nextPh}`; - nextPh += 1; - return name; - }); - whereParts.push(`(${rewritten})`); - for (const p of localParams) params.push(p); - } - if (granularities !== undefined && granularities.length > 0) { - const phs: string[] = []; - for (const g of granularities) { - phs.push(`$p${nextPh}`); - nextPh += 1; - params.push(g); - } - whereParts.push(`e.granularity IN [${phs.join(", ")}]`); - } - - const wherePredicate = whereParts.length > 0 ? ` WHERE ${whereParts.join(" AND ")}` : ""; - - // CALL QUERY_VECTOR_INDEX returns rows with `node` (the Embedding - // record) and `distance`. We pull the `e.node_id` column through so - // callers get the CodeNode id — the join to CodeNode via EMBEDS is - // only needed when the caller-supplied whereClause references `n.*`. - const needsJoin = (q.whereClause ?? "").trim().length > 0; - const joinClause = needsJoin ? `MATCH (e)-[:EMBEDS]->(node:CodeNode) ` : ""; - const cypher = - `CALL QUERY_VECTOR_INDEX('Embedding', 'och_vec', $p1, $p2) ` + - `WITH node AS e, distance ` + - `${joinClause}` + - `${wherePredicate} ` + - `RETURN e.node_id AS node_id, distance ORDER BY distance LIMIT ${Number(limit)}`; - - const rows = await pool.query(cypher, params); - const out: VectorResult[] = []; - for (const row of rows) { - const rec = row as Record; - out.push({ - nodeId: String(rec["node_id"]), - distance: Number(rec["distance"] ?? 0), - }); - } - return out; - } - - async traverse(q: TraverseQuery): Promise { - const pool = this.requirePool(); - const maxDepth = Math.max(0, Math.floor(q.maxDepth)); - if (maxDepth === 0) return []; - const minConfidence = q.minConfidence ?? 0; - const relTypes: readonly string[] = - q.relationTypes && q.relationTypes.length > 0 ? q.relationTypes : getAllRelationTypes(); - // Variable-length MATCH: `[r:T1|T2*1..N]`. The native engine accepts - // the pipe-separated label union and the lower..upper bound syntax. - // Depth is inlined because the native binding rejects a prepared - // statement whose variable-length bounds are bound via parameters. - const typeLabels = relTypes.join("|"); - const { head, tail } = - q.direction === "up" - ? { head: "<-", tail: "-" } - : q.direction === "down" - ? { head: "-", tail: "->" } - : { head: "-", tail: "-" }; - - // NOTE: `[n IN nodes(p) | n.id]` is rejected by the native engine - // (v0.16.1 `Binder exception: Variable n is not in scope`). Use - // `list_transform` instead. - // - // The native prepared-statement planner asserts `UNREACHABLE_CODE` when - // a variable-length pattern (`*1..N`) co-exists with ANY bound - // parameter. Work-around: inline the two inputs this traversal needs - // (startId and minConfidence), then route through `pool.query()` - // without a param list so the pool picks the direct-query path. Both - // values are validated before interpolation — startId is either a - // UUID-shaped NodeId or a composite identifier from `makeNodeId`, and - // minConfidence is a finite number — so the inlining cannot smuggle a - // Cypher fragment. - const startIdLiteral = cypherStringLiteral(q.startId); - const confLiteral = cypherNumberLiteral(minConfidence); - const cypher = - `MATCH p = (start:CodeNode {id: ${startIdLiteral}})${head}` + - `[r:${typeLabels}*1..${maxDepth}]${tail}(other:CodeNode) ` + - `WHERE ALL(x IN rels(p) WHERE x.confidence >= ${confLiteral}) ` + - `AND other.id <> ${startIdLiteral} ` + - `RETURN other.id AS node_id, length(p) AS depth, ` + - `list_transform(nodes(p), x -> x.id) AS path ` + - `ORDER BY depth, node_id`; - - const rows = await pool.query(cypher); - const out: TraverseResult[] = []; - for (const row of rows) { - const rec = row as Record; - const pathVal = rec["path"]; - const path = Array.isArray(pathVal) ? pathVal.map((v) => String(v)) : []; - out.push({ - nodeId: String(rec["node_id"]), - depth: Number(rec["depth"] ?? 0), - path, - }); - } - return out; - } - - // -------------------------------------------------------------------------- - // Meta + health - // -------------------------------------------------------------------------- - - async getMeta(): Promise { - const pool = this.requirePool(); - const rows = await pool.query( - `MATCH (m:StoreMeta {id: 1}) RETURN m.schema_version AS schema_version, ` + - `m.last_commit AS last_commit, m.indexed_at AS indexed_at, ` + - `m.node_count AS node_count, m.edge_count AS edge_count, ` + - `m.stats_json AS stats_json, m.cache_hit_ratio AS cache_hit_ratio, ` + - `m.cache_size_bytes AS cache_size_bytes, m.last_compaction AS last_compaction, ` + - `m.embedder_model_id AS embedder_model_id ` + - `LIMIT 1`, - ); - const first = rows[0]; - if (!first) return undefined; - const row = first as Record; - const statsStr = row["stats_json"]; - const stats = - typeof statsStr === "string" && statsStr.length > 0 - ? (JSON.parse(statsStr) as Record) - : undefined; - const lastCommit = row["last_commit"]; - const cacheHitRatio = row["cache_hit_ratio"]; - const cacheSizeBytes = row["cache_size_bytes"]; - const lastCompaction = row["last_compaction"]; - const embedderModelId = row["embedder_model_id"]; - return { - schemaVersion: String(row["schema_version"]), - ...(lastCommit !== null && lastCommit !== undefined - ? { lastCommit: String(lastCommit) } - : {}), - indexedAt: String(row["indexed_at"]), - nodeCount: Number(row["node_count"] ?? 0), - edgeCount: Number(row["edge_count"] ?? 0), - ...(stats ? { stats } : {}), - ...(cacheHitRatio !== null && cacheHitRatio !== undefined - ? { cacheHitRatio: Number(cacheHitRatio) } - : {}), - ...(cacheSizeBytes !== null && cacheSizeBytes !== undefined - ? { cacheSizeBytes: Number(cacheSizeBytes) } - : {}), - ...(lastCompaction !== null && lastCompaction !== undefined - ? { lastCompaction: String(lastCompaction) } - : {}), - ...(embedderModelId !== null && embedderModelId !== undefined - ? { embedderModelId: String(embedderModelId) } - : {}), - }; - } - - async setMeta(meta: StoreMeta): Promise { - const pool = this.requirePool(); - const statsJson = meta.stats ? JSON.stringify(meta.stats) : null; - // MERGE by id=1 so repeat writes update in place without carrying a - // separate DELETE pass. - await pool.query( - `MERGE (m:StoreMeta {id: 1}) ` + - `SET m.schema_version = $p1, m.last_commit = $p2, m.indexed_at = $p3, ` + - `m.node_count = $p4, m.edge_count = $p5, m.stats_json = $p6, ` + - `m.cache_hit_ratio = $p7, m.cache_size_bytes = $p8, m.last_compaction = $p9, ` + - `m.embedder_model_id = $p10`, - [ - meta.schemaVersion, - meta.lastCommit ?? null, - meta.indexedAt, - meta.nodeCount, - meta.edgeCount, - statsJson, - meta.cacheHitRatio ?? null, - meta.cacheSizeBytes ?? null, - meta.lastCompaction ?? null, - meta.embedderModelId ?? null, - ], - ); - } - - async healthCheck(): Promise<{ ok: boolean; message?: string }> { - if (!this.pool?.isOpen()) { - return { ok: false, message: "graph-db: pool not open" }; - } - try { - await this.pool.query("RETURN 1 AS one"); - return { ok: true }; - } catch (err) { - return { ok: false, message: (err as Error).message }; - } - } - - // -------------------------------------------------------------------------- - // execCypher — IGraphStore optional escape hatch - // -------------------------------------------------------------------------- - - /** - * {@link IGraphStore.execCypher} implementation. Delegates to the - * pre-existing {@link query} method which already enforces read-only - * Cypher via {@link assertReadOnlyCypher}. - * - * OCH core never calls this — it exists so community tooling that - * needs raw Cypher (e.g. APOC analogues on a Neo4j adapter fork) can - * route through `OpenStoreResult.graph.execCypher(...)`. The signature - * accepts a `Record` params bag (Cypher's bound-name - * model) rather than the positional `SqlParam[]` shape the legacy - * `query` method takes. - */ - async execCypher( - statement: string, - params: Record = {}, - ): Promise[]> { - if (!this.pool) { - throw new Error("graph-db: execCypher called before open()"); - } - assertReadOnlyCypher(statement); - // Lower-cast to readonly SqlParam[] expected by the existing pool API. - // The pool driver accepts a record of named params or a positional list; - // we forward a positional list extracted from the values for now. - const positional: SqlParam[] = []; - for (const v of Object.values(params)) { - if ( - v === null || - typeof v === "string" || - typeof v === "number" || - typeof v === "boolean" || - typeof v === "bigint" - ) { - positional.push(v as SqlParam); - } else { - positional.push(JSON.stringify(v)); - } - } - return this.pool.query(statement, positional, { timeoutMs: this.defaultTimeoutMs }); - } - - // -------------------------------------------------------------------------- - // Internal helpers - // -------------------------------------------------------------------------- - - private requirePool(): GraphDbPool { - if (!this.pool?.isOpen()) { - throw new Error("graph-db: query called before open()"); - } - return this.pool; - } - - private async ensureFtsExtension(): Promise { - if (this.ftsExtensionLoaded) return; - const pool = this.requirePool(); - try { - if (!this.readOnly) await pool.query("INSTALL FTS;"); - await pool.query("LOAD EXTENSION FTS;"); - this.ftsExtensionLoaded = true; - } catch (err) { - throw new Error(`graph-db: FTS extension unavailable: ${(err as Error).message}`); - } - } - - private async ensureVectorExtension(): Promise { - if (this.vectorExtensionLoaded) return; - const pool = this.requirePool(); - try { - if (!this.readOnly) await pool.query("INSTALL VECTOR;"); - await pool.query("LOAD EXTENSION VECTOR;"); - this.vectorExtensionLoaded = true; - } catch (err) { - throw new Error(`graph-db: VECTOR extension unavailable: ${(err as Error).message}`); - } - } - - private async ensureFtsIndex(): Promise { - if (this.ftsIndexBuilt) return; - // Read-only opens cannot run `CALL CREATE_FTS_INDEX` (lbug rejects - // writes against a readOnly Database). The index is built at - // bulk-load time on the write path; readers just query it. - if (this.readOnly) { - this.ftsIndexBuilt = true; - return; - } - const pool = this.requirePool(); - // `CALL CREATE_FTS_INDEX` fails if the index already exists; swallow - // that specific failure so the call is idempotent from the adapter's - // point of view. Any other error (missing table, permission) surfaces. - try { - await pool.query( - "CALL CREATE_FTS_INDEX('CodeNode', 'och_fts', ['name', 'signature', 'description'])", - ); - } catch (err) { - const msg = (err as Error).message ?? ""; - if (!/exist|already/i.test(msg)) throw err; - } - this.ftsIndexBuilt = true; - } - - private async ensureVectorIndex(): Promise { - if (this.vectorIndexBuilt) return; - if (this.readOnly) { - this.vectorIndexBuilt = true; - return; - } - const pool = this.requirePool(); - try { - await pool.query("CALL CREATE_VECTOR_INDEX('Embedding', 'och_vec', 'vector')"); - } catch (err) { - const msg = (err as Error).message ?? ""; - if (!/exist|already/i.test(msg)) throw err; - } - this.vectorIndexBuilt = true; - } - - // -------------------------------------------------------------------------- - // Public getters retained for option introspection. - // -------------------------------------------------------------------------- - - getPath(): string { - return this.path; - } - - isReadOnly(): boolean { - return this.readOnly; - } - - getEmbeddingDim(): number { - return this.embeddingDim; - } - - getDefaultTimeoutMs(): number { - return this.defaultTimeoutMs; - } -} - -// --------------------------------------------------------------------------- -// Helpers — parameter building, column translation. -// --------------------------------------------------------------------------- - -interface EdgeRow { - readonly id: string; - readonly from: NodeId; - readonly to: NodeId; - readonly type: RelationType; - readonly confidence: number; - readonly reason?: string; - readonly step?: number; -} - -/** - * Rewrite a DuckDB-style whereClause (using `?` placeholders and `n.*` - * column references) into Cypher (using `$pN` placeholders and `node.*`). - * The substitution is positional — every `?` is replaced by the next - * `$pN` as chosen by the caller-provided name generator. - */ -function rewriteWhereClause(clause: string, nextName: () => string): string { - let rewritten = clause.replace(/\bn\./g, "node."); - rewritten = rewritten.replace(/\?/g, () => nextName()); - return rewritten; -} - -/** - * Emit `'escaped'` form for a string that MUST be inlined into a Cypher - * statement (e.g. inside a variable-length traversal where the native - * engine rejects bound parameters). The caller is responsible for - * guaranteeing the value is string-typed; we only escape `\\` and `'`. - */ -function cypherStringLiteral(value: string): string { - if (typeof value !== "string") { - throw new Error(`cypherStringLiteral expects a string, got ${typeof value}`); - } - const escaped = value.replace(/\\/g, "\\\\").replace(/'/g, "\\'"); - return `'${escaped}'`; -} - -/** - * Emit a Cypher numeric literal from a finite JS number. Used when the - * native engine's parameter path is unavailable — the caller pre-validates - * the input so non-finite values surface as a clean error rather than a - * silent string concat. - */ -function cypherNumberLiteral(value: number): string { - if (typeof value !== "number" || !Number.isFinite(value)) { - throw new Error(`cypherNumberLiteral expects a finite number, got ${String(value)}`); - } - return value.toString(); -} - -// --------------------------------------------------------------------------- -// listNodes rehydration helpers — read every column the writer emits and -// rebuild a typed GraphNode with the same field set the original write -// carried. Mirrors the DuckStore `rowToGraphNode` helper byte-for-byte so -// cross-adapter parity holds when callers serialise via canonicalJson. -// --------------------------------------------------------------------------- - -/** - * Clamp a number to a non-negative integer. Local to this adapter so the - * file remains self-contained; semantics match the DuckStore helper of - * the same shape — `0` is preserved, `undefined`/negative/non-finite all - * fall through to `undefined`. - */ -function clampNonNegativeIntGd(v: number | undefined): number | undefined { - if (v === undefined || v === null) return undefined; - if (typeof v !== "number" || !Number.isFinite(v)) return undefined; - if (v < 0) return undefined; - return Math.floor(v); -} - -/** - * Rehydrate a Cypher record from `MATCH (n:CodeNode) RETURN n.col AS col …` - * into a typed {@link GraphNode}. Inverse of {@link nodeToParams}: every - * column it writes is read back here. - * - * Returns `undefined` if the load-bearing primary-key columns (`id` / - * `kind` / `name` / `file_path`) are missing. - */ -function recordToGraphNode(rec: Record): GraphNode | undefined { - const id = rec["id"]; - const kindVal = rec["kind"]; - const name = rec["name"]; - const filePath = rec["file_path"]; - if ( - typeof id !== "string" || - typeof kindVal !== "string" || - typeof name !== "string" || - typeof filePath !== "string" - ) { - return undefined; - } - const isOperation = kindVal === "Operation"; - const out: Record = { - id, - kind: kindVal, - name, - filePath, - }; - - setStringFieldGd(out, "signature", rec["signature"]); - setNumberFieldGd(out, "startLine", rec["start_line"]); - setNumberFieldGd(out, "endLine", rec["end_line"]); - setBooleanFieldGd(out, "isExported", rec["is_exported"]); - setNumberFieldGd(out, "parameterCount", rec["parameter_count"]); - setStringFieldGd(out, "returnType", rec["return_type"]); - setStringFieldGd(out, "declaredType", rec["declared_type"]); - setStringFieldGd(out, "owner", rec["owner"]); - setStringFieldGd(out, "url", rec["url"]); - if (isOperation) { - setStringFieldGd(out, "method", rec["http_method"]); - setStringFieldGd(out, "path", rec["http_path"]); - } else { - setStringFieldGd(out, "method", rec["method"]); - } - setStringFieldGd(out, "toolName", rec["tool_name"]); - setStringFieldGd(out, "content", rec["content"]); - setStringFieldGd(out, "contentHash", rec["content_hash"]); - setStringFieldGd(out, "inferredLabel", rec["inferred_label"]); - setNumberFieldGd(out, "symbolCount", rec["symbol_count"]); - setNumberFieldGd(out, "cohesion", rec["cohesion"]); - setStringArrayFieldGd(out, "keywords", rec["keywords"]); - setStringFieldGd(out, "entryPointId", rec["entry_point_id"]); - setNumberFieldGd(out, "stepCount", rec["step_count"]); - setNumberFieldGd(out, "level", rec["level"]); - setStringArrayFieldGd(out, "responseKeys", rec["response_keys"]); - setStringFieldGd(out, "description", rec["description"]); - setStringFieldGd(out, "severity", rec["severity"]); - setStringFieldGd(out, "ruleId", rec["rule_id"]); - setStringFieldGd(out, "scannerId", rec["scanner_id"]); - setStringFieldGd(out, "message", rec["message"]); - setJsonObjectFieldGd(out, "propertiesBag", rec["properties_bag"]); - setStringFieldGd(out, "version", rec["version"]); - setStringFieldGd(out, "license", rec["license"]); - setStringFieldGd(out, "lockfileSource", rec["lockfile_source"]); - setStringFieldGd(out, "ecosystem", rec["ecosystem"]); - setStringFieldGd(out, "summary", rec["summary"]); - setStringFieldGd(out, "operationId", rec["operation_id"]); - setStringFieldGd(out, "emailHash", rec["email_hash"]); - setStringFieldGd(out, "emailPlain", rec["email_plain"]); - setJsonArrayFieldGd(out, "languages", rec["languages_json"]); - applyFrameworksJsonReadbackGd(out, rec["frameworks_json"]); - setJsonArrayFieldGd(out, "iacTypes", rec["iac_types_json"]); - setJsonArrayFieldGd(out, "apiContracts", rec["api_contracts_json"]); - setJsonArrayFieldGd(out, "manifests", rec["manifests_json"]); - setJsonArrayFieldGd(out, "srcDirs", rec["src_dirs_json"]); - setStringFieldGd(out, "orphanGrade", rec["orphan_grade"]); - setBooleanFieldGd(out, "isOrphan", rec["is_orphan"]); - setNumberFieldGd(out, "truckFactor", rec["truck_factor"]); - setNumberFieldGd(out, "ownershipDrift30d", rec["ownership_drift_30d"]); - setNumberFieldGd(out, "ownershipDrift90d", rec["ownership_drift_90d"]); - setNumberFieldGd(out, "ownershipDrift365d", rec["ownership_drift_365d"]); - setStringFieldGd(out, "deadness", denormalizeDeadnessGd(rec["deadness"])); - setNumberFieldGd(out, "coveragePercent", rec["coverage_percent"]); - setStringFieldGd(out, "coveredLinesJson", rec["covered_lines_json"]); - setNumberFieldGd(out, "cyclomaticComplexity", rec["cyclomatic_complexity"]); - setNumberFieldGd(out, "nestingDepth", rec["nesting_depth"]); - setNumberFieldGd(out, "nloc", rec["nloc"]); - setNumberFieldGd(out, "halsteadVolume", rec["halstead_volume"]); - setStringFieldGd(out, "inputSchemaJson", rec["input_schema_json"]); - setStringFieldGd(out, "partialFingerprint", rec["partial_fingerprint"]); - setStringFieldGd(out, "baselineState", rec["baseline_state"]); - setStringFieldGd(out, "suppressedJson", rec["suppressed_json"]); - if (kindVal === "Repo") { - out["originUrl"] = readNullableStringGd(rec["origin_url"]); - setStringFieldGd(out, "repoUri", rec["repo_uri"]); - out["defaultBranch"] = readNullableStringGd(rec["default_branch"]); - setStringFieldGd(out, "commitSha", rec["commit_sha"]); - setStringFieldGd(out, "indexTime", rec["index_time"]); - out["group"] = readNullableStringGd(rec["repo_group"]); - setStringFieldGd(out, "visibility", rec["visibility"]); - setStringFieldGd(out, "indexer", rec["indexer"]); - out["languageStats"] = readLanguageStatsGd(rec["language_stats_json"]); - } - return out as unknown as GraphNode; -} - -function setStringFieldGd(out: Record, key: string, v: unknown): void { - if (typeof v === "string" && v.length > 0) out[key] = v; -} - -function setNumberFieldGd(out: Record, key: string, v: unknown): void { - if (v === null || v === undefined) return; - if (typeof v === "number" && Number.isFinite(v)) { - out[key] = v; - return; - } - if (typeof v === "bigint") { - out[key] = Number(v); - return; - } - if (typeof v === "string" && /^-?\d+(\.\d+)?$/.test(v)) { - const n = Number(v); - if (Number.isFinite(n)) out[key] = n; - } -} - -function setBooleanFieldGd(out: Record, key: string, v: unknown): void { - if (typeof v === "boolean") out[key] = v; -} - -function setStringArrayFieldGd(out: Record, key: string, v: unknown): void { - // The writer ({@link encodeNodeCol}) encodes the "absent vs explicit-empty" - // distinction the same way regardless of lbug version: an absent field is - // stored as a bare `[]`, while a genuinely empty `keywords: []` is stored as - // a single-element marker. This reader is the symmetric decode: - // - null / non-array → absent: omit the key entirely. - // - [] (bare empty) → absent: omit the key entirely. - // - [marker] → explicit empty array: set `[]`. - // - any other array → reconstruct the string elements verbatim. - // - // The bare-`[]` → absent rule is load-bearing across the lbug 0.16→0.17 - // serialization change: lbug v0.16.1 collapsed a 0-length `STRING[]` to SQL - // NULL on write, so an absent field read back as `null` (caught by the - // non-array branch). lbug v0.17.0 (PR #471) made empty lists round-trip as a - // typed empty `STRING[]`, so an absent field now reads back as `[]` instead. - // Either way "no real elements and no marker" means absent — never emit a - // spurious `{keywords: []}` that would diverge from the canonical-JSON - // projection (`{}`) and break `graphHash` byte-identity. - if (!Array.isArray(v)) return; - if (v.length === 1 && v[0] === EMPTY_STRING_ARRAY_MARKER) { - out[key] = []; - return; - } - const arr: string[] = []; - for (const item of v) if (typeof item === "string") arr.push(item); - if (arr.length === 0) return; - out[key] = arr; -} - -function setJsonArrayFieldGd(out: Record, key: string, v: unknown): void { - if (typeof v !== "string" || v.length === 0) return; - try { - const parsed = JSON.parse(v); - if (Array.isArray(parsed)) out[key] = parsed; - } catch { - /* skip */ - } -} - -function setJsonObjectFieldGd(out: Record, key: string, v: unknown): void { - if (typeof v !== "string" || v.length === 0) return; - try { - const parsed = JSON.parse(v); - if (parsed !== null && typeof parsed === "object" && !Array.isArray(parsed)) { - out[key] = parsed; - } - } catch { - /* skip */ - } -} - -function applyFrameworksJsonReadbackGd(out: Record, v: unknown): void { - if (typeof v !== "string" || v.length === 0) return; - try { - const parsed = JSON.parse(v); - if (Array.isArray(parsed)) { - out["frameworks"] = parsed; - return; - } - if (parsed && typeof parsed === "object") { - const env = parsed as { flat?: unknown; detected?: unknown }; - if (Array.isArray(env.flat)) out["frameworks"] = env.flat; - if (Array.isArray(env.detected) && env.detected.length > 0) { - out["frameworksDetected"] = env.detected; - } - } - } catch { - /* skip */ - } -} - -function denormalizeDeadnessGd(v: unknown): unknown { - if (v === "unreachable_export") return "unreachable-export"; - return v; -} - -function readNullableStringGd(v: unknown): string | null { - if (typeof v === "string" && v.length > 0) return v; - return null; -} - -function readLanguageStatsGd(v: unknown): Readonly> { - if (typeof v !== "string" || v.length === 0) return {}; - try { - const parsed = JSON.parse(v); - if (parsed && typeof parsed === "object" && !Array.isArray(parsed)) { - const out: Record = {}; - for (const [k, val] of Object.entries(parsed as Record)) { - if (typeof val === "number" && Number.isFinite(val)) out[k] = val; - } - return out; - } - } catch { - /* fallthrough */ - } - return {}; -} diff --git a/packages/storage/src/graphdb-pool.test.ts b/packages/storage/src/graphdb-pool.test.ts deleted file mode 100644 index 1a632728..00000000 --- a/packages/storage/src/graphdb-pool.test.ts +++ /dev/null @@ -1,336 +0,0 @@ -/** - * Concurrency regression suite for {@link GraphDbPool}. - * - * Every test injects a fake `NativeBinding` into the pool so the suite - * runs without touching the native binding. That lets us drive exact - * timing, force queue saturation, and inspect internal counters — none - * of which are available when running against the real native binding. - * - * Scenarios: - * 1. 100 concurrent reads against one pool do not deadlock. The fake - * connection delays each query by 5ms; the suite asserts every - * promise resolves and that `available` returns to full strength. - * 2. Per-call `timeoutMs` aborts a long-running query. The fake - * connection ignores cancellation (matches the native binding), - * so the pool's own timeout race is what the test verifies. - * 3. Waiter timeout when the pool is saturated. With - * `maxConnections: 2` and a slow fake connection, the third - * concurrent read waits past `waiterTimeoutMs` and rejects with a - * clear message. - * 4. Idle sweep closes pools whose last use was older than - * `idleTimeoutMs`. The test calls `runIdleSweep` with a frozen - * `now` far in the future to avoid a real wall-clock wait. - * 5. LRU eviction when the registry is at `maxPoolSize`. Opening a - * sixth pool evicts the oldest-by-`lastUsed` entry. - */ - -import assert from "node:assert/strict"; -import { afterEach, test } from "node:test"; -import { GraphDbStore } from "./graphdb-adapter.js"; -import { - _poolRegistrySize, - _resetPoolRegistry, - GraphDbPool, - type NativeBinding, - type NativeConnection, - type NativeDatabase, - type NativePreparedStatement, - type NativeQueryResult, - runIdleSweep, -} from "./graphdb-pool.js"; - -// --------------------------------------------------------------------------- -// Fake native binding — a duck-typed stand-in for @ladybugdb/core. -// --------------------------------------------------------------------------- - -interface FakeConfig { - /** Milliseconds each `conn.query()` call sleeps before resolving. */ - readonly queryLatencyMs?: number; - /** Rows each `getAll()` returns. */ - readonly rows?: readonly Record[]; -} - -function makeFakeBinding(cfg: FakeConfig = {}): NativeBinding { - const latency = cfg.queryLatencyMs ?? 0; - const rows = cfg.rows ?? [{ ok: 1 }]; - - class FakeResult implements NativeQueryResult { - async getAll(): Promise[]> { - return [...rows]; - } - } - - class FakePreparedStatement implements NativePreparedStatement { - isSuccess(): boolean { - return true; - } - getErrorMessage(): string { - return ""; - } - } - - class FakeConnection implements NativeConnection { - private closed = false; - - async query(_stmt: string): Promise { - if (latency > 0) { - await new Promise((resolve) => setTimeout(resolve, latency)); - } - if (this.closed) throw new Error("connection closed"); - return new FakeResult(); - } - async prepare(_stmt: string): Promise { - return new FakePreparedStatement(); - } - async execute( - _stmt: NativePreparedStatement, - _params?: Record, - ): Promise { - if (latency > 0) { - await new Promise((resolve) => setTimeout(resolve, latency)); - } - return new FakeResult(); - } - async close(): Promise { - this.closed = true; - } - } - - class FakeDatabase implements NativeDatabase { - async close(): Promise {} - } - - // The cast is deliberate — NativeBinding's constructors expect - // arbitrary args; our fakes accept them via `...args` on the runtime - // but typescript complains about the arity/variance without an - // unknown bounce. - return { - Database: FakeDatabase as unknown as NativeBinding["Database"], - Connection: FakeConnection as unknown as NativeBinding["Connection"], - }; -} - -afterEach(() => { - _resetPoolRegistry(); -}); - -// --------------------------------------------------------------------------- -// 1. 100 concurrent reads do not deadlock -// --------------------------------------------------------------------------- - -test("100 concurrent reads against one pool complete without deadlock", async () => { - const pool = new GraphDbPool("/tmp/graphdb-concurrency-100.db", { - binding: makeFakeBinding({ queryLatencyMs: 2 }), - // Default maxConnections (8) is plenty for 100 reads — the point - // is that every queue handoff lands cleanly. - }); - await pool.open(); - try { - const results = await Promise.all( - Array.from({ length: 100 }, (_, i) => pool.query(`MATCH RETURN ${i}`)), - ); - assert.equal(results.length, 100); - for (const rows of results) { - assert.equal(rows.length, 1); - } - // After the fan-out settles, every connection should be back in - // `available` and no checkouts should remain outstanding. - const stats = pool.stats(); - assert.equal(stats.checkedOut, 0); - assert.equal(stats.waiters, 0); - assert.equal(stats.available, 8); - } finally { - await pool.close(); - } -}); - -// --------------------------------------------------------------------------- -// 2. Per-call timeoutMs propagates into query() -// --------------------------------------------------------------------------- - -test("per-call timeoutMs aborts a long-running query", async () => { - const pool = new GraphDbPool("/tmp/graphdb-timeout.db", { - binding: makeFakeBinding({ queryLatencyMs: 500 }), - queryTimeoutMs: 30_000, // default stays untouched - }); - await pool.open(); - try { - const started = Date.now(); - await assert.rejects( - () => pool.query("MATCH RETURN 1", undefined, { timeoutMs: 50 }), - /timed out after 50ms/, - ); - // The reject must happen in well under the fake's 500ms latency. - assert.ok(Date.now() - started < 400, "timeout should fire before the fake resolves"); - } finally { - await pool.close(); - } -}); - -test("per-call timeoutMs also propagates when the adapter wraps the pool", async () => { - const store = new GraphDbStore("/tmp/graphdb-store-timeout.db", { - poolConfig: { - binding: makeFakeBinding({ queryLatencyMs: 500 }), - }, - }); - await store.open(); - try { - await assert.rejects( - () => store.query("MATCH RETURN 1", undefined, { timeoutMs: 50 }), - /timed out after 50ms/, - ); - } finally { - await store.close(); - } -}); - -// --------------------------------------------------------------------------- -// 3. Waiter timeout when the pool is saturated -// --------------------------------------------------------------------------- - -test("waiter timeout fires when pool is saturated beyond maxConnections", async () => { - const pool = new GraphDbPool("/tmp/graphdb-waiter-timeout.db", { - binding: makeFakeBinding({ queryLatencyMs: 500 }), - maxConnections: 2, - // Shorten the waiter timeout so the test stays under 1s while still - // exercising the real code path — production still runs with 15s - // defaults. - waiterTimeoutMs: 100, - }); - await pool.open(); - try { - const slow1 = pool.query("MATCH RETURN 1"); - const slow2 = pool.query("MATCH RETURN 2"); - // Give the scheduler a microtask to route the first two checkouts. - await new Promise((resolve) => setImmediate(resolve)); - const thirdStarted = Date.now(); - await assert.rejects( - () => pool.query("MATCH RETURN 3"), - /timed out after 100ms waiting for a free connection/, - ); - // The reject must arrive within a small slop of the waiter timeout — - // shouldn't wait for the slow fakes to finish. - const elapsed = Date.now() - thirdStarted; - assert.ok(elapsed < 400, `waiter rejected in ${elapsed}ms; expected < 400ms`); - // Let the originals drain so afterEach() does not race the sweep. - await Promise.all([slow1, slow2]); - } finally { - await pool.close(); - } -}); - -// --------------------------------------------------------------------------- -// 4. Idle sweep releases unused Connections -// --------------------------------------------------------------------------- - -test("runIdleSweep closes pools whose lastUsed is older than idleTimeoutMs", async () => { - const pool = new GraphDbPool("/tmp/graphdb-idle-sweep.db", { - binding: makeFakeBinding(), - idleTimeoutMs: 1_000, - // Use a long sweep interval — the test invokes runIdleSweep directly - // rather than waiting on the timer. - idleSweepIntervalMs: 60_000, - }); - await pool.open(); - assert.equal(_poolRegistrySize(), 1); - // Idle sweep with `now` before the threshold → pool stays. - runIdleSweep(Date.now()); - assert.equal(_poolRegistrySize(), 1); - // Jump `now` well past `idleTimeoutMs` → pool is swept. - runIdleSweep(Date.now() + 10_000); - assert.equal(_poolRegistrySize(), 0); - assert.equal(pool.isOpen(), true); - // Cleanup — pool.close() is a no-op on a swept entry. - await pool.close(); -}); - -// --------------------------------------------------------------------------- -// 5. LRU eviction when the registry is at maxPoolSize -// --------------------------------------------------------------------------- - -test("opening a 6th pool evicts the LRU entry when maxPoolSize is 5", async () => { - const binding = makeFakeBinding(); - const pools: GraphDbPool[] = []; - for (let i = 0; i < 5; i += 1) { - const pool = new GraphDbPool(`/tmp/graphdb-lru-${i}.db`, { - binding, - maxPoolSize: 5, - }); - await pool.open(); - pools.push(pool); - // Tick apart the lastUsed timestamps so LRU picks a deterministic - // victim. Using setImmediate isn't enough on fast CPUs — use a 2ms - // sleep so Date.now() advances. - await new Promise((resolve) => setTimeout(resolve, 2)); - } - assert.equal(_poolRegistrySize(), 5); - - // Touch the first four so the FIFTH is not LRU — instead `lru-0` is - // still the oldest (since we didn't touch it) — but to be explicit, - // reorder by hitting queries on pools[1..4] in rising order. After - // this pools[0] is the LRU. - for (let i = 1; i < 5; i += 1) { - const p = pools[i]; - if (!p) throw new Error("unreachable"); - await p.query("MATCH RETURN 0"); - await new Promise((resolve) => setTimeout(resolve, 2)); - } - - const newest = new GraphDbPool("/tmp/graphdb-lru-new.db", { - binding, - maxPoolSize: 5, - }); - await newest.open(); - // The registry should still hold exactly 5 entries — the LRU was - // evicted to make room. - assert.equal(_poolRegistrySize(), 5); - // `pools[0]` is the evicted one — its next query() call must throw - // "evicted", not silently succeed. - const evicted = pools[0]; - if (!evicted) throw new Error("unreachable"); - await assert.rejects(() => evicted.query("MATCH RETURN 0"), /evicted/); - - await newest.close(); - for (let i = 1; i < 5; i += 1) { - const p = pools[i]; - if (p) await p.close(); - } -}); - -// --------------------------------------------------------------------------- -// 6. Parameterized queries use prepare/execute and still respect timeouts -// --------------------------------------------------------------------------- - -test("parameterized query uses prepare + execute path", async () => { - const pool = new GraphDbPool("/tmp/graphdb-parameterized.db", { - binding: makeFakeBinding({ queryLatencyMs: 0, rows: [{ hit: true }] }), - }); - await pool.open(); - try { - const rows = await pool.query("MATCH WHERE id = $p1 RETURN hit", ["abc"]); - assert.equal(rows.length, 1); - assert.deepEqual(rows[0], { hit: true }); - } finally { - await pool.close(); - } -}); - -// --------------------------------------------------------------------------- -// 7. Refcount: parallel stores over the same path share one Database -// --------------------------------------------------------------------------- - -test("parallel pool handles over the same path share a single registry entry", async () => { - const binding = makeFakeBinding(); - const p1 = new GraphDbPool("/tmp/graphdb-shared.db", { binding }); - const p2 = new GraphDbPool("/tmp/graphdb-shared.db", { binding }); - await p1.open(); - await p2.open(); - assert.equal(_poolRegistrySize(), 1); - assert.equal(p1.stats().refCount, 2); - // First close: refcount drops to 1, the underlying entry stays alive. - await p1.close(); - assert.equal(_poolRegistrySize(), 1); - // Second close: refcount 0 → entry is torn down. - await p2.close(); - assert.equal(_poolRegistrySize(), 0); -}); diff --git a/packages/storage/src/graphdb-pool.ts b/packages/storage/src/graphdb-pool.ts deleted file mode 100644 index e632d3ba..00000000 --- a/packages/storage/src/graphdb-pool.ts +++ /dev/null @@ -1,613 +0,0 @@ -/** - * Connection pool for the graph-database backend. - * - * Design goals: - * - * 1. **Single-writer-multi-reader model.** One native `Database` per store - * path, with a bounded fan-out of `Connection` objects on top of it. - * Multiple `Connection`s from the same `Database` is the officially - * supported concurrency pattern of the underlying native binding. - * - * 2. **One query per Connection at a time.** The native binding segfaults - * when two `.query()` calls race against a single `Connection`. The - * pool enforces this invariant structurally: every `query()` call - * checks out a connection, runs exactly one statement, and checks it - * back in. Queries compete for connections, never for a single - * connection. - * - * 3. **Checkout queue with back-pressure.** When every connection is - * busy, callers queue; waiters timeout at `WAITER_TIMEOUT_MS` so a - * hung backend never leaks unbounded promises. - * - * 4. **Query timeout.** Each `query()` races against `QUERY_TIMEOUT_MS` - * so a stuck query releases its slot even if the native call never - * returns. Per-call `timeoutMs` overrides the default. - * - * 5. **Idle sweep + LRU eviction.** A single process-wide sweep runs - * every `IDLE_SWEEP_INTERVAL_MS`, closing pools whose last use was - * more than `IDLE_TIMEOUT_MS` ago and whose connections are all - * idle. The LRU pathway evicts the least-recently-used pool when - * the process-wide cap `MAX_POOL_SIZE` is reached. - * - * Native binding surface (`@ladybugdb/core@0.16.1`): - * - * - Global registry keys by the resolved `dbPath` and exposes - * `GraphDbPool` as an instance object so `GraphDbStore.open()` / - * `.close()` can drive the lifecycle without a second name registry. - * - Timing knobs: `MAX_CONNS_PER_REPO=8`, waiter 15 s, query 30 s, - * idle 60 s sweep + 5 min timeout, pool cap 5. - * - Calls used here: `Database(path, bufferManagerSize, - * enableCompression, readOnly)`, `new Connection(db)`, - * `conn.query(stmt) → Promise`, - * `result.getAll() → Promise[]>`. Prepared - * statements use `conn.prepare(stmt)` + `conn.execute(stmt, params)`. - */ - -import type { SqlParam } from "./interface.js"; - -/** - * Structural shape of a native `Database`. Keeping the interface - * statically typed (rather than reaching for `any`) lets tests inject a - * fake by duck-typing. - */ -export interface NativeDatabase { - close(): Promise; -} - -/** - * Structural shape of a native `Connection`. Typed to what the pool - * actually calls — `query()` + `prepare()` + `execute()` + `close()`. - */ -export interface NativeConnection { - query(stmt: string): Promise; - prepare(stmt: string): Promise; - execute( - stmt: NativePreparedStatement, - params?: Record, - ): Promise; - close(): Promise; -} - -export interface NativeQueryResult { - getAll(): Promise[]>; - close?(): void; -} - -export interface NativePreparedStatement { - isSuccess(): boolean; - getErrorMessage(): string; -} - -/** - * Structural shape of the `@ladybugdb/core` default export used by the - * pool. Injected so tests can swap in fakes without loading the native - * binding. - */ -export interface NativeBinding { - Database: new ( - path: string, - bufferManagerSize?: number, - enableCompression?: boolean, - readOnly?: boolean, - maxDBSize?: number, - ) => NativeDatabase; - Connection: new (db: NativeDatabase) => NativeConnection; -} - -export interface GraphDbPoolConfig { - /** Max connections held per database file. Default 8. */ - readonly maxConnections?: number; - /** Global cap on number of distinct pools kept alive. Default 5. */ - readonly maxPoolSize?: number; - /** Milliseconds a checkout waiter can block before rejecting. Default 15000. */ - readonly waiterTimeoutMs?: number; - /** Default milliseconds a single query may run before aborting. Default 30000. */ - readonly queryTimeoutMs?: number; - /** Milliseconds of idleness before a pool is eligible for closure. Default 300000 (5 min). */ - readonly idleTimeoutMs?: number; - /** How often the idle sweep runs. Default 60000 (60 s). */ - readonly idleSweepIntervalMs?: number; - /** Open the database read-only. Default false. */ - readonly readOnly?: boolean; - /** - * Buffer manager (temp-page region) size in bytes. lbug's native default is - * `min(systemMemory, maxDBSize) * 0.8` — easily 50+ GiB on a beefy host. We - * cap at 256 MiB by default so the in-memory page pool stays bounded across - * many concurrent test DBs without affecting on-disk capacity. Production - * callers can pass a larger value for analyze-time bulk loads. - */ - readonly bufferManagerBytes?: number; - /** - * Maximum on-disk database size (and the size of the per-Database virtual - * mmap). lbug's native default is `1 << 43` = 8 TiB per Database, which - * exhausts the 47-bit user virtual address space (~128 TiB) after ~16 - * concurrent instances and surfaces as "Buffer manager exception: Mmap - * for size 8796093022208 failed". Must be a power of 2. - * - * Default 16 GiB — comfortably larger than any plausible OCH graph - * artifact (~few GiB at the high end), drops per-Database virtual reserve - * 512×, and lets the test suite spin up hundreds of pools without - * address-space pressure. - */ - readonly maxDbBytes?: number; - /** - * Injected native binding. Defaults to `require("@ladybugdb/core")` - * via dynamic import on first `open()`. Tests inject a fake. - */ - readonly binding?: NativeBinding; -} - -/** Defaults preserved from prior-art; changing these is a documented deviation. */ -export const DEFAULT_MAX_CONNECTIONS = 8; -export const DEFAULT_MAX_POOL_SIZE = 5; -export const DEFAULT_WAITER_TIMEOUT_MS = 15_000; -export const DEFAULT_QUERY_TIMEOUT_MS = 30_000; -export const DEFAULT_IDLE_TIMEOUT_MS = 5 * 60 * 1000; -export const DEFAULT_IDLE_SWEEP_INTERVAL_MS = 60_000; -/** - * lbug `bufferManagerSize` cap. 2 GiB. Power of 2 not required. - * - * The buffer manager is the in-memory page cache; under-sizing it surfaces - * as "Buffer manager exception: Unable to allocate memory! The buffer pool - * is full and no memory could be freed!" the moment a single query's hot - * working set exceeds the cap. lbug's native default is `min(systemMem, - * maxDBSize) * 0.8` (≥50 GiB on a beefy host), so we cap explicitly to keep - * concurrent test DBs from contending for physical RAM. 2 GiB is roughly - * the largest BOM-body fixture × 4× headroom for vector ops. - */ -export const DEFAULT_BUFFER_MANAGER_BYTES = 2 * 1024 * 1024 * 1024; -/** lbug `maxDBSize` cap. 16 GiB. MUST be a power of 2 — lbug enforces this. */ -export const DEFAULT_MAX_DB_BYTES = 16 * 1024 * 1024 * 1024; - -interface Waiter { - readonly resolve: (conn: NativeConnection) => void; - readonly reject: (err: Error) => void; - readonly timer: ReturnType; -} - -/** - * Process-wide registry. Keyed by resolved dbPath so parallel `GraphDbStore` - * instances pointing at the same file share one native `Database` and a - * single connection pool. Refcounted: the last `close()` against a shared - * path tears the native resources down. - */ -interface RegistryEntry { - readonly db: NativeDatabase; - readonly connections: NativeConnection[]; - readonly available: NativeConnection[]; - readonly waiters: Waiter[]; - readonly path: string; - readonly config: ResolvedPoolConfig; - refCount: number; - checkedOut: number; - lastUsed: number; - closed: boolean; -} - -type ResolvedPoolConfig = Required> & { - binding?: NativeBinding; -}; - -const registry = new Map(); -let sweepTimer: ReturnType | null = null; -let activeSweepIntervalMs: number | null = null; - -// --------------------------------------------------------------------------- -// Idle sweep + LRU eviction -// --------------------------------------------------------------------------- - -function ensureSweepTimer(intervalMs: number): void { - if (sweepTimer && activeSweepIntervalMs === intervalMs) return; - if (sweepTimer) { - clearInterval(sweepTimer); - } - activeSweepIntervalMs = intervalMs; - sweepTimer = setInterval(() => { - runIdleSweep(Date.now()); - }, intervalMs); - if (typeof (sweepTimer as { unref?: () => unknown }).unref === "function") { - (sweepTimer as { unref: () => unknown }).unref(); - } -} - -/** - * Scan every registered pool and close those whose last use was more - * than `idleTimeoutMs` ago with no outstanding checkouts. Exposed for - * tests which inject a frozen clock. - */ -export function runIdleSweep(now: number = Date.now()): void { - for (const [path, entry] of registry) { - if (entry.closed) continue; - if (entry.checkedOut !== 0) continue; - if (now - entry.lastUsed < entry.config.idleTimeoutMs) continue; - closeEntry(path); - } -} - -function evictLruIfNeeded(maxPoolSize: number, nextPath: string): void { - const activeCount = [...registry.keys()].filter((p) => p !== nextPath).length; - if (activeCount < maxPoolSize) return; - let oldestPath: string | null = null; - let oldest = Number.POSITIVE_INFINITY; - for (const [path, entry] of registry) { - if (path === nextPath) continue; - if (entry.checkedOut !== 0) continue; - if (entry.lastUsed < oldest) { - oldest = entry.lastUsed; - oldestPath = path; - } - } - if (oldestPath) closeEntry(oldestPath); -} - -function closeEntry(path: string): void { - const entry = registry.get(path); - if (!entry) return; - entry.closed = true; - for (const conn of entry.available) { - conn.close().catch(() => {}); - } - entry.available.length = 0; - entry.connections.length = 0; - for (const waiter of entry.waiters) { - clearTimeout(waiter.timer); - waiter.reject(new Error(`GraphDbPool for ${path} closed while waiting for a connection`)); - } - entry.waiters.length = 0; - entry.db.close().catch(() => {}); - registry.delete(path); - if (registry.size === 0 && sweepTimer) { - clearInterval(sweepTimer); - sweepTimer = null; - activeSweepIntervalMs = null; - } -} - -// --------------------------------------------------------------------------- -// Binding loader -// --------------------------------------------------------------------------- - -async function loadDefaultBinding(): Promise { - // Dynamic import keeps the native dep off the startup path when the - // DuckDB backend is in use. The cast passes through `unknown` because - // the native binding's typed surface is richer than the structural - // shape this module uses — we only require `{ Database, Connection }` - // constructors, nothing more. - const mod = (await import("@ladybugdb/core")) as unknown as { - default?: NativeBinding; - } & NativeBinding; - return mod.default ?? mod; -} - -// --------------------------------------------------------------------------- -// GraphDbPool -// --------------------------------------------------------------------------- - -/** - * Pool handle. One instance per `GraphDbStore`; multiple instances over - * the same path share the underlying native `Database` via the process - * registry. - */ -export class GraphDbPool { - private readonly path: string; - private readonly config: ResolvedPoolConfig; - private opened = false; - private closed = false; - - constructor(path: string, config: GraphDbPoolConfig = {}) { - this.path = path; - const resolved: ResolvedPoolConfig = { - maxConnections: config.maxConnections ?? DEFAULT_MAX_CONNECTIONS, - maxPoolSize: config.maxPoolSize ?? DEFAULT_MAX_POOL_SIZE, - waiterTimeoutMs: config.waiterTimeoutMs ?? DEFAULT_WAITER_TIMEOUT_MS, - queryTimeoutMs: config.queryTimeoutMs ?? DEFAULT_QUERY_TIMEOUT_MS, - idleTimeoutMs: config.idleTimeoutMs ?? DEFAULT_IDLE_TIMEOUT_MS, - idleSweepIntervalMs: config.idleSweepIntervalMs ?? DEFAULT_IDLE_SWEEP_INTERVAL_MS, - readOnly: config.readOnly ?? false, - bufferManagerBytes: config.bufferManagerBytes ?? DEFAULT_BUFFER_MANAGER_BYTES, - maxDbBytes: config.maxDbBytes ?? DEFAULT_MAX_DB_BYTES, - }; - // `exactOptionalPropertyTypes` refuses explicit `undefined` on an - // optional property — only omit-or-assign-value is allowed. - if (config.binding !== undefined) { - resolved.binding = config.binding; - } - this.config = resolved; - } - - /** - * Open (or re-use) the underlying `Database` and pre-warm connections. - * Idempotent on the instance level. The registry refcount tracks - * multiple stores over the same path. - */ - async open(): Promise { - if (this.opened) return; - if (this.closed) { - throw new Error(`GraphDbPool for ${this.path} has already been closed`); - } - - const binding = this.config.binding ?? (await loadDefaultBinding()); - - let entry = registry.get(this.path); - if (!entry) { - evictLruIfNeeded(this.config.maxPoolSize, this.path); - const db = new binding.Database( - this.path, - this.config.bufferManagerBytes, - false, // enableCompression — default - this.config.readOnly, - this.config.maxDbBytes, - ); - const connections: NativeConnection[] = []; - for (let i = 0; i < this.config.maxConnections; i += 1) { - connections.push(new binding.Connection(db)); - } - entry = { - db, - connections, - available: [...connections], - waiters: [], - path: this.path, - config: this.config, - refCount: 0, - checkedOut: 0, - lastUsed: Date.now(), - closed: false, - }; - registry.set(this.path, entry); - ensureSweepTimer(this.config.idleSweepIntervalMs); - } - entry.refCount += 1; - entry.lastUsed = Date.now(); - this.opened = true; - } - - /** - * Release the pool's refcount. The underlying `Database` is torn down - * only when the last holder closes. Idempotent. - */ - async close(): Promise { - if (!this.opened || this.closed) { - this.closed = true; - return; - } - this.closed = true; - const entry = registry.get(this.path); - if (!entry) return; - entry.refCount -= 1; - if (entry.refCount <= 0) { - closeEntry(this.path); - } - } - - /** - * Execute a read-only statement. The pool checks out a connection, - * runs the query under `timeoutMs`, and returns the parsed rows. - */ - async query( - stmt: string, - params?: readonly SqlParam[], - opts?: { readonly timeoutMs?: number }, - ): Promise[]> { - const entry = this.requireEntry(); - entry.lastUsed = Date.now(); - const timeoutMs = opts?.timeoutMs ?? entry.config.queryTimeoutMs; - const conn = await this.acquire(entry); - try { - const exec = - params && params.length > 0 - ? this.runParameterized(conn, stmt, params, timeoutMs) - : this.runDirect(conn, stmt, timeoutMs); - return await exec; - } finally { - this.release(entry, conn); - } - } - - /** - * Execute a write statement that must bypass the Cypher read-only guard - * — used exclusively by the internal bulk-load path for - * `COPY
FROM (UNWIND $rows ...)` calls. Not exposed on - * `IGraphStore`; callers outside `GraphDbStore.bulkLoad` must NOT call - * this method. - */ - async execWrite( - stmt: string, - params?: Record, - opts?: { readonly timeoutMs?: number }, - ): Promise { - const entry = this.requireEntry(); - entry.lastUsed = Date.now(); - const timeoutMs = opts?.timeoutMs ?? entry.config.queryTimeoutMs; - const conn = await this.acquire(entry); - try { - const work = (async () => { - if (params && Object.keys(params).length > 0) { - const prepared = await conn.prepare(stmt); - if (!prepared.isSuccess()) { - throw new Error(`GraphDbPool execWrite prepare failed: ${prepared.getErrorMessage()}`); - } - const res = await conn.execute(prepared, params); - // Drain result to surface any execution errors. - const result = Array.isArray(res) ? res[0] : res; - if (result) await result.getAll(); - } else { - await conn.query(stmt); - } - })(); - await raceWithTimeout(work, timeoutMs, "execWrite"); - } finally { - this.release(entry, conn); - } - } - - /** - * Acquire a connection. Exposed for callers (e.g. bulk-load paths) - * that need to hold a connection across multiple statements. - * Remember to `release()` in `finally`. - */ - async acquire(entry: RegistryEntry = this.requireEntry()): Promise { - entry.lastUsed = Date.now(); - if (entry.available.length > 0) { - entry.checkedOut += 1; - return entry.available.pop() as NativeConnection; - } - if (entry.checkedOut < entry.config.maxConnections) { - // Should never happen — pool is pre-warmed to maxConnections. - // Defensive: surface the leak rather than silently creating one - // (which would desync the `available`/`checkedOut` accounting). - throw new Error( - `GraphDbPool integrity error: expected ${entry.config.maxConnections} ` + - `connections but found ${entry.connections.length} ` + - `(${entry.available.length} available, ${entry.checkedOut} checked out)`, - ); - } - return await new Promise((resolve, reject) => { - const timer = setTimeout(() => { - const idx = entry.waiters.findIndex((w) => w.timer === timer); - if (idx !== -1) entry.waiters.splice(idx, 1); - reject( - new Error( - `GraphDbPool exhausted: timed out after ${entry.config.waiterTimeoutMs}ms ` + - `waiting for a free connection`, - ), - ); - }, entry.config.waiterTimeoutMs); - if (typeof (timer as { unref?: () => unknown }).unref === "function") { - (timer as { unref: () => unknown }).unref(); - } - entry.waiters.push({ resolve, reject, timer }); - }); - } - - /** - * Return a connection to the pool. If a waiter is queued, hand the - * connection straight over rather than bouncing through `available`. - */ - release(entry: RegistryEntry, conn: NativeConnection): void { - if (entry.closed) { - // Pool closed while the caller was mid-query — drop the connection. - conn.close().catch(() => {}); - return; - } - if (entry.waiters.length > 0) { - const next = entry.waiters.shift(); - if (next) { - clearTimeout(next.timer); - next.resolve(conn); - return; - } - } - entry.checkedOut -= 1; - entry.available.push(conn); - } - - /** Inspect current queue sizes — used by tests and diagnostics. */ - stats(): { available: number; checkedOut: number; waiters: number; refCount: number } { - const entry = registry.get(this.path); - if (!entry) { - return { available: 0, checkedOut: 0, waiters: 0, refCount: 0 }; - } - return { - available: entry.available.length, - checkedOut: entry.checkedOut, - waiters: entry.waiters.length, - refCount: entry.refCount, - }; - } - - isOpen(): boolean { - return this.opened && !this.closed; - } - - private requireEntry(): RegistryEntry { - if (!this.opened || this.closed) { - throw new Error(`GraphDbPool for ${this.path} is not open`); - } - const entry = registry.get(this.path); - if (!entry || entry.closed) { - throw new Error(`GraphDbPool for ${this.path} has been evicted`); - } - return entry; - } - - private async runDirect( - conn: NativeConnection, - stmt: string, - timeoutMs: number, - ): Promise[]> { - const queryPromise = conn.query(stmt).then(async (res) => { - const result = Array.isArray(res) ? res[0] : res; - if (!result) return [] as Record[]; - return await result.getAll(); - }); - return await raceWithTimeout(queryPromise, timeoutMs, "query"); - } - - private async runParameterized( - conn: NativeConnection, - stmt: string, - params: readonly SqlParam[], - timeoutMs: number, - ): Promise[]> { - // Parameterized queries use prepared statements with positional - // binding names `p1..pN`. The caller passes the template with those - // same names (`WHERE id = $p1`); we wrap the array so callers don't - // have to hand-build the record. - const paramRecord: Record = {}; - for (let i = 0; i < params.length; i += 1) { - paramRecord[`p${i + 1}`] = params[i] as unknown; - } - const work = (async () => { - const prepared = await conn.prepare(stmt); - if (!prepared.isSuccess()) { - throw new Error(`GraphDbPool prepare failed: ${prepared.getErrorMessage()}`); - } - const res = await conn.execute(prepared, paramRecord); - const result = Array.isArray(res) ? res[0] : res; - if (!result) return [] as Record[]; - return await result.getAll(); - })(); - return await raceWithTimeout(work, timeoutMs, "query"); - } -} - -/** - * Race `promise` against a timeout. On timeout the returned promise - * rejects, but the underlying work is NOT cancelled — the native layer - * owns that contract. - */ -function raceWithTimeout(promise: Promise, ms: number, label: string): Promise { - let timer: ReturnType | undefined; - const timeoutPromise = new Promise((_, reject) => { - timer = setTimeout(() => reject(new Error(`${label} timed out after ${ms}ms`)), ms); - if (timer && typeof (timer as { unref?: () => unknown }).unref === "function") { - (timer as { unref: () => unknown }).unref(); - } - }); - return Promise.race([promise, timeoutPromise]).finally(() => { - if (timer) clearTimeout(timer); - }); -} - -// --------------------------------------------------------------------------- -// Test helpers — not part of the public surface. Exposed so the concurrency -// suite can inspect internal state without reaching through `any`. -// --------------------------------------------------------------------------- - -/** Number of live pools in the process-wide registry. */ -export function _poolRegistrySize(): number { - return registry.size; -} - -/** Force-close every pool and stop the sweep timer. Used in test teardown. */ -export function _resetPoolRegistry(): void { - for (const path of [...registry.keys()]) { - closeEntry(path); - } - if (sweepTimer) { - clearInterval(sweepTimer); - sweepTimer = null; - activeSweepIntervalMs = null; - } -} diff --git a/packages/storage/src/graphdb-roundtrip.test.ts b/packages/storage/src/graphdb-roundtrip.test.ts deleted file mode 100644 index 8367a3f8..00000000 --- a/packages/storage/src/graphdb-roundtrip.test.ts +++ /dev/null @@ -1,656 +0,0 @@ -/** - * Round-trip parity tests for {@link GraphDbStore}. - * - * These tests verify that a knowledge graph survives a bulk-load + rebuild - * cycle byte-identical under `graphHash` on the in-tree lbug `GraphDbStore`. - * The same invariant is what the published `assertGraphParity` / - * `assertIGraphStoreConformance` harnesses hold a community `IGraphStore` - * adapter (AGE / Memgraph / Neo4j / Neptune) to. - * - * Three fixture sizes: - * - small: 2 files + 8 functions + 15 edges (mixed DEFINES / CALLS). - * Exercises the basic node + edge shape. - * - medium: ~60 nodes + ~100 edges. Exercises a wider NodeKind mix - * (Class / Method / Interface / Route) plus a Process / Section / - * Contributor tier so the polymorphic NODE_COLUMNS coverage is visible. - * - large: 100 Function nodes forming a long CALLS chain with an - * interior branch; graphHash determinism at scale matters for the - * Reindex parity gate. - * - * The 23-edge-kind sweep gets its own test so a schema regression that - * silently drops a rel table shows up as a test failure rather than a - * slow-burn round-trip hash mismatch in prod. - */ - -import assert from "node:assert/strict"; -import { mkdtemp } from "node:fs/promises"; -import { tmpdir } from "node:os"; -import { join } from "node:path"; -import { test } from "node:test"; -import { - type GraphNode, - graphHash, - KnowledgeGraph, - makeNodeId, - type NodeId, - type RelationType, -} from "@opencodehub/core-types"; -import { GraphDbStore } from "./graphdb-adapter.js"; -import { getAllRelationTypes } from "./graphdb-schema.js"; - -async function scratchDbPath(): Promise { - const dir = await mkdtemp(join(tmpdir(), "och-graphdb-rt-")); - return join(dir, "graph.db"); -} - -async function hasNativeBinding(): Promise { - try { - await import("@ladybugdb/core"); - return true; - } catch { - return false; - } -} - -// --------------------------------------------------------------------------- -// Fixture builders -// --------------------------------------------------------------------------- - -function buildSmallGraph(): KnowledgeGraph { - const g = new KnowledgeGraph(); - - const fileA = makeNodeId("File", "src/a.ts", "a.ts"); - const fileB = makeNodeId("File", "src/b.ts", "b.ts"); - g.addNode({ id: fileA, kind: "File", name: "a.ts", filePath: "src/a.ts" }); - g.addNode({ id: fileB, kind: "File", name: "b.ts", filePath: "src/b.ts" }); - - const funcs: NodeId[] = []; - for (let i = 0; i < 8; i += 1) { - const file = i % 2 === 0 ? "src/a.ts" : "src/b.ts"; - const id = makeNodeId("Function", file, `fn_${i}`, { parameterCount: i % 3 }); - funcs.push(id); - g.addNode({ - id, - kind: "Function", - name: `fn_${i}`, - filePath: file, - startLine: 10 + i, - endLine: 20 + i, - signature: `function fn_${i}(${"x,".repeat(i % 3).replace(/,$/, "")})`, - parameterCount: i % 3, - isExported: i % 2 === 0, - }); - } - - // DEFINES from each file to its functions, plus a CALLS chain. - for (let i = 0; i < funcs.length; i += 1) { - const from = i % 2 === 0 ? fileA : fileB; - g.addEdge({ from, to: funcs[i] as NodeId, type: "DEFINES", confidence: 1.0 }); - } - for (let i = 0; i + 1 < funcs.length; i += 1) { - g.addEdge({ - from: funcs[i] as NodeId, - to: funcs[i + 1] as NodeId, - type: "CALLS", - confidence: 0.9, - }); - } - - return g; -} - -function buildMediumGraph(): KnowledgeGraph { - const g = new KnowledgeGraph(); - - // Layer 1: files. - const files: NodeId[] = []; - for (let i = 0; i < 6; i += 1) { - const path = `src/mod${i}/entry.ts`; - const id = makeNodeId("File", path, path); - files.push(id); - g.addNode({ - id, - kind: "File", - name: `entry.ts`, - filePath: path, - contentHash: `hash-${i}`, - }); - } - - // Layer 2: classes + interfaces. - const classes: NodeId[] = []; - for (let i = 0; i < 6; i += 1) { - const file = `src/mod${i}/entry.ts`; - const clsId = makeNodeId("Class", file, `Service${i}`); - classes.push(clsId); - g.addNode({ - id: clsId, - kind: "Class", - name: `Service${i}`, - filePath: file, - startLine: 5, - endLine: 40, - isExported: true, - }); - const ifaceId = makeNodeId("Interface", file, `IService${i}`); - g.addNode({ - id: ifaceId, - kind: "Interface", - name: `IService${i}`, - filePath: file, - isExported: true, - }); - const fileId = files[i]; - if (!fileId) throw new Error("unreachable"); - g.addEdge({ from: fileId, to: clsId, type: "DEFINES", confidence: 1.0 }); - g.addEdge({ from: fileId, to: ifaceId, type: "DEFINES", confidence: 1.0 }); - g.addEdge({ from: clsId, to: ifaceId, type: "IMPLEMENTS", confidence: 1.0 }); - } - - // Layer 3: methods. - const methods: NodeId[] = []; - for (let i = 0; i < 6; i += 1) { - const file = `src/mod${i}/entry.ts`; - for (let j = 0; j < 3; j += 1) { - const mId = makeNodeId("Method", file, `Service${i}.method${j}`); - methods.push(mId); - g.addNode({ - id: mId, - kind: "Method", - name: `method${j}`, - filePath: file, - startLine: 10 + j, - endLine: 15 + j, - parameterCount: j, - signature: `method${j}()`, - }); - const clsId = classes[i]; - if (!clsId) throw new Error("unreachable"); - g.addEdge({ from: clsId, to: mId, type: "HAS_METHOD", confidence: 1.0 }); - } - } - - // Sparse CALL graph — even-indexed methods call the next odd-indexed method - // in the same service; a few cross-service calls keep the graph connected. - for (let i = 0; i + 1 < methods.length; i += 2) { - const from = methods[i]; - const to = methods[i + 1]; - if (!from || !to) throw new Error("unreachable"); - g.addEdge({ from, to, type: "CALLS", confidence: 0.8, reason: "synthetic fixture" }); - } - for (let i = 2; i < methods.length; i += 3) { - const from = methods[i]; - const to = methods[(i + 5) % methods.length]; - if (!from || !to) throw new Error("unreachable"); - g.addEdge({ from, to, type: "CALLS", confidence: 0.6, step: 1 }); - } - - // A contributor + ownership edges. - const contributor = makeNodeId("Contributor", "", "alice@example.com"); - g.addNode({ - id: contributor, - kind: "Contributor", - name: "alice", - filePath: "", - emailHash: "hashed", - emailPlain: "alice@example.com", - }); - for (const file of files) { - g.addEdge({ from: file, to: contributor, type: "OWNED_BY", confidence: 1.0 }); - } - - return g; -} - -function buildLargeGraph(): KnowledgeGraph { - const g = new KnowledgeGraph(); - const N = 100; - const file = makeNodeId("File", "src/chain.ts", "chain.ts"); - g.addNode({ id: file, kind: "File", name: "chain.ts", filePath: "src/chain.ts" }); - - const funcs: NodeId[] = []; - for (let i = 0; i < N; i += 1) { - const id = makeNodeId("Function", "src/chain.ts", `step_${i}`); - funcs.push(id); - g.addNode({ - id, - kind: "Function", - name: `step_${i}`, - filePath: "src/chain.ts", - startLine: 10 + i, - endLine: 12 + i, - signature: `function step_${i}()`, - parameterCount: i % 4, - isExported: i === 0 || i === N - 1, - }); - g.addEdge({ from: file, to: id, type: "DEFINES", confidence: 1.0 }); - } - // Linear CALLS chain. - for (let i = 0; i + 1 < N; i += 1) { - g.addEdge({ - from: funcs[i] as NodeId, - to: funcs[i + 1] as NodeId, - type: "CALLS", - confidence: 0.95, - }); - } - // Every 10th function also calls the function 10 steps downstream — a - // bounded shortcut that makes the graph non-tree. - for (let i = 0; i + 10 < N; i += 10) { - g.addEdge({ - from: funcs[i] as NodeId, - to: funcs[i + 10] as NodeId, - type: "CALLS", - confidence: 0.5, - step: 1, - }); - } - return g; -} - -// --------------------------------------------------------------------------- -// Read-back helpers -// --------------------------------------------------------------------------- -// -// Each node column → GraphNode field mapping. Flat (no kind-specific logic) -// because the fixture graphs only use fields that every kind can hold — the -// additive surface (Contributor.email*, File.contentHash) is covered by the -// medium fixture but still fits this list. - -const NODE_COLUMN_MAP: readonly (readonly [string, string, "number" | "string" | "boolean"])[] = [ - ["start_line", "startLine", "number"], - ["end_line", "endLine", "number"], - ["is_exported", "isExported", "boolean"], - ["signature", "signature", "string"], - ["parameter_count", "parameterCount", "number"], - ["return_type", "returnType", "string"], - ["declared_type", "declaredType", "string"], - ["owner", "owner", "string"], - ["content_hash", "contentHash", "string"], - ["email_hash", "emailHash", "string"], - ["email_plain", "emailPlain", "string"], - // Repo. See graph-hash-parity.test.ts for the parallel mapping. - ["origin_url", "originUrl", "string"], - ["repo_uri", "repoUri", "string"], - ["default_branch", "defaultBranch", "string"], - ["commit_sha", "commitSha", "string"], - ["index_time", "indexTime", "string"], - ["repo_group", "group", "string"], - ["visibility", "visibility", "string"], - ["indexer", "indexer", "string"], -]; - -/** Repo-specific nullable-field / languageStats reconstruction. */ -function applyRepoNullables(rec: Record, base: Record): void { - if (base["kind"] !== "Repo") return; - for (const [col, key] of [ - ["origin_url", "originUrl"], - ["default_branch", "defaultBranch"], - ["repo_group", "group"], - ] as const) { - const v = rec[col]; - if (v === null || v === undefined) base[key] = null; - } - const statsRaw = rec["language_stats_json"]; - if (typeof statsRaw === "string" && statsRaw.length > 0) { - base["languageStats"] = JSON.parse(statsRaw); - } else { - base["languageStats"] = {}; - } -} - -async function rebuildGraphFromStore(store: GraphDbStore): Promise { - // One MATCH per CodeNode column set we care about. Ordering by id - // matches DuckDbStore so KnowledgeGraph.addNode lands them in the same - // sequence — not strictly required because orderedNodes sorts again, - // but helpful when debugging. - const nodeRows = await store.query( - `MATCH (n:CodeNode) RETURN n.id AS id, n.kind AS kind, n.name AS name, ` + - `n.file_path AS file_path, n.start_line AS start_line, n.end_line AS end_line, ` + - `n.is_exported AS is_exported, n.signature AS signature, ` + - `n.parameter_count AS parameter_count, n.return_type AS return_type, ` + - `n.declared_type AS declared_type, n.owner AS owner, ` + - `n.content_hash AS content_hash, n.email_hash AS email_hash, ` + - `n.email_plain AS email_plain, ` + - `n.origin_url AS origin_url, n.repo_uri AS repo_uri, ` + - `n.default_branch AS default_branch, n.commit_sha AS commit_sha, ` + - `n.index_time AS index_time, n.repo_group AS repo_group, ` + - `n.visibility AS visibility, n.indexer AS indexer, ` + - `n.language_stats_json AS language_stats_json ` + - `ORDER BY n.id`, - ); - - const g = new KnowledgeGraph(); - for (const row of nodeRows) { - const rec = row as Record; - const base: Record = { - id: String(rec["id"]), - kind: String(rec["kind"]), - name: String(rec["name"] ?? ""), - filePath: String(rec["file_path"] ?? ""), - }; - for (const [col, key, ty] of NODE_COLUMN_MAP) { - const v = rec[col]; - if (v === null || v === undefined) continue; - if (ty === "number") base[key] = Number(v); - else if (ty === "boolean") base[key] = Boolean(v); - else base[key] = String(v); - } - applyRepoNullables(rec, base); - g.addNode(base as unknown as GraphNode); - } - - // Each edge kind lives in its own rel table — ask the schema for the - // active list rather than importing RELATION_TYPES directly so the two - // modules stay source-of-truth aligned. - for (const kind of getAllRelationTypes()) { - const edgeRows = await store.query( - `MATCH (a:CodeNode)-[r:${kind}]->(b:CodeNode) ` + - `RETURN a.id AS from_id, b.id AS to_id, ` + - `r.id AS edge_id, r.confidence AS confidence, ` + - `r.reason AS reason, r.step AS step ORDER BY r.id`, - ); - for (const row of edgeRows) { - const rec = row as Record; - const reason = rec["reason"]; - const stepRaw = rec["step"]; - // Two encoding quirks that matter for graphHash parity: - // 1. `step` survives even when the stored value is 0 — the original - // edge set it explicitly, so the canonical-JSON serialiser emits - // it; we must re-attach it rather than falling back to undefined. - // 2. `reason` is dropped when empty/null so the original fixture - // (which only sets `reason` on some edges) hashes the same. - g.addEdge({ - from: String(rec["from_id"]) as NodeId, - to: String(rec["to_id"]) as NodeId, - type: kind as RelationType, - confidence: Number(rec["confidence"] ?? 0), - ...(reason !== null && reason !== undefined && reason !== "" - ? { reason: String(reason) } - : {}), - ...(stepRaw !== null && stepRaw !== undefined ? { step: Number(stepRaw) } : {}), - }); - } - } - - return g; -} - -async function runRoundTrip( - fixture: KnowledgeGraph, -): Promise<{ original: string; rebuilt: string }> { - const store = new GraphDbStore(await scratchDbPath()); - await store.open(); - try { - await store.createSchema(); - await store.bulkLoad(fixture); - const rebuilt = await rebuildGraphFromStore(store); - return { - original: graphHash(fixture), - rebuilt: graphHash(rebuilt), - }; - } finally { - await store.close(); - } -} - -// --------------------------------------------------------------------------- -// Tests -// --------------------------------------------------------------------------- - -test("round-trip parity: small fixture", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const fixture = buildSmallGraph(); - const { original, rebuilt } = await runRoundTrip(fixture); - assert.equal( - rebuilt, - original, - `graphHash parity broken for small fixture:\n original: ${original}\n rebuilt: ${rebuilt}`, - ); -}); - -test("round-trip parity: medium fixture (mixed node kinds + OWNED_BY edges)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const fixture = buildMediumGraph(); - const { original, rebuilt } = await runRoundTrip(fixture); - assert.equal(rebuilt, original, "graphHash parity broken for medium fixture"); -}); - -test("round-trip parity: large fixture (100 nodes, linear chain + shortcuts)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const fixture = buildLargeGraph(); - const { original, rebuilt } = await runRoundTrip(fixture); - assert.equal(rebuilt, original, "graphHash parity broken for large fixture"); -}); - -test("every declared edge kind round-trips at least one row", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const relationTypes = getAllRelationTypes(); - const g = new KnowledgeGraph(); - const nodes: NodeId[] = []; - for (let i = 0; i < relationTypes.length + 1; i += 1) { - const id = makeNodeId("Function", `src/f${i}.ts`, `fn${i}`); - nodes.push(id); - g.addNode({ id, kind: "Function", name: `fn${i}`, filePath: `src/f${i}.ts` }); - } - for (let i = 0; i < relationTypes.length; i += 1) { - const fromId = nodes[i]; - const toId = nodes[i + 1]; - if (!fromId || !toId) throw new Error("unreachable"); - const kind = relationTypes[i]; - if (!kind) throw new Error("unreachable"); - g.addEdge({ - from: fromId, - to: toId, - type: kind as RelationType, - confidence: 0.5 + i * 0.01, - reason: `fixture-${i}`, - step: i, - }); - } - const { original, rebuilt } = await runRoundTrip(g); - assert.equal(rebuilt, original, "graphHash parity broken for all-kinds fixture"); -}); - -test("round-trip parity: RepoNode fixture (first-class repo entity)", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const g = new KnowledgeGraph(); - const repoId = makeNodeId("Repo", "", "repo"); - g.addNode({ - id: repoId, - kind: "Repo", - name: "github.com/acme/example", - filePath: "", - originUrl: "https://github.com/acme/example.git", - repoUri: "github.com/acme/example", - defaultBranch: "main", - commitSha: "0123456789abcdef0123456789abcdef01234567", - indexTime: "2026-05-06T12:34:56Z", - group: "acme", - visibility: "internal", - indexer: "opencodehub@0.1.0", - languageStats: { go: 0.5, ts: 0.3, rs: 0.2 }, - } as unknown as GraphNode); - // Include a File so the existing columns coexist with the new ones. - const fileA = makeNodeId("File", "src/a.ts", "a.ts"); - g.addNode({ id: fileA, kind: "File", name: "a.ts", filePath: "src/a.ts" }); - const { original, rebuilt } = await runRoundTrip(g); - assert.equal(rebuilt, original, "graphHash parity broken for RepoNode fixture"); -}); - -test("round-trip parity: RepoNode with explicit-null origin / branch / group", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const g = new KnowledgeGraph(); - const repoId = makeNodeId("Repo", "", "repo"); - g.addNode({ - id: repoId, - kind: "Repo", - name: "local:abcdef012345", - filePath: "", - originUrl: null, - repoUri: "local:abcdef012345", - defaultBranch: null, - commitSha: "0123456789abcdef0123456789abcdef01234567", - indexTime: "2026-05-06T12:34:56Z", - group: null, - visibility: "private", - indexer: "opencodehub@0.1.0", - languageStats: {}, - } as unknown as GraphNode); - const { original, rebuilt } = await runRoundTrip(g); - assert.equal(rebuilt, original, "graphHash parity broken for RepoNode no-remote fixture"); -}); - -test("round-trip is deterministic across independent writes of the same graph", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping round-trip"); - return; - } - const fixture = buildMediumGraph(); - const originalHash = graphHash(fixture); - - const { rebuilt: hashA } = await runRoundTrip(fixture); - const { rebuilt: hashB } = await runRoundTrip(fixture); - assert.equal(hashA, hashB, "hashes across two stores must match"); - assert.equal(hashA, originalHash, "hash after round-trip must match the original graph hash"); -}); - -// Regression: upsert-mode bulkLoad must NOT clobber an existing real node when -// the upsert batch carries an edge that references it but not the node itself. -// -// ingest-sarif upserts Finding nodes + FOUND_IN edges into the already-loaded -// graph. A FOUND_IN edge targets the Function the finding sits inside. That -// Function id is absent from the upsert batch (which holds only Findings), so -// synthesizePlaceholderNodes used to mint a `kind:Route` placeholder for it; -// mergeNodes (DETACH DELETE + re-insert) then DESTROYED the real Function, -// turning it into a `` Route. Net effect: every function/method -// containing a scanner finding silently vanished from the graph, breaking -// cross-module context/impact. The fix excludes already-persisted ids from -// synthesis. See field-report Issue 1. -test("upsert with an edge to an existing node does not clobber it into a placeholder", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const dbPath = await scratchDbPath(); - const store = new GraphDbStore(dbPath); - try { - await store.open(); - await store.createSchema(); - - // Phase 1 — replace-load a real Function node. - const fnId = makeNodeId("Function", "src/client.py", "get_bedrock_client"); - const base = new KnowledgeGraph(); - base.addNode({ - id: fnId, - kind: "Function", - name: "get_bedrock_client", - filePath: "src/client.py", - startLine: 146, - endLine: 171, - isExported: true, - } as GraphNode); - await store.bulkLoad(base, { mode: "replace" }); - - // Phase 2 — upsert a Finding + a FOUND_IN edge targeting the Function. - // The Function node is intentionally NOT in this batch. - const findingId = makeNodeId("Finding", "src/client.py", "semgrep:logger-leak:165"); - const upsert = new KnowledgeGraph(); - upsert.addNode({ - id: findingId, - kind: "Finding", - name: "logger-credential-leak", - filePath: "src/client.py", - startLine: 165, - endLine: 170, - } as GraphNode); - upsert.addEdge({ - from: findingId, - to: fnId, - type: "FOUND_IN" as RelationType, - confidence: 1, - reason: "startLine=165;endLine=170", - }); - await store.bulkLoad(upsert, { mode: "upsert" }); - - // The Function must still be a Function — not a synthesized Route - // placeholder, and not deleted. - const [node] = await store.listNodes({ ids: [fnId] }); - assert.ok(node, "the Function node must survive the finding upsert"); - assert.equal( - node.kind, - "Function", - `expected Function, got ${node.kind} (clobbered by placeholder)`, - ); - assert.equal(node.name, "get_bedrock_client"); - assert.equal(node.filePath, "src/client.py"); - assert.notEqual(node.filePath, ""); - - // The Finding and its edge must also have landed. - const findings = await store.listNodesByKind("Finding", { limit: 10 }); - assert.equal(findings.length, 1, "the upserted Finding must persist"); - } finally { - await store.close(); - } -}); - -// Companion: a genuinely-orphan edge endpoint (no node in batch, none in store) -// still gets a placeholder so the COPY's PK constraint holds — the upsert fix -// must not break the original unresolved-FETCHES use case. -test("upsert still synthesizes a placeholder for a genuinely-orphan edge endpoint", async () => { - if (!(await hasNativeBinding())) { - assert.ok(true, "native binding unavailable — skipping"); - return; - } - const dbPath = await scratchDbPath(); - const store = new GraphDbStore(dbPath); - try { - await store.open(); - await store.createSchema(); - await store.bulkLoad(new KnowledgeGraph(), { mode: "replace" }); - - const srcId = makeNodeId("Function", "src/api.py", "call_remote"); - const orphanId = "Route::https://example.com/{id}" as NodeId; - const g = new KnowledgeGraph(); - g.addNode({ - id: srcId, - kind: "Function", - name: "call_remote", - filePath: "src/api.py", - startLine: 1, - endLine: 5, - isExported: true, - } as GraphNode); - g.addEdge({ - from: srcId, - to: orphanId, - type: "FETCHES" as RelationType, - confidence: 0.5, - reason: "https://example.com/{id}", - }); - await store.bulkLoad(g, { mode: "upsert" }); - - const [placeholder] = await store.listNodes({ ids: [orphanId] }); - assert.ok(placeholder, "orphan FETCHES target must be synthesized so the edge COPY succeeds"); - } finally { - await store.close(); - } -}); diff --git a/packages/storage/src/graphdb-schema.test.ts b/packages/storage/src/graphdb-schema.test.ts deleted file mode 100644 index 43192e1b..00000000 --- a/packages/storage/src/graphdb-schema.test.ts +++ /dev/null @@ -1,124 +0,0 @@ -import assert from "node:assert/strict"; -import { test } from "node:test"; -import { generateSchemaDdl, getAllRelationTypes } from "./graphdb-schema.js"; - -// NOTE: the spec quoted "23 edge kinds" (spec 004 L11) but the live source -// of truth `duckdb-adapter.ts:ALL_RELATION_TYPES` carries 24. We trust the -// code over the spec text — the DDL must cover every kind the v1.1 DuckDB -// schema knows. If a kind is added to `ALL_RELATION_TYPES` upstream, bump -// this constant alongside the new entry in `graphdb-schema.ts`. -const EXPECTED_RELATION_COUNT = 25; - -// Banned-literal probes are built at runtime so this test file does not -// itself trip `scripts/check-banned-strings.sh`. Each entry is a list of -// character-code points that encode the banned token; the test reconstructs -// the string before asserting it is NOT present in the generated DDL. -const BANNED_LITERAL_CODES: ReadonlyArray = [ - [0x53, 0x54, 0x45, 0x50, 0x5f, 0x49, 0x4e, 0x5f, 0x50, 0x52, 0x4f, 0x43, 0x45, 0x53, 0x53], - [0x6b, 0x75, 0x7a, 0x75], - [0x68, 0x65, 0x75, 0x72, 0x69, 0x73, 0x74, 0x69, 0x63, 0x4c, 0x61, 0x62, 0x65, 0x6c], - [0x63, 0x6f, 0x64, 0x65, 0x70, 0x72, 0x6f, 0x62, 0x65], - [0x64, 0x75, 0x63, 0x6b, 0x70, 0x67, 0x71], - [0x53, 0x54, 0x45, 0x50, 0x5f, 0x49, 0x4e, 0x5f, 0x46, 0x4c, 0x4f, 0x57], - [0x6c, 0x61, 0x64, 0x79, 0x62, 0x75, 0x67], -]; - -function decode(codes: readonly number[]): string { - return codes.map((c) => String.fromCharCode(c)).join(""); -} - -test("generateSchemaDdl emits the expected number of node tables", () => { - const ddl = generateSchemaDdl(); - const nodeMatches = ddl.match(/CREATE NODE TABLE IF NOT EXISTS \w+/g) ?? []; - // Cochange + SymbolSummary live exclusively on a paired ITemporalStore; - // the graph-side schema is CodeNode + Embedding + StoreMeta = 3. - assert.equal(nodeMatches.length, 3, nodeMatches.join("\n")); -}); - -test("generateSchemaDdl emits one rel table per OCH edge kind + EMBEDS", () => { - const ddl = generateSchemaDdl(); - const relMatches = ddl.match(/CREATE REL TABLE IF NOT EXISTS \w+/g) ?? []; - assert.equal(relMatches.length, EXPECTED_RELATION_COUNT + 1, relMatches.join("\n")); -}); - -test("every edge kind from getAllRelationTypes has a dedicated rel table", () => { - const ddl = generateSchemaDdl(); - for (const kind of getAllRelationTypes()) { - const needle = `CREATE REL TABLE IF NOT EXISTS ${kind}`; - assert.ok(ddl.includes(needle), `missing rel table for ${kind}`); - } -}); - -test("PROCESS_STEP rel table is present and the banned prior-art kind is not", () => { - const ddl = generateSchemaDdl(); - assert.ok(ddl.includes("CREATE REL TABLE IF NOT EXISTS PROCESS_STEP")); - // Reconstruct the banned token at runtime so this source file itself - // stays compliant with the banned-strings guardrail. - const forbiddenProcessToken = decode(BANNED_LITERAL_CODES[0] ?? []); - assert.ok( - !new RegExp(forbiddenProcessToken, "i").test(ddl), - "graphdb-schema DDL must not mention the banned prior-art process token", - ); -}); - -test("DDL does not leak any known banned clean-room literal", () => { - const ddl = generateSchemaDdl(); - for (const codes of BANNED_LITERAL_CODES) { - const literal = decode(codes); - assert.ok( - !new RegExp(literal, "i").test(ddl), - `DDL leaked banned literal of length ${literal.length}`, - ); - } -}); - -test("DDL does not emit a polymorphic single-table CodeRelation", () => { - // Spec 004 §Architectural decisions #1: one rel table per edge kind, NOT - // one `CodeRelation` rel table with a `type` discriminator. - const ddl = generateSchemaDdl(); - assert.ok(!/CREATE REL TABLE[^(]*CodeRelation/i.test(ddl)); -}); - -test("CodeNode primary key is id", () => { - const ddl = generateSchemaDdl(); - const match = ddl.match( - /CREATE NODE TABLE IF NOT EXISTS CodeNode[\s\S]*?PRIMARY KEY \(([^)]+)\)/, - ); - assert.ok(match, "CodeNode table not found"); - assert.equal((match[1] ?? "").trim(), "id"); -}); - -test("Embedding vector has the configured fixed dimension", () => { - const ddl = generateSchemaDdl({ embeddingDim: 1024 }); - assert.ok(ddl.includes("vector FLOAT[1024]")); -}); - -test("default embedding dim is 768 to match DuckDbStore default", () => { - const ddl = generateSchemaDdl(); - assert.ok(ddl.includes("vector FLOAT[768]")); -}); - -test("generateSchemaDdl rejects invalid embedding dimensions", () => { - assert.throws(() => generateSchemaDdl({ embeddingDim: 0 }), /Invalid embeddingDim/); - assert.throws(() => generateSchemaDdl({ embeddingDim: -1 }), /Invalid embeddingDim/); - assert.throws( - () => generateSchemaDdl({ embeddingDim: 1.5 as unknown as number }), - /Invalid embeddingDim/, - ); -}); - -test("getAllRelationTypes returns every OCH edge kind in canonical order", () => { - const kinds = getAllRelationTypes(); - assert.equal(kinds.length, EXPECTED_RELATION_COUNT); - // Spot-check ordering invariants: first kind is CONTAINS, last is TYPE_OF. - assert.equal(kinds[0], "CONTAINS"); - assert.equal(kinds[kinds.length - 1], "TYPE_OF"); -}); - -test("statements are semicolon-terminated", () => { - const ddl = generateSchemaDdl(); - // 3 node tables (CodeNode + Embedding + StoreMeta) + 24 rel tables + - // 1 EMBEDS rel = 28 statements → 28 semicolons. - const count = (ddl.match(/;\n/g) ?? []).length; - assert.equal(count, 3 + EXPECTED_RELATION_COUNT + 1); -}); diff --git a/packages/storage/src/graphdb-schema.ts b/packages/storage/src/graphdb-schema.ts deleted file mode 100644 index 2b3d3fa0..00000000 --- a/packages/storage/src/graphdb-schema.ts +++ /dev/null @@ -1,233 +0,0 @@ -/** - * DDL translator for the graph-database backend. - * - * Emits Cypher `CREATE NODE TABLE` + `CREATE REL TABLE` statements that - * mirror the semantic shape of the DuckDB schema ({@link generateSchemaDDL}) - * while honouring two architectural decisions from spec 004: - * - * 1. **Polymorphic rel tables, one per edge kind.** Each OCH relation - * kind (24 live in `duckdb-adapter.ts:ALL_RELATION_TYPES` at the time - * of writing — the v1.1 schema added `OWNED_BY` / `DEPENDS_ON` / - * `FOUND_IN` past the spec 004 draft's "23 kinds" count) gets its own - * named REL TABLE with multiple `FROM/TO` pairs. A single - * `CodeRelation` table with a `type` discriminator column would - * defeat columnar predicate push-down, so we fan out to keep the - * planner honest. See the graph-db backend's - * `cypher/data-definition/create-table` doc page. - * - * 2. **Source-level naming avoids banned clean-room literals.** OCH - * uses `PROCESS_STEP` where a prior-art project used a different - * identifier; this translator only ever emits `PROCESS_STEP` so - * Cypher queries match the graph's own relation-type enum. - * - * The DuckDB schema collapses every node kind into a polymorphic `nodes` - * table (`schema-ddl.ts`). For the graph-db backend we keep the same - * collapse — a single `CodeNode` NODE TABLE — so graphHash parity (U1) is - * straightforward: round-trips read the same column set from both stores. - * Later ACs may split the table per kind once profile data justifies the - * extra surface area. - */ - -export interface GraphDbSchemaOptions { - /** Dimension for the fixed-size FLOAT array used by the embedding rel. */ - readonly embeddingDim?: number; -} - -const DEFAULT_EMBEDDING_DIM = 768; - -/** - * 23 edge kinds taken verbatim from `duckdb-adapter.ts` `ALL_RELATION_TYPES` - * (re-exported via `getAllRelationTypes()` below so this file stays - * self-contained without a circular-import risk on the adapter module). The - * ordering is load-bearing for commit diffs — append new kinds, never - * reorder. - */ -const RELATION_KINDS: readonly string[] = [ - "CONTAINS", - "DEFINES", - "IMPORTS", - "CALLS", - "EXTENDS", - "IMPLEMENTS", - "HAS_METHOD", - "HAS_PROPERTY", - "ACCESSES", - "METHOD_OVERRIDES", - "OVERRIDES", - "METHOD_IMPLEMENTS", - "MEMBER_OF", - "PROCESS_STEP", - "HANDLES_ROUTE", - "FETCHES", - "HANDLES_TOOL", - "ENTRY_POINT_OF", - "WRAPS", - "QUERIES", - "REFERENCES", - "FOUND_IN", - "DEPENDS_ON", - "OWNED_BY", - "TYPE_OF", -]; - -/** - * Exported for the round-trip parity tests so they can compare against - * the same source of truth as the DDL emitter. - */ -export function getAllRelationTypes(): readonly string[] { - return RELATION_KINDS; -} - -/** - * Returns the complete Cypher DDL as a single string — statements separated - * by `;` so callers can split on that boundary if they need per-statement - * execution. The last statement carries a trailing `;` for symmetry. - */ -export function generateSchemaDdl(opts: GraphDbSchemaOptions = {}): string { - const embeddingDim = opts.embeddingDim ?? DEFAULT_EMBEDDING_DIM; - if (!Number.isInteger(embeddingDim) || embeddingDim <= 0) { - throw new Error(`Invalid embeddingDim: ${String(embeddingDim)}`); - } - - const statements: string[] = []; - - // ------------------------------------------------------------------------- - // Node tables. CodeNode collapses every kind (File / Folder / Function / - // Class / Interface / Method / CodeElement / Community / Process / Route / - // Tool / Section / Finding / Dependency / Operation / Contributor / - // ProjectProfile / Repo) behind a `kind` discriminator, mirroring the - // DuckDB `nodes` table. Embeddings live in their own NODE TABLE so the - // vector column stays homogeneous and an HNSW index can attach. - // ------------------------------------------------------------------------- - statements.push(`CREATE NODE TABLE IF NOT EXISTS CodeNode ( - id STRING, - kind STRING, - name STRING, - file_path STRING, - start_line INT32, - end_line INT32, - is_exported BOOL, - signature STRING, - parameter_count INT32, - return_type STRING, - declared_type STRING, - owner STRING, - url STRING, - method STRING, - tool_name STRING, - content STRING, - content_hash STRING, - inferred_label STRING, - symbol_count INT32, - cohesion DOUBLE, - keywords STRING[], - entry_point_id STRING, - step_count INT32, - level INT32, - response_keys STRING[], - description STRING, - severity STRING, - rule_id STRING, - scanner_id STRING, - message STRING, - properties_bag STRING, - version STRING, - license STRING, - lockfile_source STRING, - ecosystem STRING, - http_method STRING, - http_path STRING, - summary STRING, - operation_id STRING, - email_hash STRING, - email_plain STRING, - languages_json STRING, - frameworks_json STRING, - iac_types_json STRING, - api_contracts_json STRING, - manifests_json STRING, - src_dirs_json STRING, - orphan_grade STRING, - is_orphan BOOL, - truck_factor INT32, - ownership_drift_30d DOUBLE, - ownership_drift_90d DOUBLE, - ownership_drift_365d DOUBLE, - deadness STRING, - coverage_percent DOUBLE, - covered_lines_json STRING, - cyclomatic_complexity INT32, - nesting_depth INT32, - nloc INT32, - halstead_volume DOUBLE, - input_schema_json STRING, - partial_fingerprint STRING, - baseline_state STRING, - suppressed_json STRING, - origin_url STRING, - repo_uri STRING, - default_branch STRING, - commit_sha STRING, - index_time STRING, - repo_group STRING, - visibility STRING, - indexer STRING, - language_stats_json STRING, - PRIMARY KEY (id) -)`); - - statements.push(`CREATE NODE TABLE IF NOT EXISTS Embedding ( - id STRING, - node_id STRING, - granularity STRING, - chunk_index INT32, - start_line INT32, - end_line INT32, - vector FLOAT[${embeddingDim}], - content_hash STRING, - PRIMARY KEY (id) -)`); - - statements.push(`CREATE NODE TABLE IF NOT EXISTS StoreMeta ( - id INT32, - schema_version STRING, - last_commit STRING, - indexed_at STRING, - node_count INT64, - edge_count INT64, - stats_json STRING, - cache_hit_ratio DOUBLE, - cache_size_bytes INT64, - last_compaction STRING, - embedder_model_id STRING, - PRIMARY KEY (id) -)`); - - // Cochange + SymbolSummary live exclusively on the paired DuckDB-backed - // ITemporalStore — the graph adapter never stores those rows, so the - // Cypher schema does not declare them. - // ------------------------------------------------------------------------- - // Rel tables — one per edge kind. FROM/TO is CodeNode on both sides; - // a future schema revision may narrow the endpoints per kind once the - // node-kind split lands. We DO NOT emit a single CodeRelation rel - // table with a type column — that defeats the predicate push-down the - // graph-db gives us. - // ------------------------------------------------------------------------- - for (const kind of RELATION_KINDS) { - statements.push(`CREATE REL TABLE IF NOT EXISTS ${kind} ( - FROM CodeNode TO CodeNode, - id STRING, - confidence DOUBLE, - reason STRING, - step INT32 -)`); - } - - // Dedicated rel linking Embedding rows to their CodeNode source, so HNSW - // traversals can join back through the graph without a property lookup. - statements.push(`CREATE REL TABLE IF NOT EXISTS EMBEDS ( - FROM Embedding TO CodeNode -)`); - - return `${statements.join(";\n\n")};\n`; -} diff --git a/packages/storage/src/index.ts b/packages/storage/src/index.ts index 899c8f25..00caab5d 100644 --- a/packages/storage/src/index.ts +++ b/packages/storage/src/index.ts @@ -1,18 +1,4 @@ export { assertReadOnlyCypher, CypherGuardError } from "./cypher-guard.js"; -export { classifyLicenseTier, DuckDbStore, type DuckDbStoreOptions } from "./duckdb-adapter.js"; -export { - GRAPH_BINDING_SUPPORTED_PLATFORMS, - GraphDbBindingError, - GraphDbStore, - type GraphDbStoreOptions, - graphBindingPlatformNote, - NotImplementedError, -} from "./graphdb-adapter.js"; -export { - type GraphDbSchemaOptions, - generateSchemaDdl, - getAllRelationTypes, -} from "./graphdb-schema.js"; export type { AncestorTraversalOptions, BulkLoadOptions, @@ -48,6 +34,7 @@ export type { VectorQuery, VectorResult, } from "./interface.js"; +export { classifyLicenseTier } from "./license.js"; export { readStoreMeta, writeStoreMeta } from "./meta.js"; export { describeArtifacts, @@ -59,86 +46,74 @@ export { resolveRegistryPath, resolveRepoMetaDir, } from "./paths.js"; +export { getAllRelationTypes } from "./relations.js"; export { generateSchemaDDL, type SchemaOptions } from "./schema-ddl.js"; export { assertReadOnlySql, SqlGuardError } from "./sql-guard.js"; +export { SqliteStore, type SqliteStoreOptions } from "./sqlite-adapter.js"; +export { installSqliteRuntimeGuard } from "./sqlite-runtime.js"; import { dirname, join } from "node:path"; -import { DuckDbStore, type DuckDbStoreOptions } from "./duckdb-adapter.js"; -import { GraphDbStore, type GraphDbStoreOptions } from "./graphdb-adapter.js"; import type { OpenStoreOptions as ApiOpenStoreOptions, OpenStoreResult } from "./interface.js"; -import { describeArtifacts } from "./paths.js"; +import { SqliteStore, type SqliteStoreOptions } from "./sqlite-adapter.js"; /** - * Combined options accepted by {@link openStore}. Backwards-compatible - * superset of the spec-level {@link ApiOpenStoreOptions} that adds the - * `duckOptions` / `graphDbOptions` adapter-specific bag so existing - * callers (analyze CLI, ingestion harness) can pass through precise - * per-backend tuning. + * Combined options accepted by {@link openStore}. Superset of the spec-level + * {@link ApiOpenStoreOptions} that adds the SQLite-adapter tuning bag. The + * single-file store replaced the lbug + DuckDB pair (ADR 0017), so the former + * `duckOptions` / `graphDbOptions` per-backend bags are gone. */ export interface OpenStoreOptions extends ApiOpenStoreOptions { - readonly duckOptions?: DuckDbStoreOptions; - readonly graphDbOptions?: GraphDbStoreOptions; + /** SQLite-adapter tuning (journal mode, busy timeout). */ + readonly sqliteOptions?: SqliteStoreOptions; } /** - * Compose paired graph + temporal artifact paths. The graph artifact is - * `/graph.lbug` (lbug owns this file); the temporal sidecar is - * `/temporal.duckdb`. - * - * The input `path` is treated as the directory anchor — its dirname is - * the `/.codehub/` parent, and the canonical filenames are - * appended. `:memory:` is a special case for tests: both views resolve - * to `:memory:` and no filesystem layout applies. + * Resolve the single store file. The whole index now lives in ONE + * `/store.sqlite` (WAL) — there is no graph.lbug / temporal.duckdb + * split. The input `path` is the directory anchor (its dirname is the + * `/.codehub/` parent); `:memory:` short-circuits for tests. */ -function composeArtifactPaths(path: string): { graphFile: string; temporalFile: string } { - if (path === ":memory:") { - return { graphFile: ":memory:", temporalFile: ":memory:" }; - } - const dir = dirname(path); - const { graphFile, temporalFile } = describeArtifacts(); - return { - graphFile: join(dir, graphFile), - temporalFile: join(dir, temporalFile), - }; +function resolveStoreFile(path: string): string { + if (path === ":memory:") return ":memory:"; + return join(dirname(path), "store.sqlite"); } /** - * Factory that returns a composed graph + temporal {@link OpenStoreResult}. + * Factory returning an {@link OpenStoreResult} whose `graph` and `temporal` + * views are the SAME {@link SqliteStore} instance over one + * `/store.sqlite` file. Because one object satisfies both + * {@link IGraphStore} and {@link ITemporalStore}, every existing call site + * (`store.graph.X()` / `store.temporal.Y()`) keeps working unchanged — both + * now hit the same connection and file. * - * A `GraphDbStore` instance backs the `graph` view at `/graph.lbug`; - * a separate `DuckDbStore` over the sibling `/temporal.duckdb` - * backs the `temporal` view. `OpenStoreResult.close()` closes both in - * deterministic order — graph first, temporal second. - * - * The factory only constructs — callers still own the `open()` lifecycle - * call so failures are attributable to the lifecycle boundary rather - * than the factory. + * The factory only constructs; callers own the `open()` lifecycle. Opening + * the shared instance twice (once via `store.graph`, once via + * `store.temporal`, as the CLI's open-store helper does) is safe — `open()` + * is idempotent. `close()` closes the single handle once. */ export async function openStore(opts: OpenStoreOptions): Promise { - const { graphFile, temporalFile } = composeArtifactPaths(opts.path); + const storeFile = resolveStoreFile(opts.path); - const graphDbOptions: GraphDbStoreOptions = { - ...(opts.graphDbOptions ?? {}), + const sqliteOptions: SqliteStoreOptions = { + ...(opts.sqliteOptions ?? {}), ...(opts.readOnly !== undefined ? { readOnly: opts.readOnly } : {}), ...(opts.embeddingDim !== undefined ? { embeddingDim: opts.embeddingDim } : {}), ...(opts.timeoutMs !== undefined ? { timeoutMs: opts.timeoutMs } : {}), }; - const duckOptions: DuckDbStoreOptions = { - ...(opts.duckOptions ?? {}), - ...(opts.readOnly !== undefined ? { readOnly: opts.readOnly } : {}), - ...(opts.timeoutMs !== undefined ? { timeoutMs: opts.timeoutMs } : {}), - }; - const graph = new GraphDbStore(graphFile, graphDbOptions); - const temporal = new DuckDbStore(temporalFile, duckOptions); + const store = new SqliteStore(storeFile, sqliteOptions); + let closed = false; return { - graph, - temporal, - graphFile, - temporalFile, + graph: store, + temporal: store, + graphFile: storeFile, + temporalFile: storeFile, close: async () => { - await graph.close(); - await temporal.close(); + // Both views are one instance; close once even though callers may + // invoke close() through the single envelope. + if (closed) return; + closed = true; + await store.close(); }, }; } diff --git a/packages/storage/src/interface.test.ts b/packages/storage/src/interface.test.ts index e3e42bda..96d3698e 100644 --- a/packages/storage/src/interface.test.ts +++ b/packages/storage/src/interface.test.ts @@ -115,7 +115,6 @@ test("ITemporalStore-shaped value lacks graph methods at runtime", () => { createSchema: async () => {}, healthCheck: async () => ({ ok: true }), exec: async () => [], - exportEmbeddingsToParquet: async () => ({ rowCount: 0, duckdbVersion: "test" }), bulkLoadCochanges: async () => {}, lookupCochangesForFile: async (): Promise => [], lookupCochangesBetween: async () => undefined, diff --git a/packages/storage/src/interface.ts b/packages/storage/src/interface.ts index a17971f0..bd327579 100644 --- a/packages/storage/src/interface.ts +++ b/packages/storage/src/interface.ts @@ -128,9 +128,9 @@ export type GraphDialect = "cypher"; * capability and skipped cleanly when the adapter throws "not implemented" * or returns `[]` for a known-non-empty query. * - * Both in-tree adapters (`DuckDbStore`, `GraphDbStore`) opt into this - * suite from their own test files — any future signature change here - * MUST keep the conformance suite green on both before landing. + * The in-tree adapter (`SqliteStore`) opts into this suite from its own + * test file — any future signature change here MUST keep the conformance + * suite green before landing. */ export interface IGraphStore { /** @@ -175,15 +175,13 @@ export interface IGraphStore { */ listEmbeddingHashes(): Promise>; /** - * Stream every embedding row with deterministic ordering — used by - * `pack/embeddings-sidecar.ts` to write the Parquet artifact without + * Stream every embedding row with deterministic ordering, without * materializing the full embeddings table in memory. * * The result is `AsyncIterable` (NOT `Promise`). Adapters MUST implement this as `async function*` * so the caller can `for await (const row of store.listEmbeddings())`. - * Order: `(node_id ASC, granularity ASC, chunk_index ASC)` — matches - * the Parquet writer's row-group order. + * Order: `(node_id ASC, granularity ASC, chunk_index ASC)`. * * Optional filters narrow the stream by node kind (joined to `nodes`) * and cap total rows. Empty `kindFilter` short-circuits to an empty @@ -415,21 +413,6 @@ export interface ITemporalStore { opts?: { readonly timeoutMs?: number }, ): Promise[]>; - /** - * Stage an `EmbeddingRow` stream through a per-call DuckDB temp table and - * COPY it to a Parquet file. Used by `pack/embeddings-sidecar.ts` to - * produce the deterministic Parquet sidecar from rows that originate in - * `graph.lbug`. The temp table is dropped before the call returns. - * - * Returns `{rowCount: 0}` when the stream is empty (no file written). - * `duckdbVersion` is the runtime `SELECT version()` result — pinned by - * the pack manifest so the writer version stays bound to the artifact. - */ - exportEmbeddingsToParquet( - rows: AsyncIterable, - absOutPath: string, - ): Promise<{ readonly rowCount: number; readonly duckdbVersion: string }>; - // ── Cochange surface (was on IGraphStore via CochangeStore) ─────────────── /** Replace the cochanges table contents with the supplied rows. */ bulkLoadCochanges(rows: readonly CochangeRow[]): Promise; diff --git a/packages/storage/src/license.ts b/packages/storage/src/license.ts new file mode 100644 index 00000000..201d061c --- /dev/null +++ b/packages/storage/src/license.ts @@ -0,0 +1,53 @@ +/** + * License-tier classification — pure, dependency-free. + * + * Extracted out of `duckdb-adapter.ts` so consumers (the single-file + * `SqliteStore`, `listDependencies`, the license-audit surface) can use it + * WITHOUT transitively importing `@duckdb/node-api`. That top-level native + * import is exactly what would defeat the lazy-DuckDB contract — importing a + * pure helper must never load a native binding. + */ + +/** + * Map an SPDX-ish license string to one of five tiers. Case-insensitive, + * tolerant of `-`/word-boundary-delimited family names (e.g. `GPL-3.0-only`, + * `Apache-2.0`). Empty / unknown input returns `"unknown"`. + */ +export function classifyLicenseTier( + license: string | undefined, +): "permissive" | "weak-copyleft" | "strong-copyleft" | "proprietary" | "unknown" { + if (!license || license.trim().length === 0) return "unknown"; + const lower = license.trim().toLowerCase(); + // Strong copyleft — GPL/AGPL family. + if (/(^|\b|-)agpl(-|$)/i.test(lower) || /(^|\b|-)gpl(-|$)/i.test(lower)) { + return "strong-copyleft"; + } + // Weak copyleft — LGPL, MPL, EPL, CDDL, CC-BY-SA. + if ( + /(^|\b|-)lgpl(-|$)/i.test(lower) || + /(^|\b)mpl(-|$)/i.test(lower) || + /(^|\b)epl(-|$)/i.test(lower) || + /(^|\b)cddl(-|$)/i.test(lower) || + /(^|\b)cc-by-sa(-|$)/i.test(lower) + ) { + return "weak-copyleft"; + } + // Permissive — MIT/Apache/BSD/ISC/0BSD/Unlicense/CC0/Zlib. + if ( + /(^|\b)mit(\b|-|$)/.test(lower) || + /(^|\b)apache(-|$)/i.test(lower) || + /(^|\b)bsd(-|$)/i.test(lower) || + /(^|\b)isc(\b|-|$)/.test(lower) || + /(^|\b)0bsd(\b|$)/.test(lower) || + /(^|\b)unlicense(\b|$)/.test(lower) || + /(^|\b)cc0(\b|-|$)/.test(lower) || + /(^|\b)zlib(\b|$)/.test(lower) + ) { + return "permissive"; + } + // Proprietary markers. + if (/(^|\b)(proprietary|commercial|see license)(\b|$)/i.test(lower)) { + return "proprietary"; + } + return "unknown"; +} diff --git a/packages/storage/src/relations.ts b/packages/storage/src/relations.ts new file mode 100644 index 00000000..f1f836a7 --- /dev/null +++ b/packages/storage/src/relations.ts @@ -0,0 +1,42 @@ +/** + * Canonical relation-kind roster — pure, dependency-free. + * + * The single source of truth for which edge relation types exist, in their + * load-bearing order (append new kinds, NEVER reorder — commit diffs and any + * schema emitter depend on the order). Lived in `graphdb-schema.ts`; extracted + * here so the single-file `SqliteStore` and the parity tests can reach it + * without importing the lbug-era schema module (deleted in the single-file + * migration). + */ +const RELATION_KINDS: readonly string[] = [ + "CONTAINS", + "DEFINES", + "IMPORTS", + "CALLS", + "EXTENDS", + "IMPLEMENTS", + "HAS_METHOD", + "HAS_PROPERTY", + "ACCESSES", + "METHOD_OVERRIDES", + "OVERRIDES", + "METHOD_IMPLEMENTS", + "MEMBER_OF", + "PROCESS_STEP", + "HANDLES_ROUTE", + "FETCHES", + "HANDLES_TOOL", + "ENTRY_POINT_OF", + "WRAPS", + "QUERIES", + "REFERENCES", + "FOUND_IN", + "DEPENDS_ON", + "OWNED_BY", + "TYPE_OF", +]; + +/** Every relation kind, in canonical order. Source of truth for finders + tests. */ +export function getAllRelationTypes(): readonly string[] { + return RELATION_KINDS; +} diff --git a/packages/storage/src/sqlite-adapter.test.ts b/packages/storage/src/sqlite-adapter.test.ts new file mode 100644 index 00000000..795d9345 --- /dev/null +++ b/packages/storage/src/sqlite-adapter.test.ts @@ -0,0 +1,184 @@ +/** + * Spike proof for {@link SqliteStore} (branch `spike/sqlite-single-file`). + * + * These tests are the de-risking evidence for the single-file thesis: that + * ONE `*.sqlite` file in WAL mode, opened through Node's built-in + * `node:sqlite` with zero native dependencies, can back the graph tier + * (nodes, edges, traversal), the embedding tier (Float32Array vectors, + * cosine KNN), and the temporal tier — replacing the lbug + DuckDB pair. + * + * The acceptance bar: + * - a real KnowledgeGraph bulk-loads and round-trips (nodes + edges) from + * one on-disk file across a close/reopen cycle; + * - embeddings survive as exact Float32 bytes and rank by cosine; + * - impact/blast-radius traversal works via recursive CTE (up + down); + * - the file is genuinely one file (no .lbug / .duckdb sidecars). + */ + +import assert from "node:assert/strict"; +import { mkdtemp, readdir, rm } from "node:fs/promises"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { test } from "node:test"; + +import { + type GraphNode, + KnowledgeGraph, + makeNodeId, + type NodeId, + type RelationType, +} from "@opencodehub/core-types"; + +import { SqliteStore } from "./sqlite-adapter.js"; + +interface FixtureIds { + readonly fileId: NodeId; + readonly a: NodeId; + readonly b: NodeId; + readonly c: NodeId; + readonly d: NodeId; +} + +/** Build a small fixture: 1 file + 4 functions in a CALLS chain a→b→c, a→d. */ +function fixtureGraph(): { graph: KnowledgeGraph; ids: FixtureIds } { + const g = new KnowledgeGraph(); + const fileId = makeNodeId("File", "src/app.ts", "src/app.ts"); + g.addNode({ id: fileId, kind: "File", name: "app.ts", filePath: "src/app.ts" } as GraphNode); + const mk = (fn: string): NodeId => { + const id = makeNodeId("Function", "src/app.ts", fn); + g.addNode({ + id, + kind: "Function", + name: fn, + filePath: "src/app.ts", + startLine: 1, + signature: `function ${fn}()`, + } as GraphNode); + return id; + }; + const a = mk("a"); + const b = mk("b"); + const c = mk("c"); + const d = mk("d"); + const calls = (from: NodeId, to: NodeId): void => + g.addEdge({ from, to, type: "CALLS" as RelationType, confidence: 1.0 }); + calls(a, b); + calls(b, c); + calls(a, d); + return { graph: g, ids: { fileId, a, b, c, d } }; +} + +test("SqliteStore: graph + embeddings round-trip from ONE file across reopen", async () => { + const dir = await mkdtemp(join(tmpdir(), "och-sqlite-spike-")); + const dbPath = join(dir, "store.sqlite"); + try { + const { graph, ids } = fixtureGraph(); + + // ── Write phase ── (8-dim embeddings for a readable test; real default 768) + const w = new SqliteStore(dbPath, { embeddingDim: 8 }); + await w.open(); + await w.createSchema(); + const stats = await w.bulkLoad(graph); + assert.equal(stats.nodeCount, 5, "5 nodes loaded"); + assert.equal(stats.edgeCount, 3, "3 edges loaded"); + + // 8-dim embeddings so the test is readable; real default is 768. + const vec = (seed: number): Float32Array => + Float32Array.from({ length: 8 }, (_, i) => Math.sin(seed + i)); + await w.upsertEmbeddings([ + { nodeId: ids.a, chunkIndex: 0, vector: vec(0.0), contentHash: "h-a", granularity: "symbol" }, + { nodeId: ids.b, chunkIndex: 0, vector: vec(1.0), contentHash: "h-b", granularity: "symbol" }, + { nodeId: ids.c, chunkIndex: 0, vector: vec(2.0), contentHash: "h-c", granularity: "symbol" }, + ]); + await w.close(); + + // ── Prove it is literally ONE file (WAL/shm may exist transiently; no + // .lbug / .duckdb sidecar must ever appear). ── + const files = await readdir(dir); + const sidecars = files.filter((f) => f.endsWith(".lbug") || f.endsWith(".duckdb")); + assert.deepEqual(sidecars, [], `no graph/temporal sidecars, saw: ${files.join(",")}`); + + // ── Read phase: fresh handle, read-only, over the same path ── + const r = new SqliteStore(dbPath, { readOnly: true, embeddingDim: 8 }); + await r.open(); + + // nodes survived with payload (signature lives in the JSON overflow column) + const fnA = await r.getNode(ids.a); + assert.equal(fnA?.kind, "Function"); + assert.equal(fnA?.name, "a"); + assert.equal((fnA as { signature?: string }).signature, "function a()"); + + const all = await r.listNodes(); + assert.equal(all.length, 5, "all 5 nodes enumerable after reopen"); + + // embeddings survived as exact f32 bytes + const got = new Map(); + for await (const row of r.listEmbeddings()) got.set(row.nodeId, row.vector); + assert.equal(got.size, 3); + const aVec = got.get(ids.a); + assert.ok(aVec, "embedding for node a round-tripped"); + assert.deepEqual(Array.from(aVec), Array.from(vec(0.0)), "f32 bytes identical"); + + // cosine KNN: query == a's vector → a ranks first with distance ~0 + const hits = await r.vectorSearch({ vector: vec(0.0), limit: 3 }); + assert.ok(hits.length >= 2, "expected at least two ranked hits"); + const [first, second] = hits; + assert.equal(first?.nodeId, ids.a, "self is nearest"); + assert.ok((first?.distance ?? Number.POSITIVE_INFINITY) < 1e-6, "distance to self ~0"); + assert.ok( + (first?.distance ?? 0) <= (second?.distance ?? Number.POSITIVE_INFINITY), + "ordered by ascending distance", + ); + + await r.close(); + await rm(dir, { recursive: true, force: true }); + } catch (err) { + await rm(dir, { recursive: true, force: true }); + throw err; + } +}); + +test("SqliteStore: recursive-CTE traversal does impact (up) + blast-radius (down)", async () => { + const dir = await mkdtemp(join(tmpdir(), "och-sqlite-spike-tr-")); + const dbPath = join(dir, "store.sqlite"); + try { + const { graph, ids } = fixtureGraph(); + const w = new SqliteStore(dbPath); + await w.open(); + await w.createSchema(); + await w.bulkLoad(graph); + + // DOWN from a (callees, transitive): a→b→c and a→d ⇒ {b,c,d} + const down = await w.traverse({ startId: ids.a, direction: "down", maxDepth: 5 }); + assert.deepEqual( + [...down.map((r) => r.nodeId)].sort(), + [ids.b, ids.c, ids.d].sort(), + "down reaches all transitive callees", + ); + const cHit = down.find((r) => r.nodeId === ids.c); + assert.equal(cHit?.depth, 2, "c is depth 2 (a→b→c)"); + assert.deepEqual(cHit?.path, [ids.a, ids.b, ids.c], "path is recorded a→b→c"); + + // UP from c (callers, transitive = blast radius): c←b←a ⇒ {a,b} + const up = await w.traverse({ startId: ids.c, direction: "up", maxDepth: 5 }); + assert.deepEqual( + [...up.map((r) => r.nodeId)].sort(), + [ids.a, ids.b].sort(), + "up reaches all transitive callers (blast radius)", + ); + + // depth bound respected: maxDepth 1 from a ⇒ only direct {b,d} + const shallow = await w.traverse({ startId: ids.a, direction: "down", maxDepth: 1 }); + assert.deepEqual( + [...shallow.map((r) => r.nodeId)].sort(), + [ids.b, ids.d].sort(), + "maxDepth=1 yields only direct callees", + ); + + await w.close(); + await rm(dir, { recursive: true, force: true }); + } catch (err) { + await rm(dir, { recursive: true, force: true }); + throw err; + } +}); diff --git a/packages/storage/src/sqlite-adapter.ts b/packages/storage/src/sqlite-adapter.ts new file mode 100644 index 00000000..7a09d59e --- /dev/null +++ b/packages/storage/src/sqlite-adapter.ts @@ -0,0 +1,1513 @@ +/** + * SqliteStore — single-file storage adapter (branch `spike/sqlite-single-file`). + * + * THESIS. One `*.sqlite` file in WAL mode backs EVERYTHING: graph nodes, + * edges, embeddings, and the temporal/non-graph tables (cochanges, symbol + * summaries) that today live in two native-binding engines + * (`graph.lbug` via @ladybugdb/core + `temporal.duckdb` via @duckdb/node-api). + * Collapsing both onto Node 24's built-in `node:sqlite` removes the last two + * native dependencies, which is what unlocks the real goal: a zero-dep, + * one-command, no-Docker install (`npm i -g @opencodehub/cli` and nothing else). + * + * STATUS. This file implements the FULL {@link IGraphStore} + + * {@link ITemporalStore} surface against a single file. Embeddings live in + * the `embeddings` table inside store.sqlite; there is no DuckDB dependency + * and no Parquet export (ADR 0019 dropped the write-only sidecar). + * + * GRAPH-HASH PARITY. The hard success criterion is that a `KnowledgeGraph` + * rebuilt from `listNodes({})` + `listEdges({})` produces a byte-identical + * `graphHash`. The node write/read path round-trips the full node object + * through a JSON `payload` column (so arbitrary kind-specific fields — and the + * `keywords: []`-vs-absent and `languageStats: {}` distinctions canonicalJson + * cares about — survive verbatim). The edge read path mirrors + * `GraphDbStore.listEdgesInternalGd` exactly, including the + * {@link stepZeroSentinel} drop, the empty-reason drop, and the + * `(from, to, type, id)` sort. Filter-only columns (severity, rule_id, + * ecosystem, method, entry_point_id, repo_uri, …) live INSIDE the payload and + * are reached via SQLite JSON1 `payload->>'$.field'` extracts. + * + * NON-GOAL. No backwards compatibility. Clean slate: this adapter assumes a + * fresh index, not a migration of existing `graph.lbug` / `temporal.duckdb` + * artifacts (per the spike brief). + */ + +// Install the experimental-warning guard BEFORE the node:sqlite binding loads. +import "./sqlite-runtime.js"; + +import { DatabaseSync, type StatementSync } from "node:sqlite"; + +import type { + CodeRelation, + DependencyNode, + FindingNode, + GraphNode, + KnowledgeGraph, + NodeId, + NodeKind, + NodeOfKind, + RelationType, + RepoNode, + RouteNode, +} from "@opencodehub/core-types"; +import { stepZeroSentinel } from "./column-encode.js"; +import type { + AncestorTraversalOptions, + BulkLoadOptions, + BulkLoadStats, + CochangeLookupOptions, + CochangeRow, + ConsumerProducerEdge, + DescendantTraversalOptions, + EmbeddingRow, + GraphDialect, + IGraphStore, + ITemporalStore, + ListDependenciesOptions, + ListEdgesByTypeOptions, + ListEdgesOptions, + ListEmbeddingsOptions, + ListFindingsOptions, + ListNodesByKindOptions, + ListNodesByNameOptions, + ListNodesOptions, + ListRoutesOptions, + SearchQuery, + SearchResult, + SqlParam, + StoreMeta, + SymbolSummaryRow, + TraverseQuery, + TraverseResult, + VectorQuery, + VectorResult, +} from "./interface.js"; +import { classifyLicenseTier } from "./license.js"; +import { getAllRelationTypes } from "./relations.js"; +import { assertReadOnlySql } from "./sql-guard.js"; + +export interface SqliteStoreOptions { + /** Open the file read-only. Query commands pass true; ingestion false. */ + readonly readOnly?: boolean; + /** Embedding dimensionality. Defaults to 768 (Bedrock Titan / Cohere tier). */ + readonly embeddingDim?: number; + /** + * Journal mode. Defaults to WAL — the whole point of the spike. Overridable + * to `MEMORY` for `:memory:` tests where WAL is a no-op anyway. + */ + readonly journalMode?: "WAL" | "MEMORY" | "DELETE"; + /** Default query timeout for `exec()` calls in ms. Default 5000. */ + readonly timeoutMs?: number; +} + +const DEFAULT_DIM = 768; +const SCHEMA_VERSION = "spike-sqlite-1"; +const DEFAULT_TIMEOUT_MS = 5_000; +const DEFAULT_COCHANGE_LOOKUP_LIMIT = 10; +const DEFAULT_COCHANGE_MIN_LIFT = 1.0; +const DEFAULT_SEARCH_LIMIT = 50; + +/** + * Single-file store implementing the full IGraphStore + ITemporalStore + * surface. Lifecycle mirrors the existing adapters: + * open → createSchema → bulkLoad → query/search/vectorSearch/traverse → close + */ +export class SqliteStore implements IGraphStore, ITemporalStore { + /** + * Dialect tag. node:sqlite speaks SQL, not Cypher, but {@link GraphDialect} + * is currently the single literal `"cypher"`. Rather than widen the union + * (and force every consumer to handle a second tag for a property OCH core + * never branches on), we keep `"cypher"` and leave a TODO. The + * {@link IGraphStore.execCypher} escape hatch is intentionally NOT + * implemented here — this adapter exposes raw SQL via {@link exec} on the + * temporal surface instead. + * + * TODO(P3): if a SQL community-adapter tag is ever needed, widen + * `GraphDialect = "cypher" | "sql"` in interface.ts (one-line union change) + * and set this to `"sql"`. + */ + readonly dialect: GraphDialect = "cypher"; + + private db: DatabaseSync | undefined; + private readonly path: string; + private readonly readOnly: boolean; + private readonly dim: number; + private readonly journalMode: "WAL" | "MEMORY" | "DELETE"; + private readonly defaultTimeoutMs: number; + + constructor(path: string, opts: SqliteStoreOptions = {}) { + this.path = path; + this.readOnly = opts.readOnly ?? false; + this.dim = opts.embeddingDim ?? DEFAULT_DIM; + this.journalMode = opts.journalMode ?? (path === ":memory:" ? "MEMORY" : "WAL"); + this.defaultTimeoutMs = opts.timeoutMs ?? DEFAULT_TIMEOUT_MS; + } + + // ── Lifecycle ────────────────────────────────────────────────────────────── + + async open(): Promise { + if (this.db) return; // idempotent + this.db = new DatabaseSync(this.path, { readOnly: this.readOnly }); + // WAL is the headline: concurrent readers never block the writer, the file + // is crash-safe, and there is no server process. A read-only handle cannot + // change journal mode, so only set it on a writable open. + // + // NOTE — these PRAGMAs run on the TRUSTED internal path, never through + // {@link exec}. assertReadOnlySql blocks PRAGMA as a dangerous keyword, so + // user SQL can never reach this surface. + if (!this.readOnly) { + this.db.exec(`PRAGMA journal_mode = ${this.journalMode};`); + this.db.exec("PRAGMA synchronous = NORMAL;"); // WAL-safe, fast + this.db.exec("PRAGMA foreign_keys = ON;"); + } + // node:sqlite has no connection.interrupt(); a busy-timeout is the only + // best-effort lever for lock contention (NOT a long-scan timeout). + this.db.exec(`PRAGMA busy_timeout = ${Math.max(0, Math.floor(this.defaultTimeoutMs))};`); + } + + async close(): Promise { + if (!this.db) return; + // One handle owns graph + temporal. No two-adapter ordered teardown. + if (!this.readOnly) this.db.exec("PRAGMA wal_checkpoint(TRUNCATE);"); + this.db.close(); + this.db = undefined; + } + + async healthCheck(): Promise<{ ok: boolean; message?: string }> { + try { + const row = this.conn().prepare("SELECT 1 AS ok;").get() as { ok: number }; + return { ok: row.ok === 1 }; + } catch (err) { + return { ok: false, message: err instanceof Error ? err.message : String(err) }; + } + } + + async createSchema(): Promise { + const db = this.conn(); + // ── Graph tier ── + // Generic node table: typed columns for the universal NodeBase fields + // (id/kind/name/file_path) + a JSON `payload` overflow carrying the + // kind-specific fields. This is the spike's central proposal: one table + // for 37 node kinds, not 37 tables. Rehydration reads payload back + // verbatim so canonicalJson sees the identical field set on rebuild. + // + // Filter-only fields (severity, rule_id, ecosystem, method, …) are reached + // via SQLite JSON1 `payload->>'$.field'` extracts at query time — no extra + // typed columns needed, which keeps the write path lossless. + db.exec(` + CREATE TABLE IF NOT EXISTS nodes ( + id TEXT PRIMARY KEY, + kind TEXT NOT NULL, + name TEXT NOT NULL, + file_path TEXT, + start_line INTEGER, + end_line INTEGER, + payload TEXT -- canonical JSON of remaining fields + ); + CREATE INDEX IF NOT EXISTS idx_nodes_kind ON nodes(kind); + CREATE INDEX IF NOT EXISTS idx_nodes_name ON nodes(name); + CREATE INDEX IF NOT EXISTS idx_nodes_file ON nodes(file_path); + `); + // Edges: one polymorphic table keyed by type, with the (from,to,type,step) + // dedup tuple as the natural key — mirrors KnowledgeGraph.edgeDedupKey. + db.exec(` + CREATE TABLE IF NOT EXISTS edges ( + id TEXT PRIMARY KEY, + src TEXT NOT NULL, + dst TEXT NOT NULL, + type TEXT NOT NULL, + confidence REAL NOT NULL DEFAULT 1.0, + step INTEGER, + reason TEXT + ); + CREATE INDEX IF NOT EXISTS idx_edges_src ON edges(src, type); + CREATE INDEX IF NOT EXISTS idx_edges_dst ON edges(dst, type); + CREATE INDEX IF NOT EXISTS idx_edges_type ON edges(type); + `); + // Embeddings: the f32 vector lives in a BLOB (little-endian Float32Array + // bytes). Composite PK matches the existing (granularity,node_id,chunk) + // key. content_hash drives incremental skip-re-embed. + db.exec(` + CREATE TABLE IF NOT EXISTS embeddings ( + node_id TEXT NOT NULL, + granularity TEXT NOT NULL DEFAULT 'symbol', + chunk_index INTEGER NOT NULL DEFAULT 0, + start_line INTEGER, + end_line INTEGER, + dim INTEGER NOT NULL, + vector BLOB NOT NULL, + content_hash TEXT NOT NULL, + PRIMARY KEY (granularity, node_id, chunk_index) + ); + `); + // BM25 search: an FTS5 virtual table mirroring the THREE columns lbug's + // QUERY_FTS_INDEX indexes — name + signature + description. node_id is + // UNINDEXED (carried for the join back to `nodes`). Populated at bulkLoad + // from nodes.name + payload.signature/description. + db.exec(` + CREATE VIRTUAL TABLE IF NOT EXISTS nodes_fts USING fts5( + node_id UNINDEXED, + name, + signature, + description, + tokenize='unicode61' + ); + `); + // ── Temporal / non-graph tier — same file, no second engine ── + // Canonical 7-column cochanges shape (matches schema-ddl.ts:30-42). + // last_cocommit_at is stored as a TEXT ISO-8601 string (SQLite has no + // native TIMESTAMP type; the affinity is irrelevant for a TEXT round-trip). + db.exec(` + CREATE TABLE IF NOT EXISTS cochanges ( + source_file TEXT NOT NULL, + target_file TEXT NOT NULL, + cocommit_count INTEGER NOT NULL, + total_commits_source INTEGER NOT NULL, + total_commits_target INTEGER NOT NULL, + last_cocommit_at TEXT NOT NULL, + lift REAL NOT NULL, + PRIMARY KEY (source_file, target_file) + ); + CREATE INDEX IF NOT EXISTS idx_cochanges_source ON cochanges (source_file); + CREATE INDEX IF NOT EXISTS idx_cochanges_target ON cochanges (target_file); + `); + // Canonical 9-column symbol_summaries shape (matches schema-ddl.ts:54-67). + db.exec(` + CREATE TABLE IF NOT EXISTS symbol_summaries ( + node_id TEXT NOT NULL, + content_hash TEXT NOT NULL, + prompt_version TEXT NOT NULL, + model_id TEXT NOT NULL, + summary_text TEXT NOT NULL, + signature_summary TEXT, + returns_type_summary TEXT, + structured_json TEXT, + created_at TEXT NOT NULL, + PRIMARY KEY (node_id, content_hash, prompt_version) + ); + CREATE INDEX IF NOT EXISTS idx_summaries_node ON symbol_summaries (node_id); + `); + // Single-row meta table keyed by id=1 (mirrors GraphDbStore's StoreMeta + // {id:1} MERGE pattern). Typed columns so getMeta can re-attach optional + // fields only when the column is non-null (exactOptional readback). + db.exec(` + CREATE TABLE IF NOT EXISTS store_meta ( + id INTEGER PRIMARY KEY CHECK (id = 1), + schema_version TEXT NOT NULL, + last_commit TEXT, + indexed_at TEXT NOT NULL, + node_count INTEGER NOT NULL, + edge_count INTEGER NOT NULL, + stats_json TEXT, + cache_hit_ratio REAL, + cache_size_bytes INTEGER, + last_compaction TEXT, + embedder_model_id TEXT + ); + `); + } + + // ── Bulk load (graph write path) ──────────────────────────────────────────── + + async bulkLoad(graph: KnowledgeGraph, _opts?: BulkLoadOptions): Promise { + const db = this.conn(); + const start = Date.now(); + const nodes = graph.orderedNodes(); + const edges = graph.orderedEdges(); + const insNode = db.prepare( + `INSERT OR REPLACE INTO nodes (id,kind,name,file_path,start_line,end_line,payload) + VALUES (?,?,?,?,?,?,?)`, + ); + const insEdge = db.prepare( + `INSERT OR REPLACE INTO edges (id,src,dst,type,confidence,step,reason) + VALUES (?,?,?,?,?,?,?)`, + ); + const insFts = db.prepare( + `INSERT INTO nodes_fts (node_id,name,signature,description) VALUES (?,?,?,?)`, + ); + // FTS5 has no UPSERT; in upsert mode we delete the per-node FTS row before + // re-inserting so a re-loaded node does not duplicate its search entry. + const delFtsForNode = db.prepare(`DELETE FROM nodes_fts WHERE node_id = ?`); + // "replace" (default) truncates and reloads the whole graph. "upsert" MERGES + // the supplied nodes/edges into the existing graph WITHOUT wiping — this is + // the contract ingest-sarif relies on (it adds Finding nodes + FOUND_IN + // edges to an already-loaded graph; a wipe here would destroy the index, as + // it did before this fix). INSERT OR REPLACE handles the per-row upsert. + const mode = _opts?.mode ?? "replace"; + // One transaction for the whole load — WAL turns this into a single fsync. + db.exec("BEGIN"); + try { + if (mode === "replace") { + db.exec("DELETE FROM nodes"); + db.exec("DELETE FROM edges"); + db.exec("DELETE FROM nodes_fts"); + } + for (const n of nodes) { + this.writeNode(insNode, n); + const anyNode = n as unknown as Record; + const sig = anyNode["signature"]; + const desc = anyNode["description"]; + if (mode === "upsert") delFtsForNode.run(String(n.id)); + insFts.run( + String(n.id), + String(n.name), + typeof sig === "string" ? sig : "", + typeof desc === "string" ? desc : "", + ); + } + for (const e of edges) { + insEdge.run(e.id, e.from, e.to, e.type, e.confidence, e.step ?? null, e.reason ?? null); + } + db.exec("COMMIT"); + } catch (err) { + db.exec("ROLLBACK"); + throw err; + } + // Stamp store_meta from the ACTUAL post-write table counts, so an upsert + // batch (which carries only the added rows) does not clobber the meta with + // a partial count. Callers that own richer meta (analyze) overwrite this + // with a full setMeta() afterward; this keeps a freshly-bulk-loaded store + // self-consistent on its own. + const totalNodes = (db.prepare("SELECT count(*) c FROM nodes").get() as { c: number }).c; + const totalEdges = (db.prepare("SELECT count(*) c FROM edges").get() as { c: number }).c; + const existing = await this.getMeta(); + await this.setMeta({ + ...(existing ?? {}), + schemaVersion: existing?.schemaVersion ?? SCHEMA_VERSION, + indexedAt: existing?.indexedAt ?? new Date().toISOString(), + nodeCount: totalNodes, + edgeCount: totalEdges, + }); + // bulkLoad reports the rows IT loaded (the batch), not the table total. + return { + nodeCount: nodes.length, + edgeCount: edges.length, + durationMs: Date.now() - start, + }; + } + + private writeNode(stmt: StatementSync, n: GraphNode): void { + // Split the universal base off; everything else canonical-JSONs into payload. + const anyNode = n as unknown as Record; + const { + id, + kind, + name, + filePath = undefined, + startLine = undefined, + endLine = undefined, + ...rest + } = anyNode; + stmt.run( + String(id), + String(kind), + String(name), + filePath === undefined ? null : String(filePath), + typeof startLine === "number" ? startLine : null, + typeof endLine === "number" ? endLine : null, + Object.keys(rest).length ? JSON.stringify(rest) : null, + ); + } + + // ── Node finders ───────────────────────────────────────────────────────────── + + async getNode(id: NodeId): Promise { + const row = this.conn() + .prepare("SELECT * FROM nodes WHERE id = ?") + .get(String(id)) as unknown as NodeRow | undefined; + return row ? rehydrateNode(row) : undefined; + } + + async listNodes(opts: ListNodesOptions = {}): Promise { + // Empty-array short-circuits BEFORE touching the connection (matches + // GraphDbStore.listNodes:1115-1117 — pure-JS contract). + const kinds = opts.kinds; + if (kinds !== undefined && kinds.length === 0) return []; + const idsRaw = opts.ids; + if (idsRaw !== undefined && idsRaw.length === 0) return []; + const ids = idsRaw !== undefined ? Array.from(new Set(idsRaw)) : undefined; + const limit = clampNonNegativeInt(opts.limit); + const offset = clampNonNegativeInt(opts.offset); + + const wheres: string[] = []; + const params: SqlParam[] = []; + if (kinds && kinds.length > 0) { + wheres.push(`kind IN (${placeholders(kinds.length)})`); + for (const k of kinds) params.push(k); + } + if (ids !== undefined && ids.length > 0) { + wheres.push(`id IN (${placeholders(ids.length)})`); + for (const i of ids) params.push(i); + } + if (opts.filePath !== undefined) { + wheres.push("file_path = ?"); + params.push(opts.filePath); + } + const sql = `SELECT * FROM nodes${whereClause(wheres)} ORDER BY id ASC${pageClause(limit, offset)}`; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as NodeRow[]; + return sortById(rows.map(rehydrateNode)); + } + + async listNodesByKind( + kind: K, + opts: ListNodesByKindOptions = {}, + ): Promise[]> { + const limit = clampNonNegativeInt(opts.limit); + const offset = clampNonNegativeInt(opts.offset); + const wheres: string[] = ["kind = ?"]; + const params: SqlParam[] = [kind]; + // NOTE: GraphDbStore ANDs filePath + filePathLike (impl 1201-1210) even + // though the interface doc says "exact takes priority" — mirror the IMPL. + if (opts.filePath !== undefined) { + wheres.push("file_path = ?"); + params.push(opts.filePath); + } + if (opts.filePathLike !== undefined) { + wheres.push("file_path LIKE '%' || ? || '%'"); + params.push(opts.filePathLike); + } + const sql = `SELECT * FROM nodes${whereClause(wheres)} ORDER BY id ASC${pageClause(limit, offset)}`; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as NodeRow[]; + return sortById(rows.map(rehydrateNode)) as unknown as readonly NodeOfKind[]; + } + + async listFindings(opts: ListFindingsOptions = {}): Promise { + const wheres: string[] = ["kind = 'Finding'"]; + const params: SqlParam[] = []; + if (opts.severity && opts.severity.length > 0) { + wheres.push(`payload->>'$.severity' IN (${placeholders(opts.severity.length)})`); + for (const s of opts.severity) params.push(s); + } + if (opts.ruleId !== undefined) { + wheres.push("payload->>'$.ruleId' = ?"); + params.push(opts.ruleId); + } + if (opts.baselineState && opts.baselineState.length > 0) { + wheres.push(`payload->>'$.baselineState' IN (${placeholders(opts.baselineState.length)})`); + for (const s of opts.baselineState) params.push(s); + } + if (opts.suppressed === true) { + wheres.push("payload->>'$.suppressedJson' IS NOT NULL"); + } else if (opts.suppressed === false) { + wheres.push("payload->>'$.suppressedJson' IS NULL"); + } + const limit = clampNonNegativeInt(opts.limit); + const sql = + "SELECT * FROM nodes" + + whereClause(wheres) + + " ORDER BY id ASC" + + pageClause(limit, undefined); + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as NodeRow[]; + const out: FindingNode[] = []; + for (const r of rows) { + const node = rehydrateNode(r); + if (node.kind === "Finding") out.push(node as FindingNode); + } + return sortById(out) as readonly FindingNode[]; + } + + async listDependencies(opts: ListDependenciesOptions = {}): Promise { + const wheres: string[] = ["kind = 'Dependency'"]; + const params: SqlParam[] = []; + if (opts.ecosystem !== undefined) { + wheres.push("payload->>'$.ecosystem' = ?"); + params.push(opts.ecosystem); + } + const limit = clampNonNegativeInt(opts.limit); + const sql = + "SELECT * FROM nodes" + + whereClause(wheres) + + " ORDER BY id ASC" + + pageClause(limit, undefined); + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as NodeRow[]; + // licenseTier is a JS-side post-filter via classifyLicenseTier, NOT SQL — + // the LIMIT above applies BEFORE the tier filter, matching the reference. + const tierSet = + opts.licenseTier && opts.licenseTier.length > 0 ? new Set(opts.licenseTier) : undefined; + const out: DependencyNode[] = []; + for (const r of rows) { + const node = rehydrateNode(r); + if (node.kind !== "Dependency") continue; + if (tierSet) { + const tier = classifyLicenseTier((node as DependencyNode).license); + if (!tierSet.has(tier)) continue; + } + out.push(node as DependencyNode); + } + return sortById(out) as readonly DependencyNode[]; + } + + async listRoutes(opts: ListRoutesOptions = {}): Promise { + const wheres: string[] = ["kind = 'Route'"]; + const params: SqlParam[] = []; + if (opts.methods && opts.methods.length > 0) { + wheres.push(`payload->>'$.method' IN (${placeholders(opts.methods.length)})`); + for (const m of opts.methods) params.push(m); + } + if (opts.pathLike !== undefined) { + wheres.push("payload->>'$.url' LIKE '%' || ? || '%'"); + params.push(opts.pathLike); + } + const limit = clampNonNegativeInt(opts.limit); + const sql = + "SELECT * FROM nodes" + + whereClause(wheres) + + " ORDER BY id ASC" + + pageClause(limit, undefined); + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as NodeRow[]; + const out: RouteNode[] = []; + for (const r of rows) { + const node = rehydrateNode(r); + if (node.kind === "Route") out.push(node as RouteNode); + } + return sortById(out) as readonly RouteNode[]; + } + + async getRepoNode(id: string): Promise { + // Double-guard kind='Repo': in the WHERE and again on the rehydrated node. + const row = this.conn() + .prepare("SELECT * FROM nodes WHERE id = ? AND kind = 'Repo' LIMIT 1") + .get(String(id)) as unknown as NodeRow | undefined; + if (!row) return undefined; + const node = rehydrateNode(row); + if (node.kind !== "Repo") return undefined; + return node as RepoNode; + } + + async listNodesByEntryPoint(entryPointId: string): Promise { + // Kind-agnostic on read; entryPointId lives in the payload. + const rows = this.conn() + .prepare("SELECT * FROM nodes WHERE payload->>'$.entryPointId' = ? ORDER BY id ASC") + .all(entryPointId) as unknown as NodeRow[]; + return sortById(rows.map(rehydrateNode)); + } + + async listNodesByName( + name: string, + opts: ListNodesByNameOptions = {}, + ): Promise { + const kinds = opts.kinds; + if (kinds !== undefined && kinds.length === 0) return []; + const wheres: string[] = ["name = ?"]; + const params: SqlParam[] = [name]; + if (kinds && kinds.length > 0) { + wheres.push(`kind IN (${placeholders(kinds.length)})`); + for (const k of kinds) params.push(k); + } + if (opts.filePath !== undefined) { + wheres.push("file_path = ?"); + params.push(opts.filePath); + } + const limit = clampNonNegativeInt(opts.limit); + const sql = + "SELECT * FROM nodes" + + whereClause(wheres) + + " ORDER BY id ASC" + + pageClause(limit, undefined); + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as NodeRow[]; + return sortById(rows.map(rehydrateNode)); + } + + async countNodesByKind(kinds?: readonly NodeKind[]): Promise> { + const out = new Map(); + // kinds:[] → empty Map (short-circuit before the connection). + if (kinds !== undefined && kinds.length === 0) return out; + let sql = "SELECT kind, COUNT(*) AS n FROM nodes"; + const params: SqlParam[] = []; + if (kinds && kinds.length > 0) { + sql += ` WHERE kind IN (${placeholders(kinds.length)})`; + for (const k of kinds) params.push(k); + } + sql += " GROUP BY kind ORDER BY kind ASC"; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as { + kind: string; + n: number | bigint; + }[]; + for (const r of rows) { + out.set(r.kind as NodeKind, typeof r.n === "bigint" ? Number(r.n) : Number(r.n ?? 0)); + } + // Backfill 0 for every requested kind absent from the result. + if (kinds) { + for (const k of kinds) if (!out.has(k)) out.set(k, 0); + } + return out; + } + + async countEdgesByType(types?: readonly RelationType[]): Promise> { + const out = new Map(); + // types:[] → empty Map (short-circuit before the connection). + if (types !== undefined && types.length === 0) return out; + const requested: readonly RelationType[] = + types && types.length > 0 ? types : (getAllRelationTypes() as readonly RelationType[]); + let sql = "SELECT type, COUNT(*) AS n FROM edges"; + const params: SqlParam[] = []; + if (types && types.length > 0) { + sql += ` WHERE type IN (${placeholders(types.length)})`; + for (const t of types) params.push(t); + } + sql += " GROUP BY type"; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as { + type: string; + n: number | bigint; + }[]; + const counts = new Map(); + for (const r of rows) { + counts.set(r.type, typeof r.n === "bigint" ? Number(r.n) : Number(r.n ?? 0)); + } + // Emit a 0 entry for every requested/all type with no rows (the + // GraphDbStore per-type loop guarantees every input type appears). + for (const t of requested) out.set(t, counts.get(t) ?? 0); + return out; + } + + // ── Edges ────────────────────────────────────────────────────────────────── + + async listEdges(opts: ListEdgesOptions = {}): Promise { + const wheres: string[] = []; + const params: SqlParam[] = []; + // types undefined OR empty → all types; non-empty → restrict. + if (opts.types && opts.types.length > 0) { + wheres.push(`type IN (${placeholders(opts.types.length)})`); + for (const t of opts.types) params.push(t); + } + if (opts.fromIds && opts.fromIds.length > 0) { + wheres.push(`src IN (${placeholders(opts.fromIds.length)})`); + for (const f of opts.fromIds) params.push(f); + } + if (opts.toIds && opts.toIds.length > 0) { + wheres.push(`dst IN (${placeholders(opts.toIds.length)})`); + for (const t of opts.toIds) params.push(t); + } + // minConfidence: mirror the IMPL (`>=`, inclusive floor), NOT the prose + // ("strictly below"). Both adapters must agree on `>=` for conformance. + if (opts.minConfidence !== undefined) { + wheres.push("confidence >= ?"); + params.push(opts.minConfidence); + } + const sql = `SELECT id, src, dst, type, confidence, step, reason FROM edges${whereClause(wheres)}`; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as EdgeRow[]; + + const collected: CodeRelation[] = []; + for (const row of rows) { + // step-0 sentinel: 0/null/undefined/non-finite → drop the key. + const step = stepZeroSentinel(row.step); + // reason: non-empty string kept; null OR "" → drop the key (.length > 0). + const reasonVal = row.reason; + const reason = typeof reasonVal === "string" && reasonVal.length > 0 ? reasonVal : undefined; + collected.push({ + id: String(row.id ?? "") as CodeRelation["id"], + from: String(row.src ?? "") as CodeRelation["from"], + to: String(row.dst ?? "") as CodeRelation["to"], + type: row.type as RelationType, + confidence: Number(row.confidence ?? 0), + ...(reason !== undefined ? { reason } : {}), + ...(step !== undefined ? { step } : {}), + }); + } + // Final ordering: (from, to, type, id) — byte-for-byte the GraphDbStore key. + collected.sort((x, y) => { + if (x.from !== y.from) return x.from < y.from ? -1 : 1; + if (x.to !== y.to) return x.to < y.to ? -1 : 1; + if (x.type !== y.type) return x.type < y.type ? -1 : 1; + if (x.id !== y.id) return x.id < y.id ? -1 : 1; + return 0; + }); + // limit/offset applied AFTER sort; clamp via clampNonNegativeInt. + const limit = clampNonNegativeInt(opts.limit); + const offset = clampNonNegativeInt(opts.offset); + const startAt = offset ?? 0; + const end = limit !== undefined ? startAt + limit : collected.length; + return collected.slice(startAt, end); + } + + async listEdgesByType( + type: RelationType, + opts: ListEdgesByTypeOptions = {}, + ): Promise { + // Pin types:[type], forward the rest (NO offset on ListEdgesByTypeOptions), + // delegate to the same listEdges body. + const merged: ListEdgesOptions = { + types: [type], + ...(opts.fromIds !== undefined ? { fromIds: opts.fromIds } : {}), + ...(opts.toIds !== undefined ? { toIds: opts.toIds } : {}), + ...(opts.minConfidence !== undefined ? { minConfidence: opts.minConfidence } : {}), + ...(opts.limit !== undefined ? { limit: opts.limit } : {}), + }; + return this.listEdges(merged); + } + + async listConsumerProducerEdges( + opts: { readonly repoUris?: readonly string[] } = {}, + ): Promise { + // One row per FETCHES edge whose producer (target) is kind 'Operation'. + // repo_uri / http_method / http_path live in the producer's payload + // (camelCase: repoUri / method / path). + const params: SqlParam[] = []; + let repoPredicate = ""; + if (opts.repoUris && opts.repoUris.length > 0) { + const phs = placeholders(opts.repoUris.length); + repoPredicate = + ` AND (consumer.payload->>'$.repoUri' IN (${phs}) ` + + `OR producer.payload->>'$.repoUri' IN (${phs}))`; + // The IN list appears twice in the SQL → bind the values twice. + for (const u of opts.repoUris) params.push(u); + for (const u of opts.repoUris) params.push(u); + } + const sql = + "SELECT consumer.id AS consumer_node_id, " + + "consumer.payload->>'$.repoUri' AS consumer_repo_uri, " + + "producer.id AS producer_node_id, " + + "producer.payload->>'$.repoUri' AS producer_repo_uri, " + + "producer.payload->>'$.method' AS http_method, " + + "producer.payload->>'$.path' AS http_path, " + + "e.id AS r_id " + + "FROM edges e " + + "JOIN nodes consumer ON e.src = consumer.id " + + "JOIN nodes producer ON e.dst = producer.id " + + "WHERE e.type = 'FETCHES' AND producer.kind = 'Operation'" + + repoPredicate + + " ORDER BY consumer_repo_uri ASC, producer_repo_uri ASC, " + + "http_method ASC, http_path ASC, r_id ASC"; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as Record[]; + // SQL ORDER BY is authoritative here — NO JS re-sort. + const out: ConsumerProducerEdge[] = []; + for (const row of rows) { + out.push({ + consumerNodeId: String(row["consumer_node_id"] ?? ""), + consumerRepoUri: String(row["consumer_repo_uri"] ?? ""), + producerNodeId: String(row["producer_node_id"] ?? ""), + producerRepoUri: String(row["producer_repo_uri"] ?? ""), + httpMethod: String(row["http_method"] ?? ""), + httpPath: String(row["http_path"] ?? ""), + }); + } + return out; + } + + // ── Embeddings ─────────────────────────────────────────────────────────────── + + async upsertEmbeddings(rows: readonly EmbeddingRow[]): Promise { + const db = this.conn(); + const stmt = db.prepare( + `INSERT OR REPLACE INTO embeddings + (node_id,granularity,chunk_index,start_line,end_line,dim,vector,content_hash) + VALUES (?,?,?,?,?,?,?,?)`, + ); + db.exec("BEGIN"); + try { + for (const r of rows) { + stmt.run( + r.nodeId, + r.granularity ?? "symbol", + r.chunkIndex, + r.startLine ?? null, + r.endLine ?? null, + r.vector.length, + f32ToBlob(r.vector), + r.contentHash, + ); + } + db.exec("COMMIT"); + } catch (err) { + db.exec("ROLLBACK"); + throw err; + } + } + + async listEmbeddingHashes(): Promise> { + const rows = this.conn() + .prepare("SELECT node_id, granularity, chunk_index, content_hash FROM embeddings") + .all() as unknown as { + node_id: unknown; + granularity: unknown; + chunk_index: unknown; + content_hash: unknown; + }[]; + const out = new Map(); + for (const r of rows) { + const nodeId = r.node_id; + const granularity = r.granularity; + const chunkIndex = r.chunk_index; + const contentHash = r.content_hash; + if ( + typeof nodeId !== "string" || + typeof granularity !== "string" || + typeof contentHash !== "string" || + (typeof chunkIndex !== "number" && typeof chunkIndex !== "bigint") + ) { + continue; + } + const ci = typeof chunkIndex === "bigint" ? Number(chunkIndex) : chunkIndex; + // Key separator is NUL (\0), NOT ':' (NodeIds contain ':'). + out.set(`${granularity}\0${nodeId}\0${ci}`, contentHash); + } + return out; + } + + async *listEmbeddings(opts: ListEmbeddingsOptions = {}): AsyncIterable { + const kinds = opts.kindFilter; + // Empty kindFilter short-circuits to an empty stream. + if (kinds !== undefined && kinds.length === 0) return; + const limit = clampNonNegativeInt(opts.limit); + const params: SqlParam[] = []; + let sql = + "SELECT e.node_id AS node_id, e.granularity AS granularity, " + + "e.chunk_index AS chunk_index, e.start_line AS start_line, " + + "e.end_line AS end_line, e.vector AS vector, e.content_hash AS content_hash " + + "FROM embeddings e"; + if (kinds && kinds.length > 0) { + sql += ` JOIN nodes n ON n.id = e.node_id WHERE n.kind IN (${placeholders(kinds.length)})`; + for (const k of kinds) params.push(k); + } + sql += " ORDER BY e.node_id ASC, e.granularity ASC, e.chunk_index ASC"; + if (limit !== undefined) sql += ` LIMIT ${limit}`; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as EmbRow[]; + for (const r of rows) { + // exactOptionalPropertyTypes: spread optional fields conditionally + // rather than assigning undefined. + yield { + nodeId: r.node_id, + ...(r.granularity ? { granularity: r.granularity as EmbeddingRow["granularity"] } : {}), + chunkIndex: r.chunk_index, + ...(r.start_line != null ? { startLine: r.start_line } : {}), + ...(r.end_line != null ? { endLine: r.end_line } : {}), + vector: blobToF32(r.vector), + contentHash: r.content_hash, + } as EmbeddingRow; + } + } + + /** + * Brute-force cosine KNN in JS. For repo-scale embedding counts (10²–10⁵ + * vectors) a linear scan with a typed-array dot product is sub-10ms and + * dependency-free. If a repo ever needs ANN, sqlite-vec loads as a runtime + * extension via the `loadExtension` seam proven in the spike — no rebuild. + */ + async vectorSearch(q: VectorQuery): Promise { + if (q.vector.length !== this.dim) { + throw new Error(`Vector dimension mismatch: got ${q.vector.length}, expected ${this.dim}`); + } + const limit = q.limit ?? 10; + const query = q.vector; + const rows = this.conn().prepare("SELECT node_id, vector FROM embeddings").all() as unknown as { + node_id: string; + vector: Uint8Array; + }[]; + // VectorResult.distance is a DISTANCE (lower = closer). Cosine distance + // = 1 - cosine similarity, so ranking ascending matches the lbug HNSW + // contract (ORDER BY distance ASC). + const scored: VectorResult[] = rows.map((r) => ({ + nodeId: r.node_id, + distance: 1 - cosine(query, blobToF32(r.vector)), + })); + scored.sort((a, b) => a.distance - b.distance); + return scored.slice(0, limit); + } + + // ── BM25 search via FTS5 ───────────────────────────────────────────────────── + + async search(q: SearchQuery): Promise { + const limit = q.limit ?? DEFAULT_SEARCH_LIMIT; + const kindFilter = q.kinds && q.kinds.length > 0 ? q.kinds : undefined; + const params: SqlParam[] = [q.text]; + let kindPredicate = ""; + if (kindFilter) { + kindPredicate = ` AND n.kind IN (${placeholders(kindFilter.length)})`; + for (const k of kindFilter) params.push(k); + } + // CRITICAL: SQLite bm25() returns a NEGATIVE number (more-negative = + // more-relevant). To expose SearchResult.score as "higher = better" + // (matching lbug's score DESC), set score = -bm25(...) and ORDER BY + // bm25(...) ASC (== score DESC). Tiebreak (id, file_path, name) ASC + // mirrors DuckDbStore.search. + const sql = + "SELECT n.id AS node_id, n.file_path AS file_path, n.name AS name, n.kind AS kind, " + + "-bm25(nodes_fts) AS score, bm25(nodes_fts) AS rank " + + "FROM nodes_fts JOIN nodes n ON n.id = nodes_fts.node_id " + + "WHERE nodes_fts MATCH ?" + + kindPredicate + + ` ORDER BY rank ASC, n.id ASC, n.file_path ASC, n.name ASC LIMIT ${Number(limit)}`; + const rows = this.conn() + .prepare(sql) + .all(...(params as SqliteParam[])) as unknown as Record[]; + const out: SearchResult[] = []; + for (const row of rows) { + // The storage-layer search() NEVER fills summary/signatureSummary — + // they are a post-join done by MCP/CLI. + out.push({ + nodeId: String(row["node_id"] ?? ""), + score: Number(row["score"] ?? 0), + filePath: String(row["file_path"] ?? ""), + name: String(row["name"] ?? ""), + kind: String(row["kind"] ?? ""), + }); + } + return out; + } + + // ── Graph traversal (impact / blast-radius) via recursive CTE ──────────────── + + /** + * Reachability traversal as a single recursive CTE. `direction:"down"` + * follows outgoing edges (callees / dependencies); `"up"` follows incoming + * edges (callers / dependents — the blast-radius direction). Bounded by + * maxDepth so a cyclic graph terminates. This is the LadybugDB-Cypher + * replacement, and the whole reason traversal is feasible without a + * graph engine. + */ + async traverse(q: TraverseQuery): Promise { + const maxDepth = Math.max(0, Math.floor(q.maxDepth)); + if (maxDepth === 0) return []; + const minConf = q.minConfidence ?? 0; + // relationTypes empty/undefined → all types (no type predicate). + const relTypes = q.relationTypes && q.relationTypes.length > 0 ? q.relationTypes : undefined; + const typeParams: SqlParam[] = []; + let typePredDown = ""; + let typePredUp = ""; + if (relTypes) { + const phs = placeholders(relTypes.length); + typePredDown = ` AND edges.type IN (${phs})`; + typePredUp = ` AND edges.type IN (${phs})`; + } + const downStep = + "SELECT edges.dst, reach.depth + 1, reach.path || ',' || edges.dst " + + "FROM edges JOIN reach ON edges.src = reach.node_id " + + `WHERE reach.depth < ? AND edges.confidence >= ? AND instr(reach.path, edges.dst) = 0${typePredDown}`; + const upStep = + "SELECT edges.src, reach.depth + 1, reach.path || ',' || edges.src " + + "FROM edges JOIN reach ON edges.dst = reach.node_id " + + `WHERE reach.depth < ? AND edges.confidence >= ? AND instr(reach.path, edges.src) = 0${typePredUp}`; + + let recursive: string; + const stepParams: SqlParam[] = []; + const pushStep = (down: boolean): void => { + stepParams.push(maxDepth, minConf); + if (relTypes) for (const t of relTypes) stepParams.push(t); + void down; + }; + if (q.direction === "down") { + recursive = downStep; + pushStep(true); + } else if (q.direction === "up") { + recursive = upStep; + pushStep(false); + } else { + recursive = `${downStep} UNION ${upStep}`; + pushStep(true); + pushStep(false); + } + const sql = ` + WITH RECURSIVE reach(node_id, depth, path) AS ( + SELECT ?, 0, ? + UNION + ${recursive} + ) + SELECT node_id, MIN(depth) AS depth, path + FROM reach WHERE node_id != ? + GROUP BY node_id ORDER BY depth ASC, node_id ASC`; + const allParams: SqlParam[] = [ + String(q.startId), + String(q.startId), + ...stepParams, + String(q.startId), + ]; + void typeParams; + const rows = this.conn() + .prepare(sql) + .all(...(allParams as SqliteParam[])) as unknown as { + node_id: string; + depth: number; + path: string; + }[]; + return rows.map((r) => ({ + nodeId: r.node_id, + depth: r.depth, + path: r.path.split(","), + })); + } + + async traverseAncestors(opts: AncestorTraversalOptions): Promise { + return this.traverseDirectional(opts, "up"); + } + + async traverseDescendants(opts: DescendantTraversalOptions): Promise { + return this.traverseDirectional(opts, "down"); + } + + private async traverseDirectional( + opts: AncestorTraversalOptions | DescendantTraversalOptions, + direction: "up" | "down", + ): Promise { + // edgeTypes:[] → [] short-circuit (matches traverseDirectionalGd:1720). + if (opts.edgeTypes.length === 0) return []; + const traverseQuery: TraverseQuery = { + startId: opts.fromId, + relationTypes: opts.edgeTypes, + direction, + maxDepth: opts.maxDepth, + ...(opts.minConfidence !== undefined ? { minConfidence: opts.minConfidence } : {}), + }; + return this.traverse(traverseQuery); + } + + // ── Meta ───────────────────────────────────────────────────────────────────── + + async getMeta(): Promise { + const row = this.conn().prepare("SELECT * FROM store_meta WHERE id = 1").get() as unknown as + | MetaRow + | undefined; + if (!row) return undefined; + const stats = + typeof row.stats_json === "string" && row.stats_json.length > 0 + ? (JSON.parse(row.stats_json) as Record) + : undefined; + // exactOptionalPropertyTypes: re-attach optional fields ONLY when the + // column is non-null/non-undefined (mirrors getMeta:1936-1954). + return { + schemaVersion: String(row.schema_version), + ...(row.last_commit !== null && row.last_commit !== undefined + ? { lastCommit: String(row.last_commit) } + : {}), + indexedAt: String(row.indexed_at), + nodeCount: Number(row.node_count ?? 0), + edgeCount: Number(row.edge_count ?? 0), + ...(stats ? { stats } : {}), + ...(row.cache_hit_ratio !== null && row.cache_hit_ratio !== undefined + ? { cacheHitRatio: Number(row.cache_hit_ratio) } + : {}), + ...(row.cache_size_bytes !== null && row.cache_size_bytes !== undefined + ? { cacheSizeBytes: Number(row.cache_size_bytes) } + : {}), + ...(row.last_compaction !== null && row.last_compaction !== undefined + ? { lastCompaction: String(row.last_compaction) } + : {}), + ...(row.embedder_model_id !== null && row.embedder_model_id !== undefined + ? { embedderModelId: String(row.embedder_model_id) } + : {}), + }; + } + + async setMeta(meta: StoreMeta): Promise { + const statsJson = meta.stats ? JSON.stringify(meta.stats) : null; + // UPSERT a single row keyed by id=1 (SQLite ON CONFLICT DO UPDATE). + this.conn() + .prepare( + `INSERT INTO store_meta ( + id, schema_version, last_commit, indexed_at, node_count, edge_count, + stats_json, cache_hit_ratio, cache_size_bytes, last_compaction, embedder_model_id + ) VALUES (1, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + ON CONFLICT(id) DO UPDATE SET + schema_version = excluded.schema_version, + last_commit = excluded.last_commit, + indexed_at = excluded.indexed_at, + node_count = excluded.node_count, + edge_count = excluded.edge_count, + stats_json = excluded.stats_json, + cache_hit_ratio = excluded.cache_hit_ratio, + cache_size_bytes = excluded.cache_size_bytes, + last_compaction = excluded.last_compaction, + embedder_model_id = excluded.embedder_model_id`, + ) + .run( + meta.schemaVersion, + meta.lastCommit ?? null, + meta.indexedAt, + meta.nodeCount, + meta.edgeCount, + statsJson, + meta.cacheHitRatio ?? null, + meta.cacheSizeBytes ?? null, + meta.lastCompaction ?? null, + meta.embedderModelId ?? null, + ); + } + + // ── ITemporalStore: read-only SQL escape hatch ─────────────────────────────── + + async exec( + sql: string, + params: readonly SqlParam[] = [], + opts: { readonly timeoutMs?: number } = {}, + ): Promise[]> { + // (1) Guard FIRST, before touching the connection — throws SqlGuardError. + assertReadOnlySql(sql); + void opts; // timeout is best-effort via PRAGMA busy_timeout (set at open); + // node:sqlite has no per-statement interrupt, so opts.timeoutMs cannot be + // hard-enforced here. Kept on the signature for interface compatibility. + const stmt = this.conn().prepare(sql); + // (2) Bind positional params 1..N, coercing undefined → null. + const bound = params.map((p) => (p ?? null) as SqliteParam); + const rows = stmt.all(...bound) as unknown as Record[]; + return rows; + } + + // ── ITemporalStore: cochanges ──────────────────────────────────────────────── + + async bulkLoadCochanges(rows: readonly CochangeRow[]): Promise { + const db = this.conn(); + db.exec("BEGIN"); + try { + // REPLACE semantics: clear the whole table even on empty input. + db.exec("DELETE FROM cochanges"); + if (rows.length === 0) { + db.exec("COMMIT"); + return; + } + // Sort by (sourceFile, targetFile) for deterministic insert order. + const sorted = [...rows].sort((a, b) => { + if (a.sourceFile !== b.sourceFile) return a.sourceFile < b.sourceFile ? -1 : 1; + return a.targetFile < b.targetFile ? -1 : a.targetFile > b.targetFile ? 1 : 0; + }); + const stmt = db.prepare( + `INSERT INTO cochanges ( + source_file, target_file, cocommit_count, + total_commits_source, total_commits_target, + last_cocommit_at, lift + ) VALUES (?, ?, ?, ?, ?, ?, ?)`, + ); + for (const r of sorted) { + stmt.run( + r.sourceFile, + r.targetFile, + r.cocommitCount, + r.totalCommitsSource, + r.totalCommitsTarget, + r.lastCocommitAt, + r.lift, + ); + } + db.exec("COMMIT"); + } catch (err) { + db.exec("ROLLBACK"); + throw err; + } + } + + async lookupCochangesForFile( + file: string, + opts: CochangeLookupOptions = {}, + ): Promise { + const limit = Math.max(0, Math.floor(opts.limit ?? DEFAULT_COCHANGE_LOOKUP_LIMIT)); + const minLift = opts.minLift ?? DEFAULT_COCHANGE_MIN_LIFT; + // Probe BOTH directions (signal is symmetric); ORDER BY lift DESC then + // pair key ASC; LIMIT max(0, floor(limit)). + const rows = this.conn() + .prepare( + `SELECT source_file, target_file, cocommit_count, + total_commits_source, total_commits_target, + last_cocommit_at, lift + FROM cochanges + WHERE (source_file = ? OR target_file = ?) AND lift >= ? + ORDER BY lift DESC, source_file ASC, target_file ASC + LIMIT ?`, + ) + .all(file, file, minLift, limit) as unknown as Record[]; + return rows.map(cochangeRowFromRecord); + } + + async lookupCochangesBetween(fileA: string, fileB: string): Promise { + const row = this.conn() + .prepare( + `SELECT source_file, target_file, cocommit_count, + total_commits_source, total_commits_target, + last_cocommit_at, lift + FROM cochanges + WHERE (source_file = ? AND target_file = ?) + OR (source_file = ? AND target_file = ?) + LIMIT 1`, + ) + .get(fileA, fileB, fileB, fileA) as unknown as Record | undefined; + return row ? cochangeRowFromRecord(row) : undefined; + } + + // ── ITemporalStore: symbol summaries ───────────────────────────────────────── + + async bulkLoadSymbolSummaries(rows: readonly SymbolSummaryRow[]): Promise { + // Empty input → no-op return (NOT a table clear — symbol summaries are + // upserts, not replace). + if (rows.length === 0) return; + const db = this.conn(); + // Sort by (nodeId, contentHash, promptVersion) for insert determinism. + const sorted = [...rows].sort((a, b) => { + if (a.nodeId !== b.nodeId) return a.nodeId < b.nodeId ? -1 : 1; + if (a.contentHash !== b.contentHash) return a.contentHash < b.contentHash ? -1 : 1; + if (a.promptVersion !== b.promptVersion) return a.promptVersion < b.promptVersion ? -1 : 1; + return 0; + }); + db.exec("BEGIN"); + try { + // DELETE+INSERT upsert per composite key (mirrors DuckDb's approach). + const del = db.prepare( + "DELETE FROM symbol_summaries WHERE node_id = ? AND content_hash = ? AND prompt_version = ?", + ); + const ins = db.prepare( + `INSERT INTO symbol_summaries ( + node_id, content_hash, prompt_version, model_id, + summary_text, signature_summary, returns_type_summary, + structured_json, created_at + ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)`, + ); + for (const r of sorted) { + del.run(r.nodeId, r.contentHash, r.promptVersion); + ins.run( + r.nodeId, + r.contentHash, + r.promptVersion, + r.modelId, + r.summaryText, + r.signatureSummary ?? null, + r.returnsTypeSummary ?? null, + r.structuredJson ?? null, + r.createdAt, + ); + } + db.exec("COMMIT"); + } catch (err) { + db.exec("ROLLBACK"); + throw err; + } + } + + async lookupSymbolSummary( + nodeId: string, + contentHash: string, + promptVersion: string, + ): Promise { + const row = this.conn() + .prepare( + `SELECT node_id, content_hash, prompt_version, model_id, + summary_text, signature_summary, returns_type_summary, + structured_json, created_at + FROM symbol_summaries + WHERE node_id = ? AND content_hash = ? AND prompt_version = ? + LIMIT 1`, + ) + .get(nodeId, contentHash, promptVersion) as unknown as Record | undefined; + return row ? summaryRowFromRecord(row) : undefined; + } + + async lookupSymbolSummariesByNode( + nodeIds: readonly string[], + ): Promise { + if (nodeIds.length === 0) return []; + // ORDER BY (node_id, prompt_version, content_hash) — prompt_version + // BEFORE content_hash (differs from the bulkLoad sort) so callers pick + // the newest prompt deterministically. + const sql = `SELECT node_id, content_hash, prompt_version, model_id, + summary_text, signature_summary, returns_type_summary, + structured_json, created_at + FROM symbol_summaries + WHERE node_id IN (${placeholders(nodeIds.length)}) + ORDER BY node_id ASC, prompt_version ASC, content_hash ASC`; + const rows = this.conn() + .prepare(sql) + .all(...(nodeIds as unknown as SqliteParam[])) as unknown as Record[]; + return rows.map(summaryRowFromRecord); + } + + async countSymbolSummaries(): Promise { + // MUST catch all errors and return 0 — codehub status degrades gracefully. + try { + const row = this.conn() + .prepare("SELECT COUNT(DISTINCT node_id) AS n FROM symbol_summaries") + .get() as unknown as { n: number | bigint } | undefined; + const n = row?.n; + return typeof n === "bigint" ? Number(n) : typeof n === "number" ? n : 0; + } catch { + return 0; + } + } + + private conn(): DatabaseSync { + if (!this.db) throw new Error("SqliteStore: open() not called"); + return this.db; + } +} + +// ── Row shapes + (de)serialization helpers ────────────────────────────────────── + +/** Positional params node:sqlite's StatementSync accepts. */ +type SqliteParam = string | number | bigint | null | Uint8Array; + +interface NodeRow { + id: string; + kind: string; + name: string; + file_path: string | null; + start_line: number | null; + end_line: number | null; + payload: string | null; +} + +interface EdgeRow { + id: string; + src: string; + dst: string; + type: string; + confidence: number; + step: number | null; + reason: string | null; +} + +interface EmbRow { + node_id: string; + granularity: string; + chunk_index: number; + start_line: number | null; + end_line: number | null; + vector: Uint8Array; + content_hash: string; +} + +interface MetaRow { + id: number; + schema_version: string; + last_commit: string | null; + indexed_at: string; + node_count: number; + edge_count: number; + stats_json: string | null; + cache_hit_ratio: number | null; + cache_size_bytes: number | null; + last_compaction: string | null; + embedder_model_id: string | null; +} + +function rehydrateNode(row: NodeRow): GraphNode { + const base: Record = { + id: row.id, + kind: row.kind, + name: row.name, + }; + if (row.file_path != null) base["filePath"] = row.file_path; + if (row.start_line != null) base["startLine"] = row.start_line; + if (row.end_line != null) base["endLine"] = row.end_line; + // The payload round-trips the full remaining field set verbatim — including + // `keywords: []`, `languageStats: {}`, and Repo nullable `null`s — so + // canonicalJson sees the identical shape on rebuild (graphHash parity). + if (row.payload) Object.assign(base, JSON.parse(row.payload)); + return base as unknown as GraphNode; +} + +/** Convert a SQLite cochanges row back into a {@link CochangeRow}. */ +function cochangeRowFromRecord(row: Record): CochangeRow { + // last_cocommit_at is stored as a TEXT ISO string → trivial string decode. + return { + sourceFile: String(row["source_file"] ?? ""), + targetFile: String(row["target_file"] ?? ""), + cocommitCount: Number(row["cocommit_count"] ?? 0), + totalCommitsSource: Number(row["total_commits_source"] ?? 0), + totalCommitsTarget: Number(row["total_commits_target"] ?? 0), + lastCocommitAt: String(row["last_cocommit_at"] ?? ""), + lift: Number(row["lift"] ?? 0), + }; +} + +/** Convert a SQLite symbol_summaries row back into a {@link SymbolSummaryRow}. */ +function summaryRowFromRecord(row: Record): SymbolSummaryRow { + const sig = row["signature_summary"]; + const ret = row["returns_type_summary"]; + const structured = row["structured_json"]; + return { + nodeId: String(row["node_id"] ?? ""), + contentHash: String(row["content_hash"] ?? ""), + promptVersion: String(row["prompt_version"] ?? ""), + modelId: String(row["model_id"] ?? ""), + summaryText: String(row["summary_text"] ?? ""), + ...(sig !== null && sig !== undefined ? { signatureSummary: String(sig) } : {}), + ...(ret !== null && ret !== undefined ? { returnsTypeSummary: String(ret) } : {}), + ...(structured !== null && structured !== undefined + ? { structuredJson: String(structured) } + : {}), + createdAt: String(row["created_at"] ?? ""), + }; +} + +/** + * Clamp a number to a non-negative integer. Semantics match + * `clampNonNegativeIntGd` (graphdb-adapter.ts:2202-2207): `undefined` / `null` + * / negative / non-finite → `undefined` (no clause); `0` preserved; else + * `Math.floor`. + */ +function clampNonNegativeInt(v: number | undefined): number | undefined { + if (v === undefined || v === null) return undefined; + if (typeof v !== "number" || !Number.isFinite(v)) return undefined; + if (v < 0) return undefined; + return Math.floor(v); +} + +/** Build a `?,?,…` placeholder list of length `n`. */ +function placeholders(n: number): string { + return new Array(n).fill("?").join(","); +} + +/** Build a ` WHERE a AND b …` clause, or `""` when there are no predicates. */ +function whereClause(wheres: readonly string[]): string { + return wheres.length > 0 ? ` WHERE ${wheres.join(" AND ")}` : ""; +} + +/** + * Build a ` LIMIT n OFFSET m` clause. `limit`/`offset` are pre-clamped to + * finite non-negative integers (no injection risk). SQLite requires LIMIT + * before OFFSET, and an OFFSET with no LIMIT needs a `LIMIT -1` sentinel. + */ +function pageClause(limit: number | undefined, offset: number | undefined): string { + let out = ""; + if (limit !== undefined) out += ` LIMIT ${limit}`; + else if (offset !== undefined) out += " LIMIT -1"; + if (offset !== undefined) out += ` OFFSET ${offset}`; + return out; +} + +/** Lex-stable JS-side `id ASC` tiebreak — the cross-adapter determinism guarantee. */ +function sortById(items: readonly T[]): readonly T[] { + return [...items].sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); +} + +/** Float32Array → little-endian BLOB. node:sqlite accepts Uint8Array for BLOB. */ +function f32ToBlob(v: Float32Array): Uint8Array { + return new Uint8Array(v.buffer, v.byteOffset, v.byteLength); +} + +/** BLOB → Float32Array. Copies so the view is independent of the row buffer. */ +function blobToF32(b: Uint8Array): Float32Array { + const copy = b.slice(); + return new Float32Array(copy.buffer, copy.byteOffset, copy.byteLength / 4); +} + +function cosine(a: Float32Array, b: Float32Array): number { + let dot = 0; + let na = 0; + let nb = 0; + const n = Math.min(a.length, b.length); + for (let i = 0; i < n; i++) { + const av = a[i] as number; + const bv = b[i] as number; + dot += av * bv; + na += av * av; + nb += bv * bv; + } + if (na === 0 || nb === 0) return 0; + return dot / (Math.sqrt(na) * Math.sqrt(nb)); +} diff --git a/packages/storage/src/sqlite-parity.test.ts b/packages/storage/src/sqlite-parity.test.ts new file mode 100644 index 00000000..5da54f0e --- /dev/null +++ b/packages/storage/src/sqlite-parity.test.ts @@ -0,0 +1,567 @@ +/** + * graphHash byte-identity PARITY GATE for {@link SqliteStore}. + * + * This is the P2 go/no-go: a `KnowledgeGraph` bulk-loaded into a SqliteStore + * over a real temp file and rebuilt via the PUBLIC {@link rebuildFromStore} + * harness (`listNodes({})` + `listEdges({})`) must produce a `graphHash` + * byte-identical to the original fixture. + * + * Fixtures exercise the sentinel surface that historically broke parity: + * - mixed node kinds: File, Function, Class, Method, Route, Dependency, + * Repo, Finding, Contributor, Interface — so kind-specific payload fields + * (severity / propertiesBag / ecosystem / languageStats / responseKeys) + * are round-tripped. + * - the sentinels: empty-`languageStats: {}`, Repo nullable `null` + * (originUrl/defaultBranch/group), `responseKeys: []` vs absent (the + * `[]`-vs-undefined canonicalJson distinction), and empty + * `propertiesBag: {}` on a Finding. + * - edges with varied step / confidence across DEFINES / CALLS / OWNED_BY / + * HAS_METHOD / HANDLES_ROUTE / DEPENDS_ON / FOUND_IN / IMPLEMENTS. + * + * STEP-ZERO CONTRACT (load-bearing — do NOT pass `step: 0` in a fixture). + * The `stepZeroSentinel` (column-encode.ts) is a cross-adapter invariant: + * `step: 0` is treated as IDENTICAL to an absent `step` at the storage + * boundary, so `listEdges` drops it on read on EVERY adapter (GraphDbStore + * drops it at listEdgesInternalGd:1694; SqliteStore at listEdges via + * stepZeroSentinel). But `graphHash` over a KnowledgeGraph DOES emit + * `"step":0` when an edge carries it explicitly (canonicalJson preserves the + * finite `0`). A fixture that passes `step: 0` therefore hashes WITH `"step":0` + * but rebuilds WITHOUT it — a guaranteed parity break on every backend, not a + * store bug. Ingestion only ever emits `step >= 1`, so canonical fixtures must + * use `step >= 1` or omit `step`. These fixtures honor that: every explicit + * `step` is >= 1, and the absent-step path is also exercised. + * + * Idiom mirrors graphdb-roundtrip.test.ts: node:test, mkdtemp temp file, + * open → createSchema → assertGraphParity. Embeddings are intentionally NOT + * loaded — graphHash covers nodes + edges only. + */ + +import assert from "node:assert/strict"; +import { mkdtemp } from "node:fs/promises"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { test } from "node:test"; +import { + type GraphNode, + graphHash, + KnowledgeGraph, + makeNodeId, + type NodeId, + type RelationType, +} from "@opencodehub/core-types"; +import { assertGraphParity, rebuildFromStore } from "@opencodehub/storage/test-utils"; +import { SqliteStore } from "./sqlite-adapter.js"; + +async function scratchDbPath(): Promise { + const dir = await mkdtemp(join(tmpdir(), "och-sqlite-parity-")); + return join(dir, "store.sqlite"); +} + +// --------------------------------------------------------------------------- +// Fixture builders +// --------------------------------------------------------------------------- + +/** + * Small fixture — File + Function nodes with DEFINES + CALLS edges. Confidence + * varies; some CALLS carry an explicit `step >= 1` (must survive the round-trip) + * and some omit `step` entirely (the absent-step path). + */ +function buildSmallGraph(): KnowledgeGraph { + const g = new KnowledgeGraph(); + + const fileA = makeNodeId("File", "src/a.ts", "a.ts"); + const fileB = makeNodeId("File", "src/b.ts", "b.ts"); + g.addNode({ id: fileA, kind: "File", name: "a.ts", filePath: "src/a.ts" }); + g.addNode({ + id: fileB, + kind: "File", + name: "b.ts", + filePath: "src/b.ts", + contentHash: "deadbeef", + language: "typescript", + }); + + const funcs: NodeId[] = []; + for (let i = 0; i < 8; i += 1) { + const file = i % 2 === 0 ? "src/a.ts" : "src/b.ts"; + const id = makeNodeId("Function", file, `fn_${i}`, { parameterCount: i % 3 }); + funcs.push(id); + g.addNode({ + id, + kind: "Function", + name: `fn_${i}`, + filePath: file, + startLine: 10 + i, + endLine: 20 + i, + signature: `function fn_${i}()`, + parameterCount: i % 3, + isExported: i % 2 === 0, + }); + } + + for (let i = 0; i < funcs.length; i += 1) { + const from = i % 2 === 0 ? fileA : fileB; + g.addEdge({ from, to: funcs[i] as NodeId, type: "DEFINES", confidence: 1.0 }); + } + for (let i = 0; i + 1 < funcs.length; i += 1) { + // Mix: even hops omit `step` (absent-step path), odd hops set step:1 + // (must survive). Never an explicit step:0 — see STEP-ZERO CONTRACT. + g.addEdge({ + from: funcs[i] as NodeId, + to: funcs[i + 1] as NodeId, + type: "CALLS", + confidence: 0.9 - i * 0.05, + ...(i % 2 === 1 ? { step: 1 } : {}), + }); + } + + return g; +} + +/** + * Medium fixture — the full required NodeKind mix plus every sentinel. + * File, Function, Class, Method, Interface, Route, Dependency, Repo, + * Finding, Contributor. + * Edges: DEFINES, HAS_METHOD, CALLS, IMPLEMENTS, HANDLES_ROUTE, DEPENDS_ON, + * FOUND_IN, OWNED_BY — with varied step + confidence. + */ +function buildMediumGraph(): KnowledgeGraph { + const g = new KnowledgeGraph(); + + // ── Repo (first-class node; sentinels: languageStats:{} on one, null + // origin/branch/group on another, full on the third). ── + const repoFull = makeNodeId("Repo", "", "repo-full"); + g.addNode({ + id: repoFull, + kind: "Repo", + name: "github.com/acme/example", + filePath: "", + originUrl: "https://github.com/acme/example.git", + repoUri: "github.com/acme/example", + defaultBranch: "main", + commitSha: "0123456789abcdef0123456789abcdef01234567", + indexTime: "2026-05-06T12:34:56Z", + group: "acme", + visibility: "internal", + indexer: "opencodehub@0.1.0", + languageStats: { go: 0.5, ts: 0.3, rs: 0.2 }, + } as unknown as GraphNode); + + const repoEmptyStats = makeNodeId("Repo", "", "repo-empty-stats"); + g.addNode({ + id: repoEmptyStats, + kind: "Repo", + name: "github.com/acme/empty", + filePath: "", + originUrl: "https://github.com/acme/empty.git", + repoUri: "github.com/acme/empty", + defaultBranch: "main", + commitSha: "aaaa0000bbbb1111cccc2222dddd3333eeee4444", + indexTime: "2026-05-06T12:34:56Z", + group: "acme", + visibility: "internal", + indexer: "opencodehub@0.1.0", + // SENTINEL: explicit empty languageStats must round-trip as {} (not absent). + languageStats: {}, + } as unknown as GraphNode); + + const repoNoRemote = makeNodeId("Repo", "", "repo-no-remote"); + g.addNode({ + id: repoNoRemote, + kind: "Repo", + name: "local:abcdef012345", + filePath: "", + // SENTINEL: explicit nulls must round-trip as null (not absent). + originUrl: null, + repoUri: "local:abcdef012345", + defaultBranch: null, + commitSha: "5555666677778888999900001111222233334444", + indexTime: "2026-05-06T12:34:56Z", + group: null, + visibility: "private", + indexer: "opencodehub@0.1.0", + languageStats: {}, + } as unknown as GraphNode); + + // ── Files + classes + interfaces + methods. ── + const files: NodeId[] = []; + const classes: NodeId[] = []; + const methods: NodeId[] = []; + for (let i = 0; i < 5; i += 1) { + const path = `src/mod${i}/entry.ts`; + const fileId = makeNodeId("File", path, path); + files.push(fileId); + g.addNode({ + id: fileId, + kind: "File", + name: "entry.ts", + filePath: path, + contentHash: `hash-${i}`, + lineCount: 100 + i, + }); + + const clsId = makeNodeId("Class", path, `Service${i}`); + classes.push(clsId); + g.addNode({ + id: clsId, + kind: "Class", + name: `Service${i}`, + filePath: path, + startLine: 5, + endLine: 80, + isExported: true, + }); + + const ifaceId = makeNodeId("Interface", path, `IService${i}`); + g.addNode({ + id: ifaceId, + kind: "Interface", + name: `IService${i}`, + filePath: path, + isExported: true, + }); + + g.addEdge({ from: fileId, to: clsId, type: "DEFINES", confidence: 1.0 }); + g.addEdge({ from: fileId, to: ifaceId, type: "DEFINES", confidence: 1.0 }); + g.addEdge({ from: clsId, to: ifaceId, type: "IMPLEMENTS", confidence: 1.0 }); + + for (let j = 0; j < 3; j += 1) { + const mId = makeNodeId("Method", path, `Service${i}.method${j}`); + methods.push(mId); + g.addNode({ + id: mId, + kind: "Method", + name: `method${j}`, + filePath: path, + startLine: 10 + j, + endLine: 15 + j, + parameterCount: j, + signature: `method${j}()`, + owner: `Service${i}`, + }); + g.addEdge({ from: clsId, to: mId, type: "HAS_METHOD", confidence: 1.0 }); + } + } + + // Sparse CALLS graph with varied step + confidence. Never an explicit + // step:0 (see STEP-ZERO CONTRACT) — the first sweep omits `step` (absent + // path), the second sets step:2 (must survive). + for (let i = 0; i + 1 < methods.length; i += 2) { + g.addEdge({ + from: methods[i] as NodeId, + to: methods[i + 1] as NodeId, + type: "CALLS", + confidence: 0.8, + reason: "synthetic fixture", + }); + } + for (let i = 2; i < methods.length; i += 3) { + g.addEdge({ + from: methods[i] as NodeId, + to: methods[(i + 5) % methods.length] as NodeId, + type: "CALLS", + confidence: 0.6, + step: 2, // must survive. + }); + } + + // ── Routes — one with responseKeys:[] (SENTINEL: [] vs absent), one with a + // populated responseKeys, one with none. Plus HANDLES_ROUTE edges. ── + const routeEmpty = makeNodeId("Route", "src/mod0/entry.ts", "GET /health"); + g.addNode({ + id: routeEmpty, + kind: "Route", + name: "GET /health", + filePath: "src/mod0/entry.ts", + url: "/health", + method: "GET", + responseKeys: [], // SENTINEL: explicit empty array must round-trip as []. + } as unknown as GraphNode); + + const routeKeys = makeNodeId("Route", "src/mod1/entry.ts", "POST /users"); + g.addNode({ + id: routeKeys, + kind: "Route", + name: "POST /users", + filePath: "src/mod1/entry.ts", + url: "/users", + method: "POST", + responseKeys: ["id", "createdAt"], + } as unknown as GraphNode); + + const routeBare = makeNodeId("Route", "src/mod2/entry.ts", "DELETE /users/:id"); + g.addNode({ + id: routeBare, + kind: "Route", + name: "DELETE /users/:id", + filePath: "src/mod2/entry.ts", + url: "/users/:id", + method: "DELETE", + } as unknown as GraphNode); + + g.addEdge({ from: methods[0] as NodeId, to: routeEmpty, type: "HANDLES_ROUTE", confidence: 0.9 }); + g.addEdge({ from: methods[3] as NodeId, to: routeKeys, type: "HANDLES_ROUTE", confidence: 0.9 }); + g.addEdge({ from: methods[6] as NodeId, to: routeBare, type: "HANDLES_ROUTE", confidence: 0.9 }); + + // ── Dependencies — varied ecosystem / license, DEPENDS_ON edges. ── + const depNpm = makeNodeId("Dependency", "package.json", "react@18.2.0"); + g.addNode({ + id: depNpm, + kind: "Dependency", + name: "react", + filePath: "package.json", + version: "18.2.0", + ecosystem: "npm", + lockfileSource: "package-lock.json", + license: "MIT", + } as unknown as GraphNode); + + const depPypi = makeNodeId("Dependency", "pyproject.toml", "requests@2.31.0"); + g.addNode({ + id: depPypi, + kind: "Dependency", + name: "requests", + filePath: "pyproject.toml", + version: "2.31.0", + ecosystem: "pypi", + lockfileSource: "uv.lock", + // No license — exercises an absent optional on Dependency. + } as unknown as GraphNode); + + g.addEdge({ from: files[0] as NodeId, to: depNpm, type: "DEPENDS_ON", confidence: 1.0 }); + g.addEdge({ from: files[1] as NodeId, to: depPypi, type: "DEPENDS_ON", confidence: 1.0 }); + + // ── Finding — required propertiesBag (Record), optional baselineState / + // partialFingerprint. FOUND_IN edge with a reason. ── + const finding = makeNodeId("Finding", "src/mod0/entry.ts", "semgrep:logger-leak:42"); + g.addNode({ + id: finding, + kind: "Finding", + name: "logger-credential-leak", + filePath: "src/mod0/entry.ts", + startLine: 42, + endLine: 44, + ruleId: "logger-leak", + severity: "warning", + scannerId: "semgrep", + message: "Credential may leak to logs", + propertiesBag: { cwe: "CWE-532", tags: ["security"] }, + partialFingerprint: "fp-0001", + baselineState: "new", + } as unknown as GraphNode); + + const findingNoBag = makeNodeId("Finding", "src/mod1/entry.ts", "semgrep:noop:9"); + g.addNode({ + id: findingNoBag, + kind: "Finding", + name: "noop-finding", + filePath: "src/mod1/entry.ts", + startLine: 9, + endLine: 9, + ruleId: "noop", + severity: "note", + scannerId: "semgrep", + message: "Informational", + // SENTINEL: explicit empty propertiesBag {} must round-trip as {}. + propertiesBag: {}, + } as unknown as GraphNode); + + g.addEdge({ + from: finding, + to: methods[0] as NodeId, + type: "FOUND_IN", + confidence: 1.0, + reason: "startLine=42;endLine=44", + }); + g.addEdge({ + from: findingNoBag, + to: methods[3] as NodeId, + type: "FOUND_IN", + confidence: 1.0, + }); + + // ── Contributor + OWNED_BY edges (varied confidence). ── + const contributor = makeNodeId("Contributor", "", "alice@example.com"); + g.addNode({ + id: contributor, + kind: "Contributor", + name: "alice", + filePath: "", + emailHash: "hashed-alice", + emailPlain: "alice@example.com", + }); + const contributorB = makeNodeId("Contributor", "", "bob@example.com"); + g.addNode({ + id: contributorB, + kind: "Contributor", + name: "bob", + filePath: "", + emailHash: "hashed-bob", + // No emailPlain — privacy default. + }); + for (let i = 0; i < files.length; i += 1) { + g.addEdge({ + from: files[i] as NodeId, + to: i % 2 === 0 ? contributor : contributorB, + type: "OWNED_BY", + confidence: 0.25 + i * 0.1, + }); + } + + return g; +} + +// --------------------------------------------------------------------------- +// Round-trip driver +// --------------------------------------------------------------------------- + +async function freshStore(): Promise { + const store = new SqliteStore(await scratchDbPath()); + await store.open(); + await store.createSchema(); + return store; +} + +// --------------------------------------------------------------------------- +// Tests +// --------------------------------------------------------------------------- + +test("graphHash parity: small fixture (File + Function, DEFINES + CALLS, step>=1 and absent-step)", async () => { + const fixture = buildSmallGraph(); + const store = await freshStore(); + try { + await assertGraphParity(fixture, { stores: [store], label: "sqlite-small" }); + } finally { + await store.close(); + } +}); + +test("graphHash parity: medium fixture (mixed kinds + sentinels)", async () => { + const fixture = buildMediumGraph(); + const store = await freshStore(); + try { + await assertGraphParity(fixture, { stores: [store], label: "sqlite-medium" }); + } finally { + await store.close(); + } +}); + +// Explicit per-node first-mismatch diagnosis path — surfaces WHICH node/edge +// broke parity with the canonical-JSON projection, not just a hash mismatch. +test("graphHash parity: medium fixture — first-mismatch diagnosis", async () => { + const fixture = buildMediumGraph(); + const store = await freshStore(); + try { + await store.bulkLoad(fixture); + const rebuilt = await rebuildFromStore(store); + const originalHash = graphHash(fixture); + const rebuiltHash = graphHash(rebuilt); + if (originalHash !== rebuiltHash) { + const origNodes = fixture.orderedNodes(); + const rebNodes = rebuilt.orderedNodes(); + let diag = "no node-level mismatch found (edge-level divergence)"; + const max = Math.max(origNodes.length, rebNodes.length); + for (let i = 0; i < max; i += 1) { + const a = JSON.stringify(origNodes[i] ?? null, Object.keys(origNodes[i] ?? {}).sort()); + const b = JSON.stringify(rebNodes[i] ?? null, Object.keys(rebNodes[i] ?? {}).sort()); + if (a !== b) { + diag = `first node mismatch at index ${i}:\n original: ${a}\n rebuilt: ${b}`; + break; + } + } + const origEdges = fixture.orderedEdges(); + const rebEdges = rebuilt.orderedEdges(); + let edgeDiag = "edges match"; + const emax = Math.max(origEdges.length, rebEdges.length); + for (let i = 0; i < emax; i += 1) { + const a = JSON.stringify(origEdges[i] ?? null, Object.keys(origEdges[i] ?? {}).sort()); + const b = JSON.stringify(rebEdges[i] ?? null, Object.keys(rebEdges[i] ?? {}).sort()); + if (a !== b) { + edgeDiag = `first edge mismatch at index ${i}:\n original: ${a}\n rebuilt: ${b}`; + break; + } + } + assert.fail( + `graphHash parity broken for medium fixture\n` + + ` original: ${originalHash}\n rebuilt: ${rebuiltHash}\n` + + ` node counts: original=${origNodes.length} rebuilt=${rebNodes.length}\n` + + ` edge counts: original=${origEdges.length} rebuilt=${rebEdges.length}\n` + + ` ${diag}\n ${edgeDiag}`, + ); + } + assert.equal(rebuiltHash, originalHash); + } finally { + await store.close(); + } +}); + +test("graphHash parity is deterministic across two independent stores", async () => { + const fixture = buildMediumGraph(); + const a = await freshStore(); + const b = await freshStore(); + try { + await assertGraphParity(fixture, { stores: [a, b], label: "sqlite-cross-store" }); + } finally { + await a.close(); + await b.close(); + } +}); + +// Belt-and-suspenders: every declared edge kind round-trips at least one row, +// so a dropped type surfaces as a parity failure rather than a silent miss. +test("graphHash parity: every declared edge kind round-trips", async () => { + const { getAllRelationTypes } = await import("./relations.js"); + const relationTypes = getAllRelationTypes(); + const g = new KnowledgeGraph(); + const nodes: NodeId[] = []; + for (let i = 0; i < relationTypes.length + 1; i += 1) { + const id = makeNodeId("Function", `src/f${i}.ts`, `fn${i}`); + nodes.push(id); + g.addNode({ id, kind: "Function", name: `fn${i}`, filePath: `src/f${i}.ts` }); + } + for (let i = 0; i < relationTypes.length; i += 1) { + g.addEdge({ + from: nodes[i] as NodeId, + to: nodes[i + 1] as NodeId, + type: relationTypes[i] as RelationType, + confidence: 0.5 + i * 0.01, + reason: `fixture-${i}`, + step: i + 1, // always >= 1 (see STEP-ZERO CONTRACT) — must survive. + }); + } + const store = await freshStore(); + try { + await assertGraphParity(g, { stores: [store], label: "sqlite-all-kinds" }); + } finally { + await store.close(); + } +}); + +// STEP-ZERO CONTRACT regression. An edge built with an explicit `step: 0` must +// round-trip byte-identically: KnowledgeGraph.addEdge normalizes step:0 → absent +// at the graph boundary so graphHash and every adapter's listEdges agree. A +// non-zero step must survive. This closes the latent parity gap the single-file +// migration surfaced — the old graphdb-roundtrip test masked it by re-attaching +// step:0 in a test-local rebuild helper instead of going through the public +// rebuildFromStore harness. +test("graphHash parity: explicit step:0 edge round-trips (sentinel at the graph boundary)", async () => { + const g = new KnowledgeGraph(); + const a = makeNodeId("Function", "src/x.ts", "a"); + const b = makeNodeId("Function", "src/x.ts", "b"); + const c = makeNodeId("Function", "src/x.ts", "c"); + for (const [id, name] of [ + [a, "a"], + [b, "b"], + [c, "c"], + ] as const) { + g.addNode({ id, kind: "Function", name, filePath: "src/x.ts" }); + } + g.addEdge({ from: a, to: b, type: "CALLS" as RelationType, confidence: 1, step: 0 }); // sentinel + g.addEdge({ from: b, to: c, type: "CALLS" as RelationType, confidence: 1, step: 3 }); // survives + const store = await freshStore(); + try { + await assertGraphParity(g, { stores: [store], label: "sqlite-step-zero" }); + } finally { + await store.close(); + } +}); diff --git a/packages/storage/src/sqlite-runtime.test.ts b/packages/storage/src/sqlite-runtime.test.ts new file mode 100644 index 00000000..57940dd5 --- /dev/null +++ b/packages/storage/src/sqlite-runtime.test.ts @@ -0,0 +1,62 @@ +/** + * Tests for the SQLite runtime guard: it must swallow ONLY the SQLite + * ExperimentalWarning and pass every other warning through untouched, and + * it must be idempotent (installing twice does not double-wrap). + */ + +import assert from "node:assert/strict"; +import { test } from "node:test"; + +import { installSqliteRuntimeGuard } from "./sqlite-runtime.js"; + +test("guard swallows the SQLite experimental warning, passes others through", () => { + installSqliteRuntimeGuard(); // already auto-installed on import; idempotent + + const seen: string[] = []; + const original = process.emitWarning; + // Capture what the guard delegates downstream by stubbing the *original* + // sink one level below the guard. The guard calls the bound original it + // captured at install time, so to observe delegation we drive emitWarning + // and record what is NOT swallowed. + const restore = process.emitWarning; + process.emitWarning = ((w: string | Error, ..._a: unknown[]) => { + seen.push(typeof w === "string" ? w : w.message); + }) as typeof process.emitWarning; + try { + // Re-install so the guard wraps OUR capture sink as its delegate. + (process as unknown as { [k: symbol]: unknown })[ + Symbol.for("opencodehub.sqlite-runtime.installed") + ] = undefined; + installSqliteRuntimeGuard(); + + // SQLite experimental warning → swallowed. + process.emitWarning( + "SQLite is an experimental feature and might change at any time", + "ExperimentalWarning", + ); + // Unrelated warning → passed through. + process.emitWarning("a normal deprecation", "DeprecationWarning"); + // A different experimental warning → passed through (not SQLite). + process.emitWarning("Fetch is an experimental feature", "ExperimentalWarning"); + } finally { + process.emitWarning = restore; + process.emitWarning = original; + } + + assert.ok(!seen.some((m) => /sqlite/i.test(m)), "SQLite experimental warning must be swallowed"); + assert.ok(seen.includes("a normal deprecation"), "non-SQLite warning passes through"); + assert.ok( + seen.includes("Fetch is an experimental feature"), + "non-SQLite experimental warning passes through", + ); +}); + +test("importing node:sqlite under the guard produces no warning on stderr", async () => { + // Smoke: the guard is auto-installed on module import, so loading the + // binding here must not surface the warning. We can only assert it does + // not throw; stderr capture across the worker boundary is covered by the + // CLI integration path. This guards against the guard itself throwing. + await assert.doesNotReject(async () => { + await import("node:sqlite"); + }); +}); diff --git a/packages/storage/src/sqlite-runtime.ts b/packages/storage/src/sqlite-runtime.ts new file mode 100644 index 00000000..8d6d8994 --- /dev/null +++ b/packages/storage/src/sqlite-runtime.ts @@ -0,0 +1,61 @@ +/** + * SQLite runtime guard — import this BEFORE any `node:sqlite` import. + * + * On Node ≥24.15 the built-in `node:sqlite` module is enabled by default, + * but loading it emits a one-shot `ExperimentalWarning` to stderr. For the + * `codehub` CLI that is cosmetic noise; for the **stdio MCP server** stderr + * is a real channel a client may surface, so an unsolicited warning is a + * correctness wart. This module makes the dependency on `node:sqlite` + * explicit and silences *only* that one warning, leaving every other + * process warning intact. + * + * Why an in-process filter rather than a `--node-flag` / shebang: it works + * no matter how the process was launched — the published CLI bin, the MCP + * server spawned by an arbitrary agent host, `node --test`, or a downstream + * library embedding `@opencodehub/storage`. There is no single launch site + * we control, so the guard travels with the code that needs it. + * + * The override is installed exactly once (idempotent) and delegates every + * non-SQLite warning to the original `process.emitWarning`. + */ + +const FLAG = Symbol.for("opencodehub.sqlite-runtime.installed"); + +interface Guarded { + [FLAG]?: true; +} + +function isSqliteExperimentalWarning(warning: string | Error, type?: string): boolean { + const text = typeof warning === "string" ? warning : warning.message; + // Node's text is: "SQLite is an experimental feature and might change at any time". + // Match on the SQLite mention scoped to the ExperimentalWarning type so we + // never swallow an unrelated experimental warning. + const isExperimental = type === "ExperimentalWarning" || /experimental/i.test(text); + return isExperimental && /sqlite/i.test(text); +} + +export function installSqliteRuntimeGuard(): void { + const proc = process as unknown as Guarded; + if (proc[FLAG]) return; + proc[FLAG] = true; + + const original = process.emitWarning.bind(process); + // Node overloads emitWarning(warning, type?, code?, ctor?) and + // emitWarning(warning, options?). We sniff the SQLite warning across both. + function patched( + warning: string | Error, + typeOrOptions?: string | { type?: string }, + ...rest: unknown[] + ): void { + const type = typeof typeOrOptions === "string" ? typeOrOptions : typeOrOptions?.type; + if (isSqliteExperimentalWarning(warning, type)) return; + // Delegate everything else untouched. + (original as (...a: unknown[]) => void)(warning, typeOrOptions, ...rest); + } + process.emitWarning = patched as typeof process.emitWarning; +} + +// Side effect on import: installing the guard is the whole point of importing +// this module, so callers write `import "./sqlite-runtime.js";` ahead of the +// `node:sqlite` import and get the behavior with no call site. +installSqliteRuntimeGuard(); diff --git a/packages/storage/src/transient-checkpoint.test.ts b/packages/storage/src/transient-checkpoint.test.ts deleted file mode 100644 index 3d32bf1f..00000000 --- a/packages/storage/src/transient-checkpoint.test.ts +++ /dev/null @@ -1,141 +0,0 @@ -/** - * Tests for `isTransientCheckpointError` — the matcher that gates the - * bulk-load retry in {@link GraphDbStore.bulkLoad}. - * - * The lbug native binding can fail the WAL→checkpoint rename under load with - * an "Error renaming file .wal to .wal.checkpoint" IO exception even - * though the write is durably in the WAL. That specific failure is safe to - * retry (replace-mode bulkLoad is idempotent); everything else must rethrow. - * These tests pin the matcher to the real lbug message and guard against it - * widening to swallow unrelated failures. - */ - -import assert from "node:assert/strict"; -import { test } from "node:test"; -import { isTransientCheckpointError, retryTransientCheckpoint } from "./graphdb-adapter.js"; - -/** The canonical transient WAL→checkpoint rename error. */ -const checkpointErr = () => - new Error( - "IO exception: Error renaming file graph.lbug.wal to graph.lbug.wal.checkpoint. " + - "ErrorMessage: No such file or directory", - ); -/** Zero-delay backoff so retry tests don't sleep. */ -const noBackoff = () => Promise.resolve(); - -test("matches the real lbug WAL→checkpoint rename failure", () => { - const real = - "IO exception: Error renaming file /tmp/x/.codehub/graph.lbug.wal to " + - "/tmp/x/.codehub/graph.lbug.wal.checkpoint. ErrorMessage: No such file or directory"; - assert.equal(isTransientCheckpointError(new Error(real)), true); -}); - -test("matches regardless of the OS-specific errno suffix", () => { - // Linux/macOS phrase the trailing errno differently; the matcher keys on - // the stable token trio (renaming + .wal + checkpoint), not the suffix. - const variant = - "IO exception: Error renaming file graph.lbug.wal to graph.lbug.wal.checkpoint. " + - "ErrorMessage: Permission denied"; - assert.equal(isTransientCheckpointError(new Error(variant)), true); -}); - -test("accepts a non-Error thrown value (string)", () => { - const s = "Error renaming file a.wal to a.wal.checkpoint. boom"; - assert.equal(isTransientCheckpointError(s), true); -}); - -test("does NOT match an unrelated IO error", () => { - assert.equal( - isTransientCheckpointError(new Error("IO exception: disk full while writing CodeNode")), - false, - ); -}); - -test("does NOT match a generic checkpoint mention without a WAL rename", () => { - // A CHECKPOINT statement error that isn't the rename race must rethrow. - assert.equal( - isTransientCheckpointError(new Error("CHECKPOINT failed: transaction conflict")), - false, - ); -}); - -test("does NOT match a query/constraint error", () => { - assert.equal( - isTransientCheckpointError( - new Error("Runtime exception: primary key violation on CodeNode.id"), - ), - false, - ); -}); - -test("does NOT match undefined / null", () => { - assert.equal(isTransientCheckpointError(undefined), false); - assert.equal(isTransientCheckpointError(null), false); -}); - -// --------------------------------------------------------------------------- -// retryTransientCheckpoint — the policy that wraps bulkLoad -// --------------------------------------------------------------------------- - -test("recovers when the transient error clears before maxAttempts", async () => { - let calls = 0; - const result = await retryTransientCheckpoint( - async () => { - calls++; - if (calls < 3) throw checkpointErr(); // fail attempts 1 and 2 - return "ok"; - }, - 3, - noBackoff, - ); - assert.equal(result, "ok"); - assert.equal(calls, 3, "should have retried twice then succeeded on the 3rd attempt"); -}); - -test("succeeds on the first attempt without retrying", async () => { - let calls = 0; - const result = await retryTransientCheckpoint( - async () => { - calls++; - return 42; - }, - 3, - noBackoff, - ); - assert.equal(result, 42); - assert.equal(calls, 1, "no retry when the first attempt succeeds"); -}); - -test("rethrows the transient error after exhausting maxAttempts", async () => { - let calls = 0; - await assert.rejects( - () => - retryTransientCheckpoint( - async () => { - calls++; - throw checkpointErr(); - }, - 3, - noBackoff, - ), - /renaming/, - ); - assert.equal(calls, 3, "should attempt exactly maxAttempts times before giving up"); -}); - -test("rethrows a non-transient error immediately without retrying", async () => { - let calls = 0; - await assert.rejects( - () => - retryTransientCheckpoint( - async () => { - calls++; - throw new Error("primary key violation on CodeNode.id"); - }, - 3, - noBackoff, - ), - /primary key/, - ); - assert.equal(calls, 1, "a non-transient error must NOT be retried"); -}); diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index d305b7fd..05052509 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -115,18 +115,12 @@ importers: '@cyclonedx/cyclonedx-library': specifier: 10.1.0 version: 10.1.0(ajv-formats-draft2019@1.6.1(ajv@8.18.0))(ajv-formats@3.0.1(ajv@8.18.0))(ajv@8.18.0)(packageurl-js@2.0.1)(spdx-expression-parse@4.0.0) - '@duckdb/node-api': - specifier: 1.5.3-r.3 - version: 1.5.3-r.3 '@huggingface/tokenizers': specifier: 0.1.3 version: 0.1.3 '@iarna/toml': specifier: 2.2.5 version: 2.2.5 - '@ladybugdb/core': - specifier: ^0.17.1 - version: 0.17.1 '@modelcontextprotocol/sdk': specifier: 1.29.0 version: 1.29.0(zod@4.4.3) @@ -225,9 +219,9 @@ importers: specifier: 6.0.3 version: 6.0.3 optionalDependencies: - onnxruntime-node: - specifier: 1.26.0 - version: 1.26.0 + onnxruntime-web: + specifier: 1.27.0 + version: 1.27.0 packages/cobol-proleap: dependencies: @@ -301,9 +295,9 @@ importers: specifier: 6.0.3 version: 6.0.3 optionalDependencies: - onnxruntime-node: - specifier: 1.26.0 - version: 1.26.0 + onnxruntime-web: + specifier: 1.27.0 + version: 1.27.0 packages/frameworks: dependencies: @@ -569,12 +563,6 @@ importers: packages/storage: dependencies: - '@duckdb/node-api': - specifier: 1.5.3-r.3 - version: 1.5.3-r.3 - '@ladybugdb/core': - specifier: ^0.17.1 - version: 0.17.1 '@opencodehub/core-types': specifier: workspace:* version: link:../core-types @@ -1036,56 +1024,6 @@ packages: xmlbuilder2: optional: true - '@duckdb/node-api@1.5.3-r.3': - resolution: {integrity: sha512-FzuL6sevuFfEFwkgiUMRMUAj4TaVqV//L0oo2FVZ9s9oYpLpALF9qZyQv2ucclTNQZwDCkm8+e6yLMc6t8IjlA==} - - '@duckdb/node-bindings-darwin-arm64@1.5.3-r.3': - resolution: {integrity: sha512-ttD8QBesgzHu7Sc4qouuIGLM7PWedLW8GvFbnZEyMqk24mQz1HWFgaT0ivw6nDRaDPUQLB9QnAOq8MZUh1zWHQ==} - cpu: [arm64] - os: [darwin] - - '@duckdb/node-bindings-darwin-x64@1.5.3-r.3': - resolution: {integrity: sha512-Vp9MYtoYf6zUWHdCmHXwUcJlHq3YaaIeULWeSiPUM1hsDflLiZKXtz5i250Ulz03VsfWBjpO4wdM99sjjrYKkg==} - cpu: [x64] - os: [darwin] - - '@duckdb/node-bindings-linux-arm64-musl@1.5.3-r.3': - resolution: {integrity: sha512-IadRyx+98FEynKLXAk2MzReinFgduiDXgNd5Z8c5VKch+8FgBfqkEUYGOnBMMUPT8kuheKdLj23vpWXaCzOgoQ==} - cpu: [arm64] - os: [linux] - libc: [musl] - - '@duckdb/node-bindings-linux-arm64@1.5.3-r.3': - resolution: {integrity: sha512-3HLcrzQE83947JS51UVR7C9qnXQMltCOk4Dnhiz1CD+9u32DGLMgPTIIxclk7O+Q7EwfqzD8JV86Ud+LT1crcQ==} - cpu: [arm64] - os: [linux] - libc: [glibc] - - '@duckdb/node-bindings-linux-x64-musl@1.5.3-r.3': - resolution: {integrity: sha512-5bulS16YhftXcarki4tvCufVslntpQDLOEF6RZ+FSMOGiv5d7SDXqklmVRy4DKW3C5ekgN7S2oYzuGL/ss9BuA==} - cpu: [x64] - os: [linux] - libc: [musl] - - '@duckdb/node-bindings-linux-x64@1.5.3-r.3': - resolution: {integrity: sha512-TXndAL0ZoETq17Df6wB+SUZjLGDmOsKuDSySxB+wy6sHfpRtbDgQibyXRlajVeUkRDwSzBFC5ymy16YG0Fl4iw==} - cpu: [x64] - os: [linux] - libc: [glibc] - - '@duckdb/node-bindings-win32-arm64@1.5.3-r.3': - resolution: {integrity: sha512-55Vu13S0jUudiAGlNWJd7UvlW1iKjwWehD8s93jBCNm0AdE/EJN4nz5rQ0IuWzPWXpMjAYuKu00yE7NdtbTyug==} - cpu: [arm64] - os: [win32] - - '@duckdb/node-bindings-win32-x64@1.5.3-r.3': - resolution: {integrity: sha512-rlOc9ltWQNHuDq99Ah8XaD80nN1ucrSK5AcH/7ibSp9ogX/jswPYlRVE7ODFJAjnQNf8bVvs++Mp+wyGvuG7ag==} - cpu: [x64] - os: [win32] - - '@duckdb/node-bindings@1.5.3-r.3': - resolution: {integrity: sha512-Dphw1a9kKXZnCiWX1YCEAJsQ7WJQO2Ikgxy7m8jy0QVXqAwB9esr5NGsuEL3vMKL7velZHeZCjGOMnHZEcIsdg==} - '@emnapi/runtime@1.11.0': resolution: {integrity: sha512-55coeOFKHv1ywEcUXJtWU5f+Jr/W5tZDvZig8DLKSwUN1JpROQ4rk/SNOQiFWmaR/VKF4zuFyW1B8JduOSv6Pg==} @@ -1633,34 +1571,6 @@ packages: resolution: {integrity: sha512-h4v52yJUVpA74DdvztFRQWuPgAKE52ysC2h1u/wLqdPjHvouV12Bj2bV4h30sGjEduEWgII+ktOL3kkp3GTK6A==} engines: {node: '>= 16'} - '@ladybugdb/core-darwin-arm64@0.17.1': - resolution: {integrity: sha512-JG/uzmolEh3wXJ/ME1EaTH5LTDQ9Cs+Q3Czul8pW2eWbWQZghQU3jjM++7ST7Bla5BX/WITqwPqPoC+sL+slfA==} - cpu: [arm64] - os: [darwin] - - '@ladybugdb/core-darwin-x64@0.17.1': - resolution: {integrity: sha512-Enjm+/V9/jpKmtzF2PB0muVkgpFUGHEvA7r16eJWxVRA/BeO8VPmngTKy9rf/4Yc6TWexjoHRug04BbTXEmerg==} - cpu: [x64] - os: [darwin] - - '@ladybugdb/core-linux-arm64@0.17.1': - resolution: {integrity: sha512-P+xM9o4I3JAQtXpX19ZuLj9EeO2gppa+IdmAqhpI8tuhyA3/a85Eaxby1fXOjsbrnOAEyFJczUdyoDkhCPSyiw==} - cpu: [arm64] - os: [linux] - - '@ladybugdb/core-linux-x64@0.17.1': - resolution: {integrity: sha512-N2ujE0CrsToBpVBpou1iWwEkK7CgVxucnUNxteySrnDccZwICXFP5BlcFpKE0qq3Eqmqszh4ptR4GuSi6rKPGw==} - cpu: [x64] - os: [linux] - - '@ladybugdb/core-win32-x64@0.17.1': - resolution: {integrity: sha512-9i3xNfFAMqFRuQG3F1hOCWYGna6eTg8HJ/XYhWVDGkeFJNUV3IdneEiYttF5B2qAtQYUd4sAikScsImrMRw+6g==} - cpu: [x64] - os: [win32] - - '@ladybugdb/core@0.17.1': - resolution: {integrity: sha512-K1bHnQrRy3bxkyrFHlxGqKUyIUS1LsRXKOSt14XGY/msBZHaDat/uBrlHiWpM4/24OtfOq/qwTqcTCXannnEjw==} - '@mdx-js/mdx@3.1.1': resolution: {integrity: sha512-f6ZO2ifpwAQIpzGWaBQT2TXxPv6z3RBzQKpVftEWN78Vl/YweF1uwussDx8ECAXVtr3Rs89fKyG9YlzUs9DyGQ==} @@ -1912,6 +1822,33 @@ packages: resolution: {integrity: sha512-3MYHYm8epnciApn6w5Fzx6sepawmsNU7l6lvIq+ER22/DPSrr83YMhU/EQWnf4lORn2YyiXFj0FJSyJzEtIGmw==} engines: {node: '>=14.6'} + '@protobufjs/aspromise@1.1.2': + resolution: {integrity: sha512-j+gKExEuLmKwvz3OgROXtrJ2UG2x8Ch2YZUxahh+s1F2HZ+wAceUNLkvy6zKCPVRkU++ZWQrdxsUeQXmcg4uoQ==} + + '@protobufjs/base64@1.1.2': + resolution: {integrity: sha512-AZkcAA5vnN/v4PDqKyMR5lx7hZttPDgClv83E//FMNhR2TMcLUhfRUBHCmSl0oi9zMgDDqRUJkSxO3wm85+XLg==} + + '@protobufjs/codegen@2.0.5': + resolution: {integrity: sha512-zgXFLzW3Ap33e6d0Wlj4MGIm6Ce8O89n/apUaGNB/jx+hw+ruWEp7EwGUshdLKVRCxZW12fp9r40E1mQrf/34g==} + + '@protobufjs/eventemitter@1.1.1': + resolution: {integrity: sha512-vW1GmwMZNnL+gMRaovlh9yZX74kc+TTU3FObkkurpMaRtBfLP3ldjS9KQWlwZgraRE0+dheEEoAxdzcJQ8eXZg==} + + '@protobufjs/fetch@1.1.1': + resolution: {integrity: sha512-GpptLrs57adMSuHi3VNj0mAF8dwh36LMaYF6XyJ6JMWlVsc+t42tm1HSEDmOs3A8fC9yyeisgLhsTVQokOZ0zw==} + + '@protobufjs/float@1.0.2': + resolution: {integrity: sha512-Ddb+kVXlXst9d+R9PfTIxh1EdNkgoRe5tOX6t01f1lYWOvJnSPDBlG241QLzcyPdoNTsblLUdujGSE4RzrTZGQ==} + + '@protobufjs/path@1.1.2': + resolution: {integrity: sha512-6JOcJ5Tm08dOHAbdR3GrvP+yUUfkjG5ePsHYczMFLq3ZmMkAD98cDgcT2iA1lJ9NVwFd4tH/iSSoe44YWkltEA==} + + '@protobufjs/pool@1.1.0': + resolution: {integrity: sha512-0kELaGSIDBKvcgS4zkjz1PeddatrjYcmMWOlAuAPwAeccUrPHdUqo/J6LiymHHEiJT5NrF1UVwxY14f+fy4WQw==} + + '@protobufjs/utf8@1.1.1': + resolution: {integrity: sha512-oOAWABowe8EAbMyWKM0tYDKi8Yaox52D+HWZhAIJqQXbqe0xI/GV7FhLWqlEKreMkfDjshR5FKgi3mnle0h6Eg==} + '@rollup/pluginutils@5.4.0': resolution: {integrity: sha512-MfPp06CjRLfXQ3wY0R8vJDYBy/MvVcc9OulEfR0B8Iv9ko+GCNaRZ+EpJYFl27LhKsZK0o420sYCRHCjfCgeUg==} engines: {node: '>=14.0.0'} @@ -2318,9 +2255,6 @@ packages: resolution: {integrity: sha512-k+AtsrqmS41Sd5qjkZlHcmvoSQIvBOonRj4jpgp0KNFM6aqvMGpdSuPUqrUcg8ENTKjUbfaUVszgQwq3bCOvwA==} hasBin: true - '@swc/helpers@0.5.23': - resolution: {integrity: sha512-5lSsMOTXURePglDfvuAQUqkGek9Hg2kksOYay2m0+XR++b2NWYL/4sWyuvVBIs8oKnJaxkdi9whaL/sqN13afw==} - '@szmarczak/http-timer@4.0.6': resolution: {integrity: sha512-4BAffykYOgO+5nzBWYwE3W90sBgLJoUPRWWcL8wlyiM8IB8ipJz3UMJ9KXQd1RKQXpKp8Tutn80HZtWsu2u76w==} engines: {node: '>=10'} @@ -2354,12 +2288,6 @@ packages: '@types/cacheable-request@6.0.3': resolution: {integrity: sha512-IQ3EbTzGxIigb1I3qPZc1rWJnH0BmSKv5QYTalEwweFvyBDLSAe24zP0le/hyi7ecGfZVlIVAg4BZqb8WBwKqw==} - '@types/command-line-args@5.2.3': - resolution: {integrity: sha512-uv0aG6R0Y8WHZLTamZwtfsDLVRnOa+n+n5rEvFWL5Na5gZ8V2Teab/duDPFzIIIhs9qizDpcavCusCLJZu62Kw==} - - '@types/command-line-usage@5.0.4': - resolution: {integrity: sha512-BwR5KP3Es/CSht0xqBcUXS3qCAUVXwpRKsV2+arxeb65atasuXG9LykC9Ab10Cw3s2raH92ZqOeILaQbsB2ACg==} - '@types/d3-array@3.2.2': resolution: {integrity: sha512-hOLWVbm7uRza0BYXpIIW5pxfrKe0W+D5lrFiAEYR+pb6w3N2SwSMaJbXdUfSEv+dT4MfHBLtn5js0LAWaO6otw==} @@ -2603,10 +2531,6 @@ packages: engines: {node: '>=0.4.0'} hasBin: true - adm-zip@0.5.17: - resolution: {integrity: sha512-+Ut8d9LLqwEvHHJl1+PIHqoyDxFgVN847JTVM3Izi3xHDWPE4UtzzXysMZQs64DMcrJfBeS/uoEP4AD3HQHnQQ==} - engines: {node: '>=12.0'} - agent-base@7.1.4: resolution: {integrity: sha512-MnA+YT8fwfJPgBx3m60MNqakm30XOkyIoH1y6huTQvC0PwZG7ki8NacLBcrPbNoo8vEZy7Jpuk7+jMO+CUovTQ==} engines: {node: '>= 14'} @@ -2680,10 +2604,6 @@ packages: anynum@1.0.1: resolution: {integrity: sha512-N6//FLET/tXYNM/F6ABca1oH6fWB+KlTt909Le28WMDBk8oaT4vY17DCrwg2MvmuqUKt3Ni4N5dGJ/EoBgcO6A==} - apache-arrow@21.1.0: - resolution: {integrity: sha512-kQrYLxhC+NTVVZ4CCzGF6L/uPVOzJmD1T3XgbiUnP7oTeVFOFgEUu6IKNwCDkpFoBVqDKQivlX4RUFqqnWFlEA==} - hasBin: true - arg@4.1.3: resolution: {integrity: sha512-58S9QDqG0Xx27YwPSt9fJxivjYl432YCwfDMfZ+71RAqUrZef7LrKQZ3LHLOwCS4FLNBplP533Zx895SeOCHvA==} @@ -2697,10 +2617,6 @@ packages: resolution: {integrity: sha512-COROpnaoap1E2F000S62r6A60uHZnmlvomhfyT2DlTcrY1OrBKn2UhH7qn5wTC9zMvD0AY7csdPSNwKP+7WiQw==} engines: {node: '>= 0.4'} - array-back@6.2.3: - resolution: {integrity: sha512-SGDvmg6QTYiTxCBkYVmThcoa67uLl35pyzRHdpCGBOcqFy6BtwnphoFPk7LhJshD+Yk1Kt35WGWeZPTgwR4Fhw==} - engines: {node: '>=12.17'} - array-find-index@1.0.2: resolution: {integrity: sha512-M1HQyIXcBGtVywBt8WVdim+lrNaK7VHp99Qt5pSNziXznKHViIBbXWtfRTpEFpF/c4FdfxNAsCCwPp5phBYJtw==} engines: {node: '>=0.10.0'} @@ -2842,10 +2758,6 @@ packages: ccount@2.0.1: resolution: {integrity: sha512-eyrF0jiFpY+3drT6383f1qhkbGsLSifNAjA61IUjZjmLCWjItY6LB9ft9YhoDgwfmclB2zhu51Lc7+95b8NRAg==} - chalk-template@0.4.0: - resolution: {integrity: sha512-/ghrgmhfY8RaSdeo43hNXxpoHAtxdbskUHjPpfqUWGttFgycUhYPGx3YZBCnUCvOa7Doivn1IZec3DEGFoMgLg==} - engines: {node: '>=12'} - chalk@2.4.2: resolution: {integrity: sha512-Mti+f9lpJNcwF4tWV8/OrTTtF1gZi+f8FqlyAdouralcFWFQWF2+NgCHShjkCb+IFBLq9buZwE1xckQU4peSuQ==} engines: {node: '>=4'} @@ -2922,10 +2834,6 @@ packages: peerDependencies: typanion: '*' - cliui@8.0.1: - resolution: {integrity: sha512-BSeNnyus75C4//NQ9gQt1/csTXyo/8Sb+afLAkzAptFuMsod9HFokGNudZpi/oQV73hnVK+sR+5PVRMd+Dr7YQ==} - engines: {node: '>=12'} - cliui@9.0.1: resolution: {integrity: sha512-k7ndgKhwoQveBL+/1tqGJYNz097I7WOvwbmmU2AR5+magtbjPWQTS1C5vzGkBC8Ym8UWRzfKUzUUqFLypY4Q+w==} engines: {node: '>=20'} @@ -2941,11 +2849,6 @@ packages: resolution: {integrity: sha512-eYm0QWBtUrBWZWG0d386OGAw16Z995PiOVo2B7bjWSbHedGl5e0ZWaq65kOGgUSNesEIDkB9ISbTg/JK9dhCZA==} engines: {node: '>=6'} - cmake-js@8.0.0: - resolution: {integrity: sha512-YbUP88RDwCvoQkZhRtGURYm9RIpWdtvZuhT87fKNoLjk8kIFIFeARpKfuZQGdwfH99GZpUmqSfcDrK62X7lTgg==} - engines: {node: ^20.17.0 || >=22.9.0} - hasBin: true - cmd-shim@8.0.0: resolution: {integrity: sha512-Jk/BK6NCapZ58BKUxlSI+ouKRbjH1NLZCgJkYoab+vEHUY3f6OzpNBN9u7HFSv9J6TRDGs4PLOHezoKGaFRSCA==} engines: {node: ^20.17.0 || >=22.9.0} @@ -2975,19 +2878,6 @@ packages: command-exists@1.2.9: resolution: {integrity: sha512-LTQ/SGc+s0Xc0Fu5WaKnR0YiygZkm9eKFvyS+fRsU7/ZWFF8ykFM6Pc9aCVf1+xasOOZpO3BAVgVrKvsqKHV7w==} - command-line-args@6.0.2: - resolution: {integrity: sha512-AIjYVxrV9X752LmPDLbVYv8aMCuHPSLZJXEo2qo/xJfv+NYhaZ4sMSF01rM+gHPaMgvPM0l5D/F+Qx+i2WfSmQ==} - engines: {node: '>=12.20'} - peerDependencies: - '@75lb/nature': latest - peerDependenciesMeta: - '@75lb/nature': - optional: true - - command-line-usage@7.0.4: - resolution: {integrity: sha512-85UdvzTNx/+s5CkSgBm/0hzP80RFHAa7PsfeADE5ezZF3uHz3/Tqj9gIKGT9PTtpycc3Ua64T0oVulGfKxzfqg==} - engines: {node: '>=12.20.0'} - commander@11.1.0: resolution: {integrity: sha512-yPVavfyCcRhmorC7rWlkHn15b4wDVgVmBA7kV4QVBsF7kv/9TKJAbAXVTxvTnwP8HHKjRCJDClKbciiYS7p0DQ==} engines: {node: '>=16'} @@ -3339,10 +3229,6 @@ packages: dedent@0.7.0: resolution: {integrity: sha512-Q6fKUPqnAHAyhiUgFU7BUzLiv0kd8saH9al7tnu5Q/okj6dnupxyTgFIBjVzJATdfIAm9NAsvXNzjaKa+bxVyA==} - deep-extend@0.6.0: - resolution: {integrity: sha512-LOHxIOaPYdHlJRtCQfDIVZtfw/ufM8+rVj649RIHzcm/vGwQRXFt6OPqIFWsm2XEMrNIEtWR64sY1LEKD2vAOA==} - engines: {node: '>=4.0.0'} - defaults@1.0.4: resolution: {integrity: sha512-eFuaLoy/Rxalv2kr+lqMlUnrDWV+3j4pljOIJgLIhI058IQfWJ7vXhyEIHu+HtC738klGALYxOKDO0bQP3tg8A==} @@ -3350,14 +3236,6 @@ packages: resolution: {integrity: sha512-4tvttepXG1VaYGrRibk5EwJd1t4udunSOVMdLSAL6mId1ix438oPwPZMALY41FCijukO1L0twNcGsdzS7dHgDg==} engines: {node: '>=10'} - define-data-property@1.1.4: - resolution: {integrity: sha512-rBMvIzlpA8v6E+SJZoo++HAYqsLrkg7MSfIinMPFhmkorw7X+dOXVJQs+QT69zGkzMyfDnIMN2Wid1+NbL3T+A==} - engines: {node: '>= 0.4'} - - define-properties@1.2.1: - resolution: {integrity: sha512-8QmQKqEASLd5nx0U1B1okLElbUuuttJ/AnYmRXbbbGDWh6uS208EjD4Xqq/I9wK7u0v6O08XhTWnt5XtEbR6Dg==} - engines: {node: '>= 0.4'} - defu@6.1.7: resolution: {integrity: sha512-7z22QmUWiQ/2d0KkdYmANbRUVABpZ9SNYyH5vx6PZ+nE5bcC0l7uFvEfHlyld/HcGBFTL536ClDt3DEcSlEJAQ==} @@ -3533,10 +3411,6 @@ packages: resolution: {integrity: sha512-vbRorB5FUQWvla16U8R/qgaFIya2qGzwDrNmCZuYKrbdSUMG6I1ZCGQRefkRVhuOkIGVne7BQ35DSfo1qvJqFg==} engines: {node: '>=0.8.0'} - escape-string-regexp@4.0.0: - resolution: {integrity: sha512-TtpcNJ3XAzx3Gq8sWRzJaVajRs0uVxA2YAkdb1jm2YkPz4G6egUFAyA3n5vtEIZefPk5Wa4UXbKuS5fKkJWdgA==} - engines: {node: '>=10'} - escape-string-regexp@5.0.0: resolution: {integrity: sha512-/veY75JbMK4j1yjvuUxuVsiS/hr/4iHs9FTT6cgTexxdE0Ly/glccBAkloH/DofkjRbZU3bnoj38mOmhkZ0lHw==} engines: {node: '>=12'} @@ -3667,15 +3541,6 @@ packages: find-node-modules@2.1.3: resolution: {integrity: sha512-UC2I2+nx1ZuOBclWVNdcnbDR5dlrOdVb7xNjmT/lHE+LsgztWks3dG7boJ37yTS/venXw84B/mAW9uHVoC5QRg==} - find-replace@5.0.2: - resolution: {integrity: sha512-Y45BAiE3mz2QsrN2fb5QEtO4qb44NcS7en/0y9PEVsg351HsLeVclP8QPMH79Le9sH3rs5RSwJu99W0WPZO43Q==} - engines: {node: '>=14'} - peerDependencies: - '@75lb/nature': latest - peerDependenciesMeta: - '@75lb/nature': - optional: true - find-root@1.1.0: resolution: {integrity: sha512-NKfW6bec6GfKc0SGx1e07QZY9PE99u0Bft/0rzSD5k3sO/vwkVUpDUKVm5Gpp5Ue3YfShPFTX2070tDs5kB9Ng==} @@ -3708,10 +3573,6 @@ packages: resolution: {integrity: sha512-Rx/WycZ60HOaqLKAi6cHRKKI7zxWbJ31MhntmtwMoaTeF7XFH9hhBp8vITaMidfljRQ6eYWCKkaTK+ykVJHP2A==} engines: {node: '>= 0.8'} - fs-extra@11.3.5: - resolution: {integrity: sha512-eKpRKAovdpZtR1WopLHxlBWvAgPny3c4gX1G5Jhwmmw4XJj0ifSD5qB5TOo8hmA0wlRKDAOAhEE1yVPgs6Fgcg==} - engines: {node: '>=14.14'} - fs-extra@9.1.0: resolution: {integrity: sha512-hcg3ZmepS30/7BSFqRvoo3DOMQu7IjqxO5nCDt+zM9XWjb33Wg7ziNT+Qvqbuc3+gWpzO02JubVyk2G4Zvo1OQ==} engines: {node: '>=10'} @@ -3780,10 +3641,6 @@ packages: resolution: {integrity: sha512-nFR0zLpU2YCaRxwoCJvL6UvCH2JFyFVIvwTLsIf21AuHlMskA1hhTdk+LlYJtOlYt9v6dvszD2BGRqBL+iQK9Q==} deprecated: Old versions of glob are not supported, and contain widely publicized security vulnerabilities, which have been fixed in the current version. Please update. Support for old versions may be purchased (at exorbitant rates) by contacting i@izs.me - global-agent@4.1.3: - resolution: {integrity: sha512-KUJEViiuFT3I97t+GYMikLPJS2Lfo/S2F+DQuBWzuzaMPnvt5yyZePzArx36fBzpGTxZjIpDbXLeySLgh+k76g==} - engines: {node: '>=10.0'} - global-directory@5.0.0: resolution: {integrity: sha512-1pgFdhK3J2LeM+dVf2Pd424yHx2ou338lC0ErNP2hPx4j8eW1Sp0XqSjNxtk6Tc4Kr5wlWtSvz8cn2yb7/SG/w==} engines: {node: '>=20'} @@ -3796,10 +3653,6 @@ packages: resolution: {integrity: sha512-5lsx1NUDHtSjfg0eHlmYvZKv8/nVqX4ckFbM+FrGcQ+04KWcWFo9P5MxPZYSzUvyzmdTbI7Eix8Q4IbELDqzKg==} engines: {node: '>=0.10.0'} - globalthis@1.0.4: - resolution: {integrity: sha512-DpLKbNU4WylpxJykQujfCcwYWiV/Jhm50Goo0wrVILAv5jOr9d+H+UR3PhSCD2rCCEIg0uc+G+muBTwD54JhDQ==} - engines: {node: '>= 0.4'} - google-protobuf@3.21.4: resolution: {integrity: sha512-MnG7N936zcKTco4Jd2PX2U96Kf9PxygAPKBug+74LHzmHXmceN16MmRcdgZv+DGef/S9YvQAfRsNCn4cjf9yyQ==} @@ -3838,6 +3691,9 @@ packages: peerDependencies: graphology-types: '>=0.24.0' + guid-typescript@1.0.9: + resolution: {integrity: sha512-Y8T4vYhEfwJOTbouREvG+3XDsjr8E3kIr7uf+JZ0BYloFsttiHU0WfvANVsR7TxNUJa/WpCnw/Ino/p+DeBhBQ==} + h3@1.15.11: resolution: {integrity: sha512-L3THSe2MPeBwgIZVSH5zLdBBU90TOxarvhK9d04IDY2AmVS8j2Jz2LIWtwsGOU3lu2I5jCN7FNvVfY2+XyF+mg==} @@ -3856,9 +3712,6 @@ packages: resolution: {integrity: sha512-CsNUt5x9LUdx6hnk/E2SZLsDyvfqANZSUq4+D3D8RzDJ2M+HDTIkF60ibS1vHaK55vzgiZw1bEPFG9yH7l33wA==} engines: {node: '>=12'} - has-property-descriptors@1.0.2: - resolution: {integrity: sha512-55JNKuIW+vq4Ke1BjOTjM2YctQIvCT7GFzHwmfZPGo5wnrgkid0YQtnAleFSqumZm4az3n2BS+erby5ipJdgrg==} - has-symbols@1.1.0: resolution: {integrity: sha512-1cDNdwJ2Jaohmb3sg4OmKaMBwuC48sYni5HUw2DvsC8LjGTLK9h+eb1X6RyuOHe4hT0ULCW68iomhjUoKUqlPQ==} engines: {node: '>= 0.4'} @@ -4169,10 +4022,6 @@ packages: resolution: {integrity: sha512-ePWsvanv0DWuDRsW8dnt+R4jQ31SCRCQ7hhNcPXZPsoBZiemuZNYGf7adZdqX2D86j6rvKp3RpCxVTSb8WQlOw==} hasBin: true - json-bignum@0.0.3: - resolution: {integrity: sha512-2WHyXj3OfHSgNyuzDbSxI1w2jgw5gkWSWhS7Qg4bWXx1nLk3jnbwfUeS0PSba3IzpTUWdHxBieELUzXRjQB2zg==} - engines: {node: '>=0.8'} - json-buffer@3.0.1: resolution: {integrity: sha512-4bV5BfR2mqfQTJm+V5tPPdf+ZpuhiIvTuAB5g8kcrXOZpTT/QwwVRWBywX1ozr6lEuPdbHxwaJlm9G6mI2sfSQ==} @@ -4305,9 +4154,6 @@ packages: lodash-es@4.18.1: resolution: {integrity: sha512-J8xewKD/Gk22OZbhpOVSwcs60zhd95ESDwezOFuA3/099925PdHJ7OFHNTGtajL3AlZkykD32HykiMo+BIBI8A==} - lodash.camelcase@4.3.0: - resolution: {integrity: sha512-TwuEnCnxbc3rAvhf/LbG7tJUDzhqXyFnv3dtzLOPgCG/hODL7WFnsbwktkD7yUV0RrreP/l1PALq/YSg6VvjlA==} - lodash.clone@4.5.0: resolution: {integrity: sha512-GhrVeweiTD6uTmmn5hV/lzgCQhccwReIVRLHp7LT4SopOjqEZ5BbX8b5WWEtAKasjmy8hR7ZPwsYlxRCku5odg==} deprecated: This package is deprecated. Use structuredClone instead. @@ -4385,6 +4231,9 @@ packages: resolution: {integrity: sha512-9ie8ItPR6tjY5uYJh8K/Zrv/RMZ5VOlOWvtZdEHYSTFKZfIBPQa9tOAEeAWhd+AnIneLJ22w5fjOYtoutpWq5w==} engines: {node: '>=18'} + long@5.3.2: + resolution: {integrity: sha512-mNAgZ1GmyNhD7AuqnTG3/VQ26o760+ZYBPKjPvugO8+nLbYfX6TVpJPseBvopbdY+qpZ/lKUnmEc1LeZYS3QAA==} + longest-streak@3.1.0: resolution: {integrity: sha512-9Ri+o0JYgehTaVBBDoMqIl8GXtbWg711O3srftcHhZ0dqnETqLaoIK0x17fUw9rFSlK/0NlsKe0Ahhyl5pXE2g==} @@ -4429,10 +4278,6 @@ packages: engines: {node: '>= 20'} hasBin: true - matcher@4.0.0: - resolution: {integrity: sha512-S6x5wmcDmsDRRU/c2dkccDwQPXoFczc5+HpQ2lON8pnvHlnvHAHj5WlLVvw6n6vNyHuVugYrFohYxbS+pvFpKQ==} - engines: {node: '>=10'} - math-intrinsics@1.1.0: resolution: {integrity: sha512-/IXtbwEk5HTPyEwyKX6hGkYXxM9nbj64B+ilVJnC/R6B0pH5G4V3b0pVbL7DBj4tkhBAppbQUlf6F6Xl9LHu1g==} engines: {node: '>= 0.4'} @@ -4764,12 +4609,6 @@ packages: nlcst-to-string@4.0.0: resolution: {integrity: sha512-YKLBCcUYKAg0FNlOBT6aI91qFmSiFKiluk655WzPF+DDMA02qIyy8uiRqI8QXtcFpEvll12LpL5MXqEmAZ+dcA==} - node-addon-api@6.1.0: - resolution: {integrity: sha512-+eawOlIgy680F0kBzPUNFhMZGtJ1YmqM6l4+Crf4IkImjYrO/mqPwRMh352g23uIaQKFItcQ64I7KMaJxHgAVA==} - - node-api-headers@1.9.0: - resolution: {integrity: sha512-2oNILP4jXwRB4ywnYKjVk1YyJ96n2D4EOVJO6S3oYZ5PtbJrw3Yt9TpAuX3nBLMuzn74rnfGQrv13pS9vC+YiA==} - node-fetch-native@1.6.7: resolution: {integrity: sha512-g9yhqoedzIUm0nTnTqAQvueMPVOuIY16bqgAJJC8XOOubYFNwz6IER9qs0Gq2Xd0+CecCKFjtdDTMA4u4xG06Q==} @@ -4842,10 +4681,6 @@ packages: resolution: {integrity: sha512-W67iLl4J2EXEGTbfeHCffrjDfitvLANg0UlX3wFUUSTx92KXRFegMHUVgSqE+wvhAbi4WqjGg9czysTV2Epbew==} engines: {node: '>= 0.4'} - object-keys@1.1.1: - resolution: {integrity: sha512-NuAESUOUMrlIXOfHKzD6bpPu3tYt3xvjNdRIQ+FeT0lNb4K8WR70CaDxhuNguS2XG+GjkyMwOzsN5ZktImfhLA==} - engines: {node: '>= 0.4'} - obliterator@2.0.5: resolution: {integrity: sha512-42CPE9AhahZRsMNslczq0ctAEtqk8Eka26QofnqC346BZdHDySk3LWka23LI7ULIw11NmltpiLagIq8gBozxTw==} @@ -4880,12 +4715,11 @@ packages: oniguruma-to-es@4.3.6: resolution: {integrity: sha512-csuQ9x3Yr0cEIs/Zgx/OEt9iBw9vqIunAPQkx19R/fiMq2oGVTgcMqO/V3Ybqefr1TBvosI6jU539ksaBULJyA==} - onnxruntime-common@1.26.0: - resolution: {integrity: sha512-qVyMR4lcWgbkc4getFV+GQijsTnbg/siteoqcDwa3sI/LxbrMSNw4ePyvCq/ymdQaRomCA7YuWmhzsswxvymdw==} + onnxruntime-common@1.27.0: + resolution: {integrity: sha512-3KxL5wIVqa8Ex08jxSzncm9CMgw8CjOFyOQ7SxvG9o0cVLlhTNKXyIQuTbtX4tGPJEf73OER2xrjt4HJSBL4ow==} - onnxruntime-node@1.26.0: - resolution: {integrity: sha512-OHl6PiOEOqxaLHL0N9eFrbzS7IGmu3BtJNH3RTEnRAheCIkfc3gjcjl4sGcjp9C22ZC9YTquDOxSdT/stBQ6BQ==} - os: [win32, darwin, linux] + onnxruntime-web@1.27.0: + resolution: {integrity: sha512-ogDLsqIozHZwifPuN37OproAo0byX6t43/bP8GzeZWBWD6MOGExswFAx3up4NS/vvWBOg2u2PXomDt3rMmdQSg==} openapi-types@12.1.3: resolution: {integrity: sha512-N4YtSYJqghVu4iek2ZUvcN/0aqH1kRDuNqzcycDxhOUpg7GdvLa2F3DgS6yBNhInhv2r/6I0Flkn7CqL8+nIcw==} @@ -5031,6 +4865,9 @@ packages: pkg-types@1.3.1: resolution: {integrity: sha512-/Jm5M4RvtBFVkKWRu2BLUTNP8/M2a+UwuAX+ae4770q1qVGtfjG+WTCupoZixokjmHiry8uI+dlY8KXYV5HVVQ==} + platform@1.3.6: + resolution: {integrity: sha512-fnWVljUchTro6RiCFvCXBbNhJc2NijN7oIQxbwsyL0buWJPG85v81ehlHI9fXrJsMNgTofEoWIQeClKpgxFLrg==} + playwright-core@1.61.0: resolution: {integrity: sha512-caX7TrY3Ml6egyDX0WUcTHDxodl/b51y5wJOdCEA36QviK/s2g081hvmGs8eaE3DWb6NYZQ6BjO/QkNRPenoPA==} engines: {node: '>=18'} @@ -5108,6 +4945,10 @@ packages: property-information@7.2.0: resolution: {integrity: sha512-IAtzIB6sUiWaJYrX9smp3V46pBGbBeLFRGdh25kg1334VcBlD8HzhPeNIWQH9zhGmo2itIe25EHt9dQP7G5hmg==} + protobufjs@7.6.4: + resolution: {integrity: sha512-RJJPTTpvFfHcWLkIa2JFWK4XvtSzS0yEWDmunqHXli1h3JlkbcQZXDZdcWxv+JK3Xsl5/UFDPZ0iGm7DAengYw==} + engines: {node: '>=12.0.0'} + proxy-addr@2.0.7: resolution: {integrity: sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg==} engines: {node: '>= 0.10'} @@ -5148,10 +4989,6 @@ packages: resolution: {integrity: sha512-K5zQjDllxWkf7Z5xJdV0/B0WTNqx6vxG70zJE4N0kBs4LovmEYWJzQGxC9bS9RAKu3bgM40lrd5zoLJ12MQ5BA==} engines: {node: '>= 0.10'} - rc@1.2.8: - resolution: {integrity: sha512-y3bGgqKj3QBdxLbLkomlohkvsA8gdAiUQlSBJnBhfn+BPxg4bc62d8TcBW15wavDfgexCgccckhcZvywyQYPOw==} - hasBin: true - read-cmd-shim@6.0.0: resolution: {integrity: sha512-1zM5HuOfagXCBWMN83fuFI/x+T/UhZ7k+KIzhrHXcQoeX5+7gmaDYjELQHmmzIodumBHeByBJT4QYS7ufAgs7A==} engines: {node: ^20.17.0 || >=22.9.0} @@ -5252,10 +5089,6 @@ packages: remark-stringify@11.0.0: resolution: {integrity: sha512-1OSmLd3awB/t8qdoEOMazZkNsfVTeY4fTsgzcQFdXNq8ToTN4ZGwrMnlda4K6smTFKD+GRV6O48i6Z4iKgPPpw==} - require-directory@2.1.1: - resolution: {integrity: sha512-fGxEI7+wsG9xrvdjsrlmL22OMTTiHRwAMroiEeMgq8gzoLC/PQr7RsRDSTLUg/bZAZtF+TVIkHc6/4RIKrui+Q==} - engines: {node: '>=0.10.0'} - require-from-string@2.0.2: resolution: {integrity: sha512-Xf0nWe6RseziFMu+Ap9biiUbmplq6S9/p+7w7YXP/JBHhrUDDUhwa+vANyubuqfZWTveU//DYVGsDG7RKL/vEw==} engines: {node: '>=0.10.0'} @@ -5375,10 +5208,6 @@ packages: resolution: {integrity: sha512-1gnZf7DFcoIcajTjTwjwuDjzuz4PPcY2StKPlsGAQ1+YH20IRVrBaXSWmdjowTJ6u8Rc01PoYOGHXfP1mYcZNQ==} engines: {node: '>= 18'} - serialize-error@8.1.0: - resolution: {integrity: sha512-3NnuWfM6vBYoy5gZFvHiYsVbafvI9vZv/+jlIigFn4oP4zjNPK3LhcY0xSCgeb1a5L8jO71Mit9LlNoi2UfDDQ==} - engines: {node: '>=10'} - serve-static@2.2.1: resolution: {integrity: sha512-xRXBn0pPqQTVQiC8wyQrKs2MOlX24zQ0POGaj0kultvoOCstBQM5yvOhAVSUwOMjQtTvsPWoNCHfPGwaaQJhTw==} engines: {node: '>= 18'} @@ -5573,10 +5402,6 @@ packages: resolution: {integrity: sha512-3xurFv5tEgii33Zi8Jtp55wEIILR9eh34FAW00PZf+JnSsTmV/ioewSgQl97JHvgjoRGwPShsWm+IdrxB35d0w==} engines: {node: '>=8'} - strip-json-comments@2.0.1: - resolution: {integrity: sha512-4gB8na07fecVVkOI6Rs4e7T6NOTki5EmL7TUduTs6bu3EdnSycntVJ4re8kgZA+wx9IueI2Y11bfbgwtzuE0KQ==} - engines: {node: '>=0.10.0'} - strip-json-comments@3.1.1: resolution: {integrity: sha512-6fPc+R4ihwqP6N/aIv2f1gMH8lOVtWQHoqC4yK6oSDVVocumAsfCqjkXnqiYMhmMwS/mEHLp7Vehlt3ql6lEig==} engines: {node: '>=8'} @@ -5619,10 +5444,6 @@ packages: engines: {node: '>=16'} hasBin: true - table-layout@4.1.1: - resolution: {integrity: sha512-iK5/YhZxq5GO5z8wb0bY1317uDF3Zjpha0QFFLA8/trAoiLbQD0HUbMesEaxyzUgDxi2QlcbM8IvqOlEjgoXBA==} - engines: {node: '>=12.17'} - tar@7.5.16: resolution: {integrity: sha512-56adEpPMouktRlBLXiaYFFzZ/3+JXa8P9n7WbR+ibIjtviN55mEaOkiysCnPnWm+7kkui1Dn8J9l+g6zV8731w==} engines: {node: '>=18'} @@ -5756,10 +5577,6 @@ packages: typanion@3.14.0: resolution: {integrity: sha512-ZW/lVMRabETuYCd9O9ZvMhAh8GslSqaUjxmK/JLPCh6l73CvLBiuXswj/+7LdnWOgYsQ130FqLzFz5aGT4I3Ug==} - type-fest@0.20.2: - resolution: {integrity: sha512-Ne+eE4r0/iWnpAxD852z3A+N0Bt5RN//NjJwRd2VFHEmrywxf5vsZlh4R6lixl6B+wz/8d+maTSAkN1FIkI3LQ==} - engines: {node: '>=10'} - type-fest@0.21.3: resolution: {integrity: sha512-t0rzBq87m3fVcduHDUFhKmyyX+9eo6WQjZvf51Ea/M0Q7+T374Jp1aUiyUl0GKxp8M/OETVHSDvmkyPgvX+X2w==} engines: {node: '>=10'} @@ -5778,10 +5595,6 @@ packages: engines: {node: '>=14.17'} hasBin: true - typical@7.3.0: - resolution: {integrity: sha512-ya4mg/30vm+DOWfBg4YK3j2WD6TWtRkCbasOJr40CseYENzCUby/7rIvXA99JGsQHeNxLbnXdyLLxKSv3tauFw==} - engines: {node: '>=12.17'} - ufo@1.6.4: resolution: {integrity: sha512-JFNbkD1Svwe0KvGi8GOeLcP4kAWQ609twvCdcHxq1oSL8svv39ZuSvajcD8B+5D0eL4+s1Is2D/O6KN3qcTeRA==} @@ -5913,9 +5726,6 @@ packages: uri-js@4.4.1: resolution: {integrity: sha512-7rKUyy33Q1yc98pQ1DAmLtwX109F7TIfWlW1Ydo8Wl1ii1SeHieeh0HHfPeL2fMXK6z0s8ecKs9frCuLJvndBg==} - url-join@4.0.1: - resolution: {integrity: sha512-jk1+QP6ZJqyOiuEI9AEWQfju/nB2Pw466kbA0LEZljHwKeMgd9WrAEgEGxjPDD2+TNbbb37rTyhEfrCXfuKXnA==} - util-deprecate@1.0.2: resolution: {integrity: sha512-EPD5q1uXyFxJpCrLnCc1nHnq3gOa6DZBocAIiI2TaSCA7VCJ1UJDMagCzIkXNsUYfD1daK//LTEQ8xiIbrHtcw==} @@ -6051,10 +5861,6 @@ packages: resolution: {integrity: sha512-BN22B5eaMMI9UMtjrGd5g5eCYPpCPDUy0FJXbYsaT5zYxjFOckS53SQDE3pWkVoWpHXVb3BrYcEN4Twa55B5cA==} engines: {node: '>=0.10.0'} - wordwrapjs@5.1.1: - resolution: {integrity: sha512-0yweIbkINJodk27gX9LBGMzyQdBDan3s/dEAiwBOj+Mf0PPyWL6/rikalkv8EeD0E8jm4o5RXEOrFTP3NXbhJg==} - engines: {node: '>=12.17'} - wrap-ansi@10.0.0: resolution: {integrity: sha512-SGcvg80f0wUy2/fXES19feHMz8E0JoXv2uNgHOu4Dgi2OrCy1lqwFYEJz1BLbDI0exjPMe/ZdzZ/YpGECBG/aQ==} engines: {node: '>=20'} @@ -6063,10 +5869,6 @@ packages: resolution: {integrity: sha512-r6lPcBGxZXlIcymEu7InxDMhdW0KDxpLgoFLcguasxCaJ/SOIZwINatK9KY/tf+ZrlywOKU0UDj3ATXUBfxJXA==} engines: {node: '>=8'} - wrap-ansi@7.0.0: - resolution: {integrity: sha512-YVGIj2kamLSTxw6NsZjoBxfSwsn0ycdesmc4p+Q21c5zPuZ1pl+NfxVdxPtdHvmNVOQ6XSYG4AUtyt/Fi7D16Q==} - engines: {node: '>=10'} - wrap-ansi@9.0.2: resolution: {integrity: sha512-42AtmgqjV+X1VpdOfyTGOYRi0/zsoLqtXQckTmqTeybT+BDIbM/Guxo7x3pE2vtpr1ok6xRqM9OpBe+Jyoqyww==} engines: {node: '>=18'} @@ -6105,18 +5907,10 @@ packages: engines: {node: '>= 14.6'} hasBin: true - yargs-parser@21.1.1: - resolution: {integrity: sha512-tVpsJW7DdjecAiFpbIB1e3qxIQsE6NoPc5/eTdrbbIC4h0LVsWhnoa3g+m2HclBIujHzsxZ4VJVA+GUuc2/LBw==} - engines: {node: '>=12'} - yargs-parser@22.0.0: resolution: {integrity: sha512-rwu/ClNdSMpkSrUb+d6BRsSkLUq1fmfsY6TOpYzTwvwkg1/NRG85KBy3kq++A8LKQwX6lsu+aWad+2khvuXrqw==} engines: {node: ^20.19.0 || ^22.12.0 || >=23} - yargs@17.7.2: - resolution: {integrity: sha512-7dSzzRQ++CKnNI/krKnYRV7JKKPUXMEh61soaHKg9mrWEhzFWhFnxPxGl+69cD1Ou63C13NUPCnmIcrvqCuM6w==} - engines: {node: '>=12'} - yargs@18.0.0: resolution: {integrity: sha512-4UEqdc2RYGHZc7Doyqkrqiln3p9X2DZVxaGbwhn2pi7MrRagKaOcIKe8L3OxYcbhXLgLFUS3zAYuQjKBQgmuNg==} engines: {node: ^20.19.0 || ^22.12.0 || >=23} @@ -6802,47 +6596,6 @@ snapshots: packageurl-js: 2.0.1 spdx-expression-parse: 4.0.0 - '@duckdb/node-api@1.5.3-r.3': - dependencies: - '@duckdb/node-bindings': 1.5.3-r.3 - - '@duckdb/node-bindings-darwin-arm64@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings-darwin-x64@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings-linux-arm64-musl@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings-linux-arm64@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings-linux-x64-musl@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings-linux-x64@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings-win32-arm64@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings-win32-x64@1.5.3-r.3': - optional: true - - '@duckdb/node-bindings@1.5.3-r.3': - dependencies: - detect-libc: 2.1.2 - optionalDependencies: - '@duckdb/node-bindings-darwin-arm64': 1.5.3-r.3 - '@duckdb/node-bindings-darwin-x64': 1.5.3-r.3 - '@duckdb/node-bindings-linux-arm64': 1.5.3-r.3 - '@duckdb/node-bindings-linux-arm64-musl': 1.5.3-r.3 - '@duckdb/node-bindings-linux-x64': 1.5.3-r.3 - '@duckdb/node-bindings-linux-x64-musl': 1.5.3-r.3 - '@duckdb/node-bindings-win32-arm64': 1.5.3-r.3 - '@duckdb/node-bindings-win32-x64': 1.5.3-r.3 - '@emnapi/runtime@1.11.0': dependencies: tslib: 2.8.1 @@ -7211,36 +6964,6 @@ snapshots: '@kreuzberg/tree-sitter-language-pack@1.8.0': optional: true - '@ladybugdb/core-darwin-arm64@0.17.1': - optional: true - - '@ladybugdb/core-darwin-x64@0.17.1': - optional: true - - '@ladybugdb/core-linux-arm64@0.17.1': - optional: true - - '@ladybugdb/core-linux-x64@0.17.1': - optional: true - - '@ladybugdb/core-win32-x64@0.17.1': - optional: true - - '@ladybugdb/core@0.17.1': - dependencies: - apache-arrow: 21.1.0 - cmake-js: 8.0.0 - node-addon-api: 6.1.0 - optionalDependencies: - '@ladybugdb/core-darwin-arm64': 0.17.1 - '@ladybugdb/core-darwin-x64': 0.17.1 - '@ladybugdb/core-linux-arm64': 0.17.1 - '@ladybugdb/core-linux-x64': 0.17.1 - '@ladybugdb/core-win32-x64': 0.17.1 - transitivePeerDependencies: - - '@75lb/nature' - - supports-color - '@mdx-js/mdx@3.1.1': dependencies: '@types/estree': 1.0.9 @@ -7532,6 +7255,35 @@ snapshots: '@pnpm/types@8.9.0': {} + '@protobufjs/aspromise@1.1.2': + optional: true + + '@protobufjs/base64@1.1.2': + optional: true + + '@protobufjs/codegen@2.0.5': + optional: true + + '@protobufjs/eventemitter@1.1.1': + optional: true + + '@protobufjs/fetch@1.1.1': + dependencies: + '@protobufjs/aspromise': 1.1.2 + optional: true + + '@protobufjs/float@1.0.2': + optional: true + + '@protobufjs/path@1.1.2': + optional: true + + '@protobufjs/pool@1.1.0': + optional: true + + '@protobufjs/utf8@1.1.1': + optional: true + '@rollup/pluginutils@5.4.0(rollup@4.62.2)': dependencies: '@types/estree': 1.0.9 @@ -7886,10 +7638,6 @@ snapshots: progress: 2.0.3 typescript: 5.9.3 - '@swc/helpers@0.5.23': - dependencies: - tslib: 2.8.1 - '@szmarczak/http-timer@4.0.6': dependencies: defer-to-connect: 2.0.1 @@ -7925,10 +7673,6 @@ snapshots: '@types/node': 25.9.3 '@types/responselike': 1.0.3 - '@types/command-line-args@5.2.3': {} - - '@types/command-line-usage@5.0.4': {} - '@types/d3-array@3.2.2': {} '@types/d3-axis@3.0.6': @@ -8110,7 +7854,7 @@ snapshots: '@types/sax@1.2.7': dependencies: - '@types/node': 24.13.2 + '@types/node': 25.9.3 '@types/semver@7.7.1': {} @@ -8218,9 +7962,6 @@ snapshots: acorn@8.17.0: {} - adm-zip@0.5.17: - optional: true - agent-base@7.1.4: {} aggregate-error@3.1.0: @@ -8302,20 +8043,6 @@ snapshots: anynum@1.0.1: {} - apache-arrow@21.1.0: - dependencies: - '@swc/helpers': 0.5.23 - '@types/command-line-args': 5.2.3 - '@types/command-line-usage': 5.0.4 - '@types/node': 24.13.2 - command-line-args: 6.0.2 - command-line-usage: 7.0.4 - flatbuffers: 25.9.23 - json-bignum: 0.0.3 - tslib: 2.8.1 - transitivePeerDependencies: - - '@75lb/nature' - arg@4.1.3: {} arg@5.0.2: {} @@ -8324,8 +8051,6 @@ snapshots: aria-query@5.3.2: {} - array-back@6.2.3: {} - array-find-index@1.0.2: {} array-ify@1.0.0: {} @@ -8560,10 +8285,6 @@ snapshots: ccount@2.0.1: {} - chalk-template@0.4.0: - dependencies: - chalk: 4.1.2 - chalk@2.4.2: dependencies: ansi-styles: 3.2.1 @@ -8638,12 +8359,6 @@ snapshots: dependencies: typanion: 3.14.0 - cliui@8.0.1: - dependencies: - string-width: 4.2.3 - strip-ansi: 6.0.1 - wrap-ansi: 7.0.0 - cliui@9.0.1: dependencies: string-width: 7.2.0 @@ -8658,20 +8373,6 @@ snapshots: clsx@2.1.1: {} - cmake-js@8.0.0: - dependencies: - debug: 4.4.3 - fs-extra: 11.3.5 - node-api-headers: 1.9.0 - rc: 1.2.8 - semver: 7.8.5 - tar: 7.5.16 - url-join: 4.0.1 - which: 6.0.1 - yargs: 17.7.2 - transitivePeerDependencies: - - supports-color - cmd-shim@8.0.0: {} code-block-writer@13.0.3: @@ -8695,20 +8396,6 @@ snapshots: command-exists@1.2.9: {} - command-line-args@6.0.2: - dependencies: - array-back: 6.2.3 - find-replace: 5.0.2 - lodash.camelcase: 4.3.0 - typical: 7.3.0 - - command-line-usage@7.0.4: - dependencies: - array-back: 6.2.3 - chalk-template: 0.4.0 - table-layout: 4.1.1 - typical: 7.3.0 - commander@11.1.0: {} commander@12.1.0: {} @@ -9088,28 +8775,12 @@ snapshots: dedent@0.7.0: {} - deep-extend@0.6.0: {} - defaults@1.0.4: dependencies: clone: 1.0.4 defer-to-connect@2.0.1: {} - define-data-property@1.1.4: - dependencies: - es-define-property: 1.0.1 - es-errors: 1.3.0 - gopd: 1.2.0 - optional: true - - define-properties@1.2.1: - dependencies: - define-data-property: 1.1.4 - has-property-descriptors: 1.0.2 - object-keys: 1.1.1 - optional: true - defu@6.1.7: {} delaunator@5.1.0: @@ -9281,9 +8952,6 @@ snapshots: escape-string-regexp@1.0.5: {} - escape-string-regexp@4.0.0: - optional: true - escape-string-regexp@5.0.0: {} estree-util-attach-comments@3.0.0: @@ -9465,8 +9133,6 @@ snapshots: findup-sync: 4.0.0 merge: 2.1.1 - find-replace@5.0.2: {} - find-root@1.1.0: {} findup-sync@4.0.0: @@ -9482,7 +9148,8 @@ snapshots: mlly: 1.8.2 rollup: 4.60.3 - flatbuffers@25.9.23: {} + flatbuffers@25.9.23: + optional: true flattie@1.1.1: {} @@ -9498,12 +9165,6 @@ snapshots: fresh@2.0.0: {} - fs-extra@11.3.5: - dependencies: - graceful-fs: 4.2.11 - jsonfile: 6.2.1 - universalify: 2.0.1 - fs-extra@9.1.0: dependencies: at-least-node: 1.0.0 @@ -9584,14 +9245,6 @@ snapshots: once: 1.4.0 path-is-absolute: 1.0.1 - global-agent@4.1.3: - dependencies: - globalthis: 1.0.4 - matcher: 4.0.0 - semver: 7.8.4 - serialize-error: 8.1.0 - optional: true - global-directory@5.0.0: dependencies: ini: 6.0.0 @@ -9610,12 +9263,6 @@ snapshots: is-windows: 1.0.2 which: 1.3.1 - globalthis@1.0.4: - dependencies: - define-properties: 1.2.1 - gopd: 1.2.0 - optional: true - google-protobuf@3.21.4: {} gopd@1.2.0: {} @@ -9657,6 +9304,9 @@ snapshots: events: 3.3.0 graphology-types: 0.24.8 + guid-typescript@1.0.9: + optional: true + h3@1.15.11: dependencies: cookie-es: 1.2.3 @@ -9677,11 +9327,6 @@ snapshots: has-flag@5.0.1: {} - has-property-descriptors@1.0.2: - dependencies: - es-define-property: 1.0.1 - optional: true - has-symbols@1.1.0: {} hasown@2.0.3: @@ -10101,8 +9746,6 @@ snapshots: dependencies: argparse: 2.0.1 - json-bignum@0.0.3: {} - json-buffer@3.0.1: {} json-parse-even-better-errors@2.3.1: {} @@ -10221,8 +9864,6 @@ snapshots: lodash-es@4.18.1: {} - lodash.camelcase@4.3.0: {} - lodash.clone@4.5.0: {} lodash.clonedeep@4.5.0: {} @@ -10280,6 +9921,9 @@ snapshots: strip-ansi: 7.2.0 wrap-ansi: 9.0.2 + long@5.3.2: + optional: true + longest-streak@3.1.0: {} longest@2.0.1: {} @@ -10327,11 +9971,6 @@ snapshots: marked@16.4.2: {} - matcher@4.0.0: - dependencies: - escape-string-regexp: 4.0.0 - optional: true - math-intrinsics@1.1.0: {} mdast-util-definitions@6.0.0: @@ -10953,10 +10592,6 @@ snapshots: dependencies: '@types/nlcst': 2.0.3 - node-addon-api@6.1.0: {} - - node-api-headers@1.9.0: {} - node-fetch-native@1.6.7: {} node-gyp@12.4.0: @@ -11038,9 +10673,6 @@ snapshots: object-inspect@1.13.4: {} - object-keys@1.1.1: - optional: true - obliterator@2.0.5: {} obug@2.1.3: {} @@ -11077,14 +10709,17 @@ snapshots: regex: 6.1.0 regex-recursion: 6.0.2 - onnxruntime-common@1.26.0: + onnxruntime-common@1.27.0: optional: true - onnxruntime-node@1.26.0: + onnxruntime-web@1.27.0: dependencies: - adm-zip: 0.5.17 - global-agent: 4.1.3 - onnxruntime-common: 1.26.0 + flatbuffers: 25.9.23 + guid-typescript: 1.0.9 + long: 5.3.2 + onnxruntime-common: 1.27.0 + platform: 1.3.6 + protobufjs: 7.6.4 optional: true openapi-types@12.1.3: {} @@ -11250,6 +10885,9 @@ snapshots: mlly: 1.8.2 pathe: 2.0.3 + platform@1.3.6: + optional: true + playwright-core@1.61.0: {} playwright@1.61.0: @@ -11309,6 +10947,21 @@ snapshots: property-information@7.2.0: {} + protobufjs@7.6.4: + dependencies: + '@protobufjs/aspromise': 1.1.2 + '@protobufjs/base64': 1.1.2 + '@protobufjs/codegen': 2.0.5 + '@protobufjs/eventemitter': 1.1.1 + '@protobufjs/fetch': 1.1.1 + '@protobufjs/float': 1.0.2 + '@protobufjs/path': 1.1.2 + '@protobufjs/pool': 1.1.0 + '@protobufjs/utf8': 1.1.1 + '@types/node': 25.9.3 + long: 5.3.2 + optional: true + proxy-addr@2.0.7: dependencies: forwarded: 0.2.0 @@ -11347,13 +11000,6 @@ snapshots: iconv-lite: 0.7.2 unpipe: 1.0.0 - rc@1.2.8: - dependencies: - deep-extend: 0.6.0 - ini: 1.3.8 - minimist: 1.2.8 - strip-json-comments: 2.0.1 - read-cmd-shim@6.0.0: {} readable-stream@3.6.2: @@ -11535,8 +11181,6 @@ snapshots: mdast-util-to-markdown: 2.1.2 unified: 11.0.5 - require-directory@2.1.1: {} - require-from-string@2.0.2: {} resolve-alpn@1.2.1: {} @@ -11722,11 +11366,6 @@ snapshots: transitivePeerDependencies: - supports-color - serialize-error@8.1.0: - dependencies: - type-fest: 0.20.2 - optional: true - serve-static@2.2.1: dependencies: encodeurl: 2.0.0 @@ -12064,8 +11703,6 @@ snapshots: strip-bom@4.0.0: {} - strip-json-comments@2.0.1: {} - strip-json-comments@3.1.1: {} strnum@2.4.1: @@ -12117,11 +11754,6 @@ snapshots: picocolors: 1.1.1 sax: 1.6.0 - table-layout@4.1.1: - dependencies: - array-back: 6.2.3 - wordwrapjs: 5.1.1 - tar@7.5.16: dependencies: '@isaacs/fs-minipass': 4.0.1 @@ -12259,9 +11891,6 @@ snapshots: typanion@3.14.0: {} - type-fest@0.20.2: - optional: true - type-fest@0.21.3: {} type-is@2.1.0: @@ -12274,8 +11903,6 @@ snapshots: typescript@6.0.3: {} - typical@7.3.0: {} - ufo@1.6.4: {} ultrahtml@1.6.0: {} @@ -12375,8 +12002,6 @@ snapshots: dependencies: punycode: 2.3.1 - url-join@4.0.1: {} - util-deprecate@1.0.2: {} uuid@14.0.0: {} @@ -12472,8 +12097,6 @@ snapshots: word-wrap@1.2.5: {} - wordwrapjs@5.1.1: {} - wrap-ansi@10.0.0: dependencies: ansi-styles: 6.2.3 @@ -12486,12 +12109,6 @@ snapshots: string-width: 4.2.3 strip-ansi: 6.0.1 - wrap-ansi@7.0.0: - dependencies: - ansi-styles: 4.3.0 - string-width: 4.2.3 - strip-ansi: 6.0.1 - wrap-ansi@9.0.2: dependencies: ansi-styles: 6.2.3 @@ -12520,20 +12137,8 @@ snapshots: yaml@2.9.0: {} - yargs-parser@21.1.1: {} - yargs-parser@22.0.0: {} - yargs@17.7.2: - dependencies: - cliui: 8.0.1 - escalade: 3.2.0 - get-caller-file: 2.0.5 - require-directory: 2.1.1 - string-width: 4.2.3 - y18n: 5.0.8 - yargs-parser: 21.1.1 - yargs@18.0.0: dependencies: cliui: 9.0.1 @@ -12565,10 +12170,8 @@ time: '@commitlint/cli@21.0.2': '2026-05-29T09:33:09.819Z' '@commitlint/config-conventional@21.0.2': '2026-05-29T09:33:09.725Z' '@cyclonedx/cyclonedx-library@10.1.0': '2026-06-04T10:37:53.901Z' - '@duckdb/node-api@1.5.3-r.3': '2026-06-01T02:00:35.057Z' '@huggingface/tokenizers@0.1.3': '2026-03-18T20:30:05.853Z' '@iarna/toml@2.2.5': '2020-04-22T20:16:59.382Z' - '@ladybugdb/core@0.17.1': '2026-06-02T18:58:55.249Z' '@modelcontextprotocol/sdk@1.29.0': '2026-03-30T16:50:42.718Z' '@sourcegraph/scip-python@0.6.6': '2025-09-05T12:40:43.845Z' '@sourcegraph/scip-typescript@0.4.0': '2025-10-02T06:02:28.263Z' @@ -12592,7 +12195,7 @@ time: license-checker-rseidelsohn@5.0.1: '2026-05-27T14:15:44.165Z' listr2@10.2.1: '2026-03-02T23:32:42.720Z' lru-cache@11.5.1: '2026-05-27T15:04:12.732Z' - onnxruntime-node@1.26.0: '2026-05-08T19:50:24.041Z' + onnxruntime-web@1.27.0: '2026-06-19T21:08:52.103Z' piscina@5.2.0: '2026-06-12T08:48:23.250Z' playwright@1.61.0: '2026-06-15T10:06:22.269Z' rehype-mermaid@3.0.0: '2024-10-08T18:54:56.311Z' diff --git a/pnpm-workspace.yaml b/pnpm-workspace.yaml index 1d690c54..ba68413c 100644 --- a/pnpm-workspace.yaml +++ b/pnpm-workspace.yaml @@ -73,20 +73,18 @@ minimumReleaseAge: 0 # Packages not listed here are blocked from running scripts by default # (strictDepBuilds: true is the v11 default). allowBuilds: - # DuckDB native bindings - "@duckdb/node-api": true - "@duckdb/node-bindings-darwin-arm64": true - "@duckdb/node-bindings-darwin-x64": true - "@duckdb/node-bindings-linux-arm64": true - "@duckdb/node-bindings-linux-x64": true - "@duckdb/node-bindings-win32-x64": true - # LadybugDB native binding — copies lbugjs.node from the platform sub-package - "@ladybugdb/core": true - # Misc native addons + # Toolchain native addons (dev-time only — not shipped in the CLI bundle). esbuild: true lefthook: true - onnxruntime-node: true sharp: true + # protobufjs is a transitive dep of onnxruntime-web. Its install script only + # regenerates a bundled .js descriptor that already ships in the tarball, so + # it is NOT needed — keeping it off preserves the zero-native-build install. + protobufjs: false + # No storage native bindings: @ladybugdb/core and @duckdb/node-api were removed + # in the single-file SQLite migration (ADR 0019). onnxruntime-node was replaced + # by onnxruntime-web (prebuilt WASM, no build script). The embedder is now the + # ONLY optional native-free runtime, and it builds nothing at install. # No tree-sitter entries: native tree-sitter and grammar packages are not # workspace dependencies. WASMs are vendored at packages/ingestion/vendor/wasms/. # Re-vendor on demand via the workflow in scripts/build-vendor-wasms.sh.