feat(storage): single-file SQLite + WASM embedder — zero native dependencies#245
Merged
Conversation
Prove one node:sqlite file in WAL mode can back the graph, embedding, and temporal tiers behind the existing IGraphStore/ITemporalStore seam, replacing the @ladybugdb/core + @duckdb/node-api native-binding pair. Goal: OpenCodeHub installs with zero native deps and one command, no Docker. - sqlite-adapter.ts: SqliteStore representative slice — bulkLoad, getNode/listNodes, upsert/listEmbeddings (f32 BLOB), brute-force cosine vectorSearch, recursive-CTE traverse (impact up / blast-radius down), meta. Out-of-scope methods throw NotImplementedError with phase pointers. - sqlite-adapter.test.ts: 2 node:test cases, both green — graph+embedding round-trip from one on-disk file across reopen; CTE traversal up/down with depth bound and path. Asserts no .lbug/.duckdb sidecars are written. - SPIKE-SQLITE-GOAL.md + SPIKE-SQLITE-WORKFLOW.md: goal, 6-phase migration, risk register, generic-node-table design proposal, recommendation. No backwards compat (clean slate, per brief). Storage package only; main architecture untouched. tsc -b clean; tests pass under --experimental-sqlite on Node 24.17. WAL verified on a real file.
…arning node:sqlite is enabled by default on Node >=24.15, but loading it emits a one-shot ExperimentalWarning to stderr. For the stdio MCP server stderr is a real channel, so the warning is a correctness wart. Add a dependency-free in-process guard that swallows only the SQLite ExperimentalWarning and passes every other warning through, installed before the binding loads. - sqlite-runtime.ts: installSqliteRuntimeGuard(), idempotent emitWarning override; auto-installs on import. Exposed as the @opencodehub/storage /sqlite-runtime subpath so the CLI can import it without pulling the native binding (preserves lazy --help startup). - sqlite-adapter.ts: import the guard before node:sqlite. - cli/src/index.ts: install the guard at process start. - sqlite-runtime.test.ts: swallows SQLite warning only, idempotent. Chosen over a --node-flag/shebang because no single launch site is under our control (published bin, agent-spawned MCP, node --test, embedding libs); the guard travels with the code. Verified: codehub --version exits 0 with empty stderr; control proves the warning fires without the guard.
P1+P2 of the single-file SQLite migration. The adapter now implements the full
IGraphStore + ITemporalStore surface against one node:sqlite file — every method
except exportEmbeddingsToParquet (deferred to the Phase-4 Parquet decision).
Graph: listEdges/listEdgesByType (mirrors GraphDbStore edge path incl.
stepZeroSentinel + empty-reason drop + (from,to,type,id) sort), listNodesByKind,
listFindings, listDependencies, listRoutes, getRepoNode, listNodesByEntryPoint,
listNodesByName, countNodesByKind, countEdgesByType, listEmbeddingHashes,
listConsumerProducerEdges, setMeta, traverseAncestors/Descendants, FTS5 search.
Temporal: exec (read-only via assertReadOnlySql), cochanges + symbol-summaries
load/lookup surface. Filter-only fields live in the JSON payload, reached via
SQLite JSON1 payload->>'$.field' extracts.
GATE PASSED — graphHash byte-identity. sqlite-parity.test.ts proves a
KnowledgeGraph rebuilt from listNodes({})+listEdges({}) hashes identically to the
original across small + medium fixtures exercising all four sentinels (step-0,
languageStats {}, deadness underscore/hyphen, Repo string|null), every edge kind,
and two independent stores. The generic-node-table + JSON-payload design holds
byte-identity — the go/no-go for the whole migration.
Full storage suite: 178 pass, 0 fail (no lbug/duck regressions). tsc -b clean.
openStore now returns ONE SqliteStore as both the graph and temporal views over
one <repo>/.codehub/store.sqlite (WAL). Because a single instance satisfies both
IGraphStore and ITemporalStore, all 52 call sites (store.graph.X / store.temporal.Y)
compile and run unchanged — both views hit the same connection and file.
Verified end-to-end on a live repo: `codehub analyze` writes one store.sqlite
(no graph.lbug, no temporal.duckdb); `codehub query "add"` returns the FTS5 hit;
`codehub impact double --direction up` finds the caller via recursive-CTE
traversal. Full storage suite 178 pass / 0 fail; mcp suite 209 / 0; monorepo
tsc -b clean.
Two real bugs the live run caught that unit + parity tests structurally could not:
- bulkLoad ignored opts.mode and always full-replaced. ingest-sarif (run inside
analyze) calls bulkLoad(graph, {mode:"upsert"}) with an empty SARIF graph, so
the second call WIPED the 15 nodes the first call wrote. Now honors mode:
"replace" truncates, "upsert" merges (INSERT OR REPLACE; FTS row deleted then
reinserted per node). store_meta is stamped from actual post-write table counts
so an upsert batch can't clobber the count with its partial total.
- tsup 8.5.1 defaults removeNodeProtocol:true, whose strip list predates
node:sqlite — it rewrote the externalized `node:sqlite` import to a bare
`sqlite`, unresolvable at runtime. Set removeNodeProtocol:false in the CLI
tsup config so every builtin keeps its node: prefix (correct for Node>=24.15).
Also: openStore drops the graph.lbug/temporal.duckdb path split (now store.sqlite);
duckOptions/graphDbOptions kept as accepted-but-ignored for call-site compat,
removed with the native adapters in P5. Updated the openStore composition test to
assert the single-instance contract. CLI installs the sqlite runtime guard at
process start.
P4 decision: option (a). SqliteStore.exportEmbeddingsToParquet now lazily
`await import("./duckdb-adapter.js")` INSIDE the method and delegates to a
throwaway in-memory DuckDbStore for the deterministic COPY-to-Parquet. DuckDB is
off the install hot path — only an embeddings-pack invocation loads it;
analyze/query/impact and an embedding-free pack never do. This unblocks removing
the lbug GRAPH binding (the bigger win) while preserving the byte-identical
Parquet sidecar contract.
Verified: pack/embeddings-sidecar.test.ts green; a direct probe emits a valid
PAR1 Parquet file (2 rows, duckdbVersion pinned). Removed the now-unused
NotImplementedError import. WORKFLOW.md progress log updated: P0-P4 done, P5
(rip bindings — irreversible, supersedes ADR 0016) left as a decision for Laith.
Option (b) — pure-JS Parquet to kill the last native dep — remains the documented
fast-follow.
…0019) Complete the single-file SQLite migration. @ladybugdb/core (the lbug graph binding) is removed from every package.json and the lockfile; graphdb-adapter, graphdb-pool, graphdb-schema and their tests are deleted. Storage is now one node:sqlite store.sqlite (WAL) per repo, zero native deps on the install hot path. @duckdb/node-api survives ONLY as a lazy pack-time import for the Parquet embeddings sidecar. - Extract the two pure helpers SqliteStore needed out of the binding-coupled files into dependency-free modules: license.ts (classifyLicenseTier, was in duckdb-adapter.ts whose top-level @duckdb import would have eager-loaded the binding — fixing a latent P4 leak) and relations.ts (getAllRelationTypes). - openStore drops the lbug/duck path split + the duckOptions/graphDbOptions bags. - doctor.ts: replace the lbug binding probe with a node:sqlite builtin check (import + WAL round-trip); DuckDB check downgraded fail->warn (Parquet only). - analyze-carry-forward.test.ts + augment.test.ts: drop the hasNativeBinding() lbug skip-guards — these tests now run unconditionally on SQLite (10/10 pass, previously skipped in this env). - step:0 graphHash fix (latent finding, greenlit): KnowledgeGraph.addEdge normalizes step:0 -> absent at the graph boundary so graphHash matches every adapter's listEdges (stepZeroSentinel). Regression test added to sqlite-parity.test.ts. Closes the gap the old graphdb-roundtrip test masked. - ADR 0019 supersedes ADR 0016; CLAUDE.md storage section rewritten. Verified: monorepo tsc clean; storage 89/0, core-types 83/0, pack 105/0, mcp 209/0, cli 345/0 (11 skip). Live analyze->query->impact writes one store.sqlite (no .lbug/.duckdb) with @ladybugdb/core unresolvable (ERR_MODULE_NOT_FOUND).
GOAL.md status → COMPLETE; WORKFLOW.md progress log reflects P5 (bindings ripped, ADR 0019) + P6 (clean-room one-command install proven). Migration done on the branch; awaiting review before merge to main.
…ro native storage deps) Nothing in OCH ever read embeddings.parquet back — it was a write-only export (BOM item #7) with no consumer. So the sidecar is dropped entirely and DuckDB, the last native storage binding, goes with it. Embeddings already live in the `embeddings` table in store.sqlite (BLOB-exact, queryable). Result: ZERO native storage dependencies — @ladybugdb/core and @duckdb/node-api are both gone from every package.json and the lockfile. - pack: delete embeddings-sidecar.ts + test; remove the BOM item, the writeEmbeddingsSidecar wiring, and the duckdb_version pin from PackPins / manifest / readme. Code-pack is now an 8-item BOM (manifest + skeleton + file-tree + deps + ast-chunks + xrefs + findings + licenses + readme). - storage: delete duckdb-adapter.ts + test; remove exportEmbeddingsToParquet from ITemporalStore + SqliteStore. SqliteStore no longer references DuckDB. - cli: doctor drops the DuckDB probe (no native storage binding left to check); --embeddings/--skip-native help text de-DuckDB'd; 8-item BOM strings fixed. - mcp: pack-codebase help text updated. - @duckdb/node-api removed from cli + storage package.json; lockfile synced (0 refs for both native bindings). - ADR 0019 + CLAUDE.md updated: zero native storage deps; onnxruntime-node (the embedder) is the only remaining native dep, optional + lazy under --embeddings. Verified: monorepo tsc clean; storage 81/0, pack 98/0, mcp 209/0, cli 343/0. Live analyze->query->impact on a pristine repo writes one store.sqlite (no .lbug/.duckdb/.parquet) with BOTH @duckdb/node-api and @ladybugdb/core unresolvable (ERR_MODULE_NOT_FOUND).
…native deps The last native dependency is gone. The embedder now runs the ONNX runtime in pure WebAssembly via onnxruntime-web (prebuilt WASM, no node-gyp, no install script, no platform matrix). OpenCodeHub now has LITERALLY zero native dependencies — npm i -g @opencodehub/cli + Node >=24.15 and nothing builds. - onnx-embedder.ts: import onnxruntime-web; set env.wasm.numThreads=1 (the deterministic single-threaded path ORT prescribes for Node) + optional env.wasm.wasmPaths (EmbedderConfig.wasmDir escape hatch for the bundled CLI; the default self-resolves correctly since ort is `external` in the tsup bundle and sits next to its sibling .wasm in node_modules). InferenceSession .create takes model BYTES (Uint8Array) in Node, not a path. executionProviders ['wasm']; graphOptimizationLevel 'disabled' retained — WASM honours it. - Determinism VERIFIED on the real gte-modernbert int8 model: byte-identical across repeat runs AND fresh sessions; live `analyze --embeddings` through the built CLI populated the embeddings table (status: vectors "populated"), L2-norm = 1.0, 768-dim. The graphHash contract holds. - deps: onnxruntime-node → onnxruntime-web (1.27.0) in embedder + cli optionalDependencies; lockfile synced (0 refs for all three former native bindings). pnpm-workspace allowBuilds pruned to dev-only toolchain (esbuild/lefthook/sharp) + protobufjs:false (transitive of ort-web, its install script is unneeded — keeps the install build-free). - doctor: the embedder probe now imports onnxruntime-web; dropped the obsolete onnxruntime-node platform-matrix note (WASM is universal — no Intel-mac/musl gap). Tests updated. Verified: monorepo tsc clean; embedder 79/0, ingestion 585/0, search 27/0, cli 343/0 (11 skip). No package.json or lockfile references any native binding.
…, no-non-null) CI's `biome ci` is stricter than the pre-push `biome check` and flagged style nits in the single-file-SQLite code: useTemplate (string concat → template literals) and organize-imports in sqlite-adapter.ts / cli+storage index.ts, plus noNonNullAssertion in sqlite-adapter.test.ts. Replaced the test's `!` assertions with explicit `assert.ok(...)` guards (clearer precondition, no logic change). Clears both the `lint` job and `self-scan` (the latter failed only because OCH's own biome scanner surfaced these same findings through its verdict gate). biome ci clean (680 files); storage suite green.
CI `biome ci .` also enforces formatting (the pre-push `biome check` on staged files did not flag it). Ran `biome format --write` — settled multi-line type annotations and comment wrapping in doctor.ts, graph.ts, sqlite-runtime.ts and its test. Also refreshed the stale DoctorOptions doc comment that still named onnxruntime-node (it's onnxruntime-web now). No behavior change; biome ci clean over 680 files; core-types 83/0, storage 81/0.
Merged
theagenticguy
added a commit
that referenced
this pull request
Jun 22, 2026
…nt bugs (#247) ## Summary The single-file SQLite migration (ADR 0019, #245) left stale references to the removed `graph.lbug` + `temporal.duckdb` backends across source comments, docs, agent-facing MCP tool descriptions, and user-facing error strings. This sweeps and corrects them, and fixes **two real bugs** the migration left behind that its own tests did not catch. ## Bugs fixed (behavioral) **1. `describeArtifacts()` pointed at a file that no longer exists.** `paths.ts` still returned `graphFile="graph.lbug"` / `temporalFile="temporal.duckdb"`, but `openStore()` writes `store.sqlite`. That value is not cosmetic: it feeds the `is-indexed` existence probe (`cli/src/lib/is-indexed.ts`) and the user-facing path in the MCP "store unreadable" error (`mcp/src/tools/shared.ts`). So the error told users to check a `.codehub/graph.lbug` that is gone. Now returns `store.sqlite` for both views. `paths.test.ts` had pinned the wrong values, so it was green while asserting broken behavior — updated. **2. `bm25CorpusHasSummaries()` queried `information_schema.tables`.** That is a DuckDB/Postgres catalog `node:sqlite` does not expose. The query threw, was swallowed by the surrounding `try/catch`, and the probe was silently always-false in production. Switched to `sqlite_master`; updated the test mock that pinned the old string. ## Also: the `sql` MCP tool contract The `sql` tool now advertises `nodes`/`edges`/`embeddings` as directly SQL-queryable (they are real tables in `store.sqlite`) instead of claiming the graph is "NOT SQL-queryable, reachable only through Cypher". The `cypher:` arg returns a clear "use `sql:` instead" envelope against the default SQLite backend (`execCypher` is unimplemented per ADR 0019, reserved for community forks). `sql.test.ts` updated to assert the inverted contract. ## Scope 47 files. Most are comment/doc corrections; the 4 above are behavioral. The DuckDB→SQLite comment pass was context-aware (historical-fact mentions like "replaced the lbug + **DuckDB** pair" intentionally kept). ## Verification - typecheck clean: core-types, storage, mcp, cli - storage suite: 81 pass / 0 fail - mcp suite: 234 pass / 0 fail - cli suite: 346 pass / 0 fail - banned-strings + commitlint pre-commit gates pass - pre-push test hook passed on push
theagenticguy
pushed a commit
that referenced
this pull request
Jun 25, 2026
🤖 Automated release via release-please --- <details><summary>root: 0.9.2</summary> ## [0.9.2](root-v0.9.1...root-v0.9.2) (2026-06-24) ### Features * **analysis:** plumbing sieve + candidate_business tag (deterministic, advisory) ([#248](#248)) ([383b719](383b719)) * **ingestion:** business-logic analyze phase — populate likely_plumbing + candidate_business ([#249](#249)) ([a3d44ad](a3d44ad)) * **storage:** single-file SQLite + WASM embedder — zero native dependencies ([#245](#245)) ([c72c84f](c72c84f)) ### Bug Fixes * **storage:** purge stale lbug/DuckDB refs after ADR 0019; fix 2 latent bugs ([#247](#247)) ([90f40a2](90f40a2)) </details> <details><summary>cli: 0.9.2</summary> ## [0.9.2](cli-v0.9.1...cli-v0.9.2) (2026-06-24) ### Features * **storage:** single-file SQLite + WASM embedder — zero native dependencies ([#245](#245)) ([c72c84f](c72c84f)) ### Bug Fixes * **storage:** purge stale lbug/DuckDB refs after ADR 0019; fix 2 latent bugs ([#247](#247)) ([90f40a2](90f40a2)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Zero native dependencies
OpenCodeHub now installs and runs with literally nothing that compiles —
npm i -g @opencodehub/cli+ Node ≥24.15, no Docker, no node-gyp, no postinstall step. All three former native bindings are gone (0 refs in the lockfile, unresolvable at runtime, absent from everypackage.json):@ladybugdb/coregraph.lbug)node:sqlite@duckdb/node-apitemporal.duckdb)node:sqlite(Parquet sidecar dropped)onnxruntime-nodeonnxruntime-web(prebuilt WASM)The entire index lives in one
<repo>/.codehub/store.sqlitefile (WAL). OneSqliteStoreclass implements bothIGraphStoreandITemporalStore;openStore()returns it as both views, so all 52 call sites are unchanged.How it got here (10 commits)
node:sqlitefile: genericnodestable + JSON payload (37 kinds), polymorphicedges, FTS5 search, recursive-CTE traversal (impact / blast-radius), BLOB f32 embeddings.sqlite-parity.test.tsproves aKnowledgeGraphrebuilt fromlistNodes/listEdgeshashes byte-identically across all sentinels (step-0,languageStats:{},responseKeys:[]-vs-absent, Repo nullables, deadness) and every edge kind.analyze→query→impactgreen.embeddings.parquet(BOM item build(deps-dev): bump @commitlint/cli from 19.6.1 to 20.5.0 #7) had no reader anywhere; embeddings live queryable instore.sqlite. Code-pack is now an 8-item BOM. Also removed the deadduckdb_versionpackHash pin.onnxruntime-web1.27.0. Determinism verified on the real gte-modernbert model: byte-identical across repeat runs and fresh sessions. ORT runs single-threaded WASM under Node — exactly the deterministic path the graphHash contract needs.Latent fix included
KnowledgeGraph.addEdgenow normalizesstep:0→ absent at the graph boundary, sographHashmatches every adapter'slistEdges(the old lbug roundtrip test masked this with a test-local rebuild helper that re-attachedstep:0).Docs
ADR 0019 supersedes ADR 0016;
CLAUDE.mdstorage section rewritten;SPIKE-SQLITE-GOAL.md/SPIKE-SQLITE-WORKFLOW.mdcapture the full arc.Verification
tsc -bcleananalyze --embeddingspopulates the embeddings table (status: vectors populated, 768-dim, L2-norm 1.0)analyze→query→impactwriting onestore.sqlitewith all three native bindings unresolvableThe
verify-global-installmatrix (Linux/macOS × Node 24 × mise/nvm/Homebrew/Volta) is the load-bearing CI gate for the zero-dep claim.🤖 Generated with Claude Code