Skip to content

feat(storage): single-file SQLite + WASM embedder — zero native dependencies#245

Merged
theagenticguy merged 12 commits into
mainfrom
spike/sqlite-single-file
Jun 22, 2026
Merged

feat(storage): single-file SQLite + WASM embedder — zero native dependencies#245
theagenticguy merged 12 commits into
mainfrom
spike/sqlite-single-file

Conversation

@theagenticguy

Copy link
Copy Markdown
Owner

Zero native dependencies

OpenCodeHub now installs and runs with literally nothing that compilesnpm i -g @opencodehub/cli + Node ≥24.15, no Docker, no node-gyp, no postinstall step. All three former native bindings are gone (0 refs in the lockfile, unresolvable at runtime, absent from every package.json):

Binding Tier Replacement
@ladybugdb/core graph (graph.lbug) Node built-in node:sqlite
@duckdb/node-api temporal + Parquet (temporal.duckdb) node:sqlite (Parquet sidecar dropped)
onnxruntime-node embedder onnxruntime-web (prebuilt WASM)

The entire index lives in one <repo>/.codehub/store.sqlite file (WAL). One SqliteStore class implements both IGraphStore and ITemporalStore; openStore() returns it as both views, so all 52 call sites are unchanged.

No backwards compatibility (per the original brief). Clean slate — users re-run codehub analyze; no migration of graph.lbug / temporal.duckdb artifacts.

How it got here (10 commits)

  1. SQLite adapter — single node:sqlite file: generic nodes table + JSON payload (37 kinds), polymorphic edges, FTS5 search, recursive-CTE traversal (impact / blast-radius), BLOB f32 embeddings.
  2. graphHash gate (the go/no-go)sqlite-parity.test.ts proves a KnowledgeGraph rebuilt from listNodes/listEdges hashes byte-identically across all sentinels (step-0, languageStats:{}, responseKeys:[]-vs-absent, Repo nullables, deadness) and every edge kind.
  3. openStore rewire — one instance as both views; full suite + live analyze→query→impact green.
  4. Dropped the Parquet sidecar + DuckDBembeddings.parquet (BOM item build(deps-dev): bump @commitlint/cli from 19.6.1 to 20.5.0 #7) had no reader anywhere; embeddings live queryable in store.sqlite. Code-pack is now an 8-item BOM. Also removed the dead duckdb_version packHash pin.
  5. WASM embedderonnxruntime-web 1.27.0. Determinism verified on the real gte-modernbert model: byte-identical across repeat runs and fresh sessions. ORT runs single-threaded WASM under Node — exactly the deterministic path the graphHash contract needs.

Latent fix included

KnowledgeGraph.addEdge now normalizes step:0 → absent at the graph boundary, so graphHash matches every adapter's listEdges (the old lbug roundtrip test masked this with a test-local rebuild helper that re-attached step:0).

Docs

ADR 0019 supersedes ADR 0016; CLAUDE.md storage section rewritten; SPIKE-SQLITE-GOAL.md / SPIKE-SQLITE-WORKFLOW.md capture the full arc.

Verification

  • monorepo tsc -b clean
  • storage 81/0 · core-types 83/0 · pack 98/0 · mcp 209/0 · cli 343/0 (11 skip) · ingestion 585/0 · search 27/0
  • live analyze --embeddings populates the embeddings table (status: vectors populated, 768-dim, L2-norm 1.0)
  • clean-room: built CLI runs analyze→query→impact writing one store.sqlite with all three native bindings unresolvable

The verify-global-install matrix (Linux/macOS × Node 24 × mise/nvm/Homebrew/Volta) is the load-bearing CI gate for the zero-dep claim.

🤖 Generated with Claude Code

Prove one node:sqlite file in WAL mode can back the graph, embedding, and
temporal tiers behind the existing IGraphStore/ITemporalStore seam, replacing
the @ladybugdb/core + @duckdb/node-api native-binding pair. Goal: OpenCodeHub
installs with zero native deps and one command, no Docker.

- sqlite-adapter.ts: SqliteStore representative slice — bulkLoad,
  getNode/listNodes, upsert/listEmbeddings (f32 BLOB), brute-force cosine
  vectorSearch, recursive-CTE traverse (impact up / blast-radius down), meta.
  Out-of-scope methods throw NotImplementedError with phase pointers.
- sqlite-adapter.test.ts: 2 node:test cases, both green — graph+embedding
  round-trip from one on-disk file across reopen; CTE traversal up/down with
  depth bound and path. Asserts no .lbug/.duckdb sidecars are written.
- SPIKE-SQLITE-GOAL.md + SPIKE-SQLITE-WORKFLOW.md: goal, 6-phase migration,
  risk register, generic-node-table design proposal, recommendation.

No backwards compat (clean slate, per brief). Storage package only; main
architecture untouched. tsc -b clean; tests pass under --experimental-sqlite
on Node 24.17. WAL verified on a real file.
…arning

node:sqlite is enabled by default on Node >=24.15, but loading it emits a
one-shot ExperimentalWarning to stderr. For the stdio MCP server stderr is a
real channel, so the warning is a correctness wart. Add a dependency-free
in-process guard that swallows only the SQLite ExperimentalWarning and passes
every other warning through, installed before the binding loads.

- sqlite-runtime.ts: installSqliteRuntimeGuard(), idempotent emitWarning
  override; auto-installs on import. Exposed as the @opencodehub/storage
  /sqlite-runtime subpath so the CLI can import it without pulling the native
  binding (preserves lazy --help startup).
- sqlite-adapter.ts: import the guard before node:sqlite.
- cli/src/index.ts: install the guard at process start.
- sqlite-runtime.test.ts: swallows SQLite warning only, idempotent.

Chosen over a --node-flag/shebang because no single launch site is under our
control (published bin, agent-spawned MCP, node --test, embedding libs); the
guard travels with the code. Verified: codehub --version exits 0 with empty
stderr; control proves the warning fires without the guard.
P1+P2 of the single-file SQLite migration. The adapter now implements the full
IGraphStore + ITemporalStore surface against one node:sqlite file — every method
except exportEmbeddingsToParquet (deferred to the Phase-4 Parquet decision).

Graph: listEdges/listEdgesByType (mirrors GraphDbStore edge path incl.
stepZeroSentinel + empty-reason drop + (from,to,type,id) sort), listNodesByKind,
listFindings, listDependencies, listRoutes, getRepoNode, listNodesByEntryPoint,
listNodesByName, countNodesByKind, countEdgesByType, listEmbeddingHashes,
listConsumerProducerEdges, setMeta, traverseAncestors/Descendants, FTS5 search.
Temporal: exec (read-only via assertReadOnlySql), cochanges + symbol-summaries
load/lookup surface. Filter-only fields live in the JSON payload, reached via
SQLite JSON1 payload->>'$.field' extracts.

GATE PASSED — graphHash byte-identity. sqlite-parity.test.ts proves a
KnowledgeGraph rebuilt from listNodes({})+listEdges({}) hashes identically to the
original across small + medium fixtures exercising all four sentinels (step-0,
languageStats {}, deadness underscore/hyphen, Repo string|null), every edge kind,
and two independent stores. The generic-node-table + JSON-payload design holds
byte-identity — the go/no-go for the whole migration.

Full storage suite: 178 pass, 0 fail (no lbug/duck regressions). tsc -b clean.
openStore now returns ONE SqliteStore as both the graph and temporal views over
one <repo>/.codehub/store.sqlite (WAL). Because a single instance satisfies both
IGraphStore and ITemporalStore, all 52 call sites (store.graph.X / store.temporal.Y)
compile and run unchanged — both views hit the same connection and file.

Verified end-to-end on a live repo: `codehub analyze` writes one store.sqlite
(no graph.lbug, no temporal.duckdb); `codehub query "add"` returns the FTS5 hit;
`codehub impact double --direction up` finds the caller via recursive-CTE
traversal. Full storage suite 178 pass / 0 fail; mcp suite 209 / 0; monorepo
tsc -b clean.

Two real bugs the live run caught that unit + parity tests structurally could not:
- bulkLoad ignored opts.mode and always full-replaced. ingest-sarif (run inside
  analyze) calls bulkLoad(graph, {mode:"upsert"}) with an empty SARIF graph, so
  the second call WIPED the 15 nodes the first call wrote. Now honors mode:
  "replace" truncates, "upsert" merges (INSERT OR REPLACE; FTS row deleted then
  reinserted per node). store_meta is stamped from actual post-write table counts
  so an upsert batch can't clobber the count with its partial total.
- tsup 8.5.1 defaults removeNodeProtocol:true, whose strip list predates
  node:sqlite — it rewrote the externalized `node:sqlite` import to a bare
  `sqlite`, unresolvable at runtime. Set removeNodeProtocol:false in the CLI
  tsup config so every builtin keeps its node: prefix (correct for Node>=24.15).

Also: openStore drops the graph.lbug/temporal.duckdb path split (now store.sqlite);
duckOptions/graphDbOptions kept as accepted-but-ignored for call-site compat,
removed with the native adapters in P5. Updated the openStore composition test to
assert the single-instance contract. CLI installs the sqlite runtime guard at
process start.
P4 decision: option (a). SqliteStore.exportEmbeddingsToParquet now lazily
`await import("./duckdb-adapter.js")` INSIDE the method and delegates to a
throwaway in-memory DuckDbStore for the deterministic COPY-to-Parquet. DuckDB is
off the install hot path — only an embeddings-pack invocation loads it;
analyze/query/impact and an embedding-free pack never do. This unblocks removing
the lbug GRAPH binding (the bigger win) while preserving the byte-identical
Parquet sidecar contract.

Verified: pack/embeddings-sidecar.test.ts green; a direct probe emits a valid
PAR1 Parquet file (2 rows, duckdbVersion pinned). Removed the now-unused
NotImplementedError import. WORKFLOW.md progress log updated: P0-P4 done, P5
(rip bindings — irreversible, supersedes ADR 0016) left as a decision for Laith.

Option (b) — pure-JS Parquet to kill the last native dep — remains the documented
fast-follow.
…0019)

Complete the single-file SQLite migration. @ladybugdb/core (the lbug graph
binding) is removed from every package.json and the lockfile; graphdb-adapter,
graphdb-pool, graphdb-schema and their tests are deleted. Storage is now one
node:sqlite store.sqlite (WAL) per repo, zero native deps on the install hot
path. @duckdb/node-api survives ONLY as a lazy pack-time import for the Parquet
embeddings sidecar.

- Extract the two pure helpers SqliteStore needed out of the binding-coupled
  files into dependency-free modules: license.ts (classifyLicenseTier, was in
  duckdb-adapter.ts whose top-level @duckdb import would have eager-loaded the
  binding — fixing a latent P4 leak) and relations.ts (getAllRelationTypes).
- openStore drops the lbug/duck path split + the duckOptions/graphDbOptions bags.
- doctor.ts: replace the lbug binding probe with a node:sqlite builtin check
  (import + WAL round-trip); DuckDB check downgraded fail->warn (Parquet only).
- analyze-carry-forward.test.ts + augment.test.ts: drop the hasNativeBinding()
  lbug skip-guards — these tests now run unconditionally on SQLite (10/10 pass,
  previously skipped in this env).
- step:0 graphHash fix (latent finding, greenlit): KnowledgeGraph.addEdge
  normalizes step:0 -> absent at the graph boundary so graphHash matches every
  adapter's listEdges (stepZeroSentinel). Regression test added to
  sqlite-parity.test.ts. Closes the gap the old graphdb-roundtrip test masked.
- ADR 0019 supersedes ADR 0016; CLAUDE.md storage section rewritten.

Verified: monorepo tsc clean; storage 89/0, core-types 83/0, pack 105/0, mcp
209/0, cli 345/0 (11 skip). Live analyze->query->impact writes one store.sqlite
(no .lbug/.duckdb) with @ladybugdb/core unresolvable (ERR_MODULE_NOT_FOUND).
GOAL.md status → COMPLETE; WORKFLOW.md progress log reflects P5 (bindings ripped,
ADR 0019) + P6 (clean-room one-command install proven). Migration done on the
branch; awaiting review before merge to main.
…ro native storage deps)

Nothing in OCH ever read embeddings.parquet back — it was a write-only export
(BOM item #7) with no consumer. So the sidecar is dropped entirely and DuckDB,
the last native storage binding, goes with it. Embeddings already live in the
`embeddings` table in store.sqlite (BLOB-exact, queryable). Result: ZERO native
storage dependencies — @ladybugdb/core and @duckdb/node-api are both gone from
every package.json and the lockfile.

- pack: delete embeddings-sidecar.ts + test; remove the BOM item, the
  writeEmbeddingsSidecar wiring, and the duckdb_version pin from PackPins /
  manifest / readme. Code-pack is now an 8-item BOM (manifest + skeleton +
  file-tree + deps + ast-chunks + xrefs + findings + licenses + readme).
- storage: delete duckdb-adapter.ts + test; remove exportEmbeddingsToParquet
  from ITemporalStore + SqliteStore. SqliteStore no longer references DuckDB.
- cli: doctor drops the DuckDB probe (no native storage binding left to check);
  --embeddings/--skip-native help text de-DuckDB'd; 8-item BOM strings fixed.
- mcp: pack-codebase help text updated.
- @duckdb/node-api removed from cli + storage package.json; lockfile synced
  (0 refs for both native bindings).
- ADR 0019 + CLAUDE.md updated: zero native storage deps; onnxruntime-node (the
  embedder) is the only remaining native dep, optional + lazy under --embeddings.

Verified: monorepo tsc clean; storage 81/0, pack 98/0, mcp 209/0, cli 343/0.
Live analyze->query->impact on a pristine repo writes one store.sqlite (no
.lbug/.duckdb/.parquet) with BOTH @duckdb/node-api and @ladybugdb/core
unresolvable (ERR_MODULE_NOT_FOUND).
…native deps

The last native dependency is gone. The embedder now runs the ONNX runtime in
pure WebAssembly via onnxruntime-web (prebuilt WASM, no node-gyp, no install
script, no platform matrix). OpenCodeHub now has LITERALLY zero native
dependencies — npm i -g @opencodehub/cli + Node >=24.15 and nothing builds.

- onnx-embedder.ts: import onnxruntime-web; set env.wasm.numThreads=1 (the
  deterministic single-threaded path ORT prescribes for Node) + optional
  env.wasm.wasmPaths (EmbedderConfig.wasmDir escape hatch for the bundled CLI;
  the default self-resolves correctly since ort is `external` in the tsup
  bundle and sits next to its sibling .wasm in node_modules). InferenceSession
  .create takes model BYTES (Uint8Array) in Node, not a path. executionProviders
  ['wasm']; graphOptimizationLevel 'disabled' retained — WASM honours it.
- Determinism VERIFIED on the real gte-modernbert int8 model: byte-identical
  across repeat runs AND fresh sessions; live `analyze --embeddings` through the
  built CLI populated the embeddings table (status: vectors "populated"),
  L2-norm = 1.0, 768-dim. The graphHash contract holds.
- deps: onnxruntime-node → onnxruntime-web (1.27.0) in embedder + cli
  optionalDependencies; lockfile synced (0 refs for all three former native
  bindings). pnpm-workspace allowBuilds pruned to dev-only toolchain
  (esbuild/lefthook/sharp) + protobufjs:false (transitive of ort-web, its
  install script is unneeded — keeps the install build-free).
- doctor: the embedder probe now imports onnxruntime-web; dropped the obsolete
  onnxruntime-node platform-matrix note (WASM is universal — no Intel-mac/musl
  gap). Tests updated.

Verified: monorepo tsc clean; embedder 79/0, ingestion 585/0, search 27/0,
cli 343/0 (11 skip). No package.json or lockfile references any native binding.
…, no-non-null)

CI's `biome ci` is stricter than the pre-push `biome check` and flagged style
nits in the single-file-SQLite code: useTemplate (string concat → template
literals) and organize-imports in sqlite-adapter.ts / cli+storage index.ts, plus
noNonNullAssertion in sqlite-adapter.test.ts. Replaced the test's `!` assertions
with explicit `assert.ok(...)` guards (clearer precondition, no logic change).

Clears both the `lint` job and `self-scan` (the latter failed only because
OCH's own biome scanner surfaced these same findings through its verdict gate).
biome ci clean (680 files); storage suite green.
CI `biome ci .` also enforces formatting (the pre-push `biome check` on staged
files did not flag it). Ran `biome format --write` — settled multi-line type
annotations and comment wrapping in doctor.ts, graph.ts, sqlite-runtime.ts and
its test. Also refreshed the stale DoctorOptions doc comment that still named
onnxruntime-node (it's onnxruntime-web now). No behavior change; biome ci clean
over 680 files; core-types 83/0, storage 81/0.
@theagenticguy theagenticguy merged commit c72c84f into main Jun 22, 2026
38 checks passed
@theagenticguy theagenticguy deleted the spike/sqlite-single-file branch June 22, 2026 13:41
@github-actions github-actions Bot mentioned this pull request Jun 22, 2026
theagenticguy added a commit that referenced this pull request Jun 22, 2026
…nt bugs (#247)

## Summary

The single-file SQLite migration (ADR 0019, #245) left stale references
to the removed `graph.lbug` + `temporal.duckdb` backends across source
comments, docs, agent-facing MCP tool descriptions, and user-facing
error strings. This sweeps and corrects them, and fixes **two real
bugs** the migration left behind that its own tests did not catch.

## Bugs fixed (behavioral)

**1. `describeArtifacts()` pointed at a file that no longer exists.**
`paths.ts` still returned `graphFile="graph.lbug"` /
`temporalFile="temporal.duckdb"`, but `openStore()` writes
`store.sqlite`. That value is not cosmetic: it feeds the `is-indexed`
existence probe (`cli/src/lib/is-indexed.ts`) and the user-facing path
in the MCP "store unreadable" error (`mcp/src/tools/shared.ts`). So the
error told users to check a `.codehub/graph.lbug` that is gone. Now
returns `store.sqlite` for both views. `paths.test.ts` had pinned the
wrong values, so it was green while asserting broken behavior — updated.

**2. `bm25CorpusHasSummaries()` queried `information_schema.tables`.**
That is a DuckDB/Postgres catalog `node:sqlite` does not expose. The
query threw, was swallowed by the surrounding `try/catch`, and the probe
was silently always-false in production. Switched to `sqlite_master`;
updated the test mock that pinned the old string.

## Also: the `sql` MCP tool contract

The `sql` tool now advertises `nodes`/`edges`/`embeddings` as directly
SQL-queryable (they are real tables in `store.sqlite`) instead of
claiming the graph is "NOT SQL-queryable, reachable only through
Cypher". The `cypher:` arg returns a clear "use `sql:` instead" envelope
against the default SQLite backend (`execCypher` is unimplemented per
ADR 0019, reserved for community forks). `sql.test.ts` updated to assert
the inverted contract.

## Scope

47 files. Most are comment/doc corrections; the 4 above are behavioral.
The DuckDB→SQLite comment pass was context-aware (historical-fact
mentions like "replaced the lbug + **DuckDB** pair" intentionally kept).

## Verification

- typecheck clean: core-types, storage, mcp, cli
- storage suite: 81 pass / 0 fail
- mcp suite: 234 pass / 0 fail
- cli suite: 346 pass / 0 fail
- banned-strings + commitlint pre-commit gates pass
- pre-push test hook passed on push
theagenticguy pushed a commit that referenced this pull request Jun 25, 2026
🤖 Automated release via release-please
---


<details><summary>root: 0.9.2</summary>

##
[0.9.2](root-v0.9.1...root-v0.9.2)
(2026-06-24)


### Features

* **analysis:** plumbing sieve + candidate_business tag (deterministic,
advisory)
([#248](#248))
([383b719](383b719))
* **ingestion:** business-logic analyze phase — populate likely_plumbing
+ candidate_business
([#249](#249))
([a3d44ad](a3d44ad))
* **storage:** single-file SQLite + WASM embedder — zero native
dependencies
([#245](#245))
([c72c84f](c72c84f))


### Bug Fixes

* **storage:** purge stale lbug/DuckDB refs after ADR 0019; fix 2 latent
bugs ([#247](#247))
([90f40a2](90f40a2))
</details>

<details><summary>cli: 0.9.2</summary>

##
[0.9.2](cli-v0.9.1...cli-v0.9.2)
(2026-06-24)


### Features

* **storage:** single-file SQLite + WASM embedder — zero native
dependencies
([#245](#245))
([c72c84f](c72c84f))


### Bug Fixes

* **storage:** purge stale lbug/DuckDB refs after ADR 0019; fix 2 latent
bugs ([#247](#247))
([90f40a2](90f40a2))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant