theagenticguy · theagenticguy · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
@@ -79,25 +79,31 @@ This repo ships a Claude Code plugin at `plugins/opencodehub/` — it
 provides a `code-analyst` subagent and 11 skills. Install via
 `codehub init` (writes `.mcp.json` + links the plugin).
 
-## Storage backend — lbug graph + DuckDB temporal
-
-The graph tier is always `@ladybugdb/core` (`graph.lbug`); the temporal
-tier — cochanges, structured symbol summaries, and the
-`codehub query --sql` escape hatch — is always DuckDB
-(`temporal.duckdb`). Both files live under `<repo>/.codehub/`. There is
-no env-var, no probe, no fallback; if the lbug binding fails to load,
-`open()` throws `GraphDbBindingError` and the operation aborts. See
-ADR 0016 (`docs/adr/0016-duckdb-graph-rip.md`) for the rationale and the
-AGE/Memgraph/Neo4j/Neptune community-adapter contract that survives the
-rip-out (the segregated `IGraphStore` / `ITemporalStore` interfaces stay
-exactly because community-fork adapters are a deliberate escape hatch).
-
-`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements
-`ITemporalStore` only. Embeddings live in `graph.lbug` and stream into a
-per-call DuckDB temp table at pack time so the byte-identical Parquet
-sidecar still works (see `packages/pack/src/embeddings-sidecar.ts`).
-Future temporal swap (e.g. SQLite-WASM) only needs a new `ITemporalStore`
-implementor — no graph-tier change.
+## Storage backend — single-file SQLite (ADR 0019)
+
+The entire index lives in ONE `<repo>/.codehub/store.sqlite` file (WAL),
+via Node's built-in `node:sqlite` — graph nodes, edges, embeddings, and
+the temporal tables (cochanges, symbol summaries, the
+`codehub query --sql` escape hatch). One `SqliteStore` class implements
+**both** `IGraphStore` and `ITemporalStore`; `openStore()` returns that
+single instance as both the `graph` and `temporal` views, so call sites
+use `store.graph.X()` / `store.temporal.Y()` unchanged. **Zero native
+storage bindings:** `@ladybugdb/core` AND `@duckdb/node-api` are both
+removed (ADR 0019 supersedes ADR 0016). The write-only Parquet embeddings
+sidecar (BOM item #7) was dropped with DuckDB — nothing ever read it back;
+embeddings live in the `embeddings` table in `store.sqlite`. The code-pack
+is now an 8-item BOM. (`onnxruntime-node`, the embedder, is the only
+remaining native dep — optional, lazy under `--embeddings`.)
+
+Schema: one generic `nodes` table (typed base columns +
+`payload` JSON overflow for the 37 kind-specific shapes), one polymorphic
+`edges` table keyed by the `(from,to,type,step)` dedup tuple, an FTS5
+virtual table for BM25 `search`, and recursive-CTE traversal for
+impact/blast-radius. The segregated `IGraphStore` / `ITemporalStore`
+interfaces still exist as the community-fork escape hatch (AGE / Memgraph
+/ Neo4j / Neptune) — a fork implements both, on one class or split.
+Install is zero-native-dep: `npm i -g @opencodehub/cli` + Node ≥24.15, no
+Docker, no postinstall compile.
 
 ## Parse runtime — WASM-only, vendored grammars
 

@@ -0,0 +1,86 @@
+# Spike: single-file SQLite storage — GOAL
+
+**Branch:** `spike/sqlite-single-file`
+**Status:** ✅ COMPLETE — and then some. P0→P6 done, plus DuckDB and onnxruntime-node both removed. OpenCodeHub now has **LITERALLY zero native dependencies**: one `store.sqlite` per repo (node:sqlite), and the embedder runs in pure WebAssembly (onnxruntime-web). Not merged to main yet — awaiting Laith's review.
+**Author:** autonomous run for Laith, 2026-06-22.
+
+> Done means: monorepo tsc clean; storage/core-types/pack/mcp/cli/ingestion/search suites all green; live `analyze`→`query`→`impact` on a pristine repo writes one `store.sqlite` (no `.lbug`/`.duckdb`/`.parquet`); `analyze --embeddings` runs the WASM embedder and populates the embeddings table. ALL THREE former native bindings (`@ladybugdb/core`, `@duckdb/node-api`, `onnxruntime-node`) are 0 refs in the lockfile and unresolvable at runtime. No package.json lists any native binding. The Parquet sidecar was dropped (write-only, no reader); embeddings live queryable in `store.sqlite`.
+
+## The goal in one sentence
+
+Make OpenCodeHub install and run with **zero native dependencies and one
+command** — `npm i -g @opencodehub/cli` and nothing else — by collapsing all
+persistent storage onto Node 24's built-in `node:sqlite` in WAL mode, one file
+per repo.
+
+No Docker. No `postinstall` compile. No server process. No second engine.
+
+## Why this, why now
+
+The two things standing between OCH and a frictionless install are both in the
+storage layer:
+
+| Dependency | Role today | Install cost |
+|---|---|---|
+| `@ladybugdb/core` ^0.17.1 | graph tier — `graph.lbug` (nodes, edges, embeddings, HNSW vector index, Cypher) | **native binding**, platform-specific, can fail to load → `GraphDbBindingError` |
+| `@duckdb/node-api` 1.5.3 | temporal tier — `temporal.duckdb` (cochanges, symbol summaries, `--sql`, Parquet export) | **native binding**, platform-specific |
+
+(`onnxruntime-node` is a third native dep, but it backs the *embedder*, which is
+already optional and out of scope here — see Non-goals. `web-tree-sitter` and
+`@huggingface/tokenizers` are WASM/portable and already install-clean per
+ADR 0015.)
+
+Two native bindings mean: a platform matrix to maintain, a class of
+"works-on-my-machine" install failures, and a hard floor under "how simple can
+`init` be." Distribution friction (signal **B7** in the roadmap sensor) is a
+ranked competitive axis — the code-graph MCP cluster (DeusData et al.)
+auto-installs into 11+ agents with zero config precisely because it carries no
+native graph engine. OCH's determinism + compliance moat is worth nothing if a
+developer can't get it running in one command.
+
+`node:sqlite` shipped stable enough to use on our existing Node ≥24.15 baseline
+(verified on 24.17). It is in the standard library — zero install weight — and
+it gives us BLOB storage for embeddings, recursive CTEs for graph traversal, WAL
+for crash-safe concurrent reads, and a `loadExtension` seam for `sqlite-vec` if
+we ever outgrow brute-force KNN. That is every primitive the two native engines
+were providing.
+
+## What "done" looks like (the real migration, not the spike)
+
+1. A single `SqliteStore` implements **both** `IGraphStore` and `ITemporalStore`
+   against one `<repo>/.codehub/store.sqlite` file in WAL mode.
+2. `@ladybugdb/core` and `@duckdb/node-api` are removed from every
+   `package.json`. `pnpm why` returns nothing for either.
+3. `codehub analyze` + every query/impact/pack command works on a freshly
+   installed CLI with no native build step, on Linux/macOS/Windows, on a clean
+   machine with only Node 24 present.
+4. The byte-identical `packHash` determinism contract still holds (the conformance
+   harness `assertIGraphStoreConformance` passes against `SqliteStore`).
+5. `codehub init` writes `.mcp.json` and is the *only* setup step.
+
+## Non-goals (explicit, per the spike brief)
+
+- **No backwards compatibility.** Clean slate. We do not migrate existing
+  `graph.lbug` / `temporal.duckdb` artifacts; a user re-runs `codehub analyze`.
+  This is a deliberate simplification the brief authorized.
+- **The embedder (`onnxruntime-node`) is a separate track.** Embedding *storage*
+  moves to SQLite here; embedding *generation* staying native (or going WASM /
+  remote) is its own decision. The spike stores and searches vectors; it does
+  not change how they're produced.
+- **ANN at scale is deferred.** The spike ranks vectors by brute-force cosine in
+  JS, which is sub-10ms at repo scale (10²–10⁵ vectors). If a repo needs HNSW,
+  `sqlite-vec` loads through the proven `loadExtension` seam with no rebuild —
+  that's a Phase-4 decision, not a blocker.
+
+## What the spike already proves (see SPIKE-SQLITE-WORKFLOW.md → "Evidence")
+
+- `node:sqlite` exists and works on our Node baseline (24.17).
+- A real `KnowledgeGraph` round-trips (nodes + edges) through one on-disk file
+  across a close/reopen cycle.
+- Embeddings round-trip as **exact Float32 bytes** in a BLOB and rank correctly
+  by cosine distance.
+- Graph traversal — impact (up) and blast-radius (down), depth-bounded, with
+  path tracking — runs as a recursive CTE, replacing LadybugDB Cypher.
+- WAL engages on a real file (`journal_mode=wal`; `-wal`/`-shm` companions
+  appear while open, collapse to one file on checkpointed close).
+- Zero `.lbug` / `.duckdb` sidecars are written. It is genuinely one file.
@@ -0,0 +1,180 @@
+# Spike: single-file SQLite storage — WORKFLOW
+
+**Branch:** `spike/sqlite-single-file`. Companion to `SPIKE-SQLITE-GOAL.md`.
+
+This is the phased path from today's two-native-binding architecture to the
+zero-dep, one-file end state. Each phase is independently reviewable and leaves
+the tree green. The spike (this branch) has executed **Phase 0** and the
+load-bearing slice of **Phase 1**.
+
+---
+
+## Evidence already on this branch (what's real)
+
+Files added (storage package only — nothing else touched):
+
+- `packages/storage/src/sqlite-adapter.ts` — `SqliteStore`, the representative
+  slice of `IGraphStore` + `ITemporalStore` over one `node:sqlite` file.
+- `packages/storage/src/sqlite-adapter.test.ts` — two `node:test` cases, both
+  green.
+
+Verification run (reproduce):
+
+```bash
+npx tsc -b packages/storage/tsconfig.json            # 0 errors
+node --test --experimental-sqlite \
+  ./packages/storage/dist/sqlite-adapter.test.js      # 2 pass, 0 fail
+```
+
+Proven: graph round-trip from one on-disk file, exact-f32 embedding round-trip +
+cosine ranking, recursive-CTE traversal (impact up / blast-radius down,
+depth-bounded, path-tracked), WAL engaged on a real file, no `.lbug`/`.duckdb`
+sidecars.
+
+Note: tests run with `--experimental-sqlite`. On Node 24.17 `node:sqlite` is
+behind that flag; Phase 1 must confirm the flag-free version on our shipping
+Node (or set the flag in the CLI shebang / bin wrapper). **This is the one
+runtime assumption to nail down before committing to the migration.**
+
+---
+
+## The central design proposal: generic node table, not 37 tables
+
+`GraphNode` is a 37-member discriminated union. The lbug adapter uses a wide
+polymorphic column set (`NODE_COLUMNS`). The spike instead uses **one `nodes`
+table**: typed columns for the universal base (`id, kind, name, file_path,
+start_line, end_line`) plus a `payload` JSON-overflow column carrying the
+kind-specific fields, rehydrated on read.
+
+- **Pro:** trivial schema, no per-kind migration, new node kinds need no DDL.
+- **Con to validate:** kind-filtered finders (`listNodesByKind`,
+  `listDependencies`, `listRoutes`, `listFindings`) must filter on `kind` +
+  occasionally reach into JSON (`payload->>'$.ecosystem'`). SQLite has good JSON
+  operators, but the conformance/`graphHash` parity suite is the real judge —
+  Phase 2 runs it.
+
+Edges are one polymorphic `edges` table keyed by the `(from,to,type,step)` dedup
+tuple, mirroring `KnowledgeGraph`'s `edgeDedupKey`.
+
+---
+
+## Phases
+
+### Phase 0 — De-risk the thesis ✅ DONE (this branch)
+Prove `node:sqlite` can do graph + vectors + temporal in one WAL file behind the
+existing interface seam. Output: the adapter + tests above.
+
+### Phase 1 — Complete the `IGraphStore` + `ITemporalStore` surface
+Fill in every method the spike stubbed (`NotImplementedError` today):
+
+- Graph finders: `listNodesByKind`, `listEdges`, `listEdgesByType`,
+  `listFindings`, `listDependencies`, `listRoutes`, `getRepoNode`,
+  `listNodesByName`, `listNodesByEntryPoint`, `countNodesByKind`,
+  `countEdgesByType`, `listConsumerProducerEdges`, `search` (BM25 — use SQLite
+  FTS5, built in), `traverseAncestors`/`traverseDescendants`, `setMeta`,
+  `listEmbeddingHashes`.
+- Temporal: `exec` (the `--sql` escape hatch — port `sql-guard.ts`/`cypher-guard`
+  read-only enforcement), `bulkLoadCochanges` + lookups, `bulkLoadSymbolSummaries`
+  + lookups, `countSymbolSummaries`.
+- Honor the **sentinel coercions** (step-0 drop, empty `languageStats`→NULL, Repo
+  nullable `null` not `undefined`, deadness underscore↔hyphen) — required for
+  `graphHash` parity (see `column-encode.ts`, `interface.ts:24-62`).
+- Pin down the `--experimental-sqlite` flag question (above).
+
+**Exit:** `SqliteStore` implements both interfaces with no stubs; unit tests per
+method.
+
+### Phase 2 — Pass the conformance gate
+Run `assertIGraphStoreConformance` (`@opencodehub/storage/test-utils`) against
+`SqliteStore`. This is the byte-identical `graphHash` round-trip the lbug adapter
+passes. If the generic-node-table design loses any field or ordering, it fails
+here. Fix until green. This phase is the real go/no-go on the design.
+
+### Phase 3 — Rewire `openStore` + the `--sql` / Cypher surface
+- `openStore` (`packages/storage/src/index.ts`): return one `SqliteStore`
+  instance as **both** `graph` and `temporal` views over one
+  `<repo>/.codehub/store.sqlite`. Delete the two-file `composeArtifactPaths`
+  graph.lbug/temporal.duckdb split and the ordered-close dance.
+- The MCP `sql` tool exposes a Cypher arg today (routed to lbug). Decide:
+  drop Cypher (SQL-only `--sql`), or keep a thin Cypher-ish shim. Recommend
+  **drop** — `dialect` becomes `"sql"` (widen `GraphDialect` in `interface.ts:85`),
+  and CLAUDE.md / ADR 0016 get superseded by a new ADR.
+- Update `open-store.ts`, `doctor.ts`, `analyze.ts` call sites.
+
+### Phase 4 — Parquet sidecar decision ✅ DONE (option a)
+`exportEmbeddingsToParquet` is DuckDB's one genuinely hard-to-replace feature
+(it backs the byte-identical Parquet embeddings sidecar in
+`pack/embeddings-sidecar.ts`). **Decided: option (a).** `SqliteStore`
+`.exportEmbeddingsToParquet()` now **lazily `await import("./duckdb-adapter.js")`
+inside the method** and delegates to a throwaway in-memory `DuckDbStore` for the
+deterministic `COPY … (FORMAT PARQUET, COMPRESSION ZSTD)`. DuckDB is therefore
+**off the install hot path** — only an embeddings-pack invocation loads it;
+`analyze`/`query`/`impact` and an embedding-free `pack` never do. The
+`pack/embeddings-sidecar.test.ts` byte-identity test passes unchanged, and a
+direct probe emits a valid `PAR1` Parquet file (2 rows, version pinned).
+- **(b) Write Parquet in JS** (`parquet-wasm` / hand-rolled) remains the
+  fast-follow that kills the last native dep entirely. Deferred — it carries its
+  own byte-identical-determinism contract and must not block the install win.
+
+### Phase 5 — Rip the native bindings out  ⛔ NEEDS LAITH (not done autonomously)
+Remove `@ladybugdb/core` and `@duckdb/node-api` from all `package.json` (modulo
+Phase-4 option (a)'s lazy DuckDB). Delete `graphdb-adapter.ts`,
+`graphdb-pool.ts`, `graphdb-schema.ts`, `duckdb-adapter.ts` and their tests.
+Net deletion should dwarf the addition. Update CHANGELOGs; write the superseding
+ADR (0017?: "single-file SQLite storage; supersedes 0016").
+
+### Phase 6 — Prove the one-command install
+On a clean machine / container with only Node 24: `npm i -g @opencodehub/cli`,
+then `codehub analyze` a sample repo, then `codehub query`/`impact`/`pack`.
+Confirm no native build, no Docker, no second process. Update README's install
+section to the one-liner. **This is the deliverable the whole spike exists for.**
+
+---
+
+## Risk register
+
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| `--experimental-sqlite` flag required on shipping Node | Med | Set flag in bin wrapper; or wait for unflagged (track Node release notes). **Resolve in Phase 1.** |
+| Generic node table fails `graphHash` parity | Med | Phase 2 is the gate; payload JSON is canonical-sorted already via the existing `canonicalJson`. Fall back to wider typed columns if a field needs SQL-level filtering. |
+| Brute-force KNN too slow on a giant monorepo | Low | `sqlite-vec` via `loadExtension` (seam proven). Repo-scale is fine without it. |
+| Losing the Parquet sidecar breaks pack determinism | Med | Phase 4 option (a) keeps DuckDB lazily for export only. |
+| Concurrent writers (parallel `analyze`) | Low | WAL gives one-writer/many-reader; OCH indexes single-writer per repo anyway. |
+
+## Progress log (autonomous run, 2026-06-22)
+
+| Phase | State | Evidence |
+|---|---|---|
+| P0 de-risk | ✅ | spike adapter + 2 tests (commit 3663cd4) |
+| "flag" | ✅ | node:sqlite is default-on at Node ≥24.15 — no flag needed to *run*; added a dependency-free guard that silences the one-shot ExperimentalWarning on stderr (matters for the MCP stdio channel). commit 8ee504b |
+| P1 surface | ✅ | full IGraphStore+ITemporalStore, only exportEmbeddingsToParquet was stubbed (commit 1f8fbcd) |
+| P2 graphHash gate | ✅ GREEN | sqlite-parity.test.ts: small+medium fixtures, all 4 sentinels, every edge kind, 2-store determinism. Verified SQLite is byte-correct *against the lbug reference* (commit 1f8fbcd) |
+| P3 openStore rewire | ✅ | one SqliteStore as both views; 52 call sites unchanged; live `analyze`→`query`→`impact` on one store.sqlite; storage 178/0, mcp 209/0, monorepo tsc clean (commit 806e8e3) |
+| P4 Parquet | ✅ option (a) | lazy DuckDB import at pack time only; sidecar test green; PAR1 file emitted |
+| P5 rip bindings | ⛔ **needs Laith** | large irreversible deletion (~3k lines, ADR 0016 supersede) — left as a decision, not done autonomously |
+| P6 clean-machine install | ⛔ pending P5 | — |
+
+### Two bugs the LIVE run caught that tests structurally could not
+1. **bulkLoad ignored `opts.mode`** — always full-replaced. `ingest-sarif` (run
+   inside `analyze`) calls `bulkLoad(graph,{mode:"upsert"})` with an empty SARIF
+   graph; the second call's `DELETE FROM nodes` wiped the 15 real nodes. Unit +
+   parity tests only exercised single-instance replace-mode, so they were green
+   while the product was broken. Fixed: honor `mode`; stamp `store_meta` from
+   actual post-write counts. **Lesson: a passing parity test ≠ a working CLI;
+   the analyze→query→impact loop is the real gate.**
+2. **tsup `removeNodeProtocol:true`** stripped `node:sqlite`→bare `sqlite`,
+   unresolvable at runtime. `tsc` was clean; only a live `codehub analyze`
+   surfaced it. Fixed with `removeNodeProtocol:false`.
+
+### Decision still owed by Laith
+- **Greenlight P5?** Ripping lbug + the graphdb adapters is the irreversible step
+  and the moment ADR 0016 gets superseded. The thesis is fully proven; this is a
+  "do you want to commit the architecture" call, not a technical unknown.
+- **Latent finding (separate from the spike):** the existing
+  `graphdb-roundtrip.test.ts` all-kinds test passes only because its TEST-LOCAL
+  rebuild helper re-attaches `step:0`; through the PUBLIC `rebuildFromStore`
+  harness, `GraphDbStore` breaks on a `step:0` edge identically to SQLite, since
+  `graphHash` emits `"step":0` but `listEdges` drops it on every adapter.
+  Ingestion only ever emits `step≥1`, so it's latent — but it's a real gap in the
+  conformance contract worth closing (either reject `step:0` at ingest, or make
+  `graphHash` drop it). Your call whether that's in-scope.