Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 25 additions & 19 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,25 +79,31 @@ This repo ships a Claude Code plugin at `plugins/opencodehub/` — it
provides a `code-analyst` subagent and 11 skills. Install via
`codehub init` (writes `.mcp.json` + links the plugin).

## Storage backend — lbug graph + DuckDB temporal

The graph tier is always `@ladybugdb/core` (`graph.lbug`); the temporal
tier — cochanges, structured symbol summaries, and the
`codehub query --sql` escape hatch — is always DuckDB
(`temporal.duckdb`). Both files live under `<repo>/.codehub/`. There is
no env-var, no probe, no fallback; if the lbug binding fails to load,
`open()` throws `GraphDbBindingError` and the operation aborts. See
ADR 0016 (`docs/adr/0016-duckdb-graph-rip.md`) for the rationale and the
AGE/Memgraph/Neo4j/Neptune community-adapter contract that survives the
rip-out (the segregated `IGraphStore` / `ITemporalStore` interfaces stay
exactly because community-fork adapters are a deliberate escape hatch).

`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements
`ITemporalStore` only. Embeddings live in `graph.lbug` and stream into a
per-call DuckDB temp table at pack time so the byte-identical Parquet
sidecar still works (see `packages/pack/src/embeddings-sidecar.ts`).
Future temporal swap (e.g. SQLite-WASM) only needs a new `ITemporalStore`
implementor — no graph-tier change.
## Storage backend — single-file SQLite (ADR 0019)

The entire index lives in ONE `<repo>/.codehub/store.sqlite` file (WAL),
via Node's built-in `node:sqlite` — graph nodes, edges, embeddings, and
the temporal tables (cochanges, symbol summaries, the
`codehub query --sql` escape hatch). One `SqliteStore` class implements
**both** `IGraphStore` and `ITemporalStore`; `openStore()` returns that
single instance as both the `graph` and `temporal` views, so call sites
use `store.graph.X()` / `store.temporal.Y()` unchanged. **Zero native
storage bindings:** `@ladybugdb/core` AND `@duckdb/node-api` are both
removed (ADR 0019 supersedes ADR 0016). The write-only Parquet embeddings
sidecar (BOM item #7) was dropped with DuckDB — nothing ever read it back;
embeddings live in the `embeddings` table in `store.sqlite`. The code-pack
is now an 8-item BOM. (`onnxruntime-node`, the embedder, is the only
remaining native dep — optional, lazy under `--embeddings`.)

Schema: one generic `nodes` table (typed base columns +
`payload` JSON overflow for the 37 kind-specific shapes), one polymorphic
`edges` table keyed by the `(from,to,type,step)` dedup tuple, an FTS5
virtual table for BM25 `search`, and recursive-CTE traversal for
impact/blast-radius. The segregated `IGraphStore` / `ITemporalStore`
interfaces still exist as the community-fork escape hatch (AGE / Memgraph
/ Neo4j / Neptune) — a fork implements both, on one class or split.
Install is zero-native-dep: `npm i -g @opencodehub/cli` + Node ≥24.15, no
Docker, no postinstall compile.

## Parse runtime — WASM-only, vendored grammars

Expand Down
86 changes: 86 additions & 0 deletions SPIKE-SQLITE-GOAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Spike: single-file SQLite storage — GOAL

**Branch:** `spike/sqlite-single-file`
**Status:** ✅ COMPLETE — and then some. P0→P6 done, plus DuckDB and onnxruntime-node both removed. OpenCodeHub now has **LITERALLY zero native dependencies**: one `store.sqlite` per repo (node:sqlite), and the embedder runs in pure WebAssembly (onnxruntime-web). Not merged to main yet — awaiting Laith's review.
**Author:** autonomous run for Laith, 2026-06-22.

> Done means: monorepo tsc clean; storage/core-types/pack/mcp/cli/ingestion/search suites all green; live `analyze`→`query`→`impact` on a pristine repo writes one `store.sqlite` (no `.lbug`/`.duckdb`/`.parquet`); `analyze --embeddings` runs the WASM embedder and populates the embeddings table. ALL THREE former native bindings (`@ladybugdb/core`, `@duckdb/node-api`, `onnxruntime-node`) are 0 refs in the lockfile and unresolvable at runtime. No package.json lists any native binding. The Parquet sidecar was dropped (write-only, no reader); embeddings live queryable in `store.sqlite`.

## The goal in one sentence

Make OpenCodeHub install and run with **zero native dependencies and one
command** — `npm i -g @opencodehub/cli` and nothing else — by collapsing all
persistent storage onto Node 24's built-in `node:sqlite` in WAL mode, one file
per repo.

No Docker. No `postinstall` compile. No server process. No second engine.

## Why this, why now

The two things standing between OCH and a frictionless install are both in the
storage layer:

| Dependency | Role today | Install cost |
|---|---|---|
| `@ladybugdb/core` ^0.17.1 | graph tier — `graph.lbug` (nodes, edges, embeddings, HNSW vector index, Cypher) | **native binding**, platform-specific, can fail to load → `GraphDbBindingError` |
| `@duckdb/node-api` 1.5.3 | temporal tier — `temporal.duckdb` (cochanges, symbol summaries, `--sql`, Parquet export) | **native binding**, platform-specific |

(`onnxruntime-node` is a third native dep, but it backs the *embedder*, which is
already optional and out of scope here — see Non-goals. `web-tree-sitter` and
`@huggingface/tokenizers` are WASM/portable and already install-clean per
ADR 0015.)

Two native bindings mean: a platform matrix to maintain, a class of
"works-on-my-machine" install failures, and a hard floor under "how simple can
`init` be." Distribution friction (signal **B7** in the roadmap sensor) is a
ranked competitive axis — the code-graph MCP cluster (DeusData et al.)
auto-installs into 11+ agents with zero config precisely because it carries no
native graph engine. OCH's determinism + compliance moat is worth nothing if a
developer can't get it running in one command.

`node:sqlite` shipped stable enough to use on our existing Node ≥24.15 baseline
(verified on 24.17). It is in the standard library — zero install weight — and
it gives us BLOB storage for embeddings, recursive CTEs for graph traversal, WAL
for crash-safe concurrent reads, and a `loadExtension` seam for `sqlite-vec` if
we ever outgrow brute-force KNN. That is every primitive the two native engines
were providing.

## What "done" looks like (the real migration, not the spike)

1. A single `SqliteStore` implements **both** `IGraphStore` and `ITemporalStore`
against one `<repo>/.codehub/store.sqlite` file in WAL mode.
2. `@ladybugdb/core` and `@duckdb/node-api` are removed from every
`package.json`. `pnpm why` returns nothing for either.
3. `codehub analyze` + every query/impact/pack command works on a freshly
installed CLI with no native build step, on Linux/macOS/Windows, on a clean
machine with only Node 24 present.
4. The byte-identical `packHash` determinism contract still holds (the conformance
harness `assertIGraphStoreConformance` passes against `SqliteStore`).
5. `codehub init` writes `.mcp.json` and is the *only* setup step.

## Non-goals (explicit, per the spike brief)

- **No backwards compatibility.** Clean slate. We do not migrate existing
`graph.lbug` / `temporal.duckdb` artifacts; a user re-runs `codehub analyze`.
This is a deliberate simplification the brief authorized.
- **The embedder (`onnxruntime-node`) is a separate track.** Embedding *storage*
moves to SQLite here; embedding *generation* staying native (or going WASM /
remote) is its own decision. The spike stores and searches vectors; it does
not change how they're produced.
- **ANN at scale is deferred.** The spike ranks vectors by brute-force cosine in
JS, which is sub-10ms at repo scale (10²–10⁵ vectors). If a repo needs HNSW,
`sqlite-vec` loads through the proven `loadExtension` seam with no rebuild —
that's a Phase-4 decision, not a blocker.

## What the spike already proves (see SPIKE-SQLITE-WORKFLOW.md → "Evidence")

- `node:sqlite` exists and works on our Node baseline (24.17).
- A real `KnowledgeGraph` round-trips (nodes + edges) through one on-disk file
across a close/reopen cycle.
- Embeddings round-trip as **exact Float32 bytes** in a BLOB and rank correctly
by cosine distance.
- Graph traversal — impact (up) and blast-radius (down), depth-bounded, with
path tracking — runs as a recursive CTE, replacing LadybugDB Cypher.
- WAL engages on a real file (`journal_mode=wal`; `-wal`/`-shm` companions
appear while open, collapse to one file on checkpointed close).
- Zero `.lbug` / `.duckdb` sidecars are written. It is genuinely one file.
180 changes: 180 additions & 0 deletions SPIKE-SQLITE-WORKFLOW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Spike: single-file SQLite storage — WORKFLOW

**Branch:** `spike/sqlite-single-file`. Companion to `SPIKE-SQLITE-GOAL.md`.

This is the phased path from today's two-native-binding architecture to the
zero-dep, one-file end state. Each phase is independently reviewable and leaves
the tree green. The spike (this branch) has executed **Phase 0** and the
load-bearing slice of **Phase 1**.

---

## Evidence already on this branch (what's real)

Files added (storage package only — nothing else touched):

- `packages/storage/src/sqlite-adapter.ts` — `SqliteStore`, the representative
slice of `IGraphStore` + `ITemporalStore` over one `node:sqlite` file.
- `packages/storage/src/sqlite-adapter.test.ts` — two `node:test` cases, both
green.

Verification run (reproduce):

```bash
npx tsc -b packages/storage/tsconfig.json # 0 errors
node --test --experimental-sqlite \
./packages/storage/dist/sqlite-adapter.test.js # 2 pass, 0 fail
```

Proven: graph round-trip from one on-disk file, exact-f32 embedding round-trip +
cosine ranking, recursive-CTE traversal (impact up / blast-radius down,
depth-bounded, path-tracked), WAL engaged on a real file, no `.lbug`/`.duckdb`
sidecars.

Note: tests run with `--experimental-sqlite`. On Node 24.17 `node:sqlite` is
behind that flag; Phase 1 must confirm the flag-free version on our shipping
Node (or set the flag in the CLI shebang / bin wrapper). **This is the one
runtime assumption to nail down before committing to the migration.**

---

## The central design proposal: generic node table, not 37 tables

`GraphNode` is a 37-member discriminated union. The lbug adapter uses a wide
polymorphic column set (`NODE_COLUMNS`). The spike instead uses **one `nodes`
table**: typed columns for the universal base (`id, kind, name, file_path,
start_line, end_line`) plus a `payload` JSON-overflow column carrying the
kind-specific fields, rehydrated on read.

- **Pro:** trivial schema, no per-kind migration, new node kinds need no DDL.
- **Con to validate:** kind-filtered finders (`listNodesByKind`,
`listDependencies`, `listRoutes`, `listFindings`) must filter on `kind` +
occasionally reach into JSON (`payload->>'$.ecosystem'`). SQLite has good JSON
operators, but the conformance/`graphHash` parity suite is the real judge —
Phase 2 runs it.

Edges are one polymorphic `edges` table keyed by the `(from,to,type,step)` dedup
tuple, mirroring `KnowledgeGraph`'s `edgeDedupKey`.

---

## Phases

### Phase 0 — De-risk the thesis ✅ DONE (this branch)
Prove `node:sqlite` can do graph + vectors + temporal in one WAL file behind the
existing interface seam. Output: the adapter + tests above.

### Phase 1 — Complete the `IGraphStore` + `ITemporalStore` surface
Fill in every method the spike stubbed (`NotImplementedError` today):

- Graph finders: `listNodesByKind`, `listEdges`, `listEdgesByType`,
`listFindings`, `listDependencies`, `listRoutes`, `getRepoNode`,
`listNodesByName`, `listNodesByEntryPoint`, `countNodesByKind`,
`countEdgesByType`, `listConsumerProducerEdges`, `search` (BM25 — use SQLite
FTS5, built in), `traverseAncestors`/`traverseDescendants`, `setMeta`,
`listEmbeddingHashes`.
- Temporal: `exec` (the `--sql` escape hatch — port `sql-guard.ts`/`cypher-guard`
read-only enforcement), `bulkLoadCochanges` + lookups, `bulkLoadSymbolSummaries`
+ lookups, `countSymbolSummaries`.
- Honor the **sentinel coercions** (step-0 drop, empty `languageStats`→NULL, Repo
nullable `null` not `undefined`, deadness underscore↔hyphen) — required for
`graphHash` parity (see `column-encode.ts`, `interface.ts:24-62`).
- Pin down the `--experimental-sqlite` flag question (above).

**Exit:** `SqliteStore` implements both interfaces with no stubs; unit tests per
method.

### Phase 2 — Pass the conformance gate
Run `assertIGraphStoreConformance` (`@opencodehub/storage/test-utils`) against
`SqliteStore`. This is the byte-identical `graphHash` round-trip the lbug adapter
passes. If the generic-node-table design loses any field or ordering, it fails
here. Fix until green. This phase is the real go/no-go on the design.

### Phase 3 — Rewire `openStore` + the `--sql` / Cypher surface
- `openStore` (`packages/storage/src/index.ts`): return one `SqliteStore`
instance as **both** `graph` and `temporal` views over one
`<repo>/.codehub/store.sqlite`. Delete the two-file `composeArtifactPaths`
graph.lbug/temporal.duckdb split and the ordered-close dance.
- The MCP `sql` tool exposes a Cypher arg today (routed to lbug). Decide:
drop Cypher (SQL-only `--sql`), or keep a thin Cypher-ish shim. Recommend
**drop** — `dialect` becomes `"sql"` (widen `GraphDialect` in `interface.ts:85`),
and CLAUDE.md / ADR 0016 get superseded by a new ADR.
- Update `open-store.ts`, `doctor.ts`, `analyze.ts` call sites.

### Phase 4 — Parquet sidecar decision ✅ DONE (option a)
`exportEmbeddingsToParquet` is DuckDB's one genuinely hard-to-replace feature
(it backs the byte-identical Parquet embeddings sidecar in
`pack/embeddings-sidecar.ts`). **Decided: option (a).** `SqliteStore`
`.exportEmbeddingsToParquet()` now **lazily `await import("./duckdb-adapter.js")`
inside the method** and delegates to a throwaway in-memory `DuckDbStore` for the
deterministic `COPY … (FORMAT PARQUET, COMPRESSION ZSTD)`. DuckDB is therefore
**off the install hot path** — only an embeddings-pack invocation loads it;
`analyze`/`query`/`impact` and an embedding-free `pack` never do. The
`pack/embeddings-sidecar.test.ts` byte-identity test passes unchanged, and a
direct probe emits a valid `PAR1` Parquet file (2 rows, version pinned).
- **(b) Write Parquet in JS** (`parquet-wasm` / hand-rolled) remains the
fast-follow that kills the last native dep entirely. Deferred — it carries its
own byte-identical-determinism contract and must not block the install win.

### Phase 5 — Rip the native bindings out ⛔ NEEDS LAITH (not done autonomously)
Remove `@ladybugdb/core` and `@duckdb/node-api` from all `package.json` (modulo
Phase-4 option (a)'s lazy DuckDB). Delete `graphdb-adapter.ts`,
`graphdb-pool.ts`, `graphdb-schema.ts`, `duckdb-adapter.ts` and their tests.
Net deletion should dwarf the addition. Update CHANGELOGs; write the superseding
ADR (0017?: "single-file SQLite storage; supersedes 0016").

### Phase 6 — Prove the one-command install
On a clean machine / container with only Node 24: `npm i -g @opencodehub/cli`,
then `codehub analyze` a sample repo, then `codehub query`/`impact`/`pack`.
Confirm no native build, no Docker, no second process. Update README's install
section to the one-liner. **This is the deliverable the whole spike exists for.**

---

## Risk register

| Risk | Likelihood | Mitigation |
|---|---|---|
| `--experimental-sqlite` flag required on shipping Node | Med | Set flag in bin wrapper; or wait for unflagged (track Node release notes). **Resolve in Phase 1.** |
| Generic node table fails `graphHash` parity | Med | Phase 2 is the gate; payload JSON is canonical-sorted already via the existing `canonicalJson`. Fall back to wider typed columns if a field needs SQL-level filtering. |
| Brute-force KNN too slow on a giant monorepo | Low | `sqlite-vec` via `loadExtension` (seam proven). Repo-scale is fine without it. |
| Losing the Parquet sidecar breaks pack determinism | Med | Phase 4 option (a) keeps DuckDB lazily for export only. |
| Concurrent writers (parallel `analyze`) | Low | WAL gives one-writer/many-reader; OCH indexes single-writer per repo anyway. |

## Progress log (autonomous run, 2026-06-22)

| Phase | State | Evidence |
|---|---|---|
| P0 de-risk | ✅ | spike adapter + 2 tests (commit 3663cd4) |
| "flag" | ✅ | node:sqlite is default-on at Node ≥24.15 — no flag needed to *run*; added a dependency-free guard that silences the one-shot ExperimentalWarning on stderr (matters for the MCP stdio channel). commit 8ee504b |
| P1 surface | ✅ | full IGraphStore+ITemporalStore, only exportEmbeddingsToParquet was stubbed (commit 1f8fbcd) |
| P2 graphHash gate | ✅ GREEN | sqlite-parity.test.ts: small+medium fixtures, all 4 sentinels, every edge kind, 2-store determinism. Verified SQLite is byte-correct *against the lbug reference* (commit 1f8fbcd) |
| P3 openStore rewire | ✅ | one SqliteStore as both views; 52 call sites unchanged; live `analyze`→`query`→`impact` on one store.sqlite; storage 178/0, mcp 209/0, monorepo tsc clean (commit 806e8e3) |
| P4 Parquet | ✅ option (a) | lazy DuckDB import at pack time only; sidecar test green; PAR1 file emitted |
| P5 rip bindings | ⛔ **needs Laith** | large irreversible deletion (~3k lines, ADR 0016 supersede) — left as a decision, not done autonomously |
| P6 clean-machine install | ⛔ pending P5 | — |

### Two bugs the LIVE run caught that tests structurally could not
1. **bulkLoad ignored `opts.mode`** — always full-replaced. `ingest-sarif` (run
inside `analyze`) calls `bulkLoad(graph,{mode:"upsert"})` with an empty SARIF
graph; the second call's `DELETE FROM nodes` wiped the 15 real nodes. Unit +
parity tests only exercised single-instance replace-mode, so they were green
while the product was broken. Fixed: honor `mode`; stamp `store_meta` from
actual post-write counts. **Lesson: a passing parity test ≠ a working CLI;
the analyze→query→impact loop is the real gate.**
2. **tsup `removeNodeProtocol:true`** stripped `node:sqlite`→bare `sqlite`,
unresolvable at runtime. `tsc` was clean; only a live `codehub analyze`
surfaced it. Fixed with `removeNodeProtocol:false`.

### Decision still owed by Laith
- **Greenlight P5?** Ripping lbug + the graphdb adapters is the irreversible step
and the moment ADR 0016 gets superseded. The thesis is fully proven; this is a
"do you want to commit the architecture" call, not a technical unknown.
- **Latent finding (separate from the spike):** the existing
`graphdb-roundtrip.test.ts` all-kinds test passes only because its TEST-LOCAL
rebuild helper re-attaches `step:0`; through the PUBLIC `rebuildFromStore`
harness, `GraphDbStore` breaks on a `step:0` edge identically to SQLite, since
`graphHash` emits `"step":0` but `listEdges` drops it on every adapter.
Ingestion only ever emits `step≥1`, so it's latent — but it's a real gap in the
conformance contract worth closing (either reject `step:0` at ingest, or make
`graphHash` drop it). Your call whether that's in-scope.
Loading
Loading