Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/skills/opencodehub-guide/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ For any task that touches code understanding, debugging, impact analysis, refact
2. Read `codehub://repo/{name}/context` — codebase stats and a staleness envelope.
3. Match the task to a skill below and follow that skill's checklist.

> If the context envelope reports the index is stale, run `codehub analyze` in the terminal first. If it says weights are missing, run `codehub setup --embeddings` to fetch the 768d gte-modernbert-base ONNX weights.
> If the context envelope reports the index is stale, run `codehub analyze` in the terminal first. If it says weights are missing, run `codehub setup --embeddings` to fetch the 320d F2LLM-v2-80M ONNX weights.

## Skills · analysis

Expand Down
12 changes: 12 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,18 @@ jobs:
resolve:
name: Resolve release tag + SHA
runs-on: ubuntu-latest
# Only run the package-release pipeline for version-shaped tags. The
# `release: published` event also fires for non-package releases that
# merely HOST assets — e.g. the `embed-v1` embedder-weights release
# (see packages/embedder/src/model-pins.ts) — which must NOT build,
# sign, or npm-publish anything. release-please tags are `root-v*`,
# `cli-v*`, or a bare `v*`; `workflow_call`/`workflow_dispatch` pass an
# explicit tag input and are always allowed.
if: >-
github.event_name != 'release'
|| startsWith(github.event.release.tag_name, 'root-v')
|| startsWith(github.event.release.tag_name, 'cli-v')
|| startsWith(github.event.release.tag_name, 'v')
outputs:
tag: ${{ steps.t.outputs.tag }}
sha: ${{ steps.t.outputs.sha }}
Expand Down
102 changes: 53 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,31 +78,31 @@ flowchart LR
| **Local-first, offline-capable** | `codehub analyze --offline` opens zero sockets. Your code never leaves your machine. No telemetry. |
| **Deterministic indexing** | Identical inputs produce a byte-identical graph hash. Reproducible. Auditable. Cacheable in CI. |
| **MCP-native** | Works out-of-the-box with Claude Code, Cursor, Codex, Windsurf, OpenCode. The MCP server is the primary interface; CLI exists for scripts and CI. |
| **Embedded storage, two-tier** | `@ladybugdb/core` holds the structural store: symbols, edges, embeddings, BM25 + HNSW. A dedicated DuckDB sibling holds the temporal views: cochanges and summaries. Embedded files. No daemon. No database to operate. Both tiers are always present, with no backend knob (ADR 0016). |
| **Single-file embedded storage** | One `store.sqlite` file holds everything — symbols, edges, embeddings, BM25 (FTS5) + HNSW traversal, and the temporal views (cochanges, summaries) — via Node's built-in `node:sqlite`. No daemon, no database to operate, and **zero native storage bindings** (ADR 0019 removed both `@ladybugdb/core` and `@duckdb/node-api`). |
| **15 languages at GA** | TypeScript, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, Swift, PHP, Dart, COBOL — tree-sitter for the first 14 plus a regex provider for fixed-format COBOL. |
| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`, so parsing does **zero grammar/native builds and zero GitHub fetches** at install time — there is no native parser opt-in. Storage and embeddings still load prebuilt native bindings (see Platform support). |
| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`, so parsing does **zero grammar/native builds and zero GitHub fetches** at install time — there is no native parser opt-in. Storage is pure `node:sqlite`; the only optional native dep is the local embedder (see Platform support). |

## Platform support

Parsing is WASM and runs anywhere Node does. The storage and embedding
tiers, however, depend on **prebuilt native bindings** — `@ladybugdb/core`
(graph store), `@duckdb/node-api` (temporal store), and `onnxruntime-node`
(local embeddings) — so OpenCodeHub runs on the platforms those bindings
ship a prebuild for:
Parsing is WASM and storage is pure `node:sqlite`, so the core runs anywhere
Node ≥ 24.15 does — no prebuilt native storage bindings, no Docker, no
postinstall compile (ADR 0019). There is exactly **one** optional native
dependency: `onnxruntime-web`, the WASM ONNX runtime that powers
`--embeddings`. It ships prebuilt WebAssembly (no node-gyp, no native
binding) and runs single-threaded under Node, so it too is platform-agnostic;
a BM25-only install never loads it.

| Platform | Supported |
|---|---|
| `darwin-arm64`, `darwin-x64` | ✅ prebuilt |
| `linux-x64`, `linux-arm64` (glibc) | ✅ prebuilt |
| `win32-x64` | ✅ prebuilt |
| `win32-arm64` | ❌ no prebuild — `codehub analyze` fails at store open |
| Alpine / musl, 32-bit Linux ARM | ❌ no prebuild — needs a source build of `@ladybugdb/core` |

On an unsupported platform the lbug binding fails to load and `open()`
throws `GraphDbBindingError` (there is no DuckDB-graph fallback — see
[ADR 0016](./docs/adr/0016-duckdb-graph-rip.md)). The five-target prebuilt
matrix mirrors `@ladybugdb/core`'s release artifacts; track its upstream
for musl / `win32-arm64` coverage.
| `darwin-arm64`, `darwin-x64` | ✅ |
| `linux-x64`, `linux-arm64` (glibc **and** musl/Alpine) | ✅ |
| `win32-x64`, `win32-arm64` | ✅ |
| anywhere else Node ≥ 24.15 runs | ✅ |

Because storage no longer depends on a platform-specific prebuild, the
earlier `GraphDbBindingError` / unsupported-platform failure mode is gone —
see [ADR 0019](./docs/adr/0019-single-file-sqlite-storage.md) (which
superseded the native-binding storage of [ADR 0016](./docs/adr/0016-duckdb-graph-rip.md)).

## Quick start

Expand Down Expand Up @@ -187,7 +187,7 @@ The monorepo is organised as 18 workspace packages under `packages/`:
| `scanners` | Subprocess wrappers for 19 scanners — OSV, Semgrep, hadolint, tflint, betterleaks, and the rest |
| `scip-ingest` | SCIP indexer runners (TS, Python, Go, Rust, Java) — emits CALLS, REFERENCES, IMPLEMENTS, TYPE_OF |
| `search` | Hybrid BM25 + HNSW (ACORN-1 + RaBitQ) query layer |
| `storage` | `IGraphStore` (`@ladybugdb/core`) + `ITemporalStore` (DuckDB) adapters; deterministic `graphHash` |
| `storage` | One `SqliteStore` (`node:sqlite`) implementing both `IGraphStore` + `ITemporalStore` over a single `store.sqlite`; deterministic `graphHash` |
| `summarizer` | Process + cluster summaries for MCP responses |
| `wiki` | LLM-narrated module pages emitted by `codehub wiki --llm` |

Expand All @@ -199,63 +199,67 @@ production package set ships free of test-time dependencies.
## Embedding backends

OpenCodeHub ships with three embedding backends — all serve the same
`gte-modernbert-base` 768-dim space, all use CLS pooling + L2 norm — and
picks one at runtime based on environment variables:
`codefuse-ai/F2LLM-v2-80M` 320-dim space (last-token pooling + L2 norm
baked into the ONNX graph) — and picks one at runtime based on
environment variables:

| Precedence | Env | Backend |
|---|---|---|
| 1 | `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | **SageMaker** — invokes an AWS SageMaker Runtime endpoint (e.g. a TEI-served `gte-modernbert-embed`). Auth via the default AWS credential chain (profile, env vars, IMDS). No local weights needed. |
| 1 | `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | **SageMaker** — invokes an AWS SageMaker Runtime endpoint (e.g. a TEI-served `F2LLM-v2-80M`). Auth via the default AWS credential chain (profile, env vars, IMDS). No local weights needed. |
| 2 | `CODEHUB_EMBEDDING_URL` + `CODEHUB_EMBEDDING_MODEL` | **HTTP (OpenAI-compatible)** — POSTs to a `/v1/embeddings` server (Infinity, vLLM, TEI, Ollama, LM Studio, OpenAI). Bearer auth optional via `CODEHUB_EMBEDDING_API_KEY`. |
| 3 | *(nothing set)* | **Local ONNX** — deterministic, offline-safe. Requires `codehub setup --embeddings` to download the weights. |

**SageMaker-specific vars**:

| Var | Default | Purpose |
|---|---|---|
| `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | *(required to select)* | Endpoint name (e.g. `gte-modernbert-embed`). |
| `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | *(required to select)* | Endpoint name (e.g. `F2LLM-v2-80M`). |
| `CODEHUB_EMBEDDING_SAGEMAKER_REGION` | `us-east-1` | AWS region. |
| `CODEHUB_EMBEDDING_DIMS` | `768` | Expected vector dimension — asserted on every response to catch model-swap drift. |
| `CODEHUB_EMBEDDING_MODEL` | `gte-modernbert-base/sagemaker:<endpoint-name>` | Stable modelId stamp recorded in index metadata. Override only when bridging a non-gte endpoint. |
| `CODEHUB_EMBEDDING_DIMS` | `320` | Expected vector dimension — asserted on every response to catch model-swap drift. |
| `CODEHUB_EMBEDDING_MODEL` | `F2LLM-v2-80M/sagemaker:<endpoint-name>` | Stable modelId stamp recorded in index metadata. Override only when bridging a non-F2LLM endpoint. |

IAM: the caller needs `sagemaker:InvokeEndpoint` on the endpoint ARN —
e.g. `arn:aws:sagemaker:us-east-1:<account>:endpoint/gte-modernbert-embed`.
e.g. `arn:aws:sagemaker:us-east-1:<account>:endpoint/F2LLM-v2-80M`.

**Do not mix backends against the same index.** Backends are pinned to a
single model identity via the `modelId` stamp in the `embeddings` table;
switching mid-project requires `codehub analyze --rebuild-embeddings`.
`--offline` refuses SageMaker and HTTP backends, so offline mode is
compatible only with the local ONNX path.

## Storage backend — lbug graph + DuckDB temporal

The graph tier is always `@ladybugdb/core` (`<repo>/.codehub/graph.lbug`);
the temporal tier — cochanges, structured symbol summaries, and the
`codehub query --sql` escape hatch — is always DuckDB
(`<repo>/.codehub/temporal.duckdb`). Both files are written on every
`analyze`. There is no `CODEHUB_STORE` env var, no backend probe, no
single-file `graph.duckdb` layout, and no mtime arbitration; if the lbug
binding fails to load, `open()` throws `GraphDbBindingError` and the
operation aborts.

`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements
`ITemporalStore` only. The segregated interfaces stay because they are
the v1.0 contract for community-fork adapters (AGE / Memgraph / Neo4j /
Neptune target `IGraphStore`; DuckDB owns `ITemporalStore`). Embeddings
live in `graph.lbug` and stream into a per-call DuckDB temp table at
pack time so the byte-identical Parquet sidecar still works.

See [`docs/adr/0016-duckdb-graph-rip.md`](./docs/adr/0016-duckdb-graph-rip.md)
for the rationale behind ripping out the DuckDB graph backend; it
supersedes ADR 0013 and the DuckDB-as-graph passages of ADR 0011.
## Storage backend — single-file SQLite

The entire index lives in ONE `<repo>/.codehub/store.sqlite` file (WAL),
via Node's built-in `node:sqlite` — graph nodes, edges, embeddings, the
FTS5 BM25 table, and the temporal tables (cochanges, symbol summaries, the
`codehub query --sql` escape hatch). One `SqliteStore` class implements
**both** `IGraphStore` and `ITemporalStore`; `openStore()` returns that
single instance as both the `graph` and `temporal` views, so call sites use
`store.graph.X()` / `store.temporal.Y()` unchanged. **Zero native storage
bindings** — `@ladybugdb/core` and `@duckdb/node-api` are both gone, so
there is no `GraphDbBindingError`, no backend probe, and no platform-prebuild
matrix.

The segregated `IGraphStore` / `ITemporalStore` interfaces stay as the
community-fork escape hatch (AGE / Memgraph / Neo4j / Neptune) — a fork
implements both, on one class or split. Install is zero-native-dep:
`npm i -g @opencodehub/cli` + Node ≥ 24.15, no Docker, no postinstall
compile. (`onnxruntime-web`, the optional WASM embedder, is the only native
dependency — lazy-loaded under `--embeddings`.)

See [`docs/adr/0019-single-file-sqlite-storage.md`](./docs/adr/0019-single-file-sqlite-storage.md)
for the rationale; it supersedes [ADR 0016](./docs/adr/0016-duckdb-graph-rip.md)
(and, transitively, the native-binding storage of ADRs 0011 / 0013 / 0001).

## Parse runtime — WASM-only, vendored grammars

`@opencodehub/ingestion` runs `web-tree-sitter` (WASM) as the only parse
runtime on the supported Node range (22 and 24). There is no native opt-in:
the native `tree-sitter` N-API addon and all 14 `tree-sitter-<lang>` npm
packages are gone from the install graph, so parsing pulls **zero native
builds and zero GitHub fetches** at install time. (Storage and embeddings
load prebuilt native bindings — see Platform support.)
builds and zero GitHub fetches** at install time. (Storage is pure
`node:sqlite`; the only optional native dep is the WASM embedder — see
Platform support.)

All 15 grammar `.wasm` blobs are vendored at
`packages/ingestion/vendor/wasms/`, built from the grammar sources
Expand Down
9 changes: 5 additions & 4 deletions SPECS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ first 14 plus a regex provider for fixed-format COBOL, runs SCIP indexers
for TypeScript/JavaScript, Python, Go, Rust, and Java to upgrade tree-sitter
heuristic edges to compiler-grade edges, clusters the graph into
Communities and Processes, and optionally populates embeddings from a
pinned gte-modernbert-base ONNX model (fp32 ~596 MB or int8 ~150 MB) or
pinned F2LLM-v2-80M ONNX model (320-dim; fp32 ~321 MB or int8 ~81 MB) or
an OpenAI-compatible HTTP endpoint.

At query time it exposes an MCP server with 28 tools (`query`, `context`,
Expand Down Expand Up @@ -171,7 +171,7 @@ last-analyzed commit) atomically and expose it via `getMeta`.
BM25 + ANN search, fuse results with reciprocal rank fusion (`DEFAULT_RRF_K`),
and return symbols grouped by their participating `Process`.

4.2 Where gte-modernbert-base weights are absent and no HTTP embedder is
4.2 Where F2LLM-v2-80M weights are absent and no HTTP embedder is
configured, the system shall fall back to BM25-only search and log a
one-shot `[mcp] hybrid:` warning to stderr.

Expand Down Expand Up @@ -264,8 +264,9 @@ and `sql`.
claude-code, cursor, codex, windsurf, and opencode; pass `--undo` to
restore the most recent `.bak`.

7.4 The `setup --embeddings` command shall download gte-modernbert-base
weights (fp32 or int8) with SHA256 pins validated against
7.4 The `setup --embeddings` command shall download the F2LLM-v2-80M
ONNX export (fp32 or int8) — a custom-exported artifact hosted as a
GitHub release asset — with SHA256 pins validated against
`model-pins.ts`.

7.5 The `setup --plugin` command shall copy the bundled plugin into
Expand Down
13 changes: 12 additions & 1 deletion docs/adr/0001-storage-backend.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
# ADR 0001 — Storage backend selection

Status: **Accepted (superseded prior SQLite recommendation)** — 2026-04-18
Status: **Superseded** — current storage is [ADR 0019 — Single-file SQLite
storage](./0019-single-file-sqlite-storage.md) (2026-06-22). This ADR
selected **DuckDB** as the embedded backend; that decision was unwound over
[ADR 0011](./0011-graph-db-backend.md) → [ADR 0013-m7](./0013-m7-default-flip-and-abstraction.md)
→ [ADR 0016](./0016-duckdb-graph-rip.md) → ADR 0019, which lands on one
`store.sqlite` file (Node built-in `node:sqlite`, **zero** native storage
bindings — DuckDB included). Ironically ADR 0019 returns to the SQLite
recommendation this ADR originally rejected. Read this ADR for the original
license/determinism/binding-availability criteria only; the chosen engine is
obsolete.

> Originally: **Accepted (superseded prior SQLite recommendation)** — 2026-04-18

## Context

Expand Down
13 changes: 8 additions & 5 deletions docs/adr/0011-graph-db-backend.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
# ADR 0011 — Graph-DB backend (LadybugDB phase-1)

- Status: **Partially superseded** by [ADR 0016](./0016-duckdb-graph-rip.md)
on 2026-05-16. The "DuckDB-default plus LadybugDB opt-in" framing is
obsolete; lbug is the unconditional graph backend after the rip. The
LadybugDB integration shape and `IGraphStore` design introduced here
are unchanged.
- Status: **Superseded** — current storage is [ADR 0019 — Single-file
SQLite storage](./0019-single-file-sqlite-storage.md) (2026-06-22).
Chain: this ADR (LadybugDB phase-1) → [ADR 0016](./0016-duckdb-graph-rip.md)
(lbug-only graph, made the "DuckDB-default + LadybugDB opt-in" framing
obsolete, 2026-05-16) → ADR 0019 (one `store.sqlite`, NO native bindings —
`@ladybugdb/core` itself is now gone). The `IGraphStore` design introduced
here survives ADR 0019 as a community-fork escape hatch; the LadybugDB
binding does not. Read this ADR for historical rationale only.
- Was: **Accepted** on 2026-05-05 and flipped on the M3 merge.
- Authors: Laith Al-Saadoon + Claude.
- Branch: `feat/v1-m3-m4`.
Expand Down
14 changes: 8 additions & 6 deletions docs/adr/0013-m7-default-flip-and-abstraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@
> in-tree because they were authored in parallel branches and accepted
> on the same release. The next ADR uses 0014.

- Status: **Superseded** by [ADR 0016](./0016-duckdb-graph-rip.md)
on 2026-05-16. The auto-probe, dual-artifact arbitration, and
`CODEHUB_STORE` resolver introduced here are gone. lbug is the only
graph backend; DuckDB serves the temporal tier. The
IGraphStore/ITemporalStore segregation survives because community
adapters (AGE, Memgraph, Neo4j, Neptune) target it.
- Status: **Superseded** — current storage is [ADR 0019 — Single-file
SQLite storage](./0019-single-file-sqlite-storage.md) (2026-06-22).
Chain: this ADR → [ADR 0016](./0016-duckdb-graph-rip.md) (2026-05-16,
removed the auto-probe / dual-artifact arbitration / `CODEHUB_STORE`
resolver introduced here) → ADR 0019 (one `store.sqlite`, no native
bindings). The IGraphStore/ITemporalStore segregation introduced here
survives all the way to ADR 0019 as the community-fork escape hatch
(AGE, Memgraph, Neo4j, Neptune); everything else here is historical.
- Was: **Accepted** on 2026-05-09 and flipped on the
`feat/v1-finalize-track-a` merge (PR #71).
- Authors: Laith Al-Saadoon + Claude.
Expand Down
13 changes: 12 additions & 1 deletion docs/adr/0014-scip-references-and-embedder-fingerprint.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,21 @@
# ADR 0014 — SCIP REFERENCES + TYPE_OF emission and embedder-fingerprint refusal

**Status**: Accepted
**Status**: Accepted (still in force)
**Date**: 2026-05-09
**Supersedes**: none
**Superseded by**: none

> Note (2026-06-26): the embedder-fingerprint mechanism this ADR introduced
> — persist `embedder_model_id`, refuse mismatched queries via
> `assertEmbedderCompatible` — is unchanged and is precisely what guards the
> later embedding-model swap from `gte-modernbert-base` (768-dim) to
> `F2LLM-v2-80M` (320-dim). The `gte-modernbert-base` / `768` references
> below are the contemporaneous examples; the dim/model are now 320 /
> `f2llm-v2-80m/*` but the decision and the comparator are identical. The
> `store_meta` storage substrate referenced here (DuckDB) was later replaced
> per [ADR 0019](./0019-single-file-sqlite-storage.md); the column and
> semantics carried over to `store.sqlite` verbatim.

## Context

Two unrelated holes in v1.0 finalize, both routing through a shared one-time graphHash content delta. They land in a single ADR per spec.md§Q7 because the fixture-regeneration cost is paid once.
Expand Down
12 changes: 11 additions & 1 deletion docs/adr/0016-duckdb-graph-rip.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# ADR 0016 — Rip out the DuckDB graph backend; lbug-only graph, DuckDB temporal-only

- Status: **Accepted** — 2026-05-16.
- Status: **Superseded** by [ADR 0019 — Single-file SQLite storage](./0019-single-file-sqlite-storage.md)
on 2026-06-22, **in its entirety**. ADR 0019 removed BOTH native bindings
this ADR settled on (`@ladybugdb/core` for the graph tier and
`@duckdb/node-api` for the temporal tier) and replaced the pair with one
`store.sqlite` file via Node's built-in `node:sqlite`. The segregated
`IGraphStore` / `ITemporalStore` interfaces this ADR preserved for
community forks survive — both are now implemented by a single
`SqliteStore` class. Read this ADR only for the historical rationale of
the lbug-graph / DuckDB-temporal split; **do not** treat its decision as
current.
- Was: **Accepted** — 2026-05-16.
- Authors: Laith Al-Saadoon + Claude.
- Branch: `feat/duckdb-graph-rip`.
- Supersedes: [ADR 0013 — M7 default flip and storage abstraction](./0013-m7-default-flip-and-abstraction.md)
Expand Down
Loading
Loading