From ea8d2e2043dc2cf3167d5d5921ebcde1d968eb1d Mon Sep 17 00:00:00 2001 From: Laith Al-Saadoon Date: Fri, 26 Jun 2026 01:40:59 +0000 Subject: [PATCH 1/2] =?UTF-8?q?feat(embedder)!:=20swap=20embedding=20model?= =?UTF-8?q?=20gte-modernbert-base=20=E2=86=92=20F2LLM-v2-80M=20(320-dim)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BREAKING CHANGE: the local ONNX embedder is now codefuse-ai/F2LLM-v2-80M (320-dim, was gte-modernbert-base 768-dim). Existing indexes MUST be rebuilt with `codehub analyze --embeddings` — 320-dim query vectors cannot be compared against stored 768-dim vectors. The embedder-fingerprint guard (ADR 0014) refuses queries against a stale store until re-analyze. What changed: - onnx-embedder.ts requests the in-graph `embedding` output (shape [B,320], last-token pooling + L2 norm baked into the graph) instead of pooling / normalizing in JS — clsPool + l2NormalizeInPlace are removed. Qwen2 pad id. - New Embedder.embedQuery() applies F2LLM's query-only `Instruct:` prefix; documents embed raw. Wired at the search/hybrid.ts query seam. New query-prefix.ts holds the instruction string. - Dimension parameterized 768→320 across embedder, search (NullEmbedder), storage (DEFAULT_DIM), ingestion pool, and HTTP/SageMaker defaults. - model-pins.ts: GTE_MODERNBERT_BASE_PINS → F2LLM_V2_80M_PINS; weights are a custom ONNX export hosted as the GitHub release asset `embed-v1`, SHA256-pinned. 3-file manifest (model + tokenizer.json + tokenizer_config.json). - Migration guard: analyze suppresses the content-hash cache on a model-id change so a swap forces a full re-embed (no mixed-dimension store). - Docs, CHANGELOGs, skills swept to F2LLM/320-dim. Collapsed the stale storage ADR chain (0001/0011/0013/0016 → 0019) and fixed the README's pre-ADR-0019 storage narrative. Fixed lefthook verdict guard to check store.sqlite (was the removed graph.lbug). Verified: pnpm -r build + test green (~2400 tests); tokenizer parity with Python AutoTokenizer (byte-identical IDs); production embedder reproduces the POC 4/4 top-1 ranking + byte-deterministic output; end-to-end analyze writes 320-dim rows; migration guard confirmed firing on a stale stamp. --- .claude/skills/opencodehub-guide/SKILL.md | 2 +- README.md | 102 +++++------ SPECS.md | 9 +- docs/adr/0001-storage-backend.md | 13 +- docs/adr/0011-graph-db-backend.md | 13 +- .../0013-m7-default-flip-and-abstraction.md | 14 +- ...cip-references-and-embedder-fingerprint.md | 13 +- docs/adr/0016-duckdb-graph-rip.md | 12 +- lefthook.yml | 5 +- packages/cli/src/commands/analyze.test.ts | 4 +- packages/cli/src/commands/analyze.ts | 34 +++- packages/cli/src/commands/doctor.test.ts | 6 +- packages/cli/src/commands/doctor.ts | 4 +- packages/cli/src/commands/query.test.ts | 11 +- packages/cli/src/commands/query.ts | 4 +- .../cli/src/commands/setup-embeddings.test.ts | 8 +- packages/cli/src/commands/setup.ts | 9 +- packages/cli/src/embedder-downloader.test.ts | 8 +- packages/cli/src/embedder-downloader.ts | 10 +- packages/cli/src/index.ts | 6 +- .../content/docs/architecture/embeddings.md | 24 ++- .../content/docs/architecture/monorepo-map.md | 2 +- .../docs/src/content/docs/reference/cli.md | 4 +- .../content/docs/reference/configuration.md | 4 +- packages/embedder/CHANGELOG.md | 14 ++ packages/embedder/README.md | 15 +- packages/embedder/package.json | 6 +- packages/embedder/src/factory.test.ts | 13 +- packages/embedder/src/factory.ts | 2 +- packages/embedder/src/fingerprint.test.ts | 20 +-- packages/embedder/src/fingerprint.ts | 6 +- packages/embedder/src/http-embedder.test.ts | 58 +++---- packages/embedder/src/http-embedder.ts | 11 +- packages/embedder/src/index.ts | 9 +- packages/embedder/src/model-pins.test.ts | 72 +++++--- packages/embedder/src/model-pins.ts | 91 +++++----- packages/embedder/src/onnx-embedder.test.ts | 47 +++-- packages/embedder/src/onnx-embedder.ts | 163 ++++++++---------- packages/embedder/src/paths.test.ts | 13 +- packages/embedder/src/paths.ts | 24 ++- packages/embedder/src/query-prefix.ts | 25 +++ .../sagemaker-embedder.integration.test.ts | 14 +- .../src/sagemaker-embedder.parity.test.ts | 9 +- .../embedder/src/sagemaker-embedder.test.ts | 36 ++-- packages/embedder/src/sagemaker-embedder.ts | 21 ++- packages/embedder/src/types.ts | 36 ++-- packages/ingestion/CHANGELOG.md | 7 + .../src/pipeline/phases/embedder-pool.ts | 9 +- .../src/pipeline/phases/embeddings.test.ts | 2 +- .../src/pipeline/phases/embeddings.ts | 9 +- packages/mcp/CHANGELOG.md | 7 + packages/mcp/src/server.ts | 4 +- packages/mcp/src/tools/query.test.ts | 9 +- packages/mcp/src/tools/query.ts | 7 +- packages/mcp/src/tools/shared.ts | 2 +- packages/search/CHANGELOG.md | 7 + packages/search/src/embedder.ts | 7 +- packages/search/src/hybrid.test.ts | 5 + packages/search/src/hybrid.ts | 4 +- packages/search/src/types.ts | 6 + packages/storage/CHANGELOG.md | 7 + packages/storage/src/sqlite-adapter.test.ts | 4 +- packages/storage/src/sqlite-adapter.ts | 4 +- .../opencodehub/skills/codehub-guide/SKILL.md | 2 +- 64 files changed, 680 insertions(+), 448 deletions(-) create mode 100644 packages/embedder/src/query-prefix.ts diff --git a/.claude/skills/opencodehub-guide/SKILL.md b/.claude/skills/opencodehub-guide/SKILL.md index ddbdc71b..29803824 100644 --- a/.claude/skills/opencodehub-guide/SKILL.md +++ b/.claude/skills/opencodehub-guide/SKILL.md @@ -15,7 +15,7 @@ For any task that touches code understanding, debugging, impact analysis, refact 2. Read `codehub://repo/{name}/context` — codebase stats and a staleness envelope. 3. Match the task to a skill below and follow that skill's checklist. -> If the context envelope reports the index is stale, run `codehub analyze` in the terminal first. If it says weights are missing, run `codehub setup --embeddings` to fetch the 768d gte-modernbert-base ONNX weights. +> If the context envelope reports the index is stale, run `codehub analyze` in the terminal first. If it says weights are missing, run `codehub setup --embeddings` to fetch the 320d F2LLM-v2-80M ONNX weights. ## Skills · analysis diff --git a/README.md b/README.md index d14037bd..d3544665 100644 --- a/README.md +++ b/README.md @@ -78,31 +78,31 @@ flowchart LR | **Local-first, offline-capable** | `codehub analyze --offline` opens zero sockets. Your code never leaves your machine. No telemetry. | | **Deterministic indexing** | Identical inputs produce a byte-identical graph hash. Reproducible. Auditable. Cacheable in CI. | | **MCP-native** | Works out-of-the-box with Claude Code, Cursor, Codex, Windsurf, OpenCode. The MCP server is the primary interface; CLI exists for scripts and CI. | -| **Embedded storage, two-tier** | `@ladybugdb/core` holds the structural store: symbols, edges, embeddings, BM25 + HNSW. A dedicated DuckDB sibling holds the temporal views: cochanges and summaries. Embedded files. No daemon. No database to operate. Both tiers are always present, with no backend knob (ADR 0016). | +| **Single-file embedded storage** | One `store.sqlite` file holds everything — symbols, edges, embeddings, BM25 (FTS5) + HNSW traversal, and the temporal views (cochanges, summaries) — via Node's built-in `node:sqlite`. No daemon, no database to operate, and **zero native storage bindings** (ADR 0019 removed both `@ladybugdb/core` and `@duckdb/node-api`). | | **15 languages at GA** | TypeScript, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, Swift, PHP, Dart, COBOL — tree-sitter for the first 14 plus a regex provider for fixed-format COBOL. | -| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`, so parsing does **zero grammar/native builds and zero GitHub fetches** at install time — there is no native parser opt-in. Storage and embeddings still load prebuilt native bindings (see Platform support). | +| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`, so parsing does **zero grammar/native builds and zero GitHub fetches** at install time — there is no native parser opt-in. Storage is pure `node:sqlite`; the only optional native dep is the local embedder (see Platform support). | ## Platform support -Parsing is WASM and runs anywhere Node does. The storage and embedding -tiers, however, depend on **prebuilt native bindings** — `@ladybugdb/core` -(graph store), `@duckdb/node-api` (temporal store), and `onnxruntime-node` -(local embeddings) — so OpenCodeHub runs on the platforms those bindings -ship a prebuild for: +Parsing is WASM and storage is pure `node:sqlite`, so the core runs anywhere +Node ≥ 24.15 does — no prebuilt native storage bindings, no Docker, no +postinstall compile (ADR 0019). There is exactly **one** optional native +dependency: `onnxruntime-web`, the WASM ONNX runtime that powers +`--embeddings`. It ships prebuilt WebAssembly (no node-gyp, no native +binding) and runs single-threaded under Node, so it too is platform-agnostic; +a BM25-only install never loads it. | Platform | Supported | |---|---| -| `darwin-arm64`, `darwin-x64` | ✅ prebuilt | -| `linux-x64`, `linux-arm64` (glibc) | ✅ prebuilt | -| `win32-x64` | ✅ prebuilt | -| `win32-arm64` | ❌ no prebuild — `codehub analyze` fails at store open | -| Alpine / musl, 32-bit Linux ARM | ❌ no prebuild — needs a source build of `@ladybugdb/core` | - -On an unsupported platform the lbug binding fails to load and `open()` -throws `GraphDbBindingError` (there is no DuckDB-graph fallback — see -[ADR 0016](./docs/adr/0016-duckdb-graph-rip.md)). The five-target prebuilt -matrix mirrors `@ladybugdb/core`'s release artifacts; track its upstream -for musl / `win32-arm64` coverage. +| `darwin-arm64`, `darwin-x64` | ✅ | +| `linux-x64`, `linux-arm64` (glibc **and** musl/Alpine) | ✅ | +| `win32-x64`, `win32-arm64` | ✅ | +| anywhere else Node ≥ 24.15 runs | ✅ | + +Because storage no longer depends on a platform-specific prebuild, the +earlier `GraphDbBindingError` / unsupported-platform failure mode is gone — +see [ADR 0019](./docs/adr/0019-single-file-sqlite-storage.md) (which +superseded the native-binding storage of [ADR 0016](./docs/adr/0016-duckdb-graph-rip.md)). ## Quick start @@ -187,7 +187,7 @@ The monorepo is organised as 18 workspace packages under `packages/`: | `scanners` | Subprocess wrappers for 19 scanners — OSV, Semgrep, hadolint, tflint, betterleaks, and the rest | | `scip-ingest` | SCIP indexer runners (TS, Python, Go, Rust, Java) — emits CALLS, REFERENCES, IMPLEMENTS, TYPE_OF | | `search` | Hybrid BM25 + HNSW (ACORN-1 + RaBitQ) query layer | -| `storage` | `IGraphStore` (`@ladybugdb/core`) + `ITemporalStore` (DuckDB) adapters; deterministic `graphHash` | +| `storage` | One `SqliteStore` (`node:sqlite`) implementing both `IGraphStore` + `ITemporalStore` over a single `store.sqlite`; deterministic `graphHash` | | `summarizer` | Process + cluster summaries for MCP responses | | `wiki` | LLM-narrated module pages emitted by `codehub wiki --llm` | @@ -199,12 +199,13 @@ production package set ships free of test-time dependencies. ## Embedding backends OpenCodeHub ships with three embedding backends — all serve the same -`gte-modernbert-base` 768-dim space, all use CLS pooling + L2 norm — and -picks one at runtime based on environment variables: +`codefuse-ai/F2LLM-v2-80M` 320-dim space (last-token pooling + L2 norm +baked into the ONNX graph) — and picks one at runtime based on +environment variables: | Precedence | Env | Backend | |---|---|---| -| 1 | `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | **SageMaker** — invokes an AWS SageMaker Runtime endpoint (e.g. a TEI-served `gte-modernbert-embed`). Auth via the default AWS credential chain (profile, env vars, IMDS). No local weights needed. | +| 1 | `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | **SageMaker** — invokes an AWS SageMaker Runtime endpoint (e.g. a TEI-served `F2LLM-v2-80M`). Auth via the default AWS credential chain (profile, env vars, IMDS). No local weights needed. | | 2 | `CODEHUB_EMBEDDING_URL` + `CODEHUB_EMBEDDING_MODEL` | **HTTP (OpenAI-compatible)** — POSTs to a `/v1/embeddings` server (Infinity, vLLM, TEI, Ollama, LM Studio, OpenAI). Bearer auth optional via `CODEHUB_EMBEDDING_API_KEY`. | | 3 | *(nothing set)* | **Local ONNX** — deterministic, offline-safe. Requires `codehub setup --embeddings` to download the weights. | @@ -212,13 +213,13 @@ picks one at runtime based on environment variables: | Var | Default | Purpose | |---|---|---| -| `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | *(required to select)* | Endpoint name (e.g. `gte-modernbert-embed`). | +| `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` | *(required to select)* | Endpoint name (e.g. `F2LLM-v2-80M`). | | `CODEHUB_EMBEDDING_SAGEMAKER_REGION` | `us-east-1` | AWS region. | -| `CODEHUB_EMBEDDING_DIMS` | `768` | Expected vector dimension — asserted on every response to catch model-swap drift. | -| `CODEHUB_EMBEDDING_MODEL` | `gte-modernbert-base/sagemaker:` | Stable modelId stamp recorded in index metadata. Override only when bridging a non-gte endpoint. | +| `CODEHUB_EMBEDDING_DIMS` | `320` | Expected vector dimension — asserted on every response to catch model-swap drift. | +| `CODEHUB_EMBEDDING_MODEL` | `F2LLM-v2-80M/sagemaker:` | Stable modelId stamp recorded in index metadata. Override only when bridging a non-F2LLM endpoint. | IAM: the caller needs `sagemaker:InvokeEndpoint` on the endpoint ARN — -e.g. `arn:aws:sagemaker:us-east-1::endpoint/gte-modernbert-embed`. +e.g. `arn:aws:sagemaker:us-east-1::endpoint/F2LLM-v2-80M`. **Do not mix backends against the same index.** Backends are pinned to a single model identity via the `modelId` stamp in the `embeddings` table; @@ -226,27 +227,29 @@ switching mid-project requires `codehub analyze --rebuild-embeddings`. `--offline` refuses SageMaker and HTTP backends, so offline mode is compatible only with the local ONNX path. -## Storage backend — lbug graph + DuckDB temporal - -The graph tier is always `@ladybugdb/core` (`/.codehub/graph.lbug`); -the temporal tier — cochanges, structured symbol summaries, and the -`codehub query --sql` escape hatch — is always DuckDB -(`/.codehub/temporal.duckdb`). Both files are written on every -`analyze`. There is no `CODEHUB_STORE` env var, no backend probe, no -single-file `graph.duckdb` layout, and no mtime arbitration; if the lbug -binding fails to load, `open()` throws `GraphDbBindingError` and the -operation aborts. - -`IGraphStore` lives only on `GraphDbStore`; `DuckDbStore` implements -`ITemporalStore` only. The segregated interfaces stay because they are -the v1.0 contract for community-fork adapters (AGE / Memgraph / Neo4j / -Neptune target `IGraphStore`; DuckDB owns `ITemporalStore`). Embeddings -live in `graph.lbug` and stream into a per-call DuckDB temp table at -pack time so the byte-identical Parquet sidecar still works. - -See [`docs/adr/0016-duckdb-graph-rip.md`](./docs/adr/0016-duckdb-graph-rip.md) -for the rationale behind ripping out the DuckDB graph backend; it -supersedes ADR 0013 and the DuckDB-as-graph passages of ADR 0011. +## Storage backend — single-file SQLite + +The entire index lives in ONE `/.codehub/store.sqlite` file (WAL), +via Node's built-in `node:sqlite` — graph nodes, edges, embeddings, the +FTS5 BM25 table, and the temporal tables (cochanges, symbol summaries, the +`codehub query --sql` escape hatch). One `SqliteStore` class implements +**both** `IGraphStore` and `ITemporalStore`; `openStore()` returns that +single instance as both the `graph` and `temporal` views, so call sites use +`store.graph.X()` / `store.temporal.Y()` unchanged. **Zero native storage +bindings** — `@ladybugdb/core` and `@duckdb/node-api` are both gone, so +there is no `GraphDbBindingError`, no backend probe, and no platform-prebuild +matrix. + +The segregated `IGraphStore` / `ITemporalStore` interfaces stay as the +community-fork escape hatch (AGE / Memgraph / Neo4j / Neptune) — a fork +implements both, on one class or split. Install is zero-native-dep: +`npm i -g @opencodehub/cli` + Node ≥ 24.15, no Docker, no postinstall +compile. (`onnxruntime-web`, the optional WASM embedder, is the only native +dependency — lazy-loaded under `--embeddings`.) + +See [`docs/adr/0019-single-file-sqlite-storage.md`](./docs/adr/0019-single-file-sqlite-storage.md) +for the rationale; it supersedes [ADR 0016](./docs/adr/0016-duckdb-graph-rip.md) +(and, transitively, the native-binding storage of ADRs 0011 / 0013 / 0001). ## Parse runtime — WASM-only, vendored grammars @@ -254,8 +257,9 @@ supersedes ADR 0013 and the DuckDB-as-graph passages of ADR 0011. runtime on the supported Node range (22 and 24). There is no native opt-in: the native `tree-sitter` N-API addon and all 14 `tree-sitter-` npm packages are gone from the install graph, so parsing pulls **zero native -builds and zero GitHub fetches** at install time. (Storage and embeddings -load prebuilt native bindings — see Platform support.) +builds and zero GitHub fetches** at install time. (Storage is pure +`node:sqlite`; the only optional native dep is the WASM embedder — see +Platform support.) All 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`, built from the grammar sources diff --git a/SPECS.md b/SPECS.md index d26ccf7e..bbba8926 100644 --- a/SPECS.md +++ b/SPECS.md @@ -17,7 +17,7 @@ first 14 plus a regex provider for fixed-format COBOL, runs SCIP indexers for TypeScript/JavaScript, Python, Go, Rust, and Java to upgrade tree-sitter heuristic edges to compiler-grade edges, clusters the graph into Communities and Processes, and optionally populates embeddings from a -pinned gte-modernbert-base ONNX model (fp32 ~596 MB or int8 ~150 MB) or +pinned F2LLM-v2-80M ONNX model (320-dim; fp32 ~321 MB or int8 ~81 MB) or an OpenAI-compatible HTTP endpoint. At query time it exposes an MCP server with 28 tools (`query`, `context`, @@ -171,7 +171,7 @@ last-analyzed commit) atomically and expose it via `getMeta`. BM25 + ANN search, fuse results with reciprocal rank fusion (`DEFAULT_RRF_K`), and return symbols grouped by their participating `Process`. -4.2 Where gte-modernbert-base weights are absent and no HTTP embedder is +4.2 Where F2LLM-v2-80M weights are absent and no HTTP embedder is configured, the system shall fall back to BM25-only search and log a one-shot `[mcp] hybrid:` warning to stderr. @@ -264,8 +264,9 @@ and `sql`. claude-code, cursor, codex, windsurf, and opencode; pass `--undo` to restore the most recent `.bak`. -7.4 The `setup --embeddings` command shall download gte-modernbert-base -weights (fp32 or int8) with SHA256 pins validated against +7.4 The `setup --embeddings` command shall download the F2LLM-v2-80M +ONNX export (fp32 or int8) — a custom-exported artifact hosted as a +GitHub release asset — with SHA256 pins validated against `model-pins.ts`. 7.5 The `setup --plugin` command shall copy the bundled plugin into diff --git a/docs/adr/0001-storage-backend.md b/docs/adr/0001-storage-backend.md index b456471d..64f10131 100644 --- a/docs/adr/0001-storage-backend.md +++ b/docs/adr/0001-storage-backend.md @@ -1,6 +1,17 @@ # ADR 0001 — Storage backend selection -Status: **Accepted (superseded prior SQLite recommendation)** — 2026-04-18 +Status: **Superseded** — current storage is [ADR 0019 — Single-file SQLite +storage](./0019-single-file-sqlite-storage.md) (2026-06-22). This ADR +selected **DuckDB** as the embedded backend; that decision was unwound over +[ADR 0011](./0011-graph-db-backend.md) → [ADR 0013-m7](./0013-m7-default-flip-and-abstraction.md) +→ [ADR 0016](./0016-duckdb-graph-rip.md) → ADR 0019, which lands on one +`store.sqlite` file (Node built-in `node:sqlite`, **zero** native storage +bindings — DuckDB included). Ironically ADR 0019 returns to the SQLite +recommendation this ADR originally rejected. Read this ADR for the original +license/determinism/binding-availability criteria only; the chosen engine is +obsolete. + +> Originally: **Accepted (superseded prior SQLite recommendation)** — 2026-04-18 ## Context diff --git a/docs/adr/0011-graph-db-backend.md b/docs/adr/0011-graph-db-backend.md index 4d48ade9..e3365085 100644 --- a/docs/adr/0011-graph-db-backend.md +++ b/docs/adr/0011-graph-db-backend.md @@ -1,10 +1,13 @@ # ADR 0011 — Graph-DB backend (LadybugDB phase-1) -- Status: **Partially superseded** by [ADR 0016](./0016-duckdb-graph-rip.md) - on 2026-05-16. The "DuckDB-default plus LadybugDB opt-in" framing is - obsolete; lbug is the unconditional graph backend after the rip. The - LadybugDB integration shape and `IGraphStore` design introduced here - are unchanged. +- Status: **Superseded** — current storage is [ADR 0019 — Single-file + SQLite storage](./0019-single-file-sqlite-storage.md) (2026-06-22). + Chain: this ADR (LadybugDB phase-1) → [ADR 0016](./0016-duckdb-graph-rip.md) + (lbug-only graph, made the "DuckDB-default + LadybugDB opt-in" framing + obsolete, 2026-05-16) → ADR 0019 (one `store.sqlite`, NO native bindings — + `@ladybugdb/core` itself is now gone). The `IGraphStore` design introduced + here survives ADR 0019 as a community-fork escape hatch; the LadybugDB + binding does not. Read this ADR for historical rationale only. - Was: **Accepted** on 2026-05-05 and flipped on the M3 merge. - Authors: Laith Al-Saadoon + Claude. - Branch: `feat/v1-m3-m4`. diff --git a/docs/adr/0013-m7-default-flip-and-abstraction.md b/docs/adr/0013-m7-default-flip-and-abstraction.md index 76230e2e..44825f99 100644 --- a/docs/adr/0013-m7-default-flip-and-abstraction.md +++ b/docs/adr/0013-m7-default-flip-and-abstraction.md @@ -5,12 +5,14 @@ > in-tree because they were authored in parallel branches and accepted > on the same release. The next ADR uses 0014. -- Status: **Superseded** by [ADR 0016](./0016-duckdb-graph-rip.md) - on 2026-05-16. The auto-probe, dual-artifact arbitration, and - `CODEHUB_STORE` resolver introduced here are gone. lbug is the only - graph backend; DuckDB serves the temporal tier. The - IGraphStore/ITemporalStore segregation survives because community - adapters (AGE, Memgraph, Neo4j, Neptune) target it. +- Status: **Superseded** — current storage is [ADR 0019 — Single-file + SQLite storage](./0019-single-file-sqlite-storage.md) (2026-06-22). + Chain: this ADR → [ADR 0016](./0016-duckdb-graph-rip.md) (2026-05-16, + removed the auto-probe / dual-artifact arbitration / `CODEHUB_STORE` + resolver introduced here) → ADR 0019 (one `store.sqlite`, no native + bindings). The IGraphStore/ITemporalStore segregation introduced here + survives all the way to ADR 0019 as the community-fork escape hatch + (AGE, Memgraph, Neo4j, Neptune); everything else here is historical. - Was: **Accepted** on 2026-05-09 and flipped on the `feat/v1-finalize-track-a` merge (PR #71). - Authors: Laith Al-Saadoon + Claude. diff --git a/docs/adr/0014-scip-references-and-embedder-fingerprint.md b/docs/adr/0014-scip-references-and-embedder-fingerprint.md index 869e3e3a..df601ff9 100644 --- a/docs/adr/0014-scip-references-and-embedder-fingerprint.md +++ b/docs/adr/0014-scip-references-and-embedder-fingerprint.md @@ -1,10 +1,21 @@ # ADR 0014 — SCIP REFERENCES + TYPE_OF emission and embedder-fingerprint refusal -**Status**: Accepted +**Status**: Accepted (still in force) **Date**: 2026-05-09 **Supersedes**: none **Superseded by**: none +> Note (2026-06-26): the embedder-fingerprint mechanism this ADR introduced +> — persist `embedder_model_id`, refuse mismatched queries via +> `assertEmbedderCompatible` — is unchanged and is precisely what guards the +> later embedding-model swap from `gte-modernbert-base` (768-dim) to +> `F2LLM-v2-80M` (320-dim). The `gte-modernbert-base` / `768` references +> below are the contemporaneous examples; the dim/model are now 320 / +> `f2llm-v2-80m/*` but the decision and the comparator are identical. The +> `store_meta` storage substrate referenced here (DuckDB) was later replaced +> per [ADR 0019](./0019-single-file-sqlite-storage.md); the column and +> semantics carried over to `store.sqlite` verbatim. + ## Context Two unrelated holes in v1.0 finalize, both routing through a shared one-time graphHash content delta. They land in a single ADR per spec.md§Q7 because the fixture-regeneration cost is paid once. diff --git a/docs/adr/0016-duckdb-graph-rip.md b/docs/adr/0016-duckdb-graph-rip.md index f766be6d..5fb3ac21 100644 --- a/docs/adr/0016-duckdb-graph-rip.md +++ b/docs/adr/0016-duckdb-graph-rip.md @@ -1,6 +1,16 @@ # ADR 0016 — Rip out the DuckDB graph backend; lbug-only graph, DuckDB temporal-only -- Status: **Accepted** — 2026-05-16. +- Status: **Superseded** by [ADR 0019 — Single-file SQLite storage](./0019-single-file-sqlite-storage.md) + on 2026-06-22, **in its entirety**. ADR 0019 removed BOTH native bindings + this ADR settled on (`@ladybugdb/core` for the graph tier and + `@duckdb/node-api` for the temporal tier) and replaced the pair with one + `store.sqlite` file via Node's built-in `node:sqlite`. The segregated + `IGraphStore` / `ITemporalStore` interfaces this ADR preserved for + community forks survive — both are now implemented by a single + `SqliteStore` class. Read this ADR only for the historical rationale of + the lbug-graph / DuckDB-temporal split; **do not** treat its decision as + current. +- Was: **Accepted** — 2026-05-16. - Authors: Laith Al-Saadoon + Claude. - Branch: `feat/duckdb-graph-rip`. - Supersedes: [ADR 0013 — M7 default flip and storage abstraction](./0013-m7-default-flip-and-abstraction.md) diff --git a/lefthook.yml b/lefthook.yml index 3a12f81f..44e15322 100644 --- a/lefthook.yml +++ b/lefthook.yml @@ -73,6 +73,7 @@ pre-push: # Guard the verdict gate on a present index so the hook degrades # gracefully on dev boxes that haven't run `codehub analyze` yet — # mirrors the SKIP behaviour of scripts/pack-determinism-audit.sh. + # Index path is the single-file `store.sqlite` (ADR 0019). # # The verdict CLI exit ladder is 0=auto_merge, 1=single_review, # 2=dual_review/expert_review, 3=block. Those tiers are review-routing @@ -82,8 +83,8 @@ pre-push: # surface the verdict output and gate solely on exit code 3. - name: verdict run: | - if [ ! -f .codehub/graph.lbug ]; then - echo "verdict skipped: no .codehub/graph.lbug (run 'mise run och:self-analyze' first)" + if [ ! -f .codehub/store.sqlite ]; then + echo "verdict skipped: no .codehub/store.sqlite (run 'mise run och:self-analyze' first)" exit 0 fi set +e diff --git a/packages/cli/src/commands/analyze.test.ts b/packages/cli/src/commands/analyze.test.ts index f86aa807..98226567 100644 --- a/packages/cli/src/commands/analyze.test.ts +++ b/packages/cli/src/commands/analyze.test.ts @@ -409,11 +409,11 @@ test("buildStoreMeta: stamps embedderModelId when the embedder ran with a model edgeCount: 200, stats: {}, cacheSizeBytes: 0, - embeddings: { ranEmbedder: true, embeddingsModelId: "gte-modernbert-base/fp32" }, + embeddings: { ranEmbedder: true, embeddingsModelId: "f2llm-v2-80m/fp32" }, }); assert.equal( meta.embedderModelId, - "gte-modernbert-base/fp32", + "f2llm-v2-80m/fp32", "the embedder tag must round-trip into StoreMeta so the fingerprint guard can fire", ); }); diff --git a/packages/cli/src/commands/analyze.ts b/packages/cli/src/commands/analyze.ts index 3e0cd1c9..3d1ec906 100644 --- a/packages/cli/src/commands/analyze.ts +++ b/packages/cli/src/commands/analyze.ts @@ -31,6 +31,7 @@ import { type RelationType, SCHEMA_VERSION, } from "@opencodehub/core-types"; +import { embedderModelId } from "@opencodehub/embedder"; import { pipeline } from "@opencodehub/ingestion"; import { type BulkLoadProgressEvent, @@ -260,9 +261,20 @@ export async function runAnalyze(path: string, opts: AnalyzeOptions = {}): Promi // re-embeds everything, so the adapter would do no useful work. When the // prior DB is absent the adapter returns undefined and the phase // degrades to "every chunk is new". + // + // Migration safety: the content-hash skip keys on TEXT only, so swapping + // the embedder (e.g. gte-modernbert-base/768-dim → f2llm-v2-80m/320-dim) + // would otherwise skip every unchanged node and leave stale-dimension + // vectors mixed with the new ones. Gate the cache on a model-id match — + // when the prior store's `embedderModelId` differs from the active + // embedder, the adapter is suppressed (full re-embed; INSERT OR REPLACE + // overwrites every row at the new dim). + const activeEmbedderModelId = embedderModelId( + opts.embeddingsVariant === "int8" ? "int8" : "fp32", + ); const embeddingHashAdapter = opts.embeddings === true && opts.force !== true - ? await openEmbeddingHashCacheAdapter(repoPath) + ? await openEmbeddingHashCacheAdapter(repoPath, activeEmbedderModelId) : undefined; // Resolve `--max-summaries auto` against the prior run's callable count, @@ -936,6 +948,7 @@ async function openSummaryCacheAdapter( */ async function openEmbeddingHashCacheAdapter( repoPath: string, + activeModelId: string, ): Promise< { adapter: pipeline.EmbeddingHashCacheAdapter; close: () => Promise } | undefined > { @@ -948,6 +961,25 @@ async function openEmbeddingHashCacheAdapter( await store.close().catch(() => {}); return undefined; } + // Migration guard: if the prior index was built by a different embedder, + // its content_hashes describe vectors of the wrong model/dimension. + // Suppress the cache so every node is re-embedded (full overwrite) rather + // than skipped — preventing a silent mixed-dimension store. + try { + const meta = await store.graph.getMeta(); + const priorModelId = meta?.embedderModelId; + if (priorModelId !== undefined && priorModelId !== activeModelId) { + log( + `codehub analyze: embedder changed (${priorModelId} → ${activeModelId}); ` + + "re-embedding all symbols (content-hash cache suppressed).", + ); + await store.close().catch(() => {}); + return undefined; + } + } catch { + // Meta unreadable (fresh/legacy store) — fall through; the cache list() + // below already tolerates an empty/erroring store. + } return { adapter: { // listEmbeddingHashes is on the graph-tier interface — embeddings diff --git a/packages/cli/src/commands/doctor.test.ts b/packages/cli/src/commands/doctor.test.ts index 8ad254d8..947c73a5 100644 --- a/packages/cli/src/commands/doctor.test.ts +++ b/packages/cli/src/commands/doctor.test.ts @@ -120,7 +120,7 @@ test("embedder weights check reports warn when no model present", async () => { test("embedder weights check reports ok when fp32 weights present", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-emb-ok-")); try { - const base = join(home, ".codehub", "models", "gte-modernbert-base", "fp32"); + const base = join(home, ".codehub", "models", "f2llm-v2-80m", "fp32"); await mkdir(base, { recursive: true }); await writeFile(join(base, "model.onnx"), "fake weights"); const checks = buildChecks({ home, skipNative: true }); @@ -141,7 +141,7 @@ test("embedder weights check reports ok when fp32 weights present", async () => test("embedder weights check reports ok when int8 weights present (underscore filename)", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-emb-int8-")); try { - const base = join(home, ".codehub", "models", "gte-modernbert-base", "int8"); + const base = join(home, ".codehub", "models", "f2llm-v2-80m", "int8"); await mkdir(base, { recursive: true }); // Canonical filename from embedder/src/paths.ts:modelFileName("int8"). await writeFile(join(base, "model_int8.onnx"), "fake int8 weights"); @@ -162,7 +162,7 @@ test("embedder weights check reports ok when int8 weights present (underscore fi test("embedder weights check reports warn when only hyphenated int8 file is present", async () => { const home = await mkdtemp(join(tmpdir(), "codehub-doctor-emb-hyphen-")); try { - const base = join(home, ".codehub", "models", "gte-modernbert-base", "int8"); + const base = join(home, ".codehub", "models", "f2llm-v2-80m", "int8"); await mkdir(base, { recursive: true }); await writeFile(join(base, "model-int8.onnx"), "wrong filename"); const checks = buildChecks({ home, skipNative: true }); diff --git a/packages/cli/src/commands/doctor.ts b/packages/cli/src/commands/doctor.ts index 8621459d..b36388d8 100644 --- a/packages/cli/src/commands/doctor.ts +++ b/packages/cli/src/commands/doctor.ts @@ -583,7 +583,9 @@ function embedderWeightsCheck(home: string): Check { // NOT hyphen). A historical hyphenated path name lingered here and // caused false-negative `warn`s for users who had int8 weights on // disk. - const base = join(home, ".codehub", "models", "gte-modernbert-base"); + // Subdir must match `embedder/src/paths.ts:MODEL_SUBDIR` + // (`models/f2llm-v2-80m`); a mismatch silently always-warns. + const base = join(home, ".codehub", "models", "f2llm-v2-80m"); const fp32 = join(base, "fp32", "model.onnx"); const int8 = join(base, "int8", "model_int8.onnx"); const fp32Ok = await fileExists(fp32); diff --git a/packages/cli/src/commands/query.test.ts b/packages/cli/src/commands/query.test.ts index 7adba79a..14ec185b 100644 --- a/packages/cli/src/commands/query.test.ts +++ b/packages/cli/src/commands/query.test.ts @@ -201,6 +201,11 @@ class FakeEmbedder implements Embedder { async embed(_text: string): Promise { return new Float32Array([0.1, 0.2, 0.3, 0.4]); } + // F2LLM gained a query-only `embedQuery` path; the fake aliases it to + // `embed` since the hybrid path only needs a stable Float32Array back. + async embedQuery(text: string): Promise { + return this.embed(text); + } async embedBatch(texts: readonly string[]): Promise { return texts.map(() => new Float32Array([0.1, 0.2, 0.3, 0.4])); } @@ -457,7 +462,7 @@ test("cli query: embeddings populated + embedder fails → warn + BM25 fallback, ...hooksFor(handle, "/tmp/fake"), openEmbedder: async () => { const err = new Error( - "gte-modernbert-base weights not found. Run `codehub setup --embeddings`.", + "F2LLM-v2-80M weights not found. Run `codehub setup --embeddings`.", ); (err as unknown as { code: string }).code = "EMBEDDER_NOT_SETUP"; throw err; @@ -665,7 +670,7 @@ test("cli query: embedder mismatch sets exit code 2 and still closes embedder + ], vectorRows: [{ nodeId: "F:foo", distance: 0.1 }], // Persisted model id differs from the active embedder's "fake-embedder/test". - metaModelId: "gte-modernbert-base/fp32", + metaModelId: "f2llm-v2-80m/fp32", }); const fake = new FakeEmbedder(); const prevExitCode = process.exitCode; @@ -707,7 +712,7 @@ test("cli query: --force-backend-mismatch bypasses the refusal and runs hybrid", ], vectorRows: [{ nodeId: "F:foo", distance: 0.1 }], nodes, - metaModelId: "gte-modernbert-base/fp32", + metaModelId: "f2llm-v2-80m/fp32", }); const fake = new FakeEmbedder(); const prevExitCode = process.exitCode; diff --git a/packages/cli/src/commands/query.ts b/packages/cli/src/commands/query.ts index 356a78bc..38f955eb 100644 --- a/packages/cli/src/commands/query.ts +++ b/packages/cli/src/commands/query.ts @@ -21,7 +21,7 @@ * * Hybrid ranking priority matches the MCP tool: * 1. `CODEHUB_EMBEDDING_URL` + `CODEHUB_EMBEDDING_MODEL` → HTTP embedder. - * 2. Otherwise local ONNX gte-modernbert-base weights. + * 2. Otherwise local ONNX F2LLM-v2-80M weights. * 3. On failure to open (missing weights, unreachable HTTP) → warn + BM25. */ @@ -59,7 +59,7 @@ export interface QueryRuntimeHooks { readonly openStore?: (opts: QueryOptions) => Promise; /** * Embedder factory — production uses the default lazy-import path; tests - * inject a fake so they don't need gte-modernbert-base weights on disk. Any + * inject a fake so they don't need F2LLM-v2-80M weights on disk. Any * throw is caught by {@link tryOpenEmbedder} and collapses to BM25. */ readonly openEmbedder?: () => Promise; diff --git a/packages/cli/src/commands/setup-embeddings.test.ts b/packages/cli/src/commands/setup-embeddings.test.ts index 37232759..4fd34fc1 100644 --- a/packages/cli/src/commands/setup-embeddings.test.ts +++ b/packages/cli/src/commands/setup-embeddings.test.ts @@ -2,7 +2,7 @@ * Happy-path test for `codehub setup --embeddings` wiring. * * Uses the public `runSetupEmbeddings` entry and a stub fetch + an override - * pin manifest so we never hit the real HuggingFace CDN. + * pin manifest so we never hit the real GitHub release-asset CDN. */ import { strict as assert } from "node:assert"; @@ -13,7 +13,7 @@ import { join } from "node:path"; import { ReadableStream } from "node:stream/web"; import { describe, it } from "node:test"; -import { GTE_MODERNBERT_BASE_PINS } from "@opencodehub/embedder"; +import { F2LLM_V2_80M_PINS } from "@opencodehub/embedder"; import { runSetupEmbeddings } from "./setup.js"; @@ -49,7 +49,7 @@ describe("runSetupEmbeddings", { skip: platformSkip }, () => { // Build a tiny per-file body keyed by pin name; substitute our SHAs into // the manifest so the downloader's verification passes. const bodies = new Map(); - const originals = GTE_MODERNBERT_BASE_PINS.fp32.files; + const originals = F2LLM_V2_80M_PINS.fp32.files; const replaced = originals.map((f, idx) => { const body = new TextEncoder().encode(`pin-${idx}-${f.name}`); bodies.set(f.url, body); @@ -61,7 +61,7 @@ describe("runSetupEmbeddings", { skip: platformSkip }, () => { }; }); - const mutable = GTE_MODERNBERT_BASE_PINS as unknown as { + const mutable = F2LLM_V2_80M_PINS as unknown as { fp32: { variant: "fp32"; files: readonly (typeof replaced)[number][] }; }; const saved = mutable.fp32; diff --git a/packages/cli/src/commands/setup.ts b/packages/cli/src/commands/setup.ts index c8048f17..2f79d507 100644 --- a/packages/cli/src/commands/setup.ts +++ b/packages/cli/src/commands/setup.ts @@ -246,9 +246,9 @@ async function writeSingle( * allows the CLI `log`/`warn` sinks to be overridden for tests. */ export interface SetupEmbeddingsOptions { - /** Variant to install. Defaults to `fp32` (~596 MB). */ + /** Variant to install. Defaults to `fp32` (~332 MB). */ readonly variant?: "fp32" | "int8"; - /** Custom model directory. Defaults to `~/.codehub/models/gte-modernbert-base//`. */ + /** Custom model directory. Defaults to `~/.codehub/models/f2llm-v2-80m//`. */ readonly modelDir?: string; /** Re-download even if files already match their SHA256 pin. */ readonly force?: boolean; @@ -264,7 +264,8 @@ export interface SetupEmbeddingsOptions { /** * Public entry point for `codehub setup --embeddings`. * - * Downloads the five pinned gte-modernbert-base files into the target dir with + * Downloads the three pinned F2LLM-v2-80M files (ONNX weights + + * tokenizer.json + tokenizer_config.json) into the target dir with * streaming SHA256 verification and atomic rename. Returns the downloader * summary so programmatic callers can assert on byte counts and locations. */ @@ -277,7 +278,7 @@ export async function runSetupEmbeddings( log( `codehub setup --embeddings: starting ${variant} download ` + - `(${variant === "fp32" ? "~90 MB" : "~23 MB"})`, + `(${variant === "fp32" ? "~332 MB" : "~92 MB"})`, ); const downloaderOpts: DownloadEmbedderOptions = { diff --git a/packages/cli/src/embedder-downloader.test.ts b/packages/cli/src/embedder-downloader.test.ts index 515c9dd7..dede18b2 100644 --- a/packages/cli/src/embedder-downloader.test.ts +++ b/packages/cli/src/embedder-downloader.test.ts @@ -17,7 +17,7 @@ import { join } from "node:path"; import { ReadableStream } from "node:stream/web"; import { describe, it } from "node:test"; -import { GTE_MODERNBERT_BASE_PINS } from "@opencodehub/embedder"; +import { F2LLM_V2_80M_PINS } from "@opencodehub/embedder"; import { downloadEmbedderWeights, @@ -98,7 +98,7 @@ function makeFetchWith( } /** - * Monkeypatch GTE_MODERNBERT_BASE_PINS[variant] for a single test. Because the + * Monkeypatch F2LLM_V2_80M_PINS[variant] for a single test. Because the * pins are `readonly`, we rebuild the structure by casting into a mutable * shape. The test restores on completion. */ @@ -107,8 +107,8 @@ function withOverridePins( newFiles: readonly { name: string; url: string; sizeBytes: number; sha256: string }[], fn: () => Promise, ): Promise { - const original = GTE_MODERNBERT_BASE_PINS[variant]; - const mutable = GTE_MODERNBERT_BASE_PINS as unknown as { + const original = F2LLM_V2_80M_PINS[variant]; + const mutable = F2LLM_V2_80M_PINS as unknown as { [k in "fp32" | "int8"]: { variant: "fp32" | "int8"; files: readonly { name: string; url: string; sizeBytes: number; sha256: string }[]; diff --git a/packages/cli/src/embedder-downloader.ts b/packages/cli/src/embedder-downloader.ts index b3078e83..e29a7953 100644 --- a/packages/cli/src/embedder-downloader.ts +++ b/packages/cli/src/embedder-downloader.ts @@ -1,8 +1,8 @@ /** - * SHA256-pinned downloader for gte-modernbert-base weights. + * SHA256-pinned downloader for F2LLM-v2-80M weights. * * Resolves the target directory via {@link resolveModelDir}, then for each - * pinned file in {@link GTE_MODERNBERT_BASE_PINS}: + * pinned file in {@link F2LLM_V2_80M_PINS}: * 1. Skip when the file already exists and its SHA256 matches the pin. * 2. Otherwise stream-download to `.tmp`, hash during write, verify * hash, and atomically rename to the final path. @@ -12,7 +12,7 @@ * error — the `.tmp` file is deleted and the error thrown. We never ship * weights that don't match the pin. * - * All disk access is streaming; we never buffer a 596 MB file in memory. + * All disk access is streaming; we never buffer a 321 MB file in memory. */ import { createHash } from "node:crypto"; @@ -24,7 +24,7 @@ import { pipeline as streamPipeline } from "node:stream/promises"; import type { ReadableStream as NodeReadableStream } from "node:stream/web"; import { setTimeout as delay } from "node:timers/promises"; -import { GTE_MODERNBERT_BASE_PINS, type PinnedFile, resolveModelDir } from "@opencodehub/embedder"; +import { F2LLM_V2_80M_PINS, type PinnedFile, resolveModelDir } from "@opencodehub/embedder"; /** Fetch function signature for dependency injection (tests mock this). */ export type FetchFn = typeof fetch; @@ -311,7 +311,7 @@ export async function downloadEmbedderWeights( const modelDir = resolveModelDir(opts.modelDir, opts.variant); await mkdir(modelDir, { recursive: true }); - const files = GTE_MODERNBERT_BASE_PINS[opts.variant].files; + const files = F2LLM_V2_80M_PINS[opts.variant].files; let downloaded = 0; let skipped = 0; let totalBytes = 0; diff --git a/packages/cli/src/index.ts b/packages/cli/src/index.ts index cb9ddeb5..a205605b 100644 --- a/packages/cli/src/index.ts +++ b/packages/cli/src/index.ts @@ -49,7 +49,7 @@ program .description("Index a repository at [path] (default: current directory)") .option("--force", "Ignore registry cache and re-run the pipeline") .option("--embeddings", "Embed symbols and populate the embeddings table in store.sqlite") - .option("--embeddings-int8", "Use the int8 embedder variant (~23 MB) instead of fp32") + .option("--embeddings-int8", "Use the int8 embedder variant (~81 MB) instead of fp32 (~321 MB)") .option( "--granularity ", "Hierarchical embedding tiers to emit, comma-separated. Values: symbol, file, community. Default: symbol. Example: --granularity symbol,file,community", @@ -241,8 +241,8 @@ program ) .option("--force", "Overwrite an existing codehub entry without prompting; re-download weights") .option("--undo", "Restore the most recent .bak next to each config") - .option("--embeddings", "Download gte-modernbert-base ONNX weights (SHA256-pinned)") - .option("--int8", "Use the int8 weight variant (~150 MB) instead of fp32 (~596 MB)") + .option("--embeddings", "Download F2LLM-v2-80M ONNX weights (SHA256-pinned)") + .option("--int8", "Use the int8 weight variant (~92 MB) instead of fp32 (~332 MB)") .option("--model-dir ", "Override the target directory for embedder weights") .option("--plugin", "Install the Claude Code plugin to ~/.claude/plugins/opencodehub/") .option( diff --git a/packages/docs/src/content/docs/architecture/embeddings.md b/packages/docs/src/content/docs/architecture/embeddings.md index e3cf8111..5ec0841e 100644 --- a/packages/docs/src/content/docs/architecture/embeddings.md +++ b/packages/docs/src/content/docs/architecture/embeddings.md @@ -56,11 +56,21 @@ flowchart LR ### ONNX local -The default. Deterministic 768-dim embeddings from -`Alibaba-NLP/gte-modernbert-base`. Weights live in the directory -managed by `@opencodehub/embedder/paths`; missing weights throw +The default. Deterministic 320-dim embeddings from +`codefuse-ai/F2LLM-v2-80M` (a Qwen3-0.6B-Base derivative, 80.1M params). +Last-token pooling and L2 normalization are baked into the ONNX graph, +which emits a single already-unit-length output `embedding` of shape +`[B, 320]`. The custom ONNX export is hosted as a SHA256-pinned GitHub +release asset; weights live in the directory managed by +`@opencodehub/embedder/paths`; missing weights throw `EmbedderNotSetupError`, which `codehub setup --embeddings` fixes. +**Query/document asymmetry.** Documents are embedded raw. Queries get an +`Instruct: {instruction}\nQuery: {query}` prefix (instruction: "Given a +code search query, retrieve the most relevant code snippet.") via the +`embedQuery()` method on the Embedder interface, applied only at the +hybrid-search query seam. + A Piscina worker pool (`embedder-pool.ts`) spins up when `embeddingsWorkers >= 2`, running ONNX inference across worker threads. Single-worker mode is the default and is good enough for @@ -73,7 +83,7 @@ wire format: - `CODEHUB_EMBEDDING_URL` — base URL (`/embeddings` is appended). - `CODEHUB_EMBEDDING_MODEL` — model id passed through verbatim. -- `CODEHUB_EMBEDDING_DIMS` — dimensions (default 768). +- `CODEHUB_EMBEDDING_DIMS` — dimensions (default 320). - `CODEHUB_EMBEDDING_API_KEY` — bearer token. 30 s timeout, 2 retries with 1 s backoff. @@ -89,8 +99,8 @@ carries on. ModelId stamping is explicit to prevent silent cross-backend pollution of the `embeddings.model` column: SageMaker rows carry -`gte-modernbert-base/sagemaker:`, ONNX rows carry -`gte-modernbert-base/fp32`, HTTP rows pass the configured model id +`F2LLM-v2-80M/sagemaker:`, ONNX rows carry +`F2LLM-v2-80M/fp32`, HTTP rows pass the configured model id through. See the durable lesson linked below for the full pattern (dynamic import, structural-typing seam, 413 split-retry). @@ -156,7 +166,7 @@ enabling `hnsw_acorn` enables it. `["symbol"]`). - `PipelineOptions.embeddingsWorkers` — Piscina pool size for ONNX. - `PipelineOptions.embeddingsBatchSize` — default 32. -- `DuckDbStoreOptions.embeddingDim` — default 768. +- `SqliteStoreOptions.embeddingDim` — default 320. - Env vars: `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` / `_REGION` / `_MODEL` / `_DIMS`; `CODEHUB_EMBEDDING_URL` / `_MODEL` / `_DIMS` / `_API_KEY`. diff --git a/packages/docs/src/content/docs/architecture/monorepo-map.md b/packages/docs/src/content/docs/architecture/monorepo-map.md index d41330c9..1bce8010 100644 --- a/packages/docs/src/content/docs/architecture/monorepo-map.md +++ b/packages/docs/src/content/docs/architecture/monorepo-map.md @@ -19,7 +19,7 @@ package is a library imported by `cli`, `mcp`, `ingestion`, or | `@opencodehub/cli` | `packages/cli` | The `codehub` binary (analyze, setup, mcp, query, context, impact, sql, group, scan, verdict, code-pack, ...). | | `@opencodehub/cobol-proleap` | `packages/cobol-proleap` | Optional JVM ProLeap deep-parse bridge for COBOL — gated behind `--allow-build-scripts=proleap`. | | `@opencodehub/core-types` | `packages/core-types` | Shared graph schema, `LanguageId`, `RelationType`, determinism primitives. | -| `@opencodehub/embedder` | `packages/embedder` | Deterministic ONNX embedder (`gte-modernbert-base`), modelId fingerprint, three-backend cascade. | +| `@opencodehub/embedder` | `packages/embedder` | Deterministic ONNX embedder (`F2LLM-v2-80M`, 320-dim), modelId fingerprint, three-backend cascade. | | `@opencodehub/frameworks` | `packages/frameworks` | Five-stage framework detector (manifest → lockfile → config-AST → folder → import/SCIP) over a curated registry. | | `@opencodehub/ingestion` | `packages/ingestion` | The indexing pipeline (parse, resolve, scip-index, embeddings, communities, processes, summaries, ...). | | `@opencodehub/mcp` | `packages/mcp` | The stdio MCP server, 28 tool registrations (all read-only with respect to user source), 7 resources, the error envelope, the staleness `_meta` block. | diff --git a/packages/docs/src/content/docs/reference/cli.md b/packages/docs/src/content/docs/reference/cli.md index d89928de..0d267671 100644 --- a/packages/docs/src/content/docs/reference/cli.md +++ b/packages/docs/src/content/docs/reference/cli.md @@ -88,8 +88,8 @@ codehub setup | `--editors ` | all | `claude-code,cursor,codex,windsurf,opencode`. | | `--force` | off | Overwrite existing entries; re-download weights. | | `--undo` | off | Restore the most recent `.bak` next to each config. | -| `--embeddings` | off | Download `gte-modernbert-base` ONNX weights (SHA256-pinned). | -| `--int8` | off | Use the int8 weight variant (~150 MB) instead of fp32 (~596 MB). | +| `--embeddings` | off | Download `F2LLM-v2-80M` ONNX weights (SHA256-pinned GitHub release asset). | +| `--int8` | off | Use the int8 weight variant (~81 MB) instead of fp32 (~321 MB). | | `--model-dir ` | — | Override the target directory for embedder weights. | | `--plugin` | off | Install the Claude Code plugin to `~/.claude/plugins/opencodehub/`. | diff --git a/packages/docs/src/content/docs/reference/configuration.md b/packages/docs/src/content/docs/reference/configuration.md index 3537fcea..50fe8e82 100644 --- a/packages/docs/src/content/docs/reference/configuration.md +++ b/packages/docs/src/content/docs/reference/configuration.md @@ -51,11 +51,11 @@ that resolves wins; the others are ignored. | `CODEHUB_EMBEDDING_SAGEMAKER_REGION` | Override the AWS region for the SageMaker call. | | `CODEHUB_EMBEDDING_URL` | Base URL for an OpenAI-compatible HTTP endpoint (Infinity, vLLM, TEI, Ollama, LM Studio, OpenAI). `/embeddings` is appended. | | `CODEHUB_EMBEDDING_MODEL` | Model id passed through to the HTTP endpoint verbatim. | -| `CODEHUB_EMBEDDING_DIMS` | Dimensionality of the embedding model. Default 768. | +| `CODEHUB_EMBEDDING_DIMS` | Dimensionality of the embedding model. Default 320. | | `CODEHUB_EMBEDDING_API_KEY` | Bearer token sent as `Authorization: Bearer ...`. | When none of the above are set, the local ONNX backend -(`gte-modernbert-base`, deterministic, offline-safe) is used. +(`F2LLM-v2-80M`, 320-dim, deterministic, offline-safe) is used. ### Other toggles diff --git a/packages/embedder/CHANGELOG.md b/packages/embedder/CHANGELOG.md index e4c5c80a..32859387 100644 --- a/packages/embedder/CHANGELOG.md +++ b/packages/embedder/CHANGELOG.md @@ -1,5 +1,19 @@ # Changelog +## [0.2.0](https://github.com/theagenticguy/opencodehub/compare/embedder-v0.1.3...embedder-v0.2.0) (2026-06-26) + + +### ⚠ BREAKING CHANGES + +* **embedder:** swap the local ONNX model from `gte-modernbert-base` (768-dim) to `codefuse-ai/F2LLM-v2-80M` (320-dim). The dimension change is incompatible with existing stores — re-index with `codehub analyze --embeddings`. The fingerprint guard already refuses queries against a stale store on a `modelId` mismatch. + + +### Features + +* **embedder:** replace gte-modernbert-base with `codefuse-ai/F2LLM-v2-80M` (Qwen3-0.6B-Base derivative, 80.1M params, 320-dim). Last-token pooling + L2 normalization are baked into the ONNX graph — the graph emits a single already-unit-length `embedding` output of shape `[B, 320]`. +* **embedder:** add `embedQuery()` to the Embedder interface for query/document asymmetry — queries get an `Instruct: {instruction}\nQuery: {query}` prefix (instruction: "Given a code search query, retrieve the most relevant code snippet."), documents are embedded raw. Applied only at the hybrid-search query seam. +* **embedder:** ship the model as a custom ONNX export hosted as a GitHub release asset (`github.com/theagenticguy/opencodehub/releases/download/embed-v1/...`), SHA256-pinned in `model-pins.ts` (`F2LLM_V2_80M_PINS`, renamed from `GTE_MODERNBERT_BASE_PINS`). fp32 ~321 MB / int8 ~81 MB. Tokenizer is Qwen2 BPE (`tokenizer.json` + `tokenizer_config.json`). Runtime unchanged: `onnxruntime-web` (WASM), single-threaded deterministic. License: Apache-2.0. + ## [0.1.3](https://github.com/theagenticguy/opencodehub/compare/embedder-v0.1.2...embedder-v0.1.3) (2026-06-01) diff --git a/packages/embedder/README.md b/packages/embedder/README.md index a335a5a4..696cfc67 100644 --- a/packages/embedder/README.md +++ b/packages/embedder/README.md @@ -1,8 +1,8 @@ # @opencodehub/embedder Deterministic text embedder for OpenCodeHub. Uses the -`gte-modernbert-base` model via ONNX Runtime (CPU) locally or -Amazon SageMaker for larger deployments. +`codefuse-ai/F2LLM-v2-80M` model (320-dim) via ONNX Runtime (WASM, CPU) +locally or Amazon SageMaker for larger deployments. ## Surface @@ -16,8 +16,9 @@ const vectors = await embed(["function foo(): void {}", "class Bar {}"]); const vectors = await embed(texts, { backend: EmbedderBackend.SageMaker }); ``` -- **Local backend** — runs `gte-modernbert-base` via `onnxruntime-node` - (CPU only; CUDA postinstall is suppressed via `.npmrc`). +- **Local backend** — runs `F2LLM-v2-80M` via `onnxruntime-web` + (WASM, single-threaded, deterministic; no native bindings). Last-token + pooling + L2 normalization are baked into the ONNX graph. - **SageMaker backend** — sends batches to an endpoint via `@aws-sdk/client-sagemaker-runtime`; endpoint URL read from `OCH_SAGEMAKER_ENDPOINT`. @@ -30,7 +31,7 @@ const vectors = await embed(texts, { backend: EmbedderBackend.SageMaker }); |---|---|---| | `OCH_EMBED_BACKEND` | `onnx` | `onnx` or `sagemaker` | | `OCH_SAGEMAKER_ENDPOINT` | — | SageMaker real-time endpoint URL | -| `OCH_EMBED_DIM` | `768` | Expected embedding dimension (validation) | +| `OCH_EMBED_DIM` | `320` | Expected embedding dimension (validation) | ## Design @@ -39,5 +40,5 @@ const vectors = await embed(texts, { backend: EmbedderBackend.SageMaker }); fully offline. - The SageMaker path is the recommended backend for CI and cloud deployments; the ONNX path is the default for local dev. -- `onnxruntime_node_install_cuda=skip` in `.npmrc` prevents the ~400 MB - CUDA EP postinstall download. +- `onnxruntime-web` runs the model as WASM with no native postinstall — + the local backend ships zero native bindings. diff --git a/packages/embedder/package.json b/packages/embedder/package.json index 961872c2..6d633256 100644 --- a/packages/embedder/package.json +++ b/packages/embedder/package.json @@ -1,8 +1,8 @@ { "name": "@opencodehub/embedder", - "version": "0.1.3", + "version": "0.2.0", "private": true, - "description": "OpenCodeHub — ONNX-based deterministic text embedder (gte-modernbert-base)", + "description": "OpenCodeHub — ONNX-based deterministic text embedder (F2LLM-v2-80M)", "license": "Apache-2.0", "repository": { "type": "git", @@ -63,7 +63,7 @@ "embeddings", "onnx", "sagemaker", - "gte-modernbert", + "f2llm", "semantic-search" ], "engines": { diff --git a/packages/embedder/src/factory.test.ts b/packages/embedder/src/factory.test.ts index 6147bb86..d2fd03d1 100644 --- a/packages/embedder/src/factory.test.ts +++ b/packages/embedder/src/factory.test.ts @@ -24,10 +24,11 @@ import { type Embedder, EmbedderNotSetupError } from "./types.js"; /** Build a sentinel Embedder whose identity we can assert against. */ function makeSentinelEmbedder(modelId: string): Embedder { return { - dim: 768, + dim: 320, modelId, - embed: async () => new Float32Array(768), - embedBatch: async (texts) => texts.map(() => new Float32Array(768)), + embed: async () => new Float32Array(320), + embedQuery: async () => new Float32Array(320), + embedBatch: async (texts) => texts.map(() => new Float32Array(320)), close: async () => {}, }; } @@ -49,7 +50,7 @@ describe("openDefaultEmbedder", () => { }); it("falls back to ONNX when no HTTP env vars and allowOnnxFallback defaults to true", async () => { - const onnxSentinel = makeSentinelEmbedder("gte-modernbert-base/fp32"); + const onnxSentinel = makeSentinelEmbedder("f2llm-v2-80m/fp32"); const result = await openDefaultEmbedder( {}, { @@ -58,7 +59,7 @@ describe("openDefaultEmbedder", () => { }, ); strictEqual(result, onnxSentinel, "factory should return the ONNX embedder reference"); - equal(result.modelId, "gte-modernbert-base/fp32"); + equal(result.modelId, "f2llm-v2-80m/fp32"); }); it("throws EmbedderNotSetupError when HTTP env vars absent and allowOnnxFallback=false", async () => { @@ -90,7 +91,7 @@ describe("openDefaultEmbedder", () => { it("propagates the underlying error when ONNX setup fails", async () => { const onnxFailure = new EmbedderNotSetupError( - "Run `codehub setup --embeddings` to install gte-modernbert-base", + "Run `codehub setup --embeddings` to install f2llm-v2-80m", ); await rejects( openDefaultEmbedder( diff --git a/packages/embedder/src/factory.ts b/packages/embedder/src/factory.ts index 174f60f8..4bf265f2 100644 --- a/packages/embedder/src/factory.ts +++ b/packages/embedder/src/factory.ts @@ -6,7 +6,7 @@ * 1. {@link tryOpenHttpEmbedder} reads SageMaker / OpenAI-HTTP env vars * first and returns a remote-backed embedder when configured. * 2. Otherwise — and only when `allowOnnxFallback === true` (the default) — - * fall back to {@link openOnnxEmbedder}, which loads gte-modernbert-base + * fall back to {@link openOnnxEmbedder}, which loads F2LLM-v2-80m * weights from disk (the lazy-load side effect). * 3. With `allowOnnxFallback: false` and no HTTP/SageMaker env, throw * {@link EmbedderNotSetupError} — the ONNX binding is never loaded. diff --git a/packages/embedder/src/fingerprint.test.ts b/packages/embedder/src/fingerprint.test.ts index b31bceab..88a2baff 100644 --- a/packages/embedder/src/fingerprint.test.ts +++ b/packages/embedder/src/fingerprint.test.ts @@ -8,23 +8,19 @@ import { assertEmbedderCompatible, EMBEDDER_MISMATCH_HINT } from "./fingerprint. describe("assertEmbedderCompatible", () => { test("ok when persisted is undefined (legacy store, never tagged)", () => { - const result = assertEmbedderCompatible(undefined, "gte-modernbert-base/fp32", false); + const result = assertEmbedderCompatible(undefined, "f2llm-v2-80m/fp32", false); ok(result.ok); }); test("ok when persisted equals current", () => { - const result = assertEmbedderCompatible( - "gte-modernbert-base/fp32", - "gte-modernbert-base/fp32", - false, - ); + const result = assertEmbedderCompatible("f2llm-v2-80m/fp32", "f2llm-v2-80m/fp32", false); ok(result.ok); }); test("ok when persisted differs from current but force is true", () => { const result = assertEmbedderCompatible( - "gte-modernbert-base/fp32", - "sagemaker:gte-modernbert-base@my-endpoint", + "f2llm-v2-80m/fp32", + "f2llm-v2-80m/sagemaker:my-endpoint", true, ); ok(result.ok); @@ -32,14 +28,14 @@ describe("assertEmbedderCompatible", () => { test("not ok when persisted differs from current and force is false", () => { const result = assertEmbedderCompatible( - "gte-modernbert-base/fp32", - "sagemaker:gte-modernbert-base@my-endpoint", + "f2llm-v2-80m/fp32", + "f2llm-v2-80m/sagemaker:my-endpoint", false, ); ok(!result.ok); if (!result.ok) { - equal(result.persistedModelId, "gte-modernbert-base/fp32"); - equal(result.currentModelId, "sagemaker:gte-modernbert-base@my-endpoint"); + equal(result.persistedModelId, "f2llm-v2-80m/fp32"); + equal(result.currentModelId, "f2llm-v2-80m/sagemaker:my-endpoint"); equal(result.hint, EMBEDDER_MISMATCH_HINT); } }); diff --git a/packages/embedder/src/fingerprint.ts b/packages/embedder/src/fingerprint.ts index 91d932b8..9444b2a1 100644 --- a/packages/embedder/src/fingerprint.ts +++ b/packages/embedder/src/fingerprint.ts @@ -3,10 +3,10 @@ * * The `embeddings` table on disk was populated by ONE specific embedder * — usually identified by its {@link Embedder.modelId} (e.g. - * `gte-modernbert-base/fp32`, `sagemaker:gte-modernbert-base@`). + * `f2llm-v2-80m/fp32`, `f2llm-v2-80m/sagemaker:`). * If the operator switches the active embedder between index runs (ONNX - * → SageMaker, fp32 → int8) the dim might still match by coincidence - * (768 = 768) but the vector subspace is different — hybrid search + * → SageMaker, fp32 → int8, or a different model entirely) the vector + * subspace differs even when the dim coincides — hybrid search * silently corrupts ranking with no error. * * `assertEmbedderCompatible` makes the mismatch loud: diff --git a/packages/embedder/src/http-embedder.test.ts b/packages/embedder/src/http-embedder.test.ts index 31f502aa..473721ae 100644 --- a/packages/embedder/src/http-embedder.test.ts +++ b/packages/embedder/src/http-embedder.test.ts @@ -3,7 +3,7 @@ * {@link openEmbedder} factory. * * Coverage: - * - happy path: mock fetch returns a 768-d vector → Float32Array of 768 + * - happy path: mock fetch returns a 320-d vector → Float32Array of 320 * - retry on 5xx × 2, then succeed * - retry on network error × 2, then succeed * - empty endpointUrl → ONNX path chosen (factory falls through; @@ -112,23 +112,23 @@ function makeFetchMockNetErrThenOk( describe("openHttpEmbedder: happy path", () => { it("returns a Float32Array of the expected dim on a 200 response", async () => { - const vec768 = Array.from({ length: 768 }, (_, i) => (i + 1) / 400); - const fetchImpl = makeFetchMockOk(vec768); + const vec320 = Array.from({ length: 320 }, (_, i) => (i + 1) / 400); + const fetchImpl = makeFetchMockOk(vec320); const embedder = openHttpEmbedder({ endpointUrl: "https://embed.example/v1", - modelId: "gte-modernbert-base", + modelId: "f2llm-v2-80m", fetchImpl, }); const out = await embedder.embed("hello world"); - equal(out.length, 768); - equal(embedder.dim, 768); - equal(embedder.modelId, "gte-modernbert-base"); + equal(out.length, 320); + equal(embedder.dim, 320); + equal(embedder.modelId, "f2llm-v2-80m"); // Values round-trip as Float32 (so small precision loss is acceptable). - ok(Math.abs((out[0] ?? 0) - (vec768[0] ?? 0)) < 1e-6); + ok(Math.abs((out[0] ?? 0) - (vec320[0] ?? 0)) < 1e-6); await embedder.close(); }); - it("honours a caller-supplied `dims` value (non-768 remote)", async () => { + it("honours a caller-supplied `dims` value (non-320 remote)", async () => { const vec1024 = new Array(1024).fill(0.125); const fetchImpl = makeFetchMockOk(vec1024); const embedder = openHttpEmbedder({ @@ -147,7 +147,7 @@ describe("openHttpEmbedder: happy path", () => { const fetchImpl: typeof fetch = async (_url, _init) => { call += 1; // Distinct vector per call so we can verify order. - const embedding = new Array(768).fill(call / 100); + const embedding = new Array(320).fill(call / 100); return new Response(JSON.stringify({ data: [{ embedding }] }), { status: 200, headers: { "content-type": "application/json" }, @@ -170,7 +170,7 @@ describe("openHttpEmbedder: happy path", () => { const fetchImpl: typeof fetch = async (url, _init) => { seen.push(String(url)); return new Response( - JSON.stringify({ data: [{ embedding: new Array(768).fill(0) }] }), + JSON.stringify({ data: [{ embedding: new Array(320).fill(0) }] }), { status: 200, headers: { "content-type": "application/json" } }, ); }; @@ -189,7 +189,7 @@ describe("openHttpEmbedder: happy path", () => { const fetchImpl: typeof fetch = async (url, _init) => { seen.push(String(url)); return new Response( - JSON.stringify({ data: [{ embedding: new Array(768).fill(0) }] }), + JSON.stringify({ data: [{ embedding: new Array(320).fill(0) }] }), { status: 200, headers: { "content-type": "application/json" } }, ); }; @@ -205,7 +205,7 @@ describe("openHttpEmbedder: happy path", () => { describe("openHttpEmbedder: retries", () => { it("retries on 5xx and succeeds on the third attempt", async () => { - const embedding = new Array(768).fill(0.1); + const embedding = new Array(320).fill(0.1); const seq = makeFetchMockSeq([ { status: 500, body: { error: "bad" } }, { status: 503, body: { error: "busy" } }, @@ -217,12 +217,12 @@ describe("openHttpEmbedder: retries", () => { fetchImpl: seq.fetchImpl, }); const out = await embedder.embed("x"); - equal(out.length, 768); + equal(out.length, 320); equal(seq.calls(), 3, "must have retried twice before succeeding"); }); it("retries on 429 (rate limit) and succeeds on the third attempt", async () => { - const embedding = new Array(768).fill(0.2); + const embedding = new Array(320).fill(0.2); const seq = makeFetchMockSeq([ { status: 429, body: { error: "rate" } }, { status: 429, body: { error: "rate" } }, @@ -234,7 +234,7 @@ describe("openHttpEmbedder: retries", () => { fetchImpl: seq.fetchImpl, }); const out = await embedder.embed("x"); - equal(out.length, 768); + equal(out.length, 320); equal(seq.calls(), 3); }); @@ -250,14 +250,14 @@ describe("openHttpEmbedder: retries", () => { }); it("retries on a thrown network error and succeeds on the third attempt", async () => { - const seq = makeFetchMockNetErrThenOk(2, new Array(768).fill(0)); + const seq = makeFetchMockNetErrThenOk(2, new Array(320).fill(0)); const embedder = openHttpEmbedder({ endpointUrl: "https://embed.example/v1", modelId: "m", fetchImpl: seq.fetchImpl, }); const out = await embedder.embed("x"); - equal(out.length, 768); + equal(out.length, 320); equal(seq.calls(), 3); }); @@ -284,27 +284,27 @@ describe("openHttpEmbedder: dim mismatch guard", () => { const embedder = openHttpEmbedder({ endpointUrl: "https://embed.example/v1", modelId: "m", - dims: 768, + dims: 320, fetchImpl: makeFetchMockOk(wrong), }); await rejects(embedder.embed("x"), (err: unknown) => { ok(err instanceof Error); match(err.message, /Embedding dimension mismatch/); match(err.message, /1024d vector/); - match(err.message, /expected 768d/); + match(err.message, /expected 320d/); match(err.message, /CODEHUB_EMBEDDING_DIMS/); return true; }); }); - it("uses 768 as the default expected dim when `dims` is omitted", async () => { + it("uses 320 as the default expected dim when `dims` is omitted", async () => { const wrong = new Array(1024).fill(0); const embedder = openHttpEmbedder({ endpointUrl: "https://embed.example/v1", modelId: "m", fetchImpl: makeFetchMockOk(wrong), }); - await rejects(embedder.embed("x"), /expected 768d/); + await rejects(embedder.embed("x"), /expected 320d/); }); }); @@ -315,7 +315,7 @@ describe("openHttpEmbedder: auth header", () => { const headers = new Headers(init?.headers); seenAuth = headers.get("authorization") ?? undefined; return new Response( - JSON.stringify({ data: [{ embedding: new Array(768).fill(0) }] }), + JSON.stringify({ data: [{ embedding: new Array(320).fill(0) }] }), { status: 200, headers: { "content-type": "application/json" } }, ); }; @@ -335,7 +335,7 @@ describe("openHttpEmbedder: auth header", () => { const headers = new Headers(init?.headers); seenAuth = headers.get("authorization") ?? undefined; return new Response( - JSON.stringify({ data: [{ embedding: new Array(768).fill(0) }] }), + JSON.stringify({ data: [{ embedding: new Array(320).fill(0) }] }), { status: 200, headers: { "content-type": "application/json" } }, ); }; @@ -461,10 +461,10 @@ describe("openEmbedder factory", () => { const embedder = await openEmbedder({ endpointUrl: "https://embed.example/v1", modelId: "m", - fetchImpl: makeFetchMockOk(new Array(768).fill(0.5)), + fetchImpl: makeFetchMockOk(new Array(320).fill(0.5)), }); const out = await embedder.embed("x"); - equal(out.length, 768); + equal(out.length, 320); }); it("throws when offline=true AND endpointUrl is set", async () => { @@ -526,11 +526,11 @@ describe("tryOpenHttpEmbedder", () => { it("returns an Embedder when env is configured", async () => { process.env["CODEHUB_EMBEDDING_URL"] = "https://embed.example/v1"; process.env["CODEHUB_EMBEDDING_MODEL"] = "m"; - const fetchImpl = makeFetchMockOk(new Array(768).fill(0)); + const fetchImpl = makeFetchMockOk(new Array(320).fill(0)); const embedder = await tryOpenHttpEmbedder({ fetchImpl }); ok(embedder !== null); const out = await embedder.embed("x"); - equal(out.length, 768); + equal(out.length, 320); }); it("throws when offline AND env is configured", () => { @@ -550,7 +550,7 @@ describe("Embedder contract via HTTP", () => { const embedder = openHttpEmbedder({ endpointUrl: "https://embed.example/v1", modelId: "m", - fetchImpl: makeFetchMockOk(new Array(768).fill(0)), + fetchImpl: makeFetchMockOk(new Array(320).fill(0)), }); await embedder.close(); await embedder.close(); diff --git a/packages/embedder/src/http-embedder.ts b/packages/embedder/src/http-embedder.ts index 2fba772b..3a6c278a 100644 --- a/packages/embedder/src/http-embedder.ts +++ b/packages/embedder/src/http-embedder.ts @@ -42,7 +42,7 @@ export interface HttpEmbedderConfig { /** Model id sent in the `model` field of the request body. */ readonly modelId: string; /** - * Expected response-vector dimension. Defaults to 768 (gte-modernbert-base). + * Expected response-vector dimension. Defaults to 320 (F2LLM-v2-80M). * Every response is asserted against this so a remote model swap can * never silently pollute downstream vector indexes. */ @@ -60,8 +60,8 @@ export interface HttpEmbedderConfig { readonly fetchImpl?: typeof fetch; } -/** Default dim for gte-modernbert-base (the fallback when env doesn't set it). */ -const DEFAULT_DIMS = 768; +/** Default dim for F2LLM-v2-80M (the fallback when env doesn't set it). */ +const DEFAULT_DIMS = 320; /** * Read HTTP embedder config from the process environment. Returns `null` @@ -228,6 +228,11 @@ export function openHttpEmbedder(cfg: HttpEmbedderConfig): Embedder { dim: dims, modelId, embed: embedOne, + // Remote backends are text-in/vector-out and own their pooling/prefix + // server-side; the local F2LLM query-prefix asymmetry is not applied + // here (a remote F2LLM endpoint must handle it itself). Alias query to + // document so the interface contract holds. + embedQuery: embedOne, async embedBatch(texts: readonly string[]): Promise { if (texts.length === 0) return []; // One request per text. The HTTP surface supports batched `input`, but diff --git a/packages/embedder/src/index.ts b/packages/embedder/src/index.ts index bc198a61..4bd9c54b 100644 --- a/packages/embedder/src/index.ts +++ b/packages/embedder/src/index.ts @@ -9,7 +9,7 @@ * that POSTs to an OpenAI-compatible `/v1/embeddings` server (Infinity, * vLLM, TEI, Ollama, LM Studio, OpenAI). * - When neither is set, {@link openEmbedder} falls back to the local - * ONNX gte-modernbert-base path (deterministic embedder). + * ONNX F2LLM-v2-80M path (deterministic embedder). * * Offline invariant: when `offline === true` and any remote option * (SageMaker or `endpointUrl`) is set, {@link openEmbedder} throws. Remote @@ -47,8 +47,8 @@ export { } from "./http-embedder.js"; export { embedderModelId, - GTE_MODERNBERT_BASE_PINS, - GTE_MODERNBERT_BASE_REPO, + F2LLM_V2_80M_PINS, + F2LLM_V2_80M_REPO, type PinnedFile, type VariantPins, } from "./model-pins.js"; @@ -59,6 +59,7 @@ export { resolveModelDir, TOKENIZER_FILES, } from "./paths.js"; +export { buildQueryText, F2LLM_QUERY_INSTRUCTION } from "./query-prefix.js"; export { openSagemakerEmbedder, readSagemakerEmbedderConfigFromEnv, @@ -103,7 +104,7 @@ export interface OpenEmbedderOptions { readonly modelId?: string; /** Bearer token for the HTTP request. Optional; sent as `unused` when absent. */ readonly apiKey?: string; - /** Expected response-vector dimension. Defaults to 768 for HTTP/SageMaker. */ + /** Expected response-vector dimension. Defaults to 320 for HTTP/SageMaker. */ readonly dims?: number; /** * Pass-through options for the ONNX backend when a remote backend is diff --git a/packages/embedder/src/model-pins.test.ts b/packages/embedder/src/model-pins.test.ts index a6f4e44a..dbaddeaf 100644 --- a/packages/embedder/src/model-pins.test.ts +++ b/packages/embedder/src/model-pins.test.ts @@ -7,52 +7,80 @@ import { equal, match, ok } from "node:assert/strict"; import { describe, it } from "node:test"; -import { - embedderModelId, - GTE_MODERNBERT_BASE_PINS, - GTE_MODERNBERT_BASE_REPO, -} from "./model-pins.js"; +import { embedderModelId, F2LLM_V2_80M_PINS, F2LLM_V2_80M_REPO } from "./model-pins.js"; const SHA256_RE = /^[0-9a-f]{64}$/; -const HF_URL_RE = new RegExp( - `^https://huggingface\\.co/Alibaba-NLP/gte-modernbert-base/resolve/${GTE_MODERNBERT_BASE_REPO.commit}/`, +// The exported ONNX + tokenizer artifacts are hosted as GitHub release assets +// on the opencodehub repo (NOT upstream Hugging Face — the export bakes +// pooling + L2 norm into the graph and does not exist upstream). +const RELEASE_URL_RE = new RegExp( + `^https://github\\.com/theagenticguy/opencodehub/releases/download/${F2LLM_V2_80M_REPO.release}/`, ); describe("model-pins", () => { - it("repo metadata is Apache-2.0 and pins a commit SHA", () => { - equal(GTE_MODERNBERT_BASE_REPO.license, "Apache-2.0"); - equal(GTE_MODERNBERT_BASE_REPO.hfRepo, "Alibaba-NLP/gte-modernbert-base"); - match(GTE_MODERNBERT_BASE_REPO.commit, /^[0-9a-f]{40}$/); + it("repo metadata is Apache-2.0 and attributes the upstream + release", () => { + equal(F2LLM_V2_80M_REPO.license, "Apache-2.0"); + equal(F2LLM_V2_80M_REPO.upstream, "codefuse-ai/F2LLM-v2-80M"); + equal(F2LLM_V2_80M_REPO.release, "embed-v1"); }); - it("fp32 variant ships one ONNX + four tokenizer files", () => { - const names = GTE_MODERNBERT_BASE_PINS.fp32.files.map((f) => f.name); - equal(GTE_MODERNBERT_BASE_PINS.fp32.files.length, 5); + it("fp32 variant ships one ONNX + two tokenizer files", () => { + const names = F2LLM_V2_80M_PINS.fp32.files.map((f) => f.name); + equal(F2LLM_V2_80M_PINS.fp32.files.length, 3); ok(names.includes("model.onnx")); ok(names.includes("tokenizer.json")); ok(names.includes("tokenizer_config.json")); - ok(names.includes("config.json")); - ok(names.includes("special_tokens_map.json")); + // The export omits config.json / special_tokens_map.json — pooling + norm + // are in-graph, so they are not fetched. + ok(!names.includes("config.json")); + ok(!names.includes("special_tokens_map.json")); }); it("int8 variant swaps the ONNX file and reuses tokenizer pins", () => { - const names = GTE_MODERNBERT_BASE_PINS.int8.files.map((f) => f.name); + const names = F2LLM_V2_80M_PINS.int8.files.map((f) => f.name); + equal(F2LLM_V2_80M_PINS.int8.files.length, 3); ok(names.includes("model_int8.onnx")); ok(!names.includes("model.onnx")); + ok(names.includes("tokenizer.json")); + ok(names.includes("tokenizer_config.json")); }); - it("every pinned file has a 64-char sha256 and HF resolve URL", () => { + it("every pinned file has a 64-char sha256 and GitHub release URL", () => { for (const variant of ["fp32", "int8"] as const) { - for (const f of GTE_MODERNBERT_BASE_PINS[variant].files) { + for (const f of F2LLM_V2_80M_PINS[variant].files) { match(f.sha256, SHA256_RE, `${variant}/${f.name} sha256`); - match(f.url, HF_URL_RE, `${variant}/${f.name} url`); + match(f.url, RELEASE_URL_RE, `${variant}/${f.name} url`); ok(f.sizeBytes > 0, `${variant}/${f.name} sizeBytes`); } } }); + it("pins the exact fp32 model + tokenizer sizes and hashes", () => { + const model = F2LLM_V2_80M_PINS.fp32.files.find((f) => f.name === "model.onnx"); + ok(model !== undefined); + equal(model.sizeBytes, 320708733); + equal(model.sha256, "9347f761e1420e61c477b56616b3b4f2d2ee80d94747fd6cdde9a03b4c9176bc"); + + const tok = F2LLM_V2_80M_PINS.fp32.files.find((f) => f.name === "tokenizer.json"); + ok(tok !== undefined); + equal(tok.sizeBytes, 11423359); + equal(tok.sha256, "7dd49a6a008054ecbf11f1568ea9244e99ca8a44fe47e883d1bb9915c3042705"); + + const tokCfg = F2LLM_V2_80M_PINS.fp32.files.find((f) => f.name === "tokenizer_config.json"); + ok(tokCfg !== undefined); + equal(tokCfg.sizeBytes, 378); + equal(tokCfg.sha256, "3dbc087db36f09c0c359618cbfcebb4b3aed6d8438951c037789b5a0fdc099af"); + }); + + it("pins the exact int8 model size and hash", () => { + const model = F2LLM_V2_80M_PINS.int8.files.find((f) => f.name === "model_int8.onnx"); + ok(model !== undefined); + equal(model.sizeBytes, 80699171); + equal(model.sha256, "302845905e9273a1dd0fb4c670dcd12d16ad35e9522f518aa45a74da4d6ec5b8"); + }); + it("embedderModelId produces the string used by the storage layer", () => { - equal(embedderModelId("fp32"), "gte-modernbert-base/fp32"); - equal(embedderModelId("int8"), "gte-modernbert-base/int8"); + equal(embedderModelId("fp32"), "f2llm-v2-80m/fp32"); + equal(embedderModelId("int8"), "f2llm-v2-80m/int8"); }); }); diff --git a/packages/embedder/src/model-pins.ts b/packages/embedder/src/model-pins.ts index fa2474af..56bab6bb 100644 --- a/packages/embedder/src/model-pins.ts +++ b/packages/embedder/src/model-pins.ts @@ -1,18 +1,23 @@ /** - * SHA256 and source-URL pins for every gte-modernbert-base weight file we ship. + * SHA256 and source-URL pins for every F2LLM-v2-80M weight file we ship. * * These pins are the authoritative contract consumed by `codehub setup * --embeddings` and by `codehub doctor` at runtime. SHA256 values were - * computed locally against the Hugging Face model repo at commit - * `e7f32e3c00f91d699e8c43b53106206bcc72bb22` on 2026-04-25. + * computed locally against the ONNX export produced from + * `codefuse-ai/F2LLM-v2-80M` (a Qwen3-0.6B-Base derivative) — the export + * bakes last-token pooling + L2 normalization into the graph, so it is NOT + * the upstream Hugging Face repo's own files. We host the exported + * artifacts as a GitHub release asset and pin them by URL + SHA256. * * This module does NOT download anything on its own. It is pure data. */ -/** HF repo + commit the pins are anchored to. */ -export const GTE_MODERNBERT_BASE_REPO = { - hfRepo: "Alibaba-NLP/gte-modernbert-base", - commit: "e7f32e3c00f91d699e8c43b53106206bcc72bb22", +/** Source repo + release the pins are anchored to. */ +export const F2LLM_V2_80M_REPO = { + /** Upstream model the ONNX export is derived from (attribution). */ + upstream: "codefuse-ai/F2LLM-v2-80M", + /** GitHub release tag hosting the exported ONNX + tokenizer artifacts. */ + release: "embed-v1", license: "Apache-2.0", } as const; @@ -30,46 +35,44 @@ export interface VariantPins { readonly files: readonly PinnedFile[]; } -function hfUrl(path: string): string { - return `https://huggingface.co/${GTE_MODERNBERT_BASE_REPO.hfRepo}/resolve/${GTE_MODERNBERT_BASE_REPO.commit}/${path}`; +/** + * Build the download URL for a release asset. The exported ONNX files do + * not exist upstream on Hugging Face — they are attached to a GitHub + * release on the opencodehub repo. Asset names are flat (no directory), + * so the int8 weights are uploaded as `model_int8.onnx` etc. + */ +function releaseUrl(asset: string): string { + return `https://github.com/theagenticguy/opencodehub/releases/download/${F2LLM_V2_80M_REPO.release}/${asset}`; } // Tokenizer + config files are identical across variants — hashes computed -// once from the model repo. +// once from the exported artifacts. const TOKENIZER_JSON: PinnedFile = { name: "tokenizer.json", - url: hfUrl("tokenizer.json"), - sizeBytes: 3583228, - sha256: "6c8aaa9a542084f2457eab775d4eeb51f92a70c0fd9de28d5edb0ddec3c08d30", + url: releaseUrl("tokenizer.json"), + sizeBytes: 11423359, + sha256: "7dd49a6a008054ecbf11f1568ea9244e99ca8a44fe47e883d1bb9915c3042705", }; const TOKENIZER_CONFIG_JSON: PinnedFile = { name: "tokenizer_config.json", - url: hfUrl("tokenizer_config.json"), - sizeBytes: 20867, - sha256: "9654072f7c873161814043cf08cb5ed72f71d0b935abcd4e267935cb34352c21", -}; - -const CONFIG_JSON: PinnedFile = { - name: "config.json", - url: hfUrl("config.json"), - sizeBytes: 1184, - sha256: "8ba54dc3d35d7194f5178a4194b649f146753e02dabd22bdca5c5cbac15069ed", -}; - -const SPECIAL_TOKENS_MAP_JSON: PinnedFile = { - name: "special_tokens_map.json", - url: hfUrl("special_tokens_map.json"), - sizeBytes: 694, - sha256: "ea97ecdbcc73713039d8d64dbb05e3689495c96657fbd9a18f5bed381be81049", + url: releaseUrl("tokenizer_config.json"), + sizeBytes: 378, + sha256: "3dbc087db36f09c0c359618cbfcebb4b3aed6d8438951c037789b5a0fdc099af", }; /** - * Per-variant manifests. The fp32 variant is the default (596 MB, highest - * precision); int8 is 4× smaller (150 MB) with near-identical retrieval - * quality for size-constrained installs. + * Per-variant manifests. The fp32 variant is the default (321 MB, + * cosine-exact 1.0 vs the PyTorch reference, byte-deterministic under the + * single-thread WASM gate); int8 is 4× smaller (81 MB) with 4/4 top-1 + * ranking agreement for size-constrained installs. + * + * F2LLM emits a single graph output named `embedding` of shape + * `[batch, 320]` — pooling + L2 norm are in-graph, so only the ONNX file + * + the two tokenizer files are required (no config.json / + * special_tokens_map.json, which the export omits). */ -export const GTE_MODERNBERT_BASE_PINS: { +export const F2LLM_V2_80M_PINS: { readonly fp32: VariantPins; readonly int8: VariantPins; } = { @@ -78,14 +81,12 @@ export const GTE_MODERNBERT_BASE_PINS: { files: [ { name: "model.onnx", - url: hfUrl("onnx/model.onnx"), - sizeBytes: 596392315, - sha256: "947f31df7effaeec4edb57c50e4ed7e0f2034d9336063f92615b92e3e0d24d78", + url: releaseUrl("model.onnx"), + sizeBytes: 320708733, + sha256: "9347f761e1420e61c477b56616b3b4f2d2ee80d94747fd6cdde9a03b4c9176bc", }, TOKENIZER_JSON, TOKENIZER_CONFIG_JSON, - CONFIG_JSON, - SPECIAL_TOKENS_MAP_JSON, ], }, int8: { @@ -93,19 +94,17 @@ export const GTE_MODERNBERT_BASE_PINS: { files: [ { name: "model_int8.onnx", - url: hfUrl("onnx/model_int8.onnx"), - sizeBytes: 150218016, - sha256: "bae96b276d342bf86eeee07c1bdbc0c75bb82bf4033941aab7fabc1e33ee3b44", + url: releaseUrl("model_int8.onnx"), + sizeBytes: 80699171, + sha256: "302845905e9273a1dd0fb4c670dcd12d16ad35e9522f518aa45a74da4d6ec5b8", }, TOKENIZER_JSON, TOKENIZER_CONFIG_JSON, - CONFIG_JSON, - SPECIAL_TOKENS_MAP_JSON, ], }, } as const; -/** Model id tag written into `embeddings.model` (keeps HNSW indexes separable). */ +/** Model id tag written into `embeddings.model` (keeps vector indexes separable). */ export function embedderModelId(variant: "fp32" | "int8"): string { - return `gte-modernbert-base/${variant}`; + return `f2llm-v2-80m/${variant}`; } diff --git a/packages/embedder/src/onnx-embedder.test.ts b/packages/embedder/src/onnx-embedder.test.ts index d54ca721..d9b3e53f 100644 --- a/packages/embedder/src/onnx-embedder.test.ts +++ b/packages/embedder/src/onnx-embedder.test.ts @@ -6,11 +6,17 @@ * `code` literal. Guarantees the CLI and search layer can pattern-match * the error to degrade to BM25-only. * 2. Real weights present → byte-identical output across three repeat - * calls + L2 norm ≈ 1 + dim === 768. Only runs when the cache dir is - * populated. CI does NOT populate this dir. + * calls + dim === 320. Only runs when the cache dir is populated. CI + * does NOT populate this dir. + * + * F2LLM-v2-80M's ONNX graph bakes last-token pooling + L2 normalization in, + * emitting a single 320-dim output named `embedding` already unit-length — + * so the embedder does NO JS-side pooling or normalization (unlike the prior + * gte-modernbert CLS-pool path). The unit-norm assertion below therefore + * checks the model contract, not a JS post-step. * * When weights are absent we also run a mock-based check of the Embedder - * contract (dim=768, embedBatch preserves input order, close() is + * contract (dim=320, embedBatch preserves input order, close() is * idempotent) so the interface is covered unconditionally. */ @@ -74,11 +80,13 @@ describe("openOnnxEmbedder: missing weights", () => { /** * A hand-rolled `Embedder` used when real weights are unavailable. Its * `embed` produces a deterministic fake vector (index-based) so we can still - * exercise the downstream contract: L2 norm ≈ 1, dim=768, embedBatch - * preserves order, close() is idempotent. + * exercise the downstream contract: dim=320, embedBatch preserves order, + * close() is idempotent, repeat calls are byte-identical. The fake happens to + * return unit vectors, but that is a property of the mock — the real embedder + * gets unit length from the in-graph L2 norm, not from any JS step. */ class MockEmbedder implements Embedder { - readonly dim = 768; + readonly dim = 320; readonly modelId = embedderModelId("fp32"); #closed = false; @@ -107,6 +115,12 @@ class MockEmbedder implements Embedder { return vec; } + // F2LLM is asymmetric (query gets an Instruct: prefix) but the mock has no + // real model, so it aliases the query path to the document path. + async embedQuery(text: string): Promise { + return this.embed(text); + } + async embedBatch(texts: readonly string[]): Promise { return Promise.all(texts.map((t) => this.embed(t))); } @@ -130,17 +144,18 @@ describe("Embedder contract (mocked)", () => { const m = new MockEmbedder(); // Static type check: `m satisfies Embedder` is enforced by the class // declaration. Here we re-check at runtime. - equal(m.dim, 768); - equal(m.modelId, "gte-modernbert-base/fp32"); + equal(m.dim, 320); + equal(m.modelId, "f2llm-v2-80m/fp32"); equal(typeof m.embed, "function"); + equal(typeof m.embedQuery, "function"); equal(typeof m.embedBatch, "function"); equal(typeof m.close, "function"); }); - it("dim === 768", async () => { + it("dim === 320", async () => { const m = new MockEmbedder(); const v = await m.embed("hello world"); - equal(v.length, 768); + equal(v.length, 320); }); it("L2 norm is ~1 (within 1e-6)", async () => { @@ -199,9 +214,9 @@ async function hasRealWeights(): Promise { } describe("OnnxEmbedder: real weights (optional)", () => { - it("produces byte-identical vectors across 3 calls and has dim=768", async (t) => { + it("produces byte-identical vectors across 3 calls and has dim=320", async (t) => { if (!(await hasRealWeights())) { - t.skip("gte-modernbert-base weights not installed — run `codehub setup --embeddings`"); + t.skip("f2llm-v2-80m weights not installed — run `codehub setup --embeddings`"); return; } let embedder: Embedder | undefined; @@ -211,12 +226,14 @@ describe("OnnxEmbedder: real weights (optional)", () => { const a = await embedder.embed(text); const b = await embedder.embed(text); const c = await embedder.embed(text); - equal(a.length, 768); - equal(embedder.dim, 768); - equal(embedder.modelId, "gte-modernbert-base/fp32"); + equal(a.length, 320); + equal(embedder.dim, 320); + equal(embedder.modelId, "f2llm-v2-80m/fp32"); deepEqual(new Uint8Array(a.buffer), new Uint8Array(b.buffer)); deepEqual(new Uint8Array(a.buffer), new Uint8Array(c.buffer)); + // F2LLM's graph L2-normalizes its output, so the vector is unit-length + // straight from `embedding` — no JS normalize step is involved. const n = l2Norm(a); ok(Math.abs(n - 1) < 1e-4, `expected unit norm, got ${n}`); } finally { diff --git a/packages/embedder/src/onnx-embedder.ts b/packages/embedder/src/onnx-embedder.ts index 30248786..04685401 100644 --- a/packages/embedder/src/onnx-embedder.ts +++ b/packages/embedder/src/onnx-embedder.ts @@ -1,12 +1,25 @@ /** - * Deterministic ONNX-based embedder for Alibaba gte-modernbert-base. + * Deterministic ONNX-based embedder for codefuse-ai F2LLM-v2-80M. * * Loads weights from disk (populated by `codehub setup --embeddings`), runs - * inference with every nondeterminism knob disabled, and emits a 768-dim + * inference with every nondeterminism knob disabled, and emits a 320-dim * Float32Array per input. The same input MUST produce byte-identical output * across repeat calls; this is the contract the graphHash CI gate relies * on. * + * F2LLM-v2-80M is a Qwen3-0.6B-Base derivative (8 layers, hidden 320, 16 + * heads / 8 KV heads). The ONNX export bakes last-token pooling + * (`attention_mask.sum()-1`) AND L2 normalization INTO the graph, emitting + * a single output named `embedding` of shape `[batch, 320]` already + * unit-length — so this module does NO JS-side pooling or normalization, + * unlike the previous gte-modernbert (CLS-pool) path. + * + * Query/document asymmetry: F2LLM expects an `Instruct:`-wrapped prefix on + * QUERY text only; documents are embedded raw. {@link OnnxEmbedder.embed} + * /`embedBatch` embed raw text (the document path); {@link + * OnnxEmbedder.embedQuery} applies the prefix (the query path). See + * {@link buildQueryText}. + * * The weights themselves are NOT downloaded here — `codehub setup * --embeddings` owns that code path. If the weights are absent we throw * {@link EmbedderNotSetupError} so callers can degrade to BM25-only search @@ -28,13 +41,17 @@ import type { InferenceSession, Tensor } from "onnxruntime-web"; import { embedderModelId } from "./model-pins.js"; import { modelFileName, resolveModelDir, TOKENIZER_FILES } from "./paths.js"; +import { buildQueryText } from "./query-prefix.js"; import { type Embedder, type EmbedderConfig, EmbedderNotSetupError } from "./types.js"; -// gte-modernbert-base is a ModernBERT-base encoder (22 layers, 12 heads, -// 768 hidden). These numbers are part of the model contract, not a config -// knob — do not expose to callers. -const EMBED_DIM = 768; -const MODEL_MAX_POSITION = 8192; // ModernBERT's position embedding table +// F2LLM-v2-80M emits a single graph output named `embedding`, shape +// `[batch, 320]`, already L2-normalized. These numbers are part of the +// model contract, not a config knob — do not expose to callers. +const EMBED_DIM = 320; +// Practical truncation cap in tokens. F2LLM's model_max_length is 131072, +// but code symbols are short and a large cap wastes memory/latency; 8192 is +// the operative ceiling. +const MODEL_MAX_LENGTH = 8192; async function fileExists(path: string): Promise { try { @@ -61,7 +78,7 @@ async function assertModelFiles( } if (missing.length > 0) { throw new EmbedderNotSetupError( - `gte-modernbert-base weights not found in ${modelDir}: ` + + `F2LLM-v2-80M weights not found in ${modelDir}: ` + `missing ${missing.join(", ")}. ` + `Run \`codehub setup --embeddings\` while online.`, ); @@ -110,7 +127,10 @@ function buildSessionOptions(): InferenceSession.SessionOptions { /** * Encode `text` using the supplied Tokenizer and produce padded/truncated * input_ids and attention_mask arrays. BigInt64Array matches the model's - * int64 input type. ModernBERT has no token_type_ids input. + * int64 input type. Qwen3/F2LLM has no token_type_ids input. + * + * `add_special_tokens: true` is REQUIRED — the tokenizer's TemplateProcessing + * appends the EOS (`<|im_end|>`) that the in-graph last-token pooling reads. */ function encodeForModel( tokenizer: Tokenizer, @@ -124,7 +144,9 @@ function encodeForModel( const enc = tokenizer.encode(text, { add_special_tokens: true, }); - // Truncate to the model's max_position_embeddings. + // Truncate to the practical max length. On truncation the trailing EOS is + // dropped and last-token pooling reads the final retained token — a valid + // (if degraded) representation of the truncated prefix. const ids = enc.ids.slice(0, maxModelLength); const mask = enc.attention_mask.slice(0, maxModelLength); @@ -139,11 +161,12 @@ function encodeForModel( } /** - * Pad two parallel BigInt64Arrays (ids, mask) up to `padTo`. ModernBERT's - * pad_token_id is 50283 (not 0 as in BERT); the attention mask is 0 for - * padding positions so the model ignores them regardless. + * Pad two parallel BigInt64Arrays (ids, mask) up to `padTo`. F2LLM's + * tokenizer pad_token is `<|endoftext|>` (id 151643); the attention mask is + * 0 for padding positions so the in-graph last-token pooling + * (`attention_mask.sum()-1`) skips them regardless of the pad id used. */ -const MODERNBERT_PAD_ID = 50283n; +const F2LLM_PAD_ID = 151643n; function padToLength( ids: BigInt64Array, @@ -156,57 +179,13 @@ function padToLength( if (ids.length === padTo) { return { ids, mask }; } - const outIds = new BigInt64Array(padTo).fill(MODERNBERT_PAD_ID); + const outIds = new BigInt64Array(padTo).fill(F2LLM_PAD_ID); const outMask = new BigInt64Array(padTo); outIds.set(ids); outMask.set(mask); return { ids: outIds, mask: outMask }; } -/** - * Extract the [CLS] vector (index 0 of last_hidden_state) for batch item - * `rowIdx`. gte-modernbert-base ships `1_Pooling/config.json` with - * `pooling_mode_cls_token: true`, so we grab the first-token hidden state and - * L2-normalize it downstream. - */ -function clsPool( - lastHidden: Float32Array, - rowIdx: number, - seqLen: number, - hiddenSize: number, -): Float32Array { - const rowStart = rowIdx * seqLen * hiddenSize; - const out = new Float32Array(hiddenSize); - for (let i = 0; i < hiddenSize; i++) { - out[i] = lastHidden[rowStart + i] ?? 0; - } - return out; -} - -/** - * In-place L2 normalization with Kahan-summed squared norm for 2-ULP tighter - * precision than naive sum. Single division by `sqrt(norm)` keeps the op - * deterministic across x86_64 + aarch64 (IEEE-754 round-to-nearest-even). - */ -function l2NormalizeInPlace(vec: Float32Array): void { - let sum = 0; - let comp = 0; // Kahan compensator - for (let i = 0; i < vec.length; i++) { - const v = vec[i] ?? 0; - const term = v * v - comp; - const t = sum + term; - comp = t - sum - term; - sum = t; - } - if (sum <= 0) { - return; - } - const inv = 1 / Math.sqrt(sum); - for (let i = 0; i < vec.length; i++) { - vec[i] = (vec[i] ?? 0) * inv; - } -} - /** Internal implementation — exported only via the {@link Embedder} seam. */ class OnnxEmbedder implements Embedder { readonly dim = EMBED_DIM; @@ -214,10 +193,9 @@ class OnnxEmbedder implements Embedder { readonly #session: InferenceSession; readonly #tokenizer: Tokenizer; - readonly #normalize: boolean; readonly #maxModelLength: number; // Runtime `Tensor` constructor, threaded in from the dynamic - // `import("onnxruntime-node")` so this module never statically loads the + // `import("onnxruntime-web")` so this module never statically loads the // native binding. readonly #Tensor: typeof Tensor; #closed = false; @@ -226,18 +204,17 @@ class OnnxEmbedder implements Embedder { readonly session: InferenceSession; readonly tokenizer: Tokenizer; readonly variant: "fp32" | "int8"; - readonly normalize: boolean; readonly maxModelLength: number; readonly Tensor: typeof Tensor; }) { this.#session = params.session; this.#tokenizer = params.tokenizer; this.modelId = embedderModelId(params.variant); - this.#normalize = params.normalize; this.#maxModelLength = params.maxModelLength; this.#Tensor = params.Tensor; } + /** Embed a single DOCUMENT (no query prefix). */ async embed(text: string): Promise { this.#ensureOpen(); const [vec] = await this.embedBatch([text]); @@ -247,6 +224,17 @@ class OnnxEmbedder implements Embedder { return vec; } + /** + * Embed a QUERY. F2LLM expects the `Instruct:`-wrapped prefix on query + * text only; documents (`embed`/`embedBatch`) get none. Keeping this on + * the embedder localizes the model-specific instruction string and keeps + * the asymmetry explicit + unit-testable. + */ + async embedQuery(text: string): Promise { + this.#ensureOpen(); + return this.embed(buildQueryText(text)); + } + async embedBatch(texts: readonly string[]): Promise { this.#ensureOpen(); if (texts.length === 0) { @@ -263,13 +251,13 @@ class OnnxEmbedder implements Embedder { } if (batchMax === 0) { // Degenerate case: every input tokenized to zero tokens. Return zero - // vectors (still dim=768) so callers downstream get a stable shape. + // vectors (still dim=320) so callers downstream get a stable shape. return texts.map(() => new Float32Array(EMBED_DIM)); } // Build flat [B, seqLen] buffers. const batchSize = encoded.length; - const flatIds = new BigInt64Array(batchSize * batchMax).fill(MODERNBERT_PAD_ID); + const flatIds = new BigInt64Array(batchSize * batchMax).fill(F2LLM_PAD_ID); const flatMask = new BigInt64Array(batchSize * batchMax); for (let b = 0; b < batchSize; b++) { const e = encoded[b]; @@ -285,29 +273,30 @@ class OnnxEmbedder implements Embedder { input_ids: new Tensor("int64", flatIds, dims), attention_mask: new Tensor("int64", flatMask, dims), }; - const results = await this.#session.run(feeds, ["last_hidden_state"]); - const hidden = results["last_hidden_state"]; - if (hidden === undefined || hidden.type !== "float32") { + // F2LLM's graph pools (last-token) + L2-normalizes internally and emits a + // single output named `embedding`, shape [B, EMBED_DIM] — already + // unit-length. We do NO JS-side pooling/normalization here. + const results = await this.#session.run(feeds, ["embedding"]); + const embedding = results["embedding"]; + if (embedding === undefined || embedding.type !== "float32") { throw new Error( - `ONNX session did not return float32 last_hidden_state (got ${String(hidden?.type)})`, + `ONNX session did not return a float32 'embedding' tensor (got ${String(embedding?.type)})`, ); } - // Shape is [B, seqLen, hiddenSize]. hiddenSize derived from data length - // so we don't hard-fail if a checkpoint ever surprises us with a - // different dim — we just assert it matches EMBED_DIM at the boundary. - const data = hidden.data as Float32Array; - const hiddenSize = data.length / (batchSize * batchMax); - if (hiddenSize !== EMBED_DIM) { - throw new Error(`Expected hidden size ${EMBED_DIM}, got ${hiddenSize}. Wrong model loaded?`); + // Shape is [B, EMBED_DIM] (NOT [B, seqLen, H]). Derive the per-row width + // from the flat buffer length and assert it matches EMBED_DIM at the + // boundary so a wrong model loaded surfaces loudly. + const data = embedding.data as Float32Array; + const rowDim = data.length / batchSize; + if (rowDim !== EMBED_DIM) { + throw new Error(`Expected embedding dim ${EMBED_DIM}, got ${rowDim}. Wrong model loaded?`); } const out: Float32Array[] = []; for (let b = 0; b < batchSize; b++) { - const vec = clsPool(data, b, batchMax, hiddenSize); - if (this.#normalize) { - l2NormalizeInPlace(vec); - } - out.push(vec); + // Copy each row out of the shared buffer so callers own an independent + // Float32Array (the graph already normalized it). + out.push(data.slice(b * EMBED_DIM, (b + 1) * EMBED_DIM)); } return out; } @@ -328,7 +317,7 @@ class OnnxEmbedder implements Embedder { } /** - * Open a deterministic gte-modernbert-base embedder. + * Open a deterministic F2LLM-v2-80M embedder. * * Throws {@link EmbedderNotSetupError} if the weight files are not present — * callers in the CLI use this to surface `codehub setup --embeddings` @@ -337,12 +326,11 @@ class OnnxEmbedder implements Embedder { export async function openOnnxEmbedder(cfg: EmbedderConfig = {}): Promise { const variant = cfg.variant ?? "fp32"; const modelDir = resolveModelDir(cfg.modelDir, variant); - const normalize = cfg.normalize ?? true; // `maxSequenceLength` is the caller-facing budget in user tokens; the - // actual model input adds 2 slots for [CLS]/[SEP], capped at - // MODEL_MAX_POSITION (8192) to fit the position embedding table. - const userMax = cfg.maxSequenceLength ?? MODEL_MAX_POSITION - 2; - const maxModelLength = Math.min(userMax + 2, MODEL_MAX_POSITION); + // tokenizer appends a single EOS token, so the model input is at most + // userMax + 1, capped at MODEL_MAX_LENGTH. + const userMax = cfg.maxSequenceLength ?? MODEL_MAX_LENGTH - 1; + const maxModelLength = Math.min(userMax + 1, MODEL_MAX_LENGTH); const { modelPath, tokenizerDir } = await assertModelFiles(modelDir, variant); @@ -387,7 +375,6 @@ export async function openOnnxEmbedder(cfg: EmbedderConfig = {}): Promise { it("resolveModelDir builds fp32 path by default", () => { const dir = resolveModelDir(); - equal(dir, join(homedir(), ".codehub", "models", "gte-modernbert-base", "fp32")); + equal(dir, join(homedir(), ".codehub", "models", "f2llm-v2-80m", "fp32")); }); it("resolveModelDir respects int8 variant", () => { const dir = resolveModelDir(undefined, "int8"); - equal(dir, join(homedir(), ".codehub", "models", "gte-modernbert-base", "int8")); + equal(dir, join(homedir(), ".codehub", "models", "f2llm-v2-80m", "int8")); }); it("resolveModelDir returns override unchanged when provided", () => { @@ -59,11 +59,8 @@ describe("paths", () => { equal(modelFileName("int8"), "model_int8.onnx"); }); - it("TOKENIZER_FILES enumerates the four required JSON files", () => { - deepEqual( - [...TOKENIZER_FILES], - ["tokenizer.json", "tokenizer_config.json", "config.json", "special_tokens_map.json"], - ); - ok(TOKENIZER_FILES.length === 4); + it("TOKENIZER_FILES enumerates the two required JSON files", () => { + deepEqual([...TOKENIZER_FILES], ["tokenizer.json", "tokenizer_config.json"]); + ok(TOKENIZER_FILES.length === 2); }); }); diff --git a/packages/embedder/src/paths.ts b/packages/embedder/src/paths.ts index dbd91105..9aebf786 100644 --- a/packages/embedder/src/paths.ts +++ b/packages/embedder/src/paths.ts @@ -1,13 +1,11 @@ /** - * Resolves the on-disk location of gte-modernbert-base weight files. + * Resolves the on-disk location of F2LLM-v2-80M weight files. * * Layout convention: - * ${CODEHUB_HOME:-~/.codehub}/models/gte-modernbert-base/${variant}/ + * ${CODEHUB_HOME:-~/.codehub}/models/f2llm-v2-80m/${variant}/ * ├── model.onnx (or model_int8.onnx) * ├── tokenizer.json - * ├── tokenizer_config.json - * ├── config.json - * └── special_tokens_map.json + * └── tokenizer_config.json * * `codehub setup --embeddings` is the code path that populates this * directory; this module just resolves paths and never touches the network. @@ -16,7 +14,7 @@ import { homedir } from "node:os"; import { join, resolve } from "node:path"; -const MODEL_SUBDIR = "models/gte-modernbert-base"; +const MODEL_SUBDIR = "models/f2llm-v2-80m"; /** * Root directory that holds every OpenCodeHub-managed artefact (model weights, @@ -49,10 +47,10 @@ export function modelFileName(variant: "fp32" | "int8"): string { return variant === "fp32" ? "model.onnx" : "model_int8.onnx"; } -/** All tokenizer-related files we require alongside the ONNX weights. */ -export const TOKENIZER_FILES = [ - "tokenizer.json", - "tokenizer_config.json", - "config.json", - "special_tokens_map.json", -] as const; +/** + * All tokenizer-related files we require alongside the ONNX weights. The + * F2LLM ONNX export ships only these two — pooling + normalization are + * baked into the graph, so there is no separate `config.json` / + * `special_tokens_map.json` to fetch. + */ +export const TOKENIZER_FILES = ["tokenizer.json", "tokenizer_config.json"] as const; diff --git a/packages/embedder/src/query-prefix.ts b/packages/embedder/src/query-prefix.ts new file mode 100644 index 00000000..64be412f --- /dev/null +++ b/packages/embedder/src/query-prefix.ts @@ -0,0 +1,25 @@ +/** + * F2LLM query-prefix helper. + * + * F2LLM-v2-80M is an asymmetric retrieval model: QUERY text is wrapped in an + * `Instruct: {instruction}\nQuery: {query}` template, while DOCUMENT text is + * embedded raw. Applying the prefix to documents (or omitting it on queries) + * degrades retrieval. This module is the single source of truth for the + * instruction string and the wrapping format so the query path + * (`embedQuery`) and any backend that prefixes caller-side stay in lockstep. + * + * The instruction string is the one validated in the POC ranking parity + * harness (`export/verify_ranking.py`). + */ + +/** The retrieval instruction prepended to every query (F2LLM contract). */ +export const F2LLM_QUERY_INSTRUCTION = + "Given a code search query, retrieve the most relevant code snippet."; + +/** + * Wrap raw query text in the F2LLM `Instruct:`/`Query:` template. Documents + * must NOT be passed through this — embed them raw. + */ +export function buildQueryText(query: string): string { + return `Instruct: ${F2LLM_QUERY_INSTRUCTION}\nQuery: ${query}`; +} diff --git a/packages/embedder/src/sagemaker-embedder.integration.test.ts b/packages/embedder/src/sagemaker-embedder.integration.test.ts index f72ef53d..4dc71090 100644 --- a/packages/embedder/src/sagemaker-embedder.integration.test.ts +++ b/packages/embedder/src/sagemaker-embedder.integration.test.ts @@ -11,7 +11,7 @@ * AWS_PROFILE=lalsaado-handson \ * AWS_REGION=us-east-1 \ * CODEHUB_INTEGRATION=1 \ - * CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT=gte-modernbert-embed \ + * CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT=f2llm-embed \ * pnpm --filter @opencodehub/embedder test */ @@ -33,7 +33,7 @@ const skipReason = !INTEGRATION_GATE describe("openSagemakerEmbedder — live SageMaker endpoint", { skip: skipReason ?? undefined, }, () => { - it("single text returns a 768-d Float32Array with unit L2 norm (≈1.0)", async () => { + it("single text returns a 320-d Float32Array with unit L2 norm (≈1.0)", async () => { const embedder = await openSagemakerEmbedder({ endpointName: ENDPOINT as string, region: REGION, @@ -42,9 +42,9 @@ describe("openSagemakerEmbedder — live SageMaker endpoint", { const vec = await embedder.embed( "function add(a: number, b: number): number { return a + b; }", ); - equal(vec.length, 768); - // TEI with the gte-modernbert-base bundled Normalize module returns - // L2-normalized vectors; assert norm is close to 1. + equal(vec.length, 320); + // F2LLM bakes last-token pooling + L2 normalization into its graph, so + // the endpoint returns L2-normalized vectors; assert norm is close to 1. let norm = 0; for (let i = 0; i < vec.length; i++) { const v = vec[i] ?? 0; @@ -66,7 +66,7 @@ describe("openSagemakerEmbedder — live SageMaker endpoint", { const texts = Array.from({ length: 64 }, (_, i) => `const value${i} = ${i};`); const out = await embedder.embedBatch(texts); equal(out.length, 64); - for (const v of out) equal(v.length, 768); + for (const v of out) equal(v.length, 320); } finally { await embedder.close(); } @@ -81,7 +81,7 @@ describe("openSagemakerEmbedder — live SageMaker endpoint", { const texts = Array.from({ length: 100 }, (_, i) => `let x${i} = ${i};`); const out = await embedder.embedBatch(texts); equal(out.length, 100); - for (const v of out) equal(v.length, 768); + for (const v of out) equal(v.length, 320); } finally { await embedder.close(); } diff --git a/packages/embedder/src/sagemaker-embedder.parity.test.ts b/packages/embedder/src/sagemaker-embedder.parity.test.ts index 89644e58..85cf7f98 100644 --- a/packages/embedder/src/sagemaker-embedder.parity.test.ts +++ b/packages/embedder/src/sagemaker-embedder.parity.test.ts @@ -9,9 +9,12 @@ * `CODEHUB_HOME`. Weight-missing is detected lazily — `openOnnxEmbedder` * throws `EmbedderNotSetupError` and we skip the rest of the suite. * - * Acceptance threshold: per-pair cosine similarity ≥ 0.99. Both backends - * use CLS pooling + L2 normalization, so cosine should be ≳ 0.999 on the - * happy path — the 0.99 floor absorbs fp16-vs-fp32 drift on the GPU side. + * Acceptance threshold: per-pair cosine similarity ≥ 0.99. Both backends are + * F2LLM-v2-80M (last-token pooling + L2 normalization baked into the graph), + * so cosine should be ≳ 0.999 on the happy path — the 0.99 floor absorbs + * fp16-vs-fp32 drift on the GPU side. The SageMaker endpoint pointed at by + * `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` must serve the same F2LLM model for + * the parity assertion to hold. */ import { ok } from "node:assert/strict"; diff --git a/packages/embedder/src/sagemaker-embedder.test.ts b/packages/embedder/src/sagemaker-embedder.test.ts index a5098cc2..3bb0693d 100644 --- a/packages/embedder/src/sagemaker-embedder.test.ts +++ b/packages/embedder/src/sagemaker-embedder.test.ts @@ -2,7 +2,7 @@ * Tests for the SageMaker embedder backend. * * Coverage: - * - happy path: single input + small batch returns 768-d Float32Array + * - happy path: single input + small batch returns 320-d Float32Array * - large batch (>64) splits into multiple InvokeEndpointCommand calls * - dim mismatch throws with clear message * - row-count mismatch (endpoint returned fewer rows than inputs) throws @@ -96,10 +96,10 @@ describe("readSagemakerEmbedderConfigFromEnv", () => { }); it("reads the endpoint name when set", () => { - process.env["CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT"] = "gte-modernbert-embed"; + process.env["CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT"] = "f2llm-embed"; const cfg = readSagemakerEmbedderConfigFromEnv(); ok(cfg !== null); - equal(cfg.endpointName, "gte-modernbert-embed"); + equal(cfg.endpointName, "f2llm-embed"); equal(cfg.region, undefined); // default applied at factory }); @@ -130,8 +130,8 @@ describe("readSagemakerEmbedderConfigFromEnv", () => { }); describe("openSagemakerEmbedder — happy path", () => { - it("embeds a single text and returns a 768-d Float32Array", async () => { - const row = vec(768, 0.1); + it("embeds a single text and returns a 320-d Float32Array", async () => { + const row = vec(320, 0.1); const { runtime, calls, lastBatch } = makeRuntime(() => [row]); const embedder = await openSagemakerEmbedder({ @@ -140,7 +140,7 @@ describe("openSagemakerEmbedder — happy path", () => { }); const out = await embedder.embed("hello"); - equal(out.length, 768); + equal(out.length, 320); equal(out[0], Math.fround(0.1)); equal(calls(), 1); equal(lastBatch(), 1); @@ -148,18 +148,18 @@ describe("openSagemakerEmbedder — happy path", () => { }); it("reports modelId with endpoint-name stamp by default", async () => { - const { runtime } = makeRuntime(() => [vec(768, 0)]); + const { runtime } = makeRuntime(() => [vec(320, 0)]); const embedder = await openSagemakerEmbedder({ - endpointName: "gte-modernbert-embed", + endpointName: "f2llm-embed", runtime, }); - equal(embedder.dim, 768); - match(embedder.modelId, /^gte-modernbert-base\/sagemaker:gte-modernbert-embed$/); + equal(embedder.dim, 320); + match(embedder.modelId, /^f2llm-v2-80m\/sagemaker:f2llm-embed$/); await embedder.close(); }); it("honors an explicit modelId override", async () => { - const { runtime } = makeRuntime(() => [vec(768, 0)]); + const { runtime } = makeRuntime(() => [vec(320, 0)]); const embedder = await openSagemakerEmbedder({ endpointName: "anything", modelId: "custom/model:v1", @@ -171,7 +171,7 @@ describe("openSagemakerEmbedder — happy path", () => { it("batches ≤64 inputs in a single InvokeEndpoint call", async () => { const { runtime, calls, lastBatch } = makeRuntime((n) => - Array.from({ length: n }, (_, i) => vec(768, i * 0.01)), + Array.from({ length: n }, (_, i) => vec(320, i * 0.01)), ); const embedder = await openSagemakerEmbedder({ @@ -196,7 +196,7 @@ describe("openSagemakerEmbedder — happy path", () => { }; sizes.push(parsed.inputs.length); return { - Body: responseBody(parsed.inputs.map((_, i) => vec(768, i * 0.001))), + Body: responseBody(parsed.inputs.map((_, i) => vec(320, i * 0.001))), }; }, }; @@ -215,7 +215,7 @@ describe("openSagemakerEmbedder — happy path", () => { }); it("returns an empty array for an empty batch without calling the endpoint", async () => { - const { runtime, calls } = makeRuntime(() => [vec(768, 0)]); + const { runtime, calls } = makeRuntime(() => [vec(320, 0)]); const embedder = await openSagemakerEmbedder({ endpointName: "test-endpoint", runtime, @@ -234,14 +234,14 @@ describe("openSagemakerEmbedder — error cases", () => { endpointName: "test-endpoint", runtime, }); - await rejects(embedder.embed("hello"), /512d vector at row 0, expected 768d/); + await rejects(embedder.embed("hello"), /512d vector at row 0, expected 320d/); }); it("throws on row-count mismatch (endpoint returned too few rows)", async () => { const runtime: SagemakerRuntimeLike = { async send(_command: SendCmd) { // Return 1 row for any number of inputs. - return { Body: responseBody([vec(768, 0)]) }; + return { Body: responseBody([vec(320, 0)]) }; }, }; const embedder = await openSagemakerEmbedder({ @@ -290,7 +290,7 @@ describe("openSagemakerEmbedder — error cases", () => { (err as { name: string }).name = "ValidationException"; throw err; } - return { Body: responseBody([vec(768, parsed.inputs[0] === "a" ? 0.1 : 0.2)]) }; + return { Body: responseBody([vec(320, parsed.inputs[0] === "a" ? 0.1 : 0.2)]) }; }, }; const embedder = await openSagemakerEmbedder({ @@ -348,7 +348,7 @@ describe("tryOpenHttpEmbedder — SageMaker precedence", () => { }); it("throws when offline AND SageMaker env is configured", () => { - process.env["CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT"] = "gte-modernbert-embed"; + process.env["CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT"] = "f2llm-embed"; throws( () => tryOpenHttpEmbedder({ offline: true }), /SageMaker embeddings are disabled in offline mode/, diff --git a/packages/embedder/src/sagemaker-embedder.ts b/packages/embedder/src/sagemaker-embedder.ts index 7e3dddfc..1b8c5744 100644 --- a/packages/embedder/src/sagemaker-embedder.ts +++ b/packages/embedder/src/sagemaker-embedder.ts @@ -1,8 +1,9 @@ /** * SageMaker embedder backend. Invokes a TEI (Text Embeddings Inference) - * SageMaker endpoint — e.g. the `embed-serve` stack at - * `/efs/lalsaado/workplace/embed-serve/` which serves - * `Alibaba-NLP/gte-modernbert-base` as `gte-modernbert-embed` in us-east-1. + * SageMaker endpoint. NOTE: the local F2LLM query-prefix asymmetry is NOT + * applied here — a remote endpoint must handle query/document pooling + + * prefixing server-side. The default `dims` (320) matches F2LLM, but the + * caller's endpoint determines the actual model. * * Selection: {@link readSagemakerEmbedderConfigFromEnv} returns a config * when `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` is set; otherwise `null` so @@ -24,14 +25,14 @@ * - SDK retry (`maxAttempts: 5`) handles throttling + 5xx. * - Dims asserted on every response so a remote model swap cannot * silently pollute downstream HNSW indexes. - * - `modelId` is stamped as `gte-modernbert-base/sagemaker:` + * - `modelId` is stamped as `f2llm-v2-80m/sagemaker:` * so an index built with this backend is visibly distinct from a * local ONNX index. */ import { type Embedder, EmbedderNotSetupError } from "./types.js"; -const DEFAULT_DIMS = 768; +const DEFAULT_DIMS = 320; const DEFAULT_REGION = "us-east-1"; const MAX_BATCH = 64; const DEFAULT_MAX_ATTEMPTS = 5; @@ -55,17 +56,17 @@ export interface SagemakerRuntimeLike { /** Configuration for {@link openSagemakerEmbedder}. */ export interface SagemakerEmbedderConfig { - /** Name of the SageMaker endpoint (e.g. `gte-modernbert-embed`). */ + /** Name of the SageMaker endpoint (e.g. `f2llm-embed`). */ readonly endpointName: string; /** AWS region of the endpoint. Defaults to `us-east-1`. */ readonly region?: string; /** * Stable model id reported to the index layer. Defaults to - * `gte-modernbert-base/sagemaker:` so index metadata + * `f2llm-v2-80m/sagemaker:` so index metadata * distinguishes this backend from local ONNX. */ readonly modelId?: string; - /** Expected response-vector dimension. Defaults to 768. */ + /** Expected response-vector dimension. Defaults to 320. */ readonly dims?: number; /** SDK `maxAttempts`. Defaults to 5. */ readonly maxAttempts?: number; @@ -162,7 +163,7 @@ export async function openSagemakerEmbedder(cfg: SagemakerEmbedderConfig): Promi const region = cfg.region ?? DEFAULT_REGION; const dims = cfg.dims ?? DEFAULT_DIMS; const endpointName = cfg.endpointName; - const modelId = cfg.modelId ?? `gte-modernbert-base/sagemaker:${endpointName}`; + const modelId = cfg.modelId ?? `f2llm-v2-80m/sagemaker:${endpointName}`; const maxAttempts = cfg.maxAttempts ?? DEFAULT_MAX_ATTEMPTS; let runtime: SagemakerRuntimeLike; @@ -312,6 +313,8 @@ export async function openSagemakerEmbedder(cfg: SagemakerEmbedderConfig): Promi dim: dims, modelId, embed: embedOne, + // Remote endpoint owns pooling/prefix server-side; alias query→document. + embedQuery: embedOne, embedBatch, async close(): Promise { // SageMakerRuntimeClient keeps an HTTP agent alive; destroy when diff --git a/packages/embedder/src/types.ts b/packages/embedder/src/types.ts index 6e084ebc..f96cfb4b 100644 --- a/packages/embedder/src/types.ts +++ b/packages/embedder/src/types.ts @@ -1,8 +1,8 @@ /** * Public types for the @opencodehub/embedder package. * - * The embedder turns a piece of text into a deterministic 768-dim Float32Array - * using the gte-modernbert-base ONNX model. Callers in @opencodehub/search and + * The embedder turns a piece of text into a deterministic 320-dim Float32Array + * using the F2LLM-v2-80M ONNX model. Callers in @opencodehub/search and * @opencodehub/mcp consume this via the `Embedder` interface; the concrete * implementation is opened with `openOnnxEmbedder`. */ @@ -13,17 +13,26 @@ * contract the graphHash CI gate relies on. */ export interface Embedder { - /** Output dimension. Always 768 for gte-modernbert-base. */ + /** Output dimension. 320 for F2LLM-v2-80M (the local ONNX backend). */ readonly dim: number; /** - * Stable model identifier, e.g. `gte-modernbert-base/fp32`. Used by - * the storage layer to tag `embeddings.model` so incompatible vectors are - * never mixed in the same HNSW index. + * Stable model identifier, e.g. `f2llm-v2-80m/fp32`. Used by the storage + * layer to tag `embeddings.model` so incompatible vectors are never mixed + * in the same index. */ readonly modelId: string; - /** Embed a single text. */ + /** + * Embed a single DOCUMENT (no query prefix). Use {@link embedQuery} for + * search queries. + */ embed(text: string): Promise; - /** Embed a batch of texts. Returned array matches the input order 1:1. */ + /** + * Embed a single QUERY. For asymmetric models (F2LLM) this applies the + * model's query instruction prefix; documents are embedded raw via + * {@link embed}. Symmetric backends may alias this to {@link embed}. + */ + embedQuery(text: string): Promise; + /** Embed a batch of DOCUMENTS. Returned array matches the input order 1:1. */ embedBatch(texts: readonly string[]): Promise; /** Release native session + tokenizer resources. Idempotent. */ close(): Promise; @@ -36,21 +45,18 @@ export interface Embedder { */ export interface EmbedderConfig { /** - * Directory containing `model.onnx` (or `model_int8.onnx`) and the four + * Directory containing `model.onnx` (or `model_int8.onnx`) and the two * tokenizer JSON files. Defaults to - * `${CODEHUB_HOME:-~/.codehub}/models/gte-modernbert-base/${variant}/`. + * `${CODEHUB_HOME:-~/.codehub}/models/f2llm-v2-80m/${variant}/`. */ readonly modelDir?: string; /** Which ONNX weight file to load. Defaults to `fp32`. */ readonly variant?: "fp32" | "int8"; /** - * Max tokens of the user-supplied text, before `[CLS]`/`[SEP]` are added. - * Defaults to 8190 so the full sequence fits in ModernBERT's 8192-token - * position embedding table. + * Max tokens of the user-supplied text, before the EOS token is appended. + * Defaults to 8191 so the full sequence fits the 8192-token operative cap. */ readonly maxSequenceLength?: number; - /** L2-normalize the output vector. Defaults to `true`. */ - readonly normalize?: boolean; /** * Directory containing the onnxruntime-web `.wasm` artifacts * (`ort-wasm-simd-threaded.*.wasm`). Sets `ort.env.wasm.wasmPaths`. When diff --git a/packages/ingestion/CHANGELOG.md b/packages/ingestion/CHANGELOG.md index c1a2dea7..9b9f79f8 100644 --- a/packages/ingestion/CHANGELOG.md +++ b/packages/ingestion/CHANGELOG.md @@ -1,5 +1,12 @@ # Changelog +## [0.6.0](https://github.com/theagenticguy/opencodehub/compare/ingestion-v0.5.0...ingestion-v0.6.0) (2026-06-26) + + +### ⚠ BREAKING CHANGES + +* **embedder:** local embedding model swapped to `codefuse-ai/F2LLM-v2-80M` (320-dim, was gte-modernbert-base 768-dim). The analyze path now suppresses the content-hash cache on a model change so all symbols re-embed (no mixed-dim store); existing stores must be rebuilt with `codehub analyze --embeddings`. + ## [0.5.0](https://github.com/theagenticguy/opencodehub/compare/ingestion-v0.4.5...ingestion-v0.5.0) (2026-06-01) diff --git a/packages/ingestion/src/pipeline/phases/embedder-pool.ts b/packages/ingestion/src/pipeline/phases/embedder-pool.ts index c02ca4ac..8477e6e7 100644 --- a/packages/ingestion/src/pipeline/phases/embedder-pool.ts +++ b/packages/ingestion/src/pipeline/phases/embedder-pool.ts @@ -53,7 +53,7 @@ export function openOnnxEmbedderPool(opts: EmbedderPoolOptions): Embedder { }); let closed = false; - const dim = 768; // gte-modernbert-base — matches OnnxEmbedder's EMBED_DIM. + const dim = 320; // F2LLM-v2-80M — matches OnnxEmbedder's EMBED_DIM. async function embedBatch(texts: readonly string[]): Promise { if (closed) throw new Error("Embedder pool is closed"); @@ -78,6 +78,13 @@ export function openOnnxEmbedderPool(opts: EmbedderPoolOptions): Embedder { if (vec === undefined) throw new Error("embedBatch returned empty result"); return vec; }, + // Ingestion only embeds documents; the pool never embeds queries. Alias + // to embed() to satisfy the Embedder interface (no query prefix applied). + async embedQuery(text: string): Promise { + const [vec] = await embedBatch([text]); + if (vec === undefined) throw new Error("embedBatch returned empty result"); + return vec; + }, embedBatch, async close(): Promise { if (closed) return; diff --git a/packages/ingestion/src/pipeline/phases/embeddings.test.ts b/packages/ingestion/src/pipeline/phases/embeddings.test.ts index 3766dc75..10c90005 100644 --- a/packages/ingestion/src/pipeline/phases/embeddings.test.ts +++ b/packages/ingestion/src/pipeline/phases/embeddings.test.ts @@ -157,7 +157,7 @@ describe("embeddingsPhase", () => { // across runs produce identical embeddings. // --------------------------------------------------------------------------- -const HTTP_DIM = 768; +const HTTP_DIM = 320; /** * Hash-derived deterministic embedding. Stable across runs given the same diff --git a/packages/ingestion/src/pipeline/phases/embeddings.ts b/packages/ingestion/src/pipeline/phases/embeddings.ts index 91a15778..df5cfc72 100644 --- a/packages/ingestion/src/pipeline/phases/embeddings.ts +++ b/packages/ingestion/src/pipeline/phases/embeddings.ts @@ -1,5 +1,5 @@ /** - * Embeddings phase — generates 768-dim vectors across one or more + * Embeddings phase — generates 320-dim vectors across one or more * hierarchical tiers and materialises them into the phase output as an * array of `EmbeddingRow`s the CLI upserts into the SQLite store. * @@ -188,7 +188,7 @@ export interface EmbedderPhaseOutput { readonly chunksTotal: number; /** * Stable id tag for the embedder that produced these rows — e.g. - * `gte-modernbert-base/fp32`. Empty string when the phase was a + * `f2llm-v2-80m/fp32`. Empty string when the phase was a * no-op (flag off or weights missing). */ readonly embeddingsModelId: string; @@ -580,8 +580,9 @@ async function runEmbeddings(ctx: PipelineContext): Promise const priorHashes: Map = forceFlag || hashCache === undefined ? new Map() : await hashCache.list(); - // Max tokens includes [CLS]/[SEP]; the embedder caps input at 510 user - // tokens by default. Keep the chunker slightly conservative. + // Per-chunk token budget. F2LLM accepts up to 8192 tokens, but smaller + // chunks keep last-token-pooled vectors focused on a single unit of + // meaning; 500 mirrors the long-standing chunking granularity. const maxUserTokens = 500; // Lookup summaries by nodeId (the newest `createdAt` wins when multiple diff --git a/packages/mcp/CHANGELOG.md b/packages/mcp/CHANGELOG.md index 1825af10..d287b30f 100644 --- a/packages/mcp/CHANGELOG.md +++ b/packages/mcp/CHANGELOG.md @@ -1,5 +1,12 @@ # Changelog +## [0.6.0](https://github.com/theagenticguy/opencodehub/compare/mcp-v0.5.0...mcp-v0.6.0) (2026-06-26) + + +### ⚠ BREAKING CHANGES + +* **embedder:** local embedding model swapped to `codefuse-ai/F2LLM-v2-80M` (320-dim, was gte-modernbert-base 768-dim). Existing stores must be rebuilt with `codehub analyze --embeddings`; queries against a stale-dim store are refused by the fingerprint guard. + ## [0.5.0](https://github.com/theagenticguy/opencodehub/compare/mcp-v0.4.5...mcp-v0.5.0) (2026-06-01) diff --git a/packages/mcp/src/server.ts b/packages/mcp/src/server.ts index 09f59baa..450f79ab 100644 --- a/packages/mcp/src/server.ts +++ b/packages/mcp/src/server.ts @@ -84,7 +84,7 @@ export interface StartServerOptions { } /** - * Probe for gte-modernbert-base weights on disk. Runs once at server startup + * Probe for F2LLM-v2-80M weights on disk. Runs once at server startup * and logs a single structured warning when the weights are absent so * agents see the BM25-only fallback reason. Never throws: a missing or * unreadable model directory is a supported deployment mode. @@ -105,7 +105,7 @@ async function probeEmbedderWeights(silent: boolean): Promise { } const root = getDefaultModelRoot(); console.warn( - `[mcp] hybrid: embeddings weights not found at ${root}/models/gte-modernbert-base/; run \`codehub setup --embeddings\`. Falling back to BM25-only.`, + `[mcp] hybrid: embeddings weights not found at ${root}/models/f2llm-v2-80m/; run \`codehub setup --embeddings\`. Falling back to BM25-only.`, ); } catch (err) { // Probe failure is non-fatal; surface the reason but keep going. diff --git a/packages/mcp/src/tools/query.test.ts b/packages/mcp/src/tools/query.test.ts index 139cf6f8..ca8de4e1 100644 --- a/packages/mcp/src/tools/query.test.ts +++ b/packages/mcp/src/tools/query.test.ts @@ -387,6 +387,11 @@ class FakeEmbedder implements Embedder { async embed(_text: string): Promise { return new Float32Array([0.1, 0.2, 0.3, 0.4]); } + // F2LLM gained a query-only `embedQuery` path; the fake aliases it to + // `embed` since the query tool only needs a stable Float32Array back. + async embedQuery(text: string): Promise { + return this.embed(text); + } async embedBatch(texts: readonly string[]): Promise { return texts.map(() => new Float32Array([0.1, 0.2, 0.3, 0.4])); } @@ -576,9 +581,7 @@ test("query: populated embeddings + EMBEDDER_NOT_SETUP → warn + BM25 fallback" }; try { const opener: EmbedderFactory = async () => { - const err = new Error( - "gte-modernbert-base weights not found. Run `codehub setup --embeddings`.", - ); + const err = new Error("F2LLM-v2-80M weights not found. Run `codehub setup --embeddings`."); // Shape matches EmbedderNotSetupError.code. (err as unknown as { code: string }).code = "EMBEDDER_NOT_SETUP"; throw err; diff --git a/packages/mcp/src/tools/query.ts b/packages/mcp/src/tools/query.ts index 86b5ee7c..4d9f6cda 100644 --- a/packages/mcp/src/tools/query.ts +++ b/packages/mcp/src/tools/query.ts @@ -7,8 +7,9 @@ * corpus extends transparently (see {@link bm25CorpusHasSummaries}) so * summarized prose participates as soon as the ingestion phase lands. * 2. HNSW vector search over the `embeddings` table. The query text is - * embedded with the same gte-modernbert-base ONNX model the ingestion - * pipeline uses, so the vectors live in the same space. + * embedded with the same F2LLM-v2-80M ONNX model the ingestion + * pipeline uses, so the vectors live in the same space. (Queries get + * the F2LLM `Instruct:` prefix; documents are embedded raw.) * * Graceful fallback: * - If the `embeddings` table is empty, skip the vector leg entirely. @@ -814,7 +815,7 @@ export function registerQueryTool(server: McpServer, ctx: ToolContext): void { description: [ "True hybrid retrieval over the indexed code graph: BM25 keyword search", "(over symbol name + signature + description) fused with HNSW vector", - "search (gte-modernbert-base, 768-dim) via Reciprocal Rank Fusion (k=60).", + "search (F2LLM-v2-80M, 320-dim) via Reciprocal Rank Fusion (k=60).", "Each result carries `rank`, `nodeId`, `name`, `kind`, `filePath`,", "`startLine`/`endLine`, a capped `snippet` (~200 chars), the fused", "`score`, and `sources` indicating which ranker(s) contributed (`bm25`", diff --git a/packages/mcp/src/tools/shared.ts b/packages/mcp/src/tools/shared.ts index d061b603..5554a895 100644 --- a/packages/mcp/src/tools/shared.ts +++ b/packages/mcp/src/tools/shared.ts @@ -21,7 +21,7 @@ import { RepoResolveError, type ResolvedRepo, resolveRepo } from "../repo-resolv /** * Factory for opening an embedder on demand. The default factory imports * `@opencodehub/embedder` and calls `openOnnxEmbedder()`; tests inject a - * fake so they don't need gte-modernbert-base weights on disk. The factory + * fake so they don't need F2LLM-v2-80M weights on disk. The factory * must throw on failure — the `query` tool treats any throw as * "embedder unavailable, warn + fall back to BM25". */ diff --git a/packages/search/CHANGELOG.md b/packages/search/CHANGELOG.md index a77407a5..9b850475 100644 --- a/packages/search/CHANGELOG.md +++ b/packages/search/CHANGELOG.md @@ -1,5 +1,12 @@ # Changelog +## [0.4.0](https://github.com/theagenticguy/opencodehub/compare/search-v0.3.0...search-v0.4.0) (2026-06-26) + + +### ⚠ BREAKING CHANGES + +* **embedder:** local embedding model swapped to `codefuse-ai/F2LLM-v2-80M` (320-dim, was gte-modernbert-base 768-dim). Hybrid search now embeds queries through `embedQuery()` (Instruct/Query prefix) while documents stay raw; existing stores must be rebuilt with `codehub analyze --embeddings`. + ## [0.3.0](https://github.com/theagenticguy/opencodehub/compare/search-v0.2.3...search-v0.3.0) (2026-06-01) diff --git a/packages/search/src/embedder.ts b/packages/search/src/embedder.ts index 5c77c3f3..e777e82d 100644 --- a/packages/search/src/embedder.ts +++ b/packages/search/src/embedder.ts @@ -13,7 +13,7 @@ import type { Embedder } from "./types.js"; -export const DEFAULT_EMBEDDER_DIM = 768; +export const DEFAULT_EMBEDDER_DIM = 320; /** Whether the deprecation warning has already fired in this process. */ let warnedOnce = false; @@ -44,4 +44,9 @@ export class NullEmbedder implements Embedder { } return new Float32Array(this.dim); } + + /** Query path mirrors {@link embed} — the stand-in has no model prefix. */ + async embedQuery(text: string): Promise { + return this.embed(text); + } } diff --git a/packages/search/src/hybrid.test.ts b/packages/search/src/hybrid.test.ts index 409e621a..38dc7377 100644 --- a/packages/search/src/hybrid.test.ts +++ b/packages/search/src/hybrid.test.ts @@ -149,6 +149,11 @@ class FakeEmbedder implements Embedder { async embed(): Promise { return new Float32Array([0.1, 0.2, 0.3, 0.4]); } + // Symmetric stand-in: the query path mirrors the document path (no prefix). + // The fake ignores its input, so delegate to the no-arg `embed`. + async embedQuery(_text: string): Promise { + return this.embed(); + } } describe("hybridSearch", () => { diff --git a/packages/search/src/hybrid.ts b/packages/search/src/hybrid.ts index 26911d6c..18c6a5a0 100644 --- a/packages/search/src/hybrid.ts +++ b/packages/search/src/hybrid.ts @@ -79,7 +79,9 @@ export async function hybridSearch( })); } - const vector = await embedder.embed(q.text); + // Query text — embed via embedQuery so asymmetric models (F2LLM) apply + // their `Instruct:` query prefix. Documents were indexed raw via embed(). + const vector = await embedder.embedQuery(q.text); let annHits: readonly { readonly nodeId: string; readonly distance: number }[]; if (q.mode === "zoom") { diff --git a/packages/search/src/types.ts b/packages/search/src/types.ts index 091d78fd..54980ec3 100644 --- a/packages/search/src/types.ts +++ b/packages/search/src/types.ts @@ -66,6 +66,12 @@ export interface FusedHit { * zero-vector in production and throws in tests. */ export interface Embedder { + /** Embed a DOCUMENT (no query prefix). */ embed(text: string): Promise; + /** + * Embed a QUERY. For asymmetric models (F2LLM) this applies the model's + * query instruction prefix; documents are embedded raw via {@link embed}. + */ + embedQuery(text: string): Promise; readonly dim: number; } diff --git a/packages/storage/CHANGELOG.md b/packages/storage/CHANGELOG.md index 451675ce..33be7fd1 100644 --- a/packages/storage/CHANGELOG.md +++ b/packages/storage/CHANGELOG.md @@ -1,5 +1,12 @@ # Changelog +## [0.4.0](https://github.com/theagenticguy/opencodehub/compare/storage-v0.3.0...storage-v0.4.0) (2026-06-26) + + +### ⚠ BREAKING CHANGES + +* **embedder:** embedding dimension changed to 320 (`codefuse-ai/F2LLM-v2-80M`, was gte-modernbert-base 768-dim). The `embeddingDim` store option defaults to 320; existing stores must be rebuilt with `codehub analyze --embeddings`. + ## [0.3.0](https://github.com/theagenticguy/opencodehub/compare/storage-v0.2.3...storage-v0.3.0) (2026-06-01) diff --git a/packages/storage/src/sqlite-adapter.test.ts b/packages/storage/src/sqlite-adapter.test.ts index 795d9345..9d9d6e0c 100644 --- a/packages/storage/src/sqlite-adapter.test.ts +++ b/packages/storage/src/sqlite-adapter.test.ts @@ -74,7 +74,7 @@ test("SqliteStore: graph + embeddings round-trip from ONE file across reopen", a try { const { graph, ids } = fixtureGraph(); - // ── Write phase ── (8-dim embeddings for a readable test; real default 768) + // ── Write phase ── (8-dim embeddings for a readable test; real default 320) const w = new SqliteStore(dbPath, { embeddingDim: 8 }); await w.open(); await w.createSchema(); @@ -82,7 +82,7 @@ test("SqliteStore: graph + embeddings round-trip from ONE file across reopen", a assert.equal(stats.nodeCount, 5, "5 nodes loaded"); assert.equal(stats.edgeCount, 3, "3 edges loaded"); - // 8-dim embeddings so the test is readable; real default is 768. + // 8-dim embeddings so the test is readable; real default is 320. const vec = (seed: number): Float32Array => Float32Array.from({ length: 8 }, (_, i) => Math.sin(seed + i)); await w.upsertEmbeddings([ diff --git a/packages/storage/src/sqlite-adapter.ts b/packages/storage/src/sqlite-adapter.ts index c144a814..17f7731b 100644 --- a/packages/storage/src/sqlite-adapter.ts +++ b/packages/storage/src/sqlite-adapter.ts @@ -89,7 +89,7 @@ import { assertReadOnlySql } from "./sql-guard.js"; export interface SqliteStoreOptions { /** Open the file read-only. Query commands pass true; ingestion false. */ readonly readOnly?: boolean; - /** Embedding dimensionality. Defaults to 768 (Bedrock Titan / Cohere tier). */ + /** Embedding dimensionality. Defaults to 320 (F2LLM-v2-80M, the local ONNX tier). */ readonly embeddingDim?: number; /** * Journal mode. Defaults to WAL — the whole point of the spike. Overridable @@ -100,7 +100,7 @@ export interface SqliteStoreOptions { readonly timeoutMs?: number; } -const DEFAULT_DIM = 768; +const DEFAULT_DIM = 320; const SCHEMA_VERSION = "spike-sqlite-1"; const DEFAULT_TIMEOUT_MS = 5_000; const DEFAULT_COCHANGE_LOOKUP_LIMIT = 10; diff --git a/plugins/opencodehub/skills/codehub-guide/SKILL.md b/plugins/opencodehub/skills/codehub-guide/SKILL.md index e29f4cef..e35df607 100644 --- a/plugins/opencodehub/skills/codehub-guide/SKILL.md +++ b/plugins/opencodehub/skills/codehub-guide/SKILL.md @@ -15,7 +15,7 @@ For any task that touches code understanding, debugging, impact analysis, refact 2. Read `codehub://repo/{name}/context` — codebase stats and a staleness envelope. 3. Match the task to a skill below and follow that skill's checklist. -> If the context envelope reports the index is stale, run `codehub analyze` in the terminal first. If it says weights are missing, run `codehub setup --embeddings` to fetch the 768d gte-modernbert-base ONNX weights. +> If the context envelope reports the index is stale, run `codehub analyze` in the terminal first. If it says weights are missing, run `codehub setup --embeddings` to fetch the 320d F2LLM-v2-80M ONNX weights. ## Skills · analysis From cacf1bdd4e84702a2884591007db095f7f64fe0e Mon Sep 17 00:00:00 2001 From: Laith Al-Saadoon Date: Fri, 26 Jun 2026 01:48:26 +0000 Subject: [PATCH 2/2] ci(release): gate release pipeline to version-shaped tags only MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The `release: published` event fires for ANY GitHub release, including non-package releases that merely host assets — e.g. the `embed-v1` embedder-weights release this PR introduces (model-pins.ts pins its URLs). Creating that release triggered the full build → sign → npm-publish pipeline, which would have published every package to npm (OCH_NPM_PUBLISH_ENABLED is true). It was cancelled in time and built from main (not this branch), so nothing leaked, but the trigger must be filtered. Gate the `resolve` job (root of the chain; everything else `needs` it under a success() gate) to version-shaped tags only: `root-v*`, `cli-v*`, or bare `v*` (the release-please conventions). `workflow_call` / `workflow_dispatch` pass an explicit tag input and remain unaffected. A weights tag like `embed-v1` now skips the pipeline entirely. --- .github/workflows/release.yml | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 59e2d132..a681da3b 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -67,6 +67,18 @@ jobs: resolve: name: Resolve release tag + SHA runs-on: ubuntu-latest + # Only run the package-release pipeline for version-shaped tags. The + # `release: published` event also fires for non-package releases that + # merely HOST assets — e.g. the `embed-v1` embedder-weights release + # (see packages/embedder/src/model-pins.ts) — which must NOT build, + # sign, or npm-publish anything. release-please tags are `root-v*`, + # `cli-v*`, or a bare `v*`; `workflow_call`/`workflow_dispatch` pass an + # explicit tag input and are always allowed. + if: >- + github.event_name != 'release' + || startsWith(github.event.release.tag_name, 'root-v') + || startsWith(github.event.release.tag_name, 'cli-v') + || startsWith(github.event.release.tag_name, 'v') outputs: tag: ${{ steps.t.outputs.tag }} sha: ${{ steps.t.outputs.sha }}