Skip to content

feat(embedder)!: swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim)#252

Merged
theagenticguy merged 2 commits into
mainfrom
feat/f2llm-embedder-swap
Jun 26, 2026
Merged

feat(embedder)!: swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim)#252
theagenticguy merged 2 commits into
mainfrom
feat/f2llm-embedder-swap

Conversation

@theagenticguy

Copy link
Copy Markdown
Owner

Summary

Swaps the local ONNX embedder from gte-modernbert-base (768-dim, CLS pooling, JS-side L2 norm) to codefuse-ai/F2LLM-v2-80M (320-dim, last-token pooling + L2 norm baked into the ONNX graph, Apache-2.0). Adds a query/document asymmetry: queries get F2LLM's Instruct: prefix; documents embed raw.

Warning

BREAKING — requires re-index. 320-dim query vectors can't be compared against stored 768-dim vectors. Existing indexes must be rebuilt with codehub analyze --embeddings. The embedder-fingerprint guard (ADR 0014) already refuses queries against a stale store, and the analyze path now suppresses the content-hash cache on a model-id change so a swap forces a full re-embed (never a mixed-dimension store).

What changed

Embedder core (packages/embedder)

  • onnx-embedder.ts requests the in-graph embedding output (shape [B,320]) instead of last_hidden_state; deleted clsPool + l2NormalizeInPlace (the graph pools + normalizes). Qwen2 pad id (<|endoftext|>=151643).
  • New Embedder.embedQuery() + query-prefix.ts apply Instruct: {instruction}\nQuery: {q} to queries only. Wired at the search/hybrid.ts query seam; documents stay raw.
  • model-pins.ts: GTE_MODERNBERT_BASE_PINSF2LLM_V2_80M_PINS. Weights are a custom ONNX export (not on HF) hosted as GitHub release embed-v1, SHA256-pinned. 3-file manifest (model + tokenizer.json + tokenizer_config.json); config.json/special_tokens_map.json dropped.
  • Package bumped 0.1.30.2.0.

Dimension 768 → 320 across embedder, search (NullEmbedder/DEFAULT_EMBEDDER_DIM), storage (DEFAULT_DIM), ingestion (embedder-pool), and HTTP/SageMaker defaults.

Migration safetycli/commands/analyze.ts openEmbeddingHashCacheAdapter(repoPath, activeModelId) suppresses the content-hash cache when the prior store's embedderModelId differs.

Docs / housekeeping — swept README, SPECS, package docs, CHANGELOGs, skills to F2LLM/320-dim. Collapsed the stale storage ADR chain (0001/0011/0013-m7/0016 now all point forward to ADR 0019; 0016 no longer falsely reads "Accepted") and fixed the README's pre-ADR-0019 storage narrative (it still described native lbug+DuckDB bindings and a platform-prebuild matrix that no longer exist). Fixed the lefthook verdict guard to check store.sqlite (was the removed graph.lbug).

Verification

  • pnpm -r build + pnpm -r test green — ~2,400 tests, 0 failures; biome + banned-strings pass; pre-push hooks (typecheck + test + verdict) green.
  • Tokenizer parity: JS @huggingface/tokenizers produces byte-identical token IDs to the Python AutoTokenizer the POC used (the one risk the POC never tested — now closed).
  • Ranking parity: the production OnnxEmbedder reproduces the POC's 4/4 top-1 ranking and byte-deterministic output (satisfies the graphHash gate).
  • End-to-end: analyze --embeddings writes 320-dim rows + f2llm-v2-80m/fp32 stamp; hybrid query ranks correctly; migration guard confirmed firing on a stale gte-modernbert stamp.
  • Release verified: embed-v1 assets uploaded; pinned URLs resolve and checksums match; a real codehub setup --embeddings install from the live release passes streaming SHA256 verification.

Attribution

F2LLM — Zhang, Liao, Yu, Di, Wang (arXiv:2603.19223), Apache-2.0.

…80M (320-dim)

BREAKING CHANGE: the local ONNX embedder is now codefuse-ai/F2LLM-v2-80M
(320-dim, was gte-modernbert-base 768-dim). Existing indexes MUST be
rebuilt with `codehub analyze --embeddings` — 320-dim query vectors cannot
be compared against stored 768-dim vectors. The embedder-fingerprint guard
(ADR 0014) refuses queries against a stale store until re-analyze.

What changed:
- onnx-embedder.ts requests the in-graph `embedding` output (shape [B,320],
  last-token pooling + L2 norm baked into the graph) instead of pooling /
  normalizing in JS — clsPool + l2NormalizeInPlace are removed. Qwen2 pad id.
- New Embedder.embedQuery() applies F2LLM's query-only `Instruct:` prefix;
  documents embed raw. Wired at the search/hybrid.ts query seam. New
  query-prefix.ts holds the instruction string.
- Dimension parameterized 768→320 across embedder, search (NullEmbedder),
  storage (DEFAULT_DIM), ingestion pool, and HTTP/SageMaker defaults.
- model-pins.ts: GTE_MODERNBERT_BASE_PINS → F2LLM_V2_80M_PINS; weights are a
  custom ONNX export hosted as the GitHub release asset `embed-v1`,
  SHA256-pinned. 3-file manifest (model + tokenizer.json + tokenizer_config.json).
- Migration guard: analyze suppresses the content-hash cache on a model-id
  change so a swap forces a full re-embed (no mixed-dimension store).
- Docs, CHANGELOGs, skills swept to F2LLM/320-dim. Collapsed the stale
  storage ADR chain (0001/0011/0013/0016 → 0019) and fixed the README's
  pre-ADR-0019 storage narrative. Fixed lefthook verdict guard to check
  store.sqlite (was the removed graph.lbug).

Verified: pnpm -r build + test green (~2400 tests); tokenizer parity with
Python AutoTokenizer (byte-identical IDs); production embedder reproduces the
POC 4/4 top-1 ranking + byte-deterministic output; end-to-end analyze writes
320-dim rows; migration guard confirmed firing on a stale stamp.
The `release: published` event fires for ANY GitHub release, including
non-package releases that merely host assets — e.g. the `embed-v1`
embedder-weights release this PR introduces (model-pins.ts pins its
URLs). Creating that release triggered the full build → sign →
npm-publish pipeline, which would have published every package to npm
(OCH_NPM_PUBLISH_ENABLED is true). It was cancelled in time and built
from main (not this branch), so nothing leaked, but the trigger must be
filtered.

Gate the `resolve` job (root of the chain; everything else `needs` it
under a success() gate) to version-shaped tags only: `root-v*`, `cli-v*`,
or bare `v*` (the release-please conventions). `workflow_call` /
`workflow_dispatch` pass an explicit tag input and remain unaffected. A
weights tag like `embed-v1` now skips the pipeline entirely.
@theagenticguy theagenticguy force-pushed the feat/f2llm-embedder-swap branch from e520fbb to cacf1bd Compare June 26, 2026 01:54
@theagenticguy theagenticguy merged commit 789d0da into main Jun 26, 2026
38 checks passed
@theagenticguy theagenticguy deleted the feat/f2llm-embedder-swap branch June 26, 2026 02:03
@github-actions github-actions Bot mentioned this pull request Jun 26, 2026
theagenticguy pushed a commit that referenced this pull request Jun 26, 2026
🤖 Automated release via release-please
---


<details><summary>root: 0.10.0</summary>

##
[0.10.0](root-v0.9.2...root-v0.10.0)
(2026-06-26)


### ⚠ BREAKING CHANGES

* **embedder:** existing indexes must be rebuilt with `codehub analyze
--embeddings`; the embedder-fingerprint guard refuses queries against a
stale-dim store, and the analyze path suppresses the content-hash cache
on a model-id change to prevent a mixed-dimension store.

### Features

* **embedder:** swap embedding model gte-modernbert-base → F2LLM-v2-80M
(320-dim)
([#252](#252))
([789d0da](789d0da))
</details>

<details><summary>cli: 0.10.0</summary>

##
[0.10.0](cli-v0.9.2...cli-v0.10.0)
(2026-06-26)


### ⚠ BREAKING CHANGES

* **embedder:** existing indexes must be rebuilt with `codehub analyze
--embeddings`; the embedder-fingerprint guard refuses queries against a
stale-dim store, and the analyze path suppresses the content-hash cache
on a model-id change to prevent a mixed-dimension store.

### Features

* **embedder:** swap embedding model gte-modernbert-base → F2LLM-v2-80M
(320-dim)
([#252](#252))
([789d0da](789d0da))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant