feat(embedder)!: swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim) by theagenticguy · Pull Request #252 · theagenticguy/opencodehub

theagenticguy · 2026-06-26T01:43:51Z

Summary

Swaps the local ONNX embedder from gte-modernbert-base (768-dim, CLS pooling, JS-side L2 norm) to codefuse-ai/F2LLM-v2-80M (320-dim, last-token pooling + L2 norm baked into the ONNX graph, Apache-2.0). Adds a query/document asymmetry: queries get F2LLM's Instruct: prefix; documents embed raw.

Warning

BREAKING — requires re-index. 320-dim query vectors can't be compared against stored 768-dim vectors. Existing indexes must be rebuilt with codehub analyze --embeddings. The embedder-fingerprint guard (ADR 0014) already refuses queries against a stale store, and the analyze path now suppresses the content-hash cache on a model-id change so a swap forces a full re-embed (never a mixed-dimension store).

What changed

Embedder core (packages/embedder)

onnx-embedder.ts requests the in-graph embedding output (shape [B,320]) instead of last_hidden_state; deleted clsPool + l2NormalizeInPlace (the graph pools + normalizes). Qwen2 pad id (<|endoftext|>=151643).
New Embedder.embedQuery() + query-prefix.ts apply Instruct: {instruction}\nQuery: {q} to queries only. Wired at the search/hybrid.ts query seam; documents stay raw.
model-pins.ts: GTE_MODERNBERT_BASE_PINS → F2LLM_V2_80M_PINS. Weights are a custom ONNX export (not on HF) hosted as GitHub release embed-v1, SHA256-pinned. 3-file manifest (model + tokenizer.json + tokenizer_config.json); config.json/special_tokens_map.json dropped.
Package bumped 0.1.3 → 0.2.0.

Dimension 768 → 320 across embedder, search (NullEmbedder/DEFAULT_EMBEDDER_DIM), storage (DEFAULT_DIM), ingestion (embedder-pool), and HTTP/SageMaker defaults.

Migration safety — cli/commands/analyze.ts openEmbeddingHashCacheAdapter(repoPath, activeModelId) suppresses the content-hash cache when the prior store's embedderModelId differs.

Docs / housekeeping — swept README, SPECS, package docs, CHANGELOGs, skills to F2LLM/320-dim. Collapsed the stale storage ADR chain (0001/0011/0013-m7/0016 now all point forward to ADR 0019; 0016 no longer falsely reads "Accepted") and fixed the README's pre-ADR-0019 storage narrative (it still described native lbug+DuckDB bindings and a platform-prebuild matrix that no longer exist). Fixed the lefthook verdict guard to check store.sqlite (was the removed graph.lbug).

Verification

pnpm -r build + pnpm -r test green — ~2,400 tests, 0 failures; biome + banned-strings pass; pre-push hooks (typecheck + test + verdict) green.
Tokenizer parity: JS @huggingface/tokenizers produces byte-identical token IDs to the Python AutoTokenizer the POC used (the one risk the POC never tested — now closed).
Ranking parity: the production OnnxEmbedder reproduces the POC's 4/4 top-1 ranking and byte-deterministic output (satisfies the graphHash gate).
End-to-end: analyze --embeddings writes 320-dim rows + f2llm-v2-80m/fp32 stamp; hybrid query ranks correctly; migration guard confirmed firing on a stale gte-modernbert stamp.
Release verified: embed-v1 assets uploaded; pinned URLs resolve and checksums match; a real codehub setup --embeddings install from the live release passes streaming SHA256 verification.

Attribution

F2LLM — Zhang, Liao, Yu, Di, Wang (arXiv:2603.19223), Apache-2.0.

…80M (320-dim) BREAKING CHANGE: the local ONNX embedder is now codefuse-ai/F2LLM-v2-80M (320-dim, was gte-modernbert-base 768-dim). Existing indexes MUST be rebuilt with `codehub analyze --embeddings` — 320-dim query vectors cannot be compared against stored 768-dim vectors. The embedder-fingerprint guard (ADR 0014) refuses queries against a stale store until re-analyze. What changed: - onnx-embedder.ts requests the in-graph `embedding` output (shape [B,320], last-token pooling + L2 norm baked into the graph) instead of pooling / normalizing in JS — clsPool + l2NormalizeInPlace are removed. Qwen2 pad id. - New Embedder.embedQuery() applies F2LLM's query-only `Instruct:` prefix; documents embed raw. Wired at the search/hybrid.ts query seam. New query-prefix.ts holds the instruction string. - Dimension parameterized 768→320 across embedder, search (NullEmbedder), storage (DEFAULT_DIM), ingestion pool, and HTTP/SageMaker defaults. - model-pins.ts: GTE_MODERNBERT_BASE_PINS → F2LLM_V2_80M_PINS; weights are a custom ONNX export hosted as the GitHub release asset `embed-v1`, SHA256-pinned. 3-file manifest (model + tokenizer.json + tokenizer_config.json). - Migration guard: analyze suppresses the content-hash cache on a model-id change so a swap forces a full re-embed (no mixed-dimension store). - Docs, CHANGELOGs, skills swept to F2LLM/320-dim. Collapsed the stale storage ADR chain (0001/0011/0013/0016 → 0019) and fixed the README's pre-ADR-0019 storage narrative. Fixed lefthook verdict guard to check store.sqlite (was the removed graph.lbug). Verified: pnpm -r build + test green (~2400 tests); tokenizer parity with Python AutoTokenizer (byte-identical IDs); production embedder reproduces the POC 4/4 top-1 ranking + byte-deterministic output; end-to-end analyze writes 320-dim rows; migration guard confirmed firing on a stale stamp.

The `release: published` event fires for ANY GitHub release, including non-package releases that merely host assets — e.g. the `embed-v1` embedder-weights release this PR introduces (model-pins.ts pins its URLs). Creating that release triggered the full build → sign → npm-publish pipeline, which would have published every package to npm (OCH_NPM_PUBLISH_ENABLED is true). It was cancelled in time and built from main (not this branch), so nothing leaked, but the trigger must be filtered. Gate the `resolve` job (root of the chain; everything else `needs` it under a success() gate) to version-shaped tags only: `root-v*`, `cli-v*`, or bare `v*` (the release-please conventions). `workflow_call` / `workflow_dispatch` pass an explicit tag input and remain unaffected. A weights tag like `embed-v1` now skips the pipeline entirely.

🤖 Automated release via release-please --- <details><summary>root: 0.10.0</summary> ## [0.10.0](root-v0.9.2...root-v0.10.0) (2026-06-26) ### ⚠ BREAKING CHANGES * **embedder:** existing indexes must be rebuilt with `codehub analyze --embeddings`; the embedder-fingerprint guard refuses queries against a stale-dim store, and the analyze path suppresses the content-hash cache on a model-id change to prevent a mixed-dimension store. ### Features * **embedder:** swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim) ([#252](#252)) ([789d0da](789d0da)) </details> <details><summary>cli: 0.10.0</summary> ## [0.10.0](cli-v0.9.2...cli-v0.10.0) (2026-06-26) ### ⚠ BREAKING CHANGES * **embedder:** existing indexes must be rebuilt with `codehub analyze --embeddings`; the embedder-fingerprint guard refuses queries against a stale-dim store, and the analyze path suppresses the content-hash cache on a model-id change to prevent a mixed-dimension store. ### Features * **embedder:** swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim) ([#252](#252)) ([789d0da](789d0da)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

theagenticguy added 2 commits June 26, 2026 01:53

theagenticguy force-pushed the feat/f2llm-embedder-swap branch from e520fbb to cacf1bd Compare June 26, 2026 01:54

theagenticguy merged commit 789d0da into main Jun 26, 2026
38 checks passed

theagenticguy deleted the feat/f2llm-embedder-swap branch June 26, 2026 02:03

github-actions Bot mentioned this pull request Jun 26, 2026

chore: release main #253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embedder)!: swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim)#252

feat(embedder)!: swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim)#252
theagenticguy merged 2 commits into
mainfrom
feat/f2llm-embedder-swap

theagenticguy commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theagenticguy commented Jun 26, 2026

Summary

What changed

Verification

Attribution

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant