feat(embedder)!: swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim)#252
Merged
Conversation
…80M (320-dim) BREAKING CHANGE: the local ONNX embedder is now codefuse-ai/F2LLM-v2-80M (320-dim, was gte-modernbert-base 768-dim). Existing indexes MUST be rebuilt with `codehub analyze --embeddings` — 320-dim query vectors cannot be compared against stored 768-dim vectors. The embedder-fingerprint guard (ADR 0014) refuses queries against a stale store until re-analyze. What changed: - onnx-embedder.ts requests the in-graph `embedding` output (shape [B,320], last-token pooling + L2 norm baked into the graph) instead of pooling / normalizing in JS — clsPool + l2NormalizeInPlace are removed. Qwen2 pad id. - New Embedder.embedQuery() applies F2LLM's query-only `Instruct:` prefix; documents embed raw. Wired at the search/hybrid.ts query seam. New query-prefix.ts holds the instruction string. - Dimension parameterized 768→320 across embedder, search (NullEmbedder), storage (DEFAULT_DIM), ingestion pool, and HTTP/SageMaker defaults. - model-pins.ts: GTE_MODERNBERT_BASE_PINS → F2LLM_V2_80M_PINS; weights are a custom ONNX export hosted as the GitHub release asset `embed-v1`, SHA256-pinned. 3-file manifest (model + tokenizer.json + tokenizer_config.json). - Migration guard: analyze suppresses the content-hash cache on a model-id change so a swap forces a full re-embed (no mixed-dimension store). - Docs, CHANGELOGs, skills swept to F2LLM/320-dim. Collapsed the stale storage ADR chain (0001/0011/0013/0016 → 0019) and fixed the README's pre-ADR-0019 storage narrative. Fixed lefthook verdict guard to check store.sqlite (was the removed graph.lbug). Verified: pnpm -r build + test green (~2400 tests); tokenizer parity with Python AutoTokenizer (byte-identical IDs); production embedder reproduces the POC 4/4 top-1 ranking + byte-deterministic output; end-to-end analyze writes 320-dim rows; migration guard confirmed firing on a stale stamp.
The `release: published` event fires for ANY GitHub release, including non-package releases that merely host assets — e.g. the `embed-v1` embedder-weights release this PR introduces (model-pins.ts pins its URLs). Creating that release triggered the full build → sign → npm-publish pipeline, which would have published every package to npm (OCH_NPM_PUBLISH_ENABLED is true). It was cancelled in time and built from main (not this branch), so nothing leaked, but the trigger must be filtered. Gate the `resolve` job (root of the chain; everything else `needs` it under a success() gate) to version-shaped tags only: `root-v*`, `cli-v*`, or bare `v*` (the release-please conventions). `workflow_call` / `workflow_dispatch` pass an explicit tag input and remain unaffected. A weights tag like `embed-v1` now skips the pipeline entirely.
e520fbb to
cacf1bd
Compare
Merged
theagenticguy
pushed a commit
that referenced
this pull request
Jun 26, 2026
🤖 Automated release via release-please --- <details><summary>root: 0.10.0</summary> ## [0.10.0](root-v0.9.2...root-v0.10.0) (2026-06-26) ### ⚠ BREAKING CHANGES * **embedder:** existing indexes must be rebuilt with `codehub analyze --embeddings`; the embedder-fingerprint guard refuses queries against a stale-dim store, and the analyze path suppresses the content-hash cache on a model-id change to prevent a mixed-dimension store. ### Features * **embedder:** swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim) ([#252](#252)) ([789d0da](789d0da)) </details> <details><summary>cli: 0.10.0</summary> ## [0.10.0](cli-v0.9.2...cli-v0.10.0) (2026-06-26) ### ⚠ BREAKING CHANGES * **embedder:** existing indexes must be rebuilt with `codehub analyze --embeddings`; the embedder-fingerprint guard refuses queries against a stale-dim store, and the analyze path suppresses the content-hash cache on a model-id change to prevent a mixed-dimension store. ### Features * **embedder:** swap embedding model gte-modernbert-base → F2LLM-v2-80M (320-dim) ([#252](#252)) ([789d0da](789d0da)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Swaps the local ONNX embedder from
gte-modernbert-base(768-dim, CLS pooling, JS-side L2 norm) tocodefuse-ai/F2LLM-v2-80M(320-dim, last-token pooling + L2 norm baked into the ONNX graph, Apache-2.0). Adds a query/document asymmetry: queries get F2LLM'sInstruct:prefix; documents embed raw.Warning
BREAKING — requires re-index. 320-dim query vectors can't be compared against stored 768-dim vectors. Existing indexes must be rebuilt with
codehub analyze --embeddings. The embedder-fingerprint guard (ADR 0014) already refuses queries against a stale store, and the analyze path now suppresses the content-hash cache on a model-id change so a swap forces a full re-embed (never a mixed-dimension store).What changed
Embedder core (
packages/embedder)onnx-embedder.tsrequests the in-graphembeddingoutput (shape[B,320]) instead oflast_hidden_state; deletedclsPool+l2NormalizeInPlace(the graph pools + normalizes). Qwen2 pad id (<|endoftext|>=151643).Embedder.embedQuery()+query-prefix.tsapplyInstruct: {instruction}\nQuery: {q}to queries only. Wired at thesearch/hybrid.tsquery seam; documents stay raw.model-pins.ts:GTE_MODERNBERT_BASE_PINS→F2LLM_V2_80M_PINS. Weights are a custom ONNX export (not on HF) hosted as GitHub releaseembed-v1, SHA256-pinned. 3-file manifest (model + tokenizer.json + tokenizer_config.json);config.json/special_tokens_map.jsondropped.0.1.3→0.2.0.Dimension 768 → 320 across
embedder,search(NullEmbedder/DEFAULT_EMBEDDER_DIM),storage(DEFAULT_DIM),ingestion(embedder-pool), and HTTP/SageMaker defaults.Migration safety —
cli/commands/analyze.tsopenEmbeddingHashCacheAdapter(repoPath, activeModelId)suppresses the content-hash cache when the prior store'sembedderModelIddiffers.Docs / housekeeping — swept README, SPECS, package docs, CHANGELOGs, skills to F2LLM/320-dim. Collapsed the stale storage ADR chain (0001/0011/0013-m7/0016 now all point forward to ADR 0019; 0016 no longer falsely reads "Accepted") and fixed the README's pre-ADR-0019 storage narrative (it still described native lbug+DuckDB bindings and a platform-prebuild matrix that no longer exist). Fixed the lefthook verdict guard to check
store.sqlite(was the removedgraph.lbug).Verification
pnpm -r build+pnpm -r testgreen — ~2,400 tests, 0 failures; biome + banned-strings pass; pre-push hooks (typecheck + test + verdict) green.@huggingface/tokenizersproduces byte-identical token IDs to the PythonAutoTokenizerthe POC used (the one risk the POC never tested — now closed).OnnxEmbedderreproduces the POC's 4/4 top-1 ranking and byte-deterministic output (satisfies the graphHash gate).analyze --embeddingswrites 320-dim rows +f2llm-v2-80m/fp32stamp; hybrid query ranks correctly; migration guard confirmed firing on a stalegte-modernbertstamp.embed-v1assets uploaded; pinned URLs resolve and checksums match; a realcodehub setup --embeddingsinstall from the live release passes streaming SHA256 verification.Attribution
F2LLM — Zhang, Liao, Yu, Di, Wang (arXiv:2603.19223), Apache-2.0.