fix(linear-att): fix latent prefix-cache ref/buffer leaks#1348
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces auto-derivation for --linear_att_hash_page_size based on max_req_total_len to optimize prefix-cache hit rates for hybrid linear-attention models, along with extensive documentation and warnings against using excessively large page sizes in serving workloads. It also fixes critical bugs in the radix cache, including root reference counter drift during cache misses or trims, and memory leaks of big-page state buffers when requests are paused or aborted mid-prefill. Comprehensive unit and invariant fuzz tests are added to verify these fixes. A review comment suggests adding a defensive fallback check for args.max_req_total_len in api_start.py to prevent a potential TypeError if it is None.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| while derived * max_page_num < args.max_req_total_len: | ||
| derived *= 2 |
There was a problem hiding this comment.
To prevent potential TypeError if args.max_req_total_len is None (as it is typed as Optional[int]), it is safer to add a defensive fallback check before using it in the comparison loop.
| while derived * max_page_num < args.max_req_total_len: | |
| derived *= 2 | |
| max_len = args.max_req_total_len if args.max_req_total_len is not None else 16384 | |
| while derived * max_page_num < max_len: | |
| derived *= 2 |
Four latent defects in LinearAttPagedRadixCache / _linear_att_free_req, found via adversarial audit + property-based fuzzing: - Root in eviction set: _add_node added root to _evict_tree_set when the tree emptied (root is a leaf then), contradicting _evict's own 'assert node is not self.root_node'. Now excluded. - Root ref leak on miss / trim-to-empty: match_prefix takes a root ref on descent; the two 'no match' returns handed None to the caller without releasing it, so root.ref_counter drifted up on every miss. Both returns now release it. - Root ref leak in deref_to_first_big_page_node: the big-page downgrade path leaked the same root ref when it bottomed out at root. Fixed. - Big-page state-buffer leak / assert-crash: big-page state ids accumulated in req.linear_att_len_to_big_page_id during chunked prefill were neither inserted nor freed when a request was paused/aborted mid-prefill (fallback branch of _linear_att_free_req), tripping free_a_req_mem's assert (worker crash) or leaking slots with asserts off. Now released on the non-insert exit paths. New CPU-only tests (no GPU): property-based invariant fuzzers for the small-page and big-page regimes, plus regression tests for the pause/abort big-page release. The three root-ref issues are latent (root carries zero tokens and is never evictable). The big-page leak is a reachable worker crash for long-context serving with --linear_att_page_block_num set.
aa8f40e to
187cbda
Compare
Summary
Four latent correctness defects in the linear-attention (Qwen3.5 / Qwen3-Next) hybrid prefix cache, found via adversarial audit + property-based fuzzing.
Correctness fixes
_add_nodeadded the root node to_evict_tree_setwhen the tree emptied (root is a leaf then), contradicting_evict's ownassert node is not self.root_node. Now excluded.match_prefixtakes a root ref on descent; the two "no match" returns handedNoneto the caller without releasing it, soroot.ref_counterdrifted up on every miss. Both returns now release it.deref_to_first_big_page_node— the big-page downgrade path leaked the same root ref when it bottomed out at root. Fixed.req.linear_att_len_to_big_page_idduring chunked prefill were neither inserted nor freed when a request was paused/aborted mid-prefill (fallback branch of_linear_att_free_req), trippingfree_a_req_mem'sassert len(...) == 0(worker crash) or leaking slots with asserts off. Now released on the non-insert exit paths.The three root-ref issues are latent (root carries zero tokens and is never evictable), so they didn't affect results/memory/accuracy. The big-page leak is a reachable worker crash for long-context serving with
--linear_att_page_block_numset.Tests
New CPU-only tests (no GPU required):
test_linear_att_radix_cache_invariants.py— property-based fuzzer (small-page regime): random insert/match/dec/steal/evict, checking ref accounting, set membership, buffer-id conservation, value integrity, and root-ref balance after every op.test_linear_att_radix_cache_bigpage.py— same for the big-page regime.test_linear_att_free_req_big_page.py— regression tests for the pause/abort big-page release.pytest unit_tests/server/router/dynamic_prompt/ unit_tests/server/router/model_infer/test_linear_att_free_req_big_page.py→ passed; black + flake8 clean.Risk / compatibility
The fixes don't change matching, eviction selection, or returned KV — only ref/buffer accounting on miss and pause/abort paths.