fix(linear-att): fix latent prefix-cache ref/buffer leaks by sufubao · Pull Request #1348 · ModelTC/LightLLM

sufubao · 2026-06-13T14:38:55Z

Summary

Four latent correctness defects in the linear-attention (Qwen3.5 / Qwen3-Next) hybrid prefix cache, found via adversarial audit + property-based fuzzing.

Correctness fixes

Root in eviction set — _add_node added the root node to _evict_tree_set when the tree emptied (root is a leaf then), contradicting _evict's own assert node is not self.root_node. Now excluded.
Root ref leak on miss / trim-to-empty — match_prefix takes a root ref on descent; the two "no match" returns handed None to the caller without releasing it, so root.ref_counter drifted up on every miss. Both returns now release it.
Root ref leak in deref_to_first_big_page_node — the big-page downgrade path leaked the same root ref when it bottomed out at root. Fixed.
Big-page state-buffer leak / assert-crash — big-page state ids accumulated in req.linear_att_len_to_big_page_id during chunked prefill were neither inserted nor freed when a request was paused/aborted mid-prefill (fallback branch of _linear_att_free_req), tripping free_a_req_mem's assert len(...) == 0 (worker crash) or leaking slots with asserts off. Now released on the non-insert exit paths.

The three root-ref issues are latent (root carries zero tokens and is never evictable), so they didn't affect results/memory/accuracy. The big-page leak is a reachable worker crash for long-context serving with --linear_att_page_block_num set.

Tests

New CPU-only tests (no GPU required):

test_linear_att_radix_cache_invariants.py — property-based fuzzer (small-page regime): random insert/match/dec/steal/evict, checking ref accounting, set membership, buffer-id conservation, value integrity, and root-ref balance after every op.
test_linear_att_radix_cache_bigpage.py — same for the big-page regime.
test_linear_att_free_req_big_page.py — regression tests for the pause/abort big-page release.

pytest unit_tests/server/router/dynamic_prompt/ unit_tests/server/router/model_infer/test_linear_att_free_req_big_page.py → passed; black + flake8 clean.

Risk / compatibility

The fixes don't change matching, eviction selection, or returned KV — only ref/buffer accounting on miss and pause/abort paths.

gemini-code-assist

Code Review

This pull request introduces auto-derivation for --linear_att_hash_page_size based on max_req_total_len to optimize prefix-cache hit rates for hybrid linear-attention models, along with extensive documentation and warnings against using excessively large page sizes in serving workloads. It also fixes critical bugs in the radix cache, including root reference counter drift during cache misses or trims, and memory leaks of big-page state buffers when requests are paused or aborted mid-prefill. Comprehensive unit and invariant fuzz tests are added to verify these fixes. A review comment suggests adding a defensive fallback check for args.max_req_total_len in api_start.py to prevent a potential TypeError if it is None.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-13T14:41:14Z

+        while derived * max_page_num < args.max_req_total_len:
+            derived *= 2


To prevent potential TypeError if args.max_req_total_len is None (as it is typed as Optional[int]), it is safer to add a defensive fallback check before using it in the comparison loop.

Suggested change

while derived * max_page_num < args.max_req_total_len:

derived *= 2

max_len = args.max_req_total_len if args.max_req_total_len is not None else 16384

while derived * max_page_num < max_len:

derived *= 2

Four latent defects in LinearAttPagedRadixCache / _linear_att_free_req, found via adversarial audit + property-based fuzzing: - Root in eviction set: _add_node added root to _evict_tree_set when the tree emptied (root is a leaf then), contradicting _evict's own 'assert node is not self.root_node'. Now excluded. - Root ref leak on miss / trim-to-empty: match_prefix takes a root ref on descent; the two 'no match' returns handed None to the caller without releasing it, so root.ref_counter drifted up on every miss. Both returns now release it. - Root ref leak in deref_to_first_big_page_node: the big-page downgrade path leaked the same root ref when it bottomed out at root. Fixed. - Big-page state-buffer leak / assert-crash: big-page state ids accumulated in req.linear_att_len_to_big_page_id during chunked prefill were neither inserted nor freed when a request was paused/aborted mid-prefill (fallback branch of _linear_att_free_req), tripping free_a_req_mem's assert (worker crash) or leaking slots with asserts off. Now released on the non-insert exit paths. New CPU-only tests (no GPU): property-based invariant fuzzers for the small-page and big-page regimes, plus regression tests for the pause/abort big-page release. The three root-ref issues are latent (root carries zero tokens and is never evictable). The big-page leak is a reachable worker crash for long-context serving with --linear_att_page_block_num set.

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

sufubao force-pushed the fix_cache_hit_rare branch from aa8f40e to 187cbda Compare June 14, 2026 01:03

sufubao changed the title ~~fix(linear-att): auto-derive prefix-cache page size + fix latent cache bugs~~ fix(linear-att): fix latent prefix-cache ref/buffer leaks Jun 14, 2026

hiworldwzj added 3 commits June 14, 2026 05:20

fix

8ec5a2a

fix

87fecd1

remove unit tests added by this PR

c50cf6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(linear-att): fix latent prefix-cache ref/buffer leaks#1348

fix(linear-att): fix latent prefix-cache ref/buffer leaks#1348
sufubao wants to merge 4 commits into
ModelTC:mainfrom
sufubao:fix_cache_hit_rare

sufubao commented Jun 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		while derived * max_page_num < args.max_req_total_len:
		derived *= 2

Conversation

sufubao commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Correctness fixes

Tests

Risk / compatibility

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sufubao commented Jun 13, 2026 •

edited

Loading