refactor(runtime): runAgentic shot loop → canonical routerToolLoop (A1)#299
Conversation
…routerToolLoop runShot hand-rolled the off-box chat→tool_calls→execute loop that routerToolLoop already is — a 3rd copy of the same primitive. Generalize routerToolLoop additively so it can BE the one loop: optional initialMessages (depth continuation — seed with the carried conversation instead of [system,user]), maxTokens (the worker completion cap), and it returns the final messages (for depth carry + the analyst trajectory). runShot becomes a thin adapter: carried messages in, surface.call as execute (ERROR/throw → counted toolError, fed back not thrown), result mapped to ShotOut. External callers (appworld, humaneval-repair-gate) unaffected — the new opts are optional. Depth preserved: tests/loops/strategy-suite.test.ts 34/34 green. Minor: malformed tool-call JSON now feeds an error back to the model inside routerToolLoop (its existing behavior) rather than incrementing toolErrors; surface.call errors still counted.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — afad91fa
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T18:59:01Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 53ffa9a7
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T19:00:54Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (4 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 117.2s (2 bridge agents) |
| Total | 117.2s |
💰 Value — sound-with-nits
Replaces the hand-rolled shot loop in runAgentic with the canonical routerToolLoop, adding initialMessages/maxTokens/message carry so depth continuation works through one shared primitive — a coherent deduplication.
- What it does: Generalizes src/runtime/router-client.ts:routerToolLoop with optional initialMessages (seed a carried conversation instead of [system,user]), maxTokens (threaded through routerChatWithTools), and a new returned messages array. Then refactors src/runtime/strategy.ts:runShot into a thin adapter that passes its carried messages into routerToolLoop and maps surface.call results back to ShotOut.
- Goals it achieves: Eliminates a third copy of the off-box chat→tool_calls→execute loop, so runAgentic shares the same canonical loop already used by bench/appworld/humaneval-repair-gate. Preserves depth continuation, the worker completion cap, and existing external callers; reduces future maintenance and drift.
- Assessment: Good change. It is additive/non-breaking, keeps the same semantics for surface.call errors and ERROR:-prefixed results, and follows the codebase's layering (router-client is the declared one router chat client). The test suite and typecheck claims are consistent with the diff.
- Better / existing approach: none for the runShot refactor itself — routerToolLoop is the right primitive. However, there is still a fourth hand-rolled copy of the same loop in src/runtime/supervise/runtime.ts:248-368 (routerToolsInlineExecutor for the createExecutor({backend:'router-tools'}) substrate). It should eventually be folded into routerToolLoop too; until then the codebase has two off-box tool-loop implementations t
🎯 Usefulness — sound-with-nits
Refactors runAgentic's leaf onto the already-exported routerToolLoop substrate, removing a third copy of the off-box chat→tool→execute loop while preserving all existing callers and the depth-continuation messages contract.
- Integration: The changed runShot is reached through shotExecutor (strategy.ts:345-423), which is spawned by every runAgentic path including the built-in depth/breadth drivers and custom strategies authored via defineStrategy (strategy.ts:728-847); runAgentic is exported from src/runtime/index.ts:251 and exercised in tests/loops/strategy-suite.test.ts. The enhanced routerToolLoop is already consumed by bench/sr
- Fit with existing patterns: It follows the codebase's stated substrate layering: router-client.ts:5-6 documents routerToolLoop as the one off-box agentic loop, and docs/canonical-api.md:429-433 positions runAgentic as the canonical strategy driver. runShot was the remaining hand-rolled duplication; folding it into routerToolLoop is aligned rather than competing.
- Real-world viability: The loop copies the carried conversation (router-client.ts:224-225), propagates maxTokens and abort signals, and feeds malformed model JSON back as a tool result instead of crashing (router-client.ts:263-273). Surface.call errors and throws are still counted as toolErrors (strategy.ts:157-166). The only behavioral delta is that malformed JSON no longer increments toolErrors; the PR documents this,
🎯 Usefulness Audit
🟡 routerToolLoop still requires dummy system/user when seeding with initialMessages [ergonomics] ``
runShot passes empty strings for system and user because routerToolLoop ignores them when initialMessages is set (router-client.ts:207-229; strategy.ts:168-182). The interface works, but a messages-first overload or optional system/user would be cleaner and reduce future caller confusion.
🟡 Malformed tool-call JSON no longer increments toolErrors [robustness] ``
The documented behavior change means toolErrors now only counts surface.call failures (ERROR: prefix or throws) and not model-emitted invalid JSON (strategy.ts:157-166 vs router-client.ts:263-273). Since strategy-author.ts:39 exposes toolErrors to strategy authors, dashboards or steering logic that treated it as a total model/tool failure count may now undercount. Acceptable if intentional, but worth a release-note callout rather than a silent semantic shift.
💰 Value Audit
🟡 routerToolsInlineExecutor still hand-rolls the same loop [duplication] ``
src/runtime/supervise/runtime.ts:248-368 implements chat→tool_calls→execute→fold results identically to routerToolLoop, but with no maxTokens, no initialMessages, fixed temperature 0.2, and no costUsd. The PR makes routerToolLoop the canonical off-box loop for runAgentic/bench; the Supervisor 'router-tools' executor should be refactored to call it next, otherwise duplication remains.
🟡 toolCalls count silently excludes malformed-JSON calls [proportion] ``
Old runShot incremented toolCalls for every emitted tool_call before JSON.parse (afad91f diff); routerToolLoop increments toolCalls only after successful parse (src/runtime/router-client.ts:261). The PR documents the toolErrors change but not this metric shift. If downstream comparisons rely on total emitted tool_calls, the numbers will differ.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 60 | 79 | 60 |
| Confidence | 65 | 65 | 65 |
| Correctness | 60 | 79 | 60 |
| Security | 60 | 79 | 60 |
| Testing | 60 | 79 | 60 |
| Architecture | 60 | 79 | 60 |
Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Final assistant message dropped from carried conversation when model stops calling tools — src/runtime/router-client.ts
routerToolLoop returns at line 247 BEFORE pushing the assistant turn when toolCalls.length === 0:
if (r.toolCalls.length === 0) return { ... messages }. The assistant message push at line 251-259 only runs when there ARE tool calls. The OLD runShot (strategy.ts pre-refactor) pushed the assistant message FIRST, then checked calls.length and broke — so the model's final text answer was always in the conversation. Impact: (1) depthDriver carriesout.messagesto the next shot — shot N+1 now starts without shot N's concluding as
🟡 LOW Token usage requires both prompt_tokens and completion_tokens to count either — src/runtime/router-client.ts
routerChatWithTools (line 160-163) gates usage on BOTH fields being numbers:
u && typeof u.prompt_tokens === 'number' && typeof u.completion_tokens === 'number'. Old runShot counted them independently (two separate if-checks). If a provider returns only one of the two, the old code would sum that one; the new code sums neither. Edge case — providers typically return both or neither — but a subtle accounting regression for partial-usage providers.
🟡 LOW Malformed-JSON tool args no longer counted as toolErrors — src/runtime/strategy.ts
Old runShot caught JSON.parse failures and incremented
toolErrors += 1, then still called the tool with empty args{}. routerToolLoop (router-client.ts:265-274) now pushes an error message andcontinues — skipping theexecutecallback entirely. Since runShot's toolErrors counter lives inside theexecuteclosure (strategy.ts:158-167), malformed-JSON args are counted in routerToolLoop'stoolCalls(line 261) but never intoolErrors. ShotOut.toolErrors undercounts. The new behavior (feed error back to model instead of calling tool with garbage args) is arguably better, but the metric is now inconsistent with its documented meaning.
tangletools · 2026-06-14T19:08:11Z · trace
A1 of the converge.
runShothand-rolled the off-box chat→tool_calls→execute loop thatrouterToolLoopalready is — the 3rd copy of that primitive (alongsidebenchSolveLeaf, the gym loop).Generalized
routerToolLoopadditively so it can be the one loop:initialMessages— seed with a carried conversation (depth continuation) instead of[system,user]maxTokens— the worker completion cap (threaded throughrouterChatWithToolstoo)messages(depth carry + analyst trajectory)runShotis now a thin adapter (carried messages in,surface.callasexecute, result →ShotOut).Capability-preserving + non-breaking: new opts are optional, so external callers (
appworld,humaneval-repair-gate) are untouched; depth is preserved —tests/loops/strategy-suite.test.ts34/34 green; typecheck clean. One documented minor: malformed tool-call JSON now feeds an error back insiderouterToolLoop(its existing behavior) instead of bumpingtoolErrors;surface.callerrors still counted.Keeps both loop substrates (your call) — they now share the one canonical off-box loop.