Skip to content

refactor(runtime): runAgentic shot loop → canonical routerToolLoop (A1)#299

Merged
drewstone merged 2 commits into
mainfrom
refactor/runagentic-canonical-toolloop
Jun 14, 2026
Merged

refactor(runtime): runAgentic shot loop → canonical routerToolLoop (A1)#299
drewstone merged 2 commits into
mainfrom
refactor/runagentic-canonical-toolloop

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

A1 of the converge. runShot hand-rolled the off-box chat→tool_calls→execute loop that routerToolLoop already is — the 3rd copy of that primitive (alongside benchSolveLeaf, the gym loop).

Generalized routerToolLoop additively so it can be the one loop:

  • initialMessages — seed with a carried conversation (depth continuation) instead of [system,user]
  • maxTokens — the worker completion cap (threaded through routerChatWithTools too)
  • returns the final messages (depth carry + analyst trajectory)

runShot is now a thin adapter (carried messages in, surface.call as execute, result → ShotOut).

Capability-preserving + non-breaking: new opts are optional, so external callers (appworld, humaneval-repair-gate) are untouched; depth is preserved — tests/loops/strategy-suite.test.ts 34/34 green; typecheck clean. One documented minor: malformed tool-call JSON now feeds an error back inside routerToolLoop (its existing behavior) instead of bumping toolErrors; surface.call errors still counted.

Keeps both loop substrates (your call) — they now share the one canonical off-box loop.

…routerToolLoop

runShot hand-rolled the off-box chat→tool_calls→execute loop that routerToolLoop already
is — a 3rd copy of the same primitive. Generalize routerToolLoop additively so it can BE the
one loop: optional initialMessages (depth continuation — seed with the carried conversation
instead of [system,user]), maxTokens (the worker completion cap), and it returns the final
messages (for depth carry + the analyst trajectory). runShot becomes a thin adapter: carried
messages in, surface.call as execute (ERROR/throw → counted toolError, fed back not thrown),
result mapped to ShotOut. External callers (appworld, humaneval-repair-gate) unaffected — the
new opts are optional. Depth preserved: tests/loops/strategy-suite.test.ts 34/34 green.

Minor: malformed tool-call JSON now feeds an error back to the model inside routerToolLoop
(its existing behavior) rather than incrementing toolErrors; surface.call errors still counted.
tangletools
tangletools previously approved these changes Jun 14, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — afad91fa

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T18:59:01Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 53ffa9a7

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T19:00:54Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (4 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 117.2s (2 bridge agents)
Total 117.2s

💰 Value — sound-with-nits

Replaces the hand-rolled shot loop in runAgentic with the canonical routerToolLoop, adding initialMessages/maxTokens/message carry so depth continuation works through one shared primitive — a coherent deduplication.

  • What it does: Generalizes src/runtime/router-client.ts:routerToolLoop with optional initialMessages (seed a carried conversation instead of [system,user]), maxTokens (threaded through routerChatWithTools), and a new returned messages array. Then refactors src/runtime/strategy.ts:runShot into a thin adapter that passes its carried messages into routerToolLoop and maps surface.call results back to ShotOut.
  • Goals it achieves: Eliminates a third copy of the off-box chat→tool_calls→execute loop, so runAgentic shares the same canonical loop already used by bench/appworld/humaneval-repair-gate. Preserves depth continuation, the worker completion cap, and existing external callers; reduces future maintenance and drift.
  • Assessment: Good change. It is additive/non-breaking, keeps the same semantics for surface.call errors and ERROR:-prefixed results, and follows the codebase's layering (router-client is the declared one router chat client). The test suite and typecheck claims are consistent with the diff.
  • Better / existing approach: none for the runShot refactor itself — routerToolLoop is the right primitive. However, there is still a fourth hand-rolled copy of the same loop in src/runtime/supervise/runtime.ts:248-368 (routerToolsInlineExecutor for the createExecutor({backend:'router-tools'}) substrate). It should eventually be folded into routerToolLoop too; until then the codebase has two off-box tool-loop implementations t

🎯 Usefulness — sound-with-nits

Refactors runAgentic's leaf onto the already-exported routerToolLoop substrate, removing a third copy of the off-box chat→tool→execute loop while preserving all existing callers and the depth-continuation messages contract.

  • Integration: The changed runShot is reached through shotExecutor (strategy.ts:345-423), which is spawned by every runAgentic path including the built-in depth/breadth drivers and custom strategies authored via defineStrategy (strategy.ts:728-847); runAgentic is exported from src/runtime/index.ts:251 and exercised in tests/loops/strategy-suite.test.ts. The enhanced routerToolLoop is already consumed by bench/sr
  • Fit with existing patterns: It follows the codebase's stated substrate layering: router-client.ts:5-6 documents routerToolLoop as the one off-box agentic loop, and docs/canonical-api.md:429-433 positions runAgentic as the canonical strategy driver. runShot was the remaining hand-rolled duplication; folding it into routerToolLoop is aligned rather than competing.
  • Real-world viability: The loop copies the carried conversation (router-client.ts:224-225), propagates maxTokens and abort signals, and feeds malformed model JSON back as a tool result instead of crashing (router-client.ts:263-273). Surface.call errors and throws are still counted as toolErrors (strategy.ts:157-166). The only behavioral delta is that malformed JSON no longer increments toolErrors; the PR documents this,

🎯 Usefulness Audit

🟡 routerToolLoop still requires dummy system/user when seeding with initialMessages [ergonomics] ``

runShot passes empty strings for system and user because routerToolLoop ignores them when initialMessages is set (router-client.ts:207-229; strategy.ts:168-182). The interface works, but a messages-first overload or optional system/user would be cleaner and reduce future caller confusion.

🟡 Malformed tool-call JSON no longer increments toolErrors [robustness] ``

The documented behavior change means toolErrors now only counts surface.call failures (ERROR: prefix or throws) and not model-emitted invalid JSON (strategy.ts:157-166 vs router-client.ts:263-273). Since strategy-author.ts:39 exposes toolErrors to strategy authors, dashboards or steering logic that treated it as a total model/tool failure count may now undercount. Acceptable if intentional, but worth a release-note callout rather than a silent semantic shift.

💰 Value Audit

🟡 routerToolsInlineExecutor still hand-rolls the same loop [duplication] ``

src/runtime/supervise/runtime.ts:248-368 implements chat→tool_calls→execute→fold results identically to routerToolLoop, but with no maxTokens, no initialMessages, fixed temperature 0.2, and no costUsd. The PR makes routerToolLoop the canonical off-box loop for runAgentic/bench; the Supervisor 'router-tools' executor should be refactored to call it next, otherwise duplication remains.

🟡 toolCalls count silently excludes malformed-JSON calls [proportion] ``

Old runShot incremented toolCalls for every emitted tool_call before JSON.parse (afad91f diff); routerToolLoop increments toolCalls only after successful parse (src/runtime/router-client.ts:261). The PR documents the toolErrors change but not this metric shift. If downstream comparisons rely on total emitted tool_calls, the numbers will differ.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T190433Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 53ffa9a7

Readiness 60/100 · Confidence 65/100 · 3 findings (1 medium, 2 low)

deepseek glm aggregate
Readiness 60 79 60
Confidence 65 65 65
Correctness 60 79 60
Security 60 79 60
Testing 60 79 60
Architecture 60 79 60

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Final assistant message dropped from carried conversation when model stops calling tools — src/runtime/router-client.ts

routerToolLoop returns at line 247 BEFORE pushing the assistant turn when toolCalls.length === 0: if (r.toolCalls.length === 0) return { ... messages }. The assistant message push at line 251-259 only runs when there ARE tool calls. The OLD runShot (strategy.ts pre-refactor) pushed the assistant message FIRST, then checked calls.length and broke — so the model's final text answer was always in the conversation. Impact: (1) depthDriver carries out.messages to the next shot — shot N+1 now starts without shot N's concluding as

🟡 LOW Token usage requires both prompt_tokens and completion_tokens to count either — src/runtime/router-client.ts

routerChatWithTools (line 160-163) gates usage on BOTH fields being numbers: u && typeof u.prompt_tokens === 'number' && typeof u.completion_tokens === 'number'. Old runShot counted them independently (two separate if-checks). If a provider returns only one of the two, the old code would sum that one; the new code sums neither. Edge case — providers typically return both or neither — but a subtle accounting regression for partial-usage providers.

🟡 LOW Malformed-JSON tool args no longer counted as toolErrors — src/runtime/strategy.ts

Old runShot caught JSON.parse failures and incremented toolErrors += 1, then still called the tool with empty args {}. routerToolLoop (router-client.ts:265-274) now pushes an error message and continues — skipping the execute callback entirely. Since runShot's toolErrors counter lives inside the execute closure (strategy.ts:158-167), malformed-JSON args are counted in routerToolLoop's toolCalls (line 261) but never in toolErrors. ShotOut.toolErrors undercounts. The new behavior (feed error back to model instead of calling tool with garbage args) is arguably better, but the metric is now inconsistent with its documented meaning.


tangletools · 2026-06-14T19:08:11Z · trace

@drewstone drewstone merged commit 8a3fdc6 into main Jun 14, 2026
1 check passed
@drewstone drewstone deleted the refactor/runagentic-canonical-toolloop branch June 14, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants