fix(bench): observe-only analyst firewall + off-box web_search worker#290
Merged
Conversation
…verdict llmAnalyst read last.verdict (incl. the judge's failure detail / themesHit gradient) and injected it into the steer — a non-deployable oracle leak that contradicted AnalystFn's own selector!=judge docstring, and it never saw the task at all. Make it observe-only over the public task + output + trace, so refineAudit is a deployable steerer rather than an oracle-ceiling arm.
The research benches (finsearch) demand 'live web/market sources' but the plain router chat worker has no tools, so it could only recall/hallucinate — every arm scored a tool-less ~7.1% and the analyst correctly diagnosed 'fetch the data' against a worker that physically could not. SEARCH=you|exa now upgrades the off-box router worker to a router-tools agentic loop with a host-side web_search backed by the Tangle router's search providers. Verified live: the worker iteratively refines queries (AAPL close -> Yahoo Finance -> unadjusted close) and a task that scored 0 tool-less now resolves. Off-box, so no sandbox egress allowlist applies.
tangletools
approved these changes
Jun 14, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 1ceb70ad
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T11:00:01Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A senior re-audit of the research-bench runs found that, as wired, finsearch/trata cannot answer the gate ("does non-blind topology beat blind compute under a deployable selector at equal k?"). Two of the four root causes are code-correctness issues this PR fixes; the others (oracle judge, no correctable band) are domain limits documented for follow-up.
What
1. Observe-only analyst (
b2c8fd0) — selector ≠ judge firewall.llmAnalystreadlast.verdict.notes(the held-out judge's failure detail — for trata, thethemesHit/themesMissedgradient toward the reference) and injected it into the steer, directly contradictingAnalystFn's own docstring. It also never saw the task at all. Now it observes the public task + output + trace only — a deployable steerer instead of an oracle-fed one.2.
web_searchtool for the off-boxrouter-toolsworker (67b2204).The research benches' prompts demand "live web/market sources", but the plain
routerchat worker has no tools — every arm scored a tool-less ~7.1% while the analyst correctly diagnosed "fetch the data" against a worker that physically could not.SEARCH=you|exanow upgrades the off-box worker to arouter-toolsagentic loop with a host-sideweb_searchbacked by the Tangle router's search providers. Off-box, so no sandbox egress allowlist applies. Fail-loud on transport/HTTP errors (infra-excludes the cell rather than silently degrading to recall).3.
.gitignorecorpus/ rollout scratch (1ceb70a).Verification
pnpm run typecheck,pnpm run build,pnpm run lintclean.web_searchproven firing live: the worker does multi-turn query refinement (AAPL close → Yahoo Finance → unadjusted close, 5 hits each) and a finsearch task that scored 0 tool-less now resolves.POST router/v1/search {provider, query}→{data:[{title,url,snippet,publishedAt}]}.Known follow-ups (not in this PR)
router-toolstoolTracedoes not yet surface intoiter.events, so an analyst arm still can't see tool behavior — plumb it before trusting an analyst arm on a tool-using worker.createScopeAnalyston a deployable-checker domain.