fix(bench): observe-only analyst firewall + off-box web_search worker by drewstone · Pull Request #290 · tangle-network/agent-runtime

drewstone · 2026-06-14T10:59:54Z

Why

A senior re-audit of the research-bench runs found that, as wired, finsearch/trata cannot answer the gate ("does non-blind topology beat blind compute under a deployable selector at equal k?"). Two of the four root causes are code-correctness issues this PR fixes; the others (oracle judge, no correctable band) are domain limits documented for follow-up.

What

1. Observe-only analyst (b2c8fd0) — selector ≠ judge firewall.
llmAnalyst read last.verdict.notes (the held-out judge's failure detail — for trata, the themesHit/themesMissed gradient toward the reference) and injected it into the steer, directly contradicting AnalystFn's own docstring. It also never saw the task at all. Now it observes the public task + output + trace only — a deployable steerer instead of an oracle-fed one.

2. web_search tool for the off-box router-tools worker (67b2204).
The research benches' prompts demand "live web/market sources", but the plain router chat worker has no tools — every arm scored a tool-less ~7.1% while the analyst correctly diagnosed "fetch the data" against a worker that physically could not. SEARCH=you|exa now upgrades the off-box worker to a router-tools agentic loop with a host-side web_search backed by the Tangle router's search providers. Off-box, so no sandbox egress allowlist applies. Fail-loud on transport/HTTP errors (infra-excludes the cell rather than silently degrading to recall).

3. .gitignore corpus/ rollout scratch (1ceb70a).

Verification

pnpm run typecheck, pnpm run build, pnpm run lint clean.
web_search proven firing live: the worker does multi-turn query refinement (AAPL close → Yahoo Finance → unadjusted close, 5 hits each) and a finsearch task that scored 0 tool-less now resolves.
Endpoint verified: POST router/v1/search {provider, query} → {data:[{title,url,snippet,publishedAt}]}.

Known follow-ups (not in this PR)

The router-tools toolTrace does not yet surface into iter.events, so an analyst arm still can't see tool behavior — plumb it before trusting an analyst arm on a tool-using worker.
finsearch/trata judges compare to a reference answer (oracle), so they remain valid capability benches but not clean gate tests. The gate itself wants the canonical Supervisor + createScopeAnalyst on a deployable-checker domain.

…verdict llmAnalyst read last.verdict (incl. the judge's failure detail / themesHit gradient) and injected it into the steer — a non-deployable oracle leak that contradicted AnalystFn's own selector!=judge docstring, and it never saw the task at all. Make it observe-only over the public task + output + trace, so refineAudit is a deployable steerer rather than an oracle-ceiling arm.

The research benches (finsearch) demand 'live web/market sources' but the plain router chat worker has no tools, so it could only recall/hallucinate — every arm scored a tool-less ~7.1% and the analyst correctly diagnosed 'fetch the data' against a worker that physically could not. SEARCH=you|exa now upgrades the off-box router worker to a router-tools agentic loop with a host-side web_search backed by the Tangle router's search providers. Verified live: the worker iteratively refines queries (AAPL close -> Yahoo Finance -> unadjusted close) and a task that scored 0 tool-less now resolves. Off-box, so no sandbox egress allowlist applies.

tangletools

✅ Auto-approved PR — `1ceb70ad`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T11:00:01Z}

drewstone added 3 commits June 14, 2026 04:37

chore(bench): gitignore corpus/ rollout scratch

1ceb70a

tangletools approved these changes Jun 14, 2026

View reviewed changes

drewstone merged commit aa58942 into main Jun 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): observe-only analyst firewall + off-box web_search worker#290

fix(bench): observe-only analyst firewall + off-box web_search worker#290
drewstone merged 3 commits into
mainfrom
fix/bench-harness-validity

drewstone commented Jun 14, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 14, 2026

Why

What

Verification

Known follow-ups (not in this PR)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 1ceb70ad

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `1ceb70ad`