Skip to content

fix(bench): observe-only analyst firewall + off-box web_search worker#290

Merged
drewstone merged 3 commits into
mainfrom
fix/bench-harness-validity
Jun 14, 2026
Merged

fix(bench): observe-only analyst firewall + off-box web_search worker#290
drewstone merged 3 commits into
mainfrom
fix/bench-harness-validity

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Why

A senior re-audit of the research-bench runs found that, as wired, finsearch/trata cannot answer the gate ("does non-blind topology beat blind compute under a deployable selector at equal k?"). Two of the four root causes are code-correctness issues this PR fixes; the others (oracle judge, no correctable band) are domain limits documented for follow-up.

What

1. Observe-only analyst (b2c8fd0) — selector ≠ judge firewall.
llmAnalyst read last.verdict.notes (the held-out judge's failure detail — for trata, the themesHit/themesMissed gradient toward the reference) and injected it into the steer, directly contradicting AnalystFn's own docstring. It also never saw the task at all. Now it observes the public task + output + trace only — a deployable steerer instead of an oracle-fed one.

2. web_search tool for the off-box router-tools worker (67b2204).
The research benches' prompts demand "live web/market sources", but the plain router chat worker has no tools — every arm scored a tool-less ~7.1% while the analyst correctly diagnosed "fetch the data" against a worker that physically could not. SEARCH=you|exa now upgrades the off-box worker to a router-tools agentic loop with a host-side web_search backed by the Tangle router's search providers. Off-box, so no sandbox egress allowlist applies. Fail-loud on transport/HTTP errors (infra-excludes the cell rather than silently degrading to recall).

3. .gitignore corpus/ rollout scratch (1ceb70a).

Verification

  • pnpm run typecheck, pnpm run build, pnpm run lint clean.
  • web_search proven firing live: the worker does multi-turn query refinement (AAPL close → Yahoo Finance → unadjusted close, 5 hits each) and a finsearch task that scored 0 tool-less now resolves.
  • Endpoint verified: POST router/v1/search {provider, query}{data:[{title,url,snippet,publishedAt}]}.

Known follow-ups (not in this PR)

  • The router-tools toolTrace does not yet surface into iter.events, so an analyst arm still can't see tool behavior — plumb it before trusting an analyst arm on a tool-using worker.
  • finsearch/trata judges compare to a reference answer (oracle), so they remain valid capability benches but not clean gate tests. The gate itself wants the canonical Supervisor + createScopeAnalyst on a deployable-checker domain.

…verdict

llmAnalyst read last.verdict (incl. the judge's failure detail / themesHit
gradient) and injected it into the steer — a non-deployable oracle leak that
contradicted AnalystFn's own selector!=judge docstring, and it never saw the
task at all. Make it observe-only over the public task + output + trace, so
refineAudit is a deployable steerer rather than an oracle-ceiling arm.
The research benches (finsearch) demand 'live web/market sources' but the plain
router chat worker has no tools, so it could only recall/hallucinate — every
arm scored a tool-less ~7.1% and the analyst correctly diagnosed 'fetch the
data' against a worker that physically could not. SEARCH=you|exa now upgrades
the off-box router worker to a router-tools agentic loop with a host-side
web_search backed by the Tangle router's search providers. Verified live: the
worker iteratively refines queries (AAPL close -> Yahoo Finance -> unadjusted
close) and a task that scored 0 tool-less now resolves. Off-box, so no sandbox
egress allowlist applies.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 1ceb70ad

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T11:00:01Z

@drewstone drewstone merged commit aa58942 into main Jun 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants