feat(bench): generalized AgentProfile-coordinate optimizer on the sandbox surface#293
Conversation
…dbox surface Optimize ANY genome coordinate (skills/hooks/tools/prompt/subagents — mcp same shape) of a real sandboxed harness worker, holding the rest of the profile fixed. Each coordinate is a compose(profile, selected) that injects into its own AgentProfile field; freeze = don't select it, combine = fold a winner into the base. - profile-coordinates.ts: the coordinate registry (one composer per genome field). - profile-coord-sandbox.mts: COORDINATE= runner, with-vs-without on the sandboxed worker, deterministic-judge bench. AGENT=worker wired; driver is a marked seam (same compose — a driver/worker/subagent are all AgentProfiles). - skill-sandbox-smoke.mts: proves a SKILL.md materializes to disk in the box (resources.skills → ~/.claude/skills/<id>/SKILL.md, verified by the in-box agent). - coding-skills/ + eops-skills/: real agent-under-test skills (not prompt text). Runs the skills lever on the sandbox surface (EOPS = banded agentic judge; HumanEval = deterministic checker).
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 391cfffd
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T13:17:34Z
tangletools
left a comment
There was a problem hiding this comment.
🟠 Value Audit — better-approach-exists
| Verdict | better-approach-exists |
| Concerns | 5 (1 medium-concern, 4 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 304.5s (2 bridge agents) |
| Total | 304.5s |
💰 Value — better-approach-exists
Adds a generalized sandbox-based AgentProfile-coordinate screener (skills/hooks/tools/prompt/subagents) and a smoke test for real SKILL.md materialization, but ships a redundant skills-sandbox.mts that the new generalized runner already replaces.
- What it does: Introduces
bench/src/profile-coordinates.ts— a registry where each ‘coordinate’ is a named set of candidates plus acompose(baseProfile, selected)function that injects those candidates into oneAgentProfilefield while leaving the rest untouched (profile-coordinates.ts:20-28).bench/src/profile-coord-sandbox.mtsis a one-runner-for-every-coordinate harness: it readsCOORDINATE=from - Goals it achieves: It lets the team measure the isolated lift of any one genome coordinate on the real coding-harness sandbox surface, holding every other field fixed, instead of pasting skill text into a system prompt. It also creates the compose/freeze/combine abstraction the PR describes as the basis for future coordinate stacking and recursive driver/worker/subagent optimization (
profile-coordinates.ts:1-14). - Assessment: The change is coherent and in the grain of the codebase. It reuses the existing
runExperiment/sandboxAgentRunmachinery (experiment.ts:223-239), follows the same profile-as-data pattern already used inbench/src/search-bench/profiles.ts:57-91, and adds an honest surface-check smoke test. The main limitation is that the runner currently injects the full bundle of coordinate candidates rathe - Better / existing approach: The codebase does not already have a generalized AgentProfile-coordinate optimizer, so the abstraction itself is new. The materially better design is to drop or collapse
bench/src/skills-sandbox.mts: it is a special case of the generalized runner (COORDINATE=skills) and is unreferenced anywhere in the repo (grep forskills-sandboxreturns only its own file). It duplicates the same `runExperi
🎯 Usefulness — sound-with-nits
A useful sandbox-surface A/B harness for AgentProfile coordinates that fixes the prior wrong-substrate skills test and composes cleanly into the existing bench runExperiment path; the skills-specific script is redundant with the generic runner and the output isn't wired into the flywheel yet.
- Integration: It integrates through the existing legacy bench harness: imports ADAPTERS from bench/src/adapters.ts:27 and runExperiment/sandboxAgentRun from bench/src/experiment.ts:299,206. It is reachable as a standalone tsx entrypoint (profile-coord-sandbox.mts, skill-sandbox-smoke.mts) — the same pattern as other bench diagnostics (HARNESS.md:124 lists standalone tools not in run.ts). AGENT=driver is intenti
- Fit with existing patterns: It fits the codebase's AgentProfile-as-genome model and the established pattern of composing profile fragments into sandbox runs (cf. bench/src/search-bench/profiles.ts:57 buildArmProfile spreads a partial AgentProfile into sandboxAgentRun). It does not compete with the canonical selfImprove/runImprovementLoop optimizers: those are outer-loop, held-out-gated optimizers (docs/canonical-api.md:38-41
- Real-world viability: It reuses runExperiment's existing concurrency, infra-retries, vacuity guard, and error exclusion (experiment.ts:388,418,406), so it handles realistic run noise. It only tests the whole coordinate (all candidates) versus the frozen base, so it will not tell you which individual candidate matters or detect interactions; combining coordinates requires chaining runs via PROFILE_JSON. It writes only s
💰 Value Audit
🟠 skills-sandbox.mts duplicates the new generalized runner [duplication] ``
bench/src/skills-sandbox.mtsruns the exact same with/without skills comparison thatprofile-coord-sandbox.mtsalready handles viaCOORDINATE=skills. It re-implements loadingcoding-skills/*.mdas inline resources (skills-sandbox.mts:28-33,48-49) and the samerunExperiment/sandboxAgentRunloop (skills-sandbox.mts:62-82). The generalized registry already has askillscoordinate (profile-coordinates.ts:120-126) andprofile-coord-sandbox.mts:30,52,91-96runs the paired expe
🟡 mcp coordinate is mentioned but not registered [maintenance] ``
The coordinate abstraction’s header lists mcp as one of the genome fields (
profile-coordinates.ts:3-5), but theREGISTRYonly registersskills,hooks,tools,prompt, andsubagents(profile-coordinates.ts:120-126). If mcp is intentionally out of scope for this PR, the comment/registry should be tightened so future readers do not expect it.
🎯 Usefulness Audit
🟡 skills-sandbox.mts duplicates the generic runner's COORDINATE=skills case [integration] ``
profile-coord-sandbox.mts generalizes everything skills-sandbox.mts does (COORDINATE=skills + SKILLS_DIR=src/coding-skills). Having both in the same PR adds a second surface to maintain; drop skills-sandbox.mts and point the skills example at the generic runner (profile-coord-sandbox.mts:14).
🟡 No per-candidate or subset selection within a coordinate [ergonomics] ``
Each coordinate's compose() injects every candidate at once (profile-coordinates.ts:38,56,75,94,111). That gives a lever test for the field as a whole, but cannot identify which skill/hook/instruction actually drives lift. For an 'optimizer' this is a screening step only; a follow-up should add env-selected subsets or fold into the selfImprove/gepaDriver loop that already searches string/code surfaces.
🟡 Results are not captured into the flywheel corpus [integration] ``
The runner prints a delta to stderr but does not pass a corpusPath to runExperiment, so the with/without arms are not persisted as RunRecords. That is fine for a quick manual diagnostic, but the next seam is to emit records so runImprovementLoop/selfImprove can consume them (cf. experiment.ts:354-376 and profiles.ts:9).
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 83 | 76 | 76 |
| Confidence | 65 | 65 | 65 |
| Correctness | 83 | 76 | 76 |
| Security | 83 | 76 | 76 |
| Testing | 83 | 76 | 76 |
| Architecture | 83 | 76 | 76 |
Full multi-shot audit completed 1/1 planned shots over 14 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 14 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM hooks and subagents coordinates use unverified AgentProfile field casts — bench/src/profile-coordinates.ts
Lines 57 and 103 cast
as AgentProfile['hooks']andas AgentProfile['subagents']. The agent-runtime codebase explicitly states 'hooks are never part of AgentProfile' (src/runtime/types.ts:474, src/runtime-hooks.ts:5) — though that refers to RuntimeHooks, not sandbox-SDK lifecycle hooks. The word 'subagents' appears nowhere in src/. Since @tangle-network/sandbox (^0.6.0) is not installed in this worktree, these field names cannot be verified against the actual type. If AgentProfile lacks these fields, compose() writes to a non-existent key that the sandbox backend silently ignores, making those two coordinates dead experiments that would show no delta and
🟡 LOW PROFILE_JSON env parsing has no validation — bench/src/profile-coord-sandbox.mts
Line 49: JSON.parse(process.env.PROFILE_JSON) as AgentProfile — a string that parses to a non-conforming object would silently produce a broken base profile, invalidating the experiment. The agentRun path through sandboxAgentRun spreads the profile into sandboxOverrides, so a malformed profile would only fail deep in the sandbox create call (cryptic error). Add a lightweight runtime check or at minimum validate that required AgentProfile fields are present before proceeding.
🟡 LOW No unit tests for profile-coordinates.ts compose functions — bench/src/profile-coordinates.ts
The ProfileCoordinate interface has 5 implementations with deterministic compose() pure functions. The project tests other bench modules (experiment.test.mts, steering-experiment.test.mts, selector.test.mts, refine-loop.test.mts). The compose functions are trivially testable: verify that empty selection returns base unchanged, that selected candidates are injected into the correct field, and that base fields outside the coordinate are preserved. Without tests, a future refactor could silently break the freeze invariant.
🟡 LOW no-print-debugging hook is a guaranteed no-op — bench/src/profile-coordinates.ts
Line 50:
'no-print-debugging': { event: 'PreToolUse', matcher: 'Edit|Write', command: 'true' }. The commandtruealways exits 0, so a PreToolUse hook never blocks the tool call. This candidate cannot produce any behavioral difference vs the base — any measured delta would be pure noise. Either implement actual content inspection (grep the diff for print/debug statements and exit non-zero on match) or remove the candidate.
🟡 LOW no-print-debugging hook is a no-op placeholder — bench/src/profile-coordinates.ts
Line 50: 'no-print-debugging': { event: 'PreToolUse', matcher: 'Edit|Write', command: 'true' }. The command 'true' always succeeds and does nothing — it does NOT prevent print debugging. If this is a planned but not-yet-implemented candidate, either implement it or remove it from the candidate pool so the optimizer doesn't select a vacuous option. Currently it wastes optimizer budget on a candidate that cannot have any effect.
🟡 LOW profile-coordinates.ts compose logic has zero unit tests — bench/src/profile-coordinates.ts
Five coordinate composers (skills, hooks, tools, prompt, subagents) with non-trivial merge/spread logic have no unit test coverage. A silent compose bug (e.g. wrong spread ordering in skillsCoordinate line 41, duplicate hook entries in hooksCoordinate line 62) would produce wrong AgentProfiles and silently invalidate all downstream experiments. Add tests for each coordinate's compose with empty/non-empty/base-merge inputs. The experiment.test.mts mock pattern (mockSandboxClient) shows the project's test conventions.
🟡 LOW smoke test regex verification is fragile — bench/src/skill-sandbox-smoke.mts
Line 72: const landed = /reproduce-first/i.test(out) && /SKILL.md/i.test(out). The smoke test declares PASS if the agent output mentions 'reproduce-first' and 'SKILL.md' anywhere in the stream. A false positive is possible if the sandbox base image ships a SKILL.md file or if the agent's own prompt text mentions these strings. The 'find / -name SKILL.md' command (line 52) could match SKILL.md files from the sandbox base image unrelated to the injected skill. Recommend checking for a specific path component or the skill's bo
🟡 LOW Inconsistent module-path resolution between sibling files — bench/src/skills-sandbox.mts
skills-sandbox.mts uses
dirname(fileURLToPath(import.meta.url))(line 47) while profile-coordinates.ts usesimport.meta.dirname(line 30). Both resolve to the same directory but the inconsistency is a maintenance smell. The codebase already usesimport.meta.dirnamein corpus-report.mts and commit0-env-run.mts, so prefer that form.
tangletools · 2026-06-14T13:28:33Z · trace
Optimize any coordinate of the agent genome — skills, hooks, tools, prompt, subagents (mcp is the same shape) — on a real sandboxed harness worker (opencode/claude-code in a box), holding the rest of the profile fixed.
Why: the prior skills experiment ran on a router chat-loop and pasted skills into the system prompt — wrong substrate, wrong mechanism. This runs the worker as a sandbox (the rule: workers are sandboxes), with skills materialized to disk as real
SKILL.mdpackages viaAgentProfile.resources.skills(proven on disk byskill-sandbox-smoke.mts).The abstraction: every genome field is a coordinate =
compose(profile, selected)that injects into that field. Freeze = don't select it; combine = fold a winner into the base; recurse = a subagent is itself a profile.AGENT=workeris wired;driveris a marked seam (identical compose).Surfaces: EOPS (
enterpriseops-gym— blind-plan + deterministic local gym judge, a real middle band) and HumanEval (deterministic--network=nonechecker). The sandboxed worker + skills path is verified end-to-end.