Skip to content

feat(bench): generalized AgentProfile-coordinate optimizer on the sandbox surface#293

Merged
drewstone merged 1 commit into
mainfrom
feat/profile-coordinate-sandbox
Jun 14, 2026
Merged

feat(bench): generalized AgentProfile-coordinate optimizer on the sandbox surface#293
drewstone merged 1 commit into
mainfrom
feat/profile-coordinate-sandbox

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Optimize any coordinate of the agent genome — skills, hooks, tools, prompt, subagents (mcp is the same shape) — on a real sandboxed harness worker (opencode/claude-code in a box), holding the rest of the profile fixed.

Why: the prior skills experiment ran on a router chat-loop and pasted skills into the system prompt — wrong substrate, wrong mechanism. This runs the worker as a sandbox (the rule: workers are sandboxes), with skills materialized to disk as real SKILL.md packages via AgentProfile.resources.skills (proven on disk by skill-sandbox-smoke.mts).

The abstraction: every genome field is a coordinate = compose(profile, selected) that injects into that field. Freeze = don't select it; combine = fold a winner into the base; recurse = a subagent is itself a profile. AGENT=worker is wired; driver is a marked seam (identical compose).

Surfaces: EOPS (enterpriseops-gym — blind-plan + deterministic local gym judge, a real middle band) and HumanEval (deterministic --network=none checker). The sandboxed worker + skills path is verified end-to-end.

…dbox surface

Optimize ANY genome coordinate (skills/hooks/tools/prompt/subagents — mcp same shape)
of a real sandboxed harness worker, holding the rest of the profile fixed. Each
coordinate is a compose(profile, selected) that injects into its own AgentProfile
field; freeze = don't select it, combine = fold a winner into the base.

- profile-coordinates.ts: the coordinate registry (one composer per genome field).
- profile-coord-sandbox.mts: COORDINATE= runner, with-vs-without on the sandboxed
  worker, deterministic-judge bench. AGENT=worker wired; driver is a marked seam
  (same compose — a driver/worker/subagent are all AgentProfiles).
- skill-sandbox-smoke.mts: proves a SKILL.md materializes to disk in the box
  (resources.skills → ~/.claude/skills/<id>/SKILL.md, verified by the in-box agent).
- coding-skills/ + eops-skills/: real agent-under-test skills (not prompt text).

Runs the skills lever on the sandbox surface (EOPS = banded agentic judge; HumanEval
= deterministic checker).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 391cfffd

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T13:17:34Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Value Audit — better-approach-exists

Verdict better-approach-exists
Concerns 5 (1 medium-concern, 4 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 304.5s (2 bridge agents)
Total 304.5s

💰 Value — better-approach-exists

Adds a generalized sandbox-based AgentProfile-coordinate screener (skills/hooks/tools/prompt/subagents) and a smoke test for real SKILL.md materialization, but ships a redundant skills-sandbox.mts that the new generalized runner already replaces.

  • What it does: Introduces bench/src/profile-coordinates.ts — a registry where each ‘coordinate’ is a named set of candidates plus a compose(baseProfile, selected) function that injects those candidates into one AgentProfile field while leaving the rest untouched (profile-coordinates.ts:20-28). bench/src/profile-coord-sandbox.mts is a one-runner-for-every-coordinate harness: it reads COORDINATE= from
  • Goals it achieves: It lets the team measure the isolated lift of any one genome coordinate on the real coding-harness sandbox surface, holding every other field fixed, instead of pasting skill text into a system prompt. It also creates the compose/freeze/combine abstraction the PR describes as the basis for future coordinate stacking and recursive driver/worker/subagent optimization (profile-coordinates.ts:1-14).
  • Assessment: The change is coherent and in the grain of the codebase. It reuses the existing runExperiment/sandboxAgentRun machinery (experiment.ts:223-239), follows the same profile-as-data pattern already used in bench/src/search-bench/profiles.ts:57-91, and adds an honest surface-check smoke test. The main limitation is that the runner currently injects the full bundle of coordinate candidates rathe
  • Better / existing approach: The codebase does not already have a generalized AgentProfile-coordinate optimizer, so the abstraction itself is new. The materially better design is to drop or collapse bench/src/skills-sandbox.mts: it is a special case of the generalized runner (COORDINATE=skills) and is unreferenced anywhere in the repo (grep for skills-sandbox returns only its own file). It duplicates the same `runExperi

🎯 Usefulness — sound-with-nits

A useful sandbox-surface A/B harness for AgentProfile coordinates that fixes the prior wrong-substrate skills test and composes cleanly into the existing bench runExperiment path; the skills-specific script is redundant with the generic runner and the output isn't wired into the flywheel yet.

  • Integration: It integrates through the existing legacy bench harness: imports ADAPTERS from bench/src/adapters.ts:27 and runExperiment/sandboxAgentRun from bench/src/experiment.ts:299,206. It is reachable as a standalone tsx entrypoint (profile-coord-sandbox.mts, skill-sandbox-smoke.mts) — the same pattern as other bench diagnostics (HARNESS.md:124 lists standalone tools not in run.ts). AGENT=driver is intenti
  • Fit with existing patterns: It fits the codebase's AgentProfile-as-genome model and the established pattern of composing profile fragments into sandbox runs (cf. bench/src/search-bench/profiles.ts:57 buildArmProfile spreads a partial AgentProfile into sandboxAgentRun). It does not compete with the canonical selfImprove/runImprovementLoop optimizers: those are outer-loop, held-out-gated optimizers (docs/canonical-api.md:38-41
  • Real-world viability: It reuses runExperiment's existing concurrency, infra-retries, vacuity guard, and error exclusion (experiment.ts:388,418,406), so it handles realistic run noise. It only tests the whole coordinate (all candidates) versus the frozen base, so it will not tell you which individual candidate matters or detect interactions; combining coordinates requires chaining runs via PROFILE_JSON. It writes only s

💰 Value Audit

🟠 skills-sandbox.mts duplicates the new generalized runner [duplication] ``

bench/src/skills-sandbox.mts runs the exact same with/without skills comparison that profile-coord-sandbox.mts already handles via COORDINATE=skills. It re-implements loading coding-skills/*.md as inline resources (skills-sandbox.mts:28-33, 48-49) and the same runExperiment/sandboxAgentRun loop (skills-sandbox.mts:62-82). The generalized registry already has a skills coordinate (profile-coordinates.ts:120-126) and profile-coord-sandbox.mts:30,52,91-96 runs the paired expe

🟡 mcp coordinate is mentioned but not registered [maintenance] ``

The coordinate abstraction’s header lists mcp as one of the genome fields (profile-coordinates.ts:3-5), but the REGISTRY only registers skills, hooks, tools, prompt, and subagents (profile-coordinates.ts:120-126). If mcp is intentionally out of scope for this PR, the comment/registry should be tightened so future readers do not expect it.

🎯 Usefulness Audit

🟡 skills-sandbox.mts duplicates the generic runner's COORDINATE=skills case [integration] ``

profile-coord-sandbox.mts generalizes everything skills-sandbox.mts does (COORDINATE=skills + SKILLS_DIR=src/coding-skills). Having both in the same PR adds a second surface to maintain; drop skills-sandbox.mts and point the skills example at the generic runner (profile-coord-sandbox.mts:14).

🟡 No per-candidate or subset selection within a coordinate [ergonomics] ``

Each coordinate's compose() injects every candidate at once (profile-coordinates.ts:38,56,75,94,111). That gives a lever test for the field as a whole, but cannot identify which skill/hook/instruction actually drives lift. For an 'optimizer' this is a screening step only; a follow-up should add env-selected subsets or fold into the selfImprove/gepaDriver loop that already searches string/code surfaces.

🟡 Results are not captured into the flywheel corpus [integration] ``

The runner prints a delta to stderr but does not pass a corpusPath to runExperiment, so the with/without arms are not persisted as RunRecords. That is fine for a quick manual diagnostic, but the next seam is to emit records so runImprovementLoop/selfImprove can consume them (cf. experiment.ts:354-376 and profiles.ts:9).


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T132554Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 391cfffd

Readiness 76/100 · Confidence 65/100 · 8 findings (1 medium, 7 low)

deepseek glm aggregate
Readiness 83 76 76
Confidence 65 65 65
Correctness 83 76 76
Security 83 76 76
Testing 83 76 76
Architecture 83 76 76

Full multi-shot audit completed 1/1 planned shots over 14 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 14 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM hooks and subagents coordinates use unverified AgentProfile field casts — bench/src/profile-coordinates.ts

Lines 57 and 103 cast as AgentProfile['hooks'] and as AgentProfile['subagents']. The agent-runtime codebase explicitly states 'hooks are never part of AgentProfile' (src/runtime/types.ts:474, src/runtime-hooks.ts:5) — though that refers to RuntimeHooks, not sandbox-SDK lifecycle hooks. The word 'subagents' appears nowhere in src/. Since @tangle-network/sandbox (^0.6.0) is not installed in this worktree, these field names cannot be verified against the actual type. If AgentProfile lacks these fields, compose() writes to a non-existent key that the sandbox backend silently ignores, making those two coordinates dead experiments that would show no delta and

🟡 LOW PROFILE_JSON env parsing has no validation — bench/src/profile-coord-sandbox.mts

Line 49: JSON.parse(process.env.PROFILE_JSON) as AgentProfile — a string that parses to a non-conforming object would silently produce a broken base profile, invalidating the experiment. The agentRun path through sandboxAgentRun spreads the profile into sandboxOverrides, so a malformed profile would only fail deep in the sandbox create call (cryptic error). Add a lightweight runtime check or at minimum validate that required AgentProfile fields are present before proceeding.

🟡 LOW No unit tests for profile-coordinates.ts compose functions — bench/src/profile-coordinates.ts

The ProfileCoordinate interface has 5 implementations with deterministic compose() pure functions. The project tests other bench modules (experiment.test.mts, steering-experiment.test.mts, selector.test.mts, refine-loop.test.mts). The compose functions are trivially testable: verify that empty selection returns base unchanged, that selected candidates are injected into the correct field, and that base fields outside the coordinate are preserved. Without tests, a future refactor could silently break the freeze invariant.

🟡 LOW no-print-debugging hook is a guaranteed no-op — bench/src/profile-coordinates.ts

Line 50: 'no-print-debugging': { event: 'PreToolUse', matcher: 'Edit|Write', command: 'true' }. The command true always exits 0, so a PreToolUse hook never blocks the tool call. This candidate cannot produce any behavioral difference vs the base — any measured delta would be pure noise. Either implement actual content inspection (grep the diff for print/debug statements and exit non-zero on match) or remove the candidate.

🟡 LOW no-print-debugging hook is a no-op placeholder — bench/src/profile-coordinates.ts

Line 50: 'no-print-debugging': { event: 'PreToolUse', matcher: 'Edit|Write', command: 'true' }. The command 'true' always succeeds and does nothing — it does NOT prevent print debugging. If this is a planned but not-yet-implemented candidate, either implement it or remove it from the candidate pool so the optimizer doesn't select a vacuous option. Currently it wastes optimizer budget on a candidate that cannot have any effect.

🟡 LOW profile-coordinates.ts compose logic has zero unit tests — bench/src/profile-coordinates.ts

Five coordinate composers (skills, hooks, tools, prompt, subagents) with non-trivial merge/spread logic have no unit test coverage. A silent compose bug (e.g. wrong spread ordering in skillsCoordinate line 41, duplicate hook entries in hooksCoordinate line 62) would produce wrong AgentProfiles and silently invalidate all downstream experiments. Add tests for each coordinate's compose with empty/non-empty/base-merge inputs. The experiment.test.mts mock pattern (mockSandboxClient) shows the project's test conventions.

🟡 LOW smoke test regex verification is fragile — bench/src/skill-sandbox-smoke.mts

Line 72: const landed = /reproduce-first/i.test(out) && /SKILL.md/i.test(out). The smoke test declares PASS if the agent output mentions 'reproduce-first' and 'SKILL.md' anywhere in the stream. A false positive is possible if the sandbox base image ships a SKILL.md file or if the agent's own prompt text mentions these strings. The 'find / -name SKILL.md' command (line 52) could match SKILL.md files from the sandbox base image unrelated to the injected skill. Recommend checking for a specific path component or the skill's bo

🟡 LOW Inconsistent module-path resolution between sibling files — bench/src/skills-sandbox.mts

skills-sandbox.mts uses dirname(fileURLToPath(import.meta.url)) (line 47) while profile-coordinates.ts uses import.meta.dirname (line 30). Both resolve to the same directory but the inconsistency is a maintenance smell. The codebase already uses import.meta.dirname in corpus-report.mts and commit0-env-run.mts, so prefer that form.


tangletools · 2026-06-14T13:28:33Z · trace

@drewstone drewstone merged commit a89db22 into main Jun 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants