feat(compile): add hidden --use-samples flag for deterministic safe-outputs replay#37359
Conversation
…utputs replay
Adds a hidden compile mode that replaces the agentic 'Execute coding agent'
step with a deterministic driver that replays declarative `samples` entries
through the real safe-outputs MCP server. Makes end-to-end tests deterministic
without invoking any LLM.
Frontmatter:
safe-outputs:
create-issue:
samples:
- title: "..."
body: "..."
Each entry conforms to the MCP tool inputSchema; recognized sidecar keys
(`patch` for create-pull-request and push-to-pull-request-branch) are
stripped before validation and consumed by the replay driver for branch +
patch pre-staging.
Hidden surface:
- CLI flag `--use-samples` is hidden from `gh aw compile --help`
- JSON schema description marks `samples` as 'Internal hidden feature'
Implementation:
- Static JSON Schema validation against safe_outputs_tools.json at compile time
- Deterministic step ordering (sorted by SafeOutputsConfig struct field name)
- New driver actions/setup/js/apply_samples.cjs spawns the real MCP server
over stdio, sends one tools/call per sample, writes a synthetic
terminal_reason: completed marker so handle_agent_failure recognizes success
- Driver pre-stages git branches + patches for create_pull_request and
push_to_pull_request_branch samples so the real handler can derive a diff
Tests:
- 5 unit tests covering validation, sidecar stripping, deterministic ordering,
sidecar partitioning
- 1 integration test verifying the agent step is replaced
- 2 vitest specs driving the real MCP server end-to-end
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR adds a hidden gh aw compile --use-samples mode that swaps the agentic execution step for a deterministic “safe-outputs samples replay” driver, enabling end-to-end tests to exercise the real safe-outputs MCP server without invoking an LLM.
Changes:
- Introduces
samplesentries on safe-outputs handlers, compile-time validation against embedded MCP tool schemas, and deterministic ordering/flattening into replay payloads. - Adds compiler/CLI plumbing (
--use-samples,WorkflowData.UseSamples) to replace the agent execution step with a replay step that runsapply_samples.cjs. - Adds Go tests and Vitest specs to validate schema checking, ordering/sidecar handling, and the end-to-end replay driver behavior.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/workflow_builder.go | Plumbs UseSamples into initial workflow data so generation can branch deterministically. |
| pkg/workflow/samples_validation.go | Adds per-tool JSON Schema compilation/cache and validates samples entries (with sidecar stripping). |
| pkg/workflow/samples_validation_test.go | Unit tests for samples schema validation, sidecar stripping, and ordering assumptions. |
| pkg/workflow/samples_replay.go | Flattens samples into replay entries and emits the replacement “Replay safe-outputs samples” workflow step. |
| pkg/workflow/samples_replay_test.go | Integration test ensuring --use-samples replaces the agentic step in the compiled lock file. |
| pkg/workflow/safe_outputs_config.go | Parses hidden samples frontmatter into BaseSafeOutputConfig.Samples (including sidecar-friendly normalization). |
| pkg/workflow/compiler_yaml_ai_execution.go | Switches engine execution generation to replay mode when UseSamples is set. |
| pkg/workflow/compiler_validators.go | Adds compile-time samples validation to the core validator pipeline. |
| pkg/workflow/compiler_types.go | Adds Compiler.useSamples, WorkflowData.UseSamples, and BaseSafeOutputConfig.Samples. |
| pkg/parser/schemas/main_workflow_schema.json | Exposes samples in the schema (documented as internal/hidden) for editor authoring/autocomplete. |
| pkg/cli/compile_config.go | Adds hidden UseSamples compile configuration flag. |
| pkg/cli/compile_compiler_setup.go | Wires UseSamples into compiler configuration (SetUseSamples(true)). |
| cmd/gh-aw/main.go | Adds hidden CLI flag --use-samples and passes it into compile config. |
| actions/setup/js/apply_samples.test.cjs | Vitest smoke coverage for the driver (real MCP server spawn + completed marker + empty-samples case). |
| actions/setup/js/apply_samples.cjs | Implements deterministic replay driver: spawns MCP server, sends JSON-RPC tools/call, stages patch sidecars, writes synthetic agent log. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 15/15 changed files
- Comments generated: 6
| b, _ := os.ReadFile(lockPath) | ||
| lockContent := string(b) |
The deterministic samples replay driver emits synthetic safe-outputs purely to exercise downstream handlers in end-to-end tests. Running the LLM-backed threat-detection job against those fabricated payloads defeats determinism, costs tokens, and can spuriously flag the test fixtures. When --use-samples is set, extractSafeOutputsConfig now nils out SafeOutputsConfig.ThreatDetection unconditionally — overriding both the implicit default and any explicit threat-detection: true. The override is logged. Tests: - new TestExtractSafeOutputsConfig_UseSamplesDisablesThreatDetection covers default mode (detection enabled), --use-samples + default (disabled), and --use-samples + explicit true (still disabled) - TestUseSamplesReplacesAgentStep additionally asserts no detection: job appears in the compiled lock file
This comment has been minimized.
This comment has been minimized.
|
@copilot review all comments and address unresolved review feedback.
|
Adds three vitest specs that drive the apply_samples driver's preStagePatch path against a real, throwaway git working tree: 1. create_pull_request with a 'patch' sidecar checks out the requested branch, applies the diff, and commits it — and the resulting diff is visible via 'git diff main...<branch>', which is precisely what the downstream MCP create_pull_request handler reads when generating its bundle/patch payload. 2. push_to_pull_request_branch without an explicit 'branch' falls back to 'gh-aw-sample-<i+1>' and still applies the patch. 3. preStagePatch is a no-op when called with a tool that has no patch sidecar (defense in depth around the PATCH_SIDECAR_TOOLS gate in main()). Together with the existing Go unit tests for sidecar partitioning and schema-stripping, this closes the testing gap around the patch-sidecar flow that was previously only covered structurally.
Compiles a workflow whose only safe-output is `create-pull-request` with a samples entry carrying a multi-line `patch:` block scalar, then inspects the generated lock.yml. Extracts the GH_AW_SAMPLES JSON literal block out of the compiled YAML and asserts: - the agentic step is replaced by the replay step - the entry tool is "create_pull_request" - the patch is partitioned into sidecars, NOT arguments — the MCP create_pull_request handler must not receive a literal patch argument; it derives the diff from the working tree - title/body/branch are preserved in arguments - the patch payload (including the diff header and the added line) survives YAML emission verbatim so the driver can git-apply it - no detection: job is emitted This closes the loop from frontmatter -> compiled YAML for the patch-sidecar flow, complementing the existing Go unit tests (sidecar partitioning) and the vitest preStagePatch specs (which exercise the runtime side against a real git repo).
This comment has been minimized.
This comment has been minimized.
|
@copilot review all comments and address unresolved review feedback.
|
|
@copilot please summarize the remaining blockers and next step.
|
Observed in CI:
Error: apply_samples: GH_AW_SAMPLES must be a JSON array
at loadSamples (apply_samples.cjs:61:11)
Root cause: when a workflow opts into --use-samples but configures no
`samples:` entries (or only on disabled handlers), collectSampleEntries
returns a nil Go slice. json.Marshal(nil) produces the literal string
"null", which the driver rightly refuses to treat as an array.
Compiler fix (pkg/workflow/samples_replay.go): normalize a nil entries
slice to an empty []SampleEntry{} before marshaling so GH_AW_SAMPLES is
always emitted as a valid JSON array ("[]" in the empty case).
Driver defense (actions/setup/js/apply_samples.cjs): also tolerate a
literal JSON `null` payload and treat it as "no samples to replay",
so an older compiler against a newer driver doesn't crash either.
Tests:
- new Go integration test TestUseSamplesEmitsEmptyArrayWhenNoSamplesConfigured
compiles a workflow that uses --use-samples with safe-outputs but no
samples entries, then asserts GH_AW_SAMPLES is exactly "[]" (and
emphatically not "null")
- new vitest spec verifies the driver exits 0 on GH_AW_SAMPLES="null"
and logs "GH_AW_SAMPLES is null"
This comment has been minimized.
This comment has been minimized.
CI fixes: - pkg/workflow/samples_replay.go: switch to strings.SplitSeq per the modernize linter (lint-go was failing) - actions/setup/js/apply_samples.cjs: weaken the JSDoc type on sendJsonRpc's child parameter from ChildProcessWithoutNullStreams to ChildProcess so the value returned by spawn() with stdio: ["pipe", "pipe", "inherit"] (which has a null stderr) type-checks (js-typecheck was failing) Review feedback (all Copilot inline comments): - apply_samples.cjs: replace the /** @type {Error} */ casts on catch bindings with the shared getErrorMessage(err) helper so catch-unknown narrowing is actually safe under @ts-check - samples_replay_test.go: stop swallowing the ReadFile error in the Use-Samples-Mode subtest; t.Fatalf on failure like the default-mode subtest does - samples_validation.go: stripSidecarFields now always returns a fresh map, matching its doc comment (no more accidental aliasing of the caller's input when sidecars is empty) - safe_outputs_config.go: drop the YAML-string branch of parseSamplesValue; the JSON schema for samples only allows array/object, so the string form would be rejected upstream before this code runs. Removes the now-unused yaml import. The Copilot comment about collectSampleEntries emitting null was addressed in the prior commit (5194f4b) which normalizes nil to []SampleEntry{} before json.Marshal.
This comment has been minimized.
This comment has been minimized.
|
Please fix the failing lint-js check and summarize any remaining blockers.
|
…samples-hidden-flag
This comment has been minimized.
This comment has been minimized.
|
@copilot review all comments and address unresolved review feedback.
|
|
@copilot please refresh the branch, rerun checks, and summarize any remaining blockers after the rebase.
|
|
Addressed all review feedback in a7798979d5: @pelikhan: Copilot review items (already in earlier commits on this branch):
|
This comment has been minimized.
This comment has been minimized.
…core When apply_samples.cjs spawns safe_outputs_mcp_server.cjs as a standalone Node child process, handlers like create_pull_request.cjs that reference core.info/warning/debug throw ReferenceError: core is not defined. The shim is idempotent (guarded by 'if (!global.core)'), so loading it unconditionally is safe when the module is required from a parent that already initialized it.
|
✅ smoke-ci: safeoutputs CLI comment + comment-memory run (27076500417)
|
feat(compile): add hidden
--use-samplesflag for deterministic safe-outputs replaySummary
Introduces a hidden
gh aw compile --use-samplesflag that replaces the live agentic-execution step with a fully deterministic replay of pre-recorded MCP tool-call samples. When the flag is set, the compiler readssamplesentries declared on safe-output handlers, validates them against MCP tool schemas, marshals them asGH_AW_SAMPLES, and emits a GitHub Actions step that drives a new Node.js replay driver (apply_samples.cjs) instead of spawning the real coding agent. Threat detection is force-disabled in this mode. The feature is intentionally hidden and intended for CI testing and deterministic safe-outputs validation.What changed and why
CLI flag plumbing
cmd/gh-aw/main.go--use-samplesbool flag tocompileCmdpkg/cli/compile_config.goUseSamples boolfield to carry the flag through the pipelinepkg/cli/compile_compiler_setup.goconfig.UseSamplesand callscompiler.SetUseSamples(true)A single hidden boolean flows from the CLI surface down into the compiler without touching any public flags or existing compile paths.
Compiler and workflow data propagation
pkg/workflow/workflow_compiler.gouseSamplesfield +SetUseSamples(bool)setter toCompilerpkg/workflow/workflow_builder.goUseSampleswhen constructingWorkflowDatapkg/workflow/workflow_data.goUseSamples booltoWorkflowDatapkg/workflow/workflow_yaml.goUseSamplesis true, replaces "Execute coding agent" withbuildUseSamplesStep()under the sameagentic_executionstep IDThe flag propagates without changing any existing code paths for normal compilation; the substitution happens at the final YAML-emission stage.
Samples replay step generation
pkg/workflow/samples_replay.go(new) — defines:SampleEntry— carriesTool,Arguments, andSidecarscollectSampleEntries— uses reflection to walk safe-output handler configs and partition sidecar fields (e.g.patch) from normal MCP argumentsbuildUseSamplesStep— marshals entries to JSON, emits the GitHub Actions step that invokesapply_samples.cjsvia Node, injectingGH_AW_SAMPLES,GH_AW_AGENT_STDIO_LOG,GH_AW_SAFE_OUTPUTS_CONFIG_PATH, andGH_AW_SAFE_OUTPUTSSafe-outputs compilation changes
pkg/workflow/safe_outputs.go(modified):validateSafeOutputsSamplesto validate declared samples against MCP tool schemas before emitting YAMLcfg.ThreatDetection = nilwhenuseSamplesis true, so prompt-injection guards do not fire against static replay payloadsSchema-based sample validation
pkg/workflow/samples_validation.go(new) — provides:validateSafeOutputsSamples— iterates sorted handler names for deterministic ordering, strips sidecar fields before validation, and validates each sample entry against the per-tool JSON schemaJSON schema additions
pkg/parser/schemas/main_workflow_schema.json(modified):samplesfield (oneOf: array-of-objects | free-form object,additionalProperties: true) to every safe-output handler schema — coveringcreate-issue,create-pull-request,push-to-pull-request-branch,add-comment,update-pull-request,close-pull-request,merge-pull-request,create-branch,delete-branch,add-label,remove-label, and othersNode.js replay driver
actions/setup/js/apply_samples.cjs(new):GH_AW_SAMPLES(JSON array ofSampleEntry)safe_outputs_mcp_server.cjsas a child process over stdiotools/callper entry and awaits each responsepreStagePatch) forcreate_pull_requestandpush_to_pull_request_branchbefore the MCP callagent-stdio.logso downstream log-parsing steps see a valid log fileactions/setup/js/safe_outputs_mcp_server.cjs(modified):require('./shim.cjs')at the top socore.*calls continue to work when the server is spawned as a standalone child process byapply_samples.cjs, outside the normalgithub-scriptruntimeTests added
actions/setup/js/apply_samples.test.cjscreate_issue; empty/nullGH_AW_SAMPLES;preStagePatchforcreate_pull_requestandpush_to_pull_request_branch; no-op when no patch sidecarpkg/workflow/samples_replay_test.goSetUseSamples(true)replaces agentic step; create-PR + patch sidecar flow; nil-slice marshalling guards against"null"inGH_AW_SAMPLESpkg/workflow/samples_threat_detection_test.gouse-samplesmode regardless of frontmatterpkg/workflow/samples_validation_test.goSidecarsvsArgumentsBreaking changes
None. The
--use-samplesflag is hidden and off by default; all existing compile paths are unaffected.Key design decisions
agentic_execution) — the replay step reuses the existing step ID so downstream log-parsing and job-summary logic requires no changes.patchare not valid MCP tool arguments; they are stripped before schema validation and carried separately asSidecarstoapply_samples.cjs, which pre-stages them as git patches before the MCP call.shim.cjsinjection — rather than refactoring the MCP server to eliminatecore.*calls, a one-linerequire('./shim.cjs')at the top of the file restores theglobal.coreobject when the server runs standalone, keeping the server's own logic unchanged.