feat(compile): add hidden `--use-samples` flag for deterministic safe-outputs replay by dsyme · Pull Request #37359 · github/gh-aw

dsyme · 2026-06-06T16:32:09Z

feat(compile): add hidden `--use-samples` flag for deterministic safe-outputs replay

Summary

Introduces a hidden gh aw compile --use-samples flag that replaces the live agentic-execution step with a fully deterministic replay of pre-recorded MCP tool-call samples. When the flag is set, the compiler reads samples entries declared on safe-output handlers, validates them against MCP tool schemas, marshals them as GH_AW_SAMPLES, and emits a GitHub Actions step that drives a new Node.js replay driver (apply_samples.cjs) instead of spawning the real coding agent. Threat detection is force-disabled in this mode. The feature is intentionally hidden and intended for CI testing and deterministic safe-outputs validation.

What changed and why

CLI flag plumbing

File	Change
`cmd/gh-aw/main.go`	Added hidden `--use-samples` bool flag to `compileCmd`
`pkg/cli/compile_config.go`	Added `UseSamples bool` field to carry the flag through the pipeline
`pkg/cli/compile_compiler_setup.go`	Reads `config.UseSamples` and calls `compiler.SetUseSamples(true)`

A single hidden boolean flows from the CLI surface down into the compiler without touching any public flags or existing compile paths.

Compiler and workflow data propagation

File	Change
`pkg/workflow/workflow_compiler.go`	Adds `useSamples` field + `SetUseSamples(bool)` setter to `Compiler`
`pkg/workflow/workflow_builder.go`	Assigns `UseSamples` when constructing `WorkflowData`
`pkg/workflow/workflow_data.go`	Adds `UseSamples bool` to `WorkflowData`
`pkg/workflow/workflow_yaml.go`	When `UseSamples` is true, replaces "Execute coding agent" with `buildUseSamplesStep()` under the same `agentic_execution` step ID

The flag propagates without changing any existing code paths for normal compilation; the substitution happens at the final YAML-emission stage.

Samples replay step generation

pkg/workflow/samples_replay.go (new) — defines:

SampleEntry — carries Tool, Arguments, and Sidecars
collectSampleEntries — uses reflection to walk safe-output handler configs and partition sidecar fields (e.g. patch) from normal MCP arguments
buildUseSamplesStep — marshals entries to JSON, emits the GitHub Actions step that invokes apply_samples.cjs via Node, injecting GH_AW_SAMPLES, GH_AW_AGENT_STDIO_LOG, GH_AW_SAFE_OUTPUTS_CONFIG_PATH, and GH_AW_SAFE_OUTPUTS

Safe-outputs compilation changes

pkg/workflow/safe_outputs.go (modified):

Calls validateSafeOutputsSamples to validate declared samples against MCP tool schemas before emitting YAML
Force-sets cfg.ThreatDetection = nil when useSamples is true, so prompt-injection guards do not fire against static replay payloads

Schema-based sample validation

pkg/workflow/samples_validation.go (new) — provides:

validateSafeOutputsSamples — iterates sorted handler names for deterministic ordering, strips sidecar fields before validation, and validates each sample entry against the per-tool JSON schema
Lazy-compiled JSON schema cache keyed by MCP tool name to avoid redundant schema compilation

JSON schema additions

pkg/parser/schemas/main_workflow_schema.json (modified):

Adds an optional samples field (oneOf: array-of-objects | free-form object, additionalProperties: true) to every safe-output handler schema — covering create-issue, create-pull-request, push-to-pull-request-branch, add-comment, update-pull-request, close-pull-request, merge-pull-request, create-branch, delete-branch, add-label, remove-label, and others

Node.js replay driver

actions/setup/js/apply_samples.cjs (new):

Reads GH_AW_SAMPLES (JSON array of SampleEntry)
Spawns safe_outputs_mcp_server.cjs as a child process over stdio
Sends one JSON-RPC tools/call per entry and awaits each response
Pre-stages git patches (preStagePatch) for create_pull_request and push_to_pull_request_branch before the MCP call
Writes a synthetic agent-stdio.log so downstream log-parsing steps see a valid log file

actions/setup/js/safe_outputs_mcp_server.cjs (modified):

Adds require('./shim.cjs') at the top so core.* calls continue to work when the server is spawned as a standalone child process by apply_samples.cjs, outside the normal github-script runtime

Tests added

File	Coverage
`actions/setup/js/apply_samples.test.cjs`	Smoke tests: full MCP replay of `create_issue`; empty/null `GH_AW_SAMPLES`; `preStagePatch` for `create_pull_request` and `push_to_pull_request_branch`; no-op when no patch sidecar
`pkg/workflow/samples_replay_test.go`	Integration: `SetUseSamples(true)` replaces agentic step; create-PR + patch sidecar flow; nil-slice marshalling guards against `"null"` in `GH_AW_SAMPLES`
`pkg/workflow/samples_threat_detection_test.go`	Confirms threat detection is disabled in `use-samples` mode regardless of frontmatter
`pkg/workflow/samples_validation_test.go`	Unit: valid samples; missing required fields; sidecar stripping; deterministic ordering; sidecar partitioning into `Sidecars` vs `Arguments`

Breaking changes

None. The --use-samples flag is hidden and off by default; all existing compile paths are unaffected.

Key design decisions

Same step ID (agentic_execution) — the replay step reuses the existing step ID so downstream log-parsing and job-summary logic requires no changes.
Force-disable threat detection — static replay payloads are known-safe by construction; leaving threat detection enabled would produce false positives and obscure real signal.
Sidecar partitioning — fields like patch are not valid MCP tool arguments; they are stripped before schema validation and carried separately as Sidecars to apply_samples.cjs, which pre-stages them as git patches before the MCP call.
Lazy schema cache — JSON schemas are compiled on first use per tool name and reused for all subsequent entries, keeping validation fast even for workflows with many sample entries.
shim.cjs injection — rather than refactoring the MCP server to eliminate core.* calls, a one-line require('./shim.cjs') at the top of the file restores the global.core object when the server runs standalone, keeping the server's own logic unchanged.

Generated by PR Description Updater for issue #37359 · 349.7 AIC · ⌖ 14.3 AIC · ⊞ 19.5K · ◷

…utputs replay Adds a hidden compile mode that replaces the agentic 'Execute coding agent' step with a deterministic driver that replays declarative `samples` entries through the real safe-outputs MCP server. Makes end-to-end tests deterministic without invoking any LLM. Frontmatter: safe-outputs: create-issue: samples: - title: "..." body: "..." Each entry conforms to the MCP tool inputSchema; recognized sidecar keys (`patch` for create-pull-request and push-to-pull-request-branch) are stripped before validation and consumed by the replay driver for branch + patch pre-staging. Hidden surface: - CLI flag `--use-samples` is hidden from `gh aw compile --help` - JSON schema description marks `samples` as 'Internal hidden feature' Implementation: - Static JSON Schema validation against safe_outputs_tools.json at compile time - Deterministic step ordering (sorted by SafeOutputsConfig struct field name) - New driver actions/setup/js/apply_samples.cjs spawns the real MCP server over stdio, sends one tools/call per sample, writes a synthetic terminal_reason: completed marker so handle_agent_failure recognizes success - Driver pre-stages git branches + patches for create_pull_request and push_to_pull_request_branch samples so the real handler can derive a diff Tests: - 5 unit tests covering validation, sidecar stripping, deterministic ordering, sidecar partitioning - 1 integration test verifying the agent step is replaced - 2 vitest specs driving the real MCP server end-to-end

Copilot

Pull request overview

This PR adds a hidden gh aw compile --use-samples mode that swaps the agentic execution step for a deterministic “safe-outputs samples replay” driver, enabling end-to-end tests to exercise the real safe-outputs MCP server without invoking an LLM.

Changes:

Introduces samples entries on safe-outputs handlers, compile-time validation against embedded MCP tool schemas, and deterministic ordering/flattening into replay payloads.
Adds compiler/CLI plumbing (--use-samples, WorkflowData.UseSamples) to replace the agent execution step with a replay step that runs apply_samples.cjs.
Adds Go tests and Vitest specs to validate schema checking, ordering/sidecar handling, and the end-to-end replay driver behavior.

Show a summary per file

File	Description
pkg/workflow/workflow_builder.go	Plumbs `UseSamples` into initial workflow data so generation can branch deterministically.
pkg/workflow/samples_validation.go	Adds per-tool JSON Schema compilation/cache and validates samples entries (with sidecar stripping).
pkg/workflow/samples_validation_test.go	Unit tests for samples schema validation, sidecar stripping, and ordering assumptions.
pkg/workflow/samples_replay.go	Flattens samples into replay entries and emits the replacement “Replay safe-outputs samples” workflow step.
pkg/workflow/samples_replay_test.go	Integration test ensuring `--use-samples` replaces the agentic step in the compiled lock file.
pkg/workflow/safe_outputs_config.go	Parses hidden `samples` frontmatter into `BaseSafeOutputConfig.Samples` (including sidecar-friendly normalization).
pkg/workflow/compiler_yaml_ai_execution.go	Switches engine execution generation to replay mode when `UseSamples` is set.
pkg/workflow/compiler_validators.go	Adds compile-time samples validation to the core validator pipeline.
pkg/workflow/compiler_types.go	Adds `Compiler.useSamples`, `WorkflowData.UseSamples`, and `BaseSafeOutputConfig.Samples`.
pkg/parser/schemas/main_workflow_schema.json	Exposes `samples` in the schema (documented as internal/hidden) for editor authoring/autocomplete.
pkg/cli/compile_config.go	Adds hidden `UseSamples` compile configuration flag.
pkg/cli/compile_compiler_setup.go	Wires `UseSamples` into compiler configuration (`SetUseSamples(true)`).
cmd/gh-aw/main.go	Adds hidden CLI flag `--use-samples` and passes it into compile config.
actions/setup/js/apply_samples.test.cjs	Vitest smoke coverage for the driver (real MCP server spawn + completed marker + empty-samples case).
actions/setup/js/apply_samples.cjs	Implements deterministic replay driver: spawns MCP server, sends JSON-RPC `tools/call`, stages patch sidecars, writes synthetic agent log.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 15/15 changed files
Comments generated: 6

+		b, _ := os.ReadFile(lockPath)
+		lockContent := string(b)


The deterministic samples replay driver emits synthetic safe-outputs purely to exercise downstream handlers in end-to-end tests. Running the LLM-backed threat-detection job against those fabricated payloads defeats determinism, costs tokens, and can spuriously flag the test fixtures. When --use-samples is set, extractSafeOutputsConfig now nils out SafeOutputsConfig.ThreatDetection unconditionally — overriding both the implicit default and any explicit threat-detection: true. The override is logged. Tests: - new TestExtractSafeOutputsConfig_UseSamplesDisablesThreatDetection covers default mode (detection enabled), --use-samples + default (disabled), and --use-samples + explicit true (still disabled) - TestUseSamplesReplacesAgentStep additionally asserts no detection: job appears in the compiled lock file

github-actions · 2026-06-06T16:51:22Z

@copilot review all comments and address unresolved review feedback.

Generated by 👨‍🍳 PR Sous Chef · 58.7 AIC · ⌖ 1.04 AIC · ◷

Adds three vitest specs that drive the apply_samples driver's preStagePatch path against a real, throwaway git working tree: 1. create_pull_request with a 'patch' sidecar checks out the requested branch, applies the diff, and commits it — and the resulting diff is visible via 'git diff main...<branch>', which is precisely what the downstream MCP create_pull_request handler reads when generating its bundle/patch payload. 2. push_to_pull_request_branch without an explicit 'branch' falls back to 'gh-aw-sample-<i+1>' and still applies the patch. 3. preStagePatch is a no-op when called with a tool that has no patch sidecar (defense in depth around the PATCH_SIDECAR_TOOLS gate in main()). Together with the existing Go unit tests for sidecar partitioning and schema-stripping, this closes the testing gap around the patch-sidecar flow that was previously only covered structurally.

Compiles a workflow whose only safe-output is `create-pull-request` with a samples entry carrying a multi-line `patch:` block scalar, then inspects the generated lock.yml. Extracts the GH_AW_SAMPLES JSON literal block out of the compiled YAML and asserts: - the agentic step is replaced by the replay step - the entry tool is "create_pull_request" - the patch is partitioned into sidecars, NOT arguments — the MCP create_pull_request handler must not receive a literal patch argument; it derives the diff from the working tree - title/body/branch are preserved in arguments - the patch payload (including the diff header and the added line) survives YAML emission verbatim so the driver can git-apply it - no detection: job is emitted This closes the loop from frontmatter -> compiled YAML for the patch-sidecar flow, complementing the existing Go unit tests (sidecar partitioning) and the vitest preStagePatch specs (which exercise the runtime side against a real git repo).

github-actions · 2026-06-06T17:22:33Z

@copilot review all comments and address unresolved review feedback.

Generated by 👨‍🍳 PR Sous Chef · 39.6 AIC · ⌖ 1.06 AIC · ◷

github-actions · 2026-06-06T17:22:34Z

@copilot please summarize the remaining blockers and next step.

Generated by 👨‍🍳 PR Sous Chef · 39.6 AIC · ⌖ 1.06 AIC · ◷

Observed in CI: Error: apply_samples: GH_AW_SAMPLES must be a JSON array at loadSamples (apply_samples.cjs:61:11) Root cause: when a workflow opts into --use-samples but configures no `samples:` entries (or only on disabled handlers), collectSampleEntries returns a nil Go slice. json.Marshal(nil) produces the literal string "null", which the driver rightly refuses to treat as an array. Compiler fix (pkg/workflow/samples_replay.go): normalize a nil entries slice to an empty []SampleEntry{} before marshaling so GH_AW_SAMPLES is always emitted as a valid JSON array ("[]" in the empty case). Driver defense (actions/setup/js/apply_samples.cjs): also tolerate a literal JSON `null` payload and treat it as "no samples to replay", so an older compiler against a newer driver doesn't crash either. Tests: - new Go integration test TestUseSamplesEmitsEmptyArrayWhenNoSamplesConfigured compiles a workflow that uses --use-samples with safe-outputs but no samples entries, then asserts GH_AW_SAMPLES is exactly "[]" (and emphatically not "null") - new vitest spec verifies the driver exits 0 on GH_AW_SAMPLES="null" and logs "GH_AW_SAMPLES is null"

@type

CI fixes: - pkg/workflow/samples_replay.go: switch to strings.SplitSeq per the modernize linter (lint-go was failing) - actions/setup/js/apply_samples.cjs: weaken the JSDoc type on sendJsonRpc's child parameter from ChildProcessWithoutNullStreams to ChildProcess so the value returned by spawn() with stdio: ["pipe", "pipe", "inherit"] (which has a null stderr) type-checks (js-typecheck was failing) Review feedback (all Copilot inline comments): - apply_samples.cjs: replace the /** @type {Error} */ casts on catch bindings with the shared getErrorMessage(err) helper so catch-unknown narrowing is actually safe under @ts-check - samples_replay_test.go: stop swallowing the ReadFile error in the Use-Samples-Mode subtest; t.Fatalf on failure like the default-mode subtest does - samples_validation.go: stripSidecarFields now always returns a fresh map, matching its doc comment (no more accidental aliasing of the caller's input when sidecars is empty) - safe_outputs_config.go: drop the YAML-string branch of parseSamplesValue; the JSON schema for samples only allows array/object, so the string form would be rejected upstream before this code runs. Removes the now-unused yaml import. The Copilot comment about collectSampleEntries emitting null was addressed in the prior commit (5194f4b) which normalizes nil to []SampleEntry{} before json.Marshal.

github-actions · 2026-06-06T19:41:23Z

Please fix the failing lint-js check and summarize any remaining blockers.

Generated by 👨‍🍳 PR Sous Chef · 44.4 AIC · ⌖ 1 AIC · ⊞ 17K ambient context · ◷

…samples-hidden-flag

github-actions · 2026-06-06T22:46:20Z

@copilot review all comments and address unresolved review feedback.

Generated by 👨‍🍳 PR Sous Chef · 37.6 AIC · ⌖ 0.973 AIC · ⊞ 17K · ◷

github-actions · 2026-06-06T22:46:24Z

@copilot please refresh the branch, rerun checks, and summarize any remaining blockers after the rebase.

Generated by 👨‍🍳 PR Sous Chef · 37.6 AIC · ⌖ 0.973 AIC · ⊞ 17K · ◷

…ples driver

dsyme · 2026-06-06T23:04:55Z

Addressed all review feedback in a7798979d5:

@pelikhan: apply_samples.cjs now require("./shim.cjs") and uses core.warning / core.info / core.setFailed instead of console.error / process.exit. Matches the convention used by the other .cjs modules in the same directory.

Copilot review items (already in earlier commits on this branch):

samples_replay.go — generateSamplesReplayStep coerces nil entries to []SampleEntry{} and emits [] instead of null.
samples_replay_test.go — both subtests now t.Fatalf on ReadFile error.
samples_validation.go — stripSidecarFields always builds a fresh map (no shortcut returning the original).
safe_outputs_config.go — parseSamplesValue no longer accepts string scalars (the schema rejects them upstream).
5/6. apply_samples.cjs — replaced /** @type {Error} */ JSDoc casts with the getErrorMessage(err) helper throughout.

…core When apply_samples.cjs spawns safe_outputs_mcp_server.cjs as a standalone Node child process, handlers like create_pull_request.cjs that reference core.info/warning/debug throw ReferenceError: core is not defined. The shim is idempotent (guarded by 'if (!global.core)'), so loading it unconditionally is safe when the module is required from a parent that already initialized it.

github-actions · 2026-06-06T23:13:49Z

✅ smoke-ci: safeoutputs CLI comment + comment-memory run (27076500417)

Generated by 🧪 Smoke CI for issue #37359 · ◷

Copilot AI review requested due to automatic review settings June 6, 2026 16:32

Copilot started reviewing on behalf of dsyme June 6, 2026 16:32 View session