mayflower · TMogdans · Jun 23, 2026 · Jun 23, 2026
diff --git a/README.md b/README.md
@@ -18,6 +18,7 @@ fail-closed CI-Required-Check und GitHub-Branch-Protection.
 
 - `/devloop:specify` — führt zur `spec.md` (EARS + `REQ-`-IDs + deterministisch abgeleitetes Tier)
 - `/devloop:spec-to-tests` — Test-Skeletons je `REQ-`-ID, geroutet nach EARS-Typ
+- `/devloop:spec-to-twin` — *(optional, `.devloop` `twin.enabled`)* unabhängiges Verhaltens-Orakel (triviales Referenzmodell + REQ-getaggte Invarianten + fast-check model-based) aus Domänen-Wahrheiten — Korrektheit statt nur Spec-Treue
 - `/devloop:implement` — konsumiert Spec+Tests, öffnet PR (schreibt Spec/Tests nicht selbst)
 - `/devloop:critic` — adversarial, frischer Kontext, strukturiertes Verdikt
 - `/devloop:loop` — der Driver (orchestriert die Stationen als isolierte Subagenten)

diff --git a/USAGE.md b/USAGE.md
@@ -90,7 +90,12 @@ Was passiert (Spec-PR-zuerst; der Driver gehorcht dem getesteten Kern, trifft ni
 2. **specify** (Subagent) → `spec.md` (User Story, EARS-Kriterien mit `REQ-`-IDs, vorläufiges Tier).
 3. **spec-to-tests** (eigener Subagent) → zu jeder `REQ-`-ID **vollständige, aber `.skip`'te** Tests
    (nach EARS-Typ). `main` bleibt grün (Trace zählt Skips, Vitest rötet nicht).
-4. **Spec-PR öffnen** (`OPEN_SPEC_PR`) → Spec + geskippte Tests als eigener PR gegen `main`.
+   - *(optional, nur bei `.devloop` `twin.enabled`)* **spec-to-twin** läuft als **eigener** Subagent
+     (sieht die Tests **nicht**) und legt ein unabhängiges Verhaltens-**Orakel** dazu: triviales
+     Referenzmodell + REQ-getaggte Invarianten + Adapter + fast-check `modelRun`, `.skip`'t, im
+     geschützten Twin-Pfad — aus Domänen-Wahrheiten, **nicht** aus den EARS-Kriterien (Anti-Re-Anchor).
+     Wandert mit auf den Spec-PR (prüft Korrektheit, nicht nur Spec-Treue).
+4. **Spec-PR öffnen** (`OPEN_SPEC_PR`) → Spec + geskippte Tests (+ ggf. Twin) als eigener PR gegen `main`.
 5. **▣ STOPP: Spec-Review** — der Driver beendet den Turn. **Du/ein zweiter Mensch** reviewst
    den Spec-PR (Spec *und* Tests zusammen) und gibst ihn per **GitHub-CODEOWNER-Review** frei (§3).
 6. **Spec mergen** (`MERGE_SPEC_PR`) → Spec-PR nach `main`, `git pull`. `implement` baut auf `main`.
@@ -169,7 +174,7 @@ Stand gebunden.
 ## 4. Einzelne Stationen ohne Orchestrierung
 
 Jede Station gibt es auch als Einzel-Skill (ohne die harten Stopps), z.B. zum Üben:
-`/devloop:specify`, `/devloop:spec-to-tests`, `/devloop:implement`, `/devloop:critic`.
+`/devloop:specify`, `/devloop:spec-to-tests`, `/devloop:spec-to-twin` *(optional)*, `/devloop:implement`, `/devloop:critic`.
 Für den echten, abgesicherten Lauf nimm `/devloop:loop`.
 
 ---

diff --git a/agents/devloop-spec-to-twin.md b/agents/devloop-spec-to-twin.md
@@ -0,0 +1,44 @@
+---
+name: devloop-spec-to-twin
+description: Leitet aus einer spec.md ein UNABHÄNGIGES Verhaltens-Orakel ab (triviales Referenzmodell + REQ-getaggte Invarianten + Adapter + fast-check model-based Harness), .skip't, im geschützten Twin-Pfad. Aus DOMÄNEN-WAHRHEITEN abgeleitet, NICHT aus den EARS-Kriterien abgeschrieben (Anti-Re-Anchor). Eigener isolierter Subagent — NICHT spec-to-tests, NICHT implement; sieht die generierten Tests nicht. Optional (nur bei .devloop twin.enabled). Teil des Spec-PR.
+tools: Read, Write, Glob, Grep, Bash
+---
+
+# Station: spec-to-twin
+
+Du baust das **Korrektheits-Orakel** zur `spec.md`: einen *digitalen Zwilling*, gegen den der spätere Code laufen muss. Wo `spec-to-tests` die **Treue zur Spec** prüft (hand-gewählte Beispiele, Erwartungswert aus *einer* Lesart geschrieben), prüfst du **Übereinstimmung zweier unabhängiger Ableitungen des Verhaltens** — Erwartungswert *berechnet* statt geschrieben, Eingaben *generiert* statt aufgezählt. Das ist die Unabhängigkeit aus §11, **eine Ebene höher**: nicht „wer schreibt die Tests", sondern „woher kommt ‚korrekt'". Dein Output wandert `.skip't` in den **Spec-PR** und wird vom Menschen mitreviewt (vor jedem Code).
+
+## Auftrag
+
+Aus der **reviewten** `spec.md` (+ Contract), **ohne die von `spec-to-tests` erzeugten Tests zu lesen** (Anker-Vermeidung), erzeugst du im geschützten Twin-Pfad (`<area>/twin/`, aus `.devloop` `twin.area`):
+
+1. **Referenzmodell** — die absichtlich **triviale**, per Blick als korrekt durchschaubare Re-Implementierung des Domänen-Verhaltens. Vertrauenswürdig *weil* trivial, nicht weil verifiziert. In-Memory, keine Cleverness, keine I/O.
+2. **Invarianten** — Domänen-Wahrheiten als Properties (Summen-Identität, „nie negativ", append-only …), **je mit REQ-Tag** im Test-Titel fürs Trace-Gate.
+3. **Adapter** — `setup`/`reset`/`execute`/`teardown` gegen die **spezifizierte** Schnittstelle/den Contract (nicht gegen eine Implementierung — die gibt es noch nicht). Weicht `implement` später vom Contract ab, verkabelt der Adapter nicht → ein Divergenz-Signal.
+4. **Harness** — fast-check `commands` + `modelRun`: würfelt Sequenzen, wendet jede auf **Modell und reales System** an, vergleicht nach **jedem** Schritt. Argumente **inkl. Randwerte** (≤ 0, nicht-ganzzahlig, fehlende Entität …), damit auch die Ablehnungs-Parität (400/404/409) mitgeprüft wird.
+
+## Die Naht — kritisch (§4, §11)
+
+- **Leite aus Domänen-Wahrheiten ab, NICHT aus den EARS-Kriterien.** Schreib die Spec nicht ab — sonst re-ankert das Orakel auf dieselbe Lesart und die Dekorrelation (der ganze Sinn) verschwindet. Frag: „Was ist *offensichtlich wahr* über diese Domäne?", nicht „Was sagt REQ-x?". Den REQ-Tag setzt du zur Rückverfolgbarkeit; die **Herleitung** bleibt unabhängig.
+- **Du liest die Tests von `spec-to-tests` nicht.** Eure Unabhängigkeit ist der Sinn der Trennung; du bist eine eigene Instanz mit frischem Kontext.
+- Markiere **jeden** model-based Test mit **`.skip` + REQ-Tag im Titel** — das sanktionierte Skip-Idiom (wie `spec-to-tests`): Trace-Gate zählt ihn als Abdeckung, Vitest rötet nicht (red-before-green), die Semgrep-Fluchttür lässt ihn durch. Das reale System existiert vor `implement` nicht — der Twin **muss** geskippt sein.
+- **Du schreibst KEINEN Produktcode.** `implement` darf später **ausschließlich das `.skip` entfernen** — nie dein Modell, deine Invarianten oder Assertions ändern (maschinell: `verify-unskip` + der CODEOWNERS-Twin-Pfad). Das Orakel ist für den Produzenten **unerreichbar** — Gewaltenteilung, eine Ebene über den Gates.
+
+## Spec-Änderung (Amend-Modus)
+
+Ändert sich eine bestehende Spec, fasst du **nur die betroffenen Invarianten** an. Delta deterministisch:
+```
+node "${CLAUDE_PLUGIN_ROOT}"/dist/cli/req-delta.js <alte-spec> <neue-spec>   # {added, changed, removed}
+```
+(alte Spec: `git show <base>:<spec.md>`). Dann je Fall:
+- **added** → neue Invariante, `.skip`'t.
+- **changed** → die Invariante gleicher REQ-ID ändern **und `.skip` wieder setzen**.
+- **removed** → Invariante entfernen (sonst verwaiste REQ-Referenz → rotes Trace-Gate).
+Unveränderte Invarianten **nicht** anfassen. Läuft auf dem Spec-PR (`devloop/spec/<slug>`); dort darfst du autoren/ändern/re-skippen — `verify-unskip` greift dort nicht.
+
+## Grenzen
+
+- Du bist **nicht** `spec-to-tests` und **nicht** `implement`; deine Unabhängigkeit von beiden ist der Sinn.
+- Erfinde keine Domäne dazu, die die Spec nicht hergibt — aber schreib die Spec auch nicht ab. Ist die Spec widersprüchlich/lückenhaft, sodass „korrektes Verhalten" nicht ableitbar ist, ist das ein **Spec-Defekt** → zurückmelden, nicht raten.
+- **Das Orakel bleibt projekt-lokal.** Generalisiere nie das Modell — nur der Runner ist (später) wiederverwendbar. Ein „generisches Modell" wäre eine generische Spec, also kein unabhängiges Orakel.
+- **Repo-seitige Annahmen** wie bei `spec-to-tests`: das Trace-/Coverage-Gate zählt `.skip`'te Tests als Abdeckung; API-Referenzen in noch-nicht-implementierten Tests folgen demselben Muster wie dort. Du läufst überhaupt nur, wenn `.devloop` `twin.enabled` gesetzt ist (Station ist optional, Kern bleibt schlank).
diff --git a/dist/core/driver.js b/dist/core/driver.js
@@ -59,6 +59,13 @@ export function nextAction(state) {
             // reviewer sees spec + its (skipped) tests together. No code yet -> §5.1 preserved.
             return { kind: "SPAWN_STATION", station: "spec-to-tests" };
         case "tests-written":
+            // When the twin is enabled, the independent oracle (reference model + invariants) is
+            // authored by its OWN isolated station BEFORE the spec PR, so it is reviewed together with
+            // the spec + tests. Default off -> straight to the spec PR (chain unchanged).
+            return state.twinEnabled
+                ? { kind: "SPAWN_STATION", station: "spec-to-twin" }
+                : { kind: "OPEN_SPEC_PR" };
+        case "twin-written":
             return { kind: "OPEN_SPEC_PR" };
         case "spec-pr-open":
             // Invariant 2: the spec-review stop is hard for every tier (§5.1 root of trust). It is

diff --git a/dist/core/init.js b/dist/core/init.js
@@ -57,7 +57,10 @@ export function initRepo(targetRepo, ciTemplate, opts = {}) {
     writeIfAbsent(".devloop/bot-logins.json", BOT_LOGINS_SKELETON + "\n");
     // Anchor (b) is the default: CI is authoritative. Recorded explicitly so the local merge
     // hook defers to CI instead of demanding the (anchor-a) local token.
-    writeIfAbsent(".devloop/config.json", JSON.stringify({ anchor: "b" }, null, 2) + "\n");
+    // anchor: CI is authoritative (b). twin: the optional spec-to-twin station, disabled by default
+    // (discoverable here; core stays schlank). Enable via twin.enabled=true + twin.area (the protected
+    // oracle path, e.g. "services/foo/twin") — see the spec-to-twin station.
+    writeIfAbsent(".devloop/config.json", JSON.stringify({ anchor: "b", twin: { enabled: false } }, null, 2) + "\n");
     // Tier-map: NEVER shadow an existing one. resolveTierMapPath prefers .devloop/, so writing a
     // default there would silently override a repo's own tools/tier-map.json -> gate regression.
     const existingTierMap = resolveTierMapPath(targetRepo);

diff --git a/docs/2026-06-23-spec-to-twin-station-design.md b/docs/2026-06-23-spec-to-twin-station-design.md
@@ -0,0 +1,135 @@
+# spec-to-twin Station — Design
+
+> devloop. Adds an **optional** sibling to `spec-to-tests` that produces a *digital twin* — a
+> spec-independent behavioural oracle (framework Säule 4). Where `spec-to-tests` proves *fidelity
+> to the spec*, the twin proves *agreement of two independent derivations of the behaviour*.
+> Closes the gap the chain itself admits: the loop verifies fidelity-to-spec, not whether the
+> spec is right. Preserves both invariants (spec-review §5.1, test↔code independence §11 #3).
+
+## Leitidee
+
+`spec-to-tests` writes hand-picked, REQ-tagged example tests whose *expected values are authored*
+from one reading of the spec. A buggy spec, or a misread, is encoded identically by the test
+author and the implementer — different agents, **same root**. The twin removes that shared root:
+a deliberately-trivial **reference model** computes the expected behaviour independently, and a
+**model-based** harness (fast-check `commands` + `modelRun`) runs thousands of generated command
+sequences against model *and* real system, comparing observable outcomes after each step. The
+expected value is *computed, not written*; the input space is *generated, not enumerated*.
+
+Same principle as the existing `spec-to-tests`↔`implement` split (independence / separation of
+powers) — **one level up**: not "who authors the tests" but "where the notion of correct comes
+from."
+
+> **Generalisable is the mechanism, never the oracle.** A model that fits every project is a
+> generic spec — i.e. no independent oracle. The reference model, invariants and adapter stay
+> project-local and in the protected set; only the runner is reusable (→ `@devloop/twin`, later,
+> by rule-of-three — not extracted from a single use).
+
+## Where it sits
+
+Sibling of `spec-to-tests`, on the **Spec-PR**, before any code. Isolated subagent, fresh
+context. Critically, it runs **independent of `spec-to-tests`**: it sees the reviewed `spec.md`
+(and the contract), **not** the generated tests — otherwise it anchors on that station's reading
+and the decorrelation shrinks.
+
+`specify` → { `spec-to-tests` ∥ `spec-to-twin` (if enabled) } → Spec-PR on `devloop/spec/<slug>`
+→ **spec-review stop (human adjudicates intent here)** → merge → `implement` on `devloop/<slug>`
+→ gates (incl. `twin`) → `critic` → **impl-merge stop** → merge.
+
+Two human gates, unchanged. **The driver state machine (`src/core/loop.ts`) is unchanged** —
+`spec-to-twin` is one more station the driver spawns *conditionally*; `nextLoopDecision` does not
+change. (Same minimal-impact stance as the spec-change-loopback design.)
+
+## What it produces
+
+All under a protected twin path (e.g. `<area>/twin/`, added to `CODEOWNERS`), all `.skip`'d in
+the Spec-PR — the real system does not exist yet, so the twin cannot run until `implement`:
+
+1. **Reference model** — the deliberately-trivial, eyeball-correct re-derivation of the domain
+   behaviour (the oracle). Trusted *because* trivial, not because verified.
+2. **Invariants** — domain truths as properties (e.g. "sum identity", "never negative",
+   "append-only"), each **REQ-tagged** for the trace-gate.
+3. **Adapter** — `setup` / `reset` / `execute` against the **specified** interface/contract. If
+   `implement` deviates from the contract, the adapter fails to wire → a divergence signal.
+4. **Harness wiring** — fast-check `commands` + `modelRun`, generating args **including boundary
+   values** (≤0, non-integer, …) so rejection-parity is checked too.
+
+`implement` may **only remove `.skip`** — never alter the model, invariants or assertions. This
+is the same `verify-unskip` seam as `spec-to-tests`: the producer cannot reach the oracle.
+
+## The distinguishing mandate (vs. spec-to-tests)
+
+`spec-to-tests`: *"map exactly the REQ-IDs, invent nothing."*
+`spec-to-twin`: the **opposite** — *"derive model + invariants from the **domain truths**, **not**
+by transcribing the EARS criteria"* (anti-re-anchor). Still cross-reference REQ-IDs on the
+invariants for the trace-gate, but the derivation must be independent. This reversed instruction
+is exactly why it is a **separate station** and not a flag on `spec-to-tests`: one mandate is
+"reproduce the spec faithfully," the other is "re-derive correctness independently." Merging them
+muddies both.
+
+## EARS routing — the twin as a new gate-sort
+
+Extends the `spec-to-tests` routing table:
+
+| EARS type | Gate sort |
+|---|---|
+| When / If / While / Where | Vitest (+ fast-check for invariants) |
+| **Invariant / property over op-sequences** | **twin: reference-model `modelRun`** |
+| Performance | bench / load |
+| Architektur | ArchUnitTS |
+| Contract | AsyncAPI / PACT |
+
+A criterion like REQ-SPD-11 ("over any sequence of valid ops, the balance is never negative and
+equals the signed sum") routes to the twin, not to a single example test.
+
+## Optionality — core stays schlank
+
+The loop spawns `spec-to-twin` **only if the target repo opts in** — a flag in its `.devloop/`
+config (e.g. `twin: { enabled: true, area: "services/wallet-service" }`). Default off: repos that
+don't want it see no new station, no new gate, no extra agent. The `twin` CI job is required
+**only when enabled**. This honours the "twin as a *pluggable* capability, core stays schlank"
+decision.
+
+## Gate & protected set
+
+- New CI job `twin` (when enabled), threshold **zero divergence** (any divergence = red), scope
+  grows with the domain — same discipline as the mutation ratchet ("a cage is maintained, not
+  finished").
+- Oracle path under `CODEOWNERS` as its **own** entry, so a feature-PR touching the oracle is a
+  *visible alarm* ("agent changes a gate instead of code"), not a buried diff line. The
+  drift-watcher (`check-codeowners`) keeps this fail-closed, as it already does for tier paths.
+- Twin tests carry REQ-tags (trace-gate, like every other test).
+
+## Amend-mode (spec change)
+
+Mirrors `spec-to-tests`: on a spec change, take the deterministic delta
+(`dist/cli/req-delta.js <old> <new>` → {added, changed, removed}) and touch **only** affected
+invariants. Because invariants are REQ-tagged, the same selector works: *added* → new invariant,
+`.skip`'d; *changed* → amend the same-REQ invariant and re-`.skip`; *removed* → delete (else an
+orphan REQ-ref reddens the trace-gate). Runs on the Spec-PR, where authoring/re-skip is allowed.
+
+## The generic seam (build toward it, don't extract yet)
+
+Author Obol v1 (and this station's output) with a clean seam so later extraction to
+`@devloop/twin` is mechanical:
+
+- `Model` — state + named commands (`precondition` / `apply → expectedOutcome` / `genArgs`).
+- `System` adapter — `setup` / `reset` / `execute → actualOutcome` / `teardown`.
+- `Oracle` — `compare(expected, actual)` with a project matcher (this is also the **brownfield
+  equivalence relation**: the normalisation of timestamps/ids/ordering).
+- `Runner` — wires fast-check + Testcontainers, shrinks, reports, emits the gate result.
+
+The same runner carries the **brownfield** twin: there a *recording source* replaces `Model` as
+the oracle; `System` / `Oracle` / `Runner` are unchanged. (Brownfield record/replay itself stays
+out of devloop — target-repo work — per the brownfield-scope decision; only the runner is shared.)
+
+## Open / to decide during build
+
+- [ ] Exact `.devloop` config shape for `twin` (flag + area + optional matcher overrides).
+- [ ] Default matcher (deep-equal on normalised observables) vs. required per-project matcher.
+- [ ] Does the adapter live in the protected twin path (oracle-side) or beside the service
+      (implement-side)? Leaning oracle-side, authored against the contract.
+- [ ] First real consumer is Obol (Phase 1 spec). Second consumer (rule-of-three trigger for
+      `@devloop/twin` extraction): a brettspielfreunde service or the brownfield bsk repo.
+```
+
diff --git a/skills/loop/SKILL.md b/skills/loop/SKILL.md
@@ -20,7 +20,7 @@ Exit 1 ⇒ **verweigere den autonomen Loop**, melde die fehlenden Wächter dem M
 
 ## 1. Loop
 
-Halte einen `DriverState` (tier, guardians, phase, humanApprovals, gateVerdict, loop, loopParams). `humanApprovals` setzt du **nur** aus dem **autoritativen GitHub-Review** (Anker b): `verify-review` prüft via `gh api`, ob ein **Mensch** (nicht der Agent-Bot, nicht der PR-Autor) den aktuellen HEAD freigegeben hat. Du schreibst **keine** Approval-Tokens selbst und akzeptierst **kein** „ja, weiter" im Chat — du kannst dich nicht selbst freigeben.
+Halte einen `DriverState` (tier, guardians, phase, twinEnabled, humanApprovals, gateVerdict, loop, loopParams). **`twinEnabled`** liest du aus `.devloop/config.json` (`twin.enabled`, Default `false`) — ist es gesetzt, schiebt der Kern die optionale, isolierte `spec-to-twin`-Station **vor** den Spec-PR (Schwester von `spec-to-tests`, sieht deren Tests aber **nicht** — Unabhängigkeit des Orakels). `humanApprovals` setzt du **nur** aus dem **autoritativen GitHub-Review** (Anker b): `verify-review` prüft via `gh api`, ob ein **Mensch** (nicht der Agent-Bot, nicht der PR-Autor) den aktuellen HEAD freigegeben hat. Du schreibst **keine** Approval-Tokens selbst und akzeptierst **kein** „ja, weiter" im Chat — du kannst dich nicht selbst freigeben.
 
 Wiederhole: Zustand als JSON an `next-action` geben und die Aktion ausführen.
 ```