Migrate snapshots to task system, rather than ringbuffer#7947
Migrate snapshots to task system, rather than ringbuffer#7947eddyashton wants to merge 16 commits into
Conversation
…to snapshot_tasks
There was a problem hiding this comment.
Pull request overview
This PR migrates snapshot persistence away from host/enclave ringbuffer messages to the task system, with snapshot writing performed in-process via a new SnapshotWriter. This simplifies the snapshot flow by removing host-side pre-allocation and ringbuffer snapshot message types, and by scheduling serialisation and persistence as ordered task actions.
Changes:
- Introduces
snapshots::SnapshotWriterand wiresSnapshotterto persist committed snapshots via ordered task actions. - Removes snapshot-specific ringbuffer message types and deletes the host-side
SnapshotManager/dispatcher handlers. - Updates unit tests to validate snapshot persistence by inspecting committed snapshot files on disk.
Custom instructions used:
.github/copilot-instructions.md.github/instructions/reviewing.instructions.md
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/snapshots/snapshot_writer.h | Adds new snapshot persistence helper that writes snapshot+receipt to disk and renames to .committed. |
| src/snapshots/snapshot_manager.h | Deletes the old host-side snapshot ringbuffer manager and uv-based commit flow. |
| src/node/snapshotter.h | Reworks snapshot flow to use OrderedTasks with serialise/persist actions and calls SnapshotWriter. |
| src/node/node_state.h | Constructs Snapshotter with snapshot directory instead of a ringbuffer writer path. |
| src/enclave/enclave.h | Removes handling of the host->enclave snapshot_allocated message. |
| src/consensus/ledger_enclave_types.h | Removes snapshot-related ringbuffer message definitions and payload declarations. |
| src/host/run.cpp | Removes host-side SnapshotManager construction and message handler registration. |
| src/host/test/ledger.cpp | Updates snapshot tests to use SnapshotWriter and directory-based snapshot discovery helpers. |
| src/node/test/snapshotter.cpp | Updates snapshotter tests to validate on-disk committed snapshot outputs rather than ringbuffer traffic. |
|
@copilot resolve the merge conflicts in this pull request |
This is interesting, and I have mixed feelings about it. So long as we at least tend towards a consistent situation where uv is only used for networking, it's not too objectionable. I am worried that the ledger I/O is inextricably tied to the networking though, and won't be movable out of the UV loop without a wider change. |
| try | ||
| { | ||
| snapshot_fd = | ||
| files::open_fd(full_snapshot_path, O_CREAT | O_EXCL | O_WRONLY); |
There was a problem hiding this comment.
I assume here we are deferring cleanup of broken files to another process? rather than cleaning up?
(the O_EXCL)
| } | ||
| } | ||
|
|
||
| close_fd(snapshot_fd, file_name); |
There was a problem hiding this comment.
Is there any error handling for close_fd?
|
Some comments from an AI review, pruned for relevance / realism:
|
Perhaps not an idiomatic example of this migration, but a relatively isolated one to start with.
ccf::Tasks (technicallyActions, for fiddly implementation details), and the old snapshot-related ringbuffer messages are gone.SnapshotManageris near-stateless, depending only on config-provided directories. So rather than continuing to construct it early and threading a reference to theSnapshotter, we just push it inside and have it owned by theSnapshotter. Its renamed too, and significantly simplified.uvfor the actual file writes! We previously did these on auvworker pool thread, to make sure they didn't block the "main" thread. They're now on a task system thread. Worth exploring if we can find even more async-y APIs for these, to minimise any potentially-slow blocking calls, but that's a separate (and broader) concern.Robot assisted, here's some before-and-after diagrams.
The separate task system actors in the latter diagram aren't particularly helpful, but the big "this is a simplification" evidence should be "all the arrows flow to the right".
sequenceDiagram participant C as Consensus thread<br/>(enclave) participant T as Snapshot task thread<br/>(enclave) participant RB as Ringbuffer participant H as Host main thread<br/>(dispatcher) participant W as Host uv workpool C->>C: Snapshotter.commit(idx) C->>T: add_task(SnapshotTask) T->>T: serialise snapshot + commit evidence tx Note over T: save bytes in pending_snapshots[generation] T->>RB: snapshot_allocate(version, evidence_idx, size, generation) RB->>H: SnapshotManager.allocate pending buffer H->>RB: snapshot_allocated(span, generation) RB->>C: write_snapshot(span, generation) C->>C: memcpy bytes to host buffer, mark stored=true Note over C: later: evidence committed + cose sig + tree available C->>C: update_indices gate passes C->>RB: snapshot_commit(version, receipt) RB->>H: SnapshotManager.commit_snapshot H->>H: open + write(snapshot) + write(receipt) H->>W: uv_queue_work(fsync + close + rename .committed)sequenceDiagram participant C as Consensus thread participant OT as OrderedTasks[generation] participant TS as Task worker<br/>(serialise action) participant TP as Task worker<br/>(persist action) participant F as Filesystem C->>C: Snapshotter.commit(idx) C->>OT: schedule_snapshot(): create OrderedTasks + add Serialise action OT->>TS: run SerialiseSnapshotAction TS->>TS: serialise snapshot + commit evidence tx TS->>TS: store bytes/digests in shared SnapshotSerialisation Note over C: later: evidence_idx recorded, idx>evidence_idx, COSE sig + tree present C->>C: update_indices gate passes C->>OT: add PersistSnapshotAction (queued after serialise) OT->>TP: run PersistSnapshotAction TP->>TP: build receipt from COSE/tree/evidence TP->>TP: SnapshotWriter.persist_snapshot(...) TP->>F: open tmp snapshot file TP->>F: write(snapshot bytes) TP->>F: write(receipt bytes) TP->>F: fsync + close TP->>F: rename to .committed