Skip to content

test(longhaul): add long-haul test driver core#405

Open
WentingWu666666 wants to merge 13 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-core
Open

test(longhaul): add long-haul test driver core#405
WentingWu666666 wants to merge 13 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/longhaul-core

Conversation

@WentingWu666666

Copy link
Copy Markdown
Collaborator

Part 2 of 5: long-haul test driver core

Part 2 of the #348 split (design doc landed in #400, shared test module in #401).

What this PR adds

A new test/longhaul Go module — a long-running canary driver that exercises a persistent DocumentDB cluster with continuous workload, scheduled disruptive operations, health monitoring, and reporting. See the design in docs/designs/long-haul-test-design.md (merged via #400) for the why.

Packages:

Path Role
cmd/longhaul Driver entrypoint; wires workload + ops + monitor + reporting
config Env-driven configuration with validation + unit tests
workload Writer / verifier / atomic metrics (acked-write durability oracle)
operations Weighted-random scheduler + scale/upgrade ops
monitor Cluster client (consumes test/shared), health gating, leak detector
journal Thread-safe event log + per-operation outage-policy oracle
report Markdown report, periodic checkpoint to ConfigMap, GitHub Actions alerts
deploy/setup.yaml Bootstrap manifest (namespace + Secret + DocumentDB CR) for canary cluster
Dockerfile, README.md Container image + operator-facing documentation

Code reuse with e2e

Per the design doc (and #401), the driver consumes:

  • test/shared/documentdbGet, PatchSpec, PatchInstances, WaitHealthy, Delete, ReadyStatus (no duplicated CR logic; no hardcoded readiness strings)
  • test/shared/mongoNewClient, BuildURI, Ping, Seed, Count, TLS plumbing (no raw mongo.Connect)

go.mod uses replace for both modules so changes to shared helpers flow into the driver immediately.

Excluded from this PR (follow-ups)

  • PR-3: CI/CD workflows (.github/workflows/longhaul-*.yaml) + deploy/deployment.yaml + deploy/rbac.yaml
  • PR-4: auto-upgrade logic for operator + DocumentDB version drift

Verification

cd test/longhaul
go build ./...   # clean
go test ./...    # 6/6 packages pass: config / journal / monitor / operations / report / workload

Reviewer notes

Adds the test/longhaul Go module: a long-running canary driver that

exercises a persistent DocumentDB cluster with continuous workload,

scheduled disruptive operations, health monitoring, and reporting.

Scope of this PR (Part 2/5 of documentdb#348 split):

  * cmd/longhaul: driver entrypoint

  * config: env-driven configuration

  * workload: writer/verifier/metrics

  * operations: scheduler + scale + upgrade ops

  * monitor: cluster client (consumes test/shared), health, leak detector

  * journal: thread-safe event log + outage policy oracle

  * report: markdown report, periodic checkpoint, GitHub Actions alerts

  * Dockerfile, README, deploy/setup.yaml bootstrap manifest

Reuses test/shared/documentdb (lifecycle) and test/shared/mongo

(data plane) extracted in documentdb#401 — no duplication of CR/Mongo logic.

Excluded (in follow-up PRs): CI/CD workflows + deploy/deployment.yaml

+ deploy/rbac.yaml (PR-3), auto-upgrade logic (PR-4).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 16, 2026 19:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the core of the new test/longhaul Go module: a long-running “canary” driver that applies continuous workload, schedules disruptive operations (scale/upgrade), monitors cluster health/resources, journals events/windows, and produces checkpoint + markdown reporting.

Changes:

  • Introduces workload writer/verifier with atomic metrics and uniqueness/indexing for durability checks.
  • Adds the disruption scheduler and operations (scale up/down, DocumentDB version upgrade) with outage-policy tracking via a journal.
  • Adds Kubernetes monitoring client (typed controller-runtime + test/shared), leak detection, and reporting/checkpoint + GitHub Actions annotations.

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
test/longhaul/cmd/longhaul/main.go Driver wiring: config, mongo, monitors, workload, scheduler, checkpoint reporting
test/longhaul/config/config.go Env-driven configuration defaults + validation
test/longhaul/config/config_test.go Unit tests for env parsing/validation/enabled flag
test/longhaul/config/suite_test.go Ginkgo suite bootstrap for config package
test/longhaul/workload/metrics.go Atomic counters + snapshot helpers for workload metrics
test/longhaul/workload/metrics_test.go Unit tests for metrics snapshot/rates/concurrency safety
test/longhaul/workload/writer.go Majority-write workload writer + index creation
test/longhaul/workload/writer_test.go Unit tests for checksum and writer counter behavior
test/longhaul/workload/verifier.go Majority-read verifier scanning per-writer sequences for gaps/checksum mismatches
test/longhaul/workload/verifier_test.go Constructor/resume-point tests (with integration boundary noted)
test/longhaul/workload/suite_test.go Ginkgo suite bootstrap for workload package
test/longhaul/operations/scheduler.go Weighted operation scheduler with cooldown + steady-state gating
test/longhaul/operations/scheduler_test.go Unit tests for selection, disruption windows, and error journaling
test/longhaul/operations/scale.go Scale up/down operations with budgets + recovery waits
test/longhaul/operations/scale_test.go Unit tests for scale operation bounds and policy values
test/longhaul/operations/upgrade.go Upgrade operation reading desired version from ConfigMap + readiness waits
test/longhaul/operations/upgrade_test.go Unit tests for desired-version reads, preconditions, and outage policy
test/longhaul/operations/suite_test.go Ginkgo suite bootstrap for operations package
test/longhaul/monitor/k8sclient.go Typed K8s/CR client + metrics-server integration; uses test/shared/documentdb
test/longhaul/monitor/k8sclient_test.go Unit tests with fake clientsets/controller-runtime fake client
test/longhaul/monitor/health.go Health monitor with steady-state tracking and wait helper
test/longhaul/monitor/health_test.go Unit tests for steady-state logic and wait behavior
test/longhaul/monitor/leakdetect.go Linear regression leak detector for memory/CPU trends
test/longhaul/monitor/leakdetect_test.go Unit tests for slope/leak threshold logic
test/longhaul/monitor/suite_test.go Ginkgo suite bootstrap for monitor package
test/longhaul/journal/journal.go Thread-safe append-only event journal + disruption window tracking
test/longhaul/journal/journal_test.go Unit tests for events/windows/policy violation detection
test/longhaul/journal/policy.go Outage policy + window semantics (duration, exceeded policy)
test/longhaul/journal/policy_test.go Unit tests for policy/window semantics
test/longhaul/journal/suite_test.go Ginkgo suite bootstrap for journal package
test/longhaul/report/report.go Markdown report generator (metrics, windows, leaks, recent events)
test/longhaul/report/report_test.go Unit tests for report content/sections/truncation
test/longhaul/report/checkpoint.go Periodic checkpoint reporter writing ConfigMap + stdout + annotations
test/longhaul/report/checkpoint_test.go Unit tests using fake clientset for ConfigMap persistence
test/longhaul/report/alert.go GitHub Actions workflow command annotations (error/notice/warning)
test/longhaul/report/alert_test.go Unit tests capturing stdout for annotations
test/longhaul/report/suite_test.go Ginkgo suite bootstrap for report package
test/longhaul/deploy/setup.yaml Bootstrap namespace/secret/DocumentDB CR manifest for canary cluster
test/longhaul/Dockerfile Container image build for the longhaul driver
test/longhaul/README.md Usage/configuration documentation for running + deploying longhaul
test/longhaul/go.mod New longhaul module definition with replace to local operator/shared modules
test/longhaul/go.sum Dependency lockfile for the longhaul module

Comment thread test/longhaul/workload/writer.go Outdated
Comment thread test/longhaul/cmd/longhaul/main.go
Comment thread test/longhaul/cmd/longhaul/main.go Outdated
Comment thread test/longhaul/operations/upgrade.go Outdated
Comment thread test/longhaul/operations/scheduler.go
Comment thread test/longhaul/README.md Outdated
Comment thread test/longhaul/README.md Outdated
Comment thread test/longhaul/Dockerfile
Comment thread test/longhaul/deploy/setup.yaml Outdated
Comment thread test/longhaul/report/checkpoint.go Outdated
@WentingWu666666 WentingWu666666 marked this pull request as draft June 16, 2026 20:23
@documentdb-triage-tool documentdb-triage-tool Bot added dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request test labels Jun 16, 2026
@documentdb-triage-tool

Copy link
Copy Markdown

🤖 Auto-triaged by documentdb-triage-tool.

Applied: test, dependencies, documentation, enhancement
Project fields suggested: Component test · Priority P2 · Effort XL · Status In Progress
Confidence: 0.88 (mixed)

Reasoning

component from path globs (test, dependencies, docs); effort from diff stats (5029+0 LOC, 42 files); LLM: Large multi-package long-haul test driver core spanning several new Go packages, a Dockerfile, manifests, and docs — cross-cutting test infrastructure addition as part of a 5-PR series.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

@WentingWu666666 WentingWu666666 changed the title test(longhaul): add long-haul test driver core (Part 2/5 of #348) test(longhaul): add long-haul test driver core Jun 17, 2026
Copilot inline comments (16):

- writer.go: fix StartWriters doc; switch to math/rand/v2 (auto-seeded)

- main.go: header says Deployment (not Job); drop-collection gated on LONGHAUL_RESET_DATA

- upgrade.go: wrap+return error from GetCurrentDocumentDBImageTag

- checkpoint.go: bounded ctx on final emit; final PASS persists as PASS (not RUNNING)

- alert.go: escape % / CR / LF in GH Actions annotation messages

- deploy/setup.yaml: password is placeholder; plugin field uses spec.plugins.sidecarInjectorName

- Dockerfile + README: build from repo root so go.mod replace paths resolve; Go 1.25; update relationship table for shared module

Local code-review-agent (3 critical):

- C1: writer no longer pre-increments seq; advance only on success or DupKey ack, so non-DupKey failures retry the same seq instead of being mistaken for data loss

- C2: main calls reporter.EmitFinal() synchronously before exit so the authoritative final verdict reaches the longhaul-report ConfigMap

- C3: writers seed seq from FindOne(sort:-1) on startup so a Deployment-driven pod restart resumes past the prior tip without colliding with the unique (writer_id, seq) index

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666 WentingWu666666 marked this pull request as ready for review June 17, 2026 17:26

@xgerman xgerman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: structure, duplication & terseness

Focused pass on code duplication, structure, and style (the existing Copilot threads already covered correctness/security and are addressed in 3e9ac70). The module is well-documented and go build/go vet/go test ./... all pass locally. 👍

🟠 Must fix — gofmt (CI blocker)

Four files are not gofmt-clean (extra-space alignment), so make fmt/lint will fail:
operations/scale.go, operations/scheduler.go, monitor/health.go, report/checkpoint.go.
Fix: gofmt -w operations/ monitor/ report/.

🟡 Duplication / structure (highest-value dedup)

  1. ScaleUp and ScaleDown are ~95% identical — collapse into one parameterized type (inline).
  2. main.buildRESTConfig re-implements monitor.buildRestConfig and builds a second clientset that K8sClusterClient already owns (inline).
  3. shareddb.Get(...) repeated 4× in k8sclient.go — extract a getCR helper (inline).
  4. The ticker + select{ctx.Done/ticker.C} loop recurs 5× (writer/verifier/scheduler/health/metrics) — optional shared helper (inline).

🟢 Style / terseness / dead code

  • journal.DefaultOutagePolicy() is unused (inline).
  • report.go: b.WriteString(fmt.Sprintf(...))fmt.Fprintf(&b, ...) (inline).
  • checkpoint.go: map[string]interface{}map[string]any (inline).
  • metrics.go: MetricsSnapshot doc comment should start with the type name (inline).

No blocking design issues — just the gofmt fix for CI plus the two consolidations (#1, #2) which are the biggest wins. Nice work on the durability oracle and the bounded verifier scan.

Comment thread test/longhaul/operations/scale.go Outdated
}

// ScaleDown decreases spec.instancesPerNode by 1 (HA scale dimension; range 1-3).
type ScaleDown struct {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScaleUp (above) and ScaleDown are ~95% identical: same fields, and Precondition/Execute differ only by +1/-1, >=/<=, the bound, and a couple of constants. Consider collapsing into one parameterized type to halve the file:

type scaleOp struct {
    client    monitor.ClusterClient
    healthMon *monitor.HealthMonitor
    name      string
    weight    int
    delta     int        // +1 up, -1 down
    limit     int        // max (up) or min (down)
    atLimit   func(cur int) bool
    recovery  time.Duration
    policy    journal.OutagePolicy
}

Thin NewScaleUp/NewScaleDown constructors keep call sites unchanged.

(Also: Name()/Weight() on lines 37-38 and 98-99 fail gofmt — extra spaces.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — collapsed into a shared scaleOp struct parameterized by delta/bound/boundKind/policy. ScaleUp and ScaleDown are now thin embedding wrappers so call sites and the existing NewScaleUp/NewScaleDown constructors are unchanged. Tests use small maxInstances()/minInstances() accessors instead of field access. Net: scale.go went from 136 -> ~130 lines and has zero duplication. Also gofmt-fixed (the extra-space alignment you flagged on lines 37-38 / 98-99 is gone).

Comment thread test/longhaul/cmd/longhaul/main.go Outdated
return client, clientset, nil
}

func buildRESTConfig() (*rest.Config, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildRESTConfig here re-implements monitor.buildRestConfig (k8sclient.go:86) almost verbatim, and initK8sClient then creates a second clientset (line 231) even though K8sClusterClient already built one (k8sclient.go:59). Expose the existing one instead:

func (k *K8sClusterClient) Clientset() kubernetes.Interface { return k.clientset }

Then delete buildRESTConfig and the redundant NewForConfig — one client, one config path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — added K8sClusterClient.Clientset() kubernetes.Interface, deleted main.buildRESTConfig and initK8sClient entirely, and dropped the unused k8s.io/client-go/{kubernetes,rest,tools/clientcmd} imports from main.go. main.go now does the one monitor.NewK8sClusterClient(...) call and reuses clusterClient.Clientset() for the reporter. main.go shrank by ~40 lines.

Comment thread test/longhaul/monitor/k8sclient.go Outdated
// Get the DocumentDB CR status via the shared typed helper. Using
// shareddb.IsHealthy keeps the readiness predicate consistent with
// the e2e suite (single source of truth for ReadyStatus).
dd, err := shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName}) is repeated verbatim in 4 methods (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB). Extract a private helper:

func (k *K8sClusterClient) getCR(ctx context.Context) (*previewv1.DocumentDB, error) {
    return shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName})
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — added private (k *K8sClusterClient) getCR(ctx) helper. All four call sites (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB) now use it. types/shareddb are still imported once at the top for the helper itself.

ticker := time.NewTicker(healthCheckInterval)
defer ticker.Stop()

for {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ticker + select { <-ctx.Done() / <-ticker.C } loop is duplicated 5× across the module (here, writer.Run, verifier.Run, scheduler.Run, runMetricsSampling). Optional, since they span packages, but a tiny shared helper would DRY them:

func tick(ctx context.Context, d time.Duration, fn func(context.Context)) {
    t := time.NewTicker(d); defer t.Stop()
    for { select { case <-ctx.Done(): return; case <-t.C: fn(ctx) } }
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped intentionally. Agree the pattern recurs 5 times, but a cross-package utility for a 5-line idiom isn't a clear win — it would mean either a new test/longhaul/internal package (overhead) or sticking it in one of the existing packages (awkward import direction). Each call site is also a Run() method with package-specific logging, so the saved lines per site are closer to 2-3. Happy to revisit if more components get added in PR-3/PR-4.

}

// DefaultOutagePolicy returns a conservative policy suitable for most operations.
func DefaultOutagePolicy() OutagePolicy {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultOutagePolicy() is dead code — it's never referenced anywhere (each operation builds its OutagePolicy inline). Either use it as the base for the scale/upgrade policies or remove it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping DefaultOutagePolicy() — it's referenced from 3 test files (journal/journal_test.go:70-71, journal/policy_test.go:80, operations/scheduler_test.go:38) where the tests need a "reasonable defaults" stand-in to construct fake operations / open disruption windows. Removing it would push three literal OutagePolicy{...} copies into the tests, which is worse. Open to renaming if you'd prefer something like BaselineOutagePolicy() to signal it's a test/baseline default rather than a production default.

Comment thread test/longhaul/report/report.go Outdated
b.WriteString("# Long Haul Test Report\n\n")

// Header
b.WriteString(fmt.Sprintf("**Result:** %s\n", s.Result))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b.WriteString(fmt.Sprintf(...)) appears ~15× in this function. fmt.Fprintf(&b, ...) is the idiomatic, terser form (a strings.Builder implements io.Writer), e.g.:

fmt.Fprintf(&b, "**Result:** %s\n", s.Result)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — all ~15 occurrences switched to fmt.Fprintf(&b, ...). Pure b.WriteString("literal") calls (no formatting needed) kept as-is since Fprintf would be unnecessary indirection there.

Comment thread test/longhaul/report/checkpoint.go Outdated
}

// Also log the summary as JSON for structured log consumers.
summaryJSON, _ := json.Marshal(map[string]interface{}{

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map[string]interface{}map[string]any (Go 1.18+, and go.mod is 1.25). Minor: fmt.Printf("\n%s\n", "=== CHECKPOINT REPORT ===") on lines 93/95 can just be fmt.Println(...). Note this file also fails gofmt (the JSON map literal alignment).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498dfmap[string]any, Printf("\n%s\n", "...") collapsed to Println("\n...") / Print("...\n\n"), and gofmt -w cleaned the JSON map alignment. go vet and gofmt -l are both clean now.

Comment thread test/longhaul/workload/metrics.go Outdated
}
}

// Snapshot returns a point-in-time copy of all metric values.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment is attached to type MetricsSnapshot but reads "Snapshot returns a point-in-time copy...". Per Go convention the comment should start with the identifier it documents, e.g. // MetricsSnapshot is a point-in-time copy of all metric values.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed498df — comment now reads // MetricsSnapshot is a point-in-time copy of all metric values.

Copilot AI added 5 commits June 17, 2026 18:17
- gofmt -w across the module (4 files were not gofmt-clean: scale.go, scheduler.go, health.go, checkpoint.go) — fixes the CI lint gate

- collapse ScaleUp/ScaleDown into a shared scaleOp struct (parameterized by delta/bound/policy); thin ScaleUp / ScaleDown wrappers keep call sites and tests unchanged

- expose K8sClusterClient.Clientset() and drop the duplicate REST-config + clientset construction in cmd/longhaul/main.go

- extract K8sClusterClient.getCR() helper; the namespaced-name lookup no longer repeats in every method

- report.go: b.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&b, ...)

- checkpoint.go: map[string]interface{} -> map[string]any; collapse double Printf wrappers

- metrics.go: MetricsSnapshot doc comment now starts with the type name

Skipped (with rationale in PR replies):

- cross-package tick helper (xgerman marked optional; ~5 lines saved is not worth a new utility package)

- removing DefaultOutagePolicy (still used by journal/operations tests as a reasonable-defaults stand-in)

go build / go vet / go test ./... all clean; gofmt -l reports no remaining issues.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifier:

- Drop NumVerifiers config knob and StartVerifiers helper; always run exactly one verifier.

- Multiple verifiers were a bug, not a feature: each one independently scanned the whole collection and wrote to the SHARED Metrics.VerifyGapsDetected counter, so a single real gap of N missing seqs was reported as N x NumVerifiers in alerts, and read load on the cluster scaled with NumVerifiers (turning the verifier's own load into a confounding signal for the test report).

- One verifier is sufficient: the scan is stateless against the DB and bounded by the per-writer nextSeq resume map, so adding more instances adds noise, not coverage.

- LONGHAUL_NUM_VERIFIERS env var removed (along with the cfg field, default, parse, validate, README row, and config_test references).

Metrics:

- Add per-field godoc on every Metrics field clarifying ack-vs-fail semantics, what triggers FAIL (verifier-side only), DupKey ack behavior, scan-cycle vs document semantics, and StartTime / Elapsed restart behavior.

- The user asked 'what is VerifyGapsDetected' — that the question came up at all is evidence the field comments were insufficient.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Metrics: one-line godoc per field; drop the multi-paragraph prose.

- Verifier: drop the dead 'id' field (only one verifier exists), update test/log sites accordingly.

- Fix incorrect nextSeq comment: once we step past a gap, expectedSeq advances to seq+1, so a late-arriving fill at the missing seq is filtered out by 'seq >= nextSeq' on subsequent cycles — gaps are counted exactly once, not re-checked.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without this, the verifier could miss data loss in the tail: if seqs above the verifier's last-observed point were acked then lost, and no later writes arrived to expose the gap via the per-doc check, the loss was invisible. Concretely: writer acks 1..110, DB loses 101..110, writer crashes/idles. Verifier scans seq>=101, sees nothing, and the durability oracle stays clean — false negative.

Fix: snapshot writer.Seq() at the top of each per-writer scan, bound the Find filter to seq <= that snapshot, and after the per-doc loop check whether expectedSeq <= snapshot. Any residual is tail loss; count it under VerifyGapsDetected and log a distinct 'tail loss' message. Advance nextSeq to snapshot+1 so the next cycle scans only newly committed seqs.

Race safety: reading writer.Seq() BEFORE the Find guarantees we never count in-flight writes (which commit after Seq() advances). Verified expected <= actual in steady state, so missing>0 only when seqs are truly gone.

Also drops the aggregation-based writer-id discovery (we now iterate the writers slice the verifier owns) and adds Writer.Seq() public accessor + unit test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifier: extract auditDocs as a pure function that takes the decoded docs and (expectedSeq, maxSeq) and returns (newExpectedSeq, internalGaps, tailLoss, checksumErrors) plus a list of structured findings the caller logs. This decouples the gap/tail/checksum math from cursor iteration and the journal, so it's unit-testable without a live mongo.

Writer: introduce writeBackend interface (insert + isDuplicate + highestSeq) and wrap *mongo.Collection in mongoBackend. writeOne and Resume now go through the interface, so we can stub it in tests.

Adds 11 audit table cases (no docs, clean run, gap-in-middle, gap-at-start, empty-with-tail, tail-after-docs, gap+tail, late-write-exposes-gap, single checksum, multi-checksum, nextSeq advance invariant) and 9 writer cases (success, DupKey-as-ack, transient error, retry-after-error, monotonic advance; Resume empty / has-data / on-error).

Behavior unchanged. Net: ~+200 lines test, ~+50/-40 lines source.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666 WentingWu666666 marked this pull request as draft June 18, 2026 16:29
Copilot AI added 4 commits June 18, 2026 12:29
Merge audit_test.go into verifier_test.go and writer_backend_test.go into writer_test.go. Each source file now has exactly one matching _test.go (Go convention), no test changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add doc.go with the package-level overview explaining why report.go / checkpoint.go / alert.go are separate (data model vs. orchestration vs. CI surface) so the next reader doesn't have to ask.

- Add per-field godoc on Summary clarifying who populates each field, when it's empty, and the PASS/FAIL relationship to LeakAnalysis (warning only, not a verdict flip).

- Drop the narrow one-liner package doc from report.go in favor of the richer one in doc.go.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
AllowedDowntime: set by every operation's OutagePolicy() and DefaultOutagePolicy() but never read by ExceededPolicy (which is the only consumer). Misleading public API surface — readers assume it's enforced. Removed the field, all assignments, and the TODO. The downtime budget can be re-introduced together with the writer-side timestamp plumbing it would actually require.

Event.Metadata: declared, allocated, threaded through Record's signature, set to nil at every Info/Warn/Error call site, and never read. Dropped the field and simplified Record(level, component, message).

No behavior change. All callers were updated; tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
On a multi-day run the in-memory events slice grew unbounded — ~200 events/day baseline (op open/close + verifier batch errors) means ~6k after a month, harmless on its own but Events() returns a full slice copy on every checkpoint (~30s), so allocation/GC pressure scales with run length. Cap the ring at 10k with 1k headroom so the trim is amortized to one copy every ~1k appends, not paid on every append.

The rendered report only ever surfaces the last 20 events so the cap is invisible to consumers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666 WentingWu666666 marked this pull request as ready for review June 18, 2026 16:53
The metricsAvail field is read by MetricsAvailable() and written by
GetPodMetrics() which can be called from different goroutines. Replace
plain bool with sync/atomic.Bool for safe concurrent access.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: wenting <wenting@wentings-MacBook-Pro.local>
Comment thread test/longhaul/cmd/longhaul/main.go Outdated
j := journal.New()
metrics := workload.NewMetrics()

// Connect to MongoDB.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocumentDB

Comment thread test/longhaul/cmd/longhaul/main.go Outdated
metrics := workload.NewMetrics()

// Connect to MongoDB.
if cfg.MongoURI == "" {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Rename user-facing and internal references from MongoDB/mongo to
DocumentDB/documentdb per reviewer feedback. Includes env vars
(LONGHAUL_MONGO_URI → LONGHAUL_DOCUMENTDB_URI), struct fields,
variable names, type names (mongoBackend → docdbBackend), comments,
and documentation.

Go driver import paths and mongodb:// connection strings are unchanged
as they reference the wire protocol.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: wenting <wenting@wentings-MacBook-Pro.local>
copy(trimmed, j.events[len(j.events)-maxEvents:])
j.events = trimmed
}
j.mu.Unlock()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the defer pattern used below here as well to handle any panics and errors


Describe("metrics", func() {
It("MetricsAvailable mirrors the metricsAvail flag", func() {
k := &K8sClusterClient{metricsAvail: true}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might just be an issue with my environment, but this portion gives an error for me: "cannot use true (untyped bool constant) as atomic.Bool value in struct literal - compiler[IncompatibleAssign]"

func (l *LeakDetector) AddSample(s ResourceSample) {
l.mu.Lock()
l.samples = append(l.samples, s)
l.mu.Unlock()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do the defer pattern here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants