test(longhaul): add long-haul test driver core by WentingWu666666 · Pull Request #405 · documentdb/documentdb-kubernetes-operator

WentingWu666666 · 2026-06-16T19:52:06Z

Part 2 of 5: long-haul test driver core

Part 2 of the #348 split (design doc landed in #400, shared test module in #401).

What this PR adds

A new test/longhaul Go module — a long-running canary driver that exercises a persistent DocumentDB cluster with continuous workload, scheduled disruptive operations, health monitoring, and reporting. See the design in docs/designs/long-haul-test-design.md (merged via #400) for the why.

Packages:

Path	Role
`cmd/longhaul`	Driver entrypoint; wires workload + ops + monitor + reporting
`config`	Env-driven configuration with validation + unit tests
`workload`	Writer / verifier / atomic metrics (acked-write durability oracle)
`operations`	Weighted-random scheduler + scale/upgrade ops
`monitor`	Cluster client (consumes `test/shared`), health gating, leak detector
`journal`	Thread-safe event log + per-operation outage-policy oracle
`report`	Markdown report, periodic checkpoint to ConfigMap, GitHub Actions alerts
`deploy/setup.yaml`	Bootstrap manifest (namespace + Secret + DocumentDB CR) for canary cluster
`Dockerfile`, `README.md`	Container image + operator-facing documentation

Code reuse with e2e

Per the design doc (and #401), the driver consumes:

test/shared/documentdb — Get, PatchSpec, PatchInstances, WaitHealthy, Delete, ReadyStatus (no duplicated CR logic; no hardcoded readiness strings)
test/shared/mongo — NewClient, BuildURI, Ping, Seed, Count, TLS plumbing (no raw mongo.Connect)

go.mod uses replace for both modules so changes to shared helpers flow into the driver immediately.

Excluded from this PR (follow-ups)

PR-3: CI/CD workflows (.github/workflows/longhaul-*.yaml) + deploy/deployment.yaml + deploy/rbac.yaml
PR-4: auto-upgrade logic for operator + DocumentDB version drift

Verification

cd test/longhaul
go build ./...   # clean
go test ./...    # 6/6 packages pass: config / journal / monitor / operations / report / workload

Reviewer notes

The whole driver is dormant until feat: long-haul (canary) test driver for the operator (#220) #348's CI workflows land in PR-3 — this PR is "the binary exists and its unit tests pass". No CI runs the driver yet.
monitor/k8sclient.go is the typed controller-runtime wrapper that addresses xgerman's feat: long-haul (canary) test driver for the operator (#220) #348 review feedback (dropped raw dynamic.Interface + hardcoded GVR).
operations/upgrade.go reads the desired DocumentDB version from a longhaul-versions ConfigMap that PR-4's workflow will populate; today it no-ops if the ConfigMap is missing.

Adds the test/longhaul Go module: a long-running canary driver that exercises a persistent DocumentDB cluster with continuous workload, scheduled disruptive operations, health monitoring, and reporting. Scope of this PR (Part 2/5 of documentdb#348 split): * cmd/longhaul: driver entrypoint * config: env-driven configuration * workload: writer/verifier/metrics * operations: scheduler + scale + upgrade ops * monitor: cluster client (consumes test/shared), health, leak detector * journal: thread-safe event log + outage policy oracle * report: markdown report, periodic checkpoint, GitHub Actions alerts * Dockerfile, README, deploy/setup.yaml bootstrap manifest Reuses test/shared/documentdb (lifecycle) and test/shared/mongo (data plane) extracted in documentdb#401 — no duplication of CR/Mongo logic. Excluded (in follow-up PRs): CI/CD workflows + deploy/deployment.yaml + deploy/rbac.yaml (PR-3), auto-upgrade logic (PR-4). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds the core of the new test/longhaul Go module: a long-running “canary” driver that applies continuous workload, schedules disruptive operations (scale/upgrade), monitors cluster health/resources, journals events/windows, and produces checkpoint + markdown reporting.

Changes:

Introduces workload writer/verifier with atomic metrics and uniqueness/indexing for durability checks.
Adds the disruption scheduler and operations (scale up/down, DocumentDB version upgrade) with outage-policy tracking via a journal.
Adds Kubernetes monitoring client (typed controller-runtime + test/shared), leak detection, and reporting/checkpoint + GitHub Actions annotations.

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
test/longhaul/cmd/longhaul/main.go	Driver wiring: config, mongo, monitors, workload, scheduler, checkpoint reporting
test/longhaul/config/config.go	Env-driven configuration defaults + validation
test/longhaul/config/config_test.go	Unit tests for env parsing/validation/enabled flag
test/longhaul/config/suite_test.go	Ginkgo suite bootstrap for config package
test/longhaul/workload/metrics.go	Atomic counters + snapshot helpers for workload metrics
test/longhaul/workload/metrics_test.go	Unit tests for metrics snapshot/rates/concurrency safety
test/longhaul/workload/writer.go	Majority-write workload writer + index creation
test/longhaul/workload/writer_test.go	Unit tests for checksum and writer counter behavior
test/longhaul/workload/verifier.go	Majority-read verifier scanning per-writer sequences for gaps/checksum mismatches
test/longhaul/workload/verifier_test.go	Constructor/resume-point tests (with integration boundary noted)
test/longhaul/workload/suite_test.go	Ginkgo suite bootstrap for workload package
test/longhaul/operations/scheduler.go	Weighted operation scheduler with cooldown + steady-state gating
test/longhaul/operations/scheduler_test.go	Unit tests for selection, disruption windows, and error journaling
test/longhaul/operations/scale.go	Scale up/down operations with budgets + recovery waits
test/longhaul/operations/scale_test.go	Unit tests for scale operation bounds and policy values
test/longhaul/operations/upgrade.go	Upgrade operation reading desired version from ConfigMap + readiness waits
test/longhaul/operations/upgrade_test.go	Unit tests for desired-version reads, preconditions, and outage policy
test/longhaul/operations/suite_test.go	Ginkgo suite bootstrap for operations package
test/longhaul/monitor/k8sclient.go	Typed K8s/CR client + metrics-server integration; uses `test/shared/documentdb`
test/longhaul/monitor/k8sclient_test.go	Unit tests with fake clientsets/controller-runtime fake client
test/longhaul/monitor/health.go	Health monitor with steady-state tracking and wait helper
test/longhaul/monitor/health_test.go	Unit tests for steady-state logic and wait behavior
test/longhaul/monitor/leakdetect.go	Linear regression leak detector for memory/CPU trends
test/longhaul/monitor/leakdetect_test.go	Unit tests for slope/leak threshold logic
test/longhaul/monitor/suite_test.go	Ginkgo suite bootstrap for monitor package
test/longhaul/journal/journal.go	Thread-safe append-only event journal + disruption window tracking
test/longhaul/journal/journal_test.go	Unit tests for events/windows/policy violation detection
test/longhaul/journal/policy.go	Outage policy + window semantics (duration, exceeded policy)
test/longhaul/journal/policy_test.go	Unit tests for policy/window semantics
test/longhaul/journal/suite_test.go	Ginkgo suite bootstrap for journal package
test/longhaul/report/report.go	Markdown report generator (metrics, windows, leaks, recent events)
test/longhaul/report/report_test.go	Unit tests for report content/sections/truncation
test/longhaul/report/checkpoint.go	Periodic checkpoint reporter writing ConfigMap + stdout + annotations
test/longhaul/report/checkpoint_test.go	Unit tests using fake clientset for ConfigMap persistence
test/longhaul/report/alert.go	GitHub Actions workflow command annotations (error/notice/warning)
test/longhaul/report/alert_test.go	Unit tests capturing stdout for annotations
test/longhaul/report/suite_test.go	Ginkgo suite bootstrap for report package
test/longhaul/deploy/setup.yaml	Bootstrap namespace/secret/DocumentDB CR manifest for canary cluster
test/longhaul/Dockerfile	Container image build for the longhaul driver
test/longhaul/README.md	Usage/configuration documentation for running + deploying longhaul
test/longhaul/go.mod	New longhaul module definition with replace to local operator/shared modules
test/longhaul/go.sum	Dependency lockfile for the longhaul module

documentdb-triage-tool · 2026-06-16T21:47:03Z

🤖 Auto-triaged by documentdb-triage-tool.

Applied: test, dependencies, documentation, enhancement
Project fields suggested: Component test · Priority P2 · Effort XL · Status In Progress
Confidence: 0.88 (mixed)

Reasoning

component from path globs (test, dependencies, docs); effort from diff stats (5029+0 LOC, 42 files); LLM: Large multi-package long-haul test driver core spanning several new Go packages, a Dockerfile, manifests, and docs — cross-cutting test infrastructure addition as part of a 5-PR series.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

Copilot inline comments (16): - writer.go: fix StartWriters doc; switch to math/rand/v2 (auto-seeded) - main.go: header says Deployment (not Job); drop-collection gated on LONGHAUL_RESET_DATA - upgrade.go: wrap+return error from GetCurrentDocumentDBImageTag - checkpoint.go: bounded ctx on final emit; final PASS persists as PASS (not RUNNING) - alert.go: escape % / CR / LF in GH Actions annotation messages - deploy/setup.yaml: password is placeholder; plugin field uses spec.plugins.sidecarInjectorName - Dockerfile + README: build from repo root so go.mod replace paths resolve; Go 1.25; update relationship table for shared module Local code-review-agent (3 critical): - C1: writer no longer pre-increments seq; advance only on success or DupKey ack, so non-DupKey failures retry the same seq instead of being mistaken for data loss - C2: main calls reporter.EmitFinal() synchronously before exit so the authoritative final verdict reaches the longhaul-report ConfigMap - C3: writers seed seq from FindOne(sort:-1) on startup so a Deployment-driven pod restart resumes past the prior tip without colliding with the unique (writer_id, seq) index Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

xgerman

Review: structure, duplication & terseness

Focused pass on code duplication, structure, and style (the existing Copilot threads already covered correctness/security and are addressed in 3e9ac70). The module is well-documented and go build/go vet/go test ./... all pass locally. 👍

🟠 Must fix — gofmt (CI blocker)

Four files are not gofmt-clean (extra-space alignment), so make fmt/lint will fail:
operations/scale.go, operations/scheduler.go, monitor/health.go, report/checkpoint.go.
Fix: gofmt -w operations/ monitor/ report/.

🟡 Duplication / structure (highest-value dedup)

ScaleUp and ScaleDown are ~95% identical — collapse into one parameterized type (inline).
main.buildRESTConfig re-implements monitor.buildRestConfig and builds a second clientset that K8sClusterClient already owns (inline).
shareddb.Get(...) repeated 4× in k8sclient.go — extract a getCR helper (inline).
The ticker + select{ctx.Done/ticker.C} loop recurs 5× (writer/verifier/scheduler/health/metrics) — optional shared helper (inline).

🟢 Style / terseness / dead code

journal.DefaultOutagePolicy() is unused (inline).
report.go: b.WriteString(fmt.Sprintf(...)) → fmt.Fprintf(&b, ...) (inline).
checkpoint.go: map[string]interface{} → map[string]any (inline).
metrics.go: MetricsSnapshot doc comment should start with the type name (inline).

No blocking design issues — just the gofmt fix for CI plus the two consolidations (#1, #2) which are the biggest wins. Nice work on the durability oracle and the bounded verifier scan.

xgerman · 2026-06-17T21:48:33Z

+}
+
+// ScaleDown decreases spec.instancesPerNode by 1 (HA scale dimension; range 1-3).
+type ScaleDown struct {


ScaleUp (above) and ScaleDown are ~95% identical: same fields, and Precondition/Execute differ only by +1/-1, >=/<=, the bound, and a couple of constants. Consider collapsing into one parameterized type to halve the file:

type scaleOp struct { client monitor.ClusterClient healthMon *monitor.HealthMonitor name string weight int delta int // +1 up, -1 down limit int // max (up) or min (down) atLimit func(cur int) bool recovery time.Duration policy journal.OutagePolicy }

Thin NewScaleUp/NewScaleDown constructors keep call sites unchanged.

(Also: Name()/Weight() on lines 37-38 and 98-99 fail gofmt — extra spaces.)

Fixed in ed498df — collapsed into a shared scaleOp struct parameterized by delta/bound/boundKind/policy. ScaleUp and ScaleDown are now thin embedding wrappers so call sites and the existing NewScaleUp/NewScaleDown constructors are unchanged. Tests use small maxInstances()/minInstances() accessors instead of field access. Net: scale.go went from 136 -> ~130 lines and has zero duplication. Also gofmt-fixed (the extra-space alignment you flagged on lines 37-38 / 98-99 is gone).

xgerman · 2026-06-17T21:48:33Z

+	return client, clientset, nil
+}
+
+func buildRESTConfig() (*rest.Config, error) {


buildRESTConfig here re-implements monitor.buildRestConfig (k8sclient.go:86) almost verbatim, and initK8sClient then creates a second clientset (line 231) even though K8sClusterClient already built one (k8sclient.go:59). Expose the existing one instead:

func (k *K8sClusterClient) Clientset() kubernetes.Interface { return k.clientset }

Then delete buildRESTConfig and the redundant NewForConfig — one client, one config path.

Fixed in ed498df — added K8sClusterClient.Clientset() kubernetes.Interface, deleted main.buildRESTConfig and initK8sClient entirely, and dropped the unused k8s.io/client-go/{kubernetes,rest,tools/clientcmd} imports from main.go. main.go now does the one monitor.NewK8sClusterClient(...) call and reuses clusterClient.Clientset() for the reporter. main.go shrank by ~40 lines.

xgerman · 2026-06-17T21:48:33Z

+	// Get the DocumentDB CR status via the shared typed helper. Using
+	// shareddb.IsHealthy keeps the readiness predicate consistent with
+	// the e2e suite (single source of truth for ReadyStatus).
+	dd, err := shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName})


shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName}) is repeated verbatim in 4 methods (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB). Extract a private helper:

func (k *K8sClusterClient) getCR(ctx context.Context) (*previewv1.DocumentDB, error) { return shareddb.Get(ctx, k.crClient, types.NamespacedName{Namespace: k.namespace, Name: k.clusterName}) }

Fixed in ed498df — added private (k *K8sClusterClient) getCR(ctx) helper. All four call sites (GetClusterHealth, GetInstancesPerNode, GetCurrentDocumentDBImageTag, UpgradeDocumentDB) now use it. types/shareddb are still imported once at the top for the helper itself.

xgerman · 2026-06-17T21:48:33Z

+	ticker := time.NewTicker(healthCheckInterval)
+	defer ticker.Stop()
+
+	for {


This ticker + select { <-ctx.Done() / <-ticker.C } loop is duplicated 5× across the module (here, writer.Run, verifier.Run, scheduler.Run, runMetricsSampling). Optional, since they span packages, but a tiny shared helper would DRY them:

func tick(ctx context.Context, d time.Duration, fn func(context.Context)) { t := time.NewTicker(d); defer t.Stop() for { select { case <-ctx.Done(): return; case <-t.C: fn(ctx) } } }

Skipped intentionally. Agree the pattern recurs 5 times, but a cross-package utility for a 5-line idiom isn't a clear win — it would mean either a new test/longhaul/internal package (overhead) or sticking it in one of the existing packages (awkward import direction). Each call site is also a Run() method with package-specific logging, so the saved lines per site are closer to 2-3. Happy to revisit if more components get added in PR-3/PR-4.

xgerman · 2026-06-17T21:48:34Z

+}
+
+// DefaultOutagePolicy returns a conservative policy suitable for most operations.
+func DefaultOutagePolicy() OutagePolicy {


DefaultOutagePolicy() is dead code — it's never referenced anywhere (each operation builds its OutagePolicy inline). Either use it as the base for the scale/upgrade policies or remove it.

Keeping DefaultOutagePolicy() — it's referenced from 3 test files (journal/journal_test.go:70-71, journal/policy_test.go:80, operations/scheduler_test.go:38) where the tests need a "reasonable defaults" stand-in to construct fake operations / open disruption windows. Removing it would push three literal OutagePolicy{...} copies into the tests, which is worse. Open to renaming if you'd prefer something like BaselineOutagePolicy() to signal it's a test/baseline default rather than a production default.

xgerman · 2026-06-17T21:48:34Z

+	b.WriteString("# Long Haul Test Report\n\n")
+
+	// Header
+	b.WriteString(fmt.Sprintf("**Result:** %s\n", s.Result))


b.WriteString(fmt.Sprintf(...)) appears ~15× in this function. fmt.Fprintf(&b, ...) is the idiomatic, terser form (a strings.Builder implements io.Writer), e.g.:

fmt.Fprintf(&b, "**Result:** %s\n", s.Result)

Fixed in ed498df — all ~15 occurrences switched to fmt.Fprintf(&b, ...). Pure b.WriteString("literal") calls (no formatting needed) kept as-is since Fprintf would be unnecessary indirection there.

xgerman · 2026-06-17T21:48:34Z

+	}
+
+	// Also log the summary as JSON for structured log consumers.
+	summaryJSON, _ := json.Marshal(map[string]interface{}{


map[string]interface{} → map[string]any (Go 1.18+, and go.mod is 1.25). Minor: fmt.Printf("\n%s\n", "=== CHECKPOINT REPORT ===") on lines 93/95 can just be fmt.Println(...). Note this file also fails gofmt (the JSON map literal alignment).

Fixed in ed498df — map[string]any, Printf("\n%s\n", "...") collapsed to Println("\n...") / Print("...\n\n"), and gofmt -w cleaned the JSON map alignment. go vet and gofmt -l are both clean now.

xgerman · 2026-06-17T21:48:34Z

+	}
+}
+
+// Snapshot returns a point-in-time copy of all metric values.


This doc comment is attached to type MetricsSnapshot but reads "Snapshot returns a point-in-time copy...". Per Go convention the comment should start with the identifier it documents, e.g. // MetricsSnapshot is a point-in-time copy of all metric values.

Fixed in ed498df — comment now reads // MetricsSnapshot is a point-in-time copy of all metric values.

- gofmt -w across the module (4 files were not gofmt-clean: scale.go, scheduler.go, health.go, checkpoint.go) — fixes the CI lint gate - collapse ScaleUp/ScaleDown into a shared scaleOp struct (parameterized by delta/bound/policy); thin ScaleUp / ScaleDown wrappers keep call sites and tests unchanged - expose K8sClusterClient.Clientset() and drop the duplicate REST-config + clientset construction in cmd/longhaul/main.go - extract K8sClusterClient.getCR() helper; the namespaced-name lookup no longer repeats in every method - report.go: b.WriteString(fmt.Sprintf(...)) -> fmt.Fprintf(&b, ...) - checkpoint.go: map[string]interface{} -> map[string]any; collapse double Printf wrappers - metrics.go: MetricsSnapshot doc comment now starts with the type name Skipped (with rationale in PR replies): - cross-package tick helper (xgerman marked optional; ~5 lines saved is not worth a new utility package) - removing DefaultOutagePolicy (still used by journal/operations tests as a reasonable-defaults stand-in) go build / go vet / go test ./... all clean; gofmt -l reports no remaining issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

Verifier: - Drop NumVerifiers config knob and StartVerifiers helper; always run exactly one verifier. - Multiple verifiers were a bug, not a feature: each one independently scanned the whole collection and wrote to the SHARED Metrics.VerifyGapsDetected counter, so a single real gap of N missing seqs was reported as N x NumVerifiers in alerts, and read load on the cluster scaled with NumVerifiers (turning the verifier's own load into a confounding signal for the test report). - One verifier is sufficient: the scan is stateless against the DB and bounded by the per-writer nextSeq resume map, so adding more instances adds noise, not coverage. - LONGHAUL_NUM_VERIFIERS env var removed (along with the cfg field, default, parse, validate, README row, and config_test references). Metrics: - Add per-field godoc on every Metrics field clarifying ack-vs-fail semantics, what triggers FAIL (verifier-side only), DupKey ack behavior, scan-cycle vs document semantics, and StartTime / Elapsed restart behavior. - The user asked 'what is VerifyGapsDetected' — that the question came up at all is evidence the field comments were insufficient. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Metrics: one-line godoc per field; drop the multi-paragraph prose. - Verifier: drop the dead 'id' field (only one verifier exists), update test/log sites accordingly. - Fix incorrect nextSeq comment: once we step past a gap, expectedSeq advances to seq+1, so a late-arriving fill at the missing seq is filtered out by 'seq >= nextSeq' on subsequent cycles — gaps are counted exactly once, not re-checked. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

Without this, the verifier could miss data loss in the tail: if seqs above the verifier's last-observed point were acked then lost, and no later writes arrived to expose the gap via the per-doc check, the loss was invisible. Concretely: writer acks 1..110, DB loses 101..110, writer crashes/idles. Verifier scans seq>=101, sees nothing, and the durability oracle stays clean — false negative. Fix: snapshot writer.Seq() at the top of each per-writer scan, bound the Find filter to seq <= that snapshot, and after the per-doc loop check whether expectedSeq <= snapshot. Any residual is tail loss; count it under VerifyGapsDetected and log a distinct 'tail loss' message. Advance nextSeq to snapshot+1 so the next cycle scans only newly committed seqs. Race safety: reading writer.Seq() BEFORE the Find guarantees we never count in-flight writes (which commit after Seq() advances). Verified expected <= actual in steady state, so missing>0 only when seqs are truly gone. Also drops the aggregation-based writer-id discovery (we now iterate the writers slice the verifier owns) and adds Writer.Seq() public accessor + unit test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

Verifier: extract auditDocs as a pure function that takes the decoded docs and (expectedSeq, maxSeq) and returns (newExpectedSeq, internalGaps, tailLoss, checksumErrors) plus a list of structured findings the caller logs. This decouples the gap/tail/checksum math from cursor iteration and the journal, so it's unit-testable without a live mongo. Writer: introduce writeBackend interface (insert + isDuplicate + highestSeq) and wrap *mongo.Collection in mongoBackend. writeOne and Resume now go through the interface, so we can stub it in tests. Adds 11 audit table cases (no docs, clean run, gap-in-middle, gap-at-start, empty-with-tail, tail-after-docs, gap+tail, late-write-exposes-gap, single checksum, multi-checksum, nextSeq advance invariant) and 9 writer cases (success, DupKey-as-ack, transient error, retry-after-error, monotonic advance; Resume empty / has-data / on-error). Behavior unchanged. Net: ~+200 lines test, ~+50/-40 lines source. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge audit_test.go into verifier_test.go and writer_backend_test.go into writer_test.go. Each source file now has exactly one matching _test.go (Go convention), no test changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add doc.go with the package-level overview explaining why report.go / checkpoint.go / alert.go are separate (data model vs. orchestration vs. CI surface) so the next reader doesn't have to ask. - Add per-field godoc on Summary clarifying who populates each field, when it's empty, and the PASS/FAIL relationship to LeakAnalysis (warning only, not a verdict flip). - Drop the narrow one-liner package doc from report.go in favor of the richer one in doc.go. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

AllowedDowntime: set by every operation's OutagePolicy() and DefaultOutagePolicy() but never read by ExceededPolicy (which is the only consumer). Misleading public API surface — readers assume it's enforced. Removed the field, all assignments, and the TODO. The downtime budget can be re-introduced together with the writer-side timestamp plumbing it would actually require. Event.Metadata: declared, allocated, threaded through Record's signature, set to nil at every Info/Warn/Error call site, and never read. Dropped the field and simplified Record(level, component, message). No behavior change. All callers were updated; tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

On a multi-day run the in-memory events slice grew unbounded — ~200 events/day baseline (op open/close + verifier batch errors) means ~6k after a month, harmless on its own but Events() returns a full slice copy on every checkpoint (~30s), so allocation/GC pressure scales with run length. Cap the ring at 10k with 1k headroom so the trim is amortized to one copy every ~1k appends, not paid on every append. The rendered report only ever surfaces the last 20 events so the cap is invisible to consumers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

The metricsAvail field is read by MetricsAvailable() and written by GetPodMetrics() which can be called from different goroutines. Replace plain bool with sync/atomic.Bool for safe concurrent access. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: wenting <wenting@wentings-MacBook-Pro.local>

xgerman · 2026-06-22T19:19:37Z

+	j := journal.New()
+	metrics := workload.NewMetrics()
+
+	// Connect to MongoDB.


xgerman · 2026-06-22T19:19:47Z

+	metrics := workload.NewMetrics()
+
+	// Connect to MongoDB.
+	if cfg.MongoURI == "" {


Rename user-facing and internal references from MongoDB/mongo to DocumentDB/documentdb per reviewer feedback. Includes env vars (LONGHAUL_MONGO_URI → LONGHAUL_DOCUMENTDB_URI), struct fields, variable names, type names (mongoBackend → docdbBackend), comments, and documentation. Go driver import paths and mongodb:// connection strings are unchanged as they reference the wire protocol. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: wenting <wenting@wentings-MacBook-Pro.local>

alaye-ms · 2026-06-23T15:26:16Z

+		copy(trimmed, j.events[len(j.events)-maxEvents:])
+		j.events = trimmed
+	}
+	j.mu.Unlock()


Can we use the defer pattern used below here as well to handle any panics and errors

alaye-ms · 2026-06-23T15:52:41Z

+
+	Describe("metrics", func() {
+		It("MetricsAvailable mirrors the metricsAvail flag", func() {
+			k := &K8sClusterClient{metricsAvail: true}


It might just be an issue with my environment, but this portion gives an error for me: "cannot use true (untyped bool constant) as atomic.Bool value in struct literal - compiler[IncompatibleAssign]"

alaye-ms · 2026-06-23T15:56:53Z

+func (l *LeakDetector) AddSample(s ResourceSample) {
+	l.mu.Lock()
+	l.samples = append(l.samples, s)
+	l.mu.Unlock()


Can we do the defer pattern here as well

Copilot AI review requested due to automatic review settings June 16, 2026 19:52

WentingWu666666 requested review from alaye-ms, hossain-rayhan and xgerman as code owners June 16, 2026 19:52

Copilot started reviewing on behalf of WentingWu666666 June 16, 2026 19:52 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

WentingWu666666 marked this pull request as draft June 16, 2026 20:23

documentdb-triage-tool Bot added dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request test labels Jun 16, 2026

WentingWu666666 changed the title ~~test(longhaul): add long-haul test driver core (Part 2/5 of #348)~~ test(longhaul): add long-haul test driver core Jun 17, 2026

WentingWu666666 marked this pull request as ready for review June 17, 2026 17:26

xgerman reviewed Jun 17, 2026

View reviewed changes

Copilot AI added 5 commits June 17, 2026 18:17

WentingWu666666 marked this pull request as draft June 18, 2026 16:29

Copilot AI added 4 commits June 18, 2026 12:29

WentingWu666666 marked this pull request as ready for review June 18, 2026 16:53

xgerman requested changes Jun 22, 2026

View reviewed changes

alaye-ms approved these changes Jun 23, 2026

View reviewed changes

hossain-rayhan approved these changes Jun 23, 2026

View reviewed changes

Conversation

WentingWu666666 commented Jun 16, 2026

Part 2 of 5: long-haul test driver core

What this PR adds

Code reuse with e2e

Excluded from this PR (follow-ups)

Verification

Reviewer notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

documentdb-triage-tool Bot commented Jun 16, 2026

Uh oh!

xgerman left a comment

Choose a reason for hiding this comment

Review: structure, duplication & terseness

🟠 Must fix — gofmt (CI blocker)

🟡 Duplication / structure (highest-value dedup)

🟢 Style / terseness / dead code

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone