Skip to content

fix(longhorn): stop instance-manager OOM by exempting longhorn-system from generated LimitRange#2180

Open
devantler wants to merge 1 commit into
mainfrom
claude/fix-longhorn-limitrange-oom
Open

fix(longhorn): stop instance-manager OOM by exempting longhorn-system from generated LimitRange#2180
devantler wants to merge 1 commit into
mainfrom
claude/fix-longhorn-limitrange-oom

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by Claude Code (live investigation of the prod platform)

Problem (active prod incident, 2026-06-20)

All three CNPG Postgres clusters (coroot-db, umami-db, wedding-db) went degraded with replicas stuck in a timeline-divergence / pg_rewind loop, wedging the infrastructure and apps Flux Kustomizations and taking Coroot offline (its metadata DB crashlooped). Root-caused to the Longhorn storage data plane, not Postgres.

Root cause

The add-ns-quota Kyverno policy (patched in this file) generates a LimitRange into every namespace that stamps a default memory limit of 512Mi onto any container that doesn't set its own. longhorn-system was not exempt, so it applied to the Longhorn instance-manager pods.

longhorn-manager creates instance-manager pods without a memory limit by design (Longhorn's documented recommendation). The injected 512Mi cap is too low — idle RSS already sits at ~335–355Mi (~70% of the cap), so a volume rebuild pushes it over and the kubelet OOM-kills it. The chain:

LimitRange 512Mi on instance-manager
  → OOMKill during rebuild (evidence destroyed with the pod)
  → longhorn-manager: "Instance manager pod ... is deleted or not running, recreating"
  → every replica engine on that node faults (DetachedUnexpectedly / Faulted)
  → CNPG primary loses its volume → failover
  → a lagging standby is promoted → timeline fork → old replica stuck needing pg_rewind
  → CNPG cluster never reaches quorum → Flux health checks time out

Evidence: instance-manager pods recreated node-by-node (worker-1/2 ~4h apart, worker-3 minutes before investigation); kubectl top shows IM at ~70% of the 512Mi limit at idle; longhorn-manager logs show the recreate→remount cascade; sync replication is already correctly configured (so the divergence is a symptom of the storage faults, not a Postgres misconfig).

Fix

Exempt longhorn-system from both generated rules:

  • rule 1 (LimitRange) — removes the OOM-inducing memory limit so instance-managers run unconstrained, per Longhorn guidance.
  • rule 0 (ResourceQuota) — required in tandem: without the LimitRange-supplied default requests, pods would be rejected by the requests.memory quota. Treated like the existing flux-system exemption.

Longhorn's components are sized by their own VPAs (longhorn-manager, csi-*, longhorn-ui) / the operator, not by the generic tenant quota.

Validation

  • kubectl kustomize k8s/bases/infrastructure/cluster-policies/ builds; longhorn-system renders in both add-ns-quota exclude rules.
  • kubectl kustomize k8s/clusters/{prod,local}/ build.

Operational follow-up (not in this PR)

This prevents recurrence. The already-diverged replicas (e.g. coroot-db-3) won't self-heal from the pg_rewind loop — once this merges and Longhorn stabilises, an operator should re-clone them (delete the stuck instance's PVC so CNPG rebuilds from the primary). Kyverno should remove the now-orphaned LimitRange/ResourceQuota in longhorn-system via synchronize: true; delete manually if it doesn't.

🤖 Generated with Claude Code

The add-ns-quota Kyverno policy generates a LimitRange into every
namespace that stamps a default memory limit (512Mi) onto any container
that omits one. longhorn-manager creates instance-manager pods without a
memory limit by design; the injected 512Mi cap OOM-kills them during
volume rebuilds (idle RSS already sits at ~70% of the cap). When an
instance-manager dies, longhorn-manager deletes+recreates the pod, which
faults every replica engine on that node (DetachedUnexpectedly) and
cascades into CNPG primary failover and Postgres timeline divergence
across coroot-db / umami-db / wedding-db, wedging the infrastructure and
apps Flux Kustomizations cluster-wide (observed 2026-06-20).

Exempt longhorn-system from both the LimitRange rule (drops the
OOM-inducing limit) and the ResourceQuota rule (pods that no longer
receive the LimitRange-supplied requests would otherwise be rejected by
the requests.memory quota), treating it as platform infra like
flux-system. Longhorn's components are sized by their own VPAs/operator.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant