Skip to content

fix(kyverno): pin admission-controller memory to stop OOM-induced pod-creation outage#2181

Open
devantler wants to merge 1 commit into
mainfrom
claude/fix-kyverno-admission-oom
Open

fix(kyverno): pin admission-controller memory to stop OOM-induced pod-creation outage#2181
devantler wants to merge 1 commit into
mainfrom
claude/fix-kyverno-admission-oom

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by Claude Code (live investigation of the prod platform)

Problem

The ksail-operator HelmRelease was stuck mid-upgrade, with its ReplicaSet emitting:

FailedCreate: Internal error occurred: failed calling webhook
"ivpol.mutate.kyverno.svc-fail": ... context deadline exceeded

This is a cluster-wide pod-creation outage symptom, not a ksail-operator bug.

Root cause

The Kyverno admission controller was running on the chart-default limits.memory: 384Mi and OOMKilled (exit 137) under image-verification (cosign) + policy-evaluation load. While the admission pods were dead/restarting, the verify-image-signatures webhooks — registered failurePolicy: Fail — could not be served, so the API server rejected every Pod/Deployment/Job CREATE in non-excluded namespaces. That is exactly the FailedCreate seen on ksail-operator (and why its upgrade wedged).

The existing comment claimed auto-vpa.yaml would right-size this Deployment, so "a hard limit here would fight VPA." That assumption is false: Kyverno deliberately excludes its own namespace from its admission/generate policies (to avoid an eviction deadlock), so auto-vpa never generates a VPA for Kyverno controllers — verified, zero VPAs in ns/kyverno. The admission controller was therefore stranded permanently on the too-low chart default with nothing to raise it.

Fix

Set an explicit admissionController.resources (requests 100m/256Mi, limits memory: 1Gi) — generous headroom (live idle is ~133–149Mi; OOM happened at 384Mi), still far under auto-vpa's 6Gi maxAllowed ceiling — and correct the misleading comment.

Security is unchanged: the webhooks stay failurePolicy: Fail. The reliability problem was the OOM that made fail-closed dangerous; removing the OOM is the right lever. Node-layer enforcement (Talos ImageVerificationConfig) remains in place regardless.

Notes / scope

  • The reports/background controllers also run unmanaged by VPA, but their memory growth is already mitigated by skipResourceFilters + the report-kind filters in this file; only the admission controller caused the outage, so this PR is scoped to it.
  • Self-recovered at investigation time (2/2 admission replicas Ready), but the cap was unchanged — this prevents recurrence.

Validation

kubectl kustomize k8s/bases/infrastructure/controllers/kyverno/ builds; resources renders correctly nested under admissionController.

🤖 Generated with Claude Code

…-creation outage

The Kyverno admission controller ran on the chart-default limits.memory:
384Mi and OOMKilled (exit 137) under image-verification + policy
evaluation load. Because the verify-image-signatures webhooks are
failurePolicy: Fail, the dead admission controller rejected every
Pod/Deployment/Job CREATE cluster-wide (e.g. ksail-operator FailedCreate
and a wedged HelmRelease upgrade) until it recovered.

The previous comment assumed auto-vpa.yaml would size this Deployment,
but Kyverno excludes its own namespace from admission/generate policies
to avoid an eviction deadlock, so no VPA is ever created for kyverno
controllers (verified: zero VPAs in ns/kyverno). The controller was thus
left permanently on the too-low chart default.

Set an explicit admissionController.resources (requests 100m/256Mi,
limits memory 1Gi) -- generous headroom, still far under auto-vpa's 6Gi
maxAllowed ceiling -- and correct the now-inaccurate comment. Keeps the
webhooks failurePolicy: Fail (no weakening of supply-chain enforcement);
the fix removes the OOM that made fail-closed dangerous.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@devantler devantler marked this pull request as ready for review June 20, 2026 09:48
@devantler devantler added this pull request to the merge queue Jun 20, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant