fix(kyverno): pin admission-controller memory to stop OOM-induced pod-creation outage#2181
Open
devantler wants to merge 1 commit into
Open
fix(kyverno): pin admission-controller memory to stop OOM-induced pod-creation outage#2181devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
…-creation outage The Kyverno admission controller ran on the chart-default limits.memory: 384Mi and OOMKilled (exit 137) under image-verification + policy evaluation load. Because the verify-image-signatures webhooks are failurePolicy: Fail, the dead admission controller rejected every Pod/Deployment/Job CREATE cluster-wide (e.g. ksail-operator FailedCreate and a wedged HelmRelease upgrade) until it recovered. The previous comment assumed auto-vpa.yaml would size this Deployment, but Kyverno excludes its own namespace from admission/generate policies to avoid an eviction deadlock, so no VPA is ever created for kyverno controllers (verified: zero VPAs in ns/kyverno). The controller was thus left permanently on the too-low chart default. Set an explicit admissionController.resources (requests 100m/256Mi, limits memory 1Gi) -- generous headroom, still far under auto-vpa's 6Gi maxAllowed ceiling -- and correct the now-inaccurate comment. Keeps the webhooks failurePolicy: Fail (no weakening of supply-chain enforcement); the fix removes the OOM that made fail-closed dangerous. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
ksail-operatorHelmRelease was stuck mid-upgrade, with its ReplicaSet emitting:This is a cluster-wide pod-creation outage symptom, not a ksail-operator bug.
Root cause
The Kyverno admission controller was running on the chart-default
limits.memory: 384Miand OOMKilled (exit 137) under image-verification (cosign) + policy-evaluation load. While the admission pods were dead/restarting, theverify-image-signatureswebhooks — registeredfailurePolicy: Fail— could not be served, so the API server rejected everyPod/Deployment/JobCREATE in non-excluded namespaces. That is exactly theFailedCreateseen on ksail-operator (and why its upgrade wedged).The existing comment claimed
auto-vpa.yamlwould right-size this Deployment, so "a hard limit here would fight VPA." That assumption is false: Kyverno deliberately excludes its own namespace from its admission/generate policies (to avoid an eviction deadlock), soauto-vpanever generates a VPA for Kyverno controllers — verified, zero VPAs inns/kyverno. The admission controller was therefore stranded permanently on the too-low chart default with nothing to raise it.Fix
Set an explicit
admissionController.resources(requests100m/256Mi, limitsmemory: 1Gi) — generous headroom (live idle is ~133–149Mi; OOM happened at 384Mi), still far underauto-vpa's6GimaxAllowedceiling — and correct the misleading comment.Security is unchanged: the webhooks stay
failurePolicy: Fail. The reliability problem was the OOM that made fail-closed dangerous; removing the OOM is the right lever. Node-layer enforcement (TalosImageVerificationConfig) remains in place regardless.Notes / scope
skipResourceFilters+ the report-kind filters in this file; only the admission controller caused the outage, so this PR is scoped to it.Validation
kubectl kustomize k8s/bases/infrastructure/controllers/kyverno/builds;resourcesrenders correctly nested underadmissionController.🤖 Generated with Claude Code