fix(monitoring): scope ControllerHighReconcileLatency to sei-k8s-controller's own controllers#423
Conversation
…roller's own controllers The #417 SeiNetwork clean-break deleted SeiNodeDeployment, retiring the sei_controller_seinodedeployment_reconcile_substep_duration_seconds metric the alert was scoped to. The rewrite to the generic controller_runtime_reconcile_time_seconds bucket dropped the implicit scope, so the alert now matches EVERY controller-runtime controller in the cluster — Karpenter, Flux, cert-manager, aws-lbc — not just ours. That fires falsely on Karpenter's `interruption` controller, whose reconcile blocks on a ~20s SQS ReceiveMessage long-poll (WaitTimeSeconds=20) against a permanently-empty queue on on-demand nodes. p50≈mean≈p99≈20s (the whole distribution is the idle long-poll, not a tail), 100% result=requeue_after, workqueue_depth=0 — benign by design, fleet-wide. The generic 10s p99 threshold is meaningless for a long-poller, and `kustomization` (Flux) also trips it intermittently. Restore the original intent: only alert on sei-k8s-controller's own reconcile latency (job="sei-k8s-controller"). Karpenter and Flux own their own alerting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryLow Risk Overview This restores the alert’s intent after the metric migration: slow reconciles in sei-k8s-controller (e.g. SeiNode/SeiNetwork/SeiNodeTask) still page, while benign long-poll behavior from other jobs (Karpenter Reviewed by Cursor Bugbot for commit db38331. Bugbot is set up for automated code reviews on this repo. Configure here. |
What
Scope the
ControllerHighReconcileLatencyalert to{job="sei-k8s-controller"}so it only watches our controllers, not every controller-runtime controller in the cluster.Why (root-caused via /root-cause)
#417(SeiNetwork clean-break) deletedSeiNodeDeployment, retiring thesei_controller_seinodedeployment_reconcile_substep_duration_secondsmetric this alert used. The rewrite to the genericcontroller_runtime_reconcile_time_secondsbucket dropped the implicit scope — the metric is emitted by Karpenter (job=karpenter), Flux (flux-system), cert-manager, aws-lbc, and us, all under one alert.It now fires falsely on Karpenter's
interruptioncontroller, whose reconcile is a blocking SQSReceiveMessagelong-poll (WaitTimeSeconds=20) against a permanently-empty queue (on-demand nodes, no spot interruptions). Evidence: p50 22.5s ≈ mean 20.0s ≈ p99 24.9s (the whole distribution is the idle wait, not a tail), 100%result=requeue_after,workqueue_depth=0, flat for 4+ days, identical across prod/prod-euw1/harbor, zero throttle/error logs. Benign by design. The 10s p99 threshold is meaningless for a long-poller.Onset (alert firing ~06-18 19:29) aligns with #417's deploy — the latency itself never changed.
Scope / follow-ups
interruption(Karpenter) misfit and the intermittentkustomization(Flux) trip in one root-level change — both are non-sei-k8s-controller and out of this alert's intent.job="sei-k8s-controller", genuine slowness in our own reconcilers (seinode/seinetwork/seinodetask) still alerts.🤖 Generated with Claude Code