Skip to content

fix(monitoring): scope ControllerHighReconcileLatency to sei-k8s-controller's own controllers#423

Merged
bdchatham merged 1 commit into
mainfrom
brandon/fix-reconcile-latency-alert-scope
Jun 20, 2026
Merged

fix(monitoring): scope ControllerHighReconcileLatency to sei-k8s-controller's own controllers#423
bdchatham merged 1 commit into
mainfrom
brandon/fix-reconcile-latency-alert-scope

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

What

Scope the ControllerHighReconcileLatency alert to {job="sei-k8s-controller"} so it only watches our controllers, not every controller-runtime controller in the cluster.

-      rate(controller_runtime_reconcile_time_seconds_bucket[5m])
+      rate(controller_runtime_reconcile_time_seconds_bucket{job="sei-k8s-controller"}[5m])

Why (root-caused via /root-cause)

#417 (SeiNetwork clean-break) deleted SeiNodeDeployment, retiring the sei_controller_seinodedeployment_reconcile_substep_duration_seconds metric this alert used. The rewrite to the generic controller_runtime_reconcile_time_seconds bucket dropped the implicit scope — the metric is emitted by Karpenter (job=karpenter), Flux (flux-system), cert-manager, aws-lbc, and us, all under one alert.

It now fires falsely on Karpenter's interruption controller, whose reconcile is a blocking SQS ReceiveMessage long-poll (WaitTimeSeconds=20) against a permanently-empty queue (on-demand nodes, no spot interruptions). Evidence: p50 22.5s ≈ mean 20.0s ≈ p99 24.9s (the whole distribution is the idle wait, not a tail), 100% result=requeue_after, workqueue_depth=0, flat for 4+ days, identical across prod/prod-euw1/harbor, zero throttle/error logs. Benign by design. The 10s p99 threshold is meaningless for a long-poller.

Onset (alert firing ~06-18 19:29) aligns with #417's deploy — the latency itself never changed.

Scope / follow-ups

  • Fixes both the interruption (Karpenter) misfit and the intermittent kustomization (Flux) trip in one root-level change — both are non-sei-k8s-controller and out of this alert's intent.
  • Not silenced blindly: by scoping to job="sei-k8s-controller", genuine slowness in our own reconcilers (seinode/seinetwork/seinodetask) still alerts.
  • Karpenter/Flux reconcile-latency coverage, if wanted, belongs in their own alert bundles (separate follow-up; co-own threshold with SRE).

🤖 Generated with Claude Code

…roller's own controllers

The #417 SeiNetwork clean-break deleted SeiNodeDeployment, retiring the
sei_controller_seinodedeployment_reconcile_substep_duration_seconds metric the
alert was scoped to. The rewrite to the generic controller_runtime_reconcile_time_seconds
bucket dropped the implicit scope, so the alert now matches EVERY controller-runtime
controller in the cluster — Karpenter, Flux, cert-manager, aws-lbc — not just ours.

That fires falsely on Karpenter's `interruption` controller, whose reconcile blocks on a
~20s SQS ReceiveMessage long-poll (WaitTimeSeconds=20) against a permanently-empty queue
on on-demand nodes. p50≈mean≈p99≈20s (the whole distribution is the idle long-poll, not a
tail), 100% result=requeue_after, workqueue_depth=0 — benign by design, fleet-wide. The
generic 10s p99 threshold is meaningless for a long-poller, and `kustomization` (Flux) also
trips it intermittently.

Restore the original intent: only alert on sei-k8s-controller's own reconcile latency
(job="sei-k8s-controller"). Karpenter and Flux own their own alerting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 20, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Single-label filter on one alert expression in PrometheusRule; no runtime or auth changes, only reduces false positives while keeping sei-k8s-controller coverage.

Overview
Narrows ControllerHighReconcileLatency so p99 reconcile latency is computed only from controller_runtime_reconcile_time_seconds_bucket series with job="sei-k8s-controller", instead of every controller-runtime scraper in the cluster.

This restores the alert’s intent after the metric migration: slow reconciles in sei-k8s-controller (e.g. SeiNode/SeiNetwork/SeiNodeTask) still page, while benign long-poll behavior from other jobs (Karpenter interruption, Flux kustomization, etc.) no longer trips the same rule.

Reviewed by Cursor Bugbot for commit db38331. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit 27a76b9 into main Jun 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant