fix(monitoring): scope ControllerHighReconcileLatency to sei-k8s-controller's own controllers by bdchatham · Pull Request #423 · sei-protocol/sei-k8s-controller

bdchatham · 2026-06-20T17:26:21Z

What

Scope the ControllerHighReconcileLatency alert to {job="sei-k8s-controller"} so it only watches our controllers, not every controller-runtime controller in the cluster.

-      rate(controller_runtime_reconcile_time_seconds_bucket[5m])
+      rate(controller_runtime_reconcile_time_seconds_bucket{job="sei-k8s-controller"}[5m])

Why (root-caused via /root-cause)

#417 (SeiNetwork clean-break) deleted SeiNodeDeployment, retiring the sei_controller_seinodedeployment_reconcile_substep_duration_seconds metric this alert used. The rewrite to the generic controller_runtime_reconcile_time_seconds bucket dropped the implicit scope — the metric is emitted by Karpenter (job=karpenter), Flux (flux-system), cert-manager, aws-lbc, and us, all under one alert.

It now fires falsely on Karpenter's interruption controller, whose reconcile is a blocking SQS ReceiveMessage long-poll (WaitTimeSeconds=20) against a permanently-empty queue (on-demand nodes, no spot interruptions). Evidence: p50 22.5s ≈ mean 20.0s ≈ p99 24.9s (the whole distribution is the idle wait, not a tail), 100% result=requeue_after, workqueue_depth=0, flat for 4+ days, identical across prod/prod-euw1/harbor, zero throttle/error logs. Benign by design. The 10s p99 threshold is meaningless for a long-poller.

Onset (alert firing ~06-18 19:29) aligns with #417's deploy — the latency itself never changed.

Scope / follow-ups

Fixes both the interruption (Karpenter) misfit and the intermittent kustomization (Flux) trip in one root-level change — both are non-sei-k8s-controller and out of this alert's intent.
Not silenced blindly: by scoping to job="sei-k8s-controller", genuine slowness in our own reconcilers (seinode/seinetwork/seinodetask) still alerts.
Karpenter/Flux reconcile-latency coverage, if wanted, belongs in their own alert bundles (separate follow-up; co-own threshold with SRE).

🤖 Generated with Claude Code

…roller's own controllers The #417 SeiNetwork clean-break deleted SeiNodeDeployment, retiring the sei_controller_seinodedeployment_reconcile_substep_duration_seconds metric the alert was scoped to. The rewrite to the generic controller_runtime_reconcile_time_seconds bucket dropped the implicit scope, so the alert now matches EVERY controller-runtime controller in the cluster — Karpenter, Flux, cert-manager, aws-lbc — not just ours. That fires falsely on Karpenter's `interruption` controller, whose reconcile blocks on a ~20s SQS ReceiveMessage long-poll (WaitTimeSeconds=20) against a permanently-empty queue on on-demand nodes. p50≈mean≈p99≈20s (the whole distribution is the idle long-poll, not a tail), 100% result=requeue_after, workqueue_depth=0 — benign by design, fleet-wide. The generic 10s p99 threshold is meaningless for a long-poller, and `kustomization` (Flux) also trips it intermittently. Restore the original intent: only alert on sei-k8s-controller's own reconcile latency (job="sei-k8s-controller"). Karpenter and Flux own their own alerting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor · 2026-06-20T17:26:26Z

PR Summary

Low Risk
Single-label filter on one alert expression in PrometheusRule; no runtime or auth changes, only reduces false positives while keeping sei-k8s-controller coverage.

Overview
Narrows ControllerHighReconcileLatency so p99 reconcile latency is computed only from controller_runtime_reconcile_time_seconds_bucket series with job="sei-k8s-controller", instead of every controller-runtime scraper in the cluster.

This restores the alert’s intent after the metric migration: slow reconciles in sei-k8s-controller (e.g. SeiNode/SeiNetwork/SeiNodeTask) still page, while benign long-poll behavior from other jobs (Karpenter interruption, Flux kustomization, etc.) no longer trips the same rule.

^{Reviewed by Cursor Bugbot for commit db38331. Bugbot is set up for automated code reviews on this repo. Configure here.}

bdchatham merged commit 27a76b9 into main Jun 20, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(monitoring): scope ControllerHighReconcileLatency to sei-k8s-controller's own controllers#423

fix(monitoring): scope ControllerHighReconcileLatency to sei-k8s-controller's own controllers#423
bdchatham merged 1 commit into
mainfrom
brandon/fix-reconcile-latency-alert-scope

bdchatham commented Jun 20, 2026

Uh oh!

cursor Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented Jun 20, 2026

What

Why (root-caused via /root-cause)

Scope / follow-ups

Uh oh!

cursor Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 20, 2026 •

edited

Loading