Skip to content

feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing#1241

Open
mihow wants to merge 5 commits into
mainfrom
fix/1241-redis-state-loud-diagnostics
Open

feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing#1241
mihow wants to merge 5 commits into
mainfrom
fix/1241-redis-state-loud-diagnostics

Conversation

@mihow

@mihow mihow commented Apr 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

Complements #1244 — that PR handles pending_ids left in Redis after NATS gives up redelivering; this PR handles the case where the state keys are missing entirely. The two paths are independent and coexist without conflict.

When process_nats_pipeline_result finds the job's total-images key missing from Redis, the FAILURE it produces is close to silent. The reason string is hardcoded — "Job state keys not found in Redis (likely cleaned up concurrently)" — and ends up only in job.logger, not in progress.errors. The Redis target is never logged. Three distinct causes collapse into one misleading line:

  • DB-index drift across hosts (one worker talks to Redis DB 0, another to DB 1 — run_job initializes state in one DB, process_nats_pipeline_result looks for it in the other)
  • Key eviction under memory pressure
  • Never-initialized state (legitimate concurrent-cleanup race)

This PR keeps the failure path but makes it name the actual cause.

What changes

  1. AsyncJobStateManager.diagnose_missing_state() — returns a one-line snapshot: redis=host:port/dbN keys_for_job=<SCAN listing with SCARDs>. keys_for_job=<none> ⇒ wrong DB or never-initialized; SCARDs present ⇒ partial cleanup / eviction. The SCAN runs only on the failure path and the per-job key fanout is at most four keys (pending:process, pending:results, failed, total), so the cost is negligible.

  2. update_state() — emits a WARN with the snapshot immediately before returning None. The trigger is now visible in the worker log even if the caller's FAILURE log is filtered.

  3. process_nats_pipeline_result — passes the live snapshot to _fail_job instead of the hardcoded string. Also annotates the stage (process / results).

  4. _fail_job — appends reason to job.progress.errors before save. The UI now shows FAILUREs with a cause instead of errors=[].

  5. run_job — logs Running job X on redis=HOST:PORT/dbN at start. Cross-host DB drift becomes visible in every job's log without needing to already suspect it.

Why

This is motivated by an observed production incident where three consecutive async_api jobs flipped straight to FAILURE at ~46% with zero errors recorded and no log line explaining why. The per-job log showed STARTED → FAILURE with nothing in between. The only clue was the hardcoded string, which pattern-matched an unrelated recent change and sent triage down the wrong path. The real cause turned out to be a worker host whose REDIS_URL pointed at a different DB index than the rest of the stack. Any of the five items above would have caught it on the first failed job.

The fix for the underlying misconfig is an infra change, not in this PR. This PR is about making the same class of failure instantly debuggable next time.

Tests

Added:

  • TestTaskStateManager.test_diagnose_missing_state_when_never_initialized — asserts redis=, /db, and keys_for_job=<none> in the diagnostic string.
  • TestTaskStateManager.test_diagnose_missing_state_lists_present_keys — after wiping only the total key, asserts the surviving pending-set SCARDs appear.
  • TestFailJob.test_fail_job_appends_reason_to_progress_errors — verifies the reason string lands in job.progress.errors both in memory and after refresh_from_db(). The existing _fail_job call-site tests in TestProcessNatsPipelineResultError mock _fail_job entirely, so a silent regression on the append would not have been caught. Validated via TDD: temporarily skipped the append in production code, ran the test and watched it fail with the empty-list assertion, then restored.
  • TestFailJob.test_fail_job_is_noop_on_already_final_job — regression guard for the early-return branch.

Updated:

  • TestProcessNatsPipelineResultError.test_genuinely_missing_state_acks_and_fails_job — reason-string assertion updated to the new "Job state missing from Redis (stage=process): …" form.

Verified locally

  • docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.jobs --keepdb78 tests pass, including the new TestFailJob cases.
  • docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.jobs.tests.test_tasks.TestFailJob ami.jobs.tests.test_tasks.TestProcessNatsPipelineResultError ami.ml.tests.TestTaskStateManager --keepdb24 tests pass (the targeted slices touched by this PR).
  • TDD confirmation of the progress.errors append: with job.progress.errors.append(reason) commented out, test_fail_job_appends_reason_to_progress_errors fails with AssertionError: '…' not found in []. With the line restored, it passes. The test exercises the real code path.
  • Lint/format: pre-commit hooks (black, isort, flake8, pyupgrade, django-upgrade) pass on all touched files.

Manual verification to run in staging

  • In a dev env, start an async_api job, then redis-cli -n <db> DEL job:{id}:pending_images_total while it's running. Confirm the resulting FAILURE shows the diagnostic in both the UI's progress.errors and the worker log, and that run_job's opening log line identifies the Redis target.

Relationship to concurrent / recently-merged work

@netlify

netlify Bot commented Apr 16, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit ce28963
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a39d194a3f3310008fefb6f

@coderabbitai

coderabbitai Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@mihow, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 8 minutes and 45 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b9b144a1-46b0-4c12-930e-4dc2d5bcd241

📥 Commits

Reviewing files that changed from the base of the PR and between 702de1e and ce28963.

📒 Files selected for processing (4)
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_tasks.py
  • ami/ml/orchestration/async_job_state.py
  • ami/ml/tests.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1241-redis-state-loud-diagnostics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@netlify

netlify Bot commented Apr 16, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit ce28963
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a39d194f8963400095cdde7

@mihow mihow force-pushed the fix/1241-redis-state-loud-diagnostics branch from 6730cfb to 111d5d6 Compare April 17, 2026 07:03
@mihow mihow marked this pull request as ready for review April 17, 2026 07:03
Copilot AI review requested due to automatic review settings April 17, 2026 07:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves debuggability of async ML job failures when Redis-backed job state is missing by adding Redis-target diagnostics, propagating the failure reason into Job.progress.errors, and updating tests to lock in the new behavior.

Changes:

  • Add AsyncJobStateManager.diagnose_missing_state() and log a diagnostic snapshot on the missing-state path.
  • Include stage-specific missing-state diagnostics in _fail_job reasons and persist those reasons into progress.errors for UI visibility.
  • Add/adjust unit tests covering diagnostics output and _fail_job persistence semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
ami/ml/orchestration/async_job_state.py Adds missing-state diagnostic helper and warning log in update_state.
ami/jobs/tasks.py Logs Redis target at job start; passes diagnostics into _fail_job; appends reason into progress.errors.
ami/ml/tests.py Adds tests for diagnose_missing_state() output in “never initialized” and “partial keys present” cases.
ami/jobs/tests/test_tasks.py Updates missing-state reason assertions; adds regression tests for _fail_job writing progress.errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/ml/orchestration/async_job_state.py Outdated
Comment thread ami/jobs/tasks.py
Comment thread ami/jobs/tasks.py
Comment thread ami/ml/orchestration/async_job_state.py Outdated
@mihow mihow added the PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515. label Jun 16, 2026
mihow and others added 2 commits June 22, 2026 10:22
Before: when process_nats_pipeline_result found the job's total-images
key missing, it failed the job with the hardcoded reason
"Job state keys not found in Redis (likely cleaned up concurrently)".
The reason string went to job.logger only — not progress.errors — and
collapsed three distinct causes (DB-index drift across hosts, key
eviction, never-initialized state) into one misleading line. The Redis
target (host:port/db) was never logged, so operators couldn't tell a
cache-DB split from a cleanup race without shelling into workers.

This commit makes that path name the actual cause:

1. AsyncJobStateManager.diagnose_missing_state() returns a one-line
   snapshot: masked host:port, DB index, and SCAN output for job:{id}:*
   with SCARDs. "keys_for_job=<none>" ⇒ never initialized or wrong DB;
   SCARDs present ⇒ partial cleanup / eviction.

2. update_state() emits a WARN with that snapshot immediately before
   returning None, so the trigger shows up in the worker log even if
   the caller's FAILURE log is filtered out.

3. process_nats_pipeline_result passes the live snapshot to _fail_job
   instead of the hardcoded string.

4. _fail_job appends the reason to job.progress.errors before save, so
   the UI surfaces FAILUREs with a cause instead of errors=[].

5. run_job logs "Running job X on redis=HOST:PORT/dbN" at start. Cross-
   host DB drift becomes visible in every job's log without needing to
   already suspect it.

Tests added:
- diagnose_missing_state with never-initialized state (keys_for_job=<none>)
- diagnose_missing_state after partial cleanup (SCARDs of surviving sets
  listed)
- Existing "genuinely missing state" test assertion updated to match the
  richer reason string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add TestFailJob with two TDD-confirmed cases:

- ``test_fail_job_appends_reason_to_progress_errors`` — verifies the
  reason string lands in ``job.progress.errors`` (both in-memory and
  after refresh_from_db) so UI surfaces the cause of the FAILURE. The
  existing ``_fail_job`` call-site tests in
  ``TestProcessNatsPipelineResultError`` mock ``_fail_job`` entirely, so
  a regression that stops appending to ``progress.errors`` would slip
  through undetected.
- ``test_fail_job_is_noop_on_already_final_job`` — regression guard for
  the early-return branch when the job is already in a final state.

Also:
- Comment the bare ``except Exception: pass`` around the
  ``progress.errors.append`` in ``_fail_job`` to explain why we swallow
  diagnostic-write failures.
- Extend the ``AsyncJobStateManager.diagnose_missing_state`` docstring
  with a one-paragraph note about the SCAN cost (failure-path only,
  per-job fanout of at most four keys) to head off the obvious review
  question.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the fix/1241-redis-state-loud-diagnostics branch from 111d5d6 to 0c1dbc1 Compare June 22, 2026 17:26
Fix three inaccurate comments flagged in review:
- diagnose_missing_state() is called from update_state and the result
  handler (not _fail_job); note it can run at most twice per FAILURE.
- SCAN is O(keyspace) regardless of MATCH; acceptable only because it
  runs on the rare missing-state failure path.
- _describe_redis_target() returns a redis=... prefixed string.
mihow and others added 2 commits June 22, 2026 17:21
… duplicate rationale [skip ci]

- async_job_state.py update_state: "previously all surfaced as..." → states the
  contract ("Distinguishes three causes that map to the same symptom")
- async_job_state.py diagnose_missing_state: remove inline SCAN/MATCH cost
  duplicate that repeated the docstring verbatim; keep the docstring
- tasks.py _fail_job: "Previously the reason lived only in job.logger..." →
  states what the code does ("Operators can see the cause in the job detail view")

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…peline_result

Adds test_genuinely_missing_state_results_stage_acks_and_fails_job to
TestProcessNatsPipelineResultError, mirroring the existing process-branch test
for the results-stage missing-state path (tasks.py lines 378-388). Verifies
that when update_state returns None at stage=results, the task acks NATS (to
stop redelivery) and calls _fail_job with a reason string containing "stage=results".

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread ami/jobs/tasks.py

redis = get_redis_connection("default")
kwargs = getattr(redis.connection_pool, "connection_kwargs", {}) or {}
return f"redis={kwargs.get('host', '?')}:{kwargs.get('port', '?')}/db{kwargs.get('db', '?')}"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to log the redis connection details to the public job logs? is there another way we can diagnose the connection in the logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PSv2 Async & distributed ML backend (PSv2): job state, NATS dispatch, result handling. Umbrella #515.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants