Skip to content

Documentation for ResumableJobMixin and resumable tasks#68136

Open
amoghrajesh wants to merge 1 commit into
apache:mainfrom
astronomer:aip-103-resumablemixin-docs
Open

Documentation for ResumableJobMixin and resumable tasks#68136
amoghrajesh wants to merge 1 commit into
apache:mainfrom
astronomer:aip-103-resumablemixin-docs

Conversation

@amoghrajesh
Copy link
Copy Markdown
Contributor


Was generative AI tooling used to co-author this PR?
  • Yes - claude sonnet 4.6

closes: #67706


What?

Operators that submit work to external systems such as Spark, BigQuery, EMR, Kubernetes share a common failure mode: the worker holds its slot for the full polling duration, and if the worker crashes the task retries from scratch, often submitting a duplicate job. Airflow 3.3 now introduces ResumableJobMixin to solve the crash-and-duplicate problem, and the task state store enables the broader checkpoint-and-resume pattern. Neither had documentation, and there was no guidance on when to reach for these over deferrable operators or async tasks.

Current behaviour

The only comparison page (task-sdk/docs/deferred-vs-async-operators.rst) covered deferrable and async operators but made no mention of resumable tasks. ResumableJobMixin had no user-facing documentation at all. Users experiencing repeated job submission failures had no clear path to a solution.

Proposed change

  • Adds airflow-core/docs/core-concepts/resumable-tasks.rst - a decision guide covering all three patterns (deferrable, resumable, async) with trade-off descriptions, a three-way comparison table, and a general checkpoint example using task_store directly.
  • Adds task-sdk/docs/resumable-job-mixin.rst- a reference page for ResumableJobMixin covering the interface, method descriptions with inline examples, the retry flow, the pre-submit crash window limitation, and the external_id_key renaming warning.
  • Updates task-sdk/docs/deferred-vs-async-operators.rst with a versionchanged:: 3.3.0 note pointing to both new pages.

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.


.. versionadded:: 3.2.0

.. versionchanged:: 3.3.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the current structure the
.. versionadded:: 3.2.0
Is related to the pargraph below the new additon

The trade-offs are:

* A Triggerer component must be running. Deployments that do not include a Triggerer cannot use this pattern.
* Writing a custom deferrable operator requires implementing a separate ``Trigger`` class in
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Writing a custom deferrable operator requires implementing a separate ``Trigger`` class in
* Writing a custom deferrable operator requires implementing a ``Trigger`` class in


process_files_dag()

This pattern works without any additional work, just plain old context. The state store is just
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This pattern works without any additional work, just plain old context. The state store is just
This pattern works without any additional work, relying only on ``context``. The state store is just


:class:`~airflow.sdk.ResumableJobMixin` is a mixin for operators that submit long-running jobs
to an external system and poll for its completion. It makes the operator crash-safe by persisting
the external job identifier to the task state store before polling begins. If the worker is restarted
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the external job identifier to the task state store before polling begins. If the worker is restarted
the external job identifier to task state store before polling begins. If the worker is restarted

The trade-offs are:

* A Triggerer component must be running. Deployments that do not include a Triggerer cannot use this pattern.
* Writing a custom deferrable operator requires implementing a separate ``Trigger`` class in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or: requires implementing a dedicated Trigger class...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Add guidance on when to use ResumableMixin vs deferrable operators

5 participants