Skip to content

add task to delete zimcheck results from s3#355

Open
elfkuzco wants to merge 2 commits into
detect-zimcheck-errorsfrom
cleanup-zimcheck-s3-bucket
Open

add task to delete zimcheck results from s3#355
elfkuzco wants to merge 2 commits into
detect-zimcheck-errorsfrom
cleanup-zimcheck-s3-bucket

Conversation

@elfkuzco

Copy link
Copy Markdown
Contributor

Changes

  • add background shuttle task to delete zimcheck results from s3. results are deleted only if the zimcheck summaries have been computed and book is not in quarantine

This closes #113

@elfkuzco elfkuzco self-assigned this Jun 23, 2026
@elfkuzco elfkuzco requested a review from benoit74 June 23, 2026 10:37
@elfkuzco elfkuzco linked an issue Jun 23, 2026 that may be closed by this pull request
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.28571% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.24%. Comparing base (67d872f) to head (63b0210).
⚠️ Report is 1 commits behind head on detect-zimcheck-errors.

Files with missing lines Patch % Lines
.../cms_backend/shuttle/delete_zimcheck_s3_results.py 65.62% 14 Missing and 8 partials ⚠️
...ms_backend/mill/mark_staging_books_for_deletion.py 0.00% 2 Missing ⚠️
backend/src/cms_backend/shuttle/main.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##           detect-zimcheck-errors     #355      +/-   ##
==========================================================
+ Coverage                   80.99%   81.24%   +0.25%     
==========================================================
  Files                          61       62       +1     
  Lines                        3309     3375      +66     
  Branches                      354      366      +12     
==========================================================
+ Hits                         2680     2742      +62     
+ Misses                        525      520       -5     
- Partials                      104      113       +9     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

)

local_warehouse_paths: ClassVar[dict[UUID, Path]] = _parse_local_warehouse_paths()
zimcheck_results_upload_uri: str = os.getenv(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zimcheck_results_upload_uri => zimcheck_results_s3_bucket_uri

.where(
Book.zimcheck_result_url.is_not(None),
Book.zimcheck_s3_deleted.is_(False),
Book.location_kind.not_in(["quarantine"]),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to remove file from S3 only if book is in production or deleted. We wanna keep it (for further manual analysis if needed) when book is in quarantine or staging.

default_factory=dict, server_default="{}"
)

zimcheck_s3_deleted: Mapped[bool] = mapped_column(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need a partial index on zimcheck_s3_deleted = False to efficiently retrieve these few books which still have zimcheck results on S3 (there will be very few once this PR is merged)


s3 = get_kiwix_storage_client(upload_uri)

logger.info("Deleting zimcheck resultss from S3")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resultss => results


for book in books:
# Determine whether we need to fetch zimcheck results or not
missing_summary_keys = get_missing_keys(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we care about that? I'm concerned we will end with many zimcheck results not been deleted only because we failed to compute zimcheck summary. Once in production, book should have a summary and anyway we don't care much about quality anymore.

@elfkuzco elfkuzco requested a review from benoit74 June 23, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleanup zimcheck S3 bucket

2 participants