Skip to content

Retry mechanism with retry breaking system#797

Open
sosek108 wants to merge 9 commits into
Expensify:mainfrom
callstack-internal:feat/revised-retry-mechanism
Open

Retry mechanism with retry breaking system#797
sosek108 wants to merge 9 commits into
Expensify:mainfrom
callstack-internal:feat/revised-retry-mechanism

Conversation

@sosek108

@sosek108 sosek108 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Details

Replace the brittle, string-matching storage-error retry logic with a layered design — classify the error → route it to exactly one recovery owner → stop session-wide storms with a circuit breaker.

  1. Error classification moved into the providers. Instead of OnyxUtils hardcoding substrings like 'quotaexceedederror' and 'database or disk is full', each storage provider (IDBKeyVal, SQLite, etc.) now owns a classifyError that maps its own error dialect to a shared, engine-agnostic taxonomy (lib/storage/errors.ts): TRANSIENT, CAPACITY, INVALID_DATA, FATAL, UNKNOWN.
  2. Single-owner recovery. Each class is handled by exactly one layer, so a failure isn't retried twice:
  • TRANSIENT / FATAL → connection layer (createStore) reopens/heals the DB; retryOperation skips them quietly.
  • CAPACITY → operation layer evicts the LRU evictable key and retries.
  • INVALID_DATA → throws (same data always fails).
  • UNKNOWN → logs the full error shape once so telemetry can later promote recurring cases into a real class, then bounded retry.
  1. Session-level circuit breaker (lib/StorageCircuitBreaker.ts). The per-operation retry budget can't stop a storm because each eviction → OnyxDerived recompute → new write starts a fresh budget. The breaker trips when capacity failures storm (>50/60s) or evictions repeatedly free nothing (5 consecutive no-progress cycles), halting eviction/retry and emitting exactly one alert per window instead of one log line per failed write.

Related Issues

Expensify/App#94069

Linked E/App PR

Expensify/App#92931

Automated Tests

Manual Tests

Prerequisites: put Chrome into insufficient storage situation by going to Console -> Application -> Storage and enable Simulate custom storage quota. Set a value lower than used by the application.

  1. Open console and optionally filter logs by Onyx
  2. Go to inbox and switch between various reports observe logs
  3. Notice that there are no uncatched storm of logs
  4. Notice that proper logs about circut breaker are logged: Storage circuit breaker tripped: 5 consecutive evictions freed no usable space. Halting eviction/retry for 60s to stop a storage failure storm.

Author Checklist

  • I linked the correct issue in the ### Related Issues section above
  • I linked the corresponding Expensify/App PR in the ### Linked E/App PR section above, and verified this change against it (E/App CI passed and manual testing completed)
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android / native
    • Android / Chrome
    • iOS / native
    • iOS / Safari
    • MacOS / Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that the left part of a conditional rendering a React component is a boolean and NOT a string, e.g. myBool && <MyComponent />.
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If a new component is created I verified that:
    • A similar component doesn't exist in the codebase
    • All props are defined accurately and each prop has a /** comment above it */
    • The file is named correctly
    • The component has a clear name that is non-ambiguous and the purpose of the component can be inferred from the name alone
    • The only data being stored in the state is data necessary for rendering and nothing else
    • If we are not using the full Onyx data that we loaded, I've added the proper selector in order to ensure the component only re-renders when the data it is using changes
    • For Class Components, any internal methods passed to components event handlers are bound to this properly so there are no scoping issues (i.e. for onClick={this.submit} the method this.submit should be bound to this in the constructor)
    • Any internal methods bound to this are necessary to be bound (i.e. avoid this.submit = this.submit.bind(this); if this.submit is never passed to a component event handler like onClick)
    • All JSX used for rendering exists in the render method
    • The component has the minimum amount of code necessary for its purpose, and it is broken down into smaller components in order to separate concerns and functions
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR author checklist, including those that don't apply to this PR.

Screenshots/Videos

Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari

before:

Nagranie.z.ekranu.2026-06-22.o.15.05.45.mov

after:

Nagranie.z.ekranu.2026-06-22.o.15.07.34.mov

@sosek108

sosek108 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8bc993cafd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/OnyxUtils.ts
# Conflicts:
#	lib/OnyxUtils.ts
#	lib/storage/providers/IDBKeyValProvider/createStore.ts
@sosek108

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23bc519d72

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/OnyxUtils.ts
@sosek108 sosek108 marked this pull request as ready for review June 23, 2026 07:53
@sosek108 sosek108 requested a review from a team as a code owner June 23, 2026 07:53
@melvin-bot melvin-bot Bot requested review from inimaga and removed request for a team June 23, 2026 07:53

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edc6ec8dbb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/OnyxUtils.ts
Comment on lines +894 to 897
StorageCircuitBreaker.recordEviction();

// @ts-expect-error No overload matches this call.
return remove(keyForRemoval).then(() => onyxMethod(defaultParams, nextRetryAttempt));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move eviction marker after deletion completes

Fresh evidence beyond the earlier stale-marker issue is that this records an eviction before remove(keyForRemoval) has resolved. When several writes hit quota concurrently, the next write's rejection can run while the first deletion transaction is still pending, so recordCapacityFailure() counts it as a no-progress eviction even though nothing has been freed yet; after five such racing failures the breaker opens and subsequent storage writes are skipped for 60s despite the evictions potentially succeeding. Record the eviction only after the deletion completes, or associate the no-progress verdict with the retry that follows that deletion.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant