PHOENIX-7566 HAGroupStore state machine: ANISTS non-blocking, DS->STANDBY_TO_ACTIVE#2518
Conversation
…ED_STANDBY -> STANDBY_TO_ACTIVE - Map ACTIVE_NOT_IN_SYNC_TO_STANDBY to ClusterRole ACTIVE (non-blocking); only ACTIVE_IN_SYNC_TO_STANDBY remains mutation-blocking. - Allow DEGRADED_STANDBY -> STANDBY_TO_ACTIVE so a degraded standby can promote on failover. - Tests: HAGroupStoreRecordTest role/mutation-block and transition assertions; HAGroupStoreManagerIT multi-group blocking uses ACTIVE_IN_SYNC_TO_STANDBY. Co-authored-by: Cursor <cursoragent@cursor.com>
e846600 to
eb84f89
Compare
There was a problem hiding this comment.
Pull request overview
This PR adjusts Phoenix HAGroupStore state-machine semantics to enable cleaner failover behavior when the active cluster is not yet in sync, by making the ACTIVE_NOT_IN_SYNC_TO_STANDBY path non-mutation-blocking and allowing a degraded standby to promote directly.
Changes:
- Map
HAGroupState.ACTIVE_NOT_IN_SYNC_TO_STANDBYtoClusterRole.ACTIVE(non-blocking), leaving onlyACTIVE_IN_SYNC_TO_STANDBYasACTIVE_TO_STANDBY(mutation-blocking). - Allow
DEGRADED_STANDBY -> STANDBY_TO_ACTIVEtransition to enable direct promotion during failover. - Update unit/integration tests to reflect the updated mapping and transition behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreRecord.java | Updates HA state→role mapping and extends DEGRADED_STANDBY allowed transitions to include STANDBY_TO_ACTIVE. |
| phoenix-core/src/test/java/org/apache/phoenix/jdbc/HAGroupStoreRecordTest.java | Updates role-mapping expectations and adds assertions covering mutation-blocking behavior and degraded-standby transitions. |
| phoenix-core/src/it/java/org/apache/phoenix/jdbc/HAGroupStoreManagerIT.java | Adjusts the integration test to use the mutation-blocking drain state (ACTIVE_IN_SYNC_TO_STANDBY) and clarifies per-group blocking expectations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
apurtell
left a comment
There was a problem hiding this comment.
Approved.
I did the review myself, but did have the robots help with considering TLA model updates after this goes in. (I'll have some fun there.)
Then I asked for suggestions on improving test coverage because the current coverage seems a bit weak. These were the suggestions, fyi
-
Drive (ANIS, DS) -> (S, AIS) through the real ZK + FailoverManagementListener path in something like FailoverPhoenixConnectionIT or HAGroupStoreManagerIT, and assert both clusters converge to (S, AIS). The current unit test only pins the transition table.
-
Nit in HAGroupStoreRecordTest: Add an assertFalse(DEGRADED_STANDBY.isTransitionAllowed(SOMETHING_NOT_IN_SET)) to lock down the allowed set.
| ImmutableSet.of(ABORT_TO_ACTIVE_IN_SYNC, STANDBY); | ||
| STANDBY_TO_ACTIVE.allowedTransitions = ImmutableSet.of(ABORT_TO_STANDBY, ACTIVE_IN_SYNC); | ||
| DEGRADED_STANDBY.allowedTransitions = ImmutableSet.of(STANDBY); | ||
| DEGRADED_STANDBY.allowedTransitions = ImmutableSet.of(STANDBY, STANDBY_TO_ACTIVE); |
There was a problem hiding this comment.
Looks good. Fixes a DS side deadlock at peer AISTS.
| case ABORT_TO_ACTIVE_NOT_IN_SYNC: | ||
| case ACTIVE_IN_SYNC: | ||
| case ACTIVE_NOT_IN_SYNC: | ||
| case ACTIVE_NOT_IN_SYNC_TO_STANDBY: |
There was a problem hiding this comment.
Looks good. Reduces unavailability during failover. The safety argument shifts a bit, but the transition tables do not allow the peer to promote, so we still have only a single writer in all states.
HAGroupStore state-machine fixes for consistent failover (PHOENIX-7566).
What changes were proposed in this pull request?
HAGroupStoreRecord.getClusterRole(), mapACTIVE_NOT_IN_SYNC_TO_STANDBYto theACTIVErole (non-blocking). OnlyACTIVE_IN_SYNC_TO_STANDBYremainsACTIVE_TO_STANDBY(mutation-blocking).STANDBY_TO_ACTIVEtoDEGRADED_STANDBY.allowedTransitions, so a degraded standby can promote directly on failover (matches the TLA+ model).Why are the changes needed?
On the not-in-sync failover path, an active that is still draining its backlog was treated as mutation-blocked, even though it should keep serving writes until it is in sync and ready to hand off. Separately, a
DEGRADED_STANDBYhad no allowed transition toSTANDBY_TO_ACTIVE, so it could not promote during failover. Together these prevented a clean failover when the active is behind.Does this PR introduce any user-facing change?
No. Internal HA state-machine semantics only; no API, config, or CLI changes.
How was this patch tested?
HAGroupStoreRecordTest(17/17): role mapping + mutation-block coverage (ACTIVE_NOT_IN_SYNC_TO_STANDBYnon-blocking,ACTIVE_IN_SYNC_TO_STANDBYblocking) and theDEGRADED_STANDBY → STANDBY/STANDBY_TO_ACTIVEtransitions.HAGroupStoreManagerIT(17/17): per-group mutation blocking.Was this patch authored or co-authored using generative AI tooling?
Yes — co-authored with Cursor.