PHOENIX-7566 HAGroupStore admin tool: HDFS URLs, URL validation, failover recovery guidance#2519
Open
ritegarg wants to merge 4 commits into
Open
Conversation
fad8d80 to
fa67679
Compare
…nical SYSTEM.HA_GROUP writes, failover recovery guidance - get/list print HDFS URL / Peer HDFS URL; update accepts --hdfs-url/--peer-hdfs-url (register options, count them in the field guard, show them in proposed changes). - Validate URL fields on create/update against the registry type the read path uses (ZK quorum vs RPC/master), with a --force bypass; HAGroupStoreClient.getHAGroupNames skips + WARNs a row whose ZK URL will not parse instead of breaking enumeration for all callers; render an unparseable stored cluster URL as <invalid> in get-cluster-role-record instead of crashing. - create/update now write the SYSTEM.HA_GROUP slot columns (including HDFS_URL_1/2) in a canonical order keyed on the formatted ZK URL, so each slot's ZK/CLUSTER/ROLE/HDFS columns stay paired and both clusters persist identical rows (matching the periodic ZK->SYSTEM.HA_GROUP sync). update previously wrote local-first and never wrote HDFS, which could leave ZK_URL_n unpaired from HDFS_URL_n. - On initiate-failover/abort-failover timeout, print manual-recovery guidance (inspect both sides, restore connectivity, abort on standby, or force a steady state). Co-authored-by: Cursor <cursoragent@cursor.com>
fa67679 to
f39678c
Compare
- firstClusterTakesSlot1: rename fa/fb to formattedZkUrlA/formattedZkUrlB and restore insertIntoSystemTable's javadoc. - printFailoverRecoveryGuidance: drop the FAILOVER_RUNBOOK.md reference (not in repo) and merge the overlapping transitional-state recovery steps into one. - get-cluster-role-record: inline the invalid-URL fallback and remove the redundant printClusterRoleRecordWithInvalidUrls helper. Co-authored-by: Cursor <cursoragent@cursor.com>
…allback) - firstClusterTakesSlot1: correct the javadoc. The canonical order is keyed on the formatted ZK URL; ClusterRoleRecord canonicalizes url1/url2 on the cluster URL (not the ZK URL), and the periodic ZK->SYSTEM.HA_GROUP sync now matches this ordering (apache#2521). - get-cluster-role-record: on the invalid-URL fallback, surface the underlying cause so an unrelated RuntimeException is not silently mislabeled as a bad URL, and label the raw values Cluster URL / Peer Cluster URL (local/peer) instead of the slot-based Cluster 1/2 URL. - update help: note that --force also stores malformed URLs as-is. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
HAGroupStore admin tool (
PhoenixHAAdminTool) fixes for consistent failover (PHOENIX-7566).What changes were proposed in this pull request?
get/listprintHDFS URL/Peer HDFS URL;updateaccepts--hdfs-url/--peer-hdfs-url(registered, counted in the "at least one field" guard, shown in proposed changes).SYSTEM.HA_GROUPwrites:create/updatewrite the slot columns (includingHDFS_URL_1/2) in a canonical order keyed on the formatted ZK URL, so each slot'sZK/CLUSTER/ROLE/HDFSstay paired and both clusters persist identical rows (matching the periodic ZK→SYSTEM.HA_GROUPsync). Previouslyupdatewrote local-first and never wrote HDFS, which could leaveZK_URL_nunpaired fromHDFS_URL_n.create/updatevalidate URL fields against the registry type the read path uses (ZK quorum vs RPC/master). ZK URLs are always validated (they identify the HA pair);--forcestores only malformed cluster URLs as-is.HAGroupStoreClient.getHAGroupNamesskips + WARNs a row whose ZK URL won't parse instead of failing enumeration for all callers.get-cluster-role-recordrenders an unparseable stored cluster URL as<invalid>(surfacing the cause) instead of throwing.initiate-failover/abort-failovertimeout, prints manual-recovery steps (inspect both clusters, restore connectivity, abort on the standby, or force a steady state).Why are the changes needed?
The tool could not display or set HDFS URLs, and
updatecould persist a row whose ZK/CLUSTER/ROLE/HDFS slots were unpaired. A single malformed stored URL could crashlist/get-cluster-role-recordand break HA-group enumeration for server-side callers. A timed-out failover left the operator with no recovery direction.Does this PR introduce any user-facing change?
Yes — CLI only: new
--hdfs-url/--peer-hdfs-urland HDFS URLs inget/list; canonical, pairedSYSTEM.HA_GROUProws; malformed cluster URLs rejected oncreate/update(overridable with--force; ZK URLs always validated); clearerget-cluster-role-recordand failover-timeout output. No API or wire changes.How was this patch tested?
PhoenixHAAdminToolIT(18/18) andPhoenixHAAdminIT(9/9).Was this patch authored or co-authored using generative AI tooling?
Yes — co-authored with Cursor.