From f7d92d5cd421fb51552af542c5e2f5198caf292e Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 10:00:42 -0400 Subject: [PATCH 1/5] Add TSG: Secure Boot (AzStackHci_Hardware_Test_Secure_Boot) New Environment Validator troubleshooting guide for the Hardware Secure Boot check (Test-SecureBoot, which runs Confirm-SecureBootUEFI). Covers where the failure appears (portal, on-box validator, and the AzStackHciEnvironmentChecker event log), how to enable UEFI Secure Boot in firmware, and a BitLocker precaution: suspend BitLocker before the firmware change and resume after, since Azure Local enables data-at-rest encryption by default and a Secure Boot change is measured into TPM PCR 7. Indexed in the EnvironmentValidator README. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- TSG/EnvironmentValidator/README.md | 1 + ...oubleshooting-Hardware-Test-Secure-Boot.md | 242 ++++++++++++++++++ 2 files changed, 243 insertions(+) create mode 100644 TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md diff --git a/TSG/EnvironmentValidator/README.md b/TSG/EnvironmentValidator/README.md index 11611c19..f46b592a 100644 --- a/TSG/EnvironmentValidator/README.md +++ b/TSG/EnvironmentValidator/README.md @@ -6,6 +6,7 @@ This folder contains the TSG's related to Environment Validators. * [Troubleshooting Test NetAdapter API Failure](./Troubleshooting-Test-NetAdapter-API.md) * [Troubleshooting Test PhysicalDisk API Failure](./Troubleshooting-Test-PhysicalDisk-API.md) * [Troubleshooting Test System Drive Free Space](./Troubleshooting-Test-SystemDrive-Free-Space.md) +* [Troubleshooting Secure Boot (Hardware Test Secure Boot)](./Troubleshooting-Hardware-Test-Secure-Boot.md) * [Troubleshooting TestPowerShell Module Version](./Troubleshooting-Test-PowerShell-Module-Version.md) * [Troubleshooting Module Versions](Troubleshooting-Module-Versions.md) * [Troubleshooting MSI Does Not Have Access to Subscription](Troubleshooting-MSI-Does-Not-Have-Access-To-Subscription.md) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md new file mode 100644 index 00000000..13aa6432 --- /dev/null +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md @@ -0,0 +1,242 @@ +# AzStackHci_Hardware_Test_Secure_Boot + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameAzStackHci_Hardware_Test_Secure_Boot (aggregated as AzStackHci_Hardware_SecureBoot)
Display nameSecure Boot
Validator / testTest-SecureBoot (run with Invoke-AzStackHciHardwareValidation)
ComponentHardware (Environment Validator / Environment Checker)
SeverityCritical: this validator blocks deployment until UEFI Secure Boot is enabled on the machine.
RequirementEach machine must have UEFI Secure Boot enabled in firmware before deployment.
Applicable ScenariosDeployment and Add Node (pre-deployment validation).
Affected VersionsAzure Local, version 23H2 and later.
+ +## Overview + +This validator checks that **UEFI Secure Boot is enabled** on each Azure Local machine. +Secure Boot is part of the Azure Local hardware security baseline: it ensures the +machine only loads boot software that is signed by a trusted authority, and it is a +prerequisite for the platform's secured-core and attestation features. The check fails +when Secure Boot is disabled, or when the machine is not in a state where Secure Boot +can be evaluated (for example a machine running in legacy BIOS / CSM mode rather than +UEFI). + +It runs by calling `Confirm-SecureBootUEFI` on each machine. A machine with Secure Boot +enabled returns `True` and the check is a **SUCCESS**; a machine with Secure Boot +disabled returns `False` and the check is a **FAILURE**. On a machine that is not in +UEFI mode at all, `Confirm-SecureBootUEFI` reports that the platform does not support +the cmdlet, which is treated as the machine not meeting the Secure Boot requirement. + +While this check is failing, deployment is blocked at the Hardware validation stage and +the machine cannot proceed. This is a pre-deployment gate, so it does not change the +state of a cluster that is already running; it stops a machine from being deployed (or +added) while Secure Boot is off. + +## Where this failure appears + +You can see this failure in two places, the Azure portal and the machine itself. Both +show the same underlying result. + +### In the Azure portal + +This check runs during the deployment validation step. When you deploy Azure Local from +the portal (or with a deployment template), the **Validation** phase runs the +environment checks and lists any that fail: + +1. Open the Azure Local deployment for your cluster and go to its **Validation** + results (the deployment surfaces these before it proceeds to apply). +2. In the list of checks, this one appears under its display name, **Secure Boot**, with + a **Critical** severity. +3. Select the failing check to see the per-machine detail, which names the machine whose + Secure Boot is off. + +### On the machine + +Two on-box sources carry the result. + +**Run the single validator (fastest).** The Environment Checker module ships on every +Azure Local machine, so you can run this one Hardware check directly and read the result +in a few seconds. Use `-Include Test-SecureBoot` to run only this check, so you do not +have to run the full Hardware validation suite: + +```powershell +$r = Invoke-AzStackHciHardwareValidation -Include Test-SecureBoot -PassThru +$r | Select-Object Name, Status, Severity +$r.AdditionalData.Detail +``` + +You can also read the underlying value directly: + +```powershell +# True = Secure Boot enabled; False = supported but disabled. +# An error ("Cmdlet not supported on this platform") means the machine is not in UEFI mode. +Confirm-SecureBootUEFI +``` + +A machine with Secure Boot disabled returns `Status` of `FAILURE` and a detail line of +the form: + +``` +SecureBoot is 'False' on AzL-Node-01. Expected 'True'. Ensure SecureBoot is supported and enabled on AzL-Node-01. +``` + +**Event log (per machine).** The Environment Checker writes every check result to the +**AzStackHciEnvironmentChecker** event log, located at +`C:\Windows\System32\winevt\Logs\AzStackHciEnvironmentChecker.evtx`. Each result is the +JSON body of an **Event ID 17205** entry. To read this check's most recent result on a +machine: + +```powershell +Get-WinEvent -LogName AzStackHciEnvironmentChecker -FilterXPath '*[System[(EventID=17205)]]' -MaxEvents 2000 | + Where-Object { $_.Message -match 'AzStackHci_Hardware_Test_Secure_Boot' } | + Select-Object -First 1 -ExpandProperty Message +``` + +In both sources the result for this check looks like this: + +```json +{ + "Name": "AzStackHci_Hardware_Test_Secure_Boot", + "DisplayName": "Secure Boot", + "Title": "Secure Boot", + "Severity": "Critical", + "Status": "FAILURE", + "Description": "Validates that UEFI Secure Boot is enabled on each machine by running Confirm-SecureBootUEFI. Secure Boot must be enabled before Azure Local deployment.", + "TargetResourceName": "AzL-Node-01", + "Remediation": "Enable UEFI Secure Boot in the machine firmware and re-run the check. The machine must be in UEFI boot mode for Secure Boot to apply.", + "AdditionalData": { + "Detail": "SecureBoot is 'False' on AzL-Node-01. Expected 'True'. Ensure SecureBoot is supported and enabled on AzL-Node-01.", + "Status": "FAILURE" + } +} +``` + +## How to fix it + +Secure Boot is a firmware (UEFI/BIOS) setting, so the fix is made in the machine's +firmware setup, not from Windows. The high-level steps are: (optionally) suspend +BitLocker so the firmware change does not trip the drive into recovery, enable Secure +Boot in firmware, then re-run the check. + +### 1. If the machine has BitLocker enabled, suspend it first + +Changing Secure Boot alters the machine's measured-boot state (it is measured into TPM +PCR 7). On a machine where **BitLocker is enabled, the next boot after the change will +stop at the BitLocker recovery screen** and ask for the 48-digit recovery password, +which can strand the machine. Azure Local enables data-at-rest encryption (BitLocker) by +default, so a machine that has already been deployed (or any machine with drive +encryption) is affected. A fresh, pre-deployment machine that has never been encrypted is +not. + +If BitLocker is on, suspend it **before** you change firmware, and resume it **after** +the machine is back and Secure Boot is confirmed. Use `-RebootCount 0` so the suspend +holds across the firmware change and reboot until you explicitly resume it: + +```powershell +# Are any volumes protected? +Get-BitLockerVolume | Select-Object MountPoint, ProtectionStatus, VolumeStatus + +# Suspend each protected volume indefinitely (until you resume it). +Suspend-BitLocker -MountPoint "C:" -RebootCount 0 +# Repeat for any data volumes that report ProtectionStatus = On, for example: +# Suspend-BitLocker -MountPoint "C:\ClusterStorage\Volume1" -RebootCount 0 +``` + +You will resume BitLocker in step 4, after Secure Boot is confirmed. Confirm the +recovery key is available (escrowed) before you start, so the machine is recoverable +even if something interrupts the change. + +### 2. Enable Secure Boot in firmware (UEFI/BIOS) + +1. Reboot the machine and enter firmware setup (the key varies by vendor, commonly + `F2`, `F10`, `Del`, or via the BMC / iDRAC / iLO / XClarity remote console). +2. Make sure the machine is in **UEFI boot mode**, not legacy BIOS / CSM. Secure Boot + only applies in UEFI mode, and the OS disk must be GPT. If the machine was installed + in legacy/MBR mode, switching boot mode alone will not boot the existing OS; the + machine must be installed in UEFI mode. +3. Enable **Secure Boot**. If prompted, make sure the standard Microsoft Secure Boot keys + (PK / KEK / db) are provisioned (often shown as "Install default Secure Boot keys" or + "Standard" key configuration). +4. Save and exit, and let the machine boot back into the OS. + +The exact menu names are vendor-specific; consult your hardware vendor's documentation +for the precise location of the Secure Boot and boot-mode settings. + +### 3. Confirm Secure Boot is on + +```powershell +Confirm-SecureBootUEFI +``` + +This should now return `True`. + +### 4. Resume BitLocker (only if you suspended it in step 1) + +```powershell +Resume-BitLocker -MountPoint "C:" +# And any data volumes you suspended, for example: +# Resume-BitLocker -MountPoint "C:\ClusterStorage\Volume1" +``` + +Resuming reseals the BitLocker key to the new (Secure Boot enabled) measurements, and the +machine boots normally from then on. + +## Verify the fix + +Re-run the single validator: + +```powershell +$r = Invoke-AzStackHciHardwareValidation -Include Test-SecureBoot -PassThru +$r | Select-Object Name, Status, Severity +$r.AdditionalData.Detail +``` + +A machine with Secure Boot enabled returns `Status` of `SUCCESS`. Once every machine you +are deploying reports success, re-run the deployment validation; the **Secure Boot** +check should now pass and deployment can proceed. + +## When to escalate + +Open a support case if any of the following are true: + +- `Confirm-SecureBootUEFI` returns `True`, but the **Secure Boot** check still fails + during deployment validation. +- The firmware has no Secure Boot setting, or Secure Boot cannot be enabled because the + platform does not support it. Secure Boot is an Azure Local hardware requirement, so + confirm the machine is on the Azure Local supported hardware list. +- `Confirm-SecureBootUEFI` reports that the cmdlet is not supported on the platform even + after you have set the machine to UEFI boot mode. +- The machine stops at the BitLocker recovery screen after the change and the recovery + key is not available. + +## Related + +- General Environment Checker remediation link shown in the validator output: + https://aka.ms/hci-envch +- [Azure Local security features and baseline](https://learn.microsoft.com/azure/azure-local/concepts/security-features) +- [Secure Boot (Windows hardware security)](https://learn.microsoft.com/windows-hardware/design/device-experiences/oem-secure-boot) +- [Suspend-BitLocker before firmware changes](https://learn.microsoft.com/powershell/module/bitlocker/suspend-bitlocker) From 280334a143d1d00d84075b5d45370d02683e06b2 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 10:18:37 -0400 Subject: [PATCH 2/5] TSG Secure Boot: add cluster-member drain/quorum safety for the deployed-node path Address PR review: the realistic BitLocker case is an already-deployed, encrypted cluster member, so enabling Secure Boot reboots a live node. Add a drain-first step (Suspend-ClusterNode -Drain with cluster-health and quorum pre-checks, one node at a time) and a resume-node step (Resume-ClusterNode, wait for storage resync) around the existing BitLocker suspend/resume flow. Reconcile the Overview scope line to acknowledge the deployed-node path, and cite Suspend-ClusterNode and Resume-ClusterNode. The BitLocker suspend/resume content is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...oubleshooting-Hardware-Test-Secure-Boot.md | 66 +++++++++++++++---- 1 file changed, 55 insertions(+), 11 deletions(-) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md index 13aa6432..9b5e5114 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md @@ -52,9 +52,11 @@ UEFI mode at all, `Confirm-SecureBootUEFI` reports that the platform does not su the cmdlet, which is treated as the machine not meeting the Secure Boot requirement. While this check is failing, deployment is blocked at the Hardware validation stage and -the machine cannot proceed. This is a pre-deployment gate, so it does not change the -state of a cluster that is already running; it stops a machine from being deployed (or -added) while Secure Boot is off. +the machine cannot proceed. This is primarily a pre-deployment gate that stops a machine +from being deployed (or added) while Secure Boot is off. If you instead need to enable +Secure Boot on a machine that is already a deployed, encrypted cluster member, follow the +cluster-member drain steps below so the firmware reboot does not disrupt workloads or +storage. ## Where this failure appears @@ -138,11 +140,37 @@ In both sources the result for this check looks like this: ## How to fix it Secure Boot is a firmware (UEFI/BIOS) setting, so the fix is made in the machine's -firmware setup, not from Windows. The high-level steps are: (optionally) suspend -BitLocker so the firmware change does not trip the drive into recovery, enable Secure -Boot in firmware, then re-run the check. +firmware setup, not from Windows. The high-level steps are: if the machine is an +already-deployed cluster member, drain it first; if it has BitLocker on, suspend BitLocker +so the firmware change does not trip the drive into recovery; enable Secure Boot in +firmware; resume BitLocker; and resume the node. Then re-run the check. -### 1. If the machine has BitLocker enabled, suspend it first +### 1. If the machine is an already-deployed cluster member, drain it first + +If this machine has BitLocker on, it has almost certainly already been deployed into a +cluster (Azure Local turns on encryption during deployment). Enabling Secure Boot requires +a reboot into firmware, which takes this node down, so drain it first and do **one node at +a time**. This is a [MEDIUM RISK] change: draining live-migrates VMs off the node, and the +node is unavailable until you resume it. + +```powershell +# Confirm the cluster is healthy and can lose this one node before you start. +Get-ClusterNode | Select-Object Name, State # every other node should be Up +Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus # all Healthy / OK +Get-StorageJob # should be empty (no active repair/resync) +``` + +Only continue when every other node is `Up`, all virtual disks are Healthy, and +`Get-StorageJob` returns nothing. Then pause and drain this node so its VMs live-migrate +off: + +```powershell +Suspend-ClusterNode -Name -Drain +# Confirm the node is Paused and its roles have moved before you reboot it. +Get-ClusterNode -Name | Select-Object Name, State # State should be Paused +``` + +### 2. If the machine has BitLocker enabled, suspend it first Changing Secure Boot alters the machine's measured-boot state (it is measured into TPM PCR 7). On a machine where **BitLocker is enabled, the next boot after the change will @@ -166,11 +194,11 @@ Suspend-BitLocker -MountPoint "C:" -RebootCount 0 # Suspend-BitLocker -MountPoint "C:\ClusterStorage\Volume1" -RebootCount 0 ``` -You will resume BitLocker in step 4, after Secure Boot is confirmed. Confirm the +You will resume BitLocker in step 5, after Secure Boot is confirmed. Confirm the recovery key is available (escrowed) before you start, so the machine is recoverable even if something interrupts the change. -### 2. Enable Secure Boot in firmware (UEFI/BIOS) +### 3. Enable Secure Boot in firmware (UEFI/BIOS) 1. Reboot the machine and enter firmware setup (the key varies by vendor, commonly `F2`, `F10`, `Del`, or via the BMC / iDRAC / iLO / XClarity remote console). @@ -186,7 +214,7 @@ even if something interrupts the change. The exact menu names are vendor-specific; consult your hardware vendor's documentation for the precise location of the Secure Boot and boot-mode settings. -### 3. Confirm Secure Boot is on +### 4. Confirm Secure Boot is on ```powershell Confirm-SecureBootUEFI @@ -194,7 +222,7 @@ Confirm-SecureBootUEFI This should now return `True`. -### 4. Resume BitLocker (only if you suspended it in step 1) +### 5. Resume BitLocker (only if you suspended it in step 2) ```powershell Resume-BitLocker -MountPoint "C:" @@ -205,6 +233,20 @@ Resume-BitLocker -MountPoint "C:" Resuming reseals the BitLocker key to the new (Secure Boot enabled) measurements, and the machine boots normally from then on. +### 6. Resume the cluster node (only if you drained it in step 1) + +Bring the node back into the cluster and let storage resync before you touch the next node. + +```powershell +Resume-ClusterNode -Name +# Wait for resync to finish before moving on; do not drain the next node until this clears. +Get-StorageJob # wait until empty +Get-VirtualDisk | Select-Object FriendlyName, HealthStatus # back to Healthy +``` + +Repeat steps 1 through 6 for each remaining machine, one node at a time, so the cluster +always keeps quorum and storage resiliency. + ## Verify the fix Re-run the single validator: @@ -240,3 +282,5 @@ Open a support case if any of the following are true: - [Azure Local security features and baseline](https://learn.microsoft.com/azure/azure-local/concepts/security-features) - [Secure Boot (Windows hardware security)](https://learn.microsoft.com/windows-hardware/design/device-experiences/oem-secure-boot) - [Suspend-BitLocker before firmware changes](https://learn.microsoft.com/powershell/module/bitlocker/suspend-bitlocker) +- [Suspend-ClusterNode (pause and drain a node)](https://learn.microsoft.com/powershell/module/failoverclusters/suspend-clusternode) +- [Resume-ClusterNode](https://learn.microsoft.com/powershell/module/failoverclusters/resume-clusternode) From 6f70f29e222d42a90b7d1a01841d967db6249806 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 14:58:00 -0400 Subject: [PATCH 3/5] Secure Boot TSG: reframe for the pre-deployment scenario (BitLocker primary, drain gated) Apply the same framing the TPM Version TSG (#305) landed and that PR #170 captured into the harness skill. Test-SecureBoot runs in the Hardware validator (Deployment and Add Node) and the readiness/bootstrap set, with no upgrade-renamed variant, so the machine it flags is a host being validated to become a node, not a deployed member. - BitLocker check/suspend is now the primary step 1, because a host being vetted may have been recycled from a prior project with BitLocker already enabled; a Secure Boot change (TPM PCR 7) would trip an encrypted volume into recovery regardless of deployment state. - The cluster-drain/quorum steps move into a gated 'If the machine is already a deployed, encrypted cluster member' section instead of being front-loaded as step 1. - Steps are now: 1 check/suspend BitLocker, 2 enable Secure Boot, 3 confirm, 4 resume BitLocker, plus the gated deployed-member drain section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...oubleshooting-Hardware-Test-Secure-Boot.md | 124 +++++++++--------- 1 file changed, 63 insertions(+), 61 deletions(-) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md index 9b5e5114..c49a1107 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md @@ -52,11 +52,13 @@ UEFI mode at all, `Confirm-SecureBootUEFI` reports that the platform does not su the cmdlet, which is treated as the machine not meeting the Secure Boot requirement. While this check is failing, deployment is blocked at the Hardware validation stage and -the machine cannot proceed. This is primarily a pre-deployment gate that stops a machine -from being deployed (or added) while Secure Boot is off. If you instead need to enable -Secure Boot on a machine that is already a deployed, encrypted cluster member, follow the -cluster-member drain steps below so the firmware reboot does not disrupt workloads or -storage. +the machine cannot proceed. This is a pre-deployment gate (it runs during Deployment and +Add Node validation), so the machine it flags is normally a **host being validated to +become a cluster node**, not a running cluster member. The fix is usually short: check for +BitLocker, enable Secure Boot in firmware, and re-validate. Two cautions apply, though: a +host being vetted may have been **recycled from another project and could already have +BitLocker enabled** (a Secure Boot change trips it into recovery, so check first), and the +cluster-drain precaution is only needed if the machine is already a live, deployed member. ## Where this failure appears @@ -139,66 +141,46 @@ In both sources the result for this check looks like this: ## How to fix it -Secure Boot is a firmware (UEFI/BIOS) setting, so the fix is made in the machine's -firmware setup, not from Windows. The high-level steps are: if the machine is an -already-deployed cluster member, drain it first; if it has BitLocker on, suspend BitLocker -so the firmware change does not trip the drive into recovery; enable Secure Boot in -firmware; resume BitLocker; and resume the node. Then re-run the check. +This check runs during **pre-deployment validation**, so the machine it flags is normally a +**host being prepared to become a cluster node**, not a running cluster member: there is +usually no cluster to keep in quorum. Do not assume the host is otherwise "clean", though. +A host being vetted may have been **recycled from another project and could already have +BitLocker enabled**, and a Secure Boot change is measured into TPM PCR 7, which trips an +encrypted volume into recovery, so check for BitLocker before you touch firmware (step 1). +The cluster-drain precaution only applies in the uncommon case that the machine is already +a live, deployed cluster member. -### 1. If the machine is an already-deployed cluster member, drain it first +Secure Boot is a firmware (UEFI/BIOS) setting, so the change is made in the machine's +firmware setup, not from Windows. -If this machine has BitLocker on, it has almost certainly already been deployed into a -cluster (Azure Local turns on encryption during deployment). Enabling Secure Boot requires -a reboot into firmware, which takes this node down, so drain it first and do **one node at -a time**. This is a [MEDIUM RISK] change: draining live-migrates VMs off the node, and the -node is unavailable until you resume it. +### 1. Check for BitLocker, and suspend it if present -```powershell -# Confirm the cluster is healthy and can lose this one node before you start. -Get-ClusterNode | Select-Object Name, State # every other node should be Up -Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus # all Healthy / OK -Get-StorageJob # should be empty (no active repair/resync) -``` - -Only continue when every other node is `Up`, all virtual disks are Healthy, and -`Get-StorageJob` returns nothing. Then pause and drain this node so its VMs live-migrate -off: +Do this even on a fresh pre-deployment host. A host you are vetting may have been **recycled +from a previous project with BitLocker already enabled**, and changing Secure Boot alters the +machine's measured-boot state (it is measured into TPM PCR 7). If a protected volume is left +armed, **the next boot after the change stops at the BitLocker recovery screen** and asks for +the 48-digit recovery password, which can strand the machine. ```powershell -Suspend-ClusterNode -Name -Drain -# Confirm the node is Paused and its roles have moved before you reboot it. -Get-ClusterNode -Name | Select-Object Name, State # State should be Paused +# Are any volumes protected? (On a truly clean, never-encrypted host this is empty.) +Get-BitLockerVolume | Select-Object MountPoint, ProtectionStatus, VolumeStatus ``` -### 2. If the machine has BitLocker enabled, suspend it first - -Changing Secure Boot alters the machine's measured-boot state (it is measured into TPM -PCR 7). On a machine where **BitLocker is enabled, the next boot after the change will -stop at the BitLocker recovery screen** and ask for the 48-digit recovery password, -which can strand the machine. Azure Local enables data-at-rest encryption (BitLocker) by -default, so a machine that has already been deployed (or any machine with drive -encryption) is affected. A fresh, pre-deployment machine that has never been encrypted is -not. - -If BitLocker is on, suspend it **before** you change firmware, and resume it **after** -the machine is back and Secure Boot is confirmed. Use `-RebootCount 0` so the suspend -holds across the firmware change and reboot until you explicitly resume it: +If every volume reports `ProtectionStatus = Off`, there is nothing to suspend; go to step 2. +If any volume is protected, **confirm its recovery key is escrowed first**, then suspend it +with `-RebootCount 0` so the suspend holds across the firmware change and reboot until you +explicitly resume it: ```powershell -# Are any volumes protected? -Get-BitLockerVolume | Select-Object MountPoint, ProtectionStatus, VolumeStatus - -# Suspend each protected volume indefinitely (until you resume it). Suspend-BitLocker -MountPoint "C:" -RebootCount 0 -# Repeat for any data volumes that report ProtectionStatus = On, for example: -# Suspend-BitLocker -MountPoint "C:\ClusterStorage\Volume1" -RebootCount 0 +# Repeat for any data volume that reports ProtectionStatus = On, for example: +# Suspend-BitLocker -MountPoint "D:" -RebootCount 0 ``` -You will resume BitLocker in step 5, after Secure Boot is confirmed. Confirm the -recovery key is available (escrowed) before you start, so the machine is recoverable -even if something interrupts the change. +### 2. Enable Secure Boot in firmware (UEFI/BIOS) -### 3. Enable Secure Boot in firmware (UEFI/BIOS) +> If this machine is already a deployed, encrypted cluster member, do **not** reboot it into +> firmware yet. Follow [If the machine is already a deployed cluster member](#if-the-machine-is-already-a-deployed-encrypted-cluster-member) first so you take the node down safely. 1. Reboot the machine and enter firmware setup (the key varies by vendor, commonly `F2`, `F10`, `Del`, or via the BMC / iDRAC / iLO / XClarity remote console). @@ -214,7 +196,7 @@ even if something interrupts the change. The exact menu names are vendor-specific; consult your hardware vendor's documentation for the precise location of the Secure Boot and boot-mode settings. -### 4. Confirm Secure Boot is on +### 3. Confirm Secure Boot is on ```powershell Confirm-SecureBootUEFI @@ -222,30 +204,50 @@ Confirm-SecureBootUEFI This should now return `True`. -### 5. Resume BitLocker (only if you suspended it in step 2) +### 4. Resume BitLocker (only if you suspended it in step 1) ```powershell Resume-BitLocker -MountPoint "C:" -# And any data volumes you suspended, for example: -# Resume-BitLocker -MountPoint "C:\ClusterStorage\Volume1" +# And any data volume you suspended, for example: +# Resume-BitLocker -MountPoint "D:" ``` Resuming reseals the BitLocker key to the new (Secure Boot enabled) measurements, and the machine boots normally from then on. -### 6. Resume the cluster node (only if you drained it in step 1) +### If the machine is already a deployed, encrypted cluster member + +Because this is a pre-deployment check, it does not normally fire on a machine that is +already a deployed cluster node. But if you are enabling Secure Boot on a machine that is +**already a live, encrypted cluster member** for any reason, add one precaution to the steps +above: the firmware reboot takes a running node down, so **drain it first** and do this +**one node at a time**. + +This is a [MEDIUM RISK] change: draining live-migrates VMs off the node, and the node is +unavailable until you resume it. + +```powershell +# Confirm the cluster is healthy and can lose this one node before you start. +Get-ClusterNode | Select-Object Name, State # every other node should be Up +Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus # all Healthy / OK +Get-StorageJob # should be empty (no active repair/resync) + +# Only when the cluster is healthy, pause and drain this node so its VMs live-migrate off. +Suspend-ClusterNode -Name -Drain +Get-ClusterNode -Name | Select-Object Name, State # State should be Paused +``` -Bring the node back into the cluster and let storage resync before you touch the next node. +Then run steps 1 through 4 above (suspend BitLocker, enable Secure Boot, confirm, resume +BitLocker). Finally bring the node back and let storage resync before the next one: ```powershell Resume-ClusterNode -Name -# Wait for resync to finish before moving on; do not drain the next node until this clears. Get-StorageJob # wait until empty Get-VirtualDisk | Select-Object FriendlyName, HealthStatus # back to Healthy ``` -Repeat steps 1 through 6 for each remaining machine, one node at a time, so the cluster -always keeps quorum and storage resiliency. +Repeat for each remaining member, one node at a time, so the cluster always keeps quorum and +storage resiliency. ## Verify the fix From 3bba286aa9c84a0863e768ecdaad21d9b1300c39 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 15:07:16 -0400 Subject: [PATCH 4/5] Secure Boot TSG: match both check names in the event-log filter Resolve the reviewer note: the AzStackHciEnvironmentChecker event can carry either the per-check name (AzStackHci_Hardware_Test_Secure_Boot) or the aggregated name (AzStackHci_Hardware_SecureBoot), so the Get-WinEvent filter now matches both, mirroring the SystemDrive Free Space TSG (AzStackHci_Hardware_(Test_SystemDrive_Free_Space|SystemDriveFreeSpace)). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../Troubleshooting-Hardware-Test-Secure-Boot.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md index c49a1107..fbc96880 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md @@ -116,7 +116,7 @@ machine: ```powershell Get-WinEvent -LogName AzStackHciEnvironmentChecker -FilterXPath '*[System[(EventID=17205)]]' -MaxEvents 2000 | - Where-Object { $_.Message -match 'AzStackHci_Hardware_Test_Secure_Boot' } | + Where-Object { $_.Message -match 'AzStackHci_Hardware_(Test_Secure_Boot|SecureBoot)' } | Select-Object -First 1 -ExpandProperty Message ``` From 3ad11da6357adefb6ec1b2add019b46f96e27e31 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 17:47:22 -0400 Subject: [PATCH 5/5] Secure Boot TSG: add 'Before you start' ownership + pre-firmware proof checklist From the 10-persona usability read, the single highest-leverage change: a 'Before you start' box ahead of 'How to fix it' that (a) routes ownership (server/hardware admin with BMC does the firmware change; Windows admin confirms BitLocker; network provides BMC access only; first-line staff stop), and (b) consolidates the must-confirm proof points before the firmware reboot (BitLocker recovery key escrowed, machine is UEFI + GPT not legacy/MBR, and drain first if this is any deployed cluster member). Resolves the majority of the personas' 'wants improved' (who-should-do-this, proof-point checklist, boot-mode/GPT check, stop-and-hand-off, any-deployed-member, hard stop). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...oubleshooting-Hardware-Test-Secure-Boot.md | 27 +++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md index fbc96880..fd989bf4 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md @@ -139,6 +139,33 @@ In both sources the result for this check looks like this: } ``` +## Before you start: who does this, and confirm it is safe + +- **Who owns this.** Enabling Secure Boot is a firmware change, so it is done by the **server + or hardware administrator** with firmware / BMC (iDRAC / iLO / XClarity) access. The + **Windows administrator** confirms and suspends BitLocker; the **network team** can provide + BMC access but does not own this change. If you are first-line or temporary staff, do not + change firmware without the machine's owner. +- **Confirm all of these before you reboot into firmware** (skipping any one is how machines + get stranded): + - The **BitLocker recovery key is escrowed** and you can retrieve it (see step 1). A Secure + Boot change is measured into TPM PCR 7 and can trip an encrypted volume into the recovery + screen. + - The machine boots in **UEFI mode with a GPT system disk**, not legacy BIOS / MBR. Switching + boot mode alone will not boot an OS that was installed in legacy/MBR mode. + + ```powershell + # A GPT boot disk means the OS was installed in UEFI mode. + Get-Disk | Where-Object IsBoot | Select-Object Number, PartitionStyle + # If Confirm-SecureBootUEFI errors with "Cmdlet not supported on this platform", + # the machine is in legacy BIOS / CSM mode, not UEFI. + ``` + + - If this machine is **already a deployed cluster member** (encrypted or not), the firmware + reboot takes a live node down, so drain it first (see [If the machine is already a deployed, + encrypted cluster member](#if-the-machine-is-already-a-deployed-encrypted-cluster-member)). + One node at a time. + ## How to fix it This check runs during **pre-deployment validation**, so the machine it flags is normally a