Skip to content

Add TSG: Secure Boot (AzStackHci_Hardware_Test_Secure_Boot)#304

Open
1008covingtonlane wants to merge 3 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-secure-boot
Open

Add TSG: Secure Boot (AzStackHci_Hardware_Test_Secure_Boot)#304
1008covingtonlane wants to merge 3 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-secure-boot

Conversation

@1008covingtonlane

Copy link
Copy Markdown
Collaborator

Summary

Adds a public Environment Validator troubleshooting guide for the Hardware Secure Boot check (AzStackHci_Hardware_Test_Secure_Boot, aggregated AzStackHci_Hardware_SecureBoot). The check runs Test-SecureBoot / Confirm-SecureBootUEFI and is Critical (it blocks deployment until UEFI Secure Boot is enabled on the machine).

What the TSG covers

  • Where the failure appears: the Azure portal Validation step, the on-box single validator (Invoke-AzStackHciHardwareValidation -Include Test-SecureBoot), and the AzStackHciEnvironmentChecker event log (Event ID 17205), with the real failure detail.
  • How to fix it: enable UEFI Secure Boot in firmware (including the UEFI-mode / GPT prerequisite and the standard-keys note), then re-validate.
  • BitLocker precaution: a Secure Boot change is measured into TPM PCR 7, so on a machine with BitLocker enabled the next boot stops at the recovery screen. Azure Local enables data-at-rest encryption by default, so the TSG documents Suspend-BitLocker -RebootCount 0 before the change and Resume-BitLocker after.

Validation

The check was validated end to end on a live Azure Local (masonenode) lab cluster: injecting Secure Boot OFF made the validator return FAILURE, and enabling Secure Boot returned it to SUCCESS.

Indexed in the EnvironmentValidator README.

New Environment Validator troubleshooting guide for the Hardware Secure Boot check
(Test-SecureBoot, which runs Confirm-SecureBootUEFI). Covers where the failure appears
(portal, on-box validator, and the AzStackHciEnvironmentChecker event log), how to enable
UEFI Secure Boot in firmware, and a BitLocker precaution: suspend BitLocker before the
firmware change and resume after, since Azure Local enables data-at-rest encryption by
default and a Secure Boot change is measured into TPM PCR 7. Indexed in the
EnvironmentValidator README.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Environment Validator troubleshooting guide (TSG) for the Azure Local Secure Boot hardware validation check (AzStackHci_Hardware_Test_Secure_Boot, aggregated as AzStackHci_Hardware_SecureBoot), documenting where the failure surfaces, how to remediate it in firmware, and the BitLocker precaution/workflow around firmware changes.

Changes:

  • Introduces a new Secure Boot TSG covering portal/on-box symptoms, validation commands, remediation steps, and escalation guidance.
  • Adds the new TSG to the EnvironmentValidator index README.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md New TSG detailing diagnosis/remediation for the Secure Boot validator and BitLocker precautions.
TSG/EnvironmentValidator/README.md Adds an index entry linking to the new Secure Boot TSG.

Comment on lines +114 to +116
Get-WinEvent -LogName AzStackHciEnvironmentChecker -FilterXPath '*[System[(EventID=17205)]]' -MaxEvents 2000 |
Where-Object { $_.Message -match 'AzStackHci_Hardware_Test_Secure_Boot' } |
Select-Object -First 1 -ExpandProperty Message
…yed-node path

Address PR review: the realistic BitLocker case is an already-deployed, encrypted cluster
member, so enabling Secure Boot reboots a live node. Add a drain-first step (Suspend-ClusterNode
-Drain with cluster-health and quorum pre-checks, one node at a time) and a resume-node step
(Resume-ClusterNode, wait for storage resync) around the existing BitLocker suspend/resume flow.
Reconcile the Overview scope line to acknowledge the deployed-node path, and cite Suspend-ClusterNode
and Resume-ClusterNode. The BitLocker suspend/resume content is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at 280334a.

The deployed-node path is now covered. Previously the TSG was scoped purely pre-deployment, but the BitLocker section addressed already-encrypted machines, which are deployed cluster members; rebooting one into firmware to change Secure Boot needed cluster-safety steps that were missing.

This revision adds them:

  • Step 1 drains the node first (health pre-checks on Get-ClusterNode / Get-VirtualDisk / Get-StorageJob, then Suspend-ClusterNode -Drain), one node at a time.
  • Step 6 resumes the node and waits for storage resync before the next node.
  • The Overview now reconciles the scope (primarily a pre-deployment gate, with the deployed-member drain path called out).

Verified during this pass:

  • Step renumbering and every cross-reference are consistent (BitLocker resume references step 2, node resume references step 1, "repeat steps 1 through 6").
  • The Suspend-ClusterNode / Resume-ClusterNode reference links resolve, and only the TSG file changed.
  • The BitLocker guidance itself is correct: Suspend-BitLocker -RebootCount 0 holds across the firmware change and reboot, and resume reseals to the new measured-boot (PCR 7) state. The UEFI-mode / GPT prerequisite is handled accurately.

No remaining findings. Ready for maintainer review.

…rimary, drain gated)

Apply the same framing the TPM Version TSG (Azure#305) landed and that PR Azure#170 captured
into the harness skill. Test-SecureBoot runs in the Hardware validator (Deployment
and Add Node) and the readiness/bootstrap set, with no upgrade-renamed variant, so
the machine it flags is a host being validated to become a node, not a deployed
member.

- BitLocker check/suspend is now the primary step 1, because a host being vetted may
  have been recycled from a prior project with BitLocker already enabled; a Secure
  Boot change (TPM PCR 7) would trip an encrypted volume into recovery regardless of
  deployment state.
- The cluster-drain/quorum steps move into a gated 'If the machine is already a
  deployed, encrypted cluster member' section instead of being front-loaded as step 1.
- Steps are now: 1 check/suspend BitLocker, 2 enable Secure Boot, 3 confirm, 4 resume
  BitLocker, plus the gated deployed-member drain section.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants