Skip to content

Add Environment Validator TSG: TPM Version (AzStackHci_Hardware_Test_Tpm_Version)#305

Open
1008covingtonlane wants to merge 2 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-test-tpm-version
Open

Add Environment Validator TSG: TPM Version (AzStackHci_Hardware_Test_Tpm_Version)#305
1008covingtonlane wants to merge 2 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-test-tpm-version

Conversation

@1008covingtonlane

Copy link
Copy Markdown
Collaborator

What

Adds a customer-facing TSG for the Hardware TPM Version Environment Validator check (AzStackHci_Hardware_Test_Tpm_Version / Test-TpmVersion), which fails when a machine's TPM is present but reports a specification version other than 2.0.

Contents

  • Metadata table + Overview: what the check validates (the first segment of Win32_Tpm.SpecVersion must be 2.0), SUCCESS/FAILURE semantics, and the coverage note that an absent TPM is handled by the companion Test-TpmProperties check, not this one.
  • Where this failure appears: portal Validation results, the single on-box validator (Invoke-AzStackHciHardwareValidation -Include Test-TpmVersion), and the AzStackHciEnvironmentChecker event log (Event ID 17205), with a source-accurate result JSON example.
  • How to fix it: a firmware remediation path that is honest about the platform-specific reality of TPM version changes. A version switch clears the TPM, and depending on the server vendor it can be reversible-with-limits, one-way, or not possible (fixed module). The fix is wrapped with the same safety steps as the Secure Boot TSG: drain a deployed cluster member first, suspend BitLocker (-RebootCount 0) and confirm recovery-key escrow before the measured-boot change, then resume both.
  • Verify / When to escalate / Related: re-run guidance and links to the TPM 2.0 overview, Get-Tpm, Suspend-BitLocker, and the cluster-node maintenance cmdlets.

Validation

  • Strings (check name, Detail format, Description, Severity, Event ID 17205) verified against the Environment Checker source (Test-TpmVersion in AzStackHci.Hardware.Helpers.psm1).
  • All six referenced Microsoft Learn links return 200 and are version-less (none redirect to previous-versions).
  • Added to the EnvironmentValidator README.md index.

…Tpm_Version)

Add a customer-facing TSG for the Hardware TPM Version environment check
(Test-TpmVersion), which fails when a present TPM reports a specification
version other than 2.0. Covers where the failure surfaces (portal validation,
the single on-box validator, and the AzStackHciEnvironmentChecker event log,
Event ID 17205), a source-accurate result example, and a firmware remediation
path that accounts for the platform-specific reality of TPM version changes:
the switch clears the module, can be limited or one-way or impossible by vendor,
and must be paired with suspending BitLocker and draining a deployed cluster
member so the firmware reboot does not strand the node. Also notes the
Test-TpmProperties companion check for TPM presence/enablement, and links the
TPM 2.0, Get-Tpm, BitLocker, and cluster-node-maintenance docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new customer-facing troubleshooting guide (TSG) for the Azure Local Environment Validator check TPM Version (AzStackHci_Hardware_Test_Tpm_Version / Test-TpmVersion). This expands the EnvironmentValidator documentation set with guidance on where the failure appears (portal/on-box/event log), the operational risks of TPM version changes, and safe remediation/verification steps.

Changes:

  • Introduces a new TSG covering TPM spec-version validation logic, failure surfaces, and remediation steps (including BitLocker and cluster-drain safety steps).
  • Adds the new TSG to the EnvironmentValidator README.md index.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md New TSG documenting the TPM Version validator, failure locations, remediation steps, and verification/escalation guidance.
TSG/EnvironmentValidator/README.md Adds a link entry for the new TPM Version troubleshooting guide.

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed from the standpoint of a customer (or CSS engineer) who just hit this check and opened this page. The TSG is strong where it counts: it is honest about the gear-unsafe reality of TPM version changes (reversible-with-limits / one-way / fixed module) instead of glibly saying "enable TPM 2.0", the absent-TPM coverage note correctly redirects to Test-TpmProperties, the TPM-clear / BitLocker-key-invalidation warning is prominent, the "confirm the recovery key is escrowed first" step is exactly right, all six Learn links resolve (200, version-less), and the step numbering is internally consistent. Three customer-experience improvements:

1. [Sequencing: avoid disruptive prep that may be wasted] Have the customer confirm their model can switch before draining the node or suspending BitLocker.

  • What: The "is a version switch even possible on my hardware?" determination is buried in step 4 (firmware), after step 2 (drain a cluster node, MEDIUM RISK downtime) and step 3 (suspend BitLocker).
  • Why: A customer on a fixed-module or one-way server follows the steps in order, drains a production node and suspends BitLocker, then discovers in firmware there is no version option, having taken a node down and risked a recovery-screen lockout for nothing. The scariest, most platform-variable fact gates the work but comes last.
  • How: Add a short "Before you start" ahead of step 2: consult the vendor's TPM documentation to confirm the model supports a 1.2->2.0 switch at all (and whether it is reversible / within any toggle limit). If it cannot switch, the firmware steps will not help, escalate / engage the vendor instead, and do not drain the node or suspend BitLocker.

2. [Scope: the modal reader is pre-deployment; lead them to the short path] Reconcile the deployed-member vs fresh-machine paths up front, as the Secure Boot TSG does.

  • What: The metadata lists the primary scenario as Deployment / Add Node / Upgrade readiness, so the typical reader is a fresh pre-deployment machine (no cluster, no BitLocker). But "How to fix it" front-loads the deployed-member drain (step 2) and BitLocker suspend (step 3); steps 2, 3, 6, 7 do not apply to that reader.
  • Why: The most common reader wades through MEDIUM-RISK cluster-drain procedure that is not theirs. (And a deployed member rarely fails this check: it must have reported TPM 2.0 to deploy, and the version does not change on its own, so the drain path is for an uncommon case.)
  • How: Add an Overview line mirroring the Secure Boot TSG: this is primarily a pre-deployment gate, on a fresh machine the path is short (confirm state -> set TPM 2.0 in firmware -> confirm -> re-validate); only if the machine is an already-deployed, encrypted cluster member do the drain + BitLocker steps apply.

3. [String match across surfaces: reinforces the existing Copilot thread] Note the friendly vs verbose display name so a customer matching strings is not thrown.

  • What: The metadata table and portal section call the display name TPM Version, but the result-JSON example shows Title/DisplayName as Test TPM Version / Test TPM Version AzL-Node-01 (the Copilot reviewer flagged the table-vs-JSON inconsistency at line 11).
  • Why: A customer comparing what the portal shows ("TPM Version") against the event-log JSON ("Test TPM Version ") may doubt they are on the right check.
  • How: Both forms are real; add a one-line note that the portal shows the aggregated display name TPM Version while the per-node result JSON and event log carry the verbose Test TPM Version <node>, so the two surfaces are expected to differ.

None of these are blocking. Thanks for being straight with customers about the cases where this genuinely cannot be fixed in firmware, that honesty is the most valuable part of the guide.

…play-name finding

Addresses review of PR Azure#305:

- Bot finding (display name): the metadata row said 'TPM Version' while the result
  JSON shows 'Test TPM Version'. Note both forms in the row and add a
  names-across-surfaces callout (portal aggregated name vs the verbose per-machine
  Title in the JSON / event log); the underlying Name is identical.
- Scenario accuracy: this check (AzStackHci_Hardware_Test_Tpm_Version) is emitted only
  by the Hardware validator, whose OperationType is Deployment and Add Node, so the
  machine it flags is a host being validated to become a node, not a deployed member.
  Reframe the Overview and remediation accordingly and correct the Applicable Scenarios
  row (the upgrade-time TPM check is a separately named AzStackHci_Upgrade_* WARNING).
- BitLocker in the primary path: a host being vetted may have been recycled from a prior
  project with BitLocker already enabled, so check for and suspend BitLocker before the
  firmware change regardless of deployment state. The cluster-drain/quorum steps are now
  gated to the uncommon case of a live, deployed cluster member.
- Add a 'Before you start' hardware-capability gate so a customer on a fixed-module or
  one-way platform does not drain a node or suspend BitLocker for a change that turns out
  to be impossible in firmware.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1008covingtonlane added a commit to 1008covingtonlane/AzureLocal-Supportability that referenced this pull request Jun 26, 2026
…rimary, drain gated)

Apply the same framing the TPM Version TSG (Azure#305) landed and that PR Azure#170 captured
into the harness skill. Test-SecureBoot runs in the Hardware validator (Deployment
and Add Node) and the readiness/bootstrap set, with no upgrade-renamed variant, so
the machine it flags is a host being validated to become a node, not a deployed
member.

- BitLocker check/suspend is now the primary step 1, because a host being vetted may
  have been recycled from a prior project with BitLocker already enabled; a Secure
  Boot change (TPM PCR 7) would trip an encrypted volume into recovery regardless of
  deployment state.
- The cluster-drain/quorum steps move into a gated 'If the machine is already a
  deployed, encrypted cluster member' section instead of being front-loaded as step 1.
- Steps are now: 1 check/suspend BitLocker, 2 enable Secure Boot, 3 confirm, 4 resume
  BitLocker, plus the gated deployed-member drain section.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants