Add Environment Validator TSG: TPM Version (AzStackHci_Hardware_Test_Tpm_Version)#305
Add Environment Validator TSG: TPM Version (AzStackHci_Hardware_Test_Tpm_Version)#3051008covingtonlane wants to merge 2 commits into
Conversation
…Tpm_Version) Add a customer-facing TSG for the Hardware TPM Version environment check (Test-TpmVersion), which fails when a present TPM reports a specification version other than 2.0. Covers where the failure surfaces (portal validation, the single on-box validator, and the AzStackHciEnvironmentChecker event log, Event ID 17205), a source-accurate result example, and a firmware remediation path that accounts for the platform-specific reality of TPM version changes: the switch clears the module, can be limited or one-way or impossible by vendor, and must be paired with suspending BitLocker and draining a deployed cluster member so the firmware reboot does not strand the node. Also notes the Test-TpmProperties companion check for TPM presence/enablement, and links the TPM 2.0, Get-Tpm, BitLocker, and cluster-node-maintenance docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new customer-facing troubleshooting guide (TSG) for the Azure Local Environment Validator check TPM Version (AzStackHci_Hardware_Test_Tpm_Version / Test-TpmVersion). This expands the EnvironmentValidator documentation set with guidance on where the failure appears (portal/on-box/event log), the operational risks of TPM version changes, and safe remediation/verification steps.
Changes:
- Introduces a new TSG covering TPM spec-version validation logic, failure surfaces, and remediation steps (including BitLocker and cluster-drain safety steps).
- Adds the new TSG to the EnvironmentValidator
README.mdindex.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md |
New TSG documenting the TPM Version validator, failure locations, remediation steps, and verification/escalation guidance. |
TSG/EnvironmentValidator/README.md |
Adds a link entry for the new TPM Version troubleshooting guide. |
1008covingtonlane
left a comment
There was a problem hiding this comment.
Reviewed from the standpoint of a customer (or CSS engineer) who just hit this check and opened this page. The TSG is strong where it counts: it is honest about the gear-unsafe reality of TPM version changes (reversible-with-limits / one-way / fixed module) instead of glibly saying "enable TPM 2.0", the absent-TPM coverage note correctly redirects to Test-TpmProperties, the TPM-clear / BitLocker-key-invalidation warning is prominent, the "confirm the recovery key is escrowed first" step is exactly right, all six Learn links resolve (200, version-less), and the step numbering is internally consistent. Three customer-experience improvements:
1. [Sequencing: avoid disruptive prep that may be wasted] Have the customer confirm their model can switch before draining the node or suspending BitLocker.
- What: The "is a version switch even possible on my hardware?" determination is buried in step 4 (firmware), after step 2 (drain a cluster node, MEDIUM RISK downtime) and step 3 (suspend BitLocker).
- Why: A customer on a fixed-module or one-way server follows the steps in order, drains a production node and suspends BitLocker, then discovers in firmware there is no version option, having taken a node down and risked a recovery-screen lockout for nothing. The scariest, most platform-variable fact gates the work but comes last.
- How: Add a short "Before you start" ahead of step 2: consult the vendor's TPM documentation to confirm the model supports a 1.2->2.0 switch at all (and whether it is reversible / within any toggle limit). If it cannot switch, the firmware steps will not help, escalate / engage the vendor instead, and do not drain the node or suspend BitLocker.
2. [Scope: the modal reader is pre-deployment; lead them to the short path] Reconcile the deployed-member vs fresh-machine paths up front, as the Secure Boot TSG does.
- What: The metadata lists the primary scenario as Deployment / Add Node / Upgrade readiness, so the typical reader is a fresh pre-deployment machine (no cluster, no BitLocker). But "How to fix it" front-loads the deployed-member drain (step 2) and BitLocker suspend (step 3); steps 2, 3, 6, 7 do not apply to that reader.
- Why: The most common reader wades through MEDIUM-RISK cluster-drain procedure that is not theirs. (And a deployed member rarely fails this check: it must have reported TPM 2.0 to deploy, and the version does not change on its own, so the drain path is for an uncommon case.)
- How: Add an Overview line mirroring the Secure Boot TSG: this is primarily a pre-deployment gate, on a fresh machine the path is short (confirm state -> set TPM 2.0 in firmware -> confirm -> re-validate); only if the machine is an already-deployed, encrypted cluster member do the drain + BitLocker steps apply.
3. [String match across surfaces: reinforces the existing Copilot thread] Note the friendly vs verbose display name so a customer matching strings is not thrown.
- What: The metadata table and portal section call the display name TPM Version, but the result-JSON example shows
Title/DisplayNameas Test TPM Version / Test TPM Version AzL-Node-01 (the Copilot reviewer flagged the table-vs-JSON inconsistency at line 11). - Why: A customer comparing what the portal shows ("TPM Version") against the event-log JSON ("Test TPM Version ") may doubt they are on the right check.
- How: Both forms are real; add a one-line note that the portal shows the aggregated display name TPM Version while the per-node result JSON and event log carry the verbose Test TPM Version
<node>, so the two surfaces are expected to differ.
None of these are blocking. Thanks for being straight with customers about the cases where this genuinely cannot be fixed in firmware, that honesty is the most valuable part of the guide.
…play-name finding Addresses review of PR Azure#305: - Bot finding (display name): the metadata row said 'TPM Version' while the result JSON shows 'Test TPM Version'. Note both forms in the row and add a names-across-surfaces callout (portal aggregated name vs the verbose per-machine Title in the JSON / event log); the underlying Name is identical. - Scenario accuracy: this check (AzStackHci_Hardware_Test_Tpm_Version) is emitted only by the Hardware validator, whose OperationType is Deployment and Add Node, so the machine it flags is a host being validated to become a node, not a deployed member. Reframe the Overview and remediation accordingly and correct the Applicable Scenarios row (the upgrade-time TPM check is a separately named AzStackHci_Upgrade_* WARNING). - BitLocker in the primary path: a host being vetted may have been recycled from a prior project with BitLocker already enabled, so check for and suspend BitLocker before the firmware change regardless of deployment state. The cluster-drain/quorum steps are now gated to the uncommon case of a live, deployed cluster member. - Add a 'Before you start' hardware-capability gate so a customer on a fixed-module or one-way platform does not drain a node or suspend BitLocker for a change that turns out to be impossible in firmware. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rimary, drain gated) Apply the same framing the TPM Version TSG (Azure#305) landed and that PR Azure#170 captured into the harness skill. Test-SecureBoot runs in the Hardware validator (Deployment and Add Node) and the readiness/bootstrap set, with no upgrade-renamed variant, so the machine it flags is a host being validated to become a node, not a deployed member. - BitLocker check/suspend is now the primary step 1, because a host being vetted may have been recycled from a prior project with BitLocker already enabled; a Secure Boot change (TPM PCR 7) would trip an encrypted volume into recovery regardless of deployment state. - The cluster-drain/quorum steps move into a gated 'If the machine is already a deployed, encrypted cluster member' section instead of being front-loaded as step 1. - Steps are now: 1 check/suspend BitLocker, 2 enable Secure Boot, 3 confirm, 4 resume BitLocker, plus the gated deployed-member drain section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What
Adds a customer-facing TSG for the Hardware TPM Version Environment Validator check (
AzStackHci_Hardware_Test_Tpm_Version/Test-TpmVersion), which fails when a machine's TPM is present but reports a specification version other than2.0.Contents
Win32_Tpm.SpecVersionmust be2.0), SUCCESS/FAILURE semantics, and the coverage note that an absent TPM is handled by the companionTest-TpmPropertiescheck, not this one.Invoke-AzStackHciHardwareValidation -Include Test-TpmVersion), and theAzStackHciEnvironmentCheckerevent log (Event ID 17205), with a source-accurate result JSON example.-RebootCount 0) and confirm recovery-key escrow before the measured-boot change, then resume both.Get-Tpm,Suspend-BitLocker, and the cluster-node maintenance cmdlets.Validation
Test-TpmVersioninAzStackHci.Hardware.Helpers.psm1).previous-versions).README.mdindex.