From 1a9c94411b674e30dc115d59634f11c52f4589aa Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 13:16:31 -0400 Subject: [PATCH 1/3] Add Environment Validator TSG: TPM Version (AzStackHci_Hardware_Test_Tpm_Version) Add a customer-facing TSG for the Hardware TPM Version environment check (Test-TpmVersion), which fails when a present TPM reports a specification version other than 2.0. Covers where the failure surfaces (portal validation, the single on-box validator, and the AzStackHciEnvironmentChecker event log, Event ID 17205), a source-accurate result example, and a firmware remediation path that accounts for the platform-specific reality of TPM version changes: the switch clears the module, can be limited or one-way or impossible by vendor, and must be paired with suspending BitLocker and draining a deployed cluster member so the firmware reboot does not strand the node. Also notes the Test-TpmProperties companion check for TPM presence/enablement, and links the TPM 2.0, Get-Tpm, BitLocker, and cluster-node-maintenance docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- TSG/EnvironmentValidator/README.md | 1 + ...oubleshooting-Hardware-Test-Tpm-Version.md | 329 ++++++++++++++++++ 2 files changed, 330 insertions(+) create mode 100644 TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md diff --git a/TSG/EnvironmentValidator/README.md b/TSG/EnvironmentValidator/README.md index 11611c19..ca84bf9c 100644 --- a/TSG/EnvironmentValidator/README.md +++ b/TSG/EnvironmentValidator/README.md @@ -6,6 +6,7 @@ This folder contains the TSG's related to Environment Validators. * [Troubleshooting Test NetAdapter API Failure](./Troubleshooting-Test-NetAdapter-API.md) * [Troubleshooting Test PhysicalDisk API Failure](./Troubleshooting-Test-PhysicalDisk-API.md) * [Troubleshooting Test System Drive Free Space](./Troubleshooting-Test-SystemDrive-Free-Space.md) +* [Troubleshooting TPM Version (Hardware Test TPM Version)](./Troubleshooting-Hardware-Test-Tpm-Version.md) * [Troubleshooting TestPowerShell Module Version](./Troubleshooting-Test-PowerShell-Module-Version.md) * [Troubleshooting Module Versions](Troubleshooting-Module-Versions.md) * [Troubleshooting MSI Does Not Have Access to Subscription](Troubleshooting-MSI-Does-Not-Have-Access-To-Subscription.md) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md new file mode 100644 index 00000000..02f62afd --- /dev/null +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md @@ -0,0 +1,329 @@ +# AzStackHci_Hardware_Test_Tpm_Version + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameAzStackHci_Hardware_Test_Tpm_Version (aggregated as AzStackHci_Hardware_TpmVersion)
Display nameTPM Version
Validator / testTest-TpmVersion (run with Invoke-AzStackHciHardwareValidation)
ComponentHardware (Environment Validator / Environment Checker)
SeverityCritical: this validator blocks deployment until the machine's TPM reports specification version 2.0.
RequirementEach machine must have a TPM that reports specification version 2.0 (TPM 2.0) before deployment.
Applicable ScenariosDeployment, Add Node, and Upgrade (pre-deployment / readiness validation).
Affected VersionsAzure Local, version 23H2 and later.
+ +## Overview + +This validator checks that each Azure Local machine has a **Trusted Platform Module (TPM) +that reports specification version 2.0**. TPM 2.0 is part of the Azure Local hardware +security baseline: it is the hardware root of trust that backs measured boot, BitLocker +key protection, and the platform's attestation and secured-core features. The check fails +when a TPM is present but reports a specification version other than 2.0 (for example a +module in TPM 1.2 mode). + +It runs by reading the `Win32_Tpm` instance from each machine +(`Get-CimInstance -Namespace root/cimv2/Security/MicrosoftTpm -ClassName Win32_Tpm`) and +comparing the **first segment of the reported `SpecVersion`** to `2.0`. A machine whose +TPM reports `2.0` is a **SUCCESS**; a machine whose TPM reports a different version (such +as `1.2`) is a **FAILURE**. + +> **Important coverage note.** This check evaluates the TPM **version** only. If no TPM is +> present at all, `Win32_Tpm` returns nothing and this specific check does not raise a +> failure. Whether the TPM is **present and enabled** is covered by the companion check +> `AzStackHci_Hardware_TpmProperties` (`Test-TpmProperties`), which fails when a TPM is +> missing or disabled. If you are investigating a TPM problem, check both. + +While this check is failing, deployment is blocked at the Hardware validation stage and +the machine cannot proceed. Unlike a software setting, the fix is a **firmware and +hardware** change that is specific to your server model, and on some platforms it is +limited, irreversible, or not possible at all (see [How to fix it](#how-to-fix-it)). + +## Where this failure appears + +You can see this failure in two places, the Azure portal and the machine itself. Both +show the same underlying result. + +### In the Azure portal + +This check runs during the deployment validation step. When you deploy Azure Local from +the portal (or with a deployment template), the **Validation** phase runs the +environment checks and lists any that fail: + +1. Open the Azure Local deployment for your cluster and go to its **Validation** + results (the deployment surfaces these before it proceeds to apply). +2. In the list of checks, this one appears under its display name, **TPM Version**, with + a **Critical** severity. +3. Select the failing check to see the per-machine detail, which names the machine whose + TPM is not reporting version 2.0. + +### On the machine + +Two on-box sources carry the result. + +**Run the single validator (fastest).** The Environment Checker module ships on every +Azure Local machine, so you can run this one Hardware check directly and read the result +in a few seconds. Use `-Include Test-TpmVersion` to run only this check, so you do not +have to run the full Hardware validation suite: + +```powershell +$r = Invoke-AzStackHciHardwareValidation -Include Test-TpmVersion -PassThru +$r | Select-Object Name, Status, Severity +$r.AdditionalData.Detail +``` + +You can also read the underlying values directly: + +```powershell +# Presence / enabled state (covered by Test-TpmProperties, shown here for context). +Get-Tpm | Select-Object TpmPresent, TpmReady, TpmEnabled + +# The version this check evaluates. It compares the FIRST comma-separated segment +# of SpecVersion to '2.0' (for example "2.0, 0, 1.59" passes; "1.2, ..." fails). +(Get-CimInstance -Namespace 'root/cimv2/Security/MicrosoftTpm' -ClassName Win32_Tpm).SpecVersion +``` + +A machine whose TPM reports a non-2.0 version returns `Status` of `FAILURE` and a detail +line of the form: + +``` +Machine: AzL-Node-01, Class: Tpm, Manufacturer ID: 1314145024 Tpm version is 1.2. Expected 2.0 +``` + +**Event log (per machine).** The Environment Checker writes every check result to the +**AzStackHciEnvironmentChecker** event log, located at +`C:\Windows\System32\winevt\Logs\AzStackHciEnvironmentChecker.evtx`. Each result is the +JSON body of an **Event ID 17205** entry. To read this check's most recent result on a +machine: + +```powershell +Get-WinEvent -LogName AzStackHciEnvironmentChecker -FilterXPath '*[System[(EventID=17205)]]' -MaxEvents 2000 | + Where-Object { $_.Message -match 'AzStackHci_Hardware_Test_Tpm_Version' } | + Select-Object -First 1 -ExpandProperty Message +``` + +In both sources the result for this check looks like this: + +```json +{ + "Name": "AzStackHci_Hardware_Test_Tpm_Version", + "Title": "Test TPM Version", + "DisplayName": "Test TPM Version AzL-Node-01", + "Severity": "Critical", + "Status": "FAILURE", + "Description": "Checking TPM for desired version (2.0)", + "TargetResourceName": "Machine: AzL-Node-01, Class: Tpm, Manufacturer ID: 1314145024", + "Remediation": "https://aka.ms/hci-envch", + "AdditionalData": { + "Source": "Version", + "Resource": "1.2", + "Detail": "Machine: AzL-Node-01, Class: Tpm, Manufacturer ID: 1314145024 Tpm version is 1.2. Expected 2.0", + "Status": "FAILURE" + } +} +``` + +## How to fix it + +The TPM specification version is a firmware and hardware property, so the fix is made in +the machine's firmware setup (or with the vendor's management tooling), not from Windows. +**Before you change anything, read the warnings in this section.** Unlike most validator +fixes, changing a TPM's version is platform-specific and has serious side effects: + +- **Switching a TPM clears it.** Moving a TPM between specification versions (for example + 1.2 to 2.0) re-provisions the module and **erases the keys it holds**. Any key sealed to + that TPM, including a **BitLocker** key protector, is invalidated by the change. +- **It is vendor-specific and may be limited or impossible.** Some platforms allow a + reversible firmware switch (sometimes with a documented limit on how many times it can be + done), some allow only a one-way move, and some ship a **fixed module that cannot be + switched at all** and would have to be replaced. Consult your hardware vendor's TPM + documentation for your exact model before proceeding. + +The high-level order is: if the machine is an already-deployed cluster member, drain it +first; if it has BitLocker on, suspend BitLocker and confirm the recovery key is escrowed; +enable the TPM and set it to TPM 2.0 in firmware per your vendor's documentation; confirm; +resume BitLocker; and resume the node. Then re-run the check. + +### 1. Confirm the current TPM state + +Establish what the machine actually reports before you touch firmware: + +```powershell +Get-Tpm | Select-Object TpmPresent, TpmReady, TpmEnabled +(Get-CimInstance -Namespace 'root/cimv2/Security/MicrosoftTpm' -ClassName Win32_Tpm).SpecVersion +``` + +- If `TpmPresent` is `False`, the machine has no usable TPM. This version check will not + fail (it only evaluates a present TPM), but `Test-TpmProperties` will, and the machine + is not deployable without a TPM. This is a hardware action, not a firmware setting. +- If a TPM is present but `SpecVersion` starts with something other than `2.0`, continue + below. + +### 2. If the machine is an already-deployed cluster member, drain it first + +If this machine has BitLocker on, it has almost certainly already been deployed into a +cluster (Azure Local turns on encryption during deployment). Changing the TPM requires a +reboot into firmware, which takes this node down, so drain it first and do **one node at a +time**. This is a [MEDIUM RISK] change: draining live-migrates VMs off the node, and the +node is unavailable until you resume it. + +```powershell +# Confirm the cluster is healthy and can lose this one node before you start. +Get-ClusterNode | Select-Object Name, State # every other node should be Up +Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus # all Healthy / OK +Get-StorageJob # should be empty (no active repair/resync) +``` + +Only continue when every other node is `Up`, all virtual disks are Healthy, and +`Get-StorageJob` returns nothing. Then pause and drain this node so its VMs live-migrate +off: + +```powershell +Suspend-ClusterNode -Name -Drain +# Confirm the node is Paused and its roles have moved before you reboot it. +Get-ClusterNode -Name | Select-Object Name, State # State should be Paused +``` + +### 3. If the machine has BitLocker enabled, suspend it first + +Changing the TPM alters (and in the version-switch case **clears**) the hardware root of +trust that BitLocker seals its key to. On a machine where **BitLocker is enabled, the next +boot after the change will stop at the BitLocker recovery screen** and ask for the 48-digit +recovery password, which can strand the machine. Azure Local enables data-at-rest +encryption (BitLocker) by default, so a machine that has already been deployed (or any +machine with drive encryption) is affected. A fresh, pre-deployment machine that has never +been encrypted is not. + +If BitLocker is on, suspend it **before** you change firmware, and resume it **after** the +machine is back and the TPM is confirmed. Use `-RebootCount 0` so the suspend holds across +the firmware change and reboot until you explicitly resume it: + +```powershell +# Are any volumes protected? +Get-BitLockerVolume | Select-Object MountPoint, ProtectionStatus, VolumeStatus + +# Suspend each protected volume indefinitely (until you resume it). +Suspend-BitLocker -MountPoint "C:" -RebootCount 0 +# Repeat for any data volumes that report ProtectionStatus = On, for example: +# Suspend-BitLocker -MountPoint "C:\ClusterStorage\Volume1" -RebootCount 0 +``` + +You will resume BitLocker in step 6, after the TPM is confirmed. **Confirm the recovery +key is available (escrowed) before you start**, because a TPM version switch clears the +module and the machine must be recoverable with the recovery key if anything is +interrupted. + +### 4. Enable the TPM and set it to TPM 2.0 in firmware + +1. Reboot the machine and enter firmware setup (the key varies by vendor, commonly `F2`, + `F10`, `Del`, or via the BMC / iDRAC / iLO / XClarity remote console). +2. Locate the **TPM** (sometimes shown as "Security Device", "Trusted Computing", or + "PTT/Intel Platform Trust Technology" / "AMD fTPM") settings. +3. Make sure the TPM is **enabled** and visible to the operating system. +4. If the platform supports selecting the TPM specification version and the TPM is in 1.2 + mode, set it to **2.0** (often labelled "TPM Device Version", "TCG Spec Version", or + similar), following your vendor's documented procedure. **Heed the warnings in this + section first**: the switch clears the TPM and may be limited or one-way on your model. +5. Save and exit, and let the machine boot back into the OS. + +The exact menu names and the availability of a version switch are vendor-specific. If your +platform's TPM is a fixed module that cannot report 2.0, it cannot be remediated in +firmware and the module (or machine) must be brought to spec by your hardware vendor; +confirm the machine is on the Azure Local supported hardware list. + +### 5. Confirm the TPM now reports version 2.0 + +```powershell +Get-Tpm | Select-Object TpmPresent, TpmReady, TpmEnabled +(Get-CimInstance -Namespace 'root/cimv2/Security/MicrosoftTpm' -ClassName Win32_Tpm).SpecVersion +``` + +The first segment of `SpecVersion` should now be `2.0`. + +### 6. Resume BitLocker (only if you suspended it in step 3) + +```powershell +Resume-BitLocker -MountPoint "C:" +# And any data volumes you suspended, for example: +# Resume-BitLocker -MountPoint "C:\ClusterStorage\Volume1" +``` + +Resuming reseals the BitLocker key to the current TPM. If the TPM was cleared by the +version switch, make sure the volume re-protects cleanly and a fresh recovery key is +escrowed. + +### 7. Resume the cluster node (only if you drained it in step 2) + +Bring the node back into the cluster and let storage resync before you touch the next node. + +```powershell +Resume-ClusterNode -Name +# Wait for resync to finish before moving on; do not drain the next node until this clears. +Get-StorageJob # wait until empty +Get-VirtualDisk | Select-Object FriendlyName, HealthStatus # back to Healthy +``` + +Repeat steps 1 through 7 for each remaining machine, one node at a time, so the cluster +always keeps quorum and storage resiliency. + +## Verify the fix + +Re-run the single validator: + +```powershell +$r = Invoke-AzStackHciHardwareValidation -Include Test-TpmVersion -PassThru +$r | Select-Object Name, Status, Severity +$r.AdditionalData.Detail +``` + +A machine whose TPM reports version 2.0 returns `Status` of `SUCCESS`. Once every machine +you are deploying reports success, re-run the deployment validation; the **TPM Version** +check should now pass and deployment can proceed. + +## When to escalate + +Open a support case if any of the following are true: + +- `Win32_Tpm` reports `SpecVersion` starting with `2.0`, but the **TPM Version** check + still fails during deployment validation. +- The firmware has no option to enable a TPM or to select TPM 2.0, or the platform's TPM + is a fixed module that cannot report 2.0. TPM 2.0 is an Azure Local hardware requirement, + so confirm the machine is on the Azure Local supported hardware list, and engage your + hardware vendor if the module must be replaced. +- The machine has no TPM at all (`TpmPresent` is `False`); this is a hardware requirement + that cannot be satisfied in firmware. +- The machine stops at the BitLocker recovery screen after the change and the recovery key + is not available. + +## Related + +- General Environment Checker remediation link shown in the validator output: + https://aka.ms/hci-envch +- [Azure Local security features and baseline](https://learn.microsoft.com/azure/azure-local/concepts/security-features) +- [Trusted Platform Module (TPM 2.0) overview](https://learn.microsoft.com/windows/security/hardware-security/tpm/trusted-platform-module-overview) +- [Get-Tpm](https://learn.microsoft.com/powershell/module/trustedplatformmodule/get-tpm) +- [Suspend-BitLocker before firmware changes](https://learn.microsoft.com/powershell/module/bitlocker/suspend-bitlocker) +- [Suspend-ClusterNode (pause and drain a node)](https://learn.microsoft.com/powershell/module/failoverclusters/suspend-clusternode) +- [Resume-ClusterNode](https://learn.microsoft.com/powershell/module/failoverclusters/resume-clusternode) From 08cb5df8aeb33573a0047b5a75333dc801e29ac1 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 13:47:55 -0400 Subject: [PATCH 2/3] TPM Version TSG: reframe for the pre-deployment scenario; fix bot display-name finding Addresses review of PR #305: - Bot finding (display name): the metadata row said 'TPM Version' while the result JSON shows 'Test TPM Version'. Note both forms in the row and add a names-across-surfaces callout (portal aggregated name vs the verbose per-machine Title in the JSON / event log); the underlying Name is identical. - Scenario accuracy: this check (AzStackHci_Hardware_Test_Tpm_Version) is emitted only by the Hardware validator, whose OperationType is Deployment and Add Node, so the machine it flags is a host being validated to become a node, not a deployed member. Reframe the Overview and remediation accordingly and correct the Applicable Scenarios row (the upgrade-time TPM check is a separately named AzStackHci_Upgrade_* WARNING). - BitLocker in the primary path: a host being vetted may have been recycled from a prior project with BitLocker already enabled, so check for and suspend BitLocker before the firmware change regardless of deployment state. The cluster-drain/quorum steps are now gated to the uncommon case of a live, deployed cluster member. - Add a 'Before you start' hardware-capability gate so a customer on a fixed-module or one-way platform does not drain a node or suspend BitLocker for a change that turns out to be impossible in firmware. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...oubleshooting-Hardware-Test-Tpm-Version.md | 161 ++++++++++-------- 1 file changed, 91 insertions(+), 70 deletions(-) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md index 02f62afd..6e2f83be 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md @@ -7,7 +7,7 @@ Display name - TPM Version + TPM Version (the aggregated name shown in the portal). The per-machine result JSON and event log carry the verbose form Test TPM Version <machine>; both refer to this same check. Validator / test @@ -27,7 +27,7 @@ Applicable Scenarios - Deployment, Add Node, and Upgrade (pre-deployment / readiness validation). + Deployment and Add Node (pre-deployment readiness validation). Affected Versions @@ -61,6 +61,13 @@ the machine cannot proceed. Unlike a software setting, the fix is a **firmware a hardware** change that is specific to your server model, and on some platforms it is limited, irreversible, or not possible at all (see [How to fix it](#how-to-fix-it)). +This check runs during **pre-deployment validation** (the Deployment and Add Node readiness +checks), so the machine it evaluates is normally a **host being validated to become a cluster +node**, not an existing cluster member. The remediation is usually short (set the TPM to 2.0 +in firmware and re-validate), but two cautions apply: a host being vetted may have been +**recycled from another project and could already have BitLocker enabled**, and the +cluster-drain precaution is needed only if the machine is already a live, deployed member. + ## Where this failure appears You can see this failure in two places, the Azure portal and the machine itself. Both @@ -145,12 +152,25 @@ In both sources the result for this check looks like this: } ``` +> **A note on names across surfaces.** The portal shows the aggregated display name +> **TPM Version**, while the per-machine result JSON and the event log carry the verbose form +> **Test TPM Version ``**. The underlying `Name`, +> `AzStackHci_Hardware_Test_Tpm_Version`, is the same on both, so if you are matching strings +> between the portal and the on-box output, expect the two forms. + ## How to fix it -The TPM specification version is a firmware and hardware property, so the fix is made in +This check runs during **pre-deployment validation**, so the machine it flags is normally a +**host being prepared to become a cluster node**, not a running cluster member: there is +usually no cluster to keep in quorum. Do not assume the host is otherwise "clean", though. +A host being vetted may have been **recycled from another project and could already have +BitLocker enabled**, and a TPM change trips an encrypted volume into recovery, so check for +BitLocker before you touch firmware (step 2). The cluster-drain precaution only applies in +the uncommon case that the machine is already a live, deployed cluster member. + +The TPM specification version is a firmware and hardware property, so the change is made in the machine's firmware setup (or with the vendor's management tooling), not from Windows. -**Before you change anything, read the warnings in this section.** Unlike most validator -fixes, changing a TPM's version is platform-specific and has serious side effects: +**Before you change anything, read these two warnings:** - **Switching a TPM clears it.** Moving a TPM between specification versions (for example 1.2 to 2.0) re-provisions the module and **erases the keys it holds**. Any key sealed to @@ -161,10 +181,18 @@ fixes, changing a TPM's version is platform-specific and has serious side effect switched at all** and would have to be replaced. Consult your hardware vendor's TPM documentation for your exact model before proceeding. -The high-level order is: if the machine is an already-deployed cluster member, drain it -first; if it has BitLocker on, suspend BitLocker and confirm the recovery key is escrowed; -enable the TPM and set it to TPM 2.0 in firmware per your vendor's documentation; confirm; -resume BitLocker; and resume the node. Then re-run the check. +### Before you start: confirm a TPM 2.0 switch is possible on your hardware + +The single most platform-variable fact is whether your model can switch to TPM 2.0 at all, +so settle it first. Read the current state (step 1 below, non-disruptive), then consult your +hardware vendor's TPM documentation for your exact model to confirm whether the TPM can be +switched from 1.2 to 2.0, and if so whether the switch is reversible or subject to a toggle +limit. + +If the machine has **no TPM**, or its TPM is a **fixed module that cannot report 2.0**, the +firmware steps below will not help. Confirm the machine is on the Azure Local supported +hardware list and engage your hardware vendor (the module or machine has to be brought to +spec). Do not start any disruptive change until you have confirmed the switch is possible. ### 1. Confirm the current TPM state @@ -181,61 +209,34 @@ Get-Tpm | Select-Object TpmPresent, TpmReady, TpmEnabled - If a TPM is present but `SpecVersion` starts with something other than `2.0`, continue below. -### 2. If the machine is an already-deployed cluster member, drain it first - -If this machine has BitLocker on, it has almost certainly already been deployed into a -cluster (Azure Local turns on encryption during deployment). Changing the TPM requires a -reboot into firmware, which takes this node down, so drain it first and do **one node at a -time**. This is a [MEDIUM RISK] change: draining live-migrates VMs off the node, and the -node is unavailable until you resume it. - -```powershell -# Confirm the cluster is healthy and can lose this one node before you start. -Get-ClusterNode | Select-Object Name, State # every other node should be Up -Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus # all Healthy / OK -Get-StorageJob # should be empty (no active repair/resync) -``` +### 2. Check for BitLocker, and suspend it if present -Only continue when every other node is `Up`, all virtual disks are Healthy, and -`Get-StorageJob` returns nothing. Then pause and drain this node so its VMs live-migrate -off: +Do this even on a fresh pre-deployment host. A host you are vetting may have been **recycled +from a previous project with BitLocker already enabled**, and a TPM version change clears the +module, which invalidates the TPM-sealed BitLocker key. If a protected volume is left armed, +**the next boot after the change stops at the BitLocker recovery screen** and asks for the +48-digit recovery password, which can strand the machine. ```powershell -Suspend-ClusterNode -Name -Drain -# Confirm the node is Paused and its roles have moved before you reboot it. -Get-ClusterNode -Name | Select-Object Name, State # State should be Paused +# Are any volumes protected? (On a truly clean, never-encrypted host this is empty.) +Get-BitLockerVolume | Select-Object MountPoint, ProtectionStatus, VolumeStatus ``` -### 3. If the machine has BitLocker enabled, suspend it first - -Changing the TPM alters (and in the version-switch case **clears**) the hardware root of -trust that BitLocker seals its key to. On a machine where **BitLocker is enabled, the next -boot after the change will stop at the BitLocker recovery screen** and ask for the 48-digit -recovery password, which can strand the machine. Azure Local enables data-at-rest -encryption (BitLocker) by default, so a machine that has already been deployed (or any -machine with drive encryption) is affected. A fresh, pre-deployment machine that has never -been encrypted is not. - -If BitLocker is on, suspend it **before** you change firmware, and resume it **after** the -machine is back and the TPM is confirmed. Use `-RebootCount 0` so the suspend holds across -the firmware change and reboot until you explicitly resume it: +If every volume reports `ProtectionStatus = Off`, there is nothing to suspend; go to step 3. +If any volume is protected, **confirm its recovery key is escrowed first**, then suspend it +with `-RebootCount 0` so the suspend holds across the firmware change and reboot until you +explicitly resume it: ```powershell -# Are any volumes protected? -Get-BitLockerVolume | Select-Object MountPoint, ProtectionStatus, VolumeStatus - -# Suspend each protected volume indefinitely (until you resume it). Suspend-BitLocker -MountPoint "C:" -RebootCount 0 -# Repeat for any data volumes that report ProtectionStatus = On, for example: -# Suspend-BitLocker -MountPoint "C:\ClusterStorage\Volume1" -RebootCount 0 +# Repeat for any data volume that reports ProtectionStatus = On, for example: +# Suspend-BitLocker -MountPoint "D:" -RebootCount 0 ``` -You will resume BitLocker in step 6, after the TPM is confirmed. **Confirm the recovery -key is available (escrowed) before you start**, because a TPM version switch clears the -module and the machine must be recoverable with the recovery key if anything is -interrupted. +### 3. Enable the TPM and set it to TPM 2.0 in firmware -### 4. Enable the TPM and set it to TPM 2.0 in firmware +> If this machine is already a deployed, encrypted cluster member, do **not** reboot it into +> firmware yet. Follow [If the machine is already a deployed cluster member](#if-the-machine-is-already-a-deployed-encrypted-cluster-member) first so you take the node down safely. 1. Reboot the machine and enter firmware setup (the key varies by vendor, commonly `F2`, `F10`, `Del`, or via the BMC / iDRAC / iLO / XClarity remote console). @@ -244,16 +245,16 @@ interrupted. 3. Make sure the TPM is **enabled** and visible to the operating system. 4. If the platform supports selecting the TPM specification version and the TPM is in 1.2 mode, set it to **2.0** (often labelled "TPM Device Version", "TCG Spec Version", or - similar), following your vendor's documented procedure. **Heed the warnings in this - section first**: the switch clears the TPM and may be limited or one-way on your model. + similar), following your vendor's documented procedure. **Heed the warnings above**: the + switch clears the TPM and may be limited or one-way on your model. 5. Save and exit, and let the machine boot back into the OS. The exact menu names and the availability of a version switch are vendor-specific. If your -platform's TPM is a fixed module that cannot report 2.0, it cannot be remediated in -firmware and the module (or machine) must be brought to spec by your hardware vendor; -confirm the machine is on the Azure Local supported hardware list. +platform's TPM is a fixed module that cannot report 2.0, it cannot be remediated in firmware +and the module (or machine) must be brought to spec by your hardware vendor; confirm the +machine is on the Azure Local supported hardware list. -### 5. Confirm the TPM now reports version 2.0 +### 4. Confirm the TPM now reports version 2.0 ```powershell Get-Tpm | Select-Object TpmPresent, TpmReady, TpmEnabled @@ -262,31 +263,51 @@ Get-Tpm | Select-Object TpmPresent, TpmReady, TpmEnabled The first segment of `SpecVersion` should now be `2.0`. -### 6. Resume BitLocker (only if you suspended it in step 3) +### 5. Resume BitLocker (only if you suspended it in step 2) ```powershell Resume-BitLocker -MountPoint "C:" -# And any data volumes you suspended, for example: -# Resume-BitLocker -MountPoint "C:\ClusterStorage\Volume1" +# And any data volume you suspended, for example: +# Resume-BitLocker -MountPoint "D:" ``` -Resuming reseals the BitLocker key to the current TPM. If the TPM was cleared by the -version switch, make sure the volume re-protects cleanly and a fresh recovery key is -escrowed. +Resuming reseals the BitLocker key to the new TPM. Because the version switch cleared the +module, make sure each volume re-protects cleanly and a fresh recovery key is escrowed. + +### If the machine is already a deployed, encrypted cluster member + +Because this is a pre-deployment check, it does not normally fire on a machine that is +already a deployed cluster node: the machine must have reported TPM 2.0 to deploy, and the +version does not change on its own. But if you are changing the TPM on a machine that is +**already a live, encrypted cluster member** for any reason, add one precaution to the steps +above: the firmware reboot takes a running node down, so **drain it first** and do this +**one node at a time**. + +This is a [MEDIUM RISK] change: draining live-migrates VMs off the node, and the node is +unavailable until you resume it. + +```powershell +# Confirm the cluster is healthy and can lose this one node before you start. +Get-ClusterNode | Select-Object Name, State # every other node should be Up +Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus # all Healthy / OK +Get-StorageJob # should be empty (no active repair/resync) -### 7. Resume the cluster node (only if you drained it in step 2) +# Only when the cluster is healthy, pause and drain this node so its VMs live-migrate off. +Suspend-ClusterNode -Name -Drain +Get-ClusterNode -Name | Select-Object Name, State # State should be Paused +``` -Bring the node back into the cluster and let storage resync before you touch the next node. +Then run steps 2 through 5 above (suspend BitLocker, change firmware, confirm, resume +BitLocker). Finally bring the node back and let storage resync before the next one: ```powershell Resume-ClusterNode -Name -# Wait for resync to finish before moving on; do not drain the next node until this clears. Get-StorageJob # wait until empty Get-VirtualDisk | Select-Object FriendlyName, HealthStatus # back to Healthy ``` -Repeat steps 1 through 7 for each remaining machine, one node at a time, so the cluster -always keeps quorum and storage resiliency. +Repeat for each remaining member, one node at a time, so the cluster always keeps quorum and +storage resiliency. ## Verify the fix From 88c6700d5828cc7d0e244a6374c3bac11818ddd6 Mon Sep 17 00:00:00 2001 From: 1008covingtonlane <42551186+1008covingtonlane@users.noreply.github.com> Date: Fri, 26 Jun 2026 17:48:40 -0400 Subject: [PATCH 3/3] TPM Version TSG: turn 'Before you start' into a capability + ownership decision matrix From the 10-persona usability read, this was the single highest-leverage change (it resolved nearly every persona's 'wants improved'). Replace the prose 'Before you start' with a decision table keyed on what the hardware/vendor reports (already 2.0 / switchable 1.2 / one-way or limited / fixed module / no TPM), each row naming what it means, WHO owns the action (server or firmware admin, hardware vendor/OEM, procurement), and what to do. Keep the explicit STOP gate (model supported + BitLocker key escrowed + drained if a deployed member) and add an expectation-setting note about hardware lead time for fixed-module/unsupported cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...oubleshooting-Hardware-Test-Tpm-Version.md | 35 ++++++++++++------- 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md index 6e2f83be..4c419415 100644 --- a/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md +++ b/TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Tpm-Version.md @@ -181,18 +181,29 @@ the machine's firmware setup (or with the vendor's management tooling), not from switched at all** and would have to be replaced. Consult your hardware vendor's TPM documentation for your exact model before proceeding. -### Before you start: confirm a TPM 2.0 switch is possible on your hardware - -The single most platform-variable fact is whether your model can switch to TPM 2.0 at all, -so settle it first. Read the current state (step 1 below, non-disruptive), then consult your -hardware vendor's TPM documentation for your exact model to confirm whether the TPM can be -switched from 1.2 to 2.0, and if so whether the switch is reversible or subject to a toggle -limit. - -If the machine has **no TPM**, or its TPM is a **fixed module that cannot report 2.0**, the -firmware steps below will not help. Confirm the machine is on the Azure Local supported -hardware list and engage your hardware vendor (the module or machine has to be brought to -spec). Do not start any disruptive change until you have confirmed the switch is possible. +### Before you start: decide whether and how this can be fixed (and who does it) + +The single most platform-variable fact is whether your exact server model can switch to TPM +2.0 at all, so settle that first. Read the current state (step 1 below, non-disruptive), then +consult your hardware vendor's TPM documentation for your model and use this table to decide +the path and the owner **before** any disruptive change: + +| What your hardware reports / the vendor says | What it means | Who owns the action | What to do | +| --- | --- | --- | --- | +| TPM already reports **2.0** | This check should pass | No change needed | Re-confirm with step 1; if it still fails, see [When to escalate](#when-to-escalate) | +| TPM present, reports **1.2**, vendor says it is **switchable to 2.0** | A firmware switch is possible (it clears the TPM) | Server / firmware admin; Windows admin confirms BitLocker | Escrow the BitLocker key first, then follow [How to fix it](#how-to-fix-it) | +| TPM **1.2**, switch is **one-way or limited** (for example a toggle-count cap) | You can switch but cannot easily go back | Server / firmware admin **with hardware-vendor sign-off** | Confirm with the vendor, then treat it as a one-time change | +| TPM is a **fixed module** that cannot report 2.0 | Cannot be fixed in firmware | Hardware vendor (OEM) | Engage the OEM; the module or machine must be brought to spec. Expect lead time | +| **No TPM present** | Not deployable (this version check will not fail, but `Test-TpmProperties` will) | Hardware vendor (OEM) plus procurement | Confirm the machine is on the Azure Local supported hardware list; add or replace the TPM | + +**Do not start any disruptive change until you have confirmed all three:** the switch is +supported on your exact model, the **BitLocker recovery key is escrowed**, and (if this machine +is already a deployed cluster member) it has been **drained** first. A TPM switch clears the +module and is sometimes irreversible, so if any of the three is unknown, stop and confirm. + +> **Setting expectations:** a firmware switch is usually quick, but a fixed-module or +> unsupported-hardware case means a hardware change or replacement with real lead time and +> possible procurement. Surface that to the customer early so the deployment schedule reflects it. ### 1. Confirm the current TPM state