Clear non-fatal errors on AER recovery failure#589
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
paulmenzel
left a comment
There was a problem hiding this comment.
I am not a PCI expert. Can you please send it upstream for review?
Reading the commit message, it sounds like a workaround, and the actual problem needs to be fixed with the stale errors? Or is it by design?
| disable AER error recovery when an uncorrectable | ||
| error is reported. | ||
| + aer_clear_on_recovery_failure | ||
| + [PCIE] If the PCIEAER kernel config parameter is |
There was a problem hiding this comment.
Start on the line above?
There was a problem hiding this comment.
Stale errors are a side effect of AER recovery failures and the lack of AER error cleanup following a failure. This patch aims to resolve the stale error issue.
While we could attempt to fix the AER recovery failure in one particular case, the stale error issue would persist because it is perfectly legal for drivers to reject recovery.
Linux kernel maintainers suggested fixing the AER recovery failure directly. However, the proposed fix causes kernel crashes and requires further investigation.
This patch provides an immediate fix for the stale errors observed on our devices. This allows us time to continue investigating and ultimately fix the AER recovery failure, or alternatively, disable AER recovery entirely for all Arista devices.
There was a problem hiding this comment.
Start on the line above?
Fixed
There was a problem hiding this comment.
(Hmm, no idea how to mark threads as resolved, if they do not show up in the Files changed view.)
Could you please also amend the commit message with a summary of the maintainers response, and the URL to the discussion from May? (No idea why I didn’t find it, when searching.) It sounds like Lukas’ patches would solve your issue, so it would be best, if you tested these instead of having a new parameter?
| the same bus reports a new AER error, the bus rescan process is | ||
| repeated, and the AER error is reported for the same first device with | ||
| the uncleaned AER error. | ||
|
|
There was a problem hiding this comment.
That sounds like the wrong behavior. Can this be fixed?
There was a problem hiding this comment.
This is a side effect of an AER recovery failure. We should not be left with stale errors. We can either perform error cleanup in the kernel immediately after a failure, or we can leave the devices as-is and let user-space code recover the system/PCIe tree. The current Linux kernel leaves the devices as-is, which leads to visible side effects in the absence of user-space monitoring.
There was a problem hiding this comment.
Could you please amend the commit message to make this more clear.
pci_aer_clear_nonfatal_status() is not called when AER recovery fails. A PCIe switch can report an AER error with the 0000:00:00.0 AER error source address (if the multi-error flag is not set). In this case, the AER driver rescans the whole bus and tries to find a device reporting the AER error. find_source_device() is called. When the AER error source is 0, is_error_source() returns 'true' for any device with a reported AER error. When is_error_source() reports 'true', the AER driver checks the e_info->multi_error_valid flag and stops iterating in find_device_iter(). Therefore, the error is reported only for the first device on the bus. Then, the AER driver initiates AER recovery. If AER recovery fails, AER error cleanup is not called for the devices. When any device on the same bus reports a new AER error, the bus rescan process is repeated, and the AER error is reported for the same first device with the uncleaned AER error. Add a kernel boot parameter pci=aer_clear_on_recovery_failure to enable AER error cleanup for cases where recovery fails. This prevents stale errors from causing incorrect device identification on subsequent AER error events. The kernel boot parameter allows enabling this functionality on demand. By default, it is disabled, because there could be software that expects to see the device in an unchanged state when AER recovery fails. We are planning to enable the new functionality only for Arista devices.
1c8ca80 to
25dee8d
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
A PCIe switch can report an AER error with the 0000:00:00.0 AER error source address (if the multi-error flag is not set). In this case, the AER driver rescans the whole bus and tries to find a device reporting the AER error.
find_source_device() is called. When the AER error source is 0, is_error_source() returns 'true' for any device with a reported AER error. When is_error_source() reports 'true', the AER driver checks the e_info->multi_error_valid flag and stops iterating in find_device_iter(). Therefore, the error is reported only for the first device on the bus.
Then, the AER driver initiates AER recovery. If AER recovery fails, AER error cleanup is not called for the devices. When any device on the same bus reports a new AER error, the bus rescan process is repeated, and the AER error is reported for the same first device with the uncleaned AER error.
Add a kernel boot parameter pci=aer_clear_on_recovery_failure to enable AER error cleanup for cases where recovery fails. This prevents stale errors from causing incorrect device identification on subsequent AER error events.
The kernel boot parameter allows enabling this functionality on demand. By default, it is disabled, because there could be software that expects to see the device in an unchanged state when AER recovery fails.
We are planning to enable the new functionality only for Arista devices.