Clear non-fatal errors on AER recovery failure by yurypm · Pull Request #589 · sonic-net/sonic-linux-kernel

yurypm · 2026-06-11T10:50:53Z

pci_aer_clear_nonfatal_status() is not called when AER recovery fails.

A PCIe switch can report an AER error with the 0000:00:00.0 AER error source address (if the multi-error flag is not set). In this case, the AER driver rescans the whole bus and tries to find a device reporting the AER error.

find_source_device() is called. When the AER error source is 0, is_error_source() returns 'true' for any device with a reported AER error. When is_error_source() reports 'true', the AER driver checks the e_info->multi_error_valid flag and stops iterating in find_device_iter(). Therefore, the error is reported only for the first device on the bus.

Then, the AER driver initiates AER recovery. If AER recovery fails, AER error cleanup is not called for the devices. When any device on the same bus reports a new AER error, the bus rescan process is repeated, and the AER error is reported for the same first device with the uncleaned AER error.

Add a kernel boot parameter pci=aer_clear_on_recovery_failure to enable AER error cleanup for cases where recovery fails. This prevents stale errors from causing incorrect device identification on subsequent AER error events.

The kernel boot parameter allows enabling this functionality on demand. By default, it is disabled, because there could be software that expects to see the device in an unchanged state when AER recovery fails.

We are planning to enable the new functionality only for Arista devices.

mssonicbld · 2026-06-11T10:51:01Z

/azp run

azure-pipelines · 2026-06-11T10:51:11Z

Azure Pipelines successfully started running 1 pipeline(s).

paulmenzel

I am not a PCI expert. Can you please send it upstream for review?

Reading the commit message, it sounds like a workaround, and the actual problem needs to be fixed with the stale errors? Or is it by design?

paulmenzel · 2026-06-11T12:09:15Z

+ 				disable AER error recovery when an uncorrectable
+ 				error is reported.
+		aer_clear_on_recovery_failure
+				[PCIE] If the PCIEAER kernel config parameter is


Start on the line above?

https://lore.kernel.org/linux-pci/CAPzpGcRCTCZtaX1EVaJNZ103THZKsoszZduY7=gwfYdcrMo-SQ@mail.gmail.com/

Stale errors are a side effect of AER recovery failures and the lack of AER error cleanup following a failure. This patch aims to resolve the stale error issue.

While we could attempt to fix the AER recovery failure in one particular case, the stale error issue would persist because it is perfectly legal for drivers to reject recovery.

Linux kernel maintainers suggested fixing the AER recovery failure directly. However, the proposed fix causes kernel crashes and requires further investigation.

This patch provides an immediate fix for the stale errors observed on our devices. This allows us time to continue investigating and ultimately fix the AER recovery failure, or alternatively, disable AER recovery entirely for all Arista devices.

Start on the line above?

Fixed

(Hmm, no idea how to mark threads as resolved, if they do not show up in the Files changed view.)

Could you please also amend the commit message with a summary of the maintainers response, and the URL to the discussion from May? (No idea why I didn’t find it, when searching.) It sounds like Lukas’ patches would solve your issue, so it would be best, if you tested these instead of having a new parameter?

paulmenzel · 2026-06-11T12:11:07Z

+the same bus reports a new AER error, the bus rescan process is
+repeated, and the AER error is reported for the same first device with
+the uncleaned AER error.
+


That sounds like the wrong behavior. Can this be fixed?

This is a side effect of an AER recovery failure. We should not be left with stale errors. We can either perform error cleanup in the kernel immediately after a failure, or we can leave the devices as-is and let user-space code recover the system/PCIe tree. The current Linux kernel leaves the devices as-is, which leads to visible side effects in the absence of user-space monitoring.

Could you please amend the commit message to make this more clear.

pci_aer_clear_nonfatal_status() is not called when AER recovery fails. A PCIe switch can report an AER error with the 0000:00:00.0 AER error source address (if the multi-error flag is not set). In this case, the AER driver rescans the whole bus and tries to find a device reporting the AER error. find_source_device() is called. When the AER error source is 0, is_error_source() returns 'true' for any device with a reported AER error. When is_error_source() reports 'true', the AER driver checks the e_info->multi_error_valid flag and stops iterating in find_device_iter(). Therefore, the error is reported only for the first device on the bus. Then, the AER driver initiates AER recovery. If AER recovery fails, AER error cleanup is not called for the devices. When any device on the same bus reports a new AER error, the bus rescan process is repeated, and the AER error is reported for the same first device with the uncleaned AER error. Add a kernel boot parameter pci=aer_clear_on_recovery_failure to enable AER error cleanup for cases where recovery fails. This prevents stale errors from causing incorrect device identification on subsequent AER error events. The kernel boot parameter allows enabling this functionality on demand. By default, it is disabled, because there could be software that expects to see the device in an unchanged state when AER recovery fails. We are planning to enable the new functionality only for Arista devices.

mssonicbld · 2026-06-11T16:52:03Z

/azp run

azure-pipelines · 2026-06-11T16:52:12Z

Azure Pipelines successfully started running 1 pipeline(s).

yurypm requested a review from a team as a code owner June 11, 2026 10:50

paulmenzel reviewed Jun 11, 2026

View reviewed changes

yurypm force-pushed the port-aer-clear-on-recovery-failure-v2 branch from 1c8ca80 to 25dee8d Compare June 11, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear non-fatal errors on AER recovery failure#589

Clear non-fatal errors on AER recovery failure#589
yurypm wants to merge 1 commit into
sonic-net:masterfrom
yurypm:port-aer-clear-on-recovery-failure-v2

yurypm commented Jun 11, 2026

Uh oh!

mssonicbld commented Jun 11, 2026

Uh oh!

azure-pipelines Bot commented Jun 11, 2026

Uh oh!

paulmenzel left a comment

Uh oh!

paulmenzel Jun 11, 2026

Uh oh!

yurypm Jun 11, 2026

Uh oh!

yurypm Jun 11, 2026

Uh oh!

paulmenzel Jun 12, 2026

Uh oh!

paulmenzel Jun 11, 2026

Uh oh!

yurypm Jun 11, 2026

Uh oh!

paulmenzel Jun 12, 2026

Uh oh!

mssonicbld commented Jun 11, 2026

Uh oh!

azure-pipelines Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yurypm commented Jun 11, 2026

Uh oh!

mssonicbld commented Jun 11, 2026

Uh oh!

azure-pipelines Bot commented Jun 11, 2026

Uh oh!

paulmenzel left a comment

Choose a reason for hiding this comment

Uh oh!

paulmenzel Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

yurypm Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

yurypm Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

paulmenzel Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

paulmenzel Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

yurypm Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

paulmenzel Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Jun 11, 2026

Uh oh!

azure-pipelines Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants