Skip to content

Clear non-fatal errors on AER recovery failure#589

Open
yurypm wants to merge 1 commit into
sonic-net:masterfrom
yurypm:port-aer-clear-on-recovery-failure-v2
Open

Clear non-fatal errors on AER recovery failure#589
yurypm wants to merge 1 commit into
sonic-net:masterfrom
yurypm:port-aer-clear-on-recovery-failure-v2

Conversation

@yurypm

@yurypm yurypm commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

pci_aer_clear_nonfatal_status() is not called when AER recovery fails.

A PCIe switch can report an AER error with the 0000:00:00.0 AER error source address (if the multi-error flag is not set). In this case, the AER driver rescans the whole bus and tries to find a device reporting the AER error.

find_source_device() is called. When the AER error source is 0, is_error_source() returns 'true' for any device with a reported AER error. When is_error_source() reports 'true', the AER driver checks the e_info->multi_error_valid flag and stops iterating in find_device_iter(). Therefore, the error is reported only for the first device on the bus.

Then, the AER driver initiates AER recovery. If AER recovery fails, AER error cleanup is not called for the devices. When any device on the same bus reports a new AER error, the bus rescan process is repeated, and the AER error is reported for the same first device with the uncleaned AER error.

Add a kernel boot parameter pci=aer_clear_on_recovery_failure to enable AER error cleanup for cases where recovery fails. This prevents stale errors from causing incorrect device identification on subsequent AER error events.

The kernel boot parameter allows enabling this functionality on demand. By default, it is disabled, because there could be software that expects to see the device in an unchanged state when AER recovery fails.

We are planning to enable the new functionality only for Arista devices.

@yurypm yurypm requested a review from a team as a code owner June 11, 2026 10:50
@mssonicbld

Copy link
Copy Markdown

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@paulmenzel paulmenzel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a PCI expert. Can you please send it upstream for review?

Reading the commit message, it sounds like a workaround, and the actual problem needs to be fixed with the stale errors? Or is it by design?

disable AER error recovery when an uncorrectable
error is reported.
+ aer_clear_on_recovery_failure
+ [PCIE] If the PCIEAER kernel config parameter is

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start on the line above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://lore.kernel.org/linux-pci/CAPzpGcRCTCZtaX1EVaJNZ103THZKsoszZduY7=gwfYdcrMo-SQ@mail.gmail.com/

Stale errors are a side effect of AER recovery failures and the lack of AER error cleanup following a failure. This patch aims to resolve the stale error issue.

While we could attempt to fix the AER recovery failure in one particular case, the stale error issue would persist because it is perfectly legal for drivers to reject recovery.

Linux kernel maintainers suggested fixing the AER recovery failure directly. However, the proposed fix causes kernel crashes and requires further investigation.

This patch provides an immediate fix for the stale errors observed on our devices. This allows us time to continue investigating and ultimately fix the AER recovery failure, or alternatively, disable AER recovery entirely for all Arista devices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start on the line above?

Fixed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Hmm, no idea how to mark threads as resolved, if they do not show up in the Files changed view.)

Could you please also amend the commit message with a summary of the maintainers response, and the URL to the discussion from May? (No idea why I didn’t find it, when searching.) It sounds like Lukas’ patches would solve your issue, so it would be best, if you tested these instead of having a new parameter?

the same bus reports a new AER error, the bus rescan process is
repeated, and the AER error is reported for the same first device with
the uncleaned AER error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like the wrong behavior. Can this be fixed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of an AER recovery failure. We should not be left with stale errors. We can either perform error cleanup in the kernel immediately after a failure, or we can leave the devices as-is and let user-space code recover the system/PCIe tree. The current Linux kernel leaves the devices as-is, which leads to visible side effects in the absence of user-space monitoring.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please amend the commit message to make this more clear.

pci_aer_clear_nonfatal_status() is not called when AER recovery fails.

A PCIe switch can report an AER error with the 0000:00:00.0 AER error
source address (if the multi-error flag is not set). In this case, the
AER driver rescans the whole bus and tries to find a device reporting
the AER error.

find_source_device() is called. When the AER error source is 0,
is_error_source() returns 'true' for any device with a reported AER
error. When is_error_source() reports 'true', the AER driver checks
the e_info->multi_error_valid flag and stops iterating in
find_device_iter(). Therefore, the error is reported only for the first
device on the bus.

Then, the AER driver initiates AER recovery. If AER recovery fails,
AER error cleanup is not called for the devices. When any device on
the same bus reports a new AER error, the bus rescan process is
repeated, and the AER error is reported for the same first device with
the uncleaned AER error.

Add a kernel boot parameter pci=aer_clear_on_recovery_failure to
enable AER error cleanup for cases where recovery fails. This prevents
stale errors from causing incorrect device identification on subsequent
AER error events.

The kernel boot parameter allows enabling this functionality on demand.
By default, it is disabled, because there could be software that
expects to see the device in an unchanged state when AER recovery
fails.

We are planning to enable the new functionality only for Arista
devices.
@yurypm yurypm force-pushed the port-aer-clear-on-recovery-failure-v2 branch from 1c8ca80 to 25dee8d Compare June 11, 2026 16:51
@mssonicbld

Copy link
Copy Markdown

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants