WD BLUE SN570 2TB on Asus ROG Zephyrus G14 (2022) causes PCIe Bus errors

With the above mentioned hardware I get frequent:

[  338.015858] pcieport 0000:00:02.4: AER: Corrected error received: 0000:06:00.0
[  338.015887] nvme 0000:06:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  338.015893] nvme 0000:06:00.0:   device [15b7:5025] error status/mask=00000001/0000e000
[  338.015900] nvme 0000:06:00.0:    [ 0] RxErr                  (First)

which I understand is a non-critical (corrected) error due to the nvme not ententering or exiting low power states correctly.

Using pcie_aspm=off as a kernel parameter appears to prevent the issue, but is otherwise undesirable on a laptop.

  • Should the issue be reported upstream, so that maybe a quirk can be added about this specific nvme?
  • Is there any other solution to try apart from the pcie_aspm trick (I have already upgraded the nvme firmware to the latest one from WD with no improvement, system bios is also up to date).

this aer-errors occur if your mainboard
a) doesn’t full support the acpi-protocoll
b) acpi-modules aren’t full installed

this is no big issues nowadays and the usual way is to ignore this errors by adding the “pci=noaer” parameter to the grub-kernel parameters (remember to update grub).
this is already known and shouldn’t worry you as for example here:

A continuous stream of errors similar to the following will appear in kernel messages :

 AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
 AER:   device [1969:e0b1] error status/mask=00000040/00002000
 AER:    [ 6] BadTLP

You can fix them with the kernel parameter pci=noaer. 


@Olli Thanks. A few more questions:

  • So it is an issue of the mainboard firmware, not an issue of the specific NVME or the mainboard/NVME combination?

    • Asking because I upgraded the NVME and installed linux on the new one, so I do not know if the issue appeared because of the new NVME. So I wonder whether I might have stomped on a particularly bad NVME
    • Also asking because the error stream in my case seems to be a little different from the one on the arch wiki you linked: the device in my case is the NVME (and just it). Furthermore the error appears to be clearli related to power management.
  • pci=noaer only silences the issue, without doing anything about the actual issue, right?

  • noticed that also echo performance > /sys/module/pcie_aspm/parameters/policy fixes the issue, but again I understand this is going to hit on the idle power usage that is already quite high for this laptop. So it is better to avoid it, correct?

  • can this be expected to be fixed via a firmware upgrade?

PCIe Correctable Errors have already been corrected by hardware and there is no functional impact. Generally they are caused by things like checksum errors that are probably related to signal integrity issues. Clean and reseat the device, etc. All the kernel can do is log these.


1 Like

This seems to suggest a hardware problem. The fact that disabling power management via pcie_aspm fixes the issue seems to point to a mismanagement of power management from the mainboard or the nvme. ASAP I’ll try with a new NVME and report!

hello @callegar

yes, a lot of mainboard manufacturers use a firmware that isn’t 100% compatible to the acpi-protocoll, others use a different way, acpi-protocoll might seem to be not modern enough for some others. the list of reasons is long.
for this reason linux has enabled this pci=noaer option and in fact it just ignores any message related to the “advanced error”-feature of acpi.

it’s not an issue, this “advanced error reporting” had been invented to enhance the acpi-messaging system but a lot of manufacturers use their own methods. nothing to worry.

nvme’s had been problematic in the past, some weren’t detected, some were difficult (i remember a series of samsung had real issues). but all this problems should be history since kernel 5.xx and newer.

if there are still problems check this:

1 Like

Made more tests and tested with a different NVME.

The issue appears to be with the specific WD Blue SN570 NVME. With a Crucial NVME the issue disappears. No need to silence AER with it. So not a bug with the system firmware, rather with the specific NVME (or maybe system/NVME combination).

Furthermore the issue is actually an issue. After the NVME change, the overall power consumption of the machine at light loads appears to be somehow reduced. The NVME was probably mismanaging its low power modes.