this aer-errors occur if your mainboard
a) doesn’t full support the acpi-protocoll
b) acpi-modules aren’t full installed
this is no big issues nowadays and the usual way is to ignore this errors by adding the “pci=noaer” parameter to the grub-kernel parameters (remember to update grub).
this is already known and shouldn’t worry you as for example here:
A continuous stream of errors similar to the following will appear in kernel messages :
AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
AER: device [1969:e0b1] error status/mask=00000040/00002000
AER: [ 6] BadTLP
You can fix them with the kernel parameter pci=noaer.
So it is an issue of the mainboard firmware, not an issue of the specific NVME or the mainboard/NVME combination?
Asking because I upgraded the NVME and installed linux on the new one, so I do not know if the issue appeared because of the new NVME. So I wonder whether I might have stomped on a particularly bad NVME
Also asking because the error stream in my case seems to be a little different from the one on the arch wiki you linked: the device in my case is the NVME (and just it). Furthermore the error appears to be clearli related to power management.
pci=noaer only silences the issue, without doing anything about the actual issue, right?
noticed that also echo performance > /sys/module/pcie_aspm/parameters/policy fixes the issue, but again I understand this is going to hit on the idle power usage that is already quite high for this laptop. So it is better to avoid it, correct?
can this be expected to be fixed via a firmware upgrade?
PCIe Correctable Errors have already been corrected by hardware and there is no functional impact. Generally they are caused by things like checksum errors that are probably related to signal integrity issues. Clean and reseat the device, etc. All the kernel can do is log these.
This seems to suggest a hardware problem. The fact that disabling power management via pcie_aspm fixes the issue seems to point to a mismanagement of power management from the mainboard or the nvme. ASAP I’ll try with a new NVME and report!
yes, a lot of mainboard manufacturers use a firmware that isn’t 100% compatible to the acpi-protocoll, others use a different way, acpi-protocoll might seem to be not modern enough for some others. the list of reasons is long.
for this reason linux has enabled this pci=noaer option and in fact it just ignores any message related to the “advanced error”-feature of acpi.
it’s not an issue, this “advanced error reporting” had been invented to enhance the acpi-messaging system but a lot of manufacturers use their own methods. nothing to worry.
nvme’s had been problematic in the past, some weren’t detected, some were difficult (i remember a series of samsung had real issues). but all this problems should be history since kernel 5.xx and newer.
The issue appears to be with the specific WD Blue SN570 NVME. With a Crucial NVME the issue disappears. No need to silence AER with it. So not a bug with the system firmware, rather with the specific NVME (or maybe system/NVME combination).
Furthermore the issue is actually an issue. After the NVME change, the overall power consumption of the machine at light loads appears to be somehow reduced. The NVME was probably mismanaging its low power modes.