Which pcie device it is?

deemon · 30 August 2020 11:39

after the last stable update I started getting those messages that I have never seen before:

aug   30 07:58:58 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 07:58:58 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
aug   30 07:58:58 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00000080/00006000
aug   30 07:58:58 Zen kernel: pcieport 0000:00:01.3: AER:    [ 7] BadDLLP               
aug   30 09:00:38 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 09:00:38 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
aug   30 09:00:38 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00001000/00006000
aug   30 09:00:38 Zen kernel: pcieport 0000:00:01.3: AER:    [12] Timeout               
aug   30 09:27:28 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 09:27:28 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
aug   30 09:27:28 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00001000/00006000
aug   30 09:27:28 Zen kernel: pcieport 0000:00:01.3: AER:    [12] Timeout               
aug   30 10:44:33 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 10:44:33 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
aug   30 10:44:33 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00000080/00006000
aug   30 10:44:33 Zen kernel: pcieport 0000:00:01.3: AER:    [ 7] BadDLLP               
aug   30 10:59:10 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 10:59:10 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
aug   30 10:59:10 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00000080/00006000
aug   30 10:59:10 Zen kernel: pcieport 0000:00:01.3: AER:    [ 7] BadDLLP               
aug   30 11:56:16 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 11:56:16 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
aug   30 11:56:16 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00000080/00006000
aug   30 11:56:16 Zen kernel: pcieport 0000:00:01.3: AER:    [ 7] BadDLLP               
aug   30 13:19:16 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 13:19:16 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
aug   30 13:19:16 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00000080/00006000
aug   30 13:19:16 Zen kernel: pcieport 0000:00:01.3: AER:    [ 7] BadDLLP               
aug   30 13:46:40 Zen kernel: pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
aug   30 13:46:40 Zen kernel: pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
aug   30 13:46:40 Zen kernel: pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00001000/00006000
aug   30 13:46:40 Zen kernel: pcieport 0000:00:01.3: AER:    [12] Timeout

Although they are “corrected” I would at least like to know, what is the hardware piece being corrected.
the pcieport info I got from `lspci -vvv’:

00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin ? routed to IRQ 27
	IOMMU group: 2
	Bus: primary=00, secondary=02, subordinate=06, sec-latency=0
	I/O behind bridge: 0000f000-0000ffff [size=4K]
	Memory behind bridge: fcc00000-fcdfffff [size=2M]
	Prefetchable memory behind bridge: 00000000f0300000-00000000f03fffff [size=1M]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

that didn’t tell me anything much…

But what is this device [1022:1453] and how to find out?
And is this yet another actual hardware problem is there something wrong with drivers or such (given, then “problem” started with update)?

megavolt · 30 August 2020 12:13

Hi!

lspci -nnk | grep "1022:1453" would show you which chipset it is.

According to this database: PCI Devices is should be your Family 17h (Models 00h-0fh) PCIe GPP Bridge device.

Greetz

deemon · 30 August 2020 12:26

However… what does that meeeeean?

$ lspci -nnk | grep "1022:1453"
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]

MB itself somehow broke now?

Keruskerfuerst · 30 August 2020 12:32

Can you post the hardware info ?
inxi -Fzxxxa

I had the same problem: I had to replace the mainboard, the CPU and RAM.

deemon · 30 August 2020 12:34

$ inxi -Fzxxxa
System:    Kernel: 5.8.3-2-MANJARO x86_64 bits: 64 compiler: N/A 
           parameters: BOOT_IMAGE=/boot/vmlinuz-5.8-x86_64 root=UUID=5d7a533c-7c16-4026-b28c-1bd1f407e3ff rw quiet apparmor=1 
           security=apparmor udev.log_priority=3 nohibernate amdgpu.ppfeaturemask=0xfffd7fff amdgpu.noretry=0 
           amdgpu.lockup_timeout=1000 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt 
           sysrq_always_enabled=1 
           Desktop: Xfce 4.14.2 tk: Gtk 3.24.20 info: xfce4-panel wm: xfwm4 dm: LightDM 1.30.0 Distro: Manjaro Linux 
Machine:   Type: Desktop Mobo: ASRock model: AB350M Pro4 serial: <filter> UEFI: American Megatrends v: P5.90 date: 07/03/2019 
CPU:       Topology: 8-Core model: AMD Ryzen 7 1700 bits: 64 type: MT MCP arch: Zen family: 17 (23) model-id: 1 stepping: 1 
           microcode: 8001138 L2 cache: 4096 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 96018 
           Speed: 2758 MHz min/max: N/A Core speeds (MHz): 1: 2758 2: 2709 3: 2583 4: 2828 5: 2649 6: 2626 7: 2690 8: 2617 
           9: 2589 10: 2990 11: 2695 12: 2791 13: 2634 14: 2636 15: 2730 16: 2543 
           Vulnerabilities: Type: itlb_multihit status: Not affected 
           Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization 
           Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling 
           Type: srbds status: Not affected 
           Type: tsx_async_abort status: Not affected 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel 
           bus ID: 09:00.0 chip ID: 1002:687f 
           Display: x11 server: X.Org 1.20.8 driver: amdgpu unloaded: modesetting alternate: ati,fbdev,vesa display ID: :0.0 
           screens: 1 
           Screen-1: 0 s-res: 4480x2520 s-dpi: 96 s-size: 1185x667mm (46.7x26.3") s-diag: 1360mm (53.5") 
           Monitor-1: DisplayPort-0 res: 2560x1440 dpi: 90 size: 725x428mm (28.5x16.9") diag: 842mm (33.1") 
           Monitor-2: DisplayPort-1 res: 1920x1080 hz: 60 dpi: 96 size: 509x286mm (20.0x11.3") diag: 584mm (23") 
           Monitor-3: DisplayPort-2 res: 1920x1200 hz: 60 dpi: 94 size: 518x324mm (20.4x12.8") diag: 611mm (24.1") 
           Monitor-4: HDMI-A-0 res: 1920x1080 hz: 60 dpi: 96 size: 509x286mm (20.0x11.3") diag: 584mm (23") 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.38.0 5.8.3-2-MANJARO LLVM 10.0.1) v: 4.6 Mesa 20.1.6 
           direct render: Yes 
Audio:     Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] driver: snd_hda_intel v: kernel 
           bus ID: 09:00.1 chip ID: 1002:aaf8 
           Device-2: Advanced Micro Devices [AMD] Family 17h HD Audio vendor: ASRock driver: snd_hda_intel v: kernel 
           bus ID: 0b:00.3 chip ID: 1022:1457 
           Sound Server: ALSA v: k5.8.3-2-MANJARO 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: ASRock driver: r8168 v: 8.048.03-NAPI 
           modules: r8169 port: f000 bus ID: 05:00.0 chip ID: 10ec:8168 
           IF: enp24s0 state: up speed: 1000 Mbps duplex: full mac: <filter> 
Drives:    Local Storage: total: 3.68 TiB used: 3.46 TiB (94.0%) 
           SMART Message: Unable to run smartctl. Root privileges required. 
           ID-1: /dev/nvme0n1 vendor: Intel model: SSDPEKNW020T8 size: 1.86 TiB block size: physical: 512 B logical: 512 B 
           speed: 31.6 Gb/s lanes: 4 serial: <filter> rev: 002C scheme: GPT 
           ID-2: /dev/sda vendor: Western Digital model: WDS200T2B0B-00YS70 size: 1.82 TiB block size: physical: 512 B 
           logical: 512 B speed: 6.0 Gb/s serial: <filter> rev: 40WD scheme: GPT 
Partition: ID-1: / raw size: 1.86 TiB size: 1.83 TiB (98.38%) used: 1.72 TiB (93.9%) fs: ext4 dev: /dev/nvme0n1p2 
Swap:      Kernel: swappiness: 1 (default 60) cache pressure: 100 (default) 
           ID-1: swap-1 type: file size: 16.00 GiB used: 107.0 MiB (0.7%) priority: -2 file: /swapfile 
Sensors:   System Temperatures: cpu: 44.5 C mobo: 39.0 C gpu: amdgpu temp: 46 C 
           Fan Speeds (RPM): fan-1: 0 fan-2: 1726 fan-3: 0 fan-4: 0 fan-5: 0 gpu: amdgpu fan: 1164 
           Voltages: 12v: N/A 5v: N/A 3.3v: 3.34 vbat: 3.26 
Info:      Processes: 383 Uptime: 1d 3h 29m Memory: 15.56 GiB used: 7.16 GiB (46.0%) Init: systemd v: 246 Compilers: 
           gcc: 10.2.0 clang: 10.0.1 Packages: 1717 pacman: 1712 lib: 489 snap: 5 Shell: Bash v: 5.0.18 
           running in: xfce4-terminal inxi: 3.1.05

megavolt · 30 August 2020 12:44

I guess, you must have at least 2 PCIE devices plugged in. If one of them are working on a lower rate and the others are working normally on a higher rate than the other devices are fitting to the lower one. This is displayed somehow as error here which gets corrected. Due the UEFI implementation the kernel does this this more or less nicely.

I think thats the reason. Maybe i am wrong?

The kernel option pci=nommconf disables Memory-Mapped PCI Configuration Space, which is available in Linux since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with lspci -vv ), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access.

That means in this particular case, something goes wrong when the PCIe controller uses this method to access the configuraton space of a particular device. It may be a hardware bug in the device, in the PCIe root controller on the motherboard, in the specific interaction of those two, or something else.

By using pci=nommconf , the configuration space of all devices will be accessed in the original way, and changing the access methods works around this problem. So if you want, it’s both resolving and suppressing it.

Source

If possible try to upgrade your UEFI.
Add pci=nommconf to your /etc/default/grub at GRUB_CMDLINE_LINUX= and sudo update-grub

And yes you can upgrade your UEFI:

Latest Version: v: P6.60 date: 2020/8/13

deemon · 30 August 2020 13:05

Sadly it isn’t. Upgraded bios/uefi to last that the vendor doesn’t warn against using with first gen Ryzen.

added. will see if this fixes something.

But still puzzling, how this started with stable update… can kernel or something mess this up? Or the problem might have been there all along and now just somehow became “more visible”?

*ASRock do NOT recommend updating this BIOS if Pinnacle, Raven, Summit or Bristol Ridge CPU is being used on your system.

Have Ryzen 1700 => “Summit Ridge”. Basically the BIOS updates beyond and including 6.00 are for Ryzen 3000 series only.

deemon · 30 August 2020 20:10

Found another thing to try out in level1techs forum. Will swap soon the pci=nommconf for pci_aspm=off and see if this also works.

Although it has been errorless for about 7 hours with pci=nommconf, many say it may impact performance in some way or another. And I can always come back to this when the pcie_aspm=off doesn’t work.

edit: and wtf is “you can’t add links to your posts” ??? why I am blocked from adding sources?

megavolt · 30 August 2020 20:16

YES I have already Trustlevel 2. You can add Links and attachments if you have Trustlevel 1.

Go this this Badges page and scroll down. Here are more information how to reach it: Understanding Discourse Trust Levels

deemon · 30 August 2020 21:43

Which is weird as I was able to post link few days ago… like someone demoted my trust level here? what is going on?

system · 2 September 2020 21:43

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.