System frequently crashing after GPU drivers update

Hans12 · 8 July 2021 07:52

For me, mesa 21.1.4 finally fixes the problem! For everybody who is interested in testing it:

sudo pacman -Syu
sudo pacman -S yay
yay -S downgrade
sudo downgrade --ala-only mesa
Select option 102 for mesa 21.1.4
Reboot to apply changes.

Find the changelog including “our” bug here: Mesa 21.1.4 Release Notes / 2021-06-30 — The Mesa 3D Graphics Library latest documentation
Thanks to the mesa, firmware, linux kernel, and manjaro developers! Maybe we can support them by donating a coffee?

poynting_factor · 8 July 2021 14:54

Glad to hear it fixed your case! I have just broken a ~50-days-without-freezing streak lol, but my case has improved a lot too.

I just have a doubt here. So your fix is to stick to the 21.1.4 version of mesa, despite of any further versions there are? I thought that was the latest version, but I might be wrong. I’m confused because of the downgrade you made.

Thanks for sharing your experience!

Hans12 · 8 July 2021 15:36

I am just using the downgrade command to actually upgrade to a version, which is not yet available in the standard manjaro repos. Just check your current version of mesa. This will be an upgrade. Alternatively: Wait some more days and get the new mesa via pacman -Syu.

So for me: My notebook crashed 4-5 times a day before. Now this is fixed with mesa-21.1.4.

freggel.doe · 9 July 2021 10:05

Actually it is available - just not in all branches, see https://manjaro.org/branch-compare/?query=mesa

Cencil · 11 July 2021 02:18

Just registred to say thank you!
For me it also seems to have fixed a memory leak, everytime after I woke up the PC from standy, about 1GB or an half (not sure, I’m using standby A LOT) more ram than before was used.

I can safely say that both problems are gone now.

Cencil · 12 July 2021 21:29

Celebrated too early, just crashed again, at least I were able to start a new tty and do a normal reboot.

5.12.9-1-MANJARO
linux-firmware 20210518
AMD Ryzen 5 3400G

edit: As recommed in the arch forums I now downgraded the kernel to 5.10 LTS (instead off 5.11 as in the arch forums) and the linux firmware to 20210315.3568f96-2. Will edit this post if I see the crash again.

Also enabled ssh in case I can’t even reach the tty anymore

Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109700000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109701000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109702000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109704000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109709000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109707000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109705000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109708000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109706000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x80010970a000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:15 x300 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jul 12 23:00:15 x300 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=6305507, emitted seq=6305509
Jul 12 23:00:15 x300 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chrome pid 178074 thread chrome:cs0 pid 178102
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d7c0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d7e0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d800 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d820 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d840 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d860 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d880 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amd_iommu_report_page_fault: 21 callbacks suppressed
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d8a0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d8c0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d8e0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d900 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d920 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d940 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d960 flags=0x0070]
Jul 12 23:00:15 x300 kernel: [drm] free PSP TMR buffer
Jul 12 23:00:15 x300 kernel: mce: [Hardware Error]: Machine check events logged
Jul 12 23:00:15 x300 kernel: [Hardware Error]: Deferred error, no action required.
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Jul 12 23:00:15 x300 kernel: [Hardware Error]: CPU:0 (17:18:1) MC20_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0x942030000001085b
Jul 12 23:00:15 x300 kernel: [Hardware Error]: Error Addr: 0x00007ffcffffff00
Jul 12 23:00:15 x300 kernel: [Hardware Error]: IPID: 0x0000002e00000000, Syndrome: 0x000000005b240203
Jul 12 23:00:15 x300 kernel: [Hardware Error]: Coherent Slave Ext. Error Code: 1, Address Violation.
Jul 12 23:00:15 x300 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Jul 12 23:00:15 x300 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400E10000).
Jul 12 23:00:15 x300 kernel: [drm] PSP is resuming...
Jul 12 23:00:15 x300 kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jul 12 23:00:15 x300 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jul 12 23:00:15 x300 kernel: [drm] VCN decode and encode initialized successfully(under SPG Mode).
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Jul 12 23:00:16 x300 kernel: amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow start
Jul 12 23:00:16 x300 kernel: amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow done
Jul 12 23:00:16 x300 kernel: [drm] Skip scheduling IBs!
Jul 12 23:00:16 x300 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(2) succeeded!
Jul 12 23:00:26 x300 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 12 23:00:26 x300 systemd[1]: Started Getty on tty2.

lordsansui · 14 July 2021 13:29

Thanks Hans12,

I’ve been using this new MESA for one week and looks like it fixes my issue too, I will take more time for a true confirmation.

Firstly I updated it trought the way you suggested and this week the Manjaro Stable brunch was updated to the MESA 21.1.4 anyway.

I hope others can benefit from the same fix.

Superblazer · 14 July 2021 13:49

What firmware version are you using? Is everything latest?

mha-k · 14 July 2021 20:08

I can confirm what Cencil reported. The issue is also not solved on Polaris GPUs. I fully updated my system yesterday and this evening after a couple of youtube videos it crashed again.

Kernel 5.12
linux-firmware 20210629
AMD RX 480

-- Journal begins at Thu 2021-01-21 08:14:09 CET, ends at Wed 2021-07-14 22:12:10 CEST. --
Jul 14 20:57:07 ManjaroGamingPC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jul 14 20:57:07 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU fault detected: 146 0x0048080c for process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000009
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0400800C
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: VM fault (0x0c, vmid 2, pasid 32772) at page 9, read from 'TC0' (0x54433000) (8)
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU fault detected: 146 0x0068040c for process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010084D
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04008008
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: VM fault (0x08, vmid 2, pasid 32772) at page 1050701, read from 'TC0' (0x54433000) (8)
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU fault detected: 147 0x0aa02008 for process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00108554
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04020008
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: VM fault (0x08, vmid 2, pasid 32772) at page 1082708, read from 'CB2' (0x43423200) (32)
Jul 14 20:57:17 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1258798, emitted seq=1258801
Jul 14 20:57:17 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset begin!
Jul 14 20:57:21 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: failed to suspend display audio
Jul 14 20:57:21 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jul 14 20:57:21 ManjaroGamingPC kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu: cp is busy, skip halt cp
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu: rlc is busy, skip halt rlc
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: BACO reset
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset succeeded, trying to resume
Jul 14 20:57:22 ManjaroGamingPC kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
Jul 14 20:57:22 ManjaroGamingPC kernel: [drm] VRAM is lost due to GPU reset!
Jul 14 20:57:24 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:25 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:26 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:27 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:28 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:29 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:30 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:31 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:32 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:32 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:32 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <uvd_v6_0> failed -1
Jul 14 20:57:33 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Jul 14 20:57:33 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset(3) failed
Jul 14 20:57:33 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset end with ret = -110
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:43 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 14 20:57:53 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 14 20:58:02 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:58:02 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Hans12 · 19 July 2021 16:15

Yes. There seems to be another bug. I celebrated too early. I got another crash these days with latest linux-firmware. However, as described by @Cencil downgrading linux-firmware seems to solve it for me.

lordsansui · 23 July 2021 18:32

I’ve been using for two weeks and the improvement is clear, but yes, I can also confirm it’s not fully fixed. I got just 2 issues and they were different, from at least one issue a day, 1 issue a weak is a nice improvement.

The 1st one is the same, the famous:
20/07/2021 18:41 kernel [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] ERROR Waiting for fences timed out!

While the second one was:
22/07/2021 08:13 kernel [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=74945, emitted seq=74947

Cencil · 27 July 2021 12:49

My PC is running since 13 days without a reboot and massive use with the setup I have posted before. However, a final fix would be awesome, so I can update the firmware and kernel someday…

lordsansui · 7 August 2021 15:02

More I use Linux more I discover new stuffs, as expected, and I’m seeing a lot of redundancies in Linux that makes new user life very hard. Looks like the issue we are facing is related to MESA drive and following https://www.phoronix.com/ website I discovered the AMDVLK and after some research looks like it does the same function as the MESA RADV, so there is redundant content here that user can choose, and in this sense, did someone here tried to use ADMVLK replacing the MESA RADV to see if it can fix the issue?
or the issue is not related to this part of the MESA?

jpegxguy · 11 August 2021 01:35

The truth is, from all the various topics I’ve read, noone has a workaround. People try things and get disappointed because the issue manifests a week later. Essentially we cannot reliably reproduce this bug which is always fun

Grimmzz · 16 August 2021 07:59

I am still on linux-firmware 20210315.3568f96 and have no problems for weeks.

Will stay on this versoin as long as possible

B007C0DE · 18 August 2021 17:06

Same here. After trying many things the only workaround that reliably prevents the crashes is locking linux-firmware to the above mentioned version.

lordsansui · 18 August 2021 17:53

What linux-firmware are you referring?
How can I check in my system and check/locking it too?

I tried to use a lot of difference linux command to list system info and didn’t get linux-firmware results in the list

acarasimon96 · 19 August 2021 02:11

I second this, having recently purchased a refurbished Lenovo Thinkpad E595 with a Radeon RX Vega 10 (Picasso architecture) integrated GPU and experienced those random KWin reset moments in that same PC. Since I’ve downgraded linux-firmware on the laptop last week, I haven’t ran into the graphics reset bug so far.

Meanwhile, my desktop with a Radeon RX 570 (Polaris) has never had that same graphics freeze and reset issue, so I have no incentive to lock linux-firmware on that system.

I downgraded that package to 20210511.7685cf4 according to this reply. So far I haven’t had the freeze-and-reset behavior come back since installing that version.

lupo2010 · 19 August 2021 10:24

Steps from hans12 :

Here are your steps:

pacman -S yay #(or install similar AUR-tool if not already installed)
optional: yay -Syu #(will update your system)
yay -S downgrade #(install downgrade)
sudo downgrade --ala-only linux-firmware
select option from march (36) or earlier
optional: [Y] set linux-firmware on ignore list, to prevent future updates
optional: reboot to immediately apply changes

I downgraded to linux-firmware-20210208

lordsansui · 19 August 2021 21:10

Sorry to make more questions, but if someone don’t mind to help me for better understand here. Thanks in advanced.

In general when I ready some reference to firmware I think about and kind of embedded software loaded in a specific hardware, like the firmware for the GPU, firmware for the printer, firmware for the router, etc.
Considering that Linux has the kernel and it has built-inside the AMDGPU driver witch shares some driver functionality with MESA, what is this refered linux-firmware? what it do? How it related to the kernel, GPU driver and the MESA?