System frequently crashing after GPU drivers update

energrizer_9032 · 27 June 2021 13:49

I have been experiencing same problem. I installed linux510 kernel and the system hasn’t crashed till now .
Please try installing linux510 kernel.

Hans12 · 28 June 2021 19:15

I also downgraded from kernel 5.13 to 5.10 (and use linux-firmware-20210208) and my problems are gone on my lenovo e585 ryzen 2500u.

Well, Update: I just got another crash… Maybe this is another issue and related to the USB-C dockingstation?
kernel: NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!

HoneyBear52 · 28 June 2021 21:06

Sorry, what are you running to get these outputs? I know I’m running kernel 5.10.42-1

Hans12 · 29 June 2021 16:42

In case https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866 is the identical error, it could be solved in mesa-21.1.4 (or using their patch for earlier mesa versions).

lordsansui · 30 June 2021 11:41

Hello Hans12 thanks for posting this update, looks like they fixed the issue with the indication in the link below.

Posted from

John Smith @meep · 3 weeks ago

file: src/mesa/main/draw.c
function: validate_draw_arrays

change:
if (count < 0 || numInstances < 0)
into:
if (count <= 0 || numInstances <= 0)

I don’t know how to patch and recompile the Mesa, if anyone could indicate to a noob like me how to do it, I could try this patch and post if it fixes my issue too or not, maybe other mates can try too.

I edited this post to correct the language but also to say that the MESA 21.1.4 was released yesterday, so probably we are close to fix it thought updates, at least I hope so (crossing fingers).

https://mesa3d.org

Mesa 21.1.4 is released - June 30, 2021

Hans12 · 8 July 2021 07:52

For me, mesa 21.1.4 finally fixes the problem! For everybody who is interested in testing it:

sudo pacman -Syu
sudo pacman -S yay
yay -S downgrade
sudo downgrade --ala-only mesa
Select option 102 for mesa 21.1.4
Reboot to apply changes.

Find the changelog including “our” bug here: Mesa 21.1.4 Release Notes / 2021-06-30 — The Mesa 3D Graphics Library latest documentation
Thanks to the mesa, firmware, linux kernel, and manjaro developers! Maybe we can support them by donating a coffee?

poynting_factor · 8 July 2021 14:54

Glad to hear it fixed your case! I have just broken a ~50-days-without-freezing streak lol, but my case has improved a lot too.

I just have a doubt here. So your fix is to stick to the 21.1.4 version of mesa, despite of any further versions there are? I thought that was the latest version, but I might be wrong. I’m confused because of the downgrade you made.

Thanks for sharing your experience!

Hans12 · 8 July 2021 15:36

I am just using the downgrade command to actually upgrade to a version, which is not yet available in the standard manjaro repos. Just check your current version of mesa. This will be an upgrade. Alternatively: Wait some more days and get the new mesa via pacman -Syu.

So for me: My notebook crashed 4-5 times a day before. Now this is fixed with mesa-21.1.4.

freggel.doe · 9 July 2021 10:05

Actually it is available - just not in all branches, see Manjaro - Branch Compare

Cencil · 11 July 2021 02:18

Just registred to say thank you!
For me it also seems to have fixed a memory leak, everytime after I woke up the PC from standy, about 1GB or an half (not sure, I’m using standby A LOT) more ram than before was used.

I can safely say that both problems are gone now.

Cencil · 12 July 2021 21:29

Celebrated too early, just crashed again, at least I were able to start a new tty and do a normal reboot.

5.12.9-1-MANJARO
linux-firmware 20210518
AMD Ryzen 5 3400G

edit: As recommed in the arch forums I now downgraded the kernel to 5.10 LTS (instead off 5.11 as in the arch forums) and the linux firmware to 20210315.3568f96-2. Will edit this post if I see the crash again.

Also enabled ssh in case I can’t even reach the tty anymore

Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109700000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109701000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109702000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109704000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109709000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109707000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109705000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109708000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800109706000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32770, for process chrome pid 178074 thread chrome:cs0 pid 178102)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x80010970a000 from client 27
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 12 23:00:05 x300 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x1
Jul 12 23:00:15 x300 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jul 12 23:00:15 x300 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=6305507, emitted seq=6305509
Jul 12 23:00:15 x300 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chrome pid 178074 thread chrome:cs0 pid 178102
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d7c0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d7e0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d800 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d820 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d840 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d860 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11563d880 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: amd_iommu_report_page_fault: 21 callbacks suppressed
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d8a0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d8c0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d8e0 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d900 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d920 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d940 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x115640000 flags=0x0070]
Jul 12 23:00:15 x300 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x11563d960 flags=0x0070]
Jul 12 23:00:15 x300 kernel: [drm] free PSP TMR buffer
Jul 12 23:00:15 x300 kernel: mce: [Hardware Error]: Machine check events logged
Jul 12 23:00:15 x300 kernel: [Hardware Error]: Deferred error, no action required.
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Jul 12 23:00:15 x300 kernel: [Hardware Error]: CPU:0 (17:18:1) MC20_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0x942030000001085b
Jul 12 23:00:15 x300 kernel: [Hardware Error]: Error Addr: 0x00007ffcffffff00
Jul 12 23:00:15 x300 kernel: [Hardware Error]: IPID: 0x0000002e00000000, Syndrome: 0x000000005b240203
Jul 12 23:00:15 x300 kernel: [Hardware Error]: Coherent Slave Ext. Error Code: 1, Address Violation.
Jul 12 23:00:15 x300 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Jul 12 23:00:15 x300 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400E10000).
Jul 12 23:00:15 x300 kernel: [drm] PSP is resuming...
Jul 12 23:00:15 x300 kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jul 12 23:00:15 x300 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jul 12 23:00:15 x300 kernel: [drm] VCN decode and encode initialized successfully(under SPG Mode).
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Jul 12 23:00:15 x300 kernel: amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Jul 12 23:00:16 x300 kernel: amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow start
Jul 12 23:00:16 x300 kernel: amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow done
Jul 12 23:00:16 x300 kernel: [drm] Skip scheduling IBs!
Jul 12 23:00:16 x300 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(2) succeeded!
Jul 12 23:00:26 x300 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 12 23:00:26 x300 systemd[1]: Started Getty on tty2.

lordsansui · 14 July 2021 13:29

Thanks Hans12,

I’ve been using this new MESA for one week and looks like it fixes my issue too, I will take more time for a true confirmation.

Firstly I updated it trought the way you suggested and this week the Manjaro Stable brunch was updated to the MESA 21.1.4 anyway.

I hope others can benefit from the same fix.

Superblazer · 14 July 2021 13:49

What firmware version are you using? Is everything latest?

mha-k · 14 July 2021 20:08

I can confirm what Cencil reported. The issue is also not solved on Polaris GPUs. I fully updated my system yesterday and this evening after a couple of youtube videos it crashed again.

Kernel 5.12
linux-firmware 20210629
AMD RX 480

-- Journal begins at Thu 2021-01-21 08:14:09 CET, ends at Wed 2021-07-14 22:12:10 CEST. --
Jul 14 20:57:07 ManjaroGamingPC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jul 14 20:57:07 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU fault detected: 146 0x0048080c for process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000009
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0400800C
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: VM fault (0x0c, vmid 2, pasid 32772) at page 9, read from 'TC0' (0x54433000) (8)
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU fault detected: 146 0x0068040c for process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010084D
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04008008
Jul 14 20:57:07 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: VM fault (0x08, vmid 2, pasid 32772) at page 1050701, read from 'TC0' (0x54433000) (8)
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU fault detected: 147 0x0aa02008 for process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00108554
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04020008
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: VM fault (0x08, vmid 2, pasid 32772) at page 1082708, read from 'CB2' (0x43423200) (32)
Jul 14 20:57:17 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1258798, emitted seq=1258801
Jul 14 20:57:17 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 1790 thread plasmashel:cs0 pid 1878
Jul 14 20:57:17 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset begin!
Jul 14 20:57:21 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: failed to suspend display audio
Jul 14 20:57:21 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jul 14 20:57:21 ManjaroGamingPC kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu: cp is busy, skip halt cp
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu: rlc is busy, skip halt rlc
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: BACO reset
Jul 14 20:57:22 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset succeeded, trying to resume
Jul 14 20:57:22 ManjaroGamingPC kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
Jul 14 20:57:22 ManjaroGamingPC kernel: [drm] VRAM is lost due to GPU reset!
Jul 14 20:57:24 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:25 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:26 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:27 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:28 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:29 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:30 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:31 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:32 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:32 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:32 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <uvd_v6_0> failed -1
Jul 14 20:57:33 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Jul 14 20:57:33 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset(3) failed
Jul 14 20:57:33 ManjaroGamingPC kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset end with ret = -110
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:33 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:57:43 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 14 20:57:53 ManjaroGamingPC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jul 14 20:58:02 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 14 20:58:02 ManjaroGamingPC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Hans12 · 19 July 2021 16:15

Yes. There seems to be another bug. I celebrated too early. I got another crash these days with latest linux-firmware. However, as described by @Cencil downgrading linux-firmware seems to solve it for me.

lordsansui · 23 July 2021 18:32

I’ve been using for two weeks and the improvement is clear, but yes, I can also confirm it’s not fully fixed. I got just 2 issues and they were different, from at least one issue a day, 1 issue a weak is a nice improvement.

The 1st one is the same, the famous:
20/07/2021 18:41 kernel [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] ERROR Waiting for fences timed out!

While the second one was:
22/07/2021 08:13 kernel [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=74945, emitted seq=74947

Cencil · 27 July 2021 12:49

My PC is running since 13 days without a reboot and massive use with the setup I have posted before. However, a final fix would be awesome, so I can update the firmware and kernel someday…

lordsansui · 7 August 2021 15:02

More I use Linux more I discover new stuffs, as expected, and I’m seeing a lot of redundancies in Linux that makes new user life very hard. Looks like the issue we are facing is related to MESA drive and following https://www.phoronix.com/ website I discovered the AMDVLK and after some research looks like it does the same function as the MESA RADV, so there is redundant content here that user can choose, and in this sense, did someone here tried to use ADMVLK replacing the MESA RADV to see if it can fix the issue?
or the issue is not related to this part of the MESA?

jpegxguy · 11 August 2021 01:35

The truth is, from all the various topics I’ve read, noone has a workaround. People try things and get disappointed because the issue manifests a week later. Essentially we cannot reliably reproduce this bug which is always fun

Grimmzz · 16 August 2021 07:59

I am still on linux-firmware 20210315.3568f96 and have no problems for weeks.

Will stay on this versoin as long as possible