System frequently crashing after GPU drivers update

Just to confirm this; I am troubleshooting same problem on a laptop.

inxi -FGz | sed -n "1p; 5p; 6,7p"
System:    Kernel: 5.10.30-1-MANJARO x86_64 bits: 64 Console: tty pts/2 Distro: Manjaro Linux 
CPU:       Info: Dual Core model: AMD Athlon Silver 3050U with Radeon Graphics bits: 64 type: MCP cache: L2: 1024 KiB 
           Speed: 1396 MHz min/max: 1400/2300 MHz Core speeds (MHz): 1: 1396 2: 1398 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Picasso driver: amdgpu v: kernel
1 Like

Good call out there bro! Yeah, after 5 days of believing my problems were totally solved with the 5.11.14 kernel, I experienced a light crash (KDE died and my session got logged out, but I could log back in easily), so I can’t state that was the fix.

I hope you’re right and the experimental kernel brings up our fix :pray: Theory is on our side, since that commit looks like fixing the errors we saw when experiencing it. I don’t wanna claim victory early, since this issue has been one tough boss to fight, but I’m putting all my hope on the solution. I’ll try this in a couple of days.

Please, keep us updated if you have any further experience on this! And thanks for your contribution :raised_hands:

1 Like

@elektropepi @B007C0DE So i installed the experimental 5.12-rc7 kernel as you suggested, but this time I wasn’t even able to log in to my system :confused: After inputting my credentials, the screen went totally black and system would not respond. Here’s what got logged at journalctl:

abr 24 16:04:07 e495 kwin_x11[1148]: Freeze in OpenGL initialization detected
abr 24 16:04:03 e495 kernel: amdgpu 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11d3a6cc0 flags=0x0070]
abr 24 16:04:03 e495 kernel: amdgpu 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11d3c0000 flags=0x0070]
abr 24 16:04:03 e495 kernel: amdgpu 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11d3a6ca0 flags=0x0070]
abr 24 16:04:03 e495 kernel: amdgpu 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x11d3c0000 flags=0x0070]

Along with some of the typical page fault errors. Would you please tell me which version of mesa drivers are you using? Maybe there’s some incompatibility between the kernel and the driver itself. I’m using the latter’s latest version.

UPDATE: I tried booting into that kernel again, and this time I could do so. Idk what may have happened the first time, but I hope it was some really random and isolated error, and that 5.12 carries the real fix for this :pray:

1 Like

Well, I´m afraid that was a bit premature.

Kernel 5.12rc7 definitely reduces the crashes, but they are not completely gone. Just had a freeze with the amdgpu: [gfxhub0] error in the logs.

I just grepped though jourmalctl and I never hat any of these amdgpu errors prior to the April update.

Seems to be more than just the kernel :frowning:

2 Likes

Yeah, I would dare to state that each kernel update gives us better results (5.11.14 made a huge improvement for me, and hopefully 5.12 does the same), but still not there yet. I had never experienced them either! All of this began with the April 9th update for me.

Wish there was a way some kernel/GPU drivers dev sees this so we could get some authorized opinion… Let’s keep on updating future findings on this! We’ll get through it :muscle:

1 Like

Same here since april 9 update :

5.11.14 is better, since wednesdy no freeze/crash or black screen.

Update april 25 :
first black screen

15:57:42 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x111d13280 flags=0x0070]
15:57:42 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x111d40000 flags=0x0070]
15:57:42 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
15:57:32 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
15:57:32 kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
15:57:31 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xwayland pid 1519 thread Xwayland:cs0 pid 1775
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f2000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f0000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f2000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f0000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f2000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f0000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f2000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f0000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f2000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004C0071
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x8001030f0000 from client 27
15:57:31 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32774, for process Xwayland pid 1519 thread Xwayland:cs0 pid 1775)
15:57:31 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
15:57:26 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
1 Like

Ok that’s shit. Since I switched kernel, I’ve not had the freeze once. Did you choose the correct kernel on startup (maybe your grub defaults to a LTS kernel)?

I’ll keep you guys posted if that error occurs again.

1 Like

i’m on GNOME and was having the same issues. trying this and crossing my fingers :wink:

1 Like

Just updated kernel to 5.12.rc7. (Lastest available on Stable, for me)

Rebooted.

System is finally responding quickly as it used to before experiencing many of the above mentioned problems others have been having. My system would crash, freeze, run astonishingly sluggishly when it did run, for the past month or so.

Let’s hope this newest kernel is the answer :slight_smile:

1 Like

Not sure if this is related to my problem with freezing recently about 20 minutes after boot. man.db is hogging resources for the first 20 minutes - about 21% cpu time alongside rsynch at around 20%. It seems that any app I run after 20 minutes the entire system freezes. Initially I thought is was teams, but then noticed it happened to thunderbird, firefox and chrome and vlc. The main journal reports an EXT4 error on my main linux disk and insists that I run efsock -D (or something similar). Currently I am in windows to grab the latest ISO to burn onto a USB so I can check and try and fix that. Can’t recall any graphics issues in the logs though they may have been present. The last update is when I started experiencing freezes - not even able to drop in to tty. My setup also have AMD gpu for graphics. Will update this post later.

No more errors regarding the disk in my logs. Ran the e2fsck command and used disks to check the root drive from a live usb and all seems fine 2 hours in and no freezing issues so far.

1 Like

same problem with amd cpu & gpu, kernel 5.12 rc not help

1 Like

Ohhh, I’m sorry to hear that. I’d have thought that is some disk issue, but if your scan showed no faulty devices then that may not be it. Are you running on the experimental kernel?

1 Like

Same problem also using an AMD 3400G, since update a few days ago system is unstable and will page fault and crash, sometimes if i hit ctrl-alt-f1 it will reset to login screen but sometimes it’s just total hard freeze.
Journalctl shows long list of errors in red

 Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32778, for process brave pid 25334 thread brave:cs0 pid 25361)
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800000579000 from client 27
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00400C31
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CPG (0x6)
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 27 21:27:04 desktop kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0

etc

I am not a power user or anything and i dont know how to downgrade stuff, i just update everything every couple of days.

2 Likes

this may help,maybe,not sure, still normal for almost 48 hours, :innocent:

kernel-5.10 with AMD Ryzen 5 3550H with Radeon Vega Mobile Gfx (8)

still crashing, :broken_heart:

I will try pop os !

2 Likes

My mistake really as my post is somewhat related - but only because I was experiencing random system freezes since the last update - as I have an AMD GPU - I thought your situation was perhaps connected. Though I saw no error log connected to my graphics and only a disk error, sometimes though, I find you fix one error and then the other is revealed - however my system is now stable and no serious issues visible in any logs. My kernel is the latest but not experimental. But thanks for taking the time out to reply and I hope you find a solution soon.

2 Likes

Ohh, sorry to hear that you got affected by this too. It seems there’s a big bunch of us suffering this already :cry:

Have you tried running the 5.12 experimental kernel? I wouldn’t call it a definitive solution, but it has proved to enhance the experience of many user here (mine as well), since it introduced a fix related to GPU memory overflows (to put it in a simple way; a better and detailed explanation can be found some comments above, with a reference to the commit regarding that fix).

You can simply do so by opening the Kernel application (if you’re running KDE) or by running sudo pacman -S linux512 at your terminal. Both will install the latest kernel, and then you’ll have to reboot your machine to start it (GRUB should have it set as the default kernel after installing it). You can verify what kernel you’re running after logging in by typing uname -r in the terminal.

Nice! I hadn’t thought nor seeked through those principles as a possible cause for this. I’ll give it a deeper read later and probably get in touch with that comment’s author to see what we can intersect :thinking:

Quick update from my system:

I´m still using Kernel 5.12rc7 but added the following boot parameter:

rcu_nocbs=0-7

Since a few days I had no GPU related crashes/freezes (just one instance of USB driver crash… but thats another problem).

3 Likes

I’ll definitely try that! Did you add it to your GRUB setup or are you adding it manually before booting?

I’ve been experiencing less crashes than before with the newest kernel, but they still happen some times. I’ve just had a system freeze right now after some screen tearing effect I had never seen before (maybe it’s related to the GPU power issue that @happyxhw mentioned), but was able to softly stop the system by TTY-ing and executing a shutdown now.

UPDATE: I hope that kernel parameter is really a solution, but per its documentation I’d bet it’s a different thing. Here it says that it’s for removing certain CPU threads from the candidates list for RCU callbacks (Read-Copy-Update); maybe it has some influence on GPU processes :thinking:

I’ve just read the announcement post of the latest stable update at [Stable Update] 2021-04-28 - Kernels, Wine, Ruby, JDK, KDE-Dev, Mesa 21.0.3, KDE Apps 21.04, Python, Haskell, Mate 1.24.2, Virtualbox, Thunderbird, and it claims that

Mesa got fixed for issues reported on AMD graphics cards

Hopefully this is the case. I’ll try running a system upgrade.