System frequently crashing after GPU drivers update

It is the GitLab instance for kernel graphics development, specificly the repository for collecting issues regarding amd drivers. Quote:

amd (amdgpu, amdkfd, radeon) drm project, currently for issues only.

I suggest you scan/search the existing bug reports if one already matches your problem and open a new one only if no other fits.

1 Like

When you create a new issue there, you can select a template which gives you a predefined structure with the things that you should provide.

Template

Brief summary of the problem:

<TODO: Briefly describe your issue>

Hardware description:

  • CPU:
  • GPU:
  • System Memory:
  • Display(s):
  • Type of Diplay Connection: <TODO: DP, HDMI, DVI, etc>

System information:

  • Distro name and Version: <TODO: e.g., Ubuntu 20.04.1>
  • Kernel version: <TODO: e.g., 5.6.11>
  • Custom kernel: <TODO: e.g., Kernel from drm-misc-next, commit: “Message”>
  • AMD package version: <TODO: e.g., “20.02” or “No package”>

How to reproduce the issue:

< TODO: Describe step-by-step how to reproduce the issue >
< NOTE: Add as much detail as possible >

Attached files:

  • Dmesg log
  • Xorg log
  • Any other log
1 Like

Thank you.

I post my contribution here:

https://gitlab.freedesktop.org/drm/amd/-/issues/1322

2 Likes

No worries! (unless it’s your fault :wink: ) just passing information in case it can help :slight_smile:

1 Like

Thank you! I hope I don’t have to use it in the future, but this template seems great; if I start experiencing this kind of issues again, I’ll definitely post my issue on their site.

still, latest mesa drive, 5.10 kernel

5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080011ba00000 from client 27
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x003C0071
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x1
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x7
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x1
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32772, for process Xorg pid 4170 thread Xorg:cs0 pid 4171)
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080011ba01000 from client 27
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x003C0071
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x1
    5月 08 11:18:57 self kernel: amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0

5.11.14 is better

Hi, everyone c: It’s second day when I don’t have any crash by amdgpu. What have I done: add amdgpu.noretry, move to 5.12 and disable iommu in bios. (or maybe last update fix it?)

1 Like

Hope it is a fix for you! :crossed_fingers: In my case, when I set the amdgpu.noretry=0 kernel parameter I couldn’t even boot into Manjaro, just getting a black screen after picking the option in GRUB.

An update on my end: Until yesterday, I had spent slightly more than a week with no GPU-related freezing or crashing. I left my computer on for a while and, when I went back to use it, a lot of page faults had happened and my system was completely frozen (I’m remarking the fact that this time I wasn’t even using it, it was just running some processes on the foreground and had a couple of applications open). All of this happened while using the 5.12.0-1 kernel and the mesa updates that came with the April 28th system upgrade.

So I guess I can call it a freeze-less personal record since I first experienced this issues around a month ago. I’m glad my performance is improving, but yet I can’t state my problem is solved. I’ve seen that a system update came out on May the 6th, including some mesa updates, but haven’t dared to try it :sweat_smile: . I will do so in the next days, probably.

1 Like

I’ve been having the freezings less frequently, but still randomly after installing the mesa drivers. So I just updated to the latest packages, and updated to the 5.10 kernel. We’ll see how long it takes with this combo to go belly-up.

lol,i will buy a laptop with intel xe gpu

So, it has crashed three times today…

Perhaps you could skip all that financial trouble by switching a distro?

I have used pop_os for a short time, everything was fine except for bluetooth. So finally I switched back.

Pop_os is a good distribution worth trying.

I got a workaround for my lenovo e585 with ryzen 2500u:

  1. Install Downgrade
    yay -S downgrade
  2. Show available mesa versions:
    sudo downgrade --ala-only mesa
  3. Select newest version mesa 21.1.0, which is actually an upgrade.
  4. Don’t put mesa on ignore list
  5. Reboot
  6. No crashing anymore (since yesterday). Amdgpu still has significant stack traces and errors. But it seems that the newer mesa version handles amdgpu crashes with only slight lagging and without killing the whole system.

Kernel: 5.12
Desktop: XFCE with Xorg
Screen: 4k 30Hz

Another workaround: Boot windows from second HDD, use e.g. VirtualBox with raw-disk-pass-through, boot Linux within VM and direct HDD access. Windows GPU driver will not crash.

False positive: I could not fix the system crashes by downgrading mesa and/or amdgpu. I just had another crash.

Just for the record: this is no workaround - this is using a whole other operating system to avoid using amdgpu driver.

4 Likes

Yeah, I unsuccessfully tried that mesa downgrade at first. I used to think this problems were due to a GPU driver update, but reverting it and going back to a previous packages state did not fix my problems at all. The culprit of this is getting farther and farther from us each day.

My suggestion is to keep on updating as the Stable branch does, and use the 5.12 kernel at its latest version (regarding that possible fix that was mentioned earlier at this post). I still get freezes and crashes, but hopefully with a decreasing frequency.

The same problem appeared for me after a recent upgrade on Arch Linux (after a while not upgrading). No problems on the same machine in the previous 2 years.

  • OS: Arch Linux x86_64
  • Host: Lenovo E595
  • Kernel: 5.12.3-arch1-1
  • DE: i3/regolith
  • CPU: AMD® Ryzen 7 3700u with radeon vega mobile gfx × 8
  • GPU: AMD® Radeon™ vega 10 graphics

Running with iommu=soft amd_iommu=pt ivrs_ioapic[32]=00:14.0 intel_iommu=igfx_off , using mesa-git, xf86-video-amdgpu-git

Very sporadically, the screen freezes (audio still seems to work), and the DE restarts after ~30 seconds.

mai 12 21:32:15 e595 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
mai 12 21:32:15 e595 kernel: psmouse serio1: TouchPad at isa0060/serio1/input0 lost synchronization, throwing 5 bytes>
mai 12 21:32:15 e595 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:5 pasid:32>
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800105600000 from client 27
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00501031
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
(several repetitions)
mai 12 21:32:15 e595 regolith.desktop[5429]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate

Relevant upgrade that might have introduced the issue:

  • mesa-git (1:21.1.0_devel.137471.fbebe365476-1 → 1:21.2.0_devel.139210.922f71b819b-1)
  • vulkan-icd-loader (1.2.172-1 → 1.2.176-1)
  • vulkan-headers (1:1.2.173-1 → 1:1.2.177-1)
  • xf86-video-amdgpu-git (538.6ed4863-1 → 539.aedbf47-1)
  • amdvlk (2021.Q1.6-1 → 2021.Q2.2-1)

Cross-posting to gitlab freedesktop org drm amd issues 934 (sorry, can’t post links yet).

4 Likes

Downgrade of linux-firmware seems to have solved the GPU freezes for me. The computer runs for at least almost two days without any problems.

CPU: Quad Core AMD Ryzen 5 PRO 3400G with Radeon Vega Graphics (-MT MCP-) speed/min/max: 1399/1400/3700 MHz
Kernel: 5.10.34-1-MANJARO x86_64 Up: 1d 18h 09m Mem: 5348.1/30077.2 MiB (17.8%) Storage: 1.59 TiB (38.2% used) Procs: 380
Shell: fish inxi: 3.3.04

gnome-shell 1:3.38.4+13+gcf9d73ed5-1
lib32-mesa 21.0.3-3
lib32-mesa-vdpau 21.0.3-3
libva-mesa-driver 21.0.3-3
linux-firmware 20201124.r1786.b362fd4-1
mesa 21.0.3-3
mesa-demos 8.4.0-4
mesa-vdpau 21.0.3-3

2 Likes