System frequently crashing after GPU drivers update

I got a workaround for my lenovo e585 with ryzen 2500u:

  1. Install Downgrade
    yay -S downgrade
  2. Show available mesa versions:
    sudo downgrade --ala-only mesa
  3. Select newest version mesa 21.1.0, which is actually an upgrade.
  4. Don’t put mesa on ignore list
  5. Reboot
  6. No crashing anymore (since yesterday). Amdgpu still has significant stack traces and errors. But it seems that the newer mesa version handles amdgpu crashes with only slight lagging and without killing the whole system.

Kernel: 5.12
Desktop: XFCE with Xorg
Screen: 4k 30Hz

Another workaround: Boot windows from second HDD, use e.g. VirtualBox with raw-disk-pass-through, boot Linux within VM and direct HDD access. Windows GPU driver will not crash.

False positive: I could not fix the system crashes by downgrading mesa and/or amdgpu. I just had another crash.

Just for the record: this is no workaround - this is using a whole other operating system to avoid using amdgpu driver.

4 Likes

Yeah, I unsuccessfully tried that mesa downgrade at first. I used to think this problems were due to a GPU driver update, but reverting it and going back to a previous packages state did not fix my problems at all. The culprit of this is getting farther and farther from us each day.

My suggestion is to keep on updating as the Stable branch does, and use the 5.12 kernel at its latest version (regarding that possible fix that was mentioned earlier at this post). I still get freezes and crashes, but hopefully with a decreasing frequency.

The same problem appeared for me after a recent upgrade on Arch Linux (after a while not upgrading). No problems on the same machine in the previous 2 years.

  • OS: Arch Linux x86_64
  • Host: Lenovo E595
  • Kernel: 5.12.3-arch1-1
  • DE: i3/regolith
  • CPU: AMD® Ryzen 7 3700u with radeon vega mobile gfx × 8
  • GPU: AMD® Radeon™ vega 10 graphics

Running with iommu=soft amd_iommu=pt ivrs_ioapic[32]=00:14.0 intel_iommu=igfx_off , using mesa-git, xf86-video-amdgpu-git

Very sporadically, the screen freezes (audio still seems to work), and the DE restarts after ~30 seconds.

mai 12 21:32:15 e595 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
mai 12 21:32:15 e595 kernel: psmouse serio1: TouchPad at isa0060/serio1/input0 lost synchronization, throwing 5 bytes>
mai 12 21:32:15 e595 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:5 pasid:32>
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800105600000 from client 27
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00501031
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
mai 12 21:32:15 e595 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
(several repetitions)
mai 12 21:32:15 e595 regolith.desktop[5429]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate

Relevant upgrade that might have introduced the issue:

  • mesa-git (1:21.1.0_devel.137471.fbebe365476-1 → 1:21.2.0_devel.139210.922f71b819b-1)
  • vulkan-icd-loader (1.2.172-1 → 1.2.176-1)
  • vulkan-headers (1:1.2.173-1 → 1:1.2.177-1)
  • xf86-video-amdgpu-git (538.6ed4863-1 → 539.aedbf47-1)
  • amdvlk (2021.Q1.6-1 → 2021.Q2.2-1)

Cross-posting to gitlab freedesktop org drm amd issues 934 (sorry, can’t post links yet).

4 Likes

Downgrade of linux-firmware seems to have solved the GPU freezes for me. The computer runs for at least almost two days without any problems.

CPU: Quad Core AMD Ryzen 5 PRO 3400G with Radeon Vega Graphics (-MT MCP-) speed/min/max: 1399/1400/3700 MHz
Kernel: 5.10.34-1-MANJARO x86_64 Up: 1d 18h 09m Mem: 5348.1/30077.2 MiB (17.8%) Storage: 1.59 TiB (38.2% used) Procs: 380
Shell: fish inxi: 3.3.04

gnome-shell 1:3.38.4+13+gcf9d73ed5-1
lib32-mesa 21.0.3-3
lib32-mesa-vdpau 21.0.3-3
libva-mesa-driver 21.0.3-3
linux-firmware 20201124.r1786.b362fd4-1
mesa 21.0.3-3
mesa-demos 8.4.0-4
mesa-vdpau 21.0.3-3

2 Likes

Thanks for the hint with linux-firmware!
I used downgrade to upgrade linux-firmware to 20210511.7685cf4 and did not face any crashes since yesterday. Journalctl logs look fine. I will keep you updated in case of any new crashes.

Even after re-enabling XFCE composite effects - no crashes for many hours. :slight_smile:

Update: linux-firmware 20210511.7685cf4 solved the amggpu-related crashes for me.

However, I am facing problems with my lenovo usb-c dockingstation. It crashes/ hangs from time to time. But this is another issue and could be related to other system components.

1 Like

I’m glad the upgrade solved it both for you and @AkhIL!! I’ll try performing it soon. Still, what evidence are you basing on to state that it has solved the freezing problem? I don’t mean to sound pessimistic, but I’ve already experience week-long spans without crashing, hoping that it meant a final solution for the problem, just to have another random crash again :confused:

I’m just asking this because you may have noticed some microcode/code change on such update that could be handling the (still unknown) culprit of this issues. If you’re stating it from the growing crash-less spans, then all my hopes are with you to have found a definitive solution :pray:

worked for me too, on other os though.
It was hard crashing every ~3-6 hours, backed off the firmware version to something from november, now sitting on a day uptime with fairly heavy load.

should probably file a bug with the firmware upstream.

1 Like

I got freeze in less then ten minutes with linux-firmware 20210512.r1926.55d9649-1. I had multiple days uptime with 20201124.r1786.b362fd4-1. Trying 20210211.r1830.f7915a0-1 right now.

linux-firmware-20210315.r1846.3568f96-1 should work. I have this version in fully functional snapshot.

1 Like

@poynting_factor your guess seems correct. I agree with you and @AkhIL: The update just reduced the frequency of crashes for me. Only downgrading fixed the problem for me. I am currently using : 20201218.646f159.
I will start testing linux-firmware-20210315.r1846.3568f96-1 as suggested by @AkhIL.

4 Likes

Slightly offtopic: I upgraded to kernel 5.13 rc1 and this is the first time I am actually getting 4k 60 Hz (instead of only 30 Hz) with my Lenovo E585. So far, this seems stable with 20210315.3568f96.

well I think the problem is somewhere buried deep inside the linux-firmware.
I do use now 20201218.646f159 for about 4 days and round about 30h of uptime. NOT ONCE did my system freeze…

Quite considering the upgrade to 5.13, did you allow the linux-firmware / amd-ucode to update with the kernel or are you still on above mentioned version?

Update on my side: I’ve been running smoothly since at least a week, no crashes or unexpected freezes. Just ran the last stable upgrade (from May 19th) a couple of days ago, and as of now nothing seems to have broken. Here are my current package versions:

  • linux-firmware: 20210511.r1922.7685cf4-1.
  • Kernel: 5.12.2-1.
  • mesa and its dependencies: 21.0.3-3.

Not using any special kernel parameters, either. Honestly, I don’t know what may have fixed it; I’m betting that a combination of the firmware and kernel updates solved it (or at least reduced the crashes frequency to the point I can’t get too frustrated on it :sweat_smile:); my guess is still about that line casting a uint to a bigger architecture when performing pagination at AMD devices, the commit was linked earlier here: System frequently crashing after GPU drivers update - #30 by fkfd (I’m not marking it as a solution since I can’t think of anyway to prove it was what really fixed the issue, but I hope I can do so in the future).

1 Like

Hello,

I’m using the same packages and at first it seemed to have solved the issue. Yesterday I’ve had a freeze again with the same logs in journalctl but unexpectedly it recovered instead of freezing permanently. Before the upgrade and after a freeze there was the need to restart LightDm via CRTL-ALT-F* or if I wasn’t fast enough with a reboot.

I’m using a Lenovo ThinkPad E595 with Cinnamon.

Hope it helps.

Kind regards

I have to correct myself: Sadly it happened again. So the above mentioned packages didn’t solve it for me…

I am on a Ryzen 5 3400G:

Latest stable-upgrades but a downgraded linux-firmware version (20210315.3568f96). It seems that this is the way to go

2 Likes

Hi @Grimmzz ! What kernel and mesa versions are you using?