System frequently crashing after GPU drivers update

It’s a desktop and the RX 570 was made by MSI. In general it crashes playing Starcraft 2 while in the lobby, if I remember well, it never crashed while “playing”, I mean, controlling units. It’s always in the lobby. But it crashes using Vivaldi browser too, just less often.

Another hope is that AMD is hiring more Engineers to work with GPU driver, so lets see if they can fix the issue.

It’s interesting that you use a desktop; I was expecting your RX 570 to be a laptop GPU instead. I don’t play any games especially those that are GPU intensive or use Chromium-based browsers (even though I use some Electron-based apps like VSCode and Discord every day), or I might be just plain lucky with my specific setup.

Yeah, despite the GPU new (2 years), my PC is very old, so it should even be more stable as it was running windows 7.

My GPU model comes with OEM overclock, and frequently I log in into windows to check firmware upgrade but it already has the must updated one.

GPU Radeon RX-570 ARMOR 4G OC MSI
CPU AMD Phenon II X4 965 BE
Motherboard Gigabyte GA-870A-UD3
8GB RAM DDR3

I was planning to by new PC last year but due to the COVID effect in the economy this plan was postponed to unknown.

I have the same issue now.

AMD radeon RX 5700 crashed and froze the whole Desktop after watching Youtube more than 50 min in Vivaldi browser with hardware accelerated.

  • Branch: Manjaro KDE latest version (stable branch)
  • GPU: Radeon RX 5700
  • Kernel version: 5.14.10-1-MANJARO
  • Mesa version: 4.6 (Compatibility Profile) Mesa 21.2.3
  • Xserver version: X.Org X Server 1.20.13, X Protocol Version 11, Revision 0
  • Desktop manager and compositor: KDE

Output of journactl -b -1 -p 3 :

Okt 17 20:22:31 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec timeout, signaled seq=2315893, emitted seq=2315895
Okt 17 20:22:31 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 45233 thread vivaldi-bi:cs0 pid 45253

The similar bug report:

Solved for me with setting power_dpm_force_performance_level to high:

Create the file /etc/udev/rules.d/30-amdgpu-pm.rules

with content

KERNEL=="card0", SUBSYSTEM=="drm", DRIVERS=="amdgpu", ATTR{device/power_dpm_state}="performance"
KERNEL=="card0", SUBSYSTEM=="drm", DRIVERS=="amdgpu", ATTR{device/power_dpm_force_performance_level}="high"

Reboot.

Enjoy with no freezing.

I input these settings.

No slowdowns or freezes so far.

:+1:

Hi veltliner and bowmandm21,

This settings has any drawback? like increasing power consumption even in IDLE.

Sadly the udev rules didn’t prevent the GPU from resetting for me just now :confused:

[17267.848571] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[17267.848606] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[17272.978620] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2557416, emitted seq=2557418
[17272.978788] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 1694 thread plasmashel:cs0 pid 1754
[17272.978928] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!

For me the problem also still persists with a MSI RX 480. My and my colleagues mobile VEGAs in our notebooks for work are running fine so far.

@Zesko I’ve looked into your posted issue as my freezes only occur when I play fullscreen videos in Firefox. (Don’t know about games as I don’t have time anymore to play games.) But the error-log looks quiet different than mine. I think it’s the plasmashell that’s causing it.

It’s really frustrating that the problem persists so long. :confused:

Is your udev rule actually sticking? I created a help topic:
amdgpu-udev-rule-ignored/89725

Either of these commands will identify if the rule applied successfully:

udevadm info --attribute-walk /sys/class/drm/card0 | grep -Pi 'power_dpm'
udevadm info -a -n '/dev/dri/card0' | grep -Pi 'power_dpm'

Curiously, my problems began in May. I have an r9 380x which suffers from a voltage draw issue. If I don’t force the performance level to high I get graphical corruption, crackling audio, and an eventual green screen of death (the system appears to be working, audio is playing, but no input is recognized).

The actual failure relates to a sudden surge of power when the fans go directly from low to high speed without transitioning through a middle range (usually during gaming).

I went so far as to build a custom bios with dpm state 0 set to 1000mv and flashed it to the card. As many of these post point out, a firmware blob is applied during startup which invalidates this. As such, in my case, every iteration of linux-firmware causes me issues.

The same issue affects Windows 10 and I have to use a custom .xml and load it as a performance profile in the Radeon Adrenalin drivers.

I always assumed my card was an anomaly related to power supply and motherboard. There was no reason to change the hardware once I’d figured out the solution.

I just can’t get a udev rule to apply automatically.

Here are some reason that can cause this issue:

  • Your computer’s spec is too low

  • You overclocked too high

  • The game’s settings are wrong

  • Your graphic card requires too much power

  • You need to upgrade your operating system

  • You need to upgrade device drivers

  • Your network isn’t fast enough

  • Digital Rights Management is causing problems

  • Games are running in the wrong mode

  • Your antivirus is crashing games

  • Using a VPN is slowing online games to the point they crash

  • You have too many browser tabs open

There are many different error-logs for the same issue, they contain plasmashell too, I do not think the plasmashell is fault, but browser with AMD driver is.

See my first error log today:

Nov 15 08:16:52 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec timeout, signaled seq=6534, emitted seq=6536
Nov 15 08:16:52 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 2019 thread vivaldi-bi:cs0 pid 2069
Nov 15 08:17:00 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 15 08:17:00 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 15 08:17:00 zesko kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:221 vmid:6 pasid:32771, for process plasmashell pid 1228 thread plasmashel:cs0 pid 1293)
Nov 15 08:17:00 zesko kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800000000000 from client 0x1b (UTCL2)

my second error-log today:

Nov 15 08:27:50 zesko kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Nov 15 08:27:55 zesko kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Nov 15 08:27:55 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=235399, emitted seq=235401
Nov 15 08:27:55 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 1862 thread vivaldi-bi:cs0 pid 1907
Nov 15 08:27:59 zesko kernel: amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Nov 15 08:27:59 zesko kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Nov 15 08:27:59 zesko kernel: amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Nov 15 08:27:59 zesko kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Nov 15 08:28:00 zesko kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Nov 15 08:28:03 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 15 08:28:03 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 15 08:28:03 zesko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 15 08:28:03 zesko kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:221 vmid:5 pasid:32771, for process plasmashell pid 1236 thread plasmashel:cs0 pid 1301)
Nov 15 08:28:03 zesko kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800000030000 from client 0x1b (UTCL2)

You can see that is why both logs are different.


I am waiting for Linux Kernel 5.16 for this bug fix.

Hello,

I lost my patient with the issue because update after update it was not fixed, so, as suggested from other members, I decided to downgrade the linux-firmware to 20210818.c46b8c3-1 version and so far it is the must stable solution for this issue. I will not say that it’s error proofing version because I had just one incident and it happen day after the downgrade, now I’m 30 days without incident. Peoople like me that may are using older GPU hardware (i’m using RX 570) probably using older linux firmware will not make any difference.

I can’t confirm, but I have the impression that this issue might be related to overclock and data corruption inside GPU, I see very different behavior for GPU cooler comparing the downgrade solution vs newest ones, the noise / speeds are very different. Do you know that feeling when you are pushing overclocking and your system works but unstable? it looks the same for me. For any reason they might be pushing too much for performance with the linux-firmware.

I’d like to mention that I have two other random issue, one is the same as Linux was pointing in the Linux challenging, for any reason when my system has the screen locked it doesn’t accept my password anymore so I need to push reset bottom to fix the issue. The second one is random screen freeze, the screen just freezes at any screen and no log are registered.

1 Like

Just tried the new linux-firmware drive 20211027.1d00989-1 available today in the stable branch and the issue remains.

The updated mesa was 21.2.5

backing again to the 20210818.c46b8c3-1 version

19/11/2021 16:09	kernel	[drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
19/11/2021 16:09	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=468293, emitted seq=468295
19/11/2021 16:09	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SC2_x64.exe pid 6806 thread SC2_x64.exe pid 6867
19/11/2021 16:09	kernel	amdgpu: cp is busy, skip halt cp
19/11/2021 16:09	kernel	amdgpu: rlc is busy, skip halt rlc
19/11/2021 16:09	kernel	[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
19/11/2021 16:09	kernel	[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

System slowdown and freezing returned again for me as well.

backing again to the 20210818.c46b8c3-1 version

I just downgraded to 20210818.c46b8c3-1. System is responsive again. Will have to see if freezing occurs. But the system seems to be running better again after downgrading.

i haven’t noticed the freeze in a while now. is it the same for everybody?

1 Like

You lucky I guess, I usually have a freeze per day more or less :slightly_frowning_face: Random freezes, how to troubleshoot
most of the time I don’t know what it is, cause there are no logs, but when there are logs, it is the DRM thing.
With and without the downgrade. With an RX580 and now a 6800XT.

I haven’t noticed the freeze in a while, but I haven’t upgraded Mesa from 21.1.4-1, so everytime I update I always get the “error” warning: mesa: ignoring package upgrade (21.1.4-1 => 21.2.5-1). But I don’t get freeze either

Sadly only partly. My Raven Ridge APU at work didn’t freeze since the August update. But my Polaris GPU (RX480) still freezes when watching videos in Firefox. :neutral_face: