System frequently crashing after GPU drivers update

lordsansui · 13 September 2021 14:24

From my last post from now, my system was also updated as indicated below, and so far I got just one issue, when I let the game opened and the PC away for long time, it caused a different issue log as you can see below, but the effect was the same. I’m very happy that it has being fixed or it very close to it.

Kernel 5.14.0-0
MESA 21.2.1
linux-firmware 20210818.c46b8c3-1

04/09/2021 18:26	kernel	[drm:drm_crtc_commit_wait [drm]] *ERROR* flip_done timed out
04/09/2021 18:26	kernel	[drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:53:crtc-0] commit wait timed out
04/09/2021 18:26	kernel	[drm:drm_crtc_commit_wait [drm]] *ERROR* flip_done timed out
04/09/2021 18:26	kernel	[drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:50:plane-5] commit wait timed out

Ernest1337 · 15 September 2021 14:32

I have the same error in logs which I included in my post 5 days ago.

Ernest1337 · 22 September 2021 18:43

I created a bug report in the amd/drm issue tracker repository. It is related to at last my and @lordsansui issue. [drm:drm_crtc_commit_wait [drm]] *ERROR* flip_done timed out (#1717) · Issues · drm / amd · GitLab

lordsansui · 27 September 2021 14:21

Yesterday I just updated my kernel to 5.15.rc2 and the linux-firmware is the same 20210818.c46b8c3-1 I got one crash and the issue still persists.

The current Mesa was update to MESA 21.2.2 from the last Manjaro Stable update, but no effect in this issue.

Also before updating the kernel I was noticing that the crashes log are varying more. As you can see below 4 different issue log that results in the same crash.

12/08/2021 19:28	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=3947807, emitted seq=3947809
12/08/2021 19:28	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 1169 thread plasmashel:cs0 pid 1270|

14/09/2021 20:11	kernel	[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
14/09/2021 20:11	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=6038969, emitted seq=6038971

25/09/2021 09:46	kernel	[drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
25/09/2021 09:46	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=60722, emitted seq=60724

26/09/2021 16:58	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 3445 thread vivaldi-bi:cs0 pid 3510
26/09/2021 16:58	kernel	amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)

Ernest1337 · 29 September 2021 16:49

You could attach the info about your system and logs from one of the crashes (preferably the “flip_done” one) to the issue I created, maybe it will help better understand this particular problem.

lordsansui · 29 September 2021 17:04

I’ve been posting the issues also in the link below and other threads.

https://gitlab.freedesktop.org/drm/amd/-/issues/934#note_1047779

acarasimon96 · 5 October 2021 21:16

The crash-and-reset episodes have finally made a comeback on my laptop after applying today’s testing branch update—with a vengeance. I was able to reproduce this about 75% of the time when I plug in my laptop, and sometimes the system just goes blank and reboots on its own! I even tried downgrading the kernels (5.10 and 5.14), linux-firmware from 20210919 to 20210818, and mesa from 21.2.3 to 21.2.2, and it still freaks out almost every time I plug in the charger! I am also getting similar logs in the system journal logs as @lordsansui posted here.

I have never gotten anything like this for a little over a month between when I installed linux-firmware 20210818 and today, and I’m really furious that this has struck at me back a lot harder to the point where Manjaro is unusable on my laptop while it’s plugged in.

Edit: I’ve narrowed down the cause to TLP, which was recently upgraded to version 1.4.0 with this new testing update. Downgrading that package to 1.3.1 stopped those crash-and-reset episodes from happening for now.

Edit 2: This turned out to be a false alarm. I upgraded TLP back to 1.4.0, and after copying /etc/tlp.conf.pacnew to /etc/tlp.conf and reloading tlp.service, the crashes have stopped when I plugged in my laptop. Upon further investigation, the line RADEON_DPM_PERF_LEVEL_ON_AC="high" from my old tlp.conf was the culprit.

lordsansui · 9 October 2021 14:40

From the last Manjaro Stable update:

Kernel 5.15rc3
MESA 21.2.3
linux-firmware 20210919.d526e04-1

The issue still remain.

I’m not planning to back to windows and my apologies for who might be offended, but Linux is a bit far from windows stability, this forever crashes is very annoying, and for who doesn’t have the same issue they have others. I just hope the Steam Deck and valve investments help to increase Linux market share so we can get more quality of life. Feel free to think different.

09/10/2021 11:08	kernel	[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
09/10/2021 11:08	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=509775, emitted seq=509777
09/10/2021 11:08	kernel	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SC2_x64.exe pid 8847 thread SC2_x64.exe pid 8927

acarasimon96 · 9 October 2021 15:52

I don’t understand why you’re still getting those crashes for so long despite you have a RX 570 according to your profile popout. I’m running a Gigabyte RX 570 on my desktop and I never encountered that bug on that PC like I did with my laptop. Is yours a desktop or laptop GPU, and if it’s a desktop GPU, what brand?

lordsansui · 9 October 2021 16:18

It’s a desktop and the RX 570 was made by MSI. In general it crashes playing Starcraft 2 while in the lobby, if I remember well, it never crashed while “playing”, I mean, controlling units. It’s always in the lobby. But it crashes using Vivaldi browser too, just less often.

Another hope is that AMD is hiring more Engineers to work with GPU driver, so lets see if they can fix the issue.

acarasimon96 · 9 October 2021 16:23

It’s interesting that you use a desktop; I was expecting your RX 570 to be a laptop GPU instead. I don’t play any games especially those that are GPU intensive or use Chromium-based browsers (even though I use some Electron-based apps like VSCode and Discord every day), or I might be just plain lucky with my specific setup.

lordsansui · 9 October 2021 16:46

Yeah, despite the GPU new (2 years), my PC is very old, so it should even be more stable as it was running windows 7.

My GPU model comes with OEM overclock, and frequently I log in into windows to check firmware upgrade but it already has the must updated one.

GPU Radeon RX-570 ARMOR 4G OC MSI
CPU AMD Phenon II X4 965 BE
Motherboard Gigabyte GA-870A-UD3
8GB RAM DDR3

I was planning to by new PC last year but due to the COVID effect in the economy this plan was postponed to unknown.

Zesko · 17 October 2021 18:57

I have the same issue now.

AMD radeon RX 5700 crashed and froze the whole Desktop after watching Youtube more than 50 min in Vivaldi browser with hardware accelerated.

Branch: Manjaro KDE latest version (stable branch)
GPU: Radeon RX 5700
Kernel version: 5.14.10-1-MANJARO
Mesa version: 4.6 (Compatibility Profile) Mesa 21.2.3
Xserver version: X.Org X Server 1.20.13, X Protocol Version 11, Revision 0
Desktop manager and compositor: KDE

Output of journactl -b -1 -p 3 :

Okt 17 20:22:31 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec timeout, signaled seq=2315893, emitted seq=2315895
Okt 17 20:22:31 zesko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 45233 thread vivaldi-bi:cs0 pid 45253

The similar bug report:

veltliner · 28 October 2021 22:02

Solved for me with setting power_dpm_force_performance_level to high:

Create the file /etc/udev/rules.d/30-amdgpu-pm.rules

with content

KERNEL=="card0", SUBSYSTEM=="drm", DRIVERS=="amdgpu", ATTR{device/power_dpm_state}="performance"
KERNEL=="card0", SUBSYSTEM=="drm", DRIVERS=="amdgpu", ATTR{device/power_dpm_force_performance_level}="high"

Reboot.

Enjoy with no freezing.

bowmandm21 · 30 October 2021 12:53

I input these settings.

No slowdowns or freezes so far.

lordsansui · 30 October 2021 18:52

Hi veltliner and bowmandm21,

This settings has any drawback? like increasing power consumption even in IDLE.

networkException · 30 October 2021 21:09

Sadly the udev rules didn’t prevent the GPU from resetting for me just now

[17267.848571] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[17267.848606] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[17272.978620] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2557416, emitted seq=2557418
[17272.978788] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 1694 thread plasmashel:cs0 pid 1754
[17272.978928] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!

mha-k · 8 November 2021 18:25

For me the problem also still persists with a MSI RX 480. My and my colleagues mobile VEGAs in our notebooks for work are running fine so far.

@Zesko I’ve looked into your posted issue as my freezes only occur when I play fullscreen videos in Firefox. (Don’t know about games as I don’t have time anymore to play games.) But the error-log looks quiet different than mine. I think it’s the plasmashell that’s causing it.

It’s really frustrating that the problem persists so long.

noabody · 10 November 2021 04:12

Is your udev rule actually sticking? I created a help topic:
amdgpu-udev-rule-ignored/89725

Either of these commands will identify if the rule applied successfully:

udevadm info --attribute-walk /sys/class/drm/card0 | grep -Pi 'power_dpm'
udevadm info -a -n '/dev/dri/card0' | grep -Pi 'power_dpm'

Curiously, my problems began in May. I have an r9 380x which suffers from a voltage draw issue. If I don’t force the performance level to high I get graphical corruption, crackling audio, and an eventual green screen of death (the system appears to be working, audio is playing, but no input is recognized).

The actual failure relates to a sudden surge of power when the fans go directly from low to high speed without transitioning through a middle range (usually during gaming).

I went so far as to build a custom bios with dpm state 0 set to 1000mv and flashed it to the card. As many of these post point out, a firmware blob is applied during startup which invalidates this. As such, in my case, every iteration of linux-firmware causes me issues.

The same issue affects Windows 10 and I have to use a custom .xml and load it as a performance profile in the Radeon Adrenalin drivers.

I always assumed my card was an anomaly related to power supply and motherboard. There was no reason to change the hardware once I’d figured out the solution.

I just can’t get a udev rule to apply automatically.

bruce_banner · 10 November 2021 10:31

Here are some reason that can cause this issue:

Your computer’s spec is too low
You overclocked too high
The game’s settings are wrong
Your graphic card requires too much power
You need to upgrade your operating system
You need to upgrade device drivers
Your network isn’t fast enough
Digital Rights Management is causing problems
Games are running in the wrong mode
Your antivirus is crashing games
Using a VPN is slowing online games to the point they crash
You have too many browser tabs open