Graphics glitches / freeze-up with new comp, ERROR: ring gfx timeout

Greetings. I wasn’t very active on the old forum, so not sure if anyone remembers me. About a year ago, I made a topic about graphics problems with my old computer. Though these ones are not identical, they are similar, in that they involved the screen glitching out randomly, and the application it’s happening to becoming laggy, and often as not impossible to save data in before closing. A friend of mine suggested downgrading from 5.10 kernel to 5.9.16-1, and the results weren’t pretty:

That was what it looked like this morning, frozen, so I couldn’t take screencaps, had to grab it with my phone camera. I just bought this computer in January, and set it up near the end of that month. Currently configured:

System:    Kernel: 5.10.15-1-MANJARO x86_64 bits: 64 compiler: gcc v: 10.2.1 Desktop: KDE Plasma 5.20.5 Distro: Manjaro Linux 
Machine:   Type: Desktop Mobo: ASRock model: A520M-HDV serial: <filter> UEFI: American Megatrends v: P1.00 date: 07/21/2020 
CPU:       Info: 8-Core model: AMD Ryzen 7 3700X bits: 64 type: MT MCP arch: Zen 2 rev: 0 L2 cache: 4 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 115028 
           Speed: 2191 MHz min/max: 2200/3600 MHz boost: enabled Core speeds (MHz): 1: 2191 2: 2194 3: 2194 4: 2187 5: 2196 
           6: 2196 7: 2194 8: 2196 9: 2192 10: 2195 11: 2202 12: 2196 13: 2194 14: 2197 15: 2195 16: 2195 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] vendor: Tul 
           driver: amdgpu v: kernel bus ID: 07:00.0 
           Display: x11 server: X.Org 1.20.10 driver: loaded: amdgpu,ati unloaded: modesetting resolution: 1920x1080~60Hz 
           OpenGL: renderer: Radeon RX 570 Series (POLARIS10 DRM 3.40.0 5.10.15-1-MANJARO LLVM 11.0.1) v: 4.6 Mesa 20.3.4 
           direct render: Yes 
Audio:     Device-1: AMD Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] vendor: Tul driver: snd_hda_intel v: kernel 
           bus ID: 07:00.1 
           Device-2: Advanced Micro Devices [AMD] Starship/Matisse HD Audio vendor: ASRock driver: snd_hda_intel v: kernel 
           bus ID: 09:00.4 
           Sound Server: ALSA v: k5.10.15-1-MANJARO 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: ASRock driver: r8169 v: kernel port: f000 
           bus ID: 06:00.0 
           IF: enp6s0 state: up speed: 1000 Mbps duplex: full mac: <filter> 
Drives:    Local Storage: total: 1.37 TiB used: 960.16 GiB (68.2%) 
           ID-1: /dev/sda vendor: TeamGroup model: T253X2512G size: 476.94 GiB 
           ID-2: /dev/sdb type: USB vendor: Western Digital model: WD10EADS-11M2B3 size: 930.86 GiB 
Partition: ID-1: / size: 468.16 GiB used: 43.91 GiB (9.4%) fs: ext4 dev: /dev/sda2 
           ID-2: /boot/efi size: 299.4 MiB used: 312 KiB (0.1%) fs: vfat dev: /dev/sda1 
Swap:      ID-1: swap-1 type: file size: 512 MiB used: 0 KiB (0.0%) file: /swapfile 
Monitor: Samsung SF350

Much like the screen-blanking on my old computer, the timeframe could be anywhere from a day or so up uptime, to nearly several days, totally random. Only thing I’ve found resembling a trigger is trying to play music; Audacious breaks the interface within half an hour, VLC within an hour or so, and sometimes you can get 2 or 3 hrs out of smplayer, and I haven’t risked trying any other players, so I’m just back to silence like the past year on my old computer. Any insight would be helpful, as I thought I left this mess behind with my old computer.

1 Like

So, even without listening to music, screen still blows up in 5.10 the same way it does in 5.9.16-1. :worried:

Just an update, a friend of mine showed me how to access logs, and I came across this error string not long before my computer froze up:

Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu: GPU fault detected: 147 0x05788801 for process vivaldi-bin pid 3086 thread vivaldi-bi:>
Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08F006AF
Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08088001
Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu: VM fault (0x01, vmid 4, pasid 32774) at page 149948079, read from 'TC6' (0x54433600) (>
Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu: GPU fault detected: 147 0x05788401 for process vivaldi-bin pid 3086 thread vivaldi-bi:>
Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08F006BF
Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08088001
Mar 07 07:02:24 spooky kernel: amdgpu 0000:07:00.0: amdgpu: VM fault (0x01, vmid 4, pasid 32774) at page 149948095, read from 'TC6' (0x54433600) (>
Mar 07 07:02:34 spooky kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Mar 07 07:02:34 spooky kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=18086949, emitted seq=18086951
Mar 07 07:02:34 spooky kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process vivaldi-bin pid 3086 thread vivaldi-bi:cs0 >
Mar 07 07:02:34 spooky kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Mar 07 07:02:38 spooky kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Mar 07 07:02:38 spooky kernel: amdgpu 0000:07:00.0: amdgpu: failed to suspend display audio
Mar 07 07:02:39 spooky kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Mar 07 07:02:39 spooky kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Mar 07 07:02:39 spooky kernel: amdgpu: cp is busy, skip halt cp
Mar 07 07:02:39 spooky kernel: amdgpu: rlc is busy, skip halt rlc
Mar 07 07:02:39 spooky kernel: amdgpu 0000:07:00.0: amdgpu: BACO reset
Mar 07 07:02:40 spooky kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
Mar 07 07:02:40 spooky kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F4007E9000).
Mar 07 07:02:40 spooky kernel: [drm] VRAM is lost due to GPU reset!
Mar 07 07:02:41 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:42 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:43 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:44 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:45 spooky xembedsniproxy[1476]: Container window visible, stack below
Mar 07 07:02:45 spooky xembedsniproxy[1476]: Container window visible, stack below
Mar 07 07:02:45 spooky kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Mar 07 07:02:45 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:46 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:47 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:48 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:49 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:50 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Mar 07 07:02:50 spooky kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Mar 07 07:02:50 spooky kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <uvd_v6_0> failed -1
Mar 07 07:02:51 spooky kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
Mar 07 07:02:51 spooky kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Mar 07 07:02:51 spooky kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(2) failed
Mar 07 07:02:51 spooky kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
Mar 07 07:02:51 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:52 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:52 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:52 spooky kwin_x11[1383]: kwin_core: XCB error: 10 (BadAccess), sequence: 27864, resource id: 1729, major code: 142 (Composite), mino>
Mar 07 07:02:52 spooky kwin_x11[1383]: BlurConfig::instance called after the first use - ignoring
Mar 07 07:02:53 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:53 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:54 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:54 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:54 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:55 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:55 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:57 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:57 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:59 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:59 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:01 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:01 spooky kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Mar 07 07:03:01 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:02 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:02 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:03 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:03 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:04 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:04 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:04 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:05 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:05 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:06 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:06 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:07 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:07 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:08 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:08 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:53 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:53 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:54 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:54 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:54 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:55 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:55 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:57 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:55 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:57 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:57 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:58 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:02:59 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:02:59 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:00 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:01 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:01 spooky kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Mar 07 07:03:01 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:02 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:02 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:03 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:03 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:04 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:04 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:04 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:05 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:05 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:06 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:06 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:07 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:07 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:08 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:08 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:09 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:09 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:10 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:10 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:10 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:11 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:11 spooky kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Mar 07 07:03:11 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:12 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:12 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:13 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:13 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:14 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:14 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:15 spooky kernel: [drm] Fence fallback timer expired on ring sdma1
Mar 07 07:03:16 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:22 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:28 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:35 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:03:56 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:04:14 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:04:34 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:04:53 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:05:13 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:05:32 spooky kernel: [drm] Fence fallback timer expired on ring sdma0
Mar 07 07:05:59 spooky kernel: INFO: task vivaldi-:gdrv0:218655 blocked for more than 122 seconds.
Mar 07 07:05:59 spooky kernel:       Not tainted 5.10.18-1-MANJARO #1
Mar 07 07:05:59 spooky kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 07 07:05:59 spooky kernel: task:vivaldi-:gdrv0  state:D stack:    0 pid:218655 ppid:  3062 flags:0x00004082

I don’t know if this helps, but I hope it sheds some light on my situation.

1 Like

Hello,

I get the same issue! I’am on a thinkpad with a ryzen 5 and Radeon vega graphics.

This totally blocks my session and i have to force a logout to regain a visual control.
It started happening this week before that i didn’t notice anything and have not really changed my use cases.

Error Log
Apr 16 10:33:08 thine495 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117207000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117201000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117208000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117200000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117204000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117202000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117205000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117206000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117203000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process gnome-shell pid 3661 thread gnome-shel:cs0 pid 3686)
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800117209000 from client 27
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: 0x8
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 16 10:33:08 thine495 kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
Apr 16 10:33:09 thine495 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
1 Like

Doing some tests, I read that these issues could be caused by the power scaling features in the graphics card.

I’ve worked with @shadesmaclean on this issue for a bit, and during my searches I found a random script from github that helps control the power/performance settings of the card.

We’ve since seen some success by manually switching power scaling settings for the card using this script. Granted we tried using some of the amdgpu GUI tools that could do the same thing but honestly we couldn’t get them to work on his setup. So instead we ended up using the following script:

source: https://github.com/superjamie/snippets/blob/master/radcard
https://wiki.archlinux.org/index.php/AMDGPU

#!/bin/bash

# Radcard
# Script to control radeon DPM power saving
# Ref: https://wiki.archlinux.org/index.php/ATI#Powersaving
# Version: 2019-02-12
# License: GPLv3
# Authors: jamie.bainbridge@gmail.com

CARDPATH="/sys/class/drm/card0/device"

do_set() {
    case "$1" in
        bat*)
            sudo sh -c "echo battery > $CARDPATH/power_dpm_state"
            ;;
        bal*)
            sudo sh -c "echo balanced > $CARDPATH/power_dpm_state"
            ;;
        per*)
            sudo sh -c "echo performance > $CARDPATH/power_dpm_state"
            ;;
        a*)
            sudo sh -c "echo auto > $CARDPATH/power_dpm_force_performance_level"
            ;;
        l*)
            sudo sh -c "echo low > $CARDPATH/power_dpm_force_performance_level"
            ;;
        h*)
            sudo sh -c "echo high > $CARDPATH/power_dpm_force_performance_level"
            ;;
        *)
            do_usage
            ;;
    esac
}

do_get() {
    echo -n "power_dpm_state: "; cat "$CARDPATH/power_dpm_state"
    echo -n "power_dpm_force_performance_level: "; cat "$CARDPATH/power_dpm_force_performance_level"
}

do_usage() {
    echo "Usage: $(basename "$0") [get|set [battery|balanced|performance|auto|low|high|bat|bal|per|a|h|l]]"
    exit 1
}

case "$1" in
    "set")
        shift
        for VAR in "$@"; do
          do_set "$VAR"
        done
        do_get
        ;;
    "get")
        do_get
        ;;
    *)
        do_usage
        ;;
esac

exit 0

Basically make that a *.sh file and give it permission to run as a program and then while in the same directory as the script we’ve been having shadesmaclean run the following command.

./radcard.sh set performance high

So far this SEEMS to have kept his system from crashing from the same errors as before (while we’re still looking into seemingly unrelated things on the side).

I’m sorry but I didn’t post my research early enough to track down all the sources I used when I found this solution, but the archwiki article on ATI Powersaving and in the amdgpu there’s some references to power saving features and stuff. I’m not an expert, but after fiddling with things, hopefully this leads to someone giving a more elaborate and informed solution/answer. There’s ways to set this to load automatically at boot, but we haven’t tested that yet as we’re still evaluating it’s effectiveness.

perhaps @shadesmaclean can chime in with any extra info on this from his end?

Anyway TLDR it seems to be a power saving bug with some AMD graphics cards and disabling or manually setting power saving or performance settings to high settings seems to negate the issue as a workaround.

as a novice myself any corrections or information you guys can throw at me will be great.

as a sidenote we also started having @shadesmaclean put his CPU in performance profile mode as well to negate any chance of powersettings somehow affecting that. probably unrelated, but should mention as it was part of our workarounds.

1 Like

Otherwise i stumbled upon this subject from the forum (add forum dot manjaro dot org slash t slash - sorry for some reason i can’t post any links)
system-frequently-crashing-after-gpu-drivers-update/62139/60

They suggest updating the kernel to last RC helps. I am trying it out now

I have the same issue with AMD RX 570, nether MESA update nether last kernel fixed the issue, unfortunately.

I’m asking if the AMDVLK replacing MESA RADV could fix it !!!

I’m not sure about how to go about that, however the “fix” that we’ve been doing on @shadesmaclean 's machine with the power profile (on cpu and gpu profile) has seemed to be pretty stable since he started doing that, we haven’t researched it much since then because it’s a workaround that’s held up, but if you find that your idea works, please post it here and give us steps!

1 Like