Games are causing the GPU to reset/crash after the last update. (AMDGPU)

I updated my system and attempted to play some games today. only to find out that they all freeze after a little while of playing. This doesn’t seem to be closely tied to GPU load either, since it crashed twice while I was in the menu of a game with low GPU usage. I’m unsure whether the update to Mesa 21.0.1 or the kernel update caused this.

System:
  Kernel: 5.11.10-1-MANJARO x86_64 bits: 64 compiler: gcc v: 10.2.0 
  parameters: BOOT_IMAGE=/boot/vmlinuz-5.11-x86_64 
  root=UUID=fe9c5e23-5885-4d31-af95-391198cbd234 rw 
  systemd.unified_cgroup_hierarchy=1 intel_pstate=active udev.log_priority=3 
  Desktop: GNOME 3.38.4 tk: GTK 3.24.28 wm: gnome-shell dm: GDM 3.38.2.1 
  Distro: Manjaro Linux base: Arch Linux 
Machine:
  Type: Desktop Mobo: Gigabyte model: B85M-D3H v: x.x serial: <filter> 
  UEFI: American Megatrends v: FB date: 06/19/2014 
CPU:
  Info: Dual Core model: Intel Core i3-4150 bits: 64 type: MT MCP 
  arch: Haswell family: 6 model-id: 3C (60) stepping: 3 microcode: 28 cache: 
  L2: 3 MiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
  bogomips: 27944 
  Speed: 1018 MHz min/max: 800/3500 MHz Core speeds (MHz): 1: 1018 2: 923 
  3: 959 4: 935 
  Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled 
  Type: l1tf 
  mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable 
  Type: mds mitigation: Clear CPU buffers; SMT vulnerable 
  Type: meltdown mitigation: PTI 
  Type: spec_store_bypass 
  mitigation: Speculative Store Bypass disabled via prctl and seccomp 
  Type: spectre_v1 
  mitigation: usercopy/swapgs barriers and __user pointer sanitization 
  Type: spectre_v2 mitigation: Full generic retpoline, IBPB: conditional, 
  IBRS_FW, STIBP: conditional, RSB filling 
  Type: srbds mitigation: Microcode 
  Type: tsx_async_abort status: Not affected 
Graphics:
  Device-1: Intel 4th Generation Core Processor Family Integrated Graphics 
  vendor: Gigabyte driver: i915 v: kernel bus-ID: 00:02.0 chip-ID: 8086:041e 
  class-ID: 0300 
  Device-2: AMD Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] vendor: ASUSTeK driver: amdgpu v: kernel bus-ID: 01:00.0 chip-ID: 1002:67ef 
  class-ID: 0300 
  Display: wayland server: X.org 1.20.10 compositor: gnome-shell driver: 
  loaded: amdgpu,ati,intel unloaded: modesetting alternate: fbdev,vesa 
  display-ID: 0 resolution: <missing: xdpyinfo> 
  OpenGL: renderer: AMD Radeon RX 460 Graphics (POLARIS11 DRM 3.40.0 
  5.11.10-1-MANJARO LLVM 11.1.0) 
  v: 4.6 Mesa 21.0.1 direct render: Yes 
Audio:
  Device-1: Intel Xeon E3-1200 v3/4th Gen Core Processor HD Audio 
  driver: snd_hda_intel v: kernel bus-ID: 00:03.0 chip-ID: 8086:0c0c 
  class-ID: 0403 
  Device-2: Intel 8 Series/C220 Series High Definition Audio vendor: Gigabyte 
  driver: snd_hda_intel v: kernel bus-ID: 00:1b.0 chip-ID: 8086:8c20 
  class-ID: 0403 
  Device-3: AMD Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] 
  vendor: ASUSTeK driver: snd_hda_intel v: kernel bus-ID: 01:00.1 
  chip-ID: 1002:aae0 class-ID: 0403 
  Sound Server-1: ALSA v: k5.11.10-1-MANJARO running: yes 
  Sound Server-2: JACK v: 0.125.0 running: no 
  Sound Server-3: PulseAudio v: 14.2 running: yes 
  Sound Server-4: PipeWire v: 0.3.24 running: yes 
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet 
  vendor: Gigabyte driver: r8169 v: kernel port: d000 bus-ID: 03:00.0 
  chip-ID: 10ec:8168 class-ID: 0200 
  IF: enp3s0 state: down mac: <filter> 
  Device-2: Realtek RTL8192CU 802.11n WLAN Adapter type: USB driver: rtl8xxxu 
  bus-ID: 3-10:6 chip-ID: 0bda:8178 class-ID: 0000 serial: <filter> 
  IF: wlp0s20u10 state: up mac: <filter> 
Bluetooth:
  Device-1: Cambridge Silicon Radio Bluetooth Dongle (HCI mode) type: USB 
  driver: btusb v: 0.8 bus-ID: 3-12:7 chip-ID: 0a12:0001 class-ID: e001 
Drives:
  Local Storage: total: 1.82 TiB used: 1.54 TiB (84.5%) 
  SMART Message: Unable to run smartctl. Root privileges required. 
  ID-1: /dev/sda maj-min: 8:0 vendor: Western Digital model: WD10EZEX-00WN4A0 
  size: 931.51 GiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s 
  rotation: 7200 rpm serial: <filter> rev: 1A01 scheme: GPT 
  ID-2: /dev/sdb maj-min: 8:16 vendor: Seagate model: ST1000DM003-1ER162 
  size: 931.51 GiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s 
  rotation: 7200 rpm serial: <filter> rev: CC43 scheme: MBR 
Partition:
  ID-1: / raw-size: 99.5 GiB size: 97.44 GiB (97.93%) used: 21.59 GiB (22.2%) 
  fs: ext4 dev: /dev/sda2 maj-min: 8:2 
  ID-2: /boot/efi raw-size: 1024 MiB size: 1022 MiB (99.80%) 
  used: 312 KiB (0.0%) fs: vfat dev: /dev/sda1 maj-min: 8:1 
  ID-3: /home raw-size: 354.17 GiB size: 347.62 GiB (98.15%) 
  used: 276.87 GiB (79.6%) fs: ext4 dev: /dev/sda5 maj-min: 8:5 
Swap:
  Kernel: swappiness: 60 (default) cache-pressure: 100 (default) 
  ID-1: swap-1 type: zram size: 490.3 MiB used: 0 KiB (0.0%) priority: 32767 
  dev: /dev/zram0 
  ID-2: swap-2 type: zram size: 490.3 MiB used: 0 KiB (0.0%) priority: 32767 
  dev: /dev/zram1 
  ID-3: swap-3 type: zram size: 490.3 MiB used: 0 KiB (0.0%) priority: 32767 
  dev: /dev/zram2 
  ID-4: swap-4 type: zram size: 490.3 MiB used: 0 KiB (0.0%) priority: 32767 
  dev: /dev/zram3 
Sensors:
  System Temperatures: cpu: 50.0 C mobo: 27.8 C gpu: amdgpu temp: 38.0 C 
  Fan Speeds (RPM): N/A gpu: amdgpu fan: 1818 
Info:
  Processes: 243 Uptime: 27m wakeups: 0 Memory: 7.66 GiB 
  used: 1.98 GiB (25.8%) Init: systemd v: 247 tool: systemctl Compilers: 
  gcc: 10.2.0 clang: 11.1.0 Packages: 1525 pacman: 1508 lib: 432 flatpak: 17 
  Shell: Bash v: 5.1.0 running-in: gnome-terminal inxi: 3.3.03

This is the result of running sudo journalctl -b -1 | grep -i amd

Apr 11 20:24:43 abrar-desktop kernel:   AMD AuthenticAMD
Apr 11 20:24:43 abrar-desktop kernel: RAMDISK: [mem 0x36875000-0x37431fff]
Apr 11 20:24:43 abrar-desktop kernel: AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
Apr 11 20:24:43 abrar-desktop kernel: AMD-Vi: AMD IOMMUv2 functionality not available on this system
Apr 11 20:24:47 abrar-desktop kernel: [drm] amdgpu kernel modesetting enabled.
Apr 11 20:24:47 abrar-desktop kernel: amdgpu: Topology: Add CPU node
Apr 11 20:24:47 abrar-desktop kernel: fb0: switching to amdgpudrmfb from EFI VGA
Apr 11 20:24:47 abrar-desktop kernel: amdgpu 0000:01:00.0: vgaarb: deactivate vga console
Apr 11 20:24:47 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Apr 11 20:24:47 abrar-desktop kernel: amdgpu 0000:01:00.0: No more image in the PCI ROM
Apr 11 20:24:47 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
Apr 11 20:24:47 abrar-desktop kernel: amdgpu: ATOM BIOS: 115-C994PI00-100
Apr 11 20:24:48 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
Apr 11 20:24:48 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
Apr 11 20:24:48 abrar-desktop kernel: [drm] amdgpu: 2048M of VRAM memory ready
Apr 11 20:24:48 abrar-desktop kernel: [drm] amdgpu: 3072M of GTT memory ready.
Apr 11 20:24:48 abrar-desktop kernel: amdgpu: hwmgr_sw_init smu backed is polaris10_smu
Apr 11 20:24:48 abrar-desktop kernel: snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Apr 11 20:24:48 abrar-desktop kernel: amdgpu: Topology: Add dGPU node [0x67ef:0x1002]
Apr 11 20:24:48 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: SE 2, SH per SE 1, CU per SH 8, active_cu_number 14
Apr 11 20:24:48 abrar-desktop kernel: fbcon: amdgpudrmfb (fb0) is primary device
Apr 11 20:24:48 abrar-desktop kernel: amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device
Apr 11 20:24:48 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: Using BACO for runtime pm
Apr 11 20:24:48 abrar-desktop kernel: [drm] Initialized amdgpu 3.40.0 20150101 for 0000:01:00.0 on minor 1
Apr 11 20:25:08 abrar-desktop gnome-shell[1091]: Disabling DMA buffer screen sharing for driver 'amdgpu'.
Apr 11 20:25:23 abrar-desktop gnome-shell[1544]: Disabling DMA buffer screen sharing for driver 'amdgpu'.
Apr 11 21:08:16 abrar-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=452681, emitted seq=452683
Apr 11 21:08:16 abrar-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process eurotrucks2.exe pid 7405 thread eurotrucks:cs0 pid 7419
Apr 11 21:08:16 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Apr 11 21:08:16 abrar-desktop kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Apr 11 21:08:16 abrar-desktop kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Apr 11 21:08:16 abrar-desktop kernel: amdgpu: cp is busy, skip halt cp
Apr 11 21:08:16 abrar-desktop kernel: amdgpu: rlc is busy, skip halt rlc
Apr 11 21:08:16 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Apr 11 21:08:17 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Apr 11 21:08:17 abrar-desktop kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Apr 11 21:08:17 abrar-desktop kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
Apr 11 21:08:17 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) failed
Apr 11 21:08:17 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset end with ret = -110
Apr 11 21:08:27 abrar-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Apr 11 21:08:37 abrar-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

This was taken after it crashed while playing ETS 2 through Steam Proton 6.3-2

EDIT: I fixed the issue by deleting the Mesa shader cache, it’s located at ~/.cache/mesa_shader_cache/

EDIT 2: Nope! I spoke too soon, It’s happening again. SSH and sound still works but the picture just hangs and then the GPU stops outputting a signal. Dynamic sound effects like gunshots also freeze and glitch but the BG music keeps going without any issues or glitches.

Apr 12 16:10:39 abrar-desktop kernel:   AMD AuthenticAMD
Apr 12 16:10:39 abrar-desktop kernel: RAMDISK: [mem 0x36875000-0x37431fff]
Apr 12 16:10:39 abrar-desktop kernel: AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
Apr 12 16:10:39 abrar-desktop kernel: AMD-Vi: AMD IOMMUv2 functionality not available on this system
Apr 12 16:10:44 abrar-desktop kernel: [drm] amdgpu kernel modesetting enabled.
Apr 12 16:10:44 abrar-desktop kernel: amdgpu: Topology: Add CPU node
Apr 12 16:10:44 abrar-desktop kernel: fb0: switching to amdgpudrmfb from EFI VGA
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: vgaarb: deactivate vga console
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: No more image in the PCI ROM
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
Apr 12 16:10:44 abrar-desktop kernel: amdgpu: ATOM BIOS: 115-C994PI00-100
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
Apr 12 16:10:44 abrar-desktop kernel: [drm] amdgpu: 2048M of VRAM memory ready
Apr 12 16:10:44 abrar-desktop kernel: [drm] amdgpu: 3072M of GTT memory ready.
Apr 12 16:10:44 abrar-desktop kernel: amdgpu: hwmgr_sw_init smu backed is polaris10_smu
Apr 12 16:10:44 abrar-desktop kernel: snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Apr 12 16:10:44 abrar-desktop kernel: amdgpu: Topology: Add dGPU node [0x67ef:0x1002]
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: SE 2, SH per SE 1, CU per SH 8, active_cu_number 14
Apr 12 16:10:44 abrar-desktop kernel: fbcon: amdgpudrmfb (fb0) is primary device
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device
Apr 12 16:10:44 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: Using BACO for runtime pm
Apr 12 16:10:44 abrar-desktop kernel: [drm] Initialized amdgpu 3.40.0 20150101 for 0000:01:00.0 on minor 1
Apr 12 16:11:02 abrar-desktop gnome-shell[1091]: Disabling DMA buffer screen sharing for driver 'amdgpu'.
Apr 12 16:11:42 abrar-desktop gnome-shell[1891]: Disabling DMA buffer screen sharing for driver 'amdgpu'.
Apr 12 16:19:54 abrar-desktop kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Apr 12 16:19:54 abrar-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=131336, emitted seq=131338
Apr 12 16:19:54 abrar-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process JustCause3.exe pid 5401 thread JustCause3:cs0 pid 5418
Apr 12 16:19:54 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Apr 12 16:19:54 abrar-desktop kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Apr 12 16:19:54 abrar-desktop kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Apr 12 16:19:55 abrar-desktop kernel: amdgpu: cp is busy, skip halt cp
Apr 12 16:19:55 abrar-desktop kernel: amdgpu: rlc is busy, skip halt rlc
Apr 12 16:19:55 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Apr 12 16:19:55 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Apr 12 16:19:56 abrar-desktop kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Apr 12 16:19:56 abrar-desktop kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
Apr 12 16:19:56 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) failed
Apr 12 16:19:56 abrar-desktop kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset end with ret = -110
Apr 12 16:20:06 abrar-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Logs taken after it crashed while playing Just Cause 3.

Looks like the last update fixed it. I also cleaned the GPU so now I’m not sure whether the update to Mesa fixed it or me cleaning the GPU fixed the problem.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.