AMD graphical crash while in games

While playing a fairly large range of games on steam at intervals from anywhere from 20mins to 3/4 hours my display locks up, proceeds to go black, then “recovers” with rainbow distortion. I am often able to restart the system with alt + ctrl + F1 then ctrl+alt+del. But these seem to be becoming more frequent over time and I am still very new to linux. So assistance with figuring out what I need to do would be appreciated. Read some posts that suggested some kernels (5.4 particularly) might be better then others have have attempted to use AMD pro drivers and mesa-git drivers with no change.

System:
  Kernel: 5.4.143-1-MANJARO x86_64 bits: 64 compiler: gcc v: 11.1.0 
  parameters: BOOT_IMAGE=/boot/vmlinuz-5.4-x86_64 
  root=UUID=c01cddb7-2b57-4aee-93a6-5a69361f34f7 rw udev.log_priority=3 
  amdgpu.gpu_recovery=1 
  Desktop: Xfce 4.16.0 tk: Gtk 3.24.29 info: xfce4-panel wm: xfwm 4.16.1 vt: 7 
  dm: LightDM 1.30.0 Distro: Manjaro Linux base: Arch Linux 
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <filter> 
  Mobo: ASUSTeK model: ROG STRIX Z390-F GAMING v: Rev 1.xx serial: <filter> 
  UEFI: American Megatrends v: 1802 date: 12/01/2020 
Battery:
  Message: No system battery data found. Is one present? 
Memory:
  RAM: total: 31.28 GiB used: 2.59 GiB (8.3%) 
  RAM Report: permissions: Unable to run dmidecode. Root privileges required. 
CPU:
  Info: 6-Core model: Intel Core i5-9600K bits: 64 type: MCP arch: Kaby Lake 
  note: check family: 6 model-id: 9E (158) stepping: C (12) microcode: EA 
  cache: L2: 9 MiB bogomips: 44412 
  Speed: 800 MHz min/max: 800/4600 MHz Core speeds (MHz): 1: 800 2: 800 3: 801 
  4: 800 5: 800 6: 800 
  Flags: 3dnowprefetch abm acpi adx aes aperfmperf apic arat arch_capabilities 
  arch_perfmon art avx avx2 bmi1 bmi2 bts clflush clflushopt cmov constant_tsc 
  cpuid cpuid_fault cx16 cx8 de ds_cpl dtes64 dtherm dts epb ept ept_ad erms 
  est f16c flexpriority flush_l1d fma fpu fsgsbase fxsr hle ht hwp 
  hwp_act_window hwp_epp hwp_notify ibpb ibrs ida intel_pt invpcid 
  invpcid_single lahf_lm lm mca mce md_clear mmx monitor movbe mpx msr mtrr 
  nonstop_tsc nopl nx pae pat pbe pcid pclmulqdq pdcm pdpe1gb pebs pge pln pni 
  popcnt pse pse36 pts rdrand rdseed rdtscp rep_good rtm sdbg sep smap smep 
  smx ss ssbd sse sse2 sse4_1 sse4_2 ssse3 stibp syscall tm tm2 tpr_shadow tsc 
  tsc_adjust tsc_deadline_timer vme vmx vnmi vpid x2apic xgetbv1 xsave xsavec 
  xsaveopt xsaves xtopology xtpr 
  Vulnerabilities: Type: itlb_multihit status: KVM: Vulnerable 
  Type: l1tf status: Not affected 
  Type: mds mitigation: Clear CPU buffers; SMT disabled 
  Type: meltdown status: Not affected 
  Type: spec_store_bypass 
  mitigation: Speculative Store Bypass disabled via prctl and seccomp 
  Type: spectre_v1 
  mitigation: usercopy/swapgs barriers and __user pointer sanitization 
  Type: spectre_v2 mitigation: Full generic retpoline, IBPB: conditional, 
  IBRS_FW, STIBP: disabled, RSB filling 
  Type: srbds mitigation: Microcode 
  Type: tsx_async_abort mitigation: Clear CPU buffers; SMT disabled 
Graphics:
  Device-1: AMD Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] 
  vendor: Gigabyte driver: amdgpu v: kernel bus-ID: 03:00.0 chip-ID: 1002:731f 
  class-ID: 0300 
  Display: x11 server: X.Org 1.20.13 compositor: xfwm4 v: 4.16.1 driver: 
  loaded: amdgpu,ati unloaded: modesetting,radeon alternate: fbdev,vesa 
  display-ID: :0.0 screens: 1 
  Screen-1: 0 s-res: 3440x1440 s-dpi: 96 s-size: 910x381mm (35.8x15.0") 
  s-diag: 987mm (38.8") 
  Monitor-1: DisplayPort-1 res: 3440x1440 dpi: 110 
  size: 797x334mm (31.4x13.1") diag: 864mm (34") 
  OpenGL: renderer: AMD Radeon RX 5700 XT (NAVI10 DRM 3.35.0 5.4.143-1-MANJARO 
  LLVM 12.0.1) 
  v: 4.6 Mesa 21.2.1 direct render: Yes 
Audio:
  Device-1: Intel Cannon Lake PCH cAVS vendor: ASUSTeK driver: snd_hda_intel 
  v: kernel alternate: snd_soc_skl,snd_sof_pci bus-ID: 00:1f.3 
  chip-ID: 8086:a348 class-ID: 0403 
  Device-2: AMD Navi 10 HDMI Audio driver: snd_hda_intel v: kernel 
  bus-ID: 03:00.1 chip-ID: 1002:ab38 class-ID: 0403 
  Device-3: C-Media ATGM1-USB type: USB 
  driver: hid-generic,snd-usb-audio,usbhid bus-ID: 1-6.1:4 chip-ID: 0d8c:0089 
  class-ID: 0300 serial: <filter> 
  Sound Server-1: ALSA v: k5.4.143-1-MANJARO running: yes 
  Sound Server-2: JACK v: 1.9.19 running: no 
  Sound Server-3: PulseAudio v: 15.0 running: yes 
  Sound Server-4: PipeWire v: 0.3.34 running: no 
Network:
  Device-1: Intel Ethernet I219-V vendor: ASUSTeK driver: e1000e v: 3.2.6-k 
  port: efa0 bus-ID: 00:1f.6 chip-ID: 8086:15bc class-ID: 0200 
  IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter> 
  IP v4: <filter> type: dynamic noprefixroute scope: global 
  broadcast: <filter> 
  IF-ID-1: tun0 state: unknown speed: 10 Mbps duplex: full mac: N/A 
  IP v4: <filter> scope: global 
  WAN IP: <filter> 
Bluetooth:
  Device-1: Broadcom BCM20702A0 Bluetooth 4.0 type: USB driver: btusb v: 0.8 
  bus-ID: 1-6.2:5 chip-ID: 0a5c:21e8 class-ID: fe01 serial: <filter> 
  Report: rfkill ID: hci0 rfk-id: 0 state: down bt-service: enabled,running 
  rfk-block: hardware: no software: yes address: see --recommends 
Logical:
  Message: No logical block device data found. 
RAID:
  Message: No RAID data found. 
Drives:
  Local Storage: total: 1.82 TiB used: 584.62 GiB (31.4%) 
  SMART Message: Required tool smartctl not installed. Check --recommends 
  ID-1: /dev/sda maj-min: 8:0 vendor: Samsung model: SSD 860 EVO 1TB 
  size: 931.51 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s 
  type: SSD serial: <filter> rev: 3B6Q scheme: GPT 
  ID-2: /dev/sdb maj-min: 8:16 type: USB vendor: Seagate model: Expansion 
  size: 931.51 GiB block-size: physical: 4096 B logical: 512 B type: N/A 
  serial: <filter> rev: 9300 scheme: MBR 
  Optical-1: /dev/sr0 vendor: PIONEER model: DVR-213NP rev: 1.00 
  dev-links: cdrom 
  Features: speed: 40 multisession: yes audio: yes dvd: yes 
  rw: cd-r,cd-rw,dvd-r,dvd-ram state: running 
Partition:
  ID-1: / raw-size: 931.22 GiB size: 915.53 GiB (98.32%) 
  used: 584.62 GiB (63.9%) fs: ext4 dev: /dev/sda2 maj-min: 8:2 label: N/A 
  uuid: c01cddb7-2b57-4aee-93a6-5a69361f34f7 
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%) 
  used: 296 KiB (0.1%) fs: vfat dev: /dev/sda1 maj-min: 8:1 label: NO_LABEL 
  uuid: 2CF3-1B16 
  ID-3: /home/<filter>/pCloudDrive raw-size: N/A size: 2 TiB 
  used: 6.87 GiB (0.3%) fs: fuse source: ERR-102 
Swap:
  Alert: No swap data was found. 
Unmounted:
  ID-1: /dev/sdb1 maj-min: 8:17 size: 931.51 GiB fs: exfat label: Toolbox 
  uuid: 2CC0-D83E 
USB:
  Hub-1: 1-0:1 info: Full speed (or root) Hub ports: 16 rev: 2.0 
  speed: 480 Mb/s chip-ID: 1d6b:0002 class-ID: 0900 
  Hub-2: 1-6:2 info: Genesys Logic Hub ports: 4 rev: 2.0 speed: 480 Mb/s 
  power: 100mA chip-ID: 05e3:0610 class-ID: 0900 
  Device-1: 1-6.1:4 info: C-Media ATGM1-USB type: Audio,HID 
  driver: hid-generic,snd-usb-audio,usbhid interfaces: 3 rev: 1.1 
  speed: 12 Mb/s power: 100mA chip-ID: 0d8c:0089 class-ID: 0300 
  serial: <filter> 
  Device-2: 1-6.2:5 info: Broadcom BCM20702A0 Bluetooth 4.0 type: Bluetooth 
  driver: btusb interfaces: 4 rev: 2.0 speed: 12 Mb/s chip-ID: 0a5c:21e8 
  class-ID: fe01 serial: <filter> 
  Device-3: 1-6.4:7 info: ASUSTek AURA MOTHERBOARD type: HID 
  driver: hid-generic,usbhid interfaces: 1 rev: 2.0 speed: 12 Mb/s 
  power: 100mA chip-ID: 0b05:18a3 class-ID: 0300 serial: <filter> 
  Hub-3: 1-9:3 info: Realtek RTS5411 Hub ports: 4 rev: 2.0 speed: 480 Mb/s 
  chip-ID: 0bda:5411 class-ID: 0900 
  Device-1: 1-9.2:6 info: Cooler Master CM110 Gaming Mouse type: Mouse,HID 
  driver: hid-generic,usbhid interfaces: 3 rev: 2.0 speed: 12 Mb/s 
  power: 100mA chip-ID: 2516:0119 class-ID: 0300 
  Device-2: 1-9.3:8 info: Razer USA BlackWidow (2019) type: Keyboard,Mouse 
  driver: hid-generic,usbhid interfaces: 3 rev: 2.0 speed: 12 Mb/s 
  power: 500mA chip-ID: 1532:0241 class-ID: 0300 
  Hub-4: 1-9.4:9 info: VIA Labs VL813 Hub ports: 4 rev: 2.1 speed: 480 Mb/s 
  chip-ID: 2109:2813 class-ID: 0900 
  Hub-5: 1-9.4.1:10 info: VIA Labs VL813 Hub ports: 4 rev: 2.1 speed: 480 Mb/s 
  chip-ID: 2109:2813 class-ID: 0900 
  Hub-6: 2-0:1 info: Full speed (or root) Hub ports: 10 rev: 3.1 
  speed: 10 Gb/s chip-ID: 1d6b:0003 class-ID: 0900 
  Hub-7: 2-9:2 info: Realtek Hub ports: 4 rev: 3.0 speed: 5 Gb/s 
  chip-ID: 0bda:0411 class-ID: 0900 
  Hub-8: 2-9.4:3 info: VIA Labs VL813 Hub ports: 4 rev: 3.0 speed: 5 Gb/s 
  chip-ID: 2109:0813 class-ID: 0900 
  Hub-9: 2-9.4.1:4 info: VIA Labs VL813 Hub ports: 4 rev: 3.0 speed: 5 Gb/s 
  chip-ID: 2109:0813 class-ID: 0900 
  Device-1: 2-9.4.4:5 info: Seagate RSS LLC SRD0NF1 Expansion Portable (STEA) 
  type: Mass Storage driver: uas interfaces: 1 rev: 3.0 speed: 5 Gb/s 
  power: 144mA chip-ID: 0bc2:2322 class-ID: 0806 serial: <filter> 
Sensors:
  System Temperatures: cpu: 27.8 C mobo: N/A gpu: amdgpu temp: 51.0 C 
  mem: 62.0 C 
  Fan Speeds (RPM): N/A gpu: amdgpu fan: 1474 
Info:
  Processes: 240 Uptime: 26m wakeups: 0 Init: systemd v: 248 tool: systemctl 
  Compilers: gcc: 11.1.0 clang: 12.0.1 Packages: pacman: 1250 lib: 437 
  Shell: Bash v: 5.1.8 running-in: xfce4-terminal inxi: 3.3.06

Below is a journalctl log of the crash.

Journal begins at Sun 2021-04-25 11:21:37 AEST, ends at Mon 2021-09-13 19:42:17 AEST. --
Sep 13 19:40:29 desktop-mdesk kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Sep 13 19:40:29 desktop-mdesk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3299287, emitted seq=3299289
Sep 13 19:40:29 desktop-mdesk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process TESV.exe pid 7570 thread TESV.exe:cs0 pid 7592
Sep 13 19:40:29 desktop-mdesk kernel: amdgpu 0000:03:00.0: GPU reset begin!
Sep 13 19:40:33 desktop-mdesk kernel: kfd2kgd: cp queue preemption time out.
Sep 13 19:40:33 desktop-mdesk kernel: [drm:gfx_v10_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 10 test failed (scratch(0xC040)=0xCAFEDEAD)
Sep 13 19:40:33 desktop-mdesk kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Sep 13 19:40:33 desktop-mdesk kernel: [drm:gfx_v10_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 10 test failed (scratch(0xC040)=0xCAFEDEAD)
Sep 13 19:40:33 desktop-mdesk kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Sep 13 19:40:33 desktop-mdesk kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Sep 13 19:40:35 desktop-mdesk kernel: amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume
Sep 13 19:40:35 desktop-mdesk kernel: [drm] PCIE GART of 512M enabled (table at 0x00000080012FC000).
Sep 13 19:40:35 desktop-mdesk kernel: [drm] PSP is resuming...
Sep 13 19:40:36 desktop-mdesk kernel: [drm] reserve 0x900000 from 0x81fe400000 for PSP TMR
Sep 13 19:40:36 desktop-mdesk kernel: amdgpu: [powerplay] SMU is resuming...
Sep 13 19:40:36 desktop-mdesk kernel: amdgpu: [powerplay] SMU is resumed successfully!
Sep 13 19:40:36 desktop-mdesk kernel: [drm] kiq ring mec 2 pipe 1 q 0
Sep 13 19:40:36 desktop-mdesk kernel: [drm] ring test on 10 succeeded in 55 usecs
Sep 13 19:40:36 desktop-mdesk kernel: [drm] ring test on 10 succeeded in 9 usecs
Sep 13 19:40:36 desktop-mdesk kernel: [drm] gfx 0 ring me 0 pipe 0 q 0
Sep 13 19:40:36 desktop-mdesk kernel: [drm:gfx_v10_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
Sep 13 19:40:36 desktop-mdesk kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v10_0> failed -22
Sep 13 19:40:36 desktop-mdesk kernel: amdgpu 0000:03:00.0: GPU reset(1) failed
Sep 13 19:40:36 desktop-mdesk kernel: amdgpu 0000:03:00.0: GPU reset end with ret = -22
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:37 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: spurious response 0x0:0x0, last cmd=0x770100
Sep 13 19:40:39 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00670d81
Sep 13 19:40:40 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: No response from codec, disabling MSI: last cmd=0x00670d81
Sep 13 19:40:40 desktop-mdesk fancontrol[5784]: /usr/sbin/fancontrol: line 639: echo: write error: Invalid argument
Sep 13 19:40:40 desktop-mdesk fancontrol[5784]: Error writing PWM value to /sys/class/hwmon/hwmon3/pwm1
Sep 13 19:40:40 desktop-mdesk fancontrol[5784]: Aborting, restoring fans...
Sep 13 19:40:40 desktop-mdesk fancontrol[5784]: Verify fans have returned to full speed
Sep 13 19:40:40 desktop-mdesk systemd[1]: fancontrol.service: Main process exited, code=exited, status=1/FAILURE
Sep 13 19:40:40 desktop-mdesk systemd[1]: fancontrol.service: Failed with result 'exit-code'.
Sep 13 19:40:40 desktop-mdesk audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=fancontrol comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Sep 13 19:40:40 desktop-mdesk systemd[1]: fancontrol.service: Consumed 2.703s CPU time.
Sep 13 19:40:40 desktop-mdesk kernel: manual fan speed control should be enabled first
Sep 13 19:40:40 desktop-mdesk kernel: manual fan speed control should be enabled first
Sep 13 19:40:40 desktop-mdesk kernel: audit: type=1131 audit(1631526040.934:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=fancontrol comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Sep 13 19:40:41 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: No response from codec, resetting bus: last cmd=0x00670d81
Sep 13 19:40:42 desktop-mdesk rtkit-daemon[1380]: Supervising 6 threads of 3 processes of 1 users.
Sep 13 19:40:42 desktop-mdesk rtkit-daemon[1380]: Successfully made thread 10500 of process 1377 owned by '1000' RT at priority 5.
Sep 13 19:40:42 desktop-mdesk kernel: snd_hda_intel 0000:03:00.1: azx_get_response timeout, switching to single_cmd mode: last cmd=0x00672400
Sep 13 19:40:42 desktop-mdesk rtkit-daemon[1380]: Supervising 7 threads of 3 processes of 1 users.
Sep 13 19:40:42 desktop-mdesk kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

I am reading 62°C for idle RAM? That looks a bit hot… I wonder how much that grows when playing.

Usually reaches about 70 and only rarely reaches 80. I dont like how high it sits passively either and watch it carefully.

Supposing it’s related to your GPU, i don’t have many ideas…
https://wiki.archlinux.org/title/AMDGPU#Screen_artifacts_and_frequency_problem

I will give this a read - I had not found that segment in the wiki for the screen artifacts. Might not have time to muck around with the computer until the weekend though.

Thanks for the help so far either way. I have been scratching my head with this problem and the solutions in similar threads I read did not really help me. :grinning_face_with_smiling_eyes:

Well I tried setting the settings mentioned in the wiki to high and low respectively and it seems to have helped with some games and made others worse… The games where it has become more frequent for are recently released however so I might wait for a few patches to see if it improves.

Regardless I just broke the OS by doing something I should not and I am thinking of trying the Manjaro Plasma desktop so I am kind of back to a blank slate anyway. Thank you for your assistance.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.