Intermittent crash seemingly when only playing Counter Strike 2 - "amdgpu_job_timedout ring gfx_0.0.0 timeout"

Nikolai5 · 22 December 2023 19:36

So far I’ve only had this issue when playing Counter Strike 2 and sometimes have gone sessions without it crashing a single time. However, 1 in every 5 games it will crash.

When it crashes, its as if I basically get logged out, so the screen dies and then I end up on the login screen, where I log back in and find that all graphical processes have been killed.

I then start everything back up and its fine again.

Here are my specs using inxi:

System:
  Kernel: 6.6.7-4-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
    Desktop: KDE Plasma v: 5.27.10 Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Desktop Mobo: ASUSTeK model: PRIME X370-PRO v: Rev X.0x
    serial: <superuser required> BIOS: American Megatrends v: 6203
    date: 07/27/2023
CPU:
  Info: 8-core model: AMD Ryzen 7 1800X bits: 64 type: MT MCP arch: Zen rev: 1
    cache: L1: 768 KiB L2: 4 MiB L3: 16 MiB
  Speed (MHz): avg: 2426 high: 3988 min/max: 2200/4000 boost: disabled
    cores: 1: 1962 2: 3879 3: 1855 4: 1987 5: 1917 6: 3946 7: 2200 8: 2200
    9: 1780 10: 3988 11: 1937 12: 1935 13: 1778 14: 3878 15: 1802 16: 1777
    bogomips: 128045
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: AMD Navi 32 [Radeon RX 7700 XT / 7800 XT] vendor: XFX
    driver: amdgpu v: kernel arch: RDNA-3 bus-ID: 0c:00.0
  Display: x11 server: X.Org v: 21.1.10 with: Xwayland v: 23.2.3 driver: X:
    loaded: amdgpu unloaded: modesetting,radeon dri: radeonsi gpu: amdgpu
    resolution: 2560x1440
  API: EGL v: 1.5 drivers: radeonsi,swrast platforms:
    active: x11,surfaceless,device inactive: gbm,wayland
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 23.1.9-manjaro1.1
    glx-v: 1.4 direct-render: yes renderer: AMD Radeon RX 7800 XT (gfx1101 LLVM
    16.0.6 DRM 3.54 6.6.7-4-MANJARO)
  API: Vulkan v: 1.3.269 drivers: radv surfaces: xcb,xlib devices: 1
Audio:
  Device-1: AMD Navi 31 HDMI/DP Audio driver: snd_hda_intel v: kernel
    bus-ID: 0c:00.1
  Device-2: AMD Family 17h HD Audio vendor: ASUSTeK driver: snd_hda_intel
    v: kernel bus-ID: 0e:00.3
  Device-3: Logitech G933 Wireless Headset Dongle
    driver: hid-generic,snd-usb-audio,usbhid type: USB bus-ID: 5-4:4
  API: ALSA v: k6.6.7-4-MANJARO status: kernel-api
  Server-1: JACK v: 1.9.22 status: off
  Server-2: PipeWire v: 1.0.0 status: off
  Server-3: PulseAudio v: 16.1 status: active
Network:
  Device-1: Intel I211 Gigabit Network vendor: ASUSTeK driver: igb v: kernel
    port: e000 bus-ID: 08:00.0
  IF: enp8s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 1.82 TiB used: 785.07 GiB (42.1%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO Plus 2TB
    size: 1.82 TiB temp: 49.9 C
Partition:
  ID-1: / size: 1.79 TiB used: 785.07 GiB (42.8%) fs: ext4 dev: /dev/dm-0
    mapped: luks-d98cd8d2-e273-4dda-808f-bc3d6ee962a8
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 48.0 C mobo: N/A gpu: amdgpu temp: 50.0 C
  Fan Speeds (rpm): N/A gpu: amdgpu fan: 0
Info:
  Processes: 381 Uptime: 9h 29m Memory: total: 32 GiB available: 31.25 GiB
  used: 7.93 GiB (25.4%) Init: systemd Compilers: gcc: 13.2.1 clang: 16.0.6
  Packages: 1447 Shell: Zsh v: 5.9 inxi: 3.3.31

Here is an output of the related error that I was able to find in the logs that I had saved a few days ago:

Dec 16 17:17:01 thomas-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=11563292, emitted seq=11563294
Dec 16 17:17:01 thomas-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process cs2 pid 45713 thread VKRenderThread pid 45744
Dec 16 17:17:01 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: IP block:gfx_v11_0 is hung!
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a800200 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a800224 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a800244 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a800264 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a800284 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a8002a0 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a8002c0 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a8002e0 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a800210 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xfff4a800200 flags=0x0020]
Dec 16 17:17:02 thomas-pc kernel: Failed to wait all pipes clean
Dec 16 17:17:02 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: soft reset failed, will fallback to full reset!
Dec 16 17:17:02 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:02 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:02 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:02 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:03 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:03 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:03 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:03 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:03 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:03 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:03 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:03 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:03 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:03 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:03 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:03 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:03 thomas-pc kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Dec 16 17:17:03 thomas-pc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Dec 16 17:17:04 thomas-pc kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 16 17:17:04 thomas-pc kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000F00000).
Dec 16 17:17:04 thomas-pc kernel: [drm] VRAM is lost due to GPU reset!
Dec 16 17:17:04 thomas-pc kernel: [drm] PSP is resuming...
Dec 16 17:17:04 thomas-pc kernel: [drm] reserve 0xa700000 from 0x83e0000000 for PSP TMR
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: RAP: optional rap ta ucode is not available
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is resuming...
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x0000003f, smu fw program = 0, smu fw vers>
Dec 16 17:17:04 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
Dec 16 17:17:05 thomas-pc kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is resumed successfully!
Dec 16 17:17:05 thomas-pc kernel: [drm] DMUB hardware initialized: version=0x07002400

I have searched online for this error message and found that other people have had the same type of timeout where it throws them out of their desktop but the causes seem to be different, with some different hardware and a number of suggestions that don’t seem relevant and with no confirmation that any of them work.

The only thing of interest that I could find was someone said something about disabling something to do with dynamic power management, where it automatically adjusts the clocks and voltages of the GPU, I’ve not tried anything though.

Someone else also suggested that it was due to a bottleneck with CPU. But it does not seem to affect anything other than CS2 (maybe Vulkan?) compared to the other games that I’ve played (which may not be Vulkan).

Thought I would raise a topic to see if anyone had any ideas or have come across this before.

Edit:
Looking at the Arch Wiki, it suggests that while there are fewer issues with the vulkan-radeon package, this error is an issue that you can get.
https://wiki.archlinux.org/title/Vulkan#AMDGPU_-_Hangs_when_playing_DirectX_Vulkan_games

And so it suggests trying amdvlk instead. I suppose I can only try it and hope that stability for all my other applications and games are just as good with amdvlk.

Nikolai5 · 23 December 2023 00:25

After installing amdvlk, rebooting and playing a game, I had another crash. So doesn’t appear to have made a difference.

[15435.561856] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=22454309, emitted seq=22454311
[15435.562356] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process cs2 pid 125473 thread VKRenderThread pid 125509
[15435.562838] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
[15436.588513] amdgpu 0000:0c:00.0: amdgpu: IP block:gfx_v11_0 is hung!
[15436.939563] Failed to wait all pipes clean
[15436.939569] amdgpu 0000:0c:00.0: amdgpu: soft reset failed, will fallback to full reset!
[15437.275441] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15437.275727] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15437.399470] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15437.399729] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15437.523427] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15437.523689] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15437.647438] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15437.647692] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15437.771435] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15437.771695] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15437.895401] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15437.895663] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15438.019416] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15438.019678] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15438.143380] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15438.143641] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15438.267266] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[15438.267522] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[15438.510128] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[15438.614634] amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
[15438.614639] amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
[15438.614696] amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
[15439.121953] amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
[15439.122141] [drm] PCIE GART of 512M enabled (table at 0x0000008000F00000).
[15439.122250] [drm] VRAM is lost due to GPU reset!
[15439.122252] [drm] PSP is resuming...
[15439.197629] [drm] reserve 0xa700000 from 0x83e0000000 for PSP TMR
[15439.436577] amdgpu 0000:0c:00.0: amdgpu: RAP: optional rap ta ucode is not available
[15439.436584] amdgpu 0000:0c:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[15439.436589] amdgpu 0000:0c:00.0: amdgpu: SMU is resuming...
[15439.436594] amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x0000003f, smu fw program = 0, smu fw version = 0x00503b00 (80.59.0)
[15439.436601] amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
[15439.534282] amdgpu 0000:0c:00.0: amdgpu: SMU is resumed successfully!
[15439.536459] [drm] DMUB hardware initialized: version=0x07002400
[15439.836027] [drm] kiq ring mec 3 pipe 1 q 0
[15439.840881] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[15439.841026] amdgpu 0000:0c:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[15439.841657] amdgpu 0000:0c:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[15439.841661] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[15439.841664] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[15439.841667] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[15439.841671] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[15439.841674] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[15439.841677] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[15439.841680] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[15439.841683] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[15439.841686] amdgpu 0000:0c:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[15439.841690] amdgpu 0000:0c:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[15439.841693] amdgpu 0000:0c:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[15439.841696] amdgpu 0000:0c:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[15439.841699] amdgpu 0000:0c:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[15439.841702] amdgpu 0000:0c:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[15439.844861] amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow start
[15439.866627] amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow done
[15439.866650] amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) succeeded!

openminded · 18 February 2024 15:26

Hi! I’m curious if you have found a solution to this issue? I’m having the same and still puzzled what’s causing this. My GPU is different though, it’s Phoenix 780M.

MrLavender · 18 February 2024 17:02

There is a very long thread about the issue that has been going since April 2022 with no clear resolution. Some people report that setting the boot param amdgpu.ppfeaturemask=0xfffd3fff fixes their problem. Other reports and testing seem to point to hardware issues.

Nikolai5 · 18 February 2024 17:08

I think it doesn’t help that there appears to be multiple causes for that error to appear. So that’s why supposed solutions are inconsistent. They aren’t all the same problem.

@openminded I haven’t done anything to resolve the issue, but I also haven’t done any recent testing. I’ll play a bunch of CS in the next few days and let you know how I get on. As I am running Arch Linux now and there have been a great many kernel updates and some mesa updates since.

Nikolai5 · 19 February 2024 17:11

So far no crash with that same error.

I’m now running:

System:
  Kernel: 6.7.5-zen1-1-zen arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
  Desktop: KDE Plasma v: 5.27.10 Distro: Arch Linux
Machine:
  Type: Desktop Mobo: ASUSTeK model: PRIME X370-PRO v: Rev X.0x
    serial: <superuser required> UEFI: American Megatrends v: 6203
    date: 07/27/2023
CPU:
  Info: 8-core model: AMD Ryzen 7 1800X bits: 64 type: MT MCP arch: Zen rev: 1
    cache: L1: 768 KiB L2: 4 MiB L3: 16 MiB
  Speed (MHz): avg: 2303 high: 3609 min/max: 2200/3600 boost: enabled cores:
    1: 1850 2: 1845 3: 1849 4: 2971 5: 1905 6: 2132 7: 2193 8: 3238 9: 1849
    10: 2200 11: 2114 12: 3137 13: 1886 14: 1883 15: 2200 16: 3609
    bogomips: 115197
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: AMD Navi 32 [Radeon RX 7700 XT / 7800 XT] vendor: XFX
    driver: amdgpu v: kernel arch: RDNA-3 bus-ID: 0c:00.0
  Display: x11 server: X.Org v: 21.1.11 with: Xwayland v: 23.2.4 driver: X:
    loaded: amdgpu unloaded: modesetting,radeon dri: radeonsi gpu: amdgpu
    resolution: 2560x1440
  API: EGL v: 1.5 drivers: radeonsi,swrast platforms:
    active: x11,surfaceless,device inactive: gbm,wayland
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.0.1-arch1.1
    glx-v: 1.4 direct-render: yes renderer: AMD Radeon RX 7800 XT (radeonsi
    navi32 LLVM 16.0.6 DRM 3.57 6.7.5-zen1-1-zen)
  API: Vulkan v: 1.3.276 drivers: radv surfaces: xcb,xlib devices: 1

dgdg · 19 February 2024 18:05

I had the same type of issue which drove me mad for a long time; in the end, it turned out to be a power supply issue. My PSU was, at least on paper, adequate for my build but I guess transient power spikes were causing problems.

In any case, it may be worth doing watch sensors amdgpu-pci-0c00 (or similar) and see if you’re now hitting the maximum power of your GPU - it’s possible that with the swap to Arch, you’re now not. You can gather some more info on how likely this is by running something like the Unigine Superposition Benchmark; Superposition is pretty much tests only the GPU, so if your PSU is marginal for CPU+GPU you may be able to run it. In my case, my Superposition scores were about 20% lower than expected for my hardware, which combined with my PSUs built-in voltage monitoring showing voltages which were a bit off, led me to swap the PSU.

(as to why swapping distro may have helped: possibly check your game/desktop configuration and make sure it’s the same; if it’s power related, lowering framerate could potentially reduce the power usage enough to where it’s stable. A newer kernel might also have helped, as AMD has been working on the power management code.)

However, do note that this type of error has definitely had multiple causes in the past. If it’s being caused by high load games specifically (i.e. the system is stable when using web browsers and other non-high GPU load scenarios/games), then I’d start investigating the PSU a bit more because that seems different to the bug report that you linked.

Nikolai5 · 19 February 2024 18:21

I appreciate the response and will definitely investigate with that in mind.

The only thing that makes me think it may not be that, is that it only seemed to be that one game, other games / applications that are more demanding on both CPU and GPU were fine. Which is what made me think it was an issue related to one of the libraries that runs alongside Mesa, like the vulkan ones.

But you raise a good point with the power.

dgdg · 19 February 2024 20:47

It just seemed worth investigating; in terms of power usage, running a fairly light game such as Counterstrike 2 with an uncapped framerate will likely use substantially more power than a heavier game with a lower framerate (especially across the entire system, as a high framerate will hit the CPU as well as GPU).

It seems unlikely to be a driver/kernel issue. Counterstrike 2 is supported on Linux, and especially Steam Deck (i.e. an Arch derivative), so I’d assume fairly good coverage of testing. Therefore, it seems more likely that this is a hardware issue of some description. I mean, I can run CS2 on Manjaro without problems, and I’m on a fairly similar GPU (7900XT).

Out of interest, could you list some games that you’ve run without any issue? It might help narrow things down a little.

openminded · 14 April 2024 06:15

Since I posted here before I think I must share what I believe is the culprit in my case. Limiting CPU to a max frequency of 4.45 GHz is enough to make my system stable under heavy load comprised of both CPU and GPU utilisation, and being packaged as a single unit APU (Ryzen 7840H) means that in the first place the power is fed to CPU boosting its frequency and GPU at some point starts “starving”, therefore resets. TDP settings have no effect on this, because they apply to the unit as a whole. That’s what I think is happening.
Cheers everyone, hope you’ll resolve your issues or find some kind of “solution” to them too.

Nikolai5 · 15 April 2024 08:23

I’ve achieved a more stable system through disabling c states in bios, and my CPU is not overclocked. I have not had any crashes for a good while now.