GpuTest/FurMark causes ring timeouts, black window, GPU reset

amdgpu ring timeouts/resets with GpuTest (FurMark) on Manjaro kernel 6.18 (extra)

Summary

Running GpuTest (FurMark) on kernel 6.18 from extra causes amdgpu ring timeouts and device resets. The FurMark window appears but stays black, GPU load stays near idle, and the process is aborted (core dump). Mouse/desktop lag spikes occur during/after the resets. The same FurMark run in the “normal benchmark” path previously worked.

Environment

  • Distro: Manjaro (kernel 6.18 from extra)
  • Kernel: 6.18.x (extra)
  • GPU: AMD (amdgpu driver) — exact model not shown in logs; system hostname sharkoon
  • Mesa/AMDGPU stack: (please fill exact versions: mesa --version, glxinfo -B)
  • Display server: X11
  • GpuTest: /test=fur (FurMark), windowed
  • Note: Kernel 6.17 had an unrelated BT stack issue; this report is about GPU resets on 6.18.

Steps to Reproduce

  1. Install GpuTest (FurMark) and amdgpu stack (default Manjaro packages).

  2. Run:

    gputest /test=fur /width=1920 /height=1080 /msaa=2 /gpumon_terminal
    (also reproducible at 1280x720; happens both standalone and when invoked via a stress script).

  3. Observe the window: it opens but remains black; after ~seconds the process aborts, kernel logs show ring timeouts and resets. GPU power stays low (~20–30 W), fans 0 RPM (idle).

Expected Result

FurMark renders and drives the GPU to high load without amdgpu resets.

Actual Result

  • Window stays black, minimal GPU load, then amdgpu ring timeouts and device resets.
  • Process aborts (core dumped).
  • System input lag spikes (mouse stalls briefly).

Kernel Log Excerpts (journalctl -k)

gputest /test=fur /width=1920 /height=1080 /msaa=2 /gpumon_terminal

(also reproducible at 1280x720; happens both standalone and when invoked via a stress script).
3. Observe the window: it opens but remains black; after ~seconds the process aborts, kernel logs show ring timeouts and resets. GPU power stays low (~20–30 W), fans 0 RPM (idle).

Expected Result

FurMark renders and drives the GPU to high load without amdgpu resets.

Actual Result

  • Window stays black, minimal GPU load, then amdgpu ring timeouts and device resets.
  • Process aborts (core dumped).
  • System input lag spikes (mouse stalls briefly).

Kernel Log Excerpts (journalctl -k)or 6.17 to check for regression.

  • Try without /gpumon_terminal and with lower resolution (1280x720); still seeing resets on my system.
  • Validate with current Mesa/AMDGPU stack versions.

Will switch back to 6.17 (Bluetooth initialisation problem, 6.16 works fine)

Nope, not in official repo

 $ pamac search gputest
gputest  0.7.0-1                                                         AUR
    cross-platform GPU stress test and OpenGL benchmark. Contains
    FurMark, TessMark
 $ inxi -CG
CPU:
  Info: 12-core model: AMD Ryzen Threadripper PRO 5945WX s bits: 64
    type: MT MCP cache: L2: 6 MiB
  Speed (MHz): avg: 1790 min/max: 413/4101 cores: 1: 1790 2: 1790 3: 1790
    4: 1790 5: 1790 6: 1790 7: 1790 8: 1790 9: 1790 10: 1790 11: 1790 12: 1790
    13: 1790 14: 1790 15: 1790 16: 1790 17: 1790 18: 1790 19: 1790 20: 1790
    21: 1790 22: 1790 23: 1790 24: 1790
Graphics:
  Device-1: Advanced Micro Devices [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900
    XTX/7900 GRE/7900M] driver: amdgpu v: kernel
  Display: wayland server: X.org v: 1.21.1.21 with: Xwayland v: 24.1.9
    compositor: kwin_wayland driver: X: loaded: modesetting unloaded: radeon
    dri: radeonsi gpu: amdgpu resolution: 5120x1440~120Hz
  API: EGL v: 1.5 drivers: kms_swrast,radeonsi,swrast
    platforms: gbm,wayland,x11,surfaceless,device
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 25.3.1-arch1.2
    renderer: AMD Radeon RX 7900 XTX (radeonsi navi31 LLVM 21.1.6 DRM 3.64
    6.18.1-1-MANJARO)
  API: Vulkan v: 1.4.335 drivers: radv surfaces: N/A
  Info: Tools: api: clinfo, eglinfo, glxinfo, vulkaninfo
    de: kscreen-console,kscreen-doctor gpu: amdgpu_top,corectrl wl: wayland-info
    x11: xdpyinfo, xprop, xrandr
That is the result I get

No issues running it on my Plasma Wayland system:

kinfo 
Operating System: Manjaro Linux 
KDE Plasma Version: 6.5.4
KDE Frameworks Version: 6.20.0
Qt Version: 6.10.1
Kernel Version: 6.18.1-1-MANJARO (64-bit)
Graphics Platform: Wayland
Processors: 16 × AMD Ryzen 7 5800H with Radeon Graphics
Memory: 32 GiB of RAM (28.3 GiB usable)
Graphics Processor: AMD Radeon Graphics

I switched back to:

[steffen@sharkoon GHULbenchmark]$ uname -a
Linux sharkoon 6.17.11-1-MANJARO #1 SMP PREEMPT_DYNAMIC Sun, 07 Dec 2025 07:13:59 +0000 x86_64 GNU/Linux

everything worked fine again (apart of my USB BT adapter, but that’s another topic, so I ran my benchmark tests here.
Saw you released a new 6.17.11.2 Kernel, I will test this one out.

THX

amdgpu / kernel 6.18.x – GpuTest/FurMark instability

Description

I’ve narrowed the issue down and can now reproduce a clear difference between kernel 6.17.x and 6.18.x, specifically in the interaction between amdgpu and GpuTest/FurMark under heavy load (GHUL “Cooler Hellfire” full‑system stress).

  • Hardware:

  • CPU: AMD Ryzen 5 2600X

  • Mainboard: MSI B450M PRO‑VDH MAX (MS‑7A38, Rev. 8.0)

  • GPU: Radeon RX 9060 XT (amdgpu)

  • RAM: 2×8 GiB DDR4 @ 3200 MT/s (dual‑channel)

  • Software:

  • Distro: Manjaro (current)

  • Kernels: 6.18.x (problem), 6.17.11‑1 and 6.17.12‑1 (stable)

  • Driver: amdgpu (Manjaro default stack)

  • Xorg (standard configuration, no exotic tweaks)

  • GpuTest/FurMark from AUR (unchanged)

  • GHULbenchmark (open‑source benchmark/stress suite; just calls GpuTest via CLI)

Reproduction scenario

  1. Boot the system with kernel 6.18.x.

  2. Run GHULbenchmark with Hellfire/Cooler enabled (or run GpuTest/FurMark directly with high duration / full GPU load).

  3. During the “Cooler Hellfire” test (simultaneous heavy load on CPU, RAM, GPU, and storage), a FurMark window is started.

  4. After a short time:

  • the screen goes black,

  • Xorg freezes,

  • the kernel repeatedly tries to reset the GPU,

  • you see “device wedged, but recovered through reset”.

With kernel 6.17.11‑1 and 6.17.12‑1, the exact same GHUL runs including all Hellfire tests (especially the Cooler test with GpuTest) are fully stable – there are no amdgpu errors in the logs.

Log excerpt (kernel 6.18, journalctl -b -1, shortened)


amdgpu 0000:2b:00.0: amdgpu: Dumping IP State

amdgpu 0000:2b:00.0: amdgpu: Dumping IP State Completed

amdgpu 0000:2b:00.0: amdgpu: [drm] AMDGPU device coredump file has been created

amdgpu 0000:2b:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data



amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=72, emitted seq=73

amdgpu 0000:2b:00.0: amdgpu:  Process Xorg pid 86265 thread Xorg:cs0 pid 86268

amdgpu 0000:2b:00.0: amdgpu: Starting comp_1.1.1 ring reset

amdgpu 0000:2b:00.0: amdgpu: reset compute queue (1:1:1)

amdgpu 0000:2b:00.0: amdgpu: Ring comp_1.1.1 reset succeeded

amdgpu 0000:2b:00.0: [drm] device wedged, but recovered through reset



amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:160 vmid:0 pasid:0)

amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10

amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B40

amdgpu 0000:2b:00.0: amdgpu:          Faulty UTCL2 client ID: CPC (0x5)

amdgpu 0000:2b:00.0: amdgpu:          MORE_FAULTS: 0x0

amdgpu 0000:2b:00.0: amdgpu:          WALKER_ERROR: 0x0

amdgpu 0000:2b:00.0: amdgpu:          PERMISSION_FAULTS: 0x4

amdgpu 0000:2b:00.0: amdgpu:          MAPPING_ERROR: 0x1

amdgpu 0000:2b:00.0: amdgpu:          RW: 0x1

and shortly afterwards again, this time on the GFX ring:


amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=8022560, emitted seq=8022561

amdgpu 0000:2b:00.0: amdgpu:  Process Xorg pid 86265 thread Xorg:cs0 pid 86268

amdgpu 0000:2b:00.0: amdgpu: Starting gfx_0.0.0 ring reset

amdgpu 0000:2b:00.0: amdgpu: Ring gfx_0.0.0 reset succeeded

amdgpu 0000:2b:00.0: [drm] device wedged, but recovered through reset

User‑visible symptoms

  • Black screen while the GpuTest/FurMark window is (or should be) visible during Cooler Hellfire.

  • Xorg becomes unresponsive, desktop cannot be used.

  • Sometimes the system partially recovers after one of the amdgpu resets, sometimes a relogin or reboot is required.

Expected behavior

  • Under heavy combined load (compute + graphics), the GPU should not constantly trigger ring timeouts or GCVM_L2 protection faults.

  • GpuTest/FurMark should behave like on kernel 6.17.x: run normally and exit cleanly after the test duration, without amdgpu coredumps or “device wedged” states.

Actual behavior

  • With kernel 6.18.x, running GpuTest/FurMark under full load reliably produces:

  • ring comp_1.1.x timeout and ring gfx_0.0.0 timeout

  • multiple “Ring reset succeeded” attempts

  • [drm] device wedged, but recovered through reset

  • GCVM_L2_PROTECTION_FAULT_STATUS + mapping/permission faults

  • AMDGPU coredumps under /sys/class/drm/card1/device/devcoredump/data

  • With kernels 6.17.11‑1 and 6.17.12‑1 the identical scenario is fully stable (same GpuTest build, same workload, same user configuration).

Question for maintainers

  • Can you reproduce this on a similar RDNA card and kernel 6.18.x by running GpuTest/FurMark under full load (e.g. 60–180 s at max, ideally as part of a parallel CPU/RAM/IO stress scenario)?

  • If yes, does this look like an upstream amdgpu / drm/amd regression, or something Manjaro‑specific (patch set / config)?

If you need more data (full journalctl logs or the devcoredump blob from /sys/class/drm/...), I can provide that as well.


Moderator edit: In the future, please use proper formatting: [HowTo] Post command output and file content as formatted text