AMD GPU system crash when playing certain games

Hi, all!

This started happening suddenly a few weeks ago with Red Dead Redemption 2, it was sort of fixed by reinstalling the game, and now it’s happening with Borderlands 3 right after I bought it.

GPU: RX 5700XT
RDR2 Proton: Proton-experimental
BL3 Proton: Proton-6.1-GE-2

When RDR2, through Steam, loads to the Rockstar logo after the gunshot, it freezes, the GPU crashes, and it attempts to recover and goes to a screen of artifacting. At this point, I can go into another tty and run commands normally, but it often crashes about a minute later. Sometimes it doesn’t recover and stays at a black screen. Each time, I have to hard-power, as running the shutdown command from the tty makes my computer hang. This happened every single time until I reinstalled, and now it’s few and far between, with it happening during gameplay once.

Here is the journal output for information from an RDR2 crash, starting from where it was launched:

May 15 10:37:19 homePC dbus-daemon[1413]: [session uid=1000 pid=1413] Activating via systemd: service name='com.feralinteractive.GameMode' unit='gamemoded.service' requested by ':1.76' (uid=1000 pid=3500 comm="env LD_PRELOAD=libgamemodeauto.so.0::/home/myah/.l")
May 15 10:37:19 homePC systemd[1388]: Starting gamemoded...
May 15 10:37:19 homePC dbus-daemon[1413]: [session uid=1000 pid=1413] Successfully activated service 'com.feralinteractive.GameMode'
May 15 10:37:19 homePC systemd[1388]: Started gamemoded.
May 15 10:37:19 homePC pkexec[3503]: pam_unix(polkit-1:session): session opened for user root(uid=0) by (uid=1000)
May 15 10:37:19 homePC pkexec[3503]: myah: Executing command [USER=root] [TTY=unknown] [CWD=/home/myah] [COMMAND=/usr/lib/gamemode/cpugovctl set performance]
May 15 10:37:19 homePC kwin_x11[1490]: kwin_core: Failed to focus 0x4e00088 (error 8)
May 15 10:37:19 homePC kwin_x11[1490]: kwin_core: Failed to restore focus. Activating 0x4e0002b
May 15 10:37:21 homePC gamemoded[3501]: ERROR: glob failed for RAPL paths: (No such file or directory)
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Skipping ioprio on client [3500,3500]: ioprio was (0) but we expected (4)
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Addition requested for already known client 3500 [/usr/bin/env].
May 15 10:37:21 homePC gamemoded[3501]:     -- This may happen due to using exec or shell wrappers. You may want to
May 15 10:37:21 homePC gamemoded[3501]:     -- blacklist this client so GameMode can see its final name here.
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Addition requested for already known client 3500 [/usr/bin/env].
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Removal requested for unknown process [3512].
May 15 10:37:21 homePC gamemoded[3501]:     -- The parent process probably forked and tries to unregister from the wrong
May 15 10:37:21 homePC gamemoded[3501]:     -- process now. We cannot work around this. This message will likely be paired
May 15 10:37:21 homePC gamemoded[3501]:     -- with a nearby 'Removing expired game' which means we cleaned up properly
May 15 10:37:21 homePC gamemoded[3501]:     -- (we will log this event). This hint will be displayed only once.
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Skipping ioprio on client [3514,3514]: ioprio was (0) but we expected (4)
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Addition requested for already known client 3500 [/usr/bin/env].
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Addition requested for already known client 3500 [/usr/bin/env].
May 15 10:37:21 homePC gamemoded[3501]: ERROR: Addition requested for already known client 3500 [/usr/bin/env].
May 15 10:37:28 homePC kded5[1486]: Registering ":1.91/StatusNotifierItem" to system tray
May 15 10:37:28 homePC xembedsniproxy[1572]: Container window visible, stack below
May 15 10:37:35 homePC kded5[1486]: Registering ":1.92/StatusNotifierItem" to system tray
May 15 10:37:35 homePC kded5[1486]: Service  ":1.92" unregistered
May 15 10:38:02 homePC kwin_x11[1490]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 53705, resource id: 14686171, major code: 3 (GetWindowAttributes), minor code: 0
May 15 10:38:02 homePC kwin_x11[1490]: qt.qpa.xcb: QXcbConnection: XCB error: 9 (BadDrawable), sequence: 53706, resource id: 14686171, major code: 14 (GetGeometry), minor code: 0
May 15 10:38:02 homePC kwin_x11[1490]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 53709, resource id: 14686172, major code: 3 (GetWindowAttributes), minor code: 0
May 15 10:38:02 homePC kwin_x11[1490]: qt.qpa.xcb: QXcbConnection: XCB error: 9 (BadDrawable), sequence: 53710, resource id: 14686172, major code: 14 (GetGeometry), minor code: 0
May 15 10:39:19 homePC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
May 15 10:39:19 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=38005, emitted seq=38007
May 15 10:39:19 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process CrGpuMain pid 4356 thread dxvk-submit pid 4397
May 15 10:39:19 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
May 15 10:39:23 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: failed to suspend display audio
May 15 10:39:23 homePC kernel: ------------[ cut here ]------------
May 15 10:39:23 homePC kernel: WARNING: CPU: 7 PID: 3098 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:3241 dcn20_validate_bandwidth_fp+0x8d/0xd0 [amdgpu]
May 15 10:39:23 homePC kernel: Modules linked in: ccm rfcomm cmac algif_hash algif_skcipher af_alg bnep uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 btusb videobuf2_common btrtl btbcm mousedev joydev videodev btintel squashfs iwlmvm mac80211 libarc4 vfat fat iwlwifi igb loop dca cfg80211 snd_usb_audio snd_usbmidi_lib eeepc_wmi asus_wmi snd_rawmidi sparse_keymap snd_seq_device mc usbhid video wmi_bmof mxm_wmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation soundwire_cadence edac_mce_amd snd_hda_codec kvm_amd ccp snd_hda_core rng_core snd_hwdep soundwire_bus amdgpu kvm snd_soc_core irqbypass snd_compress crct10dif_pclmul ac97_bus crc32_pclmul ghash_clmulni_intel snd_pcm_dmaengine aesni_intel snd_pcm crypto_simd cryptd snd_timer glue_helper rapl gpu_sched snd i2c_algo_bit ttm soundcore sp5100_tco pcspkr i2c_piix4 k10temp wmi gpio_amdpt mac_hid pinctrl_amd gpio_generic acpi_cpufreq uinput
May 15 10:39:23 homePC kernel:  rtbth(OE) bluetooth ecdh_generic rfkill ecc i2c_dev drm_kms_helper cec syscopyarea sysfillrect sysimgblt fb_sys_fops vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) drm ledtrig_timer fuse crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci
May 15 10:39:23 homePC kernel: CPU: 7 PID: 3098 Comm: kworker/7:2 Tainted: G           OE     5.10.34-1-MANJARO #1
May 15 10:39:23 homePC kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 5406 11/13/2019
May 15 10:39:23 homePC kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
May 15 10:39:23 homePC kernel: RIP: 0010:dcn20_validate_bandwidth_fp+0x8d/0xd0 [amdgpu]
May 15 10:39:23 homePC kernel: Code: 00 7b 35 22 85 14 1f 00 00 75 2f 31 d2 f2 0f 11 85 58 26 00 00 48 89 ee 4c 89 e7 e8 3d f6 ff ff 89 c2 22 95 14 1f 00 00 75 30 <0f> 0b 48 89 9d 58 26 00 00 5b 5d 41 5c c3 75 c9 48 89 9d 58 26 00
May 15 10:39:23 homePC kernel: RSP: 0018:ffff9d81cacffbf8 EFLAGS: 00010246
May 15 10:39:23 homePC kernel: RAX: 0000000000000001 RBX: 4079400000000000 RCX: 00000000000062da
May 15 10:39:23 homePC kernel: RDX: 0000000000000000 RSI: 285e3a8913d48bb5 RDI: 00000000000301a0
May 15 10:39:23 homePC kernel: RBP: ffff8adda7ce0000 R08: ffff8ade91e26000 R09: ffff8ade9a2c0000
May 15 10:39:23 homePC kernel: R10: ffff8ade91e26000 R11: 0000000100000001 R12: ffff8ade9a2c0000
May 15 10:39:23 homePC kernel: R13: ffff8adea19ac800 R14: ffff8ade8b1ea800 R15: ffff8adda7ce0000
May 15 10:39:23 homePC kernel: FS:  0000000000000000(0000) GS:ffff8ae18e9c0000(0000) knlGS:0000000000000000
May 15 10:39:23 homePC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 15 10:39:23 homePC kernel: CR2: 00007f4584005008 CR3: 00000002efc6e000 CR4: 00000000003506e0
May 15 10:39:23 homePC kernel: Call Trace:
May 15 10:39:23 homePC kernel:  dcn20_validate_bandwidth+0x29/0x40 [amdgpu]
May 15 10:39:23 homePC kernel:  dc_validate_global_state+0x2f2/0x390 [amdgpu]
May 15 10:39:23 homePC kernel:  ? dc_rem_all_planes_for_stream+0xcb/0x110 [amdgpu]
May 15 10:39:23 homePC kernel:  dm_suspend+0x18b/0x1c0 [amdgpu]
May 15 10:39:23 homePC kernel:  amdgpu_device_ip_suspend_phase1+0x73/0xd0 [amdgpu]
May 15 10:39:23 homePC kernel:  ? amdgpu_fence_process+0x4d/0x130 [amdgpu]
May 15 10:39:23 homePC kernel:  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
May 15 10:39:23 homePC kernel:  amdgpu_device_pre_asic_reset+0x185/0x19c [amdgpu]
May 15 10:39:23 homePC kernel:  amdgpu_device_gpu_recover.cold+0x5cf/0x95d [amdgpu]
May 15 10:39:23 homePC kernel:  amdgpu_job_timedout+0x121/0x140 [amdgpu]
May 15 10:39:23 homePC kernel:  drm_sched_job_timedout+0x66/0xf0 [gpu_sched]
May 15 10:39:23 homePC kernel:  process_one_work+0x1df/0x370
May 15 10:39:23 homePC kernel:  worker_thread+0x50/0x400
May 15 10:39:23 homePC kernel:  ? process_one_work+0x370/0x370
May 15 10:39:23 homePC kernel:  kthread+0x11b/0x140
May 15 10:39:23 homePC kernel:  ? __kthread_bind_mask+0x60/0x60
May 15 10:39:23 homePC kernel:  ret_from_fork+0x22/0x30
May 15 10:39:23 homePC kernel: ---[ end trace 1f1c50010c173a48 ]---
May 15 10:39:23 homePC kernel: amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
May 15 10:39:23 homePC kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
May 15 10:39:23 homePC kernel: amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
May 15 10:39:23 homePC kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
May 15 10:39:23 homePC kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
May 15 10:39:23 homePC kernel: [drm] free PSP TMR buffer
May 15 10:39:24 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: BACO reset
May 15 10:39:27 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
May 15 10:39:27 homePC kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
May 15 10:39:27 homePC kernel: [drm] VRAM is lost due to GPU reset!
May 15 10:39:27 homePC kernel: [drm] PSP is resuming...
May 15 10:39:27 homePC kernel: [drm] reserve 0x900000 from 0x81fe400000 for PSP TMR
May 15 10:39:27 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: RAS: optional ras ta ucode is not available
May 15 10:39:27 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: RAP: optional rap ta ucode is not available
May 15 10:39:27 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is resuming...
May 15 10:39:27 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000037, smu fw version = 0x002a3f00 (42.63.0)
May 15 10:39:27 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
May 15 10:39:27 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is resumed successfully!
May 15 10:39:28 homePC kernel: [drm] kiq ring mec 2 pipe 1 q 0
May 15 10:39:28 homePC kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
May 15 10:39:28 homePC kernel: [drm] JPEG decode initialized successfully.
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow start
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow done
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) succeeded!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm] Skip scheduling IBs!
May 15 10:39:28 homePC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 15 10:39:28 homePC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 15 10:39:28 homePC kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 15 10:39:28 homePC kwin_x11[1490]: file:///usr/share/kwin/aurorae/MenuButton.qml:11: TypeError: Cannot read property 'closeOnDoubleClickOnMenu' of null
May 15 10:40:11 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:40:21 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:40:31 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:40:42 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:40:52 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:41:02 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:41:12 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:41:23 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 15 10:41:33 homePC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered

At the end, it recovered to the glitched screen, but didn’t crash to a black screen like usual. As you can see, I waited about a minute before I powered off.

What I’ve done so far is add iommu=pt, amd iommu=off and iommu=soft (not at the same time), none did the trick. I’m also running kernel 5.10, but tried 5.11 and 5.4 with no luck.

I haven’t tried anything since Borderlands 3 started doing it. I had already poured over tons of forums attempting to find a solution for RDR2, so I went straight here. It gets to Claptrap going across the screen for 30 seconds or so, then artifacts and slows down, and eventually freezing with the screen of artifacting and goes to black in the same way as RDR2. I have to hard-power in this case, too.

Thanks in advance!

Edit: I also wanted to add the Graphics output of inxi -Fxxxrz:

Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] 
vendor: Sapphire Limited driver: amdgpu v: kernel bus-ID: 0c:00.0 chip-ID: 1002:731f class-ID: 0300 
Device-2: Microdia Webcam Vitade AF type: USB driver: snd-usb-audio,uvcvideo bus-ID: 3-2:3 
chip-ID: 0c45:6366 class-ID: 0102 serial: <filter> 
Display: x11 server: X.Org 1.20.11 compositor: kwin_x11 driver: loaded: amdgpu,ati 
unloaded: modesetting,radeon alternate: fbdev,vesa resolution: 2560x1440 s-dpi: 96 
OpenGL: renderer: AMD Radeon RX 5700 XT (NAVI10 DRM 3.40.0 5.10.36-2-MANJARO LLVM 11.1.0) 
v: 4.6 Mesa 21.0.3 direct render: Yes

I set my display to 60Hz (from 144) and set DPM manually, neither of which worked on Borderlands 3.

The journal log for BL3 after the crash is the same as RDR2, as far as I can tell. However, this:

May 30 09:28:28 homePC rtkit-daemon[1648]: Warning: Reached maximum concurrent threads limit for user '1000', denying request.
May 30 09:28:28 homePC rtkit-daemon[1648]: Failed to look up client: Device or resource busy

is spammed between other error messages, and the entire end of the journal is filled with these messages.

I can provide the entire log as a file, if needed. It’s too long to put here.

First Second search engine hit:
https://www.gitmemory.com/issue/ValveSoftware/Proton/4662/801557425

This doesn’t fix the issue with the amdgpu errors (the origin of the crash), but thank you.

I got help on this Reddit thread and followed a commenter’s advice to create this Mesa issue.

Try adding the following as kernel parameters:

amdgpu.gpu_recovery=1 amdgpu.lockup_timeout=3000

This stopped it from crashing my system and I was able to end the game through hitting “stop” on steam :grinning: The crash did indeed still happen, but I assume that wasn’t the purpose of this. Thank you very much!

Update: I just tried again in order to make another debug log and it’s crashing my system again with no recovery. I made these parameters persistent through /etc/default/grub so I’m unsure why it was different for that one time.

My system was defaulting to AMDVLK for some reason. Switching to Mesa worked.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.