An AMD rx6700xt crashes continuosly

Hi all,
I have a full AMD machine with an MSI motherboard and a Sapphire Nitro (radeon rx6700xt) graphic card.

I can only work more or less 15 minutes before the screes goes black.

In dmesg I see

[   63.049915] systemd-journald[423]: /var/log/journal/20b860dfa515404eade47678bbfd1b08/user-1000.journal: Journal file uses a different sequence number ID, rotating.
[  583.747952] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4934, emitted seq=4936
[  583.748377] amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
[  583.939798] amdgpu 0000:2f:00.0: amdgpu: MODE1 reset
[  583.939802] amdgpu 0000:2f:00.0: amdgpu: GPU mode1 reset
[  583.939873] amdgpu 0000:2f:00.0: amdgpu: GPU smu mode1 reset
[  595.467444] amdgpu 0000:2f:00.0: amdgpu: GPU reset succeeded, trying to resume
[  595.467743] [drm] PCIE GART of 512M enabled (table at 0x00000082FEB00000).
[  595.467808] [drm] VRAM is lost due to GPU reset!
[  595.467810] amdgpu 0000:2f:00.0: amdgpu: PSP is resuming...
[  603.042342] [drm:psp_v11_0_memory_training [amdgpu]] *ERROR* send training msg failed.
[  603.042513] amdgpu 0000:2f:00.0: amdgpu: Failed to process memory training!
[  603.042515] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
[  603.042644] amdgpu 0000:2f:00.0: amdgpu: GPU reset(1) failed
[  603.158851] snd_hda_intel 0000:2f:00.1: CORB reset timeout#2, CORBRP = 65535
[  603.160665] amdgpu 0000:2f:00.0: amdgpu: GPU reset end with ret = -62
[  603.160668] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -62 

What can I do / try / look for ?

The system informations are

    ~  LANG=C inxi -Fazi                                                                                                                                 ✔ 
  Kernel: 6.9.2-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 14.1.1
    clocksource: tsc avail: hpet,acpi_pm
    parameters: BOOT_IMAGE=/boot/vmlinuz-6.9-x86_64
    root=UUID=cab9616a-eb09-42c4-89dc-6480898b9f00 rw quiet splash
  Desktop: KDE Plasma v: 6.0.5 tk: Qt v: N/A info: frameworks v: 6.2.0
    wm: kwin_x11 vt: 2 dm: SDDM Distro: Manjaro base: Arch Linux
  Type: Desktop Mobo: Micro-Star model: MEG X570 ACE (MS-7C35) v: 1.0
    serial: <superuser required> uuid: <superuser required> UEFI: American
    Megatrends LLC. v: 1.N0 date: 10/23/2023
  Info: model: AMD Ryzen 7 5800X bits: 64 type: MT MCP arch: Zen 3+ gen: 4
    level: v3 note: check built: 2022 process: TSMC n6 (7nm) family: 0x19 (25)
    model-id: 0x21 (33) stepping: 0 microcode: 0xA20102B
  Topology: cpus: 1x cores: 8 tpc: 2 threads: 16 smt: enabled cache:
    L1: 512 KiB desc: d-8x32 KiB; i-8x32 KiB L2: 4 MiB desc: 8x512 KiB
    L3: 32 MiB desc: 1x32 MiB
  Speed (MHz): avg: 2948 high: 3800 min/max: 2200/4850 boost: enabled
    scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 3374 2: 3800
    3: 2200 4: 3800 5: 2200 6: 2200 7: 3800 8: 3600 9: 2200 10: 3598 11: 2200
    12: 2200 13: 3800 14: 2200 15: 2200 16: 3800 bogomips: 121653
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Type: gather_data_sampling status: Not affected
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: reg_file_data_sampling status: Not affected
  Type: retbleed status: Not affected
  Type: spec_rstack_overflow mitigation: Safe RET
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
  Type: spectre_v2 mitigation: Retpolines; IBPB: conditional; IBRS_FW;
    STIBP: always-on; RSB filling; PBRSB-eIBRS: Not affected; BHI: Not
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
  Device-1: AMD Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT]
    vendor: Sapphire driver: amdgpu v: kernel arch: RDNA-2 code: Navi-2x
    process: TSMC n7 (7nm) built: 2020-22 pcie: gen: 4 speed: 16 GT/s
    lanes: 16 ports: active: HDMI-A-1 empty: DP-1, DP-2, DP-3, Writeback-1
    bus-ID: 2f:00.0 chip-ID: 1002:73df class-ID: 0300
  Display: x11 server: X.Org v: 21.1.13 with: Xwayland v: 24.1.0
    compositor: kwin_x11 driver: X: loaded: amdgpu unloaded: modesetting,radeon
    alternate: fbdev,vesa dri: radeonsi gpu: amdgpu display-ID: :0 screens: 1
  Screen-1: 0 s-res: 1920x1080 s-dpi: 96 s-size: 508x285mm (20.00x11.22")
    s-diag: 582mm (22.93")
  Monitor-1: HDMI-A-1 mapped: HDMI-A-0 model: Sony TV serial: <filter>
    built: 2014 res: 1920x1080 hz: 60 dpi: 52 gamma: 1.2
    size: 930x523mm (36.61x20.59") diag: 1067mm (42") ratio: 16:9 modes:
    max: 1920x1080 min: 640x480
  API: EGL v: 1.5 hw: drv: amd radeonsi platforms: device: 0 drv: radeonsi
    device: 1 drv: swrast surfaceless: drv: radeonsi x11: drv: radeonsi
    inactive: gbm,wayland
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.0.8-manjaro1.1
    glx-v: 1.4 direct-render: yes renderer: AMD Radeon RX 6700 XT (radeonsi
    navi22 LLVM 17.0.6 DRM 3.57 6.9.2-1-MANJARO) device-ID: 1002:73df
    memory: 11.72 GiB unified: no
  API: Vulkan v: 1.3.279 layers: N/A device: 0 type: discrete-gpu name: AMD
    Radeon RX 6700 XT (RADV NAVI22) driver: mesa radv v: 24.0.8-manjaro1.1
    device-ID: 1002:73df surfaces: xcb,xlib
  Device-1: AMD Navi 21/23 HDMI/DP Audio driver: snd_hda_intel v: kernel pcie:
    gen: 4 speed: 16 GT/s lanes: 16 bus-ID: 2f:00.1 chip-ID: 1002:ab28
    class-ID: 0403
  Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
    bus-ID: 31:00.4 chip-ID: 1022:1487 class-ID: 0403
  API: ALSA v: k6.9.2-1-MANJARO status: kernel-api with: aoss
    type: oss-emulator tools: alsactl,alsamixer,amixer
  Server-1: sndiod v: N/A status: off tools: aucat,midicat,sndioctl
  Server-2: JACK v: 1.9.22 status: off tools: N/A
  Server-3: PipeWire v: 1.0.7 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    tools: pactl,pw-cat,pw-cli,wpctl
  Device-1: Intel I211 Gigabit Network vendor: Micro-Star MSI driver: igb
    v: kernel pcie: gen: 1 speed: 2.5 GT/s lanes: 1 port: e000 bus-ID: 26:00.0
    chip-ID: 8086:1539 class-ID: 0200
  IF: enp38s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IP v4: <filter> scope: global broadcast: <filter>
  IP v6: <filter> type: noprefixroute scope: link
  Device-2: Realtek RTL8125 2.5GbE vendor: Micro-Star MSI driver: r8169
    v: kernel pcie: gen: 2 speed: 5 GT/s lanes: 1 port: d000 bus-ID: 27:00.0
    chip-ID: 10ec:8125 class-ID: 0200
  IF: enp39s0 state: up speed: 2500 Mbps duplex: full mac: <filter>
  IF-ID-1: br0 state: up speed: 2500 Mbps duplex: unknown mac: <filter>
  IP v4: <filter> scope: global broadcast: <filter>
  Info: services: NetworkManager, nfsd, nginx, sshd, systemd-timesyncd
  WAN IP: <filter>
  Local Storage: total: 22.74 TiB used: 11.72 TiB (51.5%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Sabrent
    model: Rocket 4 Plus Gaming size: 931.51 GiB block-size: physical: 512 B
    logical: 512 B speed: 63.2 Gb/s lanes: 4 tech: SSD serial: <filter>
    fw-rev: R4P47G.1 temp: 41.9 C scheme: GPT
  ID-2: /dev/sda maj-min: 8:0 vendor: Seagate model: ST8000NE001-2M7101
    size: 7.28 TiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s
    tech: HDD rpm: 7200 serial: <filter> fw-rev: EN01 scheme: GPT
  ID-3: /dev/sdb maj-min: 8:16 vendor: Seagate model: ST8000VN004-2M2101
    size: 7.28 TiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s
    tech: HDD rpm: 7200 serial: <filter> fw-rev: SC60 scheme: GPT
  ID-4: /dev/sdc maj-min: 8:32 vendor: Seagate model: ST8000VN004-2M2101
    size: 7.28 TiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s
    tech: HDD rpm: 7200 serial: <filter> fw-rev: SC60 scheme: GPT
  ID-1: / raw-size: 931.22 GiB size: 915.53 GiB (98.32%)
    used: 29.14 GiB (3.2%) fs: ext4 dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 296 KiB (0.1%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
  Alert: No swap data was found.
  System Temperatures: cpu: 31.0 C mobo: 29.0 C gpu: amdgpu temp: 39.0 C
    mem: 32.0 C
  Fan Speeds (rpm): fan-1: 0 fan-2: 1037 fan-3: 971 fan-4: 628 fan-5: 644
    fan-6: 645 fan-7: 0 gpu: amdgpu fan: 0
  Memory: total: 128 GiB note: est. available: 125.72 GiB
    used: 3.35 GiB (2.7%)
  Processes: 342 Power: uptime: 9m states: freeze,mem,disk suspend: deep
    avail: s2idle wakeups: 0 hibernate: platform avail: shutdown, reboot,
    suspend, test_resume image: 50.27 GiB services: org_kde_powerdevil,
    power-profiles-daemon, upowerd Init: systemd v: 255 default: graphical
    tool: systemctl
  Packages: pm: pacman pkgs: 1417 libs: 375 tools: pamac pm: flatpak pkgs: 0
    Compilers: clang: 17.0.6 gcc: 14.1.1 alt: 13 Shell: Zsh v: 5.9 default: Bash
    v: 5.2.26 running-in: konsole inxi: 3.3.34
    ~                        

If it happens after some time, I suggest checking the temperature first.

also check the physical connections. shut the computer down, unplug power, take the gpu out and insert again, make sure all screws are tighten to hold the gpu in place.
check your hardware-manual for the bios-settings especially for the pci-lanes of the used pci-slot.

Another possible cause of those type of reset errors is insufficient power to the gpu. What PSU are you using?

the PSU is a BeQuiet Power 12 of 1000W power.

Tested with with a power tester with 100% cpu load and reproducing a 4K video in full screen, the tester never exceeded 220W

Sorry, it is difficult to reply while gathering information because the machine crashes.

Checked multiple times also removing and reinstallig the GPU.
The motherboard is an MSI MEG X570 ACE. Do you know what can I check in the bios?

Sorry, I had to wait some time to gather the temperatures

On heavy load CPU never exceeded 80 degree and GPU never exceeded 60 degree

Update your board firmware - MEG X570 ACE | Motherboard | MSI Global

It may be possible to update your firmware using fwupd

sudo pacman -Syu fwupd
fwupdmgr refresh
fwupdmgr get-updates
fwupdmgr update

Another thought is accumulated shader cache which may cause the crash ~/.cache

I’ll try ASAP

Tried but seems that there is nothing to do

    ~  LANG=C sudo fwupdmgr refresh                                                                                                                      ✔ 
Metadata is up to date; use --force to refresh again.
    ~  LANG=C sudo fwupdmgr get-updates                                                                                                                2 ✘ 
Devices with no available firmware updates: 
 ? ST8000NE001-2M7101
 ? ST8000VN004-2M2101
 ? ST8000VN004-2M2101
 ? Sabrent Rocket 4 Plus Gaming
 ? UEFI Device Firmware
 ? UEFI Device Firmware
Devices with the latest available firmware version:
 ? UEFI dbx
No updates available
    ~  LANG=C sudo fwupdmgr update                                                                                                                     2 ✘ 
Devices with no available firmware updates: 
 ? ST8000NE001-2M7101
 ? ST8000VN004-2M2101
 ? ST8000VN004-2M2101
 ? Sabrent Rocket 4 Plus Gaming
 ? UEFI Device Firmware
 ? UEFI Device Firmware
Devices with the latest available firmware version:
 ? UEFI dbx
    ~             

Now I’m using the 7C35v1N bios.
There is a new version 7C35v1O

Have I to upgrade the bios?

It is impossible to say - the reason for suggesting this is the fact that AMD provides the their drivers directly to the upstream kernel repositories.

When AMD related stuff propagates to end user systems - those updates may imply an up-to-date firmware.

Assuming your system has been working - the reason for your problems may be related to a now incompatible firmware.

But that is entirely an observation - I cannot possible state this as fact - only an observation.


Have you tried syncing LTS linux66 to your system and boot this kernel?

Now I’ll upgrade the bios and I will return after.

Hi @linux-aarhus
I updated the bios and refreshed the firmware.

Now I’ test the machine for some time.

If there are problems I’ll try to use the 6.6 kernel and I’ll report here the results

OK, definitely not that then. :rofl:

Have a look at this bug report and the gentoo wiki article that is linked there. It seems that some AMD cards may be getting boosted beyond their specified limits and/or have incorrect power limits. The LACT application mentioned there is available on the AUR.

I’ll try asap

Just thought I’d ask if you have used another distro or OS?

The Linux kernel is the Linux kernel - that is not changing anything - unless you count the kernel version in - whichever kernel that OS or distro may be using on the ISO.

I speaked too early: it started to crash again

(Upgraded the bios to 7C35v1O released on 2024-04-02.

Up to now it was an entire day whithout a crash.

If there are no more chrashes in the next 2 days I’ll close the issue)

Thanks to all

I see a lot of things in .cache.
There is some method to maintain clean the cache?

Here the cache list

    ~                                                                        

In the bios the lanes_configuration is set to “Auto”. Have I to change it?