Overheating issue happening only on Linux

,

Hey, I’ve got an issue with my GPU - Radeon RX560: running basically anything compute-heavy on that card makes my temperature skyrocket and the system shutting itself off due to overheating. On the last failure dmesg from last boot shows following events:

paź 23 09:52:57.870416 pc_name kernel: amdgpu 0000:91:00.0: amdgpu: Disabling VM faults because of PRT request!
paź 23 09:53:43.030412 pc_name kernel: amdgpu 0000:91:00.0: amdgpu: ERROR: GPU over temperature range(SW CTF) detected!
paź 23 09:53:43.030761 pc_name kernel: amdgpu 0000:91:00.0: amdgpu: ERROR: System is going to shutdown due to GPU SW CTF!

and it’s fairly consistent with what happens, when I launch for example World of Warcraft, the temperature spikes from 50 degrees Celsius to 89 in a second, and emergency shutdown goes after. I tried reapplying thermal paste and giving a thorough clean, but it didn’t seem to help all that much.

The interesting thing is, that under Windows, even more demanding games keep my GPU fairly cool at around 50-70 degrees. This leads me to believe that there might be some driver shenanigans, or perhaps some mishandling of power in the kernel.

Can someone help me debug this issue further and perhaps somehow fix it?
I really don’t want to switch back to Windows after years of harmonious coexistence with my Linux rig.

For easy access, here’s my setup:
Branch: Stable

DE / WM: KDE

CPU: i9-7900X

GPU: Radeon RX560 & Nvidia Quadro P1000

GPU Driver: video-hybrid-amd-nvidia-prime

Kernel: 6.6.0-1-MANJARO

Bootloader: GRUB

I experienced something similar to what you describe when using an RX580 (especially with games, or transcoding operations); albeit not to the point of overheating and shutting down. This sometimes occurred in both Linux and Windows, depending on the load.

There is no issue with the card; nor is there likely any issue with Manjaro or any other OS, in relation to this.

The solution is simple:

Invest in a more robust cooling system for your machine; multiple fans, if possible; or perhaps a water-cooling solution if that’s an option.

Yes, expensive, but it’s your machine; choose whatever fits your situation.

Cheers.

Thanks for taking a look at my problem, but I’m sadly not sure this is a thing to be solved with a cooling system - especially since there’s nothing particularly wrong with the current setup when doing much more GPU-intensive things on Windows.
I could dump more money into my PC, but I think it might be an expensive overkill in this situation.

Potentially expensive, yes. However, I can only share my experience with the same family of graphics. Upgrading the cooling system solved it for me, on every OS that experienced the same symptoms. In my case, it fell short of overheating; but not by much.

Good luck. :four_leaf_clover:

I have absolute no experience with AMD GPU’s, specially not in Linux… but im a Tech Nerd and i have great experience with Hardware.

89° sounds pretty shitty with a weak GPU like a Radeon 560…

My few points what could happening here, where you have special settings that you don’t have in Linux? Even it is not the case, it still can help reduce the Temp issue what you experience:

1.Possible that you have a Individual Fan Curve or always max RPM’s in Windows?

2.You undervolted/underclocked your GPU in Windows?

3.You using a FPS Limiter in Windows?

4.You activated vsync together with a 60Hz Display in Windows?

Edit:
Investigate your problems and try to run the same application with the same details in the same viewpoint (with the same drawcalls/polygons) and check your RPM/FPS/Temp’s.

I don’t know better, but something comes in my mind, that your GPU in Windows has some kind of bottle neck and thats why you have heat issues in Linux.

Maybe Bill Gates put a little bug in your PC Case and when you boot in Linux, the little bug runs to your GPU and stops your airflow. :joy:

You ignoring the fact, that his Card runs flawless in Windows… still a good PC Case with a nice airflow (i can recommend Fractal Silent Cases btw) is always welcomed.

  1. This is possible, since Windows offers some AMD Software that seems to handle optimization. It does not seem that RPMs turn to max on Windows - everything seems far more quiet under duress (Battlefield V), versus being fairly loud at around 2500RPM under just browsing web on Linux. Would you suggest some resources on setting the fan curve on Linux? It seems like a process that requires some prior knowledge.
  2. Nope, unless the software did that for me on Windows - I don’t really think it did.
  3. Nope.
  4. Turned off in the AMD app.

The app I’m referring to for your reference is AMD Software Pro

If you don’t have a strange Bitcoin Trojan Virus in Linux, im pretty sure there is something wierd going on, my Nvidia GPU (2080Ti) has the same Heat in Windows Idle/Gaming as i do under Linux…

And normally this should be the same for AMD.

No. Simply offering my experiences for comparison; with (multiple) Linux and Windows installations, with respect the same hardware. Cheers.

You’re running kernel version that hasn’t been released yet. Switch to LTS kernel and see if the problem persists.

Your solution sounded promising: unfortunately, exactly same results under 6.1 LTS kernel :frowning:

What’s the output of inxi -Fazy?

Here you go:

System:
  Kernel: 6.1.55-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
    clocksource: tsc available: hpet,acpi_pm
    parameters: BOOT_IMAGE=/boot/vmlinuz-6.1-x86_64
    root=UUID=b578080f-5064-4c02-8482-e865429de0e0 rw quiet apparmor=1
    security=apparmor resume=UUID=161bbe9c-f270-46c7-a7c1-54d4ef8221b2
    udev.log_priority=3
  Desktop: KDE Plasma v: 5.27.8 tk: Qt v: 5.15.11 info: latte-dock
    wm: kwin_x11 vt: 2 dm: SDDM Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Desktop System: Dell product: Precision 5820 Tower X-Series v: N/A
    serial: <superuser required> Chassis: type: 3 serial: <superuser required>
  Mobo: Dell model: 02M8NY v: A00 serial: <superuser required> UEFI: Dell
    v: 2.27.0 date: 03/13/2023
CPU:
  Info: model: Intel Core i9-7900X bits: 64 type: MT MCP arch: Skylake
    gen: core 7 level: v4 note: check process: Intel 14nm family: 6
    model-id: 0x55 (85) stepping: 4 microcode: 0x2006D05
  Topology: cpus: 1x cores: 10 tpc: 2 threads: 20 smt: enabled cache:
    L1: 640 KiB desc: d-10x32 KiB; i-10x32 KiB L2: 10 MiB desc: 10x1024 KiB
    L3: 13.8 MiB desc: 1x13.8 MiB
  Speed (MHz): avg: 1200 min/max: 1200/4300:4500 scaling:
    driver: intel_pstate governor: powersave cores: 1: 1200 2: 1200 3: 1200
    4: 1200 5: 1200 6: 1200 7: 1200 8: 1200 9: 1200 10: 1200 11: 1200 12: 1200
    13: 1200 14: 1200 15: 1200 16: 1200 17: 1200 18: 1200 19: 1200 20: 1200
    bogomips: 132059
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
  Vulnerabilities:
  Type: gather_data_sampling status: Vulnerable: No microcode
  Type: itlb_multihit status: KVM: VMX disabled
  Type: l1tf mitigation: PTE Inversion; VMX: conditional cache flushes, SMT
    vulnerable
  Type: mds mitigation: Clear CPU buffers; SMT vulnerable
  Type: meltdown mitigation: PTI
  Type: mmio_stale_data mitigation: Clear CPU buffers; SMT vulnerable
  Type: retbleed mitigation: IBRS
  Type: spec_rstack_overflow status: Not affected
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
    prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
    sanitization
  Type: spectre_v2 mitigation: IBRS, IBPB: conditional, STIBP: conditional,
    RSB filling, PBRSB-eIBRS: Not affected
  Type: srbds status: Not affected
  Type: tsx_async_abort mitigation: Clear CPU buffers; SMT vulnerable
Graphics:
  Device-1: NVIDIA GP107GL [Quadro P1000] vendor: Dell driver: nvidia
    v: 535.113.01 alternate: nouveau,nvidia_drm non-free: 535.xx+
    status: current (as of 2023-09) arch: Pascal code: GP10x process: TSMC 16nm
    built: 2016-21 pcie: gen: 1 speed: 2.5 GT/s lanes: 4 link-max: gen: 3
    speed: 8 GT/s lanes: 16 bus-ID: 04:00.0 chip-ID: 10de:1cb1 class-ID: 0300
  Device-2: AMD Baffin [Radeon RX 460/560D / Pro
    450/455/460/555/555X/560/560X] vendor: ASUSTeK AREZ driver: amdgpu
    v: kernel arch: GCN-4 code: Arctic Islands process: GF 14nm built: 2016-20
    pcie: gen: 3 speed: 8 GT/s lanes: 8 ports: active: DP-1,HDMI-A-1
    empty: DVI-D-1 bus-ID: 91:00.0 chip-ID: 1002:67ef class-ID: 0300
    temp: 71.0 C
  Display: x11 server: X.Org v: 21.1.8 compositor: kwin_x11 driver: X:
    loaded: modesetting dri: radeonsi gpu: amdgpu display-ID: :0 screens: 1
  Screen-1: 0 s-res: 3840x1080 s-dpi: 96 s-size: 1016x285mm (40.00x11.22")
    s-diag: 1055mm (41.54")
  Monitor-1: DP-1 pos: left model: Dell P2419H serial: <filter> built: 2021
    res: 1920x1080 hz: 60 dpi: 93 gamma: 1.2 size: 527x296mm (20.75x11.65")
    diag: 604mm (23.8") ratio: 16:9 modes: max: 1920x1080 min: 720x400
  Monitor-2: HDMI-A-1 mapped: HDMI-1 pos: primary,right model: BenQ GW2280
    serial: <filter> built: 2019 res: 1920x1080 hz: 60 dpi: 102 gamma: 1.2
    size: 476x268mm (18.74x10.55") diag: 546mm (21.5") ratio: 16:9 modes:
    max: 1920x1080 min: 720x400
  API: EGL v: 1.5 hw: drv: nvidia drv: amd radeonsi platforms: device: 0
    drv: nvidia device: 1 drv: radeonsi device: 2 drv: swrast gbm:
    drv: kms_swrast surfaceless: drv: nvidia x11: drv: radeonsi
    inactive: wayland
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: amd mesa v: 23.1.9-manjaro1.1
    glx-v: 1.4 direct-render: yes renderer: AMD Radeon RX 560 Series (polaris11
    LLVM 16.0.6 DRM 3.49 6.1.55-1-MANJARO) device-ID: 1002:67ef
    memory: 1.95 GiB unified: no
  API: Vulkan v: 1.3.264 layers: 5 device: 0 type: discrete-gpu
    name: Quadro P1000 driver: nvidia v: 535.113.01 device-ID: 10de:1cb1
    surfaces: xcb,xlib device: 1 type: discrete-gpu name: AMD Radeon RX 560
    Series (RADV POLARIS11) driver: mesa radv v: 23.1.9-manjaro1.1
    device-ID: 1002:67ef surfaces: xcb,xlib
Audio:
  Device-1: Intel 200 Series PCH HD Audio vendor: Dell driver: snd_hda_intel
    v: kernel bus-ID: 00:1f.3 chip-ID: 8086:a2f0 class-ID: 0403
  Device-2: NVIDIA GP107GL High Definition Audio vendor: Dell
    driver: snd_hda_intel v: kernel pcie: gen: 3 speed: 8 GT/s lanes: 4 link-max:
    lanes: 16 bus-ID: 04:00.1 chip-ID: 10de:0fb9 class-ID: 0403
  Device-3: AMD Baffin HDMI/DP Audio [Radeon RX 550 640SP / 560/560X]
    vendor: ASUSTeK driver: snd_hda_intel v: kernel pcie: gen: 1 speed: 2.5 GT/s
    lanes: 8 link-max: gen: 3 speed: 8 GT/s bus-ID: 91:00.1 chip-ID: 1002:aae0
    class-ID: 0403
  API: ALSA v: k6.1.55-1-MANJARO status: kernel-api with: aoss
    type: oss-emulator tools: alsactl,alsamixer,amixer
  Server-1: JACK v: 1.9.22 status: off tools: N/A
  Server-2: PipeWire v: 0.3.81 status: off with: pipewire-media-session
    status: active tools: pw-cli
  Server-3: PulseAudio v: 16.1 status: active with: 1: pulseaudio-alsa
    type: plugin 2: pulseaudio-jack type: module tools: pacat,pactl
Network:
  Device-1: Intel Ethernet I219-LM vendor: Dell driver: e1000e v: kernel
    port: N/A bus-ID: 00:1f.6 chip-ID: 8086:15e3 class-ID: 0200
  IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 2.31 TiB used: 596.97 GiB (25.3%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: PM981 NVMe 1024GB
    size: 953.87 GiB block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s
    lanes: 4 tech: SSD serial: <filter> fw-rev: EXA73D1Q temp: 30.9 C
    scheme: GPT
  ID-2: /dev/sda maj-min: 8:0 vendor: Silicon Power
    model: SPCC Solid State Disk size: 476.94 GiB block-size: physical: 512 B
    logical: 512 B speed: 6.0 Gb/s tech: SSD serial: <filter> fw-rev: 61.3
    scheme: GPT
  ID-3: /dev/sdb maj-min: 8:16 vendor: Crucial model: CT1000MX500SSD1
    size: 931.51 GiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s
    tech: SSD serial: <filter> fw-rev: 046 scheme: GPT
Partition:
  ID-1: / raw-size: 897.8 GiB size: 882.64 GiB (98.31%)
    used: 596.93 GiB (67.6%) fs: ext4 dev: /dev/sdb2 maj-min: 8:18
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 26.9 MiB (9.0%) fs: vfat dev: /dev/sdb1 maj-min: 8:17
Swap:
  Kernel: swappiness: 60 (default) cache-pressure: 100 (default) zswap: yes
    compressor: zstd max-pool: 20%
  ID-1: swap-1 type: partition size: 33.41 GiB used: 15.5 MiB (0.0%)
    priority: -2 dev: /dev/sdb3 maj-min: 8:19
Sensors:
  System Temperatures: cpu: 36.0 C mobo: 35.0 C gpu: amdgpu temp: 77.0 C
  Fan Speeds (rpm): cpu: 702 fan-1: 991 fan-3: 667 gpu: amdgpu fan: 2614
Info:
  Processes: 389 Uptime: 2m wakeups: 0 Memory: total: 32 GiB note: est.
  available: 31.05 GiB used: 2.52 GiB (8.1%) Init: systemd v: 254
  default: graphical tool: systemctl Compilers: gcc: 13.2.1 clang: 16.0.6
  Packages: 1861 pm: dpkg pkgs: 0 pm: pacman pkgs: 1856 libs: 438 tools: pamac
  pm: flatpak pkgs: 5 Shell: Bash v: 5.1.16 running-in: konsole inxi: 3.3.30

Aside:- You might consider installing / configuring the microcode package for your system:

sudo pacman -Syu amd-ucode
Vulnerabilities:
  Type: gather_data_sampling status: Vulnerable: No microcode
...

If your card is heated up to 77.0 C in desktop mode no wonder it overheats during the games. When was the last time you cleaned it?

On Saturday, with a reapplied thermal paste. These aren’t temperatures that happen under Windows idling, mind you.

There was also a browser and dual screen in the background.

The plot thickens.

Is the RX560 driving those two screens? (the RX560 has two HDMI outputs; unless memory fails me). If so, the RX560 might be somewhat under-powered for the purpose; and this could be contributing.

Yes. It has HDMI, DP and DVI outputs.
It wasn’t underpowered for that purpose for like 2 years I’ve been a proud owner of two screens.

The issue still persists on a single monitor, however it’s not instant crash as it is with only single one plugged in.

This is potentially useful information.

Are these high resolution screens, for example, 2560x1440?

Has the problem started to occur as soon as you installed Manjaro? Or your videocard has been working fine under linux until recently?

It has been working fine until recently. Unfortunately I can’t really pinpoint the moment when it started going worse, especially since my first thought was either a PSU failing or the dirt buildup. I switched the PSU - same thing, I cleaned it up - ditto.