ASUS laptop discrete AMD card will overheat and crash the system

while using the laptop normally nothing would happen but as soon as i start a video game (or using DRI_PRIME=1 vblank_mode=0 glxgears for reproducing the problem) the dgpu temperature will go above 100 degrees Celsius and will keep rising until it reaches 110 degrees and at that moment the laptop will forcibly shut itself down in order to prevent hardware damage
this problem does not exist on windows and occurred only on linux
so far i’ve tried switching both cards into amdgpu kernel driver and using amdgpu.dpm=0 kernel parameter which caused the problem but caused the issue of screen flickering at resource intensive programs. so i removed the parameter and started using acpi_call to just disable the dgpu which lowered the system performance down to potato level (update : no longer taking the said approach and using corectrl instead )

specs are :
ASUS K555D
amd fx-8800p cpu
amd radeon r7 graphics as igpu
amd radeon r8 m350 dx as dgpu

kernel 5.15 lts

configs related (updated according to megavolt recommendation) :

/etc/modprobe.d/gpu.conf
options amdgpu si_support=1
#options amdgpu cik_support=1
#options amdgpu dpm=0
options amdgpu ppfeaturemask=0xffffffff
#options amdgpu dc=1
#options amdgpu runpm=0

blacklist radeon
#options radeon cik_support=0
#options radeon si_support=0
options radeon dpm=0

result of inxi -Fazy (the parts that seemed important):

System:
  Kernel: 5.15.32-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 11.2.0
    parameters: BOOT_IMAGE=/vmlinuz-5.15-x86_64
    root=UUID=d36e83c1-d5d4-47e1-897f-a454f6305bb7 rw rootflags=subvol=@ quiet
    cryptdevice=UUID=cc02c371-7bb0-4b12-91e1-343a8a9bd3dc:luks-cc02c371-7bb0-4b12-91e1-343a8a9bd3dc
    root=/dev/mapper/luks-cc02c371-7bb0-4b12-91e1-343a8a9bd3dc splash
    apparmor=1 security=apparmor udev.log_priority=3
    resume=UUID=d36e83c1-d5d4-47e1-897f-a454f6305bb7 resume_offset=14984448
  Desktop: GNOME v: 41.5 tk: GTK v: 3.24.33 wm: gnome-shell dm: GDM v: 41.3
    Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Laptop System: ASUSTeK product: X555DG v: 1.0 serial: <filter>
  Mobo: ASUSTeK model: X555DG v: 1.0 serial: <filter>
    UEFI: American Megatrends v: X555DG.605 date: 04/18/2019
CPU:
  Info: model: AMD FX-8800P Radeon R7 12 Compute Cores 4C+8G socket: P0
    bits: 64 type: MT MCP arch: Excavator family: 0x15 (21) model-id: 0x60 (96)
    stepping: 1 microcode: 0x6006118
  Topology: cpus: 1x cores: 4 smt: enabled cache: L1: 320 KiB
    desc: d-4x32 KiB; i-2x96 KiB L2: 2 MiB desc: 2x1024 KiB
  Speed (MHz): avg: 1575 high: 2100 min/max: 1400/2100 boost: enabled
    base/boost: 2100/2100 scaling: driver: acpi-cpufreq governor: ondemand
    volts: 1.0 V ext-clock: 100 MHz cores: 1: 1400 2: 1400 3: 2100 4: 1400
    bogomips: 16775
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: AMD Wani [Radeon R5/R6/R7 Graphics] vendor: ASUSTeK driver: amdgpu
    v: kernel ports: active: eDP-1 empty: DP-1,HDMI-A-1 bus-ID: 00:01.0
    chip-ID: 1002:9874 class-ID: 0300
  Device-2: AMD Sun XT [Radeon HD 8670A/8670M/8690M / R5 M330 M430 Radeon
    520 Mobile]
    vendor: ASUSTeK driver: amdgpu v: kernel alternate: radeon pcie: gen: 3
    speed: 8 GT/s lanes: 8 bus-ID: 03:00.0 chip-ID: 1002:6660 class-ID: 0380
  Device-3: Realtek USB Camera type: USB driver: N/A bus-ID: 1-1.4:5
    chip-ID: 0bda:57b5 class-ID: 0e02 serial: <filter>
  Display: server: X.org v: 1.21.1.3 with: Xwayland v: 22.1.1
    compositor: gnome-shell driver: X: loaded: amdgpu
    unloaded: modesetting,radeon alternate: fbdev,vesa gpu: amdgpu
    display-ID: :0 screens: 1
  Screen-1: 0 s-res: 1920x1080 s-size: <missing: xdpyinfo>
  Monitor-1: eDP-1 mapped: eDP model: AU Optronics 0x38ed built: 2014
    res: 1920x1080 hz: 60 dpi: 142 gamma: 1.2 size: 344x193mm (13.54x7.6")
    diag: 394mm (15.5") ratio: 16:9 modes: max: 1920x1080 min: 640x480
  OpenGL: renderer: AMD Radeon R7 Graphics (CARRIZO DRM 3.42.0
    5.15.32-1-MANJARO LLVM 13.0.1)
    v: 4.6 Mesa 21.3.8 direct render: Yes
Sensors:
  System Temperatures: cpu: 83.0 C mobo: N/A
  Fan Speeds (RPM): cpu: 4800
  GPU: device: amdgpu temp: 70.0 C device: amdgpu temp: 82.0 C

output of mhwd -l:

> 0000:03:00.0 (0380:1002:6660) Display controller ATI Technologies Inc:
--------------------------------------------------------------------------------
                  NAME               VERSION          FREEDRIVER           TYPE
--------------------------------------------------------------------------------
           video-linux            2018.05.04                true            PCI


> 0000:02:00.0 (0200:10ec:8168) Network controller Realtek Semiconductor Co., Ltd.:
--------------------------------------------------------------------------------
                  NAME               VERSION          FREEDRIVER           TYPE
--------------------------------------------------------------------------------
         network-r8168            2016.04.20                true            PCI


> 0000:00:01.0 (0300:1002:9874) Display controller ATI Technologies Inc:
--------------------------------------------------------------------------------
                  NAME               VERSION          FREEDRIVER           TYPE
--------------------------------------------------------------------------------
           video-linux            2018.05.04                true            PCI
     video-modesetting            2020.01.13                true            PCI
            video-vesa            2017.03.12                true            PCI

output of mhwd -li :

> Installed PCI configs:
--------------------------------------------------------------------------------
                  NAME               VERSION          FREEDRIVER           TYPE
--------------------------------------------------------------------------------
           video-linux            2018.05.04                true            PCI


Warning: No installed USB configs!

output of journalctl --boot=0 --priority=5 --no-pager : (warning : LONG) journalctl - Pastebin.com

hello, edit your post by adding formated output of:
inxi -Fazy
mhwd -l
mhwd -li

Disabling the power managment and not adjusting the fan speed and limits manually is bad.

Maybe this gives you more control: CoreCtrl / CoreCtrl · GitLab

pamac build corectrl

it is done

thank you for recommending this software to me
now in addition to what you have mentioned managing the system power is not way easier and more tidy
but i have notice a small issue that might be a hint to our problem
my laptop is running on one fan and corectrl shows the fan belongs to dgpu (i dont know exactly how this can help but i think this is helpful to know )

also post output of logs:
journalctl --boot=0 --priority=5 --no-pager

update/edited the post
by the way thanks a lot for you efforts (ik the porblem is still there but it cannot get fixed without help)

thats a lot of errors/failed/warnings… try logs from this command:
journalctl --boot=0 --priority=3 --no-pager
also try installing different kernels, and try to boot with them, and check
are you running on wayland or x11?

trying different kernels and testing them with DRI_PRIME=1 vblank_mode=0 glxgears:

kernel version	x11		wayland		notes
5.17			fail	fail		according to corectrl dgpu was not uitlized but still system manages to overheat
5.16			fail	fail		at high temps dgpu clock dcreases form 1030 MHz to 800 MHz only for a second
5.15			fail	fail		similar to 5.16 but just a tiny bit longer decrease
5.10			fail	fail		simlar to 5.15
5.4				fail	fail		gpu clock is locked at 750MHz and not decreasing or increasing

the interesting thing i understood from tests is that it appears that the discrete gpu ( according to corectrl ) does have a fan which is not acting the same way as laptops only fan. in other words the dgpu might be relying on a fan that does not exist

output of --boot=0 --priority=3 --no-pager :

Apr 29 13:11:41 someguy04-x555dg kernel: kfd kfd: amdgpu: HAINAN  not supported in kfd
Apr 29 13:11:43 someguy04-x555dg kernel: tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0xcd861000-0xcd861fff flags 0x200] vs cd861000 4000
Apr 29 13:11:43 someguy04-x555dg kernel: tpm_crb MSFT0101:00: can't request region for resource [mem 0xcd861000-0xcd861fff]
Apr 29 13:11:43 someguy04-x555dg kernel: sp5100-tco sp5100-tco: Watchdog hardware is disabled
Apr 29 13:12:02 someguy04-x555dg gnome-session-binary[685]: GLib-GIO-CRITICAL: g_bus_get_sync: assertion 'error == NULL || *error == NULL' failed
Apr 29 13:12:02 someguy04-x555dg gnome-session-binary[685]: GLib-GIO-CRITICAL: g_bus_get_sync: assertion 'error == NULL || *error == NULL' failed
Apr 29 13:12:05 someguy04-x555dg kernel: [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <si_dpm> failed -22
Apr 29 13:12:20 someguy04-x555dg bluetoothd[557]: Failed to set mode: Failed (0x03)
Apr 29 13:13:10 someguy04-x555dg gdm-password][1054]: gkr-pam: unable to locate daemon control file
Apr 29 13:13:36 someguy04-x555dg gdm-launch-environment][664]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

the only thing i found that could probably help was editing the boot options by adding following parameters to the etc/default/grub and in this line: GRUB_CMDLINE_LINUX_DEFAULT add these:

radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1

update grub:
sudo update-grub
reboot

another solution was different kernels, which you already tried and the last was to add graphic drivers to modules to early load

the laptop will stabilize at 105 degrees and no longer crashing
is 105 degrees going to damage any part of the laptop or its just fine?
update : it crashed

If it 105F, then no worry, but if it is 105C then definitely yes! It should be below 90C and normal usage at about 30C to 60C. The more heat the GPU gets the more it reduces its life time.

2 Likes

Maybe related to this.

Also, does this also happen on X11 session?

and did you try the parameters?

yes the parameters are applied

si and cik supports are enabled
and the problem exists on both wayland and x11 sessions

so go to /etc/mkinitcpio.conf and edit the modules section to look like this:
MODULES=(amdgpu)
if there are already some modules, add the amdgpu to them
run this:
sudo mkinitcpio -P
reboot

The only reasonable question then: is the collying system of the laptop clean?
Just a bit of dust “in the right place” and everything goes … hot.

the laptop board was cleaned of dust and the thermal paste was changed about 3 months ago
so its kind of safe to assume that cooling is working well

Well … then my last question: Does this overheating happen on other OS?