Nvidia PRIME results in GPU overheat on Manjaro

Hi folks! I’ve now got Nvidia Optimus working properly on my Dell Inspiron 14 5401 with Manjaro, but I’ve got a different problem: whenever I attempt to use the GPU as opposed to the integrated graphics, the laptop overheats and shuts itself off automatically.

Specifically, I’m using the Unigine Superposition test as a benchmark for this. I can run it all the way through on integrated graphics (without prime-run when in hybrid mode) just fine. However, when I use optimus-manager to put the laptop into nvidia mode, or when I use prime-run to run Superposition when in hybrid mode, it overheats consistently somewhere between scenes 14 and 17, to the extent to which the laptop just point blank shuts off.

I thought this might be a hardware issue, but then I installed Ubuntu to a USB stick and tried on that: works perfectly. Ubuntu defaults to nvidia-only mode from what I can see, but on there, it runs through the entire test on the GPU totally fine, spins the fans down afterwards, and spits out a score - no shutdown involved.

Any ideas why this might be happening, or how I might start to address it? I’m a bit stuck!

Inxi output:

[curtispf@curtis-laptop ~]$ inxi -Fazy
System:
  Kernel: 5.7.19-2-MANJARO x86_64 bits: 64 compiler: gcc v: 10.2.0 
  parameters: BOOT_IMAGE=/vmlinuz-5.7-x86_64 
  root=UUID=1a7a0fbf-7510-4b18-bb85-67e34e268569 rw mem_sleep_default=deep 
  quiet 
  cryptdevice=UUID=21ff733c-9741-4616-b5d9-d41496f34322:luks-21ff733c-9741-4616-b5d9-d41496f34322 
  root=/dev/mapper/luks-21ff733c-9741-4616-b5d9-d41496f34322 apparmor=1 
  security=apparmor 
  resume=/dev/mapper/luks-cf849d49-e236-4224-a49e-608c47e9387d 
  udev.log_priority=3 
  Desktop: KDE Plasma 5.20.4 tk: Qt 5.15.2 wm: kwin_x11 dm: SDDM 
  Distro: Manjaro Linux 
Machine:
  Type: Laptop System: Dell product: Inspiron 14 5401 v: N/A serial: <filter> 
  Chassis: type: 10 serial: <filter> 
  Mobo: Dell model: 03GNVW v: A00 serial: <filter> UEFI: Dell v: 1.4.4 
  date: 09/15/2020 
Battery:
  ID-1: BAT0 charge: 49.5 Wh condition: 49.5/53.0 Wh (93%) volts: 17.1/15.0 
  model: BYD DELL TXD0307 type: Unknown serial: <filter> status: Full 
CPU:
  Info: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP 
  arch: Ice Lake family: 6 model-id: 7E (126) stepping: 5 microcode: A0 
  L2 cache: 8192 KiB 
  flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
  bogomips: 23968 
  Speed: 2729 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 2729 2: 1588 
  3: 1691 4: 1494 5: 2695 6: 2714 7: 2161 8: 2495 
  Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled 
  Type: l1tf status: Not affected 
  Type: mds status: Not affected 
  Type: meltdown status: Not affected 
  Type: spec_store_bypass 
  mitigation: Speculative Store Bypass disabled via prctl and seccomp 
  Type: spectre_v1 
  mitigation: usercopy/swapgs barriers and __user pointer sanitization 
  Type: spectre_v2 mitigation: Enhanced IBRS, IBPB: conditional, RSB filling 
  Type: srbds status: Not affected 
  Type: tsx_async_abort status: Not affected 
Graphics:
  Device-1: Intel Iris Plus Graphics G7 vendor: Dell driver: i915 v: kernel 
  bus ID: 00:02.0 chip ID: 8086:8a52 
  Device-2: NVIDIA GP108M [GeForce MX330] vendor: Dell driver: nvidia 
  v: 455.45.01 alternate: nouveau,nvidia_drm bus ID: 01:00.0 
  chip ID: 10de:1d16 
  Device-3: Realtek Integrated_Webcam_HD type: USB driver: uvcvideo 
  bus ID: 3-6:5 chip ID: 0bda:565a serial: <filter> 
  Display: x11 server: X.Org 1.20.10 compositor: kwin_x11 
  driver: modesetting,nvidia alternate: fbdev,intel,nouveau,nv,vesa 
  display ID: :0 screens: 1 
  Screen-1: 0 s-res: 1920x1080 s-dpi: 96 s-size: 508x285mm (20.0x11.2") 
  s-diag: 582mm (22.9") 
  Monitor-1: eDP-1 res: 1920x1080 hz: 60 dpi: 158 size: 309x174mm (12.2x6.9") 
  diag: 355mm (14") 
  OpenGL: renderer: Mesa Intel Iris Plus Graphics (ICL GT2) v: 4.6 Mesa 20.2.3 
  direct render: Yes 
Audio:
  Device-1: Intel Smart Sound Audio vendor: Dell driver: snd_hda_intel 
  v: kernel alternate: snd_sof_pci bus ID: 00:1f.3 chip ID: 8086:34c8 
  Sound Server: ALSA v: k5.7.19-2-MANJARO 
Network:
  Device-1: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter 
  vendor: Dell driver: ath10k_pci v: kernel port: 3000 bus ID: 02:00.0 
  chip ID: 168c:003e 
  IF: wlp2s0 state: up mac: <filter> 
  Device-2: Qualcomm Atheros type: USB driver: btusb bus ID: 3-10:6 
  chip ID: 0cf3:e007 
Drives:
  Local Storage: total: 476.94 GiB used: 141.55 GiB (29.7%) 
  SMART Message: Unable to run smartctl. Root privileges required. 
  ID-1: /dev/nvme0n1 vendor: Toshiba model: KBG40ZNS512G NVMe KIOXIA 512GB 
  size: 476.94 GiB block size: physical: 512 B logical: 512 B speed: 31.6 Gb/s 
  lanes: 4 serial: <filter> rev: 10410104 scheme: GPT 
Partition:
  ID-1: / raw size: 467.84 GiB size: 459.50 GiB (98.22%) 
  used: 140.34 GiB (30.5%) fs: ext4 dev: /dev/dm-0 
  ID-2: /boot raw size: 300.0 MiB size: 299.4 MiB (99.80%) 
  used: 147.9 MiB (49.4%) fs: vfat dev: /dev/nvme0n1p1 
Swap:
  Kernel: swappiness: 60 (default) cache pressure: 100 (default) 
  ID-1: swap-1 type: partition size: 8.80 GiB used: 1.07 GiB (12.2%) 
  priority: -2 dev: /dev/dm-1 
Sensors:
  System Temperatures: cpu: 54.0 C mobo: N/A 
  Fan Speeds (RPM): cpu: 0 
Info:
  Processes: 294 Uptime: 15h 11m Memory: 7.55 GiB used: 4.79 GiB (63.4%) 
  Init: systemd v: 246 Compilers: gcc: 10.2.0 clang: 11.0.0 Packages: 
  pacman: 1703 lib: 450 flatpak: 0 Shell: Bash v: 5.0.18 running in: konsole 
  inxi: 3.1.08 

@curtispf

Maybe have look at the fan speed and the temp while benchmarking?

watch -n1 nvidia-smi

Maybe the fan control isn’t working properly with the newest driver?

Which version of the nvidia driver has been installed there?

Without any info regarding Ubuntu on a stick this doesn’t say anything, what kernel is Ubuntu using, what nvidia drivers do you use -as megavolt said- , are you using encryption there as well, Gnome =/= KDE, also about how many framerates are we talking about because in situations where your vsync is off the GPU is taxed to its limit so overheating is natural. Also why are you still on kernel 5.7 which is EOL?

Went back into Ubuntu to try and replicate and get data… and it’s stopped working there now. It must have been a fluke :frowning: Back to square one then!

I’m normally on 5.4 LTS, but I switched to 5.7 to see if it helped, as I had read that some new thermal management systems were introduced in it. It’s helped a little - it does last longer than on 5.4 - but not a huge amount.

The fans do spin up to max I think, but it might be an issue with fan control - it’s difficult for me to immediately tell. I’ve now tried on 440, 450 and 455 and the same issue comes up on all three. I logged the output of nvidia-smi every 2 seconds until the shutoff on two of the drivers - the results are here and here.

@curtispf normally it would like this:

$ nvidia-smi
Fri Dec 18 21:41:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| 45%   31C    P0    N/A /  75W |    373MiB /  4037MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       648      G   /usr/lib/Xorg                     303MiB |
|    0   N/A  N/A      1053      G   xfwm4                               1MiB |
|    0   N/A  N/A      1554      G   ...AAAAAAAAA= --shared-files       23MiB |
|    0   N/A  N/A     28114      G   /usr/bin/alacritty                  9MiB |
|    0   N/A  N/A     66781      G   ...e/Steam/ubuntu12_32/steam       12MiB |
|    0   N/A  N/A     66791      G   ./steamwebhelper                    1MiB |
|    0   N/A  N/A    568793      G   /usr/lib/firefox/firefox            1MiB |
|    0   N/A  N/A    583183      G   mpv                                 5MiB |
|    0   N/A  N/A    583493      G   /usr/bin/alacritty                  8MiB |
+-----------------------------------------------------------------------------+

But the fan has not been detected on your system:

|   0  GeForce MX330       Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8    N/A /  N/A |     42MiB /  2002MiB |      0%      Default |

Does sensors show it?

I would really advice to use a manual control of you fan with:

for example.

pamac build nvfancontrol

Holy **** …:

|   0  GeForce MX330       Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   83C    P0    N/A /  N/A |   1565MiB /  2002MiB |    100%      Default |

That hard on the limit…

After running a sensors-detect with all the defaults, I get this:

[curtispf@curtis-laptop ~]$ sensors
dell_smm-virtual-0
Adapter: Virtual device
fan1:        3474 RPM

nvme-pci-0300
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +85.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)

ath10k_hwmon-pci-0200
Adapter: PCI adapter
temp1:        +53.0°C  

BAT0-acpi-0
Adapter: ACPI interface
in0:          17.05 V  
curr1:       1000.00 uA 

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +53.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +50.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +53.0°C  (high = +100.0°C, crit = +100.0°C)

I’ve just tried setting up nvfancontrol, but putting the below into an X11 config file:

Section "Device"
    Identifier "Device 0"
    Driver     "nvidia"
    VendorName "NVIDIA Corporation"
    BoardName  "NVIDIA Corporation GP108M [GeForce MX330] (rev a1)"
    Option     "Coolbits" "4"
EndSection

caused X to refuse to start. Similarly, running sudo nvidia-xconfig --coolbits=4 and rebooting caused X to refuse to start until I deleted /etc/X11/xorg.conf. This might be me doing something terribly wrong, though; I don’t know that much about how xorg’s config files work, for my sins :slight_smile:

I’d appreciate any help or advice you could offer.

If you use optimus-manager, then the nvidia-specific options should go into /etc/optimus-manager/xorg-nvidia-gpu.conf (or /etc/optimus-manager/xorg-nvidia.conf depending on the version):

just

Option "Coolbits" "4"

in that file should be enough.

Okay, reinstalled optimus-manager and added the Coolbits to that config file. That gives me this:

[curtispf@curtis-laptop ~]$ sudo nvfancontrol 
WARN - No config file found; using default curve
X Error of failed request:  BadMatch (invalid parameter attributes)
  Major opcode of failed request:  157 (NV-CONTROL)
  Minor opcode of failed request:  4 ()
  Serial number of failed request:  14
  Current serial number in output stream:  14

The non-sudo version gives the same error, but I thought I’d check it wasn’t a permissions error.

Can you check if /etc/X11/xorg.conf.d/10-optimus-manager.conf contains the “Coolbits” option?

It does, but actually, thinking about it, I realised something that I should have realised way earlier in this process: the MX330, as a laptop chip, does not have a fan itself. Instead, it’s cooled through the same heatpiping and fan system that cools the CPU. So, it makes complete sense that nvfancontrol wouldn’t work.

That doesn’t solve the overall problem, of course, but it at least clarifies that part of it!

I have reinstalled Windows onto the laptop to check, and sure enough, the Superposition benchmark runs just fine there - without the laptop even breaking a sweat. The GPU temperature doesn’t even reach 70.

So, there’s something up somewhere in the kernel… whether it’s the NVIDIA drivers, something else relating to the cooling system (possibly Intel-related?) or something else entirely, though, I have no idea. Suggestions very much welcome!

Since is using the CPU fan,maybe you can do a power throttling the CPU to cool the device? You can use thermald (just search it in the official repos),then create a file in /etc/thermald/thermal-conf.xml and there you can put

<?xml version="1.0"?>
<ThermalConfiguration>
  <Platform>
    <Name>Override CPU default passive</Name>
    <ProductName>*</ProductName>
    <Preference>QUIET</Preference>
    <ThermalZones>
      <ThermalZone>
        <Type>cpu</Type>
        <TripPoints>
          <TripPoint>
            <Temperature>86000</Temperature>
            <type>passive</type>
          </TripPoint>
        </TripPoints>
      </ThermalZone>
    </ThermalZones>
  </Platform>
</ThermalConfiguration>

In

<Temperature>86000</Temperature>

This means 86°C,when the CPU temp reaches 86°C is going from active to passive cooling,you can change the value if you want i have it mine with 80.

then enable the service

sudo systemctl enable --now thermald.service

Then try to heat the computer,this applies to the CPU not GPU,but since it using the CPU fan i think it can work,there is more options that you can read here,i just use the first example and its working enough for me. (I have i5 7300HQ and 1050 Ti)
https://manpages.debian.org/testing/thermald/thermal-conf.xml.5.en.html

Also,consider undervolt the CPU with Intel Undervolt,I don’t know how to undervolt the nvidia though.

How have you monitored this?