Hi folks! I’ve now got Nvidia Optimus working properly on my Dell Inspiron 14 5401 with Manjaro, but I’ve got a different problem: whenever I attempt to use the GPU as opposed to the integrated graphics, the laptop overheats and shuts itself off automatically.
Specifically, I’m using the Unigine Superposition test as a benchmark for this. I can run it all the way through on integrated graphics (without
prime-run when in hybrid mode) just fine. However, when I use
optimus-manager to put the laptop into nvidia mode, or when I use
prime-run to run Superposition when in hybrid mode, it overheats consistently somewhere between scenes 14 and 17, to the extent to which the laptop just point blank shuts off.
I thought this might be a hardware issue, but then I installed Ubuntu to a USB stick and tried on that: works perfectly. Ubuntu defaults to nvidia-only mode from what I can see, but on there, it runs through the entire test on the GPU totally fine, spins the fans down afterwards, and spits out a score - no shutdown involved.
Any ideas why this might be happening, or how I might start to address it? I’m a bit stuck!
[curtispf@curtis-laptop ~]$ inxi -Fazy
Kernel: 5.7.19-2-MANJARO x86_64 bits: 64 compiler: gcc v: 10.2.0
root=UUID=1a7a0fbf-7510-4b18-bb85-67e34e268569 rw mem_sleep_default=deep
Desktop: KDE Plasma 5.20.4 tk: Qt 5.15.2 wm: kwin_x11 dm: SDDM
Distro: Manjaro Linux
Type: Laptop System: Dell product: Inspiron 14 5401 v: N/A serial: <filter>
Chassis: type: 10 serial: <filter>
Mobo: Dell model: 03GNVW v: A00 serial: <filter> UEFI: Dell v: 1.4.4
ID-1: BAT0 charge: 49.5 Wh condition: 49.5/53.0 Wh (93%) volts: 17.1/15.0
model: BYD DELL TXD0307 type: Unknown serial: <filter> status: Full
Info: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP
arch: Ice Lake family: 6 model-id: 7E (126) stepping: 5 microcode: A0
L2 cache: 8192 KiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Speed: 2729 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 2729 2: 1588
3: 1691 4: 1494 5: 2695 6: 2714 7: 2161 8: 2495
Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled
Type: l1tf status: Not affected
Type: mds status: Not affected
Type: meltdown status: Not affected
mitigation: Speculative Store Bypass disabled via prctl and seccomp
mitigation: usercopy/swapgs barriers and __user pointer sanitization
Type: spectre_v2 mitigation: Enhanced IBRS, IBPB: conditional, RSB filling
Type: srbds status: Not affected
Type: tsx_async_abort status: Not affected
Device-1: Intel Iris Plus Graphics G7 vendor: Dell driver: i915 v: kernel
bus ID: 00:02.0 chip ID: 8086:8a52
Device-2: NVIDIA GP108M [GeForce MX330] vendor: Dell driver: nvidia
v: 455.45.01 alternate: nouveau,nvidia_drm bus ID: 01:00.0
chip ID: 10de:1d16
Device-3: Realtek Integrated_Webcam_HD type: USB driver: uvcvideo
bus ID: 3-6:5 chip ID: 0bda:565a serial: <filter>
Display: x11 server: X.Org 1.20.10 compositor: kwin_x11
driver: modesetting,nvidia alternate: fbdev,intel,nouveau,nv,vesa
display ID: :0 screens: 1
Screen-1: 0 s-res: 1920x1080 s-dpi: 96 s-size: 508x285mm (20.0x11.2")
s-diag: 582mm (22.9")
Monitor-1: eDP-1 res: 1920x1080 hz: 60 dpi: 158 size: 309x174mm (12.2x6.9")
diag: 355mm (14")
OpenGL: renderer: Mesa Intel Iris Plus Graphics (ICL GT2) v: 4.6 Mesa 20.2.3
direct render: Yes
Device-1: Intel Smart Sound Audio vendor: Dell driver: snd_hda_intel
v: kernel alternate: snd_sof_pci bus ID: 00:1f.3 chip ID: 8086:34c8
Sound Server: ALSA v: k5.7.19-2-MANJARO
Device-1: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter
vendor: Dell driver: ath10k_pci v: kernel port: 3000 bus ID: 02:00.0
chip ID: 168c:003e
IF: wlp2s0 state: up mac: <filter>
Device-2: Qualcomm Atheros type: USB driver: btusb bus ID: 3-10:6
chip ID: 0cf3:e007
Local Storage: total: 476.94 GiB used: 141.55 GiB (29.7%)
SMART Message: Unable to run smartctl. Root privileges required.
ID-1: /dev/nvme0n1 vendor: Toshiba model: KBG40ZNS512G NVMe KIOXIA 512GB
size: 476.94 GiB block size: physical: 512 B logical: 512 B speed: 31.6 Gb/s
lanes: 4 serial: <filter> rev: 10410104 scheme: GPT
ID-1: / raw size: 467.84 GiB size: 459.50 GiB (98.22%)
used: 140.34 GiB (30.5%) fs: ext4 dev: /dev/dm-0
ID-2: /boot raw size: 300.0 MiB size: 299.4 MiB (99.80%)
used: 147.9 MiB (49.4%) fs: vfat dev: /dev/nvme0n1p1
Kernel: swappiness: 60 (default) cache pressure: 100 (default)
ID-1: swap-1 type: partition size: 8.80 GiB used: 1.07 GiB (12.2%)
priority: -2 dev: /dev/dm-1
System Temperatures: cpu: 54.0 C mobo: N/A
Fan Speeds (RPM): cpu: 0
Processes: 294 Uptime: 15h 11m Memory: 7.55 GiB used: 4.79 GiB (63.4%)
Init: systemd v: 246 Compilers: gcc: 10.2.0 clang: 11.0.0 Packages:
pacman: 1703 lib: 450 flatpak: 0 Shell: Bash v: 5.0.18 running in: konsole
Maybe have look at the fan speed and the temp while benchmarking?
watch -n1 nvidia-smi
Maybe the fan control isn’t working properly with the newest driver?
Which version of the nvidia driver has been installed there?
Without any info regarding Ubuntu on a stick this doesn’t say anything, what kernel is Ubuntu using, what nvidia drivers do you use -as megavolt said- , are you using encryption there as well, Gnome =/= KDE, also about how many framerates are we talking about because in situations where your vsync is off the GPU is taxed to its limit so overheating is natural. Also why are you still on kernel 5.7 which is EOL?
Went back into Ubuntu to try and replicate and get data… and it’s stopped working there now. It must have been a fluke Back to square one then!
I’m normally on 5.4 LTS, but I switched to 5.7 to see if it helped, as I had read that some new thermal management systems were introduced in it. It’s helped a little - it does last longer than on 5.4 - but not a huge amount.
The fans do spin up to max I think, but it might be an issue with fan control - it’s difficult for me to immediately tell. I’ve now tried on 440, 450 and 455 and the same issue comes up on all three. I logged the output of
nvidia-smi every 2 seconds until the shutoff on two of the drivers - the results are here and here.
@curtispf normally it would like this:
Fri Dec 18 21:41:29 2020
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A |
| 45% 31C P0 N/A / 75W | 373MiB / 4037MiB | 0% Default |
| | | N/A |
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 648 G /usr/lib/Xorg 303MiB |
| 0 N/A N/A 1053 G xfwm4 1MiB |
| 0 N/A N/A 1554 G ...AAAAAAAAA= --shared-files 23MiB |
| 0 N/A N/A 28114 G /usr/bin/alacritty 9MiB |
| 0 N/A N/A 66781 G ...e/Steam/ubuntu12_32/steam 12MiB |
| 0 N/A N/A 66791 G ./steamwebhelper 1MiB |
| 0 N/A N/A 568793 G /usr/lib/firefox/firefox 1MiB |
| 0 N/A N/A 583183 G mpv 5MiB |
| 0 N/A N/A 583493 G /usr/bin/alacritty 8MiB |
But the fan has not been detected on your system:
| 0 GeForce MX330 Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 N/A / N/A | 42MiB / 2002MiB | 0% Default |
sensors show it?
I would really advice to use a manual control of you fan with:
pamac build nvfancontrol
Holy **** …:
| 0 GeForce MX330 Off | 00000000:01:00.0 Off | N/A |
| N/A 83C P0 N/A / N/A | 1565MiB / 2002MiB | 100% Default |
That hard on the limit…
After running a
sensors-detect with all the defaults, I get this:
[curtispf@curtis-laptop ~]$ sensors
Adapter: Virtual device
fan1: 3474 RPM
Adapter: PCI adapter
Composite: +31.9°C (low = -273.1°C, high = +81.8°C)
(crit = +85.8°C)
Sensor 1: +31.9°C (low = -273.1°C, high = +65261.8°C)
Adapter: PCI adapter
Adapter: ACPI interface
in0: 17.05 V
curr1: 1000.00 uA
Adapter: ISA adapter
Package id 0: +53.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +52.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +53.0°C (high = +100.0°C, crit = +100.0°C)
I’ve just tried setting up
nvfancontrol, but putting the below into an X11 config file:
Identifier "Device 0"
VendorName "NVIDIA Corporation"
BoardName "NVIDIA Corporation GP108M [GeForce MX330] (rev a1)"
Option "Coolbits" "4"
caused X to refuse to start. Similarly, running
sudo nvidia-xconfig --coolbits=4 and rebooting caused X to refuse to start until I deleted
/etc/X11/xorg.conf. This might be me doing something terribly wrong, though; I don’t know that much about how xorg’s config files work, for my sins
I’d appreciate any help or advice you could offer.
If you use
optimus-manager, then the nvidia-specific options should go into
/etc/optimus-manager/xorg-nvidia.conf depending on the version):
Option "Coolbits" "4"
in that file should be enough.
optimus-manager and added the Coolbits to that config file. That gives me this:
[curtispf@curtis-laptop ~]$ sudo nvfancontrol
WARN - No config file found; using default curve
X Error of failed request: BadMatch (invalid parameter attributes)
Major opcode of failed request: 157 (NV-CONTROL)
Minor opcode of failed request: 4 ()
Serial number of failed request: 14
Current serial number in output stream: 14
The non-sudo version gives the same error, but I thought I’d check it wasn’t a permissions error.
Can you check if
/etc/X11/xorg.conf.d/10-optimus-manager.conf contains the “Coolbits” option?
It does, but actually, thinking about it, I realised something that I should have realised way earlier in this process: the MX330, as a laptop chip, does not have a fan itself. Instead, it’s cooled through the same heatpiping and fan system that cools the CPU. So, it makes complete sense that
nvfancontrol wouldn’t work.
That doesn’t solve the overall problem, of course, but it at least clarifies that part of it!
I have reinstalled Windows onto the laptop to check, and sure enough, the Superposition benchmark runs just fine there - without the laptop even breaking a sweat. The GPU temperature doesn’t even reach 70.
So, there’s something up somewhere in the kernel… whether it’s the NVIDIA drivers, something else relating to the cooling system (possibly Intel-related?) or something else entirely, though, I have no idea. Suggestions very much welcome!
Since is using the CPU fan,maybe you can do a power throttling the CPU to cool the device? You can use thermald (just search it in the official repos),then create a file in
/etc/thermald/thermal-conf.xml and there you can put
<Name>Override CPU default passive</Name>
This means 86°C,when the CPU temp reaches 86°C is going from active to passive cooling,you can change the value if you want i have it mine with 80.
then enable the service
sudo systemctl enable --now thermald.service
Then try to heat the computer,this applies to the CPU not GPU,but since it using the CPU fan i think it can work,there is more options that you can read here,i just use the first example and its working enough for me. (I have i5 7300HQ and 1050 Ti)
Also,consider undervolt the CPU with Intel Undervolt,I don’t know how to undervolt the nvidia though.
How have you monitored this?