GPU Fan Speed Spikes to 100% Unnecessarily, Stays There

I’m not entirely sure when this started or what could have prompted this. I monitor my GPU usage and temperature with MangoHUD, and I’ve noticed that when playing any game now (emulator, Steam, Genshin, etc) at some point during gameplay my GPU fan speed will instantly jump to 100%, and stay there - Regardless of what the GPU usage or actual temperature is. It seems to get triggered at very low temperatures, in some of these games I never see the GPU go over 55c (typically around 40-45), and the fan will still ramp up and stay there.

Even closing the game/application will not return the fan speed to normal, I have to reboot. I installed GreenWithEnvy to check the fan profile, and it appears that Green isn’t able to see the fan information (reads 0% fan duty and 0 RPM). Attempting to change the fan profile does not have an effect. I also attempted upgrading my kernel from 5.13 → 5.15.

GPU is a 1660 Super, and I’m using the latest 495.44 proprietary Nvidia drivers. Any ideas are greatly appreciated as this issue makes me pretty uncomfortable, for obvious reasons I don’t want to yeet a fan on my GPU and have to replace it in the current market.

Please let me know what other information would be useful.

Edit: Here is the inxi output of system info

  Kernel: 5.15.7-1-MANJARO x86_64 bits: 64 compiler: gcc v: 11.1.0
    parameters: BOOT_IMAGE=/boot/vmlinuz-5.15-x86_64
    root=UUID=fada4d6d-7bdb-40cc-a80c-ed14fd89d9ac rw quiet apparmor=1
    security=apparmor udev.log_priority=3
  Desktop: KDE Plasma 5.23.4 tk: Qt 5.15.2 wm: kwin_x11 vt: 1 dm: SDDM
    Distro: Manjaro Linux base: Arch Linux
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: PRIME B550M-A (WI-FI) v: Rev X.0x
    serial: <superuser required> UEFI: American Megatrends v: 2423
    date: 08/09/2021
  Message: No system battery data found. Is one present?
  RAM: total: 31.32 GiB used: 2.87 GiB (9.2%)
  RAM Report:
    permissions: Unable to run dmidecode. Root privileges required.
  Info: model: AMD Ryzen 7 5800X bits: 64 type: MT MCP arch: Zen 3
    family: 0x19 (25) model-id: 0x21 (33) stepping: 0 microcode: 0xA201016
  Topology: cpus: 1x cores: 8 tpc: 2 threads: 16 smt: enabled cache:
    L1: 512 KiB desc: d-8x32 KiB; i-8x32 KiB L2: 4 MiB desc: 8x512 KiB
    L3: 32 MiB desc: 1x32 MiB
  Speed (MHz): avg: 3161 high: 3807 min/max: 2200/4850 boost: enabled
    scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 2813 2: 2874
    3: 3305 4: 3593 5: 2873 6: 2871 7: 3588 8: 3591 9: 3320 10: 2874 11: 2864
    12: 2871 13: 2876 14: 2872 15: 3587 16: 3807 bogomips: 121425
  Flags: 3dnowprefetch abm adx aes aperfmperf apic arat avic avx avx2 bmi1
    bmi2 bpext cat_l3 cdp_l3 clflush clflushopt clwb clzero cmov cmp_legacy
    constant_tsc cpb cpuid cqm cqm_llc cqm_mbm_local cqm_mbm_total
    cqm_occup_llc cr8_legacy cx16 cx8 de decodeassists erms extapic
    extd_apicid f16c flushbyasid fma fpu fsgsbase fsrm fxsr fxsr_opt ht
    hw_pstate ibpb ibrs ibs invpcid irperf lahf_lm lbrv lm mba mca mce
    misalignsse mmx mmxext monitor movbe msr mtrr mwaitx nonstop_tsc nopl npt
    nrip_save nx ospke osvw overflow_recov pae pat pausefilter pclmulqdq
    pdpe1gb perfctr_core perfctr_llc perfctr_nb pfthreshold pge pku pni popcnt
    pse pse36 rapl rdpid rdpru rdrand rdseed rdt_a rdtscp rep_good sep sha_ni
    skinit smap smca smep ssbd sse sse2 sse4_1 sse4_2 sse4a ssse3 stibp succor
    svm svm_lock syscall tce topoext tsc tsc_scale umip v_spec_ctrl
    v_vmsave_vmload vaes vgif vmcb_clean vme vmmcall vpclmulqdq wbnoinvd wdt
    xgetbv1 xsave xsavec xsaveerptr xsaveopt xsaves
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: spec_store_bypass
    mitigation: Speculative Store Bypass disabled via prctl and seccomp
  Type: spectre_v1
    mitigation: usercopy/swapgs barriers and __user pointer sanitization
  Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional,
    IBRS_FW, STIBP: always-on, RSB filling
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
  Device-1: NVIDIA TU116 [GeForce GTX 1660 SUPER] vendor: Gigabyte
    driver: nvidia v: 495.44 alternate: nouveau,nvidia_drm bus-ID: 0a:00.0
    chip-ID: 10de:21c4 class-ID: 0300
  Display: x11 server: X.Org compositor: kwin_x11 driver:
    loaded: nvidia display-ID: :0 screens: 1
  Screen-1: 0 s-res: 3840x2160 s-dpi: 102 s-size: 956x543mm (37.6x21.4")
    s-diag: 1099mm (43.3")
  Monitor-1: HDMI-0 res: 3840x2160 hz: 60 dpi: 122
    size: 800x450mm (31.5x17.7") diag: 918mm (36.1")
  OpenGL: renderer: NVIDIA GeForce GTX 1660 SUPER/PCIe/SSE2
    v: 4.6.0 NVIDIA 495.44 direct render: Yes
  Device-1: NVIDIA TU116 High Definition Audio vendor: Gigabyte
    driver: snd_hda_intel v: kernel bus-ID: 0a:00.1 chip-ID: 10de:1aeb
    class-ID: 0403
  Device-2: AMD Starship/Matisse HD Audio vendor: ASUSTeK
    driver: snd_hda_intel v: kernel bus-ID: 0c:00.4 chip-ID: 1022:1487
    class-ID: 0403
  Sound Server-1: ALSA v: k5.15.7-1-MANJARO running: yes
  Sound Server-2: JACK v: 1.9.19 running: no
  Sound Server-3: PulseAudio v: 15.0 running: yes
  Sound Server-4: PipeWire v: 0.3.40 running: yes
  Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel bus-ID: 08:00.0
    chip-ID: 8086:2723 class-ID: 0280
  IF: wlp8s0 state: up mac: <filter>
  IP v4: <filter> type: dynamic noprefixroute scope: global
    broadcast: <filter>
  Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    vendor: ASUSTeK PRIME B450M-A driver: r8169 v: kernel port: f000
    bus-ID: 09:00.0 chip-ID: 10ec:8168 class-ID: 0200
  IF: enp9s0 state: down mac: <filter>
  IF-ID-1: wg-mullvad state: unknown speed: N/A duplex: N/A mac: N/A
  IP v4: <filter> scope: global
  WAN IP: <filter>
  Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus-ID: 1-5:2
    chip-ID: 8087:0029 class-ID: e001
  Report: rfkill ID: hci0 rfk-id: 1 state: up address: see --recommends
  Message: No logical block device data found.
  Message: No RAID data found.
  Local Storage: total: 10.14 TiB used: 8.24 TiB (81.3%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/sda maj-min: 8:0 vendor: PNY model: SSD2SC120G1SA754D117-820
    size: 111.79 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
    type: SSD serial: <filter> rev: 0A scheme: GPT
  ID-2: /dev/sdb maj-min: 8:16 vendor: Samsung model: SSD 850 PRO 1TB
    size: 953.87 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
    type: SSD serial: <filter> rev: 2B6Q scheme: GPT
  ID-3: /dev/sdc maj-min: 8:32 vendor: Seagate model: ST10000NM0086-2AA101
    size: 9.1 TiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s
    type: HDD rpm: 7200 serial: <filter> rev: SN05 scheme: GPT
  Message: No optical or floppy data found.
  ID-1: / raw-size: 111.49 GiB size: 109.18 GiB (97.93%)
    used: 33.5 GiB (30.7%) fs: ext4 dev: /dev/sda2 maj-min: 8:2 label: N/A
    uuid: fada4d6d-7bdb-40cc-a80c-ed14fd89d9ac
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 288 KiB (0.1%) fs: vfat dev: /dev/sda1 maj-min: 8:1 label: NO_LABEL
    uuid: 6B06-7918
  ID-3: /media/viejo/Media raw-size: 9.1 TiB size: 9.02 TiB (99.20%)
    used: 7.92 TiB (87.8%) fs: ext4 dev: /dev/sdc1 maj-min: 8:33 label: Media
    uuid: ad03e1ca-ec94-40e6-b1d7-69899065cbb1
  ID-4: /media/viejo/SSDGames raw-size: 953.87 GiB size: 937.82 GiB (98.32%)
    used: 296.27 GiB (31.6%) fs: ext4 dev: /dev/sdb1 maj-min: 8:17
    label: SSD Games uuid: f8e436ed-6dfe-4dc7-8e9d-8c33d597863a
  Alert: No swap data was found.
  Message: No unmounted partitions found.
  Hub-1: 1-0:1 info: Hi-speed hub with single TT ports: 10 rev: 2.0
    speed: 480 Mb/s chip-ID: 1d6b:0002 class-ID: 0900
  Device-1: 1-5:2 info: Intel AX200 Bluetooth type: Bluetooth driver: btusb
    interfaces: 2 rev: 2.0 speed: 12 Mb/s power: 100mA chip-ID: 8087:0029
    class-ID: e001
  Device-2: 1-6:3 info: ASUSTek AURA LED Controller type: HID
    driver: hid-generic,usbhid interfaces: 2 rev: 2.0 speed: 12 Mb/s power: 16mA
    chip-ID: 0b05:1939 class-ID: 0300 serial: <filter>
  Device-3: 1-7:4 info: SINOWEALTH Game Mouse type: Mouse,Keyboard
    driver: hid-generic,usbhid interfaces: 2 rev: 1.1 speed: 12 Mb/s
    power: 480mA chip-ID: 258a:1007 class-ID: 0301
  Device-4: 1-8:5 info: Logitech Keyboard K120 type: Keyboard,HID
    driver: hid-generic,usbhid interfaces: 2 rev: 1.1 speed: 1.5 Mb/s
    power: 90mA chip-ID: 046d:c31c class-ID: 0300
  Hub-2: 2-0:1 info: Super-speed hub ports: 4 rev: 3.1 speed: 10 Gb/s
    chip-ID: 1d6b:0003 class-ID: 0900
  Hub-3: 3-0:1 info: Hi-speed hub with single TT ports: 2 rev: 2.0
    speed: 480 Mb/s chip-ID: 1d6b:0002 class-ID: 0900
  Hub-4: 4-0:1 info: Super-speed hub ports: 4 rev: 3.1 speed: 10 Gb/s
    chip-ID: 1d6b:0003 class-ID: 0900
  Hub-5: 5-0:1 info: Hi-speed hub with single TT ports: 4 rev: 2.0
    speed: 480 Mb/s chip-ID: 1d6b:0002 class-ID: 0900
  Hub-6: 6-0:1 info: Super-speed hub ports: 4 rev: 3.1 speed: 10 Gb/s
    chip-ID: 1d6b:0003 class-ID: 0900
  System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 33 C
  Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
  Processes: 353 Uptime: 38m wakeups: 0 Init: systemd v: 249 tool: systemctl
  Compilers: gcc: 11.1.0 Packages: 1389 pacman: 1377 lib: 466 flatpak: 12
  Shell: Bash v: 5.1.12 running-in: konsole inxi: 3.3.11

And error output from dmesg, looks like there could be something here. Driver error?:

[    0.566072]   #9 #10 #11 #12 #13 #14 #15
[    1.606232] ata2.00: supports DRM functions and may not be fully accessible
[    1.621594] ata2.00: supports DRM functions and may not be fully accessible
[    3.474209] ata5: failed to resume link (SControl 0)
[    4.491077] ipmi_si: Unable to find any System Interface(s)
[    4.587946] acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
[    4.587993] acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
[    4.588024] acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
[    4.588079] acpi PNP0C14:04: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
[    4.677095] sp5100-tco sp5100-tco: Watchdog hardware is disabled
[    4.871070] nvidia: loading out-of-tree module taints kernel.
[    4.871084] nvidia: module license 'NVIDIA' taints kernel.
[    4.871086] Disabling lock debugging due to kernel taint
[    4.878847] iwlwifi 0000:08:00.0: Direct firmware load for iwlwifi-cc-a0-66.ucode failed with error -2
[    4.878915] iwlwifi 0000:08:00.0: Direct firmware load for iwlwifi-cc-a0-65.ucode failed with error -2
[    4.878940] iwlwifi 0000:08:00.0: Direct firmware load for iwlwifi-cc-a0-64.ucode failed with error -2
[    4.885490] iwlwifi 0000:08:00.0: api flags index 2 larger than supported by driver
[    4.914832] kvm: disabled by bios

[    4.975121] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  495.44  Fri Oct 22 06:13:12 UTC 2021
[    5.019065] usb 1-6: config 1 has an invalid interface number: 2 but max is 1
[    5.019067] usb 1-6: config 1 has no interface number 1
[    5.043525] kvm: disabled by bios
[    5.192541] thermal thermal_zone0: failed to read out thermal zone (-61)
[    5.250622] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[    5.255084] kvm: disabled by bios
[    5.416390] kvm: disabled by bios
[    5.422039] urandom_read: 4 callbacks suppressed
[    5.565754] kvm: disabled by bios
[    5.590400] kauditd_printk_skb: 30 callbacks suppressed
[    5.677655] kvm: disabled by bios
[    5.819447] kvm: disabled by bios
[    6.003436] kvm: disabled by bios
[    6.039261] nvidia-gpu 0000:0a:00.3: i2c timeout error e0000000
[    6.039264] ucsi_ccg 0-0008: i2c_transfer failed -110
[    6.039266] ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
[    6.039268] ucsi_ccg: probe of 0-0008 failed with error -110
[    6.137696] kvm: disabled by bios
[    6.271360] kvm: disabled by bios
[   10.756920] kauditd_printk_skb: 25 callbacks suppressed
[   16.318268] kauditd_printk_skb: 37 callbacks suppressed

Hello @Viejo :wink:

Maybe downgrading to nvidia 470xx could change this behavior. At least GTX 10xx series seems to have some problems with the newest driver. Not sure about that, but for me (GTX 1050ti) it solved some problems.

Downgrading the driver has not had an effect, unfortunately.

Really uncomfortable with this issue since it’s putting my hardware at risk. Please let me know if anyone else has any ideas or something I should be looking at.

Thank you everyone!

You need to set the coolbits here: Roberto Leinardi / GreenWithEnvy · GitLab

Thank you for the ideas, but the same behavior is persisting. I had previously set Coolbits to 8 when I installed GWE, but neither that or changing it to “12” has had an impact on the fan behavior.

We can close this thread - I’ll be resetting the PC. Thank you!

i use nfancurve to set the curve of my GPU fan depending on the temperature, it works great for me, and i do not need to bother with coolbits manually, it does it alone at start.

Before that i tried GWE and nvfancontrol but they are a little bit crappy in my opinion.