CPU Scheduler Issues

Hi, I’ve tried searching for this problem, but naturally this combination of keywords doesn’t pull up many useful results. Mostly I get threads asking how to adjust the frequency governor, which is not my problem.

I’m using the latest Manjaro with XFCE, kernel 6.9.10-1 on an 8c/16t Intel i7-11800H. I’m not sure if that’s specifically a mobile CPU, but this is in a laptop.

My issue is that the CPU scheduler seems to let threads dwell on a single core for far, far too long. A long-running task can peg one core until it hits 99C before it will shift load to another core. This happens with all sorts of tasks, but my main issue is during C++ compilation and startup of CLion. Installing AUR packages can also produce this if they’re big enough.

What my problem is not:
CPU frequency/governor: i7z and cpupower output seem sane and correct. Governor correctly scales frequency based on load and temperature, boost to 4.6GHz works correctly.
Core congestion: In all cases, all CPU cores are at less than ~20% utilization except the one at 100%.
Power/thermal: Heatsinks are clean and functional. Under normal load, the temperatures are similar to what I get under Windows.
Workload related: I see this behavior with many different workloads, I just happen to run compilation workloads more often than anything else.

The core which gets overloaded is always random, which is what I would expect.

I have a CPU load widget in my taskbar that shows load per core. This agrees with htop and i7z, and both clearly show one core being maxed out for tens of seconds at a time. On other Linux systems I don’t see a core maxed out for more than a couple of seconds before it swaps to another core, even with similar workloads. Those systems are usually Debian based, with the latest stable kernel.

I’m convinced this is either a bug or some edge case in the CPU scheduler. I’m far from an expert in this area, but it seems to me that the scheduler should not allow one thread to dwell on a single core for this long, or allow a single core to be 40-50C above all others.

I’m concerned about physical damage to the CPU from this, but also it causes the fans to rapidly cycle between 0 and 100% due to the rapid temperature swings. This is becoming extremely annoying.

So here’s the question I can’t find an answer to: Is there any sort of adjustment I can make to the scheduler to alleviate this behavior? Perhaps a setting for maximum dwell time, or maybe just change it out for an entirely different algorithm? I know basically nothing about the scheduler in Linux, and it’s proven pretty difficult to find any information.

can you report

inxi -Fza
sudo mhwd-kernel -li
sudo mhwd -li
cpupower frequency-info
sudo turbostat ( first part including idle  q to quit )
lscpu

inxi:

System:
  Kernel: 6.9.10-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 14.1.1
    clocksource: tsc avail: acpi_pm
    parameters: BOOT_IMAGE=/@/boot/vmlinuz-6.9-x86_64
    root=UUID=d2c9d7ca-9e7b-4ec0-bb07-dd7b72d0a866 rw rootflags=subvol=@
    quiet
    cryptdevice=UUID=dcb0e670-f1d8-471b-abe0-0463fd298820:luks-dcb0e670-f1d8-471b-abe0-0463fd298820
    root=/dev/mapper/luks-dcb0e670-f1d8-471b-abe0-0463fd298820 splash
    apparmor=1 security=apparmor
    resume=/dev/mapper/luks-066b36e1-14bc-4078-b12f-2c4423d71eb2
    udev.log_priority=3
  Desktop: Xfce v: 4.18.1 tk: Gtk v: 3.24.43 wm: xfwm4 v: 4.18.0
    with: xfce4-panel tools: xfce4-screensaver vt: 7 dm: LightDM v: 1.32.0
    Distro: Manjaro base: Arch Linux
Machine:
  Type: Laptop System: Razer product: Blade 15 Base Model (Mid 2021) -
    RZ09-0410 v: 7.04 serial: <superuser required> Chassis: type: 10
    serial: <superuser required>
  Mobo: Razer model: DA570 v: 4 serial: <superuser required>
    part-nu: RZ09-0410BE22 uuid: <superuser required> UEFI: Razer v: 1.03
    date: 08/03/2021
Battery:
  ID-1: BAT0 charge: 58.3 Wh (100.0%) condition: 58.3/65.0 Wh (89.7%)
    volts: 17.3 min: 15.4 model: Razer Blade type: Unknown serial: <filter>
    status: full
  Device-1: hidpp_battery_0 model: Logitech ERGO M575 Trackball
    serial: <filter> charge: 30% rechargeable: yes status: discharging
CPU:
  Info: model: 11th Gen Intel Core i7-11800H bits: 64 type: MT MCP
    arch: Tiger Lake gen: core 11 level: v4 note: check built: 2020
    process: Intel 10nm family: 6 model-id: 0x8D (141) stepping: 1
    microcode: 0x50
  Topology: cpus: 1x cores: 8 tpc: 2 threads: 16 smt: enabled cache:
    L1: 640 KiB desc: d-8x48 KiB; i-8x32 KiB L2: 10 MiB desc: 8x1.2 MiB
    L3: 24 MiB desc: 1x24 MiB
  Speed (MHz): avg: 1764 high: 4500 min/max: 800/4600 scaling:
    driver: intel_pstate governor: powersave cores: 1: 1831 2: 3896 3: 4500
    4: 800 5: 1973 6: 800 7: 800 8: 1805 9: 2147 10: 800 11: 800 12: 2837
    13: 2312 14: 800 15: 1334 16: 800 bogomips: 73744
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
  Vulnerabilities:
  Type: gather_data_sampling mitigation: Microcode
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: reg_file_data_sampling status: Not affected
  Type: retbleed status: Not affected
  Type: spec_rstack_overflow status: Not affected
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
    prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
    sanitization
  Type: spectre_v2 mitigation: Enhanced / Automatic IBRS; IBPB:
    conditional; RSB filling; PBRSB-eIBRS: SW sequence; BHI: SW loop, KVM: SW
    loop
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: Intel TigerLake-H GT1 [UHD Graphics] vendor: Razer USA
    driver: i915 v: kernel alternate: xe arch: Gen-12.1 process: Intel 10nm
    built: 2020-21 ports: active: eDP-1 empty: DP-1, DP-2, DP-3, DP-4
    bus-ID: 00:02.0 chip-ID: 8086:9a60 class-ID: 0300
  Device-2: NVIDIA GA104M [GeForce RTX 3070 Mobile / Max-Q]
    vendor: Razer USA driver: nvidia v: 550.100 alternate: nouveau,nvidia_drm
    non-free: 550.xx+ status: current (as of 2024-06; EOL~2026-12-xx)
    arch: Ampere code: GAxxx process: TSMC n7 (7nm) built: 2020-2023 pcie:
    gen: 1 speed: 2.5 GT/s lanes: 8 link-max: gen: 4 speed: 16 GT/s lanes: 16
    bus-ID: 01:00.0 chip-ID: 10de:249d class-ID: 0300
  Device-3: 2M UVC CAMERA NexiGo N60 FHD Webcam
    driver: snd-usb-audio,uvcvideo type: USB rev: 2.0 speed: 480 Mb/s lanes: 1
    mode: 2.0 bus-ID: 3-4.1.2:12 chip-ID: 1d6c:0103 class-ID: 0102
    serial: <filter>
  Device-4: IMC Networks USB Camera driver: uvcvideo type: USB rev: 2.0
    speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 3-8:7 chip-ID: 13d3:56bd
    class-ID: 0e02 serial: <filter>
  Display: x11 server: X.org v: 1.21.1.13 compositor: xfwm4 v: 4.18.0
    driver: X: loaded: modesetting,nvidia alternate: fbdev,nouveau,nv,vesa
    dri: iris gpu: i915 display-ID: :0.0 screens: 1
  Screen-1: 0 s-res: 4920x2095 s-size: <missing: xdpyinfo>
  Monitor-1: DP-1-4.2 pos: top-center res: 1080x1920 hz: 60 dpi: 93
    size: 296x527mm (11.65x20.75") diag: 604mm (23.8") modes: N/A
  Monitor-2: DP-1-4.3.1 pos: primary,middle-l res: 1920x1080 hz: 60 dpi: 96
    size: 510x290mm (20.08x11.42") diag: 587mm (23.1") modes: N/A
  Monitor-3: eDP-1 pos: bottom-r res: 1536x864 hz: 144 dpi: 113
    size: 344x194mm (13.54x7.64") diag: 395mm (15.55") modes: N/A
  API: EGL v: 1.5 hw: drv: intel iris drv: nvidia platforms: device: 0
    drv: nvidia device: 1 drv: iris device: 3 drv: swrast gbm: drv: kms_swrast
    surfaceless: drv: nvidia x11: drv: iris inactive: wayland,device-2
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: intel mesa v: 24.1.3-manjaro1.1
    glx-v: 1.4 direct-render: yes renderer: Mesa Intel UHD Graphics (TGL GT1)
    device-ID: 8086:9a60 memory: 15.18 GiB unified: yes
Audio:
  Device-1: Intel Tiger Lake-H HD Audio vendor: Razer USA
    driver: snd_hda_intel v: kernel alternate: snd_soc_avs,snd_sof_pci_intel_tgl
    bus-ID: 00:1f.3 chip-ID: 8086:43c8 class-ID: 0403
  Device-2: NVIDIA GA104 High Definition Audio vendor: Razer USA
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 8
    link-max: lanes: 16 bus-ID: 01:00.1 chip-ID: 10de:228b class-ID: 0403
  Device-3: 2M UVC CAMERA NexiGo N60 FHD Webcam
    driver: snd-usb-audio,uvcvideo type: USB rev: 2.0 speed: 480 Mb/s lanes: 1
    mode: 2.0 bus-ID: 3-4.1.2:12 chip-ID: 1d6c:0103 class-ID: 0102
    serial: <filter>
  Device-4: Generalplus USB Audio Device
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 1.1 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 3-4.1.3:14 chip-ID: 1b3f:2008 class-ID: 0300
  API: ALSA v: k6.9.10-1-MANJARO status: kernel-api with: aoss
    type: oss-emulator tools: alsactl,alsamixer,amixer
  Server-1: JACK v: 1.9.22 status: off tools: N/A
  Server-2: PipeWire v: 1.2.1 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    tools: pactl,pw-cat,pw-cli,wpctl
Network:
  Device-1: Intel Tiger Lake PCH CNVi WiFi driver: iwlwifi v: kernel
    bus-ID: 00:14.3 chip-ID: 8086:43f0 class-ID: 0280
  IF: wlo1 state: up mac: <filter>
  Device-2: Realtek RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet
    vendor: Razer USA driver: r8169 v: kernel pcie: gen: 1 speed: 2.5 GT/s
    lanes: 1 port: 3000 bus-ID: 2f:00.0 chip-ID: 10ec:8168 class-ID: 0200
  IF: enp47s0 state: down mac: <filter>
  Device-3: ASIX AX88179 Gigabit Ethernet driver: cdc_ncm type: USB rev: 3.2
    speed: 5 Gb/s lanes: 1 mode: 3.2 gen-1x1 bus-ID: 4-4.1.4:6
    chip-ID: 0b95:1790 class-ID: 0a00 serial: <filter>
  IF: eth0 state: up speed: 1000 Mbps duplex: half mac: <filter>
  Info: services: NetworkManager, systemd-timesyncd, wpa_supplicant
Bluetooth:
  Device-1: Intel AX201 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 3-14:9 chip-ID: 8087:0026
    class-ID: e001
  Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
Drives:
  Local Storage: total: 719.18 GiB used: 424.48 GiB (59.0%)
  SMART Message: Required tool smartctl not installed. Check --recommends
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung
    model: MZVL2512HCJQ-00B00 size: 476.94 GiB block-size: physical: 512 B
    logical: 512 B speed: 63.2 Gb/s lanes: 4 tech: SSD serial: <filter>
    fw-rev: GXA7601Q temp: 33.9 C scheme: GPT
  ID-2: /dev/sda maj-min: 8:0 model: SATA SSD VLI size: 238.47 GiB
    block-size: physical: 512 B logical: 512 B type: USB rev: 3.1 spd: 10 Gb/s
    lanes: 1 mode: 3.2 gen-2x1 tech: SSD serial: <filter> fw-rev: SBFM
    scheme: GPT
  ID-3: /dev/sdb maj-min: 8:16 vendor: USBest model: Ut165 USB2FlashStorage
    size: 3.76 GiB block-size: physical: 512 B logical: 512 B type: USB rev: 2.0
    spd: 480 Mb/s lanes: 1 mode: 2.0 tech: SSD serial: <filter> fw-rev: 0.00
    scheme: GPT
Partition:
  ID-1: / raw-size: 203.97 GiB size: 203.97 GiB (100.00%)
    used: 101.83 GiB (49.9%) fs: btrfs dev: /dev/dm-0 maj-min: 254:0
    mapped: luks-dcb0e670-f1d8-471b-abe0-0463fd298820
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 728 KiB (0.2%) fs: vfat dev: /dev/sda1 maj-min: 8:1
  ID-3: /home raw-size: 203.97 GiB size: 203.97 GiB (100.00%)
    used: 101.83 GiB (49.9%) fs: btrfs dev: /dev/dm-0 maj-min: 254:0
    mapped: luks-dcb0e670-f1d8-471b-abe0-0463fd298820
  ID-4: /var/log raw-size: 203.97 GiB size: 203.97 GiB (100.00%)
    used: 101.83 GiB (49.9%) fs: btrfs dev: /dev/dm-0 maj-min: 254:0
    mapped: luks-dcb0e670-f1d8-471b-abe0-0463fd298820
Swap:
  Kernel: swappiness: 60 (default) cache-pressure: 100 (default) zswap: yes
    compressor: zstd max-pool: 20%
  ID-1: swap-1 type: partition size: 34.2 GiB used: 10.5 MiB (0.0%)
    priority: -2 dev: /dev/dm-1 maj-min: 254:1
    mapped: luks-066b36e1-14bc-4078-b12f-2c4423d71eb2
Sensors:
  System Temperatures: cpu: 66.0 C mobo: N/A
  Fan Speeds (rpm): N/A
Info:
  Memory: total: 32 GiB note: est. available: 31.09 GiB
    used: 14.71 GiB (47.3%)
  Processes: 436 Power: uptime: 1h 44m states: freeze,mem,disk suspend: deep
    avail: s2idle wakeups: 0 hibernate: platform avail: shutdown, reboot,
    suspend, test_resume image: 12.4 GiB services: upowerd,xfce4-power-manager
    Init: systemd v: 256 default: graphical tool: systemctl
  Packages: pm: pacman pkgs: 1451 libs: 472 tools: octopi,pamac,yay
    pm: flatpak pkgs: 0 Compilers: clang: 18.1.8 gcc: 14.1.1 Shell: Bash
    v: 5.2.26 running-in: xfce4-terminal inxi: 3.3.35

-kernel does not appear to be a valid option for mhwd
mhwd -li:

--------------------------------------------------------------------------------
                  NAME               VERSION          FREEDRIVER           TYPE
--------------------------------------------------------------------------------
     video-modesetting            2020.01.13                true            PCI
video-hybrid-intel-nvidia-prime            2023.03.23               false            PCI


Warning: No installed USB configs!

cpupower:

  driver: intel_pstate
  CPUs which run at the same hardware frequency: 6
  CPUs which need to have their frequency coordinated by software: 6
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 800 MHz - 4.60 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 800 MHz and 4.60 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 1.13 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

I don’t have turbostat installed

lscpu:

  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
    CPU family:           6
    Model:                141
    Thread(s) per core:   2
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             1
    CPU(s) scaling MHz:   30%
    CPU max MHz:          4600.0000
    CPU min MHz:          800.0000
    BogoMIPS:             4609.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clfl
                          ush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm con
                          stant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpui
                          d aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 
                          ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_dea
                          dline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault 
                          epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority
                           ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a 
                          avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512c
                          d sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect 
                          user_shstk dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp
                          _pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq av
                          x512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512
                          _vp2intersect md_clear ibt flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    384 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     10 MiB (8 instances)
  L3:                     24 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-15
Vulnerabilities:          
  Gather data sampling:   Mitigation; Microcode
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-
                          eIBRS SW sequence; BHI SW loop, KVM SW loop
  Srbds:                  Not affected
  Tsx async abort:        Not affected

That’s a typo.

mhwd-kernel -li

mhwd-kernel:

Currently running: 6.9.10-1-MANJARO (linux69)
The following kernels are installed in your system:
   * linux69

In case some other lost soul is searching the internet for this exact problem and also not finding any help, this appears to be a widespread issue with kernel >6.9 and sometimes 6.6.

Behavior disappears with kernel 6.1.

FYI: From a task scheduling point of view, keeping a thread on a single core is the correct behavior. A thread hopping around to different cores would incur additional context switches, which slows it down (not to mention potentially leaving the preferred core for single thread workloads, which may be the behavior that was added in recent kernels) - so my suspicion is that Kernel 6.1 is the one that has erroneous behavior, not 6.9. Current Linux schedulers don’t take into account thermals - they rely on the system to, well, sustain the clocks that it advertises to the system well.

If your system can’t sustain those clocks, then you should either do some maintenance (i.e. blowing out dust, maybe repasting etc.) or lower the power usage (by lowering the max boost, or boost power, or whatever). That being said, you’re not going to cause damage on a 12th gen core by hitting high temps (13th and 14th gen is potentially different, due to the Intel CPU degradation issue that affects these generations). Current CPUs are typically designed to boost until power/temperature becomes an issue, so seeing higher temps on one core is expected behavior.

I know that people have looked into thermally aware schedulers in the past, but I don’t think there’s any available in the Linux kernel at the moment.

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.