Hi, I’ve tried searching for this problem, but naturally this combination of keywords doesn’t pull up many useful results. Mostly I get threads asking how to adjust the frequency governor, which is not my problem.
I’m using the latest Manjaro with XFCE, kernel 6.9.10-1 on an 8c/16t Intel i7-11800H. I’m not sure if that’s specifically a mobile CPU, but this is in a laptop.
My issue is that the CPU scheduler seems to let threads dwell on a single core for far, far too long. A long-running task can peg one core until it hits 99C before it will shift load to another core. This happens with all sorts of tasks, but my main issue is during C++ compilation and startup of CLion. Installing AUR packages can also produce this if they’re big enough.
What my problem is not:
CPU frequency/governor: i7z and cpupower output seem sane and correct. Governor correctly scales frequency based on load and temperature, boost to 4.6GHz works correctly.
Core congestion: In all cases, all CPU cores are at less than ~20% utilization except the one at 100%.
Power/thermal: Heatsinks are clean and functional. Under normal load, the temperatures are similar to what I get under Windows.
Workload related: I see this behavior with many different workloads, I just happen to run compilation workloads more often than anything else.
The core which gets overloaded is always random, which is what I would expect.
I have a CPU load widget in my taskbar that shows load per core. This agrees with htop and i7z, and both clearly show one core being maxed out for tens of seconds at a time. On other Linux systems I don’t see a core maxed out for more than a couple of seconds before it swaps to another core, even with similar workloads. Those systems are usually Debian based, with the latest stable kernel.
I’m convinced this is either a bug or some edge case in the CPU scheduler. I’m far from an expert in this area, but it seems to me that the scheduler should not allow one thread to dwell on a single core for this long, or allow a single core to be 40-50C above all others.
I’m concerned about physical damage to the CPU from this, but also it causes the fans to rapidly cycle between 0 and 100% due to the rapid temperature swings. This is becoming extremely annoying.
So here’s the question I can’t find an answer to: Is there any sort of adjustment I can make to the scheduler to alleviate this behavior? Perhaps a setting for maximum dwell time, or maybe just change it out for an entirely different algorithm? I know basically nothing about the scheduler in Linux, and it’s proven pretty difficult to find any information.