Random hardlocks, softlocks, freezes

hello despite being extremely inexperienced with Linux i have used manjaro xfce for the past two years as my first operating system without issue across three computers; until the past 4 months with my primary desktop,

i have had horrible lockups at random in all applications in which mouse keyboard and power button and screen are all frozen, and if there is video or audio playing it will loop the last 2 seconds endlessly until i hit the reset button on my pc.

i also encounter boot issues, a grey screen with 3 dots willl appear and after a minute it will say

watchdog: BUG: soft lockup - CPU#11 stuck for 53s!

(with variation on the cpu# and the (x)s!)

which forces me to reset again, enter bios and boot from there. everything will work 2-8 hours until it freezes again, sometimes it will reboot first try other times it will fail to boot

i have read through this site for answers and i truly do not understand how to resolve this issue based upon other users input so far, i have tried updating everything and i still am experiencing the same issues. i did not update or changed bios or system settings in the first 6 months of flawless operation before the freezing began.

system info: kernal is 6.1.25-1 xfce AMD Ryzen 5 3600X 6-Core Processor PowerColor Red Devil AMD Radeon RX 6600 XT Corsair Vengeance LPX 32GB (2X16GB) DDR4 3200 ASRock X370 Killer SLI

i looked into Qjournalctl and identified the following errors but i do not know what they mean

kernel: Uhhuh. NMI received for unknown reason 21 on CPU 1.

kernel: Dazed and confused, but trying to continue

kernel: rcu: Possible timer handling issue on cpu=6 timer-softirq=4369

kernel: rcu: rcu_preempt kthread timer wakeup didn’t happen for 6004 jiffies! g405 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402

kernel: rcu: 7-…0: (1 GPs behind) idle=f564/1/0x4000000000000000 softirq=1071/1073 fqs=4669

lightdm[1360]: gkr-pam: unable to locate daemon control file

kernel: rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.

kauditd_printk_skb: 55 callbacks suppressed

i apologize for my ignorance and hope someone can provide low skill instructions to fixing this issue

Everything you’ve described so far suggests hardware failure. One possible culprit may be incorrect voltages being applied across the various circuits, which is usually a BIOS/UEFI issue, but it could also just be a damaged processor, or worse, the motherboard itself. :man_shrugging:

Is this for real? Programmer has a love for cinema I suppose.

yes, unironically according to qjournalctl haha

aragorn, which component is the failure most likely? motherboard or cpu having bad silicon?

the motherboard is my oldest part in the build (2017)

That, I’m afraid, I cannot tell. But you can try checking the motherboard for bad capacitors — they will appear bloated — and perhaps reapply thermal paste to the CPU.

It would be worth running a RAM test too, either from a BIOS utility or from boot

can you report

inxi -Fza
sudo mhwd-kernel -li

stephane ,

Currently running: 6.1.25-1-MANJARO (linux61)

System:
Kernel: 6.1.25-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 12.2.1
parameters: BOOT_IMAGE=/boot/vmlinuz-6.1-x86_64
root=UUID=10d2b0ab-9248-41db-99e4-2ecd7c4f1308 rw quiet splash apparmor=1
security=apparmor udev.log_priority=3
Desktop: Xfce v: 4.18.1 tk: Gtk v: 3.24.36 info: xfce4-panel wm: xfwm
v: 4.18.0 vt: 7 dm: LightDM v: 1.32.0 Distro: Manjaro Linux base: Arch Linux
Machine:
Type: Desktop Mobo: ASRock model: X370 Killer SLI/ac
serial: UEFI-[Legacy]: American Megatrends v: P5.50
date: 08/03/2019
CPU:
Info: model: AMD Ryzen 5 3600X bits: 64 type: MT MCP arch: Zen 2 gen: 3
level: v3 note: check built: 2020-22 process: TSMC n7 (7nm)
family: 0x17 (23) model-id: 0x71 (113) stepping: 0 microcode: 0x8701013
Topology: cpus: 1x cores: 6 tpc: 2 threads: 12 smt: enabled cache:
L1: 384 KiB desc: d-6x32 KiB; i-6x32 KiB L2: 3 MiB desc: 6x512 KiB
L3: 32 MiB desc: 2x16 MiB
Speed (MHz): avg: 2333 high: 3800 min/max: 2200/4409 boost: enabled
scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 2200 2: 2200
3: 2200 4: 2200 5: 2200 6: 2200 7: 3800 8: 2200 9: 2200 10: 2200 11: 2200
12: 2200 bogomips: 91059
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Vulnerabilities:
Type: itlb_multihit status: Not affected
Type: l1tf status: Not affected
Type: mds status: Not affected
Type: meltdown status: Not affected
Type: mmio_stale_data status: Not affected
Type: retbleed mitigation: untrained return thunk; SMT enabled with STIBP
protection
Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
prctl
Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
sanitization
Type: spectre_v2 mitigation: Retpolines, IBPB: conditional, STIBP:
always-on, RSB filling, PBRSB-eIBRS: Not affected
Type: srbds status: Not affected
Type: tsx_async_abort status: Not affected
Graphics:
Device-1: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Tul /
PowerColor Red Devil driver: amdgpu v: kernel arch: RDNA-2 code: Navi-2x
process: TSMC n7 (7nm) built: 2020-22 pcie: gen: 4 speed: 16 GT/s
lanes: 16 ports: active: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 0d:00.0
chip-ID: 1002:73ff class-ID: 0300
Display: x11 server: X Org v: 21.1.8 compositor: xfwm v: 4.18.0 driver: X:
loaded: amdgpu unloaded: modesetting,radeon alternate: fbdev,vesa
dri: radeonsi gpu: amdgpu display-ID: :0.0 screens: 1
Screen-1: 0 s-res: 1280x720 s-dpi: 96 s-size: 338x190mm (13.31x7.48")
s-diag: 388mm (15.27")
Monitor-1: HDMI-A-1 mapped: HDMI-A-0 model: Panasonic Panasonic-TV
serial: built: 2008 res: 1280x720 hz: 60 dpi: 47 gamma: 1.2
size: 698x392mm (27.48x15.43") modes: max: 1280x720 min: 640x480
API: OpenGL Message: Unable to show GL data. Required tool glxinfo
missing.
Audio:
Device-1: AMD Navi 21/23 HDMI/DP Audio driver: snd_hda_intel v: kernel pcie:
gen: 4 speed: 16 GT/s lanes: 16 bus-ID: 0d:00.1 chip-ID: 1002:ab28
class-ID: 0403
Device-2: AMD Starship/Matisse HD Audio vendor: ASRock
driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
bus-ID: 0f:00.4 chip-ID: 1022:1487 class-ID: 0403
API: ALSA v: k6.1.25-1-MANJARO status: kernel-api with: aoss
type: oss-emulator tools: alsamixer,amixer
Server-1: sndiod v: N/A status: off tools: aucat,midicat,sndioctl
Server-2: JACK v: 1.9.22 status: off tools: N/A
Server-3: PipeWire v: 0.3.70 status: off tools: pw-cli
Server-4: PulseAudio v: 16.1 status: active tools: pacat,pactl,pavucontrol
Network:
Device-1: Intel Dual Band Wireless-AC 3168NGW [Stone Peak] driver: iwlwifi
v: kernel pcie: gen: 1 speed: 2.5 GT/s lanes: 1 bus-ID: 04:00.0
chip-ID: 8086:24fb class-ID: 0280
IF: wlp4s0 state: down mac:
Device-2: Intel I211 Gigabit Network vendor: ASRock driver: igb v: kernel
pcie: gen: 1 speed: 2.5 GT/s lanes: 1 port: f000 bus-ID: 05:00.0
chip-ID: 8086:1539 class-ID: 0200
IF: enp5s0 state: up speed: 100 Mbps duplex: full mac:
Bluetooth:
Device-1: Intel Wireless-AC 3168 Bluetooth type: USB driver: btusb v: 0.8
bus-ID: 1-9:3 chip-ID: 8087:0aa7 class-ID: e001
Report: rfkill ID: hci1 rfk-id: 2 state: up address: see --recommends
Device-2: Realtek Bluetooth Radio type: USB driver: btusb v: 0.8
bus-ID: 3-1:2 chip-ID: 0bda:8771 class-ID: e001 serial:
Report: ID: hci0 rfk-id: 1 state: up address: N/A
Drives:
Local Storage: total: 1.88 TiB used: 94.34 GiB (4.9%)
SMART Message: Required tool smartctl not installed. Check --recommends
ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 980 500GB
size: 465.76 GiB block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s
lanes: 4 type: SSD serial: rev: 2B4QFXO7 temp: 28.9 C scheme: MBR
ID-2: /dev/sda maj-min: 8:0 vendor: Fanxiang model: S101 500GB
size: 465.76 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
type: SSD serial: rev: 415A scheme: MBR
ID-3: /dev/sdb maj-min: 8:16 vendor: Samsung model: ST1000LM024 HN-M101MBB
size: 931.51 GiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s
type: HDD rpm: 5400 serial: rev: 0001 scheme: MBR
ID-4: /dev/sdc maj-min: 8:32 type: USB model: USB DISK 3.0 size: 57.77 GiB
block-size: physical: 512 B logical: 512 B type: N/A serial:
rev: PMAP scheme: MBR
Partition:
ID-1: / raw-size: 465.76 GiB size: 457.38 GiB (98.20%)
used: 94.34 GiB (20.6%) fs: ext4 dev: /dev/nvme0n1p1 maj-min: 259:1
Swap:
Alert: No swap data was found.
Sensors:
System Temperatures: cpu: 55.0 C mobo: N/A gpu: amdgpu temp: 33.0 C
mem: 32.0 C
Fan Speeds (RPM): N/A gpu: amdgpu fan: 0
Info:
Processes: 305 Uptime: 7m wakeups: 0 Memory: 31.28 GiB used: 2.8 GiB (9.0%)
Init: systemd v: 252 default: graphical tool: systemctl Compilers:
gcc: 12.2.1 clang: 15.0.7 Packages: pm: pacman pkgs: 1356 libs: 449
tools: pamac pm: flatpak pkgs: 0 Shell: Bash v: 5.1.16
running-in: xfce4-terminal inxi: 3.3.26

aragon , my thermals are decent for air cooled, ive never exceeded 71c during my heaviest loads so i believe my paste is ok.

As for capacitors, they dont look noticeably deformed, they all fit within the ring printed around them on the board, in some areas they are different sizes but they look normal and fit within the ring, no bulges or tilting,

it may be worth mentioning, ive tried 2 different batteries for cmos battery and neither keep my board bios powered longer than 5 minutes if i flip the power switch, but that has always bin the case, ive had this system and configuration for one year and the freezing began 3-4 months ago.

lastly i use to leave my system powered on for about multiple days at a time, friday-monday while i was off work, the first time i ever froze was a monday after it had bin left on, may be irrelevant but its another detail to assist in diagnoses

cpu temp is high
you can update your bios UEFI motherboard
dont forget to have a live USB manjaro if you need restaure Grub

can you add more kernels

sudo mhwd-kernel -i linux515
sudo mhwd-kernel -i linux62

and add theses options on boot kernel
“iommu=pt nvme_load=YES processor.max_cstate=5 amd_pstate=passive”

on GRUB_CMDLINE_LINUX_DEFAULT=

sudo nano /etc/default/grub 
sudo update-grub

I’d second that. Also check in your bios settings the options for Legacy/Uefi, something’s not right with the partitioning scheme of your drives, they all are:
scheme: MBR
Looks like you’re running in Legacy mode.

What’s the output of:

[ -d /sys/firmware/efi ] && echo UEFI || echo BIOS

all disks are MBR for that

Yes, but it’s a modern board that is likely mostly used in Uefi mode. If there are some issues with the the bios firmware they would likely go undetected on the Legacy side or users of this hardware would have flagged them sooner.
That’s also why I seconded your tip to check for firmware updates.

what does MBR mean in this context? masterboot record?

im so sorry i have no idea what these commands are or what happens if i plug these in terminal may someone explain the outcome of each command so i have an idea of how to navagate? i am very inexperienced.

6x12:
I’d second that. Also check in your bios settings the options for Legacy/Uefi, something’s not right with the partitioning scheme of your drives, they all are:
scheme: MBR
Looks like you’re running in Legacy mode.

What’s the output of:

[ -d /sys/firmware/efi ] && echo UEFI || echo BIOS

what does this command do?

and i looked up legacy mode and uefi how would i go about fixing it if necessary, id like to learn as much as possible

i have an overall update for the freezing situation, i checked out my ram in bios and looked at the available profiles and saw there were preset timing configurations that did not match what “auto” had choosen(2866), so yesterday i selected 2100mhz and i noticed different behavior, i would stutter and freeze more often, BUT it would resolve itself within a few seconds, that session lasted roughly 12 hours before it actually froze while on chromium (which hasnt bin stable since its last update, so hard to tell)

i booted today, and configured the ram XMP profile (3200mhz) as it also had a timing preset, i am currently at 30m uptime and will update if it freezes again

[hexel]

[6d]/t/random-hardlocks-softlocks-freezes/139501/17?u=hexel)

i have an overall update for the freezing situation, i checked out my ram in bios and looked at the available profiles and saw there were preset timing configurations that did not match what “auto” had choosen(2866), so yesterday i selected 2100mhz and i noticed different behavior, i would stutter and freeze more often, BUT it would resolve itself within a few seconds, that session lasted roughly 12 hours before it actually froze while on chromium (which hasnt bin stable since its last update, so hard to tell)

i booted today, and configured the ram XMP profile (3200mhz) as it also had a timing preset, i am currently at 30m uptime and will update if it freezes again

-----UPDATE------

since i took action, on my last post. i have seen noticeable performance differences while in a session. the freezing does not occur as frequently, rather it will happen during spikes of load rather than randomly or while idle, it also -sometimes- unfreezes itself after 10 seconds, or bizarrely when i add or remove a display.

this whole situation is bizarre to me, but one thing has remained constant the error:

watchdog: BUG: soft lockup - CPU#11 stuck for 53s!

so what i have decided to do this weekend is switch ram slots from a2 b2 to a1 b1 and am currently posting with this configuration to see if i freeze before tomorrow to further diagnose the freezing

if anyone can teach me about above listed error message id appreciate it, as i am very tech illiterate

1 Like