Kernel 6.8.5: "BUG: workqueue leaked lock or atomic" + lockup

Blizz · 18 April 2024 07:38

Have manjaro installed for a very long time on this server and used to run kernel 6.7.7 and prior to that 6.6.26 without issues. Since I installed 6.8.5 my dmesg is full of 100s of these (and eventually the server just crashes):

kernel: BUG: workqueue leaked lock or atomic: kworker/11:0/0x7fffffff/55617
             last function: ata_scsi_dev_rescan
kernel: CPU: 11 PID: 55617 Comm: kworker/11:0 Tainted: G        W          6.8.5-1-MANJARO #1 1ce495db2fdecc34943a97e7c09301e836b40d86
kernel: Hardware name: Intel(R) Client Systems NUC10i7FNH/NUC10i7FNB, BIOS FNCML357.0038.2020.0131.1422 01/31/2020
kernel: Workqueue: events ata_scsi_dev_rescan
kernel: Call Trace:
kernel:  <TASK>
kernel:  dump_stack_lvl+0x47/0x60
kernel:  process_one_work+0x33b/0x350
kernel:  worker_thread+0x30f/0x450
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xe5/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x31/0x50
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>
kernel: BUG: scheduling while atomic: kworker/11:0/55617/0x00000000
kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver rfcomm nfs lockd grace sunrpc netfs veth xt_nat xt_tcpudp x>
kernel:  snd_soc_sst_dsp iwlmvm crct10dif_pclmul snd_soc_acpi_intel_match crc32_pclmul polyval_clmulni i915 snd_soc_acpi polyval_gene>
kernel:  sdhci_pci nvme cqhci sdhci nvme_core spi_intel_pci crc32c_intel mmc_core spi_intel nvme_auth xhci_pci xhci_pci_renesas
kernel: CPU: 11 PID: 55617 Comm: kworker/11:0 Tainted: G        W          6.8.5-1-MANJARO #1 1ce495db2fdecc34943a97e7c09301e836b40d86
kernel: Hardware name: Intel(R) Client Systems NUC10i7FNH/NUC10i7FNB, BIOS FNCML357.0038.2020.0131.1422 01/31/2020
kernel: Workqueue:  0x0 (events)
kernel: Call Trace:
kernel:  <TASK>
kernel:  dump_stack_lvl+0x47/0x60
kernel:  __schedule_bug+0x56/0x70
kernel:  __schedule+0x10f0/0x1520
kernel:  ? ret_from_fork_asm+0x1b/0x30
kernel:  ? dump_stack_lvl+0x4c/0x60
kernel:  schedule+0x32/0xd0
kernel:  worker_thread+0x1b6/0x450
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xe5/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x31/0x50
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>

This starts basically about a minute after boot.
I went back to 6.7.7, system has been stable for the last 20 minutes.
I do get that “MMIO Stale Data CPU bug present and SMT on, data leak possible.” message, but that was there in the previous kernel as well.

Kobold · 18 April 2024 10:59

Hello, you are not alone… Welcome to the club:

Blizz · 18 April 2024 11:02

Yes, I saw that topic, but because that one is about 6.6.2x, a kernel version that worked for me, I opted to start a new topic specifically for 6.8

Kobold · 18 April 2024 11:08

workqueue leaked lock or atomic: kworker... last function: ata_scsi_dev_rescan

I had the exact error message as you, but with exactly 10min delay after Booting into desktop with additional 10min delay on each error message.

So you get regular system lockups from this error’s now?

Its maybe worth to post your inxi, to compare our systems.

Blizz · 18 April 2024 12:53

The server does boot a desktop (XFCE), but I don’t use it (SSH / CLI only), so I have no idea if the start of the X environment has anything to do with it.
I can only say that I did not experience this behavior with 6.6 and 6.7
On 6.8 the server crashed 2 times within half an hour, but since I’ve been back on 6.7 it’s been up for 5.5 hours without issues already.

Blizz · 19 April 2024 11:43

Just as an update, server is still up with 6.7, more than 24 hours already.
I am prepared to run some tests on the machine against 6.8.5 if they are needed, but I don’t know where to go from here to be honest.

raguse · 20 April 2024 05:42

I am not getting that log entry anymore with linux 6.8.7-1 which is in testing at the moment of writing. You may try that one.

Kobold · 20 April 2024 12:24

@raguse
Do you see them with the LTS Kernels?

raguse · 21 April 2024 06:11

@Kobold
No such logs with 6.6.28-1 either. Not sure if they happened before with linux66 as I was only using 6.8 and they showed with it.
Btw. I am using unstable branch with all its recent updates. Let us know if you got rid of the messages.

Kobold · 21 April 2024 15:07

I will report back, if/when this messages vanished.

Something is telling me, that this is a Intel problem, since another guy reported this error in my Topic and he has Intel System too and the same happen with OP here.

Are you using Intel or AMD?

raguse · 22 April 2024 05:36

I am using Intel also.

Blizz · 22 April 2024 17:51

Intel CPU here as well. I was running 6.7 before but 6.6 LTS before that and I honestly haven’t checked the dmesg output so I cannot confirm or deny, It definitely never crashed the server.

s4uliu5 · 23 April 2024 04:36

Related bug report

https://bugzilla.kernel.org/show_bug.cgi?id=218740

Sadly 6.8.7 has the same issue