Unable to boot every recent kernels 6.5, 6.4 and 6.1

Hello,
I reported this problem first in Ext4 errors immediately after a system upgrade. It was a bit unclear at first, but it seems to be simpler now: all recent kernels seem to not make my system boot anymore. This is the list of kernels I tested and that won’t boot:

  • 6.5.0-1
  • 6.4.12-1
  • 6.1.49-1
  • 5.15.128-1

these instead seem to properly work:

  • 5.10.192-1
  • 6.1.30-1 (this is the one I found in the ISO)
  • 6.2.?

Apparently the system does not boot because it cannot even read the filesystem. I even tried to find previous kernel logs with:

journalctl -b -1

but the boot is not even registered as it could not probably mount the filesystem.
Any advice?
Thank you.

So someone may be able to help :wink:

Hmm :thinking:

I am a coder but the kernel source scares me - no only the size - bit because I have no experience with the clang.

Thank you for your hard work testing the issue … I know how much time you put into this … and I recognize that you have a real problem.

With respect to your hardware - I admit it looks like a regression.

When I say with respect to your hardware, it is founded on the thousands of other systems with no issues and my own constant testing of the latest kernel source.

On a biweekly schedule I rebuild my kernel using upstream next branch and is currently on

07:16:25 ○ [fh@tiger] ~
 $ uname -r
6.5.0-next-20230830-1-next-git-13390-g56585460cc2e

Did you try one of the ISO from https://nix.dk - newest was created so you could test a live ISO with latest kernel?

Could you please provide system context

inxi -Fv7c0

This is the required info:

System:
  Host: luca-precision5520 Kernel: 5.10.192-1-MANJARO arch: x86_64 bits: 64
    compiler: gcc v: 13.2.1 Desktop: KDE Plasma v: 5.27.7 Distro: Manjaro Linux
    base: Arch Linux
Machine:
  Type: Laptop System: Dell product: Precision 5520 v: N/A
    serial: <superuser required>
  Mobo: Dell model: 0GDXD5 v: A07 serial: <superuser required> UEFI: Dell
    v: 1.28.0 date: 03/23/2022
Battery:
  ID-1: BAT0 charge: 71.6 Wh (100.0%) condition: 71.6/97.0 Wh (73.8%)
    volts: 12.7 min: 11.4 model: SMP DELL GPM0365 status: full
CPU:
  Info: quad core model: Intel Xeon E3-1505M v6 bits: 64 type: MT MCP
    arch: Kaby Lake rev: 9 cache: L1: 256 KiB L2: 1024 KiB L3: 8 MiB
  Speed (MHz): avg: 2786 high: 3634 min/max: 800/4000 cores: 1: 2903 2: 3634
    3: 3326 4: 3412 5: 2081 6: 2587 7: 2746 8: 1605 bogomips: 48016
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Intel HD Graphics P630 vendor: Dell driver: i915 v: kernel
    arch: Gen-9.5 bus-ID: 00:02.0
  Device-2: NVIDIA GM107GLM [Quadro M1200 Mobile] vendor: Dell driver: nvidia
    v: 535.104.05 arch: Maxwell bus-ID: 01:00.0
  Device-3: Microdia Integrated_Webcam_HD driver: uvcvideo type: USB
    bus-ID: 1-12:4
  Display: wayland server: X.org v: 1.21.1.8 with: Xwayland v: 23.2.0
    compositor: kwin_wayland driver: X: loaded: modesetting,nvidia dri: iris
    gpu: i915,nvidia resolution: 1: 3072x1728 2: 1920x1080
  API: OpenGL v: 4.6 Mesa 23.1.6-2 renderer: Mesa Intel HD Graphics P630
    (KBL GT2) direct-render: Yes
Audio:
  Device-1: Intel CM238 HD Audio vendor: Dell driver: snd_hda_intel v: kernel
    bus-ID: 00:1f.3
  API: ALSA v: k5.10.192-1-MANJARO status: kernel-api
  Server-1: JACK v: 1.9.22 status: off
  Server-2: PipeWire v: 0.3.78 status: off
  Server-3: PulseAudio v: 16.1 status: active
Network:
  Device-1: Intel Wireless 8265 / 8275 driver: iwlwifi v: kernel
    bus-ID: 02:00.0
  IF: wlp2s0 state: up mac: 60:f6:77:f5:c1:cf
Bluetooth:
  Device-1: Intel Bluetooth wireless interface driver: btusb v: 0.8 type: USB
    bus-ID: 1-4:2
  Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
Drives:
  Local Storage: total: 476.94 GiB used: 12.18 GiB (2.6%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: PM961 NVMe SED 512GB
    size: 476.94 GiB temp: 32.9 C
Partition:
  ID-1: / size: 442.83 GiB used: 12.18 GiB (2.7%) fs: ext4 dev: /dev/nvme0n1p2
  ID-2: /boot/efi size: 299.4 MiB used: 288 KiB (0.1%) fs: vfat
    dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: partition size: 25.66 GiB used: 0 KiB (0.0%)
    dev: /dev/nvme0n1p3
Sensors:
  System Temperatures: cpu: 61.0 C pch: 57.0 C mobo: N/A
  Fan Speeds (rpm): cpu: 2520 fan-2: 2525
Info:
  Processes: 244 Uptime: 10m Memory: total: 24 GiB available: 23.34 GiB
  used: 5.53 GiB (23.7%) Init: systemd Compilers: gcc: 13.2.1 clang: 15.0.7
  Packages: 1091 Shell: Zsh v: 5.9 inxi: 3.3.29

For the images from https://nix.dk, I already tried the one with kernel 6.5, and unfortunately it is impossible to test: when running the installer, the partitioning fails and the installer stops. The reason why the installer fails may (or may not) be the same problem I’m reporting: impossible to write on the nvme memory.

Is this the right place to report this or should it be reported upstream? I should probably somehow understand if the same problem happens in other distros but I’m not sure where to find the same kernel versions.

Thanks for your assistance.

The exact same versions is likely difficult to obtain - Fedora would be one which is at 6.4 at least it was some time ago.

I can only provide some ideas … and I am blank for now

One suggestion would be to compile your own kernel - it is not as scary as it sounds - and your system has the power to get it done fairly quick - perhaps an hour or so.

My suggestion is to build the next kernel - just to see what it brings.

Before you do it - remove your excess kernels

sudo mhwd-kernel -li

Then remove the ones not working

sudo mhwd-kernel -r linux64
sudo mhwd-kernel -r linux64

As you are using Nvidia I suggest using the nvidia-dkms to have the driver build for your system

Ensure your system is up-to-date and having base-devel synced together with kernel headers, dkms and nvidia-dkms. If you must use an earlier driver there is nvidia-470xx-dkms and nvidia-390xx-dkms.

sudo pacman -Syu base-devel dkms nvidia-dkms $KERNEL-headers

Copy the file /etc/makepkg.conf to your home

cp /etc/makepkg.conf ~/.makepkg.conf

Edit the file and locate the line starting with CFLAGS= and change the following

CFLAGS="-march=x86-64 -mtune=generic ..."

To use the options native for your cpu

CFLAGS="-march=native ..."

Also edit the MAKEFLAGS=“-j2” to read - this make full use of your CPU to compile the kernel faster

MAKEFLAGS="-j$(nproc)"

Save the file.

Then clone, build and install

git clone https://aur.archlinux.org/linux-next-git
cd linux-next-git
makepkg -is

Thanks for your help.

I already started to bisect the linux61 repo from Manjaro. I followed instructions that are slightly different, I’ll check to see if I’m doing something wrong.

As 6.1.30 is good and 6.1.46 is bad, I’m trying to find which one is the first bad. Then I guess I’ll have to test a vanilla kernel.

Is there a quick way to build a vanilla kernel and install it with the same procedure? Would be quicker than having to go with the regular with installation.

  1. Install manjaro-downgrade package

  2. If you are in Manjaro stable branch, run:

$ DOWNGRADE_FROM_ALA=1 sudo manjaro-downgrade linux61
  1. Select any version

Wiki:


If you want to check code change between 6.1.46 and 6.1.30:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/?id=v6.1.46&id2=v6.1.30&dt=2


Moderator edit: FIFY :wink:

1 Like

I don’t know where the right places is.

When you know something actionable from bisecting the source - I suggest you start with an issue with Manjaro Gitlab.

We coiuld ping Philip here but I suggest we wait until you have some solid.

:+1:

Good work - really appreciated :raised_hand:

I built for many hours but I completed the bisection. This is the offending commit (assuming I did everything properly): 6.1.46 (17f57b5a) · Commits · Packages / Core / linux61 · GitLab.

My understanding is that it means it is only updating the patch from upstream. I see that inside that patch there are changes related to nvme.

Is it possible to report to some maintainer maybe?

Since you left on irc I’ll write it here: If it really is a kernel problem you need to bisect -git version of linux (say between v6.1.45 and 6.1.46) if you want to find the actual commit.

Thank you, this is exactly what I’ve been doing for the last few hours and the reason why I asked for info about this procedure in IRC. Unfortunately, I was welcomed with someone answering “it won’t help”, “you screwed up”, “you should learn trobleshooting” etc…

Luckily, you arrived and provided a good link. Thank you. I had to leave to reboot and test the new kernel.

I have to say that the procedure is almost identical to this one that I already followed: PKGBUILD - aur.git - AUR Package Repositories. So that is probably what I can use to bisect the upstream repo right?

I did a search on Arch forum for ‘Inspiron 5520’ issues and first topic listed was reporting an nvme error
following update to kernel v6.4.11 on 2023-08-19

Inspiron 5520 is mentioned in post #8 with a suggested workaround

[Solved] Cannot boot after an upgrade, nvme error / Kernel & Hardware / Arch Linux Forums
Blacklisting the Realtek SD module on my Dell Precision 5520 worked for me

Post #7 links to another topic that confirms the nvme issue and clarifies the workaround

kernel 6.4.11 bug prevents boot on some hardware [with SOLUTION] / System Administration / Arch Linux Forums
There is a bug with the rtsx driver in 6.4.11 that prevented one of my machines from booting. In my case it presented as NVME failure and thus preventing machine from booting.
6.4.10 is fine.

A work around is simply to blacklist the driver (its only a card reader).
i.e. Add
blacklist rtsx_pci
blacklist rtsx_pci_sdmmc
to /etc/modprobe.d/blacklist_rtsx.conf

rebuild the initramfs so that the blacklist gets put into it.

More details on this are available on lkml including the git bisect:
LKML: Keith Busch: Re: Possible nvme regression in 6.4.11

Another forum search for ‘blacklist rtsx’ located 3 more topics from Aug 2023 confirming the workaround
[SOLVED] dropped to emergency rootfs after 6.4.10->6.4.11 update / Newbie Corner / Arch Linux Forums
Device is not a valid LUKS device. / System Administration / Arch Linux Forums
[SOLVED] Dell XPS 15 9570 suddenly fails to boot. Is the SS failing? / Kernel & Hardware / Arch Linux Forums

3 Likes

The only weird thing then is that @luc4 says 6.1.45 still works.

OP: did you manage to bisect it further?

If not you can test linux-git-v6.1.45.r0.1321ab403b38 and linux-git-v6.1.45.r74.e146162dcf2e: https://easyupload.io/m/92v0oj

EDIT:

Never mind, I read that kernel version wrong (kernel 6.4.11 bug prevents boot).

nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G was indeed back-ported between 6.1.45 and .46.

2 Likes

Took me some time but I solved my problem.

The new SSD I bought arrived. After I replaced it, both issues are 100% gone. With “both issues” I refer to the boot issue with kernels >= 6.1.46 AND the flood of warnings from any kernel version, even very old versions:

[ 1731.160845] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:04:00.0
[ 1731.160862] nvme 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 1731.160867] nvme 0000:04:00.0:   device [144d:a804] error status/mask=00000001/00006000
[ 1731.160870] nvme 0000:04:00.0:    [ 0] RxErr                  (First)

This means I now have zero problems or suspicious situations (so far at least).

This being said, I have to say I still suspect a regression in the kernel somewhere. My old “possibly broken” SSD is still “apparently” perfectly working with any kernel I tested < 6.1.46 (I worked for hours with it), and cannot even be mounted by kernels >= 6.1.46.

I thought I could place back the old SSD and bisect upstream commits, but I have no idea how much the results on a possibly broken SSD would be appreciated.

The links @nikgnomic provided look very interesting, may very well be the same problem. Even the hardware is sometimes the same.

Thanks everyone for the help.

Well I guess you have more money than sense time. You can send broken SSD to me, I’m willing to pay for postage. :stuck_out_tongue:

Actually the flood of logs clearly suggests there is a problem with my older SSD, so I thought that, in any case, that was needed. But the boot issue does look a lot like a regression of some kind.

The patch you referred to in your previous post is one of those I noticed in the list contained from 6.1.45 to 6.1.46 in the nvme driver. It seems even 5.15 is broken for me, but I do not see that included. 5.10 is ok instead. If you think it may be interesting for someone, I can try to build and test it by placing it back.

Well, I’d guess that in the first place it would be “interesting” to you, since it’s your disk. :stuck_out_tongue:

Anyway, if you wish you can do one last test with linux-git-v6.1.51.r1.a55bf5bcd959: https://easyupload.io/m/of2xy9

Is this a build of 6.1.51 with the previously mentioned commit reverted?

Yes, hopefully.

Bugzilla report seems to suggest that issue affects systems with i7-7820HQ CPU
217802 – regression NVME failure in 6.4.11 : 6.4.10 works fine.
‘me too’ comments confirm the following Dell systems are affected

  • Dell Precision 5520
  • Dell XPS 15 9560

There may be other users on similar systems that need a workaround whilst awaiting kernel patch