Ext4 errors immediately after a system upgrade

Hello! I’m currently unable to boot my system. What I see is this:

It happened for the first time after a system upgrade. Sounds like an hardware issue, but the fact that it happened after an upgrade is suspicious.

I already tried to boot with a USB and run fsck on every partition, but apparently no error whatsoever is reported. I also tried to mount the partitions to read/write inside, and everything worked properly, but errors like these are flooding the kernel log buffer:

[ 1731.160845] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:04:00.0
[ 1731.160862] nvme 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 1731.160867] nvme 0000:04:00.0:   device [144d:a804] error status/mask=00000001/00006000
[ 1731.160870] nvme 0000:04:00.0:    [ 0] RxErr                  (First)

Next attempt will probably be to reinstall and see if the problem is an hardware issue. Any other idea?
Thanks!

Main difference in this case between your installed system and the one from USB is the Kernel - try a different one, and not the very newest.

Thank you for your answer. I already tried an older kernel by selecting it in grub. No difference.

I don’t own an NVME drive and thus have no experience with them, but the pattern of these I/O Errors looks remarkably like the one on a spinning drive that has reached the the end of its life.

If you can boot from USB and check and mount and use the file system, but can’t do the same using the installed system then the above was my best guess.

You could try to chroot and have a look at the logs then, when you use the installed system that way (running an update and/or reading/writing some files …).

Other than that? :man_shrugging:

1 Like

Some sys info should be added to aid in the troublesh00ting process

inxi -Fxyc0

Please don’t post pictures - use formatted text </> instead.

Is your nvme an Intel VMD?

IF yes

Typing all that mess would take forever, sorry.

No, it is a Samsung memory.

However, I just noticed that those kernel messages:

[ 1731.160845] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:04:00.0
[ 1731.160862] nvme 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 1731.160867] nvme 0000:04:00.0:   device [144d:a804] error status/mask=00000001/00006000
[ 1731.160870] nvme 0000:04:00.0:    [ 0] RxErr                  (First)

also appear when there is no use whatsoever of the internal memory. May even mean a motherboard issue, and this is likely: I already replaced 4 motherboards on this Dell machine.

At this point I’ll try with a fresh install.

Hmm - that is indeed a bad omen.

I have ISO with kernel 6.4 and 6.5 on https://nix.dk - feel free to test if the latest kernel makes a difference.

these can be fixed by adding some kernel parameters, reinstalling will not help! they will be there again… these spam logs fill you harddrive, since there are like hundreds of them every second… and that is why you probably have the ext errors…

Thanks. I tried first with the regular ISO and it booted properly. Then I updated it, and it stopped booting again. So, either the update is breaking the machine or the hardware is broken. I’m not completely sure how to determine which one is it. I do not have alternative hardware to test at the moment unfortunately.

Yes, I read about it. However, I’m sure my machine never printed that. And ignoring that fact that the kernel is not happy does not sound right. I’m even using kernel 6.1, which is not new… but still the error is there.

However I’ll try again with the kernel param and see if the system stays up. Apparently it is simple to break it.

This may be pointing to a kernel regression - which is why I suggested trying an ISO with the latest stable (6.4) or the mainline (6.5) kernel as in doing so you can verify if it is hardware or software.

If the issue persist it is likely hardware - if it does not - a regression with 6.1.

If your issue is caused by a runaway log edit the journald.conf (I use micro as terminal editor)

journald.conf(5) — Arch manual pages

sudo micro /etc/systemd/journald.conf
SystemMaxFileSize=50M

the pci errors are on your nvme drive, so maybe something was introduced in the latest kernel… you definitely need to add those parameters, you can add it from chroot… or better, add them in the grub menu…
add one at a time, and test:

pci=nomsi
pci=nommconf
pcie_aspm=off
1 Like

I did another test and I got the same result: no boot after the upgrade. I upgraded with pamac update. It is my understanding though that the kernel is not updated in this case. From the settings, I still had the same kernel. However, after the reboot, I had the same failure.

I tried with Manjaro 6.5: the ISO booted but the installer fails to partition the disk. Probably a bug because I installed at least 5 distros and all installed properly.

My first steps when i would run into a problem like this after a non bootable device. That i would restore my last Timeshift snapshot that i would “always” create before i install a full system update.

Maybe switch between the LTS Kernels and update again… probably something went wrong while updating and it was only a temporary issue.

5 Distros woow, my system would explode from that too :wink:

Is there a good reason to have 5 distros installed?

I mean that I tested the system with 5 different installations to see if the errors from the kernel were still there. The errors are there even in Ubuntu 18. Even though it works fine, the errors are something new for me, so an hardware issue is possible.

I’ve been using this machine for 6 years now, and it was born as a Linux machine. I really do not think those pcie errors were ever present.

I also tried to install Ubuntu 22 and I upgraded to Ubuntu 23. The upgrade was successful. Still the logs are there. With Manjaro I’m a bit stuck. After the upgrade, it won’t boot, whatever kernel I use apparently.

I’ll do more tests with the params provided before, and I’ll also replace the SSD. I’m not confident though.

Thanks for your help.

I did some more tests and I have a theory of what is happening. On one side, the pcie logs may be really related to a hardware issue, probably some decay of the communication on the bus, but however unrelated to the boot issue. The logs may be ignored for the moment.

My theory of the boot issue is instead this: a severe regression was introduced, and is currently present in kernel 6.5. Not only this, but the regression must have been backported to 6.1, which is the reason why I ruled out a kernel issue at first. In my original system, both 6.5 and 6.1 were already broken, so switching did not help. This Manjaro ISO, instead, features a older 6.1 kernel, which works for me. In particular, 6.1.46 and 6.5 are broken. 6.2.16-2 is ok, it probably did not receive backports. 6.1.30 seems to be ok, and in fact the Manjaro ISO boots properly from the USB.

Do you think this theory may be reasonable?

I opened a new discussion here in the proper forum section, as now the situation is a bit more clear: Unable to boot every recent kernels 6.5, 6.4 and 6.1.

For those in the same situation: replacing the SSD solved both issues discussed in the thread. One was probably hardware-related, the other is a kernel regression here: kernel/git/stable/linux.git - Linux kernel stable tree. Kernel regression seems to be still there, no signs of work on it yet.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.