It happened for the first time after a system upgrade. Sounds like an hardware issue, but the fact that it happened after an upgrade is suspicious.
I already tried to boot with a USB and run fsck on every partition, but apparently no error whatsoever is reported. I also tried to mount the partitions to read/write inside, and everything worked properly, but errors like these are flooding the kernel log buffer:
I don’t own an NVME drive and thus have no experience with them, but the pattern of these I/O Errors looks remarkably like the one on a spinning drive that has reached the the end of its life.
If you can boot from USB and check and mount and use the file system, but can’t do the same using the installed system then the above was my best guess.
You could try to chroot and have a look at the logs then, when you use the installed system that way (running an update and/or reading/writing some files …).
also appear when there is no use whatsoever of the internal memory. May even mean a motherboard issue, and this is likely: I already replaced 4 motherboards on this Dell machine.
these can be fixed by adding some kernel parameters, reinstalling will not help! they will be there again… these spam logs fill you harddrive, since there are like hundreds of them every second… and that is why you probably have the ext errors…
Thanks. I tried first with the regular ISO and it booted properly. Then I updated it, and it stopped booting again. So, either the update is breaking the machine or the hardware is broken. I’m not completely sure how to determine which one is it. I do not have alternative hardware to test at the moment unfortunately.
Yes, I read about it. However, I’m sure my machine never printed that. And ignoring that fact that the kernel is not happy does not sound right. I’m even using kernel 6.1, which is not new… but still the error is there.
However I’ll try again with the kernel param and see if the system stays up. Apparently it is simple to break it.
This may be pointing to a kernel regression - which is why I suggested trying an ISO with the latest stable (6.4) or the mainline (6.5) kernel as in doing so you can verify if it is hardware or software.
If the issue persist it is likely hardware - if it does not - a regression with 6.1.
If your issue is caused by a runaway log edit the journald.conf (I use micro as terminal editor)
the pci errors are on your nvme drive, so maybe something was introduced in the latest kernel… you definitely need to add those parameters, you can add it from chroot… or better, add them in the grub menu…
add one at a time, and test:
I did another test and I got the same result: no boot after the upgrade. I upgraded with pamac update. It is my understanding though that the kernel is not updated in this case. From the settings, I still had the same kernel. However, after the reboot, I had the same failure.
I tried with Manjaro 6.5: the ISO booted but the installer fails to partition the disk. Probably a bug because I installed at least 5 distros and all installed properly.
My first steps when i would run into a problem like this after a non bootable device. That i would restore my last Timeshift snapshot that i would “always” create before i install a full system update.
Maybe switch between the LTS Kernels and update again… probably something went wrong while updating and it was only a temporary issue.
5 Distros woow, my system would explode from that too
Is there a good reason to have 5 distros installed?
I mean that I tested the system with 5 different installations to see if the errors from the kernel were still there. The errors are there even in Ubuntu 18. Even though it works fine, the errors are something new for me, so an hardware issue is possible.
I’ve been using this machine for 6 years now, and it was born as a Linux machine. I really do not think those pcie errors were ever present.
I also tried to install Ubuntu 22 and I upgraded to Ubuntu 23. The upgrade was successful. Still the logs are there. With Manjaro I’m a bit stuck. After the upgrade, it won’t boot, whatever kernel I use apparently.
I’ll do more tests with the params provided before, and I’ll also replace the SSD. I’m not confident though.
I did some more tests and I have a theory of what is happening. On one side, the pcie logs may be really related to a hardware issue, probably some decay of the communication on the bus, but however unrelated to the boot issue. The logs may be ignored for the moment.
My theory of the boot issue is instead this: a severe regression was introduced, and is currently present in kernel 6.5. Not only this, but the regression must have been backported to 6.1, which is the reason why I ruled out a kernel issue at first. In my original system, both 6.5 and 6.1 were already broken, so switching did not help. This Manjaro ISO, instead, features a older 6.1 kernel, which works for me. In particular, 6.1.46 and 6.5 are broken. 6.2.16-2 is ok, it probably did not receive backports. 6.1.30 seems to be ok, and in fact the Manjaro ISO boots properly from the USB.
For those in the same situation: replacing the SSD solved both issues discussed in the thread. One was probably hardware-related, the other is a kernel regression here: kernel/git/stable/linux.git - Linux kernel stable tree. Kernel regression seems to be still there, no signs of work on it yet.