Important EXT4 problem in kernel 6.1.64 (and maybe others)

There has been some discussion on the kernel mailing list
https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
about the possibility of EXT4 data corruption.

The issue was identified in the Debian 12.3 “Bookworm” release but the same kernel version is currently in use in Manjaro stable.

The bug may exist in other kernels as well.

So it would be a good idea to check if the Manjaro kernels also suffers from this bug and possibly push kernel version 6.1.66 with reverted problematic code to the stable (which is in the testing branch already).

My primary source of information: https://www.root.cz/zpravicky/debian-ma-chybu-v-jadre-a-rozbiji-ext4-zatim-neaktualizujte/ (in Czech language)

2 Likes

and here is an English language report, including links to the Debian bug Hold Off Debian Upgrades: Kernel 6.1.64 ext4 Bug Alert

1 Like

Newer kernels with revert of that commit are now in the stable branch.

2 Likes

uff thats horrible news… how can i know, which files on my ext4 partitions are getting corrupted now, since i installed last stable update on the 12.01?

Should i maybe restore my timeshift before i update now?

Bad news indeed. Just checked my system and it seems like my (actual) data did not get affected, but I had similar “hidden” data corruption because of a Cryptomator low level bug a while ago and it was a nasty situation. However, I wonder how often this bug gets triggered on actual systems, considering there have been no reports concerning corrupted data (or at least not more than usual)?

1 Like

A question for the knowledgeable: in which kernel versions is it fixed (i have 6.1.x and 6.6.x)
And how can we check for corruption, fsck?

P.s. according to debian 6.1.66 should be safe, but what about 6.6.x? If i understand correctly the link above is not affected?

I wonder if the bug is limited to ext4, I have had some REALLY strange things happening on my btrfs the last few weeks.
Pamac failed to build because a file was just magically removed. Yesterday the directory for a mountpoint was missing, switched to root in dolphin to create the dir again, and suddenly it appeared but NFS still refused to mount not really giving me any reason why. I do not use option x-mount.mkdir
Also lists the directory last modified on sep 16 so… :person_shrugging:

A reboot and everything was like nothing ever happened.

A scrub on the btrfs shows no errors. :open_mouth:

I have 6.1 installed, but have not used it, only 6.5 and 6.6 and it seems I should be unaffected?
My ext4 partition shows no signs of corruption, but that is also not involved what so ever with my system other than game partition.

When i look at Philm’s quote it looks like zfs is also affected.

How in the hell can be there this big issues with our Filesystems?

This stuff should be tripple proofed before some changes happends?

This is not a Pre-Alpha game that i decided to play with :face_with_diagonal_mouth:

seems like bcachefs will get a running start considering zfs, ext4 issues lately and general unreliability on btrfs when using raid levels other than 1.

about 6.6.3 kernel , and 6.1.64 around 3-4 december ( from stable kernel )

…by a completely different issue.

You answered your own question. :wink: :point_down:

Yes.

If you are on the new broken kernel already, probably reboot into a live image, fsck, check the live data against your backups and restore backups of any lost/broken files.

If you are on the old kernel but have the new broken kernel installed, purge the new broken kernel, prevent it from being installed by holding packages back etc and wait until the fixed kernel is available, then install it and reboot into it.

https://lwn.net/Articles/954312/

And how exactly, can we check this files?

I never did anything with fsck, how can we identify which files are corrupt… do we need the newest live image to compare our updated and possible corrupted harddrive files against the live image?

And what is the command that we need for that?

Any live usb will do, it does not compare anything. Important is that the filesystem is not mounted, so no chroot. Should be something like

fsck -t ext4 /dev/sdb2

for example.

1 Like

According to this mail (by a SUSE employee concerned with this sort of stuff, I hope) the corruption is a pure data corruption. Does it even make sense to fsck in this case? And comparing backups is also problematic, considering files may simply have changed because they were… well, changed. What would be really useful would be a list of popular applications making use of the O_SYNC|O_DIRECT combo in question, but such a thing is obviously not something readily available.

1 Like

Im actually thinking, if its not better if i do a timeshift rollback that i have created at the 11.30
and skip the faulty stable update released at 12.01 (where i have also a timeshift backup from btw.) and just update my system again.

On the otherhand, im not sure what the result is, when i just skipped a stable release and if thats not leading to a bigger issue, at the end of the day.

Please report back @Teo what your result is from this fsck.

I was exactly thinking the same, its pretty hard to find out where actually are the corrupted files
and how to compare them manually.

Im also wondering why almost no one talking about this problem right now.
:see_no_evil: :hear_no_evil: :speak_no_evil:

Because it’s already fixed.

2 Likes

Its fixed right now but the damage is done if we getting broken/corrupt files while we was using this Kernel… the files aren’t magical fixed now. I was using this Kernel for 9 full days and maybe many people more here.

We need a solution for this… everyone with ext4 right now is in the same boat and the leak was there atleast this is what the developer told us: “A Data loss is/was possible in this time”.

I not even have a idea, how exactly this corrupt files was triggeret, was it from writing or was even reading a file could lead to this problem… how big are the chances for gaining corrupted files… was it 1% or 10% or 50%?

Nothing is answered… and almost no one cares?

I read later (in the readhat forum i think) this should not be an issue on 6.5 and above, and since i am currently on 6.6 i do not think there is corruption and have not scanned.
It would be nice if someone confirms this, that 6.5 and above is troublefree.

What kind of solution do you want? You were told 10x already that you can run fsck. That’s the only solution you’ll get.

so if i am understanding this correctly only those running LTS(besides 6.6) kernels had the problem?