Live iso default luks encrypted configuration leads to root filesystem corruption

I installed using the live-installer with defaut proposed luks encryption configuration & swap for hibernate:

  • unencrypted ESP : vfat
  • ROOT (easier to read than the UUID) partition : luks encrypted container with ext4 inside
  • SWAP (easier to read than the UUID) partition : luks encrypted container with swap inside

After a few days I noticed that ROOT ext4 fs was getting corrupted systematically after resuming from hibernation.

kernel: EXT4-fs (dm-0): Delayed block allocation failed for inode <inode number> at logical offset <offset value> with max blocks 3 with error 117
kernel: EXT4-fs (dm-0): This should not happen!! Data will be lost

So I started digging and found thanks to manjaro and arch forums the following kernel warning:

* BIG FAT WARNING *********************************************************
 *
 * If you touch anything on disk between suspend and resume...
 *				...kiss your data goodbye.
 *
 * If you do resume from initrd after your filesystems are mounted...
 *				...bye bye root partition.
 *			[this is actually same case as above]
 *

This was very interesting because because manjaro’s live-installer default installation does exactly what the warning says not to do with the contents of the encrypt, openswap & resume scripts

So he is what happens on boot:

  1. Grub asks and gets the password, it does its own decryption and loads the kernel and initramfs
    initramfs contains /crypto_keyfile.bin as per the default /etc/mkinitcpio.conf
    FILES="/crypto_keyfile.bin"
    This keyfile can be used to decrypt both ROOT and SWAP per the live installer choices.

  2. encrypt hook decrypts ROOT to luks-ROOT using the default ckeyfile=/crypto_keyfile.conf provided in the / mounted initramfs but then erases it in its last lines [ Why do that ???]
    rm -f ${ckeyfile}

  3. openswap hook mounts luks-ROOT to a tmp directory, gets the keyfile and decrypts the SWAP to luks-SWAP using default variables from /etc/openswap.conf

keyfile_device=/dev/mapper/luks-ROOT
keyfile_filename=crypto_keyfile.bin

and unmount luks-ROOT

So root is getting mounted before resume… just to acces a keyfile that is avaailable in the initramfs (but that the encrypt hook erases…

Any way here are the 3 changes what I did to get rid of the root filesystem corruptions:

  1. comment the rm -f ${ckeyfile} line in /usr/lib/initcpio/encrypt
    I don’t understand why that line is there… ? (maybe it is useful for erasing other keyfiles thatn the default “/crypto_keyfile.bin”
  2. set keyfile_device=/ and keyfile_device_mount_options="--bind" in /etc/openswap.conf

That way I’m using what is in the initramfs and not mounting the luks-ROOT ext4 filesystem before resume.

The warning means you can corrupt your filesystem if you write to the partition between suspend and resume.

Yes I understood the warning the same way you do…
But the reality is:

  • that just the mounting seems to causes corruptions as my 3 simple changes have eradicated the corruptions haven’t had any since I implemented it (last 2 days with multiple hibernates to test it…).
  • that others have switched to swap files on ROOT to solve the corruption thereby removing openswap and its mounting…

The answer I’m really interested in is: why the `rm -f ${ckeyfile}’ in encrypt hook.

Nexxt dumb question on my part:

When the “default” openswap configuration mounts luks-ROOT it does so with no mount options which means that luks-ROOT is actually mounted with “options=defaults” which contains relatime whereas Manjaro uses noatime…

Could [in certain conditions] atime modification of the keyfile happen on access thereby modifying le fs and causing the corruption ?
If yes another single step fix could be just adding keyfile_device_mount_options="-o noatime" to the default /etc/openswap.conf

I think you are overthinking this.

I have created several encrypted installations using Calamares with hibernate / using swapfile (not partition) - I have not experienced any issues on subsequent use.

edit: just realized that it was using swap partition - nonetheless - no issues yet.

I guess I’ll eventually remove my modification to see if the corruption come back (and test my second idea if it does).

There was a kernel bug causing ext4 corruption earlier this month. It has been fixed. That may have been the cause of your disk corruption and not the way LUKS is configured on Manjaro.

1 Like

OK so It may have been a freek coincidence… Will definitely return to default configuration to test then and report back !

Well just I just

  1. switched back to default openswap configuration…
  2. rebooted
  3. hibernated
  4. resumed from hibernate
  5. checked dmesg
[ 2039.499037] EXT4-fs error (device dm-0): ext4_validate_inode_bitmap:105: comm ThreadPoolForeg: Corrupt inode bitmap - block_group = 227, inode_bitmap = 7340051
[ 2040.550706] EXT4-fs error (device dm-0): ext4_validate_inode_bitmap:105: comm BgIOThr~Pool #6: Corrupt inode bitmap - block_group = 231, inode_bitmap = 7340055
[ 2040.550730] EXT4-fs error (device dm-0) in ext4_free_inode:362: Filesystem failed CRC
[ 2065.256011] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2073.601310] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2073.717714] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2075.264874] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2114.248267] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2120.880949] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2146.514891] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2146.515090] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2146.515483] EXT4-fs error (device dm-0): ext4_validate_block_bitmap:421: comm ThreadPoolForeg: bg 98: bad block bitmap checksum
[ 2146.515496] EXT4-fs error (device dm-0) in ext4_mb_clear_bb:6605: Filesystem failed CRC
[ 2146.515507] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2146.515533] EXT4-fs error (device dm-0) in ext4_mb_clear_bb:6605: Corrupt filesystem
[ 2146.515555] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2146.516007] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2146.516078] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2182.894748] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem
[ 2183.033375] EXT4-fs error (device dm-0) in ext4_free_inode:362: Corrupt filesystem

Hadn’t had any with my custom non luks-ROOT mounting openswap configuration…

going to test the noatime mount option now…

Adding -o noatime did not help… Indeed after first resume from hibernante luks-ROOT ext4 corruption was so bad I could not write to my home directory !

Switched back to my fix that avoids mounting luks-ROOT before resume hook, ran fsck and then restored my home directory (as my kdeconfig had been borked…)

All is well now… No more testing I rest my case with all the testing done.

1 Like

No. It’s not same bug. This is problem only with luks and separate swap partition, we must mount ext4 before resume for keyfile. I’m affected too. This problem has first mention in 2014 on arch linux.
@linux-aarhus
You may not have any problems, but this has been happening since at least 2015: [solved] Issue with file system after hibernation / Newbie Corner / Arch Linux Forums
I have the exact same situation, resulting in hibernation not being possible on my laptop.

In that case if would be wise to not use ext4 if you need must hibernate an encrypted system.

Due diligence would cause one to not use hibernation in such case.

Are you kidding me? Firstly, this use case will also damage file systems such as xfs and any other journaling file system. And in the case of btrfs it will most likely lead to data loss. Secondly, this is an error in the distribution package - and, in fact, a fix for the error has now been chewed up in the topic. Are you suggesting that I change the distribution because I need hibernation? Okay, I heard.
And last. This is DEFAULT setup with DEFAULT partitioning.

Seems we should check this issue more. Since there is a potential solution available we should check if that is correct and if other systems are affected by this and simply dont know or had just luck.

Thwre are many products and software involved and we should avoid any of them to start a blaming game.

We need a fixed ISO, a way to reproduce and a way to check if any changes may break other systems not affected by this.

Ok. I’w used plasma ISO. On clean install Manjaro in installer on step 3 (Partitions) chose “Erase disk”, Than choose “Swap (with hibernate)” an you can leave default fs ext4. Check checkbox “Encrypt system”, enter password. After proceed the install as usual.
After install we can do hibrenate, and after resume we got corruption of root filesystem. In case of ext4 we can just reboot, it’l be repaired by fsck during boot.

Problem persists because in encrypt hook (cryptsetup package) in last stage we removing any keys from initramfs. I beieve, that it for security reasons, but I don’t see any of them. Thus, as we haven’t keys for open the swap partiton, as I see, openwap mounts root partition in readonly for reading key from it and activate swap for resume. BUT it’s journaling system and journals is modified in read-only mount. BTRFS and any filesystem with metadata checksumming also corrupts with dmesg logs. Any journalling fileystem without metadata checksums corrupts silently. A workaround we must change config of openswap and comment last line in enctypt hook.

As I see, there no security advantages for remove keys from initramfs (because if attacker cat decrypt initramfs, system fully compromised). I think, bug must be fixed by apply this workaround by default. Such setup will not damage any other setup, new or existing systems. Now we lose our modifications with any future updates of cryptsetup package.

I’ll not report rhis to arch linux, because registering with their gitlab is too complicated (you need to write a letter, in which I have not yet received a response), and English is not my native language. I believe that a distro that claims to be user friendly can accept a bug and fix it without having to go through such idiotically complicated ways to report the problem and its solution.