Repaired btrfs system m2 disk, but there is something wrong with the subvolumes or the system itself

After several system hangs within a few days (I’m on stable and I haven’t updated the 2023-07-10), I can’t boot my system anymore. It hangs on the three-dots bootscreen, “manjaro” below.

I tried booting from older LTS kernels which had been installed before I tried booting from snapshots. While listing those in the standard Manjaro boot menu, browsing the snapshots, i.e. moving the cursor up and down everything is lagging.

When managing to select a snapshot, trying to boot it gives me either a blackscreen or a message “need to load the kernel first”.

I’m now posting from a live usb while still inspecting the problem and trying to rescue what I thought I have enough timeshift backups of, alas on the same disk,

that seems to be a broken filesystem.

I have a standard encrypted installation of Manjaro stable. My system disk is a 500gb Nvme on an m2-socket on my mainboard.
Partition scheme defaults to

unallocated 1mb
/dev/nvme0n1p1 300mb FAT32
/dev/nvme0n1p2 400gb luks
/dev/nvme0n1p3 ~70gb luks (the swap partition created in Calamares to hibernate)
unallocated 2.5mb

I will describe below what I have tried (and what I have not tried) and list error messages.

(I had briefly inspected logs after the system hangs and it said Xorg most of the time, so I thought problem would just solve if I reinstall video-nvidia with or after the 23-07-10 stable update which I postponed to the upcoming weekend.)

Ok it seems that blocks of the btrfs are broken. I can decrypt the partitions,

sudo cryptsetup open --type luks /dev/nvme0n1p2 cryptroot 
sudo cryptsetup open --type luks /dev/nvme0n1p3 cryptswap                    
     

but not mount them, seems I can not even mount them read-only to backup some of the files.

sudo mkdir /run/media/manjaro/rocryptroot
sudo mount -o ro /dev/mapper/cryptroot /run/media/manjaro/rocryptroot                                        

mount: /run/media/manjaro/rocryptroot: can't read superblock on /dev/mapper/cryptroot.
       dmesg(1) may have more information after failed mount system call.
 

… I’ve looked with btrfs-progs:

sudo btrfs check /dev/mapper/cryptroot                                                                        ✔ 
Opening filesystem to check...
checksum verify failed on 300190990336 wanted 0x029420c1 found 0xe75f09e9
checksum verify failed on 300190990336 wanted 0x4c77b96a found 0xe879fb1b
checksum verify failed on 300190990336 wanted 0x4c77b96a found 0xe879fb1b
bad tree block 300190990336, bytenr mismatch, want=300190990336, have=5082971799686448791
Couldn't setup log root tree
ERROR: cannot open file system

But:

sudo btrfs rescue super-recover -v /dev/mapper/cryptroot                                                    1 ✘ 
All Devices:
        Device: id = 1, name = /dev/mapper/cryptroot

Before Recovering:
        [All good supers]:
                device name = /dev/mapper/cryptroot
                superblock bytenr = 65536

                device name = /dev/mapper/cryptroot
                superblock bytenr = 67108864

                device name = /dev/mapper/cryptroot
                superblock bytenr = 274877906944

        [All bad supers]:

All supers are valid, no need to recover

Noticed “want=300190990336, have=5082971799686448791” in the above seems to be a huge difference?

Then I looked a dmesg and with journalctl too:

sudo journalctl  -p3                                                                                                                             0|1 ✘ 
[...]
Jul 11 20:50:24 manjaro kernel: BTRFS error (device dm-1): bad tree block start, mirror 1 want 300190990336 have 10804241780060440928
Jul 11 20:50:24 manjaro kernel: BTRFS error (device dm-1): bad tree block start, mirror 2 want 300190990336 have 5082971799686448791
Jul 11 20:50:24 manjaro kernel: BTRFS error (device dm-1): open_ctree failed

sudo dmesg | grep "BTRFS"
                                                                                                                           
[  952.041823] BTRFS: device fsid XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX devid 1 transid 152840 /dev/dm-1 scanned by (udev-worker) (16552)
[ 1041.079763] BTRFS info (device dm-1): using crc32c (crc32c-intel) checksum algorithm
[ 1041.079767] BTRFS info (device dm-1): using free space tree
[ 1041.120358] BTRFS info (device dm-1): enabling ssd optimizations
[ 1041.120361] BTRFS info (device dm-1): start tree-log replay
[ 1041.120435] BTRFS error (device dm-1): bad tree block start, mirror 1 want 300190990336 have 10804241780060440928
[ 1041.120522] BTRFS error (device dm-1): bad tree block start, mirror 2 want 300190990336 have 5082971799686448791
[ 1041.120525] BTRFS warning (device dm-1): failed to read log tree
[ 1041.121582] BTRFS error (device dm-1): open_ctree failed

What I’m doing right now is making an image to an external hard drive in hope to rescue some content:

sudo dd if=/dev/mapper/cryptroot of=/run/media/manjaro/................./btrfsimage bs=1M

I have not luksClose …d the 2 the cryptroot and cryptswap partions before, should I before making an image like the one above? I’m unsure.

What I HAVE NOT tried yet (because I want to look at the image first) is:

btrfs rescue chunk-recover /path/to/partition

Note
Since chunk-recover will scan the whole device, it will be VERY slow especially executed on a large device.

– Fix device size alignment related problems (e.g. unable to mount the filesystem with super total bytes mismatch):

btrfs rescue fix-device-size /path/to/partition

– Recover from an interrupted transactions (fixes log replay problems):

btrfs rescue zero-log /path/to/partition

Also I haven’t tried “btrfs check --repair”

because they recommend against it everywhere
Nor have I tried further options of btrfs check

Sorry this list looks very messy, just like I feel right now, googling helped a bit but eventually confuses me so I returned here :).

Any help will be appreciated maybe I’m missing out on something or do not understand large parts of it. I hope all of the above makes sense. I have of course realized that the drive might fail on me altogether and that I have to replace this system disk due to failure. It is pretty new though, not even 2 years.

I have for now repaired the drive again with the code below from a live boot, however I can not view the new snapshots in the boot menu, nothing after the 9th of July (whereas they show up fine in the Timeshift gui).
But: I can boot again!! :slight_smile:

btrfs rescue zero-log /path/to/partition

I do have random freezes where I am forced to reboot. Will see if that gets better after the update.
I want to see my error logs, its kernel timeout after 82 seconds for instance but I don’t know what’s causing it. I create snapshots on boot so I know when these hangs happen.

Do I have to replace my replace my m2 system SSD? Or is just the btrfs filesystem on it “broken”? At least something seems odd with the snapshots’ subvolumes still. Also in the boot menu, when I browse the snapshots, everything is more than laggy.

I Timeshift-snapshotted back to where the snapshots were still listed in boot menu (23-07-07). I might go further back to test. After the 2023-07-10 update, the latest snapshots are being listed in the boot menu again. Meanwhile, I’ve found a baloo issue with btrfs: 402154 – Baloo reindexes everything after every reboot when using BTRFS filesystem

Maybe all btrfs users turn baloo off (and reboot)?
(Search baloo in menu, it will take you to the settings, reboot, you will also be prompted to.)

Maybe this in combo with my system hangs caused the filesystem to break.
Slowly the issue is solving, however, in the Manjaro boot menu I get a black screen and no boot when I come back in the snapshot list to select the normal boot.
Might still be some broken snapshot or errors in the btrfs tree or so?

I did the zero-log cleaning with the chat support of the #linux and btrfs channnels, thank you very much!!!
Now I rolled back before the last 2 stable updates, re-did those in one step and made the pacnew recommendations in the 23.07-10 thread. (https://forum.manjaro.org/t/stable-update-2023-07-10-kernels-plasma-gnome-libreoffice-pipewire-mozilla-wine/)
I deleted ALL snapshots after that roll-back date thru the Timeshift gui (that I suspected borked up the filesystem) although I

btrfs check /dev/mapper/ROOT

it (from live system) and it showed fine (Swap does not work with that check).

After this I did not have no system hangs at all anymore. The system shows fine, I had not rolled back far enough last time.
If no hangs occur I will mark this thread solved in a few days. Boot menu seems to list the snapshots fine.
Hope that some people can resort here to solve an issue like this with the info provided above in the future.
Btw, I will keep the baloo file search off.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.