Repairing a corrupt btrfs filesystem

9a3eedi · 5 September 2024 17:53

Got a wierd one here… It started with me noticing that I am unable to run any system updates anymore. At some point I realized that my btrfs filesystem is mounted readonly. Checking journalctl shows that there have been some disk corruption issues, which may have been caused by a combination of lack of free space combined with my system’s general instability after enter/waking up from suspend… this is a new PC, and my RAM is running at a very high clock, so I’m suspecting the RAM. that’s a separate topic however. Now I just need to fix this, and there is certainly free space this time.

This is what I get when I run btrfs device stats

[/dev/nvme0n1p6].write_io_errs    0
[/dev/nvme0n1p6].read_io_errs     0
[/dev/nvme0n1p6].flush_io_errs    0
[/dev/nvme0n1p6].corruption_errs  231
[/dev/nvme0n1p6].generation_errs  0

As you can see there are some corruption errors. But I’m hoping that these can be resolved. This is a pretty new, high quality SSD and I’ll be very surprised if it really is failing.

I tried to follow the instructions here, but I got the same results. However the poster seems to have had the exact same problem I had.

The thread doesn’t seem to show any solution however so I was hoping someone would have more insight.

Teo · 5 September 2024 17:58

I have no idea about btrfs, but for many other filesystems, the solution is to boot from live usb and check the filesystem with fsck which also repairs it.

9a3eedi · 5 September 2024 17:59

That’s what I’d normally do, but btrfs is tricky, you need to treat it differently from other filesystems, and I am quite unfamiliar with using it. I doubt an fsck would work here… but I’ll give it a try.

EDIT: well I’m on a liveusb, and I ran btrfs check (that’s what fsck told me to do). This is the result. So it definitely seems like some corrupt blocks. Now, I need to figure out how to get it fixed.

sudo btrfs check /dev/nvme0n1p6                                                      

Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p6
UUID: b479a406-c76c-4949-a843-50b761f8a0a4
[1/7] checking root items
[2/7] checking extents
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
Csum didn't match
owner ref check failed [1126033309696 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
Csum didn't match
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
Csum didn't match
[5/7] checking only csums items (without verifying data)
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
checksum verify failed on 1126033309696 wanted 0xaff703a3 found 0xd27604d2
Csum didn't match
Error going to next leaf -5
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 2031589666816 bytes used, error(s) found
total csum bytes: 1859467720
total tree bytes: 3310469120
total fs tree bytes: 1165148160
total extent tree bytes: 117555200
btree space waste bytes: 407639923
file data blocks allocated: 3631883931648
 referenced 2598999842816

There’s a --repair option, but it’s scaring me into not using it. I’m not sure what I should do next. Appreciate if someone can give some advice

EDIT2:
I found this link that was very helpful in figuring out what to do next
https://wiki.archlinux.org/title/Identify_damaged_files

I’ve performed the scrub as per the instruction, then checked journalctl and identified the files that are affected. In my case, I’ve determined that losing these files would be of no consequence, and I decided to go ahead and repair the drive with --repair. I can’t really tell if that fixed the problem however. I don’t mind deleting ht efile if it fixes the problem, the files identified are of no consequence and I have a backup somewhere else

EDIT3: I rebooted after the repair, it is still mounted as read-only… so the problem still isn’t fixed. Not sure what to do next.

soundofthunder · 5 September 2024 18:59

Just chiming in to make sure you attempted those repairs from a chroot environment, after booting the ISO; beyond that I’m no help with BTRFS.

Additionally, these links, though not specific to your issue, may contain useful information:

Cheers.

Edit:- I notice you have seemingly ignored the first of two links given. Trust that it wasn’t given randomly; the first link contains the procedure to mount BTRFS partitions for the purposes of chroot; in case this is needed.

Zesko · 5 September 2024 20:21

Good, you have the backup.

Do not use repair with Btrfs. Some corrupted files cannot be fixed, unless you have redundancy (like RAID1) or a full DUP profile (It means two complete copies) with self-healing capabilities that can replace damaged files with good ones.

None of all file systems without redundancy can repair the damaged files.

At this point, you should check which hardware component is causing the problem and replace it with a new one.

Btrfs did the right thing by forcing your filesystem into read-only mode to prevent further data corruption. Unlike Ext4, which simple ignores new corrupted files without you realizing it.

Molski · 5 September 2024 20:29

The checksum feature of btrfs is great. It’s telling you it’s corrupt, and if it didn’t, something much worse could potentially happen. But at least it appears to only be one block.

But this is only one copy of the data and the checksum. So you can’t repair the data, without another copy. RAID 1 or any other btrfs redundant setup can repair this data.

I haven’t had to deal with this myself, yet. But there’s things you can still do, the first one is you can take that 1126033309696 number and see what it is.

I have not done this myself… But I’m mostly certain on the steps. So be careful about blindly entering these.

Get the inode number from the logical address you saw in the logs.

sudo btrfs inspect logical-resolve 1126033309696 /
(Assuming subvol=@ or your root volume.)

Then you can see what data is affected by that inode:
sudo btrfs inspect-internal inode-resolve INODE /

You can then copy that data out, corrupt or not, from the last command. (Preferably to another filesystem.)

Then proceed to repair it.

Scrub is only used to verify checksums, so you would just see the errors again.

Edit: Zesko’s post sneaked in here after I posted somehow.

I know you’re very knowledgeable about btrfs, but why?

Zesko · 6 September 2024 06:00

There are some reasons:

File corruption is not the same as filesystem corruption:
- If only specific files are corrupted but the filesystem itself (including metadata) works properly (i.e., It can be mounted and read), this is usually not an error at the Btrfs level.
- File corruption can happen due to checksum mismatches, often caused by bit failures in hardware, it doesn’t mean the whole filesystem needs to be repaired. This is why it’s recommended to use btrfs scrub instead of btrfs check --repair
btrfs check --repair is designed to repair critical filesystem corruption, especially when the filesystem cannot be mounted or is severely damaged. However, sometimes it succeeds (luckily), while other times it can lead to data loss after running the repair option.
Running btrfs check --repair on the working filesystem can be risky, as it might inadvertently cause more harm to the metadata or data structure.

From the Btrfs repair manpage:

WARNING:
Do not use –repair unless you are advised to do so by a developer or an experienced user, and then only after having accepted that no fsck successfully repair all types of filesystem corruption. E.g. some other software or hardware bugs can fatally damage a volume.

Aragorn · 6 September 2024 08:52

Indeed. The filesystem should be unmounted when running a repair.

9a3eedi · 22 September 2024 11:40

It turns out you were right on that one.

I haven’t mentioned this, but this is a very new PC build. I purchased an AMD 7800 X3D CPU, along with some crazy fast 7800 MHz RAM. This is faster than most people would get, and some of my friends expressed concerns that it was “too fast”, but I figured it would be helpful for the CPU because it likes fast memory. But it turns out that the RAM was indeed just too fast. I had a suspicion, so I ran memtest86, and indeed i got a bunch of errors. This must be the culprit on why the filesystem is corrupted, and why my PC often freezes if I leave it on for a while (before this, I thought it was a suspend issue, not a RAM issue)

Anyway, I have now downclocked the RAM to a more reasonable 6400MHz, will run some hard stress tests while doing a memtest to make sure it’s 100% stable, and then I’ll reformat my Manjaro partition and reinstall it just to make sure… it was a relatively fresh install so it’s no biggie, and my files are backed up in my Nextcloud and other cloud providers.

Thank you everyone for the help and insight, I really appreciate it.

system · 25 September 2024 11:41

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.