Btrfs filesystem readonly

Hardware problems!

This is going to be a longer thread because my file system (main PC) has become read-only.

  • I noticed this when I could no longer save any files.
  • After a reboot, it was fine for a while. Shortly after, that the file system became read-only again. A look at dmesg shows error messages from btrfs related to sda.

This is a btrfs RAID 1 with two SSDs (one NVMe and one older SATA SSD). The error messages in dmesg were related to the SATA SSD. There are also older rotating hard drives installed, but they are not mounted connected and 2 (to small) SSDs.

Don’t panic! I have btrfs, snapshots (hourly of /@ and /@home) and backups (from the day before yesterday and before that), but of course no spare SSD in the drawer.

I’m currently writing this from a laptop (so no inxi :frowning: ) and am still figuring out the best way to proceed.

The cause could be:

  • Power supply overloaded (-> first, remove USB loads,DVD-drive) :hammer:
  • Motherboard slowly failing (-> remove dust, disconnect and reconnect cables and connectors)
  • SATA SSD failing (-> boot from USB and check dmesg ???)

sudo dmesg|grep -E BTRFS says (typed as seen):

BTRFS info (device sda2) bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
BTRFS info (device sda2) bdev /dev/nvme0n1p3 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

mount /dev/sda2 /mnt
mount -t btrfs

/dev/sda2 on /mnt type btrfs (rw,relatime,ssd,discard=async,subvolid=83648,subvol=/@)

brfs scrub start -B -d -r -R ends with:

Starting scrub on devid 1
Starting scrub on devid 3
ERROR: scrubbing /mnt failed for device id 1: ret=-1 errno=30 (Read only file system)
ERROR: scrubbing /mnt failed for device id 3: ret=-1 errno=30 (Read only file system)
...

mount -t btrfs says:

/dev/sda2 on /mnt type btrfs (ro,relatime,ssd,discard=async,subvolid=83648,subvol=/@)

so far, so good, it is readable :wink: but can’t be scrubed :frowning:

:pencil: : scrub works, if the device is mounted ‘ro’ with btrfs scrub -BdrR /dev/sda2

I honestly expected to see a how to recover a read only file system here!

I’ve had two drives fail (from a mirror), and I don’t think this is like anything like I’ve experienced.

All these tests.. Are you doing these from a live boot, or regular boot (in read only mode)?

Also, what’s the SMART info on each drive?

sudo smartctl -x /dev/sda2
sudo smartctl -x /dev/nvme0n1p3
1 Like

Indeed. In the smartmontools package. :wink:

If you are able to boot from a USB, it might be worth investigating btrfs check and btrfs rescue - but I’ve never needed to use either of them.

1 Like

You said you had backups, but if there is important data, I would disconnect the drives.

My place is littered with PATA/SATA/m.2 enclosures so I can read almost all drive types via USB. If you had these, you could only operate on them one at time on a different system. (At least to help figure out where the problem lies).

When this happened to me about a year-and-a half ago, it turned out that my problem was–of all things–a massively faulty RAM stick. I might suggest running a memcheck from the GRUB screen.

3 Likes

Smartmontools indicates that sda is not corrupted, while btrfs detects corruption on both(!) SSDs.

This suggests that the error occurred before the data was distributed across the RAID (in RAM ??? But it is not the best time to buy RAM :frowning: ). So, clearly, it’s file system corruption (in metadata).

I have backups (snapshots copied externally via send) of all subvolumes. Just to be safe, I’m currently performing an rsync backup of the readable btrfs root volume (including /@, /@home, /@nosnap, and without snapshots). This will likely run overnight.

Then I can safely attempt a repair. The AI’s suggestions for narrowing down the problem were helpful. The backup, as suggested, would have only saved a portion of btrfs (/@), leaving the rest untouched. The suggested recovery attempts using btrfs rescue (dry run) don’t look promising. All the snapshots would be lost in the process.

(by the way, chromium is acting weired. translating garbage, slowing down …) :scream:

You can hope. This could also be bad CPU cache. (And possibly more likely.)

1 Like

Spurious comment:

There has been a significant solar storm again in very recent days — we are currently in a solar maximum year — and thus, a significant chance of flipped bits and other electronic disruption. As the matter of fact, many computer systems were affected by the increase in radiation and charged particles emanated from the sun.

ECC-capable hardware may mitigate such hazards, but it’s not a guarantee.

3 Likes

After careful consideration, the cause is likely an overloaded power supply.

The system has been upgraded several times over the years. It now includes:

  • 2 graphics cards
  • 4 RAM modules
  • NVMe
  • 3x SATA SSDs
  • DVD drive (not working, but powered)
  • 3 external USB hubs, camera, microphone, Bluetooth, 4 dongles for keyboard/mouse…

At the time of the failure, I was working on software/hardware for a D1 mini (connected via USB). It’s quite possible that a power surge on the USB port was the final straw.

  • First, I’ll slim down the system: remove 2 RAM sticks, 2x SATA SSDs and the DVD drive. :white_check_mark:
  • :pencil: : scrub shows no(!) errors, and is 3 times faster on nvme than sda
  • Maybe I can persuade btrfs to roll back a few commits (but the file system is older than 2017).
  • :pencil: : Maybe i can btrfs send the snapshots to another partition :wink:
  • Otherwise, I’ll set up the NVMe drive as the boot medium again (it was SDA before) and configure it in the UEFI, then restore the backup.

:footprints:

2 Likes

That is possible, especially if the PSU isn’t exactly new anymore. And graphics cards are definitely the biggest consumers.

Its also possible that you could stabilize your RAM (if its really your RAM) with gives them just a little bid more Voltage.

That is also the same way for a CPU… aged hardware needs sometimes a little voltage boost to run stable again.

Or going another way and just reduce the Mhz for the unstable device, in this case you don’t need to play around with the Voltage.

That’s not actually how it works, although your solution does of course remedy the problem.

It’s not the aging hardware that needs more voltage, but the aging power supply that isn’t supplying the individual components with the required voltage anymore. :wink:

1 Like

I think we all can agree that Hardware is a very complex topic.

I just pointed to a remedy, if his powersupply working as intended. And of course he needs to find the cursed device.

That still doesn’t sound 100% safe that it is indeed the power supply. But i hope he can solve it with that.

1 Like

Back before GPUs increased the demand for higher power PSUs, they used to fail on me left and right. I ended up hanging a second one outside my case for years.

But since then you have many options. I would always overshoot the power on what I need, peak power of everything you have, then add 30-40% if you want it to last.

My current PSU is an eVGA 1000W. But hey, it’s only one now!

Or the aging user, who adds device after device without recalculating what the power-supply provides (in my case 240W) :wink:
For a graphic card infos are available, but what about RAM ?, USB-Hubs and other devices ?

Edit:
The PC was quite dusty; the power supply actually has 420 watts, not 240. A cable tie was obstructing the power supply’s fan, preventing it from spinning. Of course, this doesn’t rule out the possibility that the power supply is still failing, even though it’s delivering 12 volts and 4.98 volts.

2 Likes

Considering the hardware you’ve got in your system as described above, I would recommend at least 600 Watt.

1 Like

Just by high-balling estimates from searching (or using an LLM to help sift) you can sort of get the job done.

Hubs really don’t draw a lot. USB3 without PD is capped usually. I have to have two powered USB hubs even. With a full flight simulator setup, it’s just too much for mine.

The SSDs can peak power up a lot too. Extra GPU.. It’s a lot!

I’ve had one of these for ages though, it did not cost much (measures power and current too):

Makes it really easy then.

2 Likes

Don’t know about RAM, but sata SSD’s should be about 2W and HDD are meant to be about 10W (from what I remember) but pull more when spinning up (I’ve got an external 3.5" enclosure which uses a 24W supply).

USB hubs shouldn’t use much themselves, and devices should have maximum ratings on them. You probably want to check and maybe use a powered hub if you exceed the MBs capabilities.

Depends on the 2 GPUs, but considering it’s currently 240W, 600W should be fine.