BTRFS error occurred while copying the zip file

linux-aarhus · 21 September 2021 16:12

A 15G zip file is an edge case -and I don’t think this is descriptive for the filesystem as such.

My experience is a couple of years back an in my case it was my entire projects partion which went haywire.

Such a projects structure is literally tens of thousands of files when you count in the git structure with tiny bits of fragments.

I don’t know what caused it - luckily it was only a filesystem test - so I had 99% stored on a removable device. I lost tiny pieces but my trust in said filesystem vanished and I have never touched it since.

Even f2fs has done something similiar - which is why I now use ext4 only.

Zesko · 21 September 2021 17:13

It would be dangerous for full disk encryption with LUKS when a data is corrupted, then it could not be decrypted anymore?

My both SSDs are not encrypted as they always stay home, unlike laptop.

freggel.doe · 21 September 2021 18:43

That is a possibility.

This looks quite bad and you should check your ram and other system components as advised.
As it stands now, you cannot be certain of the data integrety of any data written by your machine.

Files installed via packages can be checked with paccheck from pacutils package:

$ sudo paccheck --sha256sum --quiet

For any other (important) file you can only try to check against a backup and inspect all changes manually.

winnie · 21 September 2021 19:46

Not only is it more difficult to do data recovery from an encrypted block device, as it is, but if there’s any corruption to the LUKS header (and its backup) due to faulty CPU, RAM, SSD, you essentially “securely wiped” your data permanently.

Once your system demonstrates it consistently writes the incorrect bits into a new file (or copy), it’s practically compromised. You cannot trust anything else.

You can no longer trust…

updates
package installations
creating new files
saving files
copying files
downloads
transferring multimedia to and from the computer

…and so on.

In the end, your only means to protect yourself against losing data permanently is not redundancy/RAID, not bitrot protection against corruption, and not even a simple backup (in which it can fail in cold storage), but rather multiple backups of different types, and frequently using them and checking them.

But that’s off topic!

Zesko · 22 September 2021 12:53

@winnie

memtest86-efi has completely tested after 7 hours. Everything is OK.

But it showed [Note] RAM may be vulnerable to high frequency row hammer bit flips.

I found another problem. Not only copy errors, but also reading or CPU calculation errors too?

The same file test1.zip has many different checksum. I did not change it.

❯ sha1sum test1.zip    
e9bb2bfc90ad058aa73154cce4814fd744cc11f8  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                          
ea1127716ab44a6b74507ec7c503409ae5e84f21  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                          
bdb8b9098a3b0fc7055f20af586457156023be2d  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                           
0d7f28ed306c395baa0ec9436a356672843e9021  test1.zip # Failed

on my old SSD.

❯ sha1sum test1.zip     
b35cd8b5df411a5b20ac456f67b95f089502919f  test1.zip # OK
❯ sha1sum test1.zip                                                                                                                                                                 
b35cd8b5df411a5b20ac456f67b95f089502919f  test1.zip # OK
❯ sha1sum test1.zip                                                                                                                                                                 
f48ea7c128e8f7f892636f8c05623a243ec09c01  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                                  
3f2b211343854f8d2faf70800e76a86ff75106a9  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                                  
f4ec8624a5ae5f61bd3b7e95b898dc841436fd6e  test1.zip # Failed

on my new SSD.

Both SSDs are affected on the same device 1.

I checked the same file 15GB in my USB stick 64GB (EXT4) on my device 1, it shows different checksums too.

But I checked this USB stick on my other device 2, then no issue → The same file shows always the same checksum.

I checket the zip on Windows 10 on the same device 1, but only data transfer or copy has the problem too, but many copies are correct, few are incorrect, but checksum is always the same for individual zip, but on Linux it shows random different checksums.

The device 1 is currently running with mprime. I see many were passed.

I checked the zip file with the help of Manjaro Live USB on the device 1, the same issue.

I stop mprime now. I try to set the mainboard BIOS setting, for example AMD-V is switched off. I remove 3 of 4 RAMs, then I see if it works.

winnie · 22 September 2021 13:49

Fault of CPU, RAM, or motherboard (of system 1). The system cannot be trusted, and your data is at risk, especially newly created data.

If you can spare to run it for as long as possible, even 24 hours straight, you can then check results.txt and stress.txt for any logged errors.

Good process of elimination.

Fingers crossed! But it’s looking really, really bad at this point.

Zesko · 22 September 2021 16:10

Woa,
I have checked long enough today. The result is that 1 of 4 RAMs is faulty. Rest 3 RAMs are fine after 15 times copies of the zip 15GB, all checksums are passed, no error.

Mainboard and CPU are not faulty.

memtest86+ was joke to say that everything of RAM was passed.

I have to thank BTRFS for warning me earlier why I/O reading failed,
EXT4 did not warn me without noticing.

winnie · 22 September 2021 16:14

Bad RAM strikes again!

ZFS and Btrfs have built-in data integrity checks. Ext4, XFS, F2FS do not.

I’ve read some stories that memtests are most reliable and trustworthy when testing one stick at a time. I should have mentioned this much earlier, and I apologize I did not. It would have been tedious, to check full passes one at a time, but definitely more accurate.

However now you can brag about doing your own process of elimination and diagnosis to find the root cause of a dangerous problem.

winnie · 22 September 2021 16:19

How long have you been using this computer? It’s very likely that much of the data you saved, copied, backed up, etc, is corrupt (corrupt in the sense that it does not 100% reflect how the file should exist on storage, whether from a new creation, modification, or copy.)

andreas85 · 22 September 2021 16:28

That is because Btrfs uses checksums to secure itself against faults in Btrfs

Zesko · 22 September 2021 16:35

I have been buying this RAM for this computer since January 2021 without noticing that it is damaged. Then I bought the new SSD 1 month ago and installed new Manjaro on this SSD with full BTRFS.

I have checked my current used data from January to today. Many of these data are not important for me.

ralm · 22 September 2021 16:57

Glad this isn’t another cautionary tale about Btrfs. If anything, Btrfs seems to be doing the job ECC memory used to do.

winnie · 23 September 2021 01:00

I’m glad too. At least with faulty hardware, you can pull it out or replace it. But with a filesystem? It’s nail-biting levels of unnerving where you’re not quite sure if it’s misbehaving or trying to warn you.

Sadly, ECC RAM is pricey because there’s no real “consumer” market: it’s primarily produced for enterprise. In a better world, it would have been a feasible option for home users and would be much more accessible and affordable.

I only use ECC (and ZFS!) for my NAS server, however, since it’s my bastion for archiving data and holding backups of known good copies.

alven · 23 September 2021 16:28

What exactly you did to determine that?

Did you determine that by hash sum checking of 15 copies of the same file while using that single faulty RAM module and on some of the copies hash sums was not the same with other copies?

And you checked every of other 3 RAM modules with that 15 file copies test and their hash sums was the same?

Zesko · 23 September 2021 17:06

Yes, and many different hash sums for the same file too, it looks so wrong calculation. I think CPU loaded some corrupt addresses or values from this RAM.

Yes

alven · 23 September 2021 17:25

If you have a time, please tell exact info:

What is the exact app name?

says MemTest86
but later

Also, what is the exact and full version with build number?

Zesko · 23 September 2021 17:45

It was memtest86-efi 1:9.0build2000-2.1 from pacman repos, not AUR.
I did not use memtest86+ because it does not support EFI.

PassMark · 23 September 2021 22:41

Zesko,

We are the developers for MemTest86 (not the ‘+’ version, which is no longer unmaintained), but the original version, which you used. A user directed our attention to this post.

The version of MemTest86 you used was slightly old, but we don’t think it would have changed the outcome. We are wondering if the error was in fact linked to either.
A) Row hammer bit flips. These are in fact memory failures, but the test case for row hammer is somewhat artificial and should nearly never occur in real life applications.
B) Also linked to CPU load / system temperature. MemTest86 doesn’t load the CPU much. SHA1 calculations would likely heat the system more that simple read / write testing.
C) Maybe booting into the O/S changed some settings /timings / fan speeds /EMI environment which made the RAM error more obvious. This isn’t very common however.

Anyway our thought was this. Can we buy the faulty pair of RAM sticks from you and we can do some more in depth testing on them in a few different systems (assuming they were in a pair). With a view to improving the fault reporting in Memtest86.

David

Zesko · 24 September 2021 09:18

Thanks for the info.

I have already checked 2 days ago. My system with 1 faulty RAM without other 3 RAMs has less calculation errors and copy errors than the system with all 4 RAMs together.

Very difference of SHA1 calculation errors between 1 faulty RAM and 4 RAMs

→ 1 faulty RAM: ←
If SHA1 calculation error for the same zip file has appeared first time, then check SHA1 again, most likely no error, but luck. Probably 1 out of 20 repetitions of SHA1 check is wrong. But every wrong SHA1 result is different.
Repeat copy (read and write) has more errors than SHA1 checks (read only) !? (-> I do not yet know whether writing is also wrong when copying)

→ 1 faulty RAM with 3 RAMs together ←
When SHA1 calculation error for this same zip appeared first time, then check calculation again, always different wrong results of SHA1 for many same repetitions of check. They are worse than the faulty RAM alone.
Repetition of copy is much worse.

That is why I think one faulty RAM has spread some faulty values to other 3 RAMs, like error propagation.

But all wrong values in cache of RAMs are not cleared until power off PC.

PM to me.

Zesko · 24 September 2021 14:07

Today I just tried to test the faulty RAM with the last version of MemTest86 v9.2 from the official website MemTest86 - Download now!. The result comes in reality, if you want to know: