BTRFS error occurred while copying the zip file

Zesko · 20 September 2021 18:05

That could be, if Zip compression initially wrote data to the faulty SSD or the faulty file system without info/warning, but BTRFS can stop copying the broken file earlier when I want to transfer it to backup disk. Not bad.

Zesko · 21 September 2021 07:43

I checked smartctl before copy.

❯ sudo smartctl -A /dev/nvme0n1

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        52 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    1%
Data Units Read:                    9.541.220 [4,88 TB]
Data Units Written:                 5.187.998 [2,65 TB]
Host Read Commands:                 22.832.335
Host Write Commands:                34.318.571
Controller Busy Time:               170
Power Cycles:                       233
Power On Hours:                     272
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      354
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Then I copied the same MEGA zip 4 times in my disk.

❯ cp MEGA.zip ~/Desktop/MEGA_copy.zip # No issue
❯ cp MEGA.zip ~/Desktop/MEGA_copy1.zip  # No issue
❯ cp Desktop/MEGA_copy.zip ~/Downloads  # No issue
❯ cp Desktop/MEGA_copy1.zip ~/Downloads   # Failed
cp: error reading 'MEGA_copy1.zip': Input/output error

1 of 4 copy failed.
That means, my SSD is damaged after 1 month.

I checked smartctl after the failed copy again:

❯ sudo smartctl -A /dev/nvme0n1

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
...
Media and Data Integrity Errors:    0
Error Information Log Entries:      354 # The same number
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

It did not change “Error Information Log Entries”
I know it will count up after restart. It has nothing to do with the issue of copy.

I checked nvme:

❯ sudo nvme error-log /dev/nvme0n1                                                                                                                                                                              
Error Log Entries for device:nvme0n1 entries:63
.................
 Entry[ 0]   
.................
error_count     : 354
sqid            : 0
cmdid           : 0x19
status_field    : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0x28
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 1]   
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
phase_tag       : 0
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
...

nvme and smartctl can not detect that my SSD’s blocks are damaged?

Zesko · 21 September 2021 09:28

~~The return of the damaged SSD was confirmed after the business contact today. I will get the refund.~~

Zesko · 21 September 2021 13:16

@winnie

I misunderstood that the SSD is guilty, but BTRFS is.
I have done enough testing BTRFS vs. EXT4 on the same partition “nvme1n1p5” in my other old SSD today.

I created new partition 200 GB with BTRFS first.
I checked badblock to determine which block is bad:

❯ sudo badblocks -v /dev/nvme1n1p5 > Desktop/badsectors.txt                                                                       
Checking blocks 0 to 204799999
Checking for bad blocks (read-only test):
done                                                 
Pass completed, 0 bad blocks found. (0/0/0 errors)

There is no bad block in this partition in the old SSD.

I tested many same copies of the same zip file 15 GB on BTRFS.

/run/media/zesko/Test ❯ cp test1.zip test2.zip # OK
/run/media/zesko/Test ❯ cp test2.zip test3.zip # OK
/run/media/zesko/Test ❯ cp test3.zip test4.zip # Failed
cp: error reading 'test3.zip': Input/output error

Oh, the same error appeared because test3.zip is the broken file.

❯ journalctl -b -p 3                                                                                                                                                                 
-- Journal begins at Tue 2021-08-17 19:59:59 CEST, ends at Tue 2021-09-21 13:55:23 CEST. --
Sep 21 13:55:23 zesko kernel: BTRFS error (device nvme1n1p5): bdev /dev/nvme1n1p5 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
Sep 21 13:55:23 zesko kernel: BTRFS error (device nvme1n1p5): bdev /dev/nvme1n1p5 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
Sep 21 13:55:23 zesko kernel: BTRFS error (device nvme1n1p5): bdev /dev/nvme1n1p5 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0

I switched btrfs to ext4 on the same partition (deleting the partition then creating new partition with EXT4 on the same location of SSD).
I checked badblock again, the result showed no bad block.
I checked e2fsck that supports extX only, not btrfs:

❯ sudo e2fsck -c /dev/nvme1n1p5                                                                                                                                                                                            
e2fsck 1.46.4 (18-Aug-2021)
Checking for bad blocks (read-only test): done                                                 
Test: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

Test: ***** FILE SYSTEM WAS MODIFIED *****
Test: 11/12804096 files (0.0% non-contiguous), 1081283/51200000 blocks

I tested many copies on EXT4 again:

/run/media/zesko/Test ❯ cp test1.zip test2.zip # OK
/run/media/zesko/Test ❯ cp test2.zip test3.zip # OK
/run/media/zesko/Test ❯ cp test3.zip test4.zip # OK
/run/media/zesko/Test ❯ cp test4.zip test5.zip # OK
/run/media/zesko/Test ❯ cp test5.zip test6.zip # OK
/run/media/zesko/Test ❯ cp test6.zip test7.zip # OK
/run/media/zesko/Test ❯ cp test7.zip test8.zip # OK
/run/media/zesko/Test ❯ cp test9.zip test10.zip # OK
/run/media/zesko/Test ❯ cp test10.zip test11.zip # OK
/run/media/zesko/Test ❯ cp test11.zip test12.zip # OK
/run/media/zesko/Test ❯ cp test12.zip test13.zip # OK
/run/media/zesko/Test ❯ cp test13.zip test14.zip # OK

All copies have no issue on ext4. I checked the latest zip test13.zip that has few bad files (bad CRC) after 13 times copies.

I can confirm that BTRFS has the writing bug with SSD without RAID. EXT4 has no issue.
I stop the return of my new SSD because it is Okay.

winnie · 21 September 2021 14:17

So to reiterate, these aren’t even newly compressed files, just simple copies of the original .zip file:

Ext4:

Even after 13 copies, every single file has the same checksum and passes their internal integrity test?

Btrfs:

Just after only few copies of the same file, there’s an I/O error and the third .zip copy is corrupted?

When you made 13 copies of the same .zip file under Ext4, and they all passed, did you also verify they all had the same SHA1 checksum?

sha1sum *.zip

@linux-aarhus is this an example of Btrfs “barfing”?

@Zesko do you have the time to run a full memtest pass and mprime, to rule out the CPU and RAM?

andreas85 · 21 September 2021 14:19

Maybe, the compression of btrfs is what produces the fault ?
Compression of already compressed files seems to be tricky.

winnie · 21 September 2021 14:21

I can’t speak for Btrfs, but with ZFS the default inline compression of LZ4 simply “skips” attempting to compress a record if it determines ahead of time that the data will not be worth compressing.

I assume Btrfs’ inline compression does the same thing?

EDIT: Besides that, it’s no excuse for a write to be corrupted.

andreas85 · 21 September 2021 14:28

The bugs don’t ask if there is an excuse available

Zesko · 21 September 2021 14:31

No, every zip file has the different checksum. The latest zip has more bad CRC than the first zip. But BTRFS too.

Not always third .zip is corrupted. When I tried to overwrite it (copy test2.zip to test3.tip) again, then no error. But sometimes the next random number of copy will be corrupted.

I checked my both SSDs, both have the same issue.

winnie · 21 September 2021 14:33

Whoa, whoa, whoa! Time out! Something’s way off here.

Definitely run a few memtest passes completely, and run mprime overnight (or as long as you can spare.)

This doesn’t look like a filesystem issue anymore.

It must but one of the following, or a combination:

Drive (less likely since you said it happens on different drives)
RAM ← most likely
CPU ← not as likely, but must rule it out

andreas85 · 21 September 2021 14:33

But when you copy the checksum should stay allways the same !

Zesko · 21 September 2021 14:38

I see, but I thought every zip file that has the different timestamp.

winnie · 21 September 2021 14:39

You’re doing a file copy. Timestamps are saved as metadata in the filesystem; not on the file itself.

If you copy a .zip file a MILLION times and have a MILLION copies, they should all have the same sha1 checksum.

Weren’t you simply using the cp command to make 13 copies of the same .zip file under Ext4?

Zesko · 21 September 2021 14:43

test1.zip 15 GB Chechsum: MD5: b729c897a8259383d0dc979c8d534e55
test2.zip 15 GB Chechsum: MD5: b729c897a8259383d0dc979c8d534e55
test3.zip 15 GB Chechsum: MD5: 78df6e8e3eba0de4ee41aeeadfd1a905

on EXT4, the size of zip is big.

winnie · 21 September 2021 14:44

I seriously think it’s a RAM or CPU issue (and still haven’t entirely ruled out the drives, but those are less likely to be failing at the same time while reporting no obvious errors).

linux-aarhus · 21 September 2021 16:12

A 15G zip file is an edge case -and I don’t think this is descriptive for the filesystem as such.

My experience is a couple of years back an in my case it was my entire projects partion which went haywire.

Such a projects structure is literally tens of thousands of files when you count in the git structure with tiny bits of fragments.

I don’t know what caused it - luckily it was only a filesystem test - so I had 99% stored on a removable device. I lost tiny pieces but my trust in said filesystem vanished and I have never touched it since.

Even f2fs has done something similiar - which is why I now use ext4 only.

Zesko · 21 September 2021 17:13

It would be dangerous for full disk encryption with LUKS when a data is corrupted, then it could not be decrypted anymore?

My both SSDs are not encrypted as they always stay home, unlike laptop.

freggel.doe · 21 September 2021 18:43

That is a possibility.

This looks quite bad and you should check your ram and other system components as advised.
As it stands now, you cannot be certain of the data integrety of any data written by your machine.

Files installed via packages can be checked with paccheck from pacutils package:

$ sudo paccheck --sha256sum --quiet

For any other (important) file you can only try to check against a backup and inspect all changes manually.

winnie · 21 September 2021 19:46

Not only is it more difficult to do data recovery from an encrypted block device, as it is, but if there’s any corruption to the LUKS header (and its backup) due to faulty CPU, RAM, SSD, you essentially “securely wiped” your data permanently.

Once your system demonstrates it consistently writes the incorrect bits into a new file (or copy), it’s practically compromised. You cannot trust anything else.

You can no longer trust…

updates
package installations
creating new files
saving files
copying files
downloads
transferring multimedia to and from the computer

…and so on.

In the end, your only means to protect yourself against losing data permanently is not redundancy/RAID, not bitrot protection against corruption, and not even a simple backup (in which it can fail in cold storage), but rather multiple backups of different types, and frequently using them and checking them.

But that’s off topic!

Zesko · 22 September 2021 12:53

@winnie

memtest86-efi has completely tested after 7 hours. Everything is OK.

But it showed [Note] RAM may be vulnerable to high frequency row hammer bit flips.

I found another problem. Not only copy errors, but also reading or CPU calculation errors too?

The same file test1.zip has many different checksum. I did not change it.

❯ sha1sum test1.zip    
e9bb2bfc90ad058aa73154cce4814fd744cc11f8  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                          
ea1127716ab44a6b74507ec7c503409ae5e84f21  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                          
bdb8b9098a3b0fc7055f20af586457156023be2d  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                           
0d7f28ed306c395baa0ec9436a356672843e9021  test1.zip # Failed

on my old SSD.

❯ sha1sum test1.zip     
b35cd8b5df411a5b20ac456f67b95f089502919f  test1.zip # OK
❯ sha1sum test1.zip                                                                                                                                                                 
b35cd8b5df411a5b20ac456f67b95f089502919f  test1.zip # OK
❯ sha1sum test1.zip                                                                                                                                                                 
f48ea7c128e8f7f892636f8c05623a243ec09c01  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                                  
3f2b211343854f8d2faf70800e76a86ff75106a9  test1.zip # Failed
❯ sha1sum test1.zip                                                                                                                                                                  
f4ec8624a5ae5f61bd3b7e95b898dc841436fd6e  test1.zip # Failed

on my new SSD.

Both SSDs are affected on the same device 1.

I checked the same file 15GB in my USB stick 64GB (EXT4) on my device 1, it shows different checksums too.

But I checked this USB stick on my other device 2, then no issue → The same file shows always the same checksum.

I checket the zip on Windows 10 on the same device 1, but only data transfer or copy has the problem too, but many copies are correct, few are incorrect, but checksum is always the same for individual zip, but on Linux it shows random different checksums.

The device 1 is currently running with mprime. I see many were passed.

I checked the zip file with the help of Manjaro Live USB on the device 1, the same issue.

I stop mprime now. I try to set the mainboard BIOS setting, for example AMD-V is switched off. I remove 3 of 4 RAMs, then I see if it works.