"Bad message" since last update (March 27)

obones · 4 April 2021 11:55

Hello,

I’ve been using Manjaro to run a 24/7 server which, among other things, does some file backup. Up until last week, it was running just fine but now that I have done the huge update, I get the following error message every few days:

rsync: [sender] readlink_stat("/files/web/backup_20180819/www/mod/comments/img/back_to_top.png") failed: Bad message (74)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1330) [sender=v3.2.3]

It’s always a different file when this happens, sometimes even up to 5 files at at time, but always in a subdir of the /files mount point.
Luckily, this is not on the main file system, so I can stop every service accessing /files, unmount it and run this command:

[obones@server ~]$ sudo fsck /dev/sdb1
fsck de util-linux 2.36.2
e2fsck 1.46.2 (28-Feb-2021)
/dev/sdb1 contient un système de fichiers comportant des erreurs, vérification forcée.
Passe 1 : vérification des i-noeuds, des blocs et des tailles
L'arbre d'extent de l'i-noeud 88872381 (au niveau 2) pourrait être plus étroit. Optimiser<o>? oui pour tout
Passe 1E : Optimisation des arbres d'extent
Passe 2 : vérification de la structure des répertoires
Passe 3 : vérification de la connectivité des répertoires
Passe 4 : vérification des compteurs de référence
Passe 5 : vérification de l'information du sommaire de groupe

/dev/sdb1: ***** LE SYSTÈME DE FICHIERS A ÉTÉ MODIFIÉ *****
/dev/sdb1 : 1202591/122101760 fichiers (0.2% non contigus), 241306152/488378368 blocs

Once this has been done, the problematic files are accessible again, undamaged and the next rsync command works just fine.
I thought that maybe the harddrive onto which /dev/sdb1 is based is dying, so I ran this command:

[obones@server ~]$ sudo smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.105-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST2000VN004-2E4164
Serial Number:    Z527RCM2
LU WWN Device Id: 5 000c50 0b4e5a1b9
Firmware Version: SC60
User Capacity:    2 000 398 934 016 bytes [2,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr  4 11:09:51 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 257) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       19463616
  3 Spin_Up_Time            0x0003   096   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       44
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       17205909
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16867
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       51
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   075   075   000    Old_age   Always       -       25
190 Airflow_Temperature_Cel 0x0022   064   057   045    Old_age   Always       -       36 (Min/Max 25/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       92
194 Temperature_Celsius     0x0022   036   043   000    Old_age   Always       -       36 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     16816         -
# 2  Short offline       Completed without error       00%     15643         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

But to me, it’s not showing any specific issues. Running a self test yields no error. Using the serial number at the manufacturer’s website tells me its warranty expires in 2022, so I believe it’s not that old.

I believe the update that I did on March 27th brought something strange, but I cannot figure out what it could be with the huge number of packets that were updated.

The result of uname -a is this:

Linux server 5.4.105-1-MANJARO #1 SMP PREEMPT Thu Mar 11 18:27:58 UTC 2021 x86_64 GNU/Linux

I know that I’m not on the latest 5.10 kernel but I can’t use it because the video card is a bit dated and does not have drivers available for that kernel.

Would you have any suggestion on my filesystem issue?

Thanks for your help

obones · 5 April 2021 12:00

This night it was even worse, the entire server froze minutes after the rsync command started, with this message in journalctl:

avril 05 04:02:06 server kernel: general protection fault: 0000 [#1] PREEMPT SMP NOPTI
avril 05 04:02:06 server kernel: CPU: 0 PID: 101 Comm: kswapd0 Tainted: P           OE     5.4.105-1-MANJARO #1
avril 05 04:02:06 server kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./870 Extreme3, BIOS P1.60 09/14/2010
avril 05 04:02:06 server kernel: RIP: 0010:rb_next+0x14/0x50
avril 05 04:02:06 server kernel: Code: d8 e9 f2 fe ff ff 48 c7 07 01 00 00 00 c3 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d7 74 35 48 8b 47 08 48 85 c0 74 1c 49 >
avril 05 04:02:06 server kernel: RSP: 0018:ffffb429801dbb00 EFLAGS: 00010286
avril 05 04:02:06 server kernel: RAX: 8000000000000000 RBX: ffffb429801dbba4 RCX: 0000000000000000
avril 05 04:02:06 server kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8e0901a309b0
avril 05 04:02:06 server kernel: RBP: 00000000ffffffff R08: 8000000000000000 R09: 0000000000000001
avril 05 04:02:06 server kernel: R10: ffff8e095432df00 R11: 0000000000000000 R12: ffff8e095b23d1f0
avril 05 04:02:06 server kernel: R13: ffffb429801dbb54 R14: ffff8e0901a309b0 R15: ffff8e095b23d2d8
avril 05 04:02:06 server kernel: FS:  0000000000000000(0000) GS:ffff8e0a27a00000(0000) knlGS:0000000000000000
avril 05 04:02:06 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
avril 05 04:02:06 server kernel: CR2: 00007f68fac17ff8 CR3: 00000001e725e000 CR4: 00000000000006f0
avril 05 04:02:06 server kernel: Call Trace:
avril 05 04:02:06 server kernel:  es_do_reclaim_extents+0x82/0x120 [ext4]
avril 05 04:02:06 server kernel:  es_reclaim_extents+0x78/0xf0 [ext4]
avril 05 04:02:06 server kernel:  __es_shrink+0xa8/0x2c0 [ext4]
avril 05 04:02:06 server kernel:  ext4_es_scan+0xc9/0x150 [ext4]
avril 05 04:02:06 server kernel:  do_shrink_slab+0x149/0x2f0
avril 05 04:02:06 server kernel:  shrink_slab+0xd0/0x2c0
avril 05 04:02:06 server kernel:  shrink_node.isra.0+0xda/0x440
avril 05 04:02:06 server kernel:  balance_pgdat+0x2e7/0x530
avril 05 04:02:06 server kernel:  kswapd+0x1fd/0x3e0
avril 05 04:02:06 server kernel:  ? wait_woken+0x80/0x80
avril 05 04:02:06 server kernel:  ? balance_pgdat+0x530/0x530
avril 05 04:02:06 server kernel:  kthread+0x13e/0x160
avril 05 04:02:06 server kernel:  ? __kthread_bind_mask+0x60/0x60
avril 05 04:02:06 server kernel:  ret_from_fork+0x22/0x40
avril 05 04:02:06 server kernel: Modules linked in: ccm nvidia_uvm(POE) tun xt_nat veth xfrm_user xfrm_algo br_netfilter bridge stp llc overlay aufs ip6t_REJEC>
avril 05 04:02:06 server kernel:  joydev mousedev vmnet(OE) w83627ehf hwmon_vid snd_hda_codec_hdmi ppdev wmi_bmof pktcdvd edac_mce_amd kvm_amd ccp rng_core kvm>
avril 05 04:02:06 server kernel: ---[ end trace 67d9dd8a4398a4ed ]---
avril 05 04:02:06 server kernel: RIP: 0010:rb_next+0x14/0x50
avril 05 04:02:06 server kernel: Code: d8 e9 f2 fe ff ff 48 c7 07 01 00 00 00 c3 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d7 74 35 48 8b 47 08 48 85 c0 74 1c 49 >
avril 05 04:02:06 server kernel: RSP: 0018:ffffb429801dbb00 EFLAGS: 00010286
avril 05 04:02:06 server kernel: RAX: 8000000000000000 RBX: ffffb429801dbba4 RCX: 0000000000000000
avril 05 04:02:06 server kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8e0901a309b0
avril 05 04:02:06 server kernel: RBP: 00000000ffffffff R08: 8000000000000000 R09: 0000000000000001
avril 05 04:02:06 server kernel: R10: ffff8e095432df00 R11: 0000000000000000 R12: ffff8e095b23d1f0
avril 05 04:02:06 server kernel: R13: ffffb429801dbb54 R14: ffff8e0901a309b0 R15: ffff8e095b23d2d8
avril 05 04:02:06 server kernel: FS:  0000000000000000(0000) GS:ffff8e0a27a00000(0000) knlGS:0000000000000000
avril 05 04:02:06 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
avril 05 04:02:06 server kernel: CR2: 00007f68fac17ff8 CR3: 00000001e725e000 CR4: 00000000000006f0
avril 05 04:02:06 server kernel: note: kswapd0[101] exited with preempt_count 1

It’s as if there is something new in the kernel that makes it panic with my hardware.

megavolt · 5 April 2021 12:59

Hello @obones

There definitely something wrong with your hard drive.

Not sure, but the call trace seems to point to a swapfile which is on a ext4 filesystem and it fails there to reclaim pages.

So rsync has a problem to read data and the swapfile/partition has a problem to read from that disk. I would recommend to replace it. Maybe it is a firmware error… or a mechanical problem.

Maybe try also to check for badblocks:

fsck -vcck /dev/sdb1

https://wiki.archlinux.org/index.php/Badblocks#Read-write_test_(non-destructive)

obones · 5 April 2021 19:49

Thanks, there is definitely something strange here indeed.
The swap partition is on /dev/sda2 which is another hard drive for which I get these smartctl results:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.105-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 250GB
Serial Number:    S2R6NXAH217707A
LU WWN Device Id: 5 002538 d40a92818
Firmware Version: EMT02B6Q
User Capacity:    250 059 350 016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr  5 21:37:17 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 133) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       20002
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       791
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       13
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   065   053   000    Old_age   Always       -       35
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       36
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       9778448190

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     19999         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I don’t see any errors in there, but I’m not that much of an expert in this.

That being said, if the swap partition is dying, I believe it could explain the corruption of another harddrive if file transfers are going through the swap. Is there a command to run to test the swap partition?
I tried fsck -vcck /dev/sda2 but it returned immediately when I expected it to take a bit of time to perform its work. So I did the following:

sudo swapoff /dev/sda2
sudo badblocks -nsv /dev/sda2

And it gave me this:

Vérification des blocs défectueux dans un mode non destructif de lecture-
écriture
Du bloc 0 au bloc 9227468
Vérification des blocs défectueux (test non destructif de lecture-écriture)
Test en cours avec un motif aléatoire : complété
Passe complétée, 0 blocs défectueux repérés. (0/0/0 erreurs)

So definitely no bad blocks detected and I reactivated the swap after that.
We’ll see how it behaves tonight.

obones · 23 April 2021 15:40

Well, there was an update a week ago that, among other things, moved the kernel package from 5.4.105 to 5.4.108 and the issue has not cropped up ever since.
Interestingly, this update also led to a swap between /dev/sda and /dev/sdb, so maybe there was a filesystem related change in all this.
I looked at the kernel change log but could not find anything obvious.

Today, I see that there is another update for the kernel to move to 5.4.112, I hope it will not bring back the issue I was having.

linux-aarhus · 9 February 2022 17:45