Questions upgrading a mdadm RAID1 array

Hmm, not an initial good sign. After connecting the drives, the boot paused for a bit before it continued the boot sequence and listed the new sata drives as connected and the grub boot menu presented.

Once inside Manjaro, I launched DiskMonitor and it appears the drives SMART isn’t happy right out of the gate with raw-read-error-rate and seek-error-rate highlighted on each drive…


Might this have just been some initial spin-up “jitters” and I should queue up the full extended SMART test… or is this a bad sign and I should take the drives back to the store for replacement?

Depends on the manufacturer of the drive. If it is WD retrun it, it will only raise on real errors. If it is Seagate, totally normal, don’t worry about these two, the will raise constantly. There are some calculators on the web to see the real error for Seagate drives.

1 Like

Well the new drives are Seagate IronWolf Pro’s… but I must admit that errors listed here as “normal” are a bit concerning.

I started the Extended SMART checks figuring it couldn’t make things worse… and aborted after the seek errors raised from ~14K to 135K in maybe 10-15 min.

Do you have any links to share that explain these numbers being normal?

2 Likes

A raw-read-error-rate on a Seagate drive with a value of 4511 is an internal code for “PDAA” (“pernicious drive atrophy anomaly”). Everything is in imminent danger, including all your hardware components. It starts with the hard drives, then it moves to your CPU and RAM, and then it comes after your family and loved ones! :astonished:

Actually, @xabbu is right. Those values in Seagate drives are used internally, and do not denote actual errors. There’s a weird “calculation” they use. A “short” (or grueling “long”) SMART test will determine if the drives are really failing. They will be logged as legitimate errors, which you can review with:

sudo smartctl -l error /dev/sdX

That’s probably all it was.

Thank You for the information @xabbu and @winnie!

After doing some internet searching of my own I found a post @ https://www.truenas.com/community/threads/seagate-ironwolf-smart-test-raw_read_error_rate-seek_error_rate.68634/ where someone posted an altered smartctl command that gets you the “true” raw-read-error-rate and seek-error-rate values which indeed do now show 0's…

$ sudo smartctl -a -v 1,raw48:54 -v 7,raw48:54 /dev/sdb

Drive1
$ sudo smartctl -a -v 1,raw48:54 -v 7,raw48:54 /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.28-1-MANJARO] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST18000NE000-2YY101
LU WWN Device Id: 5 000c50 0e36a5e3c
Firmware Version: EN01
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 17 09:49:26 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (  559) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1546) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   044    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   045    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       0
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       1
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   253   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   058   000    Old_age   Always       -       39 (Min/Max 23/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   039   040   000    Old_age   Always       -       39 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       0 (200 202 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       0
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4511

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%         0         -
# 2  Extended offline    Aborted by host               90%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Drive2
$ sudo smartctl -a -v 1,raw48:54 -v 7,raw48:54 /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.28-1-MANJARO] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST18000NE000-2YY101
LU WWN Device Id: 5 000c50 0e36a5e3c
Firmware Version: EN01
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 17 09:49:26 2022 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (  559) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1546) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   044    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   045    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       0
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       1
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   253   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   058   000    Old_age   Always       -       39 (Min/Max 23/39)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   039   040   000    Old_age   Always       -       39 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       0 (200 202 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       0
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4511

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%         0         -
# 2  Extended offline    Aborted by host               90%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Also restarted the SMART Extended tests I aborted… and feeling much better with the knowledge you’ve both shared and the command I found.


EDIT:
Ah-ha, now I understand the why/how raw48:54 works… from one of @xabbu links…

The raw value of each SMART attribute occupies 48 bits. Seagate’s Seek Error Rate attribute consists of two parts – a 16-bit count of seek errors in the uppermost 4 nibbles, and a 32-bit count of seeks in the lowermost 8 nibbles.

So raw48 sets the proper data type and :54 focuses in on the “seek error” value in the “uppermost 4 nibbles” (discarding 3210 which represent the seek count)

Whereas the online calculator is likely doing the same thing, just showing the 54 and 3210 separated values individually.

Seagate super-packed the data! So unless a reported decimal number >= 4,294,967,296… an error hasn’t happened yet.

:white_check_mark: Extended SMART tests

:heavy_plus_sign:

:white_check_mark: 18TB drives

:heavy_equals_sign:

Can’t wait to hear about the results in June 2027. :v:

1 Like

Well, I’d recently replaced all 4TB drives in my Synology NAS… and Extended SMART took those 5400RPM 4TB drives ~9-10 hours for Extended SMART.

Not sure how linear it would be to say 4.5 times the disk space should take 4.5 times time duration (40-45 hours)… but I’m hopeful 7200RPM’s helps cut the time down. But if it takes 2+ days, I’m fine with that.


EDIT:
passed the 1M seeks mark (passed 2M actually) and the “Worst” value changed from 253 to 60 as outlined in the guide (normalized also changed from 100 to 64). Things are going as expected; in the 20% range now for the extended check :wink:


EDIT2:
After 14 head-flying-hours and almost 6B seeks, the extended check is already somewhere in it’s 60th percentile range… I have a feeling it’ll be (near) done by the time I wake up tomorrow… all depends how much it slows down as it approached the end of the disk :+1:


EDIT3:
After 21 head-flying-hours and over 8.78B seeks, the extended check is already somewhere in it’s 90th percentile range… it’s the home stretch!


EDIT4:
After 23 head-flying-hours and almost 9.6B seeks, the extended check is finally completed! Not bad for 18T :wink:

2 Likes

@Daniel-I, do you know where the nearest flying car repair shop is? I could use my teleporter to go to work, but I still like to run a few errands on the way home. Besides, the teleporter draws way too much electricity, and I’m trying to lower my energy bill.

And away we go!

  1. Used KDE Disk manager to create the GPT partition table and “unformatted” partition on each disk
  2. Used parted to set them as raid disks
$ sudo parted /dev/sdx
(parted) set 1 raid on
(parted) print
(parted) quit
  1. Kicked off mdadm (Assuming the mdadm ETA is close… another 23 hours until the next step)
$ sudo -i                                      
# mdadm --create --verbose --level=1 --metadata=1.2 --raid-devices=2 --name=0 /dev/md/RAID1 /dev/sda1 /dev/sdb1
mdadm: size set to 17578189824K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: array /dev/md/RAID1 started.
# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      17578189824 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.0% (2531392/17578189824) finish=1388.6min speed=210949K/sec
      bitmap: 131/131 pages [524KB], 65536KB chunk
  1. confirmed md0 was created this time
$ ls -la /dev/md*
brw-rw---- 1 root disk 9,   0 Mar 18 09:25 /dev/md0
brw-rw---- 1 root disk 9, 127 Mar 18 09:18 /dev/md127

/dev/md:
total 0
drwxr-xr-x  2 root root   80 Mar 18 09:25 .
drwxr-xr-x 22 root root 4920 Mar 18 09:25 ..
lrwxrwxrwx  1 root root    6 Mar 18 09:25 RAID1 -> ../md0
lrwxrwxrwx  1 root root    8 Mar 18 09:18 RAID1Array -> ../md127
P.S. I really appreciate DisKMonitor

1 Like

If ZFS was part of mainline kernel development, you’d be cruising with it. But until then, it’s still a second-class citizen in the Linux sphere.

Here’s just one example:

Though I must say, once I got into ZFS, I never looked back.

I think I recall you saying you used ZFS on your NAS and XFS on your PC… I had a moment where I thought I might try use XFS with this new array, but wasn’t sure how many things (like fschk and e4defrag) would change (if any; beyond the format string) with a new file system (or if I was ready to learn them :wink: ) … so I may be in the EXT4 camp a little longer. I guess I have 23 hours to change my mind.

I’d shy away from ZFS as 13%+ of my CPU on my main/gaming rig dedicated to it would not be ideal.

XFS shines when you have more CPU cores/threads, and dealing with large files or parallel operations. Otherwise, EXT4 is a solid filesystem, and is “tried-and-true”. There’s nothing inherently wrong with it, except for the ludicrous default of reserving 5% capacity for outdated reasons.

EDIT: I remember reading a while back there was forensics done on SSDs, and XFS demonstrated that it worked more efficiently with “trims”, but this was some time ago, and not sure how relevant it is today.

As for the “check” and “repair” tools, I almost never use them anymore, nor find myself in situations where I need to rely on them to safeguard my data. (I use multiple backups, and keep an eye on the drives’ health.)


While ZFS is more CPU and RAM intensive, it’s not that bad. I believe what’s being experienced lately is a regression in Linux kernel 5.16+ (since 5.15 works smoothly on my tests.)

Besides, ZFS handles everything in one go: filesystem operations, metadata, checksums, auto integrity checks and repairs (with every read-write I/O), compression, redundancy/parity, and encryption. That’s a lot of stuff that’s happening in real-time by only ZFS itself.


The real issue is legal, and thus it bleeds into technical. Because the Linux kernel developers have no assurances that Oracle will avoid a litigious route, ZFS development and maintenance in the Linux sphere must follow along with kernel development, rather than be incorporated with it. If the Linux kernel developers do something that “breaks” or “regresses” something in ZFS, it’s up the OpenZFS developers and community to bear the responsibility for making it work. From the Linux kernel team perspective, they view it as “Awe shucks, that’s too bad, but we’re not responsible.”

1 Like

at the 49 head-flying-hours mark (an additional 26 hours)

  1. the drives are sync’d
# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      17578189824 blocks super 1.2 [2/2] [UU]
      bitmap: 0/131 pages [0KB], 65536KB chunk
  1. mdadm.conf is updated… # mdadm --detail --scan >> /etc/mdadm.conf (edited out the double this created for the pre-existing/old array)
  2. initramfs is updated… # mkinitcpio -P
  3. and the drives are being formatted EXT4… # mkfs.ext4 -v -m 0 -L SeagateR1 -b 4096 -E lazy_itable_init=0,lazy_journal_init=0 /dev/md/RAID1

EDIT:
HFH only increased by 1 more hour in the time it took to complete the drive format (with the lazy options off), and now I’ve:
5. created its mountpoint… $ sudo mkdir /data/st18raid
6. gave my user access… $ sudo chown $USER:$USER /data/st18raid (hmm, I think I needed to do this after the mount)
7. added fstab entry… $ sudo nano /etc/fstab
UUID=abeef326-c724-4917-bd05-97ec5facba24 /data/st18raid ext4 defaults,noatime 0 2
8. mounted the drive… $ sudo mount /data/st18raid
9. and started the data transfer… $ rsync -avhH --info=progress2 /data/raid1/* /data/st18raid/

1 Like

Plot twist: @Daniel-I is going to save this thread and his step-by-step instructions as a .txt file, as the one and only file on the entire array, just to make absolutely sure he has a good copy of these instructions for the next array he plans on building.

I have the instructions written down… unfortunately my old notes had some lines highlighted yellow (usually do that when I’m not 100% sure what I’ve written is correct or necessary)… so recapping everything here helped serve to “talk it all out”, revise my notes, and share with others that might find it useful.

1 Like

the rsync has finished… but perhaps not 100% successfully?

          7.39T 100%  151.61MB/s   12:54:37 (xfr#328614, to-chk=0/360669)

sent 7.39T bytes  received 6.45M bytes  159.01M bytes/sec
total size is 7.39T  speedup is 1.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1330) [sender=v3.2.3]

Not sure where I’d find the “previous errors” if I wasn’t watching the file transfer the whole time, but it looks like I need to do a compare/diff. Is one way to address this to rerun the rsync command to see what changes it tries (and fails) to make?

Hmm, I think so… Looks like the only error was related to the root locked lost+found:

$ rsync -avhH --info=progress2 /data/raid1/* /data/st18raid/
sending incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, ir-chk=1018/360606)
rsync: [sender] opendir "/data/raid1/lost+found" failed: Permission denied (13)
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=120/360669)rsync: [generator] failed to set times on "/data/st18raid/lost+found": Operation not permitted (1)
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=5/360669)  
lost+found/
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/360669)

sent 12.14M bytes  received 34.51K bytes  8.12M bytes/sec
total size is 7.39T  speedup is 606,762.52
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1330) [sender=v3.2.3]

… so nothing to worry about, as I’m pretty sure that folder is created on every drive format. I guess I could have run the rsync with sudo or figured out how to add a folder exclusion… but not worried about it for that specific folder (which is empty anyway).

yeah, I don’t think sudo is always the best answer… seems pretty easy to add an exclusion…

$ rsync -avhH --exclude 'lost+found' --info=progress2 /data/raid1/* /data/st18raid/
sending incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/360668)   

sent 12.14M bytes  received 34.51K bytes  8.12M bytes/sec
total size is 7.39T  speedup is 606,765.91

Yay, no errors! :+1:


I thought I’d go for a second opinion and using freefilesync (which probably uses rsync) found all things were “equal”… sometimes it’s nice to have a great interactive visual.


Seeing as all data appears to have been successfully migrated, I see my only remaining tasks as being:

  1. edit /etc/fstab to remove/rem the old array mounting
  2. optionally remove my /data/raid1 mountpoint for it (certainly doesn’t take much space)
  3. edit /etc/mdadm.conf to remove/rem the old array
  4. run mkinitcpio -P
  5. remove the old drives (or simply detach their power cable)
1 Like

Sometimes errors are not even due to non-transfers, but also because of incompatible ACLs or xattrs. A second pass will usually clue you in, and you can even opt for a log file as well (specified by the rsync command itself.)

lost+found is for EXT4 filesystem, where it creates “recovered” files after an fsck pass, whenever applicable. These recovered files are usually corrupt, and removed from the proper filesystem. I’ve never, ever found a single thing inside lost+found in all the time I’ve used EXT4.


UPDATE:

Looks like it doesn’t use rsync.

1 Like

Yah, maybe that’s why it’s symlink handling options sounded weird/confusing, and I don;t think it has an option for hardlinks, so I rely on rsync for those and just use freefilesync for drive/folder compares; which it seems to do a great job at.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.