Compiling a Linux Driver for my HBA versus SATA Raid under Manjaro

The data has finished copying from the NAS to the Software RAID (over 7TB, leaving “147GiB” of free space)… and at some point after I was finished focusing on something else, I’ve noticed a consistent rythmic hum/clack of the RAID’s mechanical drives (lasts for 1-2 seconds, subsides for 1-2 seconds, then cycle repeats) that I didn’t notice during the file transfer (but that doesn’t mean it wasn’t there).

So I ran the only command I know right now to see what (if anything) was going on…

[AM4-x5600-Linux ~]# cat /proc/mdstat
Personalities : [raid1] 
md127 : active raid1 sdc1[1] sdb1[0]
      7813893120 blocks super 1.2 [2/2] [UU]
      bitmap: 2/59 pages [8KB], 65536KB chunk

unused devices: <none>

… and it’s doesn’t appear to have much to say… Other than letting me know about the in-memory bitmap (basically a cache of what’s in the on-disk bitmap − it allows bitmap operations to be more efficient).

Are there any commands or GUI tools that might give me some more insight into the disk/array activity? Perhaps even some that would also be good to check/monitor/scrub the RAID array?

I’d hate to reboot or something when the array is in the middle of something and cause any issues.

Nothing appears wrong. The bitmap (whether internal or external) is akin to a file-system’s journal. Its cousin in ZFS is the ZIL (intent log). After some time of no writes, the bitmap shouldn’t be using any pages for cache’d writes. You can attempt to flush it by unmounting the ext4 file-system, stopping the array, and then reassembling it.

It will also be interesting if you still hear the rythmic hums and clacks after reassembling the array and waiting after a period of idle time and no data activity. (Ruling out any other mechanical drives in the system.)

You filled it a bit too close for comfort in terms of future fragmentation and performance.

You might be able to squeeze in a bit extra capacity by removing the reserved superuser blocks from the ext4 file-system. (I believe it defaults to 5%, unless that has changed recently.) It’s original purpose was to prevent locking yourself out of the system on the chance that you filled the file-system 100% and cannot even write/modify anything for the sake of recovery or emergency. It’s not really neccessary for a purely “data storage” purpose, like you’re using.

Make sure you unmount the file-system first, but leave the array assembled, and then remove the reserved superuser blocks:

sudo tune2fs -m 0 /dev/md/RAID1Array

EDIT: This concerns the ext4 file-system, nothing to do with mdadm, per se.

Yes, the data is a bit tight… but it will be shrinking over time. Lots of “windows only” bloat (drivers/installers, etc) in it currently that will be pruned over time while I stay focused on Manjaro.

Didn’t need to re-assemble the array… as the disk activity stopped (I think somewhere between the umount and tune2fs commands completing as I reclaimed that 5% (now at 520GiB free)…

$ sudo umount /data/raid1
$ sudo tune2fs -m 0 /dev/md/RAID1Array
tune2fs 1.46.2 (28-Feb-2021)
Setting reserved blocks percentage to 0% (0 blocks)
$ sudo mount /data/raid1
$ cat /proc/mdstat
Personalities : [raid1] 
md127 : active raid1 sdc1[1] sdb1[0]
      7813893120 blocks super 1.2 [2/2] [UU]
      bitmap: 2/59 pages [8KB], 65536KB chunk

unused devices: <none>

Once I re-mounted, the drives clacked away merrily for about 5 seconds (I suspect while the bitmap/cache was rebuilt)… and have stayed silent so far… thank you winnie!

This is interesting… I caught wind of iotop and installed it through PAMAC. No read/write numbers… but apparently some IO activity for something called ext4lazyinit

Apparently the kernel is tasked with handling some of the final touches of the ext4 formats initialization… and this thread I found seems to echo my experience. Probably delayed from finishing as I went straight from formatting the array, to using rsync to fill it with data.

Now it makes sense why unmounting and remounting stopped the noise… as it only starts/continues after being mounted. And I probably had the kernel working double-time as it was trying to work on that as I was loading up the drive with data… and obviously still had more to do after the data copying was complete.

Also interesting to learn there’s an extra parameter to include to not use “lazy initialization”… and according to man mkfs.ext4, there are actually two lazy features…

lazy_itable_init[= <0 to disable, 1 to enable>]
If enabled and the uninit_bg feature is enabled, the inode table will not be fully initialized by mke2fs. This speeds up filesystem initialization noticeably, but it requires the kernel to finish initializing the filesystem in the background when the filesystem is first mounted. If the option value is omitted, it defaults to 1 to enable lazy inode table zeroing.

lazy_journal_init[= <0 to disable, 1 to enable>]
If enabled, the journal inode will not be fully zeroed out by mke2fs. This speeds up filesystem initialization noticeably, but carries some small risk if the system crashes before the journal has been overwritten entirely one time. If the option value is omitted, it defaults to 1 to enable lazy journal inode zeroing.

I think I’ll be adding these two extra parameters to my mechanical drive formats.

Even that is new to me (lazy_itable_init, lazy_journal_init), but like I said, I’ve moved exclusively to XFS (local) and ZFS (NAS). Per your discovery, seems that it will only have an affect some time after formatting the file-system, and you shouldn’t have too many issues using it normally.

While you’re at it, you should check if your 8TB drives support TLER/ERC, and if so, if the firmware is set to 7 seconds by default. (It’s very likely, since 8TB and larger are usually white-label enterprise or NAS drives.)

sudo smartctl -l scterc /dev/sdx

It prints the timeout for TLER in “deciseconds”, so a value of 70 = 7.0 seconds.

Linux/mdadm waits 30 seconds before it considers a SATA/SCSI drive “unresponsive” and tries to bring it back up or simply offline it (in which case your RAID array will drop to a “degraded” state.)

If you drives do not support ERC/TLER (or they support it, but are not configured to use it), they will try for an indefinite period of time (internally) to correct their own errors / relocate bad sectors. Problem is, if this time exceeds 30 seconds, even a healthy drive can be kicked out of the array.

Setting TLER to 7.0 seconds (“70 deciseconds”) is recommended. (Don’t try to set it to anything shorter than 7 seconds, as I’ve read that the drive’s firmware might simply ignore it without notifying you that the number is invalid.)

If your drive supports TLER, but it’s not enabled, you can manually enable it (yet this will not persist through system reboots.)

sudo smartctl -l scterc,70,70 /dev/sdx

In order for it to apply after every reboot, you need to make a script or cron job that will do it upon booting up your computer.

1 Like

Looks like they are already at 7sec?

$ sudo smartctl -l scterc /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.4-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke,

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

$ sudo smartctl -l scterc /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.4-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke,

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

And this brings up 2 more other HD thoughts:

  1. how would I check for and change write caching on the drives… I’m assuming having write caching disabled would be a better option in case of crash (preserving data integrity at the cost of speed)… or maybe this is an old idea that might not be as relevant in GNU/Linux as it was in Windows?
    EDIT: Well look at that… it’s another hdparm command/parameter…
$ sudo hdparm -W /dev/sdb

 write-caching =  1 (on)

$ sudo hdparm -W /dev/sdc

 write-caching =  1 (on)
  1. In one of my other posts the discussion evolved to adjusting APM settings on the mechanical drives to 254 (or 255 if they support it) but I thought I would reserve looking into that for this discussion (deciding how the drives would be attached first; SATA versus HBA)… would you have a recommendation for APM settings?
$ sudo hdparm -B /dev/sdb

 APM_level      = 164

$ sudo hdparm -B /dev/sdc

 APM_level      = 164

Nothing needs to be done! They support it and are already set by the factory at 7 seconds. :+1: This is one of the selling points of “NAS-ready” drives, among other features (such as longer MTBF and constant operation in a vibration-heavy chassis or server rack.

The drop in write performance from disabling it might not be worth it, considering ext4 uses a journal (thus buffers against a dirty state, and will re-check itself if it was previously not unmounted cleanly), and for future projects ZFS (and Btrfs, such as used in your Synology NAS) are copy-on-write file-systems, which means it’s nearly impossible to have corruption due to a crash or powerloss. (That’s not to say you should neglect a UPS battery backup in case of sudden powerloss.)

I always have mine disabled. It’s healthier for the drive. Acoustic (-M) and APM (-B) and auto-suspend (-S) should always be disabled, especially if used in a RAID array or ZFS pool. Drives barely use any power on idle (around 3 to 5 watts, spinning). Depending where you live, that’s about 30¢ to 50¢ per month on your electric bill if you leave them running 24/7.

1 Like

Many thanks once again for your great advice winnie!

I’m glad to hear I can leave the write caching on for performance with EXT4! I do have a UPS connected and think I have things setup to power down @ 25% battery… although I haven’t tested the settings yet by pulling power :wink:

I’ll work on disabling Acoustic (-M), APM (-B) and auto-suspend (-S) on both drives!

I had to do a double-take on this. I just realized, and correct me if I’m wrong: THIS is your first time jumping into Linux as a legit alternative to Windows?

Well gosh darn it! You’re crazy! A deep dive right into software RAID and rsync’ing from a NAS server and esoteric file-system options!


The first time when I ditched Windows for Linux, I took baby steps:

“Okay… so… the terminal, um, that’s like the CMD.exe thingy in Windows, right? Okay… I can… do stuff in the terminal… okay… so wait… package manager? Like for zip files? Oh, package manager is like for installing software? Whatever. How do I run my .exe files? Is Firefox like IE but with an orange icon?”

Hehe… I though asking for help in advance was a baby step?! :crazy_face:

I’m an older “Computer Engineering Technologist” who graduated back in the day when 386’s were king. I’ve dabbled with a few “Linux Live” CD/DVD/USB distros on and off over the past [cough] decades, but never put significant effort in to actually try replace windows with it after learning early on that I’d have to give up some of my favorite PC activities like winding down in a good RPG or MMO… GNU/Linux just wasn’t going to let me keep playing my favorite titles (until more recently).

But as luck would have it, much of what I play is on Steam, and earlier this year I caught a video (might have been this one or one like it) from Anthony at LTT where he was talking about “Gaming on Linux” (POP_OS and Manjaro) and that planted the seed for me to embrace all the good things Steam has been doing in this area over the past few years… with the help of other technologies like wine; and all the great upstream and downstream support found in the various GNU/Linux distributions of today.

Needless to say I’m glad to be rid of all the MS data-mining/telemetry, and happy to learn more about Manjaro and GNU/Linux as it is supporting my geekiness and Steam game play beautifully. And I’m digging in deep enough to try support the few people (like my parents) that will likely follow me to Linux; and likely future n00bs like me in this forum and other aspects of the GNU/Linux community that present themselves along the way.

I still have lots to learn, and I am prioritizing my posts and learning based on…

  1. where I am in my migration
  2. what apps/functionality I want/need next
  3. what hardware I want/need to get working next
  4. what presents itself as a learning opportunity along the way
1 Like

Ok, I ran through disabling Acoustic (-M), APM (-B) and auto-suspend (-S) on both drives… but based on my steps, it looks like Acoustic (-M) is “not supported” for my WD Red’s so I left it as is…

Acoustic (-M) … didn’t try -M0 after seeing it’s “not supported”

$ sudo hdparm -M /dev/sdb
 acoustic      = not supported
$ sudo hdparm -M /dev/sdc
 acoustic      = not supported

APM (-B)

$ sudo hdparm -B /dev/sdb
 APM_level      = 164
$ sudo hdparm -B255 /dev/sdb
 setting Advanced Power Management level to disabled
 APM_level      = off
$ sudo hdparm -B255 /dev/sdc
 setting Advanced Power Management level to disabled
 APM_level      = off

Auto-suspend (-S)

$ sudo hdparm -S /dev/sdc
  -S: bad/missing standby-interval value (0..255)
$ sudo hdparm -S0 /dev/sdc
 setting standby to 0 (off)
$ sudo hdparm -S0 /dev/sdb
 setting standby to 0 (off)

EDIT: Wow… it’s been 7 hours (according to the forum) since I last ran iotop… and it’s still listing ext4lazyinit as working… hopefully it’s settled come the morning.

Don’t forget to create a custom udev rule so that those values are re-applied each reboot. (I name my custom files wth “99-” so I can keep track of them if I need to review/edit.)

For example, to apply it to all “spinning” drives in the system:

sudo nano /etc/udev/rules.d/99-hdparm.rules

With the entry:

ACTION==“add|change”, KERNEL==“sd[a-z]”, ATTRS{queue/rotational}==“1”, RUN+="/usr/bin/hdparm -B255 -S0 -M0 /dev/%k"

It doesn’t hurt to specify -M0 out of practice, as it gets ignored if it’s not supported by the drive anyways. Makes the entry good for future use if such a drive supports -M.

If you prefer to use a custom script (that runs with elevated privileges), have it do something like:

hdparm -B255 -S0 -M0 /dev/disk/by-id/{serialnumberdisk1,serialnumberdisk2,etc,etc,etc}

Make sure not to use the ID of the partition, but rather the disk itself.

1 Like

Many thanks for the keeping the learning curve moving forward winnie! I really appreciate your thoroughness.

Okay, I’m going to try to implement a udev custom rule because I think it fits nicely in with the other customization files I’ve been playing with so far… like /etc/fstab, /etc/mdadm.conf, and /etc/sysctl.d/30-swap_usage.conf (custom file from Fabby to control swappiness and vfs_cache_pressure).

But that syntax is above my head (is that some form of regex or bash scripting?), so more learning to do… although it seems to target just the sdx devices, which would mean my sda Samsung EVO SSD (which likely doesn’t care about these settings) and my sdb & sdc WD Red’s.

Good to know!

$ sudo hdparm -M0 /dev/sdc

 setting acoustic management to 0
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 53 00 00 21 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 acoustic      = not supported

It’s based off the udev rules syntax.

Based on the rule entry, your SSD will be skipped because of the “spinning” attribute. (That is, "rotational")

ACTION==“add|change”, KERNEL==“sd[a-z]”, ATTRS{queue/rotational}==“1”, RUN+="/usr/bin/hdparm -B255 -S0 -M0 /dev/%k"

This is an old article, but it gives a good idea of the gist of writing a udev rule (which is rare for an end-user to do anyways. I doubt you’ll need more than this very one.)

1 Like

Awesome… copy/paste works for me in this scenario winnie! I’ll make sure to check it works after my next reboot… but considering the time, I should make like a pumpkin and grab some shut-eye :slight_smile:

I did want to ask one question about an earlier teaching though…

I’m thinking that in order to use this command that I would need a hard-copy of the UUID created during the mdadm --create part of the process (assuming I lost access to the original system for some reason)… because in another system I would not have access to my /etc/mdadm.conf, nor would I be able to run mdadm --detail --scan or sudo dumpe2fs /dev/md/Raid1Array | grep UUID on a system that hasn’t seen the RAID array before. Are either of these incorrect assumptions, or is there another command that could retrieve the UUID?

I guess I am also assuming the UUID I want is that of the array, and not something I’d find in $ sudo blkid… hmmm, that 1st UUID is the same on both drives! hmmm… :thinking:

$ sudo blkid
/dev/sdb1: UUID="54abcbfa-cc3f-becd-e4aa-d7e57961912d" UUID_SUB="ff7eafae-8394-68bc-ea44-fdd3de85dc40" LABEL="AM4-x5600-Linux:RAID1Array" TYPE="linux_raid_member" PARTUUID="9dc5e95d-dda4-9945-8d8b-911d5977fbb6"
/dev/sdc1: UUID="54abcbfa-cc3f-becd-e4aa-d7e57961912d" UUID_SUB="2bdb44ab-e43d-4e8a-0441-27838fd9b45f" LABEL="AM4-x5600-Linux:RAID1Array" TYPE="linux_raid_member" PARTUUID="e9b3a16d-df91-cb47-8138-5d97a1466d39"

Oh wait… I think I’m gonna go on a limb and pretend I’m cluing into something you said earlier… although I’m still not sure I understand the differences between the first and third points (device versus “name”)…

I’m thinking the core clue for me is in the initial create… mdadm --create --verbose --level=1 --metadata=1.2 --raid-devices=2 /dev/md/RAID1Array /dev/sdb1 /dev/sdc1… specifically /dev/sdb1 and /dev/sdc1… and so if this line of thought is correct, I could…

  1. lsblk to identify the drives/partitions names on the new system…
  2. Then assemble with… sudo mdadm --assemble /dev/md/RAID1Array /dev/sdb1 /dev/sdc1?

EDIT: Hurray! 10 hours after the first $ sudo iotopext4lazyinit is off the list; finally completed!

That was fun! Let’s wipe all your data and do it all over again! :partying_face:

You can use whichever way you want to identify your block devices.

  • Kernel-assigned
    • Located directly under /dev/
    • Familiar naming-scheme
    • Devices: sda, sdb, sdc, nvme0n1, nvme1n1
    • Partitions: sda1, sda2, sda3, sdb1, sdc1, nvme0n1p1, nvme0n1p2, nvme1n1p1
    • While not often, these can change, based on the internal ports the devices are connected to, external ports, what others devices are plugged in, etc
  • Unique ID, based on model and/or serial number
    • Located under /dev/disk/by-id/
    • Symlinks to the actual kernel-assigned names
    • More meaningful, since they usually follow a pattern of recognizable brands and serial numbers
    • Using the device name (without -partX) is the equivalent to using /dev/sda, /dev/sdb, etc…
    • Partitions are appended to the device name with -part1, -part2, -part3, etc.
    • Usually works across different computers
  • Unique ID based on UUID
    • Symlinks to the actual kernel-assigned names
    • Each device, partition, LUKS container, logical volume, assembled array, etc, has its own unique UUID
    • Can be specified with UUID=, or --uuid=, or /dev/disk/by-uuid/, etc, depending on the config or application
    • Works across different computers
    • Individual partition UUIDs found under /dev/disk/by-partuuid/ (not unique globally, only locally on current system)

So any method above works, for example, to assemble an array, specify an fstab entry, check a file-system, unlock and map an encrypted LUKS container, start a volume group (LVM), etc.

The point being, don’t rely on sda, sda1, sdb, sdb1, sdc, sdc1, etc, for longterm use. You can easily use lsblk and /proc/partition to try to identify your block devices, but it doesn’t hurt to look under /dev/disk/by-id/ (or use the UUIDs).

Most tools are pretty “smart” though. I believe mdadm (and LVM) can simply be told to “scan all block devices, find all md (or PV) devices, and assemble from there” without ever having to specify the exact devices needed, as long as you provide identifiable information (“name” or “uuid” of the array or logical volume group).

Remember, the UUID for the block devices are a way to specify which devices are needed to build the array, while the UUID for the array itself is akin to its “name”. Once everything is assembled, there’s a new UUID that only exists when the array is assembled, and it is this device where your ext4 file-system lives.

So yeah, you’ve got three sets of UUIDs going on: (1) the UUIDs of partitions that make up your array, (2) the UUID found in the superblock metadata of the array, (3) the UUID of the assembled array that a file-system is formatted on. All three are different and have nothing to do with each other.

EDIT: I highly recommend people get familiar with /dev/disk/by-id/

You’ll notice it “makes more sense” for your physical block devices, as it has the closest hands on naming scheme. (Some devices will be represented two or three different ways.)

Here’s a listing of mine, for example. (I’m leaving out the redundant entries.)

ls -l /dev/disk/by-id/

ata-hp_HLDS_DVDRW_GUD1N_873D2038077 -> ../../sr0

ata-Samsung_SSD_860_EVO_500GB_S8678NE1M67503F -> ../../sda
ata-Samsung_SSD_860_EVO_500GB_S8678NE1M67503F-part1 -> ../../sda1

nvme-SK_hynix_BC501_TGF342GDJGGH-8324A_NZ87645114133054F2 -> ../../nvme0n1
nvme-SK_hynix_BC501_TGF342GDJGGH-8324A_NZ87645114133054F2-part1 -> ../../nvme0n1p1
nvme-SK_hynix_BC501_TGF342GDJGGH-8324A_NZ87645114133054F2-part2 -> ../../nvme0n1p2
nvme-SK_hynix_BC501_TGF342GDJGGH-8324A_NZ87645114133054F2-part3 -> ../../nvme0n1p3
nvme-SK_hynix_BC501_TGF342GDJGGH-8324A_NZ87645114133054F2-part4 -> ../../nvme0n1p4
nvme-SK_hynix_BC501_TGF342GDJGGH-8324A_NZ87645114133054F2-part5 -> ../../nvme0n1p5

Just by looking at the above output, you can get an idea of what type of devices they are (DVD burner, SATA SSD, NVMe m.2) , what brands they are (HP, Samsung, Hynix), and what models they are. (I changed the serial number strings for privacy and warranty-related reasons.)

I can use the above strings instead of /dev/sda, /dev/sda1, /dev/nvme0n1p4, etc, since the symlinks point to the proper kernel-assigned devices, no matter how many times I reboot or change around the order of cables and ports.

However, the UUID is more permanent and is the preferred method, since it’s pretty much a 100% guarantee of never changing, no matter the reboots, no matter relocating to the new computer.

1 Like

:crazy_face: :rofl:
Let’s not and say we did :wink:

Many thanks once again for the in-depth responses winnie!

I’m definitely going to have to spend some time getting acquainted with the core GNU/Linux terminologies so I understand how the OS (and it’s various layers) sees/treats all components/devices and put everything into the right context for myself… I have no doubt that’ll come in time!

/dev/disk/by-id will be definitely be on the top of the list!

By the way, there was quite a few Manjaro updates this morning that required a reboot, which made it the perfect time to test the new udev rule. Things went as expected at first…

$ sudo hdparm -B /dev/sdb

 APM_level      = off

But then I scratched my head…

$ sudo hdparm -S /dev/sdb
  -S: bad/missing standby-interval value (0..255)

… until I realized (after some sleep and additional DDG internet searches) that hdparm was telling me that I had not provided a value for the “set” operation. I had been assuming -S (with no value) was a “get”… and that was incorrect.

According to DDG, the typical recommendation is to use $ sudo -I /dev/sdx (captital “i”) to view the drive settings… although I could not find a “Standby” entry that indicated it was “disabled” (but did for APM)…

        Standby timer values: spec'd by Standard, no device specific minimum
        Advanced power management level: disabled

… so I’m just going to put some faith in witnessing that the $ sudo hdparm -S0 /dev/sdx “set” command provided proof (that it was setting the value anyway, perhaps not that it had succeeded)…

$ sudo hdparm -S0 /dev/sdc
 setting standby to 0 (off)
$ sudo hdparm -S0 /dev/sdb
 setting standby to 0 (off)

… and trust that the udev rule got the same results (as it did for -B255).

Okay, I started putting together my list of commands for “Raid Scrubbing” and put the following list together for myself based on the Arch RAID Wiki

  • manual scrub start … # echo check > /sys/block/md127/md/sync_action
  • check raid activity for scrub status … $ cat /proc/mdstat
  • stop a running scrub … # echo idle > /sys/block/md127/md/sync_action
  • check if any blocks were flagged bad during scrub … # cat /sys/block/md127/md/mismatch_cnt

This seems like a good start, but I have 2 questions…

  1. What would one do next if bad blocks were found?
  2. Can I automate the scrub start by following the example I found created by TimeShift @ /etc/cron.d/timeshift-hourly by doing the following?
  • $ sudo nano /etc/cron.d/md127-check-monthly
  • paste in the following…

30 21 1-7 * 6 root echo check > /sys/block/md127/md/sync_action
  • save the file

which if I followed the “Crontab format” correctly should execute on the 1st Saturday of each month at 21:30.

If my automation idea won’t work, I’d love to hear about alternatives… including those with a GUI :wink:

EDIT: And I’d also be interesting in learning about what the last timer is in this list… mdadm-last-resort@md127.timer

$ sudo systemctl list-timers --all
NEXT                        LEFT               LAST                        PASSED            UNIT                          ACTIVATES                      
Thu 2021-07-29 00:00:00 CDT 3h 45min left      Wed 2021-07-28 00:00:18 CDT 20h ago           logrotate.timer               logrotate.service
Thu 2021-07-29 00:00:00 CDT 3h 45min left      Wed 2021-07-28 00:00:18 CDT 20h ago           man-db.timer                  man-db.service
Thu 2021-07-29 00:00:00 CDT 3h 45min left      Wed 2021-07-28 00:00:18 CDT 20h ago           pkgfile-update.timer          pkgfile-update.service
Thu 2021-07-29 00:00:00 CDT 3h 45min left      Wed 2021-07-28 00:00:18 CDT 20h ago           shadow.timer                  shadow.service
Thu 2021-07-29 09:56:35 CDT 13h left           Wed 2021-07-28 09:56:35 CDT 10h ago           systemd-tmpfiles-clean.timer  systemd-tmpfiles-clean.service
Thu 2021-07-29 10:49:53 CDT 14h left           Wed 2021-07-28 07:59:24 CDT 12h ago           updatedb.timer                updatedb.service
Thu 2021-07-29 20:48:44 CDT 24h left           Thu 2021-07-22 22:37:40 CDT 5 days ago        pamac-mirrorlist.timer        pamac-mirrorlist.service
Mon 2021-08-02 00:02:30 CDT 4 days left        Mon 2021-07-26 00:31:02 CDT 2 days ago        fstrim.timer                  fstrim.service
Sat 2021-08-07 15:00:00 CDT 1 week 2 days left Tue 2021-07-13 15:47:16 CDT 2 weeks 1 day ago pamac-cleancache.timer        pamac-cleancache.service
n/a                         n/a                n/a                         n/a               mdadm-last-resort@md127.timer mdadm-last-resort@md127.service

It it were ZFS, repairs happen automatically if using any level of redundancy, due to the all encompassing nature of ZFS itself, checksums on every data record (not just metadata), multiple copies of metadata, and multiple copies of checksums for every record of data dispersed at different areas of the devices. (This is triggered either by a routine scrub or when corruption is detected upon reading data that does not match the checksums.)

From there, you could view the zpool status, see how many checksum errors there were, how many were fixed, and on which physical devices they occurred. It would be up to your discretion on how to proceed:

  • Ignore it and hope it wasn’t too serious?
  • Power down everything then run badblocks and/or SMART tests?
  • To heck with it and just order a replacement drive?
  • Check the status of your backups to decide how to proceed?

mdadm provides redundancy and can tell you if something went wrong, and can even detect corruption. It cannot repair automatically, let alone confidentially determine which device is errant. You can infer which device to offline and replace based on other tests (i.e, badblocks, SMART, etc), and you’d usually be correct.

(Now you see why I’m such a ZFS fanboy? Unfortunately, it’s not “mainstream” enough for me to comfortably use on a desktop distro, and thus I use it exclusively on my NAS server and backups.)

Using systemd over cron seems to be the standard method going forwards. There exists in the AUR a package called raid-check-systemd. Looking through the PKGBUILD and source files, it seems to extract some files from the CentOS mdadm RPM package, and includes a modified systemd timer and service adopted from the previous cron method (from the much older raid-check package.)

After building and installing it, you would edit /etc/conf.d/raid-check to your desired preferences (they explain it within this conf file itself what to change or add.) To modify the schedule, you would edit /usr/lib/systemd/system/raid-check.timer via the command:

sudo systemctl --system edit raid-check.timer
sudo systemctl --system daemon-reload
sudo systemctl --system reenable raid-check.timer

Looks like upon further updates to this package, your changes in /etc/conf.d/raid-check will be backed up, but it doesn’t appear to be the case with raid-check.timer, but I’m not sure about this. Unless your modifications remain preserved in your edit (which I’m assuming you’re only overriding the [Timer] option, such as:

OnCalendar=Sat *-*-1..7 21:30:00

But if your cron method works for you and you’re happy, hey it works. :wink: Up to you!

If it was my system I’d think about disabling it. By all accounts, it reads like some sort of countdown timer to force an array to assemble in a degraded state if a certain number of devices are available. I don’t understand it’s purpose or rationale.

UPDATE: I found another explanation of it, and once again my question is… "But, why?"

No, really… why? I don’t know about you, but I would rather have the array never start in a degraded state on its own, as I would prefer to deal with it myself and figure out why it cannot start in a healthy state. (“Did I forget to plug in a drive? Faulty cable? Wrong names? Defective drive?”)

Looks like a couple of GUI options already exist, might want to check them out:

(Please don’t hurt me.)

1 Like

Thanks again for the reply winnie!

Regarding “automation”… I’ve aborted my cron “hack” for now. I’d like to learn more about using cron/cronie properly… but today isn’t the day :wink: … perhaps when I’m ready I’ll start by installing zeit-git (a cron GUI from AUR) and watch what it does. When I tried following one of the wiki’s yesterday and ran $ sudo crontab -e… I was immediately confused by why nano would have launched with /tmp/crontab.JsvLFQ open… an oddly named file in a temp folder?

And I had found raid-check and raid-check-systemd… but there were some comments I read somewhere that enabling this caused an unusable (or did they say unstable?) system… perhaps it was an old/fixed bug… I didn’t feel like rolling the dice and bypassed them.

I tried to $ sudo systemctl disable mdadm-last-resort@md127.timer… which executed silently, but the timer was still in the list afterwards.

I’m not sure if I’m as worried about this timer with my only array being a mirror… but that could just be a side effect of not really being in my element right now… certain things are still going over my head atm :wink:

Within my old HBA’s monitoring tool I would occasionally see references in the log about a “fixed sector”… but now that I think about it, that likely had nothing to do with scrubbing and more to do with an fsck… oh, and now I am reminded about SMART!

Well I’m glad SMART entered my mind because the Arch SMART wiki let me know KDE has a tool called DisKMonitor that will let me …

  1. monitor/run SMART checks on my disks (minus the nvme that don’t support it)
  2. monitor the health/status of my RAID Array and kick off scrubbing (manually)
  3. receive integrated system notifications (I suspect just of the jobs/checks I run)
  4. and I opted to enable the systray option for it in the “System Tray Settings”

So really the only things it’s missing are fsck and automation… but I like this option!

I think for the time being, I’m going to setup a calendar reminder to…

  1. fsck -r /dev/md/RAID1Array (after umount /data/Raid1)
  2. fire up DisKMonitor to keep an eye on SMART and start/monitor RAID scrubbing

And now that I have the beginnings of some context/understanding about fsck, I went back and edited my /etc/fstab entries so that my drives beyond boot/root have the 6th fsck column set to 2 (instead of 0).

I think I’m in a good place… thanks again for all your patience and advice winnie!