"Failed to identify" results in 10 secs time-out

Arikania · 3 January 2024 02:55

I have two hard disks, each having an 18 TB partition (GPT). One is formatted ExFAT, and shows no problems. The other is formatted Ext4, and that one spins up and down very quickly after use, and each time it’s aceessed, it gives me a 10 seconds time-out, producing the dmesg entries as shown here below.

System info:

Cinnamon version	: 6.0.2
Linux Kernel		: 6.6.5-1-rt16-MANJARO
Processor		    : AMD Ryzen 7 5700G with Radeon Graphics x 8
Memory			    : 15.4 GiB
Hard drives		    : 39137.1 GB
Graphics card		: NVIDIA Corperation TU106 [GeForce RTX 2070]
Display server		: X11

Fragment of dmesg:

[ 5834.277610] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[ 5834.277612] ata1.00: revalidation failed (errno=-5)
[ 5838.597295] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 5838.896890] ata1.00: configured for UDMA/133
[ 5838.897002] ata1.00: Entering active power mode
[ 5838.904803] sd 0:0:0:0: [sda] Starting disk
[ 5855.005221] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 5855.005931] sd 0:0:0:0: [sda] Stopping disk
[ 5855.689215] ata1.00: Entering standby power mode

megavolt · 3 January 2024 06:34

Might be a faulty cable. Check:

sudo smartctl -A /dev/sda

If UDMA_CRC_Error_Count is not zero, then the cable is/was 100% the problem.

Or maybe a bugged or outdated UEFI firmware (BIOS). Update it.

Nachlese · 3 January 2024 08:00

This is just anecdotal:
I have had (I think I still have it …) a laptop hdd which would spin down after a short time.
I remember using hdparm with the -S parameter (capital -S) to alter the time until spin down.
This can be made permanent. I think through the -K option - the capital -K, not the lower case -k ,
but the setting can also be applied each time the system is booted.

See man hdparm.

Arikania · 3 January 2024 13:35

Entry #199 shows that there are no CRC errors:

 1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       8679
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       58120
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       5287
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       875
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
 27 MAMR_Health_Monitor     0x0023   100   100   030    Pre-fail  Always       -       919371
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       839
193 Load_Cycle_Count        0x0032   095   095   000    Old_age   Always       -       58252
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       30 (Min/Max 20/55)
196 Reallocated_Event_Count 0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       17563651
222 Loaded_Hours            0x0032   092   092   000    Old_age   Always       -       3534
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       630
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       23134669401
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       13374160550

Nachlese · 3 January 2024 13:39

… there are disks like this - they will spin down after a very short amount of time
If your’s is one of those, hdparm (or perhaps sdparm) can change that.
You should try - IMO.

It’s also not very healthy for a spinning hdd if it has to spin up from zero every few minutes …

megavolt · 3 January 2024 13:56

Alright then. SMART values are in order. So try what @Nachlese mentioned…

sudo hdparm -S 255 /dev/sda

That should go into standby after about ~21min idle, so the maximum. 255 * 5sec.

Little rant: I had WD HDD which was bought with a Case and an USB connection. Even when connected by real SATA, it will spin down after 5min idle. That was a fixed value and even hdparm couldn’t change that. Only the Windows Software was able to do that.

That being said, the default value might be too short, like:

sudo hdparm -S 2 /dev/sda

which is 10sec. You need to expand it.

Nachlese · 3 January 2024 14:12

the -B option to hdparm might also be worth checking

again:
man hdparm

Read carefully! - the tool can be a dangerous one, depending on the options you give it.

I think my current drive is the one I used this particular setting on.
Current result:

sudo hdparm -B /dev/sda

/dev/sda:
 APM_level	= 254

(highest I/O performance)

From memory, I think it’s default value was in between 1 and 127 (permitting spin down).
… but that is just from memory - it has been too long since I fiddled with this

Arikania · 3 January 2024 18:41

I set the spin-down time to 1 min now, which makes the issue less tedious, but doesn’t solve it yet ofc. What are typical values used by other Manjaro users, I wonder?

hdparm -B /dev/sda gives:

/dev/sda:
 APM_level	= 128

In the man pages of hdparm I found the options –dco-freeze, –dco-identify and –dco-restore. I wonder if those could have something to do with the issue?

Nachlese · 3 January 2024 18:53

perhaps

APM_level	= 128

is not … enough - although it should prevent spin down

That is just what it currently is - you can set a different value.

There is also the -S parameter (253 or 254 or 255)

refer to:
man hdparm

I don’t think that -d or -c or -o … would do anything useful for you.

Why just 1 minute?
That’s way too low.
10+ minutes would be more appropriate.

Which would actually mean: no spindown at all, because ext4 regularly writes any changes in the meantime to disk … which will force a spin up.

--dco-restore
could work if your attempts to changing settings on the drive via -B or -S don’t have any effect (drive features are locked)
… that’s what the manual seems to say anyway

But it does not seem to be the case here.

copy/paste is quick and cheap - so here are the two relevant sections from

man hdparm

-B     Get/set Advanced Power Management feature, if the drive supports it. A low value means aggressive power management and a high value means better  per‐
              formance.   Possible  settings  range  from values 1 through 127 (which permit spin-down), and values 128 through 254 (which do not permit spin-down).
              The highest degree of power management is attained with a setting of 1, and the highest I/O performance with a setting of 254.  A value of  255  tells
              hdparm to disable Advanced Power Management altogether on the drive (not all drives support disabling it, but most do).

-S     Put the drive into idle (low-power) mode, and also set the standby (spindown) timeout for the drive.  This timeout value is used by the drive  to  de‐
              termine  how long to wait (with no disk activity) before turning off the spindle motor to save power.  Under such circumstances, the drive may take as
              long as 30 seconds to respond to a subsequent disk access, though most drives are much quicker.  The encoding of the timeout value is  somewhat  pecu‐
              liar.  A value of zero means "timeouts are disabled": the device will not automatically enter standby mode.  Values from 1 to 240 specify multiples of
              5 seconds, yielding timeouts from 5 seconds to 20 minutes.  Values from 241 to 251 specify from 1 to 11 units of 30 minutes, yielding timeouts from 30
              minutes  to  5.5 hours.  A value of 252 signifies a timeout of 21 minutes. A value of 253 sets a vendor-defined timeout period between 8 and 12 hours,
              and the value 254 is reserved.  255 is interpreted as 21 minutes plus 15 seconds.  Note that some older drives may have very different interpretations
              of these values.

Arikania · 3 January 2024 19:39

I set the spin-down to 10 mins now. Let’s see what it does…

The –dco- options that I mentioned are not composed of a separate -d, -c and -o option. As the man page states for –dco-freeze:

DCO stands for Device Configuration Overlay, a way for vendors to  selectively  disable
certain features of  a drive. The --dco-freeze option will freeze/lock the current drive configuration, thereby preventing software (or malware)  from  changing  any  DCO  settings until after the next power-on reset.

I wonder if the provider of my hard disk, i.e. Toshiba, optimized the device overlay for Windows.

I’m also curious if other Manjaro users would have the same issues as I when they set their spin-downs that low; I’m still baffled and worried about the “Identification failed, verification failed” thingie.

And, why would my Manjaro specifically be configured so, to have this issue. Why doesn’t everybody have it who uses very large Ext4 partitions.

Nachlese · 3 January 2024 19:45

In this case, it is not the OS (Manjaro).
It is how the device itself is configured (probably by the manufacturer - or by some previous owner).

… I added to the previous post, btw

Arikania · 3 January 2024 19:53

The drive itself came brand new from the store. According to my computer-guy (I myself am more of a software-savvy), it is an enterprise model, actually to be used in business settings.

Reading the addendum to your post, I think that I should set the -B option to 254, and not use the -S option then. Except for increasing the overall power usage of my computer, it should have no consequences, I think?

So I should do something like:

# hdparm -B254 -K /dev/sda

Nachlese · 3 January 2024 19:58

… that was suggested - and that is what I did
to ensure it would not spin down in a long time
There will always be write attempts coming from the file system itself in the mean time.
System logs are always generated and written to disk (vast generalization here - this, too, can be configured and prevented …)
… and (apparently) no one knows how great the difference is in power consumption between the different levels
I personally don’t care - I have a laptop and the disk runs constantly, to avoid having to wait for it to spin up every time …

I would omit (leave out) the -K option unless you are very sure.
the hdparm settings can also be applied via script (udev, I think) on every boot,
without that option.

Arikania · 3 January 2024 20:04

I added a bit to my previous post. I’ll try that then.

Power consumption would be a bit of an issue for me though, since I often have my computer running for days in a row. It’s never off, actually, and as I do a lot of gaming on it, as I do, it wouldn’t really help making it any cheaper…

Wish I knew how to diagnose the issue deeper.

Nachlese · 3 January 2024 20:11

Could be relatively easly checked:
hook up your PC through a power meter.
Have the disk spin down (hdparm …)
Then have it spin up …

I’ll guess that there is hardly a difference … except for the few seconds of spin up
especially compared to what a gaming PC with a decent graphics requirement will pull at any given time.

That would be my easy solution to enable you to actually know instead of just guessing

if it is that much - which I doubt …
what does 1 W per hour cost?

1 kWh costs ~ € 0,35 (this is Germany 2024 …)

That is 1 W for 1000 hours.
That is less than € 0,35 more per month.
It’s not even relevant in Germany
… with all the cheap green energy from wind and sun - who doesn’t send you a bill

Arikania · 3 January 2024 20:47

Here in the Netherlands it’s over 1 Euro

I can’t find on internet how to learn more about the actual issue. Even though I adjusted my drive’s parameters now:

# hdparm -B254 -S253 -K1 /dev/sda

I still have those time-outs, and this in my dmesg:

[319837.477620] ata1.00: Entering active power mode
[319847.973504] ata1.00: qc timeout after 10000 msecs (cmd 0x40)
[319847.973517] ata1.00: VERIFY failed (err_mask=0x4)
[319847.973522] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[319847.973524] ata1.00: revalidation failed (errno=-5)
[319851.877522] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[319852.170625] ata1.00: configured for UDMA/133
[319852.170745] ata1.00: Entering active power mode
[319852.176938] sd 0:0:0:0: [sda] Starting disk
[319870.875908] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[319870.876091] sd 0:0:0:0: [sda] Stopping disk
[319871.492566] ata1.00: Entering standby power mode

Nachlese · 3 January 2024 20:52

What? Really?
more than € 1 per kWh ?

That is the first time I heard of it.
I thought that Germany where pretty much the price champions.

I just changed my provider and they bill me ~ € 0,27 per kWh + a monthly fee of ~ € 12

anyway
… with regards to your actual problem:
I have no other ideas - perhaps it is faulty cables after all?

Arikania · 5 January 2024 10:31

Hmm, I found this site that seemed to have the solution for me, including a detailed explanation. My preliminary experiences imply that the issue is solved now.

I added this to my grub boot options:

libata.force=1.00:nodmalog

with “1.00” being the ATA port mentioned in the dmesg reports.

I’ll wait a day to see if the issue is really solved now. If so, I’ll mark this thread as such.

Wish me luck, guys!

EDIT: This solution didn’t work for me. Neither did restoring a back-up of that partition’s superblock. The dmesg reports and the time-outs persist

It’s also noteworthy that the disk keeps spinning down instantly after each communication, despite the hdparm command that I issued, and the fact that hdparm reports the settings as I set them.

Nachlese · 5 January 2024 15:59

It was suggested but I didn’t see that you tried to set the spin down time via the -S option (from 240 to 251 - or 255)

I don’t know from the logs if the drive is commanded by the OS to go into standby so quickly or whether this is just the logged behavior of the drive itself.
Check power saving settings?

megavolt · 5 January 2024 16:52

If connected via SATA cable, then try replace the cable or switch the cable for both drives. Or switch the ports. Would be worth a try…

In my view such errors can be result of a broken cable, loose contact or the harddrive is defect. Very rare, but it could also be a kernel bug.

However… a closer look at

For what ever reason, the driver cannot identify your HDD. Usually the kernel sends a request and in your case the HDD sends crap. I/O error, because it cannot read. It reads usually data such like device model, firmware version, supported features etc. If it cannot, it will assume generic stuff and do trail and error, what is not very precious.