First of all my issue is resolved because I ordered a replacement hard disc. And I copied my data to network attached storage.
ioping
With regard to my request for troubleshooting advice, the best information I got was from ioping
:
sudo ioping -RL /dev/sda
--- /dev/sda (block device 465.8 GiB) ioping statistics ---
1.27 k requests completed in 2.92 s, 316.8 MiB read, 433 iops, 108.3 MiB/s
generated 1.27 k requests in 3.00 s, 317 MiB, 422 iops, 105.6 MiB/s
min/avg/max/mdev = 1.43 ms / 2.31 ms / 35.3 ms / 1.12 ms
sudo ioping -RL /dev/sdb
--- /dev/sdb (block device 465.8 GiB) ioping statistics ---
68 requests completed in 3.42 s, 17 MiB read, 19 iops, 4.97 MiB/s
generated 69 requests in 3.45 s, 17.2 MiB, 20 iops, 5.00 MiB/s
min/avg/max/mdev = 2.06 ms / 50.3 ms / 3.19 s / 383.9 ms
where sda
and sdb
are my SATA drives.
I’m fortunate because I have two identical discs I can compare.
But that is not necessary. Look at the result for /dev/sda
. Its average speed was 105.6 MiB/s. The specification for the drive (ST3500320AS) lists “105 Mbytes/sec max”. So the performance of sda
looks pretty good.
Now look at the result for /dev/sdb
which is the disc having trouble. Its average speed was 5.00 MiB/s which is pretty bad. Remember these two discs are the same model. So sdb
should also be showing 105 MiB/s.
Using these options with ioping
gives the sustained transfer rate. The problem I noticed was huge latency while saving documents. So, I’d like to try to reproduce that workload with ioping
.
ioping
has many options. But the basic kinds are:
- read versus write
- sequential access versus random access
- block device versus file system
From among these options, the ones that most closely match my issue are random writes to the file system.
Note the -W
option will destroy data if used with a file or device target. In the examples below, I only use -W
with directory targets.
By default, ioping
will perform random access. So, I don’t need any option to specify random access. To specify a write test, we use the -W
option. To use the file system, we must provide the path to a directory on sdb
. For example, the sdb
file system is attached at /run/media/michael/Data/
on my computer. And as a precaution I created a directory for testing called ioping: /run/media/michael/Data/ioping/
By default, ioping
will mimic ping
and perform one operation per second until stopped. I’m looking for more of a performance test. And the -R
option encapsulates several other options to create a three second test which gives the behavior I want.
From among these options, it looks like ioping -RW .
is what I want. I just have to make sure to change directory to the ioping directory before I run my tests.
# /dev/sda
cd /home/michael/ioping/
ioping -RW .
--- . (ext4 /dev/sda1 457.4 GiB) ioping statistics ---
416 requests completed in 2.99 s, 1.62 MiB written, 139 iops, 557.3 KiB/s
generated 417 requests in 3.00 s, 1.63 MiB, 138 iops, 555.9 KiB/s
min/avg/max/mdev = 2.76 ms / 7.18 ms / 14.0 ms / 2.38 ms
# /dev/sdb
cd /run/media/michael/Data/ioping/
ioping -RW .
--- . (fuseblk /dev/sdb1 465.8 GiB) ioping statistics ---
211 requests completed in 3.00 s, 844 KiB written, 70 iops, 281.6 KiB/s
generated 212 requests in 3.01 s, 848 KiB, 70 iops, 281.4 KiB/s
min/avg/max/mdev = 7.59 ms / 14.2 ms / 21.9 ms / 3.34 ms
The result here is less impressive than the result from the sustained transfer rate test. But you can see that the second disc is about half as fast as the healthy disc.
Since my hard drive was having performance problems and making bad sounds, I already had enough information to replace it. But these ioping
results help to support that conclusion.
In the future, it would be a good habit to benchmark each disc when it is new so that I can make comparisons later without relying on having an identical disc. I’d want to try various combinations of read/write, random/sequential, and block device/file system tests while avoiding destructive writes to my new file system. The option to switch to sequential access is -L
For example, I might run these tests:
# random file system read
ioping -R /run/media/michael/Data/ioping/
# random block device read
sudo ioping -R /dev/sdb
# sequential file system read
ioping -RL /run/media/michael/Data/ioping/
# sequential block device read
sudo ioping -RL /dev/sdb
# random file system write
ioping -RW /run/media/michael/Data/ioping/
# sequential file system write
ioping -RWL /run/media/michael/Data/ioping/
S.M.A.R.T
I also investigated S.M.A.R.T.
At the start, I reasonably assumed that SMART would be reliable and that the results would correlate with disc failure.
The problem is both of those assumptions are wrong. The more I read the more it became obvious there is no standard.
The most visible aspect of SMART are the attributes. These are the detailed statistics output by sudo smartctl -a /dev/sdb
under the “SMART Attributes …” heading.
But the attributes are the least standardized part of SMART. For example, many of the raw values are proprietary. And, in those cases, all smartctl
does is print a meaningless number. Further, the normalized values should range from 1-100 (where higher is always better) and thus give end users a way of interpreting the attributes. But even this simple scheme is often ignored. For example, my Seagate drive prints raw values in the normalized columns for temperature. Also, I have an ID # 199 UDMA CRC Error Count value which is better now than it has been in the past (below: VALUE vs WORST columns). That makes no sense. How can an error count improve over time?
A further issue is that SMART doesn’t predict failure 1/3 of the time. A 2007 study from Google observed:
… even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables.
This means 36% of all drives failed without reporting any unusual SMART statistics. The study included more than one hundred thousand discs.
So even if SMART was standardized, many discs would fail with no warning.
Finally, the SMART results I have can be interpreted to mean my drive is healthy or that it is in poor condition depending on what you want the result to be.
Officially, Seagate only endorses the output of its hard drive utility SeaTools. And SeaTools only outputs pass or fail. I ran a SMART Check and a Short Drive Self Test in SeaTools. And both tests passed.
Further, when I look in smartctl
or GNOME Disks, I see that none of my attributes have gone below their threshold values. This is found in smartctl
‘s “WHEN_FAILED” or Disks’ “Assessment” column. (To be fair, the result of SeaTools’ SMART Check is probably based on these Assessment values. So, they should only be counted in my favor once.)
Even if we look at some of the normalized values, I have 90+ in many of the important attributes meaning my hard drive is getting a score better than 90 out of 100.
All these results are reassuring.
My smartctl
output:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 094 081 006 Pre-fail Always - 154328850
3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 067 067 020 Old_age Always - 33843
5 Reallocated_Sector_Ct 0x0033 090 090 036 Pre-fail Always - 210
7 Seek_Error_Rate 0x000f 074 060 030 Pre-fail Always - 43260493715
9 Power_On_Hours 0x0032 060 060 000 Old_age Always - 35293
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 45
12 Power_Cycle_Count 0x0032 091 037 020 Old_age Always - 9476
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 2478
188 Command_Timeout 0x0032 100 001 000 Old_age Always - 47247590904
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 071 055 045 Old_age Always - 29 (Min/Max 22/29)
194 Temperature_Celsius 0x0022 029 045 000 Old_age Always - 29 (0 11 0 0 0)
195 Hardware_ECC_Recovered 0x001a 024 019 000 Old_age Always - 154328850
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 001 000 Old_age Always - 13688
But if I wanted to argue the disc is in bad shape then I have plenty to work with. My ID # 5 Reallocated Sector Count raw value is non-zero. The normalized value for ID # 187 Reported Uncorrectable Errors is 1 (VALUE column above). That’s the lowest possible value. And my worst value for ID # 188 Command Timeout is also 1.
These are all critical attributes. And these results are correlated with failure.
So, in the end, SMART can give you a warning. But its not reliable.
Other Suggestions
Process viewers like htop
did not really help me in this case. I had a pretty good feeling it was a disc issue. At this point I’ve received my replacement disc. And the long wait times while saving are gone.
Stress testing utilities like stress
, stress-ng
, and sysbench
did not seem appropriate because I did not see any options to target a particular disc. There was another benchmarking tool I found: fio
But I did not take a close look at it.
One thing I learned was about reading the log files with sudo dmesg | vim -
and sudo journalctl -b | vim -
But all I saw in those logs were just some pedestrian mount info entries.
I have a link below from OpsDash which lists some other real-time tools. Some of those might be helpful in other situations.
I’m not familiar with how file shares are mounted in Linux today. And one trick I found useful was to open a share in Thunar (Xfce file browser), right-click on some empty space in the folder, and open Terminal in that location from the context menu. Then I could use pwd
at the Terminal to see where that share was mounted. That’s different from the address bar in Thunar which just shows the ‘smb://…’ address.
Follow-up
I started having problems waking my computer from sleep.
The most obvious change was the new hard drive. So I unplugged that and was able to resume normally.
Looking through journalctl -b | grep -i "kernel: ata"
the most obvious issue was the kernel spam indicating the new hard drive was dropping to the lowest link rate:
Oct 04 05:12:45 Edward kernel: ata4: SATA max UDMA/133 abar m2048@0xf9efc000 port 0xf9efc280 irq 27
Oct 04 05:12:45 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:12:45 Edward kernel: ata4.00: ATA-11: WDC WDS500G2B0A-00SM50, 415020WD, max UDMA/133
Oct 04 05:12:45 Edward kernel: ata4.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Oct 04 05:12:45 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:05 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:13:05 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:06 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:13:06 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:07 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:13:07 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:07 Edward kernel: ata4: limiting SATA link speed to 1.5 Gbps
Oct 04 05:13:08 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:08 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:09 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:09 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:10 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:10 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:11 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:11 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:11 Edward kernel: ata4.00: limiting speed to UDMA/100:PIO4
Oct 04 05:13:12 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:12 Edward kernel: ata4.00: configured for UDMA/100
Oct 04 05:13:13 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:13 Edward kernel: ata4.00: configured for UDMA/100
Oct 04 05:13:13 Edward kernel: ata4.00: limiting speed to UDMA/33:PIO4
Oct 04 05:13:14 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:14 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:15 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:15 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:16 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:16 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:17 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:17 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:18 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:18 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:19 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:19 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:20 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
I thought a reasonable first step was to change the SATA cable and port.
Once I did that, my log returned to normal:
Oct 04 07:45:17 Edward kernel: ata6: SATA max UDMA/133 abar m2048@0xf9efc000 port 0xf9efc380 irq 27
Oct 04 07:45:17 Edward kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 07:45:17 Edward kernel: ata6.00: ATA-11: WDC WDS500G2B0A-00SM50, 415020WD, max UDMA/133
Oct 04 07:45:17 Edward kernel: ata6.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Oct 04 07:45:17 Edward kernel: ata6.00: configured for UDMA/133
And I was able to resume from sleep.
This must have been the underlying issue with my old hard drive. I’m not sorry I replaced my drive. But you might check for these log entries before making a decision.
Regards,
References
Sorry, I don’t have permission to post links.
5 Tools for Monitoring Disk Activity in Linux | OpsDash
ioping | Arch manual pages
SMART | Wikipedia
Disks & Storage | GNOME Help
smartctl | Arch manual pages
FAQ | smartmontools
The 5 SMART stats that actually predict hard drive failure | Computerworld
How do you interpret Seagate’s SMART data? Signs of failure? | Silent PC Review
Failure Trends in a Large Disk Drive Population | Google
Linux Logging Basics | Loggly
Using journalctl | Loggly
journalctl | Arch manual pages
findmnt | Arch manual pages