Increased Disc Latency

sbxaru · 24 September 2021 21:18

My secondary hard disc has had increased write latency for the past few days. It is a magnetic SATA drive.

I don’t have a problem reading from the disc. I don’t have a problem writing to other discs. I don’t have a problem writing to this disc in Windows.

I can’t think of anything relevant which has changed in the last few days. I can’t explain why the problem only occurs in Manjaro.

I tried checking Linux logs, running fsck, running SeaTools, chkdsk, and accessing the disc from a bootable USB drive. I didn’t see any errors. And writing to the disc from the live system was also slow. That live USB image has the same software versions I had while the hard disc was working correctly. I only installed Manjaro in the last few weeks.

I have a feeling that some debug log would reveal the issue. But I don’t know what troubleshooting commands are relevant.

I’d like to see the error that is occuring when I write to the disc. Or I’d like to measure/graph the latency if that is all that can be done. And I’d like to fix the performance issue.

Thanks,

Xgamer · 25 September 2021 01:55

Gnome-disk-utliities should show status on disk

You can also run smartctrl from command line.

Check htop to see if memory leaks may be playing with read / write times.

alven · 25 September 2021 04:14

Fabby · 25 September 2021 14:56

You looked at everything except the most important one:

smartctl --all /dev/XdA

where X and Y denominate your specific disk.

Also what @alven said, please?

sbxaru · 29 September 2021 19:58

Hey I just want to check in. I’m not sure when the thread will automatically close. But I have some research I want to share when I’m ready.

Regards,

Fabby · 29 September 2021 21:05

3 months, but the longer you wait responding to our questions, the less we’ll remember what it was all about…

sbxaru · 1 October 2021 21:49

First of all my issue is resolved because I ordered a replacement hard disc. And I copied my data to network attached storage.

ioping

With regard to my request for troubleshooting advice, the best information I got was from ioping:

sudo ioping -RL /dev/sda

--- /dev/sda (block device 465.8 GiB) ioping statistics ---
1.27 k requests completed in 2.92 s, 316.8 MiB read, 433 iops, 108.3 MiB/s
generated 1.27 k requests in 3.00 s, 317 MiB, 422 iops, 105.6 MiB/s
min/avg/max/mdev = 1.43 ms / 2.31 ms / 35.3 ms / 1.12 ms


sudo ioping -RL /dev/sdb

--- /dev/sdb (block device 465.8 GiB) ioping statistics ---
68 requests completed in 3.42 s, 17 MiB read, 19 iops, 4.97 MiB/s
generated 69 requests in 3.45 s, 17.2 MiB, 20 iops, 5.00 MiB/s
min/avg/max/mdev = 2.06 ms / 50.3 ms / 3.19 s / 383.9 ms

where sda and sdb are my SATA drives.

I’m fortunate because I have two identical discs I can compare.

But that is not necessary. Look at the result for /dev/sda. Its average speed was 105.6 MiB/s. The specification for the drive (ST3500320AS) lists “105 Mbytes/sec max”. So the performance of sda looks pretty good.

Now look at the result for /dev/sdb which is the disc having trouble. Its average speed was 5.00 MiB/s which is pretty bad. Remember these two discs are the same model. So sdb should also be showing 105 MiB/s.

Using these options with ioping gives the sustained transfer rate. The problem I noticed was huge latency while saving documents. So, I’d like to try to reproduce that workload with ioping.

ioping has many options. But the basic kinds are:

read versus write
sequential access versus random access
block device versus file system

From among these options, the ones that most closely match my issue are random writes to the file system.

Note the -W option will destroy data if used with a file or device target. In the examples below, I only use -W with directory targets.

By default, ioping will perform random access. So, I don’t need any option to specify random access. To specify a write test, we use the -W option. To use the file system, we must provide the path to a directory on sdb. For example, the sdb file system is attached at /run/media/michael/Data/ on my computer. And as a precaution I created a directory for testing called ioping: /run/media/michael/Data/ioping/

By default, ioping will mimic ping and perform one operation per second until stopped. I’m looking for more of a performance test. And the -R option encapsulates several other options to create a three second test which gives the behavior I want.

From among these options, it looks like ioping -RW . is what I want. I just have to make sure to change directory to the ioping directory before I run my tests.

# /dev/sda
cd /home/michael/ioping/
ioping -RW . 

--- . (ext4 /dev/sda1 457.4 GiB) ioping statistics ---
416 requests completed in 2.99 s, 1.62 MiB written, 139 iops, 557.3 KiB/s
generated 417 requests in 3.00 s, 1.63 MiB, 138 iops, 555.9 KiB/s
min/avg/max/mdev = 2.76 ms / 7.18 ms / 14.0 ms / 2.38 ms


# /dev/sdb
cd /run/media/michael/Data/ioping/
ioping -RW . 

--- . (fuseblk /dev/sdb1 465.8 GiB) ioping statistics ---
211 requests completed in 3.00 s, 844 KiB written, 70 iops, 281.6 KiB/s
generated 212 requests in 3.01 s, 848 KiB, 70 iops, 281.4 KiB/s
min/avg/max/mdev = 7.59 ms / 14.2 ms / 21.9 ms / 3.34 ms

The result here is less impressive than the result from the sustained transfer rate test. But you can see that the second disc is about half as fast as the healthy disc.

Since my hard drive was having performance problems and making bad sounds, I already had enough information to replace it. But these ioping results help to support that conclusion.

In the future, it would be a good habit to benchmark each disc when it is new so that I can make comparisons later without relying on having an identical disc. I’d want to try various combinations of read/write, random/sequential, and block device/file system tests while avoiding destructive writes to my new file system. The option to switch to sequential access is -L For example, I might run these tests:

# random file system read
ioping -R /run/media/michael/Data/ioping/

# random block device read
sudo ioping -R /dev/sdb

# sequential file system read
ioping -RL /run/media/michael/Data/ioping/

# sequential block device read
sudo ioping -RL /dev/sdb

# random file system write
ioping -RW /run/media/michael/Data/ioping/

# sequential file system write
ioping -RWL /run/media/michael/Data/ioping/

S.M.A.R.T

I also investigated S.M.A.R.T.

At the start, I reasonably assumed that SMART would be reliable and that the results would correlate with disc failure.

The problem is both of those assumptions are wrong. The more I read the more it became obvious there is no standard.

The most visible aspect of SMART are the attributes. These are the detailed statistics output by sudo smartctl -a /dev/sdb under the “SMART Attributes …” heading.

But the attributes are the least standardized part of SMART. For example, many of the raw values are proprietary. And, in those cases, all smartctl does is print a meaningless number. Further, the normalized values should range from 1-100 (where higher is always better) and thus give end users a way of interpreting the attributes. But even this simple scheme is often ignored. For example, my Seagate drive prints raw values in the normalized columns for temperature. Also, I have an ID # 199 UDMA CRC Error Count value which is better now than it has been in the past (below: VALUE vs WORST columns). That makes no sense. How can an error count improve over time?

A further issue is that SMART doesn’t predict failure 1/3 of the time. A 2007 study from Google observed:

… even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables.

This means 36% of all drives failed without reporting any unusual SMART statistics. The study included more than one hundred thousand discs.

So even if SMART was standardized, many discs would fail with no warning.

Finally, the SMART results I have can be interpreted to mean my drive is healthy or that it is in poor condition depending on what you want the result to be.

Officially, Seagate only endorses the output of its hard drive utility SeaTools. And SeaTools only outputs pass or fail. I ran a SMART Check and a Short Drive Self Test in SeaTools. And both tests passed.

Further, when I look in smartctl or GNOME Disks, I see that none of my attributes have gone below their threshold values. This is found in smartctl’s “WHEN_FAILED” or Disks’ “Assessment” column. (To be fair, the result of SeaTools’ SMART Check is probably based on these Assessment values. So, they should only be counted in my favor once.)

Even if we look at some of the normalized values, I have 90+ in many of the important attributes meaning my hard drive is getting a score better than 90 out of 100.

All these results are reassuring.

My smartctl output:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   094   081   006    Pre-fail  Always       -       154328850
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   067   067   020    Old_age   Always       -       33843
  5 Reallocated_Sector_Ct   0x0033   090   090   036    Pre-fail  Always       -       210
  7 Seek_Error_Rate         0x000f   074   060   030    Pre-fail  Always       -       43260493715
  9 Power_On_Hours          0x0032   060   060   000    Old_age   Always       -       35293
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       45
 12 Power_Cycle_Count       0x0032   091   037   020    Old_age   Always       -       9476
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       2478
188 Command_Timeout         0x0032   100   001   000    Old_age   Always       -       47247590904
189 High_Fly_Writes         0x003a   097   097   000    Old_age   Always       -       3
190 Airflow_Temperature_Cel 0x0022   071   055   045    Old_age   Always       -       29 (Min/Max 22/29)
194 Temperature_Celsius     0x0022   029   045   000    Old_age   Always       -       29 (0 11 0 0 0)
195 Hardware_ECC_Recovered  0x001a   024   019   000    Old_age   Always       -       154328850
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   001   000    Old_age   Always       -       13688

But if I wanted to argue the disc is in bad shape then I have plenty to work with. My ID # 5 Reallocated Sector Count raw value is non-zero. The normalized value for ID # 187 Reported Uncorrectable Errors is 1 (VALUE column above). That’s the lowest possible value. And my worst value for ID # 188 Command Timeout is also 1.

These are all critical attributes. And these results are correlated with failure.

So, in the end, SMART can give you a warning. But its not reliable.

Other Suggestions

Process viewers like htop did not really help me in this case. I had a pretty good feeling it was a disc issue. At this point I’ve received my replacement disc. And the long wait times while saving are gone.

Stress testing utilities like stress, stress-ng, and sysbench did not seem appropriate because I did not see any options to target a particular disc. There was another benchmarking tool I found: fio But I did not take a close look at it.

One thing I learned was about reading the log files with sudo dmesg | vim - and sudo journalctl -b | vim - But all I saw in those logs were just some pedestrian mount info entries.

I have a link below from OpsDash which lists some other real-time tools. Some of those might be helpful in other situations.

I’m not familiar with how file shares are mounted in Linux today. And one trick I found useful was to open a share in Thunar (Xfce file browser), right-click on some empty space in the folder, and open Terminal in that location from the context menu. Then I could use pwd at the Terminal to see where that share was mounted. That’s different from the address bar in Thunar which just shows the ‘smb://…’ address.

Follow-up

I started having problems waking my computer from sleep.

The most obvious change was the new hard drive. So I unplugged that and was able to resume normally.

Looking through journalctl -b | grep -i "kernel: ata" the most obvious issue was the kernel spam indicating the new hard drive was dropping to the lowest link rate:

Oct 04 05:12:45 Edward kernel: ata4: SATA max UDMA/133 abar m2048@0xf9efc000 port 0xf9efc280 irq 27
Oct 04 05:12:45 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:12:45 Edward kernel: ata4.00: ATA-11: WDC  WDS500G2B0A-00SM50, 415020WD, max UDMA/133
Oct 04 05:12:45 Edward kernel: ata4.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Oct 04 05:12:45 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:05 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:13:05 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:06 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:13:06 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:07 Edward kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 05:13:07 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:07 Edward kernel: ata4: limiting SATA link speed to 1.5 Gbps
Oct 04 05:13:08 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:08 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:09 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:09 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:10 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:10 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:11 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:11 Edward kernel: ata4.00: configured for UDMA/133
Oct 04 05:13:11 Edward kernel: ata4.00: limiting speed to UDMA/100:PIO4
Oct 04 05:13:12 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:12 Edward kernel: ata4.00: configured for UDMA/100
Oct 04 05:13:13 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:13 Edward kernel: ata4.00: configured for UDMA/100
Oct 04 05:13:13 Edward kernel: ata4.00: limiting speed to UDMA/33:PIO4
Oct 04 05:13:14 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:14 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:15 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:15 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:16 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:16 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:17 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:17 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:18 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:18 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:19 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 04 05:13:19 Edward kernel: ata4.00: configured for UDMA/33
Oct 04 05:13:20 Edward kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

I thought a reasonable first step was to change the SATA cable and port.

Once I did that, my log returned to normal:

Oct 04 07:45:17 Edward kernel: ata6: SATA max UDMA/133 abar m2048@0xf9efc000 port 0xf9efc380 irq 27
Oct 04 07:45:17 Edward kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 04 07:45:17 Edward kernel: ata6.00: ATA-11: WDC  WDS500G2B0A-00SM50, 415020WD, max UDMA/133
Oct 04 07:45:17 Edward kernel: ata6.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA
Oct 04 07:45:17 Edward kernel: ata6.00: configured for UDMA/133

And I was able to resume from sleep.

This must have been the underlying issue with my old hard drive. I’m not sorry I replaced my drive. But you might check for these log entries before making a decision.

Regards,

References

Sorry, I don’t have permission to post links.

5 Tools for Monitoring Disk Activity in Linux | OpsDash

ioping | Arch manual pages

SMART | Wikipedia

Disks & Storage | GNOME Help

smartctl | Arch manual pages

FAQ | smartmontools

The 5 SMART stats that actually predict hard drive failure | Computerworld

How do you interpret Seagate’s SMART data? Signs of failure? | Silent PC Review

Failure Trends in a Large Disk Drive Population | Google

Linux Logging Basics | Loggly

Using journalctl | Loggly

journalctl | Arch manual pages

findmnt | Arch manual pages

system · 4 October 2021 11:50

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.