The pernicious USB-stick stall problem

Jazz · 6 February 2021 23:04

The main problem is related to matching the rate at which a process creates dirty memory to the rate at which that memory can be written to the underlying storage device. If a process is allowed to dirty a large amount of memory, the kernel will find itself committed to writing a chunk of data that might take minutes to transfer to persistent storage. All that data clogs up the I/O queues, possibly delaying other operations. And, as soon as somebody calls sync(), things stop until that entire queue is written. It’s a storage equivalent to the bufferbloat problem. - The pernicious USB-stick stall problem [LWN.net]

This is probably one of the oldest open issues out there with linux. I just tried to copy an mp4 file (1.8GB) to two USB flash drives: one was formatted as exFAT, the other as FAT32 and both copies failed to equal the original file from my SSD.

1st USB drive) FAT32
Hanged on “100% completed” for more than 10min, I can see the size of 1.8GB had been occupied on USB stick and when cp process has finished, the file has suddenly disappeared from USB as it was not copied at all. Forcing sync may have helped with this one…

2nd USB drive) exFAT
I could copy the file on USB with an expected speed for SSD to USB 3.1 data transfer, but the checksum was different, the header of the copy of the mp4 file was different and couldn’t be played on any device, even if I copy it back to the original location. The header of mp4 was significantly changed.

All this happened without any warning, so I find it as an extremely alerting sign to be careful when using USB flash drives. Since almost 8 years ago, even Linus was addressing this issue, and there’s an ongoing discussion until today.

I understand it’s due to setting the following variables:

$ sysctl -a | grep dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 1500
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 1500
vm.dirtytime_expire_seconds = 43200

I’ve found somewhat outdated discussions i.e.: USB hard drive transfer forces system into a cawl / Kernel & Hardware / Arch Linux Forums

Are there any recommended values to set those defaults (for 16GB of RAM) to avoid similar issues in the future (for transferring large files and many smaller files as well)? I would assume with this kernel, it may be impossible to satisfy all scenarios out there, but maybe I’m missing something… I would like to avoid those huge hickups like described above.

https://unix.stackexchange.com/questions/579708/whats-the-best-dirty-background-ratio-and-dirty-ratio-for-my-usage

Nachlese · 7 February 2021 00:01

I think:
you assume too much
and
the wrong things as well

so:
what is the problem that is in need to be solved?

Jazz · 7 February 2021 00:15

Not sure I understand your super rude response at all, unless you’re just trolling.
There was an issue described related to those USB drives and one sentence below with a question mark.

Nachlese · 7 February 2021 02:28

… or perhaps you did a mediocre job at explaining yourself and your issue
and I was, rather unsuccessfully, trying to make you aware of the fact …
(using some wording of yours and playing on it - which is now no longer there, making the format of my response even more unfitting)

offending you certainly wasn’t my intention!

the “fact” seemingly being:
a “copy” isn’t a copy - since the product is not the same as the original
…

But instead of correcting me and providing (perhaps better) feedback - you just feel offended and assume trolling.

mkay

linux-aarhus · 7 February 2021 06:14

Other than patience with the computer system and the operation - I don’t think there is any one-size-fits-all kind’a answer to this.

If the stick in question is somehow flawed and the file copied is close to the limit of the stick - there is a fair chance that something goes wrong.

But lets assume there is nothing wrong with the stick. Many of the same speculations has gone through my head but I have never run into the copy was flawed aka the not matching checksum. This has only happened to me before I realized that the copy continued after it seemed to be finished - and observation lead me into the dungeons of Linux memory handling and file cache.

When a copy operation ends is very much related to the exact variables you mention but is highly dependent on the target USB device, the type of device and the quality of the device.

One of the sure signs of activity is often indicated by a flashing LED on the USB device - but then - not all devices has such LED - which can lead to an early pull where data is still being pushed - causing the inconsistencies - you mention.

Some operations allocate the space beforehand - maybe in an attempt to verify if there is adequate space available - but again - not all operation act that way.

E.g - my 512G WD Passport USB C device is lightyears faster than a 64G Kingston Datatraveler USB 3 with in turn is faster than some really old USB2 2GB and 4 GB transcend sticks - with in turn is slower than a much older LaCie Yama key - an it illustrates very much the differences between age, type of device and quality.

A couple of members has been working on a some reasonable values for various elements and this has resulted in the maxperfwiz script by @cscs and @Kresimir which works well.

You can find it at cscs / maxperfwiz · GitLab

on a side note - instead of just pasting text from another article - making the reader think you are very smart and did the thinking yourself you should mark what is actually quoted from another article - which in this case is a prominent part - even the title has been copied from that article.

D.Dave · 7 February 2021 13:25

There was a discussion about on the previous forum:

https://archived.forum.manjaro.org/t/decrease-dirty-bytes-for-more-reliable-usb-transfer/62513

I found the same and change the values

dirty_background_bytes
dirty_bytes

Has helped.

Eg. is measurable by checking the time of cp and sync:

time cp '/home/dave/Downloads/4GiB.bin' /run/media/dave/traveler32/; time sync

cp = real 6m15,286s
sync = real 0m0,992s

With default settings, sync used to also takes minutes.

Such test has been executed on a poor usb stick.

But it also depends, obviously, by the target disk: eg on an USB 3.0 usb disk, it totally takes (copy and sync with changed values) about 40 seconds.

Jazz · 7 February 2021 13:33

I verified those USB drives on multiple machines and they both seem fine - no errors found after full scans from several different applications.

The title and the text below suits the link I posted, but you’ve got the point - I just needed to use the blockquote for that paragraph, thanks.

You could easily track the editing history and again - have no clue what you’re talking about.

No, I think you are very mediocre and superficial in being a community member and helping on forums. It’s quite obvious your ego comes out first. @linux-aarhus understood my issue, gave his opinion on it without getting personal, but your approach is not suitable for any open source community and should be discouraged.

linux-aarhus · 7 February 2021 13:39

I expected that - just me thinking loud

From experience - I just let the system finish - with large copies - e.g. manually copying the content of a Windows ISO to USB - I let the system do its job - and before pulling the stick - I will issue a sync command - it if returns immediately I know I can pull the stick without damaging data.

Jazz · 7 February 2021 13:48

I couldn’t explain it myself until I read that article I posted. This was the exact workflow:

Copy 1.mp4 [1.8GB] from SSD to USB drive 2 [FAIL]
Deleted 1.mp4 from USB 2
Copy 2.mp4 [2.3GB] from SSD to the same USB drive 2 [SUCCESS]
Copy 3.mp4 [0.9GB] from SSD to USB drive 2 [SUCCESS]
Copy 1.mp4 [1.8GB] from SSD to USB drive 2 [FAIL]

FAIL = mp4 header was distorted and you couldn’t play it on multiple machines (I compared the headers, checksum and total size: total size was the same, checksums were different as well as the headers)

SUCCESS = I could normally watch the video on any machine

I could normally watch 1.mp4 at the original location (SSD) and I could normally use Windows 10 to copy all of those files on all USB drives I’ve mentioned with no issues whatsoever.

linux-aarhus · 7 February 2021 13:53

I think your issue is due the the nature of emmc storage and to some degree maybe the quality of the storage chips.

I recently - from another thread - reread an academic paper on shredding data stored in emmc chip. If you are a data vault geek - the results presented in that paper is disturbing when it comes to storage chips.

Only frying the board - then drilling the chips - then smashing them with a hammer an finally petrol and a lighter - will ever secure what ever data you had on those chips.

The real issue can be formulated very short know your system - this implies knowledge either by experience or by other means knowing how Linux kernel uses memory as a file cache. I have never come around anyone explaining it better than the author of the text at https://linuxatemyram.com but you probably already know that

Jazz · 7 February 2021 14:45

Previously I hesitated from applying this solution since I’ve noticed some users claiming it didn’t resolve their issues, the others claimed they even bricked their own system after playing with those variables and by the fact Manjaro still didn’t implement it by the default since 2 years ago (when this discussion started). After your reply I gave it a try, rebooted my Manjaro and tried to copy the same 1.mp4 with the same USB drives and now I could normally watch my mp4 files on each machine I tried. The sync was triggered right on time, after each copy has been completed. That was an absolute success!

Thanks for your valuable response, I haven’t noticed it on the old forum before!

I wonder why this is not the default already. Cheers!

$ cat /etc/sysctl.d/98-dirty.conf 
vm.dirty_background_bytes=16777216
vm.dirty_bytes=50331648

D.Dave · 7 February 2021 15:06

I’m glad you solved

Anyway, as linux-aarhus stated:

In facts, I didn’t blindly copying/applyed the suggested values on the mentioned discussion on the previous forum; since I tested and study my system (also based on my needs); I set

vm.dirty_background_bytes=33554432
vm.dirty_bytes=134217728

So to different values, which, in facts - for my system - behaves better than default values, which are implemented by vm.dirty_ratio and vm.dirty_background_ratio instead.

Jazz · 7 February 2021 15:12

Actually, that was the whole point of this thread. The proposed solution offered just two integers. I was hoping for a recommended formula on how to find appropriate defaults for each system individually. Would you mind sharing your approach on that?

I just tested maxperfwiz and unfortunately, their 3% for both: dirty_background_ratio and dirty_ratio doesn’t work as good as the example provided by Linus. I will read more and try to find my own magic combo.

  memmeg=$(echo $(vmstat -sS M | head -n1 | awk '{print $1;}'))
  memperc=10
  if (( $memmeg > 6000 )); then
   memperc=7
  fi
  if (( $memmeg > 9000 )); then
   memperc=5
  fi  
  if (( $memmeg > 14000 )); then
   memperc=3
  fi

At the end, I will keep most of the maxperfwiz’s changes except for the dirty_background_ratio and dirty_ratio. I already cut both variables out and replaced them with Linus’s example above, since those work best for me. I hope the algorithm for setting those 2 variables will be improved. After all, that’s exactly OS’s job to know the system where it has been installed and adjust/optimize those parameters automatically. For now, I’m satisfied with the following:

$ cat /etc/sysctl.d/99-maxperfwiz.conf          
[...]
vm.dirty_background_bytes=16777216
vm.dirty_bytes=50331648
[...]

D.Dave · 7 February 2021 15:20

By doing test for days, in various scenarios based on the workload, for avoiding overheads; checking iotop and htop; also by checking the time command on copy and sync in every situation and calculating the average results; every time that I changed these values I rebooted but I also just cleaned the cache instead of reboot (sync; echo 3 > /proc/sys/vm/drop_caches), as well by keeping the caches full of data. And so I ended up using the values which I reported, to have balanced settings (not only about these values) between throughput and low latency for my needs and by the capability of my laptop’s hardware.
So, I can say for sure that that cannot be universal values.

I already had a discussion about:

https://forum.manjaro.org/t/my-system-tweaks-to-achieve-better-performances-based-on-my-needs/43808

system · 22 February 2021 15:21

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.