Cp and mount killed XFCE

I was using xfce4-terminal to copy files from one luks partition to another luks partition. Both on a pci-e x4 drive. While waiting I was reading this forum. After about 7 minutes the DE disappeared and I was shown a text-only screen, just like if I had pressed [ctrl]+[alt]+F{2-6}. The text showed filesystem error on /dev/nvme1n1p6, which is my /home. This partition is not encrypted, and was not part of the copy operation. The error message repeated itself about every 3 seconds.

Then I pressed [ctrl]+[alt]+F1 - it showed something else. I tried all the other tty’s as well, all of them showed something else. I could not find the screen showing the errors. All though, if I stayed with any of the tty’s for a few seconds, the error would pop up there as well.

I ended up removing /home from my fstab and reboot.
Then I did fsck.ext4 /dev/nvme1n1p6 - it said filesystem was clean.
So I tried fsck.ext4 -f /dev/nvme1n1p6 - still no errors found.
So I tried fsck.ext4 -fc /dev/nvme1n1p6 - it reported it had done changes to the filesystem. (How could I find out what it did?)

I put /home back into fstab and rebooted.

Now I wanted to see the status of my source luks volume. It looked good.
So I opened the destination luks volume. It mapped ok, but when I ran mount, xfce froze. I could move to mouse around. I could type into the other terminal window (I had two terminal windows open. Ran mount in one of them.) And thats all. There was no response when clicking any panel, nor the desktop, nor the whisker menu, nor any hotkey. I tried to start galculator from the working terminal window, resulting in that window also stopped responding.

I had a look at all tty’s, none of them showed any error. Finally I rebooted.

I tried the same mount again, as the first thing after boot. And the same thing happened.

After the next reboot I ran fsck.ext4, it claimed filesystem was clean. The same said fsck.ext4 -f. But fsck.ext4 -fc did some changes to the filesystem.

After this, I could mount the volume without trouble.
Status of the volume was that all files written during the copy operation had a size of 4k.

The problem here isn’t really that the cp and mount failed, but how xfce-terminal and the DE handles this. Is really xfce-terminal so integrated with the DE that a problem here will take down everything else?

It’s not the terminal that was the issue: it was a hardware / file system issue and can be compared to removing all tires while driving a car: that will make it come to a screeching halt as well. :grin:

The terminal was the symptom (fever) but the illness was the NVMe drive (flu).

Therefore, please perform a:

sudo smartctl --all /dev/nvme1

and report back so we can check whether it was just a glitch or if your NVMe is about to fail…

:scream:

Also, take a full system backup ASAP if you haven’t already done so:

If it’s a system drive I agree that faults may cause system instability. But I think a system should not go suicidal if any other drive has issues.

Since yesterday I’ve filled the disk with files, and copied them back to network storage. Running a diff on the network storage showed that the files comming from the drive was identical to the originals.

This test was performed by creating one full disk, not encrypted, partition.
Next I created the partitions I want, some encrypted, some not.
For each partition created, before putting any data into the them, I ran
fsck.ext4 -fc /dev/nvme1n1pX
On every partition the command reported it had made changes to the filesystem.
Is there a way to see which changes that was made?

# smartctl --all /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.8.6-1-MANJARO] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SA2000M81000G
Serial Number:                      50026B7683D9B69F
Firmware Version:                   S5Z42105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1 000 204 886 016 [1,00 TB]
Namespace 1 Utilization:            873 561 145 344 [873 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 683d9b69f5
Local Time is:                      Tue Sep 29 10:15:43 2020 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    8 186 182 [4,19 TB]
Data Units Written:                 4 126 374 [2,11 TB]
Host Read Commands:                 85 695 574
Host Write Commands:                71 566 787
Controller Busy Time:               293
Power Cycles:                       173
Power On Hours:                     113
Unsafe Shutdowns:                   50
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   114
Thermal Temp. 1 Total Time:         248

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

What I also learned from yesterdays file copy experiment was that if a smb /cifs is mounted using terminal, system will halt if you start to copy a large file (1+ GB) while there is another copy of large files in progress. I reproduced this twice - while a copy of a 14GB file was running, I started copy of a 1.2GB file.
There is no problem if I copy a bunch of small files while the copy of a large file is in progress.

If logging is enabled for all disk writes (not standard out of the box) yes (but that lowers drive performance and should only be used when needed). For more info:

man audit.rules

smartctl --all /dev/nvme1 looks good. Me crying wolf on impending doom might have been exaggerated. Can I have the smartctl --all /dev/nvme1p6 too?

that depends on the CIFS mount options: I can just copy multiple large (>1GB fine simultaneously. If I do 2 files, the speed actually increases overall. For more than 2 files I hit the WiFi limit. (802.11AC here)

:innocent:

As your original question has already been answered:

by:

I’ve marked the original question as solved with the below answer:

However, if you disagree with my choice, please feel free to take any other answer as the solution to your question or even remove the solution altogether: You are in control! (I just want to avoid even more subjective opinions being posted and confusing you even more)

:innocent:
P.S. In the future, please mark a solution like this:
Solution
so that the next person that has the exact same problem you just had will benefit from your post as well as your question will now be in the “solved” status.

I disagree that blaming the filesystem is an answer to this thread. How was the days of using a cd-rom in the world of linux? Did the users happily see their system die every time there was a read error from the cd?
As I said in the initial post, this is not about the filesystem, but how the system deals with such faults. My linux experience is 20 years of production server environments. I never really used a DE, except for a few test desktops 10+ years ago. Linux without DE is stable - I even experienced a server that kept running for days even though the system drive had total failure. (I’m not using raid in my servers. It’s way cheaper and better to do failover to another server, or even failover to another datacenter, than failover to another disk in the same server.)

To me, a linux computer that dies because of a filsystem error in a data partition came as a big surprise. I’m sure it would not have died if the DE was not running. It would have thrown an error. Maybe killed the process that ran into the error.

Besides, I’m not so sure this is a filesystem issue. Maybe it’s something that fails because of the high speeds of writing. During copy over smb the windows pc reported 98,7% utilization of the 1Gb cable connection. The only reason I do multiple parallel copy processes is that I’m browsing through source to pick the files/folders to copy, before the ongoing copy has finished. If the one I find is small, I start copy and keep browsing. If it’s a big one, I’ll wait for the ongoing big one to finish.
I also had the pc freeze once while copy between nvme0 and nvme1. So maybe this all comes down to the harddrives being busy and not responding to the DE’s read/write requests like an idle drive would do? Maybe the DE thinks ‘hey, you’re a fast harddrive, I’m not gonna sit around waiting for you like an old spinning disk. I’ll rather go kill myself if you keep me waiting a nanosecond longer’.

# smartctl --all /dev/nvme1n1p6 
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.8.6-1-MANJARO] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SA2000M81000G
Serial Number:                      50026B7683D9B69F
Firmware Version:                   S5Z42105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1 000 204 886 016 [1,00 TB]
Namespace 1 Utilization:            873 816 104 960 [873 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 683d9b69f5
Local Time is:                      Wed Sep 30 05:41:48 2020 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    8 238 260 [4,21 TB]
Data Units Written:                 4 140 336 [2,11 TB]
Host Read Commands:                 87 525 515
Host Write Commands:                73 217 258
Controller Busy Time:               299
Power Cycles:                       176
Power On Hours:                     114
Unsafe Shutdowns:                   50
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   114
Thermal Temp. 1 Total Time:         248

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

This problems seems to be related to writing to luks volumes. And it I have the feeling the drive goes to powersave while luks is caching data. Then when luks writes the data to disk, the disk is sleeping, and fail to respond in time.
It’s not the luks volume that gets messed up, but the inode holding the directory where the luks volume is mounted. Here’s an example:
The partition /dev/nvme1n1p6 (not encrypted) is mounted as /home and containing the folder /home/myuser/private
The partition /dev/nvme1n1p7 (partition encrypted by luks) is mapped to /dev/mapper/myprivate then mounted at /home/myuser/private.
When this issue occurs, the inode golding the folder /home/petter/private at nvme1n1p6 is corrupted.
After reboot, xfce-terminal and xfce dies when trying to mount something at this folder.

I installed nvme-cli
and ran nvme get-feature /dev/nvmeX -f 0x0c -H
When comparing the results for nvme0 and nvme1, I see different timings as shown below:

This summarize to 2100ms total for nvme1 (kingston) and 4325ms (micron). So, now the question is, how to change the powersave timing for the kingston drive?

Further experimenting with this shows that just before xfce dies, Thunar is using about 500-540% cpu, reported by htop. Each of the 8 cores has 78-82% usage. If I quit Thunar, sometimes the process keeps running. If I kill it (from htop) soon enough the filesystem issue does not happen.

So, what makes Thunar use that much CPU? It turns out that if I have one luks volume mounted to /var/data, and bind mounted /var/data/photos into /home/myuser/private/photos Thunar easily uses that much cpu. Keep in mind that /var and /home are unencrypted partitions, /home/myuser/private is a luks volume.

Thunar started to use a lot of cpu if browsing the bind mounted volume while there was a large copy process running - even thou the copy process did not involve this volume. And I did not open files in this volume, just browsing to make it list files.

So I had a folder in one luks volume bind mounted to a folder in another luks volume. After a reboot, I didn’t do the bind mount. This made the copy process much more robust. I could browse aroung without running into issues. Then the pc hibernated for the night.
Next morning I woke the pc up. Once I used Thunar to browse /mnt (where I have the smb/cifs volume mounted, it started to use 500% cpu again. I tried to unmount /mnt, but it was busy. Then I closed all Thunar windows (there was 3 of them), but the process kept running. I waited for a while, but umount kept saying the volume was busy, so I ended up killing the process. Then the volume would unmount.

I remounted, and stared a new thunar process. And it worked good.

All my filesystems are ext4

So, currently this problem seems to be a chain of happenings.

  1. Short transition times for the nvme to enter powersaving is causing it to go to sleep while a write is going on through a luks volume if the write slows down too much.
  2. Something using a lot of CPU slows down the luks write so much that the risk of this to happen increases significantly
  3. Thunar may start using enormous amounts of CPU (I’ve never seen a process using more than 500% cpu before.) This seems to happen more frequently if browsing a path with several mounts. Say /home (mounted to / ) /myuser/private (luks) /photos (luks bind mount)
  4. Luks volume (bind?)mounted into another luks volume increases the risk of this situation.
  5. Having a Thunar window open when pc goes to sleep/hibernate also seems to increase the risk for this. Possibly the risk is further increased if the window is showing the content of a luks or smb/cifs volume.

When the problem has happened, the folder where the filesystem that was written to was mounted is destroyed. Say I was writing to a luks volume mounted at /home/private, then the inode holding the folder /home/private is messed up. It will be fixed by running fsck.ext4 -fc on the partition mounted as /home. If trying to mount anything to this folder without running the fsck, the system will freeze.

I ditched luks + ext4. Running zfs with encrypted filesystem for all encryptet partiitions now. Still using ext4 for the unencrypted partitions (those that auto mount at boot)
No issues with zfs so far. And counting for the amount of data written, luks + ext4 would have failed by now.

After several days of using zfs, I have to wonder why that is not the default filesystem for Manjaroo. All my stability issues are gone! And for my vm’s, the file systems built in dedup makes no need for the linked clone functionality in vbox. (Just remember to disable automatic defrag in guest os.)

Only thing is that the latest update to Thunar and/or glib made Thunar unable to copy to zfs and smb/cifs. Silly me, followed the recommandations to do systemwide update when installing new stuff (zfs). Hopefully the next update will fix this.

2 Likes