First experience in GNU/Linux with a failing/failed drive... initial sympton being a locked taskbar?

Daniel-I · 8 August 2021 17:35

Yesterday I was playing with building shell scripts to gain visibility into my systems sensors, which included installing radeontop…

radeontop.sh

#!/bin/bash
konsole --hide-tabbar --hide-menubar -e radeontop

watchGPU.sh

#!/bin/bash
konsole -e sudo watch -n 0.5  cat /sys/kernel/debug/dri/0/amdgpu_pm_info

watchsensors.sh

#!/bin/bash
konsole -e watch -n 0.5 sensors

…and then I discovered psensor and settled into using just it and watchGPU.sh.

Potential red herring… somewhere along the line I recall finding a “stuck” terminal session that would not close and then I selected an option to kill the process. Before finally figuring out I needed to use konsole in my shell scripts (to launch a terminal window with my command) I was following examples that stated to use xterm (which did not work, command not found).

Things seemed to be going fine otherwise, until I decided I wanted to move my psensor pinned icon in the taskbar… I could “grab it” (saw a fist as it was “selected”) but it would not move it’s icon position.

I though that was weird and went to bed. This morning, the issue persisted and I thought maybe a reboot would be a good idea. Selected restart from the Manjaro Menu… and things got interesting…

The initial “shutting” down sequence started with (what I assume to be) a “clean” fsck reporting of the system drive
then I caught this flash of an error that I think indicated it was (I’m paraphrasing) unable to unmount user 1000… me
seeing this error was very brief as the system rebooted
The initial power-up sequence started with (what I assume to be) a “clean” fsck reporting of the system drive… and then a pause
Then I heard my Software RAID array spin up and thought… oh right, fsck will look at the array every 4 boots… today must be the day
the mechanical disks stopped clacking… more pause
then the timeouts/failures started rolling in as it worked through my fstab mounts

20210808_112153_HDR1920×597 129 KB
Trying to make sense of the emergency mode directions… I first typed in exit to try reboot (soft) normally
after a pause, it reprompted for another choice (previous choice failed?)… this time I typed systemctl reboot to try restart… and after more pause (in this screen) I powered down for a hard reboot
I rebooted normally on the hard reboot, and was finally able to reposition my psensor taskbar icon

All looks fine/normal now, but I can’t help but wonder…

what in the world went wrong?
And is there a relationship between the “locked taskbar”, the unmountable user, and the initial soft reboot timeouts?
Might the locked terminal session I aborted have been a sign?
what might have been “lingering” in the soft reboot that the hard reboot clearer or worked around?

Couple other peices of information for reference…

I think my SN750 is fine…

$ sudo smartctl --all /dev/nvme1n1 
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.5-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDS100T3X0C-00SJG0
Serial Number:                      192725805856
Firmware Version:                   102000WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8215
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b4472fa2a
Local Time is:                      Sun Aug  8 12:30:55 2021 CDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     3.50W       -        -    1  1  1  1        0       0
 2 +     3.00W       -        -    2  2  2  2        0       0
 3 -   0.1000W       -        -    3  3  3  3     4000   10000
 4 -   0.0025W       -        -    4  4  4  4     4000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    42,643,866 [21.8 TB]
Data Units Written:                 32,669,419 [16.7 TB]
Host Read Commands:                 331,584,942
Host Write Commands:                524,770,081
Controller Busy Time:               1,112
Power Cycles:                       51
Power On Hours:                     14,437
Unsafe Shutdowns:                   11
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,440
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

/etc/fstab contents

$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a device; this may
# be used with UUID= as a more robust way to name devices that works even if
# disks are added and removed. See fstab(5).
#
# <file system>                           <mount point>  <type>  <options>  <dump>  <pass>
UUID=B4AB-4594                            /boot/efi      vfat    umask=0077 0 2
UUID=5d67a7c6-6cdf-446d-92f6-b7be1f0fb13d /              ext4    defaults,noatime 0 1
UUID=cf6b8b04-b6ae-4b54-a5e9-3dcb0b4595d5 /data/sn750    ext4    defaults,noatime 0 2
UUID=1bbb2871-a304-4482-82e1-b4fda98cfeab /data/evo840   ext4    defaults,noatime 0 2
UUID=6487110f-670a-4bac-b88f-e422fb107071 /data/raid1   ext4    defaults,noatime 0 2
192.168.100.140:/volume1/Daniel /data/syndaniel nfs rsize=8192,wsize=8192,timeo=14,_netdev 0 0
192.168.100.140:/volume1/Shared /data/synshared nfs rsize=8192,wsize=8192,timeo=14,_netdev 0 0
/swapfile none swap defaults 0 0

post hard reboot journalctl -xb

$ journalctl -xb
-- Journal begins at Tue 2021-07-13 15:47:15 CDT, ends at Sun 2021-08-08 12:37:43 CDT. --
Aug 08 11:23:01 AM4-x5600-Linux kernel: Linux version 5.13.5-1-MANJARO (builduser@LEGION) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #>
Aug 08 11:23:01 AM4-x5600-Linux kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.13-x86_64 root=UUID=5d67a7c6-6cdf-446d-92f6-b7be1f0fb13d rw>
Aug 08 11:23:01 AM4-x5600-Linux kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Aug 08 11:23:01 AM4-x5600-Linux kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Aug 08 11:23:01 AM4-x5600-Linux kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Aug 08 11:23:01 AM4-x5600-Linux kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Aug 08 11:23:01 AM4-x5600-Linux kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Aug 08 11:23:01 AM4-x5600-Linux kernel: x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
Aug 08 11:23:01 AM4-x5600-Linux kernel: x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-provided physical RAM map:
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009d81fff] usable
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x0000000009d82000-0x0000000009ffffff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20dfff] ACPI NVS
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x000000000a20e000-0x00000000cb03bfff] usable
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000cb03c000-0x00000000cb03cfff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000cb03d000-0x00000000cb0a0fff] usable
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000cb0a1000-0x00000000cb0a1fff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000cb0a2000-0x00000000dad0bfff] usable
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000dad0c000-0x00000000db068fff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000db069000-0x00000000db0ccfff] ACPI data
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000db0cd000-0x00000000dcbccfff] ACPI NVS
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000dcbcd000-0x00000000ddb56fff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000ddb57000-0x00000000ddbfefff] type 20
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000ddbff000-0x00000000deffffff] usable
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000df000000-0x00000000dfffffff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000fd200000-0x00000000fd2fffff] reserved
Aug 08 11:23:01 AM4-x5600-Linux kernel: BIOS-e820: [mem 0x00000000fd400000-0x00000000fd5fffff] reserved

Daniel-I · 9 August 2021 03:39

Hmm, I’m remembering another step I did around installing psensor… I’d followed some instructions that suggested missing sensor devices could be found by running sensor-detect, when I’d done that it guided me through various sequential queries that it could run for detecting hardware and noted that it had warnings on most of them which I aborted by answering no.

I’m just noticing now on this link that running sensors-detect isn’t always “safe”… potentially causing SMBus lockups and (in rare worst case scenario’s) hardware damage…

Warning
sensors-detect needs to access the hardware for most of the chip detections. By definition, it doesn’t know which chips are there before it manages to identify them. This means that it can access chips in a way these chips do not like, causing problems ranging from SMBus lockup to permanent hardware damage (a rare case, thankfully.)

I could be wrong but I don’t think running sensor-detect was my gremlin as most of it’s tests were aborted (answering No to the prompts to continue)… but I’ll be sure to avoid running it for the foreseeable future anyway.

Daniel-I · 9 August 2021 05:20

Ultimately, I have uninstalled psensor and stopped using my shell scripts… after I stumbled on the ability to create “New Tabs” in KSysGuard that would let me lay out the sensors numbers I cared about in a graph from a list it already tracks.

Only thing I’m missing is GPU load, and that’s okay… as I can always launch my watchgpu.sh script when/if I feel I need it.

And I thought a Software Raid monitoring tab was another nice addition from the list of tracked KSysGuard things…

Daniel-I · 9 August 2021 06:29

While it looked like the SN750 /dev/nvme1n1 initially survived after the hard reboot (per $ sudo smartctl --all /dev/nvme1n1) it is no longer found by my system now… and there has been no reboot in between…

$ sudo smartctl --all /dev/nvme1n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.5-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/nvme1n1 failed: No such device

So it appears that the drive may have been dying and finally packed it in at some point after the hard reboot… making the earlier cell pic more relevant as it was pointing it’s finger at that drive. Maybe the reboot temporarily resurrected the drive?

I think it’s very possible that all the odd behavior was just symptoms and side-effects of a dying drive while the system was up and running and trying to compensate.

The SN750 was the drive I dedicated to my Home directory backups, and I found an entry that proves that while I went to bed the other night thinking the only symptom was a tasktray icon that would not move… the drive was likely already gone, as the daily 3am backup also failed…

After my software array scrubbing that I have running is completed tomorrow I’ll reboot and make sure even the BIOS no longer detects the drive… and if not, take it to my local store to test and bring home a replacement. Pretty sure I bought 4 year in-store replacement warranty and when I picked it up in Dec 2019… wow, not even 2 years old!

GaVenga · 9 August 2021 07:30

looks like a boot-failure after incomplete update?
Try:

sudo mkinitcpio --preset && sudo update-grub.

Failed update: if you can reach tty3 with Ctrl+Alt+F2
then login as root and execute:

pacman -Syyu
update-grub #for safety only

If that fails:
Boot an ISO Bootstick and start gparted: that has a repair-option.

Daniel-I · 9 August 2021 08:51

Thank You for the reply GaVenga!

The failed disk was definitely just a data disk (used by Back In Time)… but I think it stopped the reboot because it was in my /etc/fstab… likely it would be good to learn how to recover from this at the emergency mode state, because I think I got lucky on the hard reboot finding the disk again (if only temporarily) which got be booted back up. Perhaps I could have used my Manjaro live USB stick to boot and access the drive and edited the file?

I’ve edited /etc/fstab and remmed out (#) the SN750 line… so I won’t have to face it not being found anymore… hopefully that was enough. I don’t want the system to complain or not boot up again because it can’t find that drive while I’m rebooting and/or installing the new replacement nvme.

GaVenga · 9 August 2021 08:58

Good news.

That is correct. In gnome I prefer to use “gnome-disk-utility” - less dangerous than editing /etc/fstab

Daniel-I · 9 August 2021 23:25

My RAID Array has stopped scrubbing, so I took a moment to reboot…

I didn’t catch all the message again, but I was able to capture that the “unmount failure” (which repeated again this reboot) pointed to /dev/user/1000… and I’m thinking this might be related to another post I found where TimeShift creating boot backups was deemed to be the culprit
Once rebooted, I entered the BIOS and confirmed it also does not see the SN750 (nvme1n1) any more … so looking like drive failure (fingers crossed the motherboard m.2 connector is still happy)
Boot into Manjaro was nice and clean… so my /etc/fstab remming of nvme1n1 edit was good.

I’m too tired (I stayed up way to long last night troubleshooting/investigating) and don’t trust myself to open my case to take out the nvme today, so I’m saving that for tomorrow when I’m (hopefully) more cognizant and steady handed.

I’m also holding off on installing the Manjaro 21.1 update (96 updates in total for me) that’s available today until after I’ve installed the replacement nvme, taken a Home folder backup, and make a clonezilla image of my boot nvme.

GaVenga · 10 August 2021 08:41

installed the replacement nvme taken a Home folder backup,

Now it is time, to separate root and home, if possible, on different Disks?!
(on new install, “/home” is untouched.)

Daniel-I · 10 August 2021 12:51

I’ve been slowly coming around to that idea… not sure it’ll be a dedicated disk right away, but likely a dedicated partition as a first step.

GaVenga · 10 August 2021 12:57

Separate partition is good, separate drive is better (not possible on some laptops).

Daniel-I · 10 August 2021 13:41

okay, this was unexpected… I opened my case to remove the SN750 and discovered a “wet” condensate/film on both the nvme and “rubbery” heat spreader contact material; which I removed. Wanting to know that the drive was still bad before I took it to the store, I plugged it back in and booted into the BIOS, and sure enough it showed up. Booted into Manjaro, and there it was again.

I think I’m going to have to postpone taking it in until it fails again. Not jumping straight into trusting this drive, but I feel I got lucky in a way, and smartctl found nothing out of sorts…

$ sudo smartctl --all /dev/nvme1n1
[sudo] password for disfeld: 
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.5-1-MANJARO] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDS100T3X0C-00SJG0
Serial Number:                      192725805856
Firmware Version:                   102000WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8215
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b4472fa2a
Local Time is:                      Tue Aug 10 08:36:17 2021 CDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     3.50W       -        -    1  1  1  1        0       0
 2 +     3.00W       -        -    2  2  2  2        0       0
 3 -   0.1000W       -        -    3  3  3  3     4000   10000
 4 -   0.0025W       -        -    4  4  4  4     4000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    42,643,876 [21.8 TB]
Data Units Written:                 32,669,418 [16.7 TB]
Host Read Commands:                 331,585,134
Host Write Commands:                524,770,081
Controller Busy Time:               1,112
Power Cycles:                       53
Power On Hours:                     14,444
Unsafe Shutdowns:                   13
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,440
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

What other commands tests could I use to try validate this drive is 100% operational/functional? Other than the smartctl check I have mounted the drive to my /data/sn750 mountpoint, and executed a Back in Time backup without error.

I’m also thinking I should be removing all these heat spreaders that came with the motherboard if this is a side effect of using them and simply screwing the drives in place without them.

Oh, and when I disabled the boot option in TimeShift, the shutdown to remove the nvme threw no more “unmount” errors after the disk check.

GaVenga · 10 August 2021 14:11

Water is the natural enemy of electricity…

Daniel-I · 10 August 2021 19:16

Just got off the phone with my local store… they suggested the “wetness” I experienced is normal oils in/on the thermal pad to help tranfer heat from the drive… and that it should not deter me from using the heatsinks.

In their opinion, it would have been odd for it to stop the drive from functioning as this oil is non-conductive. They feel it’s more likely there was something not quite 100% with the drives connection and my re-seating it was the fix.

I’ll be keeping my eye on it.

GaVenga · 11 August 2021 07:53

May be condensation of water. Thermal pads - I dont like them - ineffectiv.
Prefer thermal paste if possible

2-Propanol or Ethanol to clean the contacts is helpful too.
The NVMe M.2 connectors are sensitive on installation - the screw is crucical…

system · 26 August 2021 07:54

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.