I have been experiencing frequent OS crashes starting around September last year while still running Kubuntu which I wasn’t able to pin down. Filled several pages over at the Kubuntu Forum and spend hours with a Dell support guy but never got to the bottom of the issue.
Part of my quest to get this issue solved was moving to Manjaro and I got stuck here because I liked what I saw and it seemed to be a lot more resilient to this issue, a crash maybe once a week. And I only recently stumbled over the likely root cause, this is exactly what I see happening:
- https://askubuntu.com/questions/905710/ext4-fs-error-after-ubuntu-17-04-upgrade#comment1422199_905710
- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
- https://wiki.archlinux.org/index.php/Talk:Solid_state_drive/NVMe
Once I found this I started to play around with the nvme_core.default_ps_max_latency_us kernel
parameter but if anything things are worse than before, no matter if I set it to 0, 220, 5500 or leave it out.
Actually, today was the first time I was able to reliably reproduce the error(at least for a while) by running a backup to a second internal drive. The OS reliably (sic) crashed, no matter what I set in GRUB_CMDLINE_LINUX_DEFAULT
I can see the changes are picked up by running
sudo nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
Autonomous Power State Transition Enable (APSTE): Enabled
Auto PST Entries …
Entry[ 0]
…
Idle Time Prior to Transition (ITPT): 86 ms
Idle Transition Power State (ITPS): 3
…
Which gives me Enabled
when a value >0 or nothing is set. When I set nvme_core.default_ps_max_latency_us=0
this will switch to Disabled
Kernel doesn’t seem to matter, I tried with 4.20 (my default), 4.19 and 4.14 today, got crashes with all of them.
I had probably 20 crashes today, most of them forced via the backup, but now things are stable again for 2 hours, including running a backup!?
The unpredictability of this bug is driving me mad, and I got a few presentations coming up where I can’t have a crashing system. I think I will open a case with Dell to get this nasty NVMe drive replaced but before I do so I wanted to quickly check if anyone here has got any other ideas of what to check.
And because I know you will ask , here’s my Inxi:
inxi -D
Drives: Local Storage: total: 2.29 TiB used: 1.04 TiB (45.6%)
ID-1: /dev/nvme0n1 vendor: Samsung model: SM961 NVMe 512GB size: 476.94 GiB
ID-2: /dev/sda vendor: Seagate model: ST2000LM015-2E8174 size: 1.82 TiB
thomas@hermes:~$ inxi -b
System: Host: hermes Kernel: 4.20.3-1-MANJARO x86_64 bits: 64 Desktop: KDE Plasma 5.14.5 Distro: Manjaro Linux
Machine: Type: Laptop System: Dell product: Precision 7510 v: N/A serial: <root required>
Mobo: Dell model: 0M1YNP v: A00 serial: <root required> UEFI: Dell v: 1.16.3 date: 09/12/2018
Battery: ID-1: BAT0 charge: 41.8 Wh condition: 66.5/91.0 Wh (73%)
CPU: Quad Core: Intel Core i7-6920HQ type: MT MCP speed: 825 MHz min/max: 800/3800 MHz
Graphics: Device-1: Intel HD Graphics 530 driver: i915 v: kernel
Device-2: NVIDIA GM107GLM [Quadro M1000M] driver: nouveau v: kernel
Display: x11 server: X.Org 1.20.3 driver: intel,nouveau unloaded: modesetting resolution: 1920x1080~60Hz
OpenGL: renderer: Mesa DRI Intel HD Graphics 530 (Skylake GT2) v: 4.5 Mesa 18.3.2
Network: Device-1: Intel Ethernet I219-LM driver: e1000e
Device-2: Intel Wireless 8260 driver: iwlwifi
Drives: Local Storage: total: 2.29 TiB used: 1.04 TiB (45.6%)
Info: Processes: 319 Uptime: 1h 32m Memory: 15.56 GiB used: 2.92 GiB (18.8%) Shell: bash inxi: 3.0.30