Once more a NVMe related issue, I think

nvme

#21

Sorry, I was away and had no time.
Yes, it’s more or less that.
Just grab the Manjaro PKGBUILD and add the patch, then run makepkg.

I’m currently creating the patchfile, however I still need the exact DMI data, more specifically the product name of the laptop (presumably “Precision 7510”) and the product ID of the SSD (see output of lspci).


#22

Cool!
Here’s what I believe is the relevant data for DMI

# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.
Table at 0x000E0040.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
	Vendor: Dell Inc.
	Version: 1.16.3
	Release Date: 09/12/2018
	Address: 0xF0000
	Runtime Size: 64 kB
	ROM Size: 16 MB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		5.25"/1.2 MB floppy services are supported (int 13h)
		3.5"/720 kB floppy services are supported (int 13h)
		3.5"/2.88 MB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		ACPI is supported
		USB legacy is supported
		Smart battery is supported
		BIOS boot specification is supported
		Function key-initiated network boot is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 1.16

Handle 0x0001, DMI type 1, 27 bytes
System Information
	Manufacturer: Dell Inc.
	Product Name: Precision 7510
	Version: Not Specified
	Serial Number: D4L7YF2
	UUID: 4c4c4544-0034-4c10-8037-c4c04f594632
	Wake-up Type: Power Switch
	SKU Number: 06D9
	Family: Precision

And the output from lspci:

thomas@hermes:~$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 07)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #2 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation CM236 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M1000M] (rev a2)
02:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
3d:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (rev ff)

I am currently booting Manjaro from an USB attached SSD, with the trouble-making NVMe drive disabled. To get the lspci output I enabled the NVMe drive and got challenged for my LUKS password right after logging in…but when checking in Dolphin there was no drive. Now here’s what dmesg has to say:

[Mo Feb 11 13:45:35 2019] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[Mo Feb 11 13:45:35 2019] nvme 0000:3d:00.0: Refused to change power state, currently in D3
[Mo Feb 11 13:45:35 2019] nvme nvme0: Removing after probe failure status: -19
[Mo Feb 11 13:45:35 2019] print_req_error: I/O error, dev nvme0n1, sector 0
[Mo Feb 11 13:45:35 2019] print_req_error: I/O error, dev nvme0n1, sector 1050626
[Mo Feb 11 13:45:35 2019] EXT4-fs (nvme0n1p2): unable to read superblock
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 0, async page read
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 1, async page read
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 2, async page read
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 3, async page read
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 0, async page read
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 1, async page read
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 2, async page read
[Mo Feb 11 13:45:35 2019] Buffer I/O error on dev nvme0n1p2, logical block 3, async page read
[Mo Feb 11 13:45:35 2019] nvme nvme0: failed to set APST feature (-19)

It ran long enough for the LUKS partition to get discovered and for Manjaro to ask for a password, but then it died again. Clearly there is something wrong with power states, anything you can read from this? State D3 seems to be a power saving state.

As far as the kernel compilation is concerned

Is the below sequence correct?

  1. git clone https://gitlab.manjaro.org/packages/core/linux419
  2. cd linux419
  3. update file
  4. makepkg
  5. cp the new kernel over the original one
  6. reboot
    Anything else? Update-grub or similar?

#23

Oops sorry, I need lspci -nn | grep NVMe to get the IDs.
I’ll get back to your questions in a few minutes.

Yes, that definitely looks like an issue with APST power states.
It stays stuck in a deep (possibly the deepest) powersaving state and cannot be awoken.
Therefore, using the “no deepest power state” quirk could work (or we disable APST altogether for that device).

As for the kernel:
Yes, that’s more or less the procedure.

Keep in mind that the existing Manjaro kernel is overwritten unless you change the name in the PKGBUILD and all related preset/install files.
You don’t need to copy anything, just install with pacman -U, pacman takes care of initram and GRUB.
And make sure to have your backups ready or at least a second kernel installed.


#24

3d:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804]

I have also been in contact with Dell just now, they don’t have a more up-to-date firmware but will send me a replacement drive, either Intel or Hyinx!
If this works out as planed I should be fine with my default kernel…hoping that I don’t fall into another trap with Intel/Hyinx :slight_smile:

Backups are ready! Need them for new drive anyway :slight_smile:
Boy, I hope this is it. A solution after 6 months of troubleshooting, configuration changes, distro hoping and lot’s of OS and application installs. Good thing is I learned a lot and I ended up with Manjaro :slight_smile:


#25

Here is the patch disabling the deepest powersaving state.
If it doesn’t work, replace NVME_QUIRK_NO_DEEPEST_PS with NVME_QUIRK_NO_APST.

--- a/drivers/nvme/host/pci.c	2019-02-11 09:53:37.540839478 +0100
+++ b/drivers/nvme/host/pci.c	2019-02-11 15:21:18.556192169 +0100
@@ -2440,8 +2440,16 @@
 		if (dmi_match(DMI_BOARD_VENDOR, "ASUSTeK COMPUTER INC.") &&
 		    (dmi_match(DMI_BOARD_NAME, "PRIME B350M-A") ||
 		     dmi_match(DMI_BOARD_NAME, "PRIME Z370-A")))
 			return NVME_QUIRK_NO_APST;
+	} else if (pdev->vendor == 0x144d && pdev->device == 0xa804) {
+		/*
+		 * Samsung SM961 NVMe probably has a APST related problem
+		 * on Dell Precision 7510, so let's disable deepest ps.
+		 */
+		if (dmi_match(DMI_BOARD_VENDOR, "Dell Inc.") &&
+		     (dmi_match(DMI_BOARD_NAME, "Precision 7510")))
+			return NVME_QUIRK_NO_DEEPEST_PS;
 	}
 
 	return 0;
 }

Patch applies cleanly on 4.19, but I currently cannot build as I’m on a heavily restricted VM.
I will add it to my custom kernel if it works.


Kernel 4.19 LTS with PDS scheduler
#26

One last question if I may?
Given that I am currently running from a USB drive, if I update the kernel I will obviously update the USB image and not the NVMe.
Will this modified kernel setting be applied to all drives or just to the one I booted from?
If it’s all-drives I can happily make the change on my USB image and then try to access the NVMe.

If the kernel parameter is only applied to the boot drive I somehow need to shuffle the updated kernel over to the NVMe and boot from there.


#27

It will apply to all drives corresponding to the if clause, e.g. all Samsung SM961 running on a Dell Precision 7510. Well at least it should… :slight_smile:
So if you run from the USB drive that is connected to the Dell laptop, it should work (the SSD obviously has to be installed).


#28

Compiler stops here

	if (dmi_match(DMI_BOARD_VENDOR, "Dell Inc.") &&
		     (dmi_match(DMI_BOARD_NAME, "Precision 7510"))

I think I need 3 closing brackets, don’t I? I added a 3rd one but not sure how to start the compile run again. When I rerun makepkg I get lot’s of these:

The next patch would create the file tools/testing/selftests/netfilter/Makefile,
which already exists!  Assume -R? [n] 
Apply anyway? [n] 

which I just accept but then it stops again here

patching file virt/kvm/arm/vgic/vgic-mmio.c
patching file virt/kvm/arm/vgic/vgic.c
==> ERROR: A failure occurred in prepare().
    Aborting...

I had a brief look at man makepkg but didn’t find anything obvious which would continue on where it stopped before…


#29

Thanks. I’m looking into it. You’re probably right with the missing parenthesis.
EDIT: yes you’re right! I added it in the post above.

If you restart compiling, you would probably need to use makepkg -Cc, which cleans the previous temporary build folder.


#30

My new NVMe arrived a lot quicker than expected, one could say that Dell delivered a new drive faster than compiling a Linux kernel :slight_smile:

I have got a Toshiba drive now so I guess updating the custom kernel is kind of pointless?
I can run through the compilation to confirm that the code compiles but I won’t be able to test if it solves the Samsung NVMe issue any longer.

Sorry if I created you unnecessary hassle, but at least we have a solution in case someone else falls into this trap.


#31

Well the good thing is that you have a working SSD now.
There wasn’t any hassle for me, I used it as a learning experience regarding the kernel code structure.

It does, I built it yesterday (with the missing parenthesis).

It’s a bit unfortunate though that we can’t tell whether that patch actually helps with the problem. But maybe someone else with the same hardware will pop up.