Samsung NVME randomly goes down during activity

Check nvme --help instead smartctl

Example:

$ sudo nvme smart-log /dev/nvme0
$ sudo nvme error-log /dev/nvme0
1 Like

OKAY, the Samsung debugs are needed and i have no idea how to implement them

power_on_hours				: 107671072856885234789626228506624

Now… Im pretty sure i havent even been alive for that long, divied that by 24, coomes out tooooo 4.4862947e+30 atleast according to google calculator. right nor not… Im pretty sure i havent been alive that long.

.................
 Entry[ 0]   
.................
error_count	: 173
sqid		: 0
cmdid		: 0x701d
status_field	: 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag	: 0
parm_err_loc	: 0xffff
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................

The rest are as boring as can be

  • smartctl
  • nvme

Maybe nvme has the bug. But I have no problem with that. What nvme device are you using?

I think you can ignore this error_count. I have more error_count than you, but I have no issue with my nvme device.

.................
 Entry[ 0]   
.................
error_count     : 3896
sqid            : 0
cmdid           : 0xc
status_field    : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................

I guess this error is the problem and refer to the power saving feature.

I found this here:

Linux-Kernel Archive: Re: [BUG][5.18rc5] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

1. Turn off APST (nvme_core.default_ps_max_latency_us=0)
2. Turn off APSM (pcie_aspm=off)
3. Turn off both

So add

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

to the kernel parameter.

One of them or both should solve the issue.

1 Like

im assuming sudo will do or is that su root?

Am i going to use sudo,
but because its kernel related will i need to switch user (su) root

That’s the same thing.

okay, always thought sudo stood for switch user domain
not user account.

No, @megavolt means you try to add these the kernel parameters:

If you use GRUB

  1. Edit /etc/default/grub to add these kernel parameters in GRUB_CMDLINE_LINUX_DEFAULT=
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
  1. Run sudo update-grub

  2. Reboot

1 Like

i see a grub.d and something is telling to wait for a reply, because this seems like something that needs VERY clear instructions.

i litterally went to /ect/ and i see a grub.d folder, now… This looks like something that can screw up my system badly, unless i follow and understand correctly what you are asking me to do.

if i havent been clear, please be VERY clear with your instructions. Step by step.

I am sorry… it is not really clear where you are now.

  1. live session with chroot
  2. booted local installation

:question:

1 Like

file system then ect then im stairing at whole lotta folders that look like and named in fasions that even the dummy in me is going be carefull.

ok well… dummy instruction:

  1. Boot a Manjaro Installation Disk (hope I don’t have to explain that)
  2. When booted, then first repair the filesystem (must not be mounted):
sudo fsck.ext4 -f -y /dev/nvme1n1p2
  1. Then chroot into that with the helper script:
sudo manjaro-chroot -a
  1. Now do what @Zesko wrote:

Open the file with nano:

nano /etc/default/grub

Edit it accordingly and save it and close it.

Then update grub.

exit the session.

When ever i bring out my usb for to fix my problems, it has always been without repair because i could never figure out how to get to let alone run the repair, without doing a full system reinstall. SO can you elaborate on that?

I am sorry, but what exactly is not understandable? Give me something to deal with. I will not explain pretty basic stuff. Maybe someone else can do that.

You did already a repair as I see above… so what is the question?

that repair above was not through a boot stick, that was straight from terminal. and not from grub ether… SOOOOO now you have me more confused? So do i need a bootstick or not?

The drive we are working on does not have manjaro/os on it is the samsung, the wd has manjaro on it.

ok… as I see you run that:

That is:

and I really hope you unmounted it before checking it. That works that way… But:

Is the system partition (root directory). That means, it can only be checked outside, so a live session, or it does it always on boot time to solve small problems. Huge problems have to be repaired outside of the local installation.

I mentioned this because of the journal log and to ensure that the filesystem is not broken.

1n1p2 is not the problem, sooo why are we checking its filesystem?

my head hurts… I need a break, you probably do too, before you reach across the internet and wrap your fingers around my neck.

Then skip that step if you know more… It is just safety measure. That’s all.

Then please do so and take a break. No hurry… Go for a walk and clear your mind. Sometimes things can be overwhelming. it is understandable.

then check the one that needs to be checked …

anyway, once that is done


without the complication of chroot - just use your running system
it is still running, is it?

  • open terminal
  • issue command:
    sudo nano /etc/default/grub
    (that will open the file you need to edit in … an editor (nano)
  • find the 5th line down, which should read:
    GRUB_CMDLINE_LINUX_DEFAULT="quiet udev.log_priority=3"

and append to it, inside the quotes:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

so that it then looks like this:
GRUB_CMDLINE_LINUX_DEFAULT="quiet udev.log_priority=3 nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"

that is still only one line

  • save the file

  • run:
    sudo update-grub

and then reboot

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.