Need help in investigating a SSD failure

I’ve purchased a P2 1TB SSD from Amazon (link: https://www.amazon.in/-/en/gp/product/B089DNM8LR?th=1) in February of 2022.

I’ve installed this SSD in my ASUS TUF F15 F506HM gaming laptop where I was using this SSD (single partition, BTRFS) to store my data.

I’ve used the laptop last month, when it was fine. I went for a vacation and only opened the laptop yesterday, following which, I’ve started seeing massive system slowdowns whenever I open any file from this drive.

I did the usual maintenance on BTRFS, I ran balance and then now it stopped booting altogether and my SSD is only being mounted in read only mode.

This is when I went into the product page to see the reviews, and many people are claiming that these SSDs are stopping to work after 1-8 months. One of the reviews even said that the sticker says 2 TB of storage capacity but Windows is only showing 250 GB of space on it.

Need help in trying to find out what exactly went wrong. Although I do have backups, I’d like to investigate this further.

The SSD that came with the laptop is the OS SSD (I installed Manjaro) and this second SSD from Crucial is my home partition.

Although the category is support, I’m only seeking the direction in which I should explore. Linux troubleshooting has been fun for me over the past year after all.

Do the basics, remove and remount the SSD on the motherboard, verify it is correctly seated. If it is read only that probably means there are file system errors.

About the ‘reviews’ that seems weird to me as Crucial is to me one of the best manufacturer. You have to be careful who you’re buying from too, I only buy product on Amazon when it is sold and shipped BY AMAZON, if you buy to shady distributors the issue can be that you’re shipped refurbished or bad products (also known as a scam).

Check the SMART of the drive first and see what it reports.

SMART never worked for me, right from the start. The moment I inserted the SSD, I started SMART failure in boot time, and after reading through Crucial’s website, I learned that SMART doesn’t exist for NVME SSDs, so I disabled it in BIOS.

However, I also couldn’t find any SMART info (even with smartmontools, not just KDE Partition Manager) for the built-in SSD as well. Let me try removing and re-inserting it.

Here is the SMART of my Crucial P1 NVME SSD:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.25-1-MANJARO] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000P1SSD8
Serial Number:                      20232885A672
Firmware Version:                   P3CR020
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Controller ID:                      0
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 012885a672
Local Time is:                      Sun Apr 30 14:31:32 2023 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     87 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        5       5
 1 +     4.60W       -        -    1  1  1  1       30      30
 2 +     3.80W       -        -    2  2  2  2       30      30
 3 -   0.0300W       -        -    3  3  3  3     1000    1000
 4 -   0.0030W       -        -    4  4  4  4     6000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    5%
Data Units Read:                    74,012,422 [37.8 TB]
Data Units Written:                 70,785,320 [36.2 TB]
Host Read Commands:                 596,631,861
Host Write Commands:                780,515,644
Controller Busy Time:               9,292
Power Cycles:                       649
Power On Hours:                     21,471
Unsafe Shutdowns:                   124
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,231
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius
Thermal Temp. 1 Transition Count:   38
Thermal Temp. 1 Total Time:         2364

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       1231     0  0x0018  0x4005      -            0     1     -

It is not as detailed as you would expect, but there are information you can get though.

https://wiki.archlinux.org/title/Btrfs#btrfs_check

https://man.archlinux.org/man/btrfs-check.8

Thanks for this. For some reason SMART now started showing up, which didn’t work before. Attaching the output of smartctl -a /dev/nvme1

Sorry for the images, I don’t know how to copy and paste from TMUX when using recovery mode. Do you think I should run btrfs-check?

I do have my backups in a separate external hard disk and on cloud, but don’t want to risk the SSD if it is actually good. The wiki page is showing a lot of warnings. I think I’ll just run without the --repair switch and see what happens.

Yes do as the manual says, don’t use --repair and force --readonly

If you can work from live USB environment you can connect to forum and post proper logs, as images of text, yeah, aren’t good for helping troubleshooting efficiently.

Thanks a lot! My system is now bootable after running btrfs check. Seems like some flag was set during some error before that was now cleared. However, when reading large files, SSD is still jumping temperatures to 78C which is causing SMART to report failure, since warning temperature rating is 70C for me. Room temperature currently is 38C here and CPU temp is 64C on IDLE for 10 minutes.

However, something happened with my swapfile though. It can no longer find my swapfile. So I had to remove the resume parameter from the boot logs.

I want to experiment with dynamic swap, so will open a separate thread for that soon.

Anything relevant to your issue from the command output? btrfs check should not have done ANYTHING at all, it should have just checked its things, and report, so something is weird here.

If you’re overheating components, then this in itself is an issue you need to check and fix, maybe clean the computer and maybe force fan speed to go higher.

1 Like

Btrfs check simply said no errors found.

However, I ran it with checksum check too.

It took a long time, but this also said there are no issues. I think overheating might be the issue now.

Since it’s been an year anyways, I’ll clean it up and I think I’ll also re-apply the thermal paste. CPU temperature is also going quite high. It’s touching 90C now.

Maybe reseating fixed it actually. Sometimes it can work for a while even if it is not seated correctly and suddenly it stops working or generates errors.

But yeah, you need to clear this overheating issue I think.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.