I’ve wasted more than 18h at this point on this, please excuse me if I don’t provide the completest of information here, ask if you got any ideas. I am fed up beyond belief but my geist stops me from giving up.
Given:
- Machine: MSI Alpha 15 (2021 model with the AMD Ryzen 5800H/RX6600M)
- latest firmware:
E158LAMS.108
- latest firmware:
- Old NVMe SSD: Intel SSD 760p 256GB
- New SSD: ADATA XPG Gammix S11 Pro 2TB
- Controller: SM2262 or SM2262EN)
- latest firmware
- Manjaro Live USB (older linux515 release and):
Linux manjaro 6.1.12-1-MANJARO
Due to the complexity of the installation I decided to dd from old to new drive. Then fixup the partitioning and extend the LVM etc. It has:
- GPT scheme with 4 partitions
- p1: ESP aka /boot/, contains both bootloaders
- p2: Windows’ reserved partition (msftres)
- p3: Encrypted Windows partition (Veracrypt)
- p4: LUKS with Manjaro
- LVM containing separate partitions for swap, root, home
As of writing the post I tracked it down to:
-
Boot from Live USB
-
The exact dd command I used:
dd if=/dev/nvme1n1 of=/dev/nvme0n1 bs=8M status=progress oflag=direct
. Takes 7 minutes, you can read the contents without errors. Two notes to make:a. The partition UUIDs are gonna be duplicated, this is intended because I will erase the old drive. As a precaution, I remove the old drive after shutting down.
b. The tail GPT data is null. I did not use (g)parted to fix up. -
Launch Gparted and allow it to fix the GPT data (it asks to extend the partition table to the whole disk)
-
Mount migrated /boot/ to lookup bootloader path and change grub settings
-
Mount migrated root / to comment out entries in fstab (HDDs not connected to laptop)
-
Add Manjaro’s bootloader on new SSD manually to UEFI with
efibootmgr
, because autodetection seems broken on Manjaro’s end. -
Dismount new /boot/ and /
-
shutdown now
and physically remove the old SSD -
Boot into firmware, confirm my new Manjaro entry is there. The boot takes a few more seconds although the power was not cut.
-
Reboot into Live USB again (not touching the cloned disk)
-
Read to verify:
dd if=/dev/nvme0n1 of=/dev/null bs=64K iflag=direct status=progress
Returns:dd: error reading ‘dev/nvme0n1’: No data available
6188+0 records in
6188+0 records out
405536768 bytes (406 MB, 387 MiB) copied, 0.28443 s, 1.4 GB/sThis location is part of the boot partition.
-
Check
dmesg
:nvme0n1: Read(0x2) @ LBA 792064, 128 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) MORE
critical medium error, dev nvme0n1, sector 792064 op 0x0:(READ) flags 0x800 phys_seg 8 prio class 2
Here’s another one from an earlier attempt, the fs superblock is “dead”:
blk_update_request: critical medium error, dev nvme0n1, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
Essentially soft-bricked. I can write to the SSD all I want (wasted an hour and 2TB of TBW on badblocks before), but as soon as it tries to read: error. In fact, badblocks tried so hard it froze the controller and caused the kernel to drop the NVMe device. When I reproduced and tried to write manually with dd, it did nothing to alleviate the errors. A dd read now errored at a different location.
The SSD is unusable in this state. However if I do format/securely erase it with nvme format -s1 /dev/nvme0n1
- well then it begins to function normally again.
I’ve tried fresh installs of Fedora, Ubuntu, Debian, Manjaro, Manjaro with encryption (via Calamares). They work. However cloning my previous setup here does not work. WTF?
(1) How on earth does the controller soft-brick itself?
(2) What on earth is causing it? I need your help:
- dd is not at fault: immediately after cloning I can read the entire drive without errors
- GPT tail mirror not copied properly. Gparted immediately suggests to fix that. Tried.
- Cloned grub does not matter, on my last attempt I didn’t even boot it
- According to nvme-cli and smartctl, both SSDs are set to 512 LBA size. So everything must be in the same locations as on old SSD
- Broken SSD firmware settings where deep-power states are being misreported. I tried
nvme_core.default_ps_max_latency_us=0
as a linux kernel parameter, no change. If yes, why would clean installs & subsequent dd reads work fine? - My last attempt hints at the UEFI Firmware being the culprit. I did not boot into the cloned system, only UEFI settings and then my Live USB again.
- Maybe the SSD’s firmware after all? Though I think in this case that’d have become a widespread problem
I don’t have another machine with two M.2 slots to test.