Problem with trying to recover from an interrupted update/upgrade

SillyMark · 5 October 2024 12:03

So as the title says, I’m trying to recover from an interrupted updat by following this guide:
https://forum.manjaro.org/t/howto-recovering-from-an-interrupted-update-upgrade/132762
However, in the middle of upgrading my PC freezes, and after a while I figured out it was because of my SSD. I am using an Acer Aspire A315-23 laptop, and whenever I liveboot a distro it freezes after a while even when using boot parameters from this guide https://askubuntu.com/questions/1277405/problems-installing-ubuntu-on-acer-laptop
This usually isn’t a problem, however the recovery process takes some time and it always freezes in the middle of upgrading. After some time I thought about editing and updating Grub, however it’s impossible to do that on a liveusb.
In short, I have two questions:

Is it possible to permanently alter the boot parameters on a liveboot medium? OR
Is it possible to cut down time required for the recovery process
Sorry if this question(s) has/have been asked before

Thanks in advance!

Nachlese · 5 October 2024 12:20

Yes - altering them does not even require chroot
because what need you alter / edit is the /etc/default/grub file on your system partition
not the one on the live system.
Just mount the partition and do whatever …

But after altering them you need to run update-grub
or the equivalent
grub-mkconfig -o /boot/grub/grub.cfg

and that only works from within a chroot

What recovery process?
The file system check that runs after you shut down by cutting the power/turning it off?
No - it takes the time it takes - and it is needed.

ps:
I may have misunderstood you:
the only way to edit the grub parameters for the live system is to press ESC to halt at the Grub screen, then press E to edit - there is no way to make this permanent.

BG405 · 5 October 2024 12:38

In this case, I’d suggest (attempting) a clone of this onto a new device, and see if the problem persists. CloneZilla can be used for this.

Kobold · 5 October 2024 14:35

Why you think it is your SSD? Do you have enough free space?
What are your pamac.logs showing?
/var/log/pacman.log
How did you update? Maybe try to update from TTY?
I recommend to use in future Timeshift, you can always fix better your system if you can rollback your system… at least you have a safety net and you have always a working system.

SillyMark · 5 October 2024 15:31

Sorry for wasting everyone’s time but due to an emergency I have wiped my system in the meantime.

Molski · 5 October 2024 15:43

Hold up on the solution.

Are you sure you are reinstalling the right OS for you this time around?

SillyMark · 5 October 2024 15:55

Yes, I am absolutely certain. Thank you for your patience.

BG405 · 6 October 2024 00:30

A new nuke-and-pave installation is not a “recovery” from anything, nor does it address or in any way “solve” what caused the issue in the first place, so is not helpful at all, and not a useful “solution” for anybody reading this.

Sorry if this sounds a bit harsh, but it’s a statement of truth.

SillyMark · 6 October 2024 06:41

AFAIK The problem is that the Western Digital SSD has a problem with power states which prevents linux from recognizing it (WD apparently doesn’t test their SSDs with linux distros), see: https://community.wd.com/t/linux-support-for-wd-black-nvme-2018/225446/17. A way of fixing this is modifying the grub file. However it is impossible to do such a thing on a liveboot medium. I’m pretty sure that the problem can only be permanently solved by replacing the SSD. Once again thank you for being so understanding.

BG405 · 6 October 2024 10:10

Interesting re. the SSD issue; I wonder what shows up with sudo smartctl -x /dev/sdXX (change sdXX to whatever your machine reports it as)?

You might need to install smartmontools first. It’s in “extra” repo.

I wonder if it’s overheating? I’ve heard that can be an issue, although the SSD I have in mine runs practically cold even with big transfers.

SillyMark · 6 October 2024 12:13

Thanks for the suggestion. Here’s what I got:

=== START OF INFORMATION SECTION ===
Model Number:                       WDC PC SN520 SDAPNUW-256G-1014
Serial Number:                      2039D2803732
Firmware Version:                   20110000
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 256,060,514,304 [256 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b49c12e9d
Local Time is:                      Sun Oct  6 14:07:56 2024 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     86 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     2.60W       -        -    0  0  0  0        0       0
 1 +     2.60W       -        -    1  1  1  1        0       0
 2 +     1.70W       -        -    2  2  2  2        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000    9000
 4 -   0.0025W       -        -    4  4  4  4     5000   44000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    90,082,154 [46,1 TB]
Data Units Written:                 31,887,533 [16,3 TB]
Host Read Commands:                 1,038,561,985
Host Write Commands:                478,749,191
Controller Busy Time:               2,427
Power Cycles:                       7,987
Power On Hours:                     11,004
Unsafe Shutdowns:                   149
Media and Data Integrity Errors:    0
Error Information Log Entries:      124
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x4002)

BG405 · 6 October 2024 12:20

This looks OK to me but I could be missing something. Also I’m unsure about this:

It doesn’t look like there’s been any heat issue, though, at least with the drive itself. Maybe a controller issue?

Edited to add: what memory and swap do you have available?

free -h

SillyMark · 6 October 2024 12:28

               total        used        free      shared  buff/cache   available
Mem:           5,7Gi       2,5Gi       2,8Gi       146Mi       784Mi       3,2Gi
Swap:          2,0Gi       664Mi       1,4Gi

BG405 · 6 October 2024 12:33

That looks decent enough; I’d have more swap but at least you have some. Unlike what I’m seeing a lot here lately: “No swap defined”.

I had an issue some time back where in my case Haskell modules were taking an age to install, due to a bad sector on the disk. It would get stuck for ages, until I fixed the issue. Not sure how this would relate with an SSD, though.

Watch the logs whilst upgrading; save them and post here. We might be able to spot something.

sudo pacman -Syu --logfile ~/.upgrade-log.txt

I use a slightly different method: I manually copy the Pacman output and save it in a date-stamped text file for future reference. But you may need to alter your Terminal scrollback limit:

echo export HISTSIZE=10000 >> ~/.bashrc

SillyMark · 6 October 2024 12:41

Will do!