Default boot doesn't work after an upgrade in Read Only mode

Danixu · 30 May 2023 11:42

Hello, thanks for your response.

I have checked the smartctl command, and it says that there are no errors:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    2.237.355 [1,14 TB]
Data Units Written:                 4.154.767 [2,12 TB]
Host Read Commands:                 33.959.394
Host Write Commands:                87.876.548
Controller Busy Time:               170
Power Cycles:                       263
Power On Hours:                     3.027
Unsafe Shutdowns:                   80
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

The drive is almost new in terms of written data. Anyway, I think that the problem has started after a Kernel upgrade because It was working without any problem before it, and if I don’t hibernate it the filesystem never turns into Read Only mode. Even I can work with the computer 10h without any error, and just after hibernating the filesystem turns into Read Only mode.

Best regards

Mirdarthos · 30 May 2023 11:50

That’s not what you said previously:

…which is what I based my answer of.

Nevertheless, I highly doubt it’s because of a kernel or something, since mine is working fine:

$ sudo smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.3.0-1-MANJARO] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 250GB
Serial Number:                      S3ESNX0K162428H
Firmware Version:                   3B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            108,215,193,600 [108 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5181b1e070
Local Time is:                      Tue May 30 13:48:12 2023 SAST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
0 +     6.04W       -        -    0  0  0  0        0       0
1 +     5.09W       -        -    1  1  1  1        0       0
2 +     4.08W       -        -    2  2  2  2        0       0
3 -   0.0400W       -        -    3  3  3  3      210    1500
4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        32 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    49,179,729 [25.1 TB]
Data Units Written:                 36,109,535 [18.4 TB]
Host Read Commands:                 753,868,249
Host Write Commands:                897,308,024
Controller Busy Time:               3,349
Power Cycles:                       2,419
Power On Hours:                     4,260
Unsafe Shutdowns:                   200
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,624
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               32 Celsius
Temperature Sensor 2:               41 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
0       1624     0  0x3017  0x4004      -            0     0     -
1       1623     0  0xa001  0x4004      -            0     0     -
2       1622     0  0x4011  0x4004      -            0     0     -
3       1621     0  0x9008  0x4004      -            0     0     -
4       1620     0  0x9019  0x4004      -            0     0     -
5       1619     0  0x101b  0x4004      -            0     0     -
6       1618     0  0x4013  0x4004      -            0     0     -
7       1617     0  0x5005  0x4004      -            0     0     -
8       1616     0  0x300f  0x4004      -            0     0     -
9       1615     0  0xc007  0x4004      -            0     0     -
10       1614     0  0x5004  0x4004      -            0     0     -
11       1613     0  0xc010  0x4004      -            0     0     -
12       1612     0  0xc015  0x4004      -            0     0     -
13       1611     0  0xf00b  0x4004      -            0     0     -
14       1610     0  0xc00d  0x4004      -            0     0     -
15       1609     0  0x500d  0x4004      -            0     0     -
... (48 entries not read)

~~There should be a line similar, if not equal to:~~

Nevermind. Just saw it.

SMART overall-health self-assessment test result: PASSED

What file system are you using?

Danixu · 30 May 2023 12:07

I don’t understand what do you mean. Both are correct…
My system is almost a fresh install because I did it more or less a month ago. On the installation process I have activated the hibernation and it was working perfectly hibernating the laptop every day, until a system upgrade I did the last week. After that upgrade the problem has occurred two times, that is why It caught me off guard (now I have changed the hibernate button to suspend). On the last upgrade I did today, is when the combination of the Read Only filesystem and the upgrade has finally broken my boot process. I want to avoid to do a full reinstallation, and for now the only that is failing is the default boot option (I am working with the computer right now).

My filesystems are vfat for the EFI partition, ext3 for the boot partition and ext4 for the root partition (I have not separated the home directory).

Best regards.

Mirdarthos · 30 May 2023 12:18

Because, if it was working fine, then it wouldn’t have mounted as read-only. And you said,

Which means it has been doing it for longer than just once. And

Sounds like you were used to it happening.

Nevertheless, I think you need to check you ext4 filesystem for errors. So from a Live ISO, run:

sudo fsck /dev/<partition>

Where <partition> is the partition you wish to check, /dev/sda1, /dev/sda3, /dev/nvme0n1p1 or similar.

Edit:

And while y6u’re at it, you can/should just as well check the ext4 partition as well.

Danixu · 30 May 2023 12:44

Hello,

It’s correct that is not the first time, it just happened two times and the first time I though that maybe was a random error. Also is true that with recently I mean in the last days, and It was working fine for almost a month.

Nevermind, english is not my main language and maybe I have not explained it very well.

About the check, I did it several times. Before entering into the recovery mode it was automatically checked:

/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b contains a file system with errors, check forced.
Deleted inode 8397010 has zero dtime.  FIXED.727e4b7b:                                                                                
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 8397089 has zero dtime.  FIXED.
Deleted inode 8952769 has zero dtime.  FIXED.727e4b7b:                                                                                
Inode 19665129 extent tree (at level 1) could be shorter.  IGNORED.                                                                   
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 22020696 has zero dtime.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 22020720 has zero dtime.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 22032249 has zero dtime.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 22036488 has zero dtime.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 22036489 has zero dtime.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Inode 22152386 extent tree (at level 1) could be narrower.  IGNORED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 22166151 has zero dtime.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Inode 22181869, i_blocks is 24, should be 16.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Inode 22182936, i_blocks is 24, should be 16.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Inode 22183347, i_blocks is 24, should be 16.  FIXED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Deleted inode 22282554 has zero dtime.  FIXED.
Inode 27161261, i_blocks is 16, should be 8.  FIXED.b:                                                                                
Inode 29889750, i_blocks is 84832, should be 68448.  FIXED.                                                                           
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: Orphan file (inode 12) block 13 is not clean.
CLEARED.
/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: 1658493/30072832 files (0.5% non-contiguous), 32231556/120282946 blocks

Some inodes were fixed in that boot process.
During the boot fix process I did some checks to all the partitions from the live disk to be sure, and also in every boot I can see how the filesystem seems to be checked again:

/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b: clean, 1584875/30072832 files, 32012189/120282946 blocks

Best regards.

Edit:
I have tried to remember when it happened, and was just yesterday.
Checking the last upgrade I have seen that it was at may 25th. Because the upgrade, that day instead to hibernate I have shut down the computer so at friday 26th was working fine. Just at monday after the friday hibernation has failed for the first time, and today the 2nd time.

In the upgrade process the packages that were upgraded are:

[2023-05-25T13:20:12+0200] [ALPM] upgraded adw-gtk3 (4.6-1 -> 4.7-1)
[2023-05-25T13:20:12+0200] [ALPM] upgraded cryptsetup (2.6.1-3.2 -> 2.6.1-3.3)
[2023-05-25T13:20:12+0200] [ALPM] upgraded firefox (112.0.2-2 -> 113.0-0.1)
[2023-05-25T13:20:13+0200] [ALPM] upgraded gnome-shell-extension-gtk4-desktop-icons-ng (1:38-1 -> 1:40-1)
[2023-05-25T13:20:13+0200] [ALPM] upgraded gnome-layout-switcher (0.8.35-1 -> 0.8.35-2)
[2023-05-25T13:20:13+0200] [ALPM] upgraded gnome-shell-extension-arcmenu (44.1-1 -> 45-1)
[2023-05-25T13:20:13+0200] [ALPM] upgraded gnome-shell-extension-dash-to-dock (80-1 -> 81-1)
[2023-05-25T13:20:13+0200] [ALPM] upgraded inxi (3.3.26.1-1 -> 3.3.27.1-1)
[2023-05-25T13:20:13+0200] [ALPM] upgraded linux61 (6.1.26-1 -> 6.1.29-1)
[2023-05-25T13:20:13+0200] [ALPM] upgraded manjaro-gnome-settings (20230316-1 -> 20230513-1)
[2023-05-25T13:20:13+0200] [ALPM] upgraded manjaro-gnome-extension-settings (20230109-1 -> 20230517-2)
[2023-05-25T13:20:14+0200] [ALPM] upgraded manjaro-release (22.1.1-1 -> 22.1.2-1)
[2023-05-25T13:20:14+0200] [ALPM] installed ttf-meslo-nerd-font-powerlevel10k (20230403-2)
[2023-05-25T13:20:14+0200] [ALPM] upgraded manjaro-zsh-config (0.25-1 -> 0.25-2)
[2023-05-25T13:20:14+0200] [ALPM] upgraded spectre-meltdown-checker (0.45+6+ga284357-1 -> 0.45+8+g6a61df2-1)

I have suspected about the Kernel because is the prime suspect, but looking right now maybe the problem is the cryptsetup upgrade, which is in charge of mounting the encrypted root partition, and the one is failing…

Mirdarthos · 30 May 2023 13:00

Ah, well, yes. That’s possibly the reason for the miscommunication. Thank you for explaining it differently.

P.S.:

As a friend I’ll advise you to mention that, it tends to let people be more…accommodating.

This isn’t ext4 as mentioned, but looks like BTRFS to me.

…and an encrypted one, to boot.

Not using BTRFS I don’t know if it can be unencrypted. Soooooo…a big, fat

However,

You might be right.

Anyway, I’ll point you to the Manjaro Wiki article about BTRFS:

Maybe @andreas85 knows more. He seems to be a fan of BTRFS…

andreas85 · 30 May 2023 13:06

I am.

But i do value my data. So i never encrypt btrfs. This way btrfs is save as a safe . (In that it does never loose any data) ==> I can’t help

Danixu · 30 May 2023 13:15

Hello,

No, it is not BTRFS, it is ext4 using luks to encrypt it:

/dev/mapper/luks-1ec0ec86-5306-4e63-b948-3db9727e4b7b /              ext4    defaults,noatime 0 1

Using the cryptsetup the system seems to be mapping the /dev/ partition to that mapper and then mounting it into the root partition:

# blkid /dev/nvme0n1p3 
/dev/nvme0n1p3: UUID="87fb7b2b-9c2e-437f-8120-fb07c65ffd10" TYPE="crypto_LUKS" PARTUUID="e83b2ec9-bc44-ca46-be34-b30428149002"

Best regards.

Mirdarthos · 30 May 2023 13:21

I can still only point you here, seeing as my ext4 is unencrypted:

https://wiki.archlinux.org/title/Dm-crypt

xabbu · 30 May 2023 13:44

Did you checked journal at the time before the remount as read-only happend? There needs be errors otherwise the filesystem would not be remounted.

Danixu · 30 May 2023 13:57

I don’t think… The only checks I did were the fsck checks. If there is any other way to check the Journal, I don’t know it.

@Mirdarthos Thanks!!, I’ll give a try to the link to see if I can fix the hibernation and maybe I’ll downgrade the program version to see if the upgrade was the problem.

About to fix the default grub option, can you help me please?. Without the hibernation I can live for now, but the default grub option is giving me problems because it freeze in every boot. I have to manually pop-up the menu, navigate to Advanced and select the kernel to boot. Is not a big problem, but is annoying.

I have regenerated the grub menu and even I have installed another kernel version to see if maybe was an error pointing to the current kernel, but it still failing.

Best regards.

Mirdarthos · 30 May 2023 14:09

Unfortunately I am extremely meticulous when I take care of my system. (Yeah, this is because I had to reinstall because I wasn’t careful, and don’t want to go through it again.) So I don’t really have experience or knowledge of that. But I can point you here:

and possibly here:

Because you have exhausted my knowledge.

Danixu · 30 May 2023 14:16

Thanks anyway

My knowledge in this cases is limited too because the same, I don’t like to break the system where I work and have to reinstall it (the full setup is hell)… In my experience with Ubuntu (the worst of all I have used), Debian and Fedora is the first time I have this kind of problems .

I’ll try the links you sent and I’ll try to continue searching for it once I am out of work.

Best regards.

xabbu · 30 May 2023 14:29

I did not mean the filesystem journal. I mean the normal system logs (can be accessed via journalctl and oft just called journal).

journalctl --list-boots

will list all available boots, pick the one you had a problem(first column). It might be shown in a pager like less, use the arrow keys and q to exit.
For exmaple for boot “-5” and priority level 3 (error), if it is a lot, try 2

journalctl -b -5 -p3

or just for kernel messages

journalctl -b -5 -k

again use the arrow key to check to go thru the logs and replace -5 with your boot index number the problem occur.

The Arch wiki on the journal
https://wiki.archlinux.org/title/Systemd/Journal

Danixu · 30 May 2023 14:40

I see, sorry I did not understand what did you mean. Now I see that were bitmap errors in the EXT4 partition:

may 30 09:12:57 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm upowerd: bg 891: bad block bitmap checksum
may 30 09:12:57 kernel: EXT4-fs error (device dm-0) in ext4_mb_clear_bb:6081: Filesystem failed CRC
may 30 09:12:57 kernel: EXT4-fs error (device dm-0): ext4_validate_inode_bitmap:105: comm tracker-miner-f: Corrupt inode bitmap - block_group = 2689, inode_bitmap = 88080401
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 211: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 379: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 637: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 895: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 1021: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 1148: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 1534: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 2814: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 2926: bad block bitmap checksum
may 29 06:07:38 kernel: EXT4-fs error (device dm-0): ext4_validate_block_bitmap:406: comm ext4lazyinit: bg 3071: bad block bitmap checksum

This was the first time I got the RO problem, and looks like is happening all the time right now.

xabbu · 30 May 2023 16:06

It looks like your NVMe is damaged. The ext4 filesystem notices bad blocks, which often means the NVMe returns corrupt data. This results in a read-only remount to not further damage the filesystem.

The SMART looked good, but it doesn’t always mean the devices is not damaged.

Danixu · 31 May 2023 08:12

I’ll take care about it and if continues I’ll have to change it. Thanks!

I did another full check, including a surface check even when the SSD doesn’t really need it, and the fsck did two fixes to the filesystem. It did not look related to the problem, but I am not an expert…

Danixu · 31 May 2023 11:58

Finally I have fixed the boot problem. For a reason that I don’t know, the kernel 6.1 is broken in my computer. I have tried to reinstall it several times, but it still failing, so I have installed the 6.3 version.

After changing the default boot kernel using this command:

grub-set-default "1>2"

The normal behaviour of grub has been restored and now the latest selected kernel boots with the default option, so just selecting the 6.3 kernel in the advanced menu has fixed he default boot option.

About the NVME drive, after checking the drive again with fsck (forced), i have not seen any other error for now (5 hours of work), so maybe was fixed.

Best regards.

Mirdarthos · 31 May 2023 12:03

Maybe. We wouldn’t know, unfortunately. However, it might have been due to the kernel, then…I don’t really know…

I’m glad you got it fixed. Does my heart good to see people overcoming a challenge.

system · 3 June 2023 02:03

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.