Intermittent boot failures

My Manjaro installation often fails to boot. I’d say about 50% of the time. I looked and compared the output of journalctl -d (the latest successful boot) vs journalctl -d -1 (the unsuccessful boot before it) and what I see is that the unsuccessful boot’s log doesn’t show any errors explicitly and terminates at the following output:

systemd-journald[297]: Time spent on flushing to var/log/journal/5fc951759f6d4b6a8ccd4f1fb4460120 is 22.272ms for 800 entries.

The successful boot goes on like this

systemd-journald[299]: Time spent on flushing to /var/log/journal/5fc951759f6d4b6a8ccd4f1fb4460120 is 7.713ms for 802 entries.
systemd-journald[299]: System Journal (/var/log/journal/5fc951759f6d4b6a8ccd4f1fb4460120) is 4.0G, max 4.0G, 0B free.

Does this mean the failure happens in systemd-journald? Or is it just that the log wasn’t properly flushed during the unsuccessful boot?
If the log is cut off short, then what can I do to investigate the boot failure?

Welcome at the forum, @alexb

What let you believe that boot fails? Precisely describe what’s happening and you might be showered with helpful tips. :grin:

After I select the kernel in the grub menu there’s no visible output on screen. Login screen never appears. Ctrl+Alt+Fn doesn’t work. As I mentioned in my question, the journal seems to be truncated, or else something dies without producing any log output. I can see the point where systemd was started but soon after the log ends.

Disable default “quiet” option in grub settings to see output as system loads.

1 Like

OK, I removed quiet. I saw a quick flash of targets all with the green OK next to them. Then the screen turned black with the cursor in the top left. And that was it. I could hear the fan working loudly.
journalctl -b -1 now looks different. This is how it ends.

Jan 14 17:06:31 Jaguar systemd-logind[475]: Power key pressed.
Jan 14 17:06:31 Jaguar systemd-logind[475]: Powering Off…
Jan 14 17:06:31 Jaguar systemd-logind[475]: System is powering down.

This means the system was still up when I lost patience and pressed the power button.

It looks like the the system fails to switch to the graphical mode. I found this in the journal:

Jan 14 17:02:00 Jaguar kernel: nvidia 0000:01:00.0: can’t change power state from D3cold to D0 (config space inaccessible)
Jan 14 17:02:00 Jaguar kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x56:574)
Jan 14 17:02:00 Jaguar kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 14 17:02:00 Jaguar kernel: kernel read not supported for file pci0000:00/0000:00:01.0/0000:01:00.0/config (pid: 516 comm: Xorg)

Exactly that’s the way you created your issue…

After 8 min of seeing a black screen? Happy to hear a better solution.

Give details of your system (inxi) and full boot logs. You have some nVidia problem there according to these few files of code you pasted. NVidia is notoriously problematical with their closed, proprietary driver. Could have bought Radeon :man_shrugging:

Here you go. I don’t see anything suspicious except for the last line.

(base) [alex@Jaguar ~]$ inxi
CPU: 6-Core Intel Core i7-8750H (-MT MCP-) speed/min/max: 800/800/4100 MHz
Kernel: 5.10.2-2-MANJARO x86_64 Up: 15h 21m Mem: 3215.4/31959.4 MiB (10.1%)
Storage: 1.36 TiB (25.3% used) Procs: 393 Shell: Bash inxi: 3.2.01

(base) [alex@Jaguar ~]$ sudo journalctl -b -1 |grep nvidia
Jan 15 19:07:43 Jaguar kernel: audit: type=1400 audit(1610755663.649:3): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=292 comm=“apparmor_parser”
Jan 15 19:07:43 Jaguar kernel: audit: type=1400 audit(1610755663.649:4): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=292 comm=“apparmor_parser”
Jan 15 19:07:43 Jaguar audit[292]: AVC apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=292 comm=“apparmor_parser”
Jan 15 19:07:43 Jaguar audit[292]: AVC apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=292 comm=“apparmor_parser”
Jan 15 19:07:43 Jaguar kernel: nvidia: loading out-of-tree module taints kernel.
Jan 15 19:07:43 Jaguar kernel: nvidia: module license ‘NVIDIA’ taints kernel.
Jan 15 19:07:43 Jaguar kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jan 15 19:07:43 Jaguar kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Jan 15 19:07:43 Jaguar kernel: nvidia 0000:01:00.0: enabling device (0006 -> 0007)
Jan 15 19:07:44 Jaguar systemd-modules-load[273]: Inserted module ‘nvidia’
Jan 15 19:07:44 Jaguar kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 455.45.01 Thu Nov 5 22:55:44 UTC 2020
Jan 15 19:07:44 Jaguar kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Jan 15 19:07:44 Jaguar kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Jan 15 19:07:44 Jaguar systemd-modules-load[273]: Inserted module ‘nvidia_drm’
Jan 15 19:07:45 Jaguar kernel: nvidia 0000:01:00.0: Enabling HDA controller
Jan 15 19:07:47 Jaguar kernel: nvidia 0000:01:00.0: can’t change power state from D3cold to D0 (config space inaccessible)

Sorry, I’m new to this forum… How do I upload the full log?

I wasn’t specific, sorry:
inxi -Fazy

> System:
>   Kernel: 5.10.2-2-MANJARO x86_64 bits: 64 compiler: gcc v: 10.2.0 
>   parameters: BOOT_IMAGE=/boot/vmlinuz-5.10-x86_64 
>   root=UUID=0a7f4276-a286-4bf7-ad22-a5220cfc6658 rw quiet apparmor=1 
>   security=apparmor udev.log_priority=3 
>   Desktop: Xfce 4.14.3 tk: Gtk 3.24.23 info: xfce4-panel wm: xfwm4 
>   dm: LightDM 1.30.0 Distro: Manjaro Linux 
> Machine:
>   Type: Laptop System: Micro-Star product: GF63 8RC v: REV:1.0 
>   serial: <filter> Chassis: type: 10 serial: <filter> 
>   Mobo: Micro-Star model: MS-16R1 v: REV:1.0 serial: <filter> 
>   UEFI: American Megatrends v: E16R1IMS.10D date: 09/24/2019 
> Battery:
>   ID-1: BAT1 charge: 2.5 Wh condition: 48.6/51.3 Wh (95%) volts: 11.8/11.4 
>   model: MSI Corp. MS-16R1 type: Li-ion serial: N/A status: Charging 
> CPU:
>   Info: 6-Core model: Intel Core i7-8750H bits: 64 type: MT MCP 
>   arch: Kaby Lake note: check family: 6 model-id: 9E (158) stepping: A (10) 
>   microcode: DE L2 cache: 9 MiB 
>   flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx 
>   bogomips: 52815 
>   Speed: 800 MHz min/max: 800/4100 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 
>   4: 800 5: 800 6: 800 7: 801 8: 800 9: 800 10: 800 11: 800 12: 800 
>   Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled 
>   Type: l1tf 
>   mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable 
>   Type: mds mitigation: Clear CPU buffers; SMT vulnerable 
>   Type: meltdown mitigation: PTI 
>   Type: spec_store_bypass 
>   mitigation: Speculative Store Bypass disabled via prctl and seccomp 
>   Type: spectre_v1 
>   mitigation: usercopy/swapgs barriers and __user pointer sanitization 
>   Type: spectre_v2 mitigation: Full generic retpoline, IBPB: conditional, 
>   IBRS_FW, STIBP: conditional, RSB filling 
>   Type: srbds mitigation: Microcode 
>   Type: tsx_async_abort status: Not affected 
> Graphics:
>   Device-1: Intel UHD Graphics 630 vendor: Micro-Star MSI driver: i915 
>   v: kernel bus ID: 00:02.0 chip ID: 8086:3e9b 
>   Device-2: NVIDIA GP107M [GeForce GTX 1050 Mobile] vendor: Micro-Star MSI 
>   driver: nvidia v: 455.45.01 alternate: nouveau,nvidia_drm bus ID: 01:00.0 
>   chip ID: 10de:1c92 
>   Display: x11 server: X.Org 1.20.10 driver: intel,nvidia 
>   unloaded: modesetting,nouveau alternate: fbdev,nv,vesa display ID: :0.0 
>   screens: 1 
>   Screen-1: 0 s-res: 1920x1080 s-dpi: 96 s-size: 508x285mm (20.0x11.2") 
>   s-diag: 582mm (22.9") 
>   Monitor-1: eDP1 res: 1920x1080 hz: 60 dpi: 143 size: 340x190mm (13.4x7.5") 
>   diag: 389mm (15.3") 
>   OpenGL: renderer: Mesa Intel UHD Graphics 630 (CFL GT2) v: 4.6 Mesa 20.3.1 
>   direct render: Yes 
> Audio:
>   Device-1: Intel Cannon Lake PCH cAVS vendor: Micro-Star MSI 
>   driver: snd_hda_intel v: kernel alternate: snd_soc_skl,snd_sof_pci 
>   bus ID: 00:1f.3 chip ID: 8086:a348 
>   Sound Server: ALSA v: k5.10.2-2-MANJARO 
> Network:
>   Device-1: Intel Wireless-AC 9560 [Jefferson Peak] driver: iwlwifi v: kernel 
>   port: 5000 bus ID: 00:14.3 chip ID: 8086:a370 
>   IF: wlo1 state: up mac: <filter> 
>   Device-2: Qualcomm Atheros QCA8171 Gigabit Ethernet vendor: Micro-Star MSI 
>   driver: alx v: kernel port: 3000 bus ID: 03:00.0 chip ID: 1969:10a1 
>   IF: enp3s0 state: down mac: <filter> 
> Drives:
>   Local Storage: total: 1.36 TiB used: 353.21 GiB (25.3%) 
>   SMART Message: Required tool smartctl not installed. Check --recommends 
>   ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 970 EVO 500GB 
>   size: 465.76 GiB block size: physical: 512 B logical: 512 B speed: 31.6 Gb/s 
>   lanes: 4 serial: <filter> rev: 2B2QEXE7 temp: 28.9 C 
>   ID-2: /dev/sda maj-min: 8:0 vendor: Seagate model: ST1000LM049-2GH172 
>   size: 931.51 GiB block size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s 
>   serial: <filter> rev: SDM1 
> Partition:
>   ID-1: / raw size: 465.47 GiB size: 457.16 GiB (98.22%) 
>   used: 353.21 GiB (77.3%) fs: ext4 dev: /dev/nvme0n1p2 maj-min: 259:2 
>   ID-2: /boot/efi raw size: 300 MiB size: 299.4 MiB (99.80%) 
>   used: 292 KiB (0.1%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1 
> Swap:
>   Alert: No Swap data was found. 
> Sensors:
>   System Temperatures: cpu: 45.0 C mobo: N/A 
>   Fan Speeds (RPM): N/A 
> Info:
>   Processes: 314 Uptime: 5m wakeups: 1 Memory: 31.21 GiB used: 2.59 GiB (8.3%) 
>   Init: systemd v: 247 Compilers: gcc: 10.2.0 Packages: pacman: 1409 lib: 440 
>   flatpak: 0 Shell: Bash v: 5.1.0 running in: xfce4-terminal inxi: 3.2.01

You should install the BIOS update available for your hardware.

That’s the first thing I did and it didn’t help.

Jan 15 19:07:47 Jaguar kernel: nvidia 0000:01:00.0: can’t change power state from D3cold to D0 (config space inaccessible)

System is unable to initialize GPU from D3cold hibernation state

Is this system dual-booting with Windows ?

Does Linux boot fail if you choose Reboot from Windows, but not if you use Shutdown?

If answer to both of those questions is yes then you probably need to turn off ‘fast startup’ option in Windows

How to disable Windows 10 fast startup (and why you'd want to) | Windows Central
Linux users will likely see complications with dual boot and virtualization

Fast startup uses hybrid hibernation instead of completely powering down devices
Rebooting to Linux can fail because device cannot be initialised from hybrid hibernation state
But if Linux is restarted, devices get powered down correctly and work OK on 2nd boot
so it could appear to be an intermittent issue happening about 50% of the time

Yes

I virtually never use Windows on that laptop. The boot issues have nothing to do with Windows.

I found the same issue on nvidia forum:
https://forums.developer.nvidia.com/t/bug-cant-change-power-state-from-d3cold-to-d0-config-space-inaccessible-stuck-at-boot/112912
It looks like the problem is this: an earlier step in the boot process instructed the GPU to switch to D3cold. While GPU is switching it is unable to accept another mode switch request. Hence the problem. If the first mode switch finishes before the second arrives everything works fine. If not, then we have that error.
The response from nvidia seems to be that the distro didn’t configure udev rules correctly…