Fresh install: 50% Boot success rate (Laptops with Nvidia dGPU)

TLDR Solution & Cause:
Manjaro 21.1+ now defaults to kernel 5.13. Laptops with a NVIDIA dGPU may fail to boot the desktop environment half the time. You can either
a) Change to 5.10 LTS kernel
b) Install optimus-manager and enable it instead

The last update caused a new issue, but only for a fresh install of KDE 21.1.1 (and then upgrading to 21.1.2 via Software Update util). On my prior installation after the upgrade there were no issues.

Even though no issues, I decided to wipe my Manjaro and reinstall fresh.
After doing this, there is about a 50% boot success rate. Half the time it gets stuck on pre-desktop environment init where you see the info:
/dev/[somedisk], clean, [numbers]/[numbers] files, [numbers/numbers] blocks

Prior to reinstalling I was using kernel 5.13, and then upgraded to 5.14 with no issues either and multiple reboots. However with the fresh install the issue persists on 5.13 and 5.14.

I saw Manjaro team has updated the links to 21.1.2 now, before they were not. I can try doing a fresh install with 21.1.2 to see if it makes a difference.

Where do I need to check for logs to find why it gets stuck on half boots?
I have not made any system modifications (ex using optimus-manager instead of nvidia prime-run) at this point, this is purely default Manjaro after installation.

It is worth noting latest version of Manjaro defaults to 5.13 kernel on a fresh install.

inxi --admin --verbosity=7 --filter --no-host --width:

inxi output from a successful boot with 5.14
System:
  Kernel: 5.14.0-0-MANJARO x86_64 bits: 64 compiler: gcc v: 11.1.0 
  parameters: BOOT_IMAGE=/boot/vmlinuz-5.14-x86_64 
  root=UUID= rw quiet apparmor=1
  security=apparmor udev.log_priority=3 
  Console: tty pts/3 wm: kwin_x11 DM: SDDM Distro: Manjaro Linux 
  base: Arch Linux 
Machine:
  Type: Laptop System: Micro-Star product: GS65 Stealth Thin 8RF v: REV:1.0 
  serial: <filter> Chassis: type: 10 serial: <filter> 
  Mobo: Micro-Star model: MS-16Q2 v: REV:1.0 serial: <filter> 
  UEFI: American Megatrends v: E16Q2IMS.112 date: 05/21/2019 
Battery:
  ID-1: BAT1 charge: 45.7 Wh (67.1%) condition: 68.1/80.3 Wh (84.9%) 
  volts: 15.5 min: 15.2 model: MSI BIF0_9 type: Li-ion serial: N/A 
  status: Discharging 
Memory:
  RAM: total: 31.2 GiB used: 2.65 GiB (8.5%) 
  Array-1: capacity: 32 GiB slots: 2 EC: None max-module-size: 16 GiB 
  note: est. 
  Device-1: ChannelA-DIMM0 size: 16 GiB speed: 2667 MT/s type: DDR4 
  detail: synchronous bus-width: 64 bits total: 64 bits manufacturer: Samsung 
  part-no: M471A2K43CB1-CTD serial: <filter> 
  Device-2: ChannelB-DIMM0 size: 16 GiB speed: 2667 MT/s type: DDR4 
  detail: synchronous bus-width: 64 bits total: 64 bits manufacturer: Samsung 
  part-no: M471A2K43CB1-CTD serial: <filter> 
CPU:
  Info: 6-Core model: Intel Core i7-8750H socket: U3E1 bits: 64 type: MT MCP 
  arch: Kaby Lake note: check family: 6 model-id: 9E (158) stepping: A (10) 
  microcode: EA cache: L1: 384 KiB L2: 9 MiB L3: 9 MiB bogomips: 52815 
  Speed: 900 MHz min/max: 800/4100 MHz base/boost: 2100/8300 volts: 0.8 V 
  ext-clock: 100 MHz Core speeds (MHz): 1: 900 2: 900 3: 900 4: 900 5: 900 
  6: 900 7: 900 8: 900 9: 900 10: 900 11: 900 12: 900 
  Flags: 3dnowprefetch abm acpi adx aes aperfmperf apic arat arch_perfmon art 
  avx avx2 bmi1 bmi2 bts clflush clflushopt cmov constant_tsc cpuid 
  cpuid_fault cx16 cx8 de ds_cpl dtes64 dtherm dts epb ept ept_ad erms est 
  f16c flexpriority flush_l1d fma fpu fsgsbase fxsr ht hwp hwp_act_window 
  hwp_epp hwp_notify ibpb ibrs ida intel_pt invpcid invpcid_single lahf_lm lm 
  mca mce md_clear mmx monitor movbe mpx msr mtrr nonstop_tsc nopl nx pae pat 
  pbe pcid pclmulqdq pdcm pdpe1gb pebs pge pln pni popcnt pse pse36 pti pts 
  rdrand rdseed rdtscp rep_good sdbg sep smap smep ss ssbd sse sse2 sse4_1 
  sse4_2 ssse3 stibp syscall tm tm2 tpr_shadow tsc tsc_adjust 
  tsc_deadline_timer vme vmx vnmi vpid x2apic xgetbv1 xsave xsavec xsaveopt 
  xsaves xtopology xtpr 
  Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled 
  Type: l1tf 
  mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable 
  Type: mds mitigation: Clear CPU buffers; SMT vulnerable 
  Type: meltdown mitigation: PTI 
  Type: spec_store_bypass 
  mitigation: Speculative Store Bypass disabled via prctl and seccomp 
  Type: spectre_v1 
  mitigation: usercopy/swapgs barriers and __user pointer sanitization 
  Type: spectre_v2 mitigation: Full generic retpoline, IBPB: conditional, 
  IBRS_FW, STIBP: conditional, RSB filling 
  Type: srbds mitigation: Microcode 
  Type: tsx_async_abort status: Not affected 
Graphics:
  Device-1: Intel CoffeeLake-H GT2 [UHD Graphics 630] vendor: Micro-Star MSI 
  driver: i915 v: kernel bus-ID: 00:02.0 chip-ID: 8086:3e9b class-ID: 0300 
  Device-2: NVIDIA GP104M [GeForce GTX 1070 Mobile] vendor: Micro-Star MSI 
  driver: nvidia v: 470.63.01 alternate: nouveau,nvidia_drm bus-ID: 01:00.0 
  chip-ID: 10de:1ba1 class-ID: 0300 
  Display: server: X.Org 1.20.13 compositor: kwin_x11 driver: 
  loaded: modesetting,nvidia alternate: fbdev,nouveau,nv,vesa display-ID: :0 
  screens: 1 
  Screen-1: 0 s-res: 1920x1080 s-dpi: 96 s-size: 508x285mm (20.0x11.2") 
  s-diag: 582mm (22.9") 
  Monitor-1: eDP-1 res: 1920x1080 hz: 144 dpi: 142 size: 344x193mm (13.5x7.6") 
  diag: 394mm (15.5") 
  OpenGL: renderer: Mesa Intel UHD Graphics 630 (CFL GT2) v: 4.6 Mesa 21.2.1 
  direct render: Yes 
Audio:
  Device-1: Intel Cannon Lake PCH cAVS vendor: Micro-Star MSI 
  driver: snd_hda_intel v: kernel alternate: snd_soc_skl,snd_sof_pci_intel_cnl 
  bus-ID: 00:1f.3 chip-ID: 8086:a348 class-ID: 0403 
  Device-2: NVIDIA GP104 High Definition Audio driver: snd_hda_intel v: kernel 
  bus-ID: 01:00.1 chip-ID: 10de:10f0 class-ID: 0403 
  Sound Server-1: ALSA v: k5.14.0-0-MANJARO running: yes 
  Sound Server-2: JACK v: 1.9.19 running: no 
  Sound Server-3: PulseAudio v: 15.0 running: yes 
  Sound Server-4: PipeWire v: 0.3.34 running: yes 
Network:
  Device-1: Intel Cannon Lake PCH CNVi WiFi 
  vendor: Rivet Networks Killer Wireless-AC 1550i Wireless driver: iwlwifi 
  v: kernel port: 5000 bus-ID: 00:14.3 chip-ID: 8086:a370 class-ID: 0280 
  IF: wlo1 state: up mac: <filter> 
  IP v4: <filter> type: dynamic noprefixroute scope: global 
  broadcast: <filter> 
  IP v6: <filter> type: noprefixroute scope: link 
  Device-2: Qualcomm Atheros Killer E2500 Gigabit Ethernet 
  vendor: Micro-Star MSI driver: alx v: kernel port: 3000 bus-ID: 3c:00.0 
  chip-ID: 1969:e0b1 class-ID: 0200 
  IF: enp60s0 state: down mac: <filter> 
  WAN IP: <filter> 
Bluetooth:
  Device-1: Intel Bluetooth 9460/9560 Jefferson Peak (JfP) type: USB 
  driver: btusb v: 0.8 bus-ID: 1-14:3 chip-ID: 8087:0aaa class-ID: e001 
  Report: rfkill ID: hci0 rfk-id: 1 state: down bt-service: enabled,running 
  rfk-block: hardware: no software: yes address: see --recommends 
Logical:
  Message: No logical block device data found. 
RAID:
  Message: No RAID data found. 
Drives:
  Local Storage: total: 476.94 GiB used: 54.73 GiB (11.5%) 
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: MZVLB512HAJQ-00000 
  size: 476.94 GiB block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s 
  lanes: 4 type: SSD serial: <filter> rev: EXA7201Q temp: 31.9 C scheme: GPT 
  SMART: yes health: PASSED on: 69d 4h cycles: 2,374 
  read-units: 53,131,503 [27.2 TB] written-units: 54,014,787 [27.6 TB] 
  Message: No optical or floppy data found. 
Partition:
  ID-1: / raw-size: 281.59 GiB size: 276.11 GiB (98.05%) 
  used: 54.7 GiB (19.8%) fs: ext4 block-size: 4096 B dev: /dev/nvme0n1p5 
  maj-min: 259:5 label: N/A uuid: 
  ID-2: /boot/efi raw-size: 100 MiB size: 96 MiB (96.00%) 
  used: 25.2 MiB (26.3%) fs: vfat block-size: 512 B dev: /dev/nvme0n1p1 
  maj-min: 259:1 label: N/A uuid: 
Swap:
  Alert: No swap data was found. 
Unmounted:
  ID-1: /dev/nvme0n1p2 maj-min: 259:2 size: 16 MiB fs: N/A label: N/A 
  uuid: N/A 
  ID-2: /dev/nvme0n1p3 maj-min: 259:3 size: 194.74 GiB fs: ntfs label: N/A 
  uuid: 
  ID-3: /dev/nvme0n1p4 maj-min: 259:4 size: 508 MiB fs: ntfs label: N/A 
  uuid: 
USB:
  Hub-1: 1-0:1 info: Full speed (or root) Hub ports: 16 rev: 2.0 
  speed: 480 Mb/s chip-ID: 1d6b:0002 class-ID: 0900 
  Device-1: 1-7:2 info: SteelSeries ApS SteelSeries KLC type: HID 
  driver: hid-generic,usbhid interfaces: 2 rev: 2.0 speed: 12 Mb/s 
  power: 300mA chip-ID: 1038:1122 class-ID: 0300 
  Device-2: 1-14:3 info: Intel Bluetooth 9460/9560 Jefferson Peak (JfP) 
  type: Bluetooth driver: btusb interfaces: 2 rev: 2.0 speed: 12 Mb/s 
  power: 100mA chip-ID: 8087:0aaa class-ID: e001 
  Hub-2: 2-0:1 info: Full speed (or root) Hub ports: 8 rev: 3.1 speed: 10 Gb/s 
  chip-ID: 1d6b:0003 class-ID: 0900 
Sensors:
  System Temperatures: cpu: 43.0 C mobo: N/A 
  Fan Speeds (RPM): N/A 
Info:
  Processes: 284 Uptime: 25m wakeups: 33 Init: systemd v: 248 tool: systemctl 
  Compilers: gcc: 11.1.0 Packages: pacman: 1230 lib: 348 flatpak: 0 
  Shell: Zsh (sudo) v: 5.8 default: Bash v: 5.1.8 running-in: konsole 
  inxi: 3.3.06 

Notable Hardware Events

[ 2.465211] psmouse serio1: synaptics: queried max coordinates: x […5666], y […4688]
[ 2.495201] psmouse serio1: synaptics: queried min coordinates: x [1274…], y [1166…]
[ 2.495231] psmouse serio1: synaptics: Your touchpad (PNP: SYN150d SYN1500 SYN0002 PNP0f13) says it can support a different bus. If i2c-hid and hid-rmi are not used, you might want to try s
etting psmouse.synaptics_intertouch to 1 and report this to linux-input@vger.kernel.org.

Please read this:

Especially the section about Linux kernels, then try with 5.10 LTS.

:crossed_fingers:

1 Like

I do want to reiterate when I installed a few months ago (and have upgraded every release) there were no boot issues, even when I primarily ran 5.13. I also shutdown / reboot every day. I am aware 5.13 is less stable than LTS, but again, same hardware but prior to fresh install no boot issues.

I’ll read that link and post relevant logs.

Update: I read the kernel section. Yes I am aware how to change kernels, how else would I try 5.14 and 5.13 before 5.13 was the default?
The link was a good read, but did not cover logging / where to find output for kernel panic etc. In my case just stuck before the DE even inits.

Here:

:+1:

1 Like

Thank you. From this I reproduced the error and then used the command:
journalctl --catalog --priority=3 --boot=-1
to get:

-- Journal begins at Fri 2021-09-03 13:25:14 MST, ends at Mon 2021-09-06 13:16:57 MST. --
Sep 06 13:14:42 DESKTOP-L1N6B9P kernel: x86/cpu: SGX disabled by BIOS.
Sep 06 13:14:42 DESKTOP-L1N6B9P kernel: 
Sep 06 13:14:43 DESKTOP-L1N6B9P sddm[478]: Failed to read display number from pipe

So it seems its failing to identify the display… but only for kernel 5.13 and 5.14 after the fresh install. Very weird. For 5.10 I have yet to produce this issue on current install.

I have updated my original post with the other logs.

Edit: I think this is unrelated actually. If I get logs of currently working boot:
journalctl --catalog --priority=3 --boot=0

I see the same message but its working:
kernel 5.14

-- Journal begins at Fri 2021-09-03 13:25:14 MST, ends at Mon 2021-09-06 13:26:40 MST. --
Sep 06 13:15:46 DESKTOP-L1N6B9P kernel: x86/cpu: SGX disabled by BIOS.
Sep 06 13:15:46 DESKTOP-L1N6B9P kernel: 
Sep 06 13:15:47 DESKTOP-L1N6B9P sddm[484]: Failed to read display number from pipe
Sep 06 13:21:56 DESKTOP-L1N6B9P kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=3 end=4) time 1057 us, min 1063, max 1079, scanline start 906, end 1080
Sep 06 13:22:46 DESKTOP-L1N6B9P kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=14 end=15) time 1716 us, min 1063, max 1079, scanline start 1021, end 138

It could be related. If I boot with kernel 5.10 the output is much smaller, just
kernel 5.10
journalctl --catalog --priority=3 --boot=0

-- Journal begins at Fri 2021-09-03 13:25:14 MST, ends at Mon 2021-09-06 13:35:08 MST. --
Sep 06 13:34:30 DESKTOP-L1N6B9P kernel: 

Where Kernel 5.13 (successful boot) still has the same message:

-- Journal begins at Fri 2021-09-03 13:25:14 MST, ends at Mon 2021-09-06 13:37:33 MST. --
Sep 06 13:37:12 DESKTOP-L1N6B9P kernel: x86/cpu: SGX disabled by BIOS.
Sep 06 13:37:12 DESKTOP-L1N6B9P kernel: 
Sep 06 13:37:13 DESKTOP-L1N6B9P sddm[491]: Failed to read display number from pipe

I don’t know any more what you’re trying to prove here, but 5.10 works, so that’s the solution to your problem…

:man_shrugging:

1 Like

I found the solution. nvidia prime without optimus-manager is again the cause. Yes it is stable on 5.10, so for users who want to stay on prime that is a solution.

For those that want to use 5.13/5.14:

Otherwise use 5.10 if you want to stay without optimus-manager.

Now 5.13 and 5.14 will no longer have this issue.

I think the underlying issue is ultimately the same problem as when using optimus-manager and the nvidia gpu fails to query on firefox init or nvidia-smi. So likely there is an instability with nvidia prime driver and 5.13/5.14 for now. Either way, this fixes the boot problem.

Maybe Manjaro should default back to 5.10 LTS for the fresh install default rather than 5.13 as this will affect a lot of laptop users with NVIDIA cards.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.