System consistently freezes after each system upgrade when screen lock \ monitor standby

After each update that consists of a kernel update (not necessarily an upgrade, but upgrade as well), i.e. any update that requires a restart, the system would work fine until I go AFK for a while and the screen lock \ monitor standby kick in.

After that AFK period, when I get back, I find my system frozen with the screen showing the BIOS logo (more recently just a black screen with a very rapidly blinking cursor).

Sometimes I’m able to restart it with rapid use of ctrl+alt+delete, sometimes even this fails. TTY switching doesn’t work either.

After a restart the system works fine.

It seems as if the system going to lock screen \ monitor standby causes the system to try and reload the nvidia driver, which fails.

Since the system works properly before this AFK period, I suspect it’d be easy to disable this reload and keep the system working until restart.

As I’m reporting this, I’ve just updated the system - so this can easily be reproduced right now.

So if there’s any missing information, any additional log files (or higher logging verbosity) needed, now is the time to mention it, so I can set it to log before letting my system freeze.

Here are the logs from last week, when it last happened:

/var/log/Xorg.1.log:

[1383800.788] 
X.Org X Server 1.20.11
X Protocol Version 11, Revision 0
[1383800.788] Build Operating System: Linux Manjaro Linux
[1383800.788] Current Operating System: Linux feature-precision3630tower 5.10.36-2-MANJARO #1 SMP PREEMPT Tue May 11 19:38:44 UTC 2021 x86_64
[1383800.788] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.10-x86_64 root=UUID=14840702-a1f4-45f2-ac1d-bf20549068fa rw quiet apparmor=1 security=apparmor udev.log_priority=3 mitigations=off
[1383800.788] Build Date: 13 April 2021  04:11:08PM
[1383800.788]  
[1383800.788] Current version of pixman: 0.40.0
[1383800.788] 	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
[1383800.788] Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[1383800.788] (==) Log file: "/var/log/Xorg.1.log", Time: Tue Jun  8 08:43:21 2021
[1383800.789] (==) Using config directory: "/etc/X11/xorg.conf.d"
[1383800.789] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[1383800.789] (==) ServerLayout "layout"
[1383800.789] (**) |-->Screen "Screen0" (0)
[1383800.789] (**) |   |-->Monitor "Monitor0"
[1383800.789] (**) |   |-->Device "Device0"
[1383800.789] (==) Automatically adding devices
[1383800.789] (==) Automatically enabling devices
[1383800.789] (==) Automatically adding GPU devices
[1383800.789] (==) Automatically binding GPU devices
[1383800.789] (==) Max clients allowed: 256, resource mask: 0x1fffff
[1383800.789] (WW) `fonts.dir' not found (or not valid) in "/usr/share/fonts/misc".
[1383800.789] 	Entry deleted from font path.
[1383800.789] 	(Run 'mkfontdir' on "/usr/share/fonts/misc").
[1383800.789] (WW) `fonts.dir' not found (or not valid) in "/usr/share/fonts/TTF".
[1383800.789] 	Entry deleted from font path.
[1383800.789] 	(Run 'mkfontdir' on "/usr/share/fonts/TTF").
[1383800.789] (WW) The directory "/usr/share/fonts/OTF" does not exist.
[1383800.789] 	Entry deleted from font path.
[1383800.789] (WW) The directory "/usr/share/fonts/Type1" does not exist.
[1383800.789] 	Entry deleted from font path.
[1383800.789] (WW) The directory "/usr/share/fonts/100dpi" does not exist.
[1383800.789] 	Entry deleted from font path.
[1383800.789] (WW) The directory "/usr/share/fonts/75dpi" does not exist.
[1383800.789] 	Entry deleted from font path.
[1383800.789] (==) FontPath set to:
	
[1383800.789] (==) ModulePath set to "/usr/lib/xorg/modules"
[1383800.789] (**) Extension "COMPOSITE" is enabled
[1383800.789] (II) The server relies on udev to provide the list of input devices.
	If no devices become available, reconfigure udev or disable AutoAddDevices.
[1383800.789] (II) Module ABI versions:
[1383800.789] 	X.Org ANSI C Emulation: 0.4
[1383800.789] 	X.Org Video Driver: 24.1
[1383800.789] 	X.Org XInput driver : 24.1
[1383800.789] 	X.Org Server Extension : 10.0
[1383800.792] (++) using VT number 8

[1383800.792] (II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration
[1383800.793] (II) xfree86: Adding drm device (/dev/dri/card0)
[1383800.793] (II) xfree86: Adding drm device (/dev/dri/card1)
[1383800.794] (**) OutputClass "nvidia" ModulePath extended to "/usr/lib/nvidia/xorg,/usr/lib/xorg/modules,/usr/lib/xorg/modules"
[1383800.794] (**) OutputClass "nvidia" ModulePath extended to "/usr/lib/nvidia/xorg,/usr/lib/xorg/modules,/usr/lib/nvidia/xorg,/usr/lib/xorg/modules,/usr/lib/xorg/modules"
[1383800.795] (--) PCI: (0@0:2:0) 8086:3e92:1028:0871 rev 0, Mem @ 0xa2000000/16777216, 0x80000000/268435456, I/O @ 0x00004000/64
[1383800.795] (--) PCI:*(1@0:0:0) 10de:1b80:1028:3366 rev 161, Mem @ 0xa3000000/16777216, 0x90000000/268435456, 0xa0000000/33554432, I/O @ 0x00003000/128, BIOS @ 0x????????/524288
[1383800.795] (WW) Open ACPI failed (/var/run/acpid.socket) (No such file or directory)
[1383800.795] (II) LoadModule: "glx"
[1383800.795] (II) Loading /usr/lib/xorg/modules/extensions/libglx.so
[1383800.796] (II) Module glx: vendor="X.Org Foundation"
[1383800.796] 	compiled for 1.20.11, module version = 1.0.0
[1383800.796] 	ABI class: X.Org Server Extension, version 10.0
[1383800.796] (II) LoadModule: "nvidia"
[1383800.797] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so
[1383800.797] (II) Module nvidia: vendor="NVIDIA Corporation"
[1383800.797] 	compiled for 1.6.99.901, module version = 1.0.0
[1383800.797] 	Module class: X.Org Video Driver
[1383800.797] (II) NVIDIA dlloader X Driver  465.31  Thu May 13 22:19:15 UTC 2021
[1383800.797] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[1383800.797] (II) Loading sub module "fb"
[1383800.797] (II) LoadModule: "fb"
[1383800.797] (II) Loading /usr/lib/xorg/modules/libfb.so
[1383800.797] (II) Module fb: vendor="X.Org Foundation"
[1383800.797] 	compiled for 1.20.11, module version = 1.0.0
[1383800.797] 	ABI class: X.Org ANSI C Emulation, version 0.4
[1383800.797] (II) Loading sub module "wfb"
[1383800.797] (II) LoadModule: "wfb"
[1383800.797] (II) Loading /usr/lib/xorg/modules/libwfb.so
[1383800.797] (II) Module wfb: vendor="X.Org Foundation"
[1383800.797] 	compiled for 1.20.11, module version = 1.0.0
[1383800.797] 	ABI class: X.Org ANSI C Emulation, version 0.4
[1383800.797] (II) Loading sub module "ramdac"
[1383800.797] (II) LoadModule: "ramdac"
[1383800.797] (II) Module "ramdac" already built-in
[1383800.798] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[1383800.798] (EE) NVIDIA:     system's kernel log for additional error messages and
[1383800.798] (EE) NVIDIA:     consult the NVIDIA README for details.
[1383800.798] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[1383800.798] (EE) NVIDIA:     system's kernel log for additional error messages and
[1383800.798] (EE) NVIDIA:     consult the NVIDIA README for details.
[1383800.798] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[1383800.798] (EE) NVIDIA:     system's kernel log for additional error messages and
[1383800.798] (EE) NVIDIA:     consult the NVIDIA README for details.
[1383800.799] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[1383800.799] (EE) NVIDIA:     system's kernel log for additional error messages and
[1383800.799] (EE) NVIDIA:     consult the NVIDIA README for details.
[1383800.799] (EE) No devices detected.
[1383800.799] (EE) 
Fatal server error:
[1383800.799] (EE) no screens found(EE) 
[1383800.799] (EE) 
Please consult the The X.Org Foundation support 
	 at http://wiki.x.org
 for help. 
[1383800.799] (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
[1383800.799] (EE) 
[1383800.803] (EE) Server terminated with error (1). Closing log file.

System details:

System:
  Kernel: 5.10.41-1-MANJARO x86_64 bits: 64 compiler: gcc v: 11.1.0 
  parameters: BOOT_IMAGE=/boot/vmlinuz-5.10-x86_64 
  root=UUID=14840702-a1f4-45f2-ac1d-bf20549068fa rw quiet apparmor=1 
  security=apparmor udev.log_priority=3 mitigations=off 
  Desktop: Xfce 4.16.0 tk: Gtk 3.24.24 info: xfce4-panel wm: xfwm4 vt: 7 
  dm: LightDM 1.30.0 Distro: Manjaro Linux base: Arch Linux 
Machine:
  Type: Desktop System: Dell product: Precision 3630 Tower v: N/A 
  serial: <filter> Chassis: type: 3 serial: <filter> 
  Mobo: Dell model: 0NNNCT v: A01 serial: <filter> UEFI: Dell v: 2.6.1 
  date: 07/01/2020 
Battery:
  Message: No system battery data found. Is one present? 
Memory:
  RAM: total: 31.2 GiB used: 4.11 GiB (13.2%) 
  RAM Report: permissions: Unable to run dmidecode. Root privileges required. 
CPU:
  Info: 6-Core model: Intel Core i7-8700 bits: 64 type: MT MCP arch: Kaby Lake 
  note: check family: 6 model-id: 9E (158) stepping: A (10) microcode: DE 
  cache: L2: 12 MiB bogomips: 76831 
  Speed: 900 MHz min/max: 800/4600 MHz Core speeds (MHz): 1: 900 2: 900 3: 900 
  4: 900 5: 900 6: 900 7: 900 8: 900 9: 900 10: 900 11: 900 12: 900 
  Flags: 3dnowprefetch abm acpi adx aes aperfmperf apic arat arch_perfmon art 
  avx avx2 bmi1 bmi2 bts clflush clflushopt cmov constant_tsc cpuid 
  cpuid_fault cx16 cx8 de ds_cpl dtes64 dtherm dts epb ept ept_ad erms est 
  f16c flexpriority flush_l1d fma fpu fsgsbase fxsr hle ht hwp hwp_act_window 
  hwp_epp hwp_notify ibpb ibrs ida intel_pt invpcid invpcid_single lahf_lm lm 
  mca mce md_clear mmx monitor movbe mpx msr mtrr nonstop_tsc nopl nx pae pat 
  pbe pcid pclmulqdq pdcm pdpe1gb pebs pge pln pni popcnt pse pse36 pts rdrand 
  rdseed rdtscp rep_good rtm sdbg sep smap smep smx ss ssbd sse sse2 sse4_1 
  sse4_2 ssse3 stibp syscall tm tm2 tpr_shadow tsc tsc_adjust 
  tsc_deadline_timer vme vmx vnmi vpid x2apic xgetbv1 xsave xsavec xsaveopt 
  xsaves xtopology xtpr 
  Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled 
  Type: l1tf mitigation: PTE Inversion; VMX: vulnerable 
  Type: mds status: Vulnerable; SMT vulnerable 
  Type: meltdown status: Vulnerable 
  Type: spec_store_bypass status: Vulnerable 
  Type: spectre_v1 status: Vulnerable: __user pointer sanitization and 
  usercopy barriers only; no swapgs barriers 
  Type: spectre_v2 status: Vulnerable, IBPB: disabled, STIBP: disabled 
  Type: srbds status: Vulnerable 
  Type: tsx_async_abort status: Vulnerable 
Graphics:
  Device-1: Intel UHD Graphics 630 vendor: Dell driver: i915 v: kernel 
  bus-ID: 00:02.0 chip-ID: 8086:3e92 class-ID: 0300 
  Device-2: NVIDIA GP104 [GeForce GTX 1080] vendor: Dell driver: nvidia 
  v: 465.31 alternate: nouveau,nvidia_drm bus-ID: 01:00.0 chip-ID: 10de:1b80 
  class-ID: 0300 
  Display: x11 server: X.Org 1.20.11 driver: loaded: nvidia display-ID: :0.0 
  screens: 1 
  Screen-1: 0 s-res: 1920x1080 s-dpi: 93 s-size: 524x292mm (20.6x11.5") 
  s-diag: 600mm (23.6") 
  Monitor-1: DP-4 res: 1920x1080 hz: 60 dpi: 94 size: 521x293mm (20.5x11.5") 
  diag: 598mm (23.5") 
  OpenGL: renderer: NVIDIA GeForce GTX 1080/PCIe/SSE2 v: 4.6.0 NVIDIA 465.31 
  direct render: Yes 
Audio:
  Device-1: Intel Cannon Lake PCH cAVS vendor: Dell driver: snd_hda_intel 
  v: kernel alternate: snd_soc_skl,snd_sof_pci bus-ID: 00:1f.3 
  chip-ID: 8086:a348 class-ID: 0403 
  Sound Server-1: ALSA v: k5.10.41-1-MANJARO running: yes 
  Sound Server-2: JACK v: 0.125.0 running: no 
  Sound Server-3: PulseAudio v: 14.2 running: yes 
  Sound Server-4: PipeWire v: 0.3.28 running: no 
Network:
  Device-1: Intel Ethernet I219-LM vendor: Dell driver: e1000e v: kernel 
  port: efa0 bus-ID: 00:1f.6 chip-ID: 8086:15bb class-ID: 0200 
  IF: eno1 state: up speed: 100 Mbps duplex: full mac: <filter> 
  IP v4: <filter> type: dynamic noprefixroute scope: global 
  broadcast: <filter> 
  IP v6: <filter> type: noprefixroute scope: link 
  IF-ID-1: docker0 state: down mac: <filter> 
  IP v4: <filter> scope: global broadcast: <filter> 
  WAN IP: <filter> 
Bluetooth:
  Message: No bluetooth data found. 
Logical:
  Message: No logical block device data found. 
RAID:
  Hardware-1: Intel SATA Controller [RAID mode] driver: ahci v: 3.0 port: 4060 
  bus-ID: 00:17.0 chip-ID: 8086.2822 rev: 10 class-ID: 0104 
Drives:
  Local Storage: total: 476.94 GiB used: 128.26 GiB (26.9%) 
  SMART Message: Unable to run smartctl. Root privileges required. 
  ID-1: /dev/sda maj-min: 8:0 vendor: Micron model: 1100 SATA 512GB 
  size: 476.94 GiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s 
  rotation: SSD serial: <filter> rev: L022 scheme: GPT 
  Optical-1: /dev/sr0 vendor: TS8XDVDS model: TRANSCEND rev: 1.00 
  dev-links: cdrom 
  Features: speed: 24 multisession: yes audio: yes dvd: yes 
  rw: cd-r,cd-rw,dvd-r,dvd-ram state: running 
Partition:
  ID-1: / raw-size: 210.08 GiB size: 205.78 GiB (97.95%) 
  used: 128.2 GiB (62.3%) fs: ext4 dev: /dev/sda5 maj-min: 8:5 label: N/A 
  uuid: 14840702-a1f4-45f2-ac1d-bf20549068fa 
  ID-2: /boot/efi raw-size: 650 MiB size: 646 MiB (99.38%) 
  used: 66.5 MiB (10.3%) fs: vfat dev: /dev/sda1 maj-min: 8:1 label: ESP 
  uuid: B040-598E 
Swap:
  Alert: No swap data was found. 
Unmounted:
  ID-1: /dev/sda2 maj-min: 8:2 size: 128 MiB fs: <superuser required> 
  label: N/A uuid: N/A 
  ID-2: /dev/sda3 maj-min: 8:3 size: 265.12 GiB fs: ntfs label: OS 
  uuid: 426C78126C77FF4B 
  ID-3: /dev/sda4 maj-min: 8:4 size: 990 MiB fs: ntfs label: WINRETOOLS 
  uuid: C26A19566A194895 
USB:
  Hub-1: 1-0:1 info: Full speed (or root) Hub ports: 16 rev: 2.0 
  speed: 480 Mb/s chip-ID: 1d6b:0002 class-ID: 0900 
  Device-1: 1-9:2 info: Transcend Information Portable Super Multi Drive 
  type: Mass Storage driver: usb-storage interfaces: 1 rev: 2.0 
  speed: 480 Mb/s power: 500mA chip-ID: 8564:8000 class-ID: 0802 
  serial: <filter> 
  Hub-2: 2-0:1 info: Full speed (or root) Hub ports: 10 rev: 3.1 
  speed: 10 Gb/s chip-ID: 1d6b:0003 class-ID: 0900 
  Device-1: 2-8:2 info: Realtek USB3.0-CRW type: Mass Storage 
  driver: usb-storage interfaces: 1 rev: 3.0 speed: 5 Gb/s power: 800mA 
  chip-ID: 0bda:0328 class-ID: 0806 serial: <filter> 
Sensors:
  System Temperatures: cpu: 41.0 C mobo: N/A gpu: nvidia temp: 37 C 
  Fan Speeds (RPM): cpu: 1543 fan-2: 821 gpu: nvidia fan: 27% 
Info:
  Processes: 270 Uptime: 17m wakeups: 0 Init: systemd v: 247 tool: systemctl 
  Compilers: gcc: 11.1.0 alt: 10 clang: 11.1.0 Packages: pacman: 1465 lib: 435 
  flatpak: 0 Shell: Bash v: 5.1.8 running-in: xfce4-terminal inxi: 3.3.04

For nvidia module always restart… but anyway, to avoid reloading the module: disable screensaver, standby, hibernation, etc etc etc… Don’t start any new app which uses opengl or gpu acceleration.

It never happens on Ubuntu or Mint, so I don’t think it’s an unavoidable fact that the system has to be restarted ASAP after upgrading… Seems like an avoidable bug.

In your case xorg restarted and can’t find the correct module which fits to the old kernel, since it installed a module which built for the new kernel.

If there is no kernel update and you just get an update for the nvidia module for the same kernel, it wont crash, because both kernel and module match.

So I guess the kernel change not often on Ubuntu and Mint and therefore I assume it just updates the nvidia driver and not the kernel.

Beside that on Manjaro are precompiled nvidia modules in use, on Ubuntu, Mint … Debian, there you use DKMs, which recompiles the module on every kernel change.

Maybe change from MHWD solution (precompiled) to DKMS only?

I think the DKMS solution should indeed work if the previous kernel version isn’t removed during the upgrade. I wonder why it isn’t used by default in MHWD if it provides the superior user experience? Just to save the disk space required by the kernel headers, and make the installation faster, or is it because it’s less reliable? I’ve never had any problems with it on Ubuntu or Mint…

By the way, it happens even for kernel updates (i.e. still 5.10, but with minor version bump relating to security updates), which are about as frequent in Ubuntu and Mint as they are in Manjaro.

It speeds up the upgrade process, since you don’t have to compile it yourself and since procompiled modules are available, you can boot a live session with nvidia driver what would not be possible with dkms.

However… i experience no problems with after upgrading the nvidia driver. Just avoid in any way to trigger a restart of xorg, what would cause a crash on Ubuntu also after upgrading kernel and module.

Is there a way to specify a fallback driver, so that if Xorg does restart and finds it can’t load the nvidia driver, it proceeds to load the fallback one?

The idea is not bad, you can name the xorg conf files at /etc/X11/xorg.conf.d/ like this 10-nvidia.conf and 20-nvdia.conf. It will load first 10-nvidia.conf and then 20-nvdia.conf. This way 20-nvdia.conf is your “fallback”, but at the upgrade process it overwrites the nvidia module… hm…

I am thinking why I had never problems. Maybe Ubuntu loads the nvidia driver by default into the initramfs… this is equivalent to the mkinitcpio. There you need to add the driver.

/etc/mkinitcpio.conf

MODULES=(nvidia nvidia_modeset nvidia_uvm nvidia_drm)

So in theory, the kernel loads here the driver into a ram disk very early on boot time and don’t load the driver from disk when it needs it, so when starting xorg for example.

On upgrade process, it just overwrites the module on the local disk, but the module on the ram disk, which is used when added to MODULES=(), is not touched.

Therefore xorg should not load the module from the disk, but the one from the ram disk (initramfs) first.

Maybe this is the issue here…

EDIT:

This from etc/initramfs-tools/initramfs.conf (Ubuntu 20.04):

#
# MODULES: [ most | netboot | dep | list ]
#
# most - Add most filesystem and all harddrive drivers.
#
# dep - Try and guess which modules to load.
#
# netboot - Add the base modules, network modules, but skip block devices.
#
# list - Only include modules from the 'additional modules' list
#

MODULES=most

I guess most includes also the nvidia driver, while Manjaro uses an equivalent mode to dep by default. No idea if that is true, but could be a direction.