Inconsistent Suspend Behavior on Manjaro (Kernel 6.12.34-1)

Hi folks,

I’m experiencing inconsistent suspend behavior on my Lenovo Legion 5 running Manjaro (Kernel 6.12.34-1). My laptop uses an AMD Ryzen 7 5800H CPU and an Nvidia RTX 3070 GPU (hybrid graphics disabled, so only Nvidia is active).

Most of the time, suspend via Gnome works perfectly. Occasionally (about 10% of the time), when I try to suspend, the screen goes black (no video output), but the backlight stays on. The system doesn’t suspend and becomes unresponsive, requiring a forced restart.

I tried looking into the journalctl entries around the time of the issue and I found some errors:

Jul 03 21:12:16 joao-82ju systemd-coredump[52430]: [🡕] Process 2574 (gnome-shell) of user 1000 dumped core.
Stack trace of thread 2574:
#0  0x00007efd5834633e  st_theme_node_lookup_shadow (libst-16.so + 0x4e33e)
#1  0x00007efd5834683b  st_theme_node_get_box_shadow (libst-16.so + 0x4e83b)
#2  0x00007efd5834a7c6  st_theme_node_get_paint_box (libst-16.so + 0x527c6)
...

Jul 03 21:12:38 joao-82ju kernel: Freezing user space processes
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (pKernelBus->pReadToFlush != NULL || pKernelBus->virtualBar2[GPU_GFID_PF].pCpuMapping != NULL) @ kern_>
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:881
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1092
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:488
Jul 03 21:12:38 joao-82ju kernel: NVRM: mmuWalkSparsify: Failed to sparsify VA Range 0xaa0000 to 0xb1ffff. Status = 0x00000040
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_sparse.c:74
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:881
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1092
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:488
Jul 03 21:12:38 joao-82ju kernel: NVRM: mmuWalkUnmap: Failed to unmap VA Range 0xaa0000 to 0xb1ffff. Status = 0x00000040
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_unmap.c:65
Jul 03 21:12:38 joao-82ju kernel: NVRM: mmuWalkSparsify: Unmap failed with status = 0x00000040
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == unmapStatus @ mmu_walk_sparse.c:85
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Generic Error: Invalid state [NV_ERR_INVALID_STATE] (0x00000040) returned from mmuWalkSparsify(userC>
Jul 03 21:12:38 joao-82ju kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (pKernelBus->pReadToFlush != NULL || pKernelBus->virtualBar2[GPU_GFID_PF].pCpuMapping != NULL) @ kern_>
Jul 03 21:12:38 joao-82ju kernel: Freezing user space processes failed after 20.002 seconds (1 tasks refusing to freeze, wq_busy=0):
...
Jul 03 21:12:59 joao-82ju systemd-sleep[52519]: Failed to put system to sleep. System resumed again: Device or resource busy
...
Jul 03 21:14:31 joao-82ju gnome-session-binary[52809]: Unrecoverable failure in required component org.gnome.Shell.desktop
...
Jul 03 21:16:56 joao-82ju gdm-launch-environment][52896]: pam_systemd(gdm-launch-environment:session): Failed to create session: Connection timed out
Jul 03 21:17:00 joao-82ju kernel: INFO: task nv_queue:448 blocked for more than 122 seconds.
Jul 03 21:17:00 joao-82ju kernel:       Tainted: G           OE      6.12.34-1-MANJARO #1
Jul 03 21:17:00 joao-82ju kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...
Jul 03 21:39:54 joao-82ju gdm-launch-environment][54102]: pam_systemd(gdm-launch-environment:session): Failed to create session: Connection timed out
...
Jul 03 21:39:57 joao-82ju kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] _ERROR_ [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
Jul 03 21:40:00 joao-82ju kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] _ERROR_ [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 1
Jul 03 21:40:03 joao-82ju kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] _ERROR_ [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 2

Notes:

  • I’m using GNOME Shell version: 48.2 (on Wayland)
  • Nvidia driver version: 575.64 (proprietary Nvidia driver)

Based on the logs, I think there was an error in the GPU during the attempt to suspend and the operating system didn’t find a way to recover from it.

My question is: how can I confirm what is causing the issue? and how could I make the suspend function more resilient (or maybe if there is anything I could do to recover from the error manually).

Do you have the appropriate NVIDIA services enabled? See Chapter 21. Configuring Power Management Support

Thanks for the answer and for providing the link to the docs. I wasn’t familiar with that page.

I didn’t enable NVIDIA services individually, and the doc provided mentions that the services are enabled automatically if systemd is detected (which is the case):

These files are installed and enabled by nvidia-installer automatically if systemd is detected.

I checked modprobe and I found the following: options nvidia NVreg_PreserveVideoMemoryAllocations=1, which further suggests the required parameters and services are set.

I wonder if I would get such a high success rate of suspend/wake-up if the services were misconfigured?

That’s not related to the status of any of the services.

Well, then they’re probably not enabled. You can check that status of each service via systemctl status; i.e., all 4 services are enabled on this machine–only because I manually enabled them. Notice the preset is disabled:

❯ systemctl status nvidia-suspend nvidia-resume nvidia-hibernate nvidia-suspend-then-hibernate
○ nvidia-suspend.service - NVIDIA system suspend actions
     Loaded: loaded (/usr/lib/systemd/system/nvidia-suspend.service; enabled; preset: disabled)
     Active: inactive (dead)

○ nvidia-resume.service - NVIDIA system resume actions
     Loaded: loaded (/usr/lib/systemd/system/nvidia-resume.service; enabled; preset: disabled)
     Active: inactive (dead)

○ nvidia-hibernate.service - NVIDIA system hibernate actions
     Loaded: loaded (/usr/lib/systemd/system/nvidia-hibernate.service; enabled; preset: disabled)
     Active: inactive (dead)

○ nvidia-suspend-then-hibernate.service - NVIDIA actions for suspend-then-hibernate
     Loaded: loaded (/usr/lib/systemd/system/nvidia-suspend-then-hibernate.service; enabled; preset: disabled)
     Active: inactive (dead)

Oh, thanks for showing the proper way to check the status of the services.

The services are enabled (except for nvidia-suspend-then-hibernate.service, which I manually enabled now):

systemctl status nvidia-suspend nvidia-resume nvidia-hibernate nvidia-suspend-then-hibernate

○ nvidia-suspend.service - NVIDIA system suspend actions
     Loaded: loaded (/usr/lib/systemd/system/nvidia-suspend.service; enabled; preset: disabled)
     Active: inactive (dead)

○ nvidia-resume.service - NVIDIA system resume actions
     Loaded: loaded (/usr/lib/systemd/system/nvidia-resume.service; enabled; preset: disabled)
     Active: inactive (dead)

○ nvidia-hibernate.service - NVIDIA system hibernate actions
     Loaded: loaded (/usr/lib/systemd/system/nvidia-hibernate.service; enabled; preset: disabled)
     Active: inactive (dead)

# before I manually enabled
○ nvidia-suspend-then-hibernate.service - NVIDIA actions for suspend-then-hibernate
     Loaded: loaded (/usr/lib/systemd/system/nvidia-suspend-then-hibernate.service; disabled; preset: disabled)
     Active: inactive (dead)

I’m still seeing this issue occur occasionally, but I think I understand more about what causes it.

After checking the logs, it seems most of the problem happens during the processes triggered by the GPU when the OS calls the suspend function. I’ve observed the issue occurs when I perform GPU-intensive tasks – such as playing games in Steam or running LLM models in Ollama – and immediately put the laptop to sleep. However, this doesn’t always trigger the issue; sometimes the laptop sleeps successfully, while other times it fails.

It seems the NVIDIA processes called during the sleep preparation are struggling to complete their tasks within the timeouts. They don’t appear to recover after the failure. I suspect that unloading GPU memory to the disk during the sleep process is failing, particularly when there’s a significant amount of data to write. My GPU has 8GB of RAM (NVIDIA GeForce RTX 3070 Laptop GPU) and my storage is a Rocket Q NVMe SSD 2TB.

Maybe increasing the timeouts might help, but I’m yet to know how to consistently reproduce the issue and how to change (and test) the new timeouts, but I’ll investigate that next. I wonder if folks with GPUs with large ram are having similar issues.

Note: I checked the following log line:

Jul 03 21:12:38 joao-82ju kernel: Freezing user space processes failed after 20.002 seconds (1 tasks refusing to freeze, wq_busy=0):

… and I made a rough calculation based on my ssd write benchmark (about 88.5MiB/s; depends on several factors); considering the average write throughput, I should be able to write more or less 1.8GB, which is less than the total ram for the GPU.

This is the command I used to calculate the write throughput:

fio --name=write_test --filename=testfile --size=1G --bs=4k --rw=randwrite --ioengine=libaio --direct=1 --numjobs=1 --runtime=60 --group_reporting