Freezing/stutters ... "task nvidia-modeset/:399 blocked for more than 122 seconds"

Hi, been trying to get to the bottom of random system freezes and stuttering on a new hardware for a couple weeks now. I highly suspect it's an Nvidia related issue.

~ journalctl --no-pager --no-hostname -xb-1 -p3
-- Logs begin at Tue 2020-06-23 21:44:55 PDT, end at Mon 2020-06-29 23:29:03 PDT. --
Jun 29 21:59:13 kernel: hub 5-2:1.0: Using single TT (err -22)
Jun 29 21:59:13 kernel: hub 5-2.4:1.0: Using single TT (err -22)
Jun 29 21:59:13 kernel: sp5100-tco sp5100-tco: Watchdog hardware is disabled
Jun 29 21:59:13 kernel: kvm: disabled by bios
Jun 29 21:59:14 kernel: kvm: disabled by bios
Jun 29 21:59:14 kernel: kvm: disabled by bios
Jun 29 21:59:14 kernel: kvm: disabled by bios
Jun 29 21:59:14 kernel: kvm: disabled by bios
Jun 29 21:59:14 kernel: kvm: disabled by bios
Jun 29 21:59:14 kernel: kvm: disabled by bios
Jun 29 21:59:14 kernel: nvidia-gpu 0000:07:00.3: i2c timeout error e0000000
Jun 29 21:59:14 kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Jun 29 21:59:14 kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
Jun 29 21:59:14 kernel: kvm: disabled by bios
Jun 29 21:59:15 kernel: kvm: disabled by bios
Jun 29 21:59:15 kernel: kvm: disabled by bios
Jun 29 21:59:15 kernel: kvm: disabled by bios
Jun 29 21:59:15 kernel: kvm: disabled by bios
Jun 29 22:25:26 kernel: NVRM: Xid (PCI:0000:07:00): 61, pid=1672, 0cec(3098) 00000000 00000000
Jun 29 23:23:06 kernel: INFO: task nvidia-modeset/:399 blocked for more than 122 seconds.
Jun 29 23:23:06 kernel:       Tainted: P           OE     5.4.44-1-MANJARO #1
Jun 29 23:23:06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The last 4 messages are curious to me. I searched the forum yet am unsure how to proceed. Here is inxi

System:    Host: desktop Kernel: 5.4.44-1-MANJARO x86_64 bits: 64 compiler: gcc v: 10.1.0 
           parameters: BOOT_IMAGE=/boot/vmlinuz-5.4-x86_64 
           root=UUID=b64fad59-c8a3-41dc-a3ac-1a10c124716a rw quiet 
           resume=UUID=6ee062f1-1e1e-4e2b-b7de-f6d90d25eeab udev.log_priority=3 idle=nomwait 
           Desktop: Xfce 4.14.2 tk: Gtk 3.24.20 wm: xfwm4 dm: LightDM Distro: Manjaro Linux 
Machine:   Type: Desktop System: Gigabyte product: B450M DS3H v: N/A serial: <filter> 
           Mobo: Gigabyte model: B450M DS3H-CF v: x.x serial: <filter> 
           UEFI: American Megatrends v: F50 date: 11/27/2019 
CPU:       Topology: 6-Core model: AMD Ryzen 5 3600 bits: 64 type: MT MCP arch: Zen 
           family: 17 (23) model-id: 71 (113) stepping: N/A microcode: 8701013 
           L2 cache: 3072 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm 
           bogomips: 86279 
           Speed: 3594 MHz min/max: 2200/3600 MHz Core speeds (MHz): 1: 3591 2: 3593 3: 3593 
           4: 3364 5: 3591 6: 3590 7: 3596 8: 3593 9: 3592 10: 3595 11: 3534 12: 3367 
           Vulnerabilities: Type: itlb_multihit status: Not affected 
           Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass 
           mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 
           mitigation: usercopy/swapgs barriers and __user pointer sanitization 
           Type: spectre_v2 
           mitigation: Full AMD retpoline, IBPB: conditional, STIBP: conditional, RSB filling 
           Type: tsx_async_abort status: Not affected 
Graphics:  Device-1: NVIDIA TU104 [GeForce RTX 2080 SUPER] vendor: eVga.com. driver: nvidia 
           v: 440.82 bus ID: 07:00.0 chip ID: 10de:1e81 
           Display: x11 server: X.Org 1.20.8 driver: nvidia tty: N/A 
           OpenGL: renderer: GeForce RTX 2080 SUPER/PCIe/SSE2 v: 4.6.0 NVIDIA 440.82 
           direct render: Yes 
Audio:     Device-1: NVIDIA TU104 HD Audio vendor: eVga.com. driver: snd_hda_intel v: kernel 
           bus ID: 07:00.1 chip ID: 10de:10f8 
           Device-2: AMD Starship/Matisse HD Audio vendor: Gigabyte driver: snd_hda_intel 
           v: kernel bus ID: 09:00.4 chip ID: 1022:1487 
           Sound Server: ALSA v: k5.4.44-1-MANJARO 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Gigabyte 
           driver: r8169 v: kernel port: f000 bus ID: 05:00.0 chip ID: 10ec:8168 
           IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: <filter> 
           Device-2: Realtek RTL8152 Fast Ethernet Adapter type: USB driver: usb-storage 
           bus ID: 6-2.4.4:5 chip ID: 0bda:8152 
Drives:    Local Storage: total: 931.51 GiB used: 17.06 GiB (1.8%) 
           ID-1: /dev/nvme0n1 model: Sabrent Rocket Q size: 931.51 GiB block size: 
           physical: 4096 B logical: 4096 B speed: 31.6 Gb/s lanes: 4 serial: <filter> 
Partition: ID-1: / raw size: 200.00 GiB size: 195.86 GiB (97.93%) used: 17.04 GiB (8.7%) 
           fs: ext4 dev: /dev/nvme0n1p6 
           ID-2: swap-1 size: 16.00 GiB used: 0 KiB (0.0%) fs: swap 
           swappiness: 10 (default 60) cache pressure: 75 (default 100) dev: /dev/nvme0n1p5 
Sensors:   System Temperatures: cpu: 48.5 C mobo: N/A gpu: nvidia temp: 30 C 
           Fan Speeds (RPM): N/A gpu: nvidia fan: 37% 
Info:      Processes: 275 Uptime: 19m Memory: 15.65 GiB used: 1.55 GiB (9.9%) Init: systemd 
           v: 245 Compilers: gcc: 10.1.0 Shell: zsh v: 5.8 running in: xfce4-terminal 
           inxi: 3.0.37 

Does anything in particular stand out? Thank you for the help.

Hello,

Regarding this you can ignore or blacklist that module:
Source here https://bbs.archlinux.org/viewtopic.php?id=239075
From terminal you can do this:
echo "blacklist sp5100_tco" | sudo tee /etc/modprobe.d/sp5100_tco.conf


Now, this is odd, as used to be an issue, or still is ?
https://bugzilla.kernel.org/show_bug.cgi?id=206653
Before anything else, i would suggest to try a newer kernel:
sudo mhwd-kernel -i linux57


Further, i might add to /etc/X11/mhwd.d/nvidia.conf in the Section "Device" replace the line Option "NoLogo" "1" with:

    Option  "ConnectToAcpid"    "Off"

I see you have

Somehow i'm again tempted to suggest either a swapfile or systemd-swap, not a partition on NVME or SSD ... but that is me. One thing i recommend tho, is something mentioned here:

I didn't know this is still required, but maybe it has something more to it, as presented here:

Hiya. Thanks for your suggestions.

I actually had the lag occur today while "ConnectToAcpid" was off in my nvidia.conf, so that isn't the solution, sadly. I had seen that suggestion elsewhere

I have messed about a bit to figure if my new nvme drive was the culprit. I have added the 'maxperfwiz' tweaks for /etc/sysctl.d/ but alas, no luck.

I can try changing my udev rules to your suggestions and blacklisting that module

Regarding IO, I actually had iotop open when my system was lagging and there wasn't any significant IO activity

idle=nomwait was something I was trying out as a possible fix. But alas I got the lag again while that was enabled so I removed it.

The latest suggestion I'm working with is adding 'nvidia-drm.modeset=1' to my grub parameters.

I saw another suggestion to add 'acpi=ht' to my grub default params so I will try that as well if the first one doesn't work

I should add that I did try systemd-swap as a solution a few days ago and still got the lag. I will probably switch back to enabling it so I can free up those 16gigs once I resolve whatever's going on

But AFAIK that should happen only when IO system is slower than IO use by your user ... isn't it ?

Hopefully now that I have ForceFullCompositionPipeline=On properly persisting, this issue will go away.

Forum kindly sponsored by