Gpu falling off bus and freezing laptop

Hello,

I bought a Lenovo P52 and i am trying to use Manjaro with it. It has an Intel integrated graphics and a Nvidia P1000. At random timesthe system with freeze and I have to hard reset it. I used journalctl to view my last boot and scrolled all the way down to see what was up. I get this error
NVRM: Xid (PCI:0000:01:00): 79, pid=753, GPU has fallen off the bus.

I done a couple things to try to fix it with no dice:

  1. Updated bois to latest version 2.8
  2. Set nvidia persistance mode to 1
  3. Enable discrete graphics card only in bois
  4. Downloaded TLP
  5. Checked temps and they are good

Using non-free Nvidia driver s435xx and Kernel 4.19.69-1

Any ideas?

Here is my hardware info

System: Kernel: 4.19.69-1-MANJARO x86_64 bits: 64 compiler: gcc v: 9.1.0
parameters: BOOT_IMAGE=/boot/vmlinuz-4.19-x86_64 root=UUID=a5a25f90-23fa-4bcc-bc65-fd016ecf9078 rw quiet
rcutree.rcu_idle_gp_delay=2
Desktop: Xfce 4.14.1 tk: Gtk 3.24.10 info: xfce4-panel wm: xfwm4 dm: LightDM 1.30.0 Distro: Manjaro Linux
Machine: Type: Laptop System: LENOVO product: 20M9S0AW00 v: ThinkPad P52 serial: Chassis: type: 10 serial:
Mobo: LENOVO model: 20M9S0AW00 serial: UEFI: LENOVO v: N2CET45W (1.28 ) date: 07/22/2019
Battery: ID-1: BAT0 charge: 51.4 Wh condition: 96.0/90.0 Wh (107%) volts: 11.4/11.2 model: SMP 01AV496 type: Li-poly
serial: status: Discharging cycles: 25
CPU: Topology: 6-Core model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Kaby Lake family: 6 model-id: 9E (158)
stepping: A (10) microcode: B4 L2 cache: 9216 KiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 53004
Speed: 799 MHz min/max: 800/4100 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 4: 800 5: 800 6: 801 7: 800 8: 802
9: 800 10: 800 11: 800 12: 800
Vulnerabilities: Type: l1tf mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
Type: mds mitigation: Clear CPU buffers; SMT vulnerable
Type: meltdown mitigation: PTI
Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp
Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization
Type: spectre_v2 mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
Graphics: Device-1: NVIDIA GP107GLM [Quadro P1000 Mobile] vendor: Lenovo driver: nvidia v: 418.88 bus ID: 01:00.0
chip ID: 10de:1cbb
Display: x11 server: X.Org 1.20.5 driver: nvidia resolution: 1920x1080~60Hz
OpenGL: renderer: Quadro P1000/PCIe/SSE2 v: 4.6.0 NVIDIA 418.88 direct render: Yes
Audio: Device-1: Intel Cannon Lake PCH cAVS vendor: Lenovo driver: snd_hda_intel v: kernel bus ID: 00:1f.3
chip ID: 8086:a348
Device-2: NVIDIA GP107GL High Definition Audio driver: snd_hda_intel v: kernel bus ID: 01:00.1 chip ID: 10de:0fb9
Sound Server: ALSA v: k4.19.69-1-MANJARO
Network: Device-1: Intel Wireless-AC 9560 [Jefferson Peak] driver: iwlwifi v: kernel bus ID: 00:14.3 chip ID: 8086:a370
IF: wlp0s20f3 state: up mac:
Device-2: Intel Ethernet I219-V vendor: Lenovo driver: e1000e v: 3.2.6-k port: efa0 bus ID: 00:1f.6
chip ID: 8086:15bc
IF: enp0s31f6 state: down mac:
Drives: Local Storage: total: 476.94 GiB used: 103.49 GiB (21.7%)
ID-1: /dev/nvme0n1 vendor: Samsung model: MZVLB512HAJQ-000L7 size: 476.94 GiB block size: physical: 512 B
logical: 512 B speed: 31.6 Gb/s lanes: 4 serial: rev: 5L2QEXA7 scheme: GPT
Partition: ID-1: / raw size: 476.64 GiB size: 468.16 GiB (98.22%) used: 103.47 GiB (22.1%) fs: ext4 dev: /dev/nvme0n1p2
Sensors: System Temperatures: cpu: 45.0 C mobo: N/A gpu: nvidia temp: 38 C
Fan Speeds (RPM): cpu: 0
Info: Processes: 269 Uptime: 5m Memory: 15.38 GiB used: 2.15 GiB (14.0%) Init: systemd v: 242 Compilers: gcc: 9.1.0
Shell: bash v: 5.0.9 running in: xfce4-terminal inxi: 3.0.36

nvidia's list of xid errors shows 79 has a lot of possible causes, some of which you have addressed:

HW Error
Driver Error
System Memory Corruption
Bus Error
Thermal Issue

since you've check your temp, the next easiest i think would be switching to a previous driver series to see if that makes a difference.

Forgot to mention I tried different drivers. I tried non-free 390xx with no improvement and the free video-linux one

418.xx will likely be most stable for a P1000.

Ok i will try 418xx and see how it goes. If it freezes again what we I try/check next?

We have not yet come to that bridge.

Unfortunately my laptop froze again however, when checking journalctl again i get another error message along with gpu dropping off

Sep 20 18:34:52 austin-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
Sep 20 18:34:53 austin-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
Sep 20 18:34:53 austin-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
Sep 20 18:34:53 austin-pc kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f

Remove any trace of nvidia-drm.modeset=1 if exists as kernel boot parameter.
Are the Secure BOOT and also the TPM (Trusted Platform Module) disabled in BIOS ?

No nvidia-drm.modeset= 1 was found in /etc/default/grub . I have had secure boot disabled and just disabled TPM. Out of curiosity before disabling TPM why would it help me?

Still having issues with random freezing any other ideas? Secure boot and tpm are disabled

You haven't posted any hardware info, or I missed it.

By a random nvidia similar incident

Fix/workaround: appended rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi="Windows 2009" to kernel parameters.

and another one

boot parameter rcutree.rcu_idle_gp_delay=2

Testing out the first work around will try the second one if it freezes again.
Laptop graphics is : P1000 Nvidia graphics w/ an integrated Intel gpu. However, in bios I enabled discrete gpu only.
Running non free driver418xx Nvidia drivers

I hope it works.
Regarding info (in the forum) I suppose you have missed this
How to provide information about your issues

Tried both kernel parameters and random freezing issue still persists

Forum kindly sponsored by