Constant crashes with error "GPU fell off the bus"

I am using the latest 6.5 kernel and latest Nvidia drivers. Every day, around midnight (usually a few hours after I stop using the PC) My system fully freezes and is unresponsive. Not even ctrl+alt+F3 works. I have seen some suggestions to put the gup in persistent mode, but that doesn’t work, When I look in journalctl, here is the error:

Nov 29 10:51:41 derp-linux kscreenlocker_greet[14710]: Qt: Session management error: networkIdsList argument is NULL
Nov 29 10:51:41 derp-linux kscreenlocker_greet[14710]: kscreenlocker_greet: Lockscreen QML outdated, falling back to default
Nov 29 10:51:42 derp-linux kscreenlocker_greet[14710]: kf.kirigami: Failed to find a Kirigami platform plugin
Nov 29 10:58:46 derp-linux kernel: NVRM: GPU at PCI:0000:02:00: GPU-c7f83bd9-0ef0-ca3d-c7da-fec78d33c876
Nov 29 10:58:46 derp-linux kernel: NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Nov 29 10:58:46 derp-linux kernel: NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
Nov 29 10:58:46 derp-linux kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                   NVRM: nvidia-bug-report.sh as root to collect this data before
                                   NVRM: the NVIDIA kernel module is unloaded.
Nov 29 10:58:47 derp-linux plasmashell[996]: ERROR VulkanRender.cpp:395 VkResult is "VK_##str"
Nov 29 10:58:50 derp-linux plasmashell[996]: ERROR VulkanRender.cpp:395 VkResult is "VK_##str"

(Note that the unknown is due to this being an old crash. This does not normally say that)
Here are my kernel parameters:

quiet splash udev.log_priority=3 pci=noaer pcie_aspm=off

This has been going on for months, and I really would like a solution.

My GPU also falls off the bus randomly for an unknown reason. I’ve spent hours researching. The NVIDIA documentation isn’t really helpful, unfortunately: XID Errors :: GPU Deployment and Management Documentation

Odd thing is, I can play AAA games for hours on end with no issue, however it will randomly happen not doing much of anything. :man_shrugging:

1 Like

Exactly, It seems to only be when the system isn’t under a lot of load. I’ve seen some people online say that this happens when a GPU is dying, but seeing as how this is happening to you too, I doubt that’s the case.

For reference, I have a System76 Gazelle 17 (gaze17-3060-b). System76 sent me a replacement last year not long after I bought it and that made no difference, so at least that rules out a manufacturing defect with the original laptop. Otherwise I’m quite happy with it and like the company.

inxi -Fazy
System:
  Kernel: 6.5.13-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
    clocksource: tsc available: acpi_pm
    parameters: root=UUID=cf70171e-27dd-43ec-a0b7-52a1fa96be2a rw
    add_efi_memmap initrd=boot\intel-ucode.img
    initrd=boot\initramfs-6.5-x86_64.img ec_sys.write_support=1 splash quiet
    udev.log_priority=3
  Desktop: GNOME v: 45.1 tk: GTK v: 3.24.38 wm: gnome-shell dm: GDM v: 45.0.1
    Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Laptop System: System76 product: Gazelle v: gaze17-3060-b
    serial: <superuser required> Chassis: type: 9 serial: <superuser required>
  Mobo: System76 model: Gazelle v: gaze17-3060-b serial: <superuser required>
    UEFI: coreboot v: 2023-09-08_42bf7a6 date: 09/08/2023
Battery:
  ID-1: BAT0 charge: 47.6 Wh (91.2%) condition: 52.2/54.8 Wh (95.3%)
    volts: 17.0 min: 15.4 model: Notebook BAT type: Li-ion serial: <filter>
    status: not charging cycles: 11
CPU:
  Info: model: 12th Gen Intel Core i7-12700H bits: 64 type: MST AMCP
    arch: Alder Lake gen: core 12 level: v3 note: check built: 2021+
    process: Intel 7 (10nm ESF) family: 6 model-id: 0x9A (154) stepping: 3
    microcode: 0x430
  Topology: cpus: 1x cores: 14 mt: 6 tpc: 2 st: 8 threads: 20 smt: enabled
    cache: L1: 1.2 MiB desc: d-8x32 KiB, 6x48 KiB; i-6x32 KiB, 8x64 KiB
    L2: 11.5 MiB desc: 6x1.2 MiB, 2x2 MiB L3: 24 MiB desc: 1x24 MiB
  Speed (MHz): avg: 1378 high: 3497 min/max: 400/4600:4700:3500 scaling:
    driver: intel_pstate governor: powersave cores: 1: 1018 2: 848 3: 400 4: 2331
    5: 400 6: 400 7: 2011 8: 400 9: 3443 10: 400 11: 3296 12: 400 13: 400
    14: 400 15: 400 16: 1260 17: 1690 18: 2609 19: 3497 20: 1965
    bogomips: 107560
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
  Vulnerabilities:
  Type: gather_data_sampling status: Not affected
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: retbleed status: Not affected
  Type: spec_rstack_overflow status: Not affected
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
    prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
    sanitization
  Type: spectre_v2 mitigation: Enhanced / Automatic IBRS, IBPB: conditional,
    RSB filling, PBRSB-eIBRS: SW sequence
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: Intel Alder Lake-P GT2 [Iris Xe Graphics] vendor: CLEVO/KAPOK
    driver: i915 v: kernel arch: Gen-12.2 process: Intel 10nm built: 2021-22+
    ports: active: DP-1 off: eDP-1 empty: DP-2,DP-3,DP-4 bus-ID: 00:02.0
    chip-ID: 8086:46a6 class-ID: 0300
  Device-2: NVIDIA GA106M [GeForce RTX 3060 Mobile / Max-Q]
    vendor: CLEVO/KAPOK driver: nvidia v: 545.29.06 alternate: nouveau,nvidia_drm
    non-free: 545.xx+ status: current (as of 2023-11; EOL~2026-12-xx)
    arch: Ampere code: GAxxx process: TSMC n7 (7nm) built: 2020-2023 pcie:
    gen: 4 speed: 16 GT/s lanes: 8 link-max: lanes: 16 ports: active: none
    off: DP-5,HDMI-A-1 empty: eDP-2 bus-ID: 01:00.0 chip-ID: 10de:2520
    class-ID: 0300
  Device-3: Logitech Webcam C270 driver: snd-usb-audio,uvcvideo type: USB
    rev: 2.0 speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 3-1.1:4
    chip-ID: 046d:0825 class-ID: 0102 serial: <filter>
  Device-4: Chicony USB2.0 Camera driver: uvcvideo type: USB rev: 2.0
    speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 3-8:5 chip-ID: 04f2:b729
    class-ID: fe01 serial: <filter>
  Display: x11 server: X.Org v: 21.1.9 with: Xwayland v: 23.2.2
    compositor: gnome-shell driver: X: loaded: modesetting,nvidia
    alternate: fbdev,nouveau,nv,vesa dri: iris gpu: i915,nvidia,nvidia-nvswitch
    display-ID: :1 screens: 1
  Screen-1: 0 s-res: 5760x1080 s-dpi: 96 s-size: 1524x286mm (60.00x11.26")
    s-diag: 1551mm (61.05")
  Monitor-1: DP-1 pos: primary,center model: HP X24ih serial: <filter>
    built: 2021 res: 1920x1080 dpi: 82 gamma: 1.2 size: 598x336mm (23.54x13.23")
    diag: 605mm (23.8") ratio: 16:9 modes: max: 1920x1080 min: 720x400
  Monitor-2: DP-5 mapped: DP-1-1 note: disabled pos: right model: MSI G27C4
    serial: <filter> built: 2020 res: 1920x1080 dpi: 93 gamma: 1.2
    size: 527x297mm (20.75x11.69") diag: 686mm (27") ratio: 16:9 modes:
    max: 1920x1080 min: 640x480
  Monitor-3: HDMI-A-1 mapped: HDMI-0 note: disabled pos: left model: HP X24ih
    serial: <filter> built: 2021 res: 1920x1080 dpi: 93 gamma: 1.2
    size: 527x297mm (20.75x11.69") diag: 605mm (23.8") ratio: 16:9 modes:
    max: 1920x1080 min: 640x480
  Monitor-4: eDP-1 mapped: eDP-1-1 note: disabled model: AU Optronics 0xaf90
    built: 2020 res: 1920x1080 dpi: 142 gamma: 1.2 size: 344x193mm (13.54x7.6")
    diag: 394mm (15.5") ratio: 16:9 modes: 1920x1080
  API: EGL v: 1.5 hw: drv: intel iris drv: nvidia platforms: device: 0
    drv: nvidia device: 1 drv: iris device: 3 drv: swrast gbm: drv: iris
    surfaceless: drv: nvidia x11: drv: nvidia inactive: wayland,device-2
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: nvidia mesa v: 545.29.06
    glx-v: 1.4 direct-render: yes renderer: NVIDIA GeForce RTX 3060 Laptop
    GPU/PCIe/SSE2 memory: 5.86 GiB
  API: Vulkan v: 1.3.269 layers: 10 device: 0 type: discrete-gpu name: NVIDIA
    GeForce RTX 3060 Laptop GPU driver: nvidia v: 545.29.06
    device-ID: 10de:2520 surfaces: xcb,xlib device: 1 type: integrated-gpu
    name: Intel Graphics (ADL GT2) driver: mesa intel v: 23.1.9-manjaro1.1
    device-ID: 8086:46a6 surfaces: xcb,xlib
Audio:
  Device-1: Intel Alder Lake PCH-P High Definition Audio vendor: CLEVO/KAPOK
    driver: snd_hda_intel v: kernel alternate: snd_sof_pci_intel_tgl
    bus-ID: 00:1f.3 chip-ID: 8086:51c8 class-ID: 0403
  Device-2: NVIDIA GA106 High Definition Audio vendor: CLEVO/KAPOK
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 8
    link-max: lanes: 16 bus-ID: 01:00.1 chip-ID: 10de:228e class-ID: 0403
  Device-3: Logitech Webcam C270 driver: snd-usb-audio,uvcvideo type: USB
    rev: 2.0 speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 3-1.1:4
    chip-ID: 046d:0825 class-ID: 0102 serial: <filter>
  Device-4: C-Media CM106 Like Sound Device
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 1.1 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 3-1.2:6 chip-ID: 0d8c:0102 class-ID: 0300
  API: ALSA v: k6.5.13-1-MANJARO status: kernel-api
    tools: alsactl,alsamixer,amixer
  Server-1: PipeWire v: 1.0.0 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    4: pw-jack type: plugin tools: pactl,pw-cat,pw-cli,wpctl
Network:
  Device-1: Intel Alder Lake-P PCH CNVi WiFi driver: iwlwifi v: kernel
    bus-ID: 00:14.3 chip-ID: 8086:51f0 class-ID: 0280
  IF: wlp0s20f3 state: down mac: <filter>
  Device-2: Intel Ethernet I219-V driver: e1000e v: kernel port: N/A
    bus-ID: 00:1f.6 chip-ID: 8086:1a1f class-ID: 0200
  IF: eno0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IF-ID-1: Eddie state: unknown speed: N/A duplex: N/A mac: N/A
Bluetooth:
  Device-1: Intel AX201 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 3-10:7 chip-ID: 8087:0026
    class-ID: e001
  Report: btmgmt ID: hci0 rfk-id: 1 state: up address: <filter> bt-v: 5.2
    lmp-v: 11 status: discoverable: no pairing: no class-ID: 7c010c
Drives:
  Local Storage: total: 6.37 TiB used: 3.31 TiB (52.0%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 980 PRO 2TB
    size: 1.82 TiB block-size: physical: 512 B logical: 512 B speed: 63.2 Gb/s
    lanes: 4 tech: SSD serial: <filter> fw-rev: 5B2QGXA7 temp: 34.9 C
    scheme: GPT
  ID-2: /dev/nvme1n1 maj-min: 259:4 vendor: Samsung
    model: SSD 970 EVO Plus 1TB size: 931.51 GiB block-size: physical: 512 B
    logical: 512 B speed: 31.6 Gb/s lanes: 4 tech: SSD serial: <filter>
    fw-rev: 2B2QEXM7 temp: 34.9 C scheme: GPT
  ID-3: /dev/sda maj-min: 8:0 vendor: Seagate model: Game Drive PS4
    size: 3.64 TiB block-size: physical: 4096 B logical: 512 B type: USB rev: 3.0
    spd: 5 Gb/s lanes: 1 mode: 3.2 gen-1x1 tech: N/A serial: <filter>
    fw-rev: 0304 scheme: GPT
Partition:
  ID-1: / raw-size: 500 GiB size: 491.08 GiB (98.22%) used: 128.05 GiB (26.1%)
    fs: ext4 dev: /dev/nvme0n1p1 maj-min: 259:1
  ID-2: /boot/efi raw-size: 513 MiB size: 512 MiB (99.80%)
    used: 46.9 MiB (9.2%) fs: vfat dev: /dev/nvme0n1p3 maj-min: 259:3
  ID-3: /home raw-size: 1.33 TiB size: 1.31 TiB (98.35%)
    used: 1.06 TiB (80.9%) fs: ext4 dev: /dev/nvme0n1p2 maj-min: 259:2
Swap:
  Kernel: swappiness: 10 (default 60) cache-pressure: 100 (default) zswap: yes
    compressor: zstd max-pool: 20%
  ID-1: swap-1 type: file size: 16 GiB used: 85.8 MiB (0.5%) priority: -2
    file: /swapfile
Sensors:
  System Temperatures: cpu: 50.0 C mobo: N/A gpu: nvidia temp: 40 C
  Fan Speeds (rpm): cpu: 0
Info:
  Processes: 547 Uptime: 1d 3h 17m wakeups: 0 Memory: total: 32 GiB note: est.
  available: 31.19 GiB used: 11.65 GiB (37.3%) Init: systemd v: 254
  default: graphical tool: systemctl Compilers: gcc: 13.2.1 clang: 16.0.6
  Packages: 2563 pm: pacman pkgs: 2518 libs: 545
  tools: gnome-software,octopi,pamac,paru,yay pm: flatpak pkgs: 45 Shell: Zsh
  v: 5.9 running-in: tilix inxi: 3.3.31

A long shot but since @Yochanan seem to use a bunch of monitors this looks interesting:

The refresh rate of my MiniDisplay Port 1.2 to 2 HDMI monitors (was) on 59.93 and I would have this problem. If I set both to 60Hz one of the displays wouldn’t show. It was only once I set 1 monitor to 50Hz and one 60Hz that the problem was fixed.
Xid 79, GPU has fallen off the bus. - #15 by wlarsong - CUDA Programming and Performance - NVIDIA Developer Forums .

Not related for me. All three external monitors run at 144Hz and so does my laptop screen (lid is closed). I have Mini DisplayPort, HDMI & Thunderbolt 4 ports.

I’ve had this happen on my nearly-6-year-old 1080 Ti for almost two years now. I suspect a hardware issue of some kind, perhaps in memory. When it starts to fall of the bus (it does this at least once per day), I take it out of the machine for 20 minutes, put it back in, and the card behaves for the next 4-10 months :joy:

Maybe it just wanted a hug. Falling off a bus must hurt tremendously!
:smiling_face_with_three_hearts:

2 Likes