System crashed when running large ML model on GPU

I recently wanted to run a relatively large ML model on GPU in my Manjaro machine, but every time I run it, the system gets stuck and has no response. Yesterday, it even appears some mosaic squares on my screen. My system information is shown below (inxi -b):

System:
  Host: manjaro Kernel: 6.5.13-7-MANJARO arch: x86_64 bits: 64 Desktop: i3
    v: 4.23 Distro: Manjaro Linux
Machine:
  Type: Desktop System: PCSpecialist product: Vortex Elite v: N/A
    serial: <superuser required>
  Mobo: ASUSTeK model: PRIME B760-PLUS D4 v: Rev 1.xx
    serial: <superuser required> UEFI: American Megatrends v: 1402
    date: 09/11/2023
CPU:
  Info: 16-core (8-mt/8-st) 13th Gen Intel Core i7-13700F [MST AMCP]
    speed (MHz): avg: 2233 min/max: 800/5100:5200:4100
Graphics:
  Device-1: NVIDIA AD102 [GeForce RTX 4090] driver: nvidia v: 545.29.06
  Display: x11 server: X.Org v: 21.1.10 with: Xwayland v: 23.2.3 driver: X:
    loaded: nvidia gpu: nvidia resolution: 1920x1080~60Hz
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: nvidia mesa v: 545.29.06
    renderer: NVIDIA GeForce RTX 4090/PCIe/SSE2
Network:
  Device-1: Intel Wi-Fi 6 AX210/AX211/AX411 160MHz driver: iwlwifi
  Device-2: Realtek RTL8125 2.5GbE driver: r8169
Drives:
  Local Storage: total: 4.1 TiB used: 122.69 GiB (2.9%)
Info:
  Processes: 442 Uptime: 35m Memory: total: 32 GiB available: 31.16 GiB
  used: 5.44 GiB (17.5%) Shell: fish inxi: 3.3.31

Since my computer has another Windows system, I also tried to run the ML model training process on my Windows system, and it works well. But I still want to use Manjaro during my work, so I want to solve the problem, could anyone give me some advice or intuition? Thank you for your time and help!

Can you install a LTS kernel ?
How do the temperatures look like ?

Hi @fate1997, and welcome!

The only thing I can think of to possibly find the problem is checking the logs. So let’s do so. Please provide the output of the following:

journalctl --priority=warning..err --boot=-1 --no-pager

Where:

  • The --priority=warning..err argument limits the output to warnings and errors only;
  • the --boot=-1 argument limits the output to log messages from the previous boot. This can be adjusted to -2 for the boot before that, -3 to the boot before that, and so on and so forth.
    • --no-pager formats the output nicely for use here, on the forum.

Also, it would help a great deal to know how you run it. I’m no expert, but there must be some software or command or something you use, which would help narrow this down.


:bangbang: Tip: :bangbang:

When posting terminal output, copy the output and paste it here, wrapped in three (3) backticks, before AND after the pasted text. Like this:

```
pasted text
```

Or three (3) tilde signs, like this:

~~~
pasted text
~~~

This will just cause it to be rendered like this:

Sed
sollicitudin dolor
eget nisl elit id
condimentum
arcu erat varius
cursus sem quis eros.

Instead of like this:

Sed sollicitudin dolor eget nisl elit id condimentum arcu erat varius cursus sem quis eros.

Alternatively, paste the text you wish to format as terminal output, select all pasted text, and click the </> button on the taskbar. This will indent the whole pasted section with one TAB, causing it to render the same way as described above.

Thereby increasing legibility thus making it easier for those trying to provide assistance.

For more information, please see:


:bangbang::bangbang: Additionally

If your language isn’t English, please prepend any and all terminal commands with LC_ALL=C. For example:

LC_ALL=C bluetoothctl

This will just cause the terminal output to be in English, making it easier to understand and debug.


Also,

This Kernel has been EOL since 28 November 2023, so it would seem you haven’t updated in quite some time, which means you’ve been a naughty boy.

So install a supported kernel, update your system, and see what it does…

This should be first on the OP’s list, along with ensuring updates are current, attending to important .pacnew files, etc.

Thank you so much for your quick reply!!! I have updated my linux driver to linux66. However, the problem still exits. The output of journalctl --priority=warning..err --boot=-1 --no-pager is shown below:

Jan 05 15:52:26 manjaro kernel: sched: RT throttling activated
Jan 05 15:52:27 manjaro kernel: general protection fault, probably for non-canonical address 0xbad0fba5ffffffff: 0000 [#1] PREEMPT SMP NOPTI
Jan 05 15:52:27 manjaro kernel: CPU: 17 PID: 1052 Comm: irq/183-nvidia Tainted: P           OE      6.6.8-2-MANJARO #1 146dce4c0b8863ad44f28ec2edb37ecbadc944c7
Jan 05 15:52:27 manjaro kernel: Hardware name: PCSpecialist Vortex Elite/PRIME B760-PLUS D4, BIOS 1402 09/11/2023
Jan 05 15:52:27 manjaro kernel: RIP: 0010:_nv008071rm+0x12/0x20 [nvidia]
Jan 05 15:52:27 manjaro kernel: Code: 20 4c 8d 56 f8 89 44 24 20 48 8b 86 d0 09 00 00 4c 89 d6 ff e0 cc 66 90 f3 0f 1e fa 4c 8d 46 f8 48 8b 86 d8 09 00 00 4c 89 c6 <ff> e0 cc 66 90 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 4c 8d 46 f8
Jan 05 15:52:27 manjaro kernel: RSP: 0018:ffffc900043a3d80 EFLAGS: 00010202
Jan 05 15:52:27 manjaro kernel: RAX: bad0fba5ffffffff RBX: 0000000000000040 RCX: 00000000ffffffff
Jan 05 15:52:27 manjaro kernel: RDX: ffff88813c5dc010 RSI: ffff888140562008 RDI: ffff88813c310008
Jan 05 15:52:27 manjaro kernel: RBP: ffff88813c302c90 R08: ffff888140562008 R09: 0000000000000020
Jan 05 15:52:27 manjaro kernel: R10: ffff88813c302c5c R11: 0000000000000000 R12: ffff88813c310008
Jan 05 15:52:27 manjaro kernel: R13: ffff88813c5dc010 R14: 00000000ffffffff R15: ffff888140560008
Jan 05 15:52:27 manjaro kernel: FS:  0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000
Jan 05 15:52:27 manjaro kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 05 15:52:27 manjaro kernel: CR2: 00007ffe4d95d2a8 CR3: 00000003dd220000 CR4: 0000000000f50ee0
Jan 05 15:52:27 manjaro kernel: PKRU: 55555554
Jan 05 15:52:27 manjaro kernel: Call Trace:
Jan 05 15:52:27 manjaro kernel:  <TASK>
Jan 05 15:52:27 manjaro kernel:  ? die_addr+0x36/0x90
Jan 05 15:52:27 manjaro kernel:  ? exc_general_protection+0x1c5/0x430
Jan 05 15:52:27 manjaro kernel:  ? os_acquire_spinlock+0x12/0x30 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  ? asm_exc_general_protection+0x26/0x30
Jan 05 15:52:27 manjaro kernel:  ? _nv008071rm+0x12/0x20 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  _nv042651rm+0x17e/0x1f0 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  _nv031162rm+0x60/0xc0 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  _nv011729rm+0x206/0x320 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  _nv031172rm+0x16f/0x1e0 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  _nv000720rm+0x113/0x150 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  ? __pfx_irq_thread_fn+0x10/0x10
Jan 05 15:52:27 manjaro kernel:  rm_isr_bh+0x20/0x5c [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  nvidia_isr_kthread_bh+0x1f/0x50 [nvidia 59d87dcd64405c9ddf5d779789533615f2dd9e0f]
Jan 05 15:52:27 manjaro kernel:  irq_thread_fn+0x20/0x60
Jan 05 15:52:27 manjaro kernel:  irq_thread+0xfb/0x1c0
Jan 05 15:52:27 manjaro kernel:  ? __pfx_irq_thread_dtor+0x10/0x10
Jan 05 15:52:27 manjaro kernel:  ? __pfx_irq_thread+0x10/0x10
Jan 05 15:52:27 manjaro kernel:  kthread+0xe5/0x120
Jan 05 15:52:27 manjaro kernel:  ? __pfx_kthread+0x10/0x10
Jan 05 15:52:27 manjaro kernel:  ret_from_fork+0x31/0x50
Jan 05 15:52:27 manjaro kernel:  ? __pfx_kthread+0x10/0x10
Jan 05 15:52:27 manjaro kernel:  ret_from_fork_asm+0x1b/0x30
Jan 05 15:52:27 manjaro kernel:  </TASK>
Jan 05 15:52:27 manjaro kernel: Modules linked in: rfcomm tun ccm qrtr cmac algif_hash algif_skcipher af_alg bnep nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency iwlmvm snd_soc_core intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal snd_compress ac97_bus snd_hda_codec_hdmi intel_powerclamp snd_hda_codec_realtek mac80211 snd_hda_codec_generic snd_pcm_dmaengine r8169 snd_hda_intel joydev mousedev coretemp snd_intel_dspcfg libarc4 realtek snd_intel_sdw_acpi mdio_devres libphy snd_hda_codec kvm_intel snd_hda_core snd_hwdep btusb btrtl btintel btbcm btmtk snd_pcm kvm snd_timer snd bluetooth iwlwifi irqbypass soundcore crct10dif_pclmul crc32_pclmul
Jan 05 15:52:27 manjaro kernel:  polyval_clmulni polyval_generic ecdh_generic iTCO_wdt cfg80211 intel_pmc_bxt eeepc_wmi iTCO_vendor_support mei_me intel_lpss_pci ee1004 intel_lpss asus_wmi gf128mul spi_nor pmt_telemetry i2c_i801 pmt_class mei ledtrig_audio mtd i2c_smbus intel_vsec idma64 ghash_clmulni_intel sparse_keymap sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel platform_profile i8042 mac_hid serio crypto_simd rfkill cryptd rapl intel_cstate intel_uncore pcspkr wmi_bmof video acpi_tad acpi_pad wmi squashfs fuse dm_mod crypto_user loop bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid nvme crc32c_intel spi_intel_pci nvme_core xhci_pci spi_intel xhci_pci_renesas nvme_common vmd
Jan 05 15:52:27 manjaro kernel: ---[ end trace 0000000000000000 ]---
Jan 05 15:52:27 manjaro kernel: RIP: 0010:_nv008071rm+0x12/0x20 [nvidia]
Jan 05 15:52:27 manjaro kernel: Code: 20 4c 8d 56 f8 89 44 24 20 48 8b 86 d0 09 00 00 4c 89 d6 ff e0 cc 66 90 f3 0f 1e fa 4c 8d 46 f8 48 8b 86 d8 09 00 00 4c 89 c6 <ff> e0 cc 66 90 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 4c 8d 46 f8
Jan 05 15:52:27 manjaro kernel: RSP: 0018:ffffc900043a3d80 EFLAGS: 00010202
Jan 05 15:52:27 manjaro kernel: RAX: bad0fba5ffffffff RBX: 0000000000000040 RCX: 00000000ffffffff
Jan 05 15:52:27 manjaro kernel: RDX: ffff88813c5dc010 RSI: ffff888140562008 RDI: ffff88813c310008
Jan 05 15:52:27 manjaro kernel: RBP: ffff88813c302c90 R08: ffff888140562008 R09: 0000000000000020
Jan 05 15:52:27 manjaro kernel: R10: ffff88813c302c5c R11: 0000000000000000 R12: ffff88813c310008
Jan 05 15:52:27 manjaro kernel: R13: ffff88813c5dc010 R14: 00000000ffffffff R15: ffff888140560008
Jan 05 15:52:27 manjaro kernel: FS:  0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000
Jan 05 15:52:27 manjaro kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 05 15:52:27 manjaro kernel: CR2: 00007ffe4d95d2a8 CR3: 00000003dd220000 CR4: 0000000000f50ee0
Jan 05 15:52:27 manjaro kernel: PKRU: 55555554
Jan 05 15:52:27 manjaro kernel: Oops: 0011 [#2] PREEMPT SMP NOPTI
Jan 05 15:52:27 manjaro kernel: CPU: 17 PID: 1052 Comm: irq/183-nvidia Tainted: P      D    OE      6.6.8-2-MANJARO #1 146dce4c0b8863ad44f28ec2edb37ecbadc944c7
Jan 05 15:52:27 manjaro kernel: Hardware name: PCSpecialist Vortex Elite/PRIME B760-PLUS D4, BIOS 1402 09/11/2023
Jan 05 15:52:27 manjaro kernel: RIP: 0010:0xffff888125d920c0
Jan 05 15:52:27 manjaro kernel: Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 00
Jan 05 15:52:27 manjaro kernel: RSP: 0018:ffffc900043a3eb0 EFLAGS: 00010286
Jan 05 15:52:27 manjaro kernel: RAX: ffff888125d920c0 RBX: ffffffff944e361e RCX: 00000000000001c0
Jan 05 15:52:27 manjaro kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900043a3eb0
Jan 05 15:52:27 manjaro kernel: RBP: ffff888125d920c0 R08: 0000000000000000 R09: ffffc900043a3ab8
Jan 05 15:52:27 manjaro kernel: R10: 0000000000000003 R11: ffff88887ffaab28 R12: ffff888125d929c4
Jan 05 15:52:27 manjaro kernel: R13: ffff88812344b901 R14: 0000000000000000 R15: 0000000000000000
Jan 05 15:52:27 manjaro kernel: FS:  0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000
Jan 05 15:52:27 manjaro kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 05 15:52:27 manjaro kernel: CR2: ffff888125d920c0 CR3: 00000003dd220000 CR4: 0000000000f50ee0
Jan 05 15:52:27 manjaro kernel: PKRU: 55555554
Jan 05 15:52:27 manjaro kernel: Call Trace:
Jan 05 15:52:27 manjaro kernel:  <TASK>
Jan 05 15:52:27 manjaro kernel:  ? __die+0x23/0x70
Jan 05 15:52:27 manjaro kernel:  ? page_fault_oops+0x171/0x4e0
Jan 05 15:52:27 manjaro kernel:  ? exc_page_fault+0x175/0x180
Jan 05 15:52:27 manjaro kernel:  ? asm_exc_page_fault+0x26/0x30
Jan 05 15:52:27 manjaro kernel:  ? task_work_run+0x4e/0x90
Jan 05 15:52:27 manjaro kernel:  ? task_work_run+0x5a/0x90
Jan 05 15:52:27 manjaro kernel:  ? do_exit+0x377/0xb20
Jan 05 15:52:27 manjaro kernel:  ? __pfx_irq_thread+0x10/0x10
Jan 05 15:52:27 manjaro kernel:  ? make_task_dead+0x81/0x170
Jan 05 15:52:27 manjaro kernel:  ? rewind_stack_and_make_dead+0x17/0x20
Jan 05 15:52:27 manjaro kernel:  </TASK>
Jan 05 15:52:27 manjaro kernel: Modules linked in: rfcomm tun ccm qrtr cmac algif_hash algif_skcipher af_alg bnep nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency iwlmvm snd_soc_core intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal snd_compress ac97_bus snd_hda_codec_hdmi intel_powerclamp snd_hda_codec_realtek mac80211 snd_hda_codec_generic snd_pcm_dmaengine r8169 snd_hda_intel joydev mousedev coretemp snd_intel_dspcfg libarc4 realtek snd_intel_sdw_acpi mdio_devres libphy snd_hda_codec kvm_intel snd_hda_core snd_hwdep btusb btrtl btintel btbcm btmtk snd_pcm kvm snd_timer snd bluetooth iwlwifi irqbypass soundcore crct10dif_pclmul crc32_pclmul
Jan 05 15:52:27 manjaro kernel:  polyval_clmulni polyval_generic ecdh_generic iTCO_wdt cfg80211 intel_pmc_bxt eeepc_wmi iTCO_vendor_support mei_me intel_lpss_pci ee1004 intel_lpss asus_wmi gf128mul spi_nor pmt_telemetry i2c_i801 pmt_class mei ledtrig_audio mtd i2c_smbus intel_vsec idma64 ghash_clmulni_intel sparse_keymap sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel platform_profile i8042 mac_hid serio crypto_simd rfkill cryptd rapl intel_cstate intel_uncore pcspkr wmi_bmof video acpi_tad acpi_pad wmi squashfs fuse dm_mod crypto_user loop bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid nvme crc32c_intel spi_intel_pci nvme_core xhci_pci spi_intel xhci_pci_renesas nvme_common vmd
Jan 05 15:52:27 manjaro kernel: CR2: ffff888125d920c0
Jan 05 15:52:27 manjaro kernel: ---[ end trace 0000000000000000 ]---
Jan 05 15:52:27 manjaro kernel: RIP: 0010:_nv008071rm+0x12/0x20 [nvidia]
Jan 05 15:52:27 manjaro kernel: Code: 20 4c 8d 56 f8 89 44 24 20 48 8b 86 d0 09 00 00 4c 89 d6 ff e0 cc 66 90 f3 0f 1e fa 4c 8d 46 f8 48 8b 86 d8 09 00 00 4c 89 c6 <ff> e0 cc 66 90 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 4c 8d 46 f8
Jan 05 15:52:27 manjaro kernel: RSP: 0018:ffffc900043a3d80 EFLAGS: 00010202
Jan 05 15:52:27 manjaro kernel: RAX: bad0fba5ffffffff RBX: 0000000000000040 RCX: 00000000ffffffff
Jan 05 15:52:27 manjaro kernel: RDX: ffff88813c5dc010 RSI: ffff888140562008 RDI: ffff88813c310008
Jan 05 15:52:27 manjaro kernel: RBP: ffff88813c302c90 R08: ffff888140562008 R09: 0000000000000020
Jan 05 15:52:27 manjaro kernel: R10: ffff88813c302c5c R11: 0000000000000000 R12: ffff88813c310008
Jan 05 15:52:27 manjaro kernel: R13: ffff88813c5dc010 R14: 00000000ffffffff R15: ffff888140560008
Jan 05 15:52:27 manjaro kernel: FS:  0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000
Jan 05 15:52:27 manjaro kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 05 15:52:27 manjaro kernel: CR2: ffff888125d920c0 CR3: 00000003dd220000 CR4: 0000000000f50ee0
Jan 05 15:52:27 manjaro kernel: PKRU: 55555554
Jan 05 15:52:27 manjaro kernel: BUG: scheduling while atomic: irq/183-nvidia/1052/0x00000000
Jan 05 15:52:27 manjaro kernel: Modules linked in: rfcomm tun ccm qrtr cmac algif_hash algif_skcipher af_alg bnep nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency iwlmvm snd_soc_core intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal snd_compress ac97_bus snd_hda_codec_hdmi intel_powerclamp snd_hda_codec_realtek mac80211 snd_hda_codec_generic snd_pcm_dmaengine r8169 snd_hda_intel joydev mousedev coretemp snd_intel_dspcfg libarc4 realtek snd_intel_sdw_acpi mdio_devres libphy snd_hda_codec kvm_intel snd_hda_core snd_hwdep btusb btrtl btintel btbcm btmtk snd_pcm kvm snd_timer snd bluetooth iwlwifi irqbypass soundcore crct10dif_pclmul crc32_pclmul
Jan 05 15:52:27 manjaro kernel:  polyval_clmulni polyval_generic ecdh_generic iTCO_wdt cfg80211 intel_pmc_bxt eeepc_wmi iTCO_vendor_support mei_me intel_lpss_pci ee1004 intel_lpss asus_wmi gf128mul spi_nor pmt_telemetry i2c_i801 pmt_class mei ledtrig_audio mtd i2c_smbus intel_vsec idma64 ghash_clmulni_intel sparse_keymap sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel platform_profile i8042 mac_hid serio crypto_simd rfkill cryptd rapl intel_cstate intel_uncore pcspkr wmi_bmof video acpi_tad acpi_pad wmi squashfs fuse dm_mod crypto_user loop bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid nvme crc32c_intel spi_intel_pci nvme_core xhci_pci spi_intel xhci_pci_renesas nvme_common vmd
Jan 05 15:52:27 manjaro kernel: CPU: 17 PID: 1052 Comm: irq/183-nvidia Tainted: P      D    OE      6.6.8-2-MANJARO #1 146dce4c0b8863ad44f28ec2edb37ecbadc944c7
Jan 05 15:52:27 manjaro kernel: Hardware name: PCSpecialist Vortex Elite/PRIME B760-PLUS D4, BIOS 1402 09/11/2023
Jan 05 15:52:27 manjaro kernel: Call Trace:
Jan 05 15:52:27 manjaro kernel:  <TASK>
Jan 05 15:52:27 manjaro kernel:  dump_stack_lvl+0x47/0x60
Jan 05 15:52:27 manjaro kernel:  __schedule_bug+0x56/0x70
Jan 05 15:52:27 manjaro kernel:  __schedule+0x103c/0x1410
Jan 05 15:52:27 manjaro kernel:  ? __wake_up_klogd.part.0+0x3c/0x60
Jan 05 15:52:27 manjaro kernel:  ? vprintk_emit+0x175/0x2b0
Jan 05 15:52:27 manjaro kernel:  ? _printk+0x64/0x80
Jan 05 15:52:27 manjaro kernel:  do_task_dead+0x43/0x50
Jan 05 15:52:27 manjaro kernel:  make_task_dead+0x14f/0x170
Jan 05 15:52:27 manjaro kernel:  rewind_stack_and_make_dead+0x17/0x20
Jan 05 15:52:27 manjaro kernel: RIP: 0000:0x0
Jan 05 15:52:27 manjaro kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Jan 05 15:52:27 manjaro kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
Jan 05 15:52:27 manjaro kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jan 05 15:52:27 manjaro kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 05 15:52:27 manjaro kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Jan 05 15:52:27 manjaro kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Jan 05 15:52:27 manjaro kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Jan 05 15:52:27 manjaro kernel:  </TASK>

Can you confirm that your system is fully updated to current?

The Nvidia 545.x drivers have had issues recently. Being fully updated will ensure you have the latest drivers available, and most stable.

Additionally, placing your Nvidia card in another physical slot can potentially make a difference.

That’s all I have to add. Cheers.

I also asked for additional informationz, which is still not there. And as @Keruskerfuerst stated, check thee temperatures of both your GPU and CPU, especially since a faulty driver, as mentioned by @soundofthunder, can be problematic.

Thank you for your advice!

I have updated the nvidia drivers by running the command:

sudo mhwd -a pci nonfree 0300 -f

However, the Nvidia driver version has not changed, still 545.29.06. I know it may be a stupid question, but I want to ask how to fully update the Nvidia driver.

The GPU temperature when I run the program is 70 Celsius right before the system is crashed.

Below is some information which I thought may be related to the problem:

Jan 05 17:02:01 manjaro kernel: NVRM: GPU at PCI:0000:01:00: GPU-0df5dbd1-4528-547a-8068-2c2ecbdd12a8
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 62, pid='<unknown>', name=<unknown>, 2024a7c2 2024a9fa 2024a86e 20249ca8 2024d970 202>
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1921, name=chrome, Ch 00000008
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1921, name=chrome, Ch 0000000e
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1921, name=chrome, Ch 0000000f
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1921, name=chrome, Ch 00000010
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1921, name=chrome, Ch 00000011
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1921, name=chrome, Ch 00000012
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=1921, name=chrome, Ch 00000013
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 00000038
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 00000039
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 0000003a
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 0000003b
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 0000003c
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 0000003d
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 0000003e
Jan 05 17:02:01 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=50907, name=python, Ch 0000003f
Jan 05 17:02:01 manjaro kernel: sched: RT throttling activated
Jan 05 17:02:46 manjaro NetworkManager[879]: <info>  [1704474166.4913] dhcp4 (wlp4s0): state changed new lease, address=146.169.128.85
Jan 05 17:02:46 manjaro dbus-daemon[839]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedeskto>
Jan 05 17:02:46 manjaro systemd[1]: Starting Network Manager Script Dispatcher Service...
Jan 05 17:02:46 manjaro dbus-daemon[839]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jan 05 17:02:46 manjaro systemd[1]: Started Network Manager Script Dispatcher Service.
Jan 05 17:02:56 manjaro systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.

And also these errors:

Jan 04 13:35:23 manjaro /usr/lib/gdm-x-session[1434]: (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
Jan 04 13:35:23 manjaro /usr/lib/gdm-x-session[1434]: (EE) NVIDIA(0):     recover...
Jan 04 13:35:23 manjaro kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=1434, name=Xorg, Ch 00000018, errorString CTX SWITCH TIMEOUT, Info 0x44016
Jan 04 13:35:28 manjaro /usr/lib/gdm-x-session[1434]: (WW) NVIDIA: Wait for channel idle timed out.
Jan 04 13:39:01 manjaro kernel: INFO: task Xorg:1434 blocked for more than 122 seconds.
Jan 04 13:39:01 manjaro kernel:       Tainted: P           OE      6.5.13-4-MANJARO #1
Jan 04 13:39:01 manjaro kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 04 13:39:01 manjaro kernel: task:Xorg            state:D stack:0     pid:1434  ppid:1429   flags:0x00000002