Nvidia driver/cuda compatibility

ctamegara · 3 January 2025 04:21

Hello

When writing a simple cuda kernel, it compiles without problem (with nvcc) but when executing it, I get the following error:

the provided PTX was compiled with an unsupported toolchain.

which seems to indicate that there is an incompatibility between some elements of the computing chain.

I’ve been trying to finf which, and it looks like my nvidia driver doesn’t correspond to my cuda version. More precisely:

nvcc --version                                          ✔  base  
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

comes from the official repositories (so cuda toolkit 12.6 and more precisely 12.6.3.1 as pacman -Q cuda returns). But

nvidia-smi                              INT ✘  39m 20s   base  
Fri Jan  3 04:30:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.135                Driver Version: 550.135        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
|  0%   26C    P8             14W /  450W |     479MiB /  24564MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1957      G   /usr/lib/Xorg                                 108MiB |
|    0   N/A  N/A      2050      G   /usr/bin/gnome-shell                           73MiB |
|    0   N/A  N/A      2754      G   /usr/lib/firefox/firefox                      228MiB |
|    0   N/A  N/A      4171    C+G   /usr/bin/pamac-manager                         34MiB |
+-----------------------------------------------------------------------------------------+

is what I get after having installed the driver with

mhwd -i pci video-nvidia

On the one hand, it looks like 550.135 is not very old (November 2024) but is not adaptated to cuda >12.4.
On the other hand, it looks like the more recent 550.142 (December 2024) isn’t reachable from mhwd.

So it seems I should downgrade the Cuda toolkit to 12.4 … Alas Cuda 12.4 is in Arch Linux Archive, which is frightening (to me). Before to engage in this direction, I’d like to be sure I didn’t miss something very simple that would solve the problem in a straightforward manner.

So the question is : is downgrading Cuda the easiest solution ?

Thanks

[
Kernel: 6.11.11-1-MANJARO
CPU: Intel i9-12900KF
GPU: NVIDIA GeForce RTX 4090
]

cscs · 3 January 2025 04:29

Hullo,

A system info snapshot could be helpful;

inxi -Farz

Furthermore it may be worth mentioning that

( GPGPU - ArchWiki )

As to versions shipped … I dont know enough about nvidia as I dont have any, but if its true that some version of cuda should be coupled with certain versions of nvidia* then it would seem thats not currently happening. ex;

https://manjaristas.org/branch_compare?q=cuda
https://manjaristas.org/branch_compare?q=nvidia-utils

And finally I might mention that of course you always have the option to switch branches if another like ‘Unstable’ would work better for your use case. It does for me.

linux-aarhus · 3 January 2025 06:38

Extending on what @cscs mention, and the answer to your question about downgrading, you will be best served by swiching to the edge - that is unstable branch.

sudo pacman-mirrors -aS unstable
sudo pacman -Syu

Then you also get the option of using the opensource drivers - should that have an appeal.

ctamegara · 3 January 2025 09:20

Hello @cscs and @linux-aarhus, thanks for your help !

Here is the system snapshot:

inxi -Farz                                          
System:
  Kernel: 6.11.11-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 14.2.1
    clocksource: tsc avail: hpet,acpi_pm
    parameters: BOOT_IMAGE=/boot/vmlinuz-6.11-x86_64
    root=UUID=3d2fd971-de14-4446-bac9-57b1d103b454 rw quiet splash apparmor=1
    security=apparmor udev.log_priority=3
  Desktop: GNOME v: 47.2 tk: GTK v: 3.24.43 wm: gnome-shell
    tools: gsd-screensaver-proxy dm: GDM v: 47.0 Distro: Manjaro
    base: Arch Linux
Machine:
  Type: Desktop System: CyberPowerPC product: GamingPC v: 2.0
    serial: <superuser required>
  Mobo: Micro-Star model: PRO B760M-P (MS-7E02) v: 2.0
    serial: <superuser required> part-nu: CPPC-SYSTEM-UK
    uuid: <superuser required> UEFI: American Megatrends LLC. v: A.70
    date: 03/24/2024
CPU:
  Info: model: 12th Gen Intel Core i9-12900KF bits: 64 type: MST AMCP
    arch: Alder Lake gen: core 12 level: v3 note: check built: 2021+
    process: Intel 7 (10nm ESF) family: 6 model-id: 0x97 (151) stepping: 2
    microcode: 0x37
  Topology: cpus: 1x dies: 1 clusters: 10 cores: 16 threads: 24 mt: 8 tpc: 2
    st: 8 smt: enabled cache: L1: 1.4 MiB desc: d-8x32 KiB, 8x48 KiB; i-8x32
    KiB, 8x64 KiB L2: 14 MiB desc: 8x1.2 MiB, 2x2 MiB L3: 30 MiB
    desc: 1x30 MiB
  Speed (MHz): avg: 800 min/max: 800/5100:5200:3900 scaling:
    driver: intel_pstate governor: powersave cores: 1: 800 2: 800 3: 800 4: 800
    5: 800 6: 800 7: 800 8: 800 9: 800 10: 800 11: 800 12: 800 13: 800 14: 800
    15: 800 16: 800 17: 800 18: 800 19: 800 20: 800 21: 800 22: 800 23: 800
    24: 800 bogomips: 153024
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
  Vulnerabilities:
  Type: gather_data_sampling status: Not affected
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: reg_file_data_sampling mitigation: Clear Register File
  Type: retbleed status: Not affected
  Type: spec_rstack_overflow status: Not affected
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
    prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
    sanitization
  Type: spectre_v2 mitigation: Enhanced / Automatic IBRS; IBPB:
    conditional; RSB filling; PBRSB-eIBRS: SW sequence; BHI: BHI_DIS_S
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: NVIDIA AD102 [GeForce RTX 4090] vendor: Micro-Star MSI
    driver: nvidia v: 550.135 alternate: nouveau,nvidia_drm non-free: 550.xx+
    status: current (as of 2024-09) arch: Lovelace code: AD1xx
    process: TSMC n4 (5nm) built: 2022+ pcie: gen: 1 speed: 2.5 GT/s lanes: 16
    link-max: gen: 4 speed: 16 GT/s ports: active: none off: HDMI-A-1
    empty: DP-1,DP-2,HDMI-A-2 bus-ID: 01:00.0 chip-ID: 10de:2684
    class-ID: 0300
  Display: x11 server: X.org v: 1.21.1.14 with: Xwayland v: 24.1.4
    compositor: gnome-shell driver: X: loaded: N/A failed: nvidia
    gpu: nvidia,nvidia-nvswitch note: X driver n/a, try sudo/root
    display-ID: :1 screens: 1
  Screen-1: 0 s-res: 2560x1440 s-size: <missing: xdpyinfo>
  Monitor-1: HDMI-0 res: 2560x1440 hz: 60 dpi: 109
    size: 597x336mm (23.5x13.23") diag: 685mm (26.97") modes: N/A
  API: EGL v: 1.5 hw: drv: nvidia platforms: device: 0 drv: nvidia device: 2
    drv: swrast surfaceless: drv: nvidia x11: drv: nvidia
    inactive: gbm,wayland,device-1
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: nvidia mesa v: 550.135
    glx-v: 1.4 direct-render: yes renderer: NVIDIA GeForce RTX 4090/PCIe/SSE2
    memory: 23.43 GiB
Audio:
  Device-1: Intel Raptor Lake High Definition Audio vendor: Micro-Star MSI
    driver: snd_hda_intel v: kernel alternate: snd_soc_avs,snd_sof_pci_intel_tgl
    bus-ID: 00:1f.3 chip-ID: 8086:7a50 class-ID: 0403
  Device-2: NVIDIA AD102 High Definition Audio vendor: Micro-Star MSI
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
    bus-ID: 01:00.1 chip-ID: 10de:22ba class-ID: 0403
  Device-3: Realtek USB2.0 Microphone
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 2.0 speed: 480 Mb/s
    lanes: 1 mode: 2.0 bus-ID: 1-1:2 chip-ID: 0bda:4937 class-ID: 0300
    serial: <filter>
  API: ALSA v: k6.11.11-1-MANJARO status: kernel-api with: aoss
    type: oss-emulator tools: alsactl,alsamixer,amixer
  Server-1: JACK v: 1.9.22 status: off tools: N/A
  Server-2: PipeWire v: 1.2.7 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    tools: pactl,pw-cat,pw-cli,wpctl
Network:
  Device-1: Realtek RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet
    vendor: Micro-Star MSI driver: r8169 v: kernel pcie: gen: 1 speed: 2.5 GT/s
    lanes: 1 port: 3000 bus-ID: 04:00.0 chip-ID: 10ec:8168 class-ID: 0200
  IF: enp4s0 state: down mac: <filter>
  Device-2: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel pcie: gen: 2
    speed: 5 GT/s lanes: 1 bus-ID: 05:00.0 chip-ID: 8086:2723 class-ID: 0280
  IF: wlp5s0 state: up mac: <filter>
  IF-ID-1: docker0 state: down mac: <filter>
  Info: services: NetworkManager, systemd-timesyncd, wpa_supplicant
Bluetooth:
  Device-1: Intel AX200 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 1-5:3 chip-ID: 8087:0029
    class-ID: e001
  Report: rfkill ID: hci0 rfk-id: 1 state: up address: see --recommends
Drives:
  Local Storage: total: 1.82 TiB used: 595.4 GiB (32.0%)
  SMART Message: Required tool smartctl not installed. Check --recommends
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Kingston model: SNV2S2000G
    size: 1.82 TiB block-size: physical: 512 B logical: 512 B speed: 63.2 Gb/s
    lanes: 4 tech: SSD serial: <filter> fw-rev: SBM02106 temp: 28.9 C
    scheme: GPT
Partition:
  ID-1: / raw-size: 1.82 TiB size: 1.79 TiB (98.37%) used: 595.4 GiB (32.5%)
    fs: ext4 dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 288 KiB (0.1%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 22.5 C mobo: N/A gpu: nvidia temp: 27 C
  Fan Speeds (rpm): N/A gpu: nvidia fan: 0%
Repos:
  Packages: pm: pacman pkgs: 1609 libs: 454 tools: gnome-software,pamac,yay
    pm: flatpak pkgs: 0
  Active pacman repo servers in: /etc/pacman.d/mirrorlist
    1: https://mirror.easyname.at/manjaro/stable/$repo/$arch
    2: https://fastmirror.pp.ua/manjaro/stable/$repo/$arch
    3: https://mirror.hostiko.network/manjaro/stable/$repo/$arch
    4: https://mnvoip.mm.fcix.net/manjaro/stable/$repo/$arch
    5: http://mirror.fcix.net/manjaro/stable/$repo/$arch
    6: https://mirror.csclub.uwaterloo.ca/manjaro/stable/$repo/$arch
    7: http://mirror.xeonbd.com/manjaro/stable/$repo/$arch
    8: https://manjaro.repo.cure.edu.uy/stable/$repo/$arch
Info:
  Memory: total: 32 GiB available: 31.19 GiB used: 2.3 GiB (7.4%)
  Processes: 416 Power: uptime: 28m states: freeze,mem,disk suspend: deep
    avail: s2idle wakeups: 0 hibernate: platform avail: shutdown, reboot,
    suspend, test_resume image: 12.45 GiB services: gsd-power,
    power-profiles-daemon, upowerd Init: systemd v: 256 default: graphical
    tool: systemctl
  Compilers: clang: 18.1.8 gcc: 14.2.1 alt: 13 Shell: Zsh v: 5.9
    running-in: gnome-terminal inxi: 3.3.36

I’d prefer downgrading cuda because of other programs using pytorch, for which no support for cuda 12.6 seems available (as I noticed in between these posts).

Do you know if there are any caveat ?

linux-aarhus · 3 January 2025 11:36

I don’t know about nvidia/cuda, what I do know is - generally speaking - downgrading a single package may create hard to solve issues and is discouraged for this reason.

Manjaro is rolling release so if you rely on packages of specific versions - you should use a fixed release distribution. The most stable in that regard is AlmaLinux or RockyLinux.

If you on the other hand want to be closer to Arch repos - you should definately use the unstable branch as this is likely more in line with upstream Nvidia and Cuda.

Also worth noting is the kernel you are using Linux 6.11 as it has been tagged EOL and you should sync Linux 6.12.

mhwd-kernel -i linux612 rmc

Another aspect to consider is the fact that nvidia drivers are provided either as independing on the kernel (nvidia-dkms) or with kernel dependency e.g. linux612-nvidia

 $ pamac search linux612 nvidia
linux612-nvidia-open  565.77-9                                                                                                            extra
    NVIDIA open drivers for linux612
linux612-nvidia-470xx  470.256.02-18                                                                                                      extra
    NVIDIA drivers for linux
linux612-nvidia-390xx  390.157-18                                                                                                         extra
    NVIDIA drivers for linux
linux612-nvidia  565.77-9                                                                                                                 extra
    NVIDIA drivers for linux612

And

 $ pamac search nvidia dkms --no-aur
nvidia-open-dkms  565.77-4                                                                                                                extra
    NVIDIA open kernel modules - module sources
nvidia-dkms  565.77-4                                                                                                                     extra
    NVIDIA kernel modules - module sources
nvidia-470xx-dkms  470.256.02-9                                                                                                           extra
    NVIDIA drivers - module sources
nvidia-390xx-dkms  390.157-17                                                                                                             extra
    NVIDIA drivers - module sources

ctamegara · 3 January 2025 21:14

Thanks for your help.

I guess I’ll need to enter deeper in the problem … For now mhwd-kernel -i linux612 rmc doesn’t work for reasons that seem to be documented on this forum:

:: removing linux611 breaks dependency 'linux611' required by linux-meta
:: removing linux611-nvidia breaks dependency 'linux611-nvidia' required by linux-nvidia-meta

So I’ll have a look at it as soon as I find the time. Once again, thanks for the time you spent.

pobrn · 4 January 2025 02:39

The problem is that linux-meta still wants linux611 on the testing and stable branches. Either you stop using using linux-meta et al. or wait for it to be update to Linux 6.12.

cscs · 4 January 2025 02:48

And linux-meta is only present because the user was previously running an EOL kernel long enough for it to require a replacement.
linux-meta still has problems and may or may not provide a smooth transition from linux611 without manual intervention.

soundofthunder · 4 January 2025 03:16

ctamegara:

For now mhwd-kernel -i linux612 rmc doesn’t work for reasons that seem to be documented on this forum:
:: removing linux611 breaks dependency 'linux611' required by linux-meta
:: removing linux611-nvidia breaks dependency

I recently encountered a similar issue;

My resolution was to first remove linux-meta;

from that point, installing the new kernel while removing the previous, worked as expected.

mhwd-kernel -i linux612 rmc

I have not since reinstalled linux-meta.

Regards.