Installed CUDA 12.6, nvidia-smi shows version 12.4

wnabee · 8 December 2024 23:25

Hello,

As per the title

nvidia-smi

Mon Dec  9 00:04:18 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.135                Driver Version: 550.135        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060        Off |   00000000:01:00.0  On |                  N/A |
|  0%   49C    P8             12W /  170W |     164MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

pacman -Q cuda        
cuda 12.6.3-1

cuda 12.6 is the only version of cuda installed, and the driver is up to date https://www.nvidia.com/en-us/drivers/results/, nonetheless nvidia-smi displays
CUDA Version: 12.4.

I have failed to locate any sort of manual for what might be going awry here and I am not familiar enough with nvidia drivers to hazard any sort of guess. I would appreciate suggestions for next steps.

Best regards,
wnabee

cscs · 8 December 2024 23:29

GPGPU - ArchWiki

(emphasis added)

I think thats the difference.
nvidia-smi is reporting the cuda ‘driver’ version contained in the nvidia packages.
Which is different than the cuda ‘runtime’ obtained via the cuda package.

Think being operative here. I dont have nvidia.

wnabee · 9 December 2024 00:01

Thank you for the quick reply!

Interesting, I investigated these other packages and they both seem to be up to date,

pacman -Q nvidia 
linux613-nvidia 550.135-0.2

pacman -Q opencl-nvidia
opencl-nvidia 550.135-1

So it seems to me like everything should be fine, alas no such luck…

The frustrating thing is a binary I’m running is demanding cuda 12.6, so it’s the ‘driver’ version it wants. It couldn’t possibly be that nvidia shipped their driver
with the wrong version of cuda, right? That seems preposterous.

Lucky you I suppose. One thing is certain, I’m not buying their cards in the future if they can’t get their act together.

Yochanan · 9 December 2024 02:12

It shows CUDA 12.7 for me with the 565.77 driver with the same cuda package:

❯ nvidia-smi
Sun Dec  8 19:11:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+

❯ pacman -Q cuda
cuda 12.6.3-1

Maybe it’s reading from /usr/lib/libcuda.so from nvidia-utils to report the version?

Mirdarthos · 9 December 2024 06:01

FWIW, mine says:

$ nvidia-smi
Mon Dec  9 07:59:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.135                Driver Version: 550.135        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
[...]

Yet, cuda doesn’t seem to be installed separately:

$ pacman -Q cuda
error: package 'cuda' was not found

Nielius · 9 December 2024 15:33

Hi all,

I’ve been looking into this as well, and I think I know what is going on, but am basing this on the things I managed to google yesterday, so I might very well be wrong.

tl;dr: I believe the cause of this issue is that the GPU drivers are installed by mhwd and are relatively old, whereas the package cuda (with the CUDA runtime API) is installed through the Arch package, and requires a newer driver than the one supplied by mhwd.

Summarizing a stackoverflow answer that I can’t link to because I’m new (try the normal URL + a/53504578/12762884):

Your system can have 2 different CUDA versions:
1. the driver API, which is installed by the GPU driver. (On Manjaro, you’re supposed to install this with mhwd.) This is the version you see with nvidia-smi, because that program comes from the GPU driver.
2. the runtime API, which is part of the CUDA toolkit (I think?), and this is what you can see for example with nvcc --version. On Manjaro, that is what the cuda package is installing.
If the version reported by nvidia-smi is at least the version reported by nvcc, then you’re fine; however, if the version reported by nvidia-smi is lower, then (the post says) that is a broken config.

I’m on the up-to-date Manjaro NVIDIA driver, and have also installed the latest cuda package, and have versions similar to the OP:


niels@niels-manjaro-desktop-2411:~➜ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
niels@niels-manjaro-desktop-2411:~➜ nvidia-smi | head
Mon Dec  9 15:24:31 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.135                Driver Version: 550.135        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        Off |   00000000:01:00.0  On |                  N/A |
|  0%   40C    P8             11W /  270W |    1674MiB /   8192MiB |      1%      Default |

According to the stack overflow answer, this is a broken setup, and I believe this is possibly Manjaro’s fault (but happy to be proven wrong here!).

I think the way us Manjaro users are ending up with this wrong setup is that:

the GPU driver and CUDA driver API are installed using mhwd, which is on a relatively older version of the NVIDIA driver; while at the same time
the CUDA runtime API is installed using the cuda package from Arch Linux. This package mentions in line 10 in its PKGBUILD (again I can’t link to it ) that a newer driver is required than what mhwd supplies. (@Yochanan somehow you have a newer driver though? How did you install it?) See also issue #7 on the package’s gitlab (again, sorry, I can’t link).; it mentions that they had a similar issue earlier (the cuda package was running ahead of the Arch nvidia drivers).

As for fixes, it would be great if the cuda package somehow checked for the available drivers. (I personally have no idea how this could be done.) As a short-term fix, I think we could also downgrade our cuda package (haven’t tried it yet).

wnabee · 9 December 2024 21:15

Hi,
Thank you for the insightful replies, I did some snooping and concluded something similar to @Nielius. Something is out of order with the stable branch nvidia driver. To resolve my issue i simply switched branches to unstable and installed the unstable branch Nvidia driver 565.77 which has cuda 12.7 support, then i switched back to stable. Be careful when doing this if you don’t know what you’re doing since

Unstable is synced several times a day with Arch package releases. Only a subset of Arch packages are modified to suit Manjaro. Those that use Unstable need to have the skills to get themselves out of trouble when they move their system to this branch.

(tl;dr things might break)

Nevertheless for posterity, you can follow this link, to see how switching branches works, and you can load any new drivers by either just rebooting the system, or (doing it properly) by killing X/Wayland, unloading the Nvidia drivers with rmmod and then load them back in with modprobe.

This doesn’t identify the problem of course but it’s at least a workaround and it functions well for my purposes.

Thank you all very much for your help.

Best regards,
Wnabee

system · 12 December 2024 21:16

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.