Pytorch and Nvidia Problem

shylocks · 18 August 2020 14:13

Hi there, i have updated manjaro to the current branch using pacman -Syu :
Cuda 10.2 -> 11.0
Cudnn 7.6 -> 8.0
Pytorch 1.5.1 -> 1.6.0

But i got problem using python-pytorch-cuda, it says my nivida driver version is outdate.
nvidia-smi shows current nvidia driver version is 440xx, so i update nvidia driver to 450xx by using yay -S nvidia-beta-all, after that nvidia-smi shows

Failed to initialize NVML: Driver/library version mismatch

and python -c “import tensorflow as tf;tf.test.is_gpu_available();” shows

Use tf.config.list_physical_devices('GPU') instead.
2020-08-18 22:03:51.528883: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Netw
ork Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-08-18 22:03:51.533552: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3601000000 Hz
2020-08-18 22:03:51.533923: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5604d208cf80 initialized for platform Host (this does
not guarantee that XLA will be used). Devices:
2020-08-18 22:03:51.533940: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-08-18 22:03:51.535611: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-18 22:03:51.545474: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2020-08-18 22:03:51.545509: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: manjaro
2020-08-18 22:03:51.545513: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: manjaro
2020-08-18 22:03:51.545591: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 450.57.0
2020-08-18 22:03:51.545612: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.100.0
2020-08-18 22:03:51.545617: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 440.100.0 does not match DSO version 450.57
.0 – cannot find working devices in this configuration

How can i fix that? Thank you very much for replying

Takei · 18 August 2020 14:27

The Nvidia 450xx drivers just today landed in testing. So you might have better luck switching to testing branch for a bit, updating and then installing the 450xx driver package with mhwd. Keep in mind though you might need to temporarily uninstall cuda to switch driver versions.

To switch to the testing branch you can use pacman-mirrors -a -S testing .

shylocks · 18 August 2020 15:05

After switch to the testing branch and install nvidia with sudo pacman -S mhwd mhwd-db && sudo mhwd -i pci video-nvidia-450xx.
The problem solved.Thank you very much!!
Besides, I think this may be a mistake for updateing cuda from 10.2 to 11.0 on the stable branch. cuda 11.0 relies on nvidia-450xx, but nvidia-450xx is unavailable on the the stable branch. This makes all cuda-related python machine learning packages failed.

system · 21 August 2020 15:05

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.