PyTorch - NCCL error

Hello, community!

  1. FFHQ Training

torchrun --nproc_per_node=8 --master_port=4321 basicsr/train.py -opt options/VQGAN_512_ds32_nearest_stage1
.yml --launcher pytorch

Version Information:
BasicSR: 1.3.2
PyTorch: 2.4.1+cu121
TorchVision: 0.19.1+cu121

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN
for details), NCCL version 2.20.5
[rank4]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank4]: Last error:
[rank4]: Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 1000

  1. tensorrt and python-tensorrt update error (missing cuda 13?)
Run Build Command(s): /usr/bin/cmake -E env VERBOSE=1 /usr/bin/make -f Makefile cmTC_9d920/fast
/usr/bin/make  -f CMakeFiles/cmTC_9d920.dir/build.make CMakeFiles/cmTC_9d920.dir/build
make[1]: Entering directory '/var/tmp/pamac-build-dankahazi/tensorrt/src/build/CMakeFiles/CMakeScratch/TryCompile-LAzqlM'
Building CUDA object CMakeFiles/cmTC_9d920.dir/main.cu.o
/opt/cuda/bin/nvcc -forward-unknown-to-host-compiler   "--generate-code=arch=compute_75,code=[compute_75,sm_75]" "--generate-code=arch=compute_80,code=[compute_80,sm_80]" "--generate-code=arch=compute_86,code=[compute_86,sm_86]" "--generate-code=arch=compute_87,code=[compute_87,sm_87]" "--generate-code=arch=compute_89,code=[compute_89,sm_89]" "--generate-code=arch=compute_90,code=[compute_90,sm_90]" "--generate-code=arch=compute_100,code=[compute_100,sm_100]" "--generate-code=arch=compute_103,code=[compute_103,sm_103]" "--generate-code=arch=compute_110,code=[compute_110,sm_110]" "--generate-code=arch=compute_120,code=[compute_120,sm_120]" "--generate-code=arch=compute_121,code=[compute_121,sm_121]" -MD -MT CMakeFiles/cmTC_9d920.dir/main.cu.o -MF CMakeFiles/cmTC_9d920.dir/main.cu.o.d -x cu -c /var/tmp/pamac-build-dankahazi/tensorrt/src/build/CMakeFiles/CMakeScratch/TryCompile-LAzqlM/main.cu -o CMakeFiles/cmTC_9d920.dir/main.cu.o
nvcc fatal   : Unsupported gpu architecture 'compute_110'
make[1]: *** [CMakeFiles/cmTC_9d920.dir/build.make:82: CMakeFiles/cmTC_9d920.dir/main.cu.o] Error 1
make[1]: Leaving directory '/var/tmp/pamac-build-dankahazi/tensorrt/src/build/CMakeFiles/CMakeScratch/TryCompile-LAzqlM'
make: *** [Makefile:134: cmTC_9d920/fast] Error 2
...
-- Configuring incomplete, errors occurred!

One or more errors, how can I solve them?

cuda v13 is in the Unstable & Testing repos:

mbn info cuda -q | grep -Ev 'Name|Repository|Packager'
Branch         : archlinux
Version        : 13.0.2-1
Build Date     : Fri 10 Oct 2025 04:38:56 
Branch         : unstable
Version        : 13.0.2-1
Build Date     : Fri 10 Oct 2025 04:38:56 
Branch         : testing
Version        : 13.0.2-1
Build Date     : Fri 10 Oct 2025 04:38:56 
Branch         : stable
Version        : 12.9.1-2
Build Date     : Fri 01 Aug 2025 15:19:27 

mbn can be found in the manjaro-check-repos package

To switch to Testing branch:

sudo pacman-mirrors --api --set-branch testing

or, to switch to Unstable branch:

sudo pacman-mirrors --api --set-branch unstable

After you changed the branch, rebuild the mirrorlist and update your packages:

sudo pacman-mirrors --continent && sudo pacman -Syu

If you are going to use the AUR, then you should be on at least Testing branch, but preferably Unstable branch, which is closest to Arch.

The solution to the update error would be to switch Branches?

Yes. The AUR (en) - tensorrt package update requires cuda v13, according to its PKGBUILD:

pkgbase=tensorrt
pkgname=(
    'tensorrt'
    'python-tensorrt')
pkgver=10.13.3.9
_cudaver=13.0
_protobuf_ver=3.20.1
_pybind11_ver=2.9.2
_onnx_graphsurgeon_ver=0.5.8
_polygraphy_ver=0.49.24
_tensorflow_quantization_ver=0.2.0
1 Like

Or… wait until the package reaches Stable branch. I’m afraid we are unable to give an accurate estimate of when that will be. Check the respective (branch) Update announcements, or check the Packages link to monitor its status at any time.

Alternately, install the manjaro-check-repos package to use mbn, which achieves the same thing.

Regards.

3 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.