Problems setting up Nvidia RTX A4000

Got the Lenovo ThinkStation A4000 GPU for graphic, rendering and machine learning. It is confirmed to have no damages and runs Blender and benchmarks under Windows, but the vast majority of AI and machine learning packages are not optimized for Windows, so I have tried Fedora LTS and Manjaro, but it always have a problem to set up this GPU by instructions. Right out of the box, it can display the desktop, but onboard VGA drivers are missing all CUDA and OPTiX capabilities which is verified by starting Blender 3+ and looking up the render acceleration settings. So I’ve tried to install all of combinations of nvidia drivers from Add or Remove software, but after reboot the system always stuck at the bootup textwall and can be operated only in CLI mode from tty2. I’ve tried sudo mhwd -a pci nonfree 0300, but it tells that is skipping the installation because an appropriate driver is already installed. I’ve also tried to install from NVIDIA-Linux-x86_64-460.80.run , but it yields an error “Your kernel headers for kernel 5.19.1-3-MANJARO cannot be found at /usr/lib/modules/5.19.1-3-MANJARO/build or /usr/lib/modules/5.19.1-3-MANJARO/source” despite I have altered nothing in the system.

uname -r tells:

5.15.60-1-MANJARO

mhwd -l && mhwd -li tells


[details="Спойлер"]
e[1me[31m> e[m0000:08:00.0 (0300:10de:24b0) Display controller nVidia Corporation:
--------------------------------------------------------------------------------
                  NAME               VERSION          FREEDRIVER           TYPE
--------------------------------------------------------------------------------
           video-linux            2018.05.04                true            PCI
     video-modesetting            2020.01.13                true            PCI
            video-vesa            2017.03.12                true            PCI

e[1me[31m> e[mInstalled PCI configs:
--------------------------------------------------------------------------------
                  NAME               VERSION          FREEDRIVER           TYPE
--------------------------------------------------------------------------------
           video-linux            2018.05.04                true            PCI


e[1me[31mWarning: e[mNo installed USB configs!
[/details]

lspci -vga tells

08:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
so basically the system can observe the GPU and displays the image on the beginning, but after installing the driver it refuses to see any available displays (xinit refuses to run saying EE: No displays found) and nvida-smi also refuses to run saying that the problem is in the software.

Here is the inxi -Fza output.


[details="Спойлер"]
System:
  Kernel: 5.15.60-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 12.1.1
    parameters: BOOT_IMAGE=/boot/vmlinuz-5.15-x86_64 root=UUID=c1bd0cd3-ca24-41b1-93ca-d420de7bca18
    rw quiet udev.log_priority=3
  Console: tty 2 Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Desktop Mobo: ASUSTeK model: PRIME X570-PRO v: Rev X.0x serial: <filter>
    UEFI: American Megatrends v: 3603 date: 03/20/2021
CPU:
  Info: model: AMD Ryzen 7 PRO 2700 socket: AM4 bits: 64 type: MT MCP arch: Zen+ gen: 2 level: v3
    built: 2018-21 process: GF 12nm family: 0x17 (23) model-id: 8 stepping: 2 microcode: 0x800820D
  Topology: cpus: 1x cores: 8 tpc: 2 threads: 16 smt: enabled cache: L1: 768 KiB desc: d-8x32
    KiB; i-8x64 KiB L2: 4 MiB desc: 8x512 KiB L3: 16 MiB desc: 2x8 MiB
  Speed (MHz): avg: 1653 high: 3200 min/max: 1550/3200 boost: enabled base/boost: 3200/4100
    scaling: driver: acpi-cpufreq governor: schedutil volts: 1.1 V ext-clock: 100 MHz cores:
    1: 3200 2: 1550 3: 1550 4: 1550 5: 1550 6: 1550 7: 1550 8: 1550 9: 1550 10: 1550 11: 1550
    12: 1550 13: 1550 14: 1550 15: 1550 16: 1550 bogomips: 102254
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities:
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: retbleed mitigation: untrained return thunk; SMT vulnerable
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization
  Type: spectre_v2 mitigation: Retpolines, IBPB: conditional, STIBP: disabled, RSB filling,
    PBRSB-eIBRS: Not affected
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: NVIDIA GA104GL [RTX A4000] vendor: Lenovo driver: N/A alternate: nouveau
    non-free: 515.xx+ status: current (as of 2022-08) arch: Ampere code: GAxxx process: TSMC n7
    (7nm) built: 2020-22 pcie: gen: 3 speed: 8 GT/s lanes: 16 link-max: gen: 4 speed: 16 GT/s
    bus-ID: 08:00.0 chip-ID: 10de:24b0 class-ID: 0300
  Display: server: X.org v: 1.21.1.4 driver: X: loaded: nouveau unloaded: modesetting
    alternate: fbdev,nv,vesa gpu: N/A tty: 160x45
  Message: GL data unavailable in console for root.
Audio:
  Device-1: NVIDIA GA104 High Definition Audio vendor: Lenovo driver: snd_hda_intel v: kernel
    pcie: gen: 3 speed: 8 GT/s lanes: 16 link-max: gen: 4 speed: 16 GT/s bus-ID: 08:00.1
    chip-ID: 10de:228b class-ID: 0403
  Device-2: AMD Family 17h HD Audio vendor: ASUSTeK driver: snd_hda_intel v: kernel pcie:
    gen: 3 speed: 8 GT/s lanes: 16 bus-ID: 0a:00.3 chip-ID: 1022:1457 class-ID: 0403
  Sound Server-1: ALSA v: k5.15.60-1-MANJARO running: yes
  Sound Server-2: JACK v: 1.9.21 running: no
  Sound Server-3: PulseAudio v: 16.1 running: no
  Sound Server-4: PipeWire v: 0.3.56 running: no
Network:
  Device-1: Intel I211 Gigabit Network vendor: ASUSTeK driver: igb v: kernel pcie: gen: 1
    speed: 2.5 GT/s lanes: 1 port: f000 bus-ID: 04:00.0 chip-ID: 8086:1539 class-ID: 0200
  IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  Device-2: Ralink MT7601U Wireless Adapter type: USB driver: mt7601u bus-ID: 5-4:2
    chip-ID: 148f:7601 class-ID: 0000 serial: <filter>
  IF: wlp9s0f3u4 state: down mac: <filter>
Drives:
  Local Storage: total: 2.29 TiB used: 10.1 GiB (0.4%)
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Crucial model: CT500P2SSD8 size: 465.76 GiB
    block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: P2CR012 temp: 45.9 C scheme: GPT
  SMART: yes health: PASSED on: 7d 18h cycles: 94 read-units: 1,236,647 [633 GB]
    written-units: 1,960,729 [1.00 TB]
  ID-2: /dev/sda maj-min: 8:0 vendor: Western Digital model: WD20EZAZ-00GGJB0
    family: Blue (SMR) size: 1.82 TiB block-size: physical: 4096 B logical: 512 B sata: 3.1
    speed: 6.0 Gb/s type: HDD rpm: 5400 serial: <filter> rev: 0A80 temp: 31 C scheme: MBR
  SMART: yes state: enabled health: PASSED on: 22d 20h cycles: 205
  ID-3: /dev/sdb maj-min: 8:16 type: USB vendor: Transcend model: JetFlash 16GB size: 14.96 GiB
    block-size: physical: 512 B logical: 512 B type: SSD serial: <filter> rev: 8.01 scheme: MBR
  SMART Message: Unknown USB bridge. Flash drive/Unsupported enclosure?
Partition:
  ID-1: / raw-size: 89.76 GiB size: 87.8 GiB (97.81%) used: 10.06 GiB (11.5%) fs: ext4
    block-size: 4096 B dev: /dev/nvme0n1p6 maj-min: 259:6
  ID-2: /boot/efi raw-size: 100 MiB size: 96 MiB (96.00%) used: 38.3 MiB (39.9%) fs: vfat
    block-size: 512 B dev: /dev/nvme0n1p1 maj-min: 259:1
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 43.8 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 269 Uptime: 0m wakeups: 0 Memory: 31.26 GiB used: 758 MiB (2.4%) Init: systemd
  v: 251 default: graphical tool: systemctl Compilers: gcc: 12.1.1 clang: 14.0.6 Packages:
  pm: pacman pkgs: 1184 libs: 317 tools: pamac pm: flatpak pkgs: 0 Shell: Bash (sudo) v: 5.1.16
  running-in: tty 2 inxi: 3.3.21
[/details]

Let’s start with the obvious: have you installed the cuda package?

I’ve tried to install both Cuda and Optix packages from the website (second one urged to create an account) on the previous attempt before reinstalling the system from the USB (which I am currently booted into), but it obviously it did not help since the problem is in the VGA driver.
Also it is recommended to operate completely from CLI other than downloading some OOT packages.
Most likely, as I did some research, the problem may be in the conflict between nouveau package and the installed driver, so it needs to be blacklisted and also some manual setup of the Xorg config files is needed, I just don’t know the exact sequence of the steps.
Also can I operate furthermore by chrooting to the installed system instead of booting continuously to the system / to USB image to pass some data?

the mhwd doesnt detect any nvidia drivers available for your system …
post output from:
find /etc/X11/ -name "*.conf"
pacman -Qs nvidia
you can do it from chroot if you wish, and test in live usb if the drivers are detected there: mhwd -l

Xorg finds only keyboard and touchpad (which is absent on desktop):

/etc/X11/xorg.conf.d/30-touchpad.conf
/etc/X11/xorg.conf.d/00-keyboard.conf

Pacman probe shows these options

local/egl-wayland 2:1.1.10-1
    EGLStream-based Wayland external platform
local/lib32-libvdpau 1.5-1
    Nvidia VDPAU library
local/libvdpau 1.5-1
    Nvidia VDPAU library
local/libxnvctrl-470xx 470.141.03-1
    NVIDIA NV-CONTROL X extension
local/mhwd-nvidia 515.65.01-2
    MHWD module-ids for nvidia 515.65.01
local/mhwd-nvidia-390xx 390.154-1
    MHWD module-ids for nvidia 390.154
local/mhwd-nvidia-470xx 470.141.03-1
    MHWD module-ids for nvidia 470.141.03
local/nvidia-470xx-dkms 470.141.03-1
    NVIDIA drivers - module sources
local/nvidia-470xx-utils 470.141.03-1
    NVIDIA drivers utilities
local/opencl-nvidia 515.65.01-2
    OpenCL implemention for NVIDIA
local/xf86-video-nouveau 1.0.17-2 (xorg-drivers)
    Open Source 3D acceleration driver for nVidia cards

Also seen that the download package asked for the kernel 5.19.1-3 while I currently have only 5.15.60-1, can it be an additional or main issue?

you didnt check the live usb if there the drivers are detected …
you have installed the 470xx dkms series, for which you need kernel headers - your nvidia can support the latest 5.15 version of drivers, according to inxi output…
we will install the proper drivers and see if they work, but first provide output from:
ls /etc/modprobe.d
to see if there is no blacklist for your nvidia

I do not think the device ID 24b0 is yet part of the MHWD … Maybe @Yochanan or @philm can help with that.

1 Like

I

I have found that modprobe.d folder is empty completely.
Is there a still necessity or recommendation to install 5.19.1-3-MANJARO headers?

i would just uninstall all of the 470dkms drivers and install the latest nvidia …
the link above talks about an broken card …
provide logs:
journalctl -b0 -p4 --no-pager
sudo dmesg | grep -E 'Nvidia|nvidia'

do you have fast startup disabled in windows?

edit: how is the nvidia connect? via riser cable or directly plugged in the motherboard?

Videocard is directly in the motherboard and has additional 6-pin (not 6+2pin) power supply.
journalctl says:

Sep 04 18:51:02 perkele kernel: nvme nvme0: missing or invalid SUBNQN field.
Sep 04 18:51:02 perkele kernel: usb: port power management may be unreliable
Sep 04 18:51:02 perkele kernel: usb 3-4: config 1 has an invalid interface number: 2 but max is 1
Sep 04 18:51:02 perkele kernel: usb 3-4: config 1 has no interface number 1
Sep 04 18:51:02 perkele systemd-modules-load[383]: Failed to find module 'nvidia-uvm'
Sep 04 18:51:02 perkele kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
Sep 04 18:51:02 perkele kernel: acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
Sep 04 18:51:02 perkele kernel: acpi PNP0C14:04: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
Sep 04 18:51:02 perkele kernel: acpi PNP0C14:05: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
Sep 04 18:51:02 perkele kernel: sd 7:0:0:0: [sdb] No Caching mode page found
Sep 04 18:51:02 perkele kernel: sd 7:0:0:0: [sdb] Assuming drive cache: write through
Sep 04 18:51:02 perkele kernel: kvm: disabled by bios
Sep 04 18:51:03 perkele kernel: kvm: disabled by bios
Sep 04 18:51:03 perkele kernel: kvm: disabled by bios
Sep 04 18:51:03 perkele kernel: sr 7:0:0:1: [sr0] GET_EVENT and TUR disagree continuously, suppress GET_EVENT events
Sep 04 18:51:03 perkele kernel: kvm: disabled by bios
Sep 04 18:51:03 perkele kernel: kvm: disabled by bios
Sep 04 18:51:03 perkele kernel: kvm: disabled by bios
Sep 04 18:51:03 perkele kernel: kvm: disabled by bios
Sep 04 18:51:04 perkele kernel: kvm: disabled by bios
Sep 04 18:51:04 perkele kernel: kvm: disabled by bios
Sep 04 18:51:04 perkele kernel: kvm: disabled by bios
Sep 04 18:51:08 perkele kernel: kauditd_printk_skb: 21 callbacks suppressed

and

does not return anything.

Edit: also ns if the Windows 10 has enabled fastboot (other than by default) but the NVME drive contains the remnant EFI partitions from previous Fedora installation attempt.

this really looked like a riser issue - card works with windows, doesnt work with linux (fedora/manjaro) … the only log related to nvidia is a missing nvidia module, because of the incomplete drivers …
i think windows enables fast startup by default, so go and disable it …
and output of these dmesg:
sudo dmesg -l err,warn,emerg,alert
you can post a pic of these

Actually disabled fastboot, has no effect, and dmesg returns rather the same output, just in other pallette.

Also the riser should not be a problem because RX 6600XT had no problems, also the motherboard have second PCIE slot for GPU, but I didn’t try to reinsert the GPU in it.

Possibly would be good to reinstall the system from scratch (since nothing has been deposited there so far) to return to the state when it can at last output the image and start the GUI despite the missing CUDA/Optix

well it has no effect because you dont have installed the proper drivers …
so lets try to install them:
remove this first:
pamac remove libxnvctrl-470xx nvidia-470xx-dkms nvidia-470xx-utils
and install this:

pamac install nvidia-utils lib32-nvidia-utils linux515-nvidia linux519-nvidia libxnvctrl

reboot

It is fixed now, thanks @brahma!


And a final question - is there any way to start up the system in CLI mode to get the most out of VRAM to use with neural networks? In Debian-like systems it is done simply by init 3 command, but in modern X systems there are no runlevels and it is done in other way by configuring either grub or some config files.

with this i cant help i have no idea… you can check this link and see if its what you want:

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.