Cannot run MPI processes using NVIDIA HPC-SDK after latest updates

Greetings to all people here!

After the latest big chunk of updates, I am unfortunately not able to run MPI enabled processes built with the compilers provided by the NVIDIA HPC-SDK.

I work in Computational Fluid Dynamics solver development.
We have a codebase that can be built using either Intel One API or NVIDIA HPC-SDK compilers.
(the codebase is Fortran with some C++ elements but that has nothing to do with the problem at hand)

When compiling and running the solvers using Intel One API, everything works fine.
Up until the latest update, everything also worked fine when using the NVIDIA HPC-SDK.

I am able to double verify that the problem lies somewhere in the updates, because I didn’t update Manjaro on my laptop and the solvers there still work and behave as normal.

When I start an MPI enabled process using NVIDIA HPC-SDK tools, the command just hangs and nothing happens.
It just stays there.

At first I thought something was wrong with my solvers, although everything worked fine before the updates.
Nope, it happens with every MPI process.

Even trying to compile and run a simple mpihello program, the process hangs there.

I don’t know where the problem is and I also don’t know what to try and troubleshoot.
I am on kernel 5.18, I was on 5.15 when the problem appeared.
Upgrading to a more recent kernel did not solve the problem.

I have tried uninstalling and reinstalling the HPC-SDK but that also did not help.

Here is my inxi -Fazy output:

System:
  Kernel: 5.18.5-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 12.1.0
    parameters: BOOT_IMAGE=/boot/vmlinuz-5.18-x86_64
    root=UUID=d98cfd04-eb09-4555-b7a3-040043e30330 rw quiet splash apparmor=1
    security=apparmor resume=UUID=b0126080-0ce2-48df-b3bc-544d664d6c93
    udev.log_priority=3
  Desktop: GNOME v: 42.2 tk: GTK v: 3.24.34 wm: gnome-shell dm: GDM v: 42.0
    Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Desktop System: Gigabyte product: X570 AORUS PRO v: -CF
    serial: <superuser required>
  Mobo: Gigabyte model: X570 AORUS PRO v: x.x serial: <superuser required>
    UEFI: American Megatrends v: F20 date: 07/07/2020
CPU:
  Info: model: AMD Ryzen 9 3900X bits: 64 type: MT MCP arch: Zen 2 gen: 3
    built: 2020-22 process: TSMC n7 (7nm) family: 0x17 (23) model-id: 0x71 (113)
    stepping: 0 microcode: 0x8701021
  Topology: cpus: 1x cores: 12 tpc: 2 threads: 24 smt: enabled cache:
    L1: 768 KiB desc: d-12x32 KiB; i-12x32 KiB L2: 6 MiB desc: 12x512 KiB
    L3: 64 MiB desc: 4x16 MiB
  Speed (MHz): avg: 2529 high: 4304 min/max: 2200/4672 boost: enabled
    scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 3209 2: 2053
    3: 2126 4: 3591 5: 2053 6: 2049 7: 2192 8: 2195 9: 2196 10: 2080 11: 2125
    12: 3785 13: 4304 14: 2165 15: 2101 16: 3591 17: 2050 18: 2048 19: 2196
    20: 2196 21: 2196 22: 2076 23: 2080 24: 4054 bogomips: 182142
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities:
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: spec_store_bypass
    mitigation: Speculative Store Bypass disabled via prctl
  Type: spectre_v1
    mitigation: usercopy/swapgs barriers and __user pointer sanitization
  Type: spectre_v2
    mitigation: Retpolines, IBPB: conditional, STIBP: conditional, RSB filling
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: NVIDIA GA102 [GeForce RTX 3080] vendor: Gigabyte driver: nvidia
    v: 515.48.07 alternate: nouveau,nvidia_drm non-free: 515.xx+
    status: current (as of 2022-06) arch: Ampere process: TSMC n7 (7nm)
    built: 2020-22 pcie: gen: 2 speed: 5 GT/s lanes: 16 link-max: gen: 4
    speed: 16 GT/s bus-ID: 08:00.0 chip-ID: 10de:2206 class-ID: 0300
  Display: x11 server: X.Org v: 21.1.3 with: Xwayland v: 22.1.2
    compositor: gnome-shell driver: X: loaded: nvidia gpu: nvidia display-ID: :1
    screens: 1
  Screen-1: 0 s-res: 2560x1440 s-dpi: 108 s-size: 602x342mm (23.70x13.46")
    s-diag: 692mm (27.26")
  Monitor-1: DP-0 res: 2560x1440 hz: 144 dpi: 109
    size: 597x336mm (23.5x13.23") diag: 685mm (26.97") modes: N/A
  Message: Unable to show GL data. Required tool glxinfo missing.
Audio:
  Device-1: NVIDIA GA102 High Definition Audio vendor: Gigabyte
    driver: snd_hda_intel bus-ID: 3-6.2:3 v: kernel pcie: chip-ID: 1235:8200
    class-ID: 0103 gen: 2 speed: 5 GT/s lanes: 16 link-max: gen: 4
    speed: 16 GT/s bus-ID: 08:00.1 chip-ID: 10de:1aef class-ID: 0403
  Device-2: AMD Starship/Matisse HD Audio vendor: Gigabyte
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
    bus-ID: 0a:00.4 chip-ID: 1022:1487 class-ID: 0403
  Device-3: Focusrite-Novation Scarlett 2i4 USB type: USB
    driver: snd-usb-audio
  Device-4: Creative Sound Blaster Play! 3 type: USB
    driver: hid-generic,snd-usb-audio,usbhid bus-ID: 3-6.3:4 chip-ID: 041e:324d
    class-ID: 0300 serial: <filter>
  Sound Server-1: ALSA v: k5.18.5-1-MANJARO running: yes
  Sound Server-2: JACK v: 1.9.21 running: no
  Sound Server-3: PulseAudio v: 16.0 running: yes
  Sound Server-4: PipeWire v: 0.3.52 running: yes
Network:
  Device-1: Intel I211 Gigabit Network vendor: Gigabyte driver: igb v: kernel
    pcie: gen: 1 speed: 2.5 GT/s lanes: 1 port: f000 bus-ID: 04:00.0
    chip-ID: 8086:1539 class-ID: 0200
  IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IF-ID-1: wlp5s0f1u6u1i2 state: down mac: <filter>
  IF-ID-2: ztrfycdflf state: unknown mac: <filter>
Bluetooth:
  Device-1: Realtek RTL8723BU 802.11b/g/n WLAN Adapter type: USB
    driver: btusb,rtl8xxxu bus-ID: 1-6.1:4 chip-ID: 0bda:b720 class-ID: e001
    serial: <filter>
  Report: rfkill ID: hci0 rfk-id: 0 state: down bt-service: enabled,running
    rfk-block: hardware: no software: yes address: see --recommends
Drives:
  Local Storage: total: 2.79 TiB used: 973.45 GiB (34.0%)
  SMART Message: Required tool smartctl not installed. Check --recommends
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung
    model: SSD 970 EVO Plus 500GB size: 465.76 GiB block-size: physical: 512 B
    logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: 2B2QEXM7 temp: 52.9 C scheme: GPT
  ID-2: /dev/sda maj-min: 8:0 vendor: Western Digital
    model: WD7500BPVT-60HXZT3 size: 698.64 GiB block-size: physical: 4096 B
    logical: 512 B speed: 3.0 Gb/s type: HDD rpm: 5400 serial: <filter>
    rev: 1A01 scheme: GPT
  ID-3: /dev/sdb maj-min: 8:16 vendor: Western Digital
    model: WD10SPZX-00Z10T0 size: 931.51 GiB block-size: physical: 4096 B
    logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 5400 serial: <filter>
    rev: 1A01 scheme: GPT
  ID-4: /dev/sdc maj-min: 8:32 vendor: Western Digital
    model: WD5003AZEX-00K1GA0 size: 465.76 GiB block-size: physical: 4096 B
    logical: 512 B speed: 6.0 Gb/s type: N/A serial: <filter> rev: 0A80
    scheme: MBR
  ID-5: /dev/sdd maj-min: 8:48 vendor: Seagate model: ST3320620AS
    size: 298.09 GiB block-size: physical: 512 B logical: 512 B speed: 1.5 Gb/s
    type: N/A serial: <filter> rev: K scheme: MBR
Partition:
  ID-1: / raw-size: 265.93 GiB size: 260.69 GiB (98.03%)
    used: 111.74 GiB (42.9%) fs: ext4 dev: /dev/nvme0n1p7 maj-min: 259:7
  ID-2: /boot/efi raw-size: 500 MiB size: 499 MiB (99.80%)
    used: 312 KiB (0.1%) fs: vfat dev: /dev/nvme0n1p6 maj-min: 259:6
  ID-3: /home raw-size: 342.63 GiB size: 336.19 GiB (98.12%)
    used: 220.1 GiB (65.5%) fs: ext4 dev: /dev/sdb2 maj-min: 8:18
Swap:
  Kernel: swappiness: 60 (default) cache-pressure: 100 (default)
  ID-1: swap-1 type: partition size: 4.04 GiB used: 0 KiB (0.0%)
    priority: -2 dev: /dev/nvme0n1p5 maj-min: 259:5
Sensors:
  System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 47 C
  Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
Info:
  Processes: 465 Uptime: 19m wakeups: 0 Memory: 31.29 GiB
  used: 5.51 GiB (17.6%) Init: systemd v: 251 default: graphical
  tool: systemctl Compilers: gcc: 12.1.0 clang: 13.0.1 Packages: 1830
  pacman: 1823 lib: 471 flatpak: 0 snap: 7 Shell: Zsh v: 5.9 running-in: tilix
  inxi: 3.3.19
1 Like

Hello,

The most sensible way to deal with it is to install it via AUR, but for now the nvhpc PKGBUILD is still outdated. AUR (en) - nvhpc
Someone else did the nvhpc-22.5 that is suited for the CUDA version and Nvidia Drivers we have in repositories. AUR (en) - nvhpc-22.5
You should run:
pamac build nvhpc-22.5
Reboot and see if that works.

Hello,

I did:
pamac build nvhpc-22.5
then rebooted.

As expected, problem persists.
Same behavior as installing from the tarballs.

Is there anything else I can try in order to fix this?
It’s really annoying since I do a lot of development work on my machine and it sucks that I cannot do anything now.

Something seems to be seriously wrong with openMPI after the updates.
Paraview, which uses openMPI, also fails to launch and returns MPI related errors.

The good thing is that a fresh installation of Manjaro does not seem to have these problems.
(I tried it as a test on my second workstation)
I will probably resort to reinstalling, since I cannot seem to fix the issues…