Greetings to all people here!
After the latest big chunk of updates, I am unfortunately not able to run MPI enabled processes built with the compilers provided by the NVIDIA HPC-SDK.
I work in Computational Fluid Dynamics solver development.
We have a codebase that can be built using either Intel One API or NVIDIA HPC-SDK compilers.
(the codebase is Fortran with some C++ elements but that has nothing to do with the problem at hand)
When compiling and running the solvers using Intel One API, everything works fine.
Up until the latest update, everything also worked fine when using the NVIDIA HPC-SDK.
I am able to double verify that the problem lies somewhere in the updates, because I didn’t update Manjaro on my laptop and the solvers there still work and behave as normal.
When I start an MPI enabled process using NVIDIA HPC-SDK tools, the command just hangs and nothing happens.
It just stays there.
At first I thought something was wrong with my solvers, although everything worked fine before the updates.
Nope, it happens with every MPI process.
Even trying to compile and run a simple mpihello program, the process hangs there.
I don’t know where the problem is and I also don’t know what to try and troubleshoot.
I am on kernel 5.18, I was on 5.15 when the problem appeared.
Upgrading to a more recent kernel did not solve the problem.
I have tried uninstalling and reinstalling the HPC-SDK but that also did not help.
Here is my inxi -Fazy output:
System:
Kernel: 5.18.5-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 12.1.0
parameters: BOOT_IMAGE=/boot/vmlinuz-5.18-x86_64
root=UUID=d98cfd04-eb09-4555-b7a3-040043e30330 rw quiet splash apparmor=1
security=apparmor resume=UUID=b0126080-0ce2-48df-b3bc-544d664d6c93
udev.log_priority=3
Desktop: GNOME v: 42.2 tk: GTK v: 3.24.34 wm: gnome-shell dm: GDM v: 42.0
Distro: Manjaro Linux base: Arch Linux
Machine:
Type: Desktop System: Gigabyte product: X570 AORUS PRO v: -CF
serial: <superuser required>
Mobo: Gigabyte model: X570 AORUS PRO v: x.x serial: <superuser required>
UEFI: American Megatrends v: F20 date: 07/07/2020
CPU:
Info: model: AMD Ryzen 9 3900X bits: 64 type: MT MCP arch: Zen 2 gen: 3
built: 2020-22 process: TSMC n7 (7nm) family: 0x17 (23) model-id: 0x71 (113)
stepping: 0 microcode: 0x8701021
Topology: cpus: 1x cores: 12 tpc: 2 threads: 24 smt: enabled cache:
L1: 768 KiB desc: d-12x32 KiB; i-12x32 KiB L2: 6 MiB desc: 12x512 KiB
L3: 64 MiB desc: 4x16 MiB
Speed (MHz): avg: 2529 high: 4304 min/max: 2200/4672 boost: enabled
scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 3209 2: 2053
3: 2126 4: 3591 5: 2053 6: 2049 7: 2192 8: 2195 9: 2196 10: 2080 11: 2125
12: 3785 13: 4304 14: 2165 15: 2101 16: 3591 17: 2050 18: 2048 19: 2196
20: 2196 21: 2196 22: 2076 23: 2080 24: 4054 bogomips: 182142
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Vulnerabilities:
Type: itlb_multihit status: Not affected
Type: l1tf status: Not affected
Type: mds status: Not affected
Type: meltdown status: Not affected
Type: mmio_stale_data status: Not affected
Type: spec_store_bypass
mitigation: Speculative Store Bypass disabled via prctl
Type: spectre_v1
mitigation: usercopy/swapgs barriers and __user pointer sanitization
Type: spectre_v2
mitigation: Retpolines, IBPB: conditional, STIBP: conditional, RSB filling
Type: srbds status: Not affected
Type: tsx_async_abort status: Not affected
Graphics:
Device-1: NVIDIA GA102 [GeForce RTX 3080] vendor: Gigabyte driver: nvidia
v: 515.48.07 alternate: nouveau,nvidia_drm non-free: 515.xx+
status: current (as of 2022-06) arch: Ampere process: TSMC n7 (7nm)
built: 2020-22 pcie: gen: 2 speed: 5 GT/s lanes: 16 link-max: gen: 4
speed: 16 GT/s bus-ID: 08:00.0 chip-ID: 10de:2206 class-ID: 0300
Display: x11 server: X.Org v: 21.1.3 with: Xwayland v: 22.1.2
compositor: gnome-shell driver: X: loaded: nvidia gpu: nvidia display-ID: :1
screens: 1
Screen-1: 0 s-res: 2560x1440 s-dpi: 108 s-size: 602x342mm (23.70x13.46")
s-diag: 692mm (27.26")
Monitor-1: DP-0 res: 2560x1440 hz: 144 dpi: 109
size: 597x336mm (23.5x13.23") diag: 685mm (26.97") modes: N/A
Message: Unable to show GL data. Required tool glxinfo missing.
Audio:
Device-1: NVIDIA GA102 High Definition Audio vendor: Gigabyte
driver: snd_hda_intel bus-ID: 3-6.2:3 v: kernel pcie: chip-ID: 1235:8200
class-ID: 0103 gen: 2 speed: 5 GT/s lanes: 16 link-max: gen: 4
speed: 16 GT/s bus-ID: 08:00.1 chip-ID: 10de:1aef class-ID: 0403
Device-2: AMD Starship/Matisse HD Audio vendor: Gigabyte
driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
bus-ID: 0a:00.4 chip-ID: 1022:1487 class-ID: 0403
Device-3: Focusrite-Novation Scarlett 2i4 USB type: USB
driver: snd-usb-audio
Device-4: Creative Sound Blaster Play! 3 type: USB
driver: hid-generic,snd-usb-audio,usbhid bus-ID: 3-6.3:4 chip-ID: 041e:324d
class-ID: 0300 serial: <filter>
Sound Server-1: ALSA v: k5.18.5-1-MANJARO running: yes
Sound Server-2: JACK v: 1.9.21 running: no
Sound Server-3: PulseAudio v: 16.0 running: yes
Sound Server-4: PipeWire v: 0.3.52 running: yes
Network:
Device-1: Intel I211 Gigabit Network vendor: Gigabyte driver: igb v: kernel
pcie: gen: 1 speed: 2.5 GT/s lanes: 1 port: f000 bus-ID: 04:00.0
chip-ID: 8086:1539 class-ID: 0200
IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
IF-ID-1: wlp5s0f1u6u1i2 state: down mac: <filter>
IF-ID-2: ztrfycdflf state: unknown mac: <filter>
Bluetooth:
Device-1: Realtek RTL8723BU 802.11b/g/n WLAN Adapter type: USB
driver: btusb,rtl8xxxu bus-ID: 1-6.1:4 chip-ID: 0bda:b720 class-ID: e001
serial: <filter>
Report: rfkill ID: hci0 rfk-id: 0 state: down bt-service: enabled,running
rfk-block: hardware: no software: yes address: see --recommends
Drives:
Local Storage: total: 2.79 TiB used: 973.45 GiB (34.0%)
SMART Message: Required tool smartctl not installed. Check --recommends
ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung
model: SSD 970 EVO Plus 500GB size: 465.76 GiB block-size: physical: 512 B
logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
rev: 2B2QEXM7 temp: 52.9 C scheme: GPT
ID-2: /dev/sda maj-min: 8:0 vendor: Western Digital
model: WD7500BPVT-60HXZT3 size: 698.64 GiB block-size: physical: 4096 B
logical: 512 B speed: 3.0 Gb/s type: HDD rpm: 5400 serial: <filter>
rev: 1A01 scheme: GPT
ID-3: /dev/sdb maj-min: 8:16 vendor: Western Digital
model: WD10SPZX-00Z10T0 size: 931.51 GiB block-size: physical: 4096 B
logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 5400 serial: <filter>
rev: 1A01 scheme: GPT
ID-4: /dev/sdc maj-min: 8:32 vendor: Western Digital
model: WD5003AZEX-00K1GA0 size: 465.76 GiB block-size: physical: 4096 B
logical: 512 B speed: 6.0 Gb/s type: N/A serial: <filter> rev: 0A80
scheme: MBR
ID-5: /dev/sdd maj-min: 8:48 vendor: Seagate model: ST3320620AS
size: 298.09 GiB block-size: physical: 512 B logical: 512 B speed: 1.5 Gb/s
type: N/A serial: <filter> rev: K scheme: MBR
Partition:
ID-1: / raw-size: 265.93 GiB size: 260.69 GiB (98.03%)
used: 111.74 GiB (42.9%) fs: ext4 dev: /dev/nvme0n1p7 maj-min: 259:7
ID-2: /boot/efi raw-size: 500 MiB size: 499 MiB (99.80%)
used: 312 KiB (0.1%) fs: vfat dev: /dev/nvme0n1p6 maj-min: 259:6
ID-3: /home raw-size: 342.63 GiB size: 336.19 GiB (98.12%)
used: 220.1 GiB (65.5%) fs: ext4 dev: /dev/sdb2 maj-min: 8:18
Swap:
Kernel: swappiness: 60 (default) cache-pressure: 100 (default)
ID-1: swap-1 type: partition size: 4.04 GiB used: 0 KiB (0.0%)
priority: -2 dev: /dev/nvme0n1p5 maj-min: 259:5
Sensors:
System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 47 C
Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
Info:
Processes: 465 Uptime: 19m wakeups: 0 Memory: 31.29 GiB
used: 5.51 GiB (17.6%) Init: systemd v: 251 default: graphical
tool: systemctl Compilers: gcc: 12.1.0 clang: 13.0.1 Packages: 1830
pacman: 1823 lib: 471 flatpak: 0 snap: 7 Shell: Zsh v: 5.9 running-in: tilix
inxi: 3.3.19