2023-08-23 OpenMPI hangs in unpredictable way when running mpi apps

Dear all,
I am using the latest Manjaro update from 2023-08-23 under a KDE environment. My Linux kernel is version 6.1.44-1 which is the latest LTS version I see in my kernel pad.
This machine uses an Intel i9-12900k processor. It has been working perfectly under Manjaro for the last two years. I do a lot of application development using MPI.
I am using gcc version 13.2.1 20230801
and I am using open mpi 4.1.5-2 from today ( 2023-08-23 )

Now all MPI codes I have tried randomly hang when launched with 2 or more processors. Sometimes they will finish completly, sometimes they will hang. It seems unpredictable, but the longer the run is, the more likely it is that it hangs. Is there anything I can do to fix this or is this a known issue? I have tried reinstalling and I have recompiled all codes I use, but to no avail.
Thank you for your help

I think you’ll have more luck taking this issue to Open MPI devs. See MPI 4.0.5 hangs · Issue #11853 · open-mpi/ompi · GitHub for how to provide relevant information.

Could also have a look through these other issues and see if anything there helps.

Thank you for your help. I have managed to diagnose that the issue is related to MPI/IO functions (when MPI is used to open a file in parallel). So far I have not been able to find any sort of solution. I am unsure if it is a system bug, an MPI bug, or etc. Thank you for your time though :slight_smile:

UPDATE:

I opened an issue on the OpenMPI GitHub repo and with the help of some of the developers we were able to find two options to solve my issue:

  1. Use the official arch linux OpenMPI package
  2. Configure your OpenMPI build with --with-pmix=external

Original post:

I have been running into the same issue for weeks now.

In my experience, MPI_File_open hangs every second time it is called. Here is a minimal working example that displays this behaviour:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[]) {
  MPI_Init(&argc, &argv);

  int my_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

  MPI_File handle;
  int access_mode = MPI_MODE_CREATE /* Create the file if it does not exist */
                    | MPI_MODE_WRONLY; /* Open the file for writing only */

  printf("[MPI process %02d] About to open file.\n", my_rank);
  if (MPI_File_open(MPI_COMM_WORLD, "file.tmp", access_mode, MPI_INFO_NULL,
                    &handle) != MPI_SUCCESS) {
    printf("[MPI process %02d] Failure in opening the file.\n", my_rank);
    MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
  }

  printf("[MPI process %02d] File opened successfully.\n", my_rank);

  if (MPI_File_close(&handle) != MPI_SUCCESS) {
    printf("[MPI process %02d] Failure in closing the file.\n", my_rank);
    MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
  }

  printf("[MPI process %02d] File closed successfully.\n", my_rank);

  MPI_Finalize();

  return EXIT_SUCCESS;
}

And an example of the behaviour:

mpicc MPIWriteTest.c
mpiexec -n 2 ./a.out 
[MPI process 00] About to open file.
[MPI process 01] About to open file.
[MPI process 00] File opened successfully.
[MPI process 01] File opened successfully.
[MPI process 00] File closed successfully.
[MPI process 01] File closed successfully.

mpiexec -n 2 ./a.out
[MPI process 00] About to open file.
[MPI process 01] About to open file.
^C%

mpiexec -n 2 ./a.out
[MPI process 00] About to open file.
[MPI process 01] About to open file.
[MPI process 01] File opened successfully.
[MPI process 00] File opened successfully.
[MPI process 01] File closed successfully.
[MPI process 00] File closed successfully.

mpiexec -n 2 ./a.out
[MPI process 00] About to open file.
[MPI process 01] About to open file.
^C%

This behaviour occurs whether or not the file being opened already exists or not.

This behaviour does not coincide with any changes in the version of OpenMPI I am running. I first encountered it around 2023-08-21 on v4.1.4, which I had been running without issue for almost a year at that point. I have since upgraded to v4.1.5 but this didn’t change anything. This makes me think that the issue is due to something that occurred during a Manjaro update.

My first suspicion was that this was caused by the upgrade to gcc-13 in the 2023-06-04 stable update. I therefore tried recompiling OpenMPI using gcc-12, but this had no effect.