Use MPI I/O on IU research supercomputers

MPI I/O is an API standard for parallel I/O that allows multiple processes of a parallel program to access data in a common file simultaneously. MPI I/O maps I/O reads and writes to message-passing sends and receives. Implementing parallel I/O can improve the performance of your parallel application.

Ideally, MPI I/O should be used on a parallel file system, as common systems (for example, NFS, EXT3FS) do not provide the MPI I/O API. However, a popular MPI I/O implementation, ROMIO, allows MPI I/O to work on NFS.

The following example uses MPI I/O functions to copy files. Short explanations for each step are provided below:

#include <stdio.h>
  #include <stdlib.h>
  #include <mpi.h>       /* Include the MPI definitions */

  void ErrorMessage(int error, int rank, char* string)
  {
          fprintf(stderr, "Process %d: Error %d in %s\n", rank, error, string);
          MPI_Finalize();
          exit(-1);
  }

  main(int argc, char *argv[])
  {
    int start, end;
    int length;
    int error;
    char* buffer;
    int nprocs;
    int myrank;
    MPI_Status    status;
    MPI_File      fh;
    MPI_Offset    filesize;

    if (argc != 3)
    {
          fprintf(stderr, "Usage: %s FileToRead FileToWrite\n", argv[0]);
          exit(-1);
    }

    /* Initialize MPI */
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    /* Open file to read */
    error = MPI_File_open(MPI_COMM_WORLD, argv[1],
                  MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_open");

    /* Get the size of file */
    error = MPI_File_get_size(fh, &filesize);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_get_size");

    /* calculate the range for each process to read */
    length = filesize / nprocs;
    start = length * myrank;
    if (myrank == nprocs-1)
          end = filesize;
    else
          end = start + length;
    fprintf(stdout, "Proc %d: range = [%d, %d)\n", myrank, start, end);

    /* Allocate space */
    buffer = (char *)malloc((end - start) * sizeof(char));
    if (buffer == NULL) ErrorMessage(-1, myrank, "malloc");

    /* Each process read in data from the file */
    MPI_File_seek(fh, start, MPI_SEEK_SET);
    error = MPI_File_read(fh, buffer, end-start, MPI_BYTE, &status);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_read");

    /* close the file */
    MPI_File_close(&fh);

    /* Open file to write */
    error = MPI_File_open(MPI_COMM_WORLD, argv[2],
                  MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, amp;fh);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_open");

    error = MPI_File_write_at(fh, start, buffer, end-start, MPI_BYTE, amp;status);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_write");

    /* close the file */
    MPI_File_close(amp;fh);

    /* Finalize MPI */
    MPI_Finalize();
  }

In the above example:

  • The first step is to establish the MPI environment, so the MPI_Init(C version) is required, and must be the first call in every MPI program.
  • The MPI_File_open function opens a file on all processes. Several access modes are supported. The one used in the example (MPI_MODE_RDONLY) is for read only.
  • The MPI_File_get_size function gives the file size, which is used later to determine the offset for each process.
  • The MPI_File_seek function points to the position in the file where each process will start reading data.
  • The MPI_File_read function reads data into the buffer specified in the second parameter. The size to be read is defined in the third parameter.
  • The MPI_File_write_at function will write data from the buffer (the third parameter) into a specific position in the file given by the second parameter.
  • The MPI_File_close function closes the file opened by MPI_File_open.
  • The MPI environment in every process must be terminated by the MPI_Finalize function. No MPI calls may be made after MPI_Finalize.

Fortran examples

Following are two Fortran examples:

Example 1:

!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
        program  create_file
  
  !**************************************************************************
  !  This is a Fortran 90 program to write data directly to a file by each
  !  member of an MPI group.  It is suitable for large jobs which will not
  !  fit into core memory (such as "out of core" solvers)  
  !
  !  Copyright by the Trustees of Indiana University 2005
  ***************************************************************************

         USE MPI

        integer, parameter :: kind_val = 4
        integer, parameter  :: filesize = 40
        integer :: realsize = 4
        integer ::  rank, ierr, fh, nprocs, num_reals
        integer ::  i, region
        real (kind = kind_val) :: datum
        integer, dimension (MPI_STATUS_SIZE) :: status
        integer (kind = MPI_OFFSET_KIND) :: offset, empty
  
  !  Set filename to output datafile

        character (len = *), parameter :: filename = "/u/ac/rays/new_data.dat"
        real (kind = kind_val), dimension ( : ), allocatable  :: bucket

  !  Basic MPI set-up

        call MPI_INIT(ierr)
        call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
        call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)

  !  Sanity print

         print*, "myid is ", rank

  !  Carve out a piece of the output file and create a data bucket

         empty = 0
         region = filesize / (nprocs )
         offset = ( region * rank )
         allocate (bucket(region))

  !  There is no guarantee that an old file will be clobbered,  so wipe out any previous output file

          if (rank .eq. 0) then
                  call MPI_File_delete(filename, MPI_INFO_NULL, ierr)
          endif

  !  Set the file handle to an initial value (this should not be required)

           fh = 0

  !  Open the output file

           call MPI_FILE_OPEN(MPI_COMM_WORLD, filename, MPI_MODE_CREATE+MPI_MODE_RDWR, MPI_INFO_NULL, fh, ierr)

  !  Wait on everyone to catch up.

           call MPI_BARRIER(MPI_COMM_WORLD, ierr)

  !  Do some work and fill up the data bucket

           call random_seed()
  
           do i = 1, region

               call random_number(datum)
  
               bucket(i) = datum * 1000000. * (rank + 1)

               print *, " bucket  ",i ,"= ", bucket(i)
           enddo

  !  Basic "belt and suspenders insurance that everyone's file pointer is at the beginning of the output file.

            call MPI_FILE_SET_VIEW(fh, empty, MPI_REAL4, MPI_REAL4, 'native', MPI_INFO_NULL, ierr)

  !  Send the data bucket to the output file in the proper place

            call MPI_FILE_WRITE_AT(fh, offset, bucket, region, MPI_REAL4, status, ierr)

  !  Wait on everyone to finish and close up shop

           call MPI_BARRIER(MPI_COMM_WORLD, ierr)
  
           call MPI_FILE_CLOSE(fh, ierr)

           call MPI_FINALIZE(ierr)

           end  program  create_file

Example 2:

!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

        program  read_file

  !**************************************************************************
  !  This is a Fortran 90 program to read data directly from a file by each
  !  member of an MPI group.  It is suitable for large jobs which will not
  !  fit into core memory (such as "out of core" solvers)  
  !
  !  Copyright by the Trustees of Indiana University 2005
  ***************************************************************************

          USE MPI

        integer, parameter :: kind_val = 4
        integer,  parameter  :: filesize = 40
        integer :: realsize = 4
        integer ::  rank, ierr, fh, nprocs, num_reals
        integer ::  i, region
        integer, dimension (MPI_STATUS_SIZE) :: status
        integer (kind = MPI_OFFSET_KIND) :: offset, empty

  !  Set filename to output datafile

        character (len = *), parameter :: filename = "/u/ac/rays/new_data.dat"
        real (kind = kind_val), dimension ( : ), allocatable  :: bucket

  !  Basic MPI set-up

        call MPI_INIT(ierr)
        call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
        call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
      
  !  Carve out a piece of the output file and create a data bucket
 
        empty = 0
        region = filesize / (nprocs )
        offset = (region * rank )
        allocate (bucket(region))

  !  Sanity print

        print*, "myid is ", rank

  !  Set the file handle to an initial value (this should not be required)

        fh = 0

  !  Open the output file

        call MPI_FILE_OPEN(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY, MPI_INFO_NULL, fh, ierr)
 
  !  Wait on everyone to catch up.

        call MPI_BARRIER(MPI_COMM_WORLD, ierr)

  !  Basic "belt and suspenders insurance that everyone's file pointer is at the beginning of the output file.

         call MPI_FILE_SET_VIEW(fh, 0, MPI_REAL4, MPI_REAL4, 'native', MPI_INFO_NULL, ierr)

  !  Read only the section of the data file each process needs and put data in the data bucket.

         call MPI_FILE_READ_AT(fh, offset, bucket, region, MPI_REAL4, status, ierr)

  !  We could check the values received in the bucket (debug hint)
  !
  !      do i = 1, region
  !         print *, "my id is ", rank, " and my ", i, "number is ", bucket(i)
  !      enddo

  !  Wait on everyone to finish and close up shop

        call MPI_BARRIER(MPI_COMM_WORLD, ierr) 
  
        call MPI_FILE_CLOSE(fh, ierr)

        call MPI_FINALIZE(ierr)

        end  program  read_file

Research computing support at IU is provided by the Research Technologies division of UITS. To ask a question or get help regarding Research Technologies services, including IU's research supercomputers and research storage systems, and the scientific, statistical, and mathematical applications available on those systems, contact UITS Research Technologies. For service-specific support contact information, see Research computing support at IU.

This is document aqpe in the Knowledge Base.
Last modified on 2023-10-04 14:21:35.