Use MPI I/O on IU research supercomputers

MPI I/O is an API standard for parallel I/O that allows multiple processes of a parallel program to access data in a common file simultaneously. MPI I/O maps I/O reads and writes to message-passing sends and receives. Implementing parallel I/O can improve the performance of your parallel application.

Ideally, MPI I/O should be used on a parallel file system, as common systems (e.g., NFS, EXT3FS) do not provide the MPI I/O API. However, a popular MPI I/O implementation, ROMIO, allows MPI I/O to work on NFS.

The following example uses MPI I/O functions to copy files. Short explanations for each step are provided below:

  #include <stdio.h>
  #include <stdlib.h>
  #include <mpi.h>       /* Include the MPI definitions */

  void ErrorMessage(int error, int rank, char* string)
  {
          fprintf(stderr, "Process %d: Error %d in %s\n", rank, error, string);
          MPI_Finalize();
          exit(-1);
  }

  main(int argc, char *argv[])
  {
    int start, end;
    int length;
    int error;
    char* buffer;
    int nprocs;
    int myrank;
    MPI_Status    status;
    MPI_File      fh;
    MPI_Offset    filesize;

    if (argc != 3)
    {
          fprintf(stderr, "Usage: %s FileToRead FileToWrite\n", argv[0]);
          exit(-1);
    }

    /* Initialize MPI */
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    /* Open file to read */
    error = MPI_File_open(MPI_COMM_WORLD, argv[1],
                  MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_open");

    /* Get the size of file */
    error = MPI_File_get_size(fh, &filesize);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_get_size");

    /* calculate the range for each process to read */
    length = filesize / nprocs;
    start = length * myrank;
    if (myrank == nprocs-1)
          end = filesize;
    else
          end = start + length;
    fprintf(stdout, "Proc %d: range = [%d, %d)\n", myrank, start, end);

    /* Allocate space */
    buffer = (char *)malloc((end - start) * sizeof(char));
    if (buffer == NULL) ErrorMessage(-1, myrank, "malloc");

    /* Each process read in data from the file */
    MPI_File_seek(fh, start, MPI_SEEK_SET);
    error = MPI_File_read(fh, buffer, end-start, MPI_BYTE, &status);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_read");

    /* close the file */
    MPI_File_close(&fh);

    /* Open file to write */
    error = MPI_File_open(MPI_COMM_WORLD, argv[2],
                  MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, amp;fh);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_open");

    error = MPI_File_write_at(fh, start, buffer, end-start, MPI_BYTE, amp;status);
    if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_write");

    /* close the file */
    MPI_File_close(amp;fh);

    /* Finalize MPI */
    MPI_Finalize();
  }

In the above example:

  • The first step is to establish the MPI environment, so the MPI_Init(C version) is required, and must be the first call in every MPI program.
  • The MPI_File_open function opens a file on all processes. Several access modes are supported. The one used in the example (MPI_MODE_RDONLY) is for read only.
  • The MPI_File_get_size function gives the file size, which is used later to determine the offset for each process.
  • The MPI_File_seek function points to the position in the file where each process will start reading data.
  • The MPI_File_read function reads data into the buffer specified in the second parameter. The size to be read is defined in the third parameter.
  • The MPI_File_write_at function will write data from the buffer (the third parameter) into a specific position in the file given by the second parameter.
  • The MPI_File_close function closes the file opened by MPI_File_open.
  • The MPI environment in every process must be terminated by the MPI_Finalize function. No MPI calls may be made after MPI_Finalize.

Fortran examples

Following are two Fortran examples:

Example 1:

  !^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
        program  create_file
  
  !**************************************************************************
  !  This is a Fortran 90 program to write data directly to a file by each
  !  member of an MPI group.  It is suitable for large jobs which will not
  !  fit into core memory (such as "out of core" solvers)  
  !
  !  Copyright by the Trustees of Indiana University 2005
  ***************************************************************************

         USE MPI

        integer, parameter :: kind_val = 4
        integer, parameter  :: filesize = 40
        integer :: realsize = 4
        integer ::  rank, ierr, fh, nprocs, num_reals
        integer ::  i, region
        real (kind = kind_val) :: datum
        integer, dimension (MPI_STATUS_SIZE) :: status
        integer (kind = MPI_OFFSET_KIND) :: offset, empty
  
  !  Set filename to output datafile

        character (len = *), parameter :: filename = "/u/ac/rays/new_data.dat"
        real (kind = kind_val), dimension ( : ), allocatable  :: bucket

  !  Basic MPI set-up

        call MPI_INIT(ierr)
        call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
        call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)

  !  Sanity print

         print*, "myid is ", rank

  !  Carve out a piece of the output file and create a data bucket

         empty = 0
         region = filesize / (nprocs )
         offset = ( region * rank )
         allocate (bucket(region))

  !  There is no guarantee that an old file will be clobbered,  so wipe out any previous output file

          if (rank .eq. 0) then
                  call MPI_File_delete(filename, MPI_INFO_NULL, ierr)
          endif

  !  Set the file handle to an initial value (this should not be required)

           fh = 0

  !  Open the output file

           call MPI_FILE_OPEN(MPI_COMM_WORLD, filename, MPI_MODE_CREATE+MPI_MODE_RDWR, MPI_INFO_NULL, fh, ierr)

  !  Wait on everyone to catch up.

           call MPI_BARRIER(MPI_COMM_WORLD, ierr)

  !  Do some work and fill up the data bucket

           call random_seed()
  
           do i = 1, region

               call random_number(datum)
  
               bucket(i) = datum * 1000000. * (rank + 1)

               print *, " bucket  ",i ,"= ", bucket(i)
           enddo

  !  Basic "belt and suspenders insurance that everyone's file pointer is at the beginning of the output file.

            call MPI_FILE_SET_VIEW(fh, empty, MPI_REAL4, MPI_REAL4, 'native', MPI_INFO_NULL, ierr)

  !  Send the data bucket to the output file in the proper place

            call MPI_FILE_WRITE_AT(fh, offset, bucket, region, MPI_REAL4, status, ierr)

  !  Wait on everyone to finish and close up shop

           call MPI_BARRIER(MPI_COMM_WORLD, ierr)
  
           call MPI_FILE_CLOSE(fh, ierr)

           call MPI_FINALIZE(ierr)

           end  program  create_file

Example 2:

  !^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

        program  read_file

  !**************************************************************************
  !  This is a Fortran 90 program to read data directly from a file by each
  !  member of an MPI group.  It is suitable for large jobs which will not
  !  fit into core memory (such as "out of core" solvers)  
  !
  !  Copyright by the Trustees of Indiana University 2005
  ***************************************************************************

          USE MPI

        integer, parameter :: kind_val = 4
        integer,  parameter  :: filesize = 40
        integer :: realsize = 4
        integer ::  rank, ierr, fh, nprocs, num_reals
        integer ::  i, region
        integer, dimension (MPI_STATUS_SIZE) :: status
        integer (kind = MPI_OFFSET_KIND) :: offset, empty

  !  Set filename to output datafile

        character (len = *), parameter :: filename = "/u/ac/rays/new_data.dat"
        real (kind = kind_val), dimension ( : ), allocatable  :: bucket

  !  Basic MPI set-up

        call MPI_INIT(ierr)
        call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
        call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
      
  !  Carve out a piece of the output file and create a data bucket
 
        empty = 0
        region = filesize / (nprocs )
        offset = (region * rank )
        allocate (bucket(region))

  !  Sanity print

        print*, "myid is ", rank

  !  Set the file handle to an initial value (this should not be required)

        fh = 0

  !  Open the output file

        call MPI_FILE_OPEN(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY, MPI_INFO_NULL, fh, ierr)
 
  !  Wait on everyone to catch up.

        call MPI_BARRIER(MPI_COMM_WORLD, ierr)

  !  Basic "belt and suspenders insurance that everyone's file pointer is at the beginning of the output file.

         call MPI_FILE_SET_VIEW(fh, 0, MPI_REAL4, MPI_REAL4, 'native', MPI_INFO_NULL, ierr)

  !  Read only the section of the data file each process needs and put data in the data bucket.

         call MPI_FILE_READ_AT(fh, offset, bucket, region, MPI_REAL4, status, ierr)

  !  We could check the values received in the bucket (debug hint)
  !
  !      do i = 1, region
  !         print *, "my id is ", rank, " and my ", i, "number is ", bucket(i)
  !      enddo

  !  Wait on everyone to finish and close up shop

        call MPI_BARRIER(MPI_COMM_WORLD, ierr) 
  
        call MPI_FILE_CLOSE(fh, ierr)

        call MPI_FINALIZE(ierr)

        end  program  read_file

Support for IU research computing systems, software, and services is provided by the Research Technologies division of UITS. To ask a question or get help, contact UITS Research Technologies.

This is document aqpe in the Knowledge Base.
Last modified on 2018-07-11 09:25:56.

Contact us

For help or to comment, email the UITS Support Center.