On IU's research systems, how do I use MPI I/O?
MPI I/O is an important feature of the MPI-2 standard, allowing multiple processes of a parallel program to access data in a common file simultaneously. Parallel I/O will provide high performance.
Ideally, MPI I/O should be used on a parallel file system, such as GPFS, as common systems (e.g., NFS, EXT3FS) do not provide the MPI I/O API. For example, an MPI I/O implementation such as ROMIO allows MPI I/O to work on NFS.
The following example uses MPI I/O functions to copy files. Short explanations for each step follow the example:
/********************************************************************** Copyright 2005, The Trustees of Indiana University. All right reserved. To compile on IU's Quarry machine, say the file name is mpiio_demo.c, type: soft add @openmpi (if openmpi is not already in your .soft file) mpicc -o mpiio_demo mpiio_demo.c **********************************************************************/ #include <stdio.h> #include <stdlib.h> #include <mpi.h> /* Include the MPI definitions */ void ErrorMessage(int error, int rank, char* string) { fprintf(stderr, "Process %d: Error %d in %s\n", rank, error, string); MPI_Finalize(); exit(-1); } main(int argc, char *argv[]) { int start, end; int length; int error; char* buffer; int nprocs; int myrank; MPI_Status status; MPI_File fh; MPI_Offset filesize; if (argc != 3) { fprintf(stderr, "Usage: %s FileToRead FileToWrite\n", argv[0]); exit(-1); } /* Initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* Open file to read */ error = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_open"); /* Get the size of file */ error = MPI_File_get_size(fh, &filesize); if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_get_size"); /* calculate the range for each process to read */ length = filesize / nprocs; start = length * myrank; if (myrank == nprocs-1) end = filesize; else end = start + length; fprintf(stdout, "Proc %d: range = [%d, %d)\n", myrank, start, end); /* Allocate space */ buffer = (char *)malloc((end - start) * sizeof(char)); if (buffer == NULL) ErrorMessage(-1, myrank, "malloc"); /* Each process read in data from the file */ MPI_File_seek(fh, start, MPI_SEEK_SET); error = MPI_File_read(fh, buffer, end-start, MPI_BYTE, &status); if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_read"); /* close the file */ MPI_File_close(&fh); /* Open file to write */ error = MPI_File_open(MPI_COMM_WORLD, argv[2], MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, amp;fh); if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_open"); error = MPI_File_write_at(fh, start, buffer, end-start, MPI_BYTE, amp;status); if(error != MPI_SUCCESS) ErrorMessage(error, myrank, "MPI_File_write"); /* close the file */ MPI_File_close(amp;fh); /* Finalize MPI */ MPI_Finalize(); }- The very first step is to establish the MPI environment, so the
MPI_Init(C version)is required and must be the first call in every MPI program.
- The function
MPI_File_openopens a file on all processes. Several access modes are supported. The one used in the example,MPI_MODE_RDONLY, is for read only.
- The function
MPI_File_get_sizegives the file size, which will be used later on to determine the offset for each process.
- The function
MPI_File_seekpoints to the position in the file where each process will start reading data.
- The function
MPI_File_readreads data into the buffer specified in the second parameter. The size to be read is defined in the third parameter.
- The function
MPI_File_write_atwill write data from buffer (the third parameter) into a specific position in the file given by the second parameter.
- The function
MPI_File_closecloses the file opened by the functionMPI_File_open.
- The MPI environment in every process must be terminated by the
function
MPI_Finalize. No MPI calls may be made afterMPI_Finalize.
Fortran examples
Example 1
Following are two Fortran examples:
!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ program create_file !************************************************************************** ! This is a Fortran 90 program to write data directly to a file by each ! member of an MPI group. It is suitable for large jobs which will not ! fit into core memory (such as "out of core" solvers) ! ! Copyright by the Trustees of Indiana University 2005 *************************************************************************** USE MPI integer, parameter :: kind_val = 4 integer, parameter :: filesize = 40 integer :: realsize = 4 integer :: rank, ierr, fh, nprocs, num_reals integer :: i, region real (kind = kind_val) :: datum integer, dimension (MPI_STATUS_SIZE) :: status integer (kind = MPI_OFFSET_KIND) :: offset, empty ! Set filename to output datafile character (len = *), parameter :: filename = "/u/ac/rays/new_data.dat" real (kind = kind_val), dimension ( : ), allocatable :: bucket ! Basic MPI set-up call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) ! Sanity print print*, "myid is ", rank ! Carve out a piece of the output file and create a data bucket empty = 0 region = filesize / (nprocs ) offset = ( region * rank ) allocate (bucket(region)) ! There is no guarantee that an old file will be clobbered, so wipe out any previous output file if (rank .eq. 0) then call MPI_File_delete(filename, MPI_INFO_NULL, ierr) endif ! Set the file handle to an initial value (this should not be required) fh = 0 ! Open the output file call MPI_FILE_OPEN(MPI_COMM_WORLD, filename, MPI_MODE_CREATE+MPI_MODE_RDWR, MPI_INFO_NULL, fh, ierr) ! Wait on everyone to catch up. call MPI_BARRIER(MPI_COMM_WORLD, ierr) ! Do some work and fill up the data bucket call random_seed() do i = 1, region call random_number(datum) bucket(i) = datum * 1000000. * (rank + 1) print *, " bucket ",i ,"= ", bucket(i) enddo ! Basic "belt and suspenders insurance that everyone's file pointer is at the beginning of the output file. call MPI_FILE_SET_VIEW(fh, empty, MPI_REAL4, MPI_REAL4, 'native', MPI_INFO_NULL, ierr) ! Send the data bucket to the output file in the proper place call MPI_FILE_WRITE_AT(fh, offset, bucket, region, MPI_REAL4, status, ierr) ! Wait on everyone to finish and close up shop call MPI_BARRIER(MPI_COMM_WORLD, ierr) call MPI_FILE_CLOSE(fh, ierr) call MPI_FINALIZE(ierr) end program create_file !****************************************************** ! Ray Sheppard, HPCST, RAC, UITS, Indiana University * !******************************************************Example 2
!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ program read_file !************************************************************************** ! This is a Fortran 90 program to read data directly from a file by each ! member of an MPI group. It is suitable for large jobs which will not ! fit into core memory (such as "out of core" solvers) ! ! Copyright by the Trustees of Indiana University 2005 *************************************************************************** USE MPI integer, parameter :: kind_val = 4 integer, parameter :: filesize = 40 integer :: realsize = 4 integer :: rank, ierr, fh, nprocs, num_reals integer :: i, region integer, dimension (MPI_STATUS_SIZE) :: status integer (kind = MPI_OFFSET_KIND) :: offset, empty ! Set filename to output datafile character (len = *), parameter :: filename = "/u/ac/rays/new_data.dat" real (kind = kind_val), dimension ( : ), allocatable :: bucket ! Basic MPI set-up call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) ! Carve out a piece of the output file and create a data bucket empty = 0 region = filesize / (nprocs ) offset = (region * rank ) allocate (bucket(region)) ! Sanity print print*, "myid is ", rank ! Set the file handle to an initial value (this should not be required) fh = 0 ! Open the output file call MPI_FILE_OPEN(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY, MPI_INFO_NULL, fh, ierr) ! Wait on everyone to catch up. call MPI_BARRIER(MPI_COMM_WORLD, ierr) ! Basic "belt and suspenders insurance that everyone's file pointer is at the beginning of the output file. call MPI_FILE_SET_VIEW(fh, 0, MPI_REAL4, MPI_REAL4, 'native', MPI_INFO_NULL, ierr) ! Read only the section of the data file each process needs and put data in the data bucket. call MPI_FILE_READ_AT(fh, offset, bucket, region, MPI_REAL4, status, ierr) ! We could check the values received in the bucket (debug hint) ! ! do i = 1, region ! print *, "my id is ", rank, " and my ", i, "number is ", bucket(i) ! enddo ! Wait on everyone to finish and close up shop call MPI_BARRIER(MPI_COMM_WORLD, ierr) call MPI_FILE_CLOSE(fh, ierr) call MPI_FINALIZE(ierr) end program read_file !****************************************************** ! Ray Sheppard, HPCST, RAC, UITS, Indiana University * !******************************************************You can find a detailed MPI I/O description in this MPI-2 document.
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Additional support was provided by IU through its participation in the TeraGrid, which is supported by the NSF under Grants No. 0833618, SCI451237, SCI535258, and SCI504075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
Last modified on July 15, 2009.







