Indiana University
University Information Technology Services
  
What are archived documents?
Login>>
Login

Login is for authorized groups (e.g., UITS, OVPIT, and TCC) that need access to specialized Knowledge Base documents. Otherwise, simply use the Knowledge Base without logging in.

Close

ARCHIVED: On the Research SP, how do I checkpoint a parallel job?

Not every parallel job can be checkpointed; for information about the limitations of parallel checkpointing on the Research SP, see:

https://sp-www.iu.edu/ParEnv/am106mst24.html

You will need to log in with your Indiana University Network ID username and password to access the URL above.

Also see the Knowledge Base document ARCHIVED: On the Research SP, how do I checkpoint a serial job? Much of the information there is also valid for parallel checkpointing.

Changes to the job command file are mostly the same as in the serial checkpointing case. The following is the job command file that starts the initial execution of a parallel program, xhpl, by user jdoe:

#@ class = pb #@ job_type = parallel #@ restart = no #@ network.MPI = css0,shared,IP #@ node = 2,2 #@ tasks_per_node =2 #@ wall_clock_limit = 50:30:00 #@ executable = /bin/poe #@ arguments = xhpl #@ environment = COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=2 #@ output = hpl.out #@ error = hpl.error #@ checkpoint=interval #@ ckpt_dir=/gpfs/jdoe #@ ckpt_file = xhpl-ll #@ ckpt_time_limit = rlim_infinity #@ restart_from_ckpt = no #@ queue

For information about the keywords not related to checkpointing, see:

https://sp-www.iu.edu/LoadL/am2ugmst02.html https://sp-www.iu.edu/SP.jobs.shtml

You will need to enter your IU Network ID username and password to access the URLs above.

As in serial checkpointing, to restart the job based on a checkpoint file after it aborts, change the value of keyword restart_from_ckpt from no to yes.

Note: Checkpoint files in parallel checkpointing are named differently from those for serial programs. If this is an N-process job, there will be N+1 checkpoint files, one for the master task and one for each task. For the example above, the checkpoint file for the master task is named basename.[tag], where basename is xhpl-ll, and [tag] denotes the index of the checkpointing file (0,1,2...). The checkpoint file for the tasks are named basename.taskid.[tag]. For the example above, for the first checkpoint, there are the following five files: xhpl-ll.0, xhpl-ll.0.0, xhpl-ll.1.0, xhpl-ll.2.0, and xhpl-ll.3.0. Old checkpoint files are overwritten by the new ones.

This is document aqpb in domain all.
Last modified on August 03, 2005.

Comments/Questions/Corrections

Use this form to offer suggestions, corrections, and additions to the Knowledge Base. We welcome your input!

If you are affiliated with Indiana University and would like assistance with a specific computing problem, please use the Ask a Consultant form, or contact your campus Support Center.

Contact Information

Note: We will reply to your comment at this address. If your message concerns a problem receiving email, please enter an alternate email address.