ARCHIVED: On the Research SP, how do I checkpoint a parallel job?
Not every parallel job can be checkpointed; for information about the limitations of parallel checkpointing on the Research SP, see:
https://sp-www.iu.edu/ParEnv/am106mst24.htmlYou will need to log in with your Indiana University Network ID username and password to access the URL above.
Also see the Knowledge Base document ARCHIVED: On the Research SP, how do I checkpoint a serial job? Much of the information there is also valid for parallel checkpointing.
Changes to the job command file are mostly the same as in the serial
checkpointing case. The following is the job command file that starts
the initial execution of a parallel program, xhpl, by
user jdoe:
For information about the keywords not related to checkpointing, see:
https://sp-www.iu.edu/LoadL/am2ugmst02.html https://sp-www.iu.edu/SP.jobs.shtmlYou will need to enter your IU Network ID username and password to access the URLs above.
As in serial checkpointing, to restart the job based on a checkpoint
file after it aborts, change the value of keyword
restart_from_ckpt from no to
yes.
Note: Checkpoint files in parallel checkpointing are
named differently from those for serial programs. If this is an
N-process job, there will be N+1 checkpoint files, one for the master
task and one for each task. For the example above, the checkpoint
file for the master task is named basename.[tag], where
basename is xhpl-ll, and [tag]
denotes the index of the checkpointing file (0,1,2...). The checkpoint
file for the tasks are named basename.taskid.[tag]. For
the example above, for the first checkpoint, there are the following
five files: xhpl-ll.0, xhpl-ll.0.0,
xhpl-ll.1.0, xhpl-ll.2.0, and
xhpl-ll.3.0. Old checkpoint files are overwritten by the
new ones.
Last modified on August 03, 2005.







