ARCHIVED: On FutureGrid, how do I submit a job to the Cray XT5m (Xray)?
The XT5m is a 2D mesh of nodes. Each node has two sockets, and each socket has four cores.
The batch scheduler interfaces with the Cray resource scheduler, APLS. When you submit a job, the batch scheduler talks to ALPS to determine what resources are available, and then ALPS makes the reservation.
Currently, ALPS is a "gang scheduler", and only allows one job per
node. If a user submits a job in the format aprun -n 1
a.out
, ALPS will put that job on one core of one node,
and leave the other seven cores empty. When the next job comes in,
either from the same user or a different one, it will schedule that
job to the next node.
If the user submits a job with aprun -n 10
a.out
, the scheduler will put the first eight tasks on
the first node, and the next two tasks on the second node, again
leaving six empty cores on the second node. The user can modify the
placement with -N
, -S
, and
-cc
.
A user might also run a single job with multiple treads, as with
OpenMP. If a user runs this job aprun -n 1 -d 8
a.out
, the job will be scheduled to one node, and have
eight threads running, one on each core.
You can run multiple, different binaries at the same time on the same node, but only from one submission. Submitting a script like this will not work:
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 0 ./my-binary OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 1 ./my-binary OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 2 ./my-binary OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 3 ./my-binary OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 4 ./my-binary OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 5 ./my-binary OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 6 ./my-binary OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 7 ./my-binary
This will run a job on each core, but not at the
same time. To run all jobs at the same time, you need to first bury
all the binaries under one aprun
command:
$ more run.sh ./my-binary1 ./my-binary2 ./my-binary3 ./my-binary4 ./my-binary5 ./my-binary6 ./my-binary7 ./my-binary8 $ aprun -n 1 run.sh
Alternatively, use the command aprun -n 1 -d 8
run.sh
.
To run multiple serial jobs, you must build a batch script to
divide the number of jobs into groups of eight, and then submit each
group to a different node with an aprun
command.
This is document azse in the Knowledge Base.
Last modified on 2018-01-18 16:18:00.