ARCHIVED: On FutureGrid, how do I submit a job to the Cray XT5m (Xray)?

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

The XT5m is a 2D mesh of nodes. Each node has two sockets, and each socket has four cores.

The batch scheduler interfaces with the Cray resource scheduler, APLS. When you submit a job, the batch scheduler talks to ALPS to determine what resources are available, and then ALPS makes the reservation.

Currently, ALPS is a "gang scheduler", and only allows one job per node. If a user submits a job in the format aprun -n 1 a.out, ALPS will put that job on one core of one node, and leave the other seven cores empty. When the next job comes in, either from the same user or a different one, it will schedule that job to the next node.

If the user submits a job with aprun -n 10 a.out, the scheduler will put the first eight tasks on the first node, and the next two tasks on the second node, again leaving six empty cores on the second node. The user can modify the placement with -N, -S, and -cc.

A user might also run a single job with multiple treads, as with OpenMP. If a user runs this job aprun -n 1 -d 8 a.out, the job will be scheduled to one node, and have eight threads running, one on each core.

You can run multiple, different binaries at the same time on the same node, but only from one submission. Submitting a script like this will not work:

  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 0 ./my-binary
  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 1 ./my-binary
  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 2 ./my-binary
  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 3 ./my-binary
  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 4 ./my-binary
  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 5 ./my-binary
  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 6 ./my-binary
  OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 7 ./my-binary

This will run a job on each core, but not at the same time. To run all jobs at the same time, you need to first bury all the binaries under one aprun command:

  $ more run.sh
  ./my-binary1
  ./my-binary2
  ./my-binary3
  ./my-binary4
  ./my-binary5
  ./my-binary6
  ./my-binary7
  ./my-binary8

  $ aprun -n 1 run.sh

Alternatively, use the command aprun -n 1 -d 8 run.sh.

To run multiple serial jobs, you must build a batch script to divide the number of jobs into groups of eight, and then submit each group to a different node with an aprun command.

This is document azse in the Knowledge Base.
Last modified on 2018-01-18 16:18:00.