Run GPU-accelerated jobs on Carbonate or Big Red 200 at IU
On this page:
- Partitions with GPU-accelerated nodes
- System access
- Deep learning tools and GPU-capable applications
- Run GPU-accelerated applications
- Partition (queue) information
- Get help
Partitions with GPU-accelerated nodes
To facilitate the support of deep learning and GPU-accelerated applications, Carbonate and Big Red 200 provide partitions for running jobs on GPU-accelerated nodes.
- Carbonate GPU partition: Carbonate's GPU partition consists of 24 GPU-accelerated Apollo 6500 nodes. Each node is equipped with two Intel 6248 2.5 GHz 20-core CPUs, four NVIDIA Tesla V100 PCIe 32 GB GPUs, one 1.92 TB solid-state drive, and 768 GB of RAM.
All nodes in the GPU partition are housed in the IU Bloomington Data Center and run Red Hat Enterprise 7.x. The GPU nodes are connected to the IU Science DMZ via 40-gigabit Ethernet.
- Big Red 200: Big Red 200's GPU partition consists of 64 GPU-accelerated nodes, each with 256 GB of memory, a single 64-core, 2.0 GHz, 225-watt AMD EPYC 7713 processor, and four NVIDIA Tesla A100 GPUs.
Big Red 200 is managed with HPE's Performance Cluster Manager (HPCM) and runs SUSE Enterprise Linux Server (SLES) version 15 on the compute, GPU, and login nodes.
Carbonate GPU partition | Big Red 200 GPU partition | |
---|---|---|
Architecture | Apollo 6500 | Cray HPE EX |
Nodes | 24 | 64 |
GPUs/node | 4 V100 GPUs | 4 A100 GPUs |
Cores/node | 40 | 64 |
Memory/node | 768 GB | 256 GB |
The Indiana University research supercomputers use the Slurm workload manager for resource management and job scheduling; see Use Slurm to submit and manage jobs on IU's research computing systems.
The Slate, Slate-Project, and Slate-Scratch high performance file systems are mounted for persistent storage of research data.
System access
To set up access to run GPU jobs on Carbonate or Big Red 200, IU faculty, staff, and graduate students can use RT Projects to create projects, request allocations, and add users (research collaborators, lab members, and/or students) who should be permitted to user their allocations.
For more about RT Projects, see Use RT Projects to request and manage access to specialized Research Technologies resources.
- For enhanced security, SSH connections that have been idle for 60 minutes will be disconnected. To protect your data from misuse, remember to log off or lock your computer whenever you leave it.
- User processes on the login nodes are limited to 20 minutes of CPU time. Processes on the login nodes that run longer than 20 minutes are terminated automatically (without warning).
- The scheduled monthly maintenance window for IU's research supercomputers is the second Sunday of each month, 7am-7pm.
Deep learning tools and GPU-capable applications
Carbonate uses the Modules module management system; Big Red 200 uses Lmod.
For more, see Use modules to manage your software environment on IU research supercomputers.
Popular deep learning tools (including TensorFlow, scikit-learn, NumPy, SciPy, NLTK, Torch, and MXNet) are bundled together in python/gpu
modules available on Carbonate and Big Red 200.
- To list available versions, on the command line, enter:
module avail python/gpu
- To see which packages are included in a version of the
python/gpu
module, on the command line, enter (replacex.x.x
with the module's version number):module show python/gpu/x.x.x pip freeze
- To add a
python/gpu
module to your user environment, on the command line, enter (replacex.x.x
with the module's version number):module load python/gpu/x.x.x
Individual modules for GPU-capable applications also are available on Carbonate and Big Red 200. Modules for GPU-capable applications will include gpu
in the module name.
For a list of all packages available on Carbonate or Big Red 200, see HPC Applications.
Carbonate and Big Red 200 users are free to install software in their home directories and may request the installation of software for use by all users on the system. Only faculty or staff can request software. If students require special software packages, their advisors must request them. For details, see Software requests in Policies regarding UITS research systems.
Run GPU-accelerated applications
To run a GPU-accelerated application on the GPU partition on Carbonate or Big Red 200:
- Specify the GPU partition by including the
-p gpu
flag either as an SBATCH directive in your batch job script or as an option in yoursrun
command. - Specify how many GPUs per node (up to
4
) should be allocated to your job by including the--gpus-per-node
flag either as an SBATCH directive in your batch job script or as an option in yoursrun
command. - Specify your allocation's Slurm Account Name by including the
-A
(or--account
) flag either as an SBATCH directive in your batch job script or as an option in yoursrun
command.Users belonging to projects approved through RT Projects can find their allocation's Slurm Account Name on the "Home" page in RT Projects; look under "Submitting Slurm Jobs with your Project's Account"; alternatively, on the "Home" page, under "Allocations", select an allocation and look in the table under "Allocation Attributes".
Submit an interactive job
To request resources for an interactive job, use the srun
command with the --pty
option. For example:
- To launch a Bash session that uses one V100 GPU on a node in Carbonate's GPU partition, on the command line, enter (replace
slurm-account-name
with your allocation's Slurm Account Name):srun -p gpu -A slurm-account-name --gpus-per-node v100:1 --pty bash
- To launch a Bash session that uses one A100 GPU on a node in Big Red 200's GPU partition, on the command line, enter:
srun -p gpu -A slurm-account-name --gpus-per-node 1 --pty bash
When the requested resources are allocated to your job, you will be placed at the command prompt on one of nodes in the partition you specified. When you are finished with your interactive session, on the command line, enter exit
to free the allocated resources.
For complete documentation about the srun
command, see the srun
manual page (on the web, see srun; on Carbonate or Big Red 200, enter man srun
).
Submit a batch job
To run a batch job, prepare a Slurm job script (for example, job.sh
) that includes SBATCH directives for specifying the required resources, and then use the sbatch
command to submit it. If the command exits successfully, it will return a job ID; for example:
sgerrera@login2:~> sbatch job.sh
sbatch: Submitted batch job 99999999
sgerrera@login2:~>
If your job has resource requirements that are different from the defaults (but not exceeding the maximums allowed), specify them with SBATCH directives in your job script. Also, if you need help determining how much memory your job is using, add the following SBATCH directives to your job script (replace username@iu.edu
with your IU email address):
#SBATCH --mail-user=username@iu.edu
#SBATCH --mail-type=ALL
When the job completes, Slurm will email the specified address with a summary of the job's resource utilization.
For example, a job script for running a batch job on Carbonate's GPU partition may look similar to the following:
#!/bin/bash
#SBATCH -J job_name
#SBATCH -p gpu
#SBATCH -A slurm-account-name
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@iu.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
#SBATCH --time=02:00:00
#Load any modules that your program needs
module load modulename
#Run your program
srun ./my_program my_program_arguments
In the example script above:
- The first line indicates that the script should be read using the Bash command interpreter.
- The next lines are
#SBATCH
directives used to pass options to thesbatch
command:-J job_name
specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the systems.-p gpu
specifies that the job should run in the GPU partition.-A slurm-account-name
indicates the Slurm Account Name to which resources used by this job should be charged.-o filename_%j.txt
and-e filename_%j.err
instructs Slurm to redirect the job's standard output and standard error, respectively, to the file names specified (Slurm automatically replaces%j
with the job ID).--mail-user=username@iu.edu
indicates the email address to which Slurm will send job-related mail.--mail-type=<type>
directs Slurm to send job-related email when an event of the specified type(s) occurs; validtype
values includeall
,begin
,end
, andfail
.--gpus-per-node=1
requests that one GPU be allocated to this job.--time=02:00:00
requests that the job run for a minimum of two hours.
- At the bottom are the two executable lines that the job will run. In this case, the
module
command is used to load a module (modulename
), and thensrun
is used to execute the application with the arguments specified. In your script, replacemy_program
andmy_program_arguments
with your program's name and any necessary arguments, respectively.
For more, see Use Slurm to submit and manage jobs on IU's research computing systems.
Partition (queue) information
To view current information about a partition, use the sinfo
command. For example, on Big Red 200, to view information about the GPU partition, on the command line, enter:
sinfo -p gpu
You should see output similar to the following:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu up 2-00:00:00 2 drain* nid[0672,0696] gpu up 2-00:00:00 12 down* nid[0641,0653,0658,0679,0688-0689,0692-0693,0697-0699,0702] gpu up 2-00:00:00 8 mix nid[0642,0651,0673-0674,0676-0677,0700-0701] gpu up 2-00:00:00 40 alloc nid[0643-0650,0652,0654-0657,0659-0671,0675,0678,0680-0687,0690-0691,0694-0695]
In the above sample output:
- The
PARTITION
column shows the partition name. - The
AVAIL
column shows the status of the partition. - The
TIMELIMIT
column shows the maximum wall time that users can request. - The
NODES
column shows the number of nodes in each partition. - The
STATE
column shows the current status of each partition. - The
NODELIST
column shows the actual nodes that are part of each partition.
Get help
Support for IU research supercomputers, software, and services is provided by various teams within the Research Technologies division of UITS.
- If you have a technical issue or system-specific question, contact the High Performance Systems (HPS) team.
- If you have a programming question about compilers, scientific/numerical libraries, or debuggers, contact the UITS Research Applications and Deep Learning team.
For general questions about research computing at IU, contact UITS Research Technologies.
For more options, see Research computing support at IU.
Related documents
This is document avjk in the Knowledge Base.
Last modified on 2023-08-11 16:30:49.