About the deep learning nodes on Carbonate at IU

On this page:


System overview

To facilitate the support of deep learning applications and research, Indiana University's Carbonate cluster has been expanded to include 12 GPU-accelerated Lenovo ThinkSystem SD530 compute nodes. Each of these deep learning (DL) nodes is equipped with two Intel Xeon Gold 6126 12-core CPUs, two NVIDIA GPU accelerators (eight with Tesla P100s; four with Tesla V100s), four 1.92 TB solid-state drives, and 192 GB of RAM. All DL nodes are housed in the IU Bloomington Data Center, run Red Hat Enterprise 7.x, and are connected to the IU Science DMZ via 10-gigabit Ethernet.

Carbonate's DL nodes use the Slurm Workload Manager to coordinate resource management and job scheduling. The Data Capacitor II (DC2), DC-WAN2, Slate, and Slate-Project high-performance file systems are mounted for temporary storage of research data.

System access

The Carbonate DL nodes are intended for use by users with deep learning workloads. IU students, faculty, and staff with deep learning workloads can request accounts on the Carbonate DL nodes by following these steps:

  1. If you don't already have an account on Carbonate, request one using the instructions in Get additional IU computing accounts.
  2. Fill out and submit the Access to Deep Learning Resource on Carbonate request form. Include a description of your deep learning application(s), and your hardware and software requirements.
Notes:
  • For enhanced security, SSH connections that have been idle for 60 minutes will be disconnected. To protect your data from misuse, remember to log off or lock your computer whenever you leave it.
  • The scheduled monthly maintenance window for IU's high-performance computing systems is the second Sunday of each month, 7am-7pm.

Deep learning software

For a list of packages available on Carbonate, see HPC Applications.

Carbonate has two deep learning modules available:

  • deeplearning/1.13.1 (based on Python 3)
  • deeplearning/python2.7/1.13.1 (based on Python 2)

Both modules contain frequently used deep learning tools including scikit-learn, NumPy, SciPy, NLTK, Torch, Caffe2, and MXNet. The deeplearning/1.13.1 module also includes TensorFlow and Keras.

Note:
Carbonate users are free to install software in their home directories and may request the installation of software for use by all users on the system. Only faculty or staff can request software. If students require software packages on Carbonate, their advisors must request them. For details, see IU policies relative to installing software on Carbonate. To request software, use the HPC Software Request form.

Run jobs on Carbonate's DL nodes

Note:
User processes on the login nodes are limited to 20 minutes of CPU time. Processes on the login nodes that run longer than 20 minutes are terminated automatically (without warning).

Carbonate's DL nodes use the Slurm workload manager to manage and schedule interactive sessions and batch jobs. To use the Carbonate DL nodes, you must:

  • Specify the dl or dl-debug partition by including the -p flag either as an SBATCH directive in your batch job script or as an option in your srun command.
  • Indicate the type of GPU (p100 or v100) and the number of GPUs (1 or 2) that should be allocated to your job by including the --gres flag either as an SBATCH directive in your batch job script or as an option in your srun command.

    If you omit the --gres flag when requesting resources in the dl or dl-debug partition, srun and sbatch will return an error:

    Must specify a gpu resource:
    Resubmit with --gres=gpu:type:count, where type is p100 or v100.
    Batch job submission failed: Unspecified error
    

Submit an interactive job for GPU debugging

To perform GPU debugging, use the srun command to submit an interactive job to the dl-debug or dl partition. For example, to request 12 cores and one P100 GPU for four hours of wall time in the dl-debug partition, on the command line, enter:

srun -p dl-debug -N 1 --ntasks-per-node=12 --gres=gpu:p100:1 --time=4:00:00 --pty bash

For longer debugging sessions, use srun to submit an interactive job to the dl partition; for example, to request 12 cores and one P100 GPU for 12 hours, on the command line, enter:

srun -p dl -N 1 --ntasks-per-node=12 --gres=gpu:p100:1 --time=12:00:00 --pty bash

Submit a batch job

To run a batch job on the DL nodes, first prepare a Slurm job script (for example, job.sh) that includes the #SBATCH -p dl directive (for routing your job to the dl partition), and then use the sbatch command to submit it. If the command exits successfully, it will return a job ID; for example:

[sgerrera@h1]$ sbatch job.sh
sbatch: Submitted batch job 99999999
[sgerrera@h1]$

If your job has resource requirements that are different from the defaults (but not exceeding the maximums allowed), specify them with SBATCH directives in your job script. Also, if you need help determining how much memory your job is using, add the following SBATCH directives to your job script (replace username@iu.edu with your IU email address):

#SBATCH --mail-user=username@iu.edu
#SBATCH --mail-type=ALL

When the job completes, Slurm will email the specified address with a summary of the job's resource utilization.

For example:

#!/bin/bash 
 
#SBATCH -J hybrid_job
#SBATCH -p dl
#SBATCH -o hybrid_%j.txt
#SBATCH -e hybrid_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@iu.edu
#SBATCH --gres=gpu:v100:2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=12
#SBATCH --time=02:00:00

In the above example:

  • -J hybrid_job specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the system.
  • -p dl specifies that the job should run in the dl partition.
  • -o hybrid_%j.txt and -e hybrid_%j.err instructs Slurm to connect the job's standard output and standard error, respectively, to the file names specified, where %j is automatically replaced by the job ID.
  • --mail-user=username@iu.edu indicates the email address to which Slurm will send notifications of changes to the job's status.
  • --gres=gpu:v100:2 requests that two V100 GPUs be allocated to the job.
  • --nodes=2 requests that a minimum of two nodes be allocated to this job.
  • --ntasks-per-node=12 specifies that 12 tasks should be launched per node.
  • --time=02:00:00 requests that the job run for a minimum of two hours.

Useful sbatch options include:

Option Action
--begin=YYYY-MM-DDTHH:MM:SS Defer allocation of your job until the specified date and time, after which the job is eligible to execute. For example, to defer allocation of your job until 10:30pm October 31, 2019, use:
--begin=2019-10-31T22:30:00
--no-requeue Specify that the job is not rerunnable. Setting this option prevents the job from being requeued after it has been interrupted, for example, by a scheduled downtime or preemption by a higher priority job.
--export=ALL Export all environment variables in the sbatch command's environment to the batch job.

For more, see the sbatch manual page.

Monitor jobs

To monitor the status of a queued or running job, use the SLURM squeue command. Useful squeue options include:

Option Action
-a Display information for all jobs.
-j <jobid> Display information for the specified job ID.
-n <job_name> Display information for the specified job name.
-j <jobid> -o %all Display all information fields (with a vertical bar separating each field) for the specified job ID.
-t RUNNING Display jobs that are running.
-u username1,username2,username3 Display jobs owned by the specified username(s).

For more, see the squeue manual page.

Delete jobs

To delete a queued or running job, use the scancel command; for example:

  • To cancel a job named my_job, enter:
    scancel -n my_job
    
  • To cancel a job owned by username, enter:
    scancel -u username
    

For more, see the scancel manual page.

Partition (queue) information

To view current information about Carbonate's DL nodes, use the sinfo command. You should see output similar to the following:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dl           up 2-00:00:00     11   idle dl[1-9,11-12]
dl-debug     up    8:00:00      1   idle dl10

In the above sample output:

  • The PARTITION column shows the names of the available partitions.
  • The AVAIL column shows the status of each partition.
  • The TIMELIMIT column shows the maximum wall time that users can request.
  • The NODES column shows the number of nodes in each partition.
  • The STATE column shows the current status of each partition.
  • The NODELIST column shows the actual nodes that are part of each partition.

To submit a batch job to the dl partition, add the #SBATCH -p=dl directive to your job script. To submit a batch job to the dl-debug partition, add the #SBATCH -p=dl-debug directive to your job script.

Note:
To best meet the needs of all research projects affiliated with Indiana University, UITS Research Technologies administers the batch job queues on IU's research supercomputers using resource management and job scheduling policies that optimize the overall efficiency and performance of workloads on those systems. If the structure or configuration of the batch queues on any of IU's supercomputing systems does not meet the needs of your research project, contact UITS Research Technologies.

Get help

For an overview of Carbonate documentation, see Get started on Carbonate.

Support for IU research computing systems, software, and services is provided by various teams within the Research Technologies division of UITS.

For general questions about research computing at IU, contact UITS Research Technologies.

For more options, see Research computing support at IU.

This is document avjk in the Knowledge Base.
Last modified on 2019-08-15 15:19:35.

Contact us

For help or to comment, email the UITS Support Center.