Run applications on Big Red II's GPU-enabled compute nodes

On this page:


Overview

Big Red II at Indiana University is equipped with multiple Cray XK7 GPU-enabled compute nodes. Each XK7 node has one AMD Opteron 16-core Interlagos x86_64 CPU, 32 GB of RAM, and one NVIDIA Tesla K20 GPU accelerator.

To run an application on the GPU-enabled compute nodes in either the Extreme Scalability Mode (ESM) or Cluster Compatibility Mode (CCM) execution environment, submit an interactive job or batch job to the gpu queue. The gpu queue allows you to request a maximum of 256 nodes per job and up to seven days of wall-clock time.

Alternatively, to test your application or run jobs that are less time-intensive, use the debug_gpu queue. The debug_gpu queue allows you to request a maximum of four nodes per job and up to one hour of wall-clock time. To specify either queue, use the -q option in your batch job script or in your qsub command.

For more about running batch and interactive jobs on Big Red II, see:

Interactive jobs

Interactive ESM job

To submit an interactive ESM job to the gpu or debug_gpu queue:

  1. On the Big Red II command line, use qsub with the -I (interactive) flag and other options to indicate the resources your job needs. For example, the following command submits an interactive job to the debug_gpu queue that will run for 30 minutes on all 16 processors of one node:
      qsub -I -l nodes=1:ppn=16 -l walltime=00:30:00 -q debug_gpu
    
  2. When qsub executes, your job will be assigned a job ID. When the required resources are free, your job will start, and you will be placed on one of Big Red II's aprun service nodes. You will see something similar to the following displayed in your terminal window (where jobid is the job ID number and username is the IU username associated with your Big Red II account):
      qsub: waiting for job jobid to start 
      qsub: job jobid; ready
      Directory: /N/u/username/BigRed2
      Wed Jul 31 11:03:32 EDT 2019
      username@aprun2:~>
    

    The aprun service nodes have limited resources shared between all users on the system and are not intended for computational use. To launch your application (e.g., my_binary) on a CPU/GPU node in the ESM execution environment, you must invoke the aprun command from the aprun node's command line; for example:

      username@aprun2:~> aprun -n 1 -N 1 my_binary
    

Interactive CCM job

To submit an interactive CCM job to the gpu or debug_gpu queue:

  1. On the Big Red II command line, use the qsub -I with the -l gres=ccm flag and other options to indicate the resources your job needs. For example, the following command submits an interactive job to the gpu queue that will run in the CCM execution environment for one hour on all 16 processors of one node:
      qsub -I -l gres=ccm -l nodes=1:ppn=16,walltime=01:00:00 -q gpu
    
  2. When qsub executes, your job will be assigned a job ID. When the required resources are free, your job will start, and you will be placed on one of Big Red II's aprun service nodes. You will see something similar to the following displayed in your terminal window (where jobid is the job ID number and username is the IU username associated with your Big Red II account):
      qsub: waiting for job jobid to start 
      qsub: job jobid; ready
      Directory: /N/u/username/BigRed2
      Wed Jul 31 11:03:32 EDT 2019
      username@aprun2:~>
    

    The aprun service nodes have limited resources shared between all users on the system and are not intended for computational use. To launch your application (e.g., my_binary) on a CPU/GPU node in the CCM execution environment you must load the ccm module, and then enter the ccmlogin command. This will place you on a CPU/GPU node command line, from which you can launch your application.

    For example, the following session shows user dartmaul submitting a one-hour interactive job to launch a Tensorflow script (tf-script.py) on all 16 cores on one CPU/GPU node in the CCM execution environment:

      username@login1:~> qsub -I -l gres=ccm -l nodes=1:ppn=16,walltime=01:00:00 -q gpu
      qsub: waiting for job 2024986 to start
      qsub: job 2024986 ready
    
      Directory: /N/u/username/BigRed2
      Wed Jan 24 16:14:55 EST 2018
      username@aprun7:~> module load ccm
      username@aprun7:~> ccmlogin
      username@nid00170:~> module load tensorflow
      TensorFlow 1.1.0 loaded.
      username@nid00170:~> python tf-script.py
    
Note:
Interactive X11 applications must run in the CCM execution environment. To run an interactive X11 application in the CCM execution environment, add the -X option (enable X11 forwarding) to your qsub command; for more, see On my personal computer, how do I use X forwarding to securely run graphical applications installed on IU's research computing systems?

Batch jobs

ESM batch job

To run a batch job that executes a compiled program on one or more CPU/GPU nodes in the ESM execution environment:

  1. Create a job script that includes:
    • TORQUE (PBS) directives that specify the resources (e.g., the number of nodes and cores) needed to run the job and the wall-clock time needed for the job to complete
    • The -q directive to indicate routing to the gpu (-q gpu) or debug_gpu (-q debug_gpu) queue
    • An application execution line that begins with the aprun launching command

    The following example job script routes a serial job called my_gpu_job to the gpu queue that will run for a maximum of three hours, launching the my_esm_binary program on all 16 cores of one CPU/GPU node in the ESM execution environment:

      #!/bin/bash
      #PBS -l nodes=1:ppn=16
      #PBS -l walltime=03:00:00
      #PBS -N my_gpu_job
      #PBS -q gpu
      
      aprun -n 1 -N 1 my_esm_binary
    
    Note:
    When invoking aprun on a CPU/GPU node, the -n argument specifies the total number of nodes (not the total number of processing elements), and the -N argument specifies the number of GPUs per node, which on Big Red II is one (e.g., -N 1).
  2. Use the qsub command to submit your job script; for example, to submit job script gpu_script.pbs, on the command line, enter:
      qsub gpu_script.pbs
    

CCM batch job

To run a batch job that executes a compiled program on one or more CPU/GPU nodes in the CCM execution environment:

  1. Create a batch job script that includes:
    • TORQUE (PBS) directives that specify the resources (e.g., the number of nodes and cores) needed to run the job and the wall-clock time needed for the to complete
    • The -q directive to indicate routing to the gpu (-q gpu) or debug_gpu (-q debug_gpu) queue
    • A line that loads the ccm module (this is not needed if you already have the ccm module added to your Big Red II user environment)
    • An application execution line that begins with the ccmrun launching command

    The following example batch script routes a serial job called my_gpu_job to the gpu queue that will run for a maximum of three hours, launching the my_ccm_binary program on all 16 cores of one CPU/GPU node in the CCM execution environment:

      #!/bin/bash
      #PBS -l nodes=1:ppn=16
      #PBS -l walltime=03:00:00
      #PBS -N job_name
      #PBS -q gpu
      #PBS -l gres=ccm
      
      module load ccm
      
      ccmrun my_ccm_binary
    
  2. Use the qsub command to submit your job script; for example, to submit job script gpu_script.pbs), on the command line, enter:
      qsub gpu_script.pbs
    

Sample CUDA C program

Following is a brief description of CUDA (Compute Unified Device Architecture), including an architectural overview of the NVIDIA GPUs equipped on Big Red II, links to reference documentation, and a sample matrix multiplication program that demonstrates some of the concepts of CUDA C programming.

NVIDIA GPU architecture

Each of Big Red II's Cray XK7 GPU-enabled (CPU/GPU) compute nodes is equipped with one AMD Opteron 16-core Interlagos x86-64 CPU and one NVIDIA Tesla K20 GPU accelerator housing a single Kepler GK110 GPU.

Each Kepler GK110 GPU features 13 streaming multiprocessor (SMX) units and 5 GB of device memory. Each SMX unit has:

  • 192 single-precision CUDA cores
  • 64 double-precision units
  • 32 special-function units (SFUs)
  • 32 load/store units
  • 64 KB of shared memory and L1 cache
  • 48 KB of read-only data cache
  • 65,536 32-bit registers

All threads on every SMX on each individual GPU can access the GPU's device memory. Because device memory has high latency (taking from 400 to 600 clock cycles to service requests), codes that run on the cores should be designed to use shared memory and registers as much as possible.

For complete details, see the NVIDIA Kepler GK110 whitepaper.

The CUDA Toolkit

At Indiana University, several versions of the CUDA Toolkit are available on Big Red II.

CUDA (Compute Unified Device Architecture) is a general-purpose parallel computing platform and programming model developed by NVIDIA. CUDA leverages the parallel computing power of NVIDIA GPUs to improve the speed of complex scientific and engineering applications. CUDA extensions for C, C++, Fortran, and Python codes allow programmers to use languages they already know to develop GPU-accelerated applications.

A CUDA program consists of a host program, which runs on the CPU, and the CUDA kernel, which executes across many parallel threads. The host program transfers data from CPU (host) memory to the GPU's (device) memory, launches the CUDA kernel, and then transfers the results back to host memory.

Threads are organized into one-, two-, or three-dimensional thread blocks, which are further organized into one- or two-dimensional grids of thread blocks. Kernel launch includes a special syntax for defining the grid and thread block dimensions. The CUDA runtime system maps each thread block of a grid onto one of the SMX units (one SMX may have several thread blocks mapped to it).

Each thread of the grid executes the same kernel code but operates on different data, based on the thread index. Threads run fastest when they execute the same execution path, but they can take different code paths when necessary if the kernel contains any.

When the threads are finished, they must write their individual results back to device memory for the host to read.

For more on CUDA and general-purpose GPU computing, see the NVIDIA CUDA Parallel Computing Platform page. For CUDA Toolkit documentation, see NVIDIA's CUDA Toolkit Archive.

Sample matrix multiplication program

The following sample matrix multiplication program (mat_mul.cu) demonstrates several features of CUDA programming and NVIDA GPU architecture. Elements of the code are discussed below in the Explanation. Instructions also are provided for compiling and running the sample code.

The sample program below (mat_mul.cu) will do the following:

  1. Set up two matrices (A and B) of dimensions (M,P) and (P,N), and then fill them with random numbers between 0 and 1.
  2. Multiply the matrices on the CPU (using the ordinary definition of matrix multiplication), and then save the result in D.
  3. Call a CUDA kernel to do the multiplication on the GPU, and then store the result in array C.
  4. Print several of elements of A, B, C, and D (just to show some of the results).
  /**********************    mat_mul.cu    ******************************/
  #include <stdlib.h>
  #include <stdio.h>
  
  #define M  256
  #define P  128
  #define N   64
  #define BLKSIZ 16
  
  __global__ void mat_mul(float *Ad, float *Bd, float *Cd);
  
  int main()
  {
    float  A[M*P], *Ad;
    float  B[P*N], *Bd;
    float  C[M*N], *Cd;
    float  D[M*N];
    dim3   blockDim(BLKSIZ,BLKSIZ);
    dim3   gridDim(M/BLKSIZ,N/BLKSIZ);
    int    i;
    int    j,k;
  
  /* Fill A and B with random numbers */
    for(i=0;i<M*P;i++)
      A[i]= rand()/(float)RAND_MAX;
    for(i=0;i<P*N;i++)
      B[i]= rand()/(float)RAND_MAX;
  
  /* First, compute D=AB on the host CPU. */
    for(i=0;i<M;i++) {
      for(j=0;j<N;j++) {
      D[i*N+j]=0.0;
      for(k=0;k<P;k++) {
        D[i*N+j] += A[i*P+k]*B[k*N+j];
  	}
      }
    }
  
  /* Now compute C=AB on the GPU, using a CUDA kernel.
  * First, allocate device memory on the GPU for the matrices */
    cudaMalloc(&Ad,(size_t)(M*P*sizeof(float)));
    cudaMalloc(&Bd,(size_t)(P*N*sizeof(float)));
    cudaMalloc(&Cd,(size_t)(M*N*sizeof(float)));
  
  /* Copy A and B from host memory to device memory */
    cudaMemcpy(Ad,A,M*P*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(Bd,B,P*N*sizeof(float),cudaMemcpyHostToDevice);
  
  /* Call the CUDA kernel to compute Cd=Ad*Bd. */
    mat_mul<<<gridDim,blockDim>>>(Ad,Bd,Cd);
  
  /* Copy Cd from device memory to C in host memory */
    cudaMemcpy(C,Cd,M*N*sizeof(float),cudaMemcpyDeviceToHost);
  
  /* Then free the allocated arrays in device memory. */
    cudaFree(Ad);
    cudaFree(Bd);
    cudaFree(Cd);
  
  /* Finally, print out a few of the matrix elements of A, B,
  * C and D. */
    printf("                                     GPU         CPU \n");
    printf("  i    j     A(i,j)    B(i,j)      C(i,j)      D(i,j)\n");
    for(i=0;i<10;i++) {
      for(j=25;j<28;j++) {
      printf("%4d  %4d  %9.6f %9.6f %11.6f %11.6f\n",
      		       i,j,A[i*P+j],B[i*N+j],C[i*N+j],D[i*N+j]);
      }
    }
  }
  
  __global__ void mat_mul(float *Ad, float *Bd, float *Cd) {
    int    m = blockIdx.x;
    int    n = blockIdx.y;
    int    i = threadIdx.x;
    int    j = threadIdx.y;
    int    k,p;
    float  c = 0.0;
  
    __shared__  float As[BLKSIZ][BLKSIZ];
    __shared__  float Bs[BLKSIZ][BLKSIZ];
  
    for(p=0;p<P/BLKSIZ;p++) {
      As[i][j] = Ad[(m*BLKSIZ+i)*P+(p*BLKSIZ+j)];
      Bs[i][j] = Bd[(p*BLKSIZ+i)*N+(n*BLKSIZ+j)];
      __syncthreads();
      for(k=0; k<BLKSIZ; k++) {
      c += As[i][k] * Bs[k][j];
      }
    }
    Cd[(m*BLKSIZ+i)*N+(n*BLKSIZ+j)] = c;
  }
  /**********************************************************************/

Code explanation: In the sample code above:

  • The host code has to allocate arrays Ad, Bd, and Cd in device memory, and copy A and B to Ad and Bd.
  • The GPU does not directly access the ordinary CPU memory. (You can do this using "pinned, memory-mapped" memory, but it's not appropriate for this simple example).
  • The CUDA kernel looks much like an ordinary C procedure, but the call in the main program has extra syntax (the <<< ... >>> between the function name and its arguments) that tells the CUDA runtime the grid and thread block dimensions.
  • The thread blocks have dimensions (BLKSIZ,BLKSIZ), with BLKSIZ set to 16; the grid of thread blocks has dimensions (M/BLKSIZ,N/BLKSIZ). This is a hint about the algorithm the kernel implements: Ad and Bd are partitioned into square blocks of size BLKSIZ, and Cd is calculated by block-matrix multiplication.
  • The __global__ attribute is CUDA C syntax that identifies mat_mul as a CUDA kernel to be executed on the GPU.
  • The mat_mul kernel has definitions of arrays As and Bs with the attribute __shared__. To speed up the calculation, the kernel reads blocks of Ad and Bd into the shared memory arrays As and Bs.
  • Note that there is no for loop over i and j in the kernel. In the ordinary host code, a single thread calculates D(i,j) for all (i,j). In the CUDA kernel, each thread calculates a single element C(i,j). One of the principles of GPU computing is to have a large number of threads, each running on a relatively slow processor (compared to a typical CPU) that has to do a only little of the work. The large amount of concurrency possible with many processing elements enables large speedups on many problems. Matrix multiplication demonstrates this point particularly well.
  • After a thread computes its element, it writes it to the proper location in Cd in device memory and exits. When all threads of all blocks are finished, the host program copies Cd back to host array C.

Compiling and running the sample code: To compile the sample program above on Big Red II, you first must add the CUDA Toolkit (i.e., the cudatoolkit module) to your user environment. To load the default cudatoolkit module, on the command line, enter:

  module load cudatoolkit

To compile the CUDA C sample code above (mat_mul.cu), use the nvcc compiler; on the command line, enter:

  nvcc -o mat_mul mat_mul.cu

To execute the mat_mul program:

  1. Use the following qsub command to submit a short interactive job to Big Red II's debug_gpu queue:
      qsub -I -V -q debug_gpu -l nodes=1:ppn=16,walltime=00:20:00
    
  2. When the job starts and you are placed on an aprun node, use the aprun command to launch the mat_mul program:; for example:
      aprun -n 1 -N 1 mat_mul
    
    Note:
    When invoking aprun on a CPU/GPU node, the -n argument specifies the total number of nodes (not the total number of processing elements), and the -N argument specifies the number of GPUs per node, which on Big Red II is one (e.g., -N 1).

Your session should look similar to this:

  user@login1:~> qsub -I -V -q debug_gpu -l nodes=1:ppn=16,walltime=00:20:00
  qsub: waiting for job 2026265 to start
  qsub: job 2026265 ready
  
  user@aprun6:~> aprun -n 1 -N 1 mat_mul_gpu
                                       GPU         CPU
    i    j     A(i,j)    B(i,j)      C(i,j)      D(i,j)
     0    25   0.400944  0.255035   31.258524   31.258524
     0    26   0.129790  0.591219   37.661102   37.661102
     0    27   0.108809  0.808992   34.111546   34.111542
     1    25   0.134902  0.094958   29.325089   29.325089
     1    26   0.520210  0.373906   36.368008   36.368008
     1    27   0.078232  0.876699   33.333530   33.333530
     2    25   0.003579  0.218715   30.874189   30.874189
     2    26   0.827391  0.726369   35.464531   35.464531
     2    27   0.331479  0.422776   33.749062   33.749065
     3    25   0.169820  0.111553   28.479095   28.479095
     3    26   0.609729  0.155460   32.144226   32.144226
     3    27   0.525747  0.089033   31.430958   31.430956
     4    25   0.664414  0.675947   33.328094   33.328094
     4    26   0.412483  0.172978   39.126163   39.126160
     4    27   0.611981  0.742407   37.266693   37.266697
     5    25   0.119989  0.888484   27.930496   27.930496
     5    26   0.461848  0.481973   33.629471   33.629463
     5    27   0.648545  0.924893   32.350769   32.350769
     6    25   0.757282  0.964975   28.024395   28.024395
     6    26   0.777505  0.374116   33.458797   33.458794
     6    27   0.006980  0.964702   29.766006   29.766006
     7    25   0.629359  0.305197   27.149197   27.149197
     7    26   0.832555  0.603174   32.963963   32.963963
     7    27   0.812997  0.123878   32.999767   32.999767
     8    25   0.053073  0.227310   28.740133   28.740135
     8    26   0.897883  0.419002   33.611870   33.611874
     8    27   0.899521  0.830260   33.336487   33.336491
     9    25   0.530552  0.093277   30.481806   30.481808
     9    26   0.523745  0.774616   33.510654   33.510654
     9    27   0.246990  0.179531   33.826702   33.826702
  Application 15170790 resources: utime ~0s, stime ~1s, Rss ~87536, inblocks ~374, outblocks ~1058
  user@aprun6:~>

Get help

Support for IU research computing systems, software, and services is provided by the Research Technologies division of UITS. To ask a question or get help, contact UITS Research Technologies.

This is document bdmg in the Knowledge Base.
Last modified on 2018-02-08 13:16:06.

Contact us

For help or to comment, email the UITS Support Center.