ARCHIVED: Run applications on Big Red II's GPU-enabled compute nodes
On this page:
Overview
Big Red II at Indiana University is equipped with multiple Cray XK7 GPU-enabled compute nodes. Each XK7 node has one AMD Opteron 16-core Interlagos x86_64 CPU, 32 GB of RAM, and one NVIDIA Tesla K20 GPU accelerator.
To run an application on the GPU-enabled compute nodes in either the Extreme Scalability Mode (ESM) or Cluster Compatibility Mode (CCM) execution environment, submit an interactive job or batch job to the gpu queue. The gpu queue allows you to request a maximum of 256 nodes per job and up to seven days of wall-clock time.
Alternatively, to test your application or run jobs that are less time-intensive, use the debug_gpu queue. The debug_gpu queue allows you to request a maximum of four nodes per job and up to one hour of wall-clock time. To specify either queue, use the -q
option in your batch job script or in your qsub
command.
Interactive jobs
Interactive ESM job
To submit an interactive ESM job to the gpu or debug_gpu queue:
- On the Big Red II command line, use
qsub
with the-I
(interactive) flag and other options to indicate the resources your job needs. For example, the following command submits an interactive job to the debug_gpu queue that will run for 30 minutes on all 16 processors of one node:qsub -I -l nodes=1:ppn=16 -l walltime=00:30:00 -q debug_gpu
- When
qsub
executes, your job will be assigned a job ID. When the required resources are free, your job will start, and you will be placed on one of Big Red II'saprun
service nodes. You will see something similar to the following displayed in your terminal window (wherejobid
is the job ID number andusername
is the IU username associated with your Big Red II account):qsub: waiting for job jobid to start qsub: job jobid; ready Directory: /N/u/username/BigRed2 Wed Jul 31 11:03:32 EDT 2019 username@aprun2:~>
The
aprun
service nodes have limited resources shared between all users on the system and are not intended for computational use. To launch your application (for example,my_binary
) on a CPU/GPU node in the ESM execution environment, you must invoke theaprun
command from theaprun
node's command line; for example:username@aprun2:~> aprun -n 1 -N 1 my_binary
Interactive CCM job
To submit an interactive CCM job to the gpu or debug_gpu queue:
- On the Big Red II command line, use the
qsub -I
with the-l gres=ccm
flag and other options to indicate the resources your job needs. For example, the following command submits an interactive job to the gpu queue that will run in the CCM execution environment for one hour on all 16 processors of one node:qsub -I -l gres=ccm -l nodes=1:ppn=16,walltime=01:00:00 -q gpu
- When
qsub
executes, your job will be assigned a job ID. When the required resources are free, your job will start, and you will be placed on one of Big Red II'saprun
service nodes. You will see something similar to the following displayed in your terminal window (wherejobid
is the job ID number andusername
is the IU username associated with your Big Red II account):qsub: waiting for job jobid to start qsub: job jobid; ready Directory: /N/u/username/BigRed2 Wed Jul 31 11:03:32 EDT 2019 username@aprun2:~>
The
aprun
service nodes have limited resources shared between all users on the system and are not intended for computational use. To launch your application (for example,my_binary
) on a CPU/GPU node in the CCM execution environment you must load theccm
module, and then enter theccmlogin
command. This will place you on a CPU/GPU node command line, from which you can launch your application.For example, the following session shows user
dartmaul
submitting a one-hour interactive job to launch a Tensorflow script (tf-script.py
) on all 16 cores on one CPU/GPU node in the CCM execution environment:username@login1:~> qsub -I -l gres=ccm -l nodes=1:ppn=16,walltime=01:00:00 -q gpu qsub: waiting for job 2024986 to start qsub: job 2024986 ready Directory: /N/u/username/BigRed2 Wed Jan 24 16:14:55 EST 2018 username@aprun7:~> module load ccm username@aprun7:~> ccmlogin username@nid00170:~> module load tensorflow TensorFlow 1.1.0 loaded. username@nid00170:~> python tf-script.py
-X
option (enable X11 forwarding) to your qsub
command; for more, see Use X forwarding on a personal computer to securely run graphical applications installed on IU's research supercomputers.
Batch jobs
ESM batch job
To run a batch job that executes a compiled program on one or more CPU/GPU nodes in the ESM execution environment:
- Create a job script that includes:
- TORQUE (PBS) directives that specify the resources (for example, the number of nodes and cores) needed to run the job and the wall-clock time needed for the job to complete
- The
-q
directive to indicate routing to the gpu (-q gpu
) or debug_gpu (-q debug_gpu
) queue - An application execution line that begins with the
aprun
launching command
The following example job script routes a serial job called
my_gpu_job
to the gpu queue that will run for a maximum of three hours, launching themy_esm_binary
program on all 16 cores of one CPU/GPU node in the ESM execution environment:#!/bin/bash #PBS -l nodes=1:ppn=16 #PBS -l walltime=03:00:00 #PBS -N my_gpu_job #PBS -q gpu aprun -n 1 -N 1 my_esm_binary
Note:When invokingaprun
on a CPU/GPU node, the-n
argument specifies the total number of nodes (not the total number of processing elements), and the-N
argument specifies the number of GPUs per node, which on Big Red II is one (for example,-N 1
). - Use the
qsub
command to submit your job script; for example, to submit job scriptgpu_script.pbs
, on the command line, enter:qsub gpu_script.pbs
CCM batch job
To run a batch job that executes a compiled program on one or more CPU/GPU nodes in the CCM execution environment:
- Create a batch job script that includes:
- TORQUE (PBS) directives that specify the resources (for example, the number of nodes and cores) needed to run the job and the wall-clock time needed for the to complete
- The
-q
directive to indicate routing to the gpu (-q gpu
) or debug_gpu (-q debug_gpu
) queue - A line that loads the
ccm
module (this is not needed if you already have theccm
module added to your Big Red II user environment) - An application execution line that begins with the
ccmrun
launching command
The following example batch script routes a serial job called
my_gpu_job
to the gpu queue that will run for a maximum of three hours, launching themy_ccm_binary
program on all 16 cores of one CPU/GPU node in the CCM execution environment:#!/bin/bash #PBS -l nodes=1:ppn=16 #PBS -l walltime=03:00:00 #PBS -N job_name #PBS -q gpu #PBS -l gres=ccm module load ccm ccmrun my_ccm_binary
- Use the
qsub
command to submit your job script; for example, to submit job scriptgpu_script.pbs
), on the command line, enter:qsub gpu_script.pbs
Sample CUDA C program
Following is a brief description of CUDA (Compute Unified Device Architecture), including an architectural overview of the NVIDIA GPUs equipped on Big Red II, links to reference documentation, and a sample matrix multiplication program that demonstrates some of the concepts of CUDA C programming.
NVIDIA GPU architecture
Each of Big Red II's Cray XK7 GPU-enabled (CPU/GPU) compute nodes is equipped with one AMD Opteron 16-core Interlagos x86-64 CPU and one NVIDIA Tesla K20 GPU accelerator housing a single Kepler GK110 GPU.
Each Kepler GK110 GPU features 13 streaming multiprocessor (SMX) units and 5 GB of device memory. Each SMX unit has:
- 192 single-precision CUDA cores
- 64 double-precision units
- 32 special-function units (SFUs)
- 32 load/store units
- 64 KB of shared memory and L1 cache
- 48 KB of read-only data cache
- 65,536 32-bit registers
All threads on every SMX on each individual GPU can access the GPU's device memory. Because device memory has high latency (taking from 400 to 600 clock cycles to service requests), codes that run on the cores should be designed to use shared memory and registers as much as possible.
For complete details, see the NVIDIA Kepler GK110 whitepaper.
The CUDA Toolkit
At Indiana University, several versions of the CUDA Toolkit are available on Big Red II.
CUDA (Compute Unified Device Architecture) is a general-purpose parallel computing platform and programming model developed by NVIDIA. CUDA leverages the parallel computing power of NVIDIA GPUs to improve the speed of complex scientific and engineering applications. CUDA extensions for C, C++, Fortran, and Python codes allow programmers to use languages they already know to develop GPU-accelerated applications.
A CUDA program consists of a host program, which runs on the CPU, and the CUDA kernel, which executes across many parallel threads. The host program transfers data from CPU (host) memory to the GPU's (device) memory, launches the CUDA kernel, and then transfers the results back to host memory.
Threads are organized into one-, two-, or three-dimensional thread blocks, which are further organized into one- or two-dimensional grids of thread blocks. Kernel launch includes a special syntax for defining the grid and thread block dimensions. The CUDA runtime system maps each thread block of a grid onto one of the SMX units (one SMX may have several thread blocks mapped to it).
Each thread of the grid executes the same kernel code but operates on different data, based on the thread index. Threads run fastest when they execute the same execution path, but they can take different code paths when necessary if the kernel contains any.
When the threads are finished, they must write their individual results back to device memory for the host to read.
For more on CUDA and general-purpose GPU computing, see the NVIDIA CUDA Parallel Computing Platform page. For CUDA Toolkit documentation, see NVIDIA's CUDA Toolkit Archive.
Sample matrix multiplication program
The following sample matrix multiplication program (mat_mul.cu
) demonstrates several features of CUDA programming and NVIDA GPU architecture. Elements of the code are discussed below in the Explanation. Instructions also are provided for compiling and running the sample code.
The sample program below (mat_mul.cu
) will do the following:
- Set up two matrices (
A
andB
) of dimensions(M,P)
and(P,N)
, and then fill them with random numbers between0
and1
. - Multiply the matrices on the CPU (using the ordinary definition of matrix multiplication), and then save the result in
D
. - Call a CUDA kernel to do the multiplication on the GPU, and then store the result in array
C
. - Print several of elements of
A
,B
,C
, andD
(just to show some of the results).
/********************** mat_mul.cu ******************************/ #include <stdlib.h> #include <stdio.h> #define M 256 #define P 128 #define N 64 #define BLKSIZ 16 __global__ void mat_mul(float *Ad, float *Bd, float *Cd); int main() { float A[M*P], *Ad; float B[P*N], *Bd; float C[M*N], *Cd; float D[M*N]; dim3 blockDim(BLKSIZ,BLKSIZ); dim3 gridDim(M/BLKSIZ,N/BLKSIZ); int i; int j,k; /* Fill A and B with random numbers */ for(i=0;i<M*P;i++) A[i]= rand()/(float)RAND_MAX; for(i=0;i<P*N;i++) B[i]= rand()/(float)RAND_MAX; /* First, compute D=AB on the host CPU. */ for(i=0;i<M;i++) { for(j=0;j<N;j++) { D[i*N+j]=0.0; for(k=0;k<P;k++) { D[i*N+j] += A[i*P+k]*B[k*N+j]; } } } /* Now compute C=AB on the GPU, using a CUDA kernel. * First, allocate device memory on the GPU for the matrices */ cudaMalloc(&Ad,(size_t)(M*P*sizeof(float))); cudaMalloc(&Bd,(size_t)(P*N*sizeof(float))); cudaMalloc(&Cd,(size_t)(M*N*sizeof(float))); /* Copy A and B from host memory to device memory */ cudaMemcpy(Ad,A,M*P*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(Bd,B,P*N*sizeof(float),cudaMemcpyHostToDevice); /* Call the CUDA kernel to compute Cd=Ad*Bd. */ mat_mul<<<gridDim,blockDim>>>(Ad,Bd,Cd); /* Copy Cd from device memory to C in host memory */ cudaMemcpy(C,Cd,M*N*sizeof(float),cudaMemcpyDeviceToHost); /* Then free the allocated arrays in device memory. */ cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); /* Finally, print out a few of the matrix elements of A, B, * C and D. */ printf(" GPU CPU \n"); printf(" i j A(i,j) B(i,j) C(i,j) D(i,j)\n"); for(i=0;i<10;i++) { for(j=25;j<28;j++) { printf("%4d %4d %9.6f %9.6f %11.6f %11.6f\n", i,j,A[i*P+j],B[i*N+j],C[i*N+j],D[i*N+j]); } } } __global__ void mat_mul(float *Ad, float *Bd, float *Cd) { int m = blockIdx.x; int n = blockIdx.y; int i = threadIdx.x; int j = threadIdx.y; int k,p; float c = 0.0; __shared__ float As[BLKSIZ][BLKSIZ]; __shared__ float Bs[BLKSIZ][BLKSIZ]; for(p=0;p<P/BLKSIZ;p++) { As[i][j] = Ad[(m*BLKSIZ+i)*P+(p*BLKSIZ+j)]; Bs[i][j] = Bd[(p*BLKSIZ+i)*N+(n*BLKSIZ+j)]; __syncthreads(); for(k=0; k<BLKSIZ; k++) { c += As[i][k] * Bs[k][j]; } } Cd[(m*BLKSIZ+i)*N+(n*BLKSIZ+j)] = c; } /**********************************************************************/
Code explanation: In the sample code above:
- The host code has to allocate arrays
Ad
,Bd
, andCd
in device memory, and copyA
andB
toAd
andBd
. - The GPU does not directly access the ordinary CPU memory. (You can do this using "pinned, memory-mapped" memory, but it's not appropriate for this simple example).
- The CUDA kernel looks much like an ordinary C procedure, but the call in the main program has extra syntax (the
<<< ... >>>
between the function name and its arguments) that tells the CUDA runtime the grid and thread block dimensions. - The thread blocks have dimensions
(BLKSIZ,BLKSIZ)
, withBLKSIZ
set to16
; the grid of thread blocks has dimensions(M/BLKSIZ,N/BLKSIZ)
. This is a hint about the algorithm the kernel implements:Ad
andBd
are partitioned into square blocks of sizeBLKSIZ
, andCd
is calculated by block-matrix multiplication. - The
__global__
attribute is CUDA C syntax that identifiesmat_mul
as a CUDA kernel to be executed on the GPU. - The
mat_mul
kernel has definitions of arraysAs
andBs
with the attribute__shared__
. To speed up the calculation, the kernel reads blocks ofAd
andBd
into the shared memory arraysAs
andBs
. - Note that there is no
for
loop overi
andj
in the kernel. In the ordinary host code, a single thread calculatesD(i,j)
for all(i,j)
. In the CUDA kernel, each thread calculates a single elementC(i,j)
. One of the principles of GPU computing is to have a large number of threads, each running on a relatively slow processor (compared to a typical CPU) that has to do a only little of the work. The large amount of concurrency possible with many processing elements enables large speedups on many problems. Matrix multiplication demonstrates this point particularly well. - After a thread computes its element, it writes it to the proper location in
Cd
in device memory and exits. When all threads of all blocks are finished, the host program copiesCd
back to host arrayC
.
Compiling and running the sample code: To compile the sample program above on Big Red II, you first must add the CUDA Toolkit (the cudatoolkit
module) to your user environment. To load the default cudatoolkit
module, on the command line, enter:
module load cudatoolkit
To compile the CUDA C sample code above (mat_mul.cu
), use the nvcc
compiler; on the command line, enter:
nvcc -o mat_mul mat_mul.cu
To execute the mat_mul
program:
- Use the following
qsub
command to submit a short interactive job to Big Red II's debug_gpu queue:qsub -I -V -q debug_gpu -l nodes=1:ppn=16,walltime=00:20:00
- When the job starts and you are placed on an
aprun
node, use theaprun
command to launch themat_mul
program:; for example:aprun -n 1 -N 1 mat_mul
Note:When invokingaprun
on a CPU/GPU node, the-n
argument specifies the total number of nodes (not the total number of processing elements), and the-N
argument specifies the number of GPUs per node, which on Big Red II is one (for example,-N 1
).
Your session should look similar to this:
user@login1:~> qsub -I -V -q debug_gpu -l nodes=1:ppn=16,walltime=00:20:00 qsub: waiting for job 2026265 to start qsub: job 2026265 ready user@aprun6:~> aprun -n 1 -N 1 mat_mul_gpu GPU CPU i j A(i,j) B(i,j) C(i,j) D(i,j) 0 25 0.400944 0.255035 31.258524 31.258524 0 26 0.129790 0.591219 37.661102 37.661102 0 27 0.108809 0.808992 34.111546 34.111542 1 25 0.134902 0.094958 29.325089 29.325089 1 26 0.520210 0.373906 36.368008 36.368008 1 27 0.078232 0.876699 33.333530 33.333530 2 25 0.003579 0.218715 30.874189 30.874189 2 26 0.827391 0.726369 35.464531 35.464531 2 27 0.331479 0.422776 33.749062 33.749065 3 25 0.169820 0.111553 28.479095 28.479095 3 26 0.609729 0.155460 32.144226 32.144226 3 27 0.525747 0.089033 31.430958 31.430956 4 25 0.664414 0.675947 33.328094 33.328094 4 26 0.412483 0.172978 39.126163 39.126160 4 27 0.611981 0.742407 37.266693 37.266697 5 25 0.119989 0.888484 27.930496 27.930496 5 26 0.461848 0.481973 33.629471 33.629463 5 27 0.648545 0.924893 32.350769 32.350769 6 25 0.757282 0.964975 28.024395 28.024395 6 26 0.777505 0.374116 33.458797 33.458794 6 27 0.006980 0.964702 29.766006 29.766006 7 25 0.629359 0.305197 27.149197 27.149197 7 26 0.832555 0.603174 32.963963 32.963963 7 27 0.812997 0.123878 32.999767 32.999767 8 25 0.053073 0.227310 28.740133 28.740135 8 26 0.897883 0.419002 33.611870 33.611874 8 27 0.899521 0.830260 33.336487 33.336491 9 25 0.530552 0.093277 30.481806 30.481808 9 26 0.523745 0.774616 33.510654 33.510654 9 27 0.246990 0.179531 33.826702 33.826702 Application 15170790 resources: utime ~0s, stime ~1s, Rss ~87536, inblocks ~374, outblocks ~1058 user@aprun6:~>
Get help
Research computing support at IU is provided by the Research Technologies division of UITS. To ask a question or get help regarding Research Technologies services, including IU's research supercomputers and research storage systems, and the scientific, statistical, and mathematical applications available on those systems, contact UITS Research Technologies. For service-specific support contact information, see Research computing support at IU.
This is document bdmg in the Knowledge Base.
Last modified on 2023-04-21 16:58:16.