On Big Red II, how do I use CUDA?

Following is a brief description of the Compute Unified Device Architecture (CUDA) platform, including links to reference documentation, an overview of the NVIDIA GPU architecture on Indiana University's Big Red II system, and a sample matrix multiplication program that demonstrates some of the concepts of CUDA C programming.

On this page:

About CUDA

The Compute Unified Device Architecture (CUDA) is a general-purpose parallel computing platform and programming model developed by NVIDIA. CUDA leverages the parallel computing power of NVIDIA graphics processing units (GPUs) to improve the speed of complex scientific and engineering applications. CUDA extensions for C, C++, Fortran, and Python codes allow programmers to use languages they already know to develop GPU-accelerated applications.

At Indiana University, several versions of the CUDA Toolkit are available on Big Red II.

For more on CUDA and general-purpose GPU computing, see the NVIDIA CUDA Parallel Computing Platform page. For CUDA Toolkit documentation, see NVIDIA's CUDA Toolkit Archive.

NVIDIA GPU architecture

Each of Big Red II's Cray XK7 GPU-enabled (CPU/GPU) compute nodes is equipped with one AMD Opteron 16-core Interlagos x86-64 CPU and one NVIDIA Tesla K20 GPU accelerator housing a single Kepler GK110 GPU.

Each Kepler GK110 GPU features 13 streaming multiprocessor (SMX) units and 5 GB of device memory. Each SMX unit has:

  • 192 single-precision CUDA cores
  • 64 double-precision units
  • 32 special-function units (SFUs)
  • 32 load/store units
  • 64 KB of shared memory and L1 cache)
  • 48 KB of read-only data cache
  • 65,536 32-bit registers

All threads on every SMX on each individual GPU can access the GPU's device memory. Because device memory has high latency (taking from 400 to 600 clock cycles to service requests), codes that run on the cores should be designed to use shared memory and registers as much as possible.

A CUDA program consists of a host program, which runs on the CPU, and the CUDA kernel), which executes across many parallel threads. The host program transfers data from CPU memory (host memory) to the GPU's device memory, launches the CUDA kernel, and then transfers the results back to CPU memory.

Threads are organized into one-, two-, or three-dimensional thread blocks, which are further organized into one- or two-dimensional grids of thread blocks. Kernel launch includes a special syntax for defining the grid and thread block dimensions. The CUDA runtime system maps each thread block of a grid onto one of the SMX units (one SMX may have several thread blocks mapped to it).

Each thread of the grid executes the same kernel code, but operates on different data, based on the thread index. Threads run fastest when they execute the same execution path. However, they can take different code paths when necessary if the kernel contains any.

When the threads are finished, they must write their individual results back to device memory for the host to read.

For complete details, see the NVIDIA Kepler GK110 whitepaper.

Sample CUDA C program

The following sample CUDA C multiplication program (mat_mul.cu) demonstrates several features of CUDA programming and NVIDA GPU architecture. Elements of the code are discussed below in the Explanation. Instructions also are provided for compiling and running the sample code.

Sample code

The sample code that follows (mat_mul.cu):

  1. Sets up two matrices (A and B) of dimensions (M,P) and (P,N), and then fills them with random numbers between 0 and 1
  2. Multiplies the matrices on the CPU (using the ordinary definition of matrix multiplication), and then saves the result in D
  3. Calls a CUDA kernel to do the multiplication on the GPU, and then stores the result in array C
  4. Prints several of the elements of A, B, C, and D (just to show some of the results)
  /**********************    mat_mul.cu    ******************************/
  #include <stdlib.h>
  #include <stdio.h>
  #define M  256
  #define P  128
  #define N   64
  #define BLKSIZ 16
  __global__ void mat_mul(float *Ad, float *Bd, float *Cd);
  int main()
    float  A[M*P], *Ad;
    float  B[P*N], *Bd;
    float  C[M*N], *Cd;
    float  D[M*N];
    dim3   blockDim(BLKSIZ,BLKSIZ);
    dim3   gridDim(M/BLKSIZ,N/BLKSIZ);
    int    i;
    int    j,k;
  /* Fill A and B with random numbers */
      A[i]= rand()/(float)RAND_MAX;
      B[i]= rand()/(float)RAND_MAX;
  /* First, compute D=AB on the host CPU. */
    for(i=0;i<M;i++) {
      for(j=0;j<N;j++) {
      for(k=0;k<P;k++) {
        D[i*N+j] += A[i*P+k]*B[k*N+j];
  /* Now compute C=AB on the GPU, using a CUDA kernel.
  * First, allocate device memory on the GPU for the matrices */
  /* Copy A and B from host memory to device memory */
  /* Call the CUDA kernel to compute Cd=Ad*Bd. */
  /* Copy Cd from device memory to C in host memory */
  /* Then free the allocated arrays in device memory. */
  /* Finally, print out a few of the matrix elements of A, B,
  * C and D. */
    printf("                                     GPU         CPU \n");
    printf("  i    j     A(i,j)    B(i,j)      C(i,j)      D(i,j)\n");
    for(i=0;i<10;i++) {
      for(j=25;j<28;j++) {
      printf("%4d  %4d  %9.6f %9.6f %11.6f %11.6f\n",
  __global__ void mat_mul(float *Ad, float *Bd, float *Cd) {
    int    m = blockIdx.x;
    int    n = blockIdx.y;
    int    i = threadIdx.x;
    int    j = threadIdx.y;
    int    k,p;
    float  c = 0.0;
    __shared__  float As[BLKSIZ][BLKSIZ];
    __shared__  float Bs[BLKSIZ][BLKSIZ];
    for(p=0;p<P/BLKSIZ;p++) {
      As[i][j] = Ad[(m*BLKSIZ+i)*P+(p*BLKSIZ+j)];
      Bs[i][j] = Bd[(p*BLKSIZ+i)*N+(n*BLKSIZ+j)];
      for(k=0; k<BLKSIZ; k++) {
      c += As[i][k] * Bs[k][j];
    Cd[(m*BLKSIZ+i)*N+(n*BLKSIZ+j)] = c;


In the sample program above:

  • The host code has to allocate arrays Ad, Bd, and Cd in device memory, and copy A and B to Ad and Bd.
  • The GPU does not directly access the ordinary CPU memory. (You can do this using "pinned, memory-mapped" memory, but it's not appropriate for this simple example).
  • The CUDA kernel looks much like an ordinary C procedure, but the call in the main program has extra syntax (the <<< ... >>> between the function name and its arguments) that tells the CUDA runtime the grid and thread block dimensions.
  • The thread blocks have dimensions (BLKSIZ,BLKSIZ), with BLKSIZ set to 16; the grid of thread blocks has dimensions (M/BLKSIZ,N/BLKSIZ). This is a hint about the algorithm the kernel implements: Ad and Bd are partitioned into square blocks of size BLKSIZ, and Cd is calculated by block-matrix multiplication.
  • The mat_mul is declared with the attribute __global__, which is CUDA C syntax that identifies mat_mul as a CUDA kernel to be executed on the GPU.
  • The mat_mul kernel has definitions of arrays As and Bs with the attribute __shared__. To speed up the calculation, the kernel reads blocks of Ad and Bd into the shared memory arrays As and Bs.
  • Note that there is no for loop over i and j in the kernel. In the ordinary host code, a single thread calculates D(i,j) for all (i,j). In the CUDA kernel, each thread calculates a single element C(i,j). One of the principles of GPU computing is to have a large number of threads, each running on a relatively slow processor (compared to a typical CPU) that has to do a only little of the work. The large amount of concurrency possible with many processing elements enables large speedups on many problems. Matrix multiplication demonstrates this point particularly well.
  • After a thread computes its element, it writes it to the proper location in Cd in device memory and exits. When all threads of all blocks are finished, the host program copies Cd back to host array C.

Compiling and running the sample code

CUDA C programs are compiled with the nvcc compiler, which expects CUDA C programs to have a .cu extension.

To compile the sample program above on Big Red II, first make sure the cudatoolkit module is loaded:

  module load cudatoolkit

Then, enter the compiler command:

  nvcc -o mat_mul mat_mul.cu

To execute the program, enter the following command to submit an interactive job on one GPU node:

  qsub -I -V -q gpu -l nodes=1:ppn=1,walltime=00:20:00

When TORQUE has acquired a GPU node, it will return a prompt. You will need to change to the directory where the mat_mul executable is located. Then, use aprun to launch it:

  aprun mat_mul

Getting help

Support for IU research computing systems, software, and services is provided by various UITS Research Technologies units. For help, see Research computing support at IU.

This is document bdmg in the Knowledge Base.
Last modified on 2017-09-06 17:12:45.

  • Fill out this form to submit your issue to the UITS Support Center.
  • Please note that you must be affiliated with Indiana University to receive support.
  • All fields are required.

Please provide your IU email address. If you currently have a problem receiving email at your IU account, enter an alternate email address.

  • Fill out this form to submit your comment to the IU Knowledge Base.
  • If you are affiliated with Indiana University and need help with a computing problem, please use the I need help with a computing problem section above, or contact your campus Support Center.

Please provide your IU email address. If you currently have a problem receiving email at your IU account, enter an alternate email address.