Run TensorFlow on Big Red II at IU

On this page:


About TensorFlow

TensorFlow is a flexible, distributable, portable, open source software library originally developed by researchers and engineers on the Google Brain team to support machine learning and deep neural networks research. Although TensorFlow was developed to create and train machine learning models, its dataflow-based programming model is flexible enough to be useful in other research domains.

TensorFlow computations are described by directed graphs composed of nodes and edges. Nodes represent individual mathematical operations, and edges represent the multidimensional data arrays (tensors) that flow between nodes. Graphs also may contain special edges that control dependencies between nodes. TensorFlow computations can be executed on a wide variety of platforms, ranging from mobile devices to clusters with multiple CPUs and GPUs. For more, see the TensorFlow website.

At Indiana University, TensorFlow is installed on Big Red II. For the best performance, UITS recommends running TensorFlow computations on Big Red II's hybrid CPU/GPU nodes. Although you can run TensorFlow on CPU-only nodes, GPU acceleration dramatically improves its performance.

Options for setting up your TensorFlow environment

To run TensorFlow computations on Big Red II, you first must set up your user environment. The cleanest and most straightforward way to set up a TensorFlow environment on Big Red II is to load the tensorflow module. Alternatively, you can use Anaconda or a Singularity container to import a custom TensorFlow environment.

Load the tensorflow module

To load the tensorflow module, on the command line, enter:

  module load tensorflow
Note:

The tensorflow module uses its own Python build. When loaded, the tensorflow module sets the PYTHONPATH environment variable to /N/soft/cle5/tensorflow/1.1.0/lib/python2.7/site-packages. Consequently:

  • The tensorflow module will not load if the anaconda2 module or a python module is already loaded. To check your user environment for a conflicting module, on the command line, enter module list. To remove a conflicting module, use the module unload command (e.g., module unload anaconda2).
  • Some Python packages you use regularly may be missing from the build TensorFlow uses. In such cases, you can install missing Python packages to your home directory. For instructions, see Install Python packages on the research computing systems at IU.

To make permanent changes to your environment, edit your ~/.modules file. For more, see Use a .modules file in your home directory to save your user environment on an IU research supercomputer.

For more about using Modules to configure your user environment, see Use Modules to manage your software environment on IU's research computing systems.

Use Anaconda to create a custom environment

Anaconda is a high-performance package manager and environment manager that you can use to create a fully customized TensorFlow environment in your Big Red II home directory.

To use Anaconda on Big Red II, first load the anaconda2 module; on the command line, enter:

  module load anaconda2
Note:
The anaconda2 module will not load if a python module is already loaded. To check your user environment for a conflicting Python module, on the command line, enter module list. To remove a conflicting module, use the module unload command (e.g., module unload python).

Once the anaconda2 module is loaded, you can use the following conda create command to create an environment (e.g., named mytensorflow) that has TensorFlow with GPU support, along with its dependencies (including Python) installed:

  conda create -y -n mytensorflow tensorflow-gpu=1.1.0

In the above command, the -n argument is used to specify the name of the environment, and the (optional) -y argument makes the command run without asking for confirmation.

As it executes, the conda create command creates the new environment, downloads a specific version of the tensorflow-gpu package from the Anaconda package repository, and installs its contents. By default, the new environment is created in the ~/.conda_envs directory. You should see something similar to the following as the command executes:

  bkyloren@login2:~> conda create -y -n mytensorflow tensorflow-gpu=1.1.0
  Fetching package metadata .........
  Solving package specifications: .
  
  Package plan for installation in environment
  /N/u/bkyloren/BigRed2/.conda_envs/mytensorflow:

  The following NEW packages will be INSTALLED:
  
      blas:           1.0-mkl
      cudatoolkit:    7.5-2
      cudnn:          5.1-0
      funcsigs:       1.0.2-py27_0
      libprotobuf:    3.4.0-0
      mkl:            2017.0.3-0
      mock:           2.0.0-py27_0
      numpy:          1.12.1-py27_0
      openssl:        1.0.2l-0
      pbr:            1.10.0-py27_0
      pip:            9.0.1-py27_1
      protobuf:       3.4.0-py27_0
      python:         2.7.13-0
      readline:       6.2-2
      setuptools:     36.4.0-py27_0
      six:            1.10.0-py27_0
      sqlite:         3.13.0-0
      tensorflow-gpu: 1.1.0-np112py27_0
      tk:             8.5.18-0
      werkzeug:       0.12.2-py27_0
      wheel:          0.29.0-py27_0
      zlib:           1.2.11-0
  
  
  blas-1.0-mkl.t 100% |################################| Time: 0:00:00  72.97 kB/s
  cudatoolkit-7. 100% |################################| Time: 0:00:03  59.97 MB/s
  cudnn-5.1-0.ta 100% |################################| Time: 0:00:01  44.64 MB/s
  mkl-2017.0.3-0 100% |################################| Time: 0:00:03  44.48 MB/s
  openssl-1.0.2l 100% |################################| Time: 0:00:00  31.17 MB/s
  readline-6.2-2 100% |################################| Time: 0:00:00  14.45 MB/s
  sqlite-3.13.0- 100% |################################| Time: 0:00:00  36.93 MB/s
  tk-8.5.18-0.ta 100% |################################| Time: 0:00:00  16.75 MB/s
  zlib-1.2.11-0.t 100% |################################| Time: 0:00:00   8.98 MB/s
  libprotobuf-3. 100% |################################| Time: 0:00:00  41.53 MB/s
  python-2.7.13- 100% |################################| Time: 0:00:00  41.92 MB/s
  funcsigs-1.0.2 100% |################################| Time: 0:00:00 651.96 kB/s
  numpy-1.12.1-p 100% |################################| Time: 0:00:00  38.31 MB/s
  setuptools-36. 100% |################################| Time: 0:00:00   7.35 MB/s
  six-1.10.0-py2 100% |################################| Time: 0:00:00 416.31 kB/s
  werkzeug-0.12. 100% |################################| Time: 0:00:00  19.54 MB/s
  wheel-0.29.0-p 100% |################################| Time: 0:00:00   8.25 MB/s
  protobuf-3.4.0 100% |################################| Time: 0:00:00  23.16 MB/s
  pbr-1.10.0-py2 100% |################################| Time: 0:00:00  10.64 MB/s
  mock-2.0.0-py2 100% |################################| Time: 0:00:00   4.43 MB/s
  tensorflow-gpu 100% |################################| Time: 0:00:01  44.30 MB/s

Your terminal may remain idle for 10 to 15 minutes while the necessary packages are installed. When the installation is complete, you will be returned to the Big Red II command prompt.

To use your environment to run a TensorFlow script on Big Red II's hybrid CPU/GPU nodes:

  1. Use qsub to submit an interactive job to Big Red II's gpu queue. For example, the following command submits a request for an interactive job that runs for one hour on all 16 cores of one CPU/GPU node:
      qsub -I -l nodes=1:ppn=16 -l walltime=01:00:00 -l gres=ccm -q gpu
    

    When your job is ready, you will be placed on an aprun service node.

  2. From the aprun node's command line, load the ccm module, and then use the ccmlogin command to get placed on a compute node:
      bkyloren@aprun1:~> module load ccm
      bkyloren@aprun1:~> ccmlogin
      bkyloren@nid00180:~>
    
  3. From the compute node's command line, load the anaconda2 module, and then use the source activate command to activate your TensorFlow environment (e.g., mytensorflow); for example:
      bkyloren@nid00180:~> module load anaconda2
      Please unload any existing python modules first. (module unload python)
      anaconda2 version 4.2.0 loaded.
    
      bkyloren@nid00180:~> source activate mytensorflow
    
  4. Upon activation, the environment name (e.g., mytensorflow) will be prepended to the compute node's command prompt, and you can launch your TensorFlow script (e.g., tensorflow-app.py) from there; for example:
      (mytensorflow) bkyloren@nid00180:~> tensorflow-app.py
    

To deactivate your environment, on the command line, enter:

  source deactivate

For more about Anaconda and conda commands, see the Anaconda Distribution and Conda Documentation pages on the Anaconda website.

Use a Singularity container

A Singularity container is an encapsulation of an application and its dependencies (i.e., the libraries, packages, and data files it needs for execution), saved as a single image file; for more, see Use Singularity on IU's research computing systems.

A sample Singularity container image for running TensorFlow with GPU support in a virtualized CentOS environment is available on Big Red II at:

  /N/soft/cle5/singularity/images/tensorflow-centos7.img
  • To run the container as an interactive job:
    1. Request an interactive job in Big Red II's debug_gpu queue; on the command line, enter:
        qsub -I -l walltime=00:30:00 -l nodes=1:ppn=16 -l gres=ccm -q debug_gpu
      
    2. When your job starts, load the ccm module; on the aprun command line, enter:
        module load ccm
      
    3. Log into the compute node; on the aprun command line, enter:
        ccmlogin
      
    4. When placed on the compute node (e.g., nid00170), load the craype-accel-nvidia35 and singularity modules; on the compute node command line, enter:
        module load craype-accel-nvidia35 singularity
      
    5. Change to your Data Capacitor II scratch directory; on the compute node command line, enter (replace <username> with your IU username):
        cd /N/dc2/scratch/<username>
      
    6. Spawn a shell in the tensorflow-centos7.img container; on the compute node command line, enter:
        singularity shell /N/soft/cle5/singularity/images/tensorflow-centos7.img
      
  • To run the container as a batch job:
    1. Create a working directory in your Data Capacitor II scratch space (e.g., /N/dc2/scratch/<username>/singularity_work; replace <username> with your IU username).
    2. Prepare a TORQUE job script (e.g., my_job_script.pbs) similar to the following example:
        #!/bin/bash
        # file to submit non interactive jobs to bigred2
      
        #PBS -l nodes=1:ppn=16
        #PBS -l gres=ccm
        #PBS -q debug_gpu
        #PBS -l walltime=00:30:00
        
        module load ccm
        module load singularity
        cd /N/dc2/scratch/<username>/singularity_work
        ccmrun singularity exec /N/soft/cle5/singularity/images/tensorflow-centos7.img python my_tensorflow.py
      

      The execution line in the example script above invokes ccmrun to launch Singularity, which launches the tensorflor-centos.img container and runs a Python script (my_tensorflow.py) in the container.

    3. To submit the job, on the command line, enter:
        qsub my_job_script.pbs
      

Run TensorFlow

Regardless of the method you use, once you have your TensorFlow environment set up properly, you can access TensorFlow from the Python interpreter.

The following example demonstrates how to build a one-node "Hello, Tensorflow!" computational graph, launch it in a session, and evaluate the tensor object. Start on the Python primary prompt:

  1. Import Tensorflow:
      >>>import tensorflow as tf
    

    This gives Python access to all TensorFlow classes, methods, and symbols.

    To verify which version of TensorFlow you are running, enter:

      >>>tf.__version__
    
  2. Create the constant tensor hello by using the tf.constant operation to store the value Hello, TensorFlow!:
      >>>hello = tf.constant('Hello, TensorFlow!')
    
  3. Create a Session object using the tf.Session class:
      >>>sess = tf.Session()
    
    Note:

    When you create your Session object, you may see one or more error messages stating the following:

    The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

    You may safely ignore these messages.

  4. Invoke the run method to evaluate the tensor hello, and print the result:
      >>>print(sess.run(hello))
      Hello, TensorFlow!
    

Get help

If you need help, or have questions about using TensorFlow, Anaconda, or Singularity on Big Red II, contact the UITS Research Applications and Deep Learning team.

Support for IU research computing systems, software, and services is provided by the Research Technologies division of UITS. To ask a question or get help, contact UITS Research Technologies.

This is document aoqh in the Knowledge Base.
Last modified on 2019-06-14 12:17:55.

Contact us

For help or to comment, email the UITS Support Center.