How do I monitor memory and CPU usage on an IU research computing system?

On this page:


The collectl utility is a system-monitoring tool that records specific operating system data for one or more sets of subsystems. Any set of subsystems (e.g., CPU, disks, memory, or processes) can be included in or excluded from data collection. Data can be stored in compressed or uncompressed data files, which themselves can be in either raw format or in a space-delineated format that enables plotting using gnuplot or Microsoft Excel.

At Indiana University, you can use collectl to monitor the memory and CPU usage of single-node batch jobs running on Carbonate or Karst, or in the Cluster Compatibility Mode (CCM) execution environment on Big Red II.

Output from collectl can help you determine how many serial applications you can run on a single node. For example, on Big Red II, if collectl output shows that a single serial application consumes about 8 GB of memory and uses a single core, you can stack three of them onto one compute node to optimize the efficient use of that node; see On Big Red II at IU, how do I use PCP to bundle multiple serial jobs to run them in parallel?

Likewise, on Carbonate or Karst, you can use collectl output to help determine the resources you need to request to run your batch job.

Preparing a shell script

The collectl utility runs a lightweight application alongside your binary, capturing its memory and CPU usage as it runs on a compute node.

A simple way to launch collectl alongside your binary is to prepare a simple shell script, which you can later submit to TORQUE using a job script.

For example, consider the following shell script ( for launching the binary ./my_binary:

  #!/bin/bash cd /N/dc2/scratch/username/temp ./my_binary

To have collectl record subsystem data as your binary runs, add lines to the shell script for loading the collectl module, including data collection instructions, and stopping data collection after your binary has finished running. For example, after applying such changes, the shell script would look similar to this:

  #!/bin/bash module load collectl cd /N/dc2/scratch/username/temp SAMPLEINTERVAL=10 COLLECTLDIRECTORY=/N/dc2/scratch/username/temp/ collectl -F1 -i$SAMPLEINTERVAL:$SAMPLEINTERVAL -sZl --procfilt u$UID -f $COLLECTLDIRECTORY & ./my_binary

In the above example:

  • Adjust SAMPLEINTERVAL according to the expected runtime for your application. If your job will run for less than a day, UITS recommends setting SAMPLEINTERVAL=10; if your job will run for multiple days, set SAMPLEINTERVAL=30 or SAMPLEINTERVAL=60.
  • To point COLLECTLDIRECTORY to your Data Capacitor II scratch space, replace username with your IU username.
  • The & at the end of the collectl command places the collectl process in the background, allowing your script to issue other commands (e.g., ./my_binary) while collectl runs and collects data.
  • The command kills the collectl process, stopping data collection.


Following is a summary of commonly used collectl options; for a complete list, load the collectl module, and then refer to the collectl manual page (man collectl):

Option Description
Flushes the output buffers at the specified interval (in seconds); -F0 (zero) causes a flush at each sample interval
Indicates how frequently a sample occurs (in seconds)

The value preceding the colon ( : ) is the primary sampling interval.

The value following the colon is the rate at which subsystem, subprocess ( -sZ ), and slab ( -sY ) data is collected, and must be a multiple of the value before the colon.

Determines what subsystem data (summary or detail) is collected; the default is cdn, which stands for CPU (c), disk (d), and network (n) summary data; lowercase characters indicate summary data collection, uppercase characters indicate detail data collection; to get both, include lowercase and uppercase letters (e.g., -s Zl).

Options for summary and detailed subsystem data include:

Summary subsystems Detail subsystems
b - memory fragmentation
c - CPU
d - disk
j - interrupts
l - Lustre
m - memory
n - networks
s - sockets
y - slabs (system object caches)
D - disk
J - interrupts
L - Lustre
M - memory node data (or numa data)
N - networks
Y - slabs (system object caches)
Z - processes

Note: In the above example script, -s Zl directs collectl to collect detail data for processes (Z) and summary data for Lustre I/O (l).

Tells collectl to get data only for the processes that are specified by the filter parameters

In the above example, the --procfilt option indicates that only processes for the $UID user ID (u$UID) should be monitored.

Other filters and options for specifying what data to monitor and how to record them include --diskfilt, --memopts, --lustopts, and --procopts.

Sets the directory for collectl output

Plotting output

After your job has run to completion, collectl will place an output file (e.g., nid00862-20130826-130847.raw.gz) in the directory specified by COLLECTLDIRECTORY. You then can use the shell script to create a chart depicting the runtime characteristics of your application.

Always launch from a login node. Also, you'll need to load the collectl module (module load collectl), if you haven't already, before launching

For example: output_file .

In the above example, replace output_file with the name of the collectl output file (e.g., nid00862-20130826-130847.raw.gz). Remember to include the period (.) at the end of the line.

Running will create a collectl_plot_tmp subdirectory off the current directory that will contain the files used to create the plots, which will appear in your current directory as .eps files.

The default charts produced after running are:

  • ram.eps
    Shows the memory usage of each application; for example:
    Example ram.eps plot
  • ram_sum.eps
    Shows the summary of memory usage; for example:
    Example ram_sum.eps plot
  • io.eps
    Shows the I/O usage of each application; for example:
    Example io.eps plot
  • io_sum.eps
    Shows the summary of I/O usage; for example:
    Example io_sum.eps plot
  • cpu.eps
    Shows the CPU usage of each application; for example:
    Example cpu.eps plot
  • cpu_sum.eps
    Shows the summary of CPU usage; for example:
    Example cpu_sum.eps plot

You can view these files using any graphics program that can read .eps files (e.g., Acrobat, Apple Preview, Ghostview, Illustrator, and Photoshop, among others). Alternatively, you can drag them into an open Microsoft Word document, or use Word's Insert Picture from File function.

The plots generated by contain only detail data for processes (i.e., those collected by the -sZ flag). If you specified collection of other subsystem data in your collectl command (e.g., -sl or -sm), those data will be recorded in the collectl output (raw.gz) file, although will not plot them.

If your application does not run long enough to generate multiple data points, may create empty files. In such cases, may generate messages indicating it encountered errors while parsing the data.

On Big Red II, if your plots for I/O data (e.g., io.eps and io_sum.eps) are empty, most likely the I/O of the process is not being recorded. However, some file system I/O may be recorded in the collectl output (raw.gz) file. To find this I/O data within the collectl output file, search for lines containing read_bytes or write_bytes.

Getting help

For more, see the collectl manual page (man collectl) or visit the Collectl project page.

Support for IU research computing systems, software, and services is provided by the Research Technologies division of UITS. To ask a question or get help, contact UITS Research Technologies.

If you need help or have questions about using collectl, contact the Scientific Applications and Performance Tuning team.

This is document bedc in the Knowledge Base.
Last modified on 2018-06-22 14:18:15.

Contact us

For help or to comment, email the UITS Support Center.