How do I monitor memory and CPU usage on an IU research computing system?

Note:
Mason, Indiana University's large memory computer cluster, will be retired on January 1, 2018. For more, see About the Mason retirement.

On this page:


Overview

The collectl utility is a system-monitoring tool that records specific operating system data for one or more sets of subsystems. Any set of subsystems (e.g., CPU, disks, memory, or processes) can be included in or excluded from data collection. Data can be stored in compressed or uncompressed data files, which themselves can be in either raw format or in a space-delineated format that enables plotting using gnuplot or Microsoft Excel.

At Indiana University, you can use collectl to monitor the memory and CPU usage of single-node batch jobs running on Carbonate, Karst, or Mason, or in the Cluster Compatibility Mode (CCM) execution environment on Big Red II.

Output from collectl can help you determine how many serial applications you can run on a single node. For example, on Big Red II, if collectl output shows that a single serial application consumes about 8 GB of memory and uses a single core, you can stack three of them onto one compute node to optimize the efficient use of that node; see On Big Red II at IU, how do I use PCP to bundle multiple serial jobs to run them in parallel?

Likewise, on Carbonate, Karst, or Mason, you can use collectl output to help determine the resources you need to request to run your batch job.

Preparing a shell script

The collectl utility runs a lightweight application alongside your binary, capturing its memory and CPU usage as it runs on a compute node.

A simple way to launch collectl alongside your binary is to prepare a simple shell script, which you can later submit to TORQUE using a job script.

For example, consider the following shell script (my_script.sh) for launching the binary ./my_binary:

  #!/bin/bash cd /N/dc2/scratch/username/temp ./my_binary

To have collectl record subsystem data as your binary runs, add lines to the shell script for loading the collectl module, including data collection instructions, and stopping data collection after your binary has finished running. For example, after applying such changes, the my_script.sh shell script would look similar to this:

  #!/bin/bash module load collectl cd /N/dc2/scratch/username/temp SAMPLEINTERVAL=10 COLLECTLDIRECTORY=/N/dc2/scratch/username/temp/ collectl -F1 -i$SAMPLEINTERVAL:$SAMPLEINTERVAL -sZl --procfilt u$UID -f $COLLECTLDIRECTORY & ./my_binary collectl_stop.sh

In the above example:

  • Adjust SAMPLEINTERVAL according to the expected runtime for your application. If your job will run for less than a day, UITS recommends setting SAMPLEINTERVAL=10; if your job will run for multiple days, set SAMPLEINTERVAL=30 or SAMPLEINTERVAL=60.
  • To point COLLECTLDIRECTORY to your Data Capacitor II scratch space, replace username with your IU username.
  • The & at the end of the collectl command places the collectl process in the background, allowing your script to issue other commands (e.g., ./my_binary) while collectl runs and collects data.
  • The collectl_stop.sh command kills the collectl process, stopping data collection.

Options

Following is a summary of commonly used collectl options; for a complete list, load the collectl module, and then refer to the collectl manual page (man collectl):

Option Description
-F
Flushes the output buffers at the specified interval (in seconds); -F0 (zero) causes a flush at each sample interval
-i
Indicates how frequently a sample occurs (in seconds)

The value preceding the colon ( : ) is the primary sampling interval.

The value following the colon is the rate at which subsystem, subprocess ( -sZ ), and slab ( -sY ) data is collected, and must be a multiple of the value before the colon.

-s
Determines what subsystem data (summary or detail) is collected; the default is cdn, which stands for CPU (c), disk (d), and network (n) summary data; lowercase characters indicate summary data collection, uppercase characters indicate detail data collection; to get both, include lowercase and uppercase letters (e.g., -s Zl).

Options for summary and detailed subsystem data include:

Summary subsystems Detail subsystems
b - memory fragmentation
c - CPU
d - disk
j - interrupts
l - Lustre
m - memory
n - networks
s - sockets
y - slabs (system object caches)
C - CPU
D - disk
J - interrupts
L - Lustre
M - memory node data (or numa data)
N - networks
Y - slabs (system object caches)
Z - processes

Note: In the above example script, -s Zl directs collectl to collect detail data for processes (Z) and summary data for Lustre I/O (l).

--procfilt
Tells collectl to get data only for the processes that are specified by the filter parameters

In the above example, the --procfilt option indicates that only processes for the $UID user ID (u$UID) should be monitored.

Other filters and options for specifying what data to monitor and how to record them include --diskfilt, --memopts, --lustopts, and --procopts.

-f
Sets the directory for collectl output

Plotting output

After your job has run to completion, collectl will place an output file (e.g., nid00862-20130826-130847.raw.gz) in the directory specified by COLLECTLDIRECTORY. You then can use the collectl_plot.sh shell script to create a chart depicting the runtime characteristics of your application.

Note:
Always launch collectl_plot.sh from a login node. Also, you'll need to load the collectl module (module load collectl), if you haven't already, before launching collectl_plot.sh.

For example:

  collectl_plot.sh output_file .

In the above example, replace output_file with the name of the collectl output file (e.g., nid00862-20130826-130847.raw.gz). Remember to include the period (.) at the end of the line.

Running collectl_plot.sh will create a collectl_plot_tmp subdirectory off the current directory that will contain the files used to create the plots, which will appear in your current directory as .eps files.

The default charts produced after running collectl_plot.sh are:

  • ram.eps
    Shows the memory usage of each application; for example:
    Example ram.eps plot
  • ram_sum.eps
    Shows the summary of memory usage; for example:
    Example ram_sum.eps plot
  • io.eps
    Shows the I/O usage of each application; for example:
    Example io.eps plot
  • io_sum.eps
    Shows the summary of I/O usage; for example:
    Example io_sum.eps plot
  • cpu.eps
    Shows the CPU usage of each application; for example:
    Example cpu.eps plot
  • cpu_sum.eps
    Shows the summary of CPU usage; for example:
    Example cpu_sum.eps plot

You can view these files using any graphics program that can read .eps files (e.g., Acrobat, Apple Preview, Ghostview, Illustrator, and Photoshop, among others). Alternatively, you can drag them into an open Microsoft Word document, or use Word's Insert Picture from File function.

Note:
The plots generated by collectl_plot.sh contain only detail data for processes (i.e., those collected by the -sZ flag). If you specified collection of other subsystem data in your collectl command (e.g., -sl or -sm), those data will be recorded in the collectl output (raw.gz) file, although collectl_plot.sh will not plot them.

If your application does not run long enough to generate multiple data points, collectl_plot.sh may create empty files. In such cases, collectl_plot.sh may generate messages indicating it encountered errors while parsing the data.

On Big Red II, if your plots for I/O data (e.g., io.eps and io_sum.eps) are empty, most likely the I/O of the process is not being recorded. However, some file system I/O may be recorded in the collectl output (raw.gz) file. To find this I/O data within the collectl output file, search for lines containing read_bytes or write_bytes.

Getting help

For more, see the collectl manual page (man collectl) or visit the Collectl project page.

Support for IU research computing systems, software, and services is provided by various UITS Research Technologies units. For help, see Research computing support at IU.

If you need help or have questions about using collectl, contact the Scientific Applications and Performance Tuning team.

This is document bedc in the Knowledge Base.
Last modified on 2017-11-21 13:32:19.

  • Fill out this form to submit your issue to the UITS Support Center.
  • Please note that you must be affiliated with Indiana University to receive support.
  • All fields are required.

Please provide your IU email address. If you currently have a problem receiving email at your IU account, enter an alternate email address.

  • Fill out this form to submit your comment to the IU Knowledge Base.
  • If you are affiliated with Indiana University and need help with a computing problem, please use the I need help with a computing problem section above, or contact your campus Support Center.

Please provide your IU email address. If you currently have a problem receiving email at your IU account, enter an alternate email address.