Monitor memory and CPU usage on an IU research computing system
On this page:
Overview
The collectl
utility is a system-monitoring tool that records specific operating system data for one or more sets of subsystems. Any set of subsystems (for example, CPU, disks, memory, or processes) can be included in or excluded from data collection. Data can be stored in compressed or uncompressed data files, which themselves can be in either raw
format or in a space-delineated format that enables plotting using gnuplot
or Microsoft Excel.
At Indiana University, you can use collectl
to monitor the memory and CPU usage of single-node batch jobs running on Carbonate or Karst, or in the Cluster Compatibility Mode (CCM) execution environment on Big Red
II.
Output from collectl
can help you determine how many serial applications you can run on a single node. For example, on Big Red II, if collectl
output shows that a single serial application consumes about 8 GB of memory and uses a single core, you can stack three of them onto one compute node to optimize the efficient use of that node; see Use PCP to bundle multiple serial jobs to run in parallel on Big Red II at IU
Likewise, on Carbonate or Karst, you can use collectl
output to help determine the resources you need to request to run your batch job.
Preparing a shell script
The collectl
utility runs a lightweight application alongside your binary, capturing its memory and CPU usage as it runs on a compute node.
A simple way to launch collectl
alongside your binary is to prepare a simple shell script, which you can later submit to TORQUE using a job script.
For example, consider the following shell script (my_script.sh
) for launching the binary ./my_binary
:
#!/bin/bash cd /N/dc2/scratch/username/temp ./my_binary
To have collectl
record subsystem data as your binary runs, add lines to the shell script for loading the collectl
module, including data collection instructions, and stopping data collection after your binary has finished running. For example, after applying such changes, the my_script.sh
shell script would look similar to this:
#!/bin/bash module load collectl cd /N/dc2/scratch/username/temp SAMPLEINTERVAL=10 COLLECTLDIRECTORY=/N/dc2/scratch/username/temp/ collectl -F1 -i$SAMPLEINTERVAL:$SAMPLEINTERVAL -sZl --procfilt u$UID -f $COLLECTLDIRECTORY & ./my_binary collectl_stop.sh
In the above example:
- Adjust
SAMPLEINTERVAL
according to the expected runtime for your application. If your job will run for less than a day, UITS recommends settingSAMPLEINTERVAL=10
; if your job will run for multiple days, setSAMPLEINTERVAL=30
orSAMPLEINTERVAL=60
. - To point
COLLECTLDIRECTORY
to your Data Capacitor II scratch space, replaceusername
with your IU username. - The
&
at the end of thecollectl
command places thecollectl
process in the background, allowing your script to issue other commands (for example,./my_binary
) whilecollectl
runs and collects data. - The
collectl_stop.sh
command kills thecollectl
process, stopping data collection.
Options
Following is a summary of commonly used collectl
options; for a complete list, load the collectl
module, and then refer to the collectl
manual page (man
collectl
):
Option | Description | ||||
---|---|---|---|---|---|
-F |
Flushes the output buffers at the specified interval (in seconds); -F0 (zero) causes a flush at each sample interval |
||||
-i |
Indicates how frequently a sample occurs (in seconds)
The value preceding the colon ( The value following the colon is the rate at which subsystem, subprocess ( |
||||
-s |
Determines what subsystem data (summary or detail) is collected; the default is cdn , which stands for CPU (c ), disk (d ), and network (n ) summary data; lowercase characters indicate summary data collection, uppercase characters indicate detail data collection; to get both, include lowercase and uppercase letters (for example, -s
Zl ).
Options for summary and detailed subsystem data include:
Note: In the above example script, |
||||
--procfilt |
Tells collectl to get data only for the processes that are specified by the filter parameters
In the above example, the Other filters and options for specifying what data to monitor and how to record them include |
||||
-f |
Sets the directory for collectl output |
Plotting output
After your job has run to completion, collectl
will place an output file (for example, nid00862-20130826-130847.raw.gz
) in the directory specified by COLLECTLDIRECTORY
. You then can use the collectl_plot.sh
shell script to create a chart depicting the runtime characteristics of your application.
collectl_plot.sh
from a login node. Also, you'll need to load the collectl
module (module load collectl
), if you haven't already, before launching collectl_plot.sh
.
For example:
collectl_plot.sh output_file .
In the above example, replace output_file
with the name of the collectl
output file (for example, nid00862-20130826-130847.raw.gz
). Remember to include the period (.
) at the end of the line.
Running collectl_plot.sh
will create a collectl_plot_tmp
subdirectory off the current directory that will contain the files used to create the plots, which will appear in your current directory as .eps
files.
The default charts produced after running collectl_plot.sh
are:
|
|
|
|
|
|
You can view these files using any graphics program that can read .eps
files (for example, Acrobat, Apple Preview, Ghostview, Illustrator, and Photoshop, among others). Alternatively, you can drag them into an open Microsoft Word document, or use Word's function.
collectl_plot.sh
contain only detail data for processes (those collected by the -sZ
flag). If you specified collection of other subsystem data in your collectl
command (for example, -sl
or -sm
), those data will be recorded in the collectl
output (raw.gz
) file, although collectl_plot.sh
will not plot them.
If your application does not run long enough to generate multiple data points, collectl_plot.sh
may create empty files. In such cases, collectl_plot.sh
may generate messages indicating it encountered errors while parsing the data.
On Big Red II, if your plots for I/O data (for example, io.eps
and io_sum.eps
) are empty, most likely the I/O of the process is not being recorded. However, some file system I/O may be recorded in the collectl
output (raw.gz
) file. To find this I/O data within the collectl
output file, search for lines containing read_bytes
or write_bytes
.
Getting help
For more, see the collectl
manual page (man
collectl
) or visit the Collectl project page.
Support for IU research computing systems, software, and services is provided by the Research Technologies division of UITS. To ask a question or get help, contact UITS Research Technologies.
If you need help or have questions about using collectl
, contact the Scientific Applications and Performance Tuning team.
This is document bedc in the Knowledge Base.
Last modified on 2019-01-22 19:07:46.