Indiana University
University Information Technology Services
  
What are archived documents?
Login>>
Login

Login is for authorized groups (e.g., UITS, OVPIT, and TCC) that need access to specialized Knowledge Base documents. Otherwise, simply use the Knowledge Base without logging in.

Close

At IU, how do I use TORQUE/PBS on Quarry?

On this page:


Introduction

TORQUE, also known by its historical name, Portable Batch System (PBS), is the resource manager on the Quarry system at Indiana University. Tools for job submission and management are available in /usr/local/bin; most tools have associated man pages. For detailed online information, such as user manuals and administrator guides, see the additional documentation below.

TORQUE manages jobs that users submit to various queues on the system, each queue representing a group of resources with attributes necessary for the queue's jobs. Commonly used TORQUE tools include qsub, for job submission; qstat, for monitoring the status of jobs; and qdel, for terminating jobs prior to completion. More detailed information regarding these commands and others is available below, or in the documentation mentioned above.

Policy

Jobs that are run interactively on the user nodes are limited to 20 minutes of CPU time. Monitoring scripts on the user nodes will kill processes exceeding 20 minutes of wall clock time.

Run jobs that require more than 20 minutes but less than 24 hours of CPU time on the interactive nodes, b005-b008. To access one of these nodes, you must first log into Quarry, and from there use ssh to connect to b005, b006, b007, or b008.

If your job requires more than 24 hours of CPU time, submit a batch job to TORQUE with the qsub command.

Queues

The following queues are available on Quarry:

Note: Cluster-wide, the maximum number of tasks is 2,768 (346 compute nodes available [342 in queues, 4 in user-selectable debug] X 8 tasks per node).

SERIAL queue properties

  • Nodes: 5 serial (q029-q033) + 33 normal (q034-q066) + 46 long (q067-q112) + 28 himem (q113-q140) = 112 total
  • Maximum walltime: 12 hours
  • Maximum nodes per job: 1 node
  • Maximum cores per job: 8 cores
  • Maximum number of jobs per queue: 2,000
  • Maximum number of jobs per user: 500
  • Direct submission: No

NORMAL queue properties

  • Nodes: 33 normal (q034-q066) + 46 long (q067-q112) = 79 total
  • Maximum walltime: 7 days
  • Maximum nodes per job: 6 nodes
  • Maximum cores per job: 48 cores
  • Maximum number of jobs per queue: 1,500
  • Maximum number of jobs per user: 500
  • Direct submission: No

LONG queue properties

  • Nodes: 46 long (q067-q112) = 46 total
  • Maximum walltime: 14 days
  • Maximum nodes per job: 42 nodes
  • Maximum cores per job: 336 cores
  • Maximum number of jobs per queue: 596
  • Maximum number of jobs per user: 50
  • Direct submission: No

DEBUG queue properties

  • Nodes: 4 blades dedicated (q025-q028)
  • Maximum walltime: 30 minutes
  • Maximum nodes per job: 4 nodes
  • Maximum cores per job: 32 cores
  • Maximum number of jobs per queue: None
  • Maximum number of jobs per user: 2
  • Direct submission: Yes (qsub -q debug)

Note: The DEBUG queue is intended for short, quick-turnaround test jobs requiring less than 30 minutes of wall clock time.

Once code has been debugged, you may submit it to the other work queues via the normal qsub mechanism. You may allow the batch queue to dispatch your jobs to an appropriate queue, or select one of the user-selectable queues below.

HIMEM queue properties

  • Nodes: 28 himem (q113-q140) = 28 total
  • Maximum walltime: 14 days
  • Maximum nodes per job: 28 (= maximum in queue)
  • Maximum cores per job: 224 (= maximum in queue)
  • Maximum jobs per queue: 500
  • Maximum jobs per user: 50
  • Direct submission: Yes (qsub -q himem)

PG queue properties

  • Nodes: 230 iDataPlex nodes dedicated (pg1-pg230)
  • Maximum walltime: 14 days
  • Maximum nodes per job: 230 (= maximum in queue)
  • Maximum cores per job: 1,840 (= maximum in queue)
  • Maximum number of jobs per queue: 500
  • Maximum number of jobs per user: 250
  • Direct submission: Yes (qsub -q pg)

Note: The PG queue nodes are running RedHat Enterprise Linux 5.6, in contrast to other Quarry nodes, which are running RedHat Enterprise Linux 4.8. If you elect to submit jobs to these nodes, you should start with a single, small job, if possible, to determine if there will be problems running under a new OS release. Also, although the PG nodes are available for general use, priority is given to jobs from the Polar Grid project.

OSG queue properties (group restricted access)

  • Nodes: 16 blades dedicated (q009-q024)
  • Maximum walltime: 14 days
  • Maximum nodes per job: 16 (= maximum in queue)
  • Maximum cores per job: 128 (= maximum in queue)
  • Maximum number of jobs per queue: None
  • Maximum number of jobs per user: None
  • Direct submission: Not applicable

Note: Access to the OSG queue is restricted, and jobs are routed to this queue via Globus. They are not part of the normal queue system on Quarry.

If you do not specify a queue when you submit a job, it will automatically go into the queue into which it fits, depending on other other resources (e.g., number of nodes or CPUs) requested.

Jobs

Scripts

TORQUE most commonly handles job scripts, although interactive jobs are also supported. A job script may be as simple as a bash or tcsh shell script, but also may include a number of TORQUE job directives. You must always begin TORQUE job scripts, which will be executed under your preferred login shell, with a "shebang" line specifying which command interpreter it should run under, for example:

#!/bin/bash

TORQUE directives, which are lines beginning with the string #PBS, include switches for specifying such useful information as wall clock time required to complete the job, number of nodes and processors necessary, and filenames for job output and errors. These directives must be at the top of the script following the "shebang" line. An example TORQUE job script might look like this:

#!/bin/bash #PBS -k o #PBS -l nodes=4:ppn=2,walltime=30:00 #PBS -M username@indiana.edu #PBS -m abe #PBS -N JobName #PBS -j oe mpiexec -np 8 -machinefile $PBS_NODEFILE ~/bin/binaryname

Line by line, this script says:

  • Use bash as the command interpreter for this script.
  • Keep the job output.
  • This job requires four nodes, two processors per node, and 30 minutes of wall clock time
  • Send job-related email to  username@indiana.edu .
  • Send email if the job is aborted (a), when it begins (b), and when it ends (e).
  • The job name is JobName.
  • Join standard output and standard error.
  • Execute ~/bin/binaryname on eight processors from the machines in $PBS_NODEFILE using mpirun.

For additional details on TORQUE directives, view the man pages by entering man qsub .

Submission

Submit jobs with the qsub command. If the command exits successfully, a job ID will be returned to standard output, for example:

[jdoe@Quarry]$ qsub job.script 123456.qm2 [jdoe@Quarry]$

If you require attribute values different from the defaults, but less than the maximum allowed, specify these either in the job script with TORQUE directives, or on the command line with the  -l  switch. For example, to submit a job that needs more than the default 30 minutes of walltime on Quarry, use:

qsub -l walltime=10:00:00 job.script

Note that command-line arguments override directives in the job script, and that you may specify many attributes on the command line, either as comma-separated options following the  -l  switch, or each with its own  -l  switch. The following two commands are equivalent:

qsub -l ncpus=16,mem=1024mb job.script qsub -l ncpus=16 -l mem=1024mb job.script

Useful qsub switches include:

-q queue name To specify user-selectable queues
-r Job is rerunnable
-a date_time Execute the job only after date_time
-V Export environment variables in your current environment to the job
-I Run interactively, usually for testing purposes

See the qsub man page for more information.

Monitoring

The qstat command is useful for monitoring the status of a queued or running job. Switches include:

-u user_list Display jobs for users in user_list
-a Display all jobs
-r Display running jobs
-f Display full listing of jobs (excessive detail)
-n Display nodes allocated to jobs

For example, to see all the running jobs in the Quarry long queue, at the Quarry shell prompt, enter:

qstat -r long | less

Another useful command for monitoring jobs is the Moab Scheduler showq. To list the queued jobs in dispatch order, enter:

showq -i

For more, see the showq man page.

Deleting

Use the qdel command to delete queued or running jobs. Occasionally, a node will become unresponsive to the point that it cannot respond to the TORQUE server's requests to kill a job. In that case, try adding the -W (uppercase W) force option to qdel. If that doesn't work, email High Performance Systems for help.

Additional documentation

This is document avmy in domain all.
Last modified on December 17, 2011.

Comments/Questions/Corrections

Use this form to offer suggestions, corrections, and additions to the Knowledge Base. We welcome your input!

If you are affiliated with Indiana University and would like assistance with a specific computing problem, please use the Ask a Consultant form, or contact your campus Support Center.

Contact Information

Note: We will reply to your comment at this address. If your message concerns a problem receiving email, please enter an alternate email address.