Indiana University
University Information Technology Services
  
What are archived documents?

Running MULTICLUSTAL on Big Red at IU

On this page:


General information

MULTICLUSTAL searches for parameters that optimize the alignment of a set of sequences. It optimizes by maximizing a quality function that rewards for identical amino acids and conservative substitutions, and penalizes for gaps and islands. Small islands are penalized more heavily than large islands. MULTICLUSTAL tries several substitution matrices, a range of gap-open penalties, and a range of gap-extension penalties. Details are available in Yuan et al. (1999, Bioinformatics 15:862-863).

On Big Red at Indiana University, MULTICLUSTAL uses a parallel version of CLUSTAL W. MULTICLUSTAL is installed at:

/N/soft/whatami/multiclustal-1.1

The script multiclustaljob is available for submitting parallel batch jobs. This page describes how to use MULTICLUSTAL on Big Red.

Note: MULTICLUSTAL is copyrighted by Merck & Co., Inc. It has been modified with permission to run the parallel version of CLUSTAL W in a batch scheduling environment. Neither IU nor users of MULTICLUSTAL on Big Red are at liberty to distribute the modified version.

For more information about the availability of software on the Indiana University shared central systems, see At IU, what software is available on the research computing systems, and how may I request that software be added?

For more information about TeraGrid software, see:

Preparation

The data file that contains sequences to be aligned must be alone in a directory. MULTICLUSTAL creates and reads from many files. By placing the data file in its own directory, you reduce clutter and prevent MULTICLUSTAL from reading inappropriate files.

If you have used MULTICLUSTAL to align sequences in a data file and you wish to run it again, delete all files other than the data file before rerunning MULTICLUSTAL. If you would like to keep the old output, copy it to some other directory.

Note: The name of the data file may not contain an underscore ( _ ).

The data file must contain sequences in FASTA format. Names of sequences in the file must be alphanumeric (i.e., letters and numbers only). Short names are ideal. Long names may cause problems. A sign of trouble with sequence names is that sequences are lost in the analysis (i.e., the result file contains fewer sequences than the data file).

Running multiclustaljob

Use the multiclustaljob script to submit jobs that run MULTICLUSTAL. The multiclustaljob script should be in your path by default, and its manual page should be in your default path for manual pages. Syntax for multiclustaljob is:

multiclustaljob options_to_multiclustal -CPUS n -wallhours h

Replace options_to_multiclustal with command line options, n with the number of processors to use, and h with the maximum amount of time the job should be allowed to run. If you omit the CPUS option, 4 processors will be used. To request more than 4 processors, specify an integer value that is a multiple of 4. If you specify a value that is not a multiple of 4, the value will be increased to the next multiple of 4. If you omit the -wallhours option, your job will be allowed to run for two hours. For example, to use 16 processors to align amino acid sequences from a file aaseqs for up to three hours, run:

multiclustaljob aaseqs -CPUS 16 -wallhours 3

When you run multiclustaljob, you'll receive a message when your job is submitted to the queue, and another when the job finishes. To check the status of your job, use the llq command.

The -deep option

MULTICLUSTAL has only one option (-deep). The -deep option decreases the sizes of steps that MULTICLUSTAL uses to traverse the range of gap-open penalties and the range of gap-extension penalties as it searches for optimum parameters. If you use the -deep option, it must follow the name of the data file, for example:

multiclustaljob proteinseqs -deep -CPUS 32 -wallhours 5

Output

MULTICLUSTAL produces quite a bit of output. During the parameter search, it runs CLUSTAL W many times and keeps all of the output. A file Final_score will contain a running summary of progress and identify which parameter set (and file) is associated with the highest scoring alignment. In addition to its output files, MULTICLUSTAL will create a file with system and error messages having a filename similar to multiclustaljob.99999.0.err, where 99999 is the number of your job. That file's contents are especially useful if it's the only file your job produces.

Known bug

MULTICLUSTAL is known to hang at times because of an issue with a program named BOXSHADE, which it uses to parse the output of CLUSTAL W. If it hangs, your job will sit idle until the job scheduler kills it for running over the allotted time. You can look for early signs by monitoring the accumulation of output files in your data file directory.

This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Additional support was provided by IU through its participation in the TeraGrid, which is supported by the NSF under Grants No. 0833618, SCI451237, SCI535258, and SCI504075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

Also see:

This is document awvx in domains all and tgrid-all.
Last modified on June 25, 2008.
Please tell us, did you find the answer to your question?