Indiana University
University Information Technology Services
  
What are archived documents?
Login>>
Login

Login is for authorized groups (e.g., UITS, OVPIT, and TCC) that need access to specialized Knowledge Base documents. Otherwise, simply use the Knowledge Base without logging in.

Close

Using mpiBLAST on Quarry

On this page:


Overview

mpiBLAST can perform parallel searches of National Center for Biotechnology Information (NCBI) BLAST databases. At Indiana University, mpiBLAST-1.5.0-PIO is installed on Quarry. On the one hand, mpiBLAST speeds querying by segmenting both the query file and the database and using multiple CPUs. On the other hand, mpiBLAST does not provide all the features of NCBI BLAST; for example, the output files are available in only three formats. UITS recommends using mpiBLAST only when the number of sequences in the database multiplied by the number of sequences in the query file is equal to or higher than 100,000,000. Otherwise, NCBI BLAST on Big Red is a better choice. For more details, see the mpiBLAST Overview page.

For help with Quarry, see Getting started on Quarry and Quarry usage policies.

The mpiformatdbNblastjob script

mpiformatdbNblastjob is a custom script for submitting mpiBLAST jobs that will run for more than 20 minutes on Quarry. Using the mpiformatdbNblastjob command, you can submit one batch job that contains two steps: first, mpiformatdbjob (a serial job) and second, mpiblastjob (a parallel job). The mpiformatdbjob script submits a serial job that executes mpiformatdb, which is a prefix to a standard NCBI BLAST formatdb command. The mpiblastjob script submits a parallel job that executes mpiblast, which is a prefix to a standard NCBI BLAST blastall command. Useful arguments for mpiformatdbNblastjob, mpiformatdbjob, and mpiblastjob are similar to the options you would use with formatdb and blastall, except for a few new options, such as -N to specify the number of database fragments. See the NCBI's Program parameters for blastall to learn about the options you can use in the script.

Using a preformatted large-volume database in mpiblastjob

To save you time, UITS has installed preformatted large databases (a link to Preformat publicly installed databases) in a publicly accessible directory.

The simplest way to use them is simply to specify the four necessary arguments and wallhours limit requested by the mpiformatdbNblastjob script: the program (-pro), database file (-i), query file (-q), output file (-r), and wallclock hours (-WT2). Following is an example: mpiformatdbNblastjob -prog blastn -db nt -query testnt50 -result testnt50cpus8 -- -WT2 20

Frequently asked questions

Options in mpiformatdbNblastjob script

Only four options (-i, -pro, -q, -r) are necessary for the mpiformatdbNblastjob script; the others are optional. Following are the available options:

-db filename_to_build_database Required. The name of the input file that will be used to build the database. This is equivalent to the -i option in the formatdb command.
-prog blast_program Required. This is equivalent to the -p option in blastall command. Supported programs are blastn, blastp, blastx, tblastn, and tblastx.
-query query_sequence_filename Required. The name of the query file. This is equivalent to the -i option in the blastall command.
-result output_filename Required. The name of the output file. This is equivalent to the -o option in the blastall command.
-N n Optional. Specify the number of fragments into which formatdb will split the input file to build the database. The default is 4. This option is new in mpiBLAST.
-WT1 n Optional. Specify an integer limit in hours of the mpiformatdb job to build the database. The default is one hour.
-d database_name Optional. The name of the database built by mpiformatdb. The default is the same as the input file. This is equivalent to the -o option in the formatdb command.
-m 8|9|0 Optional. Specify the output format of mpiblast. The default is 8. Currently supports only a tab-delimited table with (8) or without (9) headers, or the normal NCBI BLAST output (0).
-removedb T|F Optional. Clean the fragment files of database and query files after the current job is done. The default is F, because if you submit multiple jobs using the same database, when the first job finishes the database will be removed, which will cause trouble for the other jobs.
-debug T|F Optional. Turn on debug and generate a debug file in the current working directory. The default is T. If you suspect any problems with mpiBLAST, please attach the debug file when you email  rtls at iu.edu .
-CPUS n Specify the number of processes to be used in mpiblastjob. The default is 4. If the number is smaller than 4, it will automatically increase to 4. If it is larger than 128, it will automatically decrease to 128.
-WT2 n Optional. Specify an integer that is the maximum hours the mpiBLAST job will be allowed to run. The default is 1 hour, and the maximum is 336 hours.

Following is an example to submit an mpiBLAST job using the mpiformatdbNblastjob script:

mpiformatdbNblastjob -db mito.nt -prog blastn -query quarry.fasta -result mpiblastjobtest.psl

For complete information about options, at the command prompt on Quarry, enter man mpiformatdbNblastjob .

Preformatted publicly installed databases

All the preformatted databases are located in /N/emboss/ncbi_blastdb. The path to these databases is automatically defined when you use the mpiblastformatdbNblastjob script, so you don't need to specify the location. The number of fragments for each database is listed in the following table. The number of CPUs to request is defined in mpiformatdbNblastjob.

Reference any of the following databases using the name in the "Database" column (e.g., -d nt).

Database Type Description Source No. of pieces
nr protein All nonredundant Protein GenBank CDS (translations+PDB+SwissProt+PIR) NCBI 5
pataa protein Protein sequences from GenBank Patent division NCBI 1
nt nucleotide All nonredundant nucleotide sequences (GenBank+EMBL+DDBJ+PDB, but no EST, STS, GSS, or HTGS sequences) NCBI 40
patnt nucleotide Nucleotide sequences from GenBank Patent division NCBI 7
htgs nucleotide Nucleotide sequences from GenBank Patent division NCBI 40
swissprot protein Nucleotide sequences from GenBank Patent division NCBI 1

Deciding the number of database partitions in mpiformatdbjob

When the database is larger than core memory, a significant decrease in performance caused by additional disk I/O will be observed. However, the time required to format and output results increases with the number of fragments used, and it is independent of the processors requested.

Each node of Quarry has 8 CPUs that share 8 GB of memory; about 7 GB is available during running time. The amount of memory needed for a process to store assigned sequence data as well as intermediate search results varies with different queries, databases, and search types. UITS recommends 537MB for each database fragment, which is 60% of available memory.

The mpiformatdbjob script builds database fragments. The options for mpiformatdbjob are as follow:

Usage: mpiformatdbjob -i filename_to_build_database -p <T|F> Format of inputfile. Protein is T, otherwise F -o <T|F> With database output file is T, otherwise F -N <n> Number_of_fragments_of_database [ -- -wallclocklimit hh:mm:ss -jobname name_of_job -notify a|b|e ] Options passed to TORQUE; refer to serialjob for details]

Following is an example:

  1. Check the file's size with the command ls -altr filename . The results will be similar to the following: ls -altr nr -rw-r--r-- 1 jdoe hpc 3860088383 Oct 6 15:26 nr
  2. Calculate the ideal number of fragments: 3860088383/1024/1024/537 = 6.

Therefore, request to partition the nr database into six fragments:

mpiformatdbjob -N 6 -i nr -o T -p T

Each partition is about 537MB.

Please note that the nr file is about 3 GB, which is less than 7 GB, the memory limit of a single node. If the database you want to format is larger than 7 GB (the core memory), you have to use the option --skip-reorder, which is implemented only in the mpiformatdbjob script:

mpiformatdbjob --skip-reorder -N 40 -i nr -o T -p F -- -WT1 3

If you have a database larger than 67 GB, you need to split it into smaller pieces, because 67 GB is the biggest database that mpiBLAST can run well on 128 CPUs, which is the maximum number of CPUS that mpiBLAST can properly scale up to.

Deciding how many CPUs to request in mpiblastjob

Both mpiformatdbNblastjob and mpiformatdbjob use four CPUs as a default to execute mpiBLAST. You can request a different number of CPUs. In general, the only way to optimize performance and resource usage is to experiment. In reality, if you have a big database or a big query file, UITS suggests the following:

  • If the database is big, the number of CPUs to request is two plus the number of database fragments. The minimum number is four, because requesting fewer than four CPUs to run mpiBLAST is just like running NCBI BLAST using one CPU and wasting computing resources.

    Following the above example, the nr database is composed of six partitions. Therefore, you would request eight CPUs:

    mpiformatdbNblastjob -prog blastn -db nr -query test -m 8 -result test.psl -- -CPUS 8 -WT2 24
  • If the query file is big (at least 2,000 sequences), you can request N CPUs, where N=n*(F+1)+1. F is the number of database fragments and n is an integer.

    In most cases, the above formula is unnecessary. However, following is an example using n=3: N=3*(6+1)+1=22. Therefore, you would request 22 CPUs:

    mpiformatdbNblastjob -prog blastn -db nr -query test -m 8 -result test.psl -- -CPUS 22 -WT2 9

In both situations above, the maximum number of CPUs that this version of mpiBLAST can properly scale up to is 128 CPUs. If you request more than 128 CPUs per job, unexpected errors could occur. For query files contains more than 20,000 sequences, see Dealing with large query files.

To view usage information for mpiblastjob, enter mpiblastjob at the prompt:

Usage: /N/soft/linux-rhel4-x86_64/local-utils/bin/mpiblastjob -p <blastn|blastx|blastx|tblastn|tblastx> Program_name_of_blast -i query_sequence_file -d database -o output_file -m <8|9|0> Only format 8, 9, and 0 are supported. [ -removedb <T|F> ] Remove database fragments after current job is done. Default is F. [ -debug <T|F> ] Turn on debug to generate a file with information for debugging purposes. Default is F (turned off). [ -- -CPUS <n> -wallhours <n> -jobname jobname ] Other options that start with -- and send arguments to TORQUE to schedule the job; refer to serialjob for details.

For details, enter man mpiblastjob .

To make sure all your jobs will be queued, see the Quarry usage policies before submitting multiple mpiBLAST jobs on Quarry.

Deciding the walltime hours

Determining the appropriate number of walltime hours is a difficult question. Research shows that the running time for different searches is highly irregular. Even if we ignore the optimization of the program and data transfer, different database partitions, the amount of CPU requested, and the query files yield computation times that differ by orders of magnitude. The only suggestion is to do a small-scale of trial with NCBI BLAST to get a rough idea. Then use the test time multiplied by the size of the query files and divided by the number of CPUs, and add a few extra hours (or even more time) for the walltime hours. UITS will update this page in the future with more reference times generated from test runs.

Different e-values

NCBI BLAST started offering two executables, blastall and blastall_old, with version 2.2.13. The two BLAST searches generate slightly different results. mpiBLAST is based on the latter; thus, the e-values are different from those generated by blastall. However, NCBI has confirmed that blastall_old is still trustworthy. For more, see this information from NCBI about the difference between two BLAST versions.

Output format of mpiBLAST vs. blastall

Not all blastall options are available in mpiBLAST. In particular, the options for formatting the output file are limited. For example, you can use only -m 8 or -m 9 to produce a tab-delimited table with or without headers, respectively, or -m 0 to produce normal blastall output. The tab-delimited table contains the following fields, in this order:

  • Identity of query sequence
  • Identity of subject sequence (matching sequence in database)
  • Percent identity
  • Alignment length
  • Number of mismatches
  • Number of gaps
  • Start of query sequence
  • End of query sequence
  • Start of subject sequence
  • End of subject sequence
  • E-value
  • Bit-score

For details about the available formats, see the FAQ and User's Guide on the mpiBLAST web site.

Dealing with large query files

Although mpiBLAST manages to count and distribute query sequences to each worker processor, it loads all the input queries into memory at startup, resulting in a limitation on efficient use of mpiBLAST. If the input query file is too big, it will significantly slow the loading function. The current version of mpiBLAST can efficiently handle 20,000 to 30,000 nucleotide sequences in one query file. If your query file is larger than this, split it into several files and run mpiBLAST separately on each one.

Using scratch space

Quotas on disk space exist for home directories. UITS will not increase quotas for individual users to many hundreds of megabytes or to gigabytes. To use large amounts of storage space for short periods of time, use scratch space, /N/dc/scratch/username.

Scratch space is for temporary use and depends on the honor system. Official policy is that any file more than 60 days old will be deleted, following user notification. Make sure to back up your data to a long-term storage system.

Running mpiBLAST interactively

To test mpiBLAST, you can run it interactively from the command line on Quarry's interactive nodes (b005-b009) to test your settings before you submit massive jobs. Write a machinefile containing the name of the processors you want to request, and then you can copy the whole example directory /N/soft/linux-rhel4-x86_64/mpiblast-1.5.0-pio-openmpi/test/ to run a test.

Following is an example on Quarry. Lines preceded by a hash ( # ) are comments, for your information only. Enter each command at the prompt:

ssh b005 # login interactive node b005 mkdir /N/dc/scratch/xinhong/mpiblast/ cd /N/dc/scratch/xinhong/mpiblast/ # copy examples directory to current directory cp -R /N/soft/linux-rhel4-x86_64/mpiblast-1.5.0-pio-openmpi/test/ . cd test # Edit softenv, if you don't use mpiformatdbNblastjob, mpiformatdb, or mpiblast script echo +Pblast >> $HOME/.soft echo @openmpi >> $HOME/.soft resoft # run formatdb to build a mito.nt database into 10 pieces mpiformatdb -N 10 -i mito.nt -o T # run mpiblast with 4 cpus, please note you may need to change the mf4 according to which interactive node you login mpiexec -n 4 -machinefile mf4 mpiblast -p blastn -i quarry.fasta -d mito.nt -m 9 -o interactive4-openmpi.psl --use-parallel-write

You should see a result file called interactive4-openmpi.psl in about five minutes. If your actual job will take more than 20 minutes of processor time, use the mpiblastjob script as described above.

Manual pages

Unix manual pages provide reference information about the scripts mpiformatdbNblastjob, mpiformatdbjob, and mpiblastjob. To access them, use the man command.

This is document axun in domain all.
Last modified on November 23, 2011.

Comments/Questions/Corrections

Use this form to offer suggestions, corrections, and additions to the Knowledge Base. We welcome your input!

If you are affiliated with Indiana University and would like assistance with a specific computing problem, please use the Ask a Consultant form, or contact your campus Support Center.

Contact Information

Note: We will reply to your comment at this address. If your message concerns a problem receiving email, please enter an alternate email address.