Using mpiBLAST on Quarry
On this page:
- Overview
-
The
mpiformatdbNblastjobscript - Using a preformatted large-volume database in
mpiblastjob -
Frequently asked questions
- Options in
mpiformatdbNblastjob - Preformatted publicly installed databases
-
Deciding the number of database partitions
in
mpiformatdbjob - Deciding how many CPUs to request in
mpiblastjob - Deciding the walltime hours
- Different e-values
- Output format of mpiBLAST
vs.
blastall - Dealing with large query files
- Using scratch space
- Running mpiBLAST interactively
- Manual pages
- Options in
Overview
mpiBLAST can perform parallel searches of National Center for Biotechnology Information (NCBI) BLAST databases. At Indiana University, mpiBLAST-1.5.0-PIO is installed on Quarry. On the one hand, mpiBLAST speeds querying by segmenting both the query file and the database and using multiple CPUs. On the other hand, mpiBLAST does not provide all the features of NCBI BLAST; for example, the output files are available in only three formats. UITS recommends using mpiBLAST only when the number of sequences in the database multiplied by the number of sequences in the query file is equal to or higher than 100,000,000. Otherwise, NCBI BLAST on Big Red is a better choice. For more details, see the mpiBLAST Overview page.
For help with Quarry, see Getting started on Quarry and Quarry usage policies.
The mpiformatdbNblastjob
script
mpiformatdbNblastjob is a custom script for submitting
mpiBLAST jobs that will run for more than 20 minutes on Quarry. Using
the mpiformatdbNblastjob command, you can submit one
batch job that contains two steps: first, mpiformatdbjob
(a serial job) and second, mpiblastjob (a parallel
job). The mpiformatdbjob script submits a serial job that
executes mpiformatdb, which is a prefix to a standard
NCBI BLAST formatdb command. The mpiblastjob
script submits a parallel job that executes mpiblast,
which is a prefix to a standard NCBI BLAST blastall
command. Useful arguments for mpiformatdbNblastjob,
mpiformatdbjob, and mpiblastjob are similar
to the options you would use with formatdb and
blastall, except for a few new options, such as
-N to specify the number of database fragments.
See the NCBI's Program
parameters for blastall to learn about the options you can use in
the script.
Using a preformatted large-volume database
in mpiblastjob
To save you time, UITS has installed preformatted large databases (a link to Preformat publicly installed databases) in a publicly accessible directory.
The simplest way to use them is simply to specify the four necessary arguments and wallhours limit requested by thempiformatdbNblastjob script: the program
(-pro), database file (-i), query file
(-q), output file (-r), and wallclock
hours (-WT2). Following is an example:
mpiformatdbNblastjob -prog blastn -db nt -query testnt50 -result testnt50cpus8 -- -WT2 20
Frequently asked questions
Options in mpiformatdbNblastjob
script
Only four options (-i, -pro,
-q, -r) are necessary for the mpiformatdbNblastjob script; the others are optional. Following
are the available options:
-db filename_to_build_database
|
Required. The name of the input file that will be used to build the
database. This is equivalent to the -i option in the
formatdb command.
|
-prog blast_program
|
Required. This is equivalent to the -p option in
blastall command. Supported programs are
blastn, blastp, blastx,
tblastn, and tblastx.
|
-query query_sequence_filename
|
Required. The name of the query file. This is equivalent to the
-i option in the blastall command.
|
-result output_filename
|
Required. The name of the output file. This is equivalent to the
-o option in the blastall command.
|
-N n |
Optional. Specify the number of
fragments into which formatdb will split the input file
to build the database. The default is 4. This option is new in
mpiBLAST.
|
-WT1 n
|
Optional. Specify an integer limit in hours of the
mpiformatdb job to build the database. The default is one
hour.
|
-d database_name
|
Optional. The name of the database built by
mpiformatdb. The default is the same as the input
file. This is equivalent to the -o option in the
formatdb command.
|
-m 8|9|0
|
Optional. Specify the output format of mpiblast. The
default is 8. Currently supports only a tab-delimited table with (8)
or without (9) headers, or the normal NCBI BLAST output (0).
|
-removedb T|F
|
Optional. Clean the fragment files of database and query files after the current job is done. The default is F, because if you submit multiple jobs using the same database, when the first job finishes the database will be removed, which will cause trouble for the other jobs. |
-debug T|F
|
Optional. Turn on debug and generate a debug file in the current
working directory. The default is T. If you suspect any problems with
mpiBLAST, please attach the debug file when you email
rtls at iu.edu .
|
-CPUS n
|
Specify the number of processes to be used in
mpiblastjob. The default is 4. If the number is smaller
than 4, it will automatically increase to 4. If it is larger than 128,
it will automatically decrease to 128.
|
-WT2 n
|
Optional. Specify an integer that is the maximum hours the mpiBLAST job will be allowed to run. The default is 1 hour, and the maximum is 336 hours. |
Following is an example to submit an mpiBLAST job using the
mpiformatdbNblastjob script:
For complete information about options, at the command prompt on
Quarry, enter man mpiformatdbNblastjob .
Preformatted publicly installed databases
All the preformatted databases are located in
/N/emboss/ncbi_blastdb. The path to these databases is
automatically defined when you use the
mpiblastformatdbNblastjob script, so you don't need to
specify the location. The number of fragments for each database is
listed in the following table. The number of CPUs to request is
defined in mpiformatdbNblastjob.
Reference any of the following databases using the name in the
"Database" column (e.g., -d nt).
| Database | Type | Description | Source | No. of pieces |
|---|---|---|---|---|
| nr | protein | All nonredundant Protein GenBank CDS (translations+PDB+SwissProt+PIR) | NCBI | 5 |
| pataa | protein | Protein sequences from GenBank Patent division | NCBI | 1 |
| nt | nucleotide | All nonredundant nucleotide sequences (GenBank+EMBL+DDBJ+PDB, but no EST, STS, GSS, or HTGS sequences) | NCBI | 40 |
| patnt | nucleotide | Nucleotide sequences from GenBank Patent division | NCBI | 7 |
| htgs | nucleotide | Nucleotide sequences from GenBank Patent division | NCBI | 40 |
| swissprot | protein | Nucleotide sequences from GenBank Patent division | NCBI | 1 |
Deciding the number of database partitions
in mpiformatdbjob
When the database is larger than core memory, a significant decrease in performance caused by additional disk I/O will be observed. However, the time required to format and output results increases with the number of fragments used, and it is independent of the processors requested.
Each node of Quarry has 8 CPUs that share 8 GB of memory; about 7 GB is available during running time. The amount of memory needed for a process to store assigned sequence data as well as intermediate search results varies with different queries, databases, and search types. UITS recommends 537MB for each database fragment, which is 60% of available memory.
The mpiformatdbjob script builds database fragments. The
options for mpiformatdbjob are as follow:
Following is an example:
- Check the file's size with the command
ls -altr filename. The results will be similar to the following: ls -altr nr -rw-r--r-- 1 jdoe hpc 3860088383 Oct 6 15:26 nr - Calculate the ideal number of fragments: 3860088383/1024/1024/537 = 6.
Therefore, request to partition the nr database into six
fragments:
Each partition is about 537MB.
Please note that the nr file is about 3 GB, which is less
than 7 GB, the memory limit of a single node. If the database you want
to format is larger than 7 GB (the core memory), you have
to use the option --skip-reorder, which is
implemented only in the mpiformatdbjob script:
If you have a database larger than 67 GB, you need to split it into smaller pieces, because 67 GB is the biggest database that mpiBLAST can run well on 128 CPUs, which is the maximum number of CPUS that mpiBLAST can properly scale up to.
Deciding how many CPUs to request in
mpiblastjob
Both mpiformatdbNblastjob and mpiformatdbjob
use four CPUs as a default to execute mpiBLAST. You can request a
different number of CPUs. In general, the only way to optimize
performance and resource usage is to experiment. In reality, if you
have a big database or a big query file, UITS suggests the following:
- If the database is big, the number of CPUs to request is two plus
the number of database fragments. The minimum number is four, because
requesting fewer than four CPUs to run mpiBLAST is just like running
NCBI BLAST using one CPU and wasting computing resources.
Following the above example, the
mpiformatdbNblastjob -prog blastn -db nr -query test -m 8 -result test.psl -- -CPUS 8 -WT2 24nrdatabase is composed of six partitions. Therefore, you would request eight CPUs: - If the query file is big (at least 2,000 sequences), you can
request N CPUs, where N=n*(F+1)+1. F is the number of database
fragments and n is an integer.
In most cases, the above formula is unnecessary. However, following is an example using n=3: N=3*(6+1)+1=22. Therefore, you would request 22 CPUs:
mpiformatdbNblastjob -prog blastn -db nr -query test -m 8 -result test.psl -- -CPUS 22 -WT2 9
In both situations above, the maximum number of CPUs that this version of mpiBLAST can properly scale up to is 128 CPUs. If you request more than 128 CPUs per job, unexpected errors could occur. For query files contains more than 20,000 sequences, see Dealing with large query files.
To view usage information for mpiblastjob, enter
mpiblastjob at the prompt:
For details, enter man mpiblastjob .
To make sure all your jobs will be queued, see the Quarry usage policies before submitting multiple mpiBLAST jobs on Quarry.
Deciding the walltime hours
Determining the appropriate number of walltime hours is a difficult question. Research shows that the running time for different searches is highly irregular. Even if we ignore the optimization of the program and data transfer, different database partitions, the amount of CPU requested, and the query files yield computation times that differ by orders of magnitude. The only suggestion is to do a small-scale of trial with NCBI BLAST to get a rough idea. Then use the test time multiplied by the size of the query files and divided by the number of CPUs, and add a few extra hours (or even more time) for the walltime hours. UITS will update this page in the future with more reference times generated from test runs.
Different e-values
NCBI BLAST started offering two executables, blastall and
blastall_old, with version 2.2.13. The two BLAST searches
generate slightly different results. mpiBLAST is based on the latter;
thus, the e-values are different from those generated by
blastall. However, NCBI has confirmed that
blastall_old is still trustworthy. For more, see this
information from NCBI about the difference between two BLAST
versions.
Output format of mpiBLAST
vs. blastall
Not all blastall options are available in mpiBLAST. In
particular, the options for formatting the output file are
limited. For example, you can use only -m 8 or -m
9 to produce a tab-delimited table with or without headers,
respectively, or -m 0 to produce normal
blastall output. The tab-delimited table contains the
following fields, in this order:
- Identity of query sequence
- Identity of subject sequence (matching sequence in database)
- Percent identity
- Alignment length
- Number of mismatches
- Number of gaps
- Start of query sequence
- End of query sequence
- Start of subject sequence
- End of subject sequence
- E-value
- Bit-score
For details about the available formats, see the FAQ and User's Guide on the mpiBLAST web site.
Dealing with large query files
Although mpiBLAST manages to count and distribute query sequences to each worker processor, it loads all the input queries into memory at startup, resulting in a limitation on efficient use of mpiBLAST. If the input query file is too big, it will significantly slow the loading function. The current version of mpiBLAST can efficiently handle 20,000 to 30,000 nucleotide sequences in one query file. If your query file is larger than this, split it into several files and run mpiBLAST separately on each one.
Using scratch space
Quotas on disk space exist for home directories. UITS will not
increase quotas for individual users to many hundreds of megabytes or
to gigabytes. To use large amounts of storage space for short periods
of time, use scratch space, /N/dc/scratch/username.
Scratch space is for temporary use and depends on the honor system. Official policy is that any file more than 60 days old will be deleted, following user notification. Make sure to back up your data to a long-term storage system.
Running mpiBLAST interactively
To test mpiBLAST, you can run it interactively from the command line
on Quarry's interactive nodes (b005-b009) to
test your settings before you submit massive jobs. Write a
machinefile containing the name of the processors you
want to request, and then you can copy the whole example directory
/N/soft/linux-rhel4-x86_64/mpiblast-1.5.0-pio-openmpi/test/
to run a test.
Following is an example on Quarry. Lines preceded by a hash
( # ) are comments, for your information
only. Enter each command at the prompt:
You should see a result file called
interactive4-openmpi.psl in about five minutes. If your
actual job will take more than 20 minutes of processor time, use the
mpiblastjob script as described above.
Manual pages
Unix manual pages provide reference information about the scripts
mpiformatdbNblastjob, mpiformatdbjob, and
mpiblastjob. To access them, use the man
command.
Last modified on November 23, 2011.







