BLAST sequence searching on Big Red
On this page:
- Publicly installed databases
- Preparing to run BLAST
-
Running NCBI
blastallinteractively -
Running NCBI
blastallbatch jobs -
NCBI
blastalloptions - Dealing with large volumes of output
- Manual pages
Publicly installed databases
| Database | Type | Description | Source | Path |
|---|---|---|---|---|
| nr | protein | All non-redundant Protein GenBank CDS (translations+PDB+SwissProt+PIR) | NCBI |
/N/soft/blastdb
|
| pataa | protein | Protein sequences from GenBank Patent division | NCBI |
/N/soft/blastdb
|
| nt | nucleotide | All non-redundant Nucleotide Sequences (GenBank+EMBL+DDBJ+PDB, but no EST, STS, GSS or HTGS sequences) | NCBI |
/N/soft/blastdb
|
| patnt | nucleotide | Nucleotide sequences from GenBank Patent division | NCBI |
/N/soft/blastdb
|
| swissprot | protein | Nucleotide sequences from GenBank Patent division | NCBI |
/N/soft/blastdb
|
You can reference databases using the names in the "Database" column
(e.g., -d nt).
Preparing to run BLAST
Before you can run BLAST, you need to add it to your path. To
permanently add BLAST to your path, add the +ncbi key to
your .soft environment. To do so, run:
This places blastall in your path, and sets the
environment variables BLASTMAT and
BLASTDB.
Running NCBI blastall interactively
To run NCBI blastall interactively from the command
line, you can use only one processor and your blastall
job must consume less than 20 minutes of processor time. If you need
more than 20 minutes of processor time, use the serialjob
command (see below) to run blastall.
Running NCBI blastall batch jobs
If your blastall job will run for more than 20
minutes, you must use the batch facility.
Use the serialjob command as a one-step method for
submitting blastall jobs to the queue. The
serialjob command is a prefix to a standard
blastall command. Use arguments just as you would with
blastall, with a few optional additions. If your job will
take less than two hours, you can run:
Replace options_to_blastall with your command
options.
If your job will run for more than two hours, specify the
-wallclocklimit option:
Replace hh with hours, mm with minutes,
and ss with seconds.
Use the double dash (--) to separate the set of
options to blastall from those that will exceed two
hours. For example, if you want to blast nucleotide sequences in file
sequences.mine against the patent database, and have the
output stored in file hits.patent, and have the job run
for four hours, run:
serialjob blastall -p blastn -i sequences.mine -d patnt -o hits.patent -- -wallclocklimit 4:00:00
NCBI blastall options
Options for blastall are available by entering the command with no argument. Options are listed here for your convenience:
blastall 2.2.16 arguments: -p Program name [string] -d Database [string] (default = nr) -i Query file [file in] (default = stdin) -e Expectation value (E) [real] (default = 10.0) -m Alignment view options: 0 = pairwise, 1 = query-anchored showing identities 2 = query-anchored no identities 3 = flat query-anchored, show identities 4 = flat query-anchored, no identities 5 = query-anchored no identities and blunt ends 6 = flat query-anchored, no identities and blunt ends 7 = XML Blast output 8 = tabular 9 = tabular with comment lines 10 = ASN, text 11 = ASN, binary [integer] (default = 0; range = 0 to 11) -o BLAST report output file [file out] (optional) (default = stdout) -F Filter query sequence (DUST with blastn, SEG with others) [string] (default = T) -G Cost to open a gap [integer] (default = -1) -E Cost to extend a gap [integer] (default = -1) -X X drop-off value for gapped alignment (in bits) blastn 30, megablast 20, tblastx 0, all others 15 [integer] (default = 0) -I Show GIs in deflines [T/F] (default = F) -q Penalty for a nucleotide mismatch (blastn only) [integer] (default = -3) -r Reward for a nucleotide match (blastn only) [integer] (default = 1) -v Number of database sequences to show one-line descriptions for (V) [integer] (default = 500) -b Number of database sequence to show alignments for (B) [integer] (default = 250) -f Threshold for extending hits; blastp 11, blastn 0, blastx 12, tblastn 13 tblastx 13, megablast 0 [real] (default = 0) -g Perform gapped alignment (not available with tblastx) [T/F] (default = T) -Q Query genetic code to use [integer] (default = 1) -D DB genetic code (for tblast[nx] only) [integer] (default = 1) -a Number of processors to use [integer] (default = 1) -O SeqAlign file [file out] (optional) -J Believe the query defline [T/F] (default = F) -M Matrix [string] (default = BLOSUM62) -W Word size (blastn 11, megablast 28, all others 3) [integer] (default = 0) -z Effective length of the database (use zero for the real size) [real] (default = 0) -K Number of best hits from a region to keep (off by default, if used a value of 100 is recommended) [integer] (default = 0) -P 0 for multiple hits, 1 for single hit (does not apply to blastn) [integer] (default = 0) -Y Effective length of the search space (use zero for the real size) [real] (default = 0) -S Query strands to search against database (for blast[nx] and tblastx) 3 is both, 1 is top, 2 is bottom [integer] (default = 3) -T Produce HTML output [T/F] (default = F) -l Restrict search of database to list of GIs [string] (optional) -U Use lower case filtering of FASTA sequence [T/F] (optional) -y X drop-off value for ungapped extensions (in bits) blastn 20, megablast 10, all others 7 [real] (default = 0.0) -Z X drop-off value for final gapped alignment (in bits) blastn/megablast 50, tblastx 0, all others 25 [integer] (default = 0) -R PSI-TBLASTN checkpoint file [file in] (optional) -n MegaBlast search [T/F] (default = F) -L Location on query sequence [string] (optional) -A Multiple hits window size (blastn/megablast 0, all others 40) [integer] (default = 0) -w Frame shift penalty (OOF algorithm for blastx) [integer] (default = 0) -t Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments; a negative value disables linking [Integer] (default = 0) -B Number of concatenated queries (for blastn and tblastn) [integer] (optional) (default = 0) -V Force use of the legacy BLAST engine [T/F] (optional) (default = F) -C Use composition-based statistics (for blastpgp or tblastn) [string] As first character: D or d: default (equivalent to F) 0 or F or f: no composition-based statistics 1 or T or t: Composition-based statistics as in NAR 29:2994-3005, 2001 2: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties 3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally For programs other than tblastn, must either be absent or be D, F or 0. As second character (if first character is 1, 2, or 3): U or u: unified p-value combining alignment p-value and compositional p-value in round 1 only (default = D) -s Compute locally optimal Smith-Waterman alignments (available only for gapped tblastn) [T/F] (default = F)Dealing with large volumes of output
Blast output is verbose. Default output from thousands of searches
will consume hundreds of megabytes or gigabytes. To handle large
output, you can work using scratch disk volumes and use
blastall options that reduce the volume of output:
Using scratch space
Quotas on disk space exist for home directories. UITS will not increase quotas for individual users to many hundreds of megabytes or to gigabytes. To use large amounts of storage space for short periods of time, use scratch space: /N/gpfsbr
To set up a personal scratch directory, use mkdir with
your Big Red username:
Scratch space is for temporary use and depends on the honor system. Official policy is any file more than three months old can be deleted without notice. Make sure to backup your data to a long-term storage system.
Minimizing blastall output
Tab-delimited output: If you are querying hundreds
or thousands of sequences against databases, you probably will use
programs to summarize the output. Blastall offers two options for
reducing output to tab-delimited tables for fields that programs can
easily read. The blastall option -m 9
produces a table with headers and comment information at the top, and
-m 8 produces a table without headers and
comments. Fields are output in the following order:
- Identity of query sequence
- Identity of subject sequence (matching sequence in database)
- Percent identity
- Alignment length
- Number of mismatches
- Number of gaps
- Start of query sequence
- End of query sequence
- Start of subject sequence
- End of subject sequence
- E-value
- Bit-score
Manual pages
Unix manual pages provide reference information about the scripts
serialjob and parblastjob. To access them,
use the man command.
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
Also see:
Last modified on April 29, 2008.






