Indiana University
University Information Technology Services
  
What are archived documents?
Login>>
Login

Login is for authorized groups (e.g., UITS, OVPIT, and TCC) that need access to specialized Knowledge Base documents. Otherwise, simply use the Knowledge Base without logging in.

Close

BLAST sequence searching on Big Red

On this page:


Publicly installed databases

Database Type Description Source Path
nr protein All non-redundant Protein GenBank CDS (translations+PDB+SwissProt+PIR) NCBI /N/soft/blastdb
pataa protein Protein sequences from GenBank Patent division NCBI /N/soft/blastdb
nt nucleotide All non-redundant Nucleotide Sequences (GenBank+EMBL+DDBJ+PDB, but no EST, STS, GSS or HTGS sequences) NCBI /N/soft/blastdb
patnt nucleotide Nucleotide sequences from GenBank Patent division NCBI /N/soft/blastdb
swissprot protein Nucleotide sequences from GenBank Patent division NCBI /N/soft/blastdb

You can reference databases using the names in the "Database" column (e.g., -d nt).

Preparing to run BLAST

Before you can run BLAST, you need to add it to your path. To permanently add BLAST to your path, add the +ncbi key to your .soft environment. To do so, run:

echo +ncbi >> $HOME/.soft; resoft

This places blastall in your path, and sets the environment variables BLASTMAT and BLASTDB.

Running NCBI blastall interactively

To run NCBI blastall interactively from the command line, you can use only one processor and your blastall job must consume less than 20 minutes of processor time. If you need more than 20 minutes of processor time, use the serialjob command (see below) to run blastall.

Running NCBI blastall batch jobs

If your blastall job will run for more than 20 minutes, you must use the batch facility.

Use the serialjob command as a one-step method for submitting blastall jobs to the queue. The serialjob command is a prefix to a standard blastall command. Use arguments just as you would with blastall, with a few optional additions. If your job will take less than two hours, you can run:

serialjob blastall options_to_blastall

Replace options_to_blastall with your command options.

If your job will run for more than two hours, specify the -wallclocklimit option:

serialjob blastall options_to_blastall -- -wallclocklimit hh:mm:ss

Replace hh with hours, mm with minutes, and ss with seconds.

Use the double dash (--) to separate the set of options to blastall from those that will exceed two hours. For example, if you want to blast nucleotide sequences in file sequences.mine against the patent database, and have the output stored in file hits.patent, and have the job run for four hours, run: serialjob blastall -p blastn -i sequences.mine -d patnt -o hits.patent -- -wallclocklimit 4:00:00

NCBI blastall options

Options for blastall are available by entering the command with no argument. Options are listed here for your convenience:

blastall 2.2.16 arguments: -p Program name [string] -d Database [string] (default = nr) -i Query file [file in] (default = stdin) -e Expectation value (E) [real] (default = 10.0) -m Alignment view options: 0 = pairwise, 1 = query-anchored showing identities 2 = query-anchored no identities 3 = flat query-anchored, show identities 4 = flat query-anchored, no identities 5 = query-anchored no identities and blunt ends 6 = flat query-anchored, no identities and blunt ends 7 = XML Blast output 8 = tabular 9 = tabular with comment lines 10 = ASN, text 11 = ASN, binary [integer] (default = 0; range = 0 to 11) -o BLAST report output file [file out] (optional) (default = stdout) -F Filter query sequence (DUST with blastn, SEG with others) [string] (default = T) -G Cost to open a gap [integer] (default = -1) -E Cost to extend a gap [integer] (default = -1) -X X drop-off value for gapped alignment (in bits) blastn 30, megablast 20, tblastx 0, all others 15 [integer] (default = 0) -I Show GIs in deflines [T/F] (default = F) -q Penalty for a nucleotide mismatch (blastn only) [integer] (default = -3) -r Reward for a nucleotide match (blastn only) [integer] (default = 1) -v Number of database sequences to show one-line descriptions for (V) [integer] (default = 500) -b Number of database sequence to show alignments for (B) [integer] (default = 250) -f Threshold for extending hits; blastp 11, blastn 0, blastx 12, tblastn 13 tblastx 13, megablast 0 [real] (default = 0) -g Perform gapped alignment (not available with tblastx) [T/F] (default = T) -Q Query genetic code to use [integer] (default = 1) -D DB genetic code (for tblast[nx] only) [integer] (default = 1) -a Number of processors to use [integer] (default = 1) -O SeqAlign file [file out] (optional) -J Believe the query defline [T/F] (default = F) -M Matrix [string] (default = BLOSUM62) -W Word size (blastn 11, megablast 28, all others 3) [integer] (default = 0) -z Effective length of the database (use zero for the real size) [real] (default = 0) -K Number of best hits from a region to keep (off by default, if used a value of 100 is recommended) [integer] (default = 0) -P 0 for multiple hits, 1 for single hit (does not apply to blastn) [integer] (default = 0) -Y Effective length of the search space (use zero for the real size) [real] (default = 0) -S Query strands to search against database (for blast[nx] and tblastx) 3 is both, 1 is top, 2 is bottom [integer] (default = 3) -T Produce HTML output [T/F] (default = F) -l Restrict search of database to list of GIs [string] (optional) -U Use lower case filtering of FASTA sequence [T/F] (optional) -y X drop-off value for ungapped extensions (in bits) blastn 20, megablast 10, all others 7 [real] (default = 0.0) -Z X drop-off value for final gapped alignment (in bits) blastn/megablast 50, tblastx 0, all others 25 [integer] (default = 0) -R PSI-TBLASTN checkpoint file [file in] (optional) -n MegaBlast search [T/F] (default = F) -L Location on query sequence [string] (optional) -A Multiple hits window size (blastn/megablast 0, all others 40) [integer] (default = 0) -w Frame shift penalty (OOF algorithm for blastx) [integer] (default = 0) -t Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments; a negative value disables linking [Integer] (default = 0) -B Number of concatenated queries (for blastn and tblastn) [integer] (optional) (default = 0) -V Force use of the legacy BLAST engine [T/F] (optional) (default = F) -C Use composition-based statistics (for blastpgp or tblastn) [string] As first character: D or d: default (equivalent to F) 0 or F or f: no composition-based statistics 1 or T or t: Composition-based statistics as in NAR 29:2994-3005, 2001 2: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties 3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally For programs other than tblastn, must either be absent or be D, F or 0. As second character (if first character is 1, 2, or 3): U or u: unified p-value combining alignment p-value and compositional p-value in round 1 only (default = D) -s Compute locally optimal Smith-Waterman alignments (available only for gapped tblastn) [T/F] (default = F)

Dealing with large volumes of output

Blast output is verbose. Default output from thousands of searches will consume hundreds of megabytes or gigabytes. To handle large output, you can work using scratch disk volumes and use blastall options that reduce the volume of output:

Using scratch space

Quotas on disk space exist for home directories. UITS will not increase quotas for individual users to many hundreds of megabytes or to gigabytes. To use large amounts of storage space for short periods of time, use scratch space: /N/gpfsbr

To set up a personal scratch directory, use mkdir with your Big Red username:

mkdir /N/gpfsbr/username

Scratch space is for temporary use and depends on the honor system. Official policy is any file more than three months old can be deleted without notice. Make sure to backup your data to a long-term storage system.

Minimizing blastall output

Tab-delimited output: If you are querying hundreds or thousands of sequences against databases, you probably will use programs to summarize the output. Blastall offers two options for reducing output to tab-delimited tables for fields that programs can easily read. The blastall option -m 9 produces a table with headers and comment information at the top, and -m 8 produces a table without headers and comments. Fields are output in the following order:

  • Identity of query sequence
  • Identity of subject sequence (matching sequence in database)
  • Percent identity
  • Alignment length
  • Number of mismatches
  • Number of gaps
  • Start of query sequence
  • End of query sequence
  • Start of subject sequence
  • End of subject sequence
  • E-value
  • Bit-score

Manual pages

Unix manual pages provide reference information about the scripts serialjob and parblastjob. To access them, use the man command.

This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

This is document awvn in domains all and tgrid-all.
Last modified on December 18, 2008.

Comments/Questions/Corrections

Use this form to offer suggestions, corrections, and additions to the Knowledge Base. We welcome your input!

If you are affiliated with Indiana University and would like assistance with a specific computing problem, please use the Ask a Consultant form, or contact your campus Support Center.

Contact Information

Note: We will reply to your comment at this address. If your message concerns a problem receiving email, please enter an alternate email address.