ARCHIVED: Hadoop Blast

BLAST (Basic Local Alignment Search Tool) is one of the most widely used bioinformatics applications written in C++ (the version used here is v2.2.23). Hadoop Blast is an advanced Hadoop program which helps Blast use the computing capability of Hadoop.

Note: The following procedures use a subset (241 MB) of a non-redundant protein sequence database from an nr (8.5 GB) database.

Prerequisites

In order to perform this process, the following are required:

Using Hadoop Blast

To proceed using Hadoop Blast, see the following instructions:

Downloading Hadoop Blast under $HADOOP_HOME

Assume you start SalsaHadoop/Hadoop with setting $HADOOP_HOME=~/hadoop-0.20.203.0, and that you're running the master node on i55. Then, download the Hadoop Blast source code and customized Blast Program and Database archive (BlastProgramAndDB.tar.gz) from the Big Data for Science tutorial to $HADOOP_HOME:

[taklwu@i55 ~]$ cd $HADOOP_HOME
[taklwu@i55 hadoop-0.20.203.0]$ wget http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-Blast.zip
[taklwu@i55 hadoop-0.20.203.0]$ wget http://salsahpc.indiana.edu/tutorial/apps/BlastProgramAndDB.tar.gz
[taklwu@i55 hadoop-0.20.203.0]$ unzip Hadoop-Blast.zip

Preparing Hadoop Blast

Once the program is stored in $HADOOP_HOME/Hadoop-Blast, copy the input files and the Blast Program and Database archive (BlastProgramAndDB.tar.gz) onto HDFS:

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -put $HADOOP_HOME/Hadoop-Blast/blast_input HDFS_blast_input
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls HDFS_blast_input
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -copyFromLocal $HADOOP_HOME/BlastProgramAndDB.tar.gz BlastProgramAndDB.tar.gz
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls BlastProgramAndDB.tar.gz

Notes:

  • Line 1 pushes all the Blast input files (FASTA-formatted queries) onto the HDFS HDFS_blast_input directory from the local disk.
  • Line 2 lists the pushed files on the HDFS directory HDFS_blast_input.
  • Line 3 copies the Blast Program and Database archive (BlastProgramAndDB.tar.gz) from $HADOOP_HOME onto the HDFS as distributed caches to be used later.
  • Line 4 double-checks the pushed Blast Program and Database archive BlastProgramAndDB.tar.gz on HDFS.

Executing Hadoop Blast

After deploying these required files onto HDFS, run the Hadoop Blast program with the following commands:

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop jar $HADOOP_HOME/Hadoop-Blast/executable/blast-hadoop.jar BlastProgramAndDB.tar.gz \
 bin/blastx /tmp/hadoop-taklwu-test/ db nr HDFS_blast_input HDFS_blast_output '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'

Here is the description of the above command:

bin/hadoop jar Executable BlastProgramAndDB_on_HDFS bin/blastx Local_Work_DIR db nr HDFS_Input_DIR Unique_HDFS_Output_DIR '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'


Parameter Description
Executable
The full path of the Hadoop Blast Jar program (e.g., $HADOOP_HOME/apps/Hadoop-Blast/executable/blast-hadoop.jar)
BlastProgramAndDB_on_HDFS
The archive name of Blast Program and Database on HDFS (e.g., BlastProgramAndDB.tar.gz)
Local_Work_DIR
The local directory for storing temporary output of Blast Program (e.g., /tmp/hadoop-test/
HDFS_Input_DIR
The HDFS remote directory where stored input files (e.g., HDFS_blast_input)
Unique_HDFS_Output_DIR
A never-used HDFS remote directory for storing output files (e.g., HDFS_blast_output)


If Hadoop is running correctly, it will print Hadoop running messages similar to the following:

11/11/01 19:31:08 INFO input.FileInputFormat: Total input paths to process : 16
11/11/01 19:31:08 INFO mapred.JobClient: Running job: job_201111021738_0002
11/11/01 19:31:09 INFO mapred.JobClient: map 0% reduce 0%
11/11/01 19:31:31 INFO mapred.JobClient: map 18% reduce 0%
11/11/01 19:31:34 INFO mapred.JobClient: map 50% reduce 0%
11/11/01 19:31:53 INFO mapred.JobClient: map 75% reduce 0%
11/11/01 19:32:04 INFO mapred.JobClient: map 100% reduce 0%
...
Job Finished in 191.376 seconds

Monitoring Hadoop

You can also monitor the job status using lynx, a text browser, on a i136-based Hadoop monitoring console. Assuming the Hadoop Jobtracker is running on i55:9003:

[taklwu@i136 ~]$ lynx i55:9003

In addition, all of the outputs will be stored in the HDFS output directory (e.g., HDFS_blast_output):

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls HDFS_blast_output
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -cat HDFS_blast_output/*
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|298916876|dbj|BAJ09735.1| 100.00 11 0 0 3 35 9 19 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|298708397|emb|CBJ48460.1| 100.00 11 0 0 3 35 37 47 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|298104210|gb|ADI54942.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297746593|emb|CBM42053.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297746591|emb|CBM42052.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297746589|emb|CBM42051.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297746587|emb|CBM42050.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297746585|emb|CBM42049.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297746583|emb|CBM42048.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297746581|emb|CBM42047.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
...

Finishing the Map-Reduce process

After finishing the job, use the following command to kill the HDFS and Map-Reduce daemon:

[taklwu@i55 hadoop-0.20.203.0]$ bin/stop-all.sh

This is document bcsp in the Knowledge Base.
Last modified on 2014-10-02 00:00:00.

  • Fill out this form to submit your issue to the UITS Support Center.
  • Please note that you must be affiliated with Indiana University to receive support.
  • All fields are required.

Please provide your IU email address. If you currently have a problem receiving email at your IU account, enter an alternate email address.

  • Fill out this form to submit your comment to the IU Knowledge Base.
  • If you are affiliated with Indiana University and need help with a computing problem, please use the I need help with a computing problem section above, or contact your campus Support Center.

Please provide your IU email address. If you currently have a problem receiving email at your IU account, enter an alternate email address.