ARCHIVED: What is Hadoop, and where can I find information about using it on XSEDE?

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

Apache Hadoop is an open source software framework that supports distributed processing of large data sets across clusters of computers and is well suited for running scientific applications in parallel fashion. The Hadoop framework includes the Hadoop Distributed File System (HDFS) for high-throughput access to application data, and an implementation of MapReduce for parallel processing of large data sets. For more, see the Hadoop project website.

Hadoop is available to Extreme Science and Engineering Discovery Environment (XSEDE) users with allocations on Gordon (SDSC). MyHadoop also is available for setting up and configuring Hadoop as a batch job (for more on MyHadoop, see the MyHadoop project site).

The Hadoop distribution on Gordon is located at:

  /opt/hadoop

Setup and usage examples, plus search and sorting benchmarks, are located at:

  /opt/hadoop/contrib/myHadoop

For more about configuring and starting a Hadoop cluster on Gordon, see the San Diego Supercomputer Center (SDSC) Hadoop page in the Gordon User Guide.

For more on Gordon, see SDSC Gordon User Guide in the XSEDE User Portal. If you have questions or need help, contact the XSEDE Help Desk.

This document was developed with support from National Science Foundation (NSF) grants 1053575 and 1548562. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.