ARCHIVED: On FutureGrid, how can I use SAGA to support distributed applications on grids and clouds?

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

On this page:

Activity
Achievements
- Development of tools and frameworks
- Data-intensive apps
Future plans
References

Activity

The Simple API for Grid Applications (SAGA) is an OGF standard, and defines a high-level, application-driven API for developing first-principle distributed applications, as well as for distributed application frameworks and tools. The SAGA project provides SAGA API implementations in C++ and Python that interface to a variety of middleware backends, as well as higher-level application frameworks, such as Master-Worker, MapReduce, AllPairs, and BigJob. For all those components, FG administrators use FutureGrid and the different software environments available on it for extensive portability and interoperability testing, but also for scale-up and scale-out experiments. These activities allow for hardening of the SAGA components described above, and for supporting CS and Science experiments based on SAGA.

Achievements

FG has provided a persistent, production-grade experimental infrastructure with the ability to perform controlled experiments without violating production policies and disrupting production infrastructure priorities. These attributes, coupled with FutureGrid's technical support, have resulted in the following specific advances in the short period of under a year:

Use of FutureGrid for standards-based development and interoperability tests
In particular, FG administrators have been able to prepare SAGA for future deployments on XSEDE by testing the SAGA-BES adaptor in a variety of configurations: against Unicore and Genesis-II backends, with UserPass and certificate-based authentication, with POSIX and HPC application types, and with/without file staging support. While these tests are still ongoing, they allow for confidence about the expected XSEDE middleware evolution; in the vast majority of cases, the standards-based approach seems to work without difficulty.

Administrators are also continuously using FG-based job submission endpoints for GIN-driven interoperation tests with a variety of other production Grid infrastructures, including DEISA, PRACE, and EGI.

In order to simplify deployment and improve end-user support for SAGA, administrators have used FG hosts to develop, test, and harden deployment procedures by mimicking the CSA approach currently used on TeraGrid and XSEDE. At the same time, this deployment procedure also makes SAGA and SAGA-based components available and maintained on all FG endpoints.
Use of FG for analyzing and comparing programming models and run-time tools for computation and detail-intensive science

Development of tools and frameworks

P* experiments: P* is a conceptual model of pilot-based abstractions, particularly relevant to pilot jobs. Administrators' work on P* includes comparison between different PilotJob frameworks (BigJob, Condor GlideIn, Diane, Swift), and also between different coordination models within those frameworks. FutureGrid was used for a number of those experiments, as it allowed comparing a range of characteristics in a controlled environment.
Advanced dynamic partitioning and distribution of data-intensive distributed applications: FutureGrid resources have been crucial in carrying out a first set of scoping experiments for O.W.'s Ph.D thesis Towards a Reasoning Framework and Software System to Aid the Dynamic Partitioning, Distribution and Execution of Data-Intensive Applications. In these scoping experiments, three distinct FutureGrid resources (India, Hotel, Alamo) were used to coordinately execute a data-intensive genome matching workload (HTC). The partitioning and distribution decisions were dynamically made by an experimental software system based on autonomic computing concepts, which is capable of monitoring FutureGrid HPC resources as well as jobs during workload execution.
Bliss (SAGA): Bliss is an experimental implementation of SAGA written in pure Python. Bliss does not rely on any distributed Grid middleware; however, it allows distributed access to all FutureGrid HPC resources by providing an SFTP plug-in for file transfer as well as PBS over SSH for SAGA's job submission and resource information capabilities. Bliss has been developed specifically with FutureGrid in mind, and has been used in several cross-site experiments as the primary access mechanism for computing and storage resources. While PBS over SSH probably won't be a replacement for "real" Grid middleware (e.g., Globus), its exposure through the standardized SAGA API presents an attractive and lightweight alternative to traditionally large Grid middleware stacks.
High-performance dynamic applications: In extreme-scale computational science, there is a growing importance and need for specialized architectures and multi-model simulations. In this emerging environment, different simulation components will have different computational requirements. Instead of coarsely assigning resources to all simulation components for their lifetime, methodologies can be researched whereby simulations can be split into their constituent components, and distributed computational resources are allocated according to the needs of these individual components. Each simulation component is transferred along with the data and parameters needed to execute the simulation component on the target hardware. This approach enables multi-component applications to more easily benefit from heterogeneous and distributed computing environments, in which multiple types of processing elements and storage may be available.
In cases where software is developed with a static execution mode and only one resource in mind, the choice to distribute may not be available. By creating a dynamic method of execution and developing software which can package, transmit, and execute sub-applications remotely, existing simulations may be extended to make use of distributed resources. Through specially designed modules that are compatible with pre-existing Cactus framework applications, administrators demonstrated means of improving task-level parallelism, and extended the range of computing resources that can be used with a minimal amount of change to existing applications. Experiments were conducted using production cyberinfrastructures on FutureGrid and XSEDE, with up to 128 cores.
Grid/Cloud interop (with Andre Luckow, finished): Administrators demonstrated for the first time the use of Pilot-Jobs concurrently on different types of infrastructures; specifically, BigJob is used both on FutureGrid HPC and Cloud resources as well as on other resources such as the XSEDE and OSG Condor resources.

Data-intensive apps

MapReduce (with Andre Luckow): In ref. [1], various implementations of the word-count application are compared, concerning use of not only multiple, heterogeneous infrastructure (Sector versus DFS), but also different programming models (Sphere versus MapReduce).
Grid/Cloud NGS analysis experiments: Building upon SAGA-based MapReduce, an efficient pipeline for gene sequencing has been constructed. This pipeline is capable of dynamic resource utilization and task/worker placement.
Hybrid cloud/grid scientific applications and tools (autonomic schedulers) (with Manish Parashar, finished): Policy-based (objective-driven) Autonomic Scheduler provides a system-level approach to hybrid grid-cloud usage. FutureGrid has been used for the development and extension of such Autonomic Scheduling and application requirements, and the distributed and heterogeneous resources of FutureGrid have been integrated as a pool of resources that can be allocated by the policy-based Autonomic Scheduler (Comet). The Autonomic Scheduler dynamically determines and allocates instances to meet specific objectives, such as lowest time to completion, lowest cost, etc. Administrators also used FG supplement objective-driven pilot jobs on TeraGrid (ranger).
Investigate run-time fluctuations of application kernels: Administrators attempt to explore and characterize run-time fluctuations for a given application kernel representative of both a large number of MPI/parallel workloads and workflows. Fluctuation appears to be independent of the system load and a consequence of the complex interaction of the MPI library specifics and virtualization layer, as well as the operating environment. Administrators have thus been investigating fluctuations in application performance that are due to the cloud operational environment, with an explicit aim of correlating these fluctuations with the details of the infrastructure. As it is difficult to discern or reverse-engineer the specific infrastructure details on EC2 or other commercial infrastructure, FutureGrid has provided a controlled and well-understood environment at infrastructure scales that are not possible at the individual PI/resource level.

Future plans

Administrators will continue to use FG as a resource for SAGA development. Some explicit goals include:

To move the testing infrastructure to other SAGA-based components, like the PilotJob and PilotData frameworks
To widen the set of middlewares used for testing (again, keeping XSEDE and other PGIs in mind)
To enhance the scope and scale of scalability testing
To test and harden deployment and packaging procedures

References

[fg-1975] Sehgal, S., M. Erdelyi, A. Merzky, and S. Jha, "Understanding application-level interoperability: Scaling-out MapReduce over high-performance grids and clouds", Future Generation Computer Systems, vol. 27, issue 5, 2011.
[fg-1976] Luckow, A., L. Lacinski, and S. Jha, "SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems", 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2010.
[fg-1977] Luckow, A., and S. Jha, "Abstractions for Loosely-Coupled and Ensemble-Based Simulations on Azure", IEEE International Conference on Cloud Computing Technology and Science, 2010.
[fg-1978] Kim, J., S. Maddineni, and S. Jha, "Building Gateways for Life-Science Applications using the Dynamic Application Runtime Environment (DARE) Framework", The 2011 TeraGrid Conference: Extreme Digital Discovery, 2011.
[fg-1979] Kim, J., S. Maddineni, and S. Jha, "Characterizing Deep Sequencing Analytics using BFAST: Towards a Scalable Distributed Architecture for Next-Generation Sequencing Data", The Second International Workshop on Emerging Computational Methods for the Life Sciences, June 2011.