Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems

Abstract Details

2016, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Big Data processing and High-Performance Computing (HPC) are two disruptive technologies that are converging to meet the challenges exposed by large-scale data analysis. MapReduce, a popular parallel programming model for data-intensive applications, is being used extensively through different execution frameworks (e.g. batch processing, Directed Acyclic Graph or DAG) on modern HPC systems because of its ease-of-programming, fault-tolerance, and scalability. However, as these applications begin scaling to terabytes of data, the socket-based communication model, which is the default implementation in the open-source MapReduce execution frameworks, demonstrates performance bottleneck. Moreover, because of the synchronized nature of stocking the data in various execution phases, the default Hadoop MapReduce framework cannot leverage the full potential of the underlying interconnect. MapReduce frameworks also rely heavily on the availability of the local storage media, which introduces space inadequacy for applications that generate a large amount of intermediate data. On the other hand, most leadership-class HPC systems follow the traditional Beowulf architecture with separate parallel storage system and either no, or very limited, local storage. The storage architectures in these HPC systems are not naively conducive for default MapReduce. Also, modern high performance interconnects (e.g. InfiniBand) used to access the parallel storage in these systems can provide extremely low latency and high bandwidth. Additionally, advanced storage architectures, such as Non-Volatile Memories (NVM), can provide byte-addressability as well as data persistence. Efficient utilization of all these resources through enhanced designs of execution frameworks with tuned parameter space is crucial for MapReduce in terms of performance and scalability. This work addresses several of the shortcomings that the current MapReduce execution frameworks hold. It presents an enhanced Big Data execution framework, HOMR (Hybrid Overlapping in MapReduce), which improves the MapReduce job execution pipeline by maximizing overlapping among execution phases. HOMR also introduces RDMA (Remote Direct Memory Access) based shuffle engine with advanced shuffle algorithms to leverage the benefits of high-performance interconnects used in HPC systems. It minimizes the large number of disk accesses in the MapReduce execution frameworks through in-memory operations combined with fast execution pipeline. This work also proposes different deployment architectures while utilizing Lustre as underlying storage and provides fast shuffle strategies with dynamic adjustments. The priority based storage selection for intermediate data storage ensures the best storage usage at any point of job execution. This work also presents a variant of HOMR, that can exploit the byte-addressability of NVM to provide fast execution of MapReduce applications. Finally, a generalized advising framework is presented in this work that can provide optimum configuration recommendations for any MapReduce system with profiling and prediction capabilities. Through performance modeling of this MapReduce execution framework, techniques of predicting job execution performance are demonstrated on leadership-class HPC clusters at large scale.
Dhabaleswar Panda (Advisor)
Ponnuswamy Sadayappan (Committee Member)
Radu Teodorescu (Committee Member)
239 p.

Recommended Citations

Citations

  • Rahman, M. W.-U.- (2016). Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1480475635778714

    APA Style (7th edition)

  • Rahman, Md. Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems. 2016. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1480475635778714.

    MLA Style (8th edition)

  • Rahman, Md. "Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems." Doctoral dissertation, Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1480475635778714

    Chicago Manual of Style (17th edition)