Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map Tasks

Abstract Details

2016, Master of Science (MS), Bowling Green State University, Computer Science.

The MapReduce programming model provides efficient parallel processing via two user-defined tasks: map and reduce. Input files are split into individual records and each record is processed by a single map task. Implementations of the model, such as Apache Hadoop, typically assume each record is similarly sized and thus they allocate the same memory for every map task. However, certain data sets, such as Wikipedia or the project data from SourceForge, do not follow this assumption and instead contain very asymmetrically sized records. To handle such data sets, the current state of the art solution is to determine the maximum memory requirements for the largest record and then configure the system to allocate this much memory for all map tasks. This simple solution however, decreases the maximum number of tasks running in parallel, leading to lower overall CPU utilization on the cluster. In this work, we propose three solutions to help maintain as much parallelism as possible while processing such asymmetric data sets.

As our first solution we propose a fall-back approach, where initially all map tasks run with the standard allocation of memory. If any task attempt fails due to memory reasons, subsequent attempts of that task are run with twice the memory/heap of the previous attempt. This solution is fully automatic and requires no configuration of Hadoop. Our second solution allows users to manually specify specific map tasks to be allocated more memory. Thus, if users know a priori which records are larger, they can avoid letting that map task run, fail, and retry with the first approach. Our third solution is to automatically infer which map tasks have higher memory requirements, based on learning from prior runs of the same input data which tasks failed and had to fall-back with our first solution. If multiple jobs are run on the same input, this solution avoids the need for users to manually specify the tasks requiring more memory. We evaluate these approaches by measuring performance on several data sets and comparing to the state of the art solution. Our evaluation shows up to 37-- 48% performance improvement for our approaches when compared to the state of the art solution, due to increased parallelism in the system, and with minimal overhead.

Robert Dyer, Dr (Advisor)
Robert Green, Dr (Committee Member)
Jong Kwan Lee, Dr (Committee Member)
55 p.

Recommended Citations

Citations

  • Shrimal, S. (2016). Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map Tasks [Master's thesis, Bowling Green State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1468011920

    APA Style (7th edition)

  • Shrimal, Shubhendra. Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map Tasks. 2016. Bowling Green State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1468011920.

    MLA Style (8th edition)

  • Shrimal, Shubhendra. "Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map Tasks." Master's thesis, Bowling Green State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1468011920

    Chicago Manual of Style (17th edition)