Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning

Huang, Dachuan

Keyword Search

School Logo

thesis.pdf (1.58 MB)

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning

Author Info

Huang, Dachuan

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341

Year and Degree

2017, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

Our society is experiencing a rapid growth of data amount because of the widely used mobile devices, sensors, and computers. Most recent estimations show that every day 2.5 exabytes data are generated worldwide. The analysis to this amount of data could enable more intelligent business decisions, faster scientific discoveries, and more accurate society services. Traditional data processing techniques in one single machine, such as relational database management systems, quickly showed their limitations when handling large amount of data. To satisfy the ever-growing demand for large scale data analysis, various public and commercial data analysis distributed systems are built up such as High Performance Computing and Cloud Computing systems. These data processing distributed systems, with their excellent concurrency, scalability, and fault tolerance, are gaining more attention nowadays in research institution and industry. People are already enjoying the benefits of collect- ing and analyzing large amount of data on some maturely deployed data processing distributed systems. Unfortunately data processing distributed systems have their own performance problems. More specifically, in device layer, the system is suering from long seeking latency problem in hard disks, which reduces I/O throughput when meeting random access I/O pattern. In framework layer, the system is experiencing straggler problem in parallel jobs, where the slowest task alone would prolong the job execution time even though all other tasks finished at an much earlier time. In algorithm layer, the system faces diculty to decide intermediate cache size, where the following phase's speed-up benefit is outweighed by the overhead incurred by writing and reading a large intermediate cache file. This thesis is to solve these problems, hence to improve distributed system per- formance, by exploiting data placement and partitioning. Specifically, we propose the following solutions to address the aforementioned three problems. Firstly, we propose to use a hybrid storage system with hard disk drives and solid state drives in HPC environment, where input data's layout is re-organized to hide the long seeking latency in hard disks. Secondly, we propose to use logical data partitioning strategies for input data, so that the distributed system could benefit from fine-grained task's ability of solving straggler problem without paying the prohibitive overhead. Lastly, when intermediate data can be saved to speed up the following job's execution, we propose an online analyzer to decide how much data to place into cache. We have designed and implemented prototypes for each work, and evaluated them with representative workloads and datasets on widely used distributed system plat- forms PVFS and Hadoop. Our evaluation results can achieve almost optimal results, which fit the theoretical performance improvement expectation. For device layer, we could achieve low latency storage device with aordable cost. In framework layer, we could achieve minimal phase execution time when meeting stragglers. In algorithm layer, we could achieve near optimal job execution time for MapReduce FIM algo- rithms. Furthermore, our prototypes have low system overhead, which is a necessity for wide application in practice.

Committee

Feng Qin (Advisor)
Yang Wang (Committee Member)
Ten-Hwang Lai (Committee Member)

Pages

130 p.

Subject Headings

Computer Engineering; Computer Science

Keywords

Performance in Data Processing Distributed Systems, Exploiting Data Placement and Partitioning

Huang, D. (2017). Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341
APA Style (7th edition)
Huang, Dachuan. Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning. 2017. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341.
MLA Style (8th edition)
Huang, Dachuan. "Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning." Doctoral dissertation, Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341
Chicago Manual of Style (17th edition)

Document number:

osu1483312415041341

Download Count:

301

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations