Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
thesis.pdf (1.58 MB)
ETD Abstract Container
Abstract Header
Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning
Author Info
Huang, Dachuan
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341
Abstract Details
Year and Degree
2017, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Abstract
Our society is experiencing a rapid growth of data amount because of the widely used mobile devices, sensors, and computers. Most recent estimations show that every day 2.5 exabytes data are generated worldwide. The analysis to this amount of data could enable more intelligent business decisions, faster scientific discoveries, and more accurate society services. Traditional data processing techniques in one single machine, such as relational database management systems, quickly showed their limitations when handling large amount of data. To satisfy the ever-growing demand for large scale data analysis, various public and commercial data analysis distributed systems are built up such as High Performance Computing and Cloud Computing systems. These data processing distributed systems, with their excellent concurrency, scalability, and fault tolerance, are gaining more attention nowadays in research institution and industry. People are already enjoying the benefits of collect- ing and analyzing large amount of data on some maturely deployed data processing distributed systems. Unfortunately data processing distributed systems have their own performance problems. More specifically, in device layer, the system is suering from long seeking latency problem in hard disks, which reduces I/O throughput when meeting random access I/O pattern. In framework layer, the system is experiencing straggler problem in parallel jobs, where the slowest task alone would prolong the job execution time even though all other tasks finished at an much earlier time. In algorithm layer, the system faces diculty to decide intermediate cache size, where the following phase's speed-up benefit is outweighed by the overhead incurred by writing and reading a large intermediate cache file. This thesis is to solve these problems, hence to improve distributed system per- formance, by exploiting data placement and partitioning. Specifically, we propose the following solutions to address the aforementioned three problems. Firstly, we propose to use a hybrid storage system with hard disk drives and solid state drives in HPC environment, where input data's layout is re-organized to hide the long seeking latency in hard disks. Secondly, we propose to use logical data partitioning strategies for input data, so that the distributed system could benefit from fine-grained task's ability of solving straggler problem without paying the prohibitive overhead. Lastly, when intermediate data can be saved to speed up the following job's execution, we propose an online analyzer to decide how much data to place into cache. We have designed and implemented prototypes for each work, and evaluated them with representative workloads and datasets on widely used distributed system plat- forms PVFS and Hadoop. Our evaluation results can achieve almost optimal results, which fit the theoretical performance improvement expectation. For device layer, we could achieve low latency storage device with aordable cost. In framework layer, we could achieve minimal phase execution time when meeting stragglers. In algorithm layer, we could achieve near optimal job execution time for MapReduce FIM algo- rithms. Furthermore, our prototypes have low system overhead, which is a necessity for wide application in practice.
Committee
Feng Qin (Advisor)
Yang Wang (Committee Member)
Ten-Hwang Lai (Committee Member)
Pages
130 p.
Subject Headings
Computer Engineering
;
Computer Science
Keywords
Performance in Data Processing Distributed Systems, Exploiting Data Placement and Partitioning
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Huang, D. (2017).
Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341
APA Style (7th edition)
Huang, Dachuan.
Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning.
2017. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341.
MLA Style (8th edition)
Huang, Dachuan. "Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning." Doctoral dissertation, Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1483312415041341
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1483312415041341
Download Count:
301
Copyright Info
© 2017, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.