Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Dissertation_Yu_Su.pdf (1.9 MB)
ETD Abstract Container
Abstract Header
Big Data Management Framework based on Virtualization and Bitmap Data Summarization
Author Info
Su, Yu
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1420738636
Abstract Details
Year and Degree
2015, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Abstract
In recent years, science has become increasingly data driven. Data collected from instruments and simulations is extremely valuable for a variety of scientific endeavors. The key challenge being faced by these efforts is that the dataset sizes continue to grow rapidly. With growing computational capabilities of parallel machines, temporal and spatial scales of simulations are becoming increasingly fine-grained. However, the data transfer bandwidths and disk IO speed are growing at a much slower pace, making it extremely hard for scientists to transport these rapidly growing datasets. Our overall goal is to provide a virtualization and bitmap based data management framework for “big data” applications. The challenges rise from four aspects. First, the “big data” problem leads to a strong requirement for efficient but light-weight server-side data subsetting and aggregation to decrease the data loading and transfer volume and help scientists find subsets of the data that is of interest to them. Second, data sampling, which focuses on selecting a small set of samples to represent the entire dataset, is able to greatly decrease the data processing volume and improve the efficiency. However, finding a sample with enough accuracy to preserve scientific data features is difficult, and estimating sampling accuracy is also time-consuming. Third, correlation analysis over multiple variables plays a very important role in scientific discovery. However, scanning through multiple variables for correlation calculation is extremely time-consuming. Finally, because of the huge gap between computing and storage, a big amount of time for data analysis is wasted on IO. In an in-situ environment, before the data is written to the disk, how to generate a smaller profile of the data to represent the original dataset and still support different analyses is very difficult. In our work, we proposed a data management framework to support more efficient scientific data analysis, which contains two modules: SQL-based Data Virtualization and Bitmap-based Data Summarization. SQL-based Data Virtualization module supports high-level SQL-like queries over different kinds of low-level data formats such as NetCDF and HDF5. From the scientists’ perspective, all they need to know is how to use SQL queries to specify their data subsetting, aggregation, sampling or even correlation analysis requirements. And our module can automatically transfer the high-level SQL queries into low-level data access languages, fetch the data subsets, perform different calculations and return the final results to the scientists. Bitmap-based Data Summarization module treats bitmap index as a data summarization and supports different kinds of analysis only using bitmaps. Indexing technology, especially bitmap indexing have been widely used in database area to improve the data query efficiency. The major contribution of our work is that we find bitmap index keeps both value distribution and spatial locality of the scientific dataset. Hence, it can be treated as a summarization of the data with much smaller size. We demonstrate that many different kinds of analyses can be supported only using bitmaps.
Committee
Gagan Agrawal (Advisor)
Pages
261 p.
Subject Headings
Computer Science
Keywords
Big Data
;
High-Performance Computing
;
Bitmap Index
;
Data Virtualization
;
Sampling
;
Correlation Analysis
;
Time Steps Selection
;
In-Situ Analysis
;
Distributed Computing
;
Scientific Data Management
;
Wide-area Data Transfer
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Su, Y. (2015).
Big Data Management Framework based on Virtualization and Bitmap Data Summarization
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1420738636
APA Style (7th edition)
Su, Yu.
Big Data Management Framework based on Virtualization and Bitmap Data Summarization.
2015. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1420738636.
MLA Style (8th edition)
Su, Yu. "Big Data Management Framework based on Virtualization and Bitmap Data Summarization." Doctoral dissertation, Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1420738636
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1420738636
Download Count:
1,260
Copyright Info
© 2015, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.