Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Cal FINAL 7 26 2021 with cert.pdf (582.53 KB)
ETD Abstract Container
Abstract Header
Evaluating Query Estimation Errors Using Bootstrap Sampling
Author Info
Cal, Semih
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=ysu1627358871966099
Abstract Details
Year and Degree
2021, Master of Computing and Information Systems, Youngstown State University, Department of Computer Science and Information Systems.
Abstract
Big data embodies a massive amount of knowledge. Many businesses now rely on big data information mining to forecast the viability of future business operations. Information mining of big data can require a large investment of time. To reduce time requirements, sampling is one of the most preferred methods. Evaluation of quality (e.g. query prediction error) for the query estimates is crucial for meaningful results. The main method used in the past to solve this problem is based on bootstrap sampling. Existing work typically makes strong dataset assumptions that may not apply to real-world datasets. This research aims to evaluate query estimation errors using the bootstrap sampling method. There exist different kinds of bootstrap methods. In this work, we used non-parametric bootstrap sampling to calculate the error distribution of the queries that we choose. Then we calculated the confidence intervals to find out the hit ratio. Even though the bootstrap sampling method is one of the main approaches for finding the error in statistic estimates, it is computationally expensive on large data. To solve this problem, we test both memory and disk as storage for optimizing bootstrap sampling. Furthermore, two different total numbers of bootstrap samples (B=2000, and B=200) have been tested to reduce bootstrap computation with reliable results for optimization purposes. In the experiment part, we use three different sizes of data (100MB, 1GB, and 10GB) as well as three different sampling ratios (0.1%, 0.5%, and 1%) to analyze the data that we generated on the TPC-H benchmark in terms of accuracy and performance.The results demonstrate that the hit ratios are very high even with a 0.1% sampling ratio. The optimization strategies that were used reduced the bootstrap sample computation time adequately.
Committee
Feng Yu, PhD (Advisor)
John R. Sullins, PhD (Committee Member)
Yong Zhang, PhD (Committee Member)
Pages
39 p.
Subject Headings
Computer Science
Keywords
Bootstrap
;
SRSWOR
;
Query error estimation
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Cal, S. (2021).
Evaluating Query Estimation Errors Using Bootstrap Sampling
[Master's thesis, Youngstown State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ysu1627358871966099
APA Style (7th edition)
Cal, Semih.
Evaluating Query Estimation Errors Using Bootstrap Sampling.
2021. Youngstown State University, Master's thesis.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=ysu1627358871966099.
MLA Style (8th edition)
Cal, Semih. "Evaluating Query Estimation Errors Using Bootstrap Sampling." Master's thesis, Youngstown State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ysu1627358871966099
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
ysu1627358871966099
Download Count:
164
Copyright Info
© 2021, all rights reserved.
This open access ETD is published by Youngstown State University and OhioLINK.