Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Supporting Advanced Queries on Scientific Array Data

Abstract Details

2018, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Distributed scientific array data is becoming more prevalent, increasing in size, and there is a growing need for (performance in) advanced analytics over these data. In this dissertation, we focus on addressing issues to allow data management, efficient declarative querying, and advanced analytics over array data. We formalize the semantic of array data querying, and introduce distributed querying abilities over these data. We show how to improve the optimization phase of join querying, while developing efficient methods to execute joins in general. In addition, we introduce a class of operations that is closely related to the traditional joins performed on relational tables - including an operation we refer to as Mutual Range Joins(MRJ), which arises on scientific data that is not only numerical, but also have measurement noise. While working closely with our colleagues to provide them usable analytics over array data, we uncovered a new type of analytical querying - analytics over windows with an inner window ordering (in contrast to the external window ordering, available elsewhere). Last, we adjust our join optimization approach for skewed settings, addressing resource skew observed in real environments as well as data skew that arises while data is processed. Several major contributions are introduced throughout this dissertation. First we formalize querying over scientific array data (basic operators, such as subsettings, as well as complex analytical functions and joins). We focus on distributed data, and present a framework to execute queries over variables that are distributed across multiple containers (DSDQuery DSI) - this framework is used in production environments. Next, we present an optimization approach for join queries over geo-distributed data. This approach considers networking properties such as throughput and latency to optimize the execution of join queries. For such complex optimization, we introduce methods and algorithms to efficiently prune optional execution plans (DistriPlan). Then, after the join was optimized, we show how to execute distributed joins and optimize the MRJ operator. We demonstrate how bitmap indexes can be used for accelerating the execution of distributed joins - we do so by introducing a new Bitmap Index structure that fits the MRJ goals (BitJoin). Afterwards we introduce analytical functions (window querying) to the domain of scientific arrays (FDQ). Last, we revisit join optimization for different settings, while addressing data and resource skew (Sckeow). We thoroughly evaluate our systems. We show DSDQuery DSI produces output in its optimal size as well as produces it efficiently - performance decrease linearly with increasing dataset sizes. DistriPlan finds the optimal plan while considering reasonable amount of plans (out of the exponential amount of optional plans). BitJoin improves the performance of MRJ's and equi-join by 140% and 113%, on average. By using a new processing model with an efficient memory allocation approach, on average, FDQ improves the performance of existing functionality by 538%. In addition, FDQ efficiently process queries of types that were not available before -- its performance improve linearly with scaled resources. Last, Sckeow improves the performance of queries by 396% for heterogeneous settings and 368% for homogeneous ones. For heterogeneous settings, in most cases Sckeow generates an ideal plan directly, while generating about half the amount of plans other engines do in homogeneous settings.
Gagan Agrawal (Advisor)
Arnab Nandi (Committee Member)
P Sadayappan (Committee Member)
235 p.

Recommended Citations

Citations

  • Ebenstein, R. A. (2018). Supporting Advanced Queries on Scientific Array Data [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1531322027770129

    APA Style (7th edition)

  • Ebenstein, Roee. Supporting Advanced Queries on Scientific Array Data. 2018. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1531322027770129.

    MLA Style (8th edition)

  • Ebenstein, Roee. "Supporting Advanced Queries on Scientific Array Data." Doctoral dissertation, Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1531322027770129

    Chicago Manual of Style (17th edition)