Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Scaling Analytics via Approximate and Distributed Computing

Chakrabarti, Aniket

Abstract Details

2017, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
The amount data generated from different sources everyday is increasing exponentially and for businesses to generate actionable insights, it is paramount that the analyses complete within a reasonable amount of time. Compute capacity, unfortunately, has not increased at a comparable rate. In recent past, researchers thus have focused on alternate approaches to scaling analytics to large datasets. Two of the most popular techniques used are (i) approximate computing techniques and (ii) distributed computing techniques for analytics execution speedup. Approximate computing involves either a reduction in data size (usually through sampling and/or dimensionality reduction) or a reduction in algorithmic complexity. This results in significant gains in performance (in terms of execution time), however at a cost in reduction of accuracy for many analytics tasks. The biggest challenge in this space is understanding the tradeoff between performance and accuracy. Most of these approximate computing techniques do have some theoretical guarantees: for instance, the extremely popular principal component analysis (PCA) dimensionality reduction technique minimizes the data reconstruction error. However, a user typically interprets quality in terms of a few popular metrics such as recall and precision. A theoretical guarantee in terms of reconstruction error is not very useful from the context of a particular application of interest. Distributed computing, on the other hand, tries to use many servers to process the data instead of a single server. Typical case is to partition the data across many servers process the data partitions in parallel and combine the output from each server. This is the extremely popular MapReduce model of execution. However, at scale, it is likely that some machines will perform worse than others because they are slower, power constrained or dependent on undesirable, dirty energy sources. It is challenging to balance analytics workloads across heterogeneous machines because the algorithms are sensitive to statistical skew in data partitions. A skewed partition can slow down the whole workload or degrade the quality of results. In the approximate computing space, we first begin by using the popular dimensionality reduction technique called locality sensitive hashing (LSH). In the context of one of the core analytics tasks, the all pairs similarity search (APSS), we design an algorithm that given a recall/precision target can automatically sketch the data using LSH (hence improving execution time) while guaranteeing the recall/precision target. The user does not need to explicitly set the sketch size and other LSH parameters. We then address the issue of generality. One key aspect of the APSS task is the similarity of interest (varies according to application domain). To generalize the APSS task across different similarity measures, we design a novel data embedding for supporting LSH on arbitrary kernel similarity measures (capable of representing any inner product similarity). Prior works have mostly designed and developed approximation techniques and distributed computing techniques to scale up in isolation. We next show how insights from the approximate computing space are crucial in improving the execution of distributed analytics. We show that distributed analytics are extremely sensitive to the characteristics of data in the partitions. Key analytics tasks such as frequent pattern mining (FPM) degrades significantly in execution if the data partitions look statistically different (not representative of the true underlying distribution). We show the efficacy of LSH in creating a small sketch to understand the global characteristics of the data and hence effectively partition the data of downstream analytics. Furthermore, we show that data approximation can also control the algorithmic complexity of graph analytics tasks. This is of particular interest as distributed graph analytics follow a different execution model (vertex-centric gather-update-scatter) that traditional tasks such as FPM (MapReduce model). We show that by approximating the input graph meaningfully we are able to significantly scale the Loopy Belief Propagation (LBP) inference to very large graphs. Finally, in the distributing computing space, we also developed a probabilistic method to manage latencies of distributed key value stores (usual storage medium for large data). This is crucial for interactive analytics.
Srinivasan Parthasarathy (Advisor)
Christopher Stewart (Committee Co-Chair)
Huan Sun (Committee Member)
Wei Zhang (Committee Member)
191 p.

Recommended Citations

Citations

  • Chakrabarti, A. (2017). Scaling Analytics via Approximate and Distributed Computing [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1500473400586782

    APA Style (7th edition)

  • Chakrabarti, Aniket. Scaling Analytics via Approximate and Distributed Computing. 2017. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1500473400586782.

    MLA Style (8th edition)

  • Chakrabarti, Aniket. "Scaling Analytics via Approximate and Distributed Computing." Doctoral dissertation, Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1500473400586782

    Chicago Manual of Style (17th edition)