Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data

Abstract Details

2018, PhD, University of Cincinnati, Engineering and Applied Science: Computer Science and Engineering.
Clustering data has gained popularity in recent years due to an expanding opportunity to discover knowledge and collect insights from multiple widely available and diverse data sources. Data clustering offers an intuitive solution to a wide variety of unsupervised classification problems. Clustering solutions to problems arise often in areas in which no ground truth is known, or when the arrival frequency of a data source exceeds the labeling capabilities of a human expert. Due to the continuous, fast moving nature of many common data streams, such as those from IoT (Internet of Things) devices, social network interactions, e-commerce click-streams, scientific monitoring devices, and network traffic, noise robust and distributed clustering algorithms are necessary. Often, current clustering methods suffer from one or more drawbacks when applied to these demanding problems. For this reason, we propose a new method for data clustering that is noise resilient, and distributed with a predictable overall complexity. The principal claim of this research is that while many clustering algorithms rigorously optimize a loss function, their convergence often results in finding a local minima that is indistinguishable from a less computationally rigorous optimization on an approximation of the data. We propose that by removing the rigorous optimization requirement, we can achieve better scalability, and parallelism with comparable performance. In this work we design a clustering algorithm along these lines that uses dimensional reduction and hashing to reduce the problem size while still attaining comparable clustering performance to other clustering algorithms. Our proposed method is more robust to noise with a lower runtime requirement, and greater opportunity for shared and distributed memory parallelism. This work presents a set of methods for clustering high dimensional data for a variety of data source and processing environments. The proposed RPHash algorithms share a commonality in that they all utilize locality sensitive hash (LSH) functions and random projection (RP) to identify density modes in data sources. They differ however in the operation space and cluster region identification method. The algorithms presented are the RPHash algorithm, the streamingRPHash algorithm, and the Tree Walk RPHash (TWRP) algorithm. We analyze the results of our developed algorithms on both streaming and at-rest data input environments. Experiments on real and synthetic data demonstrate the advantages of our proposed clustering methods for streaming and at-rest data against common clustering algorithms. Furthermore our theoretical analysis shows that our runtime and memory complexities are effectively linear and sub-linear respectively, in terms of input. Our principal claim that approximate clustering results are not substantially different than exact clustering methods with guarantee convergence to a local minima, is confirmed by these results. In addition we demonstrate the potential gains in processing speed and parallelism.
Philip Wilsey, Ph.D. (Committee Chair)
Fred Beyette, Ph.D. (Committee Member)
Raj Bhatnagar, Ph.D. (Committee Member)
Anca Ralescu, Ph.D. (Committee Member)
Dan Ralescu, Ph.D. (Committee Member)
144 p.

Recommended Citations

Citations

  • Carraher, L. A. (2018). Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data [Doctoral dissertation, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511860805777818

    APA Style (7th edition)

  • Carraher, Lee. Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data. 2018. University of Cincinnati, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511860805777818.

    MLA Style (8th edition)

  • Carraher, Lee. "Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data." Doctoral dissertation, University of Cincinnati, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511860805777818

    Chicago Manual of Style (17th edition)