Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Efficient, Parameter-Free Online Clustering

Cunningham, James

Abstract Details

2020, Master of Science, Ohio State University, Computer Science and Engineering.
As the number of data sources and dynamic datasets grows so does the need for efficient online algorithms to process them. Clustering is a very important data exploration tool used in various fields from data science to machine learning. Clustering finds structure unsupervised among indecipherable collections of data. Online clustering is the process of grouping samples in a data stream into meaningful collections as they appear over time. As more data is collected in these streams online clustering becomes more important than ever. Unfortunately, the number of available efficient online clustering algorithms is limited due to the difficulty of their design that often requires significant trade-offs in clustering performance for efficiency. Those that do exist require expert domain knowledge of the data space to set hyperparameters for the algorithm such as the desired number of clusters. This domain knowledge is often unavailable, so resources must be spent tuning hyperparameters to get acceptable performance. In this thesis we present an online modification to FINCH, the recent state-of-the-art parameter-free clustering algorithm by Sarfraz et al. called Stream FINCH (S-FINCH). We examine the stages of the FINCH algorithm and leverage key insights to produce an algorithm which reduces the online update complexity of FINCH. We then compare the performance of S-FINCH and FINCH over several toy and real-world datasets. We show theoretically and empirically that our S-FINCH algorithm is more efficient than the FINCH algorithm in the online domain and has reasonable real-time update performance. We also present several alternative cluster representatives which can be used to build different agglomerative cluster hierarchies using the S-FINCH algorithm. We compare the cluster quality and clustering time performance of these new representatives with the original FINCH algorithm. The S-FINCH algorithm presented in this thesis allows for fast and efficient online clustering of arbitrary datasets while requiring no user-defined hyperparameters.
James Davis, PhD. (Advisor)
Juan Vasquez, PhD. (Committee Member)
Kyle Tarplee, PhD. (Committee Member)
Wei-Lun Chao, PhD. (Committee Member)
103 p.

Recommended Citations

Citations

  • Cunningham, J. (2020). Efficient, Parameter-Free Online Clustering [Master's thesis, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1606762403895603

    APA Style (7th edition)

  • Cunningham, James. Efficient, Parameter-Free Online Clustering. 2020. Ohio State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1606762403895603.

    MLA Style (8th edition)

  • Cunningham, James. "Efficient, Parameter-Free Online Clustering." Master's thesis, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1606762403895603

    Chicago Manual of Style (17th edition)