Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

K-groups: A Generalization of K-means by Energy Distance

Abstract Details

2015, Doctor of Philosophy (Ph.D.), Bowling Green State University, Statistics.
We propose two distribution-based clustering algorithms called K-groups. Our algorithms group the observations in one cluster if they are from a common distribution. Energy distance is a non-negative measure of the distance between distributions that is based on Euclidean distances between random observations, which is zero if and only if the distributions are identical. We use energy distance to measure the statistical distance between two clusters, and search for the best partition which maximizes the total between clusters energy distance. To implement our algorithms, we apply a version of Hartigan and Wong's moving one point idea, and generalize this idea to moving any m points. We also prove that K-groups is a generalization of the K-means algorithm. K-means is a limiting case of the K-groups generalization, with common objective function and updating formula in that case. K-means is one of the well-known clustering algorithms. From previous research, it is known that K-means has several disadvantages. K-means performs poorly when clusters are skewed or overlapping. K-means can not handle categorical data, because the mean is not a good estimate of center. K-means can not be applied when dimension exceeds sample size. Our K-groups methods provide a practical and effective solution to these problems. Simulation studies on the performance of clustering algorithms for univariate and multivariate mixture distributions are presented. Four validation indices (diagonal, Kappa, Rand and corrected Rand) are reported for each example in the simulation study. Results of the empirical studies show that both K-groups algorithms perform as well as K-means when clusters are well-separated and spherically shaped, but K-groups algorithms perform better than K-means when clusters are skewed or overlapping. K-groups algorithms are more robust than K-means with respect to outliers. Results are presented for three multivariate data sets, wine cultivars, dermatology diseases and oncology cases. In our real data examples, the performance of both K-groups algorithms are better than the performance of K-means in each case.
Rizzo Maria (Advisor)
Rump Christopher (Other)
Chen Hanfeng (Committee Member)
Wei Ning (Committee Member)
120 p.

Recommended Citations

Citations

  • Li, S. (2015). K-groups: A Generalization of K-means by Energy Distance [Doctoral dissertation, Bowling Green State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1428583805

    APA Style (7th edition)

  • Li, Songzi. K-groups: A Generalization of K-means by Energy Distance. 2015. Bowling Green State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1428583805.

    MLA Style (8th edition)

  • Li, Songzi. "K-groups: A Generalization of K-means by Energy Distance." Doctoral dissertation, Bowling Green State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1428583805

    Chicago Manual of Style (17th edition)