Skip to Main Content
 

Global Search Box

 
 
 
 

Files

File List

ETD Abstract Container

Abstract Header

A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure

Rawashdeh, Mohammad Y.

Abstract Details

2014, PhD, University of Cincinnati, Engineering and Applied Science: Computer Science and Engineering.
By clustering one seeks to partition a given set of points into a number of clusters such that points in the same cluster are similar and are dissimilar to points in other clusters. In the virtue of this goal, data of relational nature become typical for clustering. The similarity and dissimilarity relations between the data points are supposed to be the nuts and bolts for cluster formation. Thus, the task is driven by the notion of similarity between the data points. In practice, the similarity is usually measured by the pairwise distances between the data points. Indeed, the objective function of the two widely used clustering algorithms, namely, k-means and fuzzy c-means, appears in terms of the pairwise distances between the data points. The clustering task is complicated by the choice of the distance measure and estimating the number of clusters. Fuzzy c-means is convenient when there are uncertainties in allocating points, in overlapping areas, to clusters. The k-means algorithm allocates the points unequivocally to clusters; overlooking the similarities between those points in overlapping areas. The fuzzy approach allows a point to be a member in as many clusters as necessary; thus it provides better insight into the relations between the points in overlapping areas. In this thesis we develop a relational framework that is inspired by the silhouette measure of clustering quality. The framework asserts the relations between the data points by means of logical reasoning with the cluster membership values. The original description of computing the silhouettes is limited to crisp partitions. A natural generalization of silhouettes, to fuzzy partitions is given within our framework. Moreover, two notions of silhouettes emerge within the framework at different levels of granularity, namely, point-wise silhouette and center-wise silhouette. Now by the generalization, each silhouette is capable of measuring the extent to which a crisp, or fuzzy, partition has fulfilled the clustering goal at the level of the individual points, or cluster centers. The partitions are evaluated by the silhouette measure in conjunction with point-to-point or center-to-point distances. By the generalization, the average silhouette value becomes a reasonable device for selecting between crisp and fuzzy partitions of the same data set. Accordingly, one can find about which partition is better in representing the relations between the data points, in accordance with their pairwise distances. Such powerful feature of the generalized silhouettes has exposed a problem with the partitions generated by fuzzy c-means. We have observed that defuzzifying the fuzzy c-means partitions always improves the overall representation of the relations between the data points. This is due to the inconsistency between some of the membership values and the distances between the data points. This inconsistency was reported, by others, in a couple of occasions in real life applications. Finally, we present an experiment that demonstrates a successful application of the generalized silhouette measure in feature selection for highly imbalanced classification. A significant improvement in the classification for a real data set has resulted from a significant reduction in the number of features.
Anca Ralescu, Ph.D. (Committee Chair)
Anil Jegga, D.V.M. M.Res. (Committee Member)
Traian Marius Truta, Ph.D. (Committee Member)
Fred Annexstein, Ph.D. (Committee Member)
Kenneth Berman, Ph.D. (Committee Member)
Dan Ralescu, Ph.D. (Committee Member)
121 p.

Recommended Citations

Citations

  • Rawashdeh, M. Y. (2014). A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure [Doctoral dissertation, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1394725536

    APA Style (7th edition)

  • Rawashdeh, Mohammad. A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure. 2014. University of Cincinnati, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1394725536.

    MLA Style (8th edition)

  • Rawashdeh, Mohammad. "A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure." Doctoral dissertation, University of Cincinnati, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1394725536

    Chicago Manual of Style (17th edition)