Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Multi-Domain Clustering on Real-Valued Datasets

Abstract Details

2011, PhD, University of Cincinnati, Engineering and Applied Science: Computer Science and Engineering.
Clustering is an important research problem for knowledge discovery from databases. It focuses on finding hidden structures embedded in datasets. It is non-trivial to arrive at a clustering in a dataset such that each pair of data points within the same cluster is similar to each other, and each pair in different clusters is distinct from each other. This is due to the multiplicity of meanings of similarity between data points and also from criteria determining the number, shape, and boundaries of clusters. Despite a large body of published research, new clustering problems keep arising requiring novel solutions. Such a situation is evolving in the field of biomedical research which is generating a large number of interrelated and interdependent datasets, and also in many other domains of science and business. We have developed three novel methodologies for clustering to meet these newly emerging needs. The first problem we have solved relates to the grouping of data points with “similar” density in the data space into distinct clusters, using full dimensional clustering. Based on the pair-wise similarity matrix among data points, we define a new type of relationship among them - that of the point pairs being Mutual K-Nearest Neighbors (MKNN) of each other, and design clustering algorithms based on this new notion to capture the data density. Compared with traditional Euclidean distance based clustering algorithms for datasets having different densities, our MKNN-based clustering algorithm allows users to form density-based clusters with significantly lower sensitivity to parameters . We have analytically and empirically demonstrated, using both synthetic and real-world datasets, the increased capability, precision, efficiency, and robustness of our algorithm. The second clustering algorithm which we have developed incorporates prior domain knowledge, provided as pair-wise similarity matrix in one dataset, into the clustering performed for data in another dataset. The data objects in “prior knowledge” data source and the second data source are the same. By adopting a semi-supervised clustering procedure, our algorithm, called Semi-supervised Gaussian Infinite Mixture Model (SGIMM), balances information from two data sources and generates clusters enforcing precise pair-wise relationships. SGIMM accommodates many types of prior knowledge and from the empirical studies done with both the synthetic data and the real-world data; SGIMM generates high quality clusters regardless of the quality of prior knowledge. The third type of problem we have solved relates to the discovery of subspace clusters. Numerous real world applications focus on selecting subsets of data points and feature subspaces having desirable characteristics specified in terms of properties such as low variance, high distinction, low residue value, etc. We use lattice structured search spaces to identify low variance subspace clusters from one dataset (bicluster), two datasets (3-Cluster), and high discrepancy subspace clusters from a single dataset (polarized bicluster). The results on both synthetic datasets and genomic datasets have been shown for all these types of clustering tasks and they show performance better than what is shown by most of the existing algorithms.
Raj Bhatnagar, PhD (Committee Chair)
Yizong Cheng, PhD (Committee Member)
Karen Davis, PhD (Committee Member)
Mario Medvedovic, PhD (Committee Member)
John Schlipf, PhD (Committee Member)
158 p.

Recommended Citations

Citations

  • Hu, Z. (2011). Multi-Domain Clustering on Real-Valued Datasets [Doctoral dissertation, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725

    APA Style (7th edition)

  • Hu, Zhen. Multi-Domain Clustering on Real-Valued Datasets. 2011. University of Cincinnati, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725.

    MLA Style (8th edition)

  • Hu, Zhen. "Multi-Domain Clustering on Real-Valued Datasets." Doctoral dissertation, University of Cincinnati, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1311692725

    Chicago Manual of Style (17th edition)