Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

FINDING INTERESTING SUBSPACE CLUSTERS FROM HIGH DIMENSIONAL DATASETS

BIAN, HAIYUN

Abstract Details

2006, PhD, University of Cincinnati, Engineering : Computer Science.
Data mining focuses on finding previously unknown yet potentially useful, hidden patterns from large amounts of data. Clustering is one of the most commonly used unsupervised data mining techniques, and it has been successfully applied to find groups of similar data points in many applications. However, conventional clustering algorithms sometimes fail to find meaningful clusters when the dataset has dozens of attributes because the high dimensionality makes the data space very noisy. Subspace clustering is a solution to this problem that can find clusters in subsets of all the dimensions. Different subspace clusters may be formed in different subsets of dimensions, and a single data point may belong to multiple subspace clusters. A subspace clustering algorithm not only searches for the clusters, but also finds the subspaces where each individual cluster exists. Allowing overlapping of clusters in the object space and in the attribute space increases the complexity of the search algorithms exponentially and also makes the interpretation of relationships among clusters very difficult. In this thesis, we propose new subspace clustering algorithms that can find overlapping subspace clusters satisfying certain quantitative and qualitative properties. These properties an be defined by the domain users so that the search focuses only on those clusters that have some significance for the users. Molding of the search to find only clusters with specific properties has the advantage that the property itself, or its derivatives, can be used to prune away the uninteresting hypotheses at an early stage of the search. Various pruning strategies are presented in the thesis for different clusters properties to make the search more efficient. In many situations, the total number of subspace clusters having the desired properties is very large, which not only adds burden to the search, but also makes the analysis on the results very difficult. In this thesis, we present ways to impose a lattice structure on all the found clusters, and we show that the lattice facilitates the discovery of other knowledge embedded in the data. We also propose another solution to this problem by creating a condensed representation of all the clusters, that is, we find only a subset of all the clusters from which all other clusters having the desired properties can be inferred. For validation of our algorithms, we tested our algorithms on both the synthetic and the real pplication data. The results suggest that the algorithms are very useful in many application domains, such as with gene expression data and some standard datasets from the machine learning repository. The emerging infrastructure of distributed databases requires algorithms to be designed for mining meaningful patterns in data located at different sites. Due to security and privacy concerns, it is not always feasible to send all datasets to a centralized site to accomplish the mining task. An alternate solution is to have each site perform some computation locally, and exchange minimum amount of information with the other sites. We focus on finding subspace clusters in horizontally partitioned databases. The global computation is decomposed into localized computations on each participating site. We present the detailed decomposition algorithm as well as the format of message exchanges between the sites. Both theoretical and empirical validation of our proposed scheme is provided, showing that our algorithm can find all target patterns from the distributed datasets. Overall, the research presented in this thesis provides many insights into theoretically and empirically characterizing the problem of subspace clustering. The subspace clustering algorithms proposed in this thesis are expected to be useful in solving data mining problems in many applications.
Dr. Raj Bhatnagar (Advisor)
152 p.

Recommended Citations

Citations

  • BIAN, H. (2006). FINDING INTERESTING SUBSPACE CLUSTERS FROM HIGH DIMENSIONAL DATASETS [Doctoral dissertation, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1161732284

    APA Style (7th edition)

  • BIAN, HAIYUN. FINDING INTERESTING SUBSPACE CLUSTERS FROM HIGH DIMENSIONAL DATASETS. 2006. University of Cincinnati, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1161732284.

    MLA Style (8th edition)

  • BIAN, HAIYUN. "FINDING INTERESTING SUBSPACE CLUSTERS FROM HIGH DIMENSIONAL DATASETS." Doctoral dissertation, University of Cincinnati, 2006. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1161732284

    Chicago Manual of Style (17th edition)