Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

MINING STRUCTURED SETS OF SUBSPACES FROM HIGH DIMENSIONAL DATA

RAJSHIVA, ANSHUMAAN

Abstract Details

2004, MS, University of Cincinnati, Engineering : Computer Science.
Data mining is the process of extracting possibly unknown and potentially useful information from databases. Data mining algorithms are used in many applications in the domains of business, engineering, sciences, and social databases. Among many methodologies existing for data mining, clustering techniques are one of the most frequently used ones. Clustering refers to the process of formation of a number of groups of data points based on their similarity. Finding clusters in a high dimensional dataspace is challenging because a high dimensional dataspace has hundreds of attributes and hundreds of data tuples and the average density of data points is very low. The distance functions used by many conventional algorithms fail in this scenario. Agrawal et al. [2] proved that if clusters do not exist in the original high dimensional dataspace, it may be possible that clusters exist in some subspaces of the original dataspace. A subspace is formed by a subset of all attributes and a subset of data tuples taken together. The choices for the subsets are made in such a way that clusters of the data points exist in the subspace. Subspace clustering identifies such subspace clusters. In this thesis, we discuss a novel approach to identify subspace clusters by first identifying complete subspaces. A Complete Subspace is defined as a subspace which contains exactly one cluster formed by all the data tuples included in that subspace. We develop an algorithm to discover and identify such complete subspaces. In our algorithm, complete subspaces are identified based on a similarity function. A similarity function is a symmetric mathematical function that measures the similarity between two data values of an attribute. We discuss different similarity functions and apply them to the datasets belonging to each of the identified application domains of bioinformatics, graphs and citation datasets. Through experiments, we analyze and interpret the nature of the subspace clusters in correlation with the applied similarity function and the application domain. Our algorithm is exhaustive in nature and discovers all the complete subspaces in a high dimensional dataspace.
Dr. Raj Bhatnagar (Advisor)
122 p.

Recommended Citations

Citations

  • RAJSHIVA, A. (2004). MINING STRUCTURED SETS OF SUBSPACES FROM HIGH DIMENSIONAL DATA [Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1085667702

    APA Style (7th edition)

  • RAJSHIVA, ANSHUMAAN. MINING STRUCTURED SETS OF SUBSPACES FROM HIGH DIMENSIONAL DATA. 2004. University of Cincinnati, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1085667702.

    MLA Style (8th edition)

  • RAJSHIVA, ANSHUMAAN. "MINING STRUCTURED SETS OF SUBSPACES FROM HIGH DIMENSIONAL DATA." Master's thesis, University of Cincinnati, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1085667702

    Chicago Manual of Style (17th edition)