Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

New Clustering and Feature Selection Procedures with Applications to Gene Microarray Data

Abstract Details

2008, Doctor of Philosophy, Case Western Reserve University, Statistics.

Statistical data mining is one of the most active research areas. In this thesis we develop two new data mining procedures and explore their applications to genetic data.

The first procedure is called PfCluster - Profile Cluster Analysis. It is a clustering method designed for profiled genetic data. The PfCluster is efficient and flexible in uncovering clusters determined by a new class of biologically meaningful distance metrics. A new internal quality measure of clusters, coherence index, is developed to find coherent clusters. An efficient mechanism for choosing the threshold of coherent clusters is also derived and implemented. The threshold is based on the first and second order approximations to the true threshold under a null distribution for parallel clusters. The PfCluster has been applied to simulated data and two real data examples: a biomarker LOH dataset and a microarray gene expression dataset. PfCluster is competitive to the correlation-based clustering procedures.

The second procedure is called RPselection - Resampling based partitioning selection. It is a feature selection algorithm designed for microarray studies. It selects a subset of genes that maximizes a fitness score. The fitness score measures the relevance between the partition labels from a clustering result and an external class label derived from the clinical outcomes. The score is computed using a resampling procedure. The RPselection algorithm has been applied to simulated data and a real uveal melanoma gene expression data. RPselection outperforms gene-by-gene test-based feature selection procedures.

Software development is an integral part of modern statistical research. Two software packages, pfclust and rpselect, are developed in this thesis based on our PfCluster method and RPselection algorithm. Packages pfclust and rpselect are implemented based on R object-oriented programming framework, and they can be easily customized and extended by users.

The ideas in our two procedures can be generalized and applied to other data mining tasks. This thesis concludes with discussion on connections between two methods and the related future research.

Jiayang Sun (Advisor)

Recommended Citations

Citations

  • Xu, Y. (2008). New Clustering and Feature Selection Procedures with Applications to Gene Microarray Data [Doctoral dissertation, Case Western Reserve University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=case1196144281

    APA Style (7th edition)

  • Xu, Yaomin. New Clustering and Feature Selection Procedures with Applications to Gene Microarray Data. 2008. Case Western Reserve University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=case1196144281.

    MLA Style (8th edition)

  • Xu, Yaomin. "New Clustering and Feature Selection Procedures with Applications to Gene Microarray Data." Doctoral dissertation, Case Western Reserve University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=case1196144281

    Chicago Manual of Style (17th edition)