Skip to Main Content
 

Global Search Box

 
 
 
 

Files

File List

ETD Abstract Container

Abstract Header

Feature Selection with Missing Data

Sarkar, Saurabh

Abstract Details

2013, PhD, University of Cincinnati, Engineering and Applied Science: Industrial Engineering.
In the modern world information has become the new power. An increasing amount of efforts are being made to gather data, resources being allocated, time being invested and tools being developed. Data collection is no longer a myth; however, it remains a great challenge to create value out of the enormous data that is being collected. Data modeling is one of the ways in which data is being utilized. When we try to model a process or a system, it is crucial to have the right features, and thus, feature selection has become an essential part of data modeling. Yet often we have missing data, and in a worse scenario, the important features themselves may have considerable data missing. The challenge is to pick out the best features and yet accommodate the missing data. To address this problem, this dissertation introduces a cluster based feature selection process which is quite robust in handling missing data. The research extends the Minimum Expected Cost of Misclassification (MECM) based feature selection method to a very high dimensional dataset by using cluster based sampling methods. However, even though the cluster based sampling methods allow the MECM to scale to larger datasets, determining the optimal cluster size is still a challenge. This is the first issue that the dissertation aims to solve. The second area that the dissertation tries to address is the issue of handling missing data while doing feature selection by MECM based method. This area has not been studied extensively as feature selection itself, though missing data is witnessed quite often. The dissertation discusses an algorithm which enables the MECM to handle missing data. This approach is a probabilistic approach based on the distribution of most similar instances. The algorithm determines the probability of having the instance in the sampling cluster and then does a fractional count while evaluating the MECM. One of the challenges of this approach is to correctly estimate the probability of a missing point lying within the sampling cluster. The key lies in picking up the correct number of similar instances to calculate the probability. The dissertation also seeks to address this problem. The last part of the research contains a benchmark study to determine the effectiveness of the algorithm. A wrapper based feature selection method using Naive Bayesian and another method using the MECM without missing data algorithm are used simultaneously as benchmarks. The MECM missing data algorithm showed a significant improvement over the other two. Solving these problems is of great practical significance to data modeling. Instances with missing data might carry critical information; ignoring missing data during feature selection can have a cascading effect downstream when the final model is built. This research will enable us to choose better features which would in return improve the accuracy of the existing models. It will impact a broad range of applications from gene based medicine, fraud detection models, engineering, business and any field which uses feature selection as one of the components in model building process.
Hongdao Huang, Ph.D. (Committee Chair)
Manish Kumar, Ph.D. (Committee Member)
Sundararaman Anand, Ph.D. (Committee Member)
David Thompson, Ph.D. (Committee Member)
106 p.

Recommended Citations

Citations

  • Sarkar, S. (2013). Feature Selection with Missing Data [Doctoral dissertation, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1378194989

    APA Style (7th edition)

  • Sarkar, Saurabh. Feature Selection with Missing Data. 2013. University of Cincinnati, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1378194989.

    MLA Style (8th edition)

  • Sarkar, Saurabh. "Feature Selection with Missing Data." Doctoral dissertation, University of Cincinnati, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1378194989

    Chicago Manual of Style (17th edition)