Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest

Wathen, Michael J

Abstract Details

2016, MS, University of Cincinnati, Medicine: Biostatistics (Environmental Health).
In this thesis we introduce, to population genetics, a method of variable selection based on an estimator for the measure of independence using the data (contingency table) collected on the joint distribution. We call our maximum likelihood estimator the Lancaster Independence Estimate (LIE). We compare, this newly proposed method, with two other methods of variable selection: Principal Component Analysis (PCA) and Random Forest (RF). We employed data from the 1000 Genomes Project as provided by GAWA17 mini-exome data that is comprised of seven populations: Caucasians from the United States (CEPH), Chinese from Denver (Denver), Chinese from Beijing (Han), Japanese from Tokyo (Japanese), Luhya from Kenya (Luhya), Tuscans from Italy (Tuscan), and Yoruba from Nigeria (Yoruba). The data was parsed to explore the 10,455 rare variants with minor allele frequencies less than 5%. These (SNPs) values were recorded as categorical 0, 1. The LIE was used to assemble an - collection of SNPs associated with the seven populations. We also assembled same size collections of SNPs using variable importance measures of PCA and RF. We found that the LIE method preformed better than expected in the predictive models when compared to the predictive models coming from PCA but not as well as the those from RF. We also developed a hybrid method (Piggyback) that improved the predictive accuracy of RF conditional on a substantially smaller set of SNPs coming from the LIE method. Additionally, we found this hybrid method of RF built on the LIE dramatically reduced the computational time normally required for non-hybrid RF.
Marepalli Rao, Ph.D. (Committee Chair)
Tesfaye Baye Mersha, Ph.D. (Committee Member)
38 p.

Recommended Citations

Citations

  • Wathen, M. J. (2016). Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest [Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1460730716

    APA Style (7th edition)

  • Wathen, Michael. Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest. 2016. University of Cincinnati, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1460730716.

    MLA Style (8th edition)

  • Wathen, Michael. "Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest." Master's thesis, University of Cincinnati, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1460730716

    Chicago Manual of Style (17th edition)