Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development

Chen, Jonathan Jun Feng

Abstract Details

2018, Doctor of Philosophy, University of Akron, Biology.
Medicine is a precious commodity that saves, prolongs, or increases the quality of life. However, medicinal active ingredient discovery is challenging and is one of the major bottlenecks to developing new pharmaceuticals. Progressive development of new therapeutic targets and compounds exacerbates the problem as the scale of the drug discovery endeavor increases to an unmanageable size. For example, the National Institute of Health houses the National Library of Medicine, which contains an ever-growing archive of genes, proteins, and therapeutic targets as well as candidate compounds. Manual inspection of all compounds and biological targets cannot match the rate in which new information is created and deposited. New methods of data processing and drug candidate consideration are needed. The work presented used and processed data from the NLM to identify new candidates for consideration. The drug discovery pipeline central to this work created models from existing compound-target interaction data that correlated structure to activity. The models were used to identify next candidates to test. Compound structural information was captured using the Signature molecular descriptor while models were created using principal component analysis, genetic algorithm, and support vector machines. The models identify new candidates for activity validation experiments in a virtual high-throughput screen of the 72 million compounds in PubChem Compound database of the National Library of Medicine. The models were retrained to determine if improvement was possible and what might affect improvement resulting from retraining. After activity validation experiments, the activity and structure of candidates and compounds from the training set were compared to identify structure-activity relationships for additional avenues of inquiry. Seven different case studies were conducted to test the robustness of the pipeline in response to changing dataset size and active fraction: Cathepsin L, Factor XIIa, Factor XIa, C1s, SENP8, and PK-M2 with two different datasets. The information from all seven case studies found model retraining was beneficial and the pipeline was more effective at low active fractions. Recommendations for future use include retraining models when possible, to extrapolate incrementally, and to apply to small active fractions datasets but avoid large high active fractions datasets to maximize pipeline effectiveness and utility.
Donald Visco, Jr. (Advisor)
Zhong-Hui Duan (Committee Member)
Nic Leipzig (Committee Member)
Jie Zheng (Committee Member)
Richard Londraville (Committee Member)
263 p.

Recommended Citations

Citations

  • Chen, J. J. F. (2018). Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development [Doctoral dissertation, University of Akron]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591

    APA Style (7th edition)

  • Chen, Jonathan. Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development. 2018. University of Akron, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591.

    MLA Style (8th edition)

  • Chen, Jonathan. "Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development." Doctoral dissertation, University of Akron, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591

    Chicago Manual of Style (17th edition)