Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
dissertation_v12.pdf (2.55 MB)
ETD Abstract Container
Abstract Header
Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
Author Info
Chen, Jonathan Jun Feng
ORCID® Identifier
http://orcid.org/0000-0003-1148-9764
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591
Abstract Details
Year and Degree
2018, Doctor of Philosophy, University of Akron, Biology.
Abstract
Medicine is a precious commodity that saves, prolongs, or increases the quality of life. However, medicinal active ingredient discovery is challenging and is one of the major bottlenecks to developing new pharmaceuticals. Progressive development of new therapeutic targets and compounds exacerbates the problem as the scale of the drug discovery endeavor increases to an unmanageable size. For example, the National Institute of Health houses the National Library of Medicine, which contains an ever-growing archive of genes, proteins, and therapeutic targets as well as candidate compounds. Manual inspection of all compounds and biological targets cannot match the rate in which new information is created and deposited. New methods of data processing and drug candidate consideration are needed. The work presented used and processed data from the NLM to identify new candidates for consideration. The drug discovery pipeline central to this work created models from existing compound-target interaction data that correlated structure to activity. The models were used to identify next candidates to test. Compound structural information was captured using the Signature molecular descriptor while models were created using principal component analysis, genetic algorithm, and support vector machines. The models identify new candidates for activity validation experiments in a virtual high-throughput screen of the 72 million compounds in PubChem Compound database of the National Library of Medicine. The models were retrained to determine if improvement was possible and what might affect improvement resulting from retraining. After activity validation experiments, the activity and structure of candidates and compounds from the training set were compared to identify structure-activity relationships for additional avenues of inquiry. Seven different case studies were conducted to test the robustness of the pipeline in response to changing dataset size and active fraction: Cathepsin L, Factor XIIa, Factor XIa, C1s, SENP8, and PK-M2 with two different datasets. The information from all seven case studies found model retraining was beneficial and the pipeline was more effective at low active fractions. Recommendations for future use include retraining models when possible, to extrapolate incrementally, and to apply to small active fractions datasets but avoid large high active fractions datasets to maximize pipeline effectiveness and utility.
Committee
Donald Visco, Jr. (Advisor)
Zhong-Hui Duan (Committee Member)
Nic Leipzig (Committee Member)
Jie Zheng (Committee Member)
Richard Londraville (Committee Member)
Pages
263 p.
Subject Headings
Biochemistry
;
Bioinformatics
;
Biology
;
Chemical Engineering
;
Computer Science
Keywords
vHTS
;
virtual
;
high-throughput
;
screening
;
bioinformatics
;
biological
;
informatics
;
machine-learning
;
QSAR
;
quantitative structure-activity relationship
;
cheminformatics
;
chemical
;
chemoinformatics
;
data-mining
;
pipeline
;
PubChem
;
drug
;
discovery
;
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Chen, J. J. F. (2018).
Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
[Doctoral dissertation, University of Akron]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591
APA Style (7th edition)
Chen, Jonathan.
Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development.
2018. University of Akron, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591.
MLA Style (8th edition)
Chen, Jonathan. "Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development." Doctoral dissertation, University of Akron, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
akron1524661027035591
Download Count:
492
Copyright Info
© 2018, all rights reserved.
This open access ETD is published by University of Akron and OhioLINK.