Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Locality-Dependent Training and Descriptor Sets for QSAR Modeling

Hobocienski, Bryan Christopher

Abstract Details

2020, Doctor of Philosophy, Ohio State University, Chemical Engineering.
Quantitative Structure-Activity Relationships (QSARs) are empirical or semi-empirical models which correlate the structure of chemical compounds with their biological activities. QSAR analysis frequently finds application in drug development and environmental and human health protection. It is here that these models are employed to predict pharmacological endpoints for candidate drug molecules or to assess the toxicological potential of chemical ingredients found in commercial products, respectively. Fields such as drug design and health regulation share the necessity of managing a plethora of chemicals in which sufficient experimental data as to their application-relevant profiles is often lacking; the time and resources required to conduct the necessary in vitro and in vivo tests to properly characterize these compounds make a pure experimental approach impossible. QSAR analysis successfully alleviates the problems posed by these data gaps through interpretation of the wealth of information already contained in existing databases. This research involves the development of a novel QSAR workflow utilizing a local modeling strategy. By far the most common QSAR models reported in the literature are “global” models; they use all available training molecules and a single set of chemical descriptors to learn the relationship between structure and the endpoint of interest. Additionally, accepted QSAR models frequently use linear transformations such as principal component analysis or partial least squares regression to reduce the dimensionality of complex chemical data sets. To contrast these conventional approaches, the proposed methodology uses a locality-defining radius to identify a subset of training compounds in proximity to a test query to learn an individual model for that query. Furthermore, descriptor selection is utilized to isolate the subset of available chemical descriptors tailored specifically to explain the activity of each test compound. Finally, this work adapts a non-linear dimensional reduction technique, t-Distributed Stochastic Neighbor Embedding (t-SNE), for the refinement of global descriptor spaces before local training sets are identified. The resulting ensemble of local models is used to generate predictions for the test set. The proposed local QSAR workflow is evaluated using two data sets from the literature, one concerning Ames mutagenicity and the other blood-brain barrier permeability. Performance statistics are determined by a 5-fold cross-validation strategy. Local model ensembles frequently outperform global models, especially for smaller to medium-sized local training sets. Illustrating this point, local model ensembles from the proposed methodology outperform global models by as much as 5% to 10% when compared by the dimension of the modeling space between the two approaches. A sizeable portion of this work concerns implementation of the t-SNE algorithm to resolve the problems associated with identifying training samples neighboring test compounds in high dimensional spaces. t-SNE-based local model ensembles afford competitive performance to PLS-based local model ensembles; for instance, when the test set coverage is approximately 25%, the accuracy of t-SNE-based local model ensembles is 86.1% whereas that of the PLS-based local model ensembles is 81.8%. When coverage increases to 93%, predicting most of the test molecules, the accuracy of t-SNE-based local model ensembles is 73.8% versus 71.2% for PLS-based local model ensembles. Furthermore, the novel QSAR workflow offers comparative performance to literature reported QSAR models. On the Ames mutagenicity data set, AUC values derived from the proposed methodology range from 0.79 to 0.81 whereas those from literature models range from 0.79 to 0.86. Likewise, when predicting blood-brain barrier permeability, Matthews correlation coefficients range between 0.321 and 0.645 using popular machine learning methods and between 0.478 to 0.565 from the proposed methodology. Finally, the proposed local QSAR workflow offers several interpretability-based features. An open criticism of local modeling strategies, due to their fragmented nature, involves the difficultly in recognizing relationships present throughout the entire training set. This problem is addressed by demonstrating how the frequencies and associations of significant descriptors occurring across local models can be extracted. As a concrete example, this analytic approach successfully identifies acryl halides as a structural alert for positive Ames mutagenicity. Additionally, valuable information is provided on the local level such as the descriptor spaces and decision boundaries used for predicting individual query compounds.
James Rathman (Advisor)
Bhavik Bakshi (Committee Member)
Jeffrey Chalmers (Committee Member)
285 p.

Recommended Citations

Citations

  • Hobocienski, B. C. (2020). Locality-Dependent Training and Descriptor Sets for QSAR Modeling [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1577716259011585

    APA Style (7th edition)

  • Hobocienski, Bryan. Locality-Dependent Training and Descriptor Sets for QSAR Modeling. 2020. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1577716259011585.

    MLA Style (8th edition)

  • Hobocienski, Bryan. "Locality-Dependent Training and Descriptor Sets for QSAR Modeling." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1577716259011585

    Chicago Manual of Style (17th edition)