Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Case Influence and Model Complexity in Regression and Classification

Abstract Details

2019, Doctor of Philosophy, Ohio State University, Statistics.
Case influence and model complexity play very important roles in model diagnostics and model comparison. They have been extensively studied in linear regression and generalized linear model (GLM). In this dissertation, we focus on how to assess case influence and estimate model complexity for penalized M-estimators with non-smooth loss functions in regression and classification. Cook's distance is commonly used for case influence assessment in least squares regression. It measures the overall change in the fitted model when one case is deleted from the data. Unlike least squares regression, however, the relation between the full-data solution and leave-one-out (LOO) solution is not explicit for general penalized M-estimators, which makes the computation challenging. We propose a new algorithm to relate the full-data solution with the LOO solution through a case-weight adjusted solution path. We take penalized quantile regression and support vector machine (SVM) as an example in regression and classification, respectively. Resorting to the homotopy technique in optimization, we introduce a case weight for each individual data point as a continuous embedding parameter and decrease the weight gradually from one to zero to link the estimators based on the full data and those with a case deleted. We show that the case-weight adjusted solution path is piecewise linear in the weight parameter. This allows us to compute all LOO estimators efficiently. Moreover, we can use the solution path to generate case influence graphs and perform LOO cross validation for model selection. Case influence measures for classification methods are understudied in the literature. We propose a variety of overall case influence measures for large margin classifiers and empirically find that using some loss functions are quite effective in assessing case influence. Moreover, we demonstrate using real-world datasets that the proposed method is able to detect outliers in the feature space and outliers with large negative functional margin. Related to case influence and sensitivity of a model to data perturbation, the notion of degrees of freedom has been developed and used for measuring model complexity. We generalize the leave-one-out lemma used in degrees of freedom estimation by considering a data perturbation scheme based on case weight adjustment. Using the generalized lemma, we propose a refined approach to approximating model degrees of freedom based on the case-weight adjusted solutions. For classification, we link model complexity with the notion of expected optimism, which is defined as the expected difference between the prediction error and training error. We extend Efron's work by providing a general formula for the expected optimism for large margin classifiers. Based on this generalization, we define and evaluate degrees of freedom. The degrees of freedom requires solving n separate flip-one-label (FOL) problems, where one label is flipped and the rest are unchanged. To lessen the computational cost, taking SVM as an example, we propose a similar solution path algorithm to efficiently solve each FOL problem. We show that our proposed degrees of freedom outperforms an analogue of GDF estimate for estimating the expected optimism in classification.
Yoonkyung Lee (Advisor)
Yunzhang Zhu (Advisor)
Steve MacEachern (Committee Member)
Mario Peruggia (Committee Member)
132 p.

Recommended Citations

Citations

  • TU, S. (2019). Case Influence and Model Complexity in Regression and Classification [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563324139376977

    APA Style (7th edition)

  • TU, SHANSHAN. Case Influence and Model Complexity in Regression and Classification. 2019. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1563324139376977.

    MLA Style (8th edition)

  • TU, SHANSHAN. "Case Influence and Model Complexity in Regression and Classification." Doctoral dissertation, Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1563324139376977

    Chicago Manual of Style (17th edition)