Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data

Abstract Details

2023, Doctor of Philosophy (Ph.D.), Bowling Green State University, Data Science.
Many real-world datasets, such as those used for failure and anomaly detection, are severely imbalanced, with a relatively small number of failed instances compared to the number of normal instances. This imbalance often results in bias towards the majority class during learning, making mitigation a serious challenge. To address these issues, this dissertation leverages the Backblaze HDD data and makes several contributions to hard drive failure prediction. It begins with an evaluation of the current state of the art techniques, and the identification of any existing shortcomings. Multiple facets of machine learning (ML) and deep learning (DL) approaches to address these challenges are explored. The synthetic minority over-sampling technique (SMOTE) is investigated by evaluating its performance with different distance metrics and nearest neighbor search algorithms, and a novel approach that integrates SMOTE with Gaussian mixture models (GMM), called GMM SMOTE, is proposed to address various issues. Subsequently, a comprehensive analysis of different cost-aware ML techniques applied to disk failure prediction is provided, emphasizing the challenges in current implementations. The research also expands to create explore a variety of cost-aware DL models, from 1D convolutional neural networks (CNN) and long short-term memory (LSTM) models to a hybrid model combining 1D CNN and bidirectional LSTM (BLSTM) approaches to utilize the sequential nature of hard drive sensor data. A modified focal loss function is introduced to address the class imbalance issue prevalent in the hard drive dataset. The performance of DL models is compared to traditional ML algorithms, such as random forest (RF) and logistic regression (LR), demonstrating superior results, suggesting the potential effectiveness of the proposed focal loss function. In addition to these efforts, this dissertation aims to provide a comprehensive understanding of hard drive longevity and the critical factors contributing to their eventual failure through survival analysis. It employs survival analysis to enhance sampling effectiveness, preferentially including observations associated with higher hazards. Techniques like permutation feature importance, Shapley values, and Cox regression are used to identify the key factors influencing drive failure. This work also lays the groundwork for future research on efficient strategies for handling imbalanced data and predictive maintenance in big data framework.
Robert C. Green II, Ph.D. (Committee Chair)
Liuling Liu, Ph.D. (Other)
Umar D Islambekov, Ph.D. (Committee Member)
Junfeng Shang, Ph.D. (Committee Member)
160 p.

Recommended Citations

Citations

  • Ahmed, J. (2023). Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data [Doctoral dissertation, Bowling Green State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1688685109278097

    APA Style (7th edition)

  • Ahmed, Jishan. Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data. 2023. Bowling Green State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1688685109278097.

    MLA Style (8th edition)

  • Ahmed, Jishan. "Cost-Aware Machine Learning and Deep Learning for Extremely Imbalanced Data." Doctoral dissertation, Bowling Green State University, 2023. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1688685109278097

    Chicago Manual of Style (17th edition)