Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Next Generation Outlier Detection

Abstract Details

2014, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Outlier detection is a fundamental task that is used in numerous data analytic applications. It tackles the problem of identifying rare or atypical points that widely diverge from the general behavior or model of the data. The process of detecting outliers and subsequently using them for data analysis relies on the underlying application. For example, outlier detection can be employed as a preprocessing step to clean the data set from erroneous measurements and noisy data points. On the other hand, it can also be used to isolate suspicious or interesting patterns in the data. Examples include fraud detection, customer relationship management, network intrusion, clinical diagnosis, and biological data analysis. Although many successful algorithms have been developed for outlier detection, several challenges have haunted researchers and practitioners for decades. The first one is limited algorithm scalability. Due to the fast evolution of World Wide Web, the collected data can easily reach terabyte- or even petabyte- scale. Most existing approaches, ranging from statistical methods to geometric methods, and from density-based approaches to information theory based approaches, suffer from limited scalability and do not work well on large scale data. The second one is to detect outliers in the irregular, dynamic semi-structured data such as trees and graphs. There have been some research on finding outliers from the graphs. What are the definitions for meaningful outliers in the graph context? How can we detect them accurately and efficiently? The third challenge is to build a unified and modular detection system which provides researchers a complete toolbox for outlier detection tasks. Our research aims at designing the next-generation outlier detection algorithms thattackle the above three challenges. To achieve better scalability, we have done an extensive empirical study on different optimization techniques for distance-based outlier detection. Also, we proposed an ranking scheme driven by the Locality Sensitive Hashing (LSH), which finds all outliers by only visiting a small portion of the data (10%). Find similar points of each point, or all pair similarity search, is the key operation for many distance-based, density-based and cluster-based outliers. We optimized this fundamental kernel in metric space on MapReduce platform, and scaled the algorithm to hundreds of machines and solved the inadequate memory issue. For semi-structured outlier detection, we first designed a clustering-based algorithm, and a generic clustering algorithm for sets/multisets, trees and graphs. We also studied a concrete detection application on the semi-structured knowledge base, and found more than one million anomalies. Finally, we integrated our work seamlessly into a detection framework , which accepts different types of data. Users also enjoy the freedom of choosing and comparing different algorithms.
Srinivasan Parthasarathy, Dr. (Advisor)
P. Sadayappan, Dr. (Committee Member)
Gagan Agrawal, Dr. (Committee Member)
Udo Will, Dr. (Committee Member)

Recommended Citations

Citations

  • Wang, Y. (2014). Next Generation Outlier Detection [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1397704520

    APA Style (7th edition)

  • Wang, Ye. Next Generation Outlier Detection. 2014. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1397704520.

    MLA Style (8th edition)

  • Wang, Ye. "Next Generation Outlier Detection." Doctoral dissertation, Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1397704520

    Chicago Manual of Style (17th edition)