Skip to Main Content
 

Global Search Box

 
 
 
 

Files

File List

ETD Abstract Container

Abstract Header

Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment

Nadella, Pravallika

Abstract Details

2021, MS, University of Cincinnati, Engineering and Applied Science: Computer Science.
Outlier Detection is very important since they are problematic to many statistical analysis and can cause tests to either miss significant findings or distort real results. Although there are many algorithms for outlier detection, finding outliers for a large data-set can be a very time-consuming process especially when the underlying probability distribution is unknown. In this thesis, we present an efficient outlier detection algorithm using parallel K-Nearest Neighbor search by creating and searching a KD-Tree data-structure for the nearest neighbors. We also use the same KD-Tree to find the nodes and regions which are dense by using the calculations done when creating the KD-Tree. KD-Tree (k-Dimensional Tree) is a multi-dimensional binary tree, which is a specific storage structure for efficiently representing training data. In this thesis, we exploit the advantage of finding K nearest neighbors using KD-Tree and optimize parallel execution of the algorithm to find outliers and dense regions. Outliers can be found from the K-Nearest neighbors of each data record by applying a simple formula to classify whether a data point is an outlier or not. Dense regions are found by storing the centroid, number of points and the region for each node while creating the KD-Tree data structure from input data-set. We apply a threshold for number of points and density for a region at node to output only the regions which have significant number of points and density.
Raj Bhatnagar, Ph.D. (Committee Chair)
Yizong Cheng, Ph.D. (Committee Member)
Nan Niu, Ph.D. (Committee Member)
162 p.

Recommended Citations

Citations

  • Nadella, P. (2021). Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment [Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627665840826411

    APA Style (7th edition)

  • Nadella, Pravallika. Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment. 2021. University of Cincinnati, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627665840826411.

    MLA Style (8th edition)

  • Nadella, Pravallika. "Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment." Master's thesis, University of Cincinnati, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627665840826411

    Chicago Manual of Style (17th edition)