Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment

Nadella, Pravallika

Keyword Search

School Logo

40438.pdf (3.1 MB)

Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment

Author Info

Nadella, Pravallika

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627665840826411

Year and Degree

2021, MS, University of Cincinnati, Engineering and Applied Science: Computer Science.

Abstract

Outlier Detection is very important since they are problematic to many statistical analysis and can cause tests to either miss significant findings or distort real results. Although there are many algorithms for outlier detection, finding outliers for a large data-set can be a very time-consuming process especially when the underlying probability distribution is unknown. In this thesis, we present an efficient outlier detection algorithm using parallel K-Nearest Neighbor search by creating and searching a KD-Tree data-structure for the nearest neighbors. We also use the same KD-Tree to find the nodes and regions which are dense by using the calculations done when creating the KD-Tree. KD-Tree (k-Dimensional Tree) is a multi-dimensional binary tree, which is a specific storage structure for efficiently representing training data. In this thesis, we exploit the advantage of finding K nearest neighbors using KD-Tree and optimize parallel execution of the algorithm to find outliers and dense regions. Outliers can be found from the K-Nearest neighbors of each data record by applying a simple formula to classify whether a data point is an outlier or not. Dense regions are found by storing the centroid, number of points and the region for each node while creating the KD-Tree data structure from input data-set. We apply a threshold for number of points and density for a region at node to output only the regions which have significant number of points and density.

Committee

Raj Bhatnagar, Ph.D. (Committee Chair)
Yizong Cheng, Ph.D. (Committee Member)
Nan Niu, Ph.D. (Committee Member)

Pages

162 p.

Subject Headings

Computer Science

Keywords

Outliers; Dense regions; KD-Tree; K-Nearest Neighbors; Spark; MapReduce

Nadella, P. (2021). Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment [Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627665840826411
APA Style (7th edition)
Nadella, Pravallika. Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment. 2021. University of Cincinnati, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627665840826411.
MLA Style (8th edition)
Nadella, Pravallika. "Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment." Master's thesis, University of Cincinnati, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627665840826411
Chicago Manual of Style (17th edition)

Document number:

ucin1627665840826411

Download Count:

144

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Discovery of Outlier Points and Dense Regions in Large Data-Sets Using Spark Environment

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations