Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
ucin989852516.pdf (499.78 KB)
ETD Abstract Container
Abstract Header
ENTITY IDENTIFICATION USING DATA MINING TECHNIQUES
Author Info
JANAKIRAMAN, KRISHNAMOORTHY
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=ucin989852516
Abstract Details
Year and Degree
2001, MS, University of Cincinnati, Engineering : Computer Engineering.
Abstract
Organizations are increasingly experiencing the necessity and benefits of integrated access to multiple data sources. Database integration has two aspects: schema integration and data integration. Schema integration arrives at a common schema representing the elements of the source schemas. Data integration involves detecting and merging multiple instances of the same real world entities from different databases. Entity identification is necessary when there is no common means of identification such as primary keys, and it is usually solved manually. This thesis focuses on solving the entity identification problem in an automated way using data mining techniques. We use automated learning techniques to identify characteristics or patterns found in entities and apply this knowledge to detect multiple instances of the same entity. The data mining techniques that we use are decision trees and k-nearest neighbors (k-NN). Our approach preprocesses the data before employing the data mining techniques. The preprocessing forms clusters on the data and entity identification is performed on each cluster. To study the performance of the proposed algorithms, we use a small database of 2500 records and vary different parameters such as training set size and number of unique entities in our experiments. Our experiments study the impact of our preprocessing algorithm on both a decision tree implementation and a k-NN implementation as the classification techniques. We examine whether accuracy and processing speed are improved, unaffected or adversely affected. For our testbed, there is a significant savings in the processing time of the clustered data sets with decision trees when compared to the unclustered data sets with decision trees for both small and large training set sizes. On the other hand, the accuracy when using clustering is always less than that obtained without clustering, but the clustering accuracy approaches the accuracy of the non-clustered approach as the number of unique entities increases. Clustering errors do not significantly affect the accuracy of any of the classification techniques for any data set (clustered or unclustered). On clustered data sets, the processing time is always less with decision tree techniques than k-NN; however the difference in the processing time between the k-NN and the decision tree technique decreases with decrease in training set size. The decision tree technique gives better accuracy than the k-NN technique in all cases except when applied on the data sets with a small number of unique entities for a small training set size.
Committee
Dr. Karen C. Davis (Advisor)
Pages
94 p.
Subject Headings
Computer Science
Keywords
entity identification
;
data mining
Recommended Citations
Refworks
Refworks
EndNote
EndNote
RIS
RIS
Mendeley
Mendeley
Citations
JANAKIRAMAN, K. (2001).
ENTITY IDENTIFICATION USING DATA MINING TECHNIQUES
[Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin989852516
APA Style (7th edition)
JANAKIRAMAN, KRISHNAMOORTHY.
ENTITY IDENTIFICATION USING DATA MINING TECHNIQUES.
2001. University of Cincinnati, Master's thesis.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=ucin989852516.
MLA Style (8th edition)
JANAKIRAMAN, KRISHNAMOORTHY. "ENTITY IDENTIFICATION USING DATA MINING TECHNIQUES." Master's thesis, University of Cincinnati, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin989852516
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
ucin989852516
Download Count:
1,010
Copyright Info
© 2001, all rights reserved.
This open access ETD is published by University of Cincinnati and OhioLINK.