Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

ACE: Agile,Contingent and Efficient Similarity Joins Using MapReduce

Lakshminarayanan, Mahalakshmi

Abstract Details

2013, Master of Science in Engineering, University of Toledo, Engineering (Computer Science).
Similarity Join is an important operation for data mining, with a diverse range of real world applications. Three efficient MapReduce Algorithms for performing Similarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability. This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage. We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need. In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file. In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework. However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies. SSS and ESSJ algorithms were developed using Hadoop and tested using real-world Twitter data. For both SSS and ESSJ, experimental results demonstrate phenomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm.
Vijay Devabhaktuni (Committee Chair)
William Acosta (Committee Member)
Robert Green (Committee Member)
Mansoor Alam (Committee Member)
103 p.

Recommended Citations

Citations

  • Lakshminarayanan, M. (2013). ACE: Agile,Contingent and Efficient Similarity Joins Using MapReduce [Master's thesis, University of Toledo]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1383931387

    APA Style (7th edition)

  • Lakshminarayanan, Mahalakshmi. ACE: Agile,Contingent and Efficient Similarity Joins Using MapReduce. 2013. University of Toledo, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=toledo1383931387.

    MLA Style (8th edition)

  • Lakshminarayanan, Mahalakshmi. "ACE: Agile,Contingent and Efficient Similarity Joins Using MapReduce." Master's thesis, University of Toledo, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1383931387

    Chicago Manual of Style (17th edition)