Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Information Retrieval by Identification of Signature Terms in Clusters

Muppalla, Sesha Sai Krishna Koundinya

Abstract Details

2022, MS, University of Cincinnati, Engineering and Applied Science: Computer Science.
Document Clustering and Information Retrieval are essential topics in text mining and have many real-world applications. The primary goal of document clustering is to group documents by their similarity, where the documents within the group are highly similar. Finding clusters of documents can help categorize uncategorized data, ease the retrieval process, identify topics, and be used as a pre-processing method. Existing clustering techniques focus on clustering documents based on the entire collection of terms. Also, some co-clustering algorithms cluster documents based on a subset of terms. A crucial application after clustering is to retrieve a cluster based on a query or a document. Many existing retrieval methods focus on the similarity of query and a cluster as a whole. Alternatively, the ranked similarity between the query document and the documents in the cluster is considered. In both cases, the cluster is not identified by a specific set of terms. In this research, we have built an innovative clustering procedure that has the advantages of general clustering and co-clustering algorithms. Our approach differs from existing clustering methods in that a cluster formation depends on a purity metric based on the spread of the data. Furthermore, we identify critical terms - called signature - in the cluster using three different ‘Signature’ extraction techniques to assess the importance of the association between a given term and a given cluster. Finally, we use the signature to ease the retrieval of a cluster given a new document or a set of terms as the query. We demonstrate the working methodology by implementing it on document collections from different domains and validate our results using standard evaluation metrics. We believe that our methods are beneficial in identifying the essential terms in a cluster.
Raj Bhatnagar, Ph.D. (Committee Member)
Nan Niu, Ph.D. (Committee Member)
Ali Minai, Ph.D. (Committee Member)
Gowtham Atluri, Ph.D. (Committee Member)
281 p.

Recommended Citations

Citations

  • Muppalla, S. S. K. K. (2022). Information Retrieval by Identification of Signature Terms in Clusters [Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1649858985897256

    APA Style (7th edition)

  • Muppalla, Sesha Sai Krishna Koundinya. Information Retrieval by Identification of Signature Terms in Clusters. 2022. University of Cincinnati, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1649858985897256.

    MLA Style (8th edition)

  • Muppalla, Sesha Sai Krishna Koundinya. "Information Retrieval by Identification of Signature Terms in Clusters." Master's thesis, University of Cincinnati, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1649858985897256

    Chicago Manual of Style (17th edition)