Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
InformationAndRepresentationTradeoffsInDocumentClassification.pdf (2.49 MB)
ETD Abstract Container
Abstract Header
Information and Representation Tradeoffs in Document Classification
Author Info
Jin, Timothy
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=case1649340330508341
Abstract Details
Year and Degree
2022, Master of Sciences, Case Western Reserve University, EECS - Computer and Information Sciences.
Abstract
Significant prior work has proposed using topics as well as words in document classification, and many complex models have been developed to use a mix of different representations of words and topics. But how much do these different representations actually contribute to accuracy in document classification? We categorize existing document classification approaches into two axes: a syntactic/semantic/both axis that considers what kind of information the model uses and a word/topic/both axis that considers how that information is used. We conduct evaluation experiments using a uniform methodology to determine which classes of models are the most effective for the task of document classification. Surprisingly, our results show that there is little difference in overall classification performance between different classes of models on average across many datasets, and few methods outperform or produce sparser models than a basic word-based document classifier.
Committee
Soumya Ray (Committee Chair)
Mehmet Koyuturk (Committee Member)
Michael Lewicki (Committee Member)
Subject Headings
Computer Science
Keywords
natural language processing
;
text classification
;
document classification
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Jin, T. (2022).
Information and Representation Tradeoffs in Document Classification
[Master's thesis, Case Western Reserve University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=case1649340330508341
APA Style (7th edition)
Jin, Timothy.
Information and Representation Tradeoffs in Document Classification.
2022. Case Western Reserve University, Master's thesis.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=case1649340330508341.
MLA Style (8th edition)
Jin, Timothy. "Information and Representation Tradeoffs in Document Classification." Master's thesis, Case Western Reserve University, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=case1649340330508341
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
case1649340330508341
Download Count:
57
Copyright Info
© 2022, all rights reserved.
This open access ETD is published by Case Western Reserve University School of Graduate Studies and OhioLINK.