Topic Modeling and Spam Detection for Short Text Segments in Web Forums

Sun, Yingcheng

Keyword Search

School Logo

PhD_Disertation.pdf (2.48 MB)

Topic Modeling and Spam Detection for Short Text Segments in Web Forums

Author Info

Sun, Yingcheng

ORCID® Identifier

http://orcid.org/0000-0002-8693-5768

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=case1575281495398615

Year and Degree

2020, Doctor of Philosophy, Case Western Reserve University, EECS - Computer and Information Sciences.

Abstract

In the era of the Social Web, there has been explosive growth of user-generated content published on various online web forums. Segments of short texts have become a fashionable writing format because they are convenient to post and respond. Examples include comments, tweets, reviews, questions/answers, to name a few. Given the large volume of short texts that are available online, quick comprehension and filtering have become a challenging problem. In this dissertation, we explore two questions related on short texts: what are they talking about and can you trust the source? To answer the first question, an effective and efficient approach is to discover latent topics from large text datasets. Because of the text sparseness of text in online discussions, traditional topic models have had limited success when directly applied to the topic mining tasks. Short texts do not provide sufficient term co-occurrence information for the reliable discovery of topics. To overcome that limitation, we use (1) the discussion thread tree structure and propose a “popularity” metric to quantify the number of replies to a given comment and extend the frequency of word occurrences, and (2) the “transitivity” concept to characterize topic dependency among nodes in a nested discussion thread. We then build a Conversational Structure Aware Topic Model (CSATM) based on popularity and transitivity to infer topics and their assignments to comments. For the second question, the users of business review forums are generally concerned with whether the reviews of products or services are genuine, because fake reviews (also called opinion spams) have become a widespread problem in online discussion forums. Existing approaches have gained success in detecting opinion spams by utilizing various features. However, spammers are sophisticated and adaptable to game the system with fast evolving content and network patterns, and it is challenging for the anti-spamming systems that only use old features. In this dissertation, we proposed three novel features based on the photos that are provided in reviews, user social network and the evaluation of reviews, and discussed a new approach called SkyNet that uses clues extracted from associated heterogeneous data including metadata (e.g. text, photos within reviews, etc.) as well as relational data (e.g. social and review networks), to detect suspicious users and reviews within a unified computational framework. The proposed CSATM topic model is used on forum datasets exported from Reddit.com and the computational experiments demonstrate improved performance for topic extraction based on six different measurements of coherence , and impressive accuracy for topic assignments. To evaluate the proposed SkyNet framework we use business review data from Yelp.com to run computational experiments assuming “recommended” reviews are genuine and “not recommended” reviews are fake to show that the proposed SkyNet framework outperforms several baselines and state-of-the-art opinion detection methods.

Committee

Kenneth Loparo, Dr. (Advisor)
Xusheng Xiao, Dr. (Committee Chair)
Wang An, Dr. (Committee Member)
Erman Ayday, Dr. (Committee Member)

Pages

75 p.

Subject Headings

Computer Science

Keywords

short text; online discussions; topic model; conversational structure; opinion spam; heterogeneous information network; social network

Sun, Y. (2020). Topic Modeling and Spam Detection for Short Text Segments in Web Forums [Doctoral dissertation, Case Western Reserve University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=case1575281495398615
APA Style (7th edition)
Sun, Yingcheng. Topic Modeling and Spam Detection for Short Text Segments in Web Forums. 2020. Case Western Reserve University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=case1575281495398615.
MLA Style (8th edition)
Sun, Yingcheng. "Topic Modeling and Spam Detection for Short Text Segments in Web Forums." Doctoral dissertation, Case Western Reserve University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=case1575281495398615
Chicago Manual of Style (17th edition)

Document number:

case1575281495398615

Download Count:

1,872

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Topic Modeling and Spam Detection for Short Text Segments in Web Forums

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Topic Modeling and Spam Detection for Short Text Segments in Web Forums

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations