Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
enhua-dissertation.pdf (5.73 MB)
ETD Abstract Container
Abstract Header
Spam Analysis and Detection for User Generated Content in Online Social Networks
Author Info
Tan, Enhua
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1365520334
Abstract Details
Year and Degree
2013, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Abstract
Recent years have witnessed the success of a number of online social networks (OSNs) and explosive increasing of social media. These social networking and social media sites have attracted a significant number of participants that contribute various types of contents on the Internet, which are generally referred as user generated content (UGC). A well designed UGC network can utilize the wisdom of crowds to collect, organize, and vote user contributed content to generate high quality knowledge with a relatively low cost. However, the open environment of UGC system also makes it easy to be polluted and attacked by spammers and malicious users. How users participate in UGC networks, especially how users contribute content and share content with their friends and other users, is fundamental to spam detection and high quality knowledge discovery. In this dissertation, we investigate two important research issues: (1) discovering user content generation patterns in OSNs, focusing on publicly available content (knowledge sharing), and (2) detecting spam in user generated content based on our discovered patterns. With the access to three large OSN user activity logs, including Yahoo! Blogs, Yahoo! Answers, and Yahoo! Del.icio.us, for a duration of up to 4.5 years, we are able to well analyze the patterns of content generation patterns of social network users in detail. Our analysis consistently shows that users' posting behavior in these networks exhibits strong daily and weekly patterns, but the user active time in these OSNs does not follow commonly assumed exponential distributions. We also show that the user posting behavior in these OSNs follows stretched exponential distributions instead of widely accepted power law distributions. Our discovery lays a foundation for user behavior analysis in social networks, and serves as a ground truth for anomaly detection and anti-spam. Applying the user posting behavior distribution pattern, we further conducted a comprehensive analysis of spamming activities on a large commercial social blog UGC site in 325 days covering over 6 million posts and nearly 400 thousand users. Observing power law distribution instead of our discovered stretched exponential distribution on user contributions, we find it actually indicates serious UGC spam attack activities. Our analysis shows that UGC spammers exhibit unique non-textual patterns, such as posting activities, advertised spam link metrics, and spam hosting behaviors. Based on these non-textual features, we show with commonly used classification methods that a high detection rate could be achieved offline. These results further motivate us to develop a runtime scheme, BARS, to detect spam posts based on these spamming patterns. The experimental results demonstrate the effectiveness and robustness of BARS. To timely detect spam in large social network sites, it is desirable to discover self-tuned, unsupervised schemes that can save the training cost of supervised classification schemes. Identifying the limitations of existing unsupervised detection schemes due to assumptions of spammer behaviors that no longer hold, we design an unsupervised spam detection scheme, called UNIK. Instead of picking out spammers directly, UNIK leverages both the connection-based social graph and the content-based user-link graph to remove non-spammers from the network first, and then clusters spammers with the landing pages they are trying to advertise. Based on highly accurate detection results of UNIK, we further analyze a number of spam campaigns. The result shows that different spammer clusters demonstrate distinct characteristics, implying the ability of UNIK to automatically extract spam signatures.
Committee
Xiaodong Zhang (Advisor)
Feng Qin (Committee Member)
Ten H. (Steve) Lai (Committee Member)
Pages
131 p.
Subject Headings
Computer Engineering
;
Computer Science
Keywords
user generated content
;
online social networks
;
user behavior
;
stretched exponential distribution
;
spam filtering
;
spam detection
;
spam classification
;
decision tree
;
social graph
;
user-link graph
;
Sybil attack
;
community detection
;
BARS
;
UNIK
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Tan, E. (2013).
Spam Analysis and Detection for User Generated Content in Online Social Networks
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1365520334
APA Style (7th edition)
Tan, Enhua.
Spam Analysis and Detection for User Generated Content in Online Social Networks.
2013. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1365520334.
MLA Style (8th edition)
Tan, Enhua. "Spam Analysis and Detection for User Generated Content in Online Social Networks." Doctoral dissertation, Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1365520334
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1365520334
Download Count:
1,309
Copyright Info
© 2013, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.