Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
30604.pdf (2.73 MB)
ETD Abstract Container
Abstract Header
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
Author Info
Mysore Gopinath, Abhijith Athreya
ORCID® Identifier
http://orcid.org/0000-0003-2580-8965
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677
Abstract Details
Year and Degree
2018, MS, University of Cincinnati, Engineering and Applied Science: Computer Science.
Abstract
Web documents are one of the most important sources of obtaining publicly available information, and researchers in need of textual data often scour the web for information. Most web documents organize the textual content into different sections based on the topicality of the text. Each section contains two distinguishable parts: (1) the title, which consists of a summary/title of the text which follows it, and (2) the text which follows the title, also known as prose text. Apart from the aesthetic appeal, this organization could be helpful in natural language processing (NLP) tasks such as question answering, information extraction, text summarization and text classification. The section title acts as an index or a quick summary of the prose content that follows it. Just like searching for information using a table of contents in a book, these indexes can be used to focus on content relevant to a search. Each section is lexically cohesive, and at the same time, it is cohesively different from other sections. Current methods of web text extraction are agnostic of these textual demarcations, as they cannot identify titles and prose text. One reason is the inherent difficulty in determining sections, since two documents with the same appearance can be structured in many different ways, and a rule-based method may not work well on various websites. Also, the complex nesting of HTML tags and the copious presence of unrelated data complicate processing. Through this thesis, we solve the problem of automatic identification of section titles and prose text. We developed two methods: one an unsupervised domain-independent approach and the other a supervised domain-dependent approach. In the domain-independent approach, we make use of lexical and morphological features of text to perform k-means clustering to identify title labels. Then, further techniques are used to determine corresponding prose text for the titles. In the domain-dependent approach, we train a neural network classifier on the dense word embeddings of title and prose text collected from a domain. The system produces a simplified output of the original HTML page which can be machine-read using simple rules. Along with these novel methods, we also have created a corpus of web documents containing privacy policies, terms of service agreements and miscellaneous web documents. This corpus includes both the original version and the simplified output of all HTML documents. To test our assumptions and methods, we used online privacy policies, terms of service agreements and miscellaneous web documents. We evaluated the models on two fronts: (1) the traditional precision, recall and F-1 scores for segment identification, and (2) a metric we name coverage, which measures the amount of the original legitimate text reproduced in the final output. The domain-independent approach achieved an overall precision of 0.82, recall of 0.98 and coverage of 0.97. The domain-dependent model returned with an accuracy of 0.99, recall of 0.75 and coverage of 0.93. These results demonstrate that our system is largely accurate and robust.
Committee
Shomir Wilson, Ph.D. (Committee Chair)
Raj Bhatnagar, Ph.D. (Committee Member)
Nan Niu, Ph.D. (Committee Member)
Carla Purdy, Ph.D. (Committee Member)
Pages
161 p.
Subject Headings
Computer Science
Keywords
HTML Structure Analysis
;
Natural Language Processing
;
Topicality Detection in HTML
;
Machine Learning
;
Privacy Policies
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Mysore Gopinath, A. A. (2018).
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning
[Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677
APA Style (7th edition)
Mysore Gopinath, Abhijith Athreya.
Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning.
2018. University of Cincinnati, Master's thesis.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677.
MLA Style (8th edition)
Mysore Gopinath, Abhijith Athreya. "Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning." Master's thesis, University of Cincinnati, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
ucin1535371714338677
Download Count:
4,428
Copyright Info
© 2018, all rights reserved.
This open access ETD is published by University of Cincinnati and OhioLINK.