Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning

Mysore Gopinath, Abhijith Athreya

Abstract Details

2018, MS, University of Cincinnati, Engineering and Applied Science: Computer Science.
Web documents are one of the most important sources of obtaining publicly available information, and researchers in need of textual data often scour the web for information. Most web documents organize the textual content into different sections based on the topicality of the text. Each section contains two distinguishable parts: (1) the title, which consists of a summary/title of the text which follows it, and (2) the text which follows the title, also known as prose text. Apart from the aesthetic appeal, this organization could be helpful in natural language processing (NLP) tasks such as question answering, information extraction, text summarization and text classification. The section title acts as an index or a quick summary of the prose content that follows it. Just like searching for information using a table of contents in a book, these indexes can be used to focus on content relevant to a search. Each section is lexically cohesive, and at the same time, it is cohesively different from other sections. Current methods of web text extraction are agnostic of these textual demarcations, as they cannot identify titles and prose text. One reason is the inherent difficulty in determining sections, since two documents with the same appearance can be structured in many different ways, and a rule-based method may not work well on various websites. Also, the complex nesting of HTML tags and the copious presence of unrelated data complicate processing. Through this thesis, we solve the problem of automatic identification of section titles and prose text. We developed two methods: one an unsupervised domain-independent approach and the other a supervised domain-dependent approach. In the domain-independent approach, we make use of lexical and morphological features of text to perform k-means clustering to identify title labels. Then, further techniques are used to determine corresponding prose text for the titles. In the domain-dependent approach, we train a neural network classifier on the dense word embeddings of title and prose text collected from a domain. The system produces a simplified output of the original HTML page which can be machine-read using simple rules. Along with these novel methods, we also have created a corpus of web documents containing privacy policies, terms of service agreements and miscellaneous web documents. This corpus includes both the original version and the simplified output of all HTML documents. To test our assumptions and methods, we used online privacy policies, terms of service agreements and miscellaneous web documents. We evaluated the models on two fronts: (1) the traditional precision, recall and F-1 scores for segment identification, and (2) a metric we name coverage, which measures the amount of the original legitimate text reproduced in the final output. The domain-independent approach achieved an overall precision of 0.82, recall of 0.98 and coverage of 0.97. The domain-dependent model returned with an accuracy of 0.99, recall of 0.75 and coverage of 0.93. These results demonstrate that our system is largely accurate and robust.
Shomir Wilson, Ph.D. (Committee Chair)
Raj Bhatnagar, Ph.D. (Committee Member)
Nan Niu, Ph.D. (Committee Member)
Carla Purdy, Ph.D. (Committee Member)
161 p.

Recommended Citations

Citations

  • Mysore Gopinath, A. A. (2018). Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning [Master's thesis, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677

    APA Style (7th edition)

  • Mysore Gopinath, Abhijith Athreya. Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning. 2018. University of Cincinnati, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677.

    MLA Style (8th edition)

  • Mysore Gopinath, Abhijith Athreya. "Automatic Detection of Section Title and Prose Text in HTML Documents Using Unsupervised and Supervised Learning." Master's thesis, University of Cincinnati, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535371714338677

    Chicago Manual of Style (17th edition)