Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Discovering interpretable topics in free-style text: diagnostics, rare topics, and topic supervision

Abstract Details

2008, Doctor of Philosophy, Ohio State University, Industrial and Systems Engineering.

Massive databases with free-style text fields are a common feature of virtually all types of organizations from hospitals to aviation companies to governmental agencies. Perhaps the most promising approaches for intelligent, automatic text analysis are called "topic models". Yet, it is likely also true that all topic models generate at least some topics that do not correspond to anything human analysts understand and can act upon.

In this dissertation, we begin by synthesizing the literature on text modeling and information retrieval. We argue that the research has evolved from focusing on fast search/document retrieval to creating interpretable models of entire corpora, i.e., databases. We also argue that the topic model literature has largely failed to address statistical issues relating to data limitations, rare topics, and the associated effects on topic model accuracy.

Next, we clarify the limitations of the standard measure of topic model accuracy, perplexity, for cases in which topic interpretability and accuracy are important. Then, we propose new measures including the "KL percentage" that provide absolute evaluations of the accuracy or "informativeness" of all topics in the model. Computational experiments show that the proposed measures are more sensitive and give different data requirement estimates than perplexity.

Then, to improve the interpretability of topics outputted from topic models, we propose using human-computer interaction (HCI) and integrating the results directly into the topic models. We introduce "anti-words" to capture negative relationships in which words do not belong to topics. Also, we propose two supervision methods, the probabilistic constraint (PC) method and the topic augmentation (TA) method, and demonstrate their benefits using numerical examples.

Next, we propose the topic model process control (TMPC) approach for control charting systems characterized by free-style text. This approach identifies new trends and assignable causes and is based on chi-squared tests on the empirical topic percentages. Finally, we show the effectiveness of all methods for an Ohio-based company. Interesting rare, unexpected topics are discovered after supervision representing new classes of customers, and the TMPC method correctly indicates unusual activities and their root causes.

Theodore Allen (Advisor)

Recommended Citations

Citations

  • Zheng, N. (2008). Discovering interpretable topics in free-style text: diagnostics, rare topics, and topic supervision [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1199237529

    APA Style (7th edition)

  • Zheng, Ning. Discovering interpretable topics in free-style text: diagnostics, rare topics, and topic supervision. 2008. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1199237529.

    MLA Style (8th edition)

  • Zheng, Ning. "Discovering interpretable topics in free-style text: diagnostics, rare topics, and topic supervision." Doctoral dissertation, Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1199237529

    Chicago Manual of Style (17th edition)