Massive databases with free-style text fields are a common feature of virtually all types of organizations from hospitals to aviation companies to governmental agencies. Perhaps the most promising approaches for intelligent, automatic text analysis are called "topic models". Yet, it is likely also true that all topic models generate at least some topics that do not correspond to anything human analysts understand and can act upon.
In this dissertation, we begin by synthesizing the literature on text modeling and information retrieval. We argue that the research has evolved from focusing on fast search/document retrieval to creating interpretable models of entire corpora, i.e., databases. We also argue that the topic model literature has largely failed to address statistical issues relating to data limitations, rare topics, and the associated effects on topic model accuracy.
Next, we clarify the limitations of the standard measure of topic model accuracy, perplexity, for cases in which topic interpretability and accuracy are important. Then, we propose new measures including the "KL percentage" that provide absolute evaluations of the accuracy or "informativeness" of all topics in the model. Computational experiments show that the proposed measures are more sensitive and give different data requirement estimates than perplexity.
Then, to improve the interpretability of topics outputted from topic models, we propose using human-computer interaction (HCI) and integrating the results directly into the topic models. We introduce "anti-words" to capture negative relationships in which words do not belong to topics. Also, we propose two supervision methods, the probabilistic constraint (PC) method and the topic augmentation (TA) method, and demonstrate their benefits using numerical examples.
Next, we propose the topic model process control (TMPC) approach for control charting systems characterized by free-style text. This approach identifies new trends and assignable causes and is based on chi-squared tests on the empirical topic percentages. Finally, we show the effectiveness of all methods for an Ohio-based company. Interesting rare, unexpected topics are discovered after supervision representing new classes of customers, and the TMPC method correctly indicates unusual activities and their root causes.