Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Automatic Identification of Interestingness in Biomedical Literature

Anand, Gaurish

Abstract Details

2014, Master of Science (MS), Wright State University, Computer Science.
This thesis presents research on automatically identifying interestingness in a graph of semantic predications. Interestingness represents a subjective quality of information that represents its value in meeting a user’s known or unknown retrieval needs. The perception of information as interesting requires a level of utility for the user as well as a balance between significant novelty and sufficient familiarity. It can also be influenced by additional factors such as unexpectedness or serendipity with recent experiences. The ability to identify interesting information facilitates the development of user-centered retrieval, especially in information semantic summarization and iterative, step-wise searching such as in discovery browsing systems. Ultimately, this allows biomedical researchers to more quickly identify information of greatest potential interest to them, whether expected or, perhaps more importantly, unexpected. Current discovery browsing systems use iterative information retrieval to discover new knowledge – a process that requires finding relevant co-occurring topics and relationships through consistent human involvement to identify interesting concepts. Although interestingness is subjective, this thesis identifies computable quantities in semantic data that correlate to interestingness in user searches. We compare several statistical and rule-based models correlating graph data extracted from semantic predications with concept interestingness as demonstrated in PubMed queries. Semantic predications represent scientific assertions extracted from all of the biomedical literature contained in the MEDLINE database. They are of the form, “subject-predicate-object”. Predications can easily be represented as graphs, where subjects and objects are nodes and predicates form edges. A graph of predications represents the assertions made in the citations from which the predications were extracted. This thesis uses graph metrics to identify features from the predication graph for model generation. These features are based on degree centrality (connectedness) of the seed concept node and surrounding nodes; they are also based on frequency of occurrence measures of the edges between the seed concept and surrounding nodes as well as between the nodes surrounding the seed concept and the neighbors of those nodes. A PubMed query log is used for training and testing models for interestingness. This log contains a set of user searches over a 24-hour period, and we make the assumption that co-occurrence of concepts with the seed concept in searches demonstrates interestingness of that concept with regard to the seed concept. Graph generation begins by the selection of a set of all predications containing the seed concept from the Semantic Medline database (our training dataset uses Alzheimer’s disease as the seed concept). The graph is built with the seed concept as the central node. Additional nodes are added for each concept that occurs with the seed concept in the initial predications and an edge is created for each instance of a predication containing the two concepts. The edges are labeled with the specific predicate in the predication. This graph is extended to include additional nodes within two leaps from the seed concept. The concepts in the PubMed query logs are normalized to UMLS concepts or Entrez Gene symbols using MetaMap. Token-based and user-based counts are collected for each co-occurring term. These measures are combined to create a weighted score which is used to determine three potential thresholds of interestingness based on deviation from the mean score. The concepts that are included in both the graph and the normalized log data are identified for use in model training and testing. In modeling interestingness, we rely on commonly used mining algorithms: support vector machines, naive bayes, and rule induction. To evaluate the models generated by these algorithms, we calculate precision, recall, and f-score for each model at all three interestingness thresholds. The best performing model is tested on three additional seed topics: schizophrenia, diabetes, and colitis. The results show that the model based on the rule-induction algorithm generated performs best. Performance was best with the schizophrenia dataset, which suggests there is a benefit of training and testing on semantically similar topics or perhaps on broader seed concepts. Additionally, the use of more graph metric features, a larger duration of query log, and separating log data by user class can be used to improve the performance. To conclude, this thesis presents a novel approach of identifying interestingness in a graph of semantic predications. A positive correlation is seen between interestingness and graph metrics. The results show the potential for improving the identification of interestingness of retrieved information for searches in discovery browsing and beyond.
Amith Sheth, Ph.D. (Advisor)
Thomas Rindflesch, Ph.D. (Committee Member)
Michael Cairelli, D.O. (Committee Member)
74 p.

Recommended Citations

Citations

  • Anand, G. (2014). Automatic Identification of Interestingness in Biomedical Literature [Master's thesis, Wright State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=wright1410962490

    APA Style (7th edition)

  • Anand, Gaurish. Automatic Identification of Interestingness in Biomedical Literature. 2014. Wright State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=wright1410962490.

    MLA Style (8th edition)

  • Anand, Gaurish. "Automatic Identification of Interestingness in Biomedical Literature." Master's thesis, Wright State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=wright1410962490

    Chicago Manual of Style (17th edition)