Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data

Abstract Details

2008, Doctor of Philosophy, Ohio State University, Computer and Information Science.
This work seeks to develop a probabilistic framework for modeling, querying and analyzing large-scale structured and semi-structured data. The framework has three components: (1) Mining non-redundant local patterns from data; (2) Gluing these local patterns together by employing probabilistic models (e.g., Markov random field (MRF), Bayesian network); and (3) Reasoning over the data for solving various data analysis tasks. Our contributions are as follows: (a) We present an approach of employing probabilistic models to identify non-redundant itemset patterns from a large collection of frequent itemsets on transactional data. Our approach can effectively eliminate redundancies from a large collection of itemset patterns. (b) We propose a technique of employing local probabilistic models to glue non-redundant itemset patterns together in tackling the link prediction task in co-authorship network analysis. Our technique effectively combines topology analysis on network structure data and frequency analysis on network event log data. The main idea is to consider the co-occurrence probability of two end nodes associated with a candidate link. We propose a method of building MRFs over local data regions to compute this co-occurrence probability. Experimental results demonstrate that the co-occurrence probability inferred from the local probabilistic models is very useful for link prediction. (c) We explore employing global models, models over large data regions, to glue non-redundant itemset patterns together. We investigate learning approximate global MRFs on large transactional data and propose a divide-and-conquer style modeling approach. Empirical study shows that the models are effective in modeling the data and approximately answering queries on the data. (d) We propose a technique of identifying non-redundant tree patterns from a large collection of structural tree patterns on semi-structured XML data. Our approach can effectively eliminate redundancies from a large collection of structural tree patterns. Furthermore, we present techniques of employing these non-redundant tree patterns as summary statistics for the XML data to solve the XML twig selection estimation problem. We propose a probabilistic framework under which the selectivity of a twig query can be estimated from the information of its subtrees. Empirical results demonstrate the efficacy of our approach on real and synthetic datasets.
Srinivasan Parthasarathy (Advisor)
166 p.

Recommended Citations

Citations

  • Wang, C. (2008). Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1199284713

    APA Style (7th edition)

  • Wang, Chao. Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data. 2008. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1199284713.

    MLA Style (8th edition)

  • Wang, Chao. "Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data." Doctoral dissertation, Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1199284713

    Chicago Manual of Style (17th edition)