Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Towards Efficient Data Analysis and Management of Semi-structured Data

Tatikonda, Shirish

Abstract Details

2010, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Over the last decade, there has been an enormous growth in both the amount and the complexity of online content that is collected and processed by humans and machines. Such a growth has spurred interest in flexible and fluid (semi-structured) data models that do not constrain the data to follow a fixed schema. Many applications ranging from bioinformatics to XML repositories, from software engineering to computational linguistics, are now generating and processing large amounts of semi-structured data. For these applications to reach their full potential, we need to build an effective set of tools to index, process, manage, and analyze such data. This dissertation focuses on a specific class of semi-structured data that is denoted using hierarchical tree objects. We specifically address the following questions pertaining to mining and managing tree-structured data: How can we provide quick access mechanisms to large semi-structured data stores? How can we discover hidden structural patterns from such data collections? How can we devise strategies to realize performance that is commensurate with modern computer architectures?

In the context of managing tree-structured data, first, we develop an indexing mechanism that extracts discriminant features from the database and indexes them using a simple tunable inverted structure. Such an index is complemented with an efficient holistic query processing technique that retrieves the matches by operating entirely on space-efficient sequential representation of trees. Second, we propose a framework that enables the development of application-specific hash functions that convert variable-sized graph and tree structured data into fixed-sized hash values. We demonstrate the usability of this framework by developing a hash-based distributed data placement service for semi-structured data. We argue that this service is capable of supporting large scale data management and data mining algorithms.

In the context of mining tree databases, first, we explore the role of succinct sequential data structures for efficiently discovering frequent tree patterns. Second, we propose a "memory-conscious design" wherein the algorithms trade memory for redundant computations to improve the memory system performance. Third, we consider the case of deploying data mining workloads on modern multicore systems. Here, we demonstrate that the bandwidth to main memory becomes a precious shared commodity as one increases the number of cores present in the system. We present mechanisms to alleviate the bandwidth pressure and show their effectiveness. Fourth, we explore an adaptive task and data parallel algorithm design that facilitates effective parallelization in the presence of data and workload skew. This algorithm is integrated into a general purpose scheduling service that supports the development of adaptive and moldable algorithms for database and mining tasks. Finally, we develop a hash-based distributed data placement service that can support the development of large scale distributed data mining and data management applications.

Srinivasan Parthasarathy, PhD (Advisor)
Sadayappan Ponnuswamy, PhD (Committee Member)
Xiaodong Zhang, PhD (Committee Member)
218 p.

Recommended Citations

Citations

  • Tatikonda, S. (2010). Towards Efficient Data Analysis and Management of Semi-structured Data [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1275414859

    APA Style (7th edition)

  • Tatikonda, Shirish. Towards Efficient Data Analysis and Management of Semi-structured Data. 2010. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1275414859.

    MLA Style (8th edition)

  • Tatikonda, Shirish. "Towards Efficient Data Analysis and Management of Semi-structured Data." Doctoral dissertation, Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1275414859

    Chicago Manual of Style (17th edition)