Over the last decade, there has been an enormous growth in both the amount
and the complexity of online content that is collected and processed by
humans and machines. Such a growth has spurred interest in flexible and
fluid (semi-structured) data models that do not constrain the data to
follow a fixed schema. Many applications ranging from bioinformatics to
XML repositories, from software engineering to computational linguistics,
are now generating and processing large amounts of semi-structured data.
For these applications to reach their full potential, we need to build an
effective set of tools to index, process, manage, and analyze such data.
This dissertation focuses on a specific class of semi-structured data that is denoted using hierarchical tree objects. We specifically address the following questions pertaining to mining and managing tree-structured data: How can we provide quick access mechanisms to large semi-structured data stores? How can we discover hidden structural patterns from such data collections? How can we devise strategies to realize performance that is commensurate with modern computer architectures?
In the context of managing tree-structured data, first,
we develop an indexing mechanism that extracts discriminant features from
the database and indexes them using a simple tunable inverted structure.
Such an index is complemented with an efficient holistic query processing
technique that retrieves the matches by operating entirely on
space-efficient sequential representation of trees. Second, we propose a
framework that enables the development of application-specific hash
functions that convert variable-sized graph and tree structured data into
fixed-sized hash values. We demonstrate the usability of this framework by
developing a hash-based distributed data placement service for
semi-structured data. We argue that this service is capable of supporting large scale data management and data mining algorithms.
In the context of mining tree databases, first, we explore the role of
succinct sequential data structures for efficiently discovering frequent
tree patterns. Second, we propose a "memory-conscious design" wherein the
algorithms trade memory for redundant computations to improve the memory
system performance. Third, we consider the case of deploying data mining
workloads on modern multicore systems. Here, we demonstrate that the
bandwidth to main memory becomes a precious shared commodity as one
increases the number of cores present in the system. We present mechanisms
to alleviate the bandwidth pressure and show their effectiveness. Fourth,
we explore an adaptive task and data parallel algorithm design that
facilitates effective parallelization in the presence of data and workload
skew. This algorithm is integrated into a general purpose scheduling
service that supports the development of adaptive and moldable algorithms
for database and mining tasks. Finally, we develop a hash-based distributed data placement service that can support the development of large scale distributed data mining and data management applications.