As scientific simulations are generating large amounts of data,analyzing this data to gain insights into scientific phenomena is
increasingly becoming a challenge. With the emergence of grid
computing, analysis of large geographically distributed scientific
datasets, also referred to as distributed data-intensive
science, has emerged as an important area in recent years. It is
our belief that a middleware
supporting remote datamining would make the development of remote data
analysis applications more efficient and less time consuming, allowing
the programmer to concentrate on specifying the processing to be
performed on data, rather than efficiency of data retrieval or
scalability.
In this thesis, we present design and evaluation of a middleware that
targets mining data resident on remote repositories, and supports a
high-level interface for developing data mining and scientific data
processing applications. Our middleware, referred to as FREERIDE-G
(FRamework for Rapid Implementation of Datamining Engines in Grids),
is based on a precursor system, FREERIDE, created to provide run-time
parallelization support for performing generalized reduction
computations on locally stored data.
In its final implementation our middleware is used for mining data
resident on SRB-based servers, and uses Storage Resource Broker (which
is a de facto standard for remote data access) for both data
retrieval and its delivery to the processing site. This implementation
was evaluated using 5 data processing applications developed for our
middleware. We have also conducted an in depth study of how
performance of the SRB-based implementation is effected by size of the
unit of the remote I/O request, I/O concurrency, and limited network
bandwidth available for data transfer.
In order to make our middleware compliant with the grid computing
standards, we have also integrated the compute node client component
of our SRB-based implementation with Globus Toolkit and MPICH-G2.
As a part of this work we evaluated the overhead of using
the pre-WS components of the Globus Toolkit for middleware deployment,
and found such overhead to be quite modest.
In order to facilitate dataset replica and computing resource
selection process, an accurate performance prediction framework was
also developed as a part of our middleware. The approach we use to
model performance considers a breakdown of application execution time
into data retrieval, data communication, and data processing
component, and leverage our familiarity with the structure of
computation supported by FREERIDE-G. Also, based on where data to be
processed has been generated or how
it is shared, interesting load balancing and
scheduling considerations may arise.
Our middleware supports efficient processing of data from
geographically distributed sources through a load balancing resource allocation and scheduling
algorithm, which minimizes the total time spent on processing the
data. To solve this scheduling problem, we consider
weighted sum of two factors, a load balancing factor and a term that
captures the amount of time spent by processing nodes waiting for
the data, and supporting data integration in cases of vertical
partitioning.