Recent advances in digital sensor technology and numerical simulations
of real-world phenomena are resulting in the acquisition of unprecedented
amounts of raw digital data. Terms like ‘data explosion’ and ‘data tsunami’
have come to describe the uncontrolled rate at which scientific datasets
are generated by automated sources ranging from digital microscopes and
telescopes to in-silico models simulating the complex dynamics of physical
and biological processes. Scientists in various domains now have secure,
affordable access to petabyte-scale observational data gathered over time,
the analysis of which, is crucial to scientific discovery.
The availability of commodity components have fostered the development of
large distributed systems with high-performance computing resources to
support the execution requirements of scientific data analysis applications.
Increased levels of middleware support over the years have aimed to provide
high scalability of application execution on these systems. However, the
high-resolution, multi-dimensional nature of scientific datasets, and the
complexity of analysis requirements present challenges to efficient
application execution on such systems. Traditional brute-force analysis
techniques to extract useful information from scientific datasets
may no longer meet desired performance levels at extreme data scales.
This thesis builds on a comprehensive study involving multi-dimensional data
analysis applications at large data scales, and identifies a set of advanced
factors or parameters to this class of applications which can be customized
in domain-specific ways to obtain substantial improvements in performance.
A useful property of these applications
is their ability to operate at multiple performance levels based on a set of
trade-off parameters, while providing different levels of quality-of-service
(QoS) specific to the application instance. To avail the performance benefits
brought about by such factors, applications must be configured for execution
in specific ways for specific systems. Middleware support for such
domain-specific configuration is limited, and there is typically no integration
across middleware layers to this end. Low-level manual configuration of
applications within a large space of solutions is error-prone and tedious.
This thesis proposes an approach for the development and execution of large
scientific multi-dimensional data analysis applications that takes multiple
performance parameters into account and supports the notion of domain-specific
configuration-as-a-service.
My research identifies various aspects that go into the creation
of a framework for user-guided, system-directed performance optimizations
for such applications. The framework seeks to achieve this goal by
integrating software modules that (i) provide a unified, homogeneous model
for the high-level specification of any conceptual knowledge that
may be used to configure applications within a domain, (ii) perform
application configuration in response to user directives, i.e., use the
specifications to translate high-level requirements into low-level execution
plans optimized for a given system, and (iii) carry out the execution plans
on the underlying system in an efficient and scalable manner.
A prototype implementation of the framework that integrates several middleware
layers is used for evaluating our approach. Experimental results gathered for
real-world application scenarios from the domains of astronomy and biomedical
imaging demonstrate the utility of our framework towards meeting the scientific
performance requirements at very large data scales.