Scientific advancements have ushered in staggering amounts of
available data and processes which are now scattered across various
locations in the Web, Grid, and more recently, the Cloud. These processes and
data sets are often semantically loosely-coupled and must be composed together
piecemeal to generate scientific workflows. Understanding how to
design, manage, and execute such data-intensive workflows has become
increasingly esoteric, confined to a few scientific experts in the field.
Despite the development of scientific workflow management systems, which have
simplified workflow planning to some extent, a means to reduce the complexity
of user interaction without forfeiting some robustness has been elusive. This
violates the essence of scientific progress, where information should
be accessible to anyone. A high-level querying interface tantamount to common
search engines that can not only return a relevant set of scientific workflows,
but also facilitate their execution, may be highly beneficial to users.
The development of such a system that can abstract the complex task of
scientific workflow planning and execution from the user is reported herein.
Our system, Auspice: AUtomatic Service Planning In Cloud/Grid
Environments, consists of the following key contributions. Initially, a
two-level metadata management framework is introduced. In the top-level, Auspice
captures semantic dependencies among available, shared processes and data sets
with an ontology. Our system furthermore indexes these shared resources for
facilitating fast planning times. This metadata framework enables an automatic
workflow composition algorithm, which exhaustively enumerates relevant
scientific workflow plans given a few key parameters - a marked departure
from requiring users to design and manage workflow plans.
By applying models on processes, time-critical and accuracy-aware
constraints can be realized in this planning algorithm. During the
planning phase, Auspice projects these costs and prunes workflow
plans in an apriori fashion if they cannot meet the specified
constraints. Conversely, when feasible, Auspice can adapt to certain time
constraints by trading accuracy for time. To simplify user interaction, both
natural language and keyword search interfaces have been developed to invoke
the said workflow planning algorithm. Intermediate data caching
strategies have also been implemented to accelerate workflow execution over
emerging Cloud environments. A focus on cache elasticity is reported, and to
this end, we have developed methods to scale and relax resource provisioning
for cooperating data caches. Finally, costs of supporting such data caches
over various Cloud storage and compute resources have been evaluated.