Science is increasingly becoming more and more data-driven. With technologicaladvancements such as advanced sensing technologies that can rapidly capture
data at high resolutions and Grid technologies that enable increasingly
realistic simulation of complex numerical models, scientific applications have
become very data-intensive and involve storing and accessing large amounts of
data. The LHC experiment at CERN is an example of a high energy physics
initiative where the amount of data stored is in petabytes. The end goal in
collecting petabytes of simulation data is to gain a better understanding of
the problem under study. This essentially involves collaborative analysis of
data by scientists across the world which conforms to a distributed
data-intensive computing paradigm where a set of compute, storage and network
resources are used in a collective fashion to advance science. Effective
scheduling and resource management for such data-intensive applications on
distributed resources is critical in order to meet their performance
requirements.
Efficient scheduling in the aforementioned scenario encompasses two key
inter-related problems. The first one is the data staging problem which
involves the staging of data from the simulation/experimental sites to the
computational sites where the data analysis needs to be performed. The second
one is the job mapping problem which involves the mapping of data analysis jobs
to compute resources in such a manner so as to maximize the locality of data
usage.
Traditional batch job schedulers are designed for compute-intensive jobs
running at supercomputer centers. They take into account CPU related metrics
(e.g., user estimated job run times) and system state (e.g., queue wait times)
to make scheduling decisions, but they do not take into account data related
metrics. Therefore, there is a need for designing scheduling mechanisms for
data-analysis jobs that take into account not only the computation time of the
jobs, but also the overheads of retrieving files requested by those jobs.
In this dissertation, we address the problem of data staging and job mapping
for data-intensive jobs in both homogeneous and heterogeneous environments. We
achieve this by taking into account the effects of data staging, end-point
contention and data locality. For the mapping problem, we propose algorithms
for mapping data-intensive jobs in both an offline and online setting. We also
study the interplay between job mapping and data replication and propose
algorithms which perform job mapping and data replication in a coordinated
manner. For the data staging problem, we propose efficient data staging
mechanisms for data centers consisting of coupled collections of storage and
compute clusters. Furthermore, we extend our data staging work to a
heterogeneous distributed system like the Grid. To accomplish that, we employ
multi-hop path splitting and multi-pathing optimizations to improve wide-area
file transfer throughput.