Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing

Khanna, Gaurav

Abstract Details

2008, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Science is increasingly becoming more and more data-driven. With technologicaladvancements such as advanced sensing technologies that can rapidly capture data at high resolutions and Grid technologies that enable increasingly realistic simulation of complex numerical models, scientific applications have become very data-intensive and involve storing and accessing large amounts of data. The LHC experiment at CERN is an example of a high energy physics initiative where the amount of data stored is in petabytes. The end goal in collecting petabytes of simulation data is to gain a better understanding of the problem under study. This essentially involves collaborative analysis of data by scientists across the world which conforms to a distributed data-intensive computing paradigm where a set of compute, storage and network resources are used in a collective fashion to advance science. Effective scheduling and resource management for such data-intensive applications on distributed resources is critical in order to meet their performance requirements.

Efficient scheduling in the aforementioned scenario encompasses two key inter-related problems. The first one is the data staging problem which involves the staging of data from the simulation/experimental sites to the computational sites where the data analysis needs to be performed. The second one is the job mapping problem which involves the mapping of data analysis jobs to compute resources in such a manner so as to maximize the locality of data usage.

Traditional batch job schedulers are designed for compute-intensive jobs running at supercomputer centers. They take into account CPU related metrics (e.g., user estimated job run times) and system state (e.g., queue wait times) to make scheduling decisions, but they do not take into account data related metrics. Therefore, there is a need for designing scheduling mechanisms for data-analysis jobs that take into account not only the computation time of the jobs, but also the overheads of retrieving files requested by those jobs.

In this dissertation, we address the problem of data staging and job mapping for data-intensive jobs in both homogeneous and heterogeneous environments. We achieve this by taking into account the effects of data staging, end-point contention and data locality. For the mapping problem, we propose algorithms for mapping data-intensive jobs in both an offline and online setting. We also study the interplay between job mapping and data replication and propose algorithms which perform job mapping and data replication in a coordinated manner. For the data staging problem, we propose efficient data staging mechanisms for data centers consisting of coupled collections of storage and compute clusters. Furthermore, we extend our data staging work to a heterogeneous distributed system like the Grid. To accomplish that, we employ multi-hop path splitting and multi-pathing optimizations to improve wide-area file transfer throughput.

Ponnuswamy Sadayappan, PhD (Advisor)
Umit Catalyurek, PhD (Committee Member)
Tahsin Kurc, PhD (Committee Member)
Joel Saltz, PhD (Committee Member)
Srinivasan Parthasarathy, PhD (Committee Member)
199 p.

Recommended Citations

Citations

  • Khanna, G. (2008). A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1218559278

    APA Style (7th edition)

  • Khanna, Gaurav. A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing. 2008. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1218559278.

    MLA Style (8th edition)

  • Khanna, Gaurav. "A Data-Locality Aware Mapping and Scheduling Framework for Data-Intensive Computing." Doctoral dissertation, Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1218559278

    Chicago Manual of Style (17th edition)