Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Supporting Data-Intensive Scienti c Computing on Bandwidth and Space Constrained Environments

Bicer, Tekin

Abstract Details

2014, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Scientific applications, simulations and instruments generate massive amount of data. This data does not only contribute to the already existing scientific areas, but it also leads to new sciences. However, management of this large-scale data and its analysis are both challenging processes. In this context, we require tools, methods and technologies such as reduction-based processing structures, cloud computing and storage, and efficient parallel compression methods. In this dissertation, we first focus on parallel and scalable processing of data stored in S3, a cloud storage resource, using compute instances in Amazon Web Services (AWS). We develop MATE-EC2 which allows specification of data processing using a variant of Map-Reduce paradigm. We show various optimizations, including data organization, job scheduling, and data retrieval strategies, that can be leveraged based on the performance characteristics of cloud storage resources. Furthermore, we investigate the efficiency of our middleware in both homogeneous and heterogeneous environments. Next, we improve our middleware so that users can perform transparent processing on data that is distributed among local and cloud resources. With this work, we maximize the utilization of geographically distributed resources. We evaluate our system's overhead, scalability, and performance with varying data distributions. The users of data-intensive applications have different requirements on hybrid cloud settings. Two of the most important ones are execution time of the application and resulting cost on the cloud. Our third contribution is providing a time and cost model for data-intensive applications that run on hybrid cloud environments. The proposed model lets our middleware adapt performance changes and dynamically allocate necessary resources from its environments. Therefore, applications can meet user specified constraints. Fourth, we investigate compression approaches for scientific datasets and build a compression system. The proposed system focuses on implementation and application of domain specific compression algorithms. We port our compression system into aforementioned middleware and implement different compression algorithms. Our framework enables our middleware to maximize bandwidth utilization of data-intensive applications while minimizing storage requirements. Although, compression can help us to minimize input and output overhead of data-intensive applications, utilization of compression during parallel operations is not trivial. Specifically, unable to determine compressed data chunk sizes in advance complicates the parallel write operations. In our final work, we develop different methods for enabling compression during parallel input and output operations. Then, we port our proposed methods into PnetCDF, a widely used scientific data management library, and show how transparent compression can be supported during parallel output operations. The proposed system lets an existing parallel simulation program start outputting and storing data in a compressed fashion. Similarly, data analysis applications can transparently access to compressed data using our system.
Gagan Agrawal (Advisor)
Feng Qin (Committee Member)
Spyros Blanas (Committee Member)
162 p.

Recommended Citations

Citations

  • Bicer, T. (2014). Supporting Data-Intensive Scienti c Computing on Bandwidth and Space Constrained Environments [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1397749544

    APA Style (7th edition)

  • Bicer, Tekin. Supporting Data-Intensive Scienti c Computing on Bandwidth and Space Constrained Environments. 2014. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1397749544.

    MLA Style (8th edition)

  • Bicer, Tekin. "Supporting Data-Intensive Scienti c Computing on Bandwidth and Space Constrained Environments." Doctoral dissertation, Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1397749544

    Chicago Manual of Style (17th edition)