Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

Experiments in Compressing Wikipedia

Wotschka, Marco

Abstract Details

2013, Master of Science (MS), Ohio University, Computer Science (Engineering and Technology).
Wikipedia contains a large amount of information on a variety of topics and continues to grow rapidly. With this growing collection of information, the need for improved lossless compression programs arises. This thesis investigates several lossless, general-purpose compression techniques, such as Burrows-Wheeler Compression (BWC) as well as Prediction by Partial Matching (PPM), and evaluates their performance on two benchmark files containing Wikipedia data. Improvements to BWC are suggested, outlined and evaluated. Furthermore, several preprocessing stages are introduced and tested. This thesis suggests an Improved Burrows-Wheeler Compression (IBWC) scheme, which utilizes a multi-threaded Burrows-Wheeler Transform (BWT), a Move-Fraction Transform (MF) as well as PPM and combines good compression outcomes with reasonable space and time requirements. It achieves compression of the first 1 GB of Wikipedia to 1.447 bits per character (44% better than gzip) and compresses the first 100 MB of Wikipedia to 1.78 bits per character (38% better than gzip). Utilizing the BWT, this compression approach works particularly well on long inputs that contain frequent repetitions of long strings. Compression performance of the IBWC scheme is compared to gzip, bzip2 and PPM on two additional files - the complete genome of the model organism Caenorhabditis elegans and a collection of books obtained from Project Gutenberg. In both cases, IBWC provides compression performance similar to PPM.
David Juedes (Advisor)
David Chelberg (Committee Member)
Cynthia Marling (Committee Member)
Sergio Lopez (Other)
106 p.

Recommended Citations

Citations

  • Wotschka, M. (2013). Experiments in Compressing Wikipedia [Master's thesis, Ohio University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1376909207

    APA Style (7th edition)

  • Wotschka, Marco. Experiments in Compressing Wikipedia. 2013. Ohio University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1376909207.

    MLA Style (8th edition)

  • Wotschka, Marco. "Experiments in Compressing Wikipedia." Master's thesis, Ohio University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1376909207

    Chicago Manual of Style (17th edition)