More organisms are having their genomes sequenced recently than in the past, thus creating a greater demand from the biological community to better understand the exact biological mechanisms which are encoded within the genomic blueprint of each organism. While biologists continue to analyze genomes and to identify new functional elements within organisms, there remain several regions of the genomes which are often overlooked, such as non-protein encoding regions, introns, and intergenic regions. Several bioinformatics algorithms exist to discover functional elements (which are also referenced within as words) in these regions.
In this thesis, a functional genomics toolkit for finding functional words of genomes (vocabularies) is presented and described. With currently available vocabulary based tools, limitations arise when analyzing large input sequences. To overcome this limitation, a scalable word searching approach is presented and tested with genomic sequences with file sizes up to 2 Gigabytes (GB). In addition, the toolkit is utilized to provide a genome-wide characterization of the Arabidopsis thaliana genome in terms of over- and under-represented repeats within specific genome regions and to search for similarities between putative functional elements in the human genome and Arabidopsis thaliana thereby producing a putative vocabulary. The difficulties encountered during the research process and suggestions for future work are also further discussed.