Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

The Characterization and Utilization of Middle-range Sequence Patterns within the Human Genome

Shepard, Samuel Steven

Abstract Details

2010, Doctor of Philosophy in Biomedical Sciences (Ph.D.), University of Toledo, College of Medicine.

Mid-range inhomogeneity (MRI) is the significant enrichment of particular nucleotides in genomic sequences extending from 30 to 10,000 nucleotides. MRI can be observed for all nucleotide pairings (e.g., G+C, A+G, and G+T) as well as for individual bases. Various types of MRI regions are 4 to 20 times enriched in mammalian genomes compared to their occurrences in random models. We first show how different types of mutations change MRI regions. Human, chimpanzee andMacaca mulatta genomes were aligned to study the projected effects of substitutions and indels on human sequence evolution within both MRI regions and control regions of average nucleotide composition. Over 18.8 million fixed point substitutions, 3.9 million SNPs, and indels spanning 6.9 Mb were procured and evaluated in human – 1.8 Mb substitutions and 1.9 Mb indels within MRI regions. Ancestral and mutant alleles for substitutions were determined. Substitutions were grouped according to their fixation within human populations: fixed substitutions (from the human-chimp-macaca alignment), major SNPs (> 80% mutant allele frequency within humans), medium SNPs (20%-80%), minor SNPs (3%-20%), and rare SNPs (< 3%). Data on short (< 3 bp) and medium-length (3-50 bp) insertions and deletions within MRI regions and appropriate control regions were analyzed for their effect on the expansion or diminution of such regions as well as on changing nucleotide composition. MRI regions have comparable levels of de novo mutations to the control genomic sequences. Newer mutations rapidly erode MRI regions, bringing their nucleotide composition toward genome-average levels. However, substitutions that favor the maintenance of MRI properties have a higher chance to spread throughout the human population. Indels have a clear tendency to maintain MRI features but have a smaller impact than substitutions. Overall, the observed fixation bias for mutations helps maintain MRI regions during evolution.

Next, we discuss the splicing of large introns in mammals (over 50,000 base-pairs). Large introns must be spliced out of the pre-mRNA in a timely fashion, which involves bringing together distant 5′ and 3′ splice sites. In Drosophila large introns can be spliced efficiently through a process known as recursive splicing. We computationally demonstrate that vertebrates lack the proper enrichment of RP-sites in their large introns, and, therefore, require some other method to aid splicing. Over 15,000 non-redundant, large introns from six mammals, 1,600 from chicken and zebrafish, and 560 large introns from five invertebrates were analyzed. Unlike the studied invertebrates, the studied vertebrate genomes contain consistently abundant amounts of direct and complementary strand interspersed repetitive elements (mainly SINEs and LINEs) that may form stems with each other within large introns. Indeed, predicted stems were abundant and stable in the large introns of mammals. We hypothesize that stable stems with long loops within large introns allow splice sites to find each other more quickly by folding the intronic RNA upon itself.

Finally, we extend and complement existing Markov model algorithms by developing and testing a novel binary-abstracted Markov model (BAMM) algorithm. BAMM can emphasize selected portions of genomic sequence signals according to specific abstraction rules. We present abstraction rules that generalize genomic sequence patterns at the single nucleotide level up to the level of tetranucleotides, using both in-frame data and data of mixed reading frames. We develop context-dependent abstraction rules that emphasize genomic sequence repetition. Unlike traditional Markov models, BAMM can analyze nucleotide patterns on the short-range (< 20 bp) up to the mid-range (20 to 50 bp) scale. Abstraction rules can also be both frame sensitive or independent. We build classifiers for both coding sequences and introns as well as for 5′ and 3′ UTR data. Using support vector machines, we demonstrate that we can combine multiple BAMM classifiers to get even better exon-intron classification accuracy.

Alexei Fedorov, PhD (Committee Chair)
Robert Blumenthal, PhD (Committee Member)
John Gray, PhD (Committee Member)
Sadik Khuder, PhD (Committee Member)
Robert Trumbly, PhD (Committee Member)
192 p.

Recommended Citations

Citations

  • Shepard, S. S. (2010). The Characterization and Utilization of Middle-range Sequence Patterns within the Human Genome [Doctoral dissertation, University of Toledo]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=mco1271271172

    APA Style (7th edition)

  • Shepard, Samuel. The Characterization and Utilization of Middle-range Sequence Patterns within the Human Genome. 2010. University of Toledo, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=mco1271271172.

    MLA Style (8th edition)

  • Shepard, Samuel. "The Characterization and Utilization of Middle-range Sequence Patterns within the Human Genome." Doctoral dissertation, University of Toledo, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=mco1271271172

    Chicago Manual of Style (17th edition)