Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition

Heintz, Ilana

Abstract Details

2010, Doctor of Philosophy, Ohio State University, Linguistics.

The goal of this dissertation is to introduce a method for deriving morphemes from Arabic words using stem patterns, a feature of Arabic morphology. The motivations are three-fold: modeling with morphemes rather than words should help address the out-of-vocabulary problem; working with stem patterns should prove to be a cross-dialectally valid method for deriving morphemes using a small amount of linguistic knowledge; and the stem patterns should allow for the prediction of short vowel sequences that are missing from the text. The out-of-vocabulary problem is acute in Modern Standard Arabic due to its rich morphology, including a large inventory of inflectional affixes and clitics that combine in many ways to increase the rate of vocabulary growth. The problem of creating tools that work across dialects is challenging due to the many differences between regional dialects and formal Arabic, and because of the lack of text resources on which to train natural language processing (NLP) tools. The short vowels, while missing from standard orthography, provide information that is crucial to both acoustic modeling and grammatical inference, and therefore must be inserted into the text to train the most predictive NLP models. While other morpheme derivation methods exist that address one or two of the above challenges, none addresses all three with a single solution.

The stem pattern derivation method is tested in the task of automatic speech recognition (ASR), and compared to three other morpheme derivation methods as well as word-based language models. We find that the utility of morphemes in increasing word accuracy scores on the ASR task is highly dependent on the ASR system's ability to accommodate the morphemes in the acoustic and pronunciation models. In experiments involving both Modern Standard Arabic and Levantine Conversational Arabic data, we find that knowledge-light methods of morpheme derivation may work as well as knowledge-rich methods. We also find that morpheme derivation methods that result in a single morpheme hypothesis per word result in stronger models than those that spread probability mass across several hypotheses per word, however, the multi-hypothesis model may be strengthened by applying informed weights to the predicted morpheme sequences. Furthermore, we exploit the flexibility of Finite State Machines, with which the stem pattern derivation method is implemented, to predict short vowels. The result is a comprehensive exploration not only of the stem pattern derivation method, but of the use of morphemes in Arabic language modeling for automatic speech recognition.

Chris Brew, PhD (Committee Co-Chair)
J. Eric Fosler-Lussier, PhD (Committee Co-Chair)
Michael White, PhD (Committee Member)
202 p.

Recommended Citations

Citations

  • Heintz, I. (2010). Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1275053334

    APA Style (7th edition)

  • Heintz, Ilana. Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition. 2010. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1275053334.

    MLA Style (8th edition)

  • Heintz, Ilana. "Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition." Doctoral dissertation, Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1275053334

    Chicago Manual of Style (17th edition)