Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

Practical Morphological Modeling: Insights from Dialectal Arabic

Abstract Details

2020, Doctor of Philosophy, Ohio State University, Linguistics.
This thesis treats a major challenge for current state-of-the-art natural language processing (NLP) pipelines: morphologically rich languages where many inflected forms or weak form--meaning correspondence lead to data sparsity and noise. For example, if the lexeme TEACHER occurs the same number of times in an English text and an Arabic text, those occurrences will be spread over just four forms in English, teacher, teacher's, teachers' and teachers, versus numerous forms in Arabic, leading to more low frequency and out-of-vocabulary forms at test time. Furthermore, while the +s suffix of teachers is highly predictable, there is significant entropy involved in predicting how pluralization will realize in Arabic, which can cause models to be noisy. That said, the particular means of realizing pluralization (among other properties) can be informative in Arabic, as the +wn in mdrswn, 'teachers' not only indicates plurality, but also that the referent is human. To address data sparsity and noise from morphological richness, I propose some practical means of inducing morphological information and/or incorporating morphological information in preprocessing steps or model components, depending on the task at hand. The goals of this intervention are twofold. First, I aim to link variant inflections of the same lexeme to reduce sparsity. Second, I aim to mitigate noise by identifying morphosyntactic properties encoded in complex inflections like mdrswn and leverage them to help models interpret low frequency or out-of-vocabulary forms. To be practical, morphological modeling should be maximally language agnostic, i.e., portable to new languages or domains with minimal human effort, and maximally cheap, i.e., in terms of the amount/cost of required manual supervision. Thus, I explore morphological modeling strategies and morphological resource creation, progressing toward more language agnostic solutions requiring less supervision over the course of this thesis. To start, I look at a low resource machine translation system from Egyptian Arabic to Levantine Arabic, demonstrating that even a noisy external morphological resource can help. This begs the question: how do we develop such a resource for languages and dialects where they do not already exist and how do we expand such resources when they do exist? From there, I discuss strategies for rapidly expanding a pre-existing morphological lexicon to include new lexemes. Then I consider the scenario where morphological resources do not exist but minimal funding is available to generate some. For this setup, I propose building delexicalized grammars describing closed class affixes and their combinatorics. I propose a language agnostic framework for developing such a grammar for a new language or dialect and demonstrate this to be an effective alternative to expensive morphological resources depending on lexical (open class) information. Finally, this thesis reports early but promising results for the first fully unsupervised paradigmatic morphological analyzer which receives only unlabeled text as input. I highlight several directions for improvement, with the ultimate goal of incorporating unsupervised, paradigmatic morphological analyses into standard NLP pipelines. This has the potential to greatly reduce sparsity and noise in downstream tasks.
Marie-Catherine de Marneffe (Advisor)
Micha Elsner (Committee Member)
Nizar Habash (Committee Member)
Andrea Sims (Committee Member)
266 p.

Recommended Citations

Citations

  • Erdmann, A. (2020). Practical Morphological Modeling: Insights from Dialectal Arabic [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079

    APA Style (7th edition)

  • Erdmann, Alexander. Practical Morphological Modeling: Insights from Dialectal Arabic. 2020. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079.

    MLA Style (8th edition)

  • Erdmann, Alexander. "Practical Morphological Modeling: Insights from Dialectal Arabic." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079

    Chicago Manual of Style (17th edition)