Practical Morphological Modeling: Insights from Dialectal Arabic

Erdmann, Alexander

Keyword Search

School Logo

Practical_Morphological_Modeling__Insights_from_Dialectal_Arabic.pdf (5.58 MB)

Practical Morphological Modeling: Insights from Dialectal Arabic

Author Info

Erdmann, Alexander

ORCID® Identifier

http://orcid.org/0000-0001-5529-1659

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079

Year and Degree

2020, Doctor of Philosophy, Ohio State University, Linguistics.

Abstract

This thesis treats a major challenge for current state-of-the-art natural language processing (NLP) pipelines: morphologically rich languages where many inflected forms or weak form--meaning correspondence lead to data sparsity and noise. For example, if the lexeme TEACHER occurs the same number of times in an English text and an Arabic text, those occurrences will be spread over just four forms in English, teacher, teacher's, teachers' and teachers, versus numerous forms in Arabic, leading to more low frequency and out-of-vocabulary forms at test time. Furthermore, while the +s suffix of teachers is highly predictable, there is significant entropy involved in predicting how pluralization will realize in Arabic, which can cause models to be noisy. That said, the particular means of realizing pluralization (among other properties) can be informative in Arabic, as the +wn in mdrswn, 'teachers' not only indicates plurality, but also that the referent is human. To address data sparsity and noise from morphological richness, I propose some practical means of inducing morphological information and/or incorporating morphological information in preprocessing steps or model components, depending on the task at hand. The goals of this intervention are twofold. First, I aim to link variant inflections of the same lexeme to reduce sparsity. Second, I aim to mitigate noise by identifying morphosyntactic properties encoded in complex inflections like mdrswn and leverage them to help models interpret low frequency or out-of-vocabulary forms. To be practical, morphological modeling should be maximally language agnostic, i.e., portable to new languages or domains with minimal human effort, and maximally cheap, i.e., in terms of the amount/cost of required manual supervision. Thus, I explore morphological modeling strategies and morphological resource creation, progressing toward more language agnostic solutions requiring less supervision over the course of this thesis. To start, I look at a low resource machine translation system from Egyptian Arabic to Levantine Arabic, demonstrating that even a noisy external morphological resource can help. This begs the question: how do we develop such a resource for languages and dialects where they do not already exist and how do we expand such resources when they do exist? From there, I discuss strategies for rapidly expanding a pre-existing morphological lexicon to include new lexemes. Then I consider the scenario where morphological resources do not exist but minimal funding is available to generate some. For this setup, I propose building delexicalized grammars describing closed class affixes and their combinatorics. I propose a language agnostic framework for developing such a grammar for a new language or dialect and demonstrate this to be an effective alternative to expensive morphological resources depending on lexical (open class) information. Finally, this thesis reports early but promising results for the first fully unsupervised paradigmatic morphological analyzer which receives only unlabeled text as input. I highlight several directions for improvement, with the ultimate goal of incorporating unsupervised, paradigmatic morphological analyses into standard NLP pipelines. This has the potential to greatly reduce sparsity and noise in downstream tasks.

Committee

Marie-Catherine de Marneffe (Advisor)
Micha Elsner (Committee Member)
Nizar Habash (Committee Member)
Andrea Sims (Committee Member)

Pages

266 p.

Subject Headings

Computer Science; Linguistics

Keywords

Computational Linguistics; Unsupervised Learning; Computational Morphology; Machine Translation; Segmentation; Arabic Dialectology; Language Complexity; Linguistic Typology

Erdmann, A. (2020). Practical Morphological Modeling: Insights from Dialectal Arabic [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079
APA Style (7th edition)
Erdmann, Alexander. Practical Morphological Modeling: Insights from Dialectal Arabic. 2020. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079.
MLA Style (8th edition)
Erdmann, Alexander. "Practical Morphological Modeling: Insights from Dialectal Arabic." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079
Chicago Manual of Style (17th edition)

Document number:

osu1598006284544079

Download Count:

462

Copyright Info

Practical Morphological Modeling: Insights from Dialectal Arabic by Alexander Erdmann is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by The Ohio State University and OhioLINK.

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Practical Morphological Modeling: Insights from Dialectal Arabic

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Practical Morphological Modeling: Insights from Dialectal Arabic

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Recommended Citations