Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Practical_Morphological_Modeling__Insights_from_Dialectal_Arabic.pdf (5.58 MB)
ETD Abstract Container
Abstract Header
Practical Morphological Modeling: Insights from Dialectal Arabic
Author Info
Erdmann, Alexander
ORCID® Identifier
http://orcid.org/0000-0001-5529-1659
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079
Abstract Details
Year and Degree
2020, Doctor of Philosophy, Ohio State University, Linguistics.
Abstract
This thesis treats a major challenge for current state-of-the-art natural language processing (NLP) pipelines: morphologically rich languages where many inflected forms or weak form--meaning correspondence lead to data sparsity and noise. For example, if the lexeme TEACHER occurs the same number of times in an English text and an Arabic text, those occurrences will be spread over just four forms in English, teacher, teacher's, teachers' and teachers, versus numerous forms in Arabic, leading to more low frequency and out-of-vocabulary forms at test time. Furthermore, while the +s suffix of teachers is highly predictable, there is significant entropy involved in predicting how pluralization will realize in Arabic, which can cause models to be noisy. That said, the particular means of realizing pluralization (among other properties) can be informative in Arabic, as the +wn in mdrswn, 'teachers' not only indicates plurality, but also that the referent is human. To address data sparsity and noise from morphological richness, I propose some practical means of inducing morphological information and/or incorporating morphological information in preprocessing steps or model components, depending on the task at hand. The goals of this intervention are twofold. First, I aim to link variant inflections of the same lexeme to reduce sparsity. Second, I aim to mitigate noise by identifying morphosyntactic properties encoded in complex inflections like mdrswn and leverage them to help models interpret low frequency or out-of-vocabulary forms. To be practical, morphological modeling should be maximally language agnostic, i.e., portable to new languages or domains with minimal human effort, and maximally cheap, i.e., in terms of the amount/cost of required manual supervision. Thus, I explore morphological modeling strategies and morphological resource creation, progressing toward more language agnostic solutions requiring less supervision over the course of this thesis. To start, I look at a low resource machine translation system from Egyptian Arabic to Levantine Arabic, demonstrating that even a noisy external morphological resource can help. This begs the question: how do we develop such a resource for languages and dialects where they do not already exist and how do we expand such resources when they do exist? From there, I discuss strategies for rapidly expanding a pre-existing morphological lexicon to include new lexemes. Then I consider the scenario where morphological resources do not exist but minimal funding is available to generate some. For this setup, I propose building delexicalized grammars describing closed class affixes and their combinatorics. I propose a language agnostic framework for developing such a grammar for a new language or dialect and demonstrate this to be an effective alternative to expensive morphological resources depending on lexical (open class) information. Finally, this thesis reports early but promising results for the first fully unsupervised paradigmatic morphological analyzer which receives only unlabeled text as input. I highlight several directions for improvement, with the ultimate goal of incorporating unsupervised, paradigmatic morphological analyses into standard NLP pipelines. This has the potential to greatly reduce sparsity and noise in downstream tasks.
Committee
Marie-Catherine de Marneffe (Advisor)
Micha Elsner (Committee Member)
Nizar Habash (Committee Member)
Andrea Sims (Committee Member)
Pages
266 p.
Subject Headings
Computer Science
;
Linguistics
Keywords
Computational Linguistics
;
Unsupervised Learning
;
Computational Morphology
;
Machine Translation
;
Segmentation
;
Arabic Dialectology
;
Language Complexity
;
Linguistic Typology
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Erdmann, A. (2020).
Practical Morphological Modeling: Insights from Dialectal Arabic
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079
APA Style (7th edition)
Erdmann, Alexander.
Practical Morphological Modeling: Insights from Dialectal Arabic.
2020. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079.
MLA Style (8th edition)
Erdmann, Alexander. "Practical Morphological Modeling: Insights from Dialectal Arabic." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1598006284544079
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1598006284544079
Download Count:
388
Copyright Info
© 2020, some rights reserved.
Practical Morphological Modeling: Insights from Dialectal Arabic by Alexander Erdmann is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by The Ohio State University and OhioLINK.