Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Multilingual Distributional Lexical Similarity

Abstract Details

2008, Doctor of Philosophy, Ohio State University, Linguistics.

One of the most fundamental problems in natural language processing involves words that are not in the dictionary, or unknown words. The supply of unknown words is virtually unlimited - proper names, technical jargon, foreign borrowings, newly created words, etc. - meaning that lexical resources like dictionaries and thesauri inevitably miss important vocabulary items. However, manually creating and maintaining broad coverage dictionaries and ontologies for natural language processing is expensive and difficult. Instead, it is desirable to learn them from distributional lexical information such as can be obtained relatively easily from unlabeled or sparsely labeled text corpora. Rule-based approaches to acquiring or augmenting repositories of lexical information typically offer a high precision, low recall methodology that fails to generalize to new domains or scale to very large data sets. Classification-based approaches to organizing lexical material have more promising scaling properties, but require an amount of labeled training data that is usually not available on the necessary scale.

This dissertation addresses the problem of learning an accurate and scalable lexical classifier in the absence of large amounts of hand-labeled training data. One approach to this problem involves using a rule-based system to generate large amounts of data that serve as training examples for a secondary lexical classifier. The viability of this approach is demonstrated for the task of automatically identifying English loanwords in Korean. A set of rules describing changes English words undergo when ii they are borrowed into Korean is used to generate training data for an etymological classification task. Although the quality of the rule-based output is low, on a sufficient scale it is reliable enough to train a classifier that is robust to the deficiencies of the original rule-based output and reaches a level of performance that has previously been obtained only with access to substantial hand-labeled training data.

The second approach to the problem of obtaining labeled training data uses the output of a statistical parser to automatically generate lexical-syntactic co-occurrence features. These features are used to partition English verbs into lexical semantic classes, producing results on a substantially larger scale than any previously reported and yielding new insights into the properties of verbs that are responsible for their lexical categorization. The work here is geared towards automatically extending the coverage of verb classification schemes such as Levin, VerbNet, and FrameNet to other verbs that occur in a large text corpus.

Christopher Brew, PhD (Advisor)
Michael White, PhD (Committee Member)
James Unger, PhD (Committee Member)
243 p.

Recommended Citations

Citations

  • Baker, K. (2008). Multilingual Distributional Lexical Similarity [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1221752517

    APA Style (7th edition)

  • Baker, Kirk. Multilingual Distributional Lexical Similarity. 2008. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1221752517.

    MLA Style (8th edition)

  • Baker, Kirk. "Multilingual Distributional Lexical Similarity." Doctoral dissertation, Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1221752517

    Chicago Manual of Style (17th edition)