Improved automatic text understanding requires detailed linguistic information about the words that comprise the text. Particularly crucial is the knowledge about predicates, typically verbs, which communicate both the event being expressed and how participants are related to the event. Although the field of natural language processing (NLP) has yet to develop a clear consensus on guidelines for building a verb lexicon suitable for applications in NLP, class-based construction of verb lexicons (e.g. Levin verb classification) has proved beneficial to a wide range of NLP tasks in combating the pervasive problem of data sparsity. Such broad coverage dictionaries and ontologies are difficult and costly to create and maintain by hand, it is therefore desirable to learn them from distributional data, such as can be obtained from unlabeled text corpora. To this end, this thesis will primarily address the following three questions:
First, deriving Levin-style verb classifications from text corpora helps avoid the expensive hand-coding of such information, but appropriate features must be identified and demonstrated to be effective. One of our primary goals is to assess the linguistic conditions which are crucial for lexical classification of verbs. In particular, we experiment with different ways of mixing syntactic and lexical information for improved verb classification. The results show that both syntactic and lexical information are useful in automatic verb classification.
Second, Levin verb classification provides a systematic account of verb polysemy. We propose a class-based method for disambiguating Levin verbs using only untagged data. The basic working hypothesis is that verbs in the same Levin class tend to share their subcategorization patterns as well as neighboring words. In practice, information about unambiguous verbs is used to disambiguate ambiguous ones. The results suggest that this class-based method can be used in the absence of hand-tagged data.
Last, automatically created verb classifications are likely to deviate from manually created ones, therefore it is great importance to understand whether automatically acquired verb classifications can benefit the wider NLP community. We propose to integrate verb class information, automatically learned from text corpora, into a particular parsing task, PP-attachment disambiguation. The results indicate that automatically acquired verb class information helps improve the performance of PP-attachment disambiguation models by alleviating the severity of the problem of data sparsity.