Lexicalized noun phrases are noun phrases that function as words. In English, lexicalized noun phrases are often realized as noun-noun compounds such as theater ticket and garbage man, or as adjective-noun phrases such as black market and high school. In specialized or technical subjects, phrases such as urban planning , air traffic control , highway engineering and combinatorial mathematics are conventional names for concepts that are just as important as single-word terms such as adsorbents , hydrology , or aerodynamics . But despite the fact that lexicalized noun phrases represent useful vocabulary and are cited in dictionaries, thesauri and book indexes, the traditional linguistic literature has failed to identify consistent and categorical formal criteria for identifying them.
This study develops and evaluates a linguistically natural computational method for recognizing lexicalized noun phrases in a large corpus of English-language engineering text by synthesizing the insights of studies in traditional linguistics and computational linguists. From the scholarship in theoretical linguistics, the analysis adopts the perspective that lexicalized noun phrases represent the names of concepts that are important to a community of speakers and have survived a single context of use. Theoretical linguists have also proposed diagnostic tests for identifying lexicalized noun phrases, many of which can be formalized in a computational study. From the scholarship in computational linguistics, the analysis incorporates the view that a linguistic investigation can be extended and verified by processing relevant evidence from a corpus of text, which can be evaluated using mathematical models that do not require categorical input.
In a engineering text, a small set of linguistic contexts, including professor of , department of or studies in , yields long lists of lexicalized noun phrases, including public safety , abstract state machines , complex systems , computer graphics , and mathematical morphology . The study reported here identifies lexical and syntactic contexts that harbor lexicalized noun phrases and submits them to a machine-learning algorithm that classifies the lexical status of noun phrases extracted from the text. Results from several evaluations show that the linguistic evidence extracted from the corpus is relevant to the classification of noun phrases in engineering text. Informal evidence from other subject domains suggests that the results can be generalized.