Treebanks, as a quantitative extension of decades ofsyntactic theorizing, typically use annotation schemes with a small set of well-motivated phrasal categories. For constituency-based treebanks, these phrasal categories are selected to describe distributional regularities. These treebanks are often used as a data set for estimating Probabilistic Context Free Grammars (PCFGs) for parsing, but the phrasal category sets which are best for constituency description may be suboptimal for constituency parsing. Specifically, phrasal categories may exhibit a probabilistic bias towards different expansions in different parts of the overall tree, and there may be unanticipated but useful correlations between constituency annotation and other levels of linguistic annotation.
In this thesis, the symbol-splitting technique of Johnson (1998) is extended to enrich syntactic categories with information about local syntactic context on the English Penn Treebank and the German Verbmobil II Treebank. The split symbols are then subjected to two different clustering techniques to preserve only relevant category distinctions, forming linguistically-motivated generalizations and assuaging data sparsity. The symbol-splitting and clustering techniques are then employed, on the Verbmobil treebank, to enrich syntactic categories with information about implicit prosodic break strength alone and then together with information about local context.
Local syntactic context is found to be helpful on both treebanks examined. Experiments on the German Verbmobil II Treebank then show that information about implicit prosodic break strength presents slightly larger gain over information about local syntactic context, and that combining both sorts of information leads to the largest increase in parse accuracy. This research shows that implicit prosody, as imposed by the annotators of the Verbmobil project, does vary with syntactic structure in a
useful way outside of a laboratory setting. It is moreover suggestive of exploring prosody as a cue to grammar learning in children.