Penn Chinese Treebank Project

Chinese Language Processing at University of Colorado

CU's Chinese Language Processing program is anchored by linguistic corpora annotated with morphological, syntactic, semantic and discourse structures. The Chinese Treebank, started at University of Pennsylvania, is a segmented, part-of-speech tagged, and fully bracketed corpus that currently has 780 thousand words (over 1.28 Million Chinese characters). The sources of this corpus are mostly Xinhua newswire, Sinorama news magazine and Hong Kong News. The segmentation, POS-tagging and syntactic bracketing standards are fully documented.

The Chinese Proposition Bank adds a layer of semantic annotation to the Chinese Treebank. This layer of semantic annotation mainly deals with the predicate-argument structure of Chinese verbs. This task is also called semantic role labeling in the sense that each verb is expected to take a fixed number of arguments and each argument plays a role with regard to the verb. The annotation of the first installment of the corpus (250K words of Xinhua newswire) has been released through the LDC and the second installment, another 250K words of Sinorama data is near completion. The next release is expected to be in early 2007.

Extending the idea of predicate-argument structure to discourse, we are also in the initial stages of building a Chinese Discourse Treebank in which discourse connectives are treated as predicates that take arguments. A discourse connective can be a subordinate conjunction, a coordinate conjunction, or an anaphorical adverbial expression. Sometimes discourse relations can even be inferred when explicit discourse markers are not available.

Other Chinese annotation projects that are carried out at University of Colorado include coreference annotation, sense-tagging. Since most of our data have English translations, we are also building parallel Chinese-English treebanks and proposition banks.

In the context of NLP research, building annotated corpora is of course only part of the larger picture, a means to an end. The goal is to train natural language systems. To that end, we have built Chinese segmenters and part-of-speech taggers, parsers, semantic role labelers, word sense and coreference disambiguators. We have also built machine translation and information extraction systems.

Center for Spoken Language Research , University of Colorado at Boulder .