Penn Chinese Treebank Project

The Penn - CU Chinese Treebank Project

Growing interest in Chinese Language Processing is leading to the development of resources such as annotated corpora and automatic segmenters, part-of-speech taggers and parsers. Currently these are all being developed independently, often with quite different standards for segmentation, part-of-speech tagging and syntactic bracketing. The time is ripe for an open discussion of the methodological issues involved in achieving agreement on annotation standards.

Unlike Western and Middle Eastern Writing systems, Chinese writing does not have a natural delimiter between words with the result that appropriate word segmentation becomes a prerequisite for any other NLP tasks. In the literature this problem has been discussed extensively. The problem of part-of-speech tagging is closely related. These are both prerequisites to the establishment of a Chinese Treebank that could be of general use.

We have completed building a 780K-thousand-word Chinese Treebank. Our aim is to work towards a community consensus on guidelines that will include the input of influential researchers from Taiwan, Singapore, Hong Kong, China and the US. To this end, we held two workshops and a number of meetings between 7/1998 to 10/2000 in USA and abroad. We are very interested in the community's reaction to our guidelines and Treebank, and encourage anyone interested in getting involved to please look into the guidelines we have attached below, use the Treebank, which is available via LDC, and to get in touch with us with your comments.

Descriptions of the project:

Task: Building a segmented, POS tagged and bracketed Chinese corpus. The data consists of Xinhua newswire, Hong Kong news and articles from Sinorama news magazine. There is on-going effort to annotate broadcast news and broadcast conversation data under the DARPA GALE funding.
Latest release: The Chinese TreeBank (CTB) version 6.0, which has 780K words, has been officially released via Linguistic Data Consortium.
CTB6.0 data composition: Xinhua newswire: [001-325, 400-454, 600-885, 900-931] Hong Kong news: [500-554] Sinorama: [590-596, 1001-1151], Broadcast news: [2000-3145]
See this slide for more information.
Coming soon! CTB6.0 is in the LDC publication pipeline.

Penn guidelines for Chinese Treebank

Segmentation guidelines (final version): [ps-file], [pdf-file]
Guideline for POS tagging (final version): [ps-file], [pdf-file]
Guideline for Bracketing (final version): [ps-file], [pdf-file]
All three guidelines are now IRCS technical reports. The ID numbers are 00-06, 00-07 and 00-08, respectively.

Publications

2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.: Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer; Natural Language Engineering, 11(2)207-238.

2002: Building a Large-Scale Annotated Chinese Corpus: Nianwen Xue, Fu-Dong Chiou, and Martha Palmer; Proceedings of the 19th. International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, 2002.

2001: Facilitating Treebank Annotation with a Statistical Parser: Fu-Dong Chiou, David Chiang, and Martha Palmer; Proceedings of the Human Language Technology Conference (HLT 2001), San Diego, California, 2001.

2000: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation: Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, and Mitch Marcus; Proceedings of the second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 2000.