Penn Chinese Treebank Project
The Penn - CU Chinese Treebank Project
Growing interest in Chinese Language Processing is leading to the development
of resources such as annotated corpora and automatic segmenters, part-of-speech
taggers and parsers. Currently these are all being developed independently,
often with quite different standards for segmentation, part-of-speech tagging
and syntactic bracketing. The time is ripe for an open discussion of the
methodological issues involved in achieving agreement on annotation
standards.
Unlike Western and Middle Eastern
Writing systems, Chinese writing does not have a
natural delimiter between words with the result that appropriate word
segmentation becomes a prerequisite for any other NLP tasks. In the literature
this problem has been discussed extensively. The problem of part-of-speech
tagging is closely related. These are both prerequisites to the establishment
of a Chinese Treebank that could be of general use.
We have completed building a 780K-thousand-word Chinese Treebank.
Our aim is to work towards a community
consensus on guidelines that will include the input of influential researchers
from Taiwan, Singapore, Hong Kong, China and the US. To this end,
we held two workshops and a number of meetings between 7/1998 to 10/2000
in USA and abroad.
We are very interested in the community's
reaction to our guidelines and Treebank, and encourage anyone interested in
getting involved to please look into the guidelines we have attached below, use
the Treebank, which is available via LDC, and
to get in touch with us with your comments.
Descriptions of the project:
- Task: Building a segmented, POS tagged and bracketed Chinese corpus. The
data consists of Xinhua newswire, Hong Kong news and articles from Sinorama
news magazine. There is on-going effort to annotate broadcast news and broadcast
conversation data under the
DARPA GALE funding.
- Latest release: The Chinese TreeBank (CTB) version 6.0, which has
780K
words, has been officially
released via Linguistic Data Consortium.
CTB6.0 data composition:
Xinhua newswire: [001-325, 400-454, 600-885, 900-931]
Hong Kong news: [500-554]
Sinorama: [590-596, 1001-1151], Broadcast news: [2000-3145]
See this slide for more information.
- Coming soon! CTB6.0 is in the LDC publication pipeline.
Penn guidelines for Chinese Treebank
Publications
- 2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.
- Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer
- Natural Language Engineering, 11(2)207-238.
- 2002:
Building a Large-Scale Annotated Chinese Corpus
- Nianwen Xue, Fu-Dong Chiou, and Martha Palmer
- Proceedings of the 19th. International Conference on Computational
Linguistics (COLING 2002), Taipei, Taiwan, 2002.
- 2001:
Facilitating Treebank Annotation with a Statistical Parser
- Fu-Dong Chiou, David Chiang, and Martha Palmer
- Proceedings of the Human Language Technology Conference (HLT 2001), San
Diego, California, 2001.
- 2000:
Developing Guidelines and Ensuring Consistency for Chinese Text Annotation
- Fei Xia, Martha Palmer, Nianwen Xue,
Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou,
Shizhe Huang, Tony Kroch, and Mitch Marcus
- Proceedings of the second International Conference on Language Resources
and Evaluation (LREC 2000), Athens, Greece, 2000.
Sample Files
Treebank Releases on
Preliminary Release: June 2000,
see the announcement
Second Release: Dec 2000,
see
the announcement
Workshops and meetings
1st CLP Workshop (6-7/98), Philadelphia, USA
meeting during ACL-98, Montreal, Canada (8/98)
meeting during ICCIP-98, Beijing, China (11/98)
meeting during ACL-99, Maryland, USA (6/99)
2nd CLP Workshop (10/00), Hong Kong,
China
Links to other sites
Penn English Treebank Project
Penn Korean Treebank Project
Last modified on February 10, 2004. This page has been viewed
times since March 5, 2003.