Corpus listing

Corpus Listing for Babel
The main function of the babel server is to allow students and faculty easy access to a variety of corpora across a variety of languages. Here is a list of the corpora and specialized programs installed on Babel, all accessible by logging in to babel and navigating to /newcorpora:

Corpora (located in /newcorpora):

bn: The Second Full Release of the 1996 Broadcast News Corpus
brooklyn: The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English
brown: The Brown Corpus
bushotter: The George Bushotter Lakhota Text collection
callhome: The Callhome Corpora
celex: The Celex Lexical databases (and some locally created scripts to work with them)
chinesegigaword: The Chinese Gigaword Release Second Edition LDC2005T14, a comprehensive archive of newswire text data in Chinese that has been acquired over several years by the LDC.
cmudict: The CMU Pronouncing Dictionary
comlex: COMLEX Syntax
czech_voa_trans: Voice of America (VOA) Broadcast News Czech Transcripts
ecipor0: Extracts from the Borba/Ramsey Corpus of Brazillian Portuguese, ECI Version
enronsent: The EnronSent email corpus
etobi: Guidelines for ToBI (Tones and Break Indices) labeling
fnet: Framenet
helsinki: The Diachronic portion of the Helsinki Corpus of English Texts
hub4m: HUB-4 Mandarin Transcript Data
ivie: Full text corresponding to the IViE 36 corpus of variation in English Intonation
london: The London Lund Corpus of Spoken English
mandan: Kennard's Mandan Texts
mari: An early release of the prosodically labeled data from Mari Ostendorf
mrcdict: MRC Psycholinguistic Database Machine Usable Dictionary
nant: North American News Test Corpus
oald: The Computer-Usable Oxford Advanced Learner's Dictionary of Current English (with desc)
oldswitchboard: An older version of the Switchboard Corpus
rus01: The RUS01 Corpus (Computer Science and Program Manuals in English and Russian)
said: Syntactically Annotated Idiom Dataset
sinica: The academica Sinica corpus of Chinese
spanish_s_lenition: Documentation for Syllable-Final /s/ Lenition in the Callhome Spanish Corpus
swbd: Switchboard Manuals, Data, and past analysis
swbdphon: A selection of Switchboard data "chosen for acoustic training"
timit: The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus
treebank2: The Penn Treebank Project, Release Two
tw.putonghua: The Taiwanese Putonghua Corpus
uclaerrors: 3162 Errors from the UCLA speech errors corpus
wordnet: Wordnet 1.5
wsj0_lng_modl: CSR WSJ0 Language Model
ws00chinesephonetics: Chinese speech files with chinese transcriptions
wsj_counts: Words counts from the Wall Street Journal Corpus, alphabetically
xdi: Xdi Language Data

Programs (located in /newcorpora/Programs):

charniak_parser_v4: The Charniak Parser
imscorpus: The Tools and Documentation of the IMS Corpus Toolbox
link_Gram: The Link Grammar Parsing System for UNIX
festival: The Festival Speech Synthesis System V.1.41
treetagger: The Treetagger (with Docs)

In order to get an account on babel.colorado.edu, you must first obtain a username and password through the department. Please contact your professor for assistance.

Working with Corpora (revisions in progress)