Corpus
Listing for Babel
The
main function of the babel server is to allow students and
faculty easy access to a variety of corpora across a
variety of languages. Here is a list of the corpora and
specialized programs installed on Babel, all accessible by
logging in to babel and navigating to /newcorpora:
Corpora
(located in /newcorpora):
bn:
The Second Full Release of the 1996 Broadcast News Corpus
brooklyn:
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old
English
brown:
The Brown Corpus
bushotter:
The George Bushotter Lakhota Text collection
callhome:
The Callhome Corpora
celex:
The Celex Lexical databases (and some locally created
scripts to work with them)
chinesegigaword:
The Chinese Gigaword Release Second Edition LDC2005T14, a
comprehensive archive of newswire text data in Chinese that
has been acquired over several years by the LDC.
cmudict:
The CMU Pronouncing
Dictionary
comlex:
COMLEX Syntax
czech_voa_trans:
Voice of America (VOA) Broadcast News Czech Transcripts
ecipor0:
Extracts from the Borba/Ramsey Corpus of Brazillian
Portuguese, ECI Version
enronsent:
The
EnronSent email corpus
etobi:
Guidelines for ToBI (Tones and Break Indices) labeling
fnet:
Framenet
helsinki:
The Diachronic portion of the Helsinki Corpus of English
Texts
hub4m:
HUB-4 Mandarin Transcript Data
ivie:
Full text corresponding to the IViE 36 corpus of variation
in English Intonation
london:
The London Lund Corpus of Spoken English
mandan:
Kennard's Mandan Texts
mari:
An early release of the prosodically labeled data from Mari
Ostendorf
mrcdict:
MRC Psycholinguistic Database Machine Usable Dictionary
nant:
North American News Test Corpus
oald:
The Computer-Usable Oxford Advanced Learner's Dictionary of
Current English (with desc)
oldswitchboard:
An older version of the Switchboard Corpus
rus01:
The RUS01 Corpus (Computer Science and Program Manuals in
English and Russian)
said:
Syntactically Annotated Idiom Dataset
sinica:
The academica Sinica corpus of Chinese
spanish_s_lenition:
Documentation for Syllable-Final /s/ Lenition in the
Callhome Spanish Corpus
swbd:
Switchboard Manuals, Data, and past analysis
swbdphon:
A selection of Switchboard data "chosen for acoustic
training"
timit:
The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus
treebank2:
The Penn Treebank Project, Release Two
tw.putonghua:
The Taiwanese Putonghua Corpus
uclaerrors:
3162 Errors from the UCLA speech errors corpus
wordnet:
Wordnet 1.5
wsj0_lng_modl:
CSR WSJ0 Language Model
ws00chinesephonetics:
Chinese speech files with chinese transcriptions
wsj_counts:
Words counts from the Wall Street Journal Corpus,
alphabetically
xdi:
Xdi Language Data
Programs
(located in /newcorpora/Programs):
charniak_parser_v4:
The Charniak Parser
imscorpus:
The Tools and Documentation of the IMS Corpus Toolbox
link_Gram:
The Link Grammar Parsing System for UNIX
festival:
The Festival Speech Synthesis System V.1.41
treetagger:
The Treetagger (with Docs)
In
order to get an account on babel.colorado.edu, you must
first obtain a username and password through the
department. Please contact your professor for
assistance.