All Corpora

To access the discs in the LDC library, contact Michael Ginn.

You need to have a verbs account to access the corpora that are on the verbs server.

Corpus Name Language Catalog ID
1996 English Broadcast News Speech (HUB4) English LDC97S44
1996 English Broadcast News Transcripts (HUB4) English LDC97T22
1996-2008 NIST Speaker Recognition Evaluation Data Collection English LDC2009E100
1997 English Broadcast News Transcripts (HUB4) English LDC98T28
1997 HUB4 Broadcast News Evaluation Non-English Test Material Spanish, Mandarin Chinese LDC2001S91
1997 HUB4 English Evaluation Speech and Transcripts English LDC2002S11
1997 HUB5 Arabic Evaluation Egyptian Arabic LDC2002S22
1997 HUB5 Arabic Transcripts Egyptian Arabic LDC2002T39
1997 HUB5 English Evaluation English LDC2002S23
1997 HUB5 German Evaluation German LDC2002S24
1997 HUB5 German Transcripts German LDC2003T03
1997 HUB5 Spanish Evaluation Spanish LDC2002S25
1997 HUB5 Spanish Transcripts Spanish LDC2003T04
1997 Mandarin Broadcast News Speech (HUB4-NE) Mandarin Chinese LDC98S73
1997 Spanish Broadcast News Transcripts (HUB4-NE) Spanish LDC98T29
1998 HUB4 Broadcast News Evaluation English Test Material English LDC2000S86
1998 HUB5 English Evaluation English LDC2002S10
1998 HUB5 English Transcripts English LDC2003T02
1999 HUB4 Broadcast News Evaluation English Test Material English LDC2000S88
2000 Communicator Evaluation English LDC2002S56
2000 HUB5 English Evaluation Speech English LDC2002S09
2000 HUB5 English Evaluation Transcripts English LDC2002T43
2000 NIST Speaker Recognition Evaluation English LDC2001S97
2001 Communicator Evaluation English LDC2003S01
2001 HUB5 English Evaluation English LDC2002S13
2001 HUB5 Mandarin Evaluation Mandarin Chinese LDC2002S12
2001 HUB5 Mandarin Transcripts Mandarin Chinese LDC2003T01
2001 NIST Speaker Recognition Evaluation Corpus English LDC2002S34
2002 Rich Transcription Broadcast News and Conversational Telephone Speech English LDC2004S11
2009 CoNLL Shared Task Part 1 Catalan, Czech, German, Spanish LDC2012T03
2009 CoNLL Shared Task Part 2 English, Mandarin Chinese, Chinese LDC2012T04
8 years worth of summary/article sets collected via Newsblaster LDC2012E80
ACE 2004 Evaluation Corpus English, Chinook jargon, Baharna Arabic, Chinese, Arabic LDC2004E51
ACE 2004 Multilingual Training Corpus English, Standard Arabic, Mandarin Chinese LDC2005T09
ACE 2004 Pilot Corpus V1.3 Baharna Arabic, Chinook jargon, English, Arabic, Chinese LDC2004E03
ACE 2005 Multilingual Training Data V6.0 English, Chinook jargon, Baharna Arabic, Chinese, Arabic LDC2005E18
ACE-2 Version 1.0 English LDC2003T11
ACL Multilingual Corpus 1 1006
AIDA 1.2 : Automatic Identification of Dialectal Arabic Arabic LDC2012E56
AQUAINT CrossLingual QA Arabic Newswire Corpus Baharna Arabic, English, Arabic LDC2004E49
ATIS3 Test Data English LDC95S26
ATIS3 Training Data English LDC94S19
Abstract Meaning Representation (AMR) Annotation Release 1.0 English LDC2014T12
American English Nickname Collection English LDC2012T11
American English Spoken Lexicon English LDC99L23
American National Corpus (ANC) Second Release English LDC2005T35
Annotated English Gigaword English LDC2012T21
Arabic Gigaword Third Edition Standard Arabic LDC2007T40
Arabic Newswire Part 1 Standard Arabic LDC2001T55
Arabic Treebank - Broadcast News v1.0 Standard Arabic, Arabic LDC2012T07
Arabic Treebank ARZ Part 1, V1.0 Egyptian Arabic LDC2012E28
Arabic Treebank Part 20 V1.0 - BOLT Pilot ARZ Email Arabic LDC2012E25
Arabic Treebank: Part 1 - 10K-word English Translation Standard Arabic LDC2003T07
Arabic Treebank: Part 1 v 2.0 Standard Arabic LDC2003T06
Arabic Treebank: Part 3 v 3.2 Standard Arabic, Arabic LDC2010T08
BBN Pronoun Coreference and Entity Type Corpus English LDC2005T33
BBN/LDC WebForum Selections Arabic/English Parallel Corpus Arabic/English (Parallel) LDC2012E75
BBN/LDC WebForum Selections Chinese/English Parallel Corpus Chinese/English (Parallel) LDC2012E76
BBN/LDC/Sakhr Arabic-Dialect/English Parallel Corpus Sakhr Arabic-Dialect/English (Parallel) LDC2012E17
BLLIP 1987-89 WSJ Corpus Release 1 English LDC2000T43
BOLT - Phase 1 Discussion Forums Source Data R1 V2 English, Egyptian Arabic, Chinese LDC2012E04
BOLT - Phase 1 Discussion Forums Source Data R2 Chinese, Egyptian Arabic, English LDC2012E16
BOLT - Phase 1 Discussion Forums Source Data R3 Chinese, Egyptian Arabic, English LDC2012E21
BOLT - Phase 1 Rejected Training Data Thread IDs LDC2012E62
BOLT - Phase 1 Translation Samples V2 LDC2012E11
BOLT LRL Hausa Representative Language Pack V1.2 Hausa LDC2015E70
BOLT LRL Turkish Representative Language Pack V2.2 Turkish LDC2014E115
BOLT LRL Uzbek Representative Language Pack Uzbek LDC2016E29
BOLT Phase 1 - Arabic Treebank ARZ Part 2, V1.0 Egyptian Arabic LDC2012E88
BOLT Phase 1 - Chinese Parallel Word Alignment and Tagging Part 3 Chinese LDC2012E95
BOLT Phase 1 - English Treebank BOLT WB Part 2, V 1.0 English LDC2012E97
BOLT Phase 1 Chinese Parallel Word Alignment and Tagging DF Part 4 Chinese LDC2013E02
BOLT Phase 1 Chinese Parallel Word Alignment and Tagging Part 1 Chinese LDC2012E24
BOLT Phase 1 Chinese Parallel Word Alignment and Tagging Part 2 Chinese LDC2012E72
BOLT Phase 1 Chinese Propbank DF Part 1 Chinese LDC2012E121
BOLT Phase 1 Chinese Propbank DF Part 2 Chinese LDC2012E131
BOLT Phase 1 Chinese Treebank DF Part 1 Chinese LDC2012E109
BOLT Phase 1 Chinese Treebank DF Part 2 Chinese LDC2012E120
BOLT Phase 1 Chinese Treebank DF Part 3 Chinese LDC2012E130
BOLT Phase 1 DevTest Source and Translation V4 Arabic/Chinese/English LDC2012E30
BOLT Phase 1 Egyptian Arabic Parallel Word Alignment DF Egyptian Arabic LDC2013E01
BOLT Phase 1 Egyptian Arabic Parallel Word Alignment DF Part 2 v2 Egyptian Arabic LDC2012E94
BOLT Phase 1 Egyptian Arabic Parallel Word Alignment Part 1 V2 Egyptian Arabic LDC2012E51
BOLT Phase 1 Egyptian Arabic Propbank DF Part 1 Egyptian Arabic LDC2012E122
BOLT Phase 1 Egyptian Arabic Propbank DF Part 2 Egyptian Arabic LDC2012E129
BOLT Phase 1 Egyptian Arabic Treebank DF Part 1 V2.0 Egyptian Arabic LDC2012E93
BOLT Phase 1 Egyptian Arabic Treebank DF Part 2 V2.0 Egyptian Arabic LDC2012E98
BOLT Phase 1 Egyptian Arabic Treebank DF Part 3 V2.0 Egyptian Arabic LDC2012E89
BOLT Phase 1 Egyptian Arabic Treebank DF Part 4 V2.0 Egyptian Arabic LDC2012E99
BOLT Phase 1 Egyptian Arabic Treebank DF Part 5 V2.0 Egyptian Arabic LDC2012E107
BOLT Phase 1 Egyptian Arabic Treebank DF Part 6 V2.0 Egyptian Arabic LDC2012E125
BOLT Phase 1 Egyptian Arabic Treebank DF Part 7 V1.0 Egyptian Arabic LDC2013E12
BOLT Phase 1 English Propbank DF Part 1 English LDC2012E123
BOLT Phase 1 English Propbank DF Part 2 English LDC2012E128
BOLT Phase 1 English Propbank DF Part 3 English LDC2013E05
BOLT Phase 1 English Treebank DF Part 1 V1.0 English LDC2012E92
BOLT Phase 1 English Treebank DF Part 3 V1.0 English LDC2012E114
BOLT Phase 1 English Treebank DF Part 4 V1.0 English LDC2013E17
BOLT Phase 1 HTER Experiment Source and Reference Translation Chinese-English, Arabic-English LDC2012E18
BOLT Phase 1 IR Eval Assessment Results V1.1 LDC2012E118
BOLT Phase 1 IR Eval Source Data Document List LDC2012E82
BOLT Phase 1 Translation Training Data R1 Chinese-English, Arabic-English LDC2012E15
BOLT Phase 1 Translation Training Data R2 Chinese-English, Arabic-English LDC2012E19
BOLT Phase 1 Translation Training Data R3 Chinese-English, Arabic-English LDC2012E55
BOLT Phase 1 Translation Training Data R4 Chinese-English, Arabic-English LDC2012E81
BOLT Phase 1 Translation Training Data R5 Chinese-English, Arabic-English LDC2012E96
BOLT Phase 1 Translation Training Data R6 Chinese-English, Arabic-English LDC2012E124
BOLT Phase 2 English Treebank SMS/Chat Part 1 English LDC2013E127
BOLT Phase 2 IR Source Data Document List and Sample Query English LDC2013E08
BOLT Phase 2 SMS and Chat Sample Source Data Chinese, English, Egyptian Arabic LDC2013E10
Boston University Radio Speech Corpus English LDC96S36
Boulder Coercion Corpus Other_8
British National Corpus Parses and BNC British English 1000
Brown Corpus (treebanked) Standard American English Other_7
Buckwalter Arabic Morphological Analyzer Standard Arabic, English LDC2004L02
CALIMA 0.3: Columbia Arabic Language Morphological Analyzer -- Egyptian Arabic Egyptian Arabic LDC2012E57
CALLFRIEND American English-Non-Southern Dialect English LDC96S46
CALLFRIEND American English-Southern Dialect Southern American English LDC96S47
CALLFRIEND Canadian French Canadian French LDC96S48
CALLFRIEND Farsi Farsi, Persian LDC96S50
CALLFRIEND German German LDC96S51
CALLFRIEND Hindi Hindi LDC96S52
CALLFRIEND Japanese Japanese LDC96S53
CALLFRIEND Korean Korean LDC96S54
CALLFRIEND Mandarin Chinese-Mainland Dialect Mandarin Chinese-Mainland Dialect LDC96S55
CALLFRIEND Mandarin Chinese-Taiwan Dialect Mandarin Chinese-Taiwan Dialect LDC96S56
CALLFRIEND Spanish-Caribbean Dialect Spanish LDC96S57
CALLFRIEND Spanish-Caribbean Dialect Spanish LDC96S57
CALLFRIEND Tamil Tamil LDC96S59
CALLFRIEND Vietnamese Vietnamese LDC96S60
CALLHOME American English Lexicon (PRONLEX) American English LDC97L20
CALLHOME American English Speech American English LDC97S42
CALLHOME American English Transcripts American English LDC97T14
CALLHOME Egyptian Arabic Speech Supplement Egyptian Arabic LDC2002S37
CALLHOME Egyptian Arabic Transcripts Egyptian Arabic LDC97T19
CALLHOME Egyptian Arabic Transcripts Supplement Egyptian Arabic LDC2002T38
CALLHOME German Lexicon German LDC97L18
CALLHOME German Speech German LDC97S43
CALLHOME German Transcripts German LDC97T15
CALLHOME Japanese Lexicon Japanese LDC96L17
CALLHOME Japanese Speech Japanese LDC96S37
CALLHOME Japanese Transcripts Japanese LDC96T18
CALLHOME Mandarin Chinese Lexicon Mandarin Chinese LDC96L15
CALLHOME Mandarin Chinese Speech Mandarin Chinese LDC96S34
CALLHOME Mandarin Chinese Transcripts Mandarin Chinese LDC96T16
CALLHOME Spanish Dialogue Act Annotation Spanish LDC2001T61
CALLHOME Spanish Lexicon Spanish LDC96L16
CALLHOME Spanish Speech Spanish LDC96S35
CALLHOME Spanish Transcripts Spanish LDC96T17
CELEX2 English, German, Dutch LDC96L14
CETEMpublico Portuguese LDC2001T62
CODAFY 0.1: Automatic mapper into the Conventional Orthography of Dialectal Arabic Dialectal Arabic LDC2012E58
COMLEX English Syntax Lexicon English LDC96L6
COMLEX Pronouncing Dictionary English LDC96L7
COMLEX Syntax Text Corpus Version 2.0 English LDC96T11
CSLU: Kids` Speech Version 1.1 English LDC2007S18
CSLU: Spelled and Spoken Words English LDC2006S15
CSLU: Spoltech Brazilian Portuguese Version 1.0 Brazilian Portuguese LDC2006S16
CSLU: Stories v 1.2 English LDC2006S14
CSR-I (WSJ0) Complete English LDC93S6A
CSR-IV HUB4 English LDC96S31
Childes Corpus 1996 1001
Childes Corpus 1998 1002
Chinese <-> English Name Entity Lists v 1.0 Mandarin Chinese-English LDC2005T34
Chinese English News Magazine Parallel Text Chinese-English (Parallel) LDC2005T10
Chinese Gigaword Mandarin Chinese LDC2003T09
Chinese Gigaword Fifth Edition Mandarin Chinese LDC2011T13
Chinese Gigaword Second Edition Mandarin Chinese LDC2005T14
Chinese Proposition Bank 2.0 Mandarin Chinese LDC2008T07
Chinese Treebank 2.0 Mandarin Chinese LDC2001T11
Chinese Treebank 4.0 Mandarin Chinese LDC2004T05
Chinese Treebank 5.0 Mandarin Chinese LDC2005T01
Chinese Treebank 5.1 Mandarin Chinese LDC2005T01U01
Chinese Treebank 6.0 Mandarin Chinese LDC2007T36
Chinese Treebank 7.0 Mandarin Chinese LDC2010T07
Chinese Treebank 8.0 Mandarin Chinese, Chinese LDC2013T21
Chinese Treebank Final Release Mandarin Chinese LDC2000T48
Chinese idiom translation dictionary + word segmenter dictionary - web resources Chinese LDC2012E78
Chinese-English Translation Lexicon Version 3.0 English-Mandarin Chinese LDC2002L27
CoNNL 2008 Shared Task Development Set English LDC2008E33
CoNNL 2008 Shared Task Test Set English LDC2008E34
CoNNL 2008 Shared Task Training Set English LDC2008E32
CoNNL 2008 Shared Task Trial Data Set English LDC2008E31
CoNNL 2009 Shared Task Chinese Test Set Chinese LDC2009E37
CoNNL 2009 Shared Task Chinese Training Set Chinese LDC2009E38
CoNNL 2009 Shared Task Chinese Trial Data Set Chinese LDC2009E36D
CoRD | The London-Lund Corpus of Spoken English English other_1234
Corpus Search 1003
DEFT ERE Cross-Doc Event Coreference Training Data Annotation LDC2017E24
DEFT ERE English Discussion Forum Annotation V3 English LDC2014E31
DEFT English Belief and Sentiment Annotation English LDC2016E27
DEFT Event Sequencing After-Link And Parent-Child Annotation Training Data English LDC2016E130
DEFT Event Sequencing Pilot Evaluation Source Data English LDC2017E08
DEFT Phase 1 AMR Annotation R4 English LDC2014E41
DEFT Phase 1 ERE Annotation R3 V2 English LDC2013E64
DEFT Phase 1 Narrative Text Source Data R1 English LDC2013E19
DEFT Phase 2 AMR Annotation R1 English LDC2015E86
DEFT Phase 2 AMR Annotation R2 English LDC2016E25
DEFT Phase 2 AMR Exploratory Source Data English LDC2014R46
DEFT Phase 2 AMR Selected Segmented DF Source Data V2.0 English LDC2015R11
DEFT Rich ERE English Training Annotation R2 V2 English LDC2015E68
DSO Corpus of Sense-Tagged English English LDC97T12
ECI Multilingual Text Turkish, Swedish, Slovenian, Russian, Portuguese, Norwegian, Norwegian Bokmål, Norwegian Nynorsk, Lithuanian, Latin, Japanese, Scottish Gaelic, French, Estonian, English, Modern Greek (1453-), German, Danish, Bulgarian, Tosk Albanian, Standard Malay, Spanish, Serbian, Northern Uzbek, Mandarin Chinese, Italian, Dutch, Czech, Croatian, Albanian LDC94T5
Emotional Prosody Speech and Transcripts English LDC2002S28
English Gigaword English LDC2003T05
English Gigaword Fifth Edition English LDC2011T07
English Gigaword Second Edition English LDC2005T12
English News Text Treebank: Penn Treebank Revised English LDC2015T13
English Translation Treebank: An-Nahar Newswire English LDC2012T02
English Web Treebank English LDC2012T13
Entropic Speech Technology 1005
European Language Newspaper Text Portuguese, French, German LDC95T11
FactBank 1.0 English LDC2009T23
Fisher English Training Part 2, Speech English LDC2005S13
Fisher English Training Part 2, Transcripts English LDC2005T19
Fisher English Training Speech Part 1 Speech English LDC2004S13
Fisher English Training Speech Part 1 Transcripts English LDC2004T19
GALE Arabic-English Parallel Aligned Treebank -- Newswire Arabic-English (Parallel) LDC2013T10
GALE Kickoff Release - Arabic Names Extracted from ACE V1.0 Arabic LDC2005E66
GALE Kickoff Release - Arabic Names Extracted from ATB V1.0 Arabic LDC2005E68
GALE Kickoff Release - Broadcast Conversation Audio V1.0 Baharna Arabic, Chinese, Arabic LDC2005E61
GALE Kickoff Release - Broadcast Conversation Transcripts V1.0 Baharna Arabic, Chinook jargon, Chinese, Arabic LDC2005E63
GALE Kickoff Release - Broadcast News Audio V1.0 Arabic, Chinese LDC2005E62
GALE Kickoff Release - English-Arabic Parallel Treebank V1.0 English-Arabic (Parallel) LDC2005E69
GALE Kickoff Release - VOA Arabic Broadcast News Audio Arabic LDC2005E60
GALE Kickoff Release - VOA Arabic Broadcast News Transcripts Arabic LDC2005E71
GALE Kickoff Release 2 - English CTS Treebank with Structural Metadata English LDC2005E79
GALE Kickoff Release 2 -- Levantine Arabic CTS Audio South Levantine Arabic, North Levantine Arabic LDC2005E76
GALE Kickoff Release 2 -- Levantine Arabic CTS Transcripts South Levantine Arabic, North Levantine Arabic LDC2005E77
GALE Kickoff Release 2 -- Levantine Arabic CTS Treebank South Levantine Arabic, North Levantine Arabic LDC2005E78
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 English, Mandarin Chinese (Parallel) LDC2009T02
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 English, Mandarin Chinese (Parallel) LDC2009T06
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 English, Mandarin Chinese (Parallel) LDC2007T23
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 English, Mandarin Chinese (Parallel) LDC2008T08
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 English, Mandarin Chinese (Parallel) LDC2008T18
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 English-Mandarin Chinese (Parallel) LDC2009T15
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 English-Mandarin Chinese (Parallel) LDC2010T03
GALE Phase 2 Distillation - Training V5.0 Baharna Arabic, Chinook jargon, English, Arabic, Chinese LDC2007E13
GALE Phase 2 Release 1 - Transcripts Chinook jargon, Baharna Arabic, English, Chinese, Arabic LDC2007E05
GALE Phase 2 Release 1 - Translations English, Chinook jargon, Baharna Arabic, Chinese, Arabic LDC2007E06
GALE Phase 2 Release 1 - Web Text Arabic, Chinese, English LDC2007E04
GALE Phase 2 Release 2 - Transcripts Baharna Arabic, Chinook jargon, English, Chinese, Arabic LDC2007E45
GALE Phase 2 Release 2 - Translations Chinook jargon, Baharna Arabic, Chinese, Arabic LDC2007E46
GALE Phase 2 Release 3 - Transcripts Baharna Arabic, Chinook jargon, Chinese, Arabic LDC2007E86
GALE Phase 2 Release 3 - Translations Chinook jargon, Baharna Arabic, Chinese, Arabic LDC2007E87
GALE Phase 3 - MTPlus Pilot LDC2008E42
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 Mandarin Chinese, Chinese LDC2014T28
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 Mandarin Chinese, Chinese LDC2015T09
GALE Phase 3 DevTest - Broadcast Audio LDC2007E60
GALE Phase 3 Release 1 - Distillation V1.1 English, Chinese, Arabic LDC2007E104
GALE Phase 3 Release 1 - English Translation Treebank English, Baharna Arabic, Arabic LDC2007E105
GALE Phase 3 Release 1 - Found Parallel Text English, Chinese, Arabic (Parallel) LDC2007E103
GALE Phase 3 Release 1 - Transcripts English, Chinese, Arabic LDC2007E100
GALE Phase 3 Release 1 - Translations Arabic, Chinese, English LDC2007E101
GALE Phase 3 Release 1 - Web Text V 1.0 English, Chinook jargon, Baharna Arabic, Arabic, Chinese LDC2007E102
GALE Phase 3 Release 2 - Broadcast Audio English, Chinese, Arabic LDC2008E38
GALE Phase 3 Release 2 - Transcripts LDC2008E39
GALE Phase 3 Release 2 - Translations LDC2008E40
GALE Phase 3 Release 2 - Web Text LDC2008E41
GALE Phase 3 and 4 Eval Superset Arabic, Chinese LDC2011E50
GALE Phase 4 Arabic Parallel Aligned Treebank Part 1 V1.2 Arabic-English (Parallel) LDC2009E82
GALE Phase 4 Chinese Parallel Word Alignment and Tagging Part 1 V1.1 Chinese-English (Parallel) LDC2009E83
GALE Phase 4 Release 1 - Transcripts V1.0 English LDC2008E55
GALE Phase 4 Release 1 - Translations V2.0 Arabic and Chinese - English (Parallel) LDC2008E56
GALE Phase 4 Release 1 - Web Text V1.0 LDC2008E53
GALE Phase 4 Release 2 - Transcripts Arabic, Chinese, English LDC2009E15
GALE Phase 4 Release 2 - Translations Arabic and Chinese - English (Parallel) LDC2009E16
GALE Phase 4 Release 2 - Web Text Arabic, Chinese, English LDC2009E14
GALE Phase 4 Release 3 - Found Parallel Text Arabic-English, Chinese-English (Parallel) LDC2009E105
GALE Phase 4 Release 3 - Transcripts Arabic, Chinese, English LDC2009E94
GALE Phase 4 Release 3 - Translations V1.2 Arabic and Chinese - English (Parallel) LDC2009E95
GALE Phase 4 Release 3 - Web Text Arabic, Chinese, English LDC2009E93
GALE Phase 5 Eval Source Transcripts and Translation Arabic, Chinese LDC2011E21
GALE Phase 5 Eval Superset Source Transcripts and Translation Arabic, Chinese LDC2011E25
GALE Phase 5 Levantine Arabic Dialect Judgments and Translations Levantine Arabic-English (Parallel) LDC2010E79
GALE Y1 - Arabic English Parallel News Text English, Baharna Arabic, Arabic (Parallel) LDC2006E25
GALE Y1 - BBN Iraqi Broadcast Conversation Corpus Iraqi Arabic LDC2006G07
GALE Y1 - Distillation Blind Evaluation Audio Part A English LDC2006E46_A
GALE Y1 - Distillation Blind Evaluation Audio Part B English LDC2006E46_B
GALE Y1 - Distillation Blind Evaluation Audio Part C English LDC2006E46_C
GALE Y1 - Distillation Blind Evaluation Audio Part D English LDC2006E46_D
GALE Y1 - Distillation Blind Evaluation Audio Part E English LDC2006E46_E
GALE Y1 - Distillation Blind Evaluation Newswire English LDC2006E45
GALE Y1 - Distillation Evaluation Audio English LDC2006E21
GALE Y1 - Distillation Evaluation Newswire Baharna Arabic, Chinook jargon, English, Chinese, Arabic LDC2006E22
GALE Y1 - English Chinese Parallel Financial News Chinook jargon, English, Chinese (Parallel) LDC2006E26
GALE Y1 - Interim Release: Transcripts Baharna Arabic, Chinook jargon, English, Arabic, Chinese LDC2006E23
GALE Y1 - Interim Release: Translations Chinook jargon, Baharna Arabic, Chinese, Arabic - English (Parallel) LDC2006E24
GALE Y1 - Web 1T 5-gram Version 1 English LDC2006E88
GALE Y1 Q1 Release - Arabic Treebank v 1.0 Arabic LDC2005E84
GALE Y1 Q1 Release - English Translation Treebank v 1.0 Arabic-English (Parallel) LDC2005E85
GALE Y1 Q1 Release - Transcripts V1.0 Baharna Arabic, Chinook jargon, English, Arabic, Chinese LDC2005E82
GALE Y1 Q1 Release - Translations V1.0 Arabic and Chinese - English (Parallel) LDC2005E83
GALE Y1 Q1 Release - Web Text Collection V1.0 Chinese, Arabic, English LDC2005E81
GALE Y1 Q2 Release - Arabic Treebank v 1.0 Arabic LDC2006E35
GALE Y1 Q2 Release - English Translation Treebank v 1.0 Arabic-English (Parallel) LDC2006E36
GALE Y1 Q2 Release - Transcripts V1.0 Baharna Arabic, Chinook jargon, English, Arabic, Chinese LDC2006E33
GALE Y1 Q2 Release - Translations V2.0 Baharna Arabic, Chinook jargon, Arabic, Chinese; into English LDC2006E34
GALE Y1 Q2 Release - Web Text Collection V1.0 Arabic, Chinese, English LDC2006E32
GALE Y1 Q3 Release - Arabic Treebank Arabic LDC2006E87
GALE Y1 Q3 Release - English Translation Treebank Arabic-English (Parallel) LDC2006E82
GALE Y1 Q3 Release - Transcripts English, Chinook jargon, Baharna Arabic, Arabic, Chinese LDC2006E84
GALE Y1 Q3 Release - Translations Baharna Arabic, Chinook jargon, Arabic, Chinese; into English LDC2006E85
GALE Y1 Q3 Release - Web Text Collection LDC2006E77
GALE Y1 Q3 Release - Word Alignment Baharna Arabic, Chinook jargon, Arabic, Chinese; into English LDC2006E86
GALE Y1 Q4 Release - Arabic Treebank Arabic LDC2006E94
GALE Y1 Q4 Release - English Translation Treebank Arabic-English (Parallel) LDC2006E95
GALE Y1 Q4 Release - Transcripts English, Chinook jargon, Baharna Arabic, Chinese, Arabic LDC2006E91
GALE Y1 Q4 Release - Translations Arabic and Chinese - English (Parallel) LDC2006E92
GALE Y1 Q4 Release - Web Text Collection LDC2006E90
GALE Y1 Q4 Release - Word Alignment Arabic, Chinese, English (Parallel) LDC2006E93
Gigaword English Automatic Parses Other_9
Google Question Bank Update-v1.0 English LDC2012R121
Google Treebank Weblog Subcorpus V2.0 English LDC2011E71
Grassfields Bantu Fieldwork: Ngomba Tone Paradigms Ngomba LDC2001S16
HUB4 Radio Broadcast News 1014
HUB5 Spanish Telephone Speech Corpus Spanish LDC98S70
Hansard French/English English - Canadian French (Parallel) LDC95T20
Hong Kong Hansards Parallel Text English, Chinese (Parallel) LDC2000T50
Hong Kong Laws Parallel Text English, Chinese LDC2000T47
Hong Kong News Parallel Text English, Chinese (Parallel) LDC2000T46
Hong Kong Parallel Text English, Chinese (Parallel) LDC2004T08
ICSI Meeting Speech English LDC2004S02
ICSI Meeting Transcripts English LDC2004T04
ISCA 1 and 3 1007
ISCA Tutorial 1008
ISL Meeting Speech Part 1 English LDC2004S05
ISL Meeting Transcripts Part 1 English LDC2004T10
JURIS English LDC98T32
Japanese Business News Text Japanese LDC95T8
Japanese Business News Text Supplement Japanese LDC99T34
Korean English Treebank Annotations Korean, English (Parallel) LDC2002T26
Korean Newswire Korean LDC2000T45
Korean Propbank Korean LDC2006T03
Korean Telephone Conversations Lexicon Korean LDC2003L02
Korean Telephone Conversations Speech Korean LDC2003S03
Korean Telephone Conversations Transcripts Korean LDC2003T08
Korean Treebank Annotations Version 2.0 Korean LDC2006T09
LCTL Urdu Urdu LDC2006E110
LLHDB English LDC98S68
LORELEI Akan Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Akan LDC2018E07
LORELEI Amharic Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Amharic LDC2016E87
LORELEI Arabic Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Arabic LDC2016E89
LORELEI Bengali Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Bengali LDC2017E60
LORELEI Farsi Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Farsi LDC2016E93
LORELEI Hindi Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Hindi LDC2017E62
LORELEI Hungarian Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Hungarian LDC2016E98
LORELEI Indonesian Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1 Indonesian LDC2017E66
LORELEI Language Independent NLP Tools LDC2016E53
LORELEI Mandarin Incident Language Pack V2 Mandarin Chinese LDC2016E30
LORELEI Mandarin Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Mandarin Chinese LDC2016E101
LORELEI Russian Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Russian LDC2016E95
LORELEI Situation Frame Exercise Annotation English LDC2017E07
LORELEI Somali Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Somali LDC2016E91
LORELEI Spanish Representative Language Pack Translation, Annotation, Grammar, Lexicon and Tools V1. Spanish LDC2016E97
LORELEI Swahili Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Swahili LDC2017E64
LORELEI Tagalog Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Tagalog LDC2017E68
LORELEI Tamil Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Tamil LDC2017E70
LORELEI Thai Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Thai LDC2018E03
LORELEI Vietnamese Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1 Vietnamese LDC2016E103
LORELEI Wolof Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Wolof LDC2018E09
LORELEI Year 1 Dry Run Evaluation IL2 V1.1 English LDC2016E56
LORELEI Yoruba Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Yoruba LDC2016E105
LORELEI Zulu Representative Language Pack Translation Annotation Grammar Lexicon and Tools V1.0 Zulu LDC2018E05
Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) North Levantine Arabic, South Levantine Arabic LDC2005S14
MADA-ARZ 0.1: Morphological Analysis and Disambiguation for Arabic (Egyptian version) Egyptian Arabic LDC2012E60
MRC Psycholinguistic Database Machine Usable Dictionary other_4
Mandarin Chinese News Text Mandarin Chinese LDC95T13
Matlab 1009
Message Understanding Conference (MUC) 6 English LDC2003T13
Message Understanding Conference (MUC) 7 English LDC2001T02
Multiple-Translation Arabic (MTA) Part 1 English, Standard Arabic (Parallel) LDC2003T18
Multiple-Translation Arabic (MTA) Part 2 English, Standard Arabic (Parallel) LDC2005T05
Multiple-Translation Chinese (MTC) Part 2 English, Mandarin Chinese (Parallel) LDC2003T17
Multiple-Translation Chinese (MTC) Part 3 English, Mandarin Chinese (Parallel) LDC2004T07
Multiple-Translation Chinese (MTC) Part 4 English, Mandarin Chinese (Parallel) LDC2006T04
Multiple-Translation Chinese Corpus English, Mandarin Chinese (Parallel) LDC2002T01
NIST 2009 Open Machine Translation (OpenMT) Evaluation Urdu and Arabic - English (Parallel) LDC2010T23
NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source Dari, Korean, Persian, Farsi, English, Mandarin Chinese, Arabic, Iranian Persian, Chinese (Parallel) LDC2014T02
NIST Meeting Pilot Corpus Speech English LDC2004S09
NIST Meeting Pilot Corpus Transcripts and Metadata English LDC2004T13
NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations Urdu, Mandarin Chinese, Standard Arabic, English, Chinese, Arabic (Parallel) LDC2010T01
NLTK 1010
NTIMIT 1011
NomBank v 1.0 English LDC2008T23
North American News Text Corpus English LDC95T21
OntoNotes Release 5.0 English, Mandarin Chinese, Arabic, Chinese LDC2013T19
OntoNotes V3.0 - GALE Pre-Release English LDC2009E60
Original Penn Treebank release 2 1012
Penn Discourse Treebank Version 2.0 English LDC2008T05
Penn Treebank release 3 1013
Portuguese Newswire Text Portuguese LDC99T40
Prague Dependency Treebank 1.0 Czech, English (Parallel) LDC2001T10
PropBank frameset files (v1.7) other_10
PropBank on the Brown corpus other_11
Proposition Bank I English LDC2004T14
REFLEX Bengali LDC2015E13
REFLEX Hungarian LDC2015E82
REFLEX Tagalog LDC2015E90
REFLEX Tamil LDC2015E83
REFLEX Thai LDC2015E84
REFLEX Urdu LDC2015E14
REFLEX Yoruba LDC2015E91
RST Discourse Treebank English LDC2002T07
Reuters vol 1 English 1015
Reuters vol. 2 English 1016
SAID English LDC2003T10
SANCL 2012 Shared Task Release 1 English LDC2012E43
SIGHAN Bakeoff LDC2003E16
SUSAS English LDC99S78
SUSAS Transcripts English LDC99T33
Santa Barbara Corpus of Spoken American English Part I American English LDC2000S85
Santa Barbara Corpus of Spoken American English Part II American English LDC2003S06
Santa Barbara Corpus of Spoken American English Part III Amrican English LDC2004S10
Santa Barbara Corpus of Spoken American English Part IV American English LDC2005S25
SemEval-2016 Task 8 - Meaning Representation Parsing - Gold Standard AMRs English LDC2016E33
Spanish Discussion Forum Source Data R1 Spanish LDC2014E14
Spanish Language News Corpus Spanish 1017
Spanish Newswire Text, Volume 2 Spanish LDC99T41
Speech in Noisy Environments (SPINE) Evaluation Audio English LDC2000S96
Speech in Noisy Environments (SPINE) Evaluation Transcripts English LDC2000T54
Speech in Noisy Environments (SPINE) Training Audio English LDC2000S87
Speech in Noisy Environments (SPINE) Training Transcripts English LDC2000T49
Speech in Noisy Environments (SPINE2) Part 1 Audio English LDC2001S04
Speech in Noisy Environments (SPINE2) Part 1 Transcripts English LDC2001T05
Speech in Noisy Environments (SPINE2) Part 2 Audio English LDC2001S06
Speech in Noisy Environments (SPINE2) Part 2 Transcripts English LDC2001T07
Speech in Noisy Environments (SPINE2) Part 3 Audio English LDC2001S08
Speech in Noisy Environments (SPINE2) Part 3 Transcripts English LDC2001T09
Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio English LDC2001S99
Switchboard Cellular Part 1 Transcription English LDC2001T14
Switchboard-1 Release 2 English LDC97S62
Switchboard-2 Phase I English LDC98S75
Switchboard-2 Phase II English LDC99S79
Switchboard-2 Phase III Audio English LDC2002S06
Syllable-Final /s/ Lenition Spanish LDC2001T60
TAC 2009 KBP Assessment Results English LDC2009E90
TAC 2009 KBP Evaluation Generic Infoboxes V2.0 English LDC2009E56
TAC 2009 KBP Evaluation NIL Link Assessment English LDC2009E110
TAC 2009 KBP Evaluation Reference Knowledge Base English LDC2009E58A
TAC 2009 KBP Evaluation Reference Knowledge Base English LDC2009E58C
TAC 2009 KBP Evaluation Reference Knowledge Base English LDC2009E58B
TAC 2009 KBP Evaluation Slot Filling List English LDC2009E65
TAC 2010 KBP Assessment Results English LDC2010E61
TAC 2010 KBP Entity Linking IAA Study Results English LDC2012E31
TAC 2010 KBP Evaluation Entity Linking Gold Standard V1.0 English LDC2010E82
TAC 2010 KBP Evaluation Slot Filling Annotation English LDC2012E32
TAC 2010 KBP Evaluation Surprise Slot Filling Annotation English LDC2012E33
TAC 2010 KBP Generic Infoboxes English LDC2010E24
TAC 2010 KBP Source Data LDC2010E12
TAC 2010 KBP Training Entity Linking V2.0 English LDC2010E31
TAC 2010 KBP Training Slot Filling Annotation V2.1 English LDC2010E18
TAC 2010 RTE-6 KBP Validation Pilot Development Data English LDC2010E32
TAC 2011 Guided Summarization Test Data English LDC2011E28
TAC 2011 Guided Summarization Test Data V1.1 English LDC2011E62
TAC 2011 KBP English Evaluation Diagnostic Temporal Slot Filling Queries English LDC2011E85
TAC 2011 KBP English Evaluation Entity Linking Annotation English LDC2012E29
TAC 2011 KBP English Evaluation Entity Linking Queries English LDC2012E36
TAC 2011 KBP English Evaluation Regular Slot Filling Annotation V1.2 English LDC2011E89
TAC 2011 KBP English Evaluation Regular Slot Filling Queries English LDC2012E37
TAC 2011 KBP English Evaluation Temporal Slot Filling Annotation English LDC2012E38
TAC 2011 KBP English Evaluation Temporal Slot Filling Queries English LDC2012E39
TAC 2011 KBP English Regular Slot Filling Assessment Results English LDC2011E88
TAC 2011 KBP English Sample Temporal Slot Filling Annotation V1.2 English LDC2011E47
TAC 2011 KBP English Temporal Slot Filling Assessment Results English LDC2013E65
TAC 2011 KBP English Training Regular Slot Filling Annotation English LDC2011E48
TAC 2011 KBP English Training Temporal Slot Filling Annotation English LDC2011E49
TAC 2011 RTE-7 KBP Validation Development Data English LDC2011E29
TAC 2011 RTE-7 KBP Validation Test Data English LDC2011E30
TAC 2012 KBP English Regular Slot Filling Evaluation Annotations English LDC2012E91
TAC 2013 KBP English Entity Linking Evaluation Queries and Knowledge Base Links V1.1 English LDC2013E90
TAC 2013 KBP English Regular Slot Filling Assessment Results English LDC2013E91
TAC 2013 KBP English Regular Slot Filling Evaluation Queries and Annotations V1.1 English LDC2013E77
TAC 2013 KBP English Regular Slot Filling per:title Training Data English LDC2013E60
TAC 2013 KBP English Temporal Slot Filling Assessment Results English LDC2013E99
TAC 2013 KBP English Temporal Slot Filling Evaluation Queries and Annotations V1.1 English LDC2013E86
TAC 2013 KBP English Temporal Slot Filling Training Queries and Annotations English LDC2013E82
TAC 2013 KBP Source Corpus LDC2013E45
TAC 2014 KBP English Entity Linking Training AMR Queries and KB Links V1.1 English LDC2014E15
TAC 2014 KBP English Event Argument Extraction Evaluation Assessment Results V2.0 English LDC2014E88
TAC 2014 KBP English Event Argument Extraction Evaluation Source Corpus V1.1 English LDC2014R43
TAC 2014 KBP English Source Corpus English LDC2014E13
TAC 2014 KBP Event Argument Extraction Pilot Assessment Results V1.1 English LDC2014E40
TAC 2014 KBP Event Argument Extraction Pilot Source Corpus V1.1 English LDC2014E20
TAC KBP 2009 Evaluation Entity Linking List English LDC2009E64
TAC KBP 2016 Belief and Sentiment Evaluation Gold Standard Annotation (Versions 1 and 2) English LDC2016E114
TAC KBP Evaluation Surprise Slot Filling Queries English LDC2010E53
TAC KBP Gold Standard Entity Linking Entity Type List English LDC2009E86
TAC KBP Training Surprise Slot Filling Annotation English LDC2010E52
TDT2 Careful Transcription Audio English LDC2000S92
TDT2 Careful Transcription Text English LDC2000T44
TDT2 English Text English LDC99T35
TDT2 Mandarin Audio Corpus Mandarin Chinese LDC2001S93
TDT2 Multilanguage Text Version 4.0 English, Mandarin Chinese LDC2001T57
TDT2 Text Data and Tables 1019
TDT3 Multilanguage Text Version 2.0 English, Mandarin Chinese LDC2001T58
TERN 2004 Training Data V1.3 LDC2004E23
TI 46-Word English LDC93S9
TI 46-word 1004
TIDES Extraction ACE 2004 Training Data V1.4 LDC2004E17
TIMIT Acoustic-Phonetic Continuous Speech Corpus English LDC93S1
TIPSTER Complete English LDC93T3A
TREC Mandarin Mandarin Chinese LDC2000T52
TREC Spanish Spanish LDC2000T51
Tactical Speaker Identification Speech Corpus (TSID) English LDC99S83
Taiwanese Putonghua Taiwanese Mandarin LDC98S72
Talkbank Switchboard corpus 1018
The 2012 IBM Egyptian Arabic Corpus Egyptian Arabic LDC2012E77
The AQUAINT Corpus of English News Text English LDC2002T31
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English Old English other_1
The Enron Sent Corpus v1.0 other_2
The George Bushotter Lakhota Text collection Other_5
The IViE corpus other_3
The New York Times Annotated Corpus English LDC2008T19
TimeBank 1.2 English LDC2006T08
Tipster 1020
Translanguage English Database (TED) Speech English LDC2002S04
Translanguage English Database (TED) Transcripts English LDC2002T03
Treebank-2 English LDC95T7
Treebank-3 English LDC99T42
USC Marketplace Broadcast News Speech English LDC99S82
USC Marketplace Broadcast News Transcripts English LDC99T36
Uzbek Incident Language Pack LDC2015E89
VAHA (POLYPHONE II) Spanish LDC96S41
Voice of America (VOA) Czech Broadcast News Audio Czech LDC2000S89
Voice of America (VOA) Czech Broadcast News Transcripts Czech LDC2000T53
Voicemail Corpus Part II English LDC2002S35
WordNet 1.5 Other_6
Zurich BNC web 1000.5
bilingual data extracted from three Creative Commons (CC BY-SA) sources LDC2012E79
530 total corpora