The EnronSent Corpus

The EnronSent corpus is a special preparation of a portion of the Enron Email Dataset designed specifically for use in Corpus Linguistics and language analysis. Divided across 45 plain text files, this corpus contains 2,205,910 lines and 13,810,266 words.

This preparation was created by cleaning up a portion of the original Enron Corpus. It contains 96,107 messages from the "Sent Mail" directories of all the users in the corpus. It has been cleaned specifically for use with conventional corpus linguistics tools (such as grep, python), and an attempt has been made to remove as much non-human generated text as possible from the raw messages in the original data. For more history on the original dataset, please see the homepage for the Enron Email Dataset and "Introducing the Enron Corpus".

Please see the included README file for more information about this data. For a more detailed explanation of the preparation of the corpus, please read University of Colorado Institute of Cognitive Science Technical Report 01-2011

Citing the EnronSent Corpus

Styler, Will (2011). The EnronSent Corpus. Technical Report 01-2011, University of Colorado at Boulder Institute of Cognitive Science, Boulder, CO.

Download the EnronSent Corpus:

Privacy Concerns

This preparation and all corpus data is in the public domain. All messages in the Enron corpus were made public domain in 2003 by the United States Federal Energy Regulatory Commission during their investigation of Enron. The messages in the source data represent all of the email in the Enron Corporation's database, and not just those of the investigated individuals. Although many of the concerned individuals have already had their messages removed from the source data, it is important to remember that the vast majority of the people whose messages are in this corpus were likely not directly involved in the investigation. Please keep the privacy of these individuals in mind as you work with this corpus and the data it contains.

Questions, comments or concerns?

Contact Will Styler at the University of Colorado, (