TheEnron Corpus is adatabase of over 600,000emails written by 158 employees[1] of theEnron Corporation in the years leading up tothe company's collapse in December 2001. The corpus was generated from Enron email servers by theFederal Energy Regulatory Commission (FERC) during its subsequent investigation.[2] A copy of the email database was subsequently purchased for $10,000 byAndrew McCallum, a computer scientist at theUniversity of Massachusetts Amherst.[3] He released this copy to researchers, providing a trove of data that has been used for studies onsocial networking andcomputer-mediated communication.
In the legal investigation into Enron's collapse, thediscovery process required collecting and preserving vast amounts of data, for which the FERC hired Aspen Systems (now part ofLockheed Martin). The emails were collected at Enron Corporation headquarters inHouston during two weeks in May 2002 by Joe Bartling,[4] a litigation support and data analysis contractor for Aspen. In addition to the Enron employee emails, all of Enron's enterprise database systems,[5] hosted inOracle databases onSun Microsystems servers, were captured and preserved, including its onlineenergy trading platform, EnronOnline.
Once collected, the Enron emails were processed and hosted in proprietaryelectronic discovery platforms (first Concordance, then iCONECT) for review by investigators from the FERC,Commodity Futures Trading Commission, andDepartment of Justice. At the conclusion of the investigation, and upon the issuance of the FERC staff report,[6] the emails and information collected were deemed to be in thepublic domain, to be used forhistorical research and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available onhard drives.
Jitesh Shetty and Jafar Adibi from theUniversity of Southern California processed the data in 2004 and released aMySQL version.[7] In 2010, EDRM.net published a revised and expanded version 2 of the corpus,[8] containing over 1.7 million messages, which has been made available onAmazon S3 for easy access to the researchers.

The corpus is valued as one of the few publicly available mass collections of real emails easily available for study; such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access, such asnon-disclosure agreements anddata sanitization.[3] Shetty and Adibi, based on their MySQL version, published somelink analysis of which user accounts emailed which.[9] Linguistic comparison with more recent emailcorpora showschanges in the emailregister of English. It is also used astest or training data for research innatural language processing andmachine learning.[10]The Pile dataset uses it.