Movatterモバイル変換

Enron Corpus

From Wikipedia, the free encyclopedia

Company database

TheEnron Corpus is adatabase of over 600,000emails written by 158 employees^[1] of theEnron Corporation in the years leading up tothe company's collapse in December 2001. The corpus was generated from Enron email servers by theFederal Energy Regulatory Commission (FERC) during its subsequent investigation.^[2] A copy of the email database was subsequently purchased for $10,000 byAndrew McCallum, a computer scientist at theUniversity of Massachusetts Amherst.^[3] He released this copy to researchers, providing a trove of data that has been used for studies onsocial networking andcomputer-mediated communication.

Creation

[edit]

In the legal investigation into Enron's collapse, thediscovery process required collecting and preserving vast amounts of data, for which the FERC hired Aspen Systems (now part ofLockheed Martin). The emails were collected at Enron Corporation headquarters inHouston during two weeks in May 2002 by Joe Bartling,^[4] a litigation support and data analysis contractor for Aspen. In addition to the Enron employee emails, all of Enron's enterprise database systems,^[5] hosted inOracle databases onSun Microsystems servers, were captured and preserved, including its onlineenergy trading platform, EnronOnline.

Once collected, the Enron emails were processed and hosted in proprietaryelectronic discovery platforms (first Concordance, then iCONECT) for review by investigators from the FERC,Commodity Futures Trading Commission, andDepartment of Justice. At the conclusion of the investigation, and upon the issuance of the FERC staff report,^[6] the emails and information collected were deemed to be in thepublic domain, to be used forhistorical research and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available onhard drives.

Jitesh Shetty and Jafar Adibi from theUniversity of Southern California processed the data in 2004 and released aMySQL version.^[7] In 2010, EDRM.net published a revised and expanded version 2 of the corpus,^[8] containing over 1.7 million messages, which has been made available onAmazon S3 for easy access to the researchers.

Exploitation

[edit]

A visualization of the email network in the Enron Corpus, with coloring representing eight communities

The corpus is valued as one of the few publicly available mass collections of real emails easily available for study; such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access, such asnon-disclosure agreements anddata sanitization.^[3] Shetty and Adibi, based on their MySQL version, published somelink analysis of which user accounts emailed which.^[9] Linguistic comparison with more recent emailcorpora showschanges in the emailregister of English. It is also used astest or training data for research innatural language processing andmachine learning.^[10]The Pile dataset uses it.

References

[edit]

^Klimt, Bryan; Yiming Yang (2004). "The Enron Corpus: A New Dataset for Email Classification Research". pp. 217–226.CiteSeerX 10.1.1.61.1645.
^"The Enron Email Corpus Archived 2011-03-08 at theWayback Machine" Retrieved March 5, 2011.
^^a ^bMarkoff, John. "Armies of Expensive Lawyers, Replaced by Cheaper Software".New York Times March 5, 2011. p A1.
^Bartling, Joe (September 3, 2015)."The Enron Data Set - Where Did It Come From?".Bartling Forensic and Advisory. Archived fromthe original on April 15, 2016. RetrievedSeptember 3, 2015.
^"FERC: Industries - Enron's Energy Trading Business Process and Databases".www.ferc.gov. Archived fromthe original on 2020-01-05. Retrieved2015-09-02.
^FERC Staff Report - Price Manipulation in Western Markets - Findings at a Glance Archived 2006-02-21 at theWayback Machine (3-26-2003)
^"Enron processed database"
^Socha, George."EDRM Enron Email Data Set v2 Now Available". EDRM.net. Archived fromthe original on 2011-09-04. Retrieved2012-09-03.
^Shetty, Jitesh; Adibi, Jafar (2005). "Discovering important nodes through graph entropy the case of Enron email database".Proceedings of the 3rd international workshop on Link discovery - LinkKDD '05. pp. 74–81.doi:10.1145/1134271.1134282.ISBN 978-1595932150.S2CID 10122735.
^Friginal, Eric; Hardy, Jack (2013).Corpus-Based Sociolinguistics: A Guide for Students. Routledge. p. 167.ISBN 978-1-136-29277-4. Retrieved29 May 2020.

External links

[edit]

Tutorial on data modeling with the Enron Corpus
Shetty and Adibi's enron email dataset download on S3 (178 MB)
Nathan Heller:What the Enron E-mails Say About Us The New Yorker, July 24, 2017
Searchable Enron Email Database (requires registration)

Enron Corporation

Predecessors

Divisions and subsidiaries

People

Kenneth Lay (Founder, Chairman and CEO)
Jeffrey Skilling (President, COO, and CEO)
Andrew Fastow (CFO)
Rebecca Mark-Jusbasche (Vice Chairman, Chairman and CEO of Enron International)
Sherron Watkins (Vice President of Corporate Development, whistleblower)

Enron scandal

In culture

Books	Anatomy of Greed Conspiracy of Fools Pipe Dreams: Greed, Ego, and the Death of Enron The Smartest Guys in the Room
Other	Enron (play) Enron: The Smartest Guys in the Room (documentary film) The Crooked E: The Unshredded Truth About Enron (television movie)

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine

Retrieved from "https://en.wikipedia.org/w/index.php?title=Enron_Corpus&oldid=1338442217"

Categories:

Hidden categories:

[8]ページ先頭

Movatterモバイル変換

Creation

Exploitation

See also

References

External links