Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Enron Corpus

From Wikipedia, the free encyclopedia
Company database

TheEnron Corpus is adatabase of over 600,000emails written by 158 employees[1] of theEnron Corporation in the years leading up tothe company's collapse in December 2001. The corpus was generated from Enron email servers by theFederal Energy Regulatory Commission (FERC) during its subsequent investigation.[2] A copy of the email database was subsequently purchased for $10,000 byAndrew McCallum, a computer scientist at theUniversity of Massachusetts Amherst.[3] He released this copy to researchers, providing a trove of data that has been used for studies onsocial networking andcomputer-mediated communication.

Creation

[edit]

In the legal investigation into Enron's collapse, thediscovery process required collecting and preserving vast amounts of data, for which the FERC hired Aspen Systems (now part ofLockheed Martin). The emails were collected at Enron Corporation headquarters inHouston during two weeks in May 2002 by Joe Bartling,[4] a litigation support and data analysis contractor for Aspen. In addition to the Enron employee emails, all of Enron's enterprise database systems,[5] hosted inOracle databases onSun Microsystems servers, were captured and preserved, including its onlineenergy trading platform, EnronOnline.

Once collected, the Enron emails were processed and hosted in proprietaryelectronic discovery platforms (first Concordance, then iCONECT) for review by investigators from the FERC,Commodity Futures Trading Commission, andDepartment of Justice. At the conclusion of the investigation, and upon the issuance of the FERC staff report,[6] the emails and information collected were deemed to be in thepublic domain, to be used forhistorical research and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available onhard drives.

Jitesh Shetty and Jafar Adibi from theUniversity of Southern California processed the data in 2004 and released aMySQL version.[7] In 2010, EDRM.net published a revised and expanded version 2 of the corpus,[8] containing over 1.7 million messages, which has been made available onAmazon S3 for easy access to the researchers.

Exploitation

[edit]
A visualization of the email network in the Enron Corpus, with coloring representing eight communities

The corpus is valued as one of the few publicly available mass collections of real emails easily available for study; such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access, such asnon-disclosure agreements anddata sanitization.[3] Shetty and Adibi, based on their MySQL version, published somelink analysis of which user accounts emailed which.[9] Linguistic comparison with more recent emailcorpora showschanges in the emailregister of English. It is also used astest or training data for research innatural language processing andmachine learning.[10]The Pile dataset uses it.

See also

[edit]

References

[edit]
  1. ^Klimt, Bryan; Yiming Yang (2004). "The Enron Corpus: A New Dataset for Email Classification Research". pp. 217–226.CiteSeerX 10.1.1.61.1645.
  2. ^"The Enron Email CorpusArchived 2011-03-08 at theWayback Machine" Retrieved March 5, 2011.
  3. ^abMarkoff, John. "Armies of Expensive Lawyers, Replaced by Cheaper Software".New York Times March 5, 2011. p A1.
  4. ^Bartling, Joe (September 3, 2015)."The Enron Data Set - Where Did It Come From?".Bartling Forensic and Advisory. Archived fromthe original on April 15, 2016. RetrievedSeptember 3, 2015.
  5. ^"FERC: Industries - Enron's Energy Trading Business Process and Databases".www.ferc.gov. Archived fromthe original on 2020-01-05. Retrieved2015-09-02.
  6. ^FERC Staff Report - Price Manipulation in Western Markets - Findings at a GlanceArchived 2006-02-21 at theWayback Machine (3-26-2003)
  7. ^"Enron processed database"
  8. ^Socha, George."EDRM Enron Email Data Set v2 Now Available". EDRM.net. Archived fromthe original on 2011-09-04. Retrieved2012-09-03.
  9. ^Shetty, Jitesh; Adibi, Jafar (2005). "Discovering important nodes through graph entropy the case of Enron email database".Proceedings of the 3rd international workshop on Link discovery - LinkKDD '05. pp. 74–81.doi:10.1145/1134271.1134282.ISBN 978-1595932150.S2CID 10122735.
  10. ^Friginal, Eric; Hardy, Jack (2013).Corpus-Based Sociolinguistics: A Guide for Students. Routledge. p. 167.ISBN 978-1-136-29277-4. Retrieved29 May 2020.

External links

[edit]
Predecessors
Divisions and subsidiaries
People
Enron scandal
In culture
Books
Other
Text corpora,
English
Text corpora,
non-English
Organizations
Retrieved from "https://en.wikipedia.org/w/index.php?title=Enron_Corpus&oldid=1338442217"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp