Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WiktionaryThe Free Dictionary
Search

Wiktionary:Corpora

From Wiktionary, the free dictionary
Shortcut:
WT:CORPUS

This page is dedicated to listing collections of texts useful for the work of creating a dictionary. These collections are often known as "corpora" or less commonly "corpora". Many of them feature functions like full-text search, term frequency information and collocation search.

For a more user-friendly introduction to some of the most prominent corpora, as well as other resources like dictionaries, seeWiktionary:Quotations/Resources. Another page,Wiktionary:Searchable external archives also contains information with a more specific focus on those which can solidly provide citations passingWiktionary's criteria for inclusion.

Note that corpora that contain text in multiple languages but where English text makes up of a significant portion of the corpora are listed in theEnglish table below with their "Dialect" in the listing including the word "Multilingual".

If there are any other resources that you know of which aren't listed here, please do add them or suggest them on thetalk page.

Glossary

[edit]

The following is a brief explanation of how various terms are used in describing and categorizing the resources on this page.

  • Access restrictions: Any barriers to accessing the resource's contents, such as registration or paying a subscription. A number of listed resources can be accessed through theWikipedia Library for free.
  • Active: Whether a corpus is still publicly accessible, either through the internet or another method. See also the synonymous use of strikethroughs describedbelow.
  • Available medium: The format through which the language can be accessed in the resource, such as written text or as a spoken and recorded in a video.
  • Library: Collection of texts gathered with a wide net and without linguistic work particularly in mind. It must be possible to search the contents of these texts.
  • Native Name: The name of the resource in its default language. Doesn't necessarily match the language of the resource's contents. A dash is used instead if the default language of the resource is English.
  • Original Medium: The way the language was originally produced, whether it was spoken, written, etc.
  • Social media: A live website or other online center for mass user communication, or the attempt at a near complete archive of such. If the resource is an archive with a particular focus, then it is considered a library or corpus.
  • Re-use restrictions: Unique restrictions on the distribution of the resource's contents beyond general copyright law, in particular restrictions on commercial use or to academic users only. This restriction is particularly relevant to Wiktionary were all content must be able to be redistributed commercially per Wiktionary'sCC BY-SA 4.0 license.
  • Tagged Corpus: Collection of texts gathered within a specific scope with linguistic work at least partly in mind. The contents of the texts are marked by part of speech, meaning, pragmatics, or any other method.
  • Translated Name: A translation of the resource's name from another language if the default name of the resource is not already in English.
  • Text: A continuous use of language published, released, or spoken as a coherent work. This could be a forum post in a thread, a book, an issue of a magazine, or a speech.
  • Untagged Corpus: Collection of texts gathered within a specific scope with linguistic work at least partly in mind. The contents of the texts are not marked by part of speech, meaning, pragmatics, or any other method.

Symbols

[edit]
  • Approx.: "Approximately", used to indicate that a date or quantity was estimated or not exactly known when inputted, but a best guess was given.
  • e: "Exponent", used to represent sizes as part ofE notation, a type of scientific notation. "5e3" for example represents 5×103 or a 5 with three 0's after it, 5,000.
  • Esp.: "Especially", used to qualify the most common quality of a resource, even if there are notable exceptions.
  • Dash (—): The dash symbol "—" is usually used in tables for information about a resource that cannot be readily determined or approximated. Sometimes it is also used when a particular column is not applicable to a resource.
  • Question mark (?): The symbol "?" is used in tables for information about a resource that has not yet been determined, but probably could be.
  • Strikethrough: Resources with their names' crossed out with a strikethrough were nonfunctional or otherwise broken at the time of the entry's last update. See also the synonymous column table "Active" describedabove.

English corpora

[edit]
English corpora table
NameResource TypeSize in words[1]Size in texts[1]DialectStart yearEnd yearOriginal MediumAvailable MediumGenreRe-use restrictionsAccess restrictionsActiveDate of entry update
News on the Web (NOW)Corpus, Tagged2e103e7(Various)[2][3]2010PresentWritten, Computer, InternetWrittenNonfiction, NewsNoneFree registration requiredYes2022/10/31
iWeb: The Intelligent Web-based CorpusCorpus, Tagged1e102e7(Various)[4][3]20172017Written, Computer, InternetWrittenGeneral, esp. NonfictionNoneFree registration requiredYes2022/10/31
Global Web-Based English (GloWbE)Corpus, Tagged2e92e5(Various)[2][3]20122013Written, Computer, InternetWrittenGeneral, esp. NonfictionNoneFree registration requiredYes2022/10/31
Wikipedia CorpusCorpus, Tagged2e94e6(Various)20142014Written, Computer, InternetWrittenNonfiction, EncyclopediaNoneFree registration requiredYes2022/10/30
Coronavirus Corpus[5]Corpus, Tagged2e92e6(Various)[2][3]20202022Written, Computer, InternetWrittenNonfiction, News, COVID-19NoneFree registration requiredYes2024/05/10
Corpus of Contemporary American English (COCA)Corpus, Tagged1e95e5American19902019MultimediaWrittenGeneral, esp. NonfictionNoneFree registration requiredYes2023/03/27
Early English Books Online (EEBO)Corpus, Tagged8e83e4British1470 (approx.)1690 (approx.)Written, Books, PrintWrittenGeneralNoneFree registration requiredYes2022/10/30
Early English Books Online (EEBO) TCPCorpus, Untagged6e4British14751700Written, Books, PrintWrittenGeneralNoneNoneYes2022/10/31
Early English Books Online (EEBO, V2)Corpus, Untagged6e81e4British1470 (approx.)1690 (approx.)Written, Books, PrintWrittenGeneralNoneFree registration requiredYes2022/11/02
FilmotLibrary5e8(Various, Multilingual)2005 (approx.)PresentSpoken, GeneralAudio-visualGeneral, esp. NonfictionNoneNoneYes2022/10/30
YouGlishLibrary1e8(Various)2005 (approx.)PresentSpoken, Formal[6]Audio-visualNonfictionNoneNoneYes2022/10/30
TED Corpus Search Engine (TCSE)Corpus, Tagged1e75e3(Various)20072023[7]Spoken, Formal, SpeechesAudio-visualNonfictionNoneNoneYes2022/10/30
Archive-It CollectionsLibrary2e6(Various)1996PresentWritten, Computer, InternetWrittenGeneral, esp. NonfictionNoneNoneYes2022/10/30
ACL Anthology Reference Corpus (ARC)Corpus, Tagged6e72e4(Various)19792015Written, Periodicals, JournalsWrittenNonfiction, Academic, NLPNoneNoneYes2022/10/30
COVID-19 Open Research Dataset (CORD-19)Corpus, Tagged3e97e5(Various)1922[8]2020[9]Written, Periodicals, JournalsWrittenNonfiction, AcademicNoneNoneYes2022/10/30
EcoLexicon EnglishCorpus, Tagged2e72e3(Various)19732016WrittenWrittenNonfiction, Academic, EnvironmentNoneNoneYes2022/10/30
Lipstick AlleySocial MediaAmerican, African2000PresentWritten, Computer, Social Media, ForumWrittenGeneral, esp. Nonfiction, Celebrity NewsNoneFree registration required[10]Yes2023/06/23
Corpus of Regional African American Language (CORAAL)Corpus, Untagged1e62e2American, African19682017Spoken, InterviewsAudioGeneral, Sociolinguistic interviews[11]NoneNoneYes2022/10/31
Google TrendsTrends(Various, Multilingual)2004PresentWritten, Computer, Internet SearchesWrittenGeneralNoneNoneYes2022/10/31
Google NgramsTrends2e7[12][13]2e12[12][14][13](Various, Multilingual)[15]1470[13]PresentMultimediaWrittenGeneralNoneNoneYes2022/10/31
Google BooksLibrary4e7(Various, Multilingual)1400 (approx.)PresentMultimediaWrittenGeneralNoneNoneYes2022/10/31
Google ScholarLibrary1e8[16](Various, Multilingual)1700 (approx.)PresentWritten, Periodicals, JournalsWrittenNonfiction, Academic; LawNoneNoneYes2023/01/19
Corpus of Middle English Prose and VerseCorpus, Untagged3e2Middle10001500Written, Books, PrintWrittenGeneral, esp. NonfictionNoneNoneYes2022/10/31
Michigan Corpus of Upper-level Student Papers (MICUSP)Corpus, Untagged3e68e2(Various, ESL[17])20022009Written, College WorkWrittenNonfiction, AcademicRestrictions on commercial use[18]NoneYes2022/12/28
Michigan Corpus of Academic Spoken English (MiCASE)[19][20]Corpus, Untagged2e62e2American (mostly)19982001Spoken, Formal, SpeechesAudio,[19] WrittenNonfiction, AcademicRestrictions on commercial use[21]NoneYes2022/10/31
British Academic Spoken English Corpus (BASE)Corpus, Tagged1e62e2British19982005Spoken, Formal, SpeechesWrittenNonfiction, AcademicNoneNoneYes2022/11/02
British Academic Written English Corpus (BAWE)Corpus, Tagged7e63e3British20002007Written, College WorkWrittenNonfiction, AcademicNoneNoneYes2022/11/02
Public Papers of the Presidents of the United StatesLibrary1e2American19382002MultimediaWrittenNonfiction, PoliticsNoneNoneYes2023/06/17
Google GroupsSocial Media(Various)19812024Written, Computer, Social Media, UsenetWrittenGeneral, esp. NonfictionNoneNoneYes2024/03/20
UsenetArchives.com[22]Social Media7e8(Various)1981[23]Present?Written, Computer, Social Media, UsenetWrittenGeneral, esp. NonfictionNoneNoneYes2024/03/20
NarkiveSocial Media3e8(Various)1990 (approx.)PresentWritten, Computer, Social Media, UsenetWrittenGeneral, esp. NonfictionNoneNoneYes2024/03/20
EuropeanaLibrary2e7(Various, Multilingual)0400 (approx.)PresentMultimediaMultimediaGeneralNoneNoneYes2022/10/31
Internet ArchiveLibrary6e7(Various, Multilingual)PresentMultimediaMultimediaGeneralNoneFree registration requiredYes2022/10/31
Eighteenth Century Collections Online (ECCO) TCPCorpus, Untagged2e3(Various, Multilingual)17011800Written, Books, PrintWrittenGeneralNoneNoneYes2022/10/31
Old Bailey Corpus (OBC) 2.0Corpus, Tagged4e71e6British (various dialects)17201913Spoken, Formal, Court ProceedingsWrittenNonfiction, Law, Courts, CriminalNoneFree registration requiredYes2022/10/31
Old Bailey Proceedings OnlineCorpus, Untagged1e8British (various dialects)16741913Spoken, Formal, Court ProceedingsWrittenNonfiction, Law, Courts, CriminalNoneNoneYes2022/10/31
Royal Society Corpus (RSC) 6.0.1 OpenCorpus, Tagged8e72e4British16651920Written, Periodicals, Journals, PrintWrittenNonfiction, AcademicNoneYes[24]Yes2025/10/12
Royal Society Corpus (RSC) 6.0.4 Open with TopicsCorpus, Tagged3e82e4British16651920Written, Periodicals, Journals, PrintWrittenNonfiction, AcademicNoneFree registration requiredYes2022/10/31
X (formerly Twitter)Social Media3e12[25](Various, Multilingual)2005PresentWritten, Computer, Social Media, TwitterWrittenGeneral, esp. NonfictionNoneNoneYes2022/10/31
SocialGrep (Reddit) Corpora[26]Corpus, Untagged9e7(Various)2005 (approx.)Present?Written, Computer, Social Media, RedditWrittenGeneral, esp. NonfictionNoneNoneNo2025/02/20
Europarl 7 Sample, EnglishCorpus, Tagged2e78e3International/ELF[27]20072011Spoken, Formal, Legislative ProceedingsWrittenNonfiction, Law, LegislaturesNoneNoneYes2022/11/01
Europarl 3, EnglishCorpus, Tagged2e77e2International/ELF[27]19962006Spoken, Formal, Legislative ProceedingsWrittenNonfiction, Law, LegislaturesNoneFree registration requiredYes2022/11/01
TARACorpus, Tagged9e52e4British2006 (approx.)2006 (approx.)Written, Periodicals, Newspapers, PrintWrittenNonfiction, NewsNoneFree registration requiredYes2022/11/01
British National Corpus (BNC)Corpus, Tagged1e84e3British19601993MultimediaWrittenGeneralNoneFree registration requiredYes2022/11/01
British National Corpus (BNC) SamplerCorpus, Tagged2e62e2British19751993MultimediaWrittenGeneralNoneFree registration requiredYes2022/11/01
Phrases in English (BNC)[28][29]Corpus, Tagged1e84e3British19601993MultimediaWrittenGeneralNoneNoneYes2023/02/12
Just The Word (BNC)[28]Corpus, Tagged1e84e3British19601993MultimediaWrittenGeneralNoneNoneYes2023/02/12
British English 2006 (BE06)Corpus, Tagged1e65e2British20032008WrittenWrittenGeneralNoneFree registration requiredYes2022/11/01
American English 2006 (AME06)Corpus, Tagged1e65e2American2006 (approx.)2006 (approx.)WrittenWrittenGeneralNoneFree registration requiredYes2022/11/22
Hansard Corpus (British Parliament)Corpus, Tagged2e98e6British18032005Spoken, Formal, Legislative ProceedingsWrittenNonfiction, Law, LegislaturesNoneFree registration requiredYes2022/11/01
British Parliament HansardLibraryBritish1800PresentSpoken, Formal, Legislative ProceedingsWrittenNonfiction, Law, LegislaturesNoneNoneYes2022/11/01
Australian Parliament HansardLibraryAustralian1901PresentSpoken, Formal, Legislative ProceedingsWrittenNonfiction, Law, LegislaturesNoneNoneYes2022/11/01
Canadian House of Commons HansardLibraryCanadian2002PresentSpoken, Formal, Legislative ProceedingsWrittenNonfiction, Law, LegislaturesNoneNoneYes2022/11/01
New Zealand Parliament HansardLibraryNew Zealand1854PresentSpoken, Formal, Legislative ProceedingsWrittenNonfiction, Law, LegislaturesNoneNoneYes2022/11/01
GovInfo (United States)LibraryAmerican1793PresentMultimediaWrittenNonfiction, LawNoneNoneYes2022/11/01
Transgender Usenet Archive (TUA)Corpus, Untagged4e5(Various)19942013Written, Computer, Social Media, UsenetWrittenGeneral, Transgender TopicsNoneNoneYes2022/11/01
Science Forums[30]Social Media-1e5(Various)19922014Written, Computer, Social Media, BBSWrittenNonfiction, ScienceNoneNoneNo2025/09/16
TextFiles.comLibrary(Various)1980 (approx.)1995 (approx.)MultimediaMultimediaGeneral, esp. Nonfiction, TechnologyNoneNoneYes2022/11/01
LDS General Conference CorpusCorpus, Tagged3e71e4American1851PresentSpoken, Formal, SpeechesWrittenReligious, Latter Day SaintsNoneNoneYes2022/11/01
FidoNet Echomail ArchiveSocial Media(Various)1990 (approx.)2016 (approx.)Written, Computer, Social Media, FidoNetWrittenGeneral, esp. Nonfiction, TechnologyNoneNoneYes2022/11/01
FidoNet HolySmoke Archive[31]Library4e5(Various)19932004Written, Computer, Social Media, FidoNetWrittenNonfiction, ReligionNoneNoneNo2025/10/12
Dúchas ProjectLibrary2e6Irish1900 (approx.)1940 (approx.)MultimediaWrittenFiction, FolkloreNoneNoneYes2022/11/02
Freiburg-Brown Corpus of American English (FROWN)Corpus, Tagged1e65e2American19921992Written, PrintWrittenGeneralNoneFree registration requiredYes2022/11/02
Brown Corpus FamilyCorpus, Tagged1e62e3Written, PrintWrittenGeneralNoneFree registration requiredYes2022/11/02
Brown Family (C8 tags)Corpus, Tagged6e62e3(Various)19311991Written, PrintWrittenGeneralNoneFree registration requiredYes2022/11/02
Brown Corpus[32]Corpus, Tagged1e61e3American19611961Written, PrintWrittenGeneralNoneNoneYes2022/11/02
Corpus of English DialoguesCorpus, Tagged1e62e2British(?)15601760MultimediaWrittenGeneral, DialoguesNoneFree registration requiredYes2022/11/02
Florence Early English Newspapers (FEEN)Corpus, Tagged3e5-[33]British(?)16201649Written, Periodicals, Newspapers, PrintWrittenNonfiction, NewsNoneNoneYes2023/03/27
Transhistorical Corpus of Written EnglishCorpus, Tagged5e58e2(Various)14052019WrittenWrittenGeneralNoneNoneYes2022/11/02
Linguistic Landscape CorpusCorpus, Tagged5e66e2(Various)19972018WrittenWrittenNonfiction, AcademicNoneFree registration requiredYes2022/11/02
ICNALE Online[34]Corpus, Tagged4e62e4(Various, ESL[17])[35]2007 (approx.)2022 (approx.)Multimedia, College WorkMultimediaNonfiction, AcademicNoneNoneYes2022/11/02
European Football Championship Interpreting Corpus (EFCIC)Corpus, Tagged1e41e120202020Spoken, Entertainment, Interpretation, InterviewWrittenNonfiction, SportsNoneNoneYes2022/11/02
UkWac Complete[36]Corpus, Tagged2e93e6British[3]2005 (approx.)2007 (approx.)Written, Computer, InternetWrittenGeneralNoneNoneYes2022/11/02
UkWac Small[36]Corpus, Tagged8e71e5British[3]2005 (approx.)2007 (approx.)Written, Computer, InternetWrittenGeneralNoneNoneYes2022/11/02
Postcard Archive @ Florida State University[37]Library3e3[38](Various)1829 (approx.)2016 (approx.)Written, PostcardsWrittenNonfiction, PostcardsNoneNoneYes2022/11/06
PlayPhrase.meCorpus, Tagged8e6[39](Various)1970 (approx.)Present?Spoken, Entertainment, MoviesAudio-visualFiction, MoviesNoneNoneYes2022/11/07
European Union DGT-UD: EnglishCorpus, Tagged1e85e4International/ELF[27]1948 (approx.)2016Written, Legislative ActsWrittenNonfiction, Law, LegislaturesNoneNoneYes2022/11/16
Opus-MontenegrinSubs 1.0: EnglishCorpus, Tagged5e52e2(Various)20072013Spoken, Entertainment, TelevisionWrittenFiction, TelevisionNoneNoneYes2022/11/16
Archive of Our Own (AO3)Library1e7(Various)2007PresentWritten, Computer, InternetWrittenFiction, Short Stories, Fan Works[40]NoneNoneYes2022/11/22
SCP FoundationLibrary2e3(Various)2007PresentWritten, Computer, InternetWrittenFiction, Short Stories, Sci-Fi[40]NoneNoneYes2022/11/22
NEWS-GB (British newspapers)[41]Corpus, Tagged2e8British2004 (approx.)2004 (approx.)Written, PrintWrittenNonfiction, NewsNoneNoneNo2025/10/12
INTERNET-EN[41]Corpus, Tagged2e85e4(Various)2006 (approx.)2006 (approx.)Written, Computer, InternetWrittenGeneralNoneNoneNo2025/10/12
BLOGS-EN (Political blogs)[41]Corpus, Tagged5e8(Various)2008 (approx.)2008 (approx.)Written, Computer, InternetWrittenNonfiction, PoliticsNoneNoneNo2025/10/12
Manually Annotated Sub-Corpus (MASC)Library[42]5e54e2American1990 (approx.)2010 (approx.)MultimediaWrittenGeneralNoneNoneYes2022/11/23
Open American National Corpus (OANC)Library[43]2e79e3American19902005 (approx.)MultimediaWrittenGeneral, esp. NonfictionNoneNoneYes2025/09/22
Lancaster Newsbooks Corpus (1654 part)Corpus, Tagged9e52e2British16531654Written, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNoneFree registration requiredYes2022/11/23
The Mail ArciveLibrary2e8(Various)1990PresentWritten, Computer, Mailing ListWrittenNonfiction, esp. Coding and ComputersNoneNoneYes2022/11/26
CataList (LISTSERV catalog)[44]Library-[45](Various)1990 (approx.)PresentWritten, Computer, Mailing ListWrittenNonfictionNoneNoneYes2022/11/28
United Nations Digital LibraryLibrary7e5[46](Various, International/ELF[27])1875[47]PresentMultimediaMultimediaNonfiction, PoliticsNoneNoneYes2022/11/29
Genius.comLibrary(Various, Multilingual)1900 (approx.)PresentSpoken, Entertainment, MusicWrittenGeneral, MusicNoneNoneYes2022/12/06
Chronicling AmericaLibraryAmerican17771963Written, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNoneNoneYes2022/12/06
Library of CongressLibrary3e6[48](Various, Multilingual)1470 (approx.)PresentMultimediaMultimediaGeneralNoneNoneYes2022/12/06
World Radio HistoryLibrary1e5[49](Various, Multilingual)[50]1900 (approx.)PresentWritten, Periodicals, Magazines, PrintWrittenNonfiction, Radio; Television; MusicNoneNoneYes2022/12/06
Google News Newspapers ArchiveLibrary6e6[51][52](Various, Multilingual)[50]1738 (approx.)2009Written, Periodicals, Magazines, PrintWrittenNonfiction, NewsNoneNoneYes2022/12/14
VESPA[53]Corpus2e69e2International/ESL[17]2008 (approx.)2008 (approx.)Written, College WorkWrittenNonfiction, AcademicRestriction to non-profit educational use only[54]Free registration requiredYes2022/12/28
I-EN (Internet English Corpus)[41]Corpus, Tagged2e8(Various)20052005Written, Computer, InternetWrittenNonfiction, News?NoneNoneNo2025/10/12
I-EN-CC (Internet English Creative Commons Corpus)[41]Corpus, Tagged2e8(Various)2005 (approx.)2005 (approx.)Written, Computer, InternetWrittenNonfiction, News?NoneNoneNo2025/10/12
Springfield! Springfield!Library2e5(Various)1910 (approx.)PresentSpoken, Entertainment, Movies and TelevisionWrittenGeneralNoneNoneYes2023/03/27
IssuuLibrary5e7[55](Various, Multilingual)2000 (approx.)[56]PresentWritten, Periodicals, MagazinesWrittenNonfictionNoneFree registration required for full access[57]Yes2023/01/19
Smithsonian Transcription CenterLibrary-[58]American1400 (approx.)[59]PresentWrittenWrittenNonfictionNoneNoneYes2023/01/22
Voices Remembering Slavery: Freed People Tell Their StoriesLibrary7e4[51][60]3e1[61]American, African19321975[62]Spoken, InterviewsAudioGeneral, Anthropological interviewsNoneNoneYes2023/01/28
Born in Slavery: Slave Narratives from the Federal Writers' ProjectLibrary2e3[63]American, African[64]19361938WrittenWrittenNonfiction, Biographies[64][65]NoneNoneYes2023/01/28
Corpus of Historical American English (COHA)[66]Corpus, Tagged5e81e5American18202019MultimediaWrittenGeneralNoneFree registration requiredYes2023/02/14
The TV CorpusCorpus, Tagged3e88e4(Various)[67]19502017Spoken, Entertainment, TelevisionWrittenGeneralNoneFree registration requiredYes2023/03/27
The Movie CorpusCorpus, Tagged2e83e4(Various)[67]19302018Spoken, Entertainment, MoviesWrittenGeneralNoneFree registration requiredYes2023/03/27
Corpus of American Soap Operas (CASO)Corpus, Tagged1e82e4American20012012Spoken, Entertainment, MoviesWrittenFiction, Television, Soap OperasNoneFree registration requiredYes2023/03/27
Corpus of US Supreme Court OpinionsCorpus, Tagged1e83e4American1790 (approx.)2019 (approx.)[68]WrittenWrittenNonfiction, Law, Courts, ConstitutionalNoneFree registration requiredYes2023/02/16
TIME Magazine CorpusCorpus, Tagged1e83e5[69]American19232006Written, Periodicals, Magazines, PrintWrittenNonfiction, NewsNoneFree registration requiredYes2023/02/16
Corpus of Online Registers of English (CORE)Corpus, Tagged5e75e4(Various)[70]2013 (approx.)2016 (approx.)Written, Computer, InternetWrittenGeneralNoneFree registration requiredYes2023/02/16
Strathy Corpus of Canadian EnglishCorpus, Tagged5e71e3Canadian1921[71]2011[71]MultimediaWrittenGeneralNoneFree registration requiredYes2023/02/16
Biodiversity Heritage LibraryLibrary3e5[72](Various, Multilingual)1400 (approx.)PresentWrittenWrittenNonfiction, Academic, BiologyNoneNoneYes2023/02/23
African American Writers, 1892-1912 (AAW)Corpus, Untagged5e58e0American, African18921912WrittenWrittenGeneralNoneNoneYes2023/03/15
Children's Literature (ChiLit)Corpus, Untagged4e67e1(Unclear)[73]??WrittenWrittenFiction, ChildrenNoneNoneYes2023/03/15
The Philadelphia Neighborhood Corpus of LING560 Studies (PNC)[74]Corpus2e63e2American1972Present?[75]Spoken, InterviewsWritten(Unclear)Restrictions on excerpt size[76]Yes[77]Yes2023/03/15
British Pathé[78]Library2e5British18961984Spoken, FormalAudio-visualNonfiction, NewsNone?NoneYes2023/04/06
Newspapers.comLibrary8e5(Various)[79]1690PresentWritten, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNoneWikipedia Library access available. Paid subscription otherwise required. Free trials are available.Yes2023/04/30
NewspaperArchiveLibrary2e7[52][80](Various, Multilingual)[81]1607PresentWritten, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNoneWikipedia Library access available. Paid subscription otherwise required. Free trials are available.Yes2024/06/16
PressReaderLibrary?(Various)?PresentWritten, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNoneSome snippets freely visible, most content requires paid subscription. Free trials are available.Yes2023/05/31
ProQuestLibrary?(Various)?PresentWritten, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNoneWikipedia Library access available. Some snippets freely visible, most content requires paid subscription. Free trials are available.Yes2023/05/31
Welsh NewspapersLibrary?[82]Welsh,[83] Multilingual18041919Written, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNone?NoneYes2023/08/08
Welsh JournalsLibrary?[84]Welsh, Multilingual17352007Written, Periodicals, PrintWrittenGeneralNone?NoneYes2023/08/08
Crime and Punishment DatabaseLibraryEnglish?[85]17301830Written, Formal, Court RecordsWrittenNonfiction, Law, Courts, CriminalNone?NoneYes2023/08/08
American Archive of Public BroadcastingLibrary1e5[86](Various, Multilingual)[50]1931[87]PresentSpoken, esp. FormalAudio-visualGeneral, esp. NonfictionNoneNone, additional content available on-site at GBH or the Library of Congress.Yes2023/11/01
Buckeye Speech CorpusCorpus, Tagged3e64e2American19992000Spoken, InterviewsAudio, WrittenGeneral, Sociolinguistic interviews[88]Restriction to educational and research use onlyFree registration requiredYes2024/02/19
Westminster Detective LibraryLibrary5e7[51][89]2e4[89][90]American18181891Written, Periodicals, Newspapers, Print[91]WrittenFiction, Short Stories, Detective StoriesNoneNoneYes2024/02/26
Usenet Archive (UTZOO Wiseman/Zach Barth)Social Media2e6[92](Various)19811991Written, Computer, Social Media, UsenetWrittenGeneral, esp. NonfictionNoneNoneYes2024/03/20
Searchids.com[93][94]Library[95]7e7[96]2e7[97](Various)20062006Written, Computer, Internet SearchesWrittenGeneralRestriction to non-commercial research use only[98]NoneNo2025/09/22
Freiburg Corpus of English Dialects (FRED) - Interactive Database[99]Corpus, Untagged[100]1e6[101]1e2[101]British (various dialects)1970[101]2000[101][102]Spoken, InterviewsAudio, WrittenNonfiction, History, Oral History[101]NoneNoneYes2024/05/09
MTSamples.com[103][104]Library3e6[105]5e3[106](Various?)2007[107]2023Written, ComputerWrittenNonfiction, MedicineRequires attribution[108]NoneYes2024/12/16
Evans Early American Imprints TCP (Evans-TCP)Corpus, Untagged?5e3American1640[109]1800[110]Written, PrintWrittenGeneralNone[111]NoneYes2025/10/12

Non-English corpora

[edit]
Non-English corpora table
Translated NameNative nameLanguageLanguage CodeResource TypeSize in words[1]Size in texts[1]Start yearEnd yearOriginal mediumAvailable mediumGenreRe-use restrictionsAccess restrictionsDate of entry update
Czech National Corpus[112]Český národní korpusCzechcsCorpus, Tagged?????MultimediaGeneralNone?None2024/06/18
Polish National Corpus[112]Narodowy Korpus Języka PolskiegoPolishplCorpus, Tagged2e9????WrittenGeneralNone?None2023/02/12
Russian National Corpus[112]Национальный корпус русского языкаRussianruCorpus, Tagged2e9[113]5e6[113]1100PresentMultimediaWrittenGeneralRestriction to non-commercial linguistic use only[114]None2023/02/12
Turkish National Corpus[115]Türkçe Ulusal DerlemiTurkishtrCorpus, Tagged?5e76e319902009Written[116]WrittenGeneralRestriction to educational use only[117]Free registration required2023/02/12
Bruno Corpus[118]SpanishesCorpus, Untagged1e65e2?2010 (approx.)WrittenWritten?GeneralNoneNone2023/02/12
Braun Corpus[118]GermandeCorpus, Untagged1e65e2?2008 (approx.)WrittenWritten?GeneralNoneNone2023/02/12
Corpus of Spanish: Genre/HistoricalCorpus del Español: Genre/HistoricalSpanishesCorpus, Tagged1e81e41200 (approx.)2000 (approx.)Multimedia, esp. WrittenWrittenGeneralNoneFree registration required2023/03/24
Corpus of Spanish: Web/DialectsCorpus del Español: Web/DialectsSpanish[119][120]esCorpus, Tagged2e92e62010 (approx.)2014Written, Computer, InternetWrittenGeneralNoneFree registration required2023/03/24
Corpus of Spanish: NOWCorpus del Español: NOWSpanish[119][121]esCorpus, Tagged7e91e720122019Written, Computer, InternetWrittenNonfiction, NewsNoneFree registration required2023/03/24
21st Century Corpus of Spanish[122]Corpus del Español del Siglo XXISpanishesCorpus, Tagged4e84e520012022Multimedia, esp. WrittenMultimedia, esp. WrittenGeneralNone?None2023/03/24
Lemko andKarpatska Rus’ Archive[123]Carpathian RusynrueLibrary2e319281989Written, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNone?None2024/06/18
Spauda.org[123]LithuanianltLibrary?18862015Written, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNone?None2023/04/04
GallicaGallicaFrenchfrLibrary1e7??Written, Periodicals, Newspaper, PrintWrittenGeneral, esp. Nonfiction, NewsNoneNone2023/05/31
RetroNewsRetroNewsFrenchfrLibrary3e6 (at least)16311951Written, Periodicals, Newspaper, PrintWrittenNonfiction, NewsNoneNone2023/05/31
The Database of Early Cantonese Bible早期粵語聖經資料庫CantoneseyueCorpus, Untagged?7e018631927Written, Religious TextWrittenReligious, Christianity, Bible PassagesNone?None2023/12/10
The Database of Early Christian Literature早期基督教文學資料庫CantoneseyueCorpus, Untagged?5e01845 (approx.)1906Written, Books, PrintWrittenReligious, ChristianityNone?None2023/12/10
A Comprehensive Edition of Tocharian ManuscriptsTocharian B, Tocharian Atxb, xtoCorpus, Tagged2e5[124]2e3[125]0500 (approx.)[126]700 (approx.)[126]WrittenWrittenGeneral, esp. Religious, BuddhismNone?None2024/05/05
Manx Corpus SearchManxgvCorpus, Untagged2e67e216102012Multimedia, esp. WrittenWrittenGeneral?None?None2025/03/10
Comprehensive Aramaic Lexicon ProjectAramaicarcCorpus, Tagged??BCE 0900 (approx.)1300 (approx.)WrittenWritten?None?None2025/03/24
Corpus of PortugueseCorpus do PortuguêsPortuguese,

Old Galician-Portuguese

pt, roa-optCorpus, Tagged2e94e61214[127]2019Written, esp. Computer, InternetWrittenGeneralNone?Free registration required2025/08/10
Computerized Reference of the Medieval Galician Language (Corpus Xelmírez)Tesouro Medieval Informatizado da Lingua Galega (Corpus Xelmírez)Old Galician-Portugueseroa-optCorpus, Tagged5e7[128]2e5[128]0787[128]1600 (approx.)[128]WrittenWrittenGeneralRestricted to "research, teaching, and general purposes", for-profit use prohibited without obtaining explicit permission[129]None2025/08/10
Electronic Corpus of Pre-Islamic Old Turkic TextsVorislamische Alttürkische Texte: Elektronisches CorpusOld UyghurouiCorpus, Untagged?5e20880 (approx.)[11]1150 (apprx)[15]WrittenWrittenReligious[18]None?None2025/09/10
Turkic InscriptionsТүрік БітікOld TurkicotkCorpus, Untagged?3e3[5]0730 (approx.)[21]0900 (approx.)WrittenWrittenNonfiction, Politics and HistoryNormal copyright restrictions apply[9]None2025/09/10
Diachronic Corpus of SpanishCorpus Diacrónico del EspañolSpanish, Old Spanishes, ospCorpus, Untagged2e10[130]1e5 (approx.)[131]0759[132]1975WrittenWrittenGeneralNone?None2025/10/08
Reference Corpus of Contemporary SpanishCorpus de Referencia del Español ActualSpanishesCorpus, Untagged2e10[130]7e4 (approx.)[131]1975[133]1999[133]Written[134]WrittenGeneral, esp. NonfictionNone?None2025/10/08
Vocabulary of Medieval CommerceVocabulario de Comercio MedievalOld Spanish, Early Modern Spanishosp, es-earCorpus, Untagged2e4[135]6e4[135]9th century[135]16th century[135]WrittenWrittenCommerceCC BY-NC-ND 2.5[136]None2025/10/14
Cantigas UniverseUniverso CantigasOld Galician-Portugueseroa-optCorpus, Tagged?1683?15th centuryWrittenWrittenSongbooksCC BY-NC-SA[137]None2025/10/14
Historical and chronological vocabulary of Medieval PortugueseVocabulário histórico-cronológico do Português MedievalOld Galician-Portugueseroa-optCorpus, Tagged17e3[138]??15th centuryWrittenWrittenGeneralNoneNone2025/10/14
Old Spanish Textual ArchiveOld Spanish, Old Navarro-Aragonese, Old Leoneseosp, roa-ona, roa-oleCorpus, Untagged35e6 (approx.)[139]400 (approx.)[139]11th century[139]17th century[139]WrittenWrittenGeneralNoneNone2025/10/14

Other lists and databases

[edit]
Other lists and databases table
NameLanguageLanguage CodeSize in corpora[1]ActiveDate of entry update
Corpus Resource Database (CoRD)Translingual, esp. Englishmul, en1e2Yes2023/02/13
Czech National Corpus KonText interfaceTranslingualmul1e3[140]Yes2023/02/13
English-Corpora.orgEnglishen2e1Yes2023/02/13
Leipzig Corpora CollectionTranslingualmul1e3Yes2023/02/13
Lextutor Web Concordance EnglishEnglishen5e1Yes2023/02/13
Lextutor Web Concordance FrenchFrenchfr2e1Yes2023/02/13
LINDAT/CLARIAH-CZ CorporaTranslingualmul7e2Yes2023/02/13
Linguistic Data Consortium (LDC)Translingualmul1e3Yes2023/02/13
Martin Weisser's On-line Corpora of EnglishTranslingual, esp. Englishmul, en2e1Yes2023/02/13
SketchEngineTranslingualmul2e1[141]Yes2023/02/13
University of Warwick list of free online corporaEnglishen2e1Yes2023/02/13
University of Edinburgh Scots and Scottish English corporaScots, Englishsco, en3e1Yes2023/02/13
SHACHI Database of Language Resources[142]Translingualmul2e3No2024/04/22
CLARIN.SI Online ConcordancersTranslingual, esp. Slovenemul, sl2e2Yes2023/02/26
CLARIN.SI Corpus RepositoryTranslingual, esp. Slovenemul, sl2e2Yes2023/02/26
CLARINO CorpuscleTranslingual, esp. Norwegianmul, no6e1Yes2023/02/26
CLARINO Corpus RepositoryTranslingual, esp. Norwegianmul, no4e1Yes2023/02/26
Online Resources for African American Language (ORAAL), external data sourcesEnglishen1e1Yes2023/03/15
Online Resources for African American Language (ORAAL), supplementsEnglishen4e0Yes2024/05/08
Corpus Linguistics in Context (CLiC)Englishen5e0Yes2023/03/15
The Spanish CoprusSpanishes4e0Yes2023/03/24
Pennsylvania State University scripts and transcripts of popular film, TV, and sportsEnglish[143]en2e1Yes2023/04/02
/r/Screenwriting Guide to Finding Scripts OnlineEnglish[143]en2e1Yes2023/04/02
BBC.com[144]Translingualmul3e1Yes2024/04/22
Corpus4U.org[145][146]English, Chineseen, zh2e2Yes2023/06/17
Beijing Foreign Studies University CQPweb[147]Translingualmul2e2Yes2023/06/17
Lancaster Univerity CQPwebTranslingual, esp. Englishmul, en1e2Yes2023/06/17
Hong Kong University of Science and Technology Resources for Chinese LinguisticsChinese, esp. Cantonesezh, yue3e0Yes2023/12/10
PolyU Corpus of Spoken Chinese, links to other corpora and databasesTranslingual, esp. Chinesemul, zh1e2Yes2024/01/13
Duke University list of collections of African American oral historiesEnglishen1e1Yes2024/05/08
OPUS Open Parallel Corpus CollectionTranslingualmul1e3Yes2024/06/18
OPUS Multilingual Search InterfaceTranslingualmul4e2Yes2024/06/18
Stanford Large Network Dataset CollectionTranslingualmul1e1Yes2025/03/03
Corpus-Based Language Studies' Corpus SurveyTranslingualmul1e2Yes2025/08/18

See also

[edit]

Notes

[edit]
  1. 1.01.11.21.31.4Sizes are represented usingE notation, a type of scientific notation. "5e3" for example represents 5×103 or a 5 with three 0's after it, 5,000.
  2. 2.02.12.2Specifically Australia, Bangladesh, Canada, Ghana, Great Britain, Hong Kong, India, Ireland, Jamaica, Kenya, Malaysia, New Zealand, Nigeria, Pakistan, Philippines, Singapore, South Africa, Sri Lanka, Tanzania, the United States
  3. 3.03.13.23.33.43.5Note that dialect information in internet-derived corpora tends to be somewhat inaccurate because of accidental inclusion of texts in other dialects.
  4. ^Specifically Australia, Canada, Ireland, New Zealand, the United Kingdom, and the United States
  5. 5.05.1Note that this corpus is a sub-corpus of the NOW corpus.
  6. ^Particularly speeches and interviews
  7. ^As of 2024-05-10, the latest change log entry was from 2023/12/26.
  8. ^Most after 2005
  9. 9.09.1Most before 2017
  10. ^An account is required to use the site's built in search function. Nonetheless, the forum threads can still be viewed and navigated without hindrance when logged out.
  11. 11.011.1Tyler Kendall, Charlie Farrington (June 2023), “CORAAL User Guide”, inCorpus of Regional African American Language[1], retrieved9 May 2024:
    The core components of CORAAL focus on AAL in Washington DC,[] CORAAL:DC[] is comprised of over 100 sociolinguistic interviews[] In addition to CORAAL:DC, CORAAL includes several smaller components to provide regional breadth. As of July 2021, there are six supplemental components: CORAAL:ATL, which includes 14 sociolinguistic interviews from speakers living in Atlanta, Georgia; CORAAL:DTA, which includes 40 sociolinguistic interviews from the Detroit Dialect Study collected in 1966; CORAAL:LES, comprised of 10 sociolinguistic interviews of speakers from the Lower East Side of New York City; CORAAL:PRV, which includes 15 sociolinguistic interviews from the town of Princeville, a rural African American community in central North Carolina; CORAAL:ROC, which includes 14 sociolinguistic interviews from Rochester, a city in Western Upstate New York; and CORAAL:VLD, which includes 12 speakers from Valdosta, a small city in South Georgia.[] Interviews are sociolinguistic styled interviews on topics such as life in Valdosta, personal histories, and high school sports.
  12. 12.012.1Note that this number only represents the size of the English portion of the 2020 release of the corpus.
  13. 13.013.113.2For specific details, see the "Total counts for Dependencies" file hosted in theDependencies Downloads Index for the English part of the 2020 release which contains word and book counts for each of the years in the corpus as described onthe main Ngram Viewer Exports page.
  14. ^Anna L. Shparberg (July2021), “Google Books Ngram Viewer”, inThe Charleston Advisor, volume23, number 1, Annual Reviews,→DOI, pages16–19
  15. 15.015.1Note that "British English" and "American English" sub-corpora of Google Ngram are sometime very inaccurate/misleading because of the accidental inclusion of texts in other dialects. Considercolor vs colour andairplane vs aeroplane in the "British English" corpus. In both cases, Google Ngram shows the forms as being roughly equally as common from 2000-2019, which is blatantly untrue.
  16. ^Madian Khabsa; C. Lee Giles (9 May 2014), “The Number of Scholarly Documents on the Public Web”, inPLOS ONE, volume 9, number 5,→DOI,→ISSN
  17. 17.017.117.2"English as a Second Language"
  18. 18.018.1The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
  19. 19.019.1Audio files are available separately onTalkBank.org.
  20. ^The corpora manual can be accessedonline.
  21. 21.021.1The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
  22. ^As of 2024-05-10, the search function seems to be very slow or entirely broken. The groups and discussion threads are still manually navigable, though. The website can also still besearched using Google.
  23. ^The website incorporates the UTZOO Wiseman archive.
  24. ^Specific account permissions required. The corpus used to be publicly available, but this changed some time between late 2022 and late 2025.
  25. ^Based on back-of-the-napkin extrapolation of data at theInternet Live Stats website.
  26. ^The corpus appears to havegone down starting around mid 2024 based on archived crawls in the Wayback Machine and remains down as of 2025-02-20.
  27. 27.027.127.227.3"English as a Lingua Franca"
  28. 28.028.1The website is composed of a series of search tools, including n-gram and concordance search, based on the BNC.
  29. ^Selection of different tools can be done through the "Grams" menu in the top left of the page.
  30. ^Based on captures of the website stored in the Wayback Machine, it looks like the website went down some time between2024-03-03 and2024-06-18.
  31. ^Apparently became unavailable sometime between late 2022 and late 2025.
  32. ^Full name "Brown University Standard Corpus of Present-Day American English"
  33. ^The corpus is made of six "texts", but looking at their descriptions reveals that each one is actually a compilation of multiple texts. For example, "feen4" is described as "7 separate titles". Overall, the exact number of independent texts included is unclear.
  34. ^Full name "International Corpus Network of Asian Learners of English, Online Version"
  35. ^Shin Ishikawa (12 April 2022), “The ICNALE[]”, inlanguage.sakura.ne.jp[2], SAKURA Internet, archived fromthe original on14 August 2022:The ICNALE includes[] speeches and essays produced by college students[] in ten countries/ regions in Asia (China, Hong Kong, Indonesia, Japan, Korea, Pakistan, the Philippines, Singapore/ Malaysia, Taiwan, and Thailand) as well as English native speakers.
  36. 36.036.1The name is based on an abbreviation of the phrase "UK web as corpus".
  37. ^To search the collection, select either "User-Added Text (Back)" or "User-Added Text (Front)" under "Narrow by Specific Fields", then select "contains" from the drop down just to the right and then enter the search term next to that and hit enter. Note that the overall quality and style of the data presented in the collection varies considerably.
  38. ^As of 2023/03/07, 2,574 cards have the field "Writing on Card (Yes or No)?" marked as "yes". Nonetheless, there are cards that do have hand writing on them and have the field marked "no".
  39. ^Approximately, the site actually lists its size as "7,600,186phrases" (emphasis added).
  40. 40.040.1Though not exclusively short stories, the format dominates the library.
  41. 41.041.141.241.341.4Corpuses, such as this one, which used the IntelliText interfacewhen offline around December 2023.
  42. ^Although MASC is technically a corpus, it is only directly available through a web browser as a library. A complete copy of MASC as a corpus can be downloaded, though, and then processed with another application.
  43. ^Although OANC is technically a corpus, it is only directly available by being downloaded and then processed with another application.
  44. ^Many, if not most, of the LISTSERVs in the catalog do not have publicly accessible archives.
  45. ^The catalog describes itself (as of 28 Nov 2022) as containing of 58,100 public lists, each of which contains a number of messages.
  46. ^Approximately, not all items cataloged in the library are available online. In particular, it seems none of the around 300,000 speeches cataloged are available online.
  47. ^Most after 1945
  48. ^Number of items which are both available online and have their language marked as "English".
  49. ^Approximately, based on a search of the collection for the basic words "a" and "the".
  50. 50.050.150.2The number of non-English items is small.
  51. 51.051.151.2Approximately, based on a statistical calculation.
  52. 52.052.1Note that this number represents the number of newspaper issues in the archive.
  53. ^Full name "Varieties of English for Specific Purposes dAtabase"
  54. ^The corpus' end-user license states "Grant of the Product license entitles Licensee to use the Product for non-profit educational and/or linguistic research purposes only. [...] Licensees agree not to lease, sell, or commercially exploit the results of their searches (such as texts, concordances, metadata)." which is incompatible with Wiktionary's license.
  55. ^Perhttps://issuu.com/about as of 2023/01/19
  56. ^Issuu was founded in 2006, but includes some publications uploaded since then, but most of those are from after 1990, if not 2000.
  57. ^Registration is required in order to turn "safe mode" off/show explicit search results.
  58. ^Unclear. The collection is organized by "projects" which sometimes correspond to individual texts (such as diaries or funeral programs) and other times correspond to a collection of short texts (such as notes or letters). There were 11,372 projects on 2023/01/22. The length of projects is reported by the number of pages they contain. Using random sampling, it was estimated that the total length of all projects was around 2 million pages on 2023/01/22.
  59. ^Most after 1800
  60. ^Note that some transcripts were incomplete when this number was calculated.
  61. ^Each interview in the collection, regardless of the number of parts it has, is considered one text. According to the "Faces and Voices from the Presentation" article, 26 interviews are in the collection.
  62. ^Most from before 1950.
  63. ^SeeAppendix I: Narratives in the Slave Narrative Collection by State for numerical breakdown by state
  64. 64.064.1Norman R. Yetman (2001), “The Limitations of the Slave Narrative Collection”, inLibrary of Congress[3], publishedc.2017
  65. ^The narratives are based on interviews, but because of the lack of ground-truth audio recordings and doubts about the accuracy of the published versions of the narratives, they are categorized here as "Nonfiction, Biographies" rather than "General, Anthropological interviews" or similar.
  66. ^Note that the COHA was updated in 2021.
  67. 67.067.1Specifically "United States/Canada", "United Kingdom/Ireland", "Australia/New Zealand", and "Miscellaneous".
  68. ^Note that the corpus is listed as going up to the "present", but as of 2023/02/16 the most recent section is the 2010s implying that no opinions from later decades are included.
  69. ^Note that this number reflects the number of articles in the corpus, not the number of issues of TIME Magazine in the corpus.
  70. ^Specifically Australia, Canada, New Zealand, the United Kingdom, and the United States
  71. 71.071.1Note that theQueen's University page describing the corpus describes the start year as 1970 and end year as 2010 despite english-corpora.org providing a source spreadsheet which spans the years 1921 to 2011 and itscorpus description page showing a time span from the 1920s to 2010s.
  72. ^On the website, this number is associated with how many "volumes" are available and is listed along side the number of "titles" (2e5) as well as the number of pages. The exact meaning of the terms "volumes" and "titles" in this context is unclear.
  73. ^Note that although the corpus does explicitly mention its contents, I have not put in the effort to determine the dialect of each of the included texts.
  74. ^The website for the corpus is now offline for unclear reasons, but the it is presumably still possible to access the corpus by contacting the university.
  75. ^The corpus' description implies that it is continually expanding project, but in 2018 the page had not been updated in 5 years (since 2013) which may suggest the project stopped expanding around the same time.
  76. ^An apparently genuine archived version of the corpus' confidenality agreement does state "If I need to cite more than one paragraph (300 words) in a publication, I will obtain permission from the Philadelphia Neighborhood Corpus Committee".
  77. ^An archive of the corpus' home page states that "only members of the research group have access".
  78. ^Note that searches cover both metadata and transcripts for newsreels simultaneously.
  79. ^Specifically Australia, Canada, Ireland, New Zealand, Panama, and the United Kingdom.
  80. ^Derived fromthe "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen inthis published Google Sheet.
  81. ^About 90% of the publications are based in predominantly Anglophone countries (the United States [12263], Australia [1223], the United Kingdom [811], Canada [525], Ireland [50], New Zealand [19]) while the rest are from a wide variety of countries. Information derived fromthe "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen inthis published Google Sheet.
  82. ^Issue counts are provided for individual publications, but not for the entire collection.12.7 million articles in English are available, though, with each issue featuring many articles.
  83. ^A few publications originate from regions outside Wales, in particular three from London, one from the United States, and one from Argentina.An additional publication has no region listed though its "issuing body note" states "Published in Caernarfon by Thomas Jones", with Caernarfon being in Wales.
  84. ^Issue counts are provided for individual publications, but not for the entire collection.363 thousand pages in English are available, though, with each issue featuring many pages.
  85. ^The English Wikipedia article on theCourt of Great Sessions in Wales stated on 2023-08-08 that "[o]f the 217 judges who sat on its benches [...], only 30 were Welshmen". Those involved in keeping the court's records likely had a similar make up and so the database's dialect likely reflects England rather than Wales.
  86. ^This number represents the number of recordingsavailable online.
  87. ^This date represents the earliest year specified for any recording in the archive, thoughthat recording does not have audio. It is not immediately clear what the earliest recording with audio is.The earliest audio-only recording is from 1938.
  88. ^“Buckeye Corpus Information”, inBuckeye Corpus[4],c.2005, retrieved9 May 2024:After a significant amount of piloting different protocols for eliciting large amounts of unmonitored speech, a modified sociolinguistic interview format was chosen.
  89. 89.089.1Note that this number was calculated to include the about 25% of work listings which were placeholders on 2024/02/24 but should eventually become full entries and excluded the about 15% of work listings were redirects to other listings on the same date.
  90. ^Based on the fact that the list pages for browsing works display 25 works at a time there are 78 pages to browser as of 2024/02/26.
  91. ^Not explicitly stated, but browsing the collection on 2024-02-26 revealed only newspapers being cited as the source of the stories provided.
  92. ^Samantha Cole (13 October 2020), “2.1 Million of the Oldest Internet Posts Are Now Online for Anyone to Read”, inVice[5], archived fromthe original on13 October 2020:Around 2.1 million posts from between February 1981 and June 1991 from Henry Spencer's UTZOO NetNews Archive are archived at the Usenet Archive for anyone to browse.
  93. ^There is was also a mirror site, Explicit-Id.com.
  94. ^Based on captures of the website stored in the Wayback Machine, it looks like the website and its mirror went down some time between2024-08-15 and2024-09-26.
  95. ^Though the site does feature a built in search function, it is significantly limited and prone to errors. For this reason, I've classified it as a "library" rather than a "corpus". A complete copy of the original data can bedownloaded (see here for details) and processed with another application, though.
  96. ^From the number of queries multiplied by the average of 3.5 words per query mentioned in the scientific article that originally accompanied the data:Greg Pass; Abdur Chowdhury; Cayley Torgeson (May2006), “A Picture of Search”, inProceedings of the First International Conference on Scalable Information Systems, Hong Kong,→DOI, page 2
  97. ^Number of queries, perthe README included with the data
  98. ^This requirement is incompatible with Wiktionary's license.
  99. ^Although the database indexes and shows results for the entirety of FRED, audio and transcripts are only viewable for the FRED Sampler (FRED-S) portion. For this reason, most of the information presented in this table is based on the FRED-S, not the complete FRED.
  100. ^Although tagged transcripts can be downloaded from from the database, the search function only allows for the plaintext transcripts to be searched.
  101. 101.0101.1101.2101.3101.4Benedikt Szmrecsanyi, Nuria Hernández (2007), “Manual of Information to Accompany the Freiburg Corpus of English DialectsSampler (“FRED-S”)”, inFreiDok Plus[6], archived fromthe original on2 April 2013
  102. ^Most before 1990.
  103. ^As of 2024-12-16, the search function seems to be broken. It can still bemanually searched using Google.
  104. ^A scrap of the corpus from about 2018 is also available as a CSV with the registration of a free account.
  105. ^Based on a word countthe 2018 scrape which has a similar number of transcription samples in it as the live website.
  106. ^According to the website, as of the last update on 2023-07-07.
  107. ^Based onWayback Machine records.
  108. ^Perthe landing page.
  109. ^In the form ofThe VVhole Booke of Psalmes Faithfully Translated Into English Metre
  110. ^In the form of, for example,A Sermon; Occasioned by the Death of His Excellency George Washington.
  111. ^Therights and permissions section for the corpus states "These materials are in the public domain. There is no restriction on your use of the transcribed texts."
  112. 112.0112.1112.2Note that multiple sub-corpora and related corpora can be searched on the site.
  113. 113.0113.1Note that these numbers represent the size of all the corpora on the site tallied together.
  114. ^The corpus' terms FAQ states "All data published under [this website] are available exclusively for non-commercial use for research and educational purposes [...] they can only be used as sources of examples (citations) illustrating a particular linguistic phenomenon." This requirement is incompatible with Wiktionary's license.
  115. ^As of 2023-02-12 the query interface was offline.
  116. ^The corpus'about page states that it is specifically 98% written and 2% spoken.
  117. ^The corpus' user agreement states "TUD sadece araştırma ve sunum amaçlı kullanıma açıktır ve fikri mülkiyet hakları tümüyle Sağlayıcıya aittir." (roughly, '[the corpus] is available for research and presentation purposes only and the intellectual property rights remain the sole property of the Provider.') This requirement is incompatible with Wiktionary's license.
  118. 118.0118.1This corpus was designed to imitate the English-language Brown Corpus.
  119. 119.0119.1Specifically Argentina, Bolivia, Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, Guatemala, Honduras, Mexico, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, El Salvador, Spain, United States, Uruguay, Venezuela.
  120. ^Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue isaddressed on the website with the conclusion that the "categorization is quite good".
  121. ^Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue wasaddressed for the related Web/Dialects corpus with the conclusion that the "categorization is quite good" so a similar level of quality may exist for this corpus.
  122. ^Note that the CORPES is currently undergoing continuous revision and so this information may be out of date. To be specific, the information presented is for version 0.99.
  123. 123.0123.1Note that the newspapers were published in the United States.
  124. ^Using thenumber of "total tokens" listed under "Types of complete words (including unresolved akṣaras)" on 2024-04-22.
  125. ^Using thenumber of manuscripts publicly available on 2024-04-22.
  126. 126.0126.1George S. Lane; Douglas Q. Adams (16 July 2013), “Tocharian languages”, inEncyclopedia Britannica[7], retrieved5 May 2024:Documents from AD 500–700
  127. ^https://www.corpusdoportugues.org/hist-gen/help/cdp.xls
  128. 128.0128.1128.2128.3https://ilg.usc.gal/tmilg/corpus.html
  129. ^https://ilg.usc.gal/tmilg/usar.html
  130. 130.0130.1https://apps2.rae.es/nomina/SrvltGUIBusTextos?est=1
  131. 131.0131.1Based on the total number of documents listed on thegeneral statistics page.
  132. ^Specifically, "Constitución del monasterio de San Miguel de Pedroso [Cartulario de San Millán de la Cogolla]"
  133. 133.0133.1https://corpus.rae.es/ayuda_c.htm
  134. ^About 10% is technically originally oral in composition, but this represents a small portion of the overall corpus.
  135. 135.0135.1135.2135.3https://www.um.es/lexico-comercio-medieval/index.php/p/v/inicio
  136. ^https://www.um.es/lexico-comercio-medieval/index.php/p/v/aviso%20legal
  137. ^https://www.universocantigas.gal/aviso-legal
  138. ^http://medieval.rb.gov.br/sobre.php
  139. 139.0139.1139.2139.3https://www.hispanicseminary.org/osta-en.htm
  140. ^Approximately, it is difficult to see the full list of corpora in order to get an accurate estimate.
  141. ^Note that this number reflects the number of corpora freely available. Including the corpora which require a subscription or special permission the number comes up to 722 as of 2023/0/13.
  142. ^Note that the database has not been updated since 2016 and has a somewhat buggy search system.
  143. 143.0143.1Not confirmed to be English exclusively, but probably almost all English.
  144. ^The BBC publishes news online in a wide variety of languages which can then be searched manually using a search engine like Google. The languages are specifically Arabic, Azeri, Bangla, Burmese, Chinese, French, Hausa, Hindi, Indonesian, Japanese, Kinyarwanda, Kirundi, Kyrgyz, Marathi, Nepali, Pashto, Persian, Portuguese, Russian, Sinhala, Somali, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, Uzbek, and Vietnamese.
  145. ^The forum is primarily written in Chinese, though some posts are in English.
  146. ^The section which primarily hosts links to corpora is labeled "专题研究" (Google Translate translates this as "Special Research".)
  147. ^Both user ID and password are "test" for freely available corpora.

Further reading

[edit]
Retrieved from "https://en.wiktionary.org/w/index.php?title=Wiktionary:Corpora&oldid=87440163"
Category:
Hidden category:

[8]ページ先頭

©2009-2025 Movatter.jp