Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Statistically improbable phrase

From Wikipedia, the free encyclopedia
Phrase that appears more often in a single work than in a large sample size

Astatistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some largercorpus.[1][2][3]Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section.[4][5]Christian Rudder has also used this concept with data fromonline dating profiles andTwitter posts to determine the phrases most characteristic of a given race or gender in his bookDataclysm.[6] SIPs with a linguistic density of two or three words—for example adjective, adjective, noun, or adverb, adverb, verb—will signal the author's attitude, premise or conclusions to the reader or express an important idea.

Another use of SIPs is as a detection tool for plagiarism. (Almost) unique combinations of words can be searched for online, and if they have appeared in a published text, the search will identify where. This method only checks those texts that have been published and that have been digitized online.

For example, a submission by a student that contained the phrase "garden style, praising irregularity in design", might be searched for using Google.com and will yield the original Wikipedia article about SirWilliam Temple, English political figure and essayist.

Example

[edit]

While common words such as "the" appear frequently in most texts, a phrase such as "explicit Boolean algorithm" might occur much more often in a document about computers than it does in general English. Therefore, "explicit Boolean algorithm" would be considered a statistically improbable phrase in that context.

Some statistically improbable phrases of Darwin'sOn the Origin of Species could be:genera descended, transitional gradations, unknown progenitor, fossiliferous formations, closely allied forms, profitable variations, transitional grades, very distinct species andmongrel offspring.[7]

See also

[edit]
  • Collocation – Any series of words that co-occur more often than would be expected by chance
  • Googlewhack – A pair of words occurring on a single webpage, as indexed by Google
  • tf-idf – A statistic used in information retrieval and text mining
  • Complex specified information – a concept used to argue for the "intelligent design" theory

References

[edit]
  1. ^"SIPping Wikipedia"(PDF).Courses.cms.caltech.edu. Retrieved2017-01-01.
  2. ^Jonathan Bailey (3 July 2012)."How Long Should a Statistically Improbably Phrase Be?".Plagiarism Today.
  3. ^Errami, Mounir; Sun, Zhaohui; George, Angela C.; Long, Tara C.; Skinner, Michael A.; Wren, Jonathan D.; Garner, Harold R. (1 June 2010)."Identifying duplicate content using statistically improbable phrases".Bioinformatics.26 (11):1453–1457.doi:10.1093/bioinformatics/btq146.PMC 2872002.PMID 20472545 – via bioinformatics.oxfordjournals.org.
  4. ^"What are Statistically Improbable Phrases?".Amazon.com. Retrieved2007-12-18.
  5. ^Weeks, Linton (August 30, 2005)."Amazon's Vital Statistics Show How Books Stack Up".The Washington Post. RetrievedSeptember 8, 2015.
  6. ^Rudder, Christian (2014).Dataclysm: Who We Are When We Think No One's Looking. New York: Crown Publishers.ISBN 978-0-385-34737-2.
  7. ^Sociologically Improbable Phrases Crooked Timber April 2005
People
Current
Former
Facilities
Products and
services
Subsidiaries
Cloud
computing
Services
Devices
Technology
Media
Retail
Logistics
Former
Litigation
Other
Unions


Stub icon

Thiscomputational linguistics-related article is astub. You can help Wikipedia byadding missing information.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Statistically_improbable_phrase&oldid=1321903417"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp