Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Detect text reuse and document similarity

NotificationsYou must be signed in to change notification settings

ropensci/textreuse

Repository files navigation

CRAN_Status_BadgeCRAN_DownloadsBuild StatusBuild statusCoverage StatusrOpenSci badge

Overview

ThisR package provides a set of functionsfor measuring similarity among documents and detecting passages whichhave been reused. It implements shingled n-gram, skip n-gram, and othertokenizers; similarity/dissimilarity functions; pairwise comparisons;minhash and locality sensitive hashing algorithms; and a version of theSmith-Waterman local alignment algorithm suitable for natural language.It is broadly useful for, for example, detecting duplicate documents ina corpus prior to text analysis, or for identifying borrowed passagesbetween texts. The classes provides by this package follow the model ofother natural language processing packages for R, especially theNLP andtm packages. (However, thispackage has no dependency on Java, which should make it easier toinstall.)

Citation

If you use this package for scholarly research, I would appreciate acitation.

citation("textreuse")#> To cite package 'textreuse' in publications use:#> #>   Li Y, Mullen L (2024). _textreuse: Detect Text Reuse and Document#>   Similarity_. https://docs.ropensci.org/textreuse,#>   https://github.com/ropensci/textreuse.#> #> A BibTeX entry for LaTeX users is#> #>   @Manual{,#>     title = {textreuse: Detect Text Reuse and Document Similarity},#>     author = {Yaoxiang Li and Lincoln Mullen},#>     year = {2024},#>     note = {https://docs.ropensci.org/textreuse,#> https://github.com/ropensci/textreuse},#>   }

Installation

To install this package from CRAN:

install.packages("textreuse")

To install the development version from GitHub, usedevtools.

# install.packages("devtools")devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)

Examples

There are three main approaches that one may take when using thispackage: pairwise comparisons, minhashing/locality sensitive hashing,and extracting matching passages through text alignment.

See theintroductoryvignettefor a description of the classes provided by this package.

vignette("textreuse-introduction", package = "textreuse")

Pairwise comparisons

In this example we will load a tiny corpus of three documents. Thesedocuments are drawn from Kellen Funk’sresearch into the propagation oflegal codes of civil procedure in the nineteenth-century United States.

library(textreuse)dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"),                          tokenizer = tokenize_ngrams, n = 7)

We have loaded the three documents into a corpus, which involvestokenizing the text and hashing the tokens. We can inspect the corpus asa whole or the individual documents that make it up.

corpus#> TextReuseCorpus#> Number of documents: 3 #> hash_func : hash_string #> title : Civil procedure #> tokenizer : tokenize_ngramsnames(corpus)#> [1] "ca1851-match"   "ca1851-nomatch" "ny1850-match"corpus[["ca1851-match"]]#> TextReuseTextDocument#> file : C:/Users/bach/AppData/Local/Temp/RtmpecFDvh/temp_libpath4c4124c4b59/textreuse/extdata/legal/ca1851-match.txt #> hash_func : hash_string #> id : ca1851-match #> minhash_func : #> tokenizer : tokenize_ngrams #> content : § 4. Every action shall be prosecuted in the name of the real party#> in interest, except as otherwise provided in this Act.#> #> § 5. In the case of an assignment of a thing in action, the action by#> the as

Now we can compare each of the documents to one another. Thepairwise_compare() function applies a comparison function (in thiscase,jaccard_similarity()) to every pair of documents. The result isa matrix of scores. As we would expect, some documents are similar andothers are not.

comparisons <- pairwise_compare(corpus, jaccard_similarity)comparisons#>                ca1851-match ca1851-nomatch ny1850-match#> ca1851-match             NA              0    0.3842549#> ca1851-nomatch           NA             NA    0.0000000#> ny1850-match             NA             NA           NA

We can convert that matrix to a data frame of pairs and scores if weprefer.

pairwise_candidates(comparisons)#> # A tibble: 3 × 3#>   a              b              score#> * <chr>          <chr>          <dbl>#> 1 ca1851-match   ca1851-nomatch 0    #> 2 ca1851-match   ny1850-match   0.384#> 3 ca1851-nomatch ny1850-match   0

See thepairwisevignettefor a fuller description.

vignette("textreuse-pairwise", package = "textreuse")

Minhashing and locality sensitive hashing

Pairwise comparisons can be very time-consuming because they growgeometrically with the size of the corpus. (A corpus with 10 documentswould require at least 45 comparisons; a corpus with 100 documents wouldrequire 4,950 comparisons; a corpus with 1,000 documents would require499,500 comparisons.) That’s why this package implements the minhash andlocality sensitive hashing algorithms, which can detect candidate pairsmuch faster than pairwise comparisons in corpora of any significantsize.

For this example we will load a small corpus of ten documents publishedby the American Tract Society. We will also create a minhash function,which represents an entire document (regardless of length) by a fixednumber of integer hashes. When we create the corpus, the documents willeach have a minhash signature.

dir <- system.file("extdata/ats", package = "textreuse")minhash <- minhash_generator(200, seed = 235)ats <- TextReuseCorpus(dir = dir,                       tokenizer = tokenize_ngrams, n = 5,                       minhash_func = minhash)

Now we can calculate potential matches, extract the candidates, andapply a comparison function to just those candidates.

buckets <- lsh(ats, bands = 50, progress = FALSE)#> Warning: `gather_()` was deprecated in tidyr 1.2.0.#> ℹ Please use `gather()` instead.#> ℹ The deprecated feature was likely used in the textreuse package.#>   Please report the issue at <https://github.com/ropensci/textreuse/issues>.#> This warning is displayed once every 8 hours.#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was#> generated.candidates <- lsh_candidates(buckets)scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)scores#> # A tibble: 1 × 3#>   a              b                      score#>   <chr>          <chr>                  <dbl>#> 1 remember00palm remembermeorholy00palm 0.701

For details, see theminhashvignette.

vignette("textreuse-minhash", package = "textreuse")

Text alignment

We can also extract the optimal alignment between to documents with aversion of theSmith-Watermanalgorithm, used for protein sequence alignment, adapted for naturallanguage. The longest matching substring according to scoring valueswill be extracted, and variations in the alignment will be marked.

a <- "'How do I know', she asked, 'if this is a good match?'"b <- "'This is a match', he replied."align_local(a, b)#> TextReuse alignment#> Alignment score: 7 #> Document A:#> this is a good match#> #> Document B:#> This is a #### match

For details, see thetext alignmentvignette.

vignette("textreuse-alignment", package = "textreuse")

Parallel processing

Loading the corpus and creating tokens benefit from using multiplecores, if available. (This works only on non-Windows machines.) To usemultiple cores, setoptions("mc.cores" = 4L), where the number is howmany cores you wish to use.

Contributing and acknowledgments

Please note that this project is released with aContributor Code ofConduct.By participating in this project you agree to abide by its terms.

Thanks toNoam Ross for his thoroughpeerreview of thispackage forrOpenSci.


rOpenSCi logo

About

Detect text reuse and document similarity

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors7


[8]ページ先頭

©2009-2025 Movatter.jp