- Notifications
You must be signed in to change notification settings - Fork34
Detect text reuse and document similarity
ropensci/textreuse
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
ThisR package provides a set of functionsfor measuring similarity among documents and detecting passages whichhave been reused. It implements shingled n-gram, skip n-gram, and othertokenizers; similarity/dissimilarity functions; pairwise comparisons;minhash and locality sensitive hashing algorithms; and a version of theSmith-Waterman local alignment algorithm suitable for natural language.It is broadly useful for, for example, detecting duplicate documents ina corpus prior to text analysis, or for identifying borrowed passagesbetween texts. The classes provides by this package follow the model ofother natural language processing packages for R, especially theNLP andtm packages. (However, thispackage has no dependency on Java, which should make it easier toinstall.)
If you use this package for scholarly research, I would appreciate acitation.
citation("textreuse")#> To cite package 'textreuse' in publications use:#> #> Li Y, Mullen L (2024). _textreuse: Detect Text Reuse and Document#> Similarity_. https://docs.ropensci.org/textreuse,#> https://github.com/ropensci/textreuse.#> #> A BibTeX entry for LaTeX users is#> #> @Manual{,#> title = {textreuse: Detect Text Reuse and Document Similarity},#> author = {Yaoxiang Li and Lincoln Mullen},#> year = {2024},#> note = {https://docs.ropensci.org/textreuse,#> https://github.com/ropensci/textreuse},#> }To install this package from CRAN:
install.packages("textreuse")To install the development version from GitHub, usedevtools.
# install.packages("devtools")devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)There are three main approaches that one may take when using thispackage: pairwise comparisons, minhashing/locality sensitive hashing,and extracting matching passages through text alignment.
See theintroductoryvignettefor a description of the classes provided by this package.
vignette("textreuse-introduction", package = "textreuse")In this example we will load a tiny corpus of three documents. Thesedocuments are drawn from Kellen Funk’sresearch into the propagation oflegal codes of civil procedure in the nineteenth-century United States.
library(textreuse)dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"), tokenizer = tokenize_ngrams, n = 7)We have loaded the three documents into a corpus, which involvestokenizing the text and hashing the tokens. We can inspect the corpus asa whole or the individual documents that make it up.
corpus#> TextReuseCorpus#> Number of documents: 3 #> hash_func : hash_string #> title : Civil procedure #> tokenizer : tokenize_ngramsnames(corpus)#> [1] "ca1851-match" "ca1851-nomatch" "ny1850-match"corpus[["ca1851-match"]]#> TextReuseTextDocument#> file : C:/Users/bach/AppData/Local/Temp/RtmpecFDvh/temp_libpath4c4124c4b59/textreuse/extdata/legal/ca1851-match.txt #> hash_func : hash_string #> id : ca1851-match #> minhash_func : #> tokenizer : tokenize_ngrams #> content : § 4. Every action shall be prosecuted in the name of the real party#> in interest, except as otherwise provided in this Act.#> #> § 5. In the case of an assignment of a thing in action, the action by#> the asNow we can compare each of the documents to one another. Thepairwise_compare() function applies a comparison function (in thiscase,jaccard_similarity()) to every pair of documents. The result isa matrix of scores. As we would expect, some documents are similar andothers are not.
comparisons <- pairwise_compare(corpus, jaccard_similarity)comparisons#> ca1851-match ca1851-nomatch ny1850-match#> ca1851-match NA 0 0.3842549#> ca1851-nomatch NA NA 0.0000000#> ny1850-match NA NA NAWe can convert that matrix to a data frame of pairs and scores if weprefer.
pairwise_candidates(comparisons)#> # A tibble: 3 × 3#> a b score#> * <chr> <chr> <dbl>#> 1 ca1851-match ca1851-nomatch 0 #> 2 ca1851-match ny1850-match 0.384#> 3 ca1851-nomatch ny1850-match 0See thepairwisevignettefor a fuller description.
vignette("textreuse-pairwise", package = "textreuse")Pairwise comparisons can be very time-consuming because they growgeometrically with the size of the corpus. (A corpus with 10 documentswould require at least 45 comparisons; a corpus with 100 documents wouldrequire 4,950 comparisons; a corpus with 1,000 documents would require499,500 comparisons.) That’s why this package implements the minhash andlocality sensitive hashing algorithms, which can detect candidate pairsmuch faster than pairwise comparisons in corpora of any significantsize.
For this example we will load a small corpus of ten documents publishedby the American Tract Society. We will also create a minhash function,which represents an entire document (regardless of length) by a fixednumber of integer hashes. When we create the corpus, the documents willeach have a minhash signature.
dir <- system.file("extdata/ats", package = "textreuse")minhash <- minhash_generator(200, seed = 235)ats <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, minhash_func = minhash)Now we can calculate potential matches, extract the candidates, andapply a comparison function to just those candidates.
buckets <- lsh(ats, bands = 50, progress = FALSE)#> Warning: `gather_()` was deprecated in tidyr 1.2.0.#> ℹ Please use `gather()` instead.#> ℹ The deprecated feature was likely used in the textreuse package.#> Please report the issue at <https://github.com/ropensci/textreuse/issues>.#> This warning is displayed once every 8 hours.#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was#> generated.candidates <- lsh_candidates(buckets)scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)scores#> # A tibble: 1 × 3#> a b score#> <chr> <chr> <dbl>#> 1 remember00palm remembermeorholy00palm 0.701For details, see theminhashvignette.
vignette("textreuse-minhash", package = "textreuse")We can also extract the optimal alignment between to documents with aversion of theSmith-Watermanalgorithm, used for protein sequence alignment, adapted for naturallanguage. The longest matching substring according to scoring valueswill be extracted, and variations in the alignment will be marked.
a <- "'How do I know', she asked, 'if this is a good match?'"b <- "'This is a match', he replied."align_local(a, b)#> TextReuse alignment#> Alignment score: 7 #> Document A:#> this is a good match#> #> Document B:#> This is a #### matchFor details, see thetext alignmentvignette.
vignette("textreuse-alignment", package = "textreuse")Loading the corpus and creating tokens benefit from using multiplecores, if available. (This works only on non-Windows machines.) To usemultiple cores, setoptions("mc.cores" = 4L), where the number is howmany cores you wish to use.
Please note that this project is released with aContributor Code ofConduct.By participating in this project you agree to abide by its terms.
Thanks toNoam Ross for his thoroughpeerreview of thispackage forrOpenSci.
About
Detect text reuse and document similarity
Topics
Resources
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Contributors7
Uh oh!
There was an error while loading.Please reload this page.
