NotificationsYou must be signed in to change notification settings
Fork34
Star202

Detect text reuse and document similarity

202 stars 34 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
R		R
inst/extdata		inst/extdata
man		man
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
CONDUCT.md		CONDUCT.md
DESCRIPTION		DESCRIPTION
Makefile		Makefile
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
appveyor.yml		appveyor.yml
cran-comments.md		cran-comments.md
textreuse.Rproj		textreuse.Rproj

Repository files navigation

textreuse

Overview

ThisR package provides a set of functionsfor measuring similarity among documents and detecting passages whichhave been reused. It implements shingled n-gram, skip n-gram, and othertokenizers; similarity/dissimilarity functions; pairwise comparisons;minhash and locality sensitive hashing algorithms; and a version of theSmith-Waterman local alignment algorithm suitable for natural language.It is broadly useful for, for example, detecting duplicate documents ina corpus prior to text analysis, or for identifying borrowed passagesbetween texts. The classes provides by this package follow the model ofother natural language processing packages for R, especially theNLP andtm packages. (However, thispackage has no dependency on Java, which should make it easier toinstall.)

Citation

If you use this package for scholarly research, I would appreciate acitation.

citation("textreuse")#> To cite package 'textreuse' in publications use:#> #>   Li Y, Mullen L (2024). _textreuse: Detect Text Reuse and Document#>   Similarity_. https://docs.ropensci.org/textreuse,#>   https://github.com/ropensci/textreuse.#> #> A BibTeX entry for LaTeX users is#> #>   @Manual{,#>     title = {textreuse: Detect Text Reuse and Document Similarity},#>     author = {Yaoxiang Li and Lincoln Mullen},#>     year = {2024},#>     note = {https://docs.ropensci.org/textreuse,#> https://github.com/ropensci/textreuse},#>   }

Installation

To install this package from CRAN:

install.packages("textreuse")

To install the development version from GitHub, usedevtools.

# install.packages("devtools")devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)

Examples

There are three main approaches that one may take when using thispackage: pairwise comparisons, minhashing/locality sensitive hashing,and extracting matching passages through text alignment.

See theintroductoryvignettefor a description of the classes provided by this package.

vignette("textreuse-introduction", package = "textreuse")

Pairwise comparisons

In this example we will load a tiny corpus of three documents. Thesedocuments are drawn from Kellen Funk’sresearch into the propagation oflegal codes of civil procedure in the nineteenth-century United States.

library(textreuse)dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"),                          tokenizer = tokenize_ngrams, n = 7)

We have loaded the three documents into a corpus, which involvestokenizing the text and hashing the tokens. We can inspect the corpus asa whole or the individual documents that make it up.

corpus#> TextReuseCorpus#> Number of documents: 3 #> hash_func : hash_string #> title : Civil procedure #> tokenizer : tokenize_ngramsnames(corpus)#> [1] "ca1851-match"   "ca1851-nomatch" "ny1850-match"corpus[["ca1851-match"]]#> TextReuseTextDocument#> file : C:/Users/bach/AppData/Local/Temp/RtmpecFDvh/temp_libpath4c4124c4b59/textreuse/extdata/legal/ca1851-match.txt #> hash_func : hash_string #> id : ca1851-match #> minhash_func : #> tokenizer : tokenize_ngrams #> content : § 4. Every action shall be prosecuted in the name of the real party#> in interest, except as otherwise provided in this Act.#> #> § 5. In the case of an assignment of a thing in action, the action by#> the as

Now we can compare each of the documents to one another. Thepairwise_compare() function applies a comparison function (in thiscase,jaccard_similarity()) to every pair of documents. The result isa matrix of scores. As we would expect, some documents are similar andothers are not.

comparisons <- pairwise_compare(corpus, jaccard_similarity)comparisons#>                ca1851-match ca1851-nomatch ny1850-match#> ca1851-match             NA              0    0.3842549#> ca1851-nomatch           NA             NA    0.0000000#> ny1850-match             NA             NA           NA

We can convert that matrix to a data frame of pairs and scores if weprefer.

pairwise_candidates(comparisons)#> # A tibble: 3 × 3#>   a              b              score#> * <chr>          <chr>          <dbl>#> 1 ca1851-match   ca1851-nomatch 0    #> 2 ca1851-match   ny1850-match   0.384#> 3 ca1851-nomatch ny1850-match   0

See thepairwisevignettefor a fuller description.

vignette("textreuse-pairwise", package = "textreuse")

Minhashing and locality sensitive hashing

Pairwise comparisons can be very time-consuming because they growgeometrically with the size of the corpus. (A corpus with 10 documentswould require at least 45 comparisons; a corpus with 100 documents wouldrequire 4,950 comparisons; a corpus with 1,000 documents would require499,500 comparisons.) That’s why this package implements the minhash andlocality sensitive hashing algorithms, which can detect candidate pairsmuch faster than pairwise comparisons in corpora of any significantsize.

For this example we will load a small corpus of ten documents publishedby the American Tract Society. We will also create a minhash function,which represents an entire document (regardless of length) by a fixednumber of integer hashes. When we create the corpus, the documents willeach have a minhash signature.

dir <- system.file("extdata/ats", package = "textreuse")minhash <- minhash_generator(200, seed = 235)ats <- TextReuseCorpus(dir = dir,                       tokenizer = tokenize_ngrams, n = 5,                       minhash_func = minhash)

Now we can calculate potential matches, extract the candidates, andapply a comparison function to just those candidates.

buckets <- lsh(ats, bands = 50, progress = FALSE)#> Warning: `gather_()` was deprecated in tidyr 1.2.0.#> ℹ Please use `gather()` instead.#> ℹ The deprecated feature was likely used in the textreuse package.#>   Please report the issue at <https://github.com/ropensci/textreuse/issues>.#> This warning is displayed once every 8 hours.#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was#> generated.candidates <- lsh_candidates(buckets)scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)scores#> # A tibble: 1 × 3#>   a              b                      score#>   <chr>          <chr>                  <dbl>#> 1 remember00palm remembermeorholy00palm 0.701

For details, see theminhashvignette.

vignette("textreuse-minhash", package = "textreuse")

Text alignment

We can also extract the optimal alignment between to documents with aversion of theSmith-Watermanalgorithm, used for protein sequence alignment, adapted for naturallanguage. The longest matching substring according to scoring valueswill be extracted, and variations in the alignment will be marked.

a <- "'How do I know', she asked, 'if this is a good match?'"b <- "'This is a match', he replied."align_local(a, b)#> TextReuse alignment#> Alignment score: 7 #> Document A:#> this is a good match#> #> Document B:#> This is a #### match

For details, see thetext alignmentvignette.

vignette("textreuse-alignment", package = "textreuse")

Parallel processing

Loading the corpus and creating tokens benefit from using multiplecores, if available. (This works only on non-Windows machines.) To usemultiple cores, setoptions("mc.cores" = 4L), where the number is howmany cores you wish to use.

Contributing and acknowledgments

Please note that this project is released with aContributor Code ofConduct.By participating in this project you agree to abide by its terms.

Thanks toNoam Ross for his thoroughpeerreview of thispackage forrOpenSci.

About

Detect text reuse and document similarity

docs.ropensci.org/textreuse

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

textreuse

Overview

Citation

Installation

Examples

Pairwise comparisons

Minhashing and locality sensitive hashing

Text alignment

Parallel processing

Contributing and acknowledgments

About

Topics

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Contributors7

Uh oh!

Languages

Movatterモバイル変換

ropensci/textreuse

Folders and files

Latest commit

History

Repository files navigation

textreuse

Overview

Citation

Installation

Examples

Pairwise comparisons

Minhashing and locality sensitive hashing

Text alignment

Parallel processing

Contributing and acknowledgments

About

Topics

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Contributors7

Uh oh!

Languages

Packages