| Type: | Package |
| Title: | Detect Text Reuse and Document Similarity |
| Version: | 0.1.5 |
| Date: | 2020-05-14 |
| Description: | Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language. |
| License: | MIT + file LICENSE |
| LazyData: | TRUE |
| URL: | https://docs.ropensci.org/textreuse,https://github.com/ropensci/textreuse |
| BugReports: | https://github.com/ropensci/textreuse/issues |
| VignetteBuilder: | knitr |
| Depends: | R (≥ 3.1.1) |
| Imports: | assertthat (≥ 0.1), digest (≥ 0.6.8), dplyr (≥ 0.8.0), NLP(≥ 0.1.8), Rcpp (≥ 0.12.0), RcppProgress (≥ 0.1), stringr(≥ 1.0.0), tibble (≥ 3.0.1), tidyr (≥ 0.3.1) |
| Suggests: | testthat (≥ 0.11.0), knitr (≥ 1.11), rmarkdown (≥ 0.8),covr |
| LinkingTo: | BH, Rcpp, RcppProgress |
| RoxygenNote: | 7.1.0 |
| Encoding: | UTF-8 |
| NeedsCompilation: | yes |
| Packaged: | 2020-05-15 14:43:54 UTC; lmullen |
| Author: | Lincoln Mullen |
| Maintainer: | Lincoln Mullen <lincoln@lincolnmullen.com> |
| Repository: | CRAN |
| Date/Publication: | 2020-05-15 15:50:02 UTC |
textreuse: Detect Text Reuse and Document Similarity
Description
Tools for measuring similarity among documents and detectingpassages which have been reused. Implements shingled n-gram, skip n-gram,and other tokenizers; similarity/dissimilarity functions; pairwisecomparisons; minhash and locality sensitive hashing algorithms; and aversion of the Smith-Waterman local alignment algorithm suitable fornatural language.
Details
The best place to begin with this package in the introductory vignette.
vignette("textreuse-introduction", package = "textreuse")
After reading that vignette, the "pairwise" and "minhash" vignettes introducespecific paths for working with the package.
vignette("textreuse-pairwise", package = "textreuse")
vignette("textreuse-minhash", package = "textreuse")
vignette("textreuse-alignment", package = "textreuse")
Another good place to begin with the package is the documentation for loadingdocuments (TextReuseTextDocument andTextReuseCorpus), fortokenizers,similarity functions, andlocality-sensitive hashing.
Author(s)
Maintainer: Lincoln Mullenlincoln@lincolnmullen.com (ORCID)
References
The sample data provided in theextdata/legal directory istaken from acorpusof American Tract Society publications from the nineteen-century,gathered from theInternet Archive.
The sample data provided in theextdata/legal directory, are takenfrom the following nineteenth-century codes of civil procedure fromCalifornia and New York.
Final Report of the Commissioners on Practice and Pleadings, in 2Documents of the Assembly of New York, 73rd Sess., No. 16, (1850):243-250, sections 597-613.GoogleBooks.
An Act To Regulate Proceedings in Civil Cases, 1851CaliforniaLaws 51, 51-53 sections 4-17; 101, sections 313-316.GoogleBooks.
See Also
Useful links:
Report bugs athttps://github.com/ropensci/textreuse/issues
TextReuseCorpus
Description
This is the constructor function for aTextReuseCorpus, modeled on thevirtual S3 classCorpus from thetm package. Theobject is aTextReuseCorpus, which is basically a list containingobjects of classTextReuseTextDocument. Arguments are passedalong to that constructor function. To create the corpus, you can pass eithera character vector of paths to text files using thepaths = parameter,a directory containing text files (with any extension) using thedir =parameter, or a character vector of documents using thetext =parameter, where each element in the characer vector is a document. If thecharacter vector passed totext = has names, then those names will beused as the document IDs. Otherwise, IDs will be assigned to the documents.Only one of thepaths,dir, ortext parameters should bespecified.
Usage
TextReuseCorpus( paths, dir = NULL, text = NULL, meta = list(), progress = interactive(), tokenizer = tokenize_ngrams, ..., hash_func = hash_string, minhash_func = NULL, keep_tokens = FALSE, keep_text = TRUE, skip_short = TRUE)is.TextReuseCorpus(x)skipped(x)Arguments
paths | A character vector of paths to files to be opened. |
dir | The path to a directory of text files. |
text | A character vector (possibly named) of documents. |
meta | A list with named elements for the metadata associated with thiscorpus. |
progress | Display a progress bar while loading files. |
tokenizer | A function to split the text into tokens. See |
... | Arguments passed on to the |
hash_func | A function to hash the tokens. See |
minhash_func | A function to create minhash signatures of the document.See |
keep_tokens | Should the tokens be saved in the documents that arereturned or discarded? |
keep_text | Should the text be saved in the documents that are returnedor discarded? |
skip_short | Should short documents be skipped? (See details.) |
x | An R object to check. |
Details
Ifskip_short = TRUE, this function will skip very short orempty documents. A very short document is one where there are two few wordsto create at least two n-grams. For example, if five-grams are desired,then a document must be at least six words long. If no value ofn isprovided, then the function assumes a value ofn = 3. A warning willbe printed with the document ID of each skipped document. Useskipped() to get the IDs of skipped documents.
This function will use multiple cores on non-Windows machines if the"mc.cores" option is set. For example, to use four cores:options("mc.cores" = 4L).
See Also
Accessors for TextReuseobjects.
Examples
dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, meta = list("description" = "Field Codes"))# Subset by position or file namecorpus[[1]]names(corpus)corpus[["ca1851-match"]]TextReuseTextDocument
Description
This is the constructor function forTextReuseTextDocument objects.This class is used for comparing documents.
Usage
TextReuseTextDocument( text, file = NULL, meta = list(), tokenizer = tokenize_ngrams, ..., hash_func = hash_string, minhash_func = NULL, keep_tokens = FALSE, keep_text = TRUE, skip_short = TRUE)is.TextReuseTextDocument(x)has_content(x)has_tokens(x)has_hashes(x)has_minhashes(x)Arguments
text | A character vector containing the text of the document. Thisargument can be skipped if supplying |
file | The path to a text file, if |
meta | A list with named elements for the metadata associated with thisdocument. If a document is created using the |
tokenizer | A function to split the text into tokens. See |
... | Arguments passed on to the |
hash_func | A function to hash the tokens. See |
minhash_func | A function to create minhash signatures of the document.See |
keep_tokens | Should the tokens be saved in the document that isreturned or discarded? |
keep_text | Should the text be saved in the document that is returned ordiscarded? |
skip_short | Should short documents be skipped? (See details.) |
x | An R object to check. |
Details
This constructor function follows a three-step process. It reads inthe text, either from a file or from memory. It then tokenizes that text.Then it hashes the tokens. Most of the comparison functions in this packagerely only on the hashes to make the comparison. By passingFALSE tokeep_tokens andkeep_text, you can avoid saving thoseobjects, which can result in significant memory savings for large corpora.
Ifskip_short = TRUE, this function will returnNULL for veryshort or empty documents. A very short document is one where there are twofew words to create at least two n-grams. For example, if five-grams aredesired, then a document must be at least six words long. If no value ofn is provided, then the function assumes a value ofn = 3. Awarning will be printed with the document ID of a skipped document.
Value
An object of classTextReuseTextDocument. This object inheritsfrom the virtual S3 classTextDocument in the NLPpackage. It contains the following elements:
- content
Thetext of the document.
- tokens
The tokens created from the text.
- hashes
Hashes created from the tokens.
- minhashes
The minhashsignature of the document.
- metadata
The document metadata,including the filename (if any) in
file.
See Also
Accessors for TextReuseobjects.
Examples
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")doc <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))print(doc)meta(doc)head(tokens(doc))head(hashes(doc))## Not run: content(doc)## End(Not run)Accessors for TextReuse objects
Description
Accessor functions to read and write components ofTextReuseTextDocument andTextReuseCorpusobjects.
Usage
tokens(x)tokens(x) <- valuehashes(x)hashes(x) <- valueminhashes(x)minhashes(x) <- valueArguments
x | The object to access. |
value | The value to assign. |
Value
Either a vector or a named list of vectors.
Local alignment of natural language texts
Description
This function takes two texts, either as strings or asTextReuseTextDocument objects, and finds the optimal local alignmentof those texts. A local alignment finds the best matching subset of the twodocuments. This function adapts theSmith-Watermanalgorithm, used for genetic sequencing, for use with natural language. Itcompare the texts word by word (the comparison is case-insensitive) andscores them according to a set of parameters. These parameters define thescore for amatch, and the penalties for amismatch and foropening agap (i.e., the first mismatch in a potential sequence). Thefunction then reports the optimal local alignment. Only the subset of thedocuments that is a match is included. Insertions or deletions in the textare reported with theedit_mark character.
Usage
align_local( a, b, match = 2L, mismatch = -1L, gap = -1L, edit_mark = "#", progress = interactive())Arguments
a | A character vector of length one, or a |
b | A character vector of length one, or a |
match | The score to assign a matching word. Should be a positiveinteger. |
mismatch | The score to assign a mismatching word. Should be a negativeinteger or zero. |
gap | The penalty for opening a gap in the sequence. Should be anegative integer or zero. |
edit_mark | A single character used for displaying for displayinginsertions/deletions in the documents. |
progress | Display a progress bar and messages while computing thealignment. |
Details
The compute time of this function is proportional to the product of thelengths of the two documents. Thus, longer documents will take considerablymore time to compute. This function has been tested with pairs of documentscontaining about 25 thousand words each.
If the function reports that there were multiple optimal alignments, then itis likely that there is no strong match in the document.
The score reported for the local alignment is dependent on both the size ofthe documents and on the strength of the match, as well as on the parametersfor match, mismatch, and gap penalties, so the scores are not directlycomparable.
Value
A list with the classtextreuse_alignment. This list containsseveral elements:
a_editandb_edit:Character vectors of the sequences with edits marked.score:The score of the optimal alignment.
References
For a useful description of the algorithm, seethispost. For the application of the Smith-Waterman algorithm to naturallanguage, see David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon,"Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers."IEEE International Conference on Big Data, 2013,http://hdl.handle.net/2047/d20004858.
Examples
align_local("The answer is blowin' in the wind.", "As the Bob Dylan song says, the answer is blowing in the wind.")# Example of matching documents from a corpusdir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, progress = FALSE)alignment <- align_local(corpus[["ca1851-match"]], corpus[["ny1850-match"]])str(alignment)Convert candidates data frames to other formats
Description
These S3 methods convert atextreuse_candidates object to a matrix.
Usage
## S3 method for class 'textreuse_candidates'as.matrix(x, ...)Arguments
x | An object of class |
... | Additional arguments. |
Value
A similarity matrix with row and column names containing document IDs.
Filenames from paths
Description
This function takes a character vector of paths and returns just the filename, by default without the extension. ATextReuseCorpus usesthe paths to the files in the corpus as the names of the list. This functionis intended to turn those paths into more manageable identifiers.
Usage
filenames(paths, extension = FALSE)Arguments
paths | A character vector of paths. |
extension | Should the file extension be preserved? |
See Also
Examples
paths <- c("corpus/one.txt", "corpus/two.md", "corpus/three.text")filenames(paths)filenames(paths, extension = TRUE)Hash a string to an integer
Description
Hash a string to an integer
Usage
hash_string(x)Arguments
x | A character vector to be hashed. |
Value
A vector of integer hashes.
Examples
s <- c("How", "many", "roads", "must", "a", "man", "walk", "down")hash_string(s)Locality sensitive hashing for minhash
Description
Locality sensitive hashing (LSH) discovers potential matches among a corpus ofdocuments quickly, so that only likely pairs can be compared.
Usage
lsh(x, bands, progress = interactive())Arguments
x | |
bands | The number of bands to use for locality sensitive hashing. Thenumber of hashes in the documents in the corpus must be evenly divisible bythe number of bands. See |
progress | Display a progress bar while comparing documents. |
Details
Locality sensitive hashing is a technique for detecting documentsimilarity that does not require pairwise comparisons. When comparing pairsof documents, the number of pairs grows rapidly, so that only the smallestcorpora can be compared pairwise in a reasonable amount of computation time.Locality sensitive hashing, on the other hand, takes a document which hasbeen tokenized and hashed using a minhash algorithm. (Seeminhash_generator.) Each set of minhash signatures is thenbroken into bands comprised of a certain number of rows. (For example, 200minhash signatures might be broken down into 20 bands each containing 10rows.) Each band is then hashed to a bucket. Documents with identical rowsin a band will be hashed to the same bucket. The likelihood that a documentwill be marked as a potential duplicate is proportional to the number ofbands and inversely proportional to the number of rows in each band.
This function returns a data frame with the additional classlsh_buckets. The LSH technique only requires that the signatures foreach document be calculated once. So it is possible, as long as one uses thesame minhash function and the same number of bands, to combine the outputsfrom this function at different times. The output can thus be treated as akind of cache of LSH signatures.
To extract pairs of documents from the output of this function, seelsh_candidates.
Value
A data frame (with the additional classlsh_buckets),containing a column with the document IDs and a column with their LSHsignatures, or buckets.
References
Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011), ch. 3. See also Matthew Casperson,"Minhashfor Dummies" (November 14, 2013).
See Also
minhash_generator,lsh_candidates,lsh_query,lsh_probability,lsh_threshold
Examples
dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 235)corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, minhash_func = minhash)buckets <- lsh(corpus, bands = 50)bucketsCandidate pairs from LSH comparisons
Description
Given a data frame of LSH buckets returned fromlsh, thisfunction returns the potential candidates.
Usage
lsh_candidates(buckets)Arguments
buckets | A data frame returned from |
Value
A data frame of candidate pairs.
Examples
dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 234)corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, minhash_func = minhash)buckets <- lsh(corpus, bands = 50)lsh_candidates(buckets)Compare candidates identified by LSH
Description
Thelsh_candidates only identifies potential matches, butcannot estimate the actual similarity of the documents. This function takes adata frame returned bylsh_candidates and applies a comparisonfunction to each of the documents in a corpus, thereby calculating thedocument similarity score. Note that since your corpus will have minhashsignatures rather than hashes for the tokens itself, you will probably wishto usetokenize to calculate new hashes. This can be done forjust the potentially similar documents. See the package vignettes fordetails.
Usage
lsh_compare(candidates, corpus, f, progress = interactive())Arguments
candidates | A data frame returned by |
corpus | The same |
f | A comparison function such as |
progress | Display a progress bar while comparing documents. |
Value
A data frame with values calculated forscore.
Examples
dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 234)corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, minhash_func = minhash)buckets <- lsh(corpus, bands = 50)candidates <- lsh_candidates(buckets)lsh_compare(candidates, corpus, jaccard_similarity)Probability that a candidate pair will be detected with LSH
Description
Functions to help choose the correct parameters for thelsh andminhash_generator functions. Uselsh_threshold todetermine the minimum Jaccard similarity for two documents for them to likelybe considered a match. Uselsh_probability to determine theprobability that a pair of documents with a known Jaccard similarity will bedetected.
Usage
lsh_probability(h, b, s)lsh_threshold(h, b)Arguments
h | The number of minhash signatures. |
b | The number of LSH bands. |
s | The Jaccard similarity. |
Details
Locality sensitive hashing returns a list of possible matches forsimilar documents. How likely is it that a pair of documents will be detectedas a possible match? Ifh is the number of minhash signatures,b is the number of bands in the LSH function (implying then that thenumber of rowsr = h / b), ands is the actual Jaccardsimilarity of the two documents, then the probabilityp that the twodocuments will be marked as a candidate pair is given by this equation.
p = 1 - (1 - s^{r})^{b}
According toMMDS,that equation approximates an S-curve. This implies that there is a threshold(t) fors approximated by this equation.
t = \frac{1}{b}^{\frac{1}{r}}
References
Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011), ch. 3.
Examples
# Threshold for default valueslsh_threshold(h = 200, b = 40)# Probability for varying values of slsh_probability(h = 200, b = 40, s = .25)lsh_probability(h = 200, b = 40, s = .50)lsh_probability(h = 200, b = 40, s = .75)Query a LSH cache for matches to a single document
Description
This function retrieves the matches for a single document from anlsh_buckets object created bylsh. Seelsh_candidates to retrieve all pairs of matches.
Usage
lsh_query(buckets, id)Arguments
buckets | An |
id | The document ID to find matches for. |
Value
Anlsh_candidates data frame with matches to the document specified.
See Also
Examples
dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 235)corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, minhash_func = minhash)buckets <- lsh(corpus, bands = 50)lsh_query(buckets, "ny1850-match")List of all candidates in a corpus
Description
List of all candidates in a corpus
Usage
lsh_subset(candidates)Arguments
candidates | A data frame of candidate pairs from |
Value
A character vector of document IDs from the candidate pairs, to beused to subset theTextReuseCorpus.
Examples
dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 234)corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, minhash_func = minhash)buckets <- lsh(corpus, bands = 50)candidates <- lsh_candidates(buckets)lsh_subset(candidates)corpus[lsh_subset(candidates)]Generate a minhash function
Description
A minhash value is calculated by hashing the strings in a character vector tointegers and then selecting the minimum value. Repeated minhash values aregenerated by using different hash functions: these different hash functionsare created by using performing a bitwiseXOR operation(bitwXor) with a vector of random integers. Since it is vitalthat the same random integers be used for each document, this functiongenerates another function which will always use the same integers. Thereturned function is intended to be passed to thehash_func parameterofTextReuseTextDocument.
Usage
minhash_generator(n = 200, seed = NULL)Arguments
n | The number of minhashes that the returned function should generate. |
seed | An option parameter to set the seed used in generating the randomnumbers to ensure that the same minhash function is used on repeatedapplications. |
Value
A function which will take a character vector and returnnminhashes.
References
Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011), ch. 3. See also Matthew Casperson,"Minhashfor Dummies" (November 14, 2013).
See Also
Examples
set.seed(253)minhash <- minhash_generator(10)# Example with a TextReuseTextDocumentfile <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")doc <- TextReuseTextDocument(file = file, hash_func = minhash, keep_tokens = TRUE)hashes(doc)# Example with a character vectoris.character(tokens(doc))minhash(tokens(doc))Candidate pairs from pairwise comparisons
Description
Converts a comparison matrix generated bypairwise_compare into adata frame of candidates for matches.
Usage
pairwise_candidates(m, directional = FALSE)Arguments
m | A matrix from |
directional | Should be set to the same value as in |
Value
A data frame containing all the non-NA values fromm.Columnsa andb are the IDs from the original corpus aspassed to the comparison function. Columnscore is the scorereturned by the comparison function.
Examples
dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir)m1 <- pairwise_compare(corpus, ratio_of_matches, directional = TRUE)pairwise_candidates(m1, directional = TRUE)m2 <- pairwise_compare(corpus, jaccard_similarity)pairwise_candidates(m2)Pairwise comparisons among documents in a corpus
Description
Given aTextReuseCorpus containing documents of classTextReuseTextDocument, this function applies a comparisonfunction to every pairing of documents, and returns a matrix with thecomparison scores.
Usage
pairwise_compare(corpus, f, ..., directional = FALSE, progress = interactive())Arguments
corpus | |
f | The function to apply to |
... | Additional arguments passed to |
directional | Some comparison functions are commutative, so that |
progress | Display a progress bar while comparing documents. |
Value
A square matrix with dimensions equal to the length of the corpus,and row and column names set by the names of the documents in the corpus. Avalue ofNA in the matrix indicates that a comparison was not made.In cases of directional comparisons, then the comparison reported isf(row, column).
See Also
See these document comparison functions,jaccard_similarity,ratio_of_matches.
Examples
dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir)names(corpus) <- filenames(names(corpus))# A non-directional comparisonpairwise_compare(corpus, jaccard_similarity)# A directional comparisonpairwise_compare(corpus, ratio_of_matches, directional = TRUE)Objects exported from other packages
Description
These objects are imported from other packages. Follow the linksbelow to see their documentation.
Recompute the hashes for a document or corpus
Description
Given aTextReuseTextDocument or aTextReuseCorpus, this function recomputes either the hashes orthe minhashes with the function specified. This implies that you haveretained the tokens with thekeep_tokens = TRUE parameter.
Usage
rehash(x, func, type = c("hashes", "minhashes"))Arguments
x | |
func | A function to either hash the tokens or to generate the minhashsignature. See |
type | Recompute the |
Value
The modifiedTextReuseTextDocument orTextReuseCorpus.
Examples
dir <- system.file("extdata/legal", package = "textreuse")minhash1 <- minhash_generator(seed = 1)corpus <- TextReuseCorpus(dir = dir, minhash_func = minhash1, keep_tokens = TRUE)head(minhashes(corpus[[1]]))minhash2 <- minhash_generator(seed = 2)corpus <- rehash(corpus, minhash2, type = "minhashes")head(minhashes(corpus[[2]]))Measure similarity/dissimilarity in documents
Description
A set of functions which take two sets or bag of words and measure theirsimilarity or dissimilarity.
Usage
jaccard_similarity(a, b)jaccard_dissimilarity(a, b)jaccard_bag_similarity(a, b)ratio_of_matches(a, b)Arguments
a | The first set (or bag) to be compared. The origin bag fordirectional comparisons. |
b | The second set (or bag) to be compared. The destination bag fordirectional comparisons. |
Details
The functionsjaccard_similarity andjaccard_dissimilarity provide the Jaccard measures of similarity ordissimilarity for two sets. The coefficients will be numbers between0 and1. For the similarity coefficient, the higher thenumber the more similar the two sets are. When applied to two documents ofclassTextReuseTextDocument, the hashes in those documentsare compared. But this function can be passed objects of any class acceptedby the set functions in base R. So it is possible, for instance, to passthis function two character vectors comprised of word, line, sentence, orparagraph tokens, or those character vectors hashed as integers.
The Jaccard similarity coeffecient is defined as follows:
J(A, B) = \frac{ | A \cap B | }{ | A \cup B | }
The Jaccard dissimilarity is simply
1 - J(A, B)
The functionjaccard_bag_similarity treatsa andb asbags rather than sets, so that the result is a fraction where the numeratoris the sum of each matching element counted the minimum number of times itappears in each bag, and the denominator is the sum of the lengths of bothbags. The maximum value for the Jaccard bag similarity is0.5.
The functionratio_of_matches finds the ratio between the number ofitems inb that are also ina and the total number of itemsinb. Note that this similarity measure is directional: it measureshow muchb borrows froma, but says nothing about how much ofa borrows fromb.
References
Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011).
Examples
jaccard_similarity(1:6, 3:10)jaccard_dissimilarity(1:6, 3:10)a <- c("a", "a", "a", "b")b <- c("a", "a", "b", "b", "c")jaccard_similarity(a, b)jaccard_bag_similarity(a, b)ratio_of_matches(a, b)ratio_of_matches(b, a)ny <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")ca_match <- system.file("extdata/legal/ca1851-match.txt", package = "textreuse")ca_nomatch <- system.file("extdata/legal/ca1851-nomatch.txt", package = "textreuse")ny <- TextReuseTextDocument(file = ny, meta = list(id = "ny"))ca_match <- TextReuseTextDocument(file = ca_match, meta = list(id = "ca_match"))ca_nomatch <- TextReuseTextDocument(file = ca_nomatch, meta = list(id = "ca_nomatch"))# These two should have higher similarity scoresjaccard_similarity(ny, ca_match)ratio_of_matches(ny, ca_match)# These two should have lower similarity scoresjaccard_similarity(ny, ca_nomatch)ratio_of_matches(ny, ca_nomatch)Recompute the tokens for a document or corpus
Description
Given aTextReuseTextDocument or aTextReuseCorpus, this function recomputes the tokens and hasheswith the functions specified. Optionally, it can also recompute the minhash signatures.
Usage
tokenize( x, tokenizer, ..., hash_func = hash_string, minhash_func = NULL, keep_tokens = FALSE, keep_text = TRUE)Arguments
x | |
tokenizer | A function to split the text into tokens. See |
... | Arguments passed on to the |
hash_func | A function to hash the tokens. See |
minhash_func | A function to create minhash signatures. See |
keep_tokens | Should the tokens be saved in the document that isreturned or discarded? |
keep_text | Should the text be saved in the document that is returned ordiscarded? |
Value
The modifiedTextReuseTextDocument orTextReuseCorpus.
Examples
dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)corpus <- tokenize(corpus, tokenize_ngrams)head(tokens(corpus[[1]]))Split texts into tokens
Description
These functions each turn a text into tokens. Thetokenize_ngramsfunctions returns shingled n-grams.
Usage
tokenize_words(string, lowercase = TRUE)tokenize_sentences(string, lowercase = TRUE)tokenize_ngrams(string, lowercase = TRUE, n = 3)tokenize_skip_ngrams(string, lowercase = TRUE, n = 3, k = 1)Arguments
string | A character vector of length 1 to be tokenized. |
lowercase | Should the tokens be made lower case? |
n | For n-gram tokenizers, the number of words in each n-gram. |
k | For the skip n-gram tokenizer, the maximum skip distance betweenwords. The function will compute all skip n-grams between |
Details
These functions will strip all punctuation.
Value
A character vector containing the tokens.
Examples
dylan <- "How many roads must a man walk down? The answer is blowin' in the wind."tokenize_words(dylan)tokenize_sentences(dylan)tokenize_ngrams(dylan, n = 2)tokenize_skip_ngrams(dylan, n = 3, k = 2)Count words
Description
This function counts words in a text, for example, a character vector, aTextReuseTextDocument, some other object that inherits fromTextDocument, or a all the documents in aTextReuseCorpus.
Usage
wordcount(x)Arguments
x | The object containing a text. |
Value
An integer vector for the word count.