Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Detect Text Reuse and Document Similarity
Version:0.1.5
Date:2020-05-14
Description:Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.
License:MIT + file LICENSE
LazyData:TRUE
URL:https://docs.ropensci.org/textreuse,https://github.com/ropensci/textreuse
BugReports:https://github.com/ropensci/textreuse/issues
VignetteBuilder:knitr
Depends:R (≥ 3.1.1)
Imports:assertthat (≥ 0.1), digest (≥ 0.6.8), dplyr (≥ 0.8.0), NLP(≥ 0.1.8), Rcpp (≥ 0.12.0), RcppProgress (≥ 0.1), stringr(≥ 1.0.0), tibble (≥ 3.0.1), tidyr (≥ 0.3.1)
Suggests:testthat (≥ 0.11.0), knitr (≥ 1.11), rmarkdown (≥ 0.8),covr
LinkingTo:BH, Rcpp, RcppProgress
RoxygenNote:7.1.0
Encoding:UTF-8
NeedsCompilation:yes
Packaged:2020-05-15 14:43:54 UTC; lmullen
Author:Lincoln MullenORCID iD [aut, cre]
Maintainer:Lincoln Mullen <lincoln@lincolnmullen.com>
Repository:CRAN
Date/Publication:2020-05-15 15:50:02 UTC

textreuse: Detect Text Reuse and Document Similarity

Description

Tools for measuring similarity among documents and detectingpassages which have been reused. Implements shingled n-gram, skip n-gram,and other tokenizers; similarity/dissimilarity functions; pairwisecomparisons; minhash and locality sensitive hashing algorithms; and aversion of the Smith-Waterman local alignment algorithm suitable fornatural language.

Details

The best place to begin with this package in the introductory vignette.

vignette("textreuse-introduction", package = "textreuse")

After reading that vignette, the "pairwise" and "minhash" vignettes introducespecific paths for working with the package.

vignette("textreuse-pairwise", package = "textreuse")

vignette("textreuse-minhash", package = "textreuse")

vignette("textreuse-alignment", package = "textreuse")

Another good place to begin with the package is the documentation for loadingdocuments (TextReuseTextDocument andTextReuseCorpus), fortokenizers,similarity functions, andlocality-sensitive hashing.

Author(s)

Maintainer: Lincoln Mullenlincoln@lincolnmullen.com (ORCID)

References

The sample data provided in theextdata/legal directory istaken from acorpusof American Tract Society publications from the nineteen-century,gathered from theInternet Archive.

The sample data provided in theextdata/legal directory, are takenfrom the following nineteenth-century codes of civil procedure fromCalifornia and New York.

Final Report of the Commissioners on Practice and Pleadings, in 2Documents of the Assembly of New York, 73rd Sess., No. 16, (1850):243-250, sections 597-613.GoogleBooks.

An Act To Regulate Proceedings in Civil Cases, 1851CaliforniaLaws 51, 51-53 sections 4-17; 101, sections 313-316.GoogleBooks.

See Also

Useful links:


TextReuseCorpus

Description

This is the constructor function for aTextReuseCorpus, modeled on thevirtual S3 classCorpus from thetm package. Theobject is aTextReuseCorpus, which is basically a list containingobjects of classTextReuseTextDocument. Arguments are passedalong to that constructor function. To create the corpus, you can pass eithera character vector of paths to text files using thepaths = parameter,a directory containing text files (with any extension) using thedir =parameter, or a character vector of documents using thetext =parameter, where each element in the characer vector is a document. If thecharacter vector passed totext = has names, then those names will beused as the document IDs. Otherwise, IDs will be assigned to the documents.Only one of thepaths,dir, ortext parameters should bespecified.

Usage

TextReuseCorpus(  paths,  dir = NULL,  text = NULL,  meta = list(),  progress = interactive(),  tokenizer = tokenize_ngrams,  ...,  hash_func = hash_string,  minhash_func = NULL,  keep_tokens = FALSE,  keep_text = TRUE,  skip_short = TRUE)is.TextReuseCorpus(x)skipped(x)

Arguments

paths

A character vector of paths to files to be opened.

dir

The path to a directory of text files.

text

A character vector (possibly named) of documents.

meta

A list with named elements for the metadata associated with thiscorpus.

progress

Display a progress bar while loading files.

tokenizer

A function to split the text into tokens. Seetokenizers. If value isNULL, then tokenizing andhashing will be skipped.

...

Arguments passed on to thetokenizer.

hash_func

A function to hash the tokens. Seehash_string.

minhash_func

A function to create minhash signatures of the document.Seeminhash_generator.

keep_tokens

Should the tokens be saved in the documents that arereturned or discarded?

keep_text

Should the text be saved in the documents that are returnedor discarded?

skip_short

Should short documents be skipped? (See details.)

x

An R object to check.

Details

Ifskip_short = TRUE, this function will skip very short orempty documents. A very short document is one where there are two few wordsto create at least two n-grams. For example, if five-grams are desired,then a document must be at least six words long. If no value ofn isprovided, then the function assumes a value ofn = 3. A warning willbe printed with the document ID of each skipped document. Useskipped() to get the IDs of skipped documents.

This function will use multiple cores on non-Windows machines if the"mc.cores" option is set. For example, to use four cores:options("mc.cores" = 4L).

See Also

Accessors for TextReuseobjects.

Examples

dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, meta = list("description" = "Field Codes"))# Subset by position or file namecorpus[[1]]names(corpus)corpus[["ca1851-match"]]

TextReuseTextDocument

Description

This is the constructor function forTextReuseTextDocument objects.This class is used for comparing documents.

Usage

TextReuseTextDocument(  text,  file = NULL,  meta = list(),  tokenizer = tokenize_ngrams,  ...,  hash_func = hash_string,  minhash_func = NULL,  keep_tokens = FALSE,  keep_text = TRUE,  skip_short = TRUE)is.TextReuseTextDocument(x)has_content(x)has_tokens(x)has_hashes(x)has_minhashes(x)

Arguments

text

A character vector containing the text of the document. Thisargument can be skipped if supplyingfile.

file

The path to a text file, iftext is not provided.

meta

A list with named elements for the metadata associated with thisdocument. If a document is created using thetext parameter, thenyou must provide anid field, e.g.,meta = list(id ="my_id"). If the document is created usingfile, then the ID willbe created from the file name.

tokenizer

A function to split the text into tokens. Seetokenizers. If value isNULL, then tokenizing andhashing will be skipped.

...

Arguments passed on to thetokenizer.

hash_func

A function to hash the tokens. Seehash_string.

minhash_func

A function to create minhash signatures of the document.Seeminhash_generator.

keep_tokens

Should the tokens be saved in the document that isreturned or discarded?

keep_text

Should the text be saved in the document that is returned ordiscarded?

skip_short

Should short documents be skipped? (See details.)

x

An R object to check.

Details

This constructor function follows a three-step process. It reads inthe text, either from a file or from memory. It then tokenizes that text.Then it hashes the tokens. Most of the comparison functions in this packagerely only on the hashes to make the comparison. By passingFALSE tokeep_tokens andkeep_text, you can avoid saving thoseobjects, which can result in significant memory savings for large corpora.

Ifskip_short = TRUE, this function will returnNULL for veryshort or empty documents. A very short document is one where there are twofew words to create at least two n-grams. For example, if five-grams aredesired, then a document must be at least six words long. If no value ofn is provided, then the function assumes a value ofn = 3. Awarning will be printed with the document ID of a skipped document.

Value

An object of classTextReuseTextDocument. This object inheritsfrom the virtual S3 classTextDocument in the NLPpackage. It contains the following elements:

content

Thetext of the document.

tokens

The tokens created from the text.

hashes

Hashes created from the tokens.

minhashes

The minhashsignature of the document.

metadata

The document metadata,including the filename (if any) infile.

See Also

Accessors for TextReuseobjects.

Examples

file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")doc  <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))print(doc)meta(doc)head(tokens(doc))head(hashes(doc))## Not run: content(doc)## End(Not run)

Accessors for TextReuse objects

Description

Accessor functions to read and write components ofTextReuseTextDocument andTextReuseCorpusobjects.

Usage

tokens(x)tokens(x) <- valuehashes(x)hashes(x) <- valueminhashes(x)minhashes(x) <- value

Arguments

x

The object to access.

value

The value to assign.

Value

Either a vector or a named list of vectors.


Local alignment of natural language texts

Description

This function takes two texts, either as strings or asTextReuseTextDocument objects, and finds the optimal local alignmentof those texts. A local alignment finds the best matching subset of the twodocuments. This function adapts theSmith-Watermanalgorithm, used for genetic sequencing, for use with natural language. Itcompare the texts word by word (the comparison is case-insensitive) andscores them according to a set of parameters. These parameters define thescore for amatch, and the penalties for amismatch and foropening agap (i.e., the first mismatch in a potential sequence). Thefunction then reports the optimal local alignment. Only the subset of thedocuments that is a match is included. Insertions or deletions in the textare reported with theedit_mark character.

Usage

align_local(  a,  b,  match = 2L,  mismatch = -1L,  gap = -1L,  edit_mark = "#",  progress = interactive())

Arguments

a

A character vector of length one, or aTextReuseTextDocument.

b

A character vector of length one, or aTextReuseTextDocument.

match

The score to assign a matching word. Should be a positiveinteger.

mismatch

The score to assign a mismatching word. Should be a negativeinteger or zero.

gap

The penalty for opening a gap in the sequence. Should be anegative integer or zero.

edit_mark

A single character used for displaying for displayinginsertions/deletions in the documents.

progress

Display a progress bar and messages while computing thealignment.

Details

The compute time of this function is proportional to the product of thelengths of the two documents. Thus, longer documents will take considerablymore time to compute. This function has been tested with pairs of documentscontaining about 25 thousand words each.

If the function reports that there were multiple optimal alignments, then itis likely that there is no strong match in the document.

The score reported for the local alignment is dependent on both the size ofthe documents and on the strength of the match, as well as on the parametersfor match, mismatch, and gap penalties, so the scores are not directlycomparable.

Value

A list with the classtextreuse_alignment. This list containsseveral elements:

References

For a useful description of the algorithm, seethispost. For the application of the Smith-Waterman algorithm to naturallanguage, see David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon,"Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers."IEEE International Conference on Big Data, 2013,http://hdl.handle.net/2047/d20004858.

Examples

align_local("The answer is blowin' in the wind.",            "As the Bob Dylan song says, the answer is blowing in the wind.")# Example of matching documents from a corpusdir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, progress = FALSE)alignment <- align_local(corpus[["ca1851-match"]], corpus[["ny1850-match"]])str(alignment)

Convert candidates data frames to other formats

Description

These S3 methods convert atextreuse_candidates object to a matrix.

Usage

## S3 method for class 'textreuse_candidates'as.matrix(x, ...)

Arguments

x

An object of classtextreuse_candidates.

...

Additional arguments.

Value

A similarity matrix with row and column names containing document IDs.


Filenames from paths

Description

This function takes a character vector of paths and returns just the filename, by default without the extension. ATextReuseCorpus usesthe paths to the files in the corpus as the names of the list. This functionis intended to turn those paths into more manageable identifiers.

Usage

filenames(paths, extension = FALSE)

Arguments

paths

A character vector of paths.

extension

Should the file extension be preserved?

See Also

basename

Examples

paths <- c("corpus/one.txt", "corpus/two.md", "corpus/three.text")filenames(paths)filenames(paths, extension = TRUE)

Hash a string to an integer

Description

Hash a string to an integer

Usage

hash_string(x)

Arguments

x

A character vector to be hashed.

Value

A vector of integer hashes.

Examples

s <- c("How", "many", "roads", "must", "a", "man", "walk", "down")hash_string(s)

Locality sensitive hashing for minhash

Description

Locality sensitive hashing (LSH) discovers potential matches among a corpus ofdocuments quickly, so that only likely pairs can be compared.

Usage

lsh(x, bands, progress = interactive())

Arguments

x

ATextReuseCorpus orTextReuseTextDocument.

bands

The number of bands to use for locality sensitive hashing. Thenumber of hashes in the documents in the corpus must be evenly divisible bythe number of bands. Seelsh_threshold andlsh_probability for guidance in selecting the number of bandsand hashes.

progress

Display a progress bar while comparing documents.

Details

Locality sensitive hashing is a technique for detecting documentsimilarity that does not require pairwise comparisons. When comparing pairsof documents, the number of pairs grows rapidly, so that only the smallestcorpora can be compared pairwise in a reasonable amount of computation time.Locality sensitive hashing, on the other hand, takes a document which hasbeen tokenized and hashed using a minhash algorithm. (Seeminhash_generator.) Each set of minhash signatures is thenbroken into bands comprised of a certain number of rows. (For example, 200minhash signatures might be broken down into 20 bands each containing 10rows.) Each band is then hashed to a bucket. Documents with identical rowsin a band will be hashed to the same bucket. The likelihood that a documentwill be marked as a potential duplicate is proportional to the number ofbands and inversely proportional to the number of rows in each band.

This function returns a data frame with the additional classlsh_buckets. The LSH technique only requires that the signatures foreach document be calculated once. So it is possible, as long as one uses thesame minhash function and the same number of bands, to combine the outputsfrom this function at different times. The output can thus be treated as akind of cache of LSH signatures.

To extract pairs of documents from the output of this function, seelsh_candidates.

Value

A data frame (with the additional classlsh_buckets),containing a column with the document IDs and a column with their LSHsignatures, or buckets.

References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011), ch. 3. See also Matthew Casperson,"Minhashfor Dummies" (November 14, 2013).

See Also

minhash_generator,lsh_candidates,lsh_query,lsh_probability,lsh_threshold

Examples

dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 235)corpus <- TextReuseCorpus(dir = dir,                          tokenizer = tokenize_ngrams, n = 5,                          minhash_func = minhash)buckets <- lsh(corpus, bands = 50)buckets

Candidate pairs from LSH comparisons

Description

Given a data frame of LSH buckets returned fromlsh, thisfunction returns the potential candidates.

Usage

lsh_candidates(buckets)

Arguments

buckets

A data frame returned fromlsh.

Value

A data frame of candidate pairs.

Examples

dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 234)corpus <- TextReuseCorpus(dir = dir,                          tokenizer = tokenize_ngrams, n = 5,                          minhash_func = minhash)buckets <- lsh(corpus, bands = 50)lsh_candidates(buckets)

Compare candidates identified by LSH

Description

Thelsh_candidates only identifies potential matches, butcannot estimate the actual similarity of the documents. This function takes adata frame returned bylsh_candidates and applies a comparisonfunction to each of the documents in a corpus, thereby calculating thedocument similarity score. Note that since your corpus will have minhashsignatures rather than hashes for the tokens itself, you will probably wishto usetokenize to calculate new hashes. This can be done forjust the potentially similar documents. See the package vignettes fordetails.

Usage

lsh_compare(candidates, corpus, f, progress = interactive())

Arguments

candidates

A data frame returned bylsh_candidates.

corpus

The sameTextReuseCorpus corpus which was used to generate the candidates.

f

A comparison function such asjaccard_similarity.

progress

Display a progress bar while comparing documents.

Value

A data frame with values calculated forscore.

Examples

dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 234)corpus <- TextReuseCorpus(dir = dir,                          tokenizer = tokenize_ngrams, n = 5,                          minhash_func = minhash)buckets <- lsh(corpus, bands = 50)candidates <- lsh_candidates(buckets)lsh_compare(candidates, corpus, jaccard_similarity)

Probability that a candidate pair will be detected with LSH

Description

Functions to help choose the correct parameters for thelsh andminhash_generator functions. Uselsh_threshold todetermine the minimum Jaccard similarity for two documents for them to likelybe considered a match. Uselsh_probability to determine theprobability that a pair of documents with a known Jaccard similarity will bedetected.

Usage

lsh_probability(h, b, s)lsh_threshold(h, b)

Arguments

h

The number of minhash signatures.

b

The number of LSH bands.

s

The Jaccard similarity.

Details

Locality sensitive hashing returns a list of possible matches forsimilar documents. How likely is it that a pair of documents will be detectedas a possible match? Ifh is the number of minhash signatures,b is the number of bands in the LSH function (implying then that thenumber of rowsr = h / b), ands is the actual Jaccardsimilarity of the two documents, then the probabilityp that the twodocuments will be marked as a candidate pair is given by this equation.

p = 1 - (1 - s^{r})^{b}

According toMMDS,that equation approximates an S-curve. This implies that there is a threshold(t) fors approximated by this equation.

t = \frac{1}{b}^{\frac{1}{r}}

References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011), ch. 3.

Examples

# Threshold for default valueslsh_threshold(h = 200, b = 40)# Probability for varying values of slsh_probability(h = 200, b = 40, s = .25)lsh_probability(h = 200, b = 40, s = .50)lsh_probability(h = 200, b = 40, s = .75)

Query a LSH cache for matches to a single document

Description

This function retrieves the matches for a single document from anlsh_buckets object created bylsh. Seelsh_candidates to retrieve all pairs of matches.

Usage

lsh_query(buckets, id)

Arguments

buckets

Anlsh_buckets object created bylsh.

id

The document ID to find matches for.

Value

Anlsh_candidates data frame with matches to the document specified.

See Also

lsh,lsh_candidates

Examples

dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 235)corpus <- TextReuseCorpus(dir = dir,                          tokenizer = tokenize_ngrams, n = 5,                          minhash_func = minhash)buckets <- lsh(corpus, bands = 50)lsh_query(buckets, "ny1850-match")

List of all candidates in a corpus

Description

List of all candidates in a corpus

Usage

lsh_subset(candidates)

Arguments

candidates

A data frame of candidate pairs fromlsh_candidates.

Value

A character vector of document IDs from the candidate pairs, to beused to subset theTextReuseCorpus.

Examples

dir <- system.file("extdata/legal", package = "textreuse")minhash <- minhash_generator(200, seed = 234)corpus <- TextReuseCorpus(dir = dir,                          tokenizer = tokenize_ngrams, n = 5,                          minhash_func = minhash)buckets <- lsh(corpus, bands = 50)candidates <- lsh_candidates(buckets)lsh_subset(candidates)corpus[lsh_subset(candidates)]

Generate a minhash function

Description

A minhash value is calculated by hashing the strings in a character vector tointegers and then selecting the minimum value. Repeated minhash values aregenerated by using different hash functions: these different hash functionsare created by using performing a bitwiseXOR operation(bitwXor) with a vector of random integers. Since it is vitalthat the same random integers be used for each document, this functiongenerates another function which will always use the same integers. Thereturned function is intended to be passed to thehash_func parameterofTextReuseTextDocument.

Usage

minhash_generator(n = 200, seed = NULL)

Arguments

n

The number of minhashes that the returned function should generate.

seed

An option parameter to set the seed used in generating the randomnumbers to ensure that the same minhash function is used on repeatedapplications.

Value

A function which will take a character vector and returnnminhashes.

References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011), ch. 3. See also Matthew Casperson,"Minhashfor Dummies" (November 14, 2013).

See Also

lsh

Examples

set.seed(253)minhash <- minhash_generator(10)# Example with a TextReuseTextDocumentfile <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")doc <- TextReuseTextDocument(file = file, hash_func = minhash,                             keep_tokens = TRUE)hashes(doc)# Example with a character vectoris.character(tokens(doc))minhash(tokens(doc))

Candidate pairs from pairwise comparisons

Description

Converts a comparison matrix generated bypairwise_compare into adata frame of candidates for matches.

Usage

pairwise_candidates(m, directional = FALSE)

Arguments

m

A matrix frompairwise_compare.

directional

Should be set to the same value as inpairwise_compare.

Value

A data frame containing all the non-NA values fromm.Columnsa andb are the IDs from the original corpus aspassed to the comparison function. Columnscore is the scorereturned by the comparison function.

Examples

dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir)m1 <- pairwise_compare(corpus, ratio_of_matches, directional = TRUE)pairwise_candidates(m1, directional = TRUE)m2 <- pairwise_compare(corpus, jaccard_similarity)pairwise_candidates(m2)

Pairwise comparisons among documents in a corpus

Description

Given aTextReuseCorpus containing documents of classTextReuseTextDocument, this function applies a comparisonfunction to every pairing of documents, and returns a matrix with thecomparison scores.

Usage

pairwise_compare(corpus, f, ..., directional = FALSE, progress = interactive())

Arguments

corpus

ATextReuseCorpus.

f

The function to apply tox andy.

...

Additional arguments passed tof.

directional

Some comparison functions are commutative, so thatf(a, b) == f(b, a) (e.g.,jaccard_similarity). Otherfunctions are directional, so thatf(a, b) measuresa'sborrowing fromb, which may not be the same asf(b, a) (e.g.,ratio_of_matches). Ifdirectional isFALSE,then only the minimum number of comparisons will be made, i.e., the uppertriangle of the matrix. Ifdirectional isTRUE, then bothdirectional comparisons will be measured. In no case, however, willdocuments be compared to themselves, i.e., the diagonal of the matrix.

progress

Display a progress bar while comparing documents.

Value

A square matrix with dimensions equal to the length of the corpus,and row and column names set by the names of the documents in the corpus. Avalue ofNA in the matrix indicates that a comparison was not made.In cases of directional comparisons, then the comparison reported isf(row, column).

See Also

See these document comparison functions,jaccard_similarity,ratio_of_matches.

Examples

dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir)names(corpus) <- filenames(names(corpus))# A non-directional comparisonpairwise_compare(corpus, jaccard_similarity)# A directional comparisonpairwise_compare(corpus, ratio_of_matches, directional = TRUE)

Objects exported from other packages

Description

These objects are imported from other packages. Follow the linksbelow to see their documentation.

NLP

content,content<-,meta,meta<-


Recompute the hashes for a document or corpus

Description

Given aTextReuseTextDocument or aTextReuseCorpus, this function recomputes either the hashes orthe minhashes with the function specified. This implies that you haveretained the tokens with thekeep_tokens = TRUE parameter.

Usage

rehash(x, func, type = c("hashes", "minhashes"))

Arguments

x

ATextReuseTextDocument orTextReuseCorpus.

func

A function to either hash the tokens or to generate the minhashsignature. Seehash_string,minhash_generator.

type

Recompute thehashes orminhashes?

Value

The modifiedTextReuseTextDocument orTextReuseCorpus.

Examples

dir <- system.file("extdata/legal", package = "textreuse")minhash1 <- minhash_generator(seed = 1)corpus <- TextReuseCorpus(dir = dir, minhash_func = minhash1, keep_tokens = TRUE)head(minhashes(corpus[[1]]))minhash2 <- minhash_generator(seed = 2)corpus <- rehash(corpus, minhash2, type = "minhashes")head(minhashes(corpus[[2]]))

Measure similarity/dissimilarity in documents

Description

A set of functions which take two sets or bag of words and measure theirsimilarity or dissimilarity.

Usage

jaccard_similarity(a, b)jaccard_dissimilarity(a, b)jaccard_bag_similarity(a, b)ratio_of_matches(a, b)

Arguments

a

The first set (or bag) to be compared. The origin bag fordirectional comparisons.

b

The second set (or bag) to be compared. The destination bag fordirectional comparisons.

Details

The functionsjaccard_similarity andjaccard_dissimilarity provide the Jaccard measures of similarity ordissimilarity for two sets. The coefficients will be numbers between0 and1. For the similarity coefficient, the higher thenumber the more similar the two sets are. When applied to two documents ofclassTextReuseTextDocument, the hashes in those documentsare compared. But this function can be passed objects of any class acceptedby the set functions in base R. So it is possible, for instance, to passthis function two character vectors comprised of word, line, sentence, orparagraph tokens, or those character vectors hashed as integers.

The Jaccard similarity coeffecient is defined as follows:

J(A, B) = \frac{ | A \cap B | }{ | A \cup B | }

The Jaccard dissimilarity is simply

1 - J(A, B)

The functionjaccard_bag_similarity treatsa andb asbags rather than sets, so that the result is a fraction where the numeratoris the sum of each matching element counted the minimum number of times itappears in each bag, and the denominator is the sum of the lengths of bothbags. The maximum value for the Jaccard bag similarity is0.5.

The functionratio_of_matches finds the ratio between the number ofitems inb that are also ina and the total number of itemsinb. Note that this similarity measure is directional: it measureshow muchb borrows froma, but says nothing about how much ofa borrows fromb.

References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman,Mining of Massive Datasets(Cambridge University Press, 2011).

Examples

jaccard_similarity(1:6, 3:10)jaccard_dissimilarity(1:6, 3:10)a <- c("a", "a", "a", "b")b <- c("a", "a", "b", "b", "c")jaccard_similarity(a, b)jaccard_bag_similarity(a, b)ratio_of_matches(a, b)ratio_of_matches(b, a)ny         <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")ca_match   <- system.file("extdata/legal/ca1851-match.txt", package = "textreuse")ca_nomatch <- system.file("extdata/legal/ca1851-nomatch.txt", package = "textreuse")ny         <- TextReuseTextDocument(file = ny,                                    meta = list(id = "ny"))ca_match   <- TextReuseTextDocument(file = ca_match,                                    meta = list(id = "ca_match"))ca_nomatch <- TextReuseTextDocument(file = ca_nomatch,                                    meta = list(id = "ca_nomatch"))# These two should have higher similarity scoresjaccard_similarity(ny, ca_match)ratio_of_matches(ny, ca_match)# These two should have lower similarity scoresjaccard_similarity(ny, ca_nomatch)ratio_of_matches(ny, ca_nomatch)

Recompute the tokens for a document or corpus

Description

Given aTextReuseTextDocument or aTextReuseCorpus, this function recomputes the tokens and hasheswith the functions specified. Optionally, it can also recompute the minhash signatures.

Usage

tokenize(  x,  tokenizer,  ...,  hash_func = hash_string,  minhash_func = NULL,  keep_tokens = FALSE,  keep_text = TRUE)

Arguments

x

ATextReuseTextDocument orTextReuseCorpus.

tokenizer

A function to split the text into tokens. Seetokenizers.

...

Arguments passed on to thetokenizer.

hash_func

A function to hash the tokens. Seehash_string.

minhash_func

A function to create minhash signatures. Seeminhash_generator.

keep_tokens

Should the tokens be saved in the document that isreturned or discarded?

keep_text

Should the text be saved in the document that is returned ordiscarded?

Value

The modifiedTextReuseTextDocument orTextReuseCorpus.

Examples

dir <- system.file("extdata/legal", package = "textreuse")corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)corpus <- tokenize(corpus, tokenize_ngrams)head(tokens(corpus[[1]]))

Split texts into tokens

Description

These functions each turn a text into tokens. Thetokenize_ngramsfunctions returns shingled n-grams.

Usage

tokenize_words(string, lowercase = TRUE)tokenize_sentences(string, lowercase = TRUE)tokenize_ngrams(string, lowercase = TRUE, n = 3)tokenize_skip_ngrams(string, lowercase = TRUE, n = 3, k = 1)

Arguments

string

A character vector of length 1 to be tokenized.

lowercase

Should the tokens be made lower case?

n

For n-gram tokenizers, the number of words in each n-gram.

k

For the skip n-gram tokenizer, the maximum skip distance betweenwords. The function will compute all skip n-grams between0 andk.

Details

These functions will strip all punctuation.

Value

A character vector containing the tokens.

Examples

dylan <- "How many roads must a man walk down? The answer is blowin' in the wind."tokenize_words(dylan)tokenize_sentences(dylan)tokenize_ngrams(dylan, n = 2)tokenize_skip_ngrams(dylan, n = 3, k = 2)

Count words

Description

This function counts words in a text, for example, a character vector, aTextReuseTextDocument, some other object that inherits fromTextDocument, or a all the documents in aTextReuseCorpus.

Usage

wordcount(x)

Arguments

x

The object containing a text.

Value

An integer vector for the word count.


[8]ページ先頭

©2009-2025 Movatter.jp