| Title: | Text Mining Package |
| Version: | 0.7-17 |
| Date: | 2025-12-10 |
| Depends: | R (≥ 3.4.0), NLP (≥ 0.2-0) |
| Imports: | Rcpp, parallel, slam (≥ 0.1-37), stats, tools, utils,graphics, xml2 |
| LinkingTo: | BH, Rcpp |
| Suggests: | antiword, filehash, methods, pdftools, Rcampdf, Rgraphviz,Rpoppler, SnowballC, testthat, tm.lexicon.GeneralInquirer |
| Description: | A framework for text mining applications within R. |
| License: | GPL-3 |
| URL: | https://tm.r-forge.r-project.org/ |
| Additional_repositories: | https://datacube.wu.ac.at |
| NeedsCompilation: | yes |
| Packaged: | 2025-12-10 12:27:27 UTC; hornik |
| Author: | Ingo Feinerer |
| Maintainer: | Kurt Hornik <Kurt.Hornik@R-project.org> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-10 13:41:05 UTC |
Corpora
Description
Representing and computing on corpora.
Details
Corpora are collections of documents containing (natural language)text. In packages which employ the infrastructure provided by packagetm, such corpora are represented via the virtual S3 classCorpus: such packages then provide S3 corpus classes extending thevirtual base class (such asVCorpus provided by packagetmitself).
All extension classes must provide accessors to extract subsets([), individual documents ([[), and metadata(meta). The functionlength must return the numberof documents, andas.list must construct a list holding thedocuments.
A corpus can have two types of metadata (accessible viameta).Corpus metadata contains corpus specific metadata in form of tag-valuepairs.Document level metadata contains document specific metadata butis stored in the corpus as a data frame. Document level metadata is typicallyused for semantic reasons (e.g., classifications of documents form an ownentity due to some high-level information like the range of possible values)or for performance reasons (single access instead of extracting metadata ofeach document).
The functionCorpus is a convenience alias toSimpleCorpus orVCorpus, depending on the arguments provided.
See Also
SimpleCorpus,VCorpus, andPCorpusfor the corpora classes provided by packagetm.
DCorpus for a distributed corpus class provided bypackagetm.plugin.dc.
Data Frame Source
Description
Create a data frame source.
Usage
DataframeSource(x)Arguments
x | A data frame giving the texts and metadata. |
Details
Adata frame source interprets each row of the data framex as adocument. The first column must be named"doc_id" and contain a uniquestring identifier for each document. The second column must be named"text" and contain a UTF-8 encoded string representing thedocument's content. Optional additional columns are used as document levelmetadata.
Value
An object inheriting fromDataframeSource,SimpleSource,andSource.
See Also
Source for basic information on the source infrastructureemployed by packagetm, andmeta for types of metadata.
readtext for reading in a text in multiple formatssuitable to be processed byDataframeSource.
Examples
docs <- data.frame(doc_id = c("doc_1", "doc_2"), text = c("This is a text.", "This another one."), dmeta1 = 1:2, dmeta2 = letters[1:2], stringsAsFactors = FALSE)(ds <- DataframeSource(docs))x <- Corpus(ds)inspect(x)meta(x)Directory Source
Description
Create a directory source.
Usage
DirSource(directory = ".", encoding = "", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text")Arguments
directory | A character vector of full path names; the defaultcorresponds to the working directory |
encoding | a character string describing the current encoding. It ispassed to |
pattern | an optional regular expression. Only file names which matchthe regular expression will be returned. |
recursive | logical. Should the listing recurse into directories? |
ignore.case | logical. Should pattern-matching be case-insensitive? |
mode | a character string specifying if and how files should be read in.Available modes are: |
Details
Adirectory source acquires a list of files viadir andinterprets each file as a document.
Value
An object inheriting fromDirSource,SimpleSource, andSource.
See Also
Source for basic information on the source infrastructureemployed by packagetm.
Encoding andiconv on encodings.
Examples
DirSource(system.file("texts", "txt", package = "tm"))Access Document IDs and Terms
Description
Accessing document IDs, terms, and their number of a term-document matrix ordocument-term matrix.
Usage
Docs(x)nDocs(x)nTerms(x)Terms(x)Arguments
x | Either a |
Value
ForDocs andTerms, a character vector with document IDs andterms, respectively.
FornDocs andnTerms, an integer with the number of document IDsand terms, respectively.
Examples
data("crude")tdm <- TermDocumentMatrix(crude)[1:10,1:20]Docs(tdm)nDocs(tdm)nTerms(tdm)Terms(tdm)Permanent Corpora
Description
Create permanent corpora.
Usage
PCorpus(x, readerControl = list(reader = reader(x), language = "en"), dbControl = list(dbName = "", dbType = "DB1"))Arguments
x | A |
readerControl | a named list of control parameters for reading in contentfrom
|
dbControl | a named list of control parameters for the underlyingdatabase storage provided by packagefilehash.
|
Details
Apermanent corpus stores documents outside ofR in a database. SincemultiplePCorpusR objects with the same underlying database canexist simultaneously in memory, changes in one get propagated to allcorresponding objects (in contrast to the defaultR semantics).
Value
An object inheriting fromPCorpus andCorpus.
See Also
Corpus for basic information on the corpus infrastructureemployed by packagetm.
VCorpus provides an implementation with volatile storagesemantics.
Examples
txt <- system.file("texts", "txt", package = "tm")## Not run: PCorpus(DirSource(txt), dbControl = list(dbName = "pcorpus.db", dbType = "DB1"))## End(Not run)Plain Text Documents
Description
Create plain text documents.
Usage
PlainTextDocument(x = character(0), author = character(0), datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0), ..., meta = NULL, class = NULL)Arguments
x | A character string giving the plain text content. |
author | a character string or an object of class |
datetimestamp | an object of class |
description | a character string giving a description. |
heading | a character string giving the title or a short heading. |
id | a character string giving a unique identifier. |
language | a character string giving the language (preferably asIETFlanguage tags, seelanguage in packageNLP). |
origin | a character string giving information on the source and origin. |
... | user-defined document metadata tag-value pairs. |
meta | a named list or |
class | a character vector or |
Value
An object inheriting fromclass,PlainTextDocument andTextDocument.
See Also
TextDocument for basic information on the text documentinfrastructure employed by packagetm.
Examples
(ptd <- PlainTextDocument("A simple plain text document", heading = "Plain text document", id = basename(tempfile()), language = "en"))meta(ptd)Readers
Description
Creating readers.
Usage
getReaders()Details
Readers are functions for extracting textual content and metadata outof elements delivered by aSource, and for constructing aTextDocument. A reader must accept following arguments inits signature:
elema named list with the components
contentanduri(as delivered by aSourceviagetElemorpGetElem).languagea character string giving the language.
ida character giving a unique identifier for the created textdocument.
The elementelem is typically provided by a source whereas the languageand the identifier are normally provided by a corpus constructor (for the casethatelem$content does not give information on these two essentialitems).
In case a reader expects configuration arguments we can use a functiongenerator. A function generator is indicated by inheriting from classFunctionGenerator andfunction. It allows us to processadditional arguments, store them in an environment, return a reader functionwith the well-defined signature described above, and still be able to accessthe additional arguments via lexical scoping. All corpus constructors inpackagetm check the reader function for being a function generator andif so apply it to yield the reader with the expected signature.
Value
ForgetReaders(), a character vector with readers provided by packagetm.
See Also
readDOC,readPDF,readPlain,readRCV1,readRCV1asPlain,readReut21578XML,readReut21578XMLasPlain,andreadXML.
Simple Corpora
Description
Create simple corpora.
Usage
SimpleCorpus(x, control = list(language = "en"))Arguments
x | |
control | a named list of control parameters.
|
Details
Asimple corpus is fully kept in memory. Compared to aVCorpus,it is optimized for the most common usage scenario: importing plain texts fromfiles in a directory or directly from a vector inR, preprocessing andtransforming the texts, and finally exporting them to a term-document matrix.It adheres to theCorpusAPI. However, it takesinternally various shortcuts to boost performance and minimize memorypressure; consequently it operates only under the following contraints:
only
DataframeSource,DirSourceandVectorSourceare supported,no custom readers, i.e., each document is read in and stored as plaintext (as a string, i.e., a character vector of length one),
transformations applied via
tm_mapmust be able toprocess character vectors and return character vectors (of the samelength),no lazy transformations in
tm_map,no meta data for individual documents (i.e., no
"local"inmeta).
Value
An object inheriting fromSimpleCorpus andCorpus.
See Also
Corpus for basic information on the corpus infrastructureemployed by packagetm.
VCorpus provides an implementation with volatile storagesemantics, andPCorpus provides an implementation withpermanent storage semantics.
Examples
txt <- system.file("texts", "txt", package = "tm")(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"), control = list(language = "lat")))Sources
Description
Creating and accessing sources.
Usage
SimpleSource(encoding = "", length = 0, position = 0, reader = readPlain, ..., class)getSources()## S3 method for class 'SimpleSource'close(con, ...)## S3 method for class 'SimpleSource'eoi(x)## S3 method for class 'DataframeSource'getMeta(x)## S3 method for class 'DataframeSource'getElem(x)## S3 method for class 'DirSource'getElem(x)## S3 method for class 'URISource'getElem(x)## S3 method for class 'VectorSource'getElem(x)## S3 method for class 'XMLSource'getElem(x)## S3 method for class 'SimpleSource'length(x)## S3 method for class 'SimpleSource'open(con, ...)## S3 method for class 'DataframeSource'pGetElem(x)## S3 method for class 'DirSource'pGetElem(x)## S3 method for class 'URISource'pGetElem(x)## S3 method for class 'VectorSource'pGetElem(x)## S3 method for class 'SimpleSource'reader(x)## S3 method for class 'SimpleSource'stepNext(x)Arguments
x | A |
con | A |
encoding | a character giving the encoding of the elements delivered bythe source. |
length | a non-negative integer denoting the number of elements deliveredby the source. If the length is unknown in advance set it to |
position | a numeric indicating the current position in the source. |
reader | a reader function (generator). |
... | For |
class | a character vector giving additional classes to be used for thecreated source. |
Details
Sources abstract input locations, like a directory, a connection, orsimply anR vector, in order to acquire content in a uniform way. In packageswhich employ the infrastructure provided by packagetm, such sources arerepresented via the virtual S3 classSource: such packages then provideS3 source classes extending the virtual base class (such asDirSource provided by packagetm itself).
All extension classes must provide implementations for the functionsclose,eoi,getElem,length,open,reader, andstepNext. For parallel element access the(optional) functionpGetElem must be provided as well. Ifdocument level metadata is available, the (optional) functiongetMetamust be implemented.
The functionsopen andclose open and close the source,respectively.eoi indicates end of input.getElem fetches theelement at the current position, whereaspGetElem retrieves allelements in parallel at once. The functionlength gives the number ofelements.reader returns a default reader for processing elements.stepNext increases the position in the source to acquire the nextelement.
The functionSimpleSource provides a simple reference implementationand can be used when creating custom sources.
Value
ForSimpleSource, an object inheriting fromclass,SimpleSource, andSource.
ForgetSources, a character vector with sources provided by packagetm.
open andclose return the opened and closed source,respectively.
Foreoi, a logical indicating if the end of input of the source isreached.
ForgetElem a named list with the componentscontent holding thedocument anduri giving a uniform resource identifier (e.g., a filepath orURL;NULL if not applicable or unavailable). ForpGetElem a list of such named lists.
Forlength, an integer for the number of elements.
Forreader, a function for the default reader.
See Also
DataframeSource,DirSource,URISource,VectorSource, andXMLSource.
Term-Document Matrix
Description
Constructs or coerces to a term-document matrix or a document-term matrix.
Usage
TermDocumentMatrix(x, control = list())DocumentTermMatrix(x, control = list())as.TermDocumentMatrix(x, ...)as.DocumentTermMatrix(x, ...)Arguments
x | for the constructors, a corpus or anR object from which acorpus can be generated via |
control | a named list of control options. There are localoptions which are evaluated for each document and global optionswhich are evaluated once for the constructed matrix. Available localoptions are documented in This is different for a Available global options are:
|
... | the additional argument |
Value
An object of classTermDocumentMatrix or classDocumentTermMatrix (both inheriting from asimple triplet matrix in packageslam)containing a sparse term-document matrix or document-term matrix. Theattributeweighting contains the weighting applied to thematrix.
See Also
termFreq for available local control options.
Examples
data("crude")tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE))dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE))inspect(tdm[202:205, 1:5])inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")])inspect(dtm[1:5, 273:276])if(requireNamespace("SnowballC")) { s <- SimpleCorpus(VectorSource(unlist(lapply(crude, as.character)))) m <- TermDocumentMatrix(s, control = list(removeNumbers = TRUE, stopwords = TRUE, stemming = TRUE)) inspect(m[c("price", "texa"), c("127", "144", "191", "194")])}Text Documents
Description
Representing and computing on text documents.
Details
Text documents are documents containing (natural language) text. Thetm package employs the infrastructure provided by packageNLP andrepresents text documents via the virtual S3 classTextDocument.Actual S3 text document classes then extend the virtual base class (such asPlainTextDocument).
All extension classes must provide anas.charactermethod which extracts the natural language text in documents of therespective classes in a “suitable” (not necessarily structured)form, as well ascontent andmeta methodsfor accessing the (possibly raw) document content and metadata.
See Also
PlainTextDocument, andXMLTextDocumentfor the text document classes provided by packagetm.
TextDocument for text documents in packageNLP.
Uniform Resource Identifier Source
Description
Create a uniform resource identifier source.
Usage
URISource(x, encoding = "", mode = "text")Arguments
x | A character vector of uniform resource identifiers (URIs. |
encoding | A character string describing the current encoding. It ispassed to |
mode | a character string specifying if and howURIs should beread in. Available modes are:
|
Details
Auniform resource identifier source interprets eachURI as adocument.
Value
An object inheriting fromURISource,SimpleSource,andSource.
See Also
Source for basic information on the source infrastructureemployed by packagetm.
Encoding andiconv on encodings.
Examples
loremipsum <- system.file("texts", "loremipsum.txt", package = "tm")ovid <- system.file("texts", "txt", "ovid_1.txt", package = "tm")us <- URISource(sprintf("file://%s", c(loremipsum, ovid)))inspect(VCorpus(us))Volatile Corpora
Description
Create volatile corpora.
Usage
VCorpus(x, readerControl = list(reader = reader(x), language = "en"))as.VCorpus(x)Arguments
x | For |
readerControl | a named list of control parameters for reading in contentfrom
|
Details
Avolatile corpus is fully kept in memory and thus all changes onlyaffect the correspondingR object.
Value
An object inheriting fromVCorpus andCorpus.
See Also
Corpus for basic information on the corpus infrastructureemployed by packagetm.
PCorpus provides an implementation with permanent storagesemantics.
Examples
reut21578 <- system.file("texts", "crude", package = "tm")VCorpus(DirSource(reut21578, mode = "binary"), list(reader = readReut21578XMLasPlain))Vector Source
Description
Create a vector source.
Usage
VectorSource(x)Arguments
x | A vector giving the texts. |
Details
Avector source interprets each element of the vectorx as adocument.
Value
An object inheriting fromVectorSource,SimpleSource,andSource.
See Also
Source for basic information on the source infrastructureemployed by packagetm.
Examples
docs <- c("This is a text.", "This another one.")(vs <- VectorSource(docs))inspect(VCorpus(vs))Weighting Function
Description
Construct a weighting function for term-document matrices.
Usage
WeightFunction(x, name, acronym)Arguments
x | A function which takes a |
name | A character naming the weighting function. |
acronym | A character giving an acronym for the name of theweighting function. |
Value
An object of classWeightFunction which extends the classfunction representing a weighting function.
Examples
weightCutBin <- WeightFunction(function(m, cutoff) m > cutoff, "binary with cutoff", "bincut")XML Source
Description
Create anXML source.
Usage
XMLSource(x, parser = xml_contents, reader)Arguments
x | a character giving a uniform resource identifier. |
parser | a function accepting anXML document (as delivered by |
reader | a function capable of turningXML elements/nodes asreturned by |
Value
An object inheriting fromXMLSource,SimpleSource,andSource.
See Also
Source for basic information on the source infrastructureemployed by packagetm.
Vignette 'Extensions: How to Handle Custom File Formats', andreadXML.
XML Text Documents
Description
CreateXML text documents.
Usage
XMLTextDocument(x = xml_missing(), author = character(0), datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0), ..., meta = NULL)Arguments
x | An |
author | a character or an object of class |
datetimestamp | an object of class |
description | a character giving a description. |
heading | a character giving the title or a short heading. |
id | a character giving a unique identifier. |
language | a character giving the language (preferably asIETFlanguage tags, seelanguage in packageNLP). |
origin | a character giving information on the source and origin. |
... | user-defined document metadata tag-value pairs. |
meta | a named list or |
Value
An object inheriting fromXMLTextDocument andTextDocument.
See Also
TextDocument for basic information on the text documentinfrastructure employed by packagetm.
Examples
xml <- system.file("extdata", "order-doc.xml", package = "xml2")(xtd <- XMLTextDocument(xml2::read_xml(xml), heading = "XML text document", id = xml, language = "en"))content(xtd)meta(xtd)ZIP File Source
Description
Create a ZIP file source.
Usage
ZipSource(zipfile, pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text")Arguments
zipfile | A character string with the full path name of a ZIP file. |
pattern | an optional regular expression. Only file names in the ZIPfile which match the regular expression will be returned. |
recursive | logical. Should the listing recurse into directories? |
ignore.case | logical. Should pattern-matching be case-insensitive? |
mode | a character string specifying if and how files should be read in.Available modes are: |
Details
AZIP file source extracts a compressed ZIP file viaunzip and interprets each file as a document.
Value
An object inheriting fromZipSource,SimpleSource, andSource.
See Also
Source for basic information on the source infrastructureemployed by packagetm.
Examples
zipfile <- tempfile()files <- Sys.glob(file.path(system.file("texts", "txt", package = "tm"), "*"))zip(zipfile, files)zipfile <- paste0(zipfile, ".zip")Corpus(ZipSource(zipfile, recursive = TRUE))[[1]]file.remove(zipfile)Explore Corpus Term Frequency Characteristics
Description
Explore Zipf's law and Heaps' law, two empirical laws in linguisticsdescribing commonly observed characteristics of term frequencydistributions in corpora.
Usage
Zipf_plot(x, type = "l", ...)Heaps_plot(x, type = "l", ...)Arguments
x | a document-term matrix or term-document matrix withunweighted term frequencies. |
type | a character string indicating the type of plot to bedrawn, see |
... | further graphical parameters to be used for plotting. |
Details
Zipf's law (e.g.,https://en.wikipedia.org/wiki/Zipf%27s_law)states that given some corpus of natural language utterances, thefrequency of any word is inversely proportional to its rank in thefrequency table, or, more generally, that the pmf of the termfrequencies is of the formc k^{-\beta}, wherek is therank of the term (taken from the most to the least frequent one).We can conveniently explore the degree to which the law holds byplotting the logarithm of the frequency against the logarithm of therank, and inspecting the goodness of fit of a linear model.
Heaps' law (e.g.,https://en.wikipedia.org/wiki/Heaps%27_law)states that the vocabulary sizeV (i.e., the number of differentterms employed) grows polynomially with the text sizeT (thetotal number of terms in the texts), so thatV = c T^\beta.We can conveniently explore the degree to which the law holds byplotting\log(V) against\log(T), and inspecting thegoodness of fit of a linear model.
Value
The coefficients of the fitted linear model. As a side effect, thecorresponding plot is produced.
Examples
data("acq")m <- DocumentTermMatrix(acq)Zipf_plot(m)Heaps_plot(m)50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq
Description
This dataset holds 50 news articles with additional meta information from theReuters-21578 data set. All documents belong to the topicacq dealingwith corporate acquisitions.
Usage
data("acq")Format
AVCorpus of 50 text documents.
Source
Reuters-21578 Text Categorization Collection Distribution 1.0(XML format).
References
Lewis, David (1997).Reuters-21578 Text Categorization Collection Distribution.UCI Machine Learning Repository.doi:10.24432/C52G6M.
Examples
data("acq")acqContent Transformers
Description
Create content transformers, i.e., functions which modify the content of anR object.
Usage
content_transformer(FUN)Arguments
FUN | a function. |
Value
A function with two arguments:
xanR object with implemented content getter(
content) and setter (content<-)functions....arguments passed over to
FUN.
See Also
tm_map for an interface to apply transformations to corpora.
Examples
data("crude")crude[[1]](f <- content_transformer(function(x, pattern) gsub(pattern, "", x)))tm_map(crude, f, "[[:digit:]]+")[[1]]20 Exemplary News Articles from the Reuters-21578 Data Set of Topic crude
Description
This data set holds 20 news articles with additional meta information fromthe Reuters-21578 data set. All documents belong to the topiccrudedealing with crude oil.
Usage
data("crude")Format
AVCorpus of 20 text documents.
Source
Reuters-21578 Text Categorization Collection Distribution 1.0(XML format).
References
Lewis, David (1997).Reuters-21578 Text Categorization Collection Distribution.UCI Machine Learning Repository.doi:10.24432/C52G6M.
Examples
data("crude")crudeFind Associations in a Term-Document Matrix
Description
Find associations in a document-term or term-document matrix.
Usage
## S3 method for class 'DocumentTermMatrix'findAssocs(x, terms, corlimit)## S3 method for class 'TermDocumentMatrix'findAssocs(x, terms, corlimit)Arguments
x | |
terms | a character vector holding terms. |
corlimit | a numeric vector (of the same length as |
Value
A named list. Each list component is named after a term intermsand contains a named numeric vector. Each vector holds matching terms fromx and their rounded correlations satisfying the inclusive lowercorrelation limit ofcorlimit.
Examples
data("crude")tdm <- TermDocumentMatrix(crude)findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))Find Frequent Terms
Description
Find frequent terms in a document-term or term-document matrix.
Usage
findFreqTerms(x, lowfreq = 0, highfreq = Inf)Arguments
x | |
lowfreq | A numeric for the lower frequency bound. |
highfreq | A numeric for the upper frequency bound. |
Details
This method works for all numeric weightings but is probablymost meaningful for the standard term frequency (tf) weightingofx.
Value
A character vector of terms inx which occur more or equal oftenthanlowfreq times and less or equal often thanhighfreqtimes.
Examples
data("crude")tdm <- TermDocumentMatrix(crude)findFreqTerms(tdm, 2, 3)Find Most Frequent Terms
Description
Find most frequent terms in a document-term or term-document matrix,or a vector of term frequencies.
Usage
findMostFreqTerms(x, n = 6L, ...)## S3 method for class 'DocumentTermMatrix'findMostFreqTerms(x, n = 6L, INDEX = NULL, ...)## S3 method for class 'TermDocumentMatrix'findMostFreqTerms(x, n = 6L, INDEX = NULL, ...)Arguments
x | A |
n | A single integer giving the maximal number of terms. |
INDEX | an object specifying a grouping of documents for rollup,or |
... | arguments to be passed to or from methods. |
Details
Only terms with positive frequencies are included in the results.
Value
For the document-term or term-document matrix methods, a list with thenamed frequencies of the up ton most frequent terms occurringin each document (group). Otherwise, a single such vector of mostfrequent terms.
Examples
data("crude")## Term frequencies:tf <- termFreq(crude[[14L]])findMostFreqTerms(tf)## Document-term matrices:dtm <- DocumentTermMatrix(crude)## Most frequent terms for each document:findMostFreqTerms(dtm)## Most frequent terms for the first 10 the second 10 documents,## respectively:findMostFreqTerms(dtm, INDEX = rep(1 : 2, each = 10L))Read Document-Term Matrices
Description
Read document-term matrices stored in special file formats.
Usage
read_dtm_Blei_et_al(file, vocab = NULL)read_dtm_MC(file, scalingtype = NULL)Arguments
file | a character string with the name of the file to read. |
vocab | a character string with the name of a vocabulary file(giving the terms, one per line), or |
scalingtype | a character string specifying the type of scalingto be used, or |
Details
read_dtm_Blei_et_al reads the (List of Lists type sparsematrix) format employed by the Latent Dirichlet Allocation andCorrelated Topic Model C codes by Blei et al(http://www.cs.columbia.edu/~blei/).
MC is a toolkit for creating vector models from text documents (seehttps://www.cs.utexas.edu/~dml/software/mc/). It employs avariant of Compressed Column Storage (CCS) sparse matrix format,writing data into several files with suitable names: e.g., a file with‘_dim’ appended to the base file name stores the matrixdimensions. The non-zero entries are stored in a file the name ofwhich indicates the scaling type used: e.g., ‘_tfx_nz’ indicatesscaling by term frequency (‘t’), inverse document frequency(‘f’) and no normalization (‘x’). See ‘README’ in theMC sources for more information.
read_dtm_MC reads such sparse matrix information with argumentfile giving the path with the base file name.
Value
See Also
read_stm_MC in packageslam.
Tokenizers
Description
Predefined tokenizers.
Usage
getTokenizers()Value
A character vector with tokenizers provided by packagetm.
See Also
Boost_tokenizer,MC_tokenizer andscan_tokenizer.
Examples
getTokenizers()Transformations
Description
Predefined transformations (mappings) which can be used withtm_map.
Usage
getTransformations()Value
A character vector with transformations provided by packagetm.
See Also
removeNumbers,removePunctuation,removeWords,stemDocument, andstripWhitespace.
content_transformer to create custom transformations.
Examples
getTransformations()Parallelized ‘lapply’
Description
Parallelize applying a function over a list or vector according to theregistered parallelization engine.
Usage
tm_parLapply(X, FUN, ...)tm_parLapply_engine(new)Arguments
X | A vector (atomic or list), or other objects suitable for theengine in use. |
FUN | the function to be applied to each element of |
... | optional arguments to |
new | an object inheriting from class |
Details
Parallelization can be employed to speed up some of the embarrassinglyparallel computations performed in packagetm, specificallytm_index(),tm_map() on a non-lazy-mappedVCorpus, andTermDocumentMatrix() on aVCorpus orPCorpus.
Functionstm_parLapply() andtm_parLapply_engine() canbe used to customize parallelization according to the availableresources.
tm_parLapply_engine() is used for getting (with no arguments)or setting (with argumentnew) the parallelization engineemployed (see below for examples).
If an engine is set to an object inheriting from classcluster,tm_parLapply() callsparLapply() with this cluster and the given arguments. If set to a function,tm_parLapply()calls the function with the given arguments. Otherwise, it simplycallslapply().
Hence, parallelization viaparLapply()and a default cluster registered viasetDefaultCluster() can beachieved via
tm_parLapply_engine(function(X, FUN, ...) parallel::parLapply(NULL, X, FUN, ...))
or re-registering the cluster, saycl, using
tm_parLapply_engine(cl)
(note that sinceR version 3.5.0, one can usegetDefaultCluster() to getthe registered default cluster). Using
tm_parLapply_engine(function(X, FUN, ...) parallel::parLapplyLB(NULL, X, FUN, ...))
or
tm_parLapply_engine(function(X, FUN, ...) parallel::parLapplyLB(cl, X, FUN, ...))
gives load-balancing parallelization with the registered default orgiven cluster, respectively. To achieve parallelization via forking(on Unix-alike platforms), one can use the above with clusters createdbymakeForkCluster(), or use
tm_parLapply_engine(parallel::mclapply)
or
tm_parLapply_engine(function(X, FUN, ...) parallel::mclapply(X, FUN, ..., mc.cores = n))
to usemclapply() with the default orgiven numbern of cores.
Value
A list the length ofX, with the result of applyingFUNtogether with the... arguments to each element ofX.
See Also
makeCluster(),parLapply(),parLapplyLB(), andmclapply().
Inspect Objects
Description
Inspect, i.e., display detailed information on a corpus, aterm-document matrix, or a text document.
Usage
## S3 method for class 'PCorpus'inspect(x)## S3 method for class 'VCorpus'inspect(x)## S3 method for class 'TermDocumentMatrix'inspect(x)## S3 method for class 'TextDocument'inspect(x)Arguments
x | Either a corpus, a term-document matrix, or a text document. |
Examples
data("crude")inspect(crude[1:3])inspect(crude[[1]])tdm <- TermDocumentMatrix(crude)[1:10, 1:10]inspect(tdm)Metadata Management
Description
Accessing and modifying metadata of text documents and corpora.
Usage
## S3 method for class 'PCorpus'meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...)## S3 replacement method for class 'PCorpus'meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value## S3 method for class 'SimpleCorpus'meta(x, tag = NULL, type = c("indexed", "corpus"), ...)## S3 replacement method for class 'SimpleCorpus'meta(x, tag, type = c("indexed", "corpus"), ...) <- value## S3 method for class 'VCorpus'meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...)## S3 replacement method for class 'VCorpus'meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value## S3 method for class 'PlainTextDocument'meta(x, tag = NULL, ...)## S3 replacement method for class 'PlainTextDocument'meta(x, tag = NULL, ...) <- value## S3 method for class 'XMLTextDocument'meta(x, tag = NULL, ...)## S3 replacement method for class 'XMLTextDocument'meta(x, tag = NULL, ...) <- valueDublinCore(x, tag = NULL)DublinCore(x, tag) <- valueArguments
x | For |
tag | a character giving the name of a metadatum. No tag corresponds toall available metadata. |
type | a character specifying the kind of corpus metadata (seeDetails). |
... | Not used. |
value | replacement value. |
Details
A corpus has two types of metadata.Corpus metadata ("corpus")contains corpus specific metadata in form of tag-value pairs.Document level metadata ("indexed") contains document specificmetadata but is stored in the corpus as a data frame. Document level metadatais typically used for semantic reasons (e.g., classifications of documentsform an own entity due to some high-level information like the range ofpossible values) or for performance reasons (single access instead ofextracting metadata of each document). The latter can be seen as a from ofindexing, hence the name"indexed".Document metadata("local") are tag-value pairs directly stored locally at the individualdocuments.
DublinCore is a convenience wrapper to access and modify the metadataof a text document using the Simple Dublin Core schema (supporting the 15metadata elements from the Dublin Core Metadata Element Sethttps://dublincore.org/documents/dces/).
References
Dublin Core Metadata Initiative.https://dublincore.org/
See Also
meta for metadata in packageNLP.
Examples
data("crude")meta(crude[[1]])DublinCore(crude[[1]])meta(crude[[1]], tag = "topics")meta(crude[[1]], tag = "comment") <- "A short comment."meta(crude[[1]], tag = "topics") <- NULLDublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"DublinCore(crude[[1]], tag = "format") <- "XML"DublinCore(crude[[1]])meta(crude[[1]])meta(crude)meta(crude, type = "corpus")meta(crude, "labels") <- 21:40meta(crude)Visualize a Term-Document Matrix
Description
Visualize correlations between terms of a term-document matrix.
Usage
## S3 method for class 'TermDocumentMatrix'plot(x, terms = sample(Terms(x), 20), corThreshold = 0.7, weighting = FALSE, attrs = list(graph = list(rankdir = "BT"), node = list(shape = "rectangle", fixedsize = FALSE)), ...)Arguments
x | A term-document matrix. |
terms | Terms to be plotted. Defaults to 20 randomly chosen termsof the term-document matrix. |
corThreshold | Do not plot correlations below thisthreshold. Defaults to |
weighting | Define whether the line width corresponds to thecorrelation. |
attrs | Argument passed to the plot method for class |
... | Other arguments passed to the |
Details
Visualization requires that Bioconductor software packageRgraphviz is installed.
Examples
## Not run: data(crude)tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE))plot(tdm, corThreshold = 0.2, weighting = TRUE)## End(Not run)Read In a MS Word Document
Description
Return a function which reads in a Microsoft Word document extractingits text.
Usage
readDOC(engine = c("antiword", "executable"), AntiwordOptions = "")Arguments
engine | a character string for the preferredDOC extractionengine (seeDetails). |
AntiwordOptions | Options passed over to |
Details
Formally this function is a function generator, i.e., it returns afunction (which reads in a text document) with a well-definedsignature, but can access passed over arguments (e.g., options toantiword) via lexical scoping.
AvailableDOC extraction engines are as follows.
"antiword"(default) Antiword utility as provided by thefunction
antiwordin packageantiword."executable"command line
antiwordexecutable which must be installed and accessible on your system.This can convert documents from Microsoft Word version 2, 6, 7,97, 2000, 2002 and 2003 to plain text.The character vectorAntiwordOptionsis passed over to theexecutable.
Value
Afunction with the following formals:
elema list with the named component
uriwhich musthold a valid file name.languagea string giving the language.
idNot used.
The function returns aPlainTextDocument representing the textand metadata extracted fromelem$uri.
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Read In a Text Document from a Data Frame
Description
Read in a text document from a row in a data frame.
Usage
readDataframe(elem, language, id)Arguments
elem | a named list with the component |
language | a string giving the language. |
id | Not used. |
Value
APlainTextDocument representingelem$content.
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Examples
docs <- data.frame(doc_id = c("doc_1", "doc_2"), text = c("This is a text.", "This another one."), stringsAsFactors = FALSE)ds <- DataframeSource(docs)elem <- getElem(stepNext(ds))result <- readDataframe(elem, "en", NULL)inspect(result)meta(result)Read In a PDF Document
Description
Return a function which reads in a portable document format (PDF)document extracting both its text and its metadata.
Usage
readPDF(engine = c("pdftools", "xpdf", "Rpoppler", "ghostscript", "Rcampdf", "custom"), control = list(info = NULL, text = NULL))Arguments
engine | a character string for the preferredPDF extractionengine (seeDetails). |
control | a list of control options for the engine with the namedcomponents |
Details
Formally this function is a function generator, i.e., it returns a function(which reads in a text document) with a well-defined signature, but can accesspassed over arguments (e.g., the preferredPDF extractionengine andcontrol options) via lexical scoping.
AvailablePDF extraction engines are as follows.
"pdftools"(default) PopplerPDF rendering libraryas provided by the functions
pdf_infoandpdf_textin packagepdftools."xpdf"command line
pdfinfoandpdftotextexecutables which must be installed and accessible onyour system. Suitable utilities are provided by the Xpdf(http://www.xpdfreader.com/)PDF viewer or by thePoppler (https://poppler.freedesktop.org/)PDF renderinglibrary."Rpoppler"PopplerPDF rendering library asprovided by the functions
PDF_infoandPDF_textin packageRpoppler."ghostscript"Ghostscript using ‘pdf_info.ps’ and‘ps2ascii.ps’.
"Rcampdf"Perl CAM::PDFPDF manipulation libraryas provided by the functions
pdf_infoandpdf_textin packageRcampdf, available from the repository athttp://datacube.wu.ac.at."custom"custom user-provided extraction engine.
Control parameters for engine"xpdf" are as follows.
infoa character vector specifying options passed over tothe
pdfinfoexecutable.texta character vector specifying options passed over tothe
pdftotextexecutable.
Control parameters for engine"custom" are as follows.
infoa function extracting metadata from aPDF.The function must accept a file path as first argument and must return anamed list with the components
Author(as character string),CreationDate(of classPOSIXlt),Subject(ascharacter string),Title(as character string), andCreator(as character string).texta function extracting content from aPDF.The function must accept a file path as first argument and must return acharacter vector.
Value
Afunction with the following formals:
elema named list with the component
uriwhich musthold a valid file name.languagea string giving the language.
idNot used.
The function returns aPlainTextDocument representing the textand metadata extracted fromelem$uri.
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Examples
uri <- paste0("file://", system.file(file.path("doc", "tm.pdf"), package = "tm"))engine <- if(nzchar(system.file(package = "pdftools"))) { "pdftools" } else { "ghostscript"}reader <- readPDF(engine)pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")cat(content(pdf)[1])VCorpus(URISource(uri, mode = ""), readerControl = list(reader = readPDF(engine = "ghostscript")))Read In a Text Document
Description
Read in a text document without knowledge about its internal structure andpossible available metadata.
Usage
readPlain(elem, language, id)Arguments
elem | a named list with the component |
language | a string giving the language. |
id | a character giving a unique identifier for the created textdocument. |
Value
APlainTextDocument representingelem$content. Theargumentid is used as fallback ifelem$uri is null.
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Examples
docs <- c("This is a text.", "This another one.")vs <- VectorSource(docs)elem <- getElem(stepNext(vs))(result <- readPlain(elem, "en", "id1"))meta(result)Read In a Reuters Corpus Volume 1 Document
Description
Read in a Reuters Corpus Volume 1XML document.
Usage
readRCV1(elem, language, id)readRCV1asPlain(elem, language, id)Arguments
elem | a named list with the component |
language | a string giving the language. |
id | Not used. |
Value
AnXMLTextDocument forreadRCV1, or aPlainTextDocument forreadRCV1asPlain, representing thetext and metadata extracted fromelem$content.
References
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F (2004).RCV1: A New Benchmark Collection for Text Categorization Research.Journal of Machine Learning Research,5, 361–397.https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Examples
f <- system.file("texts", "rcv1_2330.xml", package = "tm")f_bin <- readBin(f, raw(), file.size(f))rcv1 <- readRCV1(elem = list(content = f_bin), language = "en", id = "id1")content(rcv1)meta(rcv1)Read In a Reuters-21578 XML Document
Description
Read in a Reuters-21578XML document.
Usage
readReut21578XML(elem, language, id)readReut21578XMLasPlain(elem, language, id)Arguments
elem | a named list with the component |
language | a string giving the language. |
id | Not used. |
Value
AnXMLTextDocument forreadReut21578XML, or aPlainTextDocument forreadReut21578XMLasPlain,representing the text and metadata extracted fromelem$content.
References
Lewis, David (1997).Reuters-21578 Text Categorization Collection Distribution.UCI Machine Learning Repository.doi:10.24432/C52G6M.
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Read In a POS-Tagged Word Text Document
Description
Return a function which reads in a text document containing POS-tagged words.
Usage
readTagged(...)Arguments
... | Arguments passed to |
Details
Formally this function is a function generator, i.e., it returns afunction (which reads in a text document) with a well-definedsignature, but can access passed over arguments (...)via lexical scoping.
Value
Afunction with the following formals:
elema named list with the component
contentwhich musthold the document to be read in or the componenturiholding aconnection object or a character string.languagea string giving the language.
ida character giving a unique identifier for the createdtext document.
The function returns aTaggedTextDocument representing thetext and metadata extracted fromelem$content orelem$uri. Theargumentid is used as fallback ifelem$uri is null.
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Examples
# See http://www.nltk.org/book/ch05.html or file ca01 in the Brown corpusx <- paste("The/at grand/jj jury/nn commented/vbd on/in a/at number/nn of/in", "other/ap topics/nns ,/, among/in them/ppo the/at Atlanta/np and/cc", "Fulton/np-tl County/nn-tl purchasing/vbg departments/nns which/wdt", "it/pps said/vbd ``/`` are/ber well/ql operated/vbn and/cc follow/vb", "generally/rb accepted/vbn practices/nns which/wdt inure/vb to/in the/at", "best/jjt interest/nn of/in both/abx governments/nns ''/'' ./.")vs <- VectorSource(x)elem <- getElem(stepNext(vs))(doc <- readTagged()(elem, language = "en", id = "id1"))tagged_words(doc)Read In an XML Document
Description
Return a function which reads in anXML document. The structure oftheXML document is described with a specification.
Usage
readXML(spec, doc)Arguments
spec | A named list of lists each containing two components. Theconstructed reader will map each list entry to the content or metadatum ofthe text document as specified by the named list entry. Valid names include Each list entry must consist of two components: the first must be a stringdescribing the type of the second argument, and the second is thespecification entry. Valid combinations are:
|
doc | An (empty) document of some subclass of |
Details
Formally this function is a function generator, i.e., it returns afunction (which reads in a text document) with a well-definedsignature, but can access passed over arguments (e.g., the specification)via lexical scoping.
Value
A function with the following formals:
elema named list with the component
contentwhichmust hold the document to be read in.languagea string giving the language.
ida character giving a unique identifier for the createdtext document.
The function returnsdoc augmented by the parsed informationas described byspec out of theXML file inelem$content. The argumentslanguage andid are used asfallback:language if no corresponding metadata entry is found inelem$content, andid if no corresponding metadata entry is foundinelem$content and ifelem$uri is null.
See Also
Reader for basic information on the reader infrastructureemployed by packagetm.
Vignette 'Extensions: How to Handle Custom File Formats', andXMLSource.
Remove Numbers from a Text Document
Description
Remove numbers from a text document.
Usage
## S3 method for class 'character'removeNumbers(x, ucp = FALSE, ...)## S3 method for class 'PlainTextDocument'removeNumbers(x, ...)Arguments
x | a character vector or text document. |
ucp | a logical specifying whether to use Unicode characterproperties for determining digit characters. If |
... | arguments to be passed to or from methods;in particular, from the |
Value
The text document without numbers.
See Also
getTransformations to list available transformation(mapping) functions.
https://unicode.org/reports/tr44/#General_Category_Values.
Examples
data("crude")crude[[1]]removeNumbers(crude[[1]])Remove Punctuation Marks from a Text Document
Description
Remove punctuation marks from a text document.
Usage
## S3 method for class 'character'removePunctuation(x, preserve_intra_word_contractions = FALSE, preserve_intra_word_dashes = FALSE, ucp = FALSE, ...)## S3 method for class 'PlainTextDocument'removePunctuation(x, ...)Arguments
x | a character vector or text document. |
preserve_intra_word_contractions | a logical specifying whetherintra-word contractions should be kept. |
preserve_intra_word_dashes | a logical specifying whetherintra-word dashes should be kept. |
ucp | a logical specifying whether to use Unicode characterproperties for determining punctuation characters. If |
... | arguments to be passed to or from methods;in particular, from the |
Value
The character or text documentx without punctuation marks(besides intra-word contractions (‘'’) and intra-word dashes(‘-’) ifpreserve_intra_word_contractions andpreserve_intra_word_dashes are set, respectively).
See Also
getTransformations to list available transformation(mapping) functions.
regex shows the class[:punct:] of punctuationcharacters.
https://unicode.org/reports/tr44/#General_Category_Values.
Examples
data("crude")inspect(crude[[14]])inspect(removePunctuation(crude[[14]]))inspect(removePunctuation(crude[[14]], preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE))Remove Sparse Terms from a Term-Document Matrix
Description
Remove sparse terms from a document-term or term-document matrix.
Usage
removeSparseTerms(x, sparse)Arguments
x | |
sparse | A numeric for the maximal allowed sparsity in the range frombigger zero to smaller one. |
Value
A term-document matrix where those terms fromx areremoved which have at least asparse percentage of empty (i.e.,terms occurring 0 times in a document) elements. I.e., the resultingmatrix contains only terms with a sparse factor of less thansparse.
Examples
data("crude")tdm <- TermDocumentMatrix(crude)removeSparseTerms(tdm, 0.2)Remove Words from a Text Document
Description
Remove words from a text document.
Usage
## S3 method for class 'character'removeWords(x, words)## S3 method for class 'PlainTextDocument'removeWords(x, ...)Arguments
x | A character or text document. |
words | A character vector giving the words to be removed. |
... | passed over argument |
Value
The character or text document without the specified words.
See Also
getTransformations to list available transformation (mapping)functions.
remove_stopwords provided by packagetau.
Examples
data("crude")crude[[1]]removeWords(crude[[1]], stopwords("english"))Complete Stems
Description
Heuristically complete stemmed words.
Usage
stemCompletion(x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest"))Arguments
x | A character vector of stems to be completed. |
dictionary | A |
type | A
|
Value
A character vector with completed words.
References
Ingo Feinerer (2010).Analysis and Algorithms for Stemming Inversion.Information Retrieval Technology — 6th Asia Information Retrieval Societies Conference, AIRS 2010, Taipei, Taiwan, December 1–3, 2010. Proceedings, volume 6458 ofLecture Notes in Computer Science, pages 290–299. Springer-Verlag, December 2010.
Examples
data("crude")stemCompletion(c("compan", "entit", "suppl"), crude)Stem Words
Description
Stem words in a text document using Porter's stemming algorithm.
Usage
## S3 method for class 'character'stemDocument(x, language = "english")## S3 method for class 'PlainTextDocument'stemDocument(x, language = meta(x, "language"))Arguments
x | A character vector or text document. |
language | A string giving the language for stemming. |
Details
Stemming requires that packageSnowballC is installed.The argumentlanguage is passed over towordStem as the name of the Snowball stemmer.
Examples
data("crude")inspect(crude[[1]])if(requireNamespace("SnowballC")) { inspect(stemDocument(crude[[1]]))}Stopwords
Description
Return various kinds of stopwords with support for differentlanguages.
Usage
stopwords(kind = "en")Arguments
kind | A character string identifying the desired stopword list. |
Details
Available stopword lists are:
catalanCatalan stopwords (obtained fromhttp://latel.upf.edu/morgana/altres/pub/ca_stop.htm),
romanianRomanian stopwords (extracted fromhttp://snowball.tartarus.org/otherapps/romanian/romanian1.tgz),
SMARTEnglish stopwords from the SMART informationretrieval system (as documented in Appendix 11 ofhttps://jmlr.csail.mit.edu/papers/volume5/lewis04a/)(which coincides with the stopword list used by the MC toolkit(https://www.cs.utexas.edu/~dml/software/mc/)),
and a set of stopword lists from the Snowball stemmer project in differentlanguages (obtained from‘http://svn.tartarus.org/snowball/trunk/website/algorithms/*/stop.txt’).Supported languages aredanish,dutch,english,finnish,french,german,hungarian,italian,norwegian,portuguese,russian,spanish, andswedish. Language names are case sensitive. Alternatively, theirIETF language tags may be used.
Value
A character vector containing the requested stopwords. An erroris raised if no stopwords are available for the requestedkind.
Examples
stopwords("en")stopwords("SMART")stopwords("german")Strip Whitespace from a Text Document
Description
Strip extra whitespace from a text document. Multiple whitespacecharacters are collapsed to a single blank.
Usage
## S3 method for class 'PlainTextDocument'stripWhitespace(x, ...)Arguments
x | A text document. |
... | Not used. |
Value
The text document with multiple whitespace characters collapsed to asingle blank.
See Also
getTransformations to list available transformation (mapping)functions.
Examples
data("crude")crude[[1]]stripWhitespace(crude[[1]])Term Frequency Vector
Description
Generate a term frequency vector from a text document.
Usage
termFreq(doc, control = list())Arguments
doc | An object inheriting from |
control | A list of control options which override defaultsettings. First, following two options are processed.
Next, a set of options which are sensitive to the order ofoccurrence in the
Finally, following options are processed in the given order.
|
Value
A table of classc("term_frequency", "integer") with term frequenciesas values and tokens as names.
See Also
Examples
data("crude")termFreq(crude[[14]])if(requireNamespace("SnowballC")) { strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+")) ctrl <- list(tokenize = strsplit_space_tokenizer, removePunctuation = list(preserve_intra_word_dashes = TRUE), stopwords = c("reuter", "that"), stemming = TRUE, wordLengths = c(4, Inf)) termFreq(crude[[14]], control = ctrl)}Combine Corpora, Documents, Term-Document Matrices, and Term Frequency Vectors
Description
Combine several corpora into a single one, combine multipledocuments into a corpus, combine multiple term-document matricesinto a single one, or combine multiple term frequency vectors into asingle term-document matrix.
Usage
## S3 method for class 'VCorpus'c(..., recursive = FALSE)## S3 method for class 'TextDocument'c(..., recursive = FALSE)## S3 method for class 'TermDocumentMatrix'c(..., recursive = FALSE)## S3 method for class 'term_frequency'c(..., recursive = FALSE)Arguments
... | Corpora, text documents, term-document matrices, or termfrequency vectors. |
recursive | Not used. |
See Also
VCorpus,TextDocument,TermDocumentMatrix, andtermFreq.
Examples
data("acq")data("crude")meta(acq, "comment", type = "corpus") <- "Acquisitions"meta(crude, "comment", type = "corpus") <- "Crude oil"meta(acq, "acqLabels") <- 1:50meta(acq, "jointLabels") <- 1:50meta(crude, "crudeLabels") <- letters[1:20]meta(crude, "jointLabels") <- 1:20c(acq, crude)meta(c(acq, crude), type = "corpus")meta(c(acq, crude))c(acq[[30]], crude[[10]])c(TermDocumentMatrix(acq), TermDocumentMatrix(crude))Filter and Index Functions on Corpora
Description
Interface to apply filter and index functions to corpora.
Usage
## S3 method for class 'PCorpus'tm_filter(x, FUN, ...)## S3 method for class 'SimpleCorpus'tm_filter(x, FUN, ...)## S3 method for class 'VCorpus'tm_filter(x, FUN, ...)## S3 method for class 'PCorpus'tm_index(x, FUN, ...)## S3 method for class 'SimpleCorpus'tm_index(x, FUN, ...)## S3 method for class 'VCorpus'tm_index(x, FUN, ...)Arguments
x | A corpus. |
FUN | a filter function taking a text document or a string (if |
... | arguments to |
Value
tm_filter returns a corpus containing documents whereFUN matches, whereastm_index only returns thecorresponding indices.
Examples
data("crude")# Full-text searchtm_filter(crude, FUN = function(x) any(grep("co[m]?pany", content(x))))Transformations on Corpora
Description
Interface to apply transformation functions (also denoted as mappings)to corpora.
Usage
## S3 method for class 'PCorpus'tm_map(x, FUN, ...)## S3 method for class 'SimpleCorpus'tm_map(x, FUN, ...)## S3 method for class 'VCorpus'tm_map(x, FUN, ..., lazy = FALSE)Arguments
x | A corpus. |
FUN | a transformation function taking a text document (a charactervector when |
... | arguments to |
lazy | a logical. Lazy mappings are mappings which are delayeduntil the content is accessed. It is useful for large corpora if only fewdocuments will be accessed. In such a case it avoids the computationallyexpensive application of the mapping to all elements in the corpus. |
Value
A corpus withFUN applied to each document inx. In caseof lazy mappings only internal flags are set. Access of individual documentstriggers the execution of the corresponding transformation function.
Note
Lazy transformations changeR's standard evaluation semantics.
See Also
getTransformations for available transformations.
Examples
data("crude")## Document access triggers the stemming function## (i.e., all other documents are not stemmed yet)if(requireNamespace("SnowballC")) { tm_map(crude, stemDocument, lazy = TRUE)[[1]]}## Use wrapper to apply character processing functiontm_map(crude, content_transformer(tolower))## Generate a custom transformation function which takes the heading as new contentheadings <- function(x) PlainTextDocument(meta(x, "heading"), id = meta(x, "id"), language = meta(x, "language"))inspect(tm_map(crude, headings))Combine Transformations
Description
Fold multiple transformations (mappings) into a single one.
Usage
tm_reduce(x, tmFuns, ...)Arguments
x | A corpus. |
tmFuns | A list oftm transformations. |
... | Arguments to the individual transformations. |
Value
A singletm transformation function obtained by foldingtmFunsfrom right to left (viaReduce(..., right = TRUE)).
See Also
Reduce forR's internal folding/accumulation mechanism, andgetTransformations to list available transformation(mapping) functions.
Examples
data(crude)crude[[1]]skipWords <- function(x) removeWords(x, c("it", "the"))funs <- list(stripWhitespace, skipWords, removePunctuation, content_transformer(tolower))tm_map(crude, FUN = tm_reduce, tmFuns = funs)[[1]]Compute Score for Matching Terms
Description
Compute a score based on the number of matching terms.
Usage
## S3 method for class 'DocumentTermMatrix'tm_term_score(x, terms, FUN = row_sums)## S3 method for class 'PlainTextDocument'tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE))## S3 method for class 'term_frequency'tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE))## S3 method for class 'TermDocumentMatrix'tm_term_score(x, terms, FUN = col_sums)Arguments
x | Either a |
terms | A character vector of terms to be matched. |
FUN | A function computing a score from the number of termsmatching in |
Value
A score as computed byFUN from the number of matchingterms inx.
Examples
data("acq")tm_term_score(acq[[1]], c("company", "change"))## Not run: ## Test for positive and negative sentiments## install.packages("tm.lexicon.GeneralInquirer", repos="http://datacube.wu.ac.at", type="source")require("tm.lexicon.GeneralInquirer")sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Positiv"))sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Negativ"))tm_term_score(TermDocumentMatrix(acq[1:10], control = list(removePunctuation = TRUE)), terms_in_General_Inquirer_categories("Positiv"))## End(Not run)Tokenizers
Description
Tokenize a document or character vector.
Usage
Boost_tokenizer(x)MC_tokenizer(x)scan_tokenizer(x)Arguments
x | A character vector, or an object that can be coerced to character by |
Details
The quality and correctness of a tokenization algorithm highly dependson the context and application scenario. Relevant factors are thelanguage of the underlying text and the notions of whitespace (whichcan vary with the used encoding and the language) and punctuationmarks. Consequently, for superior results you probably need a customtokenization function.
- Boost_tokenizer
Uses the Boost (https://www.boost.org)Tokenizer (viaRcpp).
- MC_tokenizer
Implements the functionality of the tokenizer in theMC toolkit (https://www.cs.utexas.edu/~dml/software/mc/).
- scan_tokenizer
Simulates
scan(..., what = "character").
Value
A character vector consisting of tokens obtained by tokenization ofx.
See Also
getTokenizers to list tokenizers provided by packagetm.
Regexp_Tokenizer for tokenizers using regular expressionsprovided by packageNLP.
tokenize for a simple regular expression based tokenizerprovided by packagetau.
tokenizers for a collection of tokenizers providedby packagetokenizers.
Examples
data("crude")Boost_tokenizer(crude[[1]])MC_tokenizer(crude[[1]])scan_tokenizer(crude[[1]])strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+"))strsplit_space_tokenizer(crude[[1]])Weight Binary
Description
Binary weight a term-document matrix.
Usage
weightBin(m)Arguments
m | A |
Details
Formally this function is of classWeightingFunction with theadditional attributesname andacronym.
Value
The weighted matrix.
SMART Weightings
Description
Weight a term-document matrix according to a combination of weightsspecified in SMART notation.
Usage
weightSMART(m, spec = "nnn", control = list())Arguments
m | A |
spec | a character string consisting of three characters. The first letterspecifies a term frequency schema, the second a document frequencyschema, and the third a normalization schema. SeeDetails foravailable built-in schemata. |
control | a list of control parameters. SeeDetails. |
Details
Formally this function is of classWeightingFunction with theadditional attributesname andacronym.
The first letter ofspec specifies a weighting schema for termfrequencies ofm:
- "n"
(natural)
\mathit{tf}_{i,j}counts the number of occurrencesn_{i,j}of a termt_iin a documentd_j. Theinput term-document matrixmis assumed to be in thisstandard term frequency format already.- "l"
(logarithm) is defined as
1 + \log_2(\mathit{tf}_{i,j}).- "a"
(augmented) is defined as
0.5 + \frac{0.5 * \mathit{tf}_{i,j}}{\max_i(\mathit{tf}_{i,j})}.- "b"
(boolean) is defined as 1 if
\mathit{tf}_{i,j} > 0and 0 otherwise.- "L"
(log average) is defined as
\frac{1 + \log_2(\mathit{tf}_{i,j})}{1+\log_2(\mathrm{ave}_{i\in j}(\mathit{tf}_{i,j}))}.
The second letter ofspec specifies a weighting schema ofdocument frequencies form:
- "n"
(no) is defined as 1.
- "t"
(idf) is defined as
\log_2 \frac{N}{\mathit{df}_t}where\mathit{df}_tdenotes how often termtoccurs in alldocuments.- "p"
(prob idf) is defined as
\max(0, \log_2(\frac{N - \mathit{df}_t}{\mathit{df}_t})).
The third letter ofspec specifies a schema for normalizationofm:
- "n"
(none) is defined as 1.
- "c"
(cosine) is defined as
\sqrt{\mathrm{col\_sums}(m ^ 2)}.- "u"
(pivoted unique) is defined as
\mathit{slope} * \sqrt{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) * \mathit{pivot}where bothslopeandpivotmust be setvia named tags in thecontrollist.- "b"
(byte size) is defined as
\frac{1}{\mathit{CharLength}^\alpha}. The parameter\alphamust be set via the named tagalphain thecontrollist.
The final result is defined by multiplication of the chosen termfrequency component with the chosen document frequency component withthe chosen normalization component.
Value
The weighted matrix.
References
Christopher D. Manning and Prabhakar Raghavan and Hinrich Schütze (2008).Introduction to Information Retrieval.Cambridge University Press, ISBN 0521865719.
Examples
data("crude")TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE, weighting = function(x) weightSMART(x, spec = "ntc")))Weight by Term Frequency
Description
Weight a term-document matrix by term frequency.
Usage
weightTf(m)Arguments
m | A |
Details
Formally this function is of classWeightingFunction with theadditional attributesname andacronym.
This function acts as the identity function since the input matrix isalready in term frequency format.
Value
The weighted matrix.
Weight by Term Frequency - Inverse Document Frequency
Description
Weight a term-document matrix by term frequency - inverse documentfrequency.
Usage
weightTfIdf(m, normalize = TRUE)Arguments
m | A |
normalize | A Boolean value indicating whether the termfrequencies should be normalized. |
Details
Formally this function is of classWeightingFunction with theadditional attributesname andacronym.
Term frequency\mathit{tf}_{i,j} counts the number ofoccurrencesn_{i,j} of a termt_i in a documentd_j. In the case of normalization, the term frequency\mathit{tf}_{i,j} is divided by\sum_k n_{k,j}.
Inverse document frequency for a termt_i is defined as
\mathit{idf}_i = \log_2 \frac{|D|}{|\{d \mid t_i \in d\}|}
where|D| denotes the total number of documents and where|\{d \mid t_i \in d\}| is the number of documents where the termt_iappears.
Term frequency - inverse document frequency is now defined as\mathit{tf}_{i,j} \cdot \mathit{idf}_i.
Value
The weighted matrix.
References
Gerard Salton and Christopher Buckley (1988).Term-weighting approaches in automatic text retrieval.Information Processing and Management,24/5, 513–523.
Write a Corpus to Disk
Description
Write a plain text representation of a corpus to multiple files ondisk corresponding to the individual documents in the corpus.
Usage
writeCorpus(x, path = ".", filenames = NULL)Arguments
x | A corpus. |
path | A character listing the directory to be written into. |
filenames | Either |
Details
The plain text representation of the corpus is obtained by callingas.character on each document.
Examples
data("crude")## Not run: writeCorpus(crude, path = ".", filenames = paste(seq_along(crude), ".txt", sep = ""))## End(Not run)