Movatterモバイル変換


[0]ホーム

URL:


Title:Text Mining Package
Version:0.7-17
Date:2025-12-10
Depends:R (≥ 3.4.0), NLP (≥ 0.2-0)
Imports:Rcpp, parallel, slam (≥ 0.1-37), stats, tools, utils,graphics, xml2
LinkingTo:BH, Rcpp
Suggests:antiword, filehash, methods, pdftools, Rcampdf, Rgraphviz,Rpoppler, SnowballC, testthat, tm.lexicon.GeneralInquirer
Description:A framework for text mining applications within R.
License:GPL-3
URL:https://tm.r-forge.r-project.org/
Additional_repositories:https://datacube.wu.ac.at
NeedsCompilation:yes
Packaged:2025-12-10 12:27:27 UTC; hornik
Author:Ingo FeinererORCID iD [aut], Kurt HornikORCID iD [aut, cre], Artifex Software, Inc. [ctb, cph] (pdf_info.ps taken from GPL Ghostscript)
Maintainer:Kurt Hornik <Kurt.Hornik@R-project.org>
Repository:CRAN
Date/Publication:2025-12-10 13:41:05 UTC

Corpora

Description

Representing and computing on corpora.

Details

Corpora are collections of documents containing (natural language)text. In packages which employ the infrastructure provided by packagetm, such corpora are represented via the virtual S3 classCorpus: such packages then provide S3 corpus classes extending thevirtual base class (such asVCorpus provided by packagetmitself).

All extension classes must provide accessors to extract subsets([), individual documents ([[), and metadata(meta). The functionlength must return the numberof documents, andas.list must construct a list holding thedocuments.

A corpus can have two types of metadata (accessible viameta).Corpus metadata contains corpus specific metadata in form of tag-valuepairs.Document level metadata contains document specific metadata butis stored in the corpus as a data frame. Document level metadata is typicallyused for semantic reasons (e.g., classifications of documents form an ownentity due to some high-level information like the range of possible values)or for performance reasons (single access instead of extracting metadata ofeach document).

The functionCorpus is a convenience alias toSimpleCorpus orVCorpus, depending on the arguments provided.

See Also

SimpleCorpus,VCorpus, andPCorpusfor the corpora classes provided by packagetm.

DCorpus for a distributed corpus class provided bypackagetm.plugin.dc.


Data Frame Source

Description

Create a data frame source.

Usage

DataframeSource(x)

Arguments

x

A data frame giving the texts and metadata.

Details

Adata frame source interprets each row of the data framex as adocument. The first column must be named"doc_id" and contain a uniquestring identifier for each document. The second column must be named"text" and contain a UTF-8 encoded string representing thedocument's content. Optional additional columns are used as document levelmetadata.

Value

An object inheriting fromDataframeSource,SimpleSource,andSource.

See Also

Source for basic information on the source infrastructureemployed by packagetm, andmeta for types of metadata.

readtext for reading in a text in multiple formatssuitable to be processed byDataframeSource.

Examples

docs <- data.frame(doc_id = c("doc_1", "doc_2"),                   text = c("This is a text.", "This another one."),                   dmeta1 = 1:2, dmeta2 = letters[1:2],                   stringsAsFactors = FALSE)(ds <- DataframeSource(docs))x <- Corpus(ds)inspect(x)meta(x)

Directory Source

Description

Create a directory source.

Usage

DirSource(directory = ".",          encoding = "",          pattern = NULL,          recursive = FALSE,          ignore.case = FALSE,          mode = "text")

Arguments

directory

A character vector of full path names; the defaultcorresponds to the working directorygetwd().

encoding

a character string describing the current encoding. It ispassed toiconv to convert the input to UTF-8.

pattern

an optional regular expression. Only file names which matchthe regular expression will be returned.

recursive

logical. Should the listing recurse into directories?

ignore.case

logical. Should pattern-matching be case-insensitive?

mode

a character string specifying if and how files should be read in.Available modes are:

""

No read. In this casegetElem andpGetElem only deliverURIs.

"binary"

Files are read in binary raw mode (viareadBin).

"text"

Files are read as text (viareadLines).

Details

Adirectory source acquires a list of files viadir andinterprets each file as a document.

Value

An object inheriting fromDirSource,SimpleSource, andSource.

See Also

Source for basic information on the source infrastructureemployed by packagetm.

Encoding andiconv on encodings.

Examples

DirSource(system.file("texts", "txt", package = "tm"))

Access Document IDs and Terms

Description

Accessing document IDs, terms, and their number of a term-document matrix ordocument-term matrix.

Usage

Docs(x)nDocs(x)nTerms(x)Terms(x)

Arguments

x

Either aTermDocumentMatrix orDocumentTermMatrix.

Value

ForDocs andTerms, a character vector with document IDs andterms, respectively.

FornDocs andnTerms, an integer with the number of document IDsand terms, respectively.

Examples

data("crude")tdm <- TermDocumentMatrix(crude)[1:10,1:20]Docs(tdm)nDocs(tdm)nTerms(tdm)Terms(tdm)

Permanent Corpora

Description

Create permanent corpora.

Usage

PCorpus(x,        readerControl = list(reader = reader(x), language = "en"),        dbControl = list(dbName = "", dbType = "DB1"))

Arguments

x

ASource object.

readerControl

a named list of control parameters for reading in contentfromx.

reader

a function capable of reading in and processing theformat delivered byx.

language

a character giving the language (preferably asIETF language tags, seelanguage inpackageNLP).The default language is assumed to be English ("en").

dbControl

a named list of control parameters for the underlyingdatabase storage provided by packagefilehash.

dbName

a character giving the filename for the database.

dbType

a character giving the database format (seefilehashOption for possible database formats).

Details

Apermanent corpus stores documents outside ofR in a database. SincemultiplePCorpusR objects with the same underlying database canexist simultaneously in memory, changes in one get propagated to allcorresponding objects (in contrast to the defaultR semantics).

Value

An object inheriting fromPCorpus andCorpus.

See Also

Corpus for basic information on the corpus infrastructureemployed by packagetm.

VCorpus provides an implementation with volatile storagesemantics.

Examples

txt <- system.file("texts", "txt", package = "tm")## Not run: PCorpus(DirSource(txt),        dbControl = list(dbName = "pcorpus.db", dbType = "DB1"))## End(Not run)

Plain Text Documents

Description

Create plain text documents.

Usage

PlainTextDocument(x = character(0),                  author = character(0),                  datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"),                  description = character(0),                  heading = character(0),                  id = character(0),                  language = character(0),                  origin = character(0),                  ...,                  meta = NULL,                  class = NULL)

Arguments

x

A character string giving the plain text content.

author

a character string or an object of classperson givingthe author names.

datetimestamp

an object of classPOSIXt or a characterstring giving the creation date/time information. If a character string,exactly one of theISO 8601 formats defined byhttps://www.w3.org/TR/NOTE-datetime should be used.Seeparse_ISO_8601_datetime in packageNLPfor processing such date/time information.

description

a character string giving a description.

heading

a character string giving the title or a short heading.

id

a character string giving a unique identifier.

language

a character string giving the language (preferably asIETFlanguage tags, seelanguage in packageNLP).

origin

a character string giving information on the source and origin.

...

user-defined document metadata tag-value pairs.

meta

a named list orNULL (default) giving all metadata. If setall other metadata arguments are ignored.

class

a character vector orNULL (default) givingadditional classes to be used for the created plain text document.

Value

An object inheriting fromclass,PlainTextDocument andTextDocument.

See Also

TextDocument for basic information on the text documentinfrastructure employed by packagetm.

Examples

(ptd <- PlainTextDocument("A simple plain text document",                          heading = "Plain text document",                          id = basename(tempfile()),                          language = "en"))meta(ptd)

Readers

Description

Creating readers.

Usage

getReaders()

Details

Readers are functions for extracting textual content and metadata outof elements delivered by aSource, and for constructing aTextDocument. A reader must accept following arguments inits signature:

elem

a named list with the componentscontent anduri (as delivered by aSource viagetElem orpGetElem).

language

a character string giving the language.

id

a character giving a unique identifier for the created textdocument.

The elementelem is typically provided by a source whereas the languageand the identifier are normally provided by a corpus constructor (for the casethatelem$content does not give information on these two essentialitems).

In case a reader expects configuration arguments we can use a functiongenerator. A function generator is indicated by inheriting from classFunctionGenerator andfunction. It allows us to processadditional arguments, store them in an environment, return a reader functionwith the well-defined signature described above, and still be able to accessthe additional arguments via lexical scoping. All corpus constructors inpackagetm check the reader function for being a function generator andif so apply it to yield the reader with the expected signature.

Value

ForgetReaders(), a character vector with readers provided by packagetm.

See Also

readDOC,readPDF,readPlain,readRCV1,readRCV1asPlain,readReut21578XML,readReut21578XMLasPlain,andreadXML.


Simple Corpora

Description

Create simple corpora.

Usage

SimpleCorpus(x, control = list(language = "en"))

Arguments

x

aDataframeSource,DirSource orVectorSource.

control

a named list of control parameters.

language

a character giving the language (preferably asIETF language tags, seelanguage inpackageNLP).The default language is assumed to be English ("en").

Details

Asimple corpus is fully kept in memory. Compared to aVCorpus,it is optimized for the most common usage scenario: importing plain texts fromfiles in a directory or directly from a vector inR, preprocessing andtransforming the texts, and finally exporting them to a term-document matrix.It adheres to theCorpusAPI. However, it takesinternally various shortcuts to boost performance and minimize memorypressure; consequently it operates only under the following contraints:

Value

An object inheriting fromSimpleCorpus andCorpus.

See Also

Corpus for basic information on the corpus infrastructureemployed by packagetm.

VCorpus provides an implementation with volatile storagesemantics, andPCorpus provides an implementation withpermanent storage semantics.

Examples

txt <- system.file("texts", "txt", package = "tm")(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),                      control = list(language = "lat")))

Sources

Description

Creating and accessing sources.

Usage

SimpleSource(encoding = "",             length = 0,             position = 0,             reader = readPlain,             ...,             class)getSources()## S3 method for class 'SimpleSource'close(con, ...)## S3 method for class 'SimpleSource'eoi(x)## S3 method for class 'DataframeSource'getMeta(x)## S3 method for class 'DataframeSource'getElem(x)## S3 method for class 'DirSource'getElem(x)## S3 method for class 'URISource'getElem(x)## S3 method for class 'VectorSource'getElem(x)## S3 method for class 'XMLSource'getElem(x)## S3 method for class 'SimpleSource'length(x)## S3 method for class 'SimpleSource'open(con, ...)## S3 method for class 'DataframeSource'pGetElem(x)## S3 method for class 'DirSource'pGetElem(x)## S3 method for class 'URISource'pGetElem(x)## S3 method for class 'VectorSource'pGetElem(x)## S3 method for class 'SimpleSource'reader(x)## S3 method for class 'SimpleSource'stepNext(x)

Arguments

x

ASource.

con

ASource.

encoding

a character giving the encoding of the elements delivered bythe source.

length

a non-negative integer denoting the number of elements deliveredby the source. If the length is unknown in advance set it to0.

position

a numeric indicating the current position in the source.

reader

a reader function (generator).

...

ForSimpleSource tag-value pairs for storing additionalinformation; not used otherwise.

class

a character vector giving additional classes to be used for thecreated source.

Details

Sources abstract input locations, like a directory, a connection, orsimply anR vector, in order to acquire content in a uniform way. In packageswhich employ the infrastructure provided by packagetm, such sources arerepresented via the virtual S3 classSource: such packages then provideS3 source classes extending the virtual base class (such asDirSource provided by packagetm itself).

All extension classes must provide implementations for the functionsclose,eoi,getElem,length,open,reader, andstepNext. For parallel element access the(optional) functionpGetElem must be provided as well. Ifdocument level metadata is available, the (optional) functiongetMetamust be implemented.

The functionsopen andclose open and close the source,respectively.eoi indicates end of input.getElem fetches theelement at the current position, whereaspGetElem retrieves allelements in parallel at once. The functionlength gives the number ofelements.reader returns a default reader for processing elements.stepNext increases the position in the source to acquire the nextelement.

The functionSimpleSource provides a simple reference implementationand can be used when creating custom sources.

Value

ForSimpleSource, an object inheriting fromclass,SimpleSource, andSource.

ForgetSources, a character vector with sources provided by packagetm.

open andclose return the opened and closed source,respectively.

Foreoi, a logical indicating if the end of input of the source isreached.

ForgetElem a named list with the componentscontent holding thedocument anduri giving a uniform resource identifier (e.g., a filepath orURL;NULL if not applicable or unavailable). ForpGetElem a list of such named lists.

Forlength, an integer for the number of elements.

Forreader, a function for the default reader.

See Also

DataframeSource,DirSource,URISource,VectorSource, andXMLSource.


Term-Document Matrix

Description

Constructs or coerces to a term-document matrix or a document-term matrix.

Usage

TermDocumentMatrix(x, control = list())DocumentTermMatrix(x, control = list())as.TermDocumentMatrix(x, ...)as.DocumentTermMatrix(x, ...)

Arguments

x

for the constructors, a corpus or anR object from which acorpus can be generated viaCorpus(VectorSource(x)); for thecoercing functions, either a term-document matrix or a document-termmatrix or asimple triplet matrix (packageslam) or aterm frequency vector.

control

a named list of control options. There are localoptions which are evaluated for each document and global optionswhich are evaluated once for the constructed matrix. Available localoptions are documented intermFreq and are internallydelegated to atermFreq call.

This is different for aSimpleCorpus. In this case alloptions are processed in a fixed order in one pass to improve performance.It always uses the Boost (https://www.boost.org) Tokenizer (viaRcpp) and takes no custom functions as option arguments.

Available global options are:

bounds

A list with a tagglobal whose valuemust be an integer vector of length 2. Terms that appear in lessdocuments than the lower boundbounds$global[1] or inmore documents than the upper boundbounds$global[2] arediscarded. Defaults tolist(global = c(1, Inf)) (i.e., everyterm will be used).

weighting

A weighting function capable of handling aTermDocumentMatrix. It defaults toweightTf for termfrequency weighting. Available weighting functions shipped withthetm package areweightTf,weightTfIdf,weightBin, andweightSMART.

...

the additional argumentweighting (typically aWeightFunction) is allowed when coercing asimple triplet matrix to a term-document or document-term matrix.

Value

An object of classTermDocumentMatrix or classDocumentTermMatrix (both inheriting from asimple triplet matrix in packageslam)containing a sparse term-document matrix or document-term matrix. Theattributeweighting contains the weighting applied to thematrix.

See Also

termFreq for available local control options.

Examples

data("crude")tdm <- TermDocumentMatrix(crude,                          control = list(removePunctuation = TRUE,                                         stopwords = TRUE))dtm <- DocumentTermMatrix(crude,                          control = list(weighting =                                         function(x)                                         weightTfIdf(x, normalize =                                                     FALSE),                                         stopwords = TRUE))inspect(tdm[202:205, 1:5])inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")])inspect(dtm[1:5, 273:276])if(requireNamespace("SnowballC")) {    s <- SimpleCorpus(VectorSource(unlist(lapply(crude, as.character))))    m <- TermDocumentMatrix(s,                            control = list(removeNumbers = TRUE,                                           stopwords = TRUE,                                           stemming = TRUE))    inspect(m[c("price", "texa"), c("127", "144", "191", "194")])}

Text Documents

Description

Representing and computing on text documents.

Details

Text documents are documents containing (natural language) text. Thetm package employs the infrastructure provided by packageNLP andrepresents text documents via the virtual S3 classTextDocument.Actual S3 text document classes then extend the virtual base class (such asPlainTextDocument).

All extension classes must provide anas.charactermethod which extracts the natural language text in documents of therespective classes in a “suitable” (not necessarily structured)form, as well ascontent andmeta methodsfor accessing the (possibly raw) document content and metadata.

See Also

PlainTextDocument, andXMLTextDocumentfor the text document classes provided by packagetm.

TextDocument for text documents in packageNLP.


Uniform Resource Identifier Source

Description

Create a uniform resource identifier source.

Usage

URISource(x, encoding = "", mode = "text")

Arguments

x

A character vector of uniform resource identifiers (URIs.

encoding

A character string describing the current encoding. It ispassed toiconv to convert the input to UTF-8.

mode

a character string specifying if and howURIs should beread in. Available modes are:

""

No read. In this casegetElem andpGetElem only deliverURIs.

"binary"

URIs are read in binary raw mode (viareadBin).

"text"

URIs are read as text (viareadLines).

Details

Auniform resource identifier source interprets eachURI as adocument.

Value

An object inheriting fromURISource,SimpleSource,andSource.

See Also

Source for basic information on the source infrastructureemployed by packagetm.

Encoding andiconv on encodings.

Examples

loremipsum <- system.file("texts", "loremipsum.txt", package = "tm")ovid <- system.file("texts", "txt", "ovid_1.txt", package = "tm")us <- URISource(sprintf("file://%s", c(loremipsum, ovid)))inspect(VCorpus(us))

Volatile Corpora

Description

Create volatile corpora.

Usage

VCorpus(x, readerControl = list(reader = reader(x), language = "en"))as.VCorpus(x)

Arguments

x

ForVCorpus aSource object, and foras.VCorpus anR object.

readerControl

a named list of control parameters for reading in contentfromx.

reader

a function capable of reading in and processing theformat delivered byx.

language

a character giving the language (preferably asIETF language tags, seelanguage inpackageNLP).The default language is assumed to be English ("en").

Details

Avolatile corpus is fully kept in memory and thus all changes onlyaffect the correspondingR object.

Value

An object inheriting fromVCorpus andCorpus.

See Also

Corpus for basic information on the corpus infrastructureemployed by packagetm.

PCorpus provides an implementation with permanent storagesemantics.

Examples

reut21578 <- system.file("texts", "crude", package = "tm")VCorpus(DirSource(reut21578, mode = "binary"),        list(reader = readReut21578XMLasPlain))

Vector Source

Description

Create a vector source.

Usage

VectorSource(x)

Arguments

x

A vector giving the texts.

Details

Avector source interprets each element of the vectorx as adocument.

Value

An object inheriting fromVectorSource,SimpleSource,andSource.

See Also

Source for basic information on the source infrastructureemployed by packagetm.

Examples

docs <- c("This is a text.", "This another one.")(vs <- VectorSource(docs))inspect(VCorpus(vs))

Weighting Function

Description

Construct a weighting function for term-document matrices.

Usage

WeightFunction(x, name, acronym)

Arguments

x

A function which takes aTermDocumentMatrixwith term frequencies as input, weights the elements, and returnsthe weighted matrix.

name

A character naming the weighting function.

acronym

A character giving an acronym for the name of theweighting function.

Value

An object of classWeightFunction which extends the classfunction representing a weighting function.

Examples

weightCutBin <- WeightFunction(function(m, cutoff) m > cutoff,                               "binary with cutoff", "bincut")

XML Source

Description

Create anXML source.

Usage

XMLSource(x, parser = xml_contents, reader)

Arguments

x

a character giving a uniform resource identifier.

parser

a function accepting anXML document (as delivered byread_xml in packagexml2) as input and returningXML elements/nodes.

reader

a function capable of turningXML elements/nodes asreturned byparser into a subclass ofTextDocument.

Value

An object inheriting fromXMLSource,SimpleSource,andSource.

See Also

Source for basic information on the source infrastructureemployed by packagetm.

Vignette 'Extensions: How to Handle Custom File Formats', andreadXML.


XML Text Documents

Description

CreateXML text documents.

Usage

XMLTextDocument(x = xml_missing(),                author = character(0),                datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"),                description = character(0),                heading = character(0),                id = character(0),                language = character(0),                origin = character(0),                ...,                meta = NULL)

Arguments

x

AnXMLDocument.

author

a character or an object of classperson givingthe author names.

datetimestamp

an object of classPOSIXt or a characterstring giving the creation date/time information. If a character string,exactly one of theISO 8601 formats defined byhttps://www.w3.org/TR/NOTE-datetime should be used.Seeparse_ISO_8601_datetime in packageNLPfor processing such date/time information.

description

a character giving a description.

heading

a character giving the title or a short heading.

id

a character giving a unique identifier.

language

a character giving the language (preferably asIETFlanguage tags, seelanguage in packageNLP).

origin

a character giving information on the source and origin.

...

user-defined document metadata tag-value pairs.

meta

a named list orNULL (default) giving all metadata. If setall other metadata arguments are ignored.

Value

An object inheriting fromXMLTextDocument andTextDocument.

See Also

TextDocument for basic information on the text documentinfrastructure employed by packagetm.

Examples

xml <- system.file("extdata", "order-doc.xml", package = "xml2")(xtd <- XMLTextDocument(xml2::read_xml(xml),                        heading = "XML text document",                        id = xml,                        language = "en"))content(xtd)meta(xtd)

ZIP File Source

Description

Create a ZIP file source.

Usage

ZipSource(zipfile,          pattern = NULL,          recursive = FALSE,          ignore.case = FALSE,          mode = "text")

Arguments

zipfile

A character string with the full path name of a ZIP file.

pattern

an optional regular expression. Only file names in the ZIPfile which match the regular expression will be returned.

recursive

logical. Should the listing recurse into directories?

ignore.case

logical. Should pattern-matching be case-insensitive?

mode

a character string specifying if and how files should be read in.Available modes are:

""

No read. In this casegetElem andpGetElem only deliverURIs.

"binary"

Files are read in binary raw mode (viareadBin).

"text"

Files are read as text (viareadLines).

Details

AZIP file source extracts a compressed ZIP file viaunzip and interprets each file as a document.

Value

An object inheriting fromZipSource,SimpleSource, andSource.

See Also

Source for basic information on the source infrastructureemployed by packagetm.

Examples

zipfile <- tempfile()files <- Sys.glob(file.path(system.file("texts", "txt", package = "tm"), "*"))zip(zipfile, files)zipfile <- paste0(zipfile, ".zip")Corpus(ZipSource(zipfile, recursive = TRUE))[[1]]file.remove(zipfile)

Explore Corpus Term Frequency Characteristics

Description

Explore Zipf's law and Heaps' law, two empirical laws in linguisticsdescribing commonly observed characteristics of term frequencydistributions in corpora.

Usage

Zipf_plot(x, type = "l", ...)Heaps_plot(x, type = "l", ...)

Arguments

x

a document-term matrix or term-document matrix withunweighted term frequencies.

type

a character string indicating the type of plot to bedrawn, seeplot.

...

further graphical parameters to be used for plotting.

Details

Zipf's law (e.g.,https://en.wikipedia.org/wiki/Zipf%27s_law)states that given some corpus of natural language utterances, thefrequency of any word is inversely proportional to its rank in thefrequency table, or, more generally, that the pmf of the termfrequencies is of the formc k^{-\beta}, wherek is therank of the term (taken from the most to the least frequent one).We can conveniently explore the degree to which the law holds byplotting the logarithm of the frequency against the logarithm of therank, and inspecting the goodness of fit of a linear model.

Heaps' law (e.g.,https://en.wikipedia.org/wiki/Heaps%27_law)states that the vocabulary sizeV (i.e., the number of differentterms employed) grows polynomially with the text sizeT (thetotal number of terms in the texts), so thatV = c T^\beta.We can conveniently explore the degree to which the law holds byplotting\log(V) against\log(T), and inspecting thegoodness of fit of a linear model.

Value

The coefficients of the fitted linear model. As a side effect, thecorresponding plot is produced.

Examples

data("acq")m <- DocumentTermMatrix(acq)Zipf_plot(m)Heaps_plot(m)

50 Exemplary News Articles from the Reuters-21578 Data Set of Topic acq

Description

This dataset holds 50 news articles with additional meta information from theReuters-21578 data set. All documents belong to the topicacq dealingwith corporate acquisitions.

Usage

data("acq")

Format

AVCorpus of 50 text documents.

Source

Reuters-21578 Text Categorization Collection Distribution 1.0(XML format).

References

Lewis, David (1997).Reuters-21578 Text Categorization Collection Distribution.UCI Machine Learning Repository.doi:10.24432/C52G6M.

Examples

data("acq")acq

Content Transformers

Description

Create content transformers, i.e., functions which modify the content of anR object.

Usage

content_transformer(FUN)

Arguments

FUN

a function.

Value

A function with two arguments:

x

anR object with implemented content getter(content) and setter (content<-)functions.

...

arguments passed over toFUN.

See Also

tm_map for an interface to apply transformations to corpora.

Examples

data("crude")crude[[1]](f <- content_transformer(function(x, pattern) gsub(pattern, "", x)))tm_map(crude, f, "[[:digit:]]+")[[1]]

20 Exemplary News Articles from the Reuters-21578 Data Set of Topic crude

Description

This data set holds 20 news articles with additional meta information fromthe Reuters-21578 data set. All documents belong to the topiccrudedealing with crude oil.

Usage

data("crude")

Format

AVCorpus of 20 text documents.

Source

Reuters-21578 Text Categorization Collection Distribution 1.0(XML format).

References

Lewis, David (1997).Reuters-21578 Text Categorization Collection Distribution.UCI Machine Learning Repository.doi:10.24432/C52G6M.

Examples

data("crude")crude

Find Associations in a Term-Document Matrix

Description

Find associations in a document-term or term-document matrix.

Usage

## S3 method for class 'DocumentTermMatrix'findAssocs(x, terms, corlimit)## S3 method for class 'TermDocumentMatrix'findAssocs(x, terms, corlimit)

Arguments

x

ADocumentTermMatrix or aTermDocumentMatrix.

terms

a character vector holding terms.

corlimit

a numeric vector (of the same length asterms;recycled otherwise) for the (inclusive) lower correlation limits of eachterm in the range from zero to one.

Value

A named list. Each list component is named after a term intermsand contains a named numeric vector. Each vector holds matching terms fromx and their rounded correlations satisfying the inclusive lowercorrelation limit ofcorlimit.

Examples

data("crude")tdm <- TermDocumentMatrix(crude)findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))

Find Frequent Terms

Description

Find frequent terms in a document-term or term-document matrix.

Usage

findFreqTerms(x, lowfreq = 0, highfreq = Inf)

Arguments

x

ADocumentTermMatrix orTermDocumentMatrix.

lowfreq

A numeric for the lower frequency bound.

highfreq

A numeric for the upper frequency bound.

Details

This method works for all numeric weightings but is probablymost meaningful for the standard term frequency (tf) weightingofx.

Value

A character vector of terms inx which occur more or equal oftenthanlowfreq times and less or equal often thanhighfreqtimes.

Examples

data("crude")tdm <- TermDocumentMatrix(crude)findFreqTerms(tdm, 2, 3)

Find Most Frequent Terms

Description

Find most frequent terms in a document-term or term-document matrix,or a vector of term frequencies.

Usage

findMostFreqTerms(x, n = 6L, ...)## S3 method for class 'DocumentTermMatrix'findMostFreqTerms(x, n = 6L, INDEX = NULL, ...)## S3 method for class 'TermDocumentMatrix'findMostFreqTerms(x, n = 6L, INDEX = NULL, ...)

Arguments

x

ADocumentTermMatrix orTermDocumentMatrix, or a vector of term frequencies asobtained bytermFreq().

n

A single integer giving the maximal number of terms.

INDEX

an object specifying a grouping of documents for rollup,orNULL (default) in which case each document is consideredindividually.

...

arguments to be passed to or from methods.

Details

Only terms with positive frequencies are included in the results.

Value

For the document-term or term-document matrix methods, a list with thenamed frequencies of the up ton most frequent terms occurringin each document (group). Otherwise, a single such vector of mostfrequent terms.

Examples

data("crude")## Term frequencies:tf <- termFreq(crude[[14L]])findMostFreqTerms(tf)## Document-term matrices:dtm <- DocumentTermMatrix(crude)## Most frequent terms for each document:findMostFreqTerms(dtm)## Most frequent terms for the first 10 the second 10 documents,## respectively:findMostFreqTerms(dtm, INDEX = rep(1 : 2, each = 10L))

Read Document-Term Matrices

Description

Read document-term matrices stored in special file formats.

Usage

read_dtm_Blei_et_al(file, vocab = NULL)read_dtm_MC(file, scalingtype = NULL)

Arguments

file

a character string with the name of the file to read.

vocab

a character string with the name of a vocabulary file(giving the terms, one per line), orNULL.

scalingtype

a character string specifying the type of scalingto be used, orNULL (default), in which case the scaling willbe inferred from the names of the files with non-zero entries found(seeDetails).

Details

read_dtm_Blei_et_al reads the (List of Lists type sparsematrix) format employed by the Latent Dirichlet Allocation andCorrelated Topic Model C codes by Blei et al(http://www.cs.columbia.edu/~blei/).

MC is a toolkit for creating vector models from text documents (seehttps://www.cs.utexas.edu/~dml/software/mc/). It employs avariant of Compressed Column Storage (CCS) sparse matrix format,writing data into several files with suitable names: e.g., a file with‘_dim’ appended to the base file name stores the matrixdimensions. The non-zero entries are stored in a file the name ofwhich indicates the scaling type used: e.g., ‘_tfx_nz’ indicatesscaling by term frequency (‘⁠t⁠’), inverse document frequency(‘⁠f⁠’) and no normalization (‘⁠x⁠’). See ‘README’ in theMC sources for more information.

read_dtm_MC reads such sparse matrix information with argumentfile giving the path with the base file name.

Value

Adocument-term matrix.

See Also

read_stm_MC in packageslam.


Tokenizers

Description

Predefined tokenizers.

Usage

getTokenizers()

Value

A character vector with tokenizers provided by packagetm.

See Also

Boost_tokenizer,MC_tokenizer andscan_tokenizer.

Examples

getTokenizers()

Transformations

Description

Predefined transformations (mappings) which can be used withtm_map.

Usage

getTransformations()

Value

A character vector with transformations provided by packagetm.

See Also

removeNumbers,removePunctuation,removeWords,stemDocument, andstripWhitespace.

content_transformer to create custom transformations.

Examples

getTransformations()

Parallelized ‘lapply’

Description

Parallelize applying a function over a list or vector according to theregistered parallelization engine.

Usage

tm_parLapply(X, FUN, ...)tm_parLapply_engine(new)

Arguments

X

A vector (atomic or list), or other objects suitable for theengine in use.

FUN

the function to be applied to each element ofX.

...

optional arguments toFUN.

new

an object inheriting from classcluster as createdbymakeCluster() from packageparallel, or a function with formalsX,FUN and..., orNULL corresponding to the default of using noparallelization engine.

Details

Parallelization can be employed to speed up some of the embarrassinglyparallel computations performed in packagetm, specificallytm_index(),tm_map() on a non-lazy-mappedVCorpus, andTermDocumentMatrix() on aVCorpus orPCorpus.

Functionstm_parLapply() andtm_parLapply_engine() canbe used to customize parallelization according to the availableresources.

tm_parLapply_engine() is used for getting (with no arguments)or setting (with argumentnew) the parallelization engineemployed (see below for examples).

If an engine is set to an object inheriting from classcluster,tm_parLapply() callsparLapply() with this cluster and the given arguments. If set to a function,tm_parLapply()calls the function with the given arguments. Otherwise, it simplycallslapply().

Hence, parallelization viaparLapply()and a default cluster registered viasetDefaultCluster() can beachieved via

  tm_parLapply_engine(function(X, FUN, ...)      parallel::parLapply(NULL, X, FUN, ...))

or re-registering the cluster, saycl, using

  tm_parLapply_engine(cl)

(note that sinceR version 3.5.0, one can usegetDefaultCluster() to getthe registered default cluster). Using

  tm_parLapply_engine(function(X, FUN, ...)      parallel::parLapplyLB(NULL, X, FUN, ...))

or

  tm_parLapply_engine(function(X, FUN, ...)      parallel::parLapplyLB(cl, X, FUN, ...))

gives load-balancing parallelization with the registered default orgiven cluster, respectively. To achieve parallelization via forking(on Unix-alike platforms), one can use the above with clusters createdbymakeForkCluster(), or use

  tm_parLapply_engine(parallel::mclapply)

or

  tm_parLapply_engine(function(X, FUN, ...)      parallel::mclapply(X, FUN, ..., mc.cores = n))

to usemclapply() with the default orgiven numbern of cores.

Value

A list the length ofX, with the result of applyingFUNtogether with the... arguments to each element ofX.

See Also

makeCluster(),parLapply(),parLapplyLB(), andmclapply().


Inspect Objects

Description

Inspect, i.e., display detailed information on a corpus, aterm-document matrix, or a text document.

Usage

## S3 method for class 'PCorpus'inspect(x)## S3 method for class 'VCorpus'inspect(x)## S3 method for class 'TermDocumentMatrix'inspect(x)## S3 method for class 'TextDocument'inspect(x)

Arguments

x

Either a corpus, a term-document matrix, or a text document.

Examples

data("crude")inspect(crude[1:3])inspect(crude[[1]])tdm <- TermDocumentMatrix(crude)[1:10, 1:10]inspect(tdm)

Metadata Management

Description

Accessing and modifying metadata of text documents and corpora.

Usage

## S3 method for class 'PCorpus'meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...)## S3 replacement method for class 'PCorpus'meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value## S3 method for class 'SimpleCorpus'meta(x, tag = NULL, type = c("indexed", "corpus"), ...)## S3 replacement method for class 'SimpleCorpus'meta(x, tag, type = c("indexed", "corpus"), ...) <- value## S3 method for class 'VCorpus'meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...)## S3 replacement method for class 'VCorpus'meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value## S3 method for class 'PlainTextDocument'meta(x, tag = NULL, ...)## S3 replacement method for class 'PlainTextDocument'meta(x, tag = NULL, ...) <- value## S3 method for class 'XMLTextDocument'meta(x, tag = NULL, ...)## S3 replacement method for class 'XMLTextDocument'meta(x, tag = NULL, ...) <- valueDublinCore(x, tag = NULL)DublinCore(x, tag) <- value

Arguments

x

ForDublinCore aTextDocument, and formeta aTextDocument or aCorpus.

tag

a character giving the name of a metadatum. No tag corresponds toall available metadata.

type

a character specifying the kind of corpus metadata (seeDetails).

...

Not used.

value

replacement value.

Details

A corpus has two types of metadata.Corpus metadata ("corpus")contains corpus specific metadata in form of tag-value pairs.Document level metadata ("indexed") contains document specificmetadata but is stored in the corpus as a data frame. Document level metadatais typically used for semantic reasons (e.g., classifications of documentsform an own entity due to some high-level information like the range ofpossible values) or for performance reasons (single access instead ofextracting metadata of each document). The latter can be seen as a from ofindexing, hence the name"indexed".Document metadata("local") are tag-value pairs directly stored locally at the individualdocuments.

DublinCore is a convenience wrapper to access and modify the metadataof a text document using the Simple Dublin Core schema (supporting the 15metadata elements from the Dublin Core Metadata Element Sethttps://dublincore.org/documents/dces/).

References

Dublin Core Metadata Initiative.https://dublincore.org/

See Also

meta for metadata in packageNLP.

Examples

data("crude")meta(crude[[1]])DublinCore(crude[[1]])meta(crude[[1]], tag = "topics")meta(crude[[1]], tag = "comment") <- "A short comment."meta(crude[[1]], tag = "topics") <- NULLDublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"DublinCore(crude[[1]], tag = "format") <- "XML"DublinCore(crude[[1]])meta(crude[[1]])meta(crude)meta(crude, type = "corpus")meta(crude, "labels") <- 21:40meta(crude)

Visualize a Term-Document Matrix

Description

Visualize correlations between terms of a term-document matrix.

Usage

## S3 method for class 'TermDocumentMatrix'plot(x,     terms = sample(Terms(x), 20),     corThreshold = 0.7,     weighting = FALSE,     attrs = list(graph = list(rankdir = "BT"),                  node = list(shape = "rectangle",                              fixedsize = FALSE)),     ...)

Arguments

x

A term-document matrix.

terms

Terms to be plotted. Defaults to 20 randomly chosen termsof the term-document matrix.

corThreshold

Do not plot correlations below thisthreshold. Defaults to0.7.

weighting

Define whether the line width corresponds to thecorrelation.

attrs

Argument passed to the plot method for classgraphNEL.

...

Other arguments passed to thegraphNEL plot method.

Details

Visualization requires that Bioconductor software packageRgraphviz is installed.

Examples

## Not run: data(crude)tdm <- TermDocumentMatrix(crude,                          control = list(removePunctuation = TRUE,                                         removeNumbers = TRUE,                                         stopwords = TRUE))plot(tdm, corThreshold = 0.2, weighting = TRUE)## End(Not run)

Read In a MS Word Document

Description

Return a function which reads in a Microsoft Word document extractingits text.

Usage

readDOC(engine = c("antiword", "executable"), AntiwordOptions = "")

Arguments

engine

a character string for the preferredDOC extractionengine (seeDetails).

AntiwordOptions

Options passed over toantiword executable.

Details

Formally this function is a function generator, i.e., it returns afunction (which reads in a text document) with a well-definedsignature, but can access passed over arguments (e.g., options toantiword) via lexical scoping.

AvailableDOC extraction engines are as follows.

"antiword"

(default) Antiword utility as provided by thefunctionantiword in packageantiword.

"executable"

command lineantiwordexecutable which must be installed and accessible on your system.This can convert documents from Microsoft Word version 2, 6, 7,97, 2000, 2002 and 2003 to plain text.The character vectorAntiwordOptions is passed over to theexecutable.

Value

Afunction with the following formals:

elem

a list with the named componenturi which musthold a valid file name.

language

a string giving the language.

id

Not used.

The function returns aPlainTextDocument representing the textand metadata extracted fromelem$uri.

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.


Read In a Text Document from a Data Frame

Description

Read in a text document from a row in a data frame.

Usage

readDataframe(elem, language, id)

Arguments

elem

a named list with the componentcontent which must holda data frame with rows as the documents to be read in. The names of thecolumns holding the text content and the document identifier must be"text" and"doc_id", respectively.

language

a string giving the language.

id

Not used.

Value

APlainTextDocument representingelem$content.

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.

Examples

docs <- data.frame(doc_id = c("doc_1", "doc_2"),                   text = c("This is a text.", "This another one."),                   stringsAsFactors = FALSE)ds <- DataframeSource(docs)elem <- getElem(stepNext(ds))result <- readDataframe(elem, "en", NULL)inspect(result)meta(result)

Read In a PDF Document

Description

Return a function which reads in a portable document format (PDF)document extracting both its text and its metadata.

Usage

readPDF(engine = c("pdftools", "xpdf", "Rpoppler",                   "ghostscript", "Rcampdf", "custom"),        control = list(info = NULL, text = NULL))

Arguments

engine

a character string for the preferredPDF extractionengine (seeDetails).

control

a list of control options for the engine with the namedcomponentsinfo andtext (seeDetails).

Details

Formally this function is a function generator, i.e., it returns a function(which reads in a text document) with a well-defined signature, but can accesspassed over arguments (e.g., the preferredPDF extractionengine andcontrol options) via lexical scoping.

AvailablePDF extraction engines are as follows.

"pdftools"

(default) PopplerPDF rendering libraryas provided by the functionspdf_info andpdf_text in packagepdftools.

"xpdf"

command linepdfinfo andpdftotext executables which must be installed and accessible onyour system. Suitable utilities are provided by the Xpdf(http://www.xpdfreader.com/)PDF viewer or by thePoppler (https://poppler.freedesktop.org/)PDF renderinglibrary.

"Rpoppler"

PopplerPDF rendering library asprovided by the functionsPDF_info andPDF_text in packageRpoppler.

"ghostscript"

Ghostscript using ‘pdf_info.ps’ and‘ps2ascii.ps’.

"Rcampdf"

Perl CAM::PDFPDF manipulation libraryas provided by the functionspdf_info andpdf_textin packageRcampdf, available from the repository athttp://datacube.wu.ac.at.

"custom"

custom user-provided extraction engine.

Control parameters for engine"xpdf" are as follows.

info

a character vector specifying options passed over tothepdfinfo executable.

text

a character vector specifying options passed over tothepdftotext executable.

Control parameters for engine"custom" are as follows.

info

a function extracting metadata from aPDF.The function must accept a file path as first argument and must return anamed list with the componentsAuthor (as character string),CreationDate (of classPOSIXlt),Subject (ascharacter string),Title (as character string), andCreator(as character string).

text

a function extracting content from aPDF.The function must accept a file path as first argument and must return acharacter vector.

Value

Afunction with the following formals:

elem

a named list with the componenturi which musthold a valid file name.

language

a string giving the language.

id

Not used.

The function returns aPlainTextDocument representing the textand metadata extracted fromelem$uri.

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.

Examples

uri <- paste0("file://",              system.file(file.path("doc", "tm.pdf"), package = "tm"))engine <- if(nzchar(system.file(package = "pdftools"))) {    "pdftools" } else {    "ghostscript"}reader <- readPDF(engine)pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")cat(content(pdf)[1])VCorpus(URISource(uri, mode = ""),        readerControl = list(reader = readPDF(engine = "ghostscript")))

Read In a Text Document

Description

Read in a text document without knowledge about its internal structure andpossible available metadata.

Usage

readPlain(elem, language, id)

Arguments

elem

a named list with the componentcontent which must holdthe document to be read in.

language

a string giving the language.

id

a character giving a unique identifier for the created textdocument.

Value

APlainTextDocument representingelem$content. Theargumentid is used as fallback ifelem$uri is null.

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.

Examples

docs <- c("This is a text.", "This another one.")vs <- VectorSource(docs)elem <- getElem(stepNext(vs))(result <- readPlain(elem, "en", "id1"))meta(result)

Read In a Reuters Corpus Volume 1 Document

Description

Read in a Reuters Corpus Volume 1XML document.

Usage

readRCV1(elem, language, id)readRCV1asPlain(elem, language, id)

Arguments

elem

a named list with the componentcontent which must holdthe document to be read in.

language

a string giving the language.

id

Not used.

Value

AnXMLTextDocument forreadRCV1, or aPlainTextDocument forreadRCV1asPlain, representing thetext and metadata extracted fromelem$content.

References

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F (2004).RCV1: A New Benchmark Collection for Text Categorization Research.Journal of Machine Learning Research,5, 361–397.https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.

Examples

f <- system.file("texts", "rcv1_2330.xml", package = "tm")f_bin <- readBin(f, raw(), file.size(f))rcv1 <- readRCV1(elem = list(content = f_bin), language = "en", id = "id1")content(rcv1)meta(rcv1)

Read In a Reuters-21578 XML Document

Description

Read in a Reuters-21578XML document.

Usage

readReut21578XML(elem, language, id)readReut21578XMLasPlain(elem, language, id)

Arguments

elem

a named list with the componentcontent which must holdthe document to be read in.

language

a string giving the language.

id

Not used.

Value

AnXMLTextDocument forreadReut21578XML, or aPlainTextDocument forreadReut21578XMLasPlain,representing the text and metadata extracted fromelem$content.

References

Lewis, David (1997).Reuters-21578 Text Categorization Collection Distribution.UCI Machine Learning Repository.doi:10.24432/C52G6M.

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.


Read In a POS-Tagged Word Text Document

Description

Return a function which reads in a text document containing POS-tagged words.

Usage

readTagged(...)

Arguments

...

Arguments passed toTaggedTextDocument.

Details

Formally this function is a function generator, i.e., it returns afunction (which reads in a text document) with a well-definedsignature, but can access passed over arguments (...)via lexical scoping.

Value

Afunction with the following formals:

elem

a named list with the componentcontent which musthold the document to be read in or the componenturi holding aconnection object or a character string.

language

a string giving the language.

id

a character giving a unique identifier for the createdtext document.

The function returns aTaggedTextDocument representing thetext and metadata extracted fromelem$content orelem$uri. Theargumentid is used as fallback ifelem$uri is null.

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.

Examples

# See http://www.nltk.org/book/ch05.html or file ca01 in the Brown corpusx <- paste("The/at grand/jj jury/nn commented/vbd on/in a/at number/nn of/in",           "other/ap topics/nns ,/, among/in them/ppo the/at Atlanta/np and/cc",           "Fulton/np-tl County/nn-tl purchasing/vbg departments/nns which/wdt",           "it/pps said/vbd ``/`` are/ber well/ql operated/vbn and/cc follow/vb",           "generally/rb accepted/vbn practices/nns which/wdt inure/vb to/in the/at",           "best/jjt interest/nn of/in both/abx governments/nns ''/'' ./.")vs <- VectorSource(x)elem <- getElem(stepNext(vs))(doc <- readTagged()(elem, language = "en", id = "id1"))tagged_words(doc)

Read In an XML Document

Description

Return a function which reads in anXML document. The structure oftheXML document is described with a specification.

Usage

readXML(spec, doc)

Arguments

spec

A named list of lists each containing two components. Theconstructed reader will map each list entry to the content or metadatum ofthe text document as specified by the named list entry. Valid names includecontent to access the document's content, and character strings whichare mapped to metadata entries.

Each list entry must consist of two components: the first must be a stringdescribing the type of the second argument, and the second is thespecification entry. Valid combinations are:

type = "node", spec = "XPathExpression"

The XPath (1.0)expressionspec extracts information from anXML node.

type = "function", spec = function(doc) ...

The functionspec is called, passing over theXML document (asdelivered byread_xml from packagexml2) asfirst argument.

type = "unevaluated", spec = "String"

The character vectorspec is returned without modification.

doc

An (empty) document of some subclass ofTextDocument.

Details

Formally this function is a function generator, i.e., it returns afunction (which reads in a text document) with a well-definedsignature, but can access passed over arguments (e.g., the specification)via lexical scoping.

Value

A function with the following formals:

elem

a named list with the componentcontent whichmust hold the document to be read in.

language

a string giving the language.

id

a character giving a unique identifier for the createdtext document.

The function returnsdoc augmented by the parsed informationas described byspec out of theXML file inelem$content. The argumentslanguage andid are used asfallback:language if no corresponding metadata entry is found inelem$content, andid if no corresponding metadata entry is foundinelem$content and ifelem$uri is null.

See Also

Reader for basic information on the reader infrastructureemployed by packagetm.

Vignette 'Extensions: How to Handle Custom File Formats', andXMLSource.


Remove Numbers from a Text Document

Description

Remove numbers from a text document.

Usage

## S3 method for class 'character'removeNumbers(x, ucp = FALSE, ...)## S3 method for class 'PlainTextDocument'removeNumbers(x, ...)

Arguments

x

a character vector or text document.

ucp

a logical specifying whether to use Unicode characterproperties for determining digit characters. IfFALSE (default), characters in the ASCII[:digit:] class (i.e., thedecimal digits from 0 to 9) are taken; ifTRUE, thecharacters with Unicode general categoryNd (Decimal_Number).

...

arguments to be passed to or from methods;in particular, from thePlainTextDocument method to thecharacter method.

Value

The text document without numbers.

See Also

getTransformations to list available transformation(mapping) functions.

https://unicode.org/reports/tr44/#General_Category_Values.

Examples

data("crude")crude[[1]]removeNumbers(crude[[1]])

Remove Punctuation Marks from a Text Document

Description

Remove punctuation marks from a text document.

Usage

## S3 method for class 'character'removePunctuation(x,                  preserve_intra_word_contractions = FALSE,                  preserve_intra_word_dashes = FALSE,                  ucp = FALSE, ...)## S3 method for class 'PlainTextDocument'removePunctuation(x, ...)

Arguments

x

a character vector or text document.

preserve_intra_word_contractions

a logical specifying whetherintra-word contractions should be kept.

preserve_intra_word_dashes

a logical specifying whetherintra-word dashes should be kept.

ucp

a logical specifying whether to use Unicode characterproperties for determining punctuation characters. IfFALSE (default), characters in the ASCII[:punct:] class are taken;ifTRUE, the characters with Unicode general categoryP (Punctuation).

...

arguments to be passed to or from methods;in particular, from thePlainTextDocument method to thecharacter method.

Value

The character or text documentx without punctuation marks(besides intra-word contractions (‘⁠'⁠’) and intra-word dashes(‘⁠-⁠’) ifpreserve_intra_word_contractions andpreserve_intra_word_dashes are set, respectively).

See Also

getTransformations to list available transformation(mapping) functions.

regex shows the class[:punct:] of punctuationcharacters.

https://unicode.org/reports/tr44/#General_Category_Values.

Examples

data("crude")inspect(crude[[14]])inspect(removePunctuation(crude[[14]]))inspect(removePunctuation(crude[[14]],                          preserve_intra_word_contractions = TRUE,                          preserve_intra_word_dashes = TRUE))

Remove Sparse Terms from a Term-Document Matrix

Description

Remove sparse terms from a document-term or term-document matrix.

Usage

removeSparseTerms(x, sparse)

Arguments

x

ADocumentTermMatrix or aTermDocumentMatrix.

sparse

A numeric for the maximal allowed sparsity in the range frombigger zero to smaller one.

Value

A term-document matrix where those terms fromx areremoved which have at least asparse percentage of empty (i.e.,terms occurring 0 times in a document) elements. I.e., the resultingmatrix contains only terms with a sparse factor of less thansparse.

Examples

data("crude")tdm <- TermDocumentMatrix(crude)removeSparseTerms(tdm, 0.2)

Remove Words from a Text Document

Description

Remove words from a text document.

Usage

## S3 method for class 'character'removeWords(x, words)## S3 method for class 'PlainTextDocument'removeWords(x, ...)

Arguments

x

A character or text document.

words

A character vector giving the words to be removed.

...

passed over argumentwords.

Value

The character or text document without the specified words.

See Also

getTransformations to list available transformation (mapping)functions.

remove_stopwords provided by packagetau.

Examples

data("crude")crude[[1]]removeWords(crude[[1]], stopwords("english"))

Complete Stems

Description

Heuristically complete stemmed words.

Usage

stemCompletion(x,               dictionary,               type = c("prevalent", "first", "longest",                        "none", "random", "shortest"))

Arguments

x

A character vector of stems to be completed.

dictionary

ACorpus or character vector to be searchedfor possible completions.

type

Acharacter naming the heuristics to be used:

prevalent

Default. Takes the most frequent match ascompletion.

first

Takes the first found completion.

longest

Takes the longest completion in terms ofcharacters.

none

Is the identity.

random

Takes some completion.

shortest

Takes the shortest completion in terms ofcharacters.

Value

A character vector with completed words.

References

Ingo Feinerer (2010).Analysis and Algorithms for Stemming Inversion.Information Retrieval Technology — 6th Asia Information Retrieval Societies Conference, AIRS 2010, Taipei, Taiwan, December 1–3, 2010. Proceedings, volume 6458 ofLecture Notes in Computer Science, pages 290–299. Springer-Verlag, December 2010.

Examples

data("crude")stemCompletion(c("compan", "entit", "suppl"), crude)

Stem Words

Description

Stem words in a text document using Porter's stemming algorithm.

Usage

## S3 method for class 'character'stemDocument(x, language = "english")## S3 method for class 'PlainTextDocument'stemDocument(x, language = meta(x, "language"))

Arguments

x

A character vector or text document.

language

A string giving the language for stemming.

Details

Stemming requires that packageSnowballC is installed.The argumentlanguage is passed over towordStem as the name of the Snowball stemmer.

Examples

data("crude")inspect(crude[[1]])if(requireNamespace("SnowballC")) {    inspect(stemDocument(crude[[1]]))}

Stopwords

Description

Return various kinds of stopwords with support for differentlanguages.

Usage

stopwords(kind = "en")

Arguments

kind

A character string identifying the desired stopword list.

Details

Available stopword lists are:

catalan

Catalan stopwords (obtained fromhttp://latel.upf.edu/morgana/altres/pub/ca_stop.htm),

romanian

Romanian stopwords (extracted fromhttp://snowball.tartarus.org/otherapps/romanian/romanian1.tgz),

SMART

English stopwords from the SMART informationretrieval system (as documented in Appendix 11 ofhttps://jmlr.csail.mit.edu/papers/volume5/lewis04a/)(which coincides with the stopword list used by the MC toolkit(https://www.cs.utexas.edu/~dml/software/mc/)),

and a set of stopword lists from the Snowball stemmer project in differentlanguages (obtained from‘⁠http://svn.tartarus.org/snowball/trunk/website/algorithms/*/stop.txt⁠’).Supported languages aredanish,dutch,english,finnish,french,german,hungarian,italian,norwegian,portuguese,russian,spanish, andswedish. Language names are case sensitive. Alternatively, theirIETF language tags may be used.

Value

A character vector containing the requested stopwords. An erroris raised if no stopwords are available for the requestedkind.

Examples

stopwords("en")stopwords("SMART")stopwords("german")

Strip Whitespace from a Text Document

Description

Strip extra whitespace from a text document. Multiple whitespacecharacters are collapsed to a single blank.

Usage

## S3 method for class 'PlainTextDocument'stripWhitespace(x, ...)

Arguments

x

A text document.

...

Not used.

Value

The text document with multiple whitespace characters collapsed to asingle blank.

See Also

getTransformations to list available transformation (mapping)functions.

Examples

data("crude")crude[[1]]stripWhitespace(crude[[1]])

Term Frequency Vector

Description

Generate a term frequency vector from a text document.

Usage

termFreq(doc, control = list())

Arguments

doc

An object inheriting fromTextDocument or acharacter vector.

control

A list of control options which override defaultsettings.

First, following two options are processed.

tokenize

A function tokenizing aTextDocumentinto single tokens, aSpan_Tokenizer,Token_Tokenizer, or a string matching one of thepredefined tokenization functions:

"Boost"

forBoost_tokenizer, or

"MC"

forMC_tokenizer, or

"scan"

forscan_tokenizer, or

"words"

forwords.

Defaults towords.

tolower

Either a logical value indicating whethercharacters should be translated to lower case or a custom functionconverting characters to lower case. Defaults totolower.

Next, a set of options which are sensitive to the order ofoccurrence in thecontrol list. Options are processed in thesame order as specified. User-specified options have precedence overthe default ordering so that first all user-specified options andthen all remaining options (with the default settings and in theorder as listed below) are processed.

language

A character giving the language (preferably asIETF language tags, seelanguage in packageNLP) to be used forstopwords andstemming ifnot provided bydoc.

removePunctuation

A logical value indicating whetherpunctuation characters should be removed fromdoc, a custom function which performs punctuationremoval, or a list of arguments forremovePunctuation. Defaults toFALSE.

removeNumbers

A logical value indicating whethernumbers should be removed fromdoc or a custom functionfor number removal. Defaults toFALSE.

stopwords

Either a Boolean value indicating stopwordremoval using default language specific stopword lists shippedwith this package, a character vector holding customstopwords, or a custom function for stopword removal. DefaultstoFALSE.

stemming

Either a Boolean value indicating whether tokensshould be stemmed or a custom stemming function. Defaults toFALSE.

Finally, following options are processed in the given order.

dictionary

A character vector to be tabulatedagainst. No other terms will be listed in the result. DefaultstoNULL which means that all terms indoc arelisted.

bounds

A list with a taglocal whose valuemust be an integer vector of length 2. Terms that appear lessoften indoc than the lower boundbounds$local[1]or more often than the upper boundbounds$local[2] arediscarded. Defaults tolist(local = c(1, Inf)) (i.e., everytoken will be used).

wordLengths

An integer vector of length 2. Wordsshorter than the minimum word lengthwordLengths[1] orlonger than the maximum word lengthwordLengths[2] arediscarded. Defaults toc(3, Inf), i.e., a minimum wordlength of 3 characters.

Value

A table of classc("term_frequency", "integer") with term frequenciesas values and tokens as names.

See Also

getTokenizers

Examples

data("crude")termFreq(crude[[14]])if(requireNamespace("SnowballC")) {    strsplit_space_tokenizer <- function(x)        unlist(strsplit(as.character(x), "[[:space:]]+"))    ctrl <- list(tokenize = strsplit_space_tokenizer,                 removePunctuation =                     list(preserve_intra_word_dashes = TRUE),                 stopwords = c("reuter", "that"),                 stemming = TRUE,                 wordLengths = c(4, Inf))    termFreq(crude[[14]], control = ctrl)}

Combine Corpora, Documents, Term-Document Matrices, and Term Frequency Vectors

Description

Combine several corpora into a single one, combine multipledocuments into a corpus, combine multiple term-document matricesinto a single one, or combine multiple term frequency vectors into asingle term-document matrix.

Usage

## S3 method for class 'VCorpus'c(..., recursive = FALSE)## S3 method for class 'TextDocument'c(..., recursive = FALSE)## S3 method for class 'TermDocumentMatrix'c(..., recursive = FALSE)## S3 method for class 'term_frequency'c(..., recursive = FALSE)

Arguments

...

Corpora, text documents, term-document matrices, or termfrequency vectors.

recursive

Not used.

See Also

VCorpus,TextDocument,TermDocumentMatrix, andtermFreq.

Examples

data("acq")data("crude")meta(acq, "comment", type = "corpus") <- "Acquisitions"meta(crude, "comment", type = "corpus") <- "Crude oil"meta(acq, "acqLabels") <- 1:50meta(acq, "jointLabels") <- 1:50meta(crude, "crudeLabels") <- letters[1:20]meta(crude, "jointLabels") <- 1:20c(acq, crude)meta(c(acq, crude), type = "corpus")meta(c(acq, crude))c(acq[[30]], crude[[10]])c(TermDocumentMatrix(acq), TermDocumentMatrix(crude))

Filter and Index Functions on Corpora

Description

Interface to apply filter and index functions to corpora.

Usage

## S3 method for class 'PCorpus'tm_filter(x, FUN, ...)## S3 method for class 'SimpleCorpus'tm_filter(x, FUN, ...)## S3 method for class 'VCorpus'tm_filter(x, FUN, ...)## S3 method for class 'PCorpus'tm_index(x, FUN, ...)## S3 method for class 'SimpleCorpus'tm_index(x, FUN, ...)## S3 method for class 'VCorpus'tm_index(x, FUN, ...)

Arguments

x

A corpus.

FUN

a filter function taking a text document or a string (ifx is aSimpleCorpus) as input and returning thelogical valueTRUE orFALSE.

...

arguments toFUN.

Value

tm_filter returns a corpus containing documents whereFUN matches, whereastm_index only returns thecorresponding indices.

Examples

data("crude")# Full-text searchtm_filter(crude, FUN = function(x) any(grep("co[m]?pany", content(x))))

Transformations on Corpora

Description

Interface to apply transformation functions (also denoted as mappings)to corpora.

Usage

## S3 method for class 'PCorpus'tm_map(x, FUN, ...)## S3 method for class 'SimpleCorpus'tm_map(x, FUN, ...)## S3 method for class 'VCorpus'tm_map(x, FUN, ..., lazy = FALSE)

Arguments

x

A corpus.

FUN

a transformation function taking a text document (a charactervector whenx is aSimpleCorpus) as input and returning a textdocument (a character vector of the same length as the input vector forSimpleCorpus). The functioncontent_transformer can beused to create a wrapper to get and set the content of text documents.

...

arguments toFUN.

lazy

a logical. Lazy mappings are mappings which are delayeduntil the content is accessed. It is useful for large corpora if only fewdocuments will be accessed. In such a case it avoids the computationallyexpensive application of the mapping to all elements in the corpus.

Value

A corpus withFUN applied to each document inx. In caseof lazy mappings only internal flags are set. Access of individual documentstriggers the execution of the corresponding transformation function.

Note

Lazy transformations changeR's standard evaluation semantics.

See Also

getTransformations for available transformations.

Examples

data("crude")## Document access triggers the stemming function## (i.e., all other documents are not stemmed yet)if(requireNamespace("SnowballC")) {    tm_map(crude, stemDocument, lazy = TRUE)[[1]]}## Use wrapper to apply character processing functiontm_map(crude, content_transformer(tolower))## Generate a custom transformation function which takes the heading as new contentheadings <- function(x)    PlainTextDocument(meta(x, "heading"),                      id = meta(x, "id"),                      language = meta(x, "language"))inspect(tm_map(crude, headings))

Combine Transformations

Description

Fold multiple transformations (mappings) into a single one.

Usage

tm_reduce(x, tmFuns, ...)

Arguments

x

A corpus.

tmFuns

A list oftm transformations.

...

Arguments to the individual transformations.

Value

A singletm transformation function obtained by foldingtmFunsfrom right to left (viaReduce(..., right = TRUE)).

See Also

Reduce forR's internal folding/accumulation mechanism, andgetTransformations to list available transformation(mapping) functions.

Examples

data(crude)crude[[1]]skipWords <- function(x) removeWords(x, c("it", "the"))funs <- list(stripWhitespace,             skipWords,             removePunctuation,             content_transformer(tolower))tm_map(crude, FUN = tm_reduce, tmFuns = funs)[[1]]

Compute Score for Matching Terms

Description

Compute a score based on the number of matching terms.

Usage

## S3 method for class 'DocumentTermMatrix'tm_term_score(x, terms, FUN = row_sums)## S3 method for class 'PlainTextDocument'tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE))## S3 method for class 'term_frequency'tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE))## S3 method for class 'TermDocumentMatrix'tm_term_score(x, terms, FUN = col_sums)

Arguments

x

Either aPlainTextDocument, a term frequency asreturned bytermFreq, or aTermDocumentMatrix.

terms

A character vector of terms to be matched.

FUN

A function computing a score from the number of termsmatching inx.

Value

A score as computed byFUN from the number of matchingterms inx.

Examples

data("acq")tm_term_score(acq[[1]], c("company", "change"))## Not run: ## Test for positive and negative sentiments## install.packages("tm.lexicon.GeneralInquirer", repos="http://datacube.wu.ac.at", type="source")require("tm.lexicon.GeneralInquirer")sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Positiv"))sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Negativ"))tm_term_score(TermDocumentMatrix(acq[1:10],                                control = list(removePunctuation = TRUE)),             terms_in_General_Inquirer_categories("Positiv"))## End(Not run)

Tokenizers

Description

Tokenize a document or character vector.

Usage

Boost_tokenizer(x)MC_tokenizer(x)scan_tokenizer(x)

Arguments

x

A character vector, or an object that can be coerced to character byas.character.

Details

The quality and correctness of a tokenization algorithm highly dependson the context and application scenario. Relevant factors are thelanguage of the underlying text and the notions of whitespace (whichcan vary with the used encoding and the language) and punctuationmarks. Consequently, for superior results you probably need a customtokenization function.

Boost_tokenizer

Uses the Boost (https://www.boost.org)Tokenizer (viaRcpp).

MC_tokenizer

Implements the functionality of the tokenizer in theMC toolkit (https://www.cs.utexas.edu/~dml/software/mc/).

scan_tokenizer

Simulatesscan(..., what = "character").

Value

A character vector consisting of tokens obtained by tokenization ofx.

See Also

getTokenizers to list tokenizers provided by packagetm.

Regexp_Tokenizer for tokenizers using regular expressionsprovided by packageNLP.

tokenize for a simple regular expression based tokenizerprovided by packagetau.

tokenizers for a collection of tokenizers providedby packagetokenizers.

Examples

data("crude")Boost_tokenizer(crude[[1]])MC_tokenizer(crude[[1]])scan_tokenizer(crude[[1]])strsplit_space_tokenizer <- function(x)    unlist(strsplit(as.character(x), "[[:space:]]+"))strsplit_space_tokenizer(crude[[1]])

Weight Binary

Description

Binary weight a term-document matrix.

Usage

weightBin(m)

Arguments

m

ATermDocumentMatrix in term frequency format.

Details

Formally this function is of classWeightingFunction with theadditional attributesname andacronym.

Value

The weighted matrix.


SMART Weightings

Description

Weight a term-document matrix according to a combination of weightsspecified in SMART notation.

Usage

weightSMART(m, spec = "nnn", control = list())

Arguments

m

ATermDocumentMatrix in term frequency format.

spec

a character string consisting of three characters. The first letterspecifies a term frequency schema, the second a document frequencyschema, and the third a normalization schema. SeeDetails foravailable built-in schemata.

control

a list of control parameters. SeeDetails.

Details

Formally this function is of classWeightingFunction with theadditional attributesname andacronym.

The first letter ofspec specifies a weighting schema for termfrequencies ofm:

"n"

(natural)\mathit{tf}_{i,j} counts the number of occurrencesn_{i,j} of a termt_i in a documentd_j. Theinput term-document matrixm is assumed to be in thisstandard term frequency format already.

"l"

(logarithm) is defined as1 + \log_2(\mathit{tf}_{i,j}).

"a"

(augmented) is defined as0.5 + \frac{0.5 * \mathit{tf}_{i,j}}{\max_i(\mathit{tf}_{i,j})}.

"b"

(boolean) is defined as 1 if\mathit{tf}_{i,j} > 0 and 0 otherwise.

"L"

(log average) is defined as\frac{1 + \log_2(\mathit{tf}_{i,j})}{1+\log_2(\mathrm{ave}_{i\in j}(\mathit{tf}_{i,j}))}.

The second letter ofspec specifies a weighting schema ofdocument frequencies form:

"n"

(no) is defined as 1.

"t"

(idf) is defined as\log_2 \frac{N}{\mathit{df}_t} where\mathit{df}_t denotes how often termt occurs in alldocuments.

"p"

(prob idf) is defined as\max(0, \log_2(\frac{N - \mathit{df}_t}{\mathit{df}_t})).

The third letter ofspec specifies a schema for normalizationofm:

"n"

(none) is defined as 1.

"c"

(cosine) is defined as\sqrt{\mathrm{col\_sums}(m ^ 2)}.

"u"

(pivoted unique) is defined as\mathit{slope} * \sqrt{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) * \mathit{pivot} where bothslope andpivot must be setvia named tags in thecontrol list.

"b"

(byte size) is defined as\frac{1}{\mathit{CharLength}^\alpha}. The parameter\alpha must be set via the named tagalphain thecontrol list.

The final result is defined by multiplication of the chosen termfrequency component with the chosen document frequency component withthe chosen normalization component.

Value

The weighted matrix.

References

Christopher D. Manning and Prabhakar Raghavan and Hinrich Schütze (2008).Introduction to Information Retrieval.Cambridge University Press, ISBN 0521865719.

Examples

data("crude")TermDocumentMatrix(crude,                   control = list(removePunctuation = TRUE,                                  stopwords = TRUE,                                  weighting = function(x)                                  weightSMART(x, spec = "ntc")))

Weight by Term Frequency

Description

Weight a term-document matrix by term frequency.

Usage

weightTf(m)

Arguments

m

ATermDocumentMatrix in term frequency format.

Details

Formally this function is of classWeightingFunction with theadditional attributesname andacronym.

This function acts as the identity function since the input matrix isalready in term frequency format.

Value

The weighted matrix.


Weight by Term Frequency - Inverse Document Frequency

Description

Weight a term-document matrix by term frequency - inverse documentfrequency.

Usage

weightTfIdf(m, normalize = TRUE)

Arguments

m

ATermDocumentMatrix in term frequency format.

normalize

A Boolean value indicating whether the termfrequencies should be normalized.

Details

Formally this function is of classWeightingFunction with theadditional attributesname andacronym.

Term frequency\mathit{tf}_{i,j} counts the number ofoccurrencesn_{i,j} of a termt_i in a documentd_j. In the case of normalization, the term frequency\mathit{tf}_{i,j} is divided by\sum_k n_{k,j}.

Inverse document frequency for a termt_i is defined as

\mathit{idf}_i = \log_2 \frac{|D|}{|\{d \mid t_i \in d\}|}

where|D| denotes the total number of documents and where|\{d \mid t_i \in d\}| is the number of documents where the termt_iappears.

Term frequency - inverse document frequency is now defined as\mathit{tf}_{i,j} \cdot \mathit{idf}_i.

Value

The weighted matrix.

References

Gerard Salton and Christopher Buckley (1988).Term-weighting approaches in automatic text retrieval.Information Processing and Management,24/5, 513–523.


Write a Corpus to Disk

Description

Write a plain text representation of a corpus to multiple files ondisk corresponding to the individual documents in the corpus.

Usage

writeCorpus(x, path = ".", filenames = NULL)

Arguments

x

A corpus.

path

A character listing the directory to be written into.

filenames

EitherNULL or a character vector. In case nofilenames are provided, filenames are automatically generated byusing the documents' identifiers inx.

Details

The plain text representation of the corpus is obtained by callingas.character on each document.

Examples

data("crude")## Not run: writeCorpus(crude, path = ".",            filenames = paste(seq_along(crude), ".txt", sep = ""))## End(Not run)

[8]ページ先頭

©2009-2025 Movatter.jp