Movatterモバイル変換

NEWS	R Documentation

News for Package 'tm'

Changes in tm version 0.7-16

BUG FIXES

Improvements for Rd cross-references.

Changes in tm version 0.7-15

BUG FIXES

Improvements for Rd cross-references.

Changes in tm version 0.7-14

BUG FIXES

Use R_Calloc/R_Free instead of the long-deprecated Calloc/Free.

Changes in tm version 0.7-13

BUG FIXES

Improvements for Rd cross-references.

Changes in tm version 0.7-12

BUG FIXES

Add missing S3 method registration.

Changes in tm version 0.7-11

BUG FIXES

Use the default C++ standard instead of C++11.

Changes in tm version 0.7-10

NEW FEATURES

All built-inpGetElem() methods now usetm_parLapply().

Changes in tm version 0.7-9

BUG FIXES

Compilation fixes.

Changes in tm version 0.7-8

BUG FIXES

Fix invalid counting inprevalentstemCompletion().Reported by Bernard Chang.
tm_index() now interprets all non-TRUE logical valuesreturned by the filter function asFALSE. This fixes corner caseswhere filter functions returnlogical(0) orNA. Reportedby Tom Nicholls.

Changes in tm version 0.7-6

NEW FEATURES

TermDocumentMatrix.SimpleCorpus() now also honors alogicalremovePunctuation control option (default: false).

BUG FIXES

Sync encoding fixes inTermDocumentMatrix.SimpleCorpus() withBoost_tokenizer().

Changes in tm version 0.7-5

BUG FIXES

HandleNAs consistently in tokenizers.

Changes in tm version 0.7-4

BUG FIXES

Keep document names intm_map.SimpleCorpus().
Fix encoding problems inscan_tokenizer() andBoost_tokenizer().

Changes in tm version 0.7-3

BUG FIXES

scan_tokenizer() now works with character vectors andcharacter strings.
removePunctuation() now works again inlatin1 locales.
Handle empty term-document matrices gracefully.

Changes in tm version 0.7-2

SIGNIFICANT USER-VISIBLE CHANGES

DataframeSource now only processes data frames with the twomandatory columns"doc_id" and"text". Additional columnsare used as document level metadata. This implements compatibility withText Interchange Formats corpora(https://github.com/ropenscilabs/tif).
readTabular() has been removed. UseDataframeSourceinstead.
removeNumbers() andremovePunctuation() now have anargumentucp to check for Unicode general categoriesNd(decimal digits) andP (punctuation), respectively. Contributedby Kurt Hornik.
The packagexml2 is now imported forXMLfunctionality instead of the (CRAN maintainer orphaned)packageXML.

NEW FEATURES

Boost_tokenizer provides a tokenizer based on the Boost(https://www.boost.org) Tokenizer.

BUG FIXES

Correctly handle thedictionary argument when constructing aterm-document matrix from aSimpleCorpus (reported by JoeCorrigan) or from aVCorpus (reported by Mark Rosenstein).

Changes in tm version 0.7-1

BUG FIXES

Compilation fixes for Clang's libc++.

Changes in tm version 0.7

SIGNIFICANT USER-VISIBLE CHANGES

inspect.TermDocumentMatrix() now displays a sample insteadof the full matrix. The full dense representation is available viaas.matrix().

NEW FEATURES

SimpleCorpus provides a corpus which is optimized for themost common usage scenario: importing plain texts from files in adirectory or directly from a vector inR, preprocessing and transformingthe texts, and finally exporting them to a term-document matrix. The aimis to boost performance and minimize memory pressure. It loads alldocuments into memory, and is designed for medium-sized to large datasets.
inspect() on text documents as a shorthand forwriteLines(as.character()).
findMostFreqTerms() finds most frequent terms in adocument-term or term-document matrix, or a vector of term frequencies.
tm_parLapply() is now internally used for the parallelizationof transformations, filters, and term-document matrix construction. Thepreferred parallelization engine can be registered viatm_parLapply_engine(). The default is to use no parallelization(instead ofmclapply (packageparallel) inprevious versions).

Changes in tm version 0.6-2

BUG FIXES

format.PlainTextDocument() now reports only one charactercount for a whole document.

Changes in tm version 0.6-1

SIGNIFICANT USER-VISIBLE CHANGES

format.PlainTextDocument() now displays a compactrepresentation instead of the content. Useas.character() toobtain the character content (which in turn can be applied to a corpusvialapply()).

NEW FEATURES

ZipSource() for processing ZIP files.
Sources now provideopen() andclose().
termFreq() now acceptsSpan_Tokenizer andToken_Tokenizer (both from packageNLP) objects astokenizers.
readTagged(), a reader for text documents containingPOS-tagged words.

BUG FIXES

The functionremoveWords() now correctly processes wordsbeing truncations of others. Reported by Александр Труфанов.

Changes in tm version 0.6

SIGNIFICANT USER-VISIBLE CHANGES

DirSource() andURISource() now use the argumentencoding for conversion viaiconv() to"UTF-8".
termFreq() now useswords() as the default tokenizer.
Text documents now provide the functionscontent() andas.character() to access the (possibly raw) document content andthe natural language text in a suitable (not necessarily structured)form.
The internal representation of corpora, sources, and text documentschanged. Saved objects created with oldertm versions areincompatible and need to be rebuilt.

NEW FEATURES

DirSource() andURISource() now have amodeargument specifying how elements should be read (no read, binary, text).
Improved high-level documentation on corpora (?Corpus), textdocuments (?TextDocument), sources (?Source), and readers(?Reader).
Integration with packageNLP.
Romanian stopwords. Suggested by Cristian Chirita.
words.PlainTextDocument() delivers word tokens in thedocument.

BUG FIXES

The functionstemCompletion() now avoids spurious duplicateresults. Reported by Seong-Hyeon Kim.

DEPRECATED & DEFUNCT

Following functions have been removed:
- Author(),DateTimeStamp(),CMetaData(),content_meta(),DMetaData(),Description(),Heading(),ID(),Language(),LocalMetaData(),Origin(),prescindMeta(),sFilter() (usemeta() instead).
- dissimilarity() (useproxy::dist() instead).
- makeChunks() (use[ and[[ manually).
- summary.Corpus() andsummary.TextRepository()(print() now gives a more informative but succinct overview).
- TextRepository() andRepoMetaData() (use e.g. alist to store multiple corpora instead).

Changes in tm version 0.5-10

SIGNIFICANT USER-VISIBLE CHANGES

License changed to GPL-3 (from GPL-2 | GPL-3).
Following functions have been renamed:
- tm_tag_score() totm_term_score().

DEPRECATED & DEFUNCT

Following functions have been removed:
- Dictionary() (use a character vector instead; useTerms() to extract terms from a document-term or term-documentmatrix),
- GmaneSource() (but still available via an example inXMLSource()),
- preprocessReut21578XML() (moved to packagetm.corpus.Reuters21578),
- readGmane() (but still available via an example inreadXML()),
- searchFullText() andtm_intersect()(usegrep() instead).
Following S3 classes are no longer registered as S4 classes:
- VCorpus andPlainTextDocument.

Changes in tm version 0.5-9

SIGNIFICANT USER-VISIBLE CHANGES

Stemming functionality is now provided by the packageSnowballC replacing packagesSnowball andRWeka.
All stopword lists (besides Catalan and SMART) available viastopwords() now come from the Snowball stemmer project.
Transformations, filters, and term-document matrix constructionnow usemclapply (packageparallel).Packagessnow andRmpi are no longer used.

DEPRECATED & DEFUNCT

Following functions have been removed:
- tm_startCluster() andtm_stopCluster().

Changes in tm version 0.5-8

SIGNIFICANT USER-VISIBLE CHANGES

The functiontermFreq() now processes thetolower andtokenize options first.

NEW FEATURES

Catalan stopwords. Requested by Xavier Fernández i Marín.

BUG FIXES

The functiontermFreq() now correctly acceptsuser-provided stopwords. Reported by Bettina Grün.
The functiontermFreq() now correctly handles thelower bound of the optionwordLength. Reported by StevenC. Bagley.

Changes in tm version 0.5-7

SIGNIFICANT USER-VISIBLE CHANGES

The functiontermFreq() provides two new arguments forgeneralized bounds checking of term frequencies and wordlengths. This replaces the arguments minDocFreq andminWordLength.
The functiontermFreq() is now sensitive to the order ofcontrol options.

NEW FEATURES

Weighting schemata for term-document matrices in SMART notation.
Local and global options for term-document matrixconstruction.
SMART stopword list was added.

Changes in tm version 0.5-5

NEW FEATURES

Access documents in a corpus by names (fallback to IDs if names arenot set), i.e., allow a string for the corpus operator '[['.

BUG FIXES

The functionfindFreqTerms() now checks bounds on a global level(to comply with the manual page) instead per document. Reportedand fixed by Thomas Zapf-Schramm.

Changes in tm version 0.5-4

SIGNIFICANT USER-VISIBLE CHANGES

Use IETF language tags for language codes (instead of ISO 639-2).

NEW FEATURES

The functiontm_tag_score() provides functionality to scoredocuments based on the number of tags found. This is useful forsentiment analysis.
The weighting function for term frequency-inverse documentfrequencyweightTfIdf() has now an option for termnormalization.
Plotting functions to test for Zipf's and Heaps' law on aterm-document matrix were added:Zipf_plot() andHeaps_plot(). Contributed by Kurt Hornik.

Changes in tm version 0.5-3

NEW FEATURES

The reader functionreadRCV1asPlain() was added and combines thefunctionality ofreadRCV1() andas.PlainTextDocument().
The functionstemCompletion() has a set of new heuristics.

Changes in tm version 0.5-2

SIGNIFICANT USER-VISIBLE CHANGES

The functiontermFreq() which is used for building aterm-document matrix now uses a whitespace oriented tokenizeras default.

NEW FEATURES

A combine method for merging multiple term-document matriceswas added (c.TermDocumentMatrix()).
The functiontermFreq() has now an option to removepunctuation characters.

DEPRECATED & DEFUNCT

Following functions have been removed:
- CSVSource() (useDataframeSource(read.csv(..., stringsAsFactors = FALSE)) instead), and
- TermDocMatrix() (useDocumentTermMatrix() instead).

BUG FIXES

removeWords() no longer skips words at the beginning or the endof a line. Reported by Mark Kimpel.

Changes in tm version 0.5-1

BUG FIXES

preprocessReut21578XML() no longer generates invalid file names.

Changes in tm version 0.5

SIGNIFICANT USER-VISIBLE CHANGES

All classes, functions, and generics are reimplemented usingthe S3 class system.
Following functions have been renamed:
- activateCluster() totm_startCluster(),
- asPlain() toas.PlainTextDocument(),
- deactivateCluster() totm_stopCluster(),
- tmFilter() totm_filter(),
- tmIndex() totm_index(),
- tmIntersect() totm_intersect(), and
- tmMap() totm_map().
Mail handling functionality is factored out to thetm.plugin.mail package.

DEPRECATED & DEFUNCT

Following functions have been removed:
- tmTolower() (usetolower() instead), and
- replacePatterns() (usegsub() instead).

Changes in tm version 0.4

SIGNIFICANT USER-VISIBLE CHANGES

The Corpus class is now virtual providing an abstractinterface.
VCorpus, the default implementation of the abstract corpusinterface (by subclassing), provides a corpus with volatile (=standardR object) semantics. It loads all documents intomemory, and is designed for small to medium-sized data sets.
PCorpus, an implementation of the abstract corpus interface (bysubclassing), provides a corpus with permanent storagesemantics. The actual data is stored in an external database(file) object (as supported by thefilehash package), withautomatic (un-)loading into memory. It is designed for systemswith small memory.
Language codes are now in ISO 639-2 (instead of ISO 639-1).
Reader functions no longer have a load argument for lazyloading.

NEW FEATURES

The reader functionreadReut21578XMLasPlain() was added andcombines the functionality ofreadReut21578XML() andasPlain().

BUG FIXES

weightTfIdf() no longer applies a binary weighting to an inputmatrix in term frequency format (which happened only in 0.3-4).

Changes in tm version 0.3-4

SIGNIFICANT USER-VISIBLE CHANGES

.onLoad() no longer tries to start a MPI cluster (which oftenfailed in misconfigured environments). UseactivateCluster()anddeactivateCluster() instead.
DocumentTermMatrix (the improved reimplementation of defunctTermDocMatrix) does not use theMatrix package anymore.

NEW FEATURES

TheDirSource() constructor now accepts the two new (optional)arguments pattern and ignore.case. With pattern one can definea regular expression for selecting only matching files, andignore.case specifies whether pattern-matching iscase-sensitive.
ThereadNewsgroup() reader function can now be configured forcustom date formats (via the DateFormat argument).
ThereadPDF() reader function can now be configured (via thePdfinfoOptions and PdftotextOptions arguments).
ThereadDOC() reader function can now be configured (via theAntiwordOptions argument).
Sources now can be vectorized. This allows faster corpusconstruction.
New XMLSource class for arbitrary XML files.
The newreadTabular() reader function allows to create a customtailor-made reader configured via mappings from a tabular datastructure.
The newreadXML() reader function allows to read in arbitraryXML files which are described with a specification.
The newtmReduce() transformation allows to combine multiplemaps into one transformation.

DEPRECATED & DEFUNCT

CSVSource is defunct (use DataframeSource instead).
weightLogical is defunct.
TermDocMatrix is defunct (use DocumentTermMatrix orTermDocumentMatrix instead).

Changes in tm version 0.3-3

NEW FEATURES

The abstract Source class gets a default implementation forthestepNext() method. It increments the position counter byone, a reasonable value for most sources. For special purposescustom methods can be created via overloadingstepNext() ofthe subclass.
New URISource class for a single document identified by aUniform Resource Identifier.
New DataframeSource for documents stored in a data frame. Eachrow is interpreted as a single document.

BUG FIXES

Fix off-by-one error inconvertMboxEml() function. Reported byAngela Bohn.
Sort row indices in sparse term-document matrices. Kudos toMartin Mächler for his suggestions.
Sources and readers no longer evaluate calls in a non-standardway.

Changes in tm version 0.3-2

NEW FEATURES

Weighting functions now have an Acronym slot containingabbreviations of the weighting functions' names. This is highlyuseful when generating tables with indications which weightingscheme was actually used for your experiments.
The functionstmFilter(),tmIndex(),tmMap() andTermDocMatrix()now can use a MPI cluster (via thesnow andRmpi packages) ifavailable. Use(de)activateCluster() to manually overridecluster usage settings. Special thanks to Stefan Theussl forhis constructive comments.
The Source class receives a new Length slot. It contains thenumber of elements provided by the source (although theremight be rare cases where the number cannot be determined inadvance—then it should be set to zero).