| NEWS | R Documentation |
News for Package 'tm'
Changes in tm version 0.7-16
BUG FIXES
Improvements for Rd cross-references.
Changes in tm version 0.7-15
BUG FIXES
Improvements for Rd cross-references.
Changes in tm version 0.7-14
BUG FIXES
Use R_Calloc/R_Free instead of the long-deprecated Calloc/Free.
Changes in tm version 0.7-13
BUG FIXES
Improvements for Rd cross-references.
Changes in tm version 0.7-12
BUG FIXES
Add missing S3 method registration.
Changes in tm version 0.7-11
BUG FIXES
Use the default C++ standard instead of C++11.
Changes in tm version 0.7-10
NEW FEATURES
All built-in
pGetElem()methods now usetm_parLapply().
Changes in tm version 0.7-9
BUG FIXES
Compilation fixes.
Changes in tm version 0.7-8
BUG FIXES
Fix invalid counting in
prevalentstemCompletion().Reported by Bernard Chang.tm_index()now interprets all non-TRUElogical valuesreturned by the filter function asFALSE. This fixes corner caseswhere filter functions returnlogical(0)orNA. Reportedby Tom Nicholls.
Changes in tm version 0.7-6
NEW FEATURES
TermDocumentMatrix.SimpleCorpus()now also honors alogicalremovePunctuationcontrol option (default: false).
BUG FIXES
Sync encoding fixes in
TermDocumentMatrix.SimpleCorpus()withBoost_tokenizer().
Changes in tm version 0.7-5
BUG FIXES
Handle
NAs consistently in tokenizers.
Changes in tm version 0.7-4
BUG FIXES
Keep document names in
tm_map.SimpleCorpus().Fix encoding problems in
scan_tokenizer()andBoost_tokenizer().
Changes in tm version 0.7-3
BUG FIXES
scan_tokenizer()now works with character vectors andcharacter strings.removePunctuation()now works again inlatin1locales.Handle empty term-document matrices gracefully.
Changes in tm version 0.7-2
SIGNIFICANT USER-VISIBLE CHANGES
DataframeSourcenow only processes data frames with the twomandatory columns"doc_id"and"text". Additional columnsare used as document level metadata. This implements compatibility withText Interchange Formats corpora(https://github.com/ropenscilabs/tif).readTabular()has been removed. UseDataframeSourceinstead.removeNumbers()andremovePunctuation()now have anargumentucpto check for Unicode general categoriesNd(decimal digits) andP(punctuation), respectively. Contributedby Kurt Hornik.The packagexml2 is now imported forXMLfunctionality instead of the (CRAN maintainer orphaned)packageXML.
NEW FEATURES
Boost_tokenizerprovides a tokenizer based on the Boost(https://www.boost.org) Tokenizer.
BUG FIXES
Correctly handle the
dictionaryargument when constructing aterm-document matrix from aSimpleCorpus(reported by JoeCorrigan) or from aVCorpus(reported by Mark Rosenstein).
Changes in tm version 0.7-1
BUG FIXES
Compilation fixes for Clang's libc++.
Changes in tm version 0.7
SIGNIFICANT USER-VISIBLE CHANGES
inspect.TermDocumentMatrix()now displays a sample insteadof the full matrix. The full dense representation is available viaas.matrix().
NEW FEATURES
SimpleCorpusprovides a corpus which is optimized for themost common usage scenario: importing plain texts from files in adirectory or directly from a vector inR, preprocessing and transformingthe texts, and finally exporting them to a term-document matrix. The aimis to boost performance and minimize memory pressure. It loads alldocuments into memory, and is designed for medium-sized to large datasets.inspect()on text documents as a shorthand forwriteLines(as.character()).findMostFreqTerms()finds most frequent terms in adocument-term or term-document matrix, or a vector of term frequencies.tm_parLapply()is now internally used for the parallelizationof transformations, filters, and term-document matrix construction. Thepreferred parallelization engine can be registered viatm_parLapply_engine(). The default is to use no parallelization(instead ofmclapply(packageparallel) inprevious versions).
Changes in tm version 0.6-2
BUG FIXES
format.PlainTextDocument()now reports only one charactercount for a whole document.
Changes in tm version 0.6-1
SIGNIFICANT USER-VISIBLE CHANGES
format.PlainTextDocument()now displays a compactrepresentation instead of the content. Useas.character()toobtain the character content (which in turn can be applied to a corpusvialapply()).
NEW FEATURES
ZipSource()for processing ZIP files.Sources now provide
open()andclose().termFreq()now acceptsSpan_TokenizerandToken_Tokenizer(both from packageNLP) objects astokenizers.readTagged(), a reader for text documents containingPOS-tagged words.
BUG FIXES
The function
removeWords()now correctly processes wordsbeing truncations of others. Reported by Александр Труфанов.
Changes in tm version 0.6
SIGNIFICANT USER-VISIBLE CHANGES
DirSource()andURISource()now use the argumentencodingfor conversion viaiconv()to"UTF-8".termFreq()now useswords()as the default tokenizer.Text documents now provide the functions
content()andas.character()to access the (possibly raw) document content andthe natural language text in a suitable (not necessarily structured)form.The internal representation of corpora, sources, and text documentschanged. Saved objects created with oldertm versions areincompatible and need to be rebuilt.
NEW FEATURES
DirSource()andURISource()now have amodeargument specifying how elements should be read (no read, binary, text).Improved high-level documentation on corpora (
?Corpus), textdocuments (?TextDocument), sources (?Source), and readers(?Reader).Integration with packageNLP.
Romanian stopwords. Suggested by Cristian Chirita.
words.PlainTextDocument()delivers word tokens in thedocument.
BUG FIXES
The function
stemCompletion()now avoids spurious duplicateresults. Reported by Seong-Hyeon Kim.
DEPRECATED & DEFUNCT
Following functions have been removed:
Author(),DateTimeStamp(),CMetaData(),content_meta(),DMetaData(),Description(),Heading(),ID(),Language(),LocalMetaData(),Origin(),prescindMeta(),sFilter()(usemeta()instead).dissimilarity()(useproxy::dist()instead).makeChunks()(use[and[[manually).summary.Corpus()andsummary.TextRepository()(print()now gives a more informative but succinct overview).TextRepository()andRepoMetaData()(use e.g. alist to store multiple corpora instead).
Changes in tm version 0.5-10
SIGNIFICANT USER-VISIBLE CHANGES
License changed to GPL-3 (from GPL-2 | GPL-3).
Following functions have been renamed:
tm_tag_score()totm_term_score().
DEPRECATED & DEFUNCT
Following functions have been removed:
Dictionary()(use a character vector instead; useTerms()to extract terms from a document-term or term-documentmatrix),GmaneSource()(but still available via an example inXMLSource()),preprocessReut21578XML()(moved to packagetm.corpus.Reuters21578),readGmane()(but still available via an example inreadXML()),searchFullText()andtm_intersect()(usegrep()instead).
Following S3 classes are no longer registered as S4 classes:
VCorpusandPlainTextDocument.
Changes in tm version 0.5-9
SIGNIFICANT USER-VISIBLE CHANGES
Stemming functionality is now provided by the packageSnowballC replacing packagesSnowball andRWeka.
All stopword lists (besides Catalan and SMART) available via
stopwords()now come from the Snowball stemmer project.Transformations, filters, and term-document matrix constructionnow use
mclapply(packageparallel).Packagessnow andRmpi are no longer used.
DEPRECATED & DEFUNCT
Following functions have been removed:
tm_startCluster()andtm_stopCluster().
Changes in tm version 0.5-8
SIGNIFICANT USER-VISIBLE CHANGES
The function
termFreq()now processes thetolowerandtokenizeoptions first.
NEW FEATURES
Catalan stopwords. Requested by Xavier Fernández i Marín.
BUG FIXES
The function
termFreq()now correctly acceptsuser-provided stopwords. Reported by Bettina Grün.The function
termFreq()now correctly handles thelower bound of the optionwordLength. Reported by StevenC. Bagley.
Changes in tm version 0.5-7
SIGNIFICANT USER-VISIBLE CHANGES
The function
termFreq()provides two new arguments forgeneralized bounds checking of term frequencies and wordlengths. This replaces the arguments minDocFreq andminWordLength.The function
termFreq()is now sensitive to the order ofcontrol options.
NEW FEATURES
Weighting schemata for term-document matrices in SMART notation.
Local and global options for term-document matrixconstruction.
SMART stopword list was added.
Changes in tm version 0.5-5
NEW FEATURES
Access documents in a corpus by names (fallback to IDs if names arenot set), i.e., allow a string for the corpus operator '[['.
BUG FIXES
The function
findFreqTerms()now checks bounds on a global level(to comply with the manual page) instead per document. Reportedand fixed by Thomas Zapf-Schramm.
Changes in tm version 0.5-4
SIGNIFICANT USER-VISIBLE CHANGES
Use IETF language tags for language codes (instead of ISO 639-2).
NEW FEATURES
The function
tm_tag_score()provides functionality to scoredocuments based on the number of tags found. This is useful forsentiment analysis.The weighting function for term frequency-inverse documentfrequency
weightTfIdf()has now an option for termnormalization.Plotting functions to test for Zipf's and Heaps' law on aterm-document matrix were added:
Zipf_plot()andHeaps_plot(). Contributed by Kurt Hornik.
Changes in tm version 0.5-3
NEW FEATURES
The reader function
readRCV1asPlain()was added and combines thefunctionality ofreadRCV1()andas.PlainTextDocument().The function
stemCompletion()has a set of new heuristics.
Changes in tm version 0.5-2
SIGNIFICANT USER-VISIBLE CHANGES
The function
termFreq()which is used for building aterm-document matrix now uses a whitespace oriented tokenizeras default.
NEW FEATURES
A combine method for merging multiple term-document matriceswas added (
c.TermDocumentMatrix()).The function
termFreq()has now an option to removepunctuation characters.
DEPRECATED & DEFUNCT
Following functions have been removed:
CSVSource()(useDataframeSource(read.csv(..., stringsAsFactors = FALSE))instead), andTermDocMatrix()(useDocumentTermMatrix()instead).
BUG FIXES
removeWords()no longer skips words at the beginning or the endof a line. Reported by Mark Kimpel.
Changes in tm version 0.5-1
BUG FIXES
preprocessReut21578XML()no longer generates invalid file names.
Changes in tm version 0.5
SIGNIFICANT USER-VISIBLE CHANGES
All classes, functions, and generics are reimplemented usingthe S3 class system.
Following functions have been renamed:
activateCluster()totm_startCluster(),asPlain()toas.PlainTextDocument(),deactivateCluster()totm_stopCluster(),tmFilter()totm_filter(),tmIndex()totm_index(),tmIntersect()totm_intersect(), andtmMap()totm_map().
Mail handling functionality is factored out to thetm.plugin.mail package.
DEPRECATED & DEFUNCT
Following functions have been removed:
tmTolower()(usetolower()instead), andreplacePatterns()(usegsub()instead).
Changes in tm version 0.4
SIGNIFICANT USER-VISIBLE CHANGES
The Corpus class is now virtual providing an abstractinterface.
VCorpus, the default implementation of the abstract corpusinterface (by subclassing), provides a corpus with volatile (=standardR object) semantics. It loads all documents intomemory, and is designed for small to medium-sized data sets.
PCorpus, an implementation of the abstract corpus interface (bysubclassing), provides a corpus with permanent storagesemantics. The actual data is stored in an external database(file) object (as supported by thefilehash package), withautomatic (un-)loading into memory. It is designed for systemswith small memory.
Language codes are now in ISO 639-2 (instead of ISO 639-1).
Reader functions no longer have a load argument for lazyloading.
NEW FEATURES
The reader function
readReut21578XMLasPlain()was added andcombines the functionality ofreadReut21578XML()andasPlain().
BUG FIXES
weightTfIdf()no longer applies a binary weighting to an inputmatrix in term frequency format (which happened only in 0.3-4).
Changes in tm version 0.3-4
SIGNIFICANT USER-VISIBLE CHANGES
.onLoad()no longer tries to start a MPI cluster (which oftenfailed in misconfigured environments). UseactivateCluster()anddeactivateCluster()instead.DocumentTermMatrix (the improved reimplementation of defunctTermDocMatrix) does not use theMatrix package anymore.
NEW FEATURES
The
DirSource()constructor now accepts the two new (optional)arguments pattern and ignore.case. With pattern one can definea regular expression for selecting only matching files, andignore.case specifies whether pattern-matching iscase-sensitive.The
readNewsgroup()reader function can now be configured forcustom date formats (via the DateFormat argument).The
readPDF()reader function can now be configured (via thePdfinfoOptions and PdftotextOptions arguments).The
readDOC()reader function can now be configured (via theAntiwordOptions argument).Sources now can be vectorized. This allows faster corpusconstruction.
New XMLSource class for arbitrary XML files.
The new
readTabular()reader function allows to create a customtailor-made reader configured via mappings from a tabular datastructure.The new
readXML()reader function allows to read in arbitraryXML files which are described with a specification.The new
tmReduce()transformation allows to combine multiplemaps into one transformation.
DEPRECATED & DEFUNCT
CSVSource is defunct (use DataframeSource instead).
weightLogical is defunct.
TermDocMatrix is defunct (use DocumentTermMatrix orTermDocumentMatrix instead).
Changes in tm version 0.3-3
NEW FEATURES
The abstract Source class gets a default implementation forthe
stepNext()method. It increments the position counter byone, a reasonable value for most sources. For special purposescustom methods can be created via overloadingstepNext()ofthe subclass.New URISource class for a single document identified by aUniform Resource Identifier.
New DataframeSource for documents stored in a data frame. Eachrow is interpreted as a single document.
BUG FIXES
Fix off-by-one error in
convertMboxEml()function. Reported byAngela Bohn.Sort row indices in sparse term-document matrices. Kudos toMartin Mächler for his suggestions.
Sources and readers no longer evaluate calls in a non-standardway.
Changes in tm version 0.3-2
NEW FEATURES
Weighting functions now have an Acronym slot containingabbreviations of the weighting functions' names. This is highlyuseful when generating tables with indications which weightingscheme was actually used for your experiments.
The functions
tmFilter(),tmIndex(),tmMap()andTermDocMatrix()now can use a MPI cluster (via thesnow andRmpi packages) ifavailable. Use(de)activateCluster()to manually overridecluster usage settings. Special thanks to Stefan Theussl forhis constructive comments.The Source class receives a new Length slot. It contains thenumber of elements provided by the source (although theremight be rare cases where the number cannot be determined inadvance—then it should be set to zero).