tokenizers: Fast, Consistent Tokenization of Natural Language Text
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
| Version: | 0.3.0 |
| Depends: | R (≥ 3.1.3) |
| Imports: | stringi (≥ 1.0.1),Rcpp (≥ 0.12.3),SnowballC (≥ 0.5.1) |
| LinkingTo: | Rcpp |
| Suggests: | covr,knitr,rmarkdown,stopwords (≥ 0.9.0),testthat |
| Published: | 2022-12-22 |
| DOI: | 10.32614/CRAN.package.tokenizers |
| Author: | Lincoln Mullen [aut, cre], Os Keyes [ctb], Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb], Kenneth Benoit [ctb] |
| Maintainer: | Lincoln Mullen <lincoln at lincolnmullen.com> |
| BugReports: | https://github.com/ropensci/tokenizers/issues |
| License: | MIT + fileLICENSE |
| URL: | https://docs.ropensci.org/tokenizers/,https://github.com/ropensci/tokenizers |
| NeedsCompilation: | yes |
| Citation: | tokenizers citation info |
| Materials: | README,NEWS |
| In views: | NaturalLanguageProcessing |
| CRAN checks: | tokenizers results |
Documentation:
Downloads:
Reverse dependencies:
| Reverse imports: | blocking,covfefe,deeplr,DeepPINCS,DramaAnalysis,pdfsearch,proustr,rslp,textrecipes,tidypmc,tidytext,ttgsea,wactor,WhatsR |
| Reverse suggests: | edgarWebR,sumup,torchdatasets |
| Reverse enhances: | quanteda |
Linking:
Please use the canonical formhttps://CRAN.R-project.org/package=tokenizersto link to this page.