library(textrecipes)#> Loading required package: recipes#> Loading required package: dplyr#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#> filter, lag#> The following objects are masked from 'package:base':#>#> intersect, setdiff, setequal, union#>#> Attaching package: 'recipes'#> The following object is masked from 'package:stats':#>#> step
textrecipes has been using lists of charactervectors to carry around the tokens. A simple S3 vector class has beenimplemented with thevctrs package to handle that listof tokens, henceforth to be known as atokenlist
.
If you are only using this package for preprocessing then you mostlikely won’t even notice that this change has happened. However if youare thinking of contributing totextrecipes thenknowing abouttokenlist
s will be essential.
Atokenlist
is based around a simple list of charactervectors, and has 3 attributes,lemma
,pos
andtokens
.
tokens
attribute
Thetokens
attribute is a vector of the unique tokenscontained in the data list. This attribute is calculated automaticallywhen usingtokenlist()
. If a function is applied to thetokenlist where the resulting unique tokens can be derived thennew_tokenlist()
can be used to create a tokenlist withknowntokens
attribute.
lemma
andpos
attributes
Both thelemma
andpos
attribute are usedin the same way. They default toNULL
but can be filleddepending on which engine is being used instep_tokenize()
.The attribute is a list of characters in the exact shape and size as thetokenlist and should have a one-to-one relationship.
If a specific element is removed from the tokenlist then thecorresponding element inlemma
andpos
shouldbe removed.