Under the hood - tokenlist

Source:vignettes/tokenlist.Rmd

tokenlist.Rmd

library(textrecipes)#> Loading required package: recipes#> Loading required package: dplyr#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#>     filter, lag#> The following objects are masked from 'package:base':#>#>     intersect, setdiff, setequal, union#>#> Attaching package: 'recipes'#> The following object is masked from 'package:stats':#>#>     step

textrecipes has been using lists of charactervectors to carry around the tokens. A simple S3 vector class has beenimplemented with thevctrs package to handle that listof tokens, henceforth to be known as atokenlist.

If you are only using this package for preprocessing then you mostlikely won’t even notice that this change has happened. However if youare thinking of contributing totextrecipes thenknowing abouttokenlists will be essential.

Atokenlist is based around a simple list of charactervectors, and has 3 attributes,lemma,pos andtokens.

`tokens` attribute

Thetokens attribute is a vector of the unique tokenscontained in the data list. This attribute is calculated automaticallywhen usingtokenlist(). If a function is applied to thetokenlist where the resulting unique tokens can be derived thennew_tokenlist() can be used to create a tokenlist withknowntokens attribute.

`lemma` and`pos` attributes

Both thelemma andpos attribute are usedin the same way. They default toNULL but can be filleddepending on which engine is being used instep_tokenize().The attribute is a list of characters in the exact shape and size as thetokenlist and should have a one-to-one relationship.

If a specific element is removed from the tokenlist then thecorresponding element inlemma andpos shouldbe removed.

Movatterモバイル変換

Under the hood - tokenlist

tokens attribute

lemma andpos attributes

`tokens` attribute

`lemma` and`pos` attributes