- Notifications
You must be signed in to change notification settings - Fork17
Extra recipes for Text Processing
License
Unknown, MIT licenses found
Licenses found
tidymodels/textrecipes
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
textrecipes contain extra steps for therecipes package forpreprocessing text data.
You can install the released version of textrecipes fromCRAN with:
install.packages("textrecipes")Install the development version from GitHub with:
# install.packages("pak")pak::pak("tidymodels/textrecipes")
In the following example we will go through the steps needed, to converta character variable to the TF-IDF of its tokenized words after removingstopwords, and, limiting ourself to only the 10 most used words. Thepreprocessing will be conducted on the variablemedium andartist.
library(recipes)library(textrecipes)library(modeldata)data("tate_text")okc_rec<- recipe(~medium+artist,data=tate_text)|> step_tokenize(medium,artist)|> step_stopwords(medium,artist)|> step_tokenfilter(medium,artist,max_tokens=10)|> step_tfidf(medium,artist)okc_obj<-okc_rec|> prep()str(bake(okc_obj,tate_text))#> tibble [4,284 × 20] (S3: tbl_df/tbl/data.frame)#> $ tfidf_medium_colour : num [1:4284] 2.31 0 0 0 0 ...#> $ tfidf_medium_etching : num [1:4284] 0 0.86 0.86 0.86 0 ...#> $ tfidf_medium_gelatin : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_medium_lithograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_medium_paint : num [1:4284] 0 0 0 0 2.35 ...#> $ tfidf_medium_paper : num [1:4284] 0 0.422 0.422 0.422 0 ...#> $ tfidf_medium_photograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_medium_print : num [1:4284] 0 0 0 0 0 ...#> $ tfidf_medium_screenprint: num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_medium_silver : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_akram : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_beuys : num [1:4284] 0 0 0 0 0 ...#> $ tfidf_artist_ferrari : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_john : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_joseph : num [1:4284] 0 0 0 0 0 ...#> $ tfidf_artist_león : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_richard : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_schütte : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_thomas : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...#> $ tfidf_artist_zaatari : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
As of version 0.4.0,step_lda() no longer accepts character variablesand instead takes tokenlist variables.
the following recipe
recipe(~text_var,data=data)|> step_lda(text_var)
can be replaced with the following recipe to achive the same results
lda_tokenizer<-function(x)text2vec::word_tokenizer(tolower(x))recipe(~text_var,data=data)|> step_tokenize(text_var,custom_token=lda_tokenizer )|> step_lda(text_var)
This project is released with aContributor Code ofConduct.By contributing to this project, you agree to abide by its terms.
For questions and discussions about tidymodels packages, modeling, andmachine learning, pleasepost on RStudioCommunity.
If you think you have encountered a bug, pleasesubmit anissue.
Either way, learn how to create and share areprex(a minimal, reproducible example), to clearly communicate about yourcode.
Check out further details oncontributing guidelines for tidymodelspackages andhow to gethelp.
About
Extra recipes for Text Processing
Resources
License
Unknown, MIT licenses found
Licenses found
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors12
Uh oh!
There was an error while loading.Please reload this page.
