
An R-package for analyzing natural language with transformers-basedlarge language models. Thetext package is part of theR Language Analysis Suite, includingtalk,text andtopics.
talk transformsvoice recordings into text, audio features, or embeddings.text providesmany language tasks such as converting digital text into wordembeddings.talk andtext offeraccess to Large Language Models from Hugging Face.topicsvisualizes language patterns into topics to generate psychologicalinsights.
TheR Language Analysis Suite is created through acollaboration between psychology and computer science to addressresearch needs and ensure state-of-the-art techniques. The suite iscontinuously tested on Ubuntu, Mac OS and Windows using the lateststable R version.
Thetext-package has two main objectives:
* First, toserve R-users as apoint solution for transforming text tostate-of-the-art word embeddings that are ready to be used fordownstream tasks. The package provides a user-friendly link to languagemodels based on transformers fromHugging Face.
* Second, to serveas anend-to-end solution that provides state-of-the-art AItechniques tailored for social and behavioral scientists.
Pleasereference our tutorial article when using thetext package:The text-package: AnR-package for Analyzing and Visualizing Human Language Using NaturalLanguage Processing and Deep Learning.
Most users simply need to run below installation code. For thoseexperiencing problems or want more alternatives, please see theExtendedInstallation Guide.
For the text-package to work, you first have to install thetext-package in R, and then make it work with text required pythonpackages.
GitHub development version:
# install.packages("devtools")devtools::install_github("oscarkjell/text")CRANversion:
install.packages("text")library(text)# Install text required python packages in a conda environment (with defaults).textrpp_install()# Initialize the installed conda environment.# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R.textrpp_initialize(save_profile =TRUE)Recent significant advances in NLP research have resulted in improvedrepresentations of human language (i.e., language models). Theselanguage models have produced big performance gains in tasks related tounderstanding human language. Text are making these SOTA models easilyaccessible through an interface toHuggingFace inPython.
Text provides many of the contemporary state-of-the-artlanguage models that are based on deep learning to model word order andcontext. Multilingual language models can also represent severallanguages; multilingual BERT comprises104 differentlanguages.
Table 1. Some of the available language models
#> Warning in attr(x, "align"): 'xfun::attr()' is deprecated.#> Use 'xfun::attr2()' instead.#> See help("Deprecated")#> Warning in attr(x, "format"): 'xfun::attr()' is deprecated.#> Use 'xfun::attr2()' instead.#> See help("Deprecated")| Models | References | Layers | Dimensions | Language |
|---|---|---|---|---|
| ‘bert-base-uncased’ | Devlin et al. 2019 | 12 | 768 | English |
| ‘roberta-base’ | Liu et al. 2019 | 12 | 768 | English |
| ‘distilbert-base-cased’ | Sahn et al., 2019 | 6 | 768 | English |
| ‘bert-base-multilingual-cased’ | Devlin et al. 2019 | 12 | 768 | 104 toplanguages at Wikipedia |
| ‘xlm-roberta-large’ | Liu et al | 24 | 1024 | 100language |
SeeHuggingFace for amore comprehensive list of models.
ThetextEmbed() function is the main embedding functionin text; and can output contextualized embeddings for tokens (i.e., theembeddings for each single word instance of each text) and texts (i.e.,single embeddings per text taken from aggregating all token embeddingsof the text).
library(text)# Transform the text data to BERT word embeddings# Example texttexts<-c("I feel great!")# Defaultsembeddings<-textEmbed(texts)embeddingsSeeGetStarted for more information.
It is also possible to access many language analysis tasks such astextClassify(), textGeneration(), and textTranslate().
library(text)# Generate text from the prompt "I am happy to"generated_text<-textGeneration("I am happy to",model ="gpt2")generated_textFor a full list of language analysis tasks supported in text see theReferences
Text also provides functions to analyse the word embeddingswith well-tested machine learning algorithms and statistics. The focusis to analyze and visualize text, and their relation to other text ornumerical variables. For example, thetextTrain() functionis used to examine how well the word embeddings from a text can predicta numeric or categorical variable. Another example is functions plottingstatistically significant words in the word embedding space.
library(text)# Use data (DP_projections_HILS_SWLS_100) that have been pre-processed with the textProjectionData function; the preprocessed test-data included in the package is called: DP_projections_HILS_SWLS_100plot_projection<-textProjectionPlot(word_data = DP_projections_HILS_SWLS_100,y_axes =TRUE,title_top =" Supervised Bicentroid Projection of Harmony in life words",x_axes_label ="Low vs. High HILS score",y_axes_label ="Low vs. High SWLS score",position_jitter_hight =0.5,position_jitter_width =0.8)plot_projection$final_plot
Version 1.3 of the #r-text package is now available from #CRAN.
This new version makes it easier to apply pre-trained languageassessments from the #LBAM-library (r-text.org/articles/LBA…).
#mlsky #PsychSciSky #Statistics #PsychSciSky #StatsSky#NLP
[imageor embed]— Oscar Kjell(@oscarkjell.bsky.social )Dec22, 2024 at 9:48