Movatterモバイル変換

text

Overview

An R-package for analyzing natural language with transformers-basedlarge language models. Thetext package is part of theR Language Analysis Suite, includingtalk,text andtopics.

talk transformsvoice recordings into text, audio features, or embeddings.
text providesmany language tasks such as converting digital text into wordembeddings.

talk andtext offeraccess to Large Language Models from Hugging Face.
topicsvisualizes language patterns into topics to generate psychologicalinsights.

TheR Language Analysis Suite is created through acollaboration between psychology and computer science to addressresearch needs and ensure state-of-the-art techniques. The suite iscontinuously tested on Ubuntu, Mac OS and Windows using the lateststable R version.

Thetext-package has two main objectives:
* First, toserve R-users as apoint solution for transforming text tostate-of-the-art word embeddings that are ready to be used fordownstream tasks. The package provides a user-friendly link to languagemodels based on transformers fromHugging Face.
* Second, to serveas anend-to-end solution that provides state-of-the-art AItechniques tailored for social and behavioral scientists.
Pleasereference our tutorial article when using thetext package:The text-package: AnR-package for Analyzing and Visualizing Human Language Using NaturalLanguage Processing and Deep Learning.

Short installation guide

Most users simply need to run below installation code. For thoseexperiencing problems or want more alternatives, please see theExtendedInstallation Guide.

For the text-package to work, you first have to install thetext-package in R, and then make it work with text required pythonpackages.

Install text-version (at the moment the second step only works usingthe development version of text from GitHub).

GitHub development version:

# install.packages("devtools")devtools::install_github("oscarkjell/text")

CRANversion:

install.packages("text")

Install and initialize text required python packages:

library(text)# Install text required python packages in a conda environment (with defaults).textrpp_install()# Initialize the installed conda environment.# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R.textrpp_initialize(save_profile =TRUE)

Pointsolution for transforming text to embeddings

Recent significant advances in NLP research have resulted in improvedrepresentations of human language (i.e., language models). Theselanguage models have produced big performance gains in tasks related tounderstanding human language. Text are making these SOTA models easilyaccessible through an interface toHuggingFace inPython.

Text provides many of the contemporary state-of-the-artlanguage models that are based on deep learning to model word order andcontext. Multilingual language models can also represent severallanguages; multilingual BERT comprises104 differentlanguages.

Table 1. Some of the available language models

#> Warning in attr(x, "align"): 'xfun::attr()' is deprecated.#> Use 'xfun::attr2()' instead.#> See help("Deprecated")#> Warning in attr(x, "format"): 'xfun::attr()' is deprecated.#> Use 'xfun::attr2()' instead.#> See help("Deprecated")

Models	References	Layers	Dimensions	Language
‘bert-base-uncased’	Devlin et al. 2019	12	768	English
‘roberta-base’	Liu et al. 2019	12	768	English
‘distilbert-base-cased’	Sahn et al., 2019	6	768	English
‘bert-base-multilingual-cased’	Devlin et al. 2019	12	768	104 toplanguages at Wikipedia
‘xlm-roberta-large’	Liu et al	24	1024	100language

SeeHuggingFace for amore comprehensive list of models.

ThetextEmbed() function is the main embedding functionin text; and can output contextualized embeddings for tokens (i.e., theembeddings for each single word instance of each text) and texts (i.e.,single embeddings per text taken from aggregating all token embeddingsof the text).

library(text)# Transform the text data to BERT word embeddings# Example texttexts<-c("I feel great!")# Defaultsembeddings<-textEmbed(texts)embeddings

SeeGetStarted for more information.

Language Analysis Tasks

It is also possible to access many language analysis tasks such astextClassify(), textGeneration(), and textTranslate().

library(text)# Generate text from the prompt "I am happy to"generated_text<-textGeneration("I am happy to",model ="gpt2")generated_text

For a full list of language analysis tasks supported in text see theReferences

An end-to-end package

Text also provides functions to analyse the word embeddingswith well-tested machine learning algorithms and statistics. The focusis to analyze and visualize text, and their relation to other text ornumerical variables. For example, thetextTrain() functionis used to examine how well the word embeddings from a text can predicta numeric or categorical variable. Another example is functions plottingstatistically significant words in the word embedding space.

library(text)# Use data (DP_projections_HILS_SWLS_100) that have been pre-processed with the textProjectionData function; the preprocessed test-data included in the package is called: DP_projections_HILS_SWLS_100plot_projection<-textProjectionPlot(word_data = DP_projections_HILS_SWLS_100,y_axes =TRUE,title_top =" Supervised Bicentroid Projection of Harmony in life words",x_axes_label ="Low vs. High HILS score",y_axes_label ="Low vs. High SWLS score",position_jitter_hight =0.5,position_jitter_width =0.8)plot_projection$final_plot

Featured Bluesky Post

Version 1.3 of the #r-text package is now available from #CRAN.
This new version makes it easier to apply pre-trained languageassessments from the #LBAM-library (r-text.org/articles/LBA…).
#mlsky #PsychSciSky #Statistics #PsychSciSky #StatsSky#NLP

[imageor embed]
— Oscar Kjell(@oscarkjell.bsky.social)Dec22, 2024 at 9:48

[8]ページ先頭