Title: | Access to Large Language Model Predictions |
---|---|
Description: | Provides access to word predictability estimates using large language models (LLMs) based on 'transformer' architectures via integration with the 'Hugging Face' ecosystem <https://huggingface.co/>. The package interfaces with pre-trained neural networks and supports both causal/auto-regressive LLMs (e.g., 'GPT-2') and masked/bidirectional LLMs (e.g., 'BERT') to compute the probability of words, phrases, or tokens given their linguistic context. For details on GPT-2 and causal models, see Radford et al. (2019) <https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf>, for details on BERT and masked models, see Devlin et al. (2019) <doi:10.48550/arXiv.1810.04805>. By enabling a straightforward estimation of word predictability, the package facilitates research in psycholinguistics, computational linguistics, and natural language processing (NLP). |
Authors: | Bruno Nicenboim [aut, cre] (ORCID: <https://orcid.org/0000-0002-5176-3943>), Chris Emmerly [ctb], Giovanni Cassani [ctb], Lisa Levinson [rev], Utku Turk [rev] |
Maintainer: | Bruno Nicenboim <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.3 |
Built: | 2025-06-28 06:17:35 UTC |
Source: | https://github.com/ropensci/pangoling |
Returns the configuration of a causal model
causal_config( model = getOption("pangoling.causal.default"), checkpoint = NULL, config_model = NULL)
causal_config( model= getOption("pangoling.causal.default"), checkpoint=NULL, config_model=NULL)
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
checkpoint | Folder of a checkpoint. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
A list with the configuration of the model.
A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.
If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default
, this can beaccessed viagetOption("pangoling.causal.default")
(by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found inHugging Face website.
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrained
for details.
In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/
Other causal model helper functions:causal_preload()
causal_config(model = "gpt2")
causal_config(model="gpt2")
This function predicts the possible next tokens and their predictability(log-probabilities by default). The function sorts tokens in descending orderof their predictability.
causal_next_tokens_pred_tbl( context, log.p = getOption("pangoling.log.p"), decode = FALSE, model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL)
causal_next_tokens_pred_tbl( context, log.p= getOption("pangoling.log.p"), decode=FALSE, model= getOption("pangoling.causal.default"), checkpoint=NULL, add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL)
context | A single string representing the context for which the nexttokens and their predictabilities are predicted. |
log.p | Base of the logarithm used for the output predictability values.If |
decode | Logical. If |
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
checkpoint | Folder of a checkpoint. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
The function uses a causal transformer model to compute the predictabilityof all tokens in the model's vocabulary, given a single input context. Itreturns a table where each row represents a token, along with itspredictability score. By default, the function returns log-probabilities innatural logarithm (basee), but you can specify a different logarithm base(e.g.,log.p = 1/2
for surprisal in bits).
Ifdecode = TRUE
, the tokens are converted into human-readable strings,handling special characters like accents and diacritics. This ensures thattokens are more interpretable, especially for languages with complextokenization.
A table with possible next tokens and their log-probabilities.
A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.
If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default
, this can beaccessed viagetOption("pangoling.causal.default")
(by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found inHugging Face website.
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrained
for details.
In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/
Other causal model functions:causal_pred_mats()
,causal_words_pred()
causal_next_tokens_pred_tbl( context = "The apple doesn't fall far from the", model = "gpt2")
causal_next_tokens_pred_tbl( context="The apple doesn't fall far from the", model="gpt2")
This function computes a list of matrices, where each matrix corresponds to aunique group specified by theby
argument. Each matrix represents thepredictability of every token in the input text (x
) based on precedingcontext, as evaluated by a causal transformer model.
causal_pred_mats( x, by = rep(1, length(x)), sep = " ", log.p = getOption("pangoling.log.p"), sorted = FALSE, model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, decode = FALSE, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ...)
causal_pred_mats( x, by= rep(1, length(x)), sep=" ", log.p= getOption("pangoling.log.p"), sorted=FALSE, model= getOption("pangoling.causal.default"), checkpoint=NULL, add_special_tokens=NULL, decode=FALSE, config_model=NULL, config_tokenizer=NULL, batch_size=1,...)
x | A character vector of words, phrases, or texts to evaluate. |
by | A grouping variable indicating how texts are split into groups. |
sep | A string specifying how words are separated within contexts orgroups. Default is |
log.p | Base of the logarithm used for the output predictability values.If |
sorted | When default FALSE it will retain the order of groups we aresplitting by. When TRUE then sorted (according to |
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
checkpoint | Folder of a checkpoint. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
decode | Logical. If |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
batch_size | Maximum number of sentences/texts processed in parallel.Larger batches increase speed but use more memory. Sinceall texts in a batch must have the same length, shorterones are padded with placeholder tokens. |
... | Currently not in use. |
The function splits the inputx
into groups specified by theby
argumentand processes each group independently. For each group, the model computesthe predictability of each token in its vocabulary based on precedingcontext.
Each matrix contains:
Rows representing the model's vocabulary.
Columns corresponding to tokens in the group (e.g., a sentence orparagraph).
By default, values in the matrices are the natural logarithm of wordprobabilities.
A list of matrices with tokens in their columns and the vocabulary ofthe model in their rows
A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.
If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default
, this can beaccessed viagetOption("pangoling.causal.default")
(by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found inHugging Face website.
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrained
for details.
In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/
Other causal model functions:causal_next_tokens_pred_tbl()
,causal_words_pred()
data("df_sent")df_sentlist_of_mats <- causal_pred_mats( x = df_sent$word, by = df_sent$sent_n, model = "gpt2" )# View the structure of the resulting listlist_of_mats |> str()# Inspect the last rows of the first matrixlist_of_mats[[1]] |> tail()# Inspect the last rows of the second matrixlist_of_mats[[2]] |> tail()
data("df_sent")df_sentlist_of_mats<- causal_pred_mats( x= df_sent$word, by= df_sent$sent_n, model="gpt2")# View the structure of the resulting listlist_of_mats|> str()# Inspect the last rows of the first matrixlist_of_mats[[1]]|> tail()# Inspect the last rows of the second matrixlist_of_mats[[2]]|> tail()
Preloads a causal language model to speed up next runs.
causal_preload( model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL)
causal_preload( model= getOption("pangoling.causal.default"), checkpoint=NULL, add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL)
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
checkpoint | Folder of a checkpoint. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
Nothing.
A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.
If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default
, this can beaccessed viagetOption("pangoling.causal.default")
(by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found inHugging Face website.
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrained
for details.
In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/
Other causal model helper functions:causal_config()
causal_preload(model = "gpt2")
causal_preload(model="gpt2")
These functions calculate the predictability of words, phrases, or tokensusing a causal transformer model.
causal_words_pred( x, by = rep(1, length(x)), word_n = NULL, sep = " ", log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ...)causal_tokens_pred_lst( texts, log.p = getOption("pangoling.log.p"), model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1)causal_targets_pred( contexts, targets, sep = " ", log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ...)
causal_words_pred( x, by= rep(1, length(x)), word_n=NULL, sep=" ", log.p= getOption("pangoling.log.p"), ignore_regex="", model= getOption("pangoling.causal.default"), checkpoint=NULL, add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL, batch_size=1,...)causal_tokens_pred_lst( texts, log.p= getOption("pangoling.log.p"), model= getOption("pangoling.causal.default"), checkpoint=NULL, add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL, batch_size=1)causal_targets_pred( contexts, targets, sep=" ", log.p= getOption("pangoling.log.p"), ignore_regex="", model= getOption("pangoling.causal.default"), checkpoint=NULL, add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL, batch_size=1,...)
x | A character vector of words, phrases, or texts to evaluate. |
by | A grouping variable indicating how texts are split into groups. |
word_n | Word order, by default this is the word order of the vector x. |
sep | A string specifying how words are separated within contexts orgroups. Default is |
log.p | Base of the logarithm used for the output predictability values.If |
ignore_regex | Can ignore certain characters when calculating the logprobabilities. For example |
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
checkpoint | Folder of a checkpoint. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
batch_size | Maximum number of sentences/texts processed in parallel.Larger batches increase speed but use more memory. Sinceall texts in a batch must have the same length, shorterones are padded with placeholder tokens. |
... | Currently not in use. |
texts | A vector or list of sentences or paragraphs. |
contexts | A character vector of contexts corresponding to each target. |
targets | A character vector of target words or phrases. |
These functions calculate the predictability (by default the naturallogarithm of the word probability) of words, phrases or tokens using acausal transformer model:
causal_targets_pred()
: Evaluates specific target words or phrasesbased on their given contexts. Use when you have explicitcontext-target pairs to evaluate, with each target word or phrase pairedwith a single preceding context.
causal_words_pred()
: Computes predictability for all elements of avector grouped by a specified variable. Use when working with words orphrases split into groups, such as sentences or paragraphs, wherepredictability is computed for every word or phrase in each group.
causal_tokens_pred_lst()
: Computes the predictability of each tokenin a sentence (or group of sentences) and returns a list of results foreach sentence. Use when you want to calculate the predictability ofevery token in one or more sentences.
See theonline articlein pangoling website for more examples.
Forcausal_targets_pred()
andcausal_words_pred()
,a named numeric vector of predictability scores. Forcausal_tokens_pred_lst()
, a list of named numeric vectors, one foreach sentence or group.
A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.
If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default
, this can beaccessed viagetOption("pangoling.causal.default")
(by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found inHugging Face website.
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrained
for details.
In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/
Other causal model functions:causal_next_tokens_pred_tbl()
,causal_pred_mats()
# Using causal_targets_predcausal_targets_pred( contexts = c("The apple doesn't fall far from the", "Don't judge a book by its"), targets = c("tree.", "cover."), model = "gpt2")# Using causal_words_predcausal_words_pred( x = df_sent$word, by = df_sent$sent_n, model = "gpt2")# Using causal_tokens_pred_lstpreds <- causal_tokens_pred_lst( texts = c("The apple doesn't fall far from the tree.", "Don't judge a book by its cover."), model = "gpt2")preds# Convert the output to a tidy tablesuppressPackageStartupMessages(library(tidytable))map2_dfr(preds, seq_along(preds), ~ data.frame(tokens = names(.x), pred = .x, id = .y))
# Using causal_targets_predcausal_targets_pred( contexts= c("The apple doesn't fall far from the","Don't judge a book by its"), targets= c("tree.","cover."), model="gpt2")# Using causal_words_predcausal_words_pred( x= df_sent$word, by= df_sent$sent_n, model="gpt2")# Using causal_tokens_pred_lstpreds<- causal_tokens_pred_lst( texts= c("The apple doesn't fall far from the tree.","Don't judge a book by its cover."), model="gpt2")preds# Convert the output to a tidy tablesuppressPackageStartupMessages(library(tidytable))map2_dfr(preds, seq_along(preds),~ data.frame(tokens= names(.x), pred= .x, id= .y))
This dataset contains data from a self-paced reading experiment on Chineserelative clause comprehension. It is structured to support analysis ofreaction times, comprehension accuracy, and surprisal values across variousexperimental conditions in a 2x2 fully crossed factorial design:
data(df_jaeger14)
data(df_jaeger14)
A tibble with 8,624 rows and 15 variables:
Participant identifier, a character vector.
Trial item number, an integer.
Experimental condition, a character vector indicatingvariations in sentence structure (e.g., "a", "b", "c", "d").
Chinese word presented in each trial, a character vector.
Position of the word within the sentence, an integer.
Reaction time in milliseconds for reading each word,an integer.
Sentence region or phrase type (e.g., "hd1", "Det+CL"),a character vector.
Comprehension question associated with the trial, acharacter vector.
Binary accuracy score for the comprehension question(1 = correct, 0 = incorrect).
Expected correct answer for the comprehensionquestion, a character vector ("Y" or "N").
Type of comprehension question, a character vector.
Name of the experiment, indicating self-paced reading, acharacter vector.
Experimental list number, for counterbalancing itempresentation, an integer.
Full sentence used in the trial with words marked foranalysis, a character vector.
Model-derived surprisal values for each word, a numericvector.
Region codes in the dataset (columnregion
):
N: Main clause subject (in object-modifications only)
V: Main clause verb (in object-modifications only)
Det+CL: Determiner+classifier
Adv: Adverb
VN: RC-verb+RC-object (subject relatives) or RC-subject+RC-verb (objectrelatives)
Note: These two words were merged into one region after the experiment;they were presented as separate regions during the experiment.
FreqP: Frequency phrase/durational phrase
DE: Relativizer "de"
head: Relative clause head noun
hd1: First word after the head noun
hd2: Second word after the head noun
hd3: Third word after the head noun
hd4: Fourth word after the head noun (only in subject-modifications)
hd5: Fifth word after the head noun (only in subject-modifications)
Notes on reading times (columnrt
):
The reading time of the relative clause region (e.g., "V-N" or "N-V") wascomputed by summing up the reading times of the relative clause verb andnoun.
The verb and noun were presented as two separate regions during theexperiment.
Factor I: Modification type (subject modification; object modification)
Factor II: Relative clause type (subject relative; object relative)
Condition labels:
a) subject modification; subject relative
b) subject modification; object relative
c) object modification; subject relative
d) object modification; object relative
Jäger, L., Chen, Z., Li, Q., Lin, C.-J. C., & Vasishth, S. (2015).The subject-relative advantage in Chinese: Evidence forexpectation-based processing.Journal of Memory and Language, 79–80, 97-120.doi:10.1016/j.jml.2014.10.005
Other datasets:df_sent
# Basic explorationhead(df_jaeger14)# Summarize reaction times by region library(tidytable)df_jaeger14 |> group_by(region) |> summarize(mean_rt = mean(rt, na.rm = TRUE))
# Basic explorationhead(df_jaeger14)# Summarize reaction times by region library(tidytable)df_jaeger14|> group_by(region)|> summarize(mean_rt= mean(rt, na.rm=TRUE))
This dataset contains two example sentences, splitword-by-word. It is structured to demonstrate the use of thepangoling
package for processing text data.
df_sent
df_sent
A data frame with 15 rows and 2 columns:
(integer) Sentence number, indicating which sentence eachword belongs to.
(character) Words from the sentences.
Other datasets:df_jaeger14
# Load the datasetdata("df_sent")df_sent
# Load the datasetdata("df_sent")df_sent
pangoling
install_py_pangoling
function facilitates the installation of Pythonpackages needed for usingpangoling
within an R environment,utilizing thereticulate
package for managing Python environments. Itsupports various installation methods,environment settings, and Python versions.
install_py_pangoling(method = c("auto", "virtualenv", "conda"), conda = "auto", version = "default", envname = "r-pangoling", restart_session = TRUE, conda_python_version = NULL, ..., pip_ignore_installed = FALSE, new_env = identical(envname, "r-pangoling"), python_version = NULL)
install_py_pangoling(method= c("auto","virtualenv","conda"), conda="auto", version="default", envname="r-pangoling", restart_session=TRUE, conda_python_version=NULL,..., pip_ignore_installed=FALSE, new_env= identical(envname,"r-pangoling"), python_version=NULL)
method | A character vector specifying the environment managementmethod. Options are 'auto', 'virtualenv', and 'conda'. Defaultis 'auto'. |
conda | Specifies the conda binary to use. Default is 'auto'. |
version | The Python version to use. Default is 'default', automaticallyselected. |
envname | Name of the virtual environment. Default is 'r-pangoling'. |
restart_session | Logical, whether to restart the R session afterinstallation.Default is TRUE. |
conda_python_version | Python version for conda environments. |
... | Additional arguments passed to |
pip_ignore_installed | Logical, whether to ignore already installedpackages. Default is FALSE. |
new_env | Logical, whether to create a new environment if |
python_version | Specifies the Python version for the environment. |
This function automatically selects the appropriate method for environmentmanagement and Python installation, with a focus on virtual and condaenvironments. It ensures flexibility in dependency management and Pythonversion control. If a new environment is created, existing environments withthe same name are removed.
The function returnsNULL
invisibly, but outputs a message on successfulinstallation.
Other helper functions:installed_py_pangoling()
,set_cache_folder()
# Install with default settings:## Not run: install_py_pangoling()## End(Not run)
# Install with default settings:## Not run: install_py_pangoling()## End(Not run)
pangoling
are installedThis function verifies whether the necessary Python modules (transformers
andtorch
) are available in the current Python environment.
installed_py_pangoling()
installed_py_pangoling()
A logical value:TRUE
if bothtransformers
andtorch
areinstalled and accessible, otherwiseFALSE
.
Other helper functions:install_py_pangoling()
,set_cache_folder()
## Not run: if (installed_py_pangoling()) { message("Python dependencies are installed.")} else { warning("Python dependencies are missing. Please install `torch` and `transformers`.")}## End(Not run)
## Not run:if(installed_py_pangoling()){ message("Python dependencies are installed.")}else{ warning("Python dependencies are missing. Please install `torch` and `transformers`.")}## End(Not run)
Returns the configuration of a masked model.
masked_config( model = getOption("pangoling.masked.default"), config_model = NULL)
masked_config( model= getOption("pangoling.masked.default"), config_model=NULL)
model | Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.
If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default
, this can beaccessed viagetOption("pangoling.masked.default")
(by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found inHugging Face website
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrained
for details. In case of errors check the status ofhttps://status.huggingface.co/
A list with the configuration of the model.
Other masked model helper functions:masked_preload()
masked_config(model = "bert-base-uncased")
masked_config(model="bert-base-uncased")
Preloads a masked language model to speed up next runs.
masked_preload( model = getOption("pangoling.masked.default"), add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL)
masked_preload( model= getOption("pangoling.masked.default"), add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL)
model | Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.
If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default
, this can beaccessed viagetOption("pangoling.masked.default")
(by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found inHugging Face website
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrained
for details. In case of errors check the status ofhttps://status.huggingface.co/
Nothing.
Other masked model helper functions:masked_config()
causal_preload(model = "bert-base-uncased")
causal_preload(model="bert-base-uncased")
Get the predictability (by default the natural logarithm of the wordprobability) of a vector of target words (or phrase) given avector of left and of right contexts using a masked transformer.
masked_targets_pred( prev_contexts, targets, after_contexts, log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.masked.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL)
masked_targets_pred( prev_contexts, targets, after_contexts, log.p= getOption("pangoling.log.p"), ignore_regex="", model= getOption("pangoling.masked.default"), checkpoint=NULL, add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL)
prev_contexts | Left context of the target word in left-to-right writtenlanguages. |
targets | Target words. |
after_contexts | Right context of the target in left-to-right writtenlanguages. |
log.p | Base of the logarithm used for the output predictability values.If |
ignore_regex | Can ignore certain characters when calculating the logprobabilities. For example |
model | Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website. |
checkpoint | Folder of a checkpoint. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.
If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default
, this can beaccessed viagetOption("pangoling.masked.default")
(by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found inHugging Face website
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrained
for details. In case of errors check the status ofhttps://status.huggingface.co/
A named vector of predictability values (by default the naturallogarithm of the word probability).
See theonline articlein pangoling website for more examples.
Other masked model functions:masked_tokens_pred_tbl()
masked_targets_pred( prev_contexts = c("The", "The"), targets = c("apple", "pear"), after_contexts = c( "doesn't fall far from the tree.", "doesn't fall far from the tree." ), model = "bert-base-uncased")
masked_targets_pred( prev_contexts= c("The","The"), targets= c("apple","pear"), after_contexts= c("doesn't fall far from the tree.","doesn't fall far from the tree."), model="bert-base-uncased")
For each mask, indicated with[MASK]
, in a sentence, get the possibletokens and their predictability (by default the natural logarithm of theword probability) using a masked transformer.
masked_tokens_pred_tbl( masked_sentences, log.p = getOption("pangoling.log.p"), model = getOption("pangoling.masked.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL)
masked_tokens_pred_tbl( masked_sentences, log.p= getOption("pangoling.log.p"), model= getOption("pangoling.masked.default"), checkpoint=NULL, add_special_tokens=NULL, config_model=NULL, config_tokenizer=NULL)
masked_sentences | Masked sentences. |
log.p | Base of the logarithm used for the output predictability values.If |
model | Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website. |
checkpoint | Folder of a checkpoint. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_model | List with other arguments that control how themodel from Hugging Face is accessed. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.
If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default
, this can beaccessed viagetOption("pangoling.masked.default")
(by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found inHugging Face website
Using theconfig_model
andconfig_tokenizer
arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrained
for details. In case of errors check the status ofhttps://status.huggingface.co/
A table with the masked sentences, the tokens (token
),predictability (pred
), and the respective mask number (mask_n
).
See theonline articlein pangoling website for more examples.
Other masked model functions:masked_targets_pred()
masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.", model = "bert-base-uncased")
masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.", model="bert-base-uncased")
The number of tokens in a string or vector of strings
ntokens( x, model = getOption("pangoling.causal.default"), add_special_tokens = NULL, config_tokenizer = NULL)
ntokens( x, model= getOption("pangoling.causal.default"), add_special_tokens=NULL, config_tokenizer=NULL)
x | character input |
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
The number of tokens in a string or vector of words.
Other token-related functions:tokenize_lst()
,transformer_vocab()
ntokens(x = c("The apple doesn't fall far from the tree."), model = "gpt2")
ntokens(x= c("The apple doesn't fall far from the tree."), model="gpt2")
Calculates the perplexity of a vector of (log-)probabilities.
perplexity_calc(x, na.rm = FALSE, log.p = TRUE)
perplexity_calc(x, na.rm=FALSE, log.p=TRUE)
x | A vector of log-probabilities. |
na.rm | Should missing values (including NaN) be removed? |
log.p | If TRUE (default), x are assumed to be log-transformedprobabilities with base e, if FALSE x are assumed to beraw probabilities, alternatively log.p can be the base ofother logarithmic transformations. |
If x are raw probabilities (NOT the default),then perplexity is calculated as follows:
The perplexity.
probs <- c(.3, .5, .6)perplexity_calc(probs, log.p = FALSE)lprobs <- log(probs)perplexity_calc(lprobs, log.p = TRUE)
probs<- c(.3,.5,.6)perplexity_calc(probs, log.p=FALSE)lprobs<- log(probs)perplexity_calc(lprobs, log.p=TRUE)
This function sets the cache directory for HuggingFace transformers. If apath is given, the function checks if the directory exists and then sets theHF_HOME
environment variable to this path.If no path is provided, the function checks for the existing cache directoryin a number of environment variables.If none of these environment variables are set, it provides the user withinformation on the default cache directory.
set_cache_folder(path = NULL)
set_cache_folder(path=NULL)
path | Character string, the path to set as the cache directory.If NULL, the function will look for the cache directory in anumber of environment variables. Default is NULL. |
Nothing is returned, this function is called for its side effect ofsetting theHF_HOME
environment variable, or providinginformation to the user.
Other helper functions:install_py_pangoling()
,installed_py_pangoling()
## Not run: set_cache_folder("~/new_cache_dir")## End(Not run)
## Not run:set_cache_folder("~/new_cache_dir")## End(Not run)
Tokenize a string or token ids.
tokenize_lst( x, decode = FALSE, model = getOption("pangoling.causal.default"), add_special_tokens = NULL, config_tokenizer = NULL)
tokenize_lst( x, decode=FALSE, model= getOption("pangoling.causal.default"), add_special_tokens=NULL, config_tokenizer=NULL)
x | Strings or token ids. |
decode | Logical. If |
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
A list with tokens
Other token-related functions:ntokens()
,transformer_vocab()
tokenize_lst(x = c("The apple doesn't fall far from the tree."), model = "gpt2")
tokenize_lst(x= c("The apple doesn't fall far from the tree."), model="gpt2")
Returns the (decoded) vocabulary of a model.
transformer_vocab( model = getOption("pangoling.causal.default"), add_special_tokens = NULL, decode = FALSE, config_tokenizer = NULL)
transformer_vocab( model= getOption("pangoling.causal.default"), add_special_tokens=NULL, decode=FALSE, config_tokenizer=NULL)
model | Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website. |
add_special_tokens | Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python. |
decode | Logical. If |
config_tokenizer | List with other arguments that control how thetokenizer from Hugging Face is accessed. |
A vector with the vocabulary of a model.
Other token-related functions:ntokens()
,tokenize_lst()
transformer_vocab(model = "gpt2") |> head()
transformer_vocab(model="gpt2")|> head()