Movatterモバイル変換


[0]ホーム

URL:


Package 'pangoling'

Title:Access to Large Language Model Predictions
Description:Provides access to word predictability estimates using large language models (LLMs) based on 'transformer' architectures via integration with the 'Hugging Face' ecosystem <https://huggingface.co/>. The package interfaces with pre-trained neural networks and supports both causal/auto-regressive LLMs (e.g., 'GPT-2') and masked/bidirectional LLMs (e.g., 'BERT') to compute the probability of words, phrases, or tokens given their linguistic context. For details on GPT-2 and causal models, see Radford et al. (2019) <https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf>, for details on BERT and masked models, see Devlin et al. (2019) <doi:10.48550/arXiv.1810.04805>. By enabling a straightforward estimation of word predictability, the package facilitates research in psycholinguistics, computational linguistics, and natural language processing (NLP).
Authors:Bruno Nicenboim [aut, cre] (ORCID: <https://orcid.org/0000-0002-5176-3943>), Chris Emmerly [ctb], Giovanni Cassani [ctb], Lisa Levinson [rev], Utku Turk [rev]
Maintainer:Bruno Nicenboim <[email protected]>
License:MIT + file LICENSE
Version:1.0.3
Built:2025-06-28 06:17:35 UTC
Source:https://github.com/ropensci/pangoling

Help Index


Returns the configuration of a causal model

Description

Returns the configuration of a causal model

Usage

causal_config(  model= getOption("pangoling.causal.default"),  checkpoint=NULL,  config_model=NULL)

Arguments

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

checkpoint

Folder of a checkpoint.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

Value

A list with the configuration of the model.

More details about causal models

A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.

If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default, this can beaccessed viagetOption("pangoling.causal.default") (by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found inHugging Face website.

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrainedfor details.

In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/

See Also

Other causal model helper functions:causal_preload()

Examples

causal_config(model="gpt2")

Generate next tokens after a context and their predictability using a causaltransformer model

Description

This function predicts the possible next tokens and their predictability(log-probabilities by default). The function sorts tokens in descending orderof their predictability.

Usage

causal_next_tokens_pred_tbl(  context,  log.p= getOption("pangoling.log.p"),  decode=FALSE,  model= getOption("pangoling.causal.default"),  checkpoint=NULL,  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL)

Arguments

context

A single string representing the context for which the nexttokens and their predictabilities are predicted.

log.p

Base of the logarithm used for the output predictability values.IfTRUE (default), the natural logarithm (basee) is used.IfFALSE, the raw probabilities are returned.Alternatively,log.p can be set to a numeric value specifyingthe base of the logarithm (e.g.,2 for base-2 logarithms).To get surprisal in bits (rather than predictability), setlog.p = 1/2.

decode

Logical. IfTRUE, decodes the tokens into human-readablestrings, handling special characters and diacritics. Default isFALSE.

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

checkpoint

Folder of a checkpoint.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Details

The function uses a causal transformer model to compute the predictabilityof all tokens in the model's vocabulary, given a single input context. Itreturns a table where each row represents a token, along with itspredictability score. By default, the function returns log-probabilities innatural logarithm (basee), but you can specify a different logarithm base(e.g.,log.p = 1/2 for surprisal in bits).

Ifdecode = TRUE, the tokens are converted into human-readable strings,handling special characters like accents and diacritics. This ensures thattokens are more interpretable, especially for languages with complextokenization.

Value

A table with possible next tokens and their log-probabilities.

More details about causal models

A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.

If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default, this can beaccessed viagetOption("pangoling.causal.default") (by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found inHugging Face website.

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrainedfor details.

In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/

See Also

Other causal model functions:causal_pred_mats(),causal_words_pred()

Examples

causal_next_tokens_pred_tbl(  context="The apple doesn't fall far from the",  model="gpt2")

Generate a list of predictability matrices using a causal transformer model

Description

This function computes a list of matrices, where each matrix corresponds to aunique group specified by theby argument. Each matrix represents thepredictability of every token in the input text (x) based on precedingcontext, as evaluated by a causal transformer model.

Usage

causal_pred_mats(  x,  by= rep(1, length(x)),  sep=" ",  log.p= getOption("pangoling.log.p"),  sorted=FALSE,  model= getOption("pangoling.causal.default"),  checkpoint=NULL,  add_special_tokens=NULL,  decode=FALSE,  config_model=NULL,  config_tokenizer=NULL,  batch_size=1,...)

Arguments

x

A character vector of words, phrases, or texts to evaluate.

by

A grouping variable indicating how texts are split into groups.

sep

A string specifying how words are separated within contexts orgroups. Default is" ". For languages that don't have spacesbetween words (e.g., Chinese), setsep = "".

log.p

Base of the logarithm used for the output predictability values.IfTRUE (default), the natural logarithm (basee) is used.IfFALSE, the raw probabilities are returned.Alternatively,log.p can be set to a numeric value specifyingthe base of the logarithm (e.g.,2 for base-2 logarithms).To get surprisal in bits (rather than predictability), setlog.p = 1/2.

sorted

When default FALSE it will retain the order of groups we aresplitting by. When TRUE then sorted (according toby) list(s)are returned.

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

checkpoint

Folder of a checkpoint.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

decode

Logical. IfTRUE, decodes the tokens into human-readablestrings, handling special characters and diacritics. Default isFALSE.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

batch_size

Maximum number of sentences/texts processed in parallel.Larger batches increase speed but use more memory. Sinceall texts in a batch must have the same length, shorterones are padded with placeholder tokens.

...

Currently not in use.

Details

The function splits the inputx into groups specified by theby argumentand processes each group independently. For each group, the model computesthe predictability of each token in its vocabulary based on precedingcontext.

Each matrix contains:

  • Rows representing the model's vocabulary.

  • Columns corresponding to tokens in the group (e.g., a sentence orparagraph).

  • By default, values in the matrices are the natural logarithm of wordprobabilities.

Value

A list of matrices with tokens in their columns and the vocabulary ofthe model in their rows

More details about causal models

A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.

If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default, this can beaccessed viagetOption("pangoling.causal.default") (by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found inHugging Face website.

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrainedfor details.

In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/

See Also

Other causal model functions:causal_next_tokens_pred_tbl(),causal_words_pred()

Examples

data("df_sent")df_sentlist_of_mats<- causal_pred_mats(                       x= df_sent$word,                       by= df_sent$sent_n,                         model="gpt2")# View the structure of the resulting listlist_of_mats|> str()# Inspect the last rows of the first matrixlist_of_mats[[1]]|> tail()# Inspect the last rows of the second matrixlist_of_mats[[2]]|> tail()

Preloads a causal language model

Description

Preloads a causal language model to speed up next runs.

Usage

causal_preload(  model= getOption("pangoling.causal.default"),  checkpoint=NULL,  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL)

Arguments

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

checkpoint

Folder of a checkpoint.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Value

Nothing.

More details about causal models

A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.

If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default, this can beaccessed viagetOption("pangoling.causal.default") (by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found inHugging Face website.

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrainedfor details.

In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/

See Also

Other causal model helper functions:causal_config()

Examples

causal_preload(model="gpt2")

Compute predictability using a causal transformer model

Description

These functions calculate the predictability of words, phrases, or tokensusing a causal transformer model.

Usage

causal_words_pred(  x,  by= rep(1, length(x)),  word_n=NULL,  sep=" ",  log.p= getOption("pangoling.log.p"),  ignore_regex="",  model= getOption("pangoling.causal.default"),  checkpoint=NULL,  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL,  batch_size=1,...)causal_tokens_pred_lst(  texts,  log.p= getOption("pangoling.log.p"),  model= getOption("pangoling.causal.default"),  checkpoint=NULL,  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL,  batch_size=1)causal_targets_pred(  contexts,  targets,  sep=" ",  log.p= getOption("pangoling.log.p"),  ignore_regex="",  model= getOption("pangoling.causal.default"),  checkpoint=NULL,  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL,  batch_size=1,...)

Arguments

x

A character vector of words, phrases, or texts to evaluate.

by

A grouping variable indicating how texts are split into groups.

word_n

Word order, by default this is the word order of the vector x.

sep

A string specifying how words are separated within contexts orgroups. Default is" ". For languages that don't have spacesbetween words (e.g., Chinese), setsep = "".

log.p

Base of the logarithm used for the output predictability values.IfTRUE (default), the natural logarithm (basee) is used.IfFALSE, the raw probabilities are returned.Alternatively,log.p can be set to a numeric value specifyingthe base of the logarithm (e.g.,2 for base-2 logarithms).To get surprisal in bits (rather than predictability), setlog.p = 1/2.

ignore_regex

Can ignore certain characters when calculating the logprobabilities. For example⁠^[[:punct:]]$⁠ will ignoreall punctuation that stands alone in a token.

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

checkpoint

Folder of a checkpoint.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

batch_size

Maximum number of sentences/texts processed in parallel.Larger batches increase speed but use more memory. Sinceall texts in a batch must have the same length, shorterones are padded with placeholder tokens.

...

Currently not in use.

texts

A vector or list of sentences or paragraphs.

contexts

A character vector of contexts corresponding to each target.

targets

A character vector of target words or phrases.

Details

These functions calculate the predictability (by default the naturallogarithm of the word probability) of words, phrases or tokens using acausal transformer model:

  • causal_targets_pred(): Evaluates specific target words or phrasesbased on their given contexts. Use when you have explicitcontext-target pairs to evaluate, with each target word or phrase pairedwith a single preceding context.

  • causal_words_pred(): Computes predictability for all elements of avector grouped by a specified variable. Use when working with words orphrases split into groups, such as sentences or paragraphs, wherepredictability is computed for every word or phrase in each group.

  • causal_tokens_pred_lst(): Computes the predictability of each tokenin a sentence (or group of sentences) and returns a list of results foreach sentence. Use when you want to calculate the predictability ofevery token in one or more sentences.

See theonline articlein pangoling website for more examples.

Value

Forcausal_targets_pred() andcausal_words_pred(),a named numeric vector of predictability scores. Forcausal_tokens_pred_lst(), a list of named numeric vectors, one foreach sentence or group.

More details about causal models

A causal language model (also called GPT-like, auto-regressive, or decodermodel) is a type of large language model usually used for text-generationthat can predict the next word (or more accurately in fact token) basedon a preceding context.

If not specified, the causal model used will be the one set in the globaloptionpangoling.causal.default, this can beaccessed viagetOption("pangoling.causal.default") (by default"gpt2"). To change the default optionuseoptions(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found inHugging Face website.

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thePython methodfrom_pretrainedfor details.

In case of errors when a new model is run, check the status ofhttps://status.huggingface.co/

See Also

Other causal model functions:causal_next_tokens_pred_tbl(),causal_pred_mats()

Examples

# Using causal_targets_predcausal_targets_pred(  contexts= c("The apple doesn't fall far from the","Don't judge a book by its"),  targets= c("tree.","cover."),  model="gpt2")# Using causal_words_predcausal_words_pred(  x= df_sent$word,  by= df_sent$sent_n,  model="gpt2")# Using causal_tokens_pred_lstpreds<- causal_tokens_pred_lst(  texts= c("The apple doesn't fall far from the tree.","Don't judge a book by its cover."),  model="gpt2")preds# Convert the output to a tidy tablesuppressPackageStartupMessages(library(tidytable))map2_dfr(preds, seq_along(preds),~ data.frame(tokens= names(.x), pred= .x, id= .y))

Self-Paced Reading Dataset on Chinese Relative Clauses

Description

This dataset contains data from a self-paced reading experiment on Chineserelative clause comprehension. It is structured to support analysis ofreaction times, comprehension accuracy, and surprisal values across variousexperimental conditions in a 2x2 fully crossed factorial design:

Usage

data(df_jaeger14)

Format

A tibble with 8,624 rows and 15 variables:

subject

Participant identifier, a character vector.

item

Trial item number, an integer.

cond

Experimental condition, a character vector indicatingvariations in sentence structure (e.g., "a", "b", "c", "d").

word

Chinese word presented in each trial, a character vector.

wordn

Position of the word within the sentence, an integer.

rt

Reaction time in milliseconds for reading each word,an integer.

region

Sentence region or phrase type (e.g., "hd1", "Det+CL"),a character vector.

question

Comprehension question associated with the trial, acharacter vector.

accuracy

Binary accuracy score for the comprehension question(1 = correct, 0 = incorrect).

correct_answer

Expected correct answer for the comprehensionquestion, a character vector ("Y" or "N").

question_type

Type of comprehension question, a character vector.

experiment

Name of the experiment, indicating self-paced reading, acharacter vector.

list

Experimental list number, for counterbalancing itempresentation, an integer.

sentence

Full sentence used in the trial with words marked foranalysis, a character vector.

surprisal

Model-derived surprisal values for each word, a numericvector.

Region codes in the dataset (columnregion):

  • N: Main clause subject (in object-modifications only)

  • V: Main clause verb (in object-modifications only)

  • Det+CL: Determiner+classifier

  • Adv: Adverb

  • VN: RC-verb+RC-object (subject relatives) or RC-subject+RC-verb (objectrelatives)

    • Note: These two words were merged into one region after the experiment;they were presented as separate regions during the experiment.

  • FreqP: Frequency phrase/durational phrase

  • DE: Relativizer "de"

  • head: Relative clause head noun

  • hd1: First word after the head noun

  • hd2: Second word after the head noun

  • hd3: Third word after the head noun

  • hd4: Fourth word after the head noun (only in subject-modifications)

  • hd5: Fifth word after the head noun (only in subject-modifications)

Notes on reading times (columnrt):

  • The reading time of the relative clause region (e.g., "V-N" or "N-V") wascomputed by summing up the reading times of the relative clause verb andnoun.

  • The verb and noun were presented as two separate regions during theexperiment.

Details

  • Factor I: Modification type (subject modification; object modification)

  • Factor II: Relative clause type (subject relative; object relative)

Condition labels:

  • a) subject modification; subject relative

  • b) subject modification; object relative

  • c) object modification; subject relative

  • d) object modification; object relative

Source

Jäger, L., Chen, Z., Li, Q., Lin, C.-J. C., & Vasishth, S. (2015).The subject-relative advantage in Chinese: Evidence forexpectation-based processing.Journal of Memory and Language, 79–80, 97-120.doi:10.1016/j.jml.2014.10.005

See Also

Other datasets:df_sent

Examples

# Basic explorationhead(df_jaeger14)# Summarize reaction times by region library(tidytable)df_jaeger14|>  group_by(region)|>  summarize(mean_rt= mean(rt, na.rm=TRUE))

Example dataset: Two word-by-word sentences

Description

This dataset contains two example sentences, splitword-by-word. It is structured to demonstrate the use of thepangolingpackage for processing text data.

Usage

df_sent

Format

A data frame with 15 rows and 2 columns:

sent_n

(integer) Sentence number, indicating which sentence eachword belongs to.

word

(character) Words from the sentences.

See Also

Other datasets:df_jaeger14

Examples

# Load the datasetdata("df_sent")df_sent

Install the Python packages needed forpangoling

Description

install_py_pangoling function facilitates the installation of Pythonpackages needed for usingpangoling within an R environment,utilizing thereticulate package for managing Python environments. Itsupports various installation methods,environment settings, and Python versions.

Usage

install_py_pangoling(method= c("auto","virtualenv","conda"),                     conda="auto",                     version="default",                     envname="r-pangoling",                     restart_session=TRUE,                     conda_python_version=NULL,...,                     pip_ignore_installed=FALSE,                     new_env= identical(envname,"r-pangoling"),                     python_version=NULL)

Arguments

method

A character vector specifying the environment managementmethod. Options are 'auto', 'virtualenv', and 'conda'. Defaultis 'auto'.

conda

Specifies the conda binary to use. Default is 'auto'.

version

The Python version to use. Default is 'default', automaticallyselected.

envname

Name of the virtual environment. Default is 'r-pangoling'.

restart_session

Logical, whether to restart the R session afterinstallation.Default is TRUE.

conda_python_version

Python version for conda environments.

...

Additional arguments passed toreticulate::py_install.

pip_ignore_installed

Logical, whether to ignore already installedpackages. Default is FALSE.

new_env

Logical, whether to create a new environment ifenvname is'r-pangoling'. Default is the identity ofenvname.

python_version

Specifies the Python version for the environment.

Details

This function automatically selects the appropriate method for environmentmanagement and Python installation, with a focus on virtual and condaenvironments. It ensures flexibility in dependency management and Pythonversion control. If a new environment is created, existing environments withthe same name are removed.

Value

The function returnsNULL invisibly, but outputs a message on successfulinstallation.

See Also

Other helper functions:installed_py_pangoling(),set_cache_folder()

Examples

# Install with default settings:## Not run: install_py_pangoling()## End(Not run)

Check if the required Python dependencies forpangoling are installed

Description

This function verifies whether the necessary Python modules (transformersandtorch) are available in the current Python environment.

Usage

installed_py_pangoling()

Value

A logical value:TRUE if bothtransformers andtorch areinstalled and accessible, otherwiseFALSE.

See Also

Other helper functions:install_py_pangoling(),set_cache_folder()

Examples

## Not run:if(installed_py_pangoling()){ message("Python dependencies are installed.")}else{ warning("Python dependencies are missing. Please install `torch` and `transformers`.")}## End(Not run)

Returns the configuration of a masked model

Description

Returns the configuration of a masked model.

Usage

masked_config(  model= getOption("pangoling.masked.default"),  config_model=NULL)

Arguments

model

Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.

If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default, this can beaccessed viagetOption("pangoling.masked.default") (by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel").

A list of possible masked can be found inHugging Face website

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrainedfor details. In case of errors check the status ofhttps://status.huggingface.co/

Value

A list with the configuration of the model.

See Also

Other masked model helper functions:masked_preload()

Examples

masked_config(model="bert-base-uncased")

Preloads a masked language model

Description

Preloads a masked language model to speed up next runs.

Usage

masked_preload(  model= getOption("pangoling.masked.default"),  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL)

Arguments

model

Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.

If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default, this can beaccessed viagetOption("pangoling.masked.default") (by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel").

A list of possible masked can be found inHugging Face website

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrainedfor details. In case of errors check the status ofhttps://status.huggingface.co/

Value

Nothing.

See Also

Other masked model helper functions:masked_config()

Examples

causal_preload(model="bert-base-uncased")

Get the predictability of a target word (or phrase) given a left and rightcontext

Description

Get the predictability (by default the natural logarithm of the wordprobability) of a vector of target words (or phrase) given avector of left and of right contexts using a masked transformer.

Usage

masked_targets_pred(  prev_contexts,  targets,  after_contexts,  log.p= getOption("pangoling.log.p"),  ignore_regex="",  model= getOption("pangoling.masked.default"),  checkpoint=NULL,  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL)

Arguments

prev_contexts

Left context of the target word in left-to-right writtenlanguages.

targets

Target words.

after_contexts

Right context of the target in left-to-right writtenlanguages.

log.p

Base of the logarithm used for the output predictability values.IfTRUE (default), the natural logarithm (basee) is used.IfFALSE, the raw probabilities are returned.Alternatively,log.p can be set to a numeric value specifyingthe base of the logarithm (e.g.,2 for base-2 logarithms).To get surprisal in bits (rather than predictability), setlog.p = 1/2.

ignore_regex

Can ignore certain characters when calculating the logprobabilities. For example⁠^[[:punct:]]$⁠ will ignoreall punctuation that stands alone in a token.

model

Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website.

checkpoint

Folder of a checkpoint.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.

If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default, this can beaccessed viagetOption("pangoling.masked.default") (by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel").

A list of possible masked can be found inHugging Face website

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrainedfor details. In case of errors check the status ofhttps://status.huggingface.co/

Value

A named vector of predictability values (by default the naturallogarithm of the word probability).

More examples

See theonline articlein pangoling website for more examples.

See Also

Other masked model functions:masked_tokens_pred_tbl()

Examples

masked_targets_pred(  prev_contexts= c("The","The"),  targets= c("apple","pear"),  after_contexts= c("doesn't fall far from the tree.","doesn't fall far from the tree."),  model="bert-base-uncased")

Get the possible tokens and their log probabilities for each mask in asentence

Description

For each mask, indicated with⁠[MASK]⁠, in a sentence, get the possibletokens and their predictability (by default the natural logarithm of theword probability) using a masked transformer.

Usage

masked_tokens_pred_tbl(  masked_sentences,  log.p= getOption("pangoling.log.p"),  model= getOption("pangoling.masked.default"),  checkpoint=NULL,  add_special_tokens=NULL,  config_model=NULL,  config_tokenizer=NULL)

Arguments

masked_sentences

Masked sentences.

log.p

Base of the logarithm used for the output predictability values.IfTRUE (default), the natural logarithm (basee) is used.IfFALSE, the raw probabilities are returned.Alternatively,log.p can be set to a numeric value specifyingthe base of the logarithm (e.g.,2 for base-2 logarithms).To get surprisal in bits (rather than predictability), setlog.p = 1/2.

model

Name of a pre-trained model or folder. One should be able to usemodels based on "bert". Seehugging face website.

checkpoint

Folder of a checkpoint.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_model

List with other arguments that control how themodel from Hugging Face is accessed.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a typeof large language model that can be used to predict the content of a maskin a sentence.

If not specified, the masked model that will be used is the one set inspecified in the global optionpangoling.masked.default, this can beaccessed viagetOption("pangoling.masked.default") (by default"bert-base-uncased"). To change the default optionuseoptions(pangoling.masked.default = "newmaskedmodel").

A list of possible masked can be found inHugging Face website

Using theconfig_model andconfig_tokenizer arguments, it's possible tocontrol how the model and tokenizer from Hugging Face is accessed, see thepython methodfrom_pretrainedfor details. In case of errors check the status ofhttps://status.huggingface.co/

Value

A table with the masked sentences, the tokens (token),predictability (pred), and the respective mask number (mask_n).

More examples

See theonline articlein pangoling website for more examples.

See Also

Other masked model functions:masked_targets_pred()

Examples

masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.",  model="bert-base-uncased")

The number of tokens in a string or vector of strings

Description

The number of tokens in a string or vector of strings

Usage

ntokens(  x,  model= getOption("pangoling.causal.default"),  add_special_tokens=NULL,  config_tokenizer=NULL)

Arguments

x

character input

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Value

The number of tokens in a string or vector of words.

See Also

Other token-related functions:tokenize_lst(),transformer_vocab()

Examples

ntokens(x= c("The apple doesn't fall far from the tree."), model="gpt2")

Calculates perplexity

Description

Calculates the perplexity of a vector of (log-)probabilities.

Usage

perplexity_calc(x, na.rm=FALSE, log.p=TRUE)

Arguments

x

A vector of log-probabilities.

na.rm

Should missing values (including NaN) be removed?

log.p

If TRUE (default), x are assumed to be log-transformedprobabilities with base e, if FALSE x are assumed to beraw probabilities, alternatively log.p can be the base ofother logarithmic transformations.

Details

If x are raw probabilities (NOT the default),then perplexity is calculated as follows:

(nxn)1N\left(\prod_{n} x_n \right)^\frac{1}{N}

Value

The perplexity.

Examples

probs<- c(.3,.5,.6)perplexity_calc(probs, log.p=FALSE)lprobs<- log(probs)perplexity_calc(lprobs, log.p=TRUE)

Set cache folder for HuggingFace transformers

Description

This function sets the cache directory for HuggingFace transformers. If apath is given, the function checks if the directory exists and then sets theHF_HOME environment variable to this path.If no path is provided, the function checks for the existing cache directoryin a number of environment variables.If none of these environment variables are set, it provides the user withinformation on the default cache directory.

Usage

set_cache_folder(path=NULL)

Arguments

path

Character string, the path to set as the cache directory.If NULL, the function will look for the cache directory in anumber of environment variables. Default is NULL.

Value

Nothing is returned, this function is called for its side effect ofsetting theHF_HOME environment variable, or providinginformation to the user.

See Also

Installation docs

Other helper functions:install_py_pangoling(),installed_py_pangoling()

Examples

## Not run:set_cache_folder("~/new_cache_dir")## End(Not run)

Tokenize an input

Description

Tokenize a string or token ids.

Usage

tokenize_lst(  x,  decode=FALSE,  model= getOption("pangoling.causal.default"),  add_special_tokens=NULL,  config_tokenizer=NULL)

Arguments

x

Strings or token ids.

decode

Logical. IfTRUE, decodes the tokens into human-readablestrings, handling special characters and diacritics. Default isFALSE.

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Value

A list with tokens

See Also

Other token-related functions:ntokens(),transformer_vocab()

Examples

tokenize_lst(x= c("The apple doesn't fall far from the tree."),              model="gpt2")

Returns the vocabulary of a model

Description

Returns the (decoded) vocabulary of a model.

Usage

transformer_vocab(  model= getOption("pangoling.causal.default"),  add_special_tokens=NULL,  decode=FALSE,  config_tokenizer=NULL)

Arguments

model

Name of a pre-trained model or folder. One should be able to usemodels based on "gpt2". Seehugging face website.

add_special_tokens

Whether to include special tokens. It has thesame default as theAutoTokenizermethod in Python.

decode

Logical. IfTRUE, decodes the tokens into human-readablestrings, handling special characters and diacritics. Default isFALSE.

config_tokenizer

List with other arguments that control how thetokenizer from Hugging Face is accessed.

Value

A vector with the vocabulary of a model.

See Also

Other token-related functions:ntokens(),tokenize_lst()

Examples

transformer_vocab(model="gpt2")|> head()


[8]ページ先頭

©2009-2025 Movatter.jp