- Notifications
You must be signed in to change notification settings - Fork5
Use custom tokenizers in spacy-transformers
License
megagonlabs/ginza-transformers
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Theginza-transformers
is a simple extension of thespacy-transformers to use the custom tokenizers (defined outside ofhuggingface/transformers) intransformer
pipeline component ofspaCy v3. Theginza-transformers
also provides the ability to download the models fromHugging Face Hub automatically at run time.
There are two fallback tricks inginza-transformers
.
Loading a custom tokenizer specified incomponents.transformer.model.tokenizer_config.tokenizer_class
attribute ofconfig.cfg
of a spaCy language model package, as follows.
ginza-transformers
initially tries to import a tokenizer class with the standard manner ofhuggingface/transformers
(viaAutoTokenizer.from_pretrained()
)- If a
ValueError
raised fromAutoTokenizer.from_pretrained()
, the fallback logic ofginza-transformers
tries to import the class viaimportlib.import_module
with thetokenizer_class
value
Downloading the model files published in Hugging Face Hub at run time, as follows.
ginza-transformers
initially tries to load local model directory (i.e./${local_spacy_model_dir}/transformer/model/
)- If
OSError
raised, the first fallback logic passes a model name specified incomponents.transformer.model.name
attribute ofconfig.cfg
toAutoModel.from_pretrained()
withlocal_files_only=True
option, which means the first fallback logic will immediately look in the local cache and will not reference the Hugging Face Hub at this point - If
OSError
raised from the first fallback logic, the second fallback logic executesAutoModel.from_pretrained()
withoutlocal_files_only
option, which means the second fallback logic will search specified model name in the Hugging Face Hub
Before executingspacy train
command, make sure thatspaCy is working with cuda suppot, and then install this package like:
pip install -U ginza-transformers
You need to useconfig.cfg
with a different setting when performing the analysis than thespacy train
.
Here is an example of spaCy'sconfig.cfg
for training phase.With this config,ginza-transformers
employsSudachiTra
as a transformer tokenizer and usemegagonlabs/tansformers-ud-japanese-electra-base-discriminator
as a pretrained transformer model.The attributes of the training phase that differ from the defaults of spacy-transformers model are as follows:
[components.transformer.model]@architectures = "ginza-transformers.TransformerModel.v1"name = "megagonlabs/transformers-ud-japanese-electra-base-discriminator"[components.transformer.model.tokenizer_config]use_fast = falsetokenizer_class = "sudachitra.tokenization_electra_sudachipy.ElectraSudachipyTokenizer"do_lower_case = falsedo_word_tokenize = truedo_subword_tokenize = trueword_tokenizer_type = "sudachipy"subword_tokenizer_type = "wordpiece"word_form_type = "dictionary_and_surface"[components.transformer.model.tokenizer_config.sudachipy_kwargs]split_mode = "A"dict_type = "core"
Here is an example ofconfig.cfg
for analysis phase.This config referencesmegagonlabs/tansformers-ud-japanese-electra-base-ginza
. The transformer model specified atcomponents.transformer.model.name
would be downloaded from the Hugging Face Hub at run time.The attributes of the analysis phase that differ from the training phase are as follows:
[components.transformer]factory = "transformer_custom"[components.transformer.model]name = "megagonlabs/transformers-ud-japanese-electra-base-ginza"
About
Use custom tokenizers in spacy-transformers
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.