megagonlabs/ginza-transformersPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star16

Use custom tokenizers in spacy-transformers

License

MIT license

16 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
ginza_transformers		ginza_transformers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Repository files navigation

ginza-transformers: Use custom tokenizers in spacy-transformers

Theginza-transformers is a simple extension of thespacy-transformers to use the custom tokenizers (defined outside ofhuggingface/transformers) intransformer pipeline component ofspaCy v3. Theginza-transformers also provides the ability to download the models fromHugging Face Hub automatically at run time.

Fallback mechanisms

There are two fallback tricks inginza-transformers.

Cutom tokenizer fallbacking

Loading a custom tokenizer specified incomponents.transformer.model.tokenizer_config.tokenizer_class attribute ofconfig.cfg of a spaCy language model package, as follows.

ginza-transformers initially tries to import a tokenizer class with the standard manner ofhuggingface/transformers (viaAutoTokenizer.from_pretrained())
If aValueError raised fromAutoTokenizer.from_pretrained(), the fallback logic ofginza-transformers tries to import the class viaimportlib.import_module with thetokenizer_class value

Model loading at run time

Downloading the model files published in Hugging Face Hub at run time, as follows.

ginza-transformers initially tries to load local model directory (i.e./${local_spacy_model_dir}/transformer/model/)
IfOSError raised, the first fallback logic passes a model name specified incomponents.transformer.model.name attribute ofconfig.cfg toAutoModel.from_pretrained() withlocal_files_only=True option, which means the first fallback logic will immediately look in the local cache and will not reference the Hugging Face Hub at this point
IfOSError raised from the first fallback logic, the second fallback logic executesAutoModel.from_pretrained() withoutlocal_files_only option, which means the second fallback logic will search specified model name in the Hugging Face Hub

How to use

Before executingspacy train command, make sure thatspaCy is working with cuda suppot, and then install this package like:

pip install -U ginza-transformers

You need to useconfig.cfg with a different setting when performing the analysis than thespacy train.

Setting for training phase

Here is an example of spaCy'sconfig.cfg for training phase.With this config,ginza-transformers employsSudachiTra as a transformer tokenizer and usemegagonlabs/tansformers-ud-japanese-electra-base-discriminator as a pretrained transformer model.The attributes of the training phase that differ from the defaults of spacy-transformers model are as follows:

[components.transformer.model]@architectures = "ginza-transformers.TransformerModel.v1"name = "megagonlabs/transformers-ud-japanese-electra-base-discriminator"[components.transformer.model.tokenizer_config]use_fast = falsetokenizer_class = "sudachitra.tokenization_electra_sudachipy.ElectraSudachipyTokenizer"do_lower_case = falsedo_word_tokenize = truedo_subword_tokenize = trueword_tokenizer_type = "sudachipy"subword_tokenizer_type = "wordpiece"word_form_type = "dictionary_and_surface"[components.transformer.model.tokenizer_config.sudachipy_kwargs]split_mode = "A"dict_type = "core"

Setting for analysis phases

Here is an example ofconfig.cfg for analysis phase.This config referencesmegagonlabs/tansformers-ud-japanese-electra-base-ginza. The transformer model specified atcomponents.transformer.model.name would be downloaded from the Hugging Face Hub at run time.The attributes of the analysis phase that differ from the training phase are as follows:

[components.transformer]factory = "transformer_custom"[components.transformer.model]name = "megagonlabs/transformers-ud-japanese-electra-base-ginza"

About

Use custom tokenizers in spacy-transformers

Releases7

Release v0.4.2 Latest

Aug 9, 2022

+ 6 releases

Packages

No packages published

Contributors2

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ginza-transformers: Use custom tokenizers in spacy-transformers

Fallback mechanisms

Cutom tokenizer fallbacking

Model loading at run time

How to use

Setting for training phase

Setting for analysis phases

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases7

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

megagonlabs/ginza-transformers

Folders and files

Latest commit

History

Repository files navigation

ginza-transformers: Use custom tokenizers in spacy-transformers

Fallback mechanisms

Cutom tokenizer fallbacking

Model loading at run time

How to use

Setting for training phase

Setting for analysis phases

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases7

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages