rinnakk/japanese-pretrained-modelsPublic

NotificationsYou must be signed in to change notification settings
Fork41
Star578

Code for producing Japanese pretrained models provided by rinna Co., Ltd.

License

Apache-2.0 license

578 stars 41 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data/tokenizer		data/tokenizer
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
rinna.png		rinna.png

Repository files navigation

japanese-pretrained-models

(previously: japanese-gpt2)

This repository provides the code for training Japanese pretrained models. This code has been used for producingjapanese-gpt2-medium,japanese-gpt2-small,japanese-gpt2-xsmall, andjapanese-roberta-base released on HuggingFace model hub byrinna Co., Ltd.

Currently supported pretrained models include:GPT-2,RoBERTa.

Table of Contents
Update log
Use tips
Use our pretrained models via Huggingface
Train`japanese-gpt2-xsmall` from scratch
Train`japanese-roberta-base` from scratch
License

Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.

if you find this work useful, please cite the following paper:

@article{rinna_pretrained2021,    title={日本語自然言語処理における事前学習モデルの公開},    author={趙 天雨 and 沢田 慶},    journal={人工知能学会研究会資料 言語・音声理解と対話処理研究会},    volume={93},    pages={169-170},    year={2021},    doi={10.11517/jsaislud.93.0_169}}

Update log

2022/01/25 Updated link torinna/japanese-gpt-1b in the model summary table.
2022/01/17 Updated citation information.
2021/11/01 Updated corpora links.
2021/09/13 Added tips on usingposition_ids withjapanese-roberta-base. Refer toissue 3 for details.
2021/08/26[Important] Updated license from the MIT license to the Apache 2.0 license due to the use of the Wikipedia pre-processing code fromcl-tohoku/bert-japanese. Seeissue 1 for details.
2021/08/23 Added Japanese Wikipedia to training corpora. Published code for trainingrinna/japanese-gpt2-small,rinna/japanese-gpt2-xsmall, andrinna/japanese-roberta-base.
2021/08/18 Changed repo name fromjapanese-gpt2 tojapanese-pretrained-models
2021/06/15 Fixed best PPL tracking bug when using a checkpoint.
2021/05/04 Fixed random seeding bug for Multi-GPU training.
2021/04/06 Published code for trainingrinna/japanese-gpt2-medium.

Use tips

Tips for`rinna/japanese-roberta-base`

Use[CLS]: To predict a masked token, be sure to add a[CLS] token before the sentence for the model to correctly encode it, as it is used during the model training.
Use[MASK] after tokenization: A) Directly typing[MASK] in an input string and B) replacing a token with[MASK] after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use[MASK] after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing[MASK] in the input string and produces less robust predictions.
Provideposition_ids as an argument explicitly: Whenposition_ids are not provided for aRoberta* model, Huggingface'stransformers will automatically construct it but start frompadding_idx instead of0 (seeissue and functioncreate_position_ids_from_input_ids() in Huggingface'simplementation), which unfortunately does not work as expected withrinna/japanese-roberta-base since thepadding_idx of the corresponding tokenizer is not0. So please be sure to constrcut theposition_ids by yourself and make it start from position id0.

Use our pretrained models via Huggingface

Model summary

language model	# params	# layers	# emb dim	# epochs	dev ppl	training time*
rinna/japanese-gpt-1b	1.3B	24	2048	10+	13.9	n/a**
rinna/japanese-gpt2-medium	336M	24	1024	4	18	45 days
rinna/japanese-gpt2-small	110M	12	768	3	21	15 days
rinna/japanese-gpt2-xsmall	37M	6	512	3	28	4 days

masked language model	# params	# layers	# emb dim	# epochs	dev ppl	training time*
rinna/japanese-roberta-base	110M	12	768	8	3.9	15 days

* Training was conducted on a 8x V100 32GB machine.

** Training was conducted using a different codebase and a different computing environment.

Example: use`rinna/japanese-roberta-base` for predicting masked token

import torchfrom transformers import T5Tokenizer, RobertaForMaskedLM# load tokenizertokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading# load modelmodel = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")model = model.eval()# original texttext = "4年に1度オリンピックは開かれる。"# prepend [CLS]text = "[CLS]" + text# tokenizetokens = tokenizer.tokenize(text)print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']']# mask a tokenmasked_idx = 5tokens[masked_idx] = tokenizer.mask_tokenprint(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']# convert to idstoken_ids = tokenizer.convert_tokens_to_ids(tokens)print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]# convert to tensortoken_tensor = torch.LongTensor([token_ids])# provide position ids explicitlyposition_ids = list(range(0, token_tensor.size(1)))print(position_ids)  # output: [0, 1, 2, 3, 4, 5, 6, 7, 8]position_id_tensor = torch.LongTensor([position_ids])# get the top 10 predictions of the masked tokenwith torch.no_grad():    outputs = model(input_ids=token_tensor, position_ids=position_id_tensor)    predictions = outputs[0][0, masked_idx].topk(10)for i, index_t in enumerate(predictions.indices):    index = index_t.item()    token = tokenizer.convert_ids_to_tokens([index])[0]    print(i, token)"""0 総会1 サミット2 ワールドカップ3 フェスティバル4 大会5 オリンピック6 全国大会7 党大会8 イベント9 世界選手権"""

Train`japanese-gpt2-xsmall` from scratch

Install dependencies

Install required packages by running the following command under the repo directory:

pip install -r requirements.txt

Data construction and model training

Set upfugashi tokenzier for preprocessing Wikipedia corpus by running:

python -m unidic download

Download training corpusJapanese CC-100 and extract theja.txt file.
Move theja.txt file or modifysrc/corpus/jp_cc100/config.py to match the filepath ofja.txt withself.raw_data_dir in the config file.
Splitja.txt to smaller files by running:

cd src/python -m corpus.jp_cc100.split_to_small_files

First check the versions of Wikipedia dump atWikipedia cirrussearch and fill inself.download_link (in filesrc/corpus/jp_wiki/config.py) with the link to your preferred Wikipedia dump version. Then download training corpus Japanese Wikipedia and split it by running:

python -m corpus.jp_wiki.build_pretrain_datasetpython -m corpus.jp_wiki.split_to_small_files

Train a xsmall-sized GPT-2 on, for example, 4 V100 GPUs by running:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain_gpt2.train \    --n_gpus 4 \    --save_model True \    --enable_log True \    --model_size xsmall \    --model_config_filepath model/gpt2-ja-xsmall-config.json \    --batch_size 20 \    --eval_batch_size 40 \    --n_training_steps 1600000 \    --n_accum_steps 3 \    --init_lr 0.0007

Interact with the trained model

Assume you have run the training script and saved your xsmall-sized GPT-2 todata/model/pretrain_gpt2/gpt2-ja-xsmall-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling withp=0.95 andk=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain_gpt2.interact \    --checkpoint_path ../data/model/pretrain_gpt2/gpt2-ja-medium-xxx.checkpoint \    --gen_type top \    --top_p 0.95 \    --top_k 40

Prepare files for uploading to Huggingface

Make your Huggingface account. Create a model repo. Clone it to your local machine.
Create model and config files from a checkpoint by running:

python -m task.pretrain_gpt2.checkpoint2huggingface \    --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint \    --save_dir {huggingface's model repo directory}

Validate the created files by running:

python -m task.pretrain_gpt2.check_huggingface \    --model_dir {huggingface's model repo directory}

Add files, commit, and push to your Huggingface repo.

Customize your GPT-2 training

Check available arguments of GPT-2 training script by running:

python -m task.pretrain_gpt2.train --help

Train`japanese-roberta-base` from scratch

Assume you have finished the data construction process as described above, run the following command to train a base-sized Japanese RoBERTa on, for example, 8 V100 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m task.pretrain_roberta.train \    --n_gpus 8 \    --save_model True \    --enable_log True \    --model_size base \    --model_config_filepath model/roberta-ja-base-config.json \    --batch_size 32 \    --eval_batch_size 32 \    --n_training_steps 3000000 \    --n_accum_steps 16 \    --init_lr 0.0006

License

The Apache 2.0 license

About

Code for producing Japanese pretrained models provided by rinna Co., Ltd.

huggingface.co/rinna

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

japanese-pretrained-models

(previously: japanese-gpt2)

Update log

Use tips

Tips for`rinna/japanese-roberta-base`

Use our pretrained models via Huggingface

Model summary

Example: use`rinna/japanese-roberta-base` for predicting masked token

Train`japanese-gpt2-xsmall` from scratch

Install dependencies

Data construction and model training

Interact with the trained model

Prepare files for uploading to Huggingface

Customize your GPT-2 training

Train`japanese-roberta-base` from scratch

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

rinnakk/japanese-pretrained-models

Folders and files

Latest commit

History

Repository files navigation

japanese-pretrained-models

(previously: japanese-gpt2)

Update log

Use tips

Tips forrinna/japanese-roberta-base

Use our pretrained models via Huggingface

Model summary

Example: userinna/japanese-roberta-base for predicting masked token

Trainjapanese-gpt2-xsmall from scratch

Install dependencies

Data construction and model training

Interact with the trained model

Prepare files for uploading to Huggingface

Customize your GPT-2 training

Trainjapanese-roberta-base from scratch

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Tips for`rinna/japanese-roberta-base`

Example: use`rinna/japanese-roberta-base` for predicting masked token

Train`japanese-gpt2-xsmall` from scratch

Train`japanese-roberta-base` from scratch

Packages