colorfulscoop/gpt-jaPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star3

GPT-2 Japanese model for HuggingFace's transformers

License

MIT license

3 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
input		input
models		models
release		release
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

GPT-based Japanese model for 🤗 Transformers

This repository is for GPT-based Japanese model trained on Japanese Wikipedia dataset.

Current support models are:

Model summary:

🤗 Model Hub	Data	Revision	Code	Total params	Test set PPL	vocab_size	n_ctx	n_layer	n_head	n_embd	Epochs	Training time
colorfulscoop/gpt2-small-ja	jawiki_20210820	20210820.1.0	ef927e1	110M	29.13	32,000	1,024	12	12	768	30	15 days
	jawiki_20210301	20210301.1.0	-	110M	-	32,000	1,024	12	12	768	30	-

Data summary:

Id	Corpus	#tokens in train set	#tokens in valid set	#tokens in test set
jawiki_20210820	Japanese Wikipedia on 20210820	540M	13M	13M

Note: a same tokenizer is used if models are trained on same data.

Sample usage:

>>>importtransformers>>>pipeline=transformers.pipeline("text-generation","models/gpt2-small",revision="20210820.1.0")>>>pipeline("統計的機械学習でのニューラルネットワーク",do_sample=True)[{'generated_text':'統計的機械学習でのニューラルネットワークの解析は、多くのアルゴリズムの完全な実装をもたらした。これらの'}]

Training details

Training model was conducted on the following environment.

OS: Ubuntu 18.04.5 LTS
GPU: RTX 2080 Ti x1

Environment preparation

$ docker container run --gpus all --ipc=host --rm -it -v$(pwd):/work -w /work nvidia/cuda:11.1-devel-ubuntu20.04 bash(container)$ apt update&& apt install -y python3 python3-pip git wget(container)$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html(container)$ pip3 install -r requirements.txt

Data preparation

Check the latest date in the list fromhttps://dumps.wikimedia.org/jawiki/ .

(container)$ bash src/get_jawiki.sh 20210820 input

Finally generated data can be found underinput directory.

(container)$ ls -1 input/20210820/{train,valid,test}.txtinput/20210820/test.txtinput/20210820/train.txtinput/20210820/valid.txt

Train tokenizer

Train SentencePiece model in the same container used in data peparation.

(container)$ python3 src/train_tokenizer.py --train_file input/20210820/train.txt --model_dir models/gpt2-small

Train model

Run training with the config file:

(container)$ python3 src/train.py train --config input/gpt2-small.json...255999it [10:21:51,  7.03it/s]{'epoch': 30,'batch': 256000,'step': 493108,'train_loss': 0.190585415356369,'lr': 0.0001}263236it [10:39:12,  6.86it/s]6788it [10:28, 10.81it/s]{'epoch': 30,'valid_loss': 3.417723441833458,'valid_ppl': 30.49990112587307,'save_model': True}

Test

(container)$ python3 src/train.pytest --config input/gpt2-small.json6793it [09:16, 12.20it/s]{'test_loss': 3.371613106758486,'test_ppl': 29.125471679484484}

Export Tensorflow model

(container)$pipinstalltensorflow(container)$python3>>>fromtransformersimportTFGPT2LMHeadModel>>>model=TFGPT2LMHeadModel.from_pretrained("models/gpt2-small",from_pt=True)>>>model.save_pretrained("models/gpt2-small")

Upload to 🤗 Model Hub

Followofficial document to upload model.

Prepare environment

Prepare git lfs. In a MacOS environment, git lfs can be installed as follows.

$ brew install git-lfs$ git lfs installUpdated git hooks.Git LFS initialized.

Then clone the repository.

$ git clone https://huggingface.co/colorfulscoop/gpt2-small-ja release/gpt2-small-ja

Copy model to release directory

$ cp models/gpt2-small/* release/gpt2-small-ja/cp: models/gpt2-small/spm is a directory (not copied).$cd release/gpt2-small-ja

Then, modifyconfig.json to specify default generation values by following diff.

"unk_token_id": 1,"use_cache": true,-"vocab_size": 32000+"vocab_size": 32000,+"top_k": 50,+"top_p": 0.95,+"do_sample":true }

Commit changes to git.

$ git add.

Release

$ git push origin

About

GPT-2 Japanese model for HuggingFace's transformers

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GPT-based Japanese model for 🤗 Transformers

Training details

Environment preparation

Data preparation

Train tokenizer

Train model

Test

Export Tensorflow model

Upload to 🤗 Model Hub

Prepare environment

Copy model to release directory

Release

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

colorfulscoop/gpt-ja

Folders and files

Latest commit

History

Repository files navigation

GPT-based Japanese model for 🤗 Transformers

Training details

Environment preparation

Data preparation

Train tokenizer

Train model

Test

Export Tensorflow model

Upload to 🤗 Model Hub

Prepare environment

Copy model to release directory

Release

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages