er-ri/bertjscPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star13

Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正

License

MIT license

13 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bertjsc		bertjsc
deployment		deployment
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

BERTJSC

Japanese Spelling Error Corrector using BERT(Masked-Language Model).

日本語はこちらです

Abstract

The project, fine-tuned the Masked-Language BERT for the task of Japanese Spelling Error Correction. The whole word masking pretrainedmodel has been applied in the project. Besides the original BERT, another architecture based on BERT used a method calledsoft-masking which introducing aBi-GRU layer to the original BERT architecture, was also implemented in this project. The training dataset(Japanese Wikipedia Typo Dataset version 1) contains a pair of grammatical error and error-free sentence which is collected from wikipedia, can be downloaded athere.

Getting Started

Clone the project and install necessary packages.
pip install -r requirements.txt
Download the fine-tuned model(BERT) fromhere and put it to an arbitrary directory.
Make a inference by the following code.

importtorchfromtransformersimportBertJapaneseTokenizerfrombertjscimportpredict_of_jsonfrombertjsc.lit_modelimportLitBertForMaskedLM# Tokenizer & Model declaration.tokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitBertForMaskedLM("cl-tohoku/bert-base-japanese-whole-word-masking")# Load the model downloaded in Step 2.model.load_state_dict(torch.load('load/from/path/lit-bert-for-maskedlm-230313.pth'),strict=False)# Set computing device on GPU if available, else CPUdevice=torch.device('cuda'iftorch.cuda.is_available()else'cpu')# Inferenceresult=predict_of_json(model,tokenizer,device,"日本語校正してみす。")print(result)

A json style result will be displayed as below.

{0: {'token': '日本語', 'score': 0.999341},1: {'token': '校', 'score': 0.996382},2: {'token': '正', 'score': 0.997387},3: {'token': 'し', 'score': 0.999978},4: {'token': 'て', 'score': 0.999999},5: {'token': 'み', 'score': 0.999947},6: {'token': 'す', 'correct': 'ます', 'score': 0.972711},7: {'token': '。', 'score': 1.0}}

For training the model from scratch, you need to download the training data fromhere. The file(./scripts/trainer.py) contains the steps for the training process and includes a function to evaluate model's performance. You may refer the file to perform your task on GPU cloud computing platform likeAWS SageMaker orGoogle Colab.
For usingSoft-Masked BERT, download the fine-tuned model fromhere, declare the model as the following code. The other usages are the same.

fromtransformersimportBertJapaneseTokenizerfrombertjsc.lit_modelimportLitSoftMaskedBerttokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitSoftMaskedBert("cl-tohoku/bert-base-japanese-whole-word-masking",tokenizer.mask_token_id,tokenizer.vocab_size)

Note

Downgradetransformers to 4.29.2 or add the parameterstrict=False formodel.load_state_dict() if you encountered the error ofUnexpected key(s) in state_dict: "mlbert.bert.embeddings.position_ids".

Evaluation

Detection

Model	Accuracy	Precision	Recall	F1 Score
Pre-train	43.5	31.9	41.5	36.1
Fine-tune	77.3	71.1	81.2	75.8
Soft-masked	78.4	65.3	88.4	75.1
RoBERTa	72.3	55.9	83.2	66.9

Correction

Model	Accuracy	Precision	Recall	F1 Score
Pre-train	37.0	19.0	29.8	23.2
Fine-tune	74.9	66.4	80.1	72.6
Soft-masked	76.4	61.4	87.8	72.2
RoBERTa	69.7	50.7	81.8	62.7

Training Platform:AWS SageMaker Lab
Batch Size:32
Epoch:6
Learning Rate:2e-6
Coefficient(soft-masked):0.8

License

Distributed under the MIT License. SeeLICENSE.txt for more information.

Contribution

Any suggestions for improvement or contribution to this project are appreciated! Feel free to submit an issue or pull request!

References

はじめに

プロジェクトをクローンし、必要なパッケージをインストールしてください。
pip install -r requirements.txt
ここからファインチューニング済みモデルをダウンロードし、任意のディレクトリに配置してください。
次のコードで推論を行ってください。

importtorchfromtransformersimportBertJapaneseTokenizerfrombertjscimportpredict_of_jsonfrombertjsc.lit_modelimportLitBertForMaskedLM# トークナイザーとモデルの宣言tokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitBertForMaskedLM("cl-tohoku/bert-base-japanese-whole-word-masking")# ステップ2でダウンロードしたモデルをロードしてください。model.load_state_dict(torch.load('load/from/path/lit-bert-for-maskedlm-230112.pth'))# GPUが利用可能な場合はGPU上で計算を行い、それ以外の場合はCPU上で計算を行います。device=torch.device('cuda'iftorch.cuda.is_available()else'cpu')# 推論result=predict_of_json(model,tokenizer,device,"日本語校正してみす。")print(result)

以下のようなjson形式の結果が表示されます。

{0: {'token': '日本語', 'score': 0.999341},1: {'token': '校', 'score': 0.996382},2: {'token': '正', 'score': 0.997387},3: {'token': 'し', 'score': 0.999978},4: {'token': 'て', 'score': 0.999999},5: {'token': 'み', 'score': 0.999947},6: {'token': 'す', 'correct': 'ます', 'score': 0.972711},7: {'token': '。', 'score': 1.0}}

モデルをゼロからトレーニングする場合は、ここから訓練データをダウンロードする必要があります。ファイル(./scripts/trainer.py)にはトレーニングプロセスの手順が含まれており、モデルのパフォーマンスを評価するための関数も含まれています。AWS SageMakerやGoogle ColabなどのGPUクラウドコンピューティングプラットフォームでタスクを実行する場合は、このファイルを参照してください。
Soft-Masked BERTを使用するには、ここからファインチューニング済みのモデルをダウンロードし、以下のコードのようにモデルを宣言してください。その他の使用方法は同じです。

frombertjsc.lit_modelimportLitSoftMaskedBertmodel=LitSoftMaskedBert("cl-tohoku/bert-base-japanese-whole-word-masking",tokenizer.mask_token_id,tokenizer.vocab_size)

About

Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BERTJSC

Abstract

Getting Started

Note

Evaluation

Detection

Correction

License

Contribution

References

はじめに

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

er-ri/bertjsc

Folders and files

Latest commit

History

Repository files navigation

BERTJSC

Abstract

Getting Started

Note

Evaluation

Detection

Correction

License

Contribution

References

はじめに

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages