- Notifications
You must be signed in to change notification settings - Fork1
Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正
License
er-ri/bertjsc
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Japanese Spelling Error Corrector using BERT(Masked-Language Model).
日本語はこちらです
The project, fine-tuned the Masked-Language BERT for the task of Japanese Spelling Error Correction. The whole word masking pretrainedmodel has been applied in the project. Besides the original BERT, another architecture based on BERT used a method calledsoft-masking which introducing aBi-GRU layer to the original BERT architecture, was also implemented in this project. The training dataset(Japanese Wikipedia Typo Dataset version 1) contains a pair of grammatical error and error-free sentence which is collected from wikipedia, can be downloaded athere.
- Clone the project and install necessary packages.
pip install -r requirements.txt - Download the fine-tuned model(BERT) fromhere and put it to an arbitrary directory.
- Make a inference by the following code.
importtorchfromtransformersimportBertJapaneseTokenizerfrombertjscimportpredict_of_jsonfrombertjsc.lit_modelimportLitBertForMaskedLM# Tokenizer & Model declaration.tokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitBertForMaskedLM("cl-tohoku/bert-base-japanese-whole-word-masking")# Load the model downloaded in Step 2.model.load_state_dict(torch.load('load/from/path/lit-bert-for-maskedlm-230313.pth'),strict=False)# Set computing device on GPU if available, else CPUdevice=torch.device('cuda'iftorch.cuda.is_available()else'cpu')# Inferenceresult=predict_of_json(model,tokenizer,device,"日本語校正してみす。")print(result)
- A json style result will be displayed as below.
{0: {'token': '日本語', 'score': 0.999341},1: {'token': '校', 'score': 0.996382},2: {'token': '正', 'score': 0.997387},3: {'token': 'し', 'score': 0.999978},4: {'token': 'て', 'score': 0.999999},5: {'token': 'み', 'score': 0.999947},6: {'token': 'す', 'correct': 'ます', 'score': 0.972711},7: {'token': '。', 'score': 1.0}}
- For training the model from scratch, you need to download the training data fromhere. The file(
./scripts/trainer.py
) contains the steps for the training process and includes a function to evaluate model's performance. You may refer the file to perform your task on GPU cloud computing platform likeAWS SageMaker
orGoogle Colab
. - For using
Soft-Masked BERT
, download the fine-tuned model fromhere, declare the model as the following code. The other usages are the same.
fromtransformersimportBertJapaneseTokenizerfrombertjsc.lit_modelimportLitSoftMaskedBerttokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitSoftMaskedBert("cl-tohoku/bert-base-japanese-whole-word-masking",tokenizer.mask_token_id,tokenizer.vocab_size)
- Downgrade
transformers
to 4.29.2 or add the parameterstrict=False
formodel.load_state_dict()
if you encountered the error ofUnexpected key(s) in state_dict: "mlbert.bert.embeddings.position_ids".
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Pre-train | 43.5 | 31.9 | 41.5 | 36.1 |
Fine-tune | 77.3 | 71.1 | 81.2 | 75.8 |
Soft-masked | 78.4 | 65.3 | 88.4 | 75.1 |
RoBERTa | 72.3 | 55.9 | 83.2 | 66.9 |
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Pre-train | 37.0 | 19.0 | 29.8 | 23.2 |
Fine-tune | 74.9 | 66.4 | 80.1 | 72.6 |
Soft-masked | 76.4 | 61.4 | 87.8 | 72.2 |
RoBERTa | 69.7 | 50.7 | 81.8 | 62.7 |
- Training Platform:AWS SageMaker Lab
- Batch Size:32
- Epoch:6
- Learning Rate:2e-6
- Coefficient(soft-masked):0.8
Distributed under the MIT License. SeeLICENSE.txt
for more information.
Any suggestions for improvement or contribution to this project are appreciated! Feel free to submit an issue or pull request!
- Attention Is All You Need
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Spelling Error Correction with Soft-Masked BERT
- HuggingFace🤗: BERT base Japanese
- pycorrector
- 基于BERT的文本纠错模型
- SoftMasked Bert文本纠错模型实现
- プロジェクトをクローンし、必要なパッケージをインストールしてください。
pip install -r requirements.txt - ここからファインチューニング済みモデルをダウンロードし、任意のディレクトリに配置してください。
- 次のコードで推論を行ってください。
importtorchfromtransformersimportBertJapaneseTokenizerfrombertjscimportpredict_of_jsonfrombertjsc.lit_modelimportLitBertForMaskedLM# トークナイザーとモデルの宣言tokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitBertForMaskedLM("cl-tohoku/bert-base-japanese-whole-word-masking")# ステップ2でダウンロードしたモデルをロードしてください。model.load_state_dict(torch.load('load/from/path/lit-bert-for-maskedlm-230112.pth'))# GPUが利用可能な場合はGPU上で計算を行い、それ以外の場合はCPU上で計算を行います。device=torch.device('cuda'iftorch.cuda.is_available()else'cpu')# 推論result=predict_of_json(model,tokenizer,device,"日本語校正してみす。")print(result)
- 以下のようなjson形式の結果が表示されます。
{0: {'token': '日本語', 'score': 0.999341},1: {'token': '校', 'score': 0.996382},2: {'token': '正', 'score': 0.997387},3: {'token': 'し', 'score': 0.999978},4: {'token': 'て', 'score': 0.999999},5: {'token': 'み', 'score': 0.999947},6: {'token': 'す', 'correct': 'ます', 'score': 0.972711},7: {'token': '。', 'score': 1.0}}
- モデルをゼロからトレーニングする場合は、ここから訓練データをダウンロードする必要があります。ファイル(
./scripts/trainer.py
)にはトレーニングプロセスの手順が含まれており、モデルのパフォーマンスを評価するための関数も含まれています。AWS SageMaker
やGoogle Colab
などのGPUクラウドコンピューティングプラットフォームでタスクを実行する場合は、このファイルを参照してください。 Soft-Masked BERT
を使用するには、ここからファインチューニング済みのモデルをダウンロードし、以下のコードのようにモデルを宣言してください。その他の使用方法は同じです。
frombertjsc.lit_modelimportLitSoftMaskedBertmodel=LitSoftMaskedBert("cl-tohoku/bert-base-japanese-whole-word-masking",tokenizer.mask_token_id,tokenizer.vocab_size)
About
Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.