Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正

License

NotificationsYou must be signed in to change notification settings

er-ri/bertjsc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Japanese Spelling Error Corrector using BERT(Masked-Language Model).

日本語はこちらです

Abstract

The project, fine-tuned the Masked-Language BERT for the task of Japanese Spelling Error Correction. The whole word masking pretrainedmodel has been applied in the project. Besides the original BERT, another architecture based on BERT used a method calledsoft-masking which introducing aBi-GRU layer to the original BERT architecture, was also implemented in this project. The training dataset(Japanese Wikipedia Typo Dataset version 1) contains a pair of grammatical error and error-free sentence which is collected from wikipedia, can be downloaded athere.

Getting Started

  1. Clone the project and install necessary packages.
    pip install -r requirements.txt
  2. Download the fine-tuned model(BERT) fromhere and put it to an arbitrary directory.
  3. Make a inference by the following code.
importtorchfromtransformersimportBertJapaneseTokenizerfrombertjscimportpredict_of_jsonfrombertjsc.lit_modelimportLitBertForMaskedLM# Tokenizer & Model declaration.tokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitBertForMaskedLM("cl-tohoku/bert-base-japanese-whole-word-masking")# Load the model downloaded in Step 2.model.load_state_dict(torch.load('load/from/path/lit-bert-for-maskedlm-230313.pth'),strict=False)# Set computing device on GPU if available, else CPUdevice=torch.device('cuda'iftorch.cuda.is_available()else'cpu')# Inferenceresult=predict_of_json(model,tokenizer,device,"日本語校正してみす。")print(result)
  1. A json style result will be displayed as below.
{0: {'token': '日本語', 'score': 0.999341},1: {'token': '校', 'score': 0.996382},2: {'token': '正', 'score': 0.997387},3: {'token': 'し', 'score': 0.999978},4: {'token': 'て', 'score': 0.999999},5: {'token': 'み', 'score': 0.999947},6: {'token': 'す', 'correct': 'ます', 'score': 0.972711},7: {'token': '。', 'score': 1.0}}
  • For training the model from scratch, you need to download the training data fromhere. The file(./scripts/trainer.py) contains the steps for the training process and includes a function to evaluate model's performance. You may refer the file to perform your task on GPU cloud computing platform likeAWS SageMaker orGoogle Colab.
  • For usingSoft-Masked BERT, download the fine-tuned model fromhere, declare the model as the following code. The other usages are the same.
fromtransformersimportBertJapaneseTokenizerfrombertjsc.lit_modelimportLitSoftMaskedBerttokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitSoftMaskedBert("cl-tohoku/bert-base-japanese-whole-word-masking",tokenizer.mask_token_id,tokenizer.vocab_size)

Note

  • Downgradetransformers to 4.29.2 or add the parameterstrict=False formodel.load_state_dict() if you encountered the error ofUnexpected key(s) in state_dict: "mlbert.bert.embeddings.position_ids".

Evaluation

Detection

ModelAccuracyPrecisionRecallF1 Score
Pre-train43.531.941.536.1
Fine-tune77.371.181.275.8
Soft-masked78.465.388.475.1
RoBERTa72.355.983.266.9

Correction

ModelAccuracyPrecisionRecallF1 Score
Pre-train37.019.029.823.2
Fine-tune74.966.480.172.6
Soft-masked76.461.487.872.2
RoBERTa69.750.781.862.7
  • Training Platform:AWS SageMaker Lab
  • Batch Size:32
  • Epoch:6
  • Learning Rate:2e-6
  • Coefficient(soft-masked):0.8

License

Distributed under the MIT License. SeeLICENSE.txt for more information.

Contribution

Any suggestions for improvement or contribution to this project are appreciated! Feel free to submit an issue or pull request!

References

  1. Attention Is All You Need
  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  3. Spelling Error Correction with Soft-Masked BERT
  4. HuggingFace🤗: BERT base Japanese
  5. pycorrector
  6. 基于BERT的文本纠错模型
  7. SoftMasked Bert文本纠错模型实现

はじめに

  1. プロジェクトをクローンし、必要なパッケージをインストールしてください。
    pip install -r requirements.txt
  2. ここからファインチューニング済みモデルをダウンロードし、任意のディレクトリに配置してください。
  3. 次のコードで推論を行ってください。
importtorchfromtransformersimportBertJapaneseTokenizerfrombertjscimportpredict_of_jsonfrombertjsc.lit_modelimportLitBertForMaskedLM# トークナイザーとモデルの宣言tokenizer=BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")model=LitBertForMaskedLM("cl-tohoku/bert-base-japanese-whole-word-masking")# ステップ2でダウンロードしたモデルをロードしてください。model.load_state_dict(torch.load('load/from/path/lit-bert-for-maskedlm-230112.pth'))# GPUが利用可能な場合はGPU上で計算を行い、それ以外の場合はCPU上で計算を行います。device=torch.device('cuda'iftorch.cuda.is_available()else'cpu')# 推論result=predict_of_json(model,tokenizer,device,"日本語校正してみす。")print(result)
  1. 以下のようなjson形式の結果が表示されます。
{0: {'token': '日本語', 'score': 0.999341},1: {'token': '校', 'score': 0.996382},2: {'token': '正', 'score': 0.997387},3: {'token': 'し', 'score': 0.999978},4: {'token': 'て', 'score': 0.999999},5: {'token': 'み', 'score': 0.999947},6: {'token': 'す', 'correct': 'ます', 'score': 0.972711},7: {'token': '。', 'score': 1.0}}
  • モデルをゼロからトレーニングする場合は、ここから訓練データをダウンロードする必要があります。ファイル(./scripts/trainer.py)にはトレーニングプロセスの手順が含まれており、モデルのパフォーマンスを評価するための関数も含まれています。AWS SageMakerGoogle ColabなどのGPUクラウドコンピューティングプラットフォームでタスクを実行する場合は、このファイルを参照してください。
  • Soft-Masked BERTを使用するには、ここからファインチューニング済みのモデルをダウンロードし、以下のコードのようにモデルを宣言してください。その他の使用方法は同じです。
frombertjsc.lit_modelimportLitSoftMaskedBertmodel=LitSoftMaskedBert("cl-tohoku/bert-base-japanese-whole-word-masking",tokenizer.mask_token_id,tokenizer.vocab_size)

About

Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp