- Notifications
You must be signed in to change notification settings - Fork5
BERT-based GEC tagging for Japanese
License
jonnyli1125/gector-ja
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Grammatical error correction model described in the paper"GECToR -- Grammatical Error Correction: Tag, Not Rewrite" (Omelianchuk et al. 2020), implemented for Japanese. This project's code is based on the official implementation (https://github.com/grammarly/gector).
Thepretrained Japanese BERT model used in this project was provided by Tohoku University NLP Lab.
- Japanese Wikipedia dump, extracted withWikiExtractor, synthetic errors generated using preprocessing scripts
- 19,841,767 training sentences
- NAIST Lang8 Learner Corpora
- 6,066,306 training sentences (generated from 3,084,0376 original sentences)
The Wikipedia corpus was used to synthetically generate errorful sentences, with a method similar toAwasthi et al. 2019, but with adjustments for Japanese. The details of the implementation can be found in thepreprocessing scripts in this repository.
Example error-generated sentence:
西口側には宿泊施設や地元の日本酒や海、山の幸を揃えた飲食店、呑み屋など多くある。 # Correct西口側までは宿泊から施設や地元の日本酒や、山の幸を揃えた飲食は店、呑み屋など多くあろう。 # Errorful
Using the preprocessed Wikipedia corpus and Lang8 corpus, the errorful sentences were tokenized using the WordPiece tokenizer from thepretrained BERT model. Each token was then mapped to a minimal sequence of token transformations, such that when the transformations are applied to the errorful sentence, it will lead to the target sentence. The GECToR paper explains this preprocessing step in more detail (section 3), and the code specifics can be found in theofficial implementation.
Example edit-tagged sentence (using the same pair of sentences above):
[CLS] 西口 側 まで は 宿泊 から 施設 や 地元 の 日本 酒 や 、 山 の 幸 を 揃え た 飲食 は 店 、 呑 ##み ##屋 など 多く あろう 。 [SEP]$KEEP $KEEP $KEEP $REPLACE_に $KEEP $KEEP $DELETE $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $APPEND_海 $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $DELETE $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $TRANSFORM_VBV_VB $KEEP $KEEP
Furthermore, on top of the basic 4 token transformations ($KEEP
,$DELETE
,$APPEND
,$REPLACE
), there are a set of special transformations called "g-transformations" (i.e.$TRANSFORM_VBV_VB
in the example above). G-transformations are mainly used for common replacements, such as switching verb conjugations, as described in the GECToR paper (section 3). The g-transformations in this model were redefined to accommodate for Japanese verbs and i-adjectives, which both inflect for tense.
The model consists of apretrained BERT encoder layer and two linear classification heads, one forlabels
and one fordetect
.labels
predicts a specific edit transformation ($KEEP
,$DELETE
,$APPEND_x
, etc), anddetect
predicts whether the token isCORRECT
orINCORRECT
. The results from the two are used to make a prediction. The predicted transformations are then applied to the errorful input sentence to obtain a corrected sentence.
Furthermore, in some cases, one pass of predicted transformations is not sufficient to transform the errorful sentence to the target sentence. Therefore, we repeat the process again on the result of the previous pass of transformations, until the model predicts that the sentence no longer contains incorrect tokens.
For more details about the model architecture anditerative sequence tagging approach, refer to section 4 and 5 of the GECToR paper or theofficial implementation.
The model was trained in Colab with TPUs on each corpus with the following hyperparameters (default is used if unspecified):
batch_size: 64learning_rate: 1e-5bert_trainable: true
Synthetic error corpus (Wikipedia dump):
length: 19841767epochs: 3
Lang8 corpus:
length: 6066306epochs: 10
Trained weights can be downloadedhere.
Extractmodel.zip
to thedata/
directory. You should have the following folder structure:
gector-ja/ data/ model/ checkpoint model_checkpoint.data-00000-of-00001 model_checkpoint.index ... main.py ...
After downloading and extracting the weights, the demo app can be run with the commandpython main.py
.
You may need topip install flask
if Flask is not already installed.
The model can be evaluated withevaluate.py
on a parallel sentences corpus. The evaluation corpus used wasTMU Evaluation Corpus for Japanese Learners (Koyama et al. 2020), and the metric is GLEU score.
Using the model trained with the parameters described above, it achieved a GLEU score of around 0.81, which appears to outperform the CNN-based method by Chollampatt and Ng, 2018 (state of the art on the CoNLL-2014 dataset prior to transformer-based models), that Koyama et al. 2020 chose to use in their paper.
Method | F0.5 |
---|---|
Chollampatt and Ng, 2018 | 56.52 |
Omelianchuk et al., 2020 | 66.5 |
Method | GLEU |
---|---|
Chollampatt and Ng, 2018 | 0.739 |
gector-ja (this project) | 0.81 |
In the GECToR paper, F0.5 score was used, which can also be determined through use oferrant andm2scorer. However, these tools were designed to be used for evaluation on the CoNLL-2014 dataset, and using them for this project would also require modifying the tools' source code to accommodate for Japanese. In this project GLEU score was used as in Koyama et al. 2020, which works "out of the box" from the NLTK library.
About
BERT-based GEC tagging for Japanese
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.