jonnyli1125/gector-jaPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star17

BERT-based GEC tagging for Japanese

License

Apache-2.0 license

17 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
data		data
templates		templates
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.yaml		app.yaml
evaluate.py		evaluate.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Repository files navigation

gector-ja

Grammatical error correction model described in the paper"GECToR -- Grammatical Error Correction: Tag, Not Rewrite" (Omelianchuk et al. 2020), implemented for Japanese. This project's code is based on the official implementation (https://github.com/grammarly/gector).

Thepretrained Japanese BERT model used in this project was provided by Tohoku University NLP Lab.

Datasets

Japanese Wikipedia dump, extracted withWikiExtractor, synthetic errors generated using preprocessing scripts
- 19,841,767 training sentences
NAIST Lang8 Learner Corpora
- 6,066,306 training sentences (generated from 3,084,0376 original sentences)

Synthetically Generated Error Corpus

The Wikipedia corpus was used to synthetically generate errorful sentences, with a method similar toAwasthi et al. 2019, but with adjustments for Japanese. The details of the implementation can be found in thepreprocessing scripts in this repository.

Example error-generated sentence:

西口側には宿泊施設や地元の日本酒や海、山の幸を揃えた飲食店、呑み屋など多くある。        # Correct西口側までは宿泊から施設や地元の日本酒や、山の幸を揃えた飲食は店、呑み屋など多くあろう。 # Errorful

Edit Tagging

Using the preprocessed Wikipedia corpus and Lang8 corpus, the errorful sentences were tokenized using the WordPiece tokenizer from thepretrained BERT model. Each token was then mapped to a minimal sequence of token transformations, such that when the transformations are applied to the errorful sentence, it will lead to the target sentence. The GECToR paper explains this preprocessing step in more detail (section 3), and the code specifics can be found in theofficial implementation.

Example edit-tagged sentence (using the same pair of sentences above):

[CLS] 西口  側    まで         は    宿泊  から     施設  や    地元  の     日本  酒    や         、    山    の     幸    を    揃え  た     飲食  は      店    、    呑     ##み  ##屋  など   多く  あろう             。    [SEP]$KEEP $KEEP $KEEP $REPLACE_に $KEEP $KEEP $DELETE $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $APPEND_海 $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $DELETE $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $KEEP $TRANSFORM_VBV_VB $KEEP $KEEP

Furthermore, on top of the basic 4 token transformations ($KEEP,$DELETE,$APPEND,$REPLACE), there are a set of special transformations called "g-transformations" (i.e.$TRANSFORM_VBV_VB in the example above). G-transformations are mainly used for common replacements, such as switching verb conjugations, as described in the GECToR paper (section 3). The g-transformations in this model were redefined to accommodate for Japanese verbs and i-adjectives, which both inflect for tense.

Model Architecture

The model consists of apretrained BERT encoder layer and two linear classification heads, one forlabels and one fordetect.labels predicts a specific edit transformation ($KEEP,$DELETE,$APPEND_x, etc), anddetect predicts whether the token isCORRECT orINCORRECT. The results from the two are used to make a prediction. The predicted transformations are then applied to the errorful input sentence to obtain a corrected sentence.

Furthermore, in some cases, one pass of predicted transformations is not sufficient to transform the errorful sentence to the target sentence. Therefore, we repeat the process again on the result of the previous pass of transformations, until the model predicts that the sentence no longer contains incorrect tokens.

For more details about the model architecture anditerative sequence tagging approach, refer to section 4 and 5 of the GECToR paper or theofficial implementation.

Training

The model was trained in Colab with TPUs on each corpus with the following hyperparameters (default is used if unspecified):

batch_size: 64learning_rate: 1e-5bert_trainable: true

Synthetic error corpus (Wikipedia dump):

length: 19841767epochs: 3

Lang8 corpus:

length: 6066306epochs: 10

Demo App

Trained weights can be downloadedhere.

Extractmodel.zip to thedata/ directory. You should have the following folder structure:

gector-ja/  data/    model/      checkpoint      model_checkpoint.data-00000-of-00001      model_checkpoint.index    ...  main.py  ...

After downloading and extracting the weights, the demo app can be run with the commandpython main.py.

You may need topip install flask if Flask is not already installed.

Evaluation

The model can be evaluated withevaluate.py on a parallel sentences corpus. The evaluation corpus used wasTMU Evaluation Corpus for Japanese Learners (Koyama et al. 2020), and the metric is GLEU score.

Using the model trained with the parameters described above, it achieved a GLEU score of around 0.81, which appears to outperform the CNN-based method by Chollampatt and Ng, 2018 (state of the art on the CoNLL-2014 dataset prior to transformer-based models), that Koyama et al. 2020 chose to use in their paper.

CoNLL-2014 (GEC dataset for English)

Method	F0.5
Chollampatt and Ng, 2018	56.52
Omelianchuk et al., 2020	66.5

TMU Evaluation Corpus for Japanese Learners (GEC dataset for Japanese)

Method	GLEU
Chollampatt and Ng, 2018	0.739
gector-ja (this project)	0.81

In the GECToR paper, F0.5 score was used, which can also be determined through use oferrant andm2scorer. However, these tools were designed to be used for evaluation on the CoNLL-2014 dataset, and using them for this project would also require modifying the tools' source code to accommodate for Japanese. In this project GLEU score was used as in Koyama et al. 2020, which works "out of the box" from the NLTK library.

About

BERT-based GEC tagging for Japanese

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

gector-ja

Datasets

Synthetically Generated Error Corpus

Edit Tagging

Model Architecture

Training

Demo App

Evaluation

CoNLL-2014 (GEC dataset for English)

TMU Evaluation Corpus for Japanese Learners (GEC dataset for Japanese)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

jonnyli1125/gector-ja

Folders and files

Latest commit

History

Repository files navigation

gector-ja

Datasets

Synthetically Generated Error Corpus

Edit Tagging

Model Architecture

Training

Demo App

Evaluation

CoNLL-2014 (GEC dataset for English)

TMU Evaluation Corpus for Japanese Learners (GEC dataset for Japanese)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages