awasthiabhijeet/PIEPublic

NotificationsYou must be signed in to change notification settings
Fork39
Star231

Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit Models for Local Sequence Transduction":www.aclweb.org/anthology/D19-1435.pdf (EMNLP-IJCNLP 2019)

License

MIT license

231 stars 39 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
PIE_ckpt		PIE_ckpt
errorify		errorify
example_scripts		example_scripts
m2scorer @ cb874a7		m2scorer @ cb874a7
pickles		pickles
scratch		scratch
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
apply_opcode.py		apply_opcode.py
get_edit_vocab.py		get_edit_vocab.py
get_seq2edits.py		get_seq2edits.py
install_dependencies.sh		install_dependencies.sh
modeling.py		modeling.py
modified_modeling.py		modified_modeling.py
opcodes.py		opcodes.py
optimization.py		optimization.py
requirements.txt		requirements.txt
seq2edits_utils.py		seq2edits_utils.py
spellcheck_utils.py		spellcheck_utils.py
tokenization.py		tokenization.py
tokenize_input.py		tokenize_input.py
transform_suffixes.py		transform_suffixes.py
utils.py		utils.py
wem_utils.py		wem_utils.py
word_edit_model.py		word_edit_model.py

Repository files navigation

PIE: Parallel Iterative Edit Models for Local Sequence Transduction

Fast Grammatical Error Correction using BERT

Code and Pre-trained models accompanying our paper "Parallel Iterative Edit Models for Local Sequence Transduction" (EMNLP-IJCNLP 2019)

PIE is a BERT based architecture for local sequence transduction tasks like Grammatical Error Correction. Unlike the standard approach of modeling GEC as a task of translation from "incorrect" to "correct" language, we pose GEC as local sequence editing task. We further reduce local sequence editing problem to a sequence labeling setup where we utilize BERT to non-autoregressively label input tokens with edits. We rewire the BERT architecture (without retraining) specifically for the task of sequence editing. We find that PIE models for GEC are 5 to 15 times faster than existing state of the art architectures and still maintain a competitive accuracy. For more details please check out ourEMNLP-IJCNLP 2019 paper

@inproceedings{awasthi-etal-2019-parallel,    title = "Parallel Iterative Edit Models for Local Sequence Transduction",    author = "Awasthi, Abhijeet  and      Sarawagi, Sunita  and      Goyal, Rasna  and      Ghosh, Sabyasachi  and      Piratla, Vihari",    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",    month = nov,    year = "2019",    address = "Hong Kong, China",    publisher = "Association for Computational Linguistics",    url = "https://www.aclweb.org/anthology/D19-1435",    doi = "10.18653/v1/D19-1435",    pages = "4259--4269",}

Datasets

All the public GEC datasets used in the paper can be obtained fromhere
Synthetically created datasets (perturbed version of 1 billion word corpus) divided into 5 parts to independently train 5 different ensembles. (all the ensembles are further finetuned using the public GEC datasets mentioned above.)

Pretrained Models

PIE as reported in the paper
- trained on a Synethically created GEC dataset starting with BERT's initialization
- finetuned further on Lang8, NUCLE and FCE datasets
Inference using the pretrained PIE ckpt
- Copy the pretrained checkpoint files provided above to PIE_ckpt directory
- Your PIE_ckpt directory should contain
  - bert_config.json
  - multi_round_infer.sh
  - pie_infer.sh
  - pie_model.ckpt.data-00000-of-00001
  - pie_model.ckpt.index
  - pie_model.ckpt.meta
  - vocab.txt
- Run:$ ./multi_round_infer.sh from PIE_ckpt directory
- NOTE: If you are using cloud-TPUs for inference, move the PIE_ckpt directory to the cloud bucket and change the paths in pie_infer.sh and multi_round_infer.sh accordingly

Code Description

An example usage of code in described in the directory "example_scripts".

preprocess.sh
- Extracts common insertions from a sample training data in the "scratch" directory
- converts the training data in the form of incorrect tokens and aligned edits
pie_train.sh
- trains a pie model using the converted training data
multi_round_infer.sh
- uses a trained PIE model to obtain edits for incorrect sentences
- does 4 rounds of iterative editing
- uses conll-14 test sentences
m2_eval.sh
- evaluates the final output usingm2scorer
end_to_end.sh
- describes the use of pre-processing, training, inference and evaluation scripts end to end.
More information in README.md inside "example_scripts"

Pre processing and Edits related

seq2edits_utils.py
- contains implementation of edit-distance algorithm.
- cost for substitution modified as per section A.1 in the paper.
- Adapted frombelambert's implimentation
get_edit_vocab.py : Extracts common insertions (\Sigma_a set as described in paper) from a parallel corpus
get_seq2edits.py : Extracts edits aligned to input tokens
tokenize_input.py : tokenize a file containing sentences. token_ids obtained go as input to the model.
opcodes.py : A class where members are all possible edit operations
transform_suffixes.py: Contains logic for suffix transformations
tokenization.py : Similar to BERT's implementation, with some GEC specific changes

PIE model (usesimplementation of BERT of bert in Tensorflow)

word_edit_model.py: Implementation of PIE for learning from a parallel corpous of incorrect tokens and aligned edits.
- logit factorization logic (Keep flag use_bert_more=True to enable logit factorization)
- parallel sequence labeling
modeling.py : Same as in BERT's implementation
modified_modeling.py
- Rewires attention mask to obtain representations of candidate appends and replacements
- Used for logit factorization.
optimization.py : Same as in BERT's implementation

Post processing

apply_opcode.py
- Applies inferred edits from the PIE model to the incorrect sentences.
- Handles punctuations and spacings as per requirements of a standard dataset (INFER_MODE).
- Contains some obvious rules for captialization etc.

Creating synthetic GEC dataset

errorify directory contains the scripts we used for perturbing the one-billion-word corpus

Acknowledgements

This research was partly sponsored by a Google India AI/ML Research Award and Google PhD Fellowship in Machine Learning. We gratefully acknowledge Google's TFRC program for providing us Cloud-TPUs. Thanks toVarun Patil for helping us improve the speed of pre-processing and synthetic-data generation pipelines.

About

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PIE: Parallel Iterative Edit Models for Local Sequence Transduction

Datasets

Pretrained Models

Code Description

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

awasthiabhijeet/PIE

Folders and files

Latest commit

History

Repository files navigation

PIE: Parallel Iterative Edit Models for Local Sequence Transduction

Datasets

Pretrained Models

Code Description

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages