dreamgonfly/BERT-pytorchPublic

NotificationsYou must be signed in to change notification settings
Fork29
Star109

PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"

License

Unlicense license

109 stars 29 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
bert		bert
configs		configs
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Repository files navigation

BERT-pytorch

PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (https://arxiv.org/abs/1810.04805)

Requirements

Python 3.6+
PyTorch 4.1+
tqdm

All dependencies can be installed via:

pip install -r requirements.txt

Quickstart

Prepare data

First things first, you need to prepare your data in an appropriate format.Your corpus is assumed to follow the below constraints.

Each line is adocument.
Adocument consists ofsentences, seperated by vertical bar (|).
Asentence is assumed to be already tokenized. Tokens are seperated by space.
Asentence has no more than 256 tokens.
Adocument has at least 2 sentences.
You have two distinct data files, one for train data and the other for val data.

This repo comes with example data for pretraining in data/example directory.Here is the content of data/example/train.txt file.

One, two, three, four, five,|Once I caught a fish alive,|Six, seven, eight, nine, ten,|Then I let go again.I’m a little teapot|Short and stout|Here is my handle|Here is my spout.Jack and Jill went up the hill|To fetch a pail of water.|Jack fell down and broke his crown,|And Jill came tumbling after.

Also, this repo includes SST-2 data in data/SST-2 directory for sentiment classification.

Build dictionary

python bert.py preprocess-index data/example/train.txt --dictionary=dictionary.txt

Running the above command produces dictionary.txt file in your current directory.

Pre-train the model

python bert.py pretrain --train_data data/example/train.txt --val_data data/example/val.txt --checkpoint_output model.pth

This step trains BERT model with unsupervised objective. Also this step does:

logs the training procedure for every epoch
outputs model checkpoint periodically
reports the best checkpoint based on validation metric

Fine-tune the model

You can fine-tune pretrained BERT model with downstream task.For example, you can fine-tune your model with SST-2 sentiment classification task.

python bert.py finetune --pretrained_checkpoint model.pth --train_data data/SST-2/train.tsv --val_data data/SST-2/dev.tsv

This command also logs the procedure, outputs checkpoint, and reports the best checkpoint.

Author

@dreamgonfly

About

PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"

arxiv.org/abs/1810.04805

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BERT-pytorch

Requirements

Quickstart

Prepare data

Build dictionary

Pre-train the model

Fine-tune the model

See also

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

dreamgonfly/BERT-pytorch

Folders and files

Latest commit

History

Repository files navigation

BERT-pytorch

Requirements

Quickstart

Prepare data

Build dictionary

Pre-train the model

Fine-tune the model

See also

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages