ku-nlp/bertPublic

forked fromgoogle-research/bert

NotificationsYou must be signed in to change notification settings
Fork0
Star1

TensorFlow code and pre-trained models for BERT

arxiv.org/abs/1810.04805

License

Apache-2.0 license

1 star 9.7k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
create_pretraining_data.py		create_pretraining_data.py
extract_features.py		extract_features.py
modeling.py		modeling.py
modeling_test.py		modeling_test.py
multilingual.md		multilingual.md
optimization.py		optimization.py
optimization_test.py		optimization_test.py
predicting_movie_reviews_with_bert_on_tf_hub.ipynb		predicting_movie_reviews_with_bert_on_tf_hub.ipynb
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_classifier_with_tfhub.py		run_classifier_with_tfhub.py
run_pretraining.py		run_pretraining.py
run_squad.py		run_squad.py
sample_text.txt		sample_text.txt
tokenization.py		tokenization.py
tokenization_test.py		tokenization_test.py

Repository files navigation

BERT

Forked for the following modification.

Add the option to deal with Japanese sentences (Japanese sentences are regarded as Chinese in the original code).
Add the processor class for our FAQ retrieval dataset (project).

***** New May 31st, 2019: Whole Word Masking Models *****

This is a release of several new models which were the result of an improvementthe pre-processing code.

In the original pre-processing code, we randomly select WordPiece tokens tomask. For example:

Input Text: the man jumped up , put his basket on phil ##am ##mon ' s headOriginal Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head

The new technique is called Whole Word Masking. In this case, we always maskall of the the tokens corresponding to a word at once. The overall maskingrate remains the same.

Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head

The training is identical -- we still predict each masked WordPiece tokenindependently. The improvement comes from the fact that the original predictiontask was too 'easy' for words that had been split into multiple WordPieces.

This can be enabled during data generation by passing the flag--do_whole_word_mask=True tocreate_pretraining_data.py.

Pre-trained models with Whole Word Masking are linked below. The data andtraining were otherwise identical, and the models have identical structure andvocab to the original models. We only include BERT-Large models. When usingthese models, please make it clear in the paper that you are using the WholeWord Masking variant of BERT-Large.

BERT-Large, Uncased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters

Model	SQUAD 1.1 F1/EM	Multi NLI Accuracy
BERT-Large, Uncased (Original)	91.0/84.3	86.05
BERT-Large, Uncased (Whole Word Masking)	92.8/86.7	87.07
BERT-Large, Cased (Original)	91.5/84.8	86.09
BERT-Large, Cased (Whole Word Masking)	92.9/86.7	86.46

***** New February 7th, 2019: TfHub Module *****

BERT has been uploaded toTensorFlow Hub. Seerun_classifier_with_tfhub.py for an example of how to use the TF Hub module,or run an example in the browser onColab.

***** New November 23rd, 2018: Un-normalized multilingual model + Thai +Mongolian *****

We uploaded a new multilingual model which doesnot perform any normalizationon the input (no lower casing, accent stripping, or Unicode normalization), andadditionally inclues Thai and Mongolian.

It is recommended to use this version for developing multilingual models,especially on languages with non-Latin alphabets.

This does not require any code changes, and can be downloaded here:

BERT-Base, Multilingual Cased:104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

***** New November 15th, 2018: SOTA SQuAD 2.0 System *****

We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which iscurrently 1st place on the leaderboard by 3%. See the SQuAD 2.0 section of theREADME for details.

***** New November 5th, 2018: Third-party PyTorch and Chainer versions ofBERT available *****

NLP researchers from HuggingFace made aPyTorch version of BERT availablewhich is compatible with our pre-trained checkpoints and is able to reproduceour results. Sosuke Kobayashi also made aChainer version of BERT available(Thanks!) We were not involved in the creation or maintenance of the PyTorchimplementation so please direct any questions towards the authors of thatrepository.

***** New November 3rd, 2018: Multilingual and Chinese models available*****

We have made two new BERT models available:

BERT-Base, Multilingual(Not recommended, useMultilingual Cased instead): 102 languages,12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese:Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110Mparameters

We use character-based tokenization for Chinese, and WordPiece tokenization forall other languages. Both models should work out-of-the-box without any codechanges. We did update the implementation ofBasicTokenizer intokenization.py to support Chinese character tokenization, so please update ifyou forked it. However, we did not change the tokenization API.

For more, see theMultilingual README.

***** End new information *****

Introduction

BERT, orBidirectionalEncoderRepresentations fromTransformers, is a new method of pre-training language representations whichobtains state-of-the-art results on a wide array of Natural Language Processing(NLP) tasks.

Our academic paper which describes BERT in detail and provides full results on anumber of tasks can be found here:https://arxiv.org/abs/1810.04805.

To give a few numbers, here are the results on theSQuAD v1.1 question answeringtask:

SQuAD v1.1 Leaderboard (Oct 8th 2018)	Test EM	Test F1
1st Place Ensemble - BERT	87.4	93.2
2nd Place Ensemble - nlnet	86.0	91.7
1st Place Single Model - BERT	85.1	91.8
2nd Place Single Model - nlnet	83.5	90.1

And several natural language inference tasks:

System	MultiNLI	Question NLI	SWAG
BERT	86.7	91.1	86.3
OpenAI GPT (Prev. SOTA)	82.2	88.1	75.0

Plus many other tasks.

Moreover, these results were all obtained with almost no task-specific neuralnetwork architecture design.

If you already know what BERT is and you just want to get started, you candownload the pre-trained models andrun a state-of-the-art fine-tuning in only a fewminutes.

What is BERT?

BERT is a method of pre-training language representations, meaning that we traina general-purpose "language understanding" model on a large text corpus (likeWikipedia), and then use that model for downstream NLP tasks that we care about(like question answering). BERT outperforms previous methods because it is thefirstunsupervised,deeply bidirectional system for pre-training NLP.

Unsupervised means that BERT was trained using only a plain text corpus, whichis important because an enormous amount of plain text data is publicly availableon the web in many languages.

Pre-trained representations can also either becontext-free orcontextual,and contextual representations can further beunidirectional orbidirectional. Context-free models such asword2vec orGloVe generate a single "wordembedding" representation for each word in the vocabulary, sobank would havethe same representation inbank deposit andriver bank. Contextual modelsinstead generate a representation of each word that is based on the other wordsin the sentence.

BERT was built upon recent work in pre-training contextual representations —includingSemi-supervised Sequence Learning,Generative Pre-Training,ELMo, andULMFit— but crucially these models are allunidirectional orshallowlybidirectional. This means that each word is only contextualized using the wordsto its left (or right). For example, in the sentenceI made a bank deposit theunidirectional representation ofbank is only based onI made a but notdeposit. Some previous work does combine the representations from separateleft-context and right-context models, but only in a "shallow" manner. BERTrepresents "bank" using both its left and right context —I made a ... deposit— starting from the very bottom of a deep neural network, so it isdeeplybidirectional.

BERT uses a simple approach for this: We mask out 15% of the words in the input,run the entire sequence through a deep bidirectionalTransformer encoder, and then predict onlythe masked words. For example:

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.Labels: [MASK1] = store; [MASK2] = gallon

In order to learn relationships between sentences, we also train on a simpletask which can be generated from any monolingual corpus: Given two sentencesAandB, isB the actual next sentence that comes afterA, or just a randomsentence from the corpus?

Sentence A: the man went to the store .Sentence B: he bought a gallon of milk .Label: IsNextSentence

Sentence A: the man went to the store .Sentence B: penguins are flightless .Label: NotNextSentence

We then train a large model (12-layer to 24-layer Transformer) on a large corpus(Wikipedia +BookCorpus) for a long time (1Mupdate steps), and that's BERT.

Using BERT has two stages:Pre-training andfine-tuning.

Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is aone-time procedure for each language (current models are English-only, butmultilingual models will be released in the near future). We are releasing anumber of pre-trained models from the paper which were pre-trained at Google.Most NLP researchers will never need to pre-train their own model from scratch.

Fine-tuning is inexpensive. All of the results in the paper can bereplicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU,starting from the exact same pre-trained model. SQuAD, for example, can betrained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of91.0%, which is the single system state-of-the-art.

The other important aspect of BERT is that it can be adapted to many types ofNLP tasks very easily. In the paper, we demonstrate state-of-the-art results onsentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level(e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specificmodifications.

What has been released in this repository?

We are releasing the following:

TensorFlow code for the BERT model architecture (which is mostly a standardTransformer architecture).
Pre-trained checkpoints for both the lowercase and cased version ofBERT-Base andBERT-Large from the paper.
TensorFlow code for push-button replication of the most importantfine-tuning experiments from the paper, including SQuAD, MultiNLI, and MRPC.

All of the code in this repository works out-of-the-box with CPU, GPU, and CloudTPU.

Pre-trained models

We are releasing theBERT-Base andBERT-Large models from the paper.Uncased means that the text has been lowercased before WordPiece tokenization,e.g.,John Smith becomesjohn smith. TheUncased model also strips out anyaccent markers.Cased means that the true case and accent markers arepreserved. Typically, theUncased model is better unless you know that caseinformation is important for your task (e.g., Named Entity Recognition orPart-of-Speech tagging).

These models are all released under the same license as the source code (Apache2.0).

For information about the Multilingual and Chinese model, see theMultilingual README.

When using a cased model, make sure to pass--do_lower=False to the trainingscripts. (Or passdo_lower_case=False directly toFullTokenizer if you'reusing your own script.)

The links to the models are here (right-click, 'Save link as...' on the name):

BERT-Large, Uncased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking):24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Uncased:12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased:24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased:12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased:24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New, recommended):104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Uncased (Orig, not recommended)(Not recommended, useMultilingual Cased instead): 102 languages,12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese:Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110Mparameters

Each .zip file contains three items:

A TensorFlow checkpoint (bert_model.ckpt) containing the pre-trainedweights (which is actually 3 files).
A vocab file (vocab.txt) to map WordPiece to word id.
A config file (bert_config.json) which specifies the hyperparameters ofthe model.

Fine-tuning with BERT

Important: All results on the paper were fine-tuned on a single Cloud TPU,which has 64GB of RAM. It is currently not possible to re-produce most of theBERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, becausethe maximum batch size that can fit in memory is too small. We are working onadding code to this repository which allows for much larger effective batch sizeon the GPU. See the section onout-of-memory issues formore details.

This code was tested with TensorFlow 1.11.0. It was tested with Python2 andPython3 (but more thoroughly with Python2, since this is what's used internallyin Google).

The fine-tuning examples which useBERT-Base should be able to run on a GPUthat has at least 12GB of RAM using the hyperparameters given.

Fine-tuning with Cloud TPUs

Most of the examples below assumes that you will be running training/evaluationon your local machine, using a GPU like a Titan X or GTX 1080.

However, if you have access to a Cloud TPU that you want to train on, just addthe following flags torun_classifier.py orrun_squad.py:

  --use_tpu=True \  --tpu_name=$TPU_NAME

Please see theGoogle Cloud TPU tutorialfor how to use Cloud TPUs. Alternatively, you can use the Google Colab notebook"BERT FineTuning with Cloud TPUs".

On Cloud TPUs, the pretrained model and the output directory will need to be onGoogle Cloud Storage. For example, if you have a bucket namedsome_bucket, youmight use the following flags instead:

  --output_dir=gs://some_bucket/my_output_dir/

The unzipped pre-trained model files can also be found in the Google CloudStorage foldergs://bert_models/2018_10_18. For example:

export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12

Sentence (and sentence-pair) classification tasks

Before running this example you must download theGLUE data by runningthis scriptand unpack it to some directory$GLUE_DIR. Next, download theBERT-Basecheckpoint and unzip it to some directory$BERT_BASE_DIR.

This example code fine-tunesBERT-Base on the Microsoft Research ParaphraseCorpus (MRPC) corpus, which only contains 3,600 examples and can fine-tune in afew minutes on most GPUs.

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12export GLUE_DIR=/path/to/gluepython run_classifier.py \  --task_name=MRPC \  --do_train=true \  --do_eval=true \  --data_dir=$GLUE_DIR/MRPC \  --vocab_file=$BERT_BASE_DIR/vocab.txt \  --bert_config_file=$BERT_BASE_DIR/bert_config.json \  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \  --max_seq_length=128 \  --train_batch_size=32 \  --learning_rate=2e-5 \  --num_train_epochs=3.0 \  --output_dir=/tmp/mrpc_output/

You should see output like this:

***** Eval results *****  eval_accuracy = 0.845588  eval_loss = 0.505248  global_step = 343  loss = 0.505248

This means that the Dev set accuracy was 84.55%. Small sets like MRPC have ahigh variance in the Dev set accuracy, even when starting from the samepre-training checkpoint. If you re-run multiple times (making sure to point todifferentoutput_dir), you should see results between 84% and 88%.

A few other pre-trained models are implemented off-the-shelf inrun_classifier.py, so it should be straightforward to follow those examples touse BERT for any single-sentence or sentence-pair classification task.

Note: You might see a messageRunning train on CPU. This really just meansthat it's running on something other than a Cloud TPU, which includes a GPU.

Prediction from classifier

Once you have trained your classifier you can use it in inference mode by usingthe --do_predict=true command. You need to have a file named test.tsv in theinput folder. Output will be created in file called test_results.tsv in theoutput folder. Each line will contain output for each sample, columns are theclass probabilities.

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12export GLUE_DIR=/path/to/glueexport TRAINED_CLASSIFIER=/path/to/fine/tuned/classifierpython run_classifier.py \  --task_name=MRPC \  --do_predict=true \  --data_dir=$GLUE_DIR/MRPC \  --vocab_file=$BERT_BASE_DIR/vocab.txt \  --bert_config_file=$BERT_BASE_DIR/bert_config.json \  --init_checkpoint=$TRAINED_CLASSIFIER \  --max_seq_length=128 \  --output_dir=/tmp/mrpc_output/

SQuAD 1.1

The Stanford Question Answering Dataset (SQuAD) is a popular question answeringbenchmark dataset. BERT (at the time of the release) obtains state-of-the-artresults on SQuAD with almost no task-specific network architecture modificationsor data augmentation. However, it does require semi-complex data pre-processingand post-processing to deal with (a) the variable-length nature of SQuAD contextparagraphs, and (b) the character-level answer annotations which are used forSQuAD training. This processing is implemented and documented inrun_squad.py.

To run on SQuAD, you will first need to download the dataset. TheSQuAD website does not seem tolink to the v1.1 datasets any longer, but the necessary files can be found here:

Download these to some directory$SQUAD_DIR.

The state-of-the-art SQuAD results from the paper currently cannot be reproducedon a 12GB-16GB GPU due to memory constraints (in fact, even batch size 1 doesnot seem to fit on a 12GB GPU usingBERT-Large). However, a reasonably strongBERT-Base model can be trained on the GPU with these hyperparameters:

python run_squad.py \  --vocab_file=$BERT_BASE_DIR/vocab.txt \  --bert_config_file=$BERT_BASE_DIR/bert_config.json \  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \  --do_train=True \  --train_file=$SQUAD_DIR/train-v1.1.json \  --do_predict=True \  --predict_file=$SQUAD_DIR/dev-v1.1.json \  --train_batch_size=12 \  --learning_rate=3e-5 \  --num_train_epochs=2.0 \  --max_seq_length=384 \  --doc_stride=128 \  --output_dir=/tmp/squad_base/

The dev set predictions will be saved into a file calledpredictions.json intheoutput_dir:

python$SQUAD_DIR/evaluate-v1.1.py$SQUAD_DIR/dev-v1.1.json ./squad/predictions.json

Which should produce an output like this:

{"f1": 88.41249612335034,"exact_match": 81.2488174077578}

You should see a result similar to the 88.5% reported in the paper forBERT-Base.

If you have access to a Cloud TPU, you can train withBERT-Large. Here is aset of hyperparameters (slightly different than the paper) which consistentlyobtain around 90.5%-91.0% F1 single-system trained only on SQuAD:

python run_squad.py \  --vocab_file=$BERT_LARGE_DIR/vocab.txt \  --bert_config_file=$BERT_LARGE_DIR/bert_config.json \  --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \  --do_train=True \  --train_file=$SQUAD_DIR/train-v1.1.json \  --do_predict=True \  --predict_file=$SQUAD_DIR/dev-v1.1.json \  --train_batch_size=24 \  --learning_rate=3e-5 \  --num_train_epochs=2.0 \  --max_seq_length=384 \  --doc_stride=128 \  --output_dir=gs://some_bucket/squad_large/ \  --use_tpu=True \  --tpu_name=$TPU_NAME

For example, one random run with these parameters produces the following Devscores:

{"f1": 90.87081895814865,"exact_match": 84.38978240302744}

If you fine-tune for one epoch onTriviaQA before this the results willbe even better, but you will need to convert TriviaQA into the SQuAD jsonformat.

SQuAD 2.0

This model is also implemented and documented inrun_squad.py.

To run on SQuAD 2.0, you will first need to download the dataset. The necessaryfiles can be found here:

Download these to some directory$SQUAD_DIR.

On Cloud TPU you can run with BERT-Large as follows:

python run_squad.py \  --vocab_file=$BERT_LARGE_DIR/vocab.txt \  --bert_config_file=$BERT_LARGE_DIR/bert_config.json \  --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \  --do_train=True \  --train_file=$SQUAD_DIR/train-v2.0.json \  --do_predict=True \  --predict_file=$SQUAD_DIR/dev-v2.0.json \  --train_batch_size=24 \  --learning_rate=3e-5 \  --num_train_epochs=2.0 \  --max_seq_length=384 \  --doc_stride=128 \  --output_dir=gs://some_bucket/squad_large/ \  --use_tpu=True \  --tpu_name=$TPU_NAME \  --version_2_with_negative=True

We assume you have copied everything from the output directory to a localdirectory called ./squad/. The initial dev set predictions will be at./squad/predictions.json and the differences between the score of no answer ("")and the best non-null answer for each question will be in the file./squad/null_odds.json

Run this script to tune a threshold for predicting null versus non-null answers:

python $SQUAD_DIR/evaluate-v2.0.py $SQUAD_DIR/dev-v2.0.json./squad/predictions.json --na-prob-file ./squad/null_odds.json

Assume the script outputs "best_f1_thresh" THRESH. (Typical values are between-1.0 and -5.0). You can now re-run the model to generate predictions with thederived threshold or alternatively you can extract the appropriate answers from./squad/nbest_predictions.json.

python run_squad.py \  --vocab_file=$BERT_LARGE_DIR/vocab.txt \  --bert_config_file=$BERT_LARGE_DIR/bert_config.json \  --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \  --do_train=False \  --train_file=$SQUAD_DIR/train-v2.0.json \  --do_predict=True \  --predict_file=$SQUAD_DIR/dev-v2.0.json \  --train_batch_size=24 \  --learning_rate=3e-5 \  --num_train_epochs=2.0 \  --max_seq_length=384 \  --doc_stride=128 \  --output_dir=gs://some_bucket/squad_large/ \  --use_tpu=True \  --tpu_name=$TPU_NAME \  --version_2_with_negative=True \  --null_score_diff_threshold=$THRESH

Out-of-memory issues

All experiments in the paper were fine-tuned on a Cloud TPU, which has 64GB ofdevice RAM. Therefore, when using a GPU with 12GB - 16GB of RAM, you are likelyto encounter out-of-memory issues if you use the same hyperparameters describedin the paper.

The factors that affect memory usage are:

max_seq_length: The released models were trained with sequence lengthsup to 512, but you can fine-tune with a shorter max sequence length to savesubstantial memory. This is controlled by themax_seq_length flag in ourexample code.
train_batch_size: The memory usage is also directly proportional tothe batch size.
Model type,BERT-Base vs.BERT-Large: TheBERT-Large modelrequires significantly more memory thanBERT-Base.
Optimizer: The default optimizer for BERT is Adam, which requires a lotof extra memory to store them andv vectors. Switching to a more memoryefficient optimizer can reduce memory usage, but can also affect theresults. We have not experimented with other optimizers for fine-tuning.

Using the default training scripts (run_classifier.py andrun_squad.py), webenchmarked the maximum batch size on single Titan X GPU (12GB RAM) withTensorFlow 1.11.0:

System	Seq Length	Max Batch Size
`BERT-Base`	64	64
...	128	32
...	256	16
...	320	14
...	384	12
...	512	6
`BERT-Large`	64	12
...	128	6
...	256	2
...	320	1
...	384	0
...	512	0

Unfortunately, these max batch sizes forBERT-Large are so small that theywill actually harm the model accuracy, regardless of the learning rate used. Weare working on adding code to this repository which will allow much largereffective batch sizes to be used on the GPU. The code will be based on one (orboth) of the following techniques:

Gradient accumulation: The samples in a minibatch are typicallyindependent with respect to gradient computation (excluding batchnormalization, which is not used here). This means that the gradients ofmultiple smaller minibatches can be accumulated before performing the weightupdate, and this will be exactly equivalent to a single larger update.
Gradient checkpointing:The major use of GPU/TPU memory during DNN training is caching theintermediate activations in the forward pass that are necessary forefficient computation in the backward pass. "Gradient checkpointing" tradesmemory for compute time by re-computing the activations in an intelligentway.

However, this is not implemented in the current release.

Using BERT to extract fixed feature vectors (like ELMo)

In certain cases, rather than fine-tuning the entire pre-trained modelend-to-end, it can be beneficial to obtainedpre-trained contextualembeddings, which are fixed contextual representations of each input tokengenerated from the hidden layers of the pre-trained model. This should alsomitigate most of the out-of-memory issues.

As an example, we include the scriptextract_features.py which can be usedlike this:

# Sentence A and Sentence B are separated by the ||| delimiter for sentence# pair tasks like question answering and entailment.# For single sentence inputs, put one sentence per line and DON'T use the# delimiter.echo'Who was Jim Henson ? ||| Jim Henson was a puppeteer'> /tmp/input.txtpython extract_features.py \  --input_file=/tmp/input.txt \  --output_file=/tmp/output.jsonl \  --vocab_file=$BERT_BASE_DIR/vocab.txt \  --bert_config_file=$BERT_BASE_DIR/bert_config.json \  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \  --layers=-1,-2,-3,-4 \  --max_seq_length=128 \  --batch_size=8

This will create a JSON file (one line per line of input) containing the BERTactivations from each Transformer layer specified bylayers (-1 is the finalhidden layer of the Transformer, etc.)

Note that this script will produce very large output files (by default, around15kb for every input token).

If you need to maintain alignment between the original and tokenized words (forprojecting training labels), see theTokenization sectionbelow.

Note: You may see a message likeCould not find trained model in model_dir: /tmp/tmpuB5g5c, running initialization to predict. This message is expected, itjust means that we are using theinit_from_checkpoint() API rather than thesaved model API. If you don't specify a checkpoint or specify an invalidcheckpoint, this script will complain.

Tokenization

For sentence-level tasks (or sentence-pair) tasks, tokenization is very simple.Just follow the example code inrun_classifier.py andextract_features.py.The basic procedure for sentence-level tasks is:

Instantiate an instance oftokenizer = tokenization.FullTokenizer
Tokenize the raw text withtokens = tokenizer.tokenize(raw_text).
Truncate to the maximum sequence length. (You can use up to 512, but youprobably want to use shorter if possible for memory and speed reasons.)
Add the[CLS] and[SEP] tokens in the right place.

Word-level and span-level tasks (e.g., SQuAD and NER) are more complex, sinceyou need to maintain alignment between your input text and output text so thatyou can project your training labels. SQuAD is a particularly complex examplebecause the input labels arecharacter-based, and SQuAD paragraphs are oftenlonger than our maximum sequence length. See the code inrun_squad.py to showhow we handle this.

Before we describe the general recipe for handling word-level tasks, it'simportant to understand what exactly our tokenizer is doing. It has three mainsteps:

Text normalization: Convert all whitespace characters to spaces, and(for theUncased model) lowercase the input and strip out accent markers.E.g.,John Johanson's, → john johanson's,.
Punctuation splitting: Splitall punctuation characters on both sides(i.e., add whitespace around all punctuation characters). Punctuationcharacters are defined as (a) Anything with aP* Unicode class, (b) anynon-letter/number/space ASCII character (e.g., characters like$ which aretechnically not punctuation). E.g.,john johanson's, → john johanson ' s ,
WordPiece tokenization: Apply whitespace tokenization to the output ofthe above procedure, and applyWordPiecetokenization to each token separately. (Our implementation is directly basedon the one fromtensor2tensor, which is linked). E.g.,john johanson ' s , → john johan ##son ' s ,

The advantage of this scheme is that it is "compatible" with most existingEnglish tokenizers. For example, imagine that you have a part-of-speech taggingtask which looks like this:

Input:  John Johanson 's   houseLabels: NNP  NNP      POS NN

The tokenized output will look like this:

Tokens: john johan ##son ' s house

Crucially, this would be the same output as if the raw text wereJohn Johanson's house (with no space before the's).

If you have a pre-tokenized representation with word-level annotations, you cansimply tokenize each input word independently, and deterministically maintain anoriginal-to-tokenized alignment:

### Inputorig_tokens= ["John","Johanson","'s","house"]labels= ["NNP","NNP","POS","NN"]### Outputbert_tokens= []# Token map will be an int -> int mapping between the `orig_tokens` index and# the `bert_tokens` index.orig_to_tok_map= []tokenizer=tokenization.FullTokenizer(vocab_file=vocab_file,do_lower_case=True)bert_tokens.append("[CLS]")fororig_tokeninorig_tokens:orig_to_tok_map.append(len(bert_tokens))bert_tokens.extend(tokenizer.tokenize(orig_token))bert_tokens.append("[SEP]")# bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]# orig_to_tok_map == [1, 2, 4, 6]

Noworig_to_tok_map can be used to projectlabels to the tokenizedrepresentation.

There are common English tokenization schemes which will cause a slight mismatchbetween how BERT was pre-trained. For example, if your input tokenization splitsoff contractions likedo n't, this will cause a mismatch. If it is possible todo so, you should pre-process your data to convert these back to raw-lookingtext, but if it's not possible, this mismatch is likely not a big deal.

Pre-training with BERT

We are releasing code to do "masked LM" and "next sentence prediction" on anarbitrary text corpus. Note that this isnot the exact code that was used forthe paper (the original code was written in C++, and had some additionalcomplexity), but this code does generate pre-training data as described in thepaper.

Here's how to run the data generation. The input is a plain text file, with onesentence per line. (It is important that these be actual sentences for the "nextsentence prediction" task). Documents are delimited by empty lines. The outputis a set oftf.train.Examples serialized intoTFRecord file format.

You can perform sentence segmentation with an off-the-shelf NLP toolkit such asspaCy. Thecreate_pretraining_data.py script willconcatenate segments until they reach the maximum sequence length to minimizecomputational waste from padding (see the script for more details). However, youmay want to intentionally add a slight amount of noise to your input data (e.g.,randomly truncate 2% of input segments) to make it more robust to non-sententialinput during fine-tuning.

This script stores all of the examples for the entire input file in memory, sofor large data files you should shard the input file and call the scriptmultiple times. (You can pass in a file glob torun_pretraining.py, e.g.,tf_examples.tf_record*.)

Themax_predictions_per_seq is the maximum number of masked LM predictions persequence. You should set this to aroundmax_seq_length *masked_lm_prob (thescript doesn't do that automatically because the exact value needs to be passedto both scripts).

python create_pretraining_data.py \  --input_file=./sample_text.txt \  --output_file=/tmp/tf_examples.tfrecord \  --vocab_file=$BERT_BASE_DIR/vocab.txt \  --do_lower_case=True \  --max_seq_length=128 \  --max_predictions_per_seq=20 \  --masked_lm_prob=0.15 \  --random_seed=12345 \  --dupe_factor=5

Here's how to run the pre-training. Do not includeinit_checkpoint if you arepre-training from scratch. The model configuration (including vocab size) isspecified inbert_config_file. This demo code only pre-trains for a smallnumber of steps (20), but in practice you will probably want to setnum_train_steps to 10000 steps or more. Themax_seq_length andmax_predictions_per_seq parameters passed torun_pretraining.py must be thesame ascreate_pretraining_data.py.

python run_pretraining.py \  --input_file=/tmp/tf_examples.tfrecord \  --output_dir=/tmp/pretraining_output \  --do_train=True \  --do_eval=True \  --bert_config_file=$BERT_BASE_DIR/bert_config.json \  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \  --train_batch_size=32 \  --max_seq_length=128 \  --max_predictions_per_seq=20 \  --num_train_steps=20 \  --num_warmup_steps=10 \  --learning_rate=2e-5

This will produce an output like this:

***** Eval results *****  global_step = 20  loss = 0.0979674  masked_lm_accuracy = 0.985479  masked_lm_loss = 0.0979328  next_sentence_accuracy = 1.0  next_sentence_loss = 3.45724e-05

Note that since oursample_text.txt file is very small, this example trainingwill overfit that data in only a few steps and produce unrealistically highaccuracy numbers.

Pre-training tips and caveats

If using your own vocabulary, make sure to changevocab_size inbert_config.json. If you use a larger vocabulary without changing this,you will likely get NaNs when training on GPU or TPU due to uncheckedout-of-bounds access.
If your task has a large domain-specific corpus available (e.g., "moviereviews" or "scientific papers"), it will likely be beneficial to runadditional steps of pre-training on your corpus, starting from the BERTcheckpoint.
The learning rate we used in the paper was 1e-4. However, if you are doingadditional steps of pre-training starting from an existing BERT checkpoint,you should use a smaller learning rate (e.g., 2e-5).
Current BERT models are English-only, but we do plan to release amultilingual model which has been pre-trained on a lot of languages in thenear future (hopefully by the end of November 2018).
Longer sequences are disproportionately expensive because attention isquadratic to the sequence length. In other words, a batch of 64 sequences oflength 512 is much more expensive than a batch of 256 sequences oflength 128. The fully-connected/convolutional cost is the same, but theattention cost is far greater for the 512-length sequences. Therefore, onegood recipe is to pre-train for, say, 90,000 steps with a sequence length of128 and then for 10,000 additional steps with a sequence length of 512. Thevery long sequences are mostly needed to learn positional embeddings, whichcan be learned fairly quickly. Note that this does require generating thedata twice with different values ofmax_seq_length.
If you are pre-training from scratch, be prepared that pre-training iscomputationally expensive, especially on GPUs. If you are pre-training fromscratch, our recommended recipe is to pre-train aBERT-Base on a singlepreemptible Cloud TPU v2, whichtakes about 2 weeks at a cost of about $500 USD (based on the pricing inOctober 2018). You will have to scale down the batch size when only trainingon a single Cloud TPU, compared to what was used in the paper. It isrecommended to use the largest batch size that fits into TPU memory.

Pre-training data

We willnot be able to release the pre-processed datasets used in the paper.For Wikipedia, the recommended pre-processing is to downloadthe latest dump,extract the text withWikiExtractor.py, and then applyany necessary cleanup to convert it into plain text.

Unfortunately the researchers who collected theBookCorpus no longer have it available forpublic download. TheProject Guttenberg Datasetis a somewhat smaller (200M word) collection of older books that are publicdomain.

Common Crawl is another very large collection oftext, but you will likely have to do substantial pre-processing and cleanup toextract a usable corpus for pre-training BERT.

Learning a new WordPiece vocabulary

This repository does not include code forlearning a new WordPiece vocabulary.The reason is that the code used in the paper was implemented in C++ withdependencies on Google's internal libraries. For English, it is almost alwaysbetter to just start with our vocabulary and pre-trained models. For learningvocabularies of other languages, there are a number of open source optionsavailable. However, keep in mind that these are not compatible with ourtokenization.py library:

Using BERT in Colab

If you want to use BERT withColab, you canget started with the notebook"BERT FineTuning with Cloud TPUs".At the time of this writing (October 31st, 2018), Colab users can access aCloud TPU completely for free. Note: One per user, availability limited,requires a Google Cloud Platform account with storage (although storage may bepurchased with free credit for signing up with GCP), and this capability may notlonger be available in the future. Click on the BERT Colab that was just linkedfor more information.

FAQ

Is this code compatible with Cloud TPUs? What about GPUs?

Yes, all of the code in this repository works out-of-the-box with CPU, GPU, andCloud TPU. However, GPU training is single-GPU only.

I am getting out-of-memory errors, what is wrong?

See the section onout-of-memory issues for moreinformation.

Is there a PyTorch version available?

There is no official PyTorch implementation. However, NLP researchers fromHuggingFace made aPyTorch version of BERT availablewhich is compatible with our pre-trained checkpoints and is able to reproduceour results. We were not involved in the creation or maintenance of the PyTorchimplementation so please direct any questions towards the authors of thatrepository.

Is there a Chainer version available?

There is no official Chainer implementation. However, Sosuke Kobayashi made aChainer version of BERT availablewhich is compatible with our pre-trained checkpoints and is able to reproduceour results. We were not involved in the creation or maintenance of the Chainerimplementation so please direct any questions towards the authors of thatrepository.

Will models in other languages be released?

Yes, we plan to release a multi-lingual BERT model in the near future. We cannotmake promises about exactly which languages will be included, but it will likelybe a single model which includesmost of the languages which have asignificantly-sized Wikipedia.

Will models larger than`BERT-Large` be released?

So far we have not attempted to train anything larger thanBERT-Large. It ispossible that we will release larger models if we are able to obtain significantimprovements.

What license is this library released under?

All codeand models are released under the Apache 2.0 license. See theLICENSE file for more information.

How do I cite BERT?

For now, citethe Arxiv paper:

@article{devlin2018bert,  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},  journal={arXiv preprint arXiv:1810.04805},  year={2018}}

If we submit the paper to a conference or journal, we will update the BibTeX.

Disclaimer

This is not an official Google product.

Contact information

For help or issues using BERT, please submit a GitHub issue.

For personal communication related to BERT, please contact Jacob Devlin(jacobdevlin@google.com), Ming-Wei Chang (mingweichang@google.com), orKenton Lee (kentonl@google.com).

About

TensorFlow code and pre-trained models for BERT

arxiv.org/abs/1810.04805

Releases

No releases published

Packages

No packages published

Languages

Python76.3%
Jupyter Notebook23.7%

Movatterモバイル変換

License

ku-nlp/bert

Folders and files

Latest commit

History

Repository files navigation

BERT

Introduction

What is BERT?

What has been released in this repository?

Pre-trained models

Fine-tuning with BERT

Fine-tuning with Cloud TPUs

Sentence (and sentence-pair) classification tasks

Prediction from classifier

SQuAD 1.1

SQuAD 2.0

Out-of-memory issues

Using BERT to extract fixed feature vectors (like ELMo)

Tokenization

Pre-training with BERT

Pre-training tips and caveats

Pre-training data

Learning a new WordPiece vocabulary

Using BERT in Colab

FAQ

Is this code compatible with Cloud TPUs? What about GPUs?

I am getting out-of-memory errors, what is wrong?

Is there a PyTorch version available?

Is there a Chainer version available?

Will models in other languages be released?

Will models larger thanBERT-Large be released?

What license is this library released under?

How do I cite BERT?

Disclaimer

Contact information

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Will models larger than`BERT-Large` be released?

Packages