Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling

License

NotificationsYou must be signed in to change notification settings

sgrvinod/a-PyTorch-Tutorial-to-Sequence-Labeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is aPyTorch Tutorial to Sequence Labeling.

This is the second ina series of tutorials I'm writing aboutimplementing cool models on your own with the amazing PyTorch library.

Basic knowledge of PyTorch, recurrent neural networks is assumed.

If you're new to PyTorch, first readDeep Learning with PyTorch: A 60 Minute Blitz andLearning PyTorch with Examples.

Questions, suggestions, or corrections can be posted as issues.

I'm usingPyTorch 0.4 inPython 3.6.


27 Jan 2020: Working code for two new tutorials has been added —Super-Resolution andMachine Translation


Contents

Objective

Concepts

Overview

Implementation

Training

Frequently Asked Questions

Objective

To build a model that can tag each word in a sentence with entities, parts of speech, etc.

We will be implementing theEmpower Sequence Labeling with Task-Aware Neural Language Model paper. This is more advanced than most sequence tagging models, but you will learn many useful concepts – and it works extremely well. The authors' original implementation can be foundhere.

This model is special because it augments the sequence labeling task by training itconcurrently with language models.

Concepts

  • Sequence Labeling. duh.

  • Language Models. Language Modeling is to predict the next word or character in a sequence of words or characters. Neural language models achieve impressive results across a wide variety of NLP tasks like text generation, machine translation, image captioning, optical character recognition, and what have you.

  • Character RNNs. RNNs operating on individual characters in a textare known to capture the underlying style and structure. In a sequence labeling task, they are especially useful since sub-word information can often yield important clues to an entity or tag.

  • Multi-Task Learning. Datasets available to train a model are often small. Creating annotations or handcrafted features to help your model along is not only cumbersome, but also frequently not adaptable to the diverse domains or settings in which your model may be useful. Sequence labeling, unfortunately, is a prime example. There is a way to mitigate this problem – jointly training multiple models that are joined at the hip will maximize the information available to each model, improving performance.

  • Conditional Random Fields. Discrete classifiers predict a class or label at a word. Conditional Random Fields (CRFs) can do you one better – they predict labels based on not just the word, but also the neighborhood. Which makes sense, because thereare patterns in a sequence of entities or labels. CRFs are widely used to model ordered information, be it for sequence labeling, gene sequencing, or even object detection and image segmentation in computer vision.

  • Viterbi Decoding. Since we're using CRFs, we're not so much predicting the right label at each word as we are predicting the right labelsequence for a word sequence. Viterbi Decoding is a way to do exactly this – find the most optimal tag sequence from the scores computed by a Conditional Random Field.

  • Highway Networks. Fully connected layers are a staple in any neural network to transform or extract features at different locations. Highway Networks accomplish this, but also allow information to flow unimpeded across transformations. This makes deep networks much more efficient or feasible.

Overview

In this section, I will present an overview of this model. If you're already familiar with it, you can skip straight to theImplementation section or the commented code.

LM-LSTM-CRF

The authors refer to the model as theLanguage Model - Long Short-Term Memory - Conditional Random Field since it involvesco-training language models with an LSTM + CRF combination.

This image from the paper thoroughly represents the entire model, but don't worry if it seems too complex at this time. We'll break it down to take a closer look at the components.

Multi-Task Learning

Multi-task learning is when you simultaneously train a model on two or more tasks.

Usually we're only interested inone of these tasks – in this case, the sequence labeling.

But when layers in a neural network contribute towards performing multiple functions, they learn more than they would have if they had trained only on the primary task. This is because the information extracted at each layer is expanded to accomodate all tasks. When there is more information to work with,performance on the primary task is enhanced.

Enriching existing features in this manner removes the need for using handcrafted features for sequence labeling.

Thetotal loss during multi-task learning is usually a linear combination of the losses on the individual tasks. The parameters of the combination can be fixed or learned as updateable weights.

Since we're aggregating individual losses, you can see how upstream layers shared by multiple tasks would receive updates from all of them during backpropagation.

The authors of the papersimply add the losses (β=1), and we will do the same.

Let's take a look at the tasks that make up our model.

There arethree.

This leverages sub-word information to predict the next word.

We do the same in the backward direction.

Wealso use the outputs of these twocharacter-RNNs as inputs to ourword-RNN andConditional Random Field (CRF) to perform our primary task of sequence labeling.

We're using sub-word information in our tagging task because it can be a powerful indicator of the tags, whether they're parts of speech or entities. For example, it may learn that adjectives commonly end with "-y" or "-ul", or that places often end with "-land" or "-burg".

But our sub-word features, viz. the outputs of the Character RNNs, are also enriched withadditional information – the knowledge it needs to predict the next word in both forward and backward directions, because of models 1 and 2.

Therefore, our sequence tagging model uses both

  • word-level information in the form of word embeddings.
  • character-level information up to and including each word in both directions, enriched with the know-how required to be able to predict the next word in both directions.

The Bidirectional LSTM/RNN encodes these features into new features at each word containing information about the word and its neighborhood, at both the word-level and the character-level. This forms the input to the Conditional Random Field.

Conditional Random Field (CRF)

Without a CRF, we would have simply used a single linear layer to transform the output of the Bidirectional LSTM into scores for each tag. These are known asemission scores, which are a representation of the likelihood of the word being a certain tag.

A CRF calculates not only the emission scores but also thetransition scores, which are the likelihood of a word being a certain tagconsidering the previous word was a certain tag. Therefore the transition scores measure how likely it is to transition from one tag to another.

If there arem tags, transition scores are stored in a matrix of dimesionsm, m, where the rows represent the tag of the previous word and the columns represent the tag of the current word. A value in this matrix at positioni, j is thelikelihood of transitioning from theith tag at the previous word to thejth tag at the current word. Unlike emission scores, transition scores are not defined for each word in the sentence. They are global.

In our model, the CRF layer outputs theaggregate of the emission and transition scores at each word.

For a sentence of lengthL, emission scores would be anL, m tensor. Since the emission scores at each word do not depend on the tag of the previous word, we create a new dimension likeL, _, m and broadcast (copy) the tensor along this direction to get anL, m, m tensor.

The transition scores are anm, m tensor. Since the transition scores are global and do not depend on the word, we create a new dimension like_, m, m and broadcast (copy) the tensor along this direction to get anL, m, m tensor.

We can nowadd them to get the total scores which are anL, m, m tensor. A value at positionk, i, j is theaggregate of the emission score of thejth tag at thekth word and the transition score of thejth tag at thekth word considering the previous word was theith tag.

For our example sentencedunston checks in <end>, if we assume there are 5 tags in total, the total scores would look like this –

But wait a minute, why are there<start> end<end> tags? While we're at it, why are we using an<end> token?

About<start> and<end> tags,<start> and<end> tokens

Since we're modeling the likelihood of transitioning between tags, we also include a<start> tag and an<end> tag in our tag-set.

The transition score of a certain tag given that the previous tag was a<start> tag represents thelikelihood of this tag being thefirst tag in a sentence. For example, sentences usually start with articles (a, an, the) or nouns or pronouns.

The transition score of the<end> tag considering a certain previous tag indicates thelikelihood of this previous tag being thelast tag in a sentence.

We will use an<end> token in all sentences and not a<start> token because the total CRF scores at each word are defined with respect to theprevious word's tag, which would make no sense at a<start> token.

The correct tag of the<end> token is always the<end> tag. The "previous tag" of the first word is always the<start> tag.

To illustrate, if our example sentencedunston checks in <end> had the tagstag_2, tag_3, tag_3, <end>, the values in red indicate the scores of these tags.

Highway Networks

We generally use activated linear layers to transform and process outputs of an RNN/LSTM.

If you're familiar with residual connections, we can add the input before the transformation to the transformed output, creating a path for data-flow around the transformation.

This path is a shortcut for the flow of gradients during backpropagation, and aids in the convergence of deep networks.

AHighway Network is similar to a residual network, but we use asigmoid-activated gate to determine the ratio in which the input and transformed output is combined.

Since the character-RNNs contribute towards multiple tasks,Highway Networks are used for extracting task-specific information from its outputs.

Therefore, we will use Highway Networks atthree locations in our combined model –

  • to transform the output of the forward character-RNN to predict the next word.
  • to transform the output of the backward character-RNN to predict the next word (in the backward direction).
  • to transform the concatenated output of the forward and backward character-RNNs for use in the word-level RNN along with the word embedding.

In a naive co-training setting, where we use the outputs of the character-RNNs directly for multiple tasks, i.e. without transformation, the discordance between the nature of the tasks could hurt performance.

Putting it all together

It might be clear by now what our combined network looks like.

Other configurations

Progressively removing parts of our network results in progressively simpler networks that are used widely for sequence labeling.

(a) a Bi-LSTM + CRF sequence tagger that leverages sub-word information.

There is no multi-task learning.

Using character-level information without co-training still improves performance.

(b) a Bi-LSTM + CRF sequence tagger.

There is no multi-task learning or character-level processing.

This configuration is used quite commonly in the industry and works well.

(c) a Bi-LSTM sequence tagger.

There is no multi-task learning, character-level processing, or CRFing. Note that a linear or Highway layer would replace the latter.

This could work reasonably well, but a Conditional Random Field provides a sizeable performance boost.

Viterbi Loss

Remember, we're not using a linear layer that computes only the emission scores. Cross Entropy is not a suitable loss metric.

Instead we will use theViterbi Loss which, like Cross Entropy, is a "negative log likelihood". But here we will measure the likelihood of the gold (true) tag sequence, instead of the likelihood of the true tag at each word in the sequence. To find the likelihood, we consider the softmax over the scores of all tag sequences.

The score of a tag sequencet is defined as the sum of the scores of the individual tags.

For example, consider the CRF scores we looked at earlier –

The score of the tag sequencetag_2, tag_3, tag_3, <end> tag is the sum of the values in red,4.85 + 6.79 + 3.85 + 3.52 = 19.01.

The Viterbi Loss is then defined as

wheret_G is the gold tag sequence andT represents the space of all possible tag sequences.

This simplifies to –

Therefore, the Viterbi Loss is thedifference between the log-sum-exp of the scores of all possible tag sequences and the score of the gold tag sequence, i.e.log-sum-exp(all scores) - gold score.

Viterbi Decoding

Viterbi Decoding is a way to construct the most optimal tag sequence, considering not only the likelihood of a tag at a certain word (emission scores), but also the likelihood of a tag considering the previous and next tags (transition scores).

Once you generate CRF scores in aL, m, m matrix for a sequence of lengthL, we start decoding.

Viterbi Decoding is best understood with an example. Consider again –

For the first word in the sequence, theprevious_tag can only be<start>. Therefore only consider that one row.

These are also the cumulative scores for eachcurrent_tag at the first word.

We will also keep track of theprevious_tag that corresponds to each score. These are known asbackpointers. At the first word, they are obviously all<start> tags.

At the second word,add the previous cumulative scores to the CRF scores of this word to generate new cumulative scores.

Note that the first word'scurrent_tags are the second word'sprevious_tags. Therefore, broadcast the first word's cumulative score along thecurrent_tag dimension.

For eachcurrent_tag, consider only the maximum of the scores from allprevious_tags.

Store backpointers, i.e. the previous tags that correspond to these maximum scores.

Repeat this process at the third word.

...and the last word, which is the<end> token.

Here, the only difference is youalready know the correct tag. You need the maximum score and backpointeronly for the<end> tag.

Now that you accumulated CRF scores across the entire sequence,you tracebackwards to reveal the tag sequence with the highest possible score.

We find that the most optimal tag sequence fordunston checks in <end> istag_2 tag_3 tag_3 <end>.

Implementation

The sections below briefly describe the implementation.

They are meant to provide some context, butdetails are best understood directly from the code, which is quite heavily commented.

Dataset

I use the CoNLL 2003 NER dataset to compare my results with the paper.

Here's a snippet –

-DOCSTART- -X- O OEU NNP I-NP I-ORGrejects VBZ I-VP OGerman JJ I-NP I-MISCcall NN I-NP Oto TO I-VP Oboycott VB I-VP OBritish JJ I-NP I-MISClamb NN I-NP O. . O O

This dataset is not meant to be publicly distributed, although you may find it somewhere online.

There are several public datasets online that you can use to train the model. These may not all be 100% human annotated, but they are sufficient.

For NER tagging, you can use theGroningen Meaning Bank.

For POS tagging, NLTK has a small dataset available you can access withnltk.corpus.treebank.tagged_sents().

You would either have to convert it to the CoNLL 2003 NER data format, or modify the code referenced in theData Pipeline section.

Inputs to model

We will need eight inputs.

Words

These are the word sequences that must be tagged.

dunston checks in

As discussed earlier, we will not use<start> tokens but wewill need to use<end> tokens.

dunston, checks, in, <end>

Since we pass the sentences around as fixed size Tensors, we need to pad sentences (which are naturally of varying length) to the same length with<pad> tokens.

dunston, checks, in, <end>, <pad>, <pad>, <pad>, ...

Furthermore, we create aword_map which is an index mapping for each word in the corpus, including the<end>, and<pad> tokens. PyTorch, like other libraries, needs words encoded as indices to look up embeddings for them, or to identify their place in the predicted word scores.

4381, 448, 185, 4669, 0, 0, 0, ...

Therefore,word sequences fed to the model must be anInt tensor of dimensionsN, L_w whereN is the batch_size andL_w is the padded length of the word sequences (usually the length of the longest word sequence).

Characters (Forward)

These are the character sequences in the forward direction.

'd', 'u', 'n', 's', 't', 'o', 'n', ' ', 'c', 'h', 'e', 'c', 'k', 's', ' ', 'i', 'n', ' '

We need<end> tokens in the character sequences to match the<end> token in the word sequences. Since we're going to use character-level features at each word in the word sequence, we need character-level features at<end> in the word sequence.

'd', 'u', 'n', 's', 't', 'o', 'n', ' ', 'c', 'h', 'e', 'c', 'k', 's', ' ', 'i', 'n', ' ', <end>

We also need to pad them.

'd', 'u', 'n', 's', 't', 'o', 'n', ' ', 'c', 'h', 'e', 'c', 'k', 's', ' ', 'i', 'n', ' ', <end>, <pad>, <pad>, <pad>, ...

And encode them with achar_map.

29, 2, 12, 8, 7, 14, 12, 3, 6, 18, 1, 6, 21, 8, 3, 17, 12, 3, 60, 0, 0, 0, ...

Therefore,forward character sequences fed to the model must be anInt tensor of dimensionsN, L_c, whereL_c is the padded length of the character sequences (usually the length of the longest character sequence).

Characters (Backward)

This would be processed the same as the forward sequence, but backward. (The<end> tokens would still be at the end, naturally.)

'n', 'i', ' ', 's', 'k', 'c', 'e', 'h', 'c', ' ', 'n', 'o', 't', 's', 'n', 'u', 'd', ' ', <end>, <pad>, <pad>, <pad>, ...

12, 17, 3, 8, 21, 6, 1, 18, 6, 3, 12, 14, 7, 8, 12, 2, 29, 3, 60, 0, 0, 0, ...

Therefore,backward character sequences fed to the model must be anInt tensor of dimensionsN, L_c.

Character Markers (Forward)

These markers arepositions in the character sequences where we extract features to –

  • generate the next word in the language models, and
  • use as character-level features in the word-level RNN in the sequence labeler

We will extract features at the end of every space' ' in the character sequence, and at the<end> token.

For the forward character sequence, we extract at –

7, 14, 17, 18

These are points afterdunston,checks,in,<end> respectively. Thus, we havea marker for each word in the word sequence, which makes sense. (In the language models, however, since we're predicting thenext word, we won't predict at the marker which corresponds to<end>.)

We pad these with0s. It doesn't matter what we pad with as long as they're valid indices. (We will extract features at the pads, but we will not use them.)

7, 14, 17, 18, 0, 0, 0, ...

They are padded to the padded length of the word sequences,L_w.

Therefore,forward character markers fed to the model must be anInt tensor of dimensionsN, L_w.

Character Markers (Backward)

For the markers in the backward character sequences, we similarly find the positions of every space' ' and the<end> token.

We also ensure that thesepositions are in the sameword order as in the forward markers. This alignment makes it easier to concatenate features extracted from the forward and backward character sequences, and also prevents having to re-order the targets in the language models.

17, 9, 2, 18

These are points afternotsnud,skcehc,ni,<end> respectively.

We pad with0s.

17, 9, 2, 18, 0, 0, 0, ...

Therefore,backward character markers fed to the model must be anInt tensor of dimensionsN, L_w.

Tags

Let's assume the correct tags fordunston, checks, in, <end> are –

tag_2, tag_3, tag_3, <end>

We have atag_map (containing the tags<start>,tag_1,tag_2,tag_3,<end>).

Normally, we would just encode them directly (before padding) –

2, 3, 3, 5

These are1D encodings, i.e., tag positions in a1D tag map.

But theoutputs of the CRF layer are2Dm, m tensors at each word. We would need to encode tag positions in these2D outputs.

The correct tag positions are marked in red.

(0, 2), (2, 3), (3, 3), (3, 4)

If we unroll these scores into a1Dm*m tensor, then the tag positions in the unrolled tensor would be

tag_map[previous_tag]*len(tag_map)+tag_map[current_tag]

Therefore, we encodetag_2, tag_3, tag_3, <end> as

2, 13, 18, 19

Note that you can retrieve the originaltag_map indices by taking the modulus

t%len(tag_map)

They will be padded to the padded length of the word sequences,L_w.

Therefore,tags fed to the model must be anInt tensor of dimensionsN, L_w.

Word Lengths

These are the actual lengths of the word sequences including the<end> tokens. Since PyTorch supports dynamic graphs, we will compute only over these lengths and not over the<pads>.

Therefore,word lengths fed to the model must be anInt tensor of dimensionsN.

Character Lengths

These are the actual lengths of the character sequences including the<end> tokens. Since PyTorch supports dynamic graphs, we will compute only over these lengths and not over the<pads>.

Therefore,character lengths fed to the model must be anInt tensor of dimensionsN.

Data Pipeline

Seeread_words_tags() inutils.py.

This reads the input files in the CoNLL 2003 format, and extracts the word and tag sequences.

Seecreate_maps() inutils.py.

Here, we create encoding maps for words, characters, and tags. We bin rare words and characters as<unk>s (unknowns).

Seecreate_input_tensors() inutils.py.

We generate the eight inputs detailed in theInputs to Model section.

Seeload_embeddings() inutils.py.

We load pre-trained embeddings, with the option to expand theword_map to include out-of-corpus words present in the embedding vocabulary. Note that this may also include rare in-corpus words that were binned as<unk>s earlier.

SeeWCDataset indatasets.py.

This is a subclass of PyTorchDataset. It needs a__len__ method defined, which returns the size of the dataset, and a__getitem__ method which returns theith set of the eight inputs to the model.

TheDataset will be used by a PyTorchDataLoader intrain.py to create and feed batches of data to the model for training or validation.

Highway Networks

SeeHighway inmodels.py.

Atransform is a ReLU-activated linear transformation of the input. Agate is a sigmoid-activated linear transformation of the input. Note thatboth transformations must be the same size as the input, to allow for adding the input in a residual connection.

Thenum_layers attribute specifices how many transform-gate-residual-connection operations we perform in series. Usually just one is sufficient.

We store the requisite number of transform and gate layers in separateModuleList()s, and use afor loop to perform successive operations.

Language Models

SeeLM_LSTM_CRF inmodels.py.

At the very outset, wesort the forward and backward character sequences by decreasing lengths. This is required to usepack_padded_sequence() in order for the LSTM to compute over only the valid timesteps, i.e. the true length of the sequences.

Remember to also sort all other tensors in the same order.

Seedynamic_rnn.py for an illustration of howpack_padded_sequence() can be used to take advantage of PyTorch's dynamic graphing and batching capabilities so that we don't process the pads. It flattens the sorted sequences by timestep while ignoring the pads, and theLSTM computes over only the effective batch sizeN_t at each timestep.

Thesorting allows the topN_t at any timestep to align with the outputs from the previous step. At the third timestep, for example, we process only the top 5 images, using the top 5 outputs from the previous step. Except for the sorting, all of this is handled internally by PyTorch, but it's still very useful to understand whatpack_padded_sequence() does so we can use it in other scenarios to achieve similar ends. (See the related question about handling variable length sequences in theFAQs section.)

Upon sorting, weapply the forward and backward LSTMs on the forward and backwardpacked_sequences respectively. We usepad_packed_sequence() to unflatten and re-pad the outputs.

Weextract only the outputs at the forward and backward character markers withgather. This function is very useful for extracting only certain indices from a tensor that are specified in a separate tensor.

Theseextracted outputs are processed by the forward and backward Highway layers before applying alinear layer to compute scores over the vocabulary for predicting the next word at each marker. We do this only during training, since it makes no sense to perform language modeling for multi-task learning during validation or inference. Thetraining attribute of any model is set withmodel.train() ormodel.eval() intrain.py. (Note that this is primarily used to enable or disable dropout and batch-norm layers in a PyTorch model during training and inference respectively.)

Sequence Labeling Model

SeeLM_LSTM_CRF inmodels.py (continued).

We alsosort the word sequences by decreasing lengths, because there may not always be a correlation between the lengths of the word sequences and the character sequences.

Remember to also sort all other tensors in the same order.

Weconcatenate the forward and backward character LSTM outputs at the markers, and run it through the third Highway layer. This will extract the sub-word information at each word which we will use for sequence labeling.

Weconcatenate this result with the word embeddings, and compute BLSTM outputs over thepacked_sequence.

Upon re-padding withpad_packed_sequence(), we have the features we need to feed to the CRF layer.

Conditional Random Field (CRF)

SeeCRF inmodels.py.

You may find this layer is surprisingly straightforward considering the value it adds to our model.

A linear layer is used to transform the outputs from the BLSTM to scores for each tag, which are theemission scores.

A single tensor is used to hold thetransition scores. This tensor is aParameter of the model, which means it is updateable during backpropagation, just like the weights of the other layers.

To find the CRF scores,compute the emission scores at each word and add it to the transition scores, after broadcasting both as described in theCRF Overview.

Viterbi Loss

SeeViterbiLoss inmodels.py.

We established in theViterbi Loss Overview that we want to minimize thedifference between the log-sum-exp of the scores of all possible valid tag sequences and the score of the gold tag sequence, i.e.log-sum-exp(all scores) - gold score.

We sum the CRF scores of each true tag as described earlier to calculate thegold score.

Remember how we encoded tag sequences with their positions in the unrolled CRF scores? We extract the scores at these positions withgather() and eliminate the pads withpack_padded_sequences() before summing.

Finding thelog-sum-exp of the scores of all possible sequences is slightly trickier. We use afor loop to iterate over the timesteps. At each timestep, weaccumulate scores for eachcurrent_tag by –

  • adding the CRF scores at this timestep to the accumulated scores from the previous timestep to find the accumulated score for eachcurrent_tag for eachprevious_tag. We do this at only the effective batch size, i.e. for sequences that haven't completed yet. (Our sequences are still sorted by decreasing word lengths, from theLM-LSTM-CRF model.)
  • for eachcurrent_tag, compute the log-sum-exp over theprevious_tags to find the new accumulated scores at eachcurrent_tag.

After computing over the variable lengths of all sequences, we are left with a tensor of dimensionsN, m, wherem is the number of (current) tags. These are the log-sum-exp accumulated scores over all possible sequences ending in each of them tags. However, since valid sequences can only end with the<end> tag,sum over only the<end> column to find the log-sum-exp of the scores of all possible valid sequences.

We find the difference,log-sum-exp(all scores) - gold score.

Viterbi Decoding

SeeViterbiDecoder ininference.py.

This implements the process described in theViterbi Decoding Overview.

We accumulate scores in afor loop in a manner similar to what we did inViterbiLoss, except here wefind the maximum of theprevious_tag scores for eachcurrent_tag, instead of computing the log-sum-exp. We alsokeep track of theprevious_tag that corresponds to this maximum score in a backpointer tensor.

Wepad the backpointer tensor with<end> tags because this allows us to trace backwards over the pads, eventually arriving at theactual<end> tag, whereupon theactualbacktracing begins.

Training

Seetrain.py.

The parameters for the model (and training it) are at the beginning of the file, so you can easily check or modify them should you wish to.

Totrain your model from scratch, simply run this file –

python train.py

Toresume training at a checkpoint, point to the corresponding file with thecheckpoint parameter at the beginning of the code.

Note that we perform validation at the end of every training epoch.

Trimming Batch Inputs

You will notice wetrim the inputs at each batch to the maximum sequence lengths in that batch. This is so we don't have more pads in each batch that we actually need.

But why? Although the RNNs in our model don't compute over the pads,the linear layers still do. It's pretty straightward to change this – see the related question about handling variable length sequences in theFAQs section.

For this tutorial, I figured a little extra computation over some pads was worth the straightforwardness of not having to perform a slew of operations – Highway, CRF, other linear layers, concatenations – on apacked_sequence.

Loss

In the multi-task scenario, we have chosen to sum the Cross Entropy losses from the two language modelling tasks and the Viterbi Loss from the sequence labeling task.

Even though we areminimizing the sum of these losses, we are actually only interested in minimizing the Viterbi Lossby virtue of minimizing the sum of these losses. It is the Viterbi Loss which reflects performance on the primary task.

We usepack_padded_sequence() to eliminate pads wherever necessary.

F1 Score

Like in the paper, we use themacro-averaged F1 score as the criterion for early-stopping. Naturally, computing the F1 score requires Viterbi Decoding the CRF scores to generate our optimal tag sequences.

We usepack_padded_sequence() to eliminate pads wherever necessary.

Remarks

I have followed the parameters in the authors' implementation as closely as possible.

I used a batch size of10 sentences. I employed Stochastic Gradient Descent with momentum. The learning rate was decayed every epoch. I used 100DGloVe pretrained embeddings without fine-tuning.

It took about 80s to train one epoch on a Titan X (Pascal).

The F1 score on the validation set hit91% around epoch 50, and peaked at91.6% on epoch 171. I ran it for a total of 200 epochs. This is pretty close to the results in the paper.

Model Checkpoint

You can download this pretrained modelhere.

FAQs

How do we decide if we need<start> and<end> tokens for a model that uses sequences?

If this seems confusing at first, it will easily resolve itself when you think about the requirements of the model you are planning to train.

For sequence labeling with a CRF, you need the<end> token (or the<start> token; see next question) because of how the CRF scores are structured.

In my other tutorial on image captioning, I usedboth<start> and<end> tokens. The model needed to start decodingsomewhere, and learn to recognizewhen to stop decoding during inference.

If you're performing text classification, you would need neither.


Can we have the CRF generatecurrent_word -> next_word scores instead ofprevious_word -> current_word scores?

Yes. In this case you would broadcast the emission scores likeL, m, _, and you would have a<start> token in every sentence instead of an<end> token. The correct tag of the<start> token would always be the<start> tag. The "next tag" of the last word would always be the<end> tag.

I think theprevious word -> current word convention is slightly better because there are language models in the mix. It fits in quite nicely to be able to predict the<end> token at the last real word, and therefore learn to recognize when a sentence is complete.


Why are we using different vocabularies for the sequence tagger's inputs and language models' outputs?

The language models will learn to predict only those words it has seen during training. It's really unnecessary, and a huge waste of computation and memory, to use a linear-softmax layer with the extra ~400,000 out-of-corpus words from the embedding file it will never learn to predict.

But wecan add these words to the input layer even if the model never sees them during training. This is because we're using pre-trained embeddings at the input. It doesn'tneed to see them because the meanings of words are encoded in these vectors. If it's encountered achimpanzee before, it very likely knows what to do with anorangutan.


Is it a good idea to fine-tune the pre-trained word embeddings we use in this model?

I refrain from fine-tuning because most of the input vocabulary is not in-corpus. Most embeddings will remain the same while a few are fine-tuned. If fine-tuning changes these embeddings sufficiently, the model may not work well with the words that weren't fine-tuned. In the real world, we're bound to encounter many words that weren't present in a newspaper corpus from 2003.


What are some ways we can construct dynamic graphs in PyTorch to compute over only the true lengths of sequences?

If you're using an RNN, simply usepack_padded_sequence(). PyTorch will internally compute over only the true lengths. Seedynamic_rnn.py for an example.

If you want to execute an operation (like a linear transformation) only on the true timesteps,pack_padded_sequences() is still the way to go. This flattens the tensor by timestep while removing the pads. You can perform your operation on this flattened tensor, and then usepad_packed_sequence() to unflatten it and re-pad it with0s.

Similarly, if you want to perform an aggregation operation, like computing the loss, usepack_padded_sequences() to eliminate the pads.

If you want to perform timestep-wise operations, you can take a leaf out of howpack_padded_sequences() works, and compute only on the effective batch size at each timestep with afor loop to iterate over the timesteps. We did this in theViterbiLoss andViterbiDecoder. I also used anLSTMCell() in this fashion in my image captioning tutorial.


Dunston Checks In? Really?

I had no memory of this movie for twenty years. I was trying to think of a short sentence that would be easier to visualize in this tutorial and it just popped into my mind riding a wave of 90s nostalgia.

I wish I hadn't googled it though. Damn, the critics were harsh, weren't they? This gem was overwhelmingly and universally panned. I'm not sure I'd disagree if I watched it now, but that just goes to show the world is so much more fun when you're a kid.

Didn't have to worry about LM-LSTM-CRFs or nuthin...

About

Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp