This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages) (Learn how and when to remove this message)
|
| XLNet | |
|---|---|
| Original author | Google AI |
| Initial release | 19 June 2019; 6 years ago (19 June 2019) |
| Repository | https://github.com/zihangdai/xlnet/ |
| Type | |
| License | Apache-2.0 |
TheXLNet was an autoregressiveTransformer designed as an improvement overBERT, with 340M parameters and trained on 33 billion words. It was released on 19 June 2019, under theApache 2.0 license.[1] It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference.
The main idea of XLNet is to model language autoregressively like theGPT models, but allow forall possiblepermutations of a sentence.[2] Concretely, consider the following sentence:
My dog is cute.
In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context:
We factorize the joint probability of a sequence of words using the chain rule:
For example, the sentence "My dog is cute" is factorized as:
Schematically, we can write it as
However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task:
By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words.
To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are:
The content stream uses the causal maskpermuted by a randompermutation matrix to.
The query stream uses the cross-attention mask, where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is.
Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass.
Two models were released:[1][2]
It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed ofBooksCorpus, and English Wikipedia, Giga5, ClueWeb 2012-B, andCommon Crawl.
It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with anAdam optimizer, linear learning rate decay, and a batch size of 8192.[3]