Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

XLNet

From Wikipedia, the free encyclopedia
A large language model developed by Google AI
This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages)
An editor has determined thatsufficient sources exist to establish the subject'snotability. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "XLNet" – news ·newspapers ·books ·scholar ·JSTOR
(September 2024) (Learn how and when to remove this message)
This articlepossibly containsoriginal research. Pleaseimprove it byverifying the claims made and addinginline citations. Statements consisting only of original research should be removed.(September 2024) (Learn how and when to remove this message)
This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "XLNet" – news ·newspapers ·books ·scholar ·JSTOR
(September 2024) (Learn how and when to remove this message)
(Learn how and when to remove this message)
XLNet
Original authorGoogle AI
Initial release19 June 2019; 6 years ago (19 June 2019)
Repositoryhttps://github.com/zihangdai/xlnet/
Type
LicenseApache-2.0

TheXLNet was an autoregressiveTransformer designed as an improvement overBERT, with 340M parameters and trained on 33 billion words. It was released on 19 June 2019, under theApache 2.0 license.[1] It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference.

Architecture

[edit]

The main idea of XLNet is to model language autoregressively like theGPT models, but allow forall possiblepermutations of a sentence.[2] Concretely, consider the following sentence:

My dog is cute.

In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context:

We factorize the joint probability of a sequence of wordsx1,,xT{\displaystyle x_{1},\ldots ,x_{T}} using the chain rule:Pr(x1,,xT)=Pr(x1)Pr(x2|x1)Pr(x3|x1,x2)Pr(xT|x1,,xT1).{\displaystyle \Pr(x_{1},\ldots ,x_{T})=\Pr(x_{1})\Pr(x_{2}|x_{1})\Pr(x_{3}|x_{1},x_{2})\ldots \Pr(x_{T}|x_{1},\ldots ,x_{T-1}).}

For example, the sentence "My dog is cute" is factorized as:

Pr(My,dog,is,cute)=Pr(My)Pr(dog|My)Pr(is|My,dog)Pr(cute|My,dog,is).{\displaystyle \Pr({\text{My}},{\text{dog}},{\text{is}},{\text{cute}})=\Pr({\text{My}})\Pr({\text{dog}}|{\text{My}})\Pr({\text{is}}|{\text{My}},{\text{dog}})\Pr({\text{cute}}|{\text{My}},{\text{dog}},{\text{is}}).}

Schematically, we can write it as

<MASK><MASK><MASK><MASK>My <MASK><MASK><MASK>My dog <MASK><MASK>My dog is <MASK>My dog is cute.{\displaystyle {\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\text{My }}{\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\text{My dog }}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\text{My dog is }}{\texttt {<MASK>}}\to {\text{My dog is cute}}.}

However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task:

<MASK><MASK><MASK><MASK><MASK><MASK>is <MASK><MASK>dog is <MASK><MASK>dog is cuteMy dog is cute{\displaystyle {\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}{\texttt {<MASK>}}\to {\texttt {<MASK>}}{\texttt {<MASK>}}{\text{is }}{\texttt {<MASK>}}\to {\texttt {<MASK>}}{\text{dog is }}{\texttt {<MASK>}}\to {\texttt {<MASK>}}{\text{dog is cute}}\to {\text{My dog is cute}}}

By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words.

Two-Stream Self-Attention

[edit]

To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are:

  • Content stream: This stream encodes the content of each word, as in standard causally masked self-attention.
  • Query stream: This stream encodes the content of each word in the context of what has gone before. In more detail, it is amasked cross-attention mechanism, where the queries are from the query stream, and the key-value pairs are from the content stream.

The content stream uses the causal maskMcausal=[0000000000]{\displaystyle M_{\text{causal}}={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}}permuted by a randompermutation matrix toPMcausalP1{\displaystyle PM_{\text{causal}}P^{-1}}.

The query stream uses the cross-attention maskP(McausalI)P1{\displaystyle P(M_{\text{causal}}-\infty I)P^{-1}}, where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is.

Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass.

Training

[edit]

Two models were released:[1][2]

  • XLNet-Large, cased: 110M parameters, 24-layer, 1024-hidden, 16-heads
  • XLNet-Base, cased: 340M parameters, 12-layer, 768-hidden, 12-heads.

It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed ofBooksCorpus, and English Wikipedia, Giga5, ClueWeb 2012-B, andCommon Crawl.

It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with anAdam optimizer, linear learning rate decay, and a batch size of 8192.[3]

See also

[edit]

References

[edit]
  1. ^ab"xlnet".GitHub. Retrieved2 January 2024.
  2. ^ab"Pretrained models — transformers 2.0.0 documentation".huggingface.co. Retrieved2024-08-05.
  3. ^Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2 January 2020). "XLNet: Generalized Autoregressive Pretraining for Language Understanding".arXiv:1906.08237 [cs.CL].
Computer
programs
AlphaGo
Versions
Competitions
In popular culture
Other
Machine
learning
Neural networks
Other
Generative
AI
Chatbots
Models
Other
See also
a subsidiary ofAlphabet
Company
Divisions
Subsidiaries
Active
Defunct
Programs
Events
Infrastructure
People
Current
Former
Criticism
General
Incidents
Other
Software
A–C
D–N
O–Z
Operating systems
Machine learning models
Neural networks
Computer programs
Formats and codecs
Programming languages
Search algorithms
Domain names
Typefaces
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Hardware
Pixel
Smartphones
Smartwatches
Tablets
Laptops
Other
Nexus
Smartphones
Tablets
Other
Other
Advertising
Antitrust
Intellectual
property
Privacy
Other
Related
Concepts
Products
Android
Street View coverage
YouTube
Other
Documentaries
Books
Popular culture
Other
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
standards
Data
Automatic identification
and data capture
Topic model
Computer-assisted
reviewing
Natural language
user interface
Related
Differentiable computing
General
Hardware
Software libraries
Retrieved from "https://en.wikipedia.org/w/index.php?title=XLNet&oldid=1302819672"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp