Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Attention Is All You Need

From Wikipedia, the free encyclopedia
2017 research paper by Google

Attention Is All You Need
An illustration of main components of the transformer model from the paper
Project typeArtificial intelligence research
SponsorsGoogle
ObjectiveProvide a novel approach to train AI
Duration2017; 9 years ago (2017)
Websiteproceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

"Attention Is All You Need"[1] is a 2017 research paper inmachine learning authored by eight scientists working atGoogle. The paper introduced a newdeep learning architecture known as thetransformer, based on theattention mechanism proposed in 2014 by Bahdanauet al.[2] The transformer approach it describes has become the main architecture of a wide variety of AI, such aslarge language models.[3][4] At the time, the focus of the research was on improvingSeq2seq techniques formachine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks likequestion answering and what is now known asmultimodalgenerative AI.[1]

Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on "The Transformer", andparsing. These convinced the team that the Transformer is a general-purpose language model, and not just good for translation.[5]

As of 2025,[update] the paper has been cited more than 173,000 times, placing it among the top ten most-cited papers of the 21st century.[6] After the paper was published by Google, each of the authors left the company to join other companies or to foundstartups.

Background

[edit]

The authors of the paper areAshish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit, Llion Jones,Aidan Gomez, Łukasz Kaiser, andIllia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized (according to the paper itself). After the paper, each of the authors left Google to join other companies or to foundstartups.[7][8]

The paper's title is a reference to the song "All You Need Is Love" bythe Beatles.[9] The name "Transformer" was picked because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word.[5] An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from theTransformers franchise. The team was named Team Transformer.[9]

Methods discussed and introduced

[edit]

The paper is best known for introducing the Transformer architecture, which underlies most modernlarge language models (LLMs). A key reason why the architecture is preferred by most modern LLMs is the parallelizability of the architecture over its predecessors. This ensures that the operations necessary for training can be accelerated on a GPU, allowing both faster training times and models of bigger sizes to be trained.

The paper introduced the following mechanisms as part of the development of the transformer architecture.

Scaled dot-product attention & self-attention

The use of the scaled dot-product attention and self-attention mechanism instead of arecurrent neural network orlong short-term memory (which rely on recurrence instead) allows for better performance as described in the following paragraph. The paper described the scaled dot-product attention as follows:

Attention(Q,K,V):=softmax(Q×KTdk)×V{\displaystyle {\rm {Attention}}(Q,K,V):={\rm {softmax}}\left({\frac {Q\times K^{T}}{\sqrt {d_{k}}}}\right)\times V}

whereQ{\displaystyle Q},K{\displaystyle K},V{\displaystyle V} are respectively the query, key, value matrices, anddk{\displaystyle d_{k}} is the dimension of the values.

Since the model relies on Query (Q), Key (K), and Value (V) matrices that come from the same source (i.e., the input sequence or context window), this eliminates the need for RNNs, completely ensuring parallelizability for the architecture. This differs from the original form of the Attention mechanism introduced in 2014. Additionally, the paper also discusses the use of an additional scaling factor that was found to be most effective with respect to the dimension of the key vectors (represented asdk{\displaystyle d_{k}} and initially set to 64 within the paper) in the manner shown above.

In the specific context of translation, which the paper focused on, the Query and Key matrices are usually represented in embeddings corresponding to the source language, while the Value matrix corresponds to the target language.

Multi-head attention

In the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (typically limited by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps. Multi-head attention enhances this process by introducing multiple parallel attention heads. Each attention head learns different linear projections of the Q, K, and V matrices. This allows the model to capture different aspects of the relationships between words in the sequence simultaneously, rather than focusing on a single aspect.

By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives. After the attention outputs from all heads are calculated, they are concatenated and passed through a final linear transformation to generate the output.

Positional encoding

Since the Transformer does not rely on recurrence or convolution of the text in order to perform encoding and decoding, the paper relied on the use of sine and cosine wave functions to encode the position of the token into the embedding. The methods introduced in the paper are discussed below:

PE(pos,2i)=sin(pos/100002i/dmodel){\displaystyle PE_{({\rm {pos}},2i)}=\sin({\rm {pos}}/{10000}^{2i/d_{\rm {model}}})}

PE(pos,2i+1)=cos(pos/100002i/dmodel){\displaystyle PE_{({\rm {pos}},2i+1)}=\cos({\rm {pos}}/{10000}^{2i/d_{\rm {model}}})}

whereinpos{\displaystyle {\rm {pos}}},i{\displaystyle i},dmodel{\displaystyle {d_{\rm {model}}}} correspond to the position of the word, the current dimension index, and the dimension of the model, respectively. The sine function is used for even indices of the embedding while the cosine function is used for odd indices. The resultantPE{\displaystyle PE} embedding is then added to the word at that corresponding position with respect to the current context window. The paper specifically comments on why this method was chosen describing:

"We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training."[1]

Historical context

[edit]
Main articles:Transformer (deep learning architecture) § History, andSeq2seq § History
This section'stone or style may not reflect theencyclopedic tone used on Wikipedia. See Wikipedia'sguide to writing better articles for suggestions.(February 2026) (Learn how and when to remove this message)
See also:Timeline of machine learning

Predecessors

[edit]

For many years, sequence modelling and generation was done by using plainrecurrent neural networks (RNNs). A well-cited early example was theElman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice thevanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

A key breakthrough wasLSTM (1995),[note 1] an RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of anattention mechanism which used neurons that multiply the outputs of other neurons, so-calledmultiplicative units.[10] Neural networks using multiplicative units were later calledsigma-pi networks[11] orhigher-order networks.[12] LSTM became the standard architecture for long sequence modelling until the 2017 publication of transformers. However, LSTM still used sequential processing, like most other RNNs.[note 2] Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.

Modern transformers overcome this problem, but unlike RNNs, they require computation time that isquadratic in the size of the context window. The linearly scalingfast weight controller (1992) learns to compute a weight matrix for further processing depending on the input.[13] One of its two networks has "fast weights" or "dynamic links" (1981).[14][15][16] A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries.[13] This was later shown to be equivalent to the unnormalized linear transformer.[17][18]

Attention with seq2seq

[edit]
Main article:Seq2seq § History

The idea of encoder–decoder sequence transduction had been developed in the early 2010s; commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014.[19][20][original research?]

A 380M-parameter model for machine translation uses twolong short-term memories (LSTM).[20] Its architecture consists of two parts. Theencoder is an LSTM that takes in a sequence of tokens and turns it into a vector. Thedecoder is another LSTM that converts the vector into a sequence of tokens. Similarly, another 130M-parameter model usedgated recurrent units (GRU) instead of LSTM.[19] Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq.[21][22]

These early seq2seq models had no attention mechanism, and the state vector is accessible only after thelast word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved. This is because the input is processed sequentially by one recurrent network into afixed-size output vector, which is then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, degrading the output. As evidence, reversing the input sentence improved seq2seq translation.[23]

TheRNN search model introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem (of thefixed-size output vector), allowing the model to process long-distance dependencies more easily. The name is because it "emulates searching through a source sentence during decoding a translation".[2]

The relative performances were compared between global (that ofRNN search) and local (sliding window) attention model architectures for machine translation, finding that mixed attention had higher quality than global attention, while local attention reduced translation time.[24]

In 2016,Google Translate was revamped toGoogle Neural Machine Translation, which replaced the previous model based onstatistical machine translation. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM.[25] It took nine months to develop, and it outperformed the statistical approach, which took ten years to develop.[26]

Parallelizing attention

[edit]
Main article:Attention (machine learning) § History

Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard toparallelize, which prevented them from being accelerated on GPUs. In 2016,decomposable attention applied a self-attention mechanism tofeedforward networks, which are easy to parallelize, and achievedSOTA result intextual entailment with an order of magnitude fewer parameters than LSTMs.[27] One of its authors, Jakob Uszkoreit, suspected that attentionwithout recurrence would be sufficient for language translation, thus the title "attention isall you need".[28] That hypothesis was against conventional wisdom at the time, and even his fatherHans Uszkoreit, a well-known computational linguist, was skeptical.[28] In the same year, self-attention (calledintra-attention orintra-sentence attention) was proposed for LSTMs.[29]

In 2017, the original (100M-sized) encoder–decoder transformer model was proposed in the "Attention is all you need" paper. At the time, the focus of the research was on improvingseq2seq formachine translation, by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance.[1] This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence. Its parallelizability was an important factor to its widespread use in large neural networks.[30]

AI boom era

[edit]

As early as spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles.[31] Transformer architecture is now used alongside manygenerative models that contribute to the ongoingAI boom.

In language modelling,ELMo (2018) was a bi-directional LSTM that produces contextualizedword embeddings, improving upon the line of research frombag of words andword2vec. It was followed byBERT (2018), an encoder-only transformer model.[32] In October 2019, Google started using BERT to process search queries.[33] In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a transformer-encoder–RNN-decoder model.[34]

Starting in 2018, the OpenAIGPT series of decoder-only transformers became state of the art innatural language generation. In 2022, a chatbot based on GPT-3,ChatGPT, became unexpectedly[35] popular, triggering a boom aroundlarge language models.[36][37]

Since 2020, transformers have been applied in modalities beyond text, including thevision transformer,[38] speech recognition,[39] robotics,[40] andmultimodal.[41] The vision transformer, in turn, stimulated new developments inconvolutional neural networks.[42] Image and video generators likeDALL-E (2021),Stable Diffusion 3 (2024),[43] andSora (2024), use transformers to analyse input data (like text prompts) by breaking it down into "tokens" and then calculating the relevance between each token using self-attention, which helps the model understand the context and relationships within the data.

Training

[edit]
icon
This sectionrelies excessively onreferences toprimary sources. Please improve this section by addingsecondary or tertiary sources.
Find sources: "Attention Is All You Need" – news ·newspapers ·books ·scholar ·JSTOR
(February 2026) (Learn how and when to remove this message)

While the primary focus of the paper at the time was to improve machine translation, the paper also discussed the use of the architecture on EnglishConstituency Parsing, both with limited and large-sized datasets, achieving a high-score without specific tuning for the task indicating the promising nature of the model for use in a wide-variety of general purpose of seq2seq tasks.

  • Dataset - The English-to-German translation model was trained on the 2014 WMT (Workshop on Statistical Machine Translation) English-German dataset, consisting of nearly 4.5 million sentences derived from TED Talks and high-quality news articles. A separate translation model was trained on the much larger 2014 WMT English-French dataset, consisting of 36 million sentences. Both datasets were encoded with byte-pair encoding.
  • Hardware - The models were trained using 8NVIDIA P100 GPUs. The base models were trained for 100,000 steps, and the big models were trained for 300,000 steps - each step taking about 0.4 seconds to complete for the base models and 1.0 seconds for the big models. The base model was trained for a total of 12 hours, and the big model was trained for a total of 3.5 days. Both the base and big models outperform the 2017 state-of-the-art in both English-German and English-French, while achieving the comparatively lowest training cost.[1]
  • Hyperparameters and regularization - For their 100M-parameter Transformer model, the authors increased thelearning rate linearly for the first 4000 (warmup) steps and decreased it proportionally to the inverse square root of the current step number. Dropout layers were applied to the output of each sub-layer before normalization, the sums of the embeddings, and the positional encodings. The dropout rate was set to 0.1. Label smoothing was applied with a value of 0.1, which "improves accuracy and BLEU score".[1]

Notes

[edit]
  1. ^Gated recurrent units (2014) further reduced its complexity.
  2. ^Some architectures, such as RWKV or state space models, avoid the issue.

References

[edit]
  1. ^abcdefVaswani, Ashish;Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion;Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (December 2017)."Attention is All you Need"(PDF). In I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett (ed.).31st Conference on Neural Information Processing Systems (NIPS). Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc.arXiv:1706.03762.
  2. ^abBahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (19 May 2016). "Neural Machine Translation by Jointly Learning to Align and Translate".arXiv:1409.0473 [cs.CL].
  3. ^Shinde, Gitanjali; Wasatkar, Namrata; Mahalle, Parikshit (6 June 2024).Data-Centric Artificial Intelligence for Multidisciplinary Applications.CRC Press. p. 75.ISBN 9781040031131.
  4. ^Murgia, Madhumita (23 July 2023)."Transformers: the Google scientists who pioneered an AI revolution".Financial Times.Archived from the original on 28 December 2023. Retrieved22 March 2024.
  5. ^abMarche, Stephen (23 August 2024)."Was Linguistic A.I. Created by Accident?".The New Yorker.ISSN 0028-792X. Retrieved24 August 2024.
  6. ^Pearson, Helen; Ledford, Heidi; Hutson, Matthew; Van Noorden, Richard (15 April 2025)."Exclusive: the most-cited papers of the twenty-first century".Nature.640 (8059):588–592.Bibcode:2025Natur.640..588P.doi:10.1038/d41586-025-01125-9.PMID 40234577. Retrieved18 April 2025.
  7. ^Murgia, Madhumita (23 July 2023)."Transformers: the Google scientists who pioneered an AI revolution".Financial Times. Retrieved22 March 2025.
  8. ^"Meet the $4 Billion AI Superstars That Google Lost".Bloomberg. 13 July 2023 – via www.bloomberg.com.
  9. ^abLevy, Steven."8 Google Employees Invented Modern AI. Here's the Inside Story".Wired.ISSN 1059-1028. Retrieved20 March 2024.
  10. ^Feldman, J. A.; Ballard, D. H. (1 July 1982)."Connectionist models and their properties".Cognitive Science.6 (3):205–254.doi:10.1016/S0364-0213(82)80001-3.ISSN 0364-0213.
  11. ^Rumelhart, David E.; McClelland, James L.; Hinton, Geoffrey E. (29 July 1987).Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2(PDF). Cambridge, Mass: Bradford Books.ISBN 978-0-262-68053-0.
  12. ^Giles, C. Lee; Maxwell, Tom (1 December 1987)."Learning, invariance, and generalization in high-order neural networks".Applied Optics.26 (23):4972–4978.doi:10.1364/AO.26.004972.ISSN 0003-6935.PMID 20523475.
  13. ^abSchmidhuber, Jürgen (1992)."Learning to control fast-weight memories: an alternative to recurrent nets"(PDF).Neural Computation.4 (1):131–139.doi:10.1162/neco.1992.4.1.131.S2CID 16683347.
  14. ^Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981.http://cogprints.org/1380/1/vdM_correlation.pdf See Reprint in Models of Neural Networks II, chapter 2, pages 95–119. Springer, Berlin, 1994.
  15. ^Jerome A. Feldman, "Dynamic connections in neural networks," Biological Cybernetics, vol. 46, no. 1, pp. 27–39, Dec. 1982.
  16. ^Hinton, Geoffrey E.; Plaut, David C. (1987)."Using Fast Weights to Deblur Old Memories".Proceedings of the Annual Meeting of the Cognitive Science Society.9.
  17. ^Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020)."Transformers are RNNs: Fast autoregressive Transformers with linear attention".ICML 2020. PMLR. pp. 5156–5165.
  18. ^Schlag, Imanol; Irie, Kazuki;Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers".ICML 2021. Springer. pp. 9355–9366.
  19. ^abCho, Kyunghyun; van Merriënboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (October 2014)."Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". In Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.).Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics. pp. 1724–1734.arXiv:1406.1078.doi:10.3115/v1/D14-1179.
  20. ^abSutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 December 2014). "Sequence to sequence learning with neural networks".arXiv:1409.3215 [cs.CL]. [first version posted to arXiv on 10 Sep 2014]
  21. ^Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling".arXiv:1412.3555 [cs.NE].
  22. ^Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?",Frontiers in Artificial Intelligence,3 40,doi:10.3389/frai.2020.00040,PMC 7861254,PMID 33733157,S2CID 220252321
  23. ^Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V (2014)."Sequence to Sequence Learning with Neural Networks".Advances in Neural Information Processing Systems.27. Curran Associates, Inc.arXiv:1409.3215.
  24. ^Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (2015). "Effective Approaches to Attention-based Neural Machine Translation".arXiv:1508.04025 [cs.CL].
  25. ^Wu, Yonghui; et al. (1 September 2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation".arXiv:1609.08144 [cs.CL].
  26. ^Lewis-Kraus, Gideon (14 December 2016)."The Great A.I. Awakening".The New York Times.ISSN 0362-4331. Archived fromthe original on 24 May 2023. Retrieved22 June 2023.
  27. ^Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (25 September 2016). "A Decomposable Attention Model for Natural Language Inference".arXiv:1606.01933 [cs.CL].
  28. ^abLevy, Steven."8 Google Employees Invented Modern AI. Here's the Inside Story".Wired.ISSN 1059-1028.Archived from the original on 20 March 2024. Retrieved6 August 2024.
  29. ^Cheng, Jianpeng; Dong, Li; Lapata, Mirella (November 2016)."Long Short-Term Memory-Networks for Machine Reading". In Su, Jian; Duh, Kevin; Carreras, Xavier (eds.).Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics. pp. 551–561.doi:10.18653/v1/D16-1053.
  30. ^Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (10 December 2023),RWKV: Reinventing RNNs for the transformer Era,arXiv:2305.13048
  31. ^Marche, Stephen (23 August 2024)."Was Linguistic A.I. Created by Accident?".The New Yorker.ISSN 0028-792X. Retrieved27 August 2024.
  32. ^Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".arXiv:1810.04805v2 [cs.CL].
  33. ^"Google: BERT now used on almost every English query".Search Engine Land. 15 October 2020. Retrieved24 November 2020.
  34. ^Caswell, Isaac; Liang, Bowen (8 June 2020)."Recent Advances in Google Translate".Google Research.Archived from the original on 4 July 2024. Retrieved7 August 2024.
  35. ^"The inside story of how ChatGPT was built from the people who made it".MIT Technology Review. Retrieved6 August 2024.
  36. ^"Improving language understanding with unsupervised learning".openai.com. 11 June 2018.Archived from the original on 18 March 2023. Retrieved18 March 2023.
  37. ^finetune-transformer-lm, OpenAI, 11 June 2018, retrieved1 May 2023
  38. ^Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (3 June 2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale".arXiv:2010.11929 [cs.CV].
  39. ^Gulati, Anmol; Qin, James; Chiu, Chung-Cheng; Parmar, Niki; Zhang, Yu; Yu, Jiahui; Han, Wei; Wang, Shibo; Zhang, Zhengdong; Wu, Yonghui; Pang, Ruoming (2020). "Conformer: Convolution-augmented Transformer for Speech Recognition".arXiv:2005.08100 [eess.AS].
  40. ^Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (24 June 2021),Decision Transformer: Reinforcement Learning via Sequence Modeling,arXiv:2106.01345
  41. ^Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz (19 November 2022),Rethinking Attention with Performers,arXiv:2009.14794
  42. ^Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022).A ConvNet for the 2020s. Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11976–11986.
  43. ^Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (5 March 2024),Scaling Rectified Flow Transformers for High-Resolution Image Synthesis,arXiv:2403.03206

External links

[edit]
Computer
programs
AlphaGo
Versions
Competitions
In popular culture
Other
Machine
learning
Neural networks
Other
Generative
AI
Chatbots
Models
Other
See also
Retrieved from "https://en.wikipedia.org/w/index.php?title=Attention_Is_All_You_Need&oldid=1338191054"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp