Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

T5 (language model)

From Wikipedia, the free encyclopedia
Series of large language models developed by Google AI
Text-to-Text Transfer Transformer (T5)
Original author(s)Google AI
Initial release23 October 2019; 5 years ago (23 October 2019)
Stable release
Repositoryhttps://github.com/google-research/text-to-text-transfer-transformer
Type
LicenseApache-2.0
Websiteblog.research.google/2020/02/exploring-transfer-learning-with-t5.html

T5 (Text-to-Text Transfer Transformer) is a series oflarge language models developed byGoogle AI introduced in 2019.[1][2] Like theoriginal Transformer model,[3] T5 models areencoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

T5 models are usually pretrained on a massivedataset of text and code, after which they can perform the text-based tasks that are similar to their pretrained tasks. They can also be finetuned to perform other tasks.

T5 models have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.[4]

Training

[edit]

The original T5 models are pre-trained on theColossal Clean Crawled Corpus (C4), containing text and codescraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications.

The T5 models were pretrained on many tasks, all in the format of<input text> -><output text>.

How a T5 can be finetuned for a summarization task.[5]

Some examples are:

  • restoring corrupted text:Thank you <X> me to your party <Y> week. -><X> for inviting <Y> last <Z>, where the<Z> means "end of output", and the<X> and<Y> denote blanks to be filled, called "sentinels" in the original report.
  • translation:translate English to German: That is good. ->Das ist gut..
  • judging the grammatical acceptability of a sentence (CoLA sentence):The course is jumping well. ->not acceptable .

Architecture

[edit]
T5 encoder-decoder structure, showing the attention structure. In the encoder self-attention (lower square), all input tokens attend to each other; In the encoder–decoder cross-attention (upper rectangle), each target token attends to all input tokens; In the decoder self-attention (upper triangle), each target token attends to present and past target tokens only (causal).[5]

The T5 series encompasses several models with varying sizes and capabilities, allencoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper[1] reported the following 5 models:

T5 properties[note 1]
NameTotal parametersEncoder parametersDecoder parametersnlayer{\displaystyle n_{\text{layer}}}dmodel{\displaystyle d_{\text{model}}}dff{\displaystyle d_{\text{ff}}}dkv{\displaystyle d_{\text{kv}}}nhead{\displaystyle n_{\text{head}}}
Small76,956,16035,330,81641,625,34465122048648
Base247,577,856109,628,544137,949,3121276830726412
Large770,567,168334,939,648435,627,52024102440966416
3B2,884,497,4081,240,909,8241,643,587,5842410241638412832
11B11,340,220,4164,864,791,5526,475,428,86424102465536128128

*The encoder and the decoder have the same shape. So for example, the T5-small has 6 layers in the encoder and 6 layers in the decoder.

In the above table,

Note that unlike typical Transformers, the 3B and 11B models do not satisfydmodel=dkvnhead{\displaystyle d_{\text{model}}=d_{\text{kv}}n_{\text{head}}}.[6]

Compared to the original Transformer, it uses a few minor modifications: layer normalization with no additive bias; placing the layer normalization outside the residual path; relative positional embedding.[7]

For all experiments, they used a WordPiece tokenizer, with vocabulary size 32,000. The tokenizer is shared across both the input and output of each model. It was trained on a mixture ofEnglish,German,French, andRomanian data from the C4 dataset, at a ratio of 10:1:1:1.

Variants

[edit]

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X.[8]

Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted.

  • T5 small, base, large, 3B, 11B (2019): The original models.[1]
  • T5 1.1 small, base, large, XL, XXL: Improved versions of the original T5 series. These have roughly equal parameters. Theactivation function is GEGLU[9] instead of ReLU. The 3B and the 11B were changed to "XL" and "XXL", and their shapes are changed:[8][10][11]
T5 v1.1 properties[note 2]
NameTotal parametersEncoder parametersDecoder parametersnlayer{\displaystyle n_{\text{layer}}}dmodel{\displaystyle d_{\text{model}}}dff{\displaystyle d_{\text{ff}}}dkv{\displaystyle d_{\text{kv}}}nhead{\displaystyle n_{\text{head}}}
Small76,961,15235,332,80041,628,35285121024646
Base247,577,856109,628,544137,949,3121276820486412
Large783,150,080341,231,104441,918,97624102428166416
XL2,849,757,1841,223,527,4241,626,229,76024204851206432
XXL11,135,332,3524,762,310,6566,373,021,696244096102406464
  • LM-adapted T5 (2021): a series of models (from small to XXL) that started from checkpoints of theT5 series, but trained further on 100B additional tokens from C4.[12]
  • Switch Transformer (2021): amixture-of-experts variant of T5, by replacing the feedforward layers in the encoder and decoder blocks with mixture of expert feedforward layers.[13][14]
  • T0 3B, 11B (2021): a series of models that started from checkpoints ofLM-adapted T5, and further trained to perform tasks based only on task instruction (zero-shot).[15] Different entries in the series uses different finetuning data.[16]
  • ByT5 (2021): a byte-level version of T5, trained on mC4 (multilingual C4) dataset.[17] It operates on text encoded asUTF-8 bytes, without tokenizers.
  • Flan-T5-XL (2022): a model that started with a checkpoint ofT5 XL, theninstruction-tuned on the FLAN dataset.[18][19][20][21]
  • T5X (2022): aJAX-based re-implementation of the originalT5 codebase. It isnot a model.[22] The original T5 codebase was implemented inTensorFlow with MeshTF.[2]
  • UL2 20B (2022): a model with the same architecture as theT5 series, but scaled up to 20B, and trained with "mixture of denoisers" objective on the C4.[23] It was trained on a TPU cluster by accident, when a training run was left running accidentally for a month.[24]
  • Flan-UL2 20B (2022):UL2 20Binstruction-finetuned on the FLAN dataset.[23][20]
  • Pile-T5 (2024): has the same architecture ofT5, except it used theLlama tokenizer. It was trained onThe Pile. It came in sizes of base, large, XL, XXL.[25]

Applications

[edit]

The T5 model itself is an encoder-decoder model, allowing it to be used for instruction following. The encoder encodes the instruction, and the decoder autoregressively generates the reply.

The T5 encoder can be used as a text encoder, much like BERT. It encodes a text into a sequence of real-number vectors, which can be used for downstream applications. For example, GoogleImagen[26] usesT5-XXL as text encoder, and the encoded text vectors are used as conditioning on adiffusion model. As another example, the AuraFlow diffusion model[27] usesPile-T5-XL.

References

[edit]
  1. ^abcRaffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020)."Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer".Journal of Machine Learning Research.21 (140):1–67.arXiv:1910.10683.ISSN 1533-7928.
  2. ^abgoogle-research/text-to-text-transfer-transformer, Google Research, 2024-08-21, retrieved2024-08-21
  3. ^Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017)."Attention is All you Need".Advances in Neural Information Processing Systems.30. Curran Associates, Inc.
  4. ^Jiang, Yunfan; Gupta, Agrim; Zhang, Zichen; Wang, Guanzhi; Dou, Yongqiang; Chen, Yanjun; Fei-Fei, Li; Anandkumar, Anima; Zhu, Yuke (2022-10-06). "VIMA: General Robot Manipulation with Multimodal Prompts".arXiv:2210.03094 [cs.RO].
  5. ^abZhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024)."11.9. Large-Scale Pretraining with Transformers".Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press.ISBN 978-1-009-38943-3.
  6. ^"config.json · google-t5/t5-11b at main".huggingface.co. 2020-04-24. Retrieved2024-09-17.
  7. ^Shaw, Peter; Uszkoreit, Jakob; Vaswani, Ashish (2018-04-12),Self-Attention with Relative Position Representations,arXiv:1803.02155
  8. ^ab"t5x/docs/models.md at main · google-research/t5x".GitHub. Retrieved2024-08-05.
  9. ^Shazeer, Noam (2020-02-12),GLU Variants Improve Transformer,arXiv:2002.05202
  10. ^"config.json · google/t5-v1_1-xl at main".huggingface.co. 2020-11-19. Retrieved2024-09-17.
  11. ^"config.json · google/t5-v1_1-xxl at main".huggingface.co. 2020-11-19. Retrieved2024-09-17.
  12. ^Lester, Brian; Al-Rfou, Rami; Constant, Noah (2021-09-02),The Power of Scale for Parameter-Efficient Prompt Tuning,arXiv:2104.08691
  13. ^Fedus, William; Zoph, Barret; Shazeer, Noam (2022-06-16),Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,arXiv:2101.03961
  14. ^"SwitchTransformers".huggingface.co. Retrieved2024-08-05.
  15. ^Sanh, Victor; Webson, Albert; Raffel, Colin; Bach, Stephen H.; Sutawika, Lintang; Alyafeai, Zaid; Chaffin, Antoine; Stiegler, Arnaud; Scao, Teven Le (2022-03-17),Multitask Prompted Training Enables Zero-Shot Task Generalization,arXiv:2110.08207
  16. ^"bigscience/T0 · Hugging Face".huggingface.co. 2024-03-04. Retrieved2024-08-21.
  17. ^Xue, Linting; Barua, Aditya; Constant, Noah; Al-Rfou, Rami; Narang, Sharan; Kale, Mihir; Roberts, Adam; Raffel, Colin (2022-03-25)."ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models".Transactions of the Association for Computational Linguistics.10:291–306.arXiv:2105.13626.doi:10.1162/tacl_a_00461.ISSN 2307-387X.
  18. ^Chung, Hyung Won; Hou, Le; Longpre, Shayne; Zoph, Barret; Tay, Yi; Fedus, William; Li, Yunxuan; Wang, Xuezhi; Dehghani, Mostafa; Brahma, Siddhartha; Webson, Albert; Gu, Shixiang Shane; Dai, Zhuyun; Suzgun, Mirac; Chen, Xinyun (2024)."Scaling Instruction-Finetuned Language Models".Journal of Machine Learning Research.25 (70):1–53.arXiv:2210.11416.ISSN 1533-7928.
  19. ^Longpre, Shayne; Hou, Le; Vu, Tu; Webson, Albert; Chung, Hyung Won; Tay, Yi; Zhou, Denny; Le, Quoc V.; Zoph, Barret; Wei, Jason; Roberts, Adam (2023-07-03)."The Flan Collection: Designing Data and Methods for Effective Instruction Tuning".Proceedings of the 40th International Conference on Machine Learning. PMLR:22631–22648.arXiv:2301.13688.
  20. ^abgoogle-research/FLAN, Google Research, 2024-08-03, retrieved2024-08-05
  21. ^"google/flan-t5-xl · Hugging Face".huggingface.co. 2024-01-04. Retrieved2024-08-05.
  22. ^Roberts, Adam; Chung, Hyung Won; Mishra, Gaurav; Levskaya, Anselm; Bradbury, James; Andor, Daniel; Narang, Sharan; Lester, Brian; Gaffney, Colin; Mohiuddin, Afroz; Hawthorne, Curtis; Lewkowycz, Aitor; Salcianu, Alex; Zee, Marc van; Austin, Jacob (2023)."Scaling Up Models and Data with t5x and seqio".Journal of Machine Learning Research.24 (377):1–8.ISSN 1533-7928.
  23. ^abTay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28),UL2: Unifying Language Learning Paradigms,arXiv:2205.05131
  24. ^"Training great LLMs entirely from ground up in the wilderness as a startup".Yi Tay. Retrieved2024-10-18.
  25. ^Sutawika, Lintang; Komatsuzaki, Aran; Raffel, Colin (2024-04-15)."Pile-T5".EleutherAI Blog. Retrieved2024-05-05.
  26. ^"Imagen: Text-to-Image Diffusion Models".imagen.research.google. Retrieved2024-08-23.
  27. ^"AuraFlow".huggingface.co. Retrieved2024-08-23.

External links

[edit]

Notes

[edit]
  1. ^
    importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["t5-small","t5-base","t5-large","t5-3b","t5-11b"]:print(f"Model:{name}")config=AutoConfig.from_pretrained(f"google-t5/{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in{name}:{total}")print(f"Total number of parameters in encoder:{enc}")print(f"Total number of parameters in decoder:{dec}")delmodel
  2. ^
    importtorchfromtransformersimportAutoConfig,AutoModelForSeq2SeqLMdefcount_parameters(model):enc=sum(p.numel()forpinmodel.encoder.parameters())dec=sum(p.numel()forpinmodel.decoder.parameters())total=enc+decreturntotal,enc,decfornamein["small","base","large","xl","xxl"]:print(f"Model:{name}")config=AutoConfig.from_pretrained(f"google/t5-v1_1-{name}")torch_dtype=torch.float16model=AutoModelForSeq2SeqLM.from_config(config,torch_dtype=torch_dtype)total,enc,dec=count_parameters(model)print(f"Total number of parameters in{name}:{total}")print(f"Total number of parameters in encoder:{enc}")print(f"Total number of parameters in decoder:{dec}")delmodel
Computer programs
AlphaGo
Versions
Competitions
In popular culture
Other
Machine learning
Neural networks
Other
Generative AI
Chatbots
Models
Other
See also
a subsidiary ofAlphabet
Company
Divisions
Subsidiaries
Active
Defunct
Programs
Events
Infrastructure
People
Current
Former
Criticism
General
Incidents
Other
Software
A–C
D–N
O–Z
Operating systems
Machine learning models
Neural networks
Computer programs
Formats and codecs
Programming languages
Search algorithms
Domain names
Typefaces
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Hardware
Pixel
Smartphones
Smartwatches
Tablets
Laptops
Other
Nexus
Smartphones
Tablets
Other
Other
Advertising
Antitrust
Intellectual
property
Privacy
Other
Related
Concepts
Products
Android
Street View coverage
YouTube
Other
Documentaries
Books
Popular culture
Other
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
standards
Data
Automatic identification
and data capture
Topic model
Computer-assisted
reviewing
Natural language
user interface
Related
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Retrieved from "https://en.wikipedia.org/w/index.php?title=T5_(language_model)&oldid=1289207918"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp