google-research/t5xPublic

NotificationsYou must be signed in to change notification settings
Fork322
Star2.8k

License

Apache-2.0 license

2.8k stars 322 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,106 Commits
.github/workflows		.github/workflows
docs		docs
t5x		t5x
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
readthedocs.yaml		readthedocs.yaml
setup.py		setup.py

Repository files navigation

T5X

Go toT5X ReadTheDocs Documentation Page.

T5X is a modular, composable, research-friendly framework for high-performance,configurable, self-service training, evaluation, and inference of sequencemodels (starting with language) at many scales.

It is essentially a new and improved implementation of theT5 codebase(based onMesh TensorFlow) inJAX andFlax. To learnmore, see theT5X Paper.

Below is a quick start guide for training models with TPUs on Google Cloud. Foradditional tutorials and background, see thecomplete documentation.

Quickstart (Recommended)

T5X can be run withXManager onVertex AI. Vertex AI is a platform fortraining that creates TPU instances and runs code on the TPUs. Vertex AI willalso shut down the TPUs when the jobs terminate. This is signifcantly easierthan managing GCE VMs and TPU VM instances.

Follow the pre-requisites and directions to installXManager.
Request TPU quota as required. GCP projects come with 8 cores by default,which is enough to run one training experiment on a single TPU host. If you wantto run multi-host training or run multiple trials in parallel, you will needmore quota. Navigate toQuotas.

The quota you want is:

Service:Vertex AI API
Dimensions (location):us-central1
If you want to run single-host experiments:
- Custom model training TPU V2 cores per region
- Custom model training TPU V3 cores per region
If you want to run multi-host experiments:
- Custom model training TPU V2 pod cores per region
- Custom model training TPU V3 pod cores per region

TIP: You won't be able to run single-host experiments with multi-host quota.(i.e. you can't runtpu_v2=8 usingTPU V2 pod)

Launch the xmanager script located att5x/scripts/xm_launch.py.

As a running example, we use the WMT14 En-De translation which is described inmore detail in the Examples section below.

export GOOGLE_CLOUD_BUCKET_NAME=...export TFDS_DATA_DIR=gs://$GOOGLE_CLOUD_BUCKET_NAME/t5x/dataexport MODEL_DIR=gs://$GOOGLE_CLOUD_BUCKET_NAME/t5x/$(date +%Y%m%d)# Pre-download dataset in multi-host experiments.tfds build wmt_t2t_translate --data_dir=$TFDS_DATA_DIRgit clone https://github.com/google-research/t5xcd ./t5x/python3 ./t5x/scripts/xm_launch.py \  --gin_file=t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin \  --model_dir=$MODEL_DIR \  --tfds_data_dir=$TFDS_DATA_DIR

Checkgs://$GOOGLE_CLOUD_BUCKET_NAME/t5x/ for the output artifacts, which canbe read by TensorBoard.

GPU Usage

Note: NVIDIA has released an updated version of this repository with H100 FP8 support and broad GPU performance improvements. Please visit theNVIDIA Rosetta repository for more details and usage instructions.

T5X can be run easily on GPUs either in single-node configurations or multi-node configurations with a SLURM+pyxis cluster. Further instructions att5x/contrib/gpu. Thet5x/contrib/gpu/scripts_gpu folder contains example scripts for pretraining T5X onThe Pile and for finetuning on SQuAD and MNLI. These scripts and associatedgin configurations also contain additional GPU optimizations for better throughput. More examples and instructions can be found in theNVIDIA Rosetta repository maintained by NVIDIA with H100 FP8 support and broad GPU performance improvements.

Installation

Note that all the commands in this document should be run in the commandline ofthe TPU VM instance unless otherwise stated.

Follow theinstructionsto set up a Google Cloud Platform (GCP) account and enable the Cloud TPUAPI.
Note: T5X also works with GPU, please follow instructions int5x/contrib/gpu if you'd like to use GPU version.
Create aCloud TPU VM instancefollowingthis instruction.We recommend that you develop your workflow in a single v3-8 TPU (i.e.,--accelerator-type=v3-8) and scale up to pod slices once the pipeline isready. In this README, we focus on using a single v3-8 TPU. Seehere tolearn more about TPU architectures.
With Cloud TPU VMs, you ssh directly into the host machine of the TPU VM.You can install packages, run your code run, etc. in the host machine. Oncethe TPU instance is created, ssh into it with
```
gcloud alpha compute tpus tpu-vm ssh${TPU_NAME} --zone=${ZONE}
```
whereTPU_NAME andZONE are the name and the zone used in step 2.

Install T5X and the dependencies.

git clone --branch=main https://github.com/google-research/t5xcd t5xpython3 -m pip install -e'.[tpu]' -f \  https://storage.googleapis.com/jax-releases/libtpu_releases.html

Create Google Cloud Storage (GCS) bucket to store the dataset and modelcheckpoints. To create a GCS bucket, see theseinstructions.
(optional) If you prefer working with Jupyter/Colab style environmentyou can setup a custom Colab runtime by following steps fromt5x/notebooks.

Example: English to German translation

As a running example, we use the WMT14 En-De translation. The raw dataset isavailable in TensorFlow Datasets as"wmt_t2t_translate".

T5 casts the translation task such as the following

{'en':'That is good.','de':'Das ist gut.'}

to the form called "text-to-text":

{'inputs':'translate English to German: That is good.','targets':'Das ist gut.'}

This formulation allows many different classes of language tasks to be expressedin a uniform manner and a single encoder-decoder architecture can handle themwithout any task-specific parameters. For more detail, refer to theT5 paper(Raffel et al. 2019).

For a scalable data pipeline and an evaluation framework, we useSeqIO, which was factored out of theT5library. Aseqio.Task packages together the raw dataset, vocabulary,preprocessing such as tokenization and evaluation metrics such asBLEU and provides atf.data instance.

The T5 library provides a number ofseqio.Tasks that were used in theT5 paper. In this example, we usewmt_t2t_ende_v003.

Before training or fine-tuning you need to download ["wmt_t2t_translate"](https://www.tensorflow.org/datasets/catalog/wmt_t2t_translate) dataset first.

# Data dir to save the processed dataset in "gs://data_dir" format.TFDS_DATA_DIR="..."# Make sure that dataset package is up-to-date.python3 -m pip install --upgrade tfds-nightly# Pre-download dataset.tfds build wmt_t2t_translate${TFDS_DATA_DIR}

Training

To run a training job, we use thet5x/train.py script.

# Model dir to save logs, ckpts, etc. in "gs://model_dir" format.MODEL_DIR="..."T5X_DIR="..."# directory where the T5X repo is cloned.TFDS_DATA_DIR="..."python3${T5X_DIR}/t5x/train.py \  --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \  --gin.MODEL_DIR=\"${MODEL_DIR}\" \  --tfds_data_dir=${TFDS_DATA_DIR}

The configuration for this training run is defined in the Gin filebase_wmt_from_scratch.gin.Gin-config is a library to handleconfigurations based on dependency injection. Among many benefits, Gin allowsusers to pass custom components such as a custom model to the T5X librarywithout having to modify the core library. Thecustomcomponents section shows how this is done.

While the core library is independent of Gin, it is central to the examples weprovide. Therefore, we provide a shortintroduction to Gin in thecontext of T5X. All the configurations are written to a file "config.gin" inMODEL_DIR. This makes debugging as well as reproducing the experiment mucheasier.

In addition to theconfig.json,model-info.txt file summarizes the modelparameters (shape, names of the axes, partitioning info) as well as theoptimizer states.

TensorBoard

To monitor the training inTensorBoard, it is much easier (due toauthentification issues) to launch the TensorBoard on your own machine andnot inthe TPU VM. So in the commandline where you ssh'ed into the TPU VM, launch theTensorBoard with thelogdir pointing to theMODEL_DIR.

# NB: run this on your machine not TPU VM!MODEL_DIR="..."# Copy from the TPU VM.tensorboard --logdir=${MODEL_DIR}

Or you can launch the TensorBoard inside a Colab. In a Colab cell, run

fromgoogle.colabimportauthauth.authenticate_user()

to authorize the Colab to access the GCS bucket and launch the TensorBoard.

%load_exttensorboardmodel_dir="..."# Copy from the TPU VM.%tensorboard--logdir=model_dir

Fine-tuning

We can leverage the benefits of self-supervised pre-training by initializingfrom one of our pre-trained models. Here we use the T5.1.1 Base checkpoint.

# Model dir to save logs, ckpts, etc. in "gs://model_dir" format.MODEL_DIR="..."# Data dir to save the processed dataset in "gs://data_dir" format.TFDS_DATA_DIR="..."T5X_DIR="..."# directory where the T5X repo is cloned.python3${T5X_DIR}/t5x/train.py \  --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_finetune.gin" \  --gin.MODEL_DIR=\"${MODEL_DIR}\" \  --tfds_data_dir=${TFDS_DATA_DIR}

Note: when supplying a string, dict, list, tuple value, or a bash variablevia a flag, you must put it in quotes. In the case of strings, it requiresescaped quotes (\"<string>\"). For example:--gin.utils.DatasetConfig.split=\"validation\" or--gin.MODEL_DIR=\"${MODEL_DIR}\".

Gin makes it easy to change a number of configurations. For example, you canchange thepartitioning.PjitPartitioner.num_partitions (overridingthe value inbase_wmt_from_scratch.gin)to chanage the parallelism strategy and pass it as a commandline arg.

--gin.partitioning.PjitPartitioner.num_partitions=8

Evaluation

To run the offline (i.e. without training) evaluation, you can uset5x/eval.pyscript.

EVAL_OUTPUT_DIR="..."# directory to write eval outputT5X_DIR="..."# directory where the t5x is cloned, e.g., ${HOME}"/t5x".TFDS_DATA_DIR="..."CHECKPOINT_PATH="..."python3${T5X_DIR}/t5x/eval.py \  --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_eval.gin" \  --gin.CHECKPOINT_PATH=\"${CHECKPOINT_PATH}\" \  --gin.EVAL_OUTPUT_DIR=\"${EVAL_OUTPUT_DIR}\" \  --tfds_data_dir=${TFDS_DATA_DIR}

Inference

To run inference, you can uset5x/infer.py script. Here we use the sameseqio.Task, but for inference we do not use the targets features other thanlogging them alongside the prediction in a JSON file.

INFER_OUTPUT_DIR="..."# directory to write infer outputT5X_DIR="..."# directory where the t5x is cloned, e.g., ${HOME}"/t5x".TFDS_DATA_DIR="..."CHECKPOINT_PATH="..."python3${T5X_DIR}/t5x/infer.py \  --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_infer.gin" \  --gin.CHECKPOINT_PATH=\"${CHECKPOINT_PATH}\" \  --gin.INFER_OUTPUT_DIR=\"${INFER_OUTPUT_DIR}\" \  --tfds_data_dir=${TFDS_DATA_DIR}

Exporting as TensorFlow Saved Model

Pretrained model can be exported as TensorFlow Saved Model, and deployedto Vertex AI Prediction service using [Optimized TensorFlow Runtime](https://cloud.google.com/vertex-ai/docs/predictions/optimized-tensorflow-runtime).Please note that exported model won't work on OSS basedTensorFlow Model Server.

T5X_DIR="..."# directory where the t5x is cloned, e.g., ${HOME}"/t5x".CHECKPOINT_PATH="..."BATCH_SIZE=NoneBEAM_SIZE=1# Use 'bfloat16' if you plan to run exported model on NVIDIA A100 or newer GPUs,# for other GPUs use 'float32'.ACTIVATION_DTYPE=bfloat16# Version numbers must be numeric. We generate one based on datetime.VERSION=$(date +%Y%m%d%H%M%S)NAME=t5x_base_${ACTIVATION_DTYPE}# Model name.# Path to export model to. Note that export script is going to add _cpu suffix# after model name.OUTPUT=${CHECKPOINT_PATH}/saved_model.${NAME}/${VERSION}declare -a ARGS=(--gin_file=t5x/examples/t5/t5_1_1/base.gin--gin_file=t5x/t5x/configs/runs/export.gin--gin.TASK_FEATURE_LENGTHS="{'inputs': 256, 'targets': 256}"--gin.CHECKPOINT_PATH=\"${CHECKPOINT_PATH}\"--gin.MODEL_NAME=\"/ml/${USER}/t5x_base\"--gin.MODEL_OUTPUT_DIR=\"${OUTPUT}\"--gin.BEAM_SIZE=${BEAM_SIZE}--gin.BATCH_SIZE=${BATCH_SIZE}--gin.export_lib.save.partitioner=None--gin.export_lib.save.warmup_examples="['hello world']"--gin.export_lib.ExportableModule.use_batch_function=False--gin.export_lib.ExportableModule.use_gpu=False--gin.export_lib.ExportableModule.jit_compile=False--gin.ACTIVATION_DTYPE=\"${ACTIVATION_DTYPE}\"--gin.network.T5Config.dtype=\"${ACTIVATION_DTYPE}\"--gin.utils.RestoreCheckpointConfig.dtype=\"${ACTIVATION_DTYPE}\"--gin.DROPOUT_RATE=0.0)(python3${T5X_DIR}/t5x/export.py"${ARGS[@]}")

For detailed arguments definition refer to [export.gin](t5x/configs/runs/export.gin).

You can run XL and smaller models on NVIDIA A100 40GB, and XXL models onNVIDIA A100 80GB.

Custom components

The translation example uses theencoder-decoder model that T5X provides as well as the dataset from the T5library. This section shows how you can use your own dataset and a model andpass via Gin.

Example: custom dataset in a user directory

For this example, we have the following directory structure with${HOME}/dir1/user_dir representing a user directory with custom components.

${HOME}└── dir1    └── user_dir        ├── t5_1_1_base_de_en.gin        └── tasks.py

As an example, let's define a new dataset. Here we use the same Translationdataset but we define the translation task in the opposite direction, i.e.,German to English intead of English to German. We define this task intasks.py

# ${HOME}/dir1/user_dir/tasks.pyimportfunctoolsimportseqioimporttensorflow_datasetsastfdsfromt5.evaluationimportmetricsfromt5.dataimportpreprocessorsvocabulary=seqio.SentencePieceVocabulary('gs://t5-data/vocabs/cc_all.32000/sentencepiece.model',extra_ids=100)output_features= {'inputs':seqio.Feature(vocabulary=vocabulary),'targets':seqio.Feature(vocabulary=vocabulary)}seqio.TaskRegistry.add('wmt_t2t_de_en_v003',source=seqio.TfdsDataSource(tfds_name='wmt_t2t_translate/de-en:1.0.0'),preprocessors=[functools.partial(preprocessors.translate,source_language='de',target_language='en'),seqio.preprocessors.tokenize,seqio.CacheDatasetPlaceholder(),seqio.preprocessors.append_eos_after_trim,    ],metric_fns=[metrics.bleu],output_features=output_features)

In the Gin file, most of the settings are equivalent to those used in theEn->De example. So we include the Ginfile from that example. To use "wmt_t2t_de_en_v003" task we just defined, weneed to import the task module "tasks.py". Note that we use a relative pathdefined with respect to the user directory. This will be specified as aflag.

# ${HOME}/dir1/user_dir/t5_1_1_base_de_en.ginfrom__gin__importdynamic_registrationimporttasks# This imports the task defined in dir1/user_dir/tasks.py.include"t5x-tmp/t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin"MIXTURE_OR_TASK_NAME="wmt_t2t_de_en_v003"

Finally, we launch training passing the user directory as a flaggin_search_paths such that the Gin file and python modules can be specifiedwith relative paths.

PROJECT_DIR=${HOME}"/dir1/user_dir"T5X_DIR="..."# directory where the t5x is cloned.TFDS_DATA_DIR="..."MODEL_DIR="..."export PYTHONPATH=${PROJECT_DIR}python3${T5X_DIR}/t5x/train.py \  --gin_search_paths=${PROJECT_DIR} \  --gin_file="t5_1_1_base_de_en.gin" \  --gin.MODEL_DIR=\"${MODEL_DIR}\" \  --tfds_data_dir=${TFDS_DATA_DIR}

Checkpoints

Native Checkpoints

We have released the checkpoints of many of the original T5 models and theirvariants a native T5X format for maximal efficiency.See thecomplete list including thematching Gin configuration files.

These are converted from the publicMesh TensorFlowcheckpoints.

Compatibility with the Mesh TensorFlow checkpoints

The Mesh TensorFlow checkpoints trained using theT5 library can bedirectly loaded into T5X. For example, we can rerun the fine-tuning exampleinitializing from the MTF checkpoint by changing theINIT_CHECKPOINT Ginmacro.

# Model dir to save logs, ckpts, etc. in "gs://model_dir" format.MODEL_DIR="..."# Data dir to save the processed dataset in "gs://data_dir" format.TFDS_DATA_DIR="..."T5X_DIR="..."# directory where the T5X repo is cloned.python3${T5X_DIR}/t5x/train.py \  --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt19_ende_train.gin" \  --gin.MODEL_DIR=\"${MODEL_DIR}\" \  --gin.MIXTURE_OR_TASK_NAME=\"wmt_t2t_ende_v003\" \  --gin.INIT_CHECKPOINT=\"gs://t5-data/pretrained_models/t5.1.1.base/model.ckpt-1000000\" \  --tfds_data_dir=${TFDS_DATA_DIR}

Note that restoring directly from the Mesh TensorFlow checkpoints can beinefficient if heavy model parallelism is used for large models. This isbecause each host loads the entire copy of the model first and then keep onlythe relevant slices dictated by the model parallelism specification. If you haveMesh TensorFlow checkpoints that you run often, we recommend converting thecheckpoints to T5X native format using theconvert_tf_checkpoint script.

Citing T5X

Please use the following bibtex entry to cite T5X.

@article{roberts2022t5x,  url = {https://arxiv.org/abs/2203.17189},  author = {Roberts, Adam and Chung, Hyung Won and Levskaya, Anselm and Mishra, Gaurav and Bradbury, James and Andor, Daniel and Narang, Sharan and Lester, Brian and Gaffney, Colin and Mohiuddin, Afroz and Hawthorne, Curtis and Lewkowycz, Aitor and Salcianu, Alex and van Zee, Marc and Austin, Jacob and Goodman, Sebastian and Soares, Livio Baldini and Hu, Haitang and Tsvyashchenko, Sasha and Chowdhery, Aakanksha and Bastings, Jasmijn and Bulian, Jannis and Garcia, Xavier and Ni, Jianmo and Chen, Andrew and Kenealy, Kathleen and Clark, Jonathan H. and Lee, Stephan and Garrette, Dan and Lee-Thorp, James and Raffel, Colin and Shazeer, Noam and Ritter, Marvin and Bosma, Maarten and Passos, Alexandre and Maitin-Shepard, Jeremy and Fiedel, Noah and Omernick, Mark and Saeta, Brennan and Sepassi, Ryan and Spiridonov, Alexander and Newlan, Joshua and Gesmundo, Andrea},  title = {Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$},  journal={arXiv preprint arXiv:2203.17189},  year = {2022},}

Note

This is not an officially supported Google product

About

No description, website, or topics provided.

Movatterモバイル変換

License

google-research/t5x

Folders and files

Latest commit

History

Repository files navigation

T5X

Quickstart (Recommended)

GPU Usage

Installation

Example: English to German translation

Training

TensorBoard

Fine-tuning

Evaluation

Inference

Exporting as TensorFlow Saved Model

Custom components

Example: custom dataset in a user directory

Checkpoints

Native Checkpoints

Compatibility with the Mesh TensorFlow checkpoints

Citing T5X

Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors108

Languages

Packages