Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)#

October 15, 2024 byDouglas Jia.

3 min read. | 682 total words.

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process.In this blog post we will show you, step-by step, how to set-up and fine-tune aStable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI)Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Getting things ready: OCI OKE, RoCE, SDXL and Hugging Face Accelerate#

This blog post will guide you through the set-up and fine-tuning of an SDXL model on a multinode setup with 8 GPUs on each node. A working configuration file is provided on this blog’s GitHub repository, along with instructions on how to customize the setup to meet your specific workload requirements.

Oracle Kubernetes Engine (OKE) provides a robust and flexible platform for deploying, managing, and scaling containerized applications in the Oracle Cloud. It allows users to easily orchestrate multinode setups, making it an ideal environment for training large models on multiple AMD processors.

A critical aspect of our setup involves usingRDMA over Converged Ethernet (RoCE) for internode communication. RoCE offers significant benefits over traditional Ethernet connections. RoCE reduces latency and increases data transfer speeds between nodes, leading to improved overall performance in distributed training scenarios.

Stable Diffusion XL (SDXL) is a generative AI model designed for high-quality image synthesis with text prompts. Fine-tuning this model involves adapting it to specific tasks or datasets, tailoring its image styles to match those of the dataset.

In this post we will use thelambdalabs/naruto-blip-captions dataset from Hugging Face. This dataset contains images originally obtained fromNarutopedia, which were captioned using the pre-trained BLIP model.Hugging Face Accelerate is used in this process to simplify and optimize training.

Keep in mind that while this blog provides an example for setting up and fine-tuning the SDXL model in a multinode environment on OCI, a similar approach can be used for setting up and fine-tuning other generative AI models.

Note: This blog focuses on setting up multinode fine-tuning, which can be easily adapted for pre-training, rather than on performance studies of multinode setups.

Implementation#

We implemented the multinode fine-tuning of SDXL on an OCI cluster with multiple nodes. Each node contains 8 AMD MI300x GPUs, and you can adjust the number of nodes based on your available resources in the scripts we will walk you through in the following section.

Your host machine will interact with the OKE cluster usingkubectl commands. To ensure thatkubectl interacts with the correct Kubernetes cluster, set theKUBECONFIG environment variable to point to the location of the configuration file (Be sure to update the path to point to your specific config file):

# Please update the path to your own config file.exportKUBECONFIG=<your_own_config_file>

BecauseWeights & Biases (wandb) will be used to track the fine-tuning progress and a Hugging Face dataset will be used for fine-tuning, you will need to generate an OKE “secret” using awandb API key and a Hugging Face token. An OKE secret is a Kubernetes object used to securely store and manage sensitive information such as passwords, tokens, and SSH keys. An OKE secret allows confidential data to be passed to your pods and containers securely.

# Create a secret for the WANDB API Keykubectlcreatesecretgenericwandb-secret--from-literal=WANDB_API_KEY=<Your_wandb_API_key># Create a secret for the Hugging Face tokenkubectlcreatesecretgenerichf-secret--from-literal=HF_TOKEN=<Your_Hugging_Face_token>

Create a Kubernetes ConfigMap by downloading the Hugging Face configuration file,default_config_accelerate.yaml, fromthis blog’s GitHub repositorysrc folder to your working directory on your host machine and runing the command below:

kubectlcreateconfigmapaccelerate-config--from-file=default_config_accelerate.yaml

A Kubernetes ConfigMap stores configuration information in key-value pairs. The command above will create a ConfigMap that contains thedefault_config_accelerate.yaml file. This will allow your pods to access and use the Hugging Face Accelerate configuration.

Downloadaccelerate_blog_multinode.yaml fromthis blog’s GitHub repositorysrc folder to your host machine and adjust the paths in the file to align with your actual file system.

Start the fine-tuning process by running the command below:

kubectlapply-faccelerate_blog_multinode.yaml

If everything is set up correctly, you’ll see the following output:

service/sdxl-headless-svc createdconfigmap/sdxl-finetune-multinode-config createdjob.batch/sdxl-finetune-multinode created

You can then monitor the fine-tuning progress via thewandb dashboard or othersupported reporting and logging platforms.

Theaccelerate_blog_multinode.yaml file is organized into three sections, Service, ConfigMap, and Job, separated by dashes (---). Breaking down theaccelerate_blog_multinode.yaml file into these three sections reveals how Kubernetes orchestrates the different components necessary for the multinode fine-tuning of SDXL. This modular approach provides flexibility and simplifies the management of complex workloads, especially in a multinode environment.

The Service section specifies how the application running within the pods should be exposed as a service both within the cluster and externally. This includes the service name and type, and the ports the service will use for communication. This setup ensures robust inter-node communication within the cluster.
Service section of the yaml file (click to expand)
```
apiVersion:v1kind:Servicemetadata:name:sdxl-headless-svcspec:clusterIP:Noneports:-port:12342protocol:TCPtargetPort:12342selector:job-name:sdxl-finetune-multinode
```
The ConfigMap section provides additional key-value pairs for inter-node communication. Storing these settings in this section makes it easier to manage and update them without altering the container images or pod specifications.
ConfigMap section of the yaml file (click to expand)
```
apiVersion:v1kind:ConfigMapmetadata:name:sdxl-finetune-multinode-configdata:headless_svc:sdxl-headless-svcjob_name:sdxl-finetune-multinodemaster_addr:sdxl-finetune-multinode-0.sdxl-headless-svcmaster_port:'12342'num_replicas:'3'
```

The Job section provides details about how the fine-tuning process. This includes information such as which container image to use, which resources to allocate to the job, which command to run within the container, and how the ConfigMap for the Accelerate settings should be mounted into the container.

Job section of the yaml file (click to expand)

apiVersion:batch/v1kind:Jobmetadata:name:sdxl-finetune-multinodespec:backoffLimit:0completions:3parallelism:3completionMode:Indexedtemplate:metadata:labels:job:sdxl-multinode-jobspec:hostNetwork:truednsPolicy:ClusterFirstWithHostNetcontainers:-name:accelerate-sdxlimage:rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.1.2securityContext:privileged:truecapabilities:add:["IPC_LOCK"]env:-name:HIP_VISIBLE_DEVICESvalue:"0,1,2,3,4,5,6,7"-name:HIP_FORCE_DEV_KERNARGvalue:"1"-name:GPU_MAX_HW_QUEUESvalue:"2"-name:USE_ROCMLINEARvalue:"1"-name:NCCL_SOCKET_IFNAMEvalue:"rdma0"-name:MASTER_ADDRESSvalueFrom:configMapKeyRef:key:master_addrname:sdxl-finetune-multinode-config-name:MASTER_PORTvalueFrom:configMapKeyRef:key:master_portname:sdxl-finetune-multinode-config-name:NCCL_IB_HCAvalue:"mlx5_0,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_7,mlx5_8,mlx5_9"-name:HEADLESS_SVCvalueFrom:configMapKeyRef:key:headless_svcname:sdxl-finetune-multinode-config-name:NNODESvalueFrom:configMapKeyRef:key:num_replicasname:sdxl-finetune-multinode-config-name:NODE_RANKvalueFrom:fieldRef:fieldPath:metadata.annotations['batch.kubernetes.io/job-completion-index']-name:WANDB_API_KEYvalueFrom:secretKeyRef:name:wandb-secretkey:WANDB_API_KEY-name:HF_TOKENvalueFrom:secretKeyRef:name:hf-secretkey:HF_TOKENvolumeMounts:-mountPath:/mntname:model-weights-volume-mountPath:/etc/configname:diffusers-config-volume-{ mountPath:/dev/infiniband, name:devinf}-{ mountPath:/dev/shm, name:shm}resources:requests:amd.com/gpu:8limits:amd.com/gpu:8command:["/bin/bash","-c","--"]args:-|# Clone the GitHub repogit clone --recurse https://github.com/ROCm/bitsandbytes.gitcd bitsandbytesgit checkout rocm_enabled# Install dependenciespip install -r requirements-dev.txt# Use -DBNB_ROCM_ARCH to specify target GPU archcmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .makepip install .cd ..# Set up Hugging Face authentication using the secretmkdir -p ~/.huggingfaceecho $HF_TOKEN > ~/.huggingface/tokenpip install deepspeed==0.14.5 wandbgit clone https://github.com/huggingface/diffusers &&cd diffusers && pip install -e . && cd examples/text_to_image &&pip install -r requirements_sdxl.txtexport EXP_DIR=./outputmkdir -p outputLOG_FILE="${EXP_DIR}/sdxl_$(date '+%Y-%m-%d_%H-%M-%S')_MI300_SDXL_FINETUNE.log"export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"export DATASET_NAME="lambdalabs/naruto-blip-captions"export ACCELERATE_CONFIG_FILE="/etc/config/default_config_accelerate.yaml"export HF_HOME=/mnt/huggingfaceaccelerate launch --config_file $ACCELERATE_CONFIG_FILE \--main_process_ip $MASTER_ADDRESS \--main_process_port $MASTER_PORT \--machine_rank $NODE_RANK \--num_processes $((8 * NNODES)) \--num_machines $NNODES train_text_to_image_sdxl.py \--pretrained_model_name_or_path=$MODEL_NAME \--pretrained_vae_model_name_or_path=$VAE_NAME \--dataset_name=$DATASET_NAME \--resolution=512 --center_crop --random_flip \--proportion_empty_prompts=0.1 \--train_batch_size=12 \--gradient_checkpointing \--num_train_epochs=500 \--use_8bit_adam \--learning_rate=1e-04 --lr_scheduler="cosine" --lr_warmup_steps=200 \--mixed_precision="fp16" \--validation_prompt="a cute Sundar Pichai creature" --validation_epochs 20 \--checkpointing_steps=1000 \--report_to="wandb" \--output_dir="sdxl-naruto-model" 2>&1 | tee "$LOG_FILE"sleep 30mvolumes:-name:model-weights-volumehostPath:path:/mnt/model_weightstype:Directory-name:diffusers-config-volumeconfigMap:name:accelerate-config-{ name:devinf, hostPath:{ path:/dev/infiniband}}-{ name:shm, emptyDir:{ medium:Memory, sizeLimit:512Gi}}restartPolicy:Neversubdomain:sdxl-headless-svc

Please note that you can change thenum_replicas,completions, andparallelism values in theaccelerate_blog_multinode.yaml file to any number between 1 and the total number of nodes your cluster has, to specify how many nodes to use for the multinode workload. Currently, we are using3, meaning we are implementing fine-tuning on 3 nodes, each with 8 GPUs. You will need to modify the configuration to match your actual infrastructure. For instance, if your nodes only have 4 GPUs each, you’ll need to change the number of GPUs requested per job (resources->requests->amd.com/gpu:) to 4, ensuring the correct number of nodes is allocated.

Thesleep30m command at the end of the Job section underspec->template->spec->containers->args keeps the job running for an additional 30 minutes after it completes. This gives you time to review the results and troubleshoot any issues. You can adjust the sleep time or remove this command as needed.

Summary#

In this blog post we showed you how to set-up and fine-tune a generative AI model on Oracle Cloud Infrastructure’s (OCI) Oracle Kubernetes Engine (OKE) using a cluster of AMD GPUs. You can use this tutorial as a starting point and adjust the YAML file to reflect your own network resources and the specific needs of your particular task.

Acknowledgment#

We want to thank the OCI team for helping us set up the multinode environment to implement the workload.

Disclaimers#

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

Contents

Movatterモバイル変換