Image classification using Vision Transformer with AMD GPUs
Contents
Image classification using Vision Transformer with AMD GPUs#

4 Apr, 2024 by .
The Vision Transformer (ViT) model was first proposed inAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ViT is an attractive alternative to conventional Convolutional Neural Network (CNN) models due to its excellent scalability and adaptability in the field of computer vision. On the other hand, ViT can be more expensive compared to CNN for large input images as it has quadratic computation complexity with respect to input size.
This blog demonstrates how to use the ViT model on AMD GPUs with ROCm Software.
The preceding figure, which is taken from the originalpaper, shows the ViT architecture.
Some of the code used in this blog is adapted from the blog:Quick demo: Vision Transformer (ViT) by Google Brainby Niels Rogge.
Setup#
The code in this blog has been tested on a single AMD MI210 GPUs with a Docker image thathas ROCm 6.0 and PyTorch 2.2 installed. It will also run on anyROCm supported GPU.
Software#
You can find the Docker image used in this demo onDocker Hubusing the following command:
dockerrun-it--cap-add=SYS_PTRACE--security-optseccomp=unconfined\--device=/dev/kfd--device=/dev/dri--group-addvideo\--ipc=host--shm-size8Grocm/pytorch:latest
For further details on how to install ROCm and PyTorch, visit:
To see if you have the correct version of ROCm installed, use:
aptshowrocm-libs-a
Package: rocm-libsVersion: 6.0.0.60000-91~20.04Priority: optionalSection: develMaintainer: ROCm Dev Support <rocm-dev.support@amd.com>Installed-Size: 13.3 kBDepends: hipblas (= 2.0.0.60000-91~20.04), hipblaslt (= 0.6.0.60000-91~20.04), hipfft (= 1.0.12.60000-91~20.04), hipsolver (= 2.0.0.60000-91~20.04), hipsparse (= 3.0.0.60000-91~20.04), hiptensor (= 1.1.0.60000-91~20.04), miopen-hip (= 3.00.0.60000-91~20.04), half (= 1.12.0.60000-91~20.04), rccl (= 2.18.3.60000-91~20.04), rocalution (= 3.0.3.60000-91~20.04), rocblas (= 4.0.0.60000-91~20.04), rocfft (= 1.0.23.60000-91~20.04), rocrand (= 2.10.17.60000-91~20.04), hiprand (= 2.10.16.60000-91~20.04), rocsolver (= 3.24.0.60000-91~20.04), rocsparse (= 3.0.2.60000-91~20.04), rocm-core (= 6.0.0.60000-91~20.04), composablekernel-dev (= 1.1.0.60000-91~20.04), hipblas-dev (= 2.0.0.60000-91~20.04), hipblaslt-dev (= 0.6.0.60000-91~20.04), hipcub-dev (= 3.0.0.60000-91~20.04), hipfft-dev (= 1.0.12.60000-91~20.04), hipsolver-dev (= 2.0.0.60000-91~20.04), hipsparse-dev (= 3.0.0.60000-91~20.04), hiptensor-dev (= 1.1.0.60000-91~20.04), miopen-hip-dev (= 3.00.0.60000-91~20.04), rccl-dev (= 2.18.3.60000-91~20.04), rocalution-dev (= 3.0.3.60000-91~20.04), rocblas-dev (= 4.0.0.60000-91~20.04), rocfft-dev (= 1.0.23.60000-91~20.04), rocprim-dev (= 3.0.0.60000-91~20.04), rocrand-dev (= 2.10.17.60000-91~20.04), hiprand-dev (= 2.10.16.60000-91~20.04), rocsolver-dev (= 3.24.0.60000-91~20.04), rocsparse-dev (= 3.0.2.60000-91~20.04), rocthrust-dev (= 3.0.0.60000-91~20.04), rocwmma-dev (= 1.3.0.60000-91~20.04)Homepage: https://github.com/RadeonOpenCompute/ROCmDownload-Size: 1046 BAPT-Manual-Installed: yesAPT-Sources: http://repo.radeon.com/rocm/apt/6.0 focal/main amd64 PackagesDescription: Radeon Open Compute (ROCm) Runtime software stack
You’ll also need to install the transformers package from Hugging Face.
pipinstall-qtransformers
Hardware#
For a list of supported hardware, visit theROCm System requirements page.
Check your hardware to make sure that the system recognizes the AMD GPU.
rocm-smi--showproductname
============================ ROCm System Management Interface ================================================================== Product Info ======================================GPU[0] : Card series: 0x740fGPU[0] : Card model: 0x0c34GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]GPU[0] : Card SKU: D67301V============================================================================================================================ End of ROCm SMI Log ===================================
Make sure PyTorch also recognizes the GPU.
importtorchprint(f"number of GPUs:{torch.cuda.device_count()}")print([torch.cuda.get_device_name(i)foriinrange(torch.cuda.device_count())])
number of GPUs: 1['AMD Instinct MI210']
Loading the ViT model#
Load the pre-trained ViT modelvit-base-patch16-224 from Hugging Face.
fromtransformersimportViTForImageClassificationdevice=torch.device('cuda'iftorch.cuda.is_available()else'cpu')model=ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')model.to(device)
Image classification#
Load an unlicensedimagefor the model to classify.
fromPILimportImageimportrequestsurl='https://images.pexels.com/photos/19448090/pexels-photo-19448090.png'image=Image.open(requests.get(url,stream=True).raw)image

The ViT model accepts an input resolution of 224x224. You can use the ViTImageProcessor to normalize and resize the image so that it’s ready for the model.
fromtransformersimportViTImageProcessorprocessor=ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')inputs=processor(images=image,return_tensors="pt").to(device)pixel_values=inputs.pixel_values
Check that the image has the correct size. The shape of thepixel_values tensor represents [batch,channel, height, width].
print(pixel_values.shape)
torch.Size([1,3,224,224])
The processed image shows the correct size. To look at the output image, use:
importtorchvisionfromtorchvision.ioimportread_imageimporttorchvision.transformsasTT.ToPILImage()(pixel_values[0])

As expected, the processed image is a normalized and resized version of the original one.
The model has been trained to classify images to one of 1,000 classes. Here are some sample classesthat the model supports.
importrandomrandom.seed(1002)i=0whilei<10:print(model.config.id2label[random.randint(0,999)])i=i+1
dishrag, dishclothjeep, landroverbassinettraffic light, traffic signal, stoplightbriardgrey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustusbitternSiamese cat, Siamesechina cabinet, china closetflamingo
Now you’re ready to use the ViT model to classify the image. The ViT model consists of a BERT-likeencoder and a linear classification head on top of the last hidden state of the [CLS] (classification)token. The output of the model is the logit (log odds) of the image that belongs to each of the 1,000classes that the model has been trained for. The classification result is simply the class that has thehighest logit value.
withtorch.no_grad():outputs=model(pixel_values)logits=outputs.logitsprint("Total number of classes:",logits.shape[1])prediction=logits.argmax(-1)print("Predicted class:",model.config.id2label[prediction.item()])
Totalnumberofclasses:1000Predictedclass:leopard,PantherapardusThe model got it right! You can give this a try with other images to see how this model performs onyour AMD GPUs.
To learn more about Vision Transformer, see:Official Vision Transformer repo from Google Research.
Disclaimers#
Third-party content is licensed to you directly by the third party that owns the content and isnot licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS”WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE ATYOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FORANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANYDAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.