Amazon ECS supports workloads that use GPUs, when you create clusters with container instancesthat support GPUs. Amazon EC2 GPU-based container instances that use the p2, p3, p5, g3, g4, andg5 instance types provide access to NVIDIA GPUs. For more information, seeLinux Accelerated ComputingInstances in theAmazon EC2 Instance Types guide.
Amazon ECS provides a GPU-optimized AMI that comes with pre-configured NVIDIA kernel driversand a Docker GPU runtime. For more information, seeAmazon ECS-optimized Linux AMIs.
You can designate a number of GPUs in your task definition for task placementconsideration at a container level. Amazon ECS schedules to available container instances thatsupport GPUs and pin physical GPUs to proper containers for optimal performance.
The following Amazon EC2 GPU-based instance types are supported. For more information, seeAmazon EC2 P2 Instances,Amazon EC2 P3 Instances,Amazon EC2 P4d Instances,Amazon EC2 P5 Instances,Amazon EC2 G3 Instances,Amazon EC2 G4 Instances,Amazon EC2 G5 Instances,Amazon EC2 G6 Instances, andAmazon EC2 G6e Instances.
| Instance type | GPUs | GPU memory (GiB) | vCPUs | Memory (GiB) |
|---|---|---|---|---|
p3.2xlarge | 1 | 16 | 8 | 61 |
p3.8xlarge | 4 | 64 | 32 | 244 |
p3.16xlarge | 8 | 128 | 64 | 488 |
p3dn.24xlarge | 8 | 256 | 96 | 768 |
p4d.24xlarge | 8 | 320 | 96 | 1152 |
| p5.48xlarge | 8 | 640 | 192 | 2048 |
g3s.xlarge | 1 | 8 | 4 | 30.5 |
g3.4xlarge | 1 | 8 | 16 | 122 |
g3.8xlarge | 2 | 16 | 32 | 244 |
g3.16xlarge | 4 | 32 | 64 | 488 |
g4dn.xlarge | 1 | 16 | 4 | 16 |
g4dn.2xlarge | 1 | 16 | 8 | 32 |
g4dn.4xlarge | 1 | 16 | 16 | 64 |
g4dn.8xlarge | 1 | 16 | 32 | 128 |
g4dn.12xlarge | 4 | 64 | 48 | 192 |
g4dn.16xlarge | 1 | 16 | 64 | 256 |
g5.xlarge | 1 | 24 | 4 | 16 |
g5.2xlarge | 1 | 24 | 8 | 32 |
g5.4xlarge | 1 | 24 | 16 | 64 |
g5.8xlarge | 1 | 24 | 32 | 128 |
g5.16xlarge | 1 | 24 | 64 | 256 |
g5.12xlarge | 4 | 96 | 48 | 192 |
g5.24xlarge | 4 | 96 | 96 | 384 |
g5.48xlarge | 8 | 192 | 192 | 768 |
| g6.xlarge | 1 | 24 | 4 | 16 |
| g6.2xlarge | 1 | 24 | 8 | 32 |
| g6.4xlarge | 1 | 24 | 16 | 64 |
| g6.8xlarge | 1 | 24 | 32 | 128 |
| g6.16.xlarge | 1 | 24 | 64 | 256 |
| g6.12xlarge | 4 | 96 | 48 | 192 |
| g6.24xlarge | 4 | 96 | 96 | 384 |
| g6.48xlarge | 8 | 192 | 192 | 768 |
| g6.metal | 8 | 192 | 192 | 768 |
| gr6.4xlarge | 1 | 24 | 16 | 128 |
| g6e.xlarge | 1 | 48 | 4 | 32 |
| g6e.2xlarge | 1 | 48 | 8 | 64 |
| g6e.4xlarge | 1 | 48 | 16 | 128 |
| g6e.8xlarge | 1 | 48 | 32 | 256 |
| g6e16.xlarge | 1 | 48 | 64 | 512 |
| g6e12.xlarge | 4 | 192 | 48 | 384 |
| g6e24.xlarge | 4 | 192 | 96 | 768 |
| g6e48.xlarge | 8 | 384 | 192 | 1536 |
| gr6.8xlarge | 1 | 24 | 32 | 256 |
You can retrieve the Amazon Machine Image (AMI) ID for Amazon ECS-optimized AMIs by queryingthe AWS Systems Manager Parameter Store API. Using this parameter, you don't need to manually look upAmazon ECS-optimized AMI IDs. For more information about the Systems Manager Parameter Store API, seeGetParameter. The user that youuse must have thessm:GetParameter IAM permission to retrieve theAmazon ECS-optimized AMI metadata.
aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended --regionus-east-1The support for g2 instance family type has been deprecated.
The p2 instance family type is only supported on versions earlier than20230912 of the Amazon ECS GPU-optimized AMI. If you need to continueto use p2 instances, seeWhat to do if you need a P2 instance.
In-place updates of the NVIDIA/CUDA drivers on both these instance family typeswill cause potential GPU workload failures.
We recommend that you consider the following before you begin working with GPUs onAmazon ECS.
Your clusters can contain a mix of GPU and non-GPU container instances.
You can run GPU workloads on external instances. When registering an externalinstance with your cluster, ensure the--enable-gpu flag isincluded on the installation script. For more information, seeRegistering an external instance to an Amazon ECS cluster.
You must setECS_ENABLE_GPU_SUPPORT totrue in youragent configuration file. For more information, seeAmazon ECS container agent configuration.
When running a task or creating a service, you can use instance typeattributes when you configure task placement constraints to determine thecontainer instances the task is to be launched on. By doing this, you can moreeffectively use your resources. For more information, seeHow Amazon ECS places tasks on container instances.
The following example launches a task on ag4dn.xlarge containerinstance in your default cluster.
aws ecs run-task --cluster default --task-definition ecs-gpu-task-def \--placement-constraints type=memberOf,expression="attribute:ecs.instance-type == g4dn.xlarge" --region us-east-2For each container that has a GPU resource requirement that's specified in thecontainer definition, Amazon ECS sets the container runtime to be the NVIDIAcontainer runtime.
The NVIDIA container runtime requires some environment variables to be set inthe container to function properly. For a list of these environment variables,seeSpecialized Configurations with Docker. Amazon ECS sets theNVIDIA_VISIBLE_DEVICES environment variable value to be a listof the GPU device IDs that Amazon ECS assigns to the container. For the otherrequired environment variables, Amazon ECS doesn't set them. So, make sure that yourcontainer image sets them or they're set in the container definition.
The p5 instance type family is supported on version20230929 andlater of the Amazon ECS GPU-optimized AMI.
The g4 instance type family is supported on version20230913 andlater of the Amazon ECS GPU-optimized AMI. For more information, seeAmazon ECS-optimized Linux AMIs. It's notsupported in the Create Cluster workflow in the Amazon ECS console. To use theseinstance types, you must either use the Amazon EC2 console, AWS CLI, or API andmanually register the instances to your cluster.
The p4d.24xlarge instance type only works with CUDA 11 or later.
The Amazon ECS GPU-optimized AMI has IPv6 enabled, which causes issues when usingyum. This can be resolved by configuringyum to useIPv4 with the following command.
echo "ip_resolve=4" >> /etc/yum.conf When you build a container image that doesn't use the NVIDIA/CUDA baseimages, you must set theNVIDIA_DRIVER_CAPABILITIES containerruntime variable to one of the following values:
utility,compute
all
For information about how to set the variable, seeControlling the NVIDIA Container Runtime on the NVIDIAwebsite.
GPUs are not supported on Windows containers.
When you want to share GPUs, you need to configure the following.
Remove GPU resource requirements from your task definitions so that Amazon ECS doesnot reserve any GPUs that should be shared.
Add the following user data to your instances when you want to share GPUs.This will make nvidia the default Docker container runtime on the containerinstance so that all Amazon ECS containers can use the GPUs. For more information seeRuncommands when you launch an EC2 instance with user data input in theAmazon EC2 User Guide.
const userData = ec2.UserData.forLinux(); userData.addCommands( 'sudo rm /etc/sysconfig/docker', 'echo DAEMON_MAXFILES=1048576 | sudo tee -a /etc/sysconfig/docker', 'echo OPTIONS="--default-ulimit nofile=32768:65536 --default-runtime nvidia" | sudo tee -a /etc/sysconfig/docker', 'echo DAEMON_PIDFILE_TIMEOUT=10 | sudo tee -a /etc/sysconfig/docker', 'sudo systemctl restart docker',);Set theNVIDIA_VISIBLE_DEVICES environment variable on yourcontainer. You can do this by specifying the environment variable in your taskdefinition. For information on the valid values, seeGPU Enumeration on the NVIDIA documentation site.
If you need to use P2 instance, you can use one of the following options to continueusing the instances.
You must modify the instance user data for both options. For more information seeRuncommands when you launch an EC2 instance with user data input in theAmazon EC2 User Guide.
Use the last supported GPU-optimized AMI
You can use the20230906 version of the GPU-optimized AMI, and add thefollowing to the instance user data.
Replace cluster-name with the name of your cluster.
#!/bin/bashecho "exclude=*nvidia* *cuda*" >> /etc/yum.confecho "ECS_CLUSTER=cluster-name" >> /etc/ecs/ecs.configUse the latest GPU-optimized AMI, and update the userdata
You can add the following to the instance user data. This uninstalls the Nvidia535/Cuda12.2 drivers, and then installs the Nvidia 470/Cuda11.4 drivers and fixes theversion.
#!/bin/bashyum remove -y cuda-toolkit* nvidia-driver-latest-dkms*tmpfile=$(mktemp)cat >$tmpfile <<EOF[amzn2-nvidia]name=Amazon Linux 2 Nvidia repositorymirrorlist=\$awsproto://\$amazonlinux.\$awsregion.\$awsdomain/\$releasever/amzn2-nvidia/latest/\$basearch/mirror.listpriority=20gpgcheck=1gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pubenabled=1exclude=libglvnd-*EOFmv $tmpfile /etc/yum.repos.d/amzn2-nvidia-tmp.repoyum install -y system-release-nvidia cuda-toolkit-11-4 nvidia-driver-latest-dkms-470.182.03yum install -y libnvidia-container-1.4.0 libnvidia-container-tools-1.4.0 nvidia-container-runtime-hook-1.4.0 docker-runtime-nvidia-1echo "exclude=*nvidia* *cuda*" >> /etc/yum.confnvidia-smiCreate your own P2 compatible GPU-optimizedAMI
You can create your own custom Amazon ECS GPU-optimized AMI that is compatible with P2instances, and then launch P2 instances using the AMI.
Run the following command to clone theamazon-ecs-amirepo.
git clone https://github.com/aws/amazon-ecs-amiSet the required Amazon ECS agent and source Amazon Linux AMI versions inrelease.auto.pkrvars.hcl oroverrides.auto.pkrvars.hcl.
Run the following command to build a private P2 compatible EC2 AMI.
Replace region with the Region with the instanceRegion.
REGION=region make al2keplergpuUse the AMI with the following instance user data to connect to the Amazon ECScluster.
Replace cluster-name with the name of your cluster.
#!/bin/bashecho "ECS_CLUSTER=cluster-name" >> /etc/ecs/ecs.config