Designing resilient systems

This document describes best practices for designing resilient systemson Compute Engine. It provides general advice and covers some featuresin Compute Engine that can help mitigate instance downtime and preparefor times when your Compute Engine instances unexpectedly fail.

A resilient system is a system that can withstand a certain amount of failuresor disruptions without interrupting your service or affecting your users'experience using your service. While Compute Engine makes everyeffort to prevent such disruptions, certain events are unpredictable, and it'sbest to be prepared for these events.

Types of failures

At some point, one or more of your compute instances might be lost due tosystem or hardware failures. The following list contains some types of failurescenarios that you can mitigate:

Tips for designing resilient systems

To help mitigate compute instance failures, design your application to beresilient against failures, network interruptions, and unexpected disasters. Aresilient system gracefully handles failures, for example, by redirectingtraffic from an inaccessible instance to a live instance, or by automatingtasks on reboot.

Here are some general tips to help you design a resilient system againstfailures.

Use live migration

Google Cloud periodically performs maintenance on its infrastructure by patchingsystems with the latest software, performing routine tests and preventativemaintenance, and generally ensuring that its infrastructure is as secure, fast,and efficient as possible. Compute Engine employslive migrationto ensure that this infrastructure maintenance is transparent by default to yourcompute instances.

Live migration is a technologythat moves your running instances away from systems that are about to undergomaintenance work. Compute Engine does this automatically for supportedinstance types.

During live migration, your instance might experience a decrease in performancefor a short period of time. For instances that demand constant, maximumperformance, you can configure the instances to be restarted on another hostinstead of undergoing live migration. If you choose this option,Compute Engine stops the instance and restarts it on a host that isn'tinvolved in a maintenance event. Terminating and restarting the instance issuitable for overall applications that are also built to handle instancefailures or reboots.

To configure your instances for live migration or to configure them to restartinstead of migrate, seeSet the host maintenance policy for a compute instance.

Distribute your instances

Create instances across more than one region and zone so that you havealternative compute instances to point to if a zone or region containing one ofyour instances is disrupted. If you create all your instances in the same zoneor region, then you won't be able to access any of those instances if thatzone or region becomes unreachable.

Use zone-specific internal DNS names

Set the defaultinternal DNS type for your projector organization to zonal DNS. In your applications, use zonal DNS names whenaccessing other compute instances. Internal DNS servers are distributed acrossall zones, so you can rely on zonal DNS names to resolve even if there arefailures in other locations.

Global DNS is less resilient, due to single point failures. Zonal DNS mitigatesthe risk of cross-regional outages. Zonal DNS does not require instance nameuniqueness across all regions in a project, which allows for faster instancecreation.

To check if an instance uses zonal DNS names or global DNS names, seeDetermine the internal DNS name for a VM.

If your project uses global DNS names, you can switch to usingzonal DNS names. For more information, seeUse Zonal DNS for your internal DNS type.

Create groups of VMs

Usemanaged instance groupsto create homogeneous groups of VMs so that load balancers can direct traffic tomore than one VM in case a single VM becomes unhealthy.

Managed instance groups (MIGs) also offer features likeautoscalingandautohealing.Autoscaling lets you deal with spikes in traffic by scaling the number of VMs upor down based on specific signals. Autohealing performs health checking and, ifnecessary, automatically recreates unhealthy VMs.

MIGs are also available for regions, so you can create a group of VMsdistributed across multiple zones within a single region. For more information,seeCreating and managing regional MIGs.

Use load balancing

Google Cloud offers a load balancing service that helps you support periods ofheavy traffic so that you don't overload your compute instances. WithCloud Load Balancing, you cando the following:

  • Deploy your application on VMs within multiple zones usingregional MIGs.Then, you can configure aforwarding rule that canspread traffic across all VMs in all zones within the region. Each forwardingrule can define one entry point to your application using an external IPaddress.

  • Deploy VMs across multiple regions using global load balancing.HTTP(S) load balancing enables your traffic to enter the Google Cloud systemat the location nearest the client.Cross-regional load balancingprovides redundancy so that if a region is unreachable, traffic isautomatically diverted to another region. In this way, your service remainsreachable using the same external IP address.

  • Useautoscaling to automatically add or deleteVMs from a MIG based on increases or decreases in load.

Additionally, Cloud Load Balancing offers VM health checking, providingsupport in detecting and handling VM failures.

Use startup and shutdown scripts

Compute Engine offers startup and shutdown scripts that run when aninstance boots up or shuts down, respectively. Startup and shutdown scripts canautomate tasks like installing software, running updates, making backups, andlogging data.

Both startup and shutdown scripts are an efficient and invaluable way tobootstrap or cleanly shut down your instances. Instead of configuring yourinstances using custom images, it can be beneficial to configure instancesusing startup scripts.

Startup scripts run whenever the instance reboots or restarts due to failures,and can be used to install software and updates. You can also use startupscripts to ensure that services are running within the instance. Coding thechanges to configure an instance in a startup script is often easier thantrying to figure out what files or bytes have changed on a custom image.

Shutdown scripts run when your instance shuts down, either intentionally or not.They can perform last minute tasks like backing up data, saving logs, andgracefully closing connections before you stop an instance.

For more information, seeRunning startup scriptsandRunning shutdown scripts.

Backup your data

Backup your data regularly and in multiple locations. You canupload your files to Cloud Storage,create disk snapshots, orreplicate your data to a disk in another zone usingsynchronous replication orto another region usingasynchronous replication.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.