About repairing VMs for high availability

This document describes how a managed instance group (MIG)provides high availability of your application by repairing failed and unhealthyVMs in the group.

A MIG keeps your application up and available by proactively maintaining thenumber of running VMs in the group. If a VM in the group goes down, the MIGrepairs the VM by recreating it in the following ways to bring the VM back toservice:

  • Automatically repair a failed VM: If a VM fails or isdeleted by an action not initiated by the MIG, then the MIG automaticallyrepairs the failed VM. In this document, seeAutomatically repair a failed VM.
  • Repair a VM based on an application health check: An optional way tofurther improve high availability by repairingunhealthy VMs. If youconfigure an application-based health check and your application fails thehealth check, then the MIG marks that VM as unhealthy and repairs it.Repairing a VM based on an application health check is also calledautohealing. In this document, seeRepair a VM based on an application health check.

Automatically repair a failed VM

If a VM in a MIG fails, the MIG automatically repairs the failed VM byrecreating it. A VM can fail due to the following reasons:

If the MIG intentionally stops a VM—for example, when anautoscalerdeletes a VM—then the MIG doesn't repair that VM.

Note: To make sure that the MIG doesn'trevert your configuration changes, you must manage the MIG by using theinstance groups console page,theinstance-groups managed gcloud CLI commands,and thezonal orregionalinstance group manager API resources.

Repair a VM based on an application health check

In addition to the automatic repair of failed VMs, you might want to repair a VMif an application that you're running on the VM freezes, crashes, or runs out ofmemory. To make sure that the application is responding as expected, you canconfigure an application-based health check.

An application-based health check periodically verifies that the applicationsthat you're running on each VM in a MIG are responding as expected. If theapplication on a VM doesn't respond, then the MIG marks that VM as unhealthy.The MIG then repairs the unhealthy VM. Repairing a VM based on an applicationhealth check is called autohealing.

To make sure that the MIG keeps running a subset of its VMs, the group neverconcurrently autoheals all of its VMs. This approach helps to prevent issuessuch as an incorrect health check triggering unnecessary repairs, amisconfigured firewall rule preventing a health check from probing the VM, ornetwork connectivity or infrastructure issues that cause a healthy VM to bewrongly identified as unhealthy. However, if a zonal MIG has only one VM, or aregional MIG has only one VM per zone, a MIG autoheals these VMs when theybecome unhealthy.

Autohealing policy

Each MIG has an autohealing policy in which you can configure a health check andalso set an initial delay. The initial delay is the time that a new VM takes toinitialize and run its startup script. The initial delay timer starts when theMIG changes the VM'scurrentActionfield toVERIFYING. During a VM's initial delay period, the MIG ignoresunsuccessful health checks because the VM might be in the startup process. Thisapproach prevents the MIG from prematurely recreating a VM. If the health checkreceives a healthy response during the initial delay, it indicates that thestartup process is complete and the VM is ready.

For more information about configuring an autohealing policy, seeSet up an application health check and autohealing.

Monitor application health state changes

If you've configured an application-based health check in your MIG, youcan check the health state of each VM in the MIG. For more information, seeCheck whether VMs are healthy.

You can also monitor the changes in the health state of a VM.For more information, seeMonitor health state changes.

Pricing

When you set up an application-based health check, by default Compute Enginewrites a log entry whenever a managed instance's health state changes.Cloud Logging providesafree allotment per month after whichlogging is priced by data volume.To avoid costs, you candisablethe health state change logs.

Behavior during a repair

The following sections explain the behavior during automatic repairs andrepairs based on application health check.

Update on repair

By default, during a repair, a MIG recreates a VM using the original instancetemplate that was used to create the VM. For example, if a VM was created usinginstance-template-a and then you update the MIG to useinstance-template-binOPPORTUNISTIC mode, the MIG still usesinstance-template-a to recreate the VM.

If you want your MIG to use the latest instance template andper-instance configurationsduring VM repair, you can configure the group toapply configuration updates during repairs.

Repair a VM in an alternate zone

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

By default, a regional MIG repairs a failed or an unhealthy VM by recreating theVM in its original zone. You can configure a regional MIG to repair VMs inany of the MIG'sselected zones.When the MIG cannot repair the VM in the original zone, then the MIG selects analternate zone based on available capacity and quota, and recreates the VM inthat zone.

When a MIG repairs a VM in a zone different from its original zone, then theURL of the VM changes because the URL contains the zone—for example,projects/example-project/zones/us-central1-b/instances/example-mig-0289gx.

Repairing a VM in an alternate zone has the following benefits:

  • Makes your application more resilient to zonal failures.

  • Improves resource obtainability, particularly for high-demand hardware likeGPUs, VMs with large number of cores or memory, or Spot VMs.

For more information, seeRepair a VM in an alternate zone.

Disk handling

During a repair, when recreating a VM based on its template, the MIG handlesdifferent types of disks differently. Some disk configurations can cause arepair to fail when attempting to recreate a VM.

Disk typeautodeleteBehavior during a repair
New persistent disktrueDisk is recreated as specified in the instance template. Any data that was written to that disk is lost when the disk and its VM are recreated.
New persistent diskfalseDisk is preserved and reattached when MIG recreates the VM.
Existing persistent disktrueOld disk is deleted. VM recreate operation fails because Compute Engine cannot reattach a deleted disk to the VM. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode.
Existing persistent diskfalseOld disk is reattached as specified in the instance template. The data on the disk is preserved. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode.
New local SSDN/ADisk is recreated as specified in the instance template. The data on a local SSD is lost when a VM is recreated or deleted.

The MIG does not reattach disks that are not specified in the instance templateor per-instance configurations, such as disks that you attached to a VM manuallyafter the VM was created.

To preserve important data that was written to disk, take precautions, such asthe following:

If your VMs have important settings that you want to preserve, Google alsorecommends that you use acustom image in your instance template. A custom image contains any custom settings thatyou need. When you specify a custom image in your instance template, the MIGrecreates VMs using the custom image that contains the custom settings you need.

Turn off repairs

You can turn off repairs that are automatically done by a MIG. When you turn offrepairs in a MIG, repairing of failed VMs and repairing based on an applicationhealth check are turned off. You can also separately turn off repairs based onan application health check.

You might want to turn off repairs in a MIG in scenarios such as the following:

  • To investigate or debug a failed VM without interruption from automaticrepair.
  • To repair VMs manually or implement your own repair logic.
  • To prevent recreating VMs while a batch job is in progress.
  • To observe application health states without repairing an unhealthy VM.
  • To fine-tune health check configuration without inadvertently triggeringrepairs.

When you turn off repairs, the MIG doesn't take any action if a VM in the groupfails or becomes unhealthy. Failed and unhealthy VMs continue to be in the groupand the target number of running VMs in the MIG (targetSize) remains the same.However, if you've set atime limit for the VMs,then when the failed and unhealthy VMs reach that time limit, the MIGautomatically deletes those VMs and thetargetSize decreases.

If the MIG'supdate typeis set toproactive and a new instance template is available, then the MIGupdates the failed and unhealthy VMs by recreating those VMs using the newtemplate. If you don't want to update the failed and unhealthy VMs, you must setthe update type toopportunistic.

If you've configured an application-based health check, turning off repairsdoesn't affect the functioning of the health check. The health check continuesto probe the application and provide the VM health states. This setting lets youmonitor application health states while preventing the MIG from repairingunhealthy VMs.

If the MIG is part of a backend service of a load balancer and you turn offrepairs in the MIG, any unrepaired failed and unhealthy VMs don't respond to theload balancer health check. If the number of these failed or unhealthy VMs inthe MIG increases, the load balancer might reduce traffic to that MIG or switchto another backend, if configured. When the failed VMs become available again,the load balancer resumes the traffic to the MIG.

For more information, seeTurn off repairs in a MIG.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-09 UTC.