Hibernating Guest VMs

Background

Linux supports the ability to hibernate itself in order to save power.Hibernation is sometimes called suspend-to-disk, as it writes a memoryimage to disk and puts the hardware into the lowest possible powerstate. Upon resume from hibernation, the hardware is restarted and thememory image is restored from disk so that it can resume executionwhere it left off. See the “Hibernation” section ofSystem Sleep States.

Hibernation is usually done on devices with a single user, such as apersonal laptop. For example, the laptop goes into hibernation whenthe cover is closed, and resumes when the cover is opened again.Hibernation and resume happen on the same hardware, and Linux kernelcode orchestrating the hibernation steps assumes that the hardwareconfiguration is not changed while in the hibernated state.

Hibernation can be initiated within Linux by writing “disk” to/sys/power/state or by invoking the reboot system call with theappropriate arguments. This functionality may be wrapped by user spacecommands such “systemctl hibernate” that are run directly from acommand line or in response to events such as the laptop lid closing.

Considerations for Guest VM Hibernation

Linux guests on Hyper-V can also be hibernated, in which case thehardware is the virtual hardware provided by Hyper-V to the guest VM.Only the targeted guest VM is hibernated, while other guest VMs andthe underlying Hyper-V host continue to run normally. While theunderlying Windows Hyper-V and physical hardware on which it isrunning might also be hibernated using hibernation functionality inthe Windows host, host hibernation and its impact on guest VMs is notin scope for this documentation.

Resuming a hibernated guest VM can be more challenging than withphysical hardware because VMs make it very easy to change the hardwareconfiguration between the hibernation and resume. Even when the resumeis done on the same VM that hibernated, the memory size might bechanged, or virtual NICs or SCSI controllers might be added orremoved. Virtual PCI devices assigned to the VM might be added orremoved. Most such changes cause the resume steps to fail, thoughadding a new virtual NIC, SCSI controller, or vPCI device should work.

Additional complexity can ensue because the disks of the hibernated VMcan be moved to another newly created VM that otherwise has the samevirtual hardware configuration. While it is desirable for resume fromhibernation to succeed after such a move, there are challenges. Seedetails on this scenario and its limitations in the “Resuming on aDifferent VM” section below.

Hyper-V also provides ways to move a VM from one Hyper-V host toanother. Hyper-V tries to ensure processor model and Hyper-V versioncompatibility using VM Configuration Versions, and prevents moves toa host that isn’t compatible. Linux adapts to host and processordifferences by detecting them at boot time, but such detection is notdone when resuming execution in the hibernation image. If a VM ishibernated on one host, then resumed on a host with a different processormodel or Hyper-V version, settings recorded in the hibernation imagemay not match the new host. Because Linux does not detect suchmismatches when resuming the hibernation image, undefined behaviorand failures could result.

Enabling Guest VM Hibernation

Hibernation of a Hyper-V guest VM is disabled by default becausehibernation is incompatible with memory hot-add, as provided by theHyper-V balloon driver. If hot-add is used and the VM hibernates, ithibernates with more memory than it started with. But when the VMresumes from hibernation, Hyper-V gives the VM only the originallyassigned memory, and the memory size mismatch causes resume to fail.

To enable a Hyper-V VM for hibernation, the Hyper-V administrator mustenable the ACPI virtual S4 sleep state in the ACPI configuration thatHyper-V provides to the guest VM. Such enablement is accomplished bymodifying a WMI property of the VM, the steps for which are outsidethe scope of this documentation but are available on the web.Enablement is treated as the indicator that the administratorprioritizes Linux hibernation in the VM over hot-add, so the Hyper-Vballoon driver in Linux disables hot-add. Enablement is indicated ifthe contents of /sys/power/disk contains “platform” as an option. Theenablement is also visible in /sys/bus/vmbus/hibernation. See functionhv_is_hibernation_supported().

Linux supports ACPI sleep states on x86, but not on arm64. So Linuxguest VM hibernation is not available on Hyper-V for arm64.

Initiating Guest VM Hibernation

Guest VMs can self-initiate hibernation using the standard Linuxmethods of writing “disk” to /sys/power/state or the reboot systemcall. As an additional layer, Linux guests on Hyper-V support the“Shutdown” integration service, via which a Hyper-V administrator cantell a Linux VM to hibernate using a command outside the VM. Thecommand generates a request to the Hyper-V shutdown driver in Linux,which sends the uevent “EVENT=hibernate”. See kernel functionsshutdown_onchannelcallback() andsend_hibernate_uevent(). A udev rulemust be provided in the VM that handles this event and initiateshibernation.

Handling VMBus Devices During Hibernation & Resume

The VMBus bus driver, and the individual VMBus device drivers,implement suspend and resume functions that are called as part of theLinux orchestration of hibernation and of resuming from hibernation.The overall approach is to leave in place the data structures for theprimary VMBus channels and their associated Linux devices, such asSCSI controllers and others, so that they are captured in thehibernation image. This approach allows any state associated with thedevice to be persisted across the hibernation/resume. When the VMresumes, the devices are re-offered by Hyper-V and are connected tothe data structures that already exist in the resumed hibernationimage.

VMBus devices are identified by class and instance GUID. (See section“VMBus device creation/deletion” inVMBus.) Upon resume from hibernation,the resume functions expect that the devices offered by Hyper-V havethe same class/instance GUIDs as the devices present at the time ofhibernation. Having the same class/instance GUIDs allows the offereddevices to be matched to the primary VMBus channel data structures inthe memory of the now resumed hibernation image. If any devices areoffered that don’t match primary VMBus channel data structures thatalready exist, they are processed normally as newly added devices. Ifprimary VMBus channels that exist in the resumed hibernation image arenot matched with a device offered in the resumed VM, the resumesequence waits for 10 seconds, then proceeds. But the unmatched deviceis likely to cause errors in the resumed VM.

When resuming existing primary VMBus channels, the newly offeredrelids might be different because relids can change on each VM boot,even if the VM configuration hasn’t changed. The VMBus bus driverresume function matches the class/instance GUIDs, and updates therelids in case they have changed.

VMBus sub-channels are not persisted in the hibernation image. EachVMBus device driver’s suspend function must close any sub-channelsprior to hibernation. Closing a sub-channel causes Hyper-V to send aRESCIND_CHANNELOFFER message, which Linux processes by freeing thechannel data structures so that all vestiges of the sub-channel areremoved. By contrast, primary channels are marked closed and theirring buffers are freed, but Hyper-V does not send a rescind message,so the channel data structure continues to exist. Upon resume, thedevice driver’s resume function re-allocates the ring buffer andre-opens the existing channel. It then communicates with Hyper-V tore-open sub-channels from scratch.

The Linux ends of Hyper-V sockets are forced closed at the time ofhibernation. The guest can’t force closing the host end of the socket,but any host-side actions on the host end will produce an error.

VMBus devices use the same suspend function for the “freeze” and the“poweroff” phases, and the same resume function for the “thaw” and“restore” phases. See the “Entering Hibernation” section ofDevice Power Management Basics for the sequencing of thephases.

Detailed Hibernation Sequence

  1. The Linux power management (PM) subsystem prepares forhibernation by freezing user space processes and allocatingmemory to hold the hibernation image.

  2. As part of the “freeze” phase, Linux PM calls the “suspend”function for each VMBus device in turn. As described above, thisfunction removes sub-channels, and leaves the primary channel ina closed state.

  3. Linux PM calls the “suspend” function for the VMBus bus, whichcloses any Hyper-V socket channels and unloads the top-levelVMBus connection with the Hyper-V host.

  4. Linux PM disables non-boot CPUs, creates the hibernation image inthe previously allocated memory, then re-enables non-boot CPUs.The hibernation image contains the memory data structures for theclosed primary channels, but no sub-channels.

  5. As part of the “thaw” phase, Linux PM calls the “resume” functionfor the VMBus bus, which re-establishes the top-level VMBusconnection and requests that Hyper-V re-offer the VMBus devices.As offers are received for the primary channels, the relids areupdated as previously described.

  6. Linux PM calls the “resume” function for each VMBus device. Eachdevice re-opens its primary channel, and communicates with Hyper-Vto re-establish sub-channels if appropriate. The sub-channelsare re-created as new channels since they were previously removedentirely in Step 2.

  7. With VMBus devices now working again, Linux PM writes thehibernation image from memory to disk.

  8. Linux PM repeats Steps 2 and 3 above as part of the “poweroff”phase. VMBus channels are closed and the top-level VMBusconnection is unloaded.

  9. Linux PM disables non-boot CPUs, and then enters ACPI sleep stateS4. Hibernation is now complete.

Detailed Resume Sequence

  1. The guest VM boots into a fresh Linux OS instance. During boot,the top-level VMBus connection is established, and syntheticdevices are enabled. This happens via the normal paths that don’tinvolve hibernation.

  2. Linux PM hibernation code reads swap space is to find and readthe hibernation image into memory. If there is no hibernationimage, then this boot becomes a normal boot.

  3. If this is a resume from hibernation, the “freeze” phase is usedto shutdown VMBus devices and unload the top-level VMBusconnection in the running fresh OS instance, just like Steps 2and 3 in the hibernation sequence.

  4. Linux PM disables non-boot CPUs, and transfers control to theread-in hibernation image. In the now-running hibernation image,non-boot CPUs are restarted.

  5. As part of the “resume” phase, Linux PM repeats Steps 5 and 6from the hibernation sequence. The top-level VMBus connection isre-established, and offers are received and matched to primarychannels in the image. Relids are updated. VMBus device resumefunctions re-open primary channels and re-create sub-channels.

  6. Linux PM exits the hibernation resume sequence and the VM is nowrunning normally from the hibernation image.

Key-Value Pair (KVP) Pseudo-Device Anomalies

The VMBus KVP device behaves differently from other pseudo-devicesoffered by Hyper-V. When the KVP primary channel is closed, Hyper-Vsends a rescind message, which causes all vestiges of the device to beremoved. But Hyper-V then re-offers the device, causing it to be newlyre-created. The removal and re-creation occurs during the “freeze”phase of hibernation, so the hibernation image contains the re-createdKVP device. Similar behavior occurs during the “freeze” phase of theresume sequence while still in the fresh OS instance. But in bothcases, the top-level VMBus connection is subsequently unloaded, whichcauses the device to be discarded on the Hyper-V side. So no harm isdone and everything still works.

Virtual PCI devices

Virtual PCI devices are physical PCI devices that are mapped directlyinto the VM’s physical address space so the VM can interact directlywith the hardware. vPCI devices include those accessed via what Hyper-Vcalls “Discrete Device Assignment” (DDA), as well as SR-IOV NICVirtual Functions (VF) devices. SeePCI pass-thru devices.

Hyper-V DDA devices are offered to guest VMs after the top-level VMBusconnection is established, just like VMBus synthetic devices. They arestatically assigned to the VM, and their instance GUIDs don’t changeunless the Hyper-V administrator makes changes to the configuration.DDA devices are represented in Linux as virtual PCI devices that havea VMBus identity as well as a PCI identity. Consequently, Linux guesthibernation first handles DDA devices as VMBus devices in order tomanage the VMBus channel. But then they are also handled as PCIdevices using the hibernation functions implemented by their nativePCI driver.

SR-IOV NIC VFs also have a VMBus identity as well as a PCIidentity, and overall are processed similarly to DDA devices. Adifference is that VFs are not offered to the VM during initial bootof the VM. Instead, the VMBus synthetic NIC driver first startsoperating and communicates to Hyper-V that it is prepared to accept aVF, and then the VF offer is made. However, the VMBus connectionmight later be unloaded and then re-established without the VM beingrebooted, as happens in Steps 3 and 5 in the Detailed HibernationSequence above and in the Detailed Resume Sequence. In such a case,the VFs likely became part of the VM during initial boot, so when theVMBus connection is re-established, the VFs are offered on there-established connection without intervention by the synthetic NIC driver.

UIO Devices

A VMBus device can be exposed to user space using the Hyper-V UIOdriver (uio_hv_generic.c) so that a user space driver can control andoperate the device. However, the VMBus UIO driver does not support thesuspend and resume operations needed for hibernation. If a VMBusdevice is configured to use the UIO driver, hibernating the VM failsand Linux continues to run normally. The most common use of the Hyper-VUIO driver is for DPDK networking, but there are other uses as well.

Resuming on a Different VM

This scenario occurs in the Azure public cloud in that a hibernatedcustomer VM only exists as saved configuration and disks -- the VM nolonger exists on any Hyper-V host. When the customer VM is resumed, anew Hyper-V VM with identical configuration is created, likely on adifferent Hyper-V host. That new Hyper-V VM becomes the resumedcustomer VM, and the steps the Linux kernel takes to resume from thehibernation image must work in that new VM.

While the disks and their contents are preserved from the original VM,the Hyper-V-provided VMBus instance GUIDs of the disk controllers andother synthetic devices would typically be different. The differencewould cause the resume from hibernation to fail, so several things aredone to solve this problem:

  • For VMBus synthetic devices that support only a single instance,Hyper-V always assigns the same instance GUIDs. For example, theHyper-V mouse, the shutdown pseudo-device, the time sync pseudodevice, etc., always have the same instance GUID, both for localHyper-V installs as well as in the Azure cloud.

  • VMBus synthetic SCSI controllers may have multiple instances in aVM, and in the general case instance GUIDs vary from VM to VM.However, Azure VMs always have exactly two synthetic SCSIcontrollers, and Azure code overrides the normal Hyper-V behaviorso these controllers are always assigned the same two instanceGUIDs. Consequently, when a customer VM is resumed on a newlycreated VM, the instance GUIDs match. But this guarantee does nothold for local Hyper-V installs.

  • Similarly, VMBus synthetic NICs may have multiple instances in aVM, and the instance GUIDs vary from VM to VM. Again, Azure codeoverrides the normal Hyper-V behavior so that the instance GUIDof a synthetic NIC in a customer VM does not change, even if thecustomer VM is deallocated or hibernated, and then re-constitutedon a newly created VM. As with SCSI controllers, this behaviordoes not hold for local Hyper-V installs.

  • vPCI devices do not have the same instance GUIDs when resumingfrom hibernation on a newly created VM. Consequently, Azure doesnot support hibernation for VMs that have DDA devices such asNVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes theVF from the VM before it hibernates so that the hibernation imagedoes not contain a VF device. When the VM is resumed itinstantiates a new VF, rather than trying to match against a VFthat is present in the hibernation image. Because Azure mustremove any VFs before initiating hibernation, Azure VMhibernation must be initiated externally from the Azure Portal orAzure CLI, which in turn uses the Shutdown integration service totell Linux to do the hibernation. If hibernation is self-initiatedwithin the Azure VM, VFs remain in the hibernation image, and arenot resumed properly.

In summary, Azure takes special actions to remove VFs and to ensurethat VMBus device instance GUIDs match on a new/different VM, allowinghibernation to work for most general-purpose Azure VMs sizes. Whilesimilar special actions could be taken when resuming on a different VMon a local Hyper-V install, orchestrating such actions is not providedout-of-the-box by local Hyper-V and so requires custom scripting.