Confidential Computing VMs¶
Hyper-V can create and run Linux guests that are Confidential Computing(CoCo) VMs. Such VMs cooperate with the physical processor to better protectthe confidentiality and integrity of data in the VM’s memory, even in theface of a hypervisor/VMM that has been compromised and may behave maliciously.CoCo VMs on Hyper-V share the generic CoCo VM threat model and securityobjectives described inConfidential Computing in Linux for x86 virtualization. Notethat Hyper-V specific code in Linux refers to CoCo VMs as “isolated VMs” or“isolation VMs”.
A Linux CoCo VM on Hyper-V requires the cooperation and interaction of thefollowing:
Physical hardware with a processor that supports CoCo VMs
The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
The VM runs a version of Linux that supports being a CoCo VM
The physical hardware requirements are as follows:
AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCoVM on Hyper-V.
Intel processor with TDX
To create a CoCo VM, the “Isolated VM” attribute must be specified to Hyper-Vwhen the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,or vice versa, after it is created.
Operational Modes¶
Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM iscreated and cannot be changed during the life of the VM.
Fully-enlightened mode. In this mode, the guest operating system isenlightened to understand and manage all aspects of running as a CoCo VM.
Paravisor mode. In this mode, a paravisor layer between the guest and thehost provides some operations needed to run as a CoCo VM. The guest operatingsystem can have fewer CoCo enlightenments than is required in thefully-enlightened case.
Conceptually, fully-enlightened mode and paravisor mode may be treated aspoints on a spectrum spanning the degree of guest enlightenment needed to runas a CoCo VM. Fully-enlightened mode is one end of the spectrum. A fullimplementation of paravisor mode is the other end of the spectrum, where allaspects of running as a CoCo VM are handled by the paravisor, and a normalguest OS with no knowledge of memory encryption or other aspects of CoCo VMscan run successfully. However, the Hyper-V implementation of paravisor modedoes not go this far, and is somewhere in the middle of the spectrum. Someaspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OSmust be enlightened for other aspects. Unfortunately, there is nostandardized enumeration of feature/functions that might be provided in theparavisor, and there is no standardized mechanism for a guest OS to query theparavisor for the feature/functions it provides. The understanding of whatthe paravisor provides is hard-coded in the guest OS.
Paravisor mode has similarities to theCoconut project, which aims to providea limited paravisor to provide services to the guest such as a virtual TPM.However, the Hyper-V paravisor generally handles more aspects of CoCo VMsthan is currently envisioned for Coconut, and so is further toward the “noguest enlightenments required” end of the spectrum.
In the CoCo VM threat model, the paravisor is in the guest security domainand must be trusted by the guest OS. By implication, the hypervisor/VMM mustprotect itself against a potentially malicious paravisor just like itprotects against a potentially malicious guest.
The hardware architectural approach to fully-enlightened vs. paravisor modevaries depending on the underlying processor.
With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs inVMPL 0 and has full control of the guest context. In paravisor mode, theguest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisorrunning in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.Certain operations require the guest to invoke the paravisor. Furthermore, inparavisor mode the guest OS operates in “virtual Top Of Memory” (vTOM) modeas defined by the SEV-SNP architecture. This mode simplifies guest managementof memory encryption when a paravisor is used.
With Intel TDX processor, in fully-enlightened mode the guest OS runs in anL1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in theL1 VM, and the guest OS runs in a nested L2 VM.
Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. ThisMSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, andwhether a paravisor is being used. It is straightforward to build a singlekernel image that can boot and run properly on either architecture, and ineither mode.
Paravisor Effects¶
Running in paravisor mode affects the following areas of generic Linux kernelCoCo VM functionality:
Initial guest memory setup. When a new VM is created in paravisor mode, theparavisor runs first and sets up the guest physical memory as encrypted. Theguest Linux does normal memory initialization, except for explicitly markingappropriate ranges as decrypted (shared). In paravisor mode, Linux does notperform the early boot memory setup steps that are particularly tricky withAMD SEV-SNP in fully-enlightened mode.
#VC/#VE exception handling. In paravisor mode, Hyper-V configures the guestCoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,respectively, and not the guest Linux. Consequently, these exception handlersdo not run in the guest Linux and are not a required enlightenment for aLinux guest in paravisor mode.
CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in theguest indicating that the VM is operating with the respective hardwaresupport. While these CPUID flags are visible in fully-enlightened CoCo VMs,the paravisor filters out these flags and the guest Linux does not see them.Throughout the Linux kernel, explicitly testing these flags has mostly beeneliminated in favor of the
cc_platform_has()function, with the goal ofabstracting the differences between SEV-SNP and TDX. But thecc_platform_has()abstraction also allows the Hyper-V paravisor configurationto selectively enable aspects of CoCo VM functionality even when the CPUIDflags are not set. The exception is early boot memory setup on SEV-SNP, whichtests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisormode VM achieves the desired effect or not running SEV-SNP specific earlyboot memory setup.Device emulation. In paravisor mode, the Hyper-V paravisor providesemulation of devices such as the IO-APIC and TPM. Because the emulationhappens in the paravisor in the guest context (instead of the hypervisor/VMMcontext), MMIO accesses to these devices must be encrypted references insteadof the decrypted references that would be used in a fully-enlightened CoCoVM. The
__ioremap_caller()function has been enhanced to make a callback tocheck whether a particular address range should be treated as encrypted(private). See the “is_private_mmio” callback.Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guestmemory between encrypted and decrypted requires coordinating with thehypervisor/VMM. This is done via callbacks invoked from
__set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP andTDX implementations of these callbacks are used. In paravisor mode, a Hyper-Vspecific set of callbacks is used. These callbacks invoke the paravisor sothat the paravisor can coordinate the transitions and inform the hypervisoras necessary. Seehv_vtom_init()where these callback are set up.Interrupt injection. In fully enlightened mode, a malicious hypervisorcould inject interrupts into the guest OS at times that violate x86/x64architectural rules. For full protection, the guest OS should includeenlightenments that use the interrupt injection management features providedby CoCo-capable processors. In paravisor mode, the paravisor mediatesinterrupt injection into the guest OS, and ensures that the guest OS onlysees interrupts that are “legal”. The paravisor uses the interrupt injectionmanagement features provided by the CoCo-capable physical processor, therebymasking these complexities from the guest OS.
Hyper-V Hypercalls¶
When in fully-enlightened mode, hypercalls made by the Linux guest are routeddirectly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,normal hypercalls trap to the paravisor first, which may in turn invoke thehypervisor. But the paravisor is idiosyncratic in this regard, and a fewhypercalls made by the Linux guest must always be routed directly to thehypervisor. These hypercall sites test for a paravisor being present, and usea special invocation sequence. Seehv_post_message(), for example.
Guest communication with Hyper-V¶
Separate from the generic Linux kernel handling of memory encryption in LinuxCoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memoryshared between the Linux guest and the host. This shared memory must bemarked decrypted to enable communication. Furthermore, since the threat modelincludes a compromised and potentially malicious host, the guest must guardagainst leaking any unintended data to the host through this shared memory.
These Hyper-V and VMBus memory pages are marked as decrypted:
VMBus monitor pages
Synthetic interrupt controller (SynIC) related pages (unless supplied bythe paravisor)
Per-cpu hypercall input and output pages (unless running with a paravisor)
VMBus ring buffers. The direct mapping is marked decrypted in
__vmbus_establish_gpadl(). The secondary mapping created inhv_ringbuffer_init()must also include the “decrypted” attribute.
When the guest writes data to memory that is shared with the host, it mustensure that only the intended data is written. Padding or unused fields mustbe initialized to zeros before copying into the shared memory so that randomkernel data is not inadvertently given to the host.
Similarly, when the guest reads memory that is shared with the host, it mustvalidate the data before acting on it so that a malicious host cannot inducethe guest to expose unintended data. Doing such validation can be trickybecause the host can modify the shared memory areas even while or aftervalidation is performed. For messages passed from the host to the guest in aVMBus ring buffer, the length of the message is validated, and the message iscopied into a temporary (encrypted) buffer for further validation andprocessing. The copying adds a small amount of overhead, but is the only wayto protect against a malicious host. Seehv_pkt_iter_first().
Many drivers for VMBus devices have been “hardened” by adding code to fullyvalidate messages received over VMBus, instead of assuming that Hyper-V isacting cooperatively. Such drivers are marked as “allowed_in_isolated” in thevmbus_devs[] table. Other drivers for VMBus devices that are not needed in aCoCo VM have not been hardened, and they are not allowed to load in a CoCoVM. Seevmbus_is_valid_offer() where such devices are excluded.
Two VMBus devices depend on the Hyper-V host to do DMA data transfers:storvsc for disk I/O and netvsc for network I/O. storvsc uses the normalLinux kernel DMA APIs, and so bounce buffering through decrypted swiotlbmemory is done implicitly. netvsc has two modes for data transfers. The firstmode goes through send and receive buffer space that is explicitly allocatedby the netvsc driver, and is used for most smaller packets. These send andreceive buffers are marked decrypted by__vmbus_establish_gpadl(). Becausethe netvsc driver explicitly copies packets to/from these buffers, theequivalent of bounce buffering between encrypted and decrypted memory isalready part of the data path. The second mode uses the normal Linux kernelDMA APIs, and is bounce buffered through swiotlb memory implicitly like instorvsc.
Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.Linux PCI device drivers access PCI config space using standard APIs providedby the Linux PCI subsystem. On Hyper-V, these functions directly access MMIOspace, and the access traps to Hyper-V for emulation. But in CoCo VMs, memoryencryption prevents Hyper-V from reading the guest instruction stream toemulate the access. So in a CoCo VM, these functions must make a hypercallwith arguments explicitly describing the access. See_hv_pcifront_read_config() and_hv_pcifront_write_config() and the“use_calls” flag indicating to use hypercalls.
Confidential VMBus¶
The confidential VMBus enables the confidential guest not to interact withthe untrusted host partition and the untrusted hypervisor. Instead, the guestrelies on the trusted paravisor to communicate with the devices processingsensitive data. The hardware (SNP or TDX) encrypts the guest memory and theregister state while measuring the paravisor image using the platform securityprocessor to ensure trusted and confidential computing.
Confidential VMBus provides a secure communication channel between the guestand the paravisor, ensuring that sensitive data is protected from hypervisor-level access through memory encryption and register state isolation.
Confidential VMBus is an extension of Confidential Computing (CoCo) VMs(a.k.a. “Isolated” VMs in Hyper-V terminology). Without Confidential VMBus,guest VMBus device drivers (the “VSC”s in VMBus terminology) communicatewith VMBus servers (the VSPs) running on the Hyper-V host. Thecommunication must be through memory that has been decrypted so thehost can access it. With Confidential VMBus, one or more of the VSPs residein the trusted paravisor layer in the guest VM. Since the paravisor layer alsooperates in encrypted memory, the memory used for communication withsuch VSPs does not need to be decrypted and thereby exposed to theHyper-V host. The paravisor is responsible for communicating securelywith the Hyper-V host as necessary.
The data is transferred directly between the VM and a vPCI device (a.k.a.a PCI pass-thru device, seePCI pass-thru devices) that is directly assigned to VTL2and that supports encrypted memory. In such a case, neither the host partitionnor the hypervisor has any access to the data. The guest needs to establisha VMBus connection only with the paravisor for the channels that processsensitive data, and the paravisor abstracts the details of communicatingwith the specific devices away providing the guest with the well-establishedVSP (Virtual Service Provider) interface that has had support in the Hyper-Vdrivers for a decade.
In the case the device does not support encrypted memory, the paravisorprovides bounce-buffering, and although the data is not encrypted, the backingpages aren’t mapped into the host partition through SLAT. While not impossible,it becomes much more difficult for the host partition to exfiltrate the datathan it would be with a conventional VMBus connection where the host partitionhas direct access to the memory used for communication.
Here is the data flow for a conventional VMBus connection (C stands for theclient or VSC,S for the server or VSP, theDEVICE is a physical one, mightbe with multiple virtual functions):
+---- GUEST ----+ +----- DEVICE ----+ +----- HOST -----+| | | | | || | | | | || | | ========== || | | | | || | | | | || | | | | |+----- C -------+ +-----------------+ +------- S ------+ || || || ||+------||------------------ VMBus --------------------------||------+| Interrupts, MMIO |+-------------------------------------------------------------------+
and the Confidential VMBus connection:
+---- GUEST --------------- VTL0 ------+ +-- DEVICE --+| | | || +- PARAVISOR --------- VTL2 -----+ | | || | +-- VMBus Relay ------+ ====+================ || | | Interrupts, MMIO | | | | || | +-------- S ----------+ | | +------------+| | || | || +---------+ || | || | Linux | || OpenHCL | || | kernel | || | || +---- C --+-----||---------------+ || || || |+-------++------- C -------------------+ +------------+ || | HOST | || +---- S -----++-------||----------------- VMBus ---------------------------||-----+| Interrupts, MMIO |+-------------------------------------------------------------------+
An implementation of the VMBus relay that offers the Confidential VMBuschannels is available in the OpenVMM project as a part of the OpenHCLparavisor. Please refer to
for more information about the OpenHCL paravisor.
A guest that is running with a paravisor must determine at runtime ifConfidential VMBus is supported by the current paravisor. The x86_64-specificapproach relies on the CPUID Virtualization Stack leaf; the ARM64 implementationis expected to support the Confidential VMBus unconditionally when runningARM CCA guests.
Confidential VMBus is a characteristic of the VMBus connection as a whole,and of each VMBus channel that is created. When a Confidential VMBusconnection is established, the paravisor provides the guest the message-passingpath that is used for VMBus device creation and deletion, and it provides aper-CPU synthetic interrupt controller (SynIC) just like the SynIC that isoffered by the Hyper-V host. Each VMBus device that is offered to the guestindicates the degree to which it participates in Confidential VMBus. The offerindicates if the device uses encrypted ring buffers, and if the device usesencrypted memory for DMA that is done outside the ring buffer. These settingsmay be different for different devices using the same Confidential VMBusconnection.
Although these settings are separate, in practice it’ll always be encryptedring buffer only, or both encrypted ring buffer and external data. If a channelis offered by the paravisor with confidential VMBus, the ring buffer can alwaysbe encrypted since it’s strictly for communication between the VTL2 paravisorand the VTL0 guest. However, other memory regions are often used for e.g. DMA,so they need to be accessible by the underlying hardware, and must beunencrypted (unless the device supports encrypted memory). Currently, there arenot any VSPs in OpenHCL that support encrypted external memory, but futureversions are expected to enable this capability.
Because some devices on a Confidential VMBus may require decrypted ring buffersand DMA transfers, the guest must interact with two SynICs -- the one providedby the paravisor and the one provided by the Hyper-V host when ConfidentialVMBus is not offered. Interrupts are always signaled by the paravisor SynIC,but the guest must check for messages and for channel interrupts on both SynICs.
In the case of a confidential VMBus, regular SynIC access by the guest isintercepted by the paravisor (this includes various MSRs such as the SIMP andSIEFP, as well as hypercalls like HvPostMessage and HvSignalEvent). If theguest actually wants to communicate with the hypervisor, it has to use specialmechanisms (GHCB page on SNP, or tdcall on TDX). Messages can be of eitherkind: with confidential VMBus, messages use the paravisor SynIC, and if theguest chose to communicate directly to the hypervisor, they use the hypervisorSynIC. For interrupt signaling, some channels may be running on the host(non-confidential, using the VMBus relay) and use the hypervisor SynIC, andsome on the paravisor and use its SynIC. The RelIDs are coordinated by theOpenHCL VMBus server and are guaranteed to be unique regardless of whetherthe channel originated on the host or the paravisor.
load_unaligned_zeropad()¶
When transitioning memory between encrypted and decrypted, the caller ofset_memory_encrypted() orset_memory_decrypted() is responsible for ensuringthe memory isn’t in use and isn’t referenced while the transition is inprogress. The transition has multiple steps, and includes interaction withthe Hyper-V host. The memory is in an inconsistent state until all steps arecomplete. A reference while the state is inconsistent could result in anexception that can’t be cleanly fixed up.
However, the kernelload_unaligned_zeropad() mechanism may make strayreferences that can’t be prevented by the caller ofset_memory_encrypted() orset_memory_decrypted(), so there’s specific code in the #VC or #VE exceptionhandler to fixup this case. But a CoCo VM running on Hyper-V may beconfigured to run with a paravisor, with the #VC or #VE exception routed tothe paravisor. There’s no architectural way to forward the exceptions back tothe guest kernel, and in such a case, theload_unaligned_zeropad() fixup codein the #VC/#VE handlers doesn’t run.
To avoid this problem, the Hyper-V specific functions for notifying thehypervisor of the transition mark pages as “not present” while a transitionis in progress. Ifload_unaligned_zeropad() causes a stray reference, anormal page fault is generated instead of #VC or #VE, and the page-fault-based handlers forload_unaligned_zeropad() fixup the reference. When theencrypted/decrypted transition is complete, the pages are marked as “present”again. Seehv_vtom_clear_present() andhv_vtom_set_host_visibility().