KVM VCPU Requests

Overview

KVM supports an internal API enabling threads to request a VCPU thread toperform some activity. For example, a thread may request a VCPU to flushits TLB with a VCPU request. The API consists of the following functions:

/* Check if any requests are pending for VCPU @vcpu. */bool kvm_request_pending(struct kvm_vcpu *vcpu);/* Check if VCPU @vcpu has request @req pending. */bool kvm_test_request(int req, struct kvm_vcpu *vcpu);/* Clear request @req for VCPU @vcpu. */void kvm_clear_request(int req, struct kvm_vcpu *vcpu);/* * Check if VCPU @vcpu has request @req pending. When the request is * pending it will be cleared and a memory barrier, which pairs with * another in kvm_make_request(), will be issued. */bool kvm_check_request(int req, struct kvm_vcpu *vcpu);/* * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs * with another in kvm_check_request(), prior to setting the request. */void kvm_make_request(int req, struct kvm_vcpu *vcpu);/* Make request @req of all VCPUs of the VM with struct kvm @kvm. */bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);

Typically a requester wants the VCPU to perform the activity as soonas possible after making the request. This means most requests(kvm_make_request() calls) are followed by a call tokvm_vcpu_kick(),andkvm_make_all_cpus_request() has the kicking of all VCPUs builtinto it.

VCPU Kicks

The goal of a VCPU kick is to bring a VCPU thread out of guest mode inorder to perform some KVM maintenance. To do so, an IPI is sent, forcinga guest mode exit. However, a VCPU thread may not be in guest mode at thetime of the kick. Therefore, depending on the mode and state of the VCPUthread, there are two other actions a kick may take. All three actionsare listed below:

  1. Send an IPI. This forces a guest mode exit.

  2. Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guestmode that wait on waitqueues. Waking them removes the threads fromthe waitqueues, allowing the threads to run again. This behaviormay be suppressed, see KVM_REQUEST_NO_WAKEUP below.

  3. Nothing. When the VCPU is not in guest mode and the VCPU thread is notsleeping, then there is nothing to do.

VCPU Mode

VCPUs have a mode state,vcpu->mode, that is used to track whether theguest is running in guest mode or not, as well as some specificoutside guest mode states. The architecture may usevcpu->mode toensure VCPU requests are seen by VCPUs (see “Ensuring Requests Are Seen”),as well as to avoid sending unnecessary IPIs (see “IPI Reduction”), andeven to ensure IPI acknowledgements are waited upon (see “Waiting forAcknowledgements”). The following modes are defined:

OUTSIDE_GUEST_MODE

The VCPU thread is outside guest mode.

IN_GUEST_MODE

The VCPU thread is in guest mode.

EXITING_GUEST_MODE

The VCPU thread is transitioning from IN_GUEST_MODE toOUTSIDE_GUEST_MODE.

READING_SHADOW_PAGE_TABLES

The VCPU thread is outside guest mode, but it wants the sender ofcertain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPUthread is done reading the page tables.

VCPU Request Internals

VCPU requests are simply bit indices of thevcpu->requests bitmap.This means general bitops, like those documented in[atomic-ops] couldalso be used, e.g.

clear_bit(KVM_REQ_UNBLOCK & KVM_REQUEST_MASK, &vcpu->requests);

However, VCPU request users should refrain from doing so, as it wouldbreak the abstraction. The first 8 bits are reserved for architectureindependent requests; all additional bits are available for architecturedependent requests.

Architecture Independent Requests

KVM_REQ_TLB_FLUSH

KVM’s common MMU notifier may need to flush all of a guest’s TLBentries, callingkvm_flush_remote_tlbs() to do so. Architectures thatchoose to use the commonkvm_flush_remote_tlbs() implementation willneed to handle this VCPU request.

KVM_REQ_VM_DEAD

This request informs all VCPUs that the VM is dead and unusable, e.g. due tofatal error or because the VM’s state has been intentionally destroyed.

KVM_REQ_UNBLOCK

This request informs the vCPU to exit kvm_vcpu_block. It is used forexample from timer handlers that run on the host on behalf of a vCPU,or in order to update the interrupt routing and ensure that assigneddevices will wake up the vCPU.

KVM_REQ_OUTSIDE_GUEST_MODE

This “request” ensures the target vCPU has exited guest mode prior to thesender of the request continuing on. No action needs be taken by the target,and so no request is actually logged for the target. This request is similarto a “kick”, but unlike a kick it guarantees the vCPU has actually exitedguest mode. A kick only guarantees the vCPU will exit at some point in thefuture, e.g. a previous kick may have started the process, but there’s noguarantee the to-be-kicked vCPU has fully exited guest mode.

KVM_REQUEST_MASK

VCPU requests should be masked by KVM_REQUEST_MASK before using them withbitops. This is because only the lower 8 bits are used to represent therequest’s number. The upper bits are used as flags. Currently only twoflags are defined.

VCPU Request Flags

KVM_REQUEST_NO_WAKEUP

This flag is applied to requests that only need immediate attentionfrom VCPUs running in guest mode. That is, sleeping VCPUs do not needto be awakened for these requests. Sleeping VCPUs will handle therequests when they are awakened later for some other reason.

KVM_REQUEST_WAIT

When requests with this flag are made withkvm_make_all_cpus_request(),then the caller will wait for each VCPU to acknowledge its IPI beforeproceeding. This flag only applies to VCPUs that would receive IPIs.If, for example, the VCPU is sleeping, so no IPI is necessary, thenthe requesting thread does not wait. This means that this flag may besafely combined with KVM_REQUEST_NO_WAKEUP. See “Waiting forAcknowledgements” for more information about requests withKVM_REQUEST_WAIT.

VCPU Requests with Associated State

Requesters that want the receiving VCPU to handle new state need to ensurethe newly written state is observable to the receiving VCPU thread’s CPUby the time it observes the request. This means a write memory barriermust be inserted after writing the new state and before setting the VCPUrequest bit. Additionally, on the receiving VCPU thread’s side, acorresponding read barrier must be inserted after reading the request bitand before proceeding to read the new state associated with it. Seescenario 3, Message and Flag, of[lwn-mb] and the kernel documentation[memory-barriers].

The pair of functions,kvm_check_request() andkvm_make_request(), providethe memory barriers, allowing this requirement to be handled internally bythe API.

Ensuring Requests Are Seen

When making requests to VCPUs, we want to avoid the receiving VCPUexecuting in guest mode for an arbitrary long time without handling therequest. We can be sure this won’t happen as long as we ensure the VCPUthread checkskvm_request_pending() before entering guest mode and that akick will send an IPI to force an exit from guest mode when necessary.Extra care must be taken to cover the period after the VCPU thread’s lastkvm_request_pending() check and before it has entered guest mode, as kickIPIs will only trigger guest mode exits for VCPU threads that are in guestmode or at least have already disabled interrupts in order to prepare toenter guest mode. This means that an optimized implementation (see “IPIReduction”) must be certain when it’s safe to not send the IPI. Onesolution, which all architectures except s390 apply, is to:

  • setvcpu->mode to IN_GUEST_MODE between disabling the interrupts andthe lastkvm_request_pending() check;

  • enable interrupts atomically when entering the guest.

This solution also requires memory barriers to be placed carefully in boththe requesting thread and the receiving VCPU. With the memory barriers wecan exclude the possibility of a VCPU thread observing!kvm_request_pending() on its last check and then not receiving an IPI forthe next request made of it, even if the request is made immediately afterthe check. This is done by way of the Dekker memory barrier pattern(scenario 10 of[lwn-mb]). As the Dekker pattern requires two variables,this solution pairsvcpu->mode withvcpu->requests. Substitutingthem into the pattern gives:

CPU1                                    CPU2=================                       =================local_irq_disable();WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu);smp_mb();                               smp_mb();if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) ==                                            IN_GUEST_MODE) {    ...abort guest entry...                 ...send IPI...}                                       }

As stated above, the IPI is only useful for VCPU threads in guest mode orthat have already disabled interrupts. This is why this specific case ofthe Dekker pattern has been extended to disable interrupts before settingvcpu->mode to IN_GUEST_MODE.WRITE_ONCE() andREAD_ONCE() are used topedantically implement the memory barrier pattern, guaranteeing thecompiler doesn’t interfere withvcpu->mode’s carefully plannedaccesses.

IPI Reduction

As only one IPI is needed to get a VCPU to check for any/all requests,then they may be coalesced. This is easily done by having the first IPIsending kick also change the VCPU mode to something !IN_GUEST_MODE. Thetransitional state, EXITING_GUEST_MODE, is used for this purpose.

Waiting for Acknowledgements

Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs tobe sent, and the acknowledgements to be waited upon, even when the targetVCPU threads are in modes other than IN_GUEST_MODE. For example, one caseis when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, whichis set after disabling interrupts. To support these cases, theKVM_REQUEST_WAIT flag changes the condition for sending an IPI fromchecking that the VCPU is IN_GUEST_MODE to checking that it is notOUTSIDE_GUEST_MODE.

Request-less VCPU Kicks

As the determination of whether or not to send an IPI depends on thetwo-variable Dekker memory barrier pattern, then it’s clear thatrequest-less VCPU kicks are almost never correct. Without the assurancethat a non-IPI generating kick will still result in an action by thereceiving VCPU, as the finalkvm_request_pending() check does forrequest-accompanying kicks, then the kick may not do anything useful atall. If, for instance, a request-less kick was made to a VCPU that wasjust about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, thenthe VCPU thread may continue its entry without actually having donewhatever it was the kick was meant to initiate.

One exception is x86’s posted interrupt mechanism. In this case, however,even the request-less VCPU kick is coupled with the samelocal_irq_disable() +smp_mb() pattern described above; the ON bit(Outstanding Notification) in the posted interrupt descriptor takes therole ofvcpu->requests. When sending a posted interrupt, PIR.ON isset before readingvcpu->mode; dually, in the VCPU thread,vmx_sync_pir_to_irr() reads PIR after settingvcpu->mode toIN_GUEST_MODE.

Additional Considerations

Sleeping VCPUs

VCPU threads may need to consider requests before and/or after callingfunctions that may put them to sleep, e.g.kvm_vcpu_block(). Whether theydo or not, and, if they do, which requests need consideration, isarchitecture dependent.kvm_vcpu_block() callskvm_arch_vcpu_runnable()to check if it should awaken. One reason to do so is to providearchitectures a function where requests may be checked if necessary.

References

[atomic-ops]

Documentation/atomic_bitops.txt and Documentation/atomic_t.txt

[memory-barriers]

Documentation/memory-barriers.txt