vfio-ccw: the basic infrastructure

Introduction

Here we describe the vfio support for I/O subchannel devices forLinux/s390. Motivation for vfio-ccw is to passthrough subchannels to avirtual machine, while vfio is the means.

Different than other hardware architectures, s390 has defined a unifiedI/O access method, which is so called Channel I/O. It has its own accesspatterns:

  • Channel programs run asynchronously on a separate (co)processor.
  • The channel subsystem will access any memory designated by the callerin the channel program directly, i.e. there is no iommu involved.

Thus when we introduce vfio support for these devices, we realize itwith a mediated device (mdev) implementation. The vfio mdev will beadded to an iommu group, so as to make itself able to be managed by thevfio framework. And we add read/write callbacks for special vfio I/Oregions to pass the channel programs from the mdev to its parent device(the real I/O subchannel device) to do further address translation andto perform I/O instructions.

This document does not intend to explain the s390 I/O architecture inevery detail. More information/reference could be found here:

  • A good start to know Channel I/O in general:https://en.wikipedia.org/wiki/Channel_I/O
  • s390 architecture:s390 Principles of Operation manual (IBM Form. No. SA22-7832)
  • The existing QEMU code which implements a simple emulated channelsubsystem could also be a good reference. It makes it easier to followthe flow.qemu/hw/s390x/css.c

For vfio mediated device framework:- Documentation/driver-api/vfio-mediated-device.rst

Motivation of vfio-ccw

Typically, a guest virtualized via QEMU/KVM on s390 only seesparavirtualized virtio devices via the “Virtio Over Channel I/O(virtio-ccw)” transport. This makes virtio devices discoverable viastandard operating system algorithms for handling channel devices.

However this is not enough. On s390 for the majority of devices, whichuse the standard Channel I/O based mechanism, we also need to providethe functionality of passing through them to a QEMU virtual machine.This includes devices that don’t have a virtio counterpart (e.g. tapedrives) or that have specific characteristics which guests want toexploit.

For passing a device to a guest, we want to use the same interface aseverybody else, namely vfio. We implement this vfio support for channeldevices via the vfio mediated device framework and the subchannel devicedriver “vfio_ccw”.

Access patterns of CCW devices

s390 architecture has implemented a so called channel subsystem, thatprovides a unified view of the devices physically attached to thesystems. Though the s390 hardware platform knows about a huge variety ofdifferent peripheral attachments like disk devices (aka. DASDs), tapes,communication controllers, etc. They can all be accessed by a welldefined access method and they are presenting I/O completion a unifiedway: I/O interruptions.

All I/O requires the use of channel command words (CCWs). A CCW is aninstruction to a specialized I/O channel processor. A channel program isa sequence of CCWs which are executed by the I/O channel subsystem. Toissue a channel program to the channel subsystem, it is required tobuild an operation request block (ORB), which can be used to point outthe format of the CCW and other control information to the system. Theoperating system signals the I/O channel subsystem to begin executingthe channel program with a SSCH (start sub-channel) instruction. Thecentral processor is then free to proceed with non-I/O instructionsuntil interrupted. The I/O completion result is received by theinterrupt handler in the form of interrupt response block (IRB).

Back to vfio-ccw, in short:

  • ORBs and channel programs are built in guest kernel (with guestphysical addresses).
  • ORBs and channel programs are passed to the host kernel.
  • Host kernel translates the guest physical addresses to real addressesand starts the I/O with issuing a privileged Channel I/O instruction(e.g SSCH).
  • channel programs run asynchronously on a separate processor.
  • I/O completion will be signaled to the host with I/O interruptions.And it will be copied as IRB to user space to pass it back to theguest.

Physical vfio ccw device and its child mdev

As mentioned above, we realize vfio-ccw with a mdev implementation.

Channel I/O does not have IOMMU hardware support, so the physicalvfio-ccw device does not have an IOMMU level translation or isolation.

Subchannel I/O instructions are all privileged instructions. Whenhandling the I/O instruction interception, vfio-ccw has the softwarepolicing and translation how the channel program is programmed beforeit gets sent to hardware.

Within this implementation, we have two drivers for two types ofdevices:

  • The vfio_ccw driver for the physical subchannel device.This is an I/O subchannel driver for the real subchannel device. Itrealizes a group of callbacks and registers to the mdev framework as aparent (physical) device. As a consequence, mdev provides vfio_ccw ageneric interface (sysfs) to create mdev devices. A vfio mdev could becreated by vfio_ccw then and added to the mediated bus. It is the vfiodevice that added to an IOMMU group and a vfio group.vfio_ccw also provides an I/O region to accept channel programrequest from user space and store I/O interrupt result for userspace to retrieve. To notify user space an I/O completion, it offersan interface to setup an eventfd fd for asynchronous signaling.
  • The vfio_mdev driver for the mediated vfio ccw device.This is provided by the mdev framework. It is a vfio device driver forthe mdev that created by vfio_ccw.It realizes a group of vfio device driver callbacks, adds itself to avfio group, and registers itself to the mdev framework as a mdevdriver.It uses a vfio iommu backend that uses the existing map and unmapioctls, but rather than programming them into an IOMMU for a device,it simply stores the translations for use by later requests. Thismeans that a device programmed in a VM with guest physical addressescan have the vfio kernel convert that address to process virtualaddress, pin the page and program the hardware with the host physicaladdress in one step.For a mdev, the vfio iommu backend will not pin the pages during theVFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a databaseof the iova<->vaddr mappings in this operation. And they export avfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommubackend for the physical devices to pin and unpin pages by demand.

Below is a high Level block diagram:

+-------------+|             || +---------+ | mdev_register_driver() +--------------+| |  Mdev   | +<-----------------------+              || |  bus    | |                        | vfio_mdev.ko || | driver  | +----------------------->+              |<-> VFIO user| +---------+ |    probe()/remove()    +--------------+    APIs|             ||  MDEV CORE  ||   MODULE    ||   mdev.ko   || +---------+ | mdev_register_device() +--------------+| |Physical | +<-----------------------+              || | device  | |                        |  vfio_ccw.ko |<-> subchannel| |interface| +----------------------->+              |     device| +---------+ |       callback         +--------------++-------------+

The process of how these work together.

  1. vfio_ccw.ko drives the physical I/O subchannel, and registers thephysical device (with callbacks) to mdev framework.When vfio_ccw probing the subchannel device, it registers devicepointer and callbacks to the mdev framework. Mdev related file nodesunder the device node in sysfs would be created for the subchanneldevice, namely ‘mdev_create’, ‘mdev_destroy’ and‘mdev_supported_types’.
  2. Create a mediated vfio ccw device.Use the ‘mdev_create’ sysfs file, we need to manually create one (andonly one for our case) mediated device.
  3. vfio_mdev.ko drives the mediated ccw device.vfio_mdev is also the vfio device drvier. It will probe the mdev andadd it to an iommu_group and a vfio_group. Then we could pass throughthe mdev to a guest.

VFIO-CCW Regions

The vfio-ccw driver exposes MMIO regions to accept requests from and returnresults to userspace.

vfio-ccw I/O region

An I/O region is used to accept channel program request from userspace and store I/O interrupt result for user space to retrieve. Thedefinition of the region is:

struct ccw_io_region {#define ORB_AREA_SIZE 12        __u8    orb_area[ORB_AREA_SIZE];#define SCSW_AREA_SIZE 12        __u8    scsw_area[SCSW_AREA_SIZE];#define IRB_AREA_SIZE 96        __u8    irb_area[IRB_AREA_SIZE];        __u32   ret_code;} __packed;

This region is always available.

While starting an I/O request, orb_area should be filled with theguest ORB, and scsw_area should be filled with the SCSW of the VirtualSubchannel.

irb_area stores the I/O result.

ret_code stores a return code for each access of the region. The followingvalues may occur:

0
The operation was successful.
-EOPNOTSUPP
The orb specified transport mode or an unidentified IDAW format, or thescsw specified a function other than the start function.
-EIO
A request was issued while the device was not in a state ready to acceptrequests, or an internal error occurred.
-EBUSY
The subchannel was status pending or busy, or a request is already active.
-EAGAIN
A request was being processed, and the caller should retry.
-EACCES
The channel path(s) used for the I/O were found to be not operational.
-ENODEV
The device was found to be not operational.
-EINVAL
The orb specified a chain longer than 255 ccws, or an internal erroroccurred.

vfio-ccw cmd region

The vfio-ccw cmd region is used to accept asynchronous instructionsfrom userspace:

#define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0)#define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1)struct ccw_cmd_region {       __u32 command;       __u32 ret_code;} __packed;

This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD.

Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region.

command specifies the command to be issued; ret_code stores a return codefor each access of the region. The following values may occur:

0
The operation was successful.
-ENODEV
The device was found to be not operational.
-EINVAL
A command other than halt or clear was specified.
-EIO
A request was issued while the device was not in a state ready to acceptrequests.
-EAGAIN
A request was being processed, and the caller should retry.
-EBUSY
The subchannel was status pending or busy while processing a halt request.

vfio-ccw schib region

The vfio-ccw schib region is used to return Subchannel-InformationBlock (SCHIB) data to userspace:

struct ccw_schib_region {#define SCHIB_AREA_SIZE 52       __u8 schib_area[SCHIB_AREA_SIZE];} __packed;

This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB.

Reading this region triggers a STORE SUBCHANNEL to be issued to theassociated hardware.

vfio-ccw crw region

The vfio-ccw crw region is used to return Channel Report Word (CRW)data to userspace:

struct ccw_crw_region {       __u32 crw;       __u32 pad;} __packed;

This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW.

Reading this region returns a CRW if one that is relevant for thissubchannel (e.g. one reporting changes in channel path state) ispending, or all zeroes if not. If multiple CRWs are pending (includingpossibly chained CRWs), reading this region again will return the nextone, until no more CRWs are pending and zeroes are returned. This issimilar to how STORE CHANNEL REPORT WORD works.

vfio-ccw operation details

vfio-ccw follows what vfio-pci did on the s390 platform and usesvfio-iommu-type1 as the vfio iommu backend.

  • CCW translation APIsA group of APIs (start withcp_) to do CCW translation. The CCWspassed in by a user space program are organized with their guestphysical memory addresses. These APIs will copy the CCWs into kernelspace, and assemble a runnable kernel channel program by updating theguest physical addresses with their corresponding host physical addresses.Note that we have to use IDALs even for direct-access CCWs, as thereferenced memory can be located anywhere, including above 2G.

  • vfio_ccw device driverThis driver utilizes the CCW translation APIs and introducesvfio_ccw, which is the driver for the I/O subchannel devices you wantto pass through.vfio_ccw implements the following vfio ioctls:

    VFIO_DEVICE_GET_INFOVFIO_DEVICE_GET_IRQ_INFOVFIO_DEVICE_GET_REGION_INFOVFIO_DEVICE_RESETVFIO_DEVICE_SET_IRQS

    This provides an I/O region, so that the user space program can pass achannel program to the kernel, to do further CCW translation beforeissuing them to a real device.This also provides the SET_IRQ ioctl to setup an event notifier tonotify the user space program the I/O completion in an asynchronousway.

The use of vfio-ccw is not limited to QEMU, while QEMU is definitely agood example to get understand how these patches work. Here is a littlebit more detail how an I/O request triggered by the QEMU guest will behandled (without error handling).

Explanation:

  • Q1-Q7: QEMU side process.
  • K1-K5: Kernel side process.
Q1.
Get I/O region info during initialization.
Q2.
Setup event notifier and handler to handle I/O completion.

… …

Q3.
Intercept a ssch instruction.
Q4.

Write the guest channel program and ORB to the I/O region.

K1.
Copy from guest to kernel.
K2.
Translate the guest channel program to a host kernel spacechannel program, which becomes runnable for a real device.
K3.
With the necessary information contained in the orb passed inby QEMU, issue the ccwchain to the device.
K4.
Return the ssch CC code.
Q5.
Return the CC code to the guest.

… …

K5.
Interrupt handler gets the I/O result and write the result tothe I/O region.
K6.
Signal QEMU to retrieve the result.
Q6.
Get the signal and event handler reads out the result from the I/Oregion.
Q7.
Update the irb for the guest.

Limitations

The current vfio-ccw implementation focuses on supporting basic commandsneeded to implement block device functionality (read/write) of DASD/ECKDdevice only. Some commands may need special handling in the future, forexample, anything related to path grouping.

DASD is a kind of storage device. While ECKD is a data recording format.More information for DASD and ECKD could be found here:https://en.wikipedia.org/wiki/Direct-access_storage_devicehttps://en.wikipedia.org/wiki/Count_key_data

Together with the corresponding work in QEMU, we can bring the passedthrough DASD/ECKD device online in a guest now and use it as a blockdevice.

The current code allows the guest to start channel programs viaSTART SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL,and STORE SUBCHANNEL.

Currently all channel programs are prefetched, regardless of thep-bit setting in the ORB. As a result, self modifying channelprograms are not supported. For this reason, IPL has to be handled asa special case by a userspace/guest program; this has been implementedin QEMU’s s390-ccw bios as of QEMU 4.1.

vfio-ccw supports classic (command mode) channel I/O only. Transportmode (HPF) is not supported.

QDIO subchannels are currently not supported. Classic devices other thanDASD/ECKD might work, but have not been tested.

Reference

  1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
  2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
  3. https://en.wikipedia.org/wiki/Channel_I/O
  4. Documentation/s390/cds.rst
  5. Documentation/driver-api/vfio.rst
  6. Documentation/driver-api/vfio-mediated-device.rst