Coherent Accelerator (CXL) Flash

Introduction

The IBM Power architecture provides support for CAPI (CoherentAccelerator Power Interface), which is available to certain PCIe slotson Power 8 systems. CAPI can be thought of as a special tunnelingprotocol through PCIe that allow PCIe adapters to look like specialpurpose co-processors which can read or write an application’smemory and generate page faults. As a result, the host interface toan adapter running in CAPI mode does not require the data buffers tobe mapped to the device’s memory (IOMMU bypass) nor does it requirememory to be pinned.

On Linux, Coherent Accelerator (CXL) kernel services present CAPIdevices as a PCI device by implementing a virtual PCI host bridge.This abstraction simplifies the infrastructure and programmingmodel, allowing for drivers to look similar to other native PCIdevice drivers.

CXL provides a mechanism by which user space applications candirectly talk to a device (network or storage) bypassing the typicalkernel/device driver stack. The CXL Flash Adapter Driver enables auser space application direct access to Flash storage.

The CXL Flash Adapter Driver is a kernel module that sits in theSCSI stack as a low level device driver (below the SCSI disk andprotocol drivers) for the IBM CXL Flash Adapter. This driver isresponsible for the initialization of the adapter, setting up thespecial path for user space access, and performing error recovery. Itcommunicates directly the Flash Accelerator Functional Unit (AFU)as described in Documentation/powerpc/cxl.rst.

The cxlflash driver supports two, mutually exclusive, modes ofoperation at the device (LUN) level:

  • Any flash device (LUN) can be configured to be accessed as aregular disk device (i.e.: /dev/sdc). This is the default mode.
  • Any flash device (LUN) can be configured to be accessed fromuser space with a special block library. This mode furtherspecifies the means of accessing the device and provides foreither raw access to the entire LUN (referred to as director physical LUN access) or access to a kernel/AFU-mediatedpartition of the LUN (referred to as virtual LUN access). Thesegmentation of a disk device into virtual LUNs is assistedby special translation services provided by the Flash AFU.

Overview

The Coherent Accelerator Interface Architecture (CAIA) introduces aconcept of a master context. A master typically has special privilegesgranted to it by the kernel or hypervisor allowing it to perform AFUwide management and control. The master may or may not be involveddirectly in each user I/O, but at the minimum is involved in theinitial setup before the user application is allowed to send requestsdirectly to the AFU.

The CXL Flash Adapter Driver establishes a master context with theAFU. It uses memory mapped I/O (MMIO) for this control and setup. TheAdapter Problem Space Memory Map looks like this:

+-------------------------------+|    512 * 64 KB User MMIO      ||        (per context)          ||       User Accessible         |+-------------------------------+|    512 * 128 B per context    ||    Provisioning and Control   ||   Trusted Process accessible  |+-------------------------------+|         64 KB Global          ||   Trusted Process accessible  |+-------------------------------+

This driver configures itself into the SCSI software stack as anadapter driver. The driver is the only entity that is considered aTrusted Process to program the Provisioning and Control and Globalareas in the MMIO Space shown above. The master context driverdiscovers all LUNs attached to the CXL Flash adapter and instantiatesscsi block devices (/dev/sdb, /dev/sdc etc.) for each unique LUNseen from each path.

Once these scsi block devices are instantiated, an applicationwritten to a specification provided by the block library may getaccess to the Flash from user space (without requiring a system call).

This master context driver also provides a series of ioctls for thisblock library to enable this user space access. The driver supportstwo modes for accessing the block device.

The first mode is called a virtual mode. In this mode a single scsiblock device (/dev/sdb) may be carved up into any number of distinctvirtual LUNs. The virtual LUNs may be resized as long as the sum ofthe sizes of all the virtual LUNs, along with the meta-data associatedwith it does not exceed the physical capacity.

The second mode is called the physical mode. In this mode a singleblock device (/dev/sdb) may be opened directly by the block libraryand the entire space for the LUN is available to the application.

Only the physical mode provides persistence of the data. i.e. Thedata written to the block device will survive application exit andrestart and also reboot. The virtual LUNs do not persist (i.e. donot survive after the application terminates or the system reboots).

Block library API

Applications intending to get access to the CXL Flash from userspace should use the block library, as it abstracts the details ofinterfacing directly with the cxlflash driver that are necessary forperforming administrative actions (i.e.: setup, tear down, resize).The block library can be thought of as a ‘user’ of services,implemented as IOCTLs, that are provided by the cxlflash driverspecifically for devices (LUNs) operating in user space accessmode. While it is not a requirement that applications understandthe interface between the block library and the cxlflash driver,a high-level overview of each supported service (IOCTL) is providedbelow.

The block library can be found on GitHub:http://github.com/open-power/capiflash

CXL Flash Driver LUN IOCTLs

Users, such as the block library, that wish to interface with a flashdevice (LUN) via user space access need to use the services providedby the cxlflash driver. As these services are implemented as ioctls,a file descriptor handle must first be obtained in order to establishthe communication channel between a user and the kernel. This filedescriptor is obtained by opening the device special file associatedwith the scsi disk device (/dev/sdb) that was created during LUNdiscovery. As per the location of the cxlflash driver within theSCSI protocol stack, this open is actually not seen by the cxlflashdriver. Upon successful open, the user receives a file descriptor(herein referred to as fd1) that should be used for issuing thesubsequent ioctls listed below.

The structure definitions for these IOCTLs are available in:uapi/scsi/cxlflash_ioctl.h

DK_CXLFLASH_ATTACH

This ioctl obtains, initializes, and starts a context using the CXLkernel services. These services specify a context id (u16) by whichto uniquely identify the context and its allocated resources. Theservices additionally provide a second file descriptor (hereinreferred to as fd2) that is used by the block library to initiatememory mapped I/O (via mmap()) to the CXL flash device and poll forcompletion events. This file descriptor is intentionally installed bythis driver and not the CXL kernel services to allow for intermediarynotification and access in the event of a non-user-initiated close(),such as a killed process. This design point is described in furtherdetail in the description for the DK_CXLFLASH_DETACH ioctl.

There are a few important aspects regarding the “tokens” (context idand fd2) that are provided back to the user:

  • These tokens are only valid for the process under which theywere created. The child of a forked process cannot continueto use the context id or file descriptor created by its parent(see DK_CXLFLASH_VLUN_CLONE for further details).

  • These tokens are only valid for the lifetime of the context andthe process under which they were created. Once either isdestroyed, the tokens are to be considered stale and subsequentusage will result in errors.

  • A valid adapter file descriptor (fd2 >= 0) is only returned onthe initial attach for a context. Subsequent attaches to anexisting context (DK_CXLFLASH_ATTACH_REUSE_CONTEXT flag present)do not provide the adapter file descriptor as it was previouslymade known to the application.

  • When a context is no longer needed, the user shall detach fromthe context via the DK_CXLFLASH_DETACH ioctl. When this ioctlreturns with a valid adapter file descriptor and the return flagDK_CXLFLASH_APP_CLOSE_ADAP_FD is present, the application _must_close the adapter file descriptor following a successful detach.

  • When this ioctl returns with a valid fd2 and the return flagDK_CXLFLASH_APP_CLOSE_ADAP_FD is present, the application _must_close fd2 in the following circumstances:

    • Following a successful detach of the last user of the context
    • Following a successful recovery on the context’s original fd2
    • In the child process of a fork(), following a clone ioctl,on the fd2 associated with the source context
  • At any time, a close on fd2 will invalidate the tokens. Applicationsshould exercise caution to only close fd2 when appropriate (outlinedin the previous bullet) to avoid premature loss of I/O.

DK_CXLFLASH_USER_DIRECT

This ioctl is responsible for transitioning the LUN to direct(physical) mode access and configuring the AFU for direct access fromuser space on a per-context basis. Additionally, the block size andlast logical block address (LBA) are returned to the user.

As mentioned previously, when operating in user space access mode,LUNs may be accessed in whole or in part. Only one mode is allowedat a time and if one mode is active (outstanding references exist),requests to use the LUN in a different mode are denied.

The AFU is configured for direct access from user space by adding anentry to the AFU’s resource handle table. The index of the entry istreated as a resource handle that is returned to the user. The useris then able to use the handle to reference the LUN during I/O.

DK_CXLFLASH_USER_VIRTUAL

This ioctl is responsible for transitioning the LUN to virtual modeof access and configuring the AFU for virtual access from user spaceon a per-context basis. Additionally, the block size and last logicalblock address (LBA) are returned to the user.

As mentioned previously, when operating in user space access mode,LUNs may be accessed in whole or in part. Only one mode is allowedat a time and if one mode is active (outstanding references exist),requests to use the LUN in a different mode are denied.

The AFU is configured for virtual access from user space by addingan entry to the AFU’s resource handle table. The index of the entryis treated as a resource handle that is returned to the user. Theuser is then able to use the handle to reference the LUN during I/O.

By default, the virtual LUN is created with a size of 0. The userwould need to use the DK_CXLFLASH_VLUN_RESIZE ioctl to adjust the growthe virtual LUN to a desired size. To avoid having to perform thisresize for the initial creation of the virtual LUN, the user has theoption of specifying a size as part of the DK_CXLFLASH_USER_VIRTUALioctl, such that when success is returned to the user, theresource handle that is provided is already referencing provisionedstorage. This is reflected by the last LBA being a non-zero value.

When a LUN is accessible from more than one port, this ioctl willreturn with the DK_CXLFLASH_ALL_PORTS_ACTIVE return flag set. Thisprovides the user with a hint that I/O can be retried in the eventof an I/O error as the LUN can be reached over multiple paths.

DK_CXLFLASH_VLUN_RESIZE

This ioctl is responsible for resizing a previously created virtualLUN and will fail if invoked upon a LUN that is not in virtualmode. Upon success, an updated last LBA is returned to the userindicating the new size of the virtual LUN associated with theresource handle.

The partitioning of virtual LUNs is jointly mediated by the cxlflashdriver and the AFU. An allocation table is kept for each LUN that isoperating in the virtual mode and used to program a LUN translationtable that the AFU references when provided with a resource handle.

This ioctl can return -EAGAIN if an AFU sync operation takes too long.In addition to returning a failure to user, cxlflash will also schedulean asynchronous AFU reset. Should the user choose to retry the operation,it is expected to succeed. If this ioctl fails with -EAGAIN, the usercan either retry the operation or treat it as a failure.

DK_CXLFLASH_RELEASE

This ioctl is responsible for releasing a previously obtainedreference to either a physical or virtual LUN. This can bethought of as the inverse of the DK_CXLFLASH_USER_DIRECT orDK_CXLFLASH_USER_VIRTUAL ioctls. Upon success, the resource handleis no longer valid and the entry in the resource handle table ismade available to be used again.

As part of the release process for virtual LUNs, the virtual LUNis first resized to 0 to clear out and free the translation tablesassociated with the virtual LUN reference.

DK_CXLFLASH_DETACH

This ioctl is responsible for unregistering a context with thecxlflash driver and release outstanding resources that werenot explicitly released via the DK_CXLFLASH_RELEASE ioctl. Uponsuccess, all “tokens” which had been provided to the user from theDK_CXLFLASH_ATTACH onward are no longer valid.

When the DK_CXLFLASH_APP_CLOSE_ADAP_FD flag was returned on a successfulattach, the application _must_ close the fd2 associated with the contextfollowing the detach of the final user of the context.

DK_CXLFLASH_VLUN_CLONE

This ioctl is responsible for cloning a previously createdcontext to a more recently created context. It exists solely tosupport maintaining user space access to storage after a processforks. Upon success, the child process (which invoked the ioctl)will have access to the same LUNs via the same resource handle(s)as the parent, but under a different context.

Context sharing across processes is not supported with CXL andtherefore each fork must be met with establishing a new contextfor the child process. This ioctl simplifies the state managementand playback required by a user in such a scenario. When a processforks, child process can clone the parents context by first creatinga context (via DK_CXLFLASH_ATTACH) and then using this ioctl toperform the clone from the parent to the child.

The clone itself is fairly simple. The resource handle and luntranslation tables are copied from the parent context to the child’sand then synced with the AFU.

When the DK_CXLFLASH_APP_CLOSE_ADAP_FD flag was returned on a successfulattach, the application _must_ close the fd2 associated with the sourcecontext (still resident/accessible in the parent process) following theclone. This is to avoid a stale entry in the file descriptor table of thechild process.

This ioctl can return -EAGAIN if an AFU sync operation takes too long.In addition to returning a failure to user, cxlflash will also schedulean asynchronous AFU reset. Should the user choose to retry the operation,it is expected to succeed. If this ioctl fails with -EAGAIN, the usercan either retry the operation or treat it as a failure.

DK_CXLFLASH_VERIFY

This ioctl is used to detect various changes such as the capacity ofthe disk changing, the number of LUNs visible changing, etc. In caseswhere the changes affect the application (such as a LUN resize), thecxlflash driver will report the changed state to the application.

The user calls in when they want to validate that a LUN hasn’t beenchanged in response to a check condition. As the user is operating outof band from the kernel, they will see these types of events withoutthe kernel’s knowledge. When encountered, the user’s architectedbehavior is to call in to this ioctl, indicating what they want toverify and passing along any appropriate information. For now, onlyverifying a LUN change (ie: size different) with sense data issupported.

DK_CXLFLASH_RECOVER_AFU

This ioctl is used to drive recovery (if such an action is warranted)of a specified user context. Any state associated with the user contextis re-established upon successful recovery.

User contexts are put into an error condition when the device needs tobe reset or is terminating. Users are notified of this error conditionby seeing all 0xF’s on an MMIO read. Upon encountering this, thearchitected behavior for a user is to call into this ioctl to recovertheir context. A user may also call into this ioctl at any time tocheck if the device is operating normally. If a failure is returnedfrom this ioctl, the user is expected to gracefully clean up theircontext via release/detach ioctls. Until they do, the context theyhold is not relinquished. The user may also optionally exit the processat which time the context/resources they held will be freed as part ofthe release fop.

When the DK_CXLFLASH_APP_CLOSE_ADAP_FD flag was returned on a successfulattach, the application _must_ unmap and close the fd2 associated with theoriginal context following this ioctl returning success and indicating thatthe context was recovered (DK_CXLFLASH_RECOVER_AFU_CONTEXT_RESET).

DK_CXLFLASH_MANAGE_LUN

This ioctl is used to switch a LUN from a mode where it is availablefor file-system access (legacy), to a mode where it is set aside forexclusive user space access (superpipe). In case a LUN is visibleacross multiple ports and adapters, this ioctl is used to uniquelyidentify each LUN by its World Wide Node Name (WWNN).

CXL Flash Driver Host IOCTLs

Each host adapter instance that is supported by the cxlflash driverhas a special character device associated with it to enable a set ofhost management function. These character devices are hosted in aclass dedicated for cxlflash and can be accessed via/dev/cxlflash/*.

Applications can be written to perform various functions using thehost ioctl APIs below.

The structure definitions for these IOCTLs are available in:uapi/scsi/cxlflash_ioctl.h

HT_CXLFLASH_LUN_PROVISION

This ioctl is used to create and delete persistent LUNs on cxlflashdevices that lack an external LUN management interface. It is onlyvalid when used with AFUs that support the LUN provision capability.

When sufficient space is available, LUNs can be created by specifyingthe target port to host the LUN and a desired size in 4K blocks. Uponsuccess, the LUN ID and WWID of the created LUN will be returned andthe SCSI bus can be scanned to detect the change in LUN topology. Notethat partial allocations are not supported. Should a creation fail dueto a space issue, the target port can be queried for its current LUNgeometry.

To remove a LUN, the device must first be disassociated from the LinuxSCSI subsystem. The LUN deletion can then be initiated by specifying atarget port and LUN ID. Upon success, the LUN geometry associated withthe port will be updated to reflect new number of provisioned LUNs andavailable capacity.

To query the LUN geometry of a port, the target port is specified andupon success, the following information is presented:

  • Maximum number of provisioned LUNs allowed for the port
  • Current number of provisioned LUNs for the port
  • Maximum total capacity of provisioned LUNs for the port (4K blocks)
  • Current total capacity of provisioned LUNs for the port (4K blocks)

With this information, the number of available LUNs and capacity can becan be calculated.

HT_CXLFLASH_AFU_DEBUG

This ioctl is used to debug AFUs by supporting a command pass-throughinterface. It is only valid when used with AFUs that support the AFUdebug capability.

With exception of buffer management, AFU debug commands are opaque tocxlflash and treated as pass-through. For debug commands that do requiredata transfer, the user supplies an adequately sized data buffer and mustspecify the data transfer direction with respect to the host. There is amaximum transfer size of 256K imposed. Note that partial read completionsare not supported - when errors are experienced with a host read datatransfer, the data buffer is not copied back to the user.