PCI Peer-to-Peer DMA Support¶
The PCI bus has pretty decent support for performing DMA transfersbetween two devices on the bus. This type of transaction is henceforthcalled Peer-to-Peer (or P2P). However, there are a number of issues thatmake P2P transactions tricky to do in a perfectly safe way.
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined upuntil they reach a host bridge or root port. If the path includes PCIe switchesthen based on the ACS settings the transaction can route entirely withinthe PCIe hierarchy and never reach the root port. The kernel will evaluatethe PCIe topology and always permit P2P in these well-defined cases.
However, if the P2P transaction reaches the host bridge then it might have tohairpin back out the same root port, be routed inside the CPU SOC to anotherPCIe root port, or routed internally to the SOC.
The PCIe specification doesn’t define the forwarding of transactions betweenhierarchy domains and kernel defaults to blocking such routing. There is anallow list to allow detecting known-good HW, in which case P2P between anytwo PCIe devices will be permitted.
Since P2P inherently is doing transactions between two devices it requires twodrivers to be co-operating inside the kernel. The providing driver has to conveyits MMIO to the consuming driver. To meet the driver model lifecycle rules theMMIO must have all DMA mapping removed, all CPU accesses prevented, all pagetable mappings undone before the providing driver completesremove().
This requires the providing and consuming driver to actively work together toguarantee that the consuming driver has stopped using the MMIO during a removalcycle. This is done by either a synchronous invalidation shutdown or waitingfor all usage refcounts to reach zero.
At the lowest level the P2P subsystem offers a nakedstructp2p_provider thatdelegates lifecycle management to the providing driver. It is expected thatdrivers using this option will wrap their MMIO memory in DMABUF and use DMABUFto provide an invalidation shutdown. These MMIO addresess have nostructpage, andif used with mmap() must create special PTEs. As such there are very fewkernel uAPIs that can accept pointers to them; in particular they cannot be usedwith read()/write(), including O_DIRECT.
Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICEpgmap of MEMORY_DEVICE_PCI_P2PDMA to createstructpages. The lifecycle ofpgmap ensures that when the pgmap is destroyed all other drivers have stoppedusing the MMIO. This option works with O_DIRECT flows, in some cases, if theunderlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA throughFOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmapit also relies on architecture support along with alignment and minimum sizelimitations.
Driver Writer’s Guide¶
In a given P2P implementation there may be three or more differenttypes of kernel drivers in play:
Provider - A driver which provides or publishes P2P resources likememory or doorbell registers to other drivers.
Client - A driver which makes use of a resource by setting up aDMA transaction to or from it.
Orchestrator - A driver which orchestrates the flow of data betweenclients and providers.
In many cases there could be overlap between these three types (i.e.,it may be typical for a driver to be both a provider and a client).
For example, in the NVMe Target Copy Offload implementation:
The NVMe PCI driver is both a client, provider and orchestratorin that it exposes any CMB (Controller Memory Buffer) as a P2P memoryresource (provider), it accepts P2P memory pages as buffers in requeststo be used directly (client) and it can also make use of the CMB assubmission queue entries (orchestrator).
The RDMA driver is a client in this arrangement so that an RNICcan DMA directly to the memory exposed by the NVMe device.
The NVMe Target driver (nvmet) can orchestrate the data from the RNICto the P2P memory (CMB) and then to the NVMe device (and vice versa).
This is currently the only arrangement supported by the kernel butone could imagine slight tweaks to this that would allow for the samefunctionality. For example, if a specific RNIC added a BAR with somememory behind it, its driver could add support as a P2P provider andthen the NVMe Target could use the RNIC’s memory instead of the CMBin cases where the NVMe cards in use do not have CMB support.
Provider Drivers¶
A provider simply needs to register a BAR (or a portion of a BAR)as a P2P DMA resource usingpci_p2pdma_add_resource().This will registerstructpages for all the specified memory.
After that it may optionally publish all of its resources asP2P memory usingpci_p2pmem_publish(). This will allowany orchestrator drivers to find and use the memory. When marked inthis way, the resource must be regular memory with no side effects.
For the time being this is fairly rudimentary in that all resourcesare typically going to be P2P memory. Future work will likely expandthis to include other types of resources like doorbells.
Client Drivers¶
A client driver only has to use the mapping APIdma_map_sg()anddma_unmap_sg() functions as usual, and the implementationwill do the right thing for the P2P capable memory.
Orchestrator Drivers¶
The first task an orchestrator driver must do is compile a list ofall client devices that will be involved in a given transaction. Forexample, the NVMe Target driver creates a list including the namespaceblock device and the RNIC in use. If the orchestrator has access toa specific P2P provider to use it may check compatibility usingpci_p2pdma_distance() otherwise it may find a memory providerthat’s compatible with all clients usingpci_p2pmem_find().If more than one provider is supported, the one nearest to all the clients willbe chosen first. If more than one provider is an equal distance away, theone returned will be chosen at random (it is not an arbitrary buttruly random). This function returns the PCI device to use for the providerwith a reference taken and therefore when it’s no longer needed it should bereturned withpci_dev_put().
Once a provider is selected, the orchestrator can then usepci_alloc_p2pmem() andpci_free_p2pmem() toallocate P2P memory from the provider.pci_p2pmem_alloc_sgl()andpci_p2pmem_free_sgl() are convenience functions forallocating scatter-gather lists with P2P memory.
Struct Page Caveats¶
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. TheKVA is still MMIO and must still be accessed through the normalreadX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, justlike any other MMIO mapping. While this will actually work on somearchitectures, others will experience corruption or just crash in the kernel.Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPUaccess happens.
Usage With DMABUF¶
DMABUF provides an alternative to the abovestructpage-basedclient/provider/orchestrator system and should be used whenstructpagedoesn’t exist. In this mode the exporting driver will wrapsome of its MMIO in a DMABUF and give the DMABUF FD to userspace.
Userspace can then pass the FD to an importing driver which will ask theexporting driver to map it to the importer.
In this case the initiator and target pci_devices are known and the P2P subsystemis used to determine the mapping type. The phys_addr_t-based DMA API is used toestablish the dma_addr_t.
Lifecycle is controlled by DMABUFmove_notify(). When the exporting driver wantstoremove() it must deliver an invalidation shutdown to all DMABUF importingdrivers throughmove_notify() and synchronously DMA unmap all the MMIO.
No importing driver can continue to have a DMA map to the MMIO after theexporting driver has destroyed its p2p_provider.
P2P DMA Support Library¶
- intpcim_p2pdma_init(structpci_dev*pdev)¶
Initialise peer-to-peer DMA providers
Parameters
structpci_dev*pdevThe PCI device to enable P2PDMA for
Description
This function initializes the peer-to-peer DMA infrastructurefor a PCI device. It allocates and sets up the necessary datastructures to support P2PDMA operations, including mapping typetracking.
- structp2pdma_provider*pcim_p2pdma_provider(structpci_dev*pdev,intbar)¶
Get peer-to-peer DMA provider
Parameters
structpci_dev*pdevThe PCI device to enable P2PDMA for
intbarBAR index to get provider
Description
This function gets peer-to-peer DMA provider for a PCI device. The lifetimeof the provider (and of course the MMIO) is bound to the lifetime of thedriver. A driver calling this function must ensure that all references to theprovider, and any DMA mappings created for any MMIO, are all cleaned upbefore the driverremove() completes.
Since P2P is almost always shared with a second driver this means some systemto notify, invalidate and revoke the MMIO’s DMA must be in place to use thisfunction. For example a revoke can be built using DMABUF.
- intpci_p2pdma_add_resource(structpci_dev*pdev,intbar,size_tsize,u64offset)¶
add memory for use as p2p memory
Parameters
structpci_dev*pdevthe device to add the memory to
intbarPCI BAR to add
size_tsizesize of the memory to add, may be zero to use the whole BAR
u64offsetoffset into the PCI BAR
Description
The memory will be given ZONE_DEVICEstructpages so that it maybe used with any DMA request.
- intpci_p2pdma_distance_many(structpci_dev*provider,structdevice**clients,intnum_clients,boolverbose)¶
Determine the cumulative distance between a p2pdma provider and the clients in use.
Parameters
structpci_dev*providerp2pdma provider to check against the client list
structdevice**clientsarray of devices to check (NULL-terminated)
intnum_clientsnumber of clients in the array
boolverboseif true, print warnings for devices when we return -1
Description
Returns -1 if any of the clients are not compatible, otherwise returns apositive number where a lower number is the preferable choice. (If there’sone client that’s the same as the provider it will return 0, which is bestchoice).
“compatible” means the provider and the clients are either all behindthe same PCI root port or the host bridges connected to each of the devicesare listed in the ‘pci_p2pdma_whitelist’.
- structpci_dev*pci_p2pmem_find_many(structdevice**clients,intnum_clients)¶
find a peer-to-peer DMA memory device compatible with the specified list of clients and shortest distance
Parameters
structdevice**clientsarray of devices to check (NULL-terminated)
intnum_clientsnumber of client devices in the list
Description
If multiple devices are behind the same switch, the one “closest” to theclient devices in use will be chosen first. (So if one of the providers isthe same as one of the clients, that provider will be used ahead of anyother providers that are unrelated). If multiple providers are an equaldistance away, one will be chosen at random.
Returns a pointer to the PCI device with a reference taken (use pci_dev_putto return the reference) or NULL if no compatible device is found. Thefound provider will also be assigned to the client list.
- void*pci_alloc_p2pmem(structpci_dev*pdev,size_tsize)¶
allocate peer-to-peer DMA memory
Parameters
structpci_dev*pdevthe device to allocate memory from
size_tsizenumber of bytes to allocate
Description
Returns the allocated memory or NULL on error.
- voidpci_free_p2pmem(structpci_dev*pdev,void*addr,size_tsize)¶
free peer-to-peer DMA memory
Parameters
structpci_dev*pdevthe device the memory was allocated from
void*addraddress of the memory that was allocated
size_tsizenumber of bytes that were allocated
- pci_bus_addr_tpci_p2pmem_virt_to_bus(structpci_dev*pdev,void*addr)¶
return the PCI bus address for a given virtual address obtained with
pci_alloc_p2pmem()
Parameters
structpci_dev*pdevthe device the memory was allocated from
void*addraddress of the memory that was allocated
- structscatterlist*pci_p2pmem_alloc_sgl(structpci_dev*pdev,unsignedint*nents,u32length)¶
allocate peer-to-peer DMA memory in a scatterlist
Parameters
structpci_dev*pdevthe device to allocate memory from
unsignedint*nentsthe number of SG entries in the list
u32lengthnumber of bytes to allocate
Return
NULL on error orstructscatterlist pointer andnents on success
- voidpci_p2pmem_free_sgl(structpci_dev*pdev,structscatterlist*sgl)¶
free a scatterlist allocated by
pci_p2pmem_alloc_sgl()
Parameters
structpci_dev*pdevthe device to allocate memory from
structscatterlist*sglthe allocated scatterlist
- voidpci_p2pmem_publish(structpci_dev*pdev,boolpublish)¶
publish the peer-to-peer DMA memory for use by other devices with
pci_p2pmem_find()
Parameters
structpci_dev*pdevthe device with peer-to-peer DMA memory to publish
boolpublishset to true to publish the memory, false to unpublish it
Description
Published memory can be used by other PCI device drivers forpeer-2-peer DMA operations. Non-published memory is reserved forexclusive use of the device driver that registers the peer-to-peermemory.
- intpci_p2pdma_enable_store(constchar*page,structpci_dev**p2p_dev,bool*use_p2pdma)¶
parse a configfs/sysfs attribute store to enable p2pdma
Parameters
constchar*pagecontents of the value to be stored
structpci_dev**p2p_devreturns the PCI device that was selected to be used(if one was specified in the stored value)
bool*use_p2pdmareturns whether to enable p2pdma or not
Description
Parses an attribute value to decide whether to enable p2pdma.The value can select a PCI device (using its full BDF devicename) or a boolean (in any formatkstrtobool() accepts). A falsevalue disables p2pdma, a true value expects the callerto automatically find a compatible device and specifying a PCI deviceexpects the caller to use the specific provider.
pci_p2pdma_enable_show() should be used as the show operation forthe attribute.
Returns 0 on success
- ssize_tpci_p2pdma_enable_show(char*page,structpci_dev*p2p_dev,booluse_p2pdma)¶
show a configfs/sysfs attribute indicating whether p2pdma is enabled
Parameters
char*pagecontents of the stored value
structpci_dev*p2p_devthe selected p2p device (NULL if no device is selected)
booluse_p2pdmawhether p2pdma has been enabled
Description
Attributes that usepci_p2pdma_enable_store() should use this functionto show the value of the attribute.
Returns 0 on success