PCI Peer-to-Peer DMA Support

The PCI bus has pretty decent support for performing DMA transfersbetween two devices on the bus. This type of transaction is henceforthcalled Peer-to-Peer (or P2P). However, there are a number of issues thatmake P2P transactions tricky to do in a perfectly safe way.

One of the biggest issues is that PCI doesn’t require forwardingtransactions between hierarchy domains, and in PCIe, each Root Portdefines a separate hierarchy domain. To make things worse, there is nosimple way to determine if a given Root Complex supports this or not.(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernelonly supports doing P2P when the endpoints involved are all behind thesame PCI bridge, as such devices are all in the same PCI hierarchydomain, and the spec guarantees that all transactions within thehierarchy will be routable, but it does not require routingbetween hierarchies.

The second issue is that to make use of existing interfaces in Linux,memory that is used for P2P transactions needs to be backed by structpages. However, PCI BARs are not typically cache coherent so there area few corner case gotchas with these pages so developers need tobe careful about what they do with them.

Driver Writer’s Guide

In a given P2P implementation there may be three or more differenttypes of kernel drivers in play:

  • Provider - A driver which provides or publishes P2P resources likememory or doorbell registers to other drivers.
  • Client - A driver which makes use of a resource by setting up aDMA transaction to or from it.
  • Orchestrator - A driver which orchestrates the flow of data betweenclients and providers.

In many cases there could be overlap between these three types (i.e.,it may be typical for a driver to be both a provider and a client).

For example, in the NVMe Target Copy Offload implementation:

  • The NVMe PCI driver is both a client, provider and orchestratorin that it exposes any CMB (Controller Memory Buffer) as a P2P memoryresource (provider), it accepts P2P memory pages as buffers in requeststo be used directly (client) and it can also make use of the CMB assubmission queue entries (orchestrator).
  • The RDMA driver is a client in this arrangement so that an RNICcan DMA directly to the memory exposed by the NVMe device.
  • The NVMe Target driver (nvmet) can orchestrate the data from the RNICto the P2P memory (CMB) and then to the NVMe device (and vice versa).

This is currently the only arrangement supported by the kernel butone could imagine slight tweaks to this that would allow for the samefunctionality. For example, if a specific RNIC added a BAR with somememory behind it, its driver could add support as a P2P provider andthen the NVMe Target could use the RNIC’s memory instead of the CMBin cases where the NVMe cards in use do not have CMB support.

Provider Drivers

A provider simply needs to register a BAR (or a portion of a BAR)as a P2P DMA resource usingpci_p2pdma_add_resource().This will register struct pages for all the specified memory.

After that it may optionally publish all of its resources asP2P memory usingpci_p2pmem_publish(). This will allowany orchestrator drivers to find and use the memory. When marked inthis way, the resource must be regular memory with no side effects.

For the time being this is fairly rudimentary in that all resourcesare typically going to be P2P memory. Future work will likely expandthis to include other types of resources like doorbells.

Client Drivers

A client driver typically only has to conditionally change its DMA maproutine to use the mapping functionpci_p2pdma_map_sg() insteadof the usualdma_map_sg() function. Memory mapped in thisway does not need to be unmapped.

The client may also, optionally, make use ofis_pci_p2pdma_page() to determine when to use the P2P mappingfunctions and when to use the regular mapping functions. In somesituations, it may be more appropriate to use a flag to indicate agiven request is P2P memory and map appropriately. It is important toensure that struct pages that back P2P memory stay out of code thatdoes not have support for them as other code may treat the pages asregular memory which may not be appropriate.

Orchestrator Drivers

The first task an orchestrator driver must do is compile a list ofall client devices that will be involved in a given transaction. Forexample, the NVMe Target driver creates a list including the namespaceblock device and the RNIC in use. If the orchestrator has access toa specific P2P provider to use it may check compatibility usingpci_p2pdma_distance() otherwise it may find a memory providerthat’s compatible with all clients usingpci_p2pmem_find().If more than one provider is supported, the one nearest to all the clients willbe chosen first. If more than one provider is an equal distance away, theone returned will be chosen at random (it is not an arbitrary buttruly random). This function returns the PCI device to use for the providerwith a reference taken and therefore when it’s no longer needed it should bereturned withpci_dev_put().

Once a provider is selected, the orchestrator can then usepci_alloc_p2pmem() andpci_free_p2pmem() toallocate P2P memory from the provider.pci_p2pmem_alloc_sgl()andpci_p2pmem_free_sgl() are convenience functions forallocating scatter-gather lists with P2P memory.

Struct Page Caveats

Driver writers should be very careful about not passing these specialstruct pages to code that isn’t prepared for it. At this time, the kernelinterfaces do not have any checks for ensuring this. This obviouslyprecludes passing these pages to userspace.

P2P memory is also technically IO memory but should never have any sideeffects behind it. Thus, the order of loads and stores should not be importantand ioreadX(), iowriteX() and friends should not be necessary.

P2P DMA Support Library

intpci_p2pdma_add_resource(struct pci_dev * pdev, int bar, size_t size, u64 offset)

add memory for use as p2p memory

Parameters

structpci_dev*pdev
the device to add the memory to
intbar
PCI BAR to add
size_tsize
size of the memory to add, may be zero to use the whole BAR
u64offset
offset into the PCI BAR

Description

The memory will be given ZONE_DEVICE struct pages so that it maybe used with any DMA request.

intpci_p2pdma_distance_many(struct pci_dev * provider, structdevice ** clients, int num_clients, bool verbose)

Determine the cumulative distance between a p2pdma provider and the clients in use.

Parameters

structpci_dev*provider
p2pdma provider to check against the client list
structdevice**clients
array of devices to check (NULL-terminated)
intnum_clients
number of clients in the array
boolverbose
if true, print warnings for devices when we return -1

Description

Returns -1 if any of the clients are not compatible, otherwise returns apositive number where a lower number is the preferable choice. (If there’sone client that’s the same as the provider it will return 0, which is bestchoice).

“compatible” means the provider and the clients are either all behindthe same PCI root port or the host bridges connected to each of the devicesare listed in the ‘pci_p2pdma_whitelist’.

boolpci_has_p2pmem(struct pci_dev * pdev)

check if a given PCI device has published any p2pmem

Parameters

structpci_dev*pdev
PCI device to check
struct pci_dev *pci_p2pmem_find_many(structdevice ** clients, int num_clients)

find a peer-to-peer DMA memory device compatible with the specified list of clients and shortest distance (as determined by pci_p2pmem_dma())

Parameters

structdevice**clients
array of devices to check (NULL-terminated)
intnum_clients
number of client devices in the list

Description

If multiple devices are behind the same switch, the one “closest” to theclient devices in use will be chosen first. (So if one of the providers isthe same as one of the clients, that provider will be used ahead of anyother providers that are unrelated). If multiple providers are an equaldistance away, one will be chosen at random.

Returns a pointer to the PCI device with a reference taken (use pci_dev_putto return the reference) or NULL if no compatible device is found. Thefound provider will also be assigned to the client list.

void *pci_alloc_p2pmem(struct pci_dev * pdev, size_t size)

allocate peer-to-peer DMA memory

Parameters

structpci_dev*pdev
the device to allocate memory from
size_tsize
number of bytes to allocate

Description

Returns the allocated memory or NULL on error.

voidpci_free_p2pmem(struct pci_dev * pdev, void * addr, size_t size)

free peer-to-peer DMA memory

Parameters

structpci_dev*pdev
the device the memory was allocated from
void*addr
address of the memory that was allocated
size_tsize
number of bytes that were allocated
pci_bus_addr_tpci_p2pmem_virt_to_bus(struct pci_dev * pdev, void * addr)

return the PCI bus address for a given virtual address obtained withpci_alloc_p2pmem()

Parameters

structpci_dev*pdev
the device the memory was allocated from
void*addr
address of the memory that was allocated
struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev * pdev, unsigned int * nents, u32 length)

allocate peer-to-peer DMA memory in a scatterlist

Parameters

structpci_dev*pdev
the device to allocate memory from
unsignedint*nents
the number of SG entries in the list
u32length
number of bytes to allocate

Return

NULL on error orstructscatterlist pointer andnents on success

voidpci_p2pmem_free_sgl(struct pci_dev * pdev, struct scatterlist * sgl)

free a scatterlist allocated bypci_p2pmem_alloc_sgl()

Parameters

structpci_dev*pdev
the device to allocate memory from
structscatterlist*sgl
the allocated scatterlist
voidpci_p2pmem_publish(struct pci_dev * pdev, bool publish)

publish the peer-to-peer DMA memory for use by other devices with pci_p2pmem_find()

Parameters

structpci_dev*pdev
the device with peer-to-peer DMA memory to publish
boolpublish
set to true to publish the memory, false to unpublish it

Description

Published memory can be used by other PCI device drivers forpeer-2-peer DMA operations. Non-published memory is reserved forexclusive use of the device driver that registers the peer-to-peermemory.

intpci_p2pdma_map_sg_attrs(structdevice * dev, struct scatterlist * sg, int nents, enum dma_data_direction dir, unsigned long attrs)

map a PCI peer-to-peer scatterlist for DMA

Parameters

structdevice*dev
device doing the DMA request
structscatterlist*sg
scatter list to map
intnents
elements in the scatterlist
enumdma_data_directiondir
DMA direction
unsignedlongattrs
DMA attributes passed to dma_map_sg() (if called)

Description

Scatterlists mapped with this function should be unmapped usingpci_p2pdma_unmap_sg_attrs().

Returns the number of SG entries mapped or 0 on error.

voidpci_p2pdma_unmap_sg_attrs(structdevice * dev, struct scatterlist * sg, int nents, enum dma_data_direction dir, unsigned long attrs)

unmap a PCI peer-to-peer scatterlist that was mapped with pci_p2pdma_map_sg()

Parameters

structdevice*dev
device doing the DMA request
structscatterlist*sg
scatter list to map
intnents
number of elements returned by pci_p2pdma_map_sg()
enumdma_data_directiondir
DMA direction
unsignedlongattrs
DMA attributes passed to dma_unmap_sg() (if called)
intpci_p2pdma_enable_store(const char * page, struct pci_dev ** p2p_dev, bool * use_p2pdma)

parse a configfs/sysfs attribute store to enable p2pdma

Parameters

constchar*page
contents of the value to be stored
structpci_dev**p2p_dev
returns the PCI device that was selected to be used(if one was specified in the stored value)
bool*use_p2pdma
returns whether to enable p2pdma or not

Description

Parses an attribute value to decide whether to enable p2pdma.The value can select a PCI device (using its full BDF devicename) or a boolean (in any format strtobool() accepts). A falsevalue disables p2pdma, a true value expects the callerto automatically find a compatible device and specifying a PCI deviceexpects the caller to use the specific provider.

pci_p2pdma_enable_show() should be used as the show operation forthe attribute.

Returns 0 on success

ssize_tpci_p2pdma_enable_show(char * page, struct pci_dev * p2p_dev, bool use_p2pdma)

show a configfs/sysfs attribute indicating whether p2pdma is enabled

Parameters

char*page
contents of the stored value
structpci_dev*p2p_dev
the selected p2p device (NULL if no device is selected)
booluse_p2pdma
whether p2pdma has been enabled

Description

Attributes that usepci_p2pdma_enable_store() should use this functionto show the value of the attribute.

Returns 0 on success