DMAengine controller documentation

Hardware Introduction

Most of the Slave DMA controllers have the same general principles ofoperations.

They have a given number of channels to use for the DMA transfers, anda given number of requests lines.

Requests and channels are pretty much orthogonal. Channels can be usedto serve several to any requests. To simplify, channels are theentities that will be doing the copy, and requests what endpoints areinvolved.

The request lines actually correspond to physical lines going from theDMA-eligible devices to the controller itself. Whenever the devicewill want to start a transfer, it will assert a DMA request (DRQ) byasserting that request line.

A very simple DMA controller would only take into account a singleparameter: the transfer size. At each clock cycle, it would transfer abyte of data from one buffer to another, until the transfer size hasbeen reached.

That wouldn’t work well in the real world, since slave devices mightrequire a specific number of bits to be transferred in a singlecycle. For example, we may want to transfer as much data as thephysical bus allows to maximize performances when doing a simplememory copy operation, but our audio device could have a narrower FIFOthat requires data to be written exactly 16 or 24 bits at a time. Thisis why most if not all of the DMA controllers can adjust this, using aparameter called the transfer width.

Moreover, some DMA controllers, whenever the RAM is used as a sourceor destination, can group the reads or writes in memory into a buffer,so instead of having a lot of small memory accesses, which is notreally efficient, you’ll get several bigger transfers. This is doneusing a parameter called the burst size, that defines how many singlereads/writes it’s allowed to do without the controller splitting thetransfer into smaller sub-transfers.

Our theoretical DMA controller would then only be able to do transfersthat involve a single contiguous block of data. However, some of thetransfers we usually have are not, and want to copy data fromnon-contiguous buffers to a contiguous buffer, which is calledscatter-gather.

DMAEngine, at least for mem2dev transfers, require support forscatter-gather. So we’re left with two cases here: either we have aquite simple DMA controller that doesn’t support it, and we’ll have toimplement it in software, or we have a more advanced DMA controller,that implements in hardware scatter-gather.

The latter are usually programmed using a collection of chunks totransfer, and whenever the transfer is started, the controller will goover that collection, doing whatever we programmed there.

This collection is usually either a table or a linked list. You willthen push either the address of the table and its number of elements,or the first item of the list to one channel of the DMA controller,and whenever a DRQ will be asserted, it will go through the collectionto know where to fetch the data from.

Either way, the format of this collection is completely dependent onyour hardware. Each DMA controller will require a different structure,but all of them will require, for every chunk, at least the source anddestination addresses, whether it should increment these addresses ornot and the three parameters we saw earlier: the burst size, thetransfer width and the transfer size.

The one last thing is that usually, slave devices won’t issue DRQ bydefault, and you have to enable this in your slave device driver firstwhenever you’re willing to use DMA.

These were just the general memory-to-memory (also called mem2mem) ormemory-to-device (mem2dev) kind of transfers. Most devices oftensupport other kind of transfers or memory operations that dmaenginesupport and will be detailed later in this document.

DMA Support in Linux

Historically, DMA controller drivers have been implemented using theasync TX API, to offload operations such as memory copy, XOR,cryptography, etc., basically any memory to memory operation.

Over time, the need for memory to device transfers arose, anddmaengine was extended. Nowadays, the async TX API is written as alayer on top of dmaengine, and acts as a client. Still, dmaengineaccommodates that API in some cases, and made some design choices toensure that it stayed compatible.

For more information on the Async TX API, please look the relevantdocumentation file inAsynchronous Transfers/Transforms API.

DMAEngine APIs

structdma_device Initialization

Just like any other kernel framework, the whole DMAEngine registrationrelies on the driver filling a structure and registering against theframework. In our case, that structure is dma_device.

The first thing you need to do in your driver is to allocate thisstructure. Any of the usual memory allocators will do, but you’ll alsoneed to initialize a few fields in there:

  • channels: should be initialized as a list using theINIT_LIST_HEAD macro for example

  • src_addr_widths:should contain a bitmask of the supported source transfer width

  • dst_addr_widths:should contain a bitmask of the supported destination transfer width

  • directions:should contain a bitmask of the supported slave directions(i.e. excluding mem2mem transfers)

  • residue_granularity:granularity of the transfer residue reported to dma_set_residue.This can be either:

    • Descriptor:your device doesn’t support any kind of residuereporting. The framework will only know that a particulartransaction descriptor is done.

    • Segment:your device is able to report which chunks have been transferred

    • Burst:your device is able to report which burst have been transferred

  • dev: should hold the pointer to thestructdevice associatedto your current driver instance.

Supported transaction types

The next thing you need is to set which transaction types your device(and driver) supports.

Ourdma_devicestructure has a field called cap_mask that holds thevarious types of transaction supported, and you need to modify thismask using the dma_cap_set function, with various flags depending ontransaction types you support as an argument.

All those capabilities are defined in thedma_transaction_typeenum,ininclude/linux/dmaengine.h

Currently, the types available are:

  • DMA_MEMCPY

    • The device is able to do memory to memory copies

    • No matter what the overall size of the combined chunks for source anddestination is, only as many bytes as the smallest of the two will betransmitted. That means the number and size of the scatter-gather buffers inboth lists need not be the same, and that the operation functionally isequivalent to astrncpy where thecount argument equals the smallesttotal size of the two scatter-gather list buffers.

    • It’s usually used for copying pixel data between host memory andmemory-mapped GPU device memory, such as found on modern PCI video graphicscards. The most immediate example is the OpenGL API functionglReadPixels(), which might require a verbatim copy of a hugeframebuffer from local device memory onto host memory.

  • DMA_XOR

    • The device is able to perform XOR operations on memory areas

    • Used to accelerate XOR intensive tasks, such as RAID5

  • DMA_XOR_VAL

    • The device is able to perform parity check using the XORalgorithm against a memory buffer.

  • DMA_PQ

    • The device is able to perform RAID6 P+Q computations, P being asimple XOR, and Q being a Reed-Solomon algorithm.

  • DMA_PQ_VAL

    • The device is able to perform parity check using RAID6 P+Qalgorithm against a memory buffer.

  • DMA_MEMSET

    • The device is able to fill memory with the provided pattern

    • The pattern is treated as a single byte signed value.

  • DMA_INTERRUPT

    • The device is able to trigger a dummy transfer that willgenerate periodic interrupts

    • Used by the client drivers to register a callback that will becalled on a regular basis through the DMA controller interrupt

  • DMA_PRIVATE

    • The devices only supports slave transfers, and as such isn’tavailable for async transfers.

  • DMA_ASYNC_TX

    • The device supports asynchronous memory-to-memory operations,including memcpy, memset, xor, pq, xor_val, and pq_val.

    • This capability is automatically set by the DMA engineframework and must not be configured manually by devicedrivers.

  • DMA_SLAVE

    • The device can handle device to memory transfers, includingscatter-gather transfers.

    • While in the mem2mem case we were having two distinct types todeal with a single chunk to copy or a collection of them, here,we just have a single transaction type that is supposed tohandle both.

    • If you want to transfer a single contiguous memory buffer,simply build a scatter list with only one item.

  • DMA_CYCLIC

    • The device can handle cyclic transfers.

    • A cyclic transfer is a transfer where the chunk collection willloop over itself, with the last item pointing to the first.

    • It’s usually used for audio transfers, where you want to operateon a single ring buffer that you will fill with your audio data.

  • DMA_INTERLEAVE

    • The device supports interleaved transfer.

    • These transfers can transfer data from a non-contiguous bufferto a non-contiguous buffer, opposed to DMA_SLAVE that cantransfer data from a non-contiguous data set to a continuousdestination buffer.

    • It’s usually used for 2d content transfers, in which case youwant to transfer a portion of uncompressed data directly to thedisplay to print it

  • DMA_COMPLETION_NO_ORDER

    • The device does not support in order completion.

    • The driver should return DMA_OUT_OF_ORDER for device_tx_status ifthe device is setting this capability.

    • All cookie tracking and checking API should be treated as invalid ifthe device exports this capability.

    • At this point, this is incompatible with polling option for dmatest.

    • If this cap is set, the user is recommended to provide an uniqueidentifier for each descriptor sent to the DMA device in order toproperly track the completion.

  • DMA_REPEAT

    • The device supports repeated transfers. A repeated transfer, indicated bythe DMA_PREP_REPEAT transfer flag, is similar to a cyclic transfer in thatit gets automatically repeated when it ends, but can additionally bereplaced by the client.

    • This feature is limited to interleaved transfers, this flag should thus notbe set if the DMA_INTERLEAVE flag isn’t set. This limitation is based onthe current needs of DMA clients, support for additional transfer typesshould be added in the future if and when the need arises.

  • DMA_LOAD_EOT

    • The device supports replacing repeated transfers at end of transfer (EOT)by queuing a new transfer with the DMA_PREP_LOAD_EOT flag set.

    • Support for replacing a currently running transfer at another point (suchas end of burst instead of end of transfer) will be added in the futurebased on DMA clients needs, if and when the need arises.

These various types will also affect how the source and destinationaddresses change over time.

Addresses pointing to RAM are typically incremented (or decremented)after each transfer. In case of a ring buffer, they may loop(DMA_CYCLIC). Addresses pointing to a device’s register (e.g. a FIFO)are typically fixed.

Per descriptor metadata support

Some data movement architecture (DMA controller and peripherals) uses metadataassociated with a transaction. The DMA controller role is to transfer thepayload and the metadata alongside.The metadata itself is not used by the DMA engine itself, but it containsparameters, keys, vectors, etc for peripheral or from the peripheral.

The DMAengine framework provides a generic ways to facilitate the metadata fordescriptors. Depending on the architecture the DMA driver can implement eitheror both of the methods and it is up to the client driver to choose which oneto use.

  • DESC_METADATA_CLIENT

    The metadata buffer is allocated/provided by the client driver and it isattached (via thedmaengine_desc_attach_metadata() helper to the descriptor.

    From the DMA driver the following is expected for this mode:

    • DMA_MEM_TO_DEV / DEV_MEM_TO_MEM

      The data from the provided metadata buffer should be prepared for the DMAcontroller to be sent alongside of the payload data. Either by copying to ahardware descriptor, or highly coupled packet.

    • DMA_DEV_TO_MEM

      On transfer completion the DMA driver must copy the metadata to the clientprovided metadata buffer before notifying the client about the completion.After the transfer completion, DMA drivers must not touch the metadatabuffer provided by the client.

  • DESC_METADATA_ENGINE

    The metadata buffer is allocated/managed by the DMA driver. The client drivercan ask for the pointer, maximum size and the currently used size of themetadata and can directly update or read it.dmaengine_desc_get_metadata_ptr()anddmaengine_desc_set_metadata_len() is provided as helper functions.

    From the DMA driver the following is expected for this mode:

    • get_metadata_ptr()

      Should return a pointer for the metadata buffer, the maximum size of themetadata buffer and the currently used / valid (if any) bytes in the buffer.

    • set_metadata_len()

      It is called by the clients after it have placed the metadata to the bufferto let the DMA driver know the number of valid bytes provided.

    Note: since the client will ask for the metadata pointer in the completioncallback (in DMA_DEV_TO_MEM case) the DMA driver must ensure that thedescriptor is not freed up prior the callback is called.

Device operations

Our dma_device structure also requires a few function pointers inorder to implement the actual logic, now that we described whatoperations we were able to perform.

The functions that we have to fill in there, and hence have toimplement, obviously depend on the transaction types you reported assupported.

  • device_alloc_chan_resources

  • device_free_chan_resources

    • These functions will be called whenever a driver will calldma_request_channel ordma_release_channel for the first/lasttime on the channel associated to that driver.

    • They are in charge of allocating/freeing all the neededresources in order for that channel to be useful for your driver.

    • These functions can sleep.

  • device_prep_dma_*

    • These functions are matching the capabilities you registeredpreviously.

    • These functions all take the buffer or the scatterlist relevantfor the transfer being prepared, and should create a hardwaredescriptor or a list of hardware descriptors from it

    • These functions can be called from an interrupt context

    • Any allocation you might do should be using the GFP_NOWAITflag, in order not to potentially sleep, but without depletingthe emergency pool either.

    • Drivers should try to pre-allocate any memory they might needduring the transfer setup at probe time to avoid putting tomuch pressure on the nowait allocator.

    • It should return a unique instance of thedma_async_tx_descriptorstructure, that further represents thisparticular transfer.

    • This structure can be initialized using the functiondma_async_tx_descriptor_init.

    • You’ll also need to set two fields in this structure:

      • flags:TODO: Can it be modified by the driver itself, orshould it be always the flags passed in the arguments

      • tx_submit: A pointer to a function you have to implement,that is supposed to push the current transaction descriptor to apending queue, waiting for issue_pending to be called.

    • In this structure the function pointer callback_result can beinitialized in order for the submitter to be notified that atransaction has completed. In the earlier code the function pointercallback has been used. However it does not provide any status to thetransaction and will be deprecated. The result structure defined asdmaengine_result that is passed in to callback_resulthas two fields:

      • result: This provides the transfer result defined bydmaengine_tx_result. Either success or some error condition.

      • residue: Provides the residue bytes of the transfer for those thatsupport residue.

  • device_prep_peripheral_dma_vec

    • Similar todevice_prep_slave_sg, but it takes a pointer to aarray ofdma_vec structures, which (in the long run) will replacescatterlists.

  • device_issue_pending

    • Takes the first transaction descriptor in the pending queue,and starts the transfer. Whenever that transfer is done, itshould move to the next transaction in the list.

    • This function can be called in an interrupt context

  • device_tx_status

    • Should report the bytes left to go over on the given channel

    • Should only care about the transaction descriptor passed asargument, not the currently active one on a given channel

    • The tx_state argument might be NULL

    • Should use dma_set_residue to report it

    • In the case of a cyclic transfer, it should only take intoaccount the total size of the cyclic buffer.

    • Should return DMA_OUT_OF_ORDER if the device does not support in ordercompletion and is completing the operation out of order.

    • This function can be called in an interrupt context.

  • device_config

    • Reconfigures the channel with the configuration given as argument

    • This command should NOT perform synchronously, or on anycurrently queued transfers, but only on subsequent ones

    • In this case, the function will receive adma_slave_configstructure pointer as an argument, that will detail whichconfiguration to use.

    • Even though that structure contains a direction field, thisfield is deprecated in favor of the direction argument given tothe prep_* functions

    • This call is mandatory for slave operations only. This should NOT beset or expected to be set for memcpy operations.If a driver support both, it should use this call for slaveoperations only and not for memcpy ones.

  • device_pause

    • Pauses a transfer on the channel

    • This command should operate synchronously on the channel,pausing right away the work of the given channel

  • device_resume

    • Resumes a transfer on the channel

    • This command should operate synchronously on the channel,resuming right away the work of the given channel

  • device_terminate_all

    • Aborts all the pending and ongoing transfers on the channel

    • For aborted transfers the complete callback should not be called

    • Can be called from atomic context or from within a completecallback of a descriptor. Must not sleep. Drivers must be ableto handle this correctly.

    • Termination may be asynchronous. The driver does not have towait until the currently active transfer has completely stopped.See device_synchronize.

  • device_synchronize

    • Must synchronize the termination of a channel to the currentcontext.

    • Must make sure that memory for previously submitteddescriptors is no longer accessed by the DMA controller.

    • Must make sure that all complete callbacks for previouslysubmitted descriptors have finished running and none arescheduled to run.

    • May sleep.

Misc notes

(stuff that should be documented, but don’t really knowwhere to put them)

dma_run_dependencies

  • Should be called at the end of an async TX transfer, and can beignored in the slave transfers case.

  • Makes sure that dependent operations are run before marking itas complete.

dma_cookie_t

  • it’s a DMA transaction ID that will increment over time.

  • Not really relevant any more since the introduction ofvirt-dmathat abstracts it away.

dma_vec

  • A small structure that contains a DMA address and length.

DMA_CTRL_ACK

  • If clear, the descriptor cannot be reused by provider until theclient acknowledges receipt, i.e. has a chance to establish anydependency chains

  • This can be acked by invokingasync_tx_ack()

  • If set, does not mean descriptor can be reused

DMA_CTRL_REUSE

  • If set, the descriptor can be reused after being completed. It shouldnot be freed by provider if this flag is set.

  • The descriptor should be prepared for reuse by invokingdmaengine_desc_set_reuse() which will set DMA_CTRL_REUSE.

  • dmaengine_desc_set_reuse() will succeed only when channel supportreusable descriptor as exhibited by capabilities

  • As a consequence, if a device driver wants to skip thedma_map_sg() anddma_unmap_sg() in between 2 transfers,because the DMA’d data wasn’t used, it can resubmit the transfer right afterits completion.

  • Descriptor can be freed in few ways

    • Clearing DMA_CTRL_REUSE by invokingdmaengine_desc_clear_reuse() and submitting for last txn

    • Explicitly invokingdmaengine_desc_free(), this can succeed onlywhen DMA_CTRL_REUSE is already set

    • Terminating the channel

  • DMA_PREP_CMD

    • If set, the client driver tells DMA controller that passed data in DMAAPI is command data.

    • Interpretation of command data is DMA controller specific. It can beused for issuing commands to other peripherals/register reads/registerwrites for which the descriptor should be in different format fromnormal data descriptors.

  • DMA_PREP_REPEAT

    • If set, the transfer will be automatically repeated when it ends until anew transfer is queued on the same channel with the DMA_PREP_LOAD_EOT flag.If the next transfer to be queued on the channel does not have theDMA_PREP_LOAD_EOT flag set, the current transfer will be repeated until theclient terminates all transfers.

    • This flag is only supported if the channel reports the DMA_REPEATcapability.

  • DMA_PREP_LOAD_EOT

    • If set, the transfer will replace the transfer currently being executed atthe end of the transfer.

    • This is the default behaviour for non-repeated transfers, specifyingDMA_PREP_LOAD_EOT for non-repeated transfers will thus make no difference.

    • When using repeated transfers, DMA clients will usually need to set theDMA_PREP_LOAD_EOT flag on all transfers, otherwise the channel will keeprepeating the last repeated transfer and ignore the new transfers beingqueued. Failure to set DMA_PREP_LOAD_EOT will appear as if the channel wasstuck on the previous transfer.

    • This flag is only supported if the channel reports the DMA_LOAD_EOTcapability.

General Design Notes

Most of the DMAEngine drivers you’ll see are based on a similar designthat handles the end of transfer interrupts in the handler, but defermost work to a tasklet, including the start of a new transfer wheneverthe previous transfer ended.

This is a rather inefficient design though, because the inter-transferlatency will be not only the interrupt latency, but also thescheduling latency of the tasklet, which will leave the channel idlein between, which will slow down the global transfer rate.

You should avoid this kind of practice, and instead of electing a newtransfer in your tasklet, move that part to the interrupt handler inorder to have a shorter idle window (that we can’t really avoidanyway).

Glossary

  • Burst: A number of consecutive read or write operations thatcan be queued to buffers before being flushed to memory.

  • Chunk: A contiguous collection of bursts

  • Transfer: A collection of chunks (be it contiguous or not)