You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
udmabuf is a Linux device driver that allocates contiguous memory blocks in thekernel space as DMA buffers and makes them available from the user space.It is intended that these memory blocks are used as DMA buffers when a userapplication implements device driver in user space using UIO (User space I/O).
A DMA buffer allocated by udmabuf can be accessed from the user space by opneingthe device file (e.g. /dev/udmabuf0) and mapping to the user memory space, orusing the read()/write() functions.
CPU cache for the allocated DMA buffer can be disabled by setting theO_SYNC flagwhen opening the device file. It is also possible to flush or invalidate CPU cachewhile retaining CPU cache enabled.
The physical address of a DMA buffer allocated by udmabuf can be obtained byreading/sys/class/udmabuf/udmabuf0/phys_addr.
The size of a DMA buffer and the device minor number can be specified whenthe device driver is loaded (e.g. when loaded via theinsmod command).Some platforms allow to specify them in the device tree.
Architecture of udmabuf
Figure 1. Architecture
Supported platforms
OS : Linux Kernel Version 3.6 - 3.8, 3.18, 4.4, 4.8, 4.12, 4.14, 4.19 (the author tested on 3.18, 4.4, 4.8, 4.12, 4.14).
CPU: ARM Cortex-A9 (Xilinx ZYNQ / Altera CycloneV SoC)
CPU: x86(64bit) However, verification is not enough. I hope the results from everyone.In addition, there is a limit to the following feature at the moment.
Can not control of the CPU cache by O_SYNC flag . Always CPU cache is valid.
Can not various settings by the device tree.
Usage
Compile
The followingMakefile is included in the repository.
Load the udmabuf kernel driver usinginsmod. The size of a DMA buffer should beprovided as an argument as follows.The device driver is created, and allocates a DMA buffer with the specified size.The maximum number of DMA buffers that can be allocated usinginsmod is 8 (udmabuf0/1/2/3/4/5/6/7).
zynq$ insmod udmabuf.ko udmabuf0=1048576udmabuf udmabuf0: driver installedudmabuf udmabuf0: major number = 248udmabuf udmabuf0: minor number = 0udmabuf udmabuf0: phys address = 0x1e900000udmabuf udmabuf0: buffer size = 1048576udmabuf udmabuf0: dma coherent = 0zynq$ ls -la /dev/udmabuf0crw------- 1 root root 248, 0 Dec 1 09:34 /dev/udmabuf0
In the above result, the device is only read/write accessible by root.If the permission needs to be changed at the load of the kernel module,create/etc/udev/rules.d/99-udmabuf.rules with the following content.
In addition to the allocation via theinsmod command and its arguments, DMAbuffers can be allocated by specifying the size in the device tree file.When a device tree file contains an entry like the following, udmabuf willallocate buffers and create device drivers when loaded byinsmod.
zynq$ insmod udmabuf.koudmabuf udmabuf0: driver installedudmabuf udmabuf0: major number = 248udmabuf udmabuf0: minor number = 0udmabuf udmabuf0: phys address = 0x1e900000udmabuf udmabuf0: buffer size = 1048576udmabuf udmabuf0: dma coherent = 0zynq$ ls -la /dev/udmabuf0crw------- 1 root root 248, 0 Dec 1 09:34 /dev/udmabuf0
The following properties can be set in the device tree.
compatible
size
minor-number
device-name
sync-mode
sync-always
sync-offset
sync-size
sync-direction
dma-coherent
memory-region
compatible
Thecompatible property is used to set the corresponding device driver when loadingudmabuf. Thecompatible property is mandatory. Be sure to specifycompatibleproperty as "ikwzm,udmabuf-0.10.a".
size
Thesize property is used to set the capacity of DMA buffer in bytes.Thesize property is mandatory.
Theminor-number property is used to set the minor number.The valid minor number range is 0 to 255. A minor number provided asinsmodargument will has higher precedence, and when definition in the device tree hascolliding number, creation of the device defined in the device tree will fail.
Theminor-number property is optional. When theminor-number property is notspecified, udmabuf automatically assigns an appropriate one.
Thedevice-name property is used to set the name of device.
Thedevice-name property is optional. The device name is determined as follow:
Ifdevice-name property is specified, the value ofdevice-name property is used.
Ifdevice-name property is not present, and ifminor-number property isspecified,sprintf("udmabuf%d", minor-number) is used.
Ifdevice-name property is not present, and ifminor-number property isnot present, the entry name of the device tree is used (udmabuf@0x00 in this example).
Thesync-mode property is used to configure the behavior when udmabuf is openedwith theO_SYNC flag.
sync-mode=<1>: IfO_SYNC is specified orsync-always property is specified,CPU cache is disabled. Otherwise CPU cache is enabled.
sync-mode=<2>: IfO_SYNC is specified orsync-always property is specified,CPU cache is disabled but CPU uses write-combine when writing data to DMA bufferimproves performance by combining multiple write accesses. Otherwise CPU cache isenabled.
sync-mode=<3>: IfO_SYNC is specified orsync-always property is specified,DMA coherency mode is used. Otherwise CPU cache is enabled.
Thesync-mode property is optional.When thesync-mode property is not specified,sync-mode is set to <1>.
Details onO_SYNC and cache management will be described in the next section.
sync-always
If thesync-always property is specified, when opening udmabuf, it specifies thatthe operation specified by thesync-mode property will always be performedregardless ofO_SYNC specification.
Details on cache management will be described in the next section.
dma-coherent
If thedma-coherent property is specified, indicates that coherency between DMAbuffer and CPU cache can be guaranteed by hardware.
Thedma-coherent property is optional. When thedma-coherent property is notspecified, indicates that coherency between DMA buffer and CPU cache can not beguaranteed by hardware.
Details on cache management will be described in the next section.
memory-region
Linux can specify the reserved memory area in the device tree. The Linux kernelexcludes normal memory allocation from the physical memory space specified byreserved-memory property.In order to access this reserved memory area, it is nessasary to use ageneral-purpose memory access driver such as/dev/mem, or associate it withthe device driver in the device tree.
By thememory-region property, it can be associated the reserved memory area with udmabuf.
In this example, 64MiB of 0x3C000000 to 0x3FFFFFFF is reserved as "image_buf0".In this "image_buf0", specify "shared-dma-pool" incompatible property and specifythereusable property. By specifying these properties, this reserved memory areawill be allocated by the CMA. Also, you need to be careful about address and sizealignment.
The above "image_buf0" is associated with "udmabuf@0" withmemory-region property.With this association, "udmabuf@0" reserves physical memory from the CMA areaspecifed by "image_buf0".
Thememory-region property is optional.When thememory-region property is not specified, udmabuf allocates the DMA bufferfrom the CMA area allocated to the Linux kernel.
Device file
When udmabuf is loaded into the kernel, the following device files are created.<device-name> is a placeholder for the device name described in the previous section.
/dev/<device-name>
/sys/class/udmabuf/<device-name>/phys_addr
/sys/class/udmabuf/<device-name>/size
/sys/class/udmabuf/<device-name>/sync_mode
/sys/class/udmabuf/<device-name>/sync_offset
/sys/class/udmabuf/<device-name>/sync_size
/sys/class/udmabuf/<device-name>/sync_direction
/sys/class/udmabuf/<device-name>/sync_owner
/sys/class/udmabuf/<device-name>/sync_for_cpu
/sys/class/udmabuf/<device-name>/sync_for_device
/sys/class/udmabuf/<device-name>/dma_coherent
/dev/<device-name>
/dev/<device-name> is used whenmmap()-ed to the user space or accessed viaread()/write().
if ((fd=open("/dev/udmabuf0",O_RDWR))!=-1) {buf=mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);/* Do some read/write access to buf */close(fd); }
The device file can be directly read/written by specifying the device as the target ofdd in the shell.
zynq$ dd if=/dev/urandom of=/dev/udmabuf0 bs=4096 count=10241024+0 records in1024+0 records out4194304 bytes (4.2 MB) copied, 3.07516 s, 1.4 MB/s
zynq$dd if=/dev/udmabuf4 of=random.bin8192+0 records in8192+0 records out4194304 bytes (4.2 MB) copied, 0.173866 s, 24.1 MB/s
phys_addr
The physical address of a DMA buffer can be retrieved by reading/sys/class/udmabuf/<device-name>/phys_addr.
Details onO_SYNC and cache management will be described in the next section.
sync_offset
The device file/sys/class/udmabuf/<device-name>/sync_offset is used to specifythe start address of a memory block of which cache is manually managed.
unsignedcharattr[1024];unsigned longsync_offset=0x00000000;if ((fd=open("/sys/class/udmabuf/udmabuf0/sync_offset",O_WRONLY))!=-1) {sprintf(attr,"%d",sync_offset);/* or sprintf(attr, "0x%x", sync_offset); */write(fd,attr,strlen(attr));close(fd); }
Details of manual cache management is described in the next section.
sync_size
The device file/sys/class/udmabuf/<device-name>/sync_size is used to specifythe size of a memory block of which cache is manually managed.
unsignedcharattr[1024];unsigned longsync_size=1024;if ((fd=open("/sys/class/udmabuf/udmabuf0/sync_size",O_WRONLY))!=-1) {sprintf(attr,"%d",sync_size);/* or sprintf(attr, "0x%x", sync_size); */write(fd,attr,strlen(attr));close(fd); }
Details of manual cache management is described in the next section.
sync_direction
The device file/sys/class/udmabuf/<device-name>/sync_direction is used to set thedirection of DMA transfer to/from the DMA buffer of which cache is manually managed.
Details of manual cache management is described in the next section.
dma_coherent
The device file/sys/class/udmabuf/<device-name>/dma_coherent can read whetherthe coherency of DMA buffer and CPU cache can be guaranteed by hardware.It is able to specify whether or not it is able to guarantee by hardware with thedma-coherent property in the device tree, but this device file is read-only.
If this value is 1, the coherency of DMA buffer and CPU cache can be guaranteed byhardware. If this value is 0, the coherency of DMA buffer and CPU cache can be notguaranteed by hardware.
Details of manual cache management is described in the next section.
sync_for_cpu
In the manual cache management mode, CPU can be the owner of the buffer by writingnon-zero to the device file/sys/class/udmabuf/<device-name>/sync_for_cpu.
If '1' is written to device file, ifsync_direction is 2(=DMA_FROM_DEVICE) or 0(=DMA_BIDIRECTIONAL),the write to the device file invalidates a cache specified bysync_offset andsync_size.
The sync_offset/sync_size/sync_direction specified bysync_for_cpu is temporary and does not affect thesync_offset orsync_size orsync_direction device files.
Details of manual cache management is described in the next section.
sync_for_device
In the manual cache management mode, DEVICE can be the owner of the buffer bywriting non-zero to the device file/sys/class/udmabuf/<device-name>/sync_for_device.
If '1' is written to device file, ifsync_direction is 1(=DMA_TO_DEVICE) or 0(=DMA_BIDIRECTIONAL),the write to the device file flushes a cache specified bysync_offset andsync_size (i.e. thecached data, if any, will be updated with data on DDR memory).
The sync_offset/sync_size/sync_direction specified bysync_for_device is temporary and does not affect thesync_offset orsync_size orsync_direction device files.
Details of manual cache management is described in the next section.
Coherency of data on DMA buffer and CPU cache
CPU usually accesses to a DMA buffer on the main memory using cache, and a hardwareaccelerator logic accesses to data stored in the DMA buffer on the main memory.In this situation, coherency between data stored on CPU cache and them on the mainmemory should be considered carefully.
When the coherency is maintained by hardware
When hardware assures the coherency, CPU cache can be turned on without additionaltreatment. For example, ZYNQ provides ACP (Accelerator Coherency Port), and thecoherency is maintained by hardware as long as the accelerator accesses to the mainmemory via this port.
In this case, accesses from CPU to the main memory can be fast by using CPU cacheas usual. To enable CPU cache on the DMA buffer allocated by udmabuf, open udmabufwithout specifying theO_SYNC flag.
/* To enable CPU cache on the DMA buffer, *//* open udmabuf without specifying the `O_SYNC` flag. */if ((fd=open("/dev/udmabuf0",O_RDWR))!=-1) {buf=mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);/* Read/write access to the buffer */close(fd); }
The manual management of cache, described in the following section, will not benecessary when hardware maintains the coherency.
If thedma-coherent property is specified in the device tree, specify thatcoherency can be guaranteed with hardware. In this case, the cache control describedin "2. Manual cache management with the CPU canche still being enabled" describedlater is not performed.
When hardware does not maintain the coherency
To maintain coherency of data between CPU and the main memory, another coherencymechanism is necessary. udmabuf supports two different ways of coherency maintenance;one is to disable CPU cache, and the other is to involve manual cache flush/invalidationwith CPU cache being enabled.
1. Disabling CPU cache
To disable CPU cache of allocated DMA buffer, specify theO_SYNC flag when opening udmabuf.
/* To disable CPU cache on the DMA buffer, *//* open udmabuf with the `O_SYNC` flag. */if ((fd=open("/dev/udmabuf0",O_RDWR |O_SYNC))!=-1) {buf=mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);/* Read/write access to the buffer */close(fd); }
As listed below,sync_mode can be used to configure the cache behavior when theO_SYNC flag is present inopen():
sync_mode=0: CPU cache is enabled regardless of theO_SYNC flag presense.
sync_mode=1: IfO_SYNC is specified, CPU cache is disabled.IfO_SYNC is not specified, CPU cache is enabled.
sync_mode=2: IfO_SYNC is specified, CPU cache is disabled but CPU useswrite-combine when writing data to DMA buffer improves performance by combiningmultiple write accesses. IfO_SYNC is not specified, CPU cache is enabled.
sync_mode=3: IfO_SYNC is specified, DMA coherency mode is used.IfO_SYNC is not specified, CPU cache is enabled.
sync_mode=4: CPU cache is enabled regardless of theO_SYNC flag presense.
sync_mode=5: CPU cache is disabled regardless of theO_SYNC flag presense.
sync_mode=6: CPU uses write-combine to write data to DMA buffer regardless ofO_SYNC presence.
sync_mode=7: DMA coherency mode is used regardless ofO_SYNC presence.
As a practical example, the execution times of a sample program listed below weremeasured under several test conditions as presented in the table.
Table-1 The execution time of the sample programcheckbuf
sync_mode
O_SYNC
DMA buffer size
1MByte
5MByte
10MByte
0
Not specified
0.437[sec]
2.171[sec]
4.340[sec]
Specified
0.437[sec]
2.171[sec]
4.340[sec]
1
Not specified
0.434[sec]
2.179[sec]
4.337[sec]
Specified
2.283[sec]
11.414[sec]
22.830[sec]
2
Not specified
0.434[sec]
2.169[sec]
4.337[sec]
Specified
1.616[sec]
8.262[sec]
16.562[sec]
3
Not specified
0.434[sec]
2.169[sec]
4.337[sec]
Specified
1.600[sec]
8.391[sec]
16.587[sec]
4
Not specified
0.437[sec]
2.171[sec]
4.337[sec]
Specified
0.437[sec]
2.171[sec]
4.337[sec]
5
Not specified
2.283[sec]
11.414[sec]
22.809[sec]
Specified
2.283[sec]
11.414[sec]
22.840[sec]
6
Not specified
1.655[sec]
8.391[sec]
16.587[sec]
Specified
1.655[sec]
8.391[sec]
16.587[sec]
7
Not specified
1.655[sec]
8.391[sec]
16.587[sec]
Specified
1.655[sec]
8.391[sec]
16.587[sec]
Table-2 The execution time of the sample programclearbuf
sync_mode
O_SYNC
DMA buffer size
1MByte
5MByte
10MByte
0
Not specified
0.067[sec]
0.359[sec]
0.713[sec]
Specified
0.067[sec]
0.362[sec]
0.716[sec]
1
Not specified
0.067[sec]
0.362[sec]
0.718[sec]
Specified
0.912[sec]
4.563[sec]
9.126[sec]
2
Not specified
0.068[sec]
0.360[sec]
0.721[sec]
Specified
0.063[sec]
0.310[sec]
0.620[sec]
3
Not specified
0.068[sec]
0.361[sec]
0.715[sec]
Specified
0.062[sec]
0.310[sec]
0.620[sec]
4
Not specified
0.068[sec]
0.360[sec]
0.718[sec]
Specified
0.067[sec]
0.360[sec]
0.710[sec]
5
Not specified
0.913[sec]
4.562[sec]
9.126[sec]
Specified
0.913[sec]
4.562[sec]
9.126[sec]
6
Not specified
0.062[sec]
0.310[sec]
0.618[sec]
Specified
0.062[sec]
0.310[sec]
0.619[sec]
7
Not specified
0.062[sec]
0.310[sec]
0.620[sec]
Specified
0.062[sec]
0.310[sec]
0.621[sec]
2. Manual cache management with the CPU canche still being enabled
As explained above, by opening udmabuf without specifying theO_SYNC flag, CPU cache can be left turned on.
/* To enable CPU cache on the DMA buffer, *//* open udmabuf without specifying the `O_SYNC` flag. */if ((fd=open("/dev/udmabuf0",O_RDWR))!=-1) {buf=mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);/* Read/write access to the buffer */close(fd); }
To manualy manage cache coherency, users need to follow the
Specify a memory area shared between CPU and accelerator viasync_offsetandsync_size device files.sync_offset accepts an offset from the startaddress of the allocated buffer in units of bytes.The size of the shared memory area should be set tosync_size in units of bytes.
Data transfer direction should be set tosync_direction. If the acceleratorperforms only read accesses to the memory area,sync_direction should be setto1(=DMA_TO_DEVICE), and to2(=DMA_FROM_DEVICE) if only write accesses.
If the accelerator reads and writes data from/to the memory area,sync_direction should be set to0(=DMA_BIDIRECTIONAL).
Following the above configuration,sync_for_cpu and/orsync_for_device shouldbe used to set the owner of the buffer specified by the above-mentioned offset andthe size.
When CPU accesses to the buffer, '1' should be written tosync_for_cpu to setCPU as the owner. Upon the write tosync_for_cpu, CPU cache is invalidated ifsync_direction is2(=DMA_FROM_DEVICE) or0(=DMA_BIDIRECTIONAL).Once CPU is becomes the owner of the buffer, the accelerator cannot access the buffer.
On the other hand, when the accelerator needs to access the buffer, '1' should bewritten tosync_for_device to change owership of the buffer to the accelerator.Upon the write tosync_for_device, the CPU cache of the specified memory area isflushed using data on the main memory.
However, if thedma-coherent property is specified in the device tree, CPU cacheis not invalidated and flushed.
Example using udmabuf with Python
The programming language "Python" provides an extension called "NumPy".This section explains how to do the same operation as "ndarry" by mapping the DMAbuffer allocated in the kernel withmemmap of "NumPy" with udmabuf.
The execution time for "udmabuf0"(buffer area secured in the kernel) and the sameoperation with ndarray (comparison) were almost the same.That is, it seems that "udmabuf0" is also effective CPU cache.
I confirmed the contents of "udmabuf0" after running this script.
After executing the script, it was confirmed that the result of the execution remainsin the buffer. Just to be sure, let's check that NumPy can read it.
zynq# pythonPython 2.7.9 (default, Aug 13 2016, 17:56:53)[GCC 4.9.2] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> import numpy as np>>> a = np.memmap('/dev/udmabuf0', dtype=np.uint8, mode='r+', shape=(8388608))>>> amemmap([49, 49, 49, ..., 49, 49, 49], dtype=uint8)>>> a.itemsize1>>> a.size8388608>>>
About
User space mappable dma buffer device driver for Linux.