Xillybus driver for generic FPGA interface

Author:

Eli Billauer, Xillybus Ltd. (http://xillybus.com)

Email:

eli.billauer@gmail.com or as advertised on Xillybus’ site.

Introduction

Background

An FPGA (Field Programmable Gate Array) is a piece of logic hardware, whichcan be programmed to become virtually anything that is usually found as adedicated chipset: For instance, a display adapter, network interface card,or even a processor with its peripherals. FPGAs are the LEGO of hardware:Based upon certain building blocks, you make your own toys the way you likethem. It’s usually pointless to reimplement something that is alreadyavailable on the market as a chipset, so FPGAs are mostly used when somespecial functionality is needed, and the production volume is relatively low(hence not justifying the development of an ASIC).

The challenge with FPGAs is that everything is implemented at a very lowlevel, even lower than assembly language. In order to allow FPGA designers tofocus on their specific project, and not reinvent the wheel over and overagain, pre-designed building blocks, IP cores, are often used. These are theFPGA parallels of library functions. IP cores may implement certainmathematical functions, a functional unit (e.g. a USB interface), an entireprocessor (e.g. ARM) or anything that might come handy. Think of them as abuilding block, with electrical wires dangling on the sides for connection toother blocks.

One of the daunting tasks in FPGA design is communicating with a fullblownoperating system (actually, with the processor running it): Implementing thelow-level bus protocol and the somewhat higher-level interface with the host(registers, interrupts, DMA etc.) is a project in itself. When the FPGA’sfunction is a well-known one (e.g. a video adapter card, or a NIC), it canmake sense to design the FPGA’s interface logic specifically for the project.A special driver is then written to present the FPGA as a well-known interfaceto the kernel and/or user space. In that case, there is no reason to treat theFPGA differently than any device on the bus.

It’s however common that the desired data communication doesn’t fit any well-known peripheral function. Also, the effort of designing an elegantabstraction for the data exchange is often considered too big. In those cases,a quicker and possibly less elegant solution is sought: The driver iseffectively written as a user space program, leaving the kernel space partwith just elementary data transport. This still requires designing someinterface logic for the FPGA, and write a simple ad-hoc driver for the kernel.

Xillybus Overview

Xillybus is an IP core and a Linux driver. Together, they form a kit forelementary data transport between an FPGA and the host, providing pipe-likedata streams with a straightforward user interface. It’s intended as a low-effort solution for mixed FPGA-host projects, for which it makes sense tohave the project-specific part of the driver running in a user-space program.

Since the communication requirements may vary significantly from one FPGAproject to another (the number of data pipes needed in each direction andtheir attributes), there isn’t one specific chunk of logic being the XillybusIP core. Rather, the IP core is configured and built based upon aspecification given by its end user.

Xillybus presents independent data streams, which resemble pipes or TCP/IPcommunication to the user. At the host side, a character device file is usedjust like any pipe file. On the FPGA side, hardware FIFOs are used to streamthe data. This is contrary to a common method of communicating through fixed-sized buffers (even though such buffers are used by Xillybus under the hood).There may be more than a hundred of these streams on a single IP core, butalso no more than one, depending on the configuration.

In order to ease the deployment of the Xillybus IP core, it contains a simpledata structure which completely defines the core’s configuration. The Linuxdriver fetches this data structure during its initialization process, and setsup the DMA buffers and character devices accordingly. As a result, a singledriver is used to work out of the box with any Xillybus IP core.

The data structure just mentioned should not be confused with PCI’sconfiguration space or the Flattened Device Tree.

Usage

User interface

On the host, all interface with Xillybus is done through /dev/xillybus_*device files, which are generated automatically as the drivers loads. Thenames of these files depend on the IP core that is loaded in the FPGA (seeProbing below). To communicate with the FPGA, open the device file thatcorresponds to the hardware FIFO you want to send data or receive data from,and use plain write() or read() calls, just like with a regular pipe. Inparticular, it makes perfect sense to go:

$ cat mydata > /dev/xillybus_thisfifo$ cat /dev/xillybus_thatfifo > hisdata

possibly pressing CTRL-C as some stage, even though the xillybus_* pipes havethe capability to send an EOF (but may not use it).

The driver and hardware are designed to behave sensibly as pipes, including:

  • Supporting non-blocking I/O (by setting O_NONBLOCK on open() ).

  • Supporting poll() and select().

  • Being bandwidth efficient under load (using DMA) but also handle smallpieces of data sent across (like TCP/IP) by autoflushing.

A device file can be read only, write only or bidirectional. Bidirectionaldevice files are treated like two independent pipes (except for sharing a“channel” structure in the implementation code).

Synchronization

Xillybus pipes are configured (on the IP core) to be either synchronous orasynchronous. For a synchronous pipe, write() returns successfully only aftersome data has been submitted and acknowledged by the FPGA. This slows downbulk data transfers, and is nearly impossible for use with streams thatrequire data at a constant rate: There is no data transmitted to the FPGAbetween write() calls, in particular when the process loses the CPU.

When a pipe is configured asynchronous, write() returns if there was enoughroom in the buffers to store any of the data in the buffers.

For FPGA to host pipes, asynchronous pipes allow data transfer from the FPGAas soon as the respective device file is opened, regardless of if the datahas been requested by a read() call. On synchronous pipes, only the amountof data requested by a read() call is transmitted.

In summary, for synchronous pipes, data between the host and FPGA istransmitted only to satisfy the read() or write() call currently handledby the driver, and those calls wait for the transmission to complete beforereturning.

Note that the synchronization attribute has nothing to do with the possibilitythat read() or write() completes less bytes than requested. There is aseparate configuration flag (“allowpartial”) that determines whether such apartial completion is allowed.

Seekable pipes

A synchronous pipe can be configured to have the stream’s position exposedto the user logic at the FPGA. Such a pipe is also seekable on the host API.With this feature, a memory or register interface can be attached on theFPGA side to the seekable stream. Reading or writing to a certain address inthe attached memory is done by seeking to the desired address, and callingread() or write() as required.

Internals

Source code organization

The Xillybus driver consists of a core module, xillybus_core.c, and modulesthat depend on the specific bus interface (xillybus_of.c and xillybus_pcie.c).

The bus specific modules are those probed when a suitable device is found bythe kernel. Since the DMA mapping and synchronization functions, which are busdependent by their nature, are used by the core module, axilly_endpoint_hardware structure is passed to the core module oninitialization. This structure is populated with pointers to wrapper functionswhich execute the DMA-related operations on the bus.

Pipe attributes

Each pipe has a number of attributes which are set when the FPGA component(IP core) is built. They are fetched from the IDT (the data structure whichdefines the core’s configuration, see Probing below) byxilly_setupchannels()in xillybus_core.c as follows:

  • is_writebuf: The pipe’s direction. A non-zero value means it’s an FPGA tohost pipe (the FPGA “writes”).

  • channelnum: The pipe’s identification number in communication between thehost and FPGA.

  • format: The underlying data width. See Data Granularity below.

  • allowpartial: A non-zero value means that a read() or write() (whicheverapplies) may return with less than the requested number of bytes. The commonchoice is a non-zero value, to match standard UNIX behavior.

  • synchronous: A non-zero value means that the pipe is synchronous. SeeSynchronization above.

  • bufsize: Each DMA buffer’s size. Always a power of two.

  • bufnum: The number of buffers allocated for this pipe. Always a power of two.

  • exclusive_open: A non-zero value forces exclusive opening of the associateddevice file. If the device file is bidirectional, and already opened only inone direction, the opposite direction may be opened once.

  • seekable: A non-zero value indicates that the pipe is seekable. SeeSeekable pipes above.

  • supports_nonempty: A non-zero value (which is typical) indicates that thehardware will send the messages that are necessary to support select() andpoll() for this pipe.

Host never reads from the FPGA

Even though PCI Express is hotpluggable in general, a typical motherboarddoesn’t expect a card to go away all of the sudden. But since the PCIe cardis based upon reprogrammable logic, a sudden disappearance from the bus isquite likely as a result of an accidental reprogramming of the FPGA while thehost is up. In practice, nothing happens immediately in such a situation. Butif the host attempts to read from an address that is mapped to the PCI Expressdevice, that leads to an immediate freeze of the system on some motherboards,even though the PCIe standard requires a graceful recovery.

In order to avoid these freezes, the Xillybus driver refrains completely fromreading from the device’s register space. All communication from the FPGA tothe host is done through DMA. In particular, the Interrupt Service Routinedoesn’t follow the common practice of checking a status register when it’sinvoked. Rather, the FPGA prepares a small buffer which contains shortmessages, which inform the host what the interrupt was about.

This mechanism is used on non-PCIe buses as well for the sake of uniformity.

Channels, pipes, and the message channel

Each of the (possibly bidirectional) pipes presented to the user is allocateda data channel between the FPGA and the host. The distinction between channelsand pipes is necessary only because of channel 0, which is used for interrupt-related messages from the FPGA, and has no pipe attached to it.

Data streaming

Even though a non-segmented data stream is presented to the user at bothsides, the implementation relies on a set of DMA buffers which is allocatedfor each channel. For the sake of illustration, let’s take the FPGA to hostdirection: As data streams into the respective channel’s interface in theFPGA, the Xillybus IP core writes it to one of the DMA buffers. When thebuffer is full, the FPGA informs the host about that (appending aXILLYMSG_OPCODE_RELEASEBUF message channel 0 and sending an interrupt ifnecessary). The host responds by making the data available for reading throughthe character device. When all data has been read, the host writes on theFPGA’s buffer control register, allowing the buffer’s overwriting. Flowcontrol mechanisms exist on both sides to prevent underflows and overflows.

This is not good enough for creating a TCP/IP-like stream: If the data flowstops momentarily before a DMA buffer is filled, the intuitive expectation isthat the partial data in buffer will arrive anyhow, despite the buffer notbeing completed. This is implemented by adding a field in theXILLYMSG_OPCODE_RELEASEBUF message, through which the FPGA informs not justwhich buffer is submitted, but how much data it contains.

But the FPGA will submit a partially filled buffer only if directed to do soby the host. This situation occurs when the read() method has been blockingfor XILLY_RX_TIMEOUT jiffies (currently 10 ms), after which the host commandsthe FPGA to submit a DMA buffer as soon as it can. This timeout mechanismbalances between bus bandwidth efficiency (preventing a lot of partiallyfilled buffers being sent) and a latency held fairly low for tails of data.

A similar setting is used in the host to FPGA direction. The handling ofpartial DMA buffers is somewhat different, though. The user can tell thedriver to submit all data it has in the buffers to the FPGA, by issuing awrite() with the byte count set to zero. This is similar to a flush request,but it doesn’t block. There is also an autoflushing mechanism, which triggersan equivalent flush roughly XILLY_RX_TIMEOUT jiffies after the last write().This allows the user to be oblivious about the underlying buffering mechanismand yet enjoy a stream-like interface.

Note that the issue of partial buffer flushing is irrelevant for pipes havingthe “synchronous” attribute nonzero, since synchronous pipes don’t allow datato lay around in the DMA buffers between read() and write() anyhow.

Data granularity

The data arrives or is sent at the FPGA as 8, 16 or 32 bit wide words, asconfigured by the “format” attribute. Whenever possible, the driver attemptsto hide this when the pipe is accessed differently from its natural alignment.For example, reading single bytes from a pipe with 32 bit granularity workswith no issues. Writing single bytes to pipes with 16 or 32 bit granularitywill also work, but the driver can’t send partially completed words to theFPGA, so the transmission of up to one word may be held until it’s fullyoccupied with user data.

This somewhat complicates the handling of host to FPGA streams, becausewhen a buffer is flushed, it may contain up to 3 bytes don’t form a word inthe FPGA, and hence can’t be sent. To prevent loss of data, these leftoverbytes need to be moved to the next buffer. The parts in xillybus_core.cthat mention “leftovers” in some way are related to this complication.

Probing

As mentioned earlier, the number of pipes that are created when the driverloads and their attributes depend on the Xillybus IP core in the FPGA. Duringthe driver’s initialization, a blob containing configuration info, theInterface Description Table (IDT), is sent from the FPGA to the host. Thebootstrap process is done in three phases:

  1. Acquire the length of the IDT, so a buffer can be allocated for it. Thisis done by sending a quiesce command to the device, since the acknowledgefor this command contains the IDT’s buffer length.

  2. Acquire the IDT itself.

  3. Create the interfaces according to the IDT.

Buffer allocation

In order to simplify the logic that prevents illegal boundary crossings ofPCIe packets, the following rule applies: If a buffer is smaller than 4kB,it must not cross a 4kB boundary. Otherwise, it must be 4kB aligned. Thexilly_setupchannels() functions allocates these buffers by requesting wholepages from the kernel, and diving them into DMA buffers as necessary. Sinceall buffers’ sizes are powers of two, it’s possible to pack any set of suchbuffers, with a maximal waste of one page of memory.

All buffers are allocated when the driver is loaded. This is necessary,since large continuous physical memory segments are sometimes requested,which are more likely to be available when the system is freshly booted.

The allocation of buffer memory takes place in the same order they appear inthe IDT. The driver relies on a rule that the pipes are sorted with decreasingbuffer size in the IDT. If a requested buffer is larger or equal to a page,the necessary number of pages is requested from the kernel, and these areused for this buffer. If the requested buffer is smaller than a page, onesingle page is requested from the kernel, and that page is partially used.Or, if there already is a partially used page at hand, the buffer is packedinto that page. It can be shown that all pages requested from the kernel(except possibly for the last) are 100% utilized this way.

The “nonempty” message (supporting poll)

In order to support the “poll” method (and hence select() ), there is a smallcatch regarding the FPGA to host direction: The FPGA may have filled a DMAbuffer with some data, but not submitted that buffer. If the host waited forthe buffer’s submission by the FPGA, there would be a possibility that theFPGA side has sent data, but a select() call would still block, because thehost has not received any notification about this. This is solved withXILLYMSG_OPCODE_NONEMPTY messages sent by the FPGA when a channel goes fromcompletely empty to containing some data.

These messages are used only to support poll() and select(). The IP core canbe configured not to send them for a slight reduction of bandwidth.