The application claims the benefit of italian application No.102020000009358 filed on 29 th month 4 2020, the contents of which are incorporated herein by reference.
Detailed Description
In the following description, one or more specific details are set forth in order to provide a thorough understanding of examples of the embodiments described herein. Embodiments may be obtained without one or more of the specific details, or by other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the embodiments.
References to "an embodiment" or "one embodiment" in the framework of this description are intended to indicate that a particular configuration, structure, or feature described in connection with the embodiment is included in at least one embodiment. Thus, phrases such as "in an embodiment" or "in one embodiment" that may occur in one or more points of the present description do not necessarily refer to one or more of the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In the drawings attached hereto, like parts or elements are denoted by like reference numerals, and the corresponding description will not be repeated for the sake of brevity.
The references used herein are for convenience only and thus do not define the scope of protection or the scope of the embodiments.
Fig. 1 is an example of an electronic system 1, such as a system on a chip (SoC), in accordance with one or more embodiments. The electronic system 1 may include various electronic circuits such as a central processing unit 10 (CPU, e.g., microprocessor), a main system memory 12 (e.g., system RAM-random access memory), a Direct Memory Access (DMA) controller 14, and a hardware accelerator circuit 16.
As shown in fig. 1, electronic circuits in electronic system 1 may be connected through a system interconnect network 18 (e.g., a SoC interconnect).
It is an object of one or more embodiments to provide a (runtime) reconfigurable hardware accelerator circuit 16 designed to support the execution of various (basic) arithmetic functions and with improved flexibility of use. Accordingly, one or more embodiments may facilitate improved use of silicon regions and provide satisfactory processing performance, e.g., to meet processing time requirements of a real-time data processing system.
As shown in fig. 1, in one or more embodiments, the hardware accelerator circuit 16 may include at least one (runtime) configurable processing element 160, preferably a number P of (runtime) configurable processing elements 1600、1601、…、160P-1, and a set of local data storage groups M, preferably a number q= 2*P of local data storage groups M0、…、MQ-1.
In one or more embodiments, the hardware accelerator circuit 16 may further include a local control unit 161, a local interconnection network 162, a local data memory controller 163, a local ROM controller 164, (the local ROM controller 164 being coupled to a local read-only memory set 165, preferably a number P of local read-only memories 1650、1651、…、165P-1), and a local configuration memory controller 166, (the local configuration memory controller 166 being coupled to a local configurable coefficient memory set 167, preferably a number P of local configurable coefficient memories 1670、1671、…、167P-1). For example, memory 167 may include volatile memory (e.g., RAM memory) and/or nonvolatile memory (e.g., PCM memory).
Different embodiments may include different numbers of P processing elements 160 and/or different numbers of Q local data storage groups M0、…、MQ-1. For example, P may be equal to 8 and Q may be equal to 16.
In one or more embodiments, processing element 160 may be configured to support different (basic) processing functions with different levels of computational parallelism. For example, processing element 160 may support (e.g., based on an appropriate static configuration) different types of arithmetic (e.g., floating point single precision 32 bits, fixed point/integer 32 bits, or 16 or 8 bits with parallel computing or vectorization modes).
The processing element 160 may include a corresponding internal Direct Memory Access (DMA) controller 1680、1681、…、168P-1 with low complexity. In particular, the processing element 160 may be configured to retrieve input data from the local data storage group M0、…、MQ-1 and/or from the main system memory 12 via the corresponding direct memory access controller 168. Thus, the processing element 160 may refine the retrieved input data to generate processed output data. The processing element 160 may be configured to store the processed output data in the local data storage group M0、…、MQ-1 and/or the main system memory 12 via a respective direct memory access controller 168.
Further, the processing element 160 may be configured to retrieve input data from the local read-only memory 165 and/or from the local configurable coefficient memory 167 to perform such refinement.
In one or more embodiments, providing a local set of data memory banks M0、…、MQ-1 can facilitate parallel processing of data and reduce memory access conflicts.
Preferably, the local data memory group M0、…、MQ-1 may provide buffering (e.g., double buffering), which may help recover memory upload time (write operation) and/or download time (read operation). In particular, each local data memory bank may be replicated so that data may be read (e.g., for processing) from one of the two memory banks and at the same time (new) data may be stored (e.g., for later processing) in the other memory bank. Thus, the movement data may not negatively impact computing performance as it may be masked.
In one or more embodiments, a double buffering scheme of the local data storage group M0、…、MQ-1 in combination with stream mode or back-to-back data processing may be advantageous (e.g., as applicable to FFT N-point processors configured to set forth a continuous sequence of N data inputs).
In one or more embodiments, the local data storage group M0、…、MQ-1 may include a memory group having a limited storage capacity (and, therefore, a limited silicon footprint). In the exemplary case of an FFT processor, each local data memory bank may have a memory capacity of at least (maxN)/Q, where maxN is the longest FFT that can be handled by hardware. The usual values in applications involving hardware accelerators may be as follows:
n=4096 points, e.g., each point is a floating point single precision complex number (real, imaginary), which is 64 bits (or 8 bytes) in size,
P=8, resulting in q=16,
So that the storage capacity of each local data memory group can be equal to (4096 x 8 bytes)/16=2 KB (kb=kilobytes).
In one or more embodiments, local control unit 161 may include a register file that includes information for setting the configuration of processing element 160. For example, the local control unit 161 may set the processing element 160 to execute a specific algorithm as directed by a host application running on the central processing unit 10.
In one or more embodiments, the local control unit 161 may thus include a controller circuit for the hardware accelerator circuit 16. Such controller circuitry may configure (e.g., dynamically) each processing element 160 for computing a particular (basic) function, and may configure a corresponding internal direct memory access controller 168 with a particular memory access scheme and cycle period.
In one or more embodiments, local interconnect network 162 may include a low complexity interconnect system, e.g., an AXI4 based interconnect based on a known type of bus network. For example, the data parallelism of the local interconnect network 162 may be 64 bits and the address width may be 32 bits.
Local interconnect network 162 may be configured to connect processing elements 160 to local data storage group M0、…、MQ-1 and/or main system memory 12. Further, the local interconnect network 162 may be configured to connect the local control unit 161 and the local configuration memory controller 166 to the system interconnect network 18.
In particular, the interconnect network 162 may include a set of P master ports MP0、MP1、…、MPP-1, each of which is coupled to a respective processing element 160, a set of P slave ports SP0、SP1、…、SPP-1, each of which may be coupled to a local data storage group M0、…、MQ-1 via a local data storage controller 163, another pair of ports including a system master port MPP and a system slave port SPP configured to be coupled to the system interconnect network 18 (e.g., to receive instructions from the central processing unit 10 and/or access data stored in the system memory 12), and another slave port SPP+1 coupled to the local control unit 161 and the local configuration memory controller 166.
In one or more embodiments, the interconnection network 162 may be fixed (i.e., non-reconfigurable).
In an exemplary embodiment (see, e.g., table I-1 provided below, wherein the "X" symbol indicates an existing connection between two ports), interconnect network 162 may implement a connection in which P master port MP0、MP1、…、MPP-1 coupled to processing element 160 may be connected to corresponding slave port SP0、SP1、…、SPP-1 coupled to local data memory controller 163, and system master port MPP coupled to system interconnect network 18 may be connected to slave port SPP+1 coupled to local control unit 161 and local configuration memory controller 166.
Table I-1 provided below summarizes such exemplary connections implemented through the interconnection network 162.
TABLE I-1
| SP0 | SP1 | ... | SPP-1 | SPP | SPP+1 |
| MP0 | X | | | | | |
| MP1 | | X | | | | |
| ... | | | ... | | | |
| MPP-1 | | | | X | | |
| MPP | | | | | | X |
In another exemplary embodiment (see, e.g., table I-2 provided below), interconnect network 162 may further implement a connection in which each of P master ports MP0、MP1、…、MPP-1 may be connected to a system slave port SPP coupled to system interconnect network 18. In this way, connectivity may be provided between any processing element 160 and the SoC via the system interconnect network 18.
Table I-2 provided below summarizes such exemplary connections implemented through the interconnection network 162.
TABLE I-2
| SP0 | SP1 | ... | SPP-1 | SPP | SPP+1 |
| MP0 | X | | | | X | |
| MP1 | | X | | | X | |
| ... | | | ... | | ... | |
| MPP-1 | | | | X | X | |
| MPP | | | | | | X |
In another exemplary embodiment (see, e.g., table I-3 provided below, wherein the "X" symbol indicates an existing connection between two ports, and the "X" in brackets indicates an optional connection), the interconnect network 162 may further implement a connection that a system master port MPP coupled to the system interconnect network 18 may be connected to at least one of the slave ports SP0、SP1、…、SPP-1 (here, the first slave port SP0 in the P slave port set SP0、SP1、…、SPP-1). In this way, a connection may be provided between master port MPP and (any) slave ports. The connection of the master port MPP may be extended to a plurality (e.g. all) of the slave ports SP0、SP1、…、SPP-1, depending on the specific application of the system 1. The connection of the master port MPP to at least one of the slave ports SP0、SP1、…、SPP-1 may be used (only) for loading the input data to be processed into the local data memory group M0、…、MQ-1, since all memory groups may be accessed via a single slave port. Loading input data may be accomplished using only one slave port, while processing data by parallel computing may advantageously use multiple (e.g., all) slave ports SP0、SP1、…、SPP-1.
Table I-3 provided below summarizes such exemplary connections implemented via the interconnection network 162.
TABLE I-3
| SP0 | SP1 | ... | SPP-1 | SPP | SPP+1 |
| MP0 | X | | | | X | |
| MP1 | | X | | | X | |
| ... | | | ... | | ... | |
| MPP-1 | | | | X | X | |
| MPP | X | (X) | (X) | (X) | | X |
In one or more embodiments, the local data memory controller 163 may be configured to arbitrate (e.g., by the processing element 160) access to the local data memory bank M0、…、MQ-1. For example, the local data memory controller 163 may use a memory access scheme (e.g., for calculation of a particular algorithm) selectable according to signals received from the local control unit 161.
In one or more embodiments, local data memory controller 163 may convert incoming read/write transaction bursts (e.g., AXI bursts) generated by direct read/write memory access controller 168 into read/write memory access sequences according to specified burst types, burst lengths, and memory access schemes.
Thus, one or more embodiments of the hardware accelerator circuit 16 as shown in fig. 1 may aim to reduce the complexity of the local interconnect network 162 by delegating the implementation of the (reconfigurable) connection between the processing element and the local data storage group to the local data storage controller 163.
In one or more embodiments, the local read-only memory 1650、1651、…、165P-1, accessible by the processing element 160 via the local ROM controller 164, may be configured to store digital factors and/or fixed coefficients (e.g., rotation factors or other complex coefficients for FFT computation) for implementing a particular algorithm or operation. The local ROM controller 164 may implement a specific addressing scheme.
In one or more embodiments, the local configurable coefficient memory 1670、1671、…、167P-1, accessible by the processing element 160 via the local configuration memory controller 166, may be configured to store application-dependent digital factors and/or coefficients (e.g., coefficients for implementing FIR filters or beamforming operations, weights of neural networks, etc.) that may be configured by software. The local configuration memory controller 166 may implement a particular addressing scheme.
In one or more embodiments, local read-only memory 1650、1651、…、165P-1 and/or local configurable coefficient memory 1670、1671、…、167P-1 may advantageously be partitioned into a number P of groups equal to the number of processing elements 160 included in hardware accelerator circuit 16. This helps to avoid collisions during parallel computation. For example, each locally configurable coefficient memory may be configured to provide the complete set of coefficients required for each processing element 160 in parallel.
Fig. 2 is an exemplary circuit block diagram of one or more embodiments of processing element 160 and associated connections to local ROM controller 164, local configuration memory controller 166, and local data memory set M0、…、MQ-1 (where the dashed lines schematically indicate a reconfigurable connection between processing element 160 and local data memory set M0、…、MQ-1).
The processing element 160 as shown in fig. 2 may be configured to receive a first input signal P (e.g., a digital signal indicative of binary values from the local data memory bank M0、…、MQ-1, possibly complex data having real and imaginary parts) via a corresponding direct read memory access 2000 and buffer register 2020 (e.g., FIFO registers), a second input signal Q (e.g., a digital signal indicative of binary values from the local data memory bank M0、…、MQ-1, possibly complex data having real and imaginary parts) via a corresponding direct read memory access 2001 and buffer register 2021 (e.g., FIFO registers), a first input coefficient W0 (e.g., a digital signal indicative of binary values from the local read only memory 165), and second, third, fourth and fifth input coefficients W1, W2, W3, W4 (e.g., digital signals indicative of corresponding binary values from the local configurable coefficient memory 167).
In one or more embodiments, processing element 160 may include a number of direct read memory accesses 200 equal to the number of input signals P, Q.
It should be appreciated that the number of input signals and/or input coefficients received at the processing element 160 may vary in different embodiments.
The processing element 160 may include a computation circuit 20, and the computation circuit 20 may be configured (possibly at run-time) to process the input values P, Q and the input coefficients W0, W1, W2, W3, W4 to generate a first output signal X0 (e.g., a digital signal indicative of binary values to be stored in the local data memory set M0、…、MQ-1 via the respective direct write memory access 2040 and the buffer register 2060 (such as a FIFO register)) and a second output signal X1 (e.g., a digital signal indicative of binary values to be stored in the local data memory set M0、…、MQ-1 via the respective write direct memory access 2041 and the buffer register 2061 (such as a FIFO register).
In one or more embodiments, the processing element 160 may include a number of write direct memory accesses 204 equal to the number of output signals X0, X1.
In one or more embodiments, the programming of the read and/or write direct memory accesses 200, 204 (included in the direct memory access controller 168) may be performed via an interface (e.g., AMBA interface) that may allow access to internal control registers located in the local control unit 161.
In addition, the processing element 160 may include a ROM address generator circuit 208 coupled to the local ROM controller 164 and a memory address generator circuit 210 coupled to the local configuration memory controller 166 to manage data retrieved therefrom.
Fig. 3 is an exemplary circuit block diagram of computing circuitry 20 that may be included in one or more embodiments of processing element 160.
As shown in fig. 3, the computing circuit 20 may comprise a set of processing resources, e.g. comprising four complex/real multiplier circuits (30 a, 30b, 30c, 30 d), two complex adder-subtractor circuits (32 a, 32 b) and two accumulator circuits (34 a,34 b), which may reconstruct the coupling as shown in fig. 3. For example, reconfigurable coupling of processing resources may be obtained by means of multiplexer circuits (e.g., 36a through 36 j) to form different data paths, wherein the different data paths correspond to different mathematical operations, wherein each multiplexer receives a respective control signal (e.g., S0 through S7).
In one or more embodiments, multiplier circuits 30a, 30b, 30c, 30d may be configured to operate (e.g., by means of an internal multiplexer circuit not visible in the figures) according to two different configurations, which may be selected according to control signal S8 provided to the multipliers. In a first configuration (e.g., if s8=0), the multiplier may calculate the result of two real numbers products on four real numbers of operands per clock cycle (i.e., each input signal carries two real numbers). In a second configuration (e.g., if s8=1), the multiplier may calculate the result of one complex product on two complex operands per clock cycle (i.e., each input signal carries two values, where the first value is the real part of the operand and the second value is the imaginary part of the operand).
Table II provided below summarizes exemplary possible configurations of multiplier circuits 30a, 30b, 30c, 30 d.
Table II
By way of example and with reference to fig. 3, processing resources may be arranged as follows.
The first multiplier 30a may receive a first input signal W1 and a second input signal P (e.g., complex operands).
The second multiplier 30b may receive the first input signal Q and the second input signal selected from the input signals W2 and W4 by means of the first multiplexer 36a, the first multiplexer 36a receiving the corresponding control signal S2. For example, if s2=0, the multiplier 30b receives the signal W2 as a second input, and if s2=1, the multiplier 30b receives the signal W4 as a second input.
The third multiplier 30c may receive a first input signal selected from the output signal from the first multiplier 30a and the input signal P.
For example, as shown in fig. 3, the second multiplexer 36b may provide either one of the output signal (e.g., if s0=0) or the input signal P (e.g., if s0=1) from the first multiplier 30a as an output according to the corresponding control signal S0. The third multiplexer 36c may provide either one of the output signal (e.g., if s3=1) or the input signal P (e.g., if s3=0) from the second multiplexer 36b as an output to the first input of the third multiplexer 30c according to the respective control signal S3.
The third multiplier 30c may receive a second input signal selected from among the input signal W3, the input signal W4, and the input signal W0.
For example, as shown in fig. 3, the fourth multiplexer 36d may provide either the input signal W4 (e.g., if s3=0) or the input signal W0 (e.g., if s3=1) as an output according to the respective control signal S3. The fifth multiplexer 36e may provide either the input signal W3 (e.g., if s3=0) or the output signal from the fourth multiplexer 36d (e.g., if s3=1) as an output to the second input of the third multiplexer 30c according to the respective control signal S3.
The fourth multiplier 30d may receive a first input signal selected from the input signal Q and the output signal from the second multiplier 30 b.
For example, as shown in fig. 3, the sixth multiplexer 36f may provide either the input signal Q (e.g., if s1=0) or the output signal from the second multiplier 30b (e.g., if s1=1) as an output to the first input of the fourth multiplier 30d according to the respective control signal S1.
The fourth multiplier 30d may receive a second input signal selected from the input signal W4 and the input signal W0.
For example, as shown in fig. 3, a second input of the fourth multiplier 30d may be coupled to an output of the fourth multiplexer 36 d.
The first adder-subtractor 32a may receive a first input signal selected from the output signal from the first multiplier 30a, the input signal P, and the output signal from the third multiplier 30 c.
For example, as shown in fig. 3, the seventh multiplexer 36g may provide either the output signal from the second multiplexer 36b (e.g., if s7=1) or the output signal from the third multiplier 30c (e.g., if s7=0) as an output to the first input of the first adder-subtractor 32 a.
The first adder-subtractor 32a may receive a second input signal selected from the input signal Q, the output from the second multiplier 30b, and a zero signal (i.e., a binary signal equal to zero).
For example, as shown in fig. 3, the eighth multiplexer 36h may provide either the input signal Q (e.g., if s6=0) or the output signal from the second multiplier 30b (e.g., if s6=1) as an output according to the respective control signal S6. The first and gate 38a may receive the output signal from the eighth multiplexer 36h as a first input signal and the control signal G0 as a second input signal. An output of the first and gate 38a may be coupled to a second input of the first adder-subtractor 32 a.
The second adder-subtractor 32b may receive a first input signal selected from the output signal of the third multiplier 30c and the output signal of the fourth multiplier 30 d.
For example, as shown in fig. 3, the ninth multiplexer 36i may provide either one of the output signal from the third multiplier 30c (e.g., if s5=0) or the output signal from the fourth multiplier 30d (e.g., if s5=1) as an output to the first input of the second adder-subtractor 32b according to the respective control signal S5.
The second adder-subtractor 32b may receive a second input signal selected from the output from the fourth multiplier 30d, the output from the second multiplier 30b, and a zero signal (i.e., a binary signal equal to zero).
For example, as shown in fig. 3, the tenth multiplexer 36j may provide either one of the output signal from the fourth multiplier 30d (e.g., if s4=0) or the output signal from the second multiplier 30b (e.g., if s4=1) as an output according to the corresponding control signal S4. The second and gate 38b may receive the output signal from the tenth multiplexer 36j as a first input signal and the control signal G1 as a second input signal. An output of the second and gate 38b may be coupled to a second input of the second adder-subtractor 32 b.
The first accumulator 34a may receive the input signal from the output of the first adder-subtractor 32a and the control signal EN to provide a first output signal X0 of the calculation circuit 20.
The second accumulator 34b may receive the input signal from the output of the second adder-subtractor 32b and the control signal EN to provide a second output signal X1 of the calculation circuit 20.
One or more embodiments including adder-subtractors 32a, 32b may keep their operation "bypassed" by means of and gates 38a, 38b, which and gates 38a, 38b may be used to force a zero signal at a second input of the adder-subtractors 32a, 32 b.
Fig. 4 is an exemplary circuit block diagram of other embodiments of computing circuitry 20 that may be included in one or more embodiments of processing element 160.
One or more embodiments as shown in fig. 4 may include the same arrangement of processing resources and multiplexer circuits as discussed with reference to fig. 3, with the addition of two circuits configured to compute an active nonlinear function (ANLF) and corresponding multiplexer circuits.
By way of example and with reference to fig. 4, additional processing resources may be arranged as follows.
The first ANLF circuit 40a may receive an input signal from the output of the first accumulator 34 a. The eleventh multiplexer 36k may provide the first output signal X0 of the calculation circuit 20 by selecting either one of the output signal from the first accumulator 34a (e.g. if s9=0) or the output signal from the first ANLF circuit 40a (e.g. if s9=1) according to the respective control signal S9.
The second ANLF circuit 40b may receive an input signal from the output of the second accumulator 34 b. The twelfth multiplexer 36m may provide the second output signal X1 of the calculation circuit 20 by selecting either the output signal from the second accumulator 34b (e.g., if s9=0) or the output signal from the second ANLF circuit 40b (e.g., if s9=1) according to the respective control signal S9.
Thus, in one or more embodiments as shown in fig. 4, ANLF circuits 40a and 40b may be "bypassed" by means of multiplexer circuits 36k and 36m, thereby providing operation similar to the embodiment shown in fig. 3.
Thus, as shown with reference to fig. 3 and 4, the data paths in the computation circuit 20 may be configured to support parallel computation and may facilitate execution of different functions. In one or more embodiments, the internal pipeline may be designed to meet timing constraints (e.g., clock frequency) for minimum delay.
Various non-limiting examples of possible configurations of the computing circuitry 20 are provided below. In each example, the calculation circuit 20 is configured to calculate an algorithm dependent (basis) function.
In a first example, a configuration of the calculation circuit 20 for performing a Fast Fourier Transform (FFT) algorithm is described.
Where the hardware accelerator circuit 16 is required to compute an FFT algorithm, the single processing element 160 may be programmed to implement a radix-2 DIF (decimated by frequency) butterfly algorithm, performing the following complex operations, for example, using signals from the internal control unit 161:
X0=P+Q
X1=P*W0-Q*W0
Where W0 may be a rotation factor stored in the local read-only memory 165.
In this first example, the input signals (P, Q, W, W1, W2, W3, W4) and the output signals (X0, X1) may be complex data types.
Alternatively, to reduce the effect of discontinuities at the edges of the data block of the computation FFT algorithm on the spectrum, a window function may be applied to the input data prior to the computation FFT algorithm. For example, processing element 160 may support such window processing by using four multiplier circuits.
Alternatively, the modulus or phase of the spectral components may be used in place of complex values (e.g., in radar target detection applications, etc.). In this case, an internal (optional) ANLF circuit may be used during the last FFT stage. For example, the input complex vector may be rotated to align with the x-axis to compute the module.
Table III provided below summarizes some exemplary configurations of the calculation circuit 20 for calculating different base-2 algorithms.
Table III
Thus, the data flow corresponding to the function "base-2 butterfly algorithm" illustrated above may be:
X0=P+Q
X1=P*W0-Q*W0
The data stream corresponding to the function "base-2 butterfly + window" illustrated above may be:
X0=W1*P+W2*Q
X1=(W1*P)*W0-(W2*Q)*W0
the data flow corresponding to the function "base-2 butterfly + modulo" illustrated above may be:
X0=abs(P+Q)
X1=abs(P*W0-Q*W0)
In a first example considered herein, a configuration corresponding to the "base-2 butterfly algorithm" may involve the use of two multiplier circuits, two adder-subtractor circuits, no accumulator and no ANLF circuits.
In a first example considered herein, a configuration corresponding to "base-2 butterfly + window" may involve the use of four multiplier circuits, two adder-subtractor circuits, no accumulator and no ANLF circuits.
In a first example considered herein, a configuration corresponding to "base-2 butterfly + modulo" may involve the use of two multiplier circuits, two adder-subtractor circuits, two ANLF circuits, without an accumulator.
In a second example, a configuration of the calculation circuit 20 for executing a scalar product of complex data vectors is described.
Hardware accelerator circuitry 16 may be required to compute scalar products of complex data vectors. This may be the case, for example, in connection with applications involving filtering operations, such as phased array radar systems involving a processing stage called beamforming. Beamforming techniques may help the radar system resolve targets in terms of angle (azimuth) based on distance and radial velocity.
In this second example, the input signals (P, Q, W, W1, W2, W3, W4) and the output signals (X0, X1) may be complex data types.
In this second example, two different scalar vector product operations (e.g., beamforming operations) may be performed simultaneously by a single processing element 160 (e.g., by utilizing all internal hardware resources).
The local configurable coefficient memory 167 may be used to store phase shifts for different array antenna elements during beamforming operations.
Similar to the first example, in this second example, if a modulus is to be calculated instead of a complex value, then the ANLF circuit may be selected for use.
Table IV provided below illustrates possible configurations of the calculation circuit 20 for calculating scalar products of two vectors simultaneously.
Table IV
Thus, the data flow corresponding to the function "scalar product of vector" illustrated above may be:
X0=ACC(P*W1+Q*W2)
X1=ACC(P*W3+Q*W4)
the data stream corresponding to the function "scalar product+modulo of vector" illustrated above may be:
X0=abs(ACC(P*W1+Q*W2))
X1=abs(ACC(P*W3+Q*W4))
In a second example considered herein, a configuration corresponding to "scalar product of vectors" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and no ANLF circuits.
In a second example considered herein, a configuration corresponding to "scalar product+modulus of vector" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and two ANLF circuits.
In a third example, a configuration of the calculation circuit 20 for executing a scalar product of real data vectors is described.
Hardware accelerator circuitry 16 may be required to compute scalar products of real data vectors over large real data structures, for example, for computing digital filters. For example, in many applications, the real world (e.g., analog) signal may be filtered after being digitized in order to extract (only) relevant information.
In the digital domain, the convolution operation between the input signal and the Filter Impulse Response (FIR) may take the form of a scalar product of two real data vectors. One of the two vectors may hold input data while the other vector may hold coefficients defining the filtering operation.
In this third example, the input signals (P, Q, W, W1, W2, W3, W4) and the output signals (X0, X1) are real data types.
In this third example, two different filtering operations may be performed simultaneously on the same data set by a single processing element 160, for example by processing four different input data per clock cycle using all internal hardware resources.
Table V provided below illustrates a possible configuration of the calculation circuit 20 for calculating two filtering operations simultaneously on a real data vector.
Table V
Thus, the data stream corresponding to the function illustrated above is as follows, where the subscript "h" represents the MSB portion and the subscript "l" represents the LSB portion:
X0h=ACC(Ph*W1h+Qh*W2h)
X0l=ACC(Pl*W1l+Ql*W2l)
X1h=ACC(Ph*W3h+Qh*W4h)
X1l=ACC(Pl*W3l+Ql*W4l)
in a third example considered herein, a configuration corresponding to "scalar product of real vectors" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and no ANLF circuits.
In the fourth example, the configuration of the calculation circuit 20 for calculating the nonlinear function is described.
Multilayer perceptrons (MLPs) are a type of fully connected feedforward artificial neural network that may include at least three layers of nodes/neurons. Each neuron, except for neurons in the input layer, calculates a weighted sum of all nodes of the previous layer, and then applies a nonlinear activation function to the result. The processing element 160 as disclosed herein may handle such nonlinear functions, for example, using internal ANLF circuits. Typically, neural networks process data from the real world and use the real weights and functions to calculate class membership probabilities (the output of the last layer). Thus, for such artificial networks, the real data scalar product may be the most computationally demanding, most frequently used operation.
Fig. 5 is an exemplary diagram of a general architecture of a multi-layer sensor network 50.
As shown in fig. 5, the multi-layered perceptron network 50 may include an input layer 50a comprising N inputs U1、…、UN(Ui, i=1,..and N), a hidden layer 50b comprising M hidden nodes X1、…、XM(Xk, k=1,..m), and an output layer 50c comprising P output nodes Y1、…、YP(Yj, j=1,..p).
It should be appreciated that in one or more embodiments, the multi-layered sensor network may include more than one hidden layer 50b.
As shown in fig. 5, the multi-layer sensor network 50 may include a first set of N X M weights Wi,k between the input U1、…、UN and the hidden node X1、…、XM, and a second set of M X P weights Wk,j between the hidden node X1、…、XM and the output node Y1、…、YP.
The values stored in input Ui, hidden node Xk, and output node Yj may be calculated, for example, as MAC floating points with single precision.
The values of the hidden node Xk and the output node Yj can be calculated according to the following equation:
In this fourth example, the actual weights of the training associated with all edges of the MLP may be stored in the local configurable coefficient memory 167. The real layer input may be retrieved from a local data store (e.g., local data store group M0、…、MQ-1) of the hardware accelerator circuit 16 and the real layer output may be stored in the local data store of the hardware accelerator circuit 16.
Since the MLP model is mapped onto the hardware accelerator circuit 16, each processing element 160 (e.g., P processing elements) included therein may be used to calculate scalar products and activation function outputs associated with two different neurons of the same layer, e.g., four edges are processed per clock cycle. Thus, all of the processing elements 1600、1601、…、160P-1 may be used simultaneously.
Table VI provided below illustrates possible configurations of the calculation circuit 20 for calculating two activation function outputs associated with two different neurons simultaneously.
Table VI
Thus, the data stream corresponding to the function illustrated above is as follows, where the subscript "h" represents the MSB portion and the subscript "l" represents the LSB portion:
X0h=f(ACC(Ph*W1h+Qh*W2h))
X0l=f(ACC(Pl*W1l+Ql*W2l))
X1h=f(ACC(Ph*W3h+Qh*W4h))
X1l=f(ACC(Pl*W3l+Ql*W4l))
In a fourth example considered herein, a configuration corresponding to a functional "MLP computation engine" (which may include computing two scalar products of vectors and applying a nonlinear activation function thereto) may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and two ANLF circuits.
Table VII provided below illustrates a nonlinear function that may be implemented in one or more embodiments. Some functions denoted by "algorithm=nn" may be used exclusively in the context of neural networks.
Table VII
Accordingly, one or more embodiments of hardware accelerator circuit 16, including at least one computing circuit 20 as shown herein and/or in the above examples, may facilitate implementing a digital signal processing system having one or more advantages of flexibility (e.g., ability to handle different types of algorithms), improved use of hardware resources, improved performance of parallel computing, extended connectivity and high bandwidth of each processing element 160 to local data storage group M0、…、MQ-1 and/or to system memory 12, support of additional algorithms through simple local interconnect network 162 and internal direct memory access controller 1680、1681、…、168P-1, and through an extensible architecture that integrates different processing elements.
In one or more embodiments, the electronic system 1 may be implemented as a single silicon chip or as an integrated circuit in a chip (e.g., as a system on a chip). Or the electronic system 1 may be a distributed system comprising a plurality of integrated circuits interconnected together, for example by means of a Printed Circuit Board (PCB).
As shown herein, a circuit (e.g., 160) may include a set of input terminals configured to receive a set of input digital signals (e.g., P, Q, W0, W1, W2, W3, W4) carrying input data, a set of output terminals configured to provide a set of output digital signals (e.g., X0, X1) carrying output data, and a computing circuit arrangement (e.g., 20) configured to generate output data from the input data. The computing circuit arrangement may include a set of multiplier circuits (e.g., 30a, 30b, 30c, 30 d), a set of adder-subtractor circuits (e.g., 32a, 32 b), a set of accumulator circuits (e.g., 34a, 34 b), and a configurable interconnect network (e.g., 36a,..36 j) configured to selectively couple (e.g., S1,..thing, S7) the multiplier circuits, the adder-subtractor circuits, the accumulator circuits, the input terminals, and the output terminals in at least two processing configurations.
As shown herein, in a first processing configuration, the computing circuitry may be configured to compute output data from a first set of functions, and in at least one second processing configuration, the computing circuitry may be configured to compute output data from a respective second set of functions, the respective second set of functions being different from the first set of functions.
As shown herein, the circuitry may include a respective configurable direct read memory access controller (e.g., 2000、2001) coupled to a first subset of the set of input terminals to receive (e.g., 162, 163) a respective first subset of the input digital signals carrying the first subset of input data (e.g., P, Q). The configurable direct read memory access controller may be configured to control retrieval of a first subset of the input data from a memory (e.g., M0、…、MQ-1).
As shown herein, the circuitry may include a respective configurable direct write memory access controller (e.g., 2040、2041) coupled to the set of output terminals to provide an output digital signal carrying output data. The configurable direct write memory access controller may be configured to control the output data storage into the memory.
As shown herein, the circuitry may include respective input buffers (e.g., 2020、2021) coupled to the configurable direct-read memory access controller and respective output buffers (e.g., 2060、2061) coupled to the configurable write direct memory access controller.
As shown herein, the circuitry may include ROM address generator circuitry (e.g., 208) configured to control retrieval of a second subset of the input data (e.g., W0) from the at least one read-only memory (e.g., 164, 165) via a second subset of the input digital signals, and/or memory address generator circuitry (e.g., 210) configured to control retrieval of a third subset of the input data (e.g., W1, W2, W3, W4) from the at least one configurable memory (e.g., 166, 167) via a third subset of the input digital signals.
As shown herein, in a circuit according to an embodiment, the set of multiplier circuits may include a first multiplier circuit (e.g., 30 a), a second multiplier circuit (e.g., 30 b), a third multiplier circuit (e.g., 30 c), and a fourth multiplier circuit (e.g., 30 d). The adder-subtractor circuit set may include a first adder-subtractor circuit (e.g., 32 a) and a second adder-subtractor circuit (32 b). The set of accumulator circuits may include a first accumulator circuit (e.g., 34 a) and a second accumulator circuit (e.g., 34 b).
As shown herein, a first multiplier circuit may receive a first input signal (e.g., W1) of an input digital signal set as a first operand and may receive a second input signal (e.g., P) of the input digital signal set as a second operand. The second multiplier circuit may receive a third input signal (e.g., Q) of the set of input digital signals as a first operand and may receive a selectable signal from a fourth input signal (e.g., W2) and a fifth input signal (e.g., W4) of the set of input digital signals as a second operand. The third multiplier circuit may receive as a first operand a signal selectable from the output signal from the first multiplier circuit and the second input signal, and may receive as a second operand a signal selected from the sixth input signal (e.g., W3), the seventh input signal (e.g., W0), and the fifth input signal. The fourth multiplier circuit may receive a signal selectable from the output signal from the second multiplier and the third input signal as a first operand and may receive a signal selected from the fifth input signal and the seventh input signal as a second operand. The first adder-subtractor circuit may receive as a first operand a signal selectable from the output signal from the first multiplier circuit, the second input signal, and the output signal from the third multiplier circuit, and may receive as a second operand a signal selectable from the third input signal, the output signal from the second multiplier circuit, and the zero signal. The second adder-subtractor circuit may receive as a first operand a signal selectable from the output signal from the third multiplier circuit and the output signal from the fourth multiplier circuit, and may receive as a second operand a signal selectable from the output signal from the fourth multiplier circuit, the output signal from the second multiplier circuit, and the zero signal. The first accumulator circuit may receive as input an output signal from the first adder-subtractor circuit and the second accumulator circuit may receive as input an output signal from the second adder-subtractor circuit. The first accumulator circuit may be selectively activated (e.g., EN) to provide a first output signal (e.g., X0) and the second accumulator circuit may be selectively activated to provide a second output signal (e.g., X1).
As shown herein, the computing circuitry may include a set of circuits (e.g., 40a, 40 b) configured to compute a nonlinear function.
As shown herein, a set of circuits configured to compute a nonlinear function may include a first circuit (e.g., 40 a) configured to compute a nonlinear function and a second circuit (e.g., 40 b) configured to compute a nonlinear function. A first circuit configured to calculate a nonlinear function may receive as input an output signal from the first accumulator circuit. A second circuit configured to calculate a nonlinear function may receive as input an output signal from the second accumulator circuit. The first output signal may be selectable (e.g., 36 k) between an output signal from the first accumulator circuit and an output signal from the first circuit configured to calculate the non-linear function, and the second output signal may be selectable (e.g., 36 m) between an output signal from the second accumulator circuit and an output signal from the second circuit configured to calculate the non-linear function.
As shown herein, a device (e.g., 16) may include a set of circuits, a set of data memory banks (e.g., M0、…、MQ-1), and a control unit (e.g., 161) in accordance with one or more embodiments. Depending on the configuration data stored in the control unit, the circuitry may be configured (e.g., 161, 168) to read data from and write data to the data storage group via the interconnection network (e.g., 162, 163).
As shown herein, the data memory bank may include a buffer, preferably a double buffer.
As shown herein, a system (e.g., 1) may include a device according to one or more embodiments and a processing unit (e.g., 10) coupled to the device via a system interconnect (e.g., 18). The circuitry in the set of circuitry of the device may be configured in at least two processing configurations in accordance with control signals received from the processing unit.
As shown herein, a method of operating a circuit in accordance with one or more embodiments, an apparatus in accordance with one or more embodiments, or a system in accordance with one or more embodiments may include dividing an operating time of a computing circuit arrangement in at least first and second operating intervals, wherein the computing circuit arrangement operates in a first processing configuration and at least one second processing configuration, respectively.
Without prejudice to the underlying principles, the details and the embodiments may vary, even significantly, with respect to what has been described by way of example only, without departing from the scope of the protection.
The protection scope is defined by the appended claims.
While this invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. Accordingly, the appended claims are intended to encompass any such modifications or embodiments.