CN113568864B

Movatterモバイル変換

Info

Publication number: CN113568864B
Application number: CN202110466426.6A
Authority: CN
Inventors: G·博尔戈诺沃; L·雷菲奥伦汀
Original assignee: STMicroelectronics SRL
Current assignee: STMicroelectronics SRL
Priority date: 2020-04-29
Filing date: 2021-04-28
Publication date: 2025-09-05
Anticipated expiration: 2041-04-28
Also published as: CN113568864A

Abstract

Embodiments of the present disclosure relate to circuits, corresponding devices, systems, and methods. An embodiment circuit includes a set of input terminals configured to receive an input digital signal carrying input data, a set of output terminals configured to provide an output digital signal carrying output data, and a computing circuit arrangement configured to generate output data from the input data. The computing circuitry includes a set of multiplier circuits, a set of adder-subtractor circuits, a set of accumulator circuits, and a configurable interconnection network. The configurable interconnect network is configured to selectively couple the multiplier circuit, the adder-subtractor circuit, the accumulator circuit, the input terminal, and the output terminal in at least two processing configurations. In a first configuration, the computing circuitry is configured to compute output data from a first set of functions, and in a second configuration, the computing circuitry is configured to compute output data from a different set of functions.

Description

Circuit, corresponding device, system and method

Cross Reference to Related Applications

The application claims the benefit of italian application No.102020000009358 filed on 29 th month 4 2020, the contents of which are incorporated herein by reference.

Technical Field

The present description relates to digital signal processing circuits, such as hardware accelerators, and related methods, devices, and systems.

Background

Various real-time digital signal processing systems (e.g., for processing video data and/or image data, radar data, wireless communication data) may involve processing a relevant amount of data per unit time as the demand in the automotive field increases. In various applications, such processing may become very demanding for purely core-based implementations (i.e., implementations involving a general purpose microprocessor or microcontroller running processing software).

The use of hardware accelerators is therefore becoming increasingly important in certain areas of data processing, as it helps to speed up the computation of certain algorithms. A properly designed hardware accelerator may reduce the processing time of a particular operation compared to a core-based implementation.

Conventional hardware accelerators described in the literature or available as commercial products may include different types of processing elements (also referred to as "math units" or "math operators"), each of which is dedicated to the computation of a particular operation. For example, such processing elements may include Multiply and Accumulate (MAC) circuitry and/or circuitry configured to calculate an activation function, such as an activation nonlinear function (ANLF) (e.g., a coordinate rotation digital computer (CORDIC) circuitry).

Each of the processing elements described above is typically designed to implement a particular function (e.g., a radix-2 butterfly algorithm, multiplication of complex vectors, vector/matrix product, trigonometric or exponential or logarithmic function, convolution, etc.). Accordingly, conventional hardware accelerators typically include a variety of such different processing elements that are connected together through some sort of interconnection network. In some cases, activating one different processing element at a time results in inefficient use of silicon area and available hardware resources due to data dependency and/or architectural limitations.

On the other hand, a purely software implemented, core-based approach (e.g., utilizing a Single Instruction Multiple Data (SIMD) processor) may involve high clock frequencies to meet the usual bandwidth requirements of real-time systems, as in this case each processing element performs basic operations.

Disclosure of Invention

It is an object of one or more embodiments to provide a hardware accelerator device that addresses one or more of the above disadvantages.

In particular, one or more embodiments aim to provide a memory-based hardware accelerator device (also referred to in the context of the present disclosure by abbreviation EDPA, enhanced data processing architecture) comprising one or more processing elements. The processing elements in the hardware accelerator device may be reconfigured at run-time to provide increased flexibility of use and to facilitate efficient computation of various signal processing operations, which may be particularly demanding in terms of resources (e.g., fast fourier transforms, digital filtering, implementation of artificial neural networks, etc.).

One or more embodiments may find application in real-time processing systems where acceleration of operations requiring computation (e.g., vector/matrix product, convolution, FFT, radix-2 butterfly, complex vector multiplication, trigonometric or exponential or logarithmic functions, etc.) may help meet certain performance requirements (e.g., in terms of processing time). This may be the case, for example, in the automotive field.

According to one or more embodiments, such objects are achieved by means of a circuit (e.g. a runtime reconfigurable processing unit) having the features set forth in the following claims.

One or more embodiments may relate to a corresponding device (e.g., a hardware accelerator circuit including one or more runtime reconfigurable processing units).

One or more embodiments may relate to a corresponding system (e.g., a system-on-chip integrated circuit including a hardware accelerator circuit).

One or more embodiments may relate to a corresponding method.

The claims are an integral part of the technical teaching provided herein for the embodiments.

In accordance with one or more embodiments, a circuit is provided that may include a set of input terminals configured to receive an input digital signal carrying input data, and a set of output terminals configured to provide an output digital signal carrying output data. The circuit may include a computing circuit arrangement configured to generate output data from the input data. The computing circuitry may include a set of multiplier circuits, a set of adder-subtractor circuits, a set of accumulator circuits, and a configurable interconnection network. The configurable interconnect network may be configured to selectively couple the multiplier circuit, the adder-subtractor circuit, the accumulator circuit, the input terminal, and the output terminal in at least two processing configurations. In the first processing configuration, the calculation circuitry is configured to calculate output data from the first set of functions, and in the at least one second processing configuration, the calculation circuitry is configured to calculate output data from the respective second set of functions. The second set of functions is different from the first set of functions.

Accordingly, one or more embodiments may provide increased flexibility, improved hardware resource usage, and/or improved parallel computing performance.

Drawings

One or more embodiments will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is an exemplary circuit block diagram of an electronic system (such as a system on a chip) in accordance with one or more embodiments;

FIG. 2 is an exemplary circuit block diagram of an electronic device implementing a hardware accelerator in accordance with one or more embodiments;

FIG. 3 is an exemplary circuit block diagram of a processing circuit for an electronic device according to embodiments in accordance with one or more embodiments;

FIG. 4 is another exemplary circuit block diagram of a processing circuit for an electronic device according to embodiments in accordance with one or more embodiments, an

Fig. 5 is an exemplary diagram of a multi-layer sensor network architecture.

Detailed Description

In the following description, one or more specific details are set forth in order to provide a thorough understanding of examples of the embodiments described herein. Embodiments may be obtained without one or more of the specific details, or by other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the embodiments.

References to "an embodiment" or "one embodiment" in the framework of this description are intended to indicate that a particular configuration, structure, or feature described in connection with the embodiment is included in at least one embodiment. Thus, phrases such as "in an embodiment" or "in one embodiment" that may occur in one or more points of the present description do not necessarily refer to one or more of the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the drawings attached hereto, like parts or elements are denoted by like reference numerals, and the corresponding description will not be repeated for the sake of brevity.

The references used herein are for convenience only and thus do not define the scope of protection or the scope of the embodiments.

Fig. 1 is an example of an electronic system 1, such as a system on a chip (SoC), in accordance with one or more embodiments. The electronic system 1 may include various electronic circuits such as a central processing unit 10 (CPU, e.g., microprocessor), a main system memory 12 (e.g., system RAM-random access memory), a Direct Memory Access (DMA) controller 14, and a hardware accelerator circuit 16.

As shown in fig. 1, electronic circuits in electronic system 1 may be connected through a system interconnect network 18 (e.g., a SoC interconnect).

It is an object of one or more embodiments to provide a (runtime) reconfigurable hardware accelerator circuit 16 designed to support the execution of various (basic) arithmetic functions and with improved flexibility of use. Accordingly, one or more embodiments may facilitate improved use of silicon regions and provide satisfactory processing performance, e.g., to meet processing time requirements of a real-time data processing system.

As shown in fig. 1, in one or more embodiments, the hardware accelerator circuit 16 may include at least one (runtime) configurable processing element 160, preferably a number P of (runtime) configurable processing elements 160₀、160₁、…、160_P-1, and a set of local data storage groups M, preferably a number q= 2*P of local data storage groups M₀、…、M_Q-1.

In one or more embodiments, the hardware accelerator circuit 16 may further include a local control unit 161, a local interconnection network 162, a local data memory controller 163, a local ROM controller 164, (the local ROM controller 164 being coupled to a local read-only memory set 165, preferably a number P of local read-only memories 165₀、165₁、…、165_P-1), and a local configuration memory controller 166, (the local configuration memory controller 166 being coupled to a local configurable coefficient memory set 167, preferably a number P of local configurable coefficient memories 167₀、167₁、…、167_P-1). For example, memory 167 may include volatile memory (e.g., RAM memory) and/or nonvolatile memory (e.g., PCM memory).

Different embodiments may include different numbers of P processing elements 160 and/or different numbers of Q local data storage groups M₀、…、M_Q-1. For example, P may be equal to 8 and Q may be equal to 16.

In one or more embodiments, processing element 160 may be configured to support different (basic) processing functions with different levels of computational parallelism. For example, processing element 160 may support (e.g., based on an appropriate static configuration) different types of arithmetic (e.g., floating point single precision 32 bits, fixed point/integer 32 bits, or 16 or 8 bits with parallel computing or vectorization modes).

The processing element 160 may include a corresponding internal Direct Memory Access (DMA) controller 168₀、168₁、…、168_P-1 with low complexity. In particular, the processing element 160 may be configured to retrieve input data from the local data storage group M₀、…、M_Q-1 and/or from the main system memory 12 via the corresponding direct memory access controller 168. Thus, the processing element 160 may refine the retrieved input data to generate processed output data. The processing element 160 may be configured to store the processed output data in the local data storage group M₀、…、M_Q-1 and/or the main system memory 12 via a respective direct memory access controller 168.

Further, the processing element 160 may be configured to retrieve input data from the local read-only memory 165 and/or from the local configurable coefficient memory 167 to perform such refinement.

In one or more embodiments, providing a local set of data memory banks M₀、…、M_Q-1 can facilitate parallel processing of data and reduce memory access conflicts.

Preferably, the local data memory group M₀、…、M_Q-1 may provide buffering (e.g., double buffering), which may help recover memory upload time (write operation) and/or download time (read operation). In particular, each local data memory bank may be replicated so that data may be read (e.g., for processing) from one of the two memory banks and at the same time (new) data may be stored (e.g., for later processing) in the other memory bank. Thus, the movement data may not negatively impact computing performance as it may be masked.

In one or more embodiments, a double buffering scheme of the local data storage group M₀、…、M_Q-1 in combination with stream mode or back-to-back data processing may be advantageous (e.g., as applicable to FFT N-point processors configured to set forth a continuous sequence of N data inputs).

In one or more embodiments, the local data storage group M₀、…、M_Q-1 may include a memory group having a limited storage capacity (and, therefore, a limited silicon footprint). In the exemplary case of an FFT processor, each local data memory bank may have a memory capacity of at least (maxN)/Q, where maxN is the longest FFT that can be handled by hardware. The usual values in applications involving hardware accelerators may be as follows:

n=4096 points, e.g., each point is a floating point single precision complex number (real, imaginary), which is 64 bits (or 8 bytes) in size,

P=8, resulting in q=16,

So that the storage capacity of each local data memory group can be equal to (4096 x 8 bytes)/16=2 KB (kb=kilobytes).

In one or more embodiments, local control unit 161 may include a register file that includes information for setting the configuration of processing element 160. For example, the local control unit 161 may set the processing element 160 to execute a specific algorithm as directed by a host application running on the central processing unit 10.

In one or more embodiments, the local control unit 161 may thus include a controller circuit for the hardware accelerator circuit 16. Such controller circuitry may configure (e.g., dynamically) each processing element 160 for computing a particular (basic) function, and may configure a corresponding internal direct memory access controller 168 with a particular memory access scheme and cycle period.

In one or more embodiments, local interconnect network 162 may include a low complexity interconnect system, e.g., an AXI4 based interconnect based on a known type of bus network. For example, the data parallelism of the local interconnect network 162 may be 64 bits and the address width may be 32 bits.

Local interconnect network 162 may be configured to connect processing elements 160 to local data storage group M₀、…、M_Q-1 and/or main system memory 12. Further, the local interconnect network 162 may be configured to connect the local control unit 161 and the local configuration memory controller 166 to the system interconnect network 18.

In particular, the interconnect network 162 may include a set of P master ports MP₀、MP₁、…、MP_P-1, each of which is coupled to a respective processing element 160, a set of P slave ports SP₀、SP₁、…、SP_P-1, each of which may be coupled to a local data storage group M₀、…、M_Q-1 via a local data storage controller 163, another pair of ports including a system master port MP_P and a system slave port SP_P configured to be coupled to the system interconnect network 18 (e.g., to receive instructions from the central processing unit 10 and/or access data stored in the system memory 12), and another slave port SP_P+1 coupled to the local control unit 161 and the local configuration memory controller 166.

In one or more embodiments, the interconnection network 162 may be fixed (i.e., non-reconfigurable).

In an exemplary embodiment (see, e.g., table I-1 provided below, wherein the "X" symbol indicates an existing connection between two ports), interconnect network 162 may implement a connection in which P master port MP₀、MP₁、…、MP_P-1 coupled to processing element 160 may be connected to corresponding slave port SP₀、SP₁、…、SP_P-1 coupled to local data memory controller 163, and system master port MP_P coupled to system interconnect network 18 may be connected to slave port SP_P+1 coupled to local control unit 161 and local configuration memory controller 166.

Table I-1 provided below summarizes such exemplary connections implemented through the interconnection network 162.

TABLE I-1

SP₀

SP₁

SP_P

MP₀

MP₁

MP_P

In another exemplary embodiment (see, e.g., table I-2 provided below), interconnect network 162 may further implement a connection in which each of P master ports MP₀、MP₁、…、MP_P-1 may be connected to a system slave port SP_P coupled to system interconnect network 18. In this way, connectivity may be provided between any processing element 160 and the SoC via the system interconnect network 18.

Table I-2 provided below summarizes such exemplary connections implemented through the interconnection network 162.

TABLE I-2

SP₀

SP₁

SP_P

MP₀

MP₁

MP_P

In another exemplary embodiment (see, e.g., table I-3 provided below, wherein the "X" symbol indicates an existing connection between two ports, and the "X" in brackets indicates an optional connection), the interconnect network 162 may further implement a connection that a system master port MP_P coupled to the system interconnect network 18 may be connected to at least one of the slave ports SP₀、SP₁、…、SP_P-1 (here, the first slave port SP₀ in the P slave port set SP₀、SP₁、…、SP_P-1). In this way, a connection may be provided between master port MP_P and (any) slave ports. The connection of the master port MPP may be extended to a plurality (e.g. all) of the slave ports SP₀、SP₁、…、SP_P-1, depending on the specific application of the system 1. The connection of the master port MP_P to at least one of the slave ports SP₀、SP₁、…、SP_P-1 may be used (only) for loading the input data to be processed into the local data memory group M₀、…、M_Q-1, since all memory groups may be accessed via a single slave port. Loading input data may be accomplished using only one slave port, while processing data by parallel computing may advantageously use multiple (e.g., all) slave ports SP₀、SP₁、…、SP_P-1.

Table I-3 provided below summarizes such exemplary connections implemented via the interconnection network 162.

TABLE I-3

SP₀

SP₁

SP_P

MP₀

MP₁

MP_P

In one or more embodiments, the local data memory controller 163 may be configured to arbitrate (e.g., by the processing element 160) access to the local data memory bank M₀、…、M_Q-1. For example, the local data memory controller 163 may use a memory access scheme (e.g., for calculation of a particular algorithm) selectable according to signals received from the local control unit 161.

In one or more embodiments, local data memory controller 163 may convert incoming read/write transaction bursts (e.g., AXI bursts) generated by direct read/write memory access controller 168 into read/write memory access sequences according to specified burst types, burst lengths, and memory access schemes.

Thus, one or more embodiments of the hardware accelerator circuit 16 as shown in fig. 1 may aim to reduce the complexity of the local interconnect network 162 by delegating the implementation of the (reconfigurable) connection between the processing element and the local data storage group to the local data storage controller 163.

In one or more embodiments, the local read-only memory 165₀、165₁、…、165_P-1, accessible by the processing element 160 via the local ROM controller 164, may be configured to store digital factors and/or fixed coefficients (e.g., rotation factors or other complex coefficients for FFT computation) for implementing a particular algorithm or operation. The local ROM controller 164 may implement a specific addressing scheme.

In one or more embodiments, the local configurable coefficient memory 167₀、167₁、…、167_P-1, accessible by the processing element 160 via the local configuration memory controller 166, may be configured to store application-dependent digital factors and/or coefficients (e.g., coefficients for implementing FIR filters or beamforming operations, weights of neural networks, etc.) that may be configured by software. The local configuration memory controller 166 may implement a particular addressing scheme.

In one or more embodiments, local read-only memory 165₀、165₁、…、165_P-1 and/or local configurable coefficient memory 167₀、167₁、…、167_P-1 may advantageously be partitioned into a number P of groups equal to the number of processing elements 160 included in hardware accelerator circuit 16. This helps to avoid collisions during parallel computation. For example, each locally configurable coefficient memory may be configured to provide the complete set of coefficients required for each processing element 160 in parallel.

Fig. 2 is an exemplary circuit block diagram of one or more embodiments of processing element 160 and associated connections to local ROM controller 164, local configuration memory controller 166, and local data memory set M₀、…、M_Q-1 (where the dashed lines schematically indicate a reconfigurable connection between processing element 160 and local data memory set M₀、…、M_Q-1).

The processing element 160 as shown in fig. 2 may be configured to receive a first input signal P (e.g., a digital signal indicative of binary values from the local data memory bank M₀、…、M_Q-1, possibly complex data having real and imaginary parts) via a corresponding direct read memory access 200₀ and buffer register 202₀ (e.g., FIFO registers), a second input signal Q (e.g., a digital signal indicative of binary values from the local data memory bank M₀、…、M_Q-1, possibly complex data having real and imaginary parts) via a corresponding direct read memory access 200₁ and buffer register 202₁ (e.g., FIFO registers), a first input coefficient W0 (e.g., a digital signal indicative of binary values from the local read only memory 165), and second, third, fourth and fifth input coefficients W1, W2, W3, W4 (e.g., digital signals indicative of corresponding binary values from the local configurable coefficient memory 167).

In one or more embodiments, processing element 160 may include a number of direct read memory accesses 200 equal to the number of input signals P, Q.

It should be appreciated that the number of input signals and/or input coefficients received at the processing element 160 may vary in different embodiments.

The processing element 160 may include a computation circuit 20, and the computation circuit 20 may be configured (possibly at run-time) to process the input values P, Q and the input coefficients W0, W1, W2, W3, W4 to generate a first output signal X0 (e.g., a digital signal indicative of binary values to be stored in the local data memory set M₀、…、M_Q-1 via the respective direct write memory access 204₀ and the buffer register 206₀ (such as a FIFO register)) and a second output signal X1 (e.g., a digital signal indicative of binary values to be stored in the local data memory set M₀、…、M_Q-1 via the respective write direct memory access 204₁ and the buffer register 206₁ (such as a FIFO register).

In one or more embodiments, the processing element 160 may include a number of write direct memory accesses 204 equal to the number of output signals X0, X1.

In one or more embodiments, the programming of the read and/or write direct memory accesses 200, 204 (included in the direct memory access controller 168) may be performed via an interface (e.g., AMBA interface) that may allow access to internal control registers located in the local control unit 161.

In addition, the processing element 160 may include a ROM address generator circuit 208 coupled to the local ROM controller 164 and a memory address generator circuit 210 coupled to the local configuration memory controller 166 to manage data retrieved therefrom.

Fig. 3 is an exemplary circuit block diagram of computing circuitry 20 that may be included in one or more embodiments of processing element 160.

As shown in fig. 3, the computing circuit 20 may comprise a set of processing resources, e.g. comprising four complex/real multiplier circuits (30 a, 30b, 30c, 30 d), two complex adder-subtractor circuits (32 a, 32 b) and two accumulator circuits (34 a,34 b), which may reconstruct the coupling as shown in fig. 3. For example, reconfigurable coupling of processing resources may be obtained by means of multiplexer circuits (e.g., 36a through 36 j) to form different data paths, wherein the different data paths correspond to different mathematical operations, wherein each multiplexer receives a respective control signal (e.g., S0 through S7).

In one or more embodiments, multiplier circuits 30a, 30b, 30c, 30d may be configured to operate (e.g., by means of an internal multiplexer circuit not visible in the figures) according to two different configurations, which may be selected according to control signal S8 provided to the multipliers. In a first configuration (e.g., if s8=0), the multiplier may calculate the result of two real numbers products on four real numbers of operands per clock cycle (i.e., each input signal carries two real numbers). In a second configuration (e.g., if s8=1), the multiplier may calculate the result of one complex product on two complex operands per clock cycle (i.e., each input signal carries two values, where the first value is the real part of the operand and the second value is the imaginary part of the operand).

Table II provided below summarizes exemplary possible configurations of multiplier circuits 30a, 30b, 30c, 30 d.

Table II

By way of example and with reference to fig. 3, processing resources may be arranged as follows.

The first multiplier 30a may receive a first input signal W1 and a second input signal P (e.g., complex operands).

The second multiplier 30b may receive the first input signal Q and the second input signal selected from the input signals W2 and W4 by means of the first multiplexer 36a, the first multiplexer 36a receiving the corresponding control signal S2. For example, if s2=0, the multiplier 30b receives the signal W2 as a second input, and if s2=1, the multiplier 30b receives the signal W4 as a second input.

The third multiplier 30c may receive a first input signal selected from the output signal from the first multiplier 30a and the input signal P.

For example, as shown in fig. 3, the second multiplexer 36b may provide either one of the output signal (e.g., if s0=0) or the input signal P (e.g., if s0=1) from the first multiplier 30a as an output according to the corresponding control signal S0. The third multiplexer 36c may provide either one of the output signal (e.g., if s3=1) or the input signal P (e.g., if s3=0) from the second multiplexer 36b as an output to the first input of the third multiplexer 30c according to the respective control signal S3.

The third multiplier 30c may receive a second input signal selected from among the input signal W3, the input signal W4, and the input signal W0.

For example, as shown in fig. 3, the fourth multiplexer 36d may provide either the input signal W4 (e.g., if s3=0) or the input signal W0 (e.g., if s3=1) as an output according to the respective control signal S3. The fifth multiplexer 36e may provide either the input signal W3 (e.g., if s3=0) or the output signal from the fourth multiplexer 36d (e.g., if s3=1) as an output to the second input of the third multiplexer 30c according to the respective control signal S3.

The fourth multiplier 30d may receive a first input signal selected from the input signal Q and the output signal from the second multiplier 30 b.

For example, as shown in fig. 3, the sixth multiplexer 36f may provide either the input signal Q (e.g., if s1=0) or the output signal from the second multiplier 30b (e.g., if s1=1) as an output to the first input of the fourth multiplier 30d according to the respective control signal S1.

The fourth multiplier 30d may receive a second input signal selected from the input signal W4 and the input signal W0.

For example, as shown in fig. 3, a second input of the fourth multiplier 30d may be coupled to an output of the fourth multiplexer 36 d.

The first adder-subtractor 32a may receive a first input signal selected from the output signal from the first multiplier 30a, the input signal P, and the output signal from the third multiplier 30 c.

For example, as shown in fig. 3, the seventh multiplexer 36g may provide either the output signal from the second multiplexer 36b (e.g., if s7=1) or the output signal from the third multiplier 30c (e.g., if s7=0) as an output to the first input of the first adder-subtractor 32 a.

The first adder-subtractor 32a may receive a second input signal selected from the input signal Q, the output from the second multiplier 30b, and a zero signal (i.e., a binary signal equal to zero).

For example, as shown in fig. 3, the eighth multiplexer 36h may provide either the input signal Q (e.g., if s6=0) or the output signal from the second multiplier 30b (e.g., if s6=1) as an output according to the respective control signal S6. The first and gate 38a may receive the output signal from the eighth multiplexer 36h as a first input signal and the control signal G0 as a second input signal. An output of the first and gate 38a may be coupled to a second input of the first adder-subtractor 32 a.

The second adder-subtractor 32b may receive a first input signal selected from the output signal of the third multiplier 30c and the output signal of the fourth multiplier 30 d.

For example, as shown in fig. 3, the ninth multiplexer 36i may provide either one of the output signal from the third multiplier 30c (e.g., if s5=0) or the output signal from the fourth multiplier 30d (e.g., if s5=1) as an output to the first input of the second adder-subtractor 32b according to the respective control signal S5.

The second adder-subtractor 32b may receive a second input signal selected from the output from the fourth multiplier 30d, the output from the second multiplier 30b, and a zero signal (i.e., a binary signal equal to zero).

For example, as shown in fig. 3, the tenth multiplexer 36j may provide either one of the output signal from the fourth multiplier 30d (e.g., if s4=0) or the output signal from the second multiplier 30b (e.g., if s4=1) as an output according to the corresponding control signal S4. The second and gate 38b may receive the output signal from the tenth multiplexer 36j as a first input signal and the control signal G1 as a second input signal. An output of the second and gate 38b may be coupled to a second input of the second adder-subtractor 32 b.

The first accumulator 34a may receive the input signal from the output of the first adder-subtractor 32a and the control signal EN to provide a first output signal X0 of the calculation circuit 20.

The second accumulator 34b may receive the input signal from the output of the second adder-subtractor 32b and the control signal EN to provide a second output signal X1 of the calculation circuit 20.

One or more embodiments including adder-subtractors 32a, 32b may keep their operation "bypassed" by means of and gates 38a, 38b, which and gates 38a, 38b may be used to force a zero signal at a second input of the adder-subtractors 32a, 32 b.

Fig. 4 is an exemplary circuit block diagram of other embodiments of computing circuitry 20 that may be included in one or more embodiments of processing element 160.

One or more embodiments as shown in fig. 4 may include the same arrangement of processing resources and multiplexer circuits as discussed with reference to fig. 3, with the addition of two circuits configured to compute an active nonlinear function (ANLF) and corresponding multiplexer circuits.

By way of example and with reference to fig. 4, additional processing resources may be arranged as follows.

The first ANLF circuit 40a may receive an input signal from the output of the first accumulator 34 a. The eleventh multiplexer 36k may provide the first output signal X0 of the calculation circuit 20 by selecting either one of the output signal from the first accumulator 34a (e.g. if s9=0) or the output signal from the first ANLF circuit 40a (e.g. if s9=1) according to the respective control signal S9.

The second ANLF circuit 40b may receive an input signal from the output of the second accumulator 34 b. The twelfth multiplexer 36m may provide the second output signal X1 of the calculation circuit 20 by selecting either the output signal from the second accumulator 34b (e.g., if s9=0) or the output signal from the second ANLF circuit 40b (e.g., if s9=1) according to the respective control signal S9.

Thus, in one or more embodiments as shown in fig. 4, ANLF circuits 40a and 40b may be "bypassed" by means of multiplexer circuits 36k and 36m, thereby providing operation similar to the embodiment shown in fig. 3.

Thus, as shown with reference to fig. 3 and 4, the data paths in the computation circuit 20 may be configured to support parallel computation and may facilitate execution of different functions. In one or more embodiments, the internal pipeline may be designed to meet timing constraints (e.g., clock frequency) for minimum delay.

Various non-limiting examples of possible configurations of the computing circuitry 20 are provided below. In each example, the calculation circuit 20 is configured to calculate an algorithm dependent (basis) function.

In a first example, a configuration of the calculation circuit 20 for performing a Fast Fourier Transform (FFT) algorithm is described.

Where the hardware accelerator circuit 16 is required to compute an FFT algorithm, the single processing element 160 may be programmed to implement a radix-2 DIF (decimated by frequency) butterfly algorithm, performing the following complex operations, for example, using signals from the internal control unit 161:

X0=P+Q

X1=P*W0-Q*W0

Where W0 may be a rotation factor stored in the local read-only memory 165.

In this first example, the input signals (P, Q, W, W1, W2, W3, W4) and the output signals (X0, X1) may be complex data types.

Alternatively, to reduce the effect of discontinuities at the edges of the data block of the computation FFT algorithm on the spectrum, a window function may be applied to the input data prior to the computation FFT algorithm. For example, processing element 160 may support such window processing by using four multiplier circuits.

Alternatively, the modulus or phase of the spectral components may be used in place of complex values (e.g., in radar target detection applications, etc.). In this case, an internal (optional) ANLF circuit may be used during the last FFT stage. For example, the input complex vector may be rotated to align with the x-axis to compute the module.

Table III provided below summarizes some exemplary configurations of the calculation circuit 20 for calculating different base-2 algorithms.

Table III

Thus, the data flow corresponding to the function "base-2 butterfly algorithm" illustrated above may be:

X0=P+Q

X1=P*W0-Q*W0

The data stream corresponding to the function "base-2 butterfly + window" illustrated above may be:

X0=W1*P+W2*Q

X1=(W1*P)*W0-(W2*Q)*W0

the data flow corresponding to the function "base-2 butterfly + modulo" illustrated above may be:

X0=abs(P+Q)

X1=abs(P*W0-Q*W0)

In a first example considered herein, a configuration corresponding to the "base-2 butterfly algorithm" may involve the use of two multiplier circuits, two adder-subtractor circuits, no accumulator and no ANLF circuits.

In a first example considered herein, a configuration corresponding to "base-2 butterfly + window" may involve the use of four multiplier circuits, two adder-subtractor circuits, no accumulator and no ANLF circuits.

In a first example considered herein, a configuration corresponding to "base-2 butterfly + modulo" may involve the use of two multiplier circuits, two adder-subtractor circuits, two ANLF circuits, without an accumulator.

In a second example, a configuration of the calculation circuit 20 for executing a scalar product of complex data vectors is described.

Hardware accelerator circuitry 16 may be required to compute scalar products of complex data vectors. This may be the case, for example, in connection with applications involving filtering operations, such as phased array radar systems involving a processing stage called beamforming. Beamforming techniques may help the radar system resolve targets in terms of angle (azimuth) based on distance and radial velocity.

In this second example, the input signals (P, Q, W, W1, W2, W3, W4) and the output signals (X0, X1) may be complex data types.

In this second example, two different scalar vector product operations (e.g., beamforming operations) may be performed simultaneously by a single processing element 160 (e.g., by utilizing all internal hardware resources).

The local configurable coefficient memory 167 may be used to store phase shifts for different array antenna elements during beamforming operations.

Table IV provided below illustrates possible configurations of the calculation circuit 20 for calculating scalar products of two vectors simultaneously.

Table IV

Thus, the data flow corresponding to the function "scalar product of vector" illustrated above may be:

X0=ACC(P*W1+Q*W2)

X1=ACC(P*W3+Q*W4)

the data stream corresponding to the function "scalar product+modulo of vector" illustrated above may be:

X0=abs(ACC(P*W1+Q*W2))

X1=abs(ACC(P*W3+Q*W4))

In a second example considered herein, a configuration corresponding to "scalar product of vectors" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and no ANLF circuits.

In a second example considered herein, a configuration corresponding to "scalar product+modulus of vector" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and two ANLF circuits.

In a third example, a configuration of the calculation circuit 20 for executing a scalar product of real data vectors is described.

Hardware accelerator circuitry 16 may be required to compute scalar products of real data vectors over large real data structures, for example, for computing digital filters. For example, in many applications, the real world (e.g., analog) signal may be filtered after being digitized in order to extract (only) relevant information.

In the digital domain, the convolution operation between the input signal and the Filter Impulse Response (FIR) may take the form of a scalar product of two real data vectors. One of the two vectors may hold input data while the other vector may hold coefficients defining the filtering operation.

In this third example, the input signals (P, Q, W, W1, W2, W3, W4) and the output signals (X0, X1) are real data types.

In this third example, two different filtering operations may be performed simultaneously on the same data set by a single processing element 160, for example by processing four different input data per clock cycle using all internal hardware resources.

Table V provided below illustrates a possible configuration of the calculation circuit 20 for calculating two filtering operations simultaneously on a real data vector.

Table V

Thus, the data stream corresponding to the function illustrated above is as follows, where the subscript "h" represents the MSB portion and the subscript "l" represents the LSB portion:

X0_h＝ACC(P_h*W1_h+Q_h*W2_h)

X0_l＝ACC(P_l*W1_l+Q_l*W2_l)

X1_h＝ACC(P_h*W3_h+Q_h*W4_h)

X1_l＝ACC(P_l*W3_l+Q_l*W4_l)

in a third example considered herein, a configuration corresponding to "scalar product of real vectors" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and no ANLF circuits.

In the fourth example, the configuration of the calculation circuit 20 for calculating the nonlinear function is described.

Multilayer perceptrons (MLPs) are a type of fully connected feedforward artificial neural network that may include at least three layers of nodes/neurons. Each neuron, except for neurons in the input layer, calculates a weighted sum of all nodes of the previous layer, and then applies a nonlinear activation function to the result. The processing element 160 as disclosed herein may handle such nonlinear functions, for example, using internal ANLF circuits. Typically, neural networks process data from the real world and use the real weights and functions to calculate class membership probabilities (the output of the last layer). Thus, for such artificial networks, the real data scalar product may be the most computationally demanding, most frequently used operation.

Fig. 5 is an exemplary diagram of a general architecture of a multi-layer sensor network 50.

As shown in fig. 5, the multi-layered perceptron network 50 may include an input layer 50a comprising N inputs U¹、…、U^N(Uⁱ, i=1,..and N), a hidden layer 50b comprising M hidden nodes X¹、…、X^M(X^k, k=1,..m), and an output layer 50c comprising P output nodes Y¹、…、Y^P(Y^j, j=1,..p).

It should be appreciated that in one or more embodiments, the multi-layered sensor network may include more than one hidden layer 50b.

As shown in fig. 5, the multi-layer sensor network 50 may include a first set of N X M weights W^i,k between the input U¹、…、U^N and the hidden node X¹、…、X^M, and a second set of M X P weights W^k,j between the hidden node X¹、…、X^M and the output node Y¹、…、Y^P.

The values stored in input Uⁱ, hidden node X^k, and output node Y^j may be calculated, for example, as MAC floating points with single precision.

The values of the hidden node X_k and the output node Yj can be calculated according to the following equation:

In this fourth example, the actual weights of the training associated with all edges of the MLP may be stored in the local configurable coefficient memory 167. The real layer input may be retrieved from a local data store (e.g., local data store group M₀、…、M_Q-1) of the hardware accelerator circuit 16 and the real layer output may be stored in the local data store of the hardware accelerator circuit 16.

Since the MLP model is mapped onto the hardware accelerator circuit 16, each processing element 160 (e.g., P processing elements) included therein may be used to calculate scalar products and activation function outputs associated with two different neurons of the same layer, e.g., four edges are processed per clock cycle. Thus, all of the processing elements 160₀、160₁、…、160_P-1 may be used simultaneously.

Table VI provided below illustrates possible configurations of the calculation circuit 20 for calculating two activation function outputs associated with two different neurons simultaneously.

Table VI

X0_h＝f(ACC(P_h*W1_h+Q_h*W2_h))

X0_l＝f(ACC(P_l*W1_l+Q_l*W2_l))

X1_h＝f(ACC(P_h*W3_h+Q_h*W4_h))

X1_l＝f(ACC(P_l*W3_l+Q_l*W4_l))

In a fourth example considered herein, a configuration corresponding to a functional "MLP computation engine" (which may include computing two scalar products of vectors and applying a nonlinear activation function thereto) may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and two ANLF circuits.

Table VII provided below illustrates a nonlinear function that may be implemented in one or more embodiments. Some functions denoted by "algorithm=nn" may be used exclusively in the context of neural networks.

Table VII

Accordingly, one or more embodiments of hardware accelerator circuit 16, including at least one computing circuit 20 as shown herein and/or in the above examples, may facilitate implementing a digital signal processing system having one or more advantages of flexibility (e.g., ability to handle different types of algorithms), improved use of hardware resources, improved performance of parallel computing, extended connectivity and high bandwidth of each processing element 160 to local data storage group M₀、…、M_Q-1 and/or to system memory 12, support of additional algorithms through simple local interconnect network 162 and internal direct memory access controller 168₀、168₁、…、168_P-1, and through an extensible architecture that integrates different processing elements.

In one or more embodiments, the electronic system 1 may be implemented as a single silicon chip or as an integrated circuit in a chip (e.g., as a system on a chip). Or the electronic system 1 may be a distributed system comprising a plurality of integrated circuits interconnected together, for example by means of a Printed Circuit Board (PCB).

As shown herein, a circuit (e.g., 160) may include a set of input terminals configured to receive a set of input digital signals (e.g., P, Q, W0, W1, W2, W3, W4) carrying input data, a set of output terminals configured to provide a set of output digital signals (e.g., X0, X1) carrying output data, and a computing circuit arrangement (e.g., 20) configured to generate output data from the input data. The computing circuit arrangement may include a set of multiplier circuits (e.g., 30a, 30b, 30c, 30 d), a set of adder-subtractor circuits (e.g., 32a, 32 b), a set of accumulator circuits (e.g., 34a, 34 b), and a configurable interconnect network (e.g., 36a,..36 j) configured to selectively couple (e.g., S1,..thing, S7) the multiplier circuits, the adder-subtractor circuits, the accumulator circuits, the input terminals, and the output terminals in at least two processing configurations.

As shown herein, in a first processing configuration, the computing circuitry may be configured to compute output data from a first set of functions, and in at least one second processing configuration, the computing circuitry may be configured to compute output data from a respective second set of functions, the respective second set of functions being different from the first set of functions.

As shown herein, the circuitry may include a respective configurable direct read memory access controller (e.g., 200₀、200₁) coupled to a first subset of the set of input terminals to receive (e.g., 162, 163) a respective first subset of the input digital signals carrying the first subset of input data (e.g., P, Q). The configurable direct read memory access controller may be configured to control retrieval of a first subset of the input data from a memory (e.g., M₀、…、M_Q-1).

As shown herein, the circuitry may include a respective configurable direct write memory access controller (e.g., 204₀、204₁) coupled to the set of output terminals to provide an output digital signal carrying output data. The configurable direct write memory access controller may be configured to control the output data storage into the memory.

As shown herein, the circuitry may include respective input buffers (e.g., 202₀、202₁) coupled to the configurable direct-read memory access controller and respective output buffers (e.g., 206₀、206₁) coupled to the configurable write direct memory access controller.

As shown herein, the circuitry may include ROM address generator circuitry (e.g., 208) configured to control retrieval of a second subset of the input data (e.g., W0) from the at least one read-only memory (e.g., 164, 165) via a second subset of the input digital signals, and/or memory address generator circuitry (e.g., 210) configured to control retrieval of a third subset of the input data (e.g., W1, W2, W3, W4) from the at least one configurable memory (e.g., 166, 167) via a third subset of the input digital signals.

As shown herein, in a circuit according to an embodiment, the set of multiplier circuits may include a first multiplier circuit (e.g., 30 a), a second multiplier circuit (e.g., 30 b), a third multiplier circuit (e.g., 30 c), and a fourth multiplier circuit (e.g., 30 d). The adder-subtractor circuit set may include a first adder-subtractor circuit (e.g., 32 a) and a second adder-subtractor circuit (32 b). The set of accumulator circuits may include a first accumulator circuit (e.g., 34 a) and a second accumulator circuit (e.g., 34 b).

As shown herein, the computing circuitry may include a set of circuits (e.g., 40a, 40 b) configured to compute a nonlinear function.

As shown herein, a set of circuits configured to compute a nonlinear function may include a first circuit (e.g., 40 a) configured to compute a nonlinear function and a second circuit (e.g., 40 b) configured to compute a nonlinear function. A first circuit configured to calculate a nonlinear function may receive as input an output signal from the first accumulator circuit. A second circuit configured to calculate a nonlinear function may receive as input an output signal from the second accumulator circuit. The first output signal may be selectable (e.g., 36 k) between an output signal from the first accumulator circuit and an output signal from the first circuit configured to calculate the non-linear function, and the second output signal may be selectable (e.g., 36 m) between an output signal from the second accumulator circuit and an output signal from the second circuit configured to calculate the non-linear function.

As shown herein, a device (e.g., 16) may include a set of circuits, a set of data memory banks (e.g., M₀、…、M_Q-1), and a control unit (e.g., 161) in accordance with one or more embodiments. Depending on the configuration data stored in the control unit, the circuitry may be configured (e.g., 161, 168) to read data from and write data to the data storage group via the interconnection network (e.g., 162, 163).

As shown herein, the data memory bank may include a buffer, preferably a double buffer.

As shown herein, a system (e.g., 1) may include a device according to one or more embodiments and a processing unit (e.g., 10) coupled to the device via a system interconnect (e.g., 18). The circuitry in the set of circuitry of the device may be configured in at least two processing configurations in accordance with control signals received from the processing unit.

As shown herein, a method of operating a circuit in accordance with one or more embodiments, an apparatus in accordance with one or more embodiments, or a system in accordance with one or more embodiments may include dividing an operating time of a computing circuit arrangement in at least first and second operating intervals, wherein the computing circuit arrangement operates in a first processing configuration and at least one second processing configuration, respectively.

Without prejudice to the underlying principles, the details and the embodiments may vary, even significantly, with respect to what has been described by way of example only, without departing from the scope of the protection.

The protection scope is defined by the appended claims.

While this invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. Accordingly, the appended claims are intended to encompass any such modifications or embodiments.

Claims

1. A circuit, comprising:

a set of input terminals configured to receive a set of respective input digital signals carrying input data;

A respective configurable direct read memory access controller coupled to a first subset of the set of input terminals to receive a respective first subset of the input digital signals carrying a first subset of input data, wherein the configurable direct read memory access controller is configured to control retrieval of the first subset of input data from memory;

A set of output terminals configured to provide a set of respective output digital signals carrying output data;

A respective configurable direct write memory access controller coupled to the set of output terminals to provide the output digital signal carrying output data, wherein the configurable direct write memory access controller is configured to control storage of the output data into the memory, and

A computing circuit arrangement configured to generate the output data from the input data, wherein the computing circuit arrangement comprises:

a set of multiplier circuits;

An adder-subtractor circuit set;

Accumulator circuit set, and

A configurable interconnect network configured to selectively couple the multiplier circuit, the adder-subtractor circuit, the accumulator circuit, the input terminal, and the output terminal in at least two processing configurations;

wherein:

in a first processing configuration, the computation circuitry is configured to compute the output data according to a first set of functions, and

In at least one second processing configuration, the computing circuitry is configured to compute the output data from a respective second set of functions, the respective second set of functions being different from the first set of functions.

2. The circuit of claim 1, further comprising a respective input buffer and a respective output buffer, the respective input buffer being coupled to the configurable direct-read memory access controller and the respective output buffer being coupled to the configurable direct-write memory access controller.

3. The circuit of claim 1, further comprising:

a read-only memory ROM address generator circuit configured to control acquisition of a second subset of input data from the at least one read-only memory via the second subset of input digital signals, and/or

A memory address generator circuit configured to control retrieval of a third subset of input data from the at least one locally configurable memory via the third subset of input digital signals.

4. The circuit of claim 1, wherein the set of multiplier circuits comprises a first multiplier circuit, a second multiplier circuit, a third multiplier circuit, and a fourth multiplier circuit, the set of adder-subtractor circuits comprises a first adder-subtractor circuit and a second adder-subtractor circuit, the set of accumulator circuits comprises a first accumulator circuit and a second accumulator circuit, and wherein:

the first multiplier circuit receives a first input signal of the set of respective input digital signals as a first operand and a second input signal of the set of respective input digital signals as a second operand;

The second multiplier circuit receives a third input signal of the set of respective input digital signals as a first operand and a selectable signal from a fourth input signal and a fifth input signal of the set of respective input digital signals as a second operand;

the third multiplier circuit receives as a first operand a selectable signal from among the output signal from the first multiplier circuit and the second input signal, and receives as a second operand a selectable signal from among a sixth input signal, a seventh input signal, and the fifth input signal of the set of corresponding input digital signals;

The fourth multiplier circuit receives as a first operand a selectable signal from the output signal from the second multiplier circuit and the third input signal, and receives as a second operand a selectable signal from the fifth input signal and the seventh input signal;

The first adder-subtractor circuit receives as a first operand a selectable signal from the output signal from the first multiplier circuit, the second input signal, and an output signal from the third multiplier circuit, and receives as a second operand a selectable signal from the third input signal, the output signal from the second multiplier circuit, and a zero signal;

The second adder-subtractor circuit receives as a first operand a selectable signal from the output signal from the third multiplier circuit and an output signal from the fourth multiplier circuit, and receives as a second operand a selectable signal from the output signal from the fourth multiplier circuit, the output signal from the second multiplier circuit, and a zero signal;

the first accumulator circuit receives as input an output signal from the first adder-subtractor circuit;

The second accumulator circuit receiving as input an output signal from the second adder-subtractor circuit, and

The first accumulator circuit is selectively activatable to provide a first output signal and the second accumulator circuit is selectively activatable to provide a second output signal.

5. The circuit of claim 4, wherein the computing circuitry comprises a set of functional circuits configured to compute a nonlinear function.

6. The circuit of claim 5, wherein the set of functional circuits configured to compute a nonlinear function comprises a first circuit configured to compute a nonlinear function and a second circuit configured to compute a nonlinear function, and wherein:

the first circuit configured to calculate a nonlinear function receives as input an output signal from the first accumulator circuit;

the second circuit configured to calculate a nonlinear function receives as input an output signal from the second accumulator circuit;

The first output signal being selectable between the output signal from the first accumulator circuit and an output signal from the first circuit configured to calculate a non-linear function, and

The second output signal is selectable between the output signal from the second accumulator circuit and an output signal from the second circuit configured to calculate a nonlinear function.

7. An electronic device, comprising:

A set of data storage groups;

A control unit;

interconnection network, and

A set of circuits, each circuit comprising:

a set of output terminals configured to provide a set of corresponding output digital signals carrying output data, and

a set of multiplier circuits;

An adder-subtractor circuit set;

Accumulator circuit set, and

At least one of the following:

a read only memory ROM address generator circuit configured to control retrieval of a second subset of input data from the at least one read only memory via the second subset of input digital signals, and

A memory address generator circuit configured to control retrieval of a third subset of input data from at least one locally configurable memory via the third subset of input digital signals;

wherein:

In at least one second processing configuration, the computing circuitry is configured to compute the output data from a respective second set of functions, the respective second set of functions being different from the first set of functions;

wherein the set of circuits is configurable to read data from and write data to the data storage group via the interconnection network as a function of configuration data stored in the control unit.

8. The electronic device of claim 7, wherein the data memory bank comprises a buffer register.

9. The electronic device of claim 8, wherein the buffer register is a double buffer register.

10. The electronic device of claim 7, further comprising:

A respective configurable direct read memory access controller coupled to a third subset of the set of input terminals to receive a respective third subset of the input digital signals carrying a third subset of input data, wherein the configurable direct read memory access controller is configured to control retrieval of the third subset of input data from memory, and

A respective configurable direct write memory access controller is coupled to the set of output terminals to provide the output digital signal carrying the output data, wherein the configurable direct write memory access controller is configured to control the storage of the output data into the memory.

11. The electronic device of claim 10, further comprising a respective input buffer and a respective output buffer, the respective input buffer being coupled to the configurable direct-read memory access controller and the respective output buffer being coupled to the configurable direct-write memory access controller.

12. The electronic device of claim 7, wherein the set of multiplier circuits comprises a first multiplier circuit, a second multiplier circuit, a third multiplier circuit, and a fourth multiplier circuit, the set of adder-subtractor circuits comprises a first adder-subtractor circuit and a second adder-subtractor circuit, the set of accumulator circuits comprises a first accumulator circuit and a second accumulator circuit, and wherein:

13. The electronic device defined in claim 12 wherein the computing circuitry comprises a set of functional circuits configured to compute a nonlinear function.

14. The electronic device of claim 13, wherein the set of functional circuits configured to calculate a nonlinear function comprises a first circuit configured to calculate a nonlinear function, and a second circuit configured to calculate a nonlinear function, and wherein:

15. An electronic system, comprising:

the system is interconnected;

a processing unit;

an apparatus coupled to a processing unit via the system interconnect, wherein the apparatus comprises:

A set of data storage groups;

A control unit;

interconnection network, and

A set of circuits, each circuit comprising:

a set of multiplier circuits;

An adder-subtractor circuit set;

Accumulator circuit set, and

wherein:

wherein the set of circuits is configurable to read data from and write data to the data storage group via the interconnection network as a function of configuration data stored in the control unit, and

Wherein the set of circuits is configurable in at least two processing configurations according to control signals received from the processing unit.

16. The electronic system of claim 15, wherein the data memory bank comprises a buffer register.

17. The electronic system of claim 16, wherein the buffer register is a double buffer register.

18. A method of operating a circuit comprising a set of input terminals configured to receive a set of respective input digital signals carrying input data, a set of output terminals configured to provide a set of respective output digital signals carrying output data, a respective configurable direct-read memory access controller coupled to a first subset of the set of input terminals to receive a respective first subset of the input digital signals carrying a first subset of input data, wherein the configurable direct-read memory access controller is configured to control retrieval of the first subset of input data from a memory, a respective configurable direct-write memory access controller coupled to the set of output terminals to provide the output digital signals carrying output data, wherein the configurable direct-write memory access controller is configured to control storage of the output data into the memory, and a computational circuit arrangement configured to generate the output data in accordance with the input data, the computational circuit arrangement comprising a set of circuits, an adder-subtractor circuits, the adder-subtractor circuits, and a network of the multiplier circuits being configured to process the first set of input circuits, the adder-subtractor circuits being configured to interconnect the first set of input circuits in accordance with a network of the first and the adder-adder circuits, and the computing circuitry is configured to compute the output data according to a respective second set of functions in at least one second processing configuration, the respective second set of functions being different from the first set of functions, the method comprising:

Dividing an operating time of the computing circuit means into at least a first operating interval and a second operating interval;

Operating the computing circuitry in the first processing configuration in the first operating interval, and

In the second operating interval, operating the computing circuitry in the at least one second processing configuration.

19. A method of operating a circuit, the method comprising:

Receiving a set of respective input digital signals carrying input data through a set of input terminals, the receiving comprising at least one of:

(1) Controlling the retrieval of a first subset of said input data from at least one read-only memory via said first subset of input digital signals by a read-only memory ROM address generator circuit, and

(2) Controlling, by a memory address generator circuit, retrieval of a second subset of the input data from at least one locally configurable memory via the second subset of the input digital signals;

receiving the input data from the set of input terminals by a computing circuit arrangement comprising a set of multiplier circuits, a set of adder-subtractor circuits and a set of accumulator circuits;

selectively coupling input terminals and output terminals of the multiplier circuit, the adder-subtractor circuit, the accumulator circuit, the set of input terminals in at least two processing configurations via a configurable interconnection network;

calculating, by the calculation circuitry, output data in the first operating interval as a first set of functions in a first one of the at least two processing configurations;

calculating, by the calculation circuitry, the output data in the second operating interval with a respective second set of functions in at least one second of the at least two processing configurations, the respective second set of functions being different from the first set of functions, and

A respective set of output digital signals carrying the output data is provided through a set of output terminals.

20. The method of claim 19, further comprising:

Receiving, by a respective configurable direct read memory access controller, a respective third subset of the input digital signals carrying a third subset of input data from the third subset of the set of input terminals, and

Controlling, by the configurable direct read memory access controller, retrieval of the third subset of the input data from memory;

Providing said output digital signal carrying said output data to said set of output terminals via a corresponding configurable direct write memory access controller, and

Controlling, by the configurable direct write memory access controller, storage of the output data into the memory.

21. The method of claim 19, further comprising:

Receiving, by a first multiplier circuit of the set of multiplier circuits, a first input signal of the set of corresponding input digital signals;

Receiving, by the first multiplier circuit, a second input signal of the set of corresponding input digital signals;

receiving, by a second multiplier circuit of the set of multiplier circuits, a third input signal of the set of corresponding input digital signals;

Receiving, by the second multiplier circuit, a signal selectable from among a fourth input signal and a fifth input signal of the respective input digital signal;

Receiving, by a third multiplier circuit of the set of multiplier circuits, a selectable signal from the output signal from the first multiplier circuit and the second input signal;

receiving, by the third multiplier circuit, a signal selectable from a sixth input signal of the respective input digital signal, a seventh input signal of the respective input digital signal, and a fifth input signal;

Receiving, by a fourth multiplier circuit of the set of multiplier circuits, a selectable signal from the output signal from the second multiplier circuit and the third input signal;

receiving, by the fourth multiplier circuit, a selectable signal from a fifth input signal and a seventh input signal of the respective input digital signals;

Receiving, by a first adder-subtractor circuit of the set of adder-subtractor circuits, a selectable signal from among an output signal from the first multiplier circuit, the second input signal, and an output signal from the third multiplier circuit;

receiving, by the first adder-subtractor circuit, a selectable signal from the third input signal, an output signal from the second multiplier circuit, and a zero signal;

Receiving, by a second adder-subtractor circuit of the set of adder-subtractor circuits, a selectable signal from the output signal from the third multiplier circuit and the output signal from the fourth multiplier circuit;

Receiving, by the second adder-subtractor circuit, a signal selectable from the output signal from the fourth multiplier circuit, the output signal from the second multiplier circuit, and a zero signal;

receiving, by a first accumulator circuit of the set of accumulator circuits, an output signal from the first adder-subtractor circuit;

Receiving, by a second accumulator circuit of the set of accumulator circuits, an output signal from the second adder-subtractor circuit;

Selectively activating the first accumulator circuit to provide a first output signal, and

The second accumulator circuit is selectively activated to provide a second output signal.