Movatterモバイル変換


[0]ホーム

URL:


CN116911357A - Convolutional computing accelerator and acceleration method based on CSR coding - Google Patents

Convolutional computing accelerator and acceleration method based on CSR coding
Download PDF

Info

Publication number
CN116911357A
CN116911357ACN202310848642.6ACN202310848642ACN116911357ACN 116911357 ACN116911357 ACN 116911357ACN 202310848642 ACN202310848642 ACN 202310848642ACN 116911357 ACN116911357 ACN 116911357A
Authority
CN
China
Prior art keywords
data
module
window
calculation
csr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310848642.6A
Other languages
Chinese (zh)
Other versions
CN116911357B (en
Inventor
彭琪
陈纪宇
王一凡
朱樟明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian UniversityfiledCriticalXidian University
Priority to CN202310848642.6ApriorityCriticalpatent/CN116911357B/en
Publication of CN116911357ApublicationCriticalpatent/CN116911357A/en
Application grantedgrantedCritical
Publication of CN116911357BpublicationCriticalpatent/CN116911357B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于CSR编码的卷积计算加速器,包括:数据预处理模块,用于从外部读取数据,并进行分块处理;CSR编码模块,用于对分块数据进行CSR编码,得到编码数据及其对应的地址;乘法脉动计算阵列,用于根据地址对对应的编码数据进行计算;数据分配模块,用于将计算结果划分为本窗口数据和跨窗口数据,并传入数据累加模块进行累加;数据延迟模块,用于在判断发生加法写冲突时,向乘法脉动计算阵列反馈反压信号,以暂停当前工作,并在延迟数据相加完毕后重新启动当前工作;数据排布模块,用于对累加数据进行整合并通过再量化模块重新映射位宽后,写入片外存储。该方法减少了片上存储的压力,降低了功耗,适用于高并行卷积计算。

The invention discloses a convolution calculation accelerator based on CSR encoding, which includes: a data preprocessing module for reading data from the outside and performing block processing; a CSR encoding module for performing CSR encoding on block data. Obtain the encoded data and its corresponding address; the multiplication pulse calculation array is used to calculate the corresponding encoded data based on the address; the data distribution module is used to divide the calculation results into local window data and cross-window data, and incoming data is accumulated The module accumulates; the data delay module is used to feed back the back pressure signal to the multiplication pulse calculation array when it is determined that an addition write conflict occurs, so as to suspend the current work and restart the current work after the delay data is added; the data arrangement module , used to integrate the accumulated data and remap the bit width through the requantization module, and then write it to off-chip storage. This method reduces the pressure on on-chip storage, reduces power consumption, and is suitable for highly parallel convolution calculations.

Description

Convolutional calculation accelerator based on CSR (code stream Rate) coding and acceleration method
Technical Field
The invention belongs to the technical field of chip design, and particularly relates to a convolutional calculation accelerator based on CSR (code stream and noise) coding and an acceleration method.
Background
Along with the continuous progress of the research of convolutional neural networks, the sample learning capacity and the target classification capacity of the convolutional neural networks are continuously enhanced, and the deployment scale of CNN in a plurality of application scenes of a mobile end and an edge end is also increased year by year. Meanwhile, with the increase of application scenes and integration disciplines, the scale and the computational complexity of the convolutional neural network also increase sharply.
The traditional convolutional neural network application scheme mainly adopts the way that a neural network is deployed at a cloud end so as to process data acquired by a terminal and return an operation result to the terminal. However, this method suffers from the problem of insufficient real-time performance. With the increasing demands of applications of internet of things devices that carry neural networks and process data locally, the problem of real-time network deployment is becoming more urgent. Therefore, how to deploy deep convolutional neural networks with high performance, low power consumption and real-time performance has become a research hotspot for various research institutions and related enterprises. Meanwhile, the proportion of the convolution calculation on the convolution neural network is more than ninety percent, so that a great key point of deploying the convolution neural network on edge data is formed by how to efficiently realize the convolution operation.
The convolution operation requires a very large number of parameters, and it is impossible to store all the data required by the convolution operation in the storage space of the edge device. Based on the above, the related researchers put forward an idea of restarting the systolic array, and in the convolution calculation, a systolic array mode is adopted, so that the storage pressure of the equipment can be reduced through a large amount of data multiplexing, and the convolution calculation of the large-scale parallel calculation is possible. The literature BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing proposes a Bit serial algorithm based on a systolic array structure, and the quantized data is multiplied and decomposed into popcount operation and addition operation, so as to complete data multiplexing and reduce resource consumption to optimize convolution operation. Document two, "eye energy-efficiency electronics recycle-figurable accelerator for deep convolutional neural networks," proposes An eye convolutional neural network accelerator, which mainly adopts reconfigurable architecture based on a systolic array, adopts a CSC coding mode in order to avoid redundant calculation in convolution, and proposes a line resident data stream mode to complete data multiplexing.
However, the method in the above document one is only suitable for data subjected to low-bit quantization, such as 4bit data and below, and cannot be suitable for high-parallel convolution calculation, and also does not consider redundant calculation caused by zero, so that the algorithm power consumption is relatively large. Although the method of document two skips zero redundancy calculation by CSC coding, since the proposed line resides in the data stream, the data with the same ID needs to be matched in one clock cycle, and therefore a multi-level buffer structure is required to buffer the data, and the data cannot be completely read until the IDs are matched completely, so that the buffer pressure is high.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a convolution calculation accelerator based on CSR coding and an acceleration method. The technical problems to be solved by the invention are realized by the following technical scheme:
in a first aspect, the present invention provides a CSR coding-based convolutional computational accelerator, comprising:
the device comprises a data preprocessing module, a CSR encoding module, a multiplication ripple calculation array, a data distribution module, a data accumulation module, a data delay module, a data arrangement module, a re-quantization module and a total control module; wherein,,
the data preprocessing module is used for reading data from the outside through DMA and performing blocking processing to obtain blocked data;
the CSR coding module is used for carrying out CSR coding on the data after the blocking to obtain characteristic data and corresponding addresses thereof;
the multiplication ripple calculation array is used for carrying out natural flow multiplexing calculation on the corresponding characteristic data and weight according to the address, and transmitting a calculation result into the data distribution module;
the data distribution module is used for dividing the input calculation result into window data and cross-window data, and respectively inputting the window data and the cross-window data into corresponding accumulation units in the data accumulation module for data accumulation;
the data delay module is used for feeding back a back pressure signal to the multiplication ripple calculation array when judging that the data accumulation module generates addition and write conflict so as to pause the current work and restarting the current work after the delay data addition is finished;
the data arrangement module is used for integrating the accumulated data output by the data accumulation module, re-mapping the bit width by the re-quantization module, and writing the data into off-chip storage by DMA (direct memory access) so as to complete multiplexing convolution calculation of discontinuous input data;
and the total control module is used for controlling the calling and data interaction of all other modules.
In a second aspect, the present invention provides a convolutional calculation acceleration method based on CSR coding, which is applied to the convolutional calculation accelerator based on CSR coding provided in the foregoing embodiment, and includes the following steps:
external data are obtained, block preprocessing and CSR coding are carried out, and characteristic data and corresponding addresses thereof are obtained;
performing natural flow multiplexing calculation on the corresponding characteristic data and weights through the addresses based on the pulse array, and dividing calculation results into window data and cross-window data;
accumulating the window data and the cross-window data through adder multiplexing, suspending the current work through back pressure control when judging that the addition and write conflict occurs, and restarting the current work after the delay data are added;
and integrating and remapping bit widths of the accumulated data, and writing the accumulated data into off-chip storage to complete multiplexing convolution calculation of discontinuous input data.
The invention has the beneficial effects that:
1. the convolution calculation accelerator based on CSR coding saves the additional expenditure of coding by carrying out block preprocessing on input data, effectively avoids redundant parts in the convolution calculation of a neural network by CSR coding processing, completes multiplexing of discontinuous input data by a pulse calculation array after coding and a subsequent series of processing, realizes back pressure control by a data delay module, and solves addition and write conflicts caused by the discontinuity of the input data; meanwhile, the data is divided into the window and cross-window data, so that the storage size required by matching the data is saved, the pressure of on-chip storage is reduced, the power consumption is reduced, and the high-parallelism convolution calculation becomes possible;
2. the convolution calculation accelerator based on CSR codes provided by the invention packages pretreatment, CSR codes and caches together, and saves the time cost of codes by utilizing ping-pong operation;
3. the invention saves extra area expenditure through data classification and module multiplexing, not only realizes the characteristic value multiplexing and adder multiplexing in the pulse array, but also can enjoy the convenience of zero redundancy calculation removal brought by encoding.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
FIG. 1 is a block diagram of a convolutional calculation accelerator based on CSR coding provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a CSR encoding rule according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a detailed structure of a convolutional calculation accelerator based on CSR coding according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a PE unit according to an embodiment of the invention;
FIG. 5 is a schematic diagram of single window data classification according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of cross-window data interaction provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of an accumulation unit according to an embodiment of the present invention;
fig. 8 is a schematic flow chart of a convolution calculation acceleration method based on CSR coding according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.
Example 1
Referring to fig. 1, fig. 1 is a block diagram of a convolutional calculation accelerator based on CSR coding according to an embodiment of the present invention, where the accelerator includes:
the device comprises a data preprocessing module, a CSR encoding module, a multiplication ripple calculation array, a data distribution module, a data accumulation module, a data delay module, a data arrangement module, a re-quantization module and a total control module; wherein,,
the data preprocessing module is used for reading data from the outside through DMA and performing block processing to obtain block-partitioned data;
the CSR coding module is used for carrying out CSR coding on the data after the blocking to obtain characteristic data and corresponding addresses;
the multiplication pulsation calculation array is used for carrying out natural flow multiplexing calculation on the corresponding characteristic data and the weight according to the address, and transmitting the calculation result into the data distribution module;
the data distribution module is used for dividing the input calculation result into window data and cross-window data, and respectively inputting the window data and the cross-window data into corresponding accumulation units in the data accumulation module for data accumulation;
the data delay module is used for feeding back a back pressure signal to the multiplication ripple calculation array when the data accumulation module is judged to generate addition and write conflicts so as to pause the current work and restart the current work after the delay data addition is finished;
the data arrangement module is used for integrating the accumulated data output by the data accumulation module, re-mapping the bit width by the re-quantization module, and writing the data into off-chip storage by DMA (direct memory access) so as to complete multiplexing convolution calculation of discontinuous input data;
the total control module is used for controlling the calling and data interaction of all other modules.
Each module and its data processing procedure are described in detail in turn.
And a data preprocessing module:
specifically, the data preprocessing module divides an input characteristic data graph into different matrixes according to a specific size after the Padding operation, and the default part of the matrixes is filled with zeros. The blocking processing can reduce the additional overhead caused by data coding and save the consumption of computing resources. The last actually implemented method may also be a block processing on the software side before the data is input in advance.
CSR coding module:
the CSR encoding module is used for carrying out CSR encoding on the blocked input data, filtering zero data in the blocked input data and inputting new data into the multiplication ripple calculation array. Specifically, the CSR coding rule is shown in fig. 2, for example, the example illustrated in the figure is a coding window with a size of 10×4, and an english letter indicates that the data is non-zero, and a no letter indicates that the data is zero. The new DATA obtained after encoding can be divided into characteristic DATA, denoted DATA in fig. 2, and addresses, denoted COUNT and ADDR in fig. 2, which together represent that the non-zero DATA is at a specific position in the original window. The characteristic data is input into a subsequent ripple calculation array to finish calculation, and the address is used as the input of the address calculation module.
In this embodiment, the data address calculation module is spatially coincident with the multiplicative systolic calculation array.
Optionally, as an implementation manner, the data preprocessing module and the CSR coding module may be packaged together with the storage space of the data input channel and copied once, so as to implement the table tennis operation, thereby saving the time overhead caused by data coding.
Multiplication ripple calculation array:
since convolution computation can be decomposed into multiplication and addition, the multiplication ripple computation array is used for computing the product of the encoded data and the weight, and then the result is input into the data distribution module to wait for distribution. Therefore, the multiplication ripple calculation array in the present embodiment adopts a mode of encoding data to flow transversely and flow longitudinally after weight broadcasting to complete data multiplexing.
It should be noted that, because the encoded data is not continuous in the most original feature map distribution, the weights need to be input by broadcasting according to a specific arrangement mode, that is, the weights are arranged according to the number of convolution kernels in columns, so as to ensure that the possibility of collision caused by calculation of the same convolution window data in the same clock period is minimum.
Referring to fig. 3, fig. 3 is a detailed structural schematic diagram of a convolutional calculation accelerator based on CSR coding according to an embodiment of the present invention. Wherein the multiplication ripple calculation array comprises a plurality of PE units, the brief structure of the PE units is shown in figure 4, each PE unit comprises an address calculation module, a pipeline register and a multiplier, wherein,
the address calculation module calculates the position information of the corresponding coded data in the original window according to the currently received address;
the pipeline register is used for transmitting the current address and the corresponding coded data to the next PE unit;
the multiplier is used for multiplying the current non-zero data with the weight to obtain a corresponding product, and the product and the position information obtained by the address calculation module are synchronously kept to be fed into the data distribution module.
Specifically, the address calculating module f_addr calculates the position of the data in the original window according to the COUNT and ADDR values obtained by the previous encoding, the data and the COUNT and ADDR values of the data are led to the next PE unit through the pipeline register, the encoded data in the corresponding systolic array transversely flow, the non-zero data also needs to be multiplied by the weight through the multiplier to obtain the corresponding product, and the product and the position information obtained by the address calculating module are synchronously sent to the data distributing module.
And a data distribution module:
after the multiplication ripple calculation array is calculated, the data distribution module divides the data into two data streams according to specific situations, namely a window data stream and a cross-window data stream.
Referring to fig. 5-6, fig. 5 is a schematic diagram of data classification provided by an embodiment of the present invention, and fig. 6 is a schematic diagram of cross-window data interaction provided by an embodiment of the present invention, where F is the size of a convolution kernel, ROW is the number of columns of a coding window, COL is the number of ROWs, and a blue portion represents a convolution window of 3*3 size. As shown in fig. 5, assuming that the position of the third row of the convolution has non-zero data, multiplying the non-zero data with any value of the third row of the convolution generates a cross-window data, that is, the product generated by the non-zero data of the white window needs to be matched with the product generated by the non-zero data of the gray window, and the opposite window data represents that the data has no relation with other window data, so that the calculation can be independently completed. The data distribution module is responsible for dividing the data stream from the pulse array into the window data and the cross-window data, dividing the data which is calculated finally into two parts according to the address and the control signal, and then respectively sending the data to different parts of the data accumulation module.
Optionally, in this embodiment, when the data allocation module performs data allocation, the data allocation module sets the window size based on a principle that a size of a space storing cross-window data is proportional to the window, and an encoded overhead bit is inversely proportional to the window. The former can reduce the actual logic decisions of the circuit. However, since the window data size is inversely proportional to the overhead generated by encoding, that is, the smaller the window, the more the actual operation overhead and the storage overhead of the circuit can be reduced, in practical application, an optimal result needs to be obtained according to the influence of the two conditions, for example, the window size selected by the present invention is 10×10.
And a data accumulation module:
with continued reference to fig. 3, the data accumulation module includes a first-stage adder and a second-stage adder, and the first-stage adder may be multiplexed; wherein,,
the first-stage adder is used for adding the convolution window data of the same input channel and outputting the convolution result of the channel;
the second-stage adder is used for accumulating the same convolution window data on different convolution channels and sending the window data and cross-window data to the data arrangement module together.
In this embodiment, based on the data distribution module dividing the data into two types of the window data and the cross-window data, the data accumulation module is also provided with two corresponding accumulation units for accumulating the two types of data respectively. As shown in fig. 3, the first-stage adder includes a present window accumulating unit and a cross-window accumulating unit;
the data delay module is packaged with the window accumulating unit and used for accumulating the window data;
the cross-window accumulating unit is used for accumulating cross-window data.
Specifically, the first-stage adder is divided into two parts corresponding to the two different data streams, and the data delay module and part of the adders of the first-stage adder in the accumulation module are packaged together to be used as an input module of the window data. In the present window data, a case where two or more data are written into one adder at the same time in one clock cycle occurs due to the discontinuity and randomness of the encoded data mentioned above, which will be hereinafter collectively referred to as addition write collision. The data delay module determines whether an add-write collision has occurred, and if so, provides a backpressure signal to the pre-stage pipeline, halts operation of the pre-stage pipeline, and waits until the delay data is added before starting the pipeline.
It will be appreciated that, based on the design of back pressure control, the present embodiment also designs a back pressure memory bank formed by synchronous FIFOs between the allocation module and the window accumulation unit, where the depth requirement of the memory bank is small, only for storing the intermediate pipeline calculation data from the first stage pipeline to the delay module pipeline, so as to prevent data loss.
Further, the present embodiment adopts the conventional accumulation unit structure shown in fig. 7 to implement a two-stage accumulator, which includes two muxes for data selector, a DFF flip-flop group and an adder, and the detailed working principle can refer to the prior art.
The first-stage adder of the data accumulation module provided in this embodiment can achieve multiplexing to save the overhead of the extra triggers. The multiplexing function of the first-stage adder will be briefly described below.
The adder stage de-groups the adders according to different addition IDs according to requirements, wherein the number of adders in each group corresponds to the number of convolution sliding times of the window of the code window, for example, when the code window is 10, the convolution kernel size is 3, the adders are grouped into eight groups, and then the inter-group adders are de-scheduled through the IDs to complete multiplexing. Because when data flows to the F+1th row of the original window, all data required by the first convolution window, namely all data of the former F rows, are calculated, an adder for calculating the first convolution window can be used for calculating the data of the F+1th convolution window, each convolution window needs the data of the F rows to finish calculation, the first calculation needs the time of the F rows to obtain a result, and each later convolution window only needs the time of the F rows to obtain the result, so that multiplexing efficiency is greatly improved, and meanwhile, the additional cost of the adder is reduced. The other cross-window data also generates the back pressure requirement, but since only F-1 rows of data are distributed into the cross-window data, the total data generated by the addition and the write conflict is not large, so that the addition tree is adopted at the module to solve the problem of the write conflict at the cost of area, instead of generating two different back pressure signals, and finally, the control disturbance result is generated.
Furthermore, the embodiment also designs a data temporary storage module, wherein the data temporary storage module is connected with the window-crossing accumulation unit in the first-stage adder and the second-stage adder;
the data temporary storage module is used for synchronizing the cross-window data flow so as to cooperate with the second-stage adder to accumulate data;
and the data temporary storage module is packaged with part of adders in the second-stage adders.
In particular, the data-holding module corresponds to the synchronization FIFO of fig. 3 connected across the window and the two-stage adder, and, like the data delay module, is packaged together with a part of the adders in the second-stage adder as an input module of the window-crossing data. The module completes the matching of data streams across windows, under the control of the overall control module, the data streams needing to wait for the subsequent windows are written into or read out of the synchronous FIFO, and the read-out data streams representing the matched data are sent into the second-stage adder together to complete the next accumulation. Because of the reasonable distribution of window partitioning, the cross-window data occupies little part of the population, and therefore, the required storage space is not large. In practice the additional storage overhead of the present invention is embodied in a scratch pad module that includes multiplexed adders internal and cross window data.
The second-stage adder in the accumulation module is used for accumulating the same convolution window data on different convolution channels, and then the window data and the cross-window data are fed into the data arrangement module together.
And a data arrangement module:
the data arrangement module is responsible for integrating two different data streams, the last output data stream is integrated into a form similar to the input original window, and the last data is input to the re-quantization module.
And a re-quantization module:
the re-quantization module is used for remapping the data with the expanded bit width after convolution to the bit width required by the next input so as to facilitate the next convolution calculation. And finally writing the quantized data into off-chip storage through DMA (direct memory access) to complete convolution calculation.
It can be appreciated that the convolution computing accelerator designed in this embodiment further includes a total control module for controlling the data flow direction and data interaction of the whole system. The detailed control method can be realized with reference to the prior art.
The convolution calculation accelerator based on CSR coding saves the additional expenditure of coding by carrying out block preprocessing on input data, effectively avoids redundant parts in the convolution calculation of a neural network by CSR coding processing, completes multiplexing of discontinuous input data by a pulse calculation array after coding and a subsequent series of processing, realizes back pressure control by a data delay module, and solves addition and write conflicts caused by the discontinuity of the input data; meanwhile, the data is divided into the window and cross-window data, so that the storage size required by matching the data is saved, the pressure of on-chip storage is reduced, the power consumption is reduced, and the high-parallelism convolution calculation becomes possible;
in addition, the invention saves extra area expenditure through data classification and module multiplexing, not only realizes the characteristic value multiplexing and adder multiplexing in the pulse array, but also can enjoy the convenience of zero redundancy calculation removal brought by encoding.
Example two
On the basis of the first embodiment, the present embodiment provides a convolution calculation acceleration method based on CSR coding. Referring to fig. 8, fig. 8 is a flow chart of a convolutional calculation acceleration method based on CSR coding according to an embodiment of the present invention, where the method includes:
step 1: external data are obtained, block preprocessing and CSR coding are carried out, and characteristic data and corresponding addresses thereof are obtained;
step 2: performing natural flow multiplexing calculation on the corresponding characteristic data and weights through addresses based on the pulse arrays, and dividing calculation results into window data and cross-window data;
step 3: accumulating the window data and the cross-window data through adder multiplexing, suspending the current work through back pressure control when judging that the addition and write conflict occurs, and restarting the current work after the delay data are added;
step 4: and integrating and remapping bit widths of the accumulated data, and writing the accumulated data into off-chip storage to complete multiplexing convolution calculation of discontinuous input data.
The method provided in this embodiment may be applied to the convolution calculation accelerator provided in the first embodiment, and the detailed process may be referred to the description of the first embodiment. Therefore, the method can also reduce the pressure stored on the chip, reduce the power consumption and is suitable for high-parallel convolution calculation.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (10)

Translated fromChinese
1.一种基于CSR编码的卷积计算加速器,其特征在于,包括:数据预处理模块、CSR编码模块、乘法脉动计算阵列、数据分配模块、数据累加模块、数据延迟模块、数据排布模块、再量化模块以及总控制模块;其中,1. A convolutional computing accelerator based on CSR coding, characterized by including: a data preprocessing module, a CSR coding module, a multiplication pulsation calculation array, a data distribution module, a data accumulation module, a data delay module, and a data arrangement module. Requantization module and total control module; among them,所述数据预处理模块用于通过DMA从外部读取数据,并进行分块处理,得到分块后的数据;The data preprocessing module is used to read data from the outside through DMA and perform block processing to obtain block data;所述CSR编码模块用于对分块后的数据进行CSR编码,得到特征数据及其对应的地址;The CSR encoding module is used to perform CSR encoding on the divided data to obtain feature data and its corresponding address;所述乘法脉动计算阵列用于根据所述地址对对应的特征数据和权重进行自然流动复用计算,并将计算结果传入所述数据分配模块;The multiplicative pulsation calculation array is used to perform natural flow multiplexing calculations on the corresponding feature data and weights according to the address, and transfer the calculation results to the data distribution module;所述数据分配模块用于将传入的计算结果划分为本窗口数据和跨窗口数据,分别传入所述数据累加模块中的相应累加单元进行数据累加;The data distribution module is used to divide the incoming calculation results into local window data and cross-window data, and respectively pass them into the corresponding accumulation units in the data accumulation module for data accumulation;所述数据延迟模块用于在判断所述数据累加模块发生加法写冲突时,向所述乘法脉动计算阵列反馈反压信号,以暂停当前工作,并在延迟数据相加完毕后重新启动当前工作;The data delay module is used to feed back a back pressure signal to the multiplication pulse calculation array to suspend the current work when it is determined that an addition write conflict occurs in the data accumulation module, and restart the current work after the delay data addition is completed;所述数据排布模块用于对所述数据累加模块输出的累加数据进行整合并通过所述再量化模块重新映射位宽后,通过DMA写进片外存储,以完成不连续输入数据的复用卷积计算;The data arrangement module is used to integrate the accumulated data output by the data accumulation module and remap the bit width through the requantization module, and then write it into off-chip storage through DMA to complete the multiplexing of discontinuous input data. Convolution calculation;所述总控制模块用于控制其余所有模块的调用和数据交互。The overall control module is used to control the calling and data interaction of all other modules.2.根据权利要求1所述的基于CSR编码的卷积计算加速器,其特征在于,所述数据预处理模块和所述CSR编码模块与数据输入通道的存储空间封装在一起并复制一次,以实现兵乓操作。2. The convolution calculation accelerator based on CSR coding according to claim 1, characterized in that the data preprocessing module and the CSR coding module are packaged together with the storage space of the data input channel and copied once to achieve Ping pong operation.3.根据权利要求1所述的基于CSR编码的卷积计算加速器,其特征在于,所述乘法脉动计算阵列采用编码数据横向流动、权重广播之后纵向流动的方式完成数据复用;其中,所述权重是根据卷积数据按照列排布的方式广播的。3. The convolution calculation accelerator based on CSR coding according to claim 1, characterized in that the multiplicative pulsation calculation array adopts the method of horizontal flow of encoded data and longitudinal flow after weight broadcast to complete data multiplexing; wherein, the said The weights are broadcast based on the convolution data arranged in columns.4.根据权利要求3所述的基于CSR编码的卷积计算加速器,其特征在于,所述乘法脉动计算阵列包括若干PE单元,每个PE单元均包括一个地址计算模块、一个流水线寄存器以及一个乘法器;其中,4. The convolution calculation accelerator based on CSR coding according to claim 3, characterized in that the multiplication ripple calculation array includes several PE units, each PE unit includes an address calculation module, a pipeline register and a multiplication unit. utensil; among them,所述地址计算模块根据当前接收到的地址计算对应的编码数据在原始窗口的位置信息;The address calculation module calculates the position information of the corresponding encoded data in the original window according to the currently received address;所述流水线寄存器用于将当前地址及其对应的编码数据送往下一个PE单元;The pipeline register is used to send the current address and its corresponding encoded data to the next PE unit;所述乘法器用于对当前的非零数据与权重相乘得到相应的乘积,并将该乘积与所述地址计算模块得到的位置信息保持同步一起送进所述数据分配模块。The multiplier is used to multiply the current non-zero data and the weight to obtain a corresponding product, and send the product to the data distribution module in synchronization with the position information obtained by the address calculation module.5.根据权利要求1所述的基于CSR编码的卷积计算加速器,其特征在于,所述数据分配模块在进行数据分配时,基于存储跨窗口数据的空间大小与窗口成正比,以及编码的额外开销位与窗口成反比的原则来设定窗口大小。5. The convolution computing accelerator based on CSR coding according to claim 1, characterized in that when the data allocation module performs data allocation, the space size for storing cross-window data is proportional to the window, and the additional coding The window size is set based on the principle that the overhead bit is inversely proportional to the window.6.根据权利要求1所述的基于CSR编码的卷积计算加速器,其特征在于,所述数据累加模块包括第一级加法器和第二级加法器,且所述第一级加法器可进行复用;其中,6. The convolution calculation accelerator based on CSR coding according to claim 1, characterized in that the data accumulation module includes a first-level adder and a second-level adder, and the first-level adder can perform Reuse; among them,所述第一级加法器用于对同一输入通道卷积窗口数据进行相加,输出该通道的卷积结果;The first-stage adder is used to add the convolution window data of the same input channel and output the convolution result of the channel;所述第二级加法器用于对不同卷积通道上同一个卷积窗口数据进行累加,并将本窗口数据和跨窗口数据一起送进数据排布模块。The second-stage adder is used to accumulate the same convolution window data on different convolution channels, and send the current window data and cross-window data to the data arrangement module.7.根据权利要求6所述的基于CSR编码的卷积计算加速器,其特征在于,所述第一级加法器包括本窗口累加单元和跨窗口累加单元;7. The convolution calculation accelerator based on CSR coding according to claim 6, characterized in that the first-level adder includes a local window accumulation unit and a cross-window accumulation unit;所述数据延迟模块与所述本窗口累加单元封装在一起,用以对本窗口数据进行累加;The data delay module is packaged with the current window accumulation unit to accumulate data in this window;所述跨窗口累加单元用于对跨窗口数据进行累加。The cross-window accumulation unit is used to accumulate cross-window data.8.根据权利要求6所述的基于CSR编码的卷积计算加速器,其特征在于,所述卷积计算加速器还包括数据暂存模块,所述数据暂存模块连接所述第一级加法器中的跨窗口累加单元和所述第二级加法器;8. The convolution calculation accelerator based on CSR coding according to claim 6, characterized in that the convolution calculation accelerator further includes a data temporary storage module, and the data temporary storage module is connected to the first-stage adder. The cross-window accumulation unit and the second-stage adder;所述数据暂存模块用于对跨窗口数据流进行同步,以配合所述第二级加法器进行数据累加;The data temporary storage module is used to synchronize cross-window data streams to cooperate with the second-level adder for data accumulation;且所述数据暂存模块与所述第二级加法器中的部分加法器封装在一起。And the data temporary storage module is packaged together with some of the adders in the second-stage adder.9.根据权利要求8所述的基于CSR编码的卷积计算加速器,其特征在于,所述数据暂存模块采用同步FIFO实现。9. The convolution calculation accelerator based on CSR coding according to claim 8, characterized in that the data temporary storage module is implemented by synchronous FIFO.10.一种基于CSR编码的卷积计算加速方法,其特征在于,应用于如权利要求1-9任一项所述的基于CSR编码的卷积计算加速器,包括以下步骤:10. A convolution calculation acceleration method based on CSR coding, characterized in that, applied to the convolution calculation accelerator based on CSR coding as claimed in any one of claims 1-9, it includes the following steps:获取外部数据并进行分块预处理和CSR编码,得到特征数据及其对应的地址;Obtain external data and perform block preprocessing and CSR encoding to obtain feature data and its corresponding address;基于脉动阵列通过所述地址对对应的特征数据和权重进行自然流动复用计算,并将计算结果划分为本窗口数据和跨窗口数据;Based on the pulsating array, perform natural flow multiplexing calculation on the corresponding feature data and weight through the address, and divide the calculation results into local window data and cross-window data;通过加法器复用分别对所述本窗口数据和所述跨窗口数据进行累加,并在判断发生加法写冲突时,通过反压控制暂停当前工作,并在延迟数据相加完毕后重新启动当前工作;The local window data and the cross-window data are respectively accumulated through adder multiplexing, and when it is determined that an addition write conflict occurs, the current work is suspended through back-pressure control, and the current work is restarted after the delayed data is added. ;对累加数据进行整合和重新映射位宽后,写进片外存储,以完成不连续输入数据的复用卷积计算。After integrating and remapping the bit width of the accumulated data, it is written into off-chip storage to complete the multiplexed convolution calculation of discontinuous input data.
CN202310848642.6A2023-07-112023-07-11 Convolution calculation accelerator and acceleration method based on CSR codingActiveCN116911357B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310848642.6ACN116911357B (en)2023-07-112023-07-11 Convolution calculation accelerator and acceleration method based on CSR coding

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310848642.6ACN116911357B (en)2023-07-112023-07-11 Convolution calculation accelerator and acceleration method based on CSR coding

Publications (2)

Publication NumberPublication Date
CN116911357Atrue CN116911357A (en)2023-10-20
CN116911357B CN116911357B (en)2025-09-30

Family

ID=88359658

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310848642.6AActiveCN116911357B (en)2023-07-112023-07-11 Convolution calculation accelerator and acceleration method based on CSR coding

Country Status (1)

CountryLink
CN (1)CN116911357B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107025317A (en)*2015-10-072017-08-08阿尔特拉公司 Method and apparatus for implementing layers on a convolutional neural network accelerator
CN109993297A (en)*2019-04-022019-07-09南京吉相传感成像技术研究院有限公司A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110070178A (en)*2019-04-252019-07-30北京交通大学A kind of convolutional neural networks computing device and method
CN113516235A (en)*2021-07-132021-10-19南京大学 A Deformable Convolution Accelerator and Deformable Convolution Acceleration Method
WO2022252568A1 (en)*2021-06-032022-12-08沐曦集成电路(上海)有限公司Method based on gpgpu reconfigurable architecture, computing system, and apparatus for reconfiguring architecture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107025317A (en)*2015-10-072017-08-08阿尔特拉公司 Method and apparatus for implementing layers on a convolutional neural network accelerator
CN109993297A (en)*2019-04-022019-07-09南京吉相传感成像技术研究院有限公司A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110070178A (en)*2019-04-252019-07-30北京交通大学A kind of convolutional neural networks computing device and method
WO2022252568A1 (en)*2021-06-032022-12-08沐曦集成电路(上海)有限公司Method based on gpgpu reconfigurable architecture, computing system, and apparatus for reconfiguring architecture
CN113516235A (en)*2021-07-132021-10-19南京大学 A Deformable Convolution Accelerator and Deformable Convolution Acceleration Method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
樊迪;王健;来金梅;: "FPGA中适用于低位宽乘累加的DSP块", 复旦学报(自然科学版), no. 05, 15 October 2020 (2020-10-15)*

Also Published As

Publication numberPublication date
CN116911357B (en)2025-09-30

Similar Documents

PublicationPublication DateTitle
CN109635944B (en) A sparse convolutional neural network accelerator and implementation method
Gondimalla et al.SparTen: A sparse tensor accelerator for convolutional neural networks
CN111445012B (en)FPGA-based packet convolution hardware accelerator and method thereof
Yuan et al.High performance CNN accelerators based on hardware and algorithm co-optimization
US10223334B1 (en)Native tensor processor
CN109409511B (en) A Data Stream Scheduling Method for Convolution Operation for Dynamic Reconfigurable Arrays
CN106951395B (en)Parallel convolution operations method and device towards compression convolutional neural networks
CN112418396B (en)Sparse activation perception type neural network accelerator based on FPGA
CN113344179B (en) IP Core of Binarized Convolutional Neural Network Algorithm Based on FPGA
CN112488305B (en)Neural network storage device and configurable management method thereof
CN114925823B (en) A convolutional neural network compression method and edge-side FPGA accelerator
CN111507465B (en) A Configurable Convolutional Neural Network Processor Circuit
CN102968390A (en)Configuration information cache management method and system based on decoding analysis in advance
CN112734020B (en)Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network
CN113449855B (en) A computing device and a computing method
US12307357B2 (en)NPU and Soc being operated based on two or more different clock signals
CN114004351A (en) A Convolutional Neural Network Hardware Acceleration Platform
CN115688892A (en) An FPGA Implementation Method of Sparse Weight Fused-Layer Convolution Accelerator Structure
CN115496190A (en)Efficient reconfigurable hardware accelerator for convolutional neural network training
CN116888591A (en)Matrix multiplier, matrix calculation method and related equipment
CN113313244B (en) Near-memory neural network accelerator for additive networks and its acceleration method
CN116911357B (en) Convolution calculation accelerator and acceleration method based on CSR coding
CN113191493B (en)Convolutional neural network accelerator based on FPGA parallelism self-adaption
CN101534439A (en)Low power consumption parallel wavelet transforming VLSI structure
CN114819127B (en) A backpressure indexed combined computing unit based on FPGA

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp