CN114692073A

Movatterモバイル変換

Info

Publication number: CN114692073A
Application number: CN202210375356.8A
Authority: CN
Inventors: 黄琨骅; 林志雄
Original assignee: Egis Technology Inc
Current assignee: Egis Technology Inc
Priority date: 2021-05-19
Filing date: 2022-04-11
Publication date: 2022-07-01

Abstract

The embodiment of the invention provides a data processing method and a circuit based on convolution operation. In the method, a shared memory architecture is provided, a convolution operation of batch or repeated data is provided, an allocation mechanism of data storage to a plurality of memories is provided, and a signed filling mechanism is provided. Thus, a flexible and efficient convolution operation mechanism and structure can be provided.

Description

Translated fromChinese

基于卷积运算的数据处理方法及电路Data processing method and circuit based on convolution operation

技术领域technical field

本发明涉及一种数据处理机制，且特别是有关于一种基于卷积运算的数据处理方法及电路。The present invention relates to a data processing mechanism, and in particular, to a data processing method and circuit based on convolution operations.

背景技术Background technique

神经网络是人工智能(Artificial Intelligence，AI)中的一个重要主题，并是通过仿真人类脑细胞的运作来进行决策。值得注意的是，人类脑细胞中存在着许多神经元(Neuron)，且这些神经元会通过突触(Synapse)来互相连结。各神经元可经由突触接收讯号，且这讯号经转化后的输出会再传导到另一个神经元。各神经元的转化能力不同，且人类通过前述讯号传递与转化的运作，可形成思考与确定的能力。神经网络即是根据前述运作方式来得到对应能力。Neural networks are an important topic in artificial intelligence (AI), and make decisions by simulating the operation of human brain cells. It is worth noting that there are many neurons (Neurons) in human brain cells, and these neurons are connected to each other through synapses (Synapse). Each neuron receives a signal through a synapse, and the transformed output of this signal is transmitted to another neuron. The transformation ability of each neuron is different, and through the operation of the aforementioned signal transmission and transformation, human beings can form the ability to think and determine. The neural network obtains the corresponding ability according to the aforementioned operation method.

在神经网络的运作中，输入向量与对应突触的权重进行卷积运算，从而撷取特征。值得注意的是，输入值及权重值的数量可能很多，但现有架构针对大数量的资料通常会遭遇到较高功耗、较常等待时间及较多空间用量等问题。In the operation of the neural network, the input vector is convolved with the weight of the corresponding synapse to extract features. It is worth noting that the number of input values and weight values may be large, but existing architectures usually suffer from higher power consumption, higher latency, and higher space usage for large amounts of data.

发明内容SUMMARY OF THE INVENTION

本发明实施例是针对一种基于卷积运算的数据处理方法及电路，可提供还有效率的数据配置。Embodiments of the present invention are directed to a data processing method and circuit based on convolution operation, which can provide efficient data configuration.

根据本发明的实施例，基于卷积运算的数据处理方法包括(但不仅限于) 下列步骤：提供总和缓存器。根据总和缓存器的大小读取数个卷积核(kernel) 中的卷积核组。卷积核组中的那些卷积核的数量相同于总和缓存器的大小。将输入数据与第一卷积核组的卷积运算结果通过先入先出(First Input First Output，FIFO)暂存在总和缓存器。According to an embodiment of the present invention, the data processing method based on the convolution operation includes (but is not limited to) the following steps: providing a sum buffer. The convolution kernel groups in several convolution kernels (kernels) are read according to the size of the sum buffer. The number of those kernels in the kernel group is the same as the sum buffer size. The result of the convolution operation between the input data and the first convolution kernel group is temporarily stored in the summation buffer through a first-in, first-out (First Input First Output, FIFO).

根据本发明的实施例，基于卷积运算的数据处理电路包括(但不仅限于) 一个或更多个存储器及处理器。存储器用以存储程序代码。处理器耦接存储器。处理器经配置用以加载且执行程序代码以执行下列步骤：提供总和缓存器。根据总和缓存器的大小读取数个卷积核中的卷积核组。卷积核组中的那些卷积核的数量相同于总和缓存器的大小。将输入数据与第一卷积核组的卷积运算结果通过先入先出暂存在总和缓存器。According to an embodiment of the present invention, a data processing circuit based on a convolution operation includes (but is not limited to) one or more memories and a processor. The memory is used to store program codes. The processor is coupled to the memory. The processor is configured to load and execute program code to perform the steps of: providing a summation register. Read the convolution kernel group in several convolution kernels according to the size of the sum buffer. The number of those kernels in the kernel group is the same as the sum buffer size. The result of the convolution operation between the input data and the first convolution kernel group is temporarily stored in the sum register through first-in, first-out.

基于上述，根据本发明实施例的基于卷积运算的数据处理方法及电路，可分批形成且处理多组卷积核组，从而有效地利用存储器空间，并可提升运算效率。Based on the above, according to the data processing method and circuit based on convolution operation according to the embodiment of the present invention, multiple groups of convolution kernel groups can be formed and processed in batches, so as to effectively utilize the memory space and improve the operation efficiency.

附图说明Description of drawings

包含附图以便进一步理解本发明，且附图并入本说明书中并构成本说明书的一部分。附图说明本发明的实施例，并与描述一起用于解释本发明的原理。The accompanying drawings are included to provide a further understanding of the present invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

图1是根据本发明一实施例的数据处理电路的组件方块图；1 is a block diagram of components of a data processing circuit according to an embodiment of the present invention;

图2是根据本发明一实施例的数据处理方法-存储配置的流程图；2 is a flowchart of a data processing method-storage configuration according to an embodiment of the present invention;

图3是根据本发明一实施例的输入资料的示意图；3 is a schematic diagram of input data according to an embodiment of the present invention;

图4是根据本发明一实施例的多存储器的存储空间的示意图；4 is a schematic diagram of a storage space of a multi-memory according to an embodiment of the present invention;

图5A是根据本发明一实施例的多存储器的存储配置的示意图；5A is a schematic diagram of a storage configuration of multiple memories according to an embodiment of the present invention;

图5B是根据本发明一实施例的多存储器的存储配置的示意图；5B is a schematic diagram of a storage configuration of multiple memories according to an embodiment of the present invention;

图5C是根据本发明一实施例的多存储器的存储配置的示意图；5C is a schematic diagram of a storage configuration of multiple memories according to an embodiment of the present invention;

图6是根据本发明一实施例的数据处理方法-填充扩展的流程图；6 is a flowchart of a data processing method-filling extension according to an embodiment of the present invention;

图7A是根据本发明一实施例的输入资料的示意图；7A is a schematic diagram of input data according to an embodiment of the present invention;

图7B是根据本发明一实施例的经填充的输入数据的示意图；7B is a schematic diagram of padded input data according to an embodiment of the invention;

图8是根据本发明一实施例的共享存储器的示意图；8 is a schematic diagram of a shared memory according to an embodiment of the present invention;

图9是根据本发明一实施例的数据处理方法-运算配置的流程图；9 is a flowchart of a data processing method-operation configuration according to an embodiment of the present invention;

图10是根据本发明一实施例的卷积运算的示意图；10 is a schematic diagram of a convolution operation according to an embodiment of the present invention;

图11是根据本发明一实施例的卷积运算的示意图；11 is a schematic diagram of a convolution operation according to an embodiment of the present invention;

图12是根据本发明一实施例的卷积运算的示意图；12 is a schematic diagram of a convolution operation according to an embodiment of the present invention;

图13是根据本发明一实施例的并行运算的示意图；13 is a schematic diagram of parallel operations according to an embodiment of the present invention;

图14是根据本发明一实施例的资料重复的示意图；14 is a schematic diagram of data duplication according to an embodiment of the present invention;

图15是根据本发明一实施例的资料重复的示意图；15 is a schematic diagram of data duplication according to an embodiment of the present invention;

图16是根据本发明一实施例的整体数据处理的流程图。FIG. 16 is a flowchart of overall data processing according to an embodiment of the present invention.

附图标号说明Explanation of reference numerals

100:数据处理电路；100: data processing circuit;

110、M1～M8:存储器；110. M1～M8: memory;

150:处理器；150: processor;

151:处理组件；151: processing components;

S210～S230、S610～S650、S910～S950、S1610～S1660：步骤；S210～S230, S610～S650, S910～S950, S1610～S1660: steps;

x:宽；x: width;

y:高；y: high;

z:通道数；z: number of channels;

D1、Pixel、Pixel1～:Pixelj输入数据；D1, Pixel, Pixel1～: Pixelj input data;

W:宽度；W:width;

x0～x6、y0～y6:坐标；x0～x6, y0～y6: coordinates;

n:正整数；n: positive integer;

CMD:指令；cmd:command;

Arb:仲裁器；Arb: Arbiter;

Bk₀～Bk_m-1:存储器库；Bk₀ ~ Bk_m-1 : memory bank;

DATA:资料；DATA: data;

rch0～rch3:所欲读取数据；rch0～rch3: the data you want to read;

wch0～wch3:所欲写入资料；wch0～wch3: the data you want to write;

rch0_rdata～rch3_rdata:所读取的数据；rch0_rdata~rch3_rdata: the data read;

WT、K1～K128:卷积核；WT, K1~K128: convolution kernel;

ch1～ch128:通道；ch1～ch128: channel;

OT:输出缓存器；OT: output buffer;

SB:总和缓存器。SB: Sum Buffer.

具体实施方式Detailed ways

现将详细地参考本发明的示范性实施例，示范性实施例的实例说明于附图中。只要有可能，相同组件符号在图式和描述中用来表示相同或相似部分。Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numerals are used in the drawings and description to refer to the same or similar parts.

图1是根据本发明一实施例的数据处理电路100的组件方块图。请参照图1，数据处理电路100包括(但不仅限于)一个或更多个存储器110及处理器150。FIG. 1 is a block diagram of components of adata processing circuit 100 according to an embodiment of the present invention. Referring to FIG. 1 , adata processing circuit 100 includes (but is not limited to) one ormore memories 110 and aprocessor 150.

存储器110可以是静态或动态随机存取存储器(Random Access Memory， RAM)、只读存储器(Read-Only Memory，ROM)、快闪存储器(Flash Memory)、寄存器(Register)、组合逻辑电路(Combinational Circuit)或上述组件的组合。在一实施例中，存储器110用于乘积累加运算(Multiply Accumulate， MAC)或卷积运算所用的输入数据、卷积核(kernel)、权重、激励(activation) 运算、池化(pooling)运算及/或其他神经网络运算所用的数值。在其他实施例中，应用者可根据实际需求而确定存储器110所存储数据的类型。在一实施例中，存储器110用以存储程序代码、软件模块、组态配置、数据或档案，并待后续实施例详述。Thememory 110 may be a static or dynamic random access memory (Random Access Memory, RAM), a read-only memory (Read-Only Memory, ROM), a flash memory (Flash Memory), a register (Register), a combinational logic circuit (Combinational Circuit) ) or a combination of the above components. In one embodiment, thememory 110 is used for input data, convolution kernels, weights, activation operations, pooling operations, and /or values used by other neural network operations. In other embodiments, the user can determine the type of data stored in thememory 110 according to actual needs. In one embodiment, thememory 110 is used to store program codes, software modules, configuration configurations, data or files, which will be described in detail in subsequent embodiments.

处理器150耦接存储器110。处理器150可以是由多任务器、加法器、乘法器、编码器、译码器、或各类型逻辑闸中的一者或更多者所组成的电路，并可以是中央处理单元(Central Processing Unit，CPU)、图形处理单元 (Graphic Processing unit，GPU)，或是其他可程序化的一般用途或特殊用途的微处理器(Microprocessor)、数字信号处理器(Digital Signal Processor，DSP)、可程序化控制器、现场可程序化逻辑门阵列(FieldProgrammable Gate Array，FPGA)、特殊应用集成电路(Application-SpecificIntegrated Circuit，ASIC)、神经网络加速器或其他类似组件或上述组件的组合。在一实施例中，处理器150经配置用以执行数据处理电路100的所有或部份作业，且可加载并执行存储器110所存储的各软件模块、程序代码、档案及数据。在一些实施例中，处理器150的运作可通过软件实现。Theprocessor 150 is coupled to thememory 110 . Theprocessor 150 may be a circuit composed of a multiplexer, an adder, a multiplier, an encoder, a decoder, or one or more of various types of logic gates, and may be a central processing unit (Central Processing Unit) Unit, CPU), graphics processing unit (Graphic Processing unit, GPU), or other programmable general-purpose or special-purpose microprocessor (Microprocessor), digital signal processor (Digital Signal Processor, DSP), programmable A programmable controller, a field programmable gate array (FieldProgrammable Gate Array, FPGA), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a neural network accelerator or other similar components or a combination of the above components. In one embodiment, theprocessor 150 is configured to perform all or part of the operations of thedata processing circuit 100, and may load and execute various software modules, program codes, files and data stored in thememory 110. In some embodiments, the operation of theprocessor 150 may be implemented in software.

在一实施例中，处理器150包括一个或更多个处理组件(Processing Element，PE)151。这些处理组件151经组态执行相同或不同指令规定的操作。例如，卷积运算、矩阵运算或其他运算。In one embodiment, theprocessor 150 includes one or more processing elements (Processing Element, PE) 151 . Theseprocessing components 151 are configured to perform operations specified by the same or different instructions. For example, convolution operations, matrix operations, or other operations.

下文中，将搭配数据处理电路100中的各项组件或电路说明本发明实施例所述的方法。本方法的各个流程可依照实施情形而调整，且并不仅限于此。Hereinafter, the method according to the embodiment of the present invention will be described in conjunction with various components or circuits in thedata processing circuit 100. Each process of the method can be adjusted according to the implementation situation, and is not limited to this.

图2是根据本发明一实施例的数据处理方法-存储配置的流程图。请参照图2，处理器150根据数个存储器110中的第一存储器的单一地址(下文称这存储器110的某一个地址为第一地址)的存储空间大小将输入数据中的第一部分数据存储在存储器110中。具体而言，每次待处理的输入数据的大小不一定相同。举例而言，图3是根据本发明一实施例的输入资料D1的示意图。请参照图3，输入数据D1的大小/尺寸为宽x*高y*通道数z。即，输入数据 D1包括x*y*z个元素。若以坐标系为例，则输入数据D1中在通道数z为零的那些元素的坐标可标示为：FIG. 2 is a flowchart of a data processing method-storage configuration according to an embodiment of the present invention. Referring to FIG. 2 , theprocessor 150 stores the first part of the input data in the storage space according to the storage space size of a single address of the first memory in the plurality of memories 110 (hereinafter referred to as a certain address of thememory 110 as the first address). inmemory 110. Specifically, the size of the input data to be processed is not necessarily the same each time. For example, FIG. 3 is a schematic diagram of input data D1 according to an embodiment of the present invention. Referring to FIG. 3 , the size/dimension of the input data D1 is width x height y * channel number z. That is, the input data D1 includes x*y*z elements. Taking the coordinate system as an example, the coordinates of those elements in the input data D1 where the channel number z is zero can be marked as:

表(1)Table 1)

x0,y0x0,y0x1,y0x1,y0x2,y0x2,y0x3,y0x3,y0x4,y0x4,y0x5,y0x5,y0x6,y0x6,y0x7,y0x7,y0x0,y1x0,y1x1,y1x1,y1x2,y1x2,y1x3,y1x3,y1x4,y1x4,y1x5,y1x5,y1x6,y1x6,y1x7,y1x7,y1x0,y2x0,y2x1,y2x1,y2x2,y2x2,y2x3,y2x3,y2x4,y2x4,y2x5,y2x5,y2x6,y2x6,y2x7,y2x7,y2x0,y3x0,y3x1,y3x1,y3x2,y3x2,y3x3,y3x3,y3x4,y3x4,y3x5,y3x5,y3x6,y3x6,y3x7,y3x7,y3x0,y4x0,y4x1,y4x1,y4x2,y4x2,y4x3,y4x3,y4x4,y4x4,y4x5,y4x5,y4x6,y4x6,y4x7,y4x7,y4x0,y5x0,y5x1,y5x1,y5x2,y5x2,y5x3,y5x3,y5x4,y5x4,y5x5,y5x5,y5x6,y5x6,y5x7,y5x7,y5x0,y6x0,y6x1,y6x1,y6x2,y6x2,y6x3,y6x3,y6x4,y6x4,y6x5,y6x5,y6x5,y6x5,y6x5,y6x5,y6

须说明的是，表(1)所示的宽x及高y的数值仅作为范例说明，且通道数 z可能是8、16、32或其他数值。此外，输入数据可能是用于后续卷积运算或其他运算所用的感测值、影像、检测数据、特征图(feature map)、卷积核 (kernel)、或权重，并可视应用者的实际需求而自行变更其内容。It should be noted that the values of width x and height y shown in Table (1) are only for illustration, and the number of channels z may be 8, 16, 32 or other values. In addition, the input data may be sensed values, images, detection data, feature maps, convolution kernels, or weights used for subsequent convolution operations or other operations, depending on the actual situation of the user. Change its content as required.

值得注意的是，数据存储在存储器110的位置可能会影响后续存取数据的效率及空间使用率。在本发明实施例中，这第一部分数据的大小不大于第一地址的存储空间大小。换句而言，处理器150将输入数据根据单一地址所提供的存储空间大小来拆分出多个部分数据，并将输入数据中的部分数据存储在存储器110中。这处部分数据代表输入数据中的部分或全部。It is worth noting that the location where data is stored inmemory 110 may affect the efficiency and space usage of subsequent access to the data. In this embodiment of the present invention, the size of the first part of the data is not greater than the size of the storage space of the first address. In other words, theprocessor 150 splits the input data into a plurality of partial data according to the storage space size provided by a single address, and stores the partial data in the input data in thememory 110. The partial data here represents some or all of the input data.

在一实施例中，处理器150比较输入数据的通道数量与第一地址的存储空间大小。各存储器110包括一个或更多个存储器地址(例如，前述第一地址)，且各存储器地址提供一定的存储空间大小供数据存储。举例而言，图4 是根据本发明一实施例的多存储器的存储空间的示意图。请参照图4，假设数据处理电路100包括存储器M1～M8，且各存储器M1～M8的单一地址的宽度 W(即，存储空间)为32字节(byte)。In one embodiment, theprocessor 150 compares the number of channels of the input data with the storage space size of the first address. Eachmemory 110 includes one or more memory addresses (e.g., the aforementioned first addresses), and each memory address provides a certain amount of storage space for data storage. For example, FIG. 4 is a schematic diagram of a storage space of a multi-memory according to an embodiment of the present invention. Referring to FIG. 4, it is assumed that thedata processing circuit 100 includes memories M1-M8, and the width W (ie, storage space) of a single address of each of the memories M1-M8 is 32 bytes.

图5A是根据本发明一实施例的多存储器的存储配置的示意图。请参照图 4及图5A，假设输入数据的大小为7×7×8，则处理器150比较通道数(即， 8)及第一地址的宽度(即，32)，并得出比较结果为宽度为四倍的通道数。5A is a schematic diagram of a storage configuration of multiple memories according to an embodiment of the present invention. Referring to FIG. 4 and FIG. 5A , assuming that the size of the input data is 7×7×8, theprocessor 150 compares the number of channels (ie, 8) and the width of the first address (ie, 32), and obtains the comparison result as The width is four times the number of channels.

图5B是根据本发明一实施例的多存储器的存储配置的示意图。请参照图 4及图5B，假设输入数据的大小为7×7×16，则处理器150比较通道数(即， 16)及第一地址的宽度(即，32)，并得出比较结果为宽度为两倍的通道数。5B is a schematic diagram of a storage configuration of multiple memories according to an embodiment of the present invention. Referring to FIG. 4 and FIG. 5B , assuming that the size of the input data is 7×7×16, theprocessor 150 compares the number of channels (ie, 16) and the width of the first address (ie, 32), and obtains the comparison result as The width is twice the number of channels.

图5C是根据本发明一实施例的多存储器的存储配置的示意图。请参照图 4及图5C，假设输入数据的大小为7×7×64，则处理器150比较通道数(即， 64)及第一地址的宽度(即，32)，并得出比较结果为通道数为两倍的宽度。5C is a schematic diagram of a storage configuration of multiple memories according to an embodiment of the present invention. Referring to FIG. 4 and FIG. 5C , assuming that the size of the input data is 7×7×64, theprocessor 150 compares the number of channels (ie, 64) and the width of the first address (ie, 32), and obtains the comparison result as The number of channels is twice the width.

处理器150可根据通道数量与第一地址的存储空间大小的比较结果确定第一部分数据所包括的输入数据的元素的元素数量。在一实施例中，若处理器150确定比较结果为通道数量未大于第一地址的存储空间大小，则进一步确定通道数量与元素数量的乘积未大于第一地址的存储空间大小。Theprocessor 150 may determine the number of elements of the elements of the input data included in the first part of the data according to the comparison result of the number of channels and the size of the storage space of the first address. In one embodiment, if theprocessor 150 determines that the comparison result is that the number of channels is not greater than the storage space size of the first address, it is further determined that the product of the number of channels and the number of elements is not greater than the storage space size of the first address.

以图5A为例，单一地址的宽度为四倍的通道数。因此，元素数量可以是 4、3、2或1。以4个元素为例，存储器M1的地址n(正整数)存储输入数据中的通道1～8且坐标为(x0,y0)(以表(1)的坐标系为例)、(x1,y0)、(x2,y0) 及(x3,y0)的元素。以图5B为例，宽度为两倍的通道数。因此，元素数量可以是2或1。以2个元素为例，地址n存储输入数据中的通道1～8且坐标为 (x1,y0)及(x1,y0)的元素。由此可知，第一地址存储输入数据中的相同坐标的多个通道的元素，且本发明实施例是以单一元素的所有通道优先分配。Taking FIG. 5A as an example, the width of a single address is four times the number of channels. Therefore, the number of elements can be 4, 3, 2 or 1. Taking 4 elements as an example, the address n (positive integer) of the memoryM1 stores channels 1 to 8 in the input data and the coordinates are (x0, y0) (take the coordinate system of Table (1) as an example), (x1, y0 ), (x2,y0) and (x3,y0). Taking Figure 5B as an example, the width is twice the number of channels. Therefore, the number of elements can be 2 or 1. Taking two elements as an example, the address n stores the elements ofchannels 1 to 8 in the input data and the coordinates are (x1, y0) and (x1, y0). It can be seen from this that the first address stores elements of multiple channels with the same coordinates in the input data, and in the embodiment of the present invention, all channels of a single element are allocated preferentially.

在另一实施例中，若处理器150确定比较结果为通道数量大于第一地址的存储空间大小，则进一步确定第一部分数据所包括的元素数量为一。由于单一地址的存储空间大小不足以存储单一元素的所有通道，因此处理器150 可拆分通道。In another embodiment, if theprocessor 150 determines that the comparison result is that the number of channels is greater than the size of the storage space of the first address, it further determines that the number of elements included in the first part of the data is one. Since the memory space of a single address is not large enough to store all the channels of a single element, theprocessor 150 may split the channels.

以图5C为例，单一地址的通道数为两倍的宽度。因此，元素数量是1，且处理器150将64个通道拆分成第1～32通道及第33～64通道。地址n存储输入数据中的通道1～32且坐标为(x0,y0)的元素。Taking FIG. 5C as an example, the number of channels of a single address is twice as wide. Therefore, the number of elements is 1, and theprocessor 150 splits the 64 channels into 1st to 32nd channels and 33rd to 64th channels. Address n stores the elements ofchannels 1 to 32 in the input data with coordinates (x0, y0).

请参照图2，处理器150根据那些存储器110中的第二存储器的单一地址(下文称这存储器110的某一个地址为第二地址)的存储大小将输入数据中的第二部分数据存储在第二存储器中(步骤S230)。具体而言，这第二部分数据的大小不大于第二地址的存储空间大小。值得注意的是，第一地址所存储的第一部分数据在任一通道的输入数据的二维坐标中的坐标与第二地址所存储的第二部分数据的坐标不同。即，处理器150继续处理输入数据中尚未被存储的其他数据。同理地，在一实施例中，处理器150比较输入数据的通道数量与第二地址的存储空间大小，并根据通道数量与第二地址的存储空间大小的比较结果确定第二部分数据所包括的输入数据的元素的元素数量。Referring to FIG. 2 , theprocessor 150 stores the second part of the input data in the second memory according to the storage size of the single address of the second memory in those memories 110 (hereinafter referred to as a certain address of thememory 110 as the second address). in the second memory (step S230). Specifically, the size of the second part of the data is not greater than the size of the storage space of the second address. It is worth noting that the coordinates of the first part of the data stored at the first address in the two-dimensional coordinates of the input data of any channel are different from the coordinates of the second part of the data stored at the second address. That is, theprocessor 150 continues to process other data in the input data that has not yet been stored. Similarly, in one embodiment, theprocessor 150 compares the number of channels of the input data with the storage space size of the second address, and determines, according to the comparison result between the number of channels and the storage space size of the second address, the data included in the second part of the data. The number of elements of the elements of the input data.

在一实施例中，若处理器150确定比较结果为通道数量未大于第二地址的存储空间大小，则进一步确定通道数量与元素数量的乘积未大于第二地址的存储空间大小。以图5A及4个元素为例，存储器M2的地址n存储输入数据中的通道1～8且坐标为(x4,y0)、(x5,y0)、(x6,y0)及(x7,y0)的元素(因坐标(x0,y0)、(x1,y0)、(x2,y0)及(x3,y0)已存储在存储器M1，故依照顺序分配)。以图5B且2个元素为例，存储器M2的地址n存储输入数据中的通道 1～8且坐标为(x2,y0)及(x3,y0)的元素。In one embodiment, if theprocessor 150 determines that the comparison result is that the number of channels is not greater than the storage space size of the second address, it is further determined that the product of the number of channels and the number of elements is not greater than the storage space size of the second address. Taking FIG. 5A and 4 elements as an example, the address n of the memoryM2 stores channels 1 to 8 in the input data and the coordinates are (x4, y0), (x5, y0), (x6, y0) and (x7, y0) (since the coordinates (x0, y0), (x1, y0), (x2, y0) and (x3, y0) have been stored in the memory M1, they are allocated in order). Taking Fig. 5B and two elements as an example, the address n of the memory M2 stores the elements ofchannels 1 to 8 in the input data and the coordinates are (x2, y0) and (x3, y0).

在另一实施例中，若处理器150确定比较结果为通道数量大于第二地址的存储空间大小，则进一步确定第二部分数据所包括的元素数量为一。以图 5C为例且元素数量是1，存储器M2的地址n存储输入数据中的通道1～32且坐标为(x1,y0)的元素。此外，依此类推，处理器150可分配其他部分数据至其他存储器M3～M8。In another embodiment, if theprocessor 150 determines that the comparison result is that the number of channels is greater than the storage space size of the second address, theprocessor 150 further determines that the number of elements included in the second part of the data is one. Taking Fig. 5C as an example and the number of elements is 1, the address n of the memory M2 stores the elements ofchannels 1 to 32 in the input data and the coordinates are (x1, y0). In addition, and so on, theprocessor 150 may allocate other partial data to other memories M3-M8.

在一实施例中，处理器150可根据第一存储器的第三地址(不同于第一地址)的存储空间大小将输入数据中的第三部分数据存储在第一存储器的第三地址中。这第三部分数据的大小不大于第三地址的存储空间大小。此外，第三地址所存储的第三部分数据在任一通道的输入数据的二维坐标中的坐标与第一地址所存储的第一部分数据的坐标可能相同或不同。In one embodiment, theprocessor 150 may store the third part of the data in the input data in the third address of the first memory according to the storage space size of the third address (different from the first address) of the first memory. The size of the third part of the data is not greater than the size of the storage space of the third address. In addition, the coordinates of the third part of the data stored at the third address in the two-dimensional coordinates of the input data of any channel may be the same or different from the coordinates of the first part of the data stored at the first address.

以图5C为例，存储器M1的地址n存储坐标为(x0,y0)的元素，存储器 M1的地址n+1存储坐标为(x1,y1)的元素，且存储器M1的地址n+7存储坐标为(x0,y0)的元素。在一些实施例中，第三部分数据所包括的通道可能不同第一部分数据所包括的通道。以图5C为例，存储器M1的地址n存储坐标为 (x1,y1)且通道1～32的元素，且地址n+7存储坐标为(x1,y1)且通道33～64的元素。Taking FIG. 5C as an example, the address n of the memory M1 stores the element whose coordinates are (x0, y0), the address n+1 of the memory M1 stores the element whose coordinates are (x1, y1), and the address n+7 of the memory M1 stores the coordinates. is the element of (x0,y0). In some embodiments, the channels included in the third portion of data may be different from the channels included in the first portion of data. Taking FIG. 5C as an example, address n of memory M1 stores elements whose coordinates are (x1, y1) andchannels 1 to 32, and address n+7 stores elements whose coordinates are (x1, y1) and channels 33 to 64.

藉此，本发明实施例可充分地运用存储器110中的存储空间。In this way, the embodiment of the present invention can fully utilize the storage space in thememory 110 .

图6是根据本发明一实施例的数据处理方法-填充扩展的流程图。请参照图6，处理器150根据填充(padding)模式扩展输入数据，以产生扩展的输入数据(步骤S610)。具体而言，在一些应用情境中(例如，数据经卷积运算、或保持边界信息的需求)，需要扩展输入数据的大小，并可通过填充数据来达成需求。填充模式可以是反射镜像(reflect mirror)模式或对称镜像 (symmetric mirror)模式。FIG. 6 is a flowchart of a data processing method - padding extension according to an embodiment of the present invention. Referring to FIG. 6, theprocessor 150 expands the input data according to a padding mode to generate expanded input data (step S610). Specifically, in some application scenarios (for example, the data is subjected to convolution operations, or the need to maintain boundary information), the size of the input data needs to be expanded, and the need can be achieved by padding the data. The fill mode can be a reflect mirror mode or a symmetric mirror mode.

举例而言，输入数据如表(2)所示：For example, the input data is shown in Table (2):

表(2)Table 2)

112233445566

若经反射镜像模式填充，则可得出：If filled with reflection mirror mode, it can be obtained:

表(3)table 3)

22111122333322221111223333225544445566665555444422666655

若经对称镜像模式填充，则可得出：If filled with symmetrical mirror mode, it can be obtained:

表(4)Table 4)

66554455665544332211223322116655445566554433221122332211

处理器150对扩展的输入数据中的多个元素提供二维坐标系的坐标(步骤S630)。具体而言，以输入数据在单一通道下的宽与高而言，这些元素可形成一个矩阵。若为这矩阵的各元素提供一个坐标，则可采用二维坐标系。二维坐标系的横向轴对应于输入数据的宽，且坐标系的纵向轴对应于输入数据的高。此外，轴上的任一整数值对应到输入数据的一个或更多个元素。Theprocessor 150 provides the coordinates of the two-dimensional coordinate system for the plurality of elements in the expanded input data (step S630). Specifically, these elements form a matrix in terms of the width and height of the input data under a single channel. If a coordinate is provided for each element of this matrix, a two-dimensional coordinate system can be used. The horizontal axis of the two-dimensional coordinate system corresponds to the width of the input data, and the vertical axis of the coordinate system corresponds to the height of the input data. Furthermore, any integer value on the axis corresponds to one or more elements of the input data.

在一实施例中，处理器150可设定未扩展的输入数据的坐标在第一维度(即，横向轴)介于0至w且在第二维度(即，纵向轴)介于0至h。w为未扩展的输入数据的宽，且h为未扩展的输入数据的高。此外，处理器150可设定扩展的输入资料中的不属于未扩展的输入数据的坐标在第一维度小于零或大于w且在第二维度小于零或大于h。In one embodiment, theprocessor 150 may set the coordinates of the unexpanded input data to be between 0 and w in the first dimension (ie, the horizontal axis) and between 0 and h in the second dimension (ie, the vertical axis) . w is the width of the unextended input data, and h is the height of the unextended input data. In addition, theprocessor 150 may set the coordinates in the expanded input data that do not belong to the unexpanded input data to be less than zero or greater than w in the first dimension and less than zero or greater than h in the second dimension.

举例而言，图7A是根据本发明一实施例的输入资料的示意图。请参照图 7A，宽为3且高为6的输入数据的坐标(x,y)中x为0～3且y为0～6。图7B 是根据本发明一实施例的经填充的输入数据(即，扩展的输入数据)的示意图。请参照图7B，假设处理器150对输入数据的上面、下面、左边及右边向外各填充两个元素，扩展的输入数据的坐标(x,y)中x为-2～5且y为-2～8。由此可知，被填充的元素的坐标在x或y坐标小于零、x坐标大于w或y坐标大于h。值得注意的是，负值需要有符号数来表示，但有符号数不利于存储或呼叫。For example, FIG. 7A is a schematic diagram of input data according to an embodiment of the present invention. Referring to FIG. 7A , in the coordinates (x, y) of the input data whose width is 3 and height is 6, x is 0-3 and y is 0-6. 7B is a schematic diagram of padded input data (ie, expanded input data) according to an embodiment of the present invention. Referring to FIG. 7B , it is assumed that theprocessor 150 fills the upper, lower, left and right sides of the input data with two elements, respectively. In the coordinates (x, y) of the expanded input data, x is -2 to 5 and y is - 2 to 8. It can be seen that the coordinates of the filled element are less than zero when the x or y coordinate is less than zero, the x coordinate is greater than w or the y coordinate is greater than h. It is worth noting that negative values require signed numbers to represent, but signed numbers are not good for storing or calling.

请参照图6，处理器150根据位置信息读取扩展的输入数据中的那些元素(步骤S650)。具体而言，位置信息包括未扩展的输入数据的大小及扩展的输入数据中的那些元素的坐标。例如，位置信息为(w,h,c,x,y)，其中w为输入数据的宽，h为输入数据的高，c为输入数据的通道，x为某一元素在二维坐标系中的横向轴的坐标，且y为这元素在二维坐标系中的纵向轴的坐标。输入数据被存储在存储器110中。若欲读取输入数据中的特定元素，则处理器150可根据这位置信息存取这元素。Referring to FIG. 6, theprocessor 150 reads those elements in the expanded input data according to the position information (step S650). Specifically, the location information includes the size of the unextended input data and the coordinates of those elements in the extended input data. For example, the position information is (w, h, c, x, y), where w is the width of the input data, h is the height of the input data, c is the channel of the input data, and x is an element in the two-dimensional coordinate system The coordinates of the horizontal axis of , and y is the coordinates of the vertical axis of this element in the two-dimensional coordinate system. Input data is stored inmemory 110 . If a specific element in the input data is to be read, theprocessor 150 can access the element based on the location information.

与采用有符号数的坐标不同处，若位置信息中的某一个元素的坐标在这二维坐标系中位于未扩展的输入资料以外，则处理器150根据填充模式转换位置信息中的坐标。值得注意的是，位置信息中的坐标皆映像至未扩展的输入数据的坐标。也就是说，位置信息中代表元素位置的坐标皆可对应至正值。Unlike coordinates using signed numbers, if the coordinates of an element in the position information lie outside the unexpanded input data in the two-dimensional coordinate system, theprocessor 150 converts the coordinates in the position information according to the fill mode. It is worth noting that the coordinates in the location information are all mapped to the coordinates of the unextended input data. That is to say, the coordinates representing the position of the element in the position information can all correspond to positive values.

以表(3)及表(4)为例，经填充的元素的数值皆相同于未扩展的输入数据中的某一个元素的数值。因此，经填充的元素的坐标可由未扩展的输入数据中具有相同数值的元素的坐标代替。Taking Table (3) and Table (4) as examples, the value of the filled elements is the same as the value of a certain element in the unextended input data. Therefore, the coordinates of the filled elements can be replaced by the coordinates of the elements with the same value in the unexpanded input data.

在一实施例中，假设未扩展的输入数据的宽为w且高为h，则处理器150 可确定位置信息所对应的某一元素的坐标在第一维度是否小于零或大于w，并/或确定位置信息所对应的这元素的坐标在第二维度是否小于零或大于h。若这坐标在第一维度小于零或大于w或在第二维度小于零或大于h，则处理器150确定这元素是属于扩展的输入数据。相反而言，这坐标在第一维度未小于零或未大于w或在第二维度未小于零或未大于h，则处理器150确定这元素是属于未扩展的输入数据。In one embodiment, assuming that the width of the unextended input data is w and the height is h, theprocessor 150 may determine whether the coordinate of an element corresponding to the position information is less than zero or greater than w in the first dimension, and/ Or determine whether the coordinates of the element corresponding to the position information are less than zero or greater than h in the second dimension. If the coordinate is less than zero or greater than w in the first dimension or less than zero or greater than h in the second dimension, thenprocessor 150 determines that the element belongs to extended input data. Conversely, if the coordinate is not less than zero or greater than w in the first dimension or less than zero or greater than h in the second dimension, theprocessor 150 determines that the element belongs to unexpanded input data.

针对坐标转换，在一实施例中，填充模式为反射镜像模式。若处理器150 确定位置信息所对应的某一元素的坐标在第一维度小于零，则进一步将这元素在第一维度的第一坐标转变成第一坐标的绝对值。以数学式表示为：For coordinate transformation, in one embodiment, the filling mode is a reflective mirror mode. If theprocessor 150 determines that the coordinate of an element corresponding to the position information is less than zero in the first dimension, theprocessor 150 further converts the first coordinate of the element in the first dimension into the absolute value of the first coordinate. Mathematically expressed as:

若x＜0，则ABS(x)…(1)If x<0, then ABS(x)...(1)

其中ABS()代表绝对值。where ABS() represents the absolute value.

若处理器150确定这位置信息所对应的这元素的坐标在第一维度大于w，则进一步将这元素的第一坐标转变成第一坐标与两倍w的差值(或是w减去w 与第一坐标的差值取绝对值所得的值)。以数学式表示为：If theprocessor 150 determines that the coordinate of the element corresponding to the position information is greater than w in the first dimension, it further converts the first coordinate of the element into the difference between the first coordinate and twice w (or w minus w The value obtained by taking the absolute value of the difference from the first coordinate). Mathematically expressed as:

若x＞w，则(w-ABS(w-x))…(2)If x>w, then (w-ABS(w-x))...(2)

若处理器150确定这位置信息所对应的这元素的坐标在第二维度小于零，则进一步将这元素在第二维度的第二坐标转变成第二坐标的绝对值。以数学式表示为：If theprocessor 150 determines that the coordinate of the element corresponding to the position information is less than zero in the second dimension, it further converts the second coordinate of the element in the second dimension into the absolute value of the second coordinate. Mathematically expressed as:

若y＜0，则ABS(y)…(3)If y<0, then ABS(y)...(3)

若处理器150确定这位置信息所对应的这元素的坐标在第二维度大于h，则进一步将这元素的第二坐标转变成第二坐标与两倍h的差值(或是h减去h 与第二坐标的差值取绝对值所得的值)。以数学式表示为：If theprocessor 150 determines that the coordinate of the element corresponding to the position information is greater than h in the second dimension, it further converts the second coordinate of the element into the difference between the second coordinate and twice h (or h minus h The value obtained by taking the absolute value of the difference from the second coordinate). Mathematically expressed as:

若y＞h，則(h-ABS(h-y))…(4)If y>h, then (h-ABS(h-y))...(4)

在另一实施例中，填充模式为对称镜像模式。若处理器150确定位置信息所对应的某一元素的坐标在第一维度小于零，则进一步将这元素在第一维度的第一坐标转变成第一坐标加一的绝对值。以数学式表示为：In another embodiment, the fill pattern is a symmetrical mirror pattern. If theprocessor 150 determines that the coordinate of an element corresponding to the position information is less than zero in the first dimension, theprocessor 150 further converts the first coordinate of the element in the first dimension into an absolute value of the first coordinate plus one. Mathematically expressed as:

若x＜0，则ABS(x+1)…(5)If x<0, then ABS(x+1)...(5)

若处理器150确定这位置信息所对应的这元素的坐标在第一维度大于w，则进一步将这元素的第一坐标转变成第一坐标加一与两倍w的差值(或是w减去第一坐标、w与1的差值取绝对值所得的值)。以数学式表示为：If theprocessor 150 determines that the coordinate of the element corresponding to the position information is greater than w in the first dimension, it further converts the first coordinate of the element into the difference between the first coordinate plus one and twice w (or w minus w Go to the first coordinate, take the absolute value of the difference between w and 1). Mathematically expressed as:

若x＞w，则(w-ABS(x-w-1))…(6)If x>w, then (w-ABS(x-w-1))...(6)

若处理器150确定这位置信息所对应的这元素的坐标在第二维度小于零，则进一步将这元素在第二维度的第二坐标转变成第二坐标加一的绝对值。以数学式表示为：If theprocessor 150 determines that the coordinate of the element corresponding to the position information is less than zero in the second dimension, it further converts the second coordinate of the element in the second dimension into an absolute value of the second coordinate plus one. Mathematically expressed as:

若y＜0，则ABS(y+1)…(7)If y<0, then ABS(y+1)...(7)

若处理器150确定这位置信息所对应的这元素的坐标在第二维度大于h，则进一步将这元素的第二坐标转变成第二坐标加一与两倍h的差值(或是h减去第二坐标、h与1的差值取绝对值所得的值)。以数学式表示为：If theprocessor 150 determines that the coordinate of the element corresponding to the position information is greater than h in the second dimension, it further converts the second coordinate of the element into the difference between the second coordinate plus one and two times h (or the difference between h minus h Go to the second coordinate, the value obtained by taking the absolute value of the difference between h and 1). Mathematically expressed as:

若y＞h，则(h-ABS(y-h-1))…(8)If y>h, then (h-ABS(y-h-1))...(8)

由此可知，处理器150可根据填充模式确定位置信息所指示的元素的数值为未扩展的输入数据中一者。藉此，只要未扩展的输入数据的大小及填充模式的类型，即可存取扩展的输入数据的元素。It can be seen from this that theprocessor 150 can determine that the value of the element indicated by the position information is one of the unextended input data according to the filling mode. Thereby, elements of the expanded input data can be accessed as long as the size of the unexpanded input data and the type of padding pattern.

在一实施例中，为了有效率地存取存储器110所存储的数据，本发明实施例还提出了分享存储器的架构。图8是根据本发明一实施例的共享存储器的示意图。请参照图8，处理器150可将一个或更多个存储器110组合成一个存储器库(bank)(例如，存储器库Bk₀～Bk_m-1(m为正整数)。各存储器库 Bk₀～Bk_m-1设有仲裁器(arbiter)Arb。In one embodiment, in order to efficiently access the data stored in thememory 110, the embodiment of the present invention further proposes a memory sharing architecture. FIG. 8 is a schematic diagram of a shared memory according to an embodiment of the present invention. Referring to FIG. 8 , theprocessor 150 may combine one ormore memories 110 into one memory bank (eg, memory banks Bk₀ ˜Bk_m-1 (m is a positive integer). Each memory bank Bk₀ ˜Bk m−1 Bk_m-1 is provided with an arbiter (arbiter) Arb.

在一实施例中，仲裁器Arb用以确定指令CMD所指示的存储位置。以图 8为例，假设图中所示8个指令CMD分别是用于读取数据(例如，前述输入数据或卷积核/权重)的一个或更多个元素(例如，所欲读取数据rch0～rch3)以及写入数据的一个或更多个元素(例如，所欲写入数据wch0～wch3)。在一实施例中，指令CMD可包括指示元素的坐标的位置信息。例如，表(1)所示的二维坐标系或结合通道的三维坐标系的坐标。在一实施例中，指令CMD可还包括输入数据的大小。例如，输入数据的宽、高及/或通道。在一实施例中，指令CMD可还包括填充模式。In one embodiment, the arbiter Arb is used to determine the storage location indicated by the command CMD. Taking FIG. 8 as an example, it is assumed that the 8 instructions CMD shown in the figure are respectively used to read one or more elements of data (for example, the aforementioned input data or convolution kernel/weight) (for example, the data to be read). rch0 to rch3) and one or more elements of the write data (for example, the data to be written wch0 to wch3). In one embodiment, the instruction CMD may include position information indicating the coordinates of the element. For example, the coordinates of the two-dimensional coordinate system shown in Table (1) or the three-dimensional coordinate system of the combined channel. In one embodiment, the instruction CMD may also include the size of the input data. For example, the width, height and/or channel of the input data. In one embodiment, the instruction CMD may further include a padding mode.

在一实施例中，各仲裁器Arb根据指令CMD的位置信息确定所指示元素是否在所属的存储器库Bk₀～Bk_m-1内。若所指示的元素在所属的存储器库 Bk₀～Bk_m-1内，则这仲裁器Arb对所属的存储器库Bk₀、Bk₁、…或Bk_m-1发出读取或写入指令，以读取或写入这元素。若所指示的元素不在所属的存储器库 Bk₀～Bk_m-1内，则这仲裁器Arb忽略这指令CMD或禁能/不发出这元素的读取/ 写入指令。In one embodiment, each arbiter Arb determines whether the indicated element is in the memory bank Bk₀ ˜Bkm_-1 to which it belongs according to the position information of the instruction CMD. If the indicated element is in the memory bank Bk₀ to Bkm_-1 to which it belongs, the arbiter Arb issues a read or write command to the memory bank Bk₀ , Bk₁ , . . . or Bkm_-1 to which it belongs, to Read or write this element. If the indicated element is not in the memory bank Bk₀ ˜Bk_m-1 to which it belongs, the arbiter Arb ignores the command CMD or disables/does not issue the read/write command for this element.

以图8为例，受仲裁器Arb确定读取输入数据的一个或更多个元素 rch0～rch3的指令CMD，即可读取出这些元素rch0～rch3的数据DATA(例如，所读取的数据rch0_rdata～rch3_rdata)。Taking FIG. 8 as an example, the arbiter Arb determines the command CMD to read one or more elements rch0 to rch3 of the input data, and then the data DATA of these elements rch0 to rch3 (for example, the read data can be read out) rch0_rdata~rch3_rdata).

在一实施例中，各仲裁器Arb根据指令CMD的位置信息排序指令CMD。仲裁器Arb所收到的两个或更多个指令CMD可能都存取相同元素。而仲裁器 Arb可排序这些指令CMD。In one embodiment, each arbiter Arb sorts the command CMD according to the position information of the command CMD. Two or more instructions CMD received by the arbiter Arb may both access the same element. And the arbiter Arb can order these instructions CMD.

在一实施例中，指令CMD及数据DATA根据先入先出(FIFO)机制输入或输出指令CMD或数据DATA。先入先出缓存器可将第一个进入其内的指令CMD或数据DATA第一个被移出，第二个进入其内的指令CMD或数据DATA第二个被移出，其余顺序依次类推。藉此，可提供数据存取的效率。In one embodiment, the command CMD and the data DATA input or output the command CMD or the data DATA according to a first-in-first-out (FIFO) mechanism. The first-in, first-out register can move out the first command CMD or data DATA that entered into it first, and the second command CMD or data DATA entered into it to be moved out second, and so on. Thereby, the efficiency of data access can be improved.

图9是根据本发明一实施例的数据处理方法-运算配置的流程图。请参照图9，处理器150提供总和缓存器(步骤S910)。具体而言，处理器150或处理组件151可能经组态有特定大小的运算量。例如，单次运算量为3×3×32。须说明的是，这运算量可能因规格或应用需求而不同，且本发明实施例不加以限制。此外，总和缓存器是用于存储处理器150或处理组件151运算后所输出的数据。然而，总和缓存器的大小可根据应用者的需求而变更，本发明实施例不加以限制。FIG. 9 is a flowchart of a data processing method-operation configuration according to an embodiment of the present invention. Referring to FIG. 9, theprocessor 150 provides a sum register (step S910). In particular,processor 150 orprocessing component 151 may be configured with a particular size of computation. For example, the single operation amount is 3×3×32. It should be noted that the calculation amount may vary due to specifications or application requirements, and is not limited in the embodiment of the present invention. In addition, the summation buffer is used to store the data output by theprocessor 150 or theprocessing element 151 after the operation. However, the size of the sum buffer can be changed according to the requirements of the user, which is not limited in the embodiment of the present invention.

值得注意的是，所需运算的数据量可能超过运算量。例如，图10是根据本发明一实施例的卷积运算的示意图。请参照图10，输入数据Pixel的大小为3×3×128，卷积核WT的大小为3×3×128，且总共有128个卷积核 K1～K128。图中所示1～9代表输入数据Pixel中的一个通道的第1～9元素，或是卷积核WT中的一个通道的第1～9元素。此外，图中所示ch1～32(即，ch1～ch32) 代表第1～第32通道，ch33～64(即，ch33～ch64)代表第33～第64通道，其余依此类推。假设进行3×3×32的卷积运算(例如，输出缓存器OT仅提供3× 3×32的输出量)，则无法单次完成所有3×3×128的输入数据Pixel与128 个卷积核K1～K128的卷积运算。因此，可通过分批运算来实现大量数据运算。It is worth noting that the amount of data required for the operation may exceed the amount of operation. For example, FIG. 10 is a schematic diagram of a convolution operation according to an embodiment of the present invention. Referring to Figure 10, the size of the input data Pixel is 3×3×128, the size of the convolution kernel WT is 3×3×128, and there are a total of 128 convolution kernels K1～K128. 1 to 9 shown in the figure represent the 1st to 9th elements of a channel in the input data Pixel, or the 1st to 9th elements of a channel in the convolution kernel WT. In addition, ch1 to 32 (ie, ch1 to ch32) shown in the figure represent the 1st to 32nd channels, ch33 to 64 (ie, ch33 to ch64) represent the 33rd to 64th channels, and so on. Assuming a 3×3×32 convolution operation (for example, the output buffer OT only provides 3×3×32 output), it is impossible to complete all 3×3×128 input data Pixel and 128 convolutions at one time Convolution operation of kernel K1~K128. Therefore, a large amount of data operation can be realized by batch operation.

处理器150根据总和缓存器的大小读取多个卷积核中的第一卷积核组 (步骤S930)。具体而言，这第一卷积核组中的那些卷积核的数量相同于总和缓存器的大小。以图10为例，若卷积运算为3×3×32且总和缓存器的大小为64，则第一卷积核组可包括卷积核K1～K64的通道ch1～ch32。Theprocessor 150 reads the first convolution kernel group among the plurality of convolution kernels according to the size of the sum buffer (step S930). Specifically, the number of those convolution kernels in this first convolution kernel group is the same as the size of the sum buffer. Taking Fig. 10 as an example, if the convolution operation is 3×3×32 and the size of the sum buffer is 64, the first convolution kernel group may include channels ch1 to ch32 of the convolution kernels K1 to K64.

处理器150将输入数据与第一卷积核组的第一卷积运算结果通过先入先出(First Input First Output，FIFO)暂存在总和缓存器(步骤S950)。具体而言，处理器150可执行第i通道(i为正整数)的3×3卷积运算并将运算结果存储在总和缓存器，接着执行第i+1通道的3×3卷积运算并将运算结果存储在总和缓存器，其余依此类推。Theprocessor 150 temporarily stores the result of the first convolution operation between the input data and the first convolution kernel group in the sum buffer through a first-in, first-out (First Input First Output, FIFO) (step S950). Specifically, theprocessor 150 may perform the 3×3 convolution operation of the i-th channel (i is a positive integer) and store the operation result in the sum buffer, and then perform the 3×3 convolution operation of the i+1-th channel and Store the result of the operation in the sum register, and so on.

举例而言，图11是根据本发明一实施例的卷积运算的示意图。请参照图 11，第一卷积核组为卷积核K1～K64的通道ch1～ch32。处理器150对第1通道的输入数据Pixel与卷积核K1～K64分别执行3×3的卷积运算，并分别输出运算结果至总和缓存器SB。接着，处理器150对第2通道的输入数据Pixel 与卷积核K1～K64分别执行3×3的卷积运算，并分别输出运算结果至总和缓存器SB。其余通道的运算依此类推，且于此不再赘述。For example, FIG. 11 is a schematic diagram of a convolution operation according to an embodiment of the present invention. Referring to Figure 11, the first convolution kernel group is the channels ch1 to ch32 of the convolution kernels K1 to K64. Theprocessor 150 performs a 3×3 convolution operation on the input data Pixel of the first channel and the convolution kernels K1 to K64 respectively, and outputs the operation results to the summation buffer SB respectively. Next, theprocessor 150 performs a 3×3 convolution operation on the input data Pixel of the second channel and the convolution kernels K1 to K64 respectively, and outputs the operation results to the sum buffer SB respectively. The operations of the remaining channels are analogous, and will not be repeated here.

在一实施例中，输入数据报括第四部分数据及第五部分数据，第四部分数据与该第五部分数据的所属通道不同。第一卷积核组包括第一部份核心及第二部份核心，且第一部分核心与第二部分核心的所属通道不同。此外，第一卷积运算结果仅是基于第一部分数据及第一部份核心。In one embodiment, the input data includes a fourth part of data and a fifth part of data, and the channel to which the fourth part of data and the fifth part of data belong is different. The first convolution kernel group includes a first part of the kernel and a second part of the kernel, and the first part of the kernel and the second part of the kernel belong to different channels. In addition, the result of the first convolution operation is only based on the first part of the data and the first part of the kernel.

以图11为例，第四部份数据为输入数据Pixel的通道ch1～ch32，且第五部分数据为输入数据Pixel的通道ch33～ch64。第一部分核心为卷积核 K1～K64的通道ch1～ch32，且第二部分核心为卷积核K1～K64的通道ch33～ch64。而第一卷积运算结果是输入数据Pixel的通道ch1～ch32与卷积核K1～K64的通道ch1～ch32的运算结果。Taking FIG. 11 as an example, the fourth part of the data is the channels ch1-ch32 of the input data Pixel, and the fifth part of the data is the channels ch33-ch64 of the input data Pixel. The first part of the core is the channels ch1 to ch32 of the convolution kernels K1 to K64, and the second part of the core is the channels ch33 to ch64 of the convolution kernels K1 to K64. The first convolution operation result is the operation result of the channels ch1 to ch32 of the input data Pixel and the channels ch1 to ch32 of the convolution kernels K1 to K64.

接着，处理器150根据总和缓存器的大小读取第一卷积核组合中的第二部份核心。以图11为例，处理器150自存储器110读取卷积核K1～K64的通道ch33～ch64。Next, theprocessor 150 reads the second part of the cores in the first convolution core combination according to the size of the sum buffer. Taking FIG. 11 as an example, theprocessor 150 reads the channels ch33-ch64 of the convolution kernels K1-K64 from thememory 110.

此外，处理器150自总和缓存器读取第一卷积运算结果以图11为例，处理器150自总和缓存器SB读取输入数据Pixel的通道ch1～ch32与卷积核 K1～K64的通道ch1～ch32的运算结果。In addition, theprocessor 150 reads the result of the first convolution operation from the summation buffer. Taking FIG. 11 as an example, theprocessor 150 reads the channels ch1 to ch32 of the input data Pixel and the channels of the convolution kernels K1 to K64 from the summation buffer SB. The operation results of ch1 to ch32.

处理器150将第二部份数据与第二部份核心的第二卷积运算结果与来自总和缓存器的第一卷积运算结果的总和通过先入先出暂存在总和缓存器。以图11为例，处理器150将输入数据Pixel的通道ch1～ch32与卷积核K1～K64 的通道ch1～ch32的运算结果与输入数据Pixel的通道ch33～ch64与卷积核 K1～K64的通道ch33～ch64的运算结果加总，并将加总的总和依照通道顺序且先入先出地存储在总和缓存器SB。Theprocessor 150 temporarily stores the sum of the second convolution operation result of the second part of the data and the second part of the core and the first convolution operation result from the summation register in the summation register on a first-in-first-out basis. Taking FIG. 11 as an example, theprocessor 150 compares the operation results of the channels ch1 to ch32 of the input data Pixel and the channels ch1 to ch32 of the convolution kernels K1 to K64 with the results of the channels ch33 to ch64 of the input data Pixel and the convolution kernels K1 to K64. The operation results of the channels ch33 to ch64 are summed, and the summed sum is stored in the summation buffer SB in the order of the channels on a first-in-first-out basis.

接着，处理器150执行输入数据Pixel的通道ch65～ch96与卷积核K1～K64 的通道ch65～ch96的卷积运算并存储运算结果在总和缓存器，依此类推直到输入数据Pixel的所有通道ch1～ch128皆已运算。Next, theprocessor 150 performs the convolution operation between the channels ch65-ch96 of the input data Pixel and the channels ch65-ch96 of the convolution kernels K1-K64 and stores the operation results in the sum register, and so on until all the channels ch1 of the input data Pixel ~ch128 has been calculated.

另一方面，处理器150根据总和缓存器的大小读取那些卷积核中的第二卷积核组。由于总和缓存器的大小小于所有卷积核的数量，因此需要对多个卷积核组分批运算。相似地，这第二卷积核组中的那些卷积核的数量相同于总和缓存器的大小，且第二卷积核组中的那些卷积核不同于第一卷积核组中的那些卷积核。On the other hand, theprocessor 150 reads the second convolution kernel group among those convolution kernels according to the size of the sum buffer. Since the size of the sum buffer is smaller than the number of all convolution kernels, multiple convolution kernel groups need to be batched. Similarly, the number of those convolution kernels in this second convolution kernel group is the same as the sum buffer size, and those convolution kernels in the second convolution kernel group are different from those in the first convolution kernel group convolution kernel.

举例而言，图12是根据本发明一实施例的卷积运算的示意图。请参照图 11及图12，与图11的卷积核K1～K64不同处在于，第二卷积核组包括卷积核 K65～K128。For example, FIG. 12 is a schematic diagram of a convolution operation according to an embodiment of the present invention. 11 and FIG. 12, the difference from the convolution kernels K1 to K64 in FIG. 11 is that the second convolution kernel group includes convolution kernels K65 to K128.

处理器150将输入数据与第二卷积核组的第三卷积运算结果通过先入先出暂存在总和缓存器。以图12为例，处理器150先针对卷积核K65～K128的通道ch1～ch32进行卷积运算并存储运算结果在总和缓存器。接着，处理器 150针对卷积核K65～K128的通道ch33～ch64进行卷积运算。其余运算依此类推，于这不再赘述。Theprocessor 150 temporarily stores the input data and the result of the third convolution operation of the second convolution kernel group in the summation buffer on a first-in-first-out basis. Taking Fig. 12 as an example, theprocessor 150 first performs a convolution operation on the channels ch1-ch32 of the convolution kernels K65-K128 and stores the operation results in the summation buffer. Next, theprocessor 150 performs a convolution operation on the channels ch33 to ch64 of the convolution kernels K65 to K128. The rest of the operations are deduced in the same way, and will not be repeated here.

须说明的是，本发明实施例的分批运算可提供还弹性的运算架构。在一实施例中，可提供并行运算。以图11及图12为例，两图所示的实施例都是针对相同的输入数据Pixel。此时，处理器150可提供另一个或更多个总和缓存器。相似地，处理器150可根据另一个或其他总和缓存器的大小读取第一卷积核组，并将将输入数据与第一卷积核组的第四卷积运算结果通过先入先出暂存在另一个或其他总和缓存器。针对相同输入数据，处理器150可复制输入数据或输出相同的输入数据给不同卷积运算使用。It should be noted that the batch computing in the embodiment of the present invention can provide a flexible computing structure. In one embodiment, parallel operations may be provided. Taking Fig. 11 and Fig. 12 as an example, the embodiments shown in the two figures are for the same input data Pixel. At this point,processor 150 may provide another or more summation buffers. Similarly, theprocessor 150 can read the first convolution kernel group according to the size of another or other sum buffer, and combine the input data with the fourth convolution operation result of the first convolution kernel group through a first-in, first-out temporary Another or other summation buffer exists. For the same input data, theprocessor 150 may duplicate the input data or output the same input data for use by different convolution operations.

举例而言，图13是根据本发明一实施例的并行运算的示意图。请参照图 13，多个相同的输入数据Pixel1～Pixelj(j为正整数)可分别且并行地与相同的卷积核K1～K128运算。其中，输入数据Pixel1与卷积核K1～K64的通道 ch1～ch32运算，输入数据Pixelj与卷积核K1～K64的通道ch1～ch32运算，其余依此类推。For example, FIG. 13 is a schematic diagram of parallel operations according to an embodiment of the present invention. Referring to Fig. 13, a plurality of identical input data Pixel1-Pixelj (j is a positive integer) can be operated separately and in parallel with the same convolution kernels K1-K128. Among them, the input data Pixel1 is operated with the channels ch1-ch32 of the convolution kernels K1-K64, the input data Pixelj is operated with the channels ch1-ch32 of the convolution kernels K1-K64, and so on.

在一实施例中，处理器150提供两个或更多个处理组件151。处理器150 可对这些处理组件151提供读取的第一卷积核组。也就是说，某一个卷积运算结果是通过某一个处理组件151确定，且另一个卷积运算结果是通过另一个处理组件151确定。以图13为例，假设j为2，某一个处理组件151对输入数据Pixel1与卷积核K1～K64的通道ch1～ch32进行卷积运算，(同时)另一个处理组件151对输入数据Pixelj与卷积核K1～K64的通道ch1～ch32进行卷积运算。In one embodiment, theprocessor 150 provides two ormore processing components 151 . Theprocessor 150 may provide the first set of convolution kernels read to theseprocessing components 151 . That is, a certain convolution operation result is determined by acertain processing component 151, and another convolution operation result is determined by anotherprocessing component 151. Taking FIG. 13 as an example, assuming that j is 2, acertain processing component 151 performs a convolution operation on the input data Pixel1 and the channels ch1-ch32 of the convolution kernels K1-K64, and (at the same time) anotherprocessing component 151 performs a convolution operation on the input data Pixelj and the convolution kernels K1-K64. The channels ch1 to ch32 of the convolution kernels K1 to K64 perform convolution operations.

藉此，多个输入数据可与相同卷积核并行运算，具有(部分先入先出深度) 时间来加载输入数据，各输入数据可分配给一个处理组件151，且可视需求方便地延展至还多处理组件151。In this way, multiple input data can be operated in parallel with the same convolution kernel, and there is a (partial FIFO depth) time to load the input data.Multiprocessing component 151 .

值得注意的是，本发明还可根据卷积核的大小提供不同的运算分配机制。其中，图9所示为分批运算的实施例。在一实施例中，处理器150可确定某一个或更多个卷积核的大小是否小于卷积运算的运算量。以图11为例，卷积运算是3×3×32的运算量。各卷积核K1～K128的大小为3×3×128。因此，各卷积核K1～K128的大小未小于卷积运算的运算量。It is worth noting that the present invention can also provide different operation distribution mechanisms according to the size of the convolution kernel. Among them, FIG. 9 shows an embodiment of batch operation. In one embodiment, theprocessor 150 may determine whether the size of one or more convolution kernels is less than the computational complexity of the convolution operation. Taking Fig. 11 as an example, the convolution operation is an operation amount of 3 × 3 × 32. The size of each convolution kernel K1 to K128 is 3×3×128. Therefore, the size of each of the convolution kernels K1 to K128 is not smaller than the computation amount of the convolution operation.

又例如，图14是根据本发明一实施例的资料重复的示意图。请参照图 14，卷积运算仍是3×3×32的运算量，且输入数据Pixel的大小为3×3×8。各卷积核K1～K64的大小为3×3×8。因此，各卷积核K1～K64的大小小于卷积运算的运算量。再例如，图15是根据本发明一实施例的资料重复的示意图。请参照图15，卷积运算仍是3×3×32的运算量，且输入数据Pixel的大小为3×3×16。各卷积核K1～K64的大小为3×3×16。因此，各卷积核K1～K64 的大小小于卷积运算的运算量。For another example, FIG. 14 is a schematic diagram of data duplication according to an embodiment of the present invention. Referring to Fig. 14, the convolution operation is still an operation amount of 3×3×32, and the size of the input data Pixel is 3×3×8. The size of each convolution kernel K1 to K64 is 3×3×8. Therefore, the size of each of the convolution kernels K1 to K64 is smaller than the computation amount of the convolution operation. For another example, FIG. 15 is a schematic diagram of data duplication according to an embodiment of the present invention. Referring to Fig. 15 , the convolution operation still has an operation amount of 3×3×32, and the size of the input data Pixel is 3×3×16. The size of each convolution kernel K1 to K64 is 3×3×16. Therefore, the size of each of the convolution kernels K1 to K64 is smaller than the computation amount of the convolution operation.

若卷积核的大小未小于卷积运算的运算量，则处理器150可根据前述实施例(如图9～图13)分批运算。若处理器150确定卷积核的大小小于卷积运算的运算量，则可重复提供输入数据供那些卷积核进行卷积运算。其中，输入数据的重复数量相同于倍数。这倍数是将运算量作为被除数且各卷积核的大小作为除数所得的商数。If the size of the convolution kernel is not smaller than the computation amount of the convolution operation, theprocessor 150 may perform operations in batches according to the foregoing embodiments (as shown in FIGS. 9 to 13 ). If theprocessor 150 determines that the size of the convolution kernels is smaller than the computational load of the convolution operation, the input data may be repeatedly provided for those convolution kernels to perform the convolution operation. where the number of repetitions of the input data is the same as the multiple. This multiple is a quotient obtained by taking the amount of computation as the dividend and the size of each convolution kernel as the divisor.

以图14为例，运算量为各卷积核K1～K64的大小的4倍。即，倍数为4。这时，处理器150可同时将四笔相同的输入数据Pixel分别与卷积核K1～K4 运算并输出运算结果，或将四笔相同的输入数据Pixel分别与卷积核K61～K64 运算并输出运算结果，其余依此类推。Taking FIG. 14 as an example, the computation amount is 4 times the size of each of the convolution kernels K1 to K64. That is, the multiple is 4. At this time, theprocessor 150 can simultaneously operate the four identical input data Pixel with the convolution kernels K1-K4 and output the operation result, or operate the four identical input data Pixel with the convolution kernels K61-K64 respectively and output The result of the operation, and so on for the rest.

以图15为例，运算量为各卷积核K1～K64的大小的2倍。即，倍数为2。这时，处理器150可同时将四笔相同的输入数据Pixel分别与卷积核K1～K2 运算并输出运算结果，或将四笔相同的输入数据Pixel分别与卷积核K63～K62 运算并输出运算结果，其余依此类推。Taking FIG. 15 as an example, the computation amount is twice the size of each of the convolution kernels K1 to K64. That is, the multiple is 2. At this time, theprocessor 150 can simultaneously operate the four identical input data Pixel with the convolution kernels K1-K2 and output the operation result, or operate the four identical input data Pixel with the convolution kernels K63-K62 respectively and output the operation result. The result of the operation, and so on for the rest.

图16是根据本发明一实施例的整体数据处理的流程图。请参照图16，在一实施例中，处理器150可读取讯框(frame)设定(步骤S1610)。例如，设定为(w,h,c,p)，其中w为输入数据的宽，h为输入数据的高，c为输入数据的通道，p为填充模式。根据填充模式，处理器150可使用有符号讯框(步骤 S1620)。例如，处理器150确定设定有特定填充模式。处理器150可形成未扩展的输入数据(步骤S1630)，并扩展输入数据(步骤S1640)。如图7A扩展成图7B的资料。处理器150可使用位置信息读取存储在存储器110或图8的存储器库Bk₀～Bk_m-1中的部分数据(步骤S1650)，并可推送所读取的数据至特定处理组件151进行乘加运算或卷积运算(步骤S1660)。须说明的是，步骤 S1610～S1660的详细运作可分别参酌图2～图15的说明，于此不再赘述。FIG. 16 is a flowchart of overall data processing according to an embodiment of the present invention. Referring to FIG. 16, in one embodiment, theprocessor 150 can read the frame setting (step S1610). For example, set to (w,h,c,p), where w is the width of the input data, h is the height of the input data, c is the channel of the input data, and p is the padding mode. According to the filling mode, theprocessor 150 may use a signed frame (step S1620). For example, theprocessor 150 determines that a specific filling mode is set. Theprocessor 150 may form unexpanded input data (step S1630), and expand the input data (step S1640). Figure 7A expands to the data of Figure 7B. Theprocessor 150 can use the location information to read part of the data stored in thememory 110 or the memory banks Bk₀ ˜Bk_m-1 of FIG. 8 (step S1650 ), and can push the read data to thespecific processing component 151 for multiplication. Add operation or convolution operation (step S1660). It should be noted that, the detailed operations of steps S1610 to S1660 can be referred to the descriptions of FIGS. 2 to 15 respectively, and will not be repeated here.

综上所述，在本发明实施例的基于卷积运算的数据处理方法及电路中，提供分享存储器架构，提供分批或重复数据的卷积运算，提供数据存储到多存储器的分配机制，并提供有符号的填充机制。藉此，可提供弹性且有效率的卷积运算机制及架构。To sum up, in the data processing method and circuit based on convolution operation according to the embodiments of the present invention, a shared memory architecture is provided, a convolution operation of batched or repeated data is provided, an allocation mechanism for data storage to multiple memories is provided, and Provides a signed padding mechanism. In this way, a flexible and efficient convolution operation mechanism and architecture can be provided.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.