技术领域Technical Field
本申请涉及人工智能应用的技术领域,特别是一种面向卷积神经网络中大尺度运算的编译器内存分配方法。The present application relates to the technical field of artificial intelligence applications, and in particular to a compiler memory allocation method for large-scale operations in convolutional neural networks.
背景技术Background technique
随着深度学习不断发展,卷积神经网络(Convolutional Neural Network)成为人工智能的代表性技术之一,其在计算机视觉,自然语言处理和自动驾驶等多个领域广泛应用,并取得了前所未有的突破和成就,展现了卷积神经网络应用的重要性。With the continuous development of deep learning, convolutional neural network has become one of the representative technologies of artificial intelligence. It has been widely used in many fields such as computer vision, natural language processing and autonomous driving, and has achieved unprecedented breakthroughs and achievements, demonstrating the importance of the application of convolutional neural network.
随着人工智能+物联网的兴起,面向嵌入式设备的人工智能应用发展刻不容缓。深度学习编译器为嵌入式设备的人工智能硬件加速器生成从卷积神经网络框架到优化后的目标代码,但是嵌入式设备的内存与计算资源有限,同时卷积神经网络结构种类繁多,数据流的大小也大小不一,如何处理大尺度运算在有限的内存空间中的正确高效地进行成为一个难点。With the rise of artificial intelligence + Internet of Things, the development of artificial intelligence applications for embedded devices is urgent. Deep learning compilers generate everything from convolutional neural network frameworks to optimized target codes for artificial intelligence hardware accelerators for embedded devices. However, the memory and computing resources of embedded devices are limited. At the same time, there are many types of convolutional neural network structures and the sizes of data streams vary. How to correctly and efficiently handle large-scale operations in limited memory space has become a difficult problem.
发明内容Summary of the invention
本申请提供一种面向卷积神经网络中大尺度运算的编译器内存分配方法,来解决如何处理大尺度运算在有限的内存空间中的正确高效地进行这一难点,从而实现人工智能硬件加速器对大尺度神经网络运算的高效执行。The present application provides a compiler memory allocation method for large-scale operations in convolutional neural networks to solve the difficulty of how to correctly and efficiently handle large-scale operations in limited memory space, thereby achieving efficient execution of large-scale neural network operations by artificial intelligence hardware accelerators.
第一方面,提供了一种面向卷积神经网络中大尺度运算的编译器内存分配方法,包括:In a first aspect, a compiler memory allocation method for large-scale operations in convolutional neural networks is provided, comprising:
将卷积神经网络算子的特征图进行存储,当特征图的大小大于所分配的存储空间时,对特征图切图;The feature map of the convolutional neural network operator is stored, and when the size of the feature map is larger than the allocated storage space, the feature map is cut;
按照切图数量对特征图进行切图,再次判断切图后的子特征图是否内存溢出,如果仍然溢出,重新确定切图数量进行切图,直至切图后的子特征图不再内存溢出;The feature map is sliced according to the number of slices, and it is determined again whether the sub-feature map after the slice has overflowed the memory. If it still overflows, the number of slices is re-determined and the slices are performed until the sub-feature map after the slice no longer overflows the memory.
标记每份切图在整个特征图所在的始末位置,再根据标记依次将切图送入硬件加速单元中计算,最后进行算子的归约计算,得到最终的输出,包括根据每份切图在整个特征图所在的始末位置,将每份切图的输出对应。Mark the starting and ending positions of each slice in the entire feature map, and then send the slices to the hardware acceleration unit for calculation in sequence according to the marks. Finally, perform the reduction calculation of the operator to obtain the final output, including corresponding the output of each slice according to the starting and ending positions of each slice in the entire feature map.
结合第一方面,在第一方面的某些实现方式中,根据公式(1)In combination with the first aspect, in some implementations of the first aspect, according to formula (1)
其中rowfm、colfm、chfm代表特征图的行、列、通道数,ch为通道对齐数,n为lines数,Bytedata代表数据所占字节数,offset代表特征图所占内存空间大小,当offset>SRAMspace时则内存溢出。Among them, rowfm , colfm , chfm represent the number of rows, columns, and channels of the feature map, ch is the channel alignment number, n is the number of lines, Bytedata represents the number of bytes occupied by the data, offset represents the size of the memory space occupied by the feature map, and when offset>SRAMspace , memory overflow occurs.
结合第一方面,在第一方面的某些实现方式中,针对从输出到输入切分特征图或从任意方向切分特征图,则对输入特征图进行切图;In combination with the first aspect, in some implementations of the first aspect, for slicing the feature map from output to input or slicing the feature map from any direction, the input feature map is sliced;
针对从输入到输出切分特征图,则对输出特征图进行切图。For the feature map segmentation from input to output, the output feature map is segmented.
结合第一方面,在第一方面的某些实现方式中,如果是卷积类算子,根据公式(2):In combination with the first aspect, in some implementations of the first aspect, if it is a convolution operator, according to formula (2):
row_inchunk=(row_outchunk-1)*stride+kernel (2)row_inchunk =(row_outchunk -1)*stride+kernel (2)
其中stride是卷积核步长,kernel是卷积核大小,row_inchunk是输入特征图单个切图的行数量,row_outchunk是输入特征图单个切图在输出特征图上的对应行数,将row_inchunk代入公式(1)中的rowfm,计算此时输入特征图所占内存空间是否溢出,如果溢出则继续切分输出特征图,更新row_outchunk,直至满足内存分配条件为止。Where stride is the convolution kernel step size, kernel is the convolution kernel size, row_inchunk is the number of rows of a single slice of the input feature map, and row_outchunk is the number of rows corresponding to a single slice of the input feature map on the output feature map. Substitute row_inchunk into rowfm in formula (1) to calculate whether the memory space occupied by the input feature map overflows. If so, continue to slice the output feature map and update row_outchunk until the memory allocation condition is met.
结合第一方面,在第一方面的某些实现方式中,如果是上采样类算子,根据公式(3):In combination with the first aspect, in some implementations of the first aspect, if it is an upsampling operator, according to formula (3):
row_outchunk=row_inchunk*stride (3)row_outchunk = row_inchunk *stride (3)
其中stride是上采样步长,row_inchunk是输入特征图单个切图的行数量,row_outchunk是输入特征图单个切图在输出特征图上的对应行数,将row_outchunk代入公式(1)中的rowfm,计算此时输出特征图所占内存空间是否溢出,如果溢出则继续切分输入特征图,更新row_inchunk,直至满足内存分配条件为止。Where stride is the upsampling step size, row_inchunk is the number of rows of a single slice of the input feature map, and row_outchunk is the number of rows corresponding to a single slice of the input feature map on the output feature map. Substitute row_outchunk into rowfm in formula (1) to calculate whether the memory space occupied by the output feature map overflows. If so, continue to split the input feature map and update row_inchunk until the memory allocation condition is met.
结合第一方面,在第一方面的某些实现方式中,如果是反卷积类算子,根据公式(4):In combination with the first aspect, in some implementations of the first aspect, if it is a deconvolution operator, according to formula (4):
其中stride是反卷积步长,padrow_col是相邻行列间填充的行列数,kernel是卷积核大小,row_outchunk是输入特征图单个切图在输出特征图上的对应行数,将row_outchunk代入公式(1)中的rowfm,计算此时输出特征图所占内存空间是否溢出,如果溢出则继续切分输入特征图,更新row_inchunk,直至满足内存分配条件为止。Where stride is the deconvolution step size, padrow_col is the number of rows and columns filled between adjacent rows and columns, kernel is the convolution kernel size, and row_outchunk is the number of rows corresponding to a single slice of the input feature map on the output feature map. Substitute row_outchunk into rowfm in formula (1) to calculate whether the memory space occupied by the output feature map overflows. If so, continue to slice the input feature map and update row_inchunk until the memory allocation condition is met.
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:In combination with the first aspect, in some implementations of the first aspect, the method further includes:
面向大尺度卷积核的切分及计算,包括:Segmentation and calculation of large-scale convolution kernels, including:
按照硬件加速器支持的卷积核大小为单位对大卷积核进行切分,得到子卷积核;The large convolution kernel is divided into sub-convolution kernels according to the convolution kernel size supported by the hardware accelerator;
针对子卷积核的计算域进行计算并归约,每个子卷积核在对应特征图上的输出是大卷积核卷积计算的部分和。The computation domain of the sub-convolution kernel is calculated and reduced, and the output of each sub-convolution kernel on the corresponding feature map is the partial sum of the convolution calculation of the large convolution kernel.
结合第一方面,在第一方面的某些实现方式中,如果卷积核切分数量不能整除,则切分的数量按照向上取整的方式,同时对大卷积核的左侧及下侧补零,补零后卷积核大小计算见公式(5):In combination with the first aspect, in some implementations of the first aspect, if the number of convolution kernel segments cannot be divided evenly, the number of segments is rounded up, and zeros are padded to the left and bottom of the large convolution kernel. The convolution kernel size after zero padding is calculated as shown in formula (5):
其中kernelbig为大卷积核行列数,kernelunit为支持的卷积核行列数,kernelout为补零之后的大卷积核行列数。Among them, kernelbig is the number of rows and columns of the large convolution kernel, kernelunit is the number of rows and columns of the supported convolution kernel, and kernelout is the number of rows and columns of the large convolution kernel after zero padding.
结合第一方面,在第一方面的某些实现方式中,子卷积核是大卷积核行方向上第i个、列方向第j个切分卷积核,子卷积核计算域的始末位置见公式(6):In combination with the first aspect, in some implementations of the first aspect, the sub-convolution kernel is the i-th split convolution kernel in the row direction and the j-th split convolution kernel in the column direction of the large convolution kernel, and the start and end positions of the sub-convolution kernel calculation domain are shown in formula (6):
其中rowst、colst是行列的起始位置,rowed、coled是行列的终止位置,rowtotal为输入特征图的总行数,coltotal为输入特征图的总列数。Among them, rowst and colst are the starting positions of the rows and columns, rowed and coled are the ending positions of the rows and columns, rowtotal is the total number of rows of the input feature map, and coltotal is the total number of columns of the input feature map.
结合第一方面,在第一方面的某些实现方式中,大尺度输入输出的特征图切分和大尺度卷积核切分同时参与运算,包括:根据大尺度输入输出的特征图切分方法,结合内存溢出判断,得到切分后的子特征图;根据大尺度卷积核的切分方法,结合子特征图大小及算子属性,得到切分后的子卷积核;在归约计算时,先归约计算子卷积核的输出得到大尺度卷积核的输出,再归约计算子特征图的输出得到大尺度输入输出的特征图的输出。In combination with the first aspect, in some implementation methods of the first aspect, the feature map segmentation of the large-scale input and output and the large-scale convolution kernel segmentation are simultaneously involved in the operation, including: according to the feature map segmentation method of the large-scale input and output, combined with the memory overflow judgment, to obtain the segmented sub-feature map; according to the segmentation method of the large-scale convolution kernel, combined with the sub-feature map size and operator attributes, to obtain the segmented sub-convolution kernel; during the reduction calculation, first reduce the output of the sub-convolution kernel to obtain the output of the large-scale convolution kernel, and then reduce the output of the sub-feature map to obtain the output of the feature map of the large-scale input and output.
与现有技术相比,本申请提供的方案至少包括以下有益技术效果:Compared with the prior art, the solution provided by this application includes at least the following beneficial technical effects:
1.本发明中的大尺度输入输出的特征图切分方法使得硬件加速器适用更大尺度的神经网络结构。该方法提出了一种适用于一般RAM的数据存储格式,也可以更改在某维度的数据排布来适配特殊的RAM设计。在此基础上,该方法将算子分类,每种分类的算子有不同计算方向的切分方法,针对任意大小的输入或输出特征图提出了解决方案,因此该方法可以在基本不依赖RAM设计结构的情况下处理绝大部分神经网络在有限内存上的数据内存分配问题。1. The feature graph segmentation method for large-scale input and output in the present invention makes the hardware accelerator suitable for larger-scale neural network structures. The method proposes a data storage format suitable for general RAM, and can also change the data arrangement in a certain dimension to adapt to special RAM designs. On this basis, the method classifies operators, and each classified operator has a segmentation method with different calculation directions, and proposes a solution for input or output feature graphs of any size. Therefore, the method can handle the data memory allocation problem of most neural networks on limited memory without relying on the RAM design structure.
2.本发明中的大尺度卷积核的切分及计算方法可以在一定程度上简化硬件加速器的设计,该方法将大尺度卷积核拆分为经典卷积核大小下的归约运算。神经网络硬件加速器一般对卷积操作有特殊的加速设计,需要支持至少一种卷积核尺度的运算,在这种情况下,大尺度卷积核可以通过被拆分并补零的方法转换为以支持卷积核为尺度单位的子卷积核集合。2. The method for dividing and calculating the large-scale convolution kernel in the present invention can simplify the design of the hardware accelerator to a certain extent. The method divides the large-scale convolution kernel into reduction operations under the size of the classic convolution kernel. Neural network hardware accelerators generally have special acceleration designs for convolution operations and need to support operations of at least one convolution kernel scale. In this case, the large-scale convolution kernel can be converted into a set of sub-convolution kernels with the supported convolution kernel as the scale unit by splitting and filling with zeros.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为特征图切分流程图。Figure 1 is a flow chart of feature graph segmentation.
图2为数据格式和SRAM设计示意图。Figure 2 is a schematic diagram of the data format and SRAM design.
图3为特征图切分流程图。Figure 3 is a flow chart of feature graph segmentation.
图4为特征图切分的归约计算。Figure 4 shows the reduction calculation of feature graph segmentation.
图5为反卷积操作特征图切分的归约计算。Figure 5 shows the reduction calculation of feature map segmentation of deconvolution operation.
图6为卷积核切分始末位置示意图。FIG6 is a schematic diagram of the starting and ending positions of the convolution kernel segmentation.
图7为卷积核切分的归约计算。Figure 7 shows the reduction calculation of convolution kernel segmentation.
具体实施方式Detailed ways
下面结合附图和具体实施例对本申请作进一步详细的描述。The present application is further described in detail below with reference to the accompanying drawings and specific embodiments.
图1示出了本申请实施例提供的一种面向卷积神经网络中大尺度运算的编译器内存分配方法,包括:FIG1 shows a compiler memory allocation method for large-scale operations in a convolutional neural network provided by an embodiment of the present application, including:
(1)面向卷积神经网络算子的大尺度输入输出特征图切分方法。(1) Large-scale input-output feature graph segmentation method for convolutional neural network operators.
(2)面向大尺度卷积核的切分及计算方法。(2) Segmentation and calculation methods for large-scale convolution kernels.
本发明中的面向卷积神经网络算子的大尺度输入输出的特征图切分方法是根据一种内存中的数据存储格式,结合特征图切分逻辑实现的方法。面向卷积神经网络算子的大尺度输入输出的特征图切分方法的具体内容如下。The feature graph segmentation method for large-scale input and output of convolutional neural network operators in the present invention is a method based on a data storage format in a memory combined with feature graph segmentation logic. The specific content of the feature graph segmentation method for large-scale input and output of convolutional neural network operators is as follows.
图2展示了数据格式和内存设计示意图。在三维特征图的维度上,假设数据格式按照1×1×ch存储数据,因此数据存储是ch通道对齐。SRAM在逻辑上是n lines设计,因此数据存储是ch通道对齐。对于h×w×c的特征图,首先需要按照ch通道对齐的规则将c通道分为组。在每一组中,按照通道-列数-行数(HWC)的维度顺序规则,1×w×c为一个单元组合(Entry)存放在SRAM的line中。为了保持数据在内存中的连续性读取,切图的顺序以行数-列数-通道数的维度顺序计算,即优先切分特征图的行方向,如果在1行数据的情况下仍内存溢出,则继续切分特征图的列方向。在n lines排满n行特征图后继续在0line存放第h-n行特征图,在1line存放第h-(n+1)行特征图,依此类推,直至将特征图全部存放在SRAM中或SRAM溢出需要进一步切图。Figure 2 shows the data format and memory design diagram. In the dimension of the three-dimensional feature map, it is assumed that the data format stores data in 1×1×ch, so the data storage is ch-channel aligned. SRAM is logically designed with n lines, so the data storage is ch-channel aligned. For the h×w×c feature map, the c channel needs to be divided into Group. In each group, according to the dimensional order rule of channels-columns-rows (HWC), 1×w×c is a unit combination (Entry) stored in the line of SRAM. In order to keep the data in the memory continuous to read, the order of cutting is calculated in the dimensional order of rows-columns-channels, that is, the row direction of the feature map is prioritized. If the memory overflows in the case of 1 row of data, the column direction of the feature map continues to be cut. After n lines are filled with n rows of feature maps, the hnth row of feature maps are stored in 0line, and the h-(n+1)th row of feature maps are stored in 1line, and so on, until all feature maps are stored in SRAM or SRAM overflows and further cutting is required.
当SRAM中的空间不能支持一次特征图数据的加载时,即当一个算子的输入或输出特征图的大小大于所分配的内存地址空间时,需要对特征图切图。图3展示了特征图切分流程图,编译器首先会遍历神经网络,对每一个神经网络算子的输入/输出特征图判断是否内存溢出。When the space in the SRAM cannot support the loading of feature map data at one time, that is, when the size of the input or output feature map of an operator is larger than the allocated memory address space, the feature map needs to be sliced. Figure 3 shows the feature map slicing flow chart. The compiler first traverses the neural network and determines whether the input/output feature map of each neural network operator has memory overflow.
依据切图方式的不同,可以分为三类算子。According to different graph cutting methods, operators can be divided into three categories.
1)从输出到输入切分特征图,如convolution和pooling算子等,其特点是计算过程中的多个输入行对应单个输出行。1) Splitting feature maps from output to input, such as convolution and pooling operators, is characterized by the fact that multiple input rows in the calculation process correspond to a single output row.
2)从输入到输出切分特征图,如upsample算子等,其特点是计算过程中的单个输入行对应多个输出行。2) Splitting feature maps from input to output, such as the upsample operator, which is characterized by a single input row corresponding to multiple output rows during the calculation process.
3)从任意方向切分特征图,如batch normal,非线性激活算子等,其特点是输入行和输出行一一映射。3) Split the feature map from any direction, such as batch normal, nonlinear activation operator, etc., which is characterized by a one-to-one mapping between input rows and output rows.
本发明提到的计算资源输入输出的存储采用乒乓缓存的方式,即内存中一半存储输入数据,一半存储输出数据。针对于卷积类操作,输入输出数据均表示经过padding后的数据,以下推导均基于该种方式进行。The storage of computing resource input and output mentioned in the present invention adopts a ping-pong cache method, that is, half of the memory stores input data and the other half stores output data. For convolution operations, the input and output data both represent data after padding, and the following derivations are all based on this method.
在遍历神经网络图的过程中,首先需要检查特征图的内存溢出情况,对于算子特征图,根据公式(1)When traversing the neural network graph, we first need to check the memory overflow of the feature graph. For the operator feature graph, according to formula (1)
其中rowfm、colfm、chfm代表特征图的行、列、通道数,ch为通道对齐数,n为lines数,Bytedata代表数据所占字节数,offset代表特征图所占内存空间大小。结合公式(1),当offset>SRAMspace时则内存溢出。如果未溢出则对算子行后续编译流程,如果遇到内存溢出,算子输入或输出特征图所占内存空间大小大于为其分配的SRAM空间大小,需要根据算子类型确定算子切分的种类,进而对特征图切图。不同种类的算子先计算切图的数量,然后根据切图的方向,由特征图较小的方向往特征图较大的方向计算切图尺寸。之后再次判断在该切图数量下的输入/输出切图是否内存溢出,如果仍然溢出,则增大切图数量重新计算,直至符合内存空间占用条件为止,标记此时的切图参数,结束当前切图流程。Among them, rowfm , colfm , chfm represent the number of rows, columns, and channels of the feature map, ch is the number of channel alignments, n is the number of lines, Bytedata represents the number of bytes occupied by the data, and offset represents the size of the memory space occupied by the feature map. Combined with formula (1), when offset>SRAMspace , memory overflow occurs. If there is no overflow, the subsequent compilation process for the operator row is performed. If memory overflow occurs, the memory space occupied by the input or output feature map of the operator is larger than the size of the SRAM space allocated for it. It is necessary to determine the type of operator segmentation according to the operator type, and then segment the feature map. For operators of different types, the number of segments is first calculated, and then the segment size is calculated from the direction of the smaller feature map to the direction of the larger feature map according to the segment direction. Then, it is determined again whether the input/output segment under the segment number overflows the memory. If it still overflows, the number of segments is increased and recalculated until the memory space occupation condition is met. The segment parameters at this time are marked and the current segment process ends.
1)如果是从输出到输入切分特征图的算子,给定初始row_outchunk,计算输入特征图行方向切分的数量row_inchunk,例如卷积类算子,可以根据公式(2):1) If the operator is to split the feature map from output to input, given the initial row_outchunk , calculate the number of row-wise splits of the input feature map, row_inchunk . For example, for convolution operators, the following formula (2) can be used:
row_inchunk=(row_outchunk-1)*stride+kernel (2)row_inchunk =(row_outchunk -1)*stride+kernel (2)
其中stride是卷积核步长,kernel是卷积核大小,row_inchunk是输入特征图单个切图的行数量,是由输出特征图行方向切分的行数推导出的输入特征图行方向切分的行数,row_outchunk是输入特征图单个切图在输出特征图上的对应行数。以图4为例,卷积核步长stride为2,卷积核大小kernel为3,row_outchunk为2,row_inchunk为5。为了提高单次存储效率,输入特征图相邻两个切图之间的行重叠数量可以等于kernel-stride。将row_inchunk代入公式(1)中的rowfm,计算此时输入特征图所占内存空间是否溢出,如果溢出则继续切分输出特征图,更新row_outchunk,直至满足内存分配条件为止。Where stride is the convolution kernel step length, kernel is the convolution kernel size, row_inchunk is the number of rows of a single slice of the input feature map, which is derived from the number of rows of the output feature map in the row direction, and row_outchunk is the number of rows corresponding to a single slice of the input feature map on the output feature map. Taking Figure 4 as an example, the stride of the convolution kernel is 2, the kernel size of kernel is 3, the row_outchunk is 2, and the row_inchunk is 5. In order to improve the single-time storage efficiency, the number of rows overlapped between two adjacent slices of the input feature map can be equal to kernel-stride. Substitute row_inchunk into rowfm in formula (1) to calculate whether the memory space occupied by the input feature map overflows. If overflows, continue to split the output feature map and update row_outchunk until the memory allocation condition is met.
如果在没有网络分支情况下有连续的输出到输入切分特征图的算子,则可以叠加计算特征图的行方向切图数量。算子的输入特征图的切图可以当作父算子的输出特征图的切图,再计算内存溢出情况。If there are operators that continuously output to the input split feature map without network branches, the number of slices in the row direction of the feature map can be superimposed and calculated. The slices of the input feature map of the operator can be regarded as the slices of the output feature map of the parent operator, and then the memory overflow situation is calculated.
2)如果是从输入到输出切分特征图的算子,给定初始row_inchunk,计算输入特征图行方向切分的数量row_outchunk,例如是上采样类算子,根据公式(3):2) If it is an operator that splits the feature map from input to output, given the initial row_inchunk , calculate the number of row-wise splits of the input feature map row_outchunk . For example, if it is an upsampling operator, according to formula (3):
row_outchunk=row_inchunk*stride (3)row_outchunk = row_inchunk *stride (3)
其中stride是上采样步长,row_inchunk是输入特征图单个切图的行数量,row_outchunk是输入特征图单个切图在输出特征图上的对应行数,是由输入特征图行方向切分的行数推导出的输出特征图行方向切分的行数。Where stride is the upsampling step size, row_inchunk is the number of rows of a single slice of the input feature map, and row_outchunk is the number of rows corresponding to a single slice of the input feature map on the output feature map, which is derived from the number of rows of the input feature map in the row direction.
例如反卷积类算子,根据公式(4)For example, the deconvolution operator, according to formula (4)
其中stride是反卷积步长,padrow_col是相邻行列间填充的行列数,kernel是卷积核大小,row_outchunk是输入特征图单个切图在输出特征图上的对应行数,是由输入特征图行方向切分的行数推导出的输出特征图行方向切分的行数。Where stride is the deconvolution step size, padrow_col is the number of rows and columns filled between adjacent rows and columns, kernel is the convolution kernel size, and row_outchunk is the number of rows corresponding to a single slice of the input feature map on the output feature map, which is derived from the number of rows in the row direction of the input feature map.
将row_outchunk代入公式(1)中的rowfm,计算此时输出特征图所占内存空间是否溢出,如果溢出则继续切分输入特征图,更新row_inchunk,直至满足内存分配条件为止。Substitute row_outchunk into rowfm in formula (1) to calculate whether the memory space occupied by the output feature map overflows. If so, continue to split the input feature map and update row_inchunk until the memory allocation condition is met.
同理,如果在没有网络分支情况下有连续的输入到输出切分特征图的算子,则可以叠加计算。Similarly, if there is a continuous input to output segmentation feature map operator without network branches, the calculation can be superimposed.
3)如果是从任意方向切分特征图,由于输入输出特征图大小相同,因此仅需判断输入特征图是否内存溢出即可。3) If the feature map is split in any direction, since the input and output feature maps have the same size, it is only necessary to determine whether the input feature map has memory overflow.
在内存溢出判断后,如果特征图的切分数量大于1,需要标记每份切图在整个特征图所在的始末位置,再根据标记依次将切图送入硬件加速单元中计算。最后进行算子的归约计算,得到最终的输出。在本发明中,无论是哪种类型算子(输出到输入切分、输入到输出切分、任意方向切分),只要涉及原始特征图切分的过程,均需要由子特征图的计算结果归约得到最终的输出。After the memory overflow judgment, if the number of feature map segments is greater than 1, it is necessary to mark the starting and ending positions of each segment in the entire feature map, and then send the segments to the hardware acceleration unit for calculation according to the marks. Finally, the operator reduction calculation is performed to obtain the final output. In the present invention, no matter what type of operator (output to input segmentation, input to output segmentation, segmentation in any direction), as long as it involves the process of segmenting the original feature map, the calculation results of the sub-feature map need to be reduced to obtain the final output.
单次算子的切分计算可以与后继算子的切分计算合并执行。例如在一些情况下,如反卷积(输出到输入切图)-激活函数(任意方向切图)-卷积(输入到输出切图)结构,在某些参数下,反卷积计算得到的输出切图大小即是卷积的输入切图大小,两次归约计算可以合并到卷积的一次归约计算中,实现算子融合计算,该过程不仅可以节省切图的过程,在硬件支持的情况下也可能提升两次算子的计算效率。The split calculation of a single operator can be combined with the split calculation of the subsequent operator. For example, in some cases, such as the deconvolution (output to input slice) - activation function (slice in any direction) - convolution (input to output slice) structure, under certain parameters, the output slice size obtained by the deconvolution calculation is the input slice size of the convolution, and the two reduction calculations can be combined into a single reduction calculation of the convolution to achieve operator fusion calculation. This process can not only save the slice process, but also improve the calculation efficiency of the two operators if the hardware supports it.
图4展示了一个特征图切分的归约计算示例。特征图切分的过程中并非等分切分,特征图切分过程中需要结合卷积计算的各项参数及实际内存大小等综合情况切分,因此特征图切分块中在一定情况下会有重合部分,以等效真实的卷积计算过程。如果存在大卷积核计算,则特征图切分与卷积核切分可同时进行计算。Figure 4 shows an example of reduction calculation of feature map segmentation. The feature map segmentation process is not an equal segmentation. The feature map segmentation process needs to be segmented based on various parameters of the convolution calculation and the actual memory size. Therefore, there will be overlapping parts in the feature map segmentation blocks under certain circumstances to be equivalent to the real convolution calculation process. If there is a large convolution kernel calculation, the feature map segmentation and the convolution kernel segmentation can be calculated at the same time.
图5展示了反卷积操作特征图切分的归约计算示例,结合反卷积的计算特点,由输出特征图大小,结合反卷积计算的各项参数及实际内存大小,计算输入特征图的切分方式,如果存在大卷积核计算,则特征图切分与卷积核切分可同时进行计算。需要特殊说明的是,本发明提出的输入到输出切分特征图的算子类型切分方法一般可以结合数据压缩优化方法(流程长度压缩、压缩稀疏行列等),这是由于在一般的输入到输出切分特征图的算子类型的计算过程中,存在输入特征图通过填充零(反卷积)后计算,或是直接由输入特征图根据计算规则填充数值(上采样)等特点,因此可以直接按照输入特征图对应内存空间的最大值存放特征图,即存放输入特征图而不执行图5虚线框内的切分操作,以节省数据内存搬运次数。在运算逻辑上,按照既定填充规则对原始特征图进行填充,得到反卷积填充特征图。对反卷积填充特征图进行切分,并根据切分后的反卷积填充特征图进行计算得到输出结果。在计算过程中,需要根据数据解压器恢复单次计算过程(例如反卷积的一次卷积核滑动窗口)所需要的有效数据。这是输入到输出切分特征图的算子与输出到输入切分特征图的算子的不同点之一。FIG5 shows a reduction calculation example of deconvolution operation feature map segmentation. Combined with the calculation characteristics of deconvolution, the segmentation method of the input feature map is calculated by the output feature map size, the various parameters of the deconvolution calculation and the actual memory size. If there is a large convolution kernel calculation, the feature map segmentation and the convolution kernel segmentation can be calculated simultaneously. It should be noted that the operator type segmentation method of the input to output segmentation feature map proposed in the present invention can generally be combined with the data compression optimization method (process length compression, compression sparse rows and columns, etc.). This is because in the calculation process of the operator type of the general input to output segmentation feature map, there are characteristics such as the input feature map is calculated after filling zero (deconvolution), or the input feature map is directly filled with values (up sampling) according to the calculation rules. Therefore, the feature map can be directly stored according to the maximum value of the memory space corresponding to the input feature map, that is, the input feature map is stored without performing the segmentation operation in the dotted box of FIG5, so as to save the number of data memory transfers. In terms of operation logic, the original feature map is filled according to the established filling rules to obtain the deconvolution filled feature map. The deconvolution padding feature map is segmented, and the output result is calculated based on the segmented deconvolution padding feature map. During the calculation process, it is necessary to restore the valid data required for a single calculation process (such as a convolution kernel sliding window of deconvolution) according to the data decompressor. This is one of the differences between the operator input to the output segmented feature map and the operator output to the input segmented feature map.
本发明中的面向大尺度卷积核的切分及计算方法是根据一种特征图标记区间,结合卷积核切分逻辑实现的方法,深度学习编译器实现大卷积核到小卷积核的等效转换。面向大尺度卷积核的切分及计算方法的具体内容如下。The segmentation and calculation method for large-scale convolution kernels in the present invention is based on a feature map marking interval, combined with a method for implementing convolution kernel segmentation logic, and a deep learning compiler realizes the equivalent conversion of large convolution kernels to small convolution kernels. The specific content of the segmentation and calculation method for large-scale convolution kernels is as follows.
1)对卷积核进行切分。在行列方向按照硬件加速器支持的卷积核大小为单位对大卷积核进行切分,如果切分数量不能整除,则切分的数量按照向上取整的方式,同时对大卷积核的左侧及下侧补零,补零后卷积核大小计算见公式(5):1) Split the convolution kernel. Split the large convolution kernel in the row and column directions according to the convolution kernel size supported by the hardware accelerator. If the number of splits cannot be divided evenly, the number of splits is rounded up, and zeros are added to the left and bottom of the large convolution kernel. The convolution kernel size after zero padding is calculated as shown in formula (5):
其中kernelbig为大卷积核行列数,kernelunit为支持的卷积核行列数,kernelout为补零之后的大卷积核行列数。Among them, kernelbig is the number of rows and columns of the large convolution kernel, kernelunit is the number of rows and columns of the supported convolution kernel, and kernelout is the number of rows and columns of the large convolution kernel after zero padding.
2)标记切分卷积核计算域。被切分的子卷积核在滑动窗口时并非需要遍历整个特征图,而是需要根据其在大卷积核的行列位置,标记出子卷积核在特征图上滑动的始末位置。假设子卷积核是大卷积核行方向上第i个、列方向第j个切分卷积核,始末位置的计算见公式(6):2) Mark the computational domain of the segmented convolution kernel. When sliding the window, the segmented sub-convolution kernel does not need to traverse the entire feature map, but needs to mark the starting and ending positions of the sub-convolution kernel sliding on the feature map according to its row and column positions in the large convolution kernel. Assuming that the sub-convolution kernel is the i-th segmented convolution kernel in the row direction and the j-th segmented convolution kernel in the column direction of the large convolution kernel, the calculation of the starting and ending positions is shown in formula (6):
其中rowst、colst是行列的起始位置,rowed、coled是行列的终止位置,rowtotal和coltotal是特征图的行列数(特征图也可以是上文提到的切分子特征图,即rowtotal=row_inchunk)。子卷积核需要在[rowst,rowed],[colst,coled]的特征图范围内按照大卷积核的步长滑动窗口,得到该子卷积核的输出。Where rowst and colst are the starting positions of the rows and columns, rowed and coled are the ending positions of the rows and columns, rowtotal and coltotal are the number of rows and columns of the feature map (the feature map can also be the sliced sub-feature map mentioned above, that is, rowtotal = row_inchunk ). The sub-convolution kernel needs to slide the window according to the step size of the large convolution kernel within the feature map range of [rowst , rowed ], [colst , coled ] to obtain the output of the sub-convolution kernel.
图6展示了卷积核在输入特征图上滑动时切分卷积核的区域分布图,假设在支持3x3卷积核的硬件加速器上运行6x6卷积核的卷积运算,需要将6x6卷积核切分为4个3x3的卷积核,如果卷积核的大小不能被支持卷积核的大小整除,结合公式(5)对卷积核补零即可。卷积核切分后每个3x3卷积核在特征图上的作用域不同,因此结合公式(6)可以得到不同子卷积核在特征图上的作用域区间(rowst~rowed,colst~coled),子卷积核在对应作用域上的滑动窗口计算就是6x6卷积核计算的部分结果。Figure 6 shows the distribution of the area of the convolution kernel when it slides on the input feature map. Assuming that the convolution operation of the 6x6 convolution kernel is run on a hardware accelerator that supports the 3x3 convolution kernel, the 6x6 convolution kernel needs to be divided into 4 3x3 convolution kernels. If the size of the convolution kernel cannot be divided by the size of the supported convolution kernel, the convolution kernel can be padded with zeros according to formula (5). After the convolution kernel is divided, each 3x3 convolution kernel has a different scope on the feature map. Therefore, according to formula (6), the scope range of different sub-convolution kernels on the feature map (rowst ~ rowed , colst ~ coled ) can be obtained. The sliding window calculation of the sub-convolution kernel in the corresponding scope is a partial result of the 6x6 convolution kernel calculation.
3)子卷积核归约计算。大卷积核切分为多个子卷积核后,每个子卷积核在对应特征图上的输出是大卷积核卷积计算的部分和。根据卷积运算是卷积核在特征图上对应数值乘累加的原理,可以推导出大卷积核卷积计算的输出是所有子卷积输出对应元素的累加和。图7展示了卷积核切分的归约计算,示例中卷积步长为2,子卷积核作用于特征图得到的子特征图的行数和列数分别与卷积输出特征图的行数和列数相同。卷积核切分的归约计算就是将4个子卷积核的计算结果在对应元素位置的累加和。3) Sub-convolution kernel reduction calculation. After the large convolution kernel is divided into multiple sub-convolution kernels, the output of each sub-convolution kernel on the corresponding feature map is the partial sum of the convolution calculation of the large convolution kernel. According to the principle that the convolution operation is the multiplication and accumulation of the corresponding values of the convolution kernel on the feature map, it can be deduced that the output of the convolution calculation of the large convolution kernel is the cumulative sum of the corresponding elements of all sub-convolution outputs. Figure 7 shows the reduction calculation of the convolution kernel segmentation. In the example, the convolution step size is 2, and the number of rows and columns of the sub-feature map obtained by the sub-convolution kernel acting on the feature map are the same as the number of rows and columns of the convolution output feature map. The reduction calculation of the convolution kernel segmentation is to add the calculation results of the four sub-convolution kernels at the corresponding element positions.
在面向大尺度卷积核的切分及计算方法中,应当遵循硬件加速器和编译器设计的数据存储格式:在三维特征图的维度上,数据存储是ch通道对齐和n行对齐,因此在切分大尺度卷积核时需按照数据维度对齐方式切分才符合卷积计算时卷积核和特征图的维度匹配原则。In the segmentation and calculation methods for large-scale convolution kernels, the data storage format designed by the hardware accelerator and compiler should be followed: in the dimension of the three-dimensional feature map, the data storage is ch channel aligned and n row aligned. Therefore, when segmenting the large-scale convolution kernel, it is necessary to segment it according to the data dimension alignment method to comply with the dimensional matching principle of the convolution kernel and the feature map during convolution calculation.
大尺度输入输出的特征图切分和大尺度卷积核切分可以同时参与运算。首先根据大尺度输入输出的特征图切分方法,结合内存溢出判断,得到切分后的子特征图。其次根据大尺度卷积核的切分方法,结合子特征图大小及算子属性,得到切分后的子卷积核。在归约计算时,先归约计算子卷积核的输出得到大尺度卷积核的输出,再归约计算子特征图的输出得到大尺度输入输出的特征图的输出。The feature map segmentation of large-scale input and output and the segmentation of large-scale convolution kernels can participate in the operation at the same time. First, according to the feature map segmentation method of large-scale input and output, combined with the memory overflow judgment, the segmented sub-feature map is obtained. Secondly, according to the segmentation method of large-scale convolution kernel, combined with the sub-feature map size and operator attributes, the segmented sub-convolution kernel is obtained. During the reduction calculation, the output of the sub-convolution kernel is first reduced to obtain the output of the large-scale convolution kernel, and then the output of the sub-feature map is reduced to obtain the output of the feature map of the large-scale input and output.
本发明虽然以较佳实施例公开如上,但其并不是用来限定本发明,任何本领域技术人员在不脱离本发明的精神和范围内,都可以做出可能的变动和修改,因此,本发明的保护范围应当以本发明权利要求所界定的范围为准。Although the present invention is disclosed as above in terms of preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art may make possible changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be based on the scope defined by the claims of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410477287.0ACN118377616A (en) | 2024-04-19 | 2024-04-19 | Compiler memory allocation method for large-scale operation in convolutional neural network |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410477287.0ACN118377616A (en) | 2024-04-19 | 2024-04-19 | Compiler memory allocation method for large-scale operation in convolutional neural network |
| Publication Number | Publication Date |
|---|---|
| CN118377616Atrue CN118377616A (en) | 2024-07-23 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410477287.0APendingCN118377616A (en) | 2024-04-19 | 2024-04-19 | Compiler memory allocation method for large-scale operation in convolutional neural network |
| Country | Link |
|---|---|
| CN (1) | CN118377616A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119166948A (en)* | 2024-11-15 | 2024-12-20 | 之江实验室 | A method and device for adaptive distribution of DW type operator data in a multi-core environment |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119166948A (en)* | 2024-11-15 | 2024-12-20 | 之江实验室 | A method and device for adaptive distribution of DW type operator data in a multi-core environment |
| Publication | Publication Date | Title |
|---|---|---|
| CN110058883B (en) | An OPU-based CNN acceleration method and system | |
| US20200372365A1 (en) | Classification of patterns in an electronic circuit layout using machine learning based encoding | |
| CN111414994A (en) | An FPGA-based Yolov3 network computing acceleration system and its acceleration method | |
| CN112257844B (en) | A Convolutional Neural Network Accelerator Based on Mixed Precision Configuration and Its Implementation | |
| CN109993293B (en) | A Deep Learning Accelerator for Stacked Hourglass Networks | |
| CN114003201B (en) | Matrix transformation method, device and convolutional neural network accelerator | |
| CN118377616A (en) | Compiler memory allocation method for large-scale operation in convolutional neural network | |
| US11295236B2 (en) | Machine learning in heterogeneous processing systems | |
| CN117786412A (en) | Elastic training method, cluster system, product and medium for large language model | |
| CN108875914B (en) | Method and device for preprocessing and post-processing neural network data | |
| Oliveira et al. | An evaluation of four reordering algorithms to reduce the computational cost of the Jacobi-preconditioned conjugate gradient method using high-precision arithmetic | |
| CN117851742A (en) | Data storage method, data processing method, data storage device, data processor | |
| CN118982723B (en) | Image processing method, data processing method, device, medium and product | |
| JP5549177B2 (en) | Compression program, method and apparatus, and decompression program, method and apparatus | |
| CN119294448A (en) | A large language model acceleration system and method based on binary quantization | |
| Gonzaga de Oliveira et al. | Metaheuristic algorithms for the bandwidth reduction of large-scale matrices | |
| CN118410265A (en) | Heterogeneous distributed platform-oriented sparse matrix processing method and device | |
| CN117669669A (en) | Neural network structure optimization method based on data flow chip and computing equipment | |
| CN119066308A (en) | A design and implementation method of fixed-point reversible FFT hardware accelerator based on Vitis platform | |
| CN118550513A (en) | A sparse operator compilation method and device | |
| CN114489671A (en) | Machine-readable medium and computer system for generating unified intermediate code | |
| CN113313239A (en) | Artificial intelligence model design optimization method and device | |
| CN118171710B (en) | A NPU acceleration method for sparse matrix multiplication | |
| CN116822253B (en) | Hybrid precision implementation method and system suitable for MANUM sea wave mode | |
| CN118586454B (en) | Method and computing device for artificial intelligence chip computing |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |