Detailed Description
The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification, and drawings of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of this disclosure are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in this disclosure and in the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Winograd convolution is a convolution acceleration implementation based on a polynomial interpolation algorithm. It passes through two inputs to the convolution operation: and (3) carrying out linear transformation (winograd positive transformation) on the neurons and the weights after the neurons and the weights are segmented in a certain scale, carrying out para-position multiplication on the transformed neurons and the weights, and finally carrying out linear transformation (winograd inverse transformation) on para-position multiplication results again to obtain convolution results equivalent to the original convolution operation.
The expression of winograd transform is as follows:
For one-dimensional neurons and weights: s=aT((Gg)⊙(BT d))
For neurons and weights in two dimensions: s=aT((GgGT)⊙(BT dB)) a
Wherein G represents the weight, G represents the forward left-hand transformation matrix corresponding to the weight, GT represents the forward right-hand transformation matrix corresponding to the weight, d represents the input neuron, B represents the forward right-hand transformation matrix corresponding to the input neuron, BT represents the forward left-hand transformation matrix corresponding to the input neuron, as-is-p-multiplication, a represents the inverse right-hand transformation matrix, and aT represents the inverse left-hand transformation matrix. For input neurons of different dimensions, there are B and BT corresponding thereto; similarly, for weights of different dimensions, there are G and GT corresponding thereto.
Substitution of winograd convolutions for the original convolution operation can bring greater gains in hardware energy efficiency ratio and computation time, while also achieving higher neural network performance without increasing, or with less hardware overhead. However, the disadvantage of winograd convolution is still obvious, and a large number of multiplication operations still consume a long operation time in the calculation process.
In order to solve the technical problems, the present disclosure provides a data processing method, which can disassemble multiplication operation in winograd convolution process into addition operation, so as to save calculation time, reduce energy consumption, and perform quantization processing on data in winograd convolution process, so as to further improve calculation performance.
In general, when quantizing data, if the selected value range is wider, the quantized data will have lower precision, and if the value range is too small, too much data will be truncated, resulting in information loss of data distributed on both sides, where the value range refers to the numerical range between the minimum truncation threshold and the maximum truncation threshold for quantizing data. Therefore, it is desirable to find an appropriate cutoff threshold to quantize the data such that the loss of data quantization is minimal or small. Conventionally, the optimal cut-off threshold is determined by a method of KL-divergence (Kullback-Leibler divergence), which enables the correlation between pre-and post-quantization data to be determined. KL divergence is also called relative entropy (relative entropy), information divergence (information divergence), information gain (information gain). KL divergence is a measure of the asymmetry of the difference between the two probability distributions P and Q. Assuming that the distribution of 32-bit floating point numbers before quantization is P and the distribution of 8-bit integers after quantization is Q, the smaller the KL divergence between P and Q is, the closer the distribution before and after quantization is, the more effective the quantization is. However, the inventors of the present application found that the quantization effect achieved by the cut-off threshold obtained by the conventional KL method is not good, and generally causes a large loss of accuracy.
To this end, embodiments of the present disclosure propose a new approach to determining a truncation threshold for symmetric quantization that enables smaller quantization accuracy loss than conventional techniques (e.g., KL methods). According to an embodiment of the present disclosure, after a set of data to be quantized in a winograd convolution process is acquired, a plurality of sets of quantized data are determined by separately quantizing the set of data to be quantized using a plurality of pairs of truncation thresholds, wherein each pair of truncation thresholds of the plurality of pairs of truncation thresholds includes a symmetric truncation positive value and a truncation negative value. Then, a suitable pair of cutoff thresholds is selected from the plurality of pairs of cutoff thresholds using, as an evaluation index, a difference between the average value of the absolute values of each set of quantized data and the average value of the absolute values of one set of data to be quantized. In this way, a more suitable cut-off threshold can be found.
Fig. 1 shows a flow chart of a data processing method according to an embodiment of the present disclosure. As shown in fig. 1, the method includes steps S11 to S14.
In step S11, a pair of cutoff thresholds is determined from a plurality of pairs of cutoff thresholds according to a mean value of absolute values of quantized data obtained by performing quantization processing on data to be quantized using the plurality of pairs of cutoff thresholds, where the data to be quantized is a set of data in a winograd convolution processing process, and each pair of cutoff thresholds in the plurality of pairs of cutoff thresholds includes a first cutoff threshold and a second cutoff threshold, and the second cutoff threshold is a minimum value in the data to be quantized.
For example, after a set of data to be quantized in a winograd convolution process is acquired, the set of quantized data is determined by separately quantizing the set of data to be quantized using a plurality of pairs of truncation thresholds, wherein each pair of truncation thresholds of the plurality of pairs of truncation thresholds includes a first truncation threshold and a second truncation threshold. The first cutoff threshold is greater than the second cutoff threshold. Then, a difference between the average value of the absolute values of each set of quantized data and the average value of the absolute values of one set of data to be quantized is used as an evaluation index to select an appropriate pair of truncation thresholds from among a plurality of pairs of truncation thresholds.
In one possible implementation, step S11 may include:
in the current round of searching, quantizing the data to be quantized by adopting a plurality of pairs of cutoff thresholds to obtain a plurality of groups of quantized data;
Determining a pair of truncation thresholds to be selected from the plurality of pairs of truncation thresholds according to the difference between the average value of the absolute values of each group of data and the average value of the absolute values of the data to be quantized;
When the current round search meets the continuous search condition, determining a new multiple pairs of cutoff thresholds according to the to-be-selected cutoff thresholds, and carrying out the next round search according to the new multiple pairs of cutoff thresholds until the current round search does not meet the continuous search condition;
and when the current round of search does not meet the continuous search condition, determining the to-be-selected cutoff threshold as a pair of cutoff thresholds determined from a plurality of pairs of cutoff thresholds.
In one possible implementation, the continued search condition may include at least one of:
The number of searches that have been performed is less than a predetermined total number of searches;
the difference between the average value of the absolute values of each group of data in the current search and the average value of the absolute values of the data to be quantized is greater than or equal to a difference threshold.
In this implementation, a set of data to be quantized during the winograd convolution is obtained, which may be any data during the winograd convolution process.
In this implementation manner, determining that a pair of to-be-selected cutoff thresholds includes a to-be-selected first cutoff threshold and a to-be-selected second cutoff threshold, determining a new plurality of pairs of cutoff thresholds according to the to-be-selected cutoff thresholds, and performing a next round of search according to the new plurality of pairs of cutoff thresholds includes: the new cutoff threshold range can be determined according to the first cutoff threshold to be selected, and a plurality of pairs of new cutoff thresholds are determined in the cutoff threshold range, so that next search is performed.
For example, fig. 2 shows a schematic diagram of a truncation threshold determination in a data processing method according to an embodiment of the present disclosure. As shown in fig. 2, assuming that the total number of searches is 2, each vertical line is a first cutoff threshold to be determined, and the cutoff logarithm is 10. In the first round of truncation threshold search, an optimal truncation threshold is determined as a "first truncation threshold to be selected" according to a difference between an average value of absolute values of each group of data (i.e., data corresponding to two adjacent first truncation thresholds to be determined) and an average value of absolute values of data to be quantized. And then, searching for a second round of cutoff threshold, and determining a cutoff threshold range according to the first cutoff threshold to be selected, for example, the first cutoff threshold to be determined, which is adjacent to the first cutoff threshold to be selected in the first round, can be directly determined as a new cutoff threshold range. And then searching the cutoff threshold value to determine a new first cutoff threshold value to be selected, and determining the new first cutoff threshold value to be selected as the first cutoff threshold value because the total searching times are 2, and further determining a pair of cutoff threshold values according to the determined second cutoff threshold value.
In one possible implementation, an average value of absolute values of the data to be quantized and a maximum value of absolute values in the data to be quantized may be determined, where the average value of absolute values is an absolute value of all data in the data to be quantized divided by the number of elements. The minimum mean difference may be initialized, e.g., initially setting the maximum value in the floating point number, and initializing the search order i of the circular search (e.g., to 0). In some embodiments, the search order i may also be initialized to half the total number of searches, i.e., from the middle, which can improve search efficiency. According to embodiments of the present disclosure, one or more rounds of threshold search processes may be provided, each round of threshold search may have the same or different total number of searches. In some embodiments, the total number of searches per round may be set between 10 and 32. In general, the more the total number of searches, the longer the search time spent, and the more accurate the cut-off threshold is. However, when the total number of searches reaches a certain value, the search effect may not be improved substantially.
Next, the first round coarse-grained truncated threshold search process begins. For example, the data to be quantized may be divided into 10 pairs of candidate truncation thresholds, a quantization process may be performed using the 10 pairs of truncation thresholds in turn, and an optimal pair of truncation thresholds may be determined according to a difference in the mean value of absolute values of the data before and after quantization.
And judging whether the current search order i is smaller than the preset total search times, namely judging whether all the calculation of the truncation threshold values are completed when each pair of the truncation threshold values are sequentially selected for quantization. If the current search order i is smaller than the predetermined total number of searches, a pair of cutoff thresholds are determined based on the current search order i, the pair of cutoff thresholds being-maximum of absolute value/predetermined total number of searches (i+1), respectively. The pair of truncation thresholds is used to quantize the Data to be quantized to obtain corresponding quantized Data quant_data_i, and a difference abs (quant_data_mean_i-data_mean)/data_mean between the average quant_data_mean_i of the absolute values of the quantized Data and the average data_mean of the absolute values of the Data to be quantized is calculated.
It is determined whether the calculated variance is less than the current minimum variance. If so, the calculated variance is set to the current minimum variance and a cutoff threshold is recorded at which the variance is minimum, and then the current search order i is incremented. If it is determined whether or not, the search order i is incremented directly at the present, i.e., the difference in the next pair of cutoff thresholds is continued to be determined. Next, the above steps are continuously and circularly executed until the value of the current search order i reaches the preset total searching times, and the searching process of the first cut-off threshold value is exited. And determining the cutoff threshold with the smallest difference as the optimal cutoff threshold through the first round of searching. It can be seen that the process of the cut-off threshold search is: and quantizing the data to be quantized by using a plurality of pairs of truncation thresholds, determining one group of quantized data with the smallest difference in the average value of absolute values from the data to be quantized in the plurality of groups of quantized data, and then selecting a pair of truncation thresholds corresponding to the group of quantized data from the plurality of pairs of truncation thresholds.
Alternatively, a second round of fine-grained cutoff threshold search process may be performed, which may also refer to the aforementioned method, except that the second round of search is performed within a certain range around the first round of optimal cutoff threshold (e.g., between a previous cutoff threshold and a subsequent cutoff threshold of the selected cutoff threshold), which is a further refinement of the first round of search results. For example, in the second round of searching, the interval between each pair of cutoff thresholds may be (maximum absolute value x 2)/(total number of first round searching x total number of second round searching). Through the second round of searching, the optimal cut-off threshold of the fine granularity is determined. By means of two-round searching, a more accurate cut-off threshold value can be obtained, and precision loss caused by quantization is reduced.
Embodiments of the present disclosure provide a method for iteratively searching for an optimal cutoff threshold.
For example, three pairs of truncation thresholds are determined, for example, a maximum value absmax of absolute values of all data in the data Fx to be quantized may be determined, and the three pairs of truncation thresholds may be (absmin, absmax/2), (absmin, absmax ×3/4), (absmin, absmax), respectively. Respectively quantizing the data to be quantized by using the three pairs of truncation thresholds to obtain quantized dataThen respectively calculateThe corresponding mean value of the absolute values Fmean,Then according to the formulaThe minimum difference diff _ min is selected. It is determined whether the minimum difference diff _ min is smaller than a predetermined threshold set in advance. If not, based on the selected pair of truncation thresholds (the value corresponding to the minimum difference diff_min is set as the maximum value of the new absolute value), three pairs of truncation thresholds are redetermined, and the above process is repeated until the minimum difference diff_min is smaller than the preset threshold, and the iterative process of the truncation thresholds is exited. In some embodiments, in addition to the iteration stop condition that the minimum difference diff_min is less than the predetermined threshold, other iteration stop conditions may be set, such as a maximum number of iterations, reaching a predetermined minimum interval, and so forth. In addition, the embodiment of the disclosure may also perform the iterative process only once, and then directly use a pair of truncation thresholds corresponding to the minimum difference diff_min as the final truncation threshold.
In some embodiments, quantization parameters when quantizing data using the pairs of truncation thresholds may be determined by the following equations (1) - (3).
Where p is the maximum absolute value in the data to be quantized, n represents the number of binary bits after quantization, and S and f represent quantization parameters.
According to embodiments of the present disclosure, quantization parameters S1, f1, S2, f2, S3, and f3 may be found by selecting p as absmax/2, absmax x 3/4, and absmax, respectively, thereby obtaining quantized dataAccordingly, after a pair of cut-off thresholds is selected, S and f corresponding to the pair of cut-off thresholds are directly taken as quantized data of the data to be quantized.
By the method, when the data to be quantized is asymmetric data, the pair of cut-off thresholds can be determined by the method, and then the data to be quantized is divided into different parts according to the cut-off thresholds for quantization, so that the quantization precision can be improved. It should be noted that the above method for determining the pair of cutoff thresholds is only one example of the disclosure, and those skilled in the art may set the manner of determining the pair of cutoff thresholds according to actual needs.
In one possible implementation, the method may include:
Determining the maximum absolute value of data in the data to be quantized;
determining a plurality of first truncation thresholds based on the maximum absolute value and the truncation logarithm;
and obtaining the plurality of pairs of cutoff thresholds according to the plurality of first cutoff thresholds and the second cutoff threshold.
According to embodiments of the present disclosure, pairs of truncation thresholds may be chosen to separately quantize the data to be quantized. In some embodiments, some cutoff thresholds may be chosen at fixed intervals.
For example, one cut-off threshold value is selected every predetermined distance according to the largest absolute value among the absolute values of all the data in the data to be quantized. In some embodiments it is also possible to pick only the cut-off threshold at a few specific locations, for example only the value of a few predetermined proportions of the absolute value maximum.
In some embodiments, a corresponding one or more quantization parameters (e.g., point location parameters, scaling coefficients, offsets, etc.) may be calculated from each pair of truncation thresholds, and the calculated quantization parameters may then be used to quantize the data to be quantized. Alternatively, the data to be quantized may also be quantized directly from the truncation threshold by various formulas or models without separately calculating the values of the respective quantization parameters.
A pair of truncation thresholds is selected from the plurality of pairs of truncation thresholds for quantizing the set of data to be quantized based on a difference between a mean of absolute values of each of the plurality of sets of quantized data and a mean of absolute values of the set of data to be quantized. Since the average difference of absolute values of data before and after quantization can reflect the loss of accuracy before and after quantization, the smaller the average difference of absolute values, the smaller the loss of accuracy of quantization operation. Accordingly, the embodiments of the present disclosure can achieve less loss of accuracy than the conventional KL method using the difference in the mean of the absolute values of the data before and after quantization as an index to pick the optimal cutoff threshold.
In some embodiments, the difference between the average of the absolute values of the quantized data and the average of the absolute values of the data to be quantized may be the difference between the two absolute value averages. Alternatively, the difference between the average value of the absolute values of the quantized data and the average value of the absolute values of the data to be quantized may also be: the difference between the two absolute value averages is divided by the average of the absolute values of the data to be quantized, and then the absolute values are taken.
In one possible implementation, determining a pair of truncation thresholds from the plurality of pairs of truncation thresholds according to a difference between a mean of absolute values of each set of data and a mean of absolute values of the data to be quantized may include:
Selecting a group of first quantized data from the plurality of groups of quantized data, wherein the difference between the average value of the absolute values of the group of first quantized data and the average value of the absolute values of the group of data to be quantized is smaller than the difference between the average value of the absolute values of the plurality of groups of second quantized data and the average value of the absolute values of the group of data to be quantized, and the plurality of groups of second quantized data are other groups of quantized data except the group of first quantized data in the plurality of groups of quantized data;
a pair of truncation thresholds corresponding to the set of first quantized data is selected from the plurality of pairs of truncation thresholds.
Determining the average value of the absolute values of all the quantized data, determining the difference between the average value of the absolute values of all the quantized data and the average value of the absolute values of one group of the data to be quantized, determining the quantized data with the smallest difference as first quantized data, determining the quantized data of other groups except the first quantized data as second quantized data, namely, the difference between the average value of the absolute values of the first quantized data and the average value of the absolute values of one group of the data to be quantized is smaller than the difference between the average value of the absolute values of the second quantized data and the average value of the absolute values of one group of the data to be quantized. A pair of truncation thresholds for quantizing data to be quantized to obtain first quantized data is determined as a pair of truncation thresholds for quantizing data to be quantized.
For example, after a pair of |t1| to |t2| is determined, data outside the range of |t1| to |t2| is set to |t1| or |t2|, where |t1| is a determined first truncated positive value and |t2| is a determined second truncated negative value. For example, data in a truncated range corresponding to a pair of truncation thresholds is quantized according to quantization parameters, a value to be quantized that is outside the truncated range and is smaller than |t2| is quantized as a value |t2| and a value to be quantized that is outside the truncated range and is larger than |t1| is quantized as a value |t1|. In this way, by narrowing the range of the value of the data to be quantized using the cutoff threshold, the accuracy of the quantized data can be improved.
In step S12, the data to be quantized is quantized according to the determined pair of truncation thresholds, so as to obtain quantized data.
For example, after selecting the best pair of truncation thresholds, a set of data to be quantized may be quantized using the selected pair of truncation thresholds to obtain quantized first data, including: and truncating values in the group of data to be quantized, which are larger than the first truncating threshold value, to a first truncating threshold value, and truncating values in the group of data to be quantized, which are smaller than the second truncating threshold value, to a second truncating threshold value.
In step S13, a winograd convolution process is continuously performed according to the quantized data, so as to obtain a quantized winograd convolution result.
In step S14, inverse quantization processing is performed on the quantized winograd convolution result, so as to obtain a winograd convolution result.
The data to be quantized, the data to be operated, and the like mentioned in the embodiments of the present disclosure may be data that can occur during actual data processing, and may be data corresponding to data such as image data, video data, audio data, text data, and the like. For example, the method provided by the present disclosure is used for image processing, video processing, audio processing, and other scene applications. Taking the data to be operated as image data as an example, the data to be operated may be represented in the form of NHWC (batch, height, width, channels), N represents the number of images, HW represents the number of pixels in the height and width directions, respectively, C may represent the number of channels, for example, C may represent three channels of RGB (Red, ggreen, blue), which is only one example of the present disclosure, and the present disclosure is not limited thereto. It should be noted that the above method may be applied to any data that may be quantized and subjected to winograd convolution operation, and those skilled in the art may set the type of data and the corresponding application scenario according to actual needs, which is not limited in this disclosure.
In one possible implementation, the winograd convolution process includes:
Carrying out forward conversion and disassembly on winograd of data to be operated into sum operation, and carrying out calculation to obtain winograd forward conversion results of each data in the data to be operated;
Performing para-multiplication operation between winograd positive conversion results of corresponding data in the data to be operated to obtain para-multiplication results;
The winograd inverse transformation of the para-multiplication result is disassembled into sum operation to obtain the winograd convolution result,
The data to be quantized may be any one of the data to be operated, winograd positive transform results of the data to be operated, and the para-multiplication results.
In one possible implementation, the data to be operated on includes at least one of an input neuron, a weight, and a gradient.
For example, the data to be quantized may be quantized, thereby increasing the processing speed of winograd convolutions. In some embodiments, the data to be quantized may be a 32-bit floating point number. Alternatively, the data to be quantized may be floating point numbers of other bits, or other data types.
In one possible implementation manner, the quantizing processing is performed on the data to be quantized according to the determined pair of cut-off thresholds, so as to obtain quantized data, which includes any one of the following operations:
Before winograd positive transformation of data to be operated is disassembled into summation operation, the data to be operated is used as data to be quantized to carry out quantization processing;
Before carrying out the para-multiplication operation, carrying out quantization processing on winograd positive conversion results of each piece of data in the data to be operated as the data to be quantized;
And before carrying out winograd inverse transformation and disassembly into sum operation, carrying out quantization processing on the bit multiplication result serving as data to be quantized.
Illustratively, if the data to be quantized is data to be calculated, then winograd convolution process may be:
quantizing the data to be operated by adopting the determined pair of cut-off thresholds to obtain quantized data to be operated; the winograd positive transformation of the quantized data to be operated is disassembled into sum operation, and winograd positive transformation result of the quantized data to be operated is obtained through calculation; performing para-multiplication operation of winograd positive conversion results of quantized data to be operated to obtain para-multiplication results; and carrying out inverse transformation and disassembly on the winograd of the para-multiplication result to obtain a quantized winograd convolution result, and carrying out inverse quantization on the quantized winograd convolution result to obtain a winograd convolution result.
For example, if the data to be quantized is winograd positive transform results of the data to be operated on, the winograd convolution process may be:
Carrying out forward conversion and disassembly on winograd of the data to be operated into sum operation, and carrying out calculation to obtain winograd forward conversion results of the data to be operated; quantizing winograd positive conversion results of the data to be operated by adopting the determined pair of truncation thresholds to obtain winograd positive conversion results of the quantized data to be operated; performing para-multiplication operation of winograd positive conversion results of quantized data to be operated to obtain para-multiplication results; and carrying out inverse transformation and disassembly on winograd of the para-multiplication result to obtain a quantized winograd convolution result, and carrying out inverse quantization on the quantized winograd convolution result to obtain a winograd convolution result.
Illustratively, if the data to be quantized is a para-multiplication result, then winograd convolution process may be:
Carrying out forward conversion and disassembly on winograd of the data to be operated into sum operation, and carrying out calculation to obtain winograd forward conversion results of the data to be operated; performing para-multiplication operation of winograd positive conversion results of the data to be operated to obtain para-multiplication results; quantizing the para-multiplication result by adopting the determined pair of cut-off thresholds to obtain a quantized para-multiplication result; and carrying out inverse transformation and disassembly on the quantized alignment multiplication result winograd to obtain a sum operation, and obtaining a quantized winograd convolution result. And performing inverse quantization on the quantized winograd convolution result to obtain a winograd convolution result.
In one possible implementation manner, the disassembling the winograd positive transforms of the data to be operated into sum operations, and performing computation to obtain winograd positive transform results of each data in the data to be operated includes:
Respectively disassembling each data in the data to be operated into a plurality of first sub-tensors, carrying out winograd positive transformation on the plurality of first sub-tensors of each data in the data to be operated and summing to obtain winograd positive transformation results of each data in the data to be operated,
The number of the plurality of first sub-tensors split out of each data is the same as the number of elements which are not 0 in the corresponding data, one element in each first sub-tensor is the same as the element in the corresponding position in the corresponding data, and other elements are all 0.
For example, assume that the input neuron is represented as:
The input neuron is a 4×4 matrix including 16 elements, and thus, the data to be operated on can be disassembled into 16 first sub-tensors.
Then, according to the disassembly of the present disclosure, the 16 first sub-tensors are respectively:
One element in each first sub-tensor is the same as the element in the corresponding position in the data to be operated, and the other elements are all 0, which means that: taking the first sub-tensor d00 as an example, the element at the first row and first column positions is the same as the element at the first row and first column positions of the input neuron, the other elements are all 0, and the other first sub-tensors also have the same attribute.
It should be noted that the above disassembly is only some examples of the disclosure, and the disclosure is not limited in any way, for example, if the data to be operated has an element with a value of 0, the number of first sub-tensors obtained by disassembly may be less than the number of elements of the data to be operated, for example, the number of the plurality of first sub-tensors is the same as the number of elements of the data to be operated other than 0.
In one possible implementation manner, performing winograd forward transforms on the first tensors of each data in the data to be operated on and summing to obtain a winograd forward transform result of each data in the data to be operated on may include:
Obtaining winograd positive conversion results of a first element tensor corresponding to the first tensor; the first meta-tensor corresponding to the first tensor is: the value of the element of the first position in the first meta-tensor is 1, wherein the position of the first position in the first meta-tensor is the same as the position of the non-0 element in the first tensor;
taking element values which are not 0 in the first sub-tensor as coefficients to multiply the winograd positive conversion results of the corresponding first element sub-tensor to obtain winograd positive conversion results of the first sub-tensor;
and adding winograd positive conversion results of the first sub-tensors to obtain winograd positive conversion results of the data to be operated.
Still taking the first sub-tensor d00 as an example, the first meta-tensor corresponding to d00 may beThat is, the first element tensor is extracted from the non-0 element values in the first tensor, and the non-0 element values can be used as coefficients of the first element tensor.
The winograd positive transformation result of the first meta-tensor corresponding to the first tensor may be obtained in advance through the following process: for each first sub-tensor, multiplying the left side of the first meta-tensor corresponding to the first sub-tensor by a forward conversion left multiplication matrix and multiplying the right side of the first meta-tensor by a forward conversion right multiplication matrix to obtain winograd forward conversion results of the first meta-tensor.
For matrices of different sizes, the form of the corresponding first element tensor is determined, as are the corresponding forward left-hand and forward right-hand matrices.
Thus, the winograd forward transform result of the first element tensor may be pre-computed, with the specific procedure described above. For example, still taking d00 as an example, the winograd positive transform result of its corresponding first meta-sub-tensor is:
For another example, taking d01 as an example, the result of the winograd positive transformation of the corresponding first element tensor is:
since the element values of the forward left-hand matrix and the forward right-hand matrix are both 0, ±1, the element value of the first unitary sub-tensor is 0 or 1, and the element in the winograd forward result of the first unitary sub-tensor is also 0, ±1. Thus, the matrix multiplication operation can be disassembled into addition operations.
The process of calculating winograd forward conversion results of the first meta-tensor involves more multiplication operations, and by adopting the method disclosed by the disclosure, the winograd forward conversion results of the first meta-tensor with various pre-calculated scales can be stored in the operation device, so that the forward conversion results can be directly obtained in the actual operation process without repeated operation, thereby shortening the calculation time and saving the calculation resources.
In obtaining the winograd positive conversion result of the first unitary tensor corresponding to the first tensor, the element value which is not 0in the first tensor can be multiplied by the winograd positive conversion result of the corresponding first unitary tensor, so as to obtain the winograd positive conversion result of the first tensor. For example, still taking d00 as an example, the corresponding winograd positive transform result is:
For another example, taking d01 as an example, the winograd positive transformation result of d01 is
And winograd positive conversion results of all the first sub-tensors are obtained through calculation in the above process, and the winograd positive conversion results of the data to be calculated can be obtained by adding the winograd positive conversion results of a plurality of the first sub-tensors.
Since the elements in the winograd positive transform result of the first element tensor obtained by conversion are also 0, ±1, the right side of the above equations (1), (2) involves only the summation operation.
According to the above embodiments of the present disclosure, the data to be calculated is disassembled to obtain a plurality of first sub-tensors, and the sum operation is performed according to the pre-calculated winograd positive conversion result of the first unitary sub-tensor corresponding to the first sub-tensor and the non-0 element value of the first sub-tensor to obtain the winograd positive conversion result of the data to be calculated.
After the above-mentioned winograd positive transformation results of the input neurons are obtained by the above-mentioned disassembly for summation operation, winograd positive transformation results of the weight values can be used, wherein the calculation mode of the winograd positive transformation results of the weight values can be a traditional matrix multiplication calculation, and winograd positive transformation results can be obtained by calculating the above-mentioned disassembly for summation operation.
After the winograd positive conversion result of the data to be operated (input neuron, weight and gradient) is obtained, the para-multiplication operation of the winograd positive conversion result of the data to be operated can be continuously executed, and the para-multiplication result is obtained. The para-multiplication may be a value obtained by multiplying data corresponding to positions of two tensors as a corresponding position in a para-multiplication result.
Assume that winograd positive transform result BTd4×4 B of an input neuron can be expressed as:
winograd positive transform results of weightsCan be expressed as:
Then the p-multiplication result may be:
The winograd convolution result of the data to be operated can be represented as S4×4=AT(G4×4⊙D4×4) a, and the slave function processing unit of the present disclosure can disassemble aT(G4×4⊙D4×4) a into sum operation and calculate to obtain the winograd convolution result of the data to be operated, so that the calculation time can be further saved and the energy consumption can be reduced.
In one possible implementation manner, the disassembling the winograd inverse transform of the para-multiplication result into a sum operation, to obtain the winograd convolution result may include:
Disassembling the para-multiplication result into a plurality of second sub-tensors, performing winograd inverse transformation on the plurality of second sub-tensors, and summing to obtain a winograd convolution result of the data to be calculated;
The number of the second sub-tensors is the same as the number of elements which are not 0 in the para-position multiplication result, one element in each of the second sub-tensors is the same as the element in the corresponding position in the para-position multiplication result, and other elements are all 0.
Assume that the p-multiplication result is:
the para-multiplication result is disassembled into a plurality of second sub-tensors, for example, the para-multiplication result can be disassembled into 16, and the 16 second sub-tensors are respectively:
after the disassembly, the plurality of second sub-tensors may be inverse winograd transformed and summed to obtain the winograd convolution result of the data to be operated on.
In one possible implementation, performing winograd inverse transform on the plurality of second sub-tensors and summing to obtain winograd convolution results of the data to be calculated may include the following procedures:
Obtaining winograd inverse transformation results of a second sub-tensor corresponding to the second sub-tensor; wherein the second sub-tensor corresponding to the second sub-tensor is: the value of the element of the second position in the second sub-tensor is 1, wherein the position of the second position in the second sub-tensor is the same as the position of the non-0 element in the second sub-tensor;
Taking the element value which is not 0 in the second sub-tensor as a coefficient to multiply the winograd inverse transformation result of the corresponding second sub-tensor to obtain a winograd inverse transformation result of the second sub-tensor;
And adding the winograd inverse transformation results of the second sub-tensors to obtain winograd convolution results of the data to be operated.
The manner of determining the second element tensor corresponding to the second element tensor is the same as the manner of determining the first element tensor above, and will not be described again. Wherein the result of the winograd inverse transform of the second sub-tensor is obtained in advance by: for each second sub-tensor, multiplying the left side of the second sub-tensor corresponding to the second sub-tensor by the inverse transform left-multiplication matrix and multiplying the right side of the second sub-tensor by the inverse transform right-multiplication matrix to obtain winograd inverse transform results of the second sub-tensor.
For matrices of different sizes, the form of the corresponding second sub-tensor is determined, as are the corresponding inverse transform left-multiplication matrix and inverse transform right-multiplication matrix. Thus, the result of the inverse winograd transform of the second sub-tensor may be pre-computed, with the specific procedure being as described above. For the example listed hereinabove, the inverse transform squaring matrix is a2 x 4 matrix, which may be, for example:
the inverse transform right-multiplication matrix is a4×2 matrix, which may be, for example:
The dimensions of the inverse transform matrix may be determined from the dimensions of the input neurons and the dimensions of the weights and convolution steps, the above being just one example and not limiting the disclosure in any way.
Inverse transform matrix is composed ofThe matrix multiplication operation of the inverse transform can be split into an addition and a shift operation implementation. Multiplying the inverse transformation matrix by the second sub-tensor to obtain winograd inverse transformation result of the second sub-tensor, wherein the element values in the winograd inverse transformation result of the second sub-tensor are represented byEtc., the score may be calculated by a simple shift operation, and still save calculation time compared to a multiplication operation.
For the result of winograd inverse transformation of the second sub-tensor, which takes the element value which is not 0 in the second sub-tensor as the coefficient to multiply the corresponding second sub-tensor, obtaining the result of winograd inverse transformation of the second sub-tensor; the specific process of adding the winograd inverse transform results of the plurality of second sub-tensors to obtain the winograd convolution result "of the data to be operated can refer to the above, except that the winograd inverse transform result of the second sub-tensor is not completely calculated by 0, ±1, but the score can be calculated through simple shift operation, and compared with multiplication operation, the method can still realize the effects of saving calculation time and reducing energy consumption after disassembling the common inverse transform process.
According to the above embodiments of the present disclosure, the plurality of second sub-tensors are obtained by disassembling the para-multiplication result, and the winograd convolution result of the data to be calculated can be obtained by performing the summation operation according to the pre-calculated winograd inverse transform result of the second sub-tensor corresponding to the second sub-tensor and the non-0 element value of the second sub-tensor.
According to the data processing method, a pair of cutoff thresholds is determined from a plurality of pairs of cutoff thresholds according to the average value of absolute values of quantized data obtained by quantizing data to be quantized by adopting the plurality of pairs of cutoff thresholds, wherein the data to be quantized is a group of data in a winograd convolution processing process, and each pair of cutoff thresholds in the plurality of pairs of cutoff thresholds comprises a first cutoff threshold and a second cutoff threshold, and the second cutoff threshold is the minimum value in the data to be quantized; carrying out quantization processing on the data to be quantized according to the determined pair of cut-off thresholds to obtain quantized data; performing winograd convolution processing on the quantized data to obtain a quantized winograd convolution result; and performing inverse quantization processing on the quantized winograd convolution result to obtain a winograd convolution result, so that the accuracy of quantization can be improved, the operation time of winograd convolution can be saved, and the energy consumption can be reduced.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
It should be further noted that, although the steps in the flowchart of fig. 1 are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
Fig. 3 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus includes: a first determination module 41, a first quantization module 42, a convolution processing module 43 and an inverse quantization processing module 44.
The first determining module 41 determines a pair of cutoff thresholds from a plurality of pairs of cutoff thresholds according to a mean value of absolute values of quantized data obtained by performing quantization processing on data to be quantized by using the plurality of pairs of cutoff thresholds, wherein the data to be quantized is a group of data in a winograd convolution processing process, and each pair of cutoff thresholds in the plurality of pairs of cutoff thresholds comprises a first cutoff threshold and a second cutoff threshold, and the second cutoff threshold is a minimum value in the data to be quantized;
the first quantization module 42 performs quantization processing on the data to be quantized according to the determined pair of truncation thresholds, so as to obtain quantized data;
the convolution processing module 43 continues to execute winograd convolution processing according to the quantized data to obtain a quantized winograd convolution result;
The dequantization processing module 44 performs dequantization processing on the quantized winograd convolution result to obtain a winograd convolution result.
According to the data processing device disclosed by the invention, the calculation time of winograd convolution is saved and the energy consumption is reduced while the quantization precision is improved.
In one possible implementation manner, the first determining module is further configured to:
in the current round of searching, quantizing the data to be quantized by adopting a plurality of pairs of cutoff thresholds to obtain a plurality of groups of quantized data;
Determining a pair of truncation thresholds to be selected from the plurality of pairs of truncation thresholds according to the difference between the average value of the absolute values of each group of data and the average value of the absolute values of the data to be quantized;
When the current round search meets the continuous search condition, determining a new multiple pairs of cutoff thresholds according to the to-be-selected cutoff thresholds, and carrying out the next round search according to the new multiple pairs of cutoff thresholds until the current round search does not meet the continuous search condition;
and when the current round of search does not meet the continuous search condition, determining the to-be-selected cutoff threshold as a pair of cutoff thresholds determined from a plurality of pairs of cutoff thresholds.
In one possible implementation, the continued search condition includes at least one of:
The number of searches that have been performed is less than a predetermined total number of searches;
the difference between the average value of the absolute values of each group of data in the current search and the average value of the absolute values of the data to be quantized is greater than or equal to a difference threshold.
In one possible implementation, the apparatus further includes:
a second determining module for determining the maximum absolute value of the data in the data to be quantized;
A third determining module that determines a plurality of first cutoff thresholds based on the maximum absolute value and cutoff logarithm;
and a fourth determining module, configured to obtain the plurality of pairs of cutoff thresholds according to the plurality of first cutoff thresholds and the second cutoff threshold.
In one possible implementation, the winograd convolution process includes:
Carrying out forward conversion and disassembly on winograd of data to be operated into sum operation, and carrying out calculation to obtain winograd forward conversion results of each data in the data to be operated;
Performing para-multiplication operation between winograd positive conversion results of corresponding data in the data to be operated to obtain para-multiplication results;
The winograd inverse transformation of the para-multiplication result is disassembled into sum operation to obtain the winograd convolution result,
The data to be quantized is any one of the data to be operated, winograd positive transformation results of the data to be operated and the para-multiplication results.
In one possible implementation, the first quantization module is further configured to perform any one of the following operations:
Before winograd positive transformation of data to be operated is disassembled into summation operation, the data to be operated is used as data to be quantized to carry out quantization processing;
Before carrying out the para-multiplication operation, carrying out quantization processing on winograd positive conversion results of each piece of data in the data to be operated as the data to be quantized;
And before carrying out winograd inverse transformation and disassembly into sum operation, carrying out quantization processing on the bit multiplication result serving as data to be quantized.
In one possible implementation manner, the disassembling the winograd positive transforms of the data to be operated into sum operations, and performing computation to obtain winograd positive transform results of each data in the data to be operated includes:
Respectively disassembling each data in the data to be operated into a plurality of first sub-tensors, carrying out winograd positive transformation on the plurality of first sub-tensors of each data in the data to be operated and summing to obtain winograd positive transformation results of each data in the data to be operated,
The number of the plurality of first sub-tensors split out of each data is the same as the number of elements which are not 0 in the corresponding data, one element in each first sub-tensor is the same as the element in the corresponding position in the corresponding data, and other elements are all 0.
In one possible implementation manner, the disassembling the winograd inverse transforms of the para-multiplication result into a sum operation, to obtain the winograd convolution result includes:
Disassembling the para-multiplication result into a plurality of second sub-tensors, performing winograd inverse transformation on the plurality of second sub-tensors, and summing to obtain a winograd convolution result of the data to be calculated;
The number of the second sub-tensors is the same as the number of elements which are not 0 in the para-position multiplication result, one element in each of the second sub-tensors is the same as the element in the corresponding position in the para-position multiplication result, and other elements are all 0.
In one possible implementation, the data to be operated on includes at least one of an input neuron, a weight, and a gradient.
It should be understood that the above-described apparatus embodiments are merely illustrative and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.
In addition, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together, unless otherwise specified. The integrated units/modules described above may be implemented either in hardware or in software program modules.
The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The Memory unit may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (ENHANCED DYNAMIC Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cube HMC (Hybrid Memory Cube), etc., unless otherwise indicated.
The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the various embodiments of the present disclosure. And the aforementioned memory includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The data processing method according to the embodiment of the present disclosure may be applied to a processor, which may be a general-purpose processor such as a CPU (Central Processing Unit ) or an artificial Intelligence Processor (IPU) for performing an artificial intelligence operation. The artificial intelligence operations may include machine learning operations, brain-like operations, and the like. The machine learning operation comprises neural network operation, k-means operation, support vector machine operation and the like. The artificial intelligence processor may include, for example, one or a combination of a GPU (Graphics Processing Unit ), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (DIGITAL SIGNAL Process, digital signal processing unit), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) chip. The present disclosure is not limited by the specific type of processor.
In one possible implementation, the processors referred to in this disclosure may include multiple processing units, each of which may independently execute various tasks assigned thereto, such as: convolution operation task, pooling task or full connection task, etc. The present disclosure is not limited to the tasks that the processing unit operates on.
The processor includes a plurality of processing units for executing sequences of instructions, and a memory unit for storing data, which may include a random access memory (RAM, random Access Memory) and a register file. Multiple processing units in a processor may share some memory space, such as shared part of RAM memory space and register files, as well as having separate memory spaces.
In one possible implementation, an artificial intelligence chip is also disclosed, which includes the above-described data processing apparatus.
In one possible implementation, a board is also disclosed, which includes a memory device, an interface device, and a control device, and the artificial intelligence chip described above; wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; the control device is used for monitoring the state of the artificial intelligent chip.
Fig. 4 shows a block diagram of a board according to an embodiment of the present disclosure, and referring to fig. 4, the board may further include other mating components in addition to the chip 389, where the mating components include, but are not limited to: a memory device 390, an interface device 391 and a control device 392;
The memory device 390 is connected to the artificial intelligence chip through a bus for storing data. The memory device may include multiple sets of memory cells 393. Each group of storage units is connected with the artificial intelligent chip through a bus. It is understood that each set of memory cells may be DDR SDRAM (Double sided DATA RATE SDRAM, double speed synchronous dynamic random access memory).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include 4 72-bit DDR4 controllers therein, where 64 bits of the 72-bit DDR4 controllers are used to transfer data and 8 bits are used for ECC verification. It is understood that the theoretical bandwidth of data transfer can reach 25600MB/s when DDR4-3200 granules are employed in each set of memory cells.
In one embodiment, each set of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each storage unit.
The interface device is electrically connected with the artificial intelligent chip. The interface device is used for realizing data transmission between the artificial intelligent chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X10 interface transmission is adopted, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may be another interface, and the disclosure is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the results of the computation of the artificial intelligence chip are still transmitted back to the external device (e.g., server) by the interface device.
The control device is electrically connected with the artificial intelligence chip. The control device is used for monitoring the state of the artificial intelligent chip. Specifically, the artificial intelligent chip and the control device can be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The artificial intelligent chip can comprise a plurality of processing chips, a plurality of processing cores or a plurality of processing circuits, and can drive a plurality of loads. Therefore, the artificial intelligent chip can be in different working states such as multi-load and light-load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the artificial intelligent chip.
In one possible implementation, an electronic device is disclosed that includes the artificial intelligence chip described above. The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a non-volatile computer readable storage medium.
The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.
Fig. 5 illustrates a block diagram of an electronic device 800, according to an embodiment of the disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 5, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.
Fig. 6 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server. Referring to FIG. 6, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, all of the combinations of the technical features should be considered as being within the scope of the disclosure.
The foregoing may be better understood in light of the following clauses:
clause a1. A data processing method, the method comprising:
Determining a pair of cutoff thresholds from a plurality of pairs of cutoff thresholds according to an average value of absolute values of quantized data obtained by quantizing the data to be quantized by adopting the plurality of pairs of cutoff thresholds, wherein the data to be quantized is a group of data in a winograd convolution processing process, and each pair of cutoff thresholds in the plurality of pairs of cutoff thresholds comprises a first cutoff threshold and a second cutoff threshold, and the second cutoff threshold is a minimum value in the data to be quantized;
Carrying out quantization processing on the data to be quantized according to the pair of determined cut-off thresholds to obtain quantized data;
performing winograd convolution processing on the quantized data to obtain a quantized winograd convolution result;
And performing dequantization processing on the quantized winograd convolution result to obtain a winograd convolution result.
Clause a2. According to the method of clause A1, determining a pair of cutoff thresholds from the plurality of pairs of cutoff thresholds based on a mean value of absolute values of quantized data obtained by quantizing the data to be quantized using the plurality of pairs of cutoff thresholds, including:
in the current round of searching, quantizing the data to be quantized by adopting a plurality of pairs of cutoff thresholds to obtain a plurality of groups of quantized data;
Determining a pair of truncation thresholds to be selected from the plurality of pairs of truncation thresholds according to the difference between the average value of the absolute values of each group of data and the average value of the absolute values of the data to be quantized;
When the current round search meets the continuous search condition, determining a new multiple pairs of cutoff thresholds according to the to-be-selected cutoff thresholds, and carrying out the next round search according to the new multiple pairs of cutoff thresholds until the current round search does not meet the continuous search condition;
and when the current round of search does not meet the continuous search condition, determining the to-be-selected cutoff threshold as a pair of cutoff thresholds determined from a plurality of pairs of cutoff thresholds.
Clause a3 the method of clause A2, wherein the continued search condition comprises at least one of:
The number of searches that have been performed is less than a predetermined total number of searches;
the difference between the average value of the absolute values of each group of data in the current search and the average value of the absolute values of the data to be quantized is greater than or equal to a difference threshold.
Clause a4 the method of clause A1 or 2, further comprising:
Determining the maximum absolute value of data in the data to be quantized;
determining a plurality of first truncation thresholds based on the maximum absolute value and the truncation logarithm;
and obtaining the plurality of pairs of cutoff thresholds according to the plurality of first cutoff thresholds and the second cutoff threshold.
Clause a5 the method of clause A1, the winograd convolution process comprising:
Carrying out forward conversion and disassembly on winograd of data to be operated into sum operation, and carrying out calculation to obtain winograd forward conversion results of each data in the data to be operated;
Performing para-multiplication operation between winograd positive conversion results of corresponding data in the data to be operated to obtain para-multiplication results;
The winograd inverse transformation of the para-multiplication result is disassembled into sum operation to obtain the winograd convolution result,
The data to be quantized is any one of the data to be operated, winograd positive transformation results of the data to be operated and the para-multiplication results.
Clause a6. The method of clause A5, wherein the quantizing the data to be quantized according to the determined pair of cutoff thresholds to obtain quantized data, comprises any one of the following operations:
Before winograd positive transformation of data to be operated is disassembled into summation operation, the data to be operated is used as data to be quantized to carry out quantization processing;
Before carrying out the para-multiplication operation, carrying out quantization processing on winograd positive conversion results of each piece of data in the data to be operated as the data to be quantized;
And before carrying out winograd inverse transformation and disassembly into sum operation, carrying out quantization processing on the bit multiplication result serving as data to be quantized.
Clause A7. is a method according to clause A5, wherein the disassembling the winograd positive transformation of the data to be operated into sum operation, and performing calculation to obtain a winograd positive transformation result of each data in the data to be operated, including:
Respectively disassembling each data in the data to be operated into a plurality of first sub-tensors, carrying out winograd positive transformation on the plurality of first sub-tensors of each data in the data to be operated and summing to obtain winograd positive transformation results of each data in the data to be operated,
The number of the plurality of first sub-tensors split out of each data is the same as the number of elements which are not 0 in the corresponding data, one element in each first sub-tensor is the same as the element in the corresponding position in the corresponding data, and other elements are all 0.
Clause A8. the method of clause A5, the decomposing the winograd inverse transform of the para-multiply result into a sum operation, resulting in the winograd convolution result, comprising:
Disassembling the para-multiplication result into a plurality of second sub-tensors, performing winograd inverse transformation on the plurality of second sub-tensors, and summing to obtain a winograd convolution result of the data to be calculated;
The number of the second sub-tensors is the same as the number of elements which are not 0 in the para-position multiplication result, one element in each of the second sub-tensors is the same as the element in the corresponding position in the para-position multiplication result, and other elements are all 0.
Clause A9. the method of clause A5, wherein the data to be operated on includes at least one of an input neuron, a weight, and a gradient.
Clause a10 a data processing device, the device comprising:
The first determining module is used for determining a pair of cutoff thresholds from a plurality of pairs of cutoff thresholds according to an average value of absolute values of quantized data obtained by quantizing data to be quantized by adopting the plurality of pairs of cutoff thresholds, wherein the data to be quantized is a group of data in a winograd convolution processing process, and each pair of cutoff thresholds in the plurality of pairs of cutoff thresholds comprises a first cutoff threshold and a second cutoff threshold, and the second cutoff threshold is the minimum value in the data to be quantized;
The first quantization module is used for carrying out quantization processing on the data to be quantized according to the determined pair of cut-off thresholds to obtain quantized data;
The convolution processing module is used for continuously executing winograd convolution processing according to the quantized data to obtain a quantized winograd convolution result;
And the dequantization processing module is used for performing dequantization processing on the quantized winograd convolution result to obtain a winograd convolution result.
Clause a11 the apparatus of clause a10, the first determining module is further configured to:
in the current round of searching, quantizing the data to be quantized by adopting a plurality of pairs of cutoff thresholds to obtain a plurality of groups of quantized data;
Determining a pair of truncation thresholds to be selected from the plurality of pairs of truncation thresholds according to the difference between the average value of the absolute values of each group of data and the average value of the absolute values of the data to be quantized;
When the current round search meets the continuous search condition, determining a new multiple pairs of cutoff thresholds according to the to-be-selected cutoff thresholds, and carrying out the next round search according to the new multiple pairs of cutoff thresholds until the current round search does not meet the continuous search condition;
and when the current round of search does not meet the continuous search condition, determining the to-be-selected cutoff threshold as a pair of cutoff thresholds determined from a plurality of pairs of cutoff thresholds.
Clause a12 the apparatus of clause a11, the continued search condition comprising at least one of:
The number of searches that have been performed is less than a predetermined total number of searches;
the difference between the average value of the absolute values of each group of data in the current search and the average value of the absolute values of the data to be quantized is greater than or equal to a difference threshold.
Clause a13 the device of clause a10 or 11, further comprising:
a second determining module for determining the maximum absolute value of the data in the data to be quantized;
A third determining module that determines a plurality of first cutoff thresholds based on the maximum absolute value and cutoff logarithm;
and a fourth determining module, configured to obtain the plurality of pairs of cutoff thresholds according to the plurality of first cutoff thresholds and the second cutoff threshold.
Clause a14 the apparatus of clause a10, the winograd convolution process comprising:
Carrying out forward conversion and disassembly on winograd of data to be operated into sum operation, and carrying out calculation to obtain winograd forward conversion results of each data in the data to be operated;
Performing para-multiplication operation between winograd positive conversion results of corresponding data in the data to be operated to obtain para-multiplication results;
The winograd inverse transformation of the para-multiplication result is disassembled into sum operation to obtain the winograd convolution result,
The data to be quantized is any one of the data to be operated, winograd positive transformation results of the data to be operated and the para-multiplication results.
The apparatus of clause a15, wherein the first quantization module is further configured to perform any one of:
Before winograd positive transformation of data to be operated is disassembled into summation operation, the data to be operated is used as data to be quantized to carry out quantization processing;
Before carrying out the para-multiplication operation, carrying out quantization processing on winograd positive conversion results of each piece of data in the data to be operated as the data to be quantized;
And before carrying out winograd inverse transformation and disassembly into sum operation, carrying out quantization processing on the bit multiplication result serving as data to be quantized.
Clause a16 the device of clause a14, wherein the disassembling the winograd positive transforms of the data to be operated into sum operations and calculating to obtain winograd positive transform results of each data in the data to be operated comprises:
Respectively disassembling each data in the data to be operated into a plurality of first sub-tensors, carrying out winograd positive transformation on the plurality of first sub-tensors of each data in the data to be operated and summing to obtain winograd positive transformation results of each data in the data to be operated,
The number of the plurality of first sub-tensors split out of each data is the same as the number of elements which are not 0 in the corresponding data, one element in each first sub-tensor is the same as the element in the corresponding position in the corresponding data, and other elements are all 0.
Clause a17 the apparatus of clause a14, the inverse transforming the winograd of the para-multiply result into a sum operation resulting in the winograd convolution result, comprising:
Disassembling the para-multiplication result into a plurality of second sub-tensors, performing winograd inverse transformation on the plurality of second sub-tensors, and summing to obtain a winograd convolution result of the data to be calculated;
The number of the second sub-tensors is the same as the number of elements which are not 0 in the para-position multiplication result, one element in each of the second sub-tensors is the same as the element in the corresponding position in the para-position multiplication result, and other elements are all 0.
Clause a18 the apparatus of clause a14, wherein the data to be operated on includes at least one of an input neuron, a weight, and a gradient.
Item a19 an artificial intelligence chip comprising a data processing apparatus as claimed in any one of items a10 to a 18.
Clause a20 an electronic device comprising the artificial intelligence chip of clause a 19.
Clause a21 a board, the board comprising: a memory device, interface means and control device, an artificial intelligence chip as set forth in clause a 19;
wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device;
the storage device is used for storing data;
The interface device is used for realizing data transmission between the artificial intelligent chip and external equipment;
the control device is used for monitoring the state of the artificial intelligent chip.
Clause a22 the board card of clause a21,
The memory device includes: each group of storage units is connected with the artificial intelligent chip through a bus, and the storage units are as follows: DDR SDRAM;
The chip comprises: the DDR controller is used for controlling data transmission and data storage of each storage unit;
the interface device is as follows: standard PCIE interfaces.
Clause a23 an electronic device comprising:
A processor;
A memory for storing processor-executable instructions;
wherein the processor is configured to invoke the instructions stored by the memory to perform the method of any of clauses A1-A9.
Clause a24 a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of clauses A1 to A9.
The foregoing has outlined rather closely the embodiments of the present disclosure, and detailed description of the principles and embodiments of the present disclosure have been presented herein with the application of specific examples, the description of the examples above being merely intended to facilitate an understanding of the method of the present disclosure and its core ideas. Meanwhile, those skilled in the art will recognize that modifications or variations made on the basis of the specific embodiments and application scope of the present disclosure are within the scope of the protection of the present disclosure in light of the ideas of the present disclosure. In view of the foregoing, this description should not be construed as limiting the disclosure.