CN112215331A

Movatterモバイル変換

Info

Publication number: CN112215331A
Application number: CN201910619898.3A
Authority: CN
Inventors: 蒋磊; 程捷; 杨文斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2021-01-12

Abstract

Translated fromChinese

本申请提供了一种对神经网络系统中的数据处理方法和处理装置，应用于人工智能领域的神经网络系统中。该方法能够根据目标量化位宽以及待量化的权重值集合中的权重值的范围，确定量化方式，并根据确定的量化方式对待量化的权重值集合中的权重值进行量化。其中，待量化的权重值集合包括神经网络系统中的多个卷积核组中的一个卷积核组的初始权重值，所述卷积核组包括不同卷积层中的多个卷积核，或包括同一卷积层中的部分卷积核，并基于量化后的权重值集合中的权重值对输入神经网络系统中的数据进行计算。该方法能够提升神经网络系统中进行低位宽量化后的数据精度。

The present application provides a data processing method and processing device in a neural network system, which are applied to a neural network system in the field of artificial intelligence. The method can determine the quantization method according to the target quantization bit width and the range of weight values in the weight value set to be quantized, and quantize the weight values in the weight value set to be quantized according to the determined quantization method. The set of weight values to be quantized includes an initial weight value of one convolution kernel group in a plurality of convolution kernel groups in the neural network system, and the convolution kernel group includes a plurality of convolution kernels in different convolution layers , or include some of the convolution kernels in the same convolutional layer, and calculate the data in the input neural network system based on the weight values in the quantized weight value set. This method can improve the data accuracy after low bit-width quantization in the neural network system.

Description

Data processing method for neural network system and neural network system

Technical Field

The present application relates to the field of neural networks, and in particular, to a data processing method in a neural network system and a neural network system.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. Deep Learning (DL), an important branch of artificial intelligence, has been widely noticed and deeply studied in academia and industry, resulting in not only many theoretical innovative results, but also many practical applications in industry, such as image processing, speech recognition, motion analysis, etc. The deep learning is a neural network for simulating the human brain structure, and can achieve better recognition effect than the traditional shallow learning mode.

A better performing neural network typically includes larger scale model parameters and thus higher computational complexity. Therefore, it becomes very important to compress and accelerate the neural network. And through the effective low bit width weight quantification method, the bit width of the weight matrix is reduced on the premise that the precision loss is as low as possible, the model can be effectively compressed, and the storage capacity of the neural network is reduced. Therefore, how to guarantee the precision of the quantized data under the condition of reducing the bit width of the weight is an urgent problem to be solved in the technical field of neural networks.

Disclosure of Invention

The application provides a data processing method in a neural network system and the neural network system, which can improve the precision of quantized data in the neural network system.

In a first aspect, a method for processing data in a neural network system is provided, including: determining a first quantization mode according to a target quantization bit width and a range of weight values in a first initial weight value set, wherein the first initial weight value set comprises an initial weight value of a first convolution kernel group in a plurality of convolution kernel groups in the neural network system, the first convolution kernel group comprises one or more convolution kernels, and the first convolution kernel group comprises a plurality of convolution kernels in different convolution layers or comprises partial convolution kernels in the same convolution layer; according to the first quantization mode, quantizing the weight values in the first initial weight value set to obtain a first target weight value set of the first volume kernel set; and calculating the data input into the neural network system according to the weight value in the first target weight value set.

In the embodiment of the application, a plurality of convolution kernels of a neural network can be divided into a plurality of convolution kernel groups, convolution kernels in the same convolution kernel group can be from different convolution layers or only include partial convolution kernels in the same convolution layer, so that the convolution kernels are flexibly grouped in the quantization process, quantization modes are respectively determined according to weight distribution of each convolution kernel group, the convolution kernel groups are not limited by the convolution layers in quantization, respective quantization modes can be determined for different convolution kernel groups, the convolution kernels or the convolution kernel groups can be quantized more specifically, the weight space length of the convolution kernel groups is fully utilized for quantization, and the accuracy of quantized data in the neural network system can be improved when low-bit-width quantization is performed. In addition, since the precision of the weight value processed by the weight processing method provided by the embodiment of the application is improved, the accuracy of the output data obtained by calculating the input data by the neural network system based on the weight value in the first target weight value set is higher.

With reference to the first aspect, in one possible implementation manner, the first convolution kernel group includes a plurality of convolution kernels with similar weight distributions.

In the embodiment of the application, the convolution kernels with similar weight distribution are divided into the same convolution kernel group, and respective quantization modes are determined for different convolution kernel groups, so that the convolution kernels or the convolution kernel groups can be quantized more specifically, the weight expression capability is improved, the precision loss in the quantization process is reduced, and the precision of quantized data is improved.

With reference to the first aspect, in one possible implementation manner, the first convolution kernel group satisfies at least one of the following conditions: the difference value of the maximum values of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a first preset value; or the difference value of the minimum values of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a second preset value; or the difference value of the average values of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a third preset value; or the difference value of the variances of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a fourth preset value.

In the embodiment of the application, convolution kernels with similar weight distribution are divided into the same convolution kernel group, for example, the convolution kernel group is divided according to the maximum value, the minimum value, the average value or the variance of the weight values of the convolution kernels, and respective quantization modes are determined for different convolution kernel groups, so that the convolution kernels or the convolution kernel groups can be quantized more specifically, the weight expression capability is improved, the precision loss in the quantization process is reduced, and the precision of quantized data is improved.

With reference to the first aspect, in a possible implementation manner, the quantizing the first initial weight set according to the first quantization manner includes: dividing the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode; and quantizing the weight value in each of the plurality of subsets to obtain the first target weight value set.

With reference to the first aspect, in a possible implementation manner, the first quantization manner is a uniform quantization manner, and dividing weight values in the first initial set of weight values into a plurality of subsets according to the first quantization manner includes: and uniformly dividing the weight values in the first initial weight value set into the plurality of subsets by adopting a first quantization step size.

With reference to the first aspect, in a possible implementation manner, the first quantization manner is a non-uniform quantization manner, and dividing weight values in the first initial set of weight values into a plurality of subsets according to the first quantization manner includes: and dividing the weighted values in the first initial weighted value set into a plurality of subsets according to a plurality of quantization step sizes determined by the first quantization mode.

In the embodiment of the application, weights in the initial weight value set can be divided and quantized in a non-uniform quantization mode, so that the length of a weight space can be utilized, the weight expression capacity is improved, the precision loss after quantization is reduced, and the precision of quantized data is improved.

With reference to the first aspect, in a possible implementation manner, the method further includes: carrying out forward reasoning calculation on the weight values in the first target weight value set; determining that the weight value of the first target weight value set does not meet a preset condition according to the result of the forward reasoning calculation; adjusting the weight value in the first initial weight value set to obtain a second initial weight value set of the first convolution kernel set; determining a second quantization mode according to the range of the weight values in the second initial weight value set; and quantizing the weight values in the second initial weight value set according to the second quantization mode to obtain a second target weight value set of the first convolution kernel set.

In the embodiment of the application, iterative training is performed on the obtained first target weight set, and the weight value and the quantization mode of the initial weight value set of the first convolution kernel set are continuously adjusted in the training process, so that the precision loss after neural network quantization is reduced as much as possible, and the precision of the quantized data is improved.

With reference to the first aspect, in a possible implementation manner, the determining, according to a result of the forward inference calculation, that a weight value of the first target weight value set does not satisfy a preset condition includes: and under the condition that the error between the output value of the forward reasoning calculation and the true value is greater than a first preset threshold value, determining that the weight value of the first target weight value set does not meet a preset condition.

With reference to the first aspect, in a possible implementation manner, performing forward inference calculation on weight values of the first target weight value set in the neural network system, where the neural network system includes any one of the following processing devices: the device comprises a graphic processor GPU, a field programmable array FPGA and a memristor cross array.

With reference to the first aspect, in a possible implementation manner, the method further includes: carrying out forward reasoning calculation on the weight values in the first target weight value set; determining that the weight value of the first target weight value set does not meet a preset condition according to a result of forward reasoning calculation on the weight value of the first target weight value set; and adjusting the weight value of the first target weight value set to obtain a second target weight value set of the first convolution kernel set.

With reference to the first aspect, in a possible implementation manner, the method further includes: determining a third quantization mode for a second convolution kernel group of the plurality of convolution kernel groups based on a range of initial sets of weight values for the second convolution kernel group, the third quantization mode being different from the first quantization mode; and quantizing the initial weight value set of the second convolution kernel group according to the third quantization mode to obtain a target weight value set of the second convolution kernel group.

In a second aspect, there is provided a neural network system comprising means for implementing the first aspect described above and various possible implementations of the first aspect.

In a third aspect, a chip is provided, on which the neural network system as described in the second aspect or any one of the possible implementations of the second aspect is disposed.

In a fourth aspect, a neural network system is provided, where the computer system includes a memory and a processor connected to the memory, the memory is used to store computer instructions, and the processor is used to execute the computer instructions to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, a computer-readable storage medium is provided, the computer-readable storage medium being configured to store program code comprising instructions for performing the method of the first aspect or any one of the possible implementations of the first aspect.

Drawings

Fig. 1 is a schematic structural diagram of a neural network system according to an embodiment of the present application.

FIG. 2 is a schematic diagram of a convolution kernel according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a neural network system according to yet another embodiment of the present application.

Fig. 4 is a schematic structural diagram of a ReRAM chip according to an embodiment of the present application.

Fig. 5 is a flowchart illustrating a data processing method in a neural network system according to an embodiment of the present application.

Fig. 6 is a flowchart illustrating a data processing method in a neural network system according to another embodiment of the present application.

Fig. 7 is a flowchart illustrating a data processing method in a neural network system according to another embodiment of the present application.

Fig. 8 is a flowchart illustrating a data processing method in a neural network system according to another embodiment of the present application.

Fig. 9 is a schematic structural diagram of a neural network system according to another embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

The architecture and operation of the artificial neural network system according to the embodiment of the present application will be described with reference to fig. 1. An Artificial Neural Network (ANN), referred to as Neural Network (NN) or neural network-like network, is a mathematical model or computational model that imitates the structure and function of a biological neural network (central nervous system of animals, especially the brain) in the field of machine learning and cognitive science, and is used to estimate or approximate functions. The artificial neural network may include a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a multilayer perceptron (MLP), and the like.

Fig. 1 is a schematic structural diagram of a neural network system according to an embodiment of the present disclosure. As shown in fig. 1, theneural network system 100 may include ahost 105 and aneural network circuit 110. Theneural network circuit 110 is connected to thehost 105 through a host interface. The host interface may include a standard host interface as well as a network interface (network interface). For example, the host interface may include a Peripheral Component Interconnect Express (PCIE) interface. As shown in fig. 1, theneural network circuit 110 may be connected to thehost 105 through thePCIE bus 106. Therefore, data can be input into theneural network circuit 110 through thePCIE bus 106, and the data after the processing by theneural network circuit 110 is completed is received through thePCIE bus 106. Also, thehost 105 may monitor the operating state of theneural network circuit 110 through the host interface.

Host 105 may include aprocessor 1052 and amemory 1054. It should be noted that, in addition to the devices shown in fig. 1, thehost 105 may further include other devices such as a communication interface and a magnetic disk as an external storage, which is not limited herein.

The processor (processor)1052 is an arithmetic core and a control core (control unit) of thehost 105.Processor 1052 may include multiple processor cores (cores).Processor 1052 may be an ultra-large scale integrated circuit. An operating system and other software programs are installed onprocessor 1052, and thusprocessor 1052 is able to accessmemory 1054, cache, disks, and peripheral devices, such as the neural network circuitry of FIG. 1. It is understood that, in the embodiment of the present invention, the Core in theprocessor 1052 may be, for example, a Central Processing Unit (CPU) or other specific integrated circuit (ASIC).

Thememory 1054 is the main memory of thehost 105. Thememory 1054 is coupled to theprocessor 1052 via a Double Data Rate (DDR) bus.Memory 1054 is typically used to store various operating systems running software, input and output data, and information exchanged with external memory. In order to increase the access speed of theprocessor 1052, thememory 1054 needs to have an advantage of high access speed. In a conventional computer system architecture, a Dynamic Random Access Memory (DRAM) is usually used as thememory 1054. Theprocessor 1052 is capable of accessing thememory 1054 at high speed through a memory controller (not shown in fig. 1) to perform read and write operations on any one of the memory locations in thememory 1054.

Theneural network circuit 110 may be a chip array composed of a plurality of neural network chips (chips). For example, as shown in fig. 1, theneural network circuit 110 includes a plurality of neural network chips (chips) 115 that perform data processing and a plurality ofrouters 120. For convenience of description, theneural network chip 115 in the application is simply referred to as thechip 115 in the embodiments of the present invention. The plurality ofchips 115 are connected to each other through therouter 120. For example, onechip 115 may be connected to one ormore routers 120. The plurality ofrouters 120 may comprise one or more network topologies. Thechips 115 may communicate data therebetween via the various network topologies.

In some examples, theneural network circuit 110 may be implemented by a resistive random-access memory (ReRAM) crossbar array (crossbar) based on analog computation. For example, eachneural network chip 115 in theneural network circuit 110 described above may include one or more ReRAM crossbars. In embodiments of the present application, ReRAM crossbar may also be referred to as a memristor crossbar array, a ReRAM device, or a ReRAM. A chip comprising one or more ReRAM cross bars may be referred to as a ReRAM chip. ReRAM Crossbar is a completely new non-von neumann computing architecture. The architecture integrates storage and calculation functions, has flexible configurable characteristics, utilizes a simulation calculation mode, is expected to realize matrix vector multiplication with higher speed and lower energy consumption than the traditional calculation architecture, and has wide application prospect in deep network calculation.

Alternatively, the architecture of theneural network system 100 in fig. 1 is merely an example, and those skilled in the art will appreciate that theneural network system 100 described above may include more or fewer elements than those in fig. 1 in practice. Or the modules, units or circuits in theneural network system 100 may be replaced by other modules, units or circuits with similar functions, which are not limited in this application. For example, in other examples, theneural network system 100 may also be implemented by a Graphics Processing Unit (GPU) or a Field Programmable Gate Array (FPGA) based on digital computing.

The technical scheme of the embodiment of the application can be applied to various neural networks, such as: convolutional Neural Networks (CNNs), cyclic neural networks that are widely used in processing natural languages and voices, and deep neural networks that combine convolutional neural networks and cyclic neural networks, and so on. The processing of convolutional neural networks is similar to the animal's visual system, making it very suitable for the field of picture recognition. The convolutional neural network can be applied to various image recognition fields such as security and protection fields, computer vision, safe cities and the like, and can also be applied to voice recognition, search engines, machine translation and the like. In practical application, huge parameters and calculation amount bring great challenges to the application of the neural network in the scenes with high real-time performance and low power consumption. As in an automatic driving scenario, the CNN-based depth model requires detection and identification of vehicles and pedestrians within millisecond response time, thereby providing sufficient decision time for subsequent motion planning (e.g., avoidance); meanwhile, in order to ensure sufficient endurance time of the autonomous vehicle, the power consumption of the deep neural network must be as low as possible. Therefore, how to complete the calculation with the lowest possible power consumption and the highest possible speed is the key to the neural network application.

The following continues with several terms that are referred to in the embodiments of the present application.

Convolution kernel (kernel): one convolution kernel represents a feature extraction method in the neural network computation process. For example, in image processing in a neural network system, given an input image, each pixel in the output image is a weighted average of pixels in a small region of the input image, where the weights are defined by a function, known as a convolution kernel. In the calculation process, the convolution kernel sequentially passes through an input feature map (feature map) according to a certain step length (stride) to generate output data (also called as an output feature map) after feature extraction. Therefore, the convolution kernel size (kernel size) is also used to indicate the size of the amount of data for which a computation node in the neural network system performs one computation. Those skilled in the art will appreciate that the convolution kernel may be represented by a real matrix, for example, fig. 2 shows a 4-row, 3-column convolution kernel, where each element of the convolution kernel represents a weight value. In practice, a neural network layer may include a plurality of convolution kernels. In the neural network calculation process, the input data and the convolution kernel can be subjected to multiplication and addition calculation.

A neural network layer: in the embodiment of the present application, the neural network layer is a logical layer concept, and one neural network layer means that a neural network operation is to be performed. The neural network system may include a plurality of neural network layers. Each layer of neural network calculation is realized by a calculation node. For example, the compute node may be implemented by a ReRAM chip. In practical applications, the neural network layer may include a convolutional layer, a pooling layer, and the like. For simplicity of description, the "neural network layer" in the present application may also be simply referred to as "layer".

And (3) weighting: may refer to the convolution kernel of the neural network layer. Wherein the weights of the neural network layer may include a plurality of convolution kernels. The elements in the weight may be referred to as weight values.

Weight value: referring to the elements in the convolution kernel, as shown in fig. 2, one element in the convolution kernel represents one weight value.

And (3) quantification: the calculation of a general neural network model adopts floating point number calculation, and under the condition of basically not influencing the accuracy of the model, the weight and the characteristic diagram of the model adopt integers with a certain bit width to approximate to a network of an original floating point number, namely the quantization technology. For example, a 32-bit floating point number weight value may be quantized to a 4-bit or 8-bit weight value.

Quantizing bit width: the number of bits adopted by the weight value in the quantized neural network is typically 4 bits or 8 bits.

To obtain a neural network system with lower power consumption and higher speed, it is a common way to quantize data input into the neural network system. Specifically, the weight values in the neural network may be subjected to low bit width quantization to obtain weight values with lower precision and less occupied space. For example, a 32-bit floating point number may be quantized to a 1-bit, 2-bit fixed point number. The neural network model can be effectively compressed by quantizing the neural network, and the storage capacity and the calculation amount of the neural network are reduced. Thereby increasing the operating speed of the neural network and reducing power consumption. However, the accuracy of the neural network is affected because the quantization of the neural network requires sacrificing the accuracy of the data. Therefore, how to reduce the precision loss of the neural network as much as possible in the quantization process is an urgent problem to be solved in the technical field of the neural network.

For example, on the one hand, it is difficult to support high bit width weights due to errors in the conductance value writing of ReRAM devices. Therefore, in order to fully utilize the advantages of ReRAM, an effective weight low bit width quantization method is needed, which can reduce the bit width of the weight matrix on the premise of as low as possible precision loss. On the other hand, for the calculation requirements of low power consumption and high speed, the calculation hardware such as the GPU and the FPGA sequentially supports low bit width, which also puts new requirements on the low bit width algorithm.

As will be appreciated by those skilled in the art, the conventional low bit width quantization method assumes that the convolution kernel of each layer has a consistent weight distribution and remains unchanged in the iterative training, so a consistent quantization step size is selected for each layer of convolution kernel and remains unchanged in the training, thereby quantizing the weights. However, since the weight distributions of the convolution kernels have differences and the weight distributions change during the training process, the optimal quantization effect cannot be obtained by selecting the consistent quantization step size, which affects the accuracy of the network. The embodiment of the application provides a data processing method and a data processing device in a neural network system, which can select a quantization mode of a convolution kernel according to the weight distribution of each convolution kernel or each group of convolution kernels, and the quantization mode can be changed along with the change of the weight distribution in iterative training, so that the accuracy of the quantized neural network is ensured to be as high as possible, and a quantization scheme with low accuracy loss is provided. The method can be applied to scenes based on computation of a ReRAM chip, and can also be applied to computation scenes based on a GPU, an FPGA and the like. Alternatively, the method may also be applied to other neural network computing scenarios, which is not limited in the embodiment of the present application.

Fig. 3 is a schematic diagram of yet another possibleneural network system 300 of an embodiment of the present application. Theneural network system 300 includes aprocessing device 310 and acomputing device 320. As shown in fig. 3, the data processing method in the neural network system according to the embodiment of the present disclosure may also be executed by theneural network system 300, theprocessing device 310 may be a computer system, for example, theprocessing device 310 may include a Central Processing Unit (CPU), a GPU, an FPGA, or another type of processor. After the weights of the neural network are quantized by theprocessing device 310, the quantized weights may be loaded into thecomputing device 320 for calculation.Computing device 320 may be a device capable of performing neural network computations.Computing device 320 may include a neural network chip. For example, the neural network chip may be a GPU, FPGA, or ReRAM chip. Thecomputing device 320 may also include other types of processors as long as they are capable of performing calculations for the neural network. Alternatively, theprocessing device 310 and thecomputing device 320 may be the same device or different devices. For example, the processor of theprocessing device 310 may be a CPU and thecomputing device 320 may be a ReRAM chip. Alternatively, theprocessing device 310 and thecomputing device 320 are the same GPU or the same FPGA.

In one example, theprocessing device 310 may be thehost 105 in fig. 1 and thecomputing device 320 may be theneural network circuit 110 in fig. 1.

Fig. 4 is a schematic structural diagram of a ReRAM chip provided in an embodiment of the present application. Fig. 4 illustrates a ReRAM crossbar architecture as an example, and shows the principle of matrix-vector multiplication using ReRAM chips. In the calculation of the ReRAM chip, the weight W of the matrix is represented by the conductance value G of a memristor in the ReRAM chip, the input voltage V of the ReRAM chip represents an input vector, and therefore the multiplication and addition result of the matrix is obtained by measuring the current value I of the output end of the ReRAM chip. For example, the following formula can be used:

I＝GV (1)

due to the problems of device processes, errors exist in the current writing of the conductance values (corresponding to matrix weights) of the memristors in the ReRAM chip. When too many conductance values are written, two adjacent conductance values are difficult to distinguish, so that it is difficult to write a plurality of different conductance values. However, due to device process limitations, ReRAM can represent a small bit width. As an example, current memristors typically support 16 different conductance states, i.e., weights that may represent 4 bits (bits). Therefore, in order to fully utilize the advantages of the ReRAM chip simulation computation in the neural network, an effective weight low bit width quantization method needs to be researched, so that the bit width of the weight matrix is reduced as much as possible on the premise that the precision loss is as low as possible.

Fig. 5 is a schematic diagram of a data processing method in a neural network system according to an embodiment of the present application. The method of fig. 5 may be performed by the neural network system (100, 300) of fig. 1 or fig. 3. For example, S501 and S502 may be performed by processingdevice 310, and S503 may be performed by computingdevice 320. As shown in fig. 5, the method includes:

s501, determining a first quantization mode according to a target quantization bit width and a range of weight values in a first initial weight value set, wherein the first initial weight value set comprises an initial weight value of a first convolution kernel set in a plurality of convolution kernel sets in a neural network system, the first convolution kernel set comprises one or more convolution kernels, and the first convolution kernel set comprises a plurality of convolution kernels in different convolution layers or comprises partial convolution kernels in the same convolution layer.

As previously described, the neural network system may include a plurality of convolutional layers, each convolutional layer including one or more convolutional kernels. In the quantization process, a plurality of convolution kernels in the neural network system can be divided into a plurality of convolution kernel groups, a quantization mode is determined according to the weight distribution of each convolution kernel group, and quantization is performed, wherein the quantization modes of different convolution kernel groups can be different. In particular, a convolution kernel group may include convolution kernels from different convolution layers, or a convolution kernel group may include partial convolution kernels in a convolution layer. Each convolution kernel group may include one convolution kernel or a plurality of convolution kernels. In some examples, each convolution kernel group may include only one convolution kernel, and in the quantization process, a quantization mode may be determined according to a weight distribution of each convolution kernel and quantization may be performed.

Alternatively, the first convolution kernel group may refer to one convolution kernel group among the plurality of divided convolution kernel groups described above. The first initial set of weight values may refer to a set of weight values of the first convolution kernel set prior to quantization. The weight values of the convolution kernels can be understood as elements of the convolution kernels. The target quantization bit width may refer to a bit width of a quantized weight value. For example, the bit width of the weight value before quantization is 32 bits, and the bit width of the weight value after quantization may be 4 bits.

Optionally, in the embodiment of the present invention, the quantization mode may be a uniform quantization mode or a non-uniform quantization mode. The uniform quantization mode is to uniformly divide the weight values in the initial weight value set into a plurality of subsets according to the same quantization step, and then quantize the weight values in the subsets respectively. In quantization, the weight values in the same subset are quantized to the same value. For example, the weight intervals in the initial weight set may be uniformly divided into a fixed number of weight subintervals according to the target quantization bit width and quantization step size, and the set of weight values falling into each weight subinterval is a subset.

The non-uniform quantization mode is to divide the weight values in the initial weight value set into a plurality of subsets according to a plurality of different quantization step sizes, and then quantize the weight values in each subset respectively. In quantization, the weight values in the same subset are quantized to the same value. For example, the weight intervals of the initial weight set may be non-uniformly divided into a fixed number of subintervals according to the target quantization bit width and a plurality of quantization steps, the set of weight values falling into each subinterval being a subset.

If the first quantization mode is a uniform quantization mode, the determining the first quantization mode may include determining a first quantization step size. For example, assume that the first initial set of weight values has 128 weight values, distributed over a range of [0,1], if the quantization bit width is 4 bits. For a 4-bit quantization bit width, 16 different values can be represented. When uniform quantization is used, it can be quantized to 16 different values for the weight space of [0,1 ]. The first quantization step is 1/16.

If the first quantization mode is a non-uniform quantization mode, determining the first quantization mode may include determining a plurality of quantization step sizes. For example, assume that the first initial set of weight values has 128 weight values in total, distributed over a range of [0,1], and if the quantization bit width is 4 bits, where 96 weight values distributed over [0,0.25] in total (75% of the total) and 32 weight values distributed over [0.25,1] in total (25% of the total). For a 4-bit quantization bit width, 16 different values can be represented. When a non-uniform quantization approach is used, the quantization step size is 0.25/12 1/48 for 16 × 0.75 to 12 different values for the weight space of [0,0.25], and is 0.75/4 3/16 for 16 × 0.25 to 4 different values for the weight space of [0.25,1 ].

In the embodiment of the application, weights in the initial weight value set can be divided and quantized in a non-uniform quantization mode, so that the length of a weight space can be utilized, the weight expression capacity is improved, the precision loss in the quantization process is reduced, and the precision of quantized data is improved.

S502, according to the first quantization mode, quantizing the weight values in the first initial weight value set to obtain a first target weight value set of the first convolution kernel set.

The first target weight set may refer to a set of weight values of the quantized first convolution kernel group. The quantizing the weight values in the first initial set of weight values may include: dividing the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode; and quantizing the weight value in each of the plurality of subsets to obtain the first target weight value set. The above-mentioned quantizing the weight values in each of the plurality of subsets, and during quantization, the weight values in the same subset are quantized to the same value, so that the quantized data of each subset obtains the first target weight value set. Alternatively, the quantized value of the weight value in each subset may be the middle value of the weight subinterval.

If the first quantization mode is a uniform quantization mode, when dividing the weight values in the first initial weight value set, the weight values in the first initial weight value set may be uniformly divided into the plurality of subsets by using a first quantization step. If the first quantization mode is a non-uniform quantization mode, when dividing the weight values in the first initial weight value set, the weight values in the first initial weight value set may be non-uniformly divided into the plurality of subsets according to a plurality of quantization step sizes.

S503, calculating the data input into the neural network system based on the weight values in the first target weight value set.

For example, the weight values in the first set of target weight values may be loaded into thecomputing device 320 for calculation. The processor of thecomputing device 320 may be a GPU, FPGA, or ReRAM chip. Optionally, the calculating of the data input into the neural network system may include neural network calculation such as convolution or pooling, or may also include performing forward inference calculation.

In the embodiment of the application, a plurality of convolution kernels of a neural network can be divided into a plurality of convolution kernel groups, and convolution kernels in the same convolution kernel group can be from different convolution layers or only comprise partial convolution kernels in the same convolution layer, so that the convolution kernels are flexibly grouped in the quantization process, and the quantization mode is respectively determined according to the weight distribution of each convolution kernel group, so that the convolution kernel groups are not limited by the convolution layers in the quantization process, the weight space length of the convolution kernel groups is fully utilized for quantization, the precision loss after the neural network quantization can be reduced, and the data precision after the neural network system quantization is improved. In addition, since the precision of the weight values processed by the weight processing method provided by the embodiment of the application is improved, the weight values in the processed first target weight value set are loaded into the neural network system, and the neural network system calculates the input data based on the weight values in the first target weight value set to obtain output data with higher accuracy. Further, the method of the embodiment of the application can also improve the accuracy of the specific application of the neural network system processing. For example, if the neural network system performs image processing, the accuracy or precision of image recognition performed by the neural network system can be improved.

Alternatively, when dividing the convolution kernel group, a plurality of convolution kernels whose weight distributions are close to each other may be divided in one group. Alternatively, the weight distribution may be defined by the maximum, minimum, mean or variance of the weight values in the convolution kernel. Two convolution kernel weight distributions may be considered to be similar if the maximum, minimum, mean or variance of the two convolution kernel weight values are similar. In practical application, a clustering method can be adopted to cluster convolution kernels with similar weight distribution into different convolution kernel groups.

For example, the principle of dividing the convolution kernel group may include at least one of the following division modes: the difference value of the maximum values of the weight values in any two convolution kernels in the same convolution kernel group is smaller than a first preset value; or the difference value of the minimum values of the weight values in any two convolution kernels in the same convolution kernel group is smaller than a second preset value; or the difference value of the average values of the weight values in any two convolution kernels in the same convolution kernel group is smaller than a third preset value; or the difference value of the variances of the weight values in any two convolution kernels in the same convolution kernel group is smaller than a fourth preset value. The first preset value to the fourth preset value may be set according to practice, and the embodiment of the present application is not limited thereto.

In the embodiment of the application, the convolution kernels with similar weight distribution are divided into the same convolution kernel group, and quantization is performed in the same quantization mode, so that the length of a weight space can be utilized as much as possible, the weight expression capability is improved, the precision loss in the quantization process is reduced, and the precision of quantized data is improved.

Optionally, in the embodiment of the present invention, after the first target weight value set is obtained, the first target weight value set may be iteratively trained to obtain a final target weight value set meeting a preset condition. Specifically, forward inference calculation may be performed on the first target weight value set, and according to a result of the forward inference calculation, it is determined whether a weight value of the first target weight value set satisfies a preset condition. And if the weight value of the first target weight value set meets a preset condition, determining that the first target weight value set is the final target weight value set. And if the first target weight value set does not meet the preset condition, adjusting the weight value in the first initial weight value set to obtain a second initial weight value set of the first volume kernel set. And determining a second quantization mode according to the range of the weight values in the second initial weight value set. And quantizing the weight values in the second initial weight value set according to the second quantization mode to obtain a second target weight value set of the first convolution kernel set.

Similarly, in the iterative training, after the second target weight value set is obtained, it may be continuously determined whether the weight value of the second target weight value set meets the preset condition, if not, the weight value of the second target weight value set is continuously adjusted, and the process is repeated until the target weight value set meeting the preset condition is obtained.

Optionally, the definition of the second quantization manner is similar to that of the first quantization manner, and for brevity, the description is omitted here.

In the embodiment of the application, the obtained first target weight set is subjected to iterative training, and the weight value and the quantization mode of the initial weight value set of the first convolution kernel set are continuously adjusted in the training process, so that the precision loss after the neural network quantization is reduced as much as possible.

Optionally, the performing forward inference calculation on the weight values in the first target weight value set may include: weight values in the first set of target weight values are calculated in the neural network system. For example, the weight value may be loaded into any one of the following processing devices for calculation: the method comprises the following steps of calculating the weight values of the first target weight value set by using a GPU (graphics processing unit), an FPGA (field programmable gate array), a ReRAM (ReRAM) chip, or calculating the weight values by loading the weight values in the first target weight value set into other types of neural network chips.

Optionally, the determining whether the weighted values of the first target weighted value set satisfy the preset condition may include: and judging whether the error between the output value of the forward reasoning calculation of the first target weight value set and the true value is converged. And if the error is converged, determining that the weighted value of the first target weight set meets a preset condition, and if the error is not converged, determining that the weighted value of the first target weight set does not meet the preset condition. The present application does not limit the manner of determining error convergence. In practical application, the judgment basis of error convergence is that the error between the output value after forward reasoning calculation and the true value is less than a certain threshold value, or the error is almost unchanged in multiple iterations. In one example, whether the error converges may be determined according to whether the error is greater than a first preset threshold. For example, if the error is smaller than a first preset threshold, it is determined that the weight value of the first target weight set satisfies a preset condition. And if the error is larger than a first preset threshold value, determining that the weight value of the first target weight value set does not meet a preset condition. The size of the first preset threshold may be determined according to practical applications, which is not limited in this embodiment of the application. Alternatively, the error may be referred to as a loss value.

Optionally, the adjusting the first initial weight value set to obtain a second initial weight value set of the first convolution kernel set includes: calculating a gradient of weight values in the first set of initial weight values; and adjusting the first initial weight value set according to the gradient of the weight values in the first initial weight value set to obtain a second initial weight value set. And the error between the output value after the forward reasoning calculation and the true value changes fastest along the direction of the gradient, and the change rate is the largest.

Optionally, a back propagation method may be used to calculate gradients of each weight value in the first initial weight set, the back propagation algorithm is divided into a forward process and a back process, the forward process calculates an error between an output value of the neural network and a true value, and the back propagation process calculates a gradient corresponding to each weight according to the error value. In some examples, if the forward inference computation is performed in the GPU, the gradient of each weight value is computed by the first set of target weight values and a feature map (feature map) output by each layer of convolution kernels in the GPU. If the forward inference calculation is performed in the ReRAM chip, the gradient of the weight value is calculated from the actual conductance value written into the ReRAM chip and the output value of the ReRAM chip.

In the iterative training process, in addition to adjusting the first initial weight value set and the quantization mode, the target weight value set may also be adjusted. For example, according to the result of performing forward reasoning calculation on the weight values of the first target weight value set, determining that the weight values of the first target weight value set do not satisfy a preset condition; and adjusting the weight value of the first target weight value set to obtain a second target weight value set of the first convolution kernel set. Similarly, the forward inference calculation may be performed on the weighted values of the second target weighted value set, and if the forward inference calculation result of the second target weighted value set does not meet the preset condition, the second target weighted value set may be continuously adjusted until the target weighted value set meeting the preset condition is obtained.

Optionally, the quantization manners of different convolution kernel groups may be different, for example, assuming that the second convolution group is another convolution group different from the first convolution group in the plurality of convolution kernel groups, a third quantization manner of the second convolution kernel group may be determined according to a range of the initial weight value set of the second convolution kernel group, where the third quantization manner is different from the first quantization manner; and quantizing the initial weight value set of the second convolution kernel group according to the third quantization mode to obtain a target weight value set of the second convolution kernel group. Optionally, after the target weight value set of the second convolution kernel group is obtained, it may also be determined whether the target weight value set meets a preset condition. If the target weight value set of the second convolution kernel group does not meet the preset condition, iterative training is performed on the target weight value set of the second convolution kernel group to obtain the target weight value set meeting the preset condition, and the iterative training process is the same as or similar to that of the first convolution kernel group, which is not repeated here.

Fig. 6 is a schematic diagram of a data processing method in a neural network system according to another embodiment of the present application. The example of fig. 6 is illustrated with theprocessing device 310 being a GPU and thecomputing device 320 being a ReRAM chip. As shown in fig. 6, the method includes the following steps.

S601, the GPU selects a quantization mode for each convolution kernel group, and quantizes the initial weight value set of each convolution kernel group to obtain a target weight value set.

For the above specific way of selecting and quantizing the quantization way of the initial weight value set, reference may be made to the related contents of S501 and S502 in fig. 5, and for brevity, details are not repeated here.

S602, the GPU loads the target weight value set to a ReRAM chip.

For example, the GPU writes the target set of weight values into the ReRAM chip as the weight values in the ReRAM chip, i.e., the conductance value G of the ReRAM chip.

S603, the ReRAM chip carries out forward reasoning calculation on the target weight value set and obtains an output value of the forward reasoning calculation.

And S604, judging whether the error between the output value of the forward reasoning calculation and the true value is converged by the ReRAM chip.

And if the error is determined to be converged, ending the iterative training process. If the error does not converge, S605 is executed.

And S605, if the error is not converged, the ReRAM chip outputs the input value and the output value of the forward reasoning calculation to the GPU.

Specifically, the conductance value G of the ReRAM chip and the output value of each crossbar array may be output to the GPU. For example, the actual conductance value written by each ReRAM device in the ReRAM chip may be measured and output to the GPU together with the output value of each crossbar.

S606, the GPU adjusts the weight values in the initial weight value set, and then iterative training is continued.

For example, the GPU may update the weight values in the initial set of weight values in the GPU by a back propagation method.

In the embodiment of the application, because the forward reasoning calculation is performed in the ReRAM chip, noise problems existing in the ReRAM chip, such as a problem of inaccurate writing of a ReRAM conductance value, noise of an analog-to-digital converter (ADC) and a digital-to-analog converter (DAC), and the like, can be fully considered, so that the accuracy of the forward reasoning calculation of the neural network system performed on the ReRAM chip can be ensured as much as possible.

Fig. 7 is a schematic diagram of a data processing method in a neural network system according to another embodiment of the present application. The example of fig. 7 is illustrated with theprocessing device 310 being a GPU and thecomputing device 320 being a GPU or FPGA. As shown in fig. 7, the method includes the following steps.

S701, the GPU selects a quantization mode for each convolution kernel group, and quantizes the initial weight value set of each convolution kernel group to obtain a target weight value set.

S702, the GPU performs forward reasoning calculation on the target weight value set and acquires an output value of the forward reasoning calculation.

S703, the GPU judges whether the error between the output value of the forward reasoning calculation and the true value is converged.

If it is determined that the error converges, the process proceeds to S704. If the error does not converge, S705 is performed.

S704, if the error converges, loading the target weight value set to thecomputing device 320.

Alternatively, thecomputing device 320 may be a GPU or an FPGA. In one example,processing device 310 andcomputing device 320 may be the same GPU or different GPUs.

S705, if the error is not converged, the GPU adjusts the weight value in the initial weight value set, and then the iterative training is continued.

For example, the GPU may update the initial set of weight values in the GPU by a back propagation method.

In the embodiment of the present application, theprocessing device 310 is a GPU, and thecomputing device 320 is a GPU or an FPGA, and since there is no noise problem in the ReRAM chip, the forward inference computation of the iterative training may be performed in theprocessing device 310, and when the output value of the forward inference computation of the target weight value set converges with the true value, the target weight value set is deployed into thecomputing device 320.

Fig. 8 is a schematic diagram of a data processing method in a neural network system according to another embodiment of the present application. As shown in fig. 8, the method includes the following steps.

S801, analyzing the weight distribution of the initial weight value set of each convolution group in the neural network system.

For example, an initial weight value for each convolution kernel (or each set of convolution kernels) in the respective convolution layers of the neural network system may be obtained, and the distribution of the weight values for each convolution kernel (or each set of convolution kernels) may be analyzed. In practical applications, analyzing the weight distribution may generally adopt counting the maximum value of the absolute value of the weight of each convolution kernel (or each group of convolution kernels), or modeling the gaussian distribution of the weight values and counting the range of two times of the standard deviation.

S802, determining the quantization mode of each convolution group according to the target quantization bit width and the weight distribution of each convolution group.

For example, a specific quantization step size of each convolution kernel (or each group of convolution kernels) in the current iteration can be selected according to the weight distribution of each convolution kernel (or each group of convolution kernels) in the current iteration and the target quantization bit width, so that the weight distribution range of each convolution kernel is divided uniformly or non-uniformly.

And S803, quantizing the initial weight value set of each convolution kernel group according to a quantization mode to obtain a target weight value set.

For example, the initial set of weight values may be low-bit-width quantized according to the quantization step determined in S802.

S804, judging whether the target weight value set subjected to iterative training is subjected to forward reasoning calculation on a ReRAM chip.

If the forward reasoning computation is performed on the ReRAM chip, then portion S806 is performed. If the forward inference calculation is not performed on the ReRAM chip, then portion S806 is performed.

And S805, if the iteratively trained target weight value set is not subsequently subjected to forward reasoning calculation in a ReRAM chip, but is calculated in digital chips such as a GPU or an FPGA, the current quantized target weight value set is used for performing forward reasoning calculation in the GPU.

In the application, if the trained neural network system is calculated in a digital chip such as a GPU or an FPGA, the noise problem of a ReRAM chip does not need to be considered, so that in the iterative training process, the current quantized target weight set only needs to be subjected to forward reasoning calculation in the GPU, and after the iterative training is finished, the trained target weight set is deployed in the GPU or the FPGA.

S806, if the trained weights are subjected to reasoning calculation in the ReRAM chip, writing the quantized target weight value set into a conductance value G of the ReRAM device in the ReRAM chip.

It should be noted that, because there may be an error in writing the conductance value G of the ReRAM, the conductance value G actually written into the ReRAM device may not be completely equal to the conductance value G corresponding to the quantized target weight value set.

In the application, if the trained neural network system needs to execute the forward inference calculation on the ReRAM chip, the noise problem of the ReRAM chip needs to be considered, so that the forward inference calculation needs to be executed on the ReRAM chip in the iterative training process.

And S807, performing forward reasoning calculation in the ReRAM chip.

For example, forward inferential calculations may be made in the ReRAM chip based on the actual conductance value currently being written.

And S808, judging whether the error between the output value and the actual value of the forward reasoning calculation is converged.

If the error converges, the iterative training process is stopped. If the error does not converge, the following steps are continued.

And S809, if the error is not converged, the ReRAM chip outputs the input value and the output value of the forward reasoning calculation to the GPU.

For example, the actual conductance value written by each ReRAM device in the ReRAM chip may be measured and output to the GPU together with the output value of each crossbar.

And S810, the GPU calculates the gradient of each weight value in the initial weight value set.

For example, in the GPU, the gradient of each weight value may be calculated by a back propagation method.

S811, updating the weight values in the initial weight value set according to the gradients of the weight values in the initial weight value set.

Optionally, the iterative training may be continued according to the updated initial weight value set until the iterative training process is ended.

The data processing method for the neural network system can be applied to various scenes. For example, taking automated driving as an example, how to quickly and accurately identify pedestrians, vehicles, and other objects in automated driving, and the ability to accurately locate these objects is a prerequisite and basis for achieving safe and stable automated driving. To achieve this goal, accurate target detection algorithms are critical. The currently common target detection algorithm is based on CNN's Resnet18-SSD model.

The following takes the Resnet18-SSD model as an example, and the data processing method in the neural network system according to the embodiment of the present application is continuously described. Assuming that each convolution kernel group includes one convolution kernel, the initial weight value set is hereinafter floating point weight, and the target weight value set is hereinafter low bit width weight. The target quantization bit width has a size k, where k is an integer greater than or equal to 1. Let the floating point weight of the convolution kernel in Resnet18-SSD be

And the low bit width weight after k bit width quantization is set as

The method comprises the following steps.

S1, firstly, determining the floating point weight of the convolution kernel

Maximum value of absolute value of weight in (1)

S2, according to the current maximum value

Determining a weight distribution of the convolution kernel as [ -W ]_max,W_max]。

It is assumed that a uniform quantization approach is used, i.e. a specific optimal quantization step size pair is set_max,W_max]The weight range of (2) is divided evenly. The quantization step size can be set to

Will [ -W_max,W_max]Is divided into 2^kA weight sub-interval.

And S3, performing low bit width quantization on the floating point weight of the neural network based on the quantization step size.

A specific way may be that the weight values within the range of the weight subintervals are all quantized to the same value based on the weight subintervals divided by the quantization step size in step S702. The quantized value may be the middle value of the weight sub-interval, thus will

Weights quantized to low bit widths

S4, weighting by low bit width

And carrying out forward reasoning calculation, and calculating the error between the output value of the forward reasoning and the true value.

If the error converges, i.e. the error is less than a certain threshold, or the error hardly changes in multiple iterations, the iterative training is stopped to obtain the final quantization weight

If the error does not converge, then the S705 component is executed.

S5, quantization-based low bit width weight

And feature map (feature map) output by each layer of convolution kernel, calculating each floating point weight

Of the gradient of (c). Then, based on the calculated gradient, weighting each floating point

And (6) updating.

For example, in the embodiments of the present application,

gradient of (2)

Is equal to

Gradient of (2)

Wherein

Representing the error between the true value and the output value.

Fig. 9 is a schematic structural diagram of a neural network system 900 according to an embodiment of the present application. The neural network system 900 is capable of performing the methods and steps described above as being performed by the neural network system. As shown in fig. 9, the neural network system 900 includes: a determining module 910, configured to determine a first quantization mode according to a target quantization bit width and a range of weight values in a first initial weight value set, where the first initial weight value set includes an initial weight value of a first convolution kernel group in a plurality of convolution kernel groups in a neural network system, the first convolution kernel group includes one or more convolution kernels, and the first convolution kernel group includes multiple convolution kernels in different convolution layers or includes partial convolution kernels in the same convolution layer;

a quantization module 920, configured to quantize weight values in the first initial weight value set according to the first quantization manner to obtain a first target weight value set of the first convolution kernel set;

a calculating module 930 configured to calculate data input to the neural network system based on the weight values in the first set of target weight values.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for processing data in a neural network system, comprising:

determining a first quantization mode according to a target quantization bit width and a range of weight values in a first initial weight value set, wherein the first initial weight value set comprises an initial weight value of a first convolution kernel group in a plurality of convolution kernel groups in the neural network system, and the first convolution kernel group comprises a plurality of convolution kernels in different convolution layers or comprises partial convolution kernels in the same convolution layer;

according to the first quantization mode, quantizing the weight values in the first initial weight value set to obtain a first target weight value set of the first volume kernel set;

and calculating the data input into the neural network system according to the weight value in the first target weight value set.

2. The method of claim 1, wherein the first set of convolution kernels comprises a plurality of convolution kernels having a similar weight distribution.

3. The method of claim 1 or 2, wherein the first convolutional core group satisfies at least one of the following conditions:

the difference value of the maximum values of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a first preset value;

the difference value of the minimum values of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a second preset value;

the difference value of the average values of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a third preset value; and

and the difference value of the variances of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a fourth preset value.

4. The method of any of claims 1 to 3, wherein quantizing the first initial set of weights according to the first quantization mode comprises:

dividing the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode;

and quantizing the weight value in each of the plurality of subsets to obtain the first target weight value set.

5. The method of claim 4, wherein the first quantization mode is a uniform quantization mode, and wherein the dividing the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode comprises: and uniformly dividing the weight values in the first initial weight value set into the plurality of subsets by adopting a first quantization step size.

6. The method of claim 4, wherein the first quantization mode is a non-uniform quantization mode, and wherein the dividing the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode comprises: and dividing the weighted values in the first initial weighted value set into a plurality of subsets according to a plurality of quantization step sizes determined by the first quantization mode.

7. The method of any of claims 1 to 6, further comprising:

carrying out forward reasoning calculation on the weight values in the first target weight value set;

determining that the weight value of the first target weight value set does not meet a preset condition according to the result of the forward reasoning calculation;

adjusting the weight value in the first initial weight value set to obtain a second initial weight value set of the first convolution kernel set;

determining a second quantization mode according to the range of the weight values in the second initial weight value set;

and quantizing the weight values in the second initial weight value set according to the second quantization mode to obtain a second target weight value set of the first convolution kernel set.

8. The method of claim 7, wherein said determining that the weight values of the first set of target weight values do not satisfy a preset condition according to the result of the forward inference calculation comprises:

and under the condition that the error between the output value of the forward reasoning calculation and the true value is greater than a first preset threshold value, determining that the weight value of the first target weight value set does not meet a preset condition.

9. The method of claim 7 or 8, wherein said performing a forward inference calculation on weight values in said first set of target weight values comprises:

performing forward inference calculation on the weighted values of the first target weighted value set in the neural network system, wherein the neural network system includes any one of the following processing devices: the device comprises a graphic processor GPU, a field programmable array FPGA and a memristor cross array.

10. The method of any of claims 1 to 6, further comprising:

determining that the weight value of the first target weight value set does not meet a preset condition according to a result of forward reasoning calculation on the weight value of the first target weight value set;

and adjusting the weight value of the first target weight value set to obtain a second target weight value set of the first convolution kernel set.

11. The method of any one of claims 1 to 10, further comprising:

determining a third quantization mode for a second convolution kernel group of the plurality of convolution kernel groups based on a range of initial sets of weight values for the second convolution kernel group, the third quantization mode being different from the first quantization mode;

and quantizing the initial weight value set of the second convolution kernel group according to the third quantization mode to obtain a target weight value set of the second convolution kernel group.

12. A neural network system, comprising:

a determining module, configured to determine a first quantization mode according to a target quantization bit width and a range of weight values in a first initial weight value set, where the first initial weight value set includes an initial weight value of a first convolutional kernel group in a plurality of convolutional kernel groups in the neural network system, and the first convolutional kernel group includes a plurality of convolutional kernels in different convolutional layers or includes partial convolutional kernels in a same convolutional layer;

a quantization module, configured to quantize weight values in the first initial weight value set according to the first quantization manner to obtain a first target weight value set of the first convolution kernel set;

and the calculation module is used for calculating the data input into the neural network system according to the weight value in the first target weight value set.

13. The neural network system of claim 12, wherein the first set of convolution kernels comprises a plurality of convolution kernels having a similar weight distribution.

14. The neural network system of claim 12 or 13, wherein the first convolutional kernel group satisfies at least one of the following conditions:

the difference value of the average values of the weight values of any two convolution kernels in the first convolution kernel group is smaller than a third preset value; and the combination of (a) and (b),

15. The neural network system of any one of claims 12-14, wherein the quantification module is specifically configured to:

dividing the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode; and

16. The neural network system of claim 15, wherein the first quantization mode is a uniform quantization mode, and when the quantization module divides the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode, the quantization module is specifically configured to: and uniformly dividing the weight values in the first initial weight value set into the plurality of subsets by adopting a first quantization step size.

17. The neural network system of claim 15, wherein the first quantization mode is a non-uniform quantization mode, and wherein when the quantization module divides the weight values in the first initial set of weight values into a plurality of subsets according to the first quantization mode, the quantization module is specifically configured to: and dividing the weighted values in the first initial weighted value set into a plurality of subsets according to a plurality of quantization step sizes determined by the first quantization mode.

18. The neural network system of any one of claims 12-17, wherein the computation module is further configured to:

determining a second quantization mode according to the range of the weight values in the second initial weight value set; and

19. The neural network system of claim 18, wherein the computing module is specifically configured to determine that the weight values of the first set of target weight values do not satisfy a preset condition if an error between the output value of the forward inference computation and a true value is greater than a first preset threshold.

20. The neural network system of any one of claims 12-17, wherein the apparatus further comprises a computation module to:

determining that the weight value of the first target weight value set does not meet a preset condition according to a result of forward reasoning calculation on the weight value of the first target weight value set; and

21. The neural network system of any one of claims 12-20, wherein the determination module is further configured to:

22. The neural network system of any one of claims 12 to 21, wherein the computation module comprises any one of: the device comprises a graphic processor GPU, a field programmable array FPGA and a memristor cross array.

23. A neural network system, comprising:

a memory for storing computer instructions;

a processor coupled to the memory and configured to execute the computer instructions to perform the method of any of claims 1 to 11.

24. A computer-readable storage medium for storing program code, the program code comprising instructions for performing the method of any of claims 1-11.