CN110163357B

Movatterモバイル変換

Info

Publication number: CN110163357B
Application number: CN201910195627.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-02-13
Filing date: 2018-09-03
Publication date: 2021-06-25
Anticipated expiration: 2038-09-03
Also published as: CN110163363B; CN110163359B; CN110163353B; CN110163362A; CN110163354A; CN110163353A; CN110163360B; CN110163357A; CN110163358B; CN110163358A; CN110163355A; CN110163362B; CN110163356B; CN110163359A; CN110163354B; CN110163361B; CN110163355B; CN110163361A; CN110163356A; CN110163363A

Abstract

Translated fromChinese

一种计算装置，包括：用于获取输入数据以及计算指令的存储单元(10)；用于从存储单元(10)提取计算指令，对该计算指令进行译码以得到一个或多个运算指令和将一个或多个运算指令以及输入数据发送给运算单元(12)的控制器单元(11)；和用于根据一个或多个运算指令对输入数据执行计算得到计算指令的结果的运算单元(12)。计算装置对参与机器学习计算的数据采用定点数据进行表示，可提升训练运算的处理速度和处理效率。

A computing device, comprising: a storage unit (10) for acquiring input data and calculation instructions; for extracting calculation instructions from the storage unit (10), and decoding the calculation instructions to obtain one or more operation instructions and A controller unit (11) for sending one or more operation instructions and input data to an operation unit (12); and an operation unit (12) for performing a calculation on the input data according to the one or more operation instructions to obtain a result of the calculation instruction ). The computing device uses fixed-point data to represent the data participating in the machine learning calculation, which can improve the processing speed and processing efficiency of the training operation.

Description

Computing device and method

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a computing device and method.

Background

With the continuous development of information technology and the increasing demand of people, the requirement of people on the timeliness of information is higher and higher. Currently, the terminal obtains and processes information based on a general-purpose processor.

In practice, it is found that such a manner of processing information based on a general-purpose processor running a software program is limited by the running speed of the general-purpose processor, and particularly under the condition that the load of the general-purpose processor is large, the information processing efficiency is low, the time delay is large, the computation amount of the training operation is large for a computation model of information processing, such as a training model, and the time for the general-purpose processor to complete the training operation is long, and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a computing device and method, which can improve the processing speed of operation and improve the efficiency.

In a first aspect, an embodiment of the present application provides a computing apparatus, including: the device comprises a storage unit, an arithmetic unit, a controller unit and a conversion unit;

the controller unit is used for acquiring a configuration instruction before the operation unit performs operation, and an operation domain of the configuration instruction comprises a decimal point position and a data type participating in the operation; analyzing the configuration instruction to obtain the position of the decimal point and the data type participating in the operation, or acquiring the position of the decimal point and the data type participating in the operation from the storage unit;

the controller unit is also used for acquiring input data and judging whether the data type of the input data is consistent with the data type participating in operation; when the data type of the input data is determined to be inconsistent with the data type participating in operation, transmitting the input data, the decimal point position and the data type participating in operation to the conversion unit;

and the conversion unit performs data type conversion on the input data according to the decimal point position and the data type participating in the operation to obtain converted input data, wherein the data type of the converted input data is consistent with the data type participating in the operation.

In a possible embodiment, the controller unit obtains the configuration instruction before the operation unit performs the operation, specifically, obtains the configuration instruction before the operation unit performs the operation of the ith layer of the multilayer neural network.

In one possible embodiment, the computing device is configured to perform machine learning calculations,

the controller unit is further used for transmitting the converted input data to the arithmetic unit; when the data type of the input data is consistent with the data type participating in operation, transmitting the input data to the operation unit;

and the operation unit is used for operating the converted input data or the input data to obtain an operation result.

In one possible embodiment, the machine learning computation includes: an artificial neural network operation, the first input data comprising: inputting neuron data and weight data; the calculation result is output neuron data.

In a possible embodiment, the arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;

the main processing circuit is used for performing preorder processing on the input data or the converted input data and transmitting data with the plurality of slave processing circuits;

the plurality of slave processing circuits are used for executing intermediate operation according to the input data transmitted from the master processing circuit or the converted input data to obtain a plurality of intermediate results and transmitting the plurality of intermediate results to the master processing circuit;

and the main processing circuit is used for executing subsequent processing on the plurality of intermediate results to obtain the operation result.

In one possible embodiment, the computing device further comprises a direct memory access DMA unit, the storage unit comprising: any combination of a register and a cache;

the cache is used for storing the input data; wherein the cache comprises a scratch pad cache;

the register is used for storing scalar data in the input data;

the DMA unit is used for reading data from the storage unit or storing data into the storage unit.

In a possible embodiment, when the input data is fixed-point data, the arithmetic unit further includes:

and the derivation unit is used for deriving the decimal point position of one or more intermediate results according to the decimal point position of the input data, wherein the one or more intermediate results are obtained by operation according to the input data.

In a possible embodiment, the arithmetic unit further includes: a data caching unit for caching the one or more intermediate results.

In a possible embodiment, the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module is used for forwarding data and operation instructions between the main processing circuit and the plurality of slave processing circuits; the tree model is an n-branch tree structure, and n is an integer greater than or equal to 2.

In a possible embodiment, the arithmetic unit further comprises a branch processing circuit,

the main processing circuit is specifically configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, allocate the distribution data to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to the branch processing circuit;

the branch processing circuit is used for forwarding data blocks, broadcast data and operation instructions between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for carrying out operation on the received data blocks and the broadcast data according to the operation instruction to obtain an intermediate result and transmitting the intermediate result to the branch processing circuit;

the main processing circuit is further configured to perform subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the operation instruction, and send the result of the calculation instruction to the controller unit.

In one possible embodiment, the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are as follows: n slave processing circuits ofrow 1, n slave processing circuits of row m, and m slave processing circuits ofcolumn 1;

the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits;

the main processing circuit is further configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the K slave processing circuits;

the K slave processing circuits are used for converting data between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for performing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the K slave processing circuits;

and the main processing circuit is used for processing the intermediate results sent by the K slave processing circuits to obtain the result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

In a possible embodiment, the main processing circuit is specifically configured to combine and sort the intermediate results sent by the multiple processing circuits to obtain the result of the computation instruction;

or the main processing circuit is specifically configured to perform combination sorting and activation processing on the intermediate results sent by the multiple processing circuits to obtain a result of the calculation instruction.

In one possible embodiment, the main processing circuit includes: one or any combination of an activation processing circuit and an addition processing circuit;

the activation processing circuit is used for executing activation operation of data in the main processing circuit;

the addition processing circuit is used for executing addition operation or accumulation operation;

the slave processing circuit includes:

and the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result.

And the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.

In a second aspect, an embodiment of the present invention provides a computing method, including:

before the controller unit carries out operation, a configuration instruction is obtained; the operation domain of the configuration instruction comprises decimal point positions and data types participating in operation; analyzing the configuration instruction to obtain the position of the decimal point and the data type participating in the operation, or directly obtaining the position of the decimal point and the data type participating in the operation; acquiring input data, and judging whether the data type of the input data is consistent with the data type participating in operation; when the data type of the input data is determined to be inconsistent with the data type participating in operation, the conversion unit performs data type conversion on the input data according to the decimal point position and the data type participating in operation to obtain converted input data, wherein the data type of the converted input data is consistent with the data type participating in operation.

In a possible embodiment, the controller unit obtains the configuration instruction before performing the operation, in particular, before performing the operation of the ith layer of the multilayer neural network model.

In a possible embodiment, the computing method is a method for performing machine learning computation, the method further comprising:

the operation unit operates the converted input data to obtain an operation result;

when the data type of the input data is consistent with the data type participating in operation, the computing device performs operation on the input data to obtain the operation result.

In one possible embodiment, the machine learning computation includes: artificial neural network operations, the input data comprising: inputting neurons and weights; the calculation result is an output neuron.

In one possible embodiment, when the first input data is fixed-point data, the method further includes:

the arithmetic unit derives decimal point positions of one or more intermediate results according to the decimal point positions of the first input data, wherein the one or more intermediate results are obtained through calculation according to the first input data.

In a third aspect, an embodiment of the present invention provides a machine learning arithmetic device, which includes one or more computing devices according to the first aspect. The machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of computing devices, the plurality of computing devices can be linked through a specific structure and transmit data;

the plurality of computing devices are interconnected through a PCIE bus and transmit data so as to support operation of larger-scale machine learning; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.

In a fourth aspect, an embodiment of the present invention provides a combined processing device, which includes the machine learning processing device according to the third aspect, a universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user. The combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and stores data of the machine learning arithmetic device and the other processing device.

In a fifth aspect, an embodiment of the present invention provides a neural network chip, where the neural network chip includes the computing device according to the first aspect, the machine learning arithmetic device according to the third aspect, or the combined processing device according to the fourth aspect.

In a sixth aspect, an embodiment of the present invention provides a neural network chip package structure, where the neural network chip package structure includes the neural network chip described in the fifth aspect;

in a seventh aspect, an embodiment of the present invention provides a board, where the board includes a storage device, an interface device, a control device, and the neural network chip in the fifth aspect;

wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

and the control device is used for monitoring the state of the chip.

Further, the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;

the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;

the interface device is as follows: a standard PCIE interface.

In an eighth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes the neural network chip described in the fifth aspect, the neural network chip package structure described in the sixth aspect, or the board described in the seventh aspect.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a data structure of fixed-point data according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of another data structure of fixed-point data according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 3A is a schematic block diagram of a computing device according to an embodiment of the present application;

FIG. 3B is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 3C is a schematic block diagram of a computing device according to another embodiment of the present application;

fig. 3D is a schematic structural diagram of a main processing circuit provided in an embodiment of the present application;

FIG. 3E is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 3F is a schematic structural diagram of a tree module according to an embodiment of the present disclosure;

FIG. 3G is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 3H is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 4 is a flowchart illustrating a forward operation of a single-layer artificial neural network according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of a forward operation and a reverse training of a neural network according to an embodiment of the present disclosure;

fig. 6 is a structural diagram of a combined processing device provided in an embodiment of the present application;

FIG. 6A is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 7 is a block diagram of another combined processing device provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a board card provided in the embodiment of the present application;

fig. 9 is a schematic flowchart of a calculation method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments

The embodiment of the application provides a data type, wherein the data type comprises an adjustment factor, and the adjustment factor is used for indicating the value range and the precision of the data type.

Wherein the adjustment factor comprises a first scaling factor and a second scaling factor (optionally), the first scaling factor being indicative of the precision of the data type; the second scaling factor is used for adjusting the value range of the data type.

Optionally, the first scaling factor may be 2^-m、8^-m、10^-m、2、3、6、9、10、2^m、8^m、10^mOr other values.

Specifically, the first scaling factor may be a decimal point position. For example, the binary input data INA1 has decimal point shifted by m bits to the right, and the input data INB1 ═ INA1 × 2^mThat is, the input data INB1 is enlarged by 2 relative to the input data INA1^mDoubling; for another example, decimal input data INA2 has decimal point shifted by n bits to the left to obtain input data INB2 ═ INA2/10ⁿThat is, the input data INA2 is reduced by 10 relative to the input data INB2ⁿAnd m and n are integers.

Alternatively, the second scaling factor may be 2, 8, 10, 16, or other values.

For example, the value range of the data type corresponding to the input data is 8^-15-8¹⁶In the operation process, when the obtained operation result is greater than the maximum value corresponding to the value range of the data type corresponding to the input data, the value range of the data type is multiplied by a second scaling factor (namely 8) of the data type to obtain a new value range 8^-14-8¹⁷(ii) a When the operation result is smaller than the minimum value corresponding to the value range of the data type corresponding to the input data, dividing the value range of the data type by a second scaling factor (8) of the data type to obtain a new value range 8^-16-8¹⁵。

Scaling factors may be added to data in any format (e.g., floating point number, discrete data) to adjust the size and precision of the data.

It should be noted that the decimal point positions mentioned in the description of the present application may be the first scaling factor, and are not described herein.

The following describes a structure of fixed-point data, and with reference to fig. 1, fig. 1 is a schematic diagram of a data structure of fixed-point data according to an embodiment of the present application. The signed fixed-point data, which occupies X bits as shown in fig. 1, may also be referred to as X-bit fixed-point data. The X-bit fixed point data includes a sign bit occupying 1 bit, an integer bit occupying M bits, and a decimal bit occupying N bits, and X-1 is M + N. For unsigned fixed-point data, only M-bit integer bits and N-bit decimal bits, i.e., X ═ M + N, are included.

Compared with a 32-bit floating Point data representation form, the short-bit fixed Point data representation form adopted by the invention has the advantages that the occupied bit number is less, and for data of the same layer and the same type in a network model, such as all convolution kernels, input neurons or offset data of a first convolution layer, a flag bit is additionally arranged to record the position of a decimal Point of the fixed Point data, and the flag bit is Point Location. The size of the flag bit can be adjusted according to the distribution of the input data, so that the accuracy of the fixed point data and the expressible range of the fixed point data are adjusted.

For example, floating point number 68.6875 is converted to signed 16-bit fixed point data with a decimal point position of 5. In the signed 16-bit fixed point data with the decimal point position of 5, the integer part accounts for 10 bits, the decimal part accounts for 5 bits, and the sign bit accounts for 1 bit. The conversion unit converts the floating point number 68.6875 to signed 16-bit fixed point data 0000010010010110, as shown in FIG. 2.

First, a computing device as used herein is described. Referring to fig. 3, there is provided a computing device comprising: the device comprises acontroller unit 11, anarithmetic unit 12 and aconversion unit 13, wherein thecontroller unit 11 is connected with thearithmetic unit 12, and theconversion unit 13 is connected with both thecontroller unit 11 and thearithmetic unit 12;

in a possible embodiment, thecontroller unit 11 is adapted to retrieve the first input data and to calculate the instructions.

In one embodiment, the first input data is machine learning data. Further, the machine learning data includes input neuron data, weight data. The output neuron data is the final output result or intermediate data.

In an alternative, the manner of obtaining the first input data and the calculation instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above calculation instructions include, but are not limited to: the present invention is not limited to the specific expression of the above-mentioned computation instruction, such as a convolution operation instruction, or a forward training instruction, or other neural network operation instruction.

Thecontroller unit 11 is further configured to parse the computation instruction to obtain a data conversion instruction and/or one or more operation instructions, where the data conversion instruction includes an operation field and an operation code, the operation code is used to indicate a function of the data type conversion instruction, and the operation field of the data type conversion instruction includes a decimal point position, a flag bit used to indicate a data type of the first input data, and a conversion mode identifier of the data type.

When the operation domain of the data conversion instruction is an address of a storage space, thecontroller unit 11 obtains the decimal point position, a flag bit indicating the data type of the first input data, and a conversion mode identifier of the data type from the storage space corresponding to the address.

Thecontroller unit 11 transmits the operation code and operation field of the data conversion instruction and the first input data to theconversion unit 13; transmitting the plurality of operation instructions to theoperation unit 12;

the convertingunit 13 is configured to convert the first input data into second input data according to the operation code and the operation domain of the data conversion instruction, where the second input data is fixed-point data; and transmits the second input data to thearithmetic unit 12;

thearithmetic unit 12 is configured to perform an arithmetic operation on the second input data according to the plurality of arithmetic instructions to obtain a calculation result of the calculation instruction.

In a possible embodiment, the present application provides a technical solution that theoperation unit 12 is set to a master-slave structure, and for the calculation instruction of the forward operation, the operation unit can split data according to the calculation instruction of the forward operation, so that the plurality ofslave processing circuits 102 can perform parallel operation on the part with a large calculation amount, thereby increasing the operation speed, saving the operation time, and further reducing the power consumption. As shown in fig. 3A, thearithmetic unit 12 includes amaster processing circuit 101 and a plurality ofslave processing circuits 102;

themain processing circuit 101 is configured to perform a preamble process on the second input data and to transfer data and the plurality of operation instructions with the plurality ofslave processing circuits 102;

the plurality ofslave processing circuits 102, configured to perform an intermediate operation according to second input data and the plurality of operation instructions transmitted from themaster processing circuit 101 to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to themaster processing circuit 101;

themain processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

In one embodiment, the machine learning operation includes a deep learning operation (i.e., an artificial neural network operation), and the machine learning data (i.e., the first input data) includes input neurons and weights (i.e., neural network model data). The output neuron is a calculation result or an intermediate result of the calculation instruction. In the following, the deep learning operation is taken as an example, but it should be understood that the deep learning operation is not limited thereto.

Optionally, the computing device may further include: thestorage unit 10 and the Direct Memory Access (DMA)unit 50, thestorage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; theregister 201 is configured to store the first input data and a scalar. Wherein the first input data includes input neurons, weights, and output neurons.

Thecache 202 is a scratch pad cache.

TheDMA unit 50 is used to read or store data from thememory unit 10.

In a possible embodiment, theregister 201 stores the operation instruction, the first input data, the decimal point position, a flag bit indicating a data type of the first input data, and a conversion mode identifier of the data type; thecontroller unit 11 directly obtains the operation instruction, the first input data, the decimal point position, a flag bit indicating the data type of the first input data, and a conversion mode identifier of the data type from theregister 201; transmitting the first input data, the decimal point position, the flag bit indicating the data type of the first input data, and the conversion mode identification of the data type to theabove conversion unit 13; transmitting the operation instruction to theoperation unit 12;

theconversion unit 13 converts the first input data into the second input data according to the decimal point position, the flag bit indicating the data type of the first input data, and the conversion mode identifier of the data type; then transmitting the second input data to thearithmetic unit 12;

thearithmetic unit 12 performs an arithmetic operation on the second input data according to the arithmetic instruction to obtain an arithmetic result.

Optionally, thecontroller unit 11 includes: aninstruction cache unit 110, aninstruction processing unit 111, and astore queue unit 113;

theinstruction cache unit 110 is configured to store the calculation instruction associated with the artificial neural network operation;

theinstruction processing unit 111 is configured to analyze the computation instruction to obtain the data conversion instruction and the plurality of operation instructions, and analyze the data conversion instruction to obtain an operation code and an operation domain of the data conversion instruction;

thestorage queue unit 113 is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.

For example, in an alternative embodiment, themain processing circuit 101 may also include a control unit, and the control unit may include a main instruction processing unit, specifically configured to decode an instruction into a microinstruction. Of course, in another alternative, theslave processing circuit 102 may also include another control unit including a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in Table 1 below.

Operation code

Registers or immediate data

Register/immediate

……

TABLE 1

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0,register number 1, register number 2,register number 3, and register number 4 may be operation domains. Each of register number 0,register number 1, register number 2,register number 3, and register number 4 may be a number of one or more registers.

TABLE 2

The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, and for example, when n is equal to 1, the data is 1-dimensional data, that is, a vector, and when n is equal to 2, the data is 2-dimensional data, that is, a matrix, and when n is equal to 3 or more, the data is a multidimensional tensor.

Optionally, thecontroller unit 11 may further include:

adependency processing unit 112, configured to determine whether a first operation instruction is associated with a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, if the first operation instruction is associated with the zeroth operation instruction, cache the first operation instruction in theinstruction cache unit 110, and after the zeroth operation instruction is completely executed, extract the first operation instruction from theinstruction cache unit 110 and transmit the first operation instruction to the operation unit;

the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises: extracting a first storage address interval of required data (such as a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.

In another alternative embodiment, as shown in fig. 3B, thearithmetic unit 12 includes amaster processing circuit 101, a plurality ofslave processing circuits 102, and a plurality ofbranch processing circuits 103.

Themain processing circuit 101 is specifically configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, allocate one distribution data to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to thebranch processing circuit 103;

thebranch processing circuit 103 is configured to forward a data block, broadcast data, and an operation instruction between themaster processing circuit 101 and the plurality ofslave processing circuits 102;

theslave processing circuits 102 are configured to perform an operation on the received data block and broadcast data according to the operation instruction to obtain an intermediate result, and transmit the intermediate result to thebranch processing circuit 103;

themain processing circuit 101 is further configured to perform subsequent processing on the intermediate result sent from thebranch processing circuit 103 to obtain a result of the arithmetic instruction, and send the result of the arithmetic instruction to thecontroller unit 11.

In another alternative embodiment, thearithmetic unit 12 may include amaster processing circuit 101 and a plurality ofslave processing circuits 102, as shown in fig. 3C. As shown in fig. 3C, a plurality ofslave processing circuits 102 are distributed in an array; eachslave processing circuit 102 is connected to other adjacentslave processing circuits 102, themaster processing circuit 101 is connected to Kslave processing circuits 102 in the plurality ofslave processing circuits 102, and the Kslave processing circuits 102 are: the nslave processing circuits 102 in the 1 st row, the nslave processing circuits 102 in the m th row, and the mslave processing circuits 102 in the 1 st column, it should be noted that, as shown in fig. 3C, the Kslave processing circuits 102 include only the nslave processing circuits 102 in the 1 st row, the nslave processing circuits 102 in the m th row, and the mslave processing circuits 102 in the 1 st column, that is, the Kslave processing circuits 102 are theslave processing circuits 102 directly connected to themaster processing circuit 101 among the plurality ofslave processing circuits 102.

Kslave processing circuits 102 for forwarding data and instructions between themaster processing circuit 101 and the plurality ofslave processing circuits 102;

themaster processing circuit 101 is further configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the Kslave processing circuits 102;

the Kslave processing circuits 102 for converting data between themaster processing circuit 101 and the plurality ofslave processing circuits 102;

theslave processing circuits 102 are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the Kslave processing circuits 102;

themain processing circuit 101 is configured to process the intermediate results sent by the Kslave processing circuits 102 to obtain a result of the calculation instruction, and send the result of the calculation instruction to thecontroller unit 11.

Optionally, as shown in fig. 3D, themain processing circuit 101 in fig. 3A to 3C may further include: one or any combination of the activation processing circuit 1011 and the addition processing circuit 1012;

an activation processing circuit 1011 for performing an activation operation of data in themain processing circuit 101;

an addition processing circuit 1012 is used to perform addition or accumulation.

Theslave processing circuit 102 includes: the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result; forwarding processing circuitry (optional) for forwarding the received data block or the product result. And the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.

In a possible embodiment, before thearithmetic unit 12 of the computing device performs the operation of the ith layer of the multilayer neural network model, thecontroller unit 11 of the computing device obtains a configuration instruction, which includes a decimal point position and a data type participating in the operation. The controller unit 11 analyzes the configuration command to obtain the decimal point position and the data type participating in the operation, or directly obtains the decimal point position and the data type participating in the operation from the storage unit 10, and then after the controller unit 11 obtains the input data, judges whether the data type of the input data is consistent with the data type participating in the operation; when it is determined that the data type of the input data is not identical to the data type involved in the operation, the controller unit 11 transmits the input data, the decimal point position, and the data type involved in the operation to the converting unit 13; the conversion unit carries out data type conversion on the input data according to the decimal point position and the data type participating in operation, so that the data type of the input data is consistent with the data type participating in operation; then, the converted data is transmitted to the above arithmetic unit 12, and the main processing circuit 101 and the sub processing circuit 102 of the arithmetic unit 12 perform arithmetic on the converted input data; when it is determined that the data type of the input data matches the data type participating in the operation, the controller unit 11 transmits the input data to the operation unit 12, and the master processing circuit 101 and the slave processing circuit 102 of the operation unit 12 directly operate on the input data without performing data type conversion.

Further, when the input data is fixed point data and the type of data involved in the operation is fixed point data, thecontroller unit 11 determines whether the position of the decimal point of the input data is consistent with the position of the decimal point involved in the operation, if not, thecontroller unit 11 transmits the input data, the position of the decimal point of the input data and the position of the decimal point involved in the operation to theconversion unit 13, theconversion unit 13 converts the input data into fixed point data consistent with the position of the decimal point and the position of the decimal point of the data involved in the operation, and then transmits the converted data to the operation unit, and themain processing circuit 101 and theslave processing circuit 102 of theoperation unit 12 operate on the converted data.

In other words, the arithmetic instruction may be replaced with the configuration instruction.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.

In an alternative embodiment, as shown in fig. 3E, the arithmetic unit comprises: atree module 40, the tree module comprising: a root port and a plurality of branch ports, wherein the root port of the tree module is connected with themain processing circuit 101, and the branch ports of the tree module are respectively connected with oneslave processing circuit 102 in the plurality ofslave processing circuits 102;

the tree module has a transceiving function, as shown in fig. 3E, the tree module is a transmitting function, as shown in fig. 6A, the tree module is a receiving function.

The tree module is configured to forward data blocks, weights, and operation instructions between themaster processing circuit 101 and the plurality ofslave processing circuits 102.

Optionally, the tree module is an optional result of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 3F, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and theslave processing circuit 102 may be connected to nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 3F.

Optionally, the operation unit may carry a separate cache, as shown in fig. 3G, and may include: a neuron buffer unit, theneuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of theslave processing circuit 102.

As shown in fig. 3H, the arithmetic unit may further include: theweight buffer unit 64 is used for buffering the weight data required by theslave processing circuit 102 in the calculation process.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, assuming a binary tree structure with 8slave processing circuits 102, the method may be implemented as follows:

thecontroller unit 11 acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from thestorage unit 10, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to themain processing circuit 101;

themaster processing circuit 101 splits the input neuron matrix x into 8 sub-matrices, then distributes the 8 sub-matrices to 8slave processing circuits 102 via a tree module, broadcasts the weight matrix w to the 8slave processing circuits 102,

theslave processing circuit 102 executes multiplication and accumulation operations of the 8 sub-matrices and the weight matrix w in parallel to obtain 8 intermediate results, and sends the 8 intermediate results to themaster processing circuit 101;

themain processing circuit 101 is configured to sequence the 8 intermediate results to obtain a wx operation result, perform offset b operation on the operation result, perform activation operation to obtain a final result y, send the final result y to thecontroller unit 11, and thecontroller unit 11 outputs or stores the final result y into thestorage unit 10.

In one embodiment, thearithmetic unit 12 includes, but is not limited to: a first one or more multipliers of the first portion; one or more adders of the second part (more specifically, the adders of the second part may also constitute an addition tree); a third part of the activation function unit; and/or the vector processing unit of the fourth section. More specifically, the vector processing unit may process vector operations and/or pooling operations. The first part multiplies the input data 1(in1) and the input data 2(in2) to obtain the multiplied output (out), which is: out in1 in 2; the second part adds the input data in1 by an adder to obtain output data (out). More specifically, when the second part is an adder tree, the input data in1 is added step by step through the adder tree to obtain the output data (out), where in1 is a vector with length N, N is greater than 1, and the process is: out in1[1] + in1[2] +. + in1[ N ], and/or adding the input data (in1) and the input data (in2) after adding the addition number to obtain the output data (out), wherein the process is as follows: out-in 1[1] + in1[2] +. + in1[ N ] + in2, or adding the input data (in1) and the input data (in2) to obtain the output data (out), the process is: out in1+ in 2; the third part obtains activation output data (out) by operating the input data (in) through an activation function (active), and the process is as follows: the active function may be sigmoid, tanh, relu, softmax, and the like, and in addition to the activation operation, the third part may implement other non-linear functions, and may obtain the output data (out) by performing the operation (f) on the input data (in), where the process is as follows: out ═ f (in). The vector processing unit obtains output data (out) after the pooling operation by pooling the input data (in), wherein the process is out ═ pool (in), where the pool is the pooling operation, and the pooling operation includes but is not limited to: mean pooling, maximum pooling, median pooling, input data in being data in a pooling kernel associated with output out.

The operation unit executes operation including a first part of multiplying theinput data 1 and the input data 2 to obtain multiplied data; and/or the second part performs an addition operation (more specifically, an addition tree operation for addinginput data 1 step by step through an addition tree) or adds theinput data 1 and input data 2 to obtain output data; and/or the third part executes activation function operation, and obtains output data through activation function (active) operation on input data; and/or a fourth part performing pooling operations, out ═ pool (in), where pool is a pooling operation including, but not limited to: mean pooling, maximum pooling, median pooling, input data in being data in a pooling kernel associated with output out. The operation of the above parts can freely select one or more parts to carry out combination in different orders, thereby realizing the operation of various functions. The computing units correspondingly form a two-level, three-level or four-level pipeline level architecture.

It should be noted that the first input data is long-bit non-fixed point data, such as 32-bit floating point data, or may be standard 64-bit or 16-bit floating point data, and the description is given here only with 32 bits as a specific example; the second input data is short-digit fixed-point data, which is also called less-digit fixed-point data and represents fixed-point data represented by a smaller number of digits relative to the first input data of long-digit non-fixed-point data.

In one possible embodiment, the first input data is non-fixed point data, the second input data is fixed point data, and the number of bits occupied by the first input data is greater than or equal to the number of bits occupied by the second input data. For example, the first input data is 32-bit floating point data, and the second input data is 32-bit fixed point data; for another example, the first input data is 32-bit floating point data, and the second input data is 16-bit fixed point data.

In particular, the first input data comprises different types of data for different layers of different network models. The decimal point positions of the different types of data are different, namely the accuracy of the corresponding fixed point data is different. For a fully connected layer, the first input data comprises data such as input neurons, weights, bias data and the like; in the case of convolutional layers, the first input data includes data such as convolutional kernels, input neurons, and offset data.

For example, for a fully connected layer, the decimal point locations include the decimal point locations of the input neurons, the decimal point locations of the weights, and the decimal point locations of the offset data. The positions of the decimal points of the input neurons, the positions of the decimal points of the weights and the positions of the decimal points of the offset data can be all the same or partially the same or different from each other.

In a possible embodiment, after thecontroller unit 11 or thearithmetic unit 12 obtains the decimal point position of the first input data according to the above process, the decimal point position of the first input data is stored in thebuffer 202 of thestorage unit 10.

When the calculation instruction is an immediate addressing instruction, themain processing unit 101 directly converts the first input data into the second input data according to the decimal point position indicated by the operation field of the calculation instruction; when the calculation instruction is a direct addressing instruction or an indirect addressing instruction, themain processing unit 101 obtains a decimal point position of the first input data according to a storage space indicated by an operation domain of the calculation instruction, and then converts the first input data into the second input data according to the decimal point position.

The calculation apparatus may further include a rounding unit that buffers the intermediate operation result because an operation result (the operation result including the intermediate operation result and the result of the calculation instruction) obtained by performing addition, multiplication, and/or other operations on the second input data may have a precision exceeding a precision range of the current fixed-point data during the operation. After the operation is finished, the rounding unit performs rounding operation on the operation result which exceeds the precision range of the fixed-point data to obtain a rounded operation result, and then the data conversion unit converts the rounded operation result into data of the current fixed-point data type.

Specifically, the rounding unit performs a rounding operation on the intermediate operation result, the rounding operation being any one of a random rounding operation, a rounding operation, an upward rounding operation, a downward rounding operation, and a truncation rounding operation.

When the rounding unit performs the random rounding operation, the rounding unit specifically performs the following operations:

wherein y represents data obtained by randomly rounding the operation result x before rounding, i.e. the operation result after rounding, and epsilon is the smallest positive number capable of being expressed by the current fixed-point data expression format, i.e. 2^{-Point Location}，

The formula represents the probability that the data obtained by randomly rounding the operation result x before rounding is the same as the data obtained by directly truncating the operation result x before rounding to fixed point data (similar to the operation of rounding down decimal), and the formula represents that the data obtained by randomly rounding the operation result x before rounding is the probability

Has a probability of

The intermediate operation result x is rounded randomly to obtain data of

Has a probability of

When the rounding unit performs the rounding operation, the rounding unit specifically performs the following operations:

wherein y represents data obtained by rounding the operation result x before rounding, i.e. the operation result after rounding, and epsilon is the smallest positive integer which can be expressed by the current fixed point data expression format, i.e. 2^{-Point Location}，

Is an integer multiple of epsilon and has a value less than or equal to the maximum number of x. The above formula indicates that the operation result x before the rounding satisfies the condition

The rounded operation result is

When the operation result before rounding satisfies the condition

The rounded operation result is

When the rounding-up operation is performed by the rounding unit, the rounding unit specifically performs the following operations:

wherein y represents data obtained by rounding up the pre-rounding operation result x, that is, the rounded operation result,

is an integer multiple of epsilon with a value greater than or equal to the minimum number of x, and epsilon is the smallest positive integer which can be represented by the current fixed-point data representation format, namely 2^{-Point Location}。

When the rounding unit performs a downward rounding operation, the rounding unit specifically performs the following operations:

wherein y represents the value before roundingThe data obtained by rounding down the operation result x, i.e. the rounded operation result,

is an integer multiple of epsilon with a maximum number less than or equal to x, and epsilon is the smallest positive integer that can be represented by the current fixed-point data representation format, namely 2^{-Point Location}。

When the rounding unit performs truncation rounding operation, the rounding unit specifically performs the following operations:

y＝[x]

wherein y represents the data obtained by truncating the operation result x before rounding, i.e., the operation result after rounding, and [ x ] represents the data obtained by directly truncating the operation result x to fixed point data.

When the rounding unit obtains the rounded intermediate operation result, theoperation unit 12 converts the rounded intermediate operation result into data of the current fixed point data type according to the position of the decimal point of the first input data.

In a possible embodiment, thearithmetic unit 12 does not perform truncation processing on the intermediate result of which the data type is floating point data in the one or more intermediate results.

The intermediate result obtained by the operation performed by theprocessing circuit 102 according to the above method in theoperation unit 12 is generally truncated because the intermediate result obtained by the multiplication, the division, and the like in the operation process exceeds the memory storage range; however, because the intermediate result generated in the operation process of the method is not stored in the memory, the intermediate result beyond the storage range of the memory is not required to be cut off, the precision loss of the intermediate result is greatly reduced, and the precision of the calculation result is improved.

In a possible embodiment, thearithmetic unit 12 further includes a derivation unit, when thearithmetic unit 12 receives the decimal point position of the input data participating in the fixed-point operation, the derivation unit derives the decimal point position of the one or more intermediate results obtained in the process of performing the fixed-point operation according to the decimal point position of the input data participating in the fixed-point operation. When the intermediate result obtained by the operation of the operation subunit exceeds the range indicated by the decimal point position corresponding to the intermediate result, the derivation unit shifts the decimal point position of the intermediate result to the left by M bits, so that the precision of the intermediate result is within the precision range indicated by the decimal point position of the intermediate result, and M is an integer greater than 0.

For example, the first input data includes input data I1 and input data I2, the corresponding decimal point positions are P1 and P2, respectively, and P1> P2, when the operation type indicated by the operation instruction is addition operation or subtraction operation, that is, the operation subunit performs I1+ I2 or I1-I2 operation, the derivation unit derives the decimal point position at which the intermediate result of the operation process indicated by the operation instruction is performed as P1; when the operation type indicated by the operation instruction is multiplication operation, that is, the operation subunit performs I1 × I2 operation, the derivation unit derives the decimal point position P1 × P2 at which the intermediate result of the operation process indicated by the operation instruction is performed.

In a possible embodiment, thearithmetic unit 12 further includes:

and the data caching unit is used for caching the one or more intermediate results.

In an optional embodiment, the computing apparatus further includes a data statistics unit, configured to perform statistics on input data of the same type in each layer of the multi-layer network model to obtain a position of a decimal point of each type of input data in each layer.

The data statistics unit may be a part of an external device, and the calculation device may acquire the position of the decimal point participating in the calculation data from the external device before the data conversion is performed.

Specifically, the data statistic unit includes:

the acquisition subunit is used for extracting input data of the same type in each layer of the multilayer network model;

the statistical subunit is used for counting and acquiring the distribution proportion of the input data of the same type in each layer of the multilayer network model in a preset interval;

and the analysis subunit is used for acquiring the decimal point position of the input data of the same type in each layer of the multilayer network model according to the distribution proportion.

Wherein the predetermined interval is [ -2 [ ]^X-1-i,2^X-1-i-2^-i]I is 0,1,2, …, n, n is a preset positive integer, and X is the number of bits occupied by the fixed-point data. The above-mentioned preset interval [ -2 [ ]^X-1-i,2^X-1-i-2^-i]Comprising n +1 subintervals. The statistical subunit counts distribution information of the input data of the same type in each layer of the multi-layer network model in the n +1 subintervals, and acquires the first distribution proportion according to the distribution information. The first distribution ratio is p₀,p₁,p₂,…,p_nAnd the n +1 numerical values are distribution ratios of the input data of the same type in each layer of the multilayer network model on the n +1 subintervals. The analysis subunit presets an overflow rate EPL, which takes the largest i from 0,1,2, …, n, so that p is_iAnd the maximum i is the decimal point position of the input data of the same type in each layer of the multilayer network model. In other words, the analysis subunit takes the decimal point position of the same type of input data in each layer of the multilayer network model as: max { i/p_i≧ 1-EPL, i ∈ {0,1,2, …, n } }, i.e., p satisfying greater than or equal to 1-EPL_iIn the method, the maximum subscript value i is selected as the decimal point position of the input data of the same type in each layer of the multilayer network model.

In addition, p is_iThe value of the same type of input data in each layer of the multi-layer network model is in an interval of [ -2 ]^X-1-i,2^X-1-i-2^-i]The number of input data in (a) to the total number of input data of the same type in each layer of the above-described multi-layer network model. For example, m2 input data of the same type in each layer of m1 multi-layer network modelsInput data value in interval [ -2 ]^X-1-i,2^X-1-i-2^-i]In (1), the above

In a feasible embodiment, in order to improve the operation efficiency, the obtaining subunit extracts part of data in the same type of input data in each layer of the multilayer network model randomly or in a sampling manner, then obtains the decimal point position of the part of data according to the method, and then performs data conversion (including conversion from floating point data to fixed point data, conversion from fixed point data to fixed point data, and the like) on the type of input data according to the decimal point position of the part of data, so that the calculation speed and efficiency can be improved on the premise of keeping the precision.

Optionally, the data statistics unit may determine bit width and decimal point position of the same type of data or the same layer of data according to the median of the same type of data or the same layer of data, or determine bit width and decimal point position of the same type of data or the same layer of data according to the average of the same type of data or the same layer of data.

Optionally, when the intermediate result obtained by the arithmetic unit according to the arithmetic on the data of the same type or the data of the same layer exceeds the value range corresponding to the decimal point position and the bit width of the data of the same type or the data of the same layer, the arithmetic unit does not perform truncation processing on the intermediate result, and caches the intermediate result in the data caching unit of the arithmetic unit for use in subsequent arithmetic.

Specifically, the operation field includes a decimal point position of the input data and a conversion mode identifier of the data type. The instruction processing unit analyzes the data conversion instruction to obtain the decimal point position of the input data and the conversion mode identifier of the data type. The processing unit further comprises a data conversion unit which converts the first input data into second input data according to the decimal point position of the input data and the conversion mode identification of the data type.

It should be noted that the network model includes multiple layers, such as a full connection layer, a convolutional layer, a pooling layer, and an input layer. In the at least one input data, the input data belonging to the same layer have the same decimal point position, that is, the input data of the same layer share or share the same decimal point position.

The input data includes different types of data, including input neurons, weights, and bias data, for example. The input data belonging to the same type in the input data have the same decimal point position, that is, the input data of the same type share or share the same decimal point position.

For example, the operation type indicated by the operation instruction is fixed-point operation, and the input data participating in the operation indicated by the operation instruction is floating-point data, so that the data conversion unit converts the input data from the floating-point data to the fixed-point data before the fixed-point operation is performed; if the operation type indicated by the operation instruction is floating-point operation and the input data participating in the operation indicated by the operation instruction is fixed-point data, the data conversion unit converts the input data corresponding to the operation instruction from the fixed-point data to floating-point data before the floating-point operation is performed.

For macro instructions (such as a calculation instruction and a data conversion instruction) related to the present application, thecontroller unit 11 may parse the macro instruction to obtain an operation field and an operation code of the macro instruction; generating a micro instruction corresponding to the macro instruction according to the operation domain and the operation code; alternatively, thecontroller unit 11 decodes the macro instruction to obtain the micro instruction corresponding to the macro instruction.

In one possible embodiment, a System On Chip (SOC) includes a main processor including the computing device and a coprocessor. The coprocessor acquires the decimal point position of the input data of the same type in each layer of the multilayer network model according to the method, and transmits the decimal point position of the input data of the same type in each layer of the multilayer network model to the computing device, or the computing device acquires the decimal point position of the input data of the same type in each layer of the multilayer network model from the coprocessor when the decimal point position of the input data of the same type in each layer of the multilayer network model needs to be used.

In a possible embodiment, the first input data is non-fixed point data, and the non-fixed point data includes long-bit floating point data, short-bit floating point data, integer data, discrete data, and the like.

The data types of the first input data are different from each other. For example, the input neurons, the weights and the bias data are floating point data; part of data in the input neurons, the weight values and the bias data are floating point data, and part of data is integer data; the input neurons, weights and bias data are integer data. The computing device can realize the conversion from non-fixed point data to fixed point data, namely, the conversion from data of types such as long-bit floating point data, short-bit floating point data, integer data, discrete data and the like to the fixed point data. The setpoint data may be signed setpoint data or unsigned setpoint data.

In a possible embodiment, the first input data and the second input data are fixed-point data, and the first input data and the second input data may be both signed fixed-point data, or both unsigned fixed-point data, or one of them is unsigned fixed-point data and the other is signed fixed-point data. And the position of the decimal point of the first input data is different from the position of the decimal point of the second input data.

In one possible embodiment, the first input data is fixed-point data, and the second input data is non-fixed-point data. In other words, the above-described computing device can implement conversion of fixed-point data into non-fixed-point data.

Fig. 4 is a flowchart of a forward operation of a single-layer neural network according to an embodiment of the present invention. The flow chart describes a process for a single layer neural network forward operation implemented using a computing device and instruction set implemented by the present invention. For each layer, the input neuron vectors are weighted and summed to calculate an intermediate result vector of the layer. The intermediate result vector is biased and activated to obtain an output neuron vector. And taking the output neuron vector as an input neuron vector of the next layer.

In a specific application scenario, the computing device may be a training device. Before the neural network model training, the training device acquires training data participating in the neural network model training, wherein the training data is non-fixed point data, and the position of a decimal point of the training data is acquired according to the method. The training device converts the training data into training data expressed by fixed point data according to the decimal point position of the training data. The training device performs a forward neural network operation based on the training data expressed by the fixed-point data to obtain a neural network operation result. The training device performs random rounding operation on the neural network operation result which exceeds the data precision range represented by the decimal point position of the training data to obtain the rounded neural network operation result, and the neural network operation result is positioned in the data precision range represented by the decimal point position of the training data. According to the method, the training device obtains the neural network operation result of each layer of the multilayer neural network, namely the output neuron. The training device obtains the gradient of the output neuron according to each layer of output neuron, and carries out inverse operation according to the gradient of the output neuron to obtain the weight gradient, thereby updating the weight of the neural network model according to the weight gradient.

The training device repeatedly executes the process to achieve the purpose of training the neural network model.

It should be noted that, before performing the forward operation and the backward training, the computing device performs data conversion on the data participating in the forward operation; data conversion is not carried out on the data participating in the reverse training; or the computing device does not perform data conversion on the data participating in forward operation; carrying out data conversion on data participating in reverse training; the computing device carries out data conversion on the data participating in the reverse training of the data participating in the forward operation; the specific data conversion process can be referred to the description of the related embodiment above, and will not be described here.

The forward operation includes the multilayer neural network operation, the multilayer neural network operation includes operations such as convolution, and the convolution operation is implemented by a convolution operation instruction.

The convolution operation instruction is an instruction in a Cambricon instruction set, and the Cambricon instruction set is characterized in that the instruction is composed of an operation code and an operand, and the instruction set includes four types of instructions, namely a control instruction (control instructions), a data transmission instruction (data instructions), an operation instruction (computational instructions) and a logic instruction (local instructions).

Preferably, each instruction in the instruction set has a fixed length. For example, each instruction in the instruction set may be 64 bits long.

Further, the control instructions are used for controlling the execution process. The control instructions include jump (jump) instructions and conditional branch (conditional branch) instructions.

Further, the data transmission instruction is used for completing data transmission between different storage media. The data transmission instruction comprises a load (load) instruction, a store (store) instruction and a move (move) instruction. The load instruction is used for loading data from the main memory to the cache, the store instruction is used for storing the data from the cache to the main memory, and the move instruction is used for carrying the data between the cache and the cache or between the cache and the register or between the register and the register. The data transfer instructions support three different data organization modes including matrices, vectors and scalars.

Further, the arithmetic instruction is used for completing the neural network arithmetic operation. The operation instructions include a matrix operation instruction, a vector operation instruction, and a scalar operation instruction.

Further, the matrix operation instruction performs matrix operations in the neural network, including matrix multiplication vector (matrix multiplication vector), vector multiplication matrix (vector multiplication matrix), matrix multiplication scalar (matrix multiplication scale), outer product (outer product), matrix addition matrix (matrix added matrix), and matrix subtraction matrix (matrix subtraction matrix).

Further, the vector operation instruction performs vector operations in the neural network, including vector elementary operations (vector elementary operations), vector transcendental functions (vector transcendental functions), inner products (dot products), vector random generator (random vector generator), and maximum/minimum values in vectors (maximum/minimum of a vector). The vector basic operation includes vector addition, subtraction, multiplication, and division (add, subtrect, multiplex, divide), and the vector transcendental function refers to a function that does not satisfy any polynomial equation with coefficients of a polynomial, including but not limited to an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function.

Further, scalar operation instructions perform scalar operations in the neural network, including scalar elementary operations (scalar elementary operations) and scalar transcendental functions (scalar transcendental functions). The scalar basic operation includes scalar addition, subtraction, multiplication, and division (add, subtrect, multiplex, divide), and the scalar transcendental function refers to a function that does not satisfy any polynomial equation with coefficients of a polynomial, including but not limited to an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function.

Further, the logic instruction is used for logic operation of the neural network. The logical operations include vector logical operation instructions and scalar logical operation instructions.

Further, the vector logic operation instruction includes a vector compare (vector compare), a vector logic operation (vector local operations) and a vector greater than merge (vector larger than merge). Where vector comparisons include, but are not limited to, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. The vector logic operation includes and, or, not.

Further, scalar logic operations include scalar compare (scalar compare), scalar local operations (scalar logical operations). Where scalar comparisons include, but are not limited to, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Scalar logic operations include and, or, not.

For the multilayer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output neuron calculated in the operation unit as the input neuron of the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron of the next layer), and meanwhile, the weight is replaced by the weight of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer. As shown in fig. 5, the arrows of the broken lines in fig. 5 indicate the backward operation, and the realized arrows indicate the forward operation.

In another embodiment, the operation instruction is a matrix multiplied by matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions, including a forward operation instruction and a direction training instruction.

The following describes a specific calculation method of the calculation apparatus shown in fig. 3A by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be s-s (Σ wx)_i+ b), wherein the weight w is multiplied by the input data x_iAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 3A may specifically be:

after theconversion unit 13 performs data type conversion on the first input data, thecontroller unit 11 extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction, and at least one operation code from theinstruction cache unit 110, and thecontroller unit 11 transmits the operation domain to the data access unit and sends the at least one operation code to theoperation unit 12.

Thecontroller unit 11 extracts the weight w and the offset b corresponding to the operation field from the storage unit 10 (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to themain processing circuit 101 of the arithmetic unit, and thecontroller unit 11 extracts the input data Xi from thestorage unit 10 and transmits the input data Xi to themain processing circuit 101.

Themain processing circuit 101 splits the input data Xi into n data blocks;

theinstruction processing unit 111 of thecontroller unit 11 determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one opcode, sends the multiplication instruction, the offset instruction and the accumulation instruction to themaster processing circuit 101, themaster processing circuit 101 sends the multiplication instruction and the weight w to the plurality ofslave processing circuits 102 in a broadcast manner, and distributes the n data blocks to the plurality of slave processing circuits 102 (for example, there are nslave processing circuits 102, and then eachslave processing circuit 102 sends one data block); the plurality ofslave processing circuits 102 are configured to perform a multiplication operation on the weight w and the received data block according to the multiplication instruction to obtain an intermediate result, and send the intermediate result to themaster processing circuit 101, themaster processing circuit 101 performs an accumulation operation on the intermediate result sent by the plurality ofslave processing circuits 102 according to the accumulation instruction to obtain an accumulation result, performs an offset operation b on the accumulation result according to the offset instruction to obtain a final result, and sends the final result to thecontroller unit 11.

In addition, the order of addition and multiplication may be reversed.

It should be noted that, the method for executing the neural network reverse training instruction by the computing apparatus is similar to the process for executing the neural network forward operation instruction by the computing apparatus, and specific reference may be made to the above description of the reverse training, and no description is given here.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The application also discloses a machine learning operation device, which comprises one or more computing devices mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 6 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), machine learning processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Alternatively, as shown in fig. 7, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In one possible embodiment, a distributed system is also claimed, the system comprising n1 host processors and n2 coprocessors, n1 being an integer greater than or equal to 0 and n2 being an integer greater than or equal to 1. The system may be of various types of topologies including, but not limited to, the topological result shown in FIG. 3B, the topological structure shown in FIG. 3C.

The main processor respectively sends input data, decimal point positions of the input data and calculation instructions to the plurality of coprocessors; or the main processor sends the input data, the decimal point position of the input data and the calculation instruction to some of the plurality of slave processors, and the partial slave processors send the input data, the decimal point position of the input data and the calculation instruction to other slave processors. The coprocessor comprises the computing device, and the computing device is used for computing the input data according to the method and the computing instruction to obtain a computing result;

the input data includes, but is not limited to, input neurons, weight values, bias data, and the like.

The coprocessor directly sends the operation result to the main processor, or the coprocessor which is not connected with the main processor firstly sends the operation result to the coprocessor which is connected with the main processor, and then the coprocessor sends the received operation result to the main processor.

In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure.

In some embodiments, an electronic device is provided that includes the above board card. Referring to fig. 8, fig. 8 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, receiving means 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Referring to fig. 9, fig. 9 is a method for performing machine learning calculation according to an embodiment of the present invention, where the method includes:

s901, the computing device acquires first input data and a computing instruction.

The first input data includes input neurons and weights.

S902, the computing device analyzes the computing instruction to obtain a data conversion instruction and a plurality of operation instructions.

The data conversion instruction comprises an operation field and an operation code, wherein the operation code is used for indicating the function of the data type conversion instruction, and the operation field of the data type conversion instruction comprises a decimal point position, a flag bit used for indicating the data type of the first input data and a conversion mode of the data type.

And S903, converting the first input data into second input data by the computing device according to the data conversion instruction, wherein the second input data is fixed-point data.

Wherein the converting the first input data into second input data according to the data conversion instruction comprises:

analyzing the data conversion instruction to obtain the decimal point position, the flag bit for indicating the data type of the first input data and the conversion mode of the data type;

determining the data type of the first input data according to the data type zone bit of the first input data;

and converting the first input data into second input data according to the decimal point position and the conversion mode of the data type, wherein the data type of the second input data is inconsistent with the data type of the first input data.

When the first input data and the second input data are fixed point data, the position of the decimal point of the first input data is inconsistent with the position of the decimal point of the second input data.

In a possible embodiment, when the first input data is fixed-point data, the method further comprises:

and deducing the decimal point position of one or more intermediate results according to the decimal point position of the first input data, wherein the one or more intermediate results are obtained by operation according to the first input data.

And S904, the computing device performs computation on the second input data according to the plurality of operation instructions to obtain a result of the computation instruction.

The operation instruction includes a forward operation instruction and a reverse training instruction, that is, during the process of executing the forward operation instruction and/or the reverse training instruction (that is, the computing device performs forward operation and/or reverse training), the computing device may convert data participating in the operation into fixed-point data according to the embodiment shown in fig. 9, and perform fixed-point operation.

It should be noted that, the above steps S901-S904 can be described in detail with reference to the related description of the embodiment shown in fig. 1-8, and will not be described here.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.