Movatterモバイル変換


[0]ホーム

URL:


CN110163357B - A computing device and method - Google Patents

A computing device and method
Download PDF

Info

Publication number
CN110163357B
CN110163357BCN201910195627.XACN201910195627ACN110163357BCN 110163357 BCN110163357 BCN 110163357BCN 201910195627 ACN201910195627 ACN 201910195627ACN 110163357 BCN110163357 BCN 110163357B
Authority
CN
China
Prior art keywords
data
input data
unit
processing circuit
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910195627.XA
Other languages
Chinese (zh)
Other versions
CN110163357A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201810149287.2Aexternal-prioritypatent/CN110163350B/en
Priority claimed from CN201810207915.8Aexternal-prioritypatent/CN110276447B/en
Application filed by Shanghai Cambricon Information Technology Co LtdfiledCriticalShanghai Cambricon Information Technology Co Ltd
Priority claimed from CN201880002628.1Aexternal-prioritypatent/CN110383300B/en
Publication of CN110163357ApublicationCriticalpatent/CN110163357A/en
Application grantedgrantedCritical
Publication of CN110163357BpublicationCriticalpatent/CN110163357B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

一种计算装置,包括:用于获取输入数据以及计算指令的存储单元(10);用于从存储单元(10)提取计算指令,对该计算指令进行译码以得到一个或多个运算指令和将一个或多个运算指令以及输入数据发送给运算单元(12)的控制器单元(11);和用于根据一个或多个运算指令对输入数据执行计算得到计算指令的结果的运算单元(12)。计算装置对参与机器学习计算的数据采用定点数据进行表示,可提升训练运算的处理速度和处理效率。

Figure 201910195627

A computing device, comprising: a storage unit (10) for acquiring input data and calculation instructions; for extracting calculation instructions from the storage unit (10), and decoding the calculation instructions to obtain one or more operation instructions and A controller unit (11) for sending one or more operation instructions and input data to an operation unit (12); and an operation unit (12) for performing a calculation on the input data according to the one or more operation instructions to obtain a result of the calculation instruction ). The computing device uses fixed-point data to represent the data participating in the machine learning calculation, which can improve the processing speed and processing efficiency of the training operation.

Figure 201910195627

Description

Computing device and method
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a computing device and method.
Background
With the continuous development of information technology and the increasing demand of people, the requirement of people on the timeliness of information is higher and higher. Currently, the terminal obtains and processes information based on a general-purpose processor.
In practice, it is found that such a manner of processing information based on a general-purpose processor running a software program is limited by the running speed of the general-purpose processor, and particularly under the condition that the load of the general-purpose processor is large, the information processing efficiency is low, the time delay is large, the computation amount of the training operation is large for a computation model of information processing, such as a training model, and the time for the general-purpose processor to complete the training operation is long, and the efficiency is low.
Disclosure of Invention
The embodiment of the application provides a computing device and method, which can improve the processing speed of operation and improve the efficiency.
In a first aspect, an embodiment of the present application provides a computing apparatus, including: the device comprises a storage unit, an arithmetic unit, a controller unit and a conversion unit;
the controller unit is used for acquiring a configuration instruction before the operation unit performs operation, and an operation domain of the configuration instruction comprises a decimal point position and a data type participating in the operation; analyzing the configuration instruction to obtain the position of the decimal point and the data type participating in the operation, or acquiring the position of the decimal point and the data type participating in the operation from the storage unit;
the controller unit is also used for acquiring input data and judging whether the data type of the input data is consistent with the data type participating in operation; when the data type of the input data is determined to be inconsistent with the data type participating in operation, transmitting the input data, the decimal point position and the data type participating in operation to the conversion unit;
and the conversion unit performs data type conversion on the input data according to the decimal point position and the data type participating in the operation to obtain converted input data, wherein the data type of the converted input data is consistent with the data type participating in the operation.
In a possible embodiment, the controller unit obtains the configuration instruction before the operation unit performs the operation, specifically, obtains the configuration instruction before the operation unit performs the operation of the ith layer of the multilayer neural network.
In one possible embodiment, the computing device is configured to perform machine learning calculations,
the controller unit is further used for transmitting the converted input data to the arithmetic unit; when the data type of the input data is consistent with the data type participating in operation, transmitting the input data to the operation unit;
and the operation unit is used for operating the converted input data or the input data to obtain an operation result.
In one possible embodiment, the machine learning computation includes: an artificial neural network operation, the first input data comprising: inputting neuron data and weight data; the calculation result is output neuron data.
In a possible embodiment, the arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;
the main processing circuit is used for performing preorder processing on the input data or the converted input data and transmitting data with the plurality of slave processing circuits;
the plurality of slave processing circuits are used for executing intermediate operation according to the input data transmitted from the master processing circuit or the converted input data to obtain a plurality of intermediate results and transmitting the plurality of intermediate results to the master processing circuit;
and the main processing circuit is used for executing subsequent processing on the plurality of intermediate results to obtain the operation result.
In one possible embodiment, the computing device further comprises a direct memory access DMA unit, the storage unit comprising: any combination of a register and a cache;
the cache is used for storing the input data; wherein the cache comprises a scratch pad cache;
the register is used for storing scalar data in the input data;
the DMA unit is used for reading data from the storage unit or storing data into the storage unit.
In a possible embodiment, when the input data is fixed-point data, the arithmetic unit further includes:
and the derivation unit is used for deriving the decimal point position of one or more intermediate results according to the decimal point position of the input data, wherein the one or more intermediate results are obtained by operation according to the input data.
In a possible embodiment, the arithmetic unit further includes: a data caching unit for caching the one or more intermediate results.
In a possible embodiment, the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;
the tree module is used for forwarding data and operation instructions between the main processing circuit and the plurality of slave processing circuits; the tree model is an n-branch tree structure, and n is an integer greater than or equal to 2.
In a possible embodiment, the arithmetic unit further comprises a branch processing circuit,
the main processing circuit is specifically configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, allocate the distribution data to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to the branch processing circuit;
the branch processing circuit is used for forwarding data blocks, broadcast data and operation instructions between the main processing circuit and the plurality of slave processing circuits;
the plurality of slave processing circuits are used for carrying out operation on the received data blocks and the broadcast data according to the operation instruction to obtain an intermediate result and transmitting the intermediate result to the branch processing circuit;
the main processing circuit is further configured to perform subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the operation instruction, and send the result of the calculation instruction to the controller unit.
In one possible embodiment, the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are as follows: n slave processing circuits ofrow 1, n slave processing circuits of row m, and m slave processing circuits ofcolumn 1;
the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits;
the main processing circuit is further configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the K slave processing circuits;
the K slave processing circuits are used for converting data between the main processing circuit and the plurality of slave processing circuits;
the plurality of slave processing circuits are used for performing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the K slave processing circuits;
and the main processing circuit is used for processing the intermediate results sent by the K slave processing circuits to obtain the result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.
In a possible embodiment, the main processing circuit is specifically configured to combine and sort the intermediate results sent by the multiple processing circuits to obtain the result of the computation instruction;
or the main processing circuit is specifically configured to perform combination sorting and activation processing on the intermediate results sent by the multiple processing circuits to obtain a result of the calculation instruction.
In one possible embodiment, the main processing circuit includes: one or any combination of an activation processing circuit and an addition processing circuit;
the activation processing circuit is used for executing activation operation of data in the main processing circuit;
the addition processing circuit is used for executing addition operation or accumulation operation;
the slave processing circuit includes:
and the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result.
And the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.
In a second aspect, an embodiment of the present invention provides a computing method, including:
before the controller unit carries out operation, a configuration instruction is obtained; the operation domain of the configuration instruction comprises decimal point positions and data types participating in operation; analyzing the configuration instruction to obtain the position of the decimal point and the data type participating in the operation, or directly obtaining the position of the decimal point and the data type participating in the operation; acquiring input data, and judging whether the data type of the input data is consistent with the data type participating in operation; when the data type of the input data is determined to be inconsistent with the data type participating in operation, the conversion unit performs data type conversion on the input data according to the decimal point position and the data type participating in operation to obtain converted input data, wherein the data type of the converted input data is consistent with the data type participating in operation.
In a possible embodiment, the controller unit obtains the configuration instruction before performing the operation, in particular, before performing the operation of the ith layer of the multilayer neural network model.
In a possible embodiment, the computing method is a method for performing machine learning computation, the method further comprising:
the operation unit operates the converted input data to obtain an operation result;
when the data type of the input data is consistent with the data type participating in operation, the computing device performs operation on the input data to obtain the operation result.
In one possible embodiment, the machine learning computation includes: artificial neural network operations, the input data comprising: inputting neurons and weights; the calculation result is an output neuron.
In one possible embodiment, when the first input data is fixed-point data, the method further includes:
the arithmetic unit derives decimal point positions of one or more intermediate results according to the decimal point positions of the first input data, wherein the one or more intermediate results are obtained through calculation according to the first input data.
In a third aspect, an embodiment of the present invention provides a machine learning arithmetic device, which includes one or more computing devices according to the first aspect. The machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;
when the machine learning arithmetic device comprises a plurality of computing devices, the plurality of computing devices can be linked through a specific structure and transmit data;
the plurality of computing devices are interconnected through a PCIE bus and transmit data so as to support operation of larger-scale machine learning; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.
In a fourth aspect, an embodiment of the present invention provides a combined processing device, which includes the machine learning processing device according to the third aspect, a universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user. The combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and stores data of the machine learning arithmetic device and the other processing device.
In a fifth aspect, an embodiment of the present invention provides a neural network chip, where the neural network chip includes the computing device according to the first aspect, the machine learning arithmetic device according to the third aspect, or the combined processing device according to the fourth aspect.
In a sixth aspect, an embodiment of the present invention provides a neural network chip package structure, where the neural network chip package structure includes the neural network chip described in the fifth aspect;
in a seventh aspect, an embodiment of the present invention provides a board, where the board includes a storage device, an interface device, a control device, and the neural network chip in the fifth aspect;
wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;
the storage device is used for storing data;
the interface device is used for realizing data transmission between the chip and external equipment;
and the control device is used for monitoring the state of the chip.
Further, the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;
the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;
the interface device is as follows: a standard PCIE interface.
In an eighth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes the neural network chip described in the fifth aspect, the neural network chip package structure described in the sixth aspect, or the board described in the seventh aspect.
In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a data structure of fixed-point data according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of another data structure of fixed-point data according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present application;
FIG. 3A is a schematic block diagram of a computing device according to an embodiment of the present application;
FIG. 3B is a schematic block diagram of a computing device according to another embodiment of the present application;
FIG. 3C is a schematic block diagram of a computing device according to another embodiment of the present application;
fig. 3D is a schematic structural diagram of a main processing circuit provided in an embodiment of the present application;
FIG. 3E is a schematic block diagram of a computing device according to another embodiment of the present application;
FIG. 3F is a schematic structural diagram of a tree module according to an embodiment of the present disclosure;
FIG. 3G is a schematic block diagram of a computing device according to another embodiment of the present application;
FIG. 3H is a schematic block diagram of a computing device according to another embodiment of the present application;
FIG. 4 is a flowchart illustrating a forward operation of a single-layer artificial neural network according to an embodiment of the present disclosure;
FIG. 5 is a flow chart of a forward operation and a reverse training of a neural network according to an embodiment of the present disclosure;
fig. 6 is a structural diagram of a combined processing device provided in an embodiment of the present application;
FIG. 6A is a schematic block diagram of a computing device according to another embodiment of the present application;
FIG. 7 is a block diagram of another combined processing device provided in an embodiment of the present application;
fig. 8 is a schematic structural diagram of a board card provided in the embodiment of the present application;
fig. 9 is a schematic flowchart of a calculation method according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments
The embodiment of the application provides a data type, wherein the data type comprises an adjustment factor, and the adjustment factor is used for indicating the value range and the precision of the data type.
Wherein the adjustment factor comprises a first scaling factor and a second scaling factor (optionally), the first scaling factor being indicative of the precision of the data type; the second scaling factor is used for adjusting the value range of the data type.
Optionally, the first scaling factor may be 2-m、8-m、10-m、2、3、6、9、10、2m、8m、10mOr other values.
Specifically, the first scaling factor may be a decimal point position. For example, the binary input data INA1 has decimal point shifted by m bits to the right, and the input data INB1 ═ INA1 × 2mThat is, the input data INB1 is enlarged by 2 relative to the input data INA1mDoubling; for another example, decimal input data INA2 has decimal point shifted by n bits to the left to obtain input data INB2 ═ INA2/10nThat is, the input data INA2 is reduced by 10 relative to the input data INB2nAnd m and n are integers.
Alternatively, the second scaling factor may be 2, 8, 10, 16, or other values.
For example, the value range of the data type corresponding to the input data is 8-15-816In the operation process, when the obtained operation result is greater than the maximum value corresponding to the value range of the data type corresponding to the input data, the value range of the data type is multiplied by a second scaling factor (namely 8) of the data type to obtain a new value range 8-14-817(ii) a When the operation result is smaller than the minimum value corresponding to the value range of the data type corresponding to the input data, dividing the value range of the data type by a second scaling factor (8) of the data type to obtain a new value range 8-16-815
Scaling factors may be added to data in any format (e.g., floating point number, discrete data) to adjust the size and precision of the data.
It should be noted that the decimal point positions mentioned in the description of the present application may be the first scaling factor, and are not described herein.
The following describes a structure of fixed-point data, and with reference to fig. 1, fig. 1 is a schematic diagram of a data structure of fixed-point data according to an embodiment of the present application. The signed fixed-point data, which occupies X bits as shown in fig. 1, may also be referred to as X-bit fixed-point data. The X-bit fixed point data includes a sign bit occupying 1 bit, an integer bit occupying M bits, and a decimal bit occupying N bits, and X-1 is M + N. For unsigned fixed-point data, only M-bit integer bits and N-bit decimal bits, i.e., X ═ M + N, are included.
Compared with a 32-bit floating Point data representation form, the short-bit fixed Point data representation form adopted by the invention has the advantages that the occupied bit number is less, and for data of the same layer and the same type in a network model, such as all convolution kernels, input neurons or offset data of a first convolution layer, a flag bit is additionally arranged to record the position of a decimal Point of the fixed Point data, and the flag bit is Point Location. The size of the flag bit can be adjusted according to the distribution of the input data, so that the accuracy of the fixed point data and the expressible range of the fixed point data are adjusted.
For example, floating point number 68.6875 is converted to signed 16-bit fixed point data with a decimal point position of 5. In the signed 16-bit fixed point data with the decimal point position of 5, the integer part accounts for 10 bits, the decimal part accounts for 5 bits, and the sign bit accounts for 1 bit. The conversion unit converts the floating point number 68.6875 to signed 16-bit fixed point data 0000010010010110, as shown in FIG. 2.
First, a computing device as used herein is described. Referring to fig. 3, there is provided a computing device comprising: the device comprises acontroller unit 11, anarithmetic unit 12 and aconversion unit 13, wherein thecontroller unit 11 is connected with thearithmetic unit 12, and theconversion unit 13 is connected with both thecontroller unit 11 and thearithmetic unit 12;
in a possible embodiment, thecontroller unit 11 is adapted to retrieve the first input data and to calculate the instructions.
In one embodiment, the first input data is machine learning data. Further, the machine learning data includes input neuron data, weight data. The output neuron data is the final output result or intermediate data.
In an alternative, the manner of obtaining the first input data and the calculation instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.
The above calculation instructions include, but are not limited to: the present invention is not limited to the specific expression of the above-mentioned computation instruction, such as a convolution operation instruction, or a forward training instruction, or other neural network operation instruction.
Thecontroller unit 11 is further configured to parse the computation instruction to obtain a data conversion instruction and/or one or more operation instructions, where the data conversion instruction includes an operation field and an operation code, the operation code is used to indicate a function of the data type conversion instruction, and the operation field of the data type conversion instruction includes a decimal point position, a flag bit used to indicate a data type of the first input data, and a conversion mode identifier of the data type.
When the operation domain of the data conversion instruction is an address of a storage space, thecontroller unit 11 obtains the decimal point position, a flag bit indicating the data type of the first input data, and a conversion mode identifier of the data type from the storage space corresponding to the address.
Thecontroller unit 11 transmits the operation code and operation field of the data conversion instruction and the first input data to theconversion unit 13; transmitting the plurality of operation instructions to theoperation unit 12;
the convertingunit 13 is configured to convert the first input data into second input data according to the operation code and the operation domain of the data conversion instruction, where the second input data is fixed-point data; and transmits the second input data to thearithmetic unit 12;
thearithmetic unit 12 is configured to perform an arithmetic operation on the second input data according to the plurality of arithmetic instructions to obtain a calculation result of the calculation instruction.
In a possible embodiment, the present application provides a technical solution that theoperation unit 12 is set to a master-slave structure, and for the calculation instruction of the forward operation, the operation unit can split data according to the calculation instruction of the forward operation, so that the plurality ofslave processing circuits 102 can perform parallel operation on the part with a large calculation amount, thereby increasing the operation speed, saving the operation time, and further reducing the power consumption. As shown in fig. 3A, thearithmetic unit 12 includes amaster processing circuit 101 and a plurality ofslave processing circuits 102;
themain processing circuit 101 is configured to perform a preamble process on the second input data and to transfer data and the plurality of operation instructions with the plurality ofslave processing circuits 102;
the plurality ofslave processing circuits 102, configured to perform an intermediate operation according to second input data and the plurality of operation instructions transmitted from themaster processing circuit 101 to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to themaster processing circuit 101;
themain processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.
In one embodiment, the machine learning operation includes a deep learning operation (i.e., an artificial neural network operation), and the machine learning data (i.e., the first input data) includes input neurons and weights (i.e., neural network model data). The output neuron is a calculation result or an intermediate result of the calculation instruction. In the following, the deep learning operation is taken as an example, but it should be understood that the deep learning operation is not limited thereto.
Optionally, the computing device may further include: thestorage unit 10 and the Direct Memory Access (DMA)unit 50, thestorage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; theregister 201 is configured to store the first input data and a scalar. Wherein the first input data includes input neurons, weights, and output neurons.
Thecache 202 is a scratch pad cache.
TheDMA unit 50 is used to read or store data from thememory unit 10.
In a possible embodiment, theregister 201 stores the operation instruction, the first input data, the decimal point position, a flag bit indicating a data type of the first input data, and a conversion mode identifier of the data type; thecontroller unit 11 directly obtains the operation instruction, the first input data, the decimal point position, a flag bit indicating the data type of the first input data, and a conversion mode identifier of the data type from theregister 201; transmitting the first input data, the decimal point position, the flag bit indicating the data type of the first input data, and the conversion mode identification of the data type to theabove conversion unit 13; transmitting the operation instruction to theoperation unit 12;
theconversion unit 13 converts the first input data into the second input data according to the decimal point position, the flag bit indicating the data type of the first input data, and the conversion mode identifier of the data type; then transmitting the second input data to thearithmetic unit 12;
thearithmetic unit 12 performs an arithmetic operation on the second input data according to the arithmetic instruction to obtain an arithmetic result.
Optionally, thecontroller unit 11 includes: aninstruction cache unit 110, aninstruction processing unit 111, and astore queue unit 113;
theinstruction cache unit 110 is configured to store the calculation instruction associated with the artificial neural network operation;
theinstruction processing unit 111 is configured to analyze the computation instruction to obtain the data conversion instruction and the plurality of operation instructions, and analyze the data conversion instruction to obtain an operation code and an operation domain of the data conversion instruction;
thestorage queue unit 113 is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.
For example, in an alternative embodiment, themain processing circuit 101 may also include a control unit, and the control unit may include a main instruction processing unit, specifically configured to decode an instruction into a microinstruction. Of course, in another alternative, theslave processing circuit 102 may also include another control unit including a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.
In one alternative, the structure of the calculation instruction may be as shown in Table 1 below.
Operation codeRegisters or immediate dataRegister/immediate……
TABLE 1
The ellipses in the above table indicate that multiple registers or immediate numbers may be included.
In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0,register number 1, register number 2,register number 3, and register number 4 may be operation domains. Each of register number 0,register number 1, register number 2,register number 3, and register number 4 may be a number of one or more registers.
Figure GDA0002892009860000081
TABLE 2
The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, and for example, when n is equal to 1, the data is 1-dimensional data, that is, a vector, and when n is equal to 2, the data is 2-dimensional data, that is, a matrix, and when n is equal to 3 or more, the data is a multidimensional tensor.
Optionally, thecontroller unit 11 may further include:
adependency processing unit 112, configured to determine whether a first operation instruction is associated with a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, if the first operation instruction is associated with the zeroth operation instruction, cache the first operation instruction in theinstruction cache unit 110, and after the zeroth operation instruction is completely executed, extract the first operation instruction from theinstruction cache unit 110 and transmit the first operation instruction to the operation unit;
the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises: extracting a first storage address interval of required data (such as a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.
In another alternative embodiment, as shown in fig. 3B, thearithmetic unit 12 includes amaster processing circuit 101, a plurality ofslave processing circuits 102, and a plurality ofbranch processing circuits 103.
Themain processing circuit 101 is specifically configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, allocate one distribution data to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to thebranch processing circuit 103;
thebranch processing circuit 103 is configured to forward a data block, broadcast data, and an operation instruction between themaster processing circuit 101 and the plurality ofslave processing circuits 102;
theslave processing circuits 102 are configured to perform an operation on the received data block and broadcast data according to the operation instruction to obtain an intermediate result, and transmit the intermediate result to thebranch processing circuit 103;
themain processing circuit 101 is further configured to perform subsequent processing on the intermediate result sent from thebranch processing circuit 103 to obtain a result of the arithmetic instruction, and send the result of the arithmetic instruction to thecontroller unit 11.
In another alternative embodiment, thearithmetic unit 12 may include amaster processing circuit 101 and a plurality ofslave processing circuits 102, as shown in fig. 3C. As shown in fig. 3C, a plurality ofslave processing circuits 102 are distributed in an array; eachslave processing circuit 102 is connected to other adjacentslave processing circuits 102, themaster processing circuit 101 is connected to Kslave processing circuits 102 in the plurality ofslave processing circuits 102, and the Kslave processing circuits 102 are: the nslave processing circuits 102 in the 1 st row, the nslave processing circuits 102 in the m th row, and the mslave processing circuits 102 in the 1 st column, it should be noted that, as shown in fig. 3C, the Kslave processing circuits 102 include only the nslave processing circuits 102 in the 1 st row, the nslave processing circuits 102 in the m th row, and the mslave processing circuits 102 in the 1 st column, that is, the Kslave processing circuits 102 are theslave processing circuits 102 directly connected to themaster processing circuit 101 among the plurality ofslave processing circuits 102.
Kslave processing circuits 102 for forwarding data and instructions between themaster processing circuit 101 and the plurality ofslave processing circuits 102;
themaster processing circuit 101 is further configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the Kslave processing circuits 102;
the Kslave processing circuits 102 for converting data between themaster processing circuit 101 and the plurality ofslave processing circuits 102;
theslave processing circuits 102 are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the Kslave processing circuits 102;
themain processing circuit 101 is configured to process the intermediate results sent by the Kslave processing circuits 102 to obtain a result of the calculation instruction, and send the result of the calculation instruction to thecontroller unit 11.
Optionally, as shown in fig. 3D, themain processing circuit 101 in fig. 3A to 3C may further include: one or any combination of the activation processing circuit 1011 and the addition processing circuit 1012;
an activation processing circuit 1011 for performing an activation operation of data in themain processing circuit 101;
an addition processing circuit 1012 is used to perform addition or accumulation.
Theslave processing circuit 102 includes: the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result; forwarding processing circuitry (optional) for forwarding the received data block or the product result. And the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.
In a possible embodiment, before thearithmetic unit 12 of the computing device performs the operation of the ith layer of the multilayer neural network model, thecontroller unit 11 of the computing device obtains a configuration instruction, which includes a decimal point position and a data type participating in the operation. The controller unit 11 analyzes the configuration command to obtain the decimal point position and the data type participating in the operation, or directly obtains the decimal point position and the data type participating in the operation from the storage unit 10, and then after the controller unit 11 obtains the input data, judges whether the data type of the input data is consistent with the data type participating in the operation; when it is determined that the data type of the input data is not identical to the data type involved in the operation, the controller unit 11 transmits the input data, the decimal point position, and the data type involved in the operation to the converting unit 13; the conversion unit carries out data type conversion on the input data according to the decimal point position and the data type participating in operation, so that the data type of the input data is consistent with the data type participating in operation; then, the converted data is transmitted to the above arithmetic unit 12, and the main processing circuit 101 and the sub processing circuit 102 of the arithmetic unit 12 perform arithmetic on the converted input data; when it is determined that the data type of the input data matches the data type participating in the operation, the controller unit 11 transmits the input data to the operation unit 12, and the master processing circuit 101 and the slave processing circuit 102 of the operation unit 12 directly operate on the input data without performing data type conversion.
Further, when the input data is fixed point data and the type of data involved in the operation is fixed point data, thecontroller unit 11 determines whether the position of the decimal point of the input data is consistent with the position of the decimal point involved in the operation, if not, thecontroller unit 11 transmits the input data, the position of the decimal point of the input data and the position of the decimal point involved in the operation to theconversion unit 13, theconversion unit 13 converts the input data into fixed point data consistent with the position of the decimal point and the position of the decimal point of the data involved in the operation, and then transmits the converted data to the operation unit, and themain processing circuit 101 and theslave processing circuit 102 of theoperation unit 12 operate on the converted data.
In other words, the arithmetic instruction may be replaced with the configuration instruction.
In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.
In an alternative embodiment, as shown in fig. 3E, the arithmetic unit comprises: atree module 40, the tree module comprising: a root port and a plurality of branch ports, wherein the root port of the tree module is connected with themain processing circuit 101, and the branch ports of the tree module are respectively connected with oneslave processing circuit 102 in the plurality ofslave processing circuits 102;
the tree module has a transceiving function, as shown in fig. 3E, the tree module is a transmitting function, as shown in fig. 6A, the tree module is a receiving function.
The tree module is configured to forward data blocks, weights, and operation instructions between themaster processing circuit 101 and the plurality ofslave processing circuits 102.
Optionally, the tree module is an optional result of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.
Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 3F, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and theslave processing circuit 102 may be connected to nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 3F.
Optionally, the operation unit may carry a separate cache, as shown in fig. 3G, and may include: a neuron buffer unit, theneuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of theslave processing circuit 102.
As shown in fig. 3H, the arithmetic unit may further include: theweight buffer unit 64 is used for buffering the weight data required by theslave processing circuit 102 in the calculation process.
In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, assuming a binary tree structure with 8slave processing circuits 102, the method may be implemented as follows:
thecontroller unit 11 acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from thestorage unit 10, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to themain processing circuit 101;
themaster processing circuit 101 splits the input neuron matrix x into 8 sub-matrices, then distributes the 8 sub-matrices to 8slave processing circuits 102 via a tree module, broadcasts the weight matrix w to the 8slave processing circuits 102,
theslave processing circuit 102 executes multiplication and accumulation operations of the 8 sub-matrices and the weight matrix w in parallel to obtain 8 intermediate results, and sends the 8 intermediate results to themaster processing circuit 101;
themain processing circuit 101 is configured to sequence the 8 intermediate results to obtain a wx operation result, perform offset b operation on the operation result, perform activation operation to obtain a final result y, send the final result y to thecontroller unit 11, and thecontroller unit 11 outputs or stores the final result y into thestorage unit 10.
In one embodiment, thearithmetic unit 12 includes, but is not limited to: a first one or more multipliers of the first portion; one or more adders of the second part (more specifically, the adders of the second part may also constitute an addition tree); a third part of the activation function unit; and/or the vector processing unit of the fourth section. More specifically, the vector processing unit may process vector operations and/or pooling operations. The first part multiplies the input data 1(in1) and the input data 2(in2) to obtain the multiplied output (out), which is: out in1 in 2; the second part adds the input data in1 by an adder to obtain output data (out). More specifically, when the second part is an adder tree, the input data in1 is added step by step through the adder tree to obtain the output data (out), where in1 is a vector with length N, N is greater than 1, and the process is: out in1[1] + in1[2] +. + in1[ N ], and/or adding the input data (in1) and the input data (in2) after adding the addition number to obtain the output data (out), wherein the process is as follows: out-in 1[1] + in1[2] +. + in1[ N ] + in2, or adding the input data (in1) and the input data (in2) to obtain the output data (out), the process is: out in1+ in 2; the third part obtains activation output data (out) by operating the input data (in) through an activation function (active), and the process is as follows: the active function may be sigmoid, tanh, relu, softmax, and the like, and in addition to the activation operation, the third part may implement other non-linear functions, and may obtain the output data (out) by performing the operation (f) on the input data (in), where the process is as follows: out ═ f (in). The vector processing unit obtains output data (out) after the pooling operation by pooling the input data (in), wherein the process is out ═ pool (in), where the pool is the pooling operation, and the pooling operation includes but is not limited to: mean pooling, maximum pooling, median pooling, input data in being data in a pooling kernel associated with output out.
The operation unit executes operation including a first part of multiplying theinput data 1 and the input data 2 to obtain multiplied data; and/or the second part performs an addition operation (more specifically, an addition tree operation for addinginput data 1 step by step through an addition tree) or adds theinput data 1 and input data 2 to obtain output data; and/or the third part executes activation function operation, and obtains output data through activation function (active) operation on input data; and/or a fourth part performing pooling operations, out ═ pool (in), where pool is a pooling operation including, but not limited to: mean pooling, maximum pooling, median pooling, input data in being data in a pooling kernel associated with output out. The operation of the above parts can freely select one or more parts to carry out combination in different orders, thereby realizing the operation of various functions. The computing units correspondingly form a two-level, three-level or four-level pipeline level architecture.
It should be noted that the first input data is long-bit non-fixed point data, such as 32-bit floating point data, or may be standard 64-bit or 16-bit floating point data, and the description is given here only with 32 bits as a specific example; the second input data is short-digit fixed-point data, which is also called less-digit fixed-point data and represents fixed-point data represented by a smaller number of digits relative to the first input data of long-digit non-fixed-point data.
In one possible embodiment, the first input data is non-fixed point data, the second input data is fixed point data, and the number of bits occupied by the first input data is greater than or equal to the number of bits occupied by the second input data. For example, the first input data is 32-bit floating point data, and the second input data is 32-bit fixed point data; for another example, the first input data is 32-bit floating point data, and the second input data is 16-bit fixed point data.
In particular, the first input data comprises different types of data for different layers of different network models. The decimal point positions of the different types of data are different, namely the accuracy of the corresponding fixed point data is different. For a fully connected layer, the first input data comprises data such as input neurons, weights, bias data and the like; in the case of convolutional layers, the first input data includes data such as convolutional kernels, input neurons, and offset data.
For example, for a fully connected layer, the decimal point locations include the decimal point locations of the input neurons, the decimal point locations of the weights, and the decimal point locations of the offset data. The positions of the decimal points of the input neurons, the positions of the decimal points of the weights and the positions of the decimal points of the offset data can be all the same or partially the same or different from each other.
In a possible embodiment, after thecontroller unit 11 or thearithmetic unit 12 obtains the decimal point position of the first input data according to the above process, the decimal point position of the first input data is stored in thebuffer 202 of thestorage unit 10.
When the calculation instruction is an immediate addressing instruction, themain processing unit 101 directly converts the first input data into the second input data according to the decimal point position indicated by the operation field of the calculation instruction; when the calculation instruction is a direct addressing instruction or an indirect addressing instruction, themain processing unit 101 obtains a decimal point position of the first input data according to a storage space indicated by an operation domain of the calculation instruction, and then converts the first input data into the second input data according to the decimal point position.
The calculation apparatus may further include a rounding unit that buffers the intermediate operation result because an operation result (the operation result including the intermediate operation result and the result of the calculation instruction) obtained by performing addition, multiplication, and/or other operations on the second input data may have a precision exceeding a precision range of the current fixed-point data during the operation. After the operation is finished, the rounding unit performs rounding operation on the operation result which exceeds the precision range of the fixed-point data to obtain a rounded operation result, and then the data conversion unit converts the rounded operation result into data of the current fixed-point data type.
Specifically, the rounding unit performs a rounding operation on the intermediate operation result, the rounding operation being any one of a random rounding operation, a rounding operation, an upward rounding operation, a downward rounding operation, and a truncation rounding operation.
When the rounding unit performs the random rounding operation, the rounding unit specifically performs the following operations:
Figure GDA0002892009860000121
wherein y represents data obtained by randomly rounding the operation result x before rounding, i.e. the operation result after rounding, and epsilon is the smallest positive number capable of being expressed by the current fixed-point data expression format, i.e. 2-Point Location
Figure GDA0002892009860000122
The formula represents the probability that the data obtained by randomly rounding the operation result x before rounding is the same as the data obtained by directly truncating the operation result x before rounding to fixed point data (similar to the operation of rounding down decimal), and the formula represents that the data obtained by randomly rounding the operation result x before rounding is the probability
Figure GDA0002892009860000123
Has a probability of
Figure GDA0002892009860000124
The intermediate operation result x is rounded randomly to obtain data of
Figure GDA0002892009860000125
Has a probability of
Figure GDA0002892009860000126
When the rounding unit performs the rounding operation, the rounding unit specifically performs the following operations:
Figure GDA0002892009860000131
wherein y represents data obtained by rounding the operation result x before rounding, i.e. the operation result after rounding, and epsilon is the smallest positive integer which can be expressed by the current fixed point data expression format, i.e. 2-Point Location
Figure GDA0002892009860000132
Is an integer multiple of epsilon and has a value less than or equal to the maximum number of x. The above formula indicates that the operation result x before the rounding satisfies the condition
Figure GDA0002892009860000133
The rounded operation result is
Figure GDA0002892009860000134
When the operation result before rounding satisfies the condition
Figure GDA0002892009860000135
The rounded operation result is
Figure GDA0002892009860000136
When the rounding-up operation is performed by the rounding unit, the rounding unit specifically performs the following operations:
Figure GDA0002892009860000137
wherein y represents data obtained by rounding up the pre-rounding operation result x, that is, the rounded operation result,
Figure GDA0002892009860000138
is an integer multiple of epsilon with a value greater than or equal to the minimum number of x, and epsilon is the smallest positive integer which can be represented by the current fixed-point data representation format, namely 2-Point Location
When the rounding unit performs a downward rounding operation, the rounding unit specifically performs the following operations:
Figure GDA0002892009860000139
wherein y represents the value before roundingThe data obtained by rounding down the operation result x, i.e. the rounded operation result,
Figure GDA00028920098600001310
is an integer multiple of epsilon with a maximum number less than or equal to x, and epsilon is the smallest positive integer that can be represented by the current fixed-point data representation format, namely 2-Point Location
When the rounding unit performs truncation rounding operation, the rounding unit specifically performs the following operations:
y=[x]
wherein y represents the data obtained by truncating the operation result x before rounding, i.e., the operation result after rounding, and [ x ] represents the data obtained by directly truncating the operation result x to fixed point data.
When the rounding unit obtains the rounded intermediate operation result, theoperation unit 12 converts the rounded intermediate operation result into data of the current fixed point data type according to the position of the decimal point of the first input data.
In a possible embodiment, thearithmetic unit 12 does not perform truncation processing on the intermediate result of which the data type is floating point data in the one or more intermediate results.
The intermediate result obtained by the operation performed by theprocessing circuit 102 according to the above method in theoperation unit 12 is generally truncated because the intermediate result obtained by the multiplication, the division, and the like in the operation process exceeds the memory storage range; however, because the intermediate result generated in the operation process of the method is not stored in the memory, the intermediate result beyond the storage range of the memory is not required to be cut off, the precision loss of the intermediate result is greatly reduced, and the precision of the calculation result is improved.
In a possible embodiment, thearithmetic unit 12 further includes a derivation unit, when thearithmetic unit 12 receives the decimal point position of the input data participating in the fixed-point operation, the derivation unit derives the decimal point position of the one or more intermediate results obtained in the process of performing the fixed-point operation according to the decimal point position of the input data participating in the fixed-point operation. When the intermediate result obtained by the operation of the operation subunit exceeds the range indicated by the decimal point position corresponding to the intermediate result, the derivation unit shifts the decimal point position of the intermediate result to the left by M bits, so that the precision of the intermediate result is within the precision range indicated by the decimal point position of the intermediate result, and M is an integer greater than 0.
For example, the first input data includes input data I1 and input data I2, the corresponding decimal point positions are P1 and P2, respectively, and P1> P2, when the operation type indicated by the operation instruction is addition operation or subtraction operation, that is, the operation subunit performs I1+ I2 or I1-I2 operation, the derivation unit derives the decimal point position at which the intermediate result of the operation process indicated by the operation instruction is performed as P1; when the operation type indicated by the operation instruction is multiplication operation, that is, the operation subunit performs I1 × I2 operation, the derivation unit derives the decimal point position P1 × P2 at which the intermediate result of the operation process indicated by the operation instruction is performed.
In a possible embodiment, thearithmetic unit 12 further includes:
and the data caching unit is used for caching the one or more intermediate results.
In an optional embodiment, the computing apparatus further includes a data statistics unit, configured to perform statistics on input data of the same type in each layer of the multi-layer network model to obtain a position of a decimal point of each type of input data in each layer.
The data statistics unit may be a part of an external device, and the calculation device may acquire the position of the decimal point participating in the calculation data from the external device before the data conversion is performed.
Specifically, the data statistic unit includes:
the acquisition subunit is used for extracting input data of the same type in each layer of the multilayer network model;
the statistical subunit is used for counting and acquiring the distribution proportion of the input data of the same type in each layer of the multilayer network model in a preset interval;
and the analysis subunit is used for acquiring the decimal point position of the input data of the same type in each layer of the multilayer network model according to the distribution proportion.
Wherein the predetermined interval is [ -2 [ ]X-1-i,2X-1-i-2-i]I is 0,1,2, …, n, n is a preset positive integer, and X is the number of bits occupied by the fixed-point data. The above-mentioned preset interval [ -2 [ ]X-1-i,2X-1-i-2-i]Comprising n +1 subintervals. The statistical subunit counts distribution information of the input data of the same type in each layer of the multi-layer network model in the n +1 subintervals, and acquires the first distribution proportion according to the distribution information. The first distribution ratio is p0,p1,p2,…,pnAnd the n +1 numerical values are distribution ratios of the input data of the same type in each layer of the multilayer network model on the n +1 subintervals. The analysis subunit presets an overflow rate EPL, which takes the largest i from 0,1,2, …, n, so that p isiAnd the maximum i is the decimal point position of the input data of the same type in each layer of the multilayer network model. In other words, the analysis subunit takes the decimal point position of the same type of input data in each layer of the multilayer network model as: max { i/pi≧ 1-EPL, i ∈ {0,1,2, …, n } }, i.e., p satisfying greater than or equal to 1-EPLiIn the method, the maximum subscript value i is selected as the decimal point position of the input data of the same type in each layer of the multilayer network model.
In addition, p isiThe value of the same type of input data in each layer of the multi-layer network model is in an interval of [ -2 ]X-1-i,2X-1-i-2-i]The number of input data in (a) to the total number of input data of the same type in each layer of the above-described multi-layer network model. For example, m2 input data of the same type in each layer of m1 multi-layer network modelsInput data value in interval [ -2 ]X-1-i,2X-1-i-2-i]In (1), the above
Figure GDA0002892009860000141
In a feasible embodiment, in order to improve the operation efficiency, the obtaining subunit extracts part of data in the same type of input data in each layer of the multilayer network model randomly or in a sampling manner, then obtains the decimal point position of the part of data according to the method, and then performs data conversion (including conversion from floating point data to fixed point data, conversion from fixed point data to fixed point data, and the like) on the type of input data according to the decimal point position of the part of data, so that the calculation speed and efficiency can be improved on the premise of keeping the precision.
Optionally, the data statistics unit may determine bit width and decimal point position of the same type of data or the same layer of data according to the median of the same type of data or the same layer of data, or determine bit width and decimal point position of the same type of data or the same layer of data according to the average of the same type of data or the same layer of data.
Optionally, when the intermediate result obtained by the arithmetic unit according to the arithmetic on the data of the same type or the data of the same layer exceeds the value range corresponding to the decimal point position and the bit width of the data of the same type or the data of the same layer, the arithmetic unit does not perform truncation processing on the intermediate result, and caches the intermediate result in the data caching unit of the arithmetic unit for use in subsequent arithmetic.
Specifically, the operation field includes a decimal point position of the input data and a conversion mode identifier of the data type. The instruction processing unit analyzes the data conversion instruction to obtain the decimal point position of the input data and the conversion mode identifier of the data type. The processing unit further comprises a data conversion unit which converts the first input data into second input data according to the decimal point position of the input data and the conversion mode identification of the data type.
It should be noted that the network model includes multiple layers, such as a full connection layer, a convolutional layer, a pooling layer, and an input layer. In the at least one input data, the input data belonging to the same layer have the same decimal point position, that is, the input data of the same layer share or share the same decimal point position.
The input data includes different types of data, including input neurons, weights, and bias data, for example. The input data belonging to the same type in the input data have the same decimal point position, that is, the input data of the same type share or share the same decimal point position.
For example, the operation type indicated by the operation instruction is fixed-point operation, and the input data participating in the operation indicated by the operation instruction is floating-point data, so that the data conversion unit converts the input data from the floating-point data to the fixed-point data before the fixed-point operation is performed; if the operation type indicated by the operation instruction is floating-point operation and the input data participating in the operation indicated by the operation instruction is fixed-point data, the data conversion unit converts the input data corresponding to the operation instruction from the fixed-point data to floating-point data before the floating-point operation is performed.
For macro instructions (such as a calculation instruction and a data conversion instruction) related to the present application, thecontroller unit 11 may parse the macro instruction to obtain an operation field and an operation code of the macro instruction; generating a micro instruction corresponding to the macro instruction according to the operation domain and the operation code; alternatively, thecontroller unit 11 decodes the macro instruction to obtain the micro instruction corresponding to the macro instruction.
In one possible embodiment, a System On Chip (SOC) includes a main processor including the computing device and a coprocessor. The coprocessor acquires the decimal point position of the input data of the same type in each layer of the multilayer network model according to the method, and transmits the decimal point position of the input data of the same type in each layer of the multilayer network model to the computing device, or the computing device acquires the decimal point position of the input data of the same type in each layer of the multilayer network model from the coprocessor when the decimal point position of the input data of the same type in each layer of the multilayer network model needs to be used.
In a possible embodiment, the first input data is non-fixed point data, and the non-fixed point data includes long-bit floating point data, short-bit floating point data, integer data, discrete data, and the like.
The data types of the first input data are different from each other. For example, the input neurons, the weights and the bias data are floating point data; part of data in the input neurons, the weight values and the bias data are floating point data, and part of data is integer data; the input neurons, weights and bias data are integer data. The computing device can realize the conversion from non-fixed point data to fixed point data, namely, the conversion from data of types such as long-bit floating point data, short-bit floating point data, integer data, discrete data and the like to the fixed point data. The setpoint data may be signed setpoint data or unsigned setpoint data.
In a possible embodiment, the first input data and the second input data are fixed-point data, and the first input data and the second input data may be both signed fixed-point data, or both unsigned fixed-point data, or one of them is unsigned fixed-point data and the other is signed fixed-point data. And the position of the decimal point of the first input data is different from the position of the decimal point of the second input data.
In one possible embodiment, the first input data is fixed-point data, and the second input data is non-fixed-point data. In other words, the above-described computing device can implement conversion of fixed-point data into non-fixed-point data.
Fig. 4 is a flowchart of a forward operation of a single-layer neural network according to an embodiment of the present invention. The flow chart describes a process for a single layer neural network forward operation implemented using a computing device and instruction set implemented by the present invention. For each layer, the input neuron vectors are weighted and summed to calculate an intermediate result vector of the layer. The intermediate result vector is biased and activated to obtain an output neuron vector. And taking the output neuron vector as an input neuron vector of the next layer.
In a specific application scenario, the computing device may be a training device. Before the neural network model training, the training device acquires training data participating in the neural network model training, wherein the training data is non-fixed point data, and the position of a decimal point of the training data is acquired according to the method. The training device converts the training data into training data expressed by fixed point data according to the decimal point position of the training data. The training device performs a forward neural network operation based on the training data expressed by the fixed-point data to obtain a neural network operation result. The training device performs random rounding operation on the neural network operation result which exceeds the data precision range represented by the decimal point position of the training data to obtain the rounded neural network operation result, and the neural network operation result is positioned in the data precision range represented by the decimal point position of the training data. According to the method, the training device obtains the neural network operation result of each layer of the multilayer neural network, namely the output neuron. The training device obtains the gradient of the output neuron according to each layer of output neuron, and carries out inverse operation according to the gradient of the output neuron to obtain the weight gradient, thereby updating the weight of the neural network model according to the weight gradient.
The training device repeatedly executes the process to achieve the purpose of training the neural network model.
It should be noted that, before performing the forward operation and the backward training, the computing device performs data conversion on the data participating in the forward operation; data conversion is not carried out on the data participating in the reverse training; or the computing device does not perform data conversion on the data participating in forward operation; carrying out data conversion on data participating in reverse training; the computing device carries out data conversion on the data participating in the reverse training of the data participating in the forward operation; the specific data conversion process can be referred to the description of the related embodiment above, and will not be described here.
The forward operation includes the multilayer neural network operation, the multilayer neural network operation includes operations such as convolution, and the convolution operation is implemented by a convolution operation instruction.
The convolution operation instruction is an instruction in a Cambricon instruction set, and the Cambricon instruction set is characterized in that the instruction is composed of an operation code and an operand, and the instruction set includes four types of instructions, namely a control instruction (control instructions), a data transmission instruction (data instructions), an operation instruction (computational instructions) and a logic instruction (local instructions).
Preferably, each instruction in the instruction set has a fixed length. For example, each instruction in the instruction set may be 64 bits long.
Further, the control instructions are used for controlling the execution process. The control instructions include jump (jump) instructions and conditional branch (conditional branch) instructions.
Further, the data transmission instruction is used for completing data transmission between different storage media. The data transmission instruction comprises a load (load) instruction, a store (store) instruction and a move (move) instruction. The load instruction is used for loading data from the main memory to the cache, the store instruction is used for storing the data from the cache to the main memory, and the move instruction is used for carrying the data between the cache and the cache or between the cache and the register or between the register and the register. The data transfer instructions support three different data organization modes including matrices, vectors and scalars.
Further, the arithmetic instruction is used for completing the neural network arithmetic operation. The operation instructions include a matrix operation instruction, a vector operation instruction, and a scalar operation instruction.
Further, the matrix operation instruction performs matrix operations in the neural network, including matrix multiplication vector (matrix multiplication vector), vector multiplication matrix (vector multiplication matrix), matrix multiplication scalar (matrix multiplication scale), outer product (outer product), matrix addition matrix (matrix added matrix), and matrix subtraction matrix (matrix subtraction matrix).
Further, the vector operation instruction performs vector operations in the neural network, including vector elementary operations (vector elementary operations), vector transcendental functions (vector transcendental functions), inner products (dot products), vector random generator (random vector generator), and maximum/minimum values in vectors (maximum/minimum of a vector). The vector basic operation includes vector addition, subtraction, multiplication, and division (add, subtrect, multiplex, divide), and the vector transcendental function refers to a function that does not satisfy any polynomial equation with coefficients of a polynomial, including but not limited to an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function.
Further, scalar operation instructions perform scalar operations in the neural network, including scalar elementary operations (scalar elementary operations) and scalar transcendental functions (scalar transcendental functions). The scalar basic operation includes scalar addition, subtraction, multiplication, and division (add, subtrect, multiplex, divide), and the scalar transcendental function refers to a function that does not satisfy any polynomial equation with coefficients of a polynomial, including but not limited to an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function.
Further, the logic instruction is used for logic operation of the neural network. The logical operations include vector logical operation instructions and scalar logical operation instructions.
Further, the vector logic operation instruction includes a vector compare (vector compare), a vector logic operation (vector local operations) and a vector greater than merge (vector larger than merge). Where vector comparisons include, but are not limited to, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. The vector logic operation includes and, or, not.
Further, scalar logic operations include scalar compare (scalar compare), scalar local operations (scalar logical operations). Where scalar comparisons include, but are not limited to, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Scalar logic operations include and, or, not.
For the multilayer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output neuron calculated in the operation unit as the input neuron of the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron of the next layer), and meanwhile, the weight is replaced by the weight of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer. As shown in fig. 5, the arrows of the broken lines in fig. 5 indicate the backward operation, and the realized arrows indicate the forward operation.
In another embodiment, the operation instruction is a matrix multiplied by matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions, including a forward operation instruction and a direction training instruction.
The following describes a specific calculation method of the calculation apparatus shown in fig. 3A by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be s-s (Σ wx)i+ b), wherein the weight w is multiplied by the input data xiAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.
The method for executing the neural network forward operation instruction by the computing device shown in fig. 3A may specifically be:
after theconversion unit 13 performs data type conversion on the first input data, thecontroller unit 11 extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction, and at least one operation code from theinstruction cache unit 110, and thecontroller unit 11 transmits the operation domain to the data access unit and sends the at least one operation code to theoperation unit 12.
Thecontroller unit 11 extracts the weight w and the offset b corresponding to the operation field from the storage unit 10 (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to themain processing circuit 101 of the arithmetic unit, and thecontroller unit 11 extracts the input data Xi from thestorage unit 10 and transmits the input data Xi to themain processing circuit 101.
Themain processing circuit 101 splits the input data Xi into n data blocks;
theinstruction processing unit 111 of thecontroller unit 11 determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one opcode, sends the multiplication instruction, the offset instruction and the accumulation instruction to themaster processing circuit 101, themaster processing circuit 101 sends the multiplication instruction and the weight w to the plurality ofslave processing circuits 102 in a broadcast manner, and distributes the n data blocks to the plurality of slave processing circuits 102 (for example, there are nslave processing circuits 102, and then eachslave processing circuit 102 sends one data block); the plurality ofslave processing circuits 102 are configured to perform a multiplication operation on the weight w and the received data block according to the multiplication instruction to obtain an intermediate result, and send the intermediate result to themaster processing circuit 101, themaster processing circuit 101 performs an accumulation operation on the intermediate result sent by the plurality ofslave processing circuits 102 according to the accumulation instruction to obtain an accumulation result, performs an offset operation b on the accumulation result according to the offset instruction to obtain a final result, and sends the final result to thecontroller unit 11.
In addition, the order of addition and multiplication may be reversed.
It should be noted that, the method for executing the neural network reverse training instruction by the computing apparatus is similar to the process for executing the neural network forward operation instruction by the computing apparatus, and specific reference may be made to the above description of the reverse training, and no description is given here.
According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.
The application also discloses a machine learning operation device, which comprises one or more computing devices mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.
The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.
The application also discloses a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 6 is a schematic view of a combined treatment apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), machine learning processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.
And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.
Alternatively, as shown in fig. 7, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
In one possible embodiment, a distributed system is also claimed, the system comprising n1 host processors and n2 coprocessors, n1 being an integer greater than or equal to 0 and n2 being an integer greater than or equal to 1. The system may be of various types of topologies including, but not limited to, the topological result shown in FIG. 3B, the topological structure shown in FIG. 3C.
The main processor respectively sends input data, decimal point positions of the input data and calculation instructions to the plurality of coprocessors; or the main processor sends the input data, the decimal point position of the input data and the calculation instruction to some of the plurality of slave processors, and the partial slave processors send the input data, the decimal point position of the input data and the calculation instruction to other slave processors. The coprocessor comprises the computing device, and the computing device is used for computing the input data according to the method and the computing instruction to obtain a computing result;
the input data includes, but is not limited to, input neurons, weight values, bias data, and the like.
The coprocessor directly sends the operation result to the main processor, or the coprocessor which is not connected with the main processor firstly sends the operation result to the coprocessor which is connected with the main processor, and then the coprocessor sends the received operation result to the main processor.
In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.
In some embodiments, a chip package structure is provided, which includes the above chip.
In some embodiments, a board card is provided, which includes the above chip package structure.
In some embodiments, an electronic device is provided that includes the above board card. Referring to fig. 8, fig. 8 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, receiving means 391 and control device 392;
the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.
In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.
The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
Referring to fig. 9, fig. 9 is a method for performing machine learning calculation according to an embodiment of the present invention, where the method includes:
s901, the computing device acquires first input data and a computing instruction.
The first input data includes input neurons and weights.
S902, the computing device analyzes the computing instruction to obtain a data conversion instruction and a plurality of operation instructions.
The data conversion instruction comprises an operation field and an operation code, wherein the operation code is used for indicating the function of the data type conversion instruction, and the operation field of the data type conversion instruction comprises a decimal point position, a flag bit used for indicating the data type of the first input data and a conversion mode of the data type.
And S903, converting the first input data into second input data by the computing device according to the data conversion instruction, wherein the second input data is fixed-point data.
Wherein the converting the first input data into second input data according to the data conversion instruction comprises:
analyzing the data conversion instruction to obtain the decimal point position, the flag bit for indicating the data type of the first input data and the conversion mode of the data type;
determining the data type of the first input data according to the data type zone bit of the first input data;
and converting the first input data into second input data according to the decimal point position and the conversion mode of the data type, wherein the data type of the second input data is inconsistent with the data type of the first input data.
When the first input data and the second input data are fixed point data, the position of the decimal point of the first input data is inconsistent with the position of the decimal point of the second input data.
In a possible embodiment, when the first input data is fixed-point data, the method further comprises:
and deducing the decimal point position of one or more intermediate results according to the decimal point position of the first input data, wherein the one or more intermediate results are obtained by operation according to the first input data.
And S904, the computing device performs computation on the second input data according to the plurality of operation instructions to obtain a result of the computation instruction.
The operation instruction includes a forward operation instruction and a reverse training instruction, that is, during the process of executing the forward operation instruction and/or the reverse training instruction (that is, the computing device performs forward operation and/or reverse training), the computing device may convert data participating in the operation into fixed-point data according to the embodiment shown in fig. 9, and perform fixed-point operation.
It should be noted that, the above steps S901-S904 can be described in detail with reference to the related description of the embodiment shown in fig. 1-8, and will not be described here.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (21)

Translated fromChinese
1.一种计算装置,其特征在于,所述计算装置包括:存储单元、运算单元、控制器单元和转换单元,所述运算单元包括推导单元;1. A computing device, characterized in that the computing device comprises: a storage unit, an arithmetic unit, a controller unit and a conversion unit, and the arithmetic unit comprises a derivation unit;所述控制器单元,用于在所述运算单元进行运算之前,获取配置指令,所述配置指令的操作域包括小数点位置和参与运算的数据类型;解析所述配置指令,以得到小数点位置和参与运算的数据类型,或者从所述存储单元中获取所述小数点位置和参与运算的数据类型;The controller unit is used to obtain a configuration instruction before the operation unit performs an operation, and the operation domain of the configuration instruction includes the decimal point position and the data type involved in the operation; parse the configuration instruction to obtain the decimal point position and participation in the operation. the data type of the operation, or obtain the position of the decimal point and the data type involved in the operation from the storage unit;所述控制器单元,还用于获取输入数据,并判断所述输入数据的数据类型与参与运算的数据类型是否一致;当确定所述输入数据的数据类型与参与运算的数据类型不一致时,将所述输入数据、小数点位置和参与运算的数据类型传输至所述转换单元;The controller unit is also used for acquiring input data, and judging whether the data type of the input data is consistent with the data type participating in the operation; when it is determined that the data type of the input data is inconsistent with the data type participating in the operation, the The input data, the decimal point position and the data type involved in the operation are transmitted to the conversion unit;所述转换单元根据所述小数点位置和参与运算的数据类型对所述输入数据进行数据类型转换,以得到转换后的输入数据,所述转换后的输入数据的数据类型与参与运算的数据类型一致;The conversion unit performs data type conversion on the input data according to the decimal point position and the data type participating in the operation, to obtain the converted input data, and the data type of the input data after the conversion is consistent with the data type participating in the operation. ;所述推导单元,用于当所述输入数据为定点数据时,根据所述输入数据的小数点位置,推导得到一个或者多个中间结果的小数点位置,其中所述一个或多个中间结果为根据所述输入数据运算得到的;The deriving unit is configured to derive the decimal point position of one or more intermediate results according to the decimal point position of the input data when the input data is fixed-point data, wherein the one or more intermediate results are based on the decimal point position of the input data. obtained by operation on the input data;所述推导单元,还用于在进行运算得到的所述中间结果超过其对应的小数点位置所指示的范围时,将该中间结果的小数点位置左移M位,以使该中间结果的精度位于该中间结果的小数点位置所指示的精度范围之内,M为大于0的整数。The deriving unit is further configured to shift the decimal point position of the intermediate result to the left by M bits when the intermediate result obtained by the operation exceeds the range indicated by the corresponding decimal point position, so that the precision of the intermediate result is located in the range of the decimal point position of the intermediate result. Within the precision range indicated by the decimal point position of the intermediate result, M is an integer greater than 0.2.根据权利要求1所述的装置,其特征在于,所述控制器单元在所述运算单元进行运算之前获取配置指令具体是在所述运算单元进行多层神经网络的第i层的运算之前获取所述配置指令。2 . The device according to claim 1 , wherein the controller unit obtains the configuration instruction before the operation unit performs the operation, specifically before the operation unit performs the operation of the i-th layer of the multi-layer neural network. 3 . Get the configuration directive.3.根据权利要求2所述的装置,其特征在于,所述计算装置用于执行机器学习计算,3. The device of claim 2, wherein the computing device is configured to perform machine learning calculations,所述控制器单元,还用于将所述转换后的输入数据传输至所述运算单元;当所述输入数据的数据类型与所述参与运算的数据类型一致时,将所述输入数据传输至所述运算单元;The controller unit is further configured to transmit the converted input data to the operation unit; when the data type of the input data is consistent with the data type involved in the operation, transmit the input data to the operation unit. the operation unit;所述运算单元,用于对所述转换后的输入数据或者所述输入数据进行运算,以得到运算结果。The operation unit is configured to perform an operation on the converted input data or the input data to obtain an operation result.4.根据权利要求3所述的装置,其特征在于,所述机器学习计算包括:人工神经网络运算,所述输入数据包括:输入神经元数据和权值数据;所述运算结果为输出神经元数据。4. The device according to claim 3, wherein the machine learning calculation comprises: artificial neural network operation, the input data comprises: input neuron data and weight data; the operation result is output neuron data.5.根据权利要求4所述的装置,其特征在于,所述运算单元包括一个主处理电路和多个从处理电路;5. The device according to claim 4, wherein the arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;所述主处理电路,用于对所述输入数据或者转换后的输入数据进行执行前序处理以及与所述多个从处理电路之间传输数据;the master processing circuit, configured to perform pre-sequence processing on the input data or the converted input data and transmit data with the plurality of slave processing circuits;所述多个从处理电路,用于依据从所述主处理电路传输所述输入数据或者转换后的输入数据执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;The multiple slave processing circuits are configured to perform intermediate operations according to the input data transmitted from the master processing circuit or the converted input data to obtain multiple intermediate results, and transmit the multiple intermediate results to the master processing circuit ;所述主处理电路,用于对所述多个中间结果执行后续处理得到所述运算结果。The main processing circuit is configured to perform subsequent processing on the plurality of intermediate results to obtain the operation result.6.根据权利要求5所述的装置,其特征在于,所述计算装置还包括直接内存访问DMA单元,所述存储单元包括:寄存器、缓存中任意组合;6. The device according to claim 5, wherein the computing device further comprises a direct memory access DMA unit, and the storage unit comprises: any combination in a register and a cache;所述缓存,用于存储所述输入数据;其中,所述缓存包括高速暂存缓存;The cache is used to store the input data; wherein, the cache includes a high-speed temporary cache;所述寄存器,用于存储所述输入数据中标量数据;the register for storing scalar data in the input data;所述DMA单元,用于从所述存储单元中读取数据或者向所述存储单元存储数据。The DMA unit is configured to read data from the storage unit or store data to the storage unit.7.根据权利要求3-6任一项所述的装置,其特征在于,所述运算单元还包括:7. The device according to any one of claims 3-6, wherein the arithmetic unit further comprises:数据缓存单元,用于缓存所述一个或多个中间结果。A data cache unit for caching the one or more intermediate results.8.根据权利要求5或6所述的装置,其特征在于,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;8. The device according to claim 5 or 6, wherein the operation unit comprises: a tree module, the tree module comprises: a root port and a plurality of branch ports, the root of the tree module The port is connected to the main processing circuit, and the plurality of branch ports of the tree module are respectively connected to one of the plurality of slave processing circuits;所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据以及运算指令;The tree module is used to forward data and operation instructions between the master processing circuit and the plurality of slave processing circuits;其中,所述树型模块 为n叉树结构,所述n为大于或等于2的整数。Wherein, the tree module is an n-ary tree structure, and the n is an integer greater than or equal to 2.9.根据权利要求5或6所述的装置,其特征在于,所述运算单元还包括分支处理电路,9. The device according to claim 5 or 6, wherein the arithmetic unit further comprises a branch processing circuit,所述主处理电路,具体用于确定输入神经元为广播数据,权值为分发数据,将一个分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块、广播数据以及多个运算指令中的至少一个运算指令发送给所述分支处理电路;The main processing circuit is specifically used to determine that the input neuron is broadcast data, the weight is distribution data, allocate one distribution data into multiple data blocks, and assign at least one data block and broadcast data to at least one of the multiple data blocks. And at least one operation instruction in the plurality of operation instructions is sent to the branch processing circuit;所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的数据块、广播数据以及运算指令;The branch processing circuit is configured to forward data blocks, broadcast data and operation instructions between the master processing circuit and the plurality of slave processing circuits;所述多个从处理电路,用于依据该运算指令对接收到的数据块以及广播数据执行运算得到中间结果,并将中间结果传输给所述分支处理电路;The plurality of slave processing circuits are configured to perform operations on the received data blocks and broadcast data according to the operation instructions to obtain intermediate results, and transmit the intermediate results to the branch processing circuits;所述主处理电路,还用于将所述分支处理电路发送的中间结果进行后续处理得到所述运算指令的结果,将所述运算指令的结果发送至所述控制器单元。The main processing circuit is further configured to perform subsequent processing on the intermediate result sent by the branch processing circuit to obtain the result of the operation instruction, and send the result of the operation instruction to the controller unit.10.根据权利要求5或6所述的装置,其特征在于,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的K个从处理电路,所述K个从处理电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;10. The device according to claim 5 or 6, wherein the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected to other adjacent slave processing circuits, and the master processing circuit is connected to all the slave processing circuits. K slave processing circuits in the plurality of slave processing circuits, the K slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column. processing circuit;所述K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据以及指令的转发;The K slave processing circuits are used for data and instruction forwarding between the master processing circuit and a plurality of slave processing circuits;所述主处理电路,还用于确定输入神经元为广播数据,权值为分发数据,将一个分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述K个从处理电路;The main processing circuit is further configured to determine that the input neuron is broadcast data, the weight is distribution data, distribute one distribution data into multiple data blocks, and assign at least one data block and multiple data blocks among the multiple data blocks. at least one operation instruction in the operation instructions is sent to the K slave processing circuits;所述K个从处理电路,用于转换所述主处理电路与所述多个从处理电路之间的数据;the K slave processing circuits for converting data between the master processing circuit and the plurality of slave processing circuits;所述多个从处理电路,用于依据所述运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述K个从处理电路;The multiple slave processing circuits are configured to perform operations on the received data blocks according to the operation instructions to obtain intermediate results, and transmit the operation results to the K slave processing circuits;所述主处理电路,用于将所述K个从处理电路发送的中间结果进行处理得到该运算指令的结果,将该运算指令的结果发送给所述控制器单元。The main processing circuit is configured to process the K intermediate results sent from the processing circuits to obtain the result of the operation instruction, and send the result of the operation instruction to the controller unit.11.根据权利要求10所述的装置,其特征在于,11. The apparatus of claim 10, wherein所述主处理电路,具体用于将多个处理电路发送的中间结果进行组合排序得到该运算指令的结果;The main processing circuit is specifically configured to combine and sort the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction;或所述主处理电路,具体用于将多个处理电路的发送的中间结果进行组合排序以及激活处理后得到该运算指令的结果。Or the main processing circuit is specifically configured to combine, sort and activate the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction.12.根据权利要求10所述的装置,其特征在于,所述主处理电路包括:激活处理电路和加法处理电路中的一种或任意组合;12. The apparatus according to claim 10, wherein the main processing circuit comprises: one or any combination of an activation processing circuit and an addition processing circuit;所述激活处理电路,用于执行主处理电路内数据的激活运算;The activation processing circuit is used to execute the activation operation of the data in the main processing circuit;所述加法处理电路,用于执行加法运算或累加运算;The addition processing circuit is used to perform addition operation or accumulation operation;所述从处理电路包括:The slave processing circuit includes:乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果;a multiplication processing circuit, which is used to perform a multiplication operation on the received data block to obtain a multiplication result;累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。The accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.13.一种机器学习运算装置,其特征在于,所述机器学习运算装置包括一个或多个如权利要求2-12任一项所述的计算装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;13. A machine learning computing device, characterized in that the machine learning computing device comprises one or more computing devices according to any one of claims 2-12, for obtaining data to be computed from other processing devices and control information, and execute the specified machine learning operation, and transmit the execution result to other processing devices through the I/O interface;当所述机器学习运算装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the computing devices, the plurality of the computing devices can be connected through a specific structure to transmit data;其中,多个所述计算装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述计算装置共享同一控制系统或拥有各自的控制系统;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。Wherein, a plurality of the computing devices are interconnected and transmit data through the fast peripheral device interconnection bus PCIE bus to support larger-scale machine learning operations; a plurality of the computing devices share the same control system or have their own control systems ; A plurality of the computing devices share memory or have their own memory; the interconnection mode of the plurality of computing devices is any interconnection topology.14.一种组合处理装置,其特征在于,所述组合处理装置包括如权利要求13所述的机器学习运算装置,通用互联接口、存储装置和其他处理装置;14. A combined processing device, wherein the combined processing device comprises the machine learning computing device as claimed in claim 13, a universal interconnection interface, a storage device and other processing devices;所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作;The machine learning computing device interacts with the other processing devices to jointly complete the computing operation specified by the user;所述存储装置,用于分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。The storage device is configured to be connected to the machine learning computing device and the other processing device respectively, and is used to save the data of the machine learning computing device and the other processing device.15.一种神经网络芯片,其特征在于,所述神经网络芯片包括如权利要求13所述的机器学习运算装置或如权利要求14所述的组合处理装置。15. A neural network chip, wherein the neural network chip comprises the machine learning computing device as claimed in claim 13 or the combined processing device as claimed in claim 14.16.一种电子设备,其特征在于,所述电子设备包括如所述权利要求15所述的芯片。16. An electronic device, wherein the electronic device comprises the chip of claim 15.17.一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求15所述的神经网络芯片;17. A board, characterized in that the board comprises: a storage device, an interface device, a control device, and a neural network chip as claimed in claim 15;其中,所述神经网络芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;所述存储器件,用于存储数据;the storage device for storing data;所述接口装置,用于实现所述芯片与外部设备之间的数据传输;the interface device for realizing data transmission between the chip and an external device;所述控制器件,用于对所述芯片的状态进行监控;the control device for monitoring the state of the chip;其中,所述存储器件包括:多组存储单元,每一组所述存储单元与所述芯片通过总线连接,所述存储单元为:DDR SDRAM;Wherein, the storage device includes: multiple groups of storage units, each group of the storage units is connected to the chip through a bus, and the storage units are: DDR SDRAM;所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;The chip includes: a DDR controller for controlling data transmission and data storage of each of the storage units;所述接口装置为:标准PCIE接口。The interface device is: a standard PCIE interface.18.一种计算方法,其特征在于,包括:18. A computing method, comprising:控制器单元在进行运算之前,获取配置指令;所述配置指令的操作域包括小数点位置和参与运算的数据类型;并解析所述配置指令,以得到小数点位置和参与运算的数据类型,或者直接获取所述小数点位置和参与运算的数据类型;获取输入数据,并判断所述输入数据的数据类型与参与运算的数据类型是否一致;The controller unit obtains the configuration instruction before performing the operation; the operation domain of the configuration instruction includes the decimal point position and the data type involved in the operation; and parses the configuration instruction to obtain the decimal point position and the data type involved in the operation, or directly obtains The position of the decimal point and the data type participating in the operation; obtain input data, and judge whether the data type of the input data is consistent with the data type participating in the operation;当确定所述输入数据的数据类型与参与运算的数据类型不一致时,转换单元根据所述小数点位置和参与运算的数据类型对所述输入数据进行数据类型转换,以得到转换后的输入数据,所述转换后的输入数据的数据类型与参与运算的数据类型一致;When it is determined that the data type of the input data is inconsistent with the data type involved in the operation, the conversion unit performs data type conversion on the input data according to the position of the decimal point and the data type involved in the operation, so as to obtain the converted input data, so The data type of the converted input data is consistent with the data type involved in the operation;推导单元当所述输入数据为定点数据时,根据所述输入数据的小数点位置,推导得到一个或者多个中间结果的小数点位置,其中所述一个或多个中间结果为根据所述输入数据运算得到的;When the input data is fixed-point data, the derivation unit derives the decimal point position of one or more intermediate results according to the decimal point position of the input data, wherein the one or more intermediate results are obtained by operation according to the input data of;推导单元在进行运算得到的所述中间结果超过其对应的小数点位置所指示的范围时,将该中间结果的小数点位置左移M位,以使该中间结果的精度位于该中间结果的小数点位置所指示的精度范围之内,M为大于0的整数。When the intermediate result obtained by the operation exceeds the range indicated by the corresponding decimal point position, the derivation unit shifts the decimal point position of the intermediate result to the left by M places, so that the precision of the intermediate result is within the decimal point position of the intermediate result. M is an integer greater than 0 within the indicated precision range.19.根据权利要求18所述的方法,其特征在于,所述控制器单元在进行运算之前获取配置指令具体是在进行多层神经网络模型的第i层的运算之前获取所述配置指令。19 . The method according to claim 18 , wherein acquiring the configuration instruction by the controller unit before performing the operation is specifically acquiring the configuration instruction before performing the operation on the i-th layer of the multi-layer neural network model. 20 .20.根据权利要求19所述的方法,其特征在于,所述计算方法为执行机器学习计算的方法,所述方法还包括:20. The method according to claim 19, wherein the calculation method is a method for performing machine learning calculation, and the method further comprises:运算单元对所述转换后的输入数据进行运算,以得到运算结果。The operation unit operates on the converted input data to obtain an operation result.21.根据权利要求20所述的方法,其特征在于,所述机器学习计算包括:人工神经网络运算,所述输入数据包括:输入神经元和权值;所述运算结果为输出神经元。21. The method according to claim 20, wherein the machine learning calculation comprises: an artificial neural network operation, the input data comprises: an input neuron and a weight; the operation result is an output neuron.
CN201910195627.XA2018-02-132018-09-03 A computing device and methodActiveCN110163357B (en)

Applications Claiming Priority (5)

Application NumberPriority DateFiling DateTitle
CN201810149287.2ACN110163350B (en)2018-02-132018-02-13 A computing device and method
CN20181014928722018-02-13
CN201810207915.8ACN110276447B (en)2018-03-142018-03-14Computing device and method
CN20181020791582018-03-14
CN201880002628.1ACN110383300B (en)2018-02-132018-09-03 A computing device and method

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
CN201880002628.1ADivisionCN110383300B (en)2018-02-132018-09-03 A computing device and method

Publications (2)

Publication NumberPublication Date
CN110163357A CN110163357A (en)2019-08-23
CN110163357Btrue CN110163357B (en)2021-06-25

Family

ID=67638324

Family Applications (11)

Application NumberTitlePriority DateFiling Date
CN201910195598.7AActiveCN110163354B (en)2018-02-132018-09-03Computing device and method
CN201910195899.XAActiveCN110163363B (en)2018-02-132018-09-03Computing device and method
CN201910195599.1AActiveCN110163355B (en)2018-02-132018-09-03 A computing device and method
CN201910195535.1AActiveCN110163353B (en)2018-02-132018-09-03Computing device and method
CN201910195819.0AActiveCN110163360B (en)2018-02-132018-09-03 A computing device and method
CN201910195898.5AActiveCN110163362B (en)2018-02-132018-09-03 A computing device and method
CN201910195627.XAActiveCN110163357B (en)2018-02-132018-09-03 A computing device and method
CN201910195816.7AActiveCN110163358B (en)2018-02-132018-09-03Computing device and method
CN201910195820.3AActiveCN110163361B (en)2018-02-132018-09-03 A computing device and method
CN201910195818.6AActiveCN110163359B (en)2018-02-132018-09-03Computing device and method
CN201910195600.0AActiveCN110163356B (en)2018-02-132018-09-03Computing device and method

Family Applications Before (6)

Application NumberTitlePriority DateFiling Date
CN201910195598.7AActiveCN110163354B (en)2018-02-132018-09-03Computing device and method
CN201910195899.XAActiveCN110163363B (en)2018-02-132018-09-03Computing device and method
CN201910195599.1AActiveCN110163355B (en)2018-02-132018-09-03 A computing device and method
CN201910195535.1AActiveCN110163353B (en)2018-02-132018-09-03Computing device and method
CN201910195819.0AActiveCN110163360B (en)2018-02-132018-09-03 A computing device and method
CN201910195898.5AActiveCN110163362B (en)2018-02-132018-09-03 A computing device and method

Family Applications After (4)

Application NumberTitlePriority DateFiling Date
CN201910195816.7AActiveCN110163358B (en)2018-02-132018-09-03Computing device and method
CN201910195820.3AActiveCN110163361B (en)2018-02-132018-09-03 A computing device and method
CN201910195818.6AActiveCN110163359B (en)2018-02-132018-09-03Computing device and method
CN201910195600.0AActiveCN110163356B (en)2018-02-132018-09-03Computing device and method

Country Status (1)

CountryLink
CN (11)CN110163354B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11620130B2 (en)2018-02-132023-04-04Shanghai Cambricon Information Technology Co., LtdComputing device and method
US11609760B2 (en)*2018-02-132023-03-21Shanghai Cambricon Information Technology Co., LtdComputing device and method
CN110597756B (en)*2019-08-262023-07-25光子算数(北京)科技有限责任公司Calculation circuit and data operation method
CN112446496B (en)*2019-08-282025-09-12上海寒武纪信息科技有限公司 Method, device and related product for processing data
CN112446460A (en)*2019-08-282021-03-05上海寒武纪信息科技有限公司Method, apparatus and related product for processing data
CN112445524A (en)*2019-09-022021-03-05中科寒武纪科技股份有限公司Data processing method, related device and computer readable medium
WO2021081854A1 (en)*2019-10-302021-05-06华为技术有限公司Convolution operation circuit and convolution operation method
CN112765537B (en)*2019-11-012024-08-23中科寒武纪科技股份有限公司Data processing method, device, computer equipment and storage medium
CN112667241B (en)*2019-11-082023-09-29安徽寒武纪信息科技有限公司Machine learning instruction conversion method and device, board card, main board and electronic equipment
CN110929862B (en)*2019-11-262023-08-01陈子祺 Fixed-point neural network model quantization device and method
KR102800376B1 (en)*2019-12-172025-04-25에스케이하이닉스 주식회사Data Processing System and accelerating DEVICE therefor
CN113190209B (en)*2020-01-142025-01-10中科寒武纪科技股份有限公司 A computing device and a computing method
CN113408717B (en)*2020-03-172025-09-09安徽寒武纪信息科技有限公司Computing device, method, board card and computer readable storage medium
CN113867792A (en)*2020-06-302021-12-31上海寒武纪信息科技有限公司Computing device, integrated circuit chip, board card, electronic equipment and computing method
CN113867790A (en)*2020-06-302021-12-31上海寒武纪信息科技有限公司 Computing device, integrated circuit chip, board and computing method
CN113867789A (en)*2020-06-302021-12-31上海寒武纪信息科技有限公司 Computing device, integrated circuit chip, board card, electronic device and computing method
CN111767024A (en)*2020-07-092020-10-13北京猿力未来科技有限公司 A solution method and device for simple operation
CN114282160A (en)*2020-09-272022-04-05中科寒武纪科技股份有限公司 A data processing device, integrated circuit chip, equipment and method for realizing the same
WO2022062682A1 (en)*2020-09-272022-03-31中科寒武纪科技股份有限公司Data processing device, integrated circuit chip, device, and implementation method therefor
CN114648438A (en)*2020-12-172022-06-21安徽寒武纪信息科技有限公司Apparatus, method, and readable storage medium for processing image data

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103988170A (en)*2011-12-072014-08-13Arm有限公司Apparatus and method for rounding a floating-point value to an integral floating-point value
CN104572011A (en)*2014-12-222015-04-29上海交通大学FPGA (Field Programmable Gate Array)-based general matrix fixed-point multiplier and calculation method thereof
CN104679719A (en)*2015-03-172015-06-03成都金本华科技股份有限公司Floating point calculation method based on FPGA

Family Cites Families (37)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6650327B1 (en)*1998-06-162003-11-18Silicon Graphics, Inc.Display system having floating point rasterization and floating point framebuffering
US6834293B2 (en)*2001-06-152004-12-21Hitachi, Ltd.Vector scaling system for G.728 annex G
CN100410871C (en)*2003-07-232008-08-13联发科技股份有限公司Digital signal processor using jump floating point number operation method
US7432925B2 (en)*2003-11-212008-10-07International Business Machines CorporationTechniques for representing 3D scenes using fixed point data
CN1658153B (en)*2004-02-182010-04-28联发科技股份有限公司Composite dynamic fixed point number representation method and operation method and its processor structure
CN100340972C (en)*2005-06-072007-10-03北京北方烽火科技有限公司Method for implementing logarithm computation by field programmable gate array in digital auto-gain control
JP4976798B2 (en)*2006-09-282012-07-18株式会社東芝 Two-degree-of-freedom position control method, two-degree-of-freedom position control device, and medium storage device
CN101231632A (en)*2007-11-202008-07-30西安电子科技大学 The Method of Using FPGA to Process Floating Point FFT
CN101183873B (en)*2007-12-112011-09-28广州中珩电子科技有限公司BP neural network based embedded system data compression/decompression method
CN101510149B (en)*2009-03-162011-05-04炬力集成电路设计有限公司Method and apparatus for processing data
CN101754039A (en)*2009-12-222010-06-23中国科学技术大学Three-dimensional parameter decoding system for mobile devices
CN102981854A (en)*2012-11-162013-03-20天津市天祥世联网络科技有限公司Neural network optimization method based on floating number operation inline function library
CN103019647B (en)*2012-11-282015-06-24中国人民解放军国防科学技术大学Floating-point accumulation/gradual decrease operational method with floating-point precision maintaining function
CN103455983A (en)*2013-08-302013-12-18深圳市川大智胜科技发展有限公司Image disturbance eliminating method in embedded type video system
US20170061279A1 (en)*2015-01-142017-03-02Intel CorporationUpdating an artificial neural network using flexible fixed point representation
CN104679720A (en)*2015-03-172015-06-03成都金本华科技股份有限公司Operation method for FFT
CN105094744B (en)*2015-07-282018-01-16成都腾悦科技有限公司A kind of variable floating data microprocessor
US9977116B2 (en)*2015-10-052018-05-22Analog Devices, Inc.Scaling fixed-point fast Fourier transforms in radar and sonar applications
CN106355246B (en)*2015-10-082019-02-15上海兆芯集成电路有限公司 Three Configuration Neural Network Units
CN105426344A (en)*2015-11-092016-03-23南京大学Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN109993285B (en)*2016-01-202020-02-07中科寒武纪科技股份有限公司Apparatus and method for performing artificial neural network forward operations
CN107578099B (en)*2016-01-202021-06-11中科寒武纪科技股份有限公司Computing device and method
CN109358900B (en)*2016-04-152020-07-03中科寒武纪科技股份有限公司Artificial neural network forward operation device and method supporting discrete data representation
CN107315575B (en)*2016-04-262020-07-31中科寒武纪科技股份有限公司Device and method for executing vector merging operation
CN111176608B (en)*2016-04-262025-03-11中科寒武纪科技股份有限公司 A device and method for performing vector comparison operations
CN111651199B (en)*2016-04-262023-11-17中科寒武纪科技股份有限公司Apparatus and method for performing vector cyclic shift operation
CN111860811B (en)*2016-04-272024-01-16中科寒武纪科技股份有限公司Device and method for executing full-connection layer forward operation of artificial neural network
CN110188870B (en)*2016-04-272021-10-12中科寒武纪科技股份有限公司Apparatus and method for performing artificial neural network self-learning operation
CN107340993B (en)*2016-04-282021-07-16中科寒武纪科技股份有限公司 Computing device and method
CN107330515A (en)*2016-04-292017-11-07北京中科寒武纪科技有限公司 A device and method for performing forward operation of artificial neural network
CN111310904B (en)*2016-04-292024-03-08中科寒武纪科技股份有限公司 A device and method for performing convolutional neural network training
CN106502626A (en)*2016-11-032017-03-15北京百度网讯科技有限公司Data processing method and device
CN106708780A (en)*2016-12-122017-05-24中国航空工业集团公司西安航空计算技术研究所Low complexity branch processing circuit of uniform dyeing array towards SIMT framework
CN106775599B (en)*2017-01-092019-03-01南京工业大学Multi-computing-unit coarse-grained reconfigurable system and method for recurrent neural network
CN107292334A (en)*2017-06-082017-10-24北京深瞐科技有限公司Image-recognizing method and device
CN107807819B (en)*2017-07-202021-06-25上海寒武纪信息科技有限公司 A device and method for performing forward operation of artificial neural network supporting discrete data representation
CN107451658B (en)*2017-07-242020-12-15杭州菲数科技有限公司Fixed-point method and system for floating-point operation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103988170A (en)*2011-12-072014-08-13Arm有限公司Apparatus and method for rounding a floating-point value to an integral floating-point value
CN104572011A (en)*2014-12-222015-04-29上海交通大学FPGA (Field Programmable Gate Array)-based general matrix fixed-point multiplier and calculation method thereof
CN104679719A (en)*2015-03-172015-06-03成都金本华科技股份有限公司Floating point calculation method based on FPGA

Also Published As

Publication numberPublication date
CN110163363B (en)2021-05-11
CN110163359B (en)2020-12-11
CN110163353B (en)2021-05-11
CN110163362A (en)2019-08-23
CN110163354A (en)2019-08-23
CN110163353A (en)2019-08-23
CN110163360B (en)2021-06-25
CN110163357A (en)2019-08-23
CN110163358B (en)2021-01-05
CN110163358A (en)2019-08-23
CN110163355A (en)2019-08-23
CN110163362B (en)2020-12-11
CN110163356B (en)2020-10-09
CN110163359A (en)2019-08-23
CN110163354B (en)2020-10-09
CN110163361B (en)2021-06-25
CN110163355B (en)2020-10-09
CN110163361A (en)2019-08-23
CN110163356A (en)2019-08-23
CN110163363A (en)2019-08-23
CN110163360A (en)2019-08-23

Similar Documents

PublicationPublication DateTitle
CN110163357B (en) A computing device and method
TWI827432B (en)Computing apparatus, machine learning computing apparatus, combined processing apparatus, neural network chip, electronic device, board, and computing method
CN110163350B (en) A computing device and method
CN110276447B (en)Computing device and method
CN111488976B (en)Neural network computing device, neural network computing method and related products
CN111045728A (en) A computing device and related products
CN111488963B (en)Neural network computing device and method
CN111930681A (en)Computing device and related product
CN111047024B (en) Computing device and related products
CN111368987B (en)Neural network computing device and method
CN111368985B (en) A neural network computing device and method

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp