Movatterモバイル変換


[0]ホーム

URL:


CN113705799B - Processing unit, computing device, and computational graph processing method for deep learning model - Google Patents

Processing unit, computing device, and computational graph processing method for deep learning model

Info

Publication number
CN113705799B
CN113705799BCN202010435630.7ACN202010435630ACN113705799BCN 113705799 BCN113705799 BCN 113705799BCN 202010435630 ACN202010435630 ACN 202010435630ACN 113705799 BCN113705799 BCN 113705799B
Authority
CN
China
Prior art keywords
operator
attribute data
join
operators
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010435630.7A
Other languages
Chinese (zh)
Other versions
CN113705799A (en
Inventor
董俊
尹莉
陈琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pingtouge Shanghai Semiconductor Co Ltd
Original Assignee
Pingtouge Shanghai Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingtouge Shanghai Semiconductor Co LtdfiledCriticalPingtouge Shanghai Semiconductor Co Ltd
Priority to CN202010435630.7ApriorityCriticalpatent/CN113705799B/en
Publication of CN113705799ApublicationCriticalpatent/CN113705799A/en
Application grantedgrantedCritical
Publication of CN113705799BpublicationCriticalpatent/CN113705799B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种处理单元、计算装置及深度学习模型的计算图处理方法。该方法包括:将第一深度学习框架的计算图转换成符合加速单元所遵循的中间表达;对中间表达进行模型处理;确定处理后的中间表达包含的且未在第一深度学习框架注册过的第一算子,其中,处理后的中间表达用算子标识、属性数据以及与其他算子的连接关系表征算子;将处理后的中间表达转换回第一深度学习框架的计算图,包括:用连接算子替换第一算子,且按照第一算子与其它算子的连接关系构造连接算子与其它算子的连接关系,然后用第一算子的算子标识和属性数据替换连接算子的算子标识和属性数据。本公开实施例利用一个连接算子解决计算图中多个未在原框架下定义且注册的算子的转换问题。

The present invention discloses a method for processing a computational graph of a processing unit, a computing device, and a deep learning model. The method includes: converting a computational graph of a first deep learning framework into an intermediate expression that complies with the acceleration unit; performing model processing on the intermediate expression; determining a first operator contained in the processed intermediate expression and that has not been registered in the first deep learning framework, wherein the processed intermediate expression represents the operator with an operator identifier, attribute data, and a connection relationship with other operators; converting the processed intermediate expression back into the computational graph of the first deep learning framework, including: replacing the first operator with a connection operator, and constructing a connection relationship between the connection operator and other operators according to the connection relationship between the first operator and other operators, and then replacing the operator identifier and attribute data of the connection operator with the operator identifier and attribute data of the first operator. The embodiment of the present disclosure uses a connection operator to solve the conversion problem of multiple operators in the computational graph that are not defined and registered in the original framework.

Description

Processing unit, computing device and computing graph processing method of deep learning model
Technical Field
The disclosure relates to the field of chips, and in particular relates to a processing unit, a computing device and a computational graph processing method of a deep learning model.
Background
Currently mainstream deep learning frames are TensorFlow frames, mxNet frames, caffe frames, mxNet frames, and so on. These frameworks all define different sets of operators, even though functionally similar operators may have different properties in different frameworks. Operators are the fundamental units of operations in a deep learning model, such as convolution operators. In TensorFlow, the fill property of the convolution operator is "VALID" or "SAME", which indicates that the VALID represents just VALID convolution after the convolution operation is performed, the boundary data is not processed, and "SAME" represents the convolution result at the reserved boundary, but in MxNet and Caffe, the fill property of the convolution operator is a two-dimensional array, and the user can specify the values to be filled in the horizontal direction and the vertical direction. Thus, two operators that identify the same may have different properties, resulting in the two operators not being identical. Operators are defined by identities and attributes.
It is known to those skilled in the art that deep learning models are run on an acceleration unit, and must be processed by a corresponding processing unit and then compiled into an acceleration unit model that can be supported by the acceleration unit instruction set to run on the acceleration unit. Such model processing includes operator merging, quantization, and the like. Operator merging is to merge operators inside the deep learning model, for example, multiple operators are merged into one operator. Quantization is the conversion of parameters such as weights in the deep learning model from a high precision data type to a low precision data type to the input to the deep learning model. These model processes are all performed under a framework or instruction set that the acceleration unit can support, they only support the attributes and attribute values of each operator under the framework of the acceleration unit, and do not support the attributes and attribute values of each operator under the original deep learning framework of the deep learning model. Therefore, the deep learning model under the original deep learning frame is converted into the intermediate expression which can be supported by the acceleration unit, the model processing under the frame supported by the acceleration unit can be performed, and then the intermediate expression after the model processing is converted back into the calculation map of the original frame.
The intermediate expression after model processing is converted back into the calculation graph of the original framework, namely each operator represented in the intermediate expression is converted into each operator which can be identified by the original framework, and the operator identification, the attribute data and the relation representation with other operators are adopted in the calculation graph by the operators, so that the operator identification, the attribute data and the relation with other operators of each converted operator are required to accord with the definition of the original framework. For this purpose, for operators contained in the intermediate expression but not defined under the original framework, it is necessary to define them under the original framework and register them under the original framework. And each operator is defined under the original framework respectively, so that a developer needs to develop and maintain a plurality of new operators under the original framework, and the maintenance cost is high.
Disclosure of Invention
Based on this, an object of the present disclosure is to provide a processing unit, a computing device and a computation graph processing method of a deep learning model, so as to solve the problems in the prior art.
In a first aspect, embodiments of the present disclosure provide a processing unit, comprising:
An instruction fetch unit for retrieving computer instructions from a memory external to the processing unit;
an instruction decoding unit for decoding the retrieved computer instructions;
an instruction execution unit, configured to execute the decoded computer instruction to implement:
converting a computational graph of a deep learning model for the first deep learning framework into an intermediate representation conforming to the acceleration unit;
performing model processing on the intermediate expression;
Determining a first operator which is contained in the processed intermediate expression and is not registered in the first deep learning framework, wherein the processed intermediate expression characterizes the operator by an operator identifier, attribute data and connection relation with other operators;
converting the processed intermediate representation back to a computational graph of the first deep learning framework, comprising:
and in the processed intermediate expression, replacing the first operator by a connection operator, constructing the connection relation between the connection operator and other operators according to the connection relation between the first operator and other operators, and replacing the operator identification and attribute data of the connection operator by the operator identification and attribute data of the first operator.
Optionally, the operator identification and attribute data of the first operator are stored in the attribute data of the join operator, and the replacing the operator identification and attribute data of the join operator with the operator identification and attribute data of the first operator includes:
And reading the operator identification and the attribute data of the first operator from the attribute data of the connection operator, and respectively replacing the operator identification and the attribute data of the connection operator.
Optionally, the step of replacing the first operator with a join operator is repeated until all first operators in the processed intermediate representation are replaced, and then the step of replacing the operator identity and attribute data of the join operator with the operator identity and attribute data of the first operator is repeated until all join operators in the processed intermediate representation are replaced.
Optionally, the model processing includes at least one of operator merging, quantization, graph cutting, model pruning.
Optionally, the instruction execution unit further implements:
Converting said processed intermediate representation into a json file before said replacing said first operator with a join operator, and
After the replacing the operator identification and attribute data of the join operator with the operator identification and attribute data of the first operator, converting the json file into a format of a computational graph followed by the first deep learning framework.
Optionally, the first deep learning framework is MxNet framework, and the join operator is a registrar of the MxNet framework.
Optionally, the constructor of the join operator specifies other operators having a join relationship with the first operator through input and/or output parameters, so that the constructor of the join operator can construct the join relationship of the join operator with other operators according to the join relationship of the first operator with other operators.
Optionally, the constructor of the join operator further stores the input tensor number and the output tensor number of the first operator in the attribute data of the join operator.
Optionally, the converting the computational graph into the intermediate representation conforming to the acceleration unit comprises converting attribute data of at least one operator of the computational graph into attribute data of a corresponding operator defined by the intermediate representation by a mapping function.
Optionally, determining the first operator included in the processed intermediate expression and not registered in the first deep learning framework comprises comparing an operator identifier of each operator in the processed intermediate expression with an operator identifier of a registered operator under the first deep learning framework to determine the first operator.
In a second aspect, embodiments of the present disclosure provide a computing device comprising a memory and a processing unit as described in any one of the above.
Optionally, the computing device is a server or a terminal device.
In a third aspect, an embodiment of the present disclosure provides a computation graph processing method of a deep learning model, including:
converting a computational graph of a deep learning model for the first deep learning framework into an intermediate representation conforming to the acceleration unit;
performing model processing on the intermediate expression;
Determining a first operator which is contained in the processed intermediate expression and is not registered in the first deep learning framework, wherein the processed intermediate expression characterizes the operator by an operator identifier, attribute data and connection relation with other operators;
converting the processed intermediate representation back to a computational graph of the first deep learning framework, comprising:
and in the processed intermediate expression, replacing the first operator by a connection operator, constructing the connection relation between the connection operator and other operators according to the connection relation between the first operator and other operators, and replacing the operator identification and attribute data of the connection operator by the operator identification and attribute data of the first operator.
Optionally, the operator identification and attribute data of the first operator are stored in the attribute data of the join operator, and the replacing the operator identification and attribute data of the join operator with the operator identification and attribute data of the first operator includes:
And reading the operator identification and the attribute data of the first operator from the attribute data of the connection operator, and respectively replacing the operator identification and the attribute data of the connection operator.
Optionally, the step of replacing the first operator with a join operator is repeated until all first operators in the processed intermediate representation are replaced, and then the step of replacing the operator identity and attribute data of the join operator with the operator identity and attribute data of the first operator is repeated until all join operators in the processed intermediate representation are replaced.
Optionally, the model processing includes at least one of operator merging, quantization, graph cutting, model pruning.
Optionally, the method further comprises:
Converting said processed intermediate representation into a json file before said replacing said first operator with a join operator, and
After the replacing the operator identification and attribute data of the join operator with the operator identification and attribute data of the first operator, converting the json file into a format of a computational graph followed by the first deep learning framework.
Optionally, the first deep learning framework is MxNet framework, and the join operator is a registrar of the MxNet framework.
Optionally, the constructor of the join operator specifies other operators having a join relationship with the first operator through input and/or output parameters, so that the constructor of the join operator can construct the join relationship of the join operator with other operators according to the join relationship of the first operator with other operators.
Optionally, the constructor of the join operator further stores the input tensor number and the output tensor number of the first operator in the attribute data of the join operator.
In a fourth aspect, embodiments of the present disclosure provide a data center including the computing device described above as a server.
Unlike the prior art, which uses only one operator, namely a join operator, to replace all operators contained in the intermediate expression and not defined and registered under the original frame, each operator contained in the intermediate expression and not defined under the original frame needs to be defined under the original frame, so that maintenance cost of new operators is greatly reduced. The connection operator has the same connection relation with other operators as the operators contained in the intermediate expression but not defined under the original frame, but the attribute data and the operator identification of the connection operator are different from the operators contained in the intermediate expression but not defined under the original frame, and the attribute data and the operator identification of the connection operator need to be replaced by the attribute data and the operator identification of the operators contained in the intermediate expression but not defined under the original frame. According to the embodiment of the disclosure, a developer can solve the problem of conversion of a plurality of operators which are not defined and registered under the original framework in the computational graph by only defining and registering one connection operator, so that a plurality of new operators are prevented from being developed and maintained under the original framework.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof with reference to the following drawings in which:
FIG. 1 illustrates a hierarchical structure diagram of a data center to which one embodiment of the present disclosure is applied;
FIG. 2 is a block diagram of a data center to which one embodiment of the present disclosure is applied;
FIG. 3 is a block diagram of the internal architecture of one server in a data center of one embodiment of the present disclosure;
FIG. 4 is a control relationship diagram of a Central Processing Unit (CPU) and a neural network acceleration unit (acceleration unit) inside a server according to one embodiment of the present disclosure;
FIG. 5 is an internal block diagram of an acceleration cell core according to one embodiment of the present disclosure;
FIG. 6 is a software architecture diagram of a hierarchical design;
FIG. 7 is an example diagram of a computational graph conversion;
FIG. 8 is a partial flow chart of a computational graph processing method provided by an embodiment of the present disclosure;
FIG. 9 is a partial flow chart of a computational graph processing method provided by another embodiment of the present disclosure;
FIG. 10 is a schematic diagram of operator join relationships.
Detailed Description
The present disclosure is described below based on embodiments, but the present disclosure is not limited to only these embodiments. In the following detailed description of the present disclosure, certain specific details are set forth in detail. The present disclosure may be fully understood by one skilled in the art without a description of these details. Well-known methods, procedures, and flows have not been described in detail so as not to obscure the nature of the disclosure. The figures are not necessarily drawn to scale.
The following terms are used herein.
The acceleration unit is designed for improving the data processing speed in the special purpose fields (such as image processing, various operation of a neural network processing and the like) and aims at the situation that the general purpose processor is not efficient in the special purpose fields, and the acceleration unit is often matched with a general purpose processor CPU to be used, is controlled by the general purpose processor to execute the special purpose or special field processing and improve the computer processing efficiency in the special purpose or the special field. May also be referred to as an AI processing unit and may include a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and dedicated AI acceleration hardware (e.g., acceleration unit).
On-chip memory, which is memory that is used alone in the primary core or secondary core and cannot be shared.
And the command processor is used for a command interface between the acceleration unit and the central processing unit for driving the acceleration unit to work. The command processor receives instructions which are executed by the central processing unit and enables the acceleration unit to distribute the instructions to each core in the acceleration unit for execution. In addition, it is responsible for accelerating the synchronization of the individual cores in the unit.
Lifecycle-an operand is not involved in the whole process in an instruction sequence, the part of the instruction sequence between its first occurrence and the last used instruction, or the lifecycle of the operand. That is, after the life cycle, it is no longer used and is not necessarily left in on-chip memory.
The neural network is generally an artificial neural network (ARTIFICIAL NEURAL NETWORK, abbreviated as ANN) which is an algorithm network simulating the behavior characteristics of the animal neural network and carrying out distributed parallel information processing. A classical neural network, also the simplest neural network structure, comprises three layers, an input layer, an output layer and an intermediate layer (also called hidden layer). The input layer, the output layer, and the intermediate layer each comprise a plurality of nodes.
In the neural network, the nodes are mathematically generated to generate mathematical models of the nodes, and a large number of mathematical models of the nodes in the neural network form the neural network model.
Deep learning model the concept of deep learning is derived from the study of neural networks, which will be referred to as deep learning networks. Thus, in this sense, the deep learning model is also a neural network model. Both the deep learning model and the neural network model must be generated via training. The sample data is input into a designed network structure (i.e. the network structure is determined), the characteristic information is extracted through a plurality of intermediate layers, and the weight parameters of the neurons are continuously corrected based on the output result of the output layer, so that the output result of the output layer tends to a preset result until the final weight parameters are determined. The trained deep learning model can be truly applied to an actual scene, and meanwhile, the use condition of the deep learning model in the actual scene can be collected, and the deep learning model is optimized in turn.
The node is the minimum unit of independent operation in the deep learning model, receives input, and generates output after the operation of the weight parameter of the node or parameters (such as super parameters) in other models. The deep learning model may include various specific operations such as convolution, pooling, and the like, with various operation nodes including convolution nodes, pooling nodes, and the like. There are multiple layers in the deep learning model, each layer having multiple nodes, the output of each node being the input of the node of the next layer. Specifically, the node includes programs for specific operations and related data. For example, the convolution operation node includes program code used for convolution operation and some data used in convolution.
Operator refers to a set of operations built in a deep learning model to implement a particular function. Each layer of the deep learning model may contain a plurality of such operators. May be referred to as operation in TensorFlow frames and layer in Caffe frames. An operator is considered a further abstraction on a node basis, one operator may correspond to one or more nodes. Thus, operators and nodes sometimes characterize the same program code.
Instruction set-the set of instructions that are internally supported for operation, e.g., operations that support mainly deep learning operators, such as Convolution, pooling, ROI, etc.
And (3) quantifying, namely converting the input of the operation node, the weight parameters of the operation node and other parameters in the deep learning model from a high-precision data type to a low-precision data type, so as to reduce the required behavior of data throughput and storage space.
Inverse quantization-the inverse process of quantization, i.e., the process of converting the input of the operational node, and the weight parameters and other parameters of the operational node in the deep learning model from a low-precision data type to a high-precision data type.
Intermediate expressions (IR) since the deep learning models are different in format according to the model frame (fram ework) on which they depend, they can be classified into different formats such as tensorflow, pytor ch, mxnet, etc., and the code expressions of these deep learning models are also different. This presents great difficulty in the versatility of deep learning model quantization. Intermediate expressions are expressions that express deep learning model code in a variety of different formats to conform to one expression followed by one or more acceleration units. The meaning of each code sentence in the deep learning model is analyzed, and the sentence is translated into a general expression form according to the meaning of the code sentence, so that the expressions of the code sentences with the same meaning in different deep learning models in the intermediate expression are the same. Currently, there are tool products where the expressions of different deep learning models are converted into intermediate expressions.
The calculation map (computation graph) is that the current deep learning framework mainly has two programming modes of declarative programming and command programming. Declarative programming, program code first defines a neural network model structure for describing computational logic, but not immediately executing, which is executed only when the program code that invokes the neural network model structure is executed, the neural network model structure including a plurality of operators (or symbolic representations of operators) and the connections between them, and may be graphically represented, and thus the neural network model structure is referred to as a static computational graph. And the command type programming, the program code directly returns the result of the operation, and the definition and execution of the neural network model structure are synchronous. Generally, the static diagram is convenient for model optimization of the overall neural network model, which is more beneficial to performance improvement, and the dynamic diagram is very convenient for a user to debug a specific program.
Fig. 1 illustrates a hierarchical structure diagram of a data center as one scenario to which embodiments of the present disclosure are applied.
Data centers are globally coordinated, specific networks of devices used to communicate, accelerate, display, calculate, store data information over an internet network infrastructure. In future developments, data centers will also become an asset for enterprise competition. With the widespread use of data centers, artificial intelligence and the like are increasingly applied to data centers. Neural networks have been widely used as an important technology for artificial intelligence in data center big data analysis operations.
In a conventional large data center, the network architecture is typically a three-layer architecture as shown in FIG. 1, namely a hierarchical interconnection network model (HIERARCHICAL INTER-networking model). This model contains the following three layers:
The access layer (ACCESS LAYER) 103, sometimes referred to as an edge layer, includes an access switch 130 and servers 140 to which the access switch is connected. Each server 140 is a processing and storage entity of a data center in which the processing and storage of large amounts of data is accomplished by these servers 140. The access switch 130 is a switch used to access these servers to the data center. An access switch 130 accesses a plurality of servers 140. The access switches 130 are typically located at the Top of the Rack, so they are also referred to as Top of Rack switches, which physically connect to the servers.
The aggregation layer (Aggregation Layer), sometimes referred to as the distribution layer, includes an aggregation switch 120. Each aggregation switch 120 connects multiple access switches while providing other services such as firewall, intrusion detection, network analysis, etc.
Core Layer (Core Layer) 101 includes Core switch 110. Core switch 110 provides high speed forwarding of packets into and out of the data center and provides connectivity for multiple convergence layers. The network of the entire data center is divided into an L3 layer routing network and an L2 layer routing network, and the core switch 110 provides a flexible L3 layer routing network for the network of the entire data center in general.
Typically, the aggregation switch 120 is a demarcation point for L2 and L3 layer routing networks, below the aggregation switch 120 is an L2 network, above is an L3 network. Each group of aggregation switches manages one transport point (POD, point Of Delivery), within each of which is a separate VLAN network. The server migration within the POD does not have to modify the IP address and default gateway because one POD corresponds to one L2 broadcast domain.
Spanning tree Protocol (STP, spanning Tree Protocol) is typically used between the aggregation switch 120 and the access switch 130. STP makes only one convergence layer switch 120 available for one VLAN network, and the other convergence layer switches 120 are used when a failure occurs (dashed lines in the above figures). That is, at the aggregation layer, no horizontal expansion is made, since only one is working even if multiple aggregation switches 120 are added.
Fig. 2 illustrates the physical connection of the components in the tiered data center of fig. 1. As shown in fig. 2, one core switch 110 is connected to a plurality of aggregation switches 120, one aggregation switch 120 is connected to a plurality of access switches 130, and one access switch 130 accesses a plurality of servers 140.
Server device
Since the server 140 is the real execution entity of the data center, fig. 3 shows a block diagram of the structure inside the server 140. The server 140 includes a memory 210, a Central Processing Unit (CPU) 220, and various acceleration units connected by a bus. These acceleration units include an embedded neural network processor (acceleration unit) 230, a Data Transfer Unit (DTU) 260, a graphics processing unit (GPU, not shown), an application specific integrated circuit (ASIC, not shown), and a field programmable gate array (FPGA, not shown).
The architecture design of the traditional processor makes the control unit and the storage unit occupy a large part of space in the architecture, but the space occupied by the calculation unit is insufficient, so that the traditional processor is very effective in the aspect of logic control and is not efficient in the aspect of massive parallel calculation. Therefore, various specialized acceleration units have been developed for more efficient processing to increase the speed of computation for different functions and different fields of computation. The accelerating elements proposed by the present disclosure may be any one of them, and these accelerating elements are described below, respectively.
The acceleration unit 230 is a processing unit that uses a data-driven parallel computing architecture for processing a large number of operations (e.g., convolution, pooling, etc.) of each neural network node. Because the data and intermediate results in a large number of operations (e.g., convolution, pooling, etc.) of each neural network node are closely related in the overall computation process, it is often used, and with existing CPU architectures, because the memory capacity within the CPU core is small, a large number of frequent accesses to the off-core memory are required, resulting in processing inefficiencies. By adopting the acceleration unit, each core has an on-chip memory with a storage capacity suitable for neural network calculation, so that frequent access to a memory outside the core is avoided, the processing efficiency is greatly improved, and the calculation performance is improved.
A Data Transmission Unit (DTU) 260 is a wireless terminal device dedicated to converting serial data into IP data or converting IP data into serial data for transmission through a wireless communication network. The main function of the DTU is to transmit data from the remote device wirelessly back to the background center. At the front end, the DTU and the customer's device are connected through an interface. The DTU is firstly registered to a mobile GPRS network after power-on operation, and then socket connection is established with a background center arranged in the DTU. The background center is used as a service end of socket connection, and the DTU is a client end of socket connection. Thus, the DTU and the background software cooperate together, and after the connection is established, the front-end device and the center of the background can perform wireless data transmission through the DTU.
Graphics Processing Units (GPUs) are microprocessors that do specially perform image and graphics-related operations. The GPU develops the defect of too little space of a computing unit in the CPU, adopts a large number of computing units for specially doing graphic computation, reduces the dependence on the CPU by a graphic card, and bears some of the graphic image processing work which is originally borne by the CPU and is intensive in computation.
An Application Specific Integrated Circuit (ASIC) refers to an integrated circuit that is designed and manufactured to meet the needs of a particular user and a particular electronic system. Because such integrated circuits are custom-built to the requirements of the user, their structure is often tailored to the specific user requirements.
The Field Programmable Gate Array (FPGA) is a product which is further developed on the basis of programmable devices such as PAL, GAL and the like. The programmable device is used as a semi-custom circuit in the field of Application Specific Integrated Circuits (ASICs), which not only solves the defect of custom circuits, but also overcomes the defect of limited gate circuits of the original programmable device.
The acceleration unit, although having the advantage of significantly higher execution efficiency than a conventional processor for a particular application or field, is also under the control of the processing unit 220. Taking an acceleration unit dedicated to the deep learning model as an example, various deep learning models including neurons of these models, weight data of the neurons, and the like are stored in the memory 210. These deep learning models are deployed to an acceleration unit 230 by a processing unit 220 in fig. 3 when needed. Specifically, the processing unit 220 may inform the acceleration unit 230 of the storage location of the deep learning model of the acceleration unit 230 in the memory 210 through the form of an instruction. The acceleration unit 230 may then address based on these locations, storing the instructions to be executed in its on-chip memory. The processing unit 220 may also send the instruction to be executed of the acceleration unit 230 to the acceleration unit 230 in the form of an instruction, and the acceleration unit 230 receives the instruction and stores the instruction in the on-chip memory. Similarly, the acceleration unit 230 may acquire the input data in the above manner. The acceleration unit 230 acquires instructions to be executed and input data to perform inference calculations. The weight parameters of the nodes may be included in the instruction sequence of the deep learning model and fetched from the memory 210 by the acceleration unit 230. Of course, the weight parameters of the nodes may also be stored independently and fetched from the memory 210 by the acceleration unit 230 when needed. The processing unit 220 is understood herein to be a hardware unit with scheduling and control capabilities, and may generally be a Central Processing Unit (CPU), a microcontroller, a microprocessor, or the like.
Internal structure of processing unit and acceleration unit 230
The following illustrates how the processing unit controls the operation of the acceleration unit in conjunction with the internal block diagram of the processing unit and the acceleration unit 230 of fig. 4.
As shown in fig. 4, a plurality of processor cores 222 and a cache 221 shared by the plurality of processor cores 222 are included in the processing unit 220. Each processor core 222 includes an instruction fetch unit 203, an instruction decode unit 224, an instruction issue unit 225, and an instruction execution unit 226.
Instruction fetch unit 223 is configured to transfer instructions to be executed from memory 210 into an instruction register (which may be one of register files 229 shown in fig. 4 for storing instructions) and to receive a next fetch address or to obtain a next fetch address based on a fetch algorithm, e.g., comprising incrementing or decrementing the address based on the instruction length.
After fetching the instruction, processing unit 220 enters an instruction decode stage, and instruction decode unit 224 decodes the fetched instruction according to a predetermined instruction format to obtain operand fetch information required by the fetched instruction, in preparation for operation of instruction execution unit 225. Operand fetch information refers, for example, to an immediate, registers, or other software/hardware capable of providing source operands.
An instruction issue unit 225 is located between the instruction decode unit 224 and the instruction execution unit 226 for scheduling and control of instructions to efficiently distribute individual instructions to the different instruction execution units 226, enabling parallel operation of multiple instructions.
After instruction issue unit 225 issues instructions to instruction execution unit 226, instruction execution unit 226 begins executing instructions. But if the instruction execution unit 226 determines that the instruction should be executed by an acceleration unit, it forwards it to the corresponding acceleration unit for execution. For example, if the instruction is an instruction for neural network reasoning (inference), the instruction execution unit 226 no longer executes the instruction, but instead sends the instruction over the bus to the acceleration unit 230 for execution by the acceleration unit 230.
The acceleration unit 230 includes a plurality of cores 236 within it (4 cores are shown in fig. 4, but those skilled in the art will appreciate that other numbers of cores 236 may be included in the acceleration unit 230), a command processor 237, a direct memory access mechanism 235, and a bus channel 231.
The bus channel 231 is a channel in which instructions enter and exit the acceleration unit 230 from the bus. Bus lanes 231 may include PCIE lanes 232, I2C lanes 233, JTAG lanes 234, according to different mechanisms.
PCIE, PCI-Express, is a high-speed serial computer expansion bus standard proposed by Intel in 2001 and is intended to replace the old PCI, PCI-X and AGP bus standards. PCIE belongs to high-speed serial point-to-point dual-channel high-bandwidth transmission, and connected equipment allocates exclusive channel bandwidth without sharing bus bandwidth and mainly supports functions of active power management, error reporting, end-to-end reliability transmission, hot plug, service quality and the like. Its main advantage is high data transmission speed and considerable development potential. At present, most of the PCIE buses are PCIE GEN3, but PCIE GEN4, that is, a bus channel conforming to the PCI-express4.0 standard may also be used in the embodiments of the present disclosure.
The I2C channel 233 is a simple, bi-directional two-wire synchronous serial bus channel developed by Philips corporation. It requires only two wires to transfer information between devices connected to the bus.
JTAG is an acronym for Joint test action group (Joint Test Action Group) and is a common name in IEEE Standard 1149.1, entitled Standard test Access Port and boundary Scan architecture. This standard is used to verify the functionality of the printed circuit board produced by the design and test. JTAG was formally standardized by IEEE 1149.1-1990, and in 1994, supplementary documents were added to describe the Boundary Scan Description Language (BSDL). From then on, this standard is widely adopted by electronic enterprises worldwide. Boundary scan is almost synonymous with JTAG. JTAG channels 234 are bus channels that conform to this standard.
Direct memory access (DMA, direct Memory Access) mechanism 235 is a function provided by some computer bus architecture that enables data to be written directly from an additional device (e.g., external memory) into the on-chip memory of acceleration unit 230. This greatly improves the efficiency of data access over all data transfer between devices through command processor 237. Because of such a mechanism, the core of the acceleration unit 230 can directly access the memory 210, read parameters (such as weight parameters of each node) in the deep learning model, and the like, thereby greatly improving the data access efficiency. Although the direct memory access mechanism 235 is shown as being located between the processor 237 and the bus channel 231, the design of the acceleration unit 230 is not limited thereto. In some hardware designs, each core 236 may include a direct memory access mechanism 235 such that the cores 236 do not need to read data directly from the attached device via the command processor 237 and write to the on-chip memory of the acceleration unit 230.
The command processor 237 allocates instructions sent by the processing unit 220 to the acceleration unit 230 to the cores 236 for execution. The instruction execution unit 226 sends the instruction to be executed, which needs to be executed by the acceleration unit 230, to the acceleration unit 230 or the instruction execution unit 226 informs the storage location of the instruction to be executed on the memory 210. After the sequence of instructions to be executed has entered the bus channel 231, it is buffered in the command processor 237, and the core 236 is selected by the command processor 237 to be allocated for execution. The instructions to be executed come from a compiled deep learning model. It should be appreciated that the sequence of instructions to be executed may include instructions to be executed in the processing unit 220 and instructions to be executed that need to be executed in the acceleration unit 230.
Acceleration cell core
Fig. 5 is an internal structural diagram of an acceleration unit core according to one embodiment of the present disclosure.
In one embodiment, as shown in FIG. 5, core 236 includes tensor engine 310, pooling engine 320, convolution process 330, activate operation 380, sequencer 350, instruction buffer 340, on-chip memory 360, constant buffer 370. Wherein tensor engine 310, pooling engine 320, convolution process 330, and activation operation 380 are all hardware execution units. The hardware execution unit is a hardware module actually used for executing various operations. Still other hardware execution units are not shown in the figures.
The instruction sequence assigned to the core 236 by the command processor 237 first enters the instruction buffer 340 for buffering. Sequencer 350 then fetches instructions from instruction buffer 340 in a first-in, first-out order, and assigns the instructions to individual hardware execution units for execution according to the nature of the instructions. The tensor engine 310 is responsible for handling tensor-related operations in the deep learning model. The pooling engine 320 is responsible for handling pooling operations in the deep learning model. The convolution process 330 is responsible for the convolution operation in the deep learning model. The activation operation 380 is used to perform an operation corresponding to the activation function in the deep learning model. Sequencer 350 determines the allocation of instructions to individual hardware execution units for execution based on the nature of the operation of the fetched instructions, whether convolutionally, matrix multiplications, or pooled.
On-chip memory 360 is a nuclear memory that stores weight parameters in the deep learning model, as well as inputs and various intermediate results when the deep learning model is in actual use. Constant buffer 370 is a buffer that stores constant parameters other than weight parameters in the deep learning model (e.g., super-parameters in the deep learning model). As described above, in the process in which the processing unit 220 configures the deep learning model in advance in the acceleration unit 230, the processing unit 220 may transmit the positions of parameters in the model in the memory 210 to the acceleration unit 230 in the form of instructions. These parameters include the weight of the node and other parameters (e.g., super parameters). For the weights, the acceleration unit 230 is fetched from the corresponding location of the memory 210 and placed in the on-chip memory 360 as needed. For other parameters, the acceleration unit 230 is fetched from the corresponding location in the memory 210 and placed in the constant buffer 370, if necessary. In addition, when the executable instructions are distributed to the cores 236 for execution by the command processor 237, the input parameters in the instructions (input to the deep learning model) are also stored in the on-chip memory 360. In addition, when the tensor engine 310 and the pooling engine 320 perform convolution or pooling operations, various intermediate results obtained are also stored in the on-chip memory 360.
Software architecture diagram
The improvement of the deep learning model requires not only the support of the above hardware layer but also continuous improvement of the software layer and the algorithm layer. Only the underlying hardware support and the above deep learning algorithm structure can be combined to deliver a powerful compute engine.
Fig. 6 is a software architecture diagram of a hierarchical design. Hierarchical software design is the dominant design approach for large software projects. It can reduce the dependency between layers, so that the developer can only pay attention to one layer in the whole structure, and can easily replace the realization of the original layer by new program code.
As shown in the figure, the software architecture diagram includes, from top to bottom, an application layer 401, a framework layer 402, and a functional layer.
The application layer 401 is an application of the deep learning model under a specific scene, such as an application of visual 405, natural language 406, recommendation 407, and the like. These applications are built using the present architecture, and the architecture can also be invoked in the application to provide an operational interface to obtain reasoning capabilities in the application.
The framework layer 402 integrates TensorFlow, MXNet, 409, caffe 410, etc. various deep learning frameworks and provides operator libraries and tools so that various algorithms can continue to be optimized and improved. TensorFlow408,408 is a symbolic mathematical system based on data stream programming, which is widely used in the programming implementation of various machine learning (MACHINE LEARNING) algorithms. MXNet 409A is a deep learning library selected by Amazon (Amazon). Caffe 410, holly Convolutional Architecture for Fast Feature Embedding, is a deep learning framework with expressive, speed and thought modularity.
The functional layers include a compilation stack 403 and a runtime stack 404. The compilation stack 403 is used to transform (converter) 411, quantize (quantization) 412, optimize (optimization) 413, and compile (compilation) 414 the various models. The conversion 411 is the conversion of the internal data providing the model into an Intermediate Representation (IR) format. Quantization 412 is the conversion of parameters such as weights in the deep learning model from a high precision data type to a low precision data type to the input to the deep learning model. Optimization 413 is the fusion of operators inside the model, the linking of the model optimization, and other operations. Compilation 414 is an optimization of the model based on hardware (e.g., a neural network processor) to generate a binary model that the hardware is capable of recognizing. The runtime stack 404 includes a runtime API 415, an execution manager 416, a user mode driver 417, and a kernel mode driver 418. The resource allocation performed by the execution manager 416 is bulk scheduled. The optimization runtime API 415 is used to provide various interfaces that can be invoked at runtime. A user mode driver 417 and hardware commands for providing kernel mode, resource scheduling. The kernel mode driver 418 is used to provide task scheduling, hardware control, and the like in kernel mode.
Multiple mainstream deep learning models can be integrated into one open source platform, so that a developer can develop, compile and operate multiple deep learning models on one open source platform, and the developer is prevented from needing to deploy and maintain multiple model frames. Moreover, the open source platform can support more deep learning models in an expansibility manner.
Computational graph conversion
The computational graph conversion may convert a computational graph of one specification to a computational graph of another specification. The difference in specifications means that different deep learning frameworks have different expressions for the computational graph. The deep learning framework is used to support an integrated environment for computational graph compilation and execution. When a computational graph of one framework needs to be put into another framework process, computational graph transformations are required.
In particular, a large number of acceleration units having different instruction set architectures are currently emerging. In order to be able to deploy a computational graph under a specific deep learning framework to an acceleration unit with a specific instruction set architecture for execution, adaptation and optimization processes need to be performed on the computational graph for the acceleration unit, whereby intermediate expressions are created. The intermediate expression is a computational graph expression defined in terms of an instruction set architecture that specifies the acceleration unit. The computational graph under different frameworks is first converted into intermediate expressions, and then developers focus on model processing of the intermediate expressions according to the specified acceleration units. Meanwhile, sometimes, the intermediate expression after the processing is finished also needs to be returned to the original framework to continue training and model improvement, so that the intermediate expression after the optimization processing needs to be converted into a calculation map and then converted into the framework. This process can be generalized to convert from the computational graph of the original framework to an intermediate representation- > model processing of the intermediate representation- > converting the model processed intermediate representation to the computational graph of the original framework. This process may be performed continuously.
The following describes the computational graph transformation based on the software architecture graph described above. The framework layer 402 may provide computational graphs of the deep learning model under various frameworks to the compilation stack 403 or the application layer 401. After receiving the computation graph, the compiling stack 403 needs to convert the computation graph into an intermediate expression, and then perform processes such as optimization and quantization on the intermediate expression. The optimized, quantized, etc. processed intermediate expressions may be deployed to designated acceleration units via the execution stack 404. In this process, all the computation graphs generated are static computation graphs.
With continued reference to fig. 7. First, as shown in the figure, various deep learning frameworks support models a to M, and a first computation graph 701 of one particular one of the deep learning frameworks is converted into an intermediate expression 702. The intermediate expression 702 predetermines a number of operators and their attributes. The operators and properties of intermediate expression 702 are defined according to the instruction set specifying the acceleration unit. Converting the first computational graph 701 into intermediate representation 702 includes converting each operator and its attributes in the first computational graph 701 into corresponding operators and their attributes defined by intermediate representation 702. The conversion may be accomplished by means of conversion 411. Conversion 411 defines a mapping function, implementing an operator conversion. The mapping function is used to convert between operators that are functionally identical but have different properties. The developer knows the specific functions and the respective attribute definitions of each operator, and predefines mapping functions for operators with the same functions but different attributes. Referring to the operator mapping table shown in table 1, the left column is the operator identification of the first operator and the right column is the name of the mapping function.
Table 1
Of course, embodiments of the present disclosure are not limited to having to use a mapping function for operator attribute conversion, but may use other methods to accomplish operator attribute conversion.
And then performing model processing on the intermediate expression 702, including quantization, graph cutting, model branch reduction, operator merging and the like, and compiling the intermediate expression after the model processing is completed, and outputting a second computational graph 703. The quantization process will insert quantization nodes and inverse quantization nodes in the intermediate representation 702. The operator combining combines two or more operators into one operator according to hardware. The graph cut is to divide the intermediate expression into several sub-graphs to facilitate the reading and processing of the acceleration unit. Model pruning is a model compression method, sparsity is introduced to dense connection of a deep learning model, and the number of non-zero weights is reduced by directly setting the weight of 'unimportant'. The second computational graph 703 can be deployed to execute on a designated acceleration unit.
Finally, the second computational graph 703 is converted back to a computational graph for the particular deep learning framework, if desired, including in particular converting all of the operators and their attributes in the second computational graph 703 to operators and their attributes defined by the particular deep learning framework. The conversion can also be achieved by a mapping function. Of course, this step may also be implemented in other ways, for example, extracting attribute data of the native attribute of each operator from the first computational graph 70 1, and for the second computational graph after model processing, replacing the current attribute of a part of the operators with the stored native attribute may be implemented to more conveniently transform the operator attributes.
The following is an example. The first computation graph 701 includes an operator A, B, C, D, E, F, which is converted into intermediate expressions 702 a ', B ', C ', D ', E ', and F ', where the attributes of a ', B ', C ', D ', E ', and F ' are changed, the computation graph including a ', B ', C ', D ', E ', and F ' is provided to a compiler, and the compiler continues to perform various processes, such as quantization, graph slicing, operator merging, model pruning, and the like, to obtain a second computation graph 703 including a ', H ', G ' (B ' C ' D '), E ', and F ', where H ' is a new increment operator and may be a quantization or dequantization operator, G ' is a merging operator of B ', C ', and D ', and has a function of B ', C ', and D ', and the second computation graph 703 needs to be mapped back to A, E, F for a ', E ', and G ' is mapped to G ", and the final computation graph is A, H ', G ', E, F. Typically, the operators of A ', B', C ', D', E ', F' and A, B, C, D, E, F are identified identically, or meet some set-up specification.
In addition, since the computation graph contains individual operators and their connection relationships, it is also necessary to ensure that the connection relationships between operators are correct after the operator conversion, or that the connection relationships between operators can be directly reconstructed.
FIG. 8 is a partial flow chart of a method of processing a computational graph according to an embodiment of the present disclosure. As shown in fig. 8, the calculation map processing method includes the following steps.
Step S801 sequentially reads operators from the calculation map. The computational graph in this step is the processed generic intermediate expression, the second computational graph 703 in FIG. 7.
Step S802 determines whether the specified frame has a registrar with the same operator identity as the current operator. If yes, step S804 is performed, and if no, step S803 is performed. The operator identification of the current operator is compared with the operator identifications of all registered operators under the appointed framework to determine whether registered operators with the same operator identification exist under the appointed framework. The operator identities are the same, and no matter whether the attribute data of the operators are the same or not, the registry operators with the same operator identity are considered to exist under the specified framework.
Step S803 builds a join operator and replaces the current operator. Wherein the join operator has the same join relationship as the current operator. The connection relation of the operators can be intuitively understood through an example shown in fig. 10, and the connection relation of the operator D is three paths of inputs A-C and two paths of outputs as shown in the figure. When the connection operator is specifically constructed, input and output information is acquired from the current operator, and the connection relation of the connection operator is constructed according to the input and output information. Meanwhile, when the connection operator is constructed, the operator identification and the attribute data of the current operator are stored as the attribute data of the connection operator.
Step S804 calls the mapping function to convert the attribute data of the current operator.
Step S805 determines whether all operators in the computational graph have been processed. If yes, step S806 is performed, otherwise step S801 is performed.
According to the above-described step, in the case where the specified framework does not support the current operator (the specified framework does not define the current operator), step S803 is called to construct a join operator to replace the current operator. The join operator is a registry under a specified framework. In a static computational graph, each operator is a segment of a character expression. The character expression specifies the operator identity, attribute data, and the join relationships with other operators for each operator. In this context, when referring to a certain operator in a computational graph, it actually refers to the character expression in the computational graph that characterizes that operator. Likewise, the character expression that joins operators also specifies their own operator identity, attribute data, and join relationships. The character expression of the join operator is generated by a constructor of the join operator. The construction function of the connection operator designates other operators with connection relation with the current operator through input and/or output parameters, so that the construction function of the connection operator can construct connection relation of the connection operator and other operators according to the connection relation of the current operator and other operators, and meanwhile, the construction function of the connection operator also takes the operator identification and attribute data of the current operator as attribute data of the connection operator. And finally, calling a construction function of the connection operator to generate a character expression of the connection operator and replacing the character expression of the current operator in the calculation graph.
According to the above steps, in the case that the specified framework supports the current operator, as described above, at this time, the current operator and the corresponding registrar in the specified framework have the same operator identifier, but may have different attribute data, and no matter whether the attribute data are the same, the mapping function may be directly used to perform operator attribute conversion, and the mapping function is used to copy the attribute data of the corresponding registrar of the specified framework into the attribute data of the current operator. Of course, it is also possible to determine whether the attribute data of the two are the same, and then perform operator attribute conversion by using the mapping function when the attribute data of the two are different.
Based on steps S801 to S805, each operator in the computation graph is processed accordingly, thereby obtaining a converted computation graph. The computational graph replaces operators not supported by the specified framework with join operators and completes attribute conversion of operators supported by the specified framework. The calculation map is provided to step S806 for continued processing.
Step S806 sequentially reads the join operators in the computation graph.
Step S807 reads, for the current join operator, the operator identifier and attribute data of the current operator from the attribute data, and replaces the operator identifier and attribute data of the join operator, respectively.
Step S808, judging whether all the join operators are processed. If so, the calculation map is output, and if not, the process goes to step S806.
As described above, each join operator stores the operator identifier and attribute data of the current operator replaced by each join operator in its own attribute data, so steps S806-S808, for each join operator in the computation graph, reads the operator identifier and attribute data of the current operator from its attribute data, and replaces the operator identifier and attribute data of the join operator, respectively, so that after completing the replacement of all join operators in the computation graph, the resulting computation graph can be run and compiled under the specified framework.
In the embodiment, a connection operator with the same connection relation with the current operator (registered operators without the same operator identification under a specified framework) is constructed, the operator identification and attribute data of the current operator are stored as the attribute data of the connection operator, then the current operator is replaced by the connection operator, and the operator identification and the attribute data of the connection operator are replaced by the operator identification and the attribute data of the current operator, so that the replacement of the corresponding operator can be completed by adopting one connection operator. The developer also only needs to maintain one connection operator, thereby contributing to the reduction of maintenance cost.
As a modification of the above embodiment, for each current operator that is not supported by the specified framework, although a join operator is also constructed to replace the current operator, the join operator and the current operator have the same join relationship, but instead of storing the operator identification and attribute data of the current operator as the attribute data of the join operator, the operator identification and attribute data of the current operator and the correspondence relationship (correspondence relationship including, for example, the operator identification and attribute data of the current operator and the positional information of the join operator) of the join operator are stored elsewhere. And therefore, after each current operator which is not supported by the specified framework in the computational graph is replaced by the connection operator, the operator identification and attribute data of the current operator corresponding to each connection operator are taken out from the corresponding relation so as to replace the operator identification and attribute data of the connection operator.
It should be emphasized that, although the above embodiment is processed in such a manner that the current operators not supported by the specified frame are replaced with the join operators one by one until all such operators are replaced, and then the operator identifications and the attribute data of the join operators are replaced with the operator identifications and the attribute data of the current operators not supported by the specified frame one by one until all the join operators are replaced. Embodiments of the present disclosure are not limited thereto. For example, it is also possible to process by replacing the current operator not supported by the specified framework with the join operator, then replacing the operator identity and attribute data of the join operator with the operator identity and attribute data of the current operator not supported by the specified framework, and repeating such operations until all such operators are replaced.
FIG. 9 is a partial flow chart of a computational graph processing method according to another embodiment of the present disclosure. As shown in the figure, the calculation map processing method specifically includes steps S901 to S910. Steps S901 to S905 are the same as steps S801 to S805, and will not be described here.
Step S906 converts the computation graph into json file.
Step S907 sequentially reads join operators from the json file.
Step S908 reads, for the current join operator, the operator identifier and attribute data of the first operator from the attribute data, and replaces the operator identifier and attribute data of the join operator, respectively.
Step S909 determines whether all the join operators have been processed, and if not, jumps to step S907, and if yes, step S910 is executed.
Step S910 converts the json file into a computation graph.
The json (JavaScript Object Notation) file is a lightweight data exchange format, is easy to read and write by people and easy to analyze and generate by machines, so in the implementation, the calculation graph is converted into json file processing to process a connection operator, and then the processed json file is reloaded into the calculation graph and returned to the appointed deep learning framework. It should be noted, however, that since some frameworks do not support converting a computational graph to a json file, the converting is performed only under the framework that supports the converting, e.g., mxNet framework supports the converting.
The above embodiments are further explained below using examples. For example, designating the framework as MxNet framework, the structure of one operator BatchNorm of json expression is composed of key-value pairs composed of keys and va lues. Wherein, the key "op" corresponds to the operator identifier "BatchNo rm", the key "name" corresponds to the operator operation "resnetv10 _stag1_ batchnorm1_fwd", and the key "attrs" corresponds to attribute data, and the attribute data includes a plurality of key values composed of keys and values, and the accelerating unit ts corresponds to the array [ [23,0,0], [24,0,0], [25,0,0], [26,0,1], [27,0,1] ] for {"axis":"1","eps":"9.999999747378752e-06","fix_gam ma":"False","momentum":"0.8999999761581421","use_global_stats":"False"}, keys i, for indicating the connection relation of the operators BatchNorm.
The join operator with the json structure converted from BatchNorm is also composed of key and value pairs. Wherein, the key "op" corresponds to the operator identification "connection", the key "name" corresponds to the operator operation and can be set to "resnetv10 _stag1_ batchnorm _fwd", and the key "i acceleration unit ts" corresponds to the array "[ [23,0,0], [24,0,0], [25,0,0], [26,0,1], [27,0,1] ]. The array corresponding to the key "I acceleration unit ts" indicates a connection relationship, and is data identical to BatchNorm, which indicates that the connection relationship between the two is identical. That is, the join operator and the operator BatchNorm in the computational graph have the same keys "op", "name", and "i acceleration unit ts".
In addition, the join operator also has an "attrs", but the structure of the "attrs" is different from the "attrs" in BatchNorm, as shown in the following table.
Table 2
Attributes ofMeaning ofInitial value
I acceleration unit t numberNumber of input tensors1
output numberNumber of output tensors1
op nameTypes of primitive operatorsNULL
op attributeAttribute list of primitive operatorsNULL
According to an embodiment of the present disclosure, attribute data of "attrs" corresponding to the operator BatchNorm is {
"I acceleration unit t number": "3",
"output number":"2",
"op name":"BatchNorm",
"op attribute":"{"axis":"1","eps":"9.999999747378752e- 06","fix_gamma":"False","momentum":"0.8999999761581421","use_global_stats":"False"}",
}。
Italics, i.e., attribute data corresponding to the key "attrs" is used as the attribute value of "op attrib ute".
In specific implementation, when constructing a join algorithm, an operator with a join relation with BatchNorm is used as an input parameter, then the join relation (i.e. i acceleration unit ts data) which is the same as BatchNorm is established according to the number of input tensors, the number of output tensors and the input parameter, then an operator identifier "BatchNorm" is stored as a value of "op name", and attribute data corresponding to "att rs" of an operator BatchNorm is stored as a value of "op attribute". And during replacement, the original operator identification is taken out from the 'op n ame', the value corresponding to the 'op' is replaced, and all data is taken out from the 'op attribute' so as to replace the attribute data of the 'attrs' in whole.
In this example, the input tensor number and the output tensor data of the current operator are used as the attributes of the connection operator, and the connection relation of the connection operator is constructed by combining the input parameters, so that the connection operator and the current operator have the same connection relation data, and the attribute data and the operator identification of the current operator are stored in the attribute data of the connection operator, so that the replacement is completed in the subsequent steps. And the json file is adopted to modify the operator types and attributes in the computational graph, so that the json file can be conveniently utilized to complete the replacement operation of the connection operator. Finally, the json file after the completion of the processing is reloaded into the computation graph.
The technical scheme of the embodiment of the disclosure can be applied to most frames such as TensorFlow frames, mxNet frames, caffe frames, mxNet frames and the like, so that the frame has certain universality.
Further, although the description has been made above as an execution subject of the present disclosure by taking a server of a data center as an example, the present disclosure is not limited thereto. In theory, the execution subject of the present disclosure may be any computing device, including the server and the terminal device described above, for the terminal device, as long as the processor, the memory, the network throughput capability, and the like of the terminal device can meet the operation requirements of the deep learning model, the deep learning model may be deployed thereon to perform various computation graph processing (including the computation graph processing scheme provided by the embodiments of the present disclosure).
Commercial value of embodiments of the present disclosure
Deep learning models currently have a wide and successful application scenario, so that any minor improvement to the deep learning model becomes so important, not only on a technical level but also on a business level. Taking the face recognition field as an example, video monitoring is collected through a camera, face images are recognized through a deep learning model, and the face images are compared with cloud-stored faces, so that criminals in the monitoring video can be recognized. And in the field of voice recognition, voice recognition is performed through the deep learning model, so that simultaneous interpretation is realized. These application scenarios can bring tremendous commercial interest.
In the engineering practice of the deep learning model, the computational graphs under various frameworks are required to be converted into the computational graphs which are adapted to the designated acceleration units, and then the computational graphs which are adapted to the designated acceleration units are returned to the original framework, so that the organic combination and mutual promotion of the algorithm research and engineering application of the model are realized. The calculation map processing method provided by the embodiment of the disclosure is beneficial to reducing mapping functions, so that the embodiment of the disclosure has application prospect and commercial value.
Those skilled in the art will appreciate that the present disclosure may be implemented as a system, method, and computer program product. Accordingly, the present disclosure may be embodied in the form of hardware entirely, software (including firmware, resident software, micro-code), or in a combination of software and hardware. Furthermore, in some embodiments, the present disclosure may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied therein.
Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium is, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of a computer-readable storage medium include an electrical connection, by way of one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical memory, a magnetic memory, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a processing unit, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a notch. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any other suitable combination. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., and any suitable combination of the foregoing.
Computer program code for carrying out embodiments of the present disclosure may be written in one or more programming languages or combinations. The programming languages include object oriented programming languages such as JAVA, c++, and may also include conventional procedural programming languages such as C. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The foregoing is merely a preferred embodiment of the present disclosure, and is not intended to limit the present disclosure, so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (21)

CN202010435630.7A2020-05-212020-05-21 Processing unit, computing device, and computational graph processing method for deep learning modelActiveCN113705799B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010435630.7ACN113705799B (en)2020-05-212020-05-21 Processing unit, computing device, and computational graph processing method for deep learning model

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010435630.7ACN113705799B (en)2020-05-212020-05-21 Processing unit, computing device, and computational graph processing method for deep learning model

Publications (2)

Publication NumberPublication Date
CN113705799A CN113705799A (en)2021-11-26
CN113705799Btrue CN113705799B (en)2025-08-22

Family

ID=78646086

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010435630.7AActiveCN113705799B (en)2020-05-212020-05-21 Processing unit, computing device, and computational graph processing method for deep learning model

Country Status (1)

CountryLink
CN (1)CN113705799B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114298284A (en)*2021-12-292022-04-08浙江大华技术股份有限公司 Network model conversion method, device, system, storage medium and electronic device
CN118394348B (en)*2024-06-282024-10-22浪潮电子信息产业股份有限公司Operator scheduling method, device, equipment and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101645091A (en)*2008-08-092010-02-10株式会社其恩斯Image data compression method, pattern model positioning method in image processing, image processing apparatus, image processing program, and computer readable recording medium
CN109543825A (en)*2018-11-302019-03-29上海寒武纪信息科技有限公司Neural network model algorithm Compilation Method, device and Related product

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10157045B2 (en)*2016-11-172018-12-18The Mathworks, Inc.Systems and methods for automatically generating code for deep learning systems
CN108764487B (en)*2018-05-292022-07-08北京百度网讯科技有限公司Method and device for generating model, method and device for identifying information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101645091A (en)*2008-08-092010-02-10株式会社其恩斯Image data compression method, pattern model positioning method in image processing, image processing apparatus, image processing program, and computer readable recording medium
CN109543825A (en)*2018-11-302019-03-29上海寒武纪信息科技有限公司Neural network model algorithm Compilation Method, device and Related product

Also Published As

Publication numberPublication date
CN113705799A (en)2021-11-26

Similar Documents

PublicationPublication DateTitle
CN113313241B (en) Method and computing device for determining tensor information of deep learning model
CN113705798B (en) Processing unit, computing device, and computational graph optimization method for deep learning model
CN113269319B (en) Tuning method, compilation method and computing device of deep learning model
Ge et al.A data‐centric capability‐focused approach for system‐of‐systems architecture modeling and analysis
CN113778545B (en) A data processing method, device, equipment and storage medium
CN113139650B (en)Optimization method and computing device of deep learning model
CN111221842A (en)Big data processing system and method
CN113032244A (en)Interface testing method, device, computer system and computer readable storage medium
US20230035910A1 (en)Method, system and device for parallel processing of data, and storage medium
US11789733B2 (en)Instruction processing apparatus, acceleration unit, and server
CN114936631B (en) A model processing method and device
CN113705799B (en) Processing unit, computing device, and computational graph processing method for deep learning model
CN113688982B (en)Processing unit, related apparatus and method
CN113642721B (en)Processing unit, computing device and computing graph processing method of deep learning model
CN114945898B (en) Method and system for building a compiler intermediate representation from a TensorFlow graph
CN114997380B (en) Sampler and apparatus for executing graph neural network models
CN113391795B (en) A method and system for implementing adaptive mapping between application scenarios and software development kits
CN112633502A (en)Cross-platform execution method and device of deep learning model and electronic equipment
CN115034960B (en)Method, device, electronic equipment and storage medium for converting dynamic graph into static graph
CN113705800B (en) Processing unit, related device and method
WO2021068529A1 (en)Image recognition method and apparatus, computer device and storage medium
Park et al.Interworking technology of neural network and data among deep learning frameworks
CN113688979B (en) Processing unit, acceleration unit, related device and method
CN112286578A (en)Method, apparatus, device and computer-readable storage medium executed by computing device
CN115796284B (en)Reasoning method, device, storage medium and equipment based on TVM compiler

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp