Movatterモバイル変換


[0]ホーム

URL:


CN114580606B - Data processing method, device, computer equipment and storage medium - Google Patents

Data processing method, device, computer equipment and storage medium

Info

Publication number
CN114580606B
CN114580606BCN202011399187.9ACN202011399187ACN114580606BCN 114580606 BCN114580606 BCN 114580606BCN 202011399187 ACN202011399187 ACN 202011399187ACN 114580606 BCN114580606 BCN 114580606B
Authority
CN
China
Prior art keywords
data
target
computing
dimension
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011399187.9A
Other languages
Chinese (zh)
Other versions
CN114580606A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp LtdfiledCriticalCambricon Technologies Corp Ltd
Priority to CN202011399187.9ApriorityCriticalpatent/CN114580606B/en
Publication of CN114580606ApublicationCriticalpatent/CN114580606A/en
Application grantedgrantedCritical
Publication of CN114580606BpublicationCriticalpatent/CN114580606B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本公开涉及一种数据处理方法、装置、计算机设备和存储介质。其所公开的板卡包括:存储器件、接口装置和控制器件以及设置有数据处理装置的芯片;其中,数据处理装置与存储器件、控制器件以及接口装置分别连接;存储器件,用于存储数据;接口装置,用于实现数据处理装置与外部设备之间的数据传输;控制器件,用于对数据处理装置的状态进行监控。本公开实施例所提供的数据处理方法、装置、计算机设备和存储介质,在执行神经网络运算任务时可以基于动态标签信息中的拆分索引对数据进行拆分、布局,使得第二数据能够适配多核处理器的多个内存通道,以便利用多核处理器中的多个计算核执行运算操作,提高了神经网络运算任务的处理效率和速度。

The present disclosure relates to a data processing method, apparatus, computer equipment and storage medium. The disclosed board includes: a storage device, an interface device and a control device and a chip provided with a data processing device; wherein the data processing device is connected to the storage device, the control device and the interface device respectively; the storage device is used to store data; the interface device is used to realize data transmission between the data processing device and an external device; the control device is used to monitor the status of the data processing device. The data processing method, apparatus, computer equipment and storage medium provided by the embodiments of the present disclosure can split and layout the data based on the split index in the dynamic tag information when executing the neural network operation task, so that the second data can adapt to the multiple memory channels of the multi-core processor, so as to utilize the multiple computing cores in the multi-core processor to perform the operation operation, thereby improving the processing efficiency and speed of the neural network operation task.

Description

Data processing method, device, computer equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, computer device, and storage medium.
Background
With the development of computer technology, neural networks (neural networks) have also significantly advanced, and neural network operations can be performed by specific or general-purpose processors. In the related art, under the influence of factors such as multiple data types, large operation amount, hardware limitation and the like in the neural network, the operation speed of the neural network is greatly limited.
Disclosure of Invention
In view of this, the present disclosure proposes a data processing method, apparatus, computer device, and storage medium.
According to an aspect of the present disclosure, there is provided a data processing method applied to a multi-core processor, the multi-core processor including a plurality of computing core clusters, each computing core cluster including a plurality of computing cores, the method including:
acquiring first data and tag information of the first data, wherein the tag information comprises dynamic tag information, and the dynamic tag information comprises splitting indexes and identification information of a target storage space;
Splitting the first data into a plurality of second data according to the splitting index;
storing the plurality of second data to the corresponding target storage space according to the identification information of the target storage space,
Wherein the dynamic tag information is used to characterize information associated with the first data and the multicore processor.
According to another aspect of the present disclosure, there is provided a data processing apparatus applied to a multi-core processor including a plurality of computing core clusters, each computing core cluster including a plurality of computing cores, the apparatus comprising:
the acquisition module is used for acquiring first data and tag information of the first data, wherein the tag information comprises dynamic tag information, and the dynamic tag information comprises a splitting index and a target storage space;
The splitting module splits the first data into a plurality of second data according to the splitting index;
A storage module for storing the second data into the corresponding target storage space according to the identification information of the target storage space,
Wherein the dynamic tag information is used to characterize information associated with the first data and the multicore processor.
According to another aspect of the present disclosure, there is provided a machine learning arithmetic device, the device including:
one or more data processing devices, configured to acquire data to be operated and control information from other processing devices, perform specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;
When the machine learning arithmetic device comprises a plurality of data processing devices, the data processing devices can be connected through a specific structure and transmit data;
the data processing devices are interconnected through a PCIE bus of the rapid external equipment interconnection bus to support larger-scale machine learning operation, share the same control system or have respective control systems, share memories or have respective memories, and are interconnected in any interconnection topology.
According to another aspect of the present disclosure, there is provided a combination processing apparatus including:
Such as the machine learning computing device, the universal interconnect interface, and other processing devices described above;
the machine learning operation device interacts with the other processing devices to jointly complete the calculation operation designated by the user,
The combined processing device further comprises a storage device which is respectively connected with the machine learning operation device and the other processing devices and used for storing data of the machine learning operation device and the other processing devices.
According to another aspect of the present disclosure, there is provided a chip including the above-described combination processing device.
According to another aspect of the present disclosure, there is provided a board including a memory device, an interface device and a control device, and a chip as described above;
wherein the data processing device is respectively connected with the storage device, the control device and the interface device;
the storage device is used for storing data;
the interface device is used for realizing data transmission between the data processing device and external equipment;
The control device is used for monitoring the state of the data processing device,
The memory device comprises a plurality of groups of memory units, wherein each group of memory units is connected with the data processing device through a bus, and the memory units are DDR SDRAM;
The data processing device comprises a DDR controller, a data storage controller and a data storage controller, wherein the DDR controller is used for controlling data transmission and data storage of each storage unit;
the interface device is a standard PCIE interface.
According to another aspect of the present disclosure, there is provided a data processing apparatus comprising a processor and a memory, the memory having stored therein a computer program, which when executed by the processor, implements a data processing method as described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described data processing method.
The data processing method, device and storage medium comprise the steps of obtaining first data and tag information of the first data, wherein the tag information comprises dynamic tag information, the dynamic tag information comprises a splitting index and a target storage space, splitting the first data into a plurality of second data according to the splitting index, and storing the plurality of second data into the corresponding target storage space according to identification information of the target storage space, wherein the dynamic tag information is used for representing information related to the first data and a multi-core processor. According to the method, after the first data and the dynamic labels thereof are determined, the first data can be split and stored in a layout mode based on the splitting index in the dynamic label information and the identification information of the target storage space, so that the second data can be adapted to a plurality of memory channels of the multi-core processor, operation is carried out by using a plurality of computing cores in the multi-core processor, and the processing efficiency and the processing speed of neural network operation are improved.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1a shows a schematic diagram of a framework of a computer device according to an embodiment of the present disclosure.
FIGS. 1b, 1c illustrate diagrams of multi-core processor architectures according to an embodiment of the present disclosure.
Fig. 1d and 1e illustrate architecture diagrams of a computing core cluster according to an embodiment of the present disclosure.
Fig. 2 shows a flow chart of a data processing method according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of a first data packet according to an embodiment of the present disclosure.
Fig. 4 shows a flow chart of a data processing method according to an embodiment of the present disclosure.
Fig. 5 shows an architecture schematic of a multi-core processor 1 according to an embodiment of the present disclosure.
Fig. 6a shows a schematic diagram of allocation of operation tasks by the multicore processor 1 executing the operation instruction a according to an embodiment of the present disclosure.
Fig. 6b, 6c show a schematic process diagram of the multicore processor 1 executing the operation instruction a according to an embodiment of the present disclosure.
Fig. 7 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.
Fig. 8 is a block diagram illustrating a combination processing apparatus 1200 according to an embodiment of the present disclosure.
Fig. 9 is a schematic diagram illustrating a board 1300 according to an embodiment of the disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
The data processing method of the present invention may be applied to the computer device shown in fig. 1a, where the computer device may include a host 101, a device 102, a host memory 103, and a device memory 104. The host 101 may be a general purpose processor and the device 103 may be a neural network accelerator, although the invention is not limited in the nature of the host 101 and the device 103. The data flow in the data processing method is that the host 101 moves input data on the host memory 103 to the device memory 104, the device 102 reads in the input data from the device memory 104 and completes calculation according to instructions, the device 102 writes back calculation results to the device memory 104, and the host 101 moves output data in the device memory 104 to the host memory 103.
With the development of computer technology and big data, a multi-core processor is becoming mainstream, and the device 102 in this embodiment may be a multi-core processor, and the multi-core processor may set different architectures according to the use requirement. For example, a multi-core processor may include a plurality of clusters of computing cores, each of which may include at least one computing core, each of which may be provided with a designated memory. FIGS. 1b, 1c illustrate diagrams of multi-core processor architectures according to an embodiment of the present disclosure. Fig. 1d and 1e illustrate architecture diagrams of a computing core cluster according to an embodiment of the present disclosure. The multi-core processor may be a Symmetric Multiprocessor (SMP) architecture, as shown in fig. 1b, which is assumed to include four compute core clusters (clusters) and four memories (off-chip memories). The compute cores (core) in the cluster of compute cores may access memory through the routing node R. The 4 memories are "local memories" of the corresponding 4 computing core clusters, but because of the symmetrical arrangement, the computing cores in each computing core cluster need to pass through two routing nodes R when accessing the 4 memories, and the rate of accessing any one memory is the same without congestion. For example, the "local memory" of the computing core cluster x in fig. 1b is the memory a, and in the process of data access, the data corresponding to the computing core cluster x is preferentially stored in the "local memory", and a corresponding memory channel is set between the computing core cluster and the corresponding "local memory", so that the computing core cluster can access its own "local memory" through the memory channel. In this way, the probability of the computing cores in different computing core clusters accessing the same memory can be reduced.
The multi-core processor may also be a non-uniform symmetric memory access (Non Uniform Memory Access Architecture, NUMA) architecture, as shown in FIG. 1c, assuming that the multi-core processor for non-uniform symmetric memory access includes four clusters of compute cores and four memories. The compute cores (core) in the cluster of compute cores may access memory through the routing node R. The 4 memories are "local memories" of the corresponding 4 computing core clusters, but due to the asymmetric arrangement, when the computing cores in each computing core cluster access the corresponding "local memories" (such as the computing core access memory s in the computing core cluster n in fig. 1 c), only one routing node R is needed to pass through, for example, the computing cores in the computing core cluster n only need to pass through one routing node R when accessing the memory s, the access rate is fastest, and when the computing cores in each computing core cluster access other non-local memories, "at least two routing nodes R are needed to pass through, the access rate is slow.
Optionally, when the computing core cluster is provided with a plurality of computing cores, an organization architecture in the computing core cluster may also be set according to needs. The architecture of the computing core cluster may be of the bus contention type, as shown in fig. 1d, assuming that the computing core cluster includes 4 computing cores, each of which may access memory through the routing node R. The architecture of the computing core cluster may also be a shared cache type, as shown in fig. 1e, where it is assumed that the computing core cluster includes four computing cores, and an extra-core cache corresponding to the computing core cluster is provided, and the four computing cores in the computing core cluster share the extra-core cache. Each compute core in the cluster of compute cores may directly access the off-core cache and may access memory through the routing node R.
In order to accelerate tasks with large data volume, such as neural network operation tasks, specific calculation processing is generally performed by adopting data parallel, model parallel and hybrid parallel modes when the multi-core processor is used for operation. The data parallel refers to that different computing cores process different input data of the same neural network, which is equivalent to parallel computing of multiple batches of input data of the same neural network. Model parallelism refers to that different computing cores process different parts of the same input data of the same neural network, namely, each operator and data in the neural network are disassembled, and each computing core processes a part of disassembled data, which is equivalent to that of the whole accelerator processing one part of input data in parallel. Hybrid parallelism is a combination of data parallelism and model parallelism, such as model parallelism used within a compute core cluster and data parallelism used between compute core clusters. However, when the multi-core processor executes the operation task by using the three modes, how to split operators and data and plan the storage position of the data so as to ensure the efficient execution of the operation task is a technical problem to be solved.
The present disclosure provides a data processing method, before actually executing a neural network operation task, a host may obtain first data and dynamic tag information of the first data in advance, then split the first data into a plurality of second data according to a splitting index in the dynamic tag information, and store the plurality of second data into corresponding target storage spaces respectively according to identification information of target storage spaces in the dynamic tag information, so that layout storage for the first data is realized, and a multi-core processor may directly obtain the second data to perform related operation when executing the operation task, thereby improving processing efficiency and speed of the operation task. The data processing method of the present disclosure can be found in particular in the following description. Alternatively, a compiler running in a host may implement the data processing method described above.
Fig. 2 shows a flow chart of a data processing method according to an embodiment of the present disclosure. As shown in fig. 2, the method is applied to a multi-core processor including a plurality of computing core clusters each including a plurality of computing cores, and includes a "tag use step" of steps S11 to S13.
In step S11, first data and tag information of the first data are acquired, where the tag information includes dynamic tag information, and the dynamic tag information includes a split index and identification information of a target storage space.
Alternatively, the method of the present disclosure may first traverse a computational graph of a neural network, which may include at least one data node (i.e., first data), and determine tag information of each first data in the neural network, which may include dynamic tag information and static tag information, respectively, and bind the tag information of the first data to the corresponding first data. Further, the method of the present disclosure may process the neural network described above according to the tag information of the first data. For example, a hardware instruction which can be executed by the multi-core processor is generated according to the tag information of the first data, so that the multi-core processor carries out corresponding processing on the neural network according to the hardware instruction.
The static tag information can be used for representing information related to participation of the first data in the neural network operation, and can comprise at least one of data category, static data type, static data dimension sequence, dimension value corresponding to each static data dimension and the like. The dynamic tag information is used for representing information associated with the first data and the multi-core processor, and can be specifically determined according to the architecture of the multi-core processor. The dynamic tag information may include at least one of dynamic data type, dynamic data dimension order, fragmentation parameters, padding parameters and data size, split index, identification of target storage space, and target switch level. The determining process of the static tag information and the dynamic tag information can be specifically referred to the following description, and will not be repeated here.
Specifically, the first data may be data of various data categories participating in the neural network operation, such as Input Neuron (Input Neuron), output Neuron (Output Neuron), hidden Neuron (Hidden Neuron), constant Neuron (Constant Neuron), input Weight (Input Weight), output Weight (Output Weight), constant Weight (Constant Weight), and Auxiliary data (Auxiliary). The first data may also include instructions corresponding to neural network operations, which may be hardware instructions that are compiled to be directly executable on the multi-core processor. The method of the present disclosure may identify each first data by a data category in the static tag information.
In this embodiment, the splitting index in the dynamic tag information may be used to indicate an implementation scheme that the first data needs to be split into a plurality of second data when the current multi-core processor, a certain specified data class of the first data participates in the neural network operation. The neural network operation refers to a calculation node in a neural network calculation graph, which may include convolution operation, full-connection operation, pooling operation, scaling operation, and the like, which is not limited in this disclosure. In neural network operation, one or more input data are usually subjected to operation processing to obtain an operation result, and the first data may be input data (including but not limited to input neurons and input weights), and an operation result (including but not limited to output neurons), which is not limited in this disclosure.
In this embodiment, the identification information of the target memory space may be used to indicate where each second data needs to be stored to the multi-core processor (i.e., to which particular memory space). As with the multi-core processor architecture described above, the memory space of the multi-core processor may include local memory, or may also be off-core cache and local memory. The identification information of the target storage space may be a number, a name, etc. corresponding to a memory (or an out-core cache) storing each second data, which can distinguish different identities between the memory and other memories, or may be a category of the memory or the out-core cache, that is, a category identification of the memory or the out-core cache.
For example, when the identification information of the target storage space in the dynamic tag information is a category identification of the memory (assuming that the target storage space is the memory), it indicates that the plurality of second data are sequentially stored in the plurality of memories. When the identification information of the target storage space in the dynamic tag information is the identity of each memory (assuming that the target storage space is the memory), the identification information indicates that a plurality of second data are sequentially stored in the memory corresponding to each identity. The identification information of the target storage space stored in the dynamic tag information can be set by those skilled in the art according to actual needs, and the present disclosure is not limited thereto.
In step S12, the first data is split into a plurality of second data according to the splitting index.
In step S13, the plurality of second data are respectively stored in the corresponding target storage spaces according to the identification information of the target storage spaces. So that each computing core can acquire the second data from the corresponding target storage space to perform operation, and execute corresponding operation tasks.
Optionally, in the data processing method disclosed by the disclosure, the host may obtain the first data and the corresponding dynamic tag information thereof in advance during compiling, generate a corresponding instruction according to the dynamic tag information, split the first data into a plurality of second data according to the instruction during running, and store the second data obtained after splitting into a corresponding target storage space. Optionally, the data processing method of the present disclosure may also transmit dynamic tag information of the first data during running, split the first data into a plurality of second data according to the dynamic tag information, and store the split second data in a corresponding target storage space. It should be clear that the compile time (Compiling Time) here refers to the stage of compiling the neural network computational graph to generate instructions, and the Runtime (run) refers to the running environment of the program. Alternatively, the above data processing method may be implemented by a compiler.
According to the data processing method provided by the embodiment of the disclosure, after the first data and the dynamic label information thereof are determined, the first data can be split and stored in a layout mode based on the splitting index in the dynamic label information and the identification information of the target storage space, so that the second data can be adapted to a plurality of memory channels of the multi-core processor, operation is performed by using a plurality of computing cores in the multi-core processor, and the processing efficiency and the processing speed of neural network operation are improved.
In one possible implementation manner, the dynamic tag information further comprises a target data exchange level, and the tag using step of the method can further comprise generating a first instruction for neural network operation corresponding to the first data according to the parallel computing mode of the multi-core processor, the identification information of the target storage space and the target data exchange level, so that each computing core executes a corresponding operation task according to the first instruction. The first instruction may include at least one of a data access instruction and a data operation instruction.
In this implementation, the host may generate the first instruction described above at the compilation stage. The method of the present disclosure generates the first instruction according to the dynamic tag information at the time of compiling, so that the computing task to be executed by each computing core may be sent to the corresponding computing core through the first instruction, so that the computing core may execute its corresponding computing task for the first data under the control of the received first instruction, which is not limited in this disclosure. The data access instruction may be used to instruct the computing core to store and/or read the second data (i.e., the data split by the first data). In this embodiment, different data access instructions are generated through the target storage space identification information and the target data exchange hierarchy in the dynamic tag information, so that exchange of data between different computing cores is realized. The data operation instruction is used for indicating the detailed operation to be executed by the computing core for the second data, for example, the arithmetic operation corresponding to the second data is performed to obtain an operation result.
Optionally, the first instruction may also include a data access instruction, a data exchange instruction, and a data operation instruction, where the data exchange instruction is used to instruct the exchange hierarchy of the second data and how to exchange the data when actually exchanging the data. The data access instruction is used for indicating the storage and/or reading of data, and the data operation instruction is used for calculating the core to execute operation.
The target data exchange hierarchy may be used to indicate whether the second data needs to be exchanged between cores, where the exchange refers to whether the second data stored in the off-core cache or the memory is used by other computing cores in the same computing core cluster or computing cores in other computing core clusters (as the data to be computed or as a result of an operation) in addition to the computing cores that compute the second data. The data exchange represented by the target data exchange hierarchy may include no exchange, inter-cluster data exchange, inter-core data exchange, inter-memory data exchange. Where "no swap" may mean that data is not used by other compute cores, "inter-cluster data swap" may mean that data is used by compute cores in other compute core clusters, "inter-core data swap" may mean that data is used by compute cores in the same compute core cluster, and "inter-memory data swap" may mean that data is used by compute cores in different compute core clusters. According to the embodiment of the application, the corresponding first instruction can be generated according to the target data exchange level and the target storage space in the dynamic tag information in compiling, and the framework of the multi-core processor is combined, so that each computing core of the multi-core processor can execute the corresponding task according to the first instruction.
In one possible implementation manner, the dynamic tag information may further include a target data exchange hierarchy, and the method may further include determining an operation task to be executed by each computing core according to a second instruction corresponding to the first data, a parallel computing manner of the multi-core processor, identification information of the target storage space, and the target data exchange hierarchy.
In this implementation manner, the second instruction is used to instruct the computing core to perform what operation task needs to be performed on the first data, including information related to a process of actually performing the operation task, such as how to acquire the first data, dynamic tag information of the first data, what operation is performed on the first data in the neural network, which operators need to be called, how to store the obtained operation result, and so on.
Optionally, in the case that the neural network operation, the target storage space, the parallel computing mode of the multi-core processor, and the target data exchange level in which the first data participates are different, the operation tasks executed by the computing cores are not identical. For example, in case of example 1, for the fully-connected MLP operator operation, the first data includes an input weight, an input neuron, and an output neuron, and when the operation task is performed in a data parallel manner, the input weight exchanges data, and the output neuron and the input neuron do not exchange data. When the computing core cluster architecture is bus contention type, data exchange between memories occurs when the weight is input. When the computing core cluster architecture is a shared cache type, data exchange between clusters occurs when the weight is input. For example 2, for the fully-connected MLP operator operation, the first data includes an input weight, an input neuron, and an output neuron, and when the operation task is executed in a model parallel manner, the data exchange occurs in the output neuron, and the data exchange does not occur in the input neuron and the input weight. When the computing core cluster architecture is bus contention type, data exchange between clusters occurs in the output neurons. For example 3, for the fully connected MLP operator operation, the first data includes an input weight, an input neuron, and an output neuron, and when the operation task is executed in a hybrid parallel manner, the input neuron and the output neuron exchange data, and the input weight does not exchange data. When the computing core cluster architecture is a shared cache type, the output neurons can exchange data between cores and between clusters, and the input neurons can exchange data between clusters.
In order to facilitate understanding of the "operation task performed by the computing core", the operation task performed by the computing core in the case where the first data includes the input neuron, the input weight, and the output neuron, and the data to be exchanged is determined according to the dynamic tag information is described below. Wherein the process of the computing core performing its corresponding computational task is presented in the examples given below under the section "application examples".
In one possible implementation manner, when the first data includes an input neuron, an input weight value, and an output neuron, and the data exchange is determined on the output neuron according to the dynamic tag information of the first data, the computing task executed by the computing core may include:
Each computing core obtains first target data in a plurality of second data of an input neuron and second target data in a plurality of second data of an input weight, calculates the first target data and the second target data to obtain a first intermediate result, and stores the first intermediate result into a storage space corresponding to the computing core;
a first computing core in the plurality of computing cores acquires at least two first intermediate results, performs operation processing on the at least two first intermediate results to obtain a second intermediate result, and stores the second intermediate result in a storage space corresponding to the first computing core;
And a second computing core in the plurality of computing cores acquires at least two second intermediate results, performs operation on the at least two second intermediate results to obtain an operation result, and stores the operation result as the output neuron into the target storage space.
In one possible implementation manner, when the first data includes an input neuron, an input weight value, and an output neuron, and the data exchange is determined according to the dynamic tag information of the first data, the computing task executed by the computing core may include:
each computing core is used for acquiring first target data in a plurality of second data of an input neuron and second target data in a plurality of second data of an input weight, performing operation on the first target data and the second target data to obtain a first intermediate result, and storing the first intermediate result into a corresponding target storage space;
Each computing core is used for acquiring first target data in a plurality of second data of an input neuron and third target data which is different from the second target data in a plurality of second data of an input weight, and performing operation on the first target data and the third target data to obtain a second type of first intermediate result and storing the second type of first intermediate result in a corresponding target storage space;
a first computing core in the plurality of computing cores acquires at least two first intermediate results, performs operation processing on the at least two first intermediate results to obtain a second intermediate result and stores the second intermediate result in a corresponding target storage space, wherein the first intermediate result comprises the first type of first intermediate result and the second type of first intermediate result;
and a second computing core in the plurality of computing cores acquires at least two second intermediate results, performs operation on the at least two second intermediate results to obtain an operation result, and stores the operation result as the output neuron into the target storage space.
In one possible implementation manner, when the first data includes an input neuron, an input weight value, and an output neuron, and the data exchange between the input neuron and the output neuron is determined according to the dynamic tag information of the first data, the computing task executed by the computing core may include:
Each computing core is used for acquiring first target data in a plurality of second data of an input neuron and second target data in a plurality of second data of an input weight, calculating the first target data and the second target data to obtain a first intermediate result, and storing the first intermediate result into a storage space corresponding to the computing core;
each computing core is used for acquiring fourth target data and second target data which are different from the first target data in a plurality of second data of the input neurons, calculating the fourth target data and the second target data to obtain a second type of first intermediate result, and storing the second type of first intermediate result into a storage space corresponding to the computing core;
a first computing core in the plurality of computing cores acquires at least two first intermediate results, performs operation processing on the at least two first intermediate results to obtain a second intermediate result and stores the second intermediate result in a storage space corresponding to the first computing core, wherein the first intermediate result comprises the first type of first intermediate result and the second type of first intermediate result;
and a second computing core in the plurality of computing cores acquires at least two second intermediate results, performs operation on the at least two second intermediate results to obtain an operation result, and stores the operation result as the output neuron into the target storage space.
In one possible implementation, the dynamic tag information further includes a dynamic data type, a dynamic data dimension order, a fragmentation parameter, and a padding parameter, the splitting index (a determination manner of the splitting index is described below) includes a target splitting dimension, a start splitting position and an end splitting position of each second data in the target splitting dimension of the first data, and step S12 may include:
When the current data state of the first data is inconsistent with the dynamic tag information, processing the first data according to the dynamic tag information to obtain processed first data; and splitting the processed first data into a plurality of second data based on the initial splitting position and the ending splitting position of each second data in the target splitting dimension of the first data. Wherein the data states include data types, order of data dimensions, and dimension values.
In this implementation, the dynamic tag information is determined after determining the processor running the neural network, based on static tag information (described below), algorithmic features of the neural network, computational power, performance of the processor, etc., so that the first data with the dynamic tag information can be adapted for operation by the processor. When the neural network operates with different processors, the dynamic tag information of the first data may be different. When the performance, calculation power, and other parameters of the two processors are the same, the dynamic tag information of the first data may be the same.
In this implementation, the dynamic data type may be determined based on the type of data, computational effort, etc. that can be processed by the processor running the neural network. If a processor is capable of processing a 16-bit floating point number, then when the processor is used to run the neural network, the dynamic data type of the data to be processed is the 16-bit floating point number. The dynamic data dimension order may be determined based on the need for a processor running the neural network to read or store the data. The slicing parameters may be determined based on the computational power of the processor running the neural network, e.g., each time a processor is able to perform 8 number operations, the slicing parameters may be set to 8. The filling parameter may be determined according to a dimension value of a static data dimension of the data to be processed and a fragmentation parameter, and the filling parameter may include a length of data to be filled in different dimensions of the data and/or a filling value of the required filling. The data size or data size is the product of each operation dimension value and data bit width of the data determined according to the dimension value, the fragmentation parameter and the filling parameter of the static data dimension in the actual participation operation, for example, a certain data is a matrix, in the operation process, the operation dimension value of two dimensions is 4 and 8 respectively, the data bit width is 4, and the data size of the data is 4×8×4=128 bytes.
In one possible implementation, the dynamic tag information of the first data may be represented as:
dynamic:type2,DIM_B1...Bn,tiling,padding,size,split x[(1_0,1_n-1)...(i_i×n,i_(i+1)n-1)],store,swap
wherein dynamic represents the identification that the tag information is dynamic tag information. type2 represents a dynamic data type. Dim_b1..bn represents a dynamic data dimension order of b 1..bn. Tiling is a slicing parameter. Padding is a fill parameter, size is the size of data, which represents the size of storage space occupied by data after dimension conversion, tiling, or Padding. split x [ (1_0, 1—n-1) & gt (i_i×n, i_ (i+1) n-1) ] represents a split index, where x represents a target split dimension, and the split length is n. (1_0, 1_n-1) represents that the start split position of the first second data is 0 and the end split position is n-1, (i_i×n, i_ (i+1) n-1) represents that the start split position of the ith second data is i×n and the end split position is (i+1) n-1.store represents the identifying information of the target storage space. swap represents the target data exchange hierarchy. The "," [ ] "," () "is only used for separating different parameters in the dynamic tag information in the present disclosure, is not a necessary content of the dynamic tag information, and in practical application", may not exist or may be replaced by other identifiers, and the present disclosure is not limited to this.
In one possible implementation manner, when determining that the current data state of the first data is consistent with the dynamic tag information, the processed first data may be split into a plurality of second data directly on the target splitting dimension with reference to a start splitting position and an end splitting position of each second data on the target splitting dimension of the first data.
In one possible implementation, the tag information further includes static tag information characterizing information associated with the neural network operation in which the first data participates, the static tag information including at least one of a static data type, a static data dimension order, and a dimension value corresponding to each static data dimension. When the current data state of the first data is inconsistent with the dynamic tag information, the first data is processed according to the dynamic tag information to obtain processed first data, wherein the processed first data comprises at least one of the following items:
Converting the data type of the first data from the static data type to the dynamic data type;
Adjusting the sequence of the data dimension of the first data from a static data dimension sequence to a dynamic data dimension sequence;
filling the first data according to the filling parameters;
And cutting the first data according to the cutting parameters.
In this implementation, the static tag information may include information describing a data type, a dimension value, etc. of the nature of the first data itself, and further includes information related to the neural network operation involved based on the first data. Thus, the static tag information for the same first data in different neural networks may be different. Static tag information may be determined after the neural network is established. The static tag information of the first data may be applicable to any processor running a neural network, i.e. the static tag information of the first data is unchanged in the different processors. The static tag information of the first data may be automatically detected and determined by the processor during the process of acquiring the first data (the process of inputting the first data by the user), or may be determined according to the information input by the user, which is not limited in the present disclosure.
In one possible implementation, the static tag information may also include a data category. The data categories include any of instructions, input neurons, output neurons, hidden neurons, constant neurons, input weights, output weights, constant weights, and auxiliary data. The data of which category the first data represented by the data category belongs to in the neural network is determined based on information such as whether the user is visible or not, the operation participated in the neural network and the like. The static data type represents the type and number of bits of the first data, e.g., the static data type may be a 32-bit floating point number or an 8-bit fixed point number, etc. The static data dimension may be a dimension of one dimension, two dimensions, multiple dimensions, etc., and the static data dimension order may represent a dimension order of storage and/or reading of the first data.
Alternatively, the static data dimensions may include, but are not limited to, at least one of a channel dimension C, a height dimension H, a width dimension W, a number dimension N, a depth dimension D (for use in three-dimensional convolution operations), and a time dimension T (for use in RNN network operations such as LSTM), where the order of recording or characterizing the static data dimensions may be accomplished using the order of the top 6 letters or other permutations of the identifications corresponding to each static data dimension, where the order precedes the order in the permutation or the left static data dimension precedes the order in the right static data dimension. Or may default to a higher order or left static data dimension than a lower order or right static data dimension. For example, a static data dimension order of NCHW for the first data represents that the first data is a four-dimensional tensor, and the corresponding static data dimension order is sequentially a number dimension N, a channel dimension C, a height dimension H, and a width dimension W, where the highest dimension is the number dimension N, and the lowest dimension is the width dimension W. The static data dimension order of TNHWC of the first data represents that the first data is a five-dimensional tensor, and the corresponding static data dimension order of the first data is a time dimension T, a number dimension N, a height dimension H, a width dimension W and a channel dimension C in sequence, wherein the highest dimension is the time dimension T, and the lowest dimension is the channel dimension C. When the actual user input or the multi-core processor is obtained from the physical memory, the first data is mapped into a one-dimensional array according to the determined dimension sequence of the static data and is stored on the physical memory. It may not be easy for a one-dimensional tensor (vector) to determine which dimension it belongs to. This can be determined from the algorithmic definition of the operator. For example, bias data bias in two-dimensional convolution is one-dimensional, bias needs to be superimposed on an output characteristic diagram in algorithm definition, each output channel is superimposed by one number, and the superimposed numbers of different channels are different, so that the dimension of one-dimensional bias can be regarded as a channel dimension C.
In this implementation, the dimension value corresponding to each static data dimension represents the length or size of the corresponding static data dimension. For example, a certain first data is a matrix, the static data dimension includes rows and columns, the static data dimension order is row-first, the dimension value of a row is 10, the dimension value of a column is 4, which indicates that the length of a row is 10, and the length of a column is 4.
In one possible implementation, the static tag information of the first data may be represented as:
Static:classification,type1,DIM_A1...An,{x1...xn}
Wherein static is an identification indicating that the tag information is static tag information. The classification indicates a data category, and type1 indicates a static data type. Dim_a 1..n in An represents a static data dimension, a 1..an represents a static data dimension in order a 1..an. A1 has a dimension of x1. An has a dimension of xn. The terms "," { } "are used only to separate different parameters in the static tag information in the present disclosure, and are not necessarily required in the static tag information, and in actual application", "{ }" may not exist or may be replaced by other identifiers, which is not limited in the present disclosure.
It should be understood that, a person skilled in the art may set the static tag information, the identification of the data category, and the location of each parameter in the static tag information according to actual needs, which is not limited by the present disclosure.
In one possible implementation manner, the method of the tag using step may further include a constant data packing step, where when the first data includes a plurality of constant data, packing the plurality of second data obtained by splitting each constant data according to the total number of computing cores in the multi-core processor, forming a plurality of first data packets corresponding to each computing core, and then storing the plurality of first data packets, so that each computing core performs a corresponding operation according to the loaded first data packets. The number of first data packets may be the same as the total number of computation cores.
The first data packet may include a constant data area and a tag area, the constant data area may include one of a plurality of second data obtained by splitting each constant data, and the tag area may include at least one of a constant total data amount tag, a hidden layer total data amount tag of a hidden layer in a neural network operation in which the first data participates, and an input/output address tag in the neural network operation. In the embodiment of the application, the constant data is packed, so that the constant data can be transported to the corresponding computing cores of the multi-core processor at one time in the operation process without repeated transportation, and the operation efficiency can be improved.
In this implementation, the "constant data packing step" may be performed in a compiling stage of the neural network, where constant data corresponding to the neural network may be obtained according to data static tag information (such as a data class in the static tag information) of the first data in the compiling stage, and at least one constant data is respectively split and packed into a plurality of first data packets, where the number of first data packets may be equal to the number of computing cores in the multicore processor. And before the computing core executes the assigned operation task, the host may send each first data packet to the "local memory" or the "out-of-core buffer" corresponding to the computing core in advance, and the computing core may load the first data packet into the local memory so as to execute the subsequent operation task. In this way, the constant data required by the execution of the operation task can be obtained in advance by the calculation core, the process of executing the operation task by the calculation core is simplified, and the speed of executing the operation task by the calculation core can be improved due to the fact that the constant data are obtained in advance, so that the speed of operating the first data is improved.
In this implementation, constant data may refer to data that does not change (or change frequently) during execution of an operational task after the architecture of the multi-core processor, the neural network, is determined. Because the model parameters of the trained neural network are not changed basically, the constant data can comprise weight parameters, offset parameters, scaling parameters, beta parameters and the like in the first data, and the first instruction obtained after compiling is not changed in the calculation process, so that the constant data can also comprise the first instruction. Of course, the constant data may also include third data needed to perform the operational task.
The partial constant data may be split into a plurality of second data according to the splitting index for different computing cores, for example, the first data, the convolution filter parameters (weight parameters), the offset parameters, the scaling parameters, the beta parameters, and the partial constant data may not need to be split. And for the third data, the third data required by the computing core can be determined according to the operation task required to be executed by the computing core.
For convenience of description, each second data obtained by splitting constant data according to a corresponding splitting index is also referred to herein as a constant data block.
In this implementation, the constant total data amount flag is used to record the total data amount of all constant data blocks in the first data packet. The constant total data amount flag may also record the total data amount of all constant data blocks in the first data packet and the data amount of each constant data block. And the total data quantity of the hidden layers in the operation of the neural network is used for recording the total data quantity of the hidden layer data in the neural network. The total data amount of the hidden layers can also indicate the size of each hidden layer data and the total data amount of the hidden layer data in the neural network, and the sum of the data amounts of all the hidden layer data is the total data amount of the hidden layers. The input/output address in the neural network refers to an address of input data and an address of output data, the input data may be data to be operated obtained when the computing core needs to execute an operation task each time, and the output data may be a result obtained after the data to be operated is operated.
In one possible implementation, the constant data, the hidden layer data, the input/output address, etc. in the first data packet may be arranged in a specific order to form different data segments. The above-described "constant data packing step" also includes calculating the intra-segment offset of each data segment. The above-described "constant data packing step" may further include at least one of determining an offset within the first segment, determining an offset within the second segment, and determining an offset within the third segment. The order of execution of the operations in the method of the present disclosure is not particularly limited, and may be sequentially executed in a specific order or may be executed in parallel.
Wherein, the
And determining the first intra-segment offset, namely determining and recording the second data in the first segment in the constant data segment of the corresponding first data packet according to the constant total data volume of each first data packet and the data volume of each second data contained. Thus, the calculation core can read the corresponding constant data block from the constant data segments according to the first intra-segment offset of each constant data block. The constant data segment is a segment of the first data packet in which the constant data block is recorded.
And determining the second section offset, namely determining and recording each input address and each output address in the second section of the address data section of the corresponding first data packet according to the total data quantity of the input address and the output address of each first data packet, the data quantity of each input address and the data quantity of each output address. The address data segment is used to record the input address and the output data segment. Therefore, the computing core can acquire each input address and/or each output address according to the second intra-segment offset, further acquire input data and store the result obtained by operation.
And determining the third section internal offset according to the total data amount of the hidden layers of each first data packet and the data amount of each contained hidden layer data, and determining and recording the third section internal offset of each hidden layer data in the hidden layer data section of the corresponding first data packet. The hidden layer data segment is used to record the data amount (i.e., size) of each hidden layer data. Therefore, the computing core can determine the data quantity of each hidden layer data from the total data quantity of the hidden layers according to the offset in the third section, and the speed and the efficiency of the packing processing are improved.
Fig. 3 shows a schematic diagram of a first data packet according to an embodiment of the present disclosure. For example, in the "constant data packing step", a plurality of constant data to be unpacked is first determined, as shown in fig. 3, and it is assumed that the constant data includes several of instruction 1 (i.e., the first instruction) and instruction 2 (i.e., the first instruction), and a convolution filter. And the constant data can be distributed to the computing cores 1 and 2 for use, the instruction 1 and the instruction 2 can be determined according to the label information of the instruction 1 and the instruction 2 and the convolution filter, wherein the instruction 1 and the instruction 2 are instructions to be executed by the computing cores 1 and 2 respectively, and the convolution filter needs to be split into a convolution filter data block 1 and a convolution filter data block 2 according to the splitting index and is used by the computing cores 1 and 2 respectively. And determining the input address 1 (the address for obtaining all the data to be operated) and the output address 1 (the address for storing the operation result obtained by the operation completion) required by the computing core 1 according to the input address and the output address, and determining the input address 2 (the address for obtaining all the data to be operated) and the output address 2 (the address for storing the operation result obtained by the operation completion) required by the computing core 2. And then re-sequencing the instruction 1, the instruction 2, the convolution filter data block 1, the convolution filter data block 2, the input address 1, the input address 2, the output address 1 and the output address 2 according to the use condition of the computing core, packaging the re-sequenced data into a first data packet 1 required by the computing core 1 and a first data packet 2 required by the computing core 2, and recording constant total data quantity marks in the first data packet 1 and the first data packet 2.
Taking the first data packet 1 as an example, the second section internal offset of the input address 1 and the output address 2 in the address data section also needs to be calculated, and because the data quantity of the input address 1 and the output address 1 is 8 respectively, the second section internal offset of the input address 1 is "0", and the second section internal offset of the output address 1 is "8". And calculating the first section internal offset of the instruction 1 and the convolution filter data block 1, wherein the first section internal offset of the instruction 1 is 0, the first section internal offset of the convolution filter data block 1 is 2048, and the constant total data quantity mark in the first data packet 1 is 3072 because the data quantity of the instruction 1 is 2048 and the convolution filter data block 1 is 1024. Finally, before the computing cores execute the allocated operation tasks, the host may send the first data packet 1 and the first data packet 2 to the computing core 1 and the computing core 2, respectively, or send the first data packet and the first data packet to the target storage spaces accessed by the computing core 1 and the computing core 2, respectively.
In a possible implementation manner, before performing step S11, the host may further perform a "tag generation step" in advance, for generating dynamic tag information and static tag information of the first data, where the dynamic tag information may include a split index, a target storage space, and a target data exchange hierarchy of the first data.
Fig. 4 shows a flow chart of a data processing method according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 4, the "tag generation step" may include steps S14 to S16.
In step S14, first data for participation in a neural network operation is acquired, together with tag information of the first data, the tag information including static tag information (as described above) and dynamic tag information (without split index, target storage space and target data exchange hierarchy, but including other information described above).
In step S15, according to the determined parameters of the operator corresponding to the operation in which the first data participates, the number of memory channels in the multi-core processor, and the tag information, a splitting index required for splitting the first data into a plurality of second data is determined.
In this embodiment, the parameters of the operator are used to represent or describe the kind of operation that the operator needs to perform, and the parameters that the operator needs to perform to implement its corresponding operation, such as the parameters of the operator for convolution operation, pooling operation, and the like.
In one possible implementation, step S15 may include:
determining a splitting strategy corresponding to the first data from a splitting strategy database according to the determined parameters of an operator corresponding to the operation participated in by the first data and the tag information;
And determining a splitting index required for splitting the first data into a plurality of second data according to the splitting strategy, the number of memory channels in the multi-core processor and the tag information.
The splitting strategy comprises at least one split dimension of the first data and a priority level corresponding to each split dimension. The target split dimension in the split index is a selected dimension of the at least one resolvable dimension.
In this implementation manner, the splitting policy may indicate that the data participate in different multi-core processors, different types of neural network operations, and are selectable splitting manners corresponding to different data types in the operations, so that the data may be further split into multiple data according to different manners. The target splitting dimension in the splitting index is determined from a plurality of splitting dimensions of the splitting strategy, and then the starting splitting position and the ending splitting position of each second data on the target splitting dimension of the first data are further determined according to the target splitting dimension, so that the splitting index is finally obtained. The split policy may be predetermined, which may be stored in the memory in the form of a split policy database. Of course, in other embodiments, the splitting policy may also be actually determined by the data processing apparatus according to the operation of the first data participation and the characteristics of the operators of the multicore processor. The splitting strategy database is recorded with a predetermined splitting strategy, and the numbers of different data categories are provided with corresponding splitting strategies in different operators.
For example, when the first data (such as an input neuron, an input weight, etc.) is used as an "input" to participate in the operation, the splitting index may be used to split the first data into a plurality of second data, thereby implementing the operation. When the first data (e.g., output neuron, etc.) participates as "output", the split index may be used to indicate or predict that the final output neuron (i.e., the final operation result) will be obtained by several intermediate results operating on due to the "input" split operation, which indicates the split relationship between the intermediate result and the output neuron, or that the output neuron may not have the possibility of splitting, i.e., it does not have a corresponding split index, due to the size limitation of the output neuron data. In an actual operation task, one or more data to be operated are usually operated to obtain an operation result, and the determined splitting index may be any one of the one or more data to be operated and the operation result, which is not limited in this disclosure.
In step S16, the split index is added to the dynamic tag information to enable the multi-core processor to perform the neural network operation on the first data based on the tag information.
In this embodiment, after adding the split index to the dynamic tag information, when executing the operation task by using the first data (the first data is the data to be operated), the multi-core processor may split the first data into a plurality of second data according to the split index in the dynamic tag and then execute the operation task (i.e. execute the step S12), or determine how the related calculation process is performed according to the split index in the process of executing the operation task (e.g. the first data is the operation result). Therefore, with the aid of the tag information, the multi-core processor can simplify the splitting and layout processes of data involved in the operation task and improve the operation speed of the operation task.
In one possible implementation, before executing step S14, the dynamic tag information required for executing steps S14-S16 (i.e., obtaining the dynamic tag information that does not include the split index, the target storage space, and the target data exchange hierarchy, but includes other information described above) may also be obtained in advance through a "step of determining dynamic tag information" where the step of determining dynamic tag information includes obtaining information of a target processor (i.e., the above-described multi-core processor) that operates the neural network. The information of the target processor may include information related to the computational power and performance of the target number, such as the data type of the data that the target processor can process, the dimensional order of the data read and stored by the target processor, and the number of data bits processed by the target processor (or the number of processed data) at a time. Further, the data processing apparatus may determine the dynamic tag information of the first data by at least one of the following operations based on the information of the target processor and the static tag information of the first data:
Determining a dynamic data type according to the data type of the data which can be processed by the target processor;
determining a dynamic data dimension sequence according to the dimension sequence of the read and stored data of the target processor;
determining a slicing parameter according to the number of data bits processed by the target processor each time;
determining filling parameters according to the dimension values of the fragment parameters and the static data dimension;
and determining the data size according to the dimension value, the fragmentation parameter and the filling parameter of the static data dimension.
In one possible implementation, the apparatus may determine the data class of the first data based on the first data corresponding to the degree of egress, the degree of ingress of the neural network, and the operation in the neural network in which the first data participates.
In this implementation, the ingress represents the number of preceding operation nodes in which the first data is involved as a data node (the first data is an output of the preceding operation node), and the egress represents the number of subsequent operation nodes in which the first data is involved as a data node (the first data is an input of the subsequent operation node). For example, if a certain first data cc is an output of 1 preceding operation node and an input of 3 following operation nodes, the first data cc has an out-degree of 3 and an in-degree of 1. Different codes may be set for different data categories to distinguish. As shown in table 1 below, the characteristics and corresponding identifications of the data of the different data categories are described.
TABLE 1 data category, corresponding identification and data characteristics
The output degree and the input degree of the instruction are zero, and the instruction is used for triggering a neural network operation instruction to aim at an operation task of data to be operated and/or a calculation core. The output degree of the input neuron, the constant neuron, the input weight, the constant weight and the auxiliary data is larger than 1, and the input degree is 0. The output neuron has an output degree of 0 and an input degree of greater than or equal to 1. The degree of egress and degree of ingress of hidden neurons are both greater than or equal to 1.
In one possible implementation manner, according to the determined parameter of the operator corresponding to the operation participated in by the first data and the tag information, determining the splitting policy corresponding to the first data from the splitting policy database may include:
Acquiring a calculation graph corresponding to the neural network, wherein the calculation graph comprises an operation node, a data node and a connection relation between the data node and the operation node, wherein the data node can contain tag information of the first data, and the tag information of the first data comprises static tag information and dynamic tag information;
Obtaining information of a target operation node connected with a data node where the first data is located in the calculation graph, wherein the information of the target operation node comprises parameters of an operator corresponding to the operation of the target operation node;
And determining a splitting strategy corresponding to the first data from a splitting strategy database according to the parameters of the operator and the static label information of the first data.
In this implementation, static tag information and part of dynamic tag information (excluding the split index, the target storage space, and the target data exchange hierarchy) may be first determined according to a computational graph of the neural network, and the determined tag information is stored in a binding manner with a corresponding data node.
In this implementation manner, the static tag information may further include a data class, and further, according to the parameter of the operator and the data class of the first data, a splitting policy corresponding to the first data may be determined from a splitting policy database. For example, referring to the following table 2, assuming that the operator corresponding to the operation in which the first data participates is determined to be the two-dimensional convolution Conv and the data category thereof is the input weight based on the above steps, the corresponding splitting policy is "the split dimension is N, the storage space to be selected is Cluster, and the data exchange hierarchy to be selected is No". Assuming that an operator corresponding to the operation participated in by the first data is two-dimensional convolution Conv and the data category is input neuron based on the steps, the corresponding splitting strategy is 'the split dimension is N, H, W (priority levels are sequentially reduced), when the split dimension is N, the storage space to be selected is Mem, the data exchange level to be selected is No, and when the split dimension is H or W, the storage space to be selected is Cluster and the data exchange level to be selected is Cluster'.
In one possible implementation manner, determining, according to a splitting policy corresponding to the first data, the number of memory channels in the multi-core processor, and the tag information, a splitting index for splitting the first data into a plurality of second data may include:
Determining a target split dimension from the at least one split dimension according to the number of memory channels and an operation dimension value of each dimension of the first data, which is determined according to the tag information, in the operation process of the neural network, and determining a split length corresponding to the target split dimension, wherein the split length is the length of the second data in the direction of the target split dimension;
And determining a splitting index for splitting the first data into a plurality of second data according to the memory channel number, the operation dimension value of the first data on the target splitting dimension and the splitting length.
In this implementation manner, the operation dimension value refers to a dimension value of each dimension in the actual operation process of the first data.
In this embodiment, the purpose of splitting the first data is to implement parallel processing by using the multi-core processor as much as possible. Therefore, before determining the splitting length, the target number of the second data can be determined, and then the splitting length is determined according to the dimension value of the target splitting dimension and the target number. The size relation between the operation dimension value of the target splitting dimension and the total number of computing cores of the multi-core processor (namely, the number of computing cores contained in the multi-core processor) and the number of memory channels can be judged, and then the splitting length is further determined in a corresponding mode according to the size relation, namely, the following mode one, the following mode two and the following mode three.
In the first mode, when the operation dimension value of the target splitting dimension is greater than or equal to the total number of the computing cores, the target number of the second data is determined to be consistent with the total number of the computing cores, the splitting length is the ratio of the operation dimension value of the first data on the target splitting dimension divided by the total number of the computing cores, and when the ratio is not an integer, rounding processing can be performed by using a preset rounding function (described below), so that the splitting length is obtained. The rounding function may include a rounding function (such as Floor function, that is, floor function, taking the largest integer not greater than the ratio), a rounding function (such as Ceil function, taking the smallest integer greater than or equal to the ratio), a rounding function (such as rounding function, that is, round function, taking the integer obtained by rounding the ratio according to the specified decimal place), and the like. Therefore, each computing core can be ensured to participate in the actual operation process, the idle computing power of the multi-core processor is reduced, and the operation efficiency and speed are improved.
And in a second mode, when the operation dimension value of the target splitting dimension is smaller than the total calculation core and is greater than or equal to the number of memory channels, determining the number of the memory channels as the target number of the second data, wherein the splitting length is the ratio of the operation dimension value of the first data on the target splitting dimension divided by the number of the memory channels, and when the ratio is not an integer, rounding processing can be performed by using a preset rounding function, so that the splitting length is obtained.
In the third mode, when the operation dimension value of the target splitting dimension is smaller than the number of the memory channels, determining the operation dimension value of the target splitting dimension as the target number of the second data, wherein the splitting length is the unit length of the operation dimension value of the target splitting dimension.
The first, second and third ways for determining the splitting length can ensure that the computing cores in the multi-core processor are all involved in the operation task, and ensure that the operation task is executed at high speed and high efficiency.
In one possible implementation manner, the static tag information may include a static data dimension, a dimension value corresponding to each static data dimension, and the dynamic tag information may include a dynamic data type, a filling parameter, and a data size, where determining, according to the splitting policy, the number of memory channels in the multi-core processor, and the tag information, a splitting index required to split the first data into a plurality of second data may further include:
And determining the operation dimension value of each dimension of the first data in the neural network operation process according to the dimension value and the filling parameter of each static data dimension of the first data. The sum of the dimension value of each static data dimension and the fill data length corresponding to the static data dimension in the fill parameter may be determined as the operational dimension value.
In this implementation, the operational dimension value may also be calculated in other manners based on the static tag information and the dynamic tag information existing in the first data, which is not limited in this disclosure.
In one possible implementation, the operational dimension values may also be pre-computed and stored into the dynamic tag information. The method is convenient for directly acquiring the operation dimension value from the dynamic label information, and only determines the splitting length and the splitting index, simplifies the splitting index to be a determining process and improves the efficiency.
In one possible implementation manner, determining, according to the number of memory channels and the operation dimension value of each dimension of the first data in the neural network operation process, a target split dimension from the at least one split dimension, and determining a split length corresponding to the target split dimension may include:
When the number of the resolvable dimensions is multiple, sequentially judging whether the resolvable dimensions meet the resolution conditions according to the order of the priority levels from high to low, when the current resolvable dimension is determined to meet the resolution conditions, determining the current resolvable dimension as the target resolution dimension,
The splitting condition comprises that an operation dimension value of the first data in the current split dimension is larger than or equal to the number of the memory channels.
For example, assume that the first data has three dimensions "w1, w2, w3", the operation dimension values w1=2, w2=32, w3=64 for each dimension, the corresponding split policy has the split dimensions w1 and w3, and the priority of w1 is higher than that of w3. If the number of channels of the multi-core processor is 4, judging whether the "detachable dimension w1" with the highest first data priority meets the splitting condition, and if the operation dimension value 2 of w1 is smaller than the "number of channels 4" and does not meet the splitting condition, continuously judging whether the next "detachable dimension w3" meets the splitting condition, and if the operation dimension value 32 of w3 meets the splitting condition, taking w3 of the first data as a target splitting dimension.
In one possible implementation manner, when all the resolvable dimensions do not meet the resolution condition, that is, the operation dimension value in each resolvable dimension is smaller than the number of memory channels, the operation dimension value in the resolvable dimension is determined to be the maximum as the target resolution dimension. In this way, it may be ensured that the operation tasks are performed with more computing cores in the multi-core processor for the first data.
For example:
Example 1, assume that a multi-core processor includes 2 memory channels, 2 clusters of computing cores, each including 2 computing cores. If a certain first data includes three dimensions a, b, and c, and the dimension values of each dimension are a=1, b=4, and c=2. If the detachable dimension is a and b, the priority of a is higher than b. Then, the resolvable dimension a is firstly determined, and a can not be used as the target resolution dimension since a=1 is smaller than the number of memory channels 2 (i.e. the resolution condition is not satisfied). Continuing to judge the detachable dimension b, and determining the detachable dimension b as a target detachable dimension because b=4 is larger than the number of memory channels. Further, since b=4 is equal to the total number of calculation cores 4, the target number of second data can be determined to be the same as the total number of calculation cores 4, the split length=the dimension value 4 of the target split dimension b ++the total number of calculation cores 4=1.
Example 2, assume that a multi-core processor includes 2 memory channels, 2 clusters of computing cores, each including 2 computing cores. If a certain first data includes three dimensions a, b, and c, and the dimension values of each dimension are a=1, b=4, and c=2. If the detachable dimension is a and c, the priority of a is higher than that of c. Then, the resolvable dimension a is firstly determined, and a can not be used as the target resolution dimension since a=1 is smaller than the number of memory channels 2 (i.e. the resolution condition is not satisfied). Continuing to determine the resolvable dimension c, since c=2 is less than the total number of computing cores and is equal to the number of memory channels, the resolvable dimension c may be determined as the target split dimension. Further, since c=2 is smaller than the total number of computing cores 4 and is equal to the number of memory channels, the target number of second data may be determined to be the same as the number of memory channels 2, and the split length=the dimension value 2 of the target split dimension c is divided by the number of memory channels 2=1.
Example 2, assume that a multi-core processor includes 4 memory channels, 2 clusters of computing cores, each including 2 computing cores. If a certain first data includes three dimensions a, b, and c, and the dimension values of each dimension are a=1, b=2, and c=2. If the detachable dimension is a and c, the priority of a is higher than that of c. Then, the resolvable dimension a is firstly determined, and a can not be used as the target resolution dimension since a=1 is smaller than the number of memory channels 4 (i.e. the resolution condition is not satisfied). Continuing to determine the splittable dimension c, since c=2 is still less than the number of memory channels 4 (i.e., the splitting condition is not satisfied). At this time, since c= 2>a =1, the resolvable dimension c is determined as the target split dimension. Further, since c=2 is smaller than the number of memory channels 4, the target number of second data may be determined as the dimension value 2 of the target splitting dimension C, and the splitting length=the dimension value 2 of the target splitting dimension C.
In the actual operation process, the operation task is at least that an arithmetic operation is performed between at least two data to be operated to obtain an operation result, and the data to be operated and the operation result can be used as the first data to determine the split index.
In this embodiment, in the process of participating in an operation task, due to the difference of operators, the difference of target splitting dimensions, and the difference of multi-core processor architectures, in order to facilitate different calculation checks, the target storage space of second data obtained after splitting the first data needs to be set, so as to ensure that the calculation cores actually using the second data to execute the corresponding operation task are convenient and quick to read and write.
In a possible implementation manner, step S15 may include determining at least one detachable dimension corresponding to the first data and a priority level corresponding to each detachable dimension according to the determined parameters of an operator corresponding to the operation in which the first data participates, the characteristics of an operator of the multi-core processor, and the tag information (including a dynamic data type and a data class).
In this implementation, the characteristics of the operators of the multi-core processor may include the dimension in which data of different data types can be split when performing operations of different operators, whether the split different dimensions match with hardware settings of the multi-core processor itself or match with processing capabilities of the multi-core processor set by a user, and characteristics related to the hardware settings of the processor itself, user-defined settings, such as influence of settings of different split dimensions on operation efficiency. Therefore, the method can ensure that the splitting strategy can be determined and meet the operation efficiency and speed requirements.
In the implementation mode, a strategy model for determining the splitting strategy can be trained in advance according to the parameters of the existing operators, the characteristics of the operators of the multi-core processor and the condition of the label information, or the corresponding relation among the parameters of the operators, the characteristics of the operators of the multi-core processor, the label information and the splitting strategy can be established, so that the splitting strategy of the first data can be determined in real time.
In one possible implementation manner, the multi-core processor is provided with a plurality of computing core clusters, each computing core cluster includes a plurality of computing cores, and the "tag generating step" may further include:
Determining a storage space to be selected corresponding to each split dimension according to the determined parameters of an operator corresponding to the operation participated by the first data and the storage space arranged in the multi-core processor;
Determining a target storage space corresponding to the first data from the storage spaces to be selected according to the target splitting dimension, adding the identification information of the target storage space into the dynamic tag information,
The storage space comprises a plurality of memories (as shown in fig. 1d, an extra-core cache of a computing core cluster is not arranged in the multi-core processor), or the storage space comprises a plurality of memories and a plurality of extra-core caches (as shown in fig. 1e, an extra-core cache of a computing core cluster is arranged in the multi-core processor), each computing core can access any one of the memories, and the computing cores in each computing core cluster share a corresponding extra-core cache.
In the implementation mode, a model for determining the storage space to be selected can be trained in advance according to the parameters of the existing operators and the setting conditions of the storage space set in the multi-core processor, or the corresponding relation among the parameters of the operators, the storage space set by the multi-core processor and the storage space to be selected can be established, so that the storage space to be selected of the first data is determined in real time.
In this implementation, the multi-core processor is provided with a plurality of memories, and each computing core cluster can access through a channel between the computing core cluster and a corresponding "local memory", and can also access other memories through the channel and the routing node. The multi-core processor may or may not be configured with different out-of-core caches of the architecture. According to parameters of operators and storage spaces arranged in the multi-core processor, the to-be-selected storage space corresponding to each split dimension can be determined, the selection of the to-be-selected storage space can be influenced by the difference of the operators, and whether the storage space arranged in the multi-core processor contains an out-of-core cache or not is an optional target for the operators.
In this implementation manner, the number of the storage spaces to be selected may be one or more, and in the process of determining the storage spaces to be selected, the detachable dimension corresponding to each storage space to be selected may be determined. Each of the candidate storage spaces corresponds to one or more of the detachable dimensions, different ones of which may correspond to the same candidate storage space, with one detachable dimension having only one corresponding candidate storage space (i.e., one detachable dimension cannot correspond to two different candidate storage spaces). And determining that the corresponding split dimension in the storage space to be selected comprises one of the target split dimensions as the target storage space.
For example, fig. 5 shows an architecture schematic of a multi-core processor 1 according to an embodiment of the present disclosure, as shown in fig. 5, the multi-core processor 1 includes a memory 1 and a memory 2, a computing core cluster1 (including the computing core 1 and the computing core 2) and its corresponding off-core cache 1, and a computing core cluster2 (including the computing core 3 and the computing core 4) and its corresponding off-core cache 2, and the computing core accesses the memory through a routing node R. The category of the memory is marked as mem, the category of the out-of-core cache is marked as cluster, the identification information of the memory 1 and the memory 2 is marked as mem1 and mem2 respectively, and the identification information of the out-of-core cache 1 and the out-of-core cache 2 is marked as cluster1 and cluster2 respectively. Assuming that the first data a is an input neuron, it will be split into two second data a1, a2 and need to be stored into memory. The identification information "mem" indicating the type of the memory space of the memory (a 1 and a2 may be stored in mem1 and mem2, respectively) or the identification information "mem1" and "mem2" of the memory 1 and the memory 2 themselves may be stored in the dynamic tag information.
In one possible implementation, the splitting policy further includes a storage space to be selected, each of the resolvable dimensions is provided with a corresponding storage space to be selected, and the step of generating the tag may further include determining, according to the target splitting dimension, a target storage space from the storage spaces to be selected, and adding identification information of the target storage space to the dynamic tag information,
The memory space comprises a plurality of memories, or the memory space comprises a plurality of memories and a plurality of out-of-core caches, each computing core can access any one of the memories, and the computing cores in each computing core cluster share a corresponding out-of-core cache.
In the implementation manner, the condition of the storage space to be selected of the data can be determined in advance and stored in the splitting strategy database, so that the determination process of the target storage space is simplified, and the speed and the efficiency of data processing are improved.
In one possible implementation, the "tag generation step" may further include:
Determining a data exchange level to be selected corresponding to each split dimension according to the determined parameters of an operator corresponding to the operation participated by the first data, the characteristics of an operator of the multi-core processor and the label information;
And determining a target data exchange level corresponding to the first data from the data exchange levels to be selected according to the target split dimension, and adding the target data exchange level into the dynamic tag information.
In the implementation manner, a model for determining the data exchange hierarchy to be selected can be trained according to parameters of an operator, the characteristics of an operator of the multi-core processor and the label information condition, or a corresponding relation among the parameters of the operator, the characteristics of the operator of the multi-core processor and the label information and the data exchange hierarchy to be selected can be established, so that the data exchange hierarchy to be selected of the first data can be determined in real time.
In this implementation manner, the number of the data exchange levels to be selected may be one or more, and in the process of determining the data exchange levels to be selected, the detachable dimension corresponding to each data exchange level to be selected may be determined. Each of the candidate data exchange levels corresponds to one or more detachable dimensions, which may correspond to the same candidate data exchange level, with one detachable dimension having only one corresponding candidate data exchange level (i.e., one detachable dimension cannot correspond to two different candidate data exchange levels). And determining that the corresponding split dimension in the data exchange hierarchy to be selected comprises one of the target split dimensions as the target data exchange hierarchy.
In one possible implementation manner, the splitting policy further includes a data exchange hierarchy to be selected corresponding to each of the split dimensions, and the "tag generating step" further includes:
and determining a data exchange hierarchy to be selected corresponding to the target splitting dimension in at least one data exchange hierarchy to be selected as a target data exchange hierarchy corresponding to the first data, and adding the target data exchange hierarchy into the dynamic tag information.
In the implementation manner, the data exchange level to be selected of the data can be determined in advance and stored in the splitting strategy database, so that the determination process of the target data exchange level is simplified, and the speed and the efficiency of data processing are improved.
In this embodiment, the splitting policy of the first data may be set in advance according to the above-described time. In order to further illustrate the setting of the splitting policy, taking the architecture of the multi-core processor shown in fig. 5and the first data as the image data as an example, table 2 below gives an example of the splitting policy of the first data in the embodiment of the disclosure.
Table 2 split policy example
Wherein, the resolvable dimension N represents the number dimension, the resolvable dimension H represents the height dimension, the resolvable dimension W represents the width dimension, and the resolvable dimension C represents the channel dimension.
"Mem" in the storage space to be selected is the type identifier of the memory, and "Cluster" represents the identifier of the cache outside the core.
The "No" in the alternative data exchange hierarchy indicates No exchange, "Cluster" indicates inter-Cluster data exchange, and "Core" indicates inter-Core data exchange. The priority level of the former detachable dimension in the same data is higher than that of the later detachable dimension, and the input neuron in the two-dimensional convolution Conv is taken as an example, the detachable dimension is N, H, W, and the priority levels are N, H, W from high to low.
In this embodiment, the model may be determined in real time by designating the splitting policy according to the above correspondence between the reference information related to the splitting policy (including the splitting dimension, the storage space to be selected, and the data exchange dimension to be selected) and the splitting policy, so as to determine the splitting dimension, the storage space to be selected, and the data exchange dimension of the first data in real time according to the model, and further add the splitting index, the identification information of the target storage space, and the target data exchange level in the dynamic tag information of the first data. The splitting policy database may also be determined in advance according to the reference information, in which various splitting policies required are recorded, which is not limited by the present disclosure.
Application example
An application example according to an embodiment of the present disclosure is given below in conjunction with "process operation instruction a" as one exemplary application scenario, so as to facilitate understanding of the flow of the data processing method. It will be appreciated by those skilled in the art that the following application examples are for purposes of facilitating understanding of the embodiments of the present disclosure only and should not be construed as limiting the embodiments of the present disclosure.
Example 1
Assuming that the data related to the processing of the operation instruction a includes an input neuron I, an input weight W and an output neuron O, the multi-core processor executing the operation instruction a is the multi-core processor 1 shown in fig. 5, and the operation instruction a is a full-connection MLP operation, and performs data operation in a model parallel manner. The input neuron I, the input weight W, and the output neuron O are 1×1024, 1024×4, and 1×4 data, respectively, and are specifically calculated as i×w=o. The input neuron I, the input weight W, and the output neuron O may be classified as "first data" to determine a split index, a target storage space, and a target data exchange hierarchy corresponding thereto.
Label information of neuron I is entered: static: IN, float32, dim_nc, {1 1024}dynamic:float16,DIM_NC,C =256, c= 0,2KB. That is, the data class of I is input neurons, the static data type is float32, the static data dimensions are N and C, the dimension value of dimension N is 1, the dimension value of dimension C is 1024, and the static data dimension order is NC. The dynamic data type is float16, the dynamic data dimension sequence is NC, the slicing parameters are slicing according to the length of 256 in the direction of dimension C, the filling parameters are filling with 0 as a filling value in the direction of dimension C, and the data size is 2KB.
Tag information of the weight W is input: static: IW, float32, dim_cn, {1000 4}dynamic:float16,DIM_NC,C =512, c= 24,8000 bytes. That is, the data category of W is an input weight, the static data type is float32, the static data dimension is N and C, the dimension value of dimension N is 4, the dimension value of dimension C is 1000, and the static data dimension order is CN. The dynamic data type is float16, the dynamic data dimension sequence is NC, the slicing parameters are slicing according to the length of 512 in the direction of dimension C, the filling parameters are filling with 24 as a filling value in the direction of dimension C, and the data size is 8000 bytes.
Label information of neuron O is output: static: ON, float32, dim_nc, {1 4}dynamic:float16,DIM_NC,C =4, c= 0,8KB. That is, the data class of O is the output neuron, the static data type is float32, the static data dimensions are N and C, the dimension value of dimension N is 1, the dimension value of dimension C is 4, and the static data dimension order is NC. The dynamic data type is float16, the dynamic data dimension sequence is NC, the slicing parameters are slicing according to the length of 4 in the direction of dimension C, the filling parameters are filling with 0 as a filling value in the direction of dimension C, and the data size is 8KB.
In the "tag generation step":
For input neuron I, assume that its splittable dimension includes N, C (N has a higher priority than C) from a predetermined splitting policy database, and that since input neuron I is 1×1024, multicore processor 1 includes 2 channels. Firstly, judging a resolvable dimension N, wherein the number of channels 2 is larger than an operation dimension value (the determining mode is referred to above) of an input neuron I on the dimension N, and then the resolvable dimension N can not be used as a target resolution dimension, and then judging a resolvable dimension C, wherein the number of channels 2 is smaller than the operation dimension value of the input neuron I on the dimension C, and then determining the resolvable dimension C as the target resolution dimension. It is further possible to determine that the split length is l=1024/4 (the total number of calculated cores) =512. The split index of the input neuron I is C [ (0,255), (256,511), (512,767), (768,1023) ]. Continuing to determine the splitting policy database predetermined based on the target splitting dimension C, it is known that the target storage space of the input neuron I is "Mem" (i.e., memory), and the target data exchange level is "NO", i.e., NO exchange. The tag information of the input neuron I is increased to :Static:IN,float32,DIM_NC,{1 1024}dynamic:float16,DIM_NC,C=256,C=0,2KB,C[(0,255),(256,511),(512,767),(768,1023)],Mem,NO.
Based on the same procedure as the input neuron I, it can be judged that the tag information of the input weight W is Static: IW, float32, dim_cn, {1000 4}dynamic:float16,DIM_NC,C =512, c= 24,8000 bytes, N [ (0, 255), (256,511), (512,767), (768,1023) ], mem, NO after the increase. The label information of the output neuron O is, after the increase, static: ON, float32, dim_nc, {1 4}dynamic:float16,DIM_NC,C =4, c=0, 8kb, c [ (0, 3) ], cluster, core.
In the "tag use step",:
Based on the label information of the input neuron I, the input weight W, and the output neuron O, when the multicore processor 1 executes the operation instruction a, fig. 6a shows a schematic diagram of allocation of operation tasks of the multicore processor 1 executing the operation instruction a according to an embodiment of the present disclosure, and fig. 6b and 6c show a schematic diagram of a process of the multicore processor 1 executing the operation instruction a according to an embodiment of the present disclosure, and the specific process of the multicore processor 1 is as follows in conjunction with fig. 6a and 6 b:
The multi-core processor 1 or a processor which distributes tasks for the multi-core processor 1 firstly judges whether the data states of the current input neuron I and the input weight W are consistent with the corresponding dynamic label information according to the label information of the input neuron I and the input weight W, and when the data states are inconsistent, the data needs to be processed so that the processed data are consistent with the corresponding dynamic label information. Taking the example that the input weight W needs to be processed, the processing procedure of the input weight W is shown in fig. 6a, and the data type of the input weight W may be converted from 32-bit floating point number to 16-bit floating point number. And then, carrying out transposition processing on the input weight W to ensure that the data dimension sequence of the input weight W is consistent with the 'DIM_NC'. Based on the slicing parameter "512", the input weight W is sliced into two data slices. And then filling the data fragments with the sizes not corresponding to the fragment parameters according to the filling parameters '24', for example, filling the filling bits with 0, and finally obtaining the processed input weight W, namely the input weight W of 4 multiplied by 1024.
Then, the input neuron I and the input weight W are split according to the splitting index as shown in figure 6 a. The input neurons are split into i1, i2, i3, i4, and the input weights W are split into W1, W2, W3, W4. And i1, i2, w1 and w2 are respectively stored in the corresponding target storage space memories 1 as shown in fig. 6b, and i3, i4, w3 and w4 are respectively stored in the corresponding target storage space memories 2.
Then, the specific operation process of the operation instruction a includes the following operation tasks to be executed by each computing core:
The method comprises the steps that a computing core 1 obtains i1 and w1 from a memory 1 and calculates to obtain a first intermediate result o1, the first intermediate result o1 is stored in a corresponding local memory NB, a computing core 2 obtains i2 and w2 from the memory 1 and calculates to obtain a first intermediate result o2, the first intermediate result o2 is stored in an extranuclear cache 1, a computing core 3 obtains i3 and w3 from the memory 2 and calculates to obtain a first intermediate result o3, the first intermediate result w3 is stored in a corresponding local memory NB, and a computing core 4 obtains i4 and w4 from the memory 2 and calculates to obtain a first intermediate result o4, and the first intermediate result o4 is stored in the extranuclear cache 2.
Next, the computing core 1 obtains o2 from the extra-core cache 1, adds o2 and o1 to obtain a second intermediate result o5, and stores the second intermediate result o5 in the corresponding local memory NB, the computing core 3 obtains o4 from the extra-core cache 2, adds o4 and o3 to obtain a second intermediate result o6, and stores the second intermediate result o6 in the extra-core cache 2.
Then, the computing core 1 acquires O6 from the out-core cache 2, adds O5 and O6 to obtain an operation result O7 (i.e. output neuron O), and stores O7 into the memory 1 to complete the operation task.
Since the target storage space of the output neuron O is the extranuclear cache Cluster and the target data exchange level is the inter-nuclear data exchange Core, the computing cores 2 and 4 store the results O2 and O4 obtained by the operation of the computing cores to the corresponding extranuclear caches, and the computing Core 3 stores the O6 obtained by the computing cores to the extranuclear cache 2. In this way, it is ensured that the computing cores 1 and 3 can obtain the required first intermediate result or second intermediate result from the off-core cache.
Example 2
Example 2 differs from example 1 in that the structure of the multi-core processor is changed, and the multi-core processor 2 in example 2 is not provided with an out-of-core cache.
In the "tag generation step":
Based ON the same procedure as example 1, in conjunction with a split policy database (not shown in this disclosure) of the corresponding multicore processor 2, it can be determined that only the target storage space and the target data exchange level in the tag information of the output neuron O change, i.e., static: ON, float32, dim_nc, {1 4}dynamic:float16,DIM_NC,C =4, c=0, 8kb, c [ (0, 3) ], mem, cluster. The difference from example 1 is that the target storage space and target data exchange hierarchy of the output neuron O are changed.
In the "tag use step",:
The difference between the specific processes compared with example 1 is as follows in connection with fig. 6a, 6 c:
the method comprises the steps that a computing core 1 obtains i1 and w1 from a memory 1 and calculates to obtain a first intermediate result o1, the first intermediate result o1 is stored in a corresponding local memory NB, a computing core 2 obtains i2 and w2 from the memory 1 and calculates to obtain a first intermediate result o2, the first intermediate result o2 is stored in the memory 1, a computing core 3 obtains i3 and w3 from the memory 2 and calculates to obtain a first intermediate result o3, the first intermediate result w3 is stored in a corresponding local memory NB, and a computing core 4 obtains i4 and w4 from the memory 2 and calculates to obtain a first intermediate result o4, and the first intermediate result o4 is stored in the memory 2.
Next, the computing core 1 obtains o2 from the memory 1, adds o2 and o1 to obtain a second intermediate result o5, and then stores the second intermediate result o5 in the corresponding local memory NB, the computing core 3 obtains o4 from the memory 2, adds o4 and o3 to obtain a second intermediate result o6, and stores the second intermediate result o6 in the memory 2.
Then, the computing core 1 acquires O6 from the memory 2, adds O5 and O6 to obtain an operation result O7 (i.e. output neuron O), and stores O7 into the memory 1 to complete the operation task.
Since the target storage space of the output neuron O is the memory Mem and the target data exchange level is the inter-Cluster data exchange Cluster, the computing cores 2 and 4 store the results O2 and O4 obtained by the operation of the computing cores to the corresponding memories 1 and 2, respectively, and the computing core 3 stores the O6 obtained by the computing cores to the memory 2. In this way it is ensured that the computing core 1 and the computing core 3 can retrieve the desired first intermediate result or second intermediate result from the memory.
It should be noted that, although the data processing method is described above by way of example in the above embodiments, those skilled in the art will appreciate that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step and each module according to personal preference and/or actual application scene, so long as the technical scheme of the disclosure is met.
Fig. 7 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus is applied to a multi-core processor, the multi-core processor includes a plurality of computing core clusters, each of which includes a plurality of computing cores, and the apparatus includes an information acquisition module 51, a data splitting module 52, and a data storage module 53.
The information obtaining module 51 obtains the first data and tag information of the first data, where the tag information includes dynamic tag information, and the dynamic tag information includes a splitting index and identification information of a target storage space.
The data splitting module 52 splits the first data into a plurality of second data according to the splitting index.
The data storage module 53 stores the plurality of second data into the corresponding target storage space according to the identification information of the target storage space.
Wherein the dynamic tag information is used to characterize information associated with the first data and the multicore processor.
In one possible implementation, the dynamic tag information further includes a target data exchange hierarchy, and the apparatus may further include:
the instruction generation module is used for generating a first instruction which is used for neural network operation and corresponds to the first data according to the parallel computing mode of the multi-core processor, the identification information of the target storage space and the target data exchange level, so that each computing core executes a corresponding operation task according to the first instruction, wherein the first instruction comprises at least one of a data access instruction and a data operation instruction.
In one possible implementation, the dynamic tag information further includes a target data exchange hierarchy, and the apparatus may further include:
and the operation task determining module is used for determining the operation task required to be executed by each computing core according to a second instruction corresponding to the first data, the parallel computing mode of the multi-core processor, the identification information of the target storage space and the target data exchange level.
In one possible implementation manner, when the first data includes an input neuron, an input weight value, and an output neuron, and the data exchange is determined on the output neuron according to the dynamic tag information of the first data, the computing task executed by the computing core may include:
Each computing core is used for acquiring first target data in a plurality of second data of an input neuron and second target data in a plurality of second data of an input weight, calculating the first target data and the second target data to obtain a first intermediate result, and storing the first intermediate result into a storage space corresponding to the computing core;
a first computing core in the plurality of computing cores acquires at least two first intermediate results, performs operation processing on the at least two first intermediate results to obtain a second intermediate result, and stores the second intermediate result in a storage space corresponding to the first computing core;
And a second computing core in the plurality of computing cores acquires at least two second intermediate results, performs operation on the at least two second intermediate results to obtain an operation result, and stores the operation result as the output neuron into the target storage space.
In one possible implementation manner, when the first data includes an input neuron, an input weight value, and an output neuron, and the data exchange is determined according to the dynamic tag information of the first data, the computing task executed by the computing core may include:
each computing core is used for acquiring first target data in a plurality of second data of an input neuron and second target data in a plurality of second data of an input weight, performing operation on the first target data and the second target data to obtain a first intermediate result, and storing the first intermediate result into a corresponding target storage space;
Each computing core is used for acquiring first target data in a plurality of second data of an input neuron and third target data which is different from the second target data in a plurality of second data of an input weight, and performing operation on the first target data and the third target data to obtain a second type of first intermediate result and storing the second type of first intermediate result in a corresponding target storage space;
a first computing core in the plurality of computing cores acquires at least two first intermediate results, performs operation processing on the at least two first intermediate results to obtain a second intermediate result and stores the second intermediate result in a corresponding target storage space, wherein the first intermediate result comprises the first type of first intermediate result and the second type of first intermediate result;
and a second computing core in the plurality of computing cores acquires at least two second intermediate results, performs operation on the at least two second intermediate results to obtain an operation result, and stores the operation result as the output neuron into the target storage space.
In one possible implementation manner, when the first data includes an input neuron, an input weight value, and an output neuron, and the data exchange between the input neuron and the output neuron is determined according to the dynamic tag information of the first data, the computing task executed by the computing core may include:
Each computing core is used for acquiring first target data in a plurality of second data of an input neuron and second target data in a plurality of second data of an input weight, calculating the first target data and the second target data to obtain a first intermediate result, and storing the first intermediate result into a storage space corresponding to the computing core;
each computing core is used for acquiring fourth target data and second target data which are different from the first target data in a plurality of second data of the input neurons, calculating the fourth target data and the second target data to obtain a second type of first intermediate result, and storing the second type of first intermediate result into a storage space corresponding to the computing core;
a first computing core in the plurality of computing cores acquires at least two first intermediate results, performs operation processing on the at least two first intermediate results to obtain a second intermediate result and stores the second intermediate result in a storage space corresponding to the first computing core, wherein the first intermediate result comprises the first type of first intermediate result and the second type of first intermediate result;
and a second computing core in the plurality of computing cores acquires at least two second intermediate results, performs operation on the at least two second intermediate results to obtain an operation result, and stores the operation result as the output neuron into the target storage space.
In one possible implementation, the dynamic tag information further includes a dynamic data type, a dynamic data dimension order, a fragmentation parameter, and a fill parameter, the split index includes a target split dimension, a start split position and an end split position of each second data in the target split dimension of the first data, wherein the data split module 52 may include:
The data processing sub-module is used for processing the first data according to the dynamic tag information when determining that the current data state of the first data is inconsistent with the dynamic tag information, so as to obtain the processed first data;
A splitting module for splitting the processed first data into a plurality of second data based on a start splitting position and an end splitting position of each second data in the target splitting dimension of the first data,
Wherein the data states include data types, order of data dimensions, and dimension values.
In one possible implementation, the tag information may further include static tag information characterizing information associated with the neural network operation in which the first data participates, the static tag information may include at least one of a static data type, a static data dimension order, and a dimension value corresponding to each static data dimension,
Wherein the processing performed by the data processing sub-module may include at least one of:
Converting the data type of the first data from the static data type to the dynamic data type;
Adjusting the sequence of the data dimension of the first data from a static data dimension sequence to a dynamic data dimension sequence;
filling the first data according to the filling parameters;
And cutting the first data according to the cutting parameters.
In one possible implementation, when the first data includes a plurality of constant data, the apparatus further includes:
The data packaging module is used for packaging a plurality of second data obtained by splitting each constant data according to the total number of the computing cores in the multi-core processor, forming a plurality of first data packets corresponding to each computing core and then storing the first data packets so that each computing core executes corresponding operation according to the loaded first data packets, wherein the number of the first data packets is the same as the total number of the computing cores;
The first data packet comprises one of a plurality of second data obtained by splitting each constant data, and at least one of a constant total data quantity mark, a hidden layer total data quantity mark of a hidden layer in a neural network operation participated in by the first data and an input/output address mark in the neural network operation.
In one possible implementation, the data packaging module is further configured to perform at least one of the following operations:
Determining that each second data is shifted and recorded in a first segment in the constant data segments of the corresponding first data packet according to the constant total data volume of each first data packet and the data volume of each second data contained in the constant total data volume;
Determining that each input address and each output address deviate and record in a second section in the address data section of the corresponding first data packet according to the total data quantity of the input address and the output address of each first data packet, the data quantity of each input address and the data quantity of each output address;
and determining that each piece of hidden layer data is shifted and recorded in a third section in the hidden layer data sections of the corresponding first data packet according to the total data quantity of the hidden layer of each first data packet and the data quantity of each contained hidden layer data.
In one possible implementation, the apparatus further includes:
and the tag determining module is used for determining a splitting index, a target storage space and a target data exchange level of the first data before splitting the first data.
According to the device provided by the embodiment of the disclosure, after the first data and the dynamic label thereof are determined, the first data can be split and stored in a layout mode based on the splitting index in the dynamic label information and the identification information of the target storage space, so that the second data can be adapted to a plurality of memory channels of the multi-core processor, operation is performed by using a plurality of computing cores in the multi-core processor, and the processing efficiency and the processing speed of neural network operation are improved.
The embodiment of the disclosure also provides a data processing device, which comprises a processor and a memory, wherein the memory stores a computer program, and the processor realizes the data processing method when executing the computer program.
The disclosed embodiments also provide a non-transitory computer readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the above-described data processing method.
The present disclosure provides a machine learning computing device that may include one or more of the above-described data processing devices for acquiring data to be computed and control information from other processing devices, and performing a specified machine learning operation. The machine learning computing device can obtain the neural network calculation macro instruction or the neural network calculation instruction to be executed from other machine learning computing devices or non-machine learning computing devices, and transmit the execution result to peripheral equipment (also called other processing devices) through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one data processing device is included, the data processing devices may be linked and data may be transferred through a specific structure, for example, interconnected and data may be transferred through a PCIE bus, so as to support operation of a larger-scale neural network. In this case, the same control system may be shared, or independent control systems may be provided, or the memory may be shared, or each accelerator may have a separate memory. In addition, the interconnection mode can be any interconnection topology.
The machine learning operation device has higher compatibility and can be connected with various types of servers through PCIE interfaces.
Fig. 8 is a block diagram illustrating a combination processing apparatus 1200 according to an embodiment of the present disclosure. As shown in fig. 8, the combined processing device 1200 includes a computing processing device 1202, an interface device 1204, other processing devices 1206, and a storage device 1208. Depending on the application scenario, one or more computing devices 1210 may be included in the computing processing device. The calculation processing means 1202 may be the machine learning arithmetic means described above, or the data processing means described above.
In various embodiments, the computing processing means of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence computing core (i.e., the computing cores described above) or as part of a hardware architecture of an artificial intelligence computing core.
In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively accomplish user-specified operations. Depending on the implementation, other processing devices of the present disclosure may include one or more types of processors among a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), an artificial intelligence processor, and/or the like, general purpose and/or special purpose processors. These processors may include, but are not limited to, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or an isomorphic multi-core structure only with respect to it. However, when computing processing devices and other processing devices are considered together, both may be considered to form heterogeneous multi-core structures.
In one or more embodiments, the other processing device may interface with external data and controls as a computing processing device of the present disclosure (which may be embodied as an associated computing device for artificial intelligence, such as neural network operations), performing basic controls including, but not limited to, data handling, turning on and/or off the computing device, and the like. In other embodiments, other processing devices may also cooperate with the computing processing device to jointly accomplish the computational tasks.
In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing device may obtain input data from other processing devices via the interface device, and write the input data to a storage device (or memory) on the computing device. Further, the computing processing device may obtain control instructions from other processing devices via the interface device, and write the control instructions into a control cache on the computing processing device chip. Alternatively or in addition, the interface device may also read data in a memory device of the computing processing device and transmit it to the other processing device.
Additionally or alternatively, the combined processing apparatus of the present disclosure may further comprise a storage device. As shown in the figure, the storage means are connected to the computing processing means and the other processing means, respectively. In one or more embodiments, a storage device may be used to store data for the computing processing device and/or the other processing devices. For example, the data may be data that cannot be stored entirely within an internal or on-chip memory device of a computing processing device or other processing device.
In some embodiments, the present disclosure also discloses a chip (e.g., chip 1302 shown in fig. 9). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combined processing devices as shown in fig. 8. The chip may be connected to other related components by an external interface device (such as external interface device 1306 shown in fig. 9). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) etc. may be integrated on the chip. In some embodiments, the disclosure further discloses a chip package structure, which includes the chip. In some embodiments, the disclosure further discloses a board card, which includes the chip packaging structure described above. The board will be described in detail with reference to fig. 9.
Fig. 9 is a schematic diagram illustrating a board 1300 according to an embodiment of the disclosure. As shown in fig. 9, the board includes a memory device 1304 for storing data, which includes one or more memory cells 1310. The memory device may be connected and data transferred to the control device 1308 and the chip 1302 described above by means of, for example, a bus or the like. Further, the board card also includes an external interface device 1306 configured for data relay or transfer functions between the chip (or chips in the chip package structure) and an external device 1312 (e.g., a server or computer, etc.). For example, the data to be processed may be transferred by the external device to the chip through the external interface means. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. The external interface device may have different interface forms according to different application scenarios, for example, it may use a standard PCIE interface or the like.
In one or more embodiments, the control device in the board card of the present disclosure may be configured to regulate the state of the chip. For this reason, in an application scenario, the control device may include a single chip microcomputer (Micro Controller Unit, MCU) to regulate and control the working state of the chip.
From the above description in connection with fig. 8 and 9, those skilled in the art will appreciate that the present disclosure also discloses an electronic device or apparatus that may include one or more of the above-described boards, one or more of the above-described chips, and/or one or more of the above-described combination processing apparatuses.
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle, the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves and range hoods, and the medical equipment comprises a nuclear magnetic resonance instrument, a B-ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, terminal, and the like. In one or more embodiments, a computationally intensive electronic device or apparatus according to the aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.
It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the scheme of the present disclosure is not limited by the order of the described actions. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure also has an emphasis on each of them, depending on the solution. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.
In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.
In the present disclosure, units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purposes of the solution described in the embodiments of the disclosure. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically reside separately.
In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described by the embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media that can store program codes.
In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (RESISTIVE RANDOM ACCESS MEMORY, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (ENHANCED DYNAMIC Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (13)

Translated fromChinese
1.一种数据处理方法,其特征在于,应用于多核处理器,所述多核处理器包括多个计算核簇,每个计算核簇包括多个计算核,所述方法包括:1. A data processing method, characterized in that it is applied to a multi-core processor, wherein the multi-core processor includes multiple computing core clusters, each computing core cluster includes multiple computing cores, and the method comprises:获取第一数据和所述第一数据的标签信息,所述标签信息包括动态标签信息,所述动态标签信息包括拆分索引、目标存储空间的标识信息;Acquire first data and tag information of the first data, wherein the tag information includes dynamic tag information, and the dynamic tag information includes identification information of a split index and a target storage space;根据所述拆分索引将所述第一数据拆分为多个第二数据;Splitting the first data into a plurality of second data according to the splitting index;根据所述目标存储空间的标识信息,将所述多个第二数据分别存储至对应的目标存储空间,According to the identification information of the target storage space, the plurality of second data are stored in the corresponding target storage space respectively,其中,所述动态标签信息用于表征所述第一数据与所述多核处理器相关联的信息,所述拆分索引是根据所述第一数据参与的操作所对应的算子的参数、所述多核处理器中的内存通道数、静态标签信息、动态数据类型、动态数据维度顺序、分片参数和填充参数确定出的,所述静态标签信息包括以下至少一项:静态数据类型、静态数据维度、静态数据维度顺序以及对应每个静态数据维度的维度值。Among them, the dynamic tag information is used to characterize the information associated with the first data and the multi-core processor, and the split index is determined based on the parameters of the operator corresponding to the operation in which the first data participates, the number of memory channels in the multi-core processor, static tag information, dynamic data type, dynamic data dimension order, sharding parameters and filling parameters. The static tag information includes at least one of the following: static data type, static data dimension, static data dimension order and dimension value corresponding to each static data dimension.2.根据权利要求1所述的方法,其特征在于,所述动态标签信息还包括目标数据交换层级,所述方法还包括:2. The method according to claim 1, wherein the dynamic tag information further includes a target data exchange level, and the method further includes:根据所述多核处理器的并行计算方式、所述目标存储空间的标识信息、所述目标数据交换层级,生成用于神经网络运算、对应于所述第一数据的第一指令,以使得各个计算核根据所述第一指令执行相应的运算任务;generating, based on the parallel computing mode of the multi-core processor, the identification information of the target storage space, and the target data exchange level, a first instruction for neural network computing corresponding to the first data, so that each computing core performs a corresponding computing task according to the first instruction;其中,所述第一指令包括以下至少一种:数据访存指令和数据运算指令。The first instruction includes at least one of the following: a data access instruction and a data operation instruction.3.根据权利要求1所述的方法,其特征在于,所述动态标签信息还包括目标数据交换层级,所述方法还包括:3. The method according to claim 1, wherein the dynamic tag information further includes a target data exchange level, and the method further includes:根据对应于所述第一数据的第二指令、所述多核处理器的并行计算方式、所述目标存储空间的标识信息、所述目标数据交换层级,确定每个计算核所需执行的运算任务。The computing task that each computing core needs to perform is determined based on the second instruction corresponding to the first data, the parallel computing mode of the multi-core processor, the identification information of the target storage space, and the target data exchange level.4.根据权利要求2或3所述的方法,其特征在于,在所述第一数据包括输入神经元、输入权值和输出神经元,且根据所述第一数据的动态标签信息确定对所述输出神经元进行数据交换时,计算核所执行的运算任务,包括:4. The method according to claim 2 or 3, wherein, when the first data includes input neurons, input weights, and output neurons, and when data exchange with the output neurons is determined based on dynamic label information of the first data, the computing tasks performed by the computing core include:每个计算核用于获取到输入神经元的多个第二数据中的第一目标数据和输入权值的多个第二数据中的第二目标数据,对所述第一目标数据和第二目标数据进行运算后得到的第一中间结果,并将所述第一中间结果存储至与该计算核对应的存储空间中;Each computing core is used to obtain first target data from a plurality of second data of input neurons and second target data from a plurality of second data of input weights, perform operations on the first target data and the second target data to obtain a first intermediate result, and store the first intermediate result in a storage space corresponding to the computing core;所述多个计算核中的第一计算核获取至少两个第一中间结果,对所述至少两个第一中间结果进行运算处理得到第二中间结果并存储至所述第一计算核对应的存储空间;A first computing core among the multiple computing cores obtains at least two first intermediate results, performs computation on the at least two first intermediate results to obtain a second intermediate result, and stores the second intermediate result in a storage space corresponding to the first computing core;所述多个计算核中的第二计算核获取至少两个第二中间结果,对所述至少两个第二中间结果进行运算得到运算结果,并将所述运算结果作为所述输出神经元存储所述目标存储空间。A second computing core among the multiple computing cores obtains at least two second intermediate results, performs operations on the at least two second intermediate results to obtain operation results, and stores the operation results as the output neuron in the target storage space.5.根据权利要求2或3所述的方法,其特征在于,在所述第一数据包括输入神经元、输入权值和输出神经元,且根据所述第一数据的动态标签信息确定对所述输入权值进行数据交换时,计算核所执行的运算任务,包括:5. The method according to claim 2 or 3, wherein, when the first data includes input neurons, input weights, and output neurons, and when data exchange of the input weights is determined based on dynamic tag information of the first data, the computing tasks performed by the computing core include:每个计算核用于获取到输入神经元的多个第二数据中的第一目标数据和输入权值的多个第二数据中的第二目标数据,对所述第一目标数据和第二目标数据进行运算后得到的第一种第一中间结果,并将所述第一种第一中间结果存储至对应的目标存储空间;Each computing core is used to obtain first target data from a plurality of second data of input neurons and second target data from a plurality of second data of input weights, perform operations on the first target data and the second target data to obtain a first intermediate result, and store the first intermediate result in a corresponding target storage space;每个计算核用于获取到输入神经元的多个第二数据中的第一目标数据和输入权值的多个第二数据中与所述第二目标数据不同的第三目标数据,并对所述第一目标数据和第三目标数据进行运算后得到的第二种第一中间结果并存储至对应的目标存储空间;Each computing core is used to obtain first target data from a plurality of second data of input neurons and third target data different from the second target data from a plurality of second data of input weights, and to store a second first intermediate result obtained after operating the first target data and the third target data in a corresponding target storage space;所述多个计算核中的第一计算核获取至少两个第一中间结果,并对所述至少两个第一中间结果进行运算处理得到第二中间结果并存储至对应的目标存储空间,所述第一中间结果包括所述第一种第一中间结果和所述第二种第一中间结果;A first computing core among the multiple computing cores obtains at least two first intermediate results, performs computation on the at least two first intermediate results to obtain a second intermediate result, and stores the second intermediate result in a corresponding target storage space, wherein the first intermediate results include the first type of first intermediate results and the second type of first intermediate results;所述多个计算核中的第二计算核获取至少两个第二中间结果,并对所述至少两个第二中间结果进行运算得到运算结果,并将所述运算结果作为所述输出神经元存储所述目标存储空间。The second computing core among the multiple computing cores obtains at least two second intermediate results, performs operations on the at least two second intermediate results to obtain operation results, and stores the operation results as the output neuron in the target storage space.6.根据权利要求2或3所述的方法,其特征在于,在所述第一数据包括输入神经元、输入权值和输出神经元,且根据所述第一数据的动态标签信息确定对所述输入神经元和所述输出神经元进行数据交换时,计算核所执行的运算任务,包括:6. The method according to claim 2 or 3, wherein, when the first data includes input neurons, input weights, and output neurons, and when data exchange between the input neurons and the output neurons is determined based on dynamic label information of the first data, the computing tasks performed by the computing core include:每个计算核用于获取到输入神经元的多个第二数据中的第一目标数据和输入权值的多个第二数据中的第二目标数据,并对所述第一目标数据和第二目标数据进行运算后得到的第一种第一中间结果,并将所述第一种第一中间结果存储至与该计算核对应的存储空间中;Each computing core is used to obtain first target data from a plurality of second data of input neurons and second target data from a plurality of second data of input weights, and to obtain a first intermediate result after performing an operation on the first target data and the second target data, and to store the first intermediate result in a storage space corresponding to the computing core;每个计算核用于获取到输入神经元的多个第二数据中与所述第一目标数据不同的第四目标数据和所述第二目标数据,并对所述第四目标数据和第二目标数据进行运算后得到的第二种第一中间结果,并将所述第二种第一中间结果存储至与该计算核对应的存储空间中;Each computing core is configured to obtain fourth target data and the second target data different from the first target data from the plurality of second data input to the neuron, perform operations on the fourth target data and the second target data to obtain a second first intermediate result, and store the second first intermediate result in a storage space corresponding to the computing core;所述多个计算核中的第一计算核获取至少两个第一中间结果,并对所述至少两个第一中间结果进行运算处理得到第二中间结果并存储至所述第一计算核对应的存储空间,所述第一中间结果包括所述第一种第一中间结果和所述第二种第一中间结果;A first computing core among the multiple computing cores obtains at least two first intermediate results, performs computation on the at least two first intermediate results to obtain a second intermediate result, and stores the second intermediate result in a storage space corresponding to the first computing core, where the first intermediate results include the first type of first intermediate results and the second type of first intermediate results;所述多个计算核中的第二计算核获取至少两个第二中间结果,并对所述至少两个第二中间结果进行运算得到运算结果,并将所述运算结果作为所述输出神经元存储所述目标存储空间。The second computing core among the multiple computing cores obtains at least two second intermediate results, performs operations on the at least two second intermediate results to obtain operation results, and stores the operation results as the output neuron in the target storage space.7.根据权利要求1所述的方法,其特征在于,所述动态标签信息还包括动态数据类型、动态数据维度顺序、分片参数、填充参数,所述拆分索引包括:目标拆分维度、每个第二数据在所述第一数据的目标拆分维度上的起始拆分位置和结束拆分位置,7. The method according to claim 1, wherein the dynamic tag information further includes a dynamic data type, a dynamic data dimension order, a sharding parameter, and a padding parameter; the split index includes: a target split dimension, a start split position and an end split position of each second data on the target split dimension of the first data;其中,根据所述拆分索引将所述第一数据拆分为多个第二数据,包括:The step of splitting the first data into a plurality of second data according to the split index includes:在确定所述第一数据的当前数据状态与所述动态标签信息不一致时,根据所述动态标签信息对所述第一数据进行处理,得到处理后第一数据;When it is determined that the current data state of the first data is inconsistent with the dynamic tag information, processing the first data according to the dynamic tag information to obtain processed first data;在所述目标拆分维度上,以每个第二数据在所述第一数据的目标拆分维度上的起始拆分位置和结束拆分位置为基准,将所述处理后第一数据拆分为多个第二数据,On the target splitting dimension, the processed first data is split into a plurality of second data based on the starting splitting position and the ending splitting position of each second data on the target splitting dimension of the first data,其中,所述数据状态包括数据类型、数据维度的顺序和维度值。The data state includes the data type, the order of data dimensions and the dimension value.8.根据权利要求7所述的方法,其特征在于,所述标签信息还包括静态标签信息,所述静态标签信息用于表征与所述第一数据所参与的神经网络运算相关联的信息,8. The method according to claim 7, wherein the label information further comprises static label information, wherein the static label information is used to represent information associated with the neural network operation in which the first data participates.其中,在确定所述第一数据的当前数据状态与所述动态标签信息不一致时,根据所述动态标签信息对所述第一数据进行处理,得到处理后第一数据,包括以下至少一项:When it is determined that the current data state of the first data is inconsistent with the dynamic tag information, the first data is processed according to the dynamic tag information to obtain processed first data, including at least one of the following:将所述第一数据的数据类型由所述静态数据类型转换为所述动态数据类型;Converting the data type of the first data from the static data type to the dynamic data type;将所述第一数据的数据维度的顺序由静态数据维度顺序调整为动态数据维度顺序;Adjusting the order of data dimensions of the first data from a static data dimension order to a dynamic data dimension order;根据所述填充参数,对所述第一数据进行填充;Filling the first data according to the filling parameter;根据所述分片参数,对所述第一数据进行切分。The first data is segmented according to the segmentation parameters.9.根据权利要求1所述的方法,其特征在于,在所述第一数据包括多个常量数据时,所述方法还包括:9. The method according to claim 1, wherein when the first data includes a plurality of constant data, the method further comprises:根据所述多核处理器中计算核的总数,对每个常量数据拆分获得的多个第二数据进行打包处理,形成对应于每个计算核的多个第一数据包后存储,以使每个计算核根据加载到的第一数据包执行对应的运算操作;所述第一数据包的数量与计算核总数相同;According to the total number of computing cores in the multi-core processor, the plurality of second data obtained by splitting each constant data is packaged to form a plurality of first data packets corresponding to each computing core and then stored, so that each computing core performs a corresponding operation according to the loaded first data packets; the number of the first data packets is the same as the total number of computing cores;其中,所述第一数据包包括每一个常量数据拆分所获得的多个第二数据中的一个,所述第一数据包还包括以下至少一项:常量总数据量标记、所述第一数据所参与的神经网络运算中隐藏层的隐藏层总数据量标记、所述神经网络运算中的输入输出地址标记。Among them, the first data packet includes one of the multiple second data obtained by splitting each constant data, and the first data packet also includes at least one of the following items: a constant total data amount mark, a hidden layer total data amount mark of the hidden layer in the neural network operation in which the first data participates, and an input and output address mark in the neural network operation.10.根据权利要求9所述的方法,其特征在于,所述方法还包括以下至少一项操作:10. The method according to claim 9, further comprising at least one of the following operations:根据每个第一数据包的常量总数据量和所包含的每个第二数据的数据量,确定每个第二数据在对应的第一数据包的常量数据段中的第一段内偏移并记录;Determine, based on the constant total data volume of each first data packet and the data volume of each second data contained therein, the offset of each second data within the first segment of the constant data segment of the corresponding first data packet and record the offset;根据每个第一数据包的输入输出地址总数据量、所包含的每个输入地址的数据量和每个输出地址的数据量,确定每个输入地址和每个输出地址在对应的第一数据包的地址数据段中的第二段内偏移并记录;Determine and record the offset of each input address and each output address within the second segment of the address data segment of the corresponding first data packet based on the total data volume of the input and output addresses of each first data packet, the data volume of each input address, and the data volume of each output address contained therein;根据每个第一数据包的隐藏层总数据量、所包含的每个隐藏层数据的数据量,确定每个隐藏层数据在对应的第一数据包的隐藏层数据段中的第三段内偏移并记录。According to the total amount of hidden layer data in each first data packet and the amount of each hidden layer data contained therein, the offset of each hidden layer data in the third segment of the hidden layer data segment of the corresponding first data packet is determined and recorded.11.根据权利要求1所述的方法,其特征在于,所述方法还包括:11. The method according to claim 1, further comprising:在拆分所述第一数据之前,确定所述第一数据的拆分索引、目标存储空间和目标数据交换层级。Before splitting the first data, a splitting index, a target storage space, and a target data exchange level of the first data are determined.12.一种非易失性计算机可读存储介质,其特征在于,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至11任一项所述的数据处理方法。12. A non-volatile computer-readable storage medium, characterized in that computer program instructions are stored thereon, characterized in that when the computer program instructions are executed by a processor, the data processing method according to any one of claims 1 to 11 is implemented.13.一种数据处理装置,其特征在于,所述装置包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时,实现如权利要求1-11任一项所述的方法。13. A data processing device, characterized in that the device comprises a processor and a memory, wherein a computer program is stored in the memory, and when the processor executes the computer program, the method according to any one of claims 1 to 11 is implemented.
CN202011399187.9A2020-12-022020-12-02Data processing method, device, computer equipment and storage mediumActiveCN114580606B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011399187.9ACN114580606B (en)2020-12-022020-12-02Data processing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011399187.9ACN114580606B (en)2020-12-022020-12-02Data processing method, device, computer equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN114580606A CN114580606A (en)2022-06-03
CN114580606Btrue CN114580606B (en)2025-08-12

Family

ID=81769614

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011399187.9AActiveCN114580606B (en)2020-12-022020-12-02Data processing method, device, computer equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN114580606B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114816773B (en)*2022-06-292022-09-23浙江大华技术股份有限公司Data processing method, system, electronic device and storage medium
CN114942731B (en)*2022-07-252022-10-25北京星天科技有限公司Data storage method and device
CN117806642A (en)*2022-09-222024-04-02北京有竹居网络技术有限公司 Data processing method, data processing device and storage medium
CN115964155B (en)*2023-03-162023-05-30燧原智能科技(成都)有限公司On-chip data processing hardware, on-chip data processing method and AI platform
CN116841710A (en)*2023-06-192023-10-03蔚来汽车科技(安徽)有限公司 Task scheduling method, task scheduling system and computer storage medium
CN117194056B (en)*2023-11-072024-02-23苏州元脑智能科技有限公司Large language model reasoning optimization method, device, computer equipment and storage medium
CN119473637B (en)*2025-01-152025-05-02之江实验室 A computing task planning method, device, storage medium and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110458285A (en)*2019-08-142019-11-15北京中科寒武纪科技有限公司Data processing method, device, computer equipment and storage medium
CN110674936A (en)*2019-09-242020-01-10上海寒武纪信息科技有限公司 A neural network processing method, device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP3192016B1 (en)*2014-09-122019-05-08Microsoft Technology Licensing, LLCComputing system for training neural networks
KR102224510B1 (en)*2016-12-092021-03-05베이징 호라이즌 인포메이션 테크놀로지 컴퍼니 리미티드 Systems and methods for data management
CN110689115B (en)*2019-09-242023-03-31安徽寒武纪信息科技有限公司Neural network model processing method and device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110458285A (en)*2019-08-142019-11-15北京中科寒武纪科技有限公司Data processing method, device, computer equipment and storage medium
CN110674936A (en)*2019-09-242020-01-10上海寒武纪信息科技有限公司 A neural network processing method, device, computer equipment and storage medium

Also Published As

Publication numberPublication date
CN114580606A (en)2022-06-03

Similar Documents

PublicationPublication DateTitle
CN114580606B (en)Data processing method, device, computer equipment and storage medium
CN109492241B (en)Conversion method, conversion device, computer equipment and storage medium
CN115756478B (en)Operator automatic fusion method of calculation graph and related product
CN110458285B (en)Data processing method, data processing device, computer equipment and storage medium
CN112463159B (en)Compiling method, compiling device, electronic equipment and storage medium
CN112799599B (en) A data storage method, computing core, chip and electronic device
CN116893904B (en)Memory management method, device, equipment, medium and product of neural network model
CN113850377B (en) Data processing device, data processing method and related products
CN117155791A (en)Model deployment method, system, equipment and medium based on cluster topology structure
CN114580607B (en)Data processing method, device and storage medium
CN116185378A (en) Calculation graph optimization method, data processing method and related products
CN118519768A (en)Method, device, equipment and storage medium for overflowing data to shared buffer memory
CN110458286B (en)Data processing method, data processing device, computer equipment and storage medium
CN118132150B (en) Data access mode derivation method and related products for computational graphs
CN112084023B (en) Data parallel processing method, electronic device and computer readable storage medium
CN112817898A (en)Data transmission method, processor, chip and electronic equipment
CN114282679B (en) Data processing method, device and storage medium
CN114692844B (en) Data processing device, data processing method and related products
CN112801276B (en) Data processing method, processor and electronic device
CN114691589B (en) A processing device and related products
WO2022143799A1 (en)Integrated circuit apparatus for matrix multiplication operation, computing device, system, and method
CN114281405B (en)Data processing method, device and storage medium
CN114692841B (en) Data processing device, data processing method and related products
CN117632442B (en)Task processing method and related equipment
CN111966306B (en) Instruction Processing Device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
TG01Patent term adjustment

[8]ページ先頭

©2009-2025 Movatter.jp