Movatterモバイル変換


[0]ホーム

URL:


CN120523745A - Data processing method, device, equipment, storage medium and program product - Google Patents

Data processing method, device, equipment, storage medium and program product

Info

Publication number
CN120523745A
CN120523745ACN202510457754.8ACN202510457754ACN120523745ACN 120523745 ACN120523745 ACN 120523745ACN 202510457754 ACN202510457754 ACN 202510457754ACN 120523745 ACN120523745 ACN 120523745A
Authority
CN
China
Prior art keywords
tensor
index
target
thread
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510457754.8A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mole Thread Intelligent Technology Beijing Co ltd
Original Assignee
Mole Thread Intelligent Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mole Thread Intelligent Technology Beijing Co ltdfiledCriticalMole Thread Intelligent Technology Beijing Co ltd
Priority to CN202510457754.8ApriorityCriticalpatent/CN120523745A/en
Publication of CN120523745ApublicationCriticalpatent/CN120523745A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

Translated fromChinese

本公开实施例公开了一种数据处理方法、装置、设备、存储介质及程序产品,其中,所述数据处理方法包括:确定目标张量在处理时所需的内核函数以及线程块;所述目标张量包括多个元素区域,所述线程块中的线程与所述元素区域具有一一对应关系;基于所述线程块中线程的索引、所述元素区域在目标维度上具有的元素数、以及所述元素区域中的相邻元素之间的步长,确定所述目标张量中的每一元素的索引;通过所述线程块调用所述内核函数,根据所述每一元素的索引对所述目标张量进行处理。如此,不仅能够支持非连续的张量数据的输入,还能够支持连续的张量数据的输入,适用范围广。

The embodiments of the present disclosure disclose a data processing method, apparatus, device, storage medium and program product, wherein the data processing method comprises: determining a kernel function and a thread block required for processing a target tensor; the target tensor comprises a plurality of element regions, and the threads in the thread block have a one-to-one correspondence with the element regions; determining the index of each element in the target tensor based on the index of the thread in the thread block, the number of elements in the element region on the target dimension, and the step length between adjacent elements in the element region; calling the kernel function through the thread block, and processing the target tensor according to the index of each element. In this way, it can support not only the input of non-continuous tensor data, but also the input of continuous tensor data, and has a wide range of applications.

Description

Data processing method, apparatus, device, storage medium, and program product
Technical Field
The present disclosure relates to, but is not limited to, the field of computer technology, and in particular, to a data processing method, apparatus, device, storage medium, and program product.
Background
With the development of artificial intelligence and deep learning technology, large language models exhibit powerful data processing capability in various fields such as natural language processing, question-answering systems, machine translation and the like.
In general, data of a model in a training process is input in a tensor form, and each element in the tensor is sequentially operated by adopting a kernel function so as to realize convolution, pooling and other processing of the model data. However, the existing scheme can only process tensors with continuous addresses, but cannot process tensors with discontinuous addresses.
Disclosure of Invention
In view of this, embodiments of the present disclosure at least provide a data processing method, apparatus, device, storage medium, and program product.
The technical scheme of the embodiment of the disclosure is realized as follows:
On one hand, the embodiment of the disclosure provides a data processing method, which comprises the steps of determining a kernel function and a thread block required by a target tensor during processing, wherein the target tensor comprises a plurality of element areas, threads in the thread block have a one-to-one correspondence with the element areas, determining an index of each element in the target tensor based on the index of the threads in the thread block, the number of elements of the element areas on a target dimension and step sizes between adjacent elements in the element areas, and calling the kernel function through the thread block to process the target tensor according to the index of each element.
On the other hand, the embodiment of the disclosure provides a data processing device, which comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining a kernel function and a thread block required by a target tensor during processing, the target tensor comprises a plurality of element areas, threads in the thread block have a one-to-one correspondence with the element areas, the processing module is used for determining the index of each element in the target tensor based on the index of the threads in the thread block, the number of elements in the target dimension of the element areas and the step length between adjacent elements in the element areas, and the processing module is also used for calling the kernel function through the thread block and processing the target tensor according to the index of each element.
In yet another aspect, embodiments of the present disclosure provide a computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing some or all of the steps of the above method when the program is executed.
In yet another aspect, the disclosed embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs some or all of the steps of the above method.
In yet another aspect, the disclosed embodiments provide a computer program comprising computer readable code which, when run in a computer device, causes a processor in the computer device to perform some or all of the steps for carrying out the above method.
In yet another aspect, the disclosed embodiments provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the above method.
In the embodiment of the disclosure, the position of each element in the tensor is determined according to the step length between adjacent elements, so that not only can the input of discontinuous tensor data be supported, but also the input of continuous tensor data can be supported, the application range is wide, and no additional continuous operation is needed, thereby reducing the calculation delay and improving the calculation efficiency. Before the operator is executed, the input tensor data is subjected to buffer optimization, so that the number of times of data transmission between the memory and the GPU can be reduced, and the calculation speed is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.
Fig. 1 is a schematic diagram of an implementation flow of a data processing method according to an embodiment of the disclosure;
Fig. 2 is a second schematic implementation flow chart of a data processing method according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
Fig. 4 is a schematic diagram of a hardware entity of a computer device according to an embodiment of the disclosure.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present disclosure more apparent, the technical solutions of the present disclosure are further elaborated below in conjunction with the drawings and the embodiments, and the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by those skilled in the art without making inventive efforts are within the scope of protection of the present disclosure.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
The term "first/second/third" is merely to distinguish similar objects and does not represent a particular ordering of objects, it being understood that the "first/second/third" may be interchanged with a particular order or precedence where allowed, to enable embodiments of the disclosure described herein to be implemented in other than those illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing the present disclosure only and is not intended to be limiting of the present disclosure.
In order to better understand the data processing method provided by the embodiments of the present disclosure, a description will be given below of a scheme in the related art.
In the related art, it is necessary to convert the discontinuous input tensor into the continuous input tensor and then calculate. This means that additional serialization operations are required to handle the discontinuities, adding to the computational complexity and delay.
In addition, the existing memory bandwidth optimization technology focuses on improving memory access efficiency and reducing data transmission time, and has no scheme for optimizing a memory bottleneck type operator, the existing GPU hardware optimization technology has no scheme for optimizing specific application scenes such as large language model pre-training, the existing compiler optimization technology has no scheme for optimizing a memory bottleneck type operator, and the existing software stack adaptation technology has no scheme for optimizing large language model pre-training on a non-Nvidia GPU.
To this end, the disclosed embodiments provide a data processing method that may be performed by a processor of a computer device. The computer device may be a device with data processing capability, such as a server, a notebook computer, a tablet computer, a desktop computer, a smart television, a set-top box, a mobile device (e.g., a mobile phone, a portable video player, a personal digital assistant, a dedicated messaging device, and a portable game device). The processor may be referred to as a graphics processor (graphics processing unit, GPU), or a central processor (Central Processing Unit, CPU). As shown in fig. 1, the method includes the following steps 101 to 103:
And step 101, determining a kernel function and a thread block required by a target tensor during processing, wherein the target tensor comprises a plurality of element areas, and threads in the thread block have a one-to-one correspondence with the element areas.
In neural networks, data typically exists in the form of tensors, e.g., input features, activation values of hidden layers, etc. An operator (Operators) is a function that performs a tensor operation, which defines how the tensor is mathematically operated on. Kernel functions are functions written in underlying code that implement operator specific arithmetic logic for performing operator-defined operations on hardware (e.g., GPU, CPU).
The target tensor refers to the tensor currently to be processed. Tensors are multidimensional arrays used to represent data and parameters in a neural network. For example, tensors may include, but are not limited to, input data (text, image, voice, etc.), weights, offsets, and intermediate activation values, etc. The dimension of the tensor may be a scalar (0-dimensional), vector (1-dimensional), matrix (2-dimensional), or an array of higher dimensions.
The kernel functions required by the target tensor at the time of processing refer to the underlying functions that perform specific operations on the target tensor for running on multiple processing cores of the GPU.
A thread block refers to a collection of threads. Threads within a thread block may execute on the same multiprocessor (Multiprocessor) or may share data and execute synchronously. The size and number of thread blocks may be adjusted based on computational tasks and hardware characteristics. The thread blocks may be one-dimensional, two-dimensional, three-dimensional, or even higher-dimensional. For example, block (16, 16) represents a two-dimensional thread block, each thread block containing 16×16 threads.
In performing operations on the GPU, the tensor may be partitioned into a plurality of element regions, with each thread within a thread block being responsible for processing one element region of the tensor. Wherein each element region includes one or more elements.
In some embodiments, the specific implementation manner of the step 101 may be that a neural network operator is determined based on the processing requirement of the target tensor, a kernel function required by the target tensor in processing is configured based on the neural network operator, and a thread block required by the target tensor in processing is determined based on the size of the target tensor.
Neural network operators refer to operators that perform a specific operation on a target tensor. The neural network operator may be unitary (e.g., an activation function) or binary (e.g., a matrix multiplication). The neural network operator may be a memory constrained operator, or a computationally intensive operator. Wherein the performance of the memory constrained operator is limited by the memory access speed and the performance of the computationally intensive operator is limited by the processor processing rate.
Memory limited operators (memory bound Operators) refer to operators whose performance is limited by memory access speed. When an operator is memory limited, it means that its execution speed is mainly dependent on the read and write speed of the memory, not the processing speed of the processor. For example, memory limited operators include, but are not limited to, unary operators, binary operators, and the like.
A computationally intensive operator (Computationally Intensive Operators) refers to an operator that requires a large amount of computational resources to execute. Computationally intensive operators typically involve a large number of mathematical operations, e.g., matrix multiplication, convolution operations, and the like.
Since the neural network operator may be a memory constrained operator, or a computationally intensive operator, the kernel function may be determined based on the memory constrained operator or the computationally intensive operator.
Step 102, determining an index of each element in the target tensor based on the index of the thread in the thread block, the number of elements the element region has in the target dimension, and a step size between adjacent elements in the element region.
Each thread block has a unique index, and the index of the thread block is used to identify and locate the position of the thread block in the parallel computing environment. The threads within each thread block also have a unique index that is used to identify and locate the thread's location in the parallel computing environment. The step size between adjacent elements refers to the distance in memory between two adjacent elements. The index of each element is used to characterize the position of each element within the element region.
In some implementations, the target dimension can be the x dimension. At this time, the number of elements that the element region has in the x dimension may be expressed as x_size.
In some implementations, a specific implementation of step 102 may be to determine an index for each element in the target tensor based on the index of the thread, the number of elements the element region has in the target dimension, and the step sizes of the neighboring elements in different dimensions.
And 103, calling the kernel function through the thread block, and processing the target tensor according to the index of each element.
In some embodiments, the specific implementation manner of step 103 may be that according to the position of each element, the element in the element area corresponding to each thread is obtained, and the kernel function is called by each thread to process the element in the element area corresponding to each thread in turn.
It should be noted that, the operation of the present application for processing the target tensor may be performed on the CPU or the GPU.
It should be noted that, memory limited operators such as an unary operator and a binary operator only support continuous tensor input, but do not support discontinuous tensor input, while the embodiments of the present disclosure transfer step sizes between adjacent elements into a kernel (kernel), and determine the position of each element in the tensor according to the step sizes between adjacent elements, so that even if the address is discontinuous, the position of each element can be accurately obtained according to the step sizes, and the processing of the tensor with discontinuous address can be realized, and when the address is continuous, the step sizes between the adjacent elements can be regarded as one, so as to realize the processing of the tensor with continuous address.
In the embodiment of the disclosure, the position of each element in the tensor is determined according to the step length between adjacent elements, so that not only can the input of discontinuous tensor data be supported, but also the input of continuous tensor data can be supported, the application range is wide, and no additional continuous operation is needed, thereby reducing the calculation delay and improving the calculation efficiency. Before the operator is executed, the input tensor data is subjected to buffer optimization, so that the number of times of data transmission between the memory and the GPU can be reduced, and the calculation speed is improved.
Embodiments of the present disclosure provide a data processing method that may be performed by a processor of a computer device. As shown in fig. 2, the method includes the following steps 201 to 206:
Step 201, determining a kernel function and a thread block required by a target tensor during processing, wherein the target tensor comprises a plurality of element areas, and threads in the thread block have a one-to-one correspondence with the element areas.
Here, the above step 201 corresponds to the above step 101, and reference may be made to the specific embodiment of the above step 101 when implemented.
In some embodiments, the specific implementation manner of the kernel function required by the target tensor during processing is that a plurality of neural network operators required by the target tensor during processing are determined, the plurality of neural network operators are fused based on the association relation among the plurality of neural network operators to obtain a target neural network operator, and the kernel function is configured based on the target neural network operator.
The target neural network operator refers to an operator obtained by fusing a plurality of neural network operators.
By fusing a plurality of related neural network operators into one operator, the number of execution operators can be reduced, thereby reducing memory access and computation overhead.
Step 202, determining an index of each element in a target dimension based on the index of the thread, the number of elements in the target dimension of the element region, and the first step size.
Wherein the step sizes between adjacent elements include a first step size of the target dimension and a second step size of the remaining dimensions other than the target dimension.
In some embodiments, if the element region is two-dimensional, the first step refers to a step between adjacent elements in the x-dimension. The second step size refers to the step size between adjacent elements in the y-dimension. If the element region is three-dimensional, the first step refers to the step between adjacent elements in the x-dimension. The second step size refers to a step size between adjacent elements in the y-dimension and a step size between adjacent elements in the z-dimension.
In some embodiments, the specific implementation manner of the step 202 may be that a modulo operation is performed on the index of the thread and the number of elements of the element region in the target dimension, and a multiplication operation is performed on the result obtained after the modulo operation and the first step length to obtain the index of each element in the target dimension.
For example, the index of each element in the target dimension may be calculated as size_ttx=tid% x_size_x. Where tid represents the global index of the thread, x_size represents the first step size, tid% x_size represents the number of elements to be taken, stride_x represents the first step size, and size_tx represents the index of each element in the target dimension.
In some embodiments, the specific implementation of determining the index of the thread may be that the starting index of the thread block in the target dimension is determined based on the index of the thread block in the target dimension in the computation space and the number of threads the thread block has in the target dimension, and the index of each thread is determined based on the starting index of the target dimension and the index of each thread in the target dimension in the thread block.
A computation space refers to a logical space or a physical space for performing a computation task. For example, in GPU programming, a computation space refers to a GPU's memory space that includes global memory, shared memory, registers, and the like. The computation space may be a grid (grid).
If the computation space is two-dimensional, the computation space includes two dimensions, an x-dimension and a y-dimension. If the computation space is three-dimensional, the computation space includes three dimensions, an x-dimension, a y-dimension, and a z-dimension.
The index of a thread block in the target dimension refers to the x-coordinate of the thread block in the computation space, which may be denoted as blockidx. The number of threads a thread block has in the target dimension refers to the number of threads of the thread block in the x-direction, which may be denoted as blockdim. The index of each thread in the target dimension in the thread block refers to the coordinates in the x dimension within the thread block, which may be denoted as wireidx.
The index of the thread blocks in the target dimension and the index of each thread in the target dimension in the thread blocks are built-in variables that are automatically provided by the parallel computing platform and programming model (CUDA) when executing the kernel function. The size of the thread blocks may be specified by a programmer, including the number of threads the thread blocks have in each dimension.
In some embodiments, the specific implementation manner of determining the starting index of the thread block in the target dimension based on the index of the thread block in the target dimension in the computation space and the thread number of the thread block in the target dimension may be that the starting index of the thread block in the target dimension is obtained by multiplying the index of the thread block in the target dimension and the thread number of the thread block in the target dimension.
In some embodiments, the specific implementation of determining the index of each thread based on the starting index of the target dimension and the index of each thread in the thread block may be that the starting index of the target dimension and the index of each thread in the thread block are added to obtain the index of each thread.
For example, the index of the thread may be calculated by the formula int64_t tid= (int 64_t) blockidx.x. Where int64_ t is a data type representing a 64 bit integer, blockidx.x represents an index of a thread block in a target dimension in the computation space, blockdim.x represents a number of threads each thread block has in the target dimension, wireidx.x represents an index of any thread in the target dimension of the thread block, tid represents a global index of threads for distinguishing all threads in the computation space.
Step 203, determining an index of each element in the other dimensions based on the index of the thread, the number of elements the element region has in the target dimension, and the second step size.
In some embodiments, the specific implementation manner of the step 203 may be that division operation is performed on the index of the thread and the number of elements of the element region in the target dimension, and multiplication operation is performed on the result of the division operation and the second step length to obtain the index of each element in the other dimensions.
For example, taking the element area as two-dimensional example, where the second step is the step of the adjacent element in the y dimension, the calculation formula of the index of each element in the y dimension may be expressed as size_tty=tid/x_size_stride_y. Where size_t ty represents the index of each element in the y dimension, tid/x_size represents the row in which the element is in, and stride_y represents the second step size.
For discontinuous tensor input, division operation is involved in calculating the index of each element, and the complexity of division operation is high and the calculation speed is slow. Accordingly, to further increase computational efficiency and reduce computational complexity, embodiments of the present disclosure replace division operations with shift operations.
Specifically, for any element in the last dimension of the target tensor, when the number of the elements in the last dimension is n to the power of 2, shifting the index of the thread by n bits to the right to obtain the starting address of the last dimension, wherein n is a positive integer, and determining the index of any element in the last dimension based on the starting address of the last dimension and the second step length.
For example, if the size of the last dimension is 4 (i.e., 2 to the power of 2), then the divide by 4 operation may be replaced by right shifting the index by 2 bits. Thus, not only the calculation efficiency is improved, but also the calculation complexity is reduced, thereby improving the overall performance.
Step 204, determining the index of each element based on the index of each element in the target dimension and the index of each element in the other dimensions.
Here, the steps 202 to 204 correspond to the step 102, and reference may be made to the specific embodiment of the step 102 when implemented.
In some implementations, the index of each element in the target dimension is combined with the index of each element in the remaining dimensions in a target form to obtain the index of each element. The target form may be a set of numbers, coordinates, etc.
Step 205, obtaining a continuous output tensor corresponding to the target tensor according to the index of each element.
The target tensor includes a first tensor and a second tensor with discontinuous addresses. The continuous output tensor refers to a tensor with continuous addresses obtained by processing the discontinuous first tensor and second tensor.
In some embodiments, the specific implementation manner of step 205 may be that a memory space meeting a requirement of tensor storage is determined based on the size of the first tensor and the size of the second tensor, elements in the first tensor are read according to an index of each element in the first tensor to obtain a first output tensor, elements in the second tensor are read according to an index of each element in the second tensor to obtain a second output tensor, the first output tensor is stored in a first address in the memory space, and the second output tensor is stored in a second address adjacent to the first address to obtain a continuous output tensor.
The memory space is used for storing tensor data. The first output tensor refers to the first tensor read. The second output tensor refers to the read second tensor. The first address refers to any one of the addresses in the memory space, and the second address refers to an address adjacent to the first address. In the case of the first address determination, the second address is also determined.
In some embodiments, the memory space may refer to a memory space applied at the host side, that is, applying a segment of memory space at the host side for tensor processing. Or the memory space may be a memory space applied to the GPU, that is, a section of memory space is applied to the host side for tensor processing.
In some embodiments, to further improve the processing performance of the tensor, after the first output tensor and the second output tensor are obtained, the first output tensor is stored in a first address in the memory space, and the second output tensor is stored in a second address adjacent to the first address, so as to obtain the continuous output tensor.
When tensor data is discontinuous, get_offset function is called during calculation, the function involves division operation, which is time-consuming, especially when the index is of the type int64, which consumes more time, so that optimization can be performed for this point. Specifically, when the host applies for the memory initialization output tensor, a section of continuous memory space can be opened, so that two inputs (a first tensor and a second tensor) are discontinuous, and an output (a continuous output tensor) is continuous, so that the get_offset function call of 1/3 can be reduced, and better performance is improved.
Because the GPU has better computing power than the CPU on the host side, processing operations for both inputs can be provided on the GPU. Specifically, a section of continuous memory space is applied on the GPU, two input tensors are copied to the memory space on the GPU, the two input tensors are processed on the GPU, and the two output tensors are stored in continuous addresses to obtain continuous output.
In some embodiments, the operation of obtaining the continuous output tensor corresponding to the target tensor according to the index of each element is performed by a processor, or the operation of copying the target tensor to a graphics processor and obtaining the continuous output tensor corresponding to the target tensor according to the index of each element is performed by the graphics processor.
And 206, calling the kernel function through the thread block to process the continuous output tensor.
Here, the above steps 205 to 206 correspond to the above step 103, and reference may be made to the specific embodiment of the above step 103 when implemented.
In some embodiments, the data may also be cache optimized, in particular by storing the continuous output tensor to a cache region and processing the continuous output tensor within the cache region by calling a kernel function.
The input tensor data may be pre-processed prior to executing the operator, e.g., cache optimization of the data. Therefore, the number of times of data transmission between the memory and the GPU can be reduced, and the calculation speed is improved. Multiple tests can be carried out under different configurations, and a cache optimization scheme with optimal performance can be selected from the multiple tests.
In the embodiment of the disclosure, the position of each element in the tensor is determined according to the step length between adjacent elements, so that discontinuous tensor data input can be supported, continuous tensor data input can be supported, the application range is wide, additional continuous operation is not needed, the calculation delay is reduced, and the calculation efficiency is improved. Before the operator is executed, the input tensor data is subjected to buffer optimization, so that the number of times of data transmission between the memory and the GPU can be reduced, and the calculation speed is improved. And integrating a plurality of related operators into one operator, and reducing the number of executing operators, thereby reducing memory access and calculation cost. When the size of the tensor in the last dimension is the power of 2, the shift operation replaces division calculation, so that the get_offset function call of 1/3 can be reduced, and the processing performance of the tensor is improved.
The application of the data processing method provided by the embodiment of the present disclosure in an actual scene is described below, taking the application in muDNN operator library as an example.
The embodiment of the disclosure relates to a muDNN operator library performance optimization method on a GPU, which improves the calculation efficiency of the muDNN library on the GPU by reducing the call times of get_offset, using shift operation to replace division calculation and directly supporting discontinuous input, and particularly when processing high-performance calculation tasks such as large-scale language model pre-training and the like. These optimizations help to promote the competitiveness of muDNN operator libraries and to promote the development of high-performance computing fields.
The unary and binary operators when dealing with non-continuous inputs. These optimizations include reducing the number of get_offset calls of the kernel (kernel), replacing division operations with shift operations, and optimizing for performance bottlenecks in certain situations. In addition, the embodiment of the disclosure also provides an innovative method, so that the unary operator and the binary operator can directly support discontinuous input, thereby avoiding additional continuous operation and kernel call and further improving the computing efficiency.
The specific scheme of the embodiment of the disclosure is as follows:
1. Directly supporting discontinuous input;
currently, the unary operator and the binary operator do not support discontinuous input, and discontinuous tensors (tensor) need to be serialized first and then fed into kernel calculation. The method proposed in the embodiment of the present disclosure is to transfer the stride of each dimension into a kernel, and then calculate the address (index) of each element according to the stride, so that the unary operator and the binary operator directly support discontinuous input. The method for calculating the address of each element according to stride avoids additional continuous operation and kernel call, reduces calculation delay and improves calculation efficiency.
Before optimization, a kernel function for performing a serialization operation needs to be called to convert two input tensors into a continuous tensor, and then a kernel function for processing the input tensors needs to be called, and this process needs to call three kernel functions. After the optimization is performed by adopting the method of the embodiment of the disclosure, only the kernel function for processing the input tensor is required to be called, so that the twice calling of the kernel function is saved.
The following is described by the code:
int64_t tid=(int64_t)blockIdx.x*blockDim.x+threadIdx.x;
size_t tx=tid%x_size*stride_x;
size_t ty=tid/x_size*stride_y;
The line code of 'int 64_ttid= (int 64_t) blockdim.x+wireidx.x' is used for calculating the global index (thread ID) of the current thread, blockidx.x is the x coordinate of the thread block in the calculation space, blockdim.x is the number of threads of the thread block in the x direction, and the two are multiplied to obtain the initial global index of the current thread block in the x direction. Then, the wireidx.x is the index of the current thread in the thread block, and the two values are added to obtain the global index of the current thread.
The line code "size_ttx=tid% x_size_x" is used to calculate the local index of the current thread in the x-dimension.
The line code "size_tty=tid/x_size_stride_y" is used to calculate the local index of the current thread in the y-dimension.
2. Optimizing a kernel index calculation mode;
In Pytorch framework, the logic to invoke muDNN operator library involves the part that executes on the host side (CPU in the host), where the output tensor is typically created within the framework. When tensor data is discontinuous, the get_offset function is called during calculation, which involves division, which is time consuming, especially if the index is typically of the int64 type, which consumes more time, and can therefore be optimised for this. Specifically, when the host applies for the memory initialization output tensor, a section of continuous memory space can be opened up. When calculating, the two inputs are discontinuous, the output is continuous, thus reducing the get_offset function call by 1/3, and having no small performance improvement. This can significantly improve the processing performance of non-continuous inputs, as the time consuming calculation of the index is relatively high. For example, in the unary optimization, the bandwidth is increased from 85GB/s to 110GB/s when the discontinuous vlen is 1, and from 258GB/s to 308GB/s when the discontinuous vlen is 4 aligned.
3. Replacing the division calculation with a shift operation;
In the muDNN operator library, processing of the non-sequential input typically involves computing an index for each element, but when the last dimension of the input (dim < -1 >) is a power of 2, division operations can be replaced by shift operations. For example, if dim < -1 > is 4 (i.e., the power of 2) then the division by 4 may be replaced by right shifting the index by 2 bits. Thus, not only is the calculation efficiency improved, but also the calculation complexity is reduced, and the overall performance is improved. After unary optimization, in a discontinuous model scene, the performance is improved from 308GB/s to 346GB/s, and the bandwidth is close to 366GB/s of the continuous scene under the same condition. After the binary optimization, the performance is improved from 340GB/s to 376GB/s in a discontinuous model scene under the non-broadcasting condition, and the performance is improved from 425GB/s to 464GB/s in a discontinuous model scene under the broadcasting condition.
It should be noted that the embodiments of the present disclosure at least include the following innovative points:
1. And inputting the stride of each dimension into the kernel, and calculating the address of each element according to the stride, so that the get_offset calling times of the kernel are reduced.
2. When the last dimension (dim < -1 >) of the input is a power of 2, a shift operation is used instead of a division calculation.
3. A section of continuous memory space is opened up, two inputs are discontinuous in calculation, and outputs are continuous so as to optimize performance bottlenecks in a discontinuous model scene.
It should be noted that, the embodiments of the present disclosure at least can achieve the following technical effects:
1. the embodiment of the disclosure remarkably improves the computation efficiency of muDNN operator libraries on the GPU by optimizing the memory bandwidth and reducing unnecessary computation, and particularly has obvious improvement when processing large-scale computation tasks such as large language model pre-training and the like.
2. The method has the advantages of reducing calculation delay, reducing get_offset calling times of kernel and replacing division calculation by using shift operation, effectively reducing delay in the calculation process, and enabling operators to be more rapid in processing discontinuous input.
3. Embodiments of the present disclosure have a better competitive advantage in the GPU market, particularly in supporting deep learning and large-scale computing.
4. By optimizing the operator performance on the GPU, the embodiment of the disclosure provides better hardware support for training and application of the large-scale language model, and promotes the development of the fields of natural language processing and artificial intelligence.
5. The method and the device improve the utilization efficiency of GPU hardware resources by optimizing the memory use and operator execution efficiency, and are particularly important for application scenes needing high-performance calculation under limited resources.
6. The embodiment of the disclosure simplifies the adaptation process of the software stack, reduces the maintenance cost and makes the update and optimization of the operator library more convenient by directly supporting discontinuous input.
Based on the foregoing embodiments, the embodiments of the present disclosure provide a data processing apparatus, where the apparatus includes units and modules included in the units may be implemented by a Processor in a computer device, or may of course also be implemented by specific logic circuits, and in the implementation process, the Processor may be a central processing unit (Central Processing Unit, CPU), a microprocessor (Microprocessor Unit, MPU), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), or a field programmable gate array (Field Programmable GATE ARRAY, FPGA), etc.
Fig. 3 is a schematic diagram of a composition structure of a data processing apparatus according to an embodiment of the present disclosure, and as shown in fig. 3, a data processing apparatus 300 includes a determining module 310 and a processing module 320, where:
A determining module 310, configured to determine a kernel function and a thread block required by a target tensor during processing, where the target tensor includes a plurality of element areas, and threads in the thread block have a one-to-one correspondence with the element areas;
A processing module 320, configured to determine an index of each element in the target tensor based on an index of a thread in the thread block, a number of elements that the element region has in a target dimension, and a step size between adjacent elements in the element region;
And the processing module 320 is further configured to call the kernel function through the thread block, and process the target tensor according to the index of each element.
In some implementations, the processing module 320 is further configured to determine a starting index of the thread block in a target dimension based on an index of the thread block in a target dimension in a computation space and a number of threads the thread block has in the target dimension, and determine an index of each thread based on the starting index of the target dimension and the index of each thread in the target dimension in the thread block.
In some embodiments, the step sizes comprise a first step size of the target dimension and a second step size of the remaining dimensions except the target dimension, the processing module 320 is specifically configured to determine an index of each element in the target dimension based on the index of the thread, the number of elements the element region has in the target dimension, and the first step size, determine an index of each element in the remaining dimensions based on the index of the thread, the number of elements the element region has in the target dimension, and the second step size, and determine the index of each element based on the index of each element in the target dimension and the index of each element in the remaining dimensions.
In some embodiments, the processing module 320 is specifically configured to, for any element in a last dimension of the target tensor, right shift the index of the thread by n bits when the number of elements in the last dimension is n to the power of 2, to obtain a start address of the last dimension, and determine the index of the any element in the last dimension based on the start address of the last dimension and the second step size.
In some embodiments, the processing module 320 is specifically configured to obtain a continuous output tensor corresponding to the target tensor according to the index of each element, and call the kernel function through the thread block to process the continuous output tensor.
In some embodiments, the target tensor includes a first tensor and a second tensor with discontinuous addresses, and the processing module 320 is specifically configured to determine a memory space that meets a requirement for tensor storage based on a size of the first tensor and a size of the second tensor, read an element in the first tensor according to an index of each element in the first tensor to obtain a first output tensor, read an element in the second tensor according to an index of each element in the second tensor to obtain a second output tensor, store the first output tensor at a first address in the memory space, and store the second output tensor at a second address adjacent to the first address to obtain a continuous output tensor.
In some embodiments, the operation of obtaining the continuous output tensor corresponding to the target tensor according to the index of each element is performed by a processor, or the operation of copying the target tensor to a graphics processor and obtaining the continuous output tensor corresponding to the target tensor according to the index of each element is performed by the graphics processor.
In some embodiments, the processing module 320 is specifically configured to determine a plurality of neural network operators required by the target tensor during processing, fuse the plurality of neural network operators based on an association relationship between the plurality of neural network operators to obtain a target neural network operator, and configure the kernel function based on the target neural network operator.
The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. In some embodiments, functions or modules included in the apparatus provided by the embodiments of the present disclosure may be used to perform the methods described in the embodiments of the method, and for technical details not disclosed in the embodiments of the apparatus of the present disclosure, please understand with reference to the description of the embodiments of the method of the present disclosure.
It should be noted that, in the embodiment of the present disclosure, if the above-mentioned data processing method is implemented in the form of a software functional module, and sold or used as a separate product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be essentially or portions contributing to the related art, and the software product may be stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present disclosure. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present disclosure are not limited to any specific hardware, software, or firmware, or any combination of the three.
The disclosed embodiments provide a computer device comprising a memory storing a computer program executable on the processor and a processor implementing some or all of the steps of the above method when the processor executes the program.
The disclosed embodiments provide a computer readable storage medium having stored thereon a computer program which when executed by a processor performs some or all of the steps of the above method. The computer readable storage medium may be transitory or non-transitory.
The disclosed embodiments provide a computer program comprising computer readable code which, when run in a computer device, performs some or all of the steps for implementing the methods described above.
Embodiments of the present disclosure provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the above-described method. The computer program product may be realized in particular by means of hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, and in other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It should be noted herein that the above description of various embodiments is intended to emphasize the differences between the various embodiments, and that the same or similar features may be referred to each other. The above description of apparatus, storage medium, computer program and computer program product embodiments is similar to that of method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the disclosed apparatus, storage medium, computer program and computer program product, please refer to the description of the embodiments of the disclosed method.
It should be noted that fig. 4 is a schematic diagram of a hardware entity of a computer device in the embodiment of the present disclosure, as shown in fig. 4, the hardware entity of the computer device 400 includes a processor 401, a communication interface 402, and a memory 403, where:
the processor 401 generally controls the overall operation of the computer device 400.
The communication interface 402 may enable the computer device to communicate with other terminals or servers over a network.
The memory 403 is configured to store instructions and applications executable by the processor 401, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by the respective modules in the processor 401 and the computer device 400, which may be implemented by a FLASH memory (FLASH) or a random access memory (Random Access Memory, RAM). Data transfer may occur between processor 401, communication interface 402 and memory 403 via bus 404.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the size of the sequence numbers of the steps/processes described above does not mean the order of execution, and the order of execution of the steps/processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation of the embodiments of the present disclosure. The foregoing embodiment numbers of the present disclosure are merely for description and do not represent advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place or distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the disclosure may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes various media that may store program code, such as a removable storage device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Or the integrated units of the present disclosure may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the present disclosure may be embodied essentially or in part in a form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present disclosure. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
The foregoing is merely an embodiment of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think about the changes or substitutions within the technical scope of the present disclosure, and should be covered by the protection scope of the present disclosure.

Claims (12)

CN202510457754.8A2025-04-112025-04-11 Data processing method, device, equipment, storage medium and program productPendingCN120523745A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510457754.8ACN120523745A (en)2025-04-112025-04-11 Data processing method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510457754.8ACN120523745A (en)2025-04-112025-04-11 Data processing method, device, equipment, storage medium and program product

Publications (1)

Publication NumberPublication Date
CN120523745Atrue CN120523745A (en)2025-08-22

Family

ID=96746172

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510457754.8APendingCN120523745A (en)2025-04-112025-04-11 Data processing method, device, equipment, storage medium and program product

Country Status (1)

CountryLink
CN (1)CN120523745A (en)

Similar Documents

PublicationPublication DateTitle
US10338925B2 (en)Tensor register files
US10372456B2 (en)Tensor processor instruction set architecture
KR102258414B1 (en)Processing apparatus and processing method
US20210096823A1 (en)Transpose operations using processing element array
US12182688B2 (en)Hierarchical partitioning of operators
WO2020073211A1 (en)Operation accelerator, processing method, and related device
CN111880807A (en)Deep learning compiling method, device, equipment and storage medium
JP2022538759A (en) Configurable neural network kernel
CN114580606B (en)Data processing method, device, computer equipment and storage medium
US20210304010A1 (en)Neural network training under memory restraint
US12125124B1 (en)Matrix transpose hardware acceleration
CN118043821A (en)Hybrid sparse compression
CN112463160A (en)Compiling method, compiling device, electronic equipment and storage medium
KR20240063137A (en) Hardware accelerator-optimized group convolution-based neural network model
CN107909537A (en)A kind of image processing method and mobile terminal based on convolutional neural networks
US12131188B1 (en)Scheduling for locality of reference to memory
CN120523745A (en) Data processing method, device, equipment, storage medium and program product
Kim et al.Optimizing seam carving on multi-GPU systems for real-time content-aware image resizing
CN114692844B (en) Data processing device, data processing method and related products
CN116225445A (en) Neural network model compilation method, device and storage medium
CN114217777A (en)Method for implementing expandable and performance-transplantable multidimensional array library of computer hardware
US12417082B1 (en)Frontier node-based data layout analysis framework
US12367021B1 (en)Fast interference graph construction for a binary tree of interval nodes
US12380321B1 (en)Flexible array data loading
US12423137B1 (en)Compiler managed tensor parallel execution

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp