Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Illustrative embodiments of the present application include, but are not limited to, an artificial intelligence based multi-dimensional parallel processing method, apparatus, device, and medium and an artificial intelligence based multi-dimensional parallel processing method, apparatus, device, and medium.
It is to be appreciated that the methods of determining content similarity provided herein can be implemented on a variety of electronic devices including, but not limited to, a server, a distributed server cluster of multiple servers, a cell phone, a tablet, a laptop, a desktop, a wearable device, a head-mounted display, a mobile email device, a portable game console, a portable music player, a reader device, a personal digital assistant, a virtual reality or augmented reality device, a television or other electronic device having one or more processors embedded or coupled therein, and the like.
It is to be appreciated that in various embodiments of the present application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, the like, and/or any combination thereof.
The inventive concepts of the embodiments of the present application are briefly described below.
From the perspective of the computing market, the situation of insufficient computing supply now appears in the market, the system hopes to reduce the demand of AI on computing resources by accelerating large-scale multidimensional parallel processing, and the demand of the AI on an AI infrastructure platform in the market is very urgent, and efficient multidimensional parallel processing is an indispensable function of the AI infrastructure platform, so an efficient training scheme similar to the system is just needed in the future AI market. From the perspective of an AI model application scene, a large number of application scenes can bring huge requirements for efficient parallel training, many existing leading-edge models cannot be applied to the ground due to the calculation constraint, and more markets can be developed after the calculation efficiency is improved; deployment is relatively difficult in the prior art; such as nerf (application of deep learning in three-dimensional rendering) which appeared in 2019, because the limit of the computation speed has not come to the ground widely.
In addition, multidimensional parallel processing and deployment thresholds and costs are high, taking the PyTorch built-in scheme as an example, it is necessary to write codes related to process groups, intra-group collective communication, data sets, parallel models, and adjust the backend interface according to the hardware (CPU/GPU) used. Multidimensional parallel processing deployment engineers need to understand a plurality of aspects such as an algorithm (parallel strategy), a system (training architecture, synchronization method), an AI framework, a training method, communication programming, resource scheduling software, a big data platform, bottom layer software programming and the like at the same time, talent quality requirements are extremely high, and corresponding enterprise employment cost is also high; different tasks require different multidimensional parallel processing solutions and hardware, with additional software and hardware costs. The training scheme is generally based on self hardware, is a customized solution directly integrated with the hardware, is difficult to face a new infinite hardware/model architecture, and urgently needs a set of universal and standardized parallel training scheme; in the prior art, the algorithm is often selected to break through in the aspect of algorithm, but on one hand, the algorithm is difficult to break through, and on the other hand, the problem that the multidimensional parallel processing efficiency is limited is difficult to completely solve only by the algorithm; for example, for the fields of medical treatment, security and the like, the data security may be required, or a model with a special structure may be required; the method of manual parameter adjustment and deployment can still be used for training in a short time, but a set of general and automatic parallel training mode is needed in a long time, so that the method can be adapted to a fast iterative algorithm to reduce the cost of AI application and promote AI application.
In view of this, fig. 1 provides an artificial intelligence based multidimensional parallel processing method for a hardware processor, which is executed on a software platform, using a machine learning library, according to a first embodiment of the present application;
characterized in that the method comprises the steps of:
the data parallelism, automatically managing the data to be processed from the user request, and distributing the data to be processed to each hardware processor;
the method comprises the steps of performing sequence parallelism, further segmenting long sequence data in data to be processed, performing sequence division on each data to be processed, and placing the data to be processed into a plurality of processors;
the method comprises the following steps of (1) running parallel, splitting a model into a plurality of sections, deploying each section in different hardware processors, and connecting the sections in series according to the sequence of the model, wherein the output of the previous section is used as the input of the next section;
the multi-dimensional models are parallel, grid model division is executed aiming at the training models of the data to be processed which are scheduled to the processors, and the training models are scheduled to the processors;
the data to be processed comprises a picture processing task and/or a natural language processing task;
the technical scheme provided by the embodiment of the application is suitable for multimedia content recommendation scenes such as characters, pictures (including static pictures in formats such as jpeg, and dynamic pictures in formats such as gif), videos and the like, and is mainly exemplified by corpus vector training in natural language processing. Wherein the corpus vector in the natural language processing is from a web corpus, such as Wikipedia. FIG. 2 illustrates a scenario diagram of an artificial intelligence based multi-dimensional parallel processing method, according to some embodiments of the present application. Specifically, the scenario includes a terminal 101, a server 102, and a network 103.
The terminal 101 may be a desktop terminal or a mobile terminal, and the mobile terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and the like. The terminal 101 may be installed with an application that may perform natural language processing corpus training data set acquisition. The application related to the embodiment of the application may be a software client, or a client such as a web page or an applet, and if the application is a client such as a web page or an applet, the background server is a background server corresponding to the software or the web page or the applet, and the specific type of the client is not limited. The user can log in the user on the application, and then data set collection is carried out.
The server 102 may be a background server corresponding to an application installed on the terminal 101, for example, an independent physical server or a server cluster or distributed system composed of a plurality of servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.
The server 102 may include one or more processors 1021,memory 1022, and an I/O interface 1023 to interact with the terminal, among other things. In addition, server 102 may also configuredatabase 1024, anddatabase 1024 may be used to store training data sets of a user-submitted natural language processing corpus. Thememory 1022 of the server 102 may further store program instructions such as a machine learning library and an optimizer provided in the embodiment of the present application, and when the program instructions are executed by the processor 1021, the program instructions can be used to implement the step of determining the multidimensional parallel processing method provided in the embodiment of the present application, so as to perform multidimensional parallel processing on data to be trained input by a user, and further push the trained content to a target user, so as to be used in the terminal 101 for subsequent artificial intelligence interactive application.
The terminals 101 and the server 102 are connected via a network 103, and the network 103 includes one or more and may include various connection types, such as a wired, wireless communication link, cloud, or fiber optic cable, and the like, and the specific examples of the network described above may include the internet provided by the communication provider of the terminal 101.
First, the processor 1021 reads a training data set of the natural language processing corpus submitted by the user, which is stored in thedatabase 1024 and corresponds to the terminal 101, through the I/O interface 1023 interacting with the terminal 101, and then thememory 1022 performs program instructions of the stored multidimensional parallel processing method, and after the training is completed, the program instructions are pushed to the terminal 101 through the I/O interface 1023 interacting with the terminal, and the training data set is displayed to the user.
The multi-dimensional model parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
FIG. 3 illustrates a block diagram of a hardware architecture of an artificial intelligence based multidimensional parallel processing system, according to some embodiments of the present application. Specifically, as shown in fig. 3, it includes one or more processors, system control logic connected to at least one of the processors, system memory connected to the system control logic, non-volatile memory (NVM) connected to the system control logic, and a network interface connected to the system control logic.
In some embodiments, the processor may include one or more single-core or multi-core processors. In some embodiments, the processor may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where the multidimensional parallel processing system employs an eNB (enhanced Node B) or RAN (Radio Access Network) controller, the processor may be configured to perform various consistent embodiments.
In some embodiments, the processor includes a GPU, a CPU, an FPGA, and a TPU. And performing resource scheduling of the processor based on the data set condition of the training task to be processed, migrating the task of the GPU to other non-GPU processors, and performing corresponding control logic processing on the training task to be processed based on the computing resource of each processor.
In some embodiments, the system control logic may include any suitable interface controllers to provide any suitable interface to at least one of the processors and/or any suitable device or component in communication with the system control logic.
In some embodiments, the system control logic may include one or more memory controllers to provide an interface to system memory. System memory may be used to load and store data and/or instructions. The memory of the multidimensional parallel processing system may in some embodiments comprise any suitable volatile memory, such as suitable Dynamic Random Access Memory (DRAM). In some embodiments, the system memory may be used to load or store instructions for implementing the multidimensional parallel processing, or the system memory may be used to load or store instructions for implementing an application program for implementing multidimensional parallel processing by using the multidimensional parallel processing method.
The NVM/memory may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM/memory may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device, such as at least one of a HDD (Hard Disk Drive), CD (Compact Disc) Drive, DVD (Digital Versatile Disc) Drive. The NVM/memory may also be used to store training models used in the multidimensional parallel processing described above.
The NVM/memory may comprise a portion of the storage resources on the device on which the multidimensional parallel processing system is installed, or it may be accessible by, but not necessarily a part of, the device. For example, the NVM/memory may be accessed over a network via a network interface.
In particular, the system memory and NVM/storage may each include: a temporary copy and a permanent copy of the instruction. The instructions may include: instructions that when executed by at least one of the processors cause the multidimensional parallel processing system to implement the multidimensional parallel processing method of the present application. In some embodiments, instructions, hardware, firmware, and/or software components thereof may additionally/alternatively be placed in system control logic, a network interface, and/or a processor.
The network interface may include a transceiver to provide a radio interface for the multidimensional parallel processing system to communicate with any other suitable device (e.g., front end module, antenna, etc.) over one or more networks. In some embodiments, the network interface may be integrated with other components of the multidimensional parallel processing system. For example, the network interface may be integrated into at least one of a processor, a system memory, an NVM/storage, and a firmware device (not shown) having instructions that, when executed by at least one of the processors, the multidimensional parallel processing system implements the multidimensional parallel processing method of the present application.
The network interface may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, the network interface may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem. The network interface is also used for being in communication connection with the cloud application to achieve data processing of the cloud.
In some embodiments, at least one of the processors may be packaged together with logic for one or more controllers of system control logic to form a System In Package (SiP). In some embodiments, at least one of the processors may be integrated on the same die with logic for one or more controllers of system control logic to form a system on a chip (SoC).
The multi-dimensional parallel processing system may further include: input/output (I/O) devices. The I/O device may include a user interface to enable a user to interact with the multidimensional parallel processing system; the design of the peripheral component interface enables the peripheral component to also interact with the multidimensional parallel processing system.
In a possible implementation of the first aspect, the step of automatically managing, in parallel with the data, to-be-processed data requested by a user, and distributing the to-be-processed data to each of the hardware processors further includes:
the data in the data parallelism is divided, each node or process has a model, each node takes the pitch size of different data, then the forward and backward calculation is respectively completed to obtain the gradient, the training processes are workers, besides the workers, a parameter server and a ps server, the workers can send the gradient obtained by the calculation to the ps server, the ps server carries out update operation, and the model after update is transmitted back to each node;
the data parallel can expand the equivalent batch size, namely the equivalent batch size, and the calculation is accelerated through the calculation of the number of parallel processors and the single processor batch size;
and different processors in the data parallelism use different data to synchronously update the parameters of the data.
In a possible implementation of the first aspect, the sequence parallelizing further performs segmentation on long sequence data in to-be-processed data, and performs sequence division on each to-be-processed data to be placed in the multiple processors, specifically including:
the sequence parallelly prolongs the length of data received by the transform type model, and processes long texts in NLP and high-resolution pictures in CV tasks, namely large pictures and/or videos, wherein the pictures can be cut into small pictures, and all the small pictures are also sequences after being sequentially arranged; the video is a sequence of pictures, and each picture can be cut again;
after the computing resources are obtained, the picture processing tasks and/or the feature data of the pictures are processed and distributed to various processors including but not limited to a GPU/CPU through data parallel, and the data are further segmented and distributed in parallel;
if the length of the single data is larger than the threshold value, the single processor cannot process the single data, and after the sequence is parallelly segmented, one data is put into a plurality of processors;
the calculation is equivalent to directly processing the whole complete data through communication.
In a possible implementation of the foregoing first embodiment, the performing, by the multidimensional model in parallel, mesh model division on a training model of the to-be-processed data scheduled to the processors, and scheduling the training model to the processors specifically includes:
the 2-dimensional grid adopts a scalable dense matrix multiplication SUMMA and an algorithm matrix in parallel, and a high-efficiency extensible model parallel mode of two-dimensional matrix segmentation is utilized;
a quantifiable novel deep learning model parallel framework is designed in parallel by the 2.5-dimensional grid, expensive transmission loss between the graphics processors is minimized, a flexible and efficient framework is provided, and the model parallel speed and efficiency are further improved;
the 3D grid adopts 3D parallel matrix multiplication in parallel, each matrix is divided into a plurality of small blocks according to rows and columns, the multiplication of a large matrix is divided into the multiplication of a plurality of small matrices, and the matrix storage is spread on the whole processor.
The 2-dimensional grid adopts a scalable dense Matrix Multiplication SUMMA (scalable Universal Matrix Multiplication Algorithm) in parallel, the 2-dimensional grid adopts three matrixes of the SUMMA algorithm in parallel, and a high-efficiency extensible model parallel mode of two-dimensional Matrix segmentation is utilized; c ═ AB, C ═ ABT,C=ATB, based on the input model data, defining the following:
the batch size is a variable b, the sequence length is a variable s, the hidden size is a variable h, the head attention number is a variable N, the word collection size is a variable v, the partition number is a variable p, the SUMMA dimension is a variable q, and the number of the translation layers (Transformer layers) is a variable N.
Algorithm 1: c ═ AB;
inputting: a. theij,Bij;
And (3) outputting: cij;
Broadcasting A in any row is performed for l e (0 … q-1)ilBroadcasting B in any columnlj,
Cij=Cij+AilBlj
Return to Cij;
And 2, algorithm: c ═ ABT;
Inputting: a. theij,Bij;
And (3) outputting: cij;
Broadcasting B in either column is performed for l e (0 … q-1)lj;
In any row
Reduction to C
il;
Return to Cij;
Algorithm 3: c is ATB;
Inputting: a. theij,Bij;
And (3) outputting: cij;
For l e (0 … q-1), broadcast A in any rowil;
In any column CijReduction to Clj;
Return to Cij;
Algorithm SUMMA Algorithm body step by partitioning p processors into
The grid, matrices a and B, is divided into p parts by the partition. After sending the partitions of matrices a and B to the corresponding processors, the SUMMA algorithm runs each processor in parallel. At the end of the operation, the algorithm returns a result matrix C, which is distributed over the processors, similar to the splitting operation of the a and B matrices.
The specific algorithm comprises the following steps:
inputting: a matrix A [ a, B ], a matrix B [ B, c ];
and (3) outputting: matrix C [ a, C ] ═ a × B;
dividing A and B into p parts for matching the shape of the processor;
sequentially mixing A withij,BijDeposit pij;
For i, j e {0, …, p
1/2-1} and executing C in parallel
ijFor any
Executing p
itA in (A)
itBroadcast to p
ijA 1 is to p
tjB in (1)
tjBroadcast to p
ij,
Cij=Cij+Ait*Bvt
Merging all CijA matrix C is obtained.
Fig. 4 is a schematic diagram of implementation of the first algorithm, in which a 4 × 4 grid is used, and different colors represent different device identities. First, each device has a sub-block of matrices A and B, after which the extrinsic result A is calculated2B2Each device in the second row broadcasts the subblock of the matrix A along the row to which the device belongs, each device in the second row broadcasts the subblock of the matrix B along the column to which the device belongs, each device performs local matrix calculation with subblock broadcasting, and the local matrix calculation is added into a final result;
fig. 5 is a structural layout diagram of a 2.5-dimensional grid parallel scheme, which adopts the SUMMA2.5 algorithm, wherein p processors are constructed in a 2.5-dimensional layout diagram of [ p, p, d ] for the number p of processors, and d is depth.
The 2.5-dimensional grid parallel action is implemented by separating a matrix A with the size [ a, B ] and a matrix B with the size [ B, C ], and then merging to obtain a matrix C with the size [ a, C ], and specifically implementing the following algorithm:
wherein q represents the dimension, b represents the batch size of the batch size, h represents the concealment size, and s represents the sequence length;
fig. 6 is a matrix partitioning merged graph using the SUMMA2.5 algorithm, and it is assumed that in the structural layout of the processor, p is 2, q is 2, d is 2, the dark regions indicate that the processor building layer structure is q is 2, and the matrix a [ a, b, q is 2]Is divided into dq
2The partition matrix has the structure of [ a/qd, b/q],[q,q]Partition matrices are stored in each layer, matrix B B, c]Is divided into q
2The partition matrix has the structure of [ b/q, c/q ]],[q,q]The partition matrix is stored in each layer,
dq of structure
2Partitioned matrix binding structure [ a, b ]]In matrix C.
Inputting: a matrix A [ a, B ], a matrix B [ B, c ];
and (3) outputting: matrix C [ a, C ] ═ a × B;
the matrices A, B are divided into block matrices of the form [ a/qd, B/q ] and [ B/q, c/q ], respectively,
for i e {0, … qd-1}, j e {0, …, q-1}, performing h i% p, k i// p, and aijDeposit pkjhIn, CijWhen the ratio is 0, C isijDeposit pkjhPerforming the following steps;
for i ∈ {0, … p-1}, j ∈ {0, …, p-1}, k ∈ {0, …, d-1}, the execution will B ∈ {0, … p-1}, whereijDeposit pkjhPerforming the following steps;
for i, j e {0, … p-1}, k e {0, …, d-1}, and concurrent execution for each element in t e {0, … p-1}, p is addeditkA in (A)itkBroadcast to pijkA 1 is to ptjkB in (1)tjkBroadcast to pijk,Cijk=Cijk+Aitk*Bvtk;
Merging all CijkA matrix C is obtained.
3D parallel matrix multiplication is adopted in the 3-dimensional grid in parallel, each matrix is divided into a plurality of small blocks according to rows and columns, and multiplication of a large matrix is divided into multiplication of a plurality of small matrices;
the three-dimensional matrix multiplication of the original version is realized, each matrix is only stored on one surface (partial GPU), so that the storage resource waste is caused, the scheme optimizes the storage and communication algorithm, and the matrix storage is spread on the whole processor.
Fig. 7 is a structural diagram of matrix-vector parameter equalization according to an embodiment of the present invention, in the following algorithm, load equalization optimization is adopted, the operation between matrix and vectors is used, the vector B is uniformly stored on the diagonal line (i, l, j) of the B plane, and C ═ a + B is calculated;
when the parameter scale is fixed on each GPU, the 3D method is the minimum (3D is 0.672 seconds, 1D is 1.560, and 2D is 1.052) compared with 1D and 2D; with the total parameter scale fixed, the 3D method is accelerated by 2.3 and 1.6 times than the 1D and 2D methods, respectively. Fig. 8 and 9 are schematic diagrams illustrating weak-extended-efficiency and strong-extended-efficiency comparisons, respectively, in which the problem size (calculation amount) is increased with the increase of the number of processors in the weak-extended-efficiency comparison, that is, the parameter size per gpu is fixed, and the number of gpus is increased, and in the strong-extended-efficiency comparison, the problem size is kept unchanged, and the number of processors is increased, so as to find the number of processors that best solve the problem. I.e. the time taken is as short as possible without incurring too much overhead, the average time taken to derive 3-dimensional model parallelism is less, accelerated by 2.32 and 1.57 times compared to 1 and 2 dimensions, respectively.
The data-parallel, sequence-parallel + 2/2.5/3-dimensional meshing (2/2.5/3-dimensional model parallel) may constitute 4/4.5/5-dimensional parallelism, which may be further recombined with pipeline parallel into 5/5.5/6-dimensional parallelism.
The specific dimension selection of the 2/2.5/3-dimensional model of the multi-dimensional grid parallel is determined according to processor attributes, and specifically, the 2-dimensional model parallel needs a processors, such as 2, 4,3, 9,4, 16; the 2.5-dimensional model parallelism requires a processors a x b, such as 2 x 1 ═ 4,2 x 2 ═ 8,2 x 3 ═ 123-dimensional model parallelism, and a x a, such as 2 x 2 ═ 8,3 ═ 27-dimensional model parallelism
Even though the processors are all 8, the specific operations of the parallel 2.5-dimensional model and the parallel 3-dimensional model are different; all are 4 processors, and the parallel concrete operation of the 2.5-dimensional model and the parallel concrete operation of the 2-dimensional model are different
When the number of the parallel processors is consistent with the model parallelism of various conditions, such as 64, three types of the parallel processors are consistent, and the specific selection of which type needs to be further optimized according to the actual running performance (speed). Because different operating environments bring differences in processor performance, memory, communication bandwidth, processor network topology, and the like, the models and data used by different tasks are also very different.
The model of the data to be processed is parallel through a 2/2.5/3 dimensional model, model parameters are decomposed to each processor, and due to the limited capacity of a single machine, the capacity of all machines is added up after decomposition to accommodate the model, so that the model is allowed to accommodate a larger model on the one hand, and the communication of the parameters in the calculation process is reduced on the other hand.
The data of the data to be processed, such as pictures/sentences, are input into the model, and the processors are communicated with each other during forward calculation, which is equivalent to performing calculation by using complete long sequence data. And (3) obtaining an output result by forward calculation, comparing the output result with a training data label (label) to obtain a loss function (loss function) value, and then calculating a gradient backwards for updating the model parameters in the next step. Both the forward calculation and the backward calculation can be performed in parallel through a 2/2.5/3 dimensional model, and the calculation speed is accelerated.
The multidimensional parallel processing method can be further recombined into 5/5.5/6 dimension parallel with the pipeline parallel.
Under the pipeline parallel mode, the model is divided into a plurality of sections, each section is deployed on different equipment and connected in series, the output of the previous section is used as the input of the next section, and the pipeline parallel is the model parallel of a consistent cross-layer.
In the pipeline parallelism, each device is responsible for forward and corresponding backward operation of a part of layers; in a training scene, because the next step can be carried out only after the reversal of one step is finished, each device has bubble waiting; due to the existence of bubble waiting, the utilization rate of the pipeline parallel equipment is not high; the batch size of training each time can be increased, and the batch is divided into a plurality of small batches of micro-batch, so that the utilization rate of equipment is improved.
In a possible implementation of the first embodiment, the multidimensional parallel processing method further includes, after data parallel, sequence parallel or pipeline parallel, and multidimensional model parallel, selecting the optimizer according to the attribute of the data to be processed and the system operating environment by using a plurality of optimizers;
the specific multiple optimizer algorithms comprise a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing the computer vision related data to be processed;
the LAMB is used for processing related data to be processed in natural language processing;
the ConAdv is suitable for processing to-be-processed data with high speed requirement and low precision requirement;
the La-Lars is suitable for processing to-be-processed data with narrow communication bandwidth and high network communication cost.
Although data parallelism can speed up the training speed by increasing the (equivalent) batch size, it leads to difficult optimization, and an optimizer specially aiming at large batch must be used to ensure better convergence. LAMB/LARS/ConAdv are all suitable for large batch (batch) training, where LARS is best suited for computer vision related tasks (extending the batch size of the CV task to 32K), LAMB is best suited for natural language processing related tasks (extending the batch size of the NLP task to 64K), and ConAdv is suitable for CV tasks that pursue extreme speed, with slightly less precision requirements (extending the CV task batch size to 96K with a slight loss of precision)
In addition, when data are parallel, gradient is required to be transmitted through communication, model parameters are updated synchronously, communication quantity is extremely large (proportional to the size of the model (namely the parameter quantity of the model)), and particularly for the model which is larger and larger at present. Therefore, if the communication bandwidth (the amount of data that can be transmitted simultaneously) of the system is small, the operation speed is severely slowed down, and therefore, an optimizer that selects a large batch with a small communication volume is required.
The LAMB optimizer and/or the LARS optimizer and/or the ConAdv optimizer and/or the La-Lars optimizer are extensible large-scale optimizers required by the training of the AI large model, different optimizers can be selected according to needs, for example, the LAMB/LARS/ConAdv are all suitable for large batch (batch) training, the LARS is most suitable for computer vision related tasks, the LAMB is most suitable for natural language processing related tasks, and the ConAdv further expands the maximum batch of computer vision training. APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression.
APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression. APS may require only about 1/4 traffic with little loss of accuracy. La-Lars further compresses traffic to about one thousandth to accommodate the narrow bandwidth of the communication, albeit at a slight loss of accuracy.
FIG. 10 is a statistical chart of the experimental effect of the LAMB algorithm, where ADAMW is unable to converge under the mixed batch size training (64k/32k), and LAMB can reach an acceleration ratio of 101.8% (64 times of computing resources, the computing speed is increased by 65.2 times).
The La-Lars is a gradient sparsification algorithm, see fig. 11, i.e. only important gradients are sent each time a gradient is exchanged. The remaining gradients will accumulate locally and be transmitted in the future.
To speed up training, one of the simplest methods is to increase the number of compute nodes. However, when the number of nodes is large, the network communication cost becomes a bottleneck. Meanwhile, when the batch size exceeds a certain size, the generalization performance of the neural network may deteriorate.
LARS solves the performance degradation problem caused by large-scale deep learning training. It is a layer-by-layer adaptive rate scaling optimizer that can extend the bulk size to 32K without loss of performance. However, due to sparse representation of gradients and local gradient accumulation, it is difficult for the present solution to simply use DGC and LARS together, as this can lead to gradient outdating problems.
The scheme proposes a LA-LARS algorithm, which has faster convergence speed and smaller performance loss than the direct simultaneous use of DGC and LARS. On MNIST and CIFAR-10 datasets, LA-LARS outperforms other baseline optimizers while guaranteeing 0.1% compression. On the ImageNet dataset, it only required 60% -70% training time to achieve similar performance as the baseline optimizer.
A second embodiment, referring to fig. 12, the present application provides an artificial intelligence based multidimensional parallel processing system for a hardware processor, where the system is executed on a software platform and uses a machine learning library;
the data parallel module is used for automatically managing to-be-processed data from a user request and distributing the to-be-processed data to each hardware processor;
the sequence parallel module is used for further segmenting long sequence data in the data to be processed, and performing sequence division on each data to be processed and placing the data to be processed into a plurality of processors;
the pipeline parallel module splits the model into a plurality of sections, each section is deployed on different hardware processors and is connected in series according to the model sequence, and the output of the previous section is used as the input of the next section;
the multi-dimensional models are parallel, grid model division is executed aiming at the training models of the data to be processed which are scheduled to the processors, and the training models are scheduled to the processors;
the data to be processed comprises a picture processing task and/or a natural language processing task;
the multi-dimensional model parallelism comprises 2-dimensional and/or 2.5-dimensional and/or 3-dimensional grid parallelism.
The multidimensional parallel processing system based on artificial intelligence runs at a cloud end and is in communication interaction with local data;
the multidimensional parallel processing system based on artificial intelligence is executed on a software platform, and the software platform comprises but is not limited to CUDA and ROCM;
the artificial intelligence based multidimensional parallel processing system uses machine learning libraries including, but not limited to, TensorFlow, Keras, PyTorch.
In a possible implementation of the second embodiment, the data parallel module automatically manages to-be-processed data requested by a user, and distributes the to-be-processed data to each of the hardware processors, and the method further includes:
the data in the data parallelism is divided, each node or process has a model, each node takes the pitch size of different data, then the forward and backward calculation is respectively completed to obtain the gradient, the training processes are workers, besides the workers, a parameter server and a ps server, the workers can send the gradient obtained by the calculation to the ps server, the ps server carries out update operation, and the model after update is transmitted back to each node;
the data parallelism can enlarge the equivalent batch size, namely the equivalent batch size, and the calculation is accelerated by calculating the number of parallel processors and the single processor batch size.
And different processors in the data parallelism use different data to synchronously update the parameters of the data.
In a possible implementation of the second embodiment, the sequence parallel module further splits long sequence data in to-be-processed data, and performs sequence division on each to-be-processed data and puts the to-be-processed data into the plurality of processors, specifically including:
the sequence parallelly prolongs the length of data received by the transform type model, and processes long texts in NLP and high-resolution pictures in CV tasks, namely large pictures and/or videos, wherein the pictures can be cut into small pictures, and all the small pictures are also sequences after being sequentially arranged; the video is a sequence of pictures, and each picture can be cut again;
after the computing resources are obtained, the picture processing tasks and/or the feature data of the pictures are processed and distributed to various processors including but not limited to a GPU/CPU through data parallel, and the data are further segmented and distributed in parallel;
if the length of the single data is larger than the threshold value, the single processor cannot process the single data, and after the sequence is parallelly segmented, one data is put into a plurality of processors;
the calculation is equivalent to directly processing the whole complete data through communication.
In a possible implementation of the second embodiment, the performing mesh model division on the training model of the to-be-processed data scheduled to the processors, and scheduling the training model to the processors specifically includes:
the 2-dimensional grid adopts a scalable dense matrix multiplication SUMMA and an algorithm matrix in parallel, and a high-efficiency extensible model parallel mode of two-dimensional matrix segmentation is utilized;
a quantifiable novel deep learning model parallel framework is designed in parallel by the 2.5-dimensional grid, expensive transmission loss between the graphics processors is minimized, a flexible and efficient framework is provided, and the model parallel speed and efficiency are further improved;
the 3D grid adopts 3D parallel matrix multiplication in parallel, each matrix is divided into a plurality of small blocks according to rows and columns, the multiplication of a large matrix is divided into the multiplication of a plurality of small matrices, and the matrix storage is spread on the whole processor.
The 2-dimensional grid adopts a scalable dense Matrix Multiplication SUMMA (scalable Universal Matrix Multiplication Algorithm) in parallel, the 2-dimensional grid adopts three matrixes of the SUMMA algorithm in parallel, and a high-efficiency extensible model parallel mode of two-dimensional Matrix segmentation is utilized; c ═ AB, C ═ ABT,C=ATB, based on the input model data, defining the following:
the batch size is a variable b, the sequence length is a variable s, the hidden size is a variable h, the head attention number is a variable N, the word collection size is a variable v, the partition number is a variable p, the SUMMA dimension is a variable q, and the number of the translation layers (Transformer layers) is a variable N.
Algorithm 1: c ═ AB;
inputting: a. theij,Bij;
And (3) outputting: cij;
Broadcasting A in any row is performed for l e (0 … q-1)ilBroadcasting B in any columnlj,
Cij=Cij+AilBlj;
Return to Cij;
And 2, algorithm: c ═ ABT;
Inputting: a. theij,Bij;
And (3) outputting: cij;
Broadcasting B in either column is performed for l e (0 … q-1)lj;
In any row
Reduction to C
il;
Return to Cij;
Algorithm 3: c is ATB;
Inputting: a. theij,Bij;
And (3) outputting: cij;
For l e (0 … q-1), broadcast A in any rowil;
In any column CijReduction to Clj;
Return to Cij(ii) a The algorithm above defines one-three pairs of matrices C.
SUMMA Algorithm body step by partitioning p processors into
The grid, matrices a and B, is divided into p parts by the partition. After sending the partitions of matrices a and B to the corresponding processors, the SUMMA algorithm runs each processor in parallel. At the end of the operation, the algorithm returns a result matrix C, which is distributed over the processors, similar to the splitting operation of the a and B matrices.
The specific algorithm comprises the following steps:
inputting: matrix A [ a, B ], matrix B [ B, c ]
And (3) outputting: matrix C [ a, C ] ═ a × B
Dividing A and B into p parts for matching the shape of the processor;
sequentially mixing A withij,BijDeposit pij;
For i, j e {0, …, p
1/2-1} and executing C in parallel
ijFor any
Executing p
itA in (A)
itBroadcast to p
ijA 1 is to p
tjB in (1)
tjBroadcast to p
ij,C
ij=C
ij+A
it*B
vtMerging all CijA matrix C is obtained.
In the implementation of algorithm one, a 4 x 4 grid is used, and different colors represent different device identities. First, each device has a sub-block of matrices A and B, after which the extrinsic result A is calculated2B2Each device in the second row broadcasts the subblock of the matrix A along the row to which the device belongs, each device in the second row broadcasts the subblock of the matrix B along the column to which the device belongs, each device performs local matrix calculation with subblock broadcasting, and the local matrix calculation is added into a final result;
the structural layout of the 2.5-dimensional grid parallel scheme adopts a SUMMA2.5 algorithm, wherein p processors are built in a 2.5-dimensional layout diagram of [ p, p, d ] according to the number p of the processors, and d is depth.
The 2.5-dimensional grid parallel action is implemented by separating a matrix A with the size [ a, B ] and a matrix B with the size [ B, C ], and then merging to obtain a matrix C with the size [ a, C ], and specifically implementing the following algorithm:
wherein q represents the dimension, b represents the batch size of the batch size, h represents the concealment size, and s represents the sequence length;
the matrix segmentation and merging method adopting the SUMMA2.5 algorithm assumes that in the structural layout of a processor, p is 2, q is 2, d is 2, a dark region indicates that the structure of a processor building layer is q is 2, and a matrix A [ a, b ] is formed]Is divided into dq
2The partition matrix has the structure of [ a/qd, b/q],[q,q]Partition matrices are stored in each layer, matrix B B, c]Is divided into q
2The partition matrix has the structure of [ b/q, c/q ]],[q,q]The partition matrix is stored in each layer,
dq of structure
2Partitioned matrix binding structure [ a, b ]]In matrix C.
Inputting: a matrix A [ a, B ], a matrix B [ B, c ];
and (3) outputting: matrix C [ a, C ] ═ a × B;
the matrices A, B are divided into block matrices of the form [ a/qd, B/q ] and [ B/q, c/q ], respectively,
for i e {0, … qd-1}, j e {0, …, q-1}, performing h i% p, k i// p, and aijDeposit pkjhIn, CijWhen the ratio is 0, C isijDeposit pkjhPerforming the following steps;
for i ∈ {0, … p-1}, j ∈ {0, …, p-1}, k ∈ {0, …, d-1}, the execution will B ∈ {0, … p-1}, whereijDeposit pkjhPerforming the following steps;
for i, j e {0, … p-1}, k e {0, …, d-1}, and concurrent execution for each element in t e {0, … p-1}, p is addeditkA in (A)itkBroadcast to pijkA 1 is to ptjkB in (1)tjkBroadcast to pijk,Cijk=Cijk+Aitk*Bvtk;
Merging all CijkA matrix C is obtained.
3D parallel matrix multiplication is adopted in the 3-dimensional grid in parallel, each matrix is divided into a plurality of small blocks according to rows and columns, and multiplication of a large matrix is divided into multiplication of a plurality of small matrices;
the three-dimensional matrix multiplication of the original version is realized, each matrix is only stored on one surface (partial GPU), so that the storage resource waste is caused, the scheme optimizes the storage and communication algorithm, and the matrix storage is spread on the whole processor.
In the matrix-vector parameter balancing structure of the embodiment of the invention, load balancing optimization and operation between matrixes and vectors are adopted, the vectors B are uniformly stored on a B-surface diagonal line (i, l, j), and C is calculated to be A + B;
when the parameter scale is fixed on each GPU, the 3D method is the minimum (3D is 0.672 seconds, 1D is 1.560, and 2D is 1.052) compared with 1D and 2D; with the total parameter scale fixed, the 3D method is accelerated by 2.3 and 1.6 times than the 1D and 2D methods, respectively. And comparing the weak expansion efficiency with the strong expansion efficiency, wherein the weak expansion efficiency compares the problem scale (calculated amount) to increase with the increase of the number of processors, namely the parameter scale on each gpu is fixed, and the number of the gpus is increased, and the strong expansion efficiency compares the problem scale to keep unchanged, and the number of the processors is increased, so that the processor number which is most suitable for solving the problem is found. I.e. the time taken is as short as possible without incurring too much overhead, the average time taken to derive 3-dimensional model parallelism is less, accelerated by 2.32 and 1.57 times compared to 1 and 2 dimensions, respectively.
The data-parallel, sequence-parallel + 2/2.5/3-dimensional meshing (2/2.5/3-dimensional model parallel) may constitute 4/4.5/5-dimensional parallelism, which may be further recombined with pipeline parallel into 5/5.5/6-dimensional parallelism.
The specific dimension selection of the 2/2.5/3-dimensional model of the multi-dimensional grid parallel is determined according to processor attributes, and specifically, the 2-dimensional model parallel needs a processors, such as 2, 4,3, 9,4, 16; the 2.5-dimensional model parallelism requires a processors a x b, such as 2 x 1 ═ 4,2 x 2 ═ 8,2 x 3 ═ 123-dimensional model parallelism, and a x a, such as 2 x 2 ═ 8,3 ═ 27-dimensional model parallelism
Even though the processors are all 8, the specific operations of the parallel 2.5-dimensional model and the parallel 3-dimensional model are different; all are 4 processors, and the parallel concrete operation of the 2.5-dimensional model and the parallel concrete operation of the 2-dimensional model are different
When the number of the parallel processors is consistent with the model parallelism of various conditions, such as 64, three types of the parallel processors are consistent, and the specific selection of which type needs to be further optimized according to the actual running performance (speed). Because different operating environments bring differences in processor performance, memory, communication bandwidth, processor network topology, and the like, the models and data used by different tasks are also very different.
The model of the data to be processed is parallel through a 2/2.5/3 dimensional model, model parameters are decomposed to each processor, and due to the limited capacity of a single machine, the capacity of all machines is added up after decomposition to accommodate the model, so that the model is allowed to accommodate a larger model on the one hand, and the communication of the parameters in the calculation process is reduced on the other hand.
The data of the data to be processed, such as pictures/sentences, are input into the model, and the processors are communicated with each other during forward calculation, which is equivalent to performing calculation by using complete long sequence data. And (3) obtaining an output result by forward calculation, comparing the output result with a training data label (label) to obtain a loss function (loss function) value, and then calculating a gradient backwards for updating the model parameters in the next step. Both the forward calculation and the backward calculation can be performed in parallel through a 2/2.5/3 dimensional model, and the calculation speed is accelerated.
The multi-dimensional parallel processing system may be further recombined in 5/5.5/6-dimensional parallel with the pipeline.
Under the pipeline parallel mode, the model is divided into a plurality of sections, each section is deployed on different equipment and connected in series, the output of the previous section is used as the input of the next section, and the pipeline parallel is the model parallel of a consistent cross-layer.
In the pipeline parallelism, each device is responsible for forward and corresponding backward operation of a part of layers; in a training scene, because the next step can be carried out only after the reversal of one step is finished, each device has bubble waiting; due to the existence of bubble waiting, the utilization rate of the pipeline parallel equipment is not high; the batch size of training each time can be increased, and the batch is divided into a plurality of small batches of micro-batch, so that the utilization rate of equipment is improved.
In a possible implementation of the second embodiment, the multidimensional parallel processing system further includes, after data parallel, sequence parallel or pipeline parallel and multidimensional model parallel, adopting a plurality of optimizers to select the optimizers according to attributes of the data to be processed;
the specific multiple optimizer algorithms comprise a LAMB optimizer and/or a LARS optimizer and/or a ConAdv optimizer and/or a La-Lars optimizer;
the LAMB, LARS, ConAdv optimizers are suitable for large batches of training,
the LARS is used for processing the computer vision related data to be processed;
the LAMB is used for processing related data to be processed in natural language processing;
the ConAdv is suitable for processing to-be-processed data with high speed requirement and low precision requirement;
the La-Lars is suitable for processing to-be-processed data with narrow communication bandwidth and high network communication cost.
Although data parallelism can speed up the training speed by increasing the (equivalent) batch size, it leads to difficult optimization, and an optimizer specially aiming at large batch must be used to ensure better convergence. LAMB/LARS/ConAdv are all suitable for large batch (batch) training, where LARS is best suited for computer vision related tasks (extending the batch size of the CV task to 32K), LAMB is best suited for natural language processing related tasks (extending the batch size of the NLP task to 64K), and ConAdv is suitable for CV tasks that pursue extreme speed, with slightly less precision requirements (extending the CV task batch size to 96K with a slight loss of precision)
In addition, when data are parallel, gradient is required to be transmitted through communication, model parameters are updated synchronously, communication quantity is extremely large (proportional to the size of the model (namely the parameter quantity of the model)), and particularly for the model which is larger and larger at present. Therefore, if the communication bandwidth (the amount of data that can be transmitted simultaneously) of the system is small, the operation speed is severely slowed down, and therefore, an optimizer that selects a large batch with a small communication volume is required.
The LAMB optimizer and/or the LARS optimizer and/or the ConAdv optimizer and/or the La-Lars optimizer are extensible large-scale optimizers required by the training of the AI large model, different optimizers can be selected according to needs, for example, the LAMB/LARS/ConAdv are all suitable for large batch (batch) training, the LARS is most suitable for computer vision related tasks, the LAMB is most suitable for natural language processing related tasks, and the ConAdv further expands the maximum batch of computer vision training. APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression.
APS and La-Lars are suitable for the situation that the communication bandwidth is narrow, and the network communication cost becomes a bottleneck, wherein the APS mainly uses low-precision gradient, and the La-Lars mainly uses gradient compression. APS may require only about 1/4 traffic with little loss of accuracy. La-Lars further compresses traffic to about one thousandth to accommodate the narrow bandwidth of the communication, albeit at a slight loss of accuracy.
FIG. 10 is a statistical chart of the experimental effect of the LAMB algorithm, where ADAMW is unable to converge under the mixed batch size training (64k/32k), and LAMB can reach an acceleration ratio of 101.8% (64 times of computing resources, the computing speed is increased by 65.2 times).
The La-Lars is a gradient sparsification algorithm, see fig. 11, i.e. only important gradients are sent each time a gradient is exchanged. The remaining gradients will accumulate locally and be transmitted in the future.
To speed up training, one of the simplest methods is to increase the number of compute nodes. However, when the number of nodes is large, the network communication cost becomes a bottleneck. Meanwhile, when the batch size exceeds a certain size, the generalization performance of the neural network may deteriorate.
LARS solves the performance degradation problem caused by large-scale deep learning training. It is a layer-by-layer adaptive rate scaling optimizer that can extend the bulk size to 32K without loss of performance. However, due to sparse representation of gradients and local gradient accumulation, it is difficult for the present solution to simply use DGC and LARS together, as this can lead to gradient outdating problems.
The scheme proposes a LA-LARS algorithm, which has faster convergence speed and smaller performance loss than the direct simultaneous use of DGC and LARS. On MNIST and CIFAR-10 datasets, LA-LARS outperforms other baseline optimizers while guaranteeing 0.1% compression. On the ImageNet dataset, it only required 60% -70% training time to achieve similar performance as the baseline optimizer.
A third embodiment provides an artificial intelligence-based multidimensional parallel processing device, which is characterized by comprising:
a memory for storing instructions for execution by one or more processors of the system, an
And the processor is one of the processors of the system and is used for executing the instructions to implement any one of the possible artificial intelligence based multidimensional parallel processing methods of the first aspect.
In a fourth embodiment, an embodiment of the present application provides a computer-readable storage medium encoded with a computer program, where the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to execute any one of the possible artificial intelligence based multidimensional parallel processing methods of the first aspect.
It should be noted that the method embodiments of the present application can be implemented in software, hardware, firmware, and the like. Whether implemented in software, hardware, or firmware, the instruction code may be stored in any type of computer-accessible memory (e.g., permanent or modifiable, volatile or non-volatile, solid or non-solid, fixed or removable media, etc.). Also, the Memory may be, for example, Programmable Array Logic (PAL), Random Access Memory (RAM), Programmable Read Only Memory (PROM), Read-Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic disk, an optical disk, a Digital Versatile Disk (DVD), or the like.
It should be noted that, all units/modules mentioned in the embodiments of the apparatuses in this application are logic units/modules, and physically, a logic unit may be a physical unit, or a part of a physical unit, or may be implemented by a combination of multiple physical units, where the physical implementation manner of the logic unit itself is not the most important, and the combination of the functions implemented by the logic units is the key to solve the technical problem provided by this application. In addition, in order to highlight the innovative part of the present application, the above-mentioned embodiments of the apparatus of the present application do not introduce elements that are not so closely related to solve the technical problems proposed by the present application, which does not indicate that there are no other elements in the above-mentioned embodiments of the apparatus.
It is to be noted that in the claims and the description of the present patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.
While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.