The present application claims priority from chinese patent application No. 202410137574.7, entitled "method, apparatus, and other devices for data processing," filed on 31, 2024, 01, and incorporated herein by reference in its entirety.
Disclosure of Invention
The embodiment of the application provides a computing system, a multi-round session reasoning method, a device and a computing equipment cluster, which can avoid the recalculation of KV cache and improve the efficiency of multi-round session reasoning. The technical scheme is as follows.
In a first aspect, a computing system is provided that includes a host, an accelerator for the host, and an external storage device;
the system comprises an external storage device, a host and an accelerator, wherein the external storage device is used for storing the history KV cache of a completed session, the host is used for preloading the history KV cache of a session to be processed in a task queue from the external storage device to a memory of the host while the accelerator processes the session, and the accelerator is used for preloading the history KV cache required by the i+1st layer calculation from the memory of the host into the accelerator while the session is subjected to the i layer calculation of a generated large model.
In the system, the capacity of the external storage device is far greater than that of the HBM in the accelerator, so that the external storage device can store a large amount of historical KV caches, the hit rate of the historical KV caches is further improved, the KV caches are prevented from being recalculated, the efficiency of multi-round session reasoning can be improved, computing resources are saved, and the computing process and the data loading process are synchronously carried out, so that the computing process does not need to wait for data loading to be completed, the time cost of the accelerator for accessing the external storage device can be hidden, and the efficiency of multi-round session reasoning is improved.
In some embodiments, the history KV cache of the input of the first session at layer i+1 is independent of the position encoding of the input of the first session.
Wherein irrelevant means that the position code of the input of the first session has no influence on the history KV cache of the first session.
In some embodiments, the input of the first session includes a number of tokens that is greater than or equal to a number of tokens that can be accommodated by a contextual window of the generative large model;
The accelerator is used for loading a first sub-input in the i+1th layer of historical KV cache from the memory when the i-th layer of calculation is carried out on the first session, wherein the first sub-input is the sub-input of the first session, and the number of words included in the first sub-input is smaller than the number of words which can be accommodated by the context window.
In the system, the stored KV cache is decoupled from the position codes, and when the contextual window overflows, the stored KV cache can not be invalid due to the change of the position codes of the input of the generated large model, so that the accelerator can multiplex the history KV cache of the input after clipping, the KV cache recalculation caused by the contextual window overflow is avoided, the reasoning efficiency can be improved, and the computing resources are saved.
In some embodiments, the accelerator is further to:
The method comprises the steps of embedding a position code of a first sub-input into a historical KV cache of the first sub-input in an i+1th layer, and carrying out i+1th layer calculation on the first session based on the input of the first session and the historical KV cache of the first sub-input in the i+1th layer after the position code is embedded.
Wherein the position code is a relative position code (relative positional encoding, RPE).
In some embodiments, the accelerator includes a first buffer for storing a history KV cache of the first session and a second buffer for storing a history KV cache of the input at layer 1 of a second session in the task queue, the second session being a first pending session in the task queue after the first session;
The accelerator is used for loading the historical KV cache of the input of the first session in the i+1st layer from the memory to the first buffer area when the i-th layer calculation is carried out on the first session, and loading the historical KV cache of the input of the second session in the 1 st layer from the memory to the second buffer area when the N-th layer calculation is carried out on the first session.
In the system, the HBM of the accelerator comprises a first buffer area and a second buffer area, the first buffer area is used for storing the KV cache of the session currently being processed by the accelerator, the second buffer area is used for storing the KV cache of the session to be processed by the accelerator, the accelerator loads the KV cache required by the 1 st layer calculation of the second session to the second buffer area while processing the first session based on the KV cache in the first buffer area, after the accelerator finishes processing the first session, the second buffer area is used for executing the buffer area, and the accelerator releases the first buffer area through the asynchronous thread while processing the second session, so that the computing thread of the accelerator can immediately start to process the second session after finishing processing the first session without waiting for the release of the first buffer area, thereby hiding the time slot caused by loading data required by the next session between the two sessions, and improving the reasoning efficiency.
In some embodiments, the accelerator is further configured to write KV cache generated by performing the i-th layer computation on the first session into the memory when performing the i+1-th layer computation on the first session.
In the system, the accelerator performs the calculation of the (i+1) th layer, and simultaneously writes the KV cache generated by the calculation of the (i) th layer back to the memory of the host, compared with the mode that the KV cache generated by the current round of session is written into the memory of the host after the completion of the one round of session reasoning, the time slot caused by the data generated by the previous round of session is written back between two adjacent rounds of sessions can be hidden, and the reasoning efficiency can be improved.
In some embodiments, the accelerator includes a third buffer for storing a KV cache generated by the generative large model processing the first session;
The accelerator of the host is used for writing the KV cache which is not written into the memory into the third buffer area if the KV cache which is not written into the memory exists in the KV cache generated by the generation type big model processing the first session when the N-th layer computation of the first session is completed, and writing the KV cache in the third buffer area into the memory when the generation type big model processing the second session in the task queue.
In the system, the HBM of the accelerator further comprises a third buffer zone, after the accelerator finishes processing the first session, the KV cache which is generated in the process of processing the first session and is not written back to the memory is written into the third buffer zone by the first buffer zone, and because the data copying efficiency between the first buffer zone and the third buffer zone is higher, the accelerator can quickly release the first buffer zone after finishing processing the first session, so that KV cache required by subsequent processing can be loaded to the first buffer zone without waiting for the KV cache generated by processing the first session to be written into the memory and then released from the first buffer zone, the time slot caused by waiting for the data writing back of the last session between the two sessions is hidden, and the reasoning efficiency is improved.
In some embodiments, the host is configured to load the history KV cache of the first pending session in the task queue from the external storage device to the memory if the history KV cache of the first pending session in the task queue does not exist in the memory, where the first pending session is a first number of pending sessions in the task queue.
In the system, the host can preload the history KV cache of the session to be processed in the task queue from the external storage device to the memory of the host while the session is processed by the accelerator, and the calculation process and the data loading process are synchronously performed, so that the calculation process does not need to wait for the completion of the data loading, the time cost of accessing the external storage device by the accelerator can be hidden, and the efficiency of multi-round session reasoning is improved.
In some embodiments, the number of session rounds included in the first pending session is determined based on the capacity of the memory.
In some embodiments, the host is configured to load the history KV cache of the first session to be processed from the external storage device to the memory if the available storage space of the memory is sufficient, and write the history KV cache of the second session to be processed in the task queue from the memory to the external storage device if the available storage space of the memory is insufficient, and load the history KV cache of the first session to be processed from the external storage device to the memory, where the second session to be processed is a session in the task queue other than the first session to be processed.
In some embodiments, the host is further configured to delete the historical KV cache of a third pending session in the task queue from the external storage device if the available storage space of the external storage device is insufficient, the third pending session being a second number of pending sessions in the task queue.
In a second aspect, a multi-round session reasoning method is provided, applied to a computing system, where the computing system includes a host, an accelerator of the host, and an external storage device, where the external storage device is used to store a history key value cache KV of a session, and the host is used to load the history KV cache of a session to be processed in a task queue from the external storage device to a memory of the host, and the method includes:
And reasoning a first session in the task queue through a generation type large model based on the task queue, wherein the first session is one of multiple sessions, the generation type large model comprises N layers, and when the accelerator performs i-th layer calculation on the first session, the history KV cache of the first session input in the i+1 th layer is loaded from the memory to the accelerator, wherein i is an integer greater than or equal to 1 and less than N.
In some embodiments, the history KV cache of the input of the first session at layer i+1 is independent of the position encoding of the input of the first session.
In some embodiments, the input of the first session includes a number of tokens that is greater than or equal to a number of tokens that can be accommodated by a contextual window of the generative large model;
When the i-th layer calculation is carried out on the first session, the method for loading the historical KV cache of the input of the first session in the i+1-th layer from the memory comprises the steps of loading the historical KV cache of a first sub-input in the i+1-th layer from the memory when the i-th layer calculation is carried out on the first session, wherein the first sub-input is the sub-input of the first session, and the number of words included in the first sub-input is smaller than the number of words which can be contained in the context window.
In some embodiments, the processing the first session in the task queue by the generative large model includes:
The method comprises the steps of embedding a position code of a first sub-input into a historical KV cache of the first sub-input in an i+1th layer, and carrying out i+1th layer calculation on the first session based on the input of the first session and the historical KV cache of the first sub-input in the i+1th layer after the position code is embedded.
In some embodiments, the accelerator includes a first buffer for storing a history KV cache of the first session and a second buffer for storing a history KV cache of the input at layer 1 of a second session in the task queue, the second session being a first pending session in the task queue after the first session;
when the accelerator performs the i-th layer calculation on the first session, loading the historical KV cache of the input of the first session in the i+1th layer from the memory, wherein the method comprises the step of loading the historical KV cache of the input of the first session in the i+1th layer from the memory to the first buffer area through the accelerator.
In some embodiments, the method further comprises:
When the accelerator performs layer N calculation on the first session, the accelerator loads the historical KV cache of the input of the second session in the layer 1 from the memory to the second buffer zone.
In some embodiments, the method further comprises:
and when the accelerator performs the i+1th layer calculation on the first session, writing KV cache generated by the i layer calculation on the first session into the memory through the accelerator.
In some embodiments, the accelerator includes a third buffer for storing a KV cache generated by the generative large model processing the first session;
Writing the KV cache generated by the i-th layer calculation of the first session into the memory when the i+1-th layer calculation of the first session is performed, wherein the method comprises the steps that if the accelerator finishes the N-th layer calculation of the first session, the generated big model processes that the KV cache which is generated by the first session and is not written into the memory exists in the KV cache which is not written into the memory, the KV cache which is not written into the memory is written into the third buffer area through the accelerator, and when the accelerator processes the second session in the task queue through the generated big model, the KV cache in the third buffer area is written into the memory through the accelerator.
In some embodiments, the method further includes loading, by the host, the historical KV cache of the first pending session from the external storage device to the memory if the historical KV cache of the first pending session in the task queue does not exist in the memory, the first pending session being a first number of pending sessions queued in the task queue.
In some embodiments, the number of session rounds included in the first pending session is determined based on the capacity of the memory.
In some embodiments, if the memory does not have the history KV cache of the first pending session in the task queue, loading, by the host, the history KV cache of the first pending session from the external storage device to the memory, including:
If the available storage space of the memory is insufficient, the history KV cache of a second to-be-processed session in the task queue is written into the external storage device from the memory by the host, and the history KV cache of the first to-be-processed session is loaded into the memory from the external storage device by the host, wherein the second to-be-processed session is a session in the task queue except the first to-be-processed session.
In some embodiments, the method further comprises:
If the available storage space of the external storage device is insufficient, deleting the historical KV cache of the third to-be-processed session in the task queue from the external storage device through the host, wherein the third to-be-processed session is a second number of to-be-processed sessions at the tail of the queue in the task queue.
In a third aspect, a multi-round session reasoning method is provided, applied to an accelerator of a host in a computing system, where the computing system further includes a host and an external storage device, where the external storage device is used to store a history key value cache KV cache of a session, and the host is used to load the history KV cache of a session to be processed in a task queue from the external storage device to a memory of the host, and the method includes:
Based on the task queue, reasoning is carried out on a first session in the task queue through a generation type large model, the first session is one of multiple sessions, the generation type large model comprises N layers, wherein when the i-th layer is calculated on the first session, the history KV cache of the input of the first session in the i+1 th layer is loaded from the memory, and i is an integer which is greater than or equal to 1 and smaller than N.
In some embodiments, the history KV cache of the input of the first session at layer i+1 is independent of the position encoding of the input of the first session.
In some embodiments, the input of the first session includes a number of tokens that is greater than or equal to a number of tokens that can be accommodated by a contextual window of the generative large model;
When the i-th layer calculation is carried out on the first session, the method for loading the historical KV cache of the input of the first session in the i+1-th layer from the memory comprises the steps of loading the historical KV cache of a first sub-input in the i+1-th layer from the memory when the i-th layer calculation is carried out on the first session, wherein the first sub-input is the sub-input of the first session, and the number of words included in the first sub-input is smaller than the number of words which can be contained in the context window.
In some embodiments, the processing the first session in the task queue by the generative large model includes:
The method comprises the steps of embedding a position code of a first sub-input into a historical KV cache of the first sub-input in an i+1th layer, and carrying out i+1th layer calculation on the first session based on the input of the first session and the historical KV cache of the first sub-input in the i+1th layer after the position code is embedded.
In some embodiments, the accelerator includes a first buffer for storing a history KV cache of the first session and a second buffer for storing a history KV cache of the input at layer 1 of a second session in the task queue, the second session being a first pending session in the task queue after the first session;
the step of loading the history KV cache of the input of the first session in the i+1th layer from the memory when the i-th layer calculation is performed on the first session comprises the step of loading the history KV cache of the input of the first session in the i+1th layer from the memory to the first buffer area when the i-th layer calculation is performed on the first session.
In some embodiments, the method further comprises loading the history KV cache of the input of the second session at layer 1 from the memory into the second buffer while performing the layer N calculation on the first session.
In some embodiments, the method further includes writing KV cache generated by performing the i-th layer calculation on the first session into the memory when the i+1-th layer calculation is performed on the first session.
In some embodiments, the accelerator includes a third buffer for storing a KV cache generated by the generative large model processing the first session;
Writing the KV cache generated by performing the i-th layer calculation on the first session into the memory when performing the i+1-th layer calculation on the first session, wherein the method comprises the steps of writing the KV cache which is not written into the memory into the third buffer area if the KV cache which is not written into the memory exists in the KV cache generated by the generation type large model processing the first session when performing the N-th layer calculation on the first session, and writing the KV cache in the third buffer area into the memory when processing the second session in the task queue through the generation type large model.
In a fourth aspect, a multi-round session reasoning method is provided, applied to a host in a computing system, where the computing system further includes an accelerator of the host and an external storage device, where the external storage device is configured to store a history key value cache KV cache of a session, and the accelerator is configured to reason, based on the task queue, a first session in the task queue through a generative large model, where the first session is one round of the multi-round session, and the generative large model includes N layers, where, when performing an i-th layer calculation on the first session, the history KV cache of an input of the first session in an i+1-th layer is loaded from the memory, where i is an integer greater than or equal to 1 and less than N;
The method comprises the following steps:
Loading the history KV cache of the session to be processed in the task queue from the external storage device to the memory of the host, and storing the history KV cache of the session to be processed.
In some embodiments, loading the historical KV cache of the pending session in the task queue from the external storage device to the memory of the host includes:
if the memory does not have the history KV cache of the first to-be-processed session in the task queue, loading the history KV cache of the first to-be-processed session from the external storage device to the memory, wherein the first to-be-processed session is the first number of to-be-processed sessions in the queue.
In some embodiments, the number of session rounds included in the first pending session is determined based on the capacity of the memory.
In some embodiments, if the memory does not have the history KV cache of the first pending session in the task queue, the host loads the history KV cache of the first pending session from the external storage device to the memory, including:
If the available storage space of the memory is insufficient, the history KV cache of a second to-be-processed session in the task queue is written into the external storage device from the memory, and the history KV cache of the first to-be-processed session is loaded into the memory from the external storage device, wherein the second to-be-processed session is a session except the first to-be-processed session in the task queue.
In some embodiments, the method further comprises:
If the available storage space of the external storage device is insufficient, deleting the historical KV cache of a third to-be-processed session in the task queue from the external storage device, wherein the third to-be-processed session is a second number of to-be-processed sessions at the tail of the task queue.
In a fifth aspect, a multi-round session reasoning apparatus is provided for use with an accelerator of a host in a computing system, the apparatus comprising at least one functional module for performing a multi-round session reasoning method as provided by the foregoing third aspect or any one of the possible implementations of the third aspect.
In a sixth aspect, a multi-round session reasoning apparatus is provided for use with a host in a computing system, the apparatus comprising at least one functional module for performing the multi-round session reasoning method as provided by the fourth aspect or any one of the possible implementations thereof.
In a seventh aspect, there is provided an accelerator comprising a computational core for performing the multi-round session reasoning method as provided in the foregoing third aspect or any one of the possible implementations thereof, and a memory for storing computational data, the computational core for performing computational operations on the computational data stored in the memory.
In an eighth aspect, a host is provided, the host comprising a processor and a memory, the host being adapted to perform the multi-round session reasoning method as provided by any one of the possible implementations of the fourth or fourth aspect described above.
In a ninth aspect, a computing device cluster is provided, where the computing device cluster includes at least one computing device, each of the computing devices includes a host, an accelerator of the host, an external storage device is located in the at least one computing device, the external storage device is used for storing a history KV cache of a session, and the host is used for loading the history KV cache of a session to be processed in a task queue from the external storage device to a memory of the host;
The host in the computing device cluster is configured to perform a multi-round session reasoning method as provided by the fourth aspect or any one of the possible implementations of the fourth aspect;
the accelerator in the cluster of computing devices is for performing a multi-round session reasoning method as provided by the foregoing third aspect or any of the possible implementations of the third aspect.
In a tenth aspect, there is provided a computer program product comprising instructions which, when executed by a cluster of computing devices, cause the cluster of computing devices to perform a multi-round session reasoning method as provided by the foregoing third aspect or any one of the possible implementations of the third aspect, or to perform a multi-round session reasoning method as provided by the foregoing fourth aspect or any one of the possible implementations of the fourth aspect.
In an eleventh aspect, a computer readable storage medium is provided, comprising computer program instructions which, when executed by a cluster of computing devices, performs a multi-round session reasoning method as provided by the foregoing third aspect or any of the possible implementations of the third aspect, or performs a multi-round session reasoning method as provided by the foregoing fourth aspect or any of the possible implementations of the fourth aspect.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
Firstly, the present application relates to application of a generated large model based on a transducer architecture in multi-round session push, and in order to facilitate understanding of the content of the embodiments of the present application, several technical terms related to the embodiments of the present application are explained below.
Token, the smallest semantic unit represented by a vector, where the smallest semantic unit is, for example, a character, word or phrase, etc. A word sequence composed of a plurality of token is used as an input of a large generation model. In the multi-round session reasoning scene, a word element sequence of a large generated model is input, namely a problem prompt of a session. And (3) the generated large model carries out reasoning on the input word sequence, and subsequent word elements of the problem prompt are obtained one by one through multiple iterations until a complete subsequent word element sequence is obtained or a reasoning terminator appears. The generated big model outputs the generated word element sequence, namely the answer of the question prompt. For example, the input word sequence of the generated large model is X [1:s ], the generated large model is based on the word sequence X [1:s ] to perform a first iteration to obtain the (s+1) th word token (s+1), the generated large model is based on X [1:s ] and token (s+1) to perform a second iteration to obtain the (s+2) th word token (s+2), and the like until a complete word sequence is obtained or an inference terminator appears, and the generated large model outputs the word sequence obtained by inference, namely an output answer, so that a round of dialogue inference is completed, wherein s is an integer greater than 1.
The large model based on the transducer architecture comprises N transducer layers, wherein N is an integer greater than 1. Each transducer layer includes a self-attention mechanism (self-attention) and a feed-forward neural network (feed forward network, FFN).
And in each iteration, N converters layers for generating the large model sequentially process the word element sequence input in the iteration to obtain a word element. The process of processing the input word element sequence through the ith layer of the generated large model comprises the steps of projecting each word element in the input of the iteration with a model parameter corresponding to a self-attention mechanism in the ith layer to generate intermediate data key and value corresponding to each word element, performing nonlinear conversion on the key and the value, wherein the nonlinear conversion is, for example, residual connection and normalization, transmitting a nonlinear conversion result to a feedforward neural network in the ith layer, processing the nonlinear conversion result by the feedforward neural network in the ith layer to obtain a processing result of the ith layer, and transmitting the processing result of the ith layer to the (i+1) layer to process. Wherein i is an integer greater than or equal to 1 and less than N. And processing the input word element sequence through the ith layer of the generated large model to obtain a processing result of the ith layer, namely, calculating the ith layer.
In order to maintain semantic consistency of a conversation context and ensure accurate understanding of input of a current conversation, a large generation model refers to input and output of a historical conversation with the same object when generating an answer of the current conversation, namely, input of one conversation in multiple conversations is that a word element sequence input in the historical conversation, a word element sequence output in the historical conversation and a word element sequence input in the conversation by the object.
In one iteration of the large generating model (namely, in the process of generating a word element), the large generating model generates corresponding intermediate data key and value when any word element is processed, the generated intermediate data are used in subsequent iterations, in addition, when multiple rounds of conversations occur, word element sequences input and output in a historical conversation are used in each round of conversations, and further, intermediate data corresponding to word elements input and output in the historical conversation are also used, so that in order to save calculation resources and improve reasoning efficiency, the intermediate data corresponding to word elements input and output in the historical conversation are stored to form the historical KV cache of the conversation, and the historical KV cache of the conversation can be directly multiplexed for reasoning when the conversation reasoning is performed, and recalculation is not needed.
Pre-filling stage, namely the 1 st iteration in a round of session reasoning process. In the pre-filling stage, the generated large model carries out 1 st iteration based on the word sequence X [1:s ] input in the round of session to obtain a token [ s+1], and stores the key and value corresponding to each word in X [1:s ] obtained by the iteration to form KV cache [1:s ].
Decoding stage, namely the 2 nd iteration to the last iteration in a round of session reasoning process. In any iteration of the decoding stage, a large model is generated to calculate the corresponding key and value of the token obtained in the last iteration, the stored historical KV cache is read, the token obtained in the iteration is obtained based on the key and value obtained in the iteration calculation and the historical KV cache, and the key and value obtained in the iteration calculation are stored, namely the historical KV cache is updated.
Context window (context window) the number of tokens that the generative large model can handle at the same time.
Accelerator stream (stream) is an abstract unit of an accelerator that performs parallel computing tasks.
And executing a buffer zone (execution buffer) which is a zone for storing the KV Cache which can be processed by the accelerator reasoning calculation flow in a high-bandwidth memory (HBM).
Having described several terms of art to which embodiments of the present application relate, the environment in which embodiments of the present application are implemented is described below.
Fig. 1 is a schematic structural diagram of a computing system according to an embodiment of the present application, and as shown in fig. 1, the computing system includes a host 101, an accelerator 102 of the host, and an external storage device 103, where the host 101, the accelerator 102, and the external storage device 103 communicate through a wired network or a wireless network.
Wherein the computing system is capable of being deployed on a cluster of computing devices that includes at least one computing device, which may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.
The external storage device 103 is, for example, a solid state disk (solid STATE DISK, SSD) or a mechanical hard disk (HARD DISK DRIVE, HDD), and is, for example, a cloud storage service such as an object storage service (object storage service, OBS), yun Yingpan (elastic volume service, EVS), or a flexible file service (scalable FILE SERVICE, SFS), which is not limited by the type of the external storage device 103 in the embodiment of the present application. Illustratively, in a multi-level storage system, the external storage device 103 is used to store the historical KV cache of the session.
The host 101 is, for example, a host of a computing device, and is, for example, a host computing device for controlling and managing other computing devices in a computing device cluster. The memory of the host 101 is, for example, a dynamic random access memory (dynamic random access memory, DRAM), and the host 101 is illustratively provided with a task scheduler running on the host 101, the task scheduler maintaining a task queue for indicating the session to be processed of the accelerator, the memory of the host 101 is for storing a history KV cache of the session to be processed in the task queue, and the host 101 is for loading the history KV cache of the session to be processed in the task queue from the external storage device 101 into the memory of the host 101 based on the task queue.
The accelerator 102 is, for example, a general-purpose central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a processor unit (SWITCHING MODULE PROCESSOR UNIT, SMPU) of a switching module, a network processor unit (network processor unit, NPU), a microprocessor, and is, for example, one or more integrated circuits for implementing the solution of the present application, such as an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, generic array logic (GENERIC ARRAY logic, GAL), or any combination thereof. The embodiment of the application does not limit the type of accelerator. Illustratively, the HBM in the accelerator 102 is used to store the historical KV cache of the first session. The accelerator 102 is able to generate an answer to a first session in the task queue by reasoning about the first session, i.e., performing a calculation of the i-th layer of the generative large model. The first session is one of multiple rounds of sessions, when the accelerator 102 performs the i-th layer calculation, the historical KV cache of the first session in the i-th layer input in the HBM of the accelerator 102 is multiplexed, and when the i-th layer calculation is performed, the historical KV cache of the first session in the i+1th layer is loaded into the HBM from the memory.
In some embodiments, each computing device in the computing device cluster includes a host 101, an accelerator 102, and an external storage device 103, i.e., a storage device configured with each computing device is utilized as the external storage device 103, and in other embodiments, each computing device in the computing device cluster includes the host 101 and the accelerator 102, with at least one computing device in the computing cluster having storage capability acting as the external storage device 103. The embodiment of the present application is not limited thereto.
Illustratively, fig. 2 is a functional schematic diagram of a computing system according to an embodiment of the present application, and as shown in fig. 2, the computing system includes a control system and a multi-level storage system, where the multi-level storage system includes a memory of the host 101, an HBM in the accelerator 102, and the external storage device 103. Wherein the control system is run on the host 101 and the accelerator 102, the control system comprises a task scheduler and a key value cache management unit. The task scheduler maintains a task queue for indicating pending sessions of the accelerator, and the key cache management unit is for controlling data migration between the memory of the host 101 and the external storage device 103. The key value cache management subunit comprises a key value cache pulling subunit and a key value cache placement subunit, wherein the key value cache pulling subunit is used for pre-pulling a history KV cache of a session to be processed in a task queue into a memory of the host 101 and loading the KV cache in the memory of the host 101 into an HBM of an accelerator, and the key value cache placement subunit is used for expelling the KV cache from the memory of the host 101 into the external storage device 103 when the available storage space of the memory of the host 101 is insufficient and storing (writing back) the KV cache generated by the accelerator processing session into the memory of the host. It should be noted that the division of the functional modules of the computing system shown in fig. 2 is merely exemplary, and the embodiment of the present application does not limit the division manner of the functional modules of the computing system.
In some embodiments, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically, without limitation, a transmission control protocol/internet protocol (transmission control protocol/internet protocol, TCP/IP) network in a data center network, as well as RDMA networks, such as an aggregate ethernet-based RDMA (RDMA over converged Ethernet, roCE) network, an InfiniBand (IB) network, or the like. In other embodiments, custom and/or dedicated data communication techniques can also be used in place of or in addition to the data communication techniques described above.
The embodiment of the application provides a multi-round session reasoning method, which is applied to the computing system, in the method, the external storage device stores the history KV cache of the completed session, and because the capacity of the external storage device is far greater than the capacity of the HBM in the accelerator, the external storage device can store a large amount of history KV cache, so that the hit rate of the history KV cache is improved, the recalculation of the KV cache is avoided, the efficiency of multi-round session reasoning is improved, and the computing resources are saved; meanwhile, the host can pre-load the history KV cache of the session to be processed in the task queue into the memory of the host from the external storage device while the session is processed by the accelerator, the accelerator can pre-load the history KV cache required by the i+1th layer calculation into the accelerator from the memory of the host while the session is subjected to the i layer calculation of the generated large model, and the calculation process and the data loading process are synchronously performed, so that the calculation process does not need to wait for the completion of data loading, thereby hiding the time cost of the accelerator for accessing the external storage device and improving the efficiency of multi-round session reasoning.
The method relates to a data migration process between an accelerator and a memory of a host and a data migration process between the memory of the host and an external storage device in a multi-round session reasoning process. These two data migration processes will be described separately below.
First, a data migration process between the accelerator and the memory of the host will be described. Fig. 3 is a flowchart of a multi-round session reasoning method according to an embodiment of the present application, and as shown in fig. 3, the method is applied to a computing system, where the computing system includes a host, an accelerator of the host, and an external storage device, and the method includes the following steps 301 to 308.
Step 301, the accelerator loads a historical KV cache of the first session input in the 1 st layer of the generative large model to a first buffer area of the accelerator from a memory of the host, where the generative large model includes N layers, and N is an integer greater than 1.
Wherein the first session is one of a plurality of rounds of sessions initiated by the first object. The input of the first session includes a sequence of tokens input in a history session of the first session, a sequence of tokens output in the history session, and a sequence of tokens input in the first session by the object. Wherein multiple rounds of sessions initiated by the same object have the same session window identification. The history of the first session, i.e. the completed session having the same session window identification as the first session. For example, the first object initiates three rounds of sessions, the first round of sessions and the second round of sessions are completed, the first session is a third round of sessions, wherein the word sequence input in the first round of sessions is Q1, the word sequence output is A1, the word sequence input in the second round of sessions is Q2, the word sequence output is A2, the word sequence input in the third round of sessions is Q3, and the input of the first session is [ Q1 A1Q2 A2Q 3].
The input of the first session is the KV cache associated with the 1 st layer in the 1 st layer of historical KV caches, namely the historical KV caches of the first session. The first buffer area is an area in the HBM of the accelerator for storing the KV cache that the accelerator can currently process, that is, an execution buffer area of the accelerator.
Before the accelerator executes the layer 1 calculation of the 1 st iteration of the first session, the accelerator starts a data reading thread, and the history KV cache of the input of the first session in the layer 1 is loaded to the first buffer area from the memory of the host through the data reading thread. In the method, before the layer 1 calculation occurs, the data required by the layer 1 calculation is ready in the first buffer zone of the accelerator through the data reading thread, so that time slots caused by loading data from the memory of the host can be avoided, and the reasoning efficiency is improved.
The history KV cache is a multidimensional vector with variable length. The historical KV cache can be stored according to different granularities. For example, in some embodiments, a round of session is used as granularity to store the historical KV caches, that is, intermediate data keys and values obtained by processing multiple sessions through a generative large model are stored as multiple KV caches according to the sessions, and each KV cache has a session identifier. For example, in other embodiments, the historical KV caches are stored with a layer of the generative large model as granularity, that is, the intermediate data key and value obtained by multiple layers of the generative large model are stored as multiple KV caches according to the layer, and each KV cache has a session identifier and a layer identifier. For another example, in some embodiments, the historical KV cache is stored with a set of multi-round sessions as granularity, that is, intermediate data keys and values obtained by the generative large model processing multiple rounds of sessions initiated by the same object are stored as one KV cache. It should be noted that, the foregoing description of the storage granularity of the history KV cache is only exemplary, and the embodiment of the present application does not limit the storage granularity of the KV cache.
Wherein the history KV cache of the input of the first session at the layer 1 is irrelevant to the position coding of the input of the first session. Wherein irrelevant means that the position code of the input of the first session has no influence on the history KV cache of the first session.
In some embodiments, the input of the first session includes a number of tokens greater than the number of tokens that can be accommodated by the context window of the generative large model, and the accelerator loads a first sub-input from the memory of the host at the level 1 historical KV cache, the first sub-input being a sub-input of the first session, the first sub-input including a number of tokens less than the number of tokens that can be accommodated by the context window. The process can be understood as that when the input of the first session causes the overflow of the context window, the input of the first session is cut so as to discard the word element before the input of the first session, the first sub-input of the first session is used as the input of the large generation model, correspondingly, the accelerator loads the historical KV cache corresponding to the first sub-input from the memory of the host, namely, the KV cache is directly cut after the overflow of the context window. The above loading process is illustrated by fig. 4. FIG. 4 is a schematic diagram of a direct clipping KV cache after a context window overflows, as shown in FIG. 4, in a memory of a host computer, a history KV cache of a first session input in a layer 1 is stored, a position code of the first session input is [0:2048], when the context window overflows due to the input of the first session, an accelerator loads a first sub-input of the first session in the history KV cache of the layer 1 from the memory of the host computer, the position code of the first sub-input is [0:1536], and the accelerator embeds the position code of the first sub-input in the history KV cache of the layer 1, and performs subsequent reasoning based on the history KV cache after the embedding of the position code. The accelerator embeds the position code into the history KV cache, namely the key of embedding the position code into the history KV cache.
In the above embodiment, the stored KV cache is decoupled from the position code, and when the contextual window overflows, the change of the position code of the input of the generated large model does not cause the failure of the stored KV cache, so that the accelerator can reuse the history KV cache of the input after clipping, thereby avoiding the recalculation of the KV cache caused by the overflow of the contextual window, improving the reasoning efficiency and saving the computing resource.
It should be noted that, the history KV cache of the first session is stored in the external storage device, before the accelerator infers the first session through the generative large model, the host computer preloads the history KV cache of the first session from the external storage device to the memory of the host computer, and then the accelerator loads the history KV cache of the first session input in layer 1 from the memory of the host computer. The data migration process between the external storage device and the memory of the host will be described in the following embodiments, and will not be described herein.
Step 302, the accelerator performs layer 1 computation of the 1 st iteration on the first session by generating a large model based on the input of the first session, and at the same time, loads the historical KV cache of the input of the first session in layer 2 from the memory of the host to the first buffer area of the accelerator.
Wherein the accelerator initiates a computation thread by which a layer1 computation of the 1 st iteration is performed on the first session.
The method comprises the steps of projecting an input word sequence of a first session with model parameters corresponding to a self-attention mechanism in a layer 1 respectively to generate intermediate data key and value corresponding to each word in the input word sequence of the first session, updating a history cache of the input of the first session in the layer 1 based on the key and value of the input word sequence of the first session in the layer 1, embedding a position code of the input of the first session into the updated history KV cache of the layer 1, wherein the position code is a relative position code (relative positional encoding, RPE), performing nonlinear conversion on the linear cache and value in the history cache after the position code, the nonlinear conversion comprises connection and normalization, and transmitting the result of the nonlinear conversion to a neural network in the layer 1 through a feedforward network, and the result of the nonlinear conversion in the layer 1 is obtained by the nonlinear conversion. The accelerator updates the historical KV cache of the first session input in the layer 1 based on the key and the value of the first session input word sequence in the layer 1, namely, the accelerator calculates each time an intermediate data key or value is calculated by a thread, the intermediate data is added to the historical KV cache of the first session input in the layer 1 in the first buffer area.
In the following, the calculation process of the layer 1 is illustrated by fig. 5, fig. 5 is a schematic diagram of reasoning based on a KV cache decoupled from a position code, where, as shown in fig. 5, an accelerator projects an input word sequence (input) of a first session with model parameters (Wk、Wv) corresponding to a self-attention mechanism of the layer 1, respectively, to generate intermediate data key and value corresponding to each word in the input word sequence of the first session, updates the history KV cache, and embeds the position code of the input of the first session into the key in the updated history KV cache, and performs subsequent reasoning based on the embedded position code of the cache.
The method is characterized in that when the number of words included in the input of the first session is greater than or equal to the number of words which can be contained in a context window of the generated large model, namely, when the context window overflows, the accelerator performs layer 1 computation of iteration 1 on the first session through generating the large model based on the first sub-input of the first session, and the computation process is the same as the computation process when the context window does not overflow, except that when the context window overflows, the input of the generated large model and the historical KV cache loaded by the accelerator from the memory of the host are cut, and the same is not repeated.
The process of loading the first session input in the layer 2 history KV cache to the first buffer by the accelerator from the memory of the host is the same as the process of loading the layer 1 history KV cache in step 301, and will not be described again.
Step 303, the accelerator performs layer 2 computation of the 1 st iteration on the first session, and at the same time, the accelerator loads the history KV cache of the first session input in layer 3 from the memory of the host to the first buffer, and writes the KV cache obtained by the layer 1 computation of the 1 st iteration into the memory of the host.
The accelerator carries out layer 2 computation of the 1 st iteration on the first session through a computation thread, and simultaneously loads the historical KV cache of the input of the first session in the layer 3 from the memory of the host through a data reading thread, and writes the KV cache obtained by the 1 st iteration in the layer 1 computation into the memory of the host from the first buffer through a data writing back thread.
The accelerator performs, through the calculation thread, layer 2 calculation of the 1 st iteration on the first session based on the calculation result of layer 1 calculation of the 1 st iteration on the first session, and the calculation process of layer 2 is the same as the calculation process of layer 1, and is not described again.
The process of loading the historical KV cache of the first session in the layer 3 from the memory of the host to the first buffer by the accelerator through the data reading thread is the same as the process of loading the historical KV cache of the layer 1 in step 301, and will not be described again.
In step 303, when the calculation thread of the accelerator is performing calculation on a certain layer, the data reading thread of the accelerator starts to load the KV cache of the subsequent layer to the first buffer region at the same time, after the calculation of the current layer is completed, the calculation thread can immediately start to perform calculation of the subsequent layer based on the loaded data, so that the time slot between two adjacent layers due to loading of the data required by the next layer is hidden, thereby improving the efficiency of session reasoning.
Step 304, the accelerator sequentially performs the layer 3 computation to the layer N computation of the 1 st iteration on the first session in the same manner as in step 303, and deduces the 1 st word element in the output of the first session.
Steps 301 to 304 describe the process of iterating the first session 1 st, i.e. the pre-filling stage in the first session reasoning process, through which the 1 st word element in the output of the first session is deduced. The decoding stage in the first session reasoning process is described below by means of step 305, wherein the decoding stage comprises at least one iteration, each of which reasoning out one word element in the output of the first session.
Step 305, the accelerator sequentially performs the 2 nd iteration to the M th iteration on the first session in a manner similar to the manner in steps 301 to 304, sequentially infers the 2 nd to the M th word elements in the output of the first session, wherein M is equal to the total iteration times of the first session, and M is an integer greater than or equal to 2.
Any one of the 2 nd to the M-th iterations is the same as the 1 st iteration, and the difference is that each layer of computation in the 1 st iteration generates intermediate data key and value corresponding to each word element in a word element sequence input by the first session (problem prompt of the first session), and each layer of computation in any one of the 2 nd to the M-th iterations generates intermediate data key and value corresponding to the word element obtained by the last iteration, and the same is not repeated.
Step 306, while the accelerator performs the nth layer computation of the mth iteration on the first session, the accelerator loads the history KV cache of the input of the second session in the task queue in the 1 st layer from the memory of the host to the second buffer area of the accelerator.
Wherein the task queue is used for indicating the pending session. The second session is the first pending session in the task queue. The second buffer area is an area in the HBM of the accelerator for storing the KV cache to be processed by the accelerator, that is, the historical KV cache input in layer 1 for storing the second session.
The accelerator loads the history KV cache of the second session input in the layer 1 from the memory of the host through the data reading thread to the second buffer.
In the method, the HBM of the accelerator comprises a first buffer area and a second buffer area, the first buffer area is used for storing the KV cache of the session currently being processed by the accelerator, the second buffer area is used for storing the KV cache of the session to be processed by the accelerator, the accelerator loads the KV cache required by the 1 st layer calculation of the second session to the second buffer area while processing the first session based on the KV cache in the first buffer area, when the accelerator finishes processing the first session, the accelerator processes the second session based on the KV cache in the second buffer area, at this time, the second buffer area becomes an execution buffer area, and the accelerator releases the first buffer area through an asynchronous thread while processing the second session, so that the computing thread of the accelerator can immediately start processing the second session after finishing processing the first session without waiting for the release of the first buffer area, the time slot caused by loading data required by the next session between the two sessions is hidden, and the reasoning efficiency is improved.
It should be noted that, the step 306 is an optional step, and in some embodiments, the accelerator includes only the first buffer, and after the nth layer calculation for the mth iteration of the first session is completed, the accelerator releases the first buffer, and loads the input of the second session into the first buffer in the 1 st layer of the history KV cache. The embodiment of the present application is not limited thereto. In the above embodiment, the HBM of the accelerator is used as the first buffer, so that the capacity of the first buffer can be increased, when the data size of the history KV cache of the session is large, the complete history KV cache of the session can be accommodated, and the situation that the history KV cache of the post-loading caused by insufficient capacity covers the history KV cache of the pre-loading can be avoided, so that the hit rate of the history KV cache in the first buffer can be increased, and the repeated loading of the history KV cache can be avoided.
Step 307, if the generated large model processes that the KV cache which is not written into the memory exists in the KV cache generated by the first session when the accelerator completes the nth layer computation of the mth iteration of the first session, the accelerator writes the KV cache which is not written into the memory into the third buffer area of the accelerator from the first buffer area.
And when the accelerator performs calculation of the (i+1) -th layer through a calculation thread, the data write-back thread of the accelerator writes back the KV cache generated by the calculation of the (i) th layer from the first buffer into a memory of a host, wherein i is an integer which is greater than or equal to 1 and less than N. The third buffer area is an area in the HBM of the accelerator for storing KV cache that has not yet been written back.
In the method, the accelerator further comprises a third buffer zone, after the accelerator finishes processing the first session, the accelerator writes the KV cache which is generated in the process of processing the first session and is not written back to the memory into the third buffer zone by the first buffer zone, and because the data copying efficiency between the first buffer zone and the third buffer zone is higher, the accelerator can quickly write the KV cache which is not written back to the memory in the first buffer zone into the third buffer zone after the accelerator finishes processing the first session, and further can quickly release the first buffer zone, so that the KV cache which is required by subsequent processing can be loaded into the first buffer zone, and the first buffer zone is not required to be released after the KV cache which is generated by processing the first session is completely written into the memory, thereby hiding the time slot caused by waiting for the data writing back of the last session between the two sessions, and improving the reasoning efficiency.
It should be noted that, in the above step 307 is an optional step, in some embodiments, the accelerator includes only the first buffer, and after the accelerator completes the calculation of the nth layer of the mth iteration of the first session, the accelerator releases the first buffer after the KV cache generated by the first session to be processed is completely written into the memory of the host from the first buffer. In the above embodiment, the HBM of the accelerator is used as the first buffer, so that the capacity of the first buffer can be increased, when the data size of the history KV cache of the session is large, the complete history KV cache of the session can be accommodated, and the situation that the history KV cache of the post-loading caused by insufficient capacity covers the history KV cache of the pre-loading can be avoided, so that the hit rate of the history KV cache in the first buffer can be increased, and the repeated loading of the history KV cache can be avoided.
Step 308, the accelerator processes the second session in the same manner as the above steps 301 to 307, and simultaneously writes the KV cache in the third buffer of the accelerator into the memory of the host.
In step 308, the process of writing the KV cache in the third buffer into the memory of the host by the accelerator is the same as the process of writing the KV cache obtained by the layer 1 calculation of the 1 st iteration into the memory of the host by the accelerator in step 303, and will not be described again.
According to the method, the accelerator can pre-load the historical KV cache required by the i+1 layer calculation from the memory of the host into the accelerator while performing the i layer calculation of the generation type large model on the session, and the calculation process and the data loading process are performed synchronously, so that the calculation process does not need to wait for the completion of data loading, the time overhead of the accelerator for accessing the external storage device can be hidden, the efficiency of multi-round session reasoning is improved, and furthermore, the KV cache generated by the i layer calculation is written back into the memory of the host while the accelerator performs the i+1 layer calculation, and compared with the memory of the host when the KV cache generated by the session of the present round is written after the session reasoning of the present round is completed, the time slot caused by the data generated by the session of the present round is hidden between two adjacent rounds of sessions, and the reasoning efficiency can be improved.
The flow shown in the above steps 301 to 308 is exemplarily described below by way of fig. 6 and 7. Fig. 6 is a schematic diagram of a calculation process and a data loading process in parallel in a multi-round session reasoning method provided by the embodiment of the present application, as shown in fig. 6, a generated large model includes 3 layers (L1, L2 and L3), after the previous session processing is completed, an accelerator processes a current session through the generated large model, wherein before calculation of the L1 layer occurs, a KV cache required by the layer is ready in a first buffer area in an HBM of the accelerator. The method comprises the steps of loading KV cache required by L1 layer calculation into a first buffer area by a data loading thread of an accelerator, namely KV cache read flow, loading the KV cache required by L1 layer calculation into the first buffer area by the data loading thread of the accelerator, namely execution flow, starting L1 layer calculation, loading the KV cache required by L2 layer calculation into the first buffer area by the data loading thread simultaneously when the L1 layer calculation is being carried out by the calculation thread of the accelerator, and the like, so that the time for loading the KV cache from a memory of a host by the accelerator and the calculation time are overlapped, thereby hiding the time cost for loading the KV cache from the memory of the host, and obviously improving reasoning efficiency.
FIG. 7 is a schematic diagram of a parallel computing process and a data write-back process in a multi-round session reasoning method provided by the embodiment of the application, wherein the generated large model includes 3 layers (L1, L2 and L3), in a pre-filling stage, computing threads of an accelerator, namely execution flows, perform computing of the L1 layer, after the computing threads of the accelerator complete computing of the L1 layer, computing threads of the accelerator perform computing of the L2 layer, meanwhile, data write-back threads of the accelerator, namely KV cache write-back flow, write-back KV cache generated by computing of the L1 layer from a first buffer into a memory of a host, after the computing threads of the accelerator complete computing of the L2 layer, computing threads of the accelerator perform computing of the L3 layer, meanwhile, data write-back threads of the accelerator write-back KV cache generated by computing of the L2 layer from the first buffer into the memory of the host, and the like, and in a decoding stage, the data write-back process is identical to the data write-back process of the pre-filling stage, and is not repeated. In the data writing back process shown in fig. 7, the time for the accelerator to write back the generated KV cache to the memory of the host overlaps with the calculation time, so that the time overhead for writing back the generated KV cache to the memory of the host can be hidden, and the reasoning efficiency is remarkably improved.
The above describes the data migration process between the accelerator and the memory of the host, and the following describes the data migration process between the memory of the host and the external storage device.
The data migration process between the memory of the host and the external storage device comprises the steps that if the history KV cache of a first to-be-processed session in a task queue does not exist in the memory of the host, the host loads the history KV cache of the first to-be-processed session from the external storage device to the memory of the host, and the first to-be-processed session is a first number of to-be-processed sessions in the queue head of the task queue.
The number of session rounds included in the first pending session is determined based on the capacity of the memory of the host, that is, the first number is determined based on the capacity of the memory of the host. For example, the first number = available storage space of the host's memory +.an average data size of historical KV cache of the session. It should be noted that, for the history KV cache of the first session being processed by the accelerator, the accelerator writes back the newly generated KV cache to the memory of the host while performing calculation, so as to update the history KV cache of the first session in the memory of the host, where the space occupied by the history KV cache of the first session in the memory of the host belongs to the system occupation, and the space cannot be released before the first session processing is completed, so that the available storage space of the memory of the host=the memory capacity of the host-the capacity of the space occupied by the system.
In some embodiments, if the available memory space of the memory of the host is sufficient, the host loads the history KV cache of the first session to be processed from the external storage device to the memory of the host, and if the available memory space of the memory of the host is insufficient, the host writes the history KV cache of the second session to be processed in the task queue from the memory of the host to the external storage device, and loads the history KV cache of the first session to be processed from the external storage device to the memory of the host, wherein the second session to be processed is a session in the task queue other than the first session to be processed.
One skilled in the art can determine how to determine whether the available storage space of the memory of the host is sufficient according to the actual needs, for example, in some embodiments, if the available storage space capacity of the memory of the host is greater than or equal to the target threshold, the available storage space of the host is sufficient, and if the available storage space capacity of the memory of the host is less than the target threshold, the available storage space of the host is insufficient. For example, in other embodiments, if the available storage space capacity of the memory of the host is greater than or equal to the data amount of the history KV cache of the first session to be processed, the available storage space of the host is sufficient, and if the available storage space capacity of the memory of the host is less than the data amount of the history KV cache of the first session to be processed, the available storage space of the host is insufficient. It should be noted that, the above description of the determination manner of whether the available storage space of the memory of the host is sufficient is merely exemplary, and the embodiment of the present application does not limit the determination manner.
In some embodiments, the memory of the host includes a fourth buffer in addition to the available storage space of the memory of the host, the fourth buffer being configured to store the historical KV cache loaded from the external storage device when the available storage space of the memory of the host is insufficient. According to the embodiment, the memory of the host comprises the fourth buffer area, so that enough migration space is ensured when the KV cache is migrated from the external storage device to the memory of the host, and therefore when the available storage space of the host is insufficient, the host can write the KV cache preloaded from the external storage device into the fourth buffer area without waiting for the KV cache in the memory of the host to be evicted to the external storage device, and the blocking caused by the KV cache evicted by the memory of the host when the available storage space of the host is insufficient is avoided.
In some embodiments, if the available storage space of the external storage device is insufficient, the historical KV cache of a third pending session in the task queue is deleted from the external storage device, where the third pending session is a second number of pending sessions at the tail of the task queue. The second number may be determined according to practical situations, for example, the second number is 1 or 2, which is not limited in the embodiment of the present application.
The process of data migration between the memory of the host and the external storage device is illustrated by fig. 8. FIG. 8 is a schematic diagram of a data migration process between a memory of a host and an external storage device in a multi-round session reasoning process according to an embodiment of the present application, where as shown in FIG. 8, an accelerator is processing Job1, that is, a first session, a task queue includes 8 pending sessions, job2 through Job9 in order from head to tail of a queue, the memory of the host includes a fourth buffer (an area indicated by "buf" in FIG. 8), history KV caches of Job1, job2, and Job4 are stored in the memory of the host, KV1, KV2, and KV4 are respectively, and an external storage device is a disk (disks) in which Job9 is stored, job8, job7 and Job3 historical KV caches are KV9, KV8, KV7 and KV3 respectively. The task scheduler in the host maintains a front window (PREFETCHING WINDOW) for determining a first number of sessions to be processed, i.e. the first session to be processed, in the task queue, the front window having a size of 2 in fig. 8, indicating that the first session to be processed includes two rounds of sessions, i.e. the first number is 2, as shown in fig. 8, and Job2 and Job3, and a elimination-exempt window (eviction window) having a size of 6 in fig. 8, the elimination-exempt window indicating sessions to be eliminated from the external storage device when the available storage space of the external storage device is insufficient. as shown in FIG. 8, in the first pending sessions (Job 2 and Job 3), job 2's history KV cache (KV 2) exists in the host's memory, The history KV cache (KV 3) of Job3 is not present in the memory of the host and the available storage space of the memory of the host is insufficient, the host loads KV3 from the disk into a fourth buffer in the memory of the host, and writes the history KV cache of the second to-be-processed session in the task queue from the memory of the host into the disk, where the history KV cache of the second to-be-processed session is present in the memory of the host and the second to-be-processed session is not the first to-be-processed session, the second to-be-processed session is Job4, and the host writes the history KV cache (KV 4) of Job4 from the memory of the host into the disk, where the second to-be-processed session is not the first to-be-processed session, referring to fig. 8. Because the available storage space of the disk is insufficient, the host deletes the history KV cache of the third to-be-processed session from the disk, where the third to-be-processed session is a second number of sessions at the end of the queue outside the elimination exemption window in the task queue, in fig. 8, the second number is 1, and then the third to-be-processed session is Job9, and the host deletes the history KV cache (KV 9) of Job9 from the disk. It should be noted that fig. 8 is merely an example of a data migration process between the memory of the host and the external storage device, and is not intended to limit the present application.
In the method, the external storage device stores the history KV cache of the completed session, and because the capacity of the external storage device is far greater than the capacity of the HBM in the accelerator, the external storage device can store a large amount of history KV cache, so that the hit rate of the history KV cache is improved, the KV cache is prevented from being recalculated, the efficiency of multi-round session reasoning can be improved, and the computing resource is saved.
It should be noted that, in order to facilitate description of the multi-round session reasoning method provided by the embodiment of the present application, two embodiments are adopted to introduce a data migration process between an accelerator and a memory of a host and a data migration process between a memory of a host and an external storage device, but the two embodiments are not mutually independent and respectively generated, that is, in the multi-round session reasoning method provided by the embodiment of the present application, data migration is performed between the accelerator and the memory of the host, and data migration can also occur between the memory of the host and the external storage device.
Fig. 9 is a schematic structural diagram of a multi-round session reasoning apparatus provided in an embodiment of the present application, which is applied to an accelerator of a host in a computing system, where the computing system further includes a host and an external storage device, the external storage device is used for storing a history key value cache KV cache of a session, and the host is used for loading the history KV cache of a session to be processed in a task queue from the external storage device to a memory of the host, and the apparatus includes a processing module 901 and a loading module 902.
The processing module 901 is configured to infer a first session in the task queue through a large generative model based on the task queue, where the first session is one of multiple sessions, and the large generative model includes N layers;
the loading module 902 is configured to load, from the memory, a history KV cache of the input of the first session at the i+1st layer when performing the i-th layer calculation on the first session, where i is an integer greater than or equal to 1 and less than N.
In some embodiments, the history KV cache of the input of the first session at layer i+1 is independent of the position encoding of the input of the first session.
In some embodiments, the input of the first session includes a number of tokens that is greater than or equal to a number of tokens that can be accommodated by a contextual window of the generative large model;
The loading module 902 is configured to:
When the i-th layer calculation is performed on the first session, the historical KV cache of a first sub-input in the i+1th layer is loaded from the memory, wherein the first sub-input is the sub-input of the first session, and the number of the words included in the first sub-input is smaller than the number of the words which can be accommodated in the context window.
In some embodiments, the processing module 901 includes:
The embedding unit is used for embedding the position code of the first sub-input into the historical KV cache of the first sub-input in the i+1th layer;
and the computing unit is used for computing the first session in the (i+1) -th layer based on the input of the first session and the historical KV cache of the first sub-input in the (i+1) -th layer after embedding the position codes.
In some embodiments, the accelerator includes a first buffer for storing a history KV cache of the first session and a second buffer for storing a history KV cache of the input at layer 1 of a second session in the task queue, the second session being a first pending session in the task queue after the first session;
The loading module 902 is configured to:
And when the ith layer of calculation is carried out on the first session, loading the historical KV cache of the input of the first session in the ith layer+1 into the first buffer area from the memory.
In some embodiments, the loading module 902 is further configured to:
when the N-th layer calculation is carried out on the first session, the history KV cache of the input of the second session in the 1 st layer is loaded into the second buffer area from the memory.
In some embodiments, the apparatus further comprises:
And the writing module is used for writing KV cache generated by performing the i-th layer calculation on the first session into the memory when the i+1th layer calculation is performed on the first session.
In some embodiments, the accelerator includes a third buffer for storing a KV cache generated by the generative large model processing the first session;
The writing module is used for:
If the generated large model processes that the KV cache which is not written into the memory exists in the KV cache generated by the first session when the N-th layer calculation of the first session is completed, writing the KV cache which is not written into the memory into the third buffer area;
And when the second session in the task queue is processed through the generative large model, writing the KV cache in the third buffer into the memory.
Fig. 10 is a schematic structural diagram of a multi-round session reasoning apparatus provided in an embodiment of the present application, applied to a host in a computing system, where the computing system further includes an accelerator of the host and an external storage device, where the external storage device is configured to store a history key value cache KV cache of a session, and the accelerator is configured to reason a first session in the task queue through a generative large model based on the task queue, where the first session is one round in the multi-round session, and the generative large model includes N layers, where, when performing an i-th layer calculation on the first session, i is an integer greater than or equal to 1 and less than N, where i is a history KV cache of an input of the first session in the i+1 layer loaded from the memory, and the apparatus includes a loading module 1001 and a storage module 1002.
The loading module 1001 is configured to load, from the external storage device, a history KV cache of a session to be processed in a task queue to a memory of the host;
the storage module 1002 is configured to store a history KV cache of the session to be processed.
In some embodiments, the loading module 1001 includes:
And the loading unit is used for loading the history KV cache of the first to-be-processed session from the external storage equipment to the memory if the history KV cache of the first to-be-processed session in the task queue does not exist in the memory, wherein the first to-be-processed session is the first number of to-be-processed sessions in the task queue.
In some embodiments, the number of session rounds included in the first pending session is determined based on the capacity of the memory.
In some embodiments, the loading unit is to:
if the available storage space of the memory is enough, loading the historical KV cache of the first session to be processed from the external storage device to the memory;
If the available storage space of the memory is insufficient, writing the history KV cache of the second to-be-processed session in the task queue into the external storage device from the memory, and loading the history KV cache of the first to-be-processed session in the external storage device to the memory, wherein the second to-be-processed session is a session except the first to-be-processed session in the task queue.
In some embodiments, the apparatus further comprises a deletion module for:
If the available storage space of the external storage device is insufficient, deleting the historical KV cache of a third to-be-processed session in the task queue from the external storage device, wherein the third to-be-processed session is a second number of to-be-processed sessions at the tail of the task queue.
The processing module 901, the loading module 902, the loading module 1001, and the storage module 1002 may be implemented by software, or may be implemented by hardware. Illustratively, the implementation of the processing module 901 is described next as an example of the processing module 901. Similarly, the implementation of the loading module 902, the loading module 1001 and the storing module 1002 may refer to the implementation of the processing module 901.
Module as an example of a software functional unit, the processing module 901 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container, among others. Further, the above-described computing examples may be one or more. For example, the processing module 901 may include code running on multiple hosts/virtual machines/containers. It should be noted that, multiple hosts/virtual machines/containers for running the code may be distributed in the same region (region), or may be distributed in different regions. Further, multiple hosts/virtual machines/containers for running the code may be distributed in the same availability zone (availability zone, AZ) or may be distributed in different AZs, each AZ comprising one data center or multiple geographically close data centers. Wherein typically a region may comprise a plurality of AZs.
Also, multiple hosts/virtual machines/containers for running the code may be distributed in the same virtual private cloud (virtual private cloud, VPC) or may be distributed in multiple VPCs. In general, one VPC is disposed in one region, and a communication gateway is disposed in each VPC for implementing inter-connection between VPCs in the same region and between VPCs in different regions.
Module as an example of a hardware functional unit, the processing module 901 may include at least one computing device, such as a server or the like. Alternatively, the processing module 901 may be a device implemented using an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or the like. The PLD may be implemented as a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, a general-purpose array logic (GENERIC ARRAY logic, GAL), or any combination thereof.
The multiple computing devices included in the processing module 901 may be distributed in the same region or may be distributed in different regions. The multiple computing devices included in the processing module 901 may be distributed in the same AZ or may be distributed in different AZ. Also, the multiple computing devices included in the processing module 901 may be distributed in the same VPC, or may be distributed in multiple VPCs. Wherein the plurality of computing devices may be any combination of computing devices such as servers, ASIC, PLD, CPLD, FPGA, and GAL.
It should be noted that, in other embodiments, the processing module 901 may be used to perform any step in the multi-round session reasoning method, the loading module 902 may be used to perform any step in the multi-round session reasoning method, the loading module 1001 may be used to perform any step in the multi-round session reasoning method, the storage module 1002 may be used to perform any step in the multi-round session reasoning method, the steps that the processing module 901, the loading module 902, the loading module 1001 and the storage module 1002 are responsible for implementing may be specified according to needs, the overall functions of the multi-round session reasoning apparatus as shown in fig. 9 are implemented by implementing different steps performed by the accelerator in the multi-round session reasoning method by the processing module 901 and the loading module 902, or the overall functions of the multi-round session reasoning apparatus as shown in fig. 10 are implemented by implementing different steps performed by the host in the multi-round session reasoning method by the loading module 1001 and the storage module 1002, respectively.
The present application also provides a computing device 1100. Fig. 11 is a schematic diagram of a computing device according to an embodiment of the present application, where, as shown in fig. 11, a computing device 1100 includes a bus 1101, a processor 1102, a memory 1103, and a communication interface 1104. The processor 1102, memory 1103 and communication interface 1104 communicate via the bus 1101. Computing device 1100 can be a computing device or a terminal device. It should be appreciated that the present application is not limited to the number of processors, memories in computing device 1100.
The bus 1101 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in FIG. 11, but not only one bus or one type of bus. The bus 1101 may include a path to transfer information between the various components of the computing device 1100 (e.g., memory 1103, processor 1102, communication interface 1104).
The processor 1102 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP).
The memory 1103 may include volatile memory (RAM), such as random access memory (random access memory). The memory 1103 may also include a non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD) or solid state disk (SSD STATE DRIVE).
The memory 1103 stores executable program codes, and the processor 1102 executes the executable program codes to implement the functions of the foregoing processing module 901 and the loading module 902, respectively, so as to implement the steps executed by the accelerator in the multi-round session reasoning method, or implement the functions of the foregoing loading module 1001 and the storage module 1002, so as to implement the steps executed by the host in the multi-round session reasoning method. That is, the memory 1103 has instructions stored thereon for performing the multi-round session reasoning method. Fig. 11 shows, by way of example only, the memory 1103 storing program codes that implement the functions of the aforementioned processing module 901 and loading module 902.
Communication interface 1104 enables communication between computing device 1100 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.
The embodiment of the application also provides a computing device cluster. FIG. 12 is a schematic diagram of a computing device cluster including at least one computing device 1100, as shown in FIG. 12, provided by an embodiment of the application. The same instructions for performing the multi-round session reasoning method may be stored in the memory 1103 in one or more computing devices 1100 in the cluster of computing devices.
In some possible implementations, part of the instructions for performing the multi-round session reasoning method may also be stored separately in the memory 1103 of one or more computing devices 1100 in the cluster of computing devices. In other words, a combination of one or more computing devices 1100 may collectively execute instructions for performing the multi-round session reasoning method.
It should be noted that, the memory 1103 in different computing devices 1100 in the computing device cluster may store different instructions for performing part of the functions of the multiple session reasoning apparatus respectively. That is, instructions stored in the memory 1103 in different computing devices 1100 may implement the functions of one or more of the aforementioned processing module 901, loading module 902, loading module 1001, and storage module 1002.
It should be appreciated that the functionality of the computing device 1100 shown in fig. 12 may also be performed by multiple computing devices 1100.
In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 13 shows one possible implementation. Fig. 13 is a schematic diagram of a possible implementation manner of a computing device cluster according to an embodiment of the present application, where, as shown in fig. 13, a computing device 1100A and a computing device 1100B are connected through a network. Specifically, the network is connected through communication interfaces in the respective computing devices. In this type of possible implementation, instructions to perform the functions of the processing module 901 are stored in the memory 1103 in the computing device 1100A. In fig. 13, an instruction for executing the function of the processing module 901 is stored in the memory 1103 in the computing device 1100A. Meanwhile, instructions to perform the functions of the loading module 902 are stored in the memory 1103 in the computing device 1100B. In fig. 13, an instruction for executing the function of the loading module 902 is stored in the memory 1103 in the computing device 1100B.
The embodiment of the application also provides another computing device cluster. The connection between computing devices in the computing device cluster may be similar with reference to the connection of fig. 12 or fig. 13. In contrast, the memory 1103 in one or more computing devices 1100 in the computing device cluster may have stored therein the same instructions for performing the multi-round session reasoning method.
In some possible implementations, part of the instructions for performing the multi-round session reasoning method may also be stored separately in the memory 1103 of one or more computing devices 1100 in the cluster of computing devices. In other words, a combination of one or more computing devices 1100 may collectively execute instructions for performing the multi-round session reasoning method.
It should be noted that, the memory 1103 in different computing devices 1100 in the computing device cluster may store different instructions for performing part of the functions of the multiple session reasoning apparatus respectively. That is, instructions stored in the memory 1103 in different computing devices 1100 may implement the functions of one or more of the aforementioned processing module 901, loading module 902, loading module 1001, and storage module 1002.
An embodiment of the present application provides an accelerator comprising a computational core for performing steps performed by the accelerator in a multi-round session reasoning method as provided by any one of the possible implementations of the method embodiments described above, and a memory for storing computational data, the computational core for performing computational operations on the computational data stored in the memory.
Embodiments of the present application provide a host comprising a processor and a memory for performing the steps performed by the host in a multi-round session reasoning method as provided by any of the possible implementations of the method embodiments described above.
Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform a multi-round session reasoning method.
The embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform a multi-round session reasoning method.
It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, sessions and storage space etc. involved in the present application are all acquired with sufficient authorization.
Those of ordinary skill in the art will appreciate that the various method steps and elements described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the steps and components of the various embodiments have been described generally in terms of functionality in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not considered to be beyond the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.
In addition, each unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application is essentially or part of what contributes to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computing device (which may be a personal computer, a server, a computing device, etc.) to execute all or part of the steps of the method in the various embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another element. For example, a first session may be referred to as a second session, and similarly, a second session may be referred to as a first session, without departing from the scope of the various examples. Both the first session and the second session may be node sessions, and in some cases may be separate and distinct sessions.
The terms "at least one" and "at least one" are used interchangeably herein to mean one or more, and the term "plurality" is used herein to mean two or more.
It should also be understood that the term "if" may be interpreted to mean "when" or "upon") or "in response to a determination" or "in response to detection. Similarly, the phrase "if determined" or "if detected [ stated condition or event ]" may be interpreted to mean "upon determination" or "in response to determination" or "upon detection of [ stated condition or event ]" or "in response to detection of [ stated condition or event", depending on the context.
The foregoing description is merely illustrative of the present application, and the scope of the present application is not limited thereto, and any equivalent modifications or substitutions will be apparent to those skilled in the art within the scope of the present application, and are intended to be included within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus.
The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital video disk (digital video disc, DVD), or a semiconductor medium (e.g., solid state disk)), etc.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing embodiments are merely for illustrating the technical solution of the present application, but not for limiting the same, and although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for parts of the technical features thereof, and such modifications or substitutions do not depart from the spirit of the corresponding technical solution from the scope of the technical solution of the embodiments of the present application.