CN116976424A

Movatterモバイル変換

Info

Publication number: CN116976424A
Application number: CN202310807675.6A
Authority: CN
Inventors: 王俊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-10-31

Abstract

The application relates to the field of medical health, and discloses a model training method, device, equipment and medium suitable for a large language model, wherein the method comprises the following steps: dividing the task set to obtain a training set, a verification set and a test set, wherein the initial model comprises: prefix networks and large language models; based on a meta-optimizer, pre-training the initial model by adopting the training set, and verifying the initial model by adopting the verification set; based on a meta-optimizer, performing fine tuning training on the specific task on the initial model which passes the verification result, and evaluating the performance of the initial model by adopting the test set; and determining a target model corresponding to the specific task according to the initial model which passes the evaluation result. The model training efficiency is improved, the risk of overfitting and domain deviation can be reduced by the meta-optimizer, and the performance of the target model corresponding to the specific task is improved.

Description

Model training method, device, equipment and medium suitable for large language model

Technical Field

The invention relates to the technical field of artificial intelligence and the field of medical health, in particular to a model training method, device, equipment and medium suitable for a large language model.

Background

Large Language Model (LLM) is a deep neural network based Natural Language Processing (NLP) technique that can be pre-trained on large-scale text data to learn general language knowledge and representation. The large language model may be adapted to different downstream applications by fine-tuning (fine-tuning) over specific tasks, such as doctor-back generation tasks used in a consultation system in the medical health domain, and also such as question-back tasks used by a customer service system in the financial technology domain. However, as the scale of large language models continues to increase, fine tuning requires a lot of computing resources and memory space, resulting in inefficient fine tuning of large language models, and in problems of over-fitting and domain shifting, reducing the generalization ability and reusability of large language models.

Disclosure of Invention

Based on the above, it is necessary to provide a model training method, device, equipment and medium suitable for a large language model, aiming at the technical problems that the efficiency of fine tuning the large language model in the prior art is not high, the overfitting and the domain deviation are easy to cause, and the generalization capability and the reusability of the large language model are reduced.

In a first aspect, there is provided a model training method suitable for a large language model, the method comprising:

acquiring a task set, a specific task and an initial model, and dividing the task set to obtain a training set, a verification set and a test set, wherein the initial model comprises: prefix networks and large language models;

based on a meta-optimizer, pre-training the initial model by adopting the training set, and verifying the initial model by adopting the verification set;

based on a meta-optimizer, performing fine tuning training on the specific task on the initial model which passes the verification result, and evaluating the performance of the initial model by adopting the test set;

and determining a target model corresponding to the specific task according to the initial model which passes the evaluation result.

In a second aspect, there is provided a model training apparatus adapted for use with a large language model, the apparatus comprising:

the training preparation module is used for acquiring a task set, a specific task and an initial model, dividing the task set to obtain a training set, a verification set and a test set, wherein the initial model comprises: prefix networks and large language models;

The pre-training module is used for pre-training the initial model by adopting the training set based on the meta-optimizer, and verifying the initial model by adopting the verification set;

the fine tuning training module is used for carrying out fine tuning training on the specific task on the initial model which passes the verification result based on a meta-optimizer, and carrying out performance evaluation on the initial model by adopting the test set;

and the target model determining module is used for determining a target model corresponding to the specific task according to the initial model which passes the evaluation result.

In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the model training method described above for large language models when the computer program is executed.

In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the model training method described above for large language models.

The model training method suitable for the large language model is characterized in that the initial model is pre-trained by adopting the training set based on a meta-optimizer, and the initial model is verified by adopting the verification set; based on a meta-optimizer, performing fine tuning training on the specific task on the initial model which passes the verification result, and evaluating the performance of the initial model by adopting the test set; determining a target model corresponding to the specific task according to the initial model passing the evaluation result; the initial model includes: prefix networks and large language models. The training set divided from the task set is firstly adopted to improve the cooperative performance of the prefix network and the large language model, so that the model can be quickly adapted to new tasks on multitasks, then the initial model which passes the verification result is subjected to fine tuning training on the specific tasks, and the model can be quickly optimized on the specific tasks, thereby improving the training efficiency of the model, reducing the risks of overfitting and domain deviation, and improving the performance of the target model corresponding to the specific tasks.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Wherein:

FIG. 1 is a diagram of an application environment for a model training method applicable to a large language model in one embodiment;

FIG. 2 is a flow diagram of a model training method suitable for use with large language models in one embodiment;

FIG. 3 is a block diagram of a model training apparatus suitable for use with large language models in one embodiment;

FIG. 4 is a block diagram of a computer device in one embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The model training method suitable for a large language model provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client 110 communicates with a server 120 through a network. The server 120 may obtain a task set, a specific task, and an initial model through the client 110, and divide the task set to obtain a training set, a verification set, and a test set, where the initial model includes: prefix networks and large language models. The server 120 performs pre-training on the initial model by using the training set based on the meta-optimizer, performs verification on the initial model by using the verification set, performs fine-tuning training on the specific task on the initial model passing the verification result based on the meta-optimizer, performs performance evaluation on the initial model by using the test set, and determines a target model corresponding to the specific task according to the initial model passing the evaluation result. The training set divided from the task set is firstly adopted to improve the cooperative performance of the prefix network and the large language model, so that the model can be quickly adapted to new tasks on multitasks, then the initial model which passes the verification result is subjected to fine tuning training on the specific tasks, and the model can be quickly optimized on the specific tasks, thereby improving the training efficiency of the model, reducing the risks of overfitting and domain deviation, and improving the performance of the target model corresponding to the specific tasks.

When the method is applied to the field of medical health, the initial model is pre-trained by adopting the training set based on the meta-optimizer, the initial model is verified by adopting the verification set, then fine tuning training on a specific task (such as a doctor replying to a generating task) in the field of medical health is carried out on the initial model which passes the verification result based on the meta-optimizer, performance of the initial model is evaluated by adopting the testing set, and a target model corresponding to the specific task is determined according to the initial model which passes the evaluation result. Thus, a target model suitable for a specific task in the medical health field is obtained.

When the application is applied to the field of financial science and technology, the initial model is pre-trained by adopting a meta-optimizer based on the training set, the initial model is verified by adopting the verification set, then fine tuning training on a specific task (such as a text classification task of the financial science and technology) of the financial science and technology is carried out on the initial model which passes the verification result by adopting the meta-optimizer based on the meta-optimizer, performance of the initial model is evaluated by adopting the testing set, and a target model corresponding to the specific task is determined according to the initial model which passes the evaluation result. Thus, a target model suitable for a specific task of financial science and technology is obtained.

Among other things, the client 110 may be, but is not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices. The server 120 may be implemented by a stand-alone server or a server cluster formed by a plurality of servers. The present invention will be described in detail with reference to specific examples.

Referring to fig. 2, fig. 2 is a schematic flow chart of a model training method suitable for a large language model according to an embodiment of the invention, which includes the following steps:

s1: acquiring a task set, a specific task and an initial model, and dividing the task set to obtain a training set, a verification set and a test set, wherein the initial model comprises: prefix networks and large language models;

specifically, the task set, the specific task and the initial model input by the user can be acquired, the task set, the specific task and the initial model can be acquired from a preset storage space, the task set, the specific task and the initial model can be acquired from a third party application, and the task set, the specific task and the initial model can be acquired from a client.

Optionally, the input data of the prefix network and the output data of the prefix network are spliced to be used as the input data of the large language model.

Alternatively, the prefix network employs a self-encoder, and the large language model employs a GPT (generated Pre-Trained Transformer) model.

The task set contains a plurality of different types (i.e., task types) and domains of tasks (also referred to as downstream tasks), such as text generation, text classification, questions and answers, and the like. Each task has a corresponding input-output format and annotation data. The input-output formats include an input format and an output format.

The type (i.e., task type) refers to a basic form or goal of the downstream task, such as text generation, text classification, question-and-answer, etc. The type determines the input and output format of the task and the evaluation index, for example, the input of the text generation task is a text or a text plus some special symbols, the output is a section of natural language text, and the evaluation index can be BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and the like; the input of the text classification task is a text, the output is a class label, and the evaluation index can be accuracy, an F1 value (also called as an F value, an F-measure or other variants) and the like; the input of the question-answering task is a question and a related text, the output is a simple answer, and the evaluation index can be exact match, F1 value and the like.

The field refers to specific content or subject matter of a downstream task, such as news, medical, education, etc. The domain determines the data source and the feature distribution of the task, for example, the data in the news domain can come from news websites or social media, and the feature distribution can comprise timeliness, fairness, objectivity and the like; the data in the medical health field may be from medical records or medical literature, and the feature distribution may include professionals, accuracy, privacy, and the like; the data in the educational domain may come from educational resources or learning platforms, and the feature distribution may include knowledge, interest, interactivity, and the like.

The relationship between type (i.e., task type) and domain is many-to-many, i.e., tasks of the same type may have different domains and tasks of the same domain may have different types. For example, the tasks of the text generation type can be in different fields such as news abstract generation, poem generation, dialogue generation and the like; the tasks in the news field may be of different types such as text generation, text classification, question and answer.

The method for dividing the task set is random sampling division.

Optionally, when the task set is divided, the training set, the verification set and the test set can cover as many types, fields and tasks as possible, so as to improve generalization capability and mobility of the initial model through diversity; and when the task set is divided, the training set, the verification set and the test set can be similar as much as possible, so that performance differences of the initial model on different data sets can be reduced through similarity. Thus, a cross-validation method is employed in dividing the task set to find a balance between diversity and similarity.

The cross verification method is that a task set is divided into K subsets, each subset is sequentially used as a test set, and the rest K-1 subsets are used as training sets and verification sets. This approach can take full advantage of all data and can average the performance of the initial model over different subsets.

Specific tasks refer to downstream tasks requiring fine tuning of prefix networks and large language models after a pre-training phase, such as text generation, text classification, question-answering, and the like. These downstream tasks typically have their own input and output formats and labeling data, requiring adaptation and optimization of the initial model. The specific task may be a single task or a set of tasks, such as in the case of multi-task learning or zero-sample learning.

S2: based on a meta-optimizer, pre-training the initial model by adopting the training set, and verifying the initial model by adopting the verification set;

pretraining is performed on a large set of tasks in order for the initial model to learn general language knowledge and capabilities, thus requiring the use of training sets, i.e., data sets that contain the most data.

The meta-optimizer is a meta-learning based optimizer. Meta-learning is a technique of learning how to learn, which can quickly adapt to new tasks over multiple tasks. Meta-learning can improve the speed and efficiency of parameter efficient fine tuning methods and can reduce the risk of overfitting and domain shifting.

Specifically, based on a meta-optimizer, the training set is adopted to pretrain the initial model, so that the initial model can achieve better performance after a small amount of gradients update the initial model; adopting the verification set to verify the initial model, and when the verification result of the initial model is that the initial model fails, re-adopting the training set to pretrain the initial model in the next round until the verification result of the initial model is that the initial model fails; and when the verification result of the initial model is that the initial model passes, ending the pre-training of the initial model.

S3: based on a meta-optimizer, performing fine tuning training on the specific task on the initial model which passes the verification result, and evaluating the performance of the initial model by adopting the test set;

the fine tuning is performed on a specific task in order to adapt the initial model to the needs and characteristics of the specific task.

The evaluation is performed on a specific task with the aim of testing the final performance of the initial model on the specific task, so that it is necessary to use a test set, i.e. a data set for evaluating the performance of the model.

Specifically, based on a meta-optimizer, performing fine tuning training on the specific task on the initial model which passes the verification result, so that the initial model is adapted to the requirements and characteristics of the specific task; and evaluating the performance of the initial model by adopting the test set so as to ensure that the final performance of the initial model on a specific task meets the requirements.

It can be understood that when the result of the evaluation of the initial model is that the initial model fails, performing fine-tuning training on the specific task on the initial model which is verified to be passed again until the result of the evaluation of the initial model is that the initial model is passed; and ending the fine tuning training of the initial model when the evaluation result of the initial model is passing.

S4: and determining a target model corresponding to the specific task according to the initial model which passes the evaluation result.

Specifically, the initial model which is passed as a result of the evaluation is taken as a target model corresponding to the specific task.

The embodiment adopts the training set to pretrain the initial model based on the meta-optimizer, and adopts the verification set to verify the initial model; based on a meta-optimizer, performing fine tuning training on the specific task on the initial model which passes the verification result, and evaluating the performance of the initial model by adopting the test set; determining a target model corresponding to the specific task according to the initial model passing the evaluation result; the initial model includes: prefix networks and large language models. The training set divided from the task set is firstly adopted to improve the cooperative performance of the prefix network and the large language model, so that the model can be quickly adapted to new tasks on multitasks, then the initial model which passes the verification result is subjected to fine tuning training on the specific tasks, and the model can be quickly optimized on the specific tasks, thereby improving the training efficiency of the model, reducing the risks of overfitting and domain deviation, and improving the performance of the target model corresponding to the specific tasks.

In one embodiment, the step of pre-training the initial model using the training set and validating the initial model using the validation set includes:

s21: sampling a plurality of tasks from the training set as respective initial tasks;

specifically, a random sampling method is adopted to sample a plurality of tasks from the training set, and each sampled task is taken as an initial task.

The initial task refers to a complete downstream task such as text generation, text classification, question-answering, etc. Each initial task has its own input and output format and annotation data, not just one training data. A plurality of tasks (namely, small batch tasks) are randomly sampled from a training set, so as to simulate a multi-task learning environment, and a prefix network and a large language model can be pre-trained on the plurality of tasks, so that the universality and the mobility of the initial model are improved.

S22: obtaining an initial task from each initial task in a traversing manner, taking the initial task as a training task, sampling a support set and a query set from the training task, calculating a loss value of the initial model and updating network parameters according to a preset first learning rate and the support set, and calculating a single sample loss value of the initial model by adopting the query set to obtain a single sample loss value set corresponding to the training task;

Specifically, a traversal mode is adopted to acquire one initial task from the initial tasks as a training task; sampling a plurality of training data from the training task as a support set, and sampling a plurality of training data from the training task as a query set; based on a gradient descent algorithm and a target loss function, according to a preset first learning rate and the support set, carrying out loss value calculation and network parameter updating on the initial model so as to realize pre-training on the initial model; and calculating the loss value of the initial model by adopting each training data and target loss function in the query set, taking the loss value calculated for one training data in the query set as a single sample loss value, and taking all single sample loss values corresponding to the query set corresponding to a training task as a single sample loss value set, thereby obtaining a single sample loss value set.

It will be appreciated that, by repeatedly executing step S22, the single sample loss value set corresponding to each initial task may be determined.

S23: carrying out average value calculation on the single sample loss value sets corresponding to the initial tasks to obtain comprehensive loss values, and carrying out network parameter updating on the initial model according to a preset second learning rate and the comprehensive loss values;

Specifically, calculating an average value of all the single sample loss values corresponding to each single sample loss value set corresponding to each initial task, and taking the calculated average value as a comprehensive loss value; and updating network parameters of the initial model according to a gradient descent algorithm, a preset second learning rate and the comprehensive loss value.

S24: adopting the verification set to verify the initial model;

among them, the different tasks generally use task related indicators (task-specifi c metr ics), i.e. according to the task type and output format, an appropriate indicator is selected to measure the quality or accuracy between the initial model output and the desired output. For example, for a text generation task, BLEU, ROUGE, METEOR (a technique for evaluating a text generation task that takes into account word-level accuracy and recall, as well as word order penalties) and other metrics may be used to measure similarity between the generated text and the reference text; for text classification tasks, indexes such as accuracy (precision), recall (recall), F1 value (F1-score) and the like are used for measuring consistency between classification results and real labels; for question-answering tasks, indexes such as exact match (exact match), F1 value (F1-score), MRR (mean reciprocalrank) and the like are used for measuring the matching degree between an answer result and a correct answer. The method has the advantage of intuitively reflecting the performance of the initial model on a specific task. Another approach is to use manual assessment (human assessment), i.e. invite some human evaluators to score or rank the output of the initial model. Manual evaluation may be performed according to different dimensions or criteria, such as grammar correctness, logical consistency, content relevance (content relevance), information integrity (information completeness), and the like. The manual evaluation may take into account aspects of the initial model output and may better reflect the preferences and needs of the human user.

S25: if the verification result is not passed, jumping to the step of sampling a plurality of tasks from the training set as each initial task, and continuing to execute until the verification result is passed.

Specifically, if the verification result is not passed, it means that the performance of the initial model is not yet satisfactory, and the pre-training needs to be continued, so that the step of sampling a plurality of tasks from the training set is skipped to be continuously executed as each initial task, that is, the step is skipped to step S21 to continuously execute step S21 until the verification result is passed; when the result of the verification is that it is passed, step S3 is performed without jumping to step S21.

According to the embodiment, the cooperative performance of the prefix network and the large language model is improved by adopting the training set divided from the task set based on the meta-optimizer, so that the model can be quickly adapted to new tasks on multitasking.

In one embodiment, the step of performing fine tuning training on the specific task on the initial model that is passed as a result of the verification, and performing performance evaluation on the initial model using the test set includes:

s31: sampling a plurality of training data from the specific task as respective initial data;

Specifically, a plurality of training data are randomly sampled from the specific task, and each sampled training data is used as initial data.

S32: performing fine tuning training on the initial model which passes the verification result by adopting each initial data, and performing performance evaluation on the initial model by adopting the test set;

specifically, each initial data is adopted, the initial model which is verified to pass is subjected to fine tuning training, and a target loss function and a gradient descent algorithm are adopted during fine tuning training; and evaluating the performance of the initial model by adopting the test set so as to judge whether the performance of the initial model on a specific task meets the requirement.

S33: if the result of the evaluation is not passing, jumping to the step of sampling a plurality of training data from the specific task as each initial data, and continuing to execute until the result of the evaluation is passing.

Specifically, if the result of the evaluation is not passing, that is, the performance of the initial model on the specific task is not satisfactory, the fine-tuning training needs to be continued, so that the step of skipping to the step of sampling a plurality of training data from the specific task as each initial data is continued to be executed, that is, the step of skipping to the step S31 is continued to be executed until the result of the evaluation is passing; when the result of the evaluation is passing, step S4 is performed without jumping to step S31.

Wherein, when the test set is adopted to evaluate the performance of the initial model, the same method as that of the verification set, namely, task related indexes or manual evaluation can be used to measure the quality or accuracy between the output of the initial model and the expected output. The test set is a data set used to evaluate the final performance of the initial model on a particular task, so no parameter or hyper-parameter adjustments should be made on the test set. One or more optimization objectives may be selected based on different evaluation methods and metrics, such as maximizing BLEU values, minimizing cross entropy loss, maximizing accuracy, etc. Some methods may be used to compare model performance under different parameters, such as hypothesis testing (hypothesis testing), confidence intervals (confidence interval), significance levels (significance level), and the like. The parameters that achieve the optimization objective on the test set may be selected as the best parameters and saved.

In the embodiment, fine tuning training is performed on the specific task on the initial model which passes the verification result, so that the initial model is adapted to the requirements and characteristics of the specific task; the meta-optimizer can reduce the risk of overfitting and domain shifting and improve the performance of the target model corresponding to the specific task.

In one embodiment, the step of performing fine tuning training on the initial model that is verified as passing by using each initial data, and performing performance evaluation on the initial model using the test set includes:

s321: acquiring a counter, an optimal performance variable and a preset frequency threshold, initializing the value of the counter to 0, and initializing the optimal performance variable;

a counter for recording a variable of how many times the performance on the test set is not improved in succession.

The best performance variable is used for recording the variable of the current best performance of the initial model.

And the preset times threshold is used for judging whether to stop the fine tuning condition. The preset number of times threshold is a positive integer.

S322: selecting one initial data from the initial data as target data;

specifically, one initial data is selected from the initial data as target data, so that a basis is provided for evaluating performance of an initial model by adopting a test set for each fine tuning iteration.

S323: performing fine tuning training on the initial model according to the target data;

specifically, based on a target loss function and a gradient descent algorithm, fine tuning training is performed on the initial model according to the target data.

S324: verifying the performance of the initial model according to the test set to obtain verification performance data;

specifically, the performance of the initial model is verified according to the test set, and performance data obtained through verification is used as verification performance data.

S325: if the verification performance data exceeds the optimal performance variable, updating the optimal performance variable according to the verification performance data, and initializing the value of the counter to 0;

specifically, if the verification performance data exceeds the optimal performance variable, this means that the performance of the initial model is improved by this fine tuning training, so that the optimal performance variable is updated according to the verification performance data, so that the optimal performance variable records the verification performance data of the initial model that is optimal at present, and the value of the counter is initialized to 0, so as to realize that the initial re-counting by the counter is started.

S326: and if the verification performance data does not exceed the optimal performance variable, adding 1 to the value of the counter, if the value of the counter is not smaller than the preset frequency threshold, taking the network parameter corresponding to the optimal performance variable as the network parameter of the initial model, and if the value of the counter is smaller than the preset frequency threshold, jumping to the step of selecting one initial data from the initial data as target data, and continuing to execute.

Specifically, if the verification performance data does not exceed the optimal performance variable, this means that the performance of the initial model is not improved by this fine tuning training, and therefore, the value of the counter is increased by 1; if the value of the counter is not smaller than the preset frequency threshold, the frequency of the initial model performance not improved by the fine tuning training reaches the preset fine tuning stopping condition, so that the network parameter of the initial model corresponding to the optimal performance variable is used as the final network parameter of the initial model; if the value of the counter is smaller than the preset number of times threshold, which means that the number of times that the performance of the initial model is not improved by the fine tuning training does not reach the preset stop fine tuning condition, the fine tuning training can be continued, and therefore, the step of jumping to the step of selecting one initial data from the initial data as the target data is continued, that is, the step of jumping to the step S322 to continue to execute S322.

The present embodiment calculates the performance of the initial model on the test set every time the fine tuning training, compares it with the optimal performance variable, stops fine tuning if the current performance (i.e., the verification performance data) does not exceed the previous optimal performance (i.e., the optimal performance variable), and stops fine tuning when the number of times of continuously occurring such a case is not less than the preset number of times threshold, and uses the optimal network parameter corresponding to the optimal performance variable as the final network parameter of the initial model, thereby preventing the initial model from being overfitted on the initial data.

In one embodiment, the prefix network includes an encoder for encoding input data of the prefix network into hidden vectors and a decoder for decoding the hidden vectors into successive hints, the input data of the prefix network and the successive hints being concatenated as input data for the large language model.

The input data of the prefix network may be a piece of text or some special symbols on a piece of text machine, e.g. special symbols comprise [ CLS ] and/or [ SEP ].

The hidden vector is a fixed length vector or a variable length sequence.

The successive hints are of the same length and format as the input data of the prefix network.

That is, the prefix network is a self-encoder consisting of an encoder and a decoder. The self-encoder learns the valid representation of the original input (i.e. the input data of the prefix network) without supervision and may dynamically generate appropriate successive prompts according to different tasks.

The self-encoder is trained using a loss function based on contrast learning. The loss function based on contrast learning includes: reconstructing a loss function and a contrast loss function. The reconstruction loss function is used to measure the difference between the original input and the decoded output (i.e., the successive prompts). And (3) comparing the loss function for measuring the similarity between the hidden vector and the positive and negative samples (namely, the positive sample and the negative sample). The judging capability of the hidden vector is enhanced by comparing the loss function, so that the initial model can distinguish the characteristics of different tasks, and the generalization capability and the mobility of the initial model can be improved.

In one embodiment, the encoder employs a recurrent neural network or a transducer model, the decoder employs a recurrent neural network or a transducer model, and the large language model employs a unidirectional transducer model.

The transducer model is a Transformer, and is a deep learning model adopting a self-attention mechanism, and the mechanism can be distributed with different weights according to the importance of each part of input data.

Unidirectional transducer model, unidirectional transducer.

In one embodiment, the target loss function of the initial model is a weighted sum of a task loss function, a reconstruction loss function, and a contrast loss function;

the task loss function is used for measuring the difference between the output of the large language model and the expected output;

the reconstruction loss function is used for measuring the difference between the input data of the initial model and the continuous prompt of the output of the decoder;

the contrast loss function is used for measuring the similarity between the hidden vector output by the encoder and a positive sample and a negative sample, wherein the positive sample is data which is input into the initial model and belongs to the specific task, and the negative sample is data which is input into the initial model and does not belong to the specific task.

The task loss function is defined in terms of different task types and model types. For example, for a generative task, cross-entropy loss (cross-entropy loss) is used for evaluating an indicator of the quality of the generated text. For another example, for discriminant tasks, binary cross entropy loss (binary cross-entropy loss) or mean square error loss (mean squared error loss) is used for evaluating an indicator of classification or regression result accuracy. It will be appreciated that if a task contains multiple subtasks or outputs, the task penalty functions for the different subtasks or outputs are weighted together to yield a total task penalty function.

Different task types correspond to different loss functions, and when training, proper loss functions are needed to be selected according to the task types to calculate the difference between the output of the large language model and the expected output, and the initial model parameters are updated according to a gradient descent method. For example, in training, the task types are generative tasks and discriminant tasks, and the following differences exist in the training data: (1) Training data of a generative task is usually a correspondence between input data and output data, for example, input data of a written poem is a theme, output data is a poem, input data of a written abstract is an article, output data is an abstract, and the goal of the generative task is to generate a natural language text according to the input data, so that cross entropy loss can be used to measure the matching degree of word level between the output text (output data) and a desired text; (2) The training data of the discriminant task is usually a correspondence between input data and a label, for example, the input data of emotion analysis is a text, the label is an emotion type, the input data of spam detection is a mail, the label is a binary value, and the objective of the discriminant task is to judge which type the input data belongs to or give a value according to the input data, so that the difference between the output value and the expected value can be measured by using binary cross entropy loss or mean square error loss.

The task type can be embodied from the format and content of the training data, and also from the output form of the initial model and the evaluation index. For example, if the training data is a one-to-one text correspondence and the output of the output model is a segment of natural language text, then it may be considered a generative task; if the training data is a one-to-one text and label correspondence and the output data of the parametric model is a class or value, then it may be considered a discriminant task.

The reconstruction loss function uses a mean square error loss (mean squared error l oss) or a cosine similarity loss (cosine similarity loss).

The contrast loss function may use information bottleneck contrast learning (InfoNCE). The information bottleneck contrast learning is a contrast learning method based on mutual information (mutual information). The goal of information bottleneck contrast learning is to maximize the mutual information between the hidden vector and the positive sample, while minimizing the mutual information between the hidden vector and the negative sample. The mutual information can be approximated by cross entropy (cross-entropy).

Information bottleneck contrast learning is a contrast learning method based on mutual information (mutual information) which tries to enable hidden vectors to preserve as much as possible the originally entered information, while also distinguishing as much as possible the characteristics of different tasks or domains. Mutual information is an indicator that measures the correlation between two variables and represents the amount of information shared by the two variables. If the two variables are identical, the mutual information between them is maximum; if the two variables are completely independent, the mutual information between them is zero. The goal of the information bottleneck contrast learning is to maximize the mutual information between the hidden vector z and the positive sample x+ while minimizing the mutual information between the hidden vector z and the negative sample x-. This allows the hidden vector z to better represent the original input x and to better distinguish between features of different tasks or domains.

Calculation formula L of the contrast loss function_con The method comprises the following steps:

wherein L is_con Is a contrast loss function (contrastive loss function), which is a contrast learning method based on mutual information;is an expected (estimation) sign that represents the averaging of the expression in brackets given the original input x, positive sample x+ and negative sample x-; log is a log symbol representing a logarithmic function based on a natural number e; exp is an exponential notation that represents an exponential function based on a natural number e; s (z, x) is a similarity function which represents the similarity between the hidden vector z and the original input x, wherein the similarity function can be a cosine similarity (cosine similarity) or dot product (dot product) function; z is a hidden vector (vector) which is a vector or sequence that the encoder maps the original input x; x+ is a positive sample, which is an input that is the same task as the original input x; x-is a negative sample, which is an input belonging to a different task than the original input x; k is the number of negative samples, which is a superparameter used to control the difficulty and effect of contrast learning.

The embodiment enhances the discrimination capability and the mobility of the hidden vector through contrast learning and meta learning, so that the model can distinguish the characteristics of different tasks or fields, is quickly adapted to a plurality of tasks, can perform well under the conditions of cross-field or cross-subject and the like, and thus the robustness and the reusability of the initial model are improved; the reconstruction loss and the task loss balance the generating capacity and the distinguishing capacity of the initial model, so that the initial model can complete understanding and generating tasks simultaneously. Different task types and model types are adapted without special processing for each task or model, thereby improving versatility and simplicity.

Referring to fig. 3, in one embodiment, a model training apparatus suitable for a large language model is provided, the apparatus comprising:

the training preparation module 801 is configured to obtain a task set, a specific task, and an initial model, and divide the task set to obtain a training set, a verification set, and a test set, where the initial model includes: prefix networks and large language models;

a pre-training module 802, configured to pre-train the initial model with the training set and verify the initial model with the verification set based on a meta-optimizer;

a fine tuning training module 803, configured to perform fine tuning training on the specific task on the initial model that is verified as passing by based on a meta-optimizer, and perform performance evaluation on the initial model using the test set;

the target model determining module 804 is configured to determine a target model corresponding to the specific task according to the initial model that the evaluation result is passed.

In one embodiment, the step of pre-training the initial model with the training set and validating the initial model with the validation set of the pre-training module 802 includes:

sampling a plurality of tasks from the training set as respective initial tasks;

obtaining an initial task from each initial task in a traversing manner, taking the initial task as a training task, sampling a support set and a query set from the training task, calculating a loss value of the initial model and updating network parameters according to a preset first learning rate and the support set, and calculating a single sample loss value of the initial model by adopting the query set to obtain a single sample loss value set corresponding to the training task;

carrying out average value calculation on the single sample loss value sets corresponding to the initial tasks to obtain comprehensive loss values, and carrying out network parameter updating on the initial model according to a preset second learning rate and the comprehensive loss values;

adopting the verification set to verify the initial model;

if the verification result is not passed, jumping to the step of sampling a plurality of tasks from the training set as each initial task, and continuing to execute until the verification result is passed.

In one embodiment, the step of performing fine tuning training on the specific task by the initial model, which is passed as a result of the verification by the fine tuning training module 803, and performing performance evaluation on the initial model by using the test set includes:

sampling a plurality of training data from the specific task as respective initial data;

performing fine tuning training on the initial model which passes the verification result by adopting each initial data, and performing performance evaluation on the initial model by adopting the test set;

if the result of the evaluation is not passing, jumping to the step of sampling a plurality of training data from the specific task as each initial data, and continuing to execute until the result of the evaluation is passing.

In one embodiment, the step of performing fine tuning training on the initial model that is verified as passing by using each initial data by the fine tuning training module 803 and performing performance evaluation on the initial model using the test set includes:

acquiring a counter, an optimal performance variable and a preset frequency threshold, initializing the value of the counter to 0, and initializing the optimal performance variable;

Selecting one initial data from the initial data as target data;

performing fine tuning training on the initial model according to the target data;

verifying the performance of the initial model according to the test set to obtain verification performance data;

if the verification performance data exceeds the optimal performance variable, updating the optimal performance variable according to the verification performance data, and initializing the value of the counter to 0;

and if the verification performance data does not exceed the optimal performance variable, adding 1 to the value of the counter, if the value of the counter is not smaller than the preset frequency threshold, taking the network parameter corresponding to the optimal performance variable as the network parameter of the initial model, and if the value of the counter is smaller than the preset frequency threshold, jumping to the step of selecting one initial data from the initial data as target data, and continuing to execute.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external client via a network connection. The computer program, when executed by a processor, performs functions or steps on the server side of a model training method suitable for large language models.

In one embodiment, a computer device is presented comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

In one embodiment, a computer readable storage medium is presented, the computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of:

It should be noted that, the functions or steps implemented by the computer readable storage medium or the computer device may correspond to the relevant descriptions of the server side and the client side in the foregoing method embodiments, and are not described herein for avoiding repetition.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A model training method suitable for use with a large language model, the method comprising:

2. The model training method for large language models as claimed in claim 1, wherein said step of pre-training said initial model using said training set and validating said initial model using said validation set comprises:

adopting the verification set to verify the initial model;

3. The model training method for large language models according to claim 1, wherein said step of performing fine-tuning training on said specific task on said initial model whose result of verification is passed, performing performance evaluation on said initial model using said test set, comprises:

4. A model training method for a large language model according to claim 3, wherein said step of performing fine-tuning training on said initial model, which is verified as passing, using each of said initial data, and performing performance evaluation on said initial model using said test set comprises:

selecting one initial data from the initial data as target data;

5. The model training method for a large language model according to claim 1, wherein the prefix network comprises an encoder for encoding input data of the prefix network into hidden vectors and a decoder for decoding the hidden vectors into continuous hints, the input data of the prefix network and the continuous hints being spliced as input data of the large language model.

6. The model training method for large language model according to claim 5, wherein the encoder adopts a cyclic neural network or a transducer model, the decoder adopts a cyclic neural network or a transducer model, and the large language model adopts a unidirectional transducer model.

7. The model training method for large language models according to claim 5, wherein the objective loss function of the initial model is a weighted sum of a task loss function, a reconstruction loss function, and a contrast loss function;

8. A model training apparatus for use with a large language model, the apparatus comprising:

9. Computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the model training method for large language models according to any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the model training method for large language models according to any one of claims 1 to 7.