CN118227107A

Movatterモバイル変換

Info

Publication number: CN118227107A
Application number: CN202410573736.1A
Authority: CN
Inventors: 许红波; 宁义双; 宁可; 郑政芳
Original assignee: Kingdee Software China Co Ltd
Current assignee: Kingdee Software China Co Ltd
Priority date: 2024-05-10
Filing date: 2024-05-10
Publication date: 2024-06-21

Abstract

The application relates to a code generation model training method, a code generation device, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring a pre-training data set corresponding to a target service scene; the pre-training code samples of the pre-training data set comprise universal code samples and scene code samples corresponding to the target service scene; inputting the pre-training data set into a general large language model for secondary pre-training to obtain a first prediction code sample; based on the difference between the pre-training code sample and the first prediction code sample, adjusting a universal large language model to obtain an initial code generation model corresponding to the target service scene; and performing supervised training on the initial code generation model based on the supervised training set corresponding to the target service scene to obtain the target code generation model corresponding to the target service scene. By adopting the method, the code generation efficiency can be improved.

Description

Translated fromChinese

代码生成模型训练方法、代码生成方法、装置Code generation model training method, code generation method, and device

技术领域Technical Field

本申请涉及计算机技术领域，特别是涉及一种代码生成模型训练方法、代码生成方法、装置、计算机设备、存储介质和计算机程序产品。The present application relates to the field of computer technology, and in particular to a code generation model training method, a code generation method, an apparatus, a computer device, a storage medium, and a computer program product.

背景技术Background technique

随着计算机技术的快速发展，软件开发技术日益成熟，产生了越来越多样化的代码开发需求。在软件开发过程中，首先提出开发需求，由开发人员基于开发平台中提供的各种代码插件和开发工具，开发符合需求的代码。With the rapid development of computer technology, software development technology has become increasingly mature, resulting in more and more diverse code development requirements. In the software development process, development requirements are first proposed, and developers develop code that meets the requirements based on various code plug-ins and development tools provided in the development platform.

传统方法中，是由开发人员基于开发需求人为进行软件开发，存在代码生成效率低的问题。In the traditional method, developers manually develop software based on development requirements, which leads to the problem of low code generation efficiency.

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种能够提高代码生成效率的代码生成模型训练方法、代码生成方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide a code generation model training method, code generation method, device, computer equipment, computer-readable storage medium and computer program product that can improve code generation efficiency in response to the above technical problems.

本申请提供了一种代码生成模型训练方法。所述方法包括：The present application provides a code generation model training method. The method comprises:

获取目标业务场景对应的预训练数据集；预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本；Obtain a pre-training data set corresponding to the target business scenario; the pre-training code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario;

以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型；Taking the general large language model as the base model, the pre-training data set is input into the general large language model for secondary pre-training to obtain the first predicted code sample. Based on the difference between the pre-trained code sample and the first predicted code sample, the general large language model is adjusted to obtain the initial code generation model corresponding to the target business scenario.

获取目标业务场景对应的有监督训练集；有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本；Obtain a supervised training set corresponding to the target business scenario; the supervised training set includes instruction training samples and label code samples corresponding to the instruction training samples;

将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。The instruction training sample is input into the initial code generation model to obtain a second predicted code sample. Based on the difference between the label code sample and the second predicted code sample, the initial code generation model is adjusted to obtain a target code generation model corresponding to the target business scenario.

本申请提供了一种代码生成方法。所述方法包括：The present application provides a code generation method. The method comprises:

获取目标业务场景对应的待预测指令；Obtain the instructions to be predicted corresponding to the target business scenario;

将待预测指令输入目标业务场景对应的目标代码生成模型，得到待预测指令对应的目标预测代码样本；Input the instruction to be predicted into the target code generation model corresponding to the target business scenario to obtain a target prediction code sample corresponding to the instruction to be predicted;

其中，目标代码生成模型的训练过程为：Among them, the training process of the target code generation model is:

本申请还提供了一种代码生成模型训练装置。所述装置包括：The present application also provides a code generation model training device. The device comprises:

第一获取模块，用于获取目标业务场景对应的预训练数据集；预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本；The first acquisition module is used to acquire a pre-training data set corresponding to the target business scenario; the pre-training code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario;

第一训练模块，用于以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型；The first training module is used to use the general large language model as the base model, input the pre-training data set into the general large language model, perform secondary pre-training, obtain the first predicted code sample, and adjust the general large language model based on the difference between the pre-training code sample and the first predicted code sample to obtain an initial code generation model corresponding to the target business scenario;

第二获取模块，用于获取目标业务场景对应的有监督训练集；有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本；The second acquisition module is used to acquire a supervised training set corresponding to the target business scenario; the supervised training set includes instruction training samples and label code samples corresponding to the instruction training samples;

第二训练模块，用于将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。The second training module is used to input the instruction training sample into the initial code generation model to obtain the second predicted code sample, and adjust the initial code generation model based on the difference between the label code sample and the second predicted code sample to obtain the target code generation model corresponding to the target business scenario.

本申请还提供了一种代码生成装置。所述装置包括：The present application also provides a code generation device. The device comprises:

指令获取模块，用于获取目标业务场景对应的待预测指令；An instruction acquisition module is used to acquire instructions to be predicted corresponding to the target business scenario;

预测模块，用于将待预测指令输入目标业务场景对应的目标代码生成模型，得到待预测指令对应的目标预测代码样本；其中，目标代码生成模型的训练过程为：获取目标业务场景对应的预训练数据集；预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本；以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型；获取目标业务场景对应的有监督训练集；有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本；将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。A prediction module is used to input the instruction to be predicted into a target code generation model corresponding to the target business scenario, and obtain a target prediction code sample corresponding to the instruction to be predicted; wherein the training process of the target code generation model is as follows: obtaining a pre-training data set corresponding to the target business scenario; the pre-training code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario; taking a general large language model as a base model, inputting the pre-training data set into the general large language model, performing secondary pre-training, obtaining a first prediction code sample, adjusting the general large language model based on the difference between the pre-training code sample and the first prediction code sample, and obtaining an initial code generation model corresponding to the target business scenario; obtaining a supervised training set corresponding to the target business scenario; the supervised training set includes an instruction training sample and a label code sample corresponding to the instruction training sample; inputting the instruction training sample into the initial code generation model, obtaining a second prediction code sample, adjusting the initial code generation model based on the difference between the label code sample and the second prediction code sample, and obtaining a target code generation model corresponding to the target business scenario.

一种计算机设备，包括存储器和处理器，存储器存储有计算机程序，处理器执行计算机程序时实现上述代码生成模型训练方法和代码生成方法的步骤。A computer device includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above-mentioned code generation model training method and code generation method when executing the computer program.

一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述代码生成模型训练方法和代码生成方法的步骤。A computer-readable storage medium stores a computer program, which implements the steps of the above-mentioned code generation model training method and code generation method when executed by a processor.

一种计算机程序产品，包括计算机程序，计算机程序被处理器执行时实现上述代码生成模型训练方法和代码生成方法的步骤。A computer program product includes a computer program, which implements the steps of the above-mentioned code generation model training method and code generation method when executed by a processor.

上述代码生成模型训练方法、装置、计算机设备、存储介质和计算机程序产品，通过获取目标业务场景对应的预训练数据集，预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本。以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型。进而获取目标业务场景对应的有监督训练集，有监督训练集包括指令训练样本和对应的标签代码样本。基于有监督训练集对初始代码生成模型进行有监督训练，得到目标业务场景对应的目标代码生成模型。这样，基于包含通用代码样本和目标业务场景对应的场景代码样本的预训练数据集，对通用大语言模型进行训练，大大提升了代码生成模型针对垂直领域，即目标业务场景的代码生成的准确性，此外，将大语言模型的生成能力与代码开发融合，能够提高代码生成的效率。The above-mentioned code generation model training method, device, computer equipment, storage medium and computer program product obtain a pre-training data set corresponding to the target business scenario, and the pre-training code samples of the pre-training data set include general code samples and scene code samples corresponding to the target business scenario. With the general large language model as the base model, the pre-training data set is input into the general large language model for secondary pre-training to obtain a first predicted code sample. Based on the difference between the pre-training code sample and the first predicted code sample, the general large language model is adjusted to obtain an initial code generation model corresponding to the target business scenario. Then, a supervised training set corresponding to the target business scenario is obtained, and the supervised training set includes instruction training samples and corresponding label code samples. The initial code generation model is supervised trained based on the supervised training set to obtain a target code generation model corresponding to the target business scenario. In this way, based on the pre-training data set containing general code samples and scene code samples corresponding to the target business scenario, the general large language model is trained, which greatly improves the accuracy of the code generation model for the vertical field, that is, the target business scenario. In addition, the generation capability of the large language model is integrated with code development to improve the efficiency of code generation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一个实施例中代码生成模型训练方法的应用环境图；FIG1 is an application environment diagram of a code generation model training method in one embodiment;

图2为一个实施例中代码生成模型训练方法的流程示意图；FIG2 is a schematic diagram of a flow chart of a code generation model training method in one embodiment;

图3为一个实施例中获取有监督训练集的流程示意图；FIG3 is a schematic diagram of a process for obtaining a supervised training set in one embodiment;

图4为一个实施例中代码生成方法的流程示意图；FIG4 is a schematic diagram of a flow chart of a code generation method in one embodiment;

图5为一个实施例中代码生成模型训练装置的结构框图；FIG5 is a block diagram of a code generation model training device in one embodiment;

图6为一个实施例中代码生成装置的结构框图；FIG6 is a block diagram of a code generating device according to an embodiment;

图7为一个实施例中计算机设备的内部结构图；FIG7 is a diagram showing the internal structure of a computer device in one embodiment;

图8为另一个实施例中计算机设备的内部结构图。FIG. 8 is a diagram showing the internal structure of a computer device in another embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

本申请实施例提供的代码生成模型训练方法，可以应用于如图1所示的应用环境中。其中，终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上，也可以放在云上或其他网络服务器上。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备，物联网设备可为智能电视、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端102以及服务器104可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。The code generation model training method provided in the embodiment of the present application can be applied to the application environment shown in Figure 1. Among them, the terminal 102 communicates with the server 104 through the network. The data storage system can store the data that the server 104 needs to process. The data storage system can be integrated on the server 104, or it can be placed on the cloud or other network servers. The terminal 102 can be, but is not limited to, various personal computers, laptops, smart phones, tablet computers, Internet of Things devices and portable wearable devices. The Internet of Things devices can be smart TVs, smart car-mounted devices, etc. Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc. The server 104 can be implemented with an independent server or a server cluster consisting of multiple servers. The terminal 102 and the server 104 can be directly or indirectly connected via wired or wireless communication, and this application is not limited here.

终端和服务器均可单独用于执行本申请实施例中提供的代码生成模型训练方法。Both the terminal and the server can be used independently to execute the code generation model training method provided in the embodiments of the present application.

例如，服务器获取目标业务场景对应的预训练数据集，预训练数据集的代码样本包括通用代码样本和目标业务场景对应的场景代码样本。服务器以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型。终端获取目标业务场景对应的有监督训练集，有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本。服务器将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。For example, the server obtains a pre-training data set corresponding to the target business scenario, and the code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario. The server uses a general large language model as the base model, inputs the pre-training data set into the general large language model, performs secondary pre-training, obtains a first predicted code sample, and adjusts the general large language model based on the difference between the pre-training code sample and the first predicted code sample to obtain an initial code generation model corresponding to the target business scenario. The terminal obtains a supervised training set corresponding to the target business scenario, and the supervised training set includes an instruction training sample and a labeled code sample corresponding to the instruction training sample. The server inputs the instruction training sample into the initial code generation model to obtain a second predicted code sample, and adjusts the initial code generation model based on the difference between the labeled code sample and the second predicted code sample to obtain a target code generation model corresponding to the target business scenario.

终端和服务器也可协同用于执行本申请实施例中提供的代码生成模型训练方法。The terminal and the server can also be used together to execute the code generation model training method provided in the embodiments of the present application.

在一个实施例中，如图2所示，提供了一种代码生成模型训练方法，以该方法应用于计算机设备为例进行说明，计算机设备可以是终端或服务器，由终端或服务器自身单独执行，也可以通过终端和服务器之间的交互来实现。代码生成模型训练方法包括以下步骤：In one embodiment, as shown in FIG2 , a code generation model training method is provided, and the method is applied to a computer device as an example for explanation. The computer device may be a terminal or a server, and the method may be executed by the terminal or the server itself alone, or may be implemented through interaction between the terminal and the server. The code generation model training method includes the following steps:

步骤S202，获取目标业务场景对应的预训练数据集；预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本。Step S202, obtaining a pre-training data set corresponding to the target business scenario; the pre-training code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario.

其中，目标业务场景是指需要训练代码生成模型的业务场景，也就是需要通过训练代码生成模型来进行代码生成的垂直领域，不同业务场景对应的代码开发方案和标准是不同的。例如，由于各个企业在进行产品开发过程中，所使用的代码开发方案和标准（即代码开发环境）是不同的，因此，可以认为每个企业对应的私有开发环境即为一个特定的垂直领域，不同企业对应的私有开发环境即为不同的垂直领域。The target business scenario refers to the business scenario that requires the training of the code generation model, that is, the vertical field that needs to generate code by training the code generation model. The code development solutions and standards corresponding to different business scenarios are different. For example, since the code development solutions and standards (that is, code development environments) used by various companies in the product development process are different, it can be considered that the private development environment corresponding to each company is a specific vertical field, and the private development environments corresponding to different companies are different vertical fields.

预训练数据集是指用于对通用大语言模型进行二次预训练的代码数据集。The pre-training dataset refers to the code dataset used for secondary pre-training of the general large language model.

通用代码样本是指通用领域的代码数据，是不特定于某一行业或应用领域的代码数据，可以在多种不同的业务场景下使用。场景代码样本是指目标业务场景下的代码数据，是特定于某一行业或某一业务领域的代码数据，需要根据具体的业务场景和需求进行开发，具有较强的专业性和针对性。场景代码样本可以包括目标业务场景对应的案例数据和SDK（Software Development Kit，软件开发工具包）。General code samples refer to code data in general fields. They are not specific to a certain industry or application field and can be used in a variety of different business scenarios. Scenario code samples refer to code data in target business scenarios. They are specific to a certain industry or business field and need to be developed based on specific business scenarios and requirements. They are highly professional and targeted. Scenario code samples can include case data and SDK (Software Development Kit) corresponding to the target business scenario.

示例性地，为了提高代码生成效率，可以将大语言模型作为基座模型，在基座模型的基础上增强模型的代码生成能力，得到代码生成模型。将代码生成模型应用在开发者平台工具中，用户通过自然语言描述代码生成需求，将用户输入的自然语言文本（即代码生成指令）输入到代码生成模型中，由代码生成模型输出相应的代码数据，这样能够提高代码生成的效率。计算机设备将大语言模型作为基座模型，通过大量的训练样本对基座模型进行第一次预训练，得到通用大语言模型。第一次预训练所使用的训练样本包括自然语言文本和通用领域的代码数据。这样，通过第一次预训练，能够使得通用大语言模型具备理解和生成自然语言文本、通用代码数据的能力。为了提高模型针对垂直领域，即目标业务场景的代码生成准确性，获取目标业务场景对应的预训练数据集。具体地，基于多个通用代码样本和目标业务场景对应的多个场景代码样本，生成目标业务场景对应的预训练数据集。基于预训练数据集对通用大语言模型进行预训练得到初始代码生成模型。这样，通过对大语言模型进行第二次预训练，能够使得得到的初始代码生成模型具有生成通用代码的能力，同时针对目标业务场景，也具有较强的代码生成能力，提高代码生成模型在目标业务场景对应的私有开发环境下的代码开发准确性和效率。Exemplarily, in order to improve the efficiency of code generation, a large language model can be used as a base model, and the code generation capability of the model can be enhanced on the basis of the base model to obtain a code generation model. The code generation model is applied to the developer platform tool, and the user describes the code generation requirements through natural language, and the natural language text (i.e., code generation instructions) input by the user is input into the code generation model, and the code generation model outputs the corresponding code data, which can improve the efficiency of code generation. The computer device uses the large language model as the base model, and pre-trains the base model for the first time through a large number of training samples to obtain a general large language model. The training samples used in the first pre-training include natural language text and code data in a general field. In this way, through the first pre-training, the general large language model can have the ability to understand and generate natural language text and general code data. In order to improve the accuracy of code generation of the model for the vertical field, that is, the target business scenario, the pre-training data set corresponding to the target business scenario is obtained. Specifically, based on multiple general code samples and multiple scenario code samples corresponding to the target business scenario, a pre-training data set corresponding to the target business scenario is generated. The general large language model is pre-trained based on the pre-training data set to obtain an initial code generation model. In this way, by pre-training the large language model for the second time, the initial code generation model can have the ability to generate general code. At the same time, it also has strong code generation capabilities for target business scenarios, thereby improving the code development accuracy and efficiency of the code generation model in the private development environment corresponding to the target business scenario.

步骤S204，以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型。Step S204, using the general large language model as the base model, inputting the pre-training data set into the general large language model, performing secondary pre-training, obtaining a first predicted code sample, and adjusting the general large language model based on the difference between the pre-trained code sample and the first predicted code sample to obtain an initial code generation model corresponding to the target business scenario.

其中，通用大语言模型是指经过一次预训练的大语言模型，大语言模型是一种利用深度学习技术进行训练的大型机器学习模型，可以理解和生成自然语言文本。通用大语言模型是基于大量的文本数据对初始的大语言模型进行预训练得到的，用于进行第一次预训练的训练样本包括自然语言文本和通用领域的代码数据。Among them, the general large language model refers to a large language model that has been pre-trained once. The large language model is a large machine learning model trained using deep learning technology that can understand and generate natural language text. The general large language model is obtained by pre-training the initial large language model based on a large amount of text data. The training samples used for the first pre-training include natural language text and code data in general fields.

示例性地，计算机设备以通用大语言模型作为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到各个预训练代码样本分别对应的第一预测代码样本。进而基于预训练代码样本和对应的第一预测代码样本之间的差异，调整通用大语言模型的模型参数，得到目标业务场景对应的初始代码生成模型。Exemplarily, the computer device uses a general large language model as a base model, inputs a pre-training data set into the general large language model, performs secondary pre-training, and obtains first predicted code samples corresponding to each pre-trained code sample. Then, based on the difference between the pre-trained code sample and the corresponding first predicted code sample, the model parameters of the general large language model are adjusted to obtain an initial code generation model corresponding to the target business scenario.

步骤S206，获取目标业务场景对应的有监督训练集；有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本。Step S206, obtaining a supervised training set corresponding to the target business scenario; the supervised training set includes instruction training samples and label code samples corresponding to the instruction training samples.

其中，指令训练样本是指用于描述代码生成需求的自然语言文本。标签代码样本是指符合指令训练样本对应的代码生成需求的代码数据。The instruction training sample refers to a natural language text used to describe the code generation requirements. The label code sample refers to the code data that meets the code generation requirements corresponding to the instruction training sample.

示例性地，计算机设备获取目标业务场景对应的多个指令训练样本和指令训练样本对应的标签代码样本，得到目标业务场景对应的有监督训练集。具体地，可以获取人工编写的目标业务场景对应的指令训练样本，以及人工编写的指令训练样本在目标业务场景下对应的标签代码样本，指令训练样本所对应的标签代码样本也可以是人为基于知识向量库编写的，此外，有监督训练集还可以是基于大语言模型，对少量的已知指令进行数据增强和扩充得到的。Exemplarily, the computer device obtains multiple instruction training samples corresponding to the target business scenario and label code samples corresponding to the instruction training samples, and obtains a supervised training set corresponding to the target business scenario. Specifically, the instruction training samples corresponding to the target business scenario written manually and the label code samples corresponding to the manually written instruction training samples in the target business scenario can be obtained. The label code samples corresponding to the instruction training samples can also be written manually based on the knowledge vector library. In addition, the supervised training set can also be obtained by data enhancement and expansion of a small number of known instructions based on a large language model.

在一个实施例中，预训练数据集和有监督训练集可以是同时获取的。也可以是先获取预训练数据集，基于预训练数据集对通用大语言模型进行二次预训练得到初始代码生成模型后，再获取有监督训练集。In one embodiment, the pre-training data set and the supervised training set may be obtained at the same time. Alternatively, the pre-training data set may be obtained first, and the general large language model is pre-trained twice based on the pre-training data set to obtain an initial code generation model, and then the supervised training set is obtained.

步骤S208，将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。Step S208, input the instruction training sample into the initial code generation model to obtain a second predicted code sample, and adjust the initial code generation model based on the difference between the label code sample and the second predicted code sample to obtain a target code generation model corresponding to the target business scenario.

其中，第二预测代码样本是指初始代码生成模型输出的指令训练样本对应的预测代码数据。目标代码生成模型是指训练好的，用于生成目标业务场景对应的代码数据的代码生成模型，目标代码生成模型的输入数据是指令，输出数据是指在目标业务场景下，该指令对应的代码数据。The second predicted code sample refers to the predicted code data corresponding to the instruction training sample output by the initial code generation model. The target code generation model refers to a trained code generation model used to generate code data corresponding to the target business scenario. The input data of the target code generation model is the instruction, and the output data refers to the code data corresponding to the instruction in the target business scenario.

示例性地，计算机设备将有监督训练集中的指令训练样本输入初始代码生成模型，初始代码生成模型输出指令训练样本对应的第二预测代码样本。进而基于指令训练样本对应的标签代码样本和第二预测代码样本之间的差异，计算模型损失，基于模型损失，对初始代码生成模型中的模型参数进行调整，直至模型收敛，得到目标业务场景对应的目标代码生成模型。Exemplarily, the computer device inputs the instruction training sample in the supervised training set into the initial code generation model, and the initial code generation model outputs the second predicted code sample corresponding to the instruction training sample. Then, based on the difference between the label code sample corresponding to the instruction training sample and the second predicted code sample, the model loss is calculated, and based on the model loss, the model parameters in the initial code generation model are adjusted until the model converges to obtain the target code generation model corresponding to the target business scenario.

上述代码生成模型方法中，通过获取目标业务场景对应的预训练数据集，预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本。以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型。进而获取目标业务场景对应的有监督训练集，有监督训练集包括指令训练样本和对应的标签代码样本。基于有监督训练集对初始代码生成模型进行有监督训练，得到目标业务场景对应的目标代码生成模型。这样，基于包含通用代码样本和目标业务场景对应的场景代码样本的预训练数据集，对通用大语言模型进行训练，大大提升了代码生成模型针对垂直领域，即目标业务场景的代码生成的准确性，此外，将大语言模型的生成能力与代码开发融合，能够提高代码生成的效率。In the above code generation model method, by obtaining a pre-training data set corresponding to the target business scenario, the pre-training code samples of the pre-training data set include general code samples and scene code samples corresponding to the target business scenario. With the general large language model as the base model, the pre-training data set is input into the general large language model for secondary pre-training to obtain a first predicted code sample. Based on the difference between the pre-training code sample and the first code sample, the general large language model is adjusted to obtain an initial code generation model corresponding to the target business scenario. Then, a supervised training set corresponding to the target business scenario is obtained, and the supervised training set includes instruction training samples and corresponding label code samples. The initial code generation model is supervised trained based on the supervised training set to obtain a target code generation model corresponding to the target business scenario. In this way, based on the pre-training data set containing general code samples and scene code samples corresponding to the target business scenario, the general large language model is trained, which greatly improves the accuracy of the code generation model for the vertical field, that is, the target business scenario. In addition, the generation capability of the large language model is integrated with code development to improve the efficiency of code generation.

在一个实施例中，获取目标业务场景对应的预训练数据集，包括：In one embodiment, obtaining a pre-training data set corresponding to a target business scenario includes:

获取通用代码样本集和目标业务场景对应的场景代码样本集；Obtain a general code sample set and a scenario code sample set corresponding to the target business scenario;

基于样本组合比例，对通用代码样本集中的通用代码样本和场景代码样本集中的场景代码样本进行组合，得到目标业务场景对应的预训练数据集；预训练数据集中的通用代码样本的数量大于场景代码样本的数量。Based on the sample combination ratio, the general code samples in the general code sample set and the scenario code samples in the scenario code sample set are combined to obtain a pre-training data set corresponding to the target business scenario; the number of general code samples in the pre-training data set is greater than the number of scenario code samples.

其中，样本组合比例是指预训练数据集中通用代码样本和场景代码样本分别所占的比例。在通过预训练来增强模型针对垂直领域，即目标业务场景的代码生成能力的同时，仍然需要保证模型针对通用代码的生成能力，避免过拟合，导致模型原有的通用代码生成能力退化，因此在确定样本组合比例时，通用代码样本所占的比例大于场景代码样本所占的比例。The sample combination ratio refers to the proportion of general code samples and scenario code samples in the pre-training data set. While enhancing the model's code generation capabilities for vertical fields, i.e., target business scenarios, through pre-training, it is still necessary to ensure the model's ability to generate general code to avoid overfitting, which will lead to the degradation of the model's original general code generation capabilities. Therefore, when determining the sample combination ratio, the proportion of general code samples is greater than the proportion of scenario code samples.

示例性地，计算机设备获取通用代码集和目标业务场景对应的场景代码样本集。进而获取目标业务场景对应的样本组合比例，基于样本组合比例，分别从通用代码样本集和场景代码样本集中获取相应数量的代码样本，得到目标业务场景对应的预训练数据集，使得预训练数据集中通用代码样本和场景代码样本分别所占的比例与样本组合比例一致。Exemplarily, the computer device obtains a general code set and a scenario code sample set corresponding to the target business scenario. Then, the sample combination ratio corresponding to the target business scenario is obtained, and based on the sample combination ratio, a corresponding number of code samples are obtained from the general code sample set and the scenario code sample set, respectively, to obtain a pre-training data set corresponding to the target business scenario, so that the proportions of the general code samples and the scenario code samples in the pre-training data set are consistent with the sample combination ratio.

在一个实施例中，通过多种评估指标，分别评估通用大语言模型的模型训练效果，融合通用大语言模型在各个评估指标上分别对应的质量分数，得到通用大语言模型对应的综合质量分数。例如，评估指标可以为困惑度、召回率、准确率等。基于综合质量分数，确定样本组合比例。具体地，综合质量分数和样本组合比例中通用代码样本所占的比例呈负相关，和样本组合比例中场景代码样本所占的比例呈正相关，并且通用代码样本所占的比例大于场景代码样本所占的比例。这样，在通用大语言模型针对通用代码的生成能力越强时，在对通用大语言模型进行预训练时，相应地减少预训练数据集中通用代码样本所占的比例，增加预训练数据集中场景代码样本所占的比例，这样能够在保证模型针对通用代码的生成能力的同时，最大程度地增强模型针对目标业务场景的代码生成能力，提高训练得到的初始代码生成模型的预测准确性。In one embodiment, the model training effect of the general large language model is evaluated respectively by a variety of evaluation indicators, and the quality scores corresponding to the general large language model on each evaluation indicator are integrated to obtain the comprehensive quality score corresponding to the general large language model. For example, the evaluation indicators may be perplexity, recall rate, accuracy, etc. Based on the comprehensive quality score, the sample combination ratio is determined. Specifically, the comprehensive quality score is negatively correlated with the proportion of general code samples in the sample combination ratio, and positively correlated with the proportion of scene code samples in the sample combination ratio, and the proportion of general code samples is greater than the proportion of scene code samples. In this way, when the general large language model has a stronger ability to generate general code, when the general large language model is pre-trained, the proportion of general code samples in the pre-training data set is correspondingly reduced, and the proportion of scene code samples in the pre-training data set is increased. In this way, while ensuring the generation ability of the model for general code, the model's code generation ability for the target business scenario can be enhanced to the greatest extent, and the prediction accuracy of the initial code generation model obtained by training can be improved.

在一个实施例中，基于多个通用代码样本，提取通用代码样本对应的通用代码特征信息，基于目标业务场景对应的多个场景代码样本，提取场景代码样本对应的场景代码特征信息。比对通用代码特征信息和场景代码特征信息，得到通用代码样本与场景代码样本之间的特征相似度。基于特征相似度，确定目标业务场景对应的样本组合比例。特征相似度与样本组合比例中和通用代码样本所占的比例呈正相关，和样本组合比例中场景代码样本所占的比例呈负相关，并且通用代码样本所占的比例大于场景代码样本所占的比例。可以理解，在目标业务场景对应的场景代码样本与通用代码样本之间的相似度越低时，需要增加场景代码样本在预训练数据集中所占的比例，能够提高模型针对目标业务场景的代码生成能力。这样，通过特征相似度来确定样本组合比例，能够提高模型预训练的灵活性，使得模型在面对不同的业务场景时，针对业务场景的代码生成能力都能够得到相应的提升。In one embodiment, based on multiple general code samples, the general code feature information corresponding to the general code sample is extracted, and based on multiple scene code samples corresponding to the target business scenario, the scene code feature information corresponding to the scene code sample is extracted. The general code feature information and the scene code feature information are compared to obtain the feature similarity between the general code sample and the scene code sample. Based on the feature similarity, the sample combination ratio corresponding to the target business scenario is determined. The feature similarity is positively correlated with the proportion of the general code sample in the sample combination ratio, and negatively correlated with the proportion of the scene code sample in the sample combination ratio, and the proportion of the general code sample is greater than the proportion of the scene code sample. It can be understood that when the similarity between the scene code sample corresponding to the target business scenario and the general code sample is lower, it is necessary to increase the proportion of the scene code sample in the pre-training data set, which can improve the model's code generation ability for the target business scenario. In this way, by determining the sample combination ratio by feature similarity, the flexibility of the model pre-training can be improved, so that the model can be faced with different business scenarios. The code generation ability for the business scenario can be improved accordingly.

上述实施例中，根据样本组合比例对通用代码样本和场景代码样本进行组合，得到目标业务场景对应的预训练数据集。这样，融合通用代码样本和私有代码样本对通用大语言模型进行预训练，能够使得训练得到的初始代码生成模型，在具备生成通用代码的能力的同时，也具有针对目标业务场景的代码生成能力。并且通用代码样本所占的比例大于场景代码样本所占的比例，能够防止模型出现过拟合，也就是在增强模型针对目标业务场景的代码生成能力的同时，避免模型原有的通用代码生成能力退化，使得模型的代码生成准确性得到提升。In the above embodiment, the general code samples and the scenario code samples are combined according to the sample combination ratio to obtain a pre-trained data set corresponding to the target business scenario. In this way, the general large language model is pre-trained by integrating the general code samples and the private code samples, so that the initial code generation model obtained by training has the ability to generate general code while also having the ability to generate code for the target business scenario. And the proportion of general code samples is greater than the proportion of scenario code samples, which can prevent the model from overfitting, that is, while enhancing the model's code generation ability for the target business scenario, it avoids the degradation of the model's original general code generation ability, so that the model's code generation accuracy is improved.

在一个实施例中，如图3所示，获取目标业务场景对应的有监督训练集，包括：In one embodiment, as shown in FIG3 , obtaining a supervised training set corresponding to a target business scenario includes:

步骤S302，获取目标业务场景对应的多个已知指令和各个已知指令分别对应的标签代码样本，获取目标业务场景对应的任务背景信息。Step S302, obtaining multiple known instructions corresponding to the target business scenario and label code samples corresponding to each known instruction, and obtaining task background information corresponding to the target business scenario.

步骤S304，将已知指令和任务背景信息输入样本扩充模型，得到扩充指令集、以及扩充指令集中各个扩充指令在任务背景信息下分别对应的标签代码样本。Step S304, inputting known instructions and task background information into the sample expansion model to obtain an expanded instruction set and label code samples corresponding to each expanded instruction in the expanded instruction set under the task background information.

步骤S306，基于各个已知指令、各个已知指令分别对应的标签代码样本、各个扩充指令和各个扩充指令分别对应的标签代码样本，得到目标业务场景对应的有监督训练集。Step S306, based on each known instruction, the label code sample corresponding to each known instruction, each extended instruction and the label code sample corresponding to each extended instruction, a supervised training set corresponding to the target business scenario is obtained.

其中，已知指令是指针对目标业务场景的现有的指令，可以是人工编写的指令。扩充指令是指通过样本扩充模型生成的目标业务场景下的新的指令。The known instructions refer to existing instructions for the target business scenario, which may be manually written instructions. The extended instructions refer to new instructions for the target business scenario generated by the sample extended model.

任务背景信息用于指示目标业务场景对应的代码生成规范，也就是指令在目标业务场景下对应的上下文信息，任务背景信息可以包括代码所使用的编程语言或框架、代码的性能要求、内存使用限制、代码风格或格式要求、以及示例代码或模板等。基于任务背景信息，可以使得模型输出的标签代码样本更加准确，更加符合目标业务场景的需求。样本扩充模型是指基于已知指令生成相关的扩充指令集和扩充指令集中各个扩充指令分别对应的标签指令样本的模型。样本扩充模型可以是大语言模型。The task background information is used to indicate the code generation specification corresponding to the target business scenario, that is, the context information corresponding to the instruction in the target business scenario. The task background information may include the programming language or framework used by the code, the performance requirements of the code, memory usage restrictions, code style or format requirements, and sample code or templates. Based on the task background information, the labeled code samples output by the model can be more accurate and more in line with the needs of the target business scenario. The sample expansion model refers to a model that generates related extended instruction sets based on known instructions and labeled instruction samples corresponding to each extended instruction in the extended instruction set. The sample expansion model can be a large language model.

示例性地，计算机设备获取目标业务场景对应的多个已知指令和各个已知指令分别对应的标签代码样本，以及目标业务场景对应的任务背景信息。进而将已知指令和任务背景信息拼接在一起，得到带有提示的输入数据。进而将拼接得到的输入数据输入样本扩充模型，样本扩充模型输出由已知指令相关的多个扩充指令组成的扩充指令集，以及扩充指令集中各个扩充指令在任务背景信息下分别对应的标签代码样本。具体地，在样本扩充模型输出多个扩充指令和各个扩充指令分别对应的标签代码样本时，可以由专业开发人员，对各个扩充指令分别对应的标签代码样本进行修正和完善，得到更加准确的标签代码样本。进而过滤各个已知指令、各个已知指令分别对应的标签代码样本、各个扩充指令和各个扩充指令分别对应的标签代码样本中相似度较高的扩充指令和对应的标签代码样本，得到目标业务场景对应的有监督训练集。Exemplarily, the computer device obtains multiple known instructions corresponding to the target business scenario and label code samples corresponding to each known instruction, as well as the task background information corresponding to the target business scenario. Then, the known instructions and the task background information are spliced together to obtain input data with prompts. Then, the spliced input data is input into the sample expansion model, and the sample expansion model outputs an extended instruction set composed of multiple extended instructions related to the known instructions, and label code samples corresponding to each extended instruction in the extended instruction set under the task background information. Specifically, when the sample expansion model outputs multiple extended instructions and label code samples corresponding to each extended instruction, professional developers can revise and improve the label code samples corresponding to each extended instruction to obtain more accurate label code samples. Then, the extended instructions and corresponding label code samples with higher similarity are filtered out among each known instruction, the label code samples corresponding to each known instruction, each extended instruction and the label code samples corresponding to each extended instruction, to obtain a supervised training set corresponding to the target business scenario.

上述实施例中，通过样本扩充模型，可以快速地基于少量的已知指令和目标业务场景对应的任务背景信息生成大量的扩充指令，以及扩充指令对应的标签代码样本，能够提高有监督训练集的生成效率。In the above embodiment, through the sample expansion model, a large number of expanded instructions and labeled code samples corresponding to the expanded instructions can be quickly generated based on a small number of known instructions and task background information corresponding to the target business scenario, thereby improving the generation efficiency of the supervised training set.

在一个实施例中，基于各个已知指令、各个已知指令分别对应的标签代码样本、各个扩充指令和各个标签代码数据，得到目标业务场景对应的有监督训练集，包括：In one embodiment, based on each known instruction, the label code sample corresponding to each known instruction, each expanded instruction and each label code data, a supervised training set corresponding to the target business scenario is obtained, including:

基于各个扩充指令之间的第一相似度，对各个扩充指令进行过滤操作，得到初始扩充指令集；Based on the first similarity between the extended instructions, filtering the extended instructions to obtain an initial extended instruction set;

将初始扩充指令集中的扩充指令分别作为当前指令，计算当前指令分别与各个已知指令之间的第二相似度，得到各个扩充指令分别与各个已知指令之间的第二相似度；Taking the extended instructions in the initial extended instruction set as the current instructions respectively, calculating the second similarity between the current instructions and each known instruction respectively, and obtaining the second similarity between each extended instruction and each known instruction respectively;

在初始扩充指令集中过滤第二相似度大于预设相似度的扩充指令，得到目标扩充指令集；Filtering the extended instructions whose second similarity is greater than the preset similarity in the initial extended instruction set to obtain a target extended instruction set;

将目标扩充指令集中的各个扩充指令和各个已知指令分别作为指令训练样本，基于各个指令训练样本和各个指令训练样本分别对应的标签代码样本，得到目标业务场景对应的有监督训练集。Each extended instruction and each known instruction in the target extended instruction set is used as an instruction training sample, and based on each instruction training sample and the label code sample corresponding to each instruction training sample, a supervised training set corresponding to the target business scenario is obtained.

其中，第一相似度是指扩充指令之间的相似度。第二相似度是指扩充指令与已知指令之间的相似度。预设相似度可以根据实际需要进行设置。The first similarity refers to the similarity between the extended instructions. The second similarity refers to the similarity between the extended instructions and the known instructions. The preset similarity can be set according to actual needs.

示例性地，计算机设备比对各个扩充指令，得到各个扩充指令之间的第一相似度。具体地，可以将各个扩充指令之间的文本相似度作为第一相似度，可以将各个扩充指令分别对应的标签代码样本之间的文本相似度作为第一相似度，也可以基于扩充指令之间的文本相似度和扩充指令对应的标签代码样本之间的文本相似度，共同确定扩充指令之间的第一相似度。基于各个扩充指令之间的第一相似度，过滤各个扩充指令中第一相似度大于第一预设值的扩充指令，得到初始扩充指令集。Exemplarily, the computer device compares each extended instruction to obtain a first similarity between each extended instruction. Specifically, the text similarity between each extended instruction can be used as the first similarity, the text similarity between the label code samples corresponding to each extended instruction can be used as the first similarity, or the first similarity between the extended instructions can be determined based on the text similarity between the extended instructions and the text similarity between the label code samples corresponding to the extended instructions. Based on the first similarity between each extended instruction, the extended instructions whose first similarity is greater than a first preset value are filtered to obtain an initial extended instruction set.

计算机设备将初始扩充指令集中的扩充指令依次作为当前指令，计算当前指令分别与各个已知指令之间的第二相似度，得到每个扩充指令分别与各个已知指令之间的第二相似度。具体地，可以将当前指令与已知指令之间的文本相似度作为第二相似度。进而，在初始指令集中，过滤与至少一个已知指令之间的第二相似度大于预设相似度的扩充指令，得到目标扩充指令集。将目标扩充指令集中的各个扩充指令和各个已知指令分别作为指令训练样本，基于各个指令训练样本和各个指令训练样本分别对应的标签代码样本，得到目标业务场景对应的有监督训练集。The computer device uses the extended instructions in the initial extended instruction set as the current instruction in turn, calculates the second similarity between the current instruction and each known instruction, and obtains the second similarity between each extended instruction and each known instruction. Specifically, the text similarity between the current instruction and the known instructions can be used as the second similarity. Furthermore, in the initial instruction set, the extended instructions whose second similarity with at least one known instruction is greater than the preset similarity are filtered to obtain the target extended instruction set. Each extended instruction and each known instruction in the target extended instruction set are respectively used as instruction training samples, and based on each instruction training sample and the label code sample corresponding to each instruction training sample, a supervised training set corresponding to the target business scenario is obtained.

上述实施例中，首先基于各个扩充指令之间的相似度，过滤相似度高的扩充指令得到初始扩充指令集，进而将初始扩充指令集中的各个扩充指令与已知指令比对，进一步过滤与已知指令之间的相似度较高的扩充指令，得到目标扩充指令集。这样，能够保证新生成的目标扩充指令集与已知指令集的多样性和差异性，从而提高模型训练的效率。In the above embodiment, based on the similarity between each extended instruction, the extended instructions with high similarity are filtered to obtain the initial extended instruction set, and then each extended instruction in the initial extended instruction set is compared with the known instructions, and the extended instructions with high similarity with the known instructions are further filtered to obtain the target extended instruction set. In this way, the diversity and difference between the newly generated target extended instruction set and the known instruction set can be guaranteed, thereby improving the efficiency of model training.

在一个实施例中，计算当前指令分别与各个已知指令之间的第二相似度，包括：In one embodiment, calculating the second similarity between the current instruction and each known instruction includes:

提取当前指令和已知指令之间的目标公共子序列；Extracting a target common subsequence between the current instruction and the known instructions;

基于目标公共子序列对应的序列长度和已知指令对应的指令长度之间的比值，得到第一子相似度；Obtaining a first sub-similarity based on a ratio between a sequence length corresponding to the target common subsequence and an instruction length corresponding to the known instruction;

基于序列长度和当前指令对应的指令长度之间的比值，得到第二子相似度；Based on the ratio between the sequence length and the instruction length corresponding to the current instruction, a second sub-similarity is obtained;

基于第一相似度确定第一贡献值；第一贡献值与第一子相似度呈负相关；Determine a first contribution value based on the first similarity; the first contribution value is negatively correlated with the first sub-similarity;

基于第二相似度确定第二贡献值；第二贡献值与第二子相似度呈负相关；Determine a second contribution value based on the second similarity; the second contribution value is negatively correlated with the second sub-similarity;

融合第一子贡献值和第二贡献值，得到当前指令与已知指令之间的第二相似度。The first sub-contribution value and the second sub-contribution value are integrated to obtain a second similarity between the current instruction and the known instruction.

其中，目标公共子序列是指当前指令和已知指令之间最长的公共子序列，是指当前指令和已知指令中都存在的子序列，且子序列中的元素在当前指令和已知指令中的相对顺序保持一致，并且元素之间可以是不连续的。序列长度是指序列中元素的数量，对于数字序列，序列长度就是序列中包含的数字的个数，对于字符序列（如字符串），序列长度则是字符串中字符的个数。指令长度是指指令包含的元素的个数。The target common subsequence refers to the longest common subsequence between the current instruction and the known instructions, which refers to the subsequence that exists in both the current instruction and the known instructions, and the relative order of the elements in the subsequence in the current instruction and the known instructions remains consistent, and the elements can be discontinuous. The sequence length refers to the number of elements in the sequence. For a numeric sequence, the sequence length is the number of numbers contained in the sequence. For a character sequence (such as a string), the sequence length is the number of characters in the string. The instruction length refers to the number of elements contained in the instruction.

第一子相似度是指目标公共子序列与已知指令之间的相似度。第二子相似度是指目标公共子序列与当前指令之间的相似度。第一贡献值是指第一子相似度在计算第二相似度所占的比重，用于表征第一子相似度的重要程度。第二贡献值是指第二子相似度在计算第二相似度时所占的比重，用于表征第二子相似度的重要程度。The first sub-similarity refers to the similarity between the target common sub-sequence and the known instruction. The second sub-similarity refers to the similarity between the target common sub-sequence and the current instruction. The first contribution value refers to the proportion of the first sub-similarity in calculating the second similarity, which is used to characterize the importance of the first sub-similarity. The second contribution value refers to the proportion of the second sub-similarity in calculating the second similarity, which is used to characterize the importance of the second sub-similarity.

示例性地，计算机设备提取当前指令与已知指令之间的目标公共子序列。将目标公共子序列对应的序列长度和已知指令对应的指令长度之间的比值作为第一子相似度。将目标公共子序列对应的序列长度与当前指令对应的指令长度之间的比值作为第二子相似度。进而根据第一相似度计算第一贡献值，根据第二相似度计算第二贡献值。具体地，可以将第二预设值与第一子相似度之间的比值分别作为第一贡献值，将第二预设值与第二子贡献值之间的比值作为第二贡献值。进而对第一贡献值和第二贡献值进行加权融合，得到初始相似度。将第三预设值与初始相似度之间的比值作为当前指令与已知指令之间的第二相似度。初始相似度与第二相似度之间呈负相关。Exemplarily, the computer device extracts the target common subsequence between the current instruction and the known instruction. The ratio between the sequence length corresponding to the target common subsequence and the instruction length corresponding to the known instruction is used as the first sub-similarity. The ratio between the sequence length corresponding to the target common subsequence and the instruction length corresponding to the current instruction is used as the second sub-similarity. Then the first contribution value is calculated according to the first similarity, and the second contribution value is calculated according to the second similarity. Specifically, the ratio between the second preset value and the first sub-similarity can be used as the first contribution value, and the ratio between the second preset value and the second sub-contribution value can be used as the second contribution value. Then the first contribution value and the second contribution value are weightedly fused to obtain the initial similarity. The ratio between the third preset value and the initial similarity is used as the second similarity between the current instruction and the known instruction. The initial similarity is negatively correlated with the second similarity.

上述实施例中，由于第一贡献值与第一子相似度呈负相关，第二贡献值与第二子相似度呈负相关，因此，当第一子相似度和第二子相似度中存在极端值（特别是较小的值）时，贡献值能够将极端值对结果的影响增大，也就是当第一子相似度和第二子相似度之间存在显著差异时，通过将相似度转化为贡献值的方式，能够更多地考虑较小数值的影响，使得最终确定的第二相似度更加准确。In the above embodiment, since the first contribution value is negatively correlated with the first sub-similarity and the second contribution value is negatively correlated with the second sub-similarity, when there are extreme values (especially smaller values) in the first sub-similarity and the second sub-similarity, the contribution value can increase the influence of the extreme value on the result. That is, when there is a significant difference between the first sub-similarity and the second sub-similarity, by converting the similarity into a contribution value, the influence of the smaller value can be taken into account more, so that the second similarity finally determined is more accurate.

在一个实施例中，将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型，包括：In one embodiment, the instruction training sample is input into the initial code generation model to obtain a second predicted code sample, and based on the difference between the label code sample and the second predicted code sample, the initial code generation model is adjusted to obtain a target code generation model corresponding to the target business scenario, including:

将指令训练样本输入初始代码生成模型，得到第二预测代码样本；Input the instruction training sample into the initial code generation model to obtain a second predicted code sample;

基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型的模型参数，直至模型收敛，在不同预设轮次的模型训练任务中，得到目标业务场景对应的多个候选代码生成模型；Based on the difference between the labeled code sample and the second predicted code sample, the model parameters of the initial code generation model are adjusted until the model converges, and multiple candidate code generation models corresponding to the target business scenario are obtained in different preset rounds of model training tasks;

基于目标业务场景对应的测试集，确定各个候选代码生成模型分别对应的质量分数；Based on the test set corresponding to the target business scenario, determine the quality score corresponding to each candidate code generation model;

基于质量分数，在各个候选代码生成模型中，确定目标业务场景对应的目标代码生成模型。Based on the quality scores, a target code generation model corresponding to the target business scenario is determined among the candidate code generation models.

其中，候选代码生成模型是指模型收敛过程中得到的代码生成模型。测试集是指用于测试模型性能的数据，包含模型训练阶段中未出现过的样本数据，用于评估模型的性能指标。质量分数是指代码生成模型的性能评分，用于指示模型的预测准确性。The candidate code generation model refers to the code generation model obtained during the model convergence process. The test set refers to the data used to test the performance of the model, which contains sample data that has not appeared in the model training phase and is used to evaluate the performance indicators of the model. The quality score refers to the performance score of the code generation model, which is used to indicate the prediction accuracy of the model.

示例性地，计算机设备将指令训练样本输入初始代码生成模型，初始代码生成模型输出指令训练样本对应的第二预测代码样本。进而指令训练样本对应的标签代码样本和预测代码样本之间的差异，确定模型损失，基于模型损失对初始代码生成模型的模型参数进行调整，得到更新后的初始代码生成模型。继续基于有监督训练集中的指令训练样本对更新后的初始代码生成模型进行有监督训练，直至模型收敛。在不同预设轮次的模型训练任务中，得到目标业务场景对应的多个候选代码生成模型。例如，对初始代码生成模型进行两轮有监督训练，得到两个候选代码生成模型，可以理解，经过一轮有监督训练得到的代码生成模型为一个候选代码生成模型，经过两轮有监督训练得到的代码生成模型为一个候选代码生成模型。Exemplarily, the computer device inputs the instruction training sample into the initial code generation model, and the initial code generation model outputs the second predicted code sample corresponding to the instruction training sample. Then, the difference between the label code sample and the predicted code sample corresponding to the instruction training sample is used to determine the model loss, and the model parameters of the initial code generation model are adjusted based on the model loss to obtain an updated initial code generation model. Continue to perform supervised training on the updated initial code generation model based on the instruction training samples in the supervised training set until the model converges. In different preset rounds of model training tasks, multiple candidate code generation models corresponding to the target business scenario are obtained. For example, two rounds of supervised training are performed on the initial code generation model to obtain two candidate code generation models. It can be understood that the code generation model obtained after one round of supervised training is a candidate code generation model, and the code generation model obtained after two rounds of supervised training is a candidate code generation model.

计算机设备获取目标业务场景对应的测试集，基于测试集确定各个候选代码生成模型分别对应的质量分数。具体地，将各个候选代码生成模型依次作为当前代码生成模型。将测试集中的各个测试指令样本输入当前代码生成模型，得到各个测试指令样本分别对应的预测代码样本。对各个预测代码样本进行校验，统计通过校验的预测代码样本在所有预测代码样本中所占的比例，作为当前代码生成模型对应的质量分数。将质量分数最大值对应的候选代码生成模型作为目标业务场景对应的目标代码生成模型。The computer device obtains a test set corresponding to the target business scenario, and determines the quality scores corresponding to each candidate code generation model based on the test set. Specifically, each candidate code generation model is used as the current code generation model in turn. Each test instruction sample in the test set is input into the current code generation model to obtain the predicted code samples corresponding to each test instruction sample. Each predicted code sample is verified, and the proportion of the predicted code samples that pass the verification in all predicted code samples is counted as the quality score corresponding to the current code generation model. The candidate code generation model corresponding to the maximum quality score is used as the target code generation model corresponding to the target business scenario.

上述实施例中，通过测试集对各个候选代码生成模型进行评测，得到各个候选代码生成模型分别对应的质量分数，将质量分数最高值对应的候选代码生成模型作为目标代码生成模型，这样能够得到预测准确性最佳的目标代码生成模型，提高代码生成准确性。In the above embodiment, each candidate code generation model is evaluated through a test set to obtain the quality scores corresponding to each candidate code generation model, and the candidate code generation model corresponding to the highest quality score is used as the target code generation model. In this way, the target code generation model with the best prediction accuracy can be obtained, thereby improving the code generation accuracy.

在一个实施例中，代码生成模型训练方法还包括：In one embodiment, the code generation model training method further includes:

将待预测指令输入目标代码生成模型，得到待预测指令对应的第三预测代码样本；Inputting the instruction to be predicted into the target code generation model to obtain a third predicted code sample corresponding to the instruction to be predicted;

基于第三预测代码样本得到待预测指令对应的修正代码样本；待预测指令和待预测指令对应的修正代码样本用于对目标代码生成模型进行强化训练。A corrected code sample corresponding to the instruction to be predicted is obtained based on the third predicted code sample; the instruction to be predicted and the corrected code sample corresponding to the instruction to be predicted are used to perform enhanced training on the target code generation model.

其中，待预测指令是指用户输入的指令，待预测指令可以是用户通过自然语言对代码生成需求进行描述得到的自然语言文本。修正代码样本是指对第三预测样本进行修正得到的正确的代码数据。The instruction to be predicted refers to an instruction input by a user, and the instruction to be predicted may be a natural language text obtained by the user describing the code generation requirement in natural language. The corrected code sample refers to the correct code data obtained by correcting the third prediction sample.

示例性地，计算机设备获取针对目标业务场景的待预测指令。进而将待预测指令输入目标代码生成模型，目标代码生成模型输出待预测指令对应的第三预测代码样本。将第三预测代码样本返回待预测指令对应的发送方。具体地，发送方可以是发送待预测指令的用户终端，用户终端对接收到的第三预测代码样本进行展示，用户可以对第三预测代码样本进行修正得到修正代码样本，并将修正代码样本返回至计算机设备。计算机设备接收并存储发送方返回的修正代码样本，将待预测指令和待预测指令对应的修正代码样本作为有监督训练样本，用于后续对目标代码生成模型进行训练，使得目标代码生成模型能够不断适应业务场景的变化，提高模型的泛化能力。Exemplarily, a computer device obtains an instruction to be predicted for a target business scenario. The instruction to be predicted is then input into a target code generation model, and the target code generation model outputs a third prediction code sample corresponding to the instruction to be predicted. The third prediction code sample is returned to the sender corresponding to the instruction to be predicted. Specifically, the sender may be a user terminal that sends the instruction to be predicted, and the user terminal displays the received third prediction code sample. The user may correct the third prediction code sample to obtain a corrected code sample, and return the corrected code sample to the computer device. The computer device receives and stores the corrected code sample returned by the sender, and uses the instruction to be predicted and the corrected code sample corresponding to the instruction to be predicted as supervised training samples for subsequent training of the target code generation model, so that the target code generation model can continuously adapt to changes in business scenarios and improve the generalization ability of the model.

上述实施例中，在目标代码生成模型的使用过程中，不断采集针对第三预测代码样本的反馈信息，即修正代码样本。将修正代码样本和待预测指令作为新的训练数据，用于后续对目标代码生成模型进行强化训练，这样能够使得目标代码生成模型不断适应业务场景的变化，提高模型的泛化能力。In the above embodiment, during the use of the target code generation model, feedback information for the third predicted code sample, i.e., the corrected code sample, is continuously collected. The corrected code sample and the instruction to be predicted are used as new training data for subsequent intensive training of the target code generation model, so that the target code generation model can continuously adapt to changes in business scenarios and improve the generalization ability of the model.

在一个实施例中，如图4所示，提供了一种代码生成方法，以该方法应用于计算机设备为例进行说明，计算机设备可以是终端或服务器，由终端或服务器自身单独执行，也可以通过终端和服务器之间的交互来实现。代码生成方法包括以下步骤：In one embodiment, as shown in FIG4 , a code generation method is provided, and the method is applied to a computer device as an example for explanation. The computer device may be a terminal or a server, and the method may be executed by the terminal or the server itself alone, or may be implemented through interaction between the terminal and the server. The code generation method includes the following steps:

步骤S402，获取目标业务场景对应的代码生成指令。Step S402: Obtain code generation instructions corresponding to the target business scenario.

步骤S404，将代码生成指令输入目标业务场景对应的目标代码生成模型，得到代码生成指令对应的目标代码。Step S404: input the code generation instruction into the target code generation model corresponding to the target business scenario to obtain the target code corresponding to the code generation instruction.

获取目标业务场景对应的预训练数据集；预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本；以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型；获取目标业务场景对应的有监督训练集；有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本；将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。A pre-training data set corresponding to a target business scenario is obtained; the pre-training code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario; a general large language model is used as a base model, the pre-training data set is input into the general large language model, secondary pre-training is performed to obtain a first predicted code sample, based on the difference between the pre-training code sample and the first predicted code sample, the general large language model is adjusted to obtain an initial code generation model corresponding to the target business scenario; a supervised training set corresponding to the target business scenario is obtained; the supervised training set includes an instruction training sample and a label code sample corresponding to the instruction training sample; the instruction training sample is input into the initial code generation model to obtain a second predicted code sample, based on the difference between the label code sample and the second predicted code sample, the initial code generation model is adjusted to obtain a target code generation model corresponding to the target business scenario.

其中，代码生成指令是指用户输入的指令，可以是用户通过自然语言对代码生成需求进行描述得到的自然语言文本。目标代码是指目标代码生成模型输出的代码生成指令对应的预测代码数据。示例性地，计算机设备获取针对目标业务场景的代码生成指令。进而将代码生成指令输入目标代码生成模型，目标代码生成模型输出代码生成指令对应的目标代码。The code generation instruction refers to the instruction input by the user, which can be a natural language text obtained by the user describing the code generation requirements in natural language. The target code refers to the predicted code data corresponding to the code generation instruction output by the target code generation model. Exemplarily, the computer device obtains the code generation instruction for the target business scenario. Then, the code generation instruction is input into the target code generation model, and the target code generation model outputs the target code corresponding to the code generation instruction.

在获取针对目标业务场景对应的代码生成指令之前，计算机设备获取基于多个通用代码样本和目标业务场景对应的多个场景代码样本，生成目标业务场景对应的预训练数据集。以通用大语言模型作为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到各个预训练代码样本分别对应的第一预测代码样本。进而基于预训练代码样本和对应的第一预测代码样本之间的差异，调整通用大语言模型的模型参数，得到目标业务场景对应的初始代码生成模型。获取目标业务场景对应的多个指令训练样本和指令训练样本对应的标签代码样本，得到目标业务场景对应的有监督训练集。将有监督训练集中的指令训练样本输入初始代码生成模型，初始代码生成模型输出指令训练样本对应的第二预测代码样本。进而基于指令训练样本对应的标签代码样本和第二预测代码样本之间的差异，计算模型损失，基于模型损失，对初始代码生成模型中的模型参数进行调整，直至模型收敛，得到目标业务场景对应的目标代码生成模型。Before obtaining the code generation instruction corresponding to the target business scenario, the computer device obtains multiple scenario code samples based on multiple general code samples and the target business scenario, and generates a pre-training data set corresponding to the target business scenario. Using the general large language model as the base model, the pre-training data set is input into the general large language model, and secondary pre-training is performed to obtain the first predicted code samples corresponding to each pre-training code sample. Then, based on the difference between the pre-training code sample and the corresponding first predicted code sample, the model parameters of the general large language model are adjusted to obtain the initial code generation model corresponding to the target business scenario. Multiple instruction training samples corresponding to the target business scenario and label code samples corresponding to the instruction training samples are obtained to obtain a supervised training set corresponding to the target business scenario. The instruction training samples in the supervised training set are input into the initial code generation model, and the initial code generation model outputs the second predicted code sample corresponding to the instruction training sample. Then, based on the difference between the label code sample corresponding to the instruction training sample and the second predicted code sample, the model loss is calculated, and based on the model loss, the model parameters in the initial code generation model are adjusted until the model converges to obtain the target code generation model corresponding to the target business scenario.

上述代码生成方法中，基于包含通用代码样本和目标业务场景对应的场景代码样本的预训练数据集，对通用大语言模型进行训练，得到目标业务场景对应的目标代码生成模型。这样，能够有效提升目标业务场景对应的目标代码生成模型针对垂直领域，即目标业务场景的代码生成的准确性。此外，将大语言模型的生成能力与代码开发融合，能够提高代码生成的效率。In the above code generation method, based on a pre-trained data set containing general code samples and scenario code samples corresponding to the target business scenario, the general large language model is trained to obtain a target code generation model corresponding to the target business scenario. In this way, the accuracy of code generation of the target code generation model corresponding to the target business scenario for the vertical field, that is, the target business scenario, can be effectively improved. In addition, the integration of the generation capability of the large language model with code development can improve the efficiency of code generation.

在一个具体的实施例中，本申请提出的代码生成模型训练方法可以应用于开发者平台。代码生成模型训练方法包括以下步骤：In a specific embodiment, the code generation model training method proposed in this application can be applied to a developer platform. The code generation model training method includes the following steps:

1、二次预训练1. Secondary pre-training

开发者平台将大语言模型作为基座模型，首先通过大量的训练样本对基座模型进行第一次预训练，得到通用大语言模型。第一次预训练所使用的训练样本包括自然语言文本和通用领域的代码数据。这样，通过第一次预训练，能够使得通用大语言模型具备理解和生成自然语言文本、通用代码数据的能力。为了提高模型针对私有代码领域，即目标业务场景的代码生成准确性，获取私有业务领域对应的二次预训练集（即预训练数据集）。具体地，为了避免由于二次预训练导致通用大语言模型出现过拟合，导致模型原有的通用能力退化，按照预设配比，对通用代码数据和私有业务领域对应的领域代码数据进行混合，得到二次预训练集。例如，领域代码数据和通用代码数据按照15:85的比例进行混合，得到二次预训练集。基于二次预训练集对通用大语言模型进行无监督训练，得到二次预训练后的基础大模型。The developer platform uses the large language model as the base model. First, the base model is pre-trained for the first time through a large number of training samples to obtain a general large language model. The training samples used in the first pre-training include natural language text and code data in the general field. In this way, through the first pre-training, the general large language model can have the ability to understand and generate natural language text and general code data. In order to improve the accuracy of the model's code generation for the private code field, that is, the target business scenario, a secondary pre-training set (that is, a pre-training data set) corresponding to the private business field is obtained. Specifically, in order to avoid overfitting of the general large language model due to the secondary pre-training, resulting in the degradation of the original general ability of the model, the general code data and the domain code data corresponding to the private business field are mixed according to the preset ratio to obtain a secondary pre-training set. For example, the domain code data and the general code data are mixed in a ratio of 15:85 to obtain a secondary pre-training set. The general large language model is unsupervisedly trained based on the secondary pre-training set to obtain a basic large model after secondary pre-training.

2、代码数据生成和增广2. Code data generation and augmentation

开发者平台获取少量人工编写的已知指令。将已知指令输入通过基于大语言模型训练得到的样本扩充模型，得到初始扩充指令集。为了保证指令的多样性和差异性，首先去除完全相同的扩充指令、指令长度小于预设长度的扩充指令，基于剩余的扩充指令和原有的各个已知指令得到中间扩充指令集。进而计算目标扩充指令集中的各个扩充指令分别与已知指令之间的相似度，过滤相似度大于或等于预设阈值的扩充指令，得到目标扩充指令集。例如，将可以通过提取扩充指令和已知指令之间的公共子序列来计算指令之间的相似度。基于目标扩充指令集、各个已知指令、扩充指令对应的标签代码数据和已知指令对应的标签代码数据，得到有监督训练集。The developer platform obtains a small number of known instructions written manually. The known instructions are input into the sample expansion model obtained by training based on the large language model to obtain the initial expansion instruction set. In order to ensure the diversity and difference of the instructions, the expansion instructions that are exactly the same and the expansion instructions whose instruction length is less than the preset length are first removed, and the intermediate expansion instruction set is obtained based on the remaining expansion instructions and the original known instructions. Then, the similarity between each expansion instruction in the target expansion instruction set and the known instructions is calculated, and the expansion instructions whose similarity is greater than or equal to the preset threshold are filtered to obtain the target expansion instruction set. For example, the similarity between the instructions can be calculated by extracting the common subsequence between the expansion instruction and the known instructions. Based on the target expansion instruction set, each known instruction, the label code data corresponding to the expansion instruction, and the label code data corresponding to the known instruction, a supervised training set is obtained.

3、有监督的全参数训练3. Supervised full parameter training

开发者平台基于有监督训练集，对基础大模型进行全参数训练，全参数训练相比于微调方法具有更好的模型训练效果，直至模型收敛，得到多个候选代码生成模型。The developer platform performs full-parameter training on the basic large model based on the supervised training set. Compared with the fine-tuning method, the full-parameter training has a better model training effect until the model converges and obtains multiple candidate code generation models.

4、代码评测4. Code review

开发者平台通过私有业务领域对应的特定的案例数据集来构造评测集。通过评测集，来测试各个候选代码生成模型分别对应的质量分数。具体地，将评测集中的各个指令样本输入候选代码生成模型，统计候选代码生成模型输出的各个预测代码数据中，测试通过的预测代码数据的比例作为候选代码生成模型对应的质量分数。将质量分数最大值对应的候选代码生成模型作为私有业务领域对应的目标代码生成模型。The developer platform constructs an evaluation set through a specific case data set corresponding to the private business domain. The evaluation set is used to test the quality scores of each candidate code generation model. Specifically, each instruction sample in the evaluation set is input into the candidate code generation model, and the proportion of each predicted code data output by the candidate code generation model that passes the test is counted as the quality score corresponding to the candidate code generation model. The candidate code generation model corresponding to the maximum quality score is used as the target code generation model corresponding to the private business domain.

5、代码预测5. Code prediction

开发者平台获取待预测指令，并将待预测指令输入目标代码生成模型，目标代码生成模型输出待预测指令对应的预测代码数据，并展示待预测代码数据。用户可以选择是否接受或不接受生成的代码数据，也可以对待预测代码数据进行修正，将修正后的修正代码数据返回开发者平台，以便后续基于修正代码数据，对目标代码生成模型进行迭代训练，得到更加符合用户需求的代码生成模型。The developer platform obtains the instructions to be predicted and inputs them into the target code generation model. The target code generation model outputs the predicted code data corresponding to the instructions to be predicted and displays the predicted code data. The user can choose whether to accept or reject the generated code data, or can modify the predicted code data and return the modified code data to the developer platform so that the target code generation model can be iteratively trained based on the modified code data to obtain a code generation model that better meets user needs.

上述实施例中，提出了基于大模型训练的代码生成模型的方法，充分利用了大语言模型的生成能力。模型训练阶段采用两次无监督的预训练、有监督训练、人类反馈强化学习（RLHF，Reinforcement Learning from Human Feedback）等方法，能够提高代码生成模型的预测准确性。并且，在进行二次预训练时，将通用代码数据和私有代码数据进行混合，不仅能够避免大模型的训练出现过拟合，导致模型原有的通用能力退化，还能够增强模型针对私有业务领域的代码生成能力，提高模型预测的准确性。可以理解，上述方法不仅可以解决通用代码的生成能力，同时在此基础上大大提升了垂直领域的业务开发或二次开发的私域编程语言的代码生成能力，能够将大模型的生成能力与代码开发更好的融合，提升实际生产开发的效率。In the above embodiment, a method for a code generation model based on large model training is proposed, which makes full use of the generation capability of a large language model. The model training stage adopts two unsupervised pre-training, supervised training, human feedback reinforcement learning (RLHF, Reinforcement Learning from Human Feedback) and other methods to improve the prediction accuracy of the code generation model. In addition, when performing secondary pre-training, the general code data and the private code data are mixed, which can not only avoid overfitting in the training of the large model, resulting in the degradation of the original general ability of the model, but also enhance the model's code generation capability for private business areas and improve the accuracy of model prediction. It can be understood that the above method can not only solve the generation capability of general code, but also greatly improve the code generation capability of private domain programming languages for business development in vertical fields or secondary development on this basis, and can better integrate the generation capability of large models with code development, thereby improving the efficiency of actual production development.

应该理解的是，虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowcharts involved in the above-mentioned embodiments are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps does not have a strict order restriction, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowcharts involved in the above-mentioned embodiments can include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these steps or stages is not necessarily carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的代码生成模型训练方法的代码生成模型训练装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的一个或多个代码生成模型训练装置实施例中的具体限定可以参见上文中对于代码生成模型训练方法的限定，在此不再赘述。Based on the same inventive concept, the embodiment of the present application also provides a code generation model training device for implementing the code generation model training method involved above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in one or more code generation model training device embodiments provided below can refer to the limitations of the code generation model training method above, and will not be repeated here.

在一个实施例中，如图5所示，提供了一种代码生成模型训练装置，包括：第一获取模块502、第一训练模块504、第二获取模块506和第二训练模块508，其中：In one embodiment, as shown in FIG5 , a code generation model training device is provided, including: a first acquisition module 502 , a first training module 504 , a second acquisition module 506 and a second training module 508 , wherein:

第一获取模块502，用于获取目标业务场景对应的预训练数据集；预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本；The first acquisition module 502 is used to acquire a pre-training data set corresponding to the target business scenario; the pre-training code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario;

第一训练模块504，用于以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型；The first training module 504 is used to use the general large language model as a base model, input the pre-training data set into the general large language model, perform secondary pre-training, obtain a first predicted code sample, and adjust the general large language model based on the difference between the pre-training code sample and the first predicted code sample to obtain an initial code generation model corresponding to the target business scenario;

第二获取模块506，用于获取目标业务场景对应的有监督训练集；有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本；The second acquisition module 506 is used to acquire a supervised training set corresponding to the target business scenario; the supervised training set includes instruction training samples and label code samples corresponding to the instruction training samples;

第二训练模块508，用于将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。The second training module 508 is used to input the instruction training sample into the initial code generation model to obtain the second predicted code sample, and adjust the initial code generation model based on the difference between the label code sample and the second predicted code sample to obtain the target code generation model corresponding to the target business scenario.

在一个实施例中，第一获取模块502还用于：In one embodiment, the first acquisition module 502 is further configured to:

获取通用代码样本集和目标业务场景对应的场景代码样本集；基于样本组合比例，对通用代码样本集中的通用代码样本和场景代码样本集中的场景代码样本进行组合，得到目标业务场景对应的预训练数据集；预训练数据集中的通用代码样本的数量大于场景代码样本的数量。Obtain a general code sample set and a scenario code sample set corresponding to the target business scenario; based on the sample combination ratio, combine the general code samples in the general code sample set and the scenario code samples in the scenario code sample set to obtain a pre-training data set corresponding to the target business scenario; the number of general code samples in the pre-training data set is greater than the number of scenario code samples.

在一个实施例中，第二获取模块506还用于：In one embodiment, the second acquisition module 506 is further configured to:

获取目标业务场景对应的多个已知指令和各个已知指令分别对应的标签代码样本，获取目标业务场景对应的任务背景信息；将已知指令和任务背景信息输入样本扩充模型，得到扩充指令集、以及扩充指令集中各个扩充指令在任务背景信息下分别对应的标签代码样本；基于各个已知指令、各个已知指令分别对应的标签代码样本、各个扩充指令和各个标签代码数据，得到目标业务场景对应的有监督训练集。Obtain multiple known instructions corresponding to the target business scenario and label code samples corresponding to each known instruction, and obtain task background information corresponding to the target business scenario; input the known instructions and the task background information into the sample expansion model to obtain an expanded instruction set, and label code samples corresponding to each expanded instruction in the expanded instruction set under the task background information; based on each known instruction, the label code samples corresponding to each known instruction, each expanded instruction and each label code data, obtain a supervised training set corresponding to the target business scenario.

基于各个扩充指令之间的第一相似度，对各个扩充指令进行过滤操作，得到初始扩充指令集；将初始扩充指令集中的扩充指令分别作为当前指令，计算当前指令分别与各个已知指令之间的第二相似度，得到各个扩充指令分别与各个已知指令之间的第二相似度；在初始扩充指令集中过滤第二相似度大于预设相似度的扩充指令，得到目标扩充指令集；将目标扩充指令集中的各个扩充指令和各个已知指令分别作为指令训练样本，基于各个指令训练样本和各个指令训练样本分别对应的标签代码样本，得到目标业务场景对应的有监督训练集。Based on the first similarity between each extended instruction, each extended instruction is filtered to obtain an initial extended instruction set; the extended instructions in the initial extended instruction set are respectively used as current instructions, and the second similarity between the current instruction and each known instruction is calculated to obtain the second similarity between each extended instruction and each known instruction; the extended instructions whose second similarity is greater than a preset similarity are filtered in the initial extended instruction set to obtain a target extended instruction set; each extended instruction and each known instruction in the target extended instruction set are respectively used as instruction training samples, and based on each instruction training sample and the label code sample corresponding to each instruction training sample, a supervised training set corresponding to the target business scenario is obtained.

提取当前指令和已知指令之间的目标公共子序列；基于目标公共子序列对应的序列长度和已知指令对应的指令长度之间的比值，得到第一子相似度；基于序列长度和当前指令对应的指令长度之间的比值，得到第二子相似度；基于第一相似度确定第一贡献值；第一贡献值与第一子相似度呈负相关；基于第二相似度确定第二贡献值；第二贡献值与第二子相似度呈负相关；融合第一贡献值和第二贡献值，得到当前指令与已知指令之间的第二相似度。Extract a target common subsequence between the current instruction and the known instruction; obtain a first sub-similarity based on the ratio between the sequence length corresponding to the target common subsequence and the instruction length corresponding to the known instruction; obtain a second sub-similarity based on the ratio between the sequence length and the instruction length corresponding to the current instruction; determine a first contribution value based on the first similarity; the first contribution value is negatively correlated with the first sub-similarity; determine a second contribution value based on the second similarity; the second contribution value is negatively correlated with the second sub-similarity; fuse the first contribution value and the second contribution value to obtain a second similarity between the current instruction and the known instruction.

在一个实施例中，第二训练模块508还用于：In one embodiment, the second training module 508 is further configured to:

将指令训练样本输入初始代码生成模型，得到第二预测代码样本；基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型的模型参数，直至模型收敛，在不同预设轮次的模型训练任务中，得到目标业务场景对应的多个候选代码生成模型；基于目标业务场景对应的测试集，确定各个候选代码生成模型分别对应的质量分数；基于质量分数，在各个候选代码生成模型中，确定目标业务场景对应的目标代码生成模型。The instruction training sample is input into the initial code generation model to obtain a second predicted code sample; based on the difference between the label code sample and the second predicted code sample, the model parameters of the initial code generation model are adjusted until the model converges, and multiple candidate code generation models corresponding to the target business scenario are obtained in different preset rounds of model training tasks; based on the test set corresponding to the target business scenario, the quality scores corresponding to each candidate code generation model are determined; based on the quality scores, the target code generation model corresponding to the target business scenario is determined among the candidate code generation models.

在一个实施例中，代码生成模型训练装置还包括：In one embodiment, the code generation model training device further includes:

模型预测模块510，用于获取目标业务场景对应的待预测指令；将待预测指令输入目标代码生成模型，得到待预测指令对应的第三预测代码样本；基于第三预测代码样本得到待预测指令对应的修正代码样本；待预测指令和待预测指令对应的修正代码样本用于对目标代码生成模型进行强化训练。The model prediction module 510 is used to obtain the instructions to be predicted corresponding to the target business scenario; input the instructions to be predicted into the target code generation model to obtain a third prediction code sample corresponding to the instructions to be predicted; obtain a corrected code sample corresponding to the instructions to be predicted based on the third prediction code sample; the instructions to be predicted and the corrected code samples corresponding to the instructions to be predicted are used to perform intensive training on the target code generation model.

上述代码生成模型训练装置，通过获取目标业务场景对应的预训练数据集，预训练数据集的代码样本包括通用代码样本和目标业务场景对应的场景代码样本。基于预训练数据集对通用大语言模型进行预训练，得到目标业务场景对应的初始代码生成模型。进而获取目标业务场景对应的有监督训练集，有监督训练集包括指令训练样本和对应的标签代码样本。基于有监督训练集对初始代码生成模型进行有监督训练，得到目标业务场景对应的目标代码生成模型。这样，基于包含通用代码样本和目标业务场景对应的场景代码样本的预训练数据集，对通用大语言模型进行训练，大大提升了代码生成模型针对垂直领域，即目标业务场景的代码生成的准确性，此外，将大语言模型的生成能力与代码开发融合，能够提高代码生成的效率。The above-mentioned code generation model training device obtains a pre-training data set corresponding to the target business scenario, and the code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario. The general large language model is pre-trained based on the pre-training data set to obtain an initial code generation model corresponding to the target business scenario. Then, a supervised training set corresponding to the target business scenario is obtained, and the supervised training set includes instruction training samples and corresponding label code samples. The initial code generation model is supervised trained based on the supervised training set to obtain a target code generation model corresponding to the target business scenario. In this way, based on the pre-training data set containing general code samples and scenario code samples corresponding to the target business scenario, the general large language model is trained, which greatly improves the accuracy of code generation of the code generation model for vertical fields, that is, target business scenarios. In addition, the integration of the generation capability of the large language model with code development can improve the efficiency of code generation.

在一个实施例中，如图6所示，提供了一种代码生成装置，包括：指令获取模块602和预测模块604，其中：In one embodiment, as shown in FIG6 , a code generation device is provided, including: an instruction acquisition module 602 and a prediction module 604, wherein:

指令获取模块602，用于获取目标业务场景对应的代码生成指令。The instruction acquisition module 602 is used to acquire the code generation instruction corresponding to the target business scenario.

预测模块604，用于将代码生成指令输入目标业务场景对应的目标代码生成模型，得到代码生成指令对应的目标代码；其中，目标代码生成模型的训练过程为：获取目标业务场景对应的预训练数据集；预训练数据集的预训练代码样本包括通用代码样本和目标业务场景对应的场景代码样本；以通用大语言模型为基座模型，将预训练数据集输入通用大语言模型，进行二次预训练，得到第一预测代码样本，基于预训练代码样本和第一预测代码样本之间的差异，调整通用大语言模型，得到目标业务场景对应的初始代码生成模型；获取目标业务场景对应的有监督训练集；有监督训练集包括指令训练样本和指令训练样本对应的标签代码样本；将指令训练样本输入初始代码生成模型，得到第二预测代码样本，基于标签代码样本和第二预测代码样本之间的差异，调整初始代码生成模型，得到目标业务场景对应的目标代码生成模型。上述代码生成模型训练装置和代码生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。The prediction module 604 is used to input the code generation instruction into the target code generation model corresponding to the target business scenario to obtain the target code corresponding to the code generation instruction; wherein the training process of the target code generation model is as follows: obtaining a pre-training data set corresponding to the target business scenario; the pre-training code samples of the pre-training data set include general code samples and scenario code samples corresponding to the target business scenario; taking the general large language model as the base model, inputting the pre-training data set into the general large language model, performing secondary pre-training, obtaining a first predicted code sample, adjusting the general large language model based on the difference between the pre-training code sample and the first predicted code sample, and obtaining an initial code generation model corresponding to the target business scenario; obtaining a supervised training set corresponding to the target business scenario; the supervised training set includes an instruction training sample and a label code sample corresponding to the instruction training sample; inputting the instruction training sample into the initial code generation model, obtaining a second predicted code sample, adjusting the initial code generation model based on the difference between the label code sample and the second predicted code sample, and obtaining a target code generation model corresponding to the target business scenario. Each module in the above-mentioned code generation model training device and code generation device can be implemented in whole or in part by software, hardware and a combination thereof. The above modules may be embedded in or independent of a processor in a computer device in the form of hardware, or may be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图7所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output，简称I/O）和通信接口。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储预训练数据集、有监督训练集等数据。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种有监督训练集方法。In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be shown in FIG7. The computer device includes a processor, a memory, an input/output interface (Input/Output, referred to as I/O) and a communication interface. The processor, the memory and the input/output interface are connected via a system bus, and the communication interface is connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program and a database. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store data such as pre-training data sets and supervised training sets. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a supervised training set method is implemented.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图8所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口、显示单元和输入装置通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信，无线方式可通过WIFI、移动蜂窝网络、NFC（近场通信）或其他技术实现。该计算机程序被处理器执行时以实现一种有监督训练集方法。该计算机设备的显示单元用于形成视觉可见的画面，可以是显示屏、投影装置或虚拟现实成像装置。显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be shown in FIG8. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory, and the input/output interface are connected via a system bus, and the communication interface, the display unit, and the input device are connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner may be implemented through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. When the computer program is executed by the processor, a supervised training set method is implemented. The display unit of the computer device is used to form a visually visible picture, which may be a display screen, a projection device, or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device can be a touch layer covering the display screen, or a button, trackball or touchpad set on the computer device shell, or an external keyboard, touchpad or mouse.

本领域技术人员可以理解，图7、8中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will understand that the structures shown in FIGS. 7 and 8 are merely block diagrams of partial structures related to the scheme of the present application, and do not constitute a limitation on the computer device to which the scheme of the present application is applied. The specific computer device may include more or fewer components than those shown in the figures, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, a computer device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps in the above-mentioned method embodiments when executing the computer program.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps in the above-mentioned method embodiments are implemented.

在一个实施例中，提供了一种计算机程序产品或计算机程序，该计算机产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述各方法实施例中的步骤。In one embodiment, a computer program product or computer program is provided, the computer product or computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-mentioned method embodiments.

需要说明的是，本申请所涉及的用户信息（包括但不限于用户设备信息、用户个人信息等）和数据（包括但不限于用于分析的数据、存储的数据、展示的数据等），均为经用户授权或者经过各方充分授权的信息和数据，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器（Read-OnlyMemory，ROM）、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器（ReRAM）、磁变存储器（Magnetoresistive Random Access Memory，MRAM）、铁电存储器（Ferroelectric Random Access Memory，FRAM）、相变存储器（Phase Change Memory，PCM）、石墨烯存储器等。易失性存储器可包括随机存取存储器（Random Access Memory，RAM）或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器（Static Random Access Memory，SRAM）或动态随机存取存储器（Dynamic RandomAccess Memory，DRAM）等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to the memory, database or other medium used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. As an illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM). The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic device, a data processing logic device based on quantum computing, etc., but are not limited to this.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the present application. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.