Movatterモバイル変換


[0]ホーム

URL:


CN118228839A - Method, device, electronic device and storage medium for constructing complex instruction training data for model training - Google Patents

Method, device, electronic device and storage medium for constructing complex instruction training data for model training
Download PDF

Info

Publication number
CN118228839A
CN118228839ACN202410494963.5ACN202410494963ACN118228839ACN 118228839 ACN118228839 ACN 118228839ACN 202410494963 ACN202410494963 ACN 202410494963ACN 118228839 ACN118228839 ACN 118228839A
Authority
CN
China
Prior art keywords
initial
preset
content
field
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410494963.5A
Other languages
Chinese (zh)
Other versions
CN118228839B (en
Inventor
李思远
曾国洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Facewall Intelligent Technology Co ltd
Original Assignee
Beijing Facewall Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Facewall Intelligent Technology Co ltdfiledCriticalBeijing Facewall Intelligent Technology Co ltd
Priority to CN202410494963.5ApriorityCriticalpatent/CN118228839B/en
Publication of CN118228839ApublicationCriticalpatent/CN118228839A/en
Application grantedgrantedCritical
Publication of CN118228839BpublicationCriticalpatent/CN118228839B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请的实施例公开了一种用于模型训练的复杂指令训练数据的构造方法、装置、电子设备及存储介质,涉及人工智能领域,能够生成高质量的训练数据。所述方法包括:获取大语言模型的初始训练数据,所述初始训练数据中包含有初始内容、从所述初始内容中提取的初始答案,以及与所述初始答案相对应的初始问题;基于所述初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题;将预设任务以及所述预设内容、预设答案、预设问题进行组合,得到种子指令;对所述种子指令进行泛化,得到训练数据。本发明适用于为大语言模型构造复杂指令训练数据的场景。

The embodiments of the present application disclose a method, device, electronic device and storage medium for constructing complex instruction training data for model training, which relates to the field of artificial intelligence and can generate high-quality training data. The method includes: obtaining initial training data of a large language model, the initial training data including initial content, initial answers extracted from the initial content, and initial questions corresponding to the initial answers; determining preset content, preset answers and preset questions based on the initial content, initial answers and initial questions; combining preset tasks and the preset content, preset answers and preset questions to obtain seed instructions; generalizing the seed instructions to obtain training data. The present invention is suitable for scenarios where complex instruction training data is constructed for large language models.

Description

Translated fromChinese
用于模型训练的复杂指令训练数据的构造方法、装置、电子设备及存储介质Method, device, electronic device and storage medium for constructing complex instruction training data for model training

技术领域Technical Field

本申请涉及人工智能领域,具体涉及一种用于模型训练的复杂指令训练数据的构造方法、装置、电子设备及存储介质。The present application relates to the field of artificial intelligence, and specifically to a method, device, electronic device and storage medium for constructing complex instruction training data for model training.

背景技术Background technique

在大语言模型(Large Language Model,LLM)训练中,对于复杂指令的理解能力和遵循能力非常重要,而构建复杂指令类型的训练数据需要耗费大量的人力和时间。现有的训练数据构建方案存在以下缺点:在处理特定类型的数据时存在限制,难以有效处理极其复杂的指令数据结构;由于构建的训练数据的问题和答案完全依赖已有语言模型生成,答案质量较低。In Large Language Model (LLM) training, the ability to understand and follow complex instructions is very important, and building training data for complex instruction types requires a lot of manpower and time. Existing training data construction solutions have the following disadvantages: there are limitations when processing specific types of data, and it is difficult to effectively process extremely complex instruction data structures; because the questions and answers of the constructed training data are completely dependent on the generation of existing language models, the answer quality is low.

因此,针对复杂指令类型的训练数据,需设计一种方案以能够高效生产大量高质量的训练数据。Therefore, for training data of complex instruction types, a solution needs to be designed to efficiently produce a large amount of high-quality training data.

发明内容Summary of the invention

有鉴于此,本申请提供一种用于模型训练的复杂指令训练数据的构造方法、装置、电子设备及存储介质,以生成高质量的训练数据。In view of this, the present application provides a method, device, electronic device and storage medium for constructing complex instruction training data for model training to generate high-quality training data.

第一方面,本发明实施例提供一种用于模型训练的复杂指令训练数据的构造方法,包括:获取大语言模型的初始训练数据,所述初始训练数据中包含有初始内容、从所述初始内容中提取的初始答案,以及与所述初始答案相对应的初始问题;基于所述初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题;将预设任务以及所述预设内容、预设答案、预设问题进行组合,得到种子指令;对所述种子指令进行泛化,得到训练数据。In a first aspect, an embodiment of the present invention provides a method for constructing complex instruction training data for model training, comprising: obtaining initial training data of a large language model, the initial training data comprising initial content, an initial answer extracted from the initial content, and an initial question corresponding to the initial answer; determining preset content, preset answers, and preset questions based on the initial content, the initial answer, and the initial question; combining a preset task with the preset content, the preset answer, and the preset question to obtain a seed instruction; and generalizing the seed instruction to obtain training data.

在一种具体的实施方案中,所述基于所述初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题,包括:从所述初始训练数据中提取所述初始内容、初始答案以及初始问题;按照目标格式,根据所述初始内容确定预设内容,根据所述初始答案确定预设答案,根据所述初始问题确定预设问题。In a specific implementation scheme, determining preset content, preset answers and preset questions based on the initial content, initial answers and initial questions includes: extracting the initial content, initial answers and initial questions from the initial training data; determining preset content based on the initial content, determining preset answers based on the initial answers, and determining preset questions based on the initial questions in accordance with a target format.

在一种具体的实施方案中,所述预设内容的确定方法包括:获取所述初始内容;将所述初始内容的id作为第一字段内容,并配置对应的第一字段对所述第一字段内容进行表征;将所述初始内容作为第二字段内容,并配置第二字段对所述第二字段内容进行表征;对所述第一字段和第一字段内容、所述第二字段和第二字段内容进行组合,得到预设内容。In a specific implementation scheme, the method for determining the preset content includes: obtaining the initial content; using the id of the initial content as the first field content, and configuring the corresponding first field to represent the first field content; using the initial content as the second field content, and configuring the second field to represent the second field content; combining the first field and the first field content, the second field and the second field content to obtain the preset content.

在一种具体的实施方案中,所述预设问题的确定方法包括:获取所述初始问题;将所述初始问题的id作为第三字段内容,并配置对应的第三字段对所述第三字段内容进行表征;将所述初始问题作为第四字段内容,并配置对应的第四字段对所述第四字段内容进行表征;对所述第三字段和第三字段内容、所述第四字段和第四字段内容进行组合,得到预设问题。In a specific implementation scheme, the method for determining the preset question includes: obtaining the initial question; using the ID of the initial question as the content of the third field, and configuring the corresponding third field to characterize the content of the third field; using the initial question as the content of the fourth field, and configuring the corresponding fourth field to characterize the content of the fourth field; combining the third field and the third field content, and the fourth field and the fourth field content to obtain the preset question.

在一种具体的实施方案中,所述预设答案的确定方法包括:获取所述初始答案;将所述初始答案相对应的所述初始问题的id作为第五字段内容,并配置对应的第五字段对所述第五字段内容进行表征;将所述初始答案作为第六字段内容,并配置对应的第六字段对所述第六字段内容进行表征;将所述初始答案在所述初始内容中的起始位置作为第七字段内容,并配置对应的第七字段对所述第七字段内容进行表征;对所述第五字段和第五字段内容、所述第六字段和第六字段内容、所述第七字段和第七字段内容进行组合,得到预设答案。In a specific implementation scheme, the method for determining the preset answer includes: obtaining the initial answer; taking the ID of the initial question corresponding to the initial answer as the content of the fifth field, and configuring the corresponding fifth field to represent the content of the fifth field; taking the initial answer as the content of the sixth field, and configuring the corresponding sixth field to represent the content of the sixth field; taking the starting position of the initial answer in the initial content as the content of the seventh field, and configuring the corresponding seventh field to represent the content of the seventh field; combining the fifth field and the fifth field content, the sixth field and the sixth field content, and the seventh field and the seventh field content to obtain the preset answer.

在一种具体的实施方案中,所述对所述种子指令进行泛化,包括:确定泛化示例;基于所述大语言模型对所述泛化示例的语境学习,利用所述大语言模型对所述种子指令中的语句进行泛化。In a specific implementation, generalizing the seed instruction includes: determining a generalization example; and generalizing the sentences in the seed instruction using the large language model based on contextual learning of the generalization example by the large language model.

在一种具体的实施方案中,所述对所述种子指令进行泛化后,还包括:对泛化后的所述种子指令进行评分,筛选评分值超过预设阈值的所述种子指令作为训练数据。In a specific implementation manner, after generalizing the seed instructions, the method further includes: scoring the generalized seed instructions, and selecting the seed instructions whose scoring values exceed a preset threshold as training data.

在一种具体的实施方案中,所述目标格式为JSON格式、YAML格式、XML格式中的任一种。In a specific implementation manner, the target format is any one of JSON format, YAML format, and XML format.

在一种具体的实施方案中,所述预设任务中包含有任务说明和任务引导,所述任务说明包含有任务示例以及对所述目标格式的说明,所述任务引导对所述预设内容、预设问题、以及预设答案依次进行引导提示。In a specific implementation scheme, the preset task includes a task description and a task guide, the task description includes a task example and a description of the target format, and the task guide sequentially guides the preset content, preset questions, and preset answers.

第二方面,本发明的实施例还提供一种用于模型训练的复杂指令训练数据的构造装置,所述用于模型训练的复杂指令训练数据的构造装置,包括:获取单元,用于获取大语言模型的初始训练数据,所述初始训练数据中包含有初始内容、从所述初始内容中提取的初始答案,以及与所述初始答案相对应的初始问题;确定单元,用于基于所述初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题;组合单元,用于将预设任务以及所述预设内容、预设答案、预设问题进行组合,得到种子指令;泛化单元,用于对所述种子指令进行泛化,得到训练数据。In a second aspect, an embodiment of the present invention further provides a device for constructing complex instruction training data for model training, the device for constructing complex instruction training data for model training comprising: an acquisition unit for acquiring initial training data of a large language model, the initial training data comprising initial content, an initial answer extracted from the initial content, and an initial question corresponding to the initial answer; a determination unit for determining preset content, preset answers and preset questions based on the initial content, initial answers and initial questions; a combination unit for combining a preset task and the preset content, preset answers and preset questions to obtain a seed instruction; and a generalization unit for generalizing the seed instruction to obtain training data.

在一种具体的实施方案中,所述确定单元包括:提取模块,用于从所述初始训练数据中提取所述初始内容、初始答案以及初始问题;格式确定模块,用于按照目标格式,根据所述初始内容确定预设内容,根据所述初始答案确定预设答案,根据所述初始问题确定预设问题。In a specific implementation scheme, the determination unit includes: an extraction module for extracting the initial content, initial answers and initial questions from the initial training data; a format determination module for determining preset content according to the initial content, determining preset answers according to the initial answers, and determining preset questions according to the initial questions in accordance with the target format.

在一种具体的实施方案中,所述格式确定模块包括预设内容确定子块,所述预设内容确定子块用于:获取所述初始内容;将所述初始内容的id作为第一字段内容,并配置对应的第一字段对所述第一字段内容进行表征;将所述初始内容作为第二字段内容,并配置第二字段对所述第二字段内容进行表征;对所述第一字段和第一字段内容、所述第二字段和第二字段内容进行组合,得到预设内容。In a specific implementation scheme, the format determination module includes a preset content determination sub-block, which is used to: obtain the initial content; use the id of the initial content as the first field content, and configure the corresponding first field to characterize the first field content; use the initial content as the second field content, and configure the second field to characterize the second field content; combine the first field and the first field content, the second field and the second field content to obtain the preset content.

在一种具体的实施方案中,所述格式确定模块还包括预设问题确定子块,所述预设问题确定子块用于:获取所述初始问题;将所述初始问题的id作为第三字段内容,并配置对应的第三字段对所述第三字段内容进行表征;将所述初始问题作为第四字段内容,并配置对应的第四字段对所述第四字段内容进行表征;对所述第三字段和第三字段内容、所述第四字段和第四字段内容进行组合,得到预设问题。In a specific implementation scheme, the format determination module also includes a preset question determination sub-block, which is used to: obtain the initial question; use the id of the initial question as the content of the third field, and configure the corresponding third field to characterize the content of the third field; use the initial question as the content of the fourth field, and configure the corresponding fourth field to characterize the content of the fourth field; combine the third field and the third field content, and the fourth field and the fourth field content to obtain the preset question.

在一种具体的实施方案中,所述格式确定模块还包括预设答案确定子块,所述预设答案确定子块用于:获取所述初始答案;将所述初始答案相对应的所述初始问题的id作为第五字段内容,并配置对应的第五字段对所述第五字段内容进行表征;将所述初始答案作为第六字段内容,并配置对应的第六字段对所述第六字段内容进行表征;将所述初始答案在所述初始内容中的起始位置作为第七字段内容,并配置对应的第七字段对所述第七字段内容进行表征;对所述第五字段和第五字段内容、所述第六字段和第六字段内容、所述第七字段和第七字段内容进行组合,得到预设答案。In a specific implementation scheme, the format determination module also includes a preset answer determination sub-block, which is used to: obtain the initial answer; use the id of the initial question corresponding to the initial answer as the content of the fifth field, and configure the corresponding fifth field to represent the content of the fifth field; use the initial answer as the content of the sixth field, and configure the corresponding sixth field to represent the content of the sixth field; use the starting position of the initial answer in the initial content as the content of the seventh field, and configure the corresponding seventh field to represent the content of the seventh field; combine the fifth field and the fifth field content, the sixth field and the sixth field content, and the seventh field and the seventh field content to obtain a preset answer.

在一种具体的实施方案中,所述泛化单元包括:泛化示例模块,用于确定泛化示例;语句泛化模块,用于基于所述大语言模型对所述泛化示例的语境学习,利用所述大语言模型对所述种子指令中的语句进行泛化。In a specific implementation, the generalization unit includes: a generalization example module, used to determine a generalization example; a sentence generalization module, used to generalize the sentence in the seed instruction based on the context learning of the generalization example by the large language model.

在一种具体的实施方案中,所述泛化单元还包括:评分模块,用于对泛化后的所述种子指令进行评分,筛选评分值超过预设阈值的所述种子指令作为训练数据。In a specific implementation manner, the generalization unit further includes: a scoring module, which is used to score the generalized seed instructions and select the seed instructions whose scoring values exceed a preset threshold as training data.

在一种具体的实施方案中,所述目标格式为JSON格式、YAML格式、XML格式中的任一种。In a specific implementation manner, the target format is any one of JSON format, YAML format, and XML format.

在一种具体的实施方案中,所述预设任务中包含有任务说明和任务引导,所述任务说明包含有任务示例以及对所述目标格式的说明,所述任务引导对所述预设内容、预设问题、以及预设答案依次进行引导提示。In a specific implementation scheme, the preset task includes a task description and a task guide, the task description includes a task example and a description of the target format, and the task guide sequentially guides the preset content, preset questions, and preset answers.

第三方面,本发明的实施例还提供一种电子设备,所述电子设备包括:壳体、处理器、存储器、电路板和电源电路,其中,电路板安置在壳体围成的空间内部,处理器和存储器设置在电路板上;电源电路,用于为上述电子设备的各个电路或器件供电;存储器用于存储可执行程序代码;处理器通过读取存储器中存储的可执行程序代码来运行与可执行程序代码对应的程序,用于执行本发明的实施例提供的任一种用于模型训练的复杂指令训练数据的构造方法。In a third aspect, an embodiment of the present invention further provides an electronic device, comprising: a housing, a processor, a memory, a circuit board and a power supply circuit, wherein the circuit board is placed inside the space enclosed by the housing, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to various circuits or devices of the above-mentioned electronic device; the memory is used to store executable program code; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to execute any one of the methods for constructing complex instruction training data for model training provided in the embodiments of the present invention.

第四方面,本发明的实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现本发明的实施例提供的任一种用于模型训练的复杂指令训练数据的构造方法。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores one or more programs, and the one or more programs can be executed by one or more processors to implement any method for constructing complex instruction training data for model training provided in an embodiment of the present invention.

本发明的实施例提供的用于模型训练的复杂指令训练数据的构造方法、装置、电子设备及存储介质,通过获取大语言模型的初始训练数据,其中,初始训练数据中包含有初始内容、从初始内容中提取的初始答案,以及与初始答案相对应的初始问题;然后基于初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题;将预设任务以及预设内容、预设答案、预设问题进行组合,得到种子指令;再对种子指令进行泛化,得到训练数据。该方法可对大语言模型的初始训练数据进行扩充,生成高质量的训练数据,提升训练数据的多样性和丰富性。The embodiments of the present invention provide a method, device, electronic device and storage medium for constructing complex instruction training data for model training, which obtains initial training data of a large language model, wherein the initial training data includes initial content, initial answers extracted from the initial content, and initial questions corresponding to the initial answers; then based on the initial content, initial answers and initial questions, preset content, preset answers and preset questions are determined; preset tasks and preset content, preset answers and preset questions are combined to obtain seed instructions; and the seed instructions are generalized to obtain training data. This method can expand the initial training data of a large language model, generate high-quality training data, and improve the diversity and richness of training data.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本申请实施例提供的一种用于模型训练的复杂指令训练数据的构造方法流程图;FIG1 is a flow chart of a method for constructing complex instruction training data for model training provided in an embodiment of the present application;

图2为本申请实施例提供的用于模型训练的复杂指令训练数据的构造装置的一种结构示意图;FIG2 is a schematic diagram of a structure of a device for constructing complex instruction training data for model training provided in an embodiment of the present application;

图3为本申请实施例提供的电子设备的一种结构示意图。FIG. 3 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

具体实施方式Detailed ways

下面结合附图对本发明实施例进行详细描述。The embodiments of the present invention are described in detail below with reference to the accompanying drawings.

应当明确,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。It should be clear that the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

现有的训练数据构建方案在处理特定类型的数据时,难以有效处理极其复杂的指令数据结构,而且构建的训练数据的问题和答案完全依赖已有语言模型生成,答案质量较低。因此,针对复杂指令类型的训练数据,第一方面,如图1所示,本发明的实施例提供一种用于模型训练的复杂指令训练数据的构造方法,该方法可以包括:Existing training data construction solutions are difficult to effectively process extremely complex instruction data structures when processing specific types of data, and the questions and answers of the constructed training data are completely dependent on the generation of existing language models, and the answer quality is low. Therefore, for training data of complex instruction types, in the first aspect, as shown in FIG1 , an embodiment of the present invention provides a method for constructing complex instruction training data for model training, which may include:

S11、获取大语言模型的初始训练数据,初始训练数据中包含有初始内容、从初始内容中提取的初始答案,以及与初始答案相对应的初始问题。S11. Obtaining initial training data of the large language model, where the initial training data includes initial content, initial answers extracted from the initial content, and initial questions corresponding to the initial answers.

针对不同的自然语言任务,大语言模型可基于相应的初始训练数据进行深度学习,以更准确理解语言的复杂性,向用户提供更加智能化和个性化的服务。本实施例中,初始训练数据中包含有初始内容、从初始内容中提取的初始答案,以及与初始答案相对应的初始问题,例如,可基于CMRC(Chinese Machine Reading Comprehension,中文机器阅读理解数据集)选取初始训练数据,其中,CMRC数据集包含了问题和答案对,以及相关文章段落,通过让大语言模型能够阅读理解文章段落来回答问题,以提高对自然语言的理解能力,初始训练数据中的初始内容、初始问题和初始答案可以分别为CMRC数据集中某一训练数据的相应文章段落、问题和答案对。For different natural language tasks, the large language model can perform deep learning based on the corresponding initial training data to more accurately understand the complexity of the language and provide users with more intelligent and personalized services. In this embodiment, the initial training data includes initial content, initial answers extracted from the initial content, and initial questions corresponding to the initial answers. For example, the initial training data can be selected based on CMRC (Chinese Machine Reading Comprehension, Chinese machine reading comprehension data set), where the CMRC data set contains question and answer pairs, as well as related article paragraphs. By allowing the large language model to read and understand article paragraphs to answer questions, the ability to understand natural language can be improved. The initial content, initial questions, and initial answers in the initial training data can be the corresponding article paragraphs, questions, and answer pairs of a certain training data in the CMRC data set.

S12、基于初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题。S12. Based on the initial content, initial answers and initial questions, determine the preset content, preset answers and preset questions.

在获取初始训练数据后,可根据对大语言模型的训练需求,基于初始训练数据中的初始内容、初始答案以及初始问题,构建预设内容、预设答案以及预设问题,例如,初始内容仅为背景文章,而没有关于该背景文章的附加信息如索引位置、编号或标签等,则可以对该初始内容增加关于背景文章的附加信息,这样,预设内容中不但包括有初始内容,同时还包括相关的附加信息,从而提高训练素材相关内容的数据化程度,便于进一步的数据处理以生成训练数据。After obtaining the initial training data, preset content, preset answers and preset questions can be constructed based on the initial content, initial answers and initial questions in the initial training data according to the training requirements of the large language model. For example, if the initial content is only a background article and there is no additional information about the background article such as index position, number or label, then additional information about the background article can be added to the initial content. In this way, the preset content includes not only the initial content but also relevant additional information, thereby improving the degree of dataization of the relevant content of the training material and facilitating further data processing to generate training data.

S13、将预设任务以及预设内容、预设答案、预设问题进行组合,得到种子指令。S13, combining the preset task and the preset content, the preset answer, and the preset question to obtain a seed instruction.

在根据大语言模型的训练需求,确定预设内容、预设答案以及预设问题后,可进一步将预设任务和预设内容、预设答案、预设问题组合得到种子指令,在组合时,可基于预设任务的任务类型,以及大语言模型的训练目的,设计相应的组合方式,例如为了便于对训练数据进行数据管理,可将预设任务、预设内容、预设答案、预设问题等各组合项以模块化方式进行设计,以便于对各组合项进行数据维护或扩充等。After determining the preset content, preset answers and preset questions according to the training requirements of the large language model, the preset tasks and preset content, preset answers and preset questions can be further combined to obtain seed instructions. When combining, the corresponding combination method can be designed based on the task type of the preset task and the training purpose of the large language model. For example, in order to facilitate data management of training data, the preset tasks, preset content, preset answers, preset questions and other combination items can be designed in a modular manner to facilitate data maintenance or expansion of each combination item.

S14、对种子指令进行泛化,得到训练数据。S14. Generalize the seed instruction to obtain training data.

得到种子指令后,种子指令中的预设内容、预设答案以及预设问题可能为相对简洁或较为严肃的语言描述,该种子指令训练出的大语言模型在与特定用户对象例如儿童进行交互时,其语言风格可能会与该用户对象不相匹配,从而影响人机的理解交互,因此,可对该种子指令中的预设内容、预设答案或预设问题进行泛化,得到多种语言表达风格的趣味易懂的预设内容、预设答案以及预设问题,使得种子指令得以扩充,构建更为丰富的训练数据,基于该训练数据训练完成的大语言模型,能够更好的与该用户对象进行交互,提高大语言模型的智能化和个性化。After obtaining the seed instruction, the preset content, preset answers and preset questions in the seed instruction may be relatively concise or more serious language descriptions. When the large language model trained by the seed instruction interacts with a specific user object, such as a child, its language style may not match the user object, thereby affecting the human-computer understanding and interaction. Therefore, the preset content, preset answers or preset questions in the seed instruction can be generalized to obtain interesting and easy-to-understand preset content, preset answers and preset questions in multiple language expression styles, so that the seed instruction can be expanded and richer training data can be constructed. The large language model trained based on the training data can better interact with the user object and improve the intelligence and personalization of the large language model.

本发明的实施例提供的用于模型训练的复杂指令训练数据的构造方法,通过获取大语言模型的初始训练数据,其中,初始训练数据中包含有初始内容、从初始内容中提取的初始答案,以及与初始答案相对应的初始问题;然后基于初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题;将预设任务以及预设内容、预设答案、预设问题进行组合,得到种子指令;再对种子指令进行泛化,得到训练数据。该方法可对大语言模型的初始训练数据进行扩充,生成高质量的训练数据,提升训练数据的多样性和丰富性。The method for constructing complex instruction training data for model training provided by an embodiment of the present invention obtains initial training data of a large language model, wherein the initial training data includes initial content, initial answers extracted from the initial content, and initial questions corresponding to the initial answers; then based on the initial content, initial answers and initial questions, preset content, preset answers and preset questions are determined; preset tasks and preset content, preset answers and preset questions are combined to obtain seed instructions; and the seed instructions are generalized to obtain training data. This method can expand the initial training data of a large language model, generate high-quality training data, and improve the diversity and richness of training data.

同一任务类型的大语言模型的已有训练集可能有多个,多个已有训练集的数据结构及内容表达方式往往不同,为了便于对不同数据集中的初始训练数据整理使用,可选的,在本发明的一个实施例中,步骤S12基于初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题,包括:从初始训练数据中提取初始内容、初始答案以及初始问题;按照目标格式,根据初始内容确定预设内容,根据初始答案确定预设答案,根据初始问题确定预设问题。There may be multiple existing training sets for the large language model of the same task type, and the data structures and content expressions of the multiple existing training sets are often different. In order to facilitate the organization and use of initial training data in different data sets, optionally, in one embodiment of the present invention, step S12 determines preset content, preset answers and preset questions based on the initial content, initial answers and initial questions, including: extracting the initial content, initial answers and initial questions from the initial training data; determining the preset content according to the initial content, determining the preset answer according to the initial answer, and determining the preset question according to the initial question in accordance with the target format.

对于大语言模型的某一任务类型,不同开发者提供有第一训练集和第二训练集,根据本方法,无论是以第一格式表征的第一训练集,或以第二格式表征的第二训练集,只需从其某一初始训练数据中提取相对应的初始内容、初始答案以及初始问题,按照本实施例的目标格式,将初始内容转换为对应格式的预设内容,将初始答案转换为对应格式的预设答案,将初始问题转换为对应格式的预设问题,通过该方法,在整理使用已有训练集时,不再受限于已有训练集可能存在的格式限制,同时提高了数据的模块化程度以及数据整理效率。For a certain task type of a large language model, different developers provide a first training set and a second training set. According to the present method, no matter it is the first training set represented in the first format or the second training set represented in the second format, it is only necessary to extract the corresponding initial content, initial answer and initial question from a certain initial training data, and convert the initial content into preset content in the corresponding format, convert the initial answer into preset answer in the corresponding format, and convert the initial question into preset question in the corresponding format according to the target format of the present embodiment. Through the present method, when organizing and using the existing training set, it is no longer limited by the format restrictions that may exist in the existing training set, and at the same time improves the modularity of the data and the efficiency of data organization.

可选的,在本发明的一个实施例中,预设内容的确定方法包括:获取初始内容;将初始内容的id作为第一字段内容,并配置对应的第一字段对第一字段内容进行表征;将初始内容作为第二字段内容,并配置第二字段对第二字段内容进行表征;对第一字段和第一字段内容、第二字段和第二字段内容进行组合,得到预设内容。Optionally, in one embodiment of the present invention, a method for determining preset content includes: obtaining initial content; using the id of the initial content as the first field content, and configuring the corresponding first field to characterize the first field content; using the initial content as the second field content, and configuring the second field to characterize the second field content; combining the first field and the first field content, the second field and the second field content to obtain the preset content.

为便于对数据进行高效整理,本实施例对获取的初始内容的id(identifier,标识符)作为第一字段内容并配置对应的第一字段进行表征,将初始内容作为第二字段内容并配置第二字段进行表征,初始内容的id可以是其所在初始训练数据的整个文档的标识信息,也可以是在获取初始内容后对其再配置的标识信息,以便于对所获取的初始内容进行数据管理。例如,初始内容若为一段背景文章“《战国无双3》是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。”将该初始内容的id“DEV_0”作为第一字段内容,并配置第一字段进行表征,示例性的,第一字段可以配置为“id”,将上述背景文章作为第二字段内容,并配置第二字段进行表征,示例性的,第二字段可以配置为“context”,然后对第一字段和第一字段内容、第二字段和第二字段内容进行组合,得到预设内容如下表所示:To facilitate efficient data organization, this embodiment uses the id (identifier) of the acquired initial content as the first field content and configures the corresponding first field for representation, uses the initial content as the second field content and configures the second field for representation. The id of the initial content can be the identification information of the entire document of the initial training data in which it is located, or it can be the identification information reconfigured after the initial content is acquired, so as to facilitate data management of the acquired initial content. For example, if the initial content is a background article "Samurai Warriors 3 is the third legitimate sequel to the Samurai Warriors series developed by Koei and ω-force. This work is based on three main stories, namely, "Kanto Three Kingdoms" with Takeda Shingen and others as the main character, "Three Heroes of the Warring States" with Oda Nobunaga and others as the main character, and "Young Warrior of Sekigahara" with Ishida Mitsunari and others as the main character, enriching the plot in the game." The id "DEV_0" of the initial content is used as the first field content, and the first field is configured for representation. For example, the first field can be configured as "id", the above background article is used as the second field content, and the second field is configured for representation. For example, the second field can be configured as "context", and then the first field and the first field content, the second field and the second field content are combined to obtain the preset content as shown in the following table:

通过上述方法,可对初始训练数据中的初始内容进一步数据化、模块化,提高对初始训练数据的数据处理效率和数据整合利用程度。Through the above method, the initial content in the initial training data can be further digitized and modularized, thereby improving the data processing efficiency and data integration utilization of the initial training data.

可选的,在本发明的一个实施例中,预设问题的确定方法包括:获取初始问题;将初始问题的id作为第三字段内容,并配置对应的第三字段对第三字段内容进行表征;将初始问题作为第四字段内容,并配置对应的第四字段对第四字段内容进行表征;对第三字段和第三字段内容、第四字段和第四字段内容进行组合,得到预设问题。Optionally, in one embodiment of the present invention, the method for determining a preset question includes: obtaining an initial question; using the ID of the initial question as the content of the third field, and configuring a corresponding third field to characterize the content of the third field; using the initial question as the content of the fourth field, and configuring a corresponding fourth field to characterize the content of the fourth field; combining the third field and the third field content, and the fourth field and the fourth field content to obtain the preset question.

在根据初始问题确定预设问题时,同样可对初始问题的数据结构进行模块化,例如上例中从初始训练数据中获取的初始问题为“《战国无双3》是由哪两个公司合作开发的?”,可将该初始问题的id“DEV_0_QUERY_0”作为第三字段内容,并配置对应的第三字段“id”进行表征,该初始问题的id可以是其在初始训练数据中的标识信息,也可以是在获取该初始问题后对其再配置的标识信息;将初始问题“《战国无双3》是由哪两个公司合作开发的?”作为第四字段内容,并配置对应的第四字段“question”进行表征,从而对初始训练数据中的初始问题实现数据化、模块化,然后可通过组合第三字段和第三字段内容、第四字段和第四字段内容得到预设问题,如下表所示:When determining the preset question based on the initial question, the data structure of the initial question can also be modularized. For example, in the above example, the initial question obtained from the initial training data is "Which two companies jointly developed "Samurai Warriors 3"?", and the id "DEV_0_QUERY_0" of the initial question can be used as the third field content, and the corresponding third field "id" can be configured for representation. The id of the initial question can be its identification information in the initial training data, or it can be the identification information reconfigured after the initial question is obtained; the initial question "Which two companies jointly developed "Samurai Warriors 3"?" is used as the fourth field content, and the corresponding fourth field "question" is configured for representation, so as to realize data and modularization of the initial question in the initial training data, and then the preset question can be obtained by combining the third field and the third field content, and the fourth field and the fourth field content, as shown in the following table:

其中,根据预设内容,可以有多个不同的预设问题,基于多个预设问题可构成预设问题集。Among them, according to the preset content, there can be multiple different preset questions, and a preset question set can be formed based on multiple preset questions.

可选的,在本发明的一个实施例中,预设答案的确定方法包括:获取初始答案;将初始答案相对应的初始问题的id作为第五字段内容,并配置对应的第五字段对第五字段内容进行表征;将初始答案作为第六字段内容,并配置对应的第六字段对第六字段内容进行表征;将初始答案在初始内容中的起始位置作为第七字段内容,并配置对应的第七字段对第七字段内容进行表征;对第五字段和第五字段内容、第六字段和第六字段内容、第七字段和第七字段内容进行组合,得到预设答案。Optionally, in one embodiment of the present invention, a method for determining a preset answer includes: obtaining an initial answer; using the ID of the initial question corresponding to the initial answer as the content of the fifth field, and configuring the corresponding fifth field to represent the content of the fifth field; using the initial answer as the content of the sixth field, and configuring the corresponding sixth field to represent the content of the sixth field; using the starting position of the initial answer in the initial content as the content of the seventh field, and configuring the corresponding seventh field to represent the content of the seventh field; combining the fifth field and the fifth field content, the sixth field and the sixth field content, and the seventh field and the seventh field content to obtain a preset answer.

例如,对于初始答案,可将相对应的初始问题的id“DEV_0_QUERY_0”作为第五字段内容,并配置对应的第五字段“id”进行表征,以构建初始答案和初始问题的对应关系,将初始答案“光荣和ω-force”作为第六字段内容,并配置对应的第六字段“text”进行表征,进一步地,可将初始答案在初始内容中的起始位置例如文字排序10作为第七字段内容,并配置对应的第七字段“answer_start”进行表征,这样可进一步挖掘初始答案与初始内容的数据关系,进而通过组合第五字段和第五字段内容、第六字段和第六字段内容、第七字段和第七字段内容得到预设答案,如下表所示:For example, for the initial answer, the id "DEV_0_QUERY_0" of the corresponding initial question can be used as the fifth field content, and the corresponding fifth field "id" can be configured for representation, so as to build a corresponding relationship between the initial answer and the initial question, and the initial answer "glory and ω-force" can be used as the sixth field content, and the corresponding sixth field "text" can be configured for representation. Furthermore, the starting position of the initial answer in the initial content, such as text sorting 10, can be used as the seventh field content, and the corresponding seventh field "answer_start" can be configured for representation. In this way, the data relationship between the initial answer and the initial content can be further mined, and then the preset answer can be obtained by combining the fifth field and the fifth field content, the sixth field and the sixth field content, and the seventh field and the seventh field content, as shown in the following table:

其中,可根据不同的情形,构建多个预设答案,例如简述模式的预设答案或详尽描述模式的预设答案等,基于多个预设答案可构成预设答案集,以使生成的训练数据在应用于大语言模型时,提高模型的准确度。Among them, multiple preset answers can be constructed according to different situations, such as preset answers for brief description mode or preset answers for detailed description mode, etc. A preset answer set can be constructed based on multiple preset answers to improve the accuracy of the model when the generated training data is applied to a large language model.

为了对种子指令的表达方式进行泛化,提高所训练的大语言模型对复杂指令的理解能力,可选的,在本发明的一个实施例中,对种子指令进行泛化,包括:确定泛化示例;基于大语言模型对泛化示例的语境学习,利用大语言模型对种子指令中的语句进行泛化。In order to generalize the expression of seed instructions and improve the ability of the trained large language model to understand complex instructions, optionally, in one embodiment of the present invention, the seed instructions are generalized, including: determining generalization examples; based on the context learning of the generalization examples by the large language model, generalizing the sentences in the seed instructions using the large language model.

本实施例可通过大语言模型的ICL(In-Context Learning,语境学习,又称上下文学习)对种子指令中的语句表述进行泛化,大语言模型可根据给定的任务说明或若干个实例理解输入任务,并给出结果,例如本实施例中为种子指令的预设问题进行泛化时,可先确定如下泛化示例:In this embodiment, the sentence expressions in the seed instruction can be generalized through the ICL (In-Context Learning) of the large language model. The large language model can understand the input task according to the given task description or several examples and give the result. For example, when generalizing the preset question of the seed instruction in this embodiment, the following generalization example can be determined first:

请对下述语句表述进行泛化:Please generalize the following statements:

《战国无双3》是由哪两个公司合作开发的?Which two companies jointly developed "Samurai Warriors 3"?

无双3的开发公司是?Which company developed Dynasty Warriors 3?

上述泛化示例中,“请对下述语句表述进行泛化:”为泛化任务说明,“《战国无双3》是由哪两个公司合作开发的?”为泛化任务,“无双3的开发公司是?”为泛化结果,用于训练大语言模型的泛化示例中泛化结果的语言表述可以更简练、更为口语化,可对泛化示例设计多个泛化任务以及泛化结果,并针对不用的用户群体设计不同的泛化语言风格,以提高对大语言模型学习训练后的泛化能力,这样,利用训练完成的大语言模型对种子指令中的语句泛化结果更丰富更多样,从而对种子指令泛化扩充得到大量优质的复杂指令作为训练数据,进一步提高大语言模型的理解能力。In the above generalization examples, "Please generalize the following statements:" is the generalization task description, "Which two companies jointly developed "Samurai Warriors 3"? " is the generalization task, and "Which company developed Warriors 3?" is the generalization result. The language expression of the generalization result in the generalization example used to train the large language model can be more concise and more colloquial. Multiple generalization tasks and generalization results can be designed for the generalization example, and different generalization language styles can be designed for different user groups to improve the generalization ability of the large language model after learning and training. In this way, the generalization results of the sentences in the seed instructions using the trained large language model are richer and more diverse, thereby generalizing and expanding the seed instructions to obtain a large number of high-quality complex instructions as training data, further improving the understanding ability of the large language model.

在对种子指令泛化后,为了进一步筛选出优质高效的泛化结果作为训练数据,可选的,在本发明的一个实施例中,对种子指令进行泛化后,还包括:对泛化后的种子指令进行评分,筛选评分值超过预设阈值的种子指令作为训练数据。例如可通过大语言模型对泛化后的种子指令进行评分,丢弃得分较低的数据,筛选出得分超过预设阈值的数据作为训练数据,从而提高最终生成的训练数据的质量。After generalizing the seed instructions, in order to further screen out high-quality and efficient generalization results as training data, optionally, in one embodiment of the present invention, after generalizing the seed instructions, the process further includes: scoring the generalized seed instructions, and screening out seed instructions with scoring values exceeding a preset threshold as training data. For example, the generalized seed instructions can be scored using a large language model, data with low scores can be discarded, and data with scores exceeding a preset threshold can be screened out as training data, thereby improving the quality of the training data finally generated.

在根据初始内容确定预设内容、根据初始答案确定预设答案、根据初始问题确定预设问题时,也可采用既有的数据结构,可选的,在本发明的一个实施例中,目标格式为JSON格式、YAML格式、XML格式中的任一种。例如,对不同任务类型的NLP(Natural LanguageProcessing,自然语言处理)数据集按照JSON格式进行整理,确定预设内容、预设答案以及预设问题,并与预设任务组合得到种子指令,然后通过JSON、YAML、XML格式间的转化脚本,转换成所需要的数据结构格式,从而进一步提高种子指令构成数据的结构化和通用化。When determining the preset content according to the initial content, determining the preset answer according to the initial answer, and determining the preset question according to the initial question, the existing data structure can also be used. Optionally, in one embodiment of the present invention, the target format is any one of the JSON format, the YAML format, and the XML format. For example, NLP (Natural Language Processing) data sets of different task types are sorted in JSON format, the preset content, the preset answer, and the preset question are determined, and the seed instructions are obtained by combining them with the preset tasks, and then converted into the required data structure format through the conversion script between JSON, YAML, and XML formats, thereby further improving the structuring and generalization of the seed instruction constituent data.

由于预设任务需要对训练数据的任务类型以及任务作业进行明示,在构建预设任务时,可选的,在本发明的一个实施例中,预设任务中包含有任务说明和任务引导,任务说明包含有任务示例以及对目标格式的说明,任务引导对预设内容、预设问题、以及预设答案依次进行引导提示。这样,通过任务说明可展示任务类型以及任务作业方式,并通过引导提示对预设内容、预设问题、预设答案进行组合,例如可设计如下模板化的预设任务:Since the preset task needs to clearly indicate the task type and task operation of the training data, when constructing the preset task, optionally, in one embodiment of the present invention, the preset task includes a task description and a task guide, the task description includes a task example and a description of the target format, and the task guide guides the preset content, preset questions, and preset answers in sequence. In this way, the task description can display the task type and task operation method, and the preset content, preset questions, and preset answers can be combined through the guidance prompts. For example, the following templated preset tasks can be designed:

上述表格中,第一单元格为任务说明,第二单元格、第三单元格为任务引导。其中,第一单元格中包含有任务示例以及对目标格式的说明,可详细说明、展示预设任务的任务类型以及任务作业方式。第二单元格可用于引导展示预设内容和预设问题,例如将第二单元格中的input替换为预设内容和预设问题,实现预设内容、预设问题和预设任务的组合,第三单元格可用于引导展示预设答案,例如将第三单元格中的output替换为预设答案,从而实现预设答案和预设任务的组合。可见,预设任务的该模板化配置,进一步提高了数据的模块化、结构化和格式化,通过该方式构建的种子指令,便于进一步的数据处理、整合,以快速高效生成大量优质训练数据。In the above table, the first cell is the task description, and the second and third cells are task guidance. Among them, the first cell contains a task example and a description of the target format, which can explain and display the task type and task operation method of the preset task in detail. The second cell can be used to guide the display of preset content and preset questions, such as replacing the input in the second cell with preset content and preset questions to achieve a combination of preset content, preset questions and preset tasks. The third cell can be used to guide the display of preset answers, such as replacing the output in the third cell with a preset answer to achieve a combination of preset answers and preset tasks. It can be seen that this templated configuration of the preset task further improves the modularization, structuring and formatting of the data. The seed instructions constructed in this way facilitate further data processing and integration to quickly and efficiently generate a large amount of high-quality training data.

第二方面,本发明的实施例还提供一种用于模型训练的复杂指令训练数据的构造装置,能够生成高质量的训练数据。In a second aspect, an embodiment of the present invention also provides a device for constructing complex instruction training data for model training, which is capable of generating high-quality training data.

如图2所示,本申请实施例提供的用于模型训练的复杂指令训练数据的构造装置可以包括:获取单元11,用于获取大语言模型的初始训练数据,初始训练数据中包含有初始内容、从初始内容中提取的初始答案,以及与初始答案相对应的初始问题;确定单元12,用于基于初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题;组合单元13,用于将预设任务以及预设内容、预设答案、预设问题进行组合,得到种子指令;泛化单元14,用于对种子指令进行泛化,得到训练数据。As shown in Figure 2, the construction device of complex instruction training data for model training provided in an embodiment of the present application may include: an acquisition unit 11, used to acquire initial training data of a large language model, the initial training data including initial content, initial answers extracted from the initial content, and initial questions corresponding to the initial answers; a determination unit 12, used to determine preset content, preset answers and preset questions based on the initial content, the initial answers and the initial questions; a combination unit 13, used to combine preset tasks and preset content, preset answers and preset questions to obtain seed instructions; a generalization unit 14, used to generalize the seed instructions to obtain training data.

本发明的实施例提供的用于模型训练的复杂指令训练数据的构造装置,通过获取大语言模型的初始训练数据,其中,初始训练数据中包含有初始内容、从初始内容中提取的初始答案,以及与初始答案相对应的初始问题;然后基于初始内容、初始答案以及初始问题,确定预设内容、预设答案以及预设问题;将预设任务以及预设内容、预设答案、预设问题进行组合,得到种子指令;再对种子指令进行泛化,得到训练数据。该方法可对大语言模型的初始训练数据进行扩充,生成高质量的训练数据,提升训练数据的多样性和丰富性。The embodiment of the present invention provides a device for constructing complex instruction training data for model training, which obtains initial training data of a large language model, wherein the initial training data includes initial content, initial answers extracted from the initial content, and initial questions corresponding to the initial answers; then based on the initial content, initial answers, and initial questions, preset content, preset answers, and preset questions are determined; preset tasks and preset content, preset answers, and preset questions are combined to obtain seed instructions; and the seed instructions are generalized to obtain training data. This method can expand the initial training data of a large language model, generate high-quality training data, and improve the diversity and richness of training data.

可选的,在本发明的一个实施例中,确定单元包括:提取模块,用于从初始训练数据中提取初始内容、初始答案以及初始问题;格式确定模块,用于按照目标格式,根据初始内容确定预设内容,根据初始答案确定预设答案,根据初始问题确定预设问题。Optionally, in one embodiment of the present invention, the determination unit includes: an extraction module for extracting initial content, initial answers and initial questions from initial training data; a format determination module for determining preset content according to the initial content, determining preset answers according to the initial answers, and determining preset questions according to the initial questions in accordance with a target format.

可选的,在本发明的一个实施例中,格式确定模块包括预设内容确定子块,预设内容确定子块用于:获取初始内容;将初始内容的id作为第一字段内容,并配置对应的第一字段对第一字段内容进行表征;将初始内容作为第二字段内容,并配置第二字段对第二字段内容进行表征;对第一字段和第一字段内容、第二字段和第二字段内容进行组合,得到预设内容。Optionally, in one embodiment of the present invention, the format determination module includes a preset content determination sub-block, which is used to: obtain initial content; use the id of the initial content as the first field content, and configure the corresponding first field to characterize the first field content; use the initial content as the second field content, and configure the second field to characterize the second field content; combine the first field and the first field content, the second field and the second field content to obtain preset content.

可选的,在本发明的一个实施例中,格式确定模块还包括预设问题确定子块,预设问题确定子块用于:获取初始问题;将初始问题的id作为第三字段内容,并配置对应的第三字段对第三字段内容进行表征;将初始问题作为第四字段内容,并配置对应的第四字段对第四字段内容进行表征;对第三字段和第三字段内容、第四字段和第四字段内容进行组合,得到预设问题。Optionally, in one embodiment of the present invention, the format determination module also includes a preset question determination sub-block, which is used to: obtain an initial question; use the id of the initial question as the content of the third field, and configure the corresponding third field to characterize the content of the third field; use the initial question as the content of the fourth field, and configure the corresponding fourth field to characterize the content of the fourth field; combine the third field and the third field content, and the fourth field and the fourth field content to obtain the preset question.

可选的,在本发明的一个实施例中,格式确定模块还包括预设答案确定子块,预设答案确定子块用于:获取初始答案;将初始答案相对应的初始问题的id作为第五字段内容,并配置对应的第五字段对第五字段内容进行表征;将初始答案作为第六字段内容,并配置对应的第六字段对第六字段内容进行表征;将初始答案在初始内容中的起始位置作为第七字段内容,并配置对应的第七字段对第七字段内容进行表征;对第五字段和第五字段内容、第六字段和第六字段内容、第七字段和第七字段内容进行组合,得到预设答案。Optionally, in one embodiment of the present invention, the format determination module also includes a preset answer determination sub-block, which is used to: obtain an initial answer; use the ID of the initial question corresponding to the initial answer as the content of the fifth field, and configure the corresponding fifth field to represent the content of the fifth field; use the initial answer as the content of the sixth field, and configure the corresponding sixth field to represent the content of the sixth field; use the starting position of the initial answer in the initial content as the content of the seventh field, and configure the corresponding seventh field to represent the content of the seventh field; combine the fifth field and the fifth field content, the sixth field and the sixth field content, and the seventh field and the seventh field content to obtain a preset answer.

可选的,在本发明的一个实施例中,泛化单元包括:泛化示例模块,用于确定泛化示例;语句泛化模块,用于基于大语言模型对泛化示例的语境学习,利用大语言模型对种子指令中的语句进行泛化。Optionally, in one embodiment of the present invention, the generalization unit includes: a generalization example module, used to determine the generalization example; a sentence generalization module, used to learn the context of the generalization example based on the large language model, and generalize the sentences in the seed instruction using the large language model.

可选的,在本发明的一个实施例中,泛化单元还包括:评分模块,用于对泛化后的种子指令进行评分,筛选评分值超过预设阈值的种子指令作为训练数据。Optionally, in one embodiment of the present invention, the generalization unit further includes: a scoring module, configured to score the generalized seed instructions, and select seed instructions with score values exceeding a preset threshold as training data.

可选的,在本发明的一个实施例中,目标格式为JSON格式、YAML格式、XML格式中的任一种。Optionally, in one embodiment of the present invention, the target format is any one of JSON format, YAML format, and XML format.

可选的,在本发明的一个实施例中,预设任务中包含有任务说明和任务引导,任务说明包含有任务示例以及对目标格式的说明,任务引导对预设内容、预设问题、以及预设答案依次进行引导提示。Optionally, in one embodiment of the present invention, the preset task includes a task description and a task guide, the task description includes a task example and a description of the target format, and the task guide sequentially guides the preset content, preset questions, and preset answers.

第三方面,本发明的实施例还提供一种电子设备。In a third aspect, an embodiment of the present invention further provides an electronic device.

如图3所示,本申请实施例提供的电子设备包括:壳体51、处理器52、存储器53、电路板54和电源电路55,其中,电路板54安置在壳体51围成的空间内部,处理器52和存储器53设置在电路板54上;电源电路55,用于为上述电子设备的各个电路或器件供电;存储器53用于存储可执行程序代码;处理器52通过读取存储器53中存储的可执行程序代码来运行与可执行程序代码对应的程序,用于执行本发明的实施例提供的任一种用于模型训练的复杂指令训练数据的构造方法。As shown in Figure 3, the electronic device provided in the embodiment of the present application includes: a shell 51, a processor 52, a memory 53, a circuit board 54 and a power supply circuit 55, wherein the circuit board 54 is arranged inside the space enclosed by the shell 51, and the processor 52 and the memory 53 are arranged on the circuit board 54; the power supply circuit 55 is used to supply power to various circuits or devices of the above-mentioned electronic device; the memory 53 is used to store executable program code; the processor 52 runs the program corresponding to the executable program code by reading the executable program code stored in the memory 53, so as to execute any one of the construction methods of complex instruction training data for model training provided in the embodiments of the present invention.

处理器52对上述步骤的具体执行过程以及处理器52通过运行可执行程序代码来进一步执行的步骤,可以参见前述实施例的描述,在此不再赘述。The specific execution process of the above steps by the processor 52 and the steps further executed by the processor 52 by running the executable program code can be found in the description of the previous embodiment, which will not be repeated here.

上述电子设备以多种形式存在,包括但不限于:The above electronic devices exist in many forms, including but not limited to:

(1) 移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。(1) Mobile communication devices: These devices are characterized by their mobile communication functions and their main purpose is to provide voice and data communications. These terminals include: smart phones (such as iPhone), multimedia phones, feature phones, and low-end phones.

(2) 超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。(2) Ultra-mobile personal computer devices: These devices fall into the category of personal computers, have computing and processing capabilities, and generally also have mobile Internet access features. These terminals include: PDA, MID and UMPC devices, such as iPad.

(3) 便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。(3) Portable entertainment devices: These devices can display and play multimedia content. They include audio and video players (such as iPods), handheld game consoles, e-books, smart toys, and portable car navigation devices.

(4) 服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The server consists of a processor, hard disk, memory, system bus, etc. The server is similar to a general computer architecture, but because it needs to provide highly reliable services, it has higher requirements in terms of processing power, stability, reliability, security, scalability, and manageability.

(5) 其他具有数据交互功能的电子设备。(5) Other electronic devices with data interaction functions.

第四方面,本申请的实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现本发明的实施例提供的任一种用于模型训练的复杂指令训练数据的构造方法,因此也能实现相应的技术效果,前文已经进行了详细说明,此处不再赘述。In the fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, which stores one or more programs, and the one or more programs can be executed by one or more processors to implement any method for constructing complex instruction training data for model training provided in the embodiments of the present invention, thereby also being able to achieve the corresponding technical effects, which have been described in detail above and will not be repeated here.

需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。Each embodiment in this specification is described in a related manner, and the same or similar parts between the embodiments can be referenced to each other, and each embodiment focuses on the differences from other embodiments.

尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

为了描述的方便,描述以上装置是以功能分为各种单元/模块分别描述。当然,在实施本发明时可以把各单元/模块的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, the above device is described by dividing the functions into various units/modules. Of course, when implementing the present invention, the functions of each unit/module can be implemented in the same or multiple software and/or hardware.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random AccessMemory,RAM)等。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium, and when the program is executed, it can include the processes of the embodiments of the above-mentioned methods. The storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by a person skilled in the art within the technical scope disclosed by the present invention should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

CN202410494963.5A2024-04-232024-04-23 Method, device, electronic device and storage medium for constructing complex instruction training data for model trainingActiveCN118228839B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410494963.5ACN118228839B (en)2024-04-232024-04-23 Method, device, electronic device and storage medium for constructing complex instruction training data for model training

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410494963.5ACN118228839B (en)2024-04-232024-04-23 Method, device, electronic device and storage medium for constructing complex instruction training data for model training

Publications (2)

Publication NumberPublication Date
CN118228839Atrue CN118228839A (en)2024-06-21
CN118228839B CN118228839B (en)2025-05-06

Family

ID=91510251

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410494963.5AActiveCN118228839B (en)2024-04-232024-04-23 Method, device, electronic device and storage medium for constructing complex instruction training data for model training

Country Status (1)

CountryLink
CN (1)CN118228839B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111324736A (en)*2020-03-192020-06-23苏州思必驰信息科技有限公司Man-machine dialogue model training method, man-machine dialogue method and system
CN111639163A (en)*2020-04-292020-09-08深圳壹账通智能科技有限公司Problem generation model training method, problem generation method and related equipment
CN111814466A (en)*2020-06-242020-10-23平安科技(深圳)有限公司Information extraction method based on machine reading understanding and related equipment thereof
CN113032520A (en)*2021-02-262021-06-25北京金堤征信服务有限公司Information analysis method and device, electronic equipment and computer readable storage medium
KR20210090576A (en)*2020-11-302021-07-20베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디.A method, an apparatus, an electronic device, a storage medium and a program for controlling quality
CN113420134A (en)*2021-06-222021-09-21康键信息技术(深圳)有限公司Machine reading understanding method and device, computer equipment and storage medium
CN113672708A (en)*2020-05-132021-11-19武汉Tcl集团工业研究院有限公司Language model training method, question and answer pair generation method, device and equipment
CN115587175A (en)*2022-12-082023-01-10阿里巴巴达摩院(杭州)科技有限公司Man-machine conversation and pre-training language model training method and system and electronic equipment
CN116151240A (en)*2023-02-062023-05-23北京百度网讯科技有限公司 Relation extraction model training method and device, electronic device and storage medium
CN116629235A (en)*2023-07-252023-08-22深圳须弥云图空间科技有限公司Large-scale pre-training language model fine tuning method and device, electronic equipment and medium
CN117149989A (en)*2023-11-012023-12-01腾讯科技(深圳)有限公司Training method for large language model, text processing method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111324736A (en)*2020-03-192020-06-23苏州思必驰信息科技有限公司Man-machine dialogue model training method, man-machine dialogue method and system
CN111639163A (en)*2020-04-292020-09-08深圳壹账通智能科技有限公司Problem generation model training method, problem generation method and related equipment
CN113672708A (en)*2020-05-132021-11-19武汉Tcl集团工业研究院有限公司Language model training method, question and answer pair generation method, device and equipment
CN111814466A (en)*2020-06-242020-10-23平安科技(深圳)有限公司Information extraction method based on machine reading understanding and related equipment thereof
KR20210090576A (en)*2020-11-302021-07-20베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디.A method, an apparatus, an electronic device, a storage medium and a program for controlling quality
US20210326524A1 (en)*2020-11-302021-10-21Beijing Baidu Netcom Science And Technology Co., Ltd.Method, apparatus and device for quality control and storage medium
CN113032520A (en)*2021-02-262021-06-25北京金堤征信服务有限公司Information analysis method and device, electronic equipment and computer readable storage medium
CN113420134A (en)*2021-06-222021-09-21康键信息技术(深圳)有限公司Machine reading understanding method and device, computer equipment and storage medium
CN115587175A (en)*2022-12-082023-01-10阿里巴巴达摩院(杭州)科技有限公司Man-machine conversation and pre-training language model training method and system and electronic equipment
CN116151240A (en)*2023-02-062023-05-23北京百度网讯科技有限公司 Relation extraction model training method and device, electronic device and storage medium
CN116629235A (en)*2023-07-252023-08-22深圳须弥云图空间科技有限公司Large-scale pre-training language model fine tuning method and device, electronic equipment and medium
CN117149989A (en)*2023-11-012023-12-01腾讯科技(深圳)有限公司Training method for large language model, text processing method and device

Also Published As

Publication numberPublication date
CN118228839B (en)2025-05-06

Similar Documents

PublicationPublication DateTitle
KR20210061141A (en)Method and apparatus for processimg natural languages
US10853716B2 (en)Systems and methods for a mathematical chat bot
CN110164435A (en)Audio recognition method, device, equipment and computer readable storage medium
US20110015920A1 (en)Apparatus for chinese language education and method thereof
CN111090727A (en) Language conversion processing method, device and dialect voice interaction system
CN111553138B (en) Assisted writing method and device for standardizing content structure documents
CN112685550B (en)Intelligent question-answering method, intelligent question-answering device, intelligent question-answering server and computer readable storage medium
CN113505198A (en) Keyword-driven generative dialogue reply method, device and electronic device
CN108133168B (en)Formula searching method and device in text recognition
CN116975437A (en)Information processing method, apparatus, device, storage medium, and program product
CN108345612A (en)A kind of question processing method and device, a kind of device for issue handling
CN113127621A (en)Dialogue module pushing method, device, equipment and storage medium
CN113220854B (en)Intelligent dialogue method and device for machine reading and understanding
CN118246474A (en)Tool routing method and device
CN113535916A (en)Question and answer method and device based on table and computer equipment
CN118228839A (en) Method, device, electronic device and storage medium for constructing complex instruction training data for model training
CN108255798A (en) A method and device for inputting Lateh format formulas
CN117649857A (en)Zero-sample audio classification model training method and zero-sample audio classification method
CN117612051A (en) A dance teaching video generation method, system, terminal and storage medium
CN108572956B (en) Method and device for calling knowledge point slice
CN113761147B (en) Questionnaire question display method, device and electronic device based on logic editor
CN110942775B (en)Data processing method and device, electronic equipment and storage medium
CN110377915B (en)Text emotion analysis method and device, storage medium and equipment
CN112328871A (en)Reply generation method, device, equipment and storage medium based on RPA module
CN113807148B (en)Text recognition matching method and device and terminal equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp