Movatterモバイル変換


[0]ホーム

URL:


CN118536511A - A text processing method, device, equipment and medium based on large model - Google Patents

A text processing method, device, equipment and medium based on large model
Download PDF

Info

Publication number
CN118536511A
CN118536511ACN202410367189.1ACN202410367189ACN118536511ACN 118536511 ACN118536511 ACN 118536511ACN 202410367189 ACN202410367189 ACN 202410367189ACN 118536511 ACN118536511 ACN 118536511A
Authority
CN
China
Prior art keywords
text
enhanced
preset
similarity
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410367189.1A
Other languages
Chinese (zh)
Inventor
田羽慧
刘微
孟卫明
杜兆臣
杨成喆
刘敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Group Holding Co Ltd
Original Assignee
Hisense Group Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Group Holding Co LtdfiledCriticalHisense Group Holding Co Ltd
Priority to CN202410367189.1ApriorityCriticalpatent/CN118536511A/en
Publication of CN118536511ApublicationCriticalpatent/CN118536511A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请涉及人工智能技术领域,尤其涉及一种基于大模型的文本处理方法、装置、设备及介质。由于在本申请实施例中,针对每个预设意图,获取该预设意图对应的每个增强文本,确定两个增强文本之间的字形相似度和语义相似度,将字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将语义相似度小于第二预设阈值的增强文本,确定为第二候选文本,从而将第一候选文本和第二候选文本中相同的增强文本,确定为异常文本,也就是说,通过相似度比较的方式,筛选出同一预设意图下形式上相近而语义上相差比较大的异常文本,实现了自动对增强文本进行质检,无需人工参与,提高了文本处理的效率,且节约了人力物力。

The present application relates to the field of artificial intelligence technology, and in particular to a text processing method, device, equipment and medium based on a large model. In the embodiment of the present application, for each preset intent, each enhanced text corresponding to the preset intent is obtained, the glyph similarity and semantic similarity between the two enhanced texts are determined, and the enhanced text with a glyph similarity greater than a first preset threshold is determined as the first candidate text, and the enhanced text with a semantic similarity less than a second preset threshold is determined as the second candidate text, so that the same enhanced text in the first candidate text and the second candidate text is determined as an abnormal text, that is, by comparing the similarity, the abnormal text that is similar in form but has a large difference in semantics under the same preset intent is screened out, and the enhanced text is automatically quality inspected without manual participation, which improves the efficiency of text processing and saves manpower and material resources.

Description

Translated fromChinese
一种基于大模型的文本处理方法、装置、设备及介质A text processing method, device, equipment and medium based on large model

技术领域Technical Field

本申请涉及人工智能技术领域,尤其涉及一种基于大模型的文本处理方法、装置、设备及介质。The present application relates to the field of artificial intelligence technology, and in particular to a text processing method, device, equipment and medium based on a large model.

背景技术Background Art

目前,文本增强是获取文本数据的一种重要途径,文本增强的核心是保持语义不变,这是文本增强的最大挑战,尤其是对短文本的文本增强更具挑战。相关技术中的无论哪种文本增强方法,得到的增强文本均存在着效率低、准确性不高的问题,都可能增强出一些存在语法、语义等语病问题的增强文本,需要人工对增强文本进行质检,耗费人力和物力。At present, text enhancement is an important way to obtain text data. The core of text enhancement is to keep the semantics unchanged, which is the biggest challenge of text enhancement, especially for short texts. Regardless of the text enhancement method in the related technology, the enhanced text obtained has the problems of low efficiency and low accuracy. It may enhance some enhanced texts with grammatical, semantic and other language problems, which requires manual quality inspection of the enhanced text, consuming manpower and material resources.

因此,如何自动对文本进行质检成为亟待解决的问题。Therefore, how to automatically perform quality inspection on text has become an urgent problem to be solved.

发明内容Summary of the invention

本申请实施例提供了一种基于大模型的文本处理方法、装置、设备及介质,用以解决现有技术中人工对增强文本进行质检耗时耗力的问题。The embodiments of the present application provide a large model-based text processing method, apparatus, device and medium to solve the problem of time-consuming and labor-intensive manual quality inspection of enhanced text in the prior art.

第一方面,本申请提供了一种基于大模型的文本处理方法,所述方法包括:In a first aspect, the present application provides a text processing method based on a large model, the method comprising:

针对每个预设意图,获取该预设意图对应的每个增强文本;确定每两个增强文本之间的字形相似度和语义相似度,所述字形相似度用于描述对应的两个增强文本中包括的词的相似程度;将所述字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将所述语义相似度小于第二预设阈值的增强文本,确定为第二候选文本;For each preset intent, each enhanced text corresponding to the preset intent is obtained; the glyph similarity and semantic similarity between each two enhanced texts are determined, wherein the glyph similarity is used to describe the similarity of words included in the corresponding two enhanced texts; the enhanced text whose glyph similarity is greater than a first preset threshold is determined as a first candidate text, and the enhanced text whose semantic similarity is less than a second preset threshold is determined as a second candidate text;

将所述第一候选文本和所述第二候选文本中相同的增强文本,确定为异常文本。The same enhanced text in the first candidate text and the second candidate text is determined as abnormal text.

第二方面,本申请提供了一种基于大模型的文本处理装置,所述装置包括:In a second aspect, the present application provides a text processing device based on a large model, the device comprising:

获取模块,用于针对每个预设意图,获取该预设意图对应的每个增强文本;An acquisition module, used for acquiring, for each preset intent, each enhanced text corresponding to the preset intent;

相似度比较模块,用于确定每两个增强文本之间的字形相似度和语义相似度,所述字形相似度用于描述对应的两个增强文本中包括的词的相似程度;将所述字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将所述语义相似度小于第二预设阈值的增强文本,确定为第二候选文本;A similarity comparison module is used to determine the glyph similarity and semantic similarity between each two enhanced texts, wherein the glyph similarity is used to describe the similarity between the words included in the corresponding two enhanced texts; the enhanced text whose glyph similarity is greater than a first preset threshold is determined as a first candidate text, and the enhanced text whose semantic similarity is less than a second preset threshold is determined as a second candidate text;

确定模块,用于将所述第一候选文本和所述第二候选文本中相同的增强文本,确定为异常文本。A determination module is used to determine the same enhanced text in the first candidate text and the second candidate text as abnormal text.

第三方面,本申请还提供了一种电子设备,所述电子设备包括处理器,所述处理器用于执行存储器中存储的计算机程序时实现如上述任一所述基于大模型的文本处理方法的步骤。In a third aspect, the present application further provides an electronic device, comprising a processor, wherein the processor is configured to implement the steps of any of the above-mentioned large model-based text processing methods when executing a computer program stored in a memory.

第四方面,本申请还提供了一种计算机可读存储介质,其存储有计算机程序,所述计算机程序被处理器执行时实现如上述任一所述基于大模型的文本处理方法的步骤。In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the large model-based text processing methods described above.

由于在本申请实施例中,针对每个预设意图,获取该预设意图对应的每个增强文本,确定两个增强文本之间的字形相似度和语义相似度,将字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将语义相似度小于第二预设阈值的增强文本,确定为第二候选文本,从而将第一候选文本和第二候选文本中相同的增强文本,确定为异常文本,也就是说,通过相似度比较的方式,筛选出同一预设意图下形式上相近而语义上相差比较大的异常文本,实现了自动对增强文本进行质检,无需人工参与,提高了文本处理的效率,且节约了人力物力。In the embodiment of the present application, for each preset intent, each enhanced text corresponding to the preset intent is obtained, the glyph similarity and semantic similarity between the two enhanced texts are determined, the enhanced text with a glyph similarity greater than a first preset threshold is determined as the first candidate text, and the enhanced text with a semantic similarity less than a second preset threshold is determined as the second candidate text, thereby determining the same enhanced text in the first candidate text and the second candidate text as abnormal text. In other words, by comparing the similarity, abnormal texts that are similar in form but quite different in semantics under the same preset intent are screened out, thereby realizing automatic quality inspection of the enhanced text without manual participation, improving the efficiency of text processing, and saving manpower and material resources.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solution of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1为本申请实施例提供的一种基于大模型的文本处理过程的流程示意图;FIG1 is a flow chart of a text processing process based on a large model provided in an embodiment of the present application;

图2为本申请实施例提供的一种字形特征向量确定示意图;FIG2 is a schematic diagram of determining a character feature vector provided by an embodiment of the present application;

图3为本申请实施例提供的一种文本增强过程示意图;FIG3 is a schematic diagram of a text enhancement process provided by an embodiment of the present application;

图4为本申请实施例提供的一种大模型文本增强架构示意图;FIG4 is a schematic diagram of a large model text enhancement architecture provided in an embodiment of the present application;

图5为本申请实施例提供的一种文本评审过程示意图;FIG5 is a schematic diagram of a text review process provided by an embodiment of the present application;

图6为本申请实施例提供的一种基于大模型的文本评审架构示意图;FIG6 is a schematic diagram of a text review architecture based on a large model provided in an embodiment of the present application;

图7为本申请实施例提供的一种文本处理过程示意图;FIG7 is a schematic diagram of a text processing process provided by an embodiment of the present application;

图8为本申请实施例提供的一种基于大模型的文本处理装置结构示意图;FIG8 is a schematic diagram of the structure of a text processing device based on a large model provided in an embodiment of the present application;

图9为本申请实施例提供的一种电子设备结构示意图。FIG. 9 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图,对本申请的实施例的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solution and advantages of the present application clearer, the technical solution of the embodiment of the present application will be clearly and completely described below in conjunction with the accompanying drawings. Obviously, the described embodiment is only a part of the embodiment of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field belong to the scope of protection of this application.

本申请实施例提供了一种基于大模型的文本处理方法、装置、设备及介质,该方法中针对每个预设意图,获取该预设意图对应的每个增强文本;确定每两个增强文本之间的字形相似度和语义相似度,该字形相似度用于描述对应的两个增强文本中包括的词的相似程度;将字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将语义相似度小于第二预设阈值的增强文本,确定为第二候选文本;将第一候选文本和第二候选文本中相同的增强文本,确定为异常文本。The embodiments of the present application provide a text processing method, apparatus, device and medium based on a large model. In the method, for each preset intent, each enhanced text corresponding to the preset intent is obtained; the glyph similarity and semantic similarity between each two enhanced texts are determined, and the glyph similarity is used to describe the similarity of the words included in the corresponding two enhanced texts; the enhanced text with a glyph similarity greater than a first preset threshold is determined as a first candidate text, and the enhanced text with a semantic similarity less than a second preset threshold is determined as a second candidate text; the enhanced text that is the same in the first candidate text and the second candidate text is determined as an abnormal text.

图1为本申请实施例提供的一种基于大模型的文本处理过程的流程示意图,如图1所示,该过程包括以下步骤:FIG1 is a flowchart of a text processing process based on a large model provided in an embodiment of the present application. As shown in FIG1 , the process includes the following steps:

S101:针对每个预设意图,获取该预设意图对应的每个增强文本;确定每两个增强文本之间的字形相似度和语义相似度,所述字形相似度用于描述对应的两个增强文本中包括的词的相似程度。S101: for each preset intent, obtain each enhanced text corresponding to the preset intent; determine the glyph similarity and semantic similarity between every two enhanced texts, wherein the glyph similarity is used to describe the similarity between words included in the corresponding two enhanced texts.

本申请提供的基于大模型的文本处理方法应用于电子设备,该电子设备可以是PC、移动终端、服务器等。The large model-based text processing method provided in the present application is applied to an electronic device, which may be a PC, a mobile terminal, a server, etc.

由于关于同一意图的文本的语义一般是相似的,只是在文本表现形式上存在着一些差异,因此,为了对增强文本进行筛选,确定增强文本中可能存在异常的文本。在本申请实施例中,对文本进行质检时,可以针对每个预设意图,获取该预设意图对应的每个增强文本。其中,增强文本可以是对某些原始文本进行了文本增强之后得到的文本。Since the semantics of texts with the same intent are generally similar, but there are some differences in the textual representation, in order to screen the enhanced text, it is necessary to determine the texts that may be abnormal in the enhanced text. In an embodiment of the present application, when quality checking the text, each enhanced text corresponding to each preset intent can be obtained for each preset intent. The enhanced text can be the text obtained after text enhancement of some original text.

在获取到该预设意图对应的每个增强文本之后,可以确定每两个增强文本之间的字形相似度和语义相似度。After each enhanced text corresponding to the preset intent is obtained, the glyph similarity and semantic similarity between every two enhanced texts may be determined.

其中,字形相似度是用于描述对应的两个增强文本中包括的词的相似度的,也就是说,确定获取到的同一预设意图的所有增强文本中,哪些增强文本从形式上看是比较相似的。假设增强文本为“我爱中国”,那么该增强文本中则包括了三个词,分别为“我”、“爱”、“中国”。例如,“青岛的天气挺好”与“青岛的天气不好”只有一字之差,这两个增强文本从形式上看比较相似的,那么,这两个增强文本之间的字形相似度是比较高的。在本申请实施例中,在确定每两个增强文本之间的字形相似度时,可以统计两个增强文本中不相同的词的数量,并将该数量与两个增强文本中所有词的总数量的比值,确定为字形相似度,也可以基于预先训练完成的相似度确定模型,确定字形相似度。Among them, the glyph similarity is used to describe the similarity of the words included in the corresponding two enhanced texts, that is, to determine which enhanced texts are relatively similar in form among all the enhanced texts of the same preset intention obtained. Assuming that the enhanced text is "I love China", then the enhanced text includes three words, namely "I", "love" and "China". For example, "The weather in Qingdao is good" and "The weather in Qingdao is bad" are only one word different. These two enhanced texts are relatively similar in form, so the glyph similarity between the two enhanced texts is relatively high. In an embodiment of the present application, when determining the glyph similarity between each two enhanced texts, the number of different words in the two enhanced texts can be counted, and the ratio of this number to the total number of all words in the two enhanced texts can be determined as the glyph similarity. It can also be determined based on a similarity determination model completed by pre-training to determine the glyph similarity.

其中,语义相似度是用于描述对应的两个增强文本的语义的相似程度的。在本申请实施例中,可以基于预先训练完成的语义提取模型,提取对应的两个增强文本的语义特征,如,语义特征向量或者语义特征矩阵等,并计算该两个语义特征之间的相似度,将该相似度确定为语义相似度。Among them, semantic similarity is used to describe the degree of similarity of the semantics of the two corresponding enhanced texts. In an embodiment of the present application, the semantic features of the two corresponding enhanced texts, such as a semantic feature vector or a semantic feature matrix, can be extracted based on a pre-trained semantic extraction model, and the similarity between the two semantic features can be calculated, and the similarity is determined as the semantic similarity.

需要说明的是,本领域的技术人员可以根据需要选择确定字形相似度、语义相似度的方法,本申请实施例对此不进行限制。It should be noted that those skilled in the art can select a method for determining glyph similarity and semantic similarity as needed, and the embodiments of the present application do not limit this.

S102:将所述字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将所述语义相似度小于第二预设阈值的增强文本,确定为第二候选文本。S102: Determine the enhanced text whose glyph similarity is greater than a first preset threshold as a first candidate text, and determine the enhanced text whose semantic similarity is less than a second preset threshold as a second candidate text.

由于在进行文本增强时,对于同一预设意图的增强文本,希望能够获取到语义尽可能相似,但是说法尽可能不同的增强文本,因此,为了对于一些形式相似,语义不同的增强文本进行筛选,在本申请实施例中,保存有第一预设阈值和第二预设阈值,其中该第一预设阈值和第二预设阈值可以是任意小于1的正数。在确定了每个字形相似度和语义相似度之后,可以将字形相似度大于第一预设阈值的增强文本确定为第一候选文本。也就是说,筛选出该预设意图对应的每个增强文本中,哪些增强文本从形式上看是比较相似度的。由于在本申请实施例中,是希望使用第一预设阈值筛选出比较相似的增强文本,因此,可以将该第一预设阈值设置为相对较大的数值,如0.7、0.88、0.91等,当然,将第一预设阈值设置为何值,本领域的技术人员可以根据需要进行配置。Since when performing text enhancement, for the enhanced text of the same preset intention, it is hoped that the enhanced text with the same semantics as much as possible but the same statement as much as possible can be obtained, therefore, in order to screen some enhanced texts with similar forms and different semantics, in the embodiment of the present application, a first preset threshold and a second preset threshold are saved, wherein the first preset threshold and the second preset threshold can be any positive number less than 1. After determining each glyph similarity and semantic similarity, the enhanced text with a glyph similarity greater than the first preset threshold can be determined as the first candidate text. In other words, which enhanced texts are relatively similar in form in each enhanced text corresponding to the preset intention are screened out. Since in the embodiment of the present application, it is hoped that the first preset threshold is used to screen out relatively similar enhanced texts, therefore, the first preset threshold can be set to a relatively large value, such as 0.7, 0.88, 0.91, etc. Of course, the value of setting the first preset threshold can be configured by a technician in this field as needed.

在筛选第一候选文本的同时,在本申请实施例中,还可以将语义相似度小于第二预设阈值的增强文本,确定为第二候选文本。也就是说,筛选出该预设意图对应的每个增强文本中,哪些增强文本的语义是相差比较大的。由于在本申请实施例中,是希望使用第二预设阈值筛选出语义相差比较大的增强文本,因此,可以将该第二预设阈值设置为相对较小的数值,如0.23、0.3、0.1、0.5等,当然,将该第二预设阈值设置为何值,本领域的技术人员可以根据需要进行配置。While screening the first candidate text, in the embodiment of the present application, the enhanced text whose semantic similarity is less than the second preset threshold can also be determined as the second candidate text. That is to say, among each enhanced text corresponding to the preset intention, which enhanced texts have relatively different semantics. Since in the embodiment of the present application, it is hoped to use the second preset threshold to screen out enhanced texts with relatively large semantic differences, the second preset threshold can be set to a relatively small value, such as 0.23, 0.3, 0.1, 0.5, etc. Of course, the second preset threshold can be set to a value that can be configured by a technician in this field as needed.

S103:将所述第一候选文本和所述第二候选文本中相同的增强文本,确定为异常文本。S103: Determine the same enhanced text in the first candidate text and the second candidate text as abnormal text.

在筛选出了形式上比较相似的第一候选文本,以及语义上相差比较大的第二候选文本之后,即可将第一候选文本和第二候选文本中相同的增强文本,确定为异常文本。也就是说,该异常文本是形式上比较相似,但是语义相差比较大的文本,例如,“青岛的天气挺好”与“青岛的天气不好”则是形式上比较相似,但是语义相差比较大文本。After selecting the first candidate texts that are similar in form and the second candidate texts that are quite different in semantics, the same enhanced text in the first candidate text and the second candidate text can be determined as abnormal text. In other words, the abnormal text is a text that is similar in form but quite different in semantics. For example, "Qingdao's weather is good" and "Qingdao's weather is bad" are texts that are similar in form but quite different in semantics.

在确定该异常文本之后,可以将所确定的异常文本保存在预设的存储空间,也可以将该异常文本进行显示,以便相关工作人员进行进一步审核。After the abnormal text is determined, the determined abnormal text may be saved in a preset storage space, or the abnormal text may be displayed for further review by relevant staff.

由于在本申请实施例中,针对每个预设意图,获取该预设意图对应的每个增强文本,确定两个增强文本之间的字形相似度和语义相似度,将字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将语义相似度小于第二预设阈值的增强文本,确定为第二候选文本,从而将第一候选文本和第二候选文本中相同的增强文本,确定为异常文本,也就是说,通过相似度比较的方式,筛选出同一预设意图下形式上相近而语义上相差比较大的异常文本,实现了自动对增强文本进行质检,无需人工参与,提高了文本处理的效率,且节约了人力物力。In the embodiment of the present application, for each preset intent, each enhanced text corresponding to the preset intent is obtained, the glyph similarity and semantic similarity between the two enhanced texts are determined, the enhanced text with a glyph similarity greater than a first preset threshold is determined as the first candidate text, and the enhanced text with a semantic similarity less than a second preset threshold is determined as the second candidate text, thereby determining the same enhanced text in the first candidate text and the second candidate text as abnormal text. In other words, by comparing the similarity, abnormal texts that are similar in form but quite different in semantics under the same preset intent are screened out, thereby realizing automatic quality inspection of the enhanced text without manual participation, improving the efficiency of text processing, and saving manpower and material resources.

为了提高文本处理准确率,在上述实施例的基础上,在本申请实施例中,确定每两个增强文本之间的字形相似度的过程包括:In order to improve the accuracy of text processing, based on the above embodiment, in the embodiment of the present application, the process of determining the glyph similarity between each two enhanced texts includes:

确定每两个增强文本的字形特征向量之间的相似度,所述字形特征向量用于描述对应的增强文本中包括的词的信息。The similarity between the glyph feature vectors of each two enhanced texts is determined, wherein the glyph feature vector is used to describe information of words included in the corresponding enhanced texts.

在确定每两个增强文本之间的字形相似度时,在本申请实施例中,可以基于预先训练完成的文字特征提取模型,提取每个增强文本的字形特征向量,该字形特征向量是用于描述对应的增强文本中所包括的词的信息的,也就是说,该字形特征向量描述了对应的增强文本中包括了哪些词。When determining the glyph similarity between each two enhanced texts, in an embodiment of the present application, the glyph feature vector of each enhanced text can be extracted based on a pre-trained text feature extraction model. The glyph feature vector is used to describe the information of the words included in the corresponding enhanced text, that is, the glyph feature vector describes which words are included in the corresponding enhanced text.

在确定了每个增强文本对应的字形特征向量之后,即可分别确定每两个增强文本的字形特征向量之间的相似度,该相似度是用于描述对应的两个增强文本中包括的词是否相似的,当对应的两个增强文本中包括的词越相似时,该相似度越大,反之,相似度越小。After determining the glyph feature vector corresponding to each enhanced text, the similarity between the glyph feature vectors of each two enhanced texts can be determined respectively. The similarity is used to describe whether the words included in the corresponding two enhanced texts are similar. The more similar the words included in the corresponding two enhanced texts are, the greater the similarity is, and vice versa.

为了进一步提高文本处理的准确率,在上述各实施例的基础上,在本申请实施例中,所述确定每个增强文本的字形特征向量,包括:In order to further improve the accuracy of text processing, based on the above embodiments, in the embodiment of the present application, the determining of the glyph feature vector of each enhanced text includes:

针对所述每个增强文本中的每个词,确定该词在所归属的增强文本中出现的第一频率;并根据该预设意图对应的增强文本的第一数量,以及包含该词的增强文本的第二数量,确定第二频率;根据所述第一频率和所述第二频率,确定该词在所归属的增强文本中的目标频率;For each word in each enhanced text, determine a first frequency of occurrence of the word in the enhanced text to which it belongs; and determine a second frequency according to a first number of enhanced texts corresponding to the preset intent and a second number of enhanced texts containing the word; and determine a target frequency of the word in the enhanced text to which it belongs according to the first frequency and the second frequency;

根据每个增强文本中包括的每个词对应的所述目标频率,以及预设向量中每一项对应的目标词,分别确定每个增强文本的字形特征向量。According to the target frequency corresponding to each word included in each enhanced text and the target word corresponding to each item in the preset vector, the glyph feature vector of each enhanced text is determined respectively.

在本申请实施例中,在确定每个增强文本的字形特征向量时,可以针对同一预设意图下的每个增强文本,并分别确定该增强文本中的每个词对应的第一频率。该第一频率是对应的词在所归属的增强文本中出现的频率,可以基于如下公式确定:In the embodiment of the present application, when determining the glyph feature vector of each enhanced text, the first frequency corresponding to each word in each enhanced text under the same preset intention can be determined respectively. The first frequency is the frequency of the corresponding word appearing in the enhanced text to which it belongs, and can be determined based on the following formula:

其中,tfi,j表示增强文本j中词i对应的第一频率;ni,j表示词i在增强文本j中出现的次数;k表示增强文本j中一共包括了k种词,那么nk,j则表示第k种词在增强文本j中出现的次数,∑knk,j则表示增强文本j中所有词出现的次数总和。为了便于理解,可以将上述公式使用文字进行表示:Among them, tfi,j represents the first frequency corresponding to word i in enhanced text j; ni,j represents the number of times word i appears in enhanced text j; k represents that enhanced text j includes a total of k types of words, then nk,j represents the number of times the kth word appears in enhanced text j, and ∑k nk,j represents the total number of times all words appear in enhanced text j. For ease of understanding, the above formula can be expressed in words:

为了确定增强文本中的每个词在该预设意图对应的所有增强文本中的重要程度,在本申请实施例中,还可以根据该预设意图对应的增强文本的第一数量,以及包含该词的增强文本的第二数量,确定第二频率。可以基于如下公式确定第二频率:In order to determine the importance of each word in the enhanced text in all enhanced texts corresponding to the preset intent, in the embodiment of the present application, the second frequency can also be determined according to the first number of enhanced texts corresponding to the preset intent and the second number of enhanced texts containing the word. The second frequency can be determined based on the following formula:

其中,idfi,j表示增强文本j中词i对应的第二频率;|D|表示该预设意图对应的增强文本的数量;|j:i∈dj|表示该预设意图对应的所有增强文本中,包括词i的增强文本的数量。为了便于理解,可以将上述公式使用文字进行表示:Wherein, idfi,j represents the second frequency corresponding to word i in enhanced text j; |D| represents the number of enhanced texts corresponding to the preset intent; |j:i∈dj | represents the number of enhanced texts including word i in all enhanced texts corresponding to the preset intent. For ease of understanding, the above formula can be expressed in words:

在确定了该词对应的第一频率和第二频率之后,可以根据该第一频率和第二频率,确定该词在所归属的增强文本中的目标频率。具体的,可以将该第一频率和第二频率的积值确定为目标频率,也可以将该第一频率和第二频率的和值确定为目标频率。After determining the first frequency and the second frequency corresponding to the word, the target frequency of the word in the enhanced text to which it belongs can be determined according to the first frequency and the second frequency. Specifically, the product of the first frequency and the second frequency can be determined as the target frequency, or the sum of the first frequency and the second frequency can be determined as the target frequency.

在确定了每个增强文本中包括的每个词对应的目标频率之后,即可根据每个增强文本中包括的每个词对应的目标频率,以及预设向量中每一项对应的目标词,分别确定每个增强文本的字形特征向量。在本申请实施例中,可以预先配置一个词库,该词库中包括了经常出现的每个词,并按照一定的顺序排列该词库中的每个词,预设向量中的每一项则对应着一个目标词。也就是说,预设向量中的第一项对应着排列后的第一个词,预设向量中的第二项对应着排列后的第二个词,预设向量中的第三项对应着排列后的第三个词,以此类推。After determining the target frequency corresponding to each word included in each enhanced text, the glyph feature vector of each enhanced text can be determined according to the target frequency corresponding to each word included in each enhanced text and the target word corresponding to each item in the preset vector. In an embodiment of the present application, a vocabulary can be pre-configured, which includes each frequently appearing word, and each word in the vocabulary is arranged in a certain order, and each item in the preset vector corresponds to a target word. That is, the first item in the preset vector corresponds to the first word after arrangement, the second item in the preset vector corresponds to the second word after arrangement, the third item in the preset vector corresponds to the third word after arrangement, and so on.

在一种可能的实施方式中,由于词库中包括的词的数量是较多的,而某一预设意图对应的所有增强文本中出现的词的数量不会很多,如果根据词库中包括的所有词确定预设向量,将会导致预设向量的维度较大,且预设向量中将包括较多无用项。并增强文本中还可能存在一些偏僻词,而词库中没有预先收录该偏僻词的问题。因此,在本申请实施例中,在确定每个增强文本的字形特征向量之前,可以统计该预设意图对应的所有增强文本中包括每种词,并按照预设顺序,如首字母的顺序,排列每种词。使预设向量中第一项对应着排列后的第一个词,预设向量中的第二项对应着排列后的第二个词,预设向量中的第三项对应着排列后的第三个词,以此类推。也就是说,该预设意图对应的增强文本中包括了几种词,该预设向量就是几维。In a possible implementation, since the number of words included in the vocabulary is relatively large, and the number of words that appear in all enhanced texts corresponding to a certain preset intention is not large, if the preset vector is determined based on all the words included in the vocabulary, the dimension of the preset vector will be large, and the preset vector will include more useless items. There may also be some obscure words in the enhanced text, and the vocabulary has not pre-included the obscure words. Therefore, in an embodiment of the present application, before determining the glyph feature vector of each enhanced text, each word included in all enhanced texts corresponding to the preset intention can be counted, and each word can be arranged in a preset order, such as the order of the first letters. Make the first item in the preset vector correspond to the first word after arrangement, the second item in the preset vector corresponds to the second word after arrangement, the third item in the preset vector corresponds to the third word after arrangement, and so on. In other words, the preset vector has several dimensions for the number of words included in the enhanced text corresponding to the preset intention.

在确定每个增强文本的字形特征向量时,可以针对每个增强文本,将该增强文本中包括的词对应的目标频率,填写到预设向量中对应的目标词的项中。该增强文本中不包括的词对应的项中填写预设数值即可,如,0、1等任意数值。When determining the glyph feature vector of each enhanced text, for each enhanced text, the target frequency corresponding to the words included in the enhanced text can be filled into the item of the corresponding target word in the preset vector. The items corresponding to the words not included in the enhanced text can be filled with preset values, such as any values such as 0 and 1.

图2为本申请实施例提供的一种字形特征向量确定示意图,如图2所示,预设向量中的每一项分别对应着目标词A、B、C、D、E、F、G。增强文本AEF中包括了词A、E、F,其中词A对应的目标概率为概率1、词E对应的目标概率为概率2、词F对应的目标概率为概率3。在确定该增强文本的字形特征向量时,可以将每个目标概率对应填写到预设向量对应的项中,预设向量中的其他项使用0进行填充,得到字形特征向量。FIG2 is a schematic diagram of determining a glyph feature vector provided by an embodiment of the present application. As shown in FIG2 , each item in the preset vector corresponds to the target words A, B, C, D, E, F, and G. The enhanced text AEF includes words A, E, and F, wherein the target probability corresponding to word A is probability 1, the target probability corresponding to word E is probability 2, and the target probability corresponding to word F is probability 3. When determining the glyph feature vector of the enhanced text, each target probability can be filled in the corresponding item of the preset vector, and the other items in the preset vector are filled with 0 to obtain the glyph feature vector.

为了进一步提高文本处理的效率,在确定每个词对应的目标概率之前,还可以将每个增强文本中常见的词过滤掉,保留下相对重要的词,以减少计算量。其中哪些词属于常见的词可是预先配置的。In order to further improve the efficiency of text processing, before determining the target probability corresponding to each word, common words in each enhanced text can be filtered out to retain relatively important words to reduce the amount of calculation. Which words are common words can be pre-configured.

为了进一步提高文本处理的准确率,在上述各实施例的基础上,在本申请实施例中,确定每两个增强文本之间的语义相似度的过程包括:In order to further improve the accuracy of text processing, based on the above embodiments, in the embodiment of the present application, the process of determining the semantic similarity between each two enhanced texts includes:

确定每两个增强文本的语义特征向量之间的相似度,其中所述语义特征向量为基于预先训练完成的特征提取模型对对应的增强文本进行处理后得到的。The similarity between the semantic feature vectors of every two enhanced texts is determined, wherein the semantic feature vectors are obtained by processing the corresponding enhanced texts based on a pre-trained feature extraction model.

在确定每两个增强文本之间的语义相似度时,在本申请实施例中,可以基于预先训练完成的特征提取模型对对应的增强文本进行处理,得到语义特征向量。该预先训练完成的特征提取模型可以是任意具备特征提取功能的模型,该特征提取模型可以深度学习模型,也可以是大模型。示例性地,该预先训练完后的特征提取模型可以是BERT(Bidirectional Encoder Representations from Transformers)模型。BERT是一种预训练语言模型(pre-trained language model,PLM),在特定场景使用时不需要用大量的语料来进行训练,节约时间效率高效,泛化能力较强。BERT是一种端到端(end-to-end)的模型,不需要调整网络结构,只需要在最后加上特定于下游任务的输出层。基于Transformer,可以实现快速并行,也可以增加到非常深的深度,充分发掘深度神经网络(Deep NeutralNetworks,DNN)模型的特性,提升模型准确率。BERT是一种双向的模型,结合上下文来进行训练,具有更好的性能。When determining the semantic similarity between each two enhanced texts, in an embodiment of the present application, the corresponding enhanced text can be processed based on a pre-trained feature extraction model to obtain a semantic feature vector. The pre-trained feature extraction model can be any model with a feature extraction function, and the feature extraction model can be a deep learning model or a large model. Exemplarily, the pre-trained feature extraction model can be a BERT (Bidirectional Encoder Representations from Transformers) model. BERT is a pre-trained language model (pre-trained language model, PLM), which does not require a large amount of corpus for training when used in a specific scenario, saves time and is efficient, and has strong generalization ability. BERT is an end-to-end model that does not require adjustment of the network structure, and only needs to add an output layer specific to the downstream task at the end. Based on Transformer, fast parallelization can be achieved, and it can also be increased to a very deep depth, fully exploring the characteristics of the deep neural network (Deep Neutral Networks, DNN) model and improving the model accuracy. BERT is a two-way model that is trained in conjunction with context and has better performance.

预训练是一种迁移学习的概念。所谓预训练模型,举个例子,假设我们有大量的维基百科数据,那么我们可以用这部分巨大的数据来训练一个泛化能力很强的模型,当我们需要在特定场景使用时,例如做医学命名实体识别,那么,只需要简单的修改一些输出层,再用我们自己的数据进行一个增量训练,对权重进行一个轻微的调整即可。预训练语言模型有很多,典型的如ELMO(Embedding from Language Models)、GPT(Generative Pre-trained Transformer)、BERT等。Pre-training is a concept of transfer learning. For example, if we have a large amount of Wikipedia data, we can use this huge amount of data to train a model with strong generalization ability. When we need to use it in a specific scenario, such as medical named entity recognition, we only need to simply modify some output layers, and then use our own data for incremental training and make a slight adjustment to the weights. There are many pre-trained language models, such as ELMO (Embedding from Language Models), GPT (Generative Pre-trained Transformer), BERT, etc.

本申请大模型可以理解为是基于transformer架构的模型;该大模型也可以理解为是具有庞大的参数规模和复杂程度的机器学习模型,例如,具有数百万到数十亿参数或者上百亿参数的神经网络模型;该大模型也可以理解为是通过半(弱)监督、全监督、自监督或者无监督等技术,在大规模训练数据上训练得到的一种深度学习模型。在本申请实施例中,大模型可以处理多种不同任务,在训练大模型时一般是基于某个目标任务领域的训练数据进行训练的,训练得到的大模型一般情况下可以被迁移到与目标任务领域相近的其他任务领域中进行使用。The large model of the present application can be understood as a model based on the transformer architecture; the large model can also be understood as a machine learning model with a huge parameter scale and complexity, for example, a neural network model with millions to billions of parameters or tens of billions of parameters; the large model can also be understood as a deep learning model trained on large-scale training data through semi-(weak) supervision, full supervision, self-supervision or unsupervised techniques. In an embodiment of the present application, the large model can handle a variety of different tasks. When training the large model, it is generally trained based on the training data of a certain target task field. The large model obtained by training can generally be migrated to other task fields similar to the target task field for use.

在确定了每个增强文本的语义特征向量之后,可以确定每个增强文本的语义特征向量之间的相似度。该相似度用于描述对应的增强文本的语义是否相似,当两个增强文本的语义越相似时,该相似度大,反之,相似度越小。After determining the semantic feature vector of each enhanced text, the similarity between the semantic feature vectors of each enhanced text can be determined. The similarity is used to describe whether the semantics of the corresponding enhanced texts are similar. The more similar the semantics of the two enhanced texts are, the greater the similarity is, and vice versa.

为了提高文本处理的效率,在上述各实施例的基础上,在本申请实施例中,所述获取该预设意图对应的每个增强文本,包括:In order to improve the efficiency of text processing, based on the above embodiments, in the embodiment of the present application, the step of obtaining each enhanced text corresponding to the preset intent includes:

获取第一提示文本以及针对该预设意图保存的原始文本,所述第一提示文本用于提示第一大模型对所述原始文本进行文本增强处理,所述文本增强处理包括词扩展、句子重组、语气语调变换中的至少一项;Acquire a first prompt text and an original text saved for the preset intent, wherein the first prompt text is used to prompt the first large model to perform text enhancement processing on the original text, wherein the text enhancement processing includes at least one of word expansion, sentence reorganization, and tone and intonation change;

将所述原始文本和所述第一提示文本按照预设格式输入到所述第一大模型,得到该预设意图对应的每个增强文本。The original text and the first prompt text are input into the first large model according to a preset format to obtain each enhanced text corresponding to the preset intention.

在本申请实施例中,在获取该预设意图对应的每个增强文本时,可以获取第一提示文本以及针对该预设意图保存的原始文本。其中,该第一提示文本用于提示第一大模型对获取到的原始文本进行文本增强处理。该第一提示文本可以是预先根据需要第一大模型所进行的处理编写的。该第一大模型可以是任意具备文本增强能力的大模型,如聊天生成式预训练转换器(Chat Generative Pre-trained Transformer,ChatGPT)、聊天通用语言模型(Chat General Language Model,ChatGLM)等模型。In an embodiment of the present application, when each enhanced text corresponding to the preset intent is obtained, a first prompt text and an original text saved for the preset intent can be obtained. Among them, the first prompt text is used to prompt the first large model to perform text enhancement processing on the acquired original text. The first prompt text can be pre-written according to the processing performed by the first large model as required. The first large model can be any large model with text enhancement capabilities, such as Chat Generative Pre-trained Transformer (ChatGPT), Chat General Language Model (ChatGLM) and other models.

获取到第一提示文本以及针对该预设意图保存的原始文本之后,可以将该原始文本和第一提示文本按照预设格式输入到第一大模型,得到该预设意图对应的每个增强文本。该第一大模型在对原始文本进行文本增强处理时,可以对原始文本进行词扩展,和/或句子重组,和/或语气语调变换。After obtaining the first prompt text and the original text saved for the preset intent, the original text and the first prompt text can be input into the first large model according to the preset format to obtain each enhanced text corresponding to the preset intent. When the first large model performs text enhancement processing on the original text, the original text can be expanded in words, reorganized in sentences, and/or changed in tone and intonation.

词扩展主要包括时间扩展、地点扩展、一些其他用词的扩展。时间扩展、地点扩展是指通过第一大模型实现时间地点的自动泛化。如,在进行某一问题提问时,可能包含时间、地点、也可能不包含时间地点,如原始文本为“GDP是多少?”,该原始文本中不包括时间,也不包括地点,那么,第一大模型进行了文本增强处理之后,得到的增强文本可以是“青岛市GDP是多少?”、“2022年青岛市GDP是多少?”等。Word expansion mainly includes time expansion, place expansion, and expansion of some other words. Time expansion and place expansion refer to the automatic generalization of time and place through the first model. For example, when asking a question, it may include time, place, or not. For example, the original text is "What is GDP?". The original text does not include time or place. Then, after the first model performs text enhancement processing, the enhanced text obtained can be "What is the GDP of Qingdao?", "What is the GDP of Qingdao in 2022?", etc.

句子重组主要包括句子倒装、位置调换、增加描述词、删除描述词、同义词替换等,主要是通过改变句式、用词、词序等方式,实现对原始文本的增强。如将原始文本“青岛市GDP是多少?”增强为“GDO是多少,青岛市的”等。Sentence reorganization mainly includes sentence inversion, position transposition, adding descriptive words, deleting descriptive words, synonym replacement, etc. It mainly enhances the original text by changing the sentence structure, wording, word order, etc. For example, the original text "What is the GDP of Qingdao?" is enhanced to "What is the GDO of Qingdao?"

语气语调变换主要是指不同身份的用户在说话时风格是不一样的,调用第一大模型即可对原始文本进行语气语调的变换。例如,领导身份的用户在询问青岛市GDP时,一般语气是比较强硬的,第一大模型对原始文本“青岛市GDP是多少”进行增强时,可以得到增强文本“青岛GDP是多少啊?”等。Tone and intonation change mainly refers to the different speaking styles of users with different identities. The first model can be used to change the tone and intonation of the original text. For example, when a leader asks about the GDP of Qingdao, the tone is generally tough. When the first model enhances the original text "What is the GDP of Qingdao?", it can get the enhanced text "What is the GDP of Qingdao?"

在本申请实施例中,第一提示文本除了用于提示第一大模型对原始文本进行文本增强处理,还可以规定第一大模型输出的增强文本的数量,以及输出的增强文本的格式。也就是说,第一提示文本可以使第一大模型按照固定格式输出固定数量的增强文本。In the embodiment of the present application, the first prompt text is used not only to prompt the first large model to perform text enhancement processing on the original text, but also to specify the number of enhanced texts output by the first large model and the format of the output enhanced texts. In other words, the first prompt text can cause the first large model to output a fixed number of enhanced texts in a fixed format.

由于第一大模型为预训练模型,为了使该第一大模型能够生成质量较高的增强文本,可以不断的对第一提示文本进行改进。因此,在本申请实施例中,在得到了增强文本之后,还可以将得到的每个增强文本输出,以便相关工作人员对增强文本进行检验,如果生成的增强文本效果不好,则可以由相关工作人员优化调整第一提示文本,并重新生成该预设意图的增强文本。需要说明的是,第一提示文本中具体描述的内容不仅局限于上述示例,本领域的技术人员可以根据需要进行编写,且第一提示文本的形式本申请实施例对此不进行限制。Since the first large model is a pre-trained model, in order to enable the first large model to generate high-quality enhanced text, the first prompt text can be continuously improved. Therefore, in the embodiment of the present application, after obtaining the enhanced text, each enhanced text obtained can also be output so that the relevant staff can check the enhanced text. If the generated enhanced text is not good, the relevant staff can optimize and adjust the first prompt text and regenerate the enhanced text of the preset intention. It should be noted that the content specifically described in the first prompt text is not limited to the above examples. Those skilled in the art can write it as needed, and the form of the first prompt text is not limited to this embodiment of the present application.

下面结合一个具体的实施例对文本增强过程进行说明,图3为本申请实施例提供的一种文本增强过程示意图,如图3所示,获取预先编写的原始文本,将原始文本、第一提示文本prompt和文本增强示例按照预设格式拼接,得到输入文本。第一提示文本中描述了使大语言模型(Large Language Model,LLM)进行词扩展、句子重组、语气语调变换等相关处理的内容,以便于LLM进行相应的处理。将输入文本输入到LLM,得到LLM按照固定格式输出的增强文本。The text enhancement process is described below in conjunction with a specific embodiment. FIG3 is a schematic diagram of a text enhancement process provided by an embodiment of the present application. As shown in FIG3, a pre-written original text is obtained, and the original text, the first prompt text prompt, and the text enhancement example are spliced according to a preset format to obtain an input text. The first prompt text describes the content of making the Large Language Model (LLM) perform related processing such as word expansion, sentence reorganization, and tone and intonation change, so that the LLM can perform corresponding processing. The input text is input into the LLM to obtain an enhanced text output by the LLM in a fixed format.

下面结合另一个实施例对文本增强的过程进行说明,图4为本申请实施例提供的一种大模型文本增强架构示意图,如图4所示,数据层用于存储预先保存的原始文本,在进行文本增强时,将原始文本输入到语义解析层。语义解析层基于大模型技术对每个原始文本进行处理,得到增强文本。语义解析层基于大模型技术对每个原始文本进行处理时,基于大模型强大的语义理解能力,对原始文本的语义进行分析,并基于大模型强大的文本生成能力,对原始文本进行增强,得到增强文本。在对原始文本进行增强时,可以根据分析结果对原始文本进行用词扩展、句子重组及语气语调变换等处理。具体的,用词扩展包括地点扩展、时间扩展和其他用词扩展等;句子重组包括句子倒装、位置调换、增加描述词、删除描述词、同义词替换等;语气语调变换包括将原始文本改为领导语气、将原始文本改为职员语气、将原始文本改为居民语气等。The text enhancement process is described below in conjunction with another embodiment. FIG4 is a schematic diagram of a large model text enhancement architecture provided by an embodiment of the present application. As shown in FIG4, the data layer is used to store the pre-saved original text. When performing text enhancement, the original text is input into the semantic parsing layer. The semantic parsing layer processes each original text based on the large model technology to obtain an enhanced text. When the semantic parsing layer processes each original text based on the large model technology, it analyzes the semantics of the original text based on the powerful semantic understanding ability of the large model, and enhances the original text based on the powerful text generation ability of the large model to obtain an enhanced text. When the original text is enhanced, the original text can be processed by word expansion, sentence reorganization, tone and intonation transformation according to the analysis results. Specifically, word expansion includes location expansion, time expansion and other word expansions; sentence reorganization includes sentence inversion, position exchange, adding descriptive words, deleting descriptive words, synonym replacement, etc.; tone and intonation transformation includes changing the original text to the leadership tone, changing the original text to the staff tone, changing the original text to the resident tone, etc.

本申请实施例基于大模型进行文本增强,不需要单独训练模型,通过prompt提示方式,实现大模型数据的快速扩展,获取高质量的泛化语料。实现从用词扩展、句子重组,语气语调等多维度的数据扩展,提升语料泛化增强的多样性和可用性。The embodiment of the present application performs text enhancement based on a large model, does not require a separate model training, and achieves rapid expansion of large model data through prompts to obtain high-quality generalized corpus. It achieves data expansion in multiple dimensions such as word expansion, sentence reorganization, and tone and intonation, improving the diversity and usability of corpus generalization enhancement.

为了进一步提高文本处理的准确率,在上述各实施例的基础上,在本申请实施例中,所述方法还包括:In order to further improve the accuracy of text processing, based on the above embodiments, in the embodiment of the present application, the method further includes:

在所述每个增强文本中删除所述异常文本,得到目标待评价文本;Deleting the abnormal text in each enhanced text to obtain a target text to be evaluated;

将所述目标待评价文本和第二提示文本按照预设格式输入到第二大模型,得到所述目标待评价文本的目标评分,所述第二提示文本用于提示所述第二大模型对所述目标待评价文本进行评审处理,所述评审处理包括评审语义、评审语法中的至少一项。The target text to be evaluated and the second prompt text are input into the second large model in a preset format to obtain a target score for the target text to be evaluated. The second prompt text is used to prompt the second large model to perform a review process on the target text to be evaluated. The review process includes at least one of review semantics and review grammar.

为了进一步提高文本处理的准确率,在确定了异常文本之后,可以在该预设意图对应的每个增强文本中删除所确定的异常文本,得到目标待评价文本。得到的该预设意图对应的目标待评价文本具有语义相似而说法多样性的特点。后续使用第二大模型对目标待评价文本进行评审,得到每个目标待评价文本对应的目标评分。In order to further improve the accuracy of text processing, after determining the abnormal text, the determined abnormal text can be deleted in each enhanced text corresponding to the preset intent to obtain the target text to be evaluated. The target text to be evaluated corresponding to the preset intent has the characteristics of similar semantics and diverse statements. The second largest model is subsequently used to review the target text to be evaluated to obtain the target score corresponding to each target text to be evaluated.

在确定目标评分时,可以将目标待评价文本和第二提示文本输入到第二大模型,由该第二大模型根据接收到的第二提示文本对目标待评价文本进行评审处理,得到每个目标待评价文本的目标评分。也就是说,将第二提示文本输入到第二大模型中,是为了提示第二大模型对目标待评价文本进行评审处理。其中,该第二大模型可以是任意具备文本增强能力的大模型,如聊天生成式预训练转换器(Chat Generative Pre-trainedTransformer,ChatGPT)、聊天通用语言模型(Chat General Language Model,ChatGLM)等模型,第二大模型可以与第一大模型相同,也可以不同。在本申请实施例中,第二大模型的评审处理可以包括评审语义,和/或评审语法。When determining the target score, the target text to be evaluated and the second prompt text can be input into the second largest model, and the second largest model reviews the target text to be evaluated according to the received second prompt text to obtain the target score of each target text to be evaluated. In other words, the second prompt text is input into the second largest model to prompt the second largest model to review the target text to be evaluated. Among them, the second largest model can be any large model with text enhancement capabilities, such as Chat Generative Pre-trained Transformer (ChatGPT), Chat General Language Model (ChatGLM) and other models, and the second largest model can be the same as the first largest model or different. In an embodiment of the present application, the review process of the second largest model may include review semantics, and/or review grammar.

大模型在进行评审处理时,可以参考如下评分原则:没有语法问题、语义问题的句子为最高分10分,存在的语法、语义问题越多,句子的质量越低,分数越低,最低分为0,语法、语义问题越少,句子的质量越高。因此,在一种可能得实施方式中,第二提示文本中还可以包括评分原则,以便于第二大模型在进行评审处理时,基于接收到的评分原则确定目标评分。When the large model is conducting the review process, it can refer to the following scoring principles: the highest score is 10 for sentences without grammatical or semantic problems, the more grammatical and semantic problems there are, the lower the quality of the sentence, the lower the score, the lowest score is 0, the fewer grammatical and semantic problems there are, the higher the quality of the sentence. Therefore, in a possible implementation, the second prompt text can also include scoring principles, so that the second large model can determine the target score based on the received scoring principles when conducting the review process.

其中,评审语义时存在的语义问题可以包括用词不当、前后矛盾、不合事理、语句歧义等。评审语法时存在的语法问题可以包括搭配不当、归类不当、成分冗余、成分残缺、词序颠倒、句式杂糅、关联词错误、指代不明等。Among them, semantic problems in the review of semantics may include inappropriate wording, inconsistency, irrationality, ambiguous sentences, etc. Grammatical problems in the review of grammar may include inappropriate collocation, inappropriate classification, redundant components, incomplete components, reversed word order, mixed sentence structure, incorrect conjunctions, unclear reference, etc.

在本申请实施例中,第二提示文本处理用于提示第二大模型对目标待评价文本进行评审处理,还可以规定第二大模型输出的目标评分的格式,也就是说,第二提示文本还可以用于规定打分格式,以使第二大模型按照要求格式输出相应的目标评分。需要说明的是,第二提示文本中具体描述的内容不仅局限于上述示例,本领域的技术人员可以根据需要进行编写,且第二提示文本的形式本申请实施例对此不进行限制。In the embodiment of the present application, the second prompt text processing is used to prompt the second largest model to review the target text to be evaluated, and the format of the target score output by the second largest model can also be specified, that is, the second prompt text can also be used to specify the scoring format so that the second largest model outputs the corresponding target score in the required format. It should be noted that the content specifically described in the second prompt text is not limited to the above examples, and technicians in this field can write it as needed, and the form of the second prompt text is not limited in this embodiment of the present application.

下面结合一个具体的实施例对文本评审过程进行说明,图5为本申请实施例提供的一种文本评审过程示意图,如图5所示,获取待质检文本,该待质检文本可以是上述实施例中的目标待评价文本。将待质检文本、第二提示文本和文本评审示例按照预设格式拼接,得到输入文本。第二提示文本中描述了LLM进行语法评审、语义评审的相关处理的内容,以便于LLM进行相应的处理。将输入文本输入到LLM,得到LLM按照固定格式输出的评审结果,该评审结果中可以包括每个待质检文本对应的目标评分,以及每个待质检文本中存在的问题。The text review process is explained below in conjunction with a specific embodiment. FIG5 is a schematic diagram of a text review process provided by an embodiment of the present application. As shown in FIG5 , a text to be inspected is obtained, and the text to be inspected may be the target text to be evaluated in the above embodiment. The text to be inspected, the second prompt text, and the text review example are spliced in a preset format to obtain an input text. The second prompt text describes the content of the relevant processing of grammatical review and semantic review performed by LLM, so that LLM can perform corresponding processing. The input text is input into LLM to obtain a review result output by LLM in a fixed format. The review result may include the target score corresponding to each text to be inspected, as well as the problems existing in each text to be inspected.

下面结合另一个实施例对文本增强的过程进行说明,图6为本申请实施例提供的一种基于大模型的文本评审架构示意图,如图6所示,数据层用于存储待质检文本,在进行文本评审时,将待质检文本输入到语义解析层。语义解析层基于大模型技术对每个原始文本进行处理,得到每个待质检文本的目标评分。语义解析层基于大模型技术对每个原始文本进行处理时,基于大模型强大的语义理解能力,对待质检文本的语义进行分析,并基于大模型强大的文本生成能力,对待质检文本进行评审并生成评审结果。在对待质检文本进行评审时,可以根据分析结果评审待质检文本中是否存在语义问题或者语法问题。具体的,语义问题包括用词不当、前后矛盾、不合事理、语句歧义等;语法问题搭配不当、归类不当、成分冗余、成分残缺、词序颠倒、句式杂糅、关联词错误、指代不明等。The text enhancement process is described below in conjunction with another embodiment. FIG6 is a schematic diagram of a text review architecture based on a large model provided by an embodiment of the present application. As shown in FIG6, the data layer is used to store the text to be inspected. When the text is reviewed, the text to be inspected is input into the semantic analysis layer. The semantic analysis layer processes each original text based on the large model technology to obtain the target score of each text to be inspected. When the semantic analysis layer processes each original text based on the large model technology, the semantics of the text to be inspected are analyzed based on the powerful semantic understanding ability of the large model, and the text to be inspected is reviewed and the review results are generated based on the powerful text generation ability of the large model. When the text to be inspected is reviewed, whether there are semantic problems or grammatical problems in the text to be inspected can be reviewed based on the analysis results. Specifically, semantic problems include inappropriate wording, inconsistency, irrationality, ambiguous sentences, etc.; grammatical problems include improper collocation, improper classification, redundant components, incomplete components, reversed word order, mixed sentence patterns, incorrect conjunctions, unclear references, etc.

本申请实施例基于大模型进行智能评分,不需要单独训练模型,通过prompt提示方式,实现大模型对增强文本的智能评分,提升人工质检修改效率。实现从用词、搭配、次序、句式等语义、语法上对泛化语料进行多维度的综合评分,增加评分的准确率和可靠性。The embodiment of the present application performs intelligent scoring based on a large model, and does not require a separate model training. Through prompts, the large model can be used to intelligently score the enhanced text, thereby improving the efficiency of manual quality inspection and modification. It can achieve multi-dimensional comprehensive scoring of generalized corpus from the semantic and grammatical aspects of word usage, collocation, order, sentence structure, etc., thereby increasing the accuracy and reliability of the scoring.

为了知晓目标待评价文本中存在的问题,在上述各实施例的基础上,在本申请实施例中,所述方法还包括:In order to know the problems existing in the target text to be evaluated, based on the above embodiments, in the embodiment of the present application, the method further includes:

若所述目标评分小于预设评分阈值,则基于所述第二大模型,获取所述第二大模型输出的小于预设评分阈值的目标待评价文本中存在的问题及修改后的文本。If the target score is less than a preset score threshold, based on the second largest model, the problems existing in the target text to be evaluated and less than the preset score threshold output by the second largest model and the modified text are obtained.

在本申请实施例中,在获取到每个目标待评价文本的目标评分之后,若该目标评分小于预设评分阈值,则可以基于该第二大模型,获取第二大模型输出的小于预设评分阈值的目标待评价文本中存在的问题及修改后的文本。也就是说,调用第二大模型获取评分较低的目标待评价文本的问题所在,并输出修改后的文本。In the embodiment of the present application, after obtaining the target score of each target text to be evaluated, if the target score is less than the preset score threshold, the second largest model can be used to obtain the problems existing in the target text to be evaluated that is less than the preset score threshold and the modified text. In other words, the second largest model is called to obtain the problems of the target text to be evaluated with a lower score, and the modified text is output.

为便于相关人员的查看,在本申请实施例中,在获取到每个目标评分和评分较低的目标待评价文本的问题所在之后,可以保存在预设的存储空间,如某一文档或者数据库中,也可以将获取到的问题及修改后的文本进行显示。To facilitate the review of relevant personnel, in an embodiment of the present application, after obtaining the problems of each target score and the target text with a lower score to be evaluated, they can be saved in a preset storage space, such as a document or database, and the obtained problems and modified text can also be displayed.

由于第二大模型为预训练模型,为了使该第二大模型能够准确的进行评分,可以不断的对第二提示文本进行改进。因此,在本申请实施例中,在得到了各个目标待评价文本的目标评分之后,还可以将得到的每个目标待评价文本及对应的目标评分输出,以便相关工作人员对目标评分进行检验,如果第二大模型输出的目标评分不准确,相关工作人员则可以优化调整第二提示文本,并重新进行评审处理。Since the second largest model is a pre-trained model, in order to enable the second largest model to accurately perform the scoring, the second prompt text can be continuously improved. Therefore, in the embodiment of the present application, after obtaining the target score of each target text to be evaluated, each target text to be evaluated and the corresponding target score can also be output so that the relevant staff can check the target score. If the target score output by the second largest model is inaccurate, the relevant staff can optimize and adjust the second prompt text and re-perform the review process.

在一种可能的实施方式中,在获取到每个目标待评价文本的目标评分之后,可以将每个目标待评价文本及对应的目标评分输出,由人工进行二次审核,得到高质量的增强文本。In a possible implementation, after obtaining the target score of each target text to be evaluated, each target text to be evaluated and the corresponding target score may be output and manually reviewed for a second time to obtain a high-quality enhanced text.

本申请实施例中,利用大模型进行文本增强和智能评分,可以提高文本增强的准确性和效率,提升增强文本质检效率,为各行业的数据处理提供更加可靠和高效的解决方案。In the embodiments of the present application, the use of a large model for text enhancement and intelligent scoring can improve the accuracy and efficiency of text enhancement, improve the efficiency of enhanced text quality inspection, and provide a more reliable and efficient solution for data processing in various industries.

下面结合一个具体的实施例对基于大模型的文本处理过程进行说明,图7为本申请实施例提供的一种文本处理过程示意图,该过程包括以下步骤:The text processing process based on the large model is described below in conjunction with a specific embodiment. FIG. 7 is a schematic diagram of a text processing process provided in an embodiment of the present application. The process includes the following steps:

S701:获取原始文本。S701: Obtain original text.

在本申请实施例中,可以针对每个预设意图,获取预先针对该预设意图保存的原始文本。In the embodiment of the present application, for each preset intent, the original text pre-saved for the preset intent can be obtained.

S702:使用第一大模型对原始文本进行文本增强,得到增强文本。S702: Use the first large model to perform text enhancement on the original text to obtain enhanced text.

将获取到的原始文本输入到第一大模型中,第一大模型根据第一提示文本中示例的多种增强方式对原始文本进行处理,得到增强文本。The acquired original text is input into the first large model, and the first large model processes the original text according to the multiple enhancement methods exemplified in the first prompt text to obtain an enhanced text.

S703:确定每两个增强文本之间的字形相似度。S703: Determine the glyph similarity between every two enhanced texts.

S704:确定每两个增强文本之间的语义相似度。S704: Determine the semantic similarity between every two enhanced texts.

S705:综合字形相似度和语义相似度,确定异常文本。S705: Determine abnormal text by combining the glyph similarity and semantic similarity.

在本申请实施例中,可以将字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将语义相似度小于第二预设阈值的增强文本,确定为第二候选文本;将第一候选文本和第二候选文本中相同的增强文本,确定为异常文本。In an embodiment of the present application, enhanced text whose glyph similarity is greater than a first preset threshold can be determined as a first candidate text, and enhanced text whose semantic similarity is less than a second preset threshold can be determined as a second candidate text; enhanced text that is the same in the first candidate text and the second candidate text can be determined as abnormal text.

S706:在每个增强文本中删除异常文本,得到目标待评价文本。S706: Delete abnormal text in each enhanced text to obtain the target text to be evaluated.

S707:使用第二大模型对目标待评价文本进行智能评分,得到目标评分。S707: Use the second largest model to intelligently score the target text to be evaluated to obtain a target score.

S708:将目标评分小于预设评分阈值的目标待评价文本输入到第二大模型,得到修改后的文本。S708: Input the target text to be evaluated whose target score is less than the preset score threshold into the second largest model to obtain a modified text.

S709:输出高质量文本。S709: Output high-quality text.

将修改后的文本和目标评分不小于预设评分阈值的目标待评价文本输出。The modified text and the target text to be evaluated whose target score is not less than the preset score threshold are output.

总的来说,本申请实施例根据需要,明确需要增强的意图,并保存每个意图对应的原始文本。依据获取到的原始文本,然后调用大模型,或通过其他数据增强方式,进行文本增强,获取扩展后的增强文本。再次调用大模型,对扩展后增强文本根据语句质量进行智能评分,运用根据评分对有问题的增强文本进行质检修改或给出修改建议,极大提高了输出增强文本的质量,节约项目人力和费用成本。In general, the embodiments of the present application clarify the intents that need to be enhanced as needed, and save the original text corresponding to each intent. Based on the acquired original text, the large model is then called, or other data enhancement methods are used to perform text enhancement to obtain the expanded enhanced text. The large model is called again, and the expanded enhanced text is intelligently scored according to the sentence quality. The enhanced text with problems is quality-checked and modified or modification suggestions are given based on the score, which greatly improves the quality of the output enhanced text and saves project manpower and cost.

在本申请实施例中,两次调用大模型,实现自动化文本增强和智能评分和智能化修改,快速、高效的输出高质量泛化语料,大量节俭人力和物力。In the embodiment of the present application, the large model is called twice to realize automatic text enhancement, intelligent scoring and intelligent modification, and quickly and efficiently output high-quality generalized corpus, saving a lot of manpower and material resources.

基于同一发明构思,本申请实施例还提供了一种基于大模型的文本处理装置,图8为本申请实施例提供的一种基于大模型的文本处理装置结构示意图,该装置包括:Based on the same inventive concept, the embodiment of the present application further provides a text processing device based on a large model. FIG8 is a structural schematic diagram of a text processing device based on a large model provided by the embodiment of the present application, and the device includes:

获取模块801,用于针对每个预设意图,获取该预设意图对应的每个增强文本;An acquisition module 801 is used to acquire, for each preset intent, each enhanced text corresponding to the preset intent;

相似度比较模块802,用于确定每两个增强文本之间的字形相似度和语义相似度,所述字形相似度用于描述对应的两个增强文本中包括的词的相似程度;将所述字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将所述语义相似度小于第二预设阈值的增强文本,确定为第二候选文本;A similarity comparison module 802 is used to determine the glyph similarity and semantic similarity between each two enhanced texts, wherein the glyph similarity is used to describe the similarity between the words included in the corresponding two enhanced texts; the enhanced text whose glyph similarity is greater than a first preset threshold is determined as a first candidate text, and the enhanced text whose semantic similarity is less than a second preset threshold is determined as a second candidate text;

确定模块803,用于将所述第一候选文本和所述第二候选文本中相同的增强文本,确定为异常文本。The determination module 803 is used to determine the same enhanced text in the first candidate text and the second candidate text as abnormal text.

在一种可能的实施方式中,所述确定模块803,还用于确定每两个增强文本的字形特征向量之间的相似度,所述字形特征向量用于描述对应的增强文本中包括的词的信息。In a possible implementation, the determination module 803 is further configured to determine the similarity between the glyph feature vectors of every two enhanced texts, where the glyph feature vectors are used to describe information about words included in the corresponding enhanced texts.

在一种可能的实施方式中,所述确定模块803,具体用于针对所述每个增强文本中的每个词,确定该词在所归属的增强文本中出现的第一频率;并根据该预设意图对应的增强文本的第一数量,以及包含该词的增强文本的第二数量,确定第二频率;根据所述第一频率和所述第二频率,确定该词在所归属的增强文本中的目标频率;根据每个增强文本中包括的每个词对应的所述目标频率,以及预设向量中每一项对应的目标词,分别确定每个增强文本的字形特征向量。In a possible implementation, the determination module 803 is specifically configured to determine, for each word in each enhanced text, a first frequency of occurrence of the word in the enhanced text to which it belongs; and determine a second frequency based on a first number of enhanced texts corresponding to the preset intention and a second number of enhanced texts containing the word; determine a target frequency of the word in the enhanced text to which it belongs based on the first frequency and the second frequency; and determine a glyph feature vector for each enhanced text based on the target frequency corresponding to each word included in each enhanced text and the target word corresponding to each item in the preset vector.

在一种可能的实施方式中,所述确定模块803,还用于确定每两个增强文本的语义特征向量之间的相似度,其中所述语义特征向量为基于预先训练完成的特征提取模型对对应的增强文本进行处理后得到的。In a possible implementation, the determination module 803 is further used to determine the similarity between the semantic feature vectors of every two enhanced texts, wherein the semantic feature vectors are obtained by processing the corresponding enhanced texts based on a pre-trained feature extraction model.

在一种可能的实施方式中,所述获取模块801,具体用于获取第一提示文本以及针对该预设意图保存的原始文本,所述第一提示文本用于提示第一大模型对所述原始文本进行文本增强处理,所述文本增强处理包括词扩展、句子重组、语气语调变换中的至少一项;将所述原始文本和所述第一提示文本按照预设格式输入到所述第一大模型,得到该预设意图对应的每个增强文本。In a possible implementation, the acquisition module 801 is specifically used to acquire a first prompt text and an original text saved for the preset intent, wherein the first prompt text is used to prompt the first large model to perform text enhancement processing on the original text, and the text enhancement processing includes at least one of word expansion, sentence reorganization, and tone and intonation change; the original text and the first prompt text are input into the first large model according to a preset format to obtain each enhanced text corresponding to the preset intent.

在一种可能的实施方式中,所述装置还包括:In a possible implementation, the device further includes:

评分模块804,用于在所述每个增强文本中删除所述异常文本,得到目标待评价文本;将所述目标待评价文本和第二提示文本按照预设格式输入到第二大模型,得到所述目标待评价文本的目标评分,所述第二提示文本用于提示所述第二大模型对所述目标待评价文本进行评审处理,所述评审处理包括评审语义、评审语法中的至少一项。The scoring module 804 is used to delete the abnormal text in each enhanced text to obtain a target text to be evaluated; input the target text to be evaluated and the second prompt text into the second large model according to a preset format to obtain a target score for the target text to be evaluated, and the second prompt text is used to prompt the second large model to perform a review process on the target text to be evaluated, and the review process includes at least one of review semantics and review grammar.

在一种可能的实施方式中,所述装置还包括:In a possible implementation, the device further includes:

修改模块805,用于若所述目标评分小于预设评分阈值,则基于所述第二大模型,获取所述第二大模型输出的小于预设评分阈值的目标待评价文本中存在的问题及修改后的文本。The modification module 805 is used to obtain the problems existing in the target text to be evaluated and the modified text output by the second large model and the score less than the preset score threshold based on the second large model if the target score is less than the preset score threshold.

基于同一发明构思,本申请实施例还提供了一种电子设备,图9为本申请实施例提供的一种电子设备结构示意图,如图9所示,包括:处理器901、通信接口902、存储器903和通信总线904,其中,处理器901,通信接口902,存储器903通过通信总线904完成相互间的通信;Based on the same inventive concept, an embodiment of the present application further provides an electronic device. FIG9 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application. As shown in FIG9 , the electronic device includes: a processor 901, a communication interface 902, a memory 903, and a communication bus 904, wherein the processor 901, the communication interface 902, and the memory 903 communicate with each other through the communication bus 904;

存储器903中存储有计算机程序,当程序被处理器901执行时,使得处理器901执行如下步骤:The memory 903 stores a computer program. When the program is executed by the processor 901, the processor 901 performs the following steps:

针对每个预设意图,获取该预设意图对应的每个增强文本;确定每两个增强文本之间的字形相似度和语义相似度,所述字形相似度用于描述对应的两个增强文本中包括的词的相似程度;将所述字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将所述语义相似度小于第二预设阈值的增强文本,确定为第二候选文本;For each preset intent, each enhanced text corresponding to the preset intent is obtained; the glyph similarity and semantic similarity between each two enhanced texts are determined, wherein the glyph similarity is used to describe the similarity of words included in the corresponding two enhanced texts; the enhanced text whose glyph similarity is greater than a first preset threshold is determined as a first candidate text, and the enhanced text whose semantic similarity is less than a second preset threshold is determined as a second candidate text;

将所述第一候选文本和所述第二候选文本中相同的增强文本,确定为异常文本。The same enhanced text in the first candidate text and the second candidate text is determined as abnormal text.

在一种可能的实施方式中,所述处理器901还用于:确定每两个增强文本的字形特征向量之间的相似度,所述字形特征向量用于描述对应的增强文本中包括的词的信息。In a possible implementation, the processor 901 is further configured to: determine the similarity between glyph feature vectors of every two enhanced texts, where the glyph feature vectors are used to describe information about words included in the corresponding enhanced texts.

在一种可能的实施方式中,所述处理器901还用于:针对所述每个增强文本中的每个词,确定该词在所归属的增强文本中出现的第一频率;并根据该预设意图对应的增强文本的第一数量,以及包含该词的增强文本的第二数量,确定第二频率;根据所述第一频率和所述第二频率,确定该词在所归属的增强文本中的目标频率;In a possible implementation, the processor 901 is further configured to: determine, for each word in each enhanced text, a first frequency of occurrence of the word in the enhanced text to which it belongs; and determine a second frequency according to a first number of enhanced texts corresponding to the preset intent and a second number of enhanced texts containing the word; and determine a target frequency of the word in the enhanced text to which it belongs according to the first frequency and the second frequency;

根据每个增强文本中包括的每个词对应的所述目标频率,以及预设向量中每一项对应的目标词,分别确定每个增强文本的字形特征向量。According to the target frequency corresponding to each word included in each enhanced text and the target word corresponding to each item in the preset vector, the glyph feature vector of each enhanced text is determined respectively.

在一种可能的实施方式中,所述处理器901还用于:确定每两个增强文本的语义特征向量之间的相似度,其中所述语义特征向量为基于预先训练完成的特征提取模型对对应的增强文本进行处理后得到的。In a possible implementation, the processor 901 is further used to: determine the similarity between semantic feature vectors of every two enhanced texts, wherein the semantic feature vectors are obtained by processing the corresponding enhanced texts based on a pre-trained feature extraction model.

在一种可能的实施方式中,所述处理器901还用于:获取第一提示文本以及针对该预设意图保存的原始文本,所述第一提示文本用于提示第一大模型对所述原始文本进行文本增强处理,所述文本增强处理包括词扩展、句子重组、语气语调变换中的至少一项;In a possible implementation, the processor 901 is further used to: obtain a first prompt text and an original text saved for the preset intent, wherein the first prompt text is used to prompt the first large model to perform text enhancement processing on the original text, wherein the text enhancement processing includes at least one of word expansion, sentence reorganization, and tone and intonation transformation;

将所述原始文本和所述第一提示文本按照预设格式输入到所述第一大模型,得到该预设意图对应的每个增强文本。The original text and the first prompt text are input into the first large model according to a preset format to obtain each enhanced text corresponding to the preset intention.

在一种可能的实施方式中,所述处理器901还用于:在所述每个增强文本中删除所述异常文本,得到目标待评价文本;In a possible implementation manner, the processor 901 is further configured to: delete the abnormal text in each enhanced text to obtain a target text to be evaluated;

将所述目标待评价文本和第二提示文本按照预设格式输入到第二大模型,得到所述目标待评价文本的目标评分,所述第二提示文本用于提示所述第二大模型对所述目标待评价文本进行评审处理,所述评审处理包括评审语义、评审语法中的至少一项。The target text to be evaluated and the second prompt text are input into the second large model in a preset format to obtain a target score for the target text to be evaluated. The second prompt text is used to prompt the second large model to perform a review process on the target text to be evaluated. The review process includes at least one of review semantics and review grammar.

在一种可能的实施方式中,所述处理器901还用于:若所述目标评分小于预设评分阈值,则基于所述第二大模型,获取所述第二大模型输出的小于预设评分阈值的目标待评价文本中存在的问题及修改后的文本。In a possible implementation, the processor 901 is also used to: if the target score is less than a preset score threshold, based on the second large model, obtain the problems and modified text existing in the target text to be evaluated that is less than the preset score threshold output by the second large model.

由于上述电子设备解决问题与基于大模型的文本处理方法相似,因此上述电子设备的实施可以参见方法的实施例,重复之处不再赘述。Since the problem solved by the above electronic device is similar to the text processing method based on the large model, the implementation of the above electronic device can refer to the embodiment of the method, and the repeated parts will not be repeated.

上述电子设备提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect,PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口902用于上述电子设备与其他设备之间的通信。存储器可以包括随机存取存储器(RandomAccess Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选地,存储器还可以是至少一个位于远离前述处理器的存储装置。The communication bus mentioned in the above electronic device can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus. The communication interface 902 is used for communication between the above electronic device and other devices. The memory may include a random access memory (RAM) and may also include a non-volatile memory (NVM), such as at least one disk storage. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述处理器可以是通用处理器,包括中央处理器、网络处理器(NetworkProcessor,NP)等;还可以是数字指令处理器(Digital Signal Processing,DSP)、专用集成电路、现场可编程门陈列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。The above-mentioned processor can be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; it can also be a digital signal processing processor (Digital Signal Processing, DSP), an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc.

在上述各实施例的基础上,本发明实施例还提供了一种计算机可读存储介质,计算机可读存储介质内存储有可由处理器执行的计算机程序,当程序在处理器上运行时,使得处理器执行时实现如下步骤:On the basis of the above embodiments, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored. When the program runs on the processor, the processor implements the following steps when executing:

针对每个预设意图,获取该预设意图对应的每个增强文本;确定每两个增强文本之间的字形相似度和语义相似度,所述字形相似度用于描述对应的两个增强文本中包括的词的相似程度;将所述字形相似度大于第一预设阈值的增强文本,确定为第一候选文本,并将所述语义相似度小于第二预设阈值的增强文本,确定为第二候选文本;For each preset intent, each enhanced text corresponding to the preset intent is obtained; the glyph similarity and semantic similarity between each two enhanced texts are determined, wherein the glyph similarity is used to describe the similarity of words included in the corresponding two enhanced texts; the enhanced text whose glyph similarity is greater than a first preset threshold is determined as a first candidate text, and the enhanced text whose semantic similarity is less than a second preset threshold is determined as a second candidate text;

将所述第一候选文本和所述第二候选文本中相同的增强文本,确定为异常文本。The same enhanced text in the first candidate text and the second candidate text is determined as abnormal text.

在一种可能的实施方式中,确定每两个增强文本之间的字形相似度的过程包括:In a possible implementation, the process of determining the glyph similarity between each two enhanced texts includes:

确定每两个增强文本的字形特征向量之间的相似度,所述字形特征向量用于描述对应的增强文本中包括的词的信息。The similarity between the glyph feature vectors of each two enhanced texts is determined, wherein the glyph feature vector is used to describe information of words included in the corresponding enhanced texts.

在一种可能的实施方式中,所述确定每个增强文本的字形特征向量,包括:In a possible implementation, determining the glyph feature vector of each enhanced text includes:

针对所述每个增强文本中的每个词,确定该词在所归属的增强文本中出现的第一频率;并根据该预设意图对应的增强文本的第一数量,以及包含该词的增强文本的第二数量,确定第二频率;根据所述第一频率和所述第二频率,确定该词在所归属的增强文本中的目标频率;For each word in each enhanced text, determine a first frequency of occurrence of the word in the enhanced text to which it belongs; and determine a second frequency according to a first number of enhanced texts corresponding to the preset intent and a second number of enhanced texts containing the word; and determine a target frequency of the word in the enhanced text to which it belongs according to the first frequency and the second frequency;

根据每个增强文本中包括的每个词对应的所述目标频率,以及预设向量中每一项对应的目标词,分别确定每个增强文本的字形特征向量。According to the target frequency corresponding to each word included in each enhanced text and the target word corresponding to each item in the preset vector, the glyph feature vector of each enhanced text is determined respectively.

在一种可能的实施方式中,确定每两个增强文本之间的语义相似度的过程包括:In a possible implementation, the process of determining the semantic similarity between each two enhanced texts includes:

确定每两个增强文本的语义特征向量之间的相似度,其中所述语义特征向量为基于预先训练完成的特征提取模型对对应的增强文本进行处理后得到的。The similarity between the semantic feature vectors of every two enhanced texts is determined, wherein the semantic feature vectors are obtained by processing the corresponding enhanced texts based on a pre-trained feature extraction model.

在一种可能的实施方式中,所述获取该预设意图对应的每个增强文本,包括:In a possible implementation, obtaining each enhanced text corresponding to the preset intent includes:

获取第一提示文本以及针对该预设意图保存的原始文本,所述第一提示文本用于提示第一大模型对所述原始文本进行文本增强处理,所述文本增强处理包括词扩展、句子重组、语气语调变换中的至少一项;Acquire a first prompt text and an original text saved for the preset intent, wherein the first prompt text is used to prompt the first large model to perform text enhancement processing on the original text, wherein the text enhancement processing includes at least one of word expansion, sentence reorganization, and tone and intonation change;

将所述原始文本和所述第一提示文本按照预设格式输入到所述第一大模型,得到该预设意图对应的每个增强文本。The original text and the first prompt text are input into the first large model according to a preset format to obtain each enhanced text corresponding to the preset intention.

在一种可能的实施方式中,所述方法还包括:In a possible implementation, the method further includes:

在所述每个增强文本中删除所述异常文本,得到目标待评价文本;Deleting the abnormal text in each enhanced text to obtain a target text to be evaluated;

将所述目标待评价文本和第二提示文本按照预设格式输入到第二大模型,得到所述目标待评价文本的目标评分,所述第二提示文本用于提示所述第二大模型对所述目标待评价文本进行评审处理,所述评审处理包括评审语义、评审语法中的至少一项。The target text to be evaluated and the second prompt text are input into the second large model in a preset format to obtain a target score for the target text to be evaluated. The second prompt text is used to prompt the second large model to perform a review process on the target text to be evaluated. The review process includes at least one of review semantics and review grammar.

在一种可能的实施方式中,所述方法还包括:In a possible implementation, the method further includes:

若所述目标评分小于预设评分阈值,则基于所述第二大模型,获取所述第二大模型输出的小于预设评分阈值的目标待评价文本中存在的问题及修改后的文本。If the target score is less than a preset score threshold, based on the second largest model, the problems existing in the target text to be evaluated and less than the preset score threshold output by the second largest model and the modified text are obtained.

由于上述计算机可读存储介质解决问题的原理与基于大模型的文本处理方法相似,因此上述计算机可读存储介质的实施可以参见方法的实施例,重复之处不再赘述。Since the principle of solving the problem by the above-mentioned computer-readable storage medium is similar to that of the text processing method based on the large model, the implementation of the above-mentioned computer-readable storage medium can refer to the embodiment of the method, and the repeated parts will not be repeated.

本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.

本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims (10)

CN202410367189.1A2024-03-282024-03-28 A text processing method, device, equipment and medium based on large modelPendingCN118536511A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410367189.1ACN118536511A (en)2024-03-282024-03-28 A text processing method, device, equipment and medium based on large model

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410367189.1ACN118536511A (en)2024-03-282024-03-28 A text processing method, device, equipment and medium based on large model

Publications (1)

Publication NumberPublication Date
CN118536511Atrue CN118536511A (en)2024-08-23

Family

ID=92386504

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410367189.1APendingCN118536511A (en)2024-03-282024-03-28 A text processing method, device, equipment and medium based on large model

Country Status (1)

CountryLink
CN (1)CN118536511A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119272751A (en)*2024-12-102025-01-07北京火山引擎科技有限公司 Method, apparatus, device, medium and program product for processing generated content

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119272751A (en)*2024-12-102025-01-07北京火山引擎科技有限公司 Method, apparatus, device, medium and program product for processing generated content

Similar Documents

PublicationPublication DateTitle
Laurer et al.Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and bert-nli
CN111125331B (en) Semantic recognition method, device, electronic equipment and computer-readable storage medium
CN108363790B (en)Method, device, equipment and storage medium for evaluating comments
CN115630640B (en)Intelligent writing method, device, equipment and medium
CN114547329A (en)Method for establishing pre-training language model, semantic analysis method and device
CN112287670A (en)Text error correction method, system, computer device and readable storage medium
CN109472022B (en)New word recognition method based on machine learning and terminal equipment
WO2023137911A1 (en)Intention classification method and apparatus based on small-sample corpus, and computer device
CN110263127A (en)Text search method and device is carried out based on user query word
CN115759254A (en)Question-answering method, system and medium based on knowledge-enhanced generative language model
CN115827819A (en)Intelligent question and answer processing method and device, electronic equipment and storage medium
CN118193733A (en)Method, device, electronic equipment and storage medium for generating report
CN117290482A (en)Knowledge base retrieval method and device
CN117094383A (en)Joint training method, system, equipment and storage medium for language model
CN118861244A (en) A method, device and apparatus for generating an answer
CN118536511A (en) A text processing method, device, equipment and medium based on large model
CN116955534A (en) Complaint work order intelligent processing methods, devices, equipment and storage media
CN112307048A (en)Semantic matching model training method, matching device, equipment and storage medium
CN114638231B (en) Entity linking method, device and electronic equipment
CN115392260A (en)Social media tweet emotion analysis method facing specific target
CN114218214A (en)Data processing method, system and storage medium based on Tapas model
CN117540004B (en)Industrial domain intelligent question-answering method and system based on knowledge graph and user behavior
CN118734928A (en) Method, device, equipment and medium for constructing fine-tuning instructions
CN117193823A (en) A code workload assessment method, system and equipment for software requirement changes
CN116028620A (en)Method and system for generating patent abstract based on multi-task feature cooperation

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp