CN114841122A

Movatterモバイル変換

Info

Publication number: CN114841122A
Application number: CN202210088051.9A
Authority: CN
Inventors: 李晓瑜; 罗浩轩; 贾慧雪; 钱伟中; 武文博; 朱钦圣
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-08-02

Abstract

The invention discloses a text extraction method, a storage medium and a terminal combining entity identification and relationship extraction, belonging to the technical field of natural language processing, and a serialized label data set for model training is obtained in a sequence label mode; an encoder-decoder model is adopted to completely fuse the processes of entity identification and relationship extraction, so that the interaction between the entity and the relationship is studied to the maximum extent; in the relation classification stage, clause representation information is obtained through an attention mechanism and is combined with entity pair coding information, and the model precision is further improved; the language training model is used as a shared coding layer of the entity recognition and relationship extraction tasks, the dependence of the entity recognition and relationship extraction tasks is further established, and the problem of error accumulation is solved.

Description

Translated fromChinese

一种联合实体识别与关系抽取的文本抽取方法、存储介质及终端A text extraction method, storage medium and terminal for joint entity recognition and relation extraction

技术领域technical field

本发明涉及自然语言处理技术领域，尤其涉及一种联合实体识别与关系抽取的文本抽取方法、存储介质及终端。The invention relates to the technical field of natural language processing, in particular to a text extraction method, a storage medium and a terminal for joint entity recognition and relation extraction.

背景技术Background technique

自然语言处理是一种用于分析人类语言的人工智能，其中的信息抽取任务是指抽取人类自然语言文本中的实体、实体关系和事件等结构化信息，是一种将半结构化或者非结构化的文本数据，转化为结构化数据信息的技术。Natural language processing is an artificial intelligence used to analyze human language. The information extraction task refers to extracting structured information such as entities, entity relationships, and events in human natural language texts. technology to transform textual data into structured data information.

随着互联网的飞速发展，网络上存在越来越多以文本形式存在的数据，如何将无结构的文本数据转化为有结构的信息，为下游应用提供数据支持，是信息抽取需要解决的问题。信息抽取的具体任务包括命名实体识别、实体关系抽取和事件抽取。其中实体关系抽取是信息抽取中的核心任务，通过抽取自然语言文本中的实体关系三元组(实体1-关系类型-实体2)，完成将原始自然语言文本转化为三元组形式的结构化知识的过程。With the rapid development of the Internet, there are more and more data in the form of text on the Internet. How to convert unstructured text data into structured information to provide data support for downstream applications is a problem that needs to be solved in information extraction. The specific tasks of information extraction include named entity recognition, entity relation extraction and event extraction. Among them, entity relationship extraction is the core task in information extraction. By extracting entity relationship triples (entity 1-relation type-entity 2) in natural language texts, the structure of converting original natural language texts into triples is completed. process of knowledge.

早期阶段，实体关系抽取任务将实体抽取和关系抽取看作串联的任务，也即流水线模型，该串联模型在建模上相对更简单，然而将实体识别和关系抽取当作两个独立的任务明显会存在一系列的问题：In the early stage, the entity relationship extraction task regards entity extraction and relationship extraction as a series of tasks, that is, a pipeline model. The series model is relatively simple in modeling, but it is obvious that entity recognition and relationship extraction are regarded as two independent tasks. There will be a series of problems:

1.两个任务的解决过程中没有考虑到两个子任务之间的相关性，从而导致关系抽取任务的结果严重依赖于实体抽取的结果，割裂了两个任务的内在联系，导致上游任务出错时也会致使下游任务错误，形成误差累计。1. The correlation between the two subtasks was not considered in the solution process of the two tasks, resulting in the result of the relation extraction task being heavily dependent on the result of entity extraction, splitting the internal connection of the two tasks, and causing the upstream task to go wrong. It will also cause errors in downstream tasks, resulting in accumulation of errors.

2.当一段文本中出现的关系数量较多时，很容易出现实体交叠问题，也即两个关系三元组共享一对实体的情况，实体交叠将会造成信息冗余。2. When there are a large number of relationships in a piece of text, the problem of entity overlap is easy to occur, that is, when two relationship triples share a pair of entities, the overlap of entities will cause information redundancy.

因此，如何解决实体关系抽取过程中的误差累积和实体交叠问题成为了当下的研究热点。Therefore, how to solve the problem of error accumulation and entity overlap in the process of entity relation extraction has become a current research hotspot.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于解决将实体识别和关系抽取分开进行产生的冗余信息和误差积累问题，提供一种联合实体识别与关系抽取的文本抽取方法、存储介质及终端。The purpose of the present invention is to solve the problem of redundant information and error accumulation caused by separate entity identification and relation extraction, and to provide a text extraction method, storage medium and terminal for joint entity identification and relation extraction.

本发明的目的是通过以下技术方案来实现的：一种联合实体识别与关系抽取的文本抽取方法，所述方法包括：The object of the present invention is achieved by the following technical solutions: a text extraction method for joint entity recognition and relation extraction, the method includes:

训练步骤：基于序列标注方式对当前训练文本集中各实体进行标记处理，使各实体获得对应标签，标签包括实体位置信息、关系类型信息以及实体角色信息，得到实体关系三元组的序列化标签；Training step: tag each entity in the current training text set based on the sequence tagging method, so that each entity obtains the corresponding tag, the tag includes entity location information, relationship type information and entity role information, and obtains the serialized tag of the entity relationship triplet;

将实体关系三元组的序列化标签转换为词嵌入向量；Convert the serialized labels of entity-relation triples to word embedding vectors;

将词嵌入向量输入编码器-解码器模型对实体关系三元组的序列化标签的对生成方式进行分析时，得到第一实体关系三元组的序列化标签；对生成方式进行分析时包括在解码阶段引入复制机制，在解码器每个时刻输入上一时间步的输出、上下文变量和上一时间步的隐状态，实现实体识别与关系抽取完全融合；和/或，在解码阶段引入注意力机制，计算原始文本序列每个位置的注意力值分布，进而获取对于当前头实体最为重要的原始文本特征向量输出，实现头实体与原始文本序列各位置向量的信息拼接；When the word embedding vector is input into the encoder-decoder model to analyze the generation method of the serialization label of the entity relationship triplet, the serialization label of the first entity relationship triplet is obtained; when the generation method is analyzed, it is included in the In the decoding phase, a replication mechanism is introduced, and the output of the previous time step, the context variable and the hidden state of the previous time step are input to the decoder at each moment, so as to realize the complete fusion of entity recognition and relation extraction; and/or, introduce attention in the decoding phase The mechanism calculates the attention value distribution of each position of the original text sequence, and then obtains the most important original text feature vector output for the current head entity, and realizes the information splicing of the head entity and each position vector of the original text sequence;

将第一实体关系三元组的序列化标签输入共享编码层的语言模型，建立实体识别与关系抽取的关联，进而优化词语义编码信息和句子语义编码信息，输出自然语言的实体关系三元组；Input the serialized label of the first entity-relationship triplet into the language model of the shared encoding layer, establish the association between entity recognition and relation extraction, and then optimize the word semantic encoding information and sentence semantic encoding information, and output the entity-relationship triplet of natural language. ;

基于验证文本集中的文本对语言模型输出的自然语言的实体关系三元组进行验证，反向修正语言模型的模型参数，完成语言模型的训练；Based on the text in the verification text set, the entity-relationship triples of the natural language output by the language model are verified, and the model parameters of the language model are reversely corrected to complete the training of the language model;

文本抽取步骤：将与训练步骤中同属性文本集中的文本输入上述完成训练的语言模型，输出自然语言的实体关系三元组。Text extraction step: input the text in the text set with the same attribute as in the training step into the language model that has completed the training above, and output the entity-relationship triplet of natural language.

在一示例中，所述将实体关系三元组的序列化标签转换为词嵌入向量具体包括：In an example, the converting the serialized label of the entity-relationship triplet into the word embedding vector specifically includes:

采样双向长短时编码器将文本的独热编码转换为词嵌入向量。A sampling bidirectional long-short-time encoder converts the one-hot encoding of text into word embedding vectors.

在一示例中，当解码阶段引入复制机制时包括：In one example, when the decoding stage introduces a copy mechanism including:

编码器将当前时间步输入的词嵌入向量编码得到对应的隐状态向量；The encoder encodes the word embedding vector input at the current time step to obtain the corresponding hidden state vector;

基于隐状态向量获取上下文变量；Obtain context variables based on hidden state vectors;

解码器接收解码器上一时间步的输出y_t′-1，上下文变量c和上一时间步的隐状态s_t′-1，对当前时间步的输出y_t进行预测。The decoder receives the output y_t'-1 of the previous time step of the decoder, the context variable c and the hidden state s_t'-1 of the previous time step, and predicts the output y_t of the current time step.

在一示例中，所述解码器为循环神经网络时，在解码阶段引入复制机制的计算公式为：In an example, when the decoder is a recurrent neural network, the calculation formula for introducing the replication mechanism in the decoding stage is:

y_t＝g(y_t′-1，c，s_t′-1)y_t =g(y_t'-1 , c, s_t'-1 )

其中，g(·)表示循环神经网络与归一化指数函数结合的计算函数。Among them, g(·) represents the calculation function of the combination of the recurrent neural network and the normalized exponential function.

在一示例中，当解码阶段引入注意力机制时包括：In one example, when an attention mechanism is introduced in the decoding stage, it includes:

获取编码后的头实体序列H_m、编码后的原始文本序列H_n；Obtain the encoded head entity sequence H_m and the encoded original text sequence H_n ;

对头实体序列H_m进行求和平均计算得到头实体序列的隐层特征向量h_s；The head-entity sequence H_m is summed and averaged to obtain the hidden layer feature vector h_s of the head-entity sequence;

计算隐层特征向量h_s与原始文本序列H_n各位置向量之间的注意力得分，对各位置注意力得分进行约束得到注意力值；Calculate the attention score between the hidden layer feature vector h_s and each position vector of the original text sequence H_n , and constrain the attention score of each position to obtain the attention value;

将注意力值分配至原始文本序列，得到最终的隐层特征向量序列

实现头实体与原始文本序列各位置向量的信息拼接。Assign the attention value to the original text sequence to get the final hidden layer feature vector sequence

It realizes the information splicing of the head entity and each position vector of the original text sequence.

在一示例中，所述语言模型为Bert模型，包括若干并行Transformer模块，各Transformer模块包括顺次连接Transformer层，用于建立实体识别与关系抽取的关联，具体包括：In one example, the language model is a Bert model, including several parallel Transformer modules, and each Transformer module includes sequentially connected Transformer layers for establishing an association between entity recognition and relation extraction, specifically including:

将词嵌入向量输入语言模型；Input the word embedding vector into the language model;

当前Transformer层对词嵌入向量进行编码并输出相同纬度的词向量传递至各Transformer模块中等同层级的Transformer层，进而建立实体识别与关系抽取的关联。The current Transformer layer encodes the word embedding vector and outputs the word vector of the same latitude and transmits it to the Transformer layer of the same level in each Transformer module, thereby establishing the association between entity recognition and relation extraction.

在一示例中，所述建立实体识别与关系抽取的关联还包括：In an example, the establishing the association between entity recognition and relation extraction further includes:

创建动态语义表征对语言模型参数进行调整，并添加偏置目标函数，获得端到端的双向语言模型。Create dynamic semantic representations to adjust language model parameters and add bias objective functions to obtain an end-to-end bidirectional language model.

在一示例中，所述输出自然语言的实体关系三元组时包括：In an example, the outputting the entity-relationship triples of the natural language includes:

遵循就近匹配的原则抽取实体与实体之间的关系三元组。The relation triples between entities are extracted according to the principle of nearest matching.

需要进一步说明的是，上述各示例对应的技术特征可以相互组合或替换构成新的技术方案。It should be further noted that the technical features corresponding to the above examples may be combined with each other or replaced to form a new technical solution.

本申请还包括一种存储介质，其上存储有计算机指令，所述计算机指令运行时执行上述任一示例或多个示例组合形成的所述的一种联合实体识别与关系抽取的文本抽取方法的步骤。The present application also includes a storage medium on which computer instructions are stored, and when the computer instructions run, execute the text extraction method of the joint entity recognition and relation extraction described in any one of the above examples or a combination of multiple examples. step.

本申请还包括一种终端，包括存储器和处理器，所述存储器上存储有可在所述处理器上运行的计算机指令，所述处理器运行所述计算机指令时执行上述任一示例或多个示例组合形成的所述的一种联合实体识别与关系抽取的文本抽取方法的步骤。The present application also includes a terminal, including a memory and a processor, the memory stores computer instructions that can run on the processor, and the processor executes any one or more of the above examples when running the computer instructions The steps of a text extraction method for joint entity recognition and relation extraction formed by a combination of examples.

与现有技术相比，本发明有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

1.在一示例中，本发明将抽取实体和抽取关系任务构建成一个序列标注任务，实现两个任务联合；使用编码器-解码器模型将实体识别和关系抽取两部分过程完全融合，最大限度地学习实体和关系之间的相互作用；在关系分类阶段，通过注意力机制获取子句表征信息，与实体对编码信息相结合，进一步提高模型精度；通过共享编码层的语言模型进一步建立两个任务之间的联系，防止上游任务出错时致使下游任务错误，解决了误差累计问题。1. In one example, the present invention constructs the task of extracting entities and extracting relationships into a sequence labeling task to realize the combination of the two tasks; the encoder-decoder model is used to completely integrate the two parts of the process of entity recognition and relationship extraction, so as to maximize the results. In the relationship classification stage, the clause representation information is obtained through the attention mechanism, which is combined with the entity pair coding information to further improve the model accuracy; the language model of the shared coding layer further builds two The connection between tasks prevents upstream tasks from causing errors in downstream tasks, and solves the problem of error accumulation.

2.在一示例中，采样双向长短时编码器对文本进行编码，通过编码器前向和后向的长短时层，在末尾将将两层合并，能够良好捕捉单个词语的语义信息。2. In an example, a sampled bidirectional long-short-term encoder encodes the text, and through the forward and backward long-short-term layers of the encoder, the two layers are merged at the end, which can well capture the semantic information of a single word.

3.在一示例中，通过引入复制机制，能够复用任意实体并且按照时刻顺序依次产生实体和关系，有效应对实体交叠问题。3. In an example, by introducing a replication mechanism, any entity can be reused and entities and relationships can be generated sequentially in time sequence, effectively dealing with the problem of entity overlap.

4.在一示例中，通过引入注意力机制，能够获取子句表征信息，与实体对编码信息相结合，进一步提高了关系分类的性能。4. In an example, by introducing an attention mechanism, the clause representation information can be obtained, which can be combined with the entity pair encoding information to further improve the performance of relation classification.

5.在一示例中，通过添加偏置目标函数优化尾实体和关系抽取任务中的类别不平衡问题。5. In one example, the class imbalance problem in tail entity and relation extraction tasks is optimized by adding a bias objective function.

6.在一示例中，通过遵循就近匹配的原则抽取实体和实体之间的关系三元组，能够有效解决实体交叠问题。6. In an example, the entity overlapping problem can be effectively solved by extracting entity and relation triples between entities following the principle of nearest matching.

附图说明Description of drawings

下面结合附图对本发明的具体实施方式作进一步详细的说明，此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，在这些附图中使用相同的参考标号来表示相同或相似的部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. The accompanying drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. In these drawings, the same reference numerals are used to denote the same or similar parts, the exemplary embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application.

图1为本发明一示例中的方法流程图；1 is a flowchart of a method in an example of the present invention;

图2为本发明一示例中数据标注策略示意图；2 is a schematic diagram of a data labeling strategy in an example of the present invention;

图3为本发明一示例中编码器-解码器模型结构示意图；3 is a schematic structural diagram of an encoder-decoder model in an example of the present invention;

图4为本发明一示例中用于解码阶段的注意力机制结构示例示意图；4 is a schematic diagram of an exemplary structure of an attention mechanism used in a decoding stage in an example of the present invention;

图5为本发明一示例中Bert语言预训练模型结构示意图。FIG. 5 is a schematic structural diagram of a Bert language pre-training model in an example of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

在本发明的描述中，需要说明的是，属于“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方向或位置关系为基于附图所述的方向或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，属于“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. The indicated direction or positional relationship is based on the direction or positional relationship described in the accompanying drawings, which is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must have a specific orientation or a specific orientation. construction and operation, and therefore should not be construed as limiting the invention. Furthermore, the references to "first" and "second" are for descriptive purposes only, and should not be construed as indicating or implying relative importance.

在本发明的描述中，需要说明的是，除非另有明确的规定和限定，属于“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that, unless otherwise expressly specified and limited, “installation”, “connection” and “connection” should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection Connection, or integral connection; can be mechanical connection, can also be electrical connection; can be directly connected, can also be indirectly connected through an intermediate medium, can be internal communication between two elements. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood in specific situations.

此外，下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。In addition, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

本申请一种联合实体识别与关系抽取的文本抽取方法，旨在解决割裂实体识别和关系抽取导致的误差累计问题，以及实体交叠问题，适用于各场景下文本抽取，下述实施例以医疗文本抽取进行说明。The present application is a text extraction method combining entity recognition and relationship extraction, which aims to solve the problem of error accumulation caused by separate entity recognition and relationship extraction, as well as the problem of entity overlap, and is suitable for text extraction in various scenarios. Text extraction is explained.

在一示例中，如图1所示，一种联合实体识别与关系抽取的医疗文本抽取方法，包括模型训练步骤和文本抽取步骤，具体为：In an example, as shown in Figure 1, a medical text extraction method for joint entity recognition and relation extraction includes a model training step and a text extraction step, specifically:

S1：基于序列标注方式对当前文本集中各实体进行标记处理，使各实体获得对应标签，标签包括实体位置信息、关系类型信息以及实体角色信息，得到实体关系三元组的序列化标签；其中，序列化标签为自然语言文本的代码信息。S1: Mark each entity in the current text set based on the sequence labeling method, so that each entity obtains the corresponding label, the label includes entity location information, relationship type information and entity role information, and obtains the serialized label of the entity relationship triplet; among them, Serialize code information tagged as natural language text.

S2：将实体关系三元组的序列化标签转换为词嵌入向量；S2: Convert the serialized labels of entity-relation triples into word embedding vectors;

S3：将词嵌入向量输入编码器-解码器模型；S3: Input the word embedding vector into the encoder-decoder model;

S4：在解码阶段引入复制机制，在解码器每个时刻输入上一时间步的输出、上下文变量和上一时间步的隐状态，实现实体识别与关系抽取完全融合；S4: Introduce a replication mechanism in the decoding stage, and input the output of the previous time step, the context variable and the hidden state of the previous time step at each moment of the decoder, so as to realize the complete integration of entity recognition and relation extraction;

S5：在解码阶段引入注意力机制，计算原始文本序列每个位置的注意力值分布，进而获取对于当前头实体最为重要的原始文本特征向量输出，实现头实体与原始文本序列各位置向量的信息拼接，得到进行生成方式分析后的第一实体关系三元组的序列化标签；S5: Introduce the attention mechanism in the decoding stage, calculate the attention value distribution of each position of the original text sequence, and then obtain the most important original text feature vector output for the current head entity, and realize the information of the head entity and each position vector of the original text sequence splicing to obtain the serialized label of the first entity-relationship triplet after the analysis of the generation method;

S6：将第一实体关系三元组的序列化标签输入共享编码层的语言模型，建立实体识别与关系抽取的关联，进而优化词语义编码信息和句子语义编码信息，输出自然语言的实体关系三元组；S6: Input the serialized label of the first entity relationship triplet into the language model of the shared coding layer, establish the association between entity recognition and relationship extraction, and then optimize the word semantic coding information and sentence semantic coding information, and output the entity relationship three of the natural language. tuple;

S7：基于验证文本集中的文本对语言模型输出的自然语言的实体关系三元组进行验证，反向修正语言模型的模型参数，完成语言模型的训练；S7: Validate the entity-relationship triples of the natural language output by the language model based on the text in the validation text set, reversely correct the model parameters of the language model, and complete the training of the language model;

S8：将与训练步骤中同属性文本集中的文本输入上述完成训练的语言模型，输出自然语言的实体关系三元组。其中，文本集属性即对应文本集的类别，比如医疗类别的文本集、金融类别的文本集等，对于不同类别文体进行抽取时，需通过上述步骤S1-S7对模型进行训练并验证，使模型学习当前类别文本的语义，以此提升模型的文本抽取的精准度。S8: Input the text in the text set with the same attribute as in the training step into the language model that has completed the training above, and output the entity-relationship triplet of natural language. Among them, the attribute of the text set is the category of the corresponding text set, such as the text set of the medical category, the text set of the financial category, etc. When extracting different types of styles, the model needs to be trained and verified through the above steps S1-S7, so that the model Learn the semantics of the current category of text to improve the accuracy of the model's text extraction.

本发明将抽取实体和抽取关系任务构建成一个序列标注任务，实现两个任务联合；使用编码器-解码器模型将实体识别和关系抽取两部分过程完全融合，共享模型的全部参数，在编码阶段融合整个文本的语义，并在解码阶段中交错地产生实体和关系；另一方面，将预训练语言模型引入实体和关系联合抽取，共享模型的绝大部分参数，通过联合训练在共享参数的编码阶段中最大限度地学习实体和关系之间的相互作用。同时将复制机制和注意力机制应用于解码阶段，避免了无关实体词例的产生，能够更好的描述实体和关系的内在联系，并且优化实体交叠和关系指向问题，提高模型精度；同时，通过共享编码层的语言模型建立两个任务之间的依赖，防止上游任务出错时致使下游任务错误，解决了误差累计问题。The invention constructs the task of extracting entities and extracting relationships into a sequence labeling task, and realizes the combination of the two tasks; the encoder-decoder model is used to completely integrate the two processes of entity recognition and relationship extraction, and all parameters of the model are shared. Integrate the semantics of the entire text, and generate entities and relationships interleaved in the decoding stage; on the other hand, the pre-trained language model is introduced into entity and relationship joint extraction, and most of the parameters of the model are shared. The interaction between entities and relationships is maximized in stages. At the same time, the replication mechanism and attention mechanism are applied to the decoding stage to avoid the generation of irrelevant entity words, which can better describe the internal relationship between entities and relationships, and optimize the problem of entity overlap and relationship pointing to improve model accuracy; at the same time, The dependency between the two tasks is established by the language model of the shared coding layer, which prevents the downstream task from making mistakes when the upstream task is wrong, and solves the problem of error accumulation.

在一示例中，用于实体关系联合抽取训练的医疗原始文本数据集为中文电子病历数据集，本示例中选用的原始文本数据集(文本集)为Yidu-N4K数据集，首先按照文本信息属性选定实体和实体之间的关系(实体关系标签)，为实体赋予特定的标记，在数据中，每一个词汇都会被赋予一个实体标签，获得一个带标签的数据集，用于训练模型的序列化标注数据集。其中，实体关系标签用于表征两个实体之间的关系，比如实体“胃癌”与实体“胃镜”之间的实体关系为检查。在模型训练阶段时，需要先定义实体关系标签。具体地，本示例中使用“BIOES”序列标注规则标注词语在实体中的位置信息，其中B表示实体起始；I表示实体内部；O表示非实体词汇；E表示实体结束；S表示单一实体，一个被抽取的实体关系结果由一个三元组表示(实体1-关系类型-实体2)。更为具体地，与实体相关词对应的标签包含三个部分信息：1、该词在实体内位置信息；2、关系类型信息(实体关系标签)，该信息与数据集上预定义的关系集合相对应；3、实体角色信息，使用“1”、“2”两个标签对应头实体和尾实体。假设有R个关系，使用标注策略将产生N＝2*4*R+1种标签类别。本申请通过引入头实体的实体标签，通过自学习的Embedding层来获取关系标签嵌入，进而提升关系抽取性能。In one example, the medical original text data set used for entity relationship joint extraction training is Chinese electronic medical record data set, and the original text data set (text set) selected in this example is the Yidu-N4K data set. Select the relationship between entities and entities (entity-relationship labels), assign specific labels to entities, and in the data, each word will be assigned an entity label, and obtain a labeled dataset to train the sequence of the model Annotated datasets. Among them, the entity relationship label is used to represent the relationship between two entities, for example, the entity relationship between the entity "stomach cancer" and the entity "gastroscopy" is inspection. During the model training phase, entity-relationship labels need to be defined first. Specifically, in this example, the "BIOES" sequence labeling rule is used to label the position information of words in entities, where B represents the start of the entity; I represents the interior of the entity; O represents the non-entity vocabulary; E represents the end of the entity; S represents a single entity, An extracted entity relation result is represented by a triple (entity 1 - relation type - entity 2). More specifically, the label corresponding to the entity-related word contains three parts of information: 1. The position information of the word in the entity; 2. The relationship type information (entity relationship label), which is related to the predefined relationship set on the data set. Corresponding; 3. Entity role information, use two labels "1" and "2" to correspond to the head entity and the tail entity. Assuming there are R relations, using the labeling strategy will generate N=2*4*R+1 label categories. In this application, the entity label of the head entity is introduced, and the relationship label embedding is obtained through the self-learning Embedding layer, thereby improving the relationship extraction performance.

如图2所示，本示例中基于新型“BIOES”序列标注策略对医疗文本原始数据集进行序列化标注的示例，“胃癌”为疾病实体，在疾病-检查项目三元组中属于头实体，因此“胃癌”二字的序列化标签分别是B-C-1，E-C-1；“胃镜”为检查实体，在疾病-检查项目三元组中属于尾实体，因此“胃癌”二字的序列化标签分别是B-C-2，E-C-2；“溃疡型肿物”为症状实体，在疾病-症状项目三元组中属于尾实体，因此“溃疡型肿物”五字的序列化标签分别是B-N-2，I-N-2，I-N-2，I-N-2，E-N-2，最终标识的三元组如表1所示：As shown in Figure 2, this example is an example of serializing and labeling the original medical text dataset based on the new "BIOES" sequence labeling strategy. "Stomach cancer" is a disease entity and belongs to the head entity in the disease-examination item triplet. Therefore, the serialization tags of the word "stomach cancer" are B-C-1 and E-C-1 respectively; "gastroscopy" is the inspection entity, which belongs to the tail entity in the disease-examination item triplet, so the serialization label of the word "stomach cancer" They are B-C-2 and E-C-2 respectively; "Ulcer-type tumor" is the symptom entity, and belongs to the tail entity in the triplet of disease-symptom items, so the serialized tags of the five-character "ulcer-type tumor" are B-N- 2, I-N-2, I-N-2, I-N-2, E-N-2, the final identified triples are shown in Table 1:

表1三元组标识表Table 1 Triple identification table

头实体head entity实体关系entity relationship尾实体tail entity胃癌stomach cancer检查an examination胃镜gastroscopy胃癌stomach cancer症状symptom溃疡性肿物ulcerative mass

模型训练过程中，使用上述标注方式对所有用于训练的原始医疗文本数据集进行标注，将抽取实体和抽取关系任务构建成一个序列标注任务，实现两个任务联合。During the model training process, the above labeling methods are used to label all the original medical text data sets used for training, and the task of extracting entities and extracting relationships is constructed into a sequence labeling task to realize the combination of the two tasks.

在一示例中，将实体关系三元组的序列化标签转换为词嵌入向量具体包括：In an example, converting the serialized labels of entity-relationship triples into word embedding vectors specifically includes:

采样双向长短时编码器将文本的独热编码转换为词嵌入向量。具体地，通过编码器前向和后向的长短时层，在末尾将两层合并，词嵌入层则将词语的独热编码(1-hotrepresentation)转换为词嵌入向量，能够良好捕捉单个词语的语义信息。A sampling bidirectional long-short-time encoder converts the one-hot encoding of text into word embedding vectors. Specifically, through the forward and backward long-term layers of the encoder, the two layers are merged at the end, and the word embedding layer converts the one-hot representation (1-hot representation) of the word into a word embedding vector, which can well capture a single word. semantic information.

进一步地，为了使文本能够加入到模型运算当中，需要将其转换成数字向量的形式。对于输入文本序列中的每一个词，其输入词向量由词嵌入(Token Embedding)、段嵌入(Segment Embedding)和位置嵌入(Position Embedding)三部分求和相加构成。词嵌入与其他语言模型中的词嵌入相同，代表每个词的特征表示。Bert模型中一般在开头加入一个[CLS]特殊标志学习文本的整体语义特征，常用于文本分类任务。段嵌入用来区别两个句子的关系，用于模型预训练中的上下文判断任务，在实际使用中则无需对文本做区别。位置嵌入将输入序列的顺序信息引入到模型输入中。以上三中嵌入向量均可在预训练和微调阶段对参数进行优化调整。Further, in order for the text to be able to be added to the model operation, it needs to be converted into the form of a numeric vector. For each word in the input text sequence, the input word vector is composed of the summation and addition of three parts: word embedding (Token Embedding), segment embedding (Segment Embedding) and position embedding (Position Embedding). Word embeddings are the same as word embeddings in other language models and represent the feature representation of each word. In the Bert model, a [CLS] special mark is generally added at the beginning to learn the overall semantic features of the text, which is often used in text classification tasks. Segment embedding is used to distinguish the relationship between two sentences and is used for the context judgment task in model pre-training. In actual use, there is no need to distinguish the text. Positional embeddings introduce the order information of the input sequence into the model input. The parameters of the above three embedding vectors can be optimized and adjusted in the pre-training and fine-tuning stages.

在一示例中，编码器-解码器模型中结构如图3所示，在编码阶段对整个输入文本编码得到融合了整体语义环境的词例隐状态向量，在解码阶段则可以通过对每个时刻的输入调整提高对某一局部的特征关注度，即解码器与对应编码器跳接，即解码器按照时刻顺序接收编码器输出的隐状态向量，使得实体和关系联合抽取任务具有更高的契合度，具体包括以下步骤：In an example, the structure of the encoder-decoder model is shown in Figure 3. In the encoding stage, the entire input text is encoded to obtain a word instance hidden state vector that integrates the overall semantic environment. The input adjustment to improve the attention to a certain local feature, that is, the decoder and the corresponding encoder are jumped, that is, the decoder receives the hidden state vector output by the encoder in the order of time, so that the joint extraction task of entity and relationship has a higher fit degree, which includes the following steps:

S51：编码器将当前时间步输入的词嵌入向量编码得到对应的隐状态向量；S51: The encoder encodes the word embedding vector input at the current time step to obtain the corresponding hidden state vector;

S52：基于隐状态向量获取上下文变量；S52: Obtain context variables based on the hidden state vector;

S53：解码器接收解码器上一时间步的输出y_t′-1，上下文变量c和上一时间步的隐状态s_t′-1，对当前时间步的输出y_t进行预测。S53: The decoder receives the output y_t'-1 of the previous time step of the decoder, the context variable c and the hidden state_st'-1 of the previous time step, and predicts the output y_t of the current time step.

具体地，给定输入医学文本序列x＝{x₁，x₂，…，x_n}，编码器将输入序列编码成固定长度的上下文向量表示，再对其进行解码得到输出序列y＝{y₁，y₂，…，y_m}。编码器对输入序列编码后得到相应的隐状态表示矩阵h＝{h₁，h₂，…，h_n}，计算公式为：Specifically, given an input medical text sequence x={x₁ , x₂ ,..., x_n }, the encoder encodes the input sequence into a fixed-length context vector representation, and then decodes it to obtain an output sequence y={y₁ , y₂ , ..., y_m }. After the encoder encodes the input sequence, the corresponding hidden state representation matrix h={h₁ , h₂ , ..., h_n } is obtained, and the calculation formula is:

h_t＝f(x_t，h_t-1)h_t =f(x_t , h_t-1 )

其中，h_t和h_t-1分别为当前和上一时间步的隐状态表示，x_t为当前输入词向量，f(·)表示循环神经网络隐藏层变换。上下文向量c可以视作是对整个输入序列的语义表示，汇总了输入文本中的语义信息，从输入序列的隐状态编码中提取，将每个时间步的隐状态通过自定义函数q组合即可得到上下文向量c，计算公式如下：Among them, h_t and h_t-1 are the hidden state representations of the current and previous time steps, respectively, x_t is the current input word vector, and f( ) represents the hidden layer transformation of the recurrent neural network. The context vector c can be regarded as a semantic representation of the entire input sequence, which summarizes the semantic information in the input text, extracts it from the hidden state encoding of the input sequence, and combines the hidden states of each time step through a custom function q. The context vector c is obtained, and the calculation formula is as follows:

c＝q(h₁，…，h_n)c=q(h₁ , . . . , h_n )

上下文向量c将作为解码器每个时间步的输入，实现对解码器的信息补充。本示例中将函数q定义为取最后一个时间步的隐状态h_n作为上下文向量，作为一优选，通过注意力机制动态变化上下文变量，在不同时刻传入更符合当前预测目标倾向的语义信息。解码器在每个时间步接收上一时间步的输出y_t′-1、上下文变量c和上一时间步的隐状态s_t′-1作为输入，预测当前时间步的输出y_t′。本示例中，通过引入复制机制，能够复用任意实体并且按照时刻顺序依次产生实体和关系，有效应对实体交叠问题。The context vector c will be used as the input of the decoder at each time step, realizing the information supplement to the decoder. In this example, the function q is defined as taking the hidden state h_n of the last time step as the context vector. As an option, the context variable is dynamically changed through the attention mechanism, and semantic information that is more in line with the current prediction target tendency is passed in at different times. The decoder receives the output y_t'-1 of the previous time step, the context variable c, and the hidden state s_t'-1 of the previous time step as input at each time step, and predicts the output y_t' of the current time step. In this example, by introducing a replication mechanism, any entity can be reused and entities and relationships can be generated sequentially in time sequence, effectively dealing with the problem of entity overlap.

在一示例中，解码器为单向循环神经网络时，在解码阶段引入复制机制的计算公式为：In an example, when the decoder is a one-way recurrent neural network, the calculation formula for introducing the replication mechanism in the decoding stage is:

y_t＝g(y_t′-1，c，s_t′-1)y_t =g(y_t'-1 , c, s_t'-1 )

其中，g(·)表示循环神经网络与归一化指数函数结合的计算函数。当解码器输出了象征序列结束的“<EOS>”或达到最大步长时，结束解码过程。Among them, g(·) represents the calculation function of the combination of the recurrent neural network and the normalized exponential function. The decoding process ends when the decoder outputs "<EOS>" which signifies the end of the sequence or when the maximum step size is reached.

在一示例中，应用于解码阶段的注意力机制结构如图4所示，通过基于注意力机制的方式对实体模型与关系模型输出之间的约束进行建模，对原始文本序列经过预训练模型编码的特征向量，与头实体及头实体类别序列经过预训练模型编码后的特征向量，进行注意力计算获得针对原始文本序列每个位置的注意力值分布，以获得对于当前头实体最为重要的原始文本特征向量输出，具体包括以下步骤：In an example, the structure of the attention mechanism applied to the decoding stage is shown in Figure 4. The constraints between the output of the entity model and the relational model are modeled by means of the attention mechanism, and the original text sequence is subjected to a pre-trained model. The encoded feature vector, and the feature vector encoded by the head entity and the head entity category sequence after the pre-training model, perform attention calculation to obtain the attention value distribution for each position of the original text sequence, so as to obtain the most important for the current head entity. The original text feature vector output, including the following steps:

S61：获取编码后的头实体序列H_m、编码后的原始文本序列H_n；S61: Obtain the encoded head entity sequence H_m and the encoded original text sequence H_n ;

S62：对头实体序列H_m进行求和平均计算得到头实体序列的隐层特征向量h_s；S62: perform summation and average calculation on the head-entity sequence H_m to obtain a hidden layer feature vector h_s of the head-entity sequence;

S63：计算隐层特征向量h_s与原始文本序列H_n各位置向量之间的注意力得分，对各位置注意力得分进行约束得到注意力值；S63: Calculate the attention score between the hidden layer feature vector h_s and each position vector of the original text sequence H_n , and constrain the attention score of each position to obtain the attention value;

S64：将注意力值分配至原始文本序列，得到最终的隐层特征向量序列

实现头实体与原始文本序列各位置向量的信息拼接。S64: Assign the attention value to the original text sequence to obtain the final hidden layer feature vector sequence

具体的，头实体序列为X_m＝(x₁，x₂，…，x_m)，m为头实体序列长度；原始文本序列为X_n＝(x₁，x₂，…，x_n)，n为文本序列长度；获得注意力分布后的文本序列特征向量输出为

则交互注意力计算过程可以描述为以下公式：Specifically, the head-entity sequence is X_m =(x₁ , x₂ ,..., x_m ), m is the length of the head-entity sequence; the original text sequence is X_n =(x₁ , x₂ ,..., x_n ), n is the length of the text sequence; the output of the text sequence feature vector after obtaining the attention distribution is

Then the interactive attention calculation process can be described as the following formula:

H_m＝BERT(X_m)H_m =BERT(X_m )

H_n＝BERT(X_n)H_n =BERT(X_n )

上式中，X_m经过BERT编码后记为

X_n经过BERT编码后记为

对H_m中的向量求和平均，得到头实体序列的隐层特征向量h_s；h_s与H_n中每个位置上的文本隐层特征向量计算注意力值，注意力计算过程通过引入一个线性层进行；

表示h_s与文本序列特征H_n中i位置的向量

之间的注意力得分；对文本序列中每个位置的注意力得分通过softmax函数进行约束，得到具体注意力值a_s，i；最终，将注意力值分配至原始文本序列，得到最终的隐层特征向量序列

h_s∈R^D,

为隐层向度；W_att∈R^ω×2D,V_att∈R^ω为交互注意力计算层的可训练参数；w表示参数的维度；

表示拼接后的向量。In the above formula, X_m is denoted by BERT encoding as

X_n is denoted by BERT encoding as

The vectors in H_m are summed and averaged to obtain the hidden layer feature vector h_s of the head-entity sequence; h_s and the text hidden layer feature vector at each position in H_n calculate the attention value. The attention calculation process introduces a The linear layer is carried out;

vector representing the position of i in h_s and text sequence features H_n

The attention score between each position in the text sequence is constrained by the softmax function, and the specific attention value a_{s, i} is obtained; finally, the attention value is assigned to the original text sequence, and the final hidden value is obtained. Layer feature vector sequence

h_s ∈ R^D ,

is the hidden layer dimension; W_att ∈ R^ω×2D , V_att ∈ R^ω is the trainable parameter of the interactive attention computation layer; w represents the dimension of the parameter;

represents the concatenated vector.

本示例中，通过引入注意力机制，能够获取子句表征信息，与实体对编码信息相结合，在更细粒度上解决实体交叠和关系方向性问题。In this example, by introducing the attention mechanism, the clause representation information can be obtained, which can be combined with the entity pair encoding information to solve the problems of entity overlap and relationship directionality at a finer granularity.

在一示例中，如图5所示，语言模型为Bert模型，包括若干并行Transformer模块，各Transformer模块包括顺次连接Transformer层。进一步地，基于该Bert模型建立实体识别与关系抽取的关联具体包括：In an example, as shown in FIG. 5 , the language model is a Bert model, including several parallel Transformer modules, and each Transformer module includes sequentially connected Transformer layers. Further, establishing the association between entity recognition and relation extraction based on the Bert model specifically includes:

将词嵌入向量输入各Transformer层；Input the word embedding vector into each Transformer layer;

当前Transformer层对词嵌入向量进行编码并输出相同纬度的词向量传递至各Transformer模块(T₁，T₂……T_N)中等同层级的Transformer层(Trm层)，通过各编码器用于对词嵌入向量进行编码，得到词语义编码信息和句子语义编码信息，整个编码过程中不改变原始输入向量维度，最终得到融合了整体语义环境的词向量输出，进而建立实体识别与关系抽取的关联。The current Transformer layer encodes the word embedding vector and outputs the word vector of the same latitude and transmits it to the Transformer layer (Trm layer) of the same level in each Transformer module (T₁ , T₂ ...... T_N ), and is used by each encoder for word embedding The embedding vector is encoded to obtain word semantic encoding information and sentence semantic encoding information. The dimension of the original input vector is not changed during the entire encoding process, and finally the word vector output that integrates the overall semantic environment is obtained, and then the association between entity recognition and relation extraction is established.

在一示例中，建立实体识别与关系抽取的关联，即建立实体识别和关系抽取两个任务的依赖还包括：In an example, establishing an association between entity recognition and relationship extraction, that is, establishing dependencies between the two tasks of entity recognition and relationship extraction, further includes:

根据文本上下文信息创建动态语义表征对BERT模型参数进行调整，并添加偏置目标函数，获得端到端的双向语言模型，以优化尾实体和关系抽取任务中的类别不平衡问题，得到高质量的词语义编码和句子语义编码。其中，端到端的双向语言模型能够在输入的实体关系三元组的序列化标签时，输出自然语言的实体关系三元组对模型进行训练；在输入自然语言的实体关系三元组的词嵌入向量时，输出实体关系三元组的序列化标签。Create dynamic semantic representations based on textual context information to adjust BERT model parameters, and add bias objective functions to obtain an end-to-end bidirectional language model to optimize the category imbalance problem in tail entity and relation extraction tasks, and obtain high-quality words Semantic encoding and sentence semantic encoding. Among them, the end-to-end bidirectional language model can output the entity-relation triples of natural language to train the model when the serialized labels of the input entity-relation triples; When a vector, output the serialized label of the entity-relation triple.

在一示例中，输出自然语言的实体关系三元组时还包括：In an example, when outputting the entity-relationship triples of the natural language, it also includes:

遵循就近匹配的原则抽取实体与实体之间的关系三元组，能够有效解决实体交叠问题，提升问题抽取准确性。Extracting the relationship triples between entities following the principle of nearest matching can effectively solve the problem of entity overlap and improve the accuracy of problem extraction.

本申请还包括一种联合实体识别和关系抽取的问题抽取模型，所述模型包括编码器-解码器子模型和BERT子模型；编码器-解码器子模型中编码器与解码器连接，以使解码器在每个时间步接收上一时间步的解码器输出y_t′-1、上下文变量c和上一时间步编码器的隐状态s_t′-1，预测当前时间步的输出yt’。；BERT模型包括若干并行Transformer模块，各Transformer模块包括顺次连接Transformer层，用于获取高质量的词语义编码和句子语义编码信息，建立实体识别与关系抽取之间的关系。The present application also includes a problem extraction model for joint entity recognition and relation extraction, the model includes an encoder-decoder sub-model and a BERT sub-model; in the encoder-decoder sub-model, the encoder and the decoder are connected to make At each time step, the decoder receives the decoder output y_t'-1 of the previous time step, the context variable c and the hidden state s_t'-1 of the encoder of the previous time step, and predicts the output yt' of the current time step. The BERT model includes several parallel Transformer modules, and each Transformer module includes a sequentially connected Transformer layer, which is used to obtain high-quality word semantic coding and sentence semantic coding information, and establish the relationship between entity recognition and relation extraction.

本申请还包括一种存储介质，与实施例1具有相同的发明构思，其上存储有计算机指令，所述计算机指令运行时执行实施例1中所述的一种联合实体识别与关系抽取的文本抽取方法的步骤。The present application also includes a storage medium, which has the same inventive concept asEmbodiment 1, and stores computer instructions thereon, and when the computer instructions run, executes the text of the joint entity recognition and relationship extraction described inEmbodiment 1 Extraction method steps.

基于这样的理解，本实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random AccessMemory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。Based on this understanding, the technical solution of this embodiment can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, Several instructions are included to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.

本申请还包括一种终端，与实施例1具有相同的发明构思，包括存储器和处理器，所述存储器上存储有可在所述处理器上运行的计算机指令，所述处理器运行所述计算机指令时执行实施例1中所述的一种联合实体识别与关系抽取的文本抽取方法的步骤。处理器可以是单核或者多核中央处理单元或者特定的集成电路，或者配置成实施本发明的一个或者多个集成电路。The present application also includes a terminal, which has the same inventive concept asEmbodiment 1, and includes a memory and a processor, where the memory stores computer instructions that can be executed on the processor, and the processor runs the computer When instructing, the steps of a text extraction method for joint entity recognition and relation extraction described inEmbodiment 1 are executed. The processor may be a single-core or multi-core central processing unit or a specific integrated circuit, or one or more integrated circuits configured to implement the present invention.

在本发明提供的实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。Each functional unit in the embodiments provided by the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

以上具体实施方式是对本发明的详细说明，不能认定本发明的具体实施方式只局限于这些说明，对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演和替代，都应当视为属于本发明的保护范围。The above specific embodiment is a detailed description of the present invention, and it cannot be considered that the specific embodiment of the present invention is limited to these descriptions. Some simple deductions and substitutions should be considered as belonging to the protection scope of the present invention.

Claims

1. A text extraction method combining entity identification and relationship extraction is characterized in that: the method comprises the following steps:

training: marking each entity in the current training text set based on a sequence marking mode to enable each entity to obtain a corresponding label, wherein the label comprises entity position information, relationship type information and entity role information to obtain a serialized label of an entity relationship triple;

converting the serialized labels of the entity relationship triples into word embedding vectors;

when the word embedding vector is input into an encoder-decoder model to analyze a generation mode of the serialization label of the entity relationship triple, the serialization label of the first entity relationship triple is obtained; when the generation mode is analyzed, a replication mechanism is introduced in a decoding stage, and the output of the previous time step, the context variable and the hidden state of the previous time step are input at each moment of a decoder, so that the complete fusion of entity identification and relation extraction is realized; and/or, an attention mechanism is introduced in a decoding stage, the attention value distribution of each position of the original text sequence is calculated, the most important original text characteristic vector for the current head entity is further obtained and output, and the information splicing of the head entity and each position vector of the original text sequence is realized;

inputting the serialized labels of the first entity relation triples into a language model of a shared coding layer, establishing association between entity identification and relation extraction, further optimizing word meaning coding information and sentence meaning coding information, and outputting entity relation triples of natural language;

verifying entity relation triples of the natural language output by the language model based on the texts in the verification text set, and reversely correcting model parameters of the language model to complete the training of the language model;

text extraction: and inputting the texts in the text set with the same attributes in the training step into the language model which completes the training, and outputting entity relationship triples of the natural language.

2. The method of claim 1, wherein the method comprises the following steps: the converting the serialized labels of the entity relationship triples into word embedding vectors specifically includes:

a sampled bi-directional long and short term encoder converts the one-hot encoding of text into word-embedded vectors.

3. The method of claim 1, wherein the method comprises the following steps: when the decoding stage introduces the replication mechanism, it includes:

the encoder embeds words input at the current time step into vector codes to obtain corresponding hidden state vectors;

obtaining a context variable based on the hidden state vector;

the decoder receives the output y of a time step on the decoder_t′-1 Context variable c and hidden state s of the last time step_t′-1 Output y for the current time step_t And (6) performing prediction.

4. The method of claim 3, wherein the text extraction method comprises the following steps: when the decoder is a recurrent neural network, a calculation formula of introducing a replication mechanism in a decoding stage is as follows:

y_t ＝g(y_t′-1 ，c，s_t′-1 )

where g (-) represents a computational function of the recurrent neural network in combination with a normalized exponential function.

5. The method of claim 1, wherein the method comprises the following steps: when the decoding stage introduces an attention mechanism, it includes:

obtaining a coded header entity sequence H_m Coded original text sequence H_n ；

Aligned entity sequence H_m Carrying out summation average calculation to obtain a hidden layer feature vector h of the head entity sequence_s ；

Computing hidden layer feature vector h_s With the original text sequence H_n The attention scores among the position vectors are restricted to obtain attention values;

distributing the attention value to the original text sequence to obtain the final hidden layer feature vector sequence

Realizing the position directions of the head entity and the original text sequenceAnd (4) splicing information of the quantity.

6. The method of claim 1, wherein the method comprises the following steps: the language model is a Bert model and comprises a plurality of parallel Transformer modules, each Transformer module comprises a Transformer layer which is sequentially connected and is used for establishing association between entity identification and relation extraction, and the method specifically comprises the following steps:

embedding words into the vector input language model;

the current Transformer layer encodes the word embedded vector and outputs the word vector with the same latitude to be transmitted to the Transformer layers with the same level in each Transformer module, and then association of entity identification and relation extraction is established.

7. The method of claim 1, wherein the method comprises the following steps: the establishing of the association of the entity identification and the relationship extraction further comprises:

and establishing dynamic semantic representation to adjust the language model parameters, and adding a bias objective function to obtain an end-to-end bidirectional language model.

8. The method of claim 1, wherein the method comprises the following steps: the outputting entity relationship triples of the natural language comprises:

and extracting the entity-entity relationship triples according to the principle of nearby matching.

9. A storage medium having stored thereon computer instructions, characterized in that: the computer instructions when executed perform the steps of a text extraction method of federated entity identification and relationship extraction as recited in any one of claims 1-8.

10. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, the terminal comprising: the processor when executing the computer instructions performs the steps of a text extraction method of federated entity identification and relationship extraction as recited in any of claims 1-8.