CN116467465A

Movatterモバイル変換

Info

Publication number: CN116467465A
Application number: CN202310456160.6A
Authority: CN
Inventors: 张倩
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-07-21

Abstract

Translated fromChinese

本申请公开了一种基于知识图谱的文本标签标注方法及装置、存储介质、计算机设备，该方法包括：识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体；基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体；链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签，提高了利用知识图谱对文本进行标签标注时的精准度。

The present application discloses a text tagging method and device, storage medium, and computer equipment based on a knowledge map. The method includes: identifying the entity reference of the text to be marked, and recalling a candidate map entity corresponding to the entity reference in a preset knowledge map; based on the entity reference representation of the entity reference and the candidate map entity representation of the candidate map entity, calculating the similarity between the entity reference and the candidate map entity, and screening the target similarity that meets the first preset condition, and obtaining the target similarity. Graph entity; linking the entity reference to be linked with the graph entity to be linked, and using all graph entities linked to the entity reference in the preset knowledge graph as labeling labels for the text to be labeled, which improves the accuracy of using the knowledge graph to label text.

Description

Translated fromChinese

基于知识图谱的文本标签标注方法及装置、计算机设备Text labeling method and device based on knowledge graph, and computer equipment

技术领域Technical Field

本申请涉及数字医疗技术领域，尤其是涉及到一种基于知识图谱的文本标签标注方法及装置、存储介质、计算机设备。The present application relates to the field of digital medical technology, and in particular to a text labeling method and device based on a knowledge graph, a storage medium, and a computer device.

背景技术Background Art

推荐系统一般可以分为索引层、召回层和排序层。其中，索引层作为后续召回层和排序层的基础及前提，需要整合相关资料对海量的、良莠不齐的、类型差异极大的内容(例如医疗文本)添加多样且准确的标签。因此，合适的内容标签对于一个好的推荐系统来说是必要条件。在推荐场景中，一般需要通过将用户画像标签和内容标签进行匹配来实现对内容的召回，所以需要有限个规范的内容标签。以医疗领域为例：对于每一种类型疾病的标签集合需要为已知有限个，对于同一疾病的标签名称需要是唯一且确定的。在针对医疗文本的标签构建过程中，需要利用医疗知识图谱对医疗文本进行标签标注，而对于医学词语来说，其往往变化各异，故如何利用医疗知识图谱对医疗文本进行恰当的标签标注是一个极具难度的工作。Recommendation systems can generally be divided into indexing layer, recall layer and ranking layer. Among them, the indexing layer, as the basis and premise of the subsequent recall layer and ranking layer, needs to integrate relevant information to add diverse and accurate labels to massive, mixed, and extremely different types of content (such as medical texts). Therefore, appropriate content labels are a necessary condition for a good recommendation system. In the recommendation scenario, it is generally necessary to recall the content by matching user portrait labels with content labels, so a limited number of standardized content labels are required. Take the medical field as an example: the label set for each type of disease needs to be known to be finite, and the label name for the same disease needs to be unique and definite. In the process of label construction for medical texts, it is necessary to use medical knowledge graphs to label medical texts. However, for medical terms, they often vary, so how to use medical knowledge graphs to appropriately label medical texts is an extremely difficult task.

发明内容Summary of the invention

有鉴于此，本申请提供了一种基于知识图谱的文本标签标注方法及装置、存储介质、计算机设备，提高了利用知识图谱对文本进行标签标注时的准确性。In view of this, the present application provides a text labeling method and device, storage medium, and computer equipment based on a knowledge graph, which improves the accuracy of labeling text using a knowledge graph.

根据本申请的一个方面，提供了一种基于知识图谱的文本标签标注方法，所述方法包括：According to one aspect of the present application, a text labeling method based on a knowledge graph is provided, the method comprising:

识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体；Identify the entity reference of the text to be annotated, and recall the candidate graph entities corresponding to the entity reference in the preset knowledge graph;

基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体；Based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, calculate the similarity between the entity reference and the candidate graph entity, and screen the target similarity that meets the first preset condition, and obtain the entity reference to be linked and the graph entity to be linked corresponding to the target similarity;

链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。Link the entity reference to be linked with the graph entity to be linked, and use all graph entities in the preset knowledge graph that are linked to the entity reference as annotation labels for the text to be annotated.

可选地，所述基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，包括：Optionally, the calculating the similarity between the entity reference and the candidate graph entity based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity includes:

根据所述实体指称的实体指称表征M及所述候选图谱实体的候选图谱实体表征E，计算相似度矩阵W，其中，W中每个元素w_i,j＝m_ie_j，m_i表示实体指称表征M中的第i个元素，e_j表示候选图谱实体表征E中的第j个元素；According to the entity reference representation M of the entity reference and the candidate graph entity representation E of the candidate graph entity, a similarity matrix W is calculated, wherein each element w_i,j =m_i e_j in W, where_mi represents the i-th element in the entity reference representation M, and e_j represents the j-th element in the candidate graph entity representation E;

基于所述相似度矩阵获得所述实体指称和所述候选图谱实体之间的相似度，其中，第i个实体指称与第j个候选图谱实体之间的相似度为w_i,j。The similarity between the entity reference and the candidate graph entity is obtained based on the similarity matrix, wherein the similarity between the i-th entity reference and the j-th candidate graph entity is w_i,j .

可选地，所述基于所述相似度矩阵获得所述实体指称和所述候选图谱实体之间的相似度，包括：Optionally, obtaining the similarity between the entity reference and the candidate graph entity based on the similarity matrix includes:

标准化所述相似度矩阵，获得标准相似度矩阵；Standardizing the similarity matrix to obtain a standard similarity matrix;

根据所述标准相似度矩阵及所述候选图谱实体表征，利用第一加权公式计算加权实体指称表征，其中，所述第一加权公式为m′_i表示加权实体指称表征，w′_i,j表示标准相似度矩阵W′中的元素；According to the standard similarity matrix and the candidate graph entity representation, a weighted entity reference representation is calculated using a first weighting formula, wherein the first weighting formula is: m′_i represents the weighted entity reference representation, w′_i,j represents the elements in the standard similarity matrix W′;

根据所述标准相似度矩阵及所述实体指称表征，利用第二加权公式计算加权候选图谱实体表征，其中，所述第二加权公式为e′_j表示加权候选图谱实体表征；According to the standard similarity matrix and the entity reference representation, a weighted candidate graph entity representation is calculated using a second weighting formula, wherein the second weighting formula is: e′_j represents the weighted candidate graph entity representation;

基于所述加权实体指称表征及所述加权候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度。Based on the weighted entity reference representation and the weighted candidate graph entity representation, the similarity between the entity reference and the candidate graph entity is calculated.

可选地，所述基于所述加权实体指称表征及所述加权候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，包括：Optionally, the calculating the similarity between the entity reference and the candidate graph entity based on the weighted entity reference representation and the weighted candidate graph entity representation includes:

将所述加权实体指称表征的加权实体指称表征序列通过卷积神经网络后，获得实体指称嵌入向量，和将所述加权候选图谱实体表征的加权候选图谱实体表征序列通过卷积神经网络后，获得候选图谱实体嵌入向量；Passing the weighted entity reference representation sequence of the weighted entity reference representation through a convolutional neural network to obtain an entity reference embedding vector, and passing the weighted candidate graph entity representation sequence of the weighted candidate graph entity representation through a convolutional neural network to obtain a candidate graph entity embedding vector;

组合所述实体指称嵌入向量及所述候选图谱实体嵌入向量，获得待打分向量，根据所述待打分向量及预设排序打分模型，获得所述实体指称及所述候选图谱实体之间的相似度。The entity reference embedding vector and the candidate graph entity embedding vector are combined to obtain a vector to be scored, and the similarity between the entity reference and the candidate graph entity is obtained according to the vector to be scored and a preset ranking and scoring model.

可选地，所述预设知识图谱包括图谱实体索引，所述图谱实体索引包括图谱实体本体和所述图谱实体本体对应的至少一个图谱实体别名；所述在预设知识图谱中召回得到所述实体指称对应的候选图谱实体，包括：Optionally, the preset knowledge graph includes a graph entity index, and the graph entity index includes a graph entity ontology and at least one graph entity alias corresponding to the graph entity ontology; and recalling the candidate graph entity corresponding to the entity reference in the preset knowledge graph includes:

针对任一实体指称，对所述实体指称与所述图谱实体索引进行匹配，获得目标图谱实体索引，其中，所述目标图谱实体索引中的图谱实体本体和/或任一图谱实体别名与所述实体指称匹配；For any entity reference, match the entity reference with the graph entity index to obtain a target graph entity index, wherein the graph entity body and/or any graph entity alias in the target graph entity index matches the entity reference;

将所述目标图谱实体索引对应的图谱实体本体和图谱实体别名确定为候选图谱实体。The graph entity ontology and the graph entity alias corresponding to the target graph entity index are determined as candidate graph entities.

可选地，所述将所述目标图谱实体索引对应的图谱实体本体和图谱实体别名确定为候选图谱实体，包括：Optionally, determining the graph entity ontology and the graph entity alias corresponding to the target graph entity index as a candidate graph entity includes:

将所述目标图谱实体索引对应的图谱实体本体和图谱实体别名确定为基础图谱实体；Determine the graph entity ontology and the graph entity alias corresponding to the target graph entity index as the basic graph entity;

计算所述基础图谱实体与所述实体指称的偏差距离，将符合第二预设条件的偏差距离对应的基础图谱实体确定为候选图谱实体。The deviation distance between the basic graph entity and the entity reference is calculated, and the basic graph entity corresponding to the deviation distance that meets the second preset condition is determined as a candidate graph entity.

可选地，所述将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签之前，所述方法还包括：Optionally, before taking all graph entities in the preset knowledge graph that are linked to the entity reference as annotation labels for the text to be annotated, the method further includes:

根据预设关联规则获取与所述待链接图谱实体对应的关联图谱实体；Acquire the associated graph entity corresponding to the graph entity to be linked according to the preset association rule;

计算所述关联图谱实体与所述实体指称之间的关联置信度，并筛选符合第三预设条件的目标关联置信度，获取所述目标关联置信度对应的所述关联图谱实体作为补充图谱实体；Calculating the association confidence between the association graph entity and the entity reference, and screening the target association confidence that meets the third preset condition, and obtaining the association graph entity corresponding to the target association confidence as a supplementary graph entity;

链接所述待链接实体指称与所述补充图谱实体。Link the entity reference to be linked with the supplementary graph entity.

根据本申请的另一方面，提供了一种基于知识图谱的文本标签标注装置，所述装置包括：According to another aspect of the present application, a text labeling device based on a knowledge graph is provided, the device comprising:

实体召回模块，用于识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体；An entity recall module is used to identify the entity reference of the text to be annotated, and recall the candidate graph entities corresponding to the entity reference in the preset knowledge graph;

相似度计算模块，用于基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体；A similarity calculation module, for calculating the similarity between the entity reference and the candidate graph entity based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, and screening the target similarity that meets the first preset condition, and obtaining the entity reference to be linked and the graph entity to be linked corresponding to the target similarity;

标签标注模块，用于链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。A label annotation module is used to link the entity reference to be linked with the graph entity to be linked, and use all the graph entities in the preset knowledge graph that are linked to the entity reference as annotation labels for the text to be annotated.

可选地，所述相似度计算模块，还用于：Optionally, the similarity calculation module is further used to:

根据所述实体指称的实体指称表征M及所述候选图谱实体的候选图谱实体表征E，计算相似度矩阵W，其中，W中每个元素_wi,j＝iej，m_i表示实体指称表征M中的第i个元素，e_j表示候选图谱实体表征E中的第j个元素；According to the entity reference representation M of the entity reference and the candidate graph entity representation E of the candidate graph entity, a similarity matrix W is calculated, wherein each element_wi,j=iej in W,_mi represents the i-th element in the entity reference representation M, and_ej represents the j-th element in the candidate graph entity representation E;

可选地，所述实体召回模块，还用于：Optionally, the entity recall module is further used to:

针对任一实体指称，对所述实体指称与所述图谱实体索引进行匹配，获得目标图谱实体索引，其中，所述预设知识图谱包括图谱实体索引，所述图谱实体索引包括图谱实体本体和所述图谱实体本体对应的至少一个图谱实体别名，所述目标图谱实体索引中的图谱实体本体和/或任一图谱实体别名与所述实体指称匹配；For any entity reference, the entity reference is matched with the graph entity index to obtain a target graph entity index, wherein the preset knowledge graph includes a graph entity index, the graph entity index includes a graph entity ontology and at least one graph entity alias corresponding to the graph entity ontology, and the graph entity ontology and/or any graph entity alias in the target graph entity index matches the entity reference;

可选地，所述标签标注模块，还用于：Optionally, the label marking module is further used to:

依据本申请又一个方面，提供了一种存储介质，其上存储有计算机程序，所述程序被处理器执行时实现上述基于知识图谱的文本标签标注方法。According to another aspect of the present application, a storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the above-mentioned text labeling method based on the knowledge graph is implemented.

依据本申请再一个方面，提供了一种计算机设备，包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现上述基于知识图谱的文本标签标注方法。According to another aspect of the present application, a computer device is provided, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the above-mentioned knowledge graph-based text labeling method when executing the program.

借由上述技术方案，本申请提供的一种基于知识图谱的文本标签标注方法及装置、存储介质、计算机设备，识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体，基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度。获取所述目标相似度对应的待链接实体指称和待链接图谱实体，链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签，通过计算实体指称和候选图谱实体之间的相似度，用以确定与实体指称最佳适配的图谱实体，进而将图谱实体作为文本的标注标签，提高了文本标签标注时的精准度。By means of the above technical scheme, the present application provides a text labeling method and apparatus based on a knowledge graph, a storage medium, and a computer device, which identify the entity reference of the text to be annotated, and recall the candidate graph entity corresponding to the entity reference in the preset knowledge graph, calculate the similarity between the entity reference and the candidate graph entity based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, and screen the target similarity that meets the first preset condition. Obtain the entity reference to be linked and the graph entity to be linked corresponding to the target similarity, link the entity reference to be linked with the graph entity to be linked, and use all the graph entities linked to the entity reference in the preset knowledge graph as the annotation label of the text to be annotated, calculate the similarity between the entity reference and the candidate graph entity, and use the graph entity as the annotation label of the text to be annotated, and use the graph entity as the annotation label of the text to be annotated, thereby improving the accuracy of text labeling.

上述说明仅是本申请技术方案的概述，为了能够更清楚了解本申请的技术手段，而可依照说明书的内容予以实施，并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂，以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of the present application. In order to more clearly understand the technical means of the present application, it can be implemented in accordance with the contents of the specification. In order to make the above and other purposes, features and advantages of the present application more obvious and easy to understand, the specific implementation methods of the present application are listed below.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1示出了本申请实施例提供的一种基于知识图谱的文本标签标注方法的流程示意图；FIG1 is a schematic diagram showing a flow chart of a text labeling method based on a knowledge graph provided in an embodiment of the present application;

图2示出了本申请实施例提供的另一种基于知识图谱的文本标签标注方法的流程示意图；FIG2 is a schematic diagram showing a flow chart of another text labeling method based on a knowledge graph provided in an embodiment of the present application;

图3示出了本申请实施例提供的又一种基于知识图谱的文本标签标注方法的流程示意图；FIG3 is a schematic diagram showing a flow chart of another text labeling method based on a knowledge graph provided in an embodiment of the present application;

图4示出了本申请实施例提供的一种基于知识图谱的文本标签标注装置的结构示意图。FIG4 shows a schematic diagram of the structure of a text labeling device based on a knowledge graph provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下文中将参考附图并结合实施例来详细说明本申请。需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。The present application will be described in detail below with reference to the accompanying drawings and in combination with embodiments. It should be noted that the embodiments and features in the embodiments of the present application can be combined with each other without conflict.

在本实施例中提供了一种基于知识图谱的文本标签标注方法，如图1所示，该方法包括：In this embodiment, a text labeling method based on a knowledge graph is provided. As shown in FIG1 , the method includes:

步骤101，识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体。Step 101, identifying the entity reference of the text to be annotated, and recalling the candidate graph entities corresponding to the entity reference in the preset knowledge graph.

知识图谱(Knowledge Graph)，在图书情报界称为知识域可视化或知识领域映射地图，是显示知识发展进程与结构关系的一系列各种不同的图形，用可视化技术描述知识资源及其载体，挖掘、分析、构建、绘制和显示知识及它们之间的相互联系。其主要特点包括：1、用户搜索次数越多，范围越广，搜索引擎就能获取越多信息和内容。2、赋予字串新的意义，而不只是单纯的字串。3、融合了所有的学科，以便于用户搜索时的连贯性。4、为用户找出更加准确的信息，作出更全面的总结并提供更有深度相关的信息。5、把与关键词相关的知识体系系统化地展示给用户。6、从整个互联网汲取有用的信息让用户能够获得更多相关的公共资源。Knowledge Graph, known as knowledge domain visualization or knowledge domain mapping map in the library and information industry, is a series of various graphics that show the development process and structural relationship of knowledge. It uses visualization technology to describe knowledge resources and their carriers, and mine, analyze, construct, draw and display knowledge and their mutual connections. Its main features include: 1. The more times users search and the wider the scope, the more information and content the search engine can obtain. 2. Give new meanings to strings, rather than just simple strings. 3. Integrate all disciplines to facilitate the coherence of user searches. 4. Find more accurate information for users, make more comprehensive summaries and provide more in-depth and relevant information. 5. Systematically display the knowledge system related to keywords to users. 6. Draw useful information from the entire Internet so that users can obtain more relevant public resources.

在本申请上述实施例中，可以利用命名实体识别(Named Entity Recogn ition，简称NER)识别待标注文本的实体指称(mention)，其中，以医疗领域为例，实体指称可以包括疾病、药物、检验检查等内容。具体的，可以采用bert(Bidirectional EncoderRepresentation from Transformers，预训练语言表征模型)+CRF(Conditional RandomField，条件随机场)的框架，通过预训练模型bert作为底层的文本特征编码器，再利用CRF模型来对实体指称进行预测，用以最终获得待标注文本的实体指称。随即利用预设知识图谱召回得到所述实体指称对应的候选图谱实体，前述预设知识图谱可以为医疗领域知识图谱，具体的例如：一篇医疗文本可以包括如下文本内容：“二甲双胍是目前治疗糖尿病的胰腺药物”，通过识别前述文本内容后实体指称可以包括：“糖尿病”和“二甲双胍”，其中，“糖尿病”的属性为疾病，“二甲双胍”的属性为药物，通过医疗领域知识图谱针对实体指称“糖尿病”召回得到的候选图谱实体可以为“二型糖尿病”或者“成人糖尿病”。In the above embodiment of the present application, named entity recognition (NER) can be used to identify the entity mention of the text to be annotated, where, taking the medical field as an example, the entity mention can include diseases, drugs, tests and examinations, etc. Specifically, the framework of BERT (Bidirectional Encoder Representation from Transformers, pre-trained language representation model) + CRF (Conditional Random Field, conditional random field) can be adopted, and the pre-trained model BERT is used as the underlying text feature encoder, and then the CRF model is used to predict the entity mention, so as to finally obtain the entity mention of the text to be annotated. Then, the preset knowledge graph is used to recall the candidate graph entities corresponding to the entity reference. The aforementioned preset knowledge graph may be a medical field knowledge graph. For example, a medical text may include the following text content: "Metformin is a pancreatic drug currently used to treat diabetes." After identifying the aforementioned text content, the entity reference may include: "diabetes" and "metformin," wherein the attribute of "diabetes" is disease, and the attribute of "metformin" is drug. The candidate graph entities recalled for the entity reference "diabetes" through the medical field knowledge graph may be "type 2 diabetes" or "adult diabetes."

步骤102，基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体。Step 102, based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, calculate the similarity between the entity reference and the candidate graph entity, and screen the target similarity that meets the first preset condition to obtain the entity reference to be linked and the graph entity to be linked corresponding to the target similarity.

接着，将实体指称转换成实体指称表征形式，以及将候选图谱实体转换成候选图谱实体表征形式，以便根据实体指称表征及候选图谱实体表征计算相似度。具体的表征转化过程可以为：将实体指称输入前述预训练过的bert模型，将bert模型输出的CLS(classification，分类)位置的embdding(嵌入向量)作为第一实体指称表征，和将实体指称利用word2vec模型及tf-idf加权得到第二实体指称表征，相加第一实体指称表征及第二实体指称表征即获得实体指称表征。Next, the entity reference is converted into an entity reference representation form, and the candidate graph entity is converted into a candidate graph entity representation form, so as to calculate the similarity based on the entity reference representation and the candidate graph entity representation. The specific representation conversion process can be: inputting the entity reference into the aforementioned pre-trained BERT model, using the embedding (embedding vector) of the CLS (classification) position output by the BERT model as the first entity reference representation, and using the word2vec model and TF-IDF weighting to obtain the second entity reference representation for the entity reference, and adding the first entity reference representation and the second entity reference representation to obtain the entity reference representation.

同样的，将候选图谱实体输入前述预训练过的bert模型，将bert模型输出的CLS(classification，分类)位置的embdding(嵌入向量)作为第一候选图谱实体表征，和将候选图谱实体利用word2vec模型及tf-idf加权得到第二候选图谱实体表征，相加第一候选图谱实体表征及第二候选图谱实体表征即获得候选图谱实体表征。Similarly, the candidate graph entity is input into the aforementioned pre-trained BERT model, the embedding (embedding vector) of the CLS (classification) position output by the BERT model is used as the first candidate graph entity representation, and the candidate graph entity is weighted using the word2vec model and TF-IDF to obtain the second candidate graph entity representation, and the first candidate graph entity representation and the second candidate graph entity representation are added to obtain the candidate graph entity representation.

再接着，根据实体指称表征和候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，根据第一预设条件筛选符合第一预设条件的目标相似度，用以辨别召回的候选图谱实体与对应的实体指称是否匹配。所述第一预设条件例如：对相似度进行排序，然后按照阈值取topK，其中，K可以为3或4或5等，用以确定目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体，用以后续对医疗文本进行标签标注。Next, based on the entity reference representation and the candidate graph entity representation, the similarity between the entity reference and the candidate graph entity is calculated, and the target similarity that meets the first preset condition is screened according to the first preset condition to identify whether the recalled candidate graph entity matches the corresponding entity reference. The first preset condition is, for example: sorting the similarities, and then taking the topK according to the threshold, where K can be 3 or 4 or 5, etc., to determine the target similarity, obtain the entity reference to be linked and the graph entity to be linked corresponding to the target similarity, and use it to label the medical text later.

步骤103，链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。Step 103, linking the entity reference to be linked with the graph entity to be linked, and using all graph entities in the preset knowledge graph that are linked to the entity reference as annotation labels for the text to be annotated.

在本申请上述实施例中，将待链接实体指称与待链接图谱实体进行链接，并将预设知识图谱中全部与实体指称链接的图谱实体作为所述待标注文本的标注标签，为此完成对待标注文本的标签标注。通过将上述命名实体识别所预测的实体指称映射到已有的医疗知识图谱中(有包括药物、疾病、检验检查和适应症等在内的百万级别节点和关系)的图谱实体的过程，可以达到标签规范化的目的，同时标注的内容标签也可以作为后续医疗文本召回的基础，用于将医疗文本和用户画像标签进行匹配。In the above embodiment of the present application, the entity reference to be linked is linked to the graph entity to be linked, and all graph entities linked to the entity reference in the preset knowledge graph are used as annotation labels for the text to be annotated, thereby completing the labeling of the text to be annotated. By mapping the entity reference predicted by the above-mentioned named entity recognition to the graph entity in the existing medical knowledge graph (with millions of nodes and relationships including drugs, diseases, tests and indications, etc.), the purpose of label standardization can be achieved. At the same time, the annotated content label can also be used as the basis for subsequent medical text recall, and is used to match medical text and user portrait labels.

通过应用本实施例的技术方案，识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体，基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体，链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签，通过计算实体指称和候选图谱实体之间的相似度，用以确定与实体指称最佳适配的图谱实体，进而将图谱实体作为文本的标注标签，提高了文本标签标注时的精准度。By applying the technical solution of this embodiment, the entity reference of the text to be annotated is identified, and the candidate graph entities corresponding to the entity reference are recalled in the preset knowledge graph. Based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, the similarity between the entity reference and the candidate graph entity is calculated, and the target similarity that meets the first preset condition is screened, the entity reference to be linked and the graph entity to be linked corresponding to the target similarity are obtained, the entity reference to be linked and the graph entity to be linked are linked, and all the graph entities linked to the entity reference in the preset knowledge graph are used as the annotation label of the text to be annotated. By calculating the similarity between the entity reference and the candidate graph entities, the graph entity that best matches the entity reference is determined, and then the graph entity is used as the annotation label of the text, thereby improving the accuracy of text labeling.

进一步的，作为上述实施例具体实施方式的细化和扩展，为了完整说明本实施例的具体实施过程，提供了另一种基于知识图谱的文本标签标注方法，如图2所示，该方法包括：Further, as a refinement and extension of the specific implementation of the above embodiment, in order to fully illustrate the specific implementation process of this embodiment, another text labeling method based on knowledge graph is provided, as shown in FIG2, the method includes:

步骤201，识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体。Step 201, identifying the entity reference of the text to be annotated, and recalling the candidate graph entities corresponding to the entity reference in the preset knowledge graph.

在本申请上述实施例中，通过命名实体识别(Named Entity Recognition，简称NER)识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体，以便后续基于候选图谱实体确定图谱实体，用以根据图谱实体完成对待标注文本的标签标注，其中，前述预设知识图谱可以包括医疗领域知识图谱。In the above embodiment of the present application, named entity recognition (NER) is used to identify the entity reference of the text to be annotated, and candidate graph entities corresponding to the entity reference are retrieved in the preset knowledge graph, so that the graph entities can be subsequently determined based on the candidate graph entities to complete the labeling of the text to be annotated according to the graph entities, wherein the aforementioned preset knowledge graph may include a medical field knowledge graph.

步骤202，根据所述实体指称的实体指称表征M及所述候选图谱实体的候选图谱实体表征E，计算相似度矩阵W，其中，W中每个元素w_i,j＝_ie_j，m_i表示实体指称表征M中的第i个元素，e_j表示候选图谱实体表征E中的第j个元素，第i个实体指称与第j个候选图谱实体之间的相似度为w_i,j。Step 202, calculate the similarity matrix W based on the entity reference representation M of the entity reference and the candidate graph entity representation E of the candidate graph entity, where each element w_i,j in W =_ie_j ,_mi represents the i-th element in the entity reference representation M, e_j represents the j-th element in the candidate graph entity representation E, and the similarity between the i-th entity reference and the j-th candidate graph entity is w_i,j .

接着，为了更好的表征实体指称和候选图谱实体，可以利用attention机制思想学习实体指称与候选图谱实体之间的关联，根据所述实体指称的实体指称表征M及所述候选图谱实体的候选图谱实体表征E，利用相似度矩阵公式计算相似度矩阵W。具体的，计算一个|M|*|E|维度的相似度矩阵W，其中，相似度矩阵W的每一个元素w_i,j表征实体指称和候选图谱实体之间的相似性，也即相似度矩阵W的第i行代表实体指称的第i个词和候选图谱实体中每一个词的相似度，然后可以基于所述相似度矩阵获得所述实体指称和所述候选图谱实体之间的相似度。Next, in order to better represent entity references and candidate graph entities, the idea of the attention mechanism can be used to learn the association between entity references and candidate graph entities. According to the entity reference representation M of the entity reference and the candidate graph entity representation E of the candidate graph entity, the similarity matrix W is calculated using the similarity matrix formula. Specifically, a similarity matrix W of |M|*|E| dimensions is calculated, where each element w_i,j of the similarity matrix W represents the similarity between the entity reference and the candidate graph entity, that is, the i-th row of the similarity matrix W represents the similarity between the i-th word of the entity reference and each word in the candidate graph entity, and then the similarity between the entity reference and the candidate graph entity can be obtained based on the similarity matrix.

所述相似度矩阵公式为：The similarity matrix formula is:

W＝|M|*|E|W＝|M|*|E|

其中，W表示相似度矩阵，w_i,j表示相似度矩阵W中的元素，w_i,j＝m_i′e_j；Wherein, W represents the similarity matrix, w_i,j represents the elements in the similarity matrix W, w_i,j =m_i ′e_j ;

M表示实体指称表征，M＝{m₁,m₂…？_|M|}；M represents entity reference representation, M = {m₁ ,m₂ …?_|M| };

E表示候选图谱实体表征，E＝{e₁,e₂…？_|e|}；E represents the candidate graph entity representation, E = {e₁ ,e₂ …?_|e| };

m_i表示实体指称表征M中的第i个元素，e_j表示候选图谱实体表征E中的第j个元素，0<i≤|M|，0<j≤|E|。_mi represents the i-th element in the entity reference representation M,_ej represents the j-th element in the candidate graph entity representation E, 0<i≤|M|, 0<j≤|E|.

再具体的，“m₁”、“m₂”等可以相当于一个字的embedding(嵌入)表征，embedding(嵌入)可以为768纬度，如果实体指称共有5个字，即|M|＝5，那么对应的相似度矩阵可以为5*768纬度的矩阵。同理，候选图谱实体如果共有7个字，即|E|＝7，那么对应的相似度矩阵可以为7*768维度的矩阵，将候选图谱实体矩阵进行转置后，可以得到【5*768】*【768*7】，即5*7的相似度矩阵。More specifically, "m₁ ", "m₂ ", etc. can be equivalent to the embedding representation of a word, and the embedding can be 768-dimensional. If the entity reference has a total of 5 words, that is, |M|=5, then the corresponding similarity matrix can be a matrix of 5*768 dimensions. Similarly, if the candidate graph entity has a total of 7 words, that is, |E|=7, then the corresponding similarity matrix can be a matrix of 7*768 dimensions. After transposing the candidate graph entity matrix, we can get [5*768]*[768*7], that is, a 5*7 similarity matrix.

当用w_i,j＝m′_ie_j表征实体指称和候选图谱实体之间的相似性时，例如前述相似度矩阵为一个5*7的相似度矩阵，w_2，3即表示实体指称的第2个字和候选图谱实体的第三个字之间的相似度。When w_i,j = m′_i e_j is used to represent the similarity between the entity reference and the candidate graph entity, for example, the aforementioned similarity matrix is a 5*7 similarity matrix, w_2,3 represents the similarity between the second character of the entity reference and the third character of the candidate graph entity.

步骤203，标准化所述相似度矩阵，获得标准相似度矩阵，根据所述标准相似度矩阵及所述候选图谱实体表征，利用第一加权公式计算加权实体指称表征，其中，所述第一加权公式为m′_i表示加权实体指称表征，w′_i,j表示标准相似度矩阵W′中的元素。Step 203, normalize the similarity matrix to obtain a standard similarity matrix, and calculate a weighted entity reference representation using a first weighting formula based on the standard similarity matrix and the candidate graph entity representation, wherein the first weighting formula is m′_i represents the weighted entity reference representation, and w′_i,j represents the elements in the standard similarity matrix W′.

步骤204，根据所述标准相似度矩阵及所述实体指称表征，利用第二加权公式计算加权候选图谱实体表征，其中，所述第二加权公式为e′_j表示加权候选图谱实体表征。Step 204, based on the standard similarity matrix and the entity referent representation, a second weighting formula is used to calculate the weighted candidate graph entity representation, wherein the second weighting formula is e′_j represents the weighted candidate graph entity representation.

接着，标准化所述相似度矩阵获得标准相似度矩阵，例如:可以通过soft max函数对相似度矩阵每一行进行标准化，即W′＝softmax(W)，其中，W′表示标准相似度矩阵。然后利用候选图谱实体中的元素来表征实体指称中的元素。同理，也可以利用实体指称中的元素来表征候选图谱实体中的元素。Next, the similarity matrix is normalized to obtain a standard similarity matrix. For example, each row of the similarity matrix can be normalized by a soft max function, that is, W′=softmax(W), where W′ represents the standard similarity matrix. Then, the elements in the candidate graph entity are used to represent the elements in the entity reference. Similarly, the elements in the entity reference can also be used to represent the elements in the candidate graph entity.

具体的，根据所述标准相似度矩阵W′及所述候选图谱实体表征e_j，利用第一加权公式计算加权实体指称表征m′_i，和根据所述标准相似度矩阵W″及所述实体指称表征m_i，利用第二加权公式计算加权候选图谱实体表征e_j，以便可以基于所述加权实体指称表征m′_i及所述加权候选图谱实体表征e′_j计算所述实体指称和所述候选图谱实体之间的相似度。Specifically, according to the standard similarity matrix W′ and the candidate graph entity representation e_j , a first weighted formula is used to calculate the weighted entity reference representation m′_i , and according to the standard similarity matrix W″ and the entity reference representation_mi , a second weighted formula is used to calculate the weighted candidate graph entity representation e_j , so that the similarity between the entity reference and the candidate graph entity can be calculated based on the weighted entity reference representation m′_i and the weighted candidate graph entity representation e′_j .

所述第一加权公式为：The first weighted formula is:

所述第二加权公式为：The second weighted formula is:

其中，m′_i表示加权实体指称表征，e′_j表示加权候选图谱实体表征，w′_i,j表示标准相似度矩阵W′中的元素。Among them, m′_i represents the weighted entity referent representation, e′_j represents the weighted candidate graph entity representation, and w′_i,j represents the elements in the standard similarity matrix W′.

步骤205，将所述加权实体指称表征的加权实体指称表征序列通过卷积神经网络后，获得实体指称嵌入向量，和将所述加权候选图谱实体表征的加权候选图谱实体表征序列通过卷积神经网络后，获得候选图谱实体嵌入向量。Step 205, after passing the weighted entity reference representation sequence of the weighted entity reference representation through a convolutional neural network, an entity reference embedding vector is obtained, and after passing the weighted candidate graph entity representation sequence of the weighted candidate graph entity representation through a convolutional neural network, a candidate graph entity embedding vector is obtained.

步骤206，组合所述实体指称嵌入向量及所述候选图谱实体嵌入向量，获得待打分向量，根据所述待打分向量及预设排序打分模型，获得所述实体指称及所述候选图谱实体之间的相似度。Step 206, combining the entity reference embedding vector and the candidate graph entity embedding vector to obtain a vector to be scored, and obtaining the similarity between the entity reference and the candidate graph entity based on the vector to be scored and a preset ranking and scoring model.

步骤207，筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体。Step 207, screening the target similarities that meet the first preset condition, and obtaining the to-be-linked entity references and to-be-linked graph entities corresponding to the target similarities.

接着，获得加权实体指称表征m′_i和加权候选图谱实体表征e′_j后，可以分别分析二者与所对应的原始输入之间的差异性，具体可以采用对位相减和对位相乘的方法，来从不同的角度表征和衡量实体指称与候选图谱实体之间的差别。Next, after obtaining the weighted entity reference representation m′_i and the weighted candidate graph entity representation e′_j , the differences between the two and the corresponding original inputs can be analyzed respectively. Specifically, the methods of opposite-order subtraction and opposite-order multiplication can be used to represent and measure the difference between the entity reference and the candidate graph entity from different angles.

根据所述加权实体指称表征m′_i及所述实体指称表征m_i，利用第一序列化公式计算实体指称表征序列和根据所述加权候选图谱实体表征e′_j及所述候选图谱实体表征e_j，利用第二序列化公式计算候选图谱实体表征序列以便基于所述实体指称表征序列及所述候选图谱实体表征序列，计算所述实体指称和所述候选图谱实体之间的相似度；According to the weighted entity reference representation m′_i and the entity reference representation_mi , the entity reference representation sequence is calculated using the first serialization formula and calculating the candidate graph entity representation sequence using the second serialization formula according to the weighted candidate graph entity representation e′_j and the candidate graph entity representation e_j In order to calculate the similarity between the entity reference and the candidate graph entity based on the entity reference representation sequence and the candidate graph entity representation sequence;

所述第一序列化公式为：The first serialization formula is:

所述第二序列化公式为：The second serialization formula is:

其中，表示实体指称表征序列，表示候选图谱实体表征序列。in, Represents a sequence of entity reference representations, Represents a sequence of candidate graph entity representations.

在获得实体指称和候选图谱实体的更加丰富的信息之后，为了更好的挖掘实体指称和候选图谱实体的高层语义，对实体指称表征序列和候选图谱实体表征序列分别经过一层CNN(Convolutional Neural Network，卷积神经网络)，得到N维的embedding表征，最后将两个embedding拼接输入排序打分模型即可。After obtaining more abundant information about entity references and candidate graph entities, in order to better mine the high-level semantics of entity references and candidate graph entities, the entity reference representation sequence is and candidate graph entity representation sequence After passing through a layer of CNN (Convolutional Neural Network), an N-dimensional embedding representation is obtained, and finally the two embeddings are concatenated and input into the sorting and scoring model.

具体的，计算f_out＝[f_m,f_e],其中，f_m为加权实体指称表征序列嵌入向量，f_e为加权候选图谱实体表征序列嵌入向量，f_out为待打分向量。Specifically, calculate f_out = [f_m ,_fe ], where f_m is the embedding vector of the weighted entity reference representation sequence,_fe is the embedding vector of the weighted candidate graph entity representation sequence, and f_out is the vector to be scored.

根据所述待打分向量f_out及预设排序打分模型可以获取所述实体指称及所述候选图谱实体之间的相似度。The similarity between the entity reference and the candidate graph entity can be obtained according to the vector to be scored f_out and the preset ranking and scoring model.

具体的，将f_out＝[f_m,f_e]经过多层感知机MLP，通过激活函数得到最终的输出即可，具体如下式所示：Specifically, f_out = [f_m ,_fe ] is passed through a multi-layer perceptron MLP and the final output is obtained through an activation function, as shown in the following formula:

其中，“W₂*RELU(W₁*f_out+b₁)+b₂”部分为多层感知机MLP，函数sigmoid为激活函数。Among them, the "W₂ *RELU(W₁ *f_out +b₁ )+b₂ " part is the multi-layer perceptron MLP, and the function sigmoid is the activation function.

上述“w₁、w₂”和上述“b₁、b₂”为预设排序打分模型所需要学习的参数，经过激活函数sigmoid之后，可以输出为具体的数字，即相似度。例如：输入的是实体指称为“二型糖尿病”的embedding，以及输入的候选图谱实体为“成人糖尿病”的embedding，经过预设排序打分模型可以得到“二型糖尿病”和“成人糖尿病”之间的一个相似度的分数。特别的，针对“二型糖尿病”进行候选图谱实体召回时，可以召回N个候选图谱实体，随即将“二型糖尿病”与其所针对N个候选图谱实体分别输入排序打分模型，以便获得每个候选图谱实体与实体指称对应的相似度，然后针对同一个实体指称，把同一个实体指称对应的全部候选实体的相似度从高到低排列，然后例如可以从中取3个相似度，用以获得图谱实体。The above "w₁ , w₂ " and the above "b₁ , b₂ " are parameters that need to be learned by the preset ranking and scoring model. After the activation function sigmoid, they can be output as specific numbers, i.e., similarity. For example: the input is the embedding of the entity reference "type 2 diabetes", and the input candidate graph entity is the embedding of "adult diabetes". After the preset ranking and scoring model, a similarity score between "type 2 diabetes" and "adult diabetes" can be obtained. In particular, when recalling the candidate graph entity for "type 2 diabetes", N candidate graph entities can be recalled, and then "type 2 diabetes" and the N candidate graph entities it targets are respectively input into the ranking and scoring model to obtain the similarity corresponding to each candidate graph entity and the entity reference, and then for the same entity reference, the similarities of all candidate entities corresponding to the same entity reference are arranged from high to low, and then, for example, 3 similarities can be taken from them to obtain the graph entity.

步骤208，链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。Step 208, linking the entity reference to be linked with the graph entity to be linked, and using all graph entities in the preset knowledge graph that are linked to the entity reference as annotation labels for the text to be annotated.

在本申请上述实施例中，链接待链接实体指称与待链接图谱实体，并将预设知识图谱中全部与实体指称链接的图谱实体作为待标注文本的标注标签以完成标签标注，能够高效的对医疗文本标注多样且准确的标签，同时为后续医疗文本推荐场景下的召回和排序打下了良好的基础，标准化的标签也给医疗文本推荐场景带来了极大的增长。In the above embodiment of the present application, the entity reference to be linked is linked to the graph entity to be linked, and all graph entities linked to the entity reference in the preset knowledge graph are used as annotation labels of the text to be annotated to complete the labeling. This can efficiently annotate medical texts with diverse and accurate labels, and at the same time lay a good foundation for recall and sorting in subsequent medical text recommendation scenarios. Standardized labels have also brought great growth to medical text recommendation scenarios.

通过应用本实施例的技术方案，识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体。根据实体指称表征M及候选图谱实体表征E，计算相似度矩阵W，进而获得加权实体指称表征及加权候选图谱实体表征形成待打分向量，根据所述待打分向量及预设排序打分模型，获得所述实体指称及所述候选图谱实体之间的相似度。根据相似度，获取待链接实体指称和待链接图谱实体。链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。通过基于卷积神经网络的排序打分模型来挖掘医疗文本的实体指称和医疗知识图谱的候选图谱实体间的高层语义信息以及二者之间的语义关联，提高了对图谱实体选择的精准度，进而提高了标签标注的准确性。By applying the technical solution of this embodiment, the entity reference of the text to be annotated is identified, and the candidate graph entities corresponding to the entity reference are recalled in the preset knowledge graph. According to the entity reference representation M and the candidate graph entity representation E, the similarity matrix W is calculated, and then the weighted entity reference representation and the weighted candidate graph entity representation are obtained to form a vector to be scored. According to the vector to be scored and the preset sorting and scoring model, the similarity between the entity reference and the candidate graph entity is obtained. According to the similarity, the entity reference to be linked and the graph entity to be linked are obtained. The entity reference to be linked is linked to the graph entity to be linked, and all the graph entities linked to the entity reference in the preset knowledge graph are used as the annotation labels of the text to be annotated. The high-level semantic information between the entity reference of the medical text and the candidate graph entities of the medical knowledge graph and the semantic association between the two are mined by a sorting and scoring model based on a convolutional neural network, which improves the accuracy of graph entity selection and thus improves the accuracy of label annotation.

进一步的，作为上述实施例具体实施方式的细化和扩展，为了完整说明本实施例的具体实施过程，提供了又一种基于知识图谱的文本标签标注方法，如图3所示，该方法包括：Further, as a refinement and extension of the specific implementation of the above embodiment, in order to fully illustrate the specific implementation process of this embodiment, another text labeling method based on knowledge graph is provided, as shown in FIG3, the method includes:

步骤301，识别待标注文本的实体指称，针对任一实体指称，对所述实体指称与所述图谱实体索引进行匹配，获得目标图谱实体索引，将所述目标图谱实体索引对应的图谱实体本体和图谱实体别名确定为基础图谱实体，其中，所述目标图谱实体索引中的图谱实体本体和/或任一图谱实体别名与所述实体指称匹配，所述预设知识图谱包括图谱实体索引，所述图谱实体索引包括图谱实体本体和所述图谱实体本体对应的至少一个图谱实体别名；Step 301, identifying entity references of the text to be annotated, matching the entity reference with the graph entity index for any entity reference, obtaining a target graph entity index, and determining the graph entity ontology and the graph entity alias corresponding to the target graph entity index as the basic graph entity, wherein the graph entity ontology and/or any graph entity alias in the target graph entity index matches the entity reference, the preset knowledge graph includes a graph entity index, and the graph entity index includes a graph entity ontology and at least one graph entity alias corresponding to the graph entity ontology;

在本申请上述实施例中，识别待标注文本的实体指称，针对任一实体指称，对所述实体指称与所述图谱实体索引进行匹配，由于医疗词语往往变化各异，存在着同义词替换(如脑卒中和脑中风)的情况，通过将所述实体指称与所述图谱实体索引进行匹配，可以扩大筛选候选图谱实体时的选择范围，实现将尽可能多的候选图谱实体进行召回。具体的，所述图谱实体索引由图谱实体本体和所述图谱实体本体对应的至少一个图谱实体别名组成，例如，如果图谱实体本体为“脑溢血”，那么所述图谱实体本体“脑溢血”对应的图谱实体别名可以为“脑出血”，此时对应的图谱实体索引包括“脑溢血”及“脑出血”，同时图谱实体索引的索引名为标准名，即“脑溢血”；如果图谱实体本体为“脑卒中”，那么所述图谱实体本体“脑卒中”对应的图谱实体别名可以为“脑中风”，此时对应的图谱实体索引包括“脑卒中”及“脑中风”，同时图谱实体索引的索引名为标准名，即“脑卒中”。In the above embodiment of the present application, the entity reference of the text to be annotated is identified, and for any entity reference, the entity reference is matched with the graph entity index. Since medical terms often vary and there are synonym replacements (such as stroke and cerebral infarction), by matching the entity reference with the graph entity index, the selection range for screening candidate graph entities can be expanded, so that as many candidate graph entities as possible can be recalled. Specifically, the atlas entity index consists of an atlas entity ontology and at least one atlas entity alias corresponding to the atlas entity ontology. For example, if the atlas entity ontology is "cerebral hemorrhage", then the atlas entity alias corresponding to the atlas entity ontology "cerebral hemorrhage" may be "cerebral hemorrhage". At this time, the corresponding atlas entity index includes "cerebral hemorrhage" and "cerebral hemorrhage", and the index name of the atlas entity index is the standard name, i.e., "cerebral hemorrhage"; if the atlas entity ontology is "stroke", then the atlas entity alias corresponding to the atlas entity ontology "stroke" may be "cerebral stroke". At this time, the corresponding atlas entity index includes "cerebral stroke" and "cerebral stroke", and the index name of the atlas entity index is the standard name, i.e., "stroke".

以图谱实体本体为“脑溢血”，对应的图谱实体别名为“脑出血”为例，将实体指称通过词匹配与图谱实体索引匹配后，如果匹配到“脑溢血”或者“脑出血”中任一词汇，那么都会获得包含“脑溢血”及“脑出血”的目标图谱实体索引，并将包含“脑溢血”和“脑出血”的目标图谱实体索引对应的图谱实体本体(脑溢血)和图谱实体别名(脑出血)共同确定为基础图谱实体。其中，所述目标图谱实体索引中的图谱实体本体和/或任一图谱实体别名与所述实体指称匹配，所述预设知识图谱包括图谱实体索引，所述图谱实体索引包括图谱实体本体和所述图谱实体本体对应的至少一个图谱实体别名。Taking the graph entity ontology as "cerebral hemorrhage" and the corresponding graph entity alias as "cerebral hemorrhage" as an example, after matching the entity reference with the graph entity index through word matching, if any of the words "cerebral hemorrhage" or "cerebral hemorrhage" is matched, then the target graph entity index containing "cerebral hemorrhage" and "cerebral hemorrhage" will be obtained, and the graph entity ontology (cerebral hemorrhage) and the graph entity alias (cerebral hemorrhage) corresponding to the target graph entity index containing "cerebral hemorrhage" and "cerebral hemorrhage" will be jointly determined as the basic graph entity. Among them, the graph entity ontology and/or any graph entity alias in the target graph entity index matches the entity reference, the preset knowledge graph includes the graph entity index, and the graph entity index includes the graph entity ontology and at least one graph entity alias corresponding to the graph entity ontology.

步骤302，计算所述基础图谱实体与所述实体指称的偏差距离，将符合第二预设条件的偏差距离对应的基础图谱实体确定为候选图谱实体。Step 302, calculating the deviation distance between the basic graph entity and the entity reference, and determining the basic graph entity corresponding to the deviation distance that meets the second preset condition as the candidate graph entity.

接着，计算基础图谱实体与所述实体指称的偏差距离，前述偏差距离可以包括编辑距离和拼音距离等，将符合第二预设条件的偏差距离对应的基础图谱实体确定为候选图谱实体，例如按照阈值取topN，N可以为3、4或5等，以此确定每个实体指称对应的一个或多个候选图谱实体。Next, the deviation distance between the basic graph entity and the entity reference is calculated. The aforementioned deviation distance may include edit distance and pinyin distance, etc. The basic graph entity corresponding to the deviation distance that meets the second preset condition is determined as a candidate graph entity. For example, the topN is taken according to the threshold, and N may be 3, 4 or 5, etc., so as to determine one or more candidate graph entities corresponding to each entity reference.

步骤303，基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度。Step 303: Calculate the similarity between the entity reference and the candidate graph entity based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity.

接着，基于实体指称的实体指称表征以及候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，用以确定与实体指称具有最佳相似度的图谱实体，以便后续根据图谱实体进行标签标注。Next, based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, the similarity between the entity reference and the candidate graph entity is calculated to determine the graph entity with the best similarity to the entity reference, so as to subsequently perform labeling according to the graph entity.

步骤304，筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体，链接所述待链接实体指称与所述待链接图谱实体。Step 304, screening target similarities that meet the first preset condition, obtaining the to-be-linked entity reference and the to-be-linked graph entity corresponding to the target similarities, and linking the to-be-linked entity reference and the to-be-linked graph entity.

再接着，筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体，链接所述待链接实体指称与所述待链接图谱实体。Next, the target similarities that meet the first preset condition are screened, the to-be-linked entity references and the to-be-linked graph entities corresponding to the target similarities are obtained, and the to-be-linked entity references and the to-be-linked graph entities are linked.

步骤305，根据预设关联规则获取与所述待链接图谱实体对应的关联图谱实体，计算所述关联图谱实体与所述实体指称之间的关联置信度。Step 305, obtaining the associated graph entity corresponding to the graph entity to be linked according to the preset association rule, and calculating the association confidence between the associated graph entity and the entity reference.

步骤306，筛选符合第三预设条件的目标关联置信度，获取所述目标关联置信度对应的所述关联图谱实体作为补充图谱实体，链接所述待链接实体指称与所述补充图谱实体。Step 306, screening the target association confidences that meet the third preset condition, obtaining the association graph entity corresponding to the target association confidence as a supplementary graph entity, and linking the to-be-linked entity reference with the supplementary graph entity.

步骤307，将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。Step 307: All graph entities in the preset knowledge graph that are linked to the entity reference are used as annotation labels for the text to be annotated.

根据预设关联规则获取与所述待链接图谱实体对应的关联图谱实体，具体的，所述待链接图谱实体可以为疾病、药物、检验检查等标签，根据医疗知识图谱所带有的丰富的关系，可以对疾病所对应的科室、药物的适应症、疾病的常用药和常见诊断方法等进行关联，并计算所述关联图谱实体与所述实体指称之间的关联置信度，筛选符合第三预设条件的目标关联置信度，获取所述目标关联置信度对应的所述关联图谱实体作为补充图谱实体，链接所述待链接实体指称与所述补充图谱实体。为此，可以对文本内容标注尽可能丰富的标签，同时对相关标签做了补充以及新增了一些标签索引。接着，将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。According to the preset association rules, the associated graph entity corresponding to the graph entity to be linked is obtained. Specifically, the graph entity to be linked can be a label such as disease, drug, test and examination. According to the rich relationships in the medical knowledge graph, the department corresponding to the disease, the indication of the drug, the commonly used drugs for the disease and the common diagnostic methods can be associated, and the association confidence between the associated graph entity and the entity reference is calculated, and the target association confidence that meets the third preset condition is screened, and the associated graph entity corresponding to the target association confidence is obtained as a supplementary graph entity, and the entity reference to be linked is linked to the supplementary graph entity. To this end, the text content can be annotated with as many labels as possible, and the relevant labels are supplemented and some label indexes are added. Then, all the graph entities in the preset knowledge graph that are linked to the entity reference are used as annotation labels for the text to be annotated.

通过应用本实施例的技术方案，识别待标注文本的实体指称，针对任一实体指称，对所述实体指称与所述图谱实体索引进行匹配，获得目标图谱实体索引，将所述目标图谱实体索引对应的图谱实体本体和图谱实体别名确定为基础图谱实体，计算所述基础图谱实体与所述实体指称的偏差距离，将符合第二预设条件的偏差距离对应的基础图谱实体确定为候选图谱实体。计算所述实体指称和所述候选图谱实体之间的相似度，获取待链接实体指称和待链接图谱实体，进而根据待链接图谱实体获取关联图谱实体，计算关联图谱实体与所述实体指称之间的关联置信度获取补充图谱实体，链接所述待链接实体指称与所述补充图谱实体，将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。通过对所述实体指称与所述图谱实体索引进行匹配，扩大了筛选候选图谱实体时的选择范围，实现了将尽可能多的候选图谱实体进行召回，同时通过获取补充图谱实体并链接所述待链接实体指称与所述补充图谱实体，可以给文本内容标注尽可能丰富的标签。By applying the technical solution of this embodiment, the entity reference of the text to be annotated is identified, and for any entity reference, the entity reference is matched with the graph entity index to obtain a target graph entity index, and the graph entity ontology and graph entity alias corresponding to the target graph entity index are determined as basic graph entities, and the deviation distance between the basic graph entity and the entity reference is calculated, and the basic graph entity corresponding to the deviation distance that meets the second preset condition is determined as a candidate graph entity. The similarity between the entity reference and the candidate graph entity is calculated, and the entity reference to be linked and the graph entity to be linked are obtained, and then the associated graph entity is obtained according to the graph entity to be linked, and the association confidence between the associated graph entity and the entity reference is calculated to obtain a supplementary graph entity, and the entity reference to be linked is linked with the supplementary graph entity, and all graph entities in the preset knowledge graph that are linked to the entity reference are used as annotation labels for the text to be annotated. By matching the entity reference with the graph entity index, the selection range for screening candidate graph entities is expanded, and as many candidate graph entities as possible are recalled. At the same time, by obtaining supplementary graph entities and linking the entity reference to be linked with the supplementary graph entity, the text content can be annotated with as rich labels as possible.

进一步的，作为图1方法的具体实现，本申请实施例提供了一种基于知识图谱的文本标签标注装置，如图4所示，该装置包括：Further, as a specific implementation of the method of FIG. 1 , an embodiment of the present application provides a text labeling device based on a knowledge graph, as shown in FIG. 4 , the device includes:

实体召回模块401，用于识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体；The entity recall module 401 is used to identify the entity reference of the text to be annotated, and recall the candidate graph entities corresponding to the entity reference in the preset knowledge graph;

相似度计算模块402，用于基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体；A similarity calculation module 402 is used to calculate the similarity between the entity reference and the candidate graph entity based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, and to screen the target similarity that meets the first preset condition, and obtain the entity reference to be linked and the graph entity to be linked corresponding to the target similarity;

标签标注模块403，用于链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签。The label annotation module 403 is used to link the entity reference to be linked and the graph entity to be linked, and use all the graph entities in the preset knowledge graph that are linked to the entity reference as annotation labels for the text to be annotated.

可选地，所述相似度计算模块402，还用于：Optionally, the similarity calculation module 402 is further used to:

可选地，所述实体召回模块401，还用于：Optionally, the entity recall module 401 is further used to:

可选地，所述标签标注模块403，还用于：Optionally, the label marking module 403 is further used to:

需要说明的是，本申请实施例提供的一种基于知识图谱的文本标签标注装置所涉及各功能单元的其他相应描述，可以参考图1至图3方法中的对应描述，在此不再赘述。It should be noted that for other corresponding descriptions of the functional units involved in the text labeling device based on the knowledge graph provided in the embodiment of the present application, please refer to the corresponding descriptions in the methods of Figures 1 to 3, and will not be repeated here.

基于上述如图1至图3所示方法，相应的，本申请实施例还提供了一种存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现上述如图1至图3所示的基于知识图谱的文本标签标注方法。Based on the above method as shown in Figures 1 to 3, accordingly, an embodiment of the present application also provides a storage medium on which a computer program is stored. When the computer program is executed by a processor, the text labeling method based on the knowledge graph as shown in Figures 1 to 3 is implemented.

基于这样的理解，本申请的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施场景所述的方法。Based on this understanding, the technical solution of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a USB flash drive, a mobile hard disk, etc.), and includes a number of instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each implementation scenario of the present application.

基于上述如图1至图3所示的方法，以及图4所示的虚拟装置实施例，为了实现上述目的，本申请实施例还提供了一种计算机设备，具体可以为个人计算机、服务器、网络设备等，该计算机设备包括存储介质和处理器；存储介质，用于存储计算机程序；处理器，用于执行计算机程序以实现上述如图1至图3所示的基于知识图谱的文本标签标注方法。Based on the above-mentioned method as shown in Figures 1 to 3, and the virtual device embodiment shown in Figure 4, in order to achieve the above-mentioned purpose, the embodiment of the present application also provides a computer device, which can be specifically a personal computer, a server, a network device, etc. The computer device includes a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to implement the above-mentioned text labeling method based on the knowledge graph as shown in Figures 1 to 3.

可选地，该计算机设备还可以包括用户接口、网络接口、摄像头、射频(RadioFrequency，RF)电路，传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等，可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Optionally, the computer device may further include a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, etc. The user interface may include a display, an input unit such as a keyboard, etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), etc.

本领域技术人员可以理解，本实施例提供的一种计算机设备结构并不构成对该计算机设备的限定，可以包括更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art will appreciate that the computer device structure provided in this embodiment does not constitute a limitation on the computer device, and may include more or fewer components, or a combination of certain components, or different component arrangements.

存储介质中还可以包括操作系统、网络通信模块。操作系统是管理和保存计算机设备硬件和软件资源的程序，支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信，以及与该实体设备中其它硬件和软件之间通信。The storage medium may also include an operating system and a network communication module. The operating system is a program that manages and saves the hardware and software resources of the computer device, and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between the components inside the storage medium, and the communication with other hardware and software in the physical device.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现，也可以通过硬件实现，识别待标注文本的实体指称，并在预设知识图谱中召回得到所述实体指称对应的候选图谱实体，基于所述实体指称的实体指称表征以及所述候选图谱实体的候选图谱实体表征，计算所述实体指称和所述候选图谱实体之间的相似度，并筛选符合第一预设条件的目标相似度，获取所述目标相似度对应的待链接实体指称和待链接图谱实体，链接所述待链接实体指称与所述待链接图谱实体，并将所述预设知识图谱中全部与所述实体指称链接的图谱实体作为所述待标注文本的标注标签，通过计算实体指称和候选图谱实体之间的相似度，用以确定与实体指称最佳适配的图谱实体，进而将图谱实体作为文本的标注标签，提高了文本标签标注时的准确性。Through the description of the above implementation methods, technical personnel in this field can clearly understand that the present application can be implemented by means of software plus necessary general hardware platforms, or by hardware, to identify the entity reference of the text to be annotated, and recall the candidate graph entities corresponding to the entity reference in the preset knowledge graph, calculate the similarity between the entity reference and the candidate graph entity based on the entity reference representation of the entity reference and the candidate graph entity representation of the candidate graph entity, and screen the target similarity that meets the first preset condition, obtain the entity reference to be linked and the graph entity to be linked corresponding to the target similarity, link the entity reference to be linked with the graph entity to be linked, and use all the graph entities linked to the entity reference in the preset knowledge graph as the annotation label of the text to be annotated, calculate the similarity between the entity reference and the candidate graph entity, and then use the graph entity as the annotation label of the text to be annotated, thereby improving the accuracy of text labeling.

本领域技术人员可以理解附图只是一个优选实施场景的示意图，附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中，也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块，也可以进一步拆分成多个子模块。Those skilled in the art will appreciate that the accompanying drawings are only schematic diagrams of a preferred implementation scenario, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario can be distributed in the devices of the implementation scenario according to the description of the implementation scenario, or can be changed accordingly and located in one or more devices different from the present implementation scenario. The modules of the above-mentioned implementation scenario can be combined into one module, or can be further split into multiple submodules.

上述本申请序号仅仅为了描述，不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景，但是，本申请并非局限于此，任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The above serial numbers of this application are only for description and do not represent the advantages and disadvantages of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of this application, but this application is not limited to them, and any changes that can be thought of by technicians in this field should fall within the scope of protection of this application.