CN117634612A

Movatterモバイル変換

Info

Publication number: CN117634612A
Application number: CN202311679999.2A
Authority: CN
Inventors: 杜治娟; 邹佳呈; 李子轩
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-03-01
Anticipated expiration: 2043-12-07
Also published as: CN117634612B

Abstract

The invention provides a method and a system for constructing a mathematical course class knowledge graph, and relates to the technical field of knowledge graphs. The invention solves the technical problem that the existing relation extraction method can not accurately mine the entity relation of the mathematical program class, realizes the relation classification method by fusing sparse dual affine attention and self-adaptive threshold loss, and overcomes the difficulty of complex dependence and overlapping relation extraction. Meanwhile, the hyperbolic space word is embedded into the sparse projection model to predict the upper and lower relationship, and the hyperbolic space has the capacity of expressing the hierarchical structure, so that the upper and lower relationship can be predicted better, and the calculation cost is reduced.

Description

Translated fromChinese

数学课程类知识图谱构建方法和系统Methods and systems for constructing knowledge graphs for mathematics courses

技术领域Technical field

本发明涉及知识图谱技术领域，具体涉及一种数学课程类知识图谱构建方法和系统。The invention relates to the technical field of knowledge graphs, and in particular to a method and system for constructing a mathematics course knowledge graph.

背景技术Background technique

知识图谱是结构化的语义知识库，用于以符号形式描述物理世界中的概念及其相互关系，其基本组成单位是“实体—关系—实体”三元组，以及实体及其相关属性—值对，实体间通过关系相互联结，构成网状的知识结构。Knowledge graph is a structured semantic knowledge base that is used to describe concepts and their relationships in the physical world in symbolic form. Its basic unit is the "entity-relationship-entity" triplet, as well as entities and their related attributes-values. Yes, entities are connected to each other through relationships, forming a networked knowledge structure.

数学课程类知识图谱构建主要是实例层的构建，在构建实例层主要流程包括实体识别和关系抽取。The construction of knowledge graphs for mathematics courses is mainly the construction of the instance layer. The main processes of building the instance layer include entity recognition and relationship extraction.

然而，目前没有针对数学课程类(如离散数学知识点)的数学课程类知识图谱构建方法，若通过现有的数学课程类知识图谱构建方法构建数学课程类知识图谱，因数学课程类实体的包含关系即一个上位词包含多个下位词；转化关系即一句话通常包含多个实体对，并且一个实体对可能与多种可能的关系相关联，目前关系抽取方法不能做到精准挖掘，因此，亟需要为数学课程构建知识点之间包含关系和转换关系的知识图谱。However, there is currently no mathematics course knowledge graph construction method for mathematics courses (such as discrete mathematics knowledge points). If the mathematics course knowledge graph is constructed through the existing mathematics course knowledge graph construction method, due to the inclusion of mathematics course entities, Relationship means that a hypernym contains multiple hyponyms; transformation relationship means that a sentence usually contains multiple entity pairs, and one entity pair may be associated with multiple possible relationships. The current relationship extraction method cannot achieve accurate mining. Therefore, there is an urgent need to It is necessary to construct a knowledge graph containing relationships and transformation relationships between knowledge points for mathematics courses.

发明内容Contents of the invention

(一)解决的技术问题(1) Technical problems solved

针对现有技术的不足，本发明提供了一种数学课程类知识图谱构建方法和系统，解决了目前关系抽取方法不能做到精准挖掘数学课程类的实体关系。In view of the shortcomings of the existing technology, the present invention provides a method and system for constructing a knowledge graph of mathematics courses, which solves the problem that current relationship extraction methods cannot accurately mine entity relationships of mathematics courses.

(二)技术方案(2) Technical solutions

为实现以上目的，本发明通过以下技术方案予以实现：In order to achieve the above objectives, the present invention is implemented through the following technical solutions:

第一方面，本发明提供一种数学课程类知识图谱构建方法，包括：In a first aspect, the present invention provides a method for constructing a mathematics course knowledge graph, including:

S1、获取数学知识点数据并进行预处理，得到训练数据和待自动化抽取数据库，基于训练数据构建实体识别和上下位关系获取的训练语料库；S1. Obtain mathematical knowledge point data and perform preprocessing to obtain training data and a database to be automatically extracted, and build a training corpus for entity recognition and hypernym relationship acquisition based on the training data;

S2、通过训练语料库训练实体识别模型，利用训练好的实体识别模型识别待自动化抽取数据库中的实体；S2. Train the entity recognition model through the training corpus, and use the trained entity recognition model to identify entities in the database to be automatically extracted;

S3、通过训练语料库训练双曲空间词嵌入稀疏投影模型，通过训练好的双曲空间词嵌入稀疏投影模型对实体进行上下位关系进行识别；S3. Use the training corpus to train the hyperbolic space word embedding sparse projection model, and use the trained hyperbolic space word embedding sparse projection model to identify the hypernym relationship of entities;

S4、根据实体识别结果和实体上下位关系，利用实体掩盖技术，标记待自动化抽取数据库中的句字中提到的每个实体的开始和结束；通过基于稀疏双仿射关注的BERT模型和自适应阈值损失的分类模型进行句子中的实体间的转化关系抽取。S4. Based on the entity recognition results and the hyponymic relationship between entities, use entity masking technology to mark the beginning and end of each entity mentioned in the sentences in the database to be automatically extracted; through the BERT model based on sparse biaffine attention and automatic The classification model adapted to threshold loss extracts transformation relationships between entities in sentences.

优选的，所述S1具体包括：Preferably, the S1 specifically includes:

将纸质版教材扫描为图片，将图片中的内容信息转化成文本；Scan paper textbooks into pictures and convert the content information in the pictures into text;

将获取到的文本分为课后知识总结和正文文本两个部分，课后知识总结作为训练数据，用于构建训练语料库；正文文本作为待自动化抽取数据库；The acquired text is divided into two parts: after-class knowledge summary and main text. The after-class knowledge summary is used as training data to build a training corpus; the main text is used as the database to be automatically extracted;

其中构建训练语料库的过程如下：The process of constructing the training corpus is as follows:

利用ChatGPT4从训练数据中识别出短语，去除低词频短语，与去除仅包含字母与标点符号的公式共同构成知识点实体，并在原句子中采用BIESO格式标注实体；Use ChatGPT4 to identify phrases from the training data, remove low word frequency phrases, and remove formulas that only contain letters and punctuation marks to form knowledge point entities, and use BIESO format to annotate entities in the original sentences;

以句子为单位挖去所有实体获取句子模式，并利用ChatGPT4找出表示逻辑推理的词，按照词频，无交集的为句法模式分类，再用规则标注关系，得到训练语料库。Digging out all entities in units of sentences to obtain sentence patterns, and using ChatGPT4 to find words that represent logical reasoning. According to word frequency, syntactic patterns without intersection are classified, and then rules are used to mark relationships to obtain a training corpus.

优选的，所述实体识别模型的结构包括词级表征、BERT和多任务级联；Preferably, the structure of the entity recognition model includes word-level representation, BERT and multi-task cascade;

其中，词级表征用于在原本字级别的序列标注任务上加入以词为单位的表征；Among them, word-level representation is used to add word-based representation to the original word-level sequence labeling task;

BERT用于通过联合调节所有层中的左右上下文来预训练深度双向表示；BERT is used to pre-train deep bidirectional representations by jointly conditioning left and right context in all layers;

多任务级联是指在得到实体识别的结果之后，返回到BERT输出层，找各个实体词的表征向量，再把实体的表征向量输入全连接层做分类，判断实体类型。Multi-task cascade means that after obtaining the results of entity recognition, return to the BERT output layer to find the representation vector of each entity word, and then input the representation vector of the entity into the fully connected layer for classification to determine the entity type.

优选的，所述实体识别模型通过加权损失函数衡量和优化性能；Preferably, the entity recognition model measures and optimizes performance through a weighted loss function;

若模型没有识别到一个本应该识别到的实体，就增大加权损失函数，加重对模型的惩罚；若模型识别到了一个不应该识别到的实体，则减小对应的加权损失函数。If the model does not recognize an entity that should be recognized, the weighted loss function is increased to increase the penalty on the model; if the model recognizes an entity that should not be recognized, the corresponding weighted loss function is reduced.

优选的，所述通过训练语料库训练双曲空间词嵌入稀疏投影模型，包括：Preferably, training the hyperbolic space word embedding sparse projection model through the training corpus includes:

其中，双曲空间词嵌入稀疏投影模型包括归一化嵌入映射单元、投影特征生成单元和分类单元；Among them, the hyperbolic space word embedding sparse projection model includes a normalized embedding mapping unit, a projection feature generation unit and a classification unit;

301、将训练语料库中上下位关系数据作为正样本数据，并根据正样本数据生成负样本数据；301. Use the hypernym relationship data in the training corpus as positive sample data, and generate negative sample data based on the positive sample data;

S302、通过归一化嵌入映射单元对正样本数据和负样本数据均进行归一化嵌入处理后映射到双曲空间，得到词向量样本数据，所述词向量样本数据包括具有上下位关系集合的正词向量样本数据和包括无上下位关系集合的负词向量样本数据；S302. Use the normalized embedding mapping unit to perform normalized embedding processing on both the positive sample data and the negative sample data and then map them to the hyperbolic space to obtain word vector sample data. The word vector sample data includes a set of superordinate and hyponymic relationships. Positive word vector sample data and negative word vector sample data including a set of non-hypernym relationships;

S303、对词向量样本数据进行聚类，得到聚类中心，并通过聚类中心和上下位关系对获取上下位关系词在上下位关系集合中的权重和非上下位关系词在非上下位关系集合中的权重；S303. Cluster the word vector sample data to obtain the cluster center, and obtain the weight of the hypernym relation word in the hyponym relation set and the non-hypernym relation word of the non-hypernym relation word through the cluster center and the hyponym relation pair. weight in the set;

304、基于上下位关系词在上下位关系集合中的权重和正词向量样本数据，进行稀疏投影向量的计算，得到稀疏的正投影矩阵；基于非上下位关系词在非上下位关系集合中的权重和负词向量样本数据，进行稀疏双投影向量的计算，得到稀疏的负投影矩阵；304. Based on the weight of the hypernym and hyponym relational words in the hyponymy and hyponym relation set and the sample data of orthographic word vectors, calculate the sparse projection vector and obtain a sparse orthographic projection matrix; based on the weight of the non-hypernym and hyponym relational words in the non-hypernym and hyponym relation set. and negative word vector sample data, calculate sparse dual projection vectors, and obtain a sparse negative projection matrix;

305、在投影特征生成单元中根据正投影矩阵、负投影矩阵和词向量样本数据生成正样本投影特征和负样本投影特征；305. In the projection feature generation unit, generate positive sample projection features and negative sample projection features based on the positive projection matrix, negative projection matrix and word vector sample data;

306、基于正样本投影特征和负样本投影特征训练分类单元。306. Train classification units based on positive sample projection features and negative sample projection features.

优选的，所述通过训练好的双曲空间词嵌入稀疏投影模型对S2的实体进行上下位关系进行识别，包括：Preferably, the method of identifying the hyponymy relationship of entities in S2 through the trained hyperbolic space word embedding sparse projection model includes:

通过归一化嵌入映射单元将实体(x，y)进行归一化嵌入后映射到双曲空间，得到将/>输入到投影特征生成单元，计算得到投影特征，投影特征输入到分类单元中，输出0或者1，其中0表示无上下位关系，1表示有上下位关系。The entity (x, y) is normalized and embedded through the normalized embedding mapping unit and then mapped to the hyperbolic space, and we get Will/> It is input to the projection feature generation unit, and the projection feature is calculated. The projection feature is input to the classification unit, and outputs 0 or 1, where 0 means there is no superior-hybrid relationship, and 1 means there is a superior-hybrid relationship.

优选的，所述通过基于稀疏双仿射关注的BERT模型和自适应阈值损失的分类模型进行句子中的实体间的转化关系抽取，包括：Preferably, the conversion relationship extraction between entities in the sentence is carried out through the BERT model based on sparse biaffine attention and the classification model of adaptive threshold loss, including:

通过稀疏矩阵，来求解权重矩阵；公式如下：Solve the weight matrix through a sparse matrix; the formula is as follows:

W⁽⁰⁾＝SW⁽⁰⁾ ＝S

其中，W⁽⁰⁾表示初始权重矩阵，S表示稀疏矩阵；使用稀疏矩阵来近似BERT模型内部的权重矩阵；给定一个标记好的句子s通过权重稀疏化后的BERT获取每个词的上下文表示h_i：Among them, W⁽⁰⁾ represents the initial weight matrix, and S represents the sparse matrix; use the sparse matrix to approximate the weight matrix inside the BERT model; given a labeled sentence s, obtain the context representation of each word through the weight sparse BERT h_i :

{h₁,...h_|s|}＝BERT{x₁,...x_|s|}{h₁ ,...h_|s| }＝BERT{x₁ ,...x_|s| }

其中，x_i表示每个词的输入嵌入；Among them,_xi represents the input embedding of each word;

采用跨句上下文跟踪，将句子扩展到固定的窗口大小W，为了更好的获取词的位置信息，采用了双仿射注意力机制，具体包括：Cross-sentence context tracking is used to expand the sentence to a fixed window size W. In order to better obtain the position information of the word, a double affine attention mechanism is adopted, including:

采用了二维降维MLPs，一个头MLP，一个尾MLP：Two-dimensional dimensionality reduction MLPs are used, a head MLP and a tail MLP:

和/>是头尾实体的映射表示； and/> It is a mapping representation of head and tail entities;

计算每个词的得分：Calculate the score of each word:

其中U₁∈R^|y|*d*d和U₂∈R^|y|*2d是权重参数，b是偏差，表示串联；where U₁ ∈R^|y|*d*d and U₂ ∈R^|y|*2d are weight parameters, b is the bias, Represents series connection;

每个词的得分g_i,j作为分类器的输出，范围在[0，1]内，该得分需要阈值化来转换为关系标签；The score g_i,j of each word is used as the output of the classifier, ranging from [0, 1]. This score needs to be thresholded to convert into a relationship label;

使用伪类TH来学习多标签分类的动态阈值，具体如下：Use pseudo-class TH to learn dynamic thresholds for multi-label classification, as follows:

对于每个实体对将关系集R划分为两个部分：包含关系r的正集P存在于/>之间，并且负集N＝R-P，应用自适应阈值损失来学习关系分类器，引入一个阈值类TH，并采用基于自适应阈值损失的标准分类交叉熵损失函数，如下所示：For each entity pair Divide the relation set R into two parts: the positive set P containing the relation r exists in/> , and the negative set N = RP, apply adaptive threshold loss to learn the relation classifier, introduce a threshold class TH, and adopt the standard classification cross-entropy loss function based on adaptive threshold loss, as shown below:

L＝L₁+L₂L＝L₁ +L₂

其中，logit关系类型的值P_T表示存在关系的实体集合，logit_r′表示在P_T类中也在TH类中关系类型的值，logit_r表示在P_T类中关系类型的值，logit_TH表示在P_T类中也在TH类中关系类型的值。Among them, the value P_T of the logit relationship type represents the set of entities with relationships, logit_r′ represents the value of the relationship type in the P_T class and also in the TH class, logit_r represents the value of the relationship type in the P_T class, logit_TH Represents the value of the relationship type in the P_T class and also in the TH class.

第二方面，本发明提供一种数学课程类知识图谱构建系统，包括：In a second aspect, the present invention provides a mathematics course knowledge graph construction system, which includes:

数据模块，用于获取数学知识点数据并进行预处理，得到训练数据和待自动化抽取数据库，基于训练数据构建实体识别和上下位关系获取的训练语料库；The data module is used to obtain mathematical knowledge point data and perform preprocessing to obtain training data and a database to be automatically extracted, and build a training corpus for entity recognition and hypernym relationship acquisition based on the training data;

实体识别模块，用于通过训练语料库训练实体识别模型，利用训练好的实体识别模型识别待自动化抽取数据库中的实体；The entity recognition module is used to train the entity recognition model through the training corpus, and use the trained entity recognition model to identify entities in the database to be automatically extracted;

关系识别模块，用于通过训练语料库训练双曲空间词嵌入稀疏投影模型，通过训练好的双曲空间词嵌入稀疏投影模型对实体进行上下位关系进行识别；The relationship recognition module is used to train the hyperbolic space word embedding sparse projection model through the training corpus, and identify the hyponym relationship of entities through the trained hyperbolic space word embedding sparse projection model;

转化关系抽取模块，用于根据实体识别结果和实体上下位关系，利用实体掩盖技术，标记待自动化抽取数据库中的句字中提到的每个实体的开始和结束；通过基于稀疏双仿射关注的BERT模型和自适应阈值损失的分类模型进行句子中的实体间的转化关系抽取。The conversion relationship extraction module is used to use entity masking technology to mark the beginning and end of each entity mentioned in the sentences in the database to be automatically extracted based on the entity recognition results and the entity's hyponym relationship; through sparse biaffine attention based The BERT model and the adaptive threshold loss classification model extract the transformation relationships between entities in the sentence.

第三方面，本发明提供一种计算机可读存储介质，其存储用于数学课程类知识图谱构建的计算机程序，其中，所述计算机程序使得计算机执行如上述所述的数学课程类知识图谱构建方法。In a third aspect, the present invention provides a computer-readable storage medium that stores a computer program for constructing a mathematics course knowledge graph, wherein the computer program causes the computer to execute the method for constructing a mathematics course knowledge graph as described above. .

第四方面，一种电子设备，其特征在于，包括：A fourth aspect, an electronic device, is characterized by including:

一个或多个处理器，存储器，以及一个或多个程序，其中所述一个或多个程序被存储在所述存储器中，并且被配置成由所述一个或多个处理器执行，所述程序包括用于执行如上述所述的数学课程类知识图谱构建方法。one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the program It includes a method for constructing a mathematics course knowledge graph as described above.

(三)有益效果(3) Beneficial effects

本发明提供了一种数学课程类知识图谱构建方法和系统。与现有技术相比，具备以下有益效果：The present invention provides a method and system for constructing a mathematics course knowledge graph. Compared with existing technology, it has the following beneficial effects:

本发明实施例首先获取数学知识点数据并进行预处理，得到训练数据和待自动化抽取数据库，基于训练数据构建实体识别和上下位关系获取的训练语料库；然后通过训练语料库训练实体识别模型，利用训练好的实体识别模型识别待自动化抽取数据库中的实体；通过训练语料库训练双曲空间词嵌入稀疏投影模型，通过训练好的双曲空间词嵌入稀疏投影模型对实体进行上下位关系进行识别；最后根据实体识别结果和实体上下位关系，利用实体掩盖技术，标记待自动化抽取数据库中的句字中提到的每个实体的开始和结束；通过基于稀疏双仿射关注的BERT模型和自适应阈值损失的分类模型进行句子中的实体间的转化关系抽取。本发明解决了目前关系抽取方法不能做到精准挖掘数学程类的实体关系的技术问题，实现通过融合稀疏双仿射关注与自适应阈值损失的关系分类方法，克服复杂依赖与重叠关系抽取的困难。同时，利用双曲空间词嵌入稀疏投影模型进行上下位关系预测，双曲空间具有表达层次结构的能力，能更好预测上下位关系，并降低计算代价。The embodiment of the present invention first obtains mathematical knowledge point data and performs preprocessing to obtain training data and a database to be automatically extracted. Based on the training data, a training corpus for entity recognition and synchronization acquisition is constructed; then the entity recognition model is trained through the training corpus, and the training corpus is used to A good entity recognition model identifies the entities to be automatically extracted from the database; trains the hyperbolic space word embedding sparse projection model through the training corpus, and uses the trained hyperbolic space word embedding sparse projection model to identify the hyponymy relationship of the entity; finally, according to Entity recognition results and entity hyponymy relationships use entity masking technology to mark the beginning and end of each entity mentioned in the sentences in the database to be automatically extracted; through the BERT model based on sparse biaffine attention and adaptive threshold loss The classification model extracts transformation relationships between entities in sentences. The present invention solves the technical problem that current relationship extraction methods cannot accurately mine entity relationships of mathematical programs, and implements a relationship classification method that integrates sparse biaffine attention and adaptive threshold loss to overcome the difficulties of extracting complex dependencies and overlapping relationships. . At the same time, the hyperbolic space word embedding sparse projection model is used to predict the hyponym relationship. Hyperbolic space has the ability to express hierarchical structures, which can better predict the hypernym relationship and reduce the computational cost.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1为本发明实施例一种数学课程类知识图谱构建方法的框图；Figure 1 is a block diagram of a method for constructing a mathematics course knowledge graph according to an embodiment of the present invention;

图2为实体识别模型的结构图；Figure 2 is the structure diagram of the entity recognition model;

图3为实体识别模型的训练过程的流程图；Figure 3 is a flow chart of the training process of the entity recognition model;

图4为实体识别模型的加权损失函数的计算方法示意图；Figure 4 is a schematic diagram of the calculation method of the weighted loss function of the entity recognition model;

图5为步骤S4的具体流程图。Figure 5 is a specific flow chart of step S4.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described. Obviously, the described embodiments are part of the embodiments of the present invention, not all implementations. example. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

本申请实施例通过提供一种数学课程类知识图谱构建方法和系统，解决了目前关系抽取方法不能做到精准挖掘数学课程类的实体关系的技术问题，实现通过融合稀疏双仿射关注与自适应阈值损失的关系分类方法，克服复杂依赖与重叠关系抽取的困难。The embodiments of this application provide a method and system for constructing a knowledge graph of mathematics courses, which solves the technical problem that the current relationship extraction method cannot accurately mine the entity relationships of mathematics courses, and realizes the integration of sparse biaffine attention and adaptive The relationship classification method of threshold loss overcomes the difficulty of extracting complex dependencies and overlapping relationships.

本申请实施例中的技术方案为解决上述技术问题，总体思路如下：The technical solutions in the embodiments of this application are to solve the above technical problems. The general idea is as follows:

数学领域数学课程类知识图谱构建过程包括命名实体识别，关系抽取。命名实体识别包括：BERT、BiLSTM-CRF等，现有的上下位关系预测模型有单投影模、分段投影模型等，关系分类模型包括包括BERT等。但传统的模型命名实体识别模型、上下位关系预测模型、关系分类模型在数学课程类知识图谱构建中存在一下问题：The construction process of the knowledge graph for mathematics courses in the field of mathematics includes named entity recognition and relationship extraction. Named entity recognition includes: BERT, BiLSTM-CRF, etc. Existing epilepsy relationship prediction models include single projection models, segmented projection models, etc., and relationship classification models include BERT, etc. However, the traditional model named entity recognition model, hyponym relationship prediction model, and relationship classification model have the following problems in the construction of mathematics course knowledge graphs:

(1)数学课程中的实体属于专业术语，传统实体识别方法容易将实体割裂，即实体边界识别错误；并且精确度与召回率不能达到双高。(1) Entities in mathematics courses are professional terms. Traditional entity recognition methods are easy to separate entities, that is, entity boundary recognition errors; and the precision and recall cannot reach both high levels.

(2)数学课程中的包含关系，通常是1：N关系，即一个上位词包含多个下位词，且每个上位词包含的下位词的数量一般相差较大。目前大多采用词嵌入投影模型获取，传统欧氏空间中常规矩阵线性变换的方法不能很好地表示下位词较多且不均匀的情形。(2) The inclusion relationship in mathematics courses is usually a 1:N relationship, that is, a hypernym contains multiple hyponyms, and the number of hyponyms contained in each hypernym generally differs greatly. At present, most words are obtained using the word embedding projection model. The conventional matrix linear transformation method in the traditional Euclidean space cannot well represent the situation where there are many and uneven hyponyms.

(3)在数学课程资料中，一句话通常包含多个实体对，并且一个实体对可能与多种可能的关系相关联，因此，知识点之间的转化关系抽取需要同时捕捉多实体间的依赖关系和重叠关系，目前关系抽取方法不能同时做到这两点。(3) In mathematics course materials, a sentence usually contains multiple entity pairs, and one entity pair may be associated with multiple possible relationships. Therefore, the extraction of transformation relationships between knowledge points needs to capture the dependencies between multiple entities at the same time. Relationships and overlapping relationships, current relationship extraction methods cannot do both at the same time.

为解决上述问题，本发明实施例提出了一种数学课程类知识图谱构建方法和系统。In order to solve the above problems, embodiments of the present invention propose a method and system for constructing a mathematics course knowledge graph.

为了更好的理解上述技术方案，下面将结合说明书附图以及具体的实施方式对上述技术方案进行详细的说明。In order to better understand the above technical solution, the above technical solution will be described in detail below with reference to the accompanying drawings and specific implementation modes.

本发明实施例提供一种数学课程类知识图谱构建方法，如图1所示，该方法包括：An embodiment of the present invention provides a method for constructing a mathematics course knowledge graph, as shown in Figure 1. The method includes:

S4、根据实体识别结果和实体上下位关系，利用实体掩盖技术，标记待自动化抽取数据库中的句字中提到的每个实体的开始和结束；通过基于稀疏双仿射关注的BERT模型和自适应阈值损失的分类模型进行句子中的实体间的转化关系抽取。S4. Based on the entity recognition results and the entity's hyponymy relationship, use entity masking technology to mark the beginning and end of each entity mentioned in the sentences in the database to be automatically extracted; through the BERT model based on sparse biaffine attention and automatic The classification model adapted to threshold loss extracts transformation relationships between entities in sentences.

本发明实施例解决了目前关系抽取方法不能做到精准挖掘数学程类的实体关系的技术问题，实现通过融合稀疏双仿射关注与自适应阈值损失的关系分类方法，克服复杂依赖与重叠关系抽取的困难。同时，利用双曲空间词嵌入稀疏投影模型进行上下位关系预测，双曲空间具有表达层次结构的能力，能更好预测上下位关系，并降低计算代价。Embodiments of the present invention solve the technical problem that current relationship extraction methods cannot accurately mine entity relationships in mathematical programs, and implement a relationship classification method that integrates sparse biaffine attention and adaptive threshold loss to overcome complex dependencies and overlapping relationship extraction. Difficulties. At the same time, the hyperbolic space word embedding sparse projection model is used to predict the hyponym relationship. Hyperbolic space has the ability to express hierarchical structures, which can better predict the hypernym relationship and reduce the computational cost.

下面对各个步骤进行详细说明：Each step is explained in detail below:

在步骤S1中，获取数学知识点数据并进行预处理，得到训练数据和待自动化抽取数据库，基于训练数据构建实体识别和上下位关系获取的训练语料库；具体实施过程如下：In step S1, obtain mathematical knowledge point data and perform preprocessing to obtain training data and a database to be automatically extracted. Based on the training data, a training corpus for entity recognition and hypernym relationship acquisition is constructed; the specific implementation process is as follows:

在本发明实施例中，以构建离散数学知识点知识图谱为例进行详细说明。In the embodiment of the present invention, detailed description is given by taking the construction of a knowledge graph of discrete mathematical knowledge points as an example.

数据预处理的目的是将纸质版教材转化为机器可识别的数据，并构建训练语料库备训练模型用，主要采用以下方式：The purpose of data preprocessing is to convert paper textbooks into machine-recognizable data and build a training corpus for training models. The following methods are mainly used:

首先，从纸质版教材提取文字和公式形式的文本。将纸质版教材扫描为图片，利用Yolov8模型识别文字、公式、图片、表格等内容信息。对于文字部分，采用PDF解析工具提取文字形成文本；对于公式，采用Image2LaTeX工具将图片转化为latex格式的文本。将获取到的文本分为课后知识总结和正文文本两个部分，课后知识总结作为训练数据，用于构建训练语料库；正文文本作为待自动化抽取数据库，用于自动化抽取知识点实体及其关系。First, text in the form of words and formulas was extracted from the paper version of the textbook. Scan paper textbooks into pictures, and use the Yolov8 model to identify text, formulas, pictures, tables and other content information. For the text part, the PDF parsing tool is used to extract the text and form text; for the formula, the Image2LaTeX tool is used to convert the image into latex format text. The acquired text is divided into two parts: after-class knowledge summary and main text. The after-class knowledge summary is used as training data to build a training corpus; the main text is used as a database to be automatically extracted, used to automatically extract knowledge point entities and their relationships. .

然后，构建实体识别和上下位关系获取的训练语料库。利用ChatGPT4从课后知识总结文本中识别出短语，去除低词频短语，与去除仅包含字母与标点符号的公式共同构成知识点实体，并在原句子中采用BIESO格式标注实体，其中B-tages表示实体词的开始、I-tages表示实体词的中间、E-tages表示实体词的结束、S-tages表示单字的实体、O表示没实际意义的词。最后，以句子为单位挖去所有实体获取句子模式，并利用ChatGPT4找出表示逻辑推理的词，按照词频，无交集的为句法模式分类，再用规则标注关系。Then, a training corpus for entity recognition and hyponymy relationship acquisition is constructed. Use ChatGPT4 to identify phrases from the after-class knowledge summary text, remove low word frequency phrases, and remove formulas that only contain letters and punctuation marks to form knowledge point entities, and use BIESO format to annotate entities in the original sentences, where B-tages represent entities. The beginning of a word, I-tages represent the middle of an entity word, E-tages represent the end of an entity word, S-tages represent the entity of a single word, and O represents a word with no actual meaning. Finally, all entities are dug out in units of sentences to obtain sentence patterns, and ChatGPT4 is used to find words that represent logical reasoning. Syntactic patterns are classified according to word frequency and non-intersection, and then rules are used to mark relationships.

在步骤S2中，通过训练语料库训练实体识别模型，利用训练好的实体识别模型识别待自动化抽取数据库中的实体。具体实施过程如下：In step S2, the entity recognition model is trained through the training corpus, and the trained entity recognition model is used to identify entities in the database to be automatically extracted. The specific implementation process is as follows:

本发明实施例的实体识别模型的结构如图2所示，包括多任务级联(Cascade)、BERT、词级表征(WLF,Word Level Feature)、在训练实体识别模型时，通过加权损失函数(WOL,Weight of Loss)来衡量模型训练是否成功。其中，BERT是一种语言表示模型，旨在通过联合调节所有层中的左右上下文来预训练深度双向表示；多任务级联(Cascade)是在得到实体识别的结果之后，返回到BERT输出层，找各个实体词的表征向量，然后再把实体的表征向量输入全连接层做分类，判断实体类型；词级表征是在原本字级别的序列标注任务上加入以词为单位的表征；加权损失函数是在loss函数方面通过设置权重来权衡精确度与召回率，以达到提高F1的目的。The structure of the entity recognition model in the embodiment of the present invention is shown in Figure 2, including multi-task cascade (Cascade), BERT, word level representation (WLF, Word Level Feature). When training the entity recognition model, the weighted loss function ( WOL, Weight of Loss) to measure the success of model training. Among them, BERT is a language representation model that aims to pre-train deep bidirectional representation by jointly adjusting the left and right context in all layers; multi-task cascade (Cascade) returns to the BERT output layer after obtaining the results of entity recognition. Find the representation vector of each entity word, and then input the representation vector of the entity into the fully connected layer for classification to determine the entity type; word-level representation is to add word-based representation to the original word-level sequence labeling task; weighted loss function It is to weigh the precision and recall rate by setting the weight in the loss function to achieve the purpose of improving F1.

实体识别模型过程如图3所示，多任务级联是在在得到实体识别的结果之后，返回去到BERT输出层，找各个实体词的表征向量(表征向量是把各个实体词的向量做平均计算得到)，然后再把实体的表征向量输入一层全连接做分类，判断实体类型。即在训练时，每个词，无论是不是实体词，都过一遍全连接，做实体类型分类计算loss，然后把非实体词对应的loss掩盖；在预测时，就取实体最后一个词对应的分类结果，作为实体类型。The entity recognition model process is shown in Figure 3. The multi-task cascade is to return to the BERT output layer after obtaining the entity recognition results to find the representation vector of each entity word (the representation vector is to average the vectors of each entity word) Calculated), and then input the representation vector of the entity into a layer of fully connected classification to determine the entity type. That is, during training, each word, whether it is an entity word or not, is fully connected, the entity type is classified, the loss is calculated, and then the loss corresponding to the non-entity word is masked; during prediction, the loss corresponding to the last word of the entity is taken Classification results, as entity types.

词级表征把字和词分别通过嵌入矩阵(embedding matrix)做表征，按照对应关系，进行拼接。Word-level representation represents characters and words through embedding matrices, and splices them according to the corresponding relationship.

如图4所示，加权损失函数是通过mask进行。如果模型没有识别到一个本应该识别到的实体，就增大对应的Loss，加重对模型的惩罚；如果模型识别到了一个不应该识别到的实体，就减小对应的Loss。As shown in Figure 4, the weighted loss function is performed through mask. If the model does not recognize an entity that should be recognized, the corresponding Loss is increased and the penalty for the model is increased; if the model recognizes an entity that should not be recognized, the corresponding Loss is reduced.

识别出BIE和S结构的文字为需要的知识点实体Identify the text in BIE and S structures as required knowledge point entities

通过训练好的实体识别模型识别待自动化抽取数据库中的知识点实体。Use the trained entity recognition model to identify the knowledge point entities in the database to be automatically extracted.

在步骤S3中，通过训练语料库训练双曲空间词嵌入稀疏投影模型，通过训练好的双曲空间词嵌入稀疏投影模型对S2的实体进行上下位关系进行识别。具体实施过程如下：In step S3, the hyperbolic space word embedding sparse projection model is trained through the training corpus, and the hyperhypernym relationship of the entity in S2 is identified through the trained hyperbolic space word embedding sparse projection model. The specific implementation process is as follows:

双曲空间词嵌入稀疏投影模型包括归一化嵌入映射单元、投影特征生成单元和分类单元。The hyperbolic space word embedding sparse projection model includes a normalized embedding mapping unit, a projection feature generation unit and a classification unit.

双曲空间词嵌入稀疏投影模型的训练过程如下：The training process of the hyperbolic space word embedding sparse projection model is as follows:

301、将训练语料库中上下位关系数据作为正样本数据，并根据正样本数据生成负样本数据。具体实施过程如下：301. Use the hypernym relationship data in the training corpus as positive sample data, and generate negative sample data based on the positive sample data. The specific implementation process is as follows:

将样本数据分为正样本测试集和正样本训练集，比例为2：8，采用随机采样的方法，对一个下位词，从他的非上位词中随机选取一个词作为其上位词生成负样本，分为负样本测试集和负样本训练集，比例为2：8。Divide the sample data into a positive sample test set and a positive sample training set, with a ratio of 2:8. Using a random sampling method, for a hyponym, randomly select a word from its non-hypernyms as its hypernym to generate a negative sample. It is divided into a negative sample test set and a negative sample training set, with a ratio of 2:8.

S302、通过归一化嵌入映射单元对正样本数据和负样本数据均进行归一化嵌入处理后映射到双曲空间，得到词向量样本数据，所述词向量样本数据包括具有上下位关系集合的正词向量样本数据和包括无上下位关系集合的负词向量样本数据。具体实施过程如下：S302. Use the normalized embedding mapping unit to perform normalized embedding processing on both the positive sample data and the negative sample data and then map them to the hyperbolic space to obtain word vector sample data. The word vector sample data includes a set of superordinate and hyponymic relationships. Positive word vector sample data and negative word vector sample data including a set of non-hypernym relationships. The specific implementation process is as follows:

得到知识点实体以后，需要获取知识点之间的包含关系和转化关系。其中包含关系获取采用双曲空间词嵌入稀疏投影模型。After obtaining the knowledge point entities, it is necessary to obtain the inclusion relationship and transformation relationship between the knowledge points. The included relations are obtained using a hyperbolic space word embedding sparse projection model.

利用Word2Vec词嵌入模型，将步骤2获取到的实体进行归一化嵌入。本发明实施例选用300维，作为词的特征向量维度，之后将获取的嵌入投射到双曲空间。让表示实体x_i通过Word2Vec进行的归一化嵌入，/>表示/>通过映射到双曲空间的嵌入表示。映射公式如下：Use the Word2Vec word embedding model to normalize and embed the entities obtained in step 2. The embodiment of the present invention selects 300 dimensions as the feature vector dimension of the word, and then projects the obtained embedding into the hyperbolic space. let Represents the normalized embedding of entity x_i via Word2Vec, /> Express/> Represented by embeddings mapped to hyperbolic space. The mapping formula is as follows:

和/>表示实体x_i的上位词和非上位词通过Word2Vec进行的归一化嵌入后通过映射到双曲空间的嵌入表示，设立具有上下位关系集合/>和无上下位关系集合/>归一化嵌入处理后映射到双曲空间后，上下位关系集合无上下位关系集合/> and/> The hypernyms and non-hypernyms representing the entity x_i are normalized and embedded through Word2Vec and then represented by embeddings mapped to the hyperbolic space to establish a set of hypernyms and hyponyms/> And the set of no superior and inferior relations/> After normalized embedding processing and mapping to hyperbolic space, the set of superordinate and hyponymic relations No superior-subordinate relationship set/>

S303、对词向量样本数据进行聚类，得到聚类中心，并通过聚类中心和上下位关系对获取上下位关系词在上下位关系集合中的权重和非上下位关系词在非上下位关系集合中的权重。具体实施过程如下：S303. Cluster the word vector sample data to obtain the cluster center, and obtain the weight of the hypernym relation word in the hyponym relation set and the non-hypernym relation word of the non-hypernym relation word through the cluster center and the hyponym relation pair. The weight in the set. The specific implementation process is as follows:

用K表示为聚类数，其中每个聚类对应于一个上位性分量。应用K-means到D⁽⁺⁾使用向量偏移作为特征。/>表示/>和/>之间的内积，/>表示/>和/>之间的欧几里得距离。聚类中心记为/>将/>定义为/>在数据集D⁽⁺⁾上的权重:Let K be the number of clusters, where each cluster corresponds to an epistasis component. Apply K-means to D⁽⁺⁾ using vector offsets as a feature. /> Express/> and/> The inner product between,/> Express/> and/> Euclidean distance between . The clustering center is recorded as/> Will/> Defined as/> Weights on data set D⁽⁺⁾ :

304、基于上下位关系词在上下位关系集合中的权重和正词向量样本数据，进行稀疏投影向量的计算，得到稀疏的正投影矩阵；基于非上下位关系词在非上下位关系集合中的权重和负词向量样本数据，进行稀疏双投影向量的计算，得到稀疏的负投影矩阵。具体实施过程如下：304. Based on the weight of the hypernym and hyponym relational words in the hyponymy and hyponym relation set and the sample data of orthographic word vectors, calculate the sparse projection vector and obtain a sparse orthographic projection matrix; based on the weight of the non-hypernym and hyponym relational words in the non-hypernym and hyponym relation set. and negative word vector sample data, calculate sparse dual projection vectors, and obtain a sparse negative projection matrix. The specific implementation process is as follows:

最小化第j个簇上的加权投影误差，是一个d*d的投影矩阵。第j类:Minimize the weighted projection error on the jth cluster, is a d*d projection matrix. Category j:

在计算时，使用随机稀疏的方法，将填充进稀疏矩阵，矩阵相乘获得稀疏的投影矩阵/>让/>是所有K个的投影参数的集合。学习超值投影等价于最小化以下目标函数:When calculating, use the random sparse method to Fill in the sparse matrix and multiply the matrices to obtain the sparse projection matrix/> Let/> is the set of all K projection parameters. Learning supervalued projections is equivalent to minimizing the following objective function:

学习非上位投射在语义关系分类任务中，一个术语的非上位词可以是共下位词、同义词等。也适用于不同的非上位性关系的语义建模。该目标的定义如下:Learning non-hypernym projections In the task of semantic relationship classification, the non-hypernyms of a term can be co-hypernyms, synonyms, etc. Also suitable for semantic modeling of different non-epistatic relationships. The goal is defined as follows:

带有的字符串作为非上位聚类中心，和/>为投影参数。聚类中心是K-means在D^(-)生成，使用作为特征。用/>定义为在数据集D^(-)上的权重。with The string of is used as a non-superordinate clustering center, and/> are the projection parameters. Cluster centers are generated by K-means in D^(-) , using as a feature. Use/> defined as The weight on the data set D^(-) .

305、在投影特征生成单元中根据正投影矩阵、负投影矩阵和词向量样本数据生成正样本投影特征和负样本投影特征。具体实施过程如下：305. In the projection feature generation unit, generate positive sample projection features and negative sample projection features based on the positive projection matrix, negative projection matrix and word vector sample data. The specific implementation process is as follows:

给定一对(x_i,y_i)∈D⁽⁺⁾∪D^(-)生成两组特征：Given a pair (x_i ,y_i )∈D⁽⁺⁾ ∪D^(-) generate two sets of features:

其中是向量连接操作符。/>表示正特征，/>表示负特征。为投影特征。in is the vector concatenation operator. /> Indicates positive characteristics,/> represents negative characteristics. is a projected feature.

如果一组词存在上下位关系很小，/>很大；不存在上下位关系就相反。If a group of words has a hypernym relationship Very small,/> Very large; if there is no superior-subordinate relationship, it would be the opposite.

在本发明实施例中，根据词向量样本数据中的正词向量样本数据和正双投影矩阵、负双投影矩阵生成正样本投影特征。根据负词向量样本数据和正双投影矩阵、负双投影矩阵生成负样本投影特征。In the embodiment of the present invention, positive sample projection features are generated according to the positive word vector sample data, the positive bi-projection matrix, and the negative bi-projection matrix in the word vector sample data. Generate negative sample projection features based on negative word vector sample data, positive bi-projection matrices, and negative bi-projection matrices.

306、基于正样本投影特征和负样本投影特征训练分类单元。具体实施过程如下：306. Train classification units based on positive sample projection features and negative sample projection features. The specific implementation process is as follows:

分类单元包包括MLP(多层感知器)，MLP是一种基本的前馈神经网络，由输入层、多个隐藏层和输出层组成。每个层都由多个神经元(或节点)组成，相邻层之间的神经元全部连接并带有权重。The classification unit package includes MLP (Multi-Layer Perceptron), which is a basic feed-forward neural network consisting of an input layer, multiple hidden layers, and an output layer. Each layer is composed of multiple neurons (or nodes), and the neurons between adjacent layers are all connected and have weights.

联合作为投影特征，训练分类单元，在D⁽⁺⁾和D^(-)上计算输入到分类单元中，进行训练；选用线性激活函数，对于学习率等参数，设立参数范围，利用网格搜索，选择最优参数，利用Adam算法更新权重。joint As projected features, training classification units are calculated on D⁽⁺⁾ and D^(-) Input it into the classification unit for training; select a linear activation function, set a parameter range for parameters such as learning rate, use grid search to select the optimal parameters, and use the Adam algorithm to update the weights.

双曲空间词嵌入稀疏投影模型训练好后，对S2的实体进行上下位关系进行识别，具体包括：After the hyperbolic space word embedding sparse projection model is trained, the hypernym relationship of entities in S2 is identified, including:

通过归一化嵌入映射单元(对应步骤S302的操作)将实体(x，y)进行归一化嵌入后映射到双曲空间，得到将/>输入到投影特征生成单元，计算得到投影特征(对应步骤S305的操作)，将投影特征输入到分类单元中，输出0或者1，其中0表示无上下位关系，1表示有上下位关系。上下位关系即包括关系。The entity (x, y) is normalized and embedded through the normalized embedding mapping unit (corresponding to the operation of step S302) and then mapped to the hyperbolic space, and we get Will/> Input to the projection feature generation unit, and calculate the projection features (Corresponding to the operation of step S305), input the projected features into the classification unit, and output 0 or 1, where 0 means there is no superior-hybrid relationship, and 1 means there is a superior-hybrid relationship. The superior-subordinate relationship includes relationships.

在步骤S4中，根据实体识别结果和实体上下位关系，利用实体掩盖技术，标记待自动化抽取数据库中的句字中提到的每个实体的开始和结束；通过基于稀疏双仿射关注的BERT模型和自适应阈值损失的分类模型进行句子中的实体间的转化关系抽取。具体实施过程如下：In step S4, based on the entity recognition results and the entity's hyponym relationship, entity masking technology is used to mark the beginning and end of each entity mentioned in the sentences in the database to be automatically extracted; through BERT based on sparse biaffine attention Model and adaptive threshold loss classification model extract transformation relationships between entities in sentences. The specific implementation process is as follows:

如图5所示，本发明实施例通过BERT预训练模型进行转化关系分类，输入采用实体掩盖技术ENTITY MARKERS，用四个标记[E1],[/E1end],[E2]和[/E2]来标记句中提到的每个实体的开始和结束。通过对头实体和尾实体位置进行处理，得到对应的实体向量表征。之后利用稀疏双仿射关注与自适应阈值损失的分类模型进行句子中的实体间的转化关系抽取。具体实施过程如下：As shown in Figure 5, the embodiment of the present invention uses the BERT pre-training model to classify transformation relationships. The input uses the entity masking technology ENTITY MARKERS, using four markers [E1], [/E1end], [E2] and [/E2]. Mark the beginning and end of each entity mentioned in the sentence. By processing the positions of the head entity and the tail entity, the corresponding entity vector representation is obtained. Then, the classification model of sparse biaffine attention and adaptive threshold loss is used to extract the transformation relationship between entities in the sentence. The specific implementation process is as follows:

通过稀疏矩阵，来近似求解权重矩阵。公式如下：The weight matrix is approximately solved through a sparse matrix. The formula is as follows:

W⁽⁰⁾＝SW⁽⁰⁾ ＝S

其中，W⁽⁰⁾表示初始权重矩阵，S表示稀疏矩阵。以此使用了稀疏矩阵，来近似BERT模型内部的权重矩阵。给定一个标记好的句子s通过权重稀疏化后的BERT获取每个词的上下文表示h_i：Among them, W⁽⁰⁾ represents the initial weight matrix, and S represents the sparse matrix. This uses a sparse matrix to approximate the weight matrix inside the BERT model. Given a labeled sentence s, obtain the context representation h_i of each word through weighted sparse BERT:

{h₁,...h_|s|}＝BERT{x₁,...x_|s|}{h₁ ,...h_|s| }＝BERT{x₁ ,...x_|s| }

其中，x_i表示每个词的输入嵌入。where_xi represents the input embedding of each word.

为了捕获长范围依存关系，本发明实施例还采用跨句上下文跟踪，将句子扩展到固定的窗口大小W，为了更好的获取词的位置信息，采用了双仿射注意力机制。具体包括：In order to capture long-range dependencies, embodiments of the present invention also use cross-sentence context tracking to expand the sentence to a fixed window size W. In order to better obtain word position information, a double affine attention mechanism is adopted. Specifically include:

采用了二维降维MLPs，一个头MLP；一个尾MLP：Two-dimensional dimensionality reduction MLPs are used, one head MLP and one tail MLP:

和/>是头尾实体的映射表示。 and/> It is a mapping representation of head and tail entities.

紧接着计算每个词的得分：Then calculate the score of each word:

其中U₁∈R^|y|*d*d和U₂∈R^|y|*2d是权重参数，b是偏差，表示串联。where U₁ ∈R^|y|*d*d and U₂ ∈R^|y|*2d are weight parameters, b is the bias, Represents concatenation.

每个词的得分g_i,j作为分类器的输出，范围在[0，1]内，该得分需要阈值化来转换为关系标签。阈值化是指根据某些阈值将数据或模型输出转换为二元(或多元)形式的过程。由于阈值既没有闭合形式的解，也不可微分，并且对于不同的实体对或类，模型可能具有不同的置信度，其中一个全局阈值是不够的。全局阈值指的是在整个数据集或模型输出中通用的、应用于所有样本或预测结果的单一阈值。关系的数量不同(多标签问题)，并且模型可能不会被全局校准，因此相同的概率并不意味着所有实体对都相同。这个问题促使我们用可学习的自适应阈值来代替全局阈值，这可以减少推理过程中的决策错误。自适应阈值是根据特定规则、模型输出的其他指标或数据本身的特性动态调整的阈值。The score g_i,j of each word is used as the output of the classifier, ranging from [0, 1], and this score needs to be thresholded to convert into a relationship label. Thresholding refers to the process of converting data or model output into binary (or multivariate) form based on some threshold value. Since the threshold neither has a closed-form solution nor is differentiable, and the model may have different confidence levels for different pairs of entities or classes, a global threshold is not sufficient. A global threshold refers to a single threshold that is common across the entire dataset or model output and applies to all samples or predictions. The number of relations varies (multi-label problem), and the model may not be globally calibrated, so the same probability does not mean the same for all pairs of entities. This problem motivates us to replace the global threshold with a learnable adaptive threshold, which can reduce decision errors during inference. Adaptive thresholds are thresholds that are dynamically adjusted based on specific rules, other metrics output by the model, or characteristics of the data itself.

本发明实施例使用伪类TH来学习多标签分类的动态阈值，具体如下：The embodiment of the present invention uses pseudo-class TH to learn dynamic thresholds for multi-label classification, specifically as follows:

对于每个实体对将关系集R划分为两个部分：包含关系r的正集P存在于/>之间，并且负集N＝R-P。在这里，应用自适应阈值损失来学习关系分类器，引入一个阈值类TH，并采用基于自适应阈值损失的标准分类交叉熵损失函数，如下所示：For each entity pair Divide the relation set R into two parts: the positive set P containing the relation r exists in/> between, and the negative set N=RP. Here, an adaptive threshold loss is applied to learn a relational classifier, a threshold class TH is introduced, and a standard categorical cross-entropy loss function based on the adaptive threshold loss is adopted as follows:

L＝L₁+L₂L＝L₁ +L₂

其中，logit关系类型的值P_T表示存在关系的实体集合。logit_r′表示在P_T类中也在TH类中关系类型的值，logit_r表示在P_T类中关系类型的值，logit_TH表示在P_T类中也在TH类中关系类型的值。Among them, the value P_T of the logit relationship type represents the set of entities with a relationship. logit_r′ represents the value of the relationship type in the P_T class and also in the TH class, logit_r represents the value of the relationship type in the P_T class, and logit_TH represents the value of the relationship type in the P_T class and also in the TH class.

通过分类模型抽取数据中的全部的句子中的实体间的转化关系。The transformation relationship between entities in all sentences in the data is extracted through the classification model.

通过S2-S4的步骤的到了实体、上下位关系和关系分类获取实体间关系得到大量的三元组，构建知识图谱。Through the steps of S2-S4, entities, superordinate relationships and relationship classification are obtained to obtain relationships between entities and a large number of triples are obtained to build a knowledge graph.

本发明实施例还提供一种数学课程类知识图谱构建系统，该系统包括：Embodiments of the present invention also provide a mathematics course knowledge graph construction system, which includes:

关系识别模块，用于通过训练语料库训练双曲空间词嵌入稀疏投影模型，通过训练好的双曲空间词嵌入稀疏投影模型对实体识别模块得到的实体进行上下位关系进行识别；The relationship recognition module is used to train the hyperbolic space word embedding sparse projection model through the training corpus, and use the trained hyperbolic space word embedding sparse projection model to identify the hyponym relationship of the entities obtained by the entity recognition module;

可理解的是，本发明实施例提供的数学课程类知识图谱构建系统与上述数学课程类知识图谱构建方法相对应，其有关内容的解释、举例、有益效果等部分可以参考数学课程类知识图谱构建方法中的相应内容，此处不再赘述。It can be understood that the mathematics course knowledge graph construction system provided by the embodiment of the present invention corresponds to the above-mentioned mathematics course knowledge graph construction method. For explanations, examples, beneficial effects, etc. of the relevant content, please refer to the mathematics course knowledge graph construction. The corresponding content in the method will not be repeated here.

本发明实施例还提供一种计算机可读存储介质，其存储用于数学课程类知识图谱构建的计算机程序，其中，所述计算机程序使得计算机执行如上述所述的数学课程类知识图谱构建方法。Embodiments of the present invention also provide a computer-readable storage medium that stores a computer program for constructing a mathematics course knowledge graph, wherein the computer program causes the computer to execute the method for constructing a mathematics course knowledge graph as described above.

本发明实施例还提供一种电子设备，包括：An embodiment of the present invention also provides an electronic device, including:

一个或多个处理器；one or more processors;

存储器；以及memory; and

一个或多个程序，其中所述一个或多个程序被存储在所述存储器中，并且被配置成由所述一个或多个处理器执行，所述程序包括用于执行如上述所述的数学课程类知识图谱构建方法。one or more programs, wherein said one or more programs are stored in said memory and configured to be executed by said one or more processors, said program comprising means for performing mathematical operations as described above Course knowledge graph construction method.

综上所述，与现有技术相比，具备以下有益效果：To sum up, compared with the existing technology, it has the following beneficial effects:

1、本发明实施例解决了目前关系抽取方法不能做到精准挖掘数学程类的实体关系的技术问题，实现通过融合稀疏双仿射关注与自适应阈值损失的关系分类方法，克服复杂依赖与重叠关系抽取的困难。1. The embodiment of the present invention solves the technical problem that current relationship extraction methods cannot accurately mine entity relationships of mathematical programs, and implements a relationship classification method that integrates sparse biaffine attention and adaptive threshold loss to overcome complex dependencies and overlaps. Difficulties in relation extraction.

2、本发明实施例利用双曲空间词嵌入稀疏投影模型进行上下位关系预测，双曲空间具有表达层次结构的能力，能更好预测上下位关系，并降低计算代价。2. The embodiment of the present invention uses the hyperbolic space word embedding sparse projection model to predict the hyponym relationship. The hyperbolic space has the ability to express hierarchical structures, can better predict the hyponym relationship, and reduces the calculation cost.

3、本发明实施例采用多任务级联、BERT、词级表征、加权损失函数相结合的实体识别模型，将传统的命名实体识别拆分成两个任务，一个任务用来单纯抽取实体，一个任务用来判断实体类型，并且融入了词级表征，提高了命名实体识别的准确率。3. The embodiment of the present invention uses an entity recognition model that combines multi-task cascade, BERT, word-level representation, and weighted loss function to split the traditional named entity recognition into two tasks, one task is used to simply extract entities, and the other The task is used to determine entity types and incorporates word-level representation to improve the accuracy of named entity recognition.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations are mutually exclusive. any such actual relationship or sequence exists between them. Furthermore, the terms "comprises," "comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or apparatus that includes the stated element.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions of the foregoing embodiments. The recorded technical solutions may be modified, or some of the technical features thereof may be equivalently replaced; however, these modifications or substitutions shall not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of each embodiment of the present invention.