CN116521886A

Movatterモバイル変換

Info

Publication number: CN116521886A
Application number: CN202310135812.6A
Authority: CN
Inventors: 曹柳; 黄程韦; 朱晓明; 王琪皓; 刘海丰; 何贵甲; 巨然
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-08-01

Abstract

Translated fromChinese

本发明公开了一种基于深度学习的教育领域知识图谱构建方法和装置，该方法首先获取教育领域的多源数据；再对多源数据进行结构化提取，得到关键词和知识点；之后根据关键词之间的前后置关系，构成关键词骨架；接着应用关键词骨架对知识点进行前后置关联关系的提取；应用关键词骨架和知识点前后置关系推导得出知识点框架；最后利用知识点框架，使用无监督方式和相似度计算方式进行知识融合，构建学科知识图谱。本发明从总海量的教学资源数据中，提取出知识点及关联关系，构建教学资源知识图谱，从而赋能教学领域的智能化应用，为人工智能与教育的结合提供基础性支持。

The invention discloses a method and device for constructing a knowledge map in the field of education based on deep learning. The method first obtains multi-source data in the field of education; then performs structured extraction on the multi-source data to obtain keywords and knowledge points; The pre- and post-relationships between words constitute the keyword skeleton; then apply the keyword skeleton to extract the pre- and post-relationships of knowledge points; apply the keyword skeleton and the pre- and post-relations of knowledge points to derive the framework of knowledge points; The framework uses unsupervised methods and similarity calculation methods for knowledge fusion to build subject knowledge maps. The present invention extracts knowledge points and associated relationships from the total mass of teaching resource data, and constructs a knowledge map of teaching resources, thereby empowering intelligent applications in the teaching field and providing basic support for the combination of artificial intelligence and education.

Description

Translated fromChinese

基于深度学习的教育领域学科知识图谱的构建方法和装置Method and device for constructing subject knowledge map in education field based on deep learning

技术领域technical field

本发明属于知识图谱技术领域，尤其涉及一种基于深度学习的教育领域学科知识图谱的构建方法和装置。The invention belongs to the technical field of knowledge graphs, and in particular relates to a method and device for constructing subject knowledge graphs in the field of education based on deep learning.

背景技术Background technique

学科知识图谱的构建是整个教学领域的一个重要课题。通过对学科知识点层级关系及关联关系的构建，能够有效地指导老师的教学顺序，学生的学习路径，能及时发现掌握薄弱的知识点，进而有针对性的更高效地学习。The construction of subject knowledge graph is an important topic in the whole teaching field. Through the construction of the hierarchical relationship and related relationship of subject knowledge points, it can effectively guide the teacher's teaching sequence and the student's learning path, and can timely discover and master weak knowledge points, and then learn more efficiently in a targeted manner.

在传统的知识图谱的设计中，通常以规则提取、人工整理的方式对部分学科领域的数据进行图谱构建。这种方法一般只能考虑少量的学科数据，单一的知识体系，受到人的经验积累及投入时间等方面的限制，其构建的知识图谱数量有限，无法全面、准确地反映海量知识体系中各结点之前比较全面的关联关系。In the design of traditional knowledge graphs, graphs are usually constructed for data in some subject areas by means of rule extraction and manual arrangement. Generally, this method can only consider a small amount of disciplinary data, a single knowledge system, and is limited by people's experience accumulation and investment time. A more comprehensive relationship before the point.

因此，有必要设计一种基于深度学习的自动化构建引擎，完成学科知识图谱体系的构建工作。Therefore, it is necessary to design an automatic construction engine based on deep learning to complete the construction of the subject knowledge graph system.

发明内容Contents of the invention

本发明的目的在于针对现有技术的不足，提供一种基于深度学习的教育领域学科知识图谱的构建方法，通过深度学习、机器学习等方式，基于学科领域各个渠道的数据，构建知识点的层级关系及前后置关系，建立知识点体系的关联关系模型，进而完成知识知识推理。The purpose of the present invention is to address the deficiencies of the prior art, to provide a method for constructing a subject knowledge map in the education field based on deep learning, through deep learning, machine learning, etc., based on data from various channels in the subject field, to construct the hierarchy of knowledge points Relationships and pre- and post-relationships, establish a relationship model of the knowledge point system, and then complete knowledge reasoning.

本发明的目的是通过以下技术方案来实现的：一种基于深度学习的教育领域学科知识图谱的构建方法，包括以下步骤：The purpose of the present invention is achieved through the following technical solutions: a method for constructing a subject knowledge graph in the field of education based on deep learning, comprising the following steps:

(1)获取教育领域的多源数据；(1) Obtain multi-source data in the field of education;

(2)对多源数据进行结构化提取，得到关键词和知识点；(2) Structural extraction of multi-source data to obtain keywords and knowledge points;

(3)根据关键词之间的前后置关系，构成关键词骨架；(3) According to the pre-post relationship between keywords, a keyword skeleton is formed;

(4)应用关键词骨架对知识点进行前后置关联关系的提取；应用关键词骨架和知识点前后置关系推导得出知识点框架；(4) Applying the keyword skeleton to extract the pre- and post-relationships of the knowledge points; applying the keyword skeleton and the pre- and post-relationships of the knowledge points to derive the framework of the knowledge points;

(5)利用知识点框架，使用无监督方式和相似度计算方式进行知识融合，构建学科知识图谱。(5) Utilize the knowledge point framework, use unsupervised methods and similarity calculation methods for knowledge fusion, and construct subject knowledge maps.

进一步地，还包括在学科知识图谱的基础上，运用知识推理技术，补全知识点之间的关联关系；并赋能推荐、问答和搜索前端的业务系统的步骤；所述知识点之间的关联关系包括上下位的关系、同级的顺序关系以及同级的相似关系。Further, it also includes the use of knowledge reasoning technology on the basis of subject knowledge graphs to complete the relationship between knowledge points; and the steps of empowering recommendation, question answering and search front-end business systems; The association relationship includes the upper and lower relationship, the order relationship of the same level and the similar relationship of the same level.

进一步地，所述步骤(1)中，多源数据包含教材、视频数据和PPT课件数据。Further, in the step (1), the multi-source data includes teaching materials, video data and PPT courseware data.

进一步地，所述步骤(2)中，对多源数据运用规则和NLP两种策略进行结构化提取；所述两种策略之间是互相补充的并行关系；Further, in the step (2), two strategies of applying rules and NLP to the multi-source data are used for structured extraction; between the two strategies, there is a parallel relationship that complements each other;

所述规则是通过规则整理和正则表达式来提取关键词和知识点；The rule is to extract keywords and knowledge points through rule arrangement and regular expressions;

所述NLP是首先OCR处理，再利用NER算法来提取关键词和知识点；其中，NER算法是采用BILSTM结合CRF、RoBERTa结合CRF或者RoBERTa结合span来提取关键词和知识点。The NLP is firstly processed by OCR, and then the NER algorithm is used to extract keywords and knowledge points; wherein, the NER algorithm uses BILSTM combined with CRF, RoBERTa combined with CRF or RoBERTa combined with span to extract keywords and knowledge points.

进一步地，所述NER算法是采用RoBERTa结合CRF来提取关键词和知识点。Further, the NER algorithm uses RoBERTa combined with CRF to extract keywords and knowledge points.

进一步地，对RoBERTa模型进行增量的预训练，学习教育领域的学科知识。Further, incremental pre-training is performed on the RoBERTa model to learn subject knowledge in the field of education.

进一步地，RoBERTa模型的损失函数Focus Loss For Teaching Resources，公式如下：Further, the loss function Focus Loss For Teaching Resources of the RoBERTa model is as follows:

alpha＝[α₁,α₂,α₃,......,α_n]n∈labels.countalpha＝[α₁ ,α₂ ,α₃ ,...,α_n ]n∈labels.count

log(pt)＝log(pt)*alphalog(pt)=log(pt)*alpha

Loss＝-(1-log(pt))^γlog(log(pt))Loss＝-(1-log(pt))^γ log(log(pt))

其中，alpha表示标签损失权重，α₁,α₂,α₃,......,α_n为各个具体标签的损失权重，n为标签个数；表示模型预测概率，pt取值是当预测值y为标签label时，取模型预测概率/>否则取/>γ表示调制因子。Among them, alpha represents the label loss weight, α₁ , α₂ , α₃ ,..., α_n are the loss weights of each specific label, and n is the number of labels; Indicates the predicted probability of the model, and the value of pt is when the predicted value y is the label label, the predicted probability of the model is taken /> otherwise take /> γ represents the modulation factor.

进一步地，所述无监督方式具体为运用k_means聚类算法，将知识点进行聚类，得到相似知识点为一个簇，进而完成聚类；所述相似度计算是通过TFIDF算法将知识点向量化，可以使用关键词集合作为知识点的向量化表征，进而计算知识点之间的余弦相似度，得到若干个相似的知识点；再根据设置的阈值进行融合操作，所述阈值设为0.95。Further, the unsupervised method specifically uses the k_means clustering algorithm to cluster the knowledge points to obtain similar knowledge points as a cluster, and then complete the clustering; the similarity calculation is to vectorize the knowledge points through the TFIDF algorithm , the keyword set can be used as the vectorized representation of knowledge points, and then the cosine similarity between knowledge points can be calculated to obtain several similar knowledge points; then the fusion operation is performed according to the set threshold, which is set to 0.95.

一种基于深度学习的教育领域学科知识图谱的构建装置，包括一个或多个处理器，用于实现上述的一种基于深度学习的教育领域学科知识图谱的构建方法。A device for constructing a subject knowledge map in the field of education based on deep learning, including one or more processors, used to implement the above-mentioned method for constructing a subject knowledge map in the field of education based on deep learning.

一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时，用于实现上述的一种基于深度学习的教育领域学科知识图谱的构建方法。A computer-readable storage medium, on which a program is stored. When the program is executed by a processor, it is used to implement the above-mentioned method for constructing a subject knowledge graph in the education field based on deep learning.

本发明的有益效果是：从总海量的教学资源数据中，提取出知识点及关联关系，构建教学资源知识图谱，从而赋能教学领域的智能化应用，为人工智能与教育的结合提供基础性支持。The beneficial effects of the present invention are: extracting knowledge points and associated relationships from the total mass of teaching resource data, constructing a knowledge map of teaching resources, thereby empowering intelligent applications in the teaching field, and providing a foundation for the combination of artificial intelligence and education support.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1为学科知识图谱生成的整体架构示意图；Figure 1 is a schematic diagram of the overall architecture of the subject knowledge map generation;

图2为弱监督数据标注逻辑示意图；Figure 2 is a logical schematic diagram of weakly supervised data labeling;

图3为NER数据及算法流程图；Figure 3 is a flowchart of NER data and algorithm;

图4为算法优化及参数调优日志数据图；Figure 4 is a graph of algorithm optimization and parameter tuning log data;

图5为学科关键词骨架示意图；Figure 5 is a schematic diagram of the subject keyword skeleton;

图6为学科知识点框架示意图；Figure 6 is a schematic diagram of the framework of subject knowledge points;

图7为本发明的一种硬件结构图。Fig. 7 is a hardware structure diagram of the present invention.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with aspects of the invention as recited in the appended claims.

在本发明使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本发明。在本发明和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used herein and in the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本发明可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本发明范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present invention, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

下面结合附图，对本发明进行详细说明。在不冲突的情况下，下述的实施例及实施方式中的特征可以相互组合。The present invention will be described in detail below in conjunction with the accompanying drawings. If there is no conflict, the features in the following embodiments and implementations can be combined with each other.

本发明通过获取教育学科领域的数据集合，综合运用NLP(Natural LanguageProcessing)、OCR(Optical Character Recognition)等深度学习的技术，构建一整套自动化图谱生成的解决方案，并且结合知识推理技术，丰富提升知识图谱的泛化能力。The present invention obtains data collections in the field of educational disciplines, comprehensively uses deep learning technologies such as NLP (Natural Language Processing) and OCR (Optical Character Recognition), and constructs a complete set of solutions for automatic map generation, and combines knowledge reasoning technology to enrich and improve knowledge The generalization ability of graphs.

本发明的一种基于深度学习的教育领域学科知识图谱的构建方法，如图1所示，包括以下步骤：A method for constructing a subject knowledge map in the field of education based on deep learning of the present invention, as shown in Figure 1, includes the following steps:

首先是数据收集，应用爬虫获取开源教育领域数据集合，结合积累的历史学科数据，并尽可能获取丰富语料，包括了各个来源及维度的数据，例如教材、视频数据和PPT课件数据等。The first is data collection, using crawlers to obtain data collections in the field of open source education, combined with accumulated historical subject data, and obtaining rich corpus as much as possible, including data from various sources and dimensions, such as teaching materials, video data, and PPT courseware data.

在一实施例中，使用规则策略得到关键词和知识点，规则包括规则整理和正则表达式提取。In one embodiment, keywords and knowledge points are obtained by using rule strategies, and the rules include rule sorting and regular expression extraction.

首先需对原始图片形式的数据，进行文本转化，主要使用了第三方的OCR识别接口，获得原始的文本数据。随后需进行数据清洗，包括去换行符、关键词去重、去序号、去停用词、全角半角转换、英文小写转换等。First of all, the data in the form of original pictures needs to be converted into text, and the third-party OCR recognition interface is mainly used to obtain the original text data. Data cleaning is then required, including removing line breaks, removing duplicate keywords, removing serial numbers, removing stop words, converting full-width and half-width characters, and converting English to lowercase, etc.

教材的标题信息存在一定的规则，例如在首章标题的开头位置，会出现第1章，即一级知识点，用“章”表示；1.1，即二级知识点，用“节”表示；1.1.1，即三级知识点，用“小节”表示，这些信息可以帮助区分是否标题，也存在上下位的关系，例如1.1.1包含在1.1内，对各种不同类型教材进行整理，梳理出各种标题的规则。There are certain rules for the title information of textbooks. For example, at the beginning of the title of the first chapter, Chapter 1 will appear, which is the first-level knowledge point, which is represented by "chapter"; 1.1, which is the second-level knowledge point, is represented by "section"; 1.1.1, that is, the third-level knowledge points, represented by "subsections". This information can help distinguish whether it is a title or not, and there is also a relationship between upper and lower levels. For example, 1.1.1 is included in 1.1 to organize and sort out various types of teaching materials Rules for making various titles.

随后使用正则表达式进行提取，正则表达式根据整理出的标题规则进行构建，结合Python中关于正则表达式的模块完成代码部分，首先对章进行提取，在章的内容下对节进行提取，在节的内容下提取小节。线上服务使用FastAPI的框架搭建，部署成web服务，可通过http请求调用接口得到识别结果；Then use regular expressions to extract. The regular expressions are constructed according to the sorted out title rules, and the code part is completed in combination with the modules about regular expressions in Python. First, the chapters are extracted, and the sections are extracted under the content of the chapters. Extract the subsection under the content of the section. The online service is built using the framework of FastAPI, deployed as a web service, and the recognition result can be obtained by calling the interface through the http request;

在一实施例中，也可以使用NLP得到关键词及知识点，使用NER(Named EntityRecognition)算法提取知识点(章、节、小节)、关键词，并进行算法训练及调试优化。In one embodiment, NLP can also be used to obtain keywords and knowledge points, and NER (Named Entity Recognition) algorithm can be used to extract knowledge points (chapters, sections, subsections) and keywords, and algorithm training, debugging and optimization can be performed.

如图3所示，包括数据标注、算法模型、模型输出层、数据优化、增量预训练、其他优化和评估指标。在数据标注过程中，可采用三标一的方式，保证标注数据质量，并需注意以下几点：一是关键词的标注需避免嵌套实体的问题，标注粒度应尽可能细分；二是在半监督转译标注关键词的过程中，需优先确保细粒度的关键词被标注，并避免重复标注的情况。具体是需将关键词按照长度进行倒排，优先标注更细粒度的关键词。并在原始课本中搜索关键词的命中次数，若命中一次或多次，则检查已有的索引，并记录下开始和结束位置信息，以及多次位置信息，从而可以避免重复标注的情况出现，如图2所示；三是一些关键词出现的频率很高，类似通用词典中的停用词，为保障关键词的区分度可不作为关键词处理；四是单个字的概念由于在不同词汇的组合下，含义可能发生变化，需进行验证后方可作为关键词；五是为提高标注数据的效率，可采用标注人员提供关键词词库的弱监督方式进行标注，可采用BIO的标注方案，即实体的起始位置标注为B，中间及末尾位置标注为I，非实体的位置标注为O。算法模型选型挑战在于识别准确率的提升，需应用最适合的算法模型，并在此基础上针对垂直领域的文本进行迭代优化。As shown in Figure 3, it includes data annotation, algorithm model, model output layer, data optimization, incremental pre-training, other optimization and evaluation indicators. In the process of data labeling, the method of three labels and one can be used to ensure the quality of labeling data, and the following points should be paid attention to: first, the labeling of keywords should avoid the problem of nested entities, and the labeling granularity should be subdivided as much as possible; In the process of tagging keywords in semi-supervised translation, it is necessary to give priority to ensuring that fine-grained keywords are tagged and to avoid repeated tagging. Specifically, keywords need to be reversed according to length, and more fine-grained keywords should be marked first. And search the number of hits of the keyword in the original textbook. If it hits one or more times, check the existing index, and record the start and end position information, as well as multiple position information, so as to avoid repeated labeling. As shown in Figure 2; the third is that some keywords appear frequently, similar to stop words in general dictionaries, in order to ensure the distinction of keywords can not be treated as keywords; the fourth is the concept of a single word due to different words Under the combination, the meaning may change, and it needs to be verified before it can be used as a keyword; fifth, in order to improve the efficiency of tagging data, it can be tagged with weak supervision in which taggers provide keyword thesaurus, and the BIO tagging scheme can be used, namely The starting position of the entity is marked as B, the middle and end positions are marked as I, and the position of the non-entity is marked as O. The challenge of algorithm model selection lies in the improvement of recognition accuracy. It is necessary to apply the most suitable algorithm model, and on this basis, iteratively optimize the text in the vertical field.

传统的NER算法一般采用BILSTM+CRF(Bi-directional Long Short-TermMemory、Conditional Random Field)的形式构建，可以取得一个基础的效果。然而随着近年来预训练语言模型的兴起，可以结合BERT(Bidirectional Encoder Representationfrom Transformers)类预训练模型，作为特征提取器，进一步优化提升效果。RoBERTa(ARobustly optimized BERT pretraining approach)在BERT基础上，使用了更多的数据，更大的批次，以及更长的训练时间，综合运用动态masking以及去除next sentence predict任务，取得性能更优异的文本表征效果。尽管RoBERTa模型本身作为大型语料库中训练的模型，泛化能力较强，但具体运用到垂直领域效果一般，需获取大量文本语料，在模型的基础上进行增量的pretraining训练，让RoBERTa模型学习到教育领域的学科知识。The traditional NER algorithm is generally constructed in the form of BILSTM+CRF (Bi-directional Long Short-Term Memory, Conditional Random Field), which can achieve a basic effect. However, with the rise of pre-trained language models in recent years, BERT (Bidirectional Encoder Representation from Transformers) pre-trained models can be combined as feature extractors to further optimize the improvement effect. Based on BERT, RoBERTa (ARobustly optimized BERT pretraining approach) uses more data, larger batches, and longer training time, comprehensively uses dynamic masking and removes next sentence predict tasks, and obtains text with better performance Characterize the effect. Although the RoBERTa model itself is a model trained in a large corpus, it has strong generalization ability, but it is generally effective in vertical fields. It needs to obtain a large amount of text corpus, and perform incremental pretraining training on the basis of the model, so that the RoBERTa model can learn Subject knowledge in the field of education.

对模型的损失函数进行了优化。由于知识点的标签抽取过程中，存在着一些标签较容易提取，而另一些标签较难抽取的情况。例如在教材中，章这个类别的标签较易抽取，而节和小节的标签较难识别。根据识别难易度差距，设计一种新型损失函数Focus LossFor Teaching Resources，使模型在训练过程中更加关注难分类样本，给予更高的权重，降低易分样本对总损失的影响，公式如下:The loss function of the model is optimized. In the label extraction process of knowledge points, some labels are easier to extract, while others are more difficult to extract. For example, in teaching materials, the label of the chapter category is easier to extract, while the labels of sections and subsections are more difficult to identify. According to the difference in recognition difficulty, a new loss function Focus LossFor Teaching Resources is designed to make the model pay more attention to difficult-to-classify samples during the training process, give higher weights, and reduce the impact of easy-to-classify samples on the total loss. The formula is as follows:

log(pt)＝log(pt)*alphalog(pt)=log(pt)*alpha

Loss＝-(1-log(pt))^γlog(log(pt))Loss＝-(1-log(pt))^γ log(log(pt))

其中，alpha表示标签的损失权重，α₁,α₂,α₃,......,α_n为各个具体标签的损失权重，n为标签个数；表示模型预测概率，pt取值是当预测值y为标签label时，取模型预测概率/>否则取/>γ表示调制因子，用来让模型训练时聚焦难分的样本。根据标签学习难易程度，给不同标签设置不同权重，将加权后的各个标签对应的概率根据Loss公式计算。这样平滑地调节了易分样本权值的比例，均衡难易分样本权重，减少容易分类的样本权重，使得模型在训练时更关注难分类样本，越难学习的“特定”标签对损失值的贡献也越大。Among them, alpha represents the loss weight of the label, α₁ , α₂ , α₃ ,..., α_n are the loss weights of each specific label, and n is the number of labels; Indicates the predicted probability of the model, and the value of pt is when the predicted value y is the label label, the predicted probability of the model is taken /> otherwise take /> γ represents the modulation factor, which is used to focus the difficult samples during model training. According to the difficulty of label learning, different weights are set for different labels, and the probability corresponding to each weighted label is calculated according to the Loss formula. This smoothly adjusts the weight ratio of easy-to-class samples, balances the weights of difficult-to-easy samples, and reduces the weight of easy-to-classify samples, so that the model pays more attention to difficult-to-classify samples during training, and the more difficult the “specific” label to learn has a greater impact on the loss value. The greater the contribution.

在模型的训练过程中，进行了多次模型层面及参数层面的调优，并记录下了模型的调优日志，如图4所示。具体包括训练样本条目数example、模型架构structure、训练轮次epoch、训练单批次数据量train batch size、学习率learning rate、CRF层学习率crflearning rate、准确率acc、召回率racall、F1值、验证集损失eval loss以及训练耗费时间time，进对比实验后，在本数据集上有以下的验证结论：During the training process of the model, multiple optimizations at the model level and parameter level were performed, and the tuning log of the model was recorded, as shown in Figure 4. Specifically, it includes the number of training sample entries example, model architecture structure, training round epoch, training single batch data volume train batch size, learning rate learning rate, CRF layer learning rate crflearning rate, accuracy rate acc, recall rate racall, F1 value, The eval loss of the verification set and the time spent on training are time-consuming. After the comparison experiment, the following verification conclusions are obtained on this data set:

(a)模型架构：包括了RoBERTa+CRF和RoBERTa结合span两种不同架构，前者采用了CRF的判别式模型，对标签进行全局最优路径的计算，后者使用指针网络，对标签的起始位置进行判断。可以观察到在相同epoch为5，learning rate为5e-5的情况下，RoBERTa+CRF架构的F1值高出24BP，并且在epoch为4的情况下，RoBERTa+CRF高出101BP，并且learningrate调整无法使F值更高，故选择RoBERTa+CRF架构。(a) Model architecture: It includes two different architectures: RoBERTa+CRF and RoBERTa combined with span. The former uses the discriminative model of CRF to calculate the global optimal path for the label, and the latter uses the pointer network to calculate the starting point of the label. position is judged. It can be observed that when the same epoch is 5 and the learning rate is 5e-5, the F1 value of the RoBERTa+CRF architecture is 24BP higher, and when the epoch is 4, the RoBERTa+CRF is 101BP higher, and the learning rate adjustment cannot Make the F value higher, so choose the RoBERTa+CRF architecture.

(b)训练轮次：随着epoch从1到5的不断增加，模型F值也在保持正增长，说明训练轮次的提升会让模型更加充分训练，但当epoch达到6时，F值出现下降，此时模型训练出现了过拟合，故选择最优的5作为epoch值。(b) Training rounds: As the epoch increases from 1 to 5, the F value of the model also maintains a positive growth, indicating that the improvement of the training rounds will make the model more fully trained, but when the epoch reaches 6, the F value appears At this time, the model training has overfitting, so the optimal 5 is selected as the epoch value.

(c)训练单批次数据量：train batch size在一定范围内的增长会有处于F值的提升，RoBERTa模型拥有亿级别的参数，会消耗大量计算资源，受制于GPU显存等硬件设备，模型的train batch size设置为24。(c) Training a single batch of data: The growth of the train batch size within a certain range will increase the F value. The RoBERTa model has hundreds of millions of parameters, which consumes a lot of computing resources and is limited by GPU memory and other hardware devices. The model The train batch size is set to 24.

(d)学习率：随着learning rate从5e-5降低2BP，F1值出现了负增长，进其他对比实验得出learning rate为5e-5。(d) Learning rate: As the learning rate is reduced by 2BP from 5e-5, the F1 value has a negative growth, and other comparative experiments show that the learning rate is 5e-5.

(e)CRF层学习率：因RoBERTa模型拟合能力过强，需增大CRF的模型学习率，以便于CRF层能够更加充分地学习，故当crf learning rate从5e-5提升至5e-3时，F值出现同步提升。(e) CRF layer learning rate: Because the RoBERTa model is too strong, it is necessary to increase the CRF model learning rate so that the CRF layer can learn more fully. Therefore, when the crf learning rate is increased from 5e-5 to 5e-3 , the F value appears to increase synchronously.

在一实施例中，还可以同时使用规则和NLP两种策略。规则通常有识别精度高，但识别召回率较低的特点，而使用模型可以满足泛化能力的要求，通常模型识别召回率较高，所以可同时结合规则和NLP两种策略使用。具体地，设置模型阈值为0.9，当模型识别结果低于阈值时，优先采用规则处理的结果，而当模型识别结果高于阈值时，采用模型的识别结果，发挥两种策略各自的长处。In an embodiment, both rules and NLP strategies can be used at the same time. Rules usually have the characteristics of high recognition accuracy but low recognition recall rate, and the use of models can meet the requirements of generalization ability. Usually, the model recognition recall rate is high, so it can be used in combination with rules and NLP strategies. Specifically, the model threshold is set to 0.9. When the model recognition result is lower than the threshold, the result of rule processing is used first, and when the model recognition result is higher than the threshold, the model recognition result is used to give full play to the respective strengths of the two strategies.

在一实施例中，使用关键词构建关键词框架，采用前后置关系提取策略得到关键词骨架。In one embodiment, keywords are used to construct a keyword frame, and a strategy for extracting contextual relationships is used to obtain the keyword skeleton.

为完成知识点的前后置提取任务，需探索出基于关键词骨架的知识框架结构体系。课程的知识点是指知识的最小单元，具有独立性和不可分割性。而关键词是指在学科中的专业术语，是构成知识点的文本的词汇。关键词作为学科知识体系的基石，具有唯一性、特殊性以及表征知识的能力。知识点便可被表示为多个关键词组成，两者被划分成关键词的体系和由此衍生出的知识点的体系。In order to complete the pre- and post-extraction tasks of knowledge points, it is necessary to explore a knowledge framework structure system based on keyword skeletons. The knowledge point of the course refers to the smallest unit of knowledge, which is independent and indivisible. The keywords refer to the professional terms in the subject, which are the vocabulary of the text that constitutes the knowledge point. As the cornerstone of the subject knowledge system, keywords have uniqueness, particularity and the ability to represent knowledge. Knowledge points can be expressed as multiple keywords, which are divided into a system of keywords and a system of knowledge points derived therefrom.

对关键词的前后置关系进行提取。在学科课本的目录中，对知识结构的顺序进行了定义，即章、节、点的标题顺序。当知识点当中出现了关键词A时，若该关键词A出现在了后续知识点的正文中，而同时后续知识点也存在关键词B，那么根据规则认为关键词A是关键词B的前置关键词。按照此逻辑，可得到书本中关键词的前后置图谱，对同一课程下的多本书进行整合后，生成整个课程的关键词骨架，如图5所示，关键词骨架包含的结点及关系是，关键词A和关键词B以及两者之间的前置关系。The pre- and post-relationships of keywords are extracted. In the catalog of subject textbooks, the order of knowledge structure is defined, that is, the title order of chapters, sections, and points. When keyword A appears in a knowledge point, if the keyword A appears in the text of the subsequent knowledge point, and there is also keyword B in the subsequent knowledge point, then according to the rules, keyword A is considered to be the predecessor of keyword B. Set keywords. According to this logic, the front and rear maps of the keywords in the book can be obtained. After integrating multiple books under the same course, the keyword skeleton of the entire course is generated. As shown in Figure 5, the nodes and relationships contained in the keyword skeleton Yes, keyword A and keyword B and the pre-relationship between them.

在一实施例中，使用关键词骨架作为底层逻辑框架推理得到知识点的前后置关系。In one embodiment, the keyword skeleton is used as the underlying logic framework to deduce the pre- and post-relationships of the knowledge points.

学科知识图谱构建的目标是知识点前后置图谱，而知识点前后置图谱可以通过关键词的骨架推导得出。关键词拥有知识点的表征能力，具体知识点的正文部分会出现若干该学科的关键词。具体地，假设有知识点KP1、KP2，分别由若干关键词Kij表征，即KP1(K11，K12…),KP2(K21，K22…)。The goal of subject knowledge map construction is the front and back maps of knowledge points, and the front and back maps of knowledge points can be derived from the skeleton of keywords. Keywords have the ability to represent knowledge points, and several keywords of the subject will appear in the text of specific knowledge points. Specifically, it is assumed that there are knowledge points KP1 and KP2, respectively represented by several keywords Kij, namely KP1 (K11, K12...), KP2 (K21, K22...).

在推导计算KP1、KP2的关系时，需融入关键词之间的关联关系。整个知识点关系抽取模型算法包含两个部分：(1)知识点编码模型(T-Encoder)，用来捕捉知识点文本中基本的词法、句法信息(2)前后置知识编码模型(K-Encoder)，其用来获取不同知识点中关键词前后置关系。其中T-encoder为RoBERTa模型，K-Encoder使用对关键词采用有效的知识表示模型TransE(Translating Embedding)对其进行预训练。得到知识点，关键词的向量表示后，将其进行相加，得到最终的向量表示。最后通过一个全连接层，并通过softmax分类器进行分类，判断是否存在前后置关系，最终推导得出知识点框架，如图6所示。When deriving and calculating the relationship between KP1 and KP2, it is necessary to incorporate the relationship between keywords. The entire knowledge point relationship extraction model algorithm consists of two parts: (1) knowledge point encoding model (T-Encoder), which is used to capture the basic lexical and syntactic information in knowledge point texts (2) pre- and post-knowledge encoding model (K-Encoder ), which is used to obtain the pre- and post-relationships of keywords in different knowledge points. Among them, T-encoder is the RoBERTa model, and K-Encoder uses TransE (Translating Embedding), an effective knowledge representation model for keywords, to pre-train it. After obtaining knowledge points and vector representations of keywords, add them together to obtain the final vector representation. Finally, through a fully connected layer and a softmax classifier for classification, it is judged whether there is a front-rear relationship, and finally the knowledge point framework is derived, as shown in Figure 6.

(5)利用知识点框架，使用聚类和相似度计算方案进行知识融合，构建学科知识图谱。(5) Using the framework of knowledge points, clustering and similarity calculation schemes are used for knowledge fusion to construct subject knowledge graphs.

知识融合是指当知识图谱存在大量结点时，需要对相似的知识点进行融合，可采用无监督方式及相似度计算方式预测得出。无监督方式是指运用k_means聚类等聚类算法，将知识点进行聚类，得到相似知识点为一个簇，进而完成聚类。相似度计算是通过TFIDF等算法将知识点向量化，可以使用关键词集合作为知识点的向量化表征，进而计算知识点之间的余弦相似度，得到最相似的N个知识点，根据设置的阈值进行融合操作。其中，阈值设为0.95，若阈值大于等于0.95，则进行知识融合；否则不融合。Knowledge fusion means that when there are a large number of nodes in the knowledge map, similar knowledge points need to be fused, which can be predicted by unsupervised methods and similarity calculation methods. The unsupervised method refers to the use of clustering algorithms such as k_means clustering to cluster knowledge points, obtain similar knowledge points as a cluster, and then complete the clustering. Similarity calculation is to vectorize knowledge points through algorithms such as TFIDF. You can use keyword sets as the vectorized representation of knowledge points, and then calculate the cosine similarity between knowledge points to obtain the most similar N knowledge points. According to the set Threshold for fusion operation. Among them, the threshold is set to 0.95, if the threshold is greater than or equal to 0.95, knowledge fusion is performed; otherwise, no fusion is performed.

在获得知识点的图谱后，利用知识推理技术，对潜在的前后置关系进行补全。具体是对知识图谱中的三元组数据，使用知识表示模型TransE进行训练，充分利用知识图谱已有的结构化信息，得到表征知识点实体和前后置关系的向量。知识补全可利用向量的可加性实现，以知识点预测为例，假设有知识点1是知识点2的前置知识点，即存在知识点2向量＝知识点1向量+关系向量，则在所有的知识点中选择与知识点2向量距离最近的向量，为可能的预测值，即知识点1可能是该预测知识点的前置知识点。通过知识补全，形成更加广泛全面的学科知识图谱。After obtaining the graph of knowledge points, use knowledge reasoning technology to complete the potential pre- and post-relationships. Specifically, the triple data in the knowledge map is trained using the knowledge representation model TransE, and the existing structured information of the knowledge map is fully utilized to obtain the vector representing the knowledge point entity and the front-rear relationship. Knowledge completion can be realized by using the additivity of vectors. Taking knowledge point prediction as an example, assuming that knowledge point 1 is the pre-knowledge point of knowledge point 2, that is, there is knowledge point 2 vector = knowledge point 1 vector + relationship vector, then Select the vector closest to the vector of knowledge point 2 among all knowledge points as a possible prediction value, that is, knowledge point 1 may be the pre-knowledge point of the predicted knowledge point. Through knowledge completion, a more extensive and comprehensive subject knowledge map is formed.

与前述一种基于深度学习的教育领域学科知识图谱的构建方法的实施例相对应，本发明还提供了一种基于深度学习的教育领域学科知识图谱的构建装置的实施例。Corresponding to the aforementioned embodiment of a deep learning-based construction method of a subject knowledge map in the education field, the present invention also provides an embodiment of a deep learning-based construction device for a subject knowledge map in the education field.

参见图7，本发明实施例提供的一种基于深度学习的教育领域学科知识图谱的构建装置，包括一个或多个处理器，用于实现上述实施例中的一种基于深度学习的教育领域学科知识图谱的构建方法。Referring to Fig. 7, an embodiment of the present invention provides an apparatus for constructing a deep learning-based subject knowledge map in the educational field, including one or more processors, which are used to implement a deep learning-based educational field subject in the above embodiment. The construction method of knowledge graph.

本发明的一种基于深度学习的教育领域学科知识图谱的构建装置的实施例可以应用在任意具备数据处理能力的设备上，该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图4所示，为本发明的一种基于深度学习的教育领域学科知识图谱的构建装置所在任意具备数据处理能力的设备的一种硬件结构图，除了图4所示的处理器、内存、网络接口、以及非易失性存储器之外，实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能，还可以包括其他硬件，对此不再赘述。An embodiment of the device for constructing a subject knowledge map in the field of education based on deep learning of the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer. The device embodiments can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of any device capable of data processing. From the perspective of the hardware level, as shown in Figure 4, it is a hardware structure diagram of any device with data processing capabilities where the construction device of the deep learning-based subject knowledge map in the field of education of the present invention is located, except as shown in Figure 4 In addition to the processor, memory, network interface, and non-volatile memory, any device with data processing capabilities where the device in the embodiment is usually based on the actual function of any device with data processing capabilities may also include other hardware , which will not be repeated here.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。For the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method for details, and will not be repeated here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment. The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. It can be understood and implemented by those skilled in the art without creative effort.

本发明实施例还提供一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时，实现上述实施例中的一种基于深度学习的教育领域学科知识图谱的构建方法。The embodiment of the present invention also provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, a method for constructing a knowledge map of a subject in the education field based on deep learning in the above-mentioned embodiment is implemented.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元，例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备，例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card，SMC)、SD卡、闪存卡(Flash Card)等。进一步的，所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据，还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device capable of data processing described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be any device with data processing capabilities, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), an SD card, and a flash memory card (Flash Card) equipped on the device. wait. Further, the computer-readable storage medium may also include both an internal storage unit of any device capable of data processing and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by any device capable of data processing, and may also be used to temporarily store data that has been output or will be output.

以上实施例仅用于说明本发明的设计思想和特点，其目的在于使本领域内的技术人员能够了解本发明的内容并据以实施，本发明的保护范围不限于上述实施例。所以，凡依据本发明所揭示的原理、设计思路所作的等同变化或修饰，均在本发明的保护范围之内。The above embodiments are only used to illustrate the design concept and characteristics of the present invention, and its purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications based on the principles and design ideas disclosed in the present invention are within the protection scope of the present invention.

本领域技术人员在考虑说明书及实践这里公开的内容后，将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的。Other embodiments of the present application will readily occur to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any modification, use or adaptation of the application, these modifications, uses or adaptations follow the general principles of the application and include common knowledge or conventional technical means in the technical field not disclosed in the application . The specification and examples are to be considered as illustrative only.

应当理解的是，本申请并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。It should be understood that the present application is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof.

Claims

Translated fromChinese

1.一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，包括以下步骤：1. A method for constructing a subject knowledge graph in the field of education based on deep learning, characterized in that it comprises the following steps:

2.根据权利要求1中所述的一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，还包括在学科知识图谱的基础上，运用知识推理技术，补全知识点之间的关联关系；并赋能推荐、问答和搜索前端的业务系统的步骤；所述知识点之间的关联关系包括上下位的关系、同级的顺序关系以及同级的相似关系。2. The method for constructing a subject knowledge map based on deep learning in the field of education according to claim 1, further comprising, on the basis of the subject knowledge map, using knowledge reasoning technology to complete the gap between knowledge points and empowering the steps of recommending, asking and answering, and searching front-end business systems; the associating relationships between the knowledge points include the upper and lower relationships, the order relationship of the same level, and the similar relationship of the same level.

3.根据权利要求1中所述的一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，所述步骤(1)中，多源数据包含教材、视频数据和PPT课件数据。3. A method for constructing a subject knowledge map in the field of education based on deep learning according to claim 1, wherein in said step (1), the multi-source data includes teaching materials, video data and PPT courseware data.

4.根据权利要求1中所述的一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，所述步骤(2)中，对多源数据运用规则和NLP两种策略进行结构化提取；所述两种策略之间是互相补充的并行关系；4. according to the construction method of a kind of deep learning-based education domain subject knowledge map described in claim 1, it is characterized in that, in described step (2), carry out structure to multi-source data application rule and NLP two kinds of strategies Extraction; It is a parallel relationship that complements each other between the two strategies;

5.根据权利要求4中所述的一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，所述NER算法是采用RoBERTa结合CRF来提取关键词和知识点。5. The method for constructing a subject knowledge graph in the field of education based on deep learning according to claim 4, wherein the NER algorithm uses RoBERTa in combination with CRF to extract keywords and knowledge points.

6.根据权利要求5所述的一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，对RoBERTa模型进行增量的预训练，学习教育领域的学科知识。6. A method for constructing a subject knowledge map in the field of education based on deep learning according to claim 5, characterized in that incremental pre-training is performed on the RoBERTa model to learn subject knowledge in the field of education.

7.根据权利要求5所述的一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，RoBERTa模型的损失函数Focus Loss For Teaching Resources，公式如下：7. The construction method of a subject knowledge graph in the field of education based on deep learning according to claim 5, wherein the loss function Focus Loss For Teaching Resources of the RoBERTa model has the following formula:

log(pt＝log(pt*alphalog(pt=log(pt*alpha

Loss＝-(1-log(pt^γlog(log(ptLoss＝-(1-log(pt^γ log(log(pt

8.根据权利要求1所述的一种基于深度学习的教育领域学科知识图谱的构建方法，其特征在于，所述无监督方式具体为运用k_means聚类算法，将知识点进行聚类，得到相似知识点为一个簇，进而完成聚类；所述相似度计算是通过TFIDF算法将知识点向量化，可以使用关键词集合作为知识点的向量化表征，进而计算知识点之间的余弦相似度，得到若干个相似的知识点；再根据设置的阈值进行融合操作，所述阈值设为0.95。8. A method for constructing a subject knowledge graph in the field of education based on deep learning according to claim 1, wherein the unsupervised method specifically uses the k_means clustering algorithm to cluster knowledge points to obtain similar The knowledge point is a cluster, and then the clustering is completed; the similarity calculation is to vectorize the knowledge point through the TFIDF algorithm, and the keyword set can be used as the vectorized representation of the knowledge point, and then the cosine similarity between the knowledge points is calculated. Obtain several similar knowledge points; then perform fusion operation according to the set threshold, the threshold is set to 0.95.

9.一种基于深度学习的教育领域学科知识图谱的构建装置，其特征在于，包括一个或多个处理器，用于实现权利要求1-8中任一项所述的一种基于深度学习的教育领域学科知识图谱的构建方法。9. A device for constructing a subject knowledge graph in the field of education based on deep learning, characterized in that it includes one or more processors for implementing a deep learning-based Construction method of subject knowledge map in the field of education.

10.一种计算机可读存储介质，其上存储有程序，其特征在于，该程序被处理器执行时，用于实现权利要求1-8中任一项所述的一种基于深度学习的教育领域学科知识图谱的构建方法。10. A computer-readable storage medium on which a program is stored, wherein when the program is executed by a processor, it is used to realize the education based on deep learning described in any one of claims 1-8 The construction method of domain subject knowledge map.