CN114997176A

Movatterモバイル変換

Info

Publication number: CN114997176A
Application number: CN202210553720.5A
Authority: CN
Inventors: 刘悦; 葛献远; 杨正伟; 孙拾雨; 施思齐
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-09-02
Anticipated expiration: 2042-05-19
Also published as: CN114997176B

Abstract

Translated fromChinese

本发明公开一种文本数据的描述符识别方法、装置及介质。所述方法包括：将文本数据分隔成至少一个句子序列，并将各个句子序列分隔成单独的标记，基于预设的实体标签，对各个标记进行标注，所述预设的实体标签用于定义描述符；随机掩码所述句子序列中的部分单词，并通过学习到的上下文语义关系预测被掩码的单词，以实现对文本数据的增强；基于增强后的文本数据对识别模型进行训练，利用训练好的识别模型通过对文本数据的描述符进行识别，进而筛选出性能驱动的高质量描述符。本发明不仅可以从粗粒度和细粒度两方面从材料科学文献种自动抽取描述符，还可以将领域知识嵌入描述符识别方法中，使得本发明能够根据使用者需要筛选高质量描述符。The invention discloses a descriptor identification method, device and medium of text data. The method includes: dividing text data into at least one sentence sequence, dividing each sentence sequence into individual tags, and labeling each tag based on a preset entity tag, the preset entity tag being used to define a description Randomly mask some words in the sentence sequence, and predict the masked words through the learned contextual semantic relationship, so as to realize the enhancement of the text data; train the recognition model based on the enhanced text data, and use the The trained recognition model identifies the descriptors of the text data, and then filters out the performance-driven high-quality descriptors. The present invention can not only automatically extract descriptors from material science literature from both coarse-grained and fine-grained aspects, but also can embed domain knowledge into the descriptor identification method, so that the present invention can screen high-quality descriptors according to user needs.

Description

Translated fromChinese

文本数据的描述符识别方法、装置及介质Descriptor identification method, device and medium for text data

技术领域technical field

本发明涉及文本数据处理技术领域，更具体地，涉及一种文本数据的描述符识别方法、装置及介质。The present invention relates to the technical field of text data processing, and more particularly, to a method, device and medium for identifying descriptors of text data.

背景技术Background technique

随着材料科学的发展，材料数据积累越来越庞大，从数以亿万计纷繁复杂的数据中提取有用信息，分析并梳理材料的成分-结构-工艺-性能的构效关系，成为材料研究的核心和关键。机器学习能够建立材料影响因素(如成分、工艺、外界环境等描述符)与目标量(如性能等)间的映射关系，从而实现材料成分、结构、工艺、性能的预测与新材料的发现。然而，机器学习模型的性能受到数据质量的制约，因此，科学合理地构建与目标量相关的描述符特征仍然是建立高精度机器学习预测模型、辅助理解材料机制的基础。所以，快速有效的选择合适的描述符对于材料构效关系的研究具有重要意义。With the development of materials science, the accumulation of material data is becoming more and more huge, extracting useful information from hundreds of millions of complicated data, analyzing and sorting out the structure-activity relationship of material composition-structure-process-performance, which has become a material research core and key. Machine learning can establish the mapping relationship between material influencing factors (such as descriptors such as composition, process, and external environment) and target quantities (such as performance, etc.), so as to realize the prediction of material composition, structure, process, and performance and the discovery of new materials. However, the performance of machine learning models is constrained by the quality of data. Therefore, scientifically and reasonably constructing descriptor features related to target quantities is still the basis for building high-precision machine learning prediction models and assisting in understanding material mechanisms. Therefore, the rapid and effective selection of appropriate descriptors is of great significance for the study of material structure-activity relationship.

在材料领域构效关系的研究中，描述符的选取主要依靠专家的经验知识或手工阅读文献。例如，Jalem等人(2014)通过对现有材料知识的总结，选择了元素的电荷和配位数等化学成分描述符和晶格常数、晶胞体积、多面体的键长和键角以及原子间距离等晶体结构描述符来构建学习样本对橄榄石型化合物LiMXO_4(M,X为主族元素)的激活能进行预测。Sendek等人(2016) 从Material Project数据库中手工筛选了12831含锂晶体固体，筛选出了结构和化学稳定性高、电子导电性低、成本低的锂。然后，通过电导率筛选与晶体结构和化学成分相关的21个描述符，进而探讨锂离子电池固体电解质的结构-组成-电导率关系Xu等人(2020)基于大量文献搜集与阿伦尼乌斯法则外推得到的70个

空间群的NASICON型化合物的离子电导率数据，并利用逻辑回归方法构建了一个离子电导率的预测模型。他们凭经验选取了元素半径、元素电负性、离子个数等16个化学成分描述符和晶格常数、晶胞体积、原子体积等12个结构描述符来探究与电导率间的构效关系。然而，专家花费大量的时间通过总结已发表文献的单词或短语获得合适的描述符，这会导致随着材料科学文献出版数量的增加，这样的方式会严重影响描述符的选择，从而影响材料构效关系的研究。In the study of structure-activity relationship in the field of materials, the selection of descriptors mainly relies on the empirical knowledge of experts or manual reading of literature. For example, Jalem et al. (2014) selected chemical composition descriptors such as charge and coordination number of elements and lattice constants, unit cell volumes, bond lengths and bond angles of polyhedra, and interatomic Distance and other crystal structure descriptors are used to construct learning samples to predict the activation energy of the olivine-type compound LiMXO_4 (M, X main group elements). Sendek et al. (2016) manually screened 12831 lithium-containing crystalline solids from the Material Project database, and screened lithium with high structural and chemical stability, low electronic conductivity, and low cost. Then, 21 descriptors related to crystal structure and chemical composition were screened by electrical conductivity to explore the structure-composition-conductivity relationship of solid electrolytes for Li-ion batteries. Xu et al. (2020) based on extensive literature collection with Arrhenius 70 extrapolated from the law

The ionic conductivity data of NASICON-type compounds of the space group were used, and a predictive model of ionic conductivity was constructed using logistic regression method. Based on their experience, they selected 16 chemical composition descriptors such as element radius, element electronegativity, and number of ions, and 12 structural descriptors such as lattice constant, unit cell volume, and atomic volume to explore the structure-activity relationship with conductivity. . However, experts spend a lot of time to obtain suitable descriptors by summarizing the words or phrases of the published literature, which leads to a serious impact on the choice of descriptors as the number of publications of materials science literature increases, thus affecting the material structure. A study of the effect relationship.

命名实体识别(Named Entity Recognition,NER)能够从非结构化文本中自动提取信息。通常，这种类型的任务被视为有监督的机器学习问题，在这个问题中，模型学习识别句子中的关键字或短语。目前，NER已被应用于有机材料和无机材料的信息提取，例如Kim等人(2017)采用NER解析76000篇与氧化物材料合成相关的文章，提取其中的关键信息，并将这些信息编码到数据库中；Mysore等人(2017)通过NER从材料科学合成程序中提取行动图结构，在数据集较小的情况下能够较为准确地提取行动图合成文本；此外，一些基于化学的NER系统能够提取无机材料。例如，Krallinger等人(2017) 采用NER实现对科学文献、专利、技术报告或网络中包含的化学信息有效获取；Leaman等人(2015)利用NER技术准确地识别出文献中的化学提法、属性以及关系。最近，一些研究人员通过构建深度学习的NER模型，实现了对文献中无机材料信息的大规模提取，例如He等人(2020)通过构建基于Bi-LSTM的模型来实现文献中报道的无机固态合成反应中的前体和目标的精准提取；Weston等人(2019)通过NER实现从文献中识别无机材料的提法、样品描述符、相标签、材料属性和应用，以及任何使用的合成和表征方法。NER在无机材料信息提取中的研究成果引起了有机材料研究者的关注。例如，Zhao等人(2021)使用命名实体识别(NER)和BiLSTM-CNN-CRF深度学习模型自动从文献中提取有机材料信息。然而，目前尚未有采用NER 对材料文献科学的研究。此外，材料信息学研究往往对成百上千种材料进行预测，从材料文献中提取描述符特征对研究构效关系非常有用。Named Entity Recognition (NER) can automatically extract information from unstructured text. Typically, this type of task is viewed as a supervised machine learning problem, where a model learns to recognize keywords or phrases in sentences. At present, NER has been applied to the information extraction of organic materials and inorganic materials. For example, Kim et al. (2017) used NER to parse 76,000 articles related to the synthesis of oxide materials, extract the key information, and encode the information into the database (2017) extract action graph structure from material science synthesis programs through NER, which can more accurately extract action graph synthetic text in the case of small datasets; in addition, some chemical-based NER systems can extract inorganic Material. For example, Krallinger et al. (2017) used NER to effectively obtain chemical information contained in scientific literature, patents, technical reports or the Internet; Leaman et al. (2015) used NER technology to accurately identify chemical formulations and attributes in the literature. and relationship. Recently, some researchers have achieved large-scale extraction of inorganic material information in the literature by building a deep learning NER model, such as He et al. (2020) by building a Bi-LSTM-based model to achieve the inorganic solid-state synthesis reported in the literature Precise extraction of precursors and targets in reactions; Weston et al. (2019) NER enables identification of inorganic material formulations, sample descriptors, phase labels, material properties and applications from the literature, as well as any synthesis and characterization methods used . The research results of NER in information extraction of inorganic materials have attracted the attention of researchers of organic materials. For example, Zhao et al. (2021) used Named Entity Recognition (NER) and BiLSTM-CNN-CRF deep learning model to automatically extract organic material information from literature. However, there has been no research on materials literature science using NER. In addition, materials informatics research often makes predictions for hundreds of materials, and extracting descriptor features from materials literature is very useful for studying structure-activity relationships.

目前针对描述符选取的方法大都是通过专家的领域知识从材料文献中筛选出适合作为描述符的单词或短语。目前材料领域的NER方法大都是采用无监督实体识别和有监督实体识别以及基于深度学习的实体识别。其中，无监督实体识别主要是基于规则的，规则的设计需要领域相关的知识库、词典，甚至需要专家的精心设计；有监督实体识别多是采用条件随机场实现 (Conditional Random Field,CRF)；基于深度学习的实体识别多是采用 Bi-LSTM结合CRF的方式实现，其中Bi-LSTM是一种双向长短期记忆网络 (Longshort-term memory,LSTM)，而LSTM是一种改进的RNN。下面分别对基于规则的实体识别、CRF、RNN、LSTM、Bi-LSTM进行介绍。At present, most of the methods for descriptor selection are to filter words or phrases suitable as descriptors from the material literature through the domain knowledge of experts. At present, most of the NER methods in the field of materials use unsupervised entity recognition and supervised entity recognition as well as entity recognition based on deep learning. Among them, unsupervised entity recognition is mainly based on rules, and the design of rules requires domain-related knowledge bases, dictionaries, and even careful design by experts; supervised entity recognition is mostly realized by conditional random field (CRF); Entity recognition based on deep learning is mostly realized by using Bi-LSTM combined with CRF. Bi-LSTM is a bidirectional long short-term memory (LSTM) network, and LSTM is an improved RNN. The rule-based entity recognition, CRF, RNN, LSTM, and Bi-LSTM are introduced below.

(1)基于规则(1) Based on rules

比较有名的基于规则的实体识别系统有LaSIE-II，NetOwl，Facile，SAR， FASTUS和LTG。这些系统主要基于手动设计的语义和句法规则来识别实体，比如对句子进行词性标注，将满足某些限制的名词短语视为实体。当词典资源非常丰富时，通常可以取得不错的性能。KnowItAll系统是无监督的，利用领域无关的规则模板，可以自动从网页上抽取大量的实体(和关系)。无监督学习的实体识别的优势是不需要任何标注数据，可以借助于词典和人工设计规则得到大量的实体。然而，由于规则是特定领域的以及词典的不完整性，这些系统往往具有较高的精确率和较低的召回率，并且很难将此系统应用在其他的领域。The more well-known rule-based entity recognition systems are LaSIE-II, NetOwl, Facile, SAR, FASTUS and LTG. These systems identify entities mainly based on manually designed semantic and syntactic rules, such as part-of-speech tagging of sentences, and treating noun phrases that satisfy certain constraints as entities. When the dictionary resource is very rich, generally good performance can be achieved. The KnowItAll system is unsupervised and can automatically extract a large number of entities (and relationships) from web pages using domain-independent rule templates. The advantage of unsupervised learning entity recognition is that it does not require any labeled data, and a large number of entities can be obtained with the help of dictionaries and artificial design rules. However, because the rules are domain-specific and the lexicon is incomplete, these systems tend to have high precision and low recall, and it is difficult to apply this system to other domains.

(2)基于传统机器学习(2) Based on traditional machine learning

CRF是一个经典的序列标注模型，通过抽取每个位置l上的特征以及相邻输出标签之间的特征，假设特征有N种，表示为

模型的条件概率为：CRF is a classic sequence labeling model. By extracting the features at each position l and the features between adjacent output labels, it is assumed that there are N kinds of features, which are expressed as

The conditional probability of the model is:

其中，Z(x)为归一化函数。Among them, Z(x) is the normalization function.

(3)基于深度学习(3) Based on deep learning

RNN是一种可以处理序列数据的神经网络，它是以序列数据为输入，在序列的演进方向进行递归且所有节点(循环单元)按链式连接的递归神经网络递归且所有节点(循环单元)按链式连接的递归神经网络。RNN is a neural network that can process sequence data. It takes the sequence data as input, recurses in the evolution direction of the sequence, and all nodes (recurrent units) are connected in a chain. Recursive neural network recursively and all nodes (recurrent units) Chained recurrent neural network.

在传统的神经网络中，输入层和隐层、隐层和输出层之间是全连接的，每一层内部的各个节点是没有连接的，只能单独的处理一个个的输入，前一个输入和后一个输入是完全没有关系的。但是，当对句子这种序列数据进行处理时，单独的理解句子中的每一个词肯定是不合适的。例如在进行词性标注时，前一个词的词性对当前词的词性预测就有很大的影响。如果前一个词是动词，当前词是名词的概率就要远远大于当前词是动词的概率。因此，为了更好的处理类似的问题，就需要用当前词之前的其它词的信息，或者说，用一些历史信息给当前的预测提供帮助，而传统的神经网络无法提供这种历史信息。In the traditional neural network, the input layer and the hidden layer, the hidden layer and the output layer are fully connected, and each node inside each layer is not connected, and can only process one input individually, the previous input It has absolutely nothing to do with the latter input. However, when dealing with sequence data such as sentences, it is definitely not appropriate to understand each word in the sentence individually. For example, when performing part-of-speech tagging, the part-of-speech of the previous word has a great influence on the part-of-speech prediction of the current word. If the previous word is a verb, the probability that the current word is a noun is much greater than the probability that the current word is a verb. Therefore, in order to better deal with similar problems, it is necessary to use the information of other words before the current word, or to use some historical information to help the current prediction, and the traditional neural network cannot provide such historical information.

RNN在隐层的单元加入了递归连接，网络的历史信息就可以通过递归连接进行传播，使得网络的隐层有了保存和利用历史信息的功能，具体表现在隐层的输入不仅包含输入层的信息，还包括上一时刻隐层的输出。在RNN 中，t时刻的隐层输入除了x_t外，还包括上一时刻隐层的输出h_t-1，这样在处理句子中的第t个词时，可以利用到句子中该词前面的其他词的信息(h_t-1)，从而有助于处理序列的预测问题。RNN adds recursive connections to the units of the hidden layer, and the historical information of the network can be propagated through recursive connections, so that the hidden layer of the network has the function of saving and utilizing historical information. information, and also includes the output of the hidden layer at the previous moment. In RNN, in addition to x_t , the hidden layer input at time t also includes the output h_t-1 of the hidden layer at the previous moment, so that when processing the t-th word in the sentence, the previous word in the sentence can be used. information (h_t-1 ) of other words, thus helping to deal with the problem of sequence prediction.

h_t＝H(W[h_t-1,x_t]+b) 公式(3)h_t =H(W[h_t-1 ,x_t ]+b) Formula (3)

式中，h_t为隐含层输出，H为非线性函数(例如tanh函数)，b为偏置。In the formula, h_t is the output of the hidden layer, H is the nonlinear function (such as the tanh function), and b is the bias.

然而，标准的RNN也存在两个问题：一是如果梯度传递过长，会导致难以捕捉到序列中长距离的依赖关系；二是在处理长序列时，会出现梯度消失或梯度爆炸的现象。However, the standard RNN also has two problems: one is that if the gradient transfer is too long, it will make it difficult to capture long-distance dependencies in the sequence; the other is that when dealing with long sequences, the gradient disappears or the gradient explodes.

(4)LSTM模型(4) LSTM model

相较于RNN，LSTM的隐层单元中增加了三个门结构用于信息的记忆、更新和利用。这三个门是输入门(i)、遗忘门(f)和输出门(o)，并且添加了一个记忆单元(c)。其中，输入门i的作用是确定哪些新信息可以被存放在记忆单元中，遗忘门f控制有多少历史信息应该被忘记，输出门o决定有哪些信息可以被输出。其计算方式如公式(4)-(9)所示：Compared with RNN, three gate structures are added to the hidden layer unit of LSTM for information memory, update and utilization. The three gates are the input gate (i), the forget gate (f), and the output gate (o), and a memory cell (c) is added. Among them, the role of the input gate i is to determine which new information can be stored in the memory unit, the forget gate f controls how much historical information should be forgotten, and the output gate o determines which information can be output. Its calculation method is shown in formulas (4)-(9):

i_t＝σ(W_i[h_t-1,x_t]+b_i) 公式(4)i_t =σ(W_i [h_t-1 ,x_t ]+b_i ) Formula (4)

f_t＝σ(W_f[h_t-1,x_t]+b_f) 公式(5)f_t =σ(W_f [h_t-1 ,x_t ]+b_f ) Formula (5)

c_t％＝tanh(W_c[h_t-1,x_t]+b_c) 公式(6)c_t %=tanh(W_c [h_t-1 ,x_t ]+b_c ) Formula (6)

c_t＝f_te c_t-1+i_te c_t％公式(7)c_t =f_t ec_t-1 +i_t ec_t % Formula (7)

o_t＝σ(W_o[h_t-1,x_t]+b_o) 公式(8)o_t =σ(W_o [h_t-1 ,x_t ]+b_o ) Formula (8)

h_t＝o_te tan(c_t) 公式(9)h_t =o_t e tan(c_t ) Equation (9)

其中，符号σ和tanh表示不同的激活函数，e表示点积(dot product)。W_i，W_f，W_c，W_o是权重矩阵，b_i，b_f，b_c，b_o是偏置值。x_t是t时刻的输入向量，h_t是隐藏层状态，也是输出向量，其包含了t时刻之前的所有有效信息。i_t、f_t、o_t别表示t时刻对输入门、遗忘门、输出门的控制。where the symbols σ and tanh represent different activation functions, and e represents the dot product. W_i , W_f , W_c , and W_o are weight matrices, and b_i , b_f , b_c , and_bo are bias values. x_t is the input vector at time t, h_t is the hidden layer state, and is also the output vector, which contains all the valid information before time t. i_t , ft , and_ot respectively represent the control of the input gate, the forgetting gate, and the output gate at time_t .

然后，不论是RNN还是LSTM，都只能捕获序列的历史信息，由于自然语言句子结构的复杂性，在进行序列标注时有时会需要用到未来的信息，对应到句子中即处理当前词时，可能需要用到当前词后面(右面)的词的信息。Then, whether it is RNN or LSTM, only the historical information of the sequence can be captured. Due to the complexity of the sentence structure of natural language, future information is sometimes needed when performing sequence labeling. Corresponding to the sentence, that is, when processing the current word, Information about words following (to the right) of the current word may be used.

(5)Bi-LSTM模型(5) Bi-LSTM model

Bi-LSTM网络是由前向的LSTM单元和后向的LSTM单元组成。其基本思想是在隐层使用两个LSTM分别按照自前向后(正向)和自后向前(反向)对序列建模，然后将它们的输出连接起来。前向单元的隐藏层表示为

后向单元的隐藏层表示为

通过公式(4)-(9)，得到t时刻单向隐藏层的输出，如公式(10)-(11)所示。Bi-LSTM的隐藏层输出通过前向LSTM 单元与后向LSTM单元的隐藏层输出拼接得到，如公式(12)所示。Bi-LSTM network is composed of forward LSTM unit and backward LSTM unit. The basic idea is to use two LSTMs in the hidden layer to model the sequence in terms of front-to-back (forward) and back-to-front (reverse), respectively, and then concatenate their outputs. The hidden layer of the forward unit is represented as

The hidden layer of the backward unit is represented as

Through formulas (4)-(9), the output of the unidirectional hidden layer at time t is obtained, as shown in formulas (10)-(11). The hidden layer output of Bi-LSTM is obtained by splicing the hidden layer output of the forward LSTM unit and the backward LSTM unit, as shown in Equation (12).

因此，现有描述符的选取方法应用最广泛的是通过专家经验人为从材料文献中的单词或短语，这种方法耗时长，选取结果依赖于研究者的专家知识，存在一定的局限性和主观性，即不同的研究者针对某一种材料所选取的描述符可能是不同的，因此通过该方法选取的描述符的泛化性也会受到限制。同时，传统模型在处理材料命名实体识别问题时：1)缺乏对材料文献中的长句子单词的编码能力(即对于长句子，很难捕获到其每个单词之间的依赖，从而不能得到每个单词好的语境表示，进而影响实体的分类性能)；2)存在的无法表征词的多义性的问题(材料文献句子中存在很多含义相同的不同表示的单词，比如材料文献中存在许多如化学式、其缩写以及其英文名称)，传统的Word2Vec方法对出现多义性的实体提取到了不同的Embedding，因此很难对其进行区分，或是在实体识别后单独训练一个而分类器进行同义词的判断。传统的NER方法学习长距离依赖的能力较弱，并且要联合外部知识和大量人工参与来提取和处理特征。Therefore, the most widely used method for selecting descriptors is to artificially select words or phrases from the material literature through expert experience. This method is time-consuming, and the selection results depend on the expert knowledge of the researcher, which has certain limitations and subjectivity. In other words, the descriptors selected by different researchers for a certain material may be different, so the generalization of the descriptors selected by this method will also be limited. At the same time, when dealing with material named entity recognition problems, traditional models: 1) lack the ability to encode long sentence words in material literature (that is, for long sentences, it is difficult to capture the dependencies between each word, so that each word cannot be obtained. 2) There is a problem that the ambiguity of words cannot be represented (there are many words with the same meaning and different representations in the sentence of the material document, for example, there are many words in the material document. Such as chemical formula, its abbreviation and its English name), the traditional Word2Vec method extracts different Embeddings for entities with ambiguity, so it is difficult to distinguish them, or train a separate classifier after entity recognition and synonyms judgment. Traditional NER methods have weak ability to learn long-distance dependencies, and need to combine external knowledge and a large amount of human involvement to extract and process features.

发明内容SUMMARY OF THE INVENTION

提供了本发明以解决现有技术中存在的上述问题。本发明是一种文本数据的描述符识别方法、装置及介质，通过获得具有充分语义信息单词及句子的语义向量，在对词向量进行编码的时候会同时考虑单词的嵌入、句子的嵌入及单词的位置嵌入语义向量，因此一定程度上可以有效缓解材料领域特殊此的多义性问题，从而具有充分语义信息的单词的语义特征。接着，对于材料文本中的长句子，通过对整个句子进行建模，从而有效捕获词的远距离依赖来提取词的局部上下文语义特征。最后，通过CRF模型来对学习标签之间的依赖关系，最后得到最优的标签序列，识别出相应的描述符，从而可以准确的对实体进行分类。The present invention is provided to solve the above-mentioned problems in the prior art. The present invention is a descriptor recognition method, device and medium for text data. By obtaining the semantic vector of words and sentences with sufficient semantic information, the word embedding, the sentence embedding and the word embedding are taken into consideration when encoding the word vector. The position of the embedded semantic vector, so to a certain extent, it can effectively alleviate the ambiguity problem in the field of materials, so that the semantic features of words with sufficient semantic information. Next, for long sentences in the material text, the local contextual semantic features of words are extracted by modeling the whole sentence to effectively capture the long-distance dependencies of words. Finally, the CRF model is used to learn the dependencies between labels, and finally the optimal label sequence is obtained, and the corresponding descriptors are identified, so that entities can be accurately classified.

本发明具体采用如下技术方案：The present invention specifically adopts following technical scheme:

根据本发明的第一方案，提供一种文本数据的描述符识别方法，所述方法包括：According to a first aspect of the present invention, there is provided a method for identifying descriptors of text data, the method comprising:

利用训练好的识别模型通过如下方法对文本数据的描述符进行识别：Use the trained recognition model to recognize the descriptors of text data by the following methods:

基于文本数据，确定输入序列w＝(w₁,w₂,...,w_n)和与特征向量对应的标签序列y＝(y₁,y₂,...,y_n)，其中w_n是第n个单词的特征向量；Based on the text data, determine the input sequence w=(w₁ ,w₂ ,...,w_n ) and the label sequence y=(y₁ ,y₂ ,...,y_n ) corresponding to the feature vector, where w_n is the feature vector of the nth word;

通过如下公式(14)-(17)计算出一组总概率得分最大的标签序列：A set of label sequences with the largest total probability score is calculated by the following formulas (14)-(17):

其中，score(W,y)是所有输入序列的评价得分，T为迁移矩阵，

为y_i迁移到y_i+1的概率分数，

为第i个单词被标注为y_i的概率分数，p(y|S)表示语句S被标记为标签序列y的概率，y％为真标签，公式(16)表示训练过程中标签序列的似然函数，Y_W表示所有可能标注的集合，y^*表示一组总概率得分最大的标签序列；Among them, score(W, y) is the evaluation score of all input sequences, T is the transition matrix,

is the probability score for y_i to migrate to y_i+1 ,

is the probability score of the_i -th word being marked as yi, p(y|S) represents the probability that the sentence S is marked as the label sequence y, y% is the true label, and formula (16) represents the similarity of the label sequence in the training process. Ran function, Y_W represents the set of all possible labels, y^* represents a set of label sequences with the largest total probability score;

基于总概率得分最大的标签序列，确定粗粒度描述符；Determine coarse-grained descriptors based on the label sequence with the largest total probability score;

动态添加粗粒度描述符及其对应的句子序列，以构建知识库；Dynamically add coarse-grained descriptors and their corresponding sentence sequences to build a knowledge base;

基于所述知识库中的粗粒度描述符，通过描述符共同出现在同一个句子中的原则以及每个粗粒度描述符在对应句子序列中的重要性，筛选出性能驱动的高质量描述符。Based on the coarse-grained descriptors in the knowledge base, performance-driven high-quality descriptors are screened by the principle that descriptors co-occur in the same sentence and the importance of each coarse-grained descriptor in the corresponding sentence sequence.

根据本发明的第二方案，提供一种文本数据的描述符识别装置，所述装置包括处理器，所述处理器被配置为：利用训练好的识别模型通过如下方法对文本数据的描述符进行识别：According to a second aspect of the present invention, there is provided an apparatus for identifying descriptors of text data, the apparatus includes a processor, and the processor is configured to: use the trained identification model to perform the following method on the descriptors of the text data. Identify:

基于增强后的文本数据，确定输入序列w＝(w₁,w₂,...,w_n)和与特征向量对应的标签序列y＝(y₁,y₂,...,y_n)，其中w_n是第n个单词的特征向量；Based on the enhanced text data, determine the input sequence w=(w₁ ,w₂ ,...,w_n ) and the label sequence y=(y₁ ,y₂ ,...,y_n ) corresponding to the feature vector , where w_n is the feature vector of the nth word;

其中，score(W,y)是所有输入序列的评价得分，T为迁移矩阵，

为y_i迁移到y_i+1的概率分数，

is the probability score for y_i to migrate to y_i+1 ,

根据本发明的第三方案，提供一种存储有指令的非暂时性计算机可读存储介质，当所述指令由处理器执行时，执行根据本发明任一实施例所述的方法。According to a third aspect of the present invention, there is provided a non-transitory computer-readable storage medium storing instructions that, when executed by a processor, perform the method according to any of the embodiments of the present invention.

根据本发明各实施例所述的文本数据的描述符识别方法、装置及介质，不仅可以从粗粒度和细粒度两方面从材料科学文献种自动抽取描述符，还可以将领域知识嵌入描述符识别方法中，使得本发明能够根据使用者需要筛选高质量描述符。According to the method, device and medium for identifying descriptors of text data according to the embodiments of the present invention, not only can the descriptors be automatically extracted from the material science literature in both coarse-grained and fine-grained aspects, but also domain knowledge can be embedded in the descriptor identification method, enabling the present invention to screen high-quality descriptors according to user needs.

附图说明Description of drawings

在不一定按比例绘制的附图中，相同的附图标记可以在不同的视图中描述相似的部件。具有字母后缀或不同字母后缀的相同附图标记可以表示相似部件的不同实例。附图大体上通过举例而不是限制的方式示出各种实施例，并且与说明书以及权利要求书一起用于对所发明的实施例进行说明。在适当的时候，在所有附图中使用相同的附图标记指代同一或相似的部分。这样的实施例是例证性的，而并非旨在作为本装置或方法的穷尽或排他实施例。In the drawings, which are not necessarily to scale, the same reference numbers may describe similar parts in the different views. The same reference number with a letter suffix or a different letter suffix may denote different instances of similar components. The drawings illustrate various embodiments generally by way of example and not limitation, and together with the description and claims serve to explain the embodiments of the invention. Where appropriate, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Such embodiments are illustrative, and are not intended to be exhaustive or exclusive embodiments of the present apparatus or method.

图1示出了根据本发明实施例的一种文本数据的描述符识别方法的流程图。Fig. 1 shows a flowchart of a method for identifying descriptors of text data according to an embodiment of the present invention.

图2示出了根据本发明实施例的一种文本数据的描述符识别方法的流程图。Fig. 2 shows a flowchart of a method for identifying descriptors of text data according to an embodiment of the present invention.

图3示出了根据本发明实施例的一种文本数据的描述符识别方法的流程图。Fig. 3 shows a flowchart of a method for identifying descriptors of text data according to an embodiment of the present invention.

图4示出了根据本发明实施例的一种文本数据的描述符识别方法的一个具体流程图。Fig. 4 shows a specific flowchart of a method for identifying descriptors of text data according to an embodiment of the present invention.

图5示出了根据本发明实施例的描述符识别器示意图。FIG. 5 shows a schematic diagram of a descriptor identifier according to an embodiment of the present invention.

图6示出了根据本发明实施例的材料领域知识融合的条件数据增强流程图。Fig. 6 shows a flow chart of conditional data enhancement for material domain knowledge fusion according to an embodiment of the present invention.

图7示出了根据本发明实施例的粗粒度描述符识别器结构图。FIG. 7 shows a structure diagram of a coarse-grained descriptor identifier according to an embodiment of the present invention.

图8示出了根据本发明实施例的基于MatBERT模型的单词向量表示输入示意图。Fig. 8 shows a schematic diagram of the input of word vector representation based on the MatBERT model according to an embodiment of the present invention.

图9示出了根据本发明实施例的知识库结构图。FIG. 9 shows a structure diagram of a knowledge base according to an embodiment of the present invention.

图10示出了根据本发明实施例的NER模型的实例预测示意图。FIG. 10 shows a schematic diagram of instance prediction of a NER model according to an embodiment of the present invention.

图11示出了根据本发明实施例的描述符重要度计算方法流程图。FIG. 11 shows a flowchart of a method for calculating the importance of a descriptor according to an embodiment of the present invention.

图12示出了根据本发明实施例的FGDR筛选示例。Figure 12 shows an example of FGDR screening according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本领域技术人员更好的理解本发明的技术方案，下面结合附图和具体实施方式对本发明作详细说明。下面结合附图和具体实施例对本发明的实施例作进一步详细描述，但不作为对本发明的限定。本文中所描述的各个步骤，如果彼此之间没有前后关系的必要性，则本文中作为示例对其进行描述的次序不应视为限制，本领域技术人员应知道可以对其进行顺序调整，只要不破坏其彼此之间的逻辑性导致整个流程无法实现即可。In order to make those skilled in the art better understand the technical solutions of the present invention, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. The embodiments of the present invention are described in further detail below in conjunction with the accompanying drawings and specific embodiments, but are not intended to limit the present invention. The steps described herein, if there is no need for a contextual relationship with each other, the order in which they are described herein as an example should not be regarded as a limitation, and those skilled in the art should know that the order can be adjusted as long as It is enough not to destroy the logic between them and make the whole process impossible.

本发明实施例提供一种文本数据的描述符识别方法，如图1所示，所述方法包括利用训练好的识别模型通过如下步骤对文本数据的描述符进行识别：An embodiment of the present invention provides a method for identifying descriptors of text data. As shown in Figure 1, the method includes using a trained recognition model to identify the descriptors of text data through the following steps:

步骤S100，基于文本数据，确定输入序列w＝(w₁,w₂,…,w_n)和与特征向量对应的标签序列y＝(y₁,y₂,…,y_n)，其中w_n是第n个单词的特征向量。Step S100, based on the text data, determine the input sequence w=(w₁ ,w₂ ,...,w_n ) and the label sequence y=(y₁ ,y₂ ,...,y_n ) corresponding to the feature vector, where w_n is the feature vector of the nth word.

步骤S200，通过公式(14)-(17)计算出一组总概率得分最大的标签序列：Step S200, by formula (14)-(17), calculate a group of label sequences with the largest total probability score:

其中，score(W,y)是所有输入序列的评价得分，T为迁移矩阵，

为y_i迁移到y_i+1的概率分数，

is the probability score for y_i to migrate to y_i+1 ,

在步骤S300，基于总概率得分最大的标签序列，确定粗粒度描述符。在这个步骤中，可以获得一个粗粒度描述符集合，并可以根据对应的粗粒度描述符来确定其对应的句子序列，因此本发明不仅仅可以识别对应描述符，还可以应用在文本分类上。At step S300, a coarse-grained descriptor is determined based on the label sequence with the largest total probability score. In this step, a set of coarse-grained descriptors can be obtained, and the corresponding sentence sequence can be determined according to the corresponding coarse-grained descriptors, so the present invention can not only identify the corresponding descriptors, but also be applied to text classification.

步骤S400，动态添加粗粒度描述符及其对应的句子序列，以构建知识库。仅作为示例，构建的知识库如图9所示。Step S400, dynamically adding coarse-grained descriptors and their corresponding sentence sequences to build a knowledge base. Just as an example, the constructed knowledge base is shown in Figure 9.

步骤S500，基于所述知识库中的粗粒度描述符，通过描述符共同出现在同一个句子中的原则及计算每个粗粒度描述符在对应句子序列中的重要性，筛选出性能驱动的高质量描述符。Step S500, based on the coarse-grained descriptors in the knowledge base, through the principle that the descriptors co-occur in the same sentence and calculating the importance of each coarse-grained descriptor in the corresponding sentence sequence, filter out the performance-driven high-level descriptors. quality descriptor.

在一些实施例中，如图2所示，所述步骤S500，基于所述知识库中的粗粒度描述符，通过描述符共同出现在同一个句子中的原则及计算每个粗粒度描述符在对应句子序列中的重要性，筛选出性能驱动描述符，包括：In some embodiments, as shown in FIG. 2 , in the step S500, based on the coarse-grained descriptors in the knowledge base, the principle of co-occurrence of the descriptors in the same sentence and the calculation of each coarse-grained descriptor in the Corresponding to the importance in the sentence sequence, filter out performance-driven descriptors, including:

步骤S501，在所述知识库中列出粗粒度描述符列表:D＝[D₁,D₂,...D_n]，并列出与所述描述符对应的语句列表:S＝[S₁,S₂,...S_n]；Step S501, list the coarse-grained descriptor list in the knowledge base: D=[D₁ , D₂ , . . . D_n ], and list the statement list corresponding to the descriptor: S=[S₁ , S₂ ,...S_n ];

步骤S502，选择描述符，然后创建一个临时队列，并将所述描述符放入所述临时队列中，将粗粒度描述符和语句从对应的粗粒度描述符列表和语句列表中取出，将粗粒度描述符与语句共同出现的描述符添加至临时队列中，在临时队列不为空队列的情况下，将临时队列中的头元素退出所述临时队列并赋值给性能驱动描述符集合，得到性能驱动的高质量描述符集合；Step S502, select a descriptor, then create a temporary queue, put the descriptor into the temporary queue, take out the coarse-grained descriptor and statement from the corresponding coarse-grained descriptor list and statement list, and store the coarse-grained descriptor and statement from the corresponding coarse-grained descriptor list and statement list. The descriptors that appear together with the granularity descriptor and the statement are added to the temporary queue. When the temporary queue is not empty, the head element in the temporary queue is withdrawn from the temporary queue and assigned to the performance-driven descriptor set to obtain the performance. Driven high-quality descriptor set;

步骤S503，通过公式(18)计算所述性能驱动的高质量描述符集合中的描述符在对应句子序列中的重要性：Step S503, calculate the importance of the descriptor in the corresponding sentence sequence in the performance-driven high-quality descriptor set by formula (18):

其中，I_i表示第i个词的重要性，E_i为第i个词的嵌入向量，S_[CLS]为相应的句子嵌入向量；Among them, I_i represents the importance of the ith word, E_i is the embedding vector of the ith word, and S_[CLS] is the corresponding sentence embedding vector;

步骤S504，基于描述符重要度的阈值筛选出性能驱动的高质量描述符。Step S504, based on the threshold of the descriptor importance, filter out the performance-driven high-quality descriptors.

在一些实施例中，基于描述符重要度的阈值，通过如下公式(19)筛选出性能驱动的高质量描述符；In some embodiments, based on the threshold of descriptor importance, performance-driven high-quality descriptors are filtered out by the following formula (19);

其中，D_i表示性能驱动的高质量描述符集合，T是描述符重要度的阈值， true表示在性能驱动的高质量描述符集合中保留的描述符，false表示在性能驱动的高质量描述符集合中删除的描述符。where D_i represents the performance-driven high-quality descriptor set, T is the threshold of descriptor importance, true represents the descriptors reserved in the performance-driven high-quality descriptor set, and false represents the performance-driven high-quality descriptor Descriptors removed from the collection.

在一些实施例中，在利用训练好的识别模型通过如下方法对文本数据的描述符进行识别之前，如图3所示，所述方法还包括：In some embodiments, before using the trained recognition model to identify the descriptor of the text data by the following method, as shown in Figure 3, the method further includes:

步骤S1001，将文本数据分隔成至少一个句子序列，并将各个句子序列分隔成单独的标记，基于预设的实体标签，对各个标记进行标注，所述预设的实体标签用于定义描述符；Step S1001, the text data is separated into at least one sentence sequence, and each sentence sequence is separated into a separate mark, and each mark is marked based on a preset entity tag, and the preset entity tag is used to define a descriptor;

步骤S1002，随机掩码所述句子序列中的部分单词，并通过学习到的上下文语义关系预测被掩码的单词，以实现对文本数据的增强；Step S1002, randomly mask the partial words in the described sentence sequence, and predict the masked word by the learned contextual semantic relationship, to realize the enhancement to text data;

步骤S1003，通过增强后的文本数据对所述识别模型进行训练。仅作为示例，训练方法可以采用识别模型对描述符进行识别的方法进行训练，具体步骤在上面已经阐述，此处不再赘述。Step S1003, the recognition model is trained through the enhanced text data. Just as an example, the training method can be trained by using the recognition model to recognize the descriptors, and the specific steps have been described above, and will not be repeated here.

在一些实施例中，在将文本数据分隔成至少一个句子序列之前，所述方法还包括对文本信息进行清理获得文本数据；所述对文本信息进行清理获得文本数据包括：通过正则表达式匹配方式去除文本信息中的无效数据，所述无效数据包括乱码、图片；在出现字符乱码的情况下，将导致出现乱码的字符转换为特殊符号标记。In some embodiments, before separating the text data into at least one sentence sequence, the method further includes cleaning the text information to obtain the text data; the cleaning the text information to obtain the text data includes: matching through a regular expression Invalid data in the text information is removed, and the invalid data includes garbled characters and pictures; in the case of garbled characters, the garbled characters are converted into special symbols.

本发明实施例还提供一种文本数据的描述符识别装置，所述装置包括处理器，所述处理器被配置为：An embodiment of the present invention further provides an apparatus for identifying descriptors of text data, the apparatus includes a processor, and the processor is configured to:

其中，score(W,y)是所有输入序列的评价得分，T为迁移矩阵，

为y_i迁移到y_i+1的概率分数，

is the probability score for y_i to migrate to y_i+1 ,

需要说明的是，处理器可以是包括一个以上通用处理设备的处理设备，诸如微处理器、中央处理单元(CPU)、图形处理单元(GPU)等。更具体地，处理器可以是复杂指令集计算(CISC)微处理器、精简指令集计算(RISC) 微处理器、超长指令字(VLIW)微处理器、运行其他指令集的处理器或运行指令集的组合的处理器。处理器还可以是一个以上专用处理设备，诸如专用集成电路(ASIC)、现场可编程门阵列(FPGA)、数字信号处理器(DSP)、片上系统(SoC)等。It should be noted that the processor may be a processing device including more than one general-purpose processing device, such as a microprocessor, a central processing unit (CPU), a graphics processing unit (GPU), and the like. More specifically, the processor may be a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a processor running other A processor with a combination of instruction sets. A processor may also be one or more special-purpose processing devices, such as an application specific integrated circuit (ASIC), field programmable gate array (FPGA), digital signal processor (DSP), system on a chip (SoC), or the like.

处理器可以通信地耦合到存储器并且被配置为执行存储在其上的计算机可执行指令，以执行根据本发明各个实施例的文本数据的描述符识别方法。A processor may be communicatively coupled to the memory and configured to execute computer-executable instructions stored thereon to perform a method of descriptor recognition of textual data in accordance with various embodiments of the present invention.

在一些实施例中，所述处理器被进一步配置为：将文本数据分隔成至少一个句子序列，并将各个句子序列分隔成单独的标记，基于预设的实体标签，对各个标记进行标注，所述预设的实体标签用于定义描述符；随机掩码所述句子序列中的部分单词，并通过学习到的上下文语义关系预测被掩码的单词，以实现对文本数据的增强；通过增强后的文本数据对所述识别模型进行训练。In some embodiments, the processor is further configured to: separate the text data into at least one sentence sequence, separate each sentence sequence into individual tokens, and annotate each token based on a preset entity label, where The preset entity labels described above are used to define descriptors; some words in the sentence sequence are randomly masked, and the masked words are predicted through the learned contextual semantic relations to realize the enhancement of text data; to train the recognition model.

在一些实施例中，所述处理器被进一步配置为：在所述知识库中列出粗粒度描述符列表:D＝[D₁，D₂，...D_n]，并列出与所述描述符对应的语句列表: S＝[S₁，S₂，...S_n]；选择描述符，然后创建一个临时队列，并将所述描述符放入所述临时队列中，将粗粒度描述符和语句从对应的粗粒度描述符列表和语句列表中取出，将粗粒度描述符与语句共同出现的描述符添加至临时队列中，在临时队列不为空队列的情况下，将临时队列中的头元素退出所述临时队列并赋值给性能驱动的高质量描述符集合，得到性能驱动的高质量描述符集合；通过如下公式(18)计算所述性能驱动的高质量描述符集合中的描述符在对应句子序列中的重要性：In some embodiments, the processor is further configured to: list in the knowledge base a list of coarse-_grained descriptors: D=[D₁ , D₂ , . . . The statement list corresponding to the descriptor: S=[S₁ , S₂ ,...S_n ]; select the descriptor, then create a temporary queue, and put the descriptor into the temporary queue, The granular descriptors and statements are taken out from the corresponding coarse-grained descriptor list and statement list, and the descriptors that co-occur with the coarse-grained descriptors and statements are added to the temporary queue. If the temporary queue is not empty, the temporary The head element in the queue exits the temporary queue and is assigned to the performance-driven high-quality descriptor set to obtain the performance-driven high-quality descriptor set; Calculate the performance-driven high-quality descriptor set by the following formula (18). The importance of the descriptor in the corresponding sentence sequence:

基于描述符重要度的阈值筛选出性能驱动的高质量描述符。Performance-driven high-quality descriptors are filtered out based on a threshold of descriptor importance.

在一些实施例中，所述处理器被进一步配置为：基于描述符重要度的阈值，通过如下公式(19)筛选出性能驱动的高质量描述符；In some embodiments, the processor is further configured to: filter out performance-driven high-quality descriptors by the following formula (19) based on a threshold of descriptor importance;

在一些实施例中，所述处理器被进一步配置为：通过正则表达式匹配方式去除文本信息中的无效数据，所述无效数据包括乱码、图片；在出现字符乱码的情况下，将导致出现乱码的字符转换为特殊符号标记。In some embodiments, the processor is further configured to: remove invalid data in the text information through regular expression matching, where the invalid data includes garbled characters and pictures; in the case of garbled characters, garbled characters will appear characters are converted to special symbol tokens.

本发明实施例提供一种存储有指令的非暂时性计算机可读存储介质，当所述指令由处理器执行时，执行根据本发明各个实施例的文本数据的描述符识别方法。Embodiments of the present invention provide a non-transitory computer-readable storage medium storing instructions, and when the instructions are executed by a processor, a method for identifying descriptors of text data according to various embodiments of the present invention is performed.

下面本发明实施例将结合具体的应用示例以进一步说明本发明的可行性和进步性。The following embodiments of the present invention will be combined with specific application examples to further illustrate the feasibility and progress of the present invention.

本发明实施例拟采用如下步骤对文本数据进行处理：1)采用数据处理器对文本数据进行预处理；2)通过粗粒度描述符识别器对实体进行筛选；3) 采用细粒度描述符识别器根据需要对实体进行进一步筛选。本方法流程如图 4所示，详细示意图如图5所示。The embodiment of the present invention intends to use the following steps to process the text data: 1) using a data processor to preprocess the text data; 2) using a coarse-grained descriptor identifier to screen entities; 3) using a fine-grained descriptor identifier Entities are further filtered as needed. The process flow of this method is shown in Figure 4, and the detailed schematic diagram is shown in Figure 5.

1)采用数据处理器对文本数据进行预处理1) Use a data processor to preprocess the text data

通过晶体信息文件(Crystallographic Information File,CIF)，收集了55篇适合描述符挖掘的语料库源的NASICON材料科学文献。这些文献的全文信息(包括标题、作者、摘要、关键字、机构、出版商和出版年份)然后存储为单独的文档，通过PDF解析(Python工具包)提取。最终得到的NASICON NER 数据集包含65690个数据，2434个句子，6036个单词。之后，将对这些文档执行预处理。Through the Crystallographic Information File (CIF), 55 NASICON materials science literatures of corpus sources suitable for descriptor mining were collected. Full-text information for these documents (including title, author, abstract, keywords, institution, publisher, and publication year) was then stored as a separate document, which was extracted by PDF parsing (Python toolkit). The final NASICON NER dataset contains 65690 data, 2434 sentences, and 6036 words. After that, preprocessing will be performed on these documents.

①文本清理①Text cleaning

因为从PDF转换的文档的文本信息有很多无效数据，如乱码、文本之外的其他信息等，我们通过正则表达式匹配来删除它们。这些特殊符号还可能出现字符乱码；但是，我们不能直接删除它们，因为有些可能是有用的信息，如化学单位；因此，在接下来的步骤中，我们将所有这些符号转换为特殊的 token<sYm>。这样，我们可以从PDF文献中得到相对干净的文档。Because the text information of documents converted from PDF has a lot of invalid data, such as garbled characters, other information other than text, etc., we remove them by regular expression matching. These special symbols may also have garbled characters; however, we cannot delete them directly, because some may be useful information, such as chemical units; therefore, in the next step, we convert all these symbols to the special token<sYm >. In this way, we can get relatively clean documents from PDF documents.

②对文本数据进行分词分句以及标注② Segmentation and labeling of text data

对于这项工作，我们首先通过ChemDataExtractor对清理后的文档进行分词分句操作，这涉及到将原始文本数据分割成句子，然后将每个句子分割成单独的标记。然后，为了标注这些标记，定义了描述符的8个实体标签，包括:Composition,Structure,Property,Processing,Characterization,Application, Feature and Condition，这些实体标签能够概括材料描述符的大部分信息。表 1给出了每个标记的定义和示例。For this work, we first perform word segmentation on the cleaned documents via ChemDataExtractor, which involves segmenting the raw text data into sentences, and then segmenting each sentence into individual tokens. Then, in order to label these tags, eight entity tags of the descriptor are defined, including: Composition, Structure, Property, Processing, Characterization, Application, Feature and Condition, these entity tags can summarize most of the information of the material descriptor. Table 1 gives definitions and examples of each marker.

表1.材料领域的8种描述符实体类型定义Table 1. 8 Descriptor Entity Type Definitions for Materials Domain

使用上面描述的标记方案，55个材料科学文献被手工注释。此外，在标注时，采用由内而外开始(inside-outside-beginning,IBO)的格式进行标注，该格式可以解释“活化能”等多词实体的情况。在这种方法中，token可以在 beginning(B)、inside(I)或outside(O)表示特殊的标签。例如，NASICON文献中的一句话“The ionic conductivitydecreases with increasing activation energy”将标注为(token；IOB-label)对按以下方式进行:(The；O),(ionic； B-Property),(conductivity；I-Property),(decreases；O),(with；O),(increasing；O), (activation；B-Property),(energy；I-Property)。Using the labeling scheme described above, 55 materials science articles were hand-annotated. In addition, when labeling, an inside-outside-beginning (IBO) format is used for labeling, which can explain the situation of multi-word entities such as "activation energy". In this approach, tokens can represent special tags at beginning(B), inside(I), or outside(O). For example, the sentence "The ionic conductivity decreases with increasing activation energy" in the NASICON literature will be labeled as (token; IOB-label) pair in the following way: (The; O), (ionic; B-Property), (conductivity; I-Property), (decreases; O), (with; O), (increasing; O), (activation; B-Property), (energy; I-Property).

③数据增强③Data enhancement

训练监督NER模型需要大量的标注数据，标注费时费力。在本方法中，针对NER数据不足的问题，提出了一种融合材料领域知识的条件数据增强方法(cDA-DK)，如图6所示。Training a supervised NER model requires a large amount of labeled data, which is time-consuming and labor-intensive. In this method, for the problem of insufficient NER data, a conditional data augmentation method (cDA-DK) that integrates material domain knowledge is proposed, as shown in Figure 6.

由于数据分析方法经常受到噪声的影响，因此为了在尽可能生成高质量数据的同时减少噪声的影响，我们将材料文本和标签约束等材料领域知识引入并作为预训练DistilRoBERTa模型的输入。如图2所示，我们对用于大规模扩充数据的DistilRoBERTa模型进行了微调。实际上，增强的数据是由DistilRoBERTa模型的掩码语言模型(masklanguage model,MLM)生成的，该模型可以随机掩码句子中的一些单词，然后通过学习到的上下文语义关系预测掩码单词。Since data analysis methods are often affected by noise, in order to generate high-quality data as much as possible while reducing the effect of noise, we introduce material domain knowledge such as material text and label constraints as input to the pre-trained DistilRoBERTa model. As shown in Figure 2, we fine-tune the DistilRoBERTa model for massively augmented data. In fact, the augmented data is generated by the mask language model (MLM) of the DistilRoBERTa model, which can randomly mask some words in a sentence and then predict the masked words through the learned contextual semantic relations.

例如，给定一个输入句子“The ionic conductivity decreased with increasedactivation energy”，该句子中的两个单词被屏蔽，变成“The<mask> conductivitydecreased with increasing<mask>energy”。然后，通过微调的 DistilRoBERTa模型预测和填充<mask>单词。最后，生成出句子“The electrode conductivity decreases withincreasing electric energy”，如表2所示。表2中，使用斜体加粗突出显示单词变化。For example, given an input sentence "The ionic conductivity decreased with increased activation energy", two words in the sentence are masked to become "The <mask> conductivity decreased with increasing <mask> energy". Then, the <mask> words are predicted and populated by the fine-tuned DistilRoBERTa model. Finally, the sentence "The conductivity decreases withincreasing electric energy" is generated, as shown in Table 2. In Table 2, word variations are highlighted using bold italics.

表2.初始训练和增强实例Table 2. Initial training and augmentation examples

2)通过粗粒度描述符识别器对实体进行筛选2) Entities are filtered by coarse-grained descriptor identifiers

对于这项工作，目的是训练NER模型，以这样一种方式，材料科学知识被编码；例如，我们希望让计算机知道，单词“activation energy”和“ionic conductivity”代表物质性质的描述符，而“tetrahedra”和“polyhedra”代表物质结构的描述符。因此，我们设计了CGDR，该模型构建了一个NER模型(MatBERT-BiLSTM-CRF)，用于从材料科学文献中识别不同类别的粗粒度描述符。有三个主要类型的信息,可以用来让模型识别哪些词或短语对应于一个特定的描述符类型:①基于MatBERT的词表征；②基于BiLSTM的语句上下文特征提取；③CRF-based描述符分类(如图7所示)。For this work, the aim is to train NER models in such a way that material science knowledge is encoded; for example, we want to let the computer know that the words "activation energy" and "ionic conductivity" represent descriptors of properties of matter, while " "tetrahedra" and "polyhedra" represent descriptors of the structure of matter. Therefore, we design CGDR, which constructs a NER model (MatBERT-BiLSTM-CRF) for identifying different classes of coarse-grained descriptors from materials science literature. There are three main types of information that can be used to allow the model to identify which words or phrases correspond to a particular descriptor type: 1) MatBERT-based word representation; 2) BiLSTM-based sentence context feature extraction; 3) CRF-based descriptor classification (eg. Figure 7).

①基于MatBERT的词表征①Word representation based on MatBERT

基于MatBERT的词表征，设计并利用MatBERT模型获得单词和句子的向量表示。如图7所示，MatBERT模型来源于Transformer(BERT)模型中预训练的双向编码器表示(Bidirectional Encoder Representation)，该模型需要与材料文献文本数据进行微调。通过对材料语篇的分析，我们发现意义相同或相近的词在不同的语境中可能表达出不同的意义。例如，英文的“Bottleneck” 不仅有“有限”的意思，还可以表明材料晶体的结构信息，这表明语境非常重要。因此，在对材料的复杂文本进行编码时，有必要考虑语境信息。然而，使用Mikolov等人(2013)的Word2vec方法生成的词嵌入是上下文无关的(静态嵌入)，并且不具有复杂的特征(如语法、语义)。总之，Word2vec 方法的缺点是不利于计算机充分理解材料词汇，从而影响描述符提取的准确性。本文采用MatBERT方法对材料文本进行编码，因为该方法能够充分捕捉词的上下文信息(即词嵌入、段嵌入和位置嵌入)，从而得到语义信息更丰富的向量表示。Based on the word representation of MatBERT, we design and use the MatBERT model to obtain the vector representation of words and sentences. As shown in Figure 7, the MatBERT model is derived from the pre-trained Bidirectional Encoder Representation in the Transformer (BERT) model, which needs to be fine-tuned with the material literature text data. Through the analysis of the material discourse, we find that words with the same or similar meanings may express different meanings in different contexts. For example, the English word "Bottleneck" not only means "limited", but can also indicate the structural information of the material crystal, which shows that the context is very important. Therefore, it is necessary to consider contextual information when encoding complex texts of materials. However, word embeddings generated using the Word2vec method of Mikolov et al. (2013) are context-free (static embeddings) and do not have complex features (e.g. syntax, semantics). In conclusion, the disadvantage of Word2vec method is that it is not conducive to the computer to fully understand the material vocabulary, thus affecting the accuracy of descriptor extraction. In this paper, the MatBERT method is used to encode the material text, because this method can fully capture the contextual information of words (i.e. word embedding, segment embedding and position embedding), so as to obtain a vector representation with richer semantic information.

具体来说，本文给出了一个句子序列。马特伯特使用微调参数机制。将输入序列设置为w＝([CLS],w₁,w₂,...,w_n,[SEP])，其中[CLS]表示样本句子序列的开头，[SEP]表示句子之间的间隔符号。它们都用于句子级训练任务。每个单词的向量表示由三部分组成:词嵌入向量(Word Embedding Vectyor),句嵌入向量(Sentence Embedding Vector)和位置嵌入向量(Position Embedding Vector), 其定义分别为

在这里，词嵌入向量由MatBERT提供的词汇决定，由于训练样例为一条语句，所以将句嵌入向量设为0。将三个嵌入向量相加，得到词特征作为MatBERT的输入，如图5所示。通过训练输入单词向量，最终的单词向量表示如公式(13)所示，作为BiLSTM的输入。Specifically, this paper presents a sequence of sentences. Matt Burt uses a fine-tuning parameter mechanism. Set the input sequence as w=([CLS],w₁ ,w₂ ,...,w_n ,[SEP]), where [CLS] represents the beginning of the sample sentence sequence and [SEP] represents the interval between sentences symbol. They are both used for sentence-level training tasks. The vector representation of each word consists of three parts: Word Embedding Vectyor, Sentence Embedding Vector and Position Embedding Vector, which are defined as

Here, the word embedding vector is determined by the vocabulary provided by MatBERT. Since the training example is a sentence, the sentence embedding vector is set to 0. Add the three embedding vectors to get the word features as the input of MatBERT, as shown in Figure 5. By training the input word vectors, the final word vector representation is shown in Equation (13) as the input of BiLSTM.

x＝[x₁,x₂,...,x_n] 公式(13)x=[x₁ ,x₂ ,...,x_n ] Equation (13)

②基于BiLSTM的语句上下文特征提取②Sentence context feature extraction based on BiLSTM

采用BiLSTM模型对材料文本的上下文特征进行捕获。NER是一个序列标注问题，属于标记级分类任务，即句子中的每个单词都需要进行分类。因此，有必要考虑句子中每个词的当地语境。例如，在句子“The overall_____ are near to 10-5S/cm-1at 200℃”中，很明显，缺少的单词是conductivity(一个属性类的粗粒度描述符)。虽然MatBERT引入的位置信息补齐了局部上下文信息，但在进行微调时，MatBERT的自注意机制削弱了位置信息。因此，采用RNN来解决上述问题，RNN具有捕获时序信息进行序列到序列分类的能力。然而，RNN在时间信息传播过程中经常存在梯度消失和梯度爆炸的问题，我们使用了RNN的一种变体——长短期记忆(LSTM)。LSTM引入了三个门单元，分别是输入门(InputGate)、遗忘门(Forget Gate)和输出门(Output Gate)，如图8所示。门(Gate)结构能够选择保存上下文信息，解决RNN 的上述问题。因此，在捕获远程依赖关系方面，LSTM优于RNN。The BiLSTM model is used to capture the contextual features of the material text. NER is a sequence labeling problem, which belongs to the label-level classification task, that is, every word in a sentence needs to be classified. Therefore, it is necessary to consider the local context of each word in the sentence. For example, in the sentence "The overall_____ are near to 10-5S/cm-1at 200℃", it is clear that the missing word is conductivity (a coarse-grained descriptor of an attribute class). Although the location information introduced by MatBERT complements the local context information, when fine-tuning, the self-attention mechanism of MatBERT weakens the location information. Therefore, RNN is adopted to solve the above problems, and RNN has the ability to capture time series information for sequence-to-sequence classification. However, RNNs often suffer from vanishing and exploding gradients during temporal information propagation, and we use a variant of RNNs, Long Short-Term Memory (LSTM). LSTM introduces three gate units, namely Input Gate, Forget Gate and Output Gate, as shown in Figure 8. The gate structure can choose to save the context information to solve the above problems of RNN. Therefore, LSTM outperforms RNN in capturing long-range dependencies.

其参数设置如表3所示：Its parameter settings are shown in Table 3:

表3 Bi-LSTM参数设置Table 3 Bi-LSTM parameter settings

参数名称parameter name参数值parameter value词向量维度word vector dimension768768LSTM单元维度LSTM cell dimension128128Dropout ratedropout rate0.10.1Learning rateLearning rate0.000030.00003OptimizerOptimizerAdamWAdamWBatch sizeBatch size3232Early stopping patienceEarly stopping patience33Max sentence lengthMax sentence length7575Tag schemaTag schemaBIOBIO

③CRF-based描述符分类③CRF-based descriptor classification

CRF能够通过学习标签之间的依赖关系来预测最优的标签序列，实现更准确的实体分类。同时，作为序列标记问题的分类器，CRF还能够捕获输出标签之间的强依赖性，得到最优的标签序列。由于句子中每个词的实体标签需要通过分类器来预测，而相邻的实体标签之间往往存在一定的迁移关系。因此，考虑邻域内标签之间的相关性，对给定的输入句子解码最佳标签链是很有用的。因此，NER的分类器层采用的是CRF而不是传统的Softmax层。CRF can predict the optimal label sequence by learning the dependencies between labels and achieve more accurate entity classification. At the same time, as a classifier for sequence labeling problems, CRF can also capture strong dependencies between output labels and obtain optimal label sequences. Since the entity label of each word in the sentence needs to be predicted by the classifier, there is often a certain transfer relationship between adjacent entity labels. Therefore, it is useful to decode the optimal chain of labels for a given input sentence, considering the correlation between labels within the neighborhood. Therefore, the classifier layer of NER adopts CRF instead of traditional Softmax layer.

具体来说，w＝(w₁,w₂,K,w_n)表示为一个一般的输入序列。其中，w_i是第i 个单词的输入向量，y＝(y₁,y₂,K,y_n)表示与输入向量对应的标签序列。CRF 模型计算公式的评价得分如公式(14)所示，其中T为迁移矩阵，

为y_i迁移到y_i+1的概率分数，

为第i个单词被标注为y_i的概率分数。p(y|S)表示语句S被标记为标签序列y的概率，由公式(15)计算，其中y％为真标签。此外，训练过程中标签序列的似然函数如公式(16)所示，其中Y_W表示所有可能标注的集合。需要注意的是，通过似然函数可以得到有效的输出序列。最后，通过公式(17)可以计算出一组总概率得分最大的序列。Specifically, w=(w₁ , w₂ , K,_wn ) is represented as a general input sequence. Among them,_wi is the input vector of the ith word, and y=(y₁ , y₂ , K, y_n ) represents the label sequence corresponding to the input vector. The evaluation score of the CRF model calculation formula is shown in formula (14), where T is the migration matrix,

is the probability score for y_i to migrate to y_i+1 ,

The probability score that the_ith word is marked as yi. p(y|S) represents the probability that sentence S is labeled as label sequence y, calculated by Equation (15), where y% is the true label. In addition, the likelihood function of the label sequence during training is shown in Equation (16), where Y_W represents the set of all possible labels. It should be noted that a valid output sequence can be obtained by the likelihood function. Finally, a set of sequences with the largest total probability score can be calculated by formula (17).

综上所述，可以使用CGDR的NER模型识别材料文本中的粗粒度描述符，然后构建一个知识库来存储它们，如图9所示。此外，还实现了在知识库中动态添加描述符及其对应的句子。在图9中，“activation energy”、 “occupancy”、“safety”是性质类的描述符，“conduction channels”、 “bottleneck”、“rhombohedral symmetry”是结构类的描述符。此外，每个描述符后面紧跟着它出现的相应句子。In summary, the NER model of CGDR can be used to identify coarse-grained descriptors in material text, and then build a knowledge base to store them, as shown in Figure 9. In addition, it is also implemented to dynamically add descriptors and their corresponding sentences in the knowledge base. In Fig. 9, "activation energy", "occupancy", and "safety" are descriptors of the property class, and "conduction channels", "bottleneck", and "rhombohedral symmetry" are the descriptors of the structure class. In addition, each descriptor is immediately followed by the corresponding sentence in which it appears.

利用CGDR训练后的NER模型，可以从材料科学文献中准确提取描述符信息。NER模型的性能如图10所示。从图中可以看出，语句“For those NASICON materials which showa phase transition,the activation energy differed at low temperature(LT)andhigh temperature(HT).”输入到训练后的NER模型中，能够对句子中的每个单词进行分类。结果表明:“NASICON materials” 是特征类的描述符，“phase transition”和“activation energy”属于属性类，“low temperature”和“high temperature”属于条件类。由此看出，模型能够准确地识别文本中的描述符信息。Using the CGDR-trained NER model, the descriptor information can be accurately extracted from the materials science literature. The performance of the NER model is shown in Figure 10. As can be seen from the figure, the sentence "For those NASICON materials which showa phase transition, the activation energy differed at low temperature(LT) and high temperature(HT)." words are classified. The results show that: "NASICON materials" is the descriptor of the feature class, "phase transition" and "activation energy" belong to the attribute class, and "low temperature" and "high temperature" belong to the condition class. It can be seen that the model can accurately identify the descriptor information in the text.

3)采用细粒度描述符识别器根据需要对实体进行进一步筛选3) Use fine-grained descriptor recognizers to further filter entities as needed

知识库中有大量不同类别的粗粒度描述符，用于材料性质预测或新材料发现。然而，如果我们完全手工筛选相关描述符，工作量不亚于从文献中进行选择。此外，知识库中粗粒度描述符的质量也是筛选过程中的一个重要因素。因此，FGDR被设计用来快速筛选高质量的描述符，这些描述符与研究的目标材料性质相关。注意，FGDR结合了性能驱动和重要性计算，这有助于研究人员构建描述符的样本数据集。然后，利用该数据集，可以利用ML模型进行结构-活动关系的研究。There are a large number of different categories of coarse-grained descriptors in the knowledge base for material property prediction or new material discovery. However, if we screen the relevant descriptors completely by hand, the workload is no less than selecting from the literature. Furthermore, the quality of coarse-grained descriptors in the knowledge base is also an important factor in the screening process. Therefore, FGDR is designed to rapidly screen high-quality descriptors that correlate with the target material properties of the study. Note that FGDR combines performance-driven and importance computations, which help researchers construct a sample dataset of descriptors. Then, with this dataset, the study of structure-activity relationships can be carried out using ML models.

性能驱动的具体过程如下。首先，输入一个描述符D_seed，这是要研究的目标材料性质；然后创建一个临时队列Q，并将D_seed放入队列。只要Q不为空，Q的头元素就会退出队列并赋值给D_current进行循环。此外，d_i和s_i也从D 和S的对应列表中取出。注意，d_i和D_current在这里是相同的描述符。然后，s_i中与d_i共出现的描述符w_j被添加到Q和D_associate中。最后，循环终止，直到Q中没有元素，并且得到性能驱动的描述符集合D_associate。The specific process of performance driving is as follows. First, a descriptor D_seed is input, which is the target material property to be studied; then a temporary queue Q is created, and D_seed is put into the queue. As long as Q is not empty, the head element of Q is dequeued and assigned to D_current to cycle. In addition_{, di and si}_are also taken from the corresponding lists of D and S. Note that_di and D_current are the same descriptor here. Then, the descriptor w_j that co-occurs with d_i in_si is added to Q and D_associate . Finally, the loop terminates until there are no elements in Q and the performance-driven descriptor set D_associate is obtained.

通过计算重要性筛选高质量的描述符，如图11所示。为了计算每个描述符在对应句子中的重要性，我们将MatBERT最后一层的词向量和句向量进行内积，然后用Softmax函数对前一步的结果进行归一化，得到最后的重要度。归一化的计算公式如公式(18)所示，其中I_i表示第i个词的重要性，E_i为MatBERT输出的第i个词的嵌入向量，S_[CLS]为相应的句子嵌入向量。在此之后，我们将为描述符筛选设置一个阈值。如公式(19)所示，其中true表示保留的描述符，false表示删除的描述符，T是描述符重要度的阈值。这里的 MatBERT模型与CGDR中的MatBERT相同，只是后者不需要输出单词和对应句子的向量，而是提供给下游模型进行进一步的特征提取。High-quality descriptors are screened by computational importance, as shown in Figure 11. In order to calculate the importance of each descriptor in the corresponding sentence, we take the inner product of the word vector and the sentence vector in the last layer of MatBERT, and then use the Softmax function to normalize the results of the previous step to get the final importance. The normalized calculation formula is shown in formula (18), where I_i represents the importance of the ith word, E_i is the embedding vector of the ith word output by MatBERT, and S_[CLS] is the corresponding sentence embedding vector . After this, we will set a threshold for descriptor filtering. As shown in formula (19), where true represents the reserved descriptor, false represents the deleted descriptor, and T is the threshold of the importance of the descriptor. The MatBERT model here is the same as the MatBERT in CGDR, except that the latter does not need to output the vectors of words and corresponding sentences, but provides it to the downstream model for further feature extraction.

FGDR可以在一定程度上准确筛选相应上下文中性能驱动的高质量描述符。以“activation energies”为例来筛选高质量的相关描述符，其有效性如图 12所示。从图中可以看出，在知识库中的语句“The calculated potential barriers are in goodagreement with the activation energies obtained from ac measurements ofpolycrystalline samples.”通过检索与“activation energies”共现的描述符来获得。利用MatBERT-BiLSTM-CRF模型识别句子的其他描述符(“potential barriers”和“polycrystalline samples”)。通过FGDR，计算出这个句子中每个单词的重要性。然而，我们关注的是描述符(即“potential barriers”、“activation energies”和“polycrystalline samples”)。最后，可以发现前两个描述符在该上下文中的重要性大于阈值，而后一个描述符的重要性小于阈值。结果表明，FGDR能够有效地能够筛选出高质量的性能驱动描述符。FGDR can screen the performance-driven high-quality descriptors in the corresponding context accurately to a certain extent. Taking "activation energies" as an example to filter high-quality related descriptors, its effectiveness is shown in Figure 12. As can be seen from the figure, the statement "The calculated potential barriers are in good agreement with the activation energies obtained from ac measurements of polycrystalline samples." in the knowledge base is obtained by retrieving descriptors that co-occur with "activation energies". Other descriptors of sentences (“potential barriers” and “polycrystalline samples”) were identified using the MatBERT-BiLSTM-CRF model. Through FGDR, calculate the importance of each word in this sentence. However, we focus on descriptors (i.e. "potential barriers", "activation energies" and "polycrystalline samples"). Finally, it can be found that the importance of the first two descriptors in this context is greater than the threshold, while the importance of the latter descriptor is less than the threshold. The results show that FGDR can effectively filter out high-quality performance-driven descriptors.

针对现有技术的缺点，本发明旨在从粗粒度和细粒度两个维度从材料科学文献中挖掘出可用的高质量描述符。本发明还综合考虑了材料科学领域在 NER中存在的问题，解决了传统方法长距离依赖的能力较弱，并且要联合外部知识和大量人工参与来提取和处理特征的问题。通过用材料科学文献对 BERT模型进行预训练，使其能够在较短的训练次数下达到最佳的效果。此外，我们还采用预训练的BERT对实体进行数据增强，解决了材料科学的命名实体数量不足的问题。同时，本方法通过引入领域知识，实现对描述符精准筛选，从而获得满足使用者需求的高质量描述符。In view of the shortcomings of the prior art, the present invention aims to mine available high-quality descriptors from the materials science literature from both coarse-grained and fine-grained dimensions. The invention also comprehensively considers the problems existing in NER in the field of material science, and solves the problem that the traditional method has weak long-distance dependence, and needs to combine external knowledge and a large number of manual participation to extract and process features. By pre-training the BERT model with materials science literature, it is able to achieve the best results with shorter training epochs. In addition, we also employ pre-trained BERT for entity data augmentation, addressing the insufficient number of named entities in materials science. At the same time, by introducing domain knowledge, this method realizes precise screening of descriptors, so as to obtain high-quality descriptors that meet the needs of users.

表4为CGDR对于8个命名实体分类的性能，其中F1-Score为准确率P (precision)与召回率R(recall)的调和均值。从表中我们可以看到,我们的模型的整体F1-score是0.87,其性能已经与最新的NER模型(2018)相近(0.92)。该模型训练和评估均在手动标注且仅有三个实体标签的新闻文章上进行。然而，由于数据集不同，因此无法直接通过性能指标的数值来比较模型之间的性能。值得注意的是，我们的模型是在更多的实体标签和更复杂的文本进行训练和评估。CGCR模型在Composition类别上达到了最高的F1-Score(0.94)，而在Application类别的F1-Score仅有0.58，这可能是由于该类别的训练数据较少，模型未能充分捕获该类实体与其标签之间的依赖关系。其他实体类别的F1-score在0.80以上，说明该模型在识别不同类别的描述符方面表现良好。Table 4 shows the performance of CGDR for 8 named entity classification, where F1-Score is the harmonic mean of the precision rate P (precision) and the recall rate R (recall). From the table we can see that the overall F1-score of our model is 0.87, and its performance is already close (0.92) to the state-of-the-art NER model (2018). The model is trained and evaluated on manually annotated news articles with only three entity labels. However, since the datasets are different, it is not possible to directly compare the performance between the models through the numerical values of the performance metrics. Notably, our model is trained and evaluated on more entity labels and more complex text. The CGCR model achieves the highest F1-Score (0.94) in the Composition category, while the F1-Score in the Application category is only 0.58, which may be due to the lack of training data in this category, and the model fails to fully capture this type of entities and their differences. Dependencies between tags. The F1-score of other entity categories is above 0.80, indicating that the model performs well in identifying descriptors of different categories.

表4 NER对8种实体分类的综合表现Table 4. Comprehensive performance of NER for 8 entity classifications

与基准模型(BiLSTM-CNNs-CRF)的性能对比,本方法的F1-score为0.87，其性能提高了16％，其结果如表5所示，进一步验证了CGDR模型的有效性，适合于材料领域描述符的自动识别。Compared with the performance of the benchmark model (BiLSTM-CNNs-CRF), the F1-score of this method is 0.87, and its performance is improved by 16%. The results are shown in Table 5, which further verifies the effectiveness of the CGDR model, which is suitable for materials Automatic recognition of domain descriptors.

表5模型结果对比Table 5 Comparison of model results

ModelModelPrecisionPrecisionRecallRecallF1-scoreF1-scoreBiLSTM-CNNs-CRFBiLSTM-CNNs-CRF0.740.740.690.690.710.71DA+MatBERT-BiLSTM-CRFDA+MatBERT-BiLSTM-CRF0.860.860.870.870.870.87

此外，尽管已经在本文中描述了示例性实施例，其范围包括任何和所有基于本发明的具有等同元件、修改、省略、组合(例如，各种实施例交叉的方案)、改编或改变的实施例。权利要求书中的元件将被基于权利要求中采用的语言宽泛地解释，并不限于在本说明书中或本申请的实施期间所描述的示例，其示例将被解释为非排他性的。因此，本说明书和示例旨在仅被认为是示例，真正的范围和精神由以下权利要求以及其等同物的全部范围所指示。Furthermore, although exemplary embodiments have been described herein, the scope includes any and all implementations of the present invention with equivalent elements, modifications, omissions, combinations (eg, where various embodiments intersect), adaptations, or alterations example. Elements in the claims are to be construed broadly based on the language employed in the claims, and are not to be limited to the examples described in this specification or during the practice of this application, the examples of which are to be construed as non-exclusive. Therefore, this specification and examples are intended to be regarded as examples only, with the true scope and spirit being indicated by the following claims along with their full scope of equivalents.

以上描述旨在是说明性的而不是限制性的。例如，上述示例(或其一个或更多方案)可以彼此组合使用。例如本领域普通技术人员在阅读上述描述时可以使用其它实施例。另外，在上述具体实施方式中，各种特征可以被分组在一起以简单化本发明。这不应解释为一种不要求保护的发明的特征对于任一权利要求是必要的意图。相反，本发明的主题可以少于特定的发明的实施例的全部特征。从而，以下权利要求书作为示例或实施例在此并入具体实施方式中，其中每个权利要求独立地作为单独的实施例，并且考虑这些实施例可以以各种组合或排列彼此组合。本发明的范围应参照所附权利要求以及这些权利要求赋权的等同形式的全部范围来确定。The above description is intended to be illustrative and not restrictive. For example, the above examples (or one or more of them) may be used in combination with each other. For example, other embodiments may be utilized by those of ordinary skill in the art upon reading the above description. Additionally, in the foregoing Detailed Description, various features may be grouped together to simplify the present invention. This should not be construed as an intention that a feature of an unclaimed invention is essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the Detailed Description by way of example or embodiment, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A method for descriptor identification of text data, the method comprising:

and recognizing the descriptors of the text data by using the trained recognition model through the following method:

determining an input sequence w ═ (w) based on the text data₁ ,w₂ ,...,w_n ) And (y) a tag sequence y corresponding to the feature vector₁ ,y₂ ,...,y_n ) Wherein w is_n Is the feature vector of the nth word;

calculating a group of tag sequences with the maximum total probability score through the following formulas (14) to (17):

where score (W, y) is the evaluation score for all input sequences, T is the transition matrix,

is y_i Migration to y_i+1 The probability score of (a) is determined,

labeled y for the ith word_i P (y | S) represents the probability that the sentence S is labeled as the tag sequence y,

for true tags, equation (16) represents the likelihood function of the tag sequence during training, Y_W Representing a set of all possible annotations, y^* Representing a set of tag sequences with the highest total probability score;

determining a coarse-grained descriptor based on the tag sequence with the maximum total probability score;

dynamically adding coarse-grained descriptors and sentence sequences corresponding to the coarse-grained descriptors to construct a knowledge base;

based on the coarse-grained descriptors in the knowledge base, the performance-driven high-quality descriptors are screened out according to the principle that the descriptors commonly appear in the same sentence and the importance of each coarse-grained descriptor in the corresponding sentence sequence.

2. The method of claim 1, wherein prior to recognizing the descriptor of the text data using the trained recognition model, the method further comprises:

dividing text data into at least one sentence sequence, dividing each sentence sequence into separate marks, and marking each mark based on a preset entity label, wherein the preset entity label is used for defining a descriptor;

randomly masking partial words in the sentence sequence, and predicting the masked words through the learned context semantic relationship so as to realize the enhancement of text data;

and training the recognition model through the enhanced text data.

3. The method of claim 2, wherein screening performance-driven high-quality descriptors based on coarse-grained descriptors in the knowledge base by the principle that descriptors co-occur in the same sentence and the importance of each coarse-grained descriptor in the corresponding sentence sequence comprises:

listing a coarse-grained descriptor list in the knowledge base, D ═ D₁ ，D₂ ，...D_n ]And listing a sentence list corresponding to the descriptor, wherein s is ═ s₁ ，s₂ ，...s_n ]；

Selecting a descriptor, then creating a temporary queue, putting the descriptor into the temporary queue, taking out a coarse-grained descriptor and a statement from a corresponding coarse-grained descriptor list and a statement list, adding the descriptor in which the coarse-grained descriptor and the statement appear together into the temporary queue, and under the condition that the temporary queue is not an empty queue, withdrawing a head element in the temporary queue from the temporary queue and assigning the head element to a performance-driven high-quality descriptor set to obtain the performance-driven high-quality descriptor set;

the importance of the descriptors in the performance-driven high-quality descriptor set in the corresponding sentence sequence is calculated by the following formula (18):

wherein, I_i Representing the weight of the ith wordEssential, E_i Embedding vector for the ith word, S_[CLS] Embedding vectors for the corresponding sentences;

performance driven descriptors are screened out based on a threshold of descriptor importance.

4. The method of claim 3, wherein high quality descriptors are filtered out based on a threshold of descriptor importance by the following formula (19);

wherein D is_i Representing a performance driven high quality descriptor set, T being a threshold for descriptor importance, true representing descriptors retained in the performance driven high quality descriptor set, false representing descriptors deleted in the performance driven high quality descriptor set.

5. The method of claim 2, wherein prior to separating the text data into at least one sequence of sentences, the method further comprises cleaning up the text information to obtain text data;

the step of cleaning the text information to obtain text data comprises:

removing invalid data in the text information in a regular expression matching mode, wherein the invalid data comprises messy codes and pictures;

in the case of occurrence of character scrambling, the character causing the occurrence of the scrambling is converted into a special symbol mark.

6. An apparatus for descriptor identification of textual data, the apparatus comprising a processor configured to:

determining an input sequence w ═ (w) based on text data₁ ,w₂ ,...,w_n ) Andthe tag sequence y corresponding to the feature vector is (y)₁ ,y₂ ,...,y_n ) Wherein w is_n Is the feature vector of the nth word;

where score (W, y) is the evaluation score of all input sequences, T is the transition matrix,

is y_i Migration to y_i+1 The probability score of (a) is determined,

for the ith word is labeled y_i P (y | S) represents the probability that the sentence S is marked as the tag sequence y,

7. The apparatus of claim 6, wherein the processor is further configured to:

randomly masking partial words in the sentence sequence, and predicting the masked words through the learned context semantic relation so as to realize the enhancement of text data;

and training the recognition model through the enhanced text data.

8. The apparatus of claim 7, wherein the processor is further configured to:

Selecting a descriptor, then creating a temporary queue, putting the descriptor into the temporary queue, taking out a coarse-grained descriptor and a statement from a corresponding coarse-grained descriptor list and a statement list, adding a descriptor in which the coarse-grained descriptor and the statement appear together into the temporary queue, and under the condition that the temporary queue is not an empty queue, withdrawing a head element in the temporary queue from the temporary queue and assigning the head element to a performance-driven descriptor set to obtain a performance-driven high-quality descriptor set;

wherein, I_i Indicating the importance of the ith word, E_i For the embedded vector of the ith word, S_[CLS] Embedding vectors for the corresponding sentences;

a threshold based on descriptor importance screens out performance driven high quality descriptors.

9. The apparatus of claim 8, wherein the processor is further configured to:

based on the threshold of descriptor importance, a performance-driven high-quality descriptor is screened out by the following equation (19):

where T is a threshold for descriptor importance, true represents descriptors retained in the performance driven high quality descriptor set, and false represents descriptors deleted in the performance driven high quality descriptor set.

10. A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by a processor, perform the method of any one of claims 1-5.