Background
Along with the development of material science, the accumulation of material data is more and more huge, useful information is extracted from hundreds of millions of complex data, and the structure-process-performance structure-activity relationship of the materials is analyzed and combed, so that the method becomes the core and the key of material research. Machine learning can establish a mapping relation between material influencing factors (such as descriptors of components, processes, external environments and the like) and target quantities (such as performances and the like), so that prediction of material components, structures, processes and performances and discovery of new materials are realized. However, the performance of the machine learning model is limited by the quality of the data, so scientifically and reasonably constructing descriptor features related to target quantities is still the basis for establishing a high-precision machine learning prediction model and assisting in understanding the material mechanism. Therefore, the rapid and effective selection of the proper descriptors is of great significance to the study of the structure-activity relationship of materials.
In the study of structure-activity relationship in the material field, the selection of descriptors mainly depends on the experience knowledge of expert or manual reading of documents. For example, jalem et al (2014) have selected chemical component descriptors such as the charge and coordination number of elements and crystal structure descriptors such as lattice constants, unit cell volumes, bond lengths and bond angles of polyhedrons and interatomic distances through a summary of knowledge of existing materials to construct a learning sample to predict the activation energy of olivine compounds limxo—4 (M, X being main group elements). Sendek et al (2016) manually screened 12831 lithium-containing crystalline solids from the Material Project database to screen lithium for high structural and chemical stability, low electronic conductivity, and low cost. Then, 21 descriptors related to crystal structure and chemical composition are screened through conductivity, further the structure-composition-conductivity relation Xu et al (2020) of the solid electrolyte of the lithium ion battery is discussed, ion conductivity data of NASICON type compounds of 70R 3 c space groups extrapolated from Arrhenius law are collected based on a large number of documents, and a prediction model of ion conductivity is constructed by utilizing a logistic regression method. They empirically selected 16 chemical component descriptors of element radius, element electronegativity, ion number and the like and 12 structural descriptors of lattice constant, unit cell volume, atomic volume and the like to explore the structure-activity relationship with conductivity. However, experts spend a great deal of time obtaining the appropriate descriptors by summarizing the words or phrases of published documents, which can lead to studies of material structure-activity relationships as the number of publications of material science documents increases, which can seriously affect the choice of descriptors.
Named entity Recognition (NAMED ENTITY Recognition, NER) can automatically extract information from unstructured text. Typically, this type of task is considered a supervised machine learning problem in which model learning identifies keywords or phrases in sentences. Currently, NER has been applied to information extraction of organic and inorganic materials, for example Kim et al (2017) uses NER to parse 76000 articles related to oxide material synthesis, extract key information therein and encode the information into a database, mysore et al (2017) extracts action diagram structures from material science synthesis programs through NER, can extract action diagram synthesis text more accurately in case of smaller data sets, and in addition, some chemical-based NER systems can extract inorganic materials. For example KRALLINGER et al (2017) employ NER to enable efficient acquisition of chemical information contained in scientific literature, patents, technical reports, or networks, leaman et al (2015) accurately identify chemical advances, attributes, and relationships in the literature using NER techniques. Recently, some researchers have achieved large scale extraction of inorganic material information in the literature by building deep-learning NER models, e.g., he et al (2020) have achieved accurate extraction of precursors and targets in inorganic solid state synthesis reactions reported in the literature by building Bi-LSTM based models, weston et al (2019) have achieved extraction of inorganic materials from the literature, sample descriptors, phase tags, material properties and applications, and any synthesis and characterization methods used. Research results of NER in inorganic material information extraction have attracted attention from researchers of organic materials. For example, zhao et al (2021) use Named Entity Recognition (NER) and BiLSTM-CNN-CRF deep learning models to automatically extract organic material information from the literature. However, there is currently no research on material literature science using NER. In addition, informatics studies of materials often predict hundreds or thousands of materials, and extracting descriptor features from materials literature is useful for studying structure-activity relationships.
The current method for selecting descriptors mostly screens words or phrases suitable as descriptors from material documents through expert domain knowledge. The NER method in the current material field mostly adopts unsupervised entity recognition and supervised entity recognition and entity recognition based on deep learning. The method comprises the steps of identifying an unsupervised entity, wherein the unsupervised entity is mainly based on rules, the design of the rules requires a knowledge base and a dictionary related to the field, even requires careful design of an expert, the supervised entity is mostly realized by adopting a conditional random field (Conditional Random Field, CRF), the entity identification based on deep learning is mostly realized by adopting a Bi-LSTM combined with the CRF, the Bi-LSTM is a bidirectional short-term memory network (LSTM), and the LSTM is an improved RNN. The following describes rule-based entity identification, CRF, RNN, LSTM, bi-LSTM, respectively.
(1) Rule-based
The more well-known rule-based entity recognition systems are LaSIE-II, netOwl, facile, SAR, FASTUS and LTG. These systems primarily identify entities based on manually designed semantic and syntactic rules, such as part-of-speech tagging of sentences, regarding noun phrases that meet certain constraints as entities. Good performance is often achieved when dictionary resources are very rich. KnowItAll the system is unsupervised and can automatically extract a large number of entities (and relationships) from web pages using domain-independent rule templates. The advantage of unsupervised learning of entity identification is that a large number of entities can be obtained by means of a dictionary and artificial design rules without any labeling data. However, because rules are domain-specific and dictionary-incomplete, these systems tend to have higher accuracy and lower recall, and it is difficult to apply the system to other domains.
(2) Based on traditional machine learning
CRF is a classical sequence labeling model, which is expressed by extracting features at each location l and features between adjacent output tags, assuming N featuresThe conditional probability of the model is:
wherein Z (x) is a normalization function.
(3) Deep learning-based
The RNN is a neural network that can process sequence data, and is a recurrent neural network that recursively performs recursion in the evolution direction of a sequence with all nodes (loop units) connected in a chain, and with all nodes (loop units) connected in a chain, taking the sequence data as an input.
In a conventional neural network, the input layer and the hidden layer, the hidden layer and the output layer are fully connected, and each node inside each layer is not connected, and only one input can be processed independently, so that the former input and the latter input are completely irrelevant. However, when processing such sequence data of sentences, it is definitely not appropriate to understand each word in the sentence individually. For example, in part of speech tagging, the part of speech of a preceding word has a significant impact on part of speech prediction of the current word. If the previous word is a verb, the probability that the current word is a noun is much greater than the probability that the current word is a verb. Thus, to better address similar problems, it is desirable to use information from other words preceding the current word, or to use some historical information to assist in the current prediction, which is not provided by conventional neural networks.
The RNN adds a recursive connection to the hidden layer unit, and the historical information of the network can be propagated through the recursive connection, so that the hidden layer of the network has the function of storing and utilizing the historical information, and the input specifically shown in the hidden layer not only comprises the information of the input layer, but also comprises the output of the hidden layer at the last moment. In the RNN, the hidden layer input at time t includes, in addition to xt, the output ht-1 of the hidden layer at the previous time, so that when the t-th word in the sentence is processed, information (ht-1) of other words before the word in the sentence can be utilized, thereby helping to process the prediction problem of the sequence.
Ht=H(W[ht-1,xt ] +b) equation (3)
Where Ht is the hidden layer output, H is a nonlinear function (e.g., tanh function), and b is the bias.
However, the standard RNN has two problems, namely, if the gradient is transferred too much, it is difficult to capture the long-distance dependency in the sequence, and the phenomenon of gradient extinction or gradient explosion occurs when the long sequence is processed.
(4) LSTM model
Compared with RNN, the hidden layer unit of LSTM has three gate structures for memorizing, updating and utilizing information. The three gates are an input gate (i), a forget gate (f) and an output gate (o), and a memory unit (c) is added. The input gate i is used for determining which new information can be stored in the memory unit, the forgetting gate f is used for controlling how much historical information should be forgotten, and the output gate o is used for determining which information can be output. The calculation modes are shown in formulas (4) - (9):
it=σ(Wi[ht-1,xt]+bi) formula (4)
Ft=σ(Wf[ht-1,xt]+bf) formula (5)
Ot=σ(Wo[ht-1,xt]+bo) formula (8)
Ht=ot tan(ct) formula (9)
Wherein the symbols sigma and tanh represent different activation functions, representing dot products (dotproduct). Wi,Wf,Wc,Wo is the weight matrix and bi,bf,bc,bo is the bias value. xt is the input vector at time t, ht is the hidden layer state, and is also the output vector, which contains all valid information before time t. it、ft、ot respectively represents the control of the input gate, the forget gate and the output gate at the time t.
Then, both RNN and LSTM can capture only the history information of the sequence, and due to the complexity of the natural language sentence structure, future information may be needed at times when sequence labeling, and information of words behind (right to) the current word may be needed when corresponding to the current word in the sentence, i.e., when processing the current word.
(5) Bi-LSTM model
The Bi-LSTM network is composed of forward LSTM units and backward LSTM units. The basic idea is to model the sequence in front-to-back (forward) and back-to-front (reverse) respectively using two LSTMs at the hidden layer and then connect their outputs. The hidden layer of the forward cell is denoted asThe hidden layer of the backward cell is shown asAnd (3) obtaining the output of the unidirectional hidden layer at the moment t through formulas (4) - (9), as shown in formulas (10) - (11). The hidden layer output of the Bi-LSTM is obtained by splicing the hidden layer outputs of the forward LSTM cells and the backward LSTM cells, as shown in equation (12).
Therefore, the most widely used method for selecting descriptors is to manually select words or phrases from material documents through expert experience, the method takes a long time, the selection result depends on the expert knowledge of researchers, and certain limitations and subjectivity exist, namely, different researchers can select different descriptors for a certain material, and therefore, the generalization of the descriptors selected by the method is limited. Meanwhile, when the conventional model processes the material named entity recognition problem, 1) the encoding capability of long sentence words in a material document is lacked (namely, for long sentences, dependence among each Word is difficult to capture, so that good context representation of each Word cannot be obtained, and classification performance of an entity is not influenced), 2) the problem that ambiguity of the Word cannot be represented exists (the material document sentences have a plurality of words with the same meaning and different representations, such as a plurality of chemical formulas, abbreviations and English names thereof, in the material document) exists, and the conventional Word2Vec method extracts different Embedding for the entity with ambiguity, so that the entity is difficult to distinguish, or the classifier is independently trained after the entity is recognized to judge synonyms. Traditional NER methods have a weak ability to learn long-range dependencies and combine external knowledge with a great deal of human involvement to extract and process features.
Detailed Description
The present invention will be described in detail below with reference to the drawings and detailed description to enable those skilled in the art to better understand the technical scheme of the present invention. Embodiments of the present invention will be described in further detail below with reference to the drawings and specific examples, but not by way of limitation. The order in which the steps are described herein by way of example should not be construed as limiting if there is no necessity for a relationship between each other, and it should be understood by those skilled in the art that the steps may be sequentially modified without disrupting the logic of each other so that the overall process is not realized.
The embodiment of the invention provides a descriptor identification method of text data, as shown in fig. 1, comprising the following steps of identifying the descriptor of the text data by using a trained identification model:
Step S100, based on the text data, an input sequence w= (w1,w2,…,wn) and a tag sequence y= (y1,y2,…,yn) corresponding to the feature vector are determined, where wn is the feature vector of the nth word.
Step S200, calculating a group of tag sequences with the highest total probability scores through formulas (14) - (17):
where score (W, y) is the evaluation score for all input sequences, T is the migration matrix,For a probability score of yi to yi+1,For the probability score for the ith word labeled yi, p (y|s) represents the probability that sentence S is labeled as tag sequence y,For a true label, equation (16) represents the likelihood function of the label sequence during training, YW represents the set of all possible labels, and Y* represents a set of label sequences with the highest total probability score;
In step S300, a coarse-granularity descriptor is determined based on the tag sequence with the largest total probability score. In this step, a coarse-granularity descriptor set can be obtained, and the corresponding sentence sequence can be determined according to the corresponding coarse-granularity descriptor, so that the invention can not only identify the corresponding descriptor, but also be applied to text classification.
Step S400, dynamically adding coarse-granularity descriptors and corresponding sentence sequences to construct a knowledge base. By way of example only, a constructed knowledge base is shown in FIG. 9.
Step S500, screening out high-quality descriptors driven by performance based on coarse-granularity descriptors in the knowledge base through the principle that the descriptors co-occur in the same sentence and calculating the importance of each coarse-granularity descriptor in the corresponding sentence sequence.
In some embodiments, as shown in fig. 2, the step S500, based on coarse-granularity descriptors in the knowledge base, filters out performance driving descriptors by using a rule that descriptors co-occur in a same sentence and calculating importance of each coarse-granularity descriptor in a corresponding sentence sequence, including:
Step S501, a coarse-granularity descriptor list D= [ D1,D2,...Dn ] is listed in the knowledge base, and a statement list S= [ S1,S2,...Sn ] corresponding to the descriptor is listed;
Step S502, selecting descriptors, then creating a temporary queue, putting the descriptors into the temporary queue, taking out coarse-granularity descriptors and sentences from corresponding coarse-granularity descriptor lists and sentence lists, adding descriptors which appear together with the sentences into the temporary queue, and under the condition that the temporary queue is not an empty queue, taking head elements in the temporary queue out of the temporary queue and assigning the head elements to a performance-driven descriptor set to obtain a high-quality descriptor set of performance driving;
step S503, calculating the importance of the descriptors in the performance driven high quality descriptor set in the corresponding sentence sequence by the formula (18):
Wherein Ii represents the importance of the I-th word, Ei is the embedding vector of the I-th word, and S[CLS] is the corresponding sentence embedding vector;
step S504, the performance driven high quality descriptors are screened out based on the threshold of descriptor importance.
In some embodiments, performance driven high quality descriptors are filtered out by the following equation (19) based on a threshold of descriptor importance;
Where Di denotes a performance driven high quality descriptor set, T is a threshold for descriptor importance, true denotes descriptors retained in the performance driven high quality descriptor set, and false denotes descriptors deleted in the performance driven high quality descriptor set.
In some embodiments, before the text data descriptor is identified using the trained recognition model by a method, as shown in fig. 3, the method further comprises:
Step S1001, dividing the text data into at least one sentence sequence, and dividing each sentence sequence into individual labels, and labeling each label based on a preset entity label, where the preset entity label is used to define a descriptor;
Step S1002, masking part of words in the sentence sequence randomly, and predicting the masked words through the learned context semantic relation to realize the enhancement of text data;
And step S1003, training the recognition model through the enhanced text data. By way of example only, the training method may train the method for identifying the descriptor by using the identification model, and specific steps are described above and will not be repeated here.
In some embodiments, before the text data is separated into at least one sentence sequence, the method further comprises the step of cleaning the text information to obtain the text data, wherein the step of cleaning the text information to obtain the text data comprises the step of removing invalid data in the text information in a regular expression matching mode, wherein the invalid data comprises messy codes and pictures, and characters which cause the messy codes are converted into special symbol marks under the condition that the messy codes of the characters occur.
The embodiment of the invention also provides a descriptor recognition device for text data, which comprises a processor, wherein the processor is configured to:
and identifying the descriptor of the text data by using the trained identification model through the following method:
Based on the enhanced text data, determining an input sequence w= (w1,w2,...,wn) and a tag sequence y= (y1,y2,...,yn) corresponding to the feature vector, wherein wn is the feature vector of the nth word;
the tag sequence with the highest total probability score is calculated by the following formulas (14) - (17):
where score (W, y) is the evaluation score for all input sequences, T is the migration matrix,For a probability score of yi to yi+1,For the probability score for the ith word labeled yi, p (y|s) represents the probability that sentence S is labeled as tag sequence y,For true labels, equation (16) represents likelihood functions of the label sequences during training, YW represents a set of all possible labels, and Y represents a set of label sequences with the highest total probability score;
determining coarse-granularity descriptors based on the tag sequence with the largest total probability score;
Dynamically adding coarse-granularity descriptors and sentence sequences corresponding to the coarse-granularity descriptors to construct a knowledge base;
And screening out high-quality descriptors driven by performance based on coarse-granularity descriptors in the knowledge base through the principle that the descriptors co-occur in the same sentence and the importance of each coarse-granularity descriptor in the corresponding sentence sequence.
It should be noted that the processor may be a processing device including more than one general-purpose processing device, such as a microprocessor, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and the like. More specifically, the processor may be a Complex Instruction Set Computing (CISC) microprocessor, a Reduced Instruction Set Computing (RISC) microprocessor, a Very Long Instruction Word (VLIW) microprocessor, a processor running other instruction sets, or a processor running a combination of instruction sets. A processor may also be one or more special purpose processing devices, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a system on a chip (SoC), or the like.
The processor may be communicatively coupled to the memory and configured to execute computer-executable instructions stored thereon to perform a descriptor recognition method for text data according to various embodiments of the present invention.
In some embodiments, the processor is further configured to separate the text data into at least one sentence sequence and separate each sentence sequence into individual tags, annotate each tag based on a preset entity tag, the preset entity tag defining a descriptor, randomly mask a portion of the words in the sentence sequence and predict the masked words by the learned contextual semantic relationship to achieve enhancement of the text data, and train the recognition model by the enhanced text data.
In some embodiments, the processor is further configured to list a coarse-granularity descriptor list in the knowledge base, wherein D= [ D1,D2,...Dn ] and list a statement list corresponding to the descriptor, wherein s= [ s1,s2,...sn ], select a descriptor, then create a temporary queue, place the descriptor in the temporary queue, fetch the coarse-granularity descriptor and the statement from the corresponding coarse-granularity descriptor list and statement list, add the descriptor in which the coarse-granularity descriptor and the statement co-occur to the temporary queue, and in case the temporary queue is not an empty queue, retire a head element in the temporary queue from the temporary queue and assign the head element to a set of high-quality descriptors of performance driving to obtain a set of high-quality descriptors of performance driving, and calculate the importance of the descriptor in the set of high-quality descriptors of performance driving in the corresponding sentence sequence by the following formula (18):
Wherein Ii represents the importance of the I-th word, Ei is the embedding vector of the I-th word, and S[CLS] is the corresponding sentence embedding vector;
performance driven high quality descriptors are screened out based on a threshold of descriptor importance.
In some embodiments, the processor is further configured to filter out performance driven high quality descriptors by the following equation (19) based on a threshold of descriptor importance;
Where Di denotes a performance driven high quality descriptor set, T is a threshold for descriptor importance, true denotes descriptors retained in the performance driven high quality descriptor set, and false denotes descriptors deleted in the performance driven high quality descriptor set.
In some embodiments, the processor is further configured to remove invalid data in the text information in a regular expression matching mode, wherein the invalid data comprises messy codes and pictures, and in the case of messy codes of characters, the characters causing the messy codes are converted into special symbol marks.
Embodiments of the present invention provide a non-transitory computer-readable storage medium storing instructions which, when executed by a processor, perform a descriptor recognition method for text data according to various embodiments of the present invention.
The following examples of the present invention will be combined with specific examples of application to further illustrate the feasibility and advancement of the invention.
The embodiment of the invention is to process text data by adopting the steps of 1) preprocessing the text data by adopting a data processor, 2) screening the entity by adopting a coarse-granularity descriptor identifier, and 3) further screening the entity according to the requirement by adopting a fine-granularity descriptor identifier. The flow of the method is shown in fig. 4, and a detailed schematic diagram is shown in fig. 5.
1) Preprocessing text data using a data processor
Through the crystalline information file (Crystallographic Information File, CIF), 55 NASICON material science documents were collected that fit the corpus source for descriptor mining. The full text information of these documents (including title, author, abstract, keywords, organization, publisher and year of publication) is then stored as individual documents, extracted by PDF parsing (Python toolkit). The resulting NASICONNER dataset contained 65690 data, 2434 sentences, 6036 words. Thereafter, preprocessing will be performed on these documents.
① Text cleaning
Since text information of a document converted from PDF has many invalid data such as a random number, other information than text, and the like, we delete them by regular expression matching. These special symbols may also appear as messy codes, but we cannot delete them directly because some may be useful information such as chemical units, so in the next step we convert all these symbols into special token < sYm >. In this way we can get a relatively clean document from the PDF document.
② Word segmentation and sentence segmentation and annotation of text data
For this work, we first perform a word segmentation operation on the cleaned document through ChemDataExtractor, which involves segmenting the original text data into sentences, and then segmenting each sentence into individual tokens. Then, to label these labels, 8 entity tags of the descriptor are defined, including Composition, structure, property, processing, characterization, application, feature and Condition, which can summarize most of the information of the material descriptor. Table 1 gives definitions and examples of each of the marks.
TABLE 18 descriptor entity type definitions for Material Domains
Using the labeling scheme described above, 55 material science documents were manually annotated. In addition, when labeling, a format of inside-out-begin (IBO) is used for labeling, and the format can explain the situation of multi-word entities such as "activation energy". In this method, the token may represent a particular tag at begin (B), insert (I), or outlide (O). For example, a sentence "The ionic conductivity DECREASES WITH INCREASING activation energy" in the NASICON document will be labeled as a token (IOB-label) pair in the following manner :(The;O),(ionic;B-Property),(conductivity;I-Property),(decreases;O),(with;O),(increasing;O),(activation;B-Property),(energy;I-Property).
③ Data enhancement
The training of the supervised NER model requires a large amount of labeling data, which is time-consuming and labor-consuming. In the method, aiming at the problem of insufficient NER data, a conditional data enhancement method (cDA-DK) of the knowledge in the field of fusion materials is provided, as shown in fig. 6.
Since data analysis methods are often subject to noise, to reduce the effects of noise while generating as high quality data as possible, we introduce knowledge of the material domain, such as material text and label constraints, as input to a pre-trained DistilRoBERTa model. As shown in fig. 2, we fine-tune the DistilRoBERTa model for large scale expansion of data. In effect, the enhanced data is generated by a masking language model (mask language model, MLM) of DistilRoBERTa models that can randomly mask some words in sentences and then predict the mask words by learned contextual semantic relationships.
For example, given an input sentence "The ionic conductivity DECREASEDWITH INCREASED activation energy", two words in The sentence are masked, becoming "The < mask > conductivity DECREASED WITH INCREASING < mask > energy". Then, the < mask > word is predicted and filled in by the fine-tuned DistilRoBERTa model. Finally, a sentence "The electrode conductivity DECREASES WITH INCREASING ELECTRIC ENERGY" is generated as shown in Table 2. In table 2, word changes are highlighted using italics bold.
TABLE 2 initial training and enhancement examples
2) Screening entities by coarse-granularity descriptor identifiers
For this work the aim is to train the NER model in such a way that the material science knowledge is encoded, e.g. we want to let the computer know that the words "activation energy" and "ionic conductivity" represent descriptors of the substance properties, while "tetrahedra" and "polyhedra" represent descriptors of the substance structure. Therefore, we designed CGDR that built a NER model (MatBERT-BiLSTM-CRF) for identifying different classes of coarse-grained descriptors from the materials science literature. There are three main types of information that can be used to let the model identify which words or phrases correspond to a particular descriptor type- ① word tokens based on MatBERT, ② sentence context feature extraction based on BiLSTM, ③ CRF-based descriptor classification (as shown in fig. 7).
① MatBERT-based word characterization
Based on the MatBERT word representations, vector representations of words and sentences are designed and obtained using the MatBERT model. As shown in fig. 7, the MatBERT model is derived from the pre-trained bi-directional encoder representation (Bidirectional Encoder Representation) in the Transformer (BERT) model, which requires fine tuning with the material literature text data. Through analysis of material notes, we have found that words having the same or similar meaning may express different meanings in different contexts. For example, english "Bottleneck" has the meaning of not only "limited", but also structural information of the material crystal, which indicates that the context is very important. Therefore, it is necessary to consider context information when encoding complex text of a material. However, word embedding generated using the Word2vec method of Mikolov et al (2013) is context-free (static embedding) and does not have complex features (e.g., grammar, semantics). In summary, the Word2vec method has the disadvantage of not being good for the computer to understand the material vocabulary sufficiently, thereby affecting the accuracy of descriptor extraction. The MatBERT method is used for encoding the material text, because the method can fully capture the context information of words (namely word embedding, segment embedding and position embedding), so that vector representation with richer semantic information is obtained.
Specifically, a sentence sequence is presented herein. Martertbet uses a fine tuning parameter mechanism. The input sequence is set to w= ([ CLS ], w1,w2,...,wn, [ SEP ]), where [ CLS ] represents the beginning of the sample sentence sequence and [ SEP ] represents the interval symbol between sentences. They are all used for sentence-level training tasks. The vector representation of each word consists of three parts, a word embedding vector (Word Embedding Vectyor), a sentence embedding vector (Sentence Embedding Vector) and a position embedding vector (Position Embedding Vector), which are defined separatelyHere, the word embedding vector is determined by the vocabulary provided by MatBERT, and since the training sample is one sentence, the sentence embedding vector is set to 0. The three embedded vectors are added to obtain a word feature as an input to MatBERT, as shown in fig. 5. By training the input word vector, the final word vector representation is shown in equation (13) as input to BiLSTM.
X= [ x1,x2,...,xn ] equation (13)
② BiLSTM-based sentence context feature extraction
Contextual features of the material text are captured using a BiLSTM model. NER is a sequence labeling problem that belongs to the task of label-level classification, i.e., every word in a sentence needs to be classified. Therefore, it is necessary to consider the local context of each word in the sentence. For example, in The sentence "The overlapping _____ are near to 10-5S/cm-1at200 ℃, it is apparent that The missing word is conductivity (a coarse-grained descriptor of an attribute class). Although the location information introduced by MatBERT complements the local context information, the self-care mechanism of MatBERT weakens the location information when fine-tuning is performed. Therefore, RNNs are employed to solve the above problems, which have the ability to capture timing information for sequence-to-sequence classification. However, RNNs often suffer from gradient extinction and gradient explosion during the propagation of time information, and we use long-short-term memory (LSTM), a variant of RNN. LSTM incorporates three Gate units, an Input Gate (Input Gate), a forget Gate (Forget Gate), and an Output Gate (Output Gate), as shown in fig. 8. The Gate (Gate) structure enables the option of saving context information, solving the above-mentioned problems of RNNs. Therefore, LSTM is preferred over RNN in capturing remote dependencies.
The parameter settings are shown in Table 3:
TABLE 3 Bi-LSTM parameter settings
| Parameter name | Parameter value |
| Word vector dimension | 768 |
| LSTM cell dimension | 128 |
| Dropout rate | 0.1 |
| Learning rate | 0.00003 |
| Optimizer | AdamW |
| Batch size | 32 |
| Early stopping patience | 3 |
| Max sentence length | 75 |
| Tag schema | BIO |
③ CRF-based descriptor classification
The CRF can predict the optimal tag sequence by learning the dependency relationship between the tags, so as to realize more accurate entity classification. Meanwhile, as a classifier for the sequence marking problem, the CRF can also capture the strong dependence among output labels to obtain the optimal label sequence. Since the entity tag of each word in a sentence needs to be predicted by a classifier, a certain migration relationship often exists between adjacent entity tags. Thus, it is useful to decode the best tag chain for a given input sentence, taking into account the correlation between tags within the neighborhood. Thus, the classifier layer of NER employs CRF instead of the traditional Softmax layer.
Specifically, w= (w1,w2,...,wn) is represented as a general input sequence. Where wi is the input vector of the i-th word, y= (y1,y2,...,yn) represents the tag sequence corresponding to the input vector. The evaluation score of the CRF model calculation formula is shown in formula (14), where T is the migration matrix,For a probability score of yi to yi+1,The probability score labeled yi for the i-th word. p (y|S) represents the probability that statement S is labeled as tag sequence y, calculated by equation (15), whereIs a true tag. In addition, the likelihood function of the tag sequence during training is shown in equation (16), where YW represents the set of all possible labels. It is noted that an efficient output sequence can be obtained by likelihood functions. Finally, a set of sequences with the highest total probability scores can be calculated by the formula (17).
In summary, the NER model of CGDR can be used to identify coarse-grained descriptors in material text, and then construct a knowledge base to store them, as shown in FIG. 9. In addition, the dynamic addition of descriptors and sentences corresponding to the descriptors in the knowledge base is realized. In fig. 9, "activation energy", "security" are descriptors of property classes, and "conduction channels", "bottleneck", "rhombohedral symmetry" are descriptors of structure classes. Furthermore, each descriptor is immediately followed by the corresponding sentence in which it appears.
Descriptor information can be accurately extracted from the material science literature using the CGDR trained NER model. The performance of the NER model is shown in fig. 10. As can be seen from the figure, the sentence "For those NASICON materials which show a phase transition,the activation energy differed at low temperature(LT)and high temperature(HT)." is input into the trained NER model, which can classify each word in the sentence. The results show that "NASICON materials" is a descriptor of the feature class, "phase transition" and "activation energy" belong to the attribute class, "low temperature" and "high temperature" belong to the condition class. It follows that the model is able to accurately identify descriptor information in the text.
3) Further screening of entities as needed using a fine-grained descriptor identifier
There are a number of different classes of coarse-grained descriptors in the knowledge base for material property prediction or new material discovery. However, if we were to screen related descriptors entirely manually, the effort would be no worse than selecting from the literature. In addition, the quality of coarse-grained descriptors in the knowledge base is also an important factor in the screening process. Therefore FGDR is designed to rapidly screen high quality descriptors that are related to the target material properties under investigation. Note that FGDR combines performance driven and importance calculations, which helps researchers build sample datasets of descriptors. Then, with this dataset, a study of structure-activity relationship can be made using the ML model.
The specific procedure for performance driving is as follows. First, a descriptor Dseed, which is the target material property to be studied, is entered, then a temporary queue Q is created, and Dseed is placed in the queue. As long as Q is not empty, the head element of Q will dequeue and assign to Dcurrent for looping. In addition, Di and Si are also taken from the corresponding lists of D and S. Note that Di and Dcurrent are the same descriptors here. Then, the descriptor wj co-occurring with Di in si is added to Q and Dassociate. Finally, the loop terminates until there are no elements in Q and a set of performance driven descriptors Dassociate is obtained.
High quality descriptors are filtered by computing importance as shown in fig. 11. In order to calculate the importance of each descriptor in the corresponding sentence, we perform inner product on the word vector and sentence vector of the last layer MatBERT, and then normalize the result of the previous step by using Softmax function to obtain the final importance. The normalized calculation formula is shown as formula (18), wherein Ii represents the importance of the ith word, Ei is the embedding vector of the ith word output by MatBERT,Vectors are embedded for the corresponding sentences. After this we will set a threshold for descriptor screening. As shown in formula (19), where true represents the reserved descriptor, false represents the deleted descriptor, and T is the threshold of descriptor importance. The MatBERT model is here identical to MatBERT in CGDR, except that the latter does not need to output vectors of words and corresponding sentences, but rather provides a downstream model for further feature extraction.
FGDR can screen performance driven high quality descriptors in the corresponding context accurately to some extent. The high quality associated descriptors are screened for validity as shown in fig. 12, taking "activation energies" as an example. As can be seen from the figure, the statements "The calculated potential barriers are in good agreement with the activation energies obtained from ac measurements of polycrystalline samples." in the knowledge base are obtained by retrieving descriptors co-occurring with "activation energies". Other descriptors ("potential barriers" and "polycrystalline samples") of sentences are identified using MatBERT-BiLSTM-CRF models. The importance of each word in the sentence is calculated by FGDR. However, we focus on descriptors (i.e., "potential barriers", "activation energies" and "polycrystalline samples"). Finally, it can be found that the importance of the first two descriptors in this context is greater than a threshold value, while the importance of the latter descriptor is less than the threshold value. The results indicate that FGDR can effectively screen out high quality performance driven descriptors.
In view of the shortcomings of the prior art, the present invention aims to mine available high quality descriptors from the materials science literature from both coarse and fine dimensions. The invention also comprehensively considers the problems existing in the NER in the field of material science, and solves the problems that the traditional method has weaker long-distance dependence capability and needs to combine external knowledge and a large amount of manual participation to extract and process the characteristics. The BERT model is pre-trained by using the material science literature, so that the BERT model can achieve the optimal effect under the condition of shorter training times. In addition, the pre-trained BERT is adopted to carry out data enhancement on the entities, so that the problem of insufficient number of named entities in the material science is solved. Meanwhile, the method realizes accurate screening of descriptors by introducing domain knowledge, thereby obtaining high-quality descriptors meeting the demands of users.
Table 4 shows the performance of CGDR for 8 named entity classifications, where F1-Score is the harmonic mean of the accuracy P (precision) and the recall R (recovery). From the table we can see that the overall F1-score of our model is 0.87, which has performance similar to the latest NER model (2018) (0.92). The model training and evaluation were both performed on news articles that were manually tagged and had only three entity tags. However, since the data sets are different, it is not possible to compare the performance between models directly by the values of the performance indicators. Notably, our model is trained and evaluated on more entity tags and more complex text. The CGCR model reaches the highest F1-Score (0.94) on the Composition class, while the F1-Score on the Application class is only 0.58, probably because the training data for this class is less, and the model fails to adequately capture the dependencies between this class of entities and their labels. The F1-score of other entity classes is above 0.80, indicating that the model works well in identifying descriptors of different classes.
Table 4 NER comprehensive manifestation of 8 entity classifications
Compared with the performance of a reference model (BiLSTM-CNNs-CRF), the F1-score of the method is 0.87, the performance is improved by 16%, the results are shown in Table 5, the validity of the CGDR model is further verified, and the method is suitable for automatic identification of descriptors in the material field.
Table 5 comparison of model results
| Model | Precision | Recall | F1-score |
| BiLSTM-CNNs-CRF | 0.74 | 0.69 | 0.71 |
| DA+MatBERT-BiLSTM-CRF | 0.86 | 0.87 | 0.87 |
Furthermore, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across), adaptations or alterations as pertains to the present application. The elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the invention. This is not to be interpreted as an intention that the features of the claimed invention are essential to any of the claims. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.