Parameter name	Parameter value
		Word vector dimension	768
LSTM cell dimension	128
		Dropout rate	0.1
Learning rate	0.00003
		Optimizer	AdamW
Batch size	32
		Early stopping patience	3
Max sentence length	75
		Tag schema	BIO

③ CRF-based descriptor classification

The CRF can predict the optimal tag sequence by learning the dependency relationship between the tags, so as to realize more accurate entity classification. Meanwhile, as a classifier for the sequence marking problem, the CRF can also capture the strong dependence among output labels to obtain the optimal label sequence. Since the entity tag of each word in a sentence needs to be predicted by a classifier, a certain migration relationship often exists between adjacent entity tags. Thus, it is useful to decode the best tag chain for a given input sentence, taking into account the correlation between tags within the neighborhood. Thus, the classifier layer of NER employs CRF instead of the traditional Softmax layer.

Specifically, w= (w₁,w₂,...,w_n) is represented as a general input sequence. Where w_i is the input vector of the i-th word, y= (y₁,y₂,...,y_n) represents the tag sequence corresponding to the input vector. The evaluation score of the CRF model calculation formula is shown in formula (14), where T is the migration matrix,For a probability score of y_i to y_i+1,The probability score labeled y_i for the i-th word. p (y|S) represents the probability that statement S is labeled as tag sequence y, calculated by equation (15), whereIs a true tag. In addition, the likelihood function of the tag sequence during training is shown in equation (16), where Y_W represents the set of all possible labels. It is noted that an efficient output sequence can be obtained by likelihood functions. Finally, a set of sequences with the highest total probability scores can be calculated by the formula (17).

In summary, the NER model of CGDR can be used to identify coarse-grained descriptors in material text, and then construct a knowledge base to store them, as shown in FIG. 9. In addition, the dynamic addition of descriptors and sentences corresponding to the descriptors in the knowledge base is realized. In fig. 9, "activation energy", "security" are descriptors of property classes, and "conduction channels", "bottleneck", "rhombohedral symmetry" are descriptors of structure classes. Furthermore, each descriptor is immediately followed by the corresponding sentence in which it appears.

Descriptor information can be accurately extracted from the material science literature using the CGDR trained NER model. The performance of the NER model is shown in fig. 10. As can be seen from the figure, the sentence "For those NASICON materials which show a phase transition,the activation energy differed at low temperature(LT)and high temperature(HT)." is input into the trained NER model, which can classify each word in the sentence. The results show that "NASICON materials" is a descriptor of the feature class, "phase transition" and "activation energy" belong to the attribute class, "low temperature" and "high temperature" belong to the condition class. It follows that the model is able to accurately identify descriptor information in the text.

3) Further screening of entities as needed using a fine-grained descriptor identifier

There are a number of different classes of coarse-grained descriptors in the knowledge base for material property prediction or new material discovery. However, if we were to screen related descriptors entirely manually, the effort would be no worse than selecting from the literature. In addition, the quality of coarse-grained descriptors in the knowledge base is also an important factor in the screening process. Therefore FGDR is designed to rapidly screen high quality descriptors that are related to the target material properties under investigation. Note that FGDR combines performance driven and importance calculations, which helps researchers build sample datasets of descriptors. Then, with this dataset, a study of structure-activity relationship can be made using the ML model.

The specific procedure for performance driving is as follows. First, a descriptor D_seed, which is the target material property to be studied, is entered, then a temporary queue Q is created, and D_seed is placed in the queue. As long as Q is not empty, the head element of Q will dequeue and assign to D_current for looping. In addition, D_i and S_i are also taken from the corresponding lists of D and S. Note that D_i and D_current are the same descriptors here. Then, the descriptor w_j co-occurring with D_i in s_i is added to Q and D_associate. Finally, the loop terminates until there are no elements in Q and a set of performance driven descriptors D_associate is obtained.

High quality descriptors are filtered by computing importance as shown in fig. 11. In order to calculate the importance of each descriptor in the corresponding sentence, we perform inner product on the word vector and sentence vector of the last layer MatBERT, and then normalize the result of the previous step by using Softmax function to obtain the final importance. The normalized calculation formula is shown as formula (18), wherein I_i represents the importance of the ith word, E_i is the embedding vector of the ith word output by MatBERT,Vectors are embedded for the corresponding sentences. After this we will set a threshold for descriptor screening. As shown in formula (19), where true represents the reserved descriptor, false represents the deleted descriptor, and T is the threshold of descriptor importance. The MatBERT model is here identical to MatBERT in CGDR, except that the latter does not need to output vectors of words and corresponding sentences, but rather provides a downstream model for further feature extraction.

FGDR can screen performance driven high quality descriptors in the corresponding context accurately to some extent. The high quality associated descriptors are screened for validity as shown in fig. 12, taking "activation energies" as an example. As can be seen from the figure, the statements "The calculated potential barriers are in good agreement with the activation energies obtained from ac measurements of polycrystalline samples." in the knowledge base are obtained by retrieving descriptors co-occurring with "activation energies". Other descriptors ("potential barriers" and "polycrystalline samples") of sentences are identified using MatBERT-BiLSTM-CRF models. The importance of each word in the sentence is calculated by FGDR. However, we focus on descriptors (i.e., "potential barriers", "activation energies" and "polycrystalline samples"). Finally, it can be found that the importance of the first two descriptors in this context is greater than a threshold value, while the importance of the latter descriptor is less than the threshold value. The results indicate that FGDR can effectively screen out high quality performance driven descriptors.

In view of the shortcomings of the prior art, the present invention aims to mine available high quality descriptors from the materials science literature from both coarse and fine dimensions. The invention also comprehensively considers the problems existing in the NER in the field of material science, and solves the problems that the traditional method has weaker long-distance dependence capability and needs to combine external knowledge and a large amount of manual participation to extract and process the characteristics. The BERT model is pre-trained by using the material science literature, so that the BERT model can achieve the optimal effect under the condition of shorter training times. In addition, the pre-trained BERT is adopted to carry out data enhancement on the entities, so that the problem of insufficient number of named entities in the material science is solved. Meanwhile, the method realizes accurate screening of descriptors by introducing domain knowledge, thereby obtaining high-quality descriptors meeting the demands of users.

Table 4 shows the performance of CGDR for 8 named entity classifications, where F1-Score is the harmonic mean of the accuracy P (precision) and the recall R (recovery). From the table we can see that the overall F1-score of our model is 0.87, which has performance similar to the latest NER model (2018) (0.92). The model training and evaluation were both performed on news articles that were manually tagged and had only three entity tags. However, since the data sets are different, it is not possible to compare the performance between models directly by the values of the performance indicators. Notably, our model is trained and evaluated on more entity tags and more complex text. The CGCR model reaches the highest F1-Score (0.94) on the Composition class, while the F1-Score on the Application class is only 0.58, probably because the training data for this class is less, and the model fails to adequately capture the dependencies between this class of entities and their labels. The F1-score of other entity classes is above 0.80, indicating that the model works well in identifying descriptors of different classes.

Table 4 NER comprehensive manifestation of 8 entity classifications

Compared with the performance of a reference model (BiLSTM-CNNs-CRF), the F1-score of the method is 0.87, the performance is improved by 16%, the results are shown in Table 5, the validity of the CGDR model is further verified, and the method is suitable for automatic identification of descriptors in the material field.

Table 5 comparison of model results

Model	Precision	Recall	F1-score
				BiLSTM-CNNs-CRF	0.74	0.69	0.71
DA+MatBERT-BiLSTM-CRF	0.86	0.87	0.87

Furthermore, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across), adaptations or alterations as pertains to the present application. The elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the invention. This is not to be interpreted as an intention that the features of the claimed invention are essential to any of the claims. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A method for identifying a descriptor of text data, the method comprising:

where score (W, y) is the evaluation score for all input sequences, T is the migration matrix,For a probability score of y_i to y_i+1,For the probability score for the ith word labeled y_i, p (y|s) represents the probability that sentence S is labeled as tag sequence y,For a true label, equation (16) represents the likelihood function of the label sequence during training, Y_W represents the set of all possible labels, and Y^_* represents a set of label sequences with the highest total probability score;

Listing coarse-granularity descriptor list D= [ D₁,D₂,...D_n ] in the knowledge base, and listing statement list s= [ s₁,s₂,...s_n ] corresponding to the descriptor;

Selecting descriptors, then creating a temporary queue, placing the descriptors into the temporary queue, taking out coarse-granularity descriptors and sentences from a corresponding coarse-granularity descriptor list and sentence list, adding coarse-granularity descriptors appearing in the sentences into the temporary queue, and under the condition that the temporary queue is not an empty queue, taking out head elements in the temporary queue from the temporary queue and assigning the head elements to a high-quality descriptor set driven by performance to obtain the high-quality descriptor set driven by performance;

Calculating the importance of descriptors in the set of performance driven high quality descriptors in the corresponding sentence sequence by the following formula (18):

performance driven descriptors are screened out based on a threshold of descriptor importance.

2. The method of claim 1, wherein prior to identifying the descriptor of the text data using the trained identification model by:

Dividing the text data into at least one sentence sequence, dividing each sentence sequence into independent marks, and marking each mark based on a preset entity label, wherein the preset entity label is used for defining a descriptor;

randomly masking part of words in the sentence sequence, and predicting the masked words through the learned context semantic relation to realize the enhancement of text data;

and training the recognition model through the enhanced text data.

3. The method according to claim 1, wherein high quality descriptors are screened out by the following formula (19) based on a threshold value of descriptor importance;

4. The method of claim 2, wherein prior to separating the text data into at least one sentence sequence, the method further comprises cleaning the text information to obtain the text data;

The step of cleaning the text information to obtain text data comprises the following steps:

Removing invalid data in the text information in a regular expression matching mode, wherein the invalid data comprises messy codes and pictures;

in the case of a character disorder, the character causing the disorder is converted into a special symbol mark.

5. A descriptor recognition device for text data, the device comprising a processor configured to:

Selecting descriptors, then creating a temporary queue, placing the descriptors into the temporary queue, taking out coarse-granularity descriptors and sentences from a corresponding coarse-granularity descriptor list and sentence list, adding coarse-granularity descriptors appearing in the sentences into the temporary queue, and under the condition that the temporary queue is not an empty queue, taking out head elements in the temporary queue from the temporary queue and assigning the head elements to a performance-driven descriptor set to obtain a high-quality descriptor set driven by performance;

6. The apparatus of claim 5, wherein the processor is further configured to:

and training the recognition model through the enhanced text data.

7. The apparatus of claim 6, wherein the processor is further configured to:

based on the threshold of descriptor importance, performance driven high quality descriptors are filtered out by the following equation (19):

where T is a threshold for descriptor importance, true represents descriptors retained in the performance driven high quality descriptor set, false represents descriptors deleted in the performance driven high quality descriptor set.

8. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, perform the method of any one of claims 1 to 4.