Scientific and technological figure knowledge graph construction method and device based on deep learning model and terminalTechnical Field
The invention relates to the technical field of information, in particular to a scientific and technological figure knowledge graph construction method, device and terminal based on a deep learning model.
Background
With the advent of the "big data" era, the amount of data that can be used for research and analysis has seen explosive growth. However, most of these mass data are unstructured text data composed of natural language, and therefore, how to extract effective information from unstructured text information to form structured information that is easy to understand and store has become a research focus in recent years.
The text data with complex entity relationship is processed by the following two steps of firstly extracting information from the text data and then structurally storing the extracted knowledge for further processing and utilization. In the aspect of information extraction, there are an extraction method based on a manual setting rule and an extraction method based on a statistical machine learning model. In terms of structured storage, a popular way is to construct a knowledge graph. The knowledge graph is a natural language processing technology for displaying entity information in a natural language by using a graph mode of structured visualization, and is often used for processing an NLP task which comprises a plurality of complex relationships and needs to carry out logical reasoning to form visual knowledge macro association.
In the field of science and technology portrait, science and technology entities and relations need to be extracted according to a large amount of text data, and a science and technology knowledge graph is formed so as to facilitate further searching and inference. The traditional information extraction method is a manual processing method, but with the increase of the data updating speed, not only a great deal of time and energy is needed to be consumed for manually processing the text data, but also information updating delay is brought. Therefore, it is very important to construct the scientific knowledge map to realize the automation of the scientific knowledge extraction. The deep learning model is a multi-classification model for extracting scientific and technological information based on the corpus characteristics of the training set sample, and has excellent performance in the natural language processing field in which rules are difficult to artificially define.
Disclosure of Invention
The invention aims to solve the problems and provide a scientific character knowledge graph construction method based on a deep learning model, which comprises the following steps:
s1, constructing a sample corpus: extracting text data with scientific and technical information, and performing entity identification and labeling to obtain a sample corpus with labels;
s2, training an information extraction model: building a deep learning model and training a sample corpus with labels to obtain an information extraction model;
s3, performing scientific and technological information extraction on the open domain text data based on the information extraction model to obtain scientific and technological knowledge triples of the open domain text;
s4, knowledge fusion and updating;
and S5, constructing the scientific knowledge map based on the fused and updated scientific knowledge triples of the open domain text.
Further, the scientific and technological information comprises text information with names of entities of scientific research units, names of characters of scientific research personnel and names of scientific research results.
Further, the specific implementation method of step S1 is as follows:
s11, randomly extracting open domain text information, wherein the open domain text information comprises free news text and semi-structured encyclopedia data;
s12, screening text data containing scientific and technical information based on the named entity identification method;
preprocessing Chinese text data based on a Jieba word segmentation tool, filtering abnormal data and unavailable data, and counting word frequency data;
identifying entity names and entity categories in the text information based on a Stanford NER tool, and screening text sections containing characters, organizations, articles, geographic positions and technical products to obtain a text data set containing scientific and technological information including scientific research unit entity names, scientific research personnel character names and scientific research result names;
s13, dividing the training set and the test set according to the ratio of 7:3, and labeling the training set and the test set, wherein sentences in the corpus are labeled in the following format: entity pair-entity relationship-sentence text to obtain a labeled sample corpus, wherein the entity relationship is classified data, and the prediction task is a multi-classification problem for predicting the entity relationship of a sample sentence containing scientific and technological entity information; the entity pair comprises scientific researchers-scientific research units, scientific researchers-scientific research achievements and scientific researchers-scientific researchers.
Further, the specific implementation method of step S2 is as follows:
s21, extracting sentence characteristics based on a BERT Chinese pre-training model, wherein the BERT model uses a bidirectional Transformer as a coder to pre-train deep bidirectional representation, and combines a Masked language model and a next sentence prediction task result, the Masked language model is used for obtaining word level representation, the next sentence prediction is used for obtaining sentence level representation, and the model structure is as follows:
the input of the BERT is represented as the addition of a word vector, a segment vector and a position vector corresponding to each word, and a word vector mode is adopted to pre-train a sample data set;
s22, acquiring time sequence information based on the bidirectional GRU model, wherein the GRU model learns the feature vector output by the BERT, and further acquires context information of sentence vectors and integral feature representation; the GRU model introduces two gating mechanisms on the basis of the RNN model, wherein the two gating mechanisms comprise a reset gate and an update gate, the reset gate is used for judging whether data influence the current hidden layer data and influence weight, and the update gate is used for judging the influence of the current value on an output result. The bidirectional GRU model is used as an improvement of the GRU model and is used for realizing scientific and technological information extraction based on a context semantic environment;
s23, adopting an attention mechanism to distribute sample weight to learn sentence characteristics, adopting an attention mechanism of seq2seq behind a bidirectional GRU layer, and carrying out weighted summation on a given vector set Keys according to a weight vector Query;
s24, model training: using the cross-entropy loss function as the loss function, the following is defined:
wherein C is the number of sample label categories, theta is a parameter in the model, y is the label of the sample, and f (x, theta) is the conditional probability distribution of the category label; optimizing the number of all-connected nodes of the model, the number of iterations and the number of sample groups of the fed neurons each time, calculating the average loss degree and accuracy, and predicting the prediction effect of the model.
Further, the specific implementation method of step S3 is as follows:
s31, predicting the entity relationship of the sample set to be predicted containing the scientific and technological entity information based on the deep learning model, and aligning the extracted entity information with the entity relationship to obtain a scientific and technological entity relationship triple data set;
s32, performing knowledge fusion based on the scientific and technological entity relationship triple data set, wherein the knowledge fusion comprises two aspects: linking entities, eliminating or combining entities with a common meaning; and (4) knowledge merging, namely extracting information from a knowledge database or the existing structured data, and further merging and sorting the knowledge to obtain a knowledge map.
Further, the specific implementation method of step S4 is as follows: the first is to link entities and eliminate or combine entities with common meaning, and the second is to combine knowledge, which is to extract information from developed knowledge database or existing structured data and further combine and arrange knowledge to obtain a knowledge map with more perfect source than a single database.
Further, the specific implementation method of step S5 is as follows: and selecting a graph database, and storing the extracted scientific and technological entity relationship triples according to a database format.
A scientific and technological knowledge map construction device based on a deep learning model comprises:
the corpus construction module is used for obtaining a technology text sample corpus with labels;
the relation extraction module is used for constructing a scientific and technological entity relation extraction model and realizing text scientific and technological entity relation prediction;
and the knowledge map construction module is used for performing knowledge fusion and knowledge storage according to the entity relationship extraction result to obtain the scientific and technological knowledge map.
A scientific and technological knowledge map construction terminal based on a scientific and technological knowledge map construction device based on a deep learning model comprises a processor and a memory connected with the processor.
The invention has the beneficial effects that: the invention greatly shortens the difficulty and time cost of information extraction, and effectively reduces the difficulty of knowledge graph construction, so as to construct a knowledge graph with timeliness and accuracy on the continuously updated and changed scientific and technological information text data.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a BERT Chinese based pre-training model;
FIG. 3 is a schematic diagram of a bidirectional GRU model;
FIG. 4 is a schematic diagram of a calculation for assigning sample weights using an attention mechanism.
Detailed Description
The invention will be further described with reference to the accompanying drawings in which:
as shown in the attached figure 1, the scientific and technological figure knowledge graph construction method based on the deep learning model comprises the following steps:
s1, constructing a sample corpus: extracting text data with scientific and technical information, and performing entity identification and labeling to obtain a sample corpus with labels;
the specific implementation method of S1 is as follows:
s11, randomly extracting open domain text information, wherein the open domain text information comprises free news text and semi-structured encyclopedia data;
s12, screening text data containing scientific and technical information based on the named entity recognition technology; the scientific and technological information comprises text information with names of entities of scientific research units, names of characters of scientific research personnel and names of scientific research results.
Preprocessing Chinese text data based on a Jieba word segmentation tool, filtering abnormal data and unavailable data, and counting word frequency data;
identifying entity names and entity types in the text information based on a Stanford NER tool, and screening text sections containing characters, organizations, articles, geographic positions and technical products to obtain a text data set containing scientific information such as scientific research unit entity names, scientific research personnel character names, scientific research result names and the like;
s13, dividing the training set and the test set according to the ratio of 7:3, and labeling the training set and the test set, wherein sentences in the corpus are labeled in the following format: the entity pair (scientific research personnel-scientific research unit, scientific research personnel-scientific research result, scientific research personnel-scientific research personnel) -entity relationship-sentence text obtains a labeled sample corpus, wherein the entity relationship is classified data, and the prediction task is a multi-classification problem for predicting the entity relationship of the sample sentences containing scientific and technological entity information.
S2, training an information extraction model: building a deep learning model and training a sample corpus with labels to obtain an information extraction model;
the specific implementation method of S2 is as follows:
and S21, extracting sentence characteristics based on a BERT Chinese pre-training model, wherein the BERT model uses a bidirectional Transformer as a coder to pre-train deep bidirectional representation, and combines the results of tasks of 'Masked language model' and 'next sentence prediction', the 'Masked language model' is used for acquiring word-level representation, the 'next sentence prediction' is used for acquiring sentence-level representation, and a model structure diagram is shown in FIG. 2.
The input of the BERT is represented as adding a word vector, a segment vector and a position vector corresponding to each word, and pre-training the sample data set by adopting a word vector mode (namely, posing _ training ═ NONE);
and S22, acquiring time sequence information based on the bidirectional GRU model, and learning the feature vector output by the BERT by the GRU model to further acquire context information and integral feature representation of the sentence vector. The GRU model introduces two gating mechanisms, namely a reset gate and an update gate on the basis of the RNN model, wherein the reset gate is used for judging whether the data at the previous moment has influence on the data of the current hidden layer and the weight of the influence, and the update gate is used for judging the influence of the current value on an output result. The bidirectional GRU model is used as an improvement of the GRU model to realize scientific and technological information extraction based on a context semantic environment, and the structure of the model is shown in FIG. 3;
s23, introducing an attention mechanism to distribute sample weight to learn sentence characteristics, introducing an attention mechanism of seq2seq behind a bidirectional GRU layer, weighting and summing a given vector set Keys according to a weight vector Query, wherein a specific calculation process is shown in FIG. 4, a phase 1 is to weight and sum the vector set Keys according to the Query, a phase 2 is to normalize the weighted and summed result by Softmax-like to obtain attention distribution, and aphase 3 is to calculate and evaluate according to the attention distribution;
s24, model training, using the cross entropy loss function as the loss function, is defined as follows:
wherein C is the number of sample label categories, y is a parameter in the model, and f (x, theta) is the label of the sample and is the conditional probability distribution of the category label; optimizing the number of all-connected nodes of the model, the number of iterations and the number of sample groups of the fed neurons each time, calculating the average loss degree and accuracy, and predicting the prediction effect of the model.
S3, performing scientific and technological information extraction on the open domain text data based on the information extraction model to obtain scientific and technological knowledge triples of the open domain text;
the specific implementation method of S3 is as follows:
s31, predicting the entity relationship of the sample set to be predicted containing the scientific and technological entity information based on the deep learning model, and aligning the extracted entity information with the entity relationship to obtain a scientific and technological entity relationship triple data set;
s32, performing knowledge fusion based on the scientific and technological entity relationship triple data set, wherein the knowledge fusion comprises two aspects: the first is to link entities and eliminate or combine entities with common meaning, and the second is to combine knowledge, which is to extract information from developed knowledge database or existing structured data and further combine and arrange knowledge to obtain a knowledge map with more perfect source than a single database.
And S4, knowledge fusion and updating, wherein the first is to link the entities and eliminate or combine the entities with the common meaning, and the second is to combine the knowledge, which is to extract information from the developed knowledge database or the existing structured data and further combine and arrange the knowledge to obtain a knowledge map with a more perfect source than a single database.
And S5, constructing a scientific knowledge map based on the scientific knowledge triples of the open domain text, selecting a proper map database, and storing the extracted scientific entity relationship triples according to the database format to realize the persistent storage of the knowledge map so as to facilitate further processing and utilization.
A scientific and technological knowledge map construction device based on a deep learning model comprises
The corpus construction module is used for obtaining a technology text sample corpus with labels;
the relation extraction module is used for constructing a scientific and technological entity relation extraction model and realizing text scientific and technological entity relation prediction;
and the knowledge map construction module is used for performing knowledge fusion and knowledge storage according to the entity relationship extraction result to obtain the scientific and technological knowledge map.
A scientific and technological knowledge map construction terminal based on a scientific and technological knowledge map construction device based on a deep learning model comprises a processor and a memory connected with the processor.
Those skilled in the art will appreciate that all or part of the steps of the above-described methods may be implemented by hardware instructions associated with a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.
According to the scientific and technological figure knowledge graph construction method, device and terminal based on the deep learning model, an information extraction problem is designed into a multi-classification problem, and a scientific and technological entity relation needing to be extracted is designed into a predefined classification variable, so that the scientific and technological entity relation can be trained by using the supervised deep learning model; different from the task of classifying numerical type samples, the task of classifying natural language text data needs to extract quantifiable data features from the text data, but because the text data has high complexity and irregularity, the data features are difficult to extract by a traditional mode of manually setting a rule template, so a pre-training model needs to be introduced to complete feature extraction and vectorization based on sample data, in the field of building the scientific and technological knowledge map, the difficulty of information extraction and the time cost are greatly reduced by using a deep learning model, and the difficulty of building the knowledge map is effectively reduced, so that the knowledge map with timeliness and accuracy is built for the scientific and technological information text data which is continuously updated and changed; the application of the deep learning model is not limited in the field, a prediction framework based on the data characteristics automatically extracted by the sample information is provided, and the model framework and the model parameters can be flexibly adjusted according to actual requirements aiming at different types of knowledge and texts in different fields, so that the deep learning model can become an information extraction algorithm, a device and a terminal with universality potential.
The method and the device can reduce the cost consumption of extraction and fusion of the scientific and technological knowledge, and can update the scientific and technological knowledge map along with the change of the scientific and technological text data, thereby ensuring the timeliness of the knowledge map.
The technical solution of the present invention is not limited to the limitations of the above specific embodiments, and all technical modifications made according to the technical solution of the present invention fall within the protection scope of the present invention.