Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a knowledge graph network construction method, a knowledge graph network construction system and a knowledge graph network construction medium based on mass scientific research data. The invention grasps the research progress of key technology and the achieved technical level by mining the historical subject data, and analyzes and extracts the future technical development trend.
The technical scheme of the invention is as follows:
a knowledge graph network construction method based on mass scientific research data comprises the following steps:
1) analyzing the project library, extracting basic information of the projects, and analyzing document information and the direction of the projects of each project; wherein the basic information comprises a field to which the topic belongs;
2) extracting subject words from the title information of the set document of each subject to serve as subject key technology corresponding to the subject;
3) clustering the subject directions belonging to the same field; for the topics in the same clustering result, analyzing key technical indexes from a general task book and a requirement analysis description of each topic, and then associating the topic key technology and the key technical indexes according to the correlation degree of the topic key technology and the key technical indexes of each topic in the same clustering result to form a plurality of key technical indexes corresponding to each key technology; and finally, generating a relation table between the fields, the subject directions, the subjects, the key technologies and the key technical indexes, namely a knowledge graph of the project library.
Further, the method for analyzing the subject direction comprises the following steps:
1) performing word segmentation on the target document content of the topic, and forming a directed graph according to word segmentation results;
2) for each participle point V in the directed graphiBy the formulaCalculating the word segmentation point ViFinal weight of S (v)i) (ii) a Wherein, ln (V)i) To point to the word segmentation point ViSet of participle points of (V), Out (V)j) For the word segmentation point ViPoint to the set of other participle points, d is the adjustment coefficient, wjiFor word segmentation point vjTo the point of word segmentation viWeight of the edge of (1), wiFor the word segmentation point ViThe composite weight of (a);
3) and selecting a plurality of participles as the subject direction of the subject according to the final weight of the participle points.
Further, the word segmentation point ViBy the combined weight wi=w1*Ai+w2*Bi+w3*Ci+w4*Di(ii) a Wherein A isiFor word segmentation point ViOf TF-IDF with weight w1;BiRepresenting word segmentation points ViPosition of (1) with weight w2,CiRepresenting word segmentation points ViPart of speech of weight w3,DiRepresenting word segmentation points ViLength of w, weight of4。
Further, the method for obtaining the key technology of the subject comprises the following steps: extracting subject words from key technology documents of the subject and titles of development summary reports to serve as key technologies of the subject; analyzing a key technology document, developing a text summarizing a report technology, searching a word with a technology in the text, judging whether a word segmentation result in front of the technology is in a noun, a first verb or a verb form, and if so, taking the technology and the word segmentation result in front of the technology as a key technology of the subject; and searching a name moving word combination or a moving name word combination in a key technology document and a text for developing a summary report technology, wherein the text is a text keyword and is used as a key technology.
Further, the subject directions and the key technologies are respectively merged, the same type and the similar subject directions are merged together, and the same type and the similar key technologies are merged together.
Further, calculating the similarity of the subject directions in the same field aiming at the new subject direction, taking a maximum value max, if max exceeds a set threshold value K, taking the new subject direction as an alternative item of the most similar subject direction, otherwise, adding the new subject direction in the corresponding field; and calculating the similarity of the key technologies in the same subject direction aiming at the new key technology, taking the maximum value max, if max exceeds a set threshold value G, taking the new key technology as an alternative item of the most similar key technology, and otherwise, adding the new key technology in the corresponding subject direction.
Further, the basic information also comprises a topic number, a topic name, a contract number, a bearing unit, a topic security level, an application number, a host room-to-mouth person, a scientific and technological department responsible person and a participant; the document information comprises a project number, a project name, a result form, main research contents, scientific researchers, key technologies and key technical indexes.
A knowledge graph network construction system based on mass scientific research data is characterized by comprising a topic analysis module, a topic key technology extraction module and a knowledge graph generation module; wherein,
the project analysis module is used for analyzing the project library, extracting basic information of the projects, and analyzing document information and the project direction of each project; wherein the basic information comprises a field to which the topic belongs;
the topic key technology extraction module is used for extracting subject words from the title information of the set document of each topic to serve as the topic key technology of the corresponding topic;
the knowledge graph generating module is used for clustering the subject directions belonging to the same field; for the topics in the same clustering result, analyzing key technical indexes from a general task book and a requirement analysis description of each topic, and then associating the topic key technology and the key technical indexes according to the correlation degree of the topic key technology and the key technical indexes of each topic in the same clustering result to form a plurality of key technical indexes corresponding to each key technology; and finally, generating a relation table between the fields, the subject directions, the subjects, the key technologies and the key technical indexes, namely a knowledge graph of the project library.
A computer-readable storage medium storing a computer program, characterized in that a computer program is stored, the computer program comprising instructions, the instructions comprising the steps of any of the methods described above.
In the invention, the fixed attributes are directly extracted through the project library files respectively by the information extraction of the scientific research knowledge map, and the attributes which do not exist in the project library are analyzed through the project document. Preprocessing the data, removing noise, analyzing the data meeting the project specification, and storing the data into a database. Analyzing the subject direction, wherein each subject should have one to a plurality of subject directions, analyzing the name of the subject and the main research content through a computer program, and taking the extracted subject word as the direction of the subject. The problem of synonymy different words is solved through a near sense word library, and synonymy or same directions are clustered together. And (5) analyzing and clustering the key technology, and extracting the key technology based on word frequency and word meaning. And setting a fixed threshold value through calculation of the synonym library and the similarity, calculating the similarity between the key technologies, clustering the key technologies, and clustering the key technologies with high correlation together to reduce the category of the key technologies. The engine firstly reads the configuration file, monitors the file under the fixed directory through the configuration file, analyzes the project library, and stores the project basic information in the project library. The existing history subject document is processed and a new subject document is received. And performing information extraction on the subject document, dividing the subject document into basic information, subject direction and key technology information, and perfecting the extraction direction and key technology through a manual candidate system. And finally, calculating the similarity aiming at the subject direction and the key technology, and perfecting the association table.
The scientific research knowledge map information extraction module is mainly divided into 3 parts: WEB front end, WEB back end and engine. The WEB front end interacts with a user, and the user operates a page and requests to be sent to the back end. The back end mainly has the functions of processing services and data and abstracting abstract. The engine part is responsible for information extraction of scientific research documents, extraction of basic information and combination of subject directions and key technologies, and mainly processes data aiming at a database. The scientific knowledge map architecture (as shown in figure 1) is shown. The engine firstly reads the configuration file, monitors the file under the fixed directory through the configuration file, analyzes the project library, and stores the project basic information in the project library. The existing history subject document is processed and a new subject document is received. And performing information extraction on the subject document, wherein the information extraction is divided into basic information and subject direction, key technology information extraction, and refining the extraction direction and key technology through a manual candidate system. The key techniques used are as follows: data analysis and data mining; a lightweight web framework SSI; apache POI and PDFBox.
The user uses the notebook to operate the front-end system; the front end is communicated with the back end, the back end carries out service processing, the service processing of the back end is realized by interacting with the database, and the high-speed retrieval of the server is realized by an engine; the engine is responsible for analyzing data, extracting information and searching at high speed. The back end communicates with the engine through Socket to increase the speed of transferring data. A knowledge-graph architecture diagram (see figure 2). The whole system receives data through one server, receives documents, analyzes the documents through the server, extracts information of the documents, and stores necessary information of a project for other modules. A user accesses the system through a browser of a notebook and the deployment diagram of the knowledge graph (such as the figure 3) is integrated.
The knowledge-graph may perform a series of operations such as locking, abbreviating, adding, deleting, and searching, or adding or deleting versions. The knowledge graph is classified into different areas such as versions, fields, directions, key technologies, key technical indexes and the like in sequence, each area is provided with a plurality of nodes, and each node can be added, deleted, modified and moved. For example, one of the versions is selected, the version includes a plurality of domains, each domain includes a plurality of directions, each direction node includes a plurality of key technologies, and each key technology includes a plurality of key technology indicators.
1) Parsing item libraries
By analyzing the project library, some basic information (project number, project name, contract number, undertaking unit, project security level, application number, belonging field, host room counter, responsible person of science and technology department, participator and the like) of the project is extracted and stored in the database, and the state is set to be 0. And (4) analyzing a project result logic flow chart (as shown in figure 4).
2) Analyze the document information of each topic
In order to solve the problem of incomplete document contents, all documents in each topic are read, and content information (topic numbers, topic names, result forms, main research contents, scientific researchers, key technologies, key technical indexes and the like) is extracted from a plurality of documents so as to ensure the integrity of data. The document classification parsing flowchart is shown in fig. 5. Firstly, initializing a FilterMap set, and obtaining files or folders under the subject through File. And storing the files into a FilterMap set according to file types, fixing numbers (a general task book key is 1, and a requirement specification is 2), and extracting the files containing the most attributes firstly so as to have an order. If the file folder is stored in the FileList, only the acceptance document is stored if the acceptance document exists, and the acceptance is taken as the standard. And stored in fileMap by file type. Analyzing the document to generate an analysis result basic, searching a subject number in the project map, and if the project map directly updates the database, removing the record of the subject number by the project map; if the document does not exist, but the document is accepted, the database is also updated, and the project map removes the record of the project number, and the project document is stored in the retrieval subsystem.
3) Analyze the subject direction of each subject
Each topic has one or more topic directions, and one topic extracts the topic direction from a plurality of documents. Firstly, a word segmentation device is needed to segment words of a document, stop words are removed, and words of nouns, verbs and adjective parts of words can be reserved only according to the combination of nouns and verb words in the direction of a research subject. The specific overall design of the subject direction extraction is shown in fig. 6. The topic name and the main research content are analyzed through a computer program, keywords of a text are calculated by an improved TextRank algorithm in an extraction mode, and the direction of the topic is determined by combining semantic analysis.
4) Topic key technology for analyzing each topic
The method comprises the steps of firstly extracting subject words through a key technology research report of a subject and title information in a research and summary report, and then analyzing the subject words as key technologies by utilizing text analysis based on word frequency, part of speech and semantics. The flow chart of the key technology of the analysis is shown in FIG. 7.
5) Analysis of key technical indexes
The index content in the general task book and the requirement analysis description of the subject is analyzed to analyze key technical indexes, the key technology is regarded as a search word through a full-text search model, the technical indexes are regarded as documents, the key technology and the key technical indexes are compared in the degree of correlation, the indexes related to the key technology are associated with the key technology, and each key technology corresponds to zero to multiple key technical indexes. The key technical indicator extraction flow chart is shown in fig. 8.
6) Merging of subject directions and key technologies
The domain-subject direction-key technology indexes are regarded as a whole knowledge graph, and similar subject directions and key technologies are combined together, so that the development trends of the subject directions and the key technologies in the domain can be observed. Therefore, when storing the project information, the correlation of the field, the project direction, the key technology and the key technology index is made. Merging the subject directions and key technologies with high similarity (see fig. 9). The method has the advantages that multiple directions exist in the field, multiple key technologies exist in the directions, the key technologies have multiple key technical indexes, and multiple many-to-many relations are formed.
Compared with the prior art, the invention has the following positive effects:
in the application of the information extraction function, the file monitoring module can correctly monitor the file and generate a result file (. ok); the topic document analysis module can successfully analyze the document and prompt the topics which are not in the library; the merging of the subject directions and key technologies also meets the expected result, and similar subject directions and key technologies can be merged. In the information extraction performance test, the subject documents with different sizes are analyzed, the analysis speed is not obviously changed, and the fact that the analysis speed is not directly related to the size of the document is proved; and (5) analyzing the integrity rate test, wherein the analysis attribute exceeds 90%, and the requirement is met. The IP address of the server end needs to be configured during initial login, a data source needs to be selected when a salesman uploads a text, the uploading can be successful by clicking the importing, and the history record of the importing of the history display data is checked. And linkage and content support are imported in multiple languages. The user can perform a series of operations on the scientific knowledge map, such as locking, abbreviating, adding, deleting and searching, or adding or deleting versions. The knowledge tree of each region is classified into version, field, direction, key technology and key technology index in turn, and a user can operate each node, and each node can be added, deleted, modified and moved. The navigation bar can display all version information, and click the version switching knowledge tree to the version; the mobile node can be placed at the navigation bar first and used as a cache area for the movement of the node; the user may control the navigator to change the position of the knowledge tree presentation. After the user clicks the node, the right area shows all the topics associated with the node, and the user can select the topic which wants to view the corresponding year in the time selection frame of the area.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The extraction of literature information in a scientific research knowledge map needs to read a project library file, analyze the project library file, store project basic information into a database, analyze a project document, extract other important information of a project, process the project through a candidate system, finally insert complete data into the database, calculate similarity of a project direction and a key technology, combine the project direction and the key technology with high similarity, combine the same class together, and finally form an association relation between the field-direction-project-key technology.
The overall scheme design for extracting the subject document comprises the following nine points:
1) and monitoring a directory for storing the project library, if an unprocessed project library file or a newly added project library file exists, analyzing, and storing the basic information of the project into a database. Monitoring is carried out when the data base is stored, if the data base does not have the project information of the project number, the data base is stored, the state status is set to be 0, and otherwise, the data base is not stored.
2) All data with status of 0 (only basic information is extracted from the project library, and key information is not analyzed) are searched from the database, and (project number, id) is stored in the system global variable Map.
3) Find unprocessed documents (folder exists, but there is no corresponding ok file), store unprocessed documents in the queue.
4) If the unprocessed document is a project library file. And analyzing unprocessed project library files, storing the project information into a database in batches, and storing the project number and id into a system global variable Map. If the ok file is successful, an ok file is generated.
5) If the unprocessed document is the subject document. And storing the data into a queue waiting for processing, and waiting for an idle thread.
6) And monitoring the folders through a FileMonitor, and storing newly added folders (scientific research documents) into a queue waiting for processing.
7) And the thread monitors the queue, if the queue is not empty, the queue head is moved to the position and analyzed, if the serial number of the problem exists in the Map, the problem is updated to the database, the status of the problem in the database is set to be 2, and meanwhile, the problem information is transmitted to the retrieval system.
8) Through manual processing and verification, basic data of the topic is modified, basic data (topic direction and key technology) of the topic is updated, the state is set to be 1, and the final state is obtained.
9) And a processing subject table, which is used for processing the subject basic information after being sorted, and is respectively inserted into a field table, a subject direction table, a key technology index table, a field-direction association table, a direction-subject association table and a key technology-subject association table, and the states of the fields, the directions, the key technology and the subject association table are respectively set to be 1.
The overall information extraction flowchart is shown in fig. 10.
The subject direction utilizes the TextRank algorithm to improve the defects of the TextRank algorithm, adopts the TextRank improved algorithm based on the comprehensive weight, respectively calculates the comprehensive weight of the words by using a G1 weighting method, and improves the TextRank algorithm based on the comprehensive weight to calculate the keywords of the text. The extraction of the subject direction needs to process the main research contents of a plurality of documents, firstly, the documents need to be participled through a participler, stop words are removed, and only the words of nouns, verbs and adjective parts of words can be reserved according to the research subject direction and the combination of nouns and verb words. The system uses a word segmentation system of HanLP, and words can be segmented through the HanLP, and the part of speech of the segmentation result can be obtained at the same time. And extracting the keywords of the topic by using an improved TextRank algorithm. And in addition, the main research content is subjected to word segmentation, and the final subject direction is determined by performing semantic word segmentation on the main research content in combination with the key words.
The TextRank algorithm is an algorithm for extracting keywords and subject words, and the keywords and the subject words are extracted from the text based on the calculation of a graph, so that the research content of the text is summarized. The algorithm is based on a graph, and the weight of each point in the graph is calculated respectively, and the weight is also influenced by the rest points, so the weight of the dictionary is larger, and the weight of the points connected with the dictionary is also larger. The comprehensive weight calculation formula is shown in formula (1):
wi=w1*Ai+w2*Bi+w3*Ci+w4*Di(1)
a: and the Term TF-IDF is represented, wherein TF means Term Frequency (Term Frequency) and IDF means inverse file Frequency. B: the position of the word, beginning, end, or middle of the sentence. C: representing part of speech. D: representing the word length. W: representing the respective weights.
The Textrank algorithm is to form a directed graph of the word segmentation result of the text, and if G (V, E) is that the point set is V, the edge set is E, and E is a directed graph of V multiplied by V subsets, if a certain word segmentation point V in the graph is ViAll points to this point are set as ln (V)i) And by this word segmentation point ViThe set pointing to other points is Out (V)j) Then, the word segmentation point ViThe weight of (2) can be calculated by the following equation:
d is an adjustment coefficient and is generally 0.85.
However, the TextRank algorithm assumes that there is no weighting effect between points, i.e., each point has the same degree of importance, but in the text, it does not. Therefore, the invention needs to calculate the weight of different points, gives certain large weight to important points and increases the weight of the important points. Namely, the calculation formula becomes (3):
wjiis a point vjTo point viThe weight of the edge of (2).
The improved TextRank keyword extraction algorithm comprises the following steps:
1) using a word segmentation device to segment the main research content, wherein the set of all words is a point set V, and association between words is carried out according to the segmentation result, the relation of edges between words is established, and a corresponding edge E is established;
2) calculating the weights of all points in the set V by using a formula (2), performing recursive calculation until the final calculation result is converged, and stopping the calculation;
3) and after calculating the weight of each point, sorting the weights of the points in a descending order, and selecting phrases in a certain range as keywords of the text.
The specific algorithm of the improved TextRank is as follows:
a) and performing stop word processing on the text to obtain a processed result set.
For example: [ software, personnel, programmers, senior, programmers, systems, analysts, projects, managers ]
b) Each word segmentation result takes 5 words in front and at back, and is marked as being capable of being associated with other words.
{
Software is defined as people, programmers, senior, systems,
personnel is [ software, programmer, program, system, analyst ],
programmer ═ software, personnel, senior, programmer, system, analyst, project, manager,
high-level ═ software, personnel, programmers, systems, analysts, projects, managers,
system ═ software, personnel, senior, programmer, analyst, project, manager,
analysts are [ personnel, senior, programmer, system, project, manager ],
item ═ senior, programmer, system, analyst, manager ],
manager-senior, programmer, system, analyst, project
}。
c) And calculating the weight of each word according to the distance between the words, namely calculating the distance between the word i and the front and rear 5 words and calculating the weight of the word i. Distance calculation weight formula wjiWhere k is the number of words from this word, (5-k +1)/5, and the repeated words are averaged.
{
Software ═ personnel (1), programmer (0.8), advanced (0.6), system (0.2) ],
personnel ═ software (1), programmer (0.8), senior (0.8), system (0.4), analyst (0.2) ],
programmer ═ software (0.8), personnel (1), senior (1), programmer (0.8), system (0.8), analyst (0.6), project (0.4), manager (0.2) ],
high-level ═ software (0.6), personnel (0.8), programmer (1), system (0.8), analyst (0.6), project (0.4), manager (0.2),
system [ software (0.2), personnel (0.4), senior (0.8), programmer (0.8), analyst (1), project (0.8), manager (0.6) ],
analysts ═ person (0.2), senior (0.6), programmer (0.6), system (1), project (0.8), manager (0.6),
item ═ senior (0.4), programmer (0.4), system (0.8), analyst (1), manager (1),
manager ═ senior (0.2), programmer (0.4), system (0.6), analyst (0.8), item (1) ]
}。
d) Calculating the comprehensive weight w of each word according to the formula (1)i。
e) Substituting the calculation result of the formula (1) into the formula (3), recalculating the weight of each word, and calculating according to the formula (3) until convergence.
The topic document content extraction and semantic fusion technology is a key technology for extracting topics from different documents, and the analysis key technology flow shown in fig. 7 extracts key technologies for topics from the following parts.
1) Extracting titles of key technologies from the key technology documents and the development summary reports, filtering the titles, replacing filter words with empty character strings, and taking the filter words as the key technologies if the length of the residual character strings is more than 5;
2) analyzing a key technical document, developing a text under a summary report technology based on semantic analysis, searching a word with a technology behind, and judging whether a word segmentation result in front of the technology is in a noun, a first verb form and a second verb form;
3) like the method for analyzing the subject direction, the method searches the text under the key technical documents and the development summary report technology for the name moving word combination or the moving name word combination, and the text is used as the key technology if the text is a text keyword.
Through statistics, words which influence the key technology can appear in the titles of the key technology extracted from the key technology documents and the development summary reports, and some titles are not technical titles and have certain influence on the extraction of the key technology, and if the words appear, the words need to be filtered respectively.
The database stores only basic directions, key technologies and key technical indexes, and the accuracy of the directions, the key technologies and the indexes is improved through a manual recommendation system. And clustering the directions in the same field by reading the basic direction of the subject in the database, clustering the key technologies in the similar direction in the same field, and finally generating an association table between the field-direction-subject-key technology index.
Since the same type and similarity are combined for the combination of the subject direction and the key technology, the calculation using the similarity is required. Therefore, a plurality of similarity calculation methods are applied to calculate the similarity of the short texts, calculate the mean value of the similarity of the texts and combine the texts with high similarity together.
1) Simple common words
The calculation similarity of the simple common words is calculated by the total character number of the document and the longest document character number, and the calculation result is used for evaluating the similarity by dividing the total character number of the words common to the documents by the longest document character number.
2) Edit distance
The method mainly comprises the steps of calculating the conversion times among character strings, calculating the operation times of converting one character string into another character string, and if the operation times are too large, indicating that the conversion degree is very large, further indicating that the similarity is small, otherwise, indicating that the similarity is large.
3) Cosine similarity
The method mainly comprises the steps of calculating cosine values, reflecting the similarity degree through the degree of an included angle, and if the cosine value of a vector included angle is large, indicating that the similarity degree is low and indicating that the inverse proportion is relatively close.
4) Jaccard similarity coefficient
The Jaccard similarity calculation is to divide two sentences into two sets respectively through the calculation of the sets, divide the intersection of the sets by the union of the sets, and calculate the relationship between the sets by using the method.
And performing merging operation of the directions in the same field and merging operation of the key technologies in the same direction on the topic by reading the basic direction, the key technologies and the key technology indexes of the subject in the database. And if max exceeds a certain threshold value, the method is considered to be under the subject direction or the key technology and is taken as an alternative item of the subject direction or the key technology. If max does not exceed the threshold, the subject direction and key technology are continuously added under the field and direction.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and a person skilled in the art can make modifications or equivalent substitutions to the technical solution of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.