CN112906826B

Movatterモバイル変換

Info

Publication number: CN112906826B
Application number: CN202110341589.1A
Authority: CN
Inventors: 梁丽娜; 贺春艳; 王雍富; 梁方殷
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2024-07-02
Anticipated expiration: 2041-03-30
Also published as: CN112906826A

Abstract

The invention discloses a multi-dimensional knowledge graph based fusion method, a multi-dimensional knowledge graph based fusion device and computer equipment, wherein the method comprises the following steps: acquiring data of entities from a plurality of data sources and cleaning the data; extracting entities, entity attributes and connection relations among the entities in each data source from the cleaned data; fusing the entities in each data source according to a preset entity fusion rule; fusing the attributes of the entities between each data source according to a preset attribute similarity rule; constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities; and fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums. The invention realizes the fusion of the entities of multiple data sources in the process of fusing the knowledge maps based on the knowledge map technology, and improves the accuracy of the entities in the knowledge maps.

Description

Multi-dimensional knowledge graph based fusion method and device and computer equipment

Technical Field

The present invention relates to data analysis technologies, and in particular, to a method and apparatus for fusing knowledge maps based on multiple dimensions, and a computer device.

Background

The Knowledge map (knowledgegraph), called Knowledge domain visualization or Knowledge domain mapping map in book condition report, is a series of various graphs showing Knowledge development process and structural relationship, and uses visualization technology to describe Knowledge resources and their carriers, and excavate, analyze, construct, draw and display Knowledge and their interrelationships. Knowledge graph construction is usually carried out by utilizing a plurality of different data sources, and entity fusion is an important work in the knowledge graph construction process. In the prior art, when a plurality of different data sources are adopted to construct an enterprise knowledge graph, the enterprise knowledge graph is mainly developed aiming at the fusion of enterprise entities and based on similarity or on a single layer of enterprise entity structure combination in the entity fusion process, but the accuracy of enterprise entity fusion cannot be guaranteed in the fusion process, enterprises with enterprise names changed cannot be fused, enterprises with reduced specific word structure parts in the enterprise names cannot be fused, and enterprises with wrongly written words in the enterprise names cannot be fused.

Disclosure of Invention

The embodiment of the invention provides a multidimensional knowledge graph based fusion method, a multidimensional knowledge graph based fusion device and computer equipment, and aims to solve the problem that the accuracy of entity fusion cannot be improved when a plurality of data sources are adopted for entity fusion in the related technology.

In a first aspect, an embodiment of the present invention provides a method for fusing knowledge-graph based on multiple dimensions, including:

Acquiring data of entities from a plurality of data sources, and cleaning the acquired data to obtain cleaned data;

Extracting entities in each data source, entity attributes in each data source and connection relations among the entities in each data source from the cleaned data;

fusing the entities in each data source according to a preset entity fusion rule to obtain fused entities in each data source;

Fusing entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes;

Constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities in each data source;

and fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums.

In a second aspect, an embodiment of the present invention provides a fusion apparatus based on a multidimensional knowledge graph, including:

The first acquisition unit is used for acquiring data of entities from a plurality of data sources and cleaning the acquired data to obtain cleaned data;

The extraction unit is used for extracting the entity in each data source, the entity attribute in each data source and the connection relation among the entities in each data source from the cleaned data;

The first fusion unit is used for fusing the entities in each data source according to a preset entity fusion rule to obtain fused entities in each data source;

the second fusion unit is used for fusing the entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes;

The construction unit is used for constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities in each data source;

And the third fusion unit is used for fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the multi-dimensional knowledge graph fusion method according to the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor causes the processor to perform the method for fusing a multidimensional knowledge graph according to the first aspect.

The embodiment of the invention provides a multi-dimensional knowledge graph based fusion method, a multi-dimensional knowledge graph based fusion device and computer equipment, wherein the method comprises the steps of acquiring data of entities from a plurality of data sources and cleaning the data; extracting entities, entity attributes and connection relations among the entities in each data source from the cleaned data; fusing the entities in each data source according to a preset entity fusion rule; fusing the attributes of the entities between each data source according to a preset attribute similarity rule; constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities; and fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums. In the process of fusing the knowledge graphs, the method fuses the entities, the entity attributes and the different dimensions of the knowledge graphs, so that the fusion of the entities with multiple data sources is realized, and the accuracy of the entities in the knowledge graphs is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow diagram of a method for fusing knowledge maps based on multiple dimensions according to an embodiment of the present invention;

fig. 2 is a schematic sub-flowchart of a multi-dimensional knowledge-graph-based fusion method according to an embodiment of the present invention;

fig. 3 is another schematic sub-flowchart of a multi-dimensional knowledge-graph-based fusion method according to an embodiment of the present invention;

fig. 4 is another schematic sub-flowchart of a multi-dimensional knowledge-graph-based fusion method according to an embodiment of the present invention;

fig. 5 is another schematic sub-flowchart of a multi-dimensional knowledge-graph-based fusion method according to an embodiment of the present invention;

fig. 6 is another schematic sub-flowchart of a multi-dimensional knowledge-graph-based fusion method according to an embodiment of the present invention;

fig. 7 is another schematic sub-flowchart of a multi-dimensional knowledge-graph-based fusion method according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of a multi-dimensional knowledge-graph-based fusion device according to an embodiment of the present invention;

Fig. 9 is a schematic block diagram of a subunit of a multi-dimensional knowledge-graph-based fusion device according to an embodiment of the present invention;

FIG. 10 is a schematic block diagram of another subunit of the multi-dimensional knowledge-graph-based fusion device according to an embodiment of the present invention;

FIG. 11 is a schematic block diagram of another subunit of a multi-dimensional knowledge-graph-based fusion device according to an embodiment of the present invention;

FIG. 12 is a schematic block diagram of another subunit of the multi-dimensional knowledge-graph-based fusion device according to an embodiment of the present invention;

FIG. 13 is a schematic block diagram of another subunit of a multi-dimensional knowledge-graph-based fusion device according to an embodiment of the present invention;

FIG. 14 is a schematic block diagram of another subunit of a multi-dimensional knowledge-graph-based fusion device according to an embodiment of the present invention;

fig. 15 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1, fig. 1 is a flow chart of a method for fusing knowledge-graph based on multiple dimensions according to an embodiment of the present invention. The multidimensional knowledge graph based fusion method is applied to a server, and is executed through application software installed in the server. The method for fusing the knowledge maps based on multiple dimensions is described in detail below.

As shown in fig. 1, the method includes the following steps S110 to S160.

S110, acquiring data of entities from a plurality of data sources, and cleaning the acquired data to obtain cleaned data.

Specifically, the multiple data sources are any website related to the data of the entity, such as news websites, encyclopedia websites, and the like, and the data is cleaned into a process of rechecking and checking the data, so as to delete repeated information, correct errors and provide data consistency. In this embodiment, a plurality of specific website information is input into a preset web crawler program in advance, and then data crawling is performed by the web crawler program, so that data of entities of the plurality of data sources can be obtained, and then the data is cleaned, so that cleaned data is obtained.

In another embodiment, as shown in FIG. 2, step S110 includes sub-steps S111 and S112.

S111, performing complex-form conversion on the entity according to a preset complex-form conversion tool to obtain the complex-form converted entity.

Specifically, the complex simplified form is converted into a character corresponding relation table between two code tables (a simplified character code table and a complex character code table) according to the coding rules of the two code tables, and the corresponding character byte codes under the other coding mode are automatically searched out by reading the mapping table through a program, so that byte-by-byte content replacement is performed. In this embodiment, by constructing a correspondence table between the characters of the traditional Chinese characters and the characters of the simplified Chinese characters, in the process of performing complex conversion on the enterprise name, the traditional Chinese conversion tool is adopted to scan and identify the character string of the entity name to obtain whether the traditional Chinese characters exist in the character string, and if the traditional Chinese characters exist, the traditional Chinese characters are converted into the simplified Chinese characters according to the correspondence table. In this embodiment, openCC complex-form conversion tools are used to convert complex-form of names of business entities. For example, if the name of the entity before the processing is "chinese safety insurance (group) limited 1 @", the name of the entity is changed to "chinese safety insurance (group) limited 1 @", after the complex conversion is performed by the OpenCC complex conversion tool.

S112, eliminating special symbols in the entity converted by the traditional Chinese character based on the regular expression to obtain the entity after cleaning.

Specifically, regular expression (regular expression), also known as a regular expression, is a logical formula that operates on strings, describing a pattern of string matching, i.e., forming a regular string with specific characters defined in advance, and combinations of the specific characters, where the regular string is used to express a filtering logic for the string. The method of constructing a regular expression is the same as the method of creating a mathematical expression, i.e., small expressions can be combined with operators using a variety of meta-characters to create a larger expression, and the components of a regular expression can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all of these components. In this embodiment, a regular expression component of the name of the entity is preset, and then the regular expression component is adopted to perform regular matching recognition on the character string of the name of the entity converted by the complex and simplified form, if the special symbol exists in the name, the special symbol in the name is removed, so that the name of the removed special character can be obtained, and the cleaned entity is obtained. For example, if the name of the entity converted by the traditional Chinese is "chinese safety insurance (group) limited 1 @", the regular matching recognition is performed on the name by a regular expression of the name of the entity, so as to identify that the specific characters of "1", "()" and "@" exist in the name, and after the specific characters are identified, the specific characters in the name are removed, so that the name of "chinese safety insurance group limited" is obtained.

S120, extracting the entity in each data source, the entity attribute in each data source and the connection relation among the entities in each data source from the cleaned data.

In particular, a knowledge graph is essentially a semantic network and describes objective things in the form of a graph, wherein the graph refers to a graph in a data structure, i.e. the knowledge graph consists of nodes and edges. Nodes in the knowledge graph represent concepts and entities, the concepts are abstracted things, and the entities are concrete things; edges represent relationships of things and entity attributes, internal features of things are represented by entity attributes, and external contacts are represented by relationships, i.e., connection relationships between entities include internal features of entities and external contacts. The concept and the entity of the node in the knowledge graph are generally taken as the entity, and the external contact and the entity attribute of the edge in the knowledge graph are taken as the connection relation. And after extracting the entity in each data source, the entity attribute in each data source and the connection relation among the entities in each data source from the cleaned data, storing the extracted entity in each data source, the entity attribute in each data source and the relation information among the entities in each data source in RDF (Resource Description Framework) triple storage format. For example, if the data of the entities of the plurality of data sources is the data required by the enterprise knowledge graph, the cleaned data is: the establishment time of the Baidu science and technology is 06 month 05 of 2001, and the Baidu building is registered on the Di-Shi street 10 in the sea area of Beijing city, wherein the Baidu science and technology, the 06 month 05 of 2001 and the Di-Shi street 10 in the sea area of Beijing city are entities in the enterprise knowledge graph, and the establishment time is the connection relationship between the two entities of the Baidu science and technology and the 06 month 05 of 2001; the registered place is the connection relation between two entities of the 10 # hundred degree building on the ocean lake area of Beijing city.

S130, fusing the entities in each data source according to a preset entity fusion rule to obtain fused entities in each data source.

In this embodiment, the names of the entities in each data source are fused through entity fusion at the entity structure layer, so as to obtain the fused names of the entities in each data source, where the entity fusion at the entity structure layer is a technical scheme that the names of the entities in the plurality of data sources are fused at the layer of the component structure of the entities, and the entity fusion rule is rule information that the component structure of the names of the entities is preset and the enterprise names acquired from the plurality of data sources are fused according to the component structure. For example, if the composition structure of the names of the entities is "place name+enterprise-specific portion+enterprise-general portion", the names of the entities in the plurality of data sources may be fused with reference to the composition structure.

In another embodiment, as shown in fig. 3, step S130 includes sub-steps S131 and S132.

S131, performing mode extraction on the names of the cleaned entities according to a preset extraction rule to obtain a plurality of words.

Specifically, the extraction rule is rule information that is pre-formulated and used for pattern extraction in the name of the cleaned entity, the pattern extraction is to apply a set of part-of-speech extraction patterns to a "stack" of the name of the cleaned entity by a machine-learned knowledge extraction engine to identify extracted words and phrases, i.e., a plurality of words of the name of the cleaned entity, wherein the part-of-speech extraction patterns are composed of nouns, adjectives, past segmentation, qualifiers, prepositions, conjunctions, names, abbreviations, imaginary words, and the like of grammar elements. In this embodiment, the part-of-speech extraction mode is composed of a place name part, an enterprise-specific part, and an enterprise-general part, and after the mode extraction is performed on the name of the entity after the cleaning, the name of the entity is split into three parts, namely, the place name part, the enterprise-specific part, and the enterprise-general part. For example, the chinese security group limited is split into chinese security group and limited after pattern extraction.

S132, recombining the words according to a preset recombination rule to obtain recombined names.

Specifically, the reorganization rule is rule information for performing pattern extraction on the names of the entities cleaned in each data source to obtain a plurality of words, and reorganizing the words to form new names. In this embodiment, before the plurality of words are recombined, common words in the plurality of words need to be removed, and then are recombined, so as to obtain the name after the recombination, that is, the fusion of the entities in each data source is completed. For example, if the plurality of words are Shenzhen, city, safe science and technology and Limited company, the Shenzhen safe science and technology, i.e. the name after recombination, is obtained by eliminating the common words "city" and "Limited company" in the plurality of words and then recombining.

And S140, fusing the entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes.

Specifically, the attribute similarity rule is rule information for fusing the same entity attribute between each data source to fuse the entity attributes in the plurality of data sources into the entity attribute of a single data source. And calculating the similarity of the same entity attributes among the data sources, and fusing according to the calculated similarity to obtain the fused entity attributes.

In another embodiment, as shown in fig. 4, step S140 includes sub-steps S141, S142, and S143.

S141, performing feature selection on the entity attribute in each data source to obtain feature data of the entity attribute in each data source.

Specifically, feature selection is performed on the entity attribute in each data source by adopting a machine learning mode, so that feature data of the entity attribute in each data source is obtained, and original features of the entity attribute are maintained by the feature data. When the entity attribute in each data source is selected, the feature selection can be performed through a filtering method, a packaging method or an integration method, wherein the filtering method sorts according to the statistical characteristics of each feature or the association degree of the features and the target value, and removes the features which do not reach the set threshold value, and the common filtering methods comprise variance filtering, filtering based on statistical correlation, filtering based on mutual information and the like; the packaging method is used for screening through judging the statistical characteristics of x or the relevance between the statistical characteristics and y; the integration method calculates a comprehensive feature importance ranking for feature selection by, for example, a model of the decision tree class. In this embodiment, feature data of entity attributes in each data source can be obtained by performing feature selection on entity attributes in each data source by filtering based on statistical correlation in a filtering method.

S142, obtaining the similarity of the entity attributes among the data sources according to the characteristic data of the entity attributes in the data sources.

Specifically, the similarity of the same characteristic data between each data source can be obtained by performing similarity calculation on the same characteristic data between each data source, namely, the similarity of the entity attributes between each data source. In this embodiment, the similarity of the entity attributes between each data source can be obtained by calculating the similarity of the same feature data between each data source using an edit distance algorithm.

In another embodiment, as shown in FIG. 5, step S142 includes sub-steps S1421 and S1422.

S1421, performing numerical processing on the feature data of the entity attribute in each data source to obtain the numerical value of the entity attribute in each data source.

Specifically, each character in the feature data is converted into the sequence number of the character in the dictionary according to the sequence number of the character in the feature data in the preset dictionary, and the sequence number of the character in the feature data can be converted into the numerical value of the feature data, namely the numerical value of the physical attribute in each data source. For example, the feature data is Beijing, where the ordering of "North" and "Beijing" in the dictionary is 12, 78, respectively, then the feature data has a value of 1278, i.e., the physical attribute in the data source has a value of 1278.

S1422, obtaining the similarity of the entity attributes among the data sources according to the numerical value of the entity attribute in each data source.

S143, fusing the entity attributes among the data sources according to the similarity of the entity attributes among the data sources to obtain the fused entity attributes.

Specifically, the numerical value of the same entity attribute between each data source is calculated by adopting an edit distance algorithm, so that the similarity of the entity attribute between each data source is obtained, and if the similarity is larger than a preset threshold value, the similarity is fused with the entity attribute between each data source, so that the fused entity attribute is obtained. Wherein the edit distance algorithm is the minimum number of edit operations required to switch from one to the other between two strings, indicating that they are different if their distance is greater. The specific principle is as follows: assuming that d [ i, j ] steps can be used to represent the minimum number of steps required to convert string s [1 … i ] to string t [1 … j ], then in the most basic case, i.e., when i is equal to 0, i.e., string s is empty, then the corresponding d [0, j ] is incremented by j characters such that s is converted to t; when j is equal to 0, i.e., the string t is null, then the corresponding d [ i,0] is reduced by i characters, such that s is converted to t.

In another embodiment, as shown in fig. 6, step S140 includes sub-steps S140a, S140b, S140c, and S140d.

And S140a, carrying out isomorphism processing on the entity attribute in each data source to obtain the entity attribute after isomorphism processing.

Specifically, the isomorphism process is used for processing entity attributes among each data source into data with the same table so as to facilitate subsequent fusion of the entity attributes in each data source. In this embodiment, in the field of enterprise knowledge graphs, when the data source a is table 1 and the data source B is table 2, if the data source B needs to be converted into the same table as the data source a, the data source B needs to be subjected to isomorphism processing, so that the data source B has the same table as the data source a, that is, the data source B performs attribute value extraction and generates the same table as the data source a based on the table template of the data source a. Wherein, table 1 and table 2 are respectively as follows:

TABLE 1

TABLE 2

And S140b, constructing partition indexes of the data of the entities from the plurality of data sources based on the Blocking technology, and generating matching pairs of the entities of each data source according to the partition indexes.

The Blocking technique is a concept of partitioning in knowledge fusion, that is, selecting a record pair with potential matching from all entity pairs in a given knowledge base as a candidate, and reducing the size of the candidate as much as possible, where the Blocking technique generally includes a Hash function partition, a neighboring partition, and a indexing partition. In this embodiment, if the enterprise knowledge graph domain is involved, the matching pair is composed of entity attributes in each data source, where the entity attributes in each data source include: the registration address, the operating range, the fax, the email, the enterprise web address, the legal representative, the general manager, the enterprise short, the enterprise telephone and the province market of the enterprise. When the partition of the data of the entities of the plurality of data sources is constructed, the data partition can be performed based on province and city or based on the first n bits of telephone numbers, so that the matching pair of the entities is generated, then the partition index of the data of the entities from the plurality of data sources is constructed by adopting a indexing partition method of a Blocking technology, and the candidate matching pair of the conforming entities is screened out from the M2 pair matching pair, so that the data calculation amount is reduced.

And S140c, performing similarity calculation on the matching pairs between each data source based on an edit distance algorithm to obtain the similarity of the matching pairs.

And S140d, generating a fusion rule of the entity attributes according to the similarity of the matched pairs, and fusing the entity attributes among the data sources according to the fusion rule to obtain fused entity attributes.

Specifically, the fusion rule is rule information for fusing entity attributes among the data sources, the same matching pairs among the data sources are calculated by adopting an edit distance algorithm to obtain the similarity of the matching pairs among the data sources, and the fusion rule of the entity attributes is obtained through the similarity of the matching pairs, so that the fusion of the entity attributes among the data sources is completed. For example, when the field of enterprise knowledge graph is involved, firstly judging whether a registered address in an entity matching pair is empty, if so, judging whether an operation range is empty, if so, fusing entity attributes among all data sources according to similarity of company names, and if not, fusing according to similarity of the operation range; if the registered address is not null, fusing is performed according to the similarity of the registered addresses.

S150, constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities in each data source.

In this embodiment, after extracting the fused entity, the fused entity attribute, and the connection relationship between the entities in each data source, the knowledge graph of each data source may be constructed according to the extracted information. For example, when the enterprise knowledge graph field is related to the enterprise knowledge graph field, the enterprise knowledge graph comprises a parallel table graph and a generalized graph, and the parallel table graph can check the upper third generation and lower third generation subsidiary company lists (data such as names, strand holding ratios and the like) of the enterprise, mainly check the composition structure of the enterprise and the skeleton composition of the enterprise. Generalized profile: the method is mainly developed from 8 dimensions of controlling, guaranteeing relations, main clients, competitors, high management, stakeholders, hotwords and legal litigation, and comprehensively controls the information of enterprises. The logic structure of the parallel table map and the generalized map is divided into two layers, namely a data layer and a mode layer; for the data layer, knowledge is stored in the graph database in facts (face). If the [ entity-relationship-entity ] or [ entity-attribute-value ] triples are used as the basic expression of facts, all data stored in the graph database form a huge entity relationship network to form an enterprise knowledge graph. The schema layer is above the data layer and is the core of the knowledge graph, refined knowledge is stored in the schema layer, an ontology library is generally adopted to manage the schema layer of the knowledge graph, and the relationship among entities, relations, types and attributes of the entities and other objects is normalized by the supporting capability of the ontology library on axioms, rules and constraint conditions. The position of the ontology library in the knowledge graph is equivalent to a die of the knowledge library, and the knowledge library with the ontology library has less redundant knowledge.

And S160, fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums.

Specifically, the spectrum matching rule is rule information for fusing the knowledge spectrum of each data source into a knowledge spectrum, that is, the knowledge spectrum of one data source in the plurality of data sources is used as a basic knowledge spectrum, and the knowledge spectrum are fused into the basic knowledge spectrum, so as to obtain the fused knowledge spectrum.

In another embodiment, as shown in fig. 7, step S160 includes sub-steps S161 and S162.

S161, calculating the similarity of the entities between each knowledge graph based on an edit distance algorithm.

S162, fusing the knowledge maps of each data source according to the similarity of the entities between the knowledge maps to obtain fused knowledge maps.

In this embodiment, the knowledge patterns of a certain data source in the plurality of data sources are used as basic knowledge patterns, similarity of entities between each knowledge pattern is calculated by adopting an edit distance algorithm, then the number of entities with similarity higher than a preset threshold is counted, and finally the knowledge patterns are fused according to the number of the entities.

In the multi-dimensional knowledge graph-based fusion method provided by the embodiment of the invention, the cleaned data are obtained by acquiring the data of the entities from a plurality of data sources and cleaning the acquired data; extracting entities in each data source, entity attributes in each data source and connection relations among the entities in each data source from the cleaned data; fusing the entities in each data source according to a preset entity fusion rule to obtain fused entities in each data source; fusing entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes; constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities in each data source; and fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums. By the method, the fusion of the entities of multiple data sources is realized in the fusion process of the knowledge graph, the accuracy of the entities in the knowledge graph is improved, and the problems that enterprises with changed enterprise names cannot be fused, enterprises with reduced specific word structure parts in the enterprise names cannot be fused, enterprises with wrongly written characters in the enterprise names cannot be fused and the like are solved especially in the technical field of enterprise knowledge graphs.

The embodiment of the invention also provides a multi-dimensional knowledge-graph-based fusion device 100, which is used for executing any embodiment of the multi-dimensional knowledge-graph-based fusion method.

Specifically, referring to fig. 8, fig. 8 is a schematic block diagram of a fusion device 100 based on a multidimensional knowledge-graph according to an embodiment of the present invention.

As shown in fig. 8, the multi-dimensional knowledge-graph-based fusion device 100 includes a first acquisition unit 110, an extraction unit 120, a first fusion unit 130, a second fusion unit 140, a construction unit 150, and a third fusion unit 160.

The first obtaining unit 110 is configured to obtain data of entities from a plurality of data sources, and perform data cleansing on the obtained data, to obtain cleansed data.

In other inventive embodiments, as shown in fig. 9, the first obtaining unit 110 includes a converting unit 111 and a rejecting unit 112.

A conversion unit 111, configured to perform complex-form conversion on the entity according to a preset complex-form conversion tool, so as to obtain a complex-form converted entity; and a rejection unit 112, configured to reject special symbols in the complex-simplified entity after conversion based on a regular expression, to obtain the cleaned entity.

The extracting unit 120 is configured to extract, from the cleaned data, the entity in each data source, the entity attribute in each data source, and the connection relationship between the entities in each data source.

The first fusing unit 130 is configured to fuse the entities in each data source according to a preset entity fusion rule, so as to obtain a fused entity in each data source.

In other embodiments of the invention, as shown in fig. 10, the first fusing unit 130 includes: a pattern extraction unit 131 and a reorganization unit 132.

A pattern extraction unit 131, configured to perform pattern extraction on the name of the cleaned entity according to a preset extraction rule, so as to obtain a plurality of words; and the reorganizing unit 132 is configured to reorganize the plurality of words according to a preset reorganizing rule to obtain reorganized names.

And the second fusing unit 140 is configured to fuse the entity attributes between the data sources according to a preset attribute similarity rule, so as to obtain a fused entity attribute.

In other embodiments of the invention, as shown in fig. 11, the second fusing unit 140 includes: a feature selection unit 141, a second acquisition unit 142, and a fourth fusion unit 143.

A feature selection unit 141, configured to perform feature selection on the entity attribute in each data source, so as to obtain feature data of the entity attribute in each data source; a second obtaining unit 142, configured to obtain similarity of entity attributes between each data source according to the feature data of the entity attributes in each data source; and a fourth fusing unit 143, configured to fuse the entity attributes between the data sources according to the similarity of the entity attributes between the data sources, so as to obtain the fused entity attributes.

In other inventive embodiments, as shown in fig. 12, the second obtaining unit 142 includes: a digitizing processing unit 1421 and a third acquiring unit 1422.

A digitizing processing unit 1421, configured to numerically process the feature data of the entity attribute in each data source to obtain a numerical value of the entity attribute in each data source; a third obtaining unit 1422, configured to obtain the similarity of the entity attributes between the data sources according to the value of the entity attribute in each data source.

In other embodiments of the invention, as shown in fig. 13, the second fusing unit 140 further includes: the isomorphic processing unit 140a, the matching pair generating unit 140b, the first calculating unit 140c, and the fifth fusing unit 140d.

The isomorphism processing unit 140a is configured to perform isomorphism processing on the entity attribute in each data source, so as to obtain an isomorphism processed entity attribute; a matching pair generating unit 140b, configured to construct partition indexes of the data of the entities from the plurality of data sources based on a Blocking technique and generate a matching pair of the entities of each data source according to the partition indexes; a first calculating unit 140c, configured to perform similarity calculation on the matching pairs between each data source based on the edit distance algorithm, so as to obtain a similarity of the matching pairs; and a fifth fusion unit 140d, configured to generate a fusion rule of the entity attribute according to the similarity of the matching pair, and fuse the entity attribute between each data source according to the fusion rule, so as to obtain a fused entity attribute.

And the construction unit 150 is configured to construct a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute, and the connection relationship between the entities in each data source.

And a third fusing unit 160, configured to fuse the knowledge spectrums of each data source according to a preset spectrum matching rule, so as to obtain a fused knowledge spectrum.

In other embodiments of the invention, as shown in fig. 14, the third fusing unit 160 includes: a second calculation unit 161 and a sixth fusion unit 162.

A second calculation unit 161 for calculating the similarity of entities between each knowledge graph based on an edit distance algorithm; and a sixth fusion unit 162, configured to fuse the knowledge-graph of each data source according to the similarity of the entities between each knowledge-graph, so as to obtain a fused knowledge-graph.

The multi-dimensional knowledge graph-based fusion device 100 provided by the embodiment of the invention is used for executing the steps of acquiring the data of the entities from a plurality of data sources and cleaning the acquired data to obtain cleaned data; extracting entities in each data source, entity attributes in each data source and connection relations among the entities in each data source from the cleaned data; fusing the entities in each data source according to a preset entity fusion rule to obtain fused entities in each data source; fusing entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes; constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities in each data source; and fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums.

Referring to fig. 15, fig. 15 is a schematic block diagram of a computer device according to an embodiment of the present invention.

With reference to fig. 15, the device 500 includes a processor 502, a memory, and a network interface 505, which are connected by a system bus 501, wherein the memory may include a storage medium 503 and an internal memory 504.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a multi-dimensional knowledge-graph based fusion method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a multi-dimensional knowledge-graph based fusion method.

The network interface 505 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the structure shown in fig. 15 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the apparatus 500 to which the present inventive arrangements are applied, and that a particular apparatus 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to perform the following functions: acquiring data of entities from a plurality of data sources, and cleaning the acquired data to obtain cleaned data; extracting entities in each data source, entity attributes in each data source and connection relations among the entities in each data source from the cleaned data; fusing the entities in each data source according to a preset entity fusion rule to obtain fused entities in each data source; fusing entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes; constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities in each data source; and fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums.

Those skilled in the art will appreciate that the embodiment of the apparatus 500 shown in fig. 15 is not limiting of the specific construction of the apparatus 500, and in other embodiments, the apparatus 500 may include more or less components than illustrated, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the device 500 may include only the memory and the processor 502, and in such embodiments, the structure and the function of the memory and the processor 502 are consistent with the embodiment shown in fig. 15, and will not be described herein.

It should be appreciated that in embodiments of the present invention, the Processor 502 may be a central processing unit (Central Processing Unit, CPU), the Processor 502 may also be other general purpose processors 502, digital signal processors 502 (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor 502 may be the microprocessor 502 or the processor 502 may be any conventional processor 502 or the like.

In another embodiment of the invention, a computer storage medium is provided. The storage medium may be a nonvolatile computer-readable storage medium or a volatile storage medium. The storage medium stores a computer program 5032, wherein the computer program 5032 when executed by the processor 502 performs the steps of: acquiring data of entities from a plurality of data sources, and cleaning the acquired data to obtain cleaned data; extracting entities in each data source, entity attributes in each data source and connection relations among the entities in each data source from the cleaned data; fusing the entities in each data source according to a preset entity fusion rule to obtain fused entities in each data source; fusing entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes; constructing a knowledge graph of each data source according to the fused entity in each data source, the fused entity attribute and the connection relation among the entities in each data source; and fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present invention may be essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing an apparatus 500 (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The multi-dimensional knowledge graph based fusion method is characterized by comprising the following steps of:

Fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums;

The fusing the entity attributes among the data sources according to a preset attribute similarity rule to obtain fused entity attributes, which comprises the following steps:

Isomorphism processing is carried out on the entity attribute in each data source, and entity attributes after isomorphism processing are obtained;

Constructing partition indexes of data of entities from a plurality of data sources based on a Blocking technology and generating a matching pair of the entities of each data source according to the partition indexes, wherein when partitioning the data of the entities of the plurality of data sources, the data is partitioned based on province and city or based on the first n bits of telephone numbers to generate the matching pair of the entities, and a indexing partitioning method of the Blocking technology is adopted to construct the partition indexes of the data of the entities from the plurality of data sources and generate the matching pair of the entities of each data source according to the partition indexes;

Generating a fusion rule of the entity attributes according to the similarity of the matching pairs, and fusing the entity attributes among the data sources according to the fusion rule to obtain fused entity attributes;

Performing feature selection on the entity attribute in each data source to obtain feature data of the entity attribute in each data source;

performing numerical processing on the characteristic data of the entity attribute in each data source to obtain the numerical value of the entity attribute in each data source;

obtaining the similarity of the entity attributes among the data sources according to the numerical value of the entity attribute in each data source;

and fusing the entity attributes among the data sources according to the similarity of the entity attributes among the data sources to obtain the fused entity attributes.

2. The multi-dimensional knowledge-graph-based fusion method of claim 1, wherein the performing data cleaning on the acquired data to obtain cleaned data comprises:

Performing complex-form conversion on the entity according to a preset complex-form conversion tool to obtain a complex-form converted entity;

and eliminating special symbols in the entity converted from the traditional Chinese character based on the regular expression to obtain the entity after cleaning.

3. The method for fusing knowledge maps based on multiple dimensions according to claim 2, wherein fusing the entities in each data source according to a preset entity fusion rule to obtain the fused entities in each data source comprises:

Performing mode extraction on the names of the cleaned entities according to a preset extraction rule to obtain a plurality of words;

And recombining the plurality of words according to a preset recombination rule to obtain recombined names.

4. The method for merging knowledge patterns based on multiple dimensions according to claim 1, wherein the merging the knowledge patterns of each data source according to a preset pattern matching rule to obtain a merged knowledge pattern comprises:

calculating the similarity of entities between each knowledge graph based on an edit distance algorithm;

And fusing the knowledge maps of each data source according to the similarity of the entities between each knowledge map to obtain the fused knowledge map.

5. The utility model provides a fusion device based on knowledge graph of multidimension, its characterized in that includes:

the third fusion unit is used for fusing the knowledge spectrums of each data source according to a preset spectrum matching rule to obtain fused knowledge spectrums;

wherein the second fusion unit comprises:

The isomorphism processing unit is used for carrying out isomorphism processing on the entity attribute in each data source to obtain the entity attribute after isomorphism processing;

A matching pair generating unit, configured to construct partition indexes of data of entities from the plurality of data sources based on a Blocking technique and generate a matching pair of the entity of each data source according to the partition indexes, wherein when partitioning the data of the entity of the plurality of data sources, the matching pair of the entity is generated by partitioning the data based on province and city or based on n bits before telephone numbers, and a indexing partitioning method of the Blocking technique is adopted to construct the partition indexes of the data of the entity of the plurality of data sources and generate the matching pair of the entity of each data source according to the partition indexes;

the first calculation unit is used for calculating the similarity of the matching pairs between each data source based on an edit distance algorithm to obtain the similarity of the matching pairs;

A fifth fusion unit, configured to generate a fusion rule of the entity attribute according to the similarity of the matching pair, and fuse the entity attribute between each data source according to the fusion rule, so as to obtain a fused entity attribute;

wherein the second fusion unit comprises:

the feature selection unit is used for performing feature selection on the entity attribute in each data source to obtain feature data of the entity attribute in each data source;

The numerical processing unit is used for performing numerical processing on the characteristic data of the entity attribute in each data source to obtain the numerical value of the entity attribute in each data source;

The third acquisition unit is used for acquiring the similarity of the entity attributes among the data sources according to the numerical value of the entity attribute in each data source;

And a fourth fusion unit, configured to fuse the entity attributes between each data source according to the similarity of the entity attributes between each data source, so as to obtain the fused entity attributes.

6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the multi-dimensional knowledge-graph based fusion method according to any one of claims 1 to 4 when executing the computer program.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the multi-dimensional knowledge-graph based fusion method according to any one of claims 1 to 4.