Knowledge graph-based hybrid group discovery methodTechnical Field
The invention relates to the technical field of big data processing, in particular to a knowledge graph-based hybrid group discovery method.
Background
Community members gradually form a certain stable relationship through interaction in activities such as work, study, life, entertainment and the like, and further form a social network. Many practical networks have a community structure, that is, the whole network is composed of several communities, the connections between communities are relatively sparse, and the connections inside communities are relatively dense.
At present, business databases of various departments of public security are relatively dispersed, and criminal teams are mainly identified by means of investigation and investigation, meticulous reconnaissance, data investigation and the like of workers, but with the increasing of data volume and the huge and complicated social networks, the workload of case analysis and team discovery is increased, social network teams are effectively decomposed, and the team investigation range can be reduced, so that the working efficiency is improved, the working intensity is reduced, and the purpose of assisting decision making is achieved.
Disclosure of Invention
In order to solve the technical problems, the invention provides a mixed group discovery method based on a knowledge graph, various graph calculations are mature day by day and are applied to various fields, such as fraud loop monitoring, shortest path calculation and the like.
In order to achieve the purpose, the invention provides a hybrid group discovery method based on a knowledge graph, which is explained by using a sample (a community where a suspect is discovered based on the knowledge graph); the system comprises a data acquisition module (1), a data sorting module (2), a knowledge graph building module (3) and a community dividing module (4);
the system diagram is as shown in fig. 2, and the data acquisition module (1) is used for cleaning and converting the mastered relevant data of each social relationship of the criminal suspect according to the characteristics of the existing business database of each department of public security;
the data sorting module (2) is used for sorting out attributes and relations required by the entities in the map and naming the attributes and relations in a unified manner;
the knowledge graph building module (3) is used for building an entity relationship network;
the community division module (4) performs community division on the entity relationship network by using a louvain algorithm.
Preferably, the data acquisition module (1) is responsible for centralizing data stored in each business database of each department of the public security institution, such as social basic information, case information and other data;
the data sorting module (2) sorts the entity data set, the relationship types among the entities and the entity attribute types; like the social network in fig. 1, the entities are persons (a, b, c, d), the attributes may be basic information of the persons, i.e., birth date, graduation, native place, profession, occupation, etc., and the relationships include the number of calls, the number of co-morbidities, the number of co-dwellings, etc.
The knowledge graph building module (3) is used for building an entity relationship network, and the relationship can be extracted from unstructured data or directly extracted from structured data;
the extraction of the unstructured data can be obtained by means of named entity recognition, case type classification and the like, for example, entities such as Zhang III and Li IV are obtained from a text document that Zhang III and Li IV kill people in west street through entity recognition, the relationship is the same party (intentional injury), and a triple (Zhang III, the same party (intentional injury): n, Li IV) is formed, wherein n represents the common crime frequency of two people.
The structured data can be directly obtained from the database, and if the parent column of zhang san in the base table is lie si, a triple (zhang san, father, lie si) can be formed, wherein the relationship type of the triple is determined according to the actual situation.
And constructing an effective knowledge graph, namely a complex network, by the three-element data set, and initializing parameters, wherein the parameters are determined according to actual application. A complex network can be abstracted as a graph consisting of a set of points and a set of edges. Each edge has corresponding points, that is, each edge in the 'edge set' has a pair of points in the 'point set' corresponding to the point. For example, in fig. 1, the triplets (b, number of calls: 40, c), (b, number of lives: 2, c) are formed between the entity b and the entity c, and the triplets are graphically formed into a network set, i.e., a knowledge graph, by using the same principle between every two other entities. In a specific occasion, if a certain attribute is a critical factor of community division, a correlation relationship is constructed between entities with the same attribute. If a professional attribute is x company manager and b professional attribute is x company employee, then this triple can be constructed: (a, x company, b).
The community dividing module (4) extracts a sub-network with a path being n and centered around a suspect from the knowledge graph of the knowledge graph building module (3), divides the sub-network into communities by a louvain algorithm, namely, the compactness of one community is measured by the modularity, and if a node is added into the community, the modularity of the community is increased to the maximum extent, the node should belong to the community. If the modularity of the community is not increased after other communities are added, the community is left in the current community. The value of the path n is generally 2 according to a six-degree separation theory (the depth of any network does not exceed 4 layers) and case experience (the depth of the connection between any two persons in a criminal partnership is mostly within a two-layer relation), and the method is as follows:
and (4-1) assuming that the network has N nodes, and allocating a community to each node, namely how many nodes have how many communities in the initial stage. There were initially 4 communities as in fig. 1;
and (4-2) considering all neighbor nodes j of each node i in the network, calculating the incremental change of the modularity when the node i moves from the community where the node i is located to the community where the neighbor nodes j are located, and moving the node i to the community where the node j which increases the modularity to the maximum and is not negative is located. If all the calculated gains are not positive numbers, the node is still in the original community. The process is repeated for all nodes and applied in sequence until no node moves, the first process stops, i.e. no movement of any node results in an increase in modularity;
the modularity gain of the community C where each time a node moves an isolated node to its neighbor is Δ Q:
Σ in: the sum of the weights of all edges inside community C;
Σ tot: the sum of the weights of the edges associated with all nodes in community C;
ki: the sum of the weights of all edges that occur at node i;
ki, in: the sum of the weights of the edges of node i to all nodes in community C;
m: the sum of the weights of all edges in the network;
the relationships in the different industry maps can be given different weights according to actual conditions, such as: if the probability of two persons who have worked together in the criminal group to do a crime again is relatively high, the weight of the number of times of the crime together with the entity is given to be higher; if the probability of two persons working against a crime group who are often connected is relatively high, the weight given to the relationship between the entities as the number of calls is higher. The weight can be obtained through historical data calculation or expert summary experience in related fields;
and (4-3) forming a new network by using the community divided in the step (4-2) as a node. The weight of the edge between the new nodes is the sum of the original weights between the two new nodes (communities), the community nodes have self-connection edges, and the weight of the community nodes is 2 times of the sum of the weights of the edges connected between all the nodes in the communities. And then iterating using the method of step (4-3) for the new network being constructed. Stopping iteration when the network is not changed any more, namely the maximum modularity appears;
Δ Q changes brought by any node moving to other communities:
Δ Q ═ modularity of the community after node i removed the original community-modularity of the community where node i was originally located) + (modularity of community j after node i moved to community j-original modularity of community j).
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
aiming at the relative dispersion of the service databases of various departments of the public security at present and the identification of criminal teams mainly by the modes of investigation and investigation, meticulous reconnaissance, data investigation and the like of workers, the technical scheme of the invention can effectively decompose social network communities and reduce the team investigation range, thereby improving the working efficiency, reducing the working intensity and achieving the aim of assisting decision.
Drawings
FIG. 1 is a social network diagram listed in the present invention;
fig. 2 is a flow chart of a knowledge-graph based hybrid group discovery method in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1-2, the present invention provides a specific embodiment of a knowledge-graph-based hybrid partnership discovery method, which is described herein using a sample (knowledge-graph-based discovery of a community in which a suspect is located); the system comprises a data acquisition module [1], a data sorting module [2], a knowledge graph building module [3] and a community dividing module [4 ];
the system diagram is as shown in fig. 2, and a data acquisition module [1] is used for cleaning and converting mastered relevant data of each social relation of a criminal suspect according to the characteristics of the existing business database of each department of public security;
the data sorting module [2] is used for sorting out attributes and relations required by the entities in the map and naming the attributes and relations in a unified manner;
the knowledge graph building module [3] is used for building an entity relationship network;
the community division module [4] performs community division on the entity relationship network by using a louvain algorithm;
specifically, the data acquisition module [1] is responsible for centralizing data stored in business databases of all departments of a public security organization, such as social basic information, case information and other data;
the data sorting module [2] sorts the entity data set, the relationship types among the entities and the entity attribute types; like the social network in fig. 1, the entities are persons (a, b, c, d), the attributes may be basic information of the persons, i.e., birth date, graduation, native place, profession, occupation, etc., and the relationships include the number of calls, the number of co-morbidities, the number of co-dwellings, etc.
A knowledge graph module [3] is constructed, namely an entity relation network is constructed, and the relation can be extracted from unstructured data or directly extracted from structured data;
the extraction of the unstructured data can be obtained by means of named entity recognition, case type classification and the like, for example, entities such as Zhang III and Li IV are obtained from a text document that Zhang III and Li IV kill people in west street through entity recognition, the relationship is the same party (intentional injury), and a triple (Zhang III, the same party (intentional injury): n, Li IV) is formed, wherein n represents the common crime frequency of two people.
The structured data can be directly obtained from the database, and if the parent column of zhang san in the base table is lie si, a triple (zhang san, father, lie si) can be formed, wherein the relationship type of the triple is determined according to the actual situation.
And constructing an effective knowledge graph, namely a complex network, by the three-element data set, and initializing parameters, wherein the parameters are determined according to actual application. A complex network can be abstracted as a graph consisting of a set of points and a set of edges. Each edge has corresponding points, that is, each edge in the 'edge set' has a pair of points in the 'point set' corresponding to the point. For example, in fig. 1, the triplets (b, number of calls: 40, c), (b, number of lives: 2, c) are formed between the entity b and the entity c, and the triplets are graphically formed into a network set, i.e., a knowledge graph, by using the same principle between every two other entities. In a specific occasion, if a certain attribute is a critical factor of community division, a correlation relationship is constructed between entities with the same attribute. If a professional attribute is x company manager and b professional attribute is x company employee, then this triple can be constructed: (a, x company, b).
The community dividing module [4] extracts a sub-network with a certain suspected person as a center and a path of n from the knowledge graph of the knowledge graph building module [3], divides the sub-network into communities by a louvain algorithm, namely, the compactness of one community is measured by the modularity, and if a node is added into a certain community, the modularity of the community is increased to the maximum extent, the node should belong to the community. If the modularity of the community is not increased after other communities are added, the community is left in the current community. The value of the path n is generally 2 according to a six-degree separation theory (the depth of any network does not exceed 4 layers) and case experience (the depth of the connection between any two persons in a criminal partnership is mostly within a two-layer relation), and the method is as follows:
and (4-1) assuming that the network has N nodes, and allocating a community to each node, namely how many nodes have how many communities in the initial stage. There were initially 4 communities as in fig. 1;
and (4-2) considering all neighbor nodes j of each node i in the network, calculating the incremental change of the modularity when the node i moves from the community where the node i is located to the community where the neighbor nodes j are located, and moving the node i to the community where the node j which increases the modularity to the maximum and is not negative is located. If all the calculated gains are not positive numbers, the node is still in the original community. The process is repeated for all nodes and applied in sequence until no node moves, the first process stops, i.e. no movement of any node results in an increase in modularity;
the modularity gain of the community C where each time a node moves an isolated node to its neighbor is Δ Q:
Σ in: the sum of the weights of all edges inside community C;
Σ tot: the sum of the weights of the edges associated with all nodes in community C;
ki: the sum of the weights of all edges that occur at node i;
ki, in: the sum of the weights of the edges of node i to all nodes in community C;
m: the sum of the weights of all edges in the network;
the relationships in the different industry maps can be given different weights according to actual conditions, such as: if the probability of two persons who have worked together in the criminal group to do a crime again is relatively high, the weight of the number of times of the crime with the entity is given to be a higher value. If the probability of two persons working against a crime group who are often connected is relatively high, the weight given to the relationship between the entities as the number of calls is higher. The weight can be obtained through historical data calculation or expert summary experience in related fields;
and (4-3) forming a new network by using the community divided in the step (4-2) as a node. The weight of the edge between the new nodes is the sum of the original weights between the two new nodes (communities), the community nodes have self-connection edges, and the weight of the community nodes is 2 times of the sum of the weights of the edges connected between all the nodes in the communities. And then iterating using the method of step (4-3) for the new network being constructed. Stopping iteration when the network is not changed any more, namely the maximum modularity appears;
Δ Q changes brought by any node moving to other communities:
Δ Q ═ modularity of the community after node i removed the original community-modularity of the community where node i was originally located) + (modularity of community j after node i moved to community j-original modularity of community j).
Aiming at the relative dispersion of the service databases of various departments of the public security at present and the identification of criminal teams mainly by the modes of investigation and investigation, meticulous reconnaissance, data investigation and the like of workers, the technical scheme of the invention can effectively decompose social network communities and reduce the team investigation range, thereby improving the working efficiency, reducing the working intensity and achieving the aim of assisting decision.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The principle and embodiments of the present invention have been described herein by way of specific examples, which are provided only to help understand the method and the core idea of the present invention, and the above is only a preferred embodiment of the present invention, and it should be noted that there are objectively infinite specific structures due to the limited character expressions, and it will be apparent to those skilled in the art that a plurality of modifications, decorations or changes can be made without departing from the principle of the present invention, and the above technical features can also be combined in a suitable manner; such modifications, variations, combinations, or adaptations of the invention using its spirit and scope, as defined by the claims, may be directed to other uses and embodiments.