Disclosure of Invention
The invention aims to solve the problems that the virtual identities in the network are various and effective information is difficult to extract in the prior art, and provides a correlation fusion method of virtual identity information, which can determine the virtual identities to a certain extent and provides a reliable scheme for tracing the virtual identities.
A method for association fusion of virtual identity information comprises the following steps:
acquiring virtual identity information of a plurality of platforms;
analyzing the virtual identity information and extracting metadata;
associating the metadata to form an associated network;
building a tag for the metadata;
calculating a membership coefficient of the tag;
and performing identity fusion by adopting a label propagation algorithm, and determining a community.
Further, the virtual identity information is acquired through a crawler and monitoring mode.
Further, the virtual identity information at least includes an account ID.
Further, the metadata includes attributes and attribute values.
Further, the format of the tag is ((account ID, membership, source), and the initial membership of the tag is 1.
Further, calculating a membership coefficient of the tag, comprising:
storing metadata from different sources and tags thereof;
simplifying the label and removing the source;
and comparing all the attributes and the attribute values, and reassigning the dependent coefficients.
Further, comparing all the attributes and attribute values, and reassigning the dependent coefficients, including:
and for the tags with the same attribute and attribute values and different account IDs, reassigning the subordinate coefficients of the tags to ensure that the sum of the subordinate coefficients of all the tags with the same attribute and attribute values is 1.
Further, comparing all the attributes and attribute values, and reassigning the dependent coefficients, further comprising:
and combining the initial tags with the same attribute and attribute values and the same account ID, and reassigning the combined tags so that the sum of the subordinate coefficients of all the tags with the same attribute and attribute values is 1.
Further, identity fusion is carried out by adopting a label propagation algorithm, and communities are determined, wherein the method comprises the following steps:
arbitrarily selecting two account IDs for fusion to serve as target nodes;
extracting all related metadata and tags of the two account IDs to serve as neighbor nodes of a target node;
fusing the labels of the adjacent nodes according to the fused account ID;
updating the subordinate label of the target node, and performing normalization processing on the subordinate coefficient of the subordinate label to enable the sum of the subordinate coefficients to be 1;
and judging whether the membership coefficient is larger than or equal to a preset threshold value, if so, determining that the two account IDs are the same person, and determining that the account ID of the same person is a community.
The virtual identity information association fusion method provided by the invention is directed at the same person or organization association fusion method, the label propagation algorithm is used, the association between the account IDs can be effectively determined, meanwhile, the account IDs can be assisted to be confirmed to be the same person or the same organization according to the corresponding real information, the guarantee is provided for constructing the knowledge graph of the virtual identity information, and a new scheme is provided for a network supervisor to trace and trace the source of the virtual identity information.
Detailed Description
In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1 and fig. 2, the present embodiment provides a method for fusing association of virtual identity information, including:
step S101, acquiring virtual identity information of a plurality of platforms;
step S102, analyzing the virtual identity information and extracting metadata;
step S103, associating the metadata to form an associated network;
step S104, establishing a label for the metadata;
step S105, calculating a membership coefficient of the tag;
and S106, performing identity fusion by adopting a label propagation algorithm, and determining a community.
Specifically, step S101 is executed, and the virtual identity information may be obtained by using a crawler, a monitor, and the like. The crawler technology, i.e. web crawler, is a program or script that automatically captures web information according to certain rules. Other less commonly used names are ants, automatic indexing, simulation programs, or worms. Network monitoring, including network flow monitoring, website keyword monitoring and other modes, wherein effective information in the flow can be obtained by monitoring the network flow and restoring the flow; by monitoring the keywords of the website, the user can more intuitively acquire the data of the keywords required by the user.
Referring to fig. 3, fig. 3 provides an example of obtaining virtual identity information from a web page associated with the X site.
Step S102 is executed, after the virtual identity information is obtained, the virtual identity information is analyzed, the information obtained by the crawler needs to be subjected to data preprocessing to obtain corresponding virtual identity data, the flow needs to be restored to obtain corresponding data based on the monitoring of the flow, keywords need to be extracted based on the monitoring of the website of the keywords, and the like, wherein the extracted data comprises attributes and attribute values, and the attributes and the attribute values are called as metadata. For example, "name-zhang san", which is an attribute, "zhang san", which is a value corresponding to the attribute of "name", we define "name-zhang san" as a piece of metadata, and similarly, "nickname-HappyZS" is a piece of metadata.
Since each virtual identity has a fixed account, an account ID is selected as a unique identifier of the virtual identity information of the account in a certain platform X, and is recorded as (account ID: information source X). Therefore, the obtained virtual identity information at least comprises the account ID, and in addition, other virtual identity information such as an email, a phone number, a QQ number, a password, a nickname and the like needs to be included, the more the extracted information is, the more perfect the established virtual identity information correlation network is. For the acquired name, gender, birth date, age, identification card number, school and the like, due to the authenticity of the name, the gender, the birth date, the age, the identification card number, the school and the like, the information is used as auxiliary information and plays an important role in virtual identity information fusion. In consideration of convenience of subsequent calling, the three-element form ((account ID: information source) -attribute-value) is saved in a corresponding database. Referring to fig. 4, fig. 4 is a database format of metadata preservation.
Further, step S103 is executed to associate the metadata to form an association network, and the tree-shaped association network can more intuitively display the association among the account ID, the attribute, and the attribute value. Referring to fig. 5, a schematic diagram of an associated network in an application scenario is provided.
Further, step S104 is executed to create a tag for the metadata, where the tag has a format ((account ID, membership, source)) and an initial membership of 1. For example, an initial tag of metadata "name-tree" derived from platform X with account ID zhangsan0123 is ((zhangsan0123, 1), X).
Further, step S105 is executed to store the metadata from different sources and the initial tag established in the last step in a unified manner, and in this process, since the fusion of the virtual identity information is finally formed, the source of the information does not need to be considered, so we can remove the source of the information in the tag. That is, for some metadata in source X ((zhangsan 0123: X) -mailbox-zhangsan 1990@163.com), after de-sourcing, the tag becomes (zhangsan0123, 1).
Subsequently, the assignment is re-performed on the dependent coefficients by comparing all attributes and attribute values:
and for the tags with the same attribute and attribute values and different account IDs, reassigning the subordinate coefficients of the tags to ensure that the sum of the subordinate coefficients of all the tags with the same attribute and attribute values is 1.
The following is illustrated as an example:
s1. for a certain metadata in source X ((zhangsan 0123: X) -mailbox-zhangsan 1990@163.com) an initial tag is established ((zhangsan0123, 1), X);
s2. for another piece of metadata in source Y ((zhangsi 0123: Y) -mailbox-zhangsan 1990@163.com) the initial label established is ((zhangsi0123, 1), Y);
s3. for another piece of metadata in Source Y ((zhangsi 0234: Y) -mailbox-zhangsan 1990@163.com) the initial label established is ((zhangsi0234, 1), Y).
For the three pieces of metadata, their "attribute-attribute values" are all "mailbox-zhangsan 1990@163. com", but because the sources are different or the account IDs are different, their three pieces of metadata are established with three initial labels, but because the "attribute-attribute values" are identical, the sum of their dependent coefficients must be 1, so that the dependent coefficients of the three pieces of metadata are all reassigned, the assigned new dependent coefficient is 1/3, and the label for this piece of metadata is as shown in fig. 6.
Further, the initial tags with the same attribute and attribute values and the same account ID are merged, and the merged tags are reassigned, so that the sum of the subordinate coefficients of all the tags with the same attribute and attribute values is 1.
The following is illustrated as an example:
s4. for a certain metadata in source X ((zhangsan 0123: X) -mailbox-zhangsan 1990@163.com) an initial tag is established ((zhangsan0123, 1), X);
s5. for another piece of metadata in source Y ((zhangsi 0123: Y) -mailbox-zhangsan 1990@163.com) the established initial label is ((zhangsi0123, 1), Y);
s6. for another piece of metadata in source Y ((zhangsan 0123: Y) -mailbox-zhangsan 1990@163.com) the initial label established is ((zhangsan0123, 1), Y).
For the three pieces of metadata, except for the information sources, the rest account IDs, attributes and attribute values of S6 and S4 are the same, and since the attribute-attribute values are completely the same and the sum of their dependent coefficients must be 1, the dependent coefficients of the three pieces of metadata are reassigned, the assigned new dependent coefficient is 1/3, and for S4 and S6, two tags may be merged, and the dependent coefficient is 2/3. The tag representation for this piece of metadata is shown in fig. 7.
Further, step S106 is executed, identity fusion is carried out by adopting a label propagation algorithm, any two accounts ID are selected from a plurality of accounts at first, and the two accounts are fused to be used as target nodes; simultaneously extracting all metadata and tags associated with the account ID, and using the extracted metadata and tags as neighbor nodes of a target node; fusing the labels of the adjacent nodes according to the fused account ID; and enabling the target node to update the label of the target node according to the membership coefficient of the labels of the nodes. This section is illustrated as an example. After the above steps, a plurality of virtual identity information metadata and tags are obtained, two account IDs zhangsan0123 and zhangsi0123 are selected, and for simplicity of the process, it is assumed that all metadata and tags of the two account IDs are shown in fig. 8-10, fig. 8 is a tag schematic diagram of metadata mailbox zhangsan1990@163.com ", fig. 9 is a tag schematic diagram of metadata phone-13312341234", fig. 10 is a schematic diagram of metadata nickname HappyZS ", fig. 11 is a schematic diagram of fusion of two account IDs zhangsan0123 and zhangsi0123, and zhangsan0123 and zhangsi0123 are used as target nodes.
Further, referring to fig. 12, all metadata and tags of two account IDs are extracted and serve as neighbor nodes of the target node.
Referring to fig. 12 and 13, the labels of the neighboring nodes are fused according to the fused account ID: for the tag of the metadata mailbox-zhangsan 1990@163.com ", the original tag is (zhangsan0123, 1/3) (zhangsi0123, 1/3) (zhangsan0234, 1/3), the account IDs are added according to the target nodes zhangsan0123& zhangsi0123, and the dependent coefficients are added, and the fused tag is (zhangsan0123& zhangsi0123, 2/3) (zhangsan0234, 1/3), similarly, for the metadata" phone-13312341234 ", the fused tag is (zhangsan0123& zhangsi0123, 1), and for the metadata" nickname-HappyZS ", the original tag does not contain" zhangsi0123 ", and is" zhangsi0123 "after being fused, and the fused tag is (zhangsan0123& zhangsi0123, 0234, 1/2).
Further, referring to fig. 14, the dependent tags of the target node are updated, and the dependent coefficients of the dependent tags are normalized so that the sum of the dependent coefficients is 1.
Updating the dependent label of the target node, firstly adding the corresponding dependent coefficients, and for the label (zhangsan0123& zhangsi0123), adding the dependent coefficients: 2/3+1+1/2 ═ 13/6, for label (zhangsi0234), the membership coefficients are added: 1/3+1/2 is 5/6, however, 13/6+5/6 is not equal to 1, normalization processing is performed on the basis of the principle that the sum of the subordinate coefficients is 1, the processed subordinate coefficients are 13/18 and 5/18, and therefore the updated subordinate label of the target node is (zhangsan0123& zhangsi0123, 13/18) (zhangsan0234, 5/18).
According to a preset threshold, when the membership coefficient of the fused account ID reaches the preset threshold, it can be determined that the two account IDs are the same person or the same organization. Assuming that the preset threshold is 1/2, the slave label of the target node is (zhangsan0123& zhangsi0123, 13/18), and the slave coefficient 13/18 is greater than 1/2, zhangsan0123 and zhangsi0123 can be considered as the same person.
According to auxiliary information such as name, gender, birth month, identification number and the like, if the auxiliary information points to the same person, the two account IDs can be considered as the same person, and if the auxiliary information points to multiple persons, the two account IDs can be considered as two persons in the same organization.
After repeated virtual identity fusion, the fused virtual identity information is confirmed as an account ID of the same person or the same organization, and the account ID is defined as a new community, and the community has attributes and attribute values of metadata in all the account IDs.
The association fusion method for the virtual identity information provided by this embodiment, aiming at the association fusion method for the same person or organization, uses a tag propagation algorithm, can effectively determine the association between the account IDs, can assist in confirming that the account IDs are of the same person or the same organization according to the corresponding real information, also provides a guarantee for constructing a knowledge graph of the virtual identity information, and provides a new scheme for a network supervisor to trace and trace the source of the virtual identity information.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.