CN108334580A

Movatterモバイル変換

Info

Publication number: CN108334580A
Application number: CN201810071418.XA
Authority: CN
Inventors: 黄海辉; 王欣; 禹果; 余浩; 周秀秀
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-01-25
Filing date: 2018-01-25
Publication date: 2018-07-27

Abstract

Translated fromChinese

本发明涉及社交网络与数据挖掘领域，提出一种结合链接和属性信息的社区发现方法；主要包括以下步骤：输入具有链接和属性信息的社交网络数据，构造以链接关系为基础的邻接矩阵和以属性信息为基础的属性矩阵；根据两个数据矩阵，构造联合贝叶斯概率模型；采用非负矩阵分解方法计算节点的最大归属度，获得初步的社区划分；根据节点的归属情况，计算节点的绝对归属度，得到重叠社区结构；本发明将社交网络中的链接信息与属性信息结合，有利于提高社区检测中数据的利用价值，可提高社区发现的精确度和效率，适合应用于同时具有属性和链接信息的主题社区发现。

The present invention relates to the field of social network and data mining, and proposes a community discovery method combining links and attribute information; it mainly includes the following steps: inputting social network data with link and attribute information, constructing an adjacency matrix based on link relationship and An attribute matrix based on attribute information; construct a joint Bayesian probability model based on two data matrices; use the non-negative matrix decomposition method to calculate the maximum degree of attribution of nodes, and obtain preliminary community division; according to the attribution of nodes, calculate the Absolute degree of belonging can obtain overlapping community structure; the present invention combines link information and attribute information in social networks, which is conducive to improving the utilization value of data in community detection, can improve the accuracy and efficiency of community discovery, and is suitable for applications with attributes and link information to topic community discovery.

Description

Translated fromChinese

一种结合链接和属性信息的社区发现方法A Community Discovery Method Combining Link and Attribute Information

技术领域technical field

本发明涉及社交网络和数据挖掘领域，具体为一种结合链接和属性信息的社区发现方法。The invention relates to the fields of social network and data mining, in particular to a community discovery method combining link and attribute information.

背景技术Background technique

社交网络作为信息传递的主要载具，其涵盖的信息量对于当今社会具有重要的研究意义，从个体到群体，从小世界到大社会，现实生活中总存在隐含的联系将人们链接起来。社区发现常用于分析社会群体之间的结构特征。As the main carrier of information transmission, social network has important research significance for the amount of information it covers in today's society. From individuals to groups, from the small world to the big society, there are always implicit connections in real life that link people together. Community discovery is often used to analyze structural characteristics among social groups.

图的拓扑结构和节点属性在社交网络中具有重大的参考价值，仅从其中一点来研究社区的结构特征显然不具有说服力，通常一个社区表示一类具有相同属性的集合，如果它们之间存在直接的联系，表明相邻两节点之间的社会关系链接更为紧密，如果同边相连的两个节点属于不同的社区，这个边就可以当作两个不同属性社区链接的“纽带”，如果“纽带”越多，说明两个社区之间链接也更为紧密，从而为重叠社区的发现和社区演化的研究提供了更好的研究条件。针对缺乏有效的方法来平衡网络结构和节点的结构特征，Wu等人提出一种利用全局结构和局部邻域特征的SAGL方法来探索社区的结构，首先评估每个节点的全局重要性，并通过组合边缘强度和节点相似度来计算每个节点对的相似度，然后使用K-Medoids聚类算法来对节点属性图和拓扑结构图进行分割，以分割后节点的相似度和每个集群的质心来作为节点分配的条件，从而划分社区，该方法能有效将节点拓扑结构和属性信息进行结合，从而检测到更有意义的主题社区，但是社区聚类质量不高且聚类时间复杂度较高，在实际的复杂网络中应用效果较低。常等人提出一种基于联合矩阵分解的节点多属性网络社团检测方法CDJMF，采用非负矩阵分解联合节点间的余弦相似度进行求解，将带约束的最小均方误差问题转化为经典的凸二次规划算法，从理论上证明了求解的目标函数的收敛性与非增性，在相同评价指标的情况下证明该算法有更高的社区发现效率，但是算法中依赖的参数太多，导致算法精确度不高。The graph topology and node attributes have great reference value in social networks. It is obviously not convincing to study the structural characteristics of communities only from one point. Usually, a community represents a set with the same attributes. If there is A direct connection indicates that the social relationship link between two adjacent nodes is closer. If the two nodes connected to the same edge belong to different communities, this edge can be used as a "link" for the link between two communities with different attributes. If The more "links" there are, the closer the link between the two communities is, which provides better research conditions for the discovery of overlapping communities and the study of community evolution. Aiming at the lack of an effective method to balance the network structure and the structural characteristics of nodes, Wu et al. proposed a SAGL method using global structure and local neighborhood features to explore the structure of the community. First, evaluate the global importance of each node, and pass Combine the edge strength and node similarity to calculate the similarity of each node pair, and then use the K-Medoids clustering algorithm to segment the node attribute graph and topology graph to obtain the similarity of the segmented nodes and the centroid of each cluster As the condition of node allocation, so as to divide the community, this method can effectively combine the node topology and attribute information, so as to detect more meaningful topic communities, but the quality of community clustering is not high and the clustering time complexity is high , the application effect is low in the actual complex network. Chang et al. proposed a node multi-attribute network community detection method CDJMF based on joint matrix factorization, using non-negative matrix factorization to solve the joint cosine similarity between nodes, and transforming the constrained minimum mean square error problem into a classic convex two The sub-programming algorithm theoretically proves the convergence and non-increment of the objective function to be solved, and proves that the algorithm has higher community discovery efficiency under the same evaluation index, but the algorithm relies on too many parameters, which leads to the algorithm Not very accurate.

非负矩阵分解(Non-negative Matrix Factorization NMF)已被证明在数据分析、模式识别、信号处理、机器学习等领域得到了广泛的研究。其类似聚类的方法非常适合社交网络领域中的社区发现，通常将社交网络的拓扑结构进行矩阵分解，能够得到满意的社区发现结果，所以利用NMF对既包含链接信息又具有属性信息的社交网络数据进行社区分析为社区发现提供了新的思路。Non-negative Matrix Factorization (NMF) has been proven to be widely researched in data analysis, pattern recognition, signal processing, machine learning and other fields. Its clustering-like method is very suitable for community discovery in the field of social networks. Usually, the topological structure of the social network is matrix decomposed to obtain satisfactory community discovery results. Community analysis of data provides a new idea for community discovery.

李亚芳等人提出了一种非负矩阵的方法(参考李亚芳,贾彩燕,于剑.应用非负矩阵分解模型的社区发现方法综述[J].计算机科学与探索,2016,10(01):1-13.)，能够利用连接信息和属性信息的数据网络数据进行社区分析，但该方法的算法复杂度较高，不利用运算。Li Yafang and others proposed a non-negative matrix method (refer to Li Yafang, Jia Caiyan, Yu Jian. A review of community discovery methods using non-negative matrix factorization models [J]. Computer Science and Exploration, 2016, 10(01): 1- 13.), can use the data network data of connection information and attribute information for community analysis, but the algorithm complexity of this method is high, and no calculation is used.

发明内容Contents of the invention

鉴于此，本发明提出一种结合NMF算法的社区发现方法，具体为一种结合链接和属性信息的社区发现方法，本发明分别将属性矩阵和邻接矩阵以非负矩阵分解的形式联合分解能够降低算法的复杂度，本发明所采用的方法包括：In view of this, the present invention proposes a community discovery method combined with the NMF algorithm, specifically a community discovery method that combines link and attribute information. The present invention jointly decomposes the attribute matrix and the adjacency matrix in the form of non-negative matrix decomposition to reduce the The complexity of algorithm, the method that the present invention adopts comprises:

S1、输入社交网络数据，构造基于链接信息的邻接矩阵与基于属性信息的属性矩阵；S1. Input social network data, and construct an adjacency matrix based on link information and an attribute matrix based on attribute information;

S2、利用所述邻接矩阵和所述属性矩阵设计一种联合贝叶斯概率模型；S2. Designing a joint Bayesian probability model by using the adjacency matrix and the attribute matrix;

S3、根据所述联合贝叶斯概率模型，按照联合矩阵分解方法计算邻接矩阵和属性矩阵隐含的节点归属度矩阵，计算所述归属度矩阵的最大归属度，进行初步非重叠社区划分；S3. According to the joint Bayesian probability model, calculate the node membership matrix implied by the adjacency matrix and the attribute matrix according to the joint matrix decomposition method, calculate the maximum belonging degree of the membership matrix, and perform preliminary non-overlapping community division;

S4、计算所述归属度矩阵中节点的绝对归属度，得到社交网络的重叠社区结构。S4. Calculate the absolute membership of the nodes in the membership matrix to obtain the overlapping community structure of the social network.

进一步的，所述构建邻接矩阵的方法包括以下步骤：Further, the method for constructing an adjacency matrix includes the following steps:

S201、根据输入的社交网络数据，构造社交网络拓扑结构图，统计节点数量并标记为N；S201. According to the input social network data, construct a social network topology diagram, count the number of nodes and mark it as N;

S202、以N为阶数构造全零对称方阵X₀，迭代更新X₀；判断社交网络图中两个节点是否直接相连，如果相连就将矩阵X₀中对应的行和列置为1，否则置为0；S202. Constructing an all-zero symmetrical square matrix X₀ with N as the order, and updating X₀ iteratively; judging whether two nodes in the social network graph are directly connected, and if they are connected, set the corresponding row and column in the matrix X₀ to 1, Otherwise set to 0;

S203、将对称方阵X₀对角线设为0，即节点与节点自身默认不相邻；S203. Set the diagonal of the symmetrical square matrix_X0 to 0, that is, the node is not adjacent to the node itself by default;

S204、迭代终止时输出更新后的X₀，标记为邻接矩阵X。S204. Output the updated X₀ when the iteration is terminated, and mark it as an adjacency matrix X.

进一步的，构造属性矩阵的方法的步骤包括：Further, the steps of the method for constructing the attribute matrix include:

统计属性数量M并为每个属性从1到M开始标记，以属性数量M为行，节点数N为列构造全零属性矩阵Y₀，进行迭代更新；如果节点具有属性就将对应的行和列置为1，否则为0，从而得到属性矩阵Y。Count the number of attributes M and mark each attribute from 1 to M, construct an all-zero attribute matrix Y₀ with the number of attributes M as the row, and the number of nodes N as the column, and update iteratively; if the node has an attribute, the corresponding row and The column is set to 1, otherwise it is 0, so that the attribute matrix Y is obtained.

进一步的，所述步骤S2的联合贝叶斯概率模型包括：设计一种联合贝叶斯概率模型，综合考虑邻接矩阵与属性矩阵同时含有隐含的社区结构特征，联合贝叶斯概率模型为：Further, the joint Bayesian probability model in step S2 includes: designing a joint Bayesian probability model, taking into account the adjacency matrix and the attribute matrix at the same time containing hidden community structure features, the joint Bayesian probability model is:

p(X,Y,W₁,W₂,H,β)＝p(X|W₁,H)p(Y|W₂,H)p(W₁|β)p(W₂|β)p(H|β)p(β)p(X,Y,W₁ ,W₂ ,H,β)=p(X|W₁ ,H)p(Y|W₂ ,H)p(W₁ |β)p(W₂ |β)p (H|β)p(β)

其中，W₁表示属性与社区的关联强度矩阵；W₂表示节点与社区的关联强度矩阵；H表示节点与社区的归属度矩阵，H由邻接矩阵与属性矩阵同时决定；β为超参数；p(X|W₁,H)表示矩阵W₁和矩阵H条件下X的条件概率；p(Y|W₂,H)表示矩阵W₂和矩阵H条件下X的条件概率；p(W₁|β)表示β条件下的W₁的条件概率；p(W₂|β)表示β条件下的矩阵W₂的条件概率；p(H|β)表示β条件下的矩阵H的条件概率；p(β)表示参数β的概率。Among them, W₁ represents the correlation strength matrix between attributes and communities; W₂ represents the correlation strength matrix between nodes and communities; H represents the belongingness matrix between nodes and communities, and H is determined by the adjacency matrix and attribute matrix at the same time; β is a hyperparameter; p (X|W₁ ,H) represents the conditional probability of X under the condition of matrix W₁ and matrix H; p(Y|W₂ ,H) represents the conditional probability of X under the condition of matrix W₂ and matrix H; p(W₁ | β) represents the conditional probability of W₁ under β condition; p(W₂ |β) represents the conditional probability of matrix W₂ under β condition; p(H|β) represents the conditional probability of matrix H under β condition; p (β) represents the probability of parameter β.

进一步的，优选的，所述步骤S3的初步非重叠社区划分步骤包括：Further, preferably, the preliminary non-overlapping community division step in step S3 includes:

S301、根据联合贝叶斯概率模型，计算联合贝叶斯模型中每个先验概率的负对数似然函数；得到基于联合概率NMF模型的目标函数；S301. According to the joint Bayesian probability model, calculate the negative log likelihood function of each prior probability in the joint Bayesian model; obtain the objective function based on the joint probability NMF model;

S302、根据非负矩阵分解算法，对似然函数进行优化分解，得到迭代更新公式；S302. Optimizing and decomposing the likelihood function according to the non-negative matrix factorization algorithm to obtain an iterative update formula;

S303、根据迭代更新公式进行反复迭代，直到满足收敛条件；S303. Perform repeated iterations according to the iterative update formula until the convergence condition is satisfied;

S304、根据迭代终止时的输出的归属度矩阵，计算节点最大归属度，输出非重叠社区结构。S304. Calculate the maximum belonging degree of the node according to the belonging degree matrix output when the iteration is terminated, and output the non-overlapping community structure.

进一步的，优选的，所述步骤S301中联合贝叶斯模型中每个先验概率的负对数似然函数的计算方法为：对联合贝叶斯概率模型取负对数，根据半正态先验知识，得到W₁，W₂和H的负对数似然函数，分别对应为：Further, preferably, the calculation method of the negative log likelihood function of each prior probability in the joint Bayesian model in the step S301 is: take the negative logarithm for the joint Bayesian probability model, according to the half normal Prior knowledge, the negative log likelihood functions of W₁ , W₂ and H are obtained, which correspond to:

其中，-log p(w₁|β)表示W₁的负对数似然函数，-log p(w₂|β)表示W₂的负对数似然函数，-log p(h|β)表示H的负对数似然函数；，W₁表示节点与社区的关联强度矩阵，W₂表示属性与社区的关联强度矩阵，H表示节点与社区的归属度矩阵；w_1,ik表示矩阵W₁的第i行第k列元素，w_2,ik表示矩阵W₂的第i行第k列元素，h_kj表示矩阵H的第k行第j列元素；μ是一个常数，N表示节点数量，M表示属性数量，K表示社区个数，β表示超参数。Among them, -log p(w₁ |β) represents the negative log-likelihood function of W₁ , -log p(w₂ |β) represents the negative log-likelihood function of W₂ , -log p(h|β) Represents the negative log-likelihood function of H; W₁ represents the correlation strength matrix between nodes and communities, W₂ represents the correlation strength matrix between attributes and communities, H represents the membership degree matrix between nodes and communities; w_1,ik represents the matrix W The element in row i, column k of₁ , w_2,ik represents the element in row i, column k of matrix W₂ , h_kj represents the element in row k, column j of matrix H; μ is a constant, and N represents the number of nodes , M represents the number of attributes, K represents the number of communities, and β represents hyperparameters.

上述负对数似然函数公式表示为：通过β控制社区的数量，β越大，说明W中的列向量和H中的列向量越趋近于0，也即社区之间越不相关；每个β的值都服从Gamma分布，即：The above negative log likelihood function formula is expressed as: the number of communities is controlled by β, and the larger the β, the closer the column vector in W and the column vector in H are to 0, that is, the less correlation between the communities; each The values of β all obey the Gamma distribution, that is:

其中，p(β|a_k,b_k)表示服从伽马分布；a_k表示形状参数因子，b_k表示尺度参数因子，Γ(a_k)表示伽马函数，β≥0。Among them, p(β|a_k ,b_k ) means obeying the gamma distribution; a_k means the shape parameter factor, b_k means the scale parameter factor, Γ(a_k ) means the gamma function, and β≥0.

进一步的，优选的，所述步骤S302中迭代更新公式的计算方法包括：对负对数似然函数取最小值，结合联合非负矩阵分解算法计算迭代更新公式具体包括：利用非负矩阵分解的标准形式得到邻接矩阵的简化形式，即X≈HH^T，利用非负矩阵分解的标准形式得到属性矩阵的简化形式，即Y≈WH^T；分别用H、W和H^T对应替代负对数似然函数中的W₁，W₂和H，分别计算H、W和β梯度；利用梯度下降法得到迭代更新公式。其中，W表示替换后的邻接矩阵。Further, preferably, the calculation method of the iterative update formula in step S302 includes: taking the minimum value of the negative logarithmic likelihood function, and calculating the iterative update formula in conjunction with the joint non-negative matrix factorization algorithm specifically includes: using the non-negative matrix factorization The standard form obtains the simplified form of the adjacency matrix, that is, X≈HH^T , and the simplified form of the attribute matrix is obtained by using the standard form of non-negative matrix decomposition, that is, Y≈WH^T ; use H, W and H^T to replace the negative logarithm-like W₁ , W₂ and H in the natural function, calculate the H, W and β gradients respectively; use the gradient descent method to obtain the iterative update formula. Among them, W represents the adjacency matrix after replacement.

作为一种可选方式，所述迭代更新公式如下所示：As an optional manner, the iterative update formula is as follows:

其中，←表示赋值，Y表示属性矩阵，I表示单位矩阵，B表示由超参数β构成的对角矩阵，H表示节点与社区的归属度矩阵；β_K表示超参数β的迭代值，上标为T的矩阵表示相应矩阵的转置矩阵，a表示形状常数，b表示尺度常数，a和b均为常数。Among them, ← represents the assignment, Y represents the attribute matrix, I represents the unit matrix, B represents the diagonal matrix composed of hyperparameter β, H represents the belongingness matrix of nodes and communities; β_K represents the iterative value of hyperparameter β, superscript The matrix T represents the transpose of the corresponding matrix, a represents the shape constant, b represents the scale constant, and both a and b are constants.

所述的重叠社区结构划分方法为：根据节点的归属度矩阵计算每个节点的绝对归属度，判断绝对归属度是否大于某一阈值，如果是，则将该节点划分到当前社区，从而得到重叠的社区划分；绝对归属度的计算方法为：The method for dividing the overlapping community structure is as follows: calculate the absolute belonging degree of each node according to the belonging degree matrix of the node, judge whether the absolute belonging degree is greater than a certain threshold, and if so, divide the node into the current community, thereby obtaining the overlapping The community division; the calculation method of absolute attribution is:

其中，和分别表示节点的归属强度向量中的最小分量和最大分量，h_kj为v_ij的子向量，h_kj表示v_ij表示N个节点组成的邻接矩阵的子元素。in, and Respectively represent the minimum component and maximum component in the attribution_strength vector of the node, h_kj is the sub-vector of v_ij , h_kj represents the sub-element of the adjacency matrix composed of N nodes.

与现有技术相比，本发明具有如下优点：Compared with prior art, the present invention has following advantage:

1)本发明提出了一种新的结合属性和链接信息的社区发现方法，既保留了基于节点链接信息的社区发现的优势，又结合了社交网络中属性信息进行协同社区发现，更能体现社交网络中的多维特性，能对主题社区进行发现。1) The present invention proposes a new community discovery method combining attributes and link information, which not only retains the advantages of community discovery based on node link information, but also combines attribute information in social networks for collaborative community discovery, which can better reflect social The multidimensional nature of the network enables discovery of topical communities.

2)本发明利用了非负矩阵分解算法的优势，即其“模糊聚类”的思想能为带属性的社区发现提供必要的理论基础，在严格的数学推导下，得到的社区发现方法能够更具有代表性与稳定性。2) The present invention utilizes the advantages of the non-negative matrix factorization algorithm, that is, its "fuzzy clustering" idea can provide the necessary theoretical basis for community discovery with attributes, and under strict mathematical derivation, the obtained community discovery method can be more representative and stable.

3)本发明既能发现非重叠社区又能发现重叠社区，其算法的精确性更高，复杂度更低，适合用于大规模重叠社区挖掘。3) The present invention can discover both non-overlapping communities and overlapping communities, and its algorithm has higher accuracy and lower complexity, and is suitable for mining large-scale overlapping communities.

附图说明Description of drawings

图1是本发明的实施总体流程图；Fig. 1 is the implementation overall flowchart of the present invention;

图2是邻接矩阵生成流程图；Fig. 2 is a flow chart of adjacency matrix generation;

图3是属性矩阵生成流程图；Fig. 3 is a flow chart of attribute matrix generation;

图4是联合贝叶斯概率模型图；Fig. 4 is a joint Bayesian probability model figure;

图5是归属度矩阵计算流程图。Fig. 5 is a flow chart of belongingness matrix calculation.

具体实施方式Detailed ways

下面结合具体实施例以及具体实验数据集对本发明的结合链接与属性信息的社区发现方法做进一步说明，如图1所示，本发明包括以下步骤：The community discovery method combining links and attribute information of the present invention will be further described below in conjunction with specific embodiments and specific experimental data sets. As shown in Figure 1, the present invention includes the following steps:

S3、根据所述联合贝叶斯概率模型，按照联合矩阵分解方法计算邻接矩阵和属性矩阵隐含的节点归属度矩阵，计算所述归属度矩阵中节点的最大归属度，进行初步非重叠社区划分；S3. According to the joint Bayesian probability model, calculate the node membership matrix implied by the adjacency matrix and the attribute matrix according to the joint matrix decomposition method, calculate the maximum membership degree of the nodes in the membership matrix, and perform preliminary non-overlapping community division ;

S4、计算所述归属度矩阵中的节点的绝对归属度，得到社交网络的重叠社区结构。S4. Calculate the absolute membership of the nodes in the membership matrix to obtain the overlapping community structure of the social network.

进一步的，所述步骤S1输入社交网络数据，构造基于链接信息的邻接矩阵与基于属性信息的属性矩阵包括以下步骤：Further, the step S1 inputs social network data, and constructing an adjacency matrix based on link information and an attribute matrix based on attribute information includes the following steps:

S203、将对称方阵X₀的对角线设为0，即节点与节点自身默认不相邻；S203, the diagonal of the symmetrical square matrix_X0 is set to 0, that is, the node is not adjacent to the node itself by default;

S204、迭代终止时输出更新后的X₀，标记为邻接矩阵X。其中，构建邻接矩阵如图2所示。S204. Output the updated X₀ when the iteration is terminated, and mark it as an adjacency matrix X. Among them, the adjacency matrix is constructed as shown in Figure 2.

进一步的，构造属性矩阵的如图3所示，步骤包括：统计属性数量M并为每个属性从1到M开始标记，以属性数量M为行，节点数N为列构造全零属性矩阵Y₀，进行迭代更新；如果节点具有属性就将对应的行和列置为1，否则为0，从而得到属性矩阵Y。Further, the construction of the attribute matrix is shown in Figure 3. The steps include: counting the number of attributes M and marking each attribute from 1 to M, and constructing an all-zero attribute matrix Y with the number of attributes M as rows and the number of nodes as columns.₀ , perform iterative update; if the node has attributes, set the corresponding row and column to 1, otherwise to 0, so as to obtain the attribute matrix Y.

所述步骤S2的联合贝叶斯概率模型包括：设计一种联合贝叶斯概率模型，综合考虑邻接矩阵与属性矩阵同时含有隐含的社区结构特征，联合贝叶斯概率模型为：The joint Bayesian probability model in the step S2 includes: designing a joint Bayesian probability model, comprehensively considering that the adjacency matrix and the attribute matrix contain hidden community structure features, and the joint Bayesian probability model is:

其中，W₁表示属性与社区的关联强度矩阵；W₂表示节点与社区的关联强度矩阵；H表示节点与社区的归属度矩阵，H由邻接矩阵与属性矩阵同时决定；β为超参数；p(X|W₁,H)表示矩阵W₁和矩阵H条件下X的条件概率；p(Y|W₂,H)表示矩阵W₂和矩阵H条件下X的条件概率；p(W₁|β)表示β条件下的W₁的条件概率；p(W₂|β)表示β条件下的矩阵W₂的条件概率；p(H|β)表示β条件下的矩阵H的条件概率；p(β)表示β的概率，β表示超参数。Among them, W₁ represents the correlation strength matrix between attributes and communities; W₂ represents the correlation strength matrix between nodes and communities; H represents the belongingness matrix between nodes and communities, and H is determined by the adjacency matrix and attribute matrix at the same time; β is a hyperparameter; p (X|W₁ ,H) represents the conditional probability of X under the condition of matrix W₁ and matrix H; p(Y|W₂ ,H) represents the conditional probability of X under the condition of matrix W₂ and matrix H; p(W₁ | β) represents the conditional probability of W₁ under β condition; p(W₂ |β) represents the conditional probability of matrix W₂ under β condition; p(H|β) represents the conditional probability of matrix H under β condition; p (β) represents the probability of β, and β represents the hyperparameter.

其中联合贝叶斯概率模型如图4所示，其中v_ij表示节点的交互信息，它服从泊松分布，根据图4可以将本模型描述为：在N个节点组成的矩阵中，每个子元素v_ij可由三个隐含的子向量h_kj，w_1,ik和w_2,ik来表示，这三个子向量相互作用，共同决定了节点归属情况，而这三个子向量之间存在一个共同的因子β_k，β_k决定了社区K在三个子向量中的划分。其中，h_kj和w_1,ik之间的交互表示为节点的邻接关系矩阵X；h_kj和w_2,ik之间的交互表示为节点的属性矩阵Y。The joint Bayesian probability model is shown in Figure 4, where v_ij represents the interaction information of nodes, which obeys the Poisson distribution. According to Figure 4, this model can be described as: in a matrix composed of N nodes, each sub-element v_ij can be represented by three implicit sub-vectors h_kj , w_1,ik and w_2,ik , these three sub-vectors interact to determine the node attribution, and there is a common The factor β_k , β_k determines the division of the community K in the three sub-vectors. Among them, the interaction between h_kj and w_1,ik is expressed as the adjacency matrix X of the node; the interaction between h_kj and w_2,ik is expressed as the attribute matrix Y of the node.

进一步的，上述负对数似然函数公式表示为：通过β控制社区的数量，β越大，说明W中的列向量和H中的列向量越趋近于0，也即社区之间越不相关；每个β的值都服从Gamma分布，即：Furthermore, the formula of the above-mentioned negative logarithmic likelihood function is expressed as follows: the number of communities is controlled by β, and the larger the β, the closer the column vectors in W and the column vectors in H are to 0, that is, the closer the communities are to each other. Correlation; the value of each β obeys the Gamma distribution, namely:

所述步骤S302中迭代更新公式的计算方法包括：对负对数似然函数取最小值，结合联合非负矩阵分解算法计算迭代更新公式具体包括：利用非负矩阵分解的三分解形式得到邻接矩阵的简化形式即X≈HH^T，利用非负矩阵分解的标准形式得到属性矩阵的简化形式，即Y≈WH^T；分别用H、W和H^T对应替代负对数似然函数中的W₁，W₂和H，分别计算H、W和β梯度；利用梯度下降法得到迭代更新公式。其中，W表示替换后的邻接矩阵。The calculation method of the iterative update formula in the step S302 includes: taking the minimum value of the negative log-likelihood function, and calculating the iterative update formula in combination with the joint non-negative matrix factorization algorithm, which specifically includes: using the three-decomposition form of the non-negative matrix factorization to obtain the adjacency matrix The simplified form of the attribute matrix is X≈HH^T , and the simplified form of the attribute matrix is obtained by using the standard form of non-negative matrix factorization, that is, Y≈WH^T ; H, W and H^T are respectively used to replace W₁ in the negative logarithmic likelihood function , W₂ and H, calculate H, W and β gradients respectively; use the gradient descent method to obtain the iterative update formula. Among them, W represents the adjacency matrix after replacement.

其中，←表示赋值，W表示替换后的邻接矩阵，Y表示属性矩阵，I表示单位矩阵，B表示参数由超参数β构成的对角矩阵，H表示节点与社区的归属度矩阵；上标为T的矩阵表示相应矩阵的转置矩阵，a表示形状常数，b表示尺度常数，a和b均为常数。Among them, ← represents the assignment, W represents the adjacency matrix after replacement, Y represents the attribute matrix, I represents the identity matrix, B represents the diagonal matrix whose parameters are composed of hyperparameters β, and H represents the belongingness matrix of nodes and communities; the superscript is The matrix of T represents the transpose of the corresponding matrix, a represents the shape constant, b represents the scale constant, and both a and b are constants.

所述的重叠社区结构划分方法为：根据节点的归属度矩阵计算每个节点的绝对归属度，判断绝对归属度是否大于某一阈值，如果是，则将其划分到当前社区，从而得到重叠的社区划分；归属度矩阵的计算如图5所示。绝对归属度的计算方法为：The method for dividing the overlapping community structure is as follows: calculate the absolute belonging degree of each node according to the belonging degree matrix of the node, judge whether the absolute belonging degree is greater than a certain threshold, and if so, divide it into the current community, thereby obtaining the overlapping Community division; the calculation of the belonging matrix is shown in Figure 5. The calculation method of absolute attribution is:

本发明设计了一种结合链接和属性信息的社区发现方法，首先结合社交网络的数据特征计算邻接矩阵和属性矩阵，并在此基础上设计一种基于两种矩阵的联合贝叶斯概率模型；再针对此模型进行优化，采用非负矩阵分解算法提取模型中的迭代更新公式；然后根据迭代更新公式计算节点归属度矩阵，并统计最大贡献度以发现非重叠的社区结构；最后迭代计算节点的绝对归属度，从而发现重叠的社区。所述的方法基于联合贝叶斯概率非负矩阵分解模型，具有严格的数学理论基础，应用于真实的社交网络进行社区发现，具有更好的社区划分结果。The present invention designs a community discovery method combining link and attribute information. First, the adjacency matrix and attribute matrix are calculated in combination with the data characteristics of the social network, and on this basis, a joint Bayesian probability model based on the two matrices is designed; Then optimize this model, use the non-negative matrix factorization algorithm to extract the iterative update formula in the model; then calculate the node membership matrix according to the iterative update formula, and count the maximum contribution to find the non-overlapping community structure; finally iteratively calculate the node’s Absolute attribution to discover overlapping communities. The method described is based on the joint Bayesian probability non-negative matrix factorization model, has a strict mathematical theory basis, is applied to real social networks for community discovery, and has better community division results.

以上所举实施例，对本发明的目的、技术方案和优点进行了进一步的详细说明，所应理解的是，以上所举实施例仅为本发明的优选实施方式而已，并不用以限制本发明，凡在本发明的精神和原则之内对本发明所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above examples have further described the purpose, technical solutions and advantages of the present invention in detail. It should be understood that the above examples are only preferred implementations of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made to the present invention within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

Translated fromChinese

1.一种结合链接和属性信息的社区发现方法，其特征在于：包括：1. A community discovery method combining links and attribute information, characterized in that: comprising:

S2、利用所述基于链接信息的邻接矩阵和所述基于属性信息的属性矩阵设计一种联合贝叶斯概率模型；S2. Using the adjacency matrix based on link information and the attribute matrix based on attribute information to design a joint Bayesian probability model;

2.根据权利要求1所述的一种结合链接和属性信息的社区发现方法，其特征在于，构建邻接矩阵的方法为：2. A kind of community discovery method combining link and attribute information according to claim 1, is characterized in that, the method for building adjacency matrix is:

3.根据权利要求2所述的一种结合链接和属性信息的社区发现方法，其特征在于，构造属性矩阵的方法为：统计属性数量M并为每个属性从1到M开始标记，以属性数量M为行，节点数N为列构造全零属性矩阵Y₀，进行迭代更新；如果节点具有属性就将对应的行和列置为1，否则为0，从而得到属性矩阵Y。3. A method for community discovery combining links and attribute information according to claim 2, characterized in that, the method for constructing an attribute matrix is: counting the number of attributes M and marking each attribute from 1 to M, with attribute The number M is the row, and the number of nodes N is the column to construct an all-zero attribute matrix Y₀ for iterative update; if the node has an attribute, set the corresponding row and column to 1, otherwise it is 0, so as to obtain the attribute matrix Y.

4.根据权利要求3所述的一种结合链接和属性信息的社区发现方法，其特征在于，步骤S2中所述的联合贝叶斯概率模型为：4. A method for community discovery combining links and attribute information according to claim 3, characterized in that the joint Bayesian probability model described in step S2 is:

其中，W₁表示属性与社区的关联强度矩阵；W₂表示节点与社区的关联强度矩阵；H表示节点与社区的归属度矩阵，H由邻接矩阵与属性矩阵同时决定；β为超参数；p(X|W₁,H)表示矩阵W₁和矩阵H条件下X的条件概率；p(Y|W₂,H)表示矩阵W₂和矩阵H条件下X的条件概率；p(W₁|β)表示β条件下的W₁的条件概率；p(W₂|β)表示β条件下的矩阵W₂的条件概率；p(H|β)表示β条件下的矩阵H的条件概率；p(β)表示超参数β的概率。Among them, W₁ represents the correlation strength matrix between attributes and communities; W₂ represents the correlation strength matrix between nodes and communities; H represents the belongingness matrix between nodes and communities, and H is determined by the adjacency matrix and attribute matrix at the same time; β is a hyperparameter; p (X|W₁ ,H) represents the conditional probability of X under the condition of matrix W₁ and matrix H; p(Y|W₂ ,H) represents the conditional probability of X under the condition of matrix W₂ and matrix H; p(W₁ | β) represents the conditional probability of W₁ under β condition; p(W₂ |β) represents the conditional probability of matrix W₂ under β condition; p(H|β) represents the conditional probability of matrix H under β condition; p (β) denotes the probability of the hyperparameter β.

5.根据权利要求1-4任一所述的一种结合链接和属性信息的社区发现方法，其特征在于，步骤S3中所述的初步非重叠社区划分步骤包括：5. A community discovery method combining links and attribute information according to any one of claims 1-4, characterized in that the preliminary non-overlapping community division step described in step S3 includes:

6.根据权利要求5所述的一种结合链接和属性信息的社区发现方法，其特征在于，所述步骤S301中联合贝叶斯模型中每个先验概率的负对数似然函数的计算方法为：对联合贝叶斯概率模型取负对数，根据半正态先验知识，得到W₁，W₂和H的负对数似然函数，分别对应为：6. A method for community discovery combining links and attribute information according to claim 5, characterized in that the calculation of the negative log-likelihood function of each prior probability in the joint Bayesian model in the step S301 The method is: take the negative logarithm of the joint Bayesian probability model, and obtain the negative logarithm likelihood functions of W₁ , W₂ and H according to the half-normal prior knowledge, corresponding to:

其中，-logp(w₁|β)表示W₁的负对数似然函数，-logp(w₂|β)表示W₂的负对数似然函数，-logp(h|β)表示H的负对数似然函数；w_1,ik表示矩阵W₁的第i行第k列元素，w_2,ik表示矩阵W₂的第i行第k列元素，h_kj表示矩阵H的第k行第j列元素；μ是一个常数，N表示节点数量，M表示属性数量，K表示社区个数，β表示超参数。Among them, -logp(w₁ |β) represents the negative log-likelihood function of W₁ , -logp(w₂ |β) represents the negative log-likelihood function of W₂ , and -logp(h|β) represents the Negative log-likelihood function; w_1,ik represents the i-th row and k-column element of matrix W₁ , w_2,ik represents the i-th row and k-column element of matrix W₂ , h_kj represents the k-th row of matrix H The element in the jth column; μ is a constant, N represents the number of nodes, M represents the number of attributes, K represents the number of communities, and β represents hyperparameters.

7.根据权利要求5所述一种结合链接和属性信息的社区发现方法，其特征在于，所述步骤S302中迭代更新公式的计算方法包括：对负对数似然函数取最小值，结合联合非负矩阵分解算法计算迭代更新公式，具体包括：利用非负矩阵分解的三分解形式得到邻接矩阵的简化形式，即X≈HH^T，利用非负矩阵分解的标准形式得到属性矩阵的简化形式，即Y≈WH^T；分别用H、W和H^T对应替代负对数似然函数中的W₁，W₂和H，分别计算H、W和β梯度；利用梯度下降法得到迭代更新公式；其中，W表示替换后的邻接矩阵。7. A community discovery method combining link and attribute information according to claim 5, characterized in that the calculation method of the iterative update formula in the step S302 comprises: taking the minimum value of the negative logarithmic likelihood function, combining The non-negative matrix factorization algorithm calculates the iterative update formula, which specifically includes: using the three-factorization form of the non-negative matrix factorization to obtain the simplified form of the adjacency matrix, that is, X≈HH^T , using the standard form of the non-negative matrix factorization to obtain the simplified form of the attribute matrix, That is, Y≈WH^T ; use H, W and H^T to replace W₁ , W₂ and H in the negative logarithmic likelihood function respectively, and calculate H, W and β gradients respectively; use the gradient descent method to obtain the iterative update formula; Among them, W represents the adjacency matrix after replacement.

8.根据权利要求7所述一种结合链接和属性信息的社区发现方法，其特征在于，所述迭代更新公式如下所示：8. A method for community discovery combining links and attribute information according to claim 7, wherein the iterative update formula is as follows:

其中，←表示赋值，I表示单位矩阵；B表示由超参数β构成的对角矩阵；H表示节点与社区的归属度矩阵；β_K表示超参数β的迭代值；a表示形状常数，b表示尺度常数。Among them, ← means assignment, I means identity matrix; B means a diagonal matrix composed of hyperparameter β; H means the belonging matrix of nodes and communities; β_K means iterative value of hyper parameter β; a means shape constant, b means scale constant.

9.根据权利要求4或6-9任一所述的一种结合链接和属性信息的社区发现方法，其特征在于，所述β的值都服从伽马分布，即：9. A community discovery method combining links and attribute information according to any one of claims 4 or 6-9, wherein the values of β all obey the gamma distribution, that is:

10.根据权利要求1所述一种结合链接和属性信息的社区发现方法，其特征在于，社交网络的重叠社区结构的划分方法为：根据节点的归属度矩阵计算每个节点的绝对归属度，判断绝对归属度是否大于某一阈值，如果是，则将所述节点划分到当前社区，从而得到重叠的社区划分；10. A method for community discovery combining links and attribute information according to claim 1, characterized in that the division method of the overlapping community structure of the social network is: calculate the absolute membership of each node according to the membership matrix of the node, Judging whether the absolute degree of belonging is greater than a certain threshold, if so, dividing the node into the current community, thereby obtaining an overlapping community division;

其中绝对归属度的计算方法为：The calculation method of absolute attribution is:

其中，表示节点的归属强度向量中的最小分量；表示节点的归属强度向量中的最大分量，h_kj表示矩阵H的第k行第j列元素，H表示节点与社区的归属度矩阵。in, Represents the smallest component in the attribution strength vector of a node; Indicates the maximum component of the node's belonging strength vector, h_kj indicates the kth row and jth column element of the matrix H, and H indicates the membership degree matrix of the node and the community.