技术领域technical field
本发明涉及社交网络和数据挖掘领域,具体为一种结合链接和属性信息的社区发现方法。The invention relates to the fields of social network and data mining, in particular to a community discovery method combining link and attribute information.
背景技术Background technique
社交网络作为信息传递的主要载具,其涵盖的信息量对于当今社会具有重要的研究意义,从个体到群体,从小世界到大社会,现实生活中总存在隐含的联系将人们链接起来。社区发现常用于分析社会群体之间的结构特征。As the main carrier of information transmission, social network has important research significance for the amount of information it covers in today's society. From individuals to groups, from the small world to the big society, there are always implicit connections in real life that link people together. Community discovery is often used to analyze structural characteristics among social groups.
图的拓扑结构和节点属性在社交网络中具有重大的参考价值,仅从其中一点来研究社区的结构特征显然不具有说服力,通常一个社区表示一类具有相同属性的集合,如果它们之间存在直接的联系,表明相邻两节点之间的社会关系链接更为紧密,如果同边相连的两个节点属于不同的社区,这个边就可以当作两个不同属性社区链接的“纽带”,如果“纽带”越多,说明两个社区之间链接也更为紧密,从而为重叠社区的发现和社区演化的研究提供了更好的研究条件。针对缺乏有效的方法来平衡网络结构和节点的结构特征,Wu等人提出一种利用全局结构和局部邻域特征的SAGL方法来探索社区的结构,首先评估每个节点的全局重要性,并通过组合边缘强度和节点相似度来计算每个节点对的相似度,然后使用K-Medoids聚类算法来对节点属性图和拓扑结构图进行分割,以分割后节点的相似度和每个集群的质心来作为节点分配的条件,从而划分社区,该方法能有效将节点拓扑结构和属性信息进行结合,从而检测到更有意义的主题社区,但是社区聚类质量不高且聚类时间复杂度较高,在实际的复杂网络中应用效果较低。常等人提出一种基于联合矩阵分解的节点多属性网络社团检测方法CDJMF,采用非负矩阵分解联合节点间的余弦相似度进行求解,将带约束的最小均方误差问题转化为经典的凸二次规划算法,从理论上证明了求解的目标函数的收敛性与非增性,在相同评价指标的情况下证明该算法有更高的社区发现效率,但是算法中依赖的参数太多,导致算法精确度不高。The graph topology and node attributes have great reference value in social networks. It is obviously not convincing to study the structural characteristics of communities only from one point. Usually, a community represents a set with the same attributes. If there is A direct connection indicates that the social relationship link between two adjacent nodes is closer. If the two nodes connected to the same edge belong to different communities, this edge can be used as a "link" for the link between two communities with different attributes. If The more "links" there are, the closer the link between the two communities is, which provides better research conditions for the discovery of overlapping communities and the study of community evolution. Aiming at the lack of an effective method to balance the network structure and the structural characteristics of nodes, Wu et al. proposed a SAGL method using global structure and local neighborhood features to explore the structure of the community. First, evaluate the global importance of each node, and pass Combine the edge strength and node similarity to calculate the similarity of each node pair, and then use the K-Medoids clustering algorithm to segment the node attribute graph and topology graph to obtain the similarity of the segmented nodes and the centroid of each cluster As the condition of node allocation, so as to divide the community, this method can effectively combine the node topology and attribute information, so as to detect more meaningful topic communities, but the quality of community clustering is not high and the clustering time complexity is high , the application effect is low in the actual complex network. Chang et al. proposed a node multi-attribute network community detection method CDJMF based on joint matrix factorization, using non-negative matrix factorization to solve the joint cosine similarity between nodes, and transforming the constrained minimum mean square error problem into a classic convex two The sub-programming algorithm theoretically proves the convergence and non-increment of the objective function to be solved, and proves that the algorithm has higher community discovery efficiency under the same evaluation index, but the algorithm relies on too many parameters, which leads to the algorithm Not very accurate.
非负矩阵分解(Non-negative Matrix Factorization NMF)已被证明在数据分析、模式识别、信号处理、机器学习等领域得到了广泛的研究。其类似聚类的方法非常适合社交网络领域中的社区发现,通常将社交网络的拓扑结构进行矩阵分解,能够得到满意的社区发现结果,所以利用NMF对既包含链接信息又具有属性信息的社交网络数据进行社区分析为社区发现提供了新的思路。Non-negative Matrix Factorization (NMF) has been proven to be widely researched in data analysis, pattern recognition, signal processing, machine learning and other fields. Its clustering-like method is very suitable for community discovery in the field of social networks. Usually, the topological structure of the social network is matrix decomposed to obtain satisfactory community discovery results. Community analysis of data provides a new idea for community discovery.
李亚芳等人提出了一种非负矩阵的方法(参考李亚芳,贾彩燕,于剑.应用非负矩阵分解模型的社区发现方法综述[J].计算机科学与探索,2016,10(01):1-13.),能够利用连接信息和属性信息的数据网络数据进行社区分析,但该方法的算法复杂度较高,不利用运算。Li Yafang and others proposed a non-negative matrix method (refer to Li Yafang, Jia Caiyan, Yu Jian. A review of community discovery methods using non-negative matrix factorization models [J]. Computer Science and Exploration, 2016, 10(01): 1- 13.), can use the data network data of connection information and attribute information for community analysis, but the algorithm complexity of this method is high, and no calculation is used.
发明内容Contents of the invention
鉴于此,本发明提出一种结合NMF算法的社区发现方法,具体为一种结合链接和属性信息的社区发现方法,本发明分别将属性矩阵和邻接矩阵以非负矩阵分解的形式联合分解能够降低算法的复杂度,本发明所采用的方法包括:In view of this, the present invention proposes a community discovery method combined with the NMF algorithm, specifically a community discovery method that combines link and attribute information. The present invention jointly decomposes the attribute matrix and the adjacency matrix in the form of non-negative matrix decomposition to reduce the The complexity of algorithm, the method that the present invention adopts comprises:
S1、输入社交网络数据,构造基于链接信息的邻接矩阵与基于属性信息的属性矩阵;S1. Input social network data, and construct an adjacency matrix based on link information and an attribute matrix based on attribute information;
S2、利用所述邻接矩阵和所述属性矩阵设计一种联合贝叶斯概率模型;S2. Designing a joint Bayesian probability model by using the adjacency matrix and the attribute matrix;
S3、根据所述联合贝叶斯概率模型,按照联合矩阵分解方法计算邻接矩阵和属性矩阵隐含的节点归属度矩阵,计算所述归属度矩阵的最大归属度,进行初步非重叠社区划分;S3. According to the joint Bayesian probability model, calculate the node membership matrix implied by the adjacency matrix and the attribute matrix according to the joint matrix decomposition method, calculate the maximum belonging degree of the membership matrix, and perform preliminary non-overlapping community division;
S4、计算所述归属度矩阵中节点的绝对归属度,得到社交网络的重叠社区结构。S4. Calculate the absolute membership of the nodes in the membership matrix to obtain the overlapping community structure of the social network.
进一步的,所述构建邻接矩阵的方法包括以下步骤:Further, the method for constructing an adjacency matrix includes the following steps:
S201、根据输入的社交网络数据,构造社交网络拓扑结构图,统计节点数量并标记为N;S201. According to the input social network data, construct a social network topology diagram, count the number of nodes and mark it as N;
S202、以N为阶数构造全零对称方阵X0,迭代更新X0;判断社交网络图中两个节点是否直接相连,如果相连就将矩阵X0中对应的行和列置为1,否则置为0;S202. Constructing an all-zero symmetrical square matrix X0 with N as the order, and updating X0 iteratively; judging whether two nodes in the social network graph are directly connected, and if they are connected, set the corresponding row and column in the matrix X0 to 1, Otherwise set to 0;
S203、将对称方阵X0对角线设为0,即节点与节点自身默认不相邻;S203. Set the diagonal of the symmetrical square matrixX0 to 0, that is, the node is not adjacent to the node itself by default;
S204、迭代终止时输出更新后的X0,标记为邻接矩阵X。S204. Output the updated X0 when the iteration is terminated, and mark it as an adjacency matrix X.
进一步的,构造属性矩阵的方法的步骤包括:Further, the steps of the method for constructing the attribute matrix include:
统计属性数量M并为每个属性从1到M开始标记,以属性数量M为行,节点数N为列构造全零属性矩阵Y0,进行迭代更新;如果节点具有属性就将对应的行和列置为1,否则为0,从而得到属性矩阵Y。Count the number of attributes M and mark each attribute from 1 to M, construct an all-zero attribute matrix Y0 with the number of attributes M as the row, and the number of nodes N as the column, and update iteratively; if the node has an attribute, the corresponding row and The column is set to 1, otherwise it is 0, so that the attribute matrix Y is obtained.
进一步的,所述步骤S2的联合贝叶斯概率模型包括:设计一种联合贝叶斯概率模型,综合考虑邻接矩阵与属性矩阵同时含有隐含的社区结构特征,联合贝叶斯概率模型为:Further, the joint Bayesian probability model in step S2 includes: designing a joint Bayesian probability model, taking into account the adjacency matrix and the attribute matrix at the same time containing hidden community structure features, the joint Bayesian probability model is:
p(X,Y,W1,W2,H,β)=p(X|W1,H)p(Y|W2,H)p(W1|β)p(W2|β)p(H|β)p(β)p(X,Y,W1 ,W2 ,H,β)=p(X|W1 ,H)p(Y|W2 ,H)p(W1 |β)p(W2 |β)p (H|β)p(β)
其中,W1表示属性与社区的关联强度矩阵;W2表示节点与社区的关联强度矩阵;H表示节点与社区的归属度矩阵,H由邻接矩阵与属性矩阵同时决定;β为超参数;p(X|W1,H)表示矩阵W1和矩阵H条件下X的条件概率;p(Y|W2,H)表示矩阵W2和矩阵H条件下X的条件概率;p(W1|β)表示β条件下的W1的条件概率;p(W2|β)表示β条件下的矩阵W2的条件概率;p(H|β)表示β条件下的矩阵H的条件概率;p(β)表示参数β的概率。Among them, W1 represents the correlation strength matrix between attributes and communities; W2 represents the correlation strength matrix between nodes and communities; H represents the belongingness matrix between nodes and communities, and H is determined by the adjacency matrix and attribute matrix at the same time; β is a hyperparameter; p (X|W1 ,H) represents the conditional probability of X under the condition of matrix W1 and matrix H; p(Y|W2 ,H) represents the conditional probability of X under the condition of matrix W2 and matrix H; p(W1 | β) represents the conditional probability of W1 under β condition; p(W2 |β) represents the conditional probability of matrix W2 under β condition; p(H|β) represents the conditional probability of matrix H under β condition; p (β) represents the probability of parameter β.
进一步的,优选的,所述步骤S3的初步非重叠社区划分步骤包括:Further, preferably, the preliminary non-overlapping community division step in step S3 includes:
S301、根据联合贝叶斯概率模型,计算联合贝叶斯模型中每个先验概率的负对数似然函数;得到基于联合概率NMF模型的目标函数;S301. According to the joint Bayesian probability model, calculate the negative log likelihood function of each prior probability in the joint Bayesian model; obtain the objective function based on the joint probability NMF model;
S302、根据非负矩阵分解算法,对似然函数进行优化分解,得到迭代更新公式;S302. Optimizing and decomposing the likelihood function according to the non-negative matrix factorization algorithm to obtain an iterative update formula;
S303、根据迭代更新公式进行反复迭代,直到满足收敛条件;S303. Perform repeated iterations according to the iterative update formula until the convergence condition is satisfied;
S304、根据迭代终止时的输出的归属度矩阵,计算节点最大归属度,输出非重叠社区结构。S304. Calculate the maximum belonging degree of the node according to the belonging degree matrix output when the iteration is terminated, and output the non-overlapping community structure.
进一步的,优选的,所述步骤S301中联合贝叶斯模型中每个先验概率的负对数似然函数的计算方法为:对联合贝叶斯概率模型取负对数,根据半正态先验知识,得到W1,W2和H的负对数似然函数,分别对应为:Further, preferably, the calculation method of the negative log likelihood function of each prior probability in the joint Bayesian model in the step S301 is: take the negative logarithm for the joint Bayesian probability model, according to the half normal Prior knowledge, the negative log likelihood functions of W1 , W2 and H are obtained, which correspond to:
其中,-log p(w1|β)表示W1的负对数似然函数,-log p(w2|β)表示W2的负对数似然函数,-log p(h|β)表示H的负对数似然函数;,W1表示节点与社区的关联强度矩阵,W2表示属性与社区的关联强度矩阵,H表示节点与社区的归属度矩阵;w1,ik表示矩阵W1的第i行第k列元素,w2,ik表示矩阵W2的第i行第k列元素,hkj表示矩阵H的第k行第j列元素;μ是一个常数,N表示节点数量,M表示属性数量,K表示社区个数,β表示超参数。Among them, -log p(w1 |β) represents the negative log-likelihood function of W1 , -log p(w2 |β) represents the negative log-likelihood function of W2 , -log p(h|β) Represents the negative log-likelihood function of H; W1 represents the correlation strength matrix between nodes and communities, W2 represents the correlation strength matrix between attributes and communities, H represents the membership degree matrix between nodes and communities; w1,ik represents the matrix W The element in row i, column k of1 , w2,ik represents the element in row i, column k of matrix W2 , hkj represents the element in row k, column j of matrix H; μ is a constant, and N represents the number of nodes , M represents the number of attributes, K represents the number of communities, and β represents hyperparameters.
上述负对数似然函数公式表示为:通过β控制社区的数量,β越大,说明W中的列向量和H中的列向量越趋近于0,也即社区之间越不相关;每个β的值都服从Gamma分布,即:The above negative log likelihood function formula is expressed as: the number of communities is controlled by β, and the larger the β, the closer the column vector in W and the column vector in H are to 0, that is, the less correlation between the communities; each The values of β all obey the Gamma distribution, that is:
其中,p(β|ak,bk)表示服从伽马分布;ak表示形状参数因子,bk表示尺度参数因子,Γ(ak)表示伽马函数,β≥0。Among them, p(β|ak ,bk ) means obeying the gamma distribution; ak means the shape parameter factor, bk means the scale parameter factor, Γ(ak ) means the gamma function, and β≥0.
进一步的,优选的,所述步骤S302中迭代更新公式的计算方法包括:对负对数似然函数取最小值,结合联合非负矩阵分解算法计算迭代更新公式具体包括:利用非负矩阵分解的标准形式得到邻接矩阵的简化形式,即X≈HHT,利用非负矩阵分解的标准形式得到属性矩阵的简化形式,即Y≈WHT;分别用H、W和HT对应替代负对数似然函数中的W1,W2和H,分别计算H、W和β梯度;利用梯度下降法得到迭代更新公式。其中,W表示替换后的邻接矩阵。Further, preferably, the calculation method of the iterative update formula in step S302 includes: taking the minimum value of the negative logarithmic likelihood function, and calculating the iterative update formula in conjunction with the joint non-negative matrix factorization algorithm specifically includes: using the non-negative matrix factorization The standard form obtains the simplified form of the adjacency matrix, that is, X≈HHT , and the simplified form of the attribute matrix is obtained by using the standard form of non-negative matrix decomposition, that is, Y≈WHT ; use H, W and HT to replace the negative logarithm-like W1 , W2 and H in the natural function, calculate the H, W and β gradients respectively; use the gradient descent method to obtain the iterative update formula. Among them, W represents the adjacency matrix after replacement.
作为一种可选方式,所述迭代更新公式如下所示:As an optional manner, the iterative update formula is as follows:
其中,←表示赋值,Y表示属性矩阵,I表示单位矩阵,B表示由超参数β构成的对角矩阵,H表示节点与社区的归属度矩阵;βK表示超参数β的迭代值,上标为T的矩阵表示相应矩阵的转置矩阵,a表示形状常数,b表示尺度常数,a和b均为常数。Among them, ← represents the assignment, Y represents the attribute matrix, I represents the unit matrix, B represents the diagonal matrix composed of hyperparameter β, H represents the belongingness matrix of nodes and communities; βK represents the iterative value of hyperparameter β, superscript The matrix T represents the transpose of the corresponding matrix, a represents the shape constant, b represents the scale constant, and both a and b are constants.
所述的重叠社区结构划分方法为:根据节点的归属度矩阵计算每个节点的绝对归属度,判断绝对归属度是否大于某一阈值,如果是,则将该节点划分到当前社区,从而得到重叠的社区划分;绝对归属度的计算方法为:The method for dividing the overlapping community structure is as follows: calculate the absolute belonging degree of each node according to the belonging degree matrix of the node, judge whether the absolute belonging degree is greater than a certain threshold, and if so, divide the node into the current community, thereby obtaining the overlapping The community division; the calculation method of absolute attribution is:
其中,和分别表示节点的归属强度向量中的最小分量和最大分量,hkj为vij的子向量,hkj表示vij表示N个节点组成的邻接矩阵的子元素。in, and Respectively represent the minimum component and maximum component in the attributionstrength vector of the node, hkj is the sub-vector of vij , hkj represents the sub-element of the adjacency matrix composed of N nodes.
与现有技术相比,本发明具有如下优点:Compared with prior art, the present invention has following advantage:
1)本发明提出了一种新的结合属性和链接信息的社区发现方法,既保留了基于节点链接信息的社区发现的优势,又结合了社交网络中属性信息进行协同社区发现,更能体现社交网络中的多维特性,能对主题社区进行发现。1) The present invention proposes a new community discovery method combining attributes and link information, which not only retains the advantages of community discovery based on node link information, but also combines attribute information in social networks for collaborative community discovery, which can better reflect social The multidimensional nature of the network enables discovery of topical communities.
2)本发明利用了非负矩阵分解算法的优势,即其“模糊聚类”的思想能为带属性的社区发现提供必要的理论基础,在严格的数学推导下,得到的社区发现方法能够更具有代表性与稳定性。2) The present invention utilizes the advantages of the non-negative matrix factorization algorithm, that is, its "fuzzy clustering" idea can provide the necessary theoretical basis for community discovery with attributes, and under strict mathematical derivation, the obtained community discovery method can be more representative and stable.
3)本发明既能发现非重叠社区又能发现重叠社区,其算法的精确性更高,复杂度更低,适合用于大规模重叠社区挖掘。3) The present invention can discover both non-overlapping communities and overlapping communities, and its algorithm has higher accuracy and lower complexity, and is suitable for mining large-scale overlapping communities.
附图说明Description of drawings
图1是本发明的实施总体流程图;Fig. 1 is the implementation overall flowchart of the present invention;
图2是邻接矩阵生成流程图;Fig. 2 is a flow chart of adjacency matrix generation;
图3是属性矩阵生成流程图;Fig. 3 is a flow chart of attribute matrix generation;
图4是联合贝叶斯概率模型图;Fig. 4 is a joint Bayesian probability model figure;
图5是归属度矩阵计算流程图。Fig. 5 is a flow chart of belongingness matrix calculation.
具体实施方式Detailed ways
下面结合具体实施例以及具体实验数据集对本发明的结合链接与属性信息的社区发现方法做进一步说明,如图1所示,本发明包括以下步骤:The community discovery method combining links and attribute information of the present invention will be further described below in conjunction with specific embodiments and specific experimental data sets. As shown in Figure 1, the present invention includes the following steps:
S1、输入社交网络数据,构造基于链接信息的邻接矩阵与基于属性信息的属性矩阵;S1. Input social network data, and construct an adjacency matrix based on link information and an attribute matrix based on attribute information;
S2、利用所述邻接矩阵和所述属性矩阵设计一种联合贝叶斯概率模型;S2. Designing a joint Bayesian probability model by using the adjacency matrix and the attribute matrix;
S3、根据所述联合贝叶斯概率模型,按照联合矩阵分解方法计算邻接矩阵和属性矩阵隐含的节点归属度矩阵,计算所述归属度矩阵中节点的最大归属度,进行初步非重叠社区划分;S3. According to the joint Bayesian probability model, calculate the node membership matrix implied by the adjacency matrix and the attribute matrix according to the joint matrix decomposition method, calculate the maximum membership degree of the nodes in the membership matrix, and perform preliminary non-overlapping community division ;
S4、计算所述归属度矩阵中的节点的绝对归属度,得到社交网络的重叠社区结构。S4. Calculate the absolute membership of the nodes in the membership matrix to obtain the overlapping community structure of the social network.
进一步的,所述步骤S1输入社交网络数据,构造基于链接信息的邻接矩阵与基于属性信息的属性矩阵包括以下步骤:Further, the step S1 inputs social network data, and constructing an adjacency matrix based on link information and an attribute matrix based on attribute information includes the following steps:
S201、根据输入的社交网络数据,构造社交网络拓扑结构图,统计节点数量并标记为N;S201. According to the input social network data, construct a social network topology diagram, count the number of nodes and mark it as N;
S202、以N为阶数构造全零对称方阵X0,迭代更新X0;判断社交网络图中两个节点是否直接相连,如果相连就将矩阵X0中对应的行和列置为1,否则置为0;S202. Constructing an all-zero symmetrical square matrix X0 with N as the order, and updating X0 iteratively; judging whether two nodes in the social network graph are directly connected, and if they are connected, set the corresponding row and column in the matrix X0 to 1, Otherwise set to 0;
S203、将对称方阵X0的对角线设为0,即节点与节点自身默认不相邻;S203, the diagonal of the symmetrical square matrixX0 is set to 0, that is, the node is not adjacent to the node itself by default;
S204、迭代终止时输出更新后的X0,标记为邻接矩阵X。其中,构建邻接矩阵如图2所示。S204. Output the updated X0 when the iteration is terminated, and mark it as an adjacency matrix X. Among them, the adjacency matrix is constructed as shown in Figure 2.
进一步的,构造属性矩阵的如图3所示,步骤包括:统计属性数量M并为每个属性从1到M开始标记,以属性数量M为行,节点数N为列构造全零属性矩阵Y0,进行迭代更新;如果节点具有属性就将对应的行和列置为1,否则为0,从而得到属性矩阵Y。Further, the construction of the attribute matrix is shown in Figure 3. The steps include: counting the number of attributes M and marking each attribute from 1 to M, and constructing an all-zero attribute matrix Y with the number of attributes M as rows and the number of nodes as columns.0 , perform iterative update; if the node has attributes, set the corresponding row and column to 1, otherwise to 0, so as to obtain the attribute matrix Y.
所述步骤S2的联合贝叶斯概率模型包括:设计一种联合贝叶斯概率模型,综合考虑邻接矩阵与属性矩阵同时含有隐含的社区结构特征,联合贝叶斯概率模型为:The joint Bayesian probability model in the step S2 includes: designing a joint Bayesian probability model, comprehensively considering that the adjacency matrix and the attribute matrix contain hidden community structure features, and the joint Bayesian probability model is:
p(X,Y,W1,W2,H,β)=p(X|W1,H)p(Y|W2,H)p(W1|β)p(W2|β)p(H|β)p(β)p(X,Y,W1 ,W2 ,H,β)=p(X|W1 ,H)p(Y|W2 ,H)p(W1 |β)p(W2 |β)p (H|β)p(β)
其中,W1表示属性与社区的关联强度矩阵;W2表示节点与社区的关联强度矩阵;H表示节点与社区的归属度矩阵,H由邻接矩阵与属性矩阵同时决定;β为超参数;p(X|W1,H)表示矩阵W1和矩阵H条件下X的条件概率;p(Y|W2,H)表示矩阵W2和矩阵H条件下X的条件概率;p(W1|β)表示β条件下的W1的条件概率;p(W2|β)表示β条件下的矩阵W2的条件概率;p(H|β)表示β条件下的矩阵H的条件概率;p(β)表示β的概率,β表示超参数。Among them, W1 represents the correlation strength matrix between attributes and communities; W2 represents the correlation strength matrix between nodes and communities; H represents the belongingness matrix between nodes and communities, and H is determined by the adjacency matrix and attribute matrix at the same time; β is a hyperparameter; p (X|W1 ,H) represents the conditional probability of X under the condition of matrix W1 and matrix H; p(Y|W2 ,H) represents the conditional probability of X under the condition of matrix W2 and matrix H; p(W1 | β) represents the conditional probability of W1 under β condition; p(W2 |β) represents the conditional probability of matrix W2 under β condition; p(H|β) represents the conditional probability of matrix H under β condition; p (β) represents the probability of β, and β represents the hyperparameter.
其中联合贝叶斯概率模型如图4所示,其中vij表示节点的交互信息,它服从泊松分布,根据图4可以将本模型描述为:在N个节点组成的矩阵中,每个子元素vij可由三个隐含的子向量hkj,w1,ik和w2,ik来表示,这三个子向量相互作用,共同决定了节点归属情况,而这三个子向量之间存在一个共同的因子βk,βk决定了社区K在三个子向量中的划分。其中,hkj和w1,ik之间的交互表示为节点的邻接关系矩阵X;hkj和w2,ik之间的交互表示为节点的属性矩阵Y。The joint Bayesian probability model is shown in Figure 4, where vij represents the interaction information of nodes, which obeys the Poisson distribution. According to Figure 4, this model can be described as: in a matrix composed of N nodes, each sub-element vij can be represented by three implicit sub-vectors hkj , w1,ik and w2,ik , these three sub-vectors interact to determine the node attribution, and there is a common The factor βk , βk determines the division of the community K in the three sub-vectors. Among them, the interaction between hkj and w1,ik is expressed as the adjacency matrix X of the node; the interaction between hkj and w2,ik is expressed as the attribute matrix Y of the node.
进一步的,优选的,所述步骤S3的初步非重叠社区划分步骤包括:Further, preferably, the preliminary non-overlapping community division step in step S3 includes:
S301、根据联合贝叶斯概率模型,计算联合贝叶斯模型中每个先验概率的负对数似然函数;得到基于联合概率NMF模型的目标函数;S301. According to the joint Bayesian probability model, calculate the negative log likelihood function of each prior probability in the joint Bayesian model; obtain the objective function based on the joint probability NMF model;
S302、根据非负矩阵分解算法,对似然函数进行优化分解,得到迭代更新公式;S302. Optimizing and decomposing the likelihood function according to the non-negative matrix factorization algorithm to obtain an iterative update formula;
S303、根据迭代更新公式进行反复迭代,直到满足收敛条件;S303. Perform repeated iterations according to the iterative update formula until the convergence condition is satisfied;
S304、根据迭代终止时的输出的归属度矩阵,计算节点最大归属度,输出非重叠社区结构。S304. Calculate the maximum belonging degree of the node according to the belonging degree matrix output when the iteration is terminated, and output the non-overlapping community structure.
进一步的,优选的,所述步骤S301中联合贝叶斯模型中每个先验概率的负对数似然函数的计算方法为:对联合贝叶斯概率模型取负对数,根据半正态先验知识,得到W1,W2和H的负对数似然函数,分别对应为:Further, preferably, the calculation method of the negative log likelihood function of each prior probability in the joint Bayesian model in the step S301 is: take the negative logarithm for the joint Bayesian probability model, according to the half normal Prior knowledge, the negative log likelihood functions of W1 , W2 and H are obtained, which correspond to:
其中,-log p(w1|β)表示W1的负对数似然函数,-log p(w2|β)表示W2的负对数似然函数,-log p(h|β)表示H的负对数似然函数;,W1表示节点与社区的关联强度矩阵,W2表示属性与社区的关联强度矩阵,H表示节点与社区的归属度矩阵;w1,ik表示矩阵W1的第i行第k列元素,w2,ik表示矩阵W2的第i行第k列元素,hkj表示矩阵H的第k行第j列元素;μ是一个常数,N表示节点数量,M表示属性数量,K表示社区个数,β表示超参数。Among them, -log p(w1 |β) represents the negative log-likelihood function of W1 , -log p(w2 |β) represents the negative log-likelihood function of W2 , -log p(h|β) Represents the negative log-likelihood function of H; W1 represents the correlation strength matrix between nodes and communities, W2 represents the correlation strength matrix between attributes and communities, H represents the membership degree matrix between nodes and communities; w1,ik represents the matrix W The element in row i, column k of1 , w2,ik represents the element in row i, column k of matrix W2 , hkj represents the element in row k, column j of matrix H; μ is a constant, and N represents the number of nodes , M represents the number of attributes, K represents the number of communities, and β represents hyperparameters.
进一步的,上述负对数似然函数公式表示为:通过β控制社区的数量,β越大,说明W中的列向量和H中的列向量越趋近于0,也即社区之间越不相关;每个β的值都服从Gamma分布,即:Furthermore, the formula of the above-mentioned negative logarithmic likelihood function is expressed as follows: the number of communities is controlled by β, and the larger the β, the closer the column vectors in W and the column vectors in H are to 0, that is, the closer the communities are to each other. Correlation; the value of each β obeys the Gamma distribution, namely:
其中,p(β|ak,bk)表示服从伽马分布;ak表示形状参数因子,bk表示尺度参数因子,Γ(ak)表示伽马函数,β≥0。Among them, p(β|ak ,bk ) means obeying the gamma distribution; ak means the shape parameter factor, bk means the scale parameter factor, Γ(ak ) means the gamma function, and β≥0.
所述步骤S302中迭代更新公式的计算方法包括:对负对数似然函数取最小值,结合联合非负矩阵分解算法计算迭代更新公式具体包括:利用非负矩阵分解的三分解形式得到邻接矩阵的简化形式即X≈HHT,利用非负矩阵分解的标准形式得到属性矩阵的简化形式,即Y≈WHT;分别用H、W和HT对应替代负对数似然函数中的W1,W2和H,分别计算H、W和β梯度;利用梯度下降法得到迭代更新公式。其中,W表示替换后的邻接矩阵。The calculation method of the iterative update formula in the step S302 includes: taking the minimum value of the negative log-likelihood function, and calculating the iterative update formula in combination with the joint non-negative matrix factorization algorithm, which specifically includes: using the three-decomposition form of the non-negative matrix factorization to obtain the adjacency matrix The simplified form of the attribute matrix is X≈HHT , and the simplified form of the attribute matrix is obtained by using the standard form of non-negative matrix factorization, that is, Y≈WHT ; H, W and HT are respectively used to replace W1 in the negative logarithmic likelihood function , W2 and H, calculate H, W and β gradients respectively; use the gradient descent method to obtain the iterative update formula. Among them, W represents the adjacency matrix after replacement.
作为一种可选方式,所述迭代更新公式如下所示:As an optional manner, the iterative update formula is as follows:
其中,←表示赋值,W表示替换后的邻接矩阵,Y表示属性矩阵,I表示单位矩阵,B表示参数由超参数β构成的对角矩阵,H表示节点与社区的归属度矩阵;上标为T的矩阵表示相应矩阵的转置矩阵,a表示形状常数,b表示尺度常数,a和b均为常数。Among them, ← represents the assignment, W represents the adjacency matrix after replacement, Y represents the attribute matrix, I represents the identity matrix, B represents the diagonal matrix whose parameters are composed of hyperparameters β, and H represents the belongingness matrix of nodes and communities; the superscript is The matrix of T represents the transpose of the corresponding matrix, a represents the shape constant, b represents the scale constant, and both a and b are constants.
所述的重叠社区结构划分方法为:根据节点的归属度矩阵计算每个节点的绝对归属度,判断绝对归属度是否大于某一阈值,如果是,则将其划分到当前社区,从而得到重叠的社区划分;归属度矩阵的计算如图5所示。绝对归属度的计算方法为:The method for dividing the overlapping community structure is as follows: calculate the absolute belonging degree of each node according to the belonging degree matrix of the node, judge whether the absolute belonging degree is greater than a certain threshold, and if so, divide it into the current community, thereby obtaining the overlapping Community division; the calculation of the belonging matrix is shown in Figure 5. The calculation method of absolute attribution is:
其中,和分别表示节点的归属强度向量中的最小分量和最大分量,hkj为vij的子向量,hkj表示vij表示N个节点组成的邻接矩阵的子元素。in, and Respectively represent the minimum component and maximum component in the attributionstrength vector of the node, hkj is the sub-vector of vij , hkj represents the sub-element of the adjacency matrix composed of N nodes.
本发明设计了一种结合链接和属性信息的社区发现方法,首先结合社交网络的数据特征计算邻接矩阵和属性矩阵,并在此基础上设计一种基于两种矩阵的联合贝叶斯概率模型;再针对此模型进行优化,采用非负矩阵分解算法提取模型中的迭代更新公式;然后根据迭代更新公式计算节点归属度矩阵,并统计最大贡献度以发现非重叠的社区结构;最后迭代计算节点的绝对归属度,从而发现重叠的社区。所述的方法基于联合贝叶斯概率非负矩阵分解模型,具有严格的数学理论基础,应用于真实的社交网络进行社区发现,具有更好的社区划分结果。The present invention designs a community discovery method combining link and attribute information. First, the adjacency matrix and attribute matrix are calculated in combination with the data characteristics of the social network, and on this basis, a joint Bayesian probability model based on the two matrices is designed; Then optimize this model, use the non-negative matrix factorization algorithm to extract the iterative update formula in the model; then calculate the node membership matrix according to the iterative update formula, and count the maximum contribution to find the non-overlapping community structure; finally iteratively calculate the node’s Absolute attribution to discover overlapping communities. The method described is based on the joint Bayesian probability non-negative matrix factorization model, has a strict mathematical theory basis, is applied to real social networks for community discovery, and has better community division results.
以上所举实施例,对本发明的目的、技术方案和优点进行了进一步的详细说明,所应理解的是,以上所举实施例仅为本发明的优选实施方式而已,并不用以限制本发明,凡在本发明的精神和原则之内对本发明所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above examples have further described the purpose, technical solutions and advantages of the present invention in detail. It should be understood that the above examples are only preferred implementations of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made to the present invention within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810071418.XACN108334580A (en) | 2018-01-25 | 2018-01-25 | A kind of community discovery method of combination link and attribute information |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810071418.XACN108334580A (en) | 2018-01-25 | 2018-01-25 | A kind of community discovery method of combination link and attribute information |
| Publication Number | Publication Date |
|---|---|
| CN108334580Atrue CN108334580A (en) | 2018-07-27 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810071418.XAPendingCN108334580A (en) | 2018-01-25 | 2018-01-25 | A kind of community discovery method of combination link and attribute information |
| Country | Link |
|---|---|
| CN (1) | CN108334580A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109508455A (en)* | 2018-10-18 | 2019-03-22 | 山西大学 | A kind of GloVe hyper parameter tuning method |
| CN109859063A (en)* | 2019-01-18 | 2019-06-07 | 河北工业大学 | A kind of community discovery method, device, storage medium and terminal device |
| CN109949176A (en)* | 2019-03-28 | 2019-06-28 | 南京邮电大学 | A method for detecting abnormal users in social networks based on graph embedding |
| CN110334285A (en)* | 2019-07-04 | 2019-10-15 | 仲恺农业工程学院 | Symbolic network community discovery method based on structural balance constraint |
| CN110851732A (en)* | 2019-10-28 | 2020-02-28 | 天津大学 | Attribute network semi-supervised community discovery method based on non-negative matrix three-factor decomposition |
| CN111047453A (en)* | 2019-12-04 | 2020-04-21 | 兰州交通大学 | Detection method and device for large-scale social network community based on high-order tensor decomposition |
| CN112084418A (en)* | 2020-07-29 | 2020-12-15 | 浙江工业大学 | Microblog user community discovery method based on neighbor information and attribute network representation learning |
| CN112464107A (en)* | 2020-11-26 | 2021-03-09 | 重庆邮电大学 | Social network overlapping community discovery method and device based on multi-label propagation |
| CN112487110A (en)* | 2020-12-07 | 2021-03-12 | 中国船舶重工集团公司第七一六研究所 | Overlapped community evolution analysis method and system based on network structure and node content |
| CN113158080A (en)* | 2021-04-27 | 2021-07-23 | 华南师范大学 | Community discovery method, system and device based on fusion attribute and storage medium |
| CN113516562A (en)* | 2021-07-28 | 2021-10-19 | 中移(杭州)信息技术有限公司 | Family social network construction method, device, device and storage medium |
| CN114020999A (en)* | 2021-10-20 | 2022-02-08 | 山西大学 | A method and system for community structure detection of movie social network |
| CN116226549A (en)* | 2023-03-07 | 2023-06-06 | 江苏警官学院 | Community discovery method based on triangle structure fusion node characteristics and network topology |
| CN119323008A (en)* | 2024-12-19 | 2025-01-17 | 陕西思极科技有限公司 | Cross fusion mining method and system for platform region information |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109508455A (en)* | 2018-10-18 | 2019-03-22 | 山西大学 | A kind of GloVe hyper parameter tuning method |
| CN109508455B (en)* | 2018-10-18 | 2021-11-19 | 山西大学 | GloVe super-parameter tuning method |
| CN109859063A (en)* | 2019-01-18 | 2019-06-07 | 河北工业大学 | A kind of community discovery method, device, storage medium and terminal device |
| CN109859063B (en)* | 2019-01-18 | 2023-05-05 | 河北工业大学 | Community discovery method and device, storage medium and terminal equipment |
| CN109949176A (en)* | 2019-03-28 | 2019-06-28 | 南京邮电大学 | A method for detecting abnormal users in social networks based on graph embedding |
| CN109949176B (en)* | 2019-03-28 | 2022-07-15 | 南京邮电大学 | A method for detecting abnormal users in social networks based on graph embedding |
| CN110334285B (en)* | 2019-07-04 | 2021-08-06 | 仲恺农业工程学院 | A Symbolic Network Community Discovery Method Based on Structural Equilibrium Constraints |
| CN110334285A (en)* | 2019-07-04 | 2019-10-15 | 仲恺农业工程学院 | Symbolic network community discovery method based on structural balance constraint |
| CN110851732A (en)* | 2019-10-28 | 2020-02-28 | 天津大学 | Attribute network semi-supervised community discovery method based on non-negative matrix three-factor decomposition |
| CN111047453A (en)* | 2019-12-04 | 2020-04-21 | 兰州交通大学 | Detection method and device for large-scale social network community based on high-order tensor decomposition |
| CN112084418B (en)* | 2020-07-29 | 2023-07-28 | 浙江工业大学 | Microblog user community discovery method based on neighbor information and attribute network characterization learning |
| CN112084418A (en)* | 2020-07-29 | 2020-12-15 | 浙江工业大学 | Microblog user community discovery method based on neighbor information and attribute network representation learning |
| CN112464107A (en)* | 2020-11-26 | 2021-03-09 | 重庆邮电大学 | Social network overlapping community discovery method and device based on multi-label propagation |
| CN112487110A (en)* | 2020-12-07 | 2021-03-12 | 中国船舶重工集团公司第七一六研究所 | Overlapped community evolution analysis method and system based on network structure and node content |
| CN113158080A (en)* | 2021-04-27 | 2021-07-23 | 华南师范大学 | Community discovery method, system and device based on fusion attribute and storage medium |
| CN113158080B (en)* | 2021-04-27 | 2023-07-11 | 华南师范大学 | Community discovery method, system, device and storage medium based on fusion attribute |
| CN113516562A (en)* | 2021-07-28 | 2021-10-19 | 中移(杭州)信息技术有限公司 | Family social network construction method, device, device and storage medium |
| CN113516562B (en)* | 2021-07-28 | 2023-09-19 | 中移(杭州)信息技术有限公司 | Home social network construction method, device, equipment and storage medium |
| CN114020999A (en)* | 2021-10-20 | 2022-02-08 | 山西大学 | A method and system for community structure detection of movie social network |
| CN114020999B (en)* | 2021-10-20 | 2024-12-03 | 山西大学 | A community structure detection method and system for movie social networks |
| CN116226549A (en)* | 2023-03-07 | 2023-06-06 | 江苏警官学院 | Community discovery method based on triangle structure fusion node characteristics and network topology |
| CN119323008A (en)* | 2024-12-19 | 2025-01-17 | 陕西思极科技有限公司 | Cross fusion mining method and system for platform region information |
| Publication | Publication Date | Title |
|---|---|---|
| CN108334580A (en) | A kind of community discovery method of combination link and attribute information | |
| CN103678671B (en) | A kind of dynamic community detection method in social networks | |
| CN111125358B (en) | Text classification method based on hypergraph | |
| CN111932386B (en) | User account determining method and device, information pushing method and device, and electronic equipment | |
| CN109063021B (en) | A Knowledge Graph Distributed Representation Method Encoding Relational Semantic Diversity Structure | |
| CN111985623A (en) | Attribute graph group discovery method based on maximized mutual information and graph neural network | |
| CN112085125A (en) | Missing value filling method, storage medium and system based on linear self-learning network | |
| CN110096630A (en) | Big data processing method of the one kind based on clustering | |
| Gao et al. | CNL: collective network linkage across heterogeneous social platforms | |
| CN113822419A (en) | Self-supervision graph representation learning operation method based on structural information | |
| WO2023207013A1 (en) | Graph embedding-based relational graph key personnel analysis method and system | |
| CN118314946A (en) | Single-cell multi-omics data integration method based on variational graph autoencoder | |
| He et al. | Representation learning of knowledge graph for wireless communication networks | |
| CN111382318A (en) | Dynamic community detection method based on information dynamics | |
| CN103164533B (en) | Complex network community detection method based on information theory | |
| CN104317853B (en) | A kind of service cluster construction method based on Semantic Web | |
| Chen et al. | Community Detection Based on DeepWalk Model in Large‐Scale Networks | |
| CN118568534A (en) | Node representation learning and community division method based on multi-order neighbor information fusion | |
| CN116913394A (en) | Cell type annotation method based on single-cell transcriptome data | |
| Kang et al. | Transfer learning for dynamic community knowledge detection based on dual-population cooperation and competition | |
| Yu et al. | SDHGCN: A Heterogeneous Graph Convolutional Neural Network Combined With Shadowed Set | |
| CN115017275A (en) | A Conversation Recommendation Method and System Based on Graph Neural Network and Knowledge Graph | |
| Shu et al. | sigRGCN: A robust residual graph convolutional network for scRNA-seq data clustering | |
| Zhao et al. | Probability-Corrected Overlapping Community Detection Algorithm Based on GCN | |
| Ji et al. | An improved random walk based community detection algorithm |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date:20180727 | |
| RJ01 | Rejection of invention patent application after publication |