CN104158840A

Movatterモバイル変換

Info

Publication number: CN104158840A
Application number: CN201410323742.8A
Authority: CN
Inventors: 申德荣; 冯朔; 寇月; 聂铁铮; 王振华; 于戈
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2014-07-09
Filing date: 2014-07-09
Publication date: 2014-11-19
Anticipated expiration: 2034-07-09
Also published as: CN104158840B

Abstract

一种分布式计算图节点相似度的方法，属于计算机数据挖掘领域，包括：采用主从模式搭建分布式计算平台；主计算机读入对象数据，建立图模型并发送给各子计算机；主计算机进行任务划分，并将各子任务分配给各子计算机；各子计算机计算其各任务节点分别传递给图模型中节点对的相似度增量计算值；主计算机计算偏移系数并分别发送给对应的各子计算机；子计算机对本地各任务节点的相似度增量计算值进行修正，并将修正后的本地各任务节点的相似度增量进行求和后传送给主计算机；主计算机对图模型中各节点对的相似度进行整合，最终得到图模型中各个节点对的相似度；该方法相比于传统SimRank计算方法，传输代价低，计算时间短，效率明显提高。

A method for distributed computing graph node similarity belongs to the field of computer data mining, comprising: adopting a master-slave mode to build a distributed computing platform; a master computer reads in object data, establishes a graph model and sends it to each sub-computer; The task is divided, and each sub-task is assigned to each sub-computer; each sub-computer calculates the similarity incremental calculation value that each task node transmits to the node pair in the graph model; the main computer calculates the offset coefficient and sends it to the corresponding Each sub-computer; the sub-computer corrects the calculated value of the similarity increment of each local task node, and sends the sum of the corrected similarity increments of each local task node to the main computer; The similarity of each node pair is integrated to finally obtain the similarity of each node pair in the graph model; compared with the traditional SimRank calculation method, this method has low transmission cost, short calculation time, and significantly improved efficiency.

Description

Translated fromChinese

一种分布式计算图节点相似度的方法A Distributed Method for Computing Graph Node Similarity

技术领域technical field

本发明属于计算机数据挖掘领域，具体涉及一种分布式计算图节点相似度的方法。 The invention belongs to the field of computer data mining, in particular to a method for distributed computing graph node similarity. the

背景技术Background technique

随着图结构的广泛应用，计算两节点相似度已经成为一种基本的图操作方法。例如，针对一个社交网络所建立的图模型中，节点代表个人账号，节点之间的边代表个人账号之间的关系，节点相似度可表述成为两个账号的关联程度，其在探测相似人群以及朋友推荐中有重要应用；又如，针对引文网络所建立的图模型中，节点代表文章，节点之间的边代表文章之间的引用关系，而节点相似度可应用于文章分类及相似文章推荐。 With the wide application of graph structure, calculating the similarity between two nodes has become a basic graph operation method. For example, in a graph model established for a social network, nodes represent personal accounts, and the edges between nodes represent the relationship between personal accounts. The similarity of nodes can be expressed as the degree of association between two accounts, which is useful in detecting similar groups and There are important applications in friend recommendation; another example, in the graph model established for the citation network, the nodes represent articles, and the edges between nodes represent the citation relationship between articles, and the similarity of nodes can be applied to article classification and similar article recommendation . the

目前，已有多种计算节点相似度的方法，其中SimRank方法因具有高准确性而被广泛应用。传统SimRank方法计算复杂性很高，计算方法如下：对于给定的有向图，令s(a,b)表示节点a和b之间的相似性，则这两个节点的SimRank相似度如下： At present, there are many methods for computing node similarity, among which the SimRank method is widely used because of its high accuracy. The traditional SimRank method has high computational complexity, and the calculation method is as follows: For a given directed graph, let s(a,b) represent the similarity between nodes a and b, then the SimRank similarity of these two nodes is as follows:

1)若a＝b，则s(a,b)＝1；2)若a≠b，则s(a,b)的计算公式如下： 1) If a=b, then s(a,b)=1; 2) If a≠b, then the calculation formula of s(a,b) is as follows:

$s the s ((a a,, b b)) = = \frac{C C}{| | I I ((a a)) | | | | I I ((b b)) | |} {Σ Σ}_{i i = = 11}^{| | I I ((a a)) | |} {Σ Σ}_{j j = = 11}^{| | I I ((b b)) | |} s the s (({I I}_{i i} ((a a)),, {I I}_{j j} ((b b))))$

其中，C为衰减因子，是介于0，1之间的常数；|I(a)|和|I(b)|分别为节点a和b的入邻居个数；I_i(a)表示节点a的第i个入邻居，I_j(b)表示节点b的第j个入邻居。 Among them, C is the attenuation factor, which is a constant between 0 and 1; |I(a)| and |I(b)| are the number of incoming neighbors of nodes a and b respectively; I_i (a) represents the node The i-th incoming neighbor of a, I_j (b) represents the j-th incoming neighbor of node b.

在初始化阶段，若a≠b，则R₀(a,b)＝0，否则R₀(a,b)＝1，然后按下式进行迭代： R_k(I_i(a),I_j(b))；其中R_k+1(a,b)表示节点a和b之间的第k+1次迭代的SimRank相似度；在每次迭代过程中，依次算出所有节点对之间的相似度后进入下一次迭代，在迭代计算次后，可得R_k(a,b)为节点对(a,b)的SimRank相似度，其中ε为用户所能容忍的误差大小； In the initialization stage, if a≠b, then R₀ (a,b)=0, otherwise R₀ (a,b)=1, and then iterate as follows: R_k (I_i (a), I_j (b)); where R_k+1 (a, b) represents the SimRank similarity of the k+1th iteration between nodes a and b; in each iteration process In, the similarity between all node pairs is calculated in turn and then enters the next iteration. In the iterative calculation After times, R_k (a, b) can be obtained as the SimRank similarity of the node pair (a, b), where ε is the error size that the user can tolerate;

由公式可知，相同或相似节点的出邻居节点也具有高相似性，在迭代计算节点对相似度时，需通过节点对的入邻居的相似度对该节点对相似度进行更新，则s(a,b)＝lim_k→∞R_k(a,b)。可以看出，该方法计算复杂性过高，且在计算过程中需等待前一次迭代计算全部完成后才能进行下一次迭代计算。 It can be seen from the formula that the outgoing neighbor nodes of the same or similar nodes also have high similarity. When iteratively calculating the similarity of a node pair, the similarity of the node pair needs to be updated by the similarity of the incoming neighbor of the node pair. Then s(a ,b)=lim_k→∞ R_k (a,b). It can be seen that the calculation complexity of this method is too high, and it is necessary to wait for the previous iterative calculation to be completed before the next iterative calculation can be performed during the calculation process.

为降低传统SimRank计算复杂度，人们提出了多种SimRank优化计算方法，例如：基于局部和的SimRank优化计算方法、基于随机游走的SimRank计算方法以及基于GPU的SimRank 并行计算方法等。其中基于随机游走的SimRank优化计算方法中，节点a和b之间SimRank相似度计算公式为s(a,b)＝Σ_t:(a,b)→mP[t]C^l(t)，其中t表示分别从节点a和b出发，沿入边方向游走，经过相同步数，首次在节点m相遇的两条路径，P[t]表示该相遇发生的概率；l(t)表示路径的长度；并称这两条路径形成了一次从节点对(a,b)到节点m的首遇游历(tour)，P[t]C^l(t)为游历值，同样称l(t)为该游历的长度； In order to reduce the complexity of traditional SimRank calculations, various SimRank optimization calculation methods have been proposed, such as SimRank optimization calculation methods based on local sums, SimRank calculation methods based on random walks, and SimRank parallel calculation methods based on GPUs. Among them, in the SimRank optimization calculation method based on random walk, the calculation formula of SimRank similarity between nodes a and b is s(a,b)=Σ_t:(a,b)→m P[t]C^l(t) , where t represents the two paths starting from nodes a and b respectively, walking along the direction of the edge, passing through the same number of steps, and meeting the two paths for the first time at node m, P[t] represents the probability of the encounter; l(t) represents The length of the path; and these two paths form a first encounter tour (tour) from node pair (a, b) to node m, P[t]C^l(t) is the tour value, also called l(t ) is the length of the tour;

上述的多种SimRank优化计算方法计算复杂性均达到o(Kn⁴)，其中K表示迭代计算次数，n表示图中节点个数。对于大规模图数据，如大型社交网络、网页数据以及引文信息网络，这些网络中节点的数量和边的数量异常庞大，由于这些计算方法仅在单个计算机中进行处理，而单个计算机CPU计算能力有限，导致计算时间过长。据实验测试，传统单机方法在2.1GHz英特尔奔腾处理器且1G内存的实验环境中，处理10000个节点的图数据需要的计算时间大致46个小时。 The computational complexity of the various SimRank optimization calculation methods mentioned above all reaches o(Kn⁴ ), where K represents the number of iterative calculations, and n represents the number of nodes in the graph. For large-scale graph data, such as large social networks, web page data, and citation information networks, the number of nodes and edges in these networks is extremely large, because these calculation methods are only processed in a single computer, and the computing power of a single computer CPU is limited , resulting in excessive calculation time. According to the experimental test, the traditional stand-alone method in the experimental environment of 2.1GHz Intel Pentium processor and 1G memory, the calculation time required to process the graph data of 10,000 nodes is about 46 hours.

基于随机游走的SimRank优化计算方法在计算节点对(a,b)相似度时，需得到所有以(a,b)为端点的路径，因此可在分布式环境中对其实现：以节点对为划分单元对相似度计算任务进行划分，每台计算机处理计算多个节点对的相似度大小。然而这种计算方法复杂性过高，且需多次图数据读取操作。 The SimRank optimization calculation method based on random walk needs to get all the paths with (a, b) as endpoints when calculating the similarity of node pairs (a, b), so it can be implemented in a distributed environment: node pairs The similarity calculation tasks are divided into division units, and each computer processes and calculates the similarity of multiple node pairs. However, this calculation method is too complex and requires multiple graph data read operations. the

发明内容Contents of the invention

针对现有技术存在的不足，本发明提供一种分布式计算图节点相似度(DcSimRank)的方法。 Aiming at the deficiencies in the prior art, the present invention provides a method for distributed computing graph node similarity (DcSimRank). the

本发明的技术方案： Technical scheme of the present invention:

一种分布式计算图节点相似度的方法，包括如下步骤： A method for distributed computing graph node similarity, comprising the steps of:

步骤1：采用主从(master-slave)模式搭建分布式计算平台； Step 1: Build a distributed computing platform using the master-slave mode;

搭建的分布式环境中共有N台计算机，将其中一台计算机作为主(master)计算机，其余N-1台计算机均作为子(slave)计算机； There are a total of N computers in the distributed environment built, and one of the computers is used as the master (master) computer, and the remaining N-1 computers are all used as slave (slave) computers;

步骤2：主计算机读入现有的对象数据，建立图模型并发送给分布式计算平台中的每台子计算机； Step 2: The main computer reads in the existing object data, builds a graph model and sends it to each sub-computer in the distributed computing platform;

步骤3：以该图模型的所有节点为任务划分对象，主计算机进行任务划分，并将划分后的各子任务及迭代次数分配给各子计算机；子任务由所划分的不同节点构成；子任务中的不同节点，称作任务节点； Step 3: Taking all the nodes of the graph model as the task division objects, the main computer divides the tasks, and distributes the divided sub-tasks and the number of iterations to each sub-computer; the sub-tasks are composed of different divided nodes; the sub-tasks The different nodes in are called task nodes;

迭代次数K根据下式给出， The number of iterations K is given by the following formula,

其中，ε为用户所能容忍的误差大小；C为衰减因子，是介于0，1之间的常数； Among them, ε is the error size that the user can tolerate; C is the attenuation factor, which is a constant between 0 and 1;

步骤4：根据接收的任务，各子计算机计算其各任务节点分别传递给图模型中的各个节点对的相似度增量计算值；相似度增量计算值为不考虑所得到的游历是否为首遇游历时，任务节点传递给对应节点对的相似度增量； Step 4: According to the received task, each subcomputer calculates the similarity incremental calculation value that each task node transmits to each node pair in the graph model; the similarity incremental calculation value does not consider whether the obtained tour is the first encounter When traveling, the similarity increment passed by the task node to the corresponding node pair;

对于单个任务节点m，对应的子计算机首先构建该节点的相似度增量计算值矩阵，大小为n×n，并将其中每个元素初始化为0，该矩阵用于存放该节点传递给各相应节点对的相似度增量计算值；接下来采用广度优先算法，以该节点m为根节点进行广度优先搜索，将节点m在每一层所能到达的节点存储在集合List_k(m)中，则有 For a single task node m, the corresponding subcomputer first constructs the similarity increment calculation value matrix of the node, with a size of n×n, and initializes each element to 0. This matrix is used to store the node and pass it to each corresponding Incremental calculation value of the similarity of the node pair; next, the breadth-first algorithm is adopted, and the breadth-first search is performed with the node m as the root node, and the nodes that the node m can reach in each layer are stored in the set List_k (m) , then there is

List_k(m)＝{(j,P_k[m,j])}，k＝1,…,K (2) List_k (m)={(j,P_k [m,j])}, k=1,...,K (2)

其中List_k(m)表示以节点m为根节点，在第k层所能到达的图节点；j代表图中任意节点；P_k[m,j]表示从节点j出发，沿入边方向，经k步游走到达节点m的概率，称为节点m到节点j的路径值； Among them, List_k (m) represents the graph node that can be reached at the kth layer with node m as the root node; j represents any node in the graph; P_k [m, j] represents starting from node j, along the direction of the edge, The probability of reaching node m after k steps of walking is called the path value from node m to node j;

以迭代方式计算该节点的相似度增量计算值，即List_k(m)根据List_k-1(m)计算得到，List_k(m)中每个元素根据式(3)进行更新， Iteratively calculate the calculated value of the similarity increment of the node, that is, List_k (m) is calculated according to List_k-1 (m), and each element in List_k (m) is updated according to formula (3),

P_k[m,j]＝Σ_iP_k-1[m,I_i(j)]/|I(j)|，P₀[m,m]＝1，i＝1,2,…,|I(j)| (3) P_k [m,j]=Σ_i P_k-1 [m,I_i (j)]/|I(j)|, P₀ [m,m]=1, i=1,2,...,| I(j)| (3)

根据公式(4)及集合List_k(m)，计算该节点m在各层传递给节点对(j,j′)的相似度增量计算值v′_m,k(j,j′)，其中j′为图中任意节点；任务节点m所对应的子计算机将j′＝j时的相似度增量计算值v′_m,k(j,j′)发送给主计算机；将j′≠j时的相似度增量计算值v′_m,k(j,j′)通过公式(5) 累加到相似度增量计算值矩阵中。 According to the formula (4) and the set List_k (m), calculate the similarity increment calculation value v′_m,k (j,j′) passed by the node m to the node pair (j,j′) at each layer, where j' is any node in the graph; the sub-computer corresponding to the task node m sends the similarity increment calculation value v'_m,k (j,j') when j'=j to the main computer; set j'≠j The similarity increment calculation value v′_m,k (j,j′) at time is accumulated into the similarity increment calculation value matrix through the formula (5).

v′_m,k(j,j′)＝P_k[m,j]P_k[m,j′]C^k (4) v′_m,k (j,j′)=P_k [m,j]P_k [m,j′]C^k (4)

其中，v′_m,k(j,j′)表示节点m在第k层传递给节点对(j,j′)的相似度增量计算值，即节点对(j,j′)沿入边方向经过k步游走，在节点m相遇的游历值之和； Among them, v′_m,k (j,j′) represents the calculated value of the similarity increment passed by node m to the node pair (j,j′) at layer k, that is, the node pair (j,j′) along the incoming edge The direction travels through k steps, and the sum of the tour values that meet at node m;

s′_m(j,j′)＝Σ_kv′_m,k(j,j′) (5) s′_m (j,j′)=Σ_k v′_m,k (j,j′) (5)

其中，s′_m(j,j′)表示该任务节点m带给节点对(j,j′)的相似度增量计算值； Among them, s′_m (j, j′) represents the similarity incremental calculation value brought by the task node m to the node pair (j, j′);

当迭代k＝K次以后，迭代停止，相似度增量计算值矩阵存放的即为节点m传递给各节点对的相似度增量计算值。 After iteration k=K times, the iteration stops, and the similarity increment calculation value matrix stores the similarity increment calculation value transmitted by the node m to each node pair. the

步骤5.主计算机计算各任务节点的偏移系数并分别发送给对应的各子计算机；偏移系数为用于对各任务节点的相似度增量计算值进行修正的系数； Step 5. The host computer calculates the offset coefficients of each task node and sends them to corresponding subcomputers respectively; the offset coefficient is a coefficient used to correct the similarity increment calculation value of each task node;

在所有子计算机完成步骤4后，主计算机根据接收到的每个子计算机发送的其各任务节点传递给节点对(j,j′)，j′＝j时的相似度增量计算值v′_m,k(j,j′)，建立集合Set＝{({x′,k′,y′},v′)}，其中x′和y′均表示任意图节点，v′表示节点x′在第k′＝1,2,…,K次迭代对节点对(y′,y′)的相似度增量计算值v′_x′,k′(_y′,_y′)，并用Set(x′,k′,y′)表示集合Set对应的元素v′＝v′_x′,k′(y′,y′)；构建集合Sco＝{({x,k,y},v)}，其中x和y均表示任意图节点，v表示以节点x为起点，经过k步游走，在节点y发生首遇的游历值之和v_x,k(y,y)，并用Sco(x,k,y)表示集合Sco对应的元素v＝v_x,k(y,y)；下面介绍集合Sco的更新方法： After all subcomputers complete step 4, the host computer transmits the task nodes sent by each subcomputer to the node pair (j, j′) according to the received task nodes, and the similarity increment calculation value v′_m when j′=j_,k (j,j′), establish a set Set={({x′,k′,y′},v′)}, where x′ and y′ represent any graph node, and v′ represents node x′ in k' = 1, 2,..., k-th iteration to the node pair (y', y') similarity incremental calculation value v'_x',k' (_y ',_y '), and use Set(x' ,k′,y′) means the element v′=v′_x′,k′ (y′,y′) corresponding to the set Set; construct the set Sco={({x,k,y},v)}, where Both x and y represent arbitrary graph nodes, and v represents the sum of travel values v_x,k (y,y) of the first encounter at node y after k steps starting from node x, and using Sco(x,k ,y) represents the element v=v_x,k (y,y) corresponding to the set Sco; the update method of the set Sco is introduced below:

以k＝1,2,…,K为迭代顺序更新集合Sco，在第k-1次迭代计算Sco(x,k-1,y)全部完成后，才进行第k次迭代计算Sco(x,k,y)，其中单个元素Sco(x,k,y)计算方法如下： Update the set Sco with k=1,2,...,K as the iterative order, and calculate Sco(x, k, y), where the single element Sco(x, k, y) is calculated as follows:

A.若k＝1，则Sco(x,k,y)＝Set(x,k,y)，则转去执行步骤C；否则执行步骤B； A. If k=1, then Sco(x,k,y)=Set(x,k,y), then go to step C; otherwise, go to step B;

B.Sco(x,k,y)通过从集合Sco中查找出元素Sco(x,k₁,p)与从集合Set中查找出元素Set(p,k₂,y),且满足k₁+k₂＝k,根据公式(6)计算得到偏差v″_x,k(y,y),公式(6)如下： B.Sco(x,k,y) finds the element Sco(x,k₁ ,p) from the set Sco and finds the element Set(p,k₂ ,y) from the set Set, and satisfies k₁ + k₂ =k, the deviation v″_x,k (y,y) is calculated according to the formula (6), and the formula (6) is as follows:

${v v}_{x x,, k k}^{' '' '} ((y the y,, y the y)) = = \underset{p p}{Σ Σ} \underset{{k k}_{11} + + {k k}_{22} = = k k}{Σ Σ} {v v}_{x x,, {k k}_{11}} ((p p,, p p)) {v v}_{p p,, {k k}_{22}}^{' '} ((y the y,, y the y)) - - - - - - ((66))$

再通过公式(7)计算得到Sco(x,k,y)，公式(7)如下： Then calculate Sco(x,k,y) through the formula (7), the formula (7) is as follows:

$Sco Sco ((x x,, k k,, y the y)) = = \frac{((Set set ((x x,, k k,, y the y)) - - {v v}_{x x,, k k}^{' '' '} ((y the y,, y the y))))}{22} - - - - - - ((77))$

C.结束； C. end;

集合Sco更新完毕后，接下来通过公式(8)计算各个任务节点的偏移系数， After the set Sco is updated, the offset coefficient of each task node is calculated by formula (8),

${δ δ}_{y the y} = = \underset{x x}{Σ Σ} \underset{k k}{Σ Σ} Sco Sco ((x x,, k k,, y the y)) - - - - - - ((88))$

δ_y为节点y的偏移系数，主计算机依据根节点的对应关系，将该偏移系数发送给相应的子计算机。 δ_y is the offset coefficient of node y, and the host computer sends the offset coefficient to the corresponding subcomputer according to the corresponding relationship of the root node.

步骤6.子计算机接收到其各任务节点的偏移系数之后对本地各任务节点的相似度增量计算值进行修正，并将修正后的本地各任务节点的相似度增量进行求和后传送给主计算机； Step 6. After receiving the offset coefficients of each task node, the subcomputer corrects the calculated value of similarity increment of each local task node, and sums the corrected similarity increment of each local task node before sending to the host computer;

子计算机根据公式(9)对步骤4求得的本地相似度增量计算值进行修正，即 The sub-computer corrects the calculated value of the local similarity increment obtained in step 4 according to the formula (9), that is,

s(j,j′)＝Σ_ys′_y(j,j′)(1-δ_y) (9) s(j,j′)=Σ_y s′_y (j,j′)(1-δ_y ) (9)

其中s(j,j′)表示节点对(j,j′)的相似度，s′_y(j,j′)表示节点y传递给节点对(j,j′)的相似度增量计算值，其为不考虑所得到的游历是否为首遇游历的计算结果；δ_y表示任务节点y的偏移系数，其用于对任务节点y的相似度增量计算值进行修正；求和符号内s′_y(j,j′)(1-δ_y)整体表示节点y传递给节点对(j,j′)的相似度增量，为修正后的节点y对应矩阵中的记录； Where s(j, j′) represents the similarity of the node pair (j, j′), and s′_y (j, j′) represents the calculated value of the similarity increment passed from node y to the node pair (j, j′) , which is the calculation result regardless of whether the obtained tour is a first encounter tour; δ_y represents the offset coefficient of task node y, which is used to correct the calculated value of similarity increment of task node y; '_y (j,j')(1-δ_y ) overall represents the similarity increment passed from node y to node pair (j,j'), which is the record in the corresponding matrix of node y after correction;

步骤7：根据从各子计算机接收的各节点的相似度增量，主计算机对图模型中各节点对的相似度进行整合，最终得到图模型中各个节点对的相似度； Step 7: According to the similarity increment of each node received from each sub-computer, the main computer integrates the similarity of each node pair in the graph model, and finally obtains the similarity of each node pair in the graph model;

主计算机接收到所有子计算机发送的各节点的相似度增量后，将对应节点对的相似度增量相加，得到每个节点对的相似度值，再将每个节点与其自身的相似度修正为1.0。 After the host computer receives the similarity increments of each node sent by all the subcomputers, it adds the similarity increments of the corresponding node pairs to obtain the similarity value of each node pair, and then calculates the similarity value between each node and itself Fixed to 1.0. the

有益效果：本发明的分布式求取图节点相似度的方法，通过搭建分布式计算平台，由主计算机读入对象数据，建立图模型，划分任务后将各个任务块分配给各个子计算机，最后再由主计算机对各子计算机完成的任务结果进行整合，计算出图模型中两节点相似度大小，用于对象间联系程度的分析，其计算复杂度为O(Kn³)，通信代价为O(Nn²)，相比于传统SimRank计算方法，传输代价低，计算时间短，效率明显提高。 Beneficial effects: the distributed computing method of the present invention for obtaining graph node similarity, by building a distributed computing platform, the main computer reads in object data, establishes a graph model, and assigns each task block to each subcomputer after dividing tasks, and finally Then the main computer integrates the task results completed by each sub-computer, and calculates the similarity of two nodes in the graph model, which is used for the analysis of the relationship between objects. The computational complexity is O(Kn³ ), and the communication cost is O (Nn² ), compared with the traditional SimRank calculation method, the transmission cost is low, the calculation time is short, and the efficiency is significantly improved.

附图说明Description of drawings

图1为本发明实施例1的人际关系结构示意图； Fig. 1 is a schematic diagram of the interpersonal relationship structure of Embodiment 1 of the present invention;

图2为本发明实施例1的分布式求取图节点相似度的方法流程图； Fig. 2 is the flow chart of the distributed method for obtaining graph node similarity in Embodiment 1 of the present invention;

图3为本发明实施例1根据图1所示的人际关系结构建立的图模型示意图； Fig. 3 is a schematic diagram of a graph model established according to the interpersonal relationship structure shown in Fig. 1 according to Embodiment 1 of the present invention;

图4为本发明实施例2的3种不同相似度计算方法分别对5个不同数据集进行相似度计算的运行时间测试图。 FIG. 4 is a running time test diagram of three different similarity calculation methods in Embodiment 2 of the present invention for performing similarity calculation on five different data sets respectively. the

具体实施方式Detailed ways

下面结合附图对本发明的一种实施方式作进一步详细的说明。 An embodiment of the present invention will be further described in detail below in conjunction with the accompanying drawings. the

本实施方式中所有实施例中涉及程序部分的均使用Java编程实现，且实验所使用的分布式环境为15台计算机的机群，每台计算机配置为Intel Core i321003.1GHz CPU，8G内存，500G硬盘，操作系统为Red Hat Linux6.1。 In this embodiment, all the procedures involved in the embodiments are implemented using Java programming, and the distributed environment used in the experiment is a cluster of 15 computers, and each computer is configured as an Intel Core i32100 3.1GHz CPU, 8G memory, and a 500G hard disk , the operating system is Red Hat Linux6.1. the

实施例1 Example 1

为方便理解，本实施例提供一个较小的图数据的详细处理过程。本实施例以人物关系网络为例，该人物关系网络出自微软人立方关系搜索 (http://renlifang.msra.cn/GuanxiMap.aspx？query＝355)，如图1所示，描述了七位名人间的人际关系结构，这七位名人分别是董卿、朱军、李咏、白岩松、毕福剑、赵本山以及李思思。该实施例根据这七人间的人际关系，可计算分析出这七人间的关联紧密程度。 For the convenience of understanding, this embodiment provides a detailed processing process of a small graph data. This embodiment takes the character relationship network as an example. The character relationship network comes from Microsoft People Cube Relationship Search(http://renlifang.msra.cn/GuanxiMap.aspx?query=355 ), as shown in Figure 1, which describes seven The structure of interpersonal relationships between celebrities, the seven celebrities are Dong Qing, Zhu Jun, Li Yong, Bai Yansong, Bi Fujian, Zhao Benshan and Li Sisi. In this embodiment, according to the interpersonal relationships among the seven persons, the degree of closeness of association among the seven persons can be calculated and analyzed.

为减少计算次数，本实施例确定计算精度为ε＝0.01；为了加快收敛速度，本实施例确定衰减因子为C＝0.4；本实施例中约定所有数据均取至小数点后三位。针对该七位名人构成的人际关系网络结构进行分布式计算图节点相似度(DcSimRank)的方法，如图2所示，该方法流程开始于步骤201。 In order to reduce the number of calculations, this embodiment determines the calculation accuracy to be ε=0.01; in order to speed up the convergence speed, this embodiment determines the attenuation factor to be C=0.4; in this embodiment, it is agreed that all data are rounded to three decimal places. A method for performing distributed calculation of graph node similarity (DcSimRank) on the interpersonal network structure formed by the seven celebrities, as shown in FIG. 2 , starts at step 201 . the

在步骤202，采用主从(master-slave)模式搭建分布式计算平台：本实施例所使用的计算机数量N＝4，对分布式环境中的这四台计算机进行划分，其中一台作为主计算机，其余三台作为子计算机，并分别标记为第一子计算机、第二子计算机以及第三子计算机； In step 202, adopt the master-slave (master-slave) mode to build a distributed computing platform: the number of computers used in this embodiment is N=4, these four computers in the distributed environment are divided, and one of them is used as the main computer , and the remaining three are sub-computers, and are respectively marked as the first sub-computer, the second sub-computer and the third sub-computer;

在步骤203，主计算机读入现有的对象数据，建立图模型并发送给分布式计算平台中的每台子计算机； In step 203, the main computer reads in the existing object data, builds a graph model and sends it to each subcomputer in the distributed computing platform;

本实施例中主计算机读入图1所示的7个对象数据，7个对象即董卿、赵本山、朱军、毕福剑、李思思、白岩松以及李咏；主计算机以该7个对象分别为节点A,B,C,D,E,F,G，对象之间的关系为节点之间的边，建立图模型，如图3所示；主计算机以文本形式对该图模型进行存储，如表1所示，每行存储节点tail的所有出邻居及出邻居入度，并以节点tail作为主键，为tail建立索引，存储结构如下：tail,(head₁,|I(head₁)|),…(head_|O(tail)|,|I(head_|O(tail)|)|)，其中(tail，head_z)，z＝1,2,…,|O(tail)|均为图中从节点tail到节点head_z的有向边，|O(tail)|为tail的出度；|I(head_z)|表示节点head_z的入度；主计算机将该图模型分别发送给第一子计算机、第二子计算机以及第三子计算机；各子计算机将其接收的该图模型以相同的存储结构存储于本地； In this embodiment, the host computer reads in the data of 7 objects shown in Figure 1, and the 7 objects are Dong Qing, Zhao Benshan, Zhu Jun, Bi Fujian, Li Sisi, Bai Yansong, and Li Yong; the host computer uses these 7 objects as nodes A and B respectively , C, D, E, F, G, the relationship between objects is the edge between nodes, and a graph model is established, as shown in Figure 3; the host computer stores the graph model in text form, as shown in Table 1 , each line stores all outgoing neighbors and incoming neighbors of the node tail, and uses the node tail as the primary key to build an index for tail. The storage structure is as follows: tail,(head₁ ,|I(head₁ )|),…(head_|O(tail)| ,|I(head_|O(tail)| )|), where (tail, head_z ), z=1,2,...,|O(tail)| are slave nodes tail in the graph The directed edge to the node head_z , |O(tail)| is the out-degree of tail; |I(head_z )| represents the in-degree of the node head_z ; the host computer sends the graph model to the first sub-computer, The second sub-computer and the third sub-computer; each sub-computer stores the graphic model it receives locally with the same storage structure;

表1图模型存储结构 Table 1 Diagram model storage structure

A,(B,2),(C,6),(D,4),(E,3),(F,2),(G,3)A,(B,2),(C,6),(D,4),(E,3),(F,2),(G,3)B,(A,6),(C,6)B,(A,6),(C,6)C,(A,6),(B,2),(D,4),(E,3),(F,2),(G,3)C,(A,6),(B,2),(D,4),(E,3),(F,2),(G,3)D,(A,6),(C,6),(E,3),(G,3)D,(A,6),(C,6),(E,3),(G,3)E,(A,6),(C,6),(D,4)E,(A,6),(C,6),(D,4)F,(A,6),(C,6)F,(A,6),(C,6)G,(A,6),(C,6),(D,4)G,(A,6),(C,6),(D,4)

在步骤204，以该图模型的所有节点为任务划分对象，主计算机进行任务划分，并将划分后的各子任务及迭代次数分配给各子计算机；子任务由所划分的不同节点构成；子任务中的不同节点，称作任务节点； In step 204, with all the nodes of the graph model as task division objects, the main computer divides the tasks, and distributes the divided sub-tasks and the number of iterations to each sub-computer; the sub-tasks are composed of different divided nodes; Different nodes in the task are called task nodes;

对任务进行划分时，本实施例中的主计算机将节点A,B，C,D，E,F,G作为任务划分对象，为保证负载平衡，将节点A,B划分给第一子计算机进行处理，将节点C,D划分给第二子计算机进行处理，将节点E,F,G划分给第三子计算机进行处理；则其中节点A,B分别为第一子计算机的任务节点，节点C,D分别为第二子计算机的任务节点，节点E,F,G分别为第三子计算机的任务节点； When dividing tasks, the host computer in this embodiment uses nodes A, B, C, D, E, F, and G as task division objects. In order to ensure load balance, nodes A and B are allocated to the first subcomputer to perform processing, assigning nodes C and D to the second subcomputer for processing, and assigning nodes E, F, and G to the third subcomputer for processing; wherein nodes A and B are task nodes of the first subcomputer respectively, and node C , D is the task node of the second sub-computer respectively, nodes E, F, G are the task nodes of the third sub-computer respectively;

本实施例中主计算机根据本实施例确定的计算精度为ε＝0.01和衰减因子C＝0.4及式(1)，计算出迭代次数K＝5，并将其分别发送给第一子计算机、第二子计算机以及第三子计算机； In the present embodiment, the calculation accuracy determined by the host computer according to the present embodiment is ε=0.01 and attenuation factor C=0.4 and formula (1), calculates the number of iterations K=5, and sends it to the first subcomputer, the first subcomputer, respectively the second sub-computer and the third sub-computer;

其中，ε为用户所能容忍的误差大小；C为衰减因子，是介于0，1之间的常数。 Among them, ε is the error size that the user can tolerate; C is the attenuation factor, which is a constant between 0 and 1. the

在步骤205，根据接收的任务，子计算机计算其各任务节点分别传递给图模型中的各个节点对的相似度增量计算值；相似度增量计算值为不考虑所得到的游历是否为首遇游历时，任务节点传递给对应节点对的相似度增量； In step 205, according to the received task, the subcomputer calculates the similarity incremental calculation value that each task node transmits to each node pair in the graph model; the similarity incremental calculation value does not consider whether the obtained tour is the first encounter When traveling, the similarity increment passed by the task node to the corresponding node pair;

本实施例以第一子计算机的任务节点A为例进行阐述，其余节点的计算方法类似，此处不再赘述。 In this embodiment, the task node A of the first sub-computer is taken as an example for illustration, and the calculation methods of other nodes are similar, and will not be repeated here. the

对于节点A，第一子计算机首先构建节点A的相似度增量计算值矩阵S′_A，矩阵大小为 7*7，并将其中每个元素初始化为0，该矩阵用于存放该节点传递给各相应节点对的相似度增量计算值；接下来采用广度优先算法，以节点A为根节点进行广度优先搜索，将节点A在每一层所能到达的节点记录并存储在集合List_k(A)中，则有 For node A, the first subcomputer first constructs the similarity increment calculation value matrix S′_A of node A, the size of which is 7*7, and initializes each element to 0, the matrix is used to store the node passed to The similarity incremental calculation value of each corresponding node pair; next, the breadth-first algorithm is used, and the breadth-first search is performed with node A as the root node, and the nodes that node A can reach in each layer are recorded and stored in the set List_k ( In A), there is

List_k(A)＝{(j,P_k[A,j])}，k＝1,…,K (2) List_k (A)={(j,P_k [A,j])}, k=1,...,K (2)

其中List_k(A)表示以节点A为根节点，在第k层所能到达的节点集合；j代表图中任意节点，即j代表节点A,B,C,D,E,F,G中的任意节点；P_k[A,j]表示从节点j出发，沿入边方向，经k步游走到达节点A的概率，称为节点A到节点j的路径值； Among them, List_k (A) represents the set of nodes that can be reached at the kth layer with node A as the root node; j represents any node in the graph, that is, j represents nodes A, B, C, D, E, F, and G Any node of ; P_k [A, j] represents the probability of starting from node j and walking along the direction of the edge to reach node A after k steps, which is called the path value from node A to node j;

List_k(A)根据List_k-1(A)计算得到，List_k(A)中每个元素根据式(3)进行更新， List_k (A) is calculated according to List_k-1 (A), and each element in List_k (A) is updated according to formula (3),

P_k[A,j]＝Σ_iP_k-1[A,I_i(j)]/|I(j)，P₀[m,m]＝1，i＝1,2,…,|I(j)| (3) P_k [A,j]=Σ_i P_k-1 [A,I_i (j)]/|I(j), P₀ [m,m]=1, i=1,2,...,|I (j)| (3)

则节点A在k＝1层时，所能到达的节点集合List₁(A)，如表2第一行所示；依此类推，则节点A在各层所能到达的节点的集合List_k(A)，如表2所示。 Then when node A is at k=1 layer, the set of nodes that can be reached is List₁ (A), as shown in the first row of Table 2; and so on, then the set of nodes that node A can reach in each layer List_k (A), as shown in Table 2.

表2节点A的List_k(A)存储结果 Table 2 List_k (A) storage result of node A

List₁(A)＝{(B,0.5),(C,0.167),(D,0.25),(E,0.333),(F,0.5),(G,0.333)}List₁ (A)＝{(B,0.5),(C,0.167),(D,0.25),(E,0.333),(F,0.5),(G,0.333)}List₂(A)＝{(A,0.347),(B,0.083),(C,0.319),(D,0.208),(E,0.139),(F,0.083),(G,0.139)}List₂ (A)＝{(A,0.347),(B,0.083),(C,0.319),(D,0.208),(E,0.139),(F,0.083),(G,0.139)}List₃(A)＝{(A,0.162),(B,0.333),(C,0.167),(D,0.236),(E,0.292),(F,0.333),(G,0.292)}List₃ (A)＝{(A,0.162),(B,0.333),(C,0.167),(D,0.236),(E,0.292),(F,0.333),(G,0.292)}List₄(A)＝{(A,0.275),(B,0.164),(C,0.275),(D,0.229),(E,0.188),(F,0.164),(G,0.188)}List₄ (A)＝{(A,0.275),(B,0.164),(C,0.275),(D,0.229),(E,0.188),(F,0.164),(G,0.188)}List₅(A)＝{(A,0.201),(B,0.275),(C,0.201),(D,0.232),(E,0.259),(F,0.275),(G,0.259)}List₅ (A)＝{(A,0.201),(B,0.275),(C,0.201),(D,0.232),(E,0.259),(F,0.275),(G,0.259)}

根据公式(4)及集合List_k(A)，计算该节点A在各层传递给节点对(j,j′)的相似度增量计算值v′_A,k(j,j′)，其中j′为图中任意节点；第一子计算机将j′＝j时的相似度增量计算值v′_A,k(j,j′)发送给主计算机；将j′≠j时的相似度增量计算值v′_A,k(j,j′)通过公式(5)累加到相似度增量计算值矩阵S_A′中。 According to the formula (4) and the set List_k (A), calculate the similarity incremental calculation value v′_A,k (j,j′) passed by the node A to the node pair (j,j′) at each layer, where j' is any node in the figure; the first sub-computer sends the calculated value v'_A,k (j,j') of the similarity increment when j'=j to the host computer; the similarity value when j'≠j The incremental calculation value v′_A,k (j,j′) is accumulated into the similarity incremental calculation value matrix S_A ′ through formula (5).

v′_A,k(j,j′)＝P_k[A,j]P_k[A,j′]C^k k＝1,…,K (4) v′_A,k (j,j′)=P_k [A,j]P_k [A,j′]C^k k=1,…,K (4)

s′_A(j,j′)＝Σ_kv′_A,k(j,j′ )(5) s′_A (j,j′)=Σ_k v′_A,k (j,j′ )(5)

以节点A在k＝1层传递给节点对(j,j′)的相似度增量计算值v′_A,1(j,j′)为例：根据公式(4) 及集合List₁(A)计算v′_A,1(j,j′)，即，将List₁(A)中元素两两相乘并乘以衰减因子C，如v′_A,1(B,C)＝P₁[A,B]P₁[A,C]C＝0.5*0.167*0.4＝0.033，依此类推，可得到该次迭代节点A传递给节点对(j,j′)的相似度增量计算值，并将其累加到矩阵S′_A中；将List₁(A)中的各元素分别与其自身相乘并乘以衰减因子C，如v′_A,1(B,B)＝P₁[A,B]P₁[A,B]C＝0.5*0.5*0.4＝0.100，依此类推，分别计算出v′_A,1(C,C)、v′_A,1(D,D)、v′_A,1(E,E)、v′_A,1(F,F)和v′_A,1(G,G)，并将它们传送给主计算机，作为主计算机计算偏移系数的参数，至此第一次迭代计算完毕。 Take the similarity incremental calculation value v′_A,1 (j,j′) passed by node A to node pair (j,j′) at layer k=1 as an example: according to formula (4) and set List₁ (A ) Calculate v′_A,1 (j,j′), that is, multiply the elements in List₁ (A) two by two and multiply by the attenuation factor C, such as v′_A,1 (B,C)=P₁ [ A,B]P₁ [A,C]C＝0.5*0.167*0.4＝0.033, and so on, we can get the similarity incremental calculation value passed by node A to node pair (j,j′) in this iteration, and add it to the matrix S′_A ; each element in List₁ (A) is multiplied by itself and multiplied by the attenuation factor C, such as v′_A,1 (B,B)=P₁ [A, B]P₁ [A,B]C＝0.5*0.5*0.4＝0.100, and so on, respectively calculate v′_A,1 (C,C), v′_A,1 (D,D), v′_A,1 (E,E), v′_A,1 (F,F) and v′_A,1 (G,G), and send them to the host computer as parameters for the host computer to calculate the offset coefficient, so far The first iteration is calculated.

接下来进行第二次迭代计算，即节点A在k＝2层时，所能到达的节点集合List₂(A)，如表2的第二行所示，其是通过迭代方式对List₁(A)进行更新后得到的，其更新方法如下：根据公式(3)，List₂(A)中元素对应节点是通过List₁(A)中该节点的入节点的值除以其入度所得到的，如List₂(A)中第一个元素(A,P₂[A,A])，由表1可知，节点A的入邻居包括节点B～G，因此，P₂[A,A]＝(P₁[A,B]+…P₁[A,G])/|I(A)|＝0.347，依此类推，依照此方法对List₂(A)更新完毕，得到表2中的第二行，其余步骤与第一次迭代计算相同：将List₂(A)中元素两两相乘并乘以衰减因子C²，得到第二次迭代节点A传递给节点对(j,j′)的相似度增量计算值，并将其累加到矩阵S′_A中；将List₂(A)中的各元素分别与其自身相乘并乘以衰减因子C²，分别计算出v′_A,2(A,A)、v′_A,2(B,B)、v′_A,2(C,C)、v′_A,2(D,D)、v′_A,2(E,E)、v′_A,2(F,F)和v′_A,2(G,G)，并将它们传送给主计算机，作为计算偏移系数的参数，至此第二次迭代计算完毕。 Next, the second iterative calculation is carried out, that is, when node A is at k=2 layers, the node set List₂ (A) that can be reached, as shown in the second row of table 2, it is to List_{1 (A} ) by iterative mode. A) After updating, the update method is as follows: According to the formula (3), the corresponding node of the element in List₂ (A) is obtained by dividing the value of the entry node of the node in List₁ (A) by its entry degree For example, the first element (A,P₂ [A,A]) in List₂ (A), it can be seen from Table 1 that the incoming neighbors of node A include nodes B～G, therefore, P₂ [A,A] ＝(P₁ [A,B]+…P₁ [A,G])/|I(A)|＝0.347, and so on, update List₂ (A) according to this method, and get the The second line, the rest of the steps are the same as the first iteration calculation: multiply the elements in List₂ (A) two by two and multiply by the attenuation factor C² , get the node A of the second iteration and pass it to the node pair (j, j′ ), and add it to the matrix S′_A ; each element in List₂ (A) is multiplied by itself and multiplied by the attenuation factor C² to calculate v′_{A, 2} (A,A), v′_A,2 (B,B), v′_A,2 (C,C), v′_A,2 (D,D), v′_A,2 (E,E) , v′_A,2 (F,F) and v′_A,2 (G,G), and send them to the host computer as the parameters for calculating the offset coefficient, and the second iteration is calculated so far.

按照第一次迭代计算与第二次迭代计算的方法，将第三次、第四次和第五次迭代计算完成以后，该任务节点A处理完毕，第一子计算机按照同样的方法继续处理其任务节点B。同理，根据该计算方法，第二子计算机和第三子计算机分别对其任务节点进行计算，得到其各任务节点分别传递给节点对(j,j′)的相似度增量计算值，并分别将j′＝j时的相似度增量计算值v′_m,k(j,j′)发送给主计算机，将j′≠j时的相似度增量计算值v′_m,k(j,j′)通过公式(5)累加到各任务节点的相似度增量计算值矩阵中。 According to the method of the first iterative calculation and the second iterative calculation, after the third, fourth and fifth iterative calculations are completed, the task node A is processed, and the first sub-computer continues to process other tasks according to the same method. Task node B. Similarly, according to this calculation method, the second sub-computer and the third sub-computer calculate their task nodes respectively, and obtain the similarity incremental calculation values that each task node transmits to the node pair (j, j′), and Send the similarity increment calculation value v′_m,k (j,j′) when j′=j to the host computer, and the similarity increment calculation value v′_m,k (j , j′) are accumulated into the similarity increment calculation value matrix of each task node through formula (5).

在步骤206，主计算机计算各任务节点的偏移系数并分别发送给对应的各子计算机；偏移系数为用于对各任务节点的相似度增量计算值进行修正的系数； In step 206, the host computer calculates the offset coefficients of each task node and sends them to corresponding subcomputers respectively; the offset coefficient is a coefficient used to correct the similarity increment calculation value of each task node;

在所有子计算机完成步骤205后，主计算机根据接收到的每个子计算机发送的其各任务节点传递给节点对(j,j′)，j′＝j时的相似度增量计算值v′_m,k(j,j′)，建立集合Set＝{({x′,k′,y′},v′)}，表3展示了集合Set部分元素，其中x′和y′均表示任意图节点；v′表示节点x′在第k′＝1,2,…,K次迭代对节点对(y′,y′)的相似度增量计算值v′_x′,k′(y′,y′)，并用Set(x′,k′,y′)表示集合Set对应的元素v′＝v′_x′,k′(y′,y′)； After all subcomputers complete step 205, the main computer transmits to the node pair (j, j') according to its task nodes received by each subcomputer, and the similarity incremental calculation value v'_m when j'=j_,k (j,j′), establish a set Set={({x′,k′,y′},v′)}, Table 3 shows some elements of the set Set, where x′ and y′ represent any graph Node; v' represents the similarity increment calculation value v'_x',k'(y', y′), and use Set(x′,k′,y′) to represent the element v′=v′_x′,k′ (y′,y′) corresponding to the set Set;

表3集合Set部分元素 Table 3 Collection Set Partial Elements

构建集合Sco＝{({x,k,y},v)}，其中x和y均表示任意图节点，v表示以节点x为起点，经过k步游走，在节点y发生首遇的游历值之和v_x,k(y,y)，并用Sco(x,k,y)表示集合Sco对应的元素v＝v_x,k(y,y)；下面根据公式(6)和(7)，利用集合Set对集合Sco进行更新： Construct the set Sco={({x,k,y},v)}, where x and y represent any graph node, and v represents the tour that takes node x as the starting point, travels through k steps, and first encounters at node y The sum of values v_{x, k} (y, y), and use Sco(x, k, y) to represent the element v=v_{x, k} (y, y) corresponding to the set Sco; according to formulas (6) and (7) , use the set Set to update the set Sco:

首先对Sco中k＝1的结果进行更新：由于在两条长度为1的路径中，不可能出现第二个相遇节点，因此将集合Set中k＝1的结果直接导入集合Sco即可；若Sco(x,k,y)中k＝1，则转去执行步骤C；若Sco(x,k,y)中k＞1，则继续转去执行步骤B； First, update the result of k=1 in Sco: since it is impossible for the second meeting node to appear in two paths with length 1, the result of k=1 in the set Set can be directly imported into the set Sco; if If k=1 in Sco(x,k,y), go to step C; if k>1 in Sco(x,k,y), continue to go to step B;

B.对于k＞1时的Sco(x,k,y)，通过从集合Sco中查找出元素Sco(x,k₁,p)与从集合Set中查找出元素Set(p,k₂,y),且满足k₁+k₂＝k,p为在集合Sco和集合Set中同时出现的节点；根据公式(6)计算得到偏差v″_x，k(y,y),公式(6)如下： B. For Sco(x,k,y) when k>1, by finding the element Sco(x,k₁ ,p) from the set Sco and finding the element Set(p,k₂ ,y from the set Set ), and satisfy k₁ +k₂ =k, p is the node that appears simultaneously in the set Sco and the set Set; according to the formula (6), the deviation v″_{x, k} (y, y) is calculated, and the formula (6) is as follows :

对于Sco中k＝2时，以计算Sco(A,2,A)为例，首先对集合Set与集合Sco进行查询，得到满足条件的Set(A,1,B)与Sco(B,1,A)，Set(A,1,C)与Sco(C,1,A)等六条记录，通过公式(6)得到偏差v″_A,2(A,A)＝Set(A,1,B)*Sco(B,1,A)+Set(A,1,C)*Sco(C,1,A)+…＝0.004，通过公式(7)求得Sco(A,2,A)＝(Set(A,2,A)-v″_A,2(A,A))/2＝0.008。其余Sco元素更新方法相同，不再阐述。 When k=2 in Sco, take the calculation of Sco(A,2,A) as an example, first query the set Set and the set Sco, and obtain Set(A,1,B) and Sco(B,1, A), Set(A,1,C) and Sco(C,1,A) and other six records, get the deviation v″_A,2 (A,A)=Set(A,1,B) through formula (6) *Sco(B,1,A)+Set(A,1,C)*Sco(C,1,A)+…＝0.004, Sco(A,2,A)=(Set (A,2,A)−v″_A,2 (A,A))/2=0.008. The update methods of other Sco elements are the same and will not be described here.

C.结束； C. end;

集合Sco更新完毕后，如表4所示，展示了集合Sco部分元素 After the set Sco is updated, as shown in Table 4, it shows some elements of the set Sco

表4集合Sco部分元素 Table 4 Collection Sco Partial Elements

集合Sco更新完毕后，接下来通过公式(8)计算每个任务节点的偏移系数， After the set Sco is updated, the offset coefficient of each task node is calculated by formula (8),

其中δ_y为节点y的偏移系数，其用于对任务节点y的相似度增量计算值进行修正；主计算机依据根节点的对应关系，将该偏移系数发送给相应的子计算机。如主计算机计算出节点A的偏移系数为： Where δ_y is the offset coefficient of node y, which is used to correct the incremental calculation value of the similarity of task node y; the host computer sends the offset coefficient to the corresponding sub-computer according to the corresponding relationship of the root node. For example, the host computer calculates the offset coefficient of node A as:

$δ_{A} = \underset{x}{Σ} \underset{k}{Σ} Sco (x, k, A) = Sco (A, 1, A) + . . . + Sco (A, 5, A) + Sco (B, 5, A) + . . . + Sco (B, 5, A) + . . . = 0.087$ ，并将其发送给第一子计算机；按照此方法，主计算机完成计算δ_B,δ_C,δ_D,δ_E,δ_F,δ_G，如表5所示，并分别将δ_B发送给第一子计算机，将δ_C,δ_D发送给第二子计算机，将δ_E,δ_F,δ_G发送给第三子计算机。 $δ_{A} = \underset{x}{Σ} \underset{k}{Σ} Sco (x, k, A) = Sco (A, 1, A) + . . . + Sco (A, 5, A) + Sco (B, 5, A) + . . . + Sco (B, 5, A) + . . . = 0.087$ , and send it to the first sub-computer; according to this method, the main computer completes the calculation of δ_B , δ_C , δ_D , δ_E , δ_F , δ_G , as shown in Table 5, and sends δ_B to The first sub-computer sends δ_C , δ_D to the second sub-computer, and sends δ_E , δ_F , δ_G to the third sub-computer.

表5各节点偏移系数 Table 5 Offset coefficient of each node

节点(y)node(y)AABBCCDD.EE.FfGG偏移系数(δ_y)Offset coefficient (δ_y )0.0870.0870.2120.2120.0870.0870.1150.1150.1460.1460.2120.2120.1460.146

在步骤207，子计算机接收到其各任务节点的偏移系数之后对本地各任务节点的相似度增量计算值进行修正，并将修正后的本地各任务节点的相似度增量进行求和后传送给主计算机； In step 207, after the subcomputer receives the offset coefficients of each task node, it corrects the calculated value of the similarity increment of each local task node, and sums the corrected similarity increments of each local task node. sent to the host computer;

各子计算机接收到偏移系数后，根据公式(9)对步骤4求得的本地相似度增量计算值进行修正，即，对于每个任务节点，用1减去其偏移系数的结果再与其相似度增量计算值矩阵中每个元素相乘，得到该任务节点的相似度增量矩阵，之后对其对应的子计算机内所有任务节点的相似度增量矩阵进行求和后得到该子计算机的相似度增量矩阵，并传送给主计算机。 After each subcomputer receives the offset coefficient, it corrects the calculated value of the local similarity increment obtained in step 4 according to the formula (9), that is, for each task node, subtract the result of its offset coefficient from 1 and then Multiply each element in the similarity increment calculation value matrix to obtain the similarity increment matrix of the task node, and then sum the similarity increment matrices of all task nodes in the corresponding sub-computer to obtain the sub-computer The similarity increment matrix of the computer is sent to the host computer. the

其中s(j,j′)表示节点对(j,j′)的相似度，s′_y(j,j′)表示节点y传递给节点对(j,j′)的相似度增量计算值，其为不考虑所得到的游历是否为首遇游历的计算结果；求和符号内s′_y(j,j′)(1-δ_y)整体表示节点y传递给节点对(j,j′)的相似度增量，为修正后的节点y对应矩阵中的记录； Where s(j, j′) represents the similarity of the node pair (j, j′), and s′_y (j, j′) represents the calculated value of the similarity increment passed from node y to the node pair (j, j′) , which is the calculation result regardless of whether the obtained tour is a first-encounter tour; s′_y (j,j′)(1-δ_y ) in the summation symbol as a whole indicates that node y is passed to the node pair (j,j′) The similarity increment of is the record in the matrix corresponding to the corrected node y;

在步骤208，根据从各子计算机接收的相似度增量，主计算机对图模型中各节点对的相似度进行整合，最终得到图模型中各个节点对的相似度； In step 208, according to the similarity increment received from each subcomputer, the main computer integrates the similarity of each node pair in the graph model, and finally obtains the similarity of each node pair in the graph model;

主计算机接收到所有子计算机发送的各节点的相似度增量后，将对应节点对的相似度增量相加，得到每个节点对的相似度值，再将每个节点与其自身的相似度修正为1.0，最终得到相似度计算结果如表6所示。 After the host computer receives the similarity increments of each node sent by all the subcomputers, it adds the similarity increments of the corresponding node pairs to obtain the similarity value of each node pair, and then calculates the similarity value between each node and itself Corrected to 1.0, the final similarity calculation results are shown in Table 6. the

表6相似度计算结果 Table 6 Similarity Calculation Results

theAABBCCDD.EE.FfGGAA1.0001.0000.0600.0600.0970.0970.0850.0850.0740.0740.0600.0600.0740.074BB0.0600.0601.0001.0000.0600.0600.1260.1260.1600.1600.2230.2230.1600.160CC0.0970.0970.0600.0601.0001.0000.0850.0850.0740.0740.0600.0600.0740.074DD.0.0850.0850.1260.1260.0850.0851.0001.0000.0960.0960.01260.01260.0960.096EE.0.0740.0740.1600.1600.0740.0740.0960.0961.0001.0000.1600.1600.1590.159Ff0.0600.0600.2230.2230.0600.0600.1260.1260.1600.1601.0001.0000.1600.160GG0.0740.0740.1600.1600.0740.0740.0960.0960.1590.1590.1600.1601.0001.000

通过本实施例，对本发明(DcSimRank方法)的正确性进行分析： Through the present embodiment, the correctness of the present invention (DcSimRank method) is analyzed:

如表6所示数据，为图1给出的7位名人的人际关系紧密度的分析结果，以第一行为例：通过数据可得董卿(A)与自身关联紧密程度最高，与其他人物相似度差别较大；在其他节点中与朱军(C)具有较高相似度，其次是毕福剑(D)，之后为李思思(E)与李咏(G)，最后为赵本山(B)和白岩松(F)。与现实情况进行比较分析，由于董卿(A)、朱军(C)、毕福剑(D)、李思思(E)、李咏(G)、白岩松(F)均为央视主持，且董卿(A)与朱军(C)常年合作主持春晚,而毕福剑(D)和李思思(E)为最近两年才加入春晚，自然与董卿(A)的关联紧密程度不如朱军(C)，虽然李咏(G)也曾经与董卿(A)合作过多次，但今年以来李咏(G)已离开春晚，因此其关联紧密程度也较低。通过分析表明，该方法所计算得到的节点相似度具有较高的正确性。 The data shown in Table 6 is the analysis result of the closeness of interpersonal relationships of the 7 celebrities given in Figure 1, taking the first row as an example: through the data, it can be found that Dong Qing (A) is the most closely related to herself, similar to other characters In other nodes, it has a high degree of similarity with Zhu Jun (C), followed by Bi Fujian (D), then Li Sisi (E) and Li Yong (G), and finally Zhao Benshan (B) and Bai Yansong (F ). Comparative analysis with the actual situation, since Dong Qing (A), Zhu Jun (C), Bi Fujian (D), Li Sisi (E), Li Yong (G), Bai Yansong (F) are all CCTV hosts, and Dong Qing (A) and Zhu Jun (C) co-hosted the Spring Festival Gala all the year round, while Bi Fujian (D) and Li Sisi (E) only joined the Spring Festival Gala in the last two years, so the relationship with Dong Qing (A) is not as close as that of Zhu Jun (C), although Li Yong (G) also used to He has collaborated with Dong Qing (A) many times, but Li Yong (G) has left the Spring Festival Gala this year, so his relationship is relatively low. The analysis shows that the node similarity calculated by this method has high accuracy. the

本发明(DcSimRank方法)的性能测试： The performance test of the present invention (DcSimRank method):

实施例2 Example 2

表7所示为5个不同数据集，其中Wiki数据集为维基百科页面数据，其中节点表示每个页面，节点之间的边代表页面之间的链接，节点对的相似度可用来表示两个页面的关联程度，例如维基百科中辽宁省页面、奉天省页面和沈阳市页面三者之间具有很高的关联程度，由于在大量网页中同时存在该三个页面的链接；数据集Gnu1至Gnu4为一个分布式p2p文件共享系统的通信数据，其中节点表示每个服务器，节点之间的边代表服务器间的访问过程，节点对的相似度可用来表示两个服务器内数据的关联程度，在数据维护、数据融合操作时有重要作用。 Table 7 shows 5 different datasets, among which the Wiki dataset is Wikipedia page data, where nodes represent each page, edges between nodes represent links between pages, and the similarity of node pairs can be used to represent two The degree of association of pages, for example, the pages of Liaoning Province, Fengtian Province, and Shenyang City in Wikipedia have a high degree of association, because there are links to these three pages in a large number of web pages; datasets Gnu1 to Gnu4 is the communication data of a distributed p2p file sharing system, in which the nodes represent each server, and the edges between nodes represent the access process between servers. It plays an important role in maintenance and data fusion operations. the

表7数据集 Table 7 data set

数据集data set节点数量Number of nodes边数量number of sidesWikiwiki71557155103689103689Gnu1Gnu1630163012077720777Gnu2Gnu2886488643183931839Gnu3Gnu310876108763999439994Gnu4Gnu422687226875470554705

图4为对该5个不同数据集采用3种不同的相似度计算方法进行相似度计算的运行时间测试图，从图中可以看出本发明的优化计算方法DcSimRank相比于传统SimRank以及LSimRank(基于随机游走的SimRank)具有较高的执行效率。 Fig. 4 adopts 3 kinds of different similarity calculation methods to carry out the running time test figure of similarity calculation to this 5 different data sets, can find out from the figure that the optimization calculation method DcSimRank of the present invention compares traditional SimRank and LSimRank ( SimRank based on random walk) has high execution efficiency. the

虽然以上描述了本发明的具体实施方式，但是本领域内的熟练的技术人员应当理解，这些仅是举例说明，可以对这些实施方式做出多种变更或修改，而不背离本发明的原理和实质。 Although the specific embodiments of the present invention have been described above, those skilled in the art should understand that these are only examples, and various changes or modifications can be made to these embodiments without departing from the principles and principles of the present invention. substance. the