CN111611785A

Movatterモバイル変換

Info

Publication number: CN111611785A
Application number: CN202010361406.8A
Authority: CN
Inventors: 礼欣; 吴昊; 洪辉婷; 潘元刚; 曾伟鸿
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-09-01

Abstract

Translated fromChinese

本发明涉及一种生成式对抗网络嵌入式学习表示方法，应用于网络实体对齐技术领域；本发明将网络的嵌入式表示和跨网络的实体对齐任务统一融合，通过图卷积神经网络提取网络特征，同时引入生成式对抗学习来引导领域不变性特征的学习，避免嵌入式表示学习过程中领域依赖特征的影响，在此基础上，提出感知方向的图卷积网络以更好的优化有向网络的结构信息，并且基于图卷积网络的特性，通过图卷积网络的权重分享技巧以优化跨网络的嵌入式表示学习效率。对比现有技术，本发明有效解决了在实体对齐任务中领域特征的存在使得对齐效果欠佳的问题，通过领域对抗学习获取更有利于实体对齐任务的领域不变性特征，提升实体对齐的效果。

The invention relates to a generative confrontation network embedded learning representation method, which is applied to the technical field of network entity alignment; the invention integrates the embedded representation of the network and the task of entity alignment across the network, and extracts network features through a graph convolutional neural network. At the same time, generative adversarial learning is introduced to guide the learning of domain-invariant features and avoid the influence of domain-dependent features in the process of embedded representation learning. On this basis, a direction-aware graph convolutional network is proposed to better optimize directed networks. , and based on the characteristics of graph convolutional networks, the weight sharing technique of graph convolutional networks is used to optimize the learning efficiency of embedded representations across networks. Compared with the prior art, the present invention effectively solves the problem of poor alignment effect caused by the existence of domain features in the entity alignment task, obtains domain invariant features that are more beneficial to the entity alignment task through domain confrontation learning, and improves the effect of entity alignment.

Description

Translated fromChinese

一种生成式对抗网络嵌入式表示学习方法A Generative Adversarial Network Embedded Representation Learning Method

技术领域technical field

本发明涉及一种生成式对抗网络嵌入式表示学习方法，具体涉及一种面向对齐任务的生成式对抗网络嵌入式表示学习方法，应用于网络实体对齐技术领域。The invention relates to a generative confrontation network embedded representation learning method, in particular to an alignment task-oriented generative confrontation network embedded representation learning method, which is applied to the technical field of network entity alignment.

背景技术Background technique

网络实体对齐任务最早被应用于生物信息学领域，通过不同物种的蛋白质-蛋白质交互作用网络之间的比对，以寻找蛋白质之间的共性结构。现阶段的实体对齐任务基于同一种假设，即关联节点在不同的网络上应该具有一致的连接结构。由于网络的功能、受众不同，网络与网络之间常常大相径庭，将一个网络类比为一个领域，网络中的节点属性常常具有较高的领域相关度，且受困于属性信息的低可信度以及信息缺失问题，基于网络结构的拓扑探索方法是当前对齐任务比较通用的解决方案。随着网络嵌入式表示学习的发展，网络的结构信息得以嵌入在低维空间中，实体对齐任务可以通过探索网络的共同低维子空间或网络之间的子空间变换来完成。The network entity alignment task was first applied in the field of bioinformatics to find common structures among proteins by aligning protein-protein interaction networks of different species. The current entity alignment task is based on the same assumption that associated nodes should have a consistent connection structure across different networks. Due to different network functions and audiences, networks are often quite different from one another. When a network is compared to a domain, the node attributes in the network often have a high domain correlation, and are trapped in the low reliability of attribute information and For the problem of missing information, the topology exploration method based on network structure is a common solution to the current alignment task. With the development of network-embedded representation learning, the structural information of the network can be embedded in a low-dimensional space, and the task of entity alignment can be accomplished by exploring the common low-dimensional subspace of the network or the subspace transformation between networks.

然而，当前已有一些基于嵌入式表示学习的实体对齐方法，例如SNNA、IONE等模型，但这些模型并没有考虑到对领域不变性特征(domain-invariant features)的提取，从而使得对齐任务很容易受到领域偏差的影响。当前方法仍然倾向于将局部网络结构和高阶的网络结构信息同时嵌入到低维向量空间中，例如IONE模型利用LINE使网络的嵌入式表示保持二阶相似度结构并通过链接传播隐含地保留高阶结构。然而，由此获得的嵌入式表示学习容易隐含领域的信号特征，这些领域信号特征能够帮助网络领域的区分，但却对实体对齐任务并没有帮助，甚至可能导致对领域不变性特征的学习不充分，进而削弱实体对齐的效果。However, there are currently some entity alignment methods based on embedded representation learning, such as SNNA, IONE and other models, but these models do not consider the extraction of domain-invariant features, which makes the alignment task easy subject to domain bias. Current methods still tend to embed local network structure and high-order network structure information into low-dimensional vector space simultaneously. For example, the IONE model utilizes LINE to make the embedded representation of the network maintain the second-order similarity structure and implicitly preserve it through link propagation Higher order structures. However, the resulting embedded representation learning is prone to implicit domain signal features, which can help the network domain distinction, but are not helpful for entity alignment tasks, and may even lead to ineffective learning of domain-invariant features. sufficient, thereby weakening the effect of entity alignment.

综上所述，目前迫切需要一种具有领域不变性特征提取的嵌入式表示学习方法，以提高实体对齐效果。To sum up, there is an urgent need for an embedded representation learning method with domain-invariant feature extraction to improve entity alignment.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为解决现有基于嵌入式表示学习的实体对齐方法由于未能充分考虑领域不变性特征带来的实体对齐效果欠佳的问题，提供一种面向实体对齐的生成式对抗网络嵌入式表示学习方法。该发明通过对抗性学习来引导领域不变性特征的学习，从而避免嵌入式表示学习过程中对领域依赖特征的过度学习，提高对齐效果。The purpose of the present invention is to provide an entity alignment-oriented generative adversarial network embedding in order to solve the problem that the existing entity alignment method based on embedded representation learning fails to fully consider the domain invariant feature, resulting in poor entity alignment. The formula represents the learning method. The invention guides the learning of domain-invariant features through adversarial learning, thereby avoiding over-learning of domain-dependent features in the process of embedded representation learning, and improving the alignment effect.

本发明的思想是提出一个新的框架DANA(Domain-Adversarial NetworkAlignment)，这个框架将网络的嵌入式表示和跨网络的实体对齐任务统一融合在一起。在最大化锚链接的后验概率的同时，特征提取器(用以获得网络的嵌入式表示)通过对抗游戏最大化领域判别器的分类损失以滤除领域依赖性特征，最终获得更有利于对齐任务的网络的嵌入式表示。在DANA的基本框架上，本发明又提出两种进一步优化方法：对图卷积网络进行优化，提出感知方向的图卷积网络(Directed GCNs)以更好的优化有向网络的结构信息；基于图卷积网络的特性，通过图卷积网络的权重分享技巧以优化跨网络的嵌入式表示学习效率。The idea of the present invention is to propose a new framework DANA (Domain-Adversarial Network Alignment), which integrates the embedded representation of the network and the task of entity alignment across the network. While maximizing the posterior probability of the anchor links, the feature extractor (to obtain an embedded representation of the network) maximizes the classification loss of the domain discriminator by adversarial game to filter out domain-dependent features, and finally obtains a more favorable alignment Embedded representation of the network of tasks. On the basic framework of DANA, the present invention proposes two further optimization methods: optimizing the graph convolutional network, and proposing a direction-aware graph convolutional network (Directed GCNs) to better optimize the structural information of the directed network; The characteristics of graph convolutional network, through the weight sharing technique of graph convolutional network, to optimize the learning efficiency of embedded representation across the network.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

一种生成式对抗网络嵌入式表示学习方法，包括以下内容：A generative adversarial network embedded representation learning method, including the following:

输入数据为网络A和网络B的结构信息N^A和N^B，N^A＝(V^A，E^A)，N^B＝(V^B，E^B)，V^A和V^B分别为网络A和网络B中的顶点集合，E^A和E^B分别为网络A和网络B链接边的集合；同一个用户在网络A和网络B的账户分别为网络A中的顶点

网络B中的顶点

且网络A中每个顶点

的领域标签为d^A，而网络B中的每个顶点

对应的领域标签为d^B；

为一对锚节点，S为锚节点集合；The input data is the structure information NA and^NB of network^A and network^B , NA⁼ (VA, EA),^NB⁼ (^VB ,^EB ), VA and^VB are network^A and network respectively The set of vertices in B, EA and EB are the set of link edges of network^A and network^B respectively; the accounts of the same user in network A and network B are the vertices in network A respectively

Vertices in Network B

and each vertex in network A

The domain label is d^A , while each vertex in network B

The corresponding field label is^dB ;

is a pair of anchor nodes, and S is the set of anchor nodes;

运用两个图卷积网络GCN：GCN^A和GCN^B分别探索网络A和网络B的结构信息，从每个GCN中得到对应网络的低维表示向量R＝H_L，通过最大化嵌入式表示空间中的锚节点的后验概率来约束网络嵌入式表示的学习，优化准则

为：Use two graph convolutional networks GCN: GCN^A and GCN^B to explore the structural information of network A and network B respectively, obtain the low-dimensional representation vector R=H_L of the corresponding network from each GCN, and maximize the embedded representation space by maximizing the embedded representation space. The posterior probability of the anchor nodes in the network to constrain the learning of the network embedded representation, the optimization criterion

for:

其中，

表示给定

条件下

的概率，

表示给定

条件下

的概率，

表示

的范数，

表示

的范数，

和

分别为网络A和网络B的参数集合，λ为正则化参数；in,

means given

under conditions

The probability,

means given

under conditions

The probability,

express

norm of ,

express

norm of ,

and

are the parameter sets of network A and network B, respectively, and λ is the regularization parameter;

使用领域分类器充当判别器的角色，通过如下表达式所述领域对抗学习，以得到具有领域不变性特征的网络嵌入式表示：Using the domain classifier to act as a discriminator, domain adversarial learning is described by the following expression to obtain a network-embedded representation with domain-invariant features:

其中

为领域分类器的参数集合，d∈{d^A，d^B}为顶点v所属的领域标签，v∈{V^A∪V^B}，指示函数II_d(v)用于指示顶点v是否属于领域d，若顶点v属于领域d，指示函数II_d(v)的值为1，否则值为0；p(d|v)为给定顶点v条件下标签d的条件概率；in

is the parameter set of the domain classifier, d∈{d^A , d^B } is the domain label to which the vertex v belongs, v∈{V^A ∪V^B }, the indicating function II_d (v) is used to indicate whether the vertex v belongs to the domain d, if the vertex v belongs to the domain d, the value of the indicator function II_d (v) is 1, otherwise the value is 0; p(d|v) is the conditional probability of the label d under the condition of the given vertex v;

图卷积网络GCN作为特征提取器的优化目标是最小化如下目标函数，即最大化锚链接的后验概率和领域分类器的分类损失：The optimization goal of the graph convolutional network GCN as a feature extractor is to minimize the following objective function, that is, maximize the posterior probability of the anchor link and the classification loss of the domain classifier:

其中，超参数γ是用于调节

与

之间的比重的权衡因子；where the hyperparameter γ is used to tune

and

The trade-off factor of the proportion between;

领域分类器的优化目标是最小化如下目标函数：The optimization goal of the domain classifier is to minimize the following objective function:

学习过程中引入梯度反转层，同步更新优化和

Introduce gradient reversal layer in the learning process, update and optimize synchronously and

作为优选，所述图卷积网络GCN为基于有向图网络的图卷积网络。Preferably, the graph convolution network GCN is a graph convolution network based on a directed graph network.

作为优选，所述基于有向图网络的图卷积网络是从节点的出度和入度两个视角对节点的特征进行刻画，根据出度和入度的分布情况进行卷积操作。Preferably, the graph convolution network based on the directed graph network describes the features of the nodes from two perspectives of the out-degree and in-degree of the node, and performs the convolution operation according to the distribution of the out-degree and in-degree.

作为优选，所述GCN^A和GCN^B共享网络参数权重。Preferably, the GCN^A and GCN^B share network parameter weights.

有益效果beneficial effect

本发明针对网络实体对齐，提出了一种基于生成式对抗网络的嵌入式表示学习方法DANA，利用领域对抗学习获取更有利于实体对齐任务的领域不变性特征，达到更好的实体对齐效果。此外，本发明针对DANA框架提出两种优化方法：借助权重共享策略，简化各部分组件间的参数设置，进一步优化模型的对齐性能；针对有向图网络提出了一个能够捕捉网络中顶点出度分布和入度分布的图卷积网络，将DANA推广到有向网络应用范围。Aiming at network entity alignment, the present invention proposes an embedded representation learning method DANA based on generative adversarial network, which utilizes domain adversarial learning to obtain domain invariant features that are more conducive to entity alignment tasks, and achieves better entity alignment effects. In addition, the present invention proposes two optimization methods for the DANA framework: with the help of the weight sharing strategy, the parameter settings between various components are simplified, and the alignment performance of the model is further optimized; for the directed graph network, a method that can capture the out-degree distribution of vertices in the network is proposed. and in-degree distributed graph convolutional networks, extending DANA to the range of directed network applications.

附图说明Description of drawings

图1为DANA基本框架示意图；Figure 1 is a schematic diagram of the basic framework of DANA;

图2为有向图网络的GCN框架示意图；Figure 2 is a schematic diagram of the GCN framework of a directed graph network;

图3为基于有向图网络的图卷积网络的DANA框架示意图；3 is a schematic diagram of the DANA framework of a graph convolutional network based on a directed graph network;

图4为基于图卷积网络权重共享策略的DANA框架示意图。Figure 4 is a schematic diagram of the DANA framework based on the graph convolutional network weight sharing strategy.

具体实施方式Detailed ways

下面结合附图与实施例对本发明进行详细介绍。The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

近年来，生成式对抗网络在解决复杂数据分布的问题中应用广泛。生成式对抗网络(Generative Adversarial Network)主要由一个生成式网络(生成器，Generator，G)和一个判别式网络(判别器，Discriminator，D)组成：生成器用以从给定的噪声分布(通常为均匀分布或正态分布)中生成数据，判别器则需要判别给定的数据是数据集中真实存在的数据或者是生成器生成的假数据。生成器需要尽量拟合真实数据的分布来产生具有欺骗性的数据，而判别器需要提高自身的鉴别能力以防止被生成器的假数据欺骗。在整个最大最小游戏的过程中，生成器和判别器互相对抗又互相促进，最终达到了互相平衡的状态，即生成器具有了拟合真实数据分布的能力，判别器拥有较高的鉴别水平。In recent years, generative adversarial networks have been widely used in solving complex data distribution problems. The Generative Adversarial Network is mainly composed of a generative network (generator, Generator, G) and a discriminant network (discriminator, Discriminator, D): the generator is used to learn from a given noise distribution (usually Uniform distribution or normal distribution), the discriminator needs to determine whether the given data is the real data in the data set or the fake data generated by the generator. The generator needs to fit the distribution of real data as much as possible to generate deceptive data, and the discriminator needs to improve its own discrimination ability to prevent being deceived by the generator's fake data. In the whole process of the maximum and minimum game, the generator and the discriminator fight against each other and promote each other, and finally reach a state of mutual balance, that is, the generator has the ability to fit the real data distribution, and the discriminator has a high level of discrimination.

因此，受近年来领域适应学习研究进展的启发，本发明旨在通过对抗性学习最大限度地消除领域信号特征对实体对齐任务的性能影响：在基于网络嵌入式表示学习的对齐框架中引入一个领域分类器，通过对抗性学习来引导领域不变性特征的学习，能够避免嵌入式表示学习过程中对领域依赖特征的过度学习，提高对齐效果。Therefore, inspired by recent research advances in domain adaptation learning, the present invention aims to minimize the performance impact of domain signal features on entity alignment tasks through adversarial learning: Introducing a domain into an alignment framework based on network-embedded representation learning The classifier, which guides the learning of domain-invariant features through adversarial learning, can avoid over-learning of domain-dependent features in the process of embedded representation learning and improve the alignment effect.

本发明的思想是通过以下过程进行嵌入式表示的学习：The idea of the present invention is to learn the embedded representation through the following process:

步骤一、获取网络的嵌入式表示Step 1. Obtain an embedded representation of the network

运用两个图卷积网络GCN(Graph Convolutional Network)作为特征提取器，GCN^A和GCN^B分别探索网络A和网络B的结构信息，从每个GCN中得到对应网络的低维表示向量R＝H_L；Using two graph convolutional networks GCN (Graph Convolutional Network) as feature extractors, GCN^A and GCN^B explore the structural information of network A and network B respectively, and obtain the low-dimensional representation vector R=H of the corresponding network from each GCN_L ;

步骤二、利用锚节点信息监督图卷积网络的嵌入式表示学习Step 2. Use anchor node information to supervise the embedded representation learning of graph convolutional networks

基于概率空间的实体对齐优化方法，通过最大化嵌入式表示空间中的锚节点的后验概率来约束网络嵌入式表示的学习；A probability space-based entity alignment optimization method that constrains the learning of network embedded representations by maximizing the posterior probability of anchor nodes in the embedded representation space;

步骤三、利用领域分类器进行对抗学习Step 3: Adversarial Learning Using Domain Classifiers

引入领域分类器，通过极大极小游戏(Minimax Game)与特征提取器GCN进行领域对抗学习，进一步的增强特征提取器GCN针对实体对齐任务的嵌入式表示学习，在尽量过滤掉领域依赖性特征的同时，最大化领域分类器的分类损失；The domain classifier is introduced to conduct domain confrontation learning through the Minimax Game and the feature extractor GCN, and the feature extractor GCN is further enhanced for the embedded representation learning of the entity alignment task, and the domain-dependent features are filtered out as much as possible. At the same time, maximize the classification loss of the domain classifier;

步骤四、利用梯度反转层对参数进行优化，以最小化目标损失函数。Step 4: Use the gradient inversion layer to optimize the parameters to minimize the objective loss function.

为了体现本发明框架的学习模式，在著名的Zachary空手道俱乐部网络的基础上构建了一个镜像网络，这两个网络分别对应对齐任务中的网络N^A和网络N^B，它们共同组成了一组孪生数据集。孪生数据集具体的构建步骤如下：(1)将Zachary空手道俱乐部作为网络N^A，并使用网络布局算法可视化在2D空间中；(2)以Y轴为对称轴，根据网络N^A的顶点坐标值画出其关于Y轴对称的镜像顶点，这些镜像顶点为网络N^B的顶点集合；(3)网络N^B中顶点之间的链接关系与网络N^A保持一致，即M^A＝M^B；(4)网络N^A和网络N^B中每一对关于Y轴对称的点都被视为是数据集中的锚节点。In order to embody the learning mode of the framework of the present invention, a mirror network is constructed on the basis of the famous Zachary Karate Club network. These two networks correspond to the network^NA and the network^NB in the alignment task respectively, and they together form a group of twins data set. The specific construction steps of the twin dataset are as follows: (¹ ) Take Zachary Karate Club as the network NA, and use the network layout algorithm to visualize it in 2D space; (2) Take the Y axis as the symmetry axis, according to the vertex coordinate value of the network^NA Draw its mirrored vertices symmetrical about the^Y axis, these mirrored vertices are the vertex sets of the network NB; (3) the link relationship between the vertices in the network^NB is consistent with the network NA, that is,^M^A =^MB ; ( 4) Each pair of points symmetrical about the^Y axis in network NA and network^NB is regarded as an anchor node in the dataset.

原始数据为孪生网络中的锚节点，因此需对两个网络中的节点进行操作，实现对齐任务，如图1所示，具体步骤如下：The original data is the anchor node in the twin network, so it is necessary to operate the nodes in the two networks to realize the alignment task, as shown in Figure 1. The specific steps are as follows:

步骤一、获取网络嵌入式表示Step 1. Obtain the network embedded representation

运用两个图卷积网络GCN(Graph Convolutional Network)作为特征提取器进行网络的嵌入式表示学习，GCN^A和GCN^B分别探索网络A和网络B的结构信息以获取网络的嵌入式表示。给定网络的邻接矩阵

GCN根据神经网络的前向法则获得图卷积网络的第l层隐含层表示：Two Graph Convolutional Networks (GCN) are used as feature extractors to learn the embedded representation of the network. GCN^A and GCN^B explore the structural information of network A and network B respectively to obtain the embedded representation of the network. Adjacency matrix for a given network

GCN obtains the representation of the lth hidden layer of the graph convolutional network according to the forward rule of the neural network:

H_l＝σ(FH_l-1W_l)H_l =σ(FH_l-1 W_l )

其中

和

分别为GCN的第l层和第l-1层的隐含层表示，k_l和k_l-1分别表示第l层和第l-1层的神经元个数，l＝{1，2，…，L}，L表示图卷积网络的总层数。

为GCN的卷积核，承担着分析网络结构信息的作用。其中D为网络中各个顶点的节点度组成的对角矩阵，即D_ii＝∑_jM_ij。

为单位矩阵，表示网络顶点的自连接，以传递层与层之间自身节点的隐含层表示，

是GCN第l层的训练参数，用以学习层与层之间的特征提取函数。激活函数σ设为ReLU(·)函数。由此可以从每个GCN中得到对应网络的低维表示向量R＝H_L。in

and

are the hidden layer representations of the lth layer and the l-1th layer of GCN, respectively, k_l and k_l-1 represent the number of neurons in the lth layer and the l-1th layer, respectively, l={1, 2, ..., L}, L represents the total number of layers of the graph convolutional network.

It is the convolution kernel of GCN and plays the role of analyzing network structure information. D is a diagonal matrix composed of node degrees of each vertex in the network, that is, D_ii =∑_j M_ij .

is the identity matrix, which represents the self-connection of the network vertices, and is represented by the hidden layer of the self-node between the transfer layer and the layer,

is the training parameter of the first layer of GCN to learn the feature extraction function between layers. The activation function σ is set to the ReLU(·) function. Thus, a low-dimensional representation vector_R =HL of the corresponding network can be obtained from each GCN.

利用锚节点信息来监督指导两个图卷积网络的嵌入式表示学习。通过最大化嵌入式表示空间中的锚节点的后验概率来约束网络嵌入式表示的学习：Leveraging anchor node information to supervise and guide embedded representation learning for two graph convolutional networks. Constrain the learning of network embedded representations by maximizing the posterior probability of anchor nodes in the embedded representation space:

其中S为所有锚节点的集合。

和

分别为网络A的图卷积网络GCN^A、网络B的图卷积网络GCN^B的参数集合。对于一对锚节点

根据贝叶斯定理容易得到

对于实体对齐任务来说，上式的两个概率角度同等重要，故

可以近似等于

因此，似然概率

可以代替为各个锚节点

的乘积。为简明表达，将

缩写为

缩写为

对于模型参数分布而言，引用高斯分布作为其先验分布，即

由此进一步的推导出模型的优化准则

where S is the set of all anchor nodes.

and

are the parameter sets of the graph convolution network GCN^A of network A and the graph convolution network GCN^B of network B, respectively. for a pair of anchor nodes

According to Bayes' theorem, it is easy to get

For the entity alignment task, the two probability angles of the above formula are equally important, so

can be approximately equal to

Therefore, the likelihood probability

Can be replaced by each anchor node

product of . For brevity, the

Abbreviated

For the model parameter distribution, the Gaussian distribution is cited as its prior distribution, that is

From this, the optimization criterion of the model is further deduced

其中，

表示给定

条件下

的概率，

表示给定

条件下

的概率，

表示

的范数，

表示

的范数，λ为正则化参数；in,

means given

under conditions

The probability,

means given

under conditions

The probability,

express

norm of ,

express

norm, λ is the regularization parameter;

此处，锚节点的条件概率可以采用随机森林算法、多个独立Logistics回归算法、Softmax算法等任何方法计算，本例中锚节点的条件概率采用基于采样的Softmax函数(Sampled Softmax Function)近似，利用随机采样的顶点集合替代所有顶点的集合，这样可以有效的降低计算复杂度，提高学习效率。得到锚节点条件概率为：Here, the conditional probability of the anchor node can be calculated by any method such as random forest algorithm, multiple independent logistic regression algorithm, Softmax algorithm, etc. In this example, the conditional probability of the anchor node is approximated by the sampling-based Softmax function (Sampled Softmax Function). The randomly sampled vertex set replaces the set of all vertices, which can effectively reduce the computational complexity and improve the learning efficiency. The conditional probability of the anchor node is obtained as:

其中

为顶点

的嵌入式表示，

对应为顶点

的嵌入式表示。

表示顶点集合C^B中顶点v_c的嵌入式表示，

表示顶点集合C^A中顶点v_c的嵌入式表示，采样的顶点集台

限据顶点的对数均匀分布

进行采样而获得，同理可得

in

for the vertex

the embedded representation of ,

corresponds to a vertex

embedded representation of .

represents an embedded representation of vertex v_c in vertex set C^B ,

Represents an embedded representation of vertices v_c in vertex set C^A , sampled vertex set table

Restricted to log-uniform distribution of vertices

Obtained by sampling, the same can be obtained

引入了领域分类器与上述步骤中的特征提取器GCN进行领域对抗学习，以进一步的增强特征提取器GCN针对实体对齐任务的嵌入式表示学习。The domain classifier and the feature extractor GCN in the above steps are introduced for domain adversarial learning to further enhance the feature extractor GCN's embedded representation learning for entity alignment tasks.

在领域对抗学习过程中，领域分类器扮演着判别器的角色，试图区分给定的顶点v∈{V^A∪V^B}来自于哪个领域(网络A或网络B)，特征提取器则充当生成器的角色，致力于学习具有领域不变性特征的网络嵌入式表示以混淆判别器的领域判别。领域分类器和特征提取器的领域对抗学习需要通过极大极小游戏(Minimax Game)来完成，其表达式如下：In the process of domain adversarial learning, the domain classifier plays the role of a discriminator, trying to distinguish which domain (network A or network B) a given vertex v ∈ {V^A ∪V^B } comes from, and the feature extractor acts as a generator The role of the discriminator is dedicated to learning network-embedded representations with domain-invariant features to confuse the discriminator's domain discrimination. The domain adversarial learning of the domain classifier and feature extractor needs to be completed through the Minimax Game, and its expression is as follows:

其中

为领域分类器的参数集合，

和

分别为GCN^A和GCN^B的参数集合，d∈{d^A，d^B}为顶点v所属的领域标签，指示函数II_d(v)用于指示顶点v是否属于领域d，若顶点v属于领域d，指示函数II_d(v)的值为1，否则值为0；p(d|v)为给定标签d下顶点v的条件概率。本例中领域分类器使用多层感知机(MLP)进行实现，MLP的最后一层隐含层连接到Softmax层，形成一个完整的多层感知器，最终实现输出归一化，以模拟计算输入顶点v的领域类别的条件概率p(d|v)。当然，不限于此，本领域技术人员还可以采用k个独立Logistic回归等方法计算p(d|v)。in

is the parameter set of the domain classifier,

and

are the parameter sets of GCN^A and GCN^B respectively, d∈{d^A , d^B } is the domain label to which vertex v belongs, and the indicator function II_d (v) is used to indicate whether vertex v belongs to domain d, if vertex v belongs to domain d d, the value of the indicator function II_d (v) is 1, otherwise the value is 0; p(d|v) is the conditional probability of vertex v under the given label d. In this example, the domain classifier is implemented using a multi-layer perceptron (MLP). The last hidden layer of the MLP is connected to the Softmax layer to form a complete multi-layer perceptron, and finally the output is normalized to simulate the computational input. Conditional probability p(d|v) of the domain class of vertex v. Of course, it is not limited to this, and those skilled in the art can also calculate p(d|v) by using methods such as k independent Logistic regression.

训练过程中的特征提取器在提取对齐任务特征的同时需要尽量过滤掉领域依赖性特征，即GCN^A和GCN^B不但要最大化锚链接的后验概率，同时还需要最大化领域分类器的分类损失：The feature extractor in the training process needs to filter out the domain-dependent features as much as possible while extracting the alignment task features, that is, GCN^A and GCN^B not only need to maximize the posterior probability of the anchor link, but also need to maximize the classification of the domain classifier. loss:

其中，

为嵌入式表示学习的目标函数，

为领域对抗学习的目标函数。超参数γ是用于调节

与

之间的比重的权衡因子。in,

is the objective function for embedded representation learning,

Objective function for domain adversarial learning. The hyperparameter γ is used to tune

and

The trade-off factor between the proportions.

步骤一、二、三即为DANA基本学习框架。Steps 1, 2 and 3 are the basic learning framework of DANA.

步骤四、DANA算法学习过程Step 4. DANA algorithm learning process

从上述网络A和网络B中采集数据，输入定点批量大小U、锚节点批量大小Z、权衡因子γ、正则化系数λ，参数为：GCNA参数

GCN^B参数

领域分类器参数

具体过程如下：Collect data from the above network A and network B, input the fixed-point batch size U, the anchor node batch size Z, the trade-off factor γ, the regularization coefficient λ, and the parameters are: GCNA parameter

GCN^B parameters

Domain Classifier Parameters

The specific process is as follows:

1、随机初始化参数

1. Random initialization parameters

2、从顶点集合V^A采样出一批顶点样本：

2. Sampling^a batch of vertex samples from the vertex set VA:

3、从顶点集合V^B采样出一批顶点样本：

3. Sampling a batch of vertex samples from the vertex set V^B :

4、从锚链接集合S中采样出一批锚节点：

4. Sampling a batch of anchor nodes from the anchor link set S:

5、使用Adam优化器更新参数

以最小化目标公式：5. Use Adam optimizer to update parameters

To minimize the objective formula:

6、使用Adam优化器更新参数

以最小化目标公式：6. Use Adam optimizer to update parameters

To minimize the objective formula:

7、对上述过程2～6迭代更新，直到训练收敛。7. Iteratively update theabove process 2 to 6 until the training converges.

在迭代过程中，在特征提取器和领域分类器之间引入了梯度反转层对参数

和

进行同步更新优化，最小化损失函数，更容易、快速地使训练收敛。In an iterative process, a gradient inversion layer pair parameter is introduced between the feature extractor and the domain classifier

and

Perform simultaneous update optimization, minimize loss function, and make training converge easier and faster.

将迭代收敛后的结果输出，得到网络A和网络B的嵌入式表示：

和

所得到的嵌入式表示重点在于领域不变特征的学习，避免对领域依赖特征的学习，有利于提升实体对齐任务的准确性和有效性。Output the converged results of the iterations to get the embedded representations of network A and network B:

and

The obtained embedded representation focuses on the learning of domain-invariant features and avoids the learning of domain-dependent features, which is beneficial to improve the accuracy and effectiveness of entity alignment tasks.

步骤五、优化1：改进DANA在有向网络中的应用Step 5. Optimization 1: Improve the application of DANA in directed networks

基于无向网络的DANA基本框架，改进图卷积网络内部结构，捕捉网络中顶点出度分布和入度分布，获得一个考虑感知方向的图卷积网络的DANA模型框架：Based on the basic DANA framework of undirected network, improve the internal structure of graph convolutional network, capture the out-degree distribution and in-degree distribution of vertices in the network, and obtain a DANA model framework of graph convolutional network considering the perception direction:

如图2所示，从每个节点的出度和入度两个视角对节点的特征进行刻画，根据出度和入度的分布情况进行卷积操作。给定有向图网络的邻接矩阵M，并使用H₀代表顶点出度视角的初始特征向量，

表示顶点入度视角的初始特征向量，随机初始化GCN的初始层输入H₀和

可通过如下计算规则获得GCN第l层的隐含层表示H₀和

As shown in Figure 2, the characteristics of the nodes are described from the two perspectives of the out-degree and in-degree of each node, and the convolution operation is performed according to the distribution of the out-degree and in-degree. Given an adjacency matrix M of a directed graph network, and using H₀ to represent the initial eigenvectors of vertex out-degree views,

The initial feature vector representing the in-degree view of the vertex, the initial layer input H₀ and

The hidden layer representation H₀ and

其中F^A＝D^-1(M+I)，

M^T表示M的转置。H_l和

分别表示出度分布和入度分布情况，D和

分别表示节点出度和入度组成的对角矩阵，

表示矩阵的对角元素，W_l和

分别表示出度和入度状态下第l层的训练参数。根据上述定义的卷积规则，GCN的最后一层将输出顶点的两个低维向量表示，即R＝H_L和

在后续的对齐过程中，将连接顶点v_i的两个低维向量r_i和

以执行对齐操作，其中r_i和

分别为R和

的第i个顶点的嵌入式表示。至此，基于有向图网络的图卷积网络的DANA框架可修改为如图3所示。这一优化可以更好的捕捉有向图网络的结构特征，解决图卷积网络对有向图网络特征挖掘不充分的问题。where F^A =D^-1 (M+I),

^MT represents the transpose of M. H_l and

Represent the out-degree distribution and in-degree distribution, respectively, D and

are the diagonal matrices composed of the out-degree and in-degree of the node, respectively,

represents the diagonal elements of the matrix, W_l and

represent the training parameters of the lth layer in out-degree and in-degree states, respectively. According to the convolution rules defined above, the last layer of GCN will output two low-dimensional vector representations of vertices, namely R_=HL and

In the subsequent alignment process, the two low-dimensional_vectors_ri and

to perform an alignment operation, where_ri and

R and

Embedded representation of the ith vertex of . So far, the DANA framework of the graph convolutional network based on the directed graph network can be modified as shown in Figure 3. This optimization can better capture the structural features of the directed graph network and solve the problem of insufficient feature mining of the directed graph network by the graph convolutional network.

步骤六、优化2：参数间的权重共享Step 6. Optimization 2: Weight sharing between parameters

进一步引入GCN之间的权重共享策略，加强网络嵌入式空间之间的紧密性，将GCN^A和GCN^B中的训练参数进行共享，即将GCN^A和GCN^B中的训练参数进行共享：W_l^A＝W_l^B，l＝{1，2，…，L}，以减少模型的参数，简化DANA的模型框架，如图4所示。The weight sharing strategy between GCNs is further introduced to strengthen the tightness between the network embedded spaces, and the training parameters in GCN^A and GCN^B are shared, that is, the training parameters in GCN^A and GCN^B are shared: W_l^A =W_l^B , l={1, 2, ..., L}, in order to reduce the parameters of the model and simplify the model framework of DANA, as shown in Figure 4.

评价指标Evaluation indicators

下面对本发明对齐任务上的性能进行评价。通过定义Hits@k评估模型来评价DANA在实体对齐任务上的性能：The performance of the present invention on the alignment task is evaluated below. The performance of DANA on the entity alignment task is evaluated by defining the Hits@k evaluation model:

对于一堆锚节点测试样本，其hits@k的计算规则为：For a bunch of anchor node test samples, the calculation rule of hits@k is:

其中，

表示网络A中顶点

在网络B中可能的锚节点候选列表，

由模型的排序列表中的前k个顶点组成。同理对于网络B中的顶点

模型给出对应的锚节点候选列表

in,

represents the vertices in network A

A list of possible anchor node candidates in network B,

Consists of the top k vertices in the model's sorted list. Similarly for vertices in network B

The model gives the corresponding anchor node candidate list

令S_test为测试集中的锚节点集合，则网络实体对齐在测试集上的Hits@k为：Let S_test be the set of anchor nodes in the test set, then the Hits@k of the network entity alignment on the test set is:

对于候选列表的排序准则，DANA采用余弦相似度以计算候选锚节点的评分。For the ranking criterion of the candidate list, DANA uses cosine similarity to calculate the score of candidate anchor nodes.

数据集data set

本发明的实体对齐实验使用了四个真实的跨网络数据集，其统计数据列于如表1所示。其中DBLP数据集是根据作者发表的论文所属的研究领域(Machine Learning或者Data Mining)构建的合著者网络。与DBLP数据集中由共同作者关系建立的无向图网络不同，另外的三个数据集Fq-Tw以及Fb-Tw和Db-Wb是通过社交网络建立。The entity alignment experiments of the present invention use four real cross-network datasets, whose statistics are listed in Table 1. The DBLP dataset is a co-author network constructed according to the research field (Machine Learning or Data Mining) to which the author's published paper belongs. Unlike the undirected graph networks in the DBLP dataset, which are built by co-authorship relationships, the other three datasets, Fq-Tw, as well as Fb-Tw and Db-Wb, are built through social networks.

表1对齐实验数据基本统计Table 1 Basic statistics of alignment experimental data

实验结果Experimental results

在本发明中，DANA框架完成了网络的嵌入式表示并进行跨网络的实体对齐。因此，本实验对最终的实体对齐任务进行分析和评估。In the present invention, the DANA framework completes the embedded representation of the network and performs entity alignment across the network. Therefore, this experiment analyzes and evaluates the final entity alignment task.

实验将对本发明提出的DANA 的几种变体方法以及当前最前沿的几种基准方法进行实验对比。The experiment will compare several variant methods of DANA proposed by the present invention and several current benchmark methods.

本发明中提出的DANA架构，针对有向图网络进行了图卷积网络的改进，同时还提出了图卷积网络的权重共享策略。为了清楚起见，在实验中使用DANA 表示本发明提出的面向实体对齐任务的生成式对抗网络嵌入式表示学习基本框架，后缀“-S”表示模型中采用了权重共享策略，后缀“-SD”则表示模型结合了权重共享策略和基于有向图网络的GCN框架。The DANA architecture proposed in the present invention improves the graph convolution network for the directed graph network, and also proposes a weight sharing strategy for the graph convolution network. For the sake of clarity, DANA is used in the experiment to represent the basic framework of the generative adversarial network embedded representation learning proposed by the present invention for entity alignment task. The representation model combines a weight sharing strategy and a directed graph network based GCN framework.

此外，在对齐任务中，我们将本发明(Ours)与以下算法进行对比：Furthermore, in the alignment task, we contrast our invention (Ours) with the following algorithms:

(1)MAH：此算法使用超图对网络的高阶关系进行建模的流形对齐方法，测试时使用余弦相似性来评估网络中两个顶点之间的相关性。(1) MAH: This algorithm uses a hypergraph to model a manifold alignment method of higher-order relationships in the network, and uses cosine similarity to evaluate the correlation between two vertices in the network during testing.

(2)PALE：此模型使用嵌入式表示学习方法来分别获取两个网络的低维向量表示，然后将锚节点作为监督信息来学习两个网络的嵌入式空间的映射函数，使得映射后的锚节点低维向量表示具有最短的欧几里德距离。(2) PALE: This model uses the embedded representation learning method to obtain the low-dimensional vector representations of the two networks respectively, and then uses the anchor nodes as supervision information to learn the mapping function of the embedded space of the two networks, so that the mapped anchors The node low-dimensional vector representation has the shortest Euclidean distance.

(3)IONE：此算法从三个视角出发构建节点的三个低维向量表示：基于节点的输入上下文的嵌入式表示、基于节点的输出上下文的嵌入式表示、以及表示节点本身的嵌入式表示。(3) IONE: This algorithm constructs three low-dimensional vector representations of nodes from three perspectives: an embedded representation based on the node’s input context, an embedded representation based on the node’s output context, and an embedded representation representing the node itself .

(4)ULink：此模型通过建立隐式的用户空间获取节点的属性特征，来对用户实体对齐进行建模。(4) ULink: This model models user entity alignment by establishing implicit user space to obtain the attribute characteristics of nodes.

(5)SNNA：此算法在PALE方法的核心基础上，将生成式对抗网络引入不同网络嵌入式表示空间的投影函数的学习过程。(5) SNNA: Based on the core of the PALE method, this algorithm introduces the generative adversarial network into the learning process of projection functions of different network embedded representation spaces.

(6)GANE：此模型是一种面向链接预测任务的生成式对抗网络嵌入式表示学习模型。模型通过强制锚节点共享GANE中相同的嵌入式表示以形成网络之间的重叠，训练结束后可以得到跨网络的嵌入式表示，通过计算顶点低维向量的余弦距离来实现对齐。(6) GANE: This model is a generative adversarial network embedded representation learning model for link prediction tasks. The model forms overlap between networks by forcing anchor nodes to share the same embedded representation in GANE, and the embedded representation across the network can be obtained after training, and the alignment is achieved by calculating the cosine distance of low-dimensional vectors of vertices.

表2、表3、表4和表5列出了网络实体对齐的实验结果。我们利用训练集优化模型参数，并将参数的最优值用于测试集。实验结果如下：Table 2, Table 3, Table 4 and Table 5 list the experimental results of network entity alignment. We use the training set to optimize the model parameters and use the optimal values of the parameters for the test set. The experimental results are as follows:

(1)在所有数据集的不同@k设置下，DANA及其变体的性能明显优于其他对比算法，验证了本文提出的DANA框架的有效性。(1) Under different @k settings on all datasets, DANA and its variants significantly outperform other contrasting algorithms, verifying the effectiveness of the DANA framework proposed in this paper.

(2)本发明提出的DANA几个变体的对比，表明了领域对抗学习的引入确实能够提升特征提取器对对齐任务的嵌入式表示学习的能力；随着领域权重共享策略的加入，对齐性能也有略微的提升，这也侧面说明了领域对抗学习模块实际上已经能够较好地将两个网络的嵌入式表示空间拉近；基于有向图网络的GCN框架的使用，使模型性能更近一步提高，这意味着有向图网络中的出度分布和入度分布都有着重要的意义。(2) The comparison of several variants of DANA proposed in the present invention shows that the introduction of domain adversarial learning can indeed improve the feature extractor’s ability to learn embedded representations for alignment tasks; with the addition of domain weight sharing strategies, the alignment performance There is also a slight improvement, which also shows that the domain confrontation learning module has actually been able to bring the embedded representation space of the two networks closer; the use of the GCN framework based on the directed graph network makes the model performance a step closer increase, which means that both the out-degree distribution and the in-degree distribution in the directed graph network are significant.

(3)与其他算法相比，本发明的DANA 模型由于采用了具有较强的网络结构信息探索能力的图卷积网络，相比于更依赖网络结构信息以外的属性信息的算法如Ulink和SNNA，即使在只有网络结构信息的情况下仍然能获得较为鲁棒的结果。(3) Compared with other algorithms, the DANA model of the present invention adopts a graph convolution network with strong network structure information exploration ability, compared with algorithms that rely more on attribute information other than network structure information, such as Ulink and SNNA , which can still obtain relatively robust results even in the case of only network structure information.

表2在DBLP数据集上的模型的对齐性能比较Table 2. Alignment performance comparison of models on DBLP dataset

表3在Foursquare-Twitter数据集上的模型的对齐性能比较Table 3. Alignment performance comparison of models on the Foursquare-Twitter dataset

表4在Facebook-Twitter数据集上的模型的对齐性能比较Table 4. Alignment performance comparison of models on Facebook-Twitter dataset

表5在Douban-Weibo数据集上的模型的对齐性能比较Table 5. Alignment performance comparison of models on Douban-Weibo dataset

综上所述，本发明所提出的面向实体对齐的生成式对抗网络嵌入式表示学习方法在对齐性能方面优于其他对比算法，从而证明了本发明方法的有效性，能够应用到网络实体对齐任务中。To sum up, the entity alignment-oriented generative adversarial network embedded representation learning method proposed in the present invention is superior to other comparison algorithms in terms of alignment performance, which proves the effectiveness of the method of the present invention and can be applied to the network entity alignment task middle.

为了说明本发明的内容及实施方法，本说明书给出了一个具体实施例。在实施例中引入细节的目的不是限制权利要求书的范围，而是帮助理解本发明所述方法。本领域的技术人员应理解：在不脱离本发明及其所附权利要求的精神和范围内，对最佳实施例步骤的各种修改、变化或替换都是可能的。因此，本发明不应局限于最佳实施例及附图所公开的内容。In order to illustrate the content and implementation method of the present invention, this specification provides a specific embodiment. The purpose of introducing details in the examples is not to limit the scope of the claims, but to aid in understanding the method of the invention. It will be understood by those skilled in the art that various modifications, changes or substitutions of the steps of the preferred embodiment are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the present invention should not be limited to the contents disclosed in the preferred embodiments and the accompanying drawings.

Claims

1. A generation type confrontation network embedded expression learning method is characterized in that: the method comprises the following steps:

inputting data as structure information N of network A and network B^AAnd N^B，N^A＝(V^A,E^A)，N^B＝(V^B,E^B)，V^AAnd V^BSet of vertices in network A and network B, E, respectively^AAnd E^BRespectively a set of network A and network B link edges; the accounts of the same user in the network A and the network B are respectively the top points in the network A

Vertices in network B

And each vertex in network A

The field label of^AAnd each vertex in network B

Corresponding domain label is d^B；

A pair of anchor nodes is provided, and S is an anchor node set;

two graph convolution networks GCN are applied: GCN^AAnd GCNB searches the structure information of network A and network B respectively, and sends the structure information to each GCNObtaining a low-dimensional expression vector R ═ H of the corresponding network_LConstraining learning of network embedded representations by maximizing a posteriori probability of anchor nodes in the embedded representation space, optimization criterion

Comprises the following steps:

wherein,

indicates given

Under the condition of

The probability of (a) of (b) being,

indicates given

Under the condition of

The probability of (a) of (b) being,

to represent

The norm of (a) of (b),

to represent

The norm of (a) of (b),

and

respectively are parameter sets of a network A and a network B, and lambda is a regularization parameter;

using a domain classifier to act as a discriminator, performing domain confrontation learning by the following expression to obtain a network embedded expression with domain invariance characteristics:

wherein

Set of parameters for domain classifier, d ∈ { d^A,d^BV ∈ { V } is a domain label to which vertex V belongs^A∪V^BIndicates the function

For indicating whether the vertex v belongs to the field d, and if the vertex v belongs to the field d, indicating the function

Is 1, otherwise is 0; p (d | v) is the conditional probability of label d given vertex v;

the optimization goal of the graph convolution network GCN as a feature extractor is to minimize the following objective function, i.e. to maximize the a posteriori probability of the anchor links and the classification loss of the domain classifier:

wherein the hyperparameter gamma is used for regulation

And

a trade-off factor for specific gravity between;

the optimization goal of the domain classifier is to minimize the following objective function:

a gradient inversion layer is introduced in the learning process, and synchronous updating optimization is carried out

And

2. the method of claim 1, wherein: the above-mentioned

Wherein

Is a vertex

Is to be displayed in a display device, is to be displayed,

correspond to a vertex

Is to be displayed in a display device, is to be displayed,

is a vertex v_c∈C^BIs to be displayed in a display device, is to be displayed,

is a vertex v_c∈C^AEmbedded representation of (2), sampled vertex sets

Uniformly distributed according to the logarithm of the vertex

Is obtained by sampling, and vertex set of the sampling

Uniformly distributed according to the logarithm of the vertex

Is obtained by sampling.

3. The method of claim 1, wherein: the domain classifier is implemented using a multi-layer perceptron MLP, the last hidden layer of which is connected to the Softmax layer to model the conditional probability p (d | v) of the domain class of the input vertex v.

4. A method according to any one of claims 1 to 3, wherein: the graph convolution network GCN is a graph convolution network based on a directed graph network.

5. The method of claim 4, wherein: the graph convolution network based on the directed graph network is characterized in that the characteristics of the nodes are described from the out-degree view and the in-degree view of the nodes, and convolution operation is carried out according to the distribution conditions of the out-degree view and the in-degree view.

6. A method according to any one of claims 1 to 3, wherein: the GCN^AAnd GCN^BSharing network parameter weights.