CN110110122A

Movatterモバイル変換

Info

Publication number: CN110110122A
Application number: CN201810649234.7A
Authority: CN
Inventors: 冀振燕; 姚伟娜; 杨文韬; 皮怀雨
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2019-08-09

Abstract

Translated fromChinese

本发明涉及结合深度学习与哈希方法的图像‑文本跨模态检索模型。为了解决传统基于深度学习的跨模态哈希方法在处理多标签数据问题时直接将其转换为单标签问题的局限性，提出了一种基于多层语义的深度跨模态哈希算法。通过多标签数据之间的共现关系定义数据之间的相似度，并以此作为网络训练的监督信息。设计综合考虑多层语义相似度与二值相似度的损失函数，对网络进行训练，使得特征提取和哈希码学习过程统一在一个框架内，实现端到端学习。该算法充分利用数据之间的语义相关性信息，提高了检索准确率。

The invention relates to an image-text cross-modal retrieval model combining deep learning and a hash method. In order to address the limitations of traditional deep learning-based cross-modal hashing methods when dealing with multi-label data problems and directly convert them to single-label problems, a deep cross-modal hashing algorithm based on multi-layer semantics is proposed. The similarity between data is defined by the co-occurrence relationship between multi-label data, and it is used as the supervisory information for network training. Design a loss function that comprehensively considers multi-layer semantic similarity and binary similarity, and train the network, so that the process of feature extraction and hash code learning is unified in one framework, and end-to-end learning is realized. The algorithm makes full use of the semantic correlation information between the data and improves the retrieval accuracy.

Description

Translated fromChinese

基于多层语义深度哈希算法的图像-文本跨模态检索Image-text cross-modal retrieval based on multi-layer semantic deep hashing algorithm

技术领域technical field

本发明涉及到跨模态检索领域，尤其涉及到一种基于多层语义的结合深度学习与哈希方法的图像-文本跨模态检索算法。The invention relates to the field of cross-modal retrieval, in particular to an image-text cross-modal retrieval algorithm based on multi-layer semantics combined with deep learning and a hash method.

背景技术Background technique

随着移动互联网的发展和智能手机、数码相机等设备的普及，互联网上的多媒体数据呈爆炸式增长。在信息检索领域，多媒体大数据的不断增长带来了跨模态检索应用需求。而目前主流的搜索引擎，如百度、谷歌、必应等，仅提供一种模态的检索结果。此外，随着深度学习在计算机视觉、自然语言处理等领域取得一系列突破性进展，将多媒体大数据与人工智能相结合，是两个领域未来共同的发展趋势。因此，结合新技术和新需求，探索新的跨模态检索模式成为当前信息检索领域亟待解决的挑战之一。With the development of mobile Internet and the popularization of smart phones, digital cameras and other devices, the multimedia data on the Internet is growing explosively. In the field of information retrieval, the continuous growth of multimedia big data has brought about the demand for cross-modal retrieval applications. However, the current mainstream search engines, such as Baidu, Google, Bing, etc., only provide one mode of retrieval results. In addition, as deep learning has made a series of breakthroughs in the fields of computer vision and natural language processing, the combination of multimedia big data and artificial intelligence is a common development trend in the future of the two fields. Therefore, combining new technologies and new needs, exploring new cross-modal retrieval models has become one of the challenges to be solved in the field of information retrieval.

传统的跨模态检索通常采用依赖领域知识的手工设计特征，“语义鸿沟”问题仍是该领域的难点。将深度学习应用于跨模态检索领域，不仅为解决不同模态异质数据之间的“媒体鸿沟”提供了大量特征学习与表示方面先进的研究成果。然而，随着多媒体数据的不断增长，采用深度学习的特征表示由于维数过大而面临存储空间与检索效率的挑战，导致无法适应大规模多媒体数据检索任务。同时，跨模态检索问题还面临真实数据存在多个标签的问题。现有的解决方法大部分均采用了将问题转化为二值相关的单标签学习问题，导致学习到的模型不能充分保留数据在原语义空间的关联关系，影响最终检索结果Traditional cross-modal retrieval usually adopts handcrafted features that rely on domain knowledge, and the "semantic gap" problem is still a difficult point in this field. Applying deep learning to the field of cross-modal retrieval not only provides a large number of advanced research results in feature learning and representation for solving the "media gap" between heterogeneous data of different modalities. However, with the continuous growth of multimedia data, the feature representation using deep learning faces the challenges of storage space and retrieval efficiency due to the large dimensionality, which makes it unable to adapt to large-scale multimedia data retrieval tasks. At the same time, the cross-modal retrieval problem also faces the problem of multiple labels in real data. Most of the existing solutions use the problem of converting the problem into a single-label learning problem of binary correlation, resulting in the learned model not being able to fully preserve the relationship between the data in the original semantic space and affecting the final retrieval results.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，将结合基于深度学习的特征表示，并同时考虑图像、文本两种模态数据的二值相似性和多层语义相似性，应用哈希方法通过网络训练得到数据到哈希码的映射，提供一种检索准确率更高的图像-文本跨模态检索方法。The purpose of the present invention is to overcome the deficiencies of the prior art, combining the feature representation based on deep learning, and simultaneously considering the binary similarity and multi-layer semantic similarity of the two modal data of image and text, applying the hash method to pass through the network The mapping from training data to hash codes provides an image-text cross-modal retrieval method with higher retrieval accuracy.

为实现上述目的，本发明所提供的技术方案为：In order to achieve the above object, the technical scheme provided by the present invention is:

分为三个模块，分别为深度特征提取模块、相似度矩阵生成模块、哈希码学习模块；It is divided into three modules, namely, the deep feature extraction module, the similarity matrix generation module, and the hash code learning module;

其中，深度特征提取模块采用深度神经网络提取图像和文本数据特征。该模块采用两个子网络分别提取图像和文本模态数据特征的结构，即包含两个深度神经网络，一个用于提取图像数据的特征，一个用于提取文本数据特征。采用深度卷积神经网络CNN-F网络结构进行图像特征提取。CNN-F的结构由5层卷积层和3层全连接层构成。在文本特征提取阶段，首先以词袋(Bag-of-Words,BOW)向量对文本数据建模。基于上述词袋模型，文本特征提取网络采用由三层全连接层构成的多层感知机(Multi-Layer Perception,MLP)网络提取文本特征。Among them, the deep feature extraction module uses a deep neural network to extract image and text data features. This module adopts a structure in which two sub-networks extract image and text modal data features respectively, that is, it contains two deep neural networks, one is used to extract image data features, and the other is used to extract text data features. The deep convolutional neural network CNN-F network structure is used for image feature extraction. The structure of CNN-F consists of 5 convolutional layers and 3 fully connected layers. In the text feature extraction stage, the text data is first modeled with a Bag-of-Words (BOW) vector. Based on the above-mentioned bag-of-words model, the text feature extraction network uses a multi-layer perceptron (Multi-Layer Perception, MLP) network composed of three fully connected layers to extract text features.

对于相似度矩阵生成模块，包含二值相似度矩阵生成和多层语义相似度矩阵生成。它们各自生成一个跨模态相似度矩阵。对于二值相似度矩阵当图像i与文本j相似时，矩阵对应的取值为1；当图像i与文本j不相似时，矩阵对应的取值为0。对于多层语义相似度矩阵根据标签共现关系设计其计算方法，使得两个样本的类别标签集拥有更多相似标签时，样本的相似度越大，当两个标签集完全相同时，达到最大值1。当两个样本标签集中的标签完全不同时，取最小值0。For the similarity matrix generation module, it includes binary similarity matrix generation and multi-layer semantic similarity matrix generation. They each generate a cross-modal similarity matrix. For a binary similarity matrix When image i is similar to text j, the matrix corresponds to The value is 1; when the image i is not similar to the text j, the corresponding matrix The value is 0. For multi-layer semantic similarity matrix The calculation method is designed according to the label co-occurrence relationship, so that when the category label sets of two samples have more similar labels, the similarity of the samples is greater. When the two label sets are exactly the same, reaches a maximum value of 1. When the labels in the two sample label sets are completely different, Take the minimum value of 0.

对于哈希码生成模块，为了使学习到的哈希码保留二值相似度矩阵及多层语义相似度矩阵中的语义信息，设计目标函数：For the hash code generation module, in order to make the learned hash code retain the binary similarity matrix and multi-level semantic similarity matrix Semantic information in the design objective function:

其中，in,

通过优化该目标函数，学习网络参数，得到数据与哈希码的映射关系。By optimizing the objective function and learning network parameters, the mapping relationship between data and hash codes is obtained.

与现有技术相比，本方案原理及优点如下：Compared with the existing technology, the principle and advantages of this scheme are as follows:

本方案结合深度学习与哈希方法，克服传统手工设计特征在特征表示能力上的不足，及深度特征维数过大，不利于数据存储和计算的缺点，并结合二值相似度和多层语义相似度，充分考虑跨模态数据之间复杂的相似度关系，使学习到的哈希码保留更多语义信息，提高检索准确率。This solution combines deep learning and hashing methods to overcome the shortcomings of traditional manual design features in terms of feature representation capabilities, and the large dimension of deep features is not conducive to data storage and calculation, and combines binary similarity and multi-layer semantics Similarity fully considers the complex similarity relationship between cross-modal data, so that the learned hash code retains more semantic information and improves retrieval accuracy.

附图说明Description of drawings

图1为本发明基于多层语义深度哈希算法的图像-文本跨模态检索的整体框架图；Fig. 1 is the overall frame diagram of the image-text cross-modal retrieval based on the multi-layer semantic depth hashing algorithm of the present invention;

具体实施方式Detailed ways

下面结合具体实例对本发明作进一步说明：The present invention will be further described below in conjunction with specific example:

本发明中皆以图像和文本两种模态为例进行讨论。In the present invention, both image and text modes are taken as examples for discussion.

本发明提供了一种基于多层语义深度哈希算法的图像-文本跨模态检索(DeepMulti-Level Semantic Hashing for Cross-modal Retrieval，DMSH)方法，其中包含三个模块：深度特征提取模块、相似度矩阵生成模块、哈希码学习模块，如图1所示；The present invention provides an image-text cross-modal retrieval (DeepMulti-Level Semantic Hashing for Cross-modal Retrieval, DMSH) method based on a multi-layer semantic depth hashing algorithm, which includes three modules: deep feature extraction module, similarity Degree matrix generation module, hash code learning module, as shown in Figure 1;

表1图像特征提取网络结构Table 1 Image feature extraction network structure

深度特征提取模块采用深度神经网络提取图像和文本数据特征。采用深度卷积神经网络CNN-F网络结构进行图像特征提取，网络结构配置如表1所示。在文本特征提取阶段，首先以词袋向量对文本数据建模。基于词袋模型，文本特征提取网络采用由三层全连接层构成的多层感知机网络提取文本特征，网络配置如表2所示.The deep feature extraction module uses a deep neural network to extract image and text data features. The deep convolutional neural network CNN-F network structure is used for image feature extraction, and the network structure configuration is shown in Table 1. In the text feature extraction stage, the text data is first modeled with bag-of-words vectors. Based on the bag-of-words model, the text feature extraction network uses a multi-layer perceptron network composed of three fully connected layers to extract text features. The network configuration is shown in Table 2.

其中，conv1层采用4步长卷积，conv2-conv5层均采用1步长卷积。pad即补边(Padding)，表示步长移动方式。通常指给图像边缘补边，使得卷积后输出的图像尺寸与原尺寸一致。LRN表示局部响应归一化(Local Response Normalization)。其模仿生物神经元的侧抑制机制，对局部神经元的活动创建竞争机制，使响应较大的值更大，并抑制反馈较小的神经元，增强模型泛化能力。采用MAX操作的池化技术，取原图像某一尺寸内的最大值，从而有效减少模型参数，防止过拟合。并通过Dropout正则化技术，通过在训练期间随机的丢弃一定数量的神经元，防止网络过拟合。Among them, the conv1 layer uses 4-step convolution, and the conv2-conv5 layers all use 1-step convolution. pad is Padding, which means the step size movement method. It usually refers to filling the edge of the image so that the size of the output image after convolution is consistent with the original size. LRN stands for Local Response Normalization. It imitates the lateral inhibition mechanism of biological neurons, creates a competition mechanism for the activities of local neurons, makes the value of larger responses larger, and inhibits neurons with smaller feedbacks to enhance the generalization ability of the model. The pooling technology of MAX operation is used to take the maximum value within a certain size of the original image, thereby effectively reducing model parameters and preventing overfitting. And through the Dropout regularization technique, a certain number of neurons are randomly discarded during training to prevent the network from overfitting.

表2文本特征提取网络Table 2 Text Feature Extraction Network

其中，网络的第一个隐藏层是与输入词袋向量长度相同的全连接层，第二层隐藏层是4096维全连接层，第三层是长度为哈希码长的全连接层。网络的输出即文本特征向量。Among them, the first hidden layer of the network is a fully connected layer with the same length as the input bag of words vector, the second hidden layer is a 4096-dimensional fully connected layer, and the third layer is a fully connected layer whose length is the hash code length. The output of the network is the text feature vector.

相似度矩阵生成模块包含二值相似度矩阵生成和多层语义相似度矩阵生成。它们各自生成一个跨模态相似度矩阵对于二值相似度矩阵当图像i与文本j相似时，矩阵对应的取值为1；当图像i与文本j不相似时，矩阵对应的取值为0。其中，不同模态数据之间的相似性通过类别标签衡量。即若图像i和文本j有共同的一组类别标签，那么认为它们是相似的；否则认为它们是不相似的。其定义如下：The similarity matrix generation module includes binary similarity matrix generation and multi-layer semantic similarity matrix generation. They each generate a cross-modal similarity matrix For a binary similarity matrix When image i is similar to text j, the matrix corresponds to The value is 1; when the image i is not similar to the text j, the corresponding matrix The value is 0. Among them, the similarity between different modal data is measured by category labels. That is, image i and text j are considered similar if they share a common set of class labels; otherwise, they are considered dissimilar. It is defined as follows:

对于多层语义相似度矩阵采用一种基于类别标签共现关系的相似度矩阵计算方法；下面介绍具体生成方法。For multi-layer semantic similarity matrix A similarity matrix calculation method based on the co-occurrence relationship of category tags is adopted; the specific generation method is introduced below.

对于两个类别标签t_i,t_j，定义标签相似度：For two category labels t_i , t_j , define the label similarity:

其中，d(t_i，t_j)表示两个标签的语义距离，定义如下：Among them, d(t_i , t_j ) represents the semantic distance between two labels, which is defined as follows:

其中，分别表示训练集中t_i，t_j出现的次数；表示t_i，t_j共同出现的次数；N_c表示训练集中所有标签的个数。in, respectively represent the number of occurrences of t_i and t_j in the training set; Indicates the number of co-occurrences of t_i and t_j ; N_c indicates the number of all labels in the training set.

由定义(2)可知，s(t_i，t_j)∈[0，1]，表示当两个标签共同出现的次数越多时，它们的相似度越大。根据标签相似性s，可定义样本间的相似性According to definition (2), s(t_i , t_j ) ∈ [0, 1] means that when two tags co-occur more times, their similarity is greater. According to the label similarity s, the similarity between samples can be defined

对于两个样本D_m,D_n，定义样本相似度For two samples D_m , D_n , define the sample similarity

其中，t_m，t_m分别表示样本D_m，D_n的类别标签集；|t_m|,|t_n|分别表示t_m，t_n的个数；即哈希标签。由定义可知，当两个样本的类别标签集拥有更多相似标签时，样本的相似度越大，当两个标签集t_m，t_n完全相同时，达到最大值1。当t_m中的标签与t_n中的标签全部不相似时，取最小值0。因此，基于多标签的语义相似度矩阵可以作为哈希码学习过程的监督信息。与二值相似度矩阵相比，将跨模态相似度由离散的{0,1}扩展为连续的[0,1]区间取值，保留了更多隐含在数据类别标签中的丰富的语义信息。Among them, t_m , t_m represent the category label sets of samples D_m , D_n respectively; |t_m |, |t_n | represent the number of t_m , t_n respectively; Namely hashtags. It can be seen from the definition that when the category label sets of two samples have more similar labels, the similarity of the samples is greater. When the two label sets t_m and t_n are exactly the same, reaches a maximum value of 1. When the labels in t_m are all dissimilar to those in t_n , Take the minimum value of 0. Therefore, the semantic similarity matrix based on multi-label Can be used as supervisory information for the hash code learning process. with binary similarity matrix compared to, Extending the cross-modal similarity from discrete {0,1} to continuous [0,1] interval values retains more rich semantic information hidden in data category labels.

哈希码学习模块，以表示学习到的样本D_i的图像特征，即图像特征提取网络的输出；以表示学习到的样本D_j的文字特征，即文字特征提取网络的输出。分别表示两个深度网络的参数。Hash code learning module, with Represents the image features of the learned sample D_i , that is, the output of the image feature extraction network; Indicates the learned text feature of the sample D_j , that is, the output of the text feature extraction network. represent the parameters of the two deep networks, respectively.

为了使学习到的哈希码保留二值相似度矩阵的语义信息，采用sigmoid交叉熵损失函数：In order for the learned hash codes to preserve the binary similarity matrix The semantic information of , using the sigmoid cross entropy loss function:

其中，为保证训练过程的稳定性及避免溢出，在实现阶段采用(3-5)的等价形式：in, In order to ensure the stability of the training process and avoid overflow, the equivalent form of (3-5) is used in the implementation stage:

基于上述二值语义信息损失函数进一步引入多层语义损失函数使得学习到的模型保留包含在多层语义相似度矩阵中更加丰富的语义信息。这里同样采用sigmoid交叉熵损失函数的等价形式：Based on the above binary semantic information loss function Further introduce a multi-layer semantic loss function Make the learned model retain the multi-layer semantic similarity matrix Richer semantic information in . The equivalent form of the sigmoid cross-entropy loss function is also used here:

因此，可以得到目标函数的完整形式：Therefore, the complete form of the objective function can be obtained:

其中，F^(g)、F^(x)分别表示学习到的图像和文本的特征向量，它们包含了相似度矩阵中的语义信息；C^(g)、C^(x)分别表示图像和文本的哈希码，sign(·)表示符号函数，定义如式(3-9)。F^(g)、F^(x)中的语义信息通过符号函数传递给C^(g)、C^(x)；表示斐波那契范数，E表示元素取值全为1的向量；μ，ρ，τ为超参数。Among them, F^(g) and F^(x) represent the feature vectors of the learned image and text respectively, which contain the similarity matrix Semantic information in ; C^(g) and C^(x) represent the hash codes of images and texts respectively, and sign( ) represents the sign function, defined as in formula (3-9). The semantic information in F^(g) and F^(x) is transferred to C^(g) and C^(x) through symbolic functions; Represents the Fibonacci norm, E represents a vector whose elements are all 1; μ, ρ, τ are hyperparameters.

C^(g)＝sign(F^(g)) (9)C^(g) = sign(F^(g) ) (9)

C^(x)＝sign(F^(x)) (10)C^(x) = sign(F^(x) ) (10)

目标函数的前两项是跨模态相似度的负对数似然函数，通过优化该项可保证当越大时，F(g)_*i与F^(x)_*j的相似度越大；越小，F^(g)_*i与F^(x)_*j的相似度越小。因此，优化第1、2项保证了网络学习到的图像和文本的特征保留了原来语义空间的跨模态相似性。The first two terms of the objective function are the negative logarithmic likelihood function of the cross-modal similarity, and by optimizing this term, it can be guaranteed that when The larger the value, the greater the similarity between F(g)_*i and F^(x)_*j ; The smaller , the smaller the similarity between F^(g)_*i and F^(x)_*j . Therefore, optimizing items 1 and 2 ensures that the image and text features learned by the network retain the cross-modal similarity of the original semantic space.

目标函数的第3项为正则化项，通过优化该项，得到图像和文本的哈希码C^(g)、C^(x),并且保留了网络提取的特征F^(g)_*i与F^(x)_*j的相似性。由于F^(g)_*i与F^(x)_*j保持了语义空间的跨模态相似性，因此得到的哈希码也保留了语义空间的跨模态相似性。Term 3 of the objective function is a regularization item, by optimizing this item, the hash codes C^(g) and C^(x) of images and texts are obtained, and the features extracted by the network F^(g)_*i are similar to F^(x)_*j sex. Since F^(g)_*i and F^(x)_*j preserve the cross-modal similarity of semantic space, the resulting hash code also preserves the cross-modal similarity of semantic space.

通过优化目标函数的第4项，使得最终得到的哈希码的每一位在整个训练集上取值为“1”和“-1”的个数保持平衡，即哈希码的同一位置上取“1”和“-1”的个数各占一半。这一约束可以保证哈希码的每一位包含的信息最大化。By optimizing the fourth item of the objective function, the number of values of "1" and "-1" for each bit of the final hash code on the entire training set remains balanced, that is, at the same position of the hash code Take half of the number of "1" and "-1". This constraint can ensure that the information contained in each bit of the hash code is maximized.

实验表明，在网络的训练过程中，令来自同一数据点的图像和文本取完全相同的哈希码，能更好的提升网络的性能。因此，本文在原目标函数的基础上增加加约束C^(g)＝C^(x)＝C，最终的目标函数为：Experiments show that during the training process of the network, making images and texts from the same data point have exactly the same hash code can better improve the performance of the network. Therefore, this paper adds a constraint C^(g) = C^(x) = C on the basis of the original objective function, and the final objective function is:

通过优化该目标函数，使得网络同时学习特征提取的参数和哈希码表示，即将特征学习和哈希码学习过程统一在一个深度学习框架中，实现端到端学习。By optimizing the objective function, the network learns the parameters of feature extraction and the hash code representation at the same time, that is, the process of feature learning and hash code learning is unified in a deep learning framework to achieve end-to-end learning.

在测试及应用阶段，输入任意的单一模态的图像或文本数据，都可以通过训练好的网络来生成其对应的二值码向量，即哈希码。In the testing and application phase, any single-modal image or text data can be input, and its corresponding binary code vector, that is, hash code, can be generated through the trained network.

具体的，将数据点D_i的图像模态g_i输入网络,通过网络的前向传播可生成其哈希码表示，计算过程如下：Specifically, the image modality g_i of the data point D_i is input into the network, and its hash code representation can be generated through the forward propagation of the network. The calculation process is as follows:

类似地，对数据点D_j的文本模态x_j，通过网络的前向传播可以生成其对应的哈希码：Similarly, for the text modality_xj of data point_Dj , the forward propagation through the network can generate its corresponding hash code:

因此，本文提出的DMSH检索模型可以实现给定图像或文本任意一种模态的查询数据，返回不同模态数据库中与之最相似的前k个检索结果。检索过程中，首先计算查询数据(Query)的哈希码与待检索数据库中存储的哈希码之间的距离，然后返回距离最近的前k个哈希码，其所对应的k个数据即最终检索结果。Therefore, the DMSH retrieval model proposed in this paper can realize the query data of any modality of a given image or text, and return the top k most similar retrieval results in different modality databases. In the retrieval process, first calculate the distance between the hash code of the query data (Query) and the hash code stored in the database to be retrieved, and then return the first k hash codes closest to the distance, and the corresponding k data are Final search results.

Claims

Translated fromChinese

1.一种基于多层语义深度哈希算法的图像-文本跨模态检索方法。其特征在于：整体框架包含三个模块：深度特征提取模块、相似度矩阵生成模块、哈希码学习模块；分别采用两个深度神经网络提取图像和文字特征，将特征学习和哈希码学习过程统一在一个框架内，并通过引入基于标签共现的多层次语义监督信息指导整个训练过程，使得到的二值码不仅保留了原样本空间基本的相似/不相似关系，并且能够区分样本间的相似程度，更大程度的保留样本间的高层语义，提高检索准确率；在结构上，通过对网络施加“在语义空间相似的图像和文字在汉明空间具有相似的哈希码”这一约束进行训练，直接将哈希码作为网络的输出，实现端到端学习，从而保证学习到的特征适应特定的检索任务。1. An image-text cross-modal retrieval method based on multi-layer semantic deep hashing algorithm. It is characterized in that: the overall framework includes three modules: deep feature extraction module, similarity matrix generation module, and hash code learning module; two deep neural networks are used to extract image and text features, and the feature learning and hash code learning processes are combined Unified within a framework, and guiding the entire training process by introducing multi-level semantic supervision information based on label co-occurrence, the obtained binary code not only retains the basic similarity/dissimilarity relationship of the original sample space, but also can distinguish between samples The degree of similarity, retains the high-level semantics between samples to a greater extent, and improves the retrieval accuracy; structurally, by imposing the constraint that "images and texts that are similar in semantic space have similar hash codes in Hamming space" to the network For training, the hash code is directly used as the output of the network to realize end-to-end learning, so as to ensure that the learned features are suitable for specific retrieval tasks.

2.根据权利要求1所述的一种基于多层语义深度哈希算法的图像-文本跨模态检索方法，其特征在于：整体框架由深度特征提取模块、相似度矩阵生成模块、哈希码学习模块三个部分构成，通过将原始空间的数据映射为汉明空间中由统一形式的“+1/-1”构成的二值码向量，降低存储空间，提高计算效率。2. A kind of image-text cross-modal retrieval method based on multi-layer semantic depth hash algorithm according to claim 1, characterized in that: the overall framework consists of a deep feature extraction module, a similarity matrix generation module, a hash code The learning module consists of three parts. By mapping the data in the original space to a binary code vector composed of a unified form of "+1/-1" in the Hamming space, the storage space is reduced and the calculation efficiency is improved.

3.根据权利要求1所述的一种基于多层语义深度哈希算法的图像-文本跨模态检索方法，其特征在于：深度特征提取模块对图像和文本数据分别采用不同的深度神经网络，提取两种模态数据的语义特征，对图像数据，采用改进的CNN-F网络，对文本数据，采用多层感知机网络。3. a kind of image-text cross-modal retrieval method based on multi-layer semantic depth hashing algorithm according to claim 1, is characterized in that: depth feature extraction module adopts different deep neural networks respectively to image and text data, Extract the semantic features of the two modal data, use the improved CNN-F network for the image data, and use the multi-layer perceptron network for the text data.

4.根据权利要求1所述的一种基于多层语义深度哈希算法的图像-文本跨模态检索方法，其特征在于：相似度矩阵生成模块根据不同模态数据之间是否有共同标签生成二值相似度矩阵，根据不同模态数据标签的相似性大小生成多层语义相似度矩阵，保留更多标签提供的寓意信息。4. A kind of image-text cross-modal retrieval method based on multi-layer semantic depth hash algorithm according to claim 1, characterized in that: the similarity matrix generation module generates according to whether there are common labels between different modal data The binary similarity matrix generates a multi-layer semantic similarity matrix according to the similarity of different modal data labels, and retains more implication information provided by labels.

5.根据权利要求1所述的一种基于多层语义深度哈希算法的图像-文本跨模态检索方法，其特征在于：哈希码学习模块通过设计同时保留数据在原语义空间的二值相似度信息和多层语义相似度信息的目标函数，对网络进行训练，学习特征空间到汉明空间的映射。5. A kind of image-text cross-modal retrieval method based on multi-layer semantic depth hash algorithm according to claim 1, characterized in that: the hash code learning module retains the binary similarity of data in the original semantic space by design The objective function of degree information and multi-layer semantic similarity information is used to train the network and learn the mapping from feature space to Hamming space.