CN106503729A

Movatterモバイル変換

Info

Publication number: CN106503729A
Application number: CN201610875762.5A
Authority: CN
Inventors: 赵士超; 许有疆; 韩亚洪
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2017-03-15

Abstract

Translated fromChinese

本发明公开了一种基于顶层权值的图像卷积特征的生成方法，从互联网下载图片，构成图片训练集；训练卷积神经网络的模型；用训练好的卷积神经网络的模型分别提取图片的不同层的深度卷积特征；利用得到的深度卷积特征计算顶层的卷积权重图；用顶层的卷积权重图作用到从浅层到高层的卷积特征得到新的卷积特征；得到图片的增加了卷积全值的深度特征；通过对检索图片和被检索图片数据集分别提取顶层卷积权重的特征，求两者之间的相似距离，做最后的相似度匹配，得到最终的检索结果。与现有技术相比，本发明不仅适用于商品在中间区域，而是适用商品在任何位置，新的顶层权重特征相比于之前的高斯权重要更有效更准确，能够保证图片特征的鲁邦性和准确率。

The invention discloses a method for generating image convolution features based on top-level weights, downloading pictures from the Internet to form a picture training set; training the model of the convolutional neural network; extracting pictures respectively with the trained convolutional neural network model The depth convolution features of different layers; use the obtained depth convolution features to calculate the convolution weight map of the top layer; use the convolution weight map of the top layer to apply the convolution features from the shallow layer to the high layer to obtain new convolution features; get The depth feature of the full value of the convolution is added to the picture; by extracting the features of the top-level convolution weight for the retrieved picture and the retrieved picture data set, the similarity distance between the two is calculated, and the final similarity matching is performed to obtain the final result. Search Results. Compared with the prior art, the present invention is not only applicable to the product in the middle area, but to any position of the product. Compared with the previous Gaussian weight, the new top-level weight feature is more effective and accurate, and it can guarantee the Lupine of the image feature. sex and accuracy.

Description

Translated fromChinese

一种基于顶层权值的图像卷积特征的生成方法A method for generating image convolutional features based on top-level weights

技术领域technical field

本发明涉及图像检索，视觉内容自动表达，尤其涉及一种自动获取图片内容的区域权重的方法。The invention relates to image retrieval and automatic expression of visual content, in particular to a method for automatically obtaining the area weight of picture content.

背景技术Background technique

在计算机视觉和多媒体领域中，特别是随着电商数字图片的急速增长，在当前的趋势下，无论是从学术界还是在工业界对于图片的检索任务都是一个非常重要和极具挑战性的任务。在计算机视觉领域，图片的更好表达是近几年研究的主要驱动力。当人们看到一个商品图片时，人们的注意力一般会落在商品所在的区域，通过相关技术可将图片所对应的位置特征加强，从而达到突出商品信息而弱化周边噪声的效果。这样得到的图片特征对于检索匹配会有很好的性能提升。例如用手机随便拍了一张别人穿着的衣服，这样有噪声的真实图片在电商的数据集中进行同款衣服的匹配时会遇到的主要技术难点是：真实图片的衣服可能会有物体扭曲，周边噪声点很多，图片质量差等问题；而电商的数据集中的同款图片一般是专业制作的商业图片。如果借用街拍图片作为检索图片，由于存在很大的视觉差异性，会很难匹配到同款的电商图片。In the field of computer vision and multimedia, especially with the rapid growth of e-commerce digital pictures, under the current trend, the task of image retrieval is a very important and challenging task both in academia and industry. task. In the field of computer vision, better representation of pictures is the main driving force of research in recent years. When people see a picture of a product, people's attention will generally fall on the area where the product is located. Through related technologies, the location characteristics corresponding to the picture can be strengthened, so as to achieve the effect of highlighting product information and weakening the surrounding noise. The image features obtained in this way will have a good performance improvement for retrieval matching. For example, a mobile phone is used to randomly take a picture of someone else's clothes. The main technical difficulty encountered when matching the same clothes in the real picture with noise in the e-commerce data set is: the clothes in the real picture may be distorted by objects. , a lot of surrounding noise points, poor picture quality and other problems; and the same picture in the e-commerce data set is generally a professionally produced commercial picture. If you use street photos as search images, it will be difficult to match e-commerce images of the same style due to the large visual differences.

在过去的几年中已经涌现出了各种各样的图像描述方法。特别是最近几年随着深度学习在计算机智能领域的异军突起，利用深度学习的网络特征进行图片的特征表示成为了一种主流方向。最开始的时候，大家一般用深度网络的全连接层作为图片的特征表示。直至近期，研究者开始探究卷积层的特征作为图片特征表示。在之前的工作中有人提出，在深度特征上增加高斯权值。该方法主要依据是一般图片中的物体都在中心区域。因此可以通过高斯权值的方法加强中间区域的重要性。如何通过提取更好的、更加鲁邦的特征来弥合这两者之间的差异，是本发明的研究方向。Various image description methods have emerged in the past few years. Especially in recent years, with the sudden emergence of deep learning in the field of computer intelligence, the feature representation of pictures using deep learning network features has become a mainstream direction. At the beginning, everyone generally used the fully connected layer of the deep network as the feature representation of the picture. Until recently, researchers began to explore the features of convolutional layers as image feature representations. In previous work, it was proposed to add Gaussian weights to deep features. The main basis of this method is that the objects in the general picture are all in the central area. Therefore, the importance of the middle area can be strengthened by the Gaussian weight method. How to bridge the difference between the two by extracting better and more Lupine features is the research direction of the present invention.

主要参考文献:main reference:

(1)、A.Babenko and V.Lempitsky.《聚集局部深度特征的图像检索》，国际计算机视觉大会,pp.3304-3311，2015；(1), A.Babenko and V.Lempitsky. "Image Retrieval by Aggregating Local Depth Features", International Conference on Computer Vision, pp.3304-3311, 2015;

(2)、K.He,X.Zhang,S.Ren,and J.Sun.《深度卷积网络的空间金字塔池化用于视觉分类》模式分析与机器智能汇刊模式分析与机器智能汇刊37(9):1904-1916,2015；(2), K.He, X.Zhang, S.Ren, and J.Sun. "Spatial Pyramid Pooling of Deep Convolutional Networks for Visual Classification" Pattern Analysis and Machine Intelligence Transactions Pattern Analysis and Machine Intelligence Transactions 37(9):1904-1916,2015;

(3)、H.Jegou,M.Douze,C.Schmid,and P.Perez.《聚集局部描述符得到一个紧凑的图像表达》,计算机视觉和模式识别会议pp.3304-3311，2010；(3), H.Jegou, M.Douze, C.Schmid, and P.Perez. "Aggregating local descriptors to obtain a compact image representation", Computer Vision and Pattern Recognition Conference pp.3304-3311, 2010;

(4)、C.Szegedy,W.Liu,Y.Jia,P.Sermanet,S.Reed,D.Anguelov,D.Erhan,V.Vanhoucke,and A.Rabinovich.《探索更深层的卷积》，国际计算机视觉大会，pp.1-9，2015；(4), C.Szegedy, W.Liu, Y.Jia, P.Sermanet, S.Reed, D.Anguelov, D.Erhan, V.Vanhoucke, and A.Rabinovich. "Exploring Deeper Convolutions", International Conference on Computer Vision, pp.1-9, 2015;

(5)、Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Girshick,S.Guadarrama,and T.Darrell.《用于快速特征嵌入的卷积结构》，多媒体会议，pp.675-678，2014；(5), Y.Jia, E.Shelhamer, J.Donahue, S.Karayev, J.Long, R.Girshick, S.Guadarrama, and T.Darrell. "Convolutional Structure for Fast Feature Embedding", Multimedia Conference, pp.675-678, 2014;

(6)、F.Perronnin and C.Dance.《作用在视觉词表的费舍尔核用于图片分类任务》，计算机视觉和模式识别会议，pp.1-8，2007；(6), F.Perronnin and C.Dance. "Fisher kernel acting on the visual vocabulary for image classification tasks", Computer Vision and Pattern Recognition Conference, pp.1-8, 2007;

(7)、F.Perronnin,Y.Liu,J.Sanchez,and H.Poirier.《应用压缩的费舍尔向量做大规模的图像检索》，计算机视觉和模式识别会议，pp.3384-3391，2010。(7), F.Perronnin, Y.Liu, J.Sanchez, and H.Poirier. "Application of compressed Fisher vectors for large-scale image retrieval", Computer Vision and Pattern Recognition Conference, pp.3384-3391, 2010.

发明内容Contents of the invention

基于现有技术，本发明提出了一种基于顶层权值的图像卷积特征的生成方法，通过卷积神经网络提取卷积权值图，自动加权图像目标物体，使用卷积神经网络提取图像的底层特征到高层的特征，并通过获得高层卷积的权值作用到底层到高层的特征当中，从而对图像目标实现更加准确的描述。Based on the prior art, the present invention proposes a method for generating image convolution features based on top-level weights, extracting convolution weight maps through convolutional neural networks, automatically weighting image target objects, and using convolutional neural networks to extract image features Low-level features to high-level features, and by obtaining the weight of high-level convolution to apply to the bottom-to-high-level features, so as to achieve a more accurate description of the image target.

本发明提出了一种基于顶层权值的图像卷积特征的生成方法，该方法包括以下步骤：The present invention proposes a method for generating image convolution features based on top-level weights, the method comprising the following steps:

步骤1、从互联网下载图片，并对每个图片进行描述，形成<图像，类别>对，构成图片训练集；图像表示为集合N_d是集合IMG中的图像总数；每个图像对应一个类别，类别表示为集合图像集合IMG以及每个图像对应的GroundTruth组成最终的数据集DataSet＝{IMG,GroundTruth}；Step 1. Download pictures from the Internet, and describe each picture to form a pair of <image, category> to form a picture training set; the image is represented as a set_Nd is the total number of images in the set IMG; each image corresponds to a category, and categories are represented as sets The image set IMG and the GroundTruth corresponding to each image form the final dataset DataSet={IMG,GroundTruth};

步骤2、通过图像数据集按照图像分类任务训练卷积神经网络的模型，在ILSVRC数据集上训练模型，或者用下载好的ILSVRC数据集重新训练模型；Step 2, train the model of the convolutional neural network according to the image classification task through the image data set, train the model on the ILSVRC data set, or retrain the model with the downloaded ILSVRC data set;

步骤3、用训练好的卷积神经网络的模型分别提取图片的不同层的深度卷积特征：Step 3. Use the trained convolutional neural network model to extract the deep convolution features of different layers of the picture:

GoogLeNet(X)_{inception(3a)},GoogLeNet(X)_{inception(3b)}，GoogLeNet(X)_{inception(3a)} ,GoogLeNet(X)_{inception(3b)} ,

GoogLeNet(X)_{inception(4a)},GoogLeNet(X)_{inception(4e)}GoogLeNet(X)_{inception(4a)} ,GoogLeNet(X)_{inception(4e)}

NET(X)_layer表示这个特征的是一个三维的矩阵C×W×H，其中C表示卷积层特征通道方向的个数，W表示卷积层特征的宽度，H表示卷积层特征的高度。The NET(X)_layer represents this feature as a three-dimensional matrix C×W×H, where C represents the number of channel directions of the convolutional layer features, W represents the width of the convolutional layer features, and H represents the height of the convolutional layer features .

步骤4、利用得到的深度卷积特征计算顶层的卷积权重图，提取GoogLeNet最后一层的卷积层的卷积特征GoogLeNet(X)_{inception(5b)}，对此卷积特征在通道方向做平均的池化操作，具体如下：Step 4. Use the obtained deep convolution features to calculate the convolution weight map of the top layer, extract the convolution features GoogLeNet(X)_{inception(5b)} of the last convolution layer of GoogLeNet, and average the convolution features in the channel direction The pooling operation is as follows:

其中N表示的是卷积特征的通道数目此方案中N＝1024，c表示的是卷积特征第c个通道，W表示最终得到的顶层卷积权重；Where N represents the number of channels of the convolution feature. In this scheme, N=1024, c represents the cth channel of the convolution feature, and W represents the final top-level convolution weight;

步骤5、利用顶层的卷积权重图作用到从浅层到高层的卷积特征得到新的卷积特征，具体包括：将步骤(3)提取的GoogLeNet的Inception(3a),Inception(3b),Inception(4a),Inception(4e)的特征；把这些特征都通过最大池化的方式，池化到和顶层卷积Inception(5b)同样大小的尺度上；把步骤(4)得到的顶层卷积的权重与拼接后的卷积特征进行相乘，得到加权后的深度卷积特征，具体公式如下：Step 5. Use the top-level convolution weight map to apply the convolution features from the shallow layer to the high-level layer to obtain new convolution features, specifically including: Inception (3a), Inception (3b) of GoogLeNet extracted in step (3), Inception (4a), Inception (4e) features; these features are pooled to the same size as the top-level convolution Inception (5b) through the maximum pooling method; the top-level convolution obtained in step (4) The weight of is multiplied by the concatenated convolution feature to obtain the weighted depth convolution feature. The specific formula is as follows:

F_W_(x，y)＝W_(x，y)×F_(x,y)F_W_{(x, y)} = W_{(x, y)} × F_{(x, y)}

其中，W表示步骤(4)产生的顶层卷积权重，F表示的是步骤(5)中池化和拼接之后的卷积特征，F_W表示利用顶层权重W于多层拼接的卷积特征F对应的位置相乘得到的加权之后的特征，(x,y)表示的是卷积特征的坐标；Among them, W represents the top-level convolution weight generated in step (4), F represents the convolution feature after pooling and splicing in step (5), and F_W represents the convolution feature F corresponding to the convolution feature F that uses the top-level weight W to multi-layer splicing The weighted feature obtained by multiplying the position of , (x, y) represents the coordinates of the convolution feature;

步骤6、把这些卷积特征都池化成向量；得到图片的增加了卷积全值的深度特征，即把步骤5得到的加权过后的卷积特征F_W，进行最大池化，得到最终的目标向量作为图片的最终表达。Step 6. Pool these convolutional features into vectors; obtain the depth features of the image with added convolution full value, that is, perform maximum pooling on the weighted convolutional features F_W obtained in step 5 to obtain the final target vector as the final expression of the picture.

Image_reresent＝MaxPooling(F_W)Image_reresent =MaxPooling(F_W)

其中池化的窗口大小是7×7；The pooled window size is 7×7;

步骤7、通过对检索图片QueryImage_represent和被检索图片EvaluationImage_represent数据集分别提取顶层卷积权重的特征，求两者之间的相似距离Distance：Step 7. By extracting the features of the top-level convolution weights from the retrieved image QueryImage_represent and the retrieved image EvaluationImage_represent datasets, find the similarity distance between the two:

Distance＝similar_Fnuction(QureyImag_re_epresent,EvaluatioInmage_represent)Distance＝similar_Fnuction(QureyImag_r e_present ,EvaluatioInmage_represent )

其中similar_Function表示度量两个特征之间距离的函数；Where similar_Function represents a function that measures the distance between two features;

最后根据被检索图像集的图像与检索图像之间的相似距离Distance进行排序，做最后的相似度匹配，得到最终的检索结果。Finally, sort according to the similarity distance Distance between the images in the retrieved image set and the retrieved images, and perform the final similarity matching to obtain the final retrieval result.

本发明具有以下有益技术效果：The present invention has the following beneficial technical effects:

(1)、相比较传统的图像描述方法而言，本发明能够更加准确的理解图片的目标区域和视觉注意区域；(1), compared with the traditional image description method, the present invention can more accurately understand the target area and visual attention area of the picture;

(2)、不仅仅适用于商品在中间区域的情况，而是适用商品在任何位置的情况。新的顶层权重特征相比于之前的高斯权重要更有效更准确。对于最终的图片检索任务也提升了很多性能。(2) It is not only applicable to the case where the product is in the middle area, but is applicable to the case where the product is in any position. The new top-level weight features are more effective and accurate than the previous Gaussian weights. It also improves a lot of performance for the final image retrieval task.

(3)、能够保证图片特征的鲁邦性和准确率，通过实验表明，这种新的顶层权重特征相比于之前的高斯权重要更有效更准确；对于最终推广到图片的检索任务也提升了很多性能；(3) It can guarantee the Lubang and accuracy of image features. Experiments show that this new top-level weight feature is more effective and accurate than the previous Gaussian weight feature; it also improves the retrieval tasks that are eventually extended to images. A lot of performance;

(4)、获取更加准确的图片特征后对于图像检索的任务有很大的收益。(4) After obtaining more accurate picture features, there is a great benefit for the task of image retrieval.

附图说明Description of drawings

图1为GoogleNet的一个inception单元的结构示意图；Figure 1 is a schematic structural diagram of an inception unit of GoogleNet;

图2为本发明的基于顶层权值的图像卷积特征的生成方法的流程；Fig. 2 is the process flow of the generation method of the image convolution feature based on top layer weights of the present invention;

图3为原始图片与顶层卷积权重得到的可视化图片的示意图；Fig. 3 is a schematic diagram of the visualized picture obtained by the original picture and the top layer convolution weight;

图4为本发明得到的检索结果示意图。Fig. 4 is a schematic diagram of the retrieval results obtained by the present invention.

具体实施方式detailed description

下面结合附图对本发明作进一步详细描述：Below in conjunction with accompanying drawing, the present invention is described in further detail:

本发明提出的顶层卷积的权值不是基于人为的先验知识，而是一种通过卷积神经网络自动学出来的权重。采用这种方法不仅仅适用于商品在中间区域的情况，而是适用商品在任何位置的情况。在本发明中，深度卷积网络GoogLeNet的结构包括从输入层向损失层传递。网络中的单元节点分为四种类型，每一个单元节点表示一个网络层：第一种类型单元节点表示输入层(input layer)和损失层(loss layer)；第二种类型节点表示卷积层(convlayer)和全连接层(fully connection layer)，第三种类型节点表示池化层，第三种类型节点表示拼接层(concat layer)。GoogLeNet有22层的卷积层和全连接层，有3个损失层。一般其他网络只有一个损失层。本发明设定3个损失层，目的是为了防止在训练的时候出现梯度消失的情况。The weight value of the top-level convolution proposed by the present invention is not based on artificial prior knowledge, but a weight automatically learned through the convolutional neural network. Using this method is not only applicable to the case where the product is in the middle area, but is applicable to the case where the product is in any position. In the present invention, the structure of the deep convolutional network GoogLeNet includes passing from the input layer to the loss layer. The unit nodes in the network are divided into four types, and each unit node represents a network layer: the first type of unit node represents the input layer (input layer) and the loss layer (loss layer); the second type of node represents the convolutional layer (convlayer) and fully connected layer (fully connection layer), the third type of node represents the pooling layer, and the third type of node represents the splicing layer (concat layer). GoogLeNet has 22 layers of convolutional and fully connected layers, and 3 loss layers. Generally other networks have only one loss layer. The present invention sets three loss layers to prevent gradient disappearance during training.

图1表示的是GoogleNet的一个小的单元称为inception。这个表示Previouslayer把本层特征复制多份，依次分别做，{1×1convolution}{1×1convolution 3×3convolution}，{1×1convolution 5×5convolution}，{3×3convolution 1×1convolution}，其中1×1，3×3，5×5表示的是卷积核的大小或者池化窗口的大小。convolution表示卷积操作，pooling表示池化操作。Figure 1 shows a small unit of GoogleNet called inception. This means that the Previouslayer copies the features of this layer in multiple copies, and do them in turn, {1×1convolution}{1×1convolution 3×3convolution}, {1×1convolution 5×5convolution}, {3×3convolution 1×1convolution}, of which 1× 1, 3×3, 5×5 represent the size of the convolution kernel or the size of the pooling window. Convolution represents a convolution operation, and pooling represents a pooling operation.

如图3所示，可视化本发明提出的顶层卷积权重，左侧是六张原始图片，右侧是顶层卷积权重得到的可视化图片。本图中显示的顶层卷积权重是选择的GoogLeNet的Inception5b用于可视化。权重的大小是。我们可以看到可视化权重和原图中的物体有明显的位置对应关系。而且物体所在位置权重被加强。As shown in Figure 3, the top-level convolution weights proposed by the present invention are visualized. The left side is six original pictures, and the right side is the visualized picture obtained by the top-level convolution weights. The top convolutional weights shown in this figure are GoogLeNet's Inception5b chosen for visualization. The size of the weight is . We can see that there is an obvious positional relationship between the visual weight and the objects in the original image. And the weight of the position of the object is strengthened.

如图4所示，表示检索和被检索图像应用顶层权重的策略进行图像检索，得到的检索结果。最左侧的图片时检索图片。右面的十幅图片是与检索结果。返回与检索图片最相近的十幅图篇。检索到的图片上方标注绿色表示查询正确，标注红色表示查询错误。注：这些图片来自阿里巴巴图像搜索大赛的数据集。As shown in Figure 4, it represents the search results obtained by applying the top-level weight strategy to the retrieved and retrieved images. The image is retrieved when the leftmost image is displayed. The ten pictures on the right are the search results. Return the ten most similar images to the retrieved image. A green mark above the retrieved image indicates that the query is correct, and a red mark indicates that the query is incorrect. Note: These images are from the dataset of the Alibaba Image Search Contest.

这里选取一幅图像作为待生成特征的图像，使用顶层卷积权重生成这幅图像的深度特征表达。如图2所示，为本发明的基于顶层权值的图像卷积特征的生成方法的流程，具体描述如下：Here, an image is selected as the image to be generated, and the top layer convolution weight is used to generate the deep feature expression of this image. As shown in Figure 2, it is the flow process of the generation method of the image convolution feature based on the top layer weight of the present invention, specifically described as follows:

步骤1、从互联网下载图片，并对每个图片进行描述，形成<图像，类别>对，构成图片训练集；Step 1. Download pictures from the Internet, and describe each picture to form a pair of <image, category> to form a picture training set;

(1)从互联网中下载图片分类的常见图像数据集(ILSVRC)，构成图像集合其中N_d是集合IMG中的图像总数；(1) Download a common image dataset (ILSVRC) for image classification from the Internet to form an image collection where_Nd is the total number of images in the collection IMG;

(2)每个图像都会有一个对应的类别，构成图像集合的每一个图像的类别描述为GroundTruth＝{Image_id_i,class_id_i}，其中i＝1,...,N_d，N_d表示数据集的图像总和，class_id_i＝1,...,N表示每一个图像所对应的类别数；ILSVRC使用1000类的数据集。所以N＝1000。(2) Each image will have a corresponding category, which constitutes the set of images The category of each image is described as GroundTruth={Image_id_i ,class_id_i }, where i=1,...,N_d , N_d represents the sum of images in the dataset, class_id_i =1,...,N represents each The number of classes an image corresponds to; ILSVRC uses a dataset of 1000 classes. So N=1000.

(3)通过现有的图像集合IMG以及每个图像对应的GroundTruth组成最终的数据集DataSet＝{IMG,GroundTruth}。(3) The final data set DataSet={IMG, GroundTruth} is formed from the existing image set IMG and the GroundTruth corresponding to each image.

步骤2、通过现有的图像数据集按照图像分类任务训练卷积神经网络的模型；Step 2, training the model of the convolutional neural network according to the image classification task through the existing image data set;

(1)选择要用的深度卷积的网络模型GoogLeNet,VGG16,AlexBNet，GoogLeNet模型的主要层结构按照如下定义：(1) Select the deep convolution network model GoogLeNet, VGG16, AlexBNet, the main layer structure of the GoogLeNet model is defined as follows:

GoogLeNet＝{Conv,Conv,Inception(3a),Inception(4a),Inception(4b),GoogLeNet＝{Conv,Conv,Inception(3a),Inception(4a),Inception(4b),

Inception(4c),Inception(4e),Inception(5a),Inception(5a),FC}Inception(4c),Inception(4e),Inception(5a),Inception(5a),FC}

VGG16模型的主要层结构按照如下定义：The main layer structure of the VGG16 model is defined as follows:

VGG16＝{Conv1_1,Conv1_2,Conv2_1,Conv2_2,Conv3_1,Conv3_2,VGG16={Conv1_1, Conv1_2, Conv2_1, Conv2_2, Conv3_1, Conv3_2,

Conv3_3,Conv4_1,Conv4_2,Conv4_3,Conv5_1,Conv5_2,Conv5_3,Conv3_3, Conv4_1, Conv4_2, Conv4_3, Conv5_1, Conv5_2, Conv5_3,

FC6,FC7,FC8}FC6,FC7,FC8}

下载现有的已经在ILSVRC数据集上训练过GoogLeNet的模型GoogLeNet.caffemodel，或者用下载好的ILSVRC数据集重新训练模型，下载现有的已经在ILSVRC数据集上训练过的模型Model_ILSVRC，或者用阿里的电商数据或者其他电商数据构建的数据集重新训练模型。并且在最后一层的全连接层改为当前数据集的类别个数。Download the existing model GoogLeNet.caffemodel that has trained GoogLeNet on the ILSVRC dataset, or retrain the model with the downloaded ILSVRC dataset, download the existing model Model_ILSVRC that has been trained on the ILSVRC dataset, or use Ali Retrain the model with a data set constructed from e-commerce data or other e-commerce data. And the fully connected layer in the last layer is changed to the number of categories of the current data set.

步骤3、用训练好的卷积神经网络的模型提取图片的深度卷积特征；Step 3, using the trained convolutional neural network model to extract the deep convolution features of the picture;

(1)得到训练好的深度卷积网络模型之后，要对输入图片X提取深度特征。分别提取不同层的卷积特征：(1) After obtaining the trained deep convolutional network model, it is necessary to extract deep features from the input image X. Extract the convolutional features of different layers separately:

NET(X)_layer表示：这个特征的是一个三维的矩阵C×W×H，其中C表示卷积层特征通道方向的个数，W表示卷积层特征的宽度，H表示卷积层特征的高度。NET(X)_layer means: this feature is a three-dimensional matrix C×W×H, where C represents the number of convolutional layer feature channel directions, W represents the width of the convolutional layer feature, and H represents the convolutional layer feature high.

GoogLeNet(X)_{inception(3a)}的维度是：256×28×28，The dimension of GoogLeNet(X)_{inception(3a)} is: 256×28×28,

GoogLeNet(X)_{inception(3b)}的维度是：480×28×28，The dimension of GoogLeNet(X)_{inception(3b)} is: 480×28×28,

GoogLeNet(X)_{inception(4a)}的维度是：512×14×14。The dimension of GoogLeNet(X)_{inception(4a)} is: 512×14×14.

GoogLeNet(X)_{inception(4e)}的维度是：832×28×28。The dimension of GoogLeNet(X)_{inception(4e)} is: 832×28×28.

步骤4、利用得到的深度卷积特征计算顶层的卷积权重图；Step 4, using the obtained deep convolution features to calculate the convolution weight map of the top layer;

(1)首先按照第三步提特征的方式对图片X提取顶层的卷积特征，提取GoogLeNet最后一层的卷积层的卷积特征GoogLeNet(X)_{inception(5b)。}(1) First, extract the top-level convolution features of the picture X according to the feature extraction method in the third step, and extract the convolution features of the last layer of GoogLeNet convolution layer GoogLeNet(X)_{inception (5b).}

(2)然后对GoogLeNet(X)_{inception(5b)}特征在通道方向做平均的池化操作。具体如下：(2) Then perform an average pooling operation on the GoogLeNet(X)_{inception(5b)} feature in the channel direction. details as follows:

其中N表示的是卷积特征的通道数目此方案中N＝1024，c表示的是卷积特征第c个通道，W表示最终得到的顶层卷积权重。此方案中W的大小是7×7。Where N represents the number of channels of the convolution feature. In this scheme, N=1024, c represents the cth channel of the convolution feature, and W represents the final top-level convolution weight. The size of W in this scheme is 7×7.

步骤5、利用顶层的卷积权重图作用到从浅层到高层的卷积特征得到新的卷积特征；Step 5. Use the convolution weight map of the top layer to act on the convolution features from the shallow layer to the high layer to obtain new convolution features;

(1)首先提出从底层到高层的若干层卷积特征。按照步骤3，提取的是GoogLeNet的Inception(3a),Inception(3b),Inception(4a),Inception(4e)的特征。(1) First, several layers of convolutional features from the bottom layer to the top layer are proposed. According to step 3, the features of Inception(3a), Inception(3b), Inception(4a), and Inception(4e) of GoogLeNet are extracted.

(2)把这些特征都通过最大池化的方式，都池化到和顶层卷积Inception(5b)同样大小的尺度上，尺度大小是：7×7。然后在通道维度上对这些不同层的特征进行拼接得到特征F：2080×7×7。(2) All these features are pooled to the same size as the top convolution Inception (5b) through the maximum pooling method, and the scale size is: 7×7. Then the features of these different layers are concatenated in the channel dimension to obtain the feature F: 2080×7×7.

(3)最后，把步骤4已经得到的顶层卷积的权重与拼接后的卷积特征进行相乘，就得到了加权后的深度卷积特征。这个卷积特征主要包含两种信息：(3) Finally, multiply the weight of the top-level convolution obtained in step 4 and the concatenated convolution feature to obtain the weighted depth convolution feature. This convolution feature mainly contains two kinds of information:

第一是包含了从底层图像特征到高层语义特征的信息。第二是包含了图片中目标物体更加突显的信息。具体操作如下：The first is to contain information from low-level image features to high-level semantic features. The second is to include more prominent information about the target object in the picture. The specific operation is as follows:

F_W_(x，y)＝W_(x，y)×F_(x，y)F_W_{(x, y)} = W_{(x, y)} × F_{(x, y)}

其中W表示的是步骤4产生的顶层卷积权重，F表示的是第5步(2)中池化和拼接之后的卷积特征。F_W表示的是利用顶层权重W于多层拼接的卷积特征F对应的位置相乘得到的加权之后的特征。其中(x,y)表示的是卷积特征的坐标。X＝1,…,7，Y＝1,…,7。F_W的维度是：2080×7×7。Where W represents the top-level convolution weight generated in step 4, and F represents the convolution feature after pooling and splicing in step 5 (2). F_W represents the weighted feature obtained by multiplying the top-level weight W by the position corresponding to the multi-layer concatenated convolution feature F. Where (x, y) represents the coordinates of the convolutional features. X=1,...,7, Y=1,...,7. The dimension of F_W is: 2080×7×7.

步骤6、把这些卷积特征都池化成向量；得到图片的增加了卷积全值的深度特征；Step 6. Pool these convolutional features into vectors; obtain the depth features of the image with added convolutional full value;

(1)把步骤5得到的加权过后的卷积特征F_W，进行最大池化。得到最终的目标向量作为图片的最终表达。(1) Perform maximum pooling on the weighted convolution feature F_W obtained in step 5. Get the final target vector as the final expression of the picture.

Image_reresent＝MaxPooling(F_W)Image_reresent =MaxPooling(F_W)

其中池化的窗口大小是7×7。池化的策略采用最大池化的策略。最终得到的特征是维度2080的向量。The pooling window size is 7×7. The pooling strategy adopts the maximum pooling strategy. The resulting features are vectors of dimension 2080.

步骤7、通过对检索图片和被检索的图片数据集分别提取顶层卷积权重的特征，做最后的相似度匹配，从而达到对图像检索的目的；Step 7. By extracting the features of the top-level convolution weights for the retrieved picture and the retrieved picture data set, the final similarity matching is performed, so as to achieve the purpose of image retrieval;

(1)对于检索图片，做以上的六步操作得到最后的图片表达QueryImage_represent；同理，对于被检索的图片也做同样的操作得到EvaluationImage_represent。(1) For the retrieved image, perform the above six steps to obtain the final image expression QueryImage_represent ; similarly, do the same operation for the retrieved image to obtain the EvaluationImage_represent .

(2)然后求两者之间的相似距离：(2) Then find the similar distance between the two:

Distance＝similar_Function(QueryImage_represent,EvaluationImage_represent)Distance=similar_Function(QueryImage_represent ,EvaluationImage_represent )

其中Distance表示的是两个图片之间的相似距离。similar_Function表示度量两个特征之间距离的函数。常用的距离函数有：欧式距离，余弦距离等；Among them, Distance represents the similar distance between two pictures. similar_Function represents a function that measures the distance between two features. Commonly used distance functions are: Euclidean distance, cosine distance, etc.;

采用距离是余弦距离：The distance used is the cosine distance:

(3)最后根据被检索集的图像与检索图像之间的距离进行排序，得到最终的检索结果。(3) Finally, sort according to the distance between the retrieved image and the retrieved image to obtain the final retrieval result.