CN115761258A

Movatterモバイル変換

Info

Publication number: CN115761258A
Application number: CN202211406464.3A
Authority: CN
Inventors: 白茹意; 郭小英; 贾春花
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-03-07

Abstract

The invention discloses an image direction prediction method based on multi-scale fusion and an attention mechanism, and belongs to the technical field of computer vision and image processing. Aiming at the problem of automatic prediction in the image direction, the invention outputs a characteristic diagram by the last residual error structure of the last four parts of the ResNet50 network, and the characteristic diagram respectively passes through an attention mechanism module to obtain 4 space attention diagrams which are added with corresponding elements of the original characteristic diagram. And then, the three features with smaller scales are up-sampled to the resolution ratio which is the same as the maximum scale by bilinear interpolation, and then the features are spliced along the channel to obtain the final multi-scale attention fusion feature which is called as a local feature. Secondly, taking 4 VR _ LBP feature maps of different scales of the image as network input, adopting ResNet50 to fuse residual void convolution to obtain 4 feature maps, and then adding corresponding elements of the feature maps to obtain the global feature. And finally, splicing and fusing the local features and the global features, and finally realizing direction prediction through a GAP (GAP) and a full connection layer.

Description

Translated fromChinese

一种基于多尺度融合与注意力机制的图像方向预测方法An Image Orientation Prediction Method Based on Multi-Scale Fusion and Attention Mechanism

技术领域technical field

本发明属于图像处理与计算机视觉感知技术领域，具体涉及一种基于多尺度融合与注意力机制的图像方向预测方法。The invention belongs to the technical field of image processing and computer vision perception, and in particular relates to an image direction prediction method based on multi-scale fusion and attention mechanism.

背景技术Background technique

数字成像技术的进步，以及数码相机、智能手机和其他设备的大量出现，导致人们拍摄的照片数量显著增加。由于相机在拍摄过程中并不总是水平的，因此产生的照片通常需要旋转校正，以便以正确的方向显示，即场景最初出现的方向。图像的正确方向被定义为场景最初出现的方向。大多数数码相机都有一个内置的方向传感器，允许在拍摄期间在图像的EXIF元数据中记录相机的方向，但是该字段并不是由几种图像处理应用程序和图像格式统一管理和更新的。因此，标准图像方向的自动检测对于一些应用来说是一项重要任务，例如自动创建数字相册、模拟照片的数字化、需要以垂直方向输入图像的计算机视觉应用。在这些情况下，需要用户的干预，人类可以利用自己的图像理解能力来识别照片的正确方向。一般地，照片的方向由拍摄照片时相机的旋转来确定，即使任何角度都是可能的，旋转90度是最常见的。因此，通常假设图像是在四个方向(0度(上)，90度(右)，180度(下)，270度(左))中的一个方向拍摄的。由于场景内容的广泛可变性，自动完成这项任务是一项具有挑战性的任务。Advances in digital imaging technology, along with the proliferation of digital cameras, smartphones, and other devices, have led to a dramatic increase in the number of photos people take. Since the camera is not always level during a shot, the resulting photo often needs to be rotated corrected so that it appears in the correct orientation, the direction the scene originally appeared. The correct orientation of the image is defined as the orientation in which the scene originally appeared. Most digital cameras have a built-in orientation sensor that allows the camera's orientation to be recorded in the image's EXIF metadata during capture, but this field is not uniformly managed and updated by several image processing applications and image formats. Therefore, automatic detection of standard image orientation is an important task for some applications, such as automatic creation of digital photo albums, digitization of analog photos, and computer vision applications that require input images in vertical orientation. In these cases, user intervention is required, and humans can use their image understanding abilities to identify the correct orientation of the photo. Generally, the orientation of a photo is determined by the rotation of the camera when the photo was taken, although any angle is possible, with a 90 degree rotation being the most common. Therefore, it is generally assumed that the image is taken in one of four directions (0 degrees (up), 90 degrees (right), 180 degrees (bottom), 270 degrees (left)). Automating this task is a challenging one due to the wide variability in scene content.

目前研究中，图像方向识别方法大多采用图像处理与机器学习算法。尽管如此，这些方法存在一些问题：(1)一些方向检测方法依赖于低级特征，然后采用适当的分类器实现图像方向检测，然而低级特征无法捕获图像中的大量语义内容。(2)一些采用神经网络的方向检测方法，需要对原始图像进行缩放，比如使用VGG网络会将图像缩放至224×224，然而图像的长宽比是判断图像的因素之一，缩放图像会损失图像的一些信息。(3)目前对图像方向进行检测的大多数神经网络方法基于现有的骨干网络进行微调，没有考虑提取的特征是否能表达人类的视觉感知，使模型的泛化能力不高。In current research, image orientation recognition methods mostly use image processing and machine learning algorithms. Nevertheless, there are some problems with these methods: (1) Some orientation detection methods rely on low-level features and then adopt appropriate classifiers to achieve image orientation detection, however low-level features cannot capture a large amount of semantic content in images. (2) Some direction detection methods using neural networks need to scale the original image. For example, using the VGG network will scale the image to 224×224. However, the aspect ratio of the image is one of the factors for judging the image, and the zoomed image will lose Some information about the image. (3) Most of the current neural network methods for image orientation detection are fine-tuned based on the existing backbone network, without considering whether the extracted features can express human visual perception, so the generalization ability of the model is not high.

发明内容Contents of the invention

针对目前图像方向识别的问题，本发明提供了一种基于多尺度融合与注意力机制的图像方向预测方法。Aiming at the current problem of image orientation recognition, the present invention provides an image orientation prediction method based on multi-scale fusion and attention mechanism.

为了达到上述目的，本发明采用了下列技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于多尺度融合与注意力机制的图像方向预测方法，包括以下步骤：An image direction prediction method based on multi-scale fusion and attention mechanism, comprising the following steps:

步骤1，将每幅图像分别顺时针旋转三个角度：90度，180度和270度，每幅图像最终会得到上、右、下、左四个不同方向的图像；Step 1. Rotate each image clockwise by three angles: 90 degrees, 180 degrees and 270 degrees. Each image will eventually get four different directions of up, right, down and left images;

步骤2，采用ResNet50融合残差注意力机制提取每幅图像的局部特征，具体步骤如下：Step 2, using the ResNet50 fusion residual attention mechanism to extract the local features of each image, the specific steps are as follows:

步骤2.1，ResNet50网络由6部分组成，分别是：卷积层、C0、C1、C2、C3和C4；所述C0包含1个7×7步长为2的卷积层与1个3×3步长为2的最大池化层；所述C1、C2、C3和C4分别是原图的1/4，1/8，1/16和1/32倍，且分别包含3、4、6和3个瓶颈层(Bottleneck，简称BTNK)；Step 2.1, the ResNet50 network consists of 6 parts, namely: convolutional layer, C0, C1, C2, C3, and C4; the C0 includes a 7×7 convolutional layer with a step size of 2 and a 3×3 The maximum pooling layer with a step size of 2; the C1, C2, C3 and C4 are 1/4, 1/8, 1/16 and 1/32 times of the original image respectively, and contain 3, 4, 6 and 3 bottleneck layers (Bottleneck, BTNK for short);

步骤2.2，将C1、C2、C3、C4各部分的最后一个残差结构输出特征图分别记为C₁、C₂、C₃、C₄；在四个不同尺度的特征图上分别经过一个注意力机制模块(Convolutional BlockAttention Module，CBAM)，分别得到4个空间注意力图记为A₁、A₂、A₃、A₄；Step 2.2, record the last residual structure output feature map of each part of C1, C2, C3, and C4 as C₁ , C₂ , C₃ , and C₄ respectively; The force mechanism module (Convolutional BlockAttention Module, CBAM), respectively obtains 4 spatial attention maps marked as A₁ , A₂ , A₃ , and A₄ ;

步骤2.3，将空间注意力图与对应的原特征图进行对应元素相加，记为

其中，

表示对应元素相加；Step 2.3, add the corresponding elements of the spatial attention map and the corresponding original feature map, recorded as

in,

Indicates the addition of corresponding elements;

步骤2.4，将F₂、F₃、F₄三个小尺度的特征图用双线性插值上采样至与F₁相同的尺度，沿通道进行拼接操作，再进行1×1的卷积操作，得到最终的多尺度注意力融合特征，局部特征，记为：Local_Feature＝concat(F₁，up_2x(F₂)，up_4x(F₃)，up_8x(F₄))；其中，concat表示特征拼接，up_2x表示上采样2倍；In step 2.4, the three small-scale feature maps of F₂ , F₃ , and F₄ are upsampled to the same scale as F₁ by bilinear interpolation, spliced along the channel, and then 1×1 convolution operation is performed. Get the final multi-scale attention fusion features, local features, recorded as: Local_Feature=concat(F₁ , up_2x(F₂ ), up_4x(F₃ ), up_8x(F₄ )); where concat means feature splicing, up_2x Indicates upsampling by 2 times;

步骤3，将图像的4个不同尺度的VR_LBP(Variable rotation Local binarypatterns，旋转可变局部二值模式)特征图作为网络输入，采用ResNet50融合残差空洞卷积(Residual Dilated Convolution)提取图像的全局特征；Step 3, take the VR_LBP (Variable rotation Local binary patterns) feature maps of 4 different scales of the image as the network input, and use ResNet50 fusion residual hole convolution (Residual Dilated Convolution) to extract the global features of the image ;

步骤3.1，图像在标准三原色模式RGB中，计算能表达图像“方向”特性的VR_LBP特征图；计算过程中采用4个不同的尺度VR_LBP_1,8、VR_LBP_2,16、VR_LBP_3,24和VR_LBP_4,32生成4个VR_LBP特征图，分别记作P₁，P₂，P₃，P₄，并作为RestNet50网络的输入；In step 3.1, the image is in the standard three primary color mode RGB, and the VR_LBP feature map that can express the "direction" characteristics of the image is calculated; during the calculation process, four different scales VR_LBP_1,8 , VR_LBP_2,16 , VR_LBP_3,24 and VR_LBP₄ are used_,32 generate 4 VR_LBP feature maps, which are respectively recorded as P₁ , P₂ , P₃ , and P₄ , and are used as the input of the RestNet50 network;

步骤3.2，将P₁，P₂，P₃和P₄四个不同尺度的VR_LBP特征图输入到RestNet50中，将RestNet50网络最后一个卷积块的输出特征图记为{RP₁、RP₂、RP₃、RP₄}；将这4个特征图分别输入对应采样率的残差空洞卷积块，4个采样率rates分别对应VR_LBP特征图中的R值；所述残差空洞卷积块，是在1个3×3的空洞卷积上增加一个1×1卷积的快捷连接(ShortcutConnection)，形成残差块；快捷连接的作用是匹配特征图的空间维度，残差块的作用是在卷积提取图像特征的时候同时实现恒等映射。经过残差空洞卷积块后得到4个特征图记为RPD₁、RPD₂、RPD₃、RPD₄，将RPD₁、RPD₂、RPD₃、RPD₄4个特征图对应元素相加得到全局特征；Global_Feature＝RPD₁⊕RPD₂⊕RPD₃⊕RPD₄；Step 3.2, input VR_LBP feature maps of four different scales P₁ , P₂ , P₃ and P₄ into RestNet50, and record the output feature map of the last convolution block of the RestNet50 network as {RP₁ , RP₂ , RP₃ , RP₄ }; Input these 4 feature maps into the residual hole convolution block corresponding to the sampling rate, and the 4 sampling rates rates correspond to the R values in the VR_LBP feature map; the residual hole convolution block isA 1×1 convolution shortcut connection (ShortcutConnection) is added to a 3×3 hole convolution to form a residual block; the function of the shortcut connection is to match the spatial dimension of the feature map, and the function of the residual block is in the volume When extracting image features, the identity mapping is realized at the same time. After the residual hole convolution block, 4 feature maps are obtained and marked as RPD₁ , RPD₂ , RPD₃ , RPD₄ , and the corresponding elements of the 4 feature maps of RPD₁ , RPD₂ , RPD₃ , and RPD₄ are added to obtain the global feature ;Global_Feature=RPD₁ ⊕RPD₂ ⊕RPD₃ ⊕RPD₄ ;

步骤4，将步骤2.4中得到的局部特征与步骤3.2中得到的全局特征进行拼接融合，最终实现方向预测；Step 4, splicing and fusing the local features obtained in step 2.4 with the global features obtained in step 3.2, and finally realize the direction prediction;

步骤4.1，将局部特征Local_Feature通过双线性插值下采样至与全局特征Global_Feature相同的分辨率，然后进行拼接连接，得到融合后的特征；LG_Feature＝concat(down(Local_Feature)，Global_Feature)，down表示下采样；Step 4.1, the local feature Local_Feature is down-sampled to the same resolution as the global feature Global_Feature through bilinear interpolation, and then spliced and connected to obtain the fused feature; LG_Feature=concat(down(Local_Feature), Global_Feature), down means the following sampling;

步骤4.2，将LG_Feature经过一个全局平均池化(Global Average Pooling，GAP)，得到一个一维向量；再经过一个256的全连接层实现图像方向的预测；In step 4.2, LG_Feature is subjected to a global average pooling (Global Average Pooling, GAP) to obtain a one-dimensional vector; and then a 256 fully connected layer is used to realize the prediction of the image direction;

步骤4.3，将逻辑回归极大似然损失函数作为损失函数实现方向分类，实现图像方向的自动预测，损失函数定义如下：In step 4.3, the logistic regression maximum likelihood loss function is used as the loss function to realize orientation classification and realize automatic prediction of image orientation. The loss function is defined as follows:

其中，h_θ(x)表示样本x属于某一类的概率；y_i为预测的方向类别，x_i表示第i个样本特征，m为样本个数，θ为网络模型所求参数，T表示矩阵的转置。Among them, h_θ (x) represents the probability that sample x belongs to a certain class; y_i represents the predicted direction category,_xi represents the feature of the i-th sample, m represents the number of samples, θ represents the parameters obtained by the network model, and T represents Transpose of the matrix.

进一步，所述步骤2.1中瓶颈层BTNK有BTNK1和BTNK2两种；BTNK2左侧为3个conv+BN+ReLU卷积块，将卷积后的结果F(x)与输入x相加，即F(x)+x，之后再经过1个ReLU激活函数，此模块输入与输出通道数相同；BTNK1左侧为3个conv+BN+ReLU卷积块F(x)，右侧为1个conv+BN卷积块G(x)，起到匹配输入与输出维度差异的作用，即F(x)和G(x)通道数相同，进而进行求和F(x)+G(x)，此模块输入与输出通道数不同。ResNet50网络正是由多个不同种类的瓶颈层BTNK堆叠而成。Further, the bottleneck layer BTNK in step 2.1 has two types: BTNK1 and BTNK2; the left side of BTNK2 is three conv+BN+ReLU convolution blocks, and the result F(x) after convolution is added to the input x, namely F (x)+x, and then through a ReLU activation function, the number of input and output channels of this module is the same; on the left side of BTNK1 are 3 conv+BN+ReLU convolution blocks F(x), and on the right side is 1 conv+ The BN convolution block G(x) plays the role of matching the difference between input and output dimensions, that is, F(x) and G(x) have the same number of channels, and then sums F(x)+G(x), this module The number of input and output channels is different. The ResNet50 network is formed by stacking multiple different types of bottleneck layers BTNK.

进一步，所述步骤2.2中注意力机制模块CBAM同时结合了通道注意力模块与空间注意力模块；所述通道注意力模块将输入的特征图，分别经过全局最大池化和全局平均池化得到两个C×1×1的特征，C表示通道数，再将两个C×1×1的特征分别送入一个两层的神经网络MLP，Further, the attention mechanism module CBAM in the step 2.2 combines the channel attention module and the spatial attention module at the same time; the channel attention module obtains two input feature maps through global maximum pooling and global average pooling respectively A C×1×1 feature, C represents the number of channels, and then send two C×1×1 features into a two-layer neural network MLP,

所述两层的神经网络MLP的第一层神经元个数为C/r，r为减少率，激活函数为RelU，第二层神经元个数为C，这个两层的神经网络是共享的；The number of neurons in the first layer of the two-layer neural network MLP is C/r, r is the reduction rate, the activation function is RelU, and the number of neurons in the second layer is C. This two-layer neural network is shared ;

将MLP输出的两个特征进行逐元素相加，再经过sigmoid激活操作，生成最终的通道注意力特征；最后将该通道注意力特征与输入特征进行逐元素相乘，生成空间注意力模块需要的输入特征；The two features output by the MLP are added element-wise, and then the sigmoid activation operation is performed to generate the final channel attention feature; finally, the channel attention feature is multiplied element-by-element by the input feature to generate the spatial attention module. input features;

空间注意力模块先对通道注意力模块的输出特征分别基于通道维度应用全局最大池化和全局平均池化，得到两个1×H×W的单通道池化特征，然后将这两个1×H×W的单通道池化特征基于通道维度进行拼接(concatenation)，再经过一个7×7卷积操作，将通道数减少为1，再经过sigmoid生成空间注意力特征，最后将经过sigmoid生成空间注意力特征和空间注意力模块的输入特征做逐元素相乘，得到融合了通道注意力和空间注意力的加权特征。The spatial attention module first applies global maximum pooling and global average pooling to the output features of the channel attention module based on the channel dimension to obtain two 1×H×W single-channel pooling features, and then the two 1× The H×W single-channel pooling feature is concatenated based on the channel dimension, and then undergoes a 7×7 convolution operation to reduce the number of channels to 1, then generates spatial attention features through sigmoid, and finally generates space through sigmoid The input features of the attention feature and the spatial attention module are multiplied element-wise to obtain a weighted feature that combines channel attention and spatial attention.

进一步，所述步骤3.1中的VR_LBP特征图，把图像中的某个像素点作为中心点，基于

插值得到以圆形采样点集作为该中心点的领域点，其中R为半径，P为采样点个数；然后将中心像素点的值与其邻域像素点的值进行比较，若邻域像素点的值大于中心点素，则将该领域位置置1，反之置0，之后按顺时针读取圆形采样点，最终组合成一个二进制数序列，将该序列转换成十进制，即为VR_LBP_R,P码，计算如下：Further, the VR_LBP feature map in step 3.1 uses a certain pixel point in the image as the center point, based on

Interpolate to obtain the domain point with the circular sampling point set as the center point, where R is the radius and P is the number of sampling points; then compare the value of the central pixel point with the value of its neighboring pixel points, if the neighboring pixel point If the value is greater than the center pixel, set the position of the field to 1, otherwise set to 0, then read the circular sampling points clockwise, and finally combine them into a sequence of binary numbers, convert the sequence to decimal, which is VR_LBP_{R, P} code, calculated as follows:

其中，gray_c是当前像素的灰度级，gray_i是其领域的灰度级；当x小于0时，t(x)为0，反之为1。Among them, gray_c is the gray level of the current pixel, and gray_i is the gray level of its domain; when x is less than 0, t(x) is 0, otherwise it is 1.

进一步，所述步骤3.2空洞卷积会有一个扩张率dilation rate的超参数，用来定义卷积核处理数据时各值之间的间距，是在卷积核中填充扩张率-1个0；因此，当设置不同扩张率时，感受野不同，获取多尺度信息；空洞卷积的卷积核K＝k+(k-1)(r-1)，k为原始卷积核大小，r为空洞卷积参数空洞率；依据不同尺度的VR_LBP采用4种不同的空洞卷积核，r分别为1，2，3和4。Further, the step 3.2 hole convolution will have a hyperparameter of dilation rate, which is used to define the distance between the values when the convolution kernel processes data, which is to fill the dilation rate -1 zeros in the convolution kernel; Therefore, when different expansion rates are set, the receptive field is different, and multi-scale information is obtained; the convolution kernel of hole convolution K=k+(k-1)(r-1), k is the size of the original convolution kernel, and r is the hole Convolution parameter hole rate; according to VR_LBP of different scales, 4 different hole convolution kernels are used, and r is 1, 2, 3 and 4 respectively.

进一步，所述步骤4.2中全局平均池化的窗口大小就是整个特征图的大小，对于输出的每一个通道的特征图的所有像素计算一个平均值，经过全局平均池化后得到一个大小为1×1×C的特征向量，C为原特征图的通道数。Further, the window size of the global average pooling in step 4.2 is the size of the entire feature map, an average value is calculated for all pixels of the feature map of each output channel, and a size of 1× is obtained after global average pooling The feature vector of 1×C, where C is the number of channels of the original feature map.

与现有技术相比本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

(1)将ResNet50网络后四部分的最后一个残差结构输出特征图，分别经过一个注意力机制模块(CBAM)，得到4个空间注意力图，并与对应的原特征图对应元素相加。然后将三个较小尺度的特征用双线性插值上采样至与最大尺度相同的分辨率，然后沿通道进行拼接操作得到最终的多尺度注意力融合特征，称为局部特征。该方法采用注意力机制是机器与人类视觉对方向的判断具有更好的一致性。(2)将图像的4个不同尺度的LBP特征图作为网络输入，采用ResNet50融合残差空洞卷积得4个特征图，再将特征图对应元素相加得到全局特征。从多个尺度提取图像的方向特性，能从不同视野表达图像的方向特性，提高方向检测的准确率。(3)将局部特征与全局特征进行拼接融合，再经过GAP和全连接层最终实现方向预测。GAP模块将任意输入尺寸的图像转换成固定大小的特征向量，降低了过拟合，加快了网络收敛速度。(1) The last residual structure output feature map of the last four parts of the ResNet50 network is passed through a attention mechanism module (CBAM) respectively to obtain 4 spatial attention maps, and are added to the corresponding elements of the original feature map. Then the three smaller-scale features are up-sampled to the same resolution as the largest scale with bilinear interpolation, and then concatenated along the channel to obtain the final multi-scale attention fusion feature, called local feature. The method adopts the attention mechanism so that the judgment of the direction of the machine and human vision has a better consistency. (2) The 4 LBP feature maps of different scales of the image are used as network input, and the ResNet50 is used to fuse the residual hole convolution to obtain 4 feature maps, and then the corresponding elements of the feature maps are added to obtain the global feature. Extracting the orientation characteristics of images from multiple scales can express the orientation characteristics of images from different perspectives and improve the accuracy of orientation detection. (3) The local features and the global features are spliced and fused, and then the direction prediction is finally realized through the GAP and the fully connected layer. The GAP module converts an image of any input size into a fixed-size feature vector, which reduces overfitting and speeds up network convergence.

附图说明Description of drawings

图1为本发明图像示意图；Fig. 1 is the schematic diagram of image of the present invention;

图2为本发明流程图；Fig. 2 is a flowchart of the present invention;

图3为本发明不同尺度空洞卷积核；Fig. 3 is a dilated convolution kernel of different scales in the present invention;

图4为本发明不同尺度VR_LBP特征图；Fig. 4 is a VR_LBP feature map of different scales of the present invention;

图5为本发明网络模型框架；Fig. 5 is the network model framework of the present invention;

图6为本发明LModel1～LModel8的8个模型的性能指标数据示意图Fig. 6 is the performance index data schematic diagram of 8 models of LModel1～LModel8 of the present invention

图7为本发明Gmodel1～model10的10个模型的性能指标数据示意图。Fig. 7 is a schematic diagram of performance index data of 10 models of Gmodel1-model10 in the present invention.

具体实施方式Detailed ways

为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚明白，结合实施例和附图，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。下面结合实施例和附图详细说明本发明的技术方案，但保护范围不被此限制。In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention clearer, the present invention will be further described in detail in combination with the embodiments and accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. The technical solutions of the present invention will be described in detail below in conjunction with the embodiments and drawings, but the scope of protection is not limited thereto.

实施例1Example 1

本发明选取来公开数据集，进行实验，具体实施步骤如下：The present invention selects and discloses data set, carries out experiment, and specific implementation steps are as follows:

步骤1：选取公开数据集SUN dataset测试，该数据集包含108754张图像，397个场景类别。将每幅图像分别顺时针旋转三个角度：90度，180度和270度，每幅图像最终会得到上、右、下、左四个不同方向的图像。Step 1: Select the public data set SUN dataset for testing, which contains 108,754 images and 397 scene categories. Rotate each image clockwise by three angles: 90 degrees, 180 degrees and 270 degrees, and each image will eventually get images in four different directions: up, right, down, and left.

步骤2：采用ResNet50融合残差注意力机制提取图像的局部特征，具体步骤如下：Step 2: Use the ResNet50 fusion residual attention mechanism to extract the local features of the image. The specific steps are as follows:

步骤2.1：ResNet50网络由6部分组成，分别是：卷积层、C0、C1、C2、C3和C4。所述C0包含1个7×7步长为2的卷积层与1个3×3步长为2的最大池化层；所述C1、C2、C3和C4分别是原图的1/4，1/8，1/16和1/32倍，且分别包含3、4、6和3个瓶颈层(Bottleneck，简称BTNK)。瓶颈层BTNK有BTNK1和BTNK2两种；BTNK2左侧为3个conv+BN+ReLU卷积块，将卷积后的结果F(x)与输入x相加，即F(x)+x，之后再经过1个ReLU激活函数，此模块输入与输出通道数相同；BTNK1左侧为3个conv+BN+ReLU卷积块F(x)，右侧为1个conv+BN卷积块G(x)，起到匹配输入与输出维度差异的作用，即F(x)和G(x)通道数相同，进而进行求和F(x)+G(x)，此模块输入与输出通道数不同。Step 2.1: The ResNet50 network consists of 6 parts, namely: convolution layer, C0, C1, C2, C3 and C4. The C0 includes a 7×7 convolutional layer with a step size of 2 and a 3×3 maximum pooling layer with a step size of 2; the C1, C2, C3 and C4 are 1/4 of the original image respectively , 1/8, 1/16 and 1/32 times, and contain 3, 4, 6 and 3 bottleneck layers (Bottleneck, BTNK for short) respectively. The bottleneck layer BTNK has two types: BTNK1 and BTNK2; on the left side of BTNK2 are 3 conv+BN+ReLU convolution blocks, and the convolution result F(x) is added to the input x, that is, F(x)+x, and then After another ReLU activation function, the number of input and output channels of this module is the same; the left side of BTNK1 is 3 conv+BN+ReLU convolutional blocks F(x), and the right side is 1 conv+BN convolutional block G(x ), which plays the role of matching the difference between the input and output dimensions, that is, the number of F(x) and G(x) channels is the same, and then summed F(x)+G(x), the number of input and output channels of this module is different.

步骤2.2：将C1、C2、C3、C4各部分的最后一个残差结构输出特征图分别记为C₁、C₂、C₃、C₄；在四个不同尺度的特征图上分别经过一个注意力机制模块(Convolutional BlockAttention Module，CBAM)，分别得到4个空间注意力图记为A₁、A₂、A₃、A₄。注意力机制模块CBAM同时结合了通道注意力模块与空间注意力模块；所述通道注意力模块将输入的特征图，分别经过全局最大池化和全局平均池化得到两个C×1×1的特征，C表示通道数，再将两个C×1×1的特征分别送入一个两层的神经网络MLP，所述两层的神经网络MLP的第一层神经元个数为C/r，r为减少率，激活函数为RelU，第二层神经元个数为C，这个两层的神经网络是共享的；将MLP输出的两个特征进行逐元素相加，再经过sigmoid激活操作，生成最终的通道注意力特征；最后将该通道注意力特征与输入特征进行逐元素相乘，生成空间注意力模块需要的输入特征；空间注意力模块先对通道注意力模块的输出特征分别基于通道维度应用全局最大池化和全局平均池化，得到两个1×H×W的单通道池化特征，然后将这两个1×H×W的单通道池化特征基于通道维度进行拼接，再经过一个7×7卷积操作，将通道数减少为1，再经过sigmoid生成空间注意力特征，最后将经过sigmoid生成空间注意力特征和空间注意力模块的输入特征做逐元素相乘，得到融合了通道注意力和空间注意力的加权特征。Step 2.2: Record the last residual structure output feature map of each part of C1, C2, C3, and C4 as C₁ , C₂ , C₃ , and C₄ ; The force mechanism module (Convolutional BlockAttention Module, CBAM) respectively obtains 4 spatial attention maps marked as A₁ , A₂ , A₃ , and A₄ . The attention mechanism module CBAM combines the channel attention module and the spatial attention module at the same time; the channel attention module obtains two C×1×1 feature maps through global maximum pooling and global average pooling respectively. feature, C represents the number of channels, and then send two C×1×1 features into a two-layer neural network MLP, the number of neurons in the first layer of the two-layer neural network MLP is C/r, r is the reduction rate, the activation function is RelU, and the number of neurons in the second layer is C. This two-layer neural network is shared; the two features output by MLP are added element by element, and then activated by sigmoid to generate The final channel attention feature; finally, the channel attention feature is multiplied element-by-element by the input feature to generate the input features required by the spatial attention module; the spatial attention module first calculates the output features of the channel attention module based on the channel dimension Apply global maximum pooling and global average pooling to obtain two 1×H×W single-channel pooling features, and then splice these two 1×H×W single-channel pooling features based on the channel dimension, and then pass A 7×7 convolution operation reduces the number of channels to 1, then generates spatial attention features through sigmoid, and finally multiplies the input features of the spatial attention features generated by sigmoid and the spatial attention module element-by-element to obtain a fusion Weighted features for channel attention and spatial attention.

步骤2.3：将空间注意力图与对应的原特征图进行对应元素相加，记为

Step 2.3: Add the corresponding elements of the spatial attention map and the corresponding original feature map, denoted as

步骤2.4：将F₂、F₃、F₄三个小尺度的特征图用双线性插值上采样至与F₁相同的尺度，沿通道进行拼接操作，再进行1×1的卷积操作，得到最终的多尺度注意力融合特征，局部特征，记为：Local_Feature＝concat(F₁，up_2x(F₂)，up_4x(F₃)，up_8x(F₄))；其中，concat表示特征拼接，up_2x表示上采样2倍；Step 2.4: Use bilinear interpolation to upsample the three small-scale feature maps of F₂ , F₃ , and F₄ to the same scale as F₁ , perform splicing operations along the channels, and then perform 1×1 convolution operations, Get the final multi-scale attention fusion features, local features, recorded as: Local_Feature=concat(F₁ , up_2x(F₂ ), up_4x(F₃ ), up_8x(F₄ )); where concat means feature splicing, up_2x Indicates upsampling by 2 times;

步骤3：将图像的4个不同尺度的旋转可变局部二值模式特征图VR_LBP(Variablerotation Local binary patterns)特征图作为网络输入，采用ResNet50融合残差空洞卷积(Residual Dilated Convolution)提取图像的全局特征，具体步骤如下：Step 3: Take the VR_LBP (Variablerotation Local binary patterns) feature map of 4 different scales of the image as the network input, and use ResNet50 fusion residual hole convolution (Residual Dilated Convolution) to extract the global image features, the specific steps are as follows:

步骤3.1，图像在标准三原色模式RGB中，计算能表达图像“方向”特性的VR_LBP特征图；计算过程中采用4个不同的尺度VR_LBP_1,8、VR_LBP_2,16、VR_LBP_3,24和VR_LBP_4,32生成4个VR_LBP特征图，分别记作P₁，P₂，P₃，P₄，并作为RestNet50网络的输入；VR_LBP特征图，把图像中的某个像素点作为中心

点，基于

插值得到以圆形采样点集作为该中心点的领域点，其中R为半径，P为采样点个数；然后将中心像素点的值与其邻域像素点的值进行比较，若邻域像素点的值大于中心点素，则将该领域位置置1，反之置0，之后按顺时针读取圆形采样点，最终组合成一个二进制数序列，将该序列转换成十进制，即为VR_LBP_R,P码，计算如下：In step 3.1, the image is in the standard three primary color mode RGB, and the VR_LBP feature map that can express the "direction" characteristics of the image is calculated; during the calculation process, four different scales VR_LBP_1,8 , VR_LBP_2,16 , VR_LBP_3,24 and VR_LBP₄ are used_,32 Generate 4 VR_LBP feature maps, which are respectively recorded as P₁ , P₂ , P₃ , and P₄ , and serve as the input of the RestNet50 network; VR_LBP feature maps take a certain pixel in the image as the center

point, based on

步骤3.2，将P₁，P₂，P₃和P₄四个不同尺度的VR_LBP特征图输入到RestNet50中，将RestNet50网络最后一个卷积块的输出特征图记为{RP₁、RP₂、RP₃、RP₄}；将这4个特征图分别输入对应采样率的残差空洞卷积块，4个采样率rates分别对应VR_LBP特征图中的R值；Step 3.2, input VR_LBP feature maps of four different scales P₁ , P₂ , P₃ and P₄ into RestNet50, and record the output feature map of the last convolution block of the RestNet50 network as {RP₁ , RP₂ , RP₃ , RP₄ }; Input these 4 feature maps into the residual hole convolution block corresponding to the sampling rate, and the 4 sampling rates rates correspond to the R values in the VR_LBP feature map;

残差空洞卷积块，是在1个3×3的空洞卷积上增加一个1×1卷积的快捷连接(Shortcut Connection)，形成残差块。快捷连接的作用是匹配特征图的空间维度，残差块的作用是在卷积提取图像特征的时候同时实现恒等映射。经过残差空洞卷积块后得到4个特征图记为RPD₁、RPD₂、RPD₃、RPD₄，将RPD₁、RPD₂、RPD₃、RPD₄4个特征图对应元素相加得到全局特征；

表示对应元素相加。The residual hole convolution block is to add a 1×1 convolution shortcut connection (Shortcut Connection) to a 3×3 hole convolution to form a residual block. The function of the shortcut connection is to match the spatial dimension of the feature map, and the function of the residual block is to realize the identity mapping at the same time when the convolution extracts the image features. After the residual hole convolution block, 4 feature maps are obtained and marked as RPD₁ , RPD₂ , RPD₃ , RPD₄ , and the corresponding elements of the 4 feature maps of RPD₁ , RPD₂ , RPD₃ , and RPD₄ are added to obtain the global feature ;

Indicates the addition of corresponding elements.

其中，空洞卷积会有一个扩张率(dilation rate)的超参数，用来定义卷积核处理数据时各值之间的间距，就是在卷积核中填充dilation rate-1个0，因此，当设置不同dilation rate时，感受野就会不一样，也即获取了多尺度信息。空洞卷积的卷积核K＝k+(k-1)(r-1)，k为原始卷积核大小，r为空洞卷积参数空洞率。本发明依据不同尺度的VR_LBP采用了4种不同的空洞卷积核，r分别为1，2，3和4。Among them, the hole convolution will have a hyperparameter of dilation rate, which is used to define the spacing between the values when the convolution kernel processes data, that is, to fill the convolution kernel with dilation rate-1 0, therefore, When different dilation rates are set, the receptive field will be different, that is, multi-scale information is obtained. The convolution kernel of hole convolution K=k+(k-1)(r-1), k is the size of the original convolution kernel, and r is the hole rate of the hole convolution parameter. The present invention adopts 4 different dilated convolution kernels according to VR_LBP of different scales, and r is 1, 2, 3 and 4 respectively.

步骤4：将步骤2.4中得到的局部特征与步骤3.2中得到的全局特征进行拼接融合，最终实现方向预测，具体步骤如下：Step 4: splice and fuse the local features obtained in step 2.4 with the global features obtained in step 3.2, and finally realize the direction prediction. The specific steps are as follows:

步骤4.1：将局部特征Local_Feature通过双线性插值下采样至与全局特征Global_Feature相同的分辨率，然后进行拼接连接，得到融合后的特征。LG_Feature＝concat(down(Local_Feature)，Global_Feature)，down表示下采样Step 4.1: Downsample the local feature Local_Feature to the same resolution as the global feature Global_Feature through bilinear interpolation, and then perform splicing and connection to obtain the fused feature. LG_Feature=concat(down(Local_Feature), Global_Feature), down means downsampling

步骤4.2，将LG_Feature经过一个全局平均池化，得到一个一维向量；再经过一个256的全连接层实现图像方向的预测；GAP的窗口大小就是整个特征图的大小，对于输出的每一个通道的特征图的所有像素计算一个平均值，经过全局平均池化之后就得到一个大小为1×1×C的特征向量，C为原特征图的通道数。Step 4.2, LG_Feature undergoes a global average pooling to obtain a one-dimensional vector; and then passes through a 256 fully connected layer to realize the prediction of the image direction; the window size of GAP is the size of the entire feature map, for each channel of the output All pixels in the feature map calculate an average value, and after global average pooling, a feature vector with a size of 1×1×C is obtained, and C is the number of channels of the original feature map.

其中，h_θ(x)表示样本x属于某一类的概率；y_i为预测的方向类别，x_i表示第i个样本特征，m为样本个数，θ为网络模型所求参数，T表示矩阵转置。Among them, h_θ (x) represents the probability that sample x belongs to a certain class; y_i represents the predicted direction category,_xi represents the feature of the i-th sample, m represents the number of samples, θ represents the parameters obtained by the network model, and T represents Matrix transpose.

步骤5:采用的实验环境为Anaconda3，深度学习框架为TensorFlow(GPU)。选择每个数据集中的70％作为训练集，30％作为测试集。原始图像大小保持不变。采用10折交叉验证方法，因此最终的评价指标为经10折交叉验证后准确率的平均值。Step 5: The experimental environment adopted is Anaconda3, and the deep learning framework is TensorFlow (GPU). 70% of each dataset is selected as training set and 30% as testing set. The original image size remains the same. The 10-fold cross-validation method is adopted, so the final evaluation index is the average value of the accuracy rate after 10-fold cross-validation.

使用Imagenet数据集对ResNet50进行预训练，将获取的卷积层参数应用于本发明提出的方法，在此基础上对其他模块进行微调。Use the Imagenet dataset to pre-train ResNet50, apply the obtained convolutional layer parameters to the method proposed by the present invention, and fine-tune other modules on this basis.

实验相关参数设置：batchsize设定为128，使用momentum的SGD优化器对网络进行端到端训练，动量(momentum)设置为0.9，学习率为0.001，迭代次数为30次，通过加入L2正则化来防止过拟合。本发明方法是一多分类问题，因此使用分类准确率(ACC)，宏平均精准率(MAP)，宏平均召回率(MAR)，混淆矩阵来评估模型的性能。Experiment-related parameter settings: batchsize is set to 128, and the SGD optimizer of momentum is used to train the network end-to-end. The momentum (momentum) is set to 0.9, the learning rate is 0.001, and the number of iterations is 30 times. prevent overfitting. The method of the present invention is a multi-classification problem, so classification accuracy (ACC), macro-average precision (MAP), macro-average recall (MAR), and confusion matrix are used to evaluate the performance of the model.

为充分验证本发明方法的有效性和适用性，本发明提出局部局特征在图像方向检测任务中的有效性，对这部分网络结构中使用的Resnet50不同层的特征融合方式与CBAM进行了消融实验，全局特征采用我们提出的结构，产生8个不同的模型。如表1所示，Lmodel1和Lmodel5采用的特征图为C4，Lmodel2和Lmodel6采用的特征图为C3和C4，Lmodel3和Lmodel7采用的特征图为C2、C3和C4，Lmodel4和Lmodel8采用的特征图为C1、C2、C3和C4。In order to fully verify the effectiveness and applicability of the method of the present invention, the present invention proposes the effectiveness of local features in the image direction detection task, and carried out ablation experiments on the feature fusion methods of different layers of Resnet50 used in this part of the network structure and CBAM , the global features adopt our proposed structure, resulting in 8 different models. As shown in Table 1, the feature maps used by Lmodel1 and Lmodel5 are C4, the feature maps used by Lmodel2 and Lmodel6 are C3 and C4, the feature maps used by Lmodel3 and Lmodel7 are C2, C3 and C4, and the feature maps used by Lmodel4 and Lmodel8 are C1, C2, C3 and C4.

此外，还考虑了是否结合CBAM。Lmodel1～Lmodel4没有加入CBAM，而是采用直接上采样融合的方式，Lmodel5～Lmodel8加入了CBAM。其中Lmodel1是ReanNet50，即骨干网络。Lmodel8是本发明提出的模型。从图6的8个模型的性能指标可以看出，本发明提出的模型(LModel8)的准确率、宏平均精准率、宏平均召回率分别为99.2％、97.1％、95.5％，优于其他模型。In addition, whether or not to combine CBAM was considered. Lmodel1~Lmodel4 did not join CBAM, but adopted the method of direct upsampling fusion, and Lmodel5~Lmodel8 joined CBAM. Among them, Lmodel1 is ReanNet50, the backbone network. Lmodel8 is the model proposed by the present invention. As can be seen from the performance indicators of the eight models in Figure 6, the accuracy rate, macro-average precision rate, and macro-average recall rate of the model proposed by the present invention (LModel8) are 99.2%, 97.1%, and 95.5%, respectively, which are better than other models .

表1Table 1

为了验证使用眼动热图作为标签的有效性，为了验证本发明提出的全局特征在图像方向检测任务中的有效性，对这部分网络结构中使用的不同尺度“VR_LBP”图像与残差空洞卷积进行了消融实验，局部特征采用本发明提出的结构，产生10个不同的模型，如表2所示。首先计算原始图像不同尺度的“VR_LBP”图像，选择原图或不同尺度的LBP作为网络输入到模型中。如表所示，Gmodel1和Gmodel2的输入为原始图像，Gmodel3和Gmodel7的输入为VR_LBP_1,8，Gmodel4和Gmodel8的输入为VR_LBP_1,8和VR_LBP_2,16，Gmodel5和Gmodel9的输入为VR_LBP_1,8、VR_LBP_2,16和VR_LBP_3,24，Gmodel5和Gmodel10的输入为VR_LBP_1,8、VR_LBP_2,16、VR_LBP_3,24和VR_LBP_4,32。此外，还考虑了是否结合残差空洞卷积层。Gmodel1、Gmodel3～Gmodel6没有加入残差空洞卷积层，Gmodel2、Gmodel7～Gmodel10加入了残差空洞卷积层。从图7中的10个模型的性能指标可以看出，本发明提出的模型(Gmodel10)的准确率、宏平均精准率、宏平均召回率分别为99.2％、97.5％和94.7％，优于其他模型。In order to verify the effectiveness of using eye movement heatmaps as labels, and to verify the effectiveness of the global features proposed by the present invention in the image orientation detection task, the different scale "VR_LBP" images and residual hole volumes used in this part of the network structure The ablation experiment was carried out on the plot, and the structure proposed by the present invention was adopted for local features, and 10 different models were generated, as shown in Table 2. First calculate the "VR_LBP" images of different scales of the original image, and select the original image or LBP of different scales as the network input to the model. As shown in the table, the input of Gmodel1 and Gmodel2 is the original image, the input of Gmodel3 and Gmodel7 is VR_LBP_1,8 , the input of Gmodel4 and Gmodel8 is VR_LBP_1,8 and VR_LBP_2,16 , the input of Gmodel5 and Gmodel9 is VR_LBP_{1, 8.} VR_LBP_2,16 and VR_LBP_3,24 , the input of Gmodel5 and Gmodel10 is VR_LBP_1,8 , VR_LBP_2,16 , VR_LBP_3,24 and VR_LBP_4,32 . In addition, whether to incorporate residual dilated convolutional layers is also considered. Gmodel1, Gmodel3~Gmodel6 did not add the residual hole convolution layer, and Gmodel2, Gmodel7~Gmodel10 added the residual hole convolution layer. As can be seen from the performance indicators of the 10 models in Fig. 7, the accuracy rate, macro-average precision rate, and macro-average recall rate of the model (Gmodel10) proposed by the present invention are respectively 99.2%, 97.5% and 94.7%, which are better than other models. Model.

表2Table 2

为了验证局部特征与全局特征融合在本任务中的的有效性，分别在骨干网络，局部特征、全局特征网络与两种特征融合四种模型上进行实验。表3中显示，本发明提出的融合方法在数据集上的分类准确率分别是99.2％，优于其他单一特征模型。实验结果表明采用局部与全局特征融合模型的准确率高于骨干网络或使用单一特征的网络。这也验证了我们在观看一张图片时，对图片方向的判断既会关注图片中的具体内容，也会在关注图片的整体布局，使我们的分类模型在各种类型的图像上都有很好的分类效果。In order to verify the effectiveness of the fusion of local features and global features in this task, experiments were carried out on the four models of backbone network, local feature, global feature network and two kinds of feature fusion. Table 3 shows that the classification accuracy of the fusion method proposed by the present invention on the data set is 99.2%, which is better than other single feature models. Experimental results show that the accuracy of the local and global feature fusion model is higher than that of the backbone network or the network using a single feature. This also verifies that when we look at a picture, the judgment of the direction of the picture will not only pay attention to the specific content in the picture, but also pay attention to the overall layout of the picture, so that our classification model has a good performance on various types of images. Good classification effect.

表3table 3

本文发明在数据集SUN上进行了实验，并与目前相关研究进行了比较，分类效果显著。本发明VR_LBP图像描述符的设计能很好的表达图像的方向特性。采用局部特征与全局特征的融合，帮助模型从不同的视觉特征中感知图像的方向。使本发明方法在不同的数据集中都有良好的表现。The invention of this paper is experimented on the data set SUN, and compared with the current related research, the classification effect is remarkable. The design of the VR_LBP image descriptor of the present invention can well express the direction characteristic of the image. The fusion of local features and global features is used to help the model perceive the direction of the image from different visual features. The method of the present invention has good performance in different data sets.

与现有图像方向感知的方法相比，本发明的优势在于：(1)本发明不对数据集中的原始图像进行缩放，保持图像的初始大小，更多的保留图像的有效信息。(2)对神经网络模型不同尺度的特征图提取注意力机制特征，进行融合得到局部特征。这种方法类似于人类的视觉注意力机制，获取更多与目标有关的细节信息，而忽视其他无关信息。通过这种机制可以利用有限的注意力资源从大量信息中快速筛选出高价值的信息。(3)对图像提取不同尺度的“VR_LBP”(旋转可变局部二值模式)特征，采用ResNet50融合残差空洞卷积得4个特征图，再将特征图对应元素相加得到全局特征。“VR_LBP”能更准确的表达图像的方向特性，提高模型的泛化能力。(4)把全局特征与局部特征进行融合，能更全面的表达图像的方向语义，提高模型的分类准确率。Compared with the existing image orientation perception method, the present invention has the following advantages: (1) The present invention does not scale the original image in the data set, keeps the initial size of the image, and retains more effective information of the image. (2) Extract the attention mechanism features from the feature maps of different scales of the neural network model, and perform fusion to obtain local features. This method is similar to the human visual attention mechanism, which obtains more detailed information related to the target while ignoring other irrelevant information. Through this mechanism, limited attention resources can be used to quickly screen out high-value information from a large amount of information. (3) Extract the "VR_LBP" (rotation variable local binary pattern) features of different scales from the image, use ResNet50 to fuse the residual hole convolution to obtain 4 feature maps, and then add the corresponding elements of the feature maps to obtain the global feature. "VR_LBP" can more accurately express the directional characteristics of the image and improve the generalization ability of the model. (4) The fusion of global features and local features can express the direction semantics of the image more comprehensively and improve the classification accuracy of the model.

本发明说明书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。尽管上面对本发明说明性的具体实施方式进行了描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。The contents not described in detail in the description of the present invention belong to the prior art known to those skilled in the art. Although the illustrative specific embodiments of the present invention have been described above, so that those skilled in the art can understand the present invention, it should be clear that the present invention is not limited to the scope of the specific embodiments. For those of ordinary skill in the art, As long as various changes are within the spirit and scope of the present invention defined and determined by the appended claims, these changes are obvious, and all inventions and creations using the concept of the present invention are included in the protection list.

Claims

1. An image direction prediction method based on multi-scale fusion and attention mechanism is characterized in that: the method comprises the following steps:

step 1, respectively rotating each image clockwise by three angles: 90 degrees, 180 degrees and 270 degrees, and each image can finally obtain images in four different directions, namely an upper direction, a right direction, a lower direction and a left direction;

step 2, extracting local features of each image by adopting a ResNet50 fusion residual attention mechanism, and specifically comprising the following steps:

step 2.1, the ResNet50 network consists of 6 parts, respectively: convolutional layer, C0, C1, C2, C3, and C4; the C0 comprises 1 convolution layer with 7 multiplied by 7 step length of 2 and 1 maximum pooling layer with 3 multiplied by 3 step length of 2; the C1, C2, C3 and C4 are respectively 1/4,1/8,1/16 and 1/32 times of the original image, and respectively comprise 3 bottleneck layers BTNK, 4 bottleneck layers BTNK, 6 bottleneck layers BTNK and 3 bottleneck layers BTNK;

step 2.2, respectively marking the last residual error structure output characteristic diagram of each part C1, C2, C3 and C4 as C₁ 、C₂ 、C₃ 、C₄ (ii) a Respectively passing through an attention mechanism module CBAM on four feature maps with different scales to respectively obtain 4 space attention maps marked as A₁ 、A₂ 、A₃ 、A₄ ；

Step 2.3, corresponding elements of the space attention diagram and the corresponding original feature diagram are added, and the sum is marked as F_i ＝A_i ⊕C_i (i =1,2,3, 4), wherein ≧ denotes the corresponding element addition;

step 2.4, adding F₂ 、F₃ 、F₄ Three small-scale feature maps are upsampled to F by bilinear interpolation₁ And (3) carrying out splicing operation along the channel with the same scale, and then carrying out convolution operation of 1 multiplied by 1 to obtain the final multi-scale attention fusion feature, wherein the local feature is recorded as: local _ Feature = concat (F)₁ ，up_2x(F₂ )，up_4x(F₃ )，up_8x(F₄ ) ); wherein concat represents feature concatenation, up _2x represents up-sampling by a factor of 2;

step 3, taking the rotary variable local binary pattern feature maps VR _ LBP of 4 different scales of the image as network input, and extracting the global features of the image by adopting ResNet50 fusion residual error hole convolution, wherein the specific steps are as follows;

step 3.1, calculating a VR _ LBP characteristic graph capable of expressing the direction characteristic of the image in a standard three primary colors (RGB) mode; 4 different scales VR-LBP are adopted in the calculation process_1,8 、VR_LBP_2,16 、VR_LBP_3,24 And VR _ LBP_4,32 Generating 4 VR-LBP feature maps, respectively denoted as P₁ ，P₂ ，P₃ ，P₄ And serves as an input to the RestNet50 network;

step 3.2, adding P₁ ，P₂ ，P₃ And P₄ The VR _ LBP profiles of four different scales are input into RestNet50, and the output profile of the last rolling block of the RestNet50 network is marked as RP₁ 、RP₂ 、RP₃ 、RP₄ (ii) a Inputting the 4 feature maps into residual hole convolution blocks with corresponding sampling rates respectively, wherein the 4 sampling rates correspond to R values in the VR _ LBP feature map respectively; obtaining 4 characteristic graphs after residual empty rolling blocks and marking as RPD₁ 、RPD₂ 、RPD₃ 、RPD₄ To RPD₁ 、RPD₂ 、RPD₃ 、RPD₄ Adding corresponding elements of the 4 feature maps to obtain a global feature; global _ Feature = RPD₁ ⊕RPD₂ ⊕RPD₃ ⊕RPD₄ ；

Step 4, splicing and fusing the local features obtained in the step 2.4 and the global features obtained in the step 3.2, and finally realizing direction prediction;

step 4.1, down-sampling the Local _ Feature to the resolution ratio same as the Global _ Feature through bilinear interpolation, and then splicing and connecting to obtain a fused Feature; LG _ Feature = concat (down (Local _ Feature), global _ Feature), down representing downsampling;

step 4.2, performing global average pooling on the LG _ Feature to obtain a one-dimensional vector; then, the image direction is predicted through a 256 full-connection layer;

and 4.3, using the logistic regression maximum likelihood loss function as a loss function to realize direction classification and realize automatic prediction of image direction, wherein the loss function is defined as follows:

wherein h is_θ (x) Representing the probability that sample x belongs to a class; y is_i For the predicted direction class, x_i And the characteristic of the ith sample is represented, m is the number of samples, theta is a parameter solved by the network model, and T represents the transposition of the matrix.

2. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: the bottleneck layer BTNK in the step 2.1 comprises BTNK1 and BTNK 2; the left side of BTNK2 is provided with 3 conv + BN + ReLU volume blocks, the result F (x) after convolution is added with input x, namely F (x) + x, then 1 ReLU activation function is carried out, and the number of input channels and the number of output channels of the module are the same; the left side of the BTNK1 is provided with 3 conv + BN + ReLU volume blocks F (x), and the right side is provided with 1 conv + BN volume block G (x), so that the function of matching input and output dimension difference is achieved; since the number of channels F (x) and G (x) is the same, and F (x) + G (x) is summed, the number of input and output channels of the module is different. The ResNet50 network is formed by stacking a plurality of bottleneck layers BTNK of different types.

3. The method of claim 1, wherein the image direction prediction method based on multi-scale fusion and attention mechanism comprises: in the step 2.2, the attention mechanism module CBAM combines the channel attention module and the space attention module at the same time; the channel attention module obtains two Cx 1 x 1 characteristics through global maximum pooling and global average pooling respectively of the input characteristic diagram, wherein C represents the number of channels, then the two Cx 1 x 1 characteristics are respectively sent into a two-layer neural network MLP,

the number of neurons in the first layer of the two-layer neural network MLP is C/r, r is the reduction rate, the activation function is RelU, the number of neurons in the second layer is C, and the two-layer neural network MLP is shared;

performing element-by-element addition on the two features output by the MLP, and performing sigmoid activation operation to generate a final channel attention feature; finally, multiplying the channel attention feature by the input feature element by element to generate the input feature required by the space attention module;

the spatial attention module firstly applies global maximum pooling and global average pooling to output characteristics of the channel attention module based on channel dimensions respectively to obtain two 1 XHXW single-channel pooling characteristics, then splices the two 1 XHXW single-channel pooling characteristics based on the channel dimensions, reduces the number of channels to 1 through 7X 7 convolution operation, generates spatial attention characteristics through sigmoid, and finally multiplies the spatial attention characteristics generated through sigmoid and input characteristics of the spatial attention module element by element to obtain weighting characteristics fusing channel attention and spatial attention.

4. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: the VR _ LBP feature map in the step 3.1 takes a certain pixel point in the image as a central point and is based on

Interpolating to obtain a field point taking a circular sampling point set as the central point, wherein R is the radius, and P is the number of sampling points; then comparing the value of the central pixel point with the value of the adjacent pixel point, if the value of the adjacent pixel point is larger than that of the central pixel point, setting the position of the area to be 1, otherwise setting the position to be 0,then, the circular sampling points are read clockwise to finally combine into a binary number sequence, and the sequence is converted into decimal, namely VR _ LBP_R,P Code, calculated as:

wherein, gray_c Is the gray level of the current pixel_i Is the gray scale of its domain; when x is less than 0, t (x) is 0, otherwise it is 1.

5. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: in the step 3.2, the hole convolution has a hyper-parameter of an expansion rate, which is used for defining the distance between values when a convolution kernel processes data, and the expansion rate is-1 and 0 is filled in the convolution kernel; therefore, when different expansion rates are set, the receptive fields are different, and multi-scale information is obtained; a convolution kernel K = K + (K-1) (r-1) of the cavity convolution, wherein K is the size of an original convolution kernel, and r is a cavity rate of a cavity convolution parameter; 4 different hole convolution kernels are used according to VR _ LBP of different scales, and r is 1,2,3 and 4 respectively.

6. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: the residual void volume block is formed by adding a Shortcut Connection (Shortcut Connection) of 1 × 1 convolution to 1 3 × 3 void convolution; the function of the quick connection is to match the space dimensionality of the feature map, and the function of the residual block is to realize identity mapping when the image features are extracted by convolution.

7. The method of claim 1, wherein the image direction prediction method based on multi-scale fusion and attention mechanism comprises: the window size of the global average pooling in the step 4.2 is the size of the whole feature map, an average value is calculated for all pixels of the feature map of each output channel, a feature vector with the size of 1 × 1 × C is obtained after the global average pooling, and C is the number of channels of the original feature map.