CN110287826A

Movatterモバイル変換

Info

Publication number: CN110287826A
Application number: CN201910499786.9A
Authority: CN
Inventors: 李建强; 白骏; 刘雅琦
Original assignee: Beijing University of Technology
Current assignee: Kuaima (Beijing) Electronic Technology Co.,Ltd.
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2019-09-27
Anticipated expiration: 2039-06-11
Also published as: CN110287826B

Abstract

Translated fromChinese

本发明涉及一种基于注意力机制的视频目标检测方法，涉及计算机视觉。本发明包括如下步骤：步骤S1，提取当前时间帧的候选特征图；步骤S2，在过去时间段设定融合窗口，计算窗口内各帧的拉普拉斯方差，将方差归一化作为窗口内各帧的权重，将窗口内所有帧的候选特征图进行加权求和得到时序特征，将当前时间帧的候选特征与时序特征相连接，得到待检测特征图；步骤S3，利用卷积层在待检测特征图上提取出额外尺度的特征图；步骤S4，在不同尺度的特征图上利用卷积层进行目标类别及位置预测。本发明的特征融合方法对过去时间段内不同质量的帧特征分配了不同的权重，使得时序信息的融合更加充分，提高了检测模型的性能。

The invention relates to a video target detection method based on an attention mechanism, and relates to computer vision. The present invention includes the following steps: step S1, extracting the candidate feature map of the current time frame; step S2, setting the fusion window in the past time period, calculating the Laplacian variance of each frame in the window, and normalizing the variance as the The weight of each frame, weighted and summed the candidate feature maps of all frames in the window to obtain the time series features, and connected the candidate features of the current time frame with the time series features to obtain the feature map to be detected; step S3, using the convolutional layer in the waiting Extract feature maps of additional scales from the detection feature map; step S4, use convolutional layers on feature maps of different scales to predict target categories and locations. The feature fusion method of the present invention assigns different weights to the frame features of different qualities in the past time period, so that the fusion of time series information is more sufficient and the performance of the detection model is improved.

Description

Translated fromChinese

一种基于注意力机制的视频目标检测方法A Video Object Detection Method Based on Attention Mechanism

技术领域technical field

本发明涉及计算机视觉，涉及深度学习，涉及视频目标检测技术。The invention relates to computer vision, deep learning, and video target detection technology.

背景技术Background technique

基于深度学习的图像目标检测方法在过去五年的时间取得了巨大的进展，如RCNN系列网络，SSD网络以及YOLO系列网络。但在视频监控、车辆辅助驾驶等领域，基于视频的目标检测有着更为广泛的需求。由于视频中存在运动模糊，遮挡，形态变化多样性，光照变化多样性等问题，仅利用图像目标检测技术检测视频中的目标并不能得到很好的检测结果。视频中相邻帧与帧之间在时间上有连续性，在空间上有相似性，帧与帧之间目标的位置是相关联的，如何利用视频中目标时序信息成为提升视频目标检测性能的关键。Image target detection methods based on deep learning have made great progress in the past five years, such as RCNN series networks, SSD networks and YOLO series networks. However, in the fields of video surveillance, vehicle assisted driving and other fields, video-based target detection has a wider demand. Due to the problems of motion blur, occlusion, variety of morphological changes, and diversity of illumination changes in the video, it is not possible to obtain good detection results only by using the image target detection technology to detect the target in the video. Adjacent frames in the video have continuity in time and similarity in space, and the position of the target between frames is related. How to use the timing information of the target in the video becomes the key to improve the performance of video target detection. The essential.

目前的视频目标检测框架主要有三类：一种将视频帧视为独立的图像利用图像目标检测算法进行检测，这类方法忽略了时间信息独立地对各帧进行检测，效果并不理想；另一种方法将目标检测与目标跟踪技术相结合，这类方法在检测的结果上进行后处理以便跟踪目标，跟踪的精度依赖于检测，易造成错误传播；还有一种方法只在少数关键帧上进行检测，然后利用光流信息和关键帧特征来生成其余帧的特征，这类方法虽然利用了时序信息但是光流的计算成本很大，难以进行快速检测。There are three main types of current video object detection frameworks: one regards video frames as independent images and uses image object detection algorithms to detect them. This method ignores time information and independently detects each frame, and the effect is not ideal; the other One method combines target detection and target tracking technology. This type of method performs post-processing on the detection results in order to track the target. The tracking accuracy depends on the detection, which is easy to cause error propagation; there is another method that only performs on a few key frames. Detection, and then use the optical flow information and key frame features to generate the features of the remaining frames. Although this type of method uses timing information, the calculation cost of optical flow is very high, and it is difficult to perform fast detection.

发明内容Contents of the invention

本发明的目的是提供一种充分融合时序特征的、快速准确的视频目标检测方法。The purpose of the present invention is to provide a fast and accurate video target detection method that fully integrates time series features.

为解决上述技术问题，本发明提供了一种基于注意力机制的视频目标检测方法，包括如下步骤：In order to solve the above-mentioned technical problems, the present invention provides a kind of video object detection method based on attention mechanism, comprises the following steps:

步骤S1，将当前时间点的视频帧图像输入Mobilenet网络提取得到候选特征图；Step S1, inputting the video frame image at the current time point into the Mobilenet network to extract candidate feature maps;

步骤S2,在与当前时间点相邻的过去时间段内设定一个时序特征融合窗口，对特征融合窗口内的待融合的视频帧，分别计算其图像拉普拉斯方差，将其归一化后，作为各待融合帧的融合权重，按照融合权重将所有待融合帧的候选特征图进行加权求和得到当前帧所需的时序特征，将当前时间步视频帧的候选特征与时序特征在特征的通道维相连接，得到融合了时序信息的待检测特征图；Step S2, set a time-series feature fusion window in the past time period adjacent to the current time point, and calculate the image Laplacian variance of the video frames to be fused in the feature fusion window, and normalize them Finally, as the fusion weight of each frame to be fused, according to the fusion weight, the candidate feature maps of all frames to be fused are weighted and summed to obtain the timing features required by the current frame, and the candidate features and timing features of the current time step video frame are combined in the feature The channel dimensions are connected to obtain a feature map to be detected that incorporates time series information;

步骤S3，利用卷积特征提取层以及最大池化层在待检测特征图上提取出额外尺度的待检测特征图；Step S3, using the convolutional feature extraction layer and the maximum pooling layer to extract additional scale feature maps to be detected on the feature map to be detected;

步骤S4，在不同尺度的待检测特征图上，利用卷积层进行当前帧上目标类别以及边界框坐标的预测。Step S4, on the feature maps to be detected of different scales, use the convolutional layer to predict the target category and bounding box coordinates on the current frame.

进一步，步骤S1中，对当前时间点t的视频帧进行检测，首先将当前时间点视频帧图像I_t输入Mobilenet网络进行特征提取，其中H_I和W_I分别为视频帧的高和宽，提取得到候选特征图代表实数，C₁，H₁和W₁分别为候选特征图的特征通道数、高和宽。Further, in step S1, the video frame of current time point t is detected, at first the current time point video frame image I_{t is} input into Mobilenet network for feature extraction, wherein H_I and W_I are the height and width of the video frame respectively, and the candidate feature map is extracted Represents a real number, C₁ , H₁ and W₁ are the number of feature channels, height and width of the candidate feature map, respectively.

进一步，步骤S2中，在当前时间点t的过去时间段内设定一个宽度w为s的特征融合窗口，令特征融合窗口内的待融合视频帧图像为：{I_t-i}i∈[1，s]，特征融合窗口内待融合视频帧对应的候选特征图为：{F_t-i}i∈[1，s]。将每一个待融合视频帧图像I_t-i转换为灰度图G_t-i，并在灰度图的基础上计算图像的拉普拉斯方差，在灰度图G坐标(x，y)处的拉普拉斯算子为图像的拉普拉斯算子通过计算图像各像素点各方向的二阶导数，来捕捉图像中像素值急剧变化的区域，可以用来检测图像中的边角，图像的拉普拉斯方差则体现了整个图像的像素值变化情况，如果拉普拉斯方差较大，则说明图像较清晰，反之图像较为模糊。Further, in step S2, a feature fusion window with a width w of s is set in the past time period of the current time point t, so that the image of the video frame to be fused in the feature fusion window is: {I_ti }i∈[1, s], the candidate feature map corresponding to the video frame to be fused in the feature fusion window is: {F_ti }i∈[1, s]. Convert each video frame image I_ti to be fused into a grayscale image G_ti , and calculate the Laplacian variance of the image on the basis of the grayscale image, and the Laplacian variance at the grayscale image G coordinate (x, y) The Las operator is The Laplacian operator of the image captures the region where the pixel value changes sharply in the image by calculating the second derivative of each pixel in each direction of the image, which can be used to detect the corners in the image, and the Laplacian variance of the image is It reflects the change of the pixel value of the entire image. If the Laplacian variance is large, the image is clearer, otherwise the image is blurred.

首先计算每个灰度图G_t-i的拉普拉斯均值H_I和W_I分别为灰度图的高和宽：First calculate the Laplacian mean of each grayscale image G_ti H_I and W_I are the height and width of the grayscale image respectively:

接下来计算每个灰度图G_t-i的拉普拉斯方差Next calculate the Laplacian variance of each grayscale image G_ti

如果视频帧较清晰，则其候选特征有助于目标的检测，反之一些帧由于运动目标造成图像模糊。这些帧的候选特征不利于检测目标，对于不同清晰程度的视频帧，应分配不同的融合权重，从而使得检测模型更专注于清晰的特征而不是模糊的特征，首先计算所有待融合视频帧的融合权重α_t-i：If the video frame is clear, its candidate features are helpful for object detection, otherwise some frames are blurred due to moving objects. The candidate features of these frames are not conducive to the detection of targets. For video frames with different degrees of clarity, different fusion weights should be assigned, so that the detection model can focus more on clear features rather than blurred features. First, calculate the fusion of all video frames to be fused Weight α_ti :

将特征融合窗口内的帧候选特征以加权求和的方式进行融合得到当前时间点的时序特征将时序特征与当前帧的候选特征在通道维进行连接，完成时序信息的融合，得到第一个用于检测的待检测特征图。The frame candidate features in the feature fusion window are fused in a weighted summation method to obtain the time series features at the current time point The timing features are connected with the candidate features of the current frame in the channel dimension to complete the fusion of timing information and obtain the first feature map to be detected for detection.

进一步，步骤S3中，在得到当前时间点融合了时序特征的待检测特征图后，为了得到更多尺度的待检测特征图，利用3×3卷积层和2×2池化层对待检测特征图进行进一步特征提取同时减小待检测特征图的尺寸，这样在尺寸大的待检测特征图中局部信息较为丰富，适合对小尺寸目标进行预测，尺寸小的待检测特征图含有更强的全局语义信息，适合检测尺寸较大的目标，经过e-1次特征提取，最终得到e个待检测特征图：Further, in step S3, after obtaining the feature map to be detected that combines the time series features at the current time point Finally, in order to obtain more scale feature maps to be detected, use 3×3 convolutional layer and 2×2 pooling layer to perform further feature extraction on the feature map to be detected while reducing the size of the feature map to be detected. The local information in the feature map to be detected is relatively rich, which is suitable for predicting small-sized objects. The small-sized feature map to be detected contains stronger global semantic information and is suitable for detecting larger-sized objects. After e-1 feature extraction, the final Get e feature maps to be detected:

进一步，步骤S4中，经过额外的特征提取，获得了多尺度的待检测特征图，通过在不同尺度的待检测图上设置具有先验位置的锚框，利用两个3×3卷积层在这些待检测特征图上利用通道维分别进行目标边界框相对锚框的偏移量和目标的类别。令类别数为d(包括背景)，对于每个待检测特征图经过3×3卷积类别预测层和3×3卷积边界框预测层预测后得到分类预测结果以及边界框预测结果Further, in step S4, after additional feature extraction, a multi-scale feature map to be detected is obtained. By setting anchor boxes with prior positions on different scales to be detected, two 3×3 convolutional layers are used in the On these feature maps to be detected, the channel dimension is used to calculate the offset of the target bounding box relative to the anchor box and the category of the target. Let the number of categories be d (including background), for each feature map to be detected After 3×3 convolutional category prediction layer and 3×3 convolutional bounding box prediction layer, the classification prediction result is obtained and bounding box prediction results

附图说明Description of drawings

图1是本发明示意图。Figure 1 is a schematic diagram of the present invention.

具体实施方式Detailed ways

现在结合附图对本发明作进一步详细的说明。这些附图均为简化示意图，仅以示意方式说明本发明的基本结构，因此其仅显示与本发明有关的构成。The present invention is described in further detail now in conjunction with accompanying drawing. These drawings are all simplified schematic diagrams, and only illustrate the basic structure of the present invention in a schematic manner, so they only show the configurations related to the present invention.

实施例1Example 1

如图1所示，本实例提供了一种基于注意力机制的视频目标检测方法，包括如下步骤As shown in Figure 1, this example provides a video target detection method based on the attention mechanism, including the following steps

步骤S2,在与当前时间点相邻的过去时间段内设定一个时序特征融合窗口，对于特征融合窗口内的待融合的视频帧，分别计算其图像拉普拉斯方差，将其归一化后，作为各待融合帧的融合权重，按照权重将所有待融合帧的候选特征图进行加权求和得到当前帧所需的时序特征，将当前时间步视频帧的候选特征与时序特征在通道维相连接，得到融合了时序信息的待检测特征图；Step S2, set a temporal feature fusion window in the past time period adjacent to the current time point, and calculate the image Laplacian variance of the video frames to be fused in the feature fusion window, and normalize them Finally, as the fusion weight of each frame to be fused, the candidate feature maps of all frames to be fused are weighted and summed according to the weight to obtain the timing features required by the current frame, and the candidate features and timing features of the current time step video frame are in the channel dimension are connected to obtain a feature map to be detected that incorporates time series information;

所述步骤S1中，对当前时间点t视频帧进行检测首先将当前时间点视频帧图像I_t输入Mobilenet进行特征提取，其中H_I和W_I分别是帧图像的高和宽，提取得到候选特征图其中C₁，H₁，W₁分别为候选特征图的通道数，高和宽。In described step S1, the current time point t video frame is detected and at first the current time point video frame image I_{t is} input into Mobilenet to carry out feature extraction, wherein H_I and W_I are the height and width of the frame image respectively, and the candidate feature map is extracted Among them, C₁ , H₁ , and W₁ are the number of channels, height, and width of the candidate feature map, respectively.

所述步骤S2中，在当前时间点t的过去时间段内设定一个宽度w为s的特征融合窗口，令过去时间段的长度为q，则特征融合窗口宽度的设定规则如下式所示，即如果过去的时间步长度大于s，则将融合窗口宽度设置为s，若过去时间步长度小于s，没有足够多的特征，则将融合窗口宽度设置为过去时间步的长度。In the step S2, set a feature fusion window with a width w of s in the past time period of the current time point t, let the length of the past time period be q, then the setting rule of the feature fusion window width is shown in the following formula , that is, if the past time step length is greater than s, set the fusion window width to s, if the past time step length is less than s, and there are not enough features, then set the fusion window width to the length of the past time step.

令特征融合窗口内的待融合视频帧图像为：{I_t-i}i∈[1，s]，特征融合窗口内待融合视频帧对应的候选特征图为：{F_t-i}i∈[1，s]。将每一个待融合视频帧图像I_t-i转换为灰度图G_t-i，并在灰度图的基础上计算图像的拉普拉斯方差，在灰度图G坐标(x，y)处的拉普拉斯算子为：Let the image of the video frame to be fused in the feature fusion window be: {I_ti }i∈[1, s], and the candidate feature map corresponding to the video frame to be fused in the feature fusion window is: {F_ti }i∈[1, s] ]. Convert each video frame image I_ti to be fused into a grayscale image G_ti , and calculate the Laplacian variance of the image on the basis of the grayscale image, and the Laplacian variance at the grayscale image G coordinate (x, y) The Las operator is:

其中G(x，y)代表灰度图G在坐标(x，y)处的像素值。图像的拉普拉斯算子通过计算图像各像素点各方向的二阶导数，来捕捉图像中像素值急剧变化的区域，可以用来检测图像中的边角，图像的拉普拉斯方差则体现了整个图像的像素值变化情况，如果拉普拉斯方差较大，则说明图像较清晰，反之图像较为模糊。Where G(x, y) represents the pixel value of the grayscale image G at coordinates (x, y). The Laplacian operator of the image captures the region where the pixel value changes sharply in the image by calculating the second derivative of each pixel in each direction of the image, which can be used to detect the corners in the image, and the Laplacian variance of the image is It reflects the change of the pixel value of the entire image. If the Laplacian variance is large, the image is clearer, otherwise the image is blurred.

首先计算每个灰度图G_t-i的拉普拉斯均值H_I和W_I分别为灰度图的高和宽。First calculate the Laplacian mean of each grayscale image G_ti H_I and W_I are the height and width of the grayscale image, respectively.

如果视频帧较清晰，则其候选特征有助于目标的检测，反之一些帧由于运动目标造成图像模糊。这些帧的候选特征不利于检测目标，对于不同清晰程度的视频帧，应分配不同的融合权重，越清晰的帧特征权重越大，从而使得检测模型更专注于清晰的特征而不是模糊的特征，首先计算所有待融合视频帧的融合权重α_t-i：If the video frame is clear, its candidate features are helpful for object detection, otherwise some frames are blurred due to moving objects. The candidate features of these frames are not conducive to the detection of targets. For video frames with different degrees of clarity, different fusion weights should be assigned. The clearer the frame feature weight is, the greater the weight of the feature, so that the detection model is more focused on clear features rather than blurred features. First calculate the fusion weight α_ti of all video frames to be fused:

将特征融合窗口内的帧候选特征以加权求和的方式进行融合得到当前时间点的时序特征The frame candidate features in the feature fusion window are fused in a weighted summation method to obtain the time series features at the current time point

将时序特征与当前帧的候选特征在通道维进行连接，完成时序信息的融合，得到第一个用于检测的待检测特征图Connect the timing features with the candidate features of the current frame in the channel dimension, complete the fusion of timing information, and obtain the first feature map to be detected for detection

所述步骤S3中，在得到当前时间点融合了时序特征的待检测特征图后，为了得到更多尺度的待检测特征图，利用卷积层和池化层对待检测特征图进行进一步特征提取同时减小待检测特征图的尺寸，这样在尺寸大的待检测特征图中局部信息较为丰富，适合对小尺寸目标进行预测，尺寸小的待检测特征图含有更强的全局语义信息，适合检测尺寸较大的目标，经过e-1次特征提取，最终得到e个待检测特征图：In the step S3, the feature map to be detected that combines the time series features at the current time point is obtained Finally, in order to obtain more scale feature maps to be detected, use the convolutional layer and pooling layer to perform further feature extraction and reduce the size of the feature map to be detected, so that the local The information is relatively rich, and it is suitable for predicting small-sized targets. The small-sized feature map to be detected contains stronger global semantic information and is suitable for detecting larger-sized targets. After e-1 feature extraction, e features to be detected are finally obtained. picture:

所述步骤S4中，经过额外的特征提取，获得了多尺度的待检测特征图，通过在不同尺度的待检测图上设置具有先验位置的锚框，利用两个卷积层在这些待检测特征图上利用通道维分别进行目标边界框相对锚框的偏移量和目标的类别。令类别数为d(包括背景)，对于每个待检测特征图其中C_Fi，H_Fi，W_Fi分别为该特征图的通道数、高和宽，每个像素位置的锚框数为n_i，经过卷积类别预测层和卷积边界框预测层预测后得到分类预测结果以及边界框预测结果In the step S4, after additional feature extraction, a multi-scale feature map to be detected is obtained, and by setting anchor boxes with prior positions on different scales to be detected, two convolutional layers are used to map these to be detected On the feature map, the channel dimension is used to calculate the offset of the target bounding box relative to the anchor box and the category of the target. Let the number of categories be d (including background), for each feature map to be detected Among them, C_Fi , H_Fi , W_Fi are the number of channels, height and width of the feature map respectively, and the number of anchor boxes at each pixel position is n_i , which is obtained after prediction by the convolutional category prediction layer and convolutional bounding box prediction layer Classification Prediction Results and bounding box prediction results

Claims

1. a kind of video object detection method based on attention mechanism, which comprises the steps of:

Step S1 extracts the video frame images input Mobilenet of current point in time to obtain candidate feature figure；

Step S2 sets a temporal aspect in the time in the past section adjacent with current point in time and merges window, for featureThe video frame to be fused in window is merged, its image Laplce's variance is calculated separately, after being normalized, as respectively wait meltThe candidate feature figure of all frames to be fused is weighted summation according to weight and obtained needed for present frame by the fusion weight for closing frameThe candidate feature of current time step video frame is connected with temporal aspect in channel dimension, has been merged timing by temporal aspectThe characteristic pattern to be detected of information；

Step S3, using convolution feature extraction layer and maximum pond layer extracted on characteristic pattern to be detected additional scale toDetect characteristic pattern；

Step S4 carries out target category and boundary on present frame using convolutional layer on the characteristic pattern to be detected of different scaleThe prediction of frame coordinate.

2. the video object detection method according to claim 1 based on attention mechanism, which is characterized in that

In the step S1, current point in time t video frame is detected current point in time video frame images I first_tInputMobilenet network carries out feature extraction and obtains candidate feature figure F_t；WhereinH_IAnd W_IRespectively videoThe height and width of frame, extraction obtain candidate feature figureRepresent real number, C₁, H₁And W₁It is respectively candidate specialLevy feature port number, the height and width of figure.

3. the video object detection method according to claim 2 based on attention mechanism, which is characterized in that

In the step S2, a width w is set in the time in the past section of current point in time t as the Fusion Features window of s, is enabledVideo frame images to be fused in Fusion Features window are as follows: { I_t-iI ∈ [1, s], video frame pair to be fused in Fusion Features windowThe candidate feature figure answered are as follows: { F_t-iI ∈ [1, s]；By each video frame images I to be fused_t-iBe converted to grayscale image G_t-i；

Calculate each grayscale image G_t-iLaplce's varianceIt is calculated by normalization Laplce's variance all to be fusedThe fusion weight α of video frame_t-i；Frame candidate feature in Fusion Features window is merged to obtain in a manner of weighted sumThe temporal aspect of current point in timeThe candidate feature of temporal aspect and present frame is attached in channel dimension, when completionThe fusion of sequence information, obtain first for detection characteristic pattern to be detected

4. the video object detection method according to claim 3 based on attention mechanism, which is characterized in that

In the step S3, in the characteristic pattern to be detected for obtaining current point in time and having merged temporal aspectAfterwards, volume 3 × 3 are utilizedLamination and 2 × 2 pond layers carry out further feature extraction to characteristic pattern to be detected while reducing the size of characteristic pattern to be detected, inspectionCharacteristic pattern is surveyed to carry out further feature extraction while reducing the size of characteristic pattern to be detected, by e-1 feature extraction, finalTo e characteristic patterns to be detected:

5. the video object detection method according to claim 4 based on attention mechanism, which is characterized in that

In the step S4, by additional feature extraction, multiple dimensioned characteristic pattern to be detected is obtained, by different scaleMapping to be checked on setting have priori position anchor frame, utilized on these characteristic patterns to be detected using two 3 × 3 convolutional layersChannel dimension carries out object boundary frame with respect to the offset of anchor frame and the classification of target respectively；By 3 × 3 convolution classification prediction intervals and3 × 3 convolution bounding box prediction intervals are for each characteristic pattern to be detectedIt is predicted by convolution classification prediction interval and convolution bounding boxClassification prediction result is obtained after layer predictionAnd bounding box prediction result