CN107274432B

Movatterモバイル変換

Info

Publication number: CN107274432B
Application number: CN201710434834.7A
Authority: CN
Inventors: 王田; 乔美娜; 陈阳; 陶飞
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2017-06-10
Filing date: 2017-06-10
Publication date: 2019-07-26
Anticipated expiration: 2037-06-10
Also published as: CN107274432A

Abstract

Translated fromChinese

本发明涉及一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，包括：对视频进行单帧分解，使用视觉显著性提取运动信息，然后计算相邻帧运动物体的光流，之后的检测过程分为训练和测试两个过程，在训练中，以训练样本的光流作为自编码的输入，通过最小化损失函数来训练整个自编码网络，在测试阶段，分别以训练和测试样本的光流作为输入，提取训练好的自编码网络中的编码器，通过降维提取输入的特征，然后可视化降维后的结果，使用超球体表示训练样本的可视化范围，在输入测试样本时，使用同样的方法可视化，若样本可视化的结果落入超球体范围内，则判定样本正常；反之，落在超球体范围之外，判定样本异常，由此实现视频的智能监控。

The invention relates to a public scene intelligent video monitoring method based on visual saliency and deep self-encoding, which includes: decomposing a single frame of a video, extracting motion information by using visual saliency, then calculating the optical flow of moving objects in adjacent frames, and then The detection process is divided into two processes: training and testing. In the training, the optical flow of the training sample is used as the input of the auto-encoding, and the entire auto-encoding network is trained by minimizing the loss function. In the testing phase, the training and testing samples are used respectively. As input, extract the encoder in the trained self-encoding network, extract the input features through dimensionality reduction, and then visualize the result after dimensionality reduction, using a hypersphere to represent the visualization range of the training samples, when inputting the test samples, Using the same method for visualization, if the result of the sample visualization falls within the range of the hypersphere, the sample is judged to be normal; otherwise, if it falls outside the range of the hypersphere, the sample is judged to be abnormal, thereby realizing intelligent video surveillance.

Description

Translated fromChinese

一种智能视频监控方法An intelligent video surveillance method

技术领域technical field

本发明涉及图像处理技术，特别是涉及一种基于视觉显著性和深度自编码的公共场景智能视频监控方法。The invention relates to image processing technology, in particular to a public scene intelligent video monitoring method based on visual saliency and deep self-encoding.

背景技术Background technique

近几年，监控设备被应用在各行各业，现代机场、车站、医院等公共场景覆盖了成千上万的监控装置，由于视频资料众多，单靠安保人员分析，滤除正常场景下的正常行为，及时发现异常行为，是一个很大的工作量，而随着分析数量的增多，人员的注意力和工作效率会产生明显的下降，为了将人从大量的分析理解中解放出来，研究一种智能视频监控方法具有重要意义。In recent years, monitoring equipment has been used in all walks of life. Modern airports, stations, hospitals and other public scenes cover thousands of monitoring devices. Due to the large number of video data, security personnel alone can analyze and filter out normal scenes in normal scenes. It is a large workload to detect abnormal behaviors in time, and with the increase of the number of analyses, the attention and work efficiency of personnel will drop significantly. This kind of intelligent video surveillance method is of great significance.

智能监控系统主要涉及三部分的内容：视频中运动信息的提取，即提取视频中的运动目标，由于监控系统是固定的，所以这部分主要是提取视频中前景目标的运动信息；行为特征的提取，智能监控系统中的一大挑战，要具有唯一性、鲁棒性等特点，提取特征；异常行为检测，分为基于规则的检测，如检测目标是否违反了某些预定义的规则，和基于统计的检测，即在大量的样本中找到行为的模式，使用模式识别的方法和模型进行异常行为识别。现有的技术多位第二种，使用模式识别的方法来进行识别，但这种方法精度相比深度学习的方法的精度低，故本发明使用精度较高的基于深度学习中深度自编码网络来进行异常行为的识别。The intelligent monitoring system mainly involves three parts: the extraction of motion information in the video, that is, the extraction of moving objects in the video. Since the monitoring system is fixed, this part mainly extracts the motion information of the foreground objects in the video; the extraction of behavioral features , a major challenge in the intelligent monitoring system, to have the characteristics of uniqueness, robustness, etc., to extract features; abnormal behavior detection, divided into rule-based detection, such as detecting whether the target violates some predefined rules, and based on Statistical detection, that is, finding behavior patterns in a large number of samples, and using pattern recognition methods and models to identify abnormal behaviors. There are many existing technologies of the second type, which use the method of pattern recognition to identify, but the accuracy of this method is lower than that of the deep learning method, so the present invention uses a deep self-encoding network based on deep learning with higher accuracy. to identify abnormal behavior.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的主要目的在于提供一种检测精度高、鲁棒性强的基于视觉显著性和深度自编码的公共场景智能视频监控方法，大大提高了检测精度，同时，能应对多种场景下的异常行为识别，鲁棒性很强。In view of this, the main purpose of the present invention is to provide a public scene intelligent video monitoring method based on visual saliency and deep self-encoding with high detection accuracy and strong robustness, which greatly improves the detection accuracy, and at the same time, can deal with a variety of The abnormal behavior recognition in the scene is very robust.

为了达到上述目的，本发明提出的技术方案为：一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，实现步骤如下：In order to achieve the above purpose, the technical solution proposed by the present invention is: a public scene intelligent video monitoring method based on visual saliency and deep self-encoding, and the implementation steps are as follows:

步骤1、读取公共场景下的视频，将视频分解成单个帧，然后基于高斯差分组合带通滤波器，计算每一帧的视觉显著性图，以此来提取运动信息；Step 1. Read the video in the public scene, decompose the video into a single frame, and then combine the bandpass filter based on the Gaussian difference to calculate the visual saliency map of each frame, so as to extract the motion information;

步骤2、在每帧显著性图的基础上，计算相邻帧的光流，从而提取前景目标的运动信息，获取运动特征；Step 2. Calculate the optical flow of adjacent frames on the basis of the saliency map of each frame, so as to extract the motion information of the foreground target and obtain the motion features;

步骤3、在异常识别的算法中包含训练和测试两个过程，在训练过程中，计算训练样本的视觉显著性图并提取运动特征，将得到的光流特征转换为列向量作为深度自编码网络的输入，使用深度自编码网络中编码器的降维和解码器的重建作用，通过最小化损失函数重建输入，训练深度自编码网络；Step 3. The algorithm of anomaly identification includes two processes of training and testing. During the training process, the visual saliency map of the training samples is calculated and the motion features are extracted, and the obtained optical flow features are converted into column vectors as a deep self-encoding network. The input of the deep self-encoding network is used to reduce the dimensionality of the encoder and the reconstruction function of the decoder, and the input is reconstructed by minimizing the loss function, and the deep self-encoding network is trained;

步骤4、通过最小化损失函数重建输入，训练深度自编码网络后，提取训练好的深度自编码网络的编码器部分作为测试过程中的网络，分别计算出训练样本和测试样本的显著性图和运动特征后，以各个样本的光流特征作为深度自编码网络中编码器的输入，通过所述编码器网络的降维操作，用低维向量来提取最能代表输入的低维特征；Step 4. Reconstruct the input by minimizing the loss function. After training the deep self-encoding network, extract the encoder part of the trained deep self-encoding network as the network in the testing process, and calculate the saliency maps of the training samples and the test samples respectively. After the motion feature, the optical flow feature of each sample is used as the input of the encoder in the deep self-encoding network, and through the dimensionality reduction operation of the encoder network, the low-dimensional vector is used to extract the low-dimensional feature that can best represent the input;

步骤5、在三维坐标中可视化测试过程中编码器网络的结果，用一个超球体表示其中训练样本降维后的分布范围；Step 5. Visualize the results of the encoder network in the test process in three-dimensional coordinates, and use a hypersphere to represent the distribution range of the training samples after dimension reduction;

步骤6、对于输入测试样本的异常识别，若测试样本可视化的范围落入超球体的范围内，则判定该测试样本为正常序列；反之，落在超球体范围以外，则判定该测试样本为异常序列，由此实现异常行为的识别，公共场景下视频的智能监控。Step 6. For the abnormal identification of the input test sample, if the visual range of the test sample falls within the range of the hypersphere, it is determined that the test sample is a normal sequence; on the contrary, if it falls outside the range of the hypersphere, the test sample is determined to be abnormal. sequence, thereby realizing the identification of abnormal behavior and intelligent monitoring of video in public scenes.

所述步骤1中视觉显著性图的方法如下：The method of the visual saliency map in the step 1 is as follows:

步骤i)对于一帧图像，图像中每个点的显著度定义为：Step i) For a frame of image, the saliency of each point in the image is defined as:

S(x,y)＝||I_μ-I_whc(x,y)||S(x,y)=||I_μ -I_whc (x,y)||

其中，I_μ输入图像在Lab空间中各个像素点颜色的均值，I_whc(x,y)为对图像进行高斯模糊后，各个像素点在Lab空间的值，S(x,y)表示每个像素点的显著度，为二者的欧氏距离；Among them, I_μ the mean value of the color of each pixel in the Lab space of the input image, I_whc (x, y) is the value of each pixel in the Lab space after Gaussian blurring the image, S(x, y) represents each The saliency of the pixel is the Euclidean distance between the two;

步骤ii)首先对图像进行高斯模糊，二维的高斯分布函数为：Step ii) First perform Gaussian blur on the image, and the two-dimensional Gaussian distribution function is:

其中，x和y分别对应中心点周围8个点的横纵坐标，σ为高斯分布函数的方差，G(x,y)为每个像素点的模糊程度；Among them, x and y correspond to the horizontal and vertical coordinates of 8 points around the center point respectively, σ is the variance of the Gaussian distribution function, and G(x, y) is the blur degree of each pixel point;

对于彩色图像，在R，G，B三个通道，分别使用高斯核与原图像做卷积操作，将每个通道的结果合并，即为高斯模糊后的图像，分别将高斯模糊后的图像和原图像转换到Lab空间For color images, in the R, G, B three channels, use the Gaussian kernel to convolve with the original image respectively, and combine the results of each channel, that is, the Gaussian blurred image. The Gaussian blurred image and Convert the original image to Lab space

步骤iii)计算高斯模糊后图像每个像素点的Lab值I_whc(x,y)和原图像各个像素点在Lab空间颜色的均值I_μ，计算二者的欧氏距离，即得到原图像的视觉显著性图。Step iii) Calculate the Lab value I_whc (x, y) of each pixel point of the Gaussian blurred image and the mean value I_μ of each pixel point of the original image in the Lab space color, calculate the Euclidean distance of the two, that is, obtain the original image. Visual saliency map.

所述步骤3使用训练深度自编码网络的具体过程为：The specific process of using the training deep self-encoding network in the step 3 is as follows:

步骤i)训练样本中只包含正常样本，在训练过程中，计算训练样本相邻帧图像的光流特征，将光流特征转换为列向量，作为深度自编码网络的输入，自编码是一个以全连接方式，让输出尽可能等于输入的结构为输入层-隐层-输出层的网络，整个网络由左半部分的编码器和右半部分的解码器组成，编码器用于数据降维，提取最能代表输入的特征信息；解码器以尽可能小的误差，以编码器的输出作为解码器的输入，重建整个网络的原始输入，深度自编码网络是在自编码网络的基础上，在编码器网络和解码器网络中增加了几个隐层；Step i) The training samples only contain normal samples. During the training process, the optical flow features of the adjacent frame images of the training samples are calculated, and the optical flow features are converted into column vectors, which are used as the input of the deep self-encoding network. The full connection method makes the output as equal as possible to the input. The structure is the input layer-hidden layer-output layer network. The entire network consists of the encoder in the left half and the decoder in the right half. The encoder is used for data dimensionality reduction and extraction. The feature information that can best represent the input; the decoder uses the output of the encoder as the input of the decoder with the smallest possible error to reconstruct the original input of the entire network. The deep auto-encoding network is based on the self-encoding network. Several hidden layers are added to the decoder network and the decoder network;

步骤ii)以光流为输入X＝{x₁,x₂...x_n}，网络的激活函数采用ReLU函数f(x)＝max(0,x)，其中，x为激活函数的输入，即自变量，f(x)是激活函数的因变量，网络的前半部分即编码器网络的输出为：Z＝f(wX+b)，其中，w为编码器网络的权重，b为编码器网络的偏置，Z为编码器网络的输出，即Z是X降维后的结果，能代表X的特征信息；网络的后半部分即解码器的输出为：Y＝f(w'Z+b')，其中，w'为解码器网络的权重，b'为解码器网络的偏置，即Y是X的重建，整个编码器网络以公式化表示为：Y＝f(w'(f(wX+b))+b')。Step ii) Take the optical flow as the input X={x₁ , x₂ ... x_n }, the activation function of the network adopts the ReLU function f(x)=max(0,x), where x is the input of the activation function , that is, the independent variable, f(x) is the dependent variable of the activation function, and the output of the first half of the network, the encoder network, is: Z=f(wX+b), where w is the weight of the encoder network, and b is the encoding The offset of the encoder network, Z is the output of the encoder network, that is, Z is the result of the dimension reduction of X, which can represent the feature information of X; the second half of the network, that is, the output of the decoder is: Y=f(w'Z +b'), where w' is the weight of the decoder network, b' is the bias of the decoder network, that is, Y is the reconstruction of X, and the entire encoder network is formulated as: Y=f(w'(f (wX+b))+b').

步骤iii)损失函数采用均方误差：MSE＝||X-Y||²＝||X-f(w'(f(wX+b))+b')||²，而最小化损失函数来重建输入，就是通过深度自编码网络的训练过程，使均方误差最小，此时的输出即是输入的重建。Step iii) The loss function adopts the mean square error: MSE=||XY||² =||Xf(w'(f(wX+b))+b')||² , and the loss function is minimized to reconstruct the input, It is to minimize the mean square error through the training process of the deep self-encoding network, and the output at this time is the reconstruction of the input.

所述步骤4提取训练好的深度自编码网络的编码器部分作为测试过程中的网络过程为：Described step 4 extracts the encoder part of the trained deep self-encoding network as the network process in the test process is:

步骤i)首先，图像的预处理与训练过程类似，以训练样本和测试样本的光流特征转换的列向量作为网络的输入；Step i) First, the preprocessing of the image is similar to the training process, and the column vector converted from the optical flow features of the training sample and the test sample is used as the input of the network;

步骤ii)与训练过程中所用到的网络不同，测试过程中提取训练过程得到的训练好的深度自编码网络中的编码器作为测试过程的网络，使用编码器网络的降维作用，将输入压缩为3个神经元，由编码器的特点可知，这三个神经元能包含输入的全部信息。Step ii) is different from the network used in the training process. In the testing process, the encoder in the trained deep self-encoding network obtained from the training process is extracted as the network in the testing process, and the input is compressed by using the dimensionality reduction effect of the encoder network. It is known from the characteristics of the encoder that these three neurons can contain all the information of the input.

综上所述，本发明所述的一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，包括：对公共场景下的视频进行单帧分解，在分解出来的视频帧中，使用视觉显著性提取运动信息，然后计算相邻帧运动物体的光流，包括运动速度的大小和方向，之后的检测过程分为训练和测试两个过程，在训练中，以训练样本的光流作为深度自编码的输入，通过最小化损失函数来训练整个深度自编码网络，在测试阶段，分别以训练和测试样本的光流作为输入，提取训练好的深度自编码网络中的编码器，通过编码器网络的降维作用提取输入的特征，根据编码器网络的特性，降维后的特征能代表输入的全部信息，然后可视化降维后的结果，使用超球体表示训练样本的可视化范围，在输入测试样本时，使用同样的方法可视化，若样本可视化的结果落入超球体范围内，则判定样本正常；反之，落在超球体范围之外，判定样本异常，由此实现视频的智能监控。To sum up, the method for intelligent video monitoring in public scenes based on visual saliency and deep self-encoding according to the present invention includes: decomposing a single frame of a video in a public scene, and in the decomposed video frames, using Visual saliency extracts motion information, and then calculates the optical flow of moving objects in adjacent frames, including the size and direction of motion speed. The subsequent detection process is divided into two processes: training and testing. In training, the optical flow of the training sample is used as the The input of the deep self-encoding is to train the entire deep self-encoding network by minimizing the loss function. In the test phase, the optical flow of the training and test samples is used as the input, and the encoder in the trained deep self-encoding network is extracted. The dimensionality reduction function of the encoder network extracts the features of the input. According to the characteristics of the encoder network, the features after the dimensionality reduction can represent all the information of the input, and then visualize the results after the dimensionality reduction. The hypersphere is used to represent the visualization range of the training samples. When testing samples, the same method is used for visualization. If the result of the visualization of the sample falls within the range of the hypersphere, the sample is judged to be normal; otherwise, if it falls outside the range of the hypersphere, the sample is judged to be abnormal, thereby realizing intelligent video monitoring.

本发明与现有技术相比的优点在于：The advantages of the present invention compared with the prior art are:

(1)本发明以异常行为的识别为主，使用视觉显著性和光流法初步提取运动信息，然后使用深度学习中的深度自编码器提取特征，进行训练和检测，由于深度自编码器能以最小化损失函数重建输入，编码器的降维作用能提取能表示输入信息的低维特征，所以提取的特征具有很强的鲁棒性，而正是由于特征的鲁棒性，可以很高效地进行异常行为的识别，提高算法精度。由于使用超球体表示正常的范围，在进行异常的判别时，只需要判断可视化结果的范围，所以判断速度快。(1) The present invention focuses on the identification of abnormal behavior, uses visual saliency and optical flow method to initially extract motion information, and then uses deep self-encoder in deep learning to extract features for training and detection. Minimize the loss function to reconstruct the input. The dimensionality reduction of the encoder can extract low-dimensional features that can represent the input information, so the extracted features have strong robustness, and it is precisely because of the robustness of the features that can be efficiently Identify abnormal behaviors and improve algorithm accuracy. Since the normal range is represented by the hypersphere, only the range of the visualization result needs to be judged when the abnormality is judged, so the judgment speed is fast.

(2)本发明具有检测精度高、鲁棒性强等特点，可广泛应用于社区安全防护、医院、银行等公共场景的安全保护。通过使用光流法和深度学习中的深度自编码网络，提取能表示物体全部信息的低维特征，判断精确、鲁棒性强，由于使用超球体表示正常的范围，在进行异常的判别时，只需要判断可视化结果的范围，所以判断速度快。(2) The present invention has the characteristics of high detection accuracy, strong robustness, etc., and can be widely used in community security protection, security protection in public scenarios such as hospitals and banks. By using the optical flow method and the deep self-encoding network in deep learning, the low-dimensional features that can represent all the information of the object are extracted, and the judgment is accurate and robust. It only needs to judge the range of the visualization results, so the judgment speed is fast.

附图说明Description of drawings

图1为本发明实现流程图。FIG. 1 is a flow chart of the implementation of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图及具体实施例对本发明作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

本发明所述的一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，包括：对公共场景下视频进行单帧分解，在分解出来的视频帧中，使用视觉显著性提取运动信息，然后计算相邻帧运动物体的光流，包括运动速度的大小和方向，之后的检测过程分为训练和测试两个过程，在训练中，以训练样本的光流作为深度自编码的输入，通过最小化损失函数来训练整个深度自编码网络，在测试阶段，分别以训练和测试样本的光流作为输入，提取训练好的深度自编码网络中的编码器，通过编码器网络的降维作用提取输入的特征，根据编码器网络的特性，降维后的特征能代表输入的全部信息，然后可视化降维后的结果，使用超球体表示训练样本的可视化范围，在输入测试样本时，使用同样的方法可视化，若样本可视化的结果落入超球体范围内，则判定样本正常；反之，落在超球体范围之外，判定样本异常，由此实现视频的智能监控。The method for intelligent video monitoring in public scenes based on visual saliency and deep self-encoding according to the present invention includes: decomposing a single frame of a video in a public scene, and extracting motion information by using visual saliency in the decomposed video frames. , and then calculate the optical flow of moving objects in adjacent frames, including the size and direction of the motion speed. The subsequent detection process is divided into two processes: training and testing. In training, the optical flow of the training sample is used as the input of deep auto-encoding, The entire deep self-encoding network is trained by minimizing the loss function. In the test phase, the optical flow of the training and test samples is used as input, and the encoder in the trained deep self-encoding network is extracted, and the dimensionality reduction effect of the encoder network is used. Extract the input features, according to the characteristics of the encoder network, the features after dimensionality reduction can represent all the information of the input, and then visualize the results after dimensionality reduction, using the hypersphere to represent the visualization range of the training samples, when inputting the test samples, use the same If the result of the sample visualization falls within the range of the hypersphere, the sample is judged to be normal; otherwise, if it falls outside the range of the hypersphere, the sample is judged to be abnormal, thereby realizing intelligent video monitoring.

如图1所示，本发明具体实现如下步骤：As shown in Figure 1, the present invention specifically implements the following steps:

步骤1)、读取公共场景下的视频，将视频分解成单个帧，然后基于高斯差分组合带通滤波器，计算每一帧的视觉显著性图，以此来提取运动信息；Step 1), read the video under the public scene, decompose the video into a single frame, then combine the bandpass filter based on Gaussian difference, calculate the visual saliency map of each frame, and extract motion information with this;

步骤2)、在每帧显著性图的基础上，计算相邻帧的光流，从而提取前景目标的运动信息，获取运动特征；Step 2), on the basis of the saliency map of each frame, calculate the optical flow of adjacent frames, thereby extracting the motion information of the foreground target, and obtaining motion features;

步骤3)、在异常识别的算法中包含训练和测试两个过程，在训练过程中，计算训练样本的视觉显著性图并提取运动特征，将得到的每帧图像的光流特征转换为列向量作为深度自编码网络的输入，使用深度自编码网络中编码器的降维和解码器的重建，通过最小化损失函数重建输入，训练深度自编码网络；Step 3), two processes of training and testing are included in the algorithm of abnormal identification. During the training process, the visual saliency map of the training sample is calculated and the motion feature is extracted, and the obtained optical flow feature of each frame of image is converted into a column vector As the input of the deep self-encoding network, the dimensionality reduction of the encoder and the reconstruction of the decoder in the deep self-encoding network are used, and the input is reconstructed by minimizing the loss function, and the deep self-encoding network is trained;

步骤4)、通过最小化损失函数重建输入，训练深度自编码网络后，提取训练好的深度自编码网络的编码器部分作为测试过程中的网络，分别计算出训练样本和测试样本的显著性图和运动特征后，以各图像帧样本的光流特征作为深度自编码网络中编码器的输入，通过所述编码器网络的降维操作，用低维向量来提取最能代表输入的低维特征；Step 4), reconstruct the input by minimizing the loss function, after training the deep self-encoding network, extract the encoder part of the trained deep self-encoding network as the network in the test process, and calculate the saliency map of the training sample and the test sample respectively. After summing the motion features, the optical flow feature of each image frame sample is used as the input of the encoder in the deep self-encoding network. Through the dimensionality reduction operation of the encoder network, the low-dimensional vector is used to extract the low-dimensional feature that can best represent the input. ;

步骤5)、在三维坐标中可视化测试过程中编码器网络的结果，用一个超球体表示其中训练样本降维后的分布范围；Step 5), visualize the result of the encoder network in the test process in three-dimensional coordinates, and use a hypersphere to represent the distribution range of the training sample after dimensionality reduction;

步骤6)、对于输入测试样本的异常识别，若测试样本可视化的范围落入超球体的范围内，则判定该测试样本为正常序列；反之，落在超球体范围以外，则判定该测试样本为异常序列，由此实现异常行为的识别，公共场景下视频的智能监控。Step 6), for the abnormal identification of the input test sample, if the visual scope of the test sample falls within the scope of the hypersphere, then determine that the test sample is a normal sequence; on the contrary, fall outside the scope of the hypersphere, then determine that the test sample is Abnormal sequence, thereby realizing the identification of abnormal behavior and intelligent monitoring of video in public scenes.

所述步骤1)中视觉显著性图的计算方法如下：The calculation method of the visual saliency map in the step 1) is as follows:

S(x,y)＝||I_μ-I_whc(x,y)||S(x,y)=||I_μ -I_whc (x,y)||

其中，x和y分别对应中心点周围8个点的横纵坐标，σ为高斯分布函数的方差，G(x,y)为每个像素点的模糊程度；Among them, x and y correspond to the horizontal and vertical coordinates of 8 points around the center point respectively, σ is the variance of the Gaussian distribution function, and G(x, y) is the blurring degree of each pixel point;

在R，G，B三个通道，分别使用高斯核与原图像做卷积操作，将每个通道的结果合并，即为高斯模糊后的图像，分别将高斯模糊后的图像和原图像转换到Lab空间。In the three channels of R, G, and B, the Gaussian kernel is used to convolve the original image, and the results of each channel are combined, which is the Gaussian blurred image, and the Gaussian blurred image and the original image are converted to Lab space.

所述步骤3)训练深度自编码网络原理如下：The step 3) training deep self-encoding network principle is as follows:

步骤i)训练样本中只包含正常样本，在训练过程中，计算训练样本相邻帧图像的光流特征，将光流特征转换为列向量，作为深度自编码网络的输入，自编码是一个以全连接方式，让输出尽可能等于输入的结构为输入层-隐层-输出层的网络，整个网络由编码器和解码器组成，编码器用于数据降维，提取最能代表输入的特征信息；解码器以尽可能小的误差，以编码器的输出作为解码器的输入，重建整个网络的原始输入，深度自编码网络是在自编码网络的基础上，在编码器网络和解码器网络中增加了几个隐层；Step i) The training samples only contain normal samples. During the training process, the optical flow features of the adjacent frame images of the training samples are calculated, and the optical flow features are converted into column vectors, which are used as the input of the deep self-encoding network. The full connection method makes the output as equal as possible to the input structure as the input layer-hidden layer-output layer network. The entire network consists of an encoder and a decoder. The encoder is used for data dimensionality reduction and extracts the feature information that best represents the input; The decoder uses the output of the encoder as the input of the decoder with the smallest possible error, and reconstructs the original input of the entire network. The deep self-encoding network is based on the self-encoding network, adding in the encoder network and the decoder network. several hidden layers;

所述步骤4)提取训练好的深度自编码网络的编码器部分作为测试过程中的网络具体过程为：Described step 4) extracts the encoder part of the deep self-encoding network that has been trained as the network in the test process The concrete process is:

步骤ii)与训练过程中所用到的网络不同，测试过程中提取训练过程得到的训练好的深度自编码网络中的编码器作为测试过程的网络，使用编码器网络的降维作用，将输入压缩为3个神经元，所述3个神经元能包含输入的全部信息。Step ii) is different from the network used in the training process. In the testing process, the encoder in the trained deep self-encoding network obtained from the training process is extracted as the network in the testing process, and the input is compressed by using the dimensionality reduction effect of the encoder network. is 3 neurons, which can contain all the information of the input.

综上所述，以上仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。To sum up, the above are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

Translated fromChinese

1.一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，其特征在于，实现步骤如下：1. a public scene intelligent video monitoring method based on visual salience and deep self-encoding, is characterized in that, realization step is as follows:

步骤4、通过最小化损失函数重建输入，训练深度自编码网络后，提取训练好的深度自编码网络的编码器部分作为测试过程中的网络，分别计算出训练样本和测试样本的显著性图和运动特征后，以各个样本的光流特征作为自编码网络中编码器的输入，通过所述自编码网络中编码器的降维操作，用低维向量来提取最能代表输入的低维特征；Step 4. Reconstruct the input by minimizing the loss function. After training the deep self-encoding network, extract the encoder part of the trained deep self-encoding network as the network in the testing process, and calculate the saliency maps of the training samples and the test samples respectively. After the motion feature, the optical flow feature of each sample is used as the input of the encoder in the self-encoding network, and the low-dimensional vector is used to extract the low-dimensional feature that can best represent the input through the dimensionality reduction operation of the encoder in the self-encoding network;

2.根据权利要求1所述的一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，其特征在于：所述步骤1中，计算每一帧的视觉显著性图的方法如下：2. a kind of public scene intelligent video monitoring method based on visual saliency and deep self-encoding according to claim 1, is characterized in that: in described step 1, the method for calculating the visual saliency map of each frame is as follows:

步骤i)对于一帧图像，图像中每个点的显著度为：Step i) For a frame of image, the saliency of each point in the image is:

S(x,y)＝||I_μ-I_whc(x,y)||S(x,y)=||I_μ -I_whc (x,y)||

在R，G，B三个通道，分别使用高斯核与原图像做卷积操作，将每个通道的结果合并，即为高斯模糊后的图像；分别将高斯模糊后的图像和原图像转换到Lab空间；In the three channels of R, G, and B, the Gaussian kernel and the original image are used for convolution operation, and the results of each channel are combined, which is the Gaussian blurred image; the Gaussian blurred image and the original image are converted to Lab space;

3.根据权利要求1所述的一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，其特征在于：所述步骤3中，训练深度自编码网络的具体实现过程为：3. a kind of public scene intelligent video monitoring method based on visual salience and deep self-encoding according to claim 1, is characterized in that: in described step 3, the concrete realization process of training deep self-encoding network is:

步骤i)训练样本中只包含正常样本，在训练过程中，计算训练样本相邻帧图像的光流特征，将光流特征转换为列向量，作为深度自编码网络的输入，自编码是一个以全连接方式，让输出尽可能等于输入的结构为输入层-隐层-输出层的网络，整个网络由左半部分的编码器和右半部分的解码器组成，编码器用于数据降维，提取最能代表输入的特征信息；解码器以尽可能小的误差，以编码器的输出作为解码器的输入，重建整个网络的原始输入，深度自编码网络是在自编码网络的基础上，在编码器网络和解码器网络中增加了若干个隐层；Step i) The training samples only contain normal samples. During the training process, the optical flow features of the adjacent frame images of the training samples are calculated, and the optical flow features are converted into column vectors, which are used as the input of the deep self-encoding network. The full connection method makes the output as equal as possible to the input. The structure is the input layer-hidden layer-output layer network. The entire network consists of the encoder in the left half and the decoder in the right half. The encoder is used for data dimensionality reduction and extraction. The feature information that can best represent the input; the decoder uses the output of the encoder as the input of the decoder with the smallest possible error to reconstruct the original input of the entire network. The deep auto-encoding network is based on the self-encoding network. Several hidden layers are added to the decoder network and the decoder network;

步骤ii)以光流为输入X＝{x₁,x₂...x_n}，网络的激活函数采用ReLU函数f(x)＝max(0,x)，其中，x为激活函数的输入，即自变量，f(x)是激活函数的因变量，网络的前半部分即编码器网络的输出为：Z＝f(wX+b)，其中，w为编码器网络的权重，b为编码器网络的偏置，Z为编码器网络的输出，即Z是X降维后的结果，能代表X的特征信息；网络的后半部分即解码器的输出为：Y＝f(w'Z+b')，其中，w'为解码器网络的权重，b'为解码器网络的偏置，即Y是X的重建，整个编码器网络以公式化表示为：Y＝f(w'(f(wX+b))+b')；Step ii) Take the optical flow as the input X={x₁ , x₂ ... x_n }, the activation function of the network adopts the ReLU function f(x)=max(0,x), where x is the input of the activation function , that is, the independent variable, f(x) is the dependent variable of the activation function, and the output of the first half of the network, the encoder network, is: Z=f(wX+b), where w is the weight of the encoder network, and b is the encoding The offset of the encoder network, Z is the output of the encoder network, that is, Z is the result of the dimension reduction of X, which can represent the feature information of X; the second half of the network, that is, the output of the decoder is: Y=f(w'Z +b'), where w' is the weight of the decoder network, b' is the bias of the decoder network, that is, Y is the reconstruction of X, and the entire encoder network is formulated as: Y=f(w'(f (wX+b))+b');

4.根据权利要求1所述的一种基于视觉显著性和深度自编码的公共场景智能视频监控方法，其特征在于：所述步骤4中，提取训练好的深度自编码网络的编码器部分作为测试过程中的网络的具体过程为：4. a kind of public scene intelligent video monitoring method based on visual saliency and deep self-encoding according to claim 1, is characterized in that: in described step 4, the encoder part of the deep self-encoding network that is trained is extracted as The specific process of the network during the test is as follows: