CN108509880A

Movatterモバイル変換

Info

Publication number: CN108509880A
Application number: CN201810236363.3A
Authority: CN
Inventors: 陈志�; 高翔; 岳文静; 杨天明; 陈璐; 掌静
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-09-07

Abstract

The invention discloses a kind of video personage behavior method for recognizing semantics, this method is to identify personage's behavior semanteme in video with social networks as target, the piece identity in each video scene is concurrently extracted first with convolutional neural networks, the middle level semantic feature of three aspects of personage's behavior and context, then the semantic information in terms of these three is merged by two layers of Recognition with Recurrent Neural Network, finally complete the identification of personage's behavior semanteme in video, wide gap between what this method effectively made up the low-level image feature and high-level semantic of video scene, it includes character facial feature to be extracted comprehensive video features, personage's behavioural characteristic and contextual feature, improve the accuracy rate of semantics recognition.The present invention solves the problems, such as that low-level feature models difficulty to the complex behavior under real scene by establishing the middle level features between low-level image feature and high-level semantic, can achieve the purpose that solve the wide gap between low-level image feature and high-level semantic.

Description

Translated fromChinese

一种视频人物行为语义识别方法A Semantic Recognition Method of Video Character Behavior

技术领域technical field

本发明涉及机器学习，主要是通过底层特征到高层语义的转化方法来完成视频场景高层语义识别过程，属于深度学习、模式识别、视频信息处理等交叉技术应用领域。The invention relates to machine learning, which mainly completes the high-level semantic recognition process of a video scene through a conversion method from low-level features to high-level semantics, and belongs to the application fields of deep learning, pattern recognition, and video information processing.

背景技术Background technique

视频语义分析就是对视频中有序的帧图像进行语义分析。由于一段视频中可能包含多个场景，而这些场景又是由一组有序的帧图像组成。为了更好地分析视频语义，需要对视频进行预处理，包括把视频中的内容按某种方式进行镜头分割并场景化。首先，将通过镜头检测和寻找镜头转化的方法对视频进行分割。其次，将找出镜头中的关键帧集，并通过计算所有镜头的关键帧图像之间的相似度来进行聚类。然后，在视频场景化的基础上，研究视频中人物语义。Video semantic analysis is to perform semantic analysis on the ordered frame images in the video. Since a video may contain multiple scenes, and these scenes are composed of a set of ordered frame images. In order to better analyze video semantics, it is necessary to preprocess the video, including segmenting the content of the video in a certain way and sceneizing it. First, the video will be segmented by shot detection and finding shot transitions. Second, the set of keyframes in the shot will be found and clustered by computing the similarity between the keyframe images of all shots. Then, on the basis of video sceneization, the semantics of the characters in the video are studied.

视频人物语义分析往往是以研究视频中的人物行为语义为中心，同时辅助视频中除人物以外的事物所构成的上下文环境对象的语义来提高分析人物语义信息的准确性。目前视频语义分析一般都是通过学习图像特征这种方法，图像的特征表示方法主要分为两类:低层特征和中层特征。低层特征是基于视频的像素经由各种变换而来的，没有语义含义。The semantic analysis of video characters is often centered on the study of the behavior semantics of the characters in the video, and at the same time assists the semantics of the contextual environment objects composed of things other than the characters in the video to improve the accuracy of the analysis of the semantic information of the characters. At present, video semantic analysis generally uses the method of learning image features, and image feature representation methods are mainly divided into two categories: low-level features and middle-level features. The low-level features are obtained through various transformations based on the pixels of the video, and have no semantic meaning.

发明内容Contents of the invention

技术问题：本发明的目的是提供一种视频人物行为语义识别方法，主要解决了语义识别问题，是指怎样完成底层特征向高层语义的转化，达到解决底层特征与高层语义之间的鸿沟的目的。这里的底层特征向高层语义转化的方法具体描述如下：底层特征提取、中层特征融合、长短期记忆网络(LSTM)识别。首先，通过卷积神经网络(CNN)、采样与全连接操作来完成中层语义特征抽取与融合；然后，通过LSTM对融合的中层语义进行序列建模来完成语义的识别。Technical problem: The purpose of this invention is to provide a semantic recognition method for video character behavior, which mainly solves the problem of semantic recognition, and refers to how to complete the transformation from low-level features to high-level semantics, so as to achieve the purpose of solving the gap between low-level features and high-level semantics . Here, the method of transforming low-level features to high-level semantics is described in detail as follows: low-level feature extraction, middle-level feature fusion, and long-term short-term memory network (LSTM) recognition. First, the middle-level semantic feature extraction and fusion is completed through convolutional neural network (CNN), sampling and full-connection operations; then, the fused middle-level semantics is sequentially modeled through LSTM to complete semantic recognition.

技术方案：本发明解决了中层特征语义提取问题，利用CNN完成人物、上下文以及动作的特征提取。Technical solution: The present invention solves the problem of semantic extraction of middle-level features, and uses CNN to complete the feature extraction of characters, contexts and actions.

本发明解决了高层语义识别问题，主要是利用基于LSTM的语义序列模型来完成视频序列的人物语义识别。The invention solves the problem of high-level semantic recognition, mainly using the LSTM-based semantic sequence model to complete the semantic recognition of characters in video sequences.

本发明所述一种基于深度学习视频场景语义识别方法包括以下步骤：A kind of deep learning video scene semantic recognition method based on the present invention comprises the following steps:

步骤1)：对视频图像底层特征描述与提取，具体步骤如下：Step 1): Describe and extract the underlying features of the video image, the specific steps are as follows:

步骤1.1)：人物身份底层特征描述与提取，通过对视频场景中人物的人脸进行检测和预处理来描述人物身份底层特征。其中人脸检测是基于局部二值直方图人脸检测器实现的，预处理是在检测之后需要对人脸图像进行灰度化、缩小和均衡化处理。Step 1.1): Description and extraction of the underlying features of the person's identity. The underlying features of the person's identity are described by detecting and preprocessing the face of the person in the video scene. Among them, the face detection is realized based on the local binary histogram face detector, and the preprocessing is to grayscale, reduce and equalize the face image after detection.

步骤1.2)：人物行为底层特征描述与提取，通过融合视频场景中人物图像序列的时空特征。所述时空特征是指通过原视频帧以及视频帧之间的光流图片获取的人物行为轨迹特征。Step 1.2): The underlying feature description and extraction of character behavior, by fusing the spatio-temporal features of the character image sequence in the video scene. The spatio-temporal features refer to character behavior track features acquired through the original video frames and the optical flow pictures between the video frames.

步骤1.3)：上下文底层特征描述与提取，通过提取从场景所发生的上下文环境和场景里面出现的一些物体的特征。Step 1.3): Description and extraction of the underlying features of the context, by extracting the context of the scene and the features of some objects appearing in the scene.

步骤2)：基于预训练的CNN的中层语义特征提取，具体步骤如下：Step 2): The middle layer semantic feature extraction based on the pre-trained CNN, the specific steps are as follows:

步骤2.1)：人物身份中层特征提取，在人物脸部数据集上完成卷积神经网络的训练。运用预先训练好的卷积神经网络结构的全连接层的特征向量作为人物身份中层特征。Step 2.1): The middle-level feature extraction of the identity of the person, and the training of the convolutional neural network is completed on the face data set of the person. The feature vector of the fully connected layer of the pre-trained convolutional neural network structure is used as the middle layer feature of the person identity.

步骤2.2)：人物行为中层特征提取，在两个卷积神经网络融合来识别视频中的人物行为。运用训练好的融合神经网络的全连接层的特征向量作为视频中人物行为中层特征。Step 2.2): The middle-level feature extraction of character behavior is fused with two convolutional neural networks to identify the character behavior in the video. The feature vector of the fully connected layer of the trained fusion neural network is used as the middle layer feature of the character behavior in the video.

步骤2.3)：上下文中层特征提取，将数据集ImageNet作为实验数据完成卷积神经网络的训练。运用预先训练好的卷积神经网络结构的全连接层的特征向量作为上下文中层特征。Step 2.3): Layer feature extraction in the context, using the dataset ImageNet as experimental data to complete the training of the convolutional neural network. The feature vector of the fully connected layer of the pre-trained convolutional neural network structure is used as the feature of the middle layer of the context.

步骤3)：基于LSTM的视频人物行为语义识别，具体步骤如下：Step 3): Semantic recognition of video character behavior based on LSTM, the specific steps are as follows:

步骤3.1)：建立基于LSTM的视频人物行为语义识别模型，模型由两层LSTM组成，第一层LSTM用于接收视频语义特征序列并对其进行编码，第二层LSTM用第一层的编码结果作为输入并对其解码输出语义描述句子。Step 3.1): Establish an LSTM-based video character behavior semantic recognition model. The model consists of two layers of LSTM. The first layer of LSTM is used to receive video semantic feature sequences and encode them. The second layer of LSTM uses the encoding results of the first layer Take it as input and decode it to output a semantic description sentence.

步骤3.2)：基于LSTM的语义序列识别，首先将基于CNN提取出来的人物身份、人物行为和上下文语义特征向量作为输入，通过第一层LSTM进行编码，得到一个固定长度的输出向量；接着把第一层的输出向量作为输入，通过第二层LSTM进行解码，得到对视频序列语义的描述语句。本节的第一层和第二层共用一个LSTM，这样可以在编码与解码阶段共享参数，减少训练的复杂度。Step 3.2): Semantic sequence recognition based on LSTM. First, the character identity, character behavior and context semantic feature vector extracted based on CNN are used as input, and encoded by the first layer of LSTM to obtain a fixed-length output vector; then the second The output vector of the first layer is used as input, and is decoded by the second layer LSTM to obtain a description sentence for the semantics of the video sequence. The first layer and the second layer in this section share an LSTM, so that parameters can be shared in the encoding and decoding stages, reducing the complexity of training.

步骤4)：融合提取的人物身份特征、人物行为特征以及上下文特征，将融合后的特征输入到基于LSTM的视频语义识别模型中，进行视频语义识别。Step 4): Fuse the extracted character identity features, character behavior features and context features, and input the fused features into the LSTM-based video semantic recognition model for video semantic recognition.

作为本发明所述的一种基于深度学习视频场景语义识别方法进一步优化方案，步骤1.2)中所述时空特征为原视频帧以及视频帧之间的光流图片获取的人物行为轨迹特征。As a further optimization scheme of the method for semantic recognition of video scenes based on deep learning in the present invention, the spatiotemporal features described in step 1.2) are character behavior track features obtained from the original video frame and optical flow pictures between video frames.

作为本发明所述的一种基于深度学习视频场景语义识别方法进一步优化方案，步骤2.3)中卷积神经网络的训练是利用数据集ImageNet作为实验数据来完成。As a further optimization scheme of the deep learning-based video scene semantic recognition method described in the present invention, the training of the convolutional neural network in step 2.3) is completed by using the dataset ImageNet as experimental data.

作为本发明所述的一种基于深度学习视频场景语义识别方法进一步优化方案，步骤3.2)中所述的模型由两层LSTM组成，第一层和第二层共用一个LSTM，这样可以在编码与解码阶段共享参数，减少训练的复杂度。As a further optimization scheme based on deep learning video scene semantic recognition method of the present invention, the model described in step 3.2) is made up of two layers of LSTM, and the first layer and the second layer share one LSTM, so that it can be used in encoding and The parameters are shared in the decoding stage to reduce the complexity of training.

有益效果：本发明提出的一种视频人物行为语义识别方法，是一种基于深度学习视频场景语义识别方法，其效果具体如下：Beneficial effects: a method for semantic recognition of video character behavior proposed by the present invention is a method for semantic recognition of video scenes based on deep learning, and its effects are specifically as follows:

(1)本发明提供一种视频中层特征的方法，有效的弥补的了视频场景的底层特征与高层语义之间的鸿沟。(1) The present invention provides a method for middle-level features of a video, which effectively bridges the gap between the bottom-level features and high-level semantics of a video scene.

(2)本发明中所述的方法提取了全方位的视频特征包括人物脸部特征、人物行为特征以及上下文特征，提高了语义识别的准确率。(2) The method described in the present invention extracts a full range of video features including character facial features, character behavior features and context features, which improves the accuracy of semantic recognition.

(3)本发明中所述基于LSTM的双层视频场景语义识别模型是一个端对端的模型，提高了长周期视频的语义识别的准确率。(3) The LSTM-based two-layer video scene semantic recognition model in the present invention is an end-to-end model, which improves the accuracy of semantic recognition of long-period videos.

附图说明Description of drawings

图1是视频场景语义提取方法的结构图。Figure 1 is a structural diagram of a video scene semantic extraction method.

图2是CNN-People架构提取特征图。Figure 2 is a feature map extracted by the CNN-People architecture.

图3是D-CNNs-Activity架构提取特征图。Figure 3 is a feature map extracted from the D-CNNs-Activity architecture.

图4是CNN-Context架构提取特征。Figure 4 is the feature extracted by the CNN-Context architecture.

具体实施方式Detailed ways

下面对本发明附图的某些实施例作更详细的描述。Certain embodiments of the accompanying drawings of the present invention are described in more detail below.

根据附图1，本发明具体实施方式为：According to accompanying drawing 1, the specific embodiment of the present invention is:

1)视频图像底层特征描述与提取，具体步骤如下：1) The underlying feature description and extraction of video images, the specific steps are as follows:

1.1)人物身份底层特征描述与提取，通过对视频场景中人物的人脸进行检测和预处理来描述人物身份底层特征。其中人脸检测是基于局部二值直方图(Local BinaryPatterns Histograms，LBPH)人脸检测器实现的，在检测之后对人脸图像进行灰度化、缩小和均衡化处理。经过上述预处理，利用LBPH人脸检测器得到人脸图像的二维特征向量，完成人物身份的底层特征描述与提取。运用OpenCV3提供的人脸检测接口来进行人脸检测与提取，抽取出视频场景中的人物语义输入1.1) Description and extraction of the underlying features of the person's identity. The underlying features of the person's identity are described by detecting and preprocessing the face of the person in the video scene. The face detection is implemented based on the local binary histogram (Local Binary Patterns Histograms, LBPH) face detector, after the detection, the face image is grayscaled, reduced and equalized. After the above preprocessing, the LBPH face detector is used to obtain the two-dimensional feature vector of the face image, and the bottom-level feature description and extraction of the identity of the person is completed. Use the face detection interface provided by OpenCV3 for face detection and extraction, and extract the semantic input of characters in the video scene

1.2)人物行为底层特征描述与提取，通过融合视频场景中人物图像序列的空间和时空特征。接着运用OpenCV3中goodFeaturesToTrack()函数，得到图像中的强边界作为跟踪的特征点。接着运用图像金字塔光流方法函数calcOpticalFlowPyrL()，对输入两幅连续的图像进行处理，在第一幅图像里选择一组特征点，输出为这组点在下一幅图像中的位置。再把得到的跟踪结果过滤一下，去掉不好特征点，将人物的动作轨迹标记出来，完成人物行为抽取出视频场景中的人物行为语义1.2) The underlying feature description and extraction of character behavior, by fusing the spatial and temporal features of the character image sequence in the video scene. Then use the goodFeaturesToTrack() function in OpenCV3 to get the strong boundary in the image as the feature point for tracking. Then use the image pyramid optical flow method function calcOpticalFlowPyrL() to process the input of two consecutive images, select a set of feature points in the first image, and output the position of this set of points in the next image. Then filter the obtained tracking results, remove bad feature points, mark the character's action trajectory, and complete the character behavior extraction to extract the character behavior semantics in the video scene

1.3)上下文底层特征描述与提取，提取特征的目标是上下文环境和场景里面出现的一些物体。物体特征采用4,096-D DeCAF通用视觉特征，对于地点特征，采用多个预先训练好的上下文检测器来提取场景中的地点特征，组成一个集合作为提取的上下文特征表示，完成抽取出视频场景中的上下文语义1.3) Contextual underlying feature description and extraction, the target of feature extraction is the context environment and some objects appearing in the scene. Object features use 4,096-D DeCAF general visual features. For location features, multiple pre-trained context detectors are used to extract location features in the scene, and a set is formed as the extracted context feature representation to complete the extraction of video scenes. context semantics

2)基于预训练的CNN的中层语义特征提取，具体步骤如下：2) The middle-level semantic feature extraction based on the pre-trained CNN, the specific steps are as follows:

2.1)人物身份中层特征提取，在数据集Olivetti Faces上完成CNN-People的训练，CNN-People结构如图2所示。CNN-People模型有两个卷积与子采样层。全连接层相当于多层感知机中的隐含层。输出层即分类器，采用多类别的逻辑回归。总体上我们使用串联结构来组建CNN模型，上一层的输出接下一层的输入。用预先训练好的CNN-People，前向传播的fc7层的4096特征向量作为人物身份中层特征。2.1) The mid-level feature extraction of person identity, the training of CNN-People is completed on the dataset Olivetti Faces, and the structure of CNN-People is shown in Figure 2. The CNN-People model has two convolutional and subsampling layers. The fully connected layer is equivalent to the hidden layer in the multi-layer perceptron. The output layer is the classifier, using multi-category logistic regression. In general, we use a serial structure to build a CNN model, and the output of the previous layer is connected to the input of the next layer. Using the pre-trained CNN-People, the 4096 feature vectors of the forward-propagated fc7 layer are used as the middle-level features of the person identity.

2.2)人物行为中层特征提取，双层卷积神经网络将两个卷积神经网络融合来识别视频中的人物行为。本节具体的是先分别对单帧图像和多帧的运动信息(光流)分别构建2个CNN网络，然后在分数层上对2种网络的输出作卷积融合。通过在数据集UCF101上前馈与反向传播来调优参数，完成双层卷积神经网络的训练。用训练好的D-CNNs-Activity网络，对视频场景中的图像进行训练，输出fc7全连接层的特征向量作为视频中人物行为中层特征。D-CNNs-Activity网络结构如图3所示。2.2) The middle-level feature extraction of character behavior, the double-layer convolutional neural network fuses two convolutional neural networks to identify the behavior of characters in the video. Specifically in this section, two CNN networks are constructed for the single-frame image and multi-frame motion information (optical flow), and then the output of the two networks is convolutionally fused on the score layer. The parameters are tuned by feedforward and backpropagation on the dataset UCF101, and the training of the two-layer convolutional neural network is completed. Use the trained D-CNNs-Activity network to train the images in the video scene, and output the feature vector of the fc7 fully connected layer as the middle-level feature of the character behavior in the video. The D-CNNs-Activity network structure is shown in Figure 3.

2.3)上下文中层特征提取，选择数据集ImageNet作为实验数据，对于数据集中得到的图像我们通过13)节中介绍的方法进行预处理得到上下文的底层特征描述。接着使用手写字识别模型LeNet的网络来训练数据集ImageNet，将训练好的CNN-Context网络前向传播的fc7层的4096特征向量作为上下文中层特征。所述CNN-Context如图4所示。2.3) Layer feature extraction in the context. The dataset ImageNet is selected as the experimental data. For the images obtained in the dataset, we preprocess the images in Section 13) to obtain the underlying feature description of the context. Then use the LeNet network of the handwritten character recognition model to train the dataset ImageNet, and use the 4096 feature vectors of the fc7 layer of the trained CNN-Context network as the context layer features. The CNN-Context is shown in Figure 4.

3)基于LSTM的视频人物行为语义识别，具体步骤如下：3) Semantic recognition of video character behavior based on LSTM, the specific steps are as follows:

3.1)建立基于LSTM的语义序列建模，该模型由两层LSTM组成，第一层LSTM用于接收视频语义特征序列并对其进行编码，第二层LSTM用第一层的编码结果作为输入并对其解码输出语义描述句子。该模型主要分层两个阶段：编码阶段和解码阶段，如图4。3.1) Establish semantic sequence modeling based on LSTM. The model consists of two layers of LSTM. The first layer of LSTM is used to receive video semantic feature sequences and encode them. The second layer of LSTM uses the encoding results of the first layer as input and Decode it to output a semantic description sentence. The model is mainly layered in two stages: the encoding stage and the decoding stage, as shown in Figure 4.

3.2)基于LSTM的语义序列识别，两层LSTM模型的每一层都有1000个隐型单元用于记录编码信息。在开始的几个时间点上，第一层的LSTM对两部分输入进行编码，第一部分是把经过CNN提取的人物身份、人物行为和上下文的中层语义特征作为输入，通过LSTM模块内部的四个交互过程进行计算而得到的隐含层输出信息Ht，第二部分是空填充的输入语义词语描述<pad>。在对第一层LSTM进行编码的时候没有损失。在当所有的视频序列的中层语义特征都输入完毕时，开始进行解码工作。第二层LSTM解码的时候也是两部分输入，一部分是第一层编码输出的隐含层信息Ht；由于没有了输入序列的中层语义特征，第二部分是空填充的输入语义词语描述<pad>。在编码的第一个时间点上，加入了一个<BOS>标志，用于标识开始解码工作。训练解码方法是根据前面的一个词汇输出描述和前一个时刻的隐含层输出信息Ht，最大化预测输出语句的对数似然。最终，根据第一层LSTM的隐含层的输出信息Zt作为条件，通过运用Softmax函数来计算每一个词汇在词汇集S上的分布，整个解码的训练过程中，LSTM根据结束标志<EOS>来动态结束训练，达到动态控制输出语义描述语句的长度。3.2) LSTM-based semantic sequence recognition, each layer of the two-layer LSTM model has 1000 hidden units for recording encoding information. At the first few points in time, the first layer of LSTM encodes two parts of input. The first part is to use the mid-level semantic features of character identity, character behavior and context extracted by CNN as input, through the four internal LSTM modules. The hidden layer output information Ht obtained through the calculation of the interactive process, and the second part is the empty-filled input semantic word description <pad>. There is no loss when encoding the first layer LSTM. When the middle-level semantic features of all video sequences are input, the decoding work starts. When decoding the second layer LSTM, there are also two parts of input, one part is the hidden layer information Ht output by the first layer encoding; because there is no middle-level semantic feature of the input sequence, the second part is the empty-filled input semantic word description <pad> . At the first time point of encoding, a <BOS> flag is added to mark the start of decoding. The training decoding method is to maximize the logarithmic likelihood of the predicted output sentence according to the previous vocabulary output description and the hidden layer output information Ht at the previous moment. Finally, according to the output information Zt of the hidden layer of the first layer LSTM as a condition, the distribution of each vocabulary on the vocabulary set S is calculated by using the Softmax function. Dynamically end the training to dynamically control the length of the output semantic description sentence.

4)融合提取的人物身份特征、人物行为特征以及上下文特征，将融合后的特征输入到基于LSTM的视频语义识别模型中，进行视频语义识别。在2层LSTM模型中，采用浅层融合技术来合成人物身份、人物行为、上下文特征。在解码阶段的每一个时间步长上，LSTM模型提供候选的词语集合。然后通过重新计算这些假设的评分，具体是按照一定的权重来累加人物身份、人物行为、上下文网络得到的评分，如公式P(y_t＝y′)＝α*P_p(y_t＝y′)+β*P_a(y_t＝y′)+γ*P_c(y_t＝y′)所示，其中y_t＝y'代表选取y'这个时间步长，P(y_t＝y')代表y'这个时间点的视频人物行为语义的概率评分，P_p(y_t＝y')代表在y'这个时间点的人物身份的概率评分，P_a(y_t＝y')代表在y'这个时间点的人物行为的概率评分，P_c(y_t＝y')代表在y'这个时间点的上下文的概率评分。其中α+β+γ＝1，以上三个参数初始值为1/3，并在数据集上，通过LSTM建模过程动态调谐。4) Fuse the extracted character identity features, character behavior features and context features, and input the fused features into the LSTM-based video semantic recognition model for video semantic recognition. In the 2-layer LSTM model, shallow fusion techniques are used to synthesize character identity, character behavior, and contextual features. At each time step in the decoding phase, the LSTM model provides a set of candidate words. Then recalculate the scores of these hypotheses, specifically accumulating the scores obtained by character identity, character behavior, and context network according to a certain weight, such as the formula P(y_t =y′)=α*P_p (y_t =y′ )+β*P_a (y_t =y′)+γ*P_c (y_t =y′), where y_t =y’ means that the time step of y’ is selected, and P(y_t =y’ ) represents the probability score of the video character’s behavior semantics at the time point y’, P_p (y_t = y’) represents the probability score of the character’s identity at the time point y’, and P_a (y_t = y’) represents the The probability score of the character's behavior at the time point y', P_c (y_t =y') represents the probability score of the context at the time point y'. Where α+β+γ=1, the initial value of the above three parameters is 1/3, and on the data set, it is dynamically tuned through the LSTM modeling process.