Movatterモバイル変換


[0]ホーム

URL:


CN108509880A - A kind of video personage behavior method for recognizing semantics - Google Patents

A kind of video personage behavior method for recognizing semantics
Download PDF

Info

Publication number
CN108509880A
CN108509880ACN201810236363.3ACN201810236363ACN108509880ACN 108509880 ACN108509880 ACN 108509880ACN 201810236363 ACN201810236363 ACN 201810236363ACN 108509880 ACN108509880 ACN 108509880A
Authority
CN
China
Prior art keywords
video
personage
feature
behavior
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810236363.3A
Other languages
Chinese (zh)
Inventor
陈志�
高翔
岳文静
杨天明
陈璐
掌静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication UniversityfiledCriticalNanjing Post and Telecommunication University
Priority to CN201810236363.3ApriorityCriticalpatent/CN108509880A/en
Publication of CN108509880ApublicationCriticalpatent/CN108509880A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention discloses a kind of video personage behavior method for recognizing semantics, this method is to identify personage's behavior semanteme in video with social networks as target, the piece identity in each video scene is concurrently extracted first with convolutional neural networks, the middle level semantic feature of three aspects of personage's behavior and context, then the semantic information in terms of these three is merged by two layers of Recognition with Recurrent Neural Network, finally complete the identification of personage's behavior semanteme in video, wide gap between what this method effectively made up the low-level image feature and high-level semantic of video scene, it includes character facial feature to be extracted comprehensive video features, personage's behavioural characteristic and contextual feature, improve the accuracy rate of semantics recognition.The present invention solves the problems, such as that low-level feature models difficulty to the complex behavior under real scene by establishing the middle level features between low-level image feature and high-level semantic, can achieve the purpose that solve the wide gap between low-level image feature and high-level semantic.

Description

Translated fromChinese
一种视频人物行为语义识别方法A Semantic Recognition Method of Video Character Behavior

技术领域technical field

本发明涉及机器学习,主要是通过底层特征到高层语义的转化方法来完成视频场景高层语义识别过程,属于深度学习、模式识别、视频信息处理等交叉技术应用领域。The invention relates to machine learning, which mainly completes the high-level semantic recognition process of a video scene through a conversion method from low-level features to high-level semantics, and belongs to the application fields of deep learning, pattern recognition, and video information processing.

背景技术Background technique

视频语义分析就是对视频中有序的帧图像进行语义分析。由于一段视频中可能包含多个场景,而这些场景又是由一组有序的帧图像组成。为了更好地分析视频语义,需要对视频进行预处理,包括把视频中的内容按某种方式进行镜头分割并场景化。首先,将通过镜头检测和寻找镜头转化的方法对视频进行分割。其次,将找出镜头中的关键帧集,并通过计算所有镜头的关键帧图像之间的相似度来进行聚类。然后,在视频场景化的基础上,研究视频中人物语义。Video semantic analysis is to perform semantic analysis on the ordered frame images in the video. Since a video may contain multiple scenes, and these scenes are composed of a set of ordered frame images. In order to better analyze video semantics, it is necessary to preprocess the video, including segmenting the content of the video in a certain way and sceneizing it. First, the video will be segmented by shot detection and finding shot transitions. Second, the set of keyframes in the shot will be found and clustered by computing the similarity between the keyframe images of all shots. Then, on the basis of video sceneization, the semantics of the characters in the video are studied.

视频人物语义分析往往是以研究视频中的人物行为语义为中心,同时辅助视频中除人物以外的事物所构成的上下文环境对象的语义来提高分析人物语义信息的准确性。目前视频语义分析一般都是通过学习图像特征这种方法,图像的特征表示方法主要分为两类:低层特征和中层特征。低层特征是基于视频的像素经由各种变换而来的,没有语义含义。The semantic analysis of video characters is often centered on the study of the behavior semantics of the characters in the video, and at the same time assists the semantics of the contextual environment objects composed of things other than the characters in the video to improve the accuracy of the analysis of the semantic information of the characters. At present, video semantic analysis generally uses the method of learning image features, and image feature representation methods are mainly divided into two categories: low-level features and middle-level features. The low-level features are obtained through various transformations based on the pixels of the video, and have no semantic meaning.

发明内容Contents of the invention

技术问题:本发明的目的是提供一种视频人物行为语义识别方法,主要解决了语义识别问题,是指怎样完成底层特征向高层语义的转化,达到解决底层特征与高层语义之间的鸿沟的目的。这里的底层特征向高层语义转化的方法具体描述如下:底层特征提取、中层特征融合、长短期记忆网络(LSTM)识别。首先,通过卷积神经网络(CNN)、采样与全连接操作来完成中层语义特征抽取与融合;然后,通过LSTM对融合的中层语义进行序列建模来完成语义的识别。Technical problem: The purpose of this invention is to provide a semantic recognition method for video character behavior, which mainly solves the problem of semantic recognition, and refers to how to complete the transformation from low-level features to high-level semantics, so as to achieve the purpose of solving the gap between low-level features and high-level semantics . Here, the method of transforming low-level features to high-level semantics is described in detail as follows: low-level feature extraction, middle-level feature fusion, and long-term short-term memory network (LSTM) recognition. First, the middle-level semantic feature extraction and fusion is completed through convolutional neural network (CNN), sampling and full-connection operations; then, the fused middle-level semantics is sequentially modeled through LSTM to complete semantic recognition.

技术方案:本发明解决了中层特征语义提取问题,利用CNN完成人物、上下文以及动作的特征提取。Technical solution: The present invention solves the problem of semantic extraction of middle-level features, and uses CNN to complete the feature extraction of characters, contexts and actions.

本发明解决了高层语义识别问题,主要是利用基于LSTM的语义序列模型来完成视频序列的人物语义识别。The invention solves the problem of high-level semantic recognition, mainly using the LSTM-based semantic sequence model to complete the semantic recognition of characters in video sequences.

本发明所述一种基于深度学习视频场景语义识别方法包括以下步骤:A kind of deep learning video scene semantic recognition method based on the present invention comprises the following steps:

步骤1):对视频图像底层特征描述与提取,具体步骤如下:Step 1): Describe and extract the underlying features of the video image, the specific steps are as follows:

步骤1.1):人物身份底层特征描述与提取,通过对视频场景中人物的人脸进行检测和预处理来描述人物身份底层特征。其中人脸检测是基于局部二值直方图人脸检测器实现的,预处理是在检测之后需要对人脸图像进行灰度化、缩小和均衡化处理。Step 1.1): Description and extraction of the underlying features of the person's identity. The underlying features of the person's identity are described by detecting and preprocessing the face of the person in the video scene. Among them, the face detection is realized based on the local binary histogram face detector, and the preprocessing is to grayscale, reduce and equalize the face image after detection.

步骤1.2):人物行为底层特征描述与提取,通过融合视频场景中人物图像序列的时空特征。所述时空特征是指通过原视频帧以及视频帧之间的光流图片获取的人物行为轨迹特征。Step 1.2): The underlying feature description and extraction of character behavior, by fusing the spatio-temporal features of the character image sequence in the video scene. The spatio-temporal features refer to character behavior track features acquired through the original video frames and the optical flow pictures between the video frames.

步骤1.3):上下文底层特征描述与提取,通过提取从场景所发生的上下文环境和场景里面出现的一些物体的特征。Step 1.3): Description and extraction of the underlying features of the context, by extracting the context of the scene and the features of some objects appearing in the scene.

步骤2):基于预训练的CNN的中层语义特征提取,具体步骤如下:Step 2): The middle layer semantic feature extraction based on the pre-trained CNN, the specific steps are as follows:

步骤2.1):人物身份中层特征提取,在人物脸部数据集上完成卷积神经网络的训练。运用预先训练好的卷积神经网络结构的全连接层的特征向量作为人物身份中层特征。Step 2.1): The middle-level feature extraction of the identity of the person, and the training of the convolutional neural network is completed on the face data set of the person. The feature vector of the fully connected layer of the pre-trained convolutional neural network structure is used as the middle layer feature of the person identity.

步骤2.2):人物行为中层特征提取,在两个卷积神经网络融合来识别视频中的人物行为。运用训练好的融合神经网络的全连接层的特征向量作为视频中人物行为中层特征。Step 2.2): The middle-level feature extraction of character behavior is fused with two convolutional neural networks to identify the character behavior in the video. The feature vector of the fully connected layer of the trained fusion neural network is used as the middle layer feature of the character behavior in the video.

步骤2.3):上下文中层特征提取,将数据集ImageNet作为实验数据完成卷积神经网络的训练。运用预先训练好的卷积神经网络结构的全连接层的特征向量作为上下文中层特征。Step 2.3): Layer feature extraction in the context, using the dataset ImageNet as experimental data to complete the training of the convolutional neural network. The feature vector of the fully connected layer of the pre-trained convolutional neural network structure is used as the feature of the middle layer of the context.

步骤3):基于LSTM的视频人物行为语义识别,具体步骤如下:Step 3): Semantic recognition of video character behavior based on LSTM, the specific steps are as follows:

步骤3.1):建立基于LSTM的视频人物行为语义识别模型,模型由两层LSTM组成,第一层LSTM用于接收视频语义特征序列并对其进行编码,第二层LSTM用第一层的编码结果作为输入并对其解码输出语义描述句子。Step 3.1): Establish an LSTM-based video character behavior semantic recognition model. The model consists of two layers of LSTM. The first layer of LSTM is used to receive video semantic feature sequences and encode them. The second layer of LSTM uses the encoding results of the first layer Take it as input and decode it to output a semantic description sentence.

步骤3.2):基于LSTM的语义序列识别,首先将基于CNN提取出来的人物身份、人物行为和上下文语义特征向量作为输入,通过第一层LSTM进行编码,得到一个固定长度的输出向量;接着把第一层的输出向量作为输入,通过第二层LSTM进行解码,得到对视频序列语义的描述语句。本节的第一层和第二层共用一个LSTM,这样可以在编码与解码阶段共享参数,减少训练的复杂度。Step 3.2): Semantic sequence recognition based on LSTM. First, the character identity, character behavior and context semantic feature vector extracted based on CNN are used as input, and encoded by the first layer of LSTM to obtain a fixed-length output vector; then the second The output vector of the first layer is used as input, and is decoded by the second layer LSTM to obtain a description sentence for the semantics of the video sequence. The first layer and the second layer in this section share an LSTM, so that parameters can be shared in the encoding and decoding stages, reducing the complexity of training.

步骤4):融合提取的人物身份特征、人物行为特征以及上下文特征,将融合后的特征输入到基于LSTM的视频语义识别模型中,进行视频语义识别。Step 4): Fuse the extracted character identity features, character behavior features and context features, and input the fused features into the LSTM-based video semantic recognition model for video semantic recognition.

作为本发明所述的一种基于深度学习视频场景语义识别方法进一步优化方案,步骤1.2)中所述时空特征为原视频帧以及视频帧之间的光流图片获取的人物行为轨迹特征。As a further optimization scheme of the method for semantic recognition of video scenes based on deep learning in the present invention, the spatiotemporal features described in step 1.2) are character behavior track features obtained from the original video frame and optical flow pictures between video frames.

作为本发明所述的一种基于深度学习视频场景语义识别方法进一步优化方案,步骤2.3)中卷积神经网络的训练是利用数据集ImageNet作为实验数据来完成。As a further optimization scheme of the deep learning-based video scene semantic recognition method described in the present invention, the training of the convolutional neural network in step 2.3) is completed by using the dataset ImageNet as experimental data.

作为本发明所述的一种基于深度学习视频场景语义识别方法进一步优化方案,步骤3.2)中所述的模型由两层LSTM组成,第一层和第二层共用一个LSTM,这样可以在编码与解码阶段共享参数,减少训练的复杂度。As a further optimization scheme based on deep learning video scene semantic recognition method of the present invention, the model described in step 3.2) is made up of two layers of LSTM, and the first layer and the second layer share one LSTM, so that it can be used in encoding and The parameters are shared in the decoding stage to reduce the complexity of training.

有益效果:本发明提出的一种视频人物行为语义识别方法,是一种基于深度学习视频场景语义识别方法,其效果具体如下:Beneficial effects: a method for semantic recognition of video character behavior proposed by the present invention is a method for semantic recognition of video scenes based on deep learning, and its effects are specifically as follows:

(1)本发明提供一种视频中层特征的方法,有效的弥补的了视频场景的底层特征与高层语义之间的鸿沟。(1) The present invention provides a method for middle-level features of a video, which effectively bridges the gap between the bottom-level features and high-level semantics of a video scene.

(2)本发明中所述的方法提取了全方位的视频特征包括人物脸部特征、人物行为特征以及上下文特征,提高了语义识别的准确率。(2) The method described in the present invention extracts a full range of video features including character facial features, character behavior features and context features, which improves the accuracy of semantic recognition.

(3)本发明中所述基于LSTM的双层视频场景语义识别模型是一个端对端的模型,提高了长周期视频的语义识别的准确率。(3) The LSTM-based two-layer video scene semantic recognition model in the present invention is an end-to-end model, which improves the accuracy of semantic recognition of long-period videos.

附图说明Description of drawings

图1是视频场景语义提取方法的结构图。Figure 1 is a structural diagram of a video scene semantic extraction method.

图2是CNN-People架构提取特征图。Figure 2 is a feature map extracted by the CNN-People architecture.

图3是D-CNNs-Activity架构提取特征图。Figure 3 is a feature map extracted from the D-CNNs-Activity architecture.

图4是CNN-Context架构提取特征。Figure 4 is the feature extracted by the CNN-Context architecture.

具体实施方式Detailed ways

下面对本发明附图的某些实施例作更详细的描述。Certain embodiments of the accompanying drawings of the present invention are described in more detail below.

根据附图1,本发明具体实施方式为:According to accompanying drawing 1, the specific embodiment of the present invention is:

1)视频图像底层特征描述与提取,具体步骤如下:1) The underlying feature description and extraction of video images, the specific steps are as follows:

1.1)人物身份底层特征描述与提取,通过对视频场景中人物的人脸进行检测和预处理来描述人物身份底层特征。其中人脸检测是基于局部二值直方图(Local BinaryPatterns Histograms,LBPH)人脸检测器实现的,在检测之后对人脸图像进行灰度化、缩小和均衡化处理。经过上述预处理,利用LBPH人脸检测器得到人脸图像的二维特征向量,完成人物身份的底层特征描述与提取。运用OpenCV3提供的人脸检测接口来进行人脸检测与提取,抽取出视频场景中的人物语义输入1.1) Description and extraction of the underlying features of the person's identity. The underlying features of the person's identity are described by detecting and preprocessing the face of the person in the video scene. The face detection is implemented based on the local binary histogram (Local Binary Patterns Histograms, LBPH) face detector, after the detection, the face image is grayscaled, reduced and equalized. After the above preprocessing, the LBPH face detector is used to obtain the two-dimensional feature vector of the face image, and the bottom-level feature description and extraction of the identity of the person is completed. Use the face detection interface provided by OpenCV3 for face detection and extraction, and extract the semantic input of characters in the video scene

1.2)人物行为底层特征描述与提取,通过融合视频场景中人物图像序列的空间和时空特征。接着运用OpenCV3中goodFeaturesToTrack()函数,得到图像中的强边界作为跟踪的特征点。接着运用图像金字塔光流方法函数calcOpticalFlowPyrL(),对输入两幅连续的图像进行处理,在第一幅图像里选择一组特征点,输出为这组点在下一幅图像中的位置。再把得到的跟踪结果过滤一下,去掉不好特征点,将人物的动作轨迹标记出来,完成人物行为抽取出视频场景中的人物行为语义1.2) The underlying feature description and extraction of character behavior, by fusing the spatial and temporal features of the character image sequence in the video scene. Then use the goodFeaturesToTrack() function in OpenCV3 to get the strong boundary in the image as the feature point for tracking. Then use the image pyramid optical flow method function calcOpticalFlowPyrL() to process the input of two consecutive images, select a set of feature points in the first image, and output the position of this set of points in the next image. Then filter the obtained tracking results, remove bad feature points, mark the character's action trajectory, and complete the character behavior extraction to extract the character behavior semantics in the video scene

1.3)上下文底层特征描述与提取,提取特征的目标是上下文环境和场景里面出现的一些物体。物体特征采用4,096-D DeCAF通用视觉特征,对于地点特征,采用多个预先训练好的上下文检测器来提取场景中的地点特征,组成一个集合作为提取的上下文特征表示,完成抽取出视频场景中的上下文语义1.3) Contextual underlying feature description and extraction, the target of feature extraction is the context environment and some objects appearing in the scene. Object features use 4,096-D DeCAF general visual features. For location features, multiple pre-trained context detectors are used to extract location features in the scene, and a set is formed as the extracted context feature representation to complete the extraction of video scenes. context semantics

2)基于预训练的CNN的中层语义特征提取,具体步骤如下:2) The middle-level semantic feature extraction based on the pre-trained CNN, the specific steps are as follows:

2.1)人物身份中层特征提取,在数据集Olivetti Faces上完成CNN-People的训练,CNN-People结构如图2所示。CNN-People模型有两个卷积与子采样层。全连接层相当于多层感知机中的隐含层。输出层即分类器,采用多类别的逻辑回归。总体上我们使用串联结构来组建CNN模型,上一层的输出接下一层的输入。用预先训练好的CNN-People,前向传播的fc7层的4096特征向量作为人物身份中层特征。2.1) The mid-level feature extraction of person identity, the training of CNN-People is completed on the dataset Olivetti Faces, and the structure of CNN-People is shown in Figure 2. The CNN-People model has two convolutional and subsampling layers. The fully connected layer is equivalent to the hidden layer in the multi-layer perceptron. The output layer is the classifier, using multi-category logistic regression. In general, we use a serial structure to build a CNN model, and the output of the previous layer is connected to the input of the next layer. Using the pre-trained CNN-People, the 4096 feature vectors of the forward-propagated fc7 layer are used as the middle-level features of the person identity.

2.2)人物行为中层特征提取,双层卷积神经网络将两个卷积神经网络融合来识别视频中的人物行为。本节具体的是先分别对单帧图像和多帧的运动信息(光流)分别构建2个CNN网络,然后在分数层上对2种网络的输出作卷积融合。通过在数据集UCF101上前馈与反向传播来调优参数,完成双层卷积神经网络的训练。用训练好的D-CNNs-Activity网络,对视频场景中的图像进行训练,输出fc7全连接层的特征向量作为视频中人物行为中层特征。D-CNNs-Activity网络结构如图3所示。2.2) The middle-level feature extraction of character behavior, the double-layer convolutional neural network fuses two convolutional neural networks to identify the behavior of characters in the video. Specifically in this section, two CNN networks are constructed for the single-frame image and multi-frame motion information (optical flow), and then the output of the two networks is convolutionally fused on the score layer. The parameters are tuned by feedforward and backpropagation on the dataset UCF101, and the training of the two-layer convolutional neural network is completed. Use the trained D-CNNs-Activity network to train the images in the video scene, and output the feature vector of the fc7 fully connected layer as the middle-level feature of the character behavior in the video. The D-CNNs-Activity network structure is shown in Figure 3.

2.3)上下文中层特征提取,选择数据集ImageNet作为实验数据,对于数据集中得到的图像我们通过13)节中介绍的方法进行预处理得到上下文的底层特征描述。接着使用手写字识别模型LeNet的网络来训练数据集ImageNet,将训练好的CNN-Context网络前向传播的fc7层的4096特征向量作为上下文中层特征。所述CNN-Context如图4所示。2.3) Layer feature extraction in the context. The dataset ImageNet is selected as the experimental data. For the images obtained in the dataset, we preprocess the images in Section 13) to obtain the underlying feature description of the context. Then use the LeNet network of the handwritten character recognition model to train the dataset ImageNet, and use the 4096 feature vectors of the fc7 layer of the trained CNN-Context network as the context layer features. The CNN-Context is shown in Figure 4.

3)基于LSTM的视频人物行为语义识别,具体步骤如下:3) Semantic recognition of video character behavior based on LSTM, the specific steps are as follows:

3.1)建立基于LSTM的语义序列建模,该模型由两层LSTM组成,第一层LSTM用于接收视频语义特征序列并对其进行编码,第二层LSTM用第一层的编码结果作为输入并对其解码输出语义描述句子。该模型主要分层两个阶段:编码阶段和解码阶段,如图4。3.1) Establish semantic sequence modeling based on LSTM. The model consists of two layers of LSTM. The first layer of LSTM is used to receive video semantic feature sequences and encode them. The second layer of LSTM uses the encoding results of the first layer as input and Decode it to output a semantic description sentence. The model is mainly layered in two stages: the encoding stage and the decoding stage, as shown in Figure 4.

3.2)基于LSTM的语义序列识别,两层LSTM模型的每一层都有1000个隐型单元用于记录编码信息。在开始的几个时间点上,第一层的LSTM对两部分输入进行编码,第一部分是把经过CNN提取的人物身份、人物行为和上下文的中层语义特征作为输入,通过LSTM模块内部的四个交互过程进行计算而得到的隐含层输出信息Ht,第二部分是空填充的输入语义词语描述<pad>。在对第一层LSTM进行编码的时候没有损失。在当所有的视频序列的中层语义特征都输入完毕时,开始进行解码工作。第二层LSTM解码的时候也是两部分输入,一部分是第一层编码输出的隐含层信息Ht;由于没有了输入序列的中层语义特征,第二部分是空填充的输入语义词语描述<pad>。在编码的第一个时间点上,加入了一个<BOS>标志,用于标识开始解码工作。训练解码方法是根据前面的一个词汇输出描述和前一个时刻的隐含层输出信息Ht,最大化预测输出语句的对数似然。最终,根据第一层LSTM的隐含层的输出信息Zt作为条件,通过运用Softmax函数来计算每一个词汇在词汇集S上的分布,整个解码的训练过程中,LSTM根据结束标志<EOS>来动态结束训练,达到动态控制输出语义描述语句的长度。3.2) LSTM-based semantic sequence recognition, each layer of the two-layer LSTM model has 1000 hidden units for recording encoding information. At the first few points in time, the first layer of LSTM encodes two parts of input. The first part is to use the mid-level semantic features of character identity, character behavior and context extracted by CNN as input, through the four internal LSTM modules. The hidden layer output information Ht obtained through the calculation of the interactive process, and the second part is the empty-filled input semantic word description <pad>. There is no loss when encoding the first layer LSTM. When the middle-level semantic features of all video sequences are input, the decoding work starts. When decoding the second layer LSTM, there are also two parts of input, one part is the hidden layer information Ht output by the first layer encoding; because there is no middle-level semantic feature of the input sequence, the second part is the empty-filled input semantic word description <pad> . At the first time point of encoding, a <BOS> flag is added to mark the start of decoding. The training decoding method is to maximize the logarithmic likelihood of the predicted output sentence according to the previous vocabulary output description and the hidden layer output information Ht at the previous moment. Finally, according to the output information Zt of the hidden layer of the first layer LSTM as a condition, the distribution of each vocabulary on the vocabulary set S is calculated by using the Softmax function. Dynamically end the training to dynamically control the length of the output semantic description sentence.

4)融合提取的人物身份特征、人物行为特征以及上下文特征,将融合后的特征输入到基于LSTM的视频语义识别模型中,进行视频语义识别。在2层LSTM模型中,采用浅层融合技术来合成人物身份、人物行为、上下文特征。在解码阶段的每一个时间步长上,LSTM模型提供候选的词语集合。然后通过重新计算这些假设的评分,具体是按照一定的权重来累加人物身份、人物行为、上下文网络得到的评分,如公式P(yt=y′)=α*Pp(yt=y′)+β*Pa(yt=y′)+γ*Pc(yt=y′)所示,其中yt=y'代表选取y'这个时间步长,P(yt=y')代表y'这个时间点的视频人物行为语义的概率评分,Pp(yt=y')代表在y'这个时间点的人物身份的概率评分,Pa(yt=y')代表在y'这个时间点的人物行为的概率评分,Pc(yt=y')代表在y'这个时间点的上下文的概率评分。其中α+β+γ=1,以上三个参数初始值为1/3,并在数据集上,通过LSTM建模过程动态调谐。4) Fuse the extracted character identity features, character behavior features and context features, and input the fused features into the LSTM-based video semantic recognition model for video semantic recognition. In the 2-layer LSTM model, shallow fusion techniques are used to synthesize character identity, character behavior, and contextual features. At each time step in the decoding phase, the LSTM model provides a set of candidate words. Then recalculate the scores of these hypotheses, specifically accumulating the scores obtained by character identity, character behavior, and context network according to a certain weight, such as the formula P(yt =y′)=α*Pp (yt =y′ )+β*Pa (yt =y′)+γ*Pc (yt =y′), where yt =y’ means that the time step of y’ is selected, and P(yt =y’ ) represents the probability score of the video character’s behavior semantics at the time point y’, Pp (yt = y’) represents the probability score of the character’s identity at the time point y’, and Pa (yt = y’) represents the The probability score of the character's behavior at the time point y', Pc (yt =y') represents the probability score of the context at the time point y'. Where α+β+γ=1, the initial value of the above three parameters is 1/3, and on the data set, it is dynamically tuned through the LSTM modeling process.

Claims (7)

CN201810236363.3A2018-03-212018-03-21A kind of video personage behavior method for recognizing semanticsPendingCN108509880A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810236363.3ACN108509880A (en)2018-03-212018-03-21A kind of video personage behavior method for recognizing semantics

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810236363.3ACN108509880A (en)2018-03-212018-03-21A kind of video personage behavior method for recognizing semantics

Publications (1)

Publication NumberPublication Date
CN108509880Atrue CN108509880A (en)2018-09-07

Family

ID=63377957

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810236363.3APendingCN108509880A (en)2018-03-212018-03-21A kind of video personage behavior method for recognizing semantics

Country Status (1)

CountryLink
CN (1)CN108509880A (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109409297A (en)*2018-10-302019-03-01咪付(广西)网络技术有限公司A kind of personal identification method based on binary channels convolutional neural networks
CN109740419A (en)*2018-11-222019-05-10东南大学 A Video Action Recognition Method Based on Attention-LSTM Network
CN109815785A (en)*2018-12-052019-05-28四川大学 A facial emotion recognition method based on two-stream convolutional neural network
CN109902565A (en)*2019-01-212019-06-18深圳市烨嘉为技术有限公司The Human bodys' response method of multiple features fusion
CN109977970A (en)*2019-03-272019-07-05浙江水利水电学院Character recognition method under water conservancy project complex scene based on saliency detection
CN110060264A (en)*2019-04-302019-07-26北京市商汤科技开发有限公司Neural network training method, video frame processing method, apparatus and system
CN110084259A (en)*2019-01-102019-08-02谢飞A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature
CN110163876A (en)*2019-05-242019-08-23山东师范大学Left ventricle dividing method, system, equipment and medium based on multi-feature fusion
CN110245603A (en)*2019-06-122019-09-17成都信息工程大学 A method for real-time detection of group abnormal behavior
CN110674761A (en)*2019-09-272020-01-10三星电子(中国)研发中心 A regional behavior early warning method and system
CN110807379A (en)*2019-10-212020-02-18腾讯科技(深圳)有限公司 A semantic recognition method, device, and computer storage medium
CN111340006A (en)*2020-04-162020-06-26深圳市康鸿泰科技有限公司Sign language identification method and system
CN111428593A (en)*2020-03-122020-07-17北京三快在线科技有限公司Character recognition method and device, electronic equipment and storage medium
CN111460933A (en)*2020-03-182020-07-28哈尔滨拓博科技有限公司Method for real-time recognition of continuous handwritten pattern
WO2020151247A1 (en)*2019-01-232020-07-30华为技术有限公司Image analysis method and system
CN111523378A (en)*2020-03-112020-08-11浙江工业大学 A human behavior prediction method based on deep learning
CN112699730A (en)*2020-12-012021-04-23贵州电网有限责任公司Machine room character re-identification method based on YOLO and convolution-cycle network
CN112818741A (en)*2020-12-292021-05-18南京智能情资创新科技研究院有限公司Behavior etiquette dimension evaluation method and device for intelligent interview
CN112818739A (en)*2020-12-292021-05-18南京智能情资创新科技研究院有限公司Image instrument dimension evaluation method and device for intelligent interview
CN112818740A (en)*2020-12-292021-05-18南京智能情资创新科技研究院有限公司Psychological quality dimension evaluation method and device for intelligent interview
CN112975964A (en)*2021-02-232021-06-18青岛海科虚拟现实研究院Robot automatic control method and system based on big data and robot
CN113312942A (en)*2020-02-272021-08-27阿里巴巴集团控股有限公司Data processing method and equipment and converged network architecture
CN113449801A (en)*2021-07-082021-09-28西安交通大学Image character behavior description generation method based on multilevel image context coding and decoding
CN113646800A (en)*2018-09-272021-11-12株式会社OPTiM Object condition judging system, object condition judging method and program
CN113642482A (en)*2021-08-182021-11-12西北工业大学 A video character relationship analysis method based on video spatiotemporal context
CN113744524A (en)*2021-08-162021-12-03武汉理工大学Pedestrian intention prediction method and system based on cooperative computing communication between vehicles
CN113779303A (en)*2021-11-122021-12-10腾讯科技(深圳)有限公司Video set indexing method and device, storage medium and electronic equipment
CN114972841A (en)*2022-04-212022-08-30北京邮电大学Knowledge distillation-based video multi-cue social relationship extraction method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103824051A (en)*2014-02-172014-05-28北京旷视科技有限公司Local region matching-based face search method
CN104021381A (en)*2014-06-192014-09-03天津大学Human movement recognition method based on multistage characteristics
US20150066496A1 (en)*2013-09-022015-03-05Microsoft CorporationAssignment of semantic labels to a sequence of words using neural network architectures
CN106845351A (en)*2016-05-132017-06-13苏州大学It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
CN107038221A (en)*2017-03-222017-08-11杭州电子科技大学A kind of video content description method guided based on semantic information
CN107256221A (en)*2017-04-262017-10-17苏州大学Video presentation method based on multi-feature fusion
US20170337271A1 (en)*2016-05-172017-11-23Intel CorporationVisual search and retrieval using semantic information
CN107609460A (en)*2017-05-242018-01-19南京邮电大学A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20150066496A1 (en)*2013-09-022015-03-05Microsoft CorporationAssignment of semantic labels to a sequence of words using neural network architectures
CN103824051A (en)*2014-02-172014-05-28北京旷视科技有限公司Local region matching-based face search method
CN104021381A (en)*2014-06-192014-09-03天津大学Human movement recognition method based on multistage characteristics
CN106845351A (en)*2016-05-132017-06-13苏州大学It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
US20170337271A1 (en)*2016-05-172017-11-23Intel CorporationVisual search and retrieval using semantic information
CN107038221A (en)*2017-03-222017-08-11杭州电子科技大学A kind of video content description method guided based on semantic information
CN107256221A (en)*2017-04-262017-10-17苏州大学Video presentation method based on multi-feature fusion
CN107609460A (en)*2017-05-242018-01-19南京邮电大学A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高翔: "基于视频深度学习的人物行为分析与社交关系识别", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》*

Cited By (40)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113646800A (en)*2018-09-272021-11-12株式会社OPTiM Object condition judging system, object condition judging method and program
CN109409297B (en)*2018-10-302021-11-23咪付(广西)网络技术有限公司Identity recognition method based on dual-channel convolutional neural network
CN109409297A (en)*2018-10-302019-03-01咪付(广西)网络技术有限公司A kind of personal identification method based on binary channels convolutional neural networks
CN109740419A (en)*2018-11-222019-05-10东南大学 A Video Action Recognition Method Based on Attention-LSTM Network
CN109815785A (en)*2018-12-052019-05-28四川大学 A facial emotion recognition method based on two-stream convolutional neural network
CN110084259B (en)*2019-01-102022-09-20谢飞Facial paralysis grading comprehensive evaluation system combining facial texture and optical flow characteristics
CN110084259A (en)*2019-01-102019-08-02谢飞A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature
CN109902565A (en)*2019-01-212019-06-18深圳市烨嘉为技术有限公司The Human bodys' response method of multiple features fusion
US12100209B2 (en)2019-01-232024-09-24Huawei Cloud Computing Technologies Co., Ltd.Image analysis method and system
WO2020151247A1 (en)*2019-01-232020-07-30华为技术有限公司Image analysis method and system
CN109977970A (en)*2019-03-272019-07-05浙江水利水电学院Character recognition method under water conservancy project complex scene based on saliency detection
CN110060264B (en)*2019-04-302021-03-23北京市商汤科技开发有限公司Neural network training method, video frame processing method, device and system
CN110060264A (en)*2019-04-302019-07-26北京市商汤科技开发有限公司Neural network training method, video frame processing method, apparatus and system
CN110163876A (en)*2019-05-242019-08-23山东师范大学Left ventricle dividing method, system, equipment and medium based on multi-feature fusion
CN110163876B (en)*2019-05-242021-08-17山东师范大学 Left ventricular segmentation method, system, device and medium based on multi-feature fusion
CN110245603A (en)*2019-06-122019-09-17成都信息工程大学 A method for real-time detection of group abnormal behavior
CN110674761A (en)*2019-09-272020-01-10三星电子(中国)研发中心 A regional behavior early warning method and system
CN110807379A (en)*2019-10-212020-02-18腾讯科技(深圳)有限公司 A semantic recognition method, device, and computer storage medium
CN113312942B (en)*2020-02-272024-05-17阿里巴巴集团控股有限公司Data processing method and device and converged network architecture system
CN113312942A (en)*2020-02-272021-08-27阿里巴巴集团控股有限公司Data processing method and equipment and converged network architecture
CN111523378A (en)*2020-03-112020-08-11浙江工业大学 A human behavior prediction method based on deep learning
CN111428593A (en)*2020-03-122020-07-17北京三快在线科技有限公司Character recognition method and device, electronic equipment and storage medium
CN111460933A (en)*2020-03-182020-07-28哈尔滨拓博科技有限公司Method for real-time recognition of continuous handwritten pattern
CN111460933B (en)*2020-03-182022-08-09哈尔滨拓博科技有限公司Method for real-time recognition of continuous handwritten pattern
CN111340006B (en)*2020-04-162024-06-11深圳市康鸿泰科技有限公司Sign language recognition method and system
CN111340006A (en)*2020-04-162020-06-26深圳市康鸿泰科技有限公司Sign language identification method and system
CN112699730A (en)*2020-12-012021-04-23贵州电网有限责任公司Machine room character re-identification method based on YOLO and convolution-cycle network
CN112818740A (en)*2020-12-292021-05-18南京智能情资创新科技研究院有限公司Psychological quality dimension evaluation method and device for intelligent interview
CN112818739A (en)*2020-12-292021-05-18南京智能情资创新科技研究院有限公司Image instrument dimension evaluation method and device for intelligent interview
CN112818741A (en)*2020-12-292021-05-18南京智能情资创新科技研究院有限公司Behavior etiquette dimension evaluation method and device for intelligent interview
CN112975964B (en)*2021-02-232022-04-01青岛海科虚拟现实研究院Robot automatic control method and system based on big data and robot
CN112975964A (en)*2021-02-232021-06-18青岛海科虚拟现实研究院Robot automatic control method and system based on big data and robot
CN113449801A (en)*2021-07-082021-09-28西安交通大学Image character behavior description generation method based on multilevel image context coding and decoding
CN113449801B (en)*2021-07-082023-05-02西安交通大学Image character behavior description generation method based on multi-level image context coding and decoding
CN113744524A (en)*2021-08-162021-12-03武汉理工大学Pedestrian intention prediction method and system based on cooperative computing communication between vehicles
CN113642482A (en)*2021-08-182021-11-12西北工业大学 A video character relationship analysis method based on video spatiotemporal context
CN113642482B (en)*2021-08-182024-02-02西北工业大学Video character relation analysis method based on video space-time context
CN113779303A (en)*2021-11-122021-12-10腾讯科技(深圳)有限公司Video set indexing method and device, storage medium and electronic equipment
CN114972841A (en)*2022-04-212022-08-30北京邮电大学Knowledge distillation-based video multi-cue social relationship extraction method and device
CN114972841B (en)*2022-04-212025-07-04北京邮电大学 Video multi-cue social relationship extraction method and device based on knowledge distillation

Similar Documents

PublicationPublication DateTitle
CN108509880A (en)A kind of video personage behavior method for recognizing semantics
Jiang et al.Skeleton aware multi-modal sign language recognition
Liu et al.Video-based person re-identification with accumulative motion context
Chen et al.Once for all: a two-flow convolutional neural network for visual tracking
CN113139468B (en)Video abstract generation method fusing local target features and global features
CN106096568B (en) A pedestrian re-identification method based on CNN and convolutional LSTM network
CN106709461B (en)Activity recognition method and device based on video
Wang et al.Hierarchical attention network for action recognition in videos
CN103984943B (en)A kind of scene text recognition methods based on Bayesian probability frame
CN106529477B (en) Video Human Behavior Recognition Method Based on Salient Trajectories and Spatial-Temporal Evolution Information
CN109190479A (en)A kind of video sequence expression recognition method based on interacting depth study
CN110188637A (en) A method of behavior recognition technology based on deep learning
Ditsanthia et al.Video representation learning for cctv-based violence detection
Chen et al.Action segmentation with mixed temporal domain adaptation
CN112183468A (en)Pedestrian re-identification method based on multi-attention combined multi-level features
CN107862275A (en)Human bodys&#39; response model and its construction method and Human bodys&#39; response method
CN113221770A (en)Cross-domain pedestrian re-identification method and system based on multi-feature hybrid learning
WO2023185074A1 (en)Group behavior recognition method based on complementary spatio-temporal information modeling
Tao et al.CENet: A channel-enhanced spatiotemporal network with sufficient supervision information for recognizing industrial smoke emissions
Zhou et al.Hidden Two-Stream Collaborative Learning Network for Action Recognition.
Zhang et al.Progressive modality cooperation for multi-modality domain adaptation
Huang et al.Spatial–temporal context-aware online action detection and prediction
Yoon et al.A novel online action detection framework from untrimmed video streams
CN113642482B (en)Video character relation analysis method based on video space-time context
CN106709419A (en)Video human behavior recognition method based on significant trajectory spatial information

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20180907


[8]ページ先頭

©2009-2025 Movatter.jp