CN116884412A

Movatterモバイル変換

Info

Publication number: CN116884412A
Application number: CN202310835916.8A
Authority: CN
Inventors: 李鹏华; 苏沁伟; 项盛; 侯杰; 茹懿; 吕涛; 尹绍云
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-13

Abstract

Translated fromChinese

本发明涉及一种基于混合三维残差门控循环单元的唇语识别方法，属于唇语识别领域，包括以下步骤：S1：以唇部图像特征序列为对象，设计混合数据训练，对数据进行增强；S2：以采用融合残差和时空卷积的网络作为前端网络，以产生序列的最终表示；S3：构建基于序列信息门控网络的后端网络，对唇语进行识别。本发明解决了唇语识别中常见的唇形相似性高、数据量少的难题。

The invention relates to a lip recognition method based on a hybrid three-dimensional residual gated cyclic unit, which belongs to the field of lip recognition and includes the following steps: S1: Taking the lip image feature sequence as the object, designing mixed data training and enhancing the data ; S2: Use a network that uses fused residuals and spatiotemporal convolution as the front-end network to generate the final representation of the sequence; S3: Construct a back-end network based on the sequence information gating network to recognize lip language. The invention solves the common problems of high lip shape similarity and small amount of data in lip language recognition.

Description

Translated fromChinese

一种基于混合三维残差门控循环单元的唇语识别方法A lip language recognition method based on hybrid three-dimensional residual gated recurrent units

技术领域Technical field

本发明属于唇语识别技术领域，涉及一种基于混合三维残差门控循环单元的唇语识别方法。The invention belongs to the technical field of lip recognition, and relates to a lip recognition method based on a hybrid three-dimensional residual gated cyclic unit.

背景技术Background technique

语音是如今日常生活中至关重要的一种交流方式，人们通过语音来表达他们的思想、情感和意见，以交流信息和相互理解。深度学习是一种基于神经网络的机器学习方法，随着计算机硬件技术的发展和数据量的增加，深度学习在各种应用领域中得到了广泛应用。但是，深度学习仍然面临着优化算法和模型结构的挑战。语音识别作为自然语言处理技术的一个重要分支，在实际应用中面临着一些挑战。最大的挑战之一是环境噪声和语音间混叠的影响，使得输入信号有严重的变形和干扰，导致语音识别的准确率下降。此外，声音的语速、口音、语气等不同因素也会影响语音识别的效果。同时，口齿不清、发音错误等语音缺陷也会造成巨大的影响。Voice is a vital form of communication in today's daily life. People express their thoughts, emotions and opinions through voice to exchange information and understand each other. Deep learning is a machine learning method based on neural networks. With the development of computer hardware technology and the increase in data volume, deep learning has been widely used in various application fields. However, deep learning still faces the challenge of optimizing algorithms and model structures. Speech recognition, as an important branch of natural language processing technology, faces some challenges in practical applications. One of the biggest challenges is the influence of environmental noise and aliasing between voices, which causes serious distortion and interference of the input signal, resulting in a decrease in the accuracy of speech recognition. In addition, different factors such as the speech speed, accent, and tone of the voice will also affect the effect of speech recognition. At the same time, speech defects such as unclear enunciation and mispronunciation will also have a huge impact.

唇语识别是一种能够解决语音识别中噪声问题的技术。唇语信号在无法获取语音信号或语音信号受到严重干扰的情况下，仍可提供有效的语音信息，其识别精度也能够与语音识别相当，并且具有较高的鲁棒性。唇语识别可以利用视频设备捕捉人类唇部运动信息，并利用模型对其进行分析，从而实现语音的识别和转写。在配合人类语音和声音的同时，唇语可以提供额外的视觉信息来减轻它们的识别难度，为语音识别技术的进一步发展提供更多的可能性。由于残疾人无法发声，他们可能无法利用传统的语音识别技术进行交流。在这种情况下，唇语识别技术可以通过分析人类唇部运动信息，并将其转换为语音信息，来帮助残疾人进行有效的交流。因此，唇语识别技术对于提高残疾人迅捷有效地沟通具有重要的帮助作用。Lip recognition is a technology that can solve the noise problem in speech recognition. Lip reading signals can still provide effective speech information even when the speech signal cannot be obtained or the speech signal is seriously interfered. Its recognition accuracy is comparable to that of speech recognition and has high robustness. Lip recognition can use video equipment to capture human lip movement information and use models to analyze it to achieve speech recognition and transcription. While matching human speech and sounds, lip reading can provide additional visual information to ease their recognition difficulty, providing more possibilities for the further development of speech recognition technology. Because people with disabilities are unable to speak, they may not be able to communicate using traditional speech recognition technology. In this case, lip recognition technology can help disabled people communicate effectively by analyzing human lip movement information and converting it into speech information. Therefore, lip recognition technology plays an important role in improving the rapid and effective communication of disabled people.

传统唇语识别模型的数据扩增方法包括旋转、缩放、水平翻转等方式，但限于因数据分布不均衡等问题，效果有限。为了提高模型的鲁棒性和泛化能力，采用混合数据训练技术进行样本扩增，并将每个样本进行线性组合生成新的混合样本，增加数据多样性和唇语视频数据集的规模，使得神经网络训练出来的特征更具代表性。采用融合残差与时空卷积网络可以很好地提取唇语视频中的静态和动态信息，使得模型能够更加准确地分辨不同的唇语动作。同时，使用时空残差块和自适应注意力机制来建模视频中短期和长期的时间依赖特征，以有效地处理非常长的时间序列。序列信息门控网络作为后端网络，在唇语视频中捕获长期依赖的时序关系上表现较好，进而提升识别精度和鲁棒性。此外，还运用了局部稀疏连接的解码器，这种解码器方法可有效地降低模型参数和算力消耗，实现了模型的轻量化、高效化和易于部署，同时在保持高精度的情况下，降低了模型的算力消耗，提高了唇语识别的实用性。Data amplification methods for traditional lip recognition models include rotation, scaling, horizontal flipping, etc. However, due to problems such as uneven data distribution, the effect is limited. In order to improve the robustness and generalization ability of the model, mixed data training technology is used for sample amplification, and each sample is linearly combined to generate a new mixed sample, increasing the data diversity and the scale of the lip language video data set, so that The features trained by the neural network are more representative. The use of fused residuals and spatiotemporal convolutional networks can well extract static and dynamic information in lip language videos, allowing the model to more accurately distinguish different lip movements. At the same time, spatiotemporal residual blocks and adaptive attention mechanisms are used to model short-term and long-term time-dependent features in videos to effectively handle very long time series. As a back-end network, the sequence information gating network performs well in capturing long-term dependent temporal relationships in lip language videos, thereby improving recognition accuracy and robustness. In addition, a locally sparsely connected decoder is also used. This decoder method can effectively reduce model parameters and computing power consumption, making the model lightweight, efficient and easy to deploy, while maintaining high accuracy. It reduces the computing power consumption of the model and improves the practicality of lip recognition.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供一种基于混合三维残差门控循环单元的唇语识别方法。In view of this, the object of the present invention is to provide a lip recognition method based on a hybrid three-dimensional residual gated recurrent unit.

为达到上述目的，本发明提供如下技术方案：In order to achieve the above objects, the present invention provides the following technical solutions:

一种基于混合三维残差门控循环单元的唇语识别方法，包括以下步骤：A lip recognition method based on hybrid three-dimensional residual gated recurrent units, including the following steps:

S1：以唇部图像特征序列为对象，设计混合数据训练，对数据进行增强；S1: Taking the lip image feature sequence as the object, design mixed data training and enhance the data;

S2：以采用融合残差和时空卷积的网络作为前端网络，以产生序列的最终表示；S2: Use a network using fused residuals and spatiotemporal convolution as the front-end network to produce the final representation of the sequence;

S3：构建基于序列信息门控网络的后端网络，对唇语进行识别。S3: Construct a back-end network based on sequence information gating network to recognize lip language.

进一步，所述步骤S1具体包括：Further, the step S1 specifically includes:

S11：基于多个面部标志，将数据集中的唇部图像先进行人脸对齐，裁剪图像并将它们调整为固定大小；S11: Based on multiple facial landmarks, face-align the lip images in the data set first, crop the images and adjust them to a fixed size;

S12：使用每个地标的中值坐标，将常见的裁剪应用于给定剪辑的所有帧；S12: Apply common cropping to all frames of a given clip using the median coordinate of each landmark;

S13：帧被转换为灰度，并根据整体均值和方差进行归一化后得到唇部区域；S13: The frame is converted to grayscale and normalized according to the overall mean and variance to obtain the lip area;

S14：最后使用混合数据训练进行数据增强。S14: Finally, use mixed data training for data augmentation.

进一步，步骤S1中，假设batch_x1是一个batch样本，batch_y1是该batch样本对应的标签；batch_x2是另外一个batch样本，batch_y2是该batch样本对应的标签，λ是由参数α,β的贝塔分布计算出来的混合系数，混合数据训练原理公式为：Further, in step S1, it is assumed that batch_x1 is a batch sample, batch_y1 is the label corresponding to the batch sample; batch_x2 is another batch sample, batch_y2 is the label corresponding to the batch sample, and λ is determined by the parameters α and β. The mixing coefficient calculated by the beta distribution, the principle formula of mixed data training is:

λ＝Beta(α,β)λ=Beta(α,β)

mixed_batch_x＝λ*batch_x1+(1-λ)*batch_x2mixed_batch_x =λ*batch_x1 +(1-λ)*batch_x2

mixed_batch_y＝λ*batch_y1+(1-λ)*batch_y2mixed_batch_y =λ*batch_y1 +(1-λ)*batch_y2

其中Beta指的是贝塔分布，mixed_batch_x是混合后的batch样本，mixed_batch_y是混合后的batch样本对应的标签；Among them, Beta refers to the beta distribution, mixed_batch_x is the mixed batch sample, and mixed_batch_y is the label corresponding to the mixed batch sample;

当batch size＝1时，表示两张图片样本混合；当batch size＞1时，表示两个batch图片样本两两对应混合。When batch size=1, it means that the two image samples are mixed; when batch size>1, it means that the two batch image samples are mixed in pairs.

进一步，所述前端网络具体包括：Further, the front-end network specifically includes:

第一层：用于将时空卷积应用于预处理的帧流并且捕获嘴部区域的短期动态；第一层由一个卷积层组成，具有64个3维内核，还包括批量归一化和整流线性单元；提取的特征图通过时空最大池化层；The first layer: used to apply spatiotemporal convolution to the preprocessed frame stream and capture the short-term dynamics of the mouth region; the first layer consists of a convolutional layer with 64 3D kernels, and also includes batch normalization and Rectified linear unit; the extracted feature map passes through the spatiotemporal maximum pooling layer;

第二层：3维特征图在每个时间步通过一个残差网络，使用34层身份映射版本；它的构建块由两个卷积层和BN和ReLU组成，而跳过连接促进信息传播；ResNet使用最大池化层逐步下降空间维度，直到其输出在每个时间步成为一维张量；Second layer: The 3D feature map passes through a residual network at each time step, using a 34-layer identity mapping version; its building blocks consist of two convolutional layers and BN and ReLU, while skip connections promote information propagation; ResNet uses max pooling layers to gradually decrease the spatial dimension until its output becomes a one-dimensional tensor at each time step;

原始的残差块中的计算为：The calculation in the original residual block is:

y_l＝h(x_l)+F(x_l,W_l)y_l =h(x_l )+F(x_l ,W_l )

x_l+1＝f(y_l)x_l+1 =f(y_l )

其中x_l是第1个残差单元的输入特征，W_l＝{W_l，k|1≤k≤K}是第l个残差单元相关的一组权重，K是残差单元中的层数，F表示残差函数，函数f是逐元素加法后的操作，也就是ReLU激活函数，函数的集合作为恒等映射：_where_{_}_{_} Number, F represents the residual function, the function f is the operation after element-wise addition, that is, the ReLU activation function, and the set of functions is used as an identity map:

h(x_l)＝x_l x_l+1＝y_lh(x_l )＝x_l x_l+1 =y_l

从而得到函数：This results in the function:

x_l+1＝x_l+F(x_l,W_l)x_l+1 =x_l +F(x_l ,W_l )

经过递归得到：After recursion we get:

任意深层单元的特征x_L表示为浅层单元x_l的特征加上形如的残差函数，表明任意单元L与l之间都具有残差特性，对于任意一个L层的深度网络：The feature x_L of any deep unit is expressed as the feature x_l of the shallow unit plus the form The residual function shows that there are residual characteristics between any unit L and l. For any L-layer deep network:

最后一层的输出特征x_L是x₀加上中间层残差函数的结果，将网络的损失函数表示为ε，根据链式法则有：The output feature x_L of the last layer is the result of x₀ plus the residual function of the middle layer. The loss function of the network is expressed as ε. According to the chain rule:

进一步，所述后端网络具体包括：Further, the back-end network specifically includes:

时空卷积层的输出是由卷积神经网络和光流算法提取出的时空特征；所述时空特征中包含视频数据中的空间和时间信息；所述时空特征作为残差网络输入，传递给序列信息门控网络进行时间序列的语义建模和学习，从而完成对视频中语音信号的识别；The output of the spatiotemporal convolution layer is the spatiotemporal feature extracted by the convolutional neural network and optical flow algorithm; the spatiotemporal feature contains the spatial and temporal information in the video data; the spatiotemporal feature is used as the input of the residual network and passed to the sequence information The gated network performs semantic modeling and learning of time series to complete the recognition of speech signals in videos;

序列信息门控网络在时间t的激活是前一个激活/>和候选激活/>之间的线性插值：Activation of sequence information gating network at time t Is the previous activation/> and candidate activation/> Linear interpolation between:

其中更新门决定单元更新其激活或内容的程度，更新门由下式计算：Which update gate Determining the extent to which a unit updates its activation or content, the update gate is calculated by:

这个过程是把现有的状态和新计算的状态之间取线性总和，但没有采取任何机制去控制其状态暴露程度，而每次将所有状态都暴露，候选激活的计算方法：This process is to take a linear sum between the existing state and the newly calculated state, but does not adopt any mechanism to control the degree of state exposure, and exposes all states each time. The calculation method of candidate activation:

其中r_t是一组重置门，并且⊙是逐元素乘法，当关闭时，重置门使单元充当读取输入序列的第一个符号，使其能够忘记先前计算的状态；重置门的计算类似于更新门：where r_t is a set of reset gates and ⊙ is an element-wise multiplication. When closed, the reset gates cause the unit to act as a read of the first symbol of the input sequence, enabling it to forget the previously calculated state; reset gates The calculation of is similar to the update gate:

在后端，对序列信息门控网络在时间维度上的输出进行平均，并将结果发送到最终的全连接层进行预测，交叉熵损失用于优化。In the backend, the output of the sequence information gating network is averaged in the time dimension and the results are sent to the final fully connected layer for prediction, and the cross-entropy loss is used for optimization.

本发明的有益效果在于：从模型算法角度出发，本发明的唇语识别模型完整结合了融合残差和时空卷积的网络和序列信息门控网络，前端网络能够利用三维卷积神经网络提取唇语视频中的静态和动态特征，并通过深度残差结构来解决一般二维网络中梯度消失的问题。而后端网络则是基于序列信息门控网络，能够很好地处理唇语视频中的时序信息，从而进一步提高唇语识别的准确性和鲁棒性。本前后端结合的模型成功地解决了唇语识别中常见的唇形相似性高、数据量少的难题，在实验中取得了较好的效果。该模型对唇读障碍人群的实际应用领域有着广泛而实用的意义。The beneficial effects of the present invention are: from the perspective of model algorithm, the lip recognition model of the present invention completely combines a network that fuses residuals and spatiotemporal convolution and a sequence information gating network. The front-end network can use a three-dimensional convolutional neural network to extract lips. static and dynamic features in language videos, and solves the problem of gradient disappearance in general two-dimensional networks through deep residual structures. The back-end network is based on the sequence information gating network, which can well handle the temporal information in lip language videos, thereby further improving the accuracy and robustness of lip language recognition. This model that combines front-end and back-end has successfully solved the common problems of high lip shape similarity and small amount of data in lip language recognition, and achieved good results in experiments. This model has broad and practical significance for practical application fields for lipreading disabled people.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will, to the extent that they are set forth in the description that follows, and to the extent that they will become apparent to those skilled in the art upon examination of the following, or may be derived from This invention is taught by practicing it. The objects and other advantages of the invention may be realized and obtained by the following description.

附图说明Description of the drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作优选的详细描述，其中：In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings, in which:

图1为本发明所述的基于混合三维残差门控循环单元的唇语识别方法流程图。Figure 1 is a flow chart of the lip recognition method based on a hybrid three-dimensional residual gated cyclic unit according to the present invention.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The following describes the embodiments of the present invention through specific examples. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments. Various details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the illustrations provided in the following embodiments only illustrate the basic concept of the present invention in a schematic manner. The following embodiments and the features in the embodiments can be combined with each other as long as there is no conflict.

其中，附图仅用于示例性说明，表示的仅是示意图，而非实物图，不能理解为对本发明的限制；为了更好地说明本发明的实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；对本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。The drawings are only for illustrative purposes, and represent only schematic diagrams rather than actual drawings, which cannot be understood as limitations of the present invention. In order to better illustrate the embodiments of the present invention, some components of the drawings will be omitted. The enlargement or reduction does not represent the size of the actual product; it is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

本发明实施例的附图中相同或相似的标号对应相同或相似的部件；在本发明的描述中，需要理解的是，若有术语“上”、“下”、“左”、“右”、“前”、“后”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此附图中描述位置关系的用语仅用于示例性说明，不能理解为对本发明的限制，对于本领域的普通技术人员而言，可以根据具体情况理解上述术语的具体含义。In the drawings of the embodiments of the present invention, the same or similar numbers correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms "upper", "lower", "left" and "right" The orientation or positional relationship indicated by "front", "rear", etc. is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description, and does not indicate or imply that the device or element referred to must be It has a specific orientation and is constructed and operated in a specific orientation. Therefore, the terms describing the positional relationships in the drawings are only for illustrative purposes and cannot be understood as limitations of the present invention. For those of ordinary skill in the art, they can determine the specific position according to the specific orientation. Understand the specific meaning of the above terms.

针对唇语识别模型的高精度、轻量化的需求，研究基于混合三维残差门控循环单元的唇语识别方法。以唇部图像特征序列为对象，设计混合数据训练，对数据进行增强；以采用融合残差和时空卷积的网络作为前端网络，后端网络采用序列信息门控网络的方式，实现了模型的轻量化、高效化和易于部署。In response to the demand for high accuracy and lightweight lip recognition models, a lip recognition method based on hybrid three-dimensional residual gated recurrent units was studied. Taking the lip image feature sequence as the object, mixed data training is designed to enhance the data; a network using fused residuals and spatiotemporal convolution is used as the front-end network, and the back-end network uses the sequence information gating network to achieve the model Lightweight, efficient and easy to deploy.

研究基于混合三维残差门控循环单元的唇语识别，如图1所示，包括混合数据训练、基于融合残差和时空卷积的前端网络、后端网络三个步骤。The study of lip recognition based on hybrid three-dimensional residual gated recurrent units, as shown in Figure 1, includes three steps: mixed data training, a front-end network based on fused residuals and spatiotemporal convolution, and a back-end network.

数据预处理：基于66个面部标志，将数据集中的唇部图像先进行人脸对齐，裁剪图像并将它们调整为固定的88×88大小。使用每个地标的中值坐标，将常见的裁剪应用于给定剪辑的所有帧。帧被转换为灰度，并根据整体均值和方差进行归一化后得到唇部区域。最后使用混合数据训练进行数据增强。Data preprocessing: Based on 66 facial landmarks, the lip images in the dataset are first face-aligned, the images are cropped and resized to a fixed 88×88 size. Apply a common crop to all frames of a given clip using the median coordinate of each landmark. Frames are converted to grayscale and normalized according to the overall mean and variance to obtain the lip region. Finally, mixed data training is used for data augmentation.

假设batch_x1是一个batch样本，batch_y1是该batch样本对应的标签；batch_x2是另外一个batch样本，batch_y2是该batch样本对应的标签，λ是由参数α,β的贝塔分布计算出来的混合系数，由此我们可以得到混合数据训练原理公式为：Assume that batch_x1 is a batch sample, batch_y1 is the label corresponding to the batch sample; batch_x2 is another batch sample, batch_y2 is the label corresponding to the batch sample, and λ is a mixture calculated from the beta distribution of parameters α and β. Coefficient, from which we can get the principle formula of mixed data training as:

λ＝Beta(α,β) (1)λ=Beta(α,β) (1)

mixed_batch_x＝λ*batch_x1+(1-λ)*batch_x2 (2)mixed_batch_x =λ*batch_x1 +(1-λ)*batch_x2 (2)

mixed_batch_y＝λ*batch_y1+(1-λ)*batch_y2 (3)mixed_batch_y =λ*batch_y1 +(1-λ)*batch_y2 (3)

其中Beta指的是贝塔分布，mixed_batch_x是混合后的batch样本，mixed_batch_y是混合后的batch样本对应的标签。Among them, Beta refers to the beta distribution, mixed_batch_x is the mixed batch sample, and mixed_batch_y is the label corresponding to the mixed batch sample.

batch_x1与batch_x2没有太多的限制，当batch size＝1时，就是两张图片样本混合；当batch size＞1时，便是两个batch图片样本两两对应混合。此外batch_x1与batch_x2可以是同一批样本，也可以是不同批样本。There are not many restrictions on batch_x1 and batch_x2 . When batch size=1, two image samples are mixed; when batch size>1, two batch image samples are mixed correspondingly. In addition, batch_x1 and batch_x2 can be the same batch of samples or different batches of samples.

前端网络：时空卷积层与残差网络相结合以产生序列的最终表示。Front-end network: Spatiotemporal convolutional layers are combined with residual networks to produce the final representation of the sequence.

第一层将时空卷积应用于预处理的帧流并且能够捕获嘴部区域的短期动态。它们由一个卷积层组成，具有64个3维内核，大小为5×7×7(时间/宽度/高度)，然后是批量归一化和整流线性单元。提取的特征图通过时空最大池化层，该层降低了3维特征图的空间大小。The first layer applies spatiotemporal convolution to the preprocessed frame stream and is able to capture the short-term dynamics of the mouth region. They consist of a convolutional layer with 64 3D kernels of size 5×7×7 (time/width/height), followed by batch normalization and rectified linear units. The extracted feature maps pass through a spatiotemporal max pooling layer, which reduces the spatial size of the 3D feature map.

第二层3维特征图在每个时间步通过一个残差网络。使用34层身份映射版本。它的构建块由两个卷积层和BN和ReLU组成，而跳过连接促进了信息传播。ResNet使用最大池化层逐步下降空间维度，直到其输出在每个时间步成为一维张量。The second layer 3D feature map passes through a residual network at each time step. Use the 34-layer identity mapping version. Its building blocks consist of two convolutional layers and BN and ReLU, while skip connections facilitate information propagation. ResNet uses max pooling layers to gradually decrease the spatial dimension until its output becomes a one-dimensional tensor at each time step.

y_l＝h(x_l)+F(x_l,W_l) (4)y_l =h(x_l )+F(x_l ,W_l ) (4)

x_l+1＝f(y_l) (5)x_l+1 =f(y_l ) (5)

这里xl是第1个残差单元的输入特征。W_l＝{W_l，k|1≤k≤K}是第l个残差单元相关的一组权重(和偏差)，K是残差单元中的层数。F表示残差函数，函数f是逐元素加法后的操作，也就是ReLU激活函数。函数的集合作为恒等映射：Here xl is the input feature of the first residual unit. W_l ={W_l,k |1≤k≤K} is a set of weights (and biases) related to the l-th residual unit, and K is the number of layers in the residual unit. F represents the residual function, and the function f is the operation after element-by-element addition, which is the ReLU activation function. Collections of functions as identity maps:

h(x_l)＝x_l x_l+1＝y_lh(x_l )＝x_l x_l+1 =y_l

则可以得到函数：Then you can get the function:

x_l+1＝x_l+F(x_l,W_l) (6)x_l+1 ＝x_l +F(x_l ,W_l ) (6)

经过递归，可以得到：After recursion, you can get:

任意深层单元的特征x_L可以表示为浅层单元xl的特征加上形如的残差函数，表明了任意单元L与l之间都具有残差特性。对于任意一个L层的深度网络：The feature x_L of any deep unit can be expressed as the feature of the shallow unit xl plus the form The residual function shows that there are residual characteristics between any unit L and l. For any L-layer deep network:

最后一层的输出特征x_L是x₀加上中间层残差函数的结果。公式(7)具有很好的反向传播特性，将网络的损失函数表示为ε，根据链式法则有：The output feature x_L of the last layer is the result of x₀ plus the residual function of the middle layer. Formula (7) has good backpropagation characteristics. The loss function of the network is expressed as ε. According to the chain rule, it is:

上式表明，损失函数对输入的梯度可以分解为两项相加的结果，第一项损失函数对x_L的偏导数无关任何权重层，第二项却和权重层有关。表明了信息可以直接回传到网络的任意浅层。上式同样说明对于一个mini-batch的训练数据而言，由于不大可能mini-batch中的每一个训练样本的括号中的第二项都为-1，那么整个的梯度值不大可能为0，这就实现了即便权重值很小的时候也不大可能发生梯度弥散的问题。The above equation shows that the gradient of the loss function on the input can be decomposed into the result of the addition of two items. The first term of the partial derivative of the loss function on x_L has nothing to do with any weight layer, but the second term is related to the weight layer. It shows that information can be directly transmitted back to any shallow layer of the network. The above formula also shows that for a mini-batch training data, since it is unlikely that the second term in the brackets of every training sample in the mini-batch is -1, then the entire gradient value is unlikely to be 0. , which realizes that even when the weight value is small, the problem of gradient dispersion is unlikely to occur.

后端网络：时空卷积层的输出是由卷积神经网络和光流算法提取出的时空特征。这些时空特征是一些高级别的语义特征，其中包含了视频数据中的空间和时间信息。那么这些时空特征将作为残差网络输入，进行特征的提取。这些特征能够更好地表示视频中的语音信号，从而提高模型的精度和鲁棒性。这些向量会被传递给序列信息门控网络进行时间序列的语义建模和学习，从而完成对视频中语音信号的识别。Backend network: The output of the spatiotemporal convolution layer is the spatiotemporal features extracted by the convolutional neural network and optical flow algorithm. These spatiotemporal features are high-level semantic features that contain spatial and temporal information in video data. Then these spatio-temporal features will be used as input to the residual network for feature extraction. These features can better represent the speech signal in the video, thereby improving the accuracy and robustness of the model. These vectors will be passed to the sequence information gating network for semantic modeling and learning of time series to complete the recognition of speech signals in the video.

序列信息门控网络在时间t的激活是前一个激活/>和候选激活/>之间的线性插值Activation of sequence information gating network at time t Is the previous activation/> and candidate activation/> linear interpolation between

其中更新门决定单元更新其激活或内容的程度。更新门由下式计算：Which update gate Determines the extent to which a unit updates its activation or content. The update gate is calculated by:

这个过程是把现有的状态和新计算的状态之间取线性总和，但没有采取任何机制去控制其状态暴露程度，而每次将所有状态都暴露。候选激活的计算方法：This process is to take a linear sum between the existing state and the newly calculated state, but does not adopt any mechanism to control the degree of state exposure, and exposes all states every time. How candidate activations are calculated:

其中r_t是一组重置门，并且⊙是逐元素乘法。当关闭(接近0)时，重置门有效地使单元充当读取输入序列的第一个符号，使其能够忘记先前计算的状态。重置门/>的计算类似于更新门：where_rt is a set of reset gates and ⊙ is element-wise multiplication. when closed ( Near 0), the reset gate effectively causes the unit to act as reading the first symbol of the input sequence, enabling it to forget the previously calculated state. Reset door/> The calculation of is similar to the update gate:

在后端，对序列信息门控网络在时间维度上的输出进行平均，并将结果发送到最终的全连接层进行预测。交叉熵损失用于优化。In the backend, the output of the sequence information gating network is averaged in the time dimension and the results are sent to the final fully connected layer for prediction. Cross-entropy loss is used for optimization.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified. Modifications or equivalent substitutions without departing from the purpose and scope of the technical solution shall be included in the scope of the claims of the present invention.