Movatterモバイル変換


[0]ホーム

URL:


CN114299418B - A Cantonese lip reading recognition method, device and storage medium - Google Patents

A Cantonese lip reading recognition method, device and storage medium
Download PDF

Info

Publication number
CN114299418B
CN114299418BCN202111507949.7ACN202111507949ACN114299418BCN 114299418 BCN114299418 BCN 114299418BCN 202111507949 ACN202111507949 ACN 202111507949ACN 114299418 BCN114299418 BCN 114299418B
Authority
CN
China
Prior art keywords
cantonese
network
lip reading
sequence
reading recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111507949.7A
Other languages
Chinese (zh)
Other versions
CN114299418A (en
Inventor
肖业伟
滕连伟
朱澳苏
刘烜铭
田丕承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangtan University
Original Assignee
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangtan UniversityfiledCriticalXiangtan University
Priority to CN202111507949.7ApriorityCriticalpatent/CN114299418B/en
Publication of CN114299418ApublicationCriticalpatent/CN114299418A/en
Application grantedgrantedCritical
Publication of CN114299418BpublicationCriticalpatent/CN114299418B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

Translated fromChinese

本发明公开了一种粤语唇读识别方法、设备以及存储介质,方法包括获取第一粤语视频片段;裁剪第一粤语视频片段中的无用片段,得到第二粤语视频片段;划分第二粤语视频片段中的视频序列和音频序列,对音频序列进行分词并生成分词时间戳,根据分词和分词时间戳生成标签;提取视频序列中的人脸图像,并过滤不完整的人脸图像,根据过滤后的人脸图像和标签生成样本图像;根据样本图像训练预设的粤语唇读识别模型,得到训练完成的粤语唇读识别模型;根据训练完成的粤语唇读识别模型识别目标视频序列,得到识别结果。本方法能够采集粤语单词级的唇读样本图像数据集,由于剔除了视频序列中的无用序列,能够提升训练后的模型的识别精度。

The present invention discloses a Cantonese lip reading recognition method, device and storage medium. The method comprises obtaining a first Cantonese video clip; cutting out useless clips in the first Cantonese video clip to obtain a second Cantonese video clip; dividing the video sequence and audio sequence in the second Cantonese video clip, segmenting the audio sequence and generating a segmentation timestamp, and generating a label according to the segmentation and the segmentation timestamp; extracting a face image in the video sequence, filtering incomplete face images, and generating a sample image according to the filtered face image and the label; training a preset Cantonese lip reading recognition model according to the sample image to obtain a trained Cantonese lip reading recognition model; and recognizing a target video sequence according to the trained Cantonese lip reading recognition model to obtain a recognition result. The method can collect a Cantonese word-level lip reading sample image data set, and can improve the recognition accuracy of the trained model because useless sequences in the video sequence are eliminated.

Description

Method, equipment and storage medium for identifying lip reading of Guangdong language
Technical Field
The invention relates to the technical field of lip reading identification, in particular to a method and equipment for identifying lip reading of cantonese and a storage medium.
Background
Lip recognition is the recognition of corresponding text information by observing the sequence of speaker's mouth movements, known as "lip reading". In recent years, with the development of deep learning, the lip recognition technology also makes a great breakthrough, and word-level and sentence-level English lip recognition achieves quite high recognition accuracy.
The Guangdong language is taken as the only Chinese dialect which is publicly researched at home and abroad, has a complete word system, can completely use Chinese character expression, and is widely used in Guangdong, guangxi, hong Kong, australian and China communities around the world. Guangdong has approximately 1.2 hundred million using population worldwide, which makes the research prospect of lip reading task of Guangdong huge. Unlike the english lip-reading task, chinese is a syllable-based language that distinguishes words or sentences through tone and pinyin, which undoubtedly presents a significant challenge to the overall chinese lip-reading task. In addition, the manner of pronunciation, pitch, and length of pronunciation of guangdong are different when the content of speaking and the content of mandarin chinese are identical. Therefore, the Chinese Mandarin lip reading model cannot be directly migrated to the Guangdong lip reading task.
In the related scheme of related Guangdong language lip reading identification, the model is often trained by extracting features from a video sequence, but a small sequence is often a useless sequence in the video sequence, and if the model is trained by using the sequence containing the useless sequence, the model identification precision is reduced because of no mechanism or large-scale Guangdong language lip reading data set published by individuals in combination with the research progress of the domestic and foreign lip reading fields.
Disclosure of Invention
The present invention aims to at least solve the technical problems existing in the prior art. Therefore, the invention provides a cantonese lip reading identification method, equipment and a storage medium, which can improve the identification precision of a trained model.
The first aspect of the invention provides a cantonese lip reading identification method, which comprises the following steps:
acquiring a first Guangdong video clip;
Cutting useless segments in the first Guangdong video segments to obtain second Guangdong video segments, wherein the useless segments comprise segments with unmanned human voice and/or segments with unmatched human voice and human images;
dividing a video sequence and an audio sequence in the second Guangdong video segment, performing word segmentation on the audio sequence, generating a word segmentation time stamp, and generating a label according to the word segmentation and the word segmentation time stamp;
Extracting face images in the video sequence, filtering incomplete face images, and generating a sample sequence according to the filtered face images and the labels;
training a preset cantonese lip reading identification model according to the sample sequence to obtain the trained cantonese lip reading identification model;
And identifying the target video sequence according to the training-completed cantonese lip reading identification model to obtain an identification result.
According to the embodiment of the invention, at least the following technical effects are achieved:
after a first Guangdong video segment is obtained, eliminating segments of human voice unmanned figures and/or segments of human voice unmanned figures in the first Guangdong video segment to obtain a second Guangdong video segment, and generating a sample sequence data set containing labels by using the second Guangdong video segment. The method can collect the lip reading sample sequence data set of the Guangdong word level, makes up for the blank that no large-scale lip reading sample sequence data set exists at present, and trains the model through the sample sequence due to the elimination of useless sequences in the video sequence, so that the recognition precision of the trained model can be improved.
According to some embodiments of the invention, before training the preset cantonese lip reading recognition model according to the sample sequence, the method further comprises:
and adding boundary information to the sample sequence, and encoding the sample sequence added with the boundary information according to Libjpeg.
According to some embodiments of the invention, the cantonese lip reading recognition model includes a feature extraction network, an LSTM network, a three-layer BiGRU network, and a mutual information maximization network, and the training process of the cantonese lip reading recognition model includes:
extracting features in the sample sequence according to the feature extraction network, and setting mutual information constraint between the features and the labels;
Generating corresponding weights based on different frames of the tags according to the LSTM network;
classifying the characteristics according to the three-layer BiGRU network to obtain an output result;
Generating global average characteristics according to the output result and the weight;
and maximizing mutual information between the global average feature and the tag according to the mutual information maximizing network.
According to some embodiments of the present invention, the feature extraction network includes a 3D CNN network, a spatial maximization pooling layer, a ResNet network, and a global averaging pooling layer connected in sequence, and the extracting the features in the sample sequence according to the feature extraction network and setting mutual information constraints between the features and tags includes:
extracting initial features in the sample sequence according to the 3D CNN network;
compressing the initial feature according to the spatial maximization pooling layer;
dividing the initial features into a plurality of parts, respectively extracting the features of each part according to the ResNet network, and adding mutual information constraint between the features and the labels;
And carrying out average pooling on the characteristics added with mutual information constraint according to the global average pooling layer.
According to some embodiments of the invention, the ResNet network is a ResNet-34 network.
According to some embodiments of the present invention, the LSTM network includes an LSTM layer and a linear layer connected in sequence, and the calculation formula for generating the corresponding weight based on different frames of the tag according to the LSTM network includes:
at=Relu(wlinear×LSTM(G)t+blinear)
Wherein, G represents an output result of the spatial maximum pooling layer, wlinear and blinear represent parameters of the linear layer, LSTM (G)t represents a hidden state of the LSTM layer at time step t, relu () represents a Relu function, and at represents a weight of a frame sequence when the time step is t.
According to some embodiments of the invention, the optimization function of the mutual information maximization network comprises:
LossMI=Ep(F,L)[(log(MI(F,L))]+Ep(F)p(L)[(log(1-MI(F,L))]
Wherein the LossMI represents an optimization function of the global mutual information maximization network, the p (F, L) represents a joint distribution of the pair of samples (F, L), the p (F) p (L) represents an edge distribution of the pair of samples (F, L), the MI(F,L) represents mutual information between F and L, E is used to find mathematical expectations, F represents a global average feature, and L represents a label.
According to some embodiments of the invention, the loss function of the cantonese lip reading recognition model comprises:
Wherein the saidRepresents a cross entropy loss function, said c represents the total number of classes of vocabulary in the sample sequence,A score representing label Li output by the three-tier BiGRU network.
In a second aspect of the invention, an electronic device is provided, comprising at least one control processor and a memory communicatively coupled to the at least one control processor, the memory storing instructions executable by the at least one control processor to enable the at least one control processor to perform the cantonese lip reading identification method described above.
In a third aspect of the present invention, there is provided a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the above-described cantonese lip reading recognition method.
It should be noted that the advantages of the second and third aspects of the present invention with the prior art are the same as those of the cantonese lip reading identification method described above with the prior art, and will not be described in detail herein.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic flow chart of a method for identifying reading of cantonese lips according to one embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method for identifying reading of cantonese lips according to another embodiment of the present invention;
FIG. 3 is a schematic flow chart of a method for identifying reading of cantonese lips according to another embodiment of the present invention;
FIG. 4 is a schematic flow chart of a method for identifying reading of cantonese lips according to another embodiment of the present invention;
FIG. 5 is a block flow diagram of constructing a sample dataset provided by one embodiment of the present invention;
FIG. 6 is a block flow diagram of a Yue language lip reading recognition model provided by an embodiment of the invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;
Fig. 8 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
Referring to fig. 1 to 3, in one embodiment of the present invention, there is provided a cantonese lip reading recognition method including the steps of:
Step S110, a first Guangdong language video clip is obtained.
For example, the first video clip of Guangdong is obtained by crawling video programs of Guangdong such as news lineups of Guangdong, variety of Guangdong, interviews of Guangdong characters, talk shows, etc. from the Internet using a you-get tool (the you-get tool is a video, picture and music download tool based on Python 3).
And step S120, cutting useless segments in the first Guangdong language video segments to obtain second Guangdong language video segments, wherein the useless segments comprise segments with unmanned human voice images and/or segments with unmatched human voice images. Here, useless segments in the first guang video segment may be manually cut.
And step S130, dividing the video sequence and the audio sequence in the second Guangdong video segment, segmenting the audio sequence, generating a segmentation time stamp, and generating a label according to the segmentation and the segmentation time stamp.
For example, a speech transcription tool of the science fiction is used for word segmentation and word segmentation time stamp generation of the audio sequence, the video sequence and the audio sequence are named according to the same naming mode, and the video sequence and the corresponding audio sequence are named with the same name, so that the later pairing is facilitated. Expanding the left and right of all word segmentation time stamps for 0.02s, and generating labels according to the video sequence names, the word segmentation time stamps, the word segmentation pinyin and the word segmentation generation sequence. The audio sequence of this step is to mark the sample data set obtained later, and the corresponding text information is generated by using the speech transcription tool of the science fiction to process the audio, and the text information is the text content of the corresponding video sequence, namely, is equivalent to a label.
Step S140, extracting face images in the video sequence, filtering incomplete face images, and generating a sample sequence according to the filtered face images and the labels.
For example, a MEDIAPIPE tool is used to extract faces in the video sequence, a filter is trained to filter out all images of abnormally identified faces (incomplete face clipping), and finally a sample sequence data set consisting of a plurality of face images is obtained.
And step S150, training a preset cantonese lip reading identification model according to the sample sequence to obtain the trained cantonese lip reading identification model.
And step 160, identifying the target video sequence according to the training-completed cantonese lip reading identification model to obtain an identification result.
After a first Guangdong video segment is obtained, eliminating segments of human voice unmanned figures and/or segments of human voice unmanned figures in the first Guangdong video segment to obtain a second Guangdong video segment, and generating a sample image data set containing labels by using the second Guangdong video segment. The method can collect the lip reading sample data set of the Guangdong word level, makes up for the blank that no large-scale lip reading sample data set exists at present, and trains the model through the sample sequence because useless sequences in the video sequence are removed, so that the recognition precision of the trained model can be improved.
In the related scheme, the phenomenon that boundary information is ambiguous and a key frame and an useless frame are not distinguished well exists in the training process of the model, wherein when the boundary information is ambiguous, the useless frame in the boundary information is difficult to reject, when the key frame and the useless frame are not distinguished well, the key frame cannot be screened well when the model performs feature extraction on a video sequence, the lip reading speed is slow, and the recognition accuracy of the model is also low due to the influence of the non-key frame (the useless frame).
Therefore, based on the above embodiment, the method further includes, before step S150, the steps of:
Step S1401, adding boundary information to the sample sequence, and encoding the sample sequence to which the boundary information is added according to Libjpeg. Step S1401 can solve the problem that the boundary useless frame cannot be removed by adding the boundary information. The training data is encoded using Libjpeg for the purpose of compressing the data, thereby speeding up the subsequent training process.
In the existing scheme, an identification model consisting of ResNet-18 and a backbone network of GRU is generally used, and unlike the existing scheme, the preset cantonese lip reading identification model comprises a feature extraction network, an LSTM network, a three-layer BiGRU network and a mutual information maximization network, and the training process of the cantonese lip reading identification model comprises the following steps:
Step S1501, extracting features in the sample sequence according to the feature extraction network, and setting mutual information constraint between the features and the labels.
In one embodiment, the feature extraction network includes ResNet-34 networks and a global averaging pooling layer, which is able to extract deeper features using ResNet-34 networks that are deeper than ResNet-18 networks than the generic ResNet-18 networks, with the purpose of structurally regularizing the entire network to prevent overfitting using the global averaging pooling layer instead of the fully connected layer.
In another embodiment, the feature extraction network comprises a 3D CNN network, a spatial maximization pooling layer, a ResNet-34 network, and a global averaging pooling layer, connected in sequence. Step S1501 specifically includes the steps of:
Step S15011, extracting initial features in the sample sequence according to the 3D CNN network.
Step S15012, the initial features are laminated according to the spatial maximization pooling.
Step S15013, the initial features are divided into a plurality of parts uniformly, the features of each part are extracted according to ResNet-34 networks, and mutual information constraint between the features and the labels is added.
Step S15014, the features added with mutual information constraint are averaged and pooled according to the global averaging pooling layer.
In steps S15011 and S15012, a 3D CNN network and a spatial maximum pooling layer are provided at the front end of ResNet-34 networks, the 3D CNN network performs feature extraction on the initial frame first to achieve the purpose of preliminary time alignment, and then uses the spatial maximum pooling to compress the features of the spatial domain, so that a better recognition effect can be achieved by the processing. In steps S15013 to S15014, the sequence features are equally divided into T parts according to the number of frames (e.g., T frames) of the input sequence, and features of each part are extracted by ResNet-34 networks, respectively. In order to improve the capability of capturing the fine-grained actions of the lips and the corresponding labels, and further improve the model recognition precision, mutual information constraint is applied between the outputs of ResNet-34 and the labels, and note that when the mutual information is mostly used for feature extraction, the degree of association between feature items and categories is better corresponding to the features and the corresponding labels one by one when the mutual information is maximized. And then sending the obtained characteristics into a global average pooling layer, and structurally regularizing the whole network to prevent overfitting.
Step S1502, corresponding weights are generated based on different frames of the tag according to the LSTM network.
At the output of ResNet-34, an LSTM network is added, which includes an LSTM layer and a linear layer, and in this embodiment, the purpose of the added LSTM network is to filter out key frames, that is, to assign different weights to different frames according to labels to distinguish key frames from non-key frames, in this embodiment, the weights may be any positive number, but for weights of useless frames, that should be as close to 0 as possible. A Relu function may then be introduced to obtain the weights.
And step S1503, classifying the characteristics according to the three-layer BiGRU network to obtain an output result.
And the output result of the characteristic extraction network is taken as input and is input into a three-layer BiGRU network, and the three-layer BiGRU network is used for carrying out characteristic classification to obtain the output result of the three-layer BiGRU network.
Step S1504, generating global average characteristics according to the output result and the weight.
The global average feature is obtained by weighting the outputs and weights of the three layers BiGRU.
Step S1505, maximizing the mutual information between the global average feature and the tag according to the mutual information maximizing network.
And splicing the global average feature and the one-hot vector of the label, and taking the splicing result as the input of the global mutual information maximization network. In this embodiment, the mutual information maximization network is composed of two linear layers and one sigmoid activation layer, the mutual information between the global average feature and a given tag is maximized by the mutual information maximization network, if the global average feature and the tag are from the same sample, the output of the global mutual information maximization network is close to 1 (positive sample), if the global average feature and the tag are not paired samples, the output of the global mutual information maximization network is close to 0 (negative sample).
The method embodiment has the following beneficial effects:
(1) And after the first Guangdong video fragment is acquired in the sample data set for training the model, removing the fragment of the unmanned human voice image and/or the fragment of the unmatched human voice image in the first Guangdong video fragment to obtain a second Guangdong video fragment, and generating a sample sequence containing a label by utilizing the second Guangdong video fragment. Useless sequences in the video sequence are eliminated, and the recognition accuracy of the trained model can be improved.
(2) The traditional lip reading identification model adopts ResNet-18+GRU structure, and the method adopts ResNet-34 network which is deeper than ResNet-18 network, so that deeper features can be extracted.
(3) The boundary information is ambiguous and the key frames and the useless frames are not well distinguished in the lip reading task, and before the samples in the sample data set are input into the model, the boundary information is added to the sample sequence for processing, so that the problem that the boundary useless frames cannot be removed can be solved. An LSTM network consisting of an LSTM and a linear layer is added at the rear end of the feature extraction network, and key frames are screened by utilizing the LSTM network in a weight generation mode, so that the key frames and non-key frames can be correctly distinguished, and the lip language recognition precision of the model can be effectively improved.
(4) In order to improve the capability of capturing fine-grained actions of lips and corresponding labels and further improve the accuracy of model identification, the method also designs a mutual information maximization network consisting of two linear layers and a sigmoid activation layer, the global average characteristic is obtained by weighting the characteristics output by the weight and three layers BiGRU of networks, the global average characteristic and the labels are spliced to serve as the input of the global mutual information maximization network, the mutual information maximization network is used for carrying out mutual information maximization processing on the global average characteristic and the labels, if the mutual information is a paired sample, the global mutual information maximization network output is close to 1, otherwise, the global mutual information is close to 0, and the lip language identification accuracy of the model can be effectively improved.
Referring to fig. 4 to 6, in order to facilitate understanding of those skilled in the art, according to one embodiment of the present invention, there is provided a cantonese lip reading recognition method, comprising the steps of:
and S210, constructing a sample data set of the Yue language lip reading.
Step S2101, crawling the Yue-language television programs from the Internet through a you-get tool to obtain video segments. The term "television" as used herein includes, but is not limited to, news lineups in cantonese, shows in cantonese, interviews with cantonese characters, and talk shows.
Step S2102, reject useless segments in the video segment. The useless segments herein refer to segments having a human voice but no human figure, or segments where the human voice does not match.
Step S2103, separate the audio sequence and the video sequence of the video segment, use a speech processing tool (e.g. transcription function of the science fiction flying speech) to segment the audio sequence and generate a timestamp, and name the video sequence and the audio sequence according to the same naming mode, and name the video sequence and the corresponding audio sequence with the same name, so that the later pairing is convenient.
And step S2104, expanding all word segmentation time stamps for 0.02S, and generating labels according to the video sequence names, the word segmentation time stamps, the word segmentation pinyin and the word segmentation sequence.
Step S2105, using MEDIAPIPE tool to extract face in video sequence, and training a filter to filter out incomplete picture of all identified face.
The steps S2101 to S2105 can effectively remove useless fragments, and can also ensure that the acquired data has environmental diversity and is closer to life scenes.
And S220, constructing a Guangdong language lip reading identification model based on global mutual information maximization.
The cantonese lip reading recognition model mainly comprises a backbone network combined by ResNet-34 networks and three layers of BiGRU networks and a global mutual information maximization network.
First, a 3D CNN layer and a spatial maximum pooling layer are added before ResNet-34 networks. The 3D CNN layer is used for extracting the features of the initial frames of the input video sequence so as to achieve the purpose of preliminary time alignment, and then the features of the compressed space domain are pooled by using the space maximization, so that the training time can be reduced under the condition of not affecting the recognition effect. The 3D CNN layer and the space maximum pooling layer are added before ResNet-34 networks, so that the cantonese lip reading network model can achieve a better recognition effect.
Assuming that the number of frames of an input video sequence is T, dividing the sequence characteristics into T parts according to the number of frames T of the input sequence, and extracting the characteristics of each part by using ResNet-34 networks. To improve the ability to capture fine-grained actions of the lips with the corresponding label L, a mutual information constraint is imposed between the output of the ResNet-34 network and the label L. After the mutual information constraint is applied, the acquired features are input to a global averaging pooling layer, where global averaging pooling is used instead of a fully connected layer with the purpose of structurally regularizing the entire network to prevent overfitting.
At the output of ResNet-34 networks, an LSTM and a linear layer are added, and an LSTM and a linear layer are formed into an LSTM network, and the added LSTM network assigns different weights a according to different frames of the label L. For filtering key frames, the weight a may be any positive number, but for the weight a of useless frames, that should be as close to 0 as possible. Then Relu is introduced to obtain the weight a:
at=Relu(wlinear×LSTM(G)t+blinear)
In the above equation, G represents the output of the global average pooling layer, wlinear and blinear represent parameters of the linear layer, LSTM (G)t represents the hidden state of the LSTM layer at time step t, and at represents the weight of the frame sequence at time step t.
The final global average feature F consists of the output P and weights a of the three-layer BiGRU network.
Where T represents the length of the entire video sequence, i.e. the total number of frames.
The one-hot vectors of the global average feature F and the label L are connected together as inputs to the global mutual information maximization network.
The global mutual information maximization network consists of two linear layers and a sigmoid activation layer, and mutual information between the global average feature and a given tag can be maximized through the global mutual information maximization network. If the global average feature F and the label L are from the same sample, the output of the global mutual information maximization network should be as close to 1 (positive sample) as possible, and if the global average feature F and the label L are not paired samples, the output of the global mutual information maximization network should be as close to 0 (negative sample) as possible, so the global mutual information maximization network can be expressed as:
LossMI=Ep(F,L)[(log(MI(F,L))]+Ep(F)p(L)[(log(1-MI(F,L))]
Where LossMI represents an optimization function for global mutual information maximization, p (F, L) represents a joint distribution of the pair of samples (F, L), p (F) p (L) represents an edge distribution of the pair of samples (F, L), MI(F,L) represents mutual information between the two variables F and L, and E is used to find mathematical expectations.
After the output Pt of the three-layer BiGRU network in the backbone network passes through the linear layer, the dimension becomes C-dimension, and the final classification score is as follows:
where c represents the total number of classes of vocabulary in the dataset.A score representing the label Li.
The loss function of the entire model is as follows:
Wherein Li represents the ith tag, losstotal represents the total Loss, the first term of the Loss functionIs a cross entropy Loss function and the second term LossMI is an optimization function of the global mutual information maximization network.
And step S230, adding boundary information to the data in the data set, and training a cantonese lip reading recognition model.
Step S2301, processing of adding boundary information is performed. The video sequence is cut to 88 x 88 size and then a boundary information is added to the video sequence according to the time stamp information. The boundary information is added here in such a manner that a rounding of OP (start time stamp) ×25 (video frame rate) is selected as a start frame, 40 frames after the selection are taken as an input video sequence, and data of less than 40 frames in length is screened out. By adding the boundary information, the problem that the useless frames of the boundary cannot be removed can be solved.
And step S2302, coding the video sequence through Libjpeg, setting the training parameters such as batch-size and epoch, inputting the coded data into the cantonese lip reading identification model for training to obtain training weights, and obtaining the cantonese lip reading identification model after training. The purpose of the encoding is to compress the data after the boundary information processing is added, thereby increasing the later training speed.
And step 240, identifying the target video sequence through the training-completed cantonese lip reading identification model.
In step S2401, a pyside tool kit is used for designing a Ui interface of the system, and the Ui interface is divided into three parts, namely a prompter area, a face display area and a result display area.
Step S2402, loading training weight, loading a cantonese lip reading identification model, and matching the functions in the buttons and codes of the Ui interface.
Step S2403, starting to identify, collecting the target video sequence through the face display area, and extracting the face in the target video sequence by using MEDIAPIPE.
And step S2404, loading a network, and processing the target video sequence by using the Yue language lip reading identification model.
Step S2405, displaying the lip reading recognition result obtained by the cantonese lip reading recognition model in the display area.
The research of lip language identification is significant for the vast hearing impaired, and Guangdong language is widely used in China's Guangdong and Guangxi Provinces area, hong Kong and Australian communities and people communities worldwide, so the research of Guangdong language lip reading is indistinct. At present, no company or large-scale research institution pushes out a large-scale Guangdong lip-reading data set, and the application can obtain word-level sample data by downloading network video resources, manually cutting and screening useless fragments. Because the phenomena of undefined boundary information and indistinguishable key frames and useless frames occur in a lip reading task, the method is used for preprocessing data in a sample data set firstly and can solve the problem that the boundary useless frames cannot be removed by adding boundary information. The method is improved on the basis of the existing model, the ResNet-18 network is replaced by the ResNet-34 network to extract deeper features, the LSTM network (consisting of an LSTM and a linear layer) is added at the output end of the ResNet-34 network, key frames are screened by the LSTM network in a mode of generating weight a, and a is generated by Relu, so that the weight of useless frames can be changed to 0. And then weighting the characteristics output by the weight a and the three-layer BiGRU network to obtain a global average characteristic F. In order to improve the capability of capturing fine-grained actions of lips and corresponding labels, the accuracy of model identification is further improved, the global average feature F and the labels L are spliced to be used as the input of a global mutual information maximization network, then mutual information maximization processing is carried out on the global average feature F and the labels L, if the mutual information is a paired sample, the global mutual information maximization network output is close to 1, and otherwise, the mutual information is close to 0.
The method makes up the blank that the large-scale lip reading sample data set does not exist in the field of the Guangdong lip reading, and in the used Guangdong lip reading identification model, the accuracy of lip recognition can be effectively improved by providing a method for adding boundary information and global mutual information to be maximized.
Referring to fig. 7, the present application further provides a computer device 301, including a memory 310, a processor 320, and a computer program 311 stored in the memory 310 and capable of running on the processor, where the processor 320 implements the cantonese lip reading recognition method as described above when executing the computer program 311.
The processor 320 and the memory 310 may be connected by a bus or other means.
Memory 310 acts as a non-transitory computer readable storage medium that may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, memory 310 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, memory 310 may optionally include memory located remotely from the processor to which the remote memory may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software program and instructions required to implement the cantonese lip reading identification method of the above-described embodiments are stored in the memory, and when executed by the processor, the cantonese lip reading identification method of the above-described embodiments is performed, for example, the method steps S110 to S160 in fig. 1 or the method steps S210 to S240 in fig. 4 described above are performed.
Referring to fig. 8, the present application also provides a computer-readable storage medium 401 storing computer-executable instructions 410, the computer-executable instructions 410 being for performing the cantonese lip reading identification method as described above.
The computer-readable storage medium 401 stores computer-executable instructions 410, where the computer-executable instructions 410 are executed by a processor or controller, for example, by a processor in the above-described electronic device embodiment, and may cause the processor to perform the cantonese lip reading identification method in the above-described embodiment, for example, performing the method steps S110 to S160 in fig. 1 or the method steps S210 to S240 in fig. 4 described above.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of data such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired data and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any data delivery media.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims (10)

Translated fromChinese
1.一种粤语唇读识别方法,其特征在于,包括如下步骤:1. A Cantonese lip reading recognition method, characterized in that it comprises the following steps:获取第一粤语视频片段;Get the first Cantonese video clip;裁剪所述第一粤语视频片段中的无用片段,得到第二粤语视频片段,所述无用片段包括有人声无人像的片段和/或人声人像不匹配的片段;Cutting out useless segments from the first Cantonese video segment to obtain a second Cantonese video segment, wherein the useless segments include segments with human voice but no image and/or segments where the human voice and the image do not match;划分所述第二粤语视频片段中的视频序列和音频序列,对所述音频序列进行分词并生成分词时间戳,根据所述分词和所述分词时间戳生成标签;Dividing the video sequence and the audio sequence in the second Cantonese video clip, segmenting the audio sequence and generating a segmentation timestamp, and generating a label according to the segmentation and the segmentation timestamp;提取所述视频序列中的人脸图像,并过滤不完整的人脸图像,根据过滤后的人脸图像和所述标签生成样本序列;Extracting face images from the video sequence, filtering incomplete face images, and generating a sample sequence according to the filtered face images and the labels;根据所述样本序列训练预设的粤语唇读识别模型,得到训练完成的所述粤语唇读识别模型;Training a preset Cantonese lip reading recognition model according to the sample sequence to obtain the trained Cantonese lip reading recognition model;根据训练完成的所述粤语唇读识别模型识别目标视频序列,得到识别结果。The target video sequence is recognized according to the trained Cantonese lip reading recognition model to obtain a recognition result.2.根据权利要求1所述的粤语唇读识别方法,其特征在于,在所述根据所述样本序列训练预设的粤语唇读识别模型之前,还包括:2. The Cantonese lip reading recognition method according to claim 1, characterized in that before training a preset Cantonese lip reading recognition model according to the sample sequence, it also includes:将所述样本序列添加边界信息,并根据Libjpeg对添加边界信息的所述样本序列进行编码。Boundary information is added to the sample sequence, and the sample sequence with the added boundary information is encoded according to Libjpeg.3.根据权利要求2所述的粤语唇读识别方法,其特征在于,所述粤语唇读识别模型包括特征提取网络、LSTM网络、三层BiGRU网络以及互信息最大化网络,所述粤语唇读识别模型的训练过程,包括:3. The Cantonese lip reading recognition method according to claim 2, characterized in that the Cantonese lip reading recognition model includes a feature extraction network, an LSTM network, a three-layer BiGRU network and a mutual information maximization network, and the training process of the Cantonese lip reading recognition model includes:根据所述特征提取网络提取所述样本序列中的特征,并设置所述特征与标签之间的互信息约束;Extracting features from the sample sequence according to the feature extraction network, and setting mutual information constraints between the features and labels;根据所述LSTM网络基于标签的不同帧生成相应的权值;Generate corresponding weights based on different frames of the label according to the LSTM network;根据所述三层BiGRU网络对所述特征进行分类,得到输出结果;Classify the features according to the three-layer BiGRU network to obtain an output result;根据所述输出结果和所述权值生成全局平均特征;Generate a global average feature according to the output result and the weight;根据所述互信息最大化网络最大化所述全局平均特征和所述标签之间的互信息。The mutual information between the global average feature and the label is maximized according to the mutual information maximization network.4.根据权利要求3所述的粤语唇读识别方法,其特征在于,所述特征提取网络包括依次连接的3D CNN网络、空间最大池化层、ResNet网络和全局平均池化层;所述根据所述特征提取网络提取所述样本序列中的特征,并设置所述特征与标签之间的互信息约束,包括:4. The Cantonese lip reading recognition method according to claim 3, characterized in that the feature extraction network comprises a 3D CNN network, a spatial maximum pooling layer, a ResNet network and a global average pooling layer connected in sequence; extracting features from the sample sequence according to the feature extraction network and setting mutual information constraints between the features and labels comprises:根据所述3D CNN网络提取所述样本序列中的初始特征;Extracting initial features from the sample sequence according to the 3D CNN network;根据所述空间最大池化层压缩所述初始特征;compressing the initial features according to the spatial maximum pooling layer;将所述初始特征均分多个部分,根据所述ResNet网络分别提取每个部分的特征,并且添加所述特征与标签之间的互信息约束;Divide the initial features into multiple parts, extract features of each part according to the ResNet network, and add mutual information constraints between the features and labels;根据所述全局平均池化层将添加互信息约束的所述特征平均池化。The features with added mutual information constraints are average pooled according to the global average pooling layer.5.根据权利要求4所述的粤语唇读识别方法,其特征在于,所述ResNet网络为ResNet-34网络。5. The Cantonese lip reading recognition method according to claim 4, characterized in that the ResNet network is a ResNet-34 network.6.根据权利要求4所述的粤语唇读识别方法,其特征在于,所述LSTM网络包括依次连接的LSTM层和线性层;所述根据所述LSTM网络基于标签的不同帧生成相应的权值的计算公式包括:6. The Cantonese lip reading recognition method according to claim 4 is characterized in that the LSTM network comprises an LSTM layer and a linear layer connected in sequence; the calculation formula for generating corresponding weights based on different frames of the label according to the LSTM network comprises:at=Relu(wlinear×LSTM(G)t+blinear)at =Relu(wlinear ×LSTM(G)t +blinear )其中,所述G表示所述空间最大池化层的输出结果,所述wlinear和所述blinear表示所述线性层的参数,所述LSTM(G)t表示所述LSTM层在时间步为t的隐藏状态,Relu()表示Relu函数,所述at表示时间步为t时,帧序列的权值。Among them, the G represents the output result of the spatial maximum pooling layer, the wlinear and the blinear represent the parameters of the linear layer, the LSTM(G)t represents the hidden state of the LSTM layer at time step t, Relu() represents the Relu function, and the at represents the weight of the frame sequence when the time step is t.7.根据权利要求6所述的粤语唇读识别方法,其特征在于,所述互信息最大化网络的优化函数包括:7. The Cantonese lip reading recognition method according to claim 6, wherein the optimization function of the mutual information maximization network comprises:LossMI=Ep(F,L)[(log(MI(F,L))]+Ep(F)p(L)[(log(1-MI(F,L))]LossMI =Ep(F,L) [(log(MI(F,L) )]+Ep(F)p(L) [(log(1-MI(F,L) )]其中,所述LossMI表示所述全局互信息最大化网络的优化函数,所述p(F,L)表示样本对(F,L)的联合分布,所述p(F)p(L)表示样本对(F,L)的边缘分布,所述MI(F,L)表示F和L之间的互信息,E用于求取数学期望,F表示全局平均特征,L表示标签。Among them, the LossMI represents the optimization function of the global mutual information maximization network, the p(F, L) represents the joint distribution of the sample pair (F, L), the p(F)p(L) represents the marginal distribution of the sample pair (F, L), the MI(F, L) represents the mutual information between F and L, E is used to obtain the mathematical expectation, F represents the global average feature, and L represents the label.8.根据权利要求7所述的粤语唇读识别方法,其特征在于,所述粤语唇读识别模型的损失函数包括:8. The Cantonese lip reading recognition method according to claim 7, wherein the loss function of the Cantonese lip reading recognition model comprises:其中,所述表示交叉熵损失函数,所述c表示样本序列中的词汇的类的总数,表示所述三层BiGRU网络输出的标签Li的得分。Among them, the represents the cross entropy loss function, the c represents the total number of classes of the vocabulary in the sample sequence, Represents the score of the labelLi output by the three-layer BiGRU network.9.一种电子设备,其特征在于:包括至少一个控制处理器和用于与所述至少一个控制处理器通信连接的存储器;所述存储器存储有可被所述至少一个控制处理器执行的指令,所述指令被所述至少一个控制处理器执行,以使所述至少一个控制处理器能够执行权利要求1至8任一项所述的粤语唇读识别方法。9. An electronic device, characterized in that it comprises at least one control processor and a memory for communicating with the at least one control processor; the memory stores instructions executable by the at least one control processor, and the instructions are executed by the at least one control processor so that the at least one control processor can execute the Cantonese lip reading recognition method described in any one of claims 1 to 8.10.一种计算机可读存储介质,其特征在于:所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行权利要求1至8任一项所述的粤语唇读识别方法。10. A computer-readable storage medium, characterized in that: the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to enable a computer to execute the Cantonese lip reading recognition method according to any one of claims 1 to 8.
CN202111507949.7A2021-12-102021-12-10 A Cantonese lip reading recognition method, device and storage mediumActiveCN114299418B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111507949.7ACN114299418B (en)2021-12-102021-12-10 A Cantonese lip reading recognition method, device and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111507949.7ACN114299418B (en)2021-12-102021-12-10 A Cantonese lip reading recognition method, device and storage medium

Publications (2)

Publication NumberPublication Date
CN114299418A CN114299418A (en)2022-04-08
CN114299418Btrue CN114299418B (en)2025-01-03

Family

ID=80968485

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111507949.7AActiveCN114299418B (en)2021-12-102021-12-10 A Cantonese lip reading recognition method, device and storage medium

Country Status (1)

CountryLink
CN (1)CN114299418B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115019772A (en)*2022-06-072022-09-06湘潭大学Guangdong language voice recognition enhancing method based on visual information
CN116386142A (en)*2023-04-032023-07-04湘潭大学Conv former-based Guangdong sentence-level lip language identification method
CN116311537A (en)*2023-05-182023-06-23讯龙(广东)智能科技有限公司Training method, storage medium and system for video motion recognition algorithm model
WO2024261626A1 (en)*2023-06-222024-12-26Technology Innovation Institute – Sole Proprietorship LLCVisual speech recognition based communication training system

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109524006A (en)*2018-10-172019-03-26天津大学A kind of standard Chinese lip reading recognition methods based on deep learning
CN110276259A (en)*2019-05-212019-09-24平安科技(深圳)有限公司Lip reading recognition methods, device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR101092820B1 (en)*2009-09-222011-12-12현대자동차주식회사Lipreading and Voice recognition combination multimodal interface system
CN109409204B (en)*2018-09-072021-08-06北京市商汤科技开发有限公司Anti-counterfeiting detection method and device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109524006A (en)*2018-10-172019-03-26天津大学A kind of standard Chinese lip reading recognition methods based on deep learning
CN110276259A (en)*2019-05-212019-09-24平安科技(深圳)有限公司Lip reading recognition methods, device, computer equipment and storage medium

Also Published As

Publication numberPublication date
CN114299418A (en)2022-04-08

Similar Documents

PublicationPublication DateTitle
CN114299418B (en) A Cantonese lip reading recognition method, device and storage medium
CN109117777B (en)Method and device for generating information
CN112668559B (en)Multi-mode information fusion short video emotion judgment device and method
CN111738251B (en) Optical character recognition method, device and electronic device fused with language model
CN109065021B (en)End-to-end dialect identification method for generating countermeasure network based on conditional deep convolution
CN104735468B (en)A kind of method and system that image is synthesized to new video based on semantic analysis
KR102433393B1 (en)Apparatus and method for recognizing character in video contents
CN113766314B (en)Video segmentation method, device, equipment, system and storage medium
CN110188829B (en)Neural network training method, target recognition method and related products
CN112348111B (en)Multi-modal feature fusion method and device in video, electronic equipment and medium
CN107507620A (en)Voice broadcast sound setting method and device, mobile terminal and storage medium
CN114694070B (en) Automatic video editing method, system, terminal and storage medium
CN111143617A (en) A method and system for automatic generation of picture or video text description
CN114880496A (en)Multimedia information topic analysis method, device, equipment and storage medium
CN112861864A (en)Topic entry method, topic entry device, electronic device and computer-readable storage medium
CN115311595B (en)Video feature extraction method and device and electronic equipment
CN112052687A (en)Semantic feature processing method, device and medium based on deep separable convolution
CN112465596A (en)Image information processing cloud computing platform based on electronic commerce live broadcast
Cosovic et al.Classification methods in cultural heritage
CN114051154A (en)News video strip splitting method and system
CN114398952A (en) Training text generation method, device, electronic device and storage medium
EP4207771B1 (en)Video processing method and apparatus
CN115392254A (en)Interpretable cognitive prediction and discrimination method and system based on target task
CN119580738A (en) Video processing method, device, equipment and medium based on multimodal information fusion
CN119478763A (en) Video segmentation method, device, equipment and storage medium based on multimodal features

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp