CN114220177B

Movatterモバイル変換

Info

Publication number: CN114220177B
Application number: CN202111597760.1A
Authority: CN
Inventors: 刘代波; 陈格菊; 蒋洪波; 肖竹; 曾凡仔
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2024-06-25
Anticipated expiration: 2041-12-24
Also published as: CN114220177A

Abstract

The embodiment of the disclosure provides a method, a device, equipment and a medium for identifying lip syllables, which belong to the technical field of image processing and specifically comprise the following steps: obtaining coordinate information of each initial feature point; forming a distance time sequence; converting the target curve according to the distance time sequence; dividing the corresponding pronunciation sequence from the distance time sequence; obtaining an optimal characteristic sequence; calculating the similarity of the distance between the optimal characteristic sequence and the sample characteristic sequence; judging whether the similarity is smaller than a second threshold value; if yes, judging that syllables are the same; if not, judging that syllables are different, and calculating the similarity of the optimal feature sequence and other sample feature sequences until the similarity is smaller than a second threshold value. According to the scheme, fine granularity characteristics based on syllables are calculated, then a cross-validation recursive characteristic elimination model is used for training, and an optimal characteristic sequence is selected and compared with a sample characteristic sequence, so that corresponding syllables are identified, and syllable detection efficiency, accuracy and adaptability are improved.

Description

Lip syllable recognition method, device, equipment and medium

Technical Field

The embodiment of the disclosure relates to the technical field of image processing, in particular to a method, a device, equipment and a medium for identifying lip syllables.

Background

At present, with the rapid development of intelligent mobile devices, the communication modes of people are changing, and people begin to use the intelligent devices for voice communication at any time and any place. The voice signal-based mode promotes the communication between people and enriches the interaction modes of people and machines. On the one hand, speech is a more direct and efficient way of interaction, and on the other hand, speech-based quality of service is greatly improved, both of which have prompted the development of speech-based applications on smart devices.

With great convenience brought by voice interaction, application scenes such as noisy streets, crowded restaurants, quiet libraries, and the like are being popularized and expanded. However, existing voice interactions have difficulty clearly filtering out ambient noise; meanwhile, due to the omnidirectional propagation of sound, the sound is not only received by the target microphone but also exposed to surrounding people.

Therefore, the lip reading technology is utilized to make up for the defects existing in the prior voice technology. Lip reading is one way to study the change in mouth shape of a speaker during pronunciation. The person speaking has own characteristics, and the movement of the mouth during speaking also has own regularity. The solution of lip reading recognition generally needs to be subjected to the processes of face detection, lip detection, feature extraction, recognition, fusion and the like. The motion pattern of the lips of the speaker is analyzed by extracting information when the person is silent speaking, so as to obtain the content to be expressed.

In the past, the method for identifying the lip syllables is less in implementation, a common lip language identification algorithm is an algorithm based on deep learning, and the change characteristics of the lip syllables are not deeply researched on the basis of semantic learning; there are also many methods based on silent speech interfaces, which use ultrasonic, myoelectric, camera, microphone, etc. devices to achieve semantic recognition without pronunciation, but the limitation on vocabulary is relatively large, the recognition is limited to a limited vocabulary, and no syllable-level-based recognition study is available.

Therefore, a precise, efficient and highly adaptable lip-syllable recognition method is needed.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a method, apparatus, device, and medium for identifying a lip-syllable, which at least partially solve the problems of poor efficiency, accuracy, and adaptability of lip-syllable identification in the prior art.

In a first aspect, an embodiment of the present disclosure provides a method for identifying a lip syllable, including:

Acquiring a face dynamic video of a target person, identifying a plurality of initial feature points of each frame of face image in the face dynamic video by utilizing Dlib face recognition frames, and obtaining coordinate information of each initial feature point;

selecting a preset number of feature points related to the lip from all the initial feature points as target feature points, and obtaining a plurality of groups of distance change functions according to the coordinate information of the target feature points to form a distance time sequence;

Converting a target curve according to the distance time sequence;

calculating lip opening and closing degree according to the target curve, comparing the lip opening and closing degree with a first threshold value, judging the starting moment and the ending speaking moment of lip movement, and dividing the corresponding pronunciation sequence from the distance time sequence according to the lip opening and closing degree;

Inputting all the features of the pronunciation sequence subjected to dimension reduction into a cross-validation recursive feature elimination model, adding weight for each feature, and recursively eliminating the feature with the lowest importance each time until an optimal feature sequence is obtained;

calculating the similarity of the distance between the optimal feature sequence and the sample feature sequence, wherein the sample feature sequence is a lip feature sequence corresponding to a sample syllable;

judging whether the similarity is smaller than a second threshold value or not;

According to a specific implementation manner of the embodiment of the present disclosure, the coordinate information is a coordinate value of each initial feature point that varies in time sequence.

According to a specific implementation manner of the embodiment of the present disclosure, the step of obtaining a plurality of sets of distance change functions according to the coordinate information of the target feature point to form a distance time sequence includes:

Calculating coordinate change values of the target feature points in the face images of all frames according to the coordinate information;

Calculating Euclidean distance of each two target feature points according to the coordinate change values to obtain a plurality of groups of distance change functions to form the distance time sequence, wherein the Euclidean distance formula is as follows

X_i and y_i are two different target feature points, respectively.

According to a specific implementation manner of the embodiment of the present disclosure, the step of converting the target curve according to the distance time sequence includes:

Calculating a maximum value in the distance time series;

normalizing each value in the distance time sequence according to the maximum value to obtain an initial curve;

And carrying out smoothing treatment on the initial curve to obtain a target curve.

According to a specific implementation manner of the embodiment of the present disclosure, the step of inputting all the features of the pronunciation sequence after the dimension reduction into a cross-validation recursive feature elimination model, adding a weight to each feature, and recursively eliminating the feature with the lowest importance each time until an optimal feature sequence is obtained includes:

Dividing the pronunciation sequence into an initial training set and an initial verification set after dimension reduction;

Training the cross-validation recursive feature elimination model using all feature sequences of the initial training set, calculating the importance of each of the feature sequences and aligning them;

selecting the first N feature sequences according to the sorting result to form a new training set, wherein N is a positive integer greater than 1;

Evaluating the cross-validation recursive feature elimination model using the initial validation set and splitting a new training set into a sub-training set and a sub-validation set;

calculating the importance of each feature sequence in the sub-training set and arranging the importance and the side of each feature sequence;

And selecting the first N feature sequences according to the new sequencing result, forming a new training set and re-splitting until the preset number of feature sequences are obtained and the optimal feature sequences are selected.

According to a specific implementation manner of the embodiment of the present disclosure, the step of calculating the similarity of the distance between the optimal feature sequence and the sample feature sequence includes:

establishing a matrix grid according to the number of the feature points in the optimal feature sequence and the sample feature sequence;

For each of the feature points in the optimal feature sequence and the sample feature sequence, searching the shortest path in the matrix grid;

and traversing all the characteristic points in the optimal characteristic sequence and the sample characteristic sequence to obtain the similarity.

In a second aspect, an embodiment of the present disclosure provides a lip-syllable recognition apparatus, including:

The acquisition module is used for acquiring a face dynamic video of a target person, identifying a plurality of initial feature points of each frame of face image in the face dynamic video by utilizing a Dlib face identification frame, and obtaining coordinate information of each initial feature point;

The selecting module is used for selecting a preset number of feature points related to the lip from all the initial feature points to serve as target feature points, and obtaining a plurality of groups of distance change functions according to the coordinate information of the target feature points to form a distance time sequence;

the conversion module is used for converting the target curve according to the distance time sequence;

The segmentation module is used for calculating the lip opening and closing degree according to the target curve, comparing the lip opening and closing degree with a first threshold value, judging the starting moment and the ending speaking moment of lip movement, and segmenting the corresponding pronunciation sequence from the distance time sequence according to the lip opening and closing degree;

The recursion module is used for inputting all the characteristics of the pronunciation sequence subjected to dimension reduction into a cross-validation recursion characteristic elimination model, adding weight for each characteristic, and recursion eliminating the characteristics with the lowest importance each time until an optimal characteristic sequence is obtained;

The calculating module is used for calculating the similarity of the distance between the optimal feature sequence and the sample feature sequence, wherein the sample feature sequence is a lip feature sequence corresponding to a sample syllable;

the judging module is used for judging whether the similarity is smaller than a second threshold value or not;

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the lip-syllable recognition method of the first aspect or any implementation of the first aspect.

In a fourth aspect, embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the lip-syllable recognition method of the first aspect or any implementation of the first aspect.

In a fifth aspect, the presently disclosed embodiments also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the lip-syllable recognition method of the first aspect or any implementation of the first aspect.

The lip syllable recognition scheme in the embodiment of the disclosure comprises the following steps: acquiring a face dynamic video of a target person, identifying a plurality of initial feature points of each frame of face image in the face dynamic video by utilizing Dlib face recognition frames, and obtaining coordinate information of each initial feature point; selecting a preset number of feature points related to the lip from all the initial feature points as target feature points, and obtaining a plurality of groups of distance change functions according to the coordinate information of the target feature points to form a distance time sequence; converting a target curve according to the distance time sequence; calculating lip opening and closing degree according to the target curve, comparing the lip opening and closing degree with a first threshold value, judging the starting moment and the ending speaking moment of lip movement, and dividing the corresponding pronunciation sequence from the distance time sequence according to the lip opening and closing degree; inputting all the features of the pronunciation sequence subjected to dimension reduction into a cross-validation recursive feature elimination model, adding weight for each feature, and recursively eliminating the feature with the lowest importance each time until an optimal feature sequence is obtained; calculating the similarity of the distance between the optimal feature sequence and the sample feature sequence, wherein the sample feature sequence is a lip feature sequence corresponding to a sample syllable; judging whether the similarity is smaller than a second threshold value or not; if the similarity is smaller than the second threshold, judging that syllables corresponding to the optimal feature sequence and the sample feature sequence are identical; and if the similarity is greater than or equal to the second threshold, judging that syllables corresponding to the optimal feature sequence and the sample feature sequence are different, and calculating the similarity of the optimal feature sequence and other sample feature sequences until the similarity is smaller than the second threshold.

The beneficial effects of the embodiment of the disclosure are that: according to the scheme, fine granularity characteristics based on syllables are calculated, then a cross-validation recursive characteristic elimination model is used for training, and an optimal characteristic sequence is selected and compared with a sample characteristic sequence, so that corresponding syllables are identified, and syllable detection efficiency, accuracy and adaptability are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a flowchart illustrating a method for identifying a lip syllable according to an embodiment of the present disclosure;

Fig. 2 is a schematic diagram of false positive rate of recognition under different scenarios according to a lip syllable recognition method provided in an embodiment of the present disclosure;

Fig. 3 is a schematic diagram of false negative rate of recognition under different scenarios according to a lip syllable recognition method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a lip syllable recognition device according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

The embodiment of the disclosure provides a lip syllable recognition method, which can be applied to a syllable recognition process of a video recognition scene.

Referring to fig. 1, a flow chart of a method for identifying a lip syllable according to an embodiment of the present disclosure is shown.

As shown in fig. 1, the method mainly comprises the following steps:

s101, acquiring a face dynamic video of a target person, identifying a plurality of initial feature points of each frame of face image in the face dynamic video by utilizing Dlib face identification frames, and obtaining coordinate information of each initial feature point;

optionally, the coordinate information is a coordinate value of each initial feature point that varies in time series.

In specific implementation, the electronic device may be built with a video acquisition module or externally connected with a video acquisition device, for example, a video stream when the lips change can be acquired by using a mobile phone camera, 68 feature points on the face are identified frame by using a Dlib face recognition frame, and a coordinate value of each initial feature point of each frame of image, which changes in time sequence, is obtained.

S102, selecting a preset number of feature points related to lips from all the initial feature points as target feature points, and obtaining a plurality of groups of distance change functions according to the coordinate information of the target feature points to form a distance time sequence;

Further, in step S102, a plurality of sets of distance change functions are obtained according to the coordinate information of the target feature point, so as to form a distance time sequence, including:

X_i and y_i are two different target feature points, respectively.

For example, after obtaining 68 more initial feature points, selecting 20 feature points related to lip information as the target feature points, and then calculating the distance between coordinates of every two feature points according to the Euclidean distance formula according to the coordinate information of all the target feature points:

And calculating the distances between every two target characteristic points by 20 target characteristic points to obtain a plurality of groups of distance change functions, and obtaining 190 groups of distance time series SOD (P_i,P_j) which represents the time change sequence of the distances between the characteristic points P_i and the characteristic points P_j.

S103, converting a target curve according to the distance time sequence;

optionally, the converting the target curve according to the distance time sequence in step S103 includes:

Calculating a maximum value in the distance time series;

In particular, the initial curve of 190 groups of distance changes with time can be obtained by plotting the distance time sequence. Because different people's lip shapes are different, the depth of field of camera influences lip size difference, the range of lip motion is different, and these differences have the influence to follow-up calculation discernment, and the curve ordinate scope that obtains simultaneously is different, and the variation trend is showing the degree and is difficult to judge, so carry out normalization to the curve and handle the influence that gets rid of:

The maximum value in 190-group distance time series SOD (P_i,P_j) was determined:

In order to make the normalized scale consistent, so as to find the characteristic, each value in the distance time sequence SOD (P_i,P_j) is normalized:

The curve of the normalized distance time sequence SOD (P_i,P_j) is observed to find that some unstable burrs appear when the lips do not move, and the tiny displacement possibly caused by instability when the face recognition frame locates the feature points is judged, but the subsequent similarity calculation result is influenced, so that the signal needs to be subjected to denoising treatment. The algorithm is an improvement of the average smoothing algorithm:

the processed distance time series SOD (P_i,P_j) is close to a smooth curve so as to facilitate the later screening and calculation.

S104, calculating lip opening and closing degree according to the target curve, comparing the lip opening and closing degree with a first threshold value, judging the starting moment and the ending speaking moment of lip movement, and dividing the corresponding pronunciation sequence from the distance time sequence according to the lip opening and closing degree;

In view of the need to ensure accuracy in calculating the similarity of the curves, it is necessary to segment the phonetically meaningful portions of the curves.

Judging the starting and ending moments of pronunciation, namely the moment when the mouth is opened and closed, the mouth opening degree can be defined to calculate:

For example, the distance between the 62 th feature point and the 66 th feature point is calculated, and the ratio of the 48 th feature point distance to the distance between the 54 th feature points is calculated.

And setting a proper opening and closing threshold, recording the starting and ending time according to the threshold, corresponding the two time points to the distance time sequence, respectively cutting out the starting and ending of 190 groups of time distance sequences of each test sample according to the time points, and cutting out the segmented distance time sequence SOD (P_i,P_j) curve image.

S105, inputting all the features of the pronunciation sequence subjected to dimension reduction into a cross-validation recursive feature elimination model, adding weights for each feature, and recursively eliminating the features with the lowest importance each time until an optimal feature sequence is obtained;

further, in step S105, inputting all the features of the pronunciation sequence after the dimension reduction into a cross-validation recursive feature elimination model, adding a weight to each feature, and recursively eliminating the feature with the lowest importance each time until an optimal feature sequence is obtained, including:

In practice, it is necessary to screen 190 sets of smooth distance time series SOD (P_i,P_j) in consideration of the fact that the number of the sequences is too large to extract the universality feature.

The images of 190 sets of distance time series SOD (P_i,P_j) are observed, and some images have curvatures close to 0 and show a straight line trend, so that it can be presumed that the change between points close to the upper lip or the lower lip is small, and the distance time series SOD (P_i,P_j) cannot provide any feature assistance for identification and classification, and feature dimension reduction is required.

A cross-validation recursive feature selection model can be adopted, and the optimal feature is selected through one-step elimination iteration, and in order to reduce performance fluctuation in the feature evaluation process due to the fact that the number of samples in a training set is small, a resampling process is added, namely cross-validation is added, and the method specifically comprises the following steps:

the first step, the characteristic sequence data set is divided into a training set and a verification set;

secondly, training a model by using all the feature sequences, calculating the importance of each feature sequence and sequencing;

thirdly, extracting the first N important feature sequences, and training based on a new feature data set;

fourthly, verifying a set evaluation model;

fifthly, recalculating the importance of each characteristic sequence and arranging the importance of each characteristic sequence;

Sixthly, splitting the training set into a new training set and a new verification set again and training a model;

seventh, verifying a set evaluation model and calculating importance ranking of each feature sequence;

eighth, repeating the third step until a proper number of feature sequences are found;

and ninth, selecting an optimal feature set to construct a final model.

S106, calculating the similarity of the distance between the optimal feature sequence and the sample feature sequence, wherein the sample feature sequence is a lip feature sequence corresponding to a sample syllable;

on the basis of the above embodiment, the calculating the similarity of the distance between the optimal feature sequence and the sample feature sequence in step S106 includes:

In specific implementation, a distance time series SOD (P_i,P_j) formed by characteristic point pairing of different areas is selected, and the key effect of the left 20 groups of distance time series SOD (P_i,P_j) on identification is found through calculation and experiments.

The invention adopts a DTW algorithm which is mostly used in the field of voice recognition and has good effect on similarity calculation between curves.

Calculating the similarity of the two distance time series SOD₁(P_i,P_j),SOD₂(P_i,P_j) of the test syllable and the template syllable, the lengths of the test syllable and the template syllable are m and n respectively,

SOD₁(P_i,P_j)＝d_1,1,d_1,2.d_1,3,…,d_1,m

SOD₂(P_i,P_j)＝d_2,1,d_2,2,d_2,3,…,d_2,n

Constructing an n x m matrix grid, wherein matrix elements (i, j) represent distances dis (d_1,i,d_2,j) between d_1,i and d_2,j, and a shortest path in the grid is found by a DP algorithm generally adopting Euclidean distances:

Where the sum of w_k defines the mapping of the sequence SOD₁(P_i,P_j) to SOD₂(P_i,P_j), the denominator K compensates for the regular paths of different lengths. Finally, the cumulative distance, i.e., the similarity of distance time series SOD₁(P_i,P_j) and distance time series SOD₂(P_i,P_j) is calculated:

γ(i,j)＝d(d_1,i,d_2,j)+min{γ(i-1,j),γ(i,j-1),γ(i-1,j-1)}

The minimum path, namely the value of gamma (m, n) can be calculated through a dynamic programming algorithm, and the value is the similarity of the DTW algorithm obtained through calculation.

S107, judging whether the similarity is smaller than a second threshold value;

In specific implementation, a DTW threshold may be preset as the second threshold, and after the similarity is obtained, the similarity may be compared with the second threshold, so as to determine a next operation flow.

In a specific implementation, the sample feature sequence may be a feature sequence obtained by recording and processing pronunciation videos of different syllables by a tester, if the similarity is smaller than the second threshold, the optimal feature sequence may be determined to be the same as a syllable corresponding to the sample feature sequence, if the similarity is greater than or equal to the second threshold, the optimal feature sequence may be determined to be different from a syllable corresponding to the sample feature sequence, and the similarity between the optimal feature sequence and other sample feature sequences may be calculated until the similarity is smaller than the second threshold.

According to the lip syllable recognition method, the fine granularity characteristic based on syllables is calculated, then the cross verification recursive characteristic elimination model is used for training, and the optimal characteristic sequence is selected to be compared with the sample characteristic sequence, so that the corresponding syllables are recognized, and the syllable detection efficiency, accuracy and adaptability are improved.

The scheme is described below by referring to a specific embodiment, the method acquires the video stream when the lip changes through the mobile phone camera, extracts the lip feature points from the acquired video, draws a distance change sequence curve, performs normalization and denoising processing on the curve, matches syllables to be identified by template syllables, calculates the similarity of a DTW algorithm, and performs classification identification.

And acquiring a video stream when the lips change by using a mobile phone camera, identifying 68 feature points on the face by using a Dlib face recognition frame, selecting 20 feature point coordinate information related to the lip information, and calculating the distance between the two feature point coordinates.

190 Sets of distance time series SOD (P_i,P_j) are calculated, representing the time-varying sequence of the distances between the feature point P_i and the feature point P_j. During the implementation, different volunteers are found to test, and experiments are carried out for multiple times under different light environments and at different times, so that challenges in the invention are simulated.

The curve of the normalized distance time sequence SOD (P_i,P_j) also has some burrs with unstable changes when lips do not move, so that the signals need to be denoised, and the Savitzky-Golay convolution smoothing algorithm is used in the invention.

The processed feature map is a smooth curve sequence, and in order to extract important features for subsequent training, a cross-validation recursive feature selection model can be adopted, wherein the model selects the optimal feature sequence in a recursive elimination mode, and finally 20 feature sequences are left for calculation of a subsequent model.

The syllables of the sequences selected by the model need to be identified, the time-space characteristics of the sequences are searched, and the DTW algorithm is selected.

Setting a DTW threshold value, and judging that the syllables are identical if the DTW values obtained by defining the test syllables and the template syllables are all smaller than the threshold value; otherwise, if one DTW value is larger than the threshold value, judging that the test syllable is different from the template syllable. Syllables can be identified by this method without complex calculations.

To verify universality against different syllables and different user scenes, the disclosed embodiment selects 12 syllables, and data of multiple volunteers are tested in experiments, and multiple groups of data are tested at different times and under different light rays to simulate variability in an actual scene.

Two indexes, namely a false positive rate and a false negative rate, are set, and the reliability of the effect is represented:

The FPR and FNR can well describe the accuracy of the recognition algorithm, and the lower the values of the FPR and FNR are, the better the recognition effect of the invention is.

The experimental content has the following three aspects:

(1) Identifying effects for different users;

(2) The recognition effect under different depth of field and lip movement amplitude;

(3) Influence of parameter setting on the action effect;

parameter setting:

Threshold of degree of opening: 0.01,0.02,0.03,0.04,0.05,0.06

Dtw threshold: 1,1.02,1.04,1.06,1.08,1.1;

Experimental results:

Fig. 2 shows errors of false positive rate, shows recognition effect when different DTW thresholds are set, and when the features are selected to be 6, the error rate is minimum, and the accuracy rate is over 88% on average.

Fig. 3 shows errors in false negative rates, which are smaller in value than the false positive rates, indicating that the probability of recognition omission is smaller, and the average error is optimal at about 6%.

Corresponding to the above method embodiment, referring to fig. 4, the disclosed embodiment further provides a lip-syllable recognition device 40, including:

The acquisition module 401 is configured to acquire a face dynamic video of a target person, identify a plurality of initial feature points of each frame of face image in the face dynamic video by using Dlib face recognition frames, and obtain coordinate information of each initial feature point;

a selecting module 402, configured to select a preset number of feature points related to the lip from all the initial feature points as target feature points, and obtain a plurality of sets of distance change functions according to coordinate information of the target feature points, so as to form a distance time sequence;

a conversion module 403, configured to convert a target curve according to the distance time sequence;

The segmentation module 404 is configured to calculate a lip opening and closing degree according to the target curve, compare the lip opening and closing degree with a first threshold, determine a lip movement start time and a lip movement end speaking time, and segment a corresponding pronunciation sequence from the distance time sequence according to the lip movement start time and the lip movement end speaking time;

A recursion module 405, configured to input all the features of the pronunciation sequence after the dimension reduction into a cross-validation recursion feature elimination model, add a weight to each feature, and recursively reject the feature with the lowest importance each time until an optimal feature sequence is obtained;

A calculating module 406, configured to calculate a similarity of a distance between the optimal feature sequence and a sample feature sequence, where the sample feature sequence is a lip feature sequence corresponding to a sample syllable;

A determining module 407, configured to determine whether the similarity is smaller than a second threshold;

The apparatus shown in fig. 4 may correspondingly perform the content in the foregoing method embodiment, and the portions not described in detail in this embodiment refer to the content described in the foregoing method embodiment and are not described herein again.

Referring to fig. 5, an embodiment of the present disclosure also provides an electronic device 50, including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the lip-syllable recognition method of the foregoing method embodiments.

The disclosed embodiments also provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the lip-syllable recognition method in the foregoing method embodiments.

The disclosed embodiments also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the lip-syllable recognition method in the foregoing method embodiments.

Referring now to fig. 5, a schematic diagram of an electronic device 50 suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, the electronic device 50 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic device 50 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 50 to communicate with other devices wirelessly or by wire to exchange data. While an electronic device 50 having various means is shown, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the relevant steps of the method embodiments described above.

Or the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the relevant steps of the method embodiments described above.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the disclosure are intended to be covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for identifying a lip syllable, comprising:

Converting a target curve according to the distance time sequence;

judging whether the similarity is smaller than a second threshold value or not;

2. The method according to claim 1, wherein the coordinate information is a coordinate value of each of the initial feature points that varies in time series.

3. The method according to claim 2, wherein the step of obtaining a plurality of sets of distance change functions according to the coordinate information of the target feature points to form a distance time sequence includes:

Calculating Euclidean distance of each two target feature points according to the coordinate change values to obtain a plurality of groups of distance change functions to form the distance time sequence, wherein the Euclidean distance formula is as followsX_i and y_i are two different target feature points, respectively.

4. The method of claim 1, wherein the step of converting the target profile according to the distance time series comprises:

Calculating a maximum value in the distance time series;

5. The method of claim 1, wherein the step of inputting all features of the pronunciation sequence after dimension reduction into a cross-validation recursive feature elimination model, adding weights to each feature, and recursively eliminating the feature with the lowest importance each time until an optimal feature sequence is obtained comprises the steps of:

6. The method according to claim 1, wherein the step of calculating the similarity of the distance between the optimal feature sequence and the sample feature sequence comprises:

7. A lip-syllable recognition device, comprising:

8. An electronic device, the electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the lip-syllable recognition method of any of the preceding claims 1-6.

9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the lip-syllable recognition method of any of the preceding claims 1-6.