Disclosure of Invention
1. Technical problem to be solved by the invention
The invention aims to solve the problem that the existing voice recognition technology is inaccurate in recognition.
2. Technical scheme
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the voice correction fusion method comprises the steps of simultaneously collecting voice data and video data of a speaker, conducting point marking pretreatment on a mouth shape collected in the video data, marking six point positions inside a mouth lip with letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result.
Preferably, the six points are A, F points at the mouth corners at two sides of the inner side of the lip and the upper lip
At point B, G, lower lip
Point C, H is located.
Preferably, the midpoint of a connecting line of the point B and the point G is selected as D, the midpoint of a connecting line of the point C and the point H is selected as E, and the sizes of the angle CAF and the angle BAF, the length of the line segment AF and the length of the line segment DE are measured.
Preferably, the evaluation function is calculated to judge the voice, and the specific evaluation function is
Pre=k*(p*Angle(a,b)+q*Line(LAF,LDE));
Wherein k, p and q are weight coefficients of each cost function, p and q are 0.5, the value of k is the numerical value of the pronunciation coefficient of different geographic regions, Angle (a, b) is a cost subfunction of the lip included Angle, and Line (LAF, LDE) is a cost subfunction of the lip opening degree.
Preferably, when the picture of the speaker is collected, the face of the speaker needs to be directly opposite to the camera, and the camera starts to shoot the lip position of the speaker after shooting the face of the person and recognizing the face of the person.
Preferably, in the process of voice recognition, two temporary storage areas are required to be arranged in the storage database for storing audio information and video information, the two temporary storage areas mainly store audio streams and time stamps, and the two temporary storage areas are stored and then emptied or directly emptied if no obvious audio input exists within 30 seconds.
Preferably, voice awakening is required before voice recognition, the voice awakening is mainly carried out through specific voice keywords, voice receiving and voice processing are started after the voice awakening is successful, denoising and feature point acquisition are mainly carried out on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.
Preferably, the audio database stores preset mouth shape data of dialect pronunciation.
3. Advantageous effects
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
the voice correction fusion method comprises the steps of simultaneously collecting voice data and video data of a speaker, conducting point marking pretreatment on a mouth shape collected in the video data, marking six point positions inside a mouth lip with letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result. On the basis of voice recognition, lip language recognition is added, the influence of the accent on the voice recognition can be effectively removed, the influence of the voice is eliminated by adopting the lip language recognition in the image recognition, and the speaking of a speaker is recognized through the lips, so that the method is more accurate.
Detailed Description
In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.
It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present; when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present; the terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Example 1
Referring to fig. 1, in the speech correction fusion method of this embodiment, sound data and video data of a speaker are collected at the same time, a mouth shape collected in the video data is subjected to point marking preprocessing, six point locations inside a mouth lip are marked with letters, an image after preprocessing is measured, a lip change angle is calculated according to the positions of the six point locations, the sound data is compared with an audio database to obtain a speech recognition result, and the lip change angle is compared with the mouth shape database to obtain a lip speech recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result.
The six point positions are A, F points at the mouth corners at two sides of the inner side of the lip and are used for putting on the lip
At point B, G, lower lip
Point C, H is located.
And selecting a midpoint of a connecting line of the point B and the point G as D, selecting a midpoint of a connecting line of the point C and the point H as E, and measuring the size of the angle CAF and the angle BAF, the length of the line segment AF and the length of the line segment DE.
Calculating an evaluation function to judge the voice, wherein the specific evaluation function is
Pre=k*(p*Angle(a,b)+q*Line(LAF,LDE));
Wherein k, p and q are weight coefficients of each cost function, p and q are 0.5, the value of k is the numerical value of the pronunciation coefficient of different geographic regions, Angle (a, b) is a cost subfunction of the lip included Angle, and Line (LAF, LDE) is a cost subfunction of the lip opening degree.
In Chinese Pinyin, 63 Pinyin letters are provided, 23 initial consonants are provided, 24 final consonants are provided, and 16 syllables are provided for whole reading. Most regional dialects are difficult to distinguish due to flat or curled tongues, front and back nasal sounds, individual letters and the like, so that when different local people speak the common speech, the difference from the standard common speech is large, and certain regional division can be performed through a coefficient k. And during the first training of the k coefficient, the initial value is 1 because the pronunciation condition of only one region is collected, and the difference value between the k coefficient and the established standard database is not large by adjusting the value of k during the training of other regions. For example, the lip opening angle of L and N in the standard case is greatly different, but the pronunciation of some regions is similar, and the change of the mouth shape is smaller, and at the moment, the k value is expanded or reduced by a certain amount to ensure that the L and N are different without affecting other values.
When the picture of the speaker is collected, the face of the speaker needs to be over against the camera, and the camera shoots the face of the person firstly and starts to shoot the lip position of the speaker after recognition.
In the process of voice recognition, two temporary storage areas are required to be arranged in a storage database and used for storing audio information and video information, the two temporary storage areas mainly store audio streams and timestamps, and the two temporary storage areas are stored and emptied or directly emptied without obvious audio input within 30 seconds.
Before voice recognition, voice awakening is needed, the voice awakening is mainly conducted through specific voice keywords, voice receiving and voice processing are started after the awakening is successful, denoising and characteristic point collection are mainly conducted on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.
The audio database stores preset mouth shape data of dialect pronunciation.
If the result still has no matching data, recording the current result, and prompting the speaker to re-pronounce or select to quit. If the pronunciation continues and the matching is successful, the Pra value is stored, and the original database is modified to a certain degree.
In order to improve the recognition efficiency of the machine, after a certain recognition period is carried out, a large correction is carried out on the database once through a new evaluation function (as shown in formula 2).
N_Pre=k*(p*Angle(p,q)+q*Line(LAF,LDE)+m*Time(t))
And t is the update time period of the weight coefficient. And m is a weight coefficient of the periodic cost function, and determines the importance of the periodic cost. Time (t) is the grooming period cost function.
Because some unavoidable problems can occur in the identification process, analysis processing is carried out for some conditions, and therefore machine downtime is avoided.
The first condition is as follows: preferentially selecting a voice recognition result if the matching degrees of voice recognition and lip language recognition are the same, and outputting the lip language recognition results with the same matching degree if the results of voice recognition are not matched;
case two: in the matching process, the phenomenon that the buffer area is always occupied by increasing due to the fact that the matching always fails, and the buffer area overflows. In order to ensure that the machine does not crash, the number of times of identification and matching can be set, if the machine still cannot be successfully matched after a certain number of times of identification, a related log file is generated, a buffer area is emptied, and a new round of identification is carried out again;
case three: in order to enhance the working efficiency of the machine, networking identification is carried out, some data are stored, and in order to ensure that the machine can still normally operate under the condition of network interruption, the machine needs to have a certain amount of local storage space. Meanwhile, because log files are continuously generated, the machine needs to merge and arrange the log files in a certain time period or under the limit of a certain space, and unnecessary files are emptied in time.
The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.