CN112037788B

Movatterモバイル変換

Info

Publication number: CN112037788B
Application number: CN202010947107.2A
Authority: CN
Inventors: 许召辉; 马翼平; 徐淑波; 陈年生; 范光宇; 饶蕾; 孙焜; 朱羿孜
Original assignee: Avic East China Photoelectric Shanghai Co ltd
Current assignee: Avic East China Photoelectric Shanghai Co ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2021-08-24
Anticipated expiration: 2040-09-10
Also published as: CN112037788A

Abstract

The voice correction fusion method comprises the steps of simultaneously collecting voice data and video data of a speaker, conducting point marking pretreatment on a mouth shape collected in the video data, marking six point positions inside a mouth lip with letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result. On the basis of voice recognition, lip language recognition is added, the influence of the accent on the voice recognition can be effectively removed, the influence of the voice is eliminated by adopting the lip language recognition in the image recognition, and the speaking of a speaker is recognized through the lips, so that the method is more accurate.

Description

Voice correction fusion method

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voice correction fusion method.

Background

With the development of computers and related software and hardware technologies, speech recognition technology has been increasingly applied in various fields, and the recognition rate thereof is also continuously improved. Under specific conditions of quiet environment, standard pronunciation and the like, the recognition rate of the current speech recognition input word system reaches more than 95%. However, if the noise interference is large and the pronunciation is not standard, the recognition rate is greatly reduced, so that the practical purpose cannot be achieved. If other methods can be used to assist in the determination to improve the accuracy of speech recognition, the practicability of speech recognition will be significantly improved.

The human language-aware process is a multi-channel perceptual process. In the process of daily communication between people, the content of the speech of other people is sensed through sound, and when a noisy environment or the pronunciation of the other party is unclear, the content of the speech of the other party can be accurately understood by observing the change of the mouth shape, the expression and the like of the people through eyes. The existing voice recognition system ignores the visual characteristic of language perception and only utilizes a single auditory characteristic, so that the recognition rate of the existing voice recognition system is obviously reduced under the condition of a noise environment or a plurality of speakers, the practicability of voice recognition is reduced, and the application range is limited.

Disclosure of Invention

1. Technical problem to be solved by the invention

The invention aims to solve the problem that the existing voice recognition technology is inaccurate in recognition.

2. Technical scheme

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the voice correction fusion method comprises the steps of simultaneously collecting voice data and video data of a speaker, conducting point marking pretreatment on a mouth shape collected in the video data, marking six point positions inside a mouth lip with letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result.

Preferably, the six points are A, F points at the mouth corners at two sides of the inner side of the lip and the upper lip

At point B, G, lower lip

Point C, H is located.

Preferably, the midpoint of a connecting line of the point B and the point G is selected as D, the midpoint of a connecting line of the point C and the point H is selected as E, and the sizes of the angle CAF and the angle BAF, the length of the line segment AF and the length of the line segment DE are measured.

Preferably, the evaluation function is calculated to judge the voice, and the specific evaluation function is

Pre＝k*(p*Angle(a,b)+q*Line(LAF,LDE))；

Wherein k, p and q are weight coefficients of each cost function, p and q are 0.5, the value of k is the numerical value of the pronunciation coefficient of different geographic regions, Angle (a, b) is a cost subfunction of the lip included Angle, and Line (LAF, LDE) is a cost subfunction of the lip opening degree.

Preferably, when the picture of the speaker is collected, the face of the speaker needs to be directly opposite to the camera, and the camera starts to shoot the lip position of the speaker after shooting the face of the person and recognizing the face of the person.

Preferably, in the process of voice recognition, two temporary storage areas are required to be arranged in the storage database for storing audio information and video information, the two temporary storage areas mainly store audio streams and time stamps, and the two temporary storage areas are stored and then emptied or directly emptied if no obvious audio input exists within 30 seconds.

Preferably, voice awakening is required before voice recognition, the voice awakening is mainly carried out through specific voice keywords, voice receiving and voice processing are started after the voice awakening is successful, denoising and feature point acquisition are mainly carried out on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.

Preferably, the audio database stores preset mouth shape data of dialect pronunciation.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present; when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present; the terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Example 1

Referring to fig. 1, in the speech correction fusion method of this embodiment, sound data and video data of a speaker are collected at the same time, a mouth shape collected in the video data is subjected to point marking preprocessing, six point locations inside a mouth lip are marked with letters, an image after preprocessing is measured, a lip change angle is calculated according to the positions of the six point locations, the sound data is compared with an audio database to obtain a speech recognition result, and the lip change angle is compared with the mouth shape database to obtain a lip speech recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result.

The six point positions are A, F points at the mouth corners at two sides of the inner side of the lip and are used for putting on the lip

At point B, G, lower lip

Point C, H is located.

And selecting a midpoint of a connecting line of the point B and the point G as D, selecting a midpoint of a connecting line of the point C and the point H as E, and measuring the size of the angle CAF and the angle BAF, the length of the line segment AF and the length of the line segment DE.

Calculating an evaluation function to judge the voice, wherein the specific evaluation function is

Pre＝k*(p*Angle(a,b)+q*Line(LAF,LDE))；

In Chinese Pinyin, 63 Pinyin letters are provided, 23 initial consonants are provided, 24 final consonants are provided, and 16 syllables are provided for whole reading. Most regional dialects are difficult to distinguish due to flat or curled tongues, front and back nasal sounds, individual letters and the like, so that when different local people speak the common speech, the difference from the standard common speech is large, and certain regional division can be performed through a coefficient k. And during the first training of the k coefficient, the initial value is 1 because the pronunciation condition of only one region is collected, and the difference value between the k coefficient and the established standard database is not large by adjusting the value of k during the training of other regions. For example, the lip opening angle of L and N in the standard case is greatly different, but the pronunciation of some regions is similar, and the change of the mouth shape is smaller, and at the moment, the k value is expanded or reduced by a certain amount to ensure that the L and N are different without affecting other values.

When the picture of the speaker is collected, the face of the speaker needs to be over against the camera, and the camera shoots the face of the person firstly and starts to shoot the lip position of the speaker after recognition.

In the process of voice recognition, two temporary storage areas are required to be arranged in a storage database and used for storing audio information and video information, the two temporary storage areas mainly store audio streams and timestamps, and the two temporary storage areas are stored and emptied or directly emptied without obvious audio input within 30 seconds.

Before voice recognition, voice awakening is needed, the voice awakening is mainly conducted through specific voice keywords, voice receiving and voice processing are started after the awakening is successful, denoising and characteristic point collection are mainly conducted on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.

The audio database stores preset mouth shape data of dialect pronunciation.

If the result still has no matching data, recording the current result, and prompting the speaker to re-pronounce or select to quit. If the pronunciation continues and the matching is successful, the Pra value is stored, and the original database is modified to a certain degree.

In order to improve the recognition efficiency of the machine, after a certain recognition period is carried out, a large correction is carried out on the database once through a new evaluation function (as shown in formula 2).

N_Pre＝k*(p*Angle(p,q)+q*Line(LAF,LDE)+m*Time(t))

And t is the update time period of the weight coefficient. And m is a weight coefficient of the periodic cost function, and determines the importance of the periodic cost. Time (t) is the grooming period cost function.

Because some unavoidable problems can occur in the identification process, analysis processing is carried out for some conditions, and therefore machine downtime is avoided.

The first condition is as follows: preferentially selecting a voice recognition result if the matching degrees of voice recognition and lip language recognition are the same, and outputting the lip language recognition results with the same matching degree if the results of voice recognition are not matched;

case two: in the matching process, the phenomenon that the buffer area is always occupied by increasing due to the fact that the matching always fails, and the buffer area overflows. In order to ensure that the machine does not crash, the number of times of identification and matching can be set, if the machine still cannot be successfully matched after a certain number of times of identification, a related log file is generated, a buffer area is emptied, and a new round of identification is carried out again;

case three: in order to enhance the working efficiency of the machine, networking identification is carried out, some data are stored, and in order to ensure that the machine can still normally operate under the condition of network interruption, the machine needs to have a certain amount of local storage space. Meanwhile, because log files are continuously generated, the machine needs to merge and arrange the log files in a certain time period or under the limit of a certain space, and unnecessary files are emptied in time.

The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for speech correction fusion, characterized by: simultaneously collecting voice data and video data of a speaker, performing point marking pretreatment on a mouth shape collected in the video data, marking six point positions in a mouth lip by using letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the voice recognition result and lip language recognition are performedIf the result matching degrees are the same, preferentially selecting a voice recognition result; when the matching degree of the voice recognition result and the lip language recognition result is different, the lip language recognition result is preferentially selected; the six point positions are A, F points at the mouth corners at two sides of the inner side of the lip and are used for putting on the lip

At point B, G, lower lip

Point C, H; selecting a midpoint of a connecting line of the point B and the point G as D, selecting a midpoint of a connecting line of the point C and the point H as E, and measuring the size of an angle CAF and an angle BAF, the length of a line segment AF and the length of a line segment DE; also comprises calculating an evaluation function to judge the voice, wherein the specific evaluation function is

Pre＝k*(p*Angle(a,b)+q*Line(LAF,LDE))；

2. The speech correction fusion method of claim 1, wherein: when the picture of the speaker is collected, the face of the speaker needs to be over against the camera, and the camera shoots the face of the person firstly and starts to shoot the lip position of the speaker after recognition.

3. The speech correction fusion method of claim 1, wherein: in the process of voice recognition, two temporary storage areas are required to be arranged in a storage database and used for storing audio information and video information, the two temporary storage areas are used for storing audio streams and timestamps, and the two temporary storage areas are stored and emptied or directly emptied if no obvious audio input exists within 30 seconds.

4. A speech correction fusion method according to claim 3, characterized in that: before voice recognition, voice awakening is needed, specifically, awakening is carried out through specific voice keywords, voice receiving and voice processing are carried out after awakening is successful, denoising and characteristic point acquisition are carried out on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.

5. A speech correction fusion method according to claim 3, characterized in that: the audio database stores preset mouth shape data of dialect pronunciation.