Movatterモバイル変換


[0]ホーム

URL:


CN112037788B - Voice correction fusion method - Google Patents

Voice correction fusion method
Download PDF

Info

Publication number
CN112037788B
CN112037788BCN202010947107.2ACN202010947107ACN112037788BCN 112037788 BCN112037788 BCN 112037788BCN 202010947107 ACN202010947107 ACN 202010947107ACN 112037788 BCN112037788 BCN 112037788B
Authority
CN
China
Prior art keywords
lip
voice
recognition result
point
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010947107.2A
Other languages
Chinese (zh)
Other versions
CN112037788A (en
Inventor
许召辉
马翼平
徐淑波
陈年生
范光宇
饶蕾
孙焜
朱羿孜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Avic East China Photoelectric Shanghai Co ltd
Original Assignee
Avic East China Photoelectric Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Avic East China Photoelectric Shanghai Co ltdfiledCriticalAvic East China Photoelectric Shanghai Co ltd
Priority to CN202010947107.2ApriorityCriticalpatent/CN112037788B/en
Publication of CN112037788ApublicationCriticalpatent/CN112037788A/en
Application grantedgrantedCritical
Publication of CN112037788BpublicationCriticalpatent/CN112037788B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The voice correction fusion method comprises the steps of simultaneously collecting voice data and video data of a speaker, conducting point marking pretreatment on a mouth shape collected in the video data, marking six point positions inside a mouth lip with letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result. On the basis of voice recognition, lip language recognition is added, the influence of the accent on the voice recognition can be effectively removed, the influence of the voice is eliminated by adopting the lip language recognition in the image recognition, and the speaking of a speaker is recognized through the lips, so that the method is more accurate.

Description

Voice correction fusion method
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voice correction fusion method.
Background
With the development of computers and related software and hardware technologies, speech recognition technology has been increasingly applied in various fields, and the recognition rate thereof is also continuously improved. Under specific conditions of quiet environment, standard pronunciation and the like, the recognition rate of the current speech recognition input word system reaches more than 95%. However, if the noise interference is large and the pronunciation is not standard, the recognition rate is greatly reduced, so that the practical purpose cannot be achieved. If other methods can be used to assist in the determination to improve the accuracy of speech recognition, the practicability of speech recognition will be significantly improved.
The human language-aware process is a multi-channel perceptual process. In the process of daily communication between people, the content of the speech of other people is sensed through sound, and when a noisy environment or the pronunciation of the other party is unclear, the content of the speech of the other party can be accurately understood by observing the change of the mouth shape, the expression and the like of the people through eyes. The existing voice recognition system ignores the visual characteristic of language perception and only utilizes a single auditory characteristic, so that the recognition rate of the existing voice recognition system is obviously reduced under the condition of a noise environment or a plurality of speakers, the practicability of voice recognition is reduced, and the application range is limited.
Disclosure of Invention
1. Technical problem to be solved by the invention
The invention aims to solve the problem that the existing voice recognition technology is inaccurate in recognition.
2. Technical scheme
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the voice correction fusion method comprises the steps of simultaneously collecting voice data and video data of a speaker, conducting point marking pretreatment on a mouth shape collected in the video data, marking six point positions inside a mouth lip with letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result.
Preferably, the six points are A, F points at the mouth corners at two sides of the inner side of the lip and the upper lip
Figure GDA0003147296660000021
Figure GDA0003147296660000022
At point B, G, lower lip
Figure GDA0003147296660000023
Point C, H is located.
Preferably, the midpoint of a connecting line of the point B and the point G is selected as D, the midpoint of a connecting line of the point C and the point H is selected as E, and the sizes of the angle CAF and the angle BAF, the length of the line segment AF and the length of the line segment DE are measured.
Preferably, the evaluation function is calculated to judge the voice, and the specific evaluation function is
Pre=k*(p*Angle(a,b)+q*Line(LAF,LDE));
Wherein k, p and q are weight coefficients of each cost function, p and q are 0.5, the value of k is the numerical value of the pronunciation coefficient of different geographic regions, Angle (a, b) is a cost subfunction of the lip included Angle, and Line (LAF, LDE) is a cost subfunction of the lip opening degree.
Preferably, when the picture of the speaker is collected, the face of the speaker needs to be directly opposite to the camera, and the camera starts to shoot the lip position of the speaker after shooting the face of the person and recognizing the face of the person.
Preferably, in the process of voice recognition, two temporary storage areas are required to be arranged in the storage database for storing audio information and video information, the two temporary storage areas mainly store audio streams and time stamps, and the two temporary storage areas are stored and then emptied or directly emptied if no obvious audio input exists within 30 seconds.
Preferably, voice awakening is required before voice recognition, the voice awakening is mainly carried out through specific voice keywords, voice receiving and voice processing are started after the voice awakening is successful, denoising and feature point acquisition are mainly carried out on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.
Preferably, the audio database stores preset mouth shape data of dialect pronunciation.
3. Advantageous effects
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
the voice correction fusion method comprises the steps of simultaneously collecting voice data and video data of a speaker, conducting point marking pretreatment on a mouth shape collected in the video data, marking six point positions inside a mouth lip with letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result. On the basis of voice recognition, lip language recognition is added, the influence of the accent on the voice recognition can be effectively removed, the influence of the voice is eliminated by adopting the lip language recognition in the image recognition, and the speaking of a speaker is recognized through the lips, so that the method is more accurate.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.
It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present; when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present; the terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Example 1
Referring to fig. 1, in the speech correction fusion method of this embodiment, sound data and video data of a speaker are collected at the same time, a mouth shape collected in the video data is subjected to point marking preprocessing, six point locations inside a mouth lip are marked with letters, an image after preprocessing is measured, a lip change angle is calculated according to the positions of the six point locations, the sound data is compared with an audio database to obtain a speech recognition result, and the lip change angle is compared with the mouth shape database to obtain a lip speech recognition result; when the matching degree of the voice recognition result and the lip language recognition result is the same, preferentially selecting the voice recognition result; and when the matching degree of the voice recognition result and the lip language recognition result is different, preferentially selecting the lip language recognition result.
The six point positions are A, F points at the mouth corners at two sides of the inner side of the lip and are used for putting on the lip
Figure GDA0003147296660000041
At point B, G, lower lip
Figure GDA0003147296660000042
Point C, H is located.
And selecting a midpoint of a connecting line of the point B and the point G as D, selecting a midpoint of a connecting line of the point C and the point H as E, and measuring the size of the angle CAF and the angle BAF, the length of the line segment AF and the length of the line segment DE.
Calculating an evaluation function to judge the voice, wherein the specific evaluation function is
Pre=k*(p*Angle(a,b)+q*Line(LAF,LDE));
Wherein k, p and q are weight coefficients of each cost function, p and q are 0.5, the value of k is the numerical value of the pronunciation coefficient of different geographic regions, Angle (a, b) is a cost subfunction of the lip included Angle, and Line (LAF, LDE) is a cost subfunction of the lip opening degree.
In Chinese Pinyin, 63 Pinyin letters are provided, 23 initial consonants are provided, 24 final consonants are provided, and 16 syllables are provided for whole reading. Most regional dialects are difficult to distinguish due to flat or curled tongues, front and back nasal sounds, individual letters and the like, so that when different local people speak the common speech, the difference from the standard common speech is large, and certain regional division can be performed through a coefficient k. And during the first training of the k coefficient, the initial value is 1 because the pronunciation condition of only one region is collected, and the difference value between the k coefficient and the established standard database is not large by adjusting the value of k during the training of other regions. For example, the lip opening angle of L and N in the standard case is greatly different, but the pronunciation of some regions is similar, and the change of the mouth shape is smaller, and at the moment, the k value is expanded or reduced by a certain amount to ensure that the L and N are different without affecting other values.
When the picture of the speaker is collected, the face of the speaker needs to be over against the camera, and the camera shoots the face of the person firstly and starts to shoot the lip position of the speaker after recognition.
In the process of voice recognition, two temporary storage areas are required to be arranged in a storage database and used for storing audio information and video information, the two temporary storage areas mainly store audio streams and timestamps, and the two temporary storage areas are stored and emptied or directly emptied without obvious audio input within 30 seconds.
Before voice recognition, voice awakening is needed, the voice awakening is mainly conducted through specific voice keywords, voice receiving and voice processing are started after the awakening is successful, denoising and characteristic point collection are mainly conducted on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.
The audio database stores preset mouth shape data of dialect pronunciation.
If the result still has no matching data, recording the current result, and prompting the speaker to re-pronounce or select to quit. If the pronunciation continues and the matching is successful, the Pra value is stored, and the original database is modified to a certain degree.
In order to improve the recognition efficiency of the machine, after a certain recognition period is carried out, a large correction is carried out on the database once through a new evaluation function (as shown in formula 2).
N_Pre=k*(p*Angle(p,q)+q*Line(LAF,LDE)+m*Time(t))
And t is the update time period of the weight coefficient. And m is a weight coefficient of the periodic cost function, and determines the importance of the periodic cost. Time (t) is the grooming period cost function.
Because some unavoidable problems can occur in the identification process, analysis processing is carried out for some conditions, and therefore machine downtime is avoided.
The first condition is as follows: preferentially selecting a voice recognition result if the matching degrees of voice recognition and lip language recognition are the same, and outputting the lip language recognition results with the same matching degree if the results of voice recognition are not matched;
case two: in the matching process, the phenomenon that the buffer area is always occupied by increasing due to the fact that the matching always fails, and the buffer area overflows. In order to ensure that the machine does not crash, the number of times of identification and matching can be set, if the machine still cannot be successfully matched after a certain number of times of identification, a related log file is generated, a buffer area is emptied, and a new round of identification is carried out again;
case three: in order to enhance the working efficiency of the machine, networking identification is carried out, some data are stored, and in order to ensure that the machine can still normally operate under the condition of network interruption, the machine needs to have a certain amount of local storage space. Meanwhile, because log files are continuously generated, the machine needs to merge and arrange the log files in a certain time period or under the limit of a certain space, and unnecessary files are emptied in time.
The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (5)

1. A method for speech correction fusion, characterized by: simultaneously collecting voice data and video data of a speaker, performing point marking pretreatment on a mouth shape collected in the video data, marking six point positions in a mouth lip by using letters, measuring a preprocessed image, calculating a lip change angle according to the positions of the six point positions, comparing the voice data with an audio database to obtain a voice recognition result, and comparing the lip change angle with a mouth shape database to obtain a lip recognition result; when the voice recognition result and lip language recognition are performedIf the result matching degrees are the same, preferentially selecting a voice recognition result; when the matching degree of the voice recognition result and the lip language recognition result is different, the lip language recognition result is preferentially selected; the six point positions are A, F points at the mouth corners at two sides of the inner side of the lip and are used for putting on the lip
Figure FDA0003147296650000011
At point B, G, lower lip
Figure FDA0003147296650000012
Point C, H; selecting a midpoint of a connecting line of the point B and the point G as D, selecting a midpoint of a connecting line of the point C and the point H as E, and measuring the size of an angle CAF and an angle BAF, the length of a line segment AF and the length of a line segment DE; also comprises calculating an evaluation function to judge the voice, wherein the specific evaluation function is
Pre=k*(p*Angle(a,b)+q*Line(LAF,LDE));
Wherein k, p and q are weight coefficients of each cost function, p and q are 0.5, the value of k is the numerical value of the pronunciation coefficient of different geographic regions, Angle (a, b) is a cost subfunction of the lip included Angle, and Line (LAF, LDE) is a cost subfunction of the lip opening degree.
2. The speech correction fusion method of claim 1, wherein: when the picture of the speaker is collected, the face of the speaker needs to be over against the camera, and the camera shoots the face of the person firstly and starts to shoot the lip position of the speaker after recognition.
3. The speech correction fusion method of claim 1, wherein: in the process of voice recognition, two temporary storage areas are required to be arranged in a storage database and used for storing audio information and video information, the two temporary storage areas are used for storing audio streams and timestamps, and the two temporary storage areas are stored and emptied or directly emptied if no obvious audio input exists within 30 seconds.
4. A speech correction fusion method according to claim 3, characterized in that: before voice recognition, voice awakening is needed, specifically, awakening is carried out through specific voice keywords, voice receiving and voice processing are carried out after awakening is successful, denoising and characteristic point acquisition are carried out on audio information, then the preprocessed result audio is matched with standard audio in a database, and the matching degree P1 is output; the method comprises the steps of firstly identifying the face of a person by video information, tracking the lip part, extracting 10 images in frames, comparing, averaging 10 calculated Pre values, solving a result Pra, comparing the Pra with the Pre values in a database, screening out values in a corresponding range, outputting the matching degree P2 with the database, and finally comparing the sizes of P1 and P2 to output the corresponding identification result.
5. A speech correction fusion method according to claim 3, characterized in that: the audio database stores preset mouth shape data of dialect pronunciation.
CN202010947107.2A2020-09-102020-09-10Voice correction fusion methodActiveCN112037788B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010947107.2ACN112037788B (en)2020-09-102020-09-10Voice correction fusion method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010947107.2ACN112037788B (en)2020-09-102020-09-10Voice correction fusion method

Publications (2)

Publication NumberPublication Date
CN112037788A CN112037788A (en)2020-12-04
CN112037788Btrue CN112037788B (en)2021-08-24

Family

ID=73584699

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010947107.2AActiveCN112037788B (en)2020-09-102020-09-10Voice correction fusion method

Country Status (1)

CountryLink
CN (1)CN112037788B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112820274B (en)*2021-01-082021-09-28上海仙剑文化传媒股份有限公司Voice information recognition correction method and system
CN112949554B (en)*2021-03-222022-02-08湖南中凯智创科技有限公司Intelligent children accompanying education robot
CN113421467A (en)*2021-06-152021-09-21读书郎教育科技有限公司System and method for assisting in learning pinyin spelling and reading
CN114141249A (en)*2021-12-022022-03-04河南职业技术学院Teaching voice recognition optimization method and system
CN114420124B (en)*2022-03-312022-06-24北京妙医佳健康科技集团有限公司Speech recognition method

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104409075A (en)*2014-11-282015-03-11深圳创维-Rgb电子有限公司 Speech Recognition Method and System
EP2889804A1 (en)*2013-12-302015-07-01Alcatel LucentSystems and methods for contactless speech recognition
US9159321B2 (en)*2012-02-272015-10-13Hong Kong Baptist UniversityLip-password based speaker verification system
CN109389098A (en)*2018-11-012019-02-26重庆中科云丛科技有限公司A kind of verification method and system based on lip reading identification
CN109637521A (en)*2018-10-292019-04-16深圳壹账通智能科技有限公司A kind of lip reading recognition methods and device based on deep learning
CN110111783A (en)*2019-04-102019-08-09天津大学A kind of multi-modal audio recognition method based on deep neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9159321B2 (en)*2012-02-272015-10-13Hong Kong Baptist UniversityLip-password based speaker verification system
EP2889804A1 (en)*2013-12-302015-07-01Alcatel LucentSystems and methods for contactless speech recognition
CN104409075A (en)*2014-11-282015-03-11深圳创维-Rgb电子有限公司 Speech Recognition Method and System
CN109637521A (en)*2018-10-292019-04-16深圳壹账通智能科技有限公司A kind of lip reading recognition methods and device based on deep learning
CN109389098A (en)*2018-11-012019-02-26重庆中科云丛科技有限公司A kind of verification method and system based on lip reading identification
CN110111783A (en)*2019-04-102019-08-09天津大学A kind of multi-modal audio recognition method based on deep neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Lip geometric features for human–computer interaction using bimodal speech recognition: comparison and analysis;Mustafa N.Kaynak 等;《Speech Communication》;20040630;第43卷;第4页第1栏第8行-第2栏第1段、第8页第4.1节*

Also Published As

Publication numberPublication date
CN112037788A (en)2020-12-04

Similar Documents

PublicationPublication DateTitle
CN112037788B (en)Voice correction fusion method
US10397646B2 (en)Method, system, and program product for measuring audio video synchronization using lip and teeth characteristics
US20020161582A1 (en)Method and apparatus for presenting images representative of an utterance with corresponding decoded speech
CN112997186A (en) "Liveness" detection system
CN101199208A (en)Method, system, and program product for measuring audio video synchronization
US20080111887A1 (en)Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
CN108305616A (en)A kind of audio scene recognition method and device based on long feature extraction in short-term
CN112786052B (en)Speech recognition method, electronic equipment and storage device
US20170263237A1 (en)Speech synthesis from detected speech articulator movement
CN113920560B (en)Method, device and equipment for identifying multi-mode speaker identity
CN110196914B (en)Method and device for inputting face information into database
WO2016173132A1 (en)Method and device for voice recognition, and user equipment
US20240135956A1 (en)Method and apparatus for measuring speech-image synchronicity, and method and apparatus for training model
CN114466178B (en) Method and device for measuring synchronization between speech and image
WO2006113409A2 (en)Method, system, and program product for measuring audio video synchronization using lip and teeth charateristics
CN109102813B (en)Voiceprint recognition method and device, electronic equipment and storage medium
CN114466179A (en)Method and device for measuring synchronism of voice and image
CN114494930A (en)Training method and device for voice and image synchronism measurement model
CN117995176A (en)Multi-source voice recognition method and system
Abel et al.Cognitively inspired audiovisual speech filtering: towards an intelligent, fuzzy based, multimodal, two-stage speech enhancement system
CN110223700A (en)Talker estimates method and talker's estimating device
JP2020091559A (en)Expression recognition device, expression recognition method, and program
Yoshinaga et al.Audio-visual speech recognition using new lip features extracted from side-face images
Goecke et al.Validation of an automatic lip-tracking algorithm and design of a database for audio-video speech processing
Kratt et al.Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp