CN107293300A

Movatterモバイル変換

Info

Publication number: CN107293300A
Application number: CN201710648985.2A
Authority: CN
Inventors: 关超雄
Original assignee: Meizu Technology Co Ltd
Current assignee: Meizu Technology Co Ltd
Priority date: 2017-08-01
Filing date: 2017-08-01
Publication date: 2017-10-24

Abstract

The invention provides a kind of audio recognition method, the audio recognition method includes：Obtain the voice messaging of user's input；Obtain lip image of the user when inputting the voice messaging；Pause information in the voice messaging according to the lip image recognition；Speech recognition is carried out to the voice messaging according to the pause information.The present invention also provides a kind of speech recognition equipment, computer installation and computer-readable recording medium.The present invention can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.

Description

Audio recognition method and device, computer installation and readable storage medium storing program for executing

Technical field

The present invention relates to intelligent sound technical field, and in particular to a kind of audio recognition method and device, computer installationAnd readable storage medium storing program for executing.

Background technology

At present, with the development of electronics and the communication technology, the terminal such as mobile phone, tablet personal computer is widely used, man-machine friendshipMutual mode is also more and more diversified.Phonetic entry is more and more used as one of natural mode of man-machine interaction most convenientFamily is received.However, current speech recognition accuracy is not high, poor user experience.

The content of the invention

In view of the foregoing, it is necessary to propose a kind of audio recognition method and device, computer installation and readable storage mediumMatter, it can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.

The first aspect of the application provides a kind of audio recognition method, and methods described includes：

Obtain the voice messaging of user's input；

Obtain lip image of the user when inputting the voice messaging；

Pause information in the voice messaging according to the lip image recognition；

Speech recognition is carried out to the voice messaging according to the pause information.

It is described that speech recognition is carried out to the voice messaging according to the pause information in alternatively possible implementationIncluding：

According to the time map relation between the pause information and the voice messaging, the pause information is inserted intoIn the text message being converted into by the voice messaging；Or

The pause information in the voice messaging is removed, the voice messaging for having removed the pause information is enteredRow speech recognition.

In alternatively possible implementation, the pause letter in the voice messaging according to the lip image recognitionBreath includes：

Disconnected word pause information and/or punctuate pause information in the voice messaging according to the lip image recognition；

Carrying out speech recognition to the voice messaging according to the pause information includes：

Speech recognition is carried out to the voice messaging according to disconnected the word pause information and/or punctuate pause information.

In alternatively possible implementation, the voice messaging for obtaining user's input；Obtain user described in inputLip image during voice messaging includes：

When user inputs the voice messaging, the voice messaging is gathered by the microphone of terminal, and pass through endThe camera at end shoots the lip image.

In alternatively possible implementation, methods described also includes：

Judge whether the lip motion information matches with the voice messaging；

If the lip motion information is mismatched with the voice messaging, the camera is controlled to stop shooting the lip figurePicture.

In alternatively possible implementation, methods described also includes：

The motion amplitude of user's lip is obtained according to the lip image, is recognized according to the motion amplitude of user's lipThe corresponding tone of the voice messaging；Or

The lip characteristic of user pronunciation is obtained, user characteristics is determined according to the lip characteristic, according to the user characteristicsSpeech recognition is carried out to the voice messaging with the pause information.

The second aspect of the application provides a kind of speech recognition equipment, and described device includes：

First acquisition unit, the voice messaging for obtaining user's input；

Second acquisition unit, for obtaining lip image of the user when inputting the voice messaging；

First recognition unit, for the pause information in the voice messaging according to the lip image recognition；

Second recognition unit, for carrying out speech recognition to the voice messaging according to the pause information.

In alternatively possible implementation, second recognition unit specifically for：

The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processingThe step of device is used to realize the audio recognition method when performing the computer program stored in memory.

The fourth aspect of the application provides a kind of computer-readable recording medium, is stored thereon with computer program, describedThe step of audio recognition method being realized when computer program is executed by processor.

The present invention obtains the voice messaging of user's input；Obtain lip image of the user when inputting the voice messaging；Pause information in the voice messaging according to the lip image recognition；The voice messaging is entered according to the pause informationRow speech recognition.The present invention can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.

Brief description of the drawings

Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one is provided；

Fig. 2 is the structure chart for the speech recognition equipment that the embodiment of the present invention two is provided；

Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three is provided.

Main element symbol description

Computer installation 1

Speech recognition equipment 10

Memory 20

Processor 30

Computer program 40

First acquisition unit 201

Second acquisition unit 202

First recognition unit 203

Second recognition unit 204

Following embodiment will further illustrate the present invention with reference to above-mentioned accompanying drawing.

Embodiment

It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present inventionApplying example, the present invention will be described in detail.It should be noted that in the case where not conflicting, embodiments herein and embodimentIn feature can be mutually combined.

Elaborate many details in the following description to facilitate a thorough understanding of the present invention, described embodiment onlyOnly it is a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skillThe every other embodiment that personnel are obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the inventionThe implication that technical staff is generally understood that is identical.Term used in the description of the invention herein is intended merely to description toolThe purpose of the embodiment of body, it is not intended that in the limitation present invention.

Preferably, audio recognition method of the invention is applied in one or more terminal.The terminal is a kind of energyIt is enough according to the instruction for being previously set or store, the equipment of automatic progress numerical computations and/or information processing, its hardware is included but notIt is limited to microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), can compilesJourney gate array (Field-Programmable Gate Array, FPGA), digital processing unit (Digital SignalProcessor, DSP), embedded device etc..

The terminal may be, but not limited to, any one can with user by keyboard, mouse, remote control, touch pad orThe modes such as voice-operated device carry out the electronic product of man-machine interaction, for example, tablet personal computer, smart mobile phone, personal digital assistant(Personal Digital Assistant, PDA), intelligent wearable equipment etc..

Embodiment one

Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one is provided.As shown in figure 1, this method is specifically wrappedInclude following steps：

101：Obtain the voice messaging of user's input.

The voice messaging is the speech data obtained according to the natural-sounding of user.For example, the voice messaging is logicalCross microphone and the natural-sounding of user is converted into the voice signal that electric signal is obtained.

The voice messaging can be gathered by the microphone of terminal in user's input voice information.For example, can examineSurvey and whether receive phonetic entry sign on (for example detecting whether the home keys of terminal are long pressed), refer to if receiving phonetic entryOrder, then start to gather the voice messaging that user inputs by the microphone of terminal.It can also detect whether to receive phonetic entry knotShu Zhiling (for example detects whether the home keys of terminal are released), if receiving phonetic entry END instruction, stops passing through terminalMicrophone collection user input voice messaging.

Or, the voice messaging gathered in advance can be read.For example, the voice messaging of user's input can be gathered in advance,When needing to carry out speech recognition to the voice messaging, the voice messaging is read.

102：Obtain lip image of the user when inputting the voice messaging.

The lip image is also lip motion video or labiomaney image, refers to when people speaks, the lip motion of speakerThe image of change.Lip image in a period of time may be constructed image sequence or image/video.

Facial image of the user when inputting the voice messaging can be obtained, lip position is determined from the facial imagePut, so as to obtain the lip image.

Camera can also be directly directed to user's lip to be shot, so as to obtain the lip image.For example, shootingHead can be built in microphone (such as in headset), or microphone is built in camera, and user is in use, take the photographAs head is directly directed at user's lip, so as to easily obtain lip image.

The lip image can be shot by the camera of terminal in user's input voice information.For example, can examineWhether survey receives phonetic entry sign on, if receiving phonetic entry sign on, gathers and uses in the microphone by terminalWhile the voice messaging of family input, the lip image of user is shot by the camera of terminal.It can also detect whether to receivePhonetic entry END instruction, if receiving phonetic entry END instruction, user's input is gathered stopping the microphone by terminalVoice messaging while, stop shooting the lip image of user by the camera of terminal.

Or, the lip image shot in advance can be read.For example, can be in the voice messaging that collection user inputs in advanceWhen, the lip image is shot, when needing to carry out speech recognition to the voice messaging, the lip image is read.

The voice messaging that user inputs and the camera shooting lip for passing through terminal are gathered in the microphone by terminalDuring shape image, it can be determined that whether the lip motion information matches with the voice messaging, if the lip motion information and the voiceInformation is mismatched, and controls the camera to stop shooting the lip image.

It can detect whether the lip motion information is synchronous with the voice messaging, if the lip motion information is believed with the voiceBreath is asynchronous, then the lip motion information is mismatched with the voice messaging.If for example, according to the voice messaging determine user fromLoquitur within 1st second, determine that user loquitured from the 5th second according to the lip motion information, then the lip motion information and institute's predicateMessage breath is asynchronous, thus the lip motion information and voice messaging mismatch.

Or, it can detect that the corresponding text information of lip motion information text information corresponding with the voice messaging isIt is no consistent, it is described if the corresponding text information of lip motion information text information corresponding with the voice messaging is inconsistentLip motion information is mismatched with the voice messaging.For example, the corresponding text information of the lip motion information is " I in certain time periodHave a meeting ", the corresponding text information of the voice messaging is " today, weather was pretty good ", then the corresponding text of the lip motion informationWord information text information corresponding with the voice messaging is inconsistent, thus the lip motion information and the voice messaging are notMatch somebody with somebody.

103：Pause information in the voice messaging according to the lip image recognition.

Pause often occurs when speaking by user, therefore, and the lip image includes lip image when pausing, describedVoice messaging includes the voice messaging (information of pausing) when pausing, and the voice can be recognized according to lip image when pausingThe pause information that packet contains.

User can be paused during speaking when needing disconnected word or punctuate, and therefore, the pause information can be with tableShow disconnected word and/or punctuate (now pause information can be mute signal), the pause information can include disconnected word pause informationAnd/or punctuate pause information.

Or, user can be paused during speaking when other side speaks or thinks deeply, therefore, and the pause information canWith represent one section it is Jing Yin.Now the pause information is invalid phonetic entry.

Or, user can be paused during speaking when there is noise (such as when noise is excessive), therefore, describedPause information can represent noise (now pause information can be noise signal).Now the pause information is invalid voiceInput.

When the pause information represents disconnected word and/or punctuate, it can believe voice according to the lip image recognitionDisconnected word pause information and/or punctuate pause information in breath.

Whether can not occurred according to the lip image detection to the first preset time (such as 0.1 second) interior user's lipWhether change or amplitude of variation are less than or equal to predetermined amplitude, if according in the lip image detection to the first preset timeUser's lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then pre- by described in the voice messaging firstIf time corresponding voice messaging is identified as disconnected word pause information.

Whether can not occurred according to the lip image detection to the second preset time (such as 0.5 second) interior user's lipChange or amplitude of variation are less than or equal to predetermined amplitude, if according to user in the lip image detection to the second preset timeLip does not change or amplitude of variation is less than or equal to predetermined amplitude, then when described in the voice messaging second is presetBetween corresponding voice messaging be identified as punctuate pause information.Second preset time can be more than first preset time.

, can be default to the 3rd according to the lip image detection when the pause information represents that one section Jing Yin or during noiseWhether time (such as 3 seconds) interior user's lip does not change or whether amplitude of variation is less than or equal to predetermined amplitude, if rootDo not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal in advanceIf amplitude, then the corresponding voice messaging of the 3rd preset time described in the voice messaging is identified as pause information.Or, ifDo not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal toPredetermined amplitude, and the corresponding voice signal amplitude of the 3rd preset time described in the voice messaging is more than predetermined threshold value,The corresponding voice messaging of the 3rd preset time described in the voice messaging is then identified as pause information.Described 3rd it is default whenBetween can be more than second preset time.

104：Speech recognition is carried out to the voice messaging according to the pause information.

If the pause information includes disconnected word pause information, the voice can be believed according to the disconnected word pause informationBreath carries out speech recognition.

Or, can be according to the punctuate pause information to described if the pause information includes punctuate pause informationVoice messaging carries out speech recognition.

Or, can be according to the disconnected word if the pause information includes disconnected word pause information and punctuate pause informationPause information and punctuate pause information carry out speech recognition to the voice messaging.

Can according to the time map relation between the pause information and the voice messaging (i.e. corresponding time relationship),The pause information is inserted into the text message being converted into by the voice messaging.For example, can be to the voice messagingCarry out speech recognition, obtain the corresponding text message of the voice messaging, according to the pause information (disconnected word pause information and/Or punctuate pause information) time of occurrence in the voice messaging, the pause information is inserted into the text message,Obtain including the text message of pause information.

Or, the pause information in the voice messaging can be removed, to having removed described in the pause informationVoice messaging carries out speech recognition.As it was previously stated, the pause information can represent noise or Jing Yin, i.e., invalid voice is defeatedEnter, the noise in the voice messaging can be removed by carrying out speech recognition to the voice messaging for having removed the pause informationOr it is Jing Yin.

Can use various speech recognition technologies, such as dynamic time warping (Dynamic Time Warping, DTW),It is hidden Markov model (Hidden Markov Model, HMM), vector quantization (Vector Quantization, VQ), artificialTechnology is to the voice messaging or has removed pause information for neutral net (Artificial Neural Network, ANN) etc.The voice messaging carries out speech recognition.

The audio recognition method of embodiment one obtains the voice messaging of user's input；Obtain user and input the voice letterLip image during breath；Pause information in the voice messaging according to the lip image recognition；According to the pause informationSpeech recognition is carried out to the voice messaging.The audio recognition method of embodiment one can carry out voice knowledge using lip imageNot, the accuracy rate of speech recognition is improved.

In another embodiment, methods described can also include：The motion of user's lip is obtained according to the lip imageAmplitude, the corresponding tone of the voice messaging is recognized according to the motion amplitude of user's lip.The tone can include oldPredicate gas, the query tone, imperative mood, exclamation tone etc..If for example, the motion amplitude of user's lip is in the first default widthIn the range of degree, it is determined that the corresponding tone of the voice messaging is sighs with feeling the tone；If the motion amplitude of user's lip isIn the range of two predetermined amplitudes, it is determined that the corresponding tone of the voice messaging is imperative mood.

In another embodiment, methods described can also include：Obtain the lip characteristic of user pronunciation；According to the lipCharacteristic determines user characteristics；Speech recognition is carried out to the voice messaging according to the user characteristics and the pause information.InstituteUser's sex, language form, dialect type and/or pet phrase custom etc. can be included by stating user characteristics.For example, can according toThe lip characteristic of family pronunciation determines language form (such as Chinese), according to the language form and the pause information to institute's predicateMessage breath carries out speech recognition.To obtaining more auxiliary informations before voice messaging progress speech recognition, (i.e. user is specialLevy), it can further improve the accuracy rate of speech recognition.

Embodiment two

Fig. 2 is the structure chart for the speech recognition equipment that the embodiment of the present invention two is provided.As shown in Fig. 2 the speech recognitionDevice 10 can include：First acquisition unit 201, second acquisition unit 202, the first recognition unit 203, the second recognition unit204。

First acquisition unit 201, the voice messaging for obtaining user's input.

Second acquisition unit 202, for obtaining lip image of the user when inputting the voice messaging.

First recognition unit 203, for the pause information in the voice messaging according to the lip image recognition.

Second recognition unit 204, for carrying out speech recognition to the voice messaging according to the pause information.

The speech recognition equipment 10 of embodiment two obtains the voice messaging of user's input；Obtain user and input the voiceLip image during information；Pause information in the voice messaging according to the lip image recognition；Believed according to described pauseBreath carries out speech recognition to the voice messaging.The speech recognition equipment 10 of embodiment two can carry out voice using lip imageIdentification, improves the accuracy rate of speech recognition.

In another embodiment, the speech recognition equipment 10 can also include：

3rd recognition unit, the motion amplitude for obtaining user's lip according to the lip image, according to the userThe motion amplitude of lip recognizes the corresponding tone of the voice messaging.The tone can include indicative mood, the query tone, prayMake the tone, sigh with feeling tone etc..If for example, the motion amplitude of user's lip is in the range of the first predetermined amplitude, it is determined that instituteThe corresponding tone of voice messaging is stated to sigh with feeling the tone；If the motion amplitude of user's lip is in the range of the second predetermined amplitude,It is imperative mood then to determine the corresponding tone of the voice messaging.

In another embodiment, the speech recognition equipment 10 can also include：

4th recognition unit, the lip characteristic for obtaining user pronunciation；User characteristics is determined according to the lip characteristic；Speech recognition is carried out to the voice messaging according to the user characteristics and the pause information.The user characteristics can includeUser's sex, language form, dialect type and/or pet phrase custom etc..For example, can be true according to the lip characteristic of user pronunciationDetermine language form (such as Chinese), voice knowledge is carried out to the voice messaging according to the language form and the pause informationNot.The voice messaging is carried out to obtain more auxiliary informations (i.e. user characteristics) before speech recognition, can further be carriedThe accuracy rate of high speech recognition.

Embodiment three

Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three is provided.The computer installation 1 includes memory20th, processor 30 and the computer program 40 that can be run in the memory 20 and on the processor 30, example are stored inSuch as speech recognition program.The processor 30 is realized when performing the computer program 40 in above-mentioned audio recognition method embodimentThe step of, such as step 101 shown in Fig. 1~104.Or, the processor 30 is realized when performing the computer program 40The function of each module/unit, such as unit 201~204 in said apparatus embodiment.

Exemplary, the computer program 40 can be divided into one or more module/units, it is one orMultiple module/units are stored in the memory 20, and are performed by the processor 30, to complete the present invention.Described oneIndividual or multiple module/units can complete the series of computation machine programmed instruction section of specific function, and the instruction segment is used forImplementation procedure of the computer program 40 in the computer installation 1 is described.For example, the computer program 40 can be byFirst acquisition unit 201 in Fig. 2, second acquisition unit 202, the first recognition unit 203, the second recognition unit 204 are divided into,Each module concrete function is referring to embodiment two.

The computer installation 1 can be that the calculating such as desktop PC, notebook, palm PC and cloud server is setIt is standby.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 1, do not constitute to computerThe restriction of device 1, can include than illustrating more or less parts, either combine some parts or different parts, exampleComputer installation 1 can also include input-output equipment, network access equipment, bus etc. as described.

Alleged processor 30 can be CPU (Central Processing Unit, CPU), can also beOther general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other PLDs, discrete gate or transistor logic,Discrete hardware components etc..General processor can be microprocessor or the processor 30 can also be any conventional processorDeng the processor 30 is the control centre of the computer installation 1, utilizes various interfaces and connection whole computer dressPut 1 various pieces.

The memory 20 can be used for storing the computer program 40 and/or module/unit, and the processor 30 passes throughOperation performs and is stored in computer program and/or module/unit in the memory 20, and calls and be stored in memoryData in 20, realize the various functions of the computer installation 1.The memory 20 can mainly include storing program area and depositData field is stored up, wherein, the application program that storing program area can be needed for storage program area, at least one function (such as broadcast by soundPlaying function, image player function etc.) etc.；Storage data field can be stored uses created data (ratio according to computer installation 1Such as voice data, phone directory) etc..In addition, memory 20 can include high-speed random access memory, it can also include non-easyThe property lost memory, such as hard disk, internal memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital(Secure Digital, SD) block, flash card (Flash Card), at least one disk memory, flush memory device or otherVolatile solid-state part.

If the integrated module/unit of the computer installation 1 is realized using in the form of SFU software functional unit and as independentlyProduction marketing or in use, can be stored in a computer read/write memory medium.Understood based on such, the present inventionAll or part of flow in above-described embodiment method is realized, the hardware of correlation can also be instructed by computer program come completeInto described computer program can be stored in a computer-readable recording medium, and the computer program is being executed by processorWhen, the step of each above-mentioned embodiment of the method can be achieved.Wherein, the computer program includes computer program code, describedComputer program code can be source code form, object identification code form, executable file or some intermediate forms etc..The meterCalculation machine computer-readable recording medium can include：Can carry any entity or device of the computer program code, recording medium, USB flash disk,Mobile hard disk, magnetic disc, CD, computer storage, read-only storage (ROM, Read-Only Memory), random access memoryDevice (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..Need explanationIt is that the content that the computer-readable medium is included can be fitted according to legislation in jurisdiction and the requirement of patent practiceWhen increase and decrease, such as in some jurisdictions, according to legislation and patent practice, computer-readable medium does not include electric carrier wave letterNumber and telecommunication signal.

, can be with several embodiments provided by the present invention, it should be understood that disclosed computer installation and methodRealize by another way.For example, computer installation embodiment described above is only schematical, for example, describedThe division of unit, only a kind of division of logic function, can there is other dividing mode when actually realizing.

In addition, each functional unit in each embodiment of the invention can be integrated in same treatment unit, can alsoThat unit is individually physically present, can also two or more units be integrated in same unit.Above-mentioned integrated listMember can both be realized in the form of hardware, it would however also be possible to employ hardware adds the form of software function module to realize.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er QieIn the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matterFrom the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended powerProfit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by fallingChange is included in the present invention.Any reference in claim should not be considered as to the claim involved by limitation.ThisOutside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.Stated in computer installation claimMultiple units or computer installation can also be realized by same unit or computer installation by software or hardware.TheOne, the second grade word is used for representing title, and is not offered as any specific order.

Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted, although referenceThe present invention is described in detail for preferred embodiment, it will be understood by those within the art that, can be to the present invention'sTechnical scheme is modified or equivalent substitution, without departing from the spirit and scope of technical solution of the present invention.

Claims

1. a kind of audio recognition method, it is characterised in that methods described includes：

Obtain the voice messaging of user's input；

Obtain lip image of the user when inputting the voice messaging；

2. audio recognition method as claimed in claim 1, it is characterised in that it is described according to the pause information to the voiceInformation, which carries out speech recognition, to be included：

According to the time map relation between the pause information and the voice messaging, the pause information is inserted into by instituteState in the text message that voice messaging is converted into；Or

The pause information in the voice messaging is removed, language is carried out to the voice messaging for having removed the pause informationSound is recognized.

3. audio recognition method as claimed in claim 1, it is characterised in that described according to lip image recognition institute predicatePause information in message breath includes：

4. audio recognition method as claimed in claim 1, it is characterised in that the voice messaging of the acquisition user input；ObtainTaking lip image of the family when inputting the voice messaging includes：

When user inputs the voice messaging, the voice messaging is gathered by the microphone of terminal, and pass through terminalCamera shoots the lip image.

5. audio recognition method as claimed in claim 4, it is characterised in that methods described also includes：

Judge whether the lip motion information matches with the voice messaging；

If the lip motion information is mismatched with the voice messaging, the camera is controlled to stop shooting the lip image.

6. the audio recognition method as any one of claim 1-5, it is characterised in that methods described also includes：

The motion amplitude of user's lip is obtained according to the lip image, according to the identification of the motion amplitude of user's lipThe corresponding tone of voice messaging；Or

The lip characteristic of user pronunciation is obtained, user characteristics is determined according to the lip characteristic, according to the user characteristics and instituteState pause information and speech recognition is carried out to the voice messaging.

7. a kind of speech recognition equipment, it is characterised in that described device includes：

First acquisition unit, the voice messaging for obtaining user's input；

8. speech recognition equipment as claimed in claim 7, it is characterised in that second recognition unit specifically for：

9. a kind of computer installation, it is characterised in that the computer installation includes processor, the processor is deposited for executionRealized during the computer program stored in reservoir as any one of claim 1-6 the step of audio recognition method.

10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer programRealized when being executed by processor as any one of claim 1-6 the step of audio recognition method.