CN103021412A

Movatterモバイル変換

Info

Publication number: CN103021412A
Application number: CN2012105847462A
Authority: CN
Inventors: 何婷婷; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2013-04-03
Anticipated expiration: 2032-12-28
Also published as: CN103021412B

Abstract

An embodiment of the invention discloses voice recognition method and system. The method includes: subjecting voice signals input by a user to voice recognition so as to obtain voice recognition results and voice segments corresponding to characters in the voice recognition results; receiving error correction information input by the user independently and generating error correction character strings; determining the erred voice segments in the voice signals input by the user according to the error correction character strings; determining character strings corresponding to the recognition-erred voice segments in the voice recognition results according to the voice segments corresponding to the character strings in the voice recognition results, and using the character strings as error character strings; and replacing the error character strings with the error correction character strings. The recognition-erred voice segments are determined according to the error correction character strings generated by the error correction information independently input by the user, the error character strings corresponding to the voice segments in the voice recognition results are located then, automatic location of the error character strings in the voice recognition results is achieved, and the problem than manual location is inconvenient is solved.

Description

Audio recognition method and system

Technical field

The present invention relates to the speech recognition technology field, more particularly, relate to audio recognition method and system.

Background technology

Speech recognition technology is that a kind of voice signal to user's typing is identified, and finally is converted into the technology of text/character string (also being that recognition result is text), and it is provided convenience for the man-machine interaction of Natural humanity.Take the mobile device that adopts speech recognition technology as example, under the support of speech recognition technology, the user through will automatically forming literal after the speech recognition system identification, has improved user's input efficiency as long as speak facing to mobile device greatly.

But under the applied environment that large vocabulary is arbitrarily said, speech recognition technology still can not reach very correct discrimination, needs manually recognition result to be revised editor.After mobile device (speech recognition system) is shown to the input text area of screen with voice identification result, the user then at first need to locate the character that needs to revise (also can be described as and wait to revise) as wanting to the voice identification result editor that makes amendment in voice identification result.

And on mobile device, particularly on the finger touch-screen equipment of the small screen, because screen size is limited,

Particularly when adjacent two intercharacters insert editor's cursor, there is the problem of location inconvenience in the user when from continuous large section text certain character of determining being positioned.

Summary of the invention

In view of this, embodiment of the invention purpose is to provide audio recognition method and system, manually positions the problem of the location inconvenience that exists to solve above-mentioned user.

For achieving the above object, the embodiment of the invention provides following technical scheme:

An aspect according to the embodiment of the invention provides a kind of audio recognition method, comprising:

Voice signal to user's input carries out speech recognition, obtains the first optimum decoding path, and the described first optimum decoding path comprises sound bite corresponding to each character in voice identification result and the described voice identification result;

Receive the error correction information of the independent input of user and generate corresponding error correction character string, described error correction information is by non-voice mode or voice mode input;

Determine to produce in the voice signal of described user input the voice segments of identification error according to described error correction character string;

According to sound bite corresponding to each character in the described voice identification result, determine voice segments corresponding character string in described voice identification result of described generation identification error, as the error character string that produces identification error;

Utilize described error correction character string to replace the error character string of described generation identification error.

Another aspect according to the embodiment of the invention provides a kind of speech recognition system, comprising:

Voice recognition unit is used for the voice signal of user's input is carried out speech recognition, obtains the first optimum decoding path, and the described first optimum decoding path comprises sound bite corresponding to each character in voice identification result and the described voice identification result;

Error correction word string input block is used for receiving the error correction information of the independent input of user and generating corresponding error correction character string, and described error correction information is by non-voice mode or voice mode input;

The automatic error-correcting unit is used for determining that according to described error correction character string the voice signal of described user's input produces the voice segments of identification error; According to sound bite corresponding to each character in the described voice identification result, determine voice segments corresponding character string in described voice identification result of described generation identification error, as the error character string that produces identification error; Utilize described error correction character string to replace the error character string of described generation identification error.

Can find out from above-mentioned technical scheme, the error correction character string that the error correction information that the disclosed technical scheme of the embodiment of the invention is inputted separately according to the user generates determines to produce the voice segments of identification error, find again the error character string of its corresponding generation identification error in voice identification result by this voice segments, realized the error correction information that the user inputs and the error correction character string that generates and error character string corresponding, and then realized automatic location to error character string in the voice identification result, solved the problem that the user manually positions the location inconvenience of existence.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, the below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

The audio recognition method process flow diagram that Fig. 1 provides for the embodiment of the invention;

The handwriting input identification process figure that Fig. 2 provides for the embodiment of the invention;

The Minimum Area synoptic diagram that Fig. 3 provides character to cover for the embodiment of the invention;

The automatic error-correcting process flow diagram flow chart that Fig. 4 provides for the embodiment of the invention;

The error correction character string retrieval network structural representation that Fig. 5 provides for the embodiment of the invention;

The speech recognition system structural representation that Fig. 6 provides for the embodiment of the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.

As a kind of simple and convenient and efficient input mode, speech recognition has changed traditional keyboard mode based on complexity coding or Pinyin Input, is the man-machine interaction of the Natural humanity condition of providing convenience.Particularly in recent years along with the innovation of the development of science and technology and wireless communication networks is popularized, various online speech recognition application, as sending out a microblogging, create message, network instant communication etc. have received increasing concern.Under the support of speech recognition technology, the user will form literal as long as speak facing to mobile device automatically through after the system identification, has greatly improved user's input efficiency.

But under the applied environment that large vocabulary is arbitrarily said, speech recognition technology still can not reach very correct discrimination, needs manually recognition result to be revised editor.After mobile device (speech recognition system) was shown to the input text area of screen with voice identification result, the user then need to locate the character that needs to revise (also can be described as and wait to revise) as wanting to the voice identification result editor that makes amendment in recognition result.

And on mobile device, particularly on the finger touch-screen equipment of the small screen, because screen size is limited, the user is when positioning certain character of determining from continuous large section text, particularly when adjacent two intercharacters insert editor's cursor, there is the inaccurate problem in location.

For ease of understanding, now speech recognition is described below:

If one section voice signal to be identified is denoted as S, S is carried out obtaining with it corresponding phonetic feature sequence O after a series of processing, be denoted as O={O₁, O₂..., O_i..., O_T, wherein Oi is i phonetic feature, T is the total number of phonetic feature.A word string that is comprised of many words can be regarded as in the sentence that voice signal S is corresponding, is denoted as W={w₁, w₂..., w_n.The task of speech recognition is exactly according to known phonetic feature sequence O, obtains most probable word string W '.

In the detailed process of speech recognition, general speech characteristic parameter corresponding to voice signal that at first extract, subsequently in the web search space that is consisted of by the acoustic model that presets and language model, according to default searching algorithm (such as the Viterbi algorithm), search is with respect to extract the to get optimal path (also namely optimum decoding path) of speech characteristic parameter.

After having understood some concepts of speech recognition, existing technical scheme to the embodiment of the invention is described below.

For solving the problem of above-mentioned location inconvenience, the audio recognition method that the embodiment of the invention provides comprises the steps: at least

Speech recognition process: the voice signal to user's input carries out speech recognition, obtains optimum decoding path, and wherein, optimum decoding path comprises sound bite corresponding to each character in voice identification result and the voice identification result;

Process concatenated in the error correction character: receive the error correction information of the independent input of user and generate corresponding error correction character string, above-mentioned error correction information allows by non-voice mode or voice mode input;

Automatic error-correcting process: the voice segments of determining to produce in the voice signal that the user inputs identification error according to the error correction character string; According to sound bite corresponding to each character in the voice identification result, determine to produce voice segments corresponding character string in described voice identification result of identification error, as the error character string that produces identification error; And utilize the error correction character string to replace the error character string of described generation identification error.For calling conveniently, the follow-up record mistake in using character strings of this paper are as the abbreviation of " producing the error character string of identification error ".

The below introduces one by one to each process.

One, speech recognition process

For maximum possible satisfy the daily interaction demand of user, the embodiment of the invention adopts large vocabulary continuous speech recognition technology, to realize saying arbitrarily the text-converted of voice.

Wherein, referring to Fig. 1, above-mentioned speech recognition process specifically comprises:

S11, tracking gather the voice signal (also being above-mentioned one section voice signal to be identified) of user's input;

In other embodiments of the invention, can deposit above-mentioned voice signal in data buffer area;

S12, above-mentioned voice signal is carried out pre-service, to obtain through pretreated speech data;

Above-mentioned pre-service can comprise the noise effect that voice signal sampling, anti aliasing bandpass filtering, minute frame processing, removal individual pronunciation difference and equipment, environment cause, end-point detection.For the above-mentioned pre-service of the robustness that improves speech recognition system specifically also can comprise the front end noise reduction process, think that the subsequent voice processing provides comparatively pure voice.

S13, every frame speech data in the pretreated speech data of above-mentioned process is carried out respectively feature extraction, to obtain feature vector sequence.

In step S13, every frame speech data is carried out can extracting efficient voice feature (perhaps eigenvector) after the feature extraction.Like this, after feature extraction, each frame speech data forms an eigenvector, and corresponding, above-mentioned speech data is that an available feature vector sequence represents;

It will be understood by those skilled in the art that, if to comprising 30 frame speech datas through pretreated speech data, this 30 frame speech data just can extract 30 eigenvectors so, and these 30 eigenvectors can form above-mentioned feature vector sequence according to the time order and function order.

In other embodiments of the invention, above-mentioned efficient voice feature can be linear prediction cepstrum coefficient or MFCC(Mel cepstrum) feature.Concrete, be characterized as example with MFCC, every frame speech data that can move 10ms to the long 25ms frame of window obtains the single order/second order difference of MFCC parameter and/or MFCC parameter by short-time analysis, amounts to 39 and ties up.Like this, the feature extraction of every frame speech data process can obtain the eigenvector of one 39 dimension.

In other embodiments of the invention, above-mentioned phonetic feature/speech characteristic vector sequence can be deposited in the feature buffer area.

S14, (above-mentioned retrieval network is mainly by the acoustic model of systemic presupposition in the retrieval network that makes up in advance above-mentioned feature vector sequence to be carried out optimum route search, dictionary, the formations such as language model), has the model string of maximum model likelihood probability as voice identification result output (showing) to obtain with above-mentioned feature vector sequence.

In the specific implementation, can adopt the in the industry Viterbi searching algorithm based on Dynamic Programming Idea of main flow, pre-conditioned live-vertex calculates the accumulated history path probability and reservation is satisfied pre-conditioned historical path as the live-vertex of subsequent searches network to satisfying in each eigenvector traversal retrieval network, decodes by the identification that realizes the input voice is recalled in the path (also being the above-mentioned first optimum decoding path) with maximum historical path probability at last.The first optimum decoding path all keeps its corresponding recognition unit model to every frame speech data in decoding, and then all can obtain its corresponding sound bite to each character in the voice identification result, certainly, also can obtain each character start position information and the end position information of corresponding sound bite.

Need to prove that the above-mentioned sound bite of mentioning both can be the sound bite in the voice signal of user input, also can be through at least one frame speech data in the pretreated speech data, also can be the eigenvector subsequence in the feature vector sequence.For calling conveniently, the follow-up voice signal with user's input of this paper, the pretreated speech data of process and feature vector sequence are referred to as voice signal to be identified.

Also namely, the following voice signal to be identified of mentioning of this paper specifically can be voice signal, the pretreated speech data of process or the feature vector sequence of user's input.And the following sound bite of mentioning of this paper specifically can be sound bite, at least one frame speech data or eigenvector subsequence in the voice signal of user input.

That is to say, we can will be divided into the sound bite corresponding with the character in the voice identification result through the feature vector sequence among pretreated speech data or the step S13 among the voice signal among the step S11 or the step S12, thereby make the corresponding definite sound bite of each character in the voice identification result.

By way of example, if voice identification result is " we go to climb the mountain " this character string, decoding routing information corresponding to this character string can save as: (0,000,000 2200000), (2,200,000 3600000), (36000004300000), (4,300,000 5000000), (5,000,000 7400000).

Above-mentioned (0,000,000 2200000) have indicated start position information and the end position information of the corresponding voice snippet of " I " this character.Wherein, the 0000000th, the reference position (constantly) of " I " corresponding voice snippet in voice signal to be identified, and 2200000 are " I " corresponding voice snippet end positions (constantly) in voice signal to be identified.

Two, process concatenated in the error correction character

The embodiment of the invention is supported the user with non-voice mode or voice mode input error correction information and is generated the error correction character string.

When adopting voice mode input error correction information, the error correction information of inputting is specially voice signal, because the same with speech recognition process is to input with voice mode, then system possibly can't determine that current phonetic entry is in order to continue the phonetic entry of new text, still for urtext being carried out voice error correction input.Therefore, independent error correction information input control button can be set, control switches to voice error correction input to urtext from the phonetic entry of new text.Under the pattern with voice mode input error correction information, because error correction information is voice signal, the same predicate sound of processing procedure when converting it into error correction character string identifying is identical, therefore not to repeat here, and, also can provide a plurality of identification candidate characters statements based on collusion user selections to improve the accuracy rate that generates the error correction character string.

In addition, the embodiment of the invention also supports the user to input error correction information in non-voice modes such as key-press input (such as Pinyin Input, stroke input, region-position code input etc.), handwriting inputs, at this moment, as with key-press input, the error correction information of inputting is specially keystroke sequence, as with handwriting input, the error correction information of inputting is specially written handwriting.

Now take Pinyin Input and handwriting input as example, non-voice mode input process is introduced.

Its idiographic flow still sees also Fig. 1:

S21, judgement user's input mode, the phonetic key-press input changes step S22 in this way, if handwriting input changes step S23 over to.

S22, the keystroke sequence that the user is inputted convert candidate error correction character string to.

Wherein, step S22 specifically can comprise:

S221 follows the tracks of the keystroke sequence that gathers the user, with its corresponding one-tenth alphabetic string sequence;

S222 mates to find candidate error correction character string with the alphabetic string sequence that collects and the phonetic dictionary that presets, and shows.

Behind input qinghua, system may show a plurality of candidate error correction characters statements based on collusion user selections such as Tsing-Hua University, blue and white, parent China such as the user.

The written handwriting of S23, identification user input, the written handwriting that the user is inputted is converted at least one candidate error correction character string;

Wherein, referring to Fig. 2, step S23 can specifically comprise:

S231 follows the tracks of the written handwriting that the user inputs, and the written handwriting that collects is kept in the handwriting data buffer area;

In the on-line handwritten recognition system, user's written handwriting usually represents with the two dimension (position coordinates) of a sequence or three-dimensional point (position coordinates with lift the pen/state of starting to write) coordinate, in order to describe the room and time information of character writing.

S232 carries out pre-service to above-mentioned written handwriting.

Owing to collecting device or user in reasons such as dithering in writings, acquired original to written handwriting in may have various noise jamming.In order to improve the robustness of system, can carry out pre-service to the person's handwriting that collects.Concrete, can pass through character boundary normalization, wild point remove, level and smooth, the processing modes such as resampling are made up, to reduce as far as possible the problem of the discrimination decline that noise jamming brings.

S233 extracts carry out handwriting characteristic through pretreated written handwriting.

Concrete, present embodiment extracts all directions commonly used, handwriting recognition field to feature, and the differentiation that improves handwriting characteristic by technology such as LDA.

S234, with the character feature that extracts with preset model and mate, calculate similarity.

S235, choose at least one that have a highest similarity with above-mentioned character feature and preset model as candidate error correction character string, and show.

The accuracy rate of considering Pinyin Input and handwriting recognition technology is often fine, thereby the number of common above-mentioned candidate error correction character string can select 3 to 5.

Certainly, it will be appreciated by persons skilled in the art that when user's non-voice is inputted long enough also to only have a candidate error correction character string.

S25, from candidate error correction character string, determine the error correction character string.

Step S25 can specifically comprise:

Accept user's selection and specify, from least one candidate error correction character string, determine unique error correction character string.

S25 can list separately, as the further affirmation to the error correction character string, with compatible phonetic entry and non-voice input mode.

Three, the automatic error-correcting process

Consider that the corresponding voice segments of error character string that produces identification error in error correction character string and the voice identification result often has consistance, the core concept of embodiment of the invention automatic error-correcting is: the error correction character string is mapped on the voice segments, find its corresponding words (also namely producing the error character string of identification error) in voice identification result by this voice segments again, thereby realized the corresponding of error correction character string and error character string.Like this, just realize the automatic location to error character string in the voice identification result, solved the problem that the user manually positions the location inconvenience of existence.

Specifically, at first in voice signal to be identified, find voice segments corresponding to above-mentioned error correction character string.The location character string conduct corresponding with this voice segments " produces the error character string of identification error " in voice identification result subsequently.Above-mentioned " producing the error character string of identification error " is the substring in the model string that obtains in step S14, initial time and the finish time of this substring corresponding voice segments in voice signal to be identified are with the initial time of above-mentioned error correction character string corresponding voice segments in voice signal to be identified with have consistance the finish time.

The flow process of automatic error-correcting process please still referring to Fig. 1, comprising:

S31, determine in the voice signal to be identified to produce the voice segments of identification error according to the error correction character string;

S32, according to sound bite corresponding to each character in the voice identification result, the voice segments of determining above-mentioned generation identification error is corresponding character string in the voice identification result in the first optimum decoding path, with it as " the error character string of generation identification error ";

S33, utilize the error correction character string to replace the error character string of above-mentioned generation identification error.

In other embodiments of the invention, step S33 can comprise the steps:

Number at the error character string that produces identification error equals at 1 o'clock, directly utilizes the error correction information that the user inputs and the error correction character string that generates replaced the error character string of this generation identification error;

Greater than 1 o'clock, utilize the error correction character string to replace the error character string of the generation identification error of user's appointment at the number of the error character string that produces identification error.

Some embodiments of the invention can be accepted the user and initiatively participate in selecting, and therefore, the idiographic flow of above-mentioned " utilizing the error correction character string to replace the error character string of the generation identification error of user's appointment " can comprise:

A highlights the error character string that all produce identification error in voice identification result.

In other embodiments of the invention, except highlighting all error character strings that produce identification error, other recognition result that can also arrange except wrong character string is non-active state, to improve setting accuracy;

B accepts user's selection and specifies, and utilizes above-mentioned error correction character string to upgrade user-selected fixed error character string.

In addition, in other embodiments of the invention, also can support user's Fuzzy Selection to specify---namely and do not require user's precise positioning error character string, but position by neighbour's mode: when starting to write of writing pencil a little falls into error character string neighbour zone, automatically be located on the error character string of correspondence.

Specifically, the bee-line of the Minimum Area that calculating starts to write covers apart from each error character string selects to have the error character string that the error character string of minimum " bee-line " is selected as the user.For example, referring to Fig. 3, the height H that can set the Minimum Area that a character (I) covers for the A of the high h of this character words doubly, and the width W of the Minimum Area that character covers for the B of the wide w of this character words doubly, A and B can be the positive count more than or equal to 1.So, the Minimum Area that covers of error character string is then for forming the summation of the Minimum Area that all characters cover in this error character string.

Referring to Fig. 4, in other embodiments of the invention, above-mentioned steps S31 can specifically comprise the steps:

S311 concatenates into error correction character string retrieval network according to above-mentioned error correction character.

Wherein, error correction character string model is concatenated into by the error correction character: by the dictionary that presets the error correction character string is expanded to corresponding model sequence and obtain corresponding error correction character string model.The error correction character string that generates owing to the each error correction information of inputting of user all is not quite similar, and therefore, the error correction character string model in the error correction character string network needs real-time update.

Therefore, above-mentioned steps S31 can specifically comprise again:

Obtain error correction character string model corresponding to error correction character string;

Obtain the absorbing model that presets;

Generate error correction character string retrieval network according to the error correction character string model that obtains and absorbing model.

Need to prove, if there is non-conterminous and incoherent many places identification error in the voice identification result, such as there being " Tsing-Hua University " and " western station " two place's identification errors in the voice identification result, then needing Multiple through then out voice or non-voice mode to input error correction information and generate the error correction character string.And the error correction character string that the error correction information of each input is generated no matter how many words it comprises, is all regarded it as an independently error correction character string.Such as, the user has inputted 3 Chinese characters altogether when certain input error correction character string, and then error correction character string comprises 3 Chinese characters, will comprise that by dictionary the error correction character string of these 3 Chinese characters is extended to corresponding error correction character string model subsequently.

When the error correction character string is extended to error correction character string model, can adopt different extended modes according to the difference of the acoustic model that presets.Such as, can be based on the acoustic model of syllable-based hmm unit (such as the acoustic model based on the syllable-based hmm unit, individual Chinese character is made of 1 syllable), also can be based on the acoustic model of phoneme model unit (such as the acoustic model based on the phoneme model unit, individual Chinese character is made of 2 phonemes), specifically determined by the model unit that when carrying out speech recognition, adopts.Therefore, obtain the error correction character string model that is in series by 3 syllable-based hmm unit or the error correction character string model that is in series by 6 phoneme model unit as the above-mentioned error correction character string of 3 Chinese characters that comprises being expanded, can being expanded.

Then be the background model that is obtained in magnanimity speech data training in advance by system as for absorbing model, also can adopt a plurality of absorbing models to improve the accuracy of complicated voice match.It should be noted that a plurality of independent absorbing models are in parallel.

S312 treats recognition of speech signals and again decodes and obtain the second optimum decoding path in error correction character string retrieval network.

Wherein, the second optimum decoding path comprises that voice segments corresponding to error correction character string model is as the voice segments that produces identification error.

Concrete, the corresponding voice segments of above-mentioned error correction character string model can be the voice segments in the voice signal of user's input, also can be through at least one frame speech data in the pretreated speech data, also can be the eigenvector subsequence in the feature vector sequence.For the purpose of oversimplifying, can select the corresponding eigenvector subsequence of erroneous character correction symbol string model as the voice segments that produces identification error.Then step S312 can specifically comprise:

Search obtains reference position and the end position of the corresponding eigenvector subsequence of above-mentioned error correction character string model in whole feature vector sequence corresponding to the optimal path (i.e. the second optimal path) of feature vector sequence in error correction character string retrieval network.

Decoding among the step S312, S14 is similar with above-mentioned steps, the difference of the two is that the network that step S312 utilizes is the error correction character string retrieval network of concatenating according to the error correction character, and the scope of the retrieval network that step S14 utilizes is greater than above-mentioned error correction character string retrieval network.Therefore, the decoding of step S312, still can adopt the in the industry Viterbi searching algorithm based on Dynamic Programming Idea of main flow, satisfy pre-conditioned historical path as the live-vertex of subsequent searches network to satisfying pre-conditioned live-vertex and keep in every frame feature vector traversal error correction character string retrieval network, at last by path with maximum historical path probability (i.e. the second optimum decoding path) obtained voice segments corresponding to error correction character string model, thereby determined to produce the voice segments of identification error.

Because in step S312, reference position (constantly) and the end position (constantly) of voice segments corresponding to error correction character string model have been obtained, therefore, in follow-up step S32, can according to sound bite corresponding to each character in the voice identification result, determine reference position corresponding bebinning character in voice identification result of the voice segments of generation identification error.Simultaneously, can determine to produce end position corresponding termination character in voice identification result of the voice segments of identification error, after having determined bebinning character and termination character, just can determine the error character string that produces identification error.

More specifically, can come in the following way to determine bebinning character:

The character that reference position is corresponding is as the first character, and with the corresponding voice snippet of this first character as the first voice snippet;

If above-mentioned reference position is positioned at the front portion of the first voice snippet, then with this first character as bebinning character, otherwise select next character in the voice identification result as bebinning character.

And when determining termination character, can be in the following way:

The character that end position is corresponding is as the second character, with the corresponding voice snippet of the second character as the second voice snippet;

When if end position is positioned at the second voice snippet anterior, select a upper character in the voice identification result as termination character, otherwise, with the second character as termination character.

Still take aforesaid " we go to climb the mountain " this voice identification result as example, before address, reference position and the end position of the corresponding sound bite of each character are respectively in this voice identification result: (00000002200000), (2,200,000 3600000), (3,600,000 4300000), (4,300,000 5000000), (5,000,000 7400000).

By way of example, suppose, in step S312, the reference position and the end position that produce the voice segments of identification error are (0,000,050 3600000), because reference position 0000050 is in the front portion of (0,000,000 2200000), can determine " I " as bebinning character, and end position 3600000 can determine that at the rear portion of (2,200,000 3600000) " " is termination character.Then as can be known, " we " are voice segments corresponding error character string in voice identification result of above-mentioned generation identification error.

Corresponding with said method, the embodiment of the invention also provides speech recognition system.Fig. 6 shows a kind of structure of said system, comprising:

Voice recognition unit 1 is used for the voice signal of user's input is carried out speech recognition, obtains optimum decoding path, and wherein, optimum decoding path comprises sound bite corresponding to each character in voice identification result and the described voice identification result;

More specifically, voice recognition unit can comprise processor, by processor the voice signal that the user inputs is carried out speech recognition.

Unit 2 concatenated in the error correction character, is used for receiving the error correction information of the independent input of user and generating corresponding error correction character string;

More specifically, as inputting error correction information with voice mode, then the error correction character is concatenated into the unit and still can be comprised above-mentioned processor, by processor error correction information is carried out speech recognition and generates the error correction character string;

As inputting error correction information in the key-press input mode, then the error correction character is concatenated into the unit and can be comprised keyboard and processor at least, by processor the keystroke sequence with user's input is converted to candidate error correction character string, and definite unique error correction character string from least one candidate error correction character string is specified in the selection of accepting the user.Certainly also can by another independently chip or processor convert the keystroke sequence of user's input to candidate error correction character string, and the selection of accepting the user specifies, and determines unique error correction character string from least one candidate error correction character string.

As inputting error correction information in the handwriting input mode, then the error correction character is concatenated into the unit and can be comprised writing pencil, touch-screen and processor at least, by processor the written handwriting with user's input is converted to candidate error correction character string, and definite unique error correction character string from least one candidate error correction character string is specified in the selection of accepting the user.Certainly also can by another independently chip or processor convert the written handwriting of user's input to candidate error correction character string, and the selection of accepting the user specifies, and determines unique error correction character string from least one candidate error correction character string.

Certainly, can adopt various ways input error correction information in order to guarantee the user, the unit concatenated in the error correction character also can comprise above-mentioned multiple device simultaneously.

Automatic error-correctingunit 3, be used for really stating according to the error correction character string voice segments of the voice signal generation identification error of user's input, according to sound bite corresponding to each character in the voice identification result, determine to produce voice segments corresponding character string in voice identification result of identification error, as the error character string that produces identification error; And utilize the error correction character string to replace the error character string that produces identification error.

More specifically, the function of automatic error-correctingunit 3 also can be by above-mentioned processor or other independently chip or processor realizations.

The more detailed function of above-mentioned each unit can be put down in writing referring to preceding method, and therefore not to repeat here.

Those of ordinary skills can recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.The professional and technical personnel can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought and exceeds scope of the present invention.

The those skilled in the art can be well understood to, and is the convenience described and succinct, and the specific works process of the device of foregoing description and unit can with reference to the corresponding process among the preceding method embodiment, not repeat them here.

In several embodiment that the application provides, should be understood that disclosed apparatus and method can realize by another way.For example, device embodiment described above only is schematic, for example, the division of described unit, only be that a kind of logic function is divided, during actual the realization other dividing mode can be arranged, for example a plurality of unit or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.Another point, the shown or coupling each other discussed or direct-coupling or communication connection can be by some interfaces, indirect coupling or the communication connection of device or unit can be electrically, machinery or other form.

Described unit as separating component explanation can or can not be physically to separate also, and the parts that show as the unit can be or can not be physical locations also, namely can be positioned at a place, perhaps also can be distributed on a plurality of unit.Can select according to the actual needs wherein some or all of unit to realize the purpose of present embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in the processing unit, also can be that the independent physics of unit exists, and also can be integrated in the unit two or more unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, also can adopt the form of SFU software functional unit to realize.

If described integrated unit is realized with the form of SFU software functional unit and during as independently production marketing or use, can be stored in the computer read/write memory medium.Based on such understanding, part or all or part of of this technical scheme that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium, comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out all or part of step of the described method of each embodiment of the present invention.And aforesaid storage medium comprises: the various media that can be program code stored such as USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD.

To the above-mentioned explanation of the disclosed embodiments, make this area professional and technical personnel can realize or use the present invention.Multiple modification to these embodiment will be apparent concerning those skilled in the art, and General Principle as defined herein can in the situation that does not break away from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to these embodiment shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. an audio recognition method is characterized in that, comprising:

2. the method for claim 1 is characterized in that, produces the voice segments of identification error in the described voice signal of determining described user input according to described error correction character string, comprising:

Concatenate into error correction character string retrieval network according to described error correction character, described error correction character string retrieval network comprises the error correction character string model that described error correction character string is corresponding and the absorbing model that presets;

Search is corresponding to the second optimum decoding path of the voice signal of described user's input in described error correction character string retrieval network, and the described second optimum decoding path comprises that voice segments corresponding to described error correction character string model is as the voice segments of described generation identification error;

Determine voice segments corresponding reference position and end position in the voice signal of described user's input of described generation identification error.

3. method as claimed in claim 2 is characterized in that, describedly concatenates into error correction character string retrieval network according to described error correction character, comprising:

Obtain error correction character string model corresponding to described error correction character string;

Obtain the absorbing model that presets;

Generate described error correction character string retrieval network according to the error correction character string model that obtains and absorbing model.

4. such as each described method of claims 1 to 3, it is characterized in that, according to sound bite corresponding to each character in the described voice identification result, determine voice segments corresponding character string in described voice identification result of described generation identification error, error character string as producing identification error comprises:

Determine reference position corresponding bebinning character in described voice identification result of the voice segments of described generation identification error;

Determine end position corresponding termination character in described voice identification result of the voice segments of described generation identification error;

According to described bebinning character and termination character, determine the character string in described voice identification result, as the error character string that produces identification error.

5. method as claimed in claim 4, determine to comprise reference position corresponding bebinning character in described voice identification result of the voice segments of described generation identification error:

The character that the reference position of the voice segments of described generation identification error is corresponding is as the first character, and with the corresponding voice snippet of described the first character as the first voice snippet;

When the reference position of the voice segments of described generation identification error is positioned at described the first voice snippet anterior, with described the first character as bebinning character;

When the reference position of the voice segments of described generation identification error is positioned at the rear portion of described the first voice snippet, select next character in the described voice identification result as bebinning character.

6. method as claimed in claim 4, determine to comprise end position corresponding termination character in described voice identification result of the voice segments of described generation identification error:

The character that the end position of the voice segments of described generation identification error is corresponding is as the second character, and with the corresponding voice snippet of described the second character as the second voice snippet;

When the end position of the voice segments of described generation identification error is positioned at described the second voice snippet anterior, select a upper character in the described voice identification result as termination character;

When the end position of the voice segments of described generation identification error is positioned at the rear portion of described the second voice snippet, with described the second character as termination character.

7. such as each described method of claim 1 to 6, it is characterized in that the described error character string that utilizes described error correction character string to replace described generation identification error specifically comprises:

Number at the error character string of described generation identification error equals at 1 o'clock, directly utilizes described error correction character string to replace the error character string of described generation identification error;

Greater than 1 o'clock, utilize described error correction character string to replace the error character string of the generation identification error of user's appointment at the number of the error character string of described generation identification error.

8. method as claimed in claim 7 is characterized in that, the described error character string that utilizes described error correction character string to replace the generation identification error of user's appointment specifically comprises:

In described voice identification result, highlight the error character string that all produce identification error;

Accept user selection, utilize described error correction character string to upgrade the error character string of the generation identification error that the user selectes.

9. a speech recognition system is characterized in that, comprising:

The unit concatenated in the error correction character, is used for receiving the error correction information of the independent input of user and generating corresponding error correction character string, and described error correction information is by non-voice mode or voice mode input;

The automatic error-correcting unit is used for determining that according to described error correction character string the voice signal of described user's input produces the voice segments of identification error; Sound bite corresponding to each character in according to described voice identification result determined corresponding character string in the described voice identification result of voice segments of described generation identification error, as the error character string that produces identification error; Utilize described error correction character string to replace the error character string of described generation identification error.

10. system as claimed in claim 9 is characterized in that:

Produce the voice segments of identification error in the described voice signal of determining described user input according to described error correction character string, comprising: