Embodiment
The present embodiment proposes one and calculates the method that critical value confirmed in word, after the identification target is determined, can confirm that effect obtains a suggestion critical value, in addition, need not additionally collect language material or training pattern according to the expection word.
Please refer to Fig. 3, when the target of identification is defined as an instruction set 310, byautomatic analysis instrument 320, adopt the mode of full-automatic and unartificial off-line (Offline) processing, obtain the critical value of suggestion according to a pre-conditioned analysis.This embodiment is not by obtaining identification result at new environment through speech recognition, confirming for previous default word after it being analyzed again, and upgrade critical value.In the present embodiment, before voice identification system was brought into use, via the adjustment of having carried out the effect that word confirms for specific identification target, and capable of dynamic obtained a suggestion critical value, so that output allows the words and phrases validator be judged, and the result who is confirmed.
For the dealer of IC design, the method for the present embodiment will make the solution of speech recognition more complete, and its downstream manufacturers can be developed speech recognition Related product, the problem that needn't worry to collect language material rapidly.This popularization to the speech recognition technology has sizable help.
The conception of this enforcement is before speech recognition and word affirmation running, go out word for present identification target prediction and confirm critical value, and formerly use first preset critical in the file, at voice identification system and word confirm module running in the middle of while collect language material upgrade this preset critical, with the application's implementation process very large difference is arranged thereafter.Moreover the application does not confirm that in speech recognition and word any data of collection are analyzed in the System Operation yet, and only uses a speech data that is pre-existing in, such as the corpus of voice identification system or word affirmation system.The innovative approach that the application proposes is thought that critical value that word confirms can be after identification vocabulary determines, and is confirmed to come out in advance before the module running at voice identification system or word, data collection that need not be extra, and this framework is obviously from formerly file is different.
Please refer to Fig. 4 A, be the voice identification system block schematic diagram of explanation one embodiment of the invention.In thisvoice identification system 400, comprise aspeech recognition device 410, an identificationDestination Storage Unit 420, word affirmationcritical value generator 430 and a words and phrases validator 440.The voice signal of input then is to be sent tospeech recognition device 410 and word validator 440.IdentificationDestination Storage Unit 420 then is the target of the various identifications of storage, outputs tospeech recognition device 410 and confirmscritical value generator 430 with word.
Andspeech recognition device 410 is judged according to the voice signal that receives andidentification target 422 respectively, thenexports identification result 412 to word validator 440.Simultaneously, word confirms thatcritical value generator 430 is also for theidentification target 422 correspondingcritical values 432 that produce, and output toword validator 440, thisword validator 440 then can be confirmed according toidentification result 412 andcritical value 432, whether correct withchecking identification result 412, namely whether be higher than thecritical value 432 of generation.
This enforcement proposes word and confirmscritical value generator 430, and as shown in the figure, the identification target ofspeech recognition device 410 is one group of default vocabulary (such as N Chinese phrase), can read via identification Destination Storage Unit 420.After voice signal passes through this identifier, then identification result is delivered toword validator 440.
On the other hand, the identification target is also inputted word confirmcritical value generator 430, and the word of given expection affirmation effect, such as 10% False Rejects ratio, can obtain the critical value θ of a suggestionUV
Confirm in thecritical value generator 430 at word, implement example one, can adopt statistically common hypothesis calibrating (Hypothesis Testing) method, calculate word and confirm mark, but not as limit.
There are one group of forward model and one group of reverse model (representing with H0, H1 respectively) for each voice unit.After identification result is converted into the voice unit sequence, utilize corresponding forward model and reverse model, each unit is calculated respectively a forward and oppositely confirmed mark, and totalling obtains forward affirmation mark (H0 score) and oppositely confirms mark (H1 score) separately, obtain at last word and confirm mark (being called for short UV score), its formula is as follows:
T is the sound frame sum of voice signal
At last word is confirmed mark UV score and critical value θUVRelatively, if UV score is greater than θUV, then expression is confirmed successfully, then identification result is exported.
The above embodiments please refer to Fig. 4 B, forword validator 440 illustrates schematic diagram for hypothesis calibrating (Hypothesis Testing) method that first words and phrases " last " carry out.Always having from t1, t2~t8 under eight sound frame paragraphs (Frame segments), can be divided into is eight different hypothesis calibrating zones, voice signal is then aimed at this eight sound frame paragraphs in the mode of forced alignment (Forced Alignment), is cut to respectively voice unit " sil " (represent Silence and do not have sound), " ㄑ ", " ", " ㄢ ", " null ", " ", " ㄒ ", " ㄤ " and " sil " of corresponding voice signal.And calculate respectively a forward and oppositely confirm mark for each voice unit, for example illustrated H0_sil and H1_sil, H0_ ㄑ and H1_ ㄑ, H0_ one and H1_ one, H0_ ㄢ and H1_ ㄢ, H1_null and H1_null, H0_ one and H1_ one, H0_ ㄒ and H1_ ㄒ, H0_ one ㄤ and H1_ one ㄤ, H0_sil and H1_sil.
At last, add up separately and obtain forward affirmation mark (H0 score) and reverse affirmation mark (H1score), obtain at last word and confirm mark (abbreviation UV score).
T is the sound frame sum of voice signal
The critical value generator confirmed in above-mentioned word, in one embodiment, and block schematic diagram shown in Figure 5 for example.
This word confirms thatcritical value generator 500 comprises that a processing target turnsvoice unit processor 520, target is dividednumber producer 540 and critical value resolver 550 (trouble is revised figure five corresponding square titles).And word confirms thatcritical value generator 500 also comprises a Numerical Simulation Module 530.This Numerical Simulation Module 530 offers target and dividesnumber producer 540 in order to produce numerical value.This Numerical Simulation Module 530 can comprise voiceunit affirmation module 532 and aspeech database 534 in one embodiment.Thisspeech database 534 can be the database of built-in corpus in order to store a language material that is pre-existing in, or a storage medium, and inputs relevant training language material by the user.And the data of storage can comprise sound source document or speech characteristic parameter etc.And voice unit confirms thatmodule 532 calculates the word affirmation mark of each voice unit fromspeech database 534, and offers target with one or more numerical value forms anddivide number producer 540.
Target dividesnumber producer 540 according to a voice unit sequence that receives, and receive one or more numerical value of each voice unit corresponding these voice unit sequences from Numerical Simulation Module 530, be combined to form the numeric distribution of corresponding this voice unit sequence, offercritical value resolver 550.
Critical value resolver 550 is confirmedeffect 560 according to an expection word and the numeric distribution of the voice unit sequence that receives, produces a suggestion critical value output.In one embodiment, given 10% False Rejects ratio for example.Andcritical value resolver 550 then is to confirm the defined condition of effect according to the expection words and phrases, from numeric distribution, finds out a corresponding place, and the numerical value output of correspondence is advised critical value as this.
This Numerical Simulation Module 530 is collected the corresponding a plurality of fractional samples of certain voice unit.For example to voice unit phoiThere is X fractional samples, and the value storage that it is corresponding.Still examine and determine (Hypothesis Testing) method as most preferred embodiment take the hypothesis that previous embodiment was adopted at this, but not as limit.
For voice unit phoi, exist for the corresponding forward of different samples (Sample) and reverse affirmation mark (representing with H0score, H1score respectively).
H0 score whereinPhoi, sample1Be expressed as phoiFirst forward fractional samples, H1 scorePhoi, sample1Be expressed as phoiFirst reverse fractional samples, TPhoi, sample1Be expressed as phoiThe sound frame length of first sample.
Word is processed all words after confirming thatcritical value generator 500 is received identification target (supposing W Chinese word) through the Chinese letter-to-phone that processing target turnsvoice unit processor 520, be converted to voice unit sequence (Sequence) Seqi={ pho1..., phok, wherein i is i Chinese word, k is the voice unit number of Chinese word for this reason.
Then with the voice unit sequence that produces, the input target is dividednumber producer 540.
Divide in thenumber producer 540 in target, for the content of voice unit sequence, in Numerical Simulation Module 530, select mode (for example random choose) according to one, take out the mark of corresponding forward model and reverse model, and it is as follows to be combined as a fractional samples X:
Wherein
With
Finger in Numerical Simulation Module 530 for first voice unit (pho
1) N the H0 that chooses and the fractional samples of H1.In like manner, H0score
Phok, sampleMWith
Refer in staqtistical data base for k voice unit (pho
k) M the H0 that chooses and the fractional samples of H1.
Each Chinese word is produced P word confirm mark (being called for short UV score) sample { x1, x2, xpForm the fractional samples set of this word, again the fractional samples of all words is concentrated the mark set that becomes whole identification target, and inputcritical value resolver 550.
Incritical value resolver 550, after mark set process histogram (histogram) statistics with whole identification target, be converted to cumulative probability and distribute, can therefrom find out suitable critical value part θUVFor example, export the critical value that corresponding cumulative probability is distributed as at 0.1 o'clock.
In above-described embodiment, Numerical Simulation Module 530 is that this adopts voice unit to confirm thatmodule 532 and aspeech database 534 carry out, but this is the enforcement example that instant computing is processed.But can adopting to have, above-mentioned Numerical Simulation Module 530 finishes any different technologies that function confirmed in word, the scope that all belongs to the present embodiment, " word verification method and system " disclosed content that for example No. 200421261 patent openly applies for mentioning in Taiwan, or at " Confidence measures for speechrecognition:A survey " by Hui Jiang, Speech communication, technology of mentioning in 2005 the document etc.In an other embodiment, can adopt voice unit fractional data storehouse, directly export corresponding numerical value according to selection, but be not as restriction.And these are stored in the numerical value in voice unit fractional data storehouse, then are via reception one speech data that is pre-existing in, and divide number producer and produce corresponding mark with voice unit via cutting the sound processing, and are stored in the voice unit fractional data storehouse.This embodiment is beneath explanation then.
Please refer to Fig. 6 A and Fig. 6 B, be respectively the enforcement example schematic of explanation Numerical Simulation Module.Fig. 6 A is the block schematic diagram of the enforcement example of Numerical Simulation Module, and Fig. 6 B is the schematic diagram of a generation numerical value.ThisNumerical Simulation Module 600 comprises that cuttingaudio processor 610 dividesnumber producer 620 with voice unit, exports data after treatment to voice unitfractional statistics database 650.
Above-mentionedspeech data 602 as corpus, can obtain from existing speech database, for example 500-People TRSC (Telephone Read Speech Corpus) speech database or Shanghai Mandarin ELDA FDB 1000 speech databases namely belong to and can one of originate.
Such framework can confirm that according to the expection word effect obtains the suggestion critical value after the identification target is determined, in addition, need not additionally collect language material or training pattern.This embodiment need to not obtain identification result through speech recognition at new environment, again to confirming effect renewal critical value for previous default word after its analysis.In the present embodiment, before voice identification system was brought into use, via the adjustment of having carried out the effect that word confirms for specific identification target, and capable of dynamic obtained a suggestion critical value, so that output allows the words and phrases validator be judged, and the result who is confirmed.For the dealer of IC design, the method for the present embodiment will make the solution of speech recognition more complete, and its downstream manufacturers can promptly be developed speech recognition Related product, the problem that needn't worry to collect language material.This popularization to the speech recognition technology has sizable help.
In the method, at first,speech data 602 is become one by one voice unit through cutting audio processor 610.In one embodiment, use cuts that to be used for carrying out the used model of forced alignment (Forced Alignment) insound model 630 and the word validator identical.
Then, each voice unit divides the computing of number producer 620 by voice unit and obtains corresponding result.Above-mentioned voice unit divides number producer 620, and it is to confirm model 640 computing gained by one group of word that its mark produces.This word confirms that word affirmation model used in model 640 and the identification system is consistent.The composition of voice unit mark 620 can be confirmed the mode difference and different presentation modes is arranged according to word used in the voice identification system.For example, in one embodiment, when using the mode of hypothesis calibrating (Hypothesis Testing) such as word affirmation mode, the composition of voice unit mark 620 is a forward mark and that uses the forward model under this voice unit that this element is calculated and uses the reverse mark that reverse model calculates this element under this voice unit.In different embodiment, can all deposit in the voice unit fractional statistics database 650 together with element length for the forward mark of the corresponding paragraphs of all language materials of each voice unit and reverse mark, this can be described as the first implementation type.In another embodiment, can be with for the forward mark of the corresponding paragraphs of all language materials of each voice unit and reverse mark, only deposit in these two marks subtract each other divided by length with and the statistical value of length, for example mean value and variance etc., deposit in the voice unit fractional statistics database 650, this is the second implementation type.
Confirm the difference of mode according to word, the voice unit mark forms also can comprise a forward mark that uses the affiliated forward model of this voice unit that this voice unit is calculated, and competes marks for many forwards that calculate all unit outside this voice unit with forward model under this voice unit of use in this corpus.Can be for each unit, the forward mark of the paragraph that all language materials are corresponding is all corresponding forward competition marks with it, all deposit in the voice unitfractional statistics database 650 together with element length, this can be described as the third implementation type, and wherein this corresponding forward competition mark can be stored all or only be wherein a subclass.In addition, also can only store the above-mentioned forward mark many forward competition marks corresponding with it, through subtracting each other the statistical value divided by its length and this length after the mathematical operation, such as mean value and variance etc., wherein said mathematical operation comprises such as arithmetic mean and geometric mean etc., deposit in the voice unitfractional statistics database 650, this can be described as the 4th kind of implementation type.
Target in Fig. 5 is divided the operational method ofnumber producer 540, can be according to the difference of 650 memory contentss of voice unit fractional statistics database, and different producing methods is arranged.As being first or during the 3rd implementation type when voice unitfractional statistics database 650 storage, can be according to the voice unit sequence content in voice unitfractional statistics database 650, be combined into sampling fraction by random choose, and form the distribution of this voice unit sequence mark.As be second or during the 4th implementation type, according to the directly computing combination by mean value and variance in voice unitfractional statistics database 650 of unit sequence content, mean value and the variance of formation voice unit sequence score distribution.
Beneath just Fig. 6 B explanation is a kind of operational method of implementing example wherein.Please refer to Fig. 6 B, in the hypothesis calibration method of carrying out for words and phrases " last ", for voice unit " ㄑ ", via the forward model (H0) 652 and reverse model (H1) 654 of voice unit " ㄑ ", obtain for the word of voice unit " ㄑ " and confirm that mark (UV score) is
After each voice unit dividesnumber producer 620 to process through voice unit, utilize word to confirm thatmodel 640 calculates forward (H0) and reverse (H1) mark to it, and deposit in the voice unitfractional statistics database 650 together with the length of this voice unit.
Please refer to Fig. 7, is how the data that explanation is stored in the voice unit fractional statistics database are used in the hypothesis calibration method.As shown in the figure, illustrate for example with " one " such as voice unit " sil ", " ㄑ " take words and phrases " last ", but not as limit.Each voice unit has its corresponding different phonetic unit sequence (Sequence), such as the corresponding First ray of voice unit " sil " to the N1 sequence, the corresponding First ray of voice unit " ㄑ " is to the N2 sequence, and the corresponding First ray of voice unit " " is to the N3 sequence.
When calculating word affirmation mark (UV score), will be from the voice unit sequence of correspondence, random (Randomly Select) one of them foundation as calculating of selecting comprises the therewith length of voice unit of forward (H0), reverse (H1) mark.At last, totalling obtains forward affirmation mark (H0score) and oppositely confirms mark (H1 score) separately, and obtains word affirmation mark (abbreviation UV score).
T is the sound frame sum for word " last "
Then, beneathly will lift the explanation of several actual verification example.
Use existing speech database to verify, at this take 500-People TRSC (TelephoneRead Speech Corpus) speech database as example.From this TRSC database, extract 9006 out, be used as the training statement of cutting sound model and word affirmation model (word that please refer among Fig. 6 A is confirmedmodel 640 and cutsound model 630).Use as the embodiment flow process of Fig. 6 A is done and cut sound and process with the voice unit mark and produce (please refer to the operation thataudio processor 610 and voice unitdivide number producer 620 processing of cutting among Fig. 6 A), produce at last voice unit fractional data storehouse.
The simulation test speech data uses Shanghai Mandarin ELDA FDB 1000 speech databases, takes out altogether three groups of test vocabulary groups.
Vocabulary group (1) content is " last item, message box, operator, answering equipment, emergency call " five words, has 4865;
Vocabulary group (2) content has 5235 for " pound sign, inside, outside, make a phone call, catalogue, tabulation " six words;
Vocabulary group (3) content be " forward, wire back, deletion, change, cancellation, service " six words, have 5755.
Three groups of vocabulary groups confirm that according to word for example shown in Figure 5 the critical value generator operates respectively.Turnvoice unit processor 520 via processing target anddivide number producer 540 with target, cooperate Numerical Simulation Module 530, finally bycritical value resolver 550 critical value that finds is exported.
Last result can illustrate to Fig. 8 E with reference to Fig. 8 A.In Fig. 8 A, can understand according to the expection words and phrases and confirm the requirement of effect, and obtain different critical values, and have different false rejection rate (False Rejection Rate) and false acceptance rate (False Alarm Rate).The result that the label 810 during score distribution as shown indicates confirmed in the word of vocabulary in the test set, and it can analyze to get it by testing material.In order to illustrate, confirm score distribution at this word that adopts second cover testing material to analyze the outer vocabulary of set, the result that the label 820 in as shown indicates, wherein the second identification vocabulary and first set that overlaps testing material there is no repetition.For example the critical value in the diagram was at 0.0 o'clock, and false rejection rate is 2%, and false acceptance rate then is 0.2%.In addition, critical value was at 4.1 o'clock, and false rejection rate is 10%, and false acceptance rate then is 0%.From diagram, can know, can confirm score distribution 810 according to the word of vocabulary in the set, select the worthwhile work of a number to confirm the critical value of mark at transverse axis, and obtain corresponding False Rejects and false acceptance rate.In fact, score distribution confirmed in the word that can be produced vocabulary in the set of simulation by this method, after transferring again cumulative probability branch to via statistics with histogram, just can therefrom find out suitable word and confirm the mark critical value, be False Rejects ratio (%) and its corresponding cumulative probability is on duty with 100%.
Among Fig. 8 B, the solid line that label 830 indicates, for score distribution confirmed in the word that uses actual testing material process identifier and word validator to count for vocabulary 1, and the dotted line that label 840 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.The solid line that label 832 among Fig. 8 C indicates, for expression is confirmed score distribution for the word that vocabulary 2 uses actual testing material process identifier and word validator to count, and the dotted line that label 842 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.The solid line that label 834 indicates among Fig. 8 D, for expression is confirmed score distribution for the word that vocabulary 3 uses actual testing material process identifier and word validator to count, and the dotted line that label 844 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.
After above-mentioned different label 830,832,834 and 840,842,844 resulting results are converted to respectively the cumulative probability statistical distribution, confirm that for word mark and False Rejects ratio can be exchanged into three groups of different operating performance curves, shown in Fig. 8 E.Transverse axis is that mark (UV score) value confirmed in word, and the longitudinal axis is false rejection rate (such as the FR% among the figure).Usefulness after these three groups of vocabulary groups are implemented as can be seen from Figure, wherein solid line is the distribution that real data is described, dotted line is the distribution that simulation is described.Can be learnt by Fig. 8 E, when false rejection rate was 0%~20%, each organized the error of vocabulary group simulation curve and actual curve less than 6%, within the acceptable scope of practicality.
Although the present invention is with embodiment openly as above, so it is not to limit the present invention, those skilled in the art, without departing from the spirit and scope of the present invention, when doing a little change and retouching.
As: the present invention also can be combined with the word validator separately, and as shown in Figure 9, in this voice identification system, word confirms that critical value generator 910 receives a word and confirms to produce a suggestion critical value 912 to word validator 920 after the target.One voice signal can be inputted word validator 920, and carries out word for this affirmation target and confirm action and be confirmed the result.
Comprehensive above-mentioned possibility embodiment, we confirm that with identification target or word target is referred to as processing target, the word that the application proposes confirms that the critical value generator receives one or more these processing targets, and output to should or the suggestion critical value of these processing targets.
Therefore protection scope of the present invention is as the criterion when looking the appended claims person of defining.