Movatterモバイル変換


[0]ホーム

URL:


CN102117615B - Device, method and system for generating word confirmation threshold - Google Patents

Device, method and system for generating word confirmation threshold
Download PDF

Info

Publication number
CN102117615B
CN102117615BCN2009102618864ACN200910261886ACN102117615BCN 102117615 BCN102117615 BCN 102117615BCN 2009102618864 ACN2009102618864 ACN 2009102618864ACN 200910261886 ACN200910261886 ACN 200910261886ACN 102117615 BCN102117615 BCN 102117615B
Authority
CN
China
Prior art keywords
voice unit
critical value
word
confirmed
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009102618864A
Other languages
Chinese (zh)
Other versions
CN102117615A (en
Inventor
林政贤
张森嘉
邱祺添
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial Technology Research Institute ITRI
Original Assignee
Industrial Technology Research Institute ITRI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial Technology Research Institute ITRIfiledCriticalIndustrial Technology Research Institute ITRI
Priority to CN2009102618864ApriorityCriticalpatent/CN102117615B/en
Publication of CN102117615ApublicationCriticalpatent/CN102117615A/en
Application grantedgrantedCritical
Publication of CN102117615BpublicationCriticalpatent/CN102117615B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

An apparatus, method and system for generating a word confirmation threshold. After the processing recognition target is determined, a suggested threshold value can be obtained according to the expected word confirmation effect without additionally collecting linguistic data or training models. First, one or more numerical data corresponding to at least one phonetic unit are calculated. Then, when receiving at least one voice unit sequence, receiving the or the numerical data corresponding to each voice unit in the voice unit sequence, and combining the numerical data into the numerical distribution corresponding to the voice unit sequence. Generating a suggested threshold output based on an expected word validation effect and the value distribution.

Description

Produce word and confirm device, the method and system of critical value
Technical field
The present invention relates to a kind of voice identification system, and be particularly related to a kind of word affirmation critical value generation device and method that is applicable to voice identification system.
Background technology
Word is confirmed (or to be called the word checking, utterance verification) function is a part indispensable in the voice identification system, and it can effectively refuse to gather the identification misoperation generation that outer vocabulary (Out of vocabulary) causes.And word is now confirmed algorithm after calculating word affirmation mark, can compare with a critical value, confirm successfully when mark surmounts critical value interval scale word, otherwise representative is confirmed unsuccessfully.In the application of reality, can go out best critical value by extra collection language material and for the affirmation effect analysis of expecting, and most solution also all is to attempt to find out best words and phrases for such framework to confirm effect.
For example shown in Figure 1A, traditional voice identification system comprisesspeech recognition engine 110 and words and phrases validator 120.When receiving the phonetic order input, for example receive the requirement of TV, film or music, or the instruction of non-voice input, for example electric light or operation of game etc.,speech recognition engine 110 can be judged according to identification instruction set 112 and speech model 114.For TV, film or music requires the instruction of action to judge in this identification instruction set 112,speech model 114 then provide the speech model set up for the instruction of these actions tospeech recognition engine 110 as the foundation of judging.And the result of identification will output to words andphrases validator 120, obtain a confidence mark after calculating, and confidence mark and a critical value of corresponding phonetic entry compared the determining step shown in 130.When confidence mark during greater than critical value, the requirement of namely phonetic entry is the instruction that belongs in the identification instruction set 112, then can make corresponding reaction, for example inputs TV, film or music etc.If but phonetic entry is not when being not the instruction that belongs in the identification instruction set 112, for example electric light or the operation of game then can not made corresponding reaction.
And the generation of critical value please refer to shown in Figure 1B, is for the instruction in the identification instruction set 112, collects a large amount of speech datas and analyzes the rear best critical value that produces, and produces best critical value 1 such as instruction set 1, and instruction set 2 then is to produce best critical value 2.And these speech datas all are to be undertaken by a large amount of artificial input modes, and therefore, when the change of identification vocabulary, above-mentioned work just must repeat once.And in addition, when the critical value of original setting during not as expection, another mode is that this critical value is allowed user's Self-adjustment, and shown in Fig. 1 C, adjustable height or turn down critical value is in order to find out the most satisfied set point.
Above-mentioned mode can limit the utilization scope of voice identification system, and its practical degree is reduced widely.For example, if when this voice identification system is used in the middle of some embedded system, for example system single chip (System-on-a-chip is called for short SoC) can't design the mode that critical value is adjusted in the problem of considering cost, and such problem just must solve.For example shown in Figure 2, when integrated circuit (IC) supplier provides the IC with voice identification function during to system manufacturer, system manufacturer is incorporated into the IC that these have voice identification function in the embedded system.Under such framework, unless carry out after the adjustment of critical value more again shipment from IC supplier to system manufacturer, otherwise will face the problem that can't adjust critical value.
In many patents about word affirmation system, the solution of adjusting about critical value is discussed is as described in following United States Patent (USP).
In the U.S. the 5th, 675, propose in No. 706 patents a kind of " Vocabulary IndependentDiscriminative Utterance Verification For Non-Keyword Rejection In SubwordBased Speech Recognition ", in this patent in the disclosed content, critical value is a numerical value that pre-defines, and the change of this numerical value will involve two kinds of mistakes, comprise the variation of false acceptance rate (False Alarm Rate) and false rejection rate (False Reject Rate), system designer gets Self-adjustment and therefrom finds the balance part.And the application's method is to confirm effect (such as false acceptance rate or false rejection rate) according at least one identification target and an expection word, then obtains the critical value of this corresponding affirmation effect, is not manually to be adjusted by the user.
And another U.S. the 5th, 737, propose in No. 489 patents a kind of " Discriminative UtteranceVerification For Connected Digits Recognition ", further mention mode dynamic calculation that this critical value can be by gather data on the line out, solve when the identification environment change setting problem of critical value.Although this file has the account form of the critical value mentioned, the mode of but collecting on the line in this file refers in the middle of speech recognition and word affirmation System Operation, test data by new environment obtains identification result through speech recognition first, confirms the action that critical value is upgraded for previous default word after it being analyzed again.
The explanation of comprehensive many formerly files is found to find best critical value by extra data collection and analysis, is the most common practice that arrives; Next is to user's Self-adjustment with the critical value opening.But above-mentioned method all obtains identification result through speech recognition first nothing more than the test data by new environment, confirms the action that critical value is upgraded for previous default word after it being analyzed again.
Summary of the invention
The invention provides a kind of word and confirm the critical value generation device, be applicable to a voice identification system.This word confirms that critical value generation device comprises that a Numerical Simulation Module, a target divide number producer and a critical value resolver.This Numerical Simulation Module is in order to calculate and to store the corresponding a plurality of numeric datas of a plurality of identification targets.Target divides number producer to receive at least the voice unit sequence that one of them identification target forms, and chooses the corresponding numeric data of this voice unit sequence form at least one numeric distribution from Numerical Simulation Module.And the critical value resolver in order to receiving above-mentioned numeric distribution, and is confirmed effect and numeric distribution according to expection words and phrases, produces a suggestion critical value output.
The invention provides a kind of word and confirm the critical value production method, be applicable to a voice identification system.In the method, calculate and store the corresponding a plurality of numeric datas of a plurality of identification targets.In the voice unit sequence that receives one of them identification target and form, and choose the corresponding numeric data composition of this voice unit sequence numeric distribution.Confirm therewith numeric distribution of effect according to expection words and phrases, produce a suggestion critical value output.
For above-mentioned feature and advantage of the present invention can be become apparent, embodiment cited below particularly, and cooperate accompanying drawing to be described in detail below.
Description of drawings
Figure 1A is the system architecture schematic diagram of explanation traditional voice identification system.
Figure 1B and 1C are generation or the method for adjustment schematic diagram of critical value in the voice identification system of explanation Figure 1A.
Fig. 2 is the treatment scheme simplified schematic diagram from manufacturer to the system combination dealer that explanation has the IC of voice identification function.
Fig. 3 is that explanation the present embodiment proposes automatically to calculate the method schematic diagram that critical value confirmed in word.
Fig. 4 A is the voice identification system block schematic diagram of explanation one embodiment of the invention.
Fig. 4 B is the hypothesis calibration method explanation schematic diagram that the word validator is carried out for words and phrases.
Fig. 5 is the block schematic diagram that the critical value generator confirmed in explanation word of the present invention.
Fig. 6 A is the block schematic diagram of enforcement example of the Numerical Simulation Module of explanation one embodiment of the invention, and Fig. 6 B one produces the schematic diagram of numerical value.
Fig. 7 is how the data that explanation is stored in the voice unit fractional statistics database are used in the schematic diagram of supposing calibration method.
Fig. 8 A~Fig. 8 E diagram is that explanation the present embodiment proposes automatically to calculate the checking diagram that the critical value method confirmed in word.
Fig. 9 is the voice identification system block schematic diagram of explanation another embodiment of the present invention.
[main element symbol description]
110: speech recognition engine
120: the word validator
112: the identification instruction set
114: speech model
310: instruction set
320: the automatic analysis instrument
400: voice identification system
410: the speech recognition device
420: the identification Destination Storage Unit
430: the critical value generator confirmed in words and phrases
440: the words and phrases validator
510: the identification target
520: the letter-to-phone processor
530: Numerical Simulation Module
540: target is divided number producer
550: the critical value resolver
560: effect confirmed in the expection words and phrases
600: Numerical Simulation Module
602: speech data
610: cut audio processor
620: voice unit divides number producer
630: cut the sound model
640: model confirmed in word
650: voice unit fractional statistics database
652: the forward model (H0) of voice unit " ㄑ "
654: the reverse model (H0) of voice unit " ㄑ "
Embodiment
The present embodiment proposes one and calculates the method that critical value confirmed in word, after the identification target is determined, can confirm that effect obtains a suggestion critical value, in addition, need not additionally collect language material or training pattern according to the expection word.
Please refer to Fig. 3, when the target of identification is defined as an instruction set 310, byautomatic analysis instrument 320, adopt the mode of full-automatic and unartificial off-line (Offline) processing, obtain the critical value of suggestion according to a pre-conditioned analysis.This embodiment is not by obtaining identification result at new environment through speech recognition, confirming for previous default word after it being analyzed again, and upgrade critical value.In the present embodiment, before voice identification system was brought into use, via the adjustment of having carried out the effect that word confirms for specific identification target, and capable of dynamic obtained a suggestion critical value, so that output allows the words and phrases validator be judged, and the result who is confirmed.
For the dealer of IC design, the method for the present embodiment will make the solution of speech recognition more complete, and its downstream manufacturers can be developed speech recognition Related product, the problem that needn't worry to collect language material rapidly.This popularization to the speech recognition technology has sizable help.
The conception of this enforcement is before speech recognition and word affirmation running, go out word for present identification target prediction and confirm critical value, and formerly use first preset critical in the file, at voice identification system and word confirm module running in the middle of while collect language material upgrade this preset critical, with the application's implementation process very large difference is arranged thereafter.Moreover the application does not confirm that in speech recognition and word any data of collection are analyzed in the System Operation yet, and only uses a speech data that is pre-existing in, such as the corpus of voice identification system or word affirmation system.The innovative approach that the application proposes is thought that critical value that word confirms can be after identification vocabulary determines, and is confirmed to come out in advance before the module running at voice identification system or word, data collection that need not be extra, and this framework is obviously from formerly file is different.
Please refer to Fig. 4 A, be the voice identification system block schematic diagram of explanation one embodiment of the invention.In thisvoice identification system 400, comprise aspeech recognition device 410, an identificationDestination Storage Unit 420, word affirmationcritical value generator 430 and a words and phrases validator 440.The voice signal of input then is to be sent tospeech recognition device 410 and word validator 440.IdentificationDestination Storage Unit 420 then is the target of the various identifications of storage, outputs tospeech recognition device 410 and confirmscritical value generator 430 with word.
Andspeech recognition device 410 is judged according to the voice signal that receives andidentification target 422 respectively, thenexports identification result 412 to word validator 440.Simultaneously, word confirms thatcritical value generator 430 is also for theidentification target 422 correspondingcritical values 432 that produce, and output toword validator 440, thisword validator 440 then can be confirmed according toidentification result 412 andcritical value 432, whether correct withchecking identification result 412, namely whether be higher than thecritical value 432 of generation.
This enforcement proposes word and confirmscritical value generator 430, and as shown in the figure, the identification target ofspeech recognition device 410 is one group of default vocabulary (such as N Chinese phrase), can read via identification Destination Storage Unit 420.After voice signal passes through this identifier, then identification result is delivered toword validator 440.
On the other hand, the identification target is also inputted word confirmcritical value generator 430, and the word of given expection affirmation effect, such as 10% False Rejects ratio, can obtain the critical value θ of a suggestionUV
Confirm in thecritical value generator 430 at word, implement example one, can adopt statistically common hypothesis calibrating (Hypothesis Testing) method, calculate word and confirm mark, but not as limit.
There are one group of forward model and one group of reverse model (representing with H0, H1 respectively) for each voice unit.After identification result is converted into the voice unit sequence, utilize corresponding forward model and reverse model, each unit is calculated respectively a forward and oppositely confirmed mark, and totalling obtains forward affirmation mark (H0 score) and oppositely confirms mark (H1 score) separately, obtain at last word and confirm mark (being called for short UV score), its formula is as follows:
UVscore=H0score-H1scoreT
T is the sound frame sum of voice signal
At last word is confirmed mark UV score and critical value θUVRelatively, if UV score is greater than θUV, then expression is confirmed successfully, then identification result is exported.
The above embodiments please refer to Fig. 4 B, forword validator 440 illustrates schematic diagram for hypothesis calibrating (Hypothesis Testing) method that first words and phrases " last " carry out.Always having from t1, t2~t8 under eight sound frame paragraphs (Frame segments), can be divided into is eight different hypothesis calibrating zones, voice signal is then aimed at this eight sound frame paragraphs in the mode of forced alignment (Forced Alignment), is cut to respectively voice unit " sil " (represent Silence and do not have sound), " ㄑ ", " ", " ㄢ ", " null ", " ", " ㄒ ", " ㄤ " and " sil " of corresponding voice signal.And calculate respectively a forward and oppositely confirm mark for each voice unit, for example illustrated H0_sil and H1_sil, H0_ ㄑ and H1_ ㄑ, H0_ one and H1_ one, H0_ ㄢ and H1_ ㄢ, H1_null and H1_null, H0_ one and H1_ one, H0_ ㄒ and H1_ ㄒ, H0_ one ㄤ and H1_ one ㄤ, H0_sil and H1_sil.
At last, add up separately and obtain forward affirmation mark (H0 score) and reverse affirmation mark (H1score), obtain at last word and confirm mark (abbreviation UV score).
Figure G2009102618864D00071
T is the sound frame sum of voice signal
The critical value generator confirmed in above-mentioned word, in one embodiment, and block schematic diagram shown in Figure 5 for example.
This word confirms thatcritical value generator 500 comprises that a processing target turnsvoice unit processor 520, target is dividednumber producer 540 and critical value resolver 550 (trouble is revised figure five corresponding square titles).And word confirms thatcritical value generator 500 also comprises a Numerical Simulation Module 530.This Numerical Simulation Module 530 offers target and dividesnumber producer 540 in order to produce numerical value.This Numerical Simulation Module 530 can comprise voiceunit affirmation module 532 and aspeech database 534 in one embodiment.Thisspeech database 534 can be the database of built-in corpus in order to store a language material that is pre-existing in, or a storage medium, and inputs relevant training language material by the user.And the data of storage can comprise sound source document or speech characteristic parameter etc.And voice unit confirms thatmodule 532 calculates the word affirmation mark of each voice unit fromspeech database 534, and offers target with one or more numerical value forms anddivide number producer 540.
Target dividesnumber producer 540 according to a voice unit sequence that receives, and receive one or more numerical value of each voice unit corresponding these voice unit sequences from Numerical Simulation Module 530, be combined to form the numeric distribution of corresponding this voice unit sequence, offercritical value resolver 550.
Critical value resolver 550 is confirmedeffect 560 according to an expection word and the numeric distribution of the voice unit sequence that receives, produces a suggestion critical value output.In one embodiment, given 10% False Rejects ratio for example.Andcritical value resolver 550 then is to confirm the defined condition of effect according to the expection words and phrases, from numeric distribution, finds out a corresponding place, and the numerical value output of correspondence is advised critical value as this.
This Numerical Simulation Module 530 is collected the corresponding a plurality of fractional samples of certain voice unit.For example to voice unit phoiThere is X fractional samples, and the value storage that it is corresponding.Still examine and determine (Hypothesis Testing) method as most preferred embodiment take the hypothesis that previous embodiment was adopted at this, but not as limit.
For voice unit phoi, exist for the corresponding forward of different samples (Sample) and reverse affirmation mark (representing with H0score, H1score respectively).
Figure G2009102618864D00081
H0 score whereinPhoi, sample1Be expressed as phoiFirst forward fractional samples, H1 scorePhoi, sample1Be expressed as phoiFirst reverse fractional samples, TPhoi, sample1Be expressed as phoiThe sound frame length of first sample.
Word is processed all words after confirming thatcritical value generator 500 is received identification target (supposing W Chinese word) through the Chinese letter-to-phone that processing target turnsvoice unit processor 520, be converted to voice unit sequence (Sequence) Seqi={ pho1..., phok, wherein i is i Chinese word, k is the voice unit number of Chinese word for this reason.
Then with the voice unit sequence that produces, the input target is dividednumber producer 540.
Divide in thenumber producer 540 in target, for the content of voice unit sequence, in Numerical Simulation Module 530, select mode (for example random choose) according to one, take out the mark of corresponding forward model and reverse model, and it is as follows to be combined as a fractional samples X:
X=H0scoresample-H1scoresampleTsample,
H0scoresample=H0scorepho1,sampleN+...+H0scorephok,sampleM
H1scoresample=H1scorepho1,sampleN+...+H1scorephok,sampleM
Tsample=Tpho1,sampleN+...+Tphok,sampleM
WhereinWith
Figure G2009102618864D00087
Finger in Numerical Simulation Module 530 for first voice unit (pho1) N the H0 that chooses and the fractional samples of H1.In like manner, H0scorePhok, sampleMWith
Figure G2009102618864D00088
Refer in staqtistical data base for k voice unit (phok) M the H0 that chooses and the fractional samples of H1.
Each Chinese word is produced P word confirm mark (being called for short UV score) sample { x1, x2, xpForm the fractional samples set of this word, again the fractional samples of all words is concentrated the mark set that becomes whole identification target, and inputcritical value resolver 550.
Incritical value resolver 550, after mark set process histogram (histogram) statistics with whole identification target, be converted to cumulative probability and distribute, can therefrom find out suitable critical value part θUVFor example, export the critical value that corresponding cumulative probability is distributed as at 0.1 o'clock.
In above-described embodiment, Numerical Simulation Module 530 is that this adopts voice unit to confirm thatmodule 532 and aspeech database 534 carry out, but this is the enforcement example that instant computing is processed.But can adopting to have, above-mentioned Numerical Simulation Module 530 finishes any different technologies that function confirmed in word, the scope that all belongs to the present embodiment, " word verification method and system " disclosed content that for example No. 200421261 patent openly applies for mentioning in Taiwan, or at " Confidence measures for speechrecognition:A survey " by Hui Jiang, Speech communication, technology of mentioning in 2005 the document etc.In an other embodiment, can adopt voice unit fractional data storehouse, directly export corresponding numerical value according to selection, but be not as restriction.And these are stored in the numerical value in voice unit fractional data storehouse, then are via reception one speech data that is pre-existing in, and divide number producer and produce corresponding mark with voice unit via cutting the sound processing, and are stored in the voice unit fractional data storehouse.This embodiment is beneath explanation then.
Please refer to Fig. 6 A and Fig. 6 B, be respectively the enforcement example schematic of explanation Numerical Simulation Module.Fig. 6 A is the block schematic diagram of the enforcement example of Numerical Simulation Module, and Fig. 6 B is the schematic diagram of a generation numerical value.ThisNumerical Simulation Module 600 comprises that cuttingaudio processor 610 dividesnumber producer 620 with voice unit, exports data after treatment to voice unitfractional statistics database 650.
Above-mentionedspeech data 602 as corpus, can obtain from existing speech database, for example 500-People TRSC (Telephone Read Speech Corpus) speech database or Shanghai Mandarin ELDA FDB 1000 speech databases namely belong to and can one of originate.
Such framework can confirm that according to the expection word effect obtains the suggestion critical value after the identification target is determined, in addition, need not additionally collect language material or training pattern.This embodiment need to not obtain identification result through speech recognition at new environment, again to confirming effect renewal critical value for previous default word after its analysis.In the present embodiment, before voice identification system was brought into use, via the adjustment of having carried out the effect that word confirms for specific identification target, and capable of dynamic obtained a suggestion critical value, so that output allows the words and phrases validator be judged, and the result who is confirmed.For the dealer of IC design, the method for the present embodiment will make the solution of speech recognition more complete, and its downstream manufacturers can promptly be developed speech recognition Related product, the problem that needn't worry to collect language material.This popularization to the speech recognition technology has sizable help.
In the method, at first,speech data 602 is become one by one voice unit through cutting audio processor 610.In one embodiment, use cuts that to be used for carrying out the used model of forced alignment (Forced Alignment) insound model 630 and the word validator identical.
Then, each voice unit divides the computing of number producer 620 by voice unit and obtains corresponding result.Above-mentioned voice unit divides number producer 620, and it is to confirm model 640 computing gained by one group of word that its mark produces.This word confirms that word affirmation model used in model 640 and the identification system is consistent.The composition of voice unit mark 620 can be confirmed the mode difference and different presentation modes is arranged according to word used in the voice identification system.For example, in one embodiment, when using the mode of hypothesis calibrating (Hypothesis Testing) such as word affirmation mode, the composition of voice unit mark 620 is a forward mark and that uses the forward model under this voice unit that this element is calculated and uses the reverse mark that reverse model calculates this element under this voice unit.In different embodiment, can all deposit in the voice unit fractional statistics database 650 together with element length for the forward mark of the corresponding paragraphs of all language materials of each voice unit and reverse mark, this can be described as the first implementation type.In another embodiment, can be with for the forward mark of the corresponding paragraphs of all language materials of each voice unit and reverse mark, only deposit in these two marks subtract each other divided by length with and the statistical value of length, for example mean value and variance etc., deposit in the voice unit fractional statistics database 650, this is the second implementation type.
Confirm the difference of mode according to word, the voice unit mark forms also can comprise a forward mark that uses the affiliated forward model of this voice unit that this voice unit is calculated, and competes marks for many forwards that calculate all unit outside this voice unit with forward model under this voice unit of use in this corpus.Can be for each unit, the forward mark of the paragraph that all language materials are corresponding is all corresponding forward competition marks with it, all deposit in the voice unitfractional statistics database 650 together with element length, this can be described as the third implementation type, and wherein this corresponding forward competition mark can be stored all or only be wherein a subclass.In addition, also can only store the above-mentioned forward mark many forward competition marks corresponding with it, through subtracting each other the statistical value divided by its length and this length after the mathematical operation, such as mean value and variance etc., wherein said mathematical operation comprises such as arithmetic mean and geometric mean etc., deposit in the voice unitfractional statistics database 650, this can be described as the 4th kind of implementation type.
Target in Fig. 5 is divided the operational method ofnumber producer 540, can be according to the difference of 650 memory contentss of voice unit fractional statistics database, and different producing methods is arranged.As being first or during the 3rd implementation type when voice unitfractional statistics database 650 storage, can be according to the voice unit sequence content in voice unitfractional statistics database 650, be combined into sampling fraction by random choose, and form the distribution of this voice unit sequence mark.As be second or during the 4th implementation type, according to the directly computing combination by mean value and variance in voice unitfractional statistics database 650 of unit sequence content, mean value and the variance of formation voice unit sequence score distribution.
Beneath just Fig. 6 B explanation is a kind of operational method of implementing example wherein.Please refer to Fig. 6 B, in the hypothesis calibration method of carrying out for words and phrases " last ", for voice unit " ㄑ ", via the forward model (H0) 652 and reverse model (H1) 654 of voice unit " ㄑ ", obtain for the word of voice unit " ㄑ " and confirm that mark (UV score) is
Figure G2009102618864D00111
After each voice unit dividesnumber producer 620 to process through voice unit, utilize word to confirm thatmodel 640 calculates forward (H0) and reverse (H1) mark to it, and deposit in the voice unitfractional statistics database 650 together with the length of this voice unit.
Figure G2009102618864D00112
Please refer to Fig. 7, is how the data that explanation is stored in the voice unit fractional statistics database are used in the hypothesis calibration method.As shown in the figure, illustrate for example with " one " such as voice unit " sil ", " ㄑ " take words and phrases " last ", but not as limit.Each voice unit has its corresponding different phonetic unit sequence (Sequence), such as the corresponding First ray of voice unit " sil " to the N1 sequence, the corresponding First ray of voice unit " ㄑ " is to the N2 sequence, and the corresponding First ray of voice unit " " is to the N3 sequence.
When calculating word affirmation mark (UV score), will be from the voice unit sequence of correspondence, random (Randomly Select) one of them foundation as calculating of selecting comprises the therewith length of voice unit of forward (H0), reverse (H1) mark.At last, totalling obtains forward affirmation mark (H0score) and oppositely confirms mark (H1 score) separately, and obtains word affirmation mark (abbreviation UV score).
Figure G2009102618864D00113
T is the sound frame sum for word " last "
Then, beneathly will lift the explanation of several actual verification example.
Use existing speech database to verify, at this take 500-People TRSC (TelephoneRead Speech Corpus) speech database as example.From this TRSC database, extract 9006 out, be used as the training statement of cutting sound model and word affirmation model (word that please refer among Fig. 6 A is confirmedmodel 640 and cutsound model 630).Use as the embodiment flow process of Fig. 6 A is done and cut sound and process with the voice unit mark and produce (please refer to the operation thataudio processor 610 and voice unitdivide number producer 620 processing of cutting among Fig. 6 A), produce at last voice unit fractional data storehouse.
The simulation test speech data uses Shanghai Mandarin ELDA FDB 1000 speech databases, takes out altogether three groups of test vocabulary groups.
Vocabulary group (1) content is " last item, message box, operator, answering equipment, emergency call " five words, has 4865;
Vocabulary group (2) content has 5235 for " pound sign, inside, outside, make a phone call, catalogue, tabulation " six words;
Vocabulary group (3) content be " forward, wire back, deletion, change, cancellation, service " six words, have 5755.
Three groups of vocabulary groups confirm that according to word for example shown in Figure 5 the critical value generator operates respectively.Turnvoice unit processor 520 via processing target anddivide number producer 540 with target, cooperate Numerical Simulation Module 530, finally bycritical value resolver 550 critical value that finds is exported.
Last result can illustrate to Fig. 8 E with reference to Fig. 8 A.In Fig. 8 A, can understand according to the expection words and phrases and confirm the requirement of effect, and obtain different critical values, and have different false rejection rate (False Rejection Rate) and false acceptance rate (False Alarm Rate).The result that the label 810 during score distribution as shown indicates confirmed in the word of vocabulary in the test set, and it can analyze to get it by testing material.In order to illustrate, confirm score distribution at this word that adopts second cover testing material to analyze the outer vocabulary of set, the result that the label 820 in as shown indicates, wherein the second identification vocabulary and first set that overlaps testing material there is no repetition.For example the critical value in the diagram was at 0.0 o'clock, and false rejection rate is 2%, and false acceptance rate then is 0.2%.In addition, critical value was at 4.1 o'clock, and false rejection rate is 10%, and false acceptance rate then is 0%.From diagram, can know, can confirm score distribution 810 according to the word of vocabulary in the set, select the worthwhile work of a number to confirm the critical value of mark at transverse axis, and obtain corresponding False Rejects and false acceptance rate.In fact, score distribution confirmed in the word that can be produced vocabulary in the set of simulation by this method, after transferring again cumulative probability branch to via statistics with histogram, just can therefrom find out suitable word and confirm the mark critical value, be False Rejects ratio (%) and its corresponding cumulative probability is on duty with 100%.
Among Fig. 8 B, the solid line that label 830 indicates, for score distribution confirmed in the word that uses actual testing material process identifier and word validator to count for vocabulary 1, and the dotted line that label 840 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.The solid line that label 832 among Fig. 8 C indicates, for expression is confirmed score distribution for the word that vocabulary 2 uses actual testing material process identifier and word validator to count, and the dotted line that label 842 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.The solid line that label 834 indicates among Fig. 8 D, for expression is confirmed score distribution for the word that vocabulary 3 uses actual testing material process identifier and word validator to count, and the dotted line that label 844 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.
After above-mentioned different label 830,832,834 and 840,842,844 resulting results are converted to respectively the cumulative probability statistical distribution, confirm that for word mark and False Rejects ratio can be exchanged into three groups of different operating performance curves, shown in Fig. 8 E.Transverse axis is that mark (UV score) value confirmed in word, and the longitudinal axis is false rejection rate (such as the FR% among the figure).Usefulness after these three groups of vocabulary groups are implemented as can be seen from Figure, wherein solid line is the distribution that real data is described, dotted line is the distribution that simulation is described.Can be learnt by Fig. 8 E, when false rejection rate was 0%~20%, each organized the error of vocabulary group simulation curve and actual curve less than 6%, within the acceptable scope of practicality.
Although the present invention is with embodiment openly as above, so it is not to limit the present invention, those skilled in the art, without departing from the spirit and scope of the present invention, when doing a little change and retouching.
As: the present invention also can be combined with the word validator separately, and as shown in Figure 9, in this voice identification system, word confirms that critical value generator 910 receives a word and confirms to produce a suggestion critical value 912 to word validator 920 after the target.One voice signal can be inputted word validator 920, and carries out word for this affirmation target and confirm action and be confirmed the result.
Comprehensive above-mentioned possibility embodiment, we confirm that with identification target or word target is referred to as processing target, the word that the application proposes confirms that the critical value generator receives one or more these processing targets, and output to should or the suggestion critical value of these processing targets.
Therefore protection scope of the present invention is as the criterion when looking the appended claims person of defining.

Claims (22)

1. one kind produces the device that critical value confirmed in word, and this device comprises:
One Numerical Simulation Module produces the corresponding one or more numeric datas of at least one voice unit in order to calculate;
One target is divided number producer, receive at least one voice unit sequence, and from this Numerical Simulation Module, take out corresponding these one or more numeric datas of each voice unit in this voice unit sequence, and be combined into according to this corresponding numeric distribution of this voice unit sequence; And
One critical value resolver is connected to this target and divides number producer, in order to receiving this numeric distribution, and confirms effect and this numeric distribution according to expection words and phrases, and produce a suggestion critical value and export,
Wherein this Numerical Simulation Module comprises:
One speech database is in order to store at least one voice unit corresponding one or many speech datas;
One voice unit is confirmed module to receive this speech data in this speech database, and calculates the corresponding one or more words and phrases of this voice unit and confirm marks, and offers this target with the numeric data form and divide number producer.
2. the device of critical value confirmed in generation word as claimed in claim 1, comprises also that wherein a processing target turns the voice unit processor, in order to the reception ﹠ disposal target, and transfers this processing target to this voice unit sequence and output to this target and divide number producer.
3. the device of critical value confirmed in generation word as claimed in claim 1, wherein this target divide number producer in the mode of linear combination with corresponding these the one or more numeric datas of each voice unit in this voice unit sequence, be combined into corresponding this numeric distribution of this voice unit sequence.
4. the device of critical value confirmed in generation word as claimed in claim 1, and wherein this critical value resolver is confirmed an initial conditions of effect according to these expection words and phrases, corresponds to a respective value of this numeric distribution, and then this respective value then is this suggestion critical value of output.
5. the device of critical value confirmed in generation word as claimed in claim 4, and wherein these expection words and phrases confirm that an initial conditions of effect is false rejection rate.
6. the device of critical value confirmed in generation word as claimed in claim 1, wherein the form of these speech datas of storing of this speech database comprise sound source document or speech characteristic parameter one of them, or sound source document and speech characteristic parameter both.
7. one kind produces the method that critical value confirmed in word, and the method comprises:
Calculate the corresponding one or more numeric datas of at least one voice unit;
Receive at least one voice unit sequence, and receive corresponding these one or more numeric datas of each voice unit in this voice unit sequence, and be combined into according to this corresponding numeric distribution of this voice unit sequence; And
Confirm effect and this numeric distribution according to expection words and phrases, produce a suggestion critical value output,
Wherein calculate the step of the corresponding one or more numeric datas of this voice unit, comprising:
Calculating is stored in the speech data of this voice unit of a speech database, and mark confirmed in the word that produces each this voice unit, and provide these numeric datas with one or more numerical value forms.
8. the method for critical value confirmed in generation word as claimed in claim 7, wherein also comprises transferring processing target to the voice unit sequence, so that according to this as choosing corresponding these numeric datas of this voice unit sequence, and forms this numeric distribution.
9. the method for critical value confirmed in generation word as claimed in claim 7, wherein after receiving this voice unit sequence, utilize linear combination mode will to one or more combinations of values of each voice unit in should the voice unit sequence in pairs should the voice unit sequence this numeric distribution.
10. the method for critical value confirmed in generation word as claimed in claim 7, wherein confirms an initial conditions of effect according to these expection words and phrases, corresponds to a respective value of this numeric distribution, and then this respective value then is this suggestion critical value of output.
11. the method for critical value confirmed in generation word as claimed in claim 10, wherein these expection words and phrases confirm that an initial conditions of effect is false rejection rate.
12. the method for critical value confirmed in generation word as claimed in claim 7, the form of these speech datas of wherein storing at this speech database comprise sound source document or speech characteristic parameter one of them, or sound source document and speech characteristic parameter both.
13. one kind produces the system that critical value confirmed in word, this system comprises:
One Numerical Simulation Module produces the corresponding one or more numeric datas of at least one voice unit in order to calculate;
One target mark generation module, receive at least one voice unit sequence, and from this Numerical Simulation Module, take out corresponding these one or more numeric datas of each voice unit in this voice unit sequence, and be combined into according to this corresponding numeric distribution of this voice unit sequence; And
One critical value decision module is connected to this target mark generation module, in order to receiving this numeric distribution, and confirms effect and this numeric distribution according to expection words and phrases, and produce a suggestion critical value and export,
Wherein this Numerical Simulation Module comprises:
One speech database is in order to store at least one voice unit corresponding one or many speech datas;
One voice unit is confirmed module, receives this speech data in this speech database, and calculates the corresponding one or more words and phrases of this voice unit and confirm mark, and offers this target mark generation module with the numeric data form.
14. the system of critical value confirmed in generation word as claimed in claim 13, comprise also that wherein a processing target turns the voice unit processing module, in order to the reception ﹠ disposal target, and transfer this processing target to this voice unit sequence and output to this target mark generation module.
15. the system of critical value confirmed in generation word as claimed in claim 13, wherein this target mark generation module with corresponding these the one or more numeric datas of each voice unit in this voice unit sequence, is combined into corresponding this numeric distribution of this voice unit sequence in the mode of linear combination.
16. the system of critical value confirmed in generation word as claimed in claim 13, wherein this critical value decision module is confirmed an initial conditions of effect according to these expection words and phrases, correspond to a respective value of this numeric distribution, then this respective value is then advised critical value for this that export.
17. the system of critical value confirmed in generation word as claimed in claim 16, wherein these expection words and phrases confirm that an initial conditions of effect is false rejection rate.
18. the system of critical value confirmed in generation word as claimed in claim 13, wherein the form of these speech datas of storing of this speech database comprise sound source document or speech characteristic parameter one of them, or sound source document and speech characteristic parameter both.
19. a voice identification system comprises a kind of device that critical value confirmed in word that produces as claimed in claim 1, advise critical value in order to produce one, and allow according to this this voice identification system confirm, and the result is confirmed in output according to this.
20. voice identification system as claimed in claim 19 also comprises
One speech recognition device is in order to receive a voice signal;
One processing target storage unit is stored a plurality of processing targets, and wherein, this speech recognition device reads these processing targets, and judges according to this voice signal and these processing targets that read, and then exports an identification result; And
One word validator, in order to receiving this identification result and this suggestion critical value is confirmed, and the result is confirmed in output according to this.
21. system confirmed in a word, comprises a kind of device that critical value confirmed in word that produces as claimed in claim 1, advise critical value in order to produce one, and allow according to this this word affirmation system confirm, and the result confirmed in output according to this.
22. system confirmed in word as claimed in claim 21, also comprises
One processing target storage unit is stored a processing target; And
One word validator in order to receiving a voice signal, and reads this processing target, and after comparing according to this voice signal and this processing target that reads, confirms with this suggestion critical value, and according to this output affirmation result.
CN2009102618864A2009-12-312009-12-31 Device, method and system for generating word confirmation thresholdExpired - Fee RelatedCN102117615B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN2009102618864ACN102117615B (en)2009-12-312009-12-31 Device, method and system for generating word confirmation threshold

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN2009102618864ACN102117615B (en)2009-12-312009-12-31 Device, method and system for generating word confirmation threshold

Publications (2)

Publication NumberPublication Date
CN102117615A CN102117615A (en)2011-07-06
CN102117615Btrue CN102117615B (en)2013-01-02

Family

ID=44216347

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN2009102618864AExpired - Fee RelatedCN102117615B (en)2009-12-312009-12-31 Device, method and system for generating word confirmation threshold

Country Status (1)

CountryLink
CN (1)CN102117615B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5675706A (en)*1995-03-311997-10-07Lucent Technologies Inc.Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition
US5737489A (en)*1995-09-151998-04-07Lucent Technologies Inc.Discriminative utterance verification for connected digits recognition
TW200421261A (en)*2003-04-142004-10-16Ind Tech Res InstMethod and system for utterance verification
CN1963917A (en)*2005-11-112007-05-16株式会社东芝Method for estimating distinguish of voice, registering and validating authentication of speaker and apparatus thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5675706A (en)*1995-03-311997-10-07Lucent Technologies Inc.Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition
US5737489A (en)*1995-09-151998-04-07Lucent Technologies Inc.Discriminative utterance verification for connected digits recognition
TW200421261A (en)*2003-04-142004-10-16Ind Tech Res InstMethod and system for utterance verification
CN1963917A (en)*2005-11-112007-05-16株式会社东芝Method for estimating distinguish of voice, registering and validating authentication of speaker and apparatus thereof

Also Published As

Publication numberPublication date
CN102117615A (en)2011-07-06

Similar Documents

PublicationPublication DateTitle
CN100559462C (en)Voice processing apparatus, method of speech processing, program and recording medium
US6839667B2 (en)Method of speech recognition by presenting N-best word candidates
CN106935239A (en)The construction method and device of a kind of pronunciation dictionary
CN1196104C (en)Speech processing
US8165887B2 (en)Data-driven voice user interface
CN108447471A (en)Audio recognition method and speech recognition equipment
CN112489655B (en)Method, system and storage medium for correcting voice recognition text error in specific field
CN110415725B (en)Method and system for evaluating pronunciation quality of second language using first language data
CA2531455A1 (en)Improving error prediction in spoken dialog systems
CN111081229B (en)Scoring method based on voice and related device
TWI421857B (en)Apparatus and method for generating a threshold for utterance verification and speech recognition system and utterance verification system
CN111883122A (en)Voice recognition method and device, storage medium and electronic equipment
CN113535925A (en)Voice broadcasting method, device, equipment and storage medium
US20080294433A1 (en)Automatic Text-Speech Mapping Tool
CN112216284A (en)Training data updating method and system, voice recognition method and system, and equipment
CN108984510A (en)By voice by the system of data input table
CN112530405B (en)End-to-end speech synthesis error correction method, system and device
CN111563034A (en)Method and device for generating simulation data
CN112530402B (en)Speech synthesis method, speech synthesis device and intelligent equipment
CN102117615B (en) Device, method and system for generating word confirmation threshold
CN111816171A (en) Speech recognition model training method, speech recognition method and device
CN116863923A (en)Enhanced accent voice recognition technology based on generation of countermeasure network data
CN112307757A (en)Emotion analysis method, device and equipment based on auxiliary task and storage medium
Scheffler et al.Speecheval–evaluating spoken dialog systems by user simulation
CN112784607B (en) Customer intention identification method and device

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20130102


[8]ページ先頭

©2009-2025 Movatter.jp