CN102117615B

Movatterモバイル変換

Info

Publication number: CN102117615B
Application number: CN2009102618864A
Authority: CN
Inventors: 林政贤; 张森嘉; 邱祺添
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Industrial Technology Research Institute ITRI
Priority date: 2009-12-31
Filing date: 2009-12-31
Publication date: 2013-01-02
Anticipated expiration: 2029-12-31
Also published as: CN102117615A

Abstract

An apparatus, method and system for generating a word confirmation threshold. After the processing recognition target is determined, a suggested threshold value can be obtained according to the expected word confirmation effect without additionally collecting linguistic data or training models. First, one or more numerical data corresponding to at least one phonetic unit are calculated. Then, when receiving at least one voice unit sequence, receiving the or the numerical data corresponding to each voice unit in the voice unit sequence, and combining the numerical data into the numerical distribution corresponding to the voice unit sequence. Generating a suggested threshold output based on an expected word validation effect and the value distribution.

Description

Produce word and confirm device, the method and system of critical value

Technical field

The present invention relates to a kind of voice identification system, and be particularly related to a kind of word affirmation critical value generation device and method that is applicable to voice identification system.

Background technology

Word is confirmed (or to be called the word checking, utterance verification) function is a part indispensable in the voice identification system, and it can effectively refuse to gather the identification misoperation generation that outer vocabulary (Out of vocabulary) causes.And word is now confirmed algorithm after calculating word affirmation mark, can compare with a critical value, confirm successfully when mark surmounts critical value interval scale word, otherwise representative is confirmed unsuccessfully.In the application of reality, can go out best critical value by extra collection language material and for the affirmation effect analysis of expecting, and most solution also all is to attempt to find out best words and phrases for such framework to confirm effect.

For example shown in Figure 1A, traditional voice identification system comprisesspeech recognition engine 110 and words and phrases validator 120.When receiving the phonetic order input, for example receive the requirement of TV, film or music, or the instruction of non-voice input, for example electric light or operation of game etc.,speech recognition engine 110 can be judged according to identification instruction set 112 and speech model 114.For TV, film or music requires the instruction of action to judge in this identification instruction set 112,speech model 114 then provide the speech model set up for the instruction of these actions tospeech recognition engine 110 as the foundation of judging.And the result of identification will output to words andphrases validator 120, obtain a confidence mark after calculating, and confidence mark and a critical value of corresponding phonetic entry compared the determining step shown in 130.When confidence mark during greater than critical value, the requirement of namely phonetic entry is the instruction that belongs in the identification instruction set 112, then can make corresponding reaction, for example inputs TV, film or music etc.If but phonetic entry is not when being not the instruction that belongs in the identification instruction set 112, for example electric light or the operation of game then can not made corresponding reaction.

And the generation of critical value please refer to shown in Figure 1B, is for the instruction in the identification instruction set 112, collects a large amount of speech datas and analyzes the rear best critical value that produces, and produces best critical value 1 such as instruction set 1, and instruction set 2 then is to produce best critical value 2.And these speech datas all are to be undertaken by a large amount of artificial input modes, and therefore, when the change of identification vocabulary, above-mentioned work just must repeat once.And in addition, when the critical value of original setting during not as expection, another mode is that this critical value is allowed user's Self-adjustment, and shown in Fig. 1 C, adjustable height or turn down critical value is in order to find out the most satisfied set point.

Above-mentioned mode can limit the utilization scope of voice identification system, and its practical degree is reduced widely.For example, if when this voice identification system is used in the middle of some embedded system, for example system single chip (System-on-a-chip is called for short SoC) can't design the mode that critical value is adjusted in the problem of considering cost, and such problem just must solve.For example shown in Figure 2, when integrated circuit (IC) supplier provides the IC with voice identification function during to system manufacturer, system manufacturer is incorporated into the IC that these have voice identification function in the embedded system.Under such framework, unless carry out after the adjustment of critical value more again shipment from IC supplier to system manufacturer, otherwise will face the problem that can't adjust critical value.

In many patents about word affirmation system, the solution of adjusting about critical value is discussed is as described in following United States Patent (USP).

In the U.S. the 5th, 675, propose in No. 706 patents a kind of " Vocabulary IndependentDiscriminative Utterance Verification For Non-Keyword Rejection In SubwordBased Speech Recognition ", in this patent in the disclosed content, critical value is a numerical value that pre-defines, and the change of this numerical value will involve two kinds of mistakes, comprise the variation of false acceptance rate (False Alarm Rate) and false rejection rate (False Reject Rate), system designer gets Self-adjustment and therefrom finds the balance part.And the application's method is to confirm effect (such as false acceptance rate or false rejection rate) according at least one identification target and an expection word, then obtains the critical value of this corresponding affirmation effect, is not manually to be adjusted by the user.

And another U.S. the 5th, 737, propose in No. 489 patents a kind of " Discriminative UtteranceVerification For Connected Digits Recognition ", further mention mode dynamic calculation that this critical value can be by gather data on the line out, solve when the identification environment change setting problem of critical value.Although this file has the account form of the critical value mentioned, the mode of but collecting on the line in this file refers in the middle of speech recognition and word affirmation System Operation, test data by new environment obtains identification result through speech recognition first, confirms the action that critical value is upgraded for previous default word after it being analyzed again.

The explanation of comprehensive many formerly files is found to find best critical value by extra data collection and analysis, is the most common practice that arrives; Next is to user's Self-adjustment with the critical value opening.But above-mentioned method all obtains identification result through speech recognition first nothing more than the test data by new environment, confirms the action that critical value is upgraded for previous default word after it being analyzed again.

Summary of the invention

The invention provides a kind of word and confirm the critical value generation device, be applicable to a voice identification system.This word confirms that critical value generation device comprises that a Numerical Simulation Module, a target divide number producer and a critical value resolver.This Numerical Simulation Module is in order to calculate and to store the corresponding a plurality of numeric datas of a plurality of identification targets.Target divides number producer to receive at least the voice unit sequence that one of them identification target forms, and chooses the corresponding numeric data of this voice unit sequence form at least one numeric distribution from Numerical Simulation Module.And the critical value resolver in order to receiving above-mentioned numeric distribution, and is confirmed effect and numeric distribution according to expection words and phrases, produces a suggestion critical value output.

The invention provides a kind of word and confirm the critical value production method, be applicable to a voice identification system.In the method, calculate and store the corresponding a plurality of numeric datas of a plurality of identification targets.In the voice unit sequence that receives one of them identification target and form, and choose the corresponding numeric data composition of this voice unit sequence numeric distribution.Confirm therewith numeric distribution of effect according to expection words and phrases, produce a suggestion critical value output.

For above-mentioned feature and advantage of the present invention can be become apparent, embodiment cited below particularly, and cooperate accompanying drawing to be described in detail below.

Description of drawings

Figure 1A is the system architecture schematic diagram of explanation traditional voice identification system.

Figure 1B and 1C are generation or the method for adjustment schematic diagram of critical value in the voice identification system of explanation Figure 1A.

Fig. 2 is the treatment scheme simplified schematic diagram from manufacturer to the system combination dealer that explanation has the IC of voice identification function.

Fig. 3 is that explanation the present embodiment proposes automatically to calculate the method schematic diagram that critical value confirmed in word.

Fig. 4 A is the voice identification system block schematic diagram of explanation one embodiment of the invention.

Fig. 4 B is the hypothesis calibration method explanation schematic diagram that the word validator is carried out for words and phrases.

Fig. 5 is the block schematic diagram that the critical value generator confirmed in explanation word of the present invention.

Fig. 6 A is the block schematic diagram of enforcement example of the Numerical Simulation Module of explanation one embodiment of the invention, and Fig. 6 B one produces the schematic diagram of numerical value.

Fig. 7 is how the data that explanation is stored in the voice unit fractional statistics database are used in the schematic diagram of supposing calibration method.

Fig. 8 A～Fig. 8 E diagram is that explanation the present embodiment proposes automatically to calculate the checking diagram that the critical value method confirmed in word.

Fig. 9 is the voice identification system block schematic diagram of explanation another embodiment of the present invention.

[main element symbol description]

110: speech recognition engine

120: the word validator

112: the identification instruction set

114: speech model

310: instruction set

320: the automatic analysis instrument

400: voice identification system

410: the speech recognition device

420: the identification Destination Storage Unit

430: the critical value generator confirmed in words and phrases

440: the words and phrases validator

510: the identification target

520: the letter-to-phone processor

530: Numerical Simulation Module

540: target is divided number producer

550: the critical value resolver

560: effect confirmed in the expection words and phrases

600: Numerical Simulation Module

602: speech data

610: cut audio processor

620: voice unit divides number producer

630: cut the sound model

640: model confirmed in word

650: voice unit fractional statistics database

652: the forward model (H0) of voice unit " ㄑ "

654: the reverse model (H0) of voice unit " ㄑ "

Embodiment

The present embodiment proposes one and calculates the method that critical value confirmed in word, after the identification target is determined, can confirm that effect obtains a suggestion critical value, in addition, need not additionally collect language material or training pattern according to the expection word.

Please refer to Fig. 3, when the target of identification is defined as an instruction set 310, byautomatic analysis instrument 320, adopt the mode of full-automatic and unartificial off-line (Offline) processing, obtain the critical value of suggestion according to a pre-conditioned analysis.This embodiment is not by obtaining identification result at new environment through speech recognition, confirming for previous default word after it being analyzed again, and upgrade critical value.In the present embodiment, before voice identification system was brought into use, via the adjustment of having carried out the effect that word confirms for specific identification target, and capable of dynamic obtained a suggestion critical value, so that output allows the words and phrases validator be judged, and the result who is confirmed.

For the dealer of IC design, the method for the present embodiment will make the solution of speech recognition more complete, and its downstream manufacturers can be developed speech recognition Related product, the problem that needn't worry to collect language material rapidly.This popularization to the speech recognition technology has sizable help.

The conception of this enforcement is before speech recognition and word affirmation running, go out word for present identification target prediction and confirm critical value, and formerly use first preset critical in the file, at voice identification system and word confirm module running in the middle of while collect language material upgrade this preset critical, with the application's implementation process very large difference is arranged thereafter.Moreover the application does not confirm that in speech recognition and word any data of collection are analyzed in the System Operation yet, and only uses a speech data that is pre-existing in, such as the corpus of voice identification system or word affirmation system.The innovative approach that the application proposes is thought that critical value that word confirms can be after identification vocabulary determines, and is confirmed to come out in advance before the module running at voice identification system or word, data collection that need not be extra, and this framework is obviously from formerly file is different.

Please refer to Fig. 4 A, be the voice identification system block schematic diagram of explanation one embodiment of the invention.In thisvoice identification system 400, comprise aspeech recognition device 410, an identificationDestination Storage Unit 420, word affirmationcritical value generator 430 and a words and phrases validator 440.The voice signal of input then is to be sent tospeech recognition device 410 and word validator 440.IdentificationDestination Storage Unit 420 then is the target of the various identifications of storage, outputs tospeech recognition device 410 and confirmscritical value generator 430 with word.

Andspeech recognition device 410 is judged according to the voice signal that receives andidentification target 422 respectively, thenexports identification result 412 to word validator 440.Simultaneously, word confirms thatcritical value generator 430 is also for theidentification target 422 correspondingcritical values 432 that produce, and output toword validator 440, thisword validator 440 then can be confirmed according toidentification result 412 andcritical value 432, whether correct withchecking identification result 412, namely whether be higher than thecritical value 432 of generation.

This enforcement proposes word and confirmscritical value generator 430, and as shown in the figure, the identification target ofspeech recognition device 410 is one group of default vocabulary (such as N Chinese phrase), can read via identification Destination Storage Unit 420.After voice signal passes through this identifier, then identification result is delivered toword validator 440.

On the other hand, the identification target is also inputted word confirmcritical value generator 430, and the word of given expection affirmation effect, such as 10% False Rejects ratio, can obtain the critical value θ of a suggestion_UV

Confirm in thecritical value generator 430 at word, implement example one, can adopt statistically common hypothesis calibrating (Hypothesis Testing) method, calculate word and confirm mark, but not as limit.

There are one group of forward model and one group of reverse model (representing with H0, H1 respectively) for each voice unit.After identification result is converted into the voice unit sequence, utilize corresponding forward model and reverse model, each unit is calculated respectively a forward and oppositely confirmed mark, and totalling obtains forward affirmation mark (H0 score) and oppositely confirms mark (H1 score) separately, obtain at last word and confirm mark (being called for short UV score), its formula is as follows:

UVscore = \frac{H 0 score - H 1 score}{T}

T is the sound frame sum of voice signal

At last word is confirmed mark UV score and critical value θ_UVRelatively, if UV score is greater than θ_UV, then expression is confirmed successfully, then identification result is exported.

The above embodiments please refer to Fig. 4 B, forword validator 440 illustrates schematic diagram for hypothesis calibrating (Hypothesis Testing) method that first words and phrases " last " carry out.Always having from t1, t2～t8 under eight sound frame paragraphs (Frame segments), can be divided into is eight different hypothesis calibrating zones, voice signal is then aimed at this eight sound frame paragraphs in the mode of forced alignment (Forced Alignment), is cut to respectively voice unit " sil " (represent Silence and do not have sound), " ㄑ ", " ", " ㄢ ", " null ", " ", " ㄒ ", " ㄤ " and " sil " of corresponding voice signal.And calculate respectively a forward and oppositely confirm mark for each voice unit, for example illustrated H0_sil and H1_sil, H0_ ㄑ and H1_ ㄑ, H0_ one and H1_ one, H0_ ㄢ and H1_ ㄢ, H1_null and H1_null, H0_ one and H1_ one, H0_ ㄒ and H1_ ㄒ, H0_ one ㄤ and H1_ one ㄤ, H0_sil and H1_sil.

At last, add up separately and obtain forward affirmation mark (H0 score) and reverse affirmation mark (H1score), obtain at last word and confirm mark (abbreviation UV score).

T is the sound frame sum of voice signal

The critical value generator confirmed in above-mentioned word, in one embodiment, and block schematic diagram shown in Figure 5 for example.

This word confirms thatcritical value generator 500 comprises that a processing target turnsvoice unit processor 520, target is dividednumber producer 540 and critical value resolver 550 (trouble is revised figure five corresponding square titles).And word confirms thatcritical value generator 500 also comprises a Numerical Simulation Module 530.This Numerical Simulation Module 530 offers target and dividesnumber producer 540 in order to produce numerical value.This Numerical Simulation Module 530 can comprise voiceunit affirmation module 532 and aspeech database 534 in one embodiment.Thisspeech database 534 can be the database of built-in corpus in order to store a language material that is pre-existing in, or a storage medium, and inputs relevant training language material by the user.And the data of storage can comprise sound source document or speech characteristic parameter etc.And voice unit confirms thatmodule 532 calculates the word affirmation mark of each voice unit fromspeech database 534, and offers target with one or more numerical value forms anddivide number producer 540.

Target dividesnumber producer 540 according to a voice unit sequence that receives, and receive one or more numerical value of each voice unit corresponding these voice unit sequences from Numerical Simulation Module 530, be combined to form the numeric distribution of corresponding this voice unit sequence, offercritical value resolver 550.

Critical value resolver 550 is confirmedeffect 560 according to an expection word and the numeric distribution of the voice unit sequence that receives, produces a suggestion critical value output.In one embodiment, given 10% False Rejects ratio for example.Andcritical value resolver 550 then is to confirm the defined condition of effect according to the expection words and phrases, from numeric distribution, finds out a corresponding place, and the numerical value output of correspondence is advised critical value as this.

This Numerical Simulation Module 530 is collected the corresponding a plurality of fractional samples of certain voice unit.For example to voice unit pho_iThere is X fractional samples, and the value storage that it is corresponding.Still examine and determine (Hypothesis Testing) method as most preferred embodiment take the hypothesis that previous embodiment was adopted at this, but not as limit.

For voice unit pho_i, exist for the corresponding forward of different samples (Sample) and reverse affirmation mark (representing with H0score, H1score respectively).

H0 score wherein_{Phoi, sample1}Be expressed as pho_iFirst forward fractional samples, H1 score_{Phoi, sample1}Be expressed as pho_iFirst reverse fractional samples, T_{Phoi, sample1}Be expressed as pho_iThe sound frame length of first sample.

Word is processed all words after confirming thatcritical value generator 500 is received identification target (supposing W Chinese word) through the Chinese letter-to-phone that processing target turnsvoice unit processor 520, be converted to voice unit sequence (Sequence) Seq_i={ pho₁..., pho_k, wherein i is i Chinese word, k is the voice unit number of Chinese word for this reason.

Then with the voice unit sequence that produces, the input target is dividednumber producer 540.

Divide in thenumber producer 540 in target, for the content of voice unit sequence, in Numerical Simulation Module 530, select mode (for example random choose) according to one, take out the mark of corresponding forward model and reverse model, and it is as follows to be combined as a fractional samples X:

X = \frac{H 0 sc {ore}_{sample} - {H 1 score}_{sample}}{T_{sample}},

{H 0 score}_{sample} = {H 0 score}_{ph o_{1}, sampleN} + . . . + H 0 scor e_{{pho}_{k}, sampleM}

H 1 {score}_{sample} = {H 1 score}_{{pho}_{1}, sampleN} + . . . + H 1 {score}_{{pho}_{k}, sampleM}

T_{sample} = T_{{pho}_{1}, sampleN} + . . . + T_{{pho}_{k}, sampleM}

WhereinWith

Finger in Numerical Simulation Module 530 for first voice unit (pho₁) N the H0 that chooses and the fractional samples of H1.In like manner, H0score_{Phok, sampleM}With

Refer in staqtistical data base for k voice unit (pho_k) M the H0 that chooses and the fractional samples of H1.

Each Chinese word is produced P word confirm mark (being called for short UV score) sample { x₁, x₂, x_pForm the fractional samples set of this word, again the fractional samples of all words is concentrated the mark set that becomes whole identification target, and inputcritical value resolver 550.

Incritical value resolver 550, after mark set process histogram (histogram) statistics with whole identification target, be converted to cumulative probability and distribute, can therefrom find out suitable critical value part θ_UVFor example, export the critical value that corresponding cumulative probability is distributed as at 0.1 o'clock.

In above-described embodiment, Numerical Simulation Module 530 is that this adopts voice unit to confirm thatmodule 532 and aspeech database 534 carry out, but this is the enforcement example that instant computing is processed.But can adopting to have, above-mentioned Numerical Simulation Module 530 finishes any different technologies that function confirmed in word, the scope that all belongs to the present embodiment, " word verification method and system " disclosed content that for example No. 200421261 patent openly applies for mentioning in Taiwan, or at " Confidence measures for speechrecognition:A survey " by Hui Jiang, Speech communication, technology of mentioning in 2005 the document etc.In an other embodiment, can adopt voice unit fractional data storehouse, directly export corresponding numerical value according to selection, but be not as restriction.And these are stored in the numerical value in voice unit fractional data storehouse, then are via reception one speech data that is pre-existing in, and divide number producer and produce corresponding mark with voice unit via cutting the sound processing, and are stored in the voice unit fractional data storehouse.This embodiment is beneath explanation then.

Please refer to Fig. 6 A and Fig. 6 B, be respectively the enforcement example schematic of explanation Numerical Simulation Module.Fig. 6 A is the block schematic diagram of the enforcement example of Numerical Simulation Module, and Fig. 6 B is the schematic diagram of a generation numerical value.ThisNumerical Simulation Module 600 comprises that cuttingaudio processor 610 dividesnumber producer 620 with voice unit, exports data after treatment to voice unitfractional statistics database 650.

Above-mentionedspeech data 602 as corpus, can obtain from existing speech database, for example 500-People TRSC (Telephone Read Speech Corpus) speech database or Shanghai Mandarin ELDA FDB 1000 speech databases namely belong to and can one of originate.

Such framework can confirm that according to the expection word effect obtains the suggestion critical value after the identification target is determined, in addition, need not additionally collect language material or training pattern.This embodiment need to not obtain identification result through speech recognition at new environment, again to confirming effect renewal critical value for previous default word after its analysis.In the present embodiment, before voice identification system was brought into use, via the adjustment of having carried out the effect that word confirms for specific identification target, and capable of dynamic obtained a suggestion critical value, so that output allows the words and phrases validator be judged, and the result who is confirmed.For the dealer of IC design, the method for the present embodiment will make the solution of speech recognition more complete, and its downstream manufacturers can promptly be developed speech recognition Related product, the problem that needn't worry to collect language material.This popularization to the speech recognition technology has sizable help.

In the method, at first,speech data 602 is become one by one voice unit through cutting audio processor 610.In one embodiment, use cuts that to be used for carrying out the used model of forced alignment (Forced Alignment) insound model 630 and the word validator identical.

Then, each voice unit divides the computing of number producer 620 by voice unit and obtains corresponding result.Above-mentioned voice unit divides number producer 620, and it is to confirm model 640 computing gained by one group of word that its mark produces.This word confirms that word affirmation model used in model 640 and the identification system is consistent.The composition of voice unit mark 620 can be confirmed the mode difference and different presentation modes is arranged according to word used in the voice identification system.For example, in one embodiment, when using the mode of hypothesis calibrating (Hypothesis Testing) such as word affirmation mode, the composition of voice unit mark 620 is a forward mark and that uses the forward model under this voice unit that this element is calculated and uses the reverse mark that reverse model calculates this element under this voice unit.In different embodiment, can all deposit in the voice unit fractional statistics database 650 together with element length for the forward mark of the corresponding paragraphs of all language materials of each voice unit and reverse mark, this can be described as the first implementation type.In another embodiment, can be with for the forward mark of the corresponding paragraphs of all language materials of each voice unit and reverse mark, only deposit in these two marks subtract each other divided by length with and the statistical value of length, for example mean value and variance etc., deposit in the voice unit fractional statistics database 650, this is the second implementation type.

Confirm the difference of mode according to word, the voice unit mark forms also can comprise a forward mark that uses the affiliated forward model of this voice unit that this voice unit is calculated, and competes marks for many forwards that calculate all unit outside this voice unit with forward model under this voice unit of use in this corpus.Can be for each unit, the forward mark of the paragraph that all language materials are corresponding is all corresponding forward competition marks with it, all deposit in the voice unitfractional statistics database 650 together with element length, this can be described as the third implementation type, and wherein this corresponding forward competition mark can be stored all or only be wherein a subclass.In addition, also can only store the above-mentioned forward mark many forward competition marks corresponding with it, through subtracting each other the statistical value divided by its length and this length after the mathematical operation, such as mean value and variance etc., wherein said mathematical operation comprises such as arithmetic mean and geometric mean etc., deposit in the voice unitfractional statistics database 650, this can be described as the 4th kind of implementation type.

Target in Fig. 5 is divided the operational method ofnumber producer 540, can be according to the difference of 650 memory contentss of voice unit fractional statistics database, and different producing methods is arranged.As being first or during the 3rd implementation type when voice unitfractional statistics database 650 storage, can be according to the voice unit sequence content in voice unitfractional statistics database 650, be combined into sampling fraction by random choose, and form the distribution of this voice unit sequence mark.As be second or during the 4th implementation type, according to the directly computing combination by mean value and variance in voice unitfractional statistics database 650 of unit sequence content, mean value and the variance of formation voice unit sequence score distribution.

Beneath just Fig. 6 B explanation is a kind of operational method of implementing example wherein.Please refer to Fig. 6 B, in the hypothesis calibration method of carrying out for words and phrases " last ", for voice unit " ㄑ ", via the forward model (H0) 652 and reverse model (H1) 654 of voice unit " ㄑ ", obtain for the word of voice unit " ㄑ " and confirm that mark (UV score) is

After each voice unit dividesnumber producer 620 to process through voice unit, utilize word to confirm thatmodel 640 calculates forward (H0) and reverse (H1) mark to it, and deposit in the voice unitfractional statistics database 650 together with the length of this voice unit.

Please refer to Fig. 7, is how the data that explanation is stored in the voice unit fractional statistics database are used in the hypothesis calibration method.As shown in the figure, illustrate for example with " one " such as voice unit " sil ", " ㄑ " take words and phrases " last ", but not as limit.Each voice unit has its corresponding different phonetic unit sequence (Sequence), such as the corresponding First ray of voice unit " sil " to the N1 sequence, the corresponding First ray of voice unit " ㄑ " is to the N2 sequence, and the corresponding First ray of voice unit " " is to the N3 sequence.

When calculating word affirmation mark (UV score), will be from the voice unit sequence of correspondence, random (Randomly Select) one of them foundation as calculating of selecting comprises the therewith length of voice unit of forward (H0), reverse (H1) mark.At last, totalling obtains forward affirmation mark (H0score) and oppositely confirms mark (H1 score) separately, and obtains word affirmation mark (abbreviation UV score).

T is the sound frame sum for word " last "

Then, beneathly will lift the explanation of several actual verification example.

Use existing speech database to verify, at this take 500-People TRSC (TelephoneRead Speech Corpus) speech database as example.From this TRSC database, extract 9006 out, be used as the training statement of cutting sound model and word affirmation model (word that please refer among Fig. 6 A is confirmedmodel 640 and cutsound model 630).Use as the embodiment flow process of Fig. 6 A is done and cut sound and process with the voice unit mark and produce (please refer to the operation thataudio processor 610 and voice unitdivide number producer 620 processing of cutting among Fig. 6 A), produce at last voice unit fractional data storehouse.

The simulation test speech data uses Shanghai Mandarin ELDA FDB 1000 speech databases, takes out altogether three groups of test vocabulary groups.

Vocabulary group (1) content is " last item, message box, operator, answering equipment, emergency call " five words, has 4865;

Vocabulary group (2) content has 5235 for " pound sign, inside, outside, make a phone call, catalogue, tabulation " six words;

Vocabulary group (3) content be " forward, wire back, deletion, change, cancellation, service " six words, have 5755.

Three groups of vocabulary groups confirm that according to word for example shown in Figure 5 the critical value generator operates respectively.Turnvoice unit processor 520 via processing target anddivide number producer 540 with target, cooperate Numerical Simulation Module 530, finally bycritical value resolver 550 critical value that finds is exported.

Last result can illustrate to Fig. 8 E with reference to Fig. 8 A.In Fig. 8 A, can understand according to the expection words and phrases and confirm the requirement of effect, and obtain different critical values, and have different false rejection rate (False Rejection Rate) and false acceptance rate (False Alarm Rate).The result that the label 810 during score distribution as shown indicates confirmed in the word of vocabulary in the test set, and it can analyze to get it by testing material.In order to illustrate, confirm score distribution at this word that adopts second cover testing material to analyze the outer vocabulary of set, the result that the label 820 in as shown indicates, wherein the second identification vocabulary and first set that overlaps testing material there is no repetition.For example the critical value in the diagram was at 0.0 o'clock, and false rejection rate is 2%, and false acceptance rate then is 0.2%.In addition, critical value was at 4.1 o'clock, and false rejection rate is 10%, and false acceptance rate then is 0%.From diagram, can know, can confirm score distribution 810 according to the word of vocabulary in the set, select the worthwhile work of a number to confirm the critical value of mark at transverse axis, and obtain corresponding False Rejects and false acceptance rate.In fact, score distribution confirmed in the word that can be produced vocabulary in the set of simulation by this method, after transferring again cumulative probability branch to via statistics with histogram, just can therefrom find out suitable word and confirm the mark critical value, be False Rejects ratio (%) and its corresponding cumulative probability is on duty with 100%.

Among Fig. 8 B, the solid line that label 830 indicates, for score distribution confirmed in the word that uses actual testing material process identifier and word validator to count for vocabulary 1, and the dotted line that label 840 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.The solid line that label 832 among Fig. 8 C indicates, for expression is confirmed score distribution for the word that vocabulary 2 uses actual testing material process identifier and word validator to count, and the dotted line that label 842 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.The solid line that label 834 indicates among Fig. 8 D, for expression is confirmed score distribution for the word that vocabulary 3 uses actual testing material process identifier and word validator to count, and the dotted line that label 844 indicates then is that score distribution confirmed in the word that the outer language material (as the aforementioned TRSC) of expression use test language material set and process this method simulate.

After above-mentioned different label 830,832,834 and 840,842,844 resulting results are converted to respectively the cumulative probability statistical distribution, confirm that for word mark and False Rejects ratio can be exchanged into three groups of different operating performance curves, shown in Fig. 8 E.Transverse axis is that mark (UV score) value confirmed in word, and the longitudinal axis is false rejection rate (such as the FR% among the figure).Usefulness after these three groups of vocabulary groups are implemented as can be seen from Figure, wherein solid line is the distribution that real data is described, dotted line is the distribution that simulation is described.Can be learnt by Fig. 8 E, when false rejection rate was 0%～20%, each organized the error of vocabulary group simulation curve and actual curve less than 6%, within the acceptable scope of practicality.

Although the present invention is with embodiment openly as above, so it is not to limit the present invention, those skilled in the art, without departing from the spirit and scope of the present invention, when doing a little change and retouching.

As: the present invention also can be combined with the word validator separately, and as shown in Figure 9, in this voice identification system, word confirms that critical value generator 910 receives a word and confirms to produce a suggestion critical value 912 to word validator 920 after the target.One voice signal can be inputted word validator 920, and carries out word for this affirmation target and confirm action and be confirmed the result.

Comprehensive above-mentioned possibility embodiment, we confirm that with identification target or word target is referred to as processing target, the word that the application proposes confirms that the critical value generator receives one or more these processing targets, and output to should or the suggestion critical value of these processing targets.

Therefore protection scope of the present invention is as the criterion when looking the appended claims person of defining.

Claims

1. one kind produces the device that critical value confirmed in word, and this device comprises:

One Numerical Simulation Module produces the corresponding one or more numeric datas of at least one voice unit in order to calculate;

One target is divided number producer, receive at least one voice unit sequence, and from this Numerical Simulation Module, take out corresponding these one or more numeric datas of each voice unit in this voice unit sequence, and be combined into according to this corresponding numeric distribution of this voice unit sequence; And

One critical value resolver is connected to this target and divides number producer, in order to receiving this numeric distribution, and confirms effect and this numeric distribution according to expection words and phrases, and produce a suggestion critical value and export,

Wherein this Numerical Simulation Module comprises:

One speech database is in order to store at least one voice unit corresponding one or many speech datas;

One voice unit is confirmed module to receive this speech data in this speech database, and calculates the corresponding one or more words and phrases of this voice unit and confirm marks, and offers this target with the numeric data form and divide number producer.

2. the device of critical value confirmed in generation word as claimed in claim 1, comprises also that wherein a processing target turns the voice unit processor, in order to the reception ﹠ disposal target, and transfers this processing target to this voice unit sequence and output to this target and divide number producer.

3. the device of critical value confirmed in generation word as claimed in claim 1, wherein this target divide number producer in the mode of linear combination with corresponding these the one or more numeric datas of each voice unit in this voice unit sequence, be combined into corresponding this numeric distribution of this voice unit sequence.

4. the device of critical value confirmed in generation word as claimed in claim 1, and wherein this critical value resolver is confirmed an initial conditions of effect according to these expection words and phrases, corresponds to a respective value of this numeric distribution, and then this respective value then is this suggestion critical value of output.

5. the device of critical value confirmed in generation word as claimed in claim 4, and wherein these expection words and phrases confirm that an initial conditions of effect is false rejection rate.

6. the device of critical value confirmed in generation word as claimed in claim 1, wherein the form of these speech datas of storing of this speech database comprise sound source document or speech characteristic parameter one of them, or sound source document and speech characteristic parameter both.

7. one kind produces the method that critical value confirmed in word, and the method comprises:

Calculate the corresponding one or more numeric datas of at least one voice unit;

Receive at least one voice unit sequence, and receive corresponding these one or more numeric datas of each voice unit in this voice unit sequence, and be combined into according to this corresponding numeric distribution of this voice unit sequence; And

Confirm effect and this numeric distribution according to expection words and phrases, produce a suggestion critical value output,

Wherein calculate the step of the corresponding one or more numeric datas of this voice unit, comprising:

Calculating is stored in the speech data of this voice unit of a speech database, and mark confirmed in the word that produces each this voice unit, and provide these numeric datas with one or more numerical value forms.

8. the method for critical value confirmed in generation word as claimed in claim 7, wherein also comprises transferring processing target to the voice unit sequence, so that according to this as choosing corresponding these numeric datas of this voice unit sequence, and forms this numeric distribution.

9. the method for critical value confirmed in generation word as claimed in claim 7, wherein after receiving this voice unit sequence, utilize linear combination mode will to one or more combinations of values of each voice unit in should the voice unit sequence in pairs should the voice unit sequence this numeric distribution.

10. the method for critical value confirmed in generation word as claimed in claim 7, wherein confirms an initial conditions of effect according to these expection words and phrases, corresponds to a respective value of this numeric distribution, and then this respective value then is this suggestion critical value of output.

11. the method for critical value confirmed in generation word as claimed in claim 10, wherein these expection words and phrases confirm that an initial conditions of effect is false rejection rate.

12. the method for critical value confirmed in generation word as claimed in claim 7, the form of these speech datas of wherein storing at this speech database comprise sound source document or speech characteristic parameter one of them, or sound source document and speech characteristic parameter both.

13. one kind produces the system that critical value confirmed in word, this system comprises:

One target mark generation module, receive at least one voice unit sequence, and from this Numerical Simulation Module, take out corresponding these one or more numeric datas of each voice unit in this voice unit sequence, and be combined into according to this corresponding numeric distribution of this voice unit sequence; And

One critical value decision module is connected to this target mark generation module, in order to receiving this numeric distribution, and confirms effect and this numeric distribution according to expection words and phrases, and produce a suggestion critical value and export,

Wherein this Numerical Simulation Module comprises:

One voice unit is confirmed module, receives this speech data in this speech database, and calculates the corresponding one or more words and phrases of this voice unit and confirm mark, and offers this target mark generation module with the numeric data form.

14. the system of critical value confirmed in generation word as claimed in claim 13, comprise also that wherein a processing target turns the voice unit processing module, in order to the reception ﹠ disposal target, and transfer this processing target to this voice unit sequence and output to this target mark generation module.

15. the system of critical value confirmed in generation word as claimed in claim 13, wherein this target mark generation module with corresponding these the one or more numeric datas of each voice unit in this voice unit sequence, is combined into corresponding this numeric distribution of this voice unit sequence in the mode of linear combination.

16. the system of critical value confirmed in generation word as claimed in claim 13, wherein this critical value decision module is confirmed an initial conditions of effect according to these expection words and phrases, correspond to a respective value of this numeric distribution, then this respective value is then advised critical value for this that export.

17. the system of critical value confirmed in generation word as claimed in claim 16, wherein these expection words and phrases confirm that an initial conditions of effect is false rejection rate.

18. the system of critical value confirmed in generation word as claimed in claim 13, wherein the form of these speech datas of storing of this speech database comprise sound source document or speech characteristic parameter one of them, or sound source document and speech characteristic parameter both.

19. a voice identification system comprises a kind of device that critical value confirmed in word that produces as claimed in claim 1, advise critical value in order to produce one, and allow according to this this voice identification system confirm, and the result is confirmed in output according to this.

20. voice identification system as claimed in claim 19 also comprises

One speech recognition device is in order to receive a voice signal;

One processing target storage unit is stored a plurality of processing targets, and wherein, this speech recognition device reads these processing targets, and judges according to this voice signal and these processing targets that read, and then exports an identification result; And

One word validator, in order to receiving this identification result and this suggestion critical value is confirmed, and the result is confirmed in output according to this.

21. system confirmed in a word, comprises a kind of device that critical value confirmed in word that produces as claimed in claim 1, advise critical value in order to produce one, and allow according to this this word affirmation system confirm, and the result confirmed in output according to this.

22. system confirmed in word as claimed in claim 21, also comprises

One processing target storage unit is stored a processing target; And

One word validator in order to receiving a voice signal, and reads this processing target, and after comparing according to this voice signal and this processing target that reads, confirms with this suggestion critical value, and according to this output affirmation result.