Summary of the invention
According to an aspect of the present invention, the present invention is a kind of method of phonetic synthesis, comprises the text string that receives input; The sound inventory of described input text string and index is compared; From described sound inventory, retrieve and the corresponding to complete sub-character waveform of described input text string; From described sound inventory, retrieve and the corresponding to phone string waveform of described input text string; From described sound inventory, retrieve and the corresponding to single phoneme waveform of described input text string; Connect described waveform, produce and the corresponding to synthetic speech of described input text string.
The present invention preferably can comprise by big text corpus is implemented a statistical study and decide everyday words, and described everyday words is divided into the position syllable, produces the step of described sound inventory.
The step that generates described sound inventory may further include the step that the syllable that described position syllable is sorted out is sorted out step and given up the described syllable with low definition.
The step that generates described sound inventory may further include: calculate the frequency of the CV type word in the described big text corpus and select in the described big text corpus step of the described sub-word of common part.
The step that connects described waveform can comprise the described sub-character waveform of hard connection (needing the connection of signal Processing hardly), maybe can comprise the step to the correction connection of described syllable string waveform and described single syllable waveform.
Revise to connect and preferably comprise the duration that changes described connection waveform.
According to a further aspect in the invention, the present invention is a kind of according to importing the system that voice carry out phonetic synthesis, and it comprises the sound inventory with sub-character waveform.One multistage voice unit (VU) selector switch connects with described sound inventory, and a multilayer compositor connects with described voice unit (VU) selector switch.Whether the segmentation according to described input text is consistent with the sub-character waveform in the described sound inventory, selects the one-level of described tone unit selector switch.
Described multilayer compositor preferably comprises and is used to carry out the ground floor of hard connection and is used to carry out the second layer of revising connection.
Described sound inventory can comprise CV type character waveform, and described CV type character waveform can be with a comment file index.
Described multistage voice unit (VU) selector switch preferably comprises and can connect with the ground floor of described multilayer compositor to realize the hard first order that connects and can connect with the second layer of described multilayer compositor to realize revising the second level and the third level that connects.
At this instructions, and in claims, word " comprises ", " comprise " or similar terms is intended to represent comprising of Fei Paita, so, comprise the method and apparatus of listed element, not merely be to comprise these elements, can also comprise other element of not mentioning.
DETAILED DESCRIPTION OF THE PREFERRED
Referring to Fig. 1, be synoptic diagram shown in the figure according to the functional assembly of thesystem 100 that is used for phonetic synthesis of the presentinvention.Sound inventory 110 comprises a plurality ofsub-word assemblies 120, and for example initial, consonant ends up and CV type word.Utilizeindex 130 antithetical phrase word assemblies 120 to classify.
Sound inventory 110 and multilevel-cell selector switch 140 interfaces.In three grades which rank ofunit selector switch 140 determine to be used to the synthetic speech that is input in the system 100.When the segmentation of input text string can be divided into the waveform corresponding with it when being included in sub-word in thesound inventory 110, the first order of selected cell selector switch 140.When the needed sub-word of synthetic input text string segmentation is not included in thesound inventory 110, but when the phone string in thesound inventory 110 can be used for synthetic input text string segmentation, the second level of selected cell selector switch 140.At last, when can only be with being included in single phoneme in thesound inventory 110 when synthesizing the segmentation of input text string, the third level of selectedcell selector switch 140.
Unit selector switch 140 and double-deck compositor 150 interfaces,compositor 150 synthetic voice by system's 100 outputs.160 pairs of hard connections of the execution from the sub-word of the first order ofunit selector switch 140 of ground floor synthesize.170 pairs of voice components that receive from the second level or the third level ofunit selector switch 140 of the second layer ofcompositor 150 are carried out to revise to connect and are synthesized.Back in this explanation will and be revised connection and describe in detail hard connection.The voice component that dotted arrow among Fig. 1 is represented from the second level ofunit selector switch 140 or the third level receives also can use hard connection to connect.
Referring to Fig. 2, shown in the figure process flow diagram that generates themethod 200 of sound inventory 110.Instep 205, big text corpus is carried out statistical study.This analysis comprises that calculating accounts for the word of remarkable majority in the word of given arbitrarily exemplary input text.For most west voice, for example English has 150,000 words of surpassing, and comprises at least 41,000 position syllable.Then, instep 210, be divided into the position syllable from the everyday words of step 205.The position syllable is defined as the syllable with word position mark, and is as follows:
Ws: the syllable in the single syllable word;
Wo: the syllable in the multi-syllable word but do not comprise the ultima of speech; With
Wf: the ultima in the multi-syllable word.
Then,method 200 proceeds tostep 215, and at this, the phoneme in each syllable all is classified.Phoneme roughly can be divided into following four classes: consonant, semivowel, vowel and voiced sound tail.Sharpness between all kinds of is different.So instep 220, the phoneme with low definition can be rejected.Therefore, be based on syllable according to the definition of voice unit of the present invention, and the length of voice unit from a syllable to four or more multisyllable change.This just means that following combination can omit from sound inventory 110: consonant to consonant, vowel to consonant, semivowel to consonant and nose last or end syllable to consonant.Yet, to consider in the following connection that is combined in voice unit: consonant to vowel, semivowel to vowel, vowel to semivowel.The ending of consonant string can be shared by different words.Therefore, recited abovely surpass 41,000 position syllables and be reduced to and have only 16,000 CV type words.Following table 1 provides an example, illustrates how to use above-mentioned sub-word cell to describe, for example conversion of the syllable in " Battery level is low ":
Table 1
Syllable conversion in " Battery level is low "
| Word | CV-like unit | 
| Battery | b’ae(Wo)+tax(Wo)+riy(Wf) | 
| Level | l’eh(Wo)+vaxl(Wf) | 
| Is | ’Ih(Ws)+s | 
| Low | l’ow(Ws) | 
Then,method 200 proceeds tostep 225, wherein the frequency of calculating CV type word according to the word frequencies in the dictionary (comprise according to a preferred embodiment of the invention surpass 190,000 entries) and unit frequency.The statistical study of English text shows that about 6,900 words can cover about 90% input text, and about 4,100 words can cover about 85% input text, and frequency or number of times that each sub-word occurs are defined as follows:
ni=n1i+n2i
N whereiniBe i sub-word occurrence number, wherein n1iBe the number of times that has the word appearance of i sub-word, wherein n2iBe i the number of times that sub-word occurs in dictionary.For ni, i=1,2 ...., N (wherein N is the number of dictionary neutron word) can calculate the frequency of each sub-word.
Instep 230, selection will cover the most the most frequently used sub-word of expection input text at last.When being implemented on English, above result calculated show that 20% sub-word will cover the English text above 85%.Therefore, about 2,400 the selected formation voice unit of sub-word catalogues.From the sound corpus, extract the speech waveform relevant, formsound inventory 110 with each sub-word.Thereby saidmethod 200 has significantly reduced the redundancy in thesound inventory 110.
All index ofreference 130 indexes of related voice waveform of each sub-word in the sound inventory 110.Index 130 can comprise a simple note file with the speech waveform of record.Therefore,index 130 is used to identify phone string and the single phoneme that is included in the sub-character waveform.
Referring to Fig. 3, shown in the figure process flow diagram according tophoneme synthesizing method 300 of thepresent invention.Method 300 is called ininitial step 305, for example; When the user of hand-held device receives a text message and want that it is synthesizedvoice.In step 310,speech synthesis system 100 receives an input text string, for example: be above-mentionedtext message.In step 315, implement pre-service to the input text string.Pre-service becomes to comprise the son field of the positional information relevant with each section with the input text string sort.Then, instep 320, segmentation of input text string andsound inventory 110 are compared.In step 325, determine whether the complete sub-character waveform in thesound inventory 110 is consistent with the present segment of input text string.Ifmethod 300 execution instep 330 retrieve consistent sub-character waveform from sound inventory 110.Next instep 360, sub-character waveform is coupled.Step 330 is relevant with the first order ofunit selector switch 140 withstep 360, and the connection of sub-word is carried out hard the connection by theground floor 160 of double-deck compositor 150.Hard connection will be described in detail hereinafter.Next instep 335, determine whether the input text string also has other section to compare with sound inventory 110.If also have,method 300 turns back tostep 320 again, and at this, next section of input text string compares withsound inventory 110; Otherwisemethod 300 finishes instep 340.
If determine there be not the complete sub-character waveform consistent in thesound inventory 110 with the present segment of input text string in step 325, thenmethod 300 advances tostep 345, to judge whether a plurality of phone string waveforms consistent with the present segment of input text string is arranged in sound inventory 110.If have,method 300 proceeds tostep 350, retrieves consistent a plurality of phone string waveforms from sound inventory 110.Next instep 365, multitone substring waveform is connected.Step 350 is relevant with the second level ofunit selector switch 140 withstep 365, and the connection of a plurality of phone strings is to be connected by the correction that thesecond layer 170 ofcompositor 150 is carried out.Revise to connect also and describe in detail hereinafter.Then,method 300 turns back tostep 335, judges whether input this paper string also has other sections to compare withsound inventory 110.
If judge do not have a plurality of phone string waveforms consistent with the present segment of input text string instep 345 insound inventory 110,method 300 just advances to 355 steps, retrieves single phoneme waveform from sound inventory 110.Instep 365, single phoneme waveform is coupled with the most corresponding with the present segment of input text string then.Here,step 355 is relevant with the third level ofunit selector switch 140 withstep 365, and the connection of single phoneme is still connected by the correction that thesecond layer 170 ofcompositor 150 is finished.Then,method 300 turns back tostep 335, judges whether input this paper string also has other segmentations to compare with sound inventory 110.After all segmentations of input text string were all relatively finished with thesound inventory 110 of index,method 300 finished instep 340.
Therefore, the method according to thisinvention 300 based on the analysis of " the most suitable " is carried out in the segmentation of input text string, connects from the waveform in the sound inventory 110.The ground floor of double-deck compositor 150 is carried out and is meaned under the situation that does not have correction hard the connection, and a plurality of waveforms fromsound inventory 110 simply are stitched together.When the waveform that connects enough big, to such an extent as to the duration of speaking naturally of the duration altogether that connects waveform and corresponding input text string segmentation very near the time, this process can cause sounding the voice of nature.
On the other hand, when hard connection can not obtain sounding the voice of nature, will use to revise to connect.Thesecond layer 170 ofcompositor 150 is carried out to revise and is connected.Here the duration of adjusting the connection waveform is to obtain sounding more natural voice.
With reference to following table 2, can better understand and revise connection.
Table 2
Provided the example of ten kinds of different situations in the table 2, wherein thesub-word assembly 120 ofsound inventory 110 is divided into the left side and the right text.What describe at the rightmost row of table 2 is when connectingsub-word assembly 120, producing when sounding the synthetic speech of nature needed coupled type.For example, 2 explanations of situation in the table 2 are revised when connecting two vowel waveforms ofsound inventory 110 when using, and the duration of connection waveform must reduce by 25% voice that just can obtain sounding nature.
As selection, 9 explanations of situation in the table 2 are when connecting two waveforms being made up of a vowel and consonant, and the duration of connection waveform needn't be revised.Therefore, theground floor 160 ofcompositor 150 will be carried out this hard connection.
Therefore, the present invention is the improved method and system that is used for phonetic synthesis of the less relativelysound inventory 110 of a kind of use.Suitably set up the index collection that soundinventory 110 can obtain waveform, it can synthesize about 85% input text string by hard the connection.Remaining 15% can utilize the input text string described correction connection technique and synthesized.Sound inventory 110 therefore be high compression and also have the minimal redundancy waveform, make it be specially adapted to have in the hand-held device of finite memory.And the reduction ofsound inventory 110 sizes makes more efficient quick of searching algorithm of the present invention.
What foregoing detailed description provided only is a preferred embodiment, is not to be restriction to scope of the present invention, usability and structure.On the contrary, the those skilled in the art that are specifically described as of preferred example embodiment implement preferred example embodiment of the present invention possibility are provided.It should be understood that under the situation that does not break away from the spirit and scope of the present invention in the claims, can make various modifications the function and the layout of element and step.