Word	CV-like unit
Word	CV-like unit	Battery	b’ae(Wo)+tax(Wo)+riy(Wf)
Level	l’eh(Wo)+vaxl(Wf)	Battery	b’ae(Wo)+tax(Wo)+riy(Wf)
Level	l’eh(Wo)+vaxl(Wf)	Is	’Ih(Ws)+s
Low	l’ow(Ws)	Is	’Ih(Ws)+s

Then,method 200 proceeds tostep 225, wherein the frequency of calculating CV type word according to the word frequencies in the dictionary (comprise according to a preferred embodiment of the invention surpass 190,000 entries) and unit frequency.The statistical study of English text shows that about 6,900 words can cover about 90% input text, and about 4,100 words can cover about 85% input text, and frequency or number of times that each sub-word occurs are defined as follows:

n_i＝n_1i+n_2i

N wherein_iBe i sub-word occurrence number, wherein n_1iBe the number of times that has the word appearance of i sub-word, wherein n_2iBe i the number of times that sub-word occurs in dictionary.For n_i, i=1,2 ...., N (wherein N is the number of dictionary neutron word) can calculate the frequency of each sub-word.

Instep 230, selection will cover the most the most frequently used sub-word of expection input text at last.When being implemented on English, above result calculated show that 20% sub-word will cover the English text above 85%.Therefore, about 2,400 the selected formation voice unit of sub-word catalogues.From the sound corpus, extract the speech waveform relevant, formsound inventory 110 with each sub-word.Thereby saidmethod 200 has significantly reduced the redundancy in thesound inventory 110.

All index ofreference 130 indexes of related voice waveform of each sub-word in the sound inventory 110.Index 130 can comprise a simple note file with the speech waveform of record.Therefore,index 130 is used to identify phone string and the single phoneme that is included in the sub-character waveform.

Referring to Fig. 3, shown in the figure process flow diagram according tophoneme synthesizing method 300 of thepresent invention.Method 300 is called ininitial step 305, for example; When the user of hand-held device receives a text message and want that it is synthesizedvoice.In step 310,speech synthesis system 100 receives an input text string, for example: be above-mentionedtext message.In step 315, implement pre-service to the input text string.Pre-service becomes to comprise the son field of the positional information relevant with each section with the input text string sort.Then, instep 320, segmentation of input text string andsound inventory 110 are compared.In step 325, determine whether the complete sub-character waveform in thesound inventory 110 is consistent with the present segment of input text string.Ifmethod 300 execution instep 330 retrieve consistent sub-character waveform from sound inventory 110.Next instep 360, sub-character waveform is coupled.Step 330 is relevant with the first order ofunit selector switch 140 withstep 360, and the connection of sub-word is carried out hard the connection by theground floor 160 of double-deck compositor 150.Hard connection will be described in detail hereinafter.Next instep 335, determine whether the input text string also has other section to compare with sound inventory 110.If also have,method 300 turns back tostep 320 again, and at this, next section of input text string compares withsound inventory 110; Otherwisemethod 300 finishes instep 340.

If determine there be not the complete sub-character waveform consistent in thesound inventory 110 with the present segment of input text string in step 325, thenmethod 300 advances tostep 345, to judge whether a plurality of phone string waveforms consistent with the present segment of input text string is arranged in sound inventory 110.If have,method 300 proceeds tostep 350, retrieves consistent a plurality of phone string waveforms from sound inventory 110.Next instep 365, multitone substring waveform is connected.Step 350 is relevant with the second level ofunit selector switch 140 withstep 365, and the connection of a plurality of phone strings is to be connected by the correction that thesecond layer 170 ofcompositor 150 is carried out.Revise to connect also and describe in detail hereinafter.Then,method 300 turns back tostep 335, judges whether input this paper string also has other sections to compare withsound inventory 110.

If judge do not have a plurality of phone string waveforms consistent with the present segment of input text string instep 345 insound inventory 110,method 300 just advances to 355 steps, retrieves single phoneme waveform from sound inventory 110.Instep 365, single phoneme waveform is coupled with the most corresponding with the present segment of input text string then.Here,step 355 is relevant with the third level ofunit selector switch 140 withstep 365, and the connection of single phoneme is still connected by the correction that thesecond layer 170 ofcompositor 150 is finished.Then,method 300 turns back tostep 335, judges whether input this paper string also has other segmentations to compare with sound inventory 110.After all segmentations of input text string were all relatively finished with thesound inventory 110 of index,method 300 finished instep 340.

Therefore, the method according to thisinvention 300 based on the analysis of " the most suitable " is carried out in the segmentation of input text string, connects from the waveform in the sound inventory 110.The ground floor of double-deck compositor 150 is carried out and is meaned under the situation that does not have correction hard the connection, and a plurality of waveforms fromsound inventory 110 simply are stitched together.When the waveform that connects enough big, to such an extent as to the duration of speaking naturally of the duration altogether that connects waveform and corresponding input text string segmentation very near the time, this process can cause sounding the voice of nature.

On the other hand, when hard connection can not obtain sounding the voice of nature, will use to revise to connect.Thesecond layer 170 ofcompositor 150 is carried out to revise and is connected.Here the duration of adjusting the connection waveform is to obtain sounding more natural voice.

With reference to following table 2, can better understand and revise connection.

Table 2

Provided the example of ten kinds of different situations in the table 2, wherein thesub-word assembly 120 ofsound inventory 110 is divided into the left side and the right text.What describe at the rightmost row of table 2 is when connectingsub-word assembly 120, producing when sounding the synthetic speech of nature needed coupled type.For example, 2 explanations of situation in the table 2 are revised when connecting two vowel waveforms ofsound inventory 110 when using, and the duration of connection waveform must reduce by 25% voice that just can obtain sounding nature.

As selection, 9 explanations of situation in the table 2 are when connecting two waveforms being made up of a vowel and consonant, and the duration of connection waveform needn't be revised.Therefore, theground floor 160 ofcompositor 150 will be carried out this hard connection.

Therefore, the present invention is the improved method and system that is used for phonetic synthesis of the less relativelysound inventory 110 of a kind of use.Suitably set up the index collection that soundinventory 110 can obtain waveform, it can synthesize about 85% input text string by hard the connection.Remaining 15% can utilize the input text string described correction connection technique and synthesized.Sound inventory 110 therefore be high compression and also have the minimal redundancy waveform, make it be specially adapted to have in the hand-held device of finite memory.And the reduction ofsound inventory 110 sizes makes more efficient quick of searching algorithm of the present invention.

What foregoing detailed description provided only is a preferred embodiment, is not to be restriction to scope of the present invention, usability and structure.On the contrary, the those skilled in the art that are specifically described as of preferred example embodiment implement preferred example embodiment of the present invention possibility are provided.It should be understood that under the situation that does not break away from the spirit and scope of the present invention in the claims, can make various modifications the function and the layout of element and step.

Claims

1. the phoneme synthesizing method in the hand-held device comprises:

Receive the input text string;

The sound inventory of described input text string and index is compared, and the sound inventory of described index comprises CV type character waveform, be included in the indexed phone string waveform in the described CV type character waveform and be included in indexed single phoneme waveform in the described CV type character waveform;

Retrieval and the corresponding complete CV type character waveform of described input text string in described sound inventory;

If do not retrieve and the corresponding complete CV type character waveform of described input text string, then retrieval and the corresponding indexed phone string waveform of described input text string in described sound inventory;

If do not retrieve and the corresponding indexed phone string waveform of described input text string, then retrieval and the corresponding indexed single phoneme waveform of described input text string in described sound inventory; And

Connect the waveform of being retrieved, to provide and the corresponding synthetic speech of described input text string.

2. according to the method for claim 1, also comprise the step that generates described sound inventory as follows:

To big text corpus implement a statistical study decide everyday words and

Described everyday words is divided into the position syllable.

3. according to the method for claim 2, the step of the described sound inventory of wherein said generation is further comprising the steps of:

The phoneme of each described position syllable is classified; With

Give up consonant in the syllable of described position to consonant, vowel to consonant, the described phoneme that makes up to consonant to consonant and nose last or end syllable of semivowel, with formation CV type word.

4. according to the method for claim 3, the step of the described sound inventory of wherein said generation is further comprising the steps of:

Calculate the frequency of described CV type word in described big text corpus;

Be chosen in described CV type word the most frequently used in the described big text corpus; And

From described big text corpus, extract the sound inventory that comprises described the most frequently used CV type word.

5. according to the process of claim 1 wherein that the step of the described waveform that described connection is retrieved comprises: the hard CV type character waveform of being retrieved that connects.

6. according to the process of claim 1 wherein that the step of the described waveform that described connection is retrieved comprises: correction connects the indexed phone string waveform of being retrieved and revises the indexed single phoneme waveform that connection is retrieved.

7. according to the method for claim 6, wherein said correction connects and comprises the duration that changes described connection waveform.

8. system that is used for the input text string is carried out phonetic synthesis comprises:

Sound inventory, it comprises CV type character waveform, be included in the indexed phone string waveform in the described CV type character waveform and be included in indexed single phoneme waveform in the described CV type character waveform;

Multistage voice unit (VU) selector switch, it is connected with described sound inventory, be used for selecting and the corresponding waveform of described input text string, comprise: be used to select the first order of described CV type character waveform, the third level that is used to select the second level of indexed phone string waveform and is used to select indexed single phoneme waveform in described sound inventory; And

The multilayer compositor, it is connected with described multistage voice unit (VU) selector switch, is used for selected waveform is connected so that the synthetic speech of described input text string to be provided.

9. system according to Claim 8, wherein, described multilayer compositor comprises: be used for that selected CV type character waveform is carried out the hard ground floor that connects and with being used for selected phone string waveform and selected single phoneme waveform carried out the second layer that correction connects respectively.

10. system according to Claim 8, wherein, described indexed phone string waveform in the described CV type character waveform and the single phoneme waveform of being included in uses and connects comment file and carry out index.