Movatterモバイル変換


[0]ホーム

URL:


CN1604185B - Voice synthesizing system and method by utilizing length variable sub-words - Google Patents

Voice synthesizing system and method by utilizing length variable sub-words
Download PDF

Info

Publication number
CN1604185B
CN1604185BCN 03164848CN03164848ACN1604185BCN 1604185 BCN1604185 BCN 1604185BCN 03164848CN03164848CN 03164848CN 03164848 ACN03164848 ACN 03164848ACN 1604185 BCN1604185 BCN 1604185B
Authority
CN
China
Prior art keywords
waveform
input text
text string
string
indexed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 03164848
Other languages
Chinese (zh)
Other versions
CN1604185A (en
Inventor
祖漪清
陈桂林
俞振利
岳东剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Serenes Operations
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola IncfiledCriticalMotorola Inc
Priority to CN 03164848priorityCriticalpatent/CN1604185B/en
Publication of CN1604185ApublicationCriticalpatent/CN1604185A/en
Application grantedgrantedCritical
Publication of CN1604185BpublicationCriticalpatent/CN1604185B/en
Anticipated expirationlegal-statusCritical
Expired - Fee Relatedlegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

It is a sound synthesizing system and method from input text, which comprises the following steps: first to receive s input text sting; then to compare the sound lists of text string and index; to search the relative complete son wave to the input text string in the sound list; to search the relative phoneme wave to the input text string from the sound list; to search the relative single phoneme wave to the input text string from the sound list; to connect the said wave; to generate the relative synthesizing sound to the input text string.

Description

Utilize the speech synthesis system and the method for variable eldest son's word
Technical field
The present invention relates generally to a kind of method and system that utilizes a less relatively sound inventory realization phonetic synthesis.The present invention is specially adapted to, but is not limited only to, for example: the phonetic synthesis of the hand-held device of mobile phone and personal digital assistant etc.
Background technology
What the speech synthesis technique of the complexity of knowing used is a kind of method of connection.What this technology was used is the physical record that is stored in the speech utterance in the pronunciation data storehouse.The various piece of pronunciation generates various spoken phrase through reconfiguring or connecting.Can be comprised complete word by the part that reconfigured, word section or or even the more subsection of single syllable.When bigger word section was coupled, resulting synthetic speech sounds will be more more naturally.Yet, when using bigger word section, just need jumbo storer to deposit voice data, can keep the audio database that can synthesize suitable large vocabulary.
Can be by only storing less section, for example diphones or single-tone reduce the size of this audio database; Yet the quality of the synthetic speech that obtains thus also can reduce usually.This is because form between correct tone and the very short voice segments length transit time, thereby the voice that produce the nature sounding are difficult.Exist complicated technology to analyze little phoneme chain element, for example CV and VCV (represent consonant at this C, V represents vowel).Yet the algorithm of realizing this technology will very complicatedly be strengthened processor with needs.
Other methods that are used to reduce the audio database size relevant with speech synthesis system comprise the technology that is called the resonance peak synthetic method of using.Use the resonance peak synthetic method,, just can no longer need audio database because people's sound only uses the Electron Excitation signal of filtering to simulate.Yet the synthetic speech that obtains sounds very unnatural and " machine chamber " usually.
Portable electric device such as mobile phone and PDA(Personal Digital Assistant) popular increased the demand to high-quality voice operation demonstrator.If this hand-held device dress is built-in with voice operation demonstrator, its convenience will increase greatly.For example, Email and text message, for example: SMS information can synthesize voice and be answered by the user of mobile phone.Yet the storage of this hand-hold electronic device is very limited usually with the processing resource.So the phonetic synthesis device that is built in this device must use compression and high efficiency audio database.
Therefore, just need a kind of method and system of improved phonetic synthesis, use the audio database of compression still can provide the natural phonation voice simultaneously.
Summary of the invention
According to an aspect of the present invention, the present invention is a kind of method of phonetic synthesis, comprises the text string that receives input; The sound inventory of described input text string and index is compared; From described sound inventory, retrieve and the corresponding to complete sub-character waveform of described input text string; From described sound inventory, retrieve and the corresponding to phone string waveform of described input text string; From described sound inventory, retrieve and the corresponding to single phoneme waveform of described input text string; Connect described waveform, produce and the corresponding to synthetic speech of described input text string.
The present invention preferably can comprise by big text corpus is implemented a statistical study and decide everyday words, and described everyday words is divided into the position syllable, produces the step of described sound inventory.
The step that generates described sound inventory may further include the step that the syllable that described position syllable is sorted out is sorted out step and given up the described syllable with low definition.
The step that generates described sound inventory may further include: calculate the frequency of the CV type word in the described big text corpus and select in the described big text corpus step of the described sub-word of common part.
The step that connects described waveform can comprise the described sub-character waveform of hard connection (needing the connection of signal Processing hardly), maybe can comprise the step to the correction connection of described syllable string waveform and described single syllable waveform.
Revise to connect and preferably comprise the duration that changes described connection waveform.
According to a further aspect in the invention, the present invention is a kind of according to importing the system that voice carry out phonetic synthesis, and it comprises the sound inventory with sub-character waveform.One multistage voice unit (VU) selector switch connects with described sound inventory, and a multilayer compositor connects with described voice unit (VU) selector switch.Whether the segmentation according to described input text is consistent with the sub-character waveform in the described sound inventory, selects the one-level of described tone unit selector switch.
Described multilayer compositor preferably comprises and is used to carry out the ground floor of hard connection and is used to carry out the second layer of revising connection.
Described sound inventory can comprise CV type character waveform, and described CV type character waveform can be with a comment file index.
Described multistage voice unit (VU) selector switch preferably comprises and can connect with the ground floor of described multilayer compositor to realize the hard first order that connects and can connect with the second layer of described multilayer compositor to realize revising the second level and the third level that connects.
At this instructions, and in claims, word " comprises ", " comprise " or similar terms is intended to represent comprising of Fei Paita, so, comprise the method and apparatus of listed element, not merely be to comprise these elements, can also comprise other element of not mentioning.
Description of drawings
For making easy to understand of the present invention and putting into practice, now with reference to accompanying drawing preferred embodiment is described, in the drawings, identical label is represented components identical, wherein:
Fig. 1 is the synoptic diagram according to the functional assembly of speech synthesis system of the present invention;
Fig. 2 is the process flow diagram that how to generate a sound inventory according to of the present invention; With
Fig. 3 is the process flow diagram according to phoneme synthesizing method of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED
Referring to Fig. 1, be synoptic diagram shown in the figure according to the functional assembly of thesystem 100 that is used for phonetic synthesis of the presentinvention.Sound inventory 110 comprises a plurality ofsub-word assemblies 120, and for example initial, consonant ends up and CV type word.Utilizeindex 130 antithetical phrase word assemblies 120 to classify.
Sound inventory 110 and multilevel-cell selector switch 140 interfaces.In three grades which rank ofunit selector switch 140 determine to be used to the synthetic speech that is input in the system 100.When the segmentation of input text string can be divided into the waveform corresponding with it when being included in sub-word in thesound inventory 110, the first order of selected cell selector switch 140.When the needed sub-word of synthetic input text string segmentation is not included in thesound inventory 110, but when the phone string in thesound inventory 110 can be used for synthetic input text string segmentation, the second level of selected cell selector switch 140.At last, when can only be with being included in single phoneme in thesound inventory 110 when synthesizing the segmentation of input text string, the third level of selectedcell selector switch 140.
Unit selector switch 140 and double-deck compositor 150 interfaces,compositor 150 synthetic voice by system's 100 outputs.160 pairs of hard connections of the execution from the sub-word of the first order ofunit selector switch 140 of ground floor synthesize.170 pairs of voice components that receive from the second level or the third level ofunit selector switch 140 of the second layer ofcompositor 150 are carried out to revise to connect and are synthesized.Back in this explanation will and be revised connection and describe in detail hard connection.The voice component that dotted arrow among Fig. 1 is represented from the second level ofunit selector switch 140 or the third level receives also can use hard connection to connect.
Referring to Fig. 2, shown in the figure process flow diagram that generates themethod 200 of sound inventory 110.Instep 205, big text corpus is carried out statistical study.This analysis comprises that calculating accounts for the word of remarkable majority in the word of given arbitrarily exemplary input text.For most west voice, for example English has 150,000 words of surpassing, and comprises at least 41,000 position syllable.Then, instep 210, be divided into the position syllable from the everyday words of step 205.The position syllable is defined as the syllable with word position mark, and is as follows:
Ws: the syllable in the single syllable word;
Wo: the syllable in the multi-syllable word but do not comprise the ultima of speech; With
Wf: the ultima in the multi-syllable word.
Then,method 200 proceeds tostep 215, and at this, the phoneme in each syllable all is classified.Phoneme roughly can be divided into following four classes: consonant, semivowel, vowel and voiced sound tail.Sharpness between all kinds of is different.So instep 220, the phoneme with low definition can be rejected.Therefore, be based on syllable according to the definition of voice unit of the present invention, and the length of voice unit from a syllable to four or more multisyllable change.This just means that following combination can omit from sound inventory 110: consonant to consonant, vowel to consonant, semivowel to consonant and nose last or end syllable to consonant.Yet, to consider in the following connection that is combined in voice unit: consonant to vowel, semivowel to vowel, vowel to semivowel.The ending of consonant string can be shared by different words.Therefore, recited abovely surpass 41,000 position syllables and be reduced to and have only 16,000 CV type words.Following table 1 provides an example, illustrates how to use above-mentioned sub-word cell to describe, for example conversion of the syllable in " Battery level is low ":
Table 1
Syllable conversion in " Battery level is low "
Word CV-like unit
Battery b’ae(Wo)+tax(Wo)+riy(Wf)
Level l’eh(Wo)+vaxl(Wf)
Is ’Ih(Ws)+s
Low l’ow(Ws)
Then,method 200 proceeds tostep 225, wherein the frequency of calculating CV type word according to the word frequencies in the dictionary (comprise according to a preferred embodiment of the invention surpass 190,000 entries) and unit frequency.The statistical study of English text shows that about 6,900 words can cover about 90% input text, and about 4,100 words can cover about 85% input text, and frequency or number of times that each sub-word occurs are defined as follows:
ni=n1i+n2i
N whereiniBe i sub-word occurrence number, wherein n1iBe the number of times that has the word appearance of i sub-word, wherein n2iBe i the number of times that sub-word occurs in dictionary.For ni, i=1,2 ...., N (wherein N is the number of dictionary neutron word) can calculate the frequency of each sub-word.
Instep 230, selection will cover the most the most frequently used sub-word of expection input text at last.When being implemented on English, above result calculated show that 20% sub-word will cover the English text above 85%.Therefore, about 2,400 the selected formation voice unit of sub-word catalogues.From the sound corpus, extract the speech waveform relevant, formsound inventory 110 with each sub-word.Thereby saidmethod 200 has significantly reduced the redundancy in thesound inventory 110.
All index ofreference 130 indexes of related voice waveform of each sub-word in the sound inventory 110.Index 130 can comprise a simple note file with the speech waveform of record.Therefore,index 130 is used to identify phone string and the single phoneme that is included in the sub-character waveform.
Referring to Fig. 3, shown in the figure process flow diagram according tophoneme synthesizing method 300 of thepresent invention.Method 300 is called ininitial step 305, for example; When the user of hand-held device receives a text message and want that it is synthesizedvoice.In step 310,speech synthesis system 100 receives an input text string, for example: be above-mentionedtext message.In step 315, implement pre-service to the input text string.Pre-service becomes to comprise the son field of the positional information relevant with each section with the input text string sort.Then, instep 320, segmentation of input text string andsound inventory 110 are compared.In step 325, determine whether the complete sub-character waveform in thesound inventory 110 is consistent with the present segment of input text string.Ifmethod 300 execution instep 330 retrieve consistent sub-character waveform from sound inventory 110.Next instep 360, sub-character waveform is coupled.Step 330 is relevant with the first order ofunit selector switch 140 withstep 360, and the connection of sub-word is carried out hard the connection by theground floor 160 of double-deck compositor 150.Hard connection will be described in detail hereinafter.Next instep 335, determine whether the input text string also has other section to compare with sound inventory 110.If also have,method 300 turns back tostep 320 again, and at this, next section of input text string compares withsound inventory 110; Otherwisemethod 300 finishes instep 340.
If determine there be not the complete sub-character waveform consistent in thesound inventory 110 with the present segment of input text string in step 325, thenmethod 300 advances tostep 345, to judge whether a plurality of phone string waveforms consistent with the present segment of input text string is arranged in sound inventory 110.If have,method 300 proceeds tostep 350, retrieves consistent a plurality of phone string waveforms from sound inventory 110.Next instep 365, multitone substring waveform is connected.Step 350 is relevant with the second level ofunit selector switch 140 withstep 365, and the connection of a plurality of phone strings is to be connected by the correction that thesecond layer 170 ofcompositor 150 is carried out.Revise to connect also and describe in detail hereinafter.Then,method 300 turns back tostep 335, judges whether input this paper string also has other sections to compare withsound inventory 110.
If judge do not have a plurality of phone string waveforms consistent with the present segment of input text string instep 345 insound inventory 110,method 300 just advances to 355 steps, retrieves single phoneme waveform from sound inventory 110.Instep 365, single phoneme waveform is coupled with the most corresponding with the present segment of input text string then.Here,step 355 is relevant with the third level ofunit selector switch 140 withstep 365, and the connection of single phoneme is still connected by the correction that thesecond layer 170 ofcompositor 150 is finished.Then,method 300 turns back tostep 335, judges whether input this paper string also has other segmentations to compare with sound inventory 110.After all segmentations of input text string were all relatively finished with thesound inventory 110 of index,method 300 finished instep 340.
Therefore, the method according to thisinvention 300 based on the analysis of " the most suitable " is carried out in the segmentation of input text string, connects from the waveform in the sound inventory 110.The ground floor of double-deck compositor 150 is carried out and is meaned under the situation that does not have correction hard the connection, and a plurality of waveforms fromsound inventory 110 simply are stitched together.When the waveform that connects enough big, to such an extent as to the duration of speaking naturally of the duration altogether that connects waveform and corresponding input text string segmentation very near the time, this process can cause sounding the voice of nature.
On the other hand, when hard connection can not obtain sounding the voice of nature, will use to revise to connect.Thesecond layer 170 ofcompositor 150 is carried out to revise and is connected.Here the duration of adjusting the connection waveform is to obtain sounding more natural voice.
With reference to following table 2, can better understand and revise connection.
Table 2
Provided the example of ten kinds of different situations in the table 2, wherein thesub-word assembly 120 ofsound inventory 110 is divided into the left side and the right text.What describe at the rightmost row of table 2 is when connectingsub-word assembly 120, producing when sounding the synthetic speech of nature needed coupled type.For example, 2 explanations of situation in the table 2 are revised when connecting two vowel waveforms ofsound inventory 110 when using, and the duration of connection waveform must reduce by 25% voice that just can obtain sounding nature.
As selection, 9 explanations of situation in the table 2 are when connecting two waveforms being made up of a vowel and consonant, and the duration of connection waveform needn't be revised.Therefore, theground floor 160 ofcompositor 150 will be carried out this hard connection.
Therefore, the present invention is the improved method and system that is used for phonetic synthesis of the less relativelysound inventory 110 of a kind of use.Suitably set up the index collection that soundinventory 110 can obtain waveform, it can synthesize about 85% input text string by hard the connection.Remaining 15% can utilize the input text string described correction connection technique and synthesized.Sound inventory 110 therefore be high compression and also have the minimal redundancy waveform, make it be specially adapted to have in the hand-held device of finite memory.And the reduction ofsound inventory 110 sizes makes more efficient quick of searching algorithm of the present invention.
What foregoing detailed description provided only is a preferred embodiment, is not to be restriction to scope of the present invention, usability and structure.On the contrary, the those skilled in the art that are specifically described as of preferred example embodiment implement preferred example embodiment of the present invention possibility are provided.It should be understood that under the situation that does not break away from the spirit and scope of the present invention in the claims, can make various modifications the function and the layout of element and step.

Claims (10)

CN 031648482003-09-292003-09-29Voice synthesizing system and method by utilizing length variable sub-wordsExpired - Fee RelatedCN1604185B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN 03164848CN1604185B (en)2003-09-292003-09-29Voice synthesizing system and method by utilizing length variable sub-words

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN 03164848CN1604185B (en)2003-09-292003-09-29Voice synthesizing system and method by utilizing length variable sub-words

Publications (2)

Publication NumberPublication Date
CN1604185A CN1604185A (en)2005-04-06
CN1604185Btrue CN1604185B (en)2010-05-26

Family

ID=34660846

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN 03164848Expired - Fee RelatedCN1604185B (en)2003-09-292003-09-29Voice synthesizing system and method by utilizing length variable sub-words

Country Status (1)

CountryLink
CN (1)CN1604185B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102334119B (en)*2009-02-262014-05-21国立大学法人丰桥技术科学大学 Voice retrieval device and voice retrieval method
US9959342B2 (en)*2016-06-282018-05-01Microsoft Technology Licensing, LlcAudio augmented reality system
CN112562637B (en)*2019-09-252024-02-06北京中关村科金技术有限公司Method, device and storage medium for splicing voice audios

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5682501A (en)*1994-06-221997-10-28International Business Machines CorporationSpeech synthesis system
US6064960A (en)*1997-12-182000-05-16Apple Computer, Inc.Method and apparatus for improved duration modeling of phonemes
US20020184030A1 (en)*2001-06-042002-12-05Hewlett Packard CompanySpeech synthesis apparatus and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5682501A (en)*1994-06-221997-10-28International Business Machines CorporationSpeech synthesis system
US6064960A (en)*1997-12-182000-05-16Apple Computer, Inc.Method and apparatus for improved duration modeling of phonemes
US20020184030A1 (en)*2001-06-042002-12-05Hewlett Packard CompanySpeech synthesis apparatus and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Jack C. Richards等著,管燕红译.朗文语言教学及应用语言学辞典 1.外语教学与研究出版社,2000,460-461.
Jack C. Richards等著,管燕红译.朗文语言教学及应用语言学辞典 1.外语教学与研究出版社,2000,460-461.*

Also Published As

Publication numberPublication date
CN1604185A (en)2005-04-06

Similar Documents

PublicationPublication DateTitle
US11069335B2 (en)Speech synthesis using one or more recurrent neural networks
Abushariah et al.Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus.
EP1168299B1 (en)Method and system for preselection of suitable units for concatenative speech
US6505158B1 (en)Synthesis-based pre-selection of suitable units for concatenative speech
CN104380373B (en) Systems and methods for pronunciation of names
US8126714B2 (en)Voice search device
US7454343B2 (en)Speech synthesizer, speech synthesizing method, and program
JP3481497B2 (en) Method and apparatus using a decision tree to generate and evaluate multiple pronunciations for spelled words
US20100268539A1 (en)System and method for distributed text-to-speech synthesis and intelligibility
WO2010018796A1 (en)Exception dictionary creating device, exception dictionary creating method and program therefor, and voice recognition device and voice recognition method
WO1996023298A2 (en)System amd method for generating and using context dependent sub-syllable models to recognize a tonal language
US8942983B2 (en)Method of speech synthesis
Breen et al.Non-uniform unit selection and the similarity metric within BT's Laureate TTS system.
WO2004066271A1 (en)Speech synthesizing apparatus, speech synthesizing method, and speech synthesizing system
CN112669815A (en)Song customization generation method and corresponding device, equipment and medium
Chen et al.A new prosody-assisted mandarin ASR system
US7328157B1 (en)Domain adaptation for TTS systems
Lee et al.A text-to-speech platform for variable length optimal unit searching using perception based cost functions
CN1604185B (en)Voice synthesizing system and method by utilizing length variable sub-words
Breen et al.A phonologically motivated method of selecting non-uniform units.
Dhoundiyal et al.A Multilingual Text to Speech Engine Hindi-English: Hinglish
Kiruthiga et al.Design issues in developing speech corpus for Indian languages—A survey
US8175865B2 (en)Method and apparatus of generating text script for a corpus-based text-to speech system
KominekTts from zero: Building synthetic voices for new languages
Lei et al.Development of the 2008 SRI Mandarin speech-to-text system for broadcast news and conversation.

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant
ASSSuccession or assignment of patent right

Owner name:NUANCE COMMUNICATIONS CO., LTD.

Free format text:FORMER OWNER: MOTOROLA INC.

Effective date:20100908

C41Transfer of patent application or patent right or utility model
CORChange of bibliographic data

Free format text:CORRECT: ADDRESS; FROM: ILLINOIS, UNITED STATES TO: MASSACHUSETTS, UNITED STATES

TR01Transfer of patent right

Effective date of registration:20100908

Address after:Massachusetts, USA

Patentee after:Nuance Communications, Inc.

Address before:Illinois Instrunment

Patentee before:Motorola, Inc.

TR01Transfer of patent right

Effective date of registration:20200923

Address after:Massachusetts, USA

Patentee after:Serenes operations

Address before:Massachusetts, USA

Patentee before:Nuance Communications, Inc.

TR01Transfer of patent right
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20100526

CF01Termination of patent right due to non-payment of annual fee

[8]ページ先頭

©2009-2025 Movatter.jp