CN101014996A

Movatterモバイル変換

Info

Publication number: CN101014996A
Application number: CNA2004800268024A
Authority: CN
Inventors: 杰拉尔德·E·科里根; 史蒂文·W·阿尔布雷布特
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2003-09-17
Filing date: 2004-08-23
Publication date: 2007-08-08

Abstract

In a speech synthesis technique used in a network (110, 115), a set of text words is accepted by a speech engine software function (210) in a client device (105). From the set of text words, an invalid subset of text words is determined for which the text words are not in a word synthesis dictionary of the client device. The invalid subset of text words is transmitted over the network to a server device (120), which generates a set of word pronunciations including at least a portion of the text words of the invalid subset of text words and pronunciations associated with each of the text words. The client device uses the pronunciations for speech synthesis and may store them in a local word synthesis dictionary (220) stored in a memory (150) of the client device.

Description

Phonetic synthesis

Background technology

Phonetic synthesis or text voice (TTS) conversion requires to determine pronunciation for each literal in the text.The processing of controlling this conversion is called speech engine, and it typically can visit one or more pronunciation dictionaries or vocabulary file, these dictionaries or file storage wish the pronunciation of the text handled by speech engine.For example, a text dictionary can be the common language dictionary, and for a word that particular software application is exclusive, by this application another text dictionary is offered search engine when this application operation.But, can reckon with that some literal in given pronunciation dictionary set, therefore do not comprise the method that is used to unknown literal to produce pronunciation and need not uses pronunciation dictionary in speech engine.These methods are easy to make mistakes.

TTS is in demand characteristics in many situations, and two examples of this situation are that the driver is when using cell phone and visually impaired people when using cell phone.Therefore TTS is of great value in having the electronic installation of limited resources, so problem demanding prompt solution is the size that minimizes employed pronunciation dictionary in this resource-constrained devices, and minimizes the mispronounce of unknown literal simultaneously.

Two above-mentioned situations that example is customer set up (cell phone), they are typically operated in wireless communication system, can be connected to WWW by this communication system customer set up.World Wide Web Consortium (W3C) is developing a kind of use and (is being positioned at URL such as VoiceXMLWww.w3.org/TR/lexicon-reqs) the pronunciation dictionary standard of the voice application write of such instrument.

Description of drawings

The present invention illustrates and is not limited to accompanying drawing by example, reference marker identical in the accompanying drawing is represented components identical, and wherein:

Fig. 1 represents the electrical diagram of communication system according to an embodiment of the invention, and this system comprises customer set up.

Fig. 2 is the software block diagram of programming model of the customer set up ofpresentation graphs 1.

Fig. 3 is the method flow diagram of the phoneme synthesizing method that uses in the communication system of Fig. 1.

One skilled in the art would recognize that element explanation in the accompanying drawing is for the sake of simplicity with clear, does not need to draw in proportion.For example, in order to help further to understand embodiments of the invention, some elements in the accompanying drawing may amplify with respect to other elements.

Embodiment

Before describing in detail, should be noted that the present invention mainly is the method step of relevant TTS conversion and the combination of part of appliance according to text voice of the present invention (TTS) switch technology.Therefore, these parts of appliance and method step show in place with ordinary symbol in the accompanying drawings, in order to be unlikely to, those details relevant have only been expressed with understanding the present invention owing to the details that is easy to expect makes the disclosure smudgy concerning benefiting from those of ordinary skills described here.

With reference to Fig. 1, represented the electrical diagram of communication system according to an embodiment of theinvention.Communication system 100 comprisesfirst device 105, and it is the customer set up in thiscommunicator 100, all personal communicator in this way, and one of them example is a cell phone.Customer set up 105 is coupled tocordless communication network 110, and thiscordless communication network 110 is coupled toWWW 115 again, and WWW is the wired Information Network that is connected with optics of main use certainly, but it can comprise some wireless connections.Second device 120 also is coupled toWWW 115, and thissecond device 120 is server units.

Customer set up 105 comprisesprocessor 115, and thisprocessor 115 is coupled to storer 150,loudspeaker 160, network interface 164 and user interface 170.Processor 155 can be microprocessor, digital signal processor or any other processor that is suitable for use in customer set up 105.The programmed instruction ofstorer 150access control processors 155 operation, and can use conventional instruction with provide a plurality of basically independently the mode of function realize.In these functions some are those functions that typically are classified as application.Many functions can be conventional, but some function described here is exclusive at least in some aspects.Storer 150 is also stored temporarily, short-term and long-term information, for example is cache memory and form.Therefore,storer 150 can comprise the memory storage of different hardware type, for example random access memory, programmable read only memory, flash memory etc.Loudspeaker 160 can be the loudspeaker that can find in the conventional customer set up such such as cellphone.Network interface 165 can be the radio set that can find in cell phone, and perhaps when customer set up for example was the bluetooth coupling arrangement, network interface was a bluetootho transceiver.As an alternative,network interface 165 can be to be used for customer set up through personal area network operation to the Wireline interface that is connected to the customer set up (not shown) of WWW byradio net 110, perhaps can be the Wireline interface that is used to be directly connected to the customer set up on theWWW 115 as an alternative.As an alternative,WWW 115 can be a sizable private, for example supports the company's net of several thousand users in someareas.User interface 170 can be little or big display and little or bigkeyboard.Server unit 120 preferably has the device of suitable large storage capacity with respect to customer set up 105.For example, server typically has very big hard disk drive or a plurality of driver (for example, the storer of 20 GB).

With reference to Fig. 2, represented the programming model of the customer set up 105 of the embodiment of the invention described according to reference Fig. 1.Thesynthetic dictionary 220 ofapplication 205 and literal is coupled to speech engine 210.Network transmission function 225 is coupled to speech engine 210.Using 205 is one of several software application that can be coupled tospeech engine 210, and be that produce will be by the application of the synthetic texts set ofspeech engine 210,speech engine 210 producessimulating signals 211 so that use theloudspeaker 160 of customer set up 105 to provide sound to represent.Speech engine 210 can embed function in its programming instruction anddata storer 150 in, and this function is used for directly synthesizing from the monogram of a literal sound of this literal and represents.As everyone knows, thisly synthetic typically sound very false and may often be wrong, thereby make the user misread these words.Therefore, provide literal to synthesizedictionary 220, it can comprise common language set and relevant word pronunciation set, and this has reduced the misunderstanding of user to literal.In fact thesynthetic dictionary 220 of literal can comprise a plurality of literal set that combine.For example, default collection to all immovable common language of different application and pronunciation thereof can make up with a literal relevant with application-specific and pronunciation set thereof, and wherein this literal and the pronunciation set thereof relevant with this application-specific just is incorporated in this dictionary when this application-specific of operation.This may be effectively when pre-determining one group of different application and speech engine and use together.For example, phone dialer can provide different literal tospeech engine 210 rather than to web browser.But this method may cause the memory space aspect to go wrong, and stores these literal and pronunciation thereof because storer must be associated with each application and about which literal knowledge of default storage indictionary 220 just in time.But the synthetic dictionary of literal that is arranged in customer set up may be subject to its memory span (for example, less than gigabit) fully.

In one embodiment of the invention, an application can provide a text set (not relevant pronunciation) to thesynthetic dictionary 220 of the literal in the storer 150.The set of text literal can be that this uses normally used text set, they be estimate when this applications moves than short-term (for example, from less than one second to many minutes) in this application literal that will use, perhaps the set of text literal can be to comprise the set of the text of speech text as an alternative.The text that speech text in the context of this application is intended to provide by the loudspeaker order is at once gathered.For example, the sentence of preparing in response to the user imports telephone number to user prompt " The number entered is 847-576-9999 " is a speech text.Numeral the 0,1,2,3,4,5,6,7,8, the 9th, the example of text, they more may be the digital collections that the address applications expection will be used.By following technology, the long-range literal that obtains customer set up synthesizes unexistent word pronunciation in the dictionary 220.For this purpose,speech engine 210 is coupled tonetwork transmission function 225, with the literal that does not have in thesynthetic dictionary 220 of the literal that is sent in customer set up on network.

With reference to Fig. 3, represented method according to the phonetic synthesis of the embodiment of theinvention.In step 305, the function (for example speech engine 210) relevant with the synthetic dictionary of literal 220 accepted text literal and gathered, no matter it is speech text or other, thesynthetic dictionary 220 of literal determines instep 310 whether thesynthetic dictionary 220 of literal of current configuration comprises the pronunciation of text literal set.The resulting text subclass that does not find pronunciation comprises invalid literal subclass (when one or more such literal).Customer set up 105 sends text literal invalid subset to server unit instep 315 by network then.With reference in the described example of Fig. 1, this network comprisesradio net 110 andWWW 115 in the above, but this network can also include spider lines and not have wirelessnetwork.Server unit 120 receives text literal invalid subset instep 320, and, for producing word pronunciation, this invalid text set gathers instep 325 by with reference to theserver unit 120 interior or synthetic dictionaries ofserver unit 120 operable large-scale literal.By being positioned at server or other computing machines that is typically fixed network device, this literal compound word allusion quotation enough big (for example, greater than gigabit) to needed all literal of all customer set ups that comprise that almost it is served.Server unit 120 preferably produces this literal pronunciation set to comprise all texts in the text invalid subset.This literal pronunciation set may not comprise a text certainly.This literal pronunciation set for server produced has each pronunciation that is associated with thesetexts.In step 330, server sends this literal pronunciation set to customer set up 105 by network (perhaps depending on the circumstances, by a plurality of networks).

When customer set up 105 whenstep 335 receives this literal pronunciation set, customer set up 105 determines instep 337 whether this literal pronunciation set relevant with aspeech text.In step 340, determine whether to provide (synthesizing) this speech text.When also not synthesizing this speech text, use this literal pronunciation set that the synthetic of speech text is provided atstep 345speech engine 210, thus the minimizing translation error.(surpass in the minimum prescribed situation of time delay when synthesized this speech text instep 340 as the delay that is receiving this literal pronunciation set, perhaps before receiving this literal pronunciation set, receive in the situation of the order that provides this speech text), perhaps when determining that instep 337 this literal pronunciation set is not when being used for a speech text, customer set up 105 determines instep 350 whether this pronunciation set is stored in thestorer 150 of customer set up 105, and whereinstorer 150 synthesizes replenishing of dictionary as the literal to customer set up 105.Sort memory can be used for the schedule time, this time for example is when the application of this literal of request pronunciation set in use the time, perhaps for example, and based onstorer 150 capacity limitations, perhaps for example, based on using and the priority of the memory span limit and/or time etc.When planning to be stored in thestorer 150, store them to this pronunciation set in step 355.This processing finishes instep 360.

Should be realized that the invention provides a kind of being used for provides the unique technique of text pronunciation at the customer set up with the synthetic dictionary capacity (for example less than gigabit) of limited literal, thereby has reduced the misunderstanding mistake.

In the explanation in front, the present invention and benefit and advantage have been described with reference to specific embodiment.But those of ordinary skills should be realized that, can carry out various modifications and change and do not break away from proposed invention scope in the following claim.Correspondingly, it is illustrative and not restrictive that instructions and accompanying drawing should be regarded as, and all this modifications all should be within the scope of the present invention.Benefit, advantage, the scheme of dealing with problems and can cause any benefit, advantage and solution occurs or the tangible more any element that becomes should not be interpreted as key, needs or the essential feature or the element of all or any claim.

Term " comprises " as used herein, " comprising " or its any distortion should cover comprising of nonexcludability, not only comprises those key elements but also can comprise and clearly not listing or other elements that this processing, method, technology or equipment are intrinsic so that comprise processing, method, technology or the equipment of a series of key elements.

Employed " set " meaning is nonempty set in the following claim.Term " another " is defined as at least one second or more as used herein.Term " comprises " and/or " having " is defined as comprising as used herein.Term " coupling " is defined as connection as used herein, also needs not to be mechanically but need not to be directly.Term " program " is defined as the instruction sequence that design is used for carrying out on computer system as used herein." program " or " computer program " can comprise other instruction sequences that subroutine, function, process, object method, object are realized, can be carried out application, java small routine (applet), servlet, source code, object code, shared library/dynamic load library and/or be designed to carry out on computer system.

Claims

1. method of using at the customer set up that is used for phonetic synthesis comprises:

Accept the text set;

Determine the invalid subset of text literal set, wherein the text in this invalid subset is not in the synthetic dictionary of the literal of this customer set up; And

By network the invalid subset of text literal is sent to server unit.

2. according to the process of claim 1 wherein that the set of text literal comprises speech text.

3. according to the process of claim 1 wherein that the set of text literal comprises the literal set relevant with application-specific.

4. according to the method for claim 1, further comprise:

Receive the zero comprise in this invalid text set or the word pronunciation set of more texts by network, in this literal pronunciation set the pronunciation relevant with each text arranged.

5. according to the method for claim 4, further comprise:

Use pronounces to produce the synthetic of literal in the set of text literal from this literal pronunciation set at least one.

6. according to the method for claim 5, wherein when receiving this literal pronunciation set before the order that produces synthetic text literal set, carry out and use at least one pronunciation to produce synthetic step.

7. according to the method for claim 4, further comprise:

Adding in the synthetic dictionary of literal of this customer set up from least one word pronunciation in this literal pronunciation set.

8. according to the method for claim 7, wherein when receiving this literal pronunciation set after the order that produces synthetic text literal set, carry out at least one word pronunciation is added to step in this literal compound word allusion quotation.

9. method of using at the network that is used for phonetic synthesis,

Comprise at the first device place:

Accept the text set;

Send text literal invalid subset by network;

Further comprise at the second device place:

Receive this invalid text set from this first device;

Generation comprises the zero in this invalid text set or the word pronunciation set of more texts, in this literal pronunciation set the pronunciation relevant with each text is arranged; And

By network this literal pronunciation set is sent to this first device; And

Further comprise at the first device place:

Receive this literal pronunciation set.

10. device that is used for phonetic synthesis comprises:

Processor;

Storer, this processor of storage control is carried out following functional programs instruction:

Produce the application function of text set,

The synthetic dictionary function of the local literal of storage text and pronunciation thereof, and

Accept text character set and merge to determine the speech engine of the invalid subset of text literal set, wherein should the synthetic dictionary function of this locality literal can not find the text in this invalid subset; And

Be used for text literal invalid subset being sent to the transfer function of server unit by network.

11. one kind comprises the personal communicator according to the device that is used for phonetic synthesis of claim 10.