Movatterモバイル変換


[0]ホーム

URL:


CN101014996A - Speech synthesis - Google Patents

Speech synthesis
Download PDF

Info

Publication number
CN101014996A
CN101014996ACNA2004800268024ACN200480026802ACN101014996ACN 101014996 ACN101014996 ACN 101014996ACN A2004800268024 ACNA2004800268024 ACN A2004800268024ACN 200480026802 ACN200480026802 ACN 200480026802ACN 101014996 ACN101014996 ACN 101014996A
Authority
CN
China
Prior art keywords
literal
text
pronunciation
invalid
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2004800268024A
Other languages
Chinese (zh)
Inventor
杰拉尔德·E·科里根
史蒂文·W·阿尔布雷布特
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Motorola Solutions Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola IncfiledCriticalMotorola Inc
Publication of CN101014996ApublicationCriticalpatent/CN101014996A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

In a speech synthesis technique used in a network (110, 115), a set of text words is accepted by a speech engine software function (210) in a client device (105). From the set of text words, an invalid subset of text words is determined for which the text words are not in a word synthesis dictionary of the client device. The invalid subset of text words is transmitted over the network to a server device (120), which generates a set of word pronunciations including at least a portion of the text words of the invalid subset of text words and pronunciations associated with each of the text words. The client device uses the pronunciations for speech synthesis and may store them in a local word synthesis dictionary (220) stored in a memory (150) of the client device.

Description

Phonetic synthesis
Background technology
Phonetic synthesis or text voice (TTS) conversion requires to determine pronunciation for each literal in the text.The processing of controlling this conversion is called speech engine, and it typically can visit one or more pronunciation dictionaries or vocabulary file, these dictionaries or file storage wish the pronunciation of the text handled by speech engine.For example, a text dictionary can be the common language dictionary, and for a word that particular software application is exclusive, by this application another text dictionary is offered search engine when this application operation.But, can reckon with that some literal in given pronunciation dictionary set, therefore do not comprise the method that is used to unknown literal to produce pronunciation and need not uses pronunciation dictionary in speech engine.These methods are easy to make mistakes.
TTS is in demand characteristics in many situations, and two examples of this situation are that the driver is when using cell phone and visually impaired people when using cell phone.Therefore TTS is of great value in having the electronic installation of limited resources, so problem demanding prompt solution is the size that minimizes employed pronunciation dictionary in this resource-constrained devices, and minimizes the mispronounce of unknown literal simultaneously.
Two above-mentioned situations that example is customer set up (cell phone), they are typically operated in wireless communication system, can be connected to WWW by this communication system customer set up.World Wide Web Consortium (W3C) is developing a kind of use and (is being positioned at URL such as VoiceXMLWww.w3.org/TR/lexicon-reqs) the pronunciation dictionary standard of the voice application write of such instrument.
Description of drawings
The present invention illustrates and is not limited to accompanying drawing by example, reference marker identical in the accompanying drawing is represented components identical, and wherein:
Fig. 1 represents the electrical diagram of communication system according to an embodiment of the invention, and this system comprises customer set up.
Fig. 2 is the software block diagram of programming model of the customer set up ofpresentation graphs 1.
Fig. 3 is the method flow diagram of the phoneme synthesizing method that uses in the communication system of Fig. 1.
One skilled in the art would recognize that element explanation in the accompanying drawing is for the sake of simplicity with clear, does not need to draw in proportion.For example, in order to help further to understand embodiments of the invention, some elements in the accompanying drawing may amplify with respect to other elements.
Embodiment
Before describing in detail, should be noted that the present invention mainly is the method step of relevant TTS conversion and the combination of part of appliance according to text voice of the present invention (TTS) switch technology.Therefore, these parts of appliance and method step show in place with ordinary symbol in the accompanying drawings, in order to be unlikely to, those details relevant have only been expressed with understanding the present invention owing to the details that is easy to expect makes the disclosure smudgy concerning benefiting from those of ordinary skills described here.
With reference to Fig. 1, represented the electrical diagram of communication system according to an embodiment of theinvention.Communication system 100 comprisesfirst device 105, and it is the customer set up in thiscommunicator 100, all personal communicator in this way, and one of them example is a cell phone.Customer set up 105 is coupled tocordless communication network 110, and thiscordless communication network 110 is coupled toWWW 115 again, and WWW is the wired Information Network that is connected with optics of main use certainly, but it can comprise some wireless connections.Second device 120 also is coupled toWWW 115, and thissecond device 120 is server units.
Customer set up 105 comprisesprocessor 115, and thisprocessor 115 is coupled to storer 150,loudspeaker 160, network interface 164 and user interface 170.Processor 155 can be microprocessor, digital signal processor or any other processor that is suitable for use in customer set up 105.The programmed instruction ofstorer 150access control processors 155 operation, and can use conventional instruction with provide a plurality of basically independently the mode of function realize.In these functions some are those functions that typically are classified as application.Many functions can be conventional, but some function described here is exclusive at least in some aspects.Storer 150 is also stored temporarily, short-term and long-term information, for example is cache memory and form.Therefore,storer 150 can comprise the memory storage of different hardware type, for example random access memory, programmable read only memory, flash memory etc.Loudspeaker 160 can be the loudspeaker that can find in the conventional customer set up such such as cellphone.Network interface 165 can be the radio set that can find in cell phone, and perhaps when customer set up for example was the bluetooth coupling arrangement, network interface was a bluetootho transceiver.As an alternative,network interface 165 can be to be used for customer set up through personal area network operation to the Wireline interface that is connected to the customer set up (not shown) of WWW byradio net 110, perhaps can be the Wireline interface that is used to be directly connected to the customer set up on theWWW 115 as an alternative.As an alternative,WWW 115 can be a sizable private, for example supports the company's net of several thousand users in someareas.User interface 170 can be little or big display and little or bigkeyboard.Server unit 120 preferably has the device of suitable large storage capacity with respect to customer set up 105.For example, server typically has very big hard disk drive or a plurality of driver (for example, the storer of 20 GB).
With reference to Fig. 2, represented the programming model of the customer set up 105 of the embodiment of the invention described according to reference Fig. 1.Thesynthetic dictionary 220 ofapplication 205 and literal is coupled to speech engine 210.Network transmission function 225 is coupled to speech engine 210.Using 205 is one of several software application that can be coupled tospeech engine 210, and be that produce will be by the application of the synthetic texts set ofspeech engine 210,speech engine 210 producessimulating signals 211 so that use theloudspeaker 160 of customer set up 105 to provide sound to represent.Speech engine 210 can embed function in its programming instruction anddata storer 150 in, and this function is used for directly synthesizing from the monogram of a literal sound of this literal and represents.As everyone knows, thisly synthetic typically sound very false and may often be wrong, thereby make the user misread these words.Therefore, provide literal to synthesizedictionary 220, it can comprise common language set and relevant word pronunciation set, and this has reduced the misunderstanding of user to literal.In fact thesynthetic dictionary 220 of literal can comprise a plurality of literal set that combine.For example, default collection to all immovable common language of different application and pronunciation thereof can make up with a literal relevant with application-specific and pronunciation set thereof, and wherein this literal and the pronunciation set thereof relevant with this application-specific just is incorporated in this dictionary when this application-specific of operation.This may be effectively when pre-determining one group of different application and speech engine and use together.For example, phone dialer can provide different literal tospeech engine 210 rather than to web browser.But this method may cause the memory space aspect to go wrong, and stores these literal and pronunciation thereof because storer must be associated with each application and about which literal knowledge of default storage indictionary 220 just in time.But the synthetic dictionary of literal that is arranged in customer set up may be subject to its memory span (for example, less than gigabit) fully.
In one embodiment of the invention, an application can provide a text set (not relevant pronunciation) to thesynthetic dictionary 220 of the literal in the storer 150.The set of text literal can be that this uses normally used text set, they be estimate when this applications moves than short-term (for example, from less than one second to many minutes) in this application literal that will use, perhaps the set of text literal can be to comprise the set of the text of speech text as an alternative.The text that speech text in the context of this application is intended to provide by the loudspeaker order is at once gathered.For example, the sentence of preparing in response to the user imports telephone number to user prompt " The number entered is 847-576-9999 " is a speech text.Numeral the 0,1,2,3,4,5,6,7,8, the 9th, the example of text, they more may be the digital collections that the address applications expection will be used.By following technology, the long-range literal that obtains customer set up synthesizes unexistent word pronunciation in the dictionary 220.For this purpose,speech engine 210 is coupled tonetwork transmission function 225, with the literal that does not have in thesynthetic dictionary 220 of the literal that is sent in customer set up on network.
With reference to Fig. 3, represented method according to the phonetic synthesis of the embodiment of theinvention.In step 305, the function (for example speech engine 210) relevant with the synthetic dictionary of literal 220 accepted text literal and gathered, no matter it is speech text or other, thesynthetic dictionary 220 of literal determines instep 310 whether thesynthetic dictionary 220 of literal of current configuration comprises the pronunciation of text literal set.The resulting text subclass that does not find pronunciation comprises invalid literal subclass (when one or more such literal).Customer set up 105 sends text literal invalid subset to server unit instep 315 by network then.With reference in the described example of Fig. 1, this network comprisesradio net 110 andWWW 115 in the above, but this network can also include spider lines and not have wirelessnetwork.Server unit 120 receives text literal invalid subset instep 320, and, for producing word pronunciation, this invalid text set gathers instep 325 by with reference to theserver unit 120 interior or synthetic dictionaries ofserver unit 120 operable large-scale literal.By being positioned at server or other computing machines that is typically fixed network device, this literal compound word allusion quotation enough big (for example, greater than gigabit) to needed all literal of all customer set ups that comprise that almost it is served.Server unit 120 preferably produces this literal pronunciation set to comprise all texts in the text invalid subset.This literal pronunciation set may not comprise a text certainly.This literal pronunciation set for server produced has each pronunciation that is associated with thesetexts.In step 330, server sends this literal pronunciation set to customer set up 105 by network (perhaps depending on the circumstances, by a plurality of networks).
When customer set up 105 whenstep 335 receives this literal pronunciation set, customer set up 105 determines instep 337 whether this literal pronunciation set relevant with aspeech text.In step 340, determine whether to provide (synthesizing) this speech text.When also not synthesizing this speech text, use this literal pronunciation set that the synthetic of speech text is provided atstep 345speech engine 210, thus the minimizing translation error.(surpass in the minimum prescribed situation of time delay when synthesized this speech text instep 340 as the delay that is receiving this literal pronunciation set, perhaps before receiving this literal pronunciation set, receive in the situation of the order that provides this speech text), perhaps when determining that instep 337 this literal pronunciation set is not when being used for a speech text, customer set up 105 determines instep 350 whether this pronunciation set is stored in thestorer 150 of customer set up 105, and whereinstorer 150 synthesizes replenishing of dictionary as the literal to customer set up 105.Sort memory can be used for the schedule time, this time for example is when the application of this literal of request pronunciation set in use the time, perhaps for example, and based onstorer 150 capacity limitations, perhaps for example, based on using and the priority of the memory span limit and/or time etc.When planning to be stored in thestorer 150, store them to this pronunciation set in step 355.This processing finishes instep 360.
Should be realized that the invention provides a kind of being used for provides the unique technique of text pronunciation at the customer set up with the synthetic dictionary capacity (for example less than gigabit) of limited literal, thereby has reduced the misunderstanding mistake.
In the explanation in front, the present invention and benefit and advantage have been described with reference to specific embodiment.But those of ordinary skills should be realized that, can carry out various modifications and change and do not break away from proposed invention scope in the following claim.Correspondingly, it is illustrative and not restrictive that instructions and accompanying drawing should be regarded as, and all this modifications all should be within the scope of the present invention.Benefit, advantage, the scheme of dealing with problems and can cause any benefit, advantage and solution occurs or the tangible more any element that becomes should not be interpreted as key, needs or the essential feature or the element of all or any claim.
Term " comprises " as used herein, " comprising " or its any distortion should cover comprising of nonexcludability, not only comprises those key elements but also can comprise and clearly not listing or other elements that this processing, method, technology or equipment are intrinsic so that comprise processing, method, technology or the equipment of a series of key elements.
Employed " set " meaning is nonempty set in the following claim.Term " another " is defined as at least one second or more as used herein.Term " comprises " and/or " having " is defined as comprising as used herein.Term " coupling " is defined as connection as used herein, also needs not to be mechanically but need not to be directly.Term " program " is defined as the instruction sequence that design is used for carrying out on computer system as used herein." program " or " computer program " can comprise other instruction sequences that subroutine, function, process, object method, object are realized, can be carried out application, java small routine (applet), servlet, source code, object code, shared library/dynamic load library and/or be designed to carry out on computer system.

Claims (11)

CNA2004800268024A2003-09-172004-08-23Speech synthesisPendingCN101014996A (en)

Applications Claiming Priority (3)

Application NumberPriority DateFiling DateTitle
US50368503P2003-09-172003-09-17
US60/503,6852003-09-17
US10/914,5832004-08-09

Publications (1)

Publication NumberPublication Date
CN101014996Atrue CN101014996A (en)2007-08-08

Family

ID=38701565

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CNA2004800268024APendingCN101014996A (en)2003-09-172004-08-23Speech synthesis

Country Status (1)

CountryLink
CN (1)CN101014996A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102378050A (en)*2010-07-132012-03-14索尼欧洲有限公司Broadcast system using text to speech conversion
CN109246214A (en)*2018-09-102019-01-18北京奇艺世纪科技有限公司A kind of prompt tone acquisition methods, device, terminal and server
CN112562638A (en)*2020-11-262021-03-26北京达佳互联信息技术有限公司Voice preview method and device and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102378050A (en)*2010-07-132012-03-14索尼欧洲有限公司Broadcast system using text to speech conversion
CN102378050B (en)*2010-07-132017-03-01索尼欧洲有限公司Broadcast system using text-to-speech conversion
CN109246214A (en)*2018-09-102019-01-18北京奇艺世纪科技有限公司A kind of prompt tone acquisition methods, device, terminal and server
CN112562638A (en)*2020-11-262021-03-26北京达佳互联信息技术有限公司Voice preview method and device and electronic equipment

Similar Documents

PublicationPublication DateTitle
CN101911041B (en) Method and apparatus for implementing distributed multimodal applications
CN1790326B (en)System for synchronizing natural language input element and graphical user interface
US8073700B2 (en)Retrieval and presentation of network service results for mobile device using a multimodal browser
CN101923858B (en)Real-time and synchronous mutual translation voice terminal
KR100561228B1 (en) Method for converting Voice XM document to XM LPlus Voice document and multi-modal service system using the same
JP5965175B2 (en) Response generation apparatus, response generation method, and response generation program
US9507771B2 (en)Methods for using a speech to obtain additional information
US6188985B1 (en)Wireless voice-activated device for control of a processor-based host system
US8290775B2 (en)Pronunciation correction of text-to-speech systems between different spoken languages
CN100576171C (en) System and method for combined use of step-by-step markup language and object-oriented development tools
US8682640B2 (en)Self-configuring language translation device
US20060218480A1 (en)Data output method and system
EP1215656A2 (en)Idiom handling in voice service systems
EP0959401A2 (en)Audio control method and audio controlled device
JP6625772B2 (en) Search method and electronic device using the same
CN1235387C (en)Distributed speech recognition for internet access
EP1665229B1 (en)Speech synthesis
JP2005524119A (en) Encoding method and decoding method of text data including enhanced speech data used in text speech system, and mobile phone including TTS system
CN101014996A (en)Speech synthesis
JP2002091473A (en) Information processing device
US20020077814A1 (en)Voice recognition system method and apparatus
JP2001350682A (en)Internet connection mediating system by voice domain, mediating device, mediating method, and voice domain database generating method
JP3857188B2 (en) Text-to-speech system and method
KR100702789B1 (en) Mobile service system and method using multimodal platform
WO2002099786A1 (en)Method and device for multimodal interactive browsing

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C02Deemed withdrawal of patent application after publication (patent law 2001)
WD01Invention patent application deemed withdrawn after publication

Open date:20070808


[8]ページ先頭

©2009-2025 Movatter.jp