WO2005034085A1

Movatterモバイル変換

Info

Publication number: WO2005034085A1
Application number: PCT/US2004/030570
Authority: WO
Inventors: Gui-Lin Chen; Yi-Qing Zu
Original assignee: Motorola, Inc.
Priority date: 2003-09-29
Filing date: 2004-09-17
Publication date: 2005-04-14
Also published as: CN1604183A; KR20060056403A; RU2319221C1; CN1320482C; EP1668631A4; EP1668631A1

Abstract

There is described a method (400) for automatically identifying natural speech pauses in a text string, the pauses being for use in text to speech conversion performed on an electronic device (100). The method (400) includes obtaining (420) the text string comprising two ends, these ends being a start end and a finish end. Then there is effected a step of analyzing (440) at least one word in the text string to determine if there is a natural speech pause adjacent to the word, the analyzing being based on at least one predefined threshold value for the word, the threshold value being associated with a number of syllables between the word and one of the two ends of the text string. Then there is provided a step of inserting (460) the natural speech pause into a synthesized speech signal output representative of the text string.

Description

IDENTIFYING NATURAL SPEECH PAUSES IN A TEXT STRING

FIELD OF THE INVENTION The present invention relates generally to Text-To-Speech (TTS) synthesis. The invention is particularly useful for determining natural pauses in synthesized pronunciation of a text segment.

BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion, often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string .and provide a converted representation of the string in the form of synthesized speech. However, a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the end of a sentence (input text string) may be drawn out or lengthened. The pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required at a natural speech pause. In most languages the pronunciation of a word depends on acoustic prosodic parameters comprising tone (pitch), volume (power or amplitude) .and duration. The prosodic parameter values for a word is dependent upon word position in a phrase and locations of natural speech pauses. However, identification of natural speech pauses for varying random input text patterns does not readily occur in state of the art Text-To-Speech (TTS) synthesis. In this specification, including the claims, the terms 'comprises', 'comprising' or similar terms are intended to mean a non-exclusive inclusion, such that a method or apparatus that comprises a list of elements does not include those elements solely, but may well include other elements not listed. SUMMARY OF THE INVENTION According to one aspect of the invention there is provided a method for automatically identifying natural speech pauses in a text string, the pauses being for use in text to speech conversion performed on an electronic device, the method comprising: obtaining the text string comprising two ends these ends being a start end and a finish end; .analyzing at least one word in the text string to determine if there is a natural speech pause adjacent to the word, the analyzing being based on at least one predefined threshold value for the word, the threshold value being associated with a number of syllables between the word and one of the two ends of the text string; and inserting the natural speech pause into a synthesized speech signal output representative of the text string. Suitably, the at least one predefined threshold value includes a P_word threshold value based on the number of syllables between the start end and the word. Suitably, the at least one predefined threshold value includes a F_word threshold value based on the number of syllables between the finish end and the word. Preferably, the least one predefined threshold value is determined by the steps of: providing a training set of transcriptions, with at least one natural speech pause identified by an inserted identifier; identifying words in each of the transcriptions as P_words and F_words; statistically analyzing the P_words and F_words in the training set; determining the F_word threshold value and P_word threshold value from results of the Statistically analyzing. Suitably, the inserting the natural speech pause may also include pauses identified as Part Of Speech (POS) pattern natural breaks. Suitably, the inserting the natural speech pause may also include pauses identified as Compound word natural pauses.

BRIEF DESCRIPTION OF THE DRAWINGS In order that the invention may be readily understood and put into practical effect, reference will now be made to a preferred embodiment as illustrated with reference to the accompanying drawings in which: Fig. 1 is a schematic block diagram of an electronic device in accordance with the present invention; Fig. 2 illustrates a method 200 for determining threshold values associated with natural speech pauses in text strings; Figs 3A to 3D illustrate examples of tanscriptions used for the method of Fig. 2.; Fig. 4. illustrates a method for automatically identifying natural speech pauses in a text string; and Fig 5 is illustrates detailed of an analyzing step of FIG. 4. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION Referring to Fig. 1 there is illustrated an electronic device 100, in the form of a radio-telephone, comprising a device processor 102 operatively coupled by a bus 103 to a user interface 104 that is typically a touch screen or alternatively a display screen and keypad. The electronic device 100 also has an utterance corpus 106, a speech synthesizer 110, Non Volatile memory 120, Read Only Memory 118 and Radio communications module 116 all operatively coupled to the processor 102 by the bus 103. The speech synthesizer 110 has an output coupled to drive a speaker 112. The corpus 106 includes representations of words or phonemes and associated sampled, digitized and processed utterance waveforms PUWs. In other words, and as described below, the Non Volatile memory 120 (memory module) provides text strings in use for Text-To-Speech (TTS) synthesis (the text may be received by module 116 or otherwise). Also the waveform utterance corpus comprises transcriptions, representing phrases and corresponding sampled and digitized utter.ance waveforms, text strings at locations relative to the natural phrase boundaries as described below. As will be apparent to a person skilled in the art, the radio frequency communications unit 116 is typically a combined receiver and transmitter having a common antenna. The radio frequency communications unit 116 has a transceiver coupled to antenna via a radio frequency amplifier. The transceiver is also coupled to a combined modulator/demodulator that couples the communications unit 116 to the processor 102. Also, in this embodiment the non-volatile memory 120 (memory module) stores a user programmable phonebook database Db and Read Only Memory 118 stores operating code (OC) for device processor 102 Referring to Fig. 2 there is illustrated a method 200 for determining threshold values associated with natural speech pauses in text strings. The threshold values being based on a number of preceding and following syllables in transcriptions in a training set TS. After a start step 210 the method 200 effects a providing step 220 for providing the training set TS of transcriptions, typically sentences, with at least one natural speech pause identified by a manually inserted punctuation mark or identifier "|". Examples of such transcription or sentences are shown in FIGs. 3A to 3D. One of these transcriptions 300 is "Based on our history | in China," has a natural speech pause 310 between the words "history" and "China". Also, there is a start end 305 and a finish end 315 for the transcription 300. As will be apparent to a person skilled in the art, all the transcriptions 300 in FIGs. 3 A to 3D have at least one natural speech pause 310 and a start end 305 and a finish end 315. Further analysis of the transcription shows the following:

Based = 2 syllables on = 1 syllable our = 1 syllable history = 3 syllables in = 1 syllable China = 2 syllables

Also, each word in the transcription can be designated as : (i) A P_word that is identified as a word in the transcription with an immediately preceding natural pause identified by the punctuation mark "|"; (ii) an F_word that is identified as a word in the transcription with an immediately following natural pause identified by the punctuation mark "|"; (iii) a neutral word that does not have an adjacent natural speech pause in the tanscription. After the step 220 .an identifying step 230 provides for identifying words in each of the transcriptions as (i)a P_word; (ii) a F_word; or (iii)a neutral word. Hence, for the transcription "Based on our history | in China,", table 1 below identifies attributes of each word in the transcription.

Table 1 Analysis of the transcription "Based on our history | in China,'^'

The method 200 then performs a statistically analyzing step 240. In this step 249, if the provided training set TS has 90,000 transcriptions (e.g. sentences) and presuming the word "in" occurs 10,000 times training set TS, then for these 10,000 instances of "in" the following statistical analysis could be observed: (i) Number of occurrences (OPW) of "in" as a P_word = 8,000 instances; (ii) Number of occurrences (OFW) of "in" as a F_word =1,000 instances; (iii) Number of occurrences (ONW) of "in" as a neutral word (a word that is neither a P_word or F_word) = 1,000 instances; Further, from the 8,000 P_word occurrences (OPW) of "in" identified in the training set TS, the following statistical analysis could be observed: (i) Occurrences (OPS) of 8 or more preceding syllables = 0; (ii) Occurrences (OPS) of 7 preceding syllables = 400; (iii) Occurrences (OPS) of 6 preceding syllables = 600; (iv) Occurrences (OPS) of 5 preceding syllables = 2,000; (v) Occurrences (OPS) of 4 preceding syllables = 3,000; (vi) Occurrences (OPS) of 3 preceding syllables = 1,000; (vii) Occurrences (OPS) of 2 preceding syllables = 1,000; (viii) Occurrences (OPS) of 1 preceding syllables = 0.

A heuristic ratio HR of 0.75, selected by intuition and experimentation, is used to determine a P_word break threshold value PT for the word "in". This threshold value PT is determined at a determining threshold values step 250 as follows: starting from the maximum number of observed syllables to the minimum number of observed syllables Do from largest OPS Until: Sum OPS/ OPW >= 0.75 Select PT to be the number of observed syllables identified by the last OPS of Sum OPS; End Do.

Thus PT for "in" would be determined at step 250 as follows: 400/8,000 = 0.05 for 7 preceding syllables; (400+600)78,000 = 0.125 for 6 preceding syllables; (400+600+2,000)/8,000 = 0.375 for 5 preceding syllables; (400+600+2,000+3₅000)/8,000 = 0.75 for 4 preceding syllables and thus PT is selected to be 4. A similar statistical analysis is used to determine at step 250 an F_word break threshold value FT for "in", again the heuristic ratio HR of 0.75 is used. Also, PT and FT values are determined for all other P_word and F_word instances for all other words in the training set TS (using the heuristic ratio HR of 0.75). The method 200 then ends at step 260 and all P word and F_word instances for all words in the training set TS are stored in the Non- Volatile

Memory 120. Referring to Fig. 4, there is illustrated a method 400 for automatically identifying natural speech pauses in a text string STR, the pauses being for use in text to speech conversion performed on the electronic device 100. After a start step 410, the method 400 effects a step 420 of obtaining the text string STR comprising two ends these ends being a start end SE and a finish end FE. A selecting word- step 430 selects one of the words (or a compound word CW)

.and an analyzing step 440 provides for analyzing at least one word (or compound words CW) in the text string STR to determine if there is a natural speech pause adjacent to the word (or compound words CW), the analyzing being based on at least one predefined threshold value (PT or FT) for the word, the threshold value being associated with a number of syllables between the word and one of the two ends of the text string. The threshold value includes the P_word threshold value PT based on the number of syllables between the start end .and the word. Also, the threshold value includes the F_word threshold value FT based on the number of syllables between the finish end and the word. If a test step 450 determines that a pause was identified by step 440, then a natural speech pause is inserted, at step 460, for speech synthesis. Otherwise no pause is inserted for the word that was selected at the step 430. As step 470 then checks to determine if all words in the text string STR have been analyzed and if more there words not analyzed the method returns to step

230. Otherwise, a speech synthesis step 480 provides for synthesizing speech at synthesizer 110, using the corpus 106, wherein inserting occurs of the natural speech pause or pauses (inserted into text string STR at step 460) ι into a synthesized speech signal output representative of the text string STR. Referring to Fig. 5, there is illustrated a more detailed diagram of the analyzing step 440. Firstly, the text string STR is checked, at test step 441, to determine if it has a Part Of Speech (POS) pattern natural pause break.

Examples of POS pattern natural break pauses are as follows: 1. numeral + noun For instance: two thousand books. 2. verb + adverb For instance: look carefully

3. preposition + noun For instance: with telescopes

4. adjective + noun For instance: beautiful city

If a break is determined at step 441 then a step 446 is effected and the break is identified as a F_word break. If no break is determined at step 441 then text string STR is checked, at test step 442, to determine if it has a Compound word natural pause insertion break. Examples of Compound word natural break pauses are as follows: a bit of a body of a few a fleet of a flooding of a fraction of a function of a good deal a good deal of a great deal a great deal of a growing number of a hint of a large body of a large number of a lot of land a majority of If a break is determined at step 442 then a step 446 is effected and the break is identified as a F_word break. If no break is identified at step 442 then at a step 443 a test determines if the P_word threshold value PT for the selected word has been reached. This is determined by comparing the number of syllables in the text string STR between the start end and the selected word. If the P_word threshold value PT is reached then a natural break is determined and identified as a P_word break at step 444. Alternatively, if no break is identified at step 443 then at a step 445 a test determines if the N_word threshold value PT for the selected word has been reached. This is determined by comparing the number of syllables in the text string STR between the finish end and the selected word. If the F_word threshold value PT is reached then a natural break is determined .and identified as a F_word break at step 446. Otherwise, no break is identified at step 447. Advantageously, the present invention allows for identifying natural speech pauses in text strings for use in Text-To-Speech (TTS) synthesis thereby improving the quality of synthesized speech. The detailed description provides a preferred exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the invention. Rather, the detailed description of the prefeπed exemplary embodiment provides those skilled in the art with an enabling description for implementing preferred exemplary embodiment of the invention. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

WE CLAIM:

1. A method for automatically identifying natural speech pauses in a text string, the pauses being for use in text to speech conversion performed on an electronic device, the method comprising: obtaining the text string comprising two ends these ends being a start end and a finish end; analyzing at least one word in the text string to determine if there is a natural speech pause adjacent to the word, the analyzing being based on at least one predefined threshold value for the word, the threshold value being associated with a number of syllables between the word and one of the two ends of the text string; .and inserting the natural speech pause into a synthesized speech signal output representative of the text string.

2. A method for automatically identifying natural speech pauses in a text string, as claimed in claim 1, wherein the at least one predefined threshold value includes a P_word threshold value based on the number of syllables between the start end and the word.

3. A method for automatically identifying natural speech pauses in a text string, as claimed in claim 1, wherein the at least one predefined threshold value includes a F_word threshold value based on the number of syllables between the finish end .and the word.

4. A method for automatically identifying natural speech pauses in a text string, as claimed in claim 1, wherein the least one predefined threshold value is determined by the steps of: providing a training set of transcriptions, with at least one natural speech pause identified by an inserted identifier; identifying words in each of the transcriptions as P_words and F_words; statistically analyzing the P_words and F_words in the training set; determining the F_word threshold value and P_word threshold value from results of the Statistically analyzing.

5. A method for automatically identifying natural speech pauses in a text string, as claimed in claim 1, wherein the inserting the natural speech pause may also include pauses identified as Part Of Speech (POS) pattern natural breaks.

6. A method for automatically identifying natural speech pauses in a text string, as claimed in claim 1, wherein, the inserting the natural speech pause may also include pauses identified as Compound word natural pauses.