WO2011004502A1

Movatterモバイル変換

Info

Publication number: WO2011004502A1
Application number: PCT/JP2009/062771
Authority: WO
Inventors: 孫慶華; 永松健司; 藤田雄介
Original assignee: 株式会社日立製作所
Priority date: 2009-07-08
Filing date: 2009-07-08
Publication date: 2011-01-13
Also published as: JPWO2011004502A1; JP5343293B2

Abstract

A TTS system using connection synthesis synthesizes a speech by connecting some speech fragments. Connection speech synthesis has been considered as a promising approach to realization of a practical mixed language TTS system. However, the prosody prediction made by considering the entire text and the discontinuity between synthesized speech segments of two languages adversely influence the synthesized speech quality. The current art has not been able to solve such a problem. To solve the problem, a method for mixed language text speech synthesis is provided. The method includes a procedure of detecting all units of the secondary language by analyzing the language structure of mixed language text of at least two languages, a procedure of replacing all the secondary language units with units of the primary language, a procedure of replacing the part except each unit of the secondary language with the secondary language, and a procedure of synthesizing a speech on the basis of the obtained texts of the primary and secondary languages.

Description

音声編集合成装置及び音声編集合成方法Speech editing / synthesizing apparatus and speech editing / synthesizing method

　本発明は、音声を合成する技術に関し、特に、混合言語テキストから自然な音声を合成する技術に関する。The present invention relates to a technology for synthesizing speech, and particularly to a technology for synthesizing natural speech from mixed language text.

　近年、生活の多くの場面で人工的に音声合成処理された音声を耳にする機会が増えている。波形接続方式の導入などによって、音質の改善がかなり進んでおり、車載用ナビゲーション装置、公共施設における自動放送装置、メールを読み上げ装置、及び自動通訳システムなど、音声を用いて自動的に情報を提供するサービスが広く普及している。
　その一方、グローバル化が進む中、国と国との交流が深まっているため、多言語が混じる文章が多く使われている。二言語テキスト列に対する混合言語ＴＴＳ（Ｔｅｘｔ　ｔｏ　Ｓｐｅｅｃｈ）において、テキストの大部分を構成する一つの主要な言語は通常、一次言語と呼ばれ、他の言語は二次言語と呼ばれる。例えば、テキストの大部分がテキストの作成者又は利用者の母国語によって構成され、残りの部分が外国語によって構成される場合、母国語が一次言語、外国語が二次言語に相当する。このようなテキストでは、特に人名、地名、固有名称、及び新生語などの外国語単語が、母国語に訳されずに、外国語の発音のままで使われることが多い。したがって、その様な多言語混合テキストから音声を合成するシステムが要求されている。
　すでに、いくつかの多言語音声合成方法が提案されているが、それらは大きく二つに分類される。一つは、二次言語を発音列に解析したあと、一次言語の発音列に変換し、一次言語のみによる音声合成を行う方法である。
　たとえば、日本語の場合は、外国の単語をカタカナに変換し、外来語として使ってきた歴史がある。そのため、日本語音声合成では、すべての外来語に対して、日本語発音に変換するルールを定義し、日本語発音で外国語単語を発音するような方法が一般的に使われている（特開２０００−３５２９９０号公報参照）。
　しかし、例えば日本人がアメリカで自動車を運転中に、車載用ナビゲーション装置を使用する場合を想定すると、日本語での音声案内が望ましいが、地名及び固有名詞などの部分は普段聞きなれているアメリカ発音のほうが、より伝わりやすい場合がある。特に、電子辞書などの場合は、外国語発音のままで読み上げられなければならないため、このような方法は使えない。
　もう一つは、あらかじめ一次言語と二次言語の合成エンジンをそれぞれ用意し、言語ごとに切り替える方法である（特開２００６−４８０５６号公報及び特開２００７−１５５８３３号公報参照）。
　中国語の場合は、近年、中国語テキストの中の外来語を、外国語表記のままで表現することが多く、それを読み上げるときも外国語のままで発音するため、中国語音声合成システムだけでは、外来語の読み上げが実現できないのが現状である。したがって、このような方法は、多くの中国語音声合成システムに採用されている。
　音声変換装置は、入力文章に対して言語的な解析を行い、文章中の各単語の読みを決定する言語処理部と、音素及びポーズの長さ、声の高さ、音の強さなどの韻律的特徴を予測する韻律予測部と、これらの情報を基にして実際の音声信号を合成する音響処理部と、の三つから構成されるのが一般的である。
　言語処理部においては、一次言語と二次言語を含む単語辞書を利用することで、多言語を含むテキストの処理を簡単に解決できる。音響処理部においても、同じ話者から録音した一次言語と二次言語を含む音声データベースを用いることで、多言語を含むテキストの処理を簡単に解決できる。しかし、韻律予測部では、多言語を含むテキストを予測できる韻律モデルを作成することは極めて困難である。したがって、従来の多くのシステムは、テキストを、各々が単一言語しか含まない複数のユニットに分割してから、音声を合成し、それぞれの合成した音声をつなぎ合わせるような方法を用いる。このような方法によれば、ユニットごとに音声が合成されるため、ユニット間の不連続が生じやすい。そのため、合成した音声の音質が非常に悪い。二次言語単語の前後にポーズを挿入することによって、不連続を和らげるが、非常に自然性が悪くなり、違和感が生じる。In recent years, there have been increasing opportunities to hear speech that has been artificially synthesized in many scenes of life. The introduction of the waveform connection method has significantly improved sound quality, and information is automatically provided using voice, such as in-vehicle navigation devices, automatic broadcasting devices in public facilities, e-mail reading devices, and automatic interpretation systems. Services to do are widespread.
On the other hand, as globalization progresses, exchanges between countries are deepening, so sentences with mixed languages are often used. In a mixed language TTS (Text to Speech) for bilingual text strings, one major language that makes up the majority of the text is usually called the primary language, and the other languages are called secondary languages. For example, when most of the text is composed of the native language of the creator or user of the text and the remaining part is composed of a foreign language, the native language corresponds to the primary language and the foreign language corresponds to the secondary language. In such texts, foreign words such as personal names, place names, proper names, and new words are often used in their native language without being translated into their native language. Therefore, there is a need for a system that synthesizes speech from such multilingual mixed text.
Several multilingual speech synthesis methods have already been proposed, but they are roughly classified into two. One is a method in which a secondary language is analyzed into a phonetic string and then converted into a primary language phonetic string to perform speech synthesis using only the primary language.
For example, Japanese has a history of converting foreign words into katakana and using them as foreign words. Therefore, in Japanese speech synthesis, a method is generally used in which rules for converting all foreign words to Japanese pronunciation are defined and foreign words are pronounced in Japanese pronunciation. No. 2000-352990).
However, for example, assuming that a Japanese uses an in-vehicle navigation system while driving a car in the United States, voice guidance in Japanese is desirable, but places such as place names and proper names are usually heard in the United States Pronunciation may be easier to communicate. In particular, in the case of an electronic dictionary or the like, such a method cannot be used because it must be read out in a foreign language pronunciation.
The other is a method in which a primary language and a secondary language synthesis engine are prepared in advance and switched for each language (see JP-A-2006-48056 and JP-A-2007-155833).
In the case of Chinese, in recent years, foreign words in Chinese texts are often expressed in foreign language notation, and when they are read out, they are also spoken in the foreign language. Now, it is impossible to read out foreign words. Therefore, such a method is adopted in many Chinese speech synthesis systems.
The speech converter performs a linguistic analysis on the input sentence, determines the reading of each word in the sentence, and the phoneme and pause length, voice pitch, sound intensity, etc. Generally, it is composed of a prosody prediction unit that predicts prosodic features and an acoustic processing unit that synthesizes an actual speech signal based on these pieces of information.
The language processing unit can easily solve text processing including multiple languages by using a word dictionary including a primary language and a secondary language. Even in the sound processing unit, processing of text including multiple languages can be easily solved by using a speech database including primary and secondary languages recorded from the same speaker. However, it is extremely difficult for the prosodic prediction unit to create a prosodic model that can predict texts including multiple languages. Therefore, many conventional systems use a method in which text is divided into a plurality of units each containing only a single language, and then the speech is synthesized and the synthesized speech is connected. According to such a method, since speech is synthesized for each unit, discontinuity between units tends to occur. For this reason, the quality of the synthesized speech is very poor. By inserting pauses before and after the secondary language word, the discontinuity is eased, but it becomes very natural and uncomfortable.

　第１図は、本発明の実施形態の音声編集合成装置の構成を示すブロック図である。
　第２図Ａは、本発明の実施形態の言語置換装置の動作を示すフローチャートの第１の部分である。
　第２図Ｂは、本発明の実施形態の言語置換装置の動作を示すフローチャートの第２の部分である。
　第３図は、本発明の実施形態の言語間特徴対応表の説明図である。
　第４図は、本発明の実施形態のテキストデータベースの説明図である。
　第５図Ａは、本発明の実施形態の音声合成装置の動作を示すフローチャートの第１の部分である。
　第５図Ｂは、本発明の実施形態の音声合成装置の動作を示すフローチャートの第２の部分である。
　第６図は、本発明の実施形態の音声編集合成装置のハードウェア構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a speech editing / synthesizing apparatus according to an embodiment of the present invention.
FIG. 2A is a first part of a flowchart showing the operation of the language replacement device according to the embodiment of the present invention.
FIG. 2B is a second part of the flowchart showing the operation of the language replacement device according to the embodiment of the present invention.
FIG. 3 is an explanatory diagram of an inter-language feature correspondence table according to the embodiment of this invention.
FIG. 4 is an explanatory diagram of a text database according to the embodiment of this invention.
FIG. 5A is a first part of a flowchart showing the operation of the speech synthesizer according to the embodiment of the present invention.
FIG. 5B is a second part of the flowchart showing the operation of the speech synthesizer according to the embodiment of the present invention.
FIG. 6 is a block diagram showing a hardware configuration of the speech editing / synthesis apparatus according to the embodiment of the present invention.

Claims

文を読み上げる音声を合成する音声編集合成装置であって、
　前記音声編集合成装置は、
　文のテキストの入力を受ける入力装置と、合成された音声を出力する出力装置と、前記入力装置及び前記出力装置に接続される制御装置と、前記制御装置に接続される記憶装置と、を備え、
　第１言語の第１単語及び第２言語の第２単語を含む第１文の入力を受け、
　前記第２単語を前記第１言語の第３単語によって置き換えることによって、複数の前記第１言語の単語を含み、前記第２言語の単語を含まない第２文を作成し、
　前記第２文を読み上げる音声を合成し、
　前記第２単語を含む、複数の前記第２言語の単語を含み、前記第１言語の単語を含まない第３文を取得し、
　前記第３文を読み上げる音声の音響特徴量の少なくとも一つが、前記第２文を読み上げる音声の音響特徴量の少なくとも一つと一致するように、前記第３文を読み上げる音声を合成し、
　前記第２文を読み上げる音声に含まれる、前記第３単語を読み上げる音声を、前記第３文を読み上げる音声に含まれる、前記第２単語を読み上げる音声で置き換えることによって、前記第１単語及び前記第２単語を含む前記第１文を読み上げる音声を合成することを特徴とする音声編集合成装置。A speech editing / synthesizing device that synthesizes speech to read out sentences
The speech editing / synthesizing device includes:
An input device that receives sentence text input, an output device that outputs synthesized speech, a control device connected to the input device and the output device, and a storage device connected to the control device. ,
Receiving an input of a first sentence including a first word in a first language and a second word in a second language;
Replacing the second word with a third word in the first language to create a second sentence that includes a plurality of words in the first language and does not include a word in the second language;
Synthesizing the speech to read the second sentence,
Including a second word, including a plurality of words in the second language and not including a word in the first language;
Synthesizing the speech that reads out the third sentence so that at least one of the acoustic feature quantities of the speech that reads out the third sentence matches at least one of the acoustic feature quantities of the speech that reads out the second sentence;
Replacing the voice that reads out the third word included in the voice that reads out the second sentence with the voice that reads out the second word included in the voice that reads out the third sentence. A speech editing / synthesizing device that synthesizes speech that reads out the first sentence including two words.
前記音声編集合成装置は、
　前記第１言語の複数の単語の特徴を示す情報を含むテキスト情報を保持し、
　前記第１言語の特徴と、前記第２言語の特徴と、を対応付ける対応情報を保持し、
　前記対応情報に基づいて、前記第２単語の特徴に対応する前記第１言語の単語の特徴を特定し、
　前記特定された特徴を検索キーとして、前記テキスト情報に含まれる前記第１言語の単語を検索することによって、前記第３単語を取得することを特徴とする請求項１に記載の音声編集合成装置。The speech editing / synthesizing device includes:
Holding text information including information indicating characteristics of a plurality of words in the first language;
Holding correspondence information that associates the features of the first language with the features of the second language;
Based on the correspondence information, the feature of the word in the first language corresponding to the feature of the second word is identified,
The speech editing and synthesizing apparatus according to claim 1, wherein the third word is acquired by searching for a word in the first language included in the text information using the specified feature as a search key. .
前記第２単語の特徴は、前記第２単語の開始音素及び前記第２単語の終了音素の少なくとも一つであることを特徴とする請求項２に記載の音声編集合成装置。The speech editing / synthesizing apparatus according to claim 2, wherein the characteristic of the second word is at least one of a start phoneme of the second word and an end phoneme of the second word.
前記第２単語の特徴は、さらに、前記第２単語の品詞、前記第２単語の長さ、前記第２単語のアクセントの位置、前記第２単語のストレスの位置、及び、前記第２単語の声調の少なくとも一つを含むことを特徴とする請求項３に記載の音声編集合成装置。The characteristics of the second word further include the part of speech of the second word, the length of the second word, the position of the accent of the second word, the position of the stress of the second word, and the position of the second word The speech editing / synthesizing device according to claim 3, wherein the speech editing / synthesizing device includes at least one tone.
前記テキスト情報は、さらに、前記第２単語を含む、前記第２言語の複数の単語を含む複数の文の特徴を示す情報を含み、
　前記音声編集合成装置は、
　前記対応情報に基づいて、前記第１文の特徴に対応する前記第２言語の文の特徴を特定し、
　前記特定された特徴を検索キーとして、前記テキスト情報に含まれる文を検索することによって、前記第３文を取得することを特徴とする請求項２に記載の音声編集合成装置。The text information further includes information indicating characteristics of a plurality of sentences including a plurality of words of the second language, including the second word,
The speech editing / synthesizing device includes:
Based on the correspondence information, the feature of the sentence in the second language corresponding to the feature of the first sentence is specified,
The speech editing and synthesizing apparatus according to claim 2, wherein the third sentence is acquired by searching a sentence included in the text information using the specified feature as a search key.
前記第１文の特徴は、前記第１文における前記第２単語の直前の音素、及び、前記第１文における前記第２単語の直後の音素の少なくとも一つであることを特徴とする請求項５に記載の音声編集合成装置。The feature of the first sentence is at least one of a phoneme immediately before the second word in the first sentence and a phoneme immediately after the second word in the first sentence. 5. The speech editing / synthesizing device according to 5.
前記第１文の特徴は、さらに、前記第１文において前記第２単語が占める位置、前記第１文において前記第２単語を含むフレーズが占める位置、前記第１文において前記第２単語を含む韻律語が占める位置、前記第２単語の品詞、及び、前記第１文の長さの少なくとも一つを含むことを特徴とする請求項６に記載の音声編集合成装置。The features of the first sentence further include a position occupied by the second word in the first sentence, a position occupied by a phrase including the second word in the first sentence, and the second word in the first sentence. The speech editing and synthesizing apparatus according to claim 6, comprising at least one of a position occupied by a prosodic word, a part of speech of the second word, and a length of the first sentence.
前記音声編集合成装置は、前記第２単語の開始点及び終了点の音響特徴量が、それぞれ、合成された前記第２文を読み上げる音声における、前記第３単語の開始点及び終了点の音響特徴量と一致するように、前記第３文を読み上げる音声を合成することを特徴とする請求項５に記載の音声編集合成装置。The speech editing / synthesizing device is characterized in that the acoustic features of the start point and the end point of the second word are the acoustic features of the start point and the end point of the third word in the speech that reads out the synthesized second sentence, respectively. The speech editing / synthesizing device according to claim 5, wherein speech that reads out the third sentence is synthesized so as to match the amount.
前記音響特徴量は、韻律特徴量及び音韻特徴量の少なくとも一方を含み、
　前記韻律特徴量は、少なくとも基本周波数を含み、
　前記音韻特徴量は、少なくともスペクトルを含むことを特徴とする請求項８に記載の音声編集合成装置。The acoustic feature amount includes at least one of a prosodic feature amount and a phonological feature amount,
The prosodic feature amount includes at least a fundamental frequency,
The speech editing / synthesizing apparatus according to claim 8, wherein the phoneme feature amount includes at least a spectrum.
文を読み上げる音声を合成する音声編集合成装置による音声編集合成方法であって、
　前記音声編集合成装置は、文のテキストの入力を受ける入力装置と、合成された音声を出力する出力装置と、前記入力装置及び前記出力装置に接続される制御装置と、前記制御装置に接続される記憶装置と、を備え、
　前記音声編集合成方法は、
　前記音声編集合成装置が、第１言語の第１単語及び第２言語の第２単語を含む第１文の入力を受ける第１手順と、
　前記音声編集合成装置が、前記第２単語を前記第１言語の第３単語によって置き換えることによって、複数の前記第１言語の単語を含み、前記第２言語の単語を含まない第２文を作成する第２手順と、
　前記音声編集合成装置が、前記第２文を読み上げる音声を合成する第３手順と、
　前記音声編集合成装置が、前記第２単語を含む、複数の前記第２言語の単語を含み、前記第１言語の単語を含まない第３文を取得する第４手順と、
　前記音声編集合成装置が、前記第３文を読み上げる音声の音響特徴量の少なくとも一つが前記第２文を読み上げる音声の音響特徴量の少なくとも一つと一致するように、前記第３文を読み上げる音声を合成する第５手順と、
　前記音声編集合成装置が、前記第２文を読み上げる音声に含まれる前記第３単語を読み上げる音声を、前記第３文を読み上げる音声に含まれる前記第２単語を読み上げる音声で置き換えることによって、前記第１単語及び前記第２単語を含む前記第１文を読み上げる音声を合成する第６手順と、を含むことを特徴とする音声編集合成方法。A speech editing / synthesizing method by a speech editing / synthesizing device that synthesizes speech to read a sentence,
The speech editing / synthesizing device is connected to the input device that receives the text input of the sentence, the output device that outputs the synthesized speech, the control device connected to the input device and the output device, and the control device. A storage device,
The voice editing synthesis method is:
A first procedure in which the speech editing / synthesizing device receives an input of a first sentence including a first word in a first language and a second word in a second language;
The speech editing / synthesizer creates a second sentence including a plurality of words in the first language and not including words in the second language by replacing the second word with a third word in the first language. A second procedure to
A third procedure in which the speech editing / synthesizer synthesizes speech that reads out the second sentence;
A fourth procedure in which the speech editing / synthesizing device acquires a third sentence that includes the second word, includes a plurality of words in the second language, and does not include the word in the first language;
The speech editing / synthesizing device generates a speech that reads out the third sentence so that at least one of the acoustic feature quantities of the speech that reads out the third sentence matches at least one of the acoustic feature quantities of the speech that reads out the second sentence. A fifth procedure to synthesize,
The voice editing / synthesizing device replaces the voice that reads the third word included in the voice that reads the second sentence with the voice that reads the second word included in the voice that reads the third sentence. And a sixth procedure for synthesizing speech that reads out the first sentence including one word and the second word.
前記音声編集合成装置は、
　前記第１言語の複数の単語の特徴を示す情報を含むテキスト情報を保持し、
　前記第１言語の特徴と、前記第２言語の特徴と、を対応付ける対応情報を保持し、
　前記音声編集合成方法は、さらに、
　前記音声編集合成装置が、前記対応情報に基づいて、前記第２単語の特徴に対応する前記第１言語の単語の特徴を特定する手順と、
　前記特定された特徴を検索キーとして、前記テキスト情報に含まれる前記第１言語の単語を検索することによって、前記第３単語を取得する手順を含むことを特徴とする請求項１０に記載の音声編集合成方法。The speech editing / synthesizing device includes:
Holding text information including information indicating characteristics of a plurality of words in the first language;
Holding correspondence information that associates the features of the first language with the features of the second language;
The voice editing synthesis method further includes:
The voice editing and synthesizing device, based on the correspondence information, identifying a word feature of the first language corresponding to the feature of the second word;
The voice according to claim 10, further comprising a step of acquiring the third word by searching for a word in the first language included in the text information using the specified feature as a search key. Edit synthesis method.
前記第２単語の特徴は、前記第２単語の開始音素及び前記第２単語の終了音素の少なくとも一つであることを特徴とする請求項１１に記載の音声編集合成方法。12. The speech editing and synthesizing method according to claim 11, wherein the characteristic of the second word is at least one of a start phoneme of the second word and an end phoneme of the second word.
前記第２単語の特徴は、さらに、前記第２単語の品詞、前記第２単語の長さ、前記第２単語のアクセントの位置、前記第２単語のストレスの位置、及び、前記第２単語の声調の少なくとも一つを含むことを特徴とする請求項１２に記載の音声編集合成方法。The characteristics of the second word further include the part of speech of the second word, the length of the second word, the position of the accent of the second word, the position of the stress of the second word, and the position of the second word The speech editing and synthesizing method according to claim 12, comprising at least one of tone.
前記テキスト情報は、さらに、前記第２単語を含む、前記第２言語の複数の単語を含む複数の文の特徴を示す情報を含み、
　前記方法は、さらに、前記対応情報に基づいて、前記第１文の特徴に対応する前記第２言語の文の特徴を特定する手順を含み、
　前記第４手順は、前記音声編集合成装置が、前記特定された特徴を検索キーとして、前記テキスト情報に含まれる文を検索することによって、前記第３文を取得する手順を含むことを特徴とする請求項１１に記載の音声編集合成方法。The text information further includes information indicating characteristics of a plurality of sentences including a plurality of words of the second language, including the second word,
The method further includes a step of identifying a sentence feature of the second language corresponding to the feature of the first sentence based on the correspondence information;
The fourth procedure includes a procedure in which the speech editing / synthesizing apparatus acquires the third sentence by searching for a sentence included in the text information using the specified feature as a search key. The speech editing synthesis method according to claim 11.
前記第１文の特徴は、前記第１文における前記第２単語の直前の音素、及び、前記第１文における前記第２単語の直後の音素、の少なくとも一つであることを特徴とする請求項１４に記載の音声編集合成方法。The feature of the first sentence is at least one of a phoneme immediately before the second word in the first sentence and a phoneme immediately after the second word in the first sentence. Item 15. The voice editing synthesis method according to Item 14.
前記第１文の特徴は、さらに、前記第１文において前記第２単語が占める位置、前記第１文において前記第２単語を含むフレーズが占める位置、前記第１文において前記第２単語を含む韻律語が占める位置、前記第２単語の品詞、及び、前記第１文の長さの少なくとも一つを含むことを特徴とする請求項１５に記載の音声編集合成方法。The features of the first sentence further include a position occupied by the second word in the first sentence, a position occupied by a phrase including the second word in the first sentence, and the second word in the first sentence. The speech editing and synthesizing method according to claim 15, comprising at least one of a position occupied by a prosodic word, a part of speech of the second word, and a length of the first sentence.
前記第５手順は、前記音声編集合成装置が、前記第２単語の開始点及び終了点の音響特徴量が、それぞれ、合成された前記第２文を読み上げる音声における前記第３単語の開始点及び終了点の音響特徴量と一致するように、前記第３文を読み上げる音声を合成する手順を含むことを特徴とする請求項１４に記載の音声編集合成方法。In the fifth procedure, the speech editing / synthesizing device causes the acoustic feature quantities of the start point and the end point of the second word to be the start point of the third word in the speech that reads out the synthesized second sentence, and 15. The speech editing / synthesizing method according to claim 14, further comprising a step of synthesizing a speech that reads out the third sentence so as to coincide with an acoustic feature amount at an end point.
前記音響特徴量は、韻律特徴量及び音韻特徴量の少なくとも一方を含み、
　前記韻律特徴量は、少なくとも基本周波数を含み、
　前記音韻特徴量は、少なくともスペクトルを含むことを特徴とする請求項１７に記載の音声編集合成方法。The acoustic feature amount includes at least one of a prosodic feature amount and a phonological feature amount,
The prosodic feature amount includes at least a fundamental frequency,
The speech editing / synthesizing method according to claim 17, wherein the phonological feature amount includes at least a spectrum.