CN114495897A

Movatterモバイル変換

Info

Publication number: CN114495897A
Application number: CN202210177013.0A
Authority: CN
Inventors: 刘畅; 凌震华
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-05-13
Anticipated expiration: 2042-02-24
Also published as: CN114495897B

Abstract

The invention discloses a speech synthesis system and method independent of pronunciation dictionary, the system includes: language independent speech recognition model, text-to-pronunciation characterization prediction model, pronunciation characterization-to-acoustic prediction model, and neural network vocoder. The system and the method can automatically extract pronunciation representation from the voice data of the target language by training an automatic voice recognition model irrelevant to the language, and then use the pronunciation representation to construct a voice synthesis system. The constructed speech synthesis system first predicts a pronunciation characterization from the text characters and then generates speech from the pronunciation characterization. The invention can solve the problem that the traditional speech synthesis method depends on the language-dependent pronunciation dictionary when constructing the multi-language speech synthesis system. The establishment of the pronunciation dictionary often requires the participation of language experts, and consumes a great deal of manpower and time. Compared with the existing method for directly predicting the acoustic characteristics of the voice from the text characters, the method can reduce the pronunciation errors in the synthesized voice and improve the naturalness of the synthesized voice.

Description

Translated fromChinese

一种不依赖发音词典的语音合成系统及方法A system and method for speech synthesis without relying on pronunciation dictionary

技术领域technical field

本发明涉及语音信号处理领域，尤其涉及一种语音合成系统及方法。The present invention relates to the field of speech signal processing, in particular to a speech synthesis system and method.

背景技术Background technique

语音合成(speech synthesis)旨在使机器像人类一样流畅自然地说话，它具有广泛的应用，例如语音助手和有声小说。一个语音合成系统通常由两部分组成：前端和后端。前端侧重于文本分析，其将文本序列转换为语言学特征，它具有一系列功能，如文本归一化、字音转换、分词、词性标注和韵律预测等[1]。字音转换的目的是从字符序列生成音素序列[2]。发音词典由一种语言的单词-发音对组成[3]，对字音转换至关重要。由于发音词典不可能涵盖一种语言的所有单词，所以通常采用一个在发音词典上训练的字音转换模型，它可以生成词典中不存在的词的发音[4]。然而，发音词典是特定于语言的，为一种新语言构建一个发音词典需要在语言和音素标注系统方面的专业知识，这比获取该语言的语音录音更费力、耗时和困难[3]。即使已经有一些开源的字音转换工具，但考虑到全球大约有7,000种语言，其所涵盖的语言数量仍然十分有限。Speech synthesis, which aims to make machines speak as smoothly and naturally as humans, has a wide range of applications, such as voice assistants and audio novels. A speech synthesis system usually consists of two parts: front-end and back-end. The front end focuses on text analysis, which converts text sequences into linguistic features, and it has a series of functions such as text normalization, word-to-speech conversion, word segmentation, part-of-speech tagging, and prosody prediction, etc. [1]. The purpose of phonetic conversion is to generate phoneme sequences from character sequences [2]. Pronunciation lexicons consist of word-pronunciation pairs in a language [3] and are essential for word-to-speech translation. Since it is impossible for a pronunciation dictionary to cover all the words in a language, a word-to-speech translation model trained on the pronunciation dictionary is usually adopted, which can generate pronunciations of words that do not exist in the dictionary [4]. However, pronunciation dictionaries are language-specific, and building a pronunciation dictionary for a new language requires expertise in language and phoneme tagging systems, which is more labor-intensive, time-consuming, and difficult than obtaining speech recordings of the language [3]. Even though there are some open source word-to-speech tools out there, the number of languages they cover is still quite limited, considering there are about 7,000 languages in the world.

另一方面，语音合成系统的后端通常由将语言特征转换为声学特征的声学模型和从声学特征重建语音波形的声码器组成。近年来，基于神经网络的序列到序列声学建模[5,6,7]已成为主流方法，其相比基于隐马尔可夫模型和深度神经网络的传统统计参数语音合成方法具有更好的性能。一些序列到序列声学模型可以直接将字符序列作为输入[5]，因此不再需要发音词典和字音转换模型。然而，与使用音素序列相比，使用字符序列作为输入通常会降低合成语音的自然度和可懂度。On the other hand, the backend of a speech synthesis system usually consists of an acoustic model that converts linguistic features into acoustic features and a vocoder that reconstructs speech waveforms from the acoustic features. In recent years, neural network-based sequence-to-sequence acoustic modeling [5, 6, 7] has become a mainstream method, which has better performance than traditional statistical parametric speech synthesis methods based on hidden Markov models and deep neural networks . Some sequence-to-sequence acoustic models can directly take character sequences as input [5], thus eliminating the need for pronunciation dictionaries and word-to-speech models. However, using a sequence of characters as input often reduces the naturalness and intelligibility of the synthesized speech compared to using a sequence of phonemes.

传统语音合成系统需要在前端文本分析阶段利用发音词典和字音转换模型将输入文本处理为音素序列后，送入后端模块进行声学特征预测与波形重构。字音转换模型通常也是基于发音词典训练得到的。发音词典的建立依赖语种相关的语言专家知识，建立大容量、高精度的发音词典耗时耗力。但如果直接将文本字符序列输入后端声学模型则会降低合成语音质量。The traditional speech synthesis system needs to use the pronunciation dictionary and the word-to-sound conversion model to process the input text into a phoneme sequence in the front-end text analysis stage, and then send it to the back-end module for acoustic feature prediction and waveform reconstruction. The word-to-sound conversion model is also usually trained based on pronunciation dictionaries. The establishment of the pronunciation dictionary relies on the knowledge of language experts related to the language, and the establishment of a large-capacity, high-precision pronunciation dictionary is time-consuming and labor-intensive. However, if the text character sequence is directly fed into the back-end acoustic model, the quality of the synthesized speech will be degraded.

有鉴于此，特提出本发明。In view of this, the present invention is proposed.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供了一种不依赖发音词典的语音合成系统及方法，能不依赖于发音词典进行语音合成，进而解决现有技术中存在的上述技术问题。The purpose of the present invention is to provide a speech synthesis system and method that does not rely on a pronunciation dictionary, which can perform speech synthesis without relying on a pronunciation dictionary, thereby solving the above-mentioned technical problems existing in the prior art.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

本发明实施方式提供一种不依赖发音词典的语音合成系统，包括：Embodiments of the present invention provide a speech synthesis system that does not rely on a pronunciation dictionary, including:

语种无关的语音识别模型、文本-发音表征预测模型、发音表征-声学预测模型和神经网络声码器；其中，Language-independent speech recognition model, text-pronunciation representation prediction model, pronunciation representation-acoustic prediction model and neural network vocoder; wherein,

所述语种无关的语音识别模型，能在训练阶段从输入的目标语种的语音波形提取出发音表征，将所述发音表征提供给所述文本-发音表征预测模型和发音表征-声学预测模型用于训练，以得到训练好的所述文本-发音表征预测模型和发音表征-声学预测模型；The language-independent speech recognition model can extract pronunciation representations from the input speech waveform of the target language in the training phase, and provide the pronunciation representations to the text-pronunciation representation prediction model and the pronunciation representation-acoustic prediction model for use. training to obtain the trained text-pronunciation representation prediction model and pronunciation representation-acoustic prediction model;

所述文本-发音表征预测模型，能在训练好后根据输入的待合成文本的字符序列预测为发音表征，并输出至训练好的所述发音表征-声学预测模型；The text-pronunciation representation prediction model can be predicted as a pronunciation representation after training according to the character sequence of the input text to be synthesized, and output to the trained pronunciation representation-acoustic prediction model;

所述发音表征-声学预测模型，与所述神经网络声码器连接，能根据所述文本-发音表征预测模型预测得到的发音表征生成梅尔谱；The pronunciation representation-acoustic prediction model, connected with the neural network vocoder, can generate a Mel spectrum according to the pronunciation representation predicted by the text-pronunciation representation prediction model;

所述神经网络声码器，能将所述发音表征-声学预测模型生成的梅尔谱重构成与所述待合成文本对应的语音波形。The neural network vocoder can reconstruct the mel spectrum generated by the pronunciation representation-acoustic prediction model into a speech waveform corresponding to the text to be synthesized.

本发明实施方式还提供一种不依赖发音词典的语音合成方法，采用本发明的不依赖发音词典的语音合成系统，先由所述语音合成系统的语种无关的语音识别模型从输入的目标语种的语音波形提取出发音表征，用所述发音表征训练所述语音合成系统的文本-发音表征预测模型和发音表征-声学预测模型，训练完成后得到训练好的文本-发音表征预测模型和发音表征-声学预测模型；按以下步骤进行合成：Embodiments of the present invention also provide a speech synthesis method that does not depend on a pronunciation dictionary. Using the speech synthesis system that does not depend on a pronunciation dictionary of the present invention, the language-independent speech recognition model of the speech synthesis system is used to obtain the input from the target language. The pronunciation representation is extracted from the speech waveform, the text-pronunciation representation prediction model and the pronunciation representation-acoustic prediction model of the speech synthesis system are trained with the pronunciation representation, and the trained text-pronunciation representation prediction model and pronunciation representation- Acoustic prediction model; synthesized as follows:

将待合成文本输入训练好的所述语音合成系统的文本-发音表征预测模型，经所述文本-发音表征预测模型根据所述待合成文本的字符序列预测为发音表征，并输出至所述语音合成系统的发音表征-声学预测模型；Input the text to be synthesized into the trained text-pronunciation representation prediction model of the speech synthesis system, predict the text-pronunciation representation prediction model to be a pronunciation representation according to the character sequence of the text to be synthesized, and output to the speech Pronunciation characterization of synthetic systems-acoustic prediction models;

所述发音表征-声学预测模型根据所述发音表征预测生成梅尔谱，并将所述梅尔谱输出至所述语音合成系统的神经网络声码器；The pronunciation representation-acoustic prediction model generates a mel spectrum according to the prediction of the pronunciation representation, and outputs the mel spectrum to the neural network vocoder of the speech synthesis system;

由所述神经网络声码器将所述梅尔谱重构成与所述待合成文本对应的语音波形。The mel spectrum is reconstructed by the neural network vocoder into a speech waveform corresponding to the text to be synthesized.

与现有技术相比，本发明所提供的不依赖发音词典的语音合成系统及方法，其有益效果包括：Compared with the prior art, the speech synthesis system and method provided by the present invention does not rely on pronunciation dictionary, and its beneficial effects include:

通过采用语种无关的自动语音识别模型，能自动地从目标语种的语音数据中提取发音表征，继而将发音表征用于训练构建语音合成系统的文本-发音表征预测模型和发音表征-声学预测模型，所构建语音合成系统首先从文本字符预测发音表征，再从发音表征生成语音。该系统及方法可以解决传统语音合成方法在构建多语种语音合成系统时依赖语种相关发音词典的问题，解决了发音词典的建立往往需要语言专家参与，耗费大量人力与时间的问题。该方法相对现有从文本字符直接预测语音声学特征的方法可以降低合成语音中的发音错误，提高合成语音的自然度。By using the language-independent automatic speech recognition model, the pronunciation representation can be automatically extracted from the speech data of the target language, and then the pronunciation representation can be used to train the text-pronunciation representation prediction model and the pronunciation representation-acoustic prediction model for constructing the speech synthesis system. The constructed speech synthesis system first predicts pronunciation representations from text characters, and then generates speech from pronunciation representations. The system and method can solve the problem that traditional speech synthesis methods rely on language-related pronunciation dictionaries when building a multilingual speech synthesis system, and solve the problem that the establishment of pronunciation dictionaries often requires the participation of language experts and consumes a lot of manpower and time. Compared with the existing methods of directly predicting speech acoustic features from text characters, the method can reduce pronunciation errors in synthesized speech and improve the naturalness of synthesized speech.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的不依赖发音词典的语音合成系统的整体结构示意图；1 is a schematic diagram of the overall structure of a speech synthesis system that does not rely on a pronunciation dictionary provided by an embodiment of the present invention;

图2为本发明实施例提供的不依赖发音词典的语音合成系统的语种无关的语音识别模型的结构示意图；2 is a schematic structural diagram of a language-independent speech recognition model of a speech synthesis system that does not rely on a pronunciation dictionary according to an embodiment of the present invention;

图3为本发明实施例提供的不依赖发音词典的语音合成系统的语种无关的语音识别模型的发音表征提取过程示意图；3 is a schematic diagram of a pronunciation representation extraction process of a language-independent speech recognition model of a speech synthesis system that does not rely on a pronunciation dictionary according to an embodiment of the present invention;

图4为本发明实施例提供的不依赖发音词典的语音合成系统的声学建模过程示意图。FIG. 4 is a schematic diagram of an acoustic modeling process of a speech synthesis system that does not rely on a pronunciation dictionary according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明的具体内容，对本发明实施例中的技术方案进行清楚、完整地描述；显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例，这并不构成对本发明的限制。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the specific content of the present invention; obviously, the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments, and this does not constitute a Invention limitations. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

首先对本文中可能使用的术语进行如下说明：First a description of terms that may be used in this article:

术语“和/或”是表示两者任一或两者同时均可实现，例如，X和/或Y表示既包括“X”或“Y”的情况也包括“X和Y”的三种情况。The term "and/or" means that either or both can be achieved, eg, X and/or Y means both the case of "X" or "Y" and the three cases of "X and Y" .

术语“包括”、“包含”、“含有”、“具有”或其它类似语义的描述，应被解释为非排它性的包括。例如：包括某技术特征要素(如原料、组分、成分、载体、剂型、材料、尺寸、零件、部件、机构、装置、步骤、工序、方法、反应条件、加工条件、参数、算法、信号、数据、产品或制品等)，应被解释为不仅包括明确列出的某技术特征要素，还可以包括未明确列出的本领域公知的其它技术特征要素。The terms "comprising", "comprising", "containing", "having" or other descriptions with similar meanings should be construed as non-exclusive inclusions. For example: including certain technical characteristic elements (such as raw materials, components, ingredients, carriers, dosage forms, materials, dimensions, parts, components, mechanisms, devices, steps, processes, methods, reaction conditions, processing conditions, parameters, algorithms, signals, data, products or products, etc.), should be construed to include not only a certain technical feature element explicitly listed, but also other technical feature elements known in the art that are not explicitly listed.

术语“由……组成”表示排除任何未明确列出的技术特征要素。若将该术语用于权利要求中，则该术语将使权利要求成为封闭式，使其不包含除明确列出的技术特征要素以外的技术特征要素，但与其相关的常规杂质除外。如果该术语只是出现在权利要求的某子句中，那么其仅限定在该子句中明确列出的要素，其他子句中所记载的要素并不被排除在整体权利要求之外。The term "consisting of" means to exclude any element of technical characteristics not expressly listed. If the term is used in a claim, the term will make the claim closed so that it does not contain technical feature elements other than those expressly listed, except for the usual impurities associated therewith. If the term appears in only one clause of a claim, it is limited only to the elements expressly recited in that clause, and elements recited in other clauses are not excluded from the claim as a whole.

除另有明确的规定或限定外，术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解，例如：可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本文中的具体含义。Unless otherwise expressly specified or limited, the terms "installed", "connected", "connected", "fixed" and other terms should be understood in a broad sense, for example, it may be a fixed connection, a detachable connection, or an integral Connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, and it can be the internal communication of two components. For those of ordinary skill in the art, the specific meanings of the above terms in this document can be understood according to specific situations.

术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”、“顺时针”、“逆时针”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述和简化描述，而不是明示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本文的限制。Terms "center", "longitudinal", "lateral", "length", "width", "thickness", "top", "bottom", "front", "back", "left", "right", " The orientation or positional relationship indicated by vertical, horizontal, top, bottom, inner, outer, clockwise, and counterclockwise is based on the orientation or positional relationship shown in the drawings. , is only for convenience and simplification of description, rather than expressing or implying that the indicated device or element must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as a limitation on this text.

下面对本发明所提供的不依赖发音词典的语音合成方法进行详细描述。本发明实施例中未作详细描述的内容属于本领域专业技术人员公知的现有技术。本发明实施例中未注明具体条件者，按照本领域常规条件或制造商建议的条件进行。本发明实施例中所用试剂或仪器未注明生产厂商者，均为可以通过市售购买获得的常规产品。The following describes the speech synthesis method independent of pronunciation dictionary provided by the present invention in detail. Contents that are not described in detail in the embodiments of the present invention belong to the prior art known to those skilled in the art. If the specific conditions are not indicated in the examples of the present invention, it is carried out according to the conventional conditions in the art or the conditions suggested by the manufacturer. The reagents or instruments used in the examples of the present invention without the manufacturer's indication are conventional products that can be purchased from the market.

如图1所示，本发明实施例提供一种不依赖发音词典的语音合成系统，包括：As shown in FIG. 1, an embodiment of the present invention provides a speech synthesis system that does not rely on a pronunciation dictionary, including:

参见图2，上述语音合成系统中，所述语种无关的语音识别模型包括：Referring to Figure 2, in the above speech synthesis system, the language-independent speech recognition model includes:

顺次连接的wav2vec 2.0模型、第一线性层和第二线性层；其中，Sequentially connected wav2vec 2.0 model, first linear layer and second linear layer; where,

所述wav2vec 2.0模型采用不具有量化模块的wav2vec 2.0模型，其训练输入为具有IPA音素转录的多语种语料库；The wav2vec 2.0 model adopts the wav2vec 2.0 model without a quantization module, and its training input is a multilingual corpus with IPA phoneme transcription;

所述第一线性层为能将1024维的上下文表征(C)映射到512维瓶颈表征(B)的瓶颈层；The first linear layer is a bottleneck layer capable of mapping a 1024-dimensional context representation (C) to a 512-dimensional bottleneck representation (B);

所述第二线性层为能根据所述第一线性层输出的瓶颈表征预测类别概率(P)的分类层；The second linear layer is a classification layer capable of predicting class probability (P) according to the bottleneck representation output by the first linear layer;

该语种无关的语音识别模型的训练目标为类别概率和目标IPA序列之间的CTC损失。The training objective of this language-independent speech recognition model is the CTC loss between the class probability and the target IPA sequence.

上述语音合成系统中，所述文本-发音表征预测模型的结构采用基于Tacotron2[5]的序列到序列结构；In the above-mentioned speech synthesis system, the structure of the text-pronunciation representation prediction model adopts the sequence-to-sequence structure based on Tacotron2 [5];

该文本-发音表征预测模型训练的误差函数为预测发音表征和提取的发音表征之间的均方误差和均绝对误差，再加上停止符号的二分类交叉熵。The error function trained by the text-pronunciation representation prediction model is the mean square error and mean absolute error between the predicted pronunciation representation and the extracted pronunciation representation, plus the binary cross-entropy of the stop symbol.

上述语音合成系统中，所述发音表征-声学预测模型采用Tacotron2的结构；In the above-mentioned speech synthesis system, the pronunciation representation-acoustic prediction model adopts the structure of Tacotron2;

该发音表征-声学预测模型的损失函数为预测的梅尔谱与真实梅尔谱之间的均方误差和均绝对误差，以及停止符号的二分类交叉熵。The loss functions of the pronunciation representation-acoustic prediction model are the mean square error and mean absolute error between the predicted Mel spectrum and the true Mel spectrum, and the binary cross-entropy of stop symbols.

本发明实施例还提供一种不依赖发音词典的语音合成方法，采用本发明所述的不依赖发音词典的语音合成系统，先由所述语音合成系统的语种无关的语音识别模型从输入的目标语种的语音波形提取出发音表征，用所述发音表征训练所述语音合成系统的文本-发音表征预测模型和发音表征-声学预测模型，训练完成后得到训练好的文本-发音表征预测模型和发音表征-声学预测模型；按以下步骤进行合成：An embodiment of the present invention also provides a speech synthesis method that does not rely on a pronunciation dictionary. Using the speech synthesis system that does not rely on a pronunciation dictionary according to the present invention, a language-independent speech recognition model of the speech synthesis system is used to obtain a The pronunciation representation is extracted from the speech waveform of the language, and the text-pronunciation representation prediction model and the pronunciation representation-acoustic prediction model of the speech synthesis system are trained with the pronunciation representation, and the trained text-pronunciation representation prediction model and pronunciation are obtained after the training is completed. Characterization-acoustic prediction model; synthesized as follows:

所述语种无关的语音识别模型按以下方式从输入的目标语种的语音波形提取出发音表征，包括：The language-independent speech recognition model extracts pronunciation representations from the input speech waveform of the target language in the following manner, including:

计算得出输入的目标语种的语音波形的帧级瓶颈表征B＝[b₁，…，b_T]；Calculate the frame-level bottleneck representation B=[b₁ , . . . , b_T ] of the input speech waveform of the target language;

将argmax函数应用于语种无关的语音识别模型输出的类别概率P，获得每一帧对应的音素符号类别；Apply the argmax function to the category probability P output by the language-independent speech recognition model to obtain the phoneme symbol category corresponding to each frame;

结合所述帧级瓶颈表征B＝[b₁，…，b_T]和每一帧的音素符号类别对帧级瓶颈表征进行分类操作，将第t帧的类别分配给瓶颈表征b_t；Perform a classification operation on the frame-level bottleneck representation in combination with the frame-level bottleneck representation B=[_b₁ ,_.

应用合并操作去除空白类别的瓶颈表征，并将具有相同类别的相邻瓶颈表征通过取平均值合并为一个向量R＝[r₁，…，r_N]，即为发音表征，其中，N是语音波形的发音表征数量，N小于T，T为语音波形的帧级瓶颈表征长度。Apply the merge operation to remove the bottleneck representation of the blank category, and merge the adjacent bottleneck representations with the same category into a vector R=[r₁ ,...,r_N ] by taking the average value, which is the pronunciation representation, where N is the speech representation The number of pronunciation representations of the waveform, N is less than T, where T is the frame-level bottleneck representation length of the speech waveform.

综上可见，本发明实施例的语音合成系统及方法，由于采用语种无关的自动语音识别模型，能自动地从目标语种的语音数据中提取发音表征，继而将发音表征用于训练构建语音合成系统的文本-发音表征预测模型和发音表征-声学预测模型，所构建语音合成系统首先从文本字符预测发音表征，再从发音表征生成语音。该系统及方法可以解决传统语音合成方法在构建多语种语音合成系统时依赖语种相关发音词典的问题，解决了发音词典的建立往往需要语言专家参与，耗费大量人力与时间的问题。该方法相对现有从文本字符直接预测语音声学特征的方法可以降低合成语音中的发音错误，提高合成语音的自然度。To sum up, the speech synthesis system and method of the embodiments of the present invention can automatically extract pronunciation representations from the speech data of the target language due to the language-independent automatic speech recognition model, and then use the pronunciation representations for training and constructing a speech synthesis system. The text-pronunciation representation prediction model and the pronunciation representation-acoustic prediction model of the proposed speech synthesis system first predict the pronunciation representation from the text characters, and then generate the speech from the pronunciation representation. The system and method can solve the problem that traditional speech synthesis methods rely on language-related pronunciation dictionaries when building a multilingual speech synthesis system, and solve the problem that the establishment of pronunciation dictionaries often requires the participation of language experts and consumes a lot of manpower and time. Compared with the existing methods of directly predicting speech acoustic features from text characters, the method can reduce pronunciation errors in synthesized speech and improve the naturalness of synthesized speech.

为了更加清晰地展现出本发明所提供的技术方案及所产生的技术效果，下面以具体实施例对本发明实施例所提供的不依赖发音词典的语音合成方法进行详细描述。In order to more clearly demonstrate the technical solutions provided by the present invention and the resulting technical effects, the speech synthesis method without relying on a pronunciation dictionary provided by the embodiments of the present invention will be described in detail below with specific embodiments.

实施例1Example 1

如图1所示，本实施例提供一种不依赖发音词典的语音合成系统，包括：语种无关的语音识别模型、文本-发音表征预测模型、发音表征-声学预测模型和神经网络声码器几部分；As shown in FIG. 1, this embodiment provides a speech synthesis system that does not rely on pronunciation dictionary, including: a language-independent speech recognition model, a text-pronunciation representation prediction model, a pronunciation representation-acoustic prediction model, and a neural network vocoder. part;

其中，语种无关的语音识别模型输入为目标语种的语音波形，训练目标为多语种共享的国际音标(International Phonetic Alphabet，IPA)[9]音素序列。如图2所示，该语种无关的语音识别模型采用了wav2vec 2.0[10]作为该语音识别模型的基础架构，它在上下文表征之后添加了两个线性层，并删除了原始wav2vec 2.0模型中的量化模块。第一个线性层称为瓶颈层，它将1024维的上下文表征(C)映射到512维瓶颈表征(B)。第二层是分类层，它根据瓶颈表征预测类别概率(P)。类别的数量取决于音素集合大小。为了学习跨语种的共享发音表征，该语音识别模型使用具有IPA音素转录的多语种语料库进行训练，并使用输出类别概率和目标IPA序列之间的CTC损失来训练。Among them, the input of the language-independent speech recognition model is the speech waveform of the target language, and the training target is the International Phonetic Alphabet (IPA) [9] phoneme sequence shared by multiple languages. As shown in Figure 2, the language-independent speech recognition model adopts wav2vec 2.0 [10] as the basic architecture of this speech recognition model, which adds two linear layers after the context representation and removes the original wav2vec 2.0 model. Quantization module. The first linear layer, called the bottleneck layer, maps a 1024-dimensional context representation (C) to a 512-dimensional bottleneck representation (B). The second layer is the classification layer, which predicts class probabilities (P) based on the bottleneck representation. The number of categories depends on the phoneme set size. To learn shared pronunciation representations across languages, this speech recognition model is trained using a multilingual corpus with IPA phoneme transcriptions and using the output class probabilities and CTC loss between the target IPA sequences.

该语种无关的语音识别模型的发音表征的提取过程如图3所示，此处以一个英文单词“work”的音频片段为例。φ表示CTC损失中的空白符号，w、o、r和k是识别的音素符号；The process of extracting the pronunciation representation of the language-independent speech recognition model is shown in FIG. 3 , where an audio clip of the English word “work” is taken as an example. φ represents the blank symbols in the CTC loss, and w, o, r and k are the recognized phoneme symbols;

帧级瓶颈表征B＝[b₁，…，b_T]首先使用语种无关的自动语音识别模型计算得到，其中T是语音波形的帧数；The frame-level bottleneck representation B=[b₁ , ..., b_T ] is first calculated using the language-independent automatic speech recognition model, where T is the frame number of the speech waveform;

然后将argmax函数应用于语音识别模型输出的类别概率P，产生每一帧的对应类别(音素符号或CTC中的空白符号)；Then apply the argmax function to the class probability P output by the speech recognition model to generate the corresponding class (phoneme symbol or blank symbol in CTC) of each frame;

然后，进行分类操作，将第t帧的类别分配给瓶颈表征b_t；Then, a classification operation is performed to assign the category of the t-th frame to the bottleneck representation b_t ;

最后，应用合并操作去除空白类别的瓶颈表征，并将具有相同类别的相邻瓶颈表征通过取平均值合并为一个向量，即CPR。在图2中，提取的发音表征写为R＝[r₁，…，r_N]，其中N是波形的发音表征数量。显然，N小于T。Finally, a merge operation is applied to remove the bottleneck representations of blank classes, and the adjacent bottleneck representations with the same class are merged into a vector by averaging, namely CPR. In Figure 2, the extracted pronunciation representations are written as R=[r₁ , . . . , r_N ], where N is the number of pronunciation representations of the waveform. Obviously, N is less than T.

训练该语种无关的语音识别模型时，本实施例中使用了19个语种的语料，时长总共2771个小时；首先将所有语音的文本转录使用开源工具Phonemizer转换为IPA音素序列，最终得到音素集合大小为203，再加上CTC的空白符号因此语音识别模型中的分类层输出的类别概率维度为204；每个语种的语料以99比1的比例划分为训练集和验证集，训练的batch大小为1.2小时；最终选择在验证集上具有最低音素识别错误率5.87％的模型保存点；训练使用的损失函数为输出类别概率与真实IPA序列之间的CTC误差。When training the language-independent speech recognition model, corpora in 19 languages are used in this example, with a total duration of 2771 hours; first, the text transcription of all speeches is converted into IPA phoneme sequences using the open source tool Phonemizer, and finally the size of the phoneme set is obtained. It is 203, plus the blank symbol of CTC, so the category probability dimension output by the classification layer in the speech recognition model is 204; the corpus of each language is divided into training set and validation set in a ratio of 99 to 1, and the training batch size is 1.2 hours; finally select the model save point with the lowest phoneme recognition error rate of 5.87% on the validation set; the loss function used for training is the CTC error between the output class probability and the real IPA sequence.

图4示意了本实施例的语音合成系统基于发音表征的声学建模示意图，其由两部分组成，文本-发音表征预测模型和发音表征-声学预测模型，文本-发音表征模型根据输入文本预测发音表征，发音表征-声学模型根据发音表征生成梅尔谱，在合成阶段，使用一个神经网络声码器从梅尔谱重构语音波形，即完成对待合成的文本进行语音合成。具体的，所述的文本-发音表征预测模型的结构采用基于Tacotron2[5]的序列到序列结构，其跟Tacotron2相比有三点区别；首先，训练目标从80维梅尔谱变为512维的发音表征；其次，由于发音表征的时间连续性不如声学特征明显，因此在去除了post-net部分。最后在合成阶段不应用pre-net中的dropout。FIG. 4 is a schematic diagram of acoustic modeling based on pronunciation representation of the speech synthesis system of the present embodiment, which consists of two parts, a text-pronunciation representation prediction model and a pronunciation representation-acoustic prediction model, and the text-pronunciation representation model predicts pronunciation according to the input text Representation, Pronunciation Representation - The acoustic model generates the mel spectrum from the pronunciation representation. In the synthesis stage, a neural network vocoder is used to reconstruct the speech waveform from the mel spectrum, that is, the speech synthesis of the text to be synthesized is completed. Specifically, the structure of the text-pronunciation representation prediction model adopts the sequence-to-sequence structure based on Tacotron2 [5], which is different from Tacotron2 in three points; first, the training target is changed from 80-dimensional Mel spectrum to 512-dimensional Pronunciation representation; Second, since the temporal continuity of pronunciation representation is not as obvious as that of acoustic features, the post-net part is removed. Finally, dropout in pre-net is not applied in the synthesis stage.

训练的误差函数为预测发音表征和提取的发音表征之间的均方误差和均绝对误差，再加上停止符号的二分类交叉熵。The trained error function is the mean squared error and mean absolute error between the predicted pronunciation representation and the extracted pronunciation representation, plus binary cross-entropy of stop symbols.

发音表征-声学预测模型同样也基于Tacotron2的结构，不同之处在于它将发音表征作为输入而不是字符或音素序列，因此不需要文本嵌入；它的损失函数包括预测的梅尔谱与真实梅尔谱之间的均方误差和均绝对误差，以及停止符号的二分类交叉熵。在合成阶段，预测的梅尔谱被输入到神经网络声码器以重建语音波形。The pronunciation representation-acoustic prediction model is also based on the structure of Tacotron2, the difference is that it takes the pronunciation representation as input instead of a sequence of characters or phonemes, so no text embedding is needed; its loss function includes the predicted mel spectrum and the true mel Mean squared error and mean absolute error between spectra, and binary cross-entropy of stop symbols. In the synthesis stage, the predicted mel spectrum is input to a neural network vocoder to reconstruct the speech waveform.

实施例2Example 2

如图1所示，本发明实施例提供一种不依赖发音词典，而使用多语种通用语音识别模型来提取语种无关发音表征的语音合成系统，该系统能实现基于连续发音表征(continuous phonetic representations，CPR)的语音合成，提高没有目标语种的发音词典时语音合成的性能。该系统中的语种无关的语音识别模型能从目标语言的语音波形中提取发音表征，发音表征是该语音识别模型中一个隐藏层的输出向量，其被用作声学建模的中间表征；基于发音表征的声学模型由两部分组成，文本-发音表征预测模型和发音表征-声学预测模型。文本-发音表征模型根据字符序列预测发音表征，发音表征-声学模型根据发音表征生成梅尔谱。由于语种无关的语音识别模型使用连接时序分类(connectionisttemporal classification，CTC)损失[8]进行训练，因此提取的发音表征是段级别的，并且与音素序列具有相似的长度。最后，在合成阶段，该系统通过一个神经网络声码器从生成的梅尔谱重构语音波形，即完成对待合成文本的语音合成。As shown in FIG. 1, an embodiment of the present invention provides a speech synthesis system that does not rely on a pronunciation dictionary, but uses a multilingual universal speech recognition model to extract language-independent pronunciation representations. The system can realize continuous phonetic representations (continuous phonetic representations, CPR) speech synthesis to improve the performance of speech synthesis when there is no pronunciation dictionary in the target language. The language-independent speech recognition model in this system can extract pronunciation representations from the speech waveform of the target language, which is the output vector of a hidden layer in the speech recognition model, which is used as an intermediate representation for acoustic modeling; The represented acoustic model consists of two parts, the text-pronunciation representation prediction model and the pronunciation representation-acoustic prediction model. The text-pronunciation representation model predicts the pronunciation representation from the character sequence, and the pronunciation representation-acoustic model generates the Mel spectrum from the pronunciation representation. Since language-independent speech recognition models are trained using a connectionist temporal classification (CTC) loss [8], the extracted pronunciation representations are segment-level and of similar length to phoneme sequences. Finally, in the synthesis stage, the system reconstructs the speech waveform from the generated mel spectrum through a neural network vocoder, that is, the speech synthesis of the text to be synthesized is completed.

本发明的基于发音表征的声学模型可以自动地从字符序列预测发音表征，而不依赖发音词典，相比直接从字符序列预测梅尔谱可以提高合成语音自然度和可懂度。而传统语音合成系统需要在前端文本分析阶段利用发音词典和字音转换模型将输入文本处理为音素序列后，送入后端模块进行声学特征预测与波形重构，一些序列到序列声学模型虽可以直接输入字符序列，但会降低合成语音质量。The acoustic model based on the pronunciation representation of the present invention can automatically predict the pronunciation representation from the character sequence without relying on the pronunciation dictionary, and can improve the naturalness and intelligibility of the synthesized speech compared to predicting the Mel spectrum directly from the character sequence. However, the traditional speech synthesis system needs to use the pronunciation dictionary and the word-to-speech conversion model to process the input text into a phoneme sequence in the front-end text analysis stage, and then send it to the back-end module for acoustic feature prediction and waveform reconstruction. Although some sequence-to-sequence acoustic models can directly Enter a sequence of characters with reduced quality of synthesized speech.

本发明的语音合成系统及方法的有效性，通过以下方式进行验证，包括：The validity of the speech synthesis system and method of the present invention is verified by the following means, including:

(一)测试设置：(1) Test setup:

本发明使用了英语(en)、西班牙语(es)、哈萨克语(kk)、印地语(hi)、保加利亚语(bg)和马来语(ms)六个目标语种进行实验。语料时长(即语音波形)分别为24、28、11、16、5和10小时，对应的句子数分别为13100、19351、6447、9293、3006和5780；所有语种都是单人女性发音人；将每个语种的语料都分为训练集、验证集和测试集；对于英语、西班牙语、哈萨克语、印地语、保加利亚语和马来语，验证集和测试集的句子数分别为300、400、200、250、150和200，剩余语料用于训练。此外，我们为基于语音识别的可懂度评估添加了另一个仅包含文本的测试集，其中每种语言包含1000个句子。将本发明基于发音表征的声学模型(即本发明的语音合成系统)与下面列出的三个模型进行了比较，包括：The present invention uses six target languages, English (en), Spanish (es), Kazakh (kk), Hindi (hi), Bulgarian (bg) and Malay (ms) to conduct experiments. The length of the corpus (that is, the speech waveform) is 24, 28, 11, 16, 5 and 10 hours respectively, and the corresponding sentence numbers are 13100, 19351, 6447, 9293, 3006 and 5780 respectively; all languages are single female speakers; The corpus of each language is divided into training set, validation set and test set; for English, Spanish, Kazakh, Hindi, Bulgarian and Malay, the number of sentences in the validation set and test set is 300, 400, 200, 250, 150 and 200, and the remaining corpus is used for training. Additionally, we added another text-only test set with 1000 sentences per language for speech recognition-based intelligibility evaluation. The acoustic model of the present invention based on pronunciation characterization (i.e. the speech synthesis system of the present invention) was compared with the three models listed below, including:

(1)Taco-Char模型：使用字符序列作为输入的Tacotron2[5]模型，作为实验的基线。(1) Taco-Char model: the Tacotron2 [5] model using character sequences as input as the baseline for the experiments.

(2)Taco-Phone模型：使用音素序列作为输入的Tacotron2模型，用开源字音转换工具Phonemizer来将数据集的文本转录转换为音素序列，除了英语使用Festival。(2) Taco-Phone model: The Tacotron2 model that uses phoneme sequences as input, uses the open-source word-to-speech conversion tool Phonemizer to convert the text transcription of the dataset into phoneme sequences, except that English uses Festival.

(3)DPS模型：采用语种无关的语音识别模型识别的离散音素符号结果作为声学模型的中间表征代替发音表征，这里的文本-发音表征预测模型变为一个基于长短时记忆网络的字音转换模型，同时发音表征-声学模型变更为以离散音素符号为输入。(3) DPS model: The discrete phoneme symbol result recognized by the language-independent speech recognition model is used as the intermediate representation of the acoustic model instead of the pronunciation representation. At the same time, the pronunciation representation-acoustic model was changed to take discrete phoneme symbols as input.

采用语音识别给出的字符错误率(character error rate，CER)来评估合成语音的可懂度。本实施例使用谷歌语音识别API来转录合成语音以进行CER计算。如上所述，每种语言的CER评估都采用了1000个句子的测试集。同时进行了主观测听来评估合成语音的自然度平均意见分(mean opinion scores，MOS)。对于每种语言，从测试集中选择30个句子并使用不同的模型进行合成。每个合成语音的自然度以1到5的等级评分，间隔为0.5，由4到6名母语者评分。The character error rate (CER) given by speech recognition is used to evaluate the intelligibility of the synthesized speech. This example uses the Google Speech Recognition API to transcribe synthetic speech for CER calculation. As mentioned above, a test set of 1000 sentences was used for the CER evaluation of each language. At the same time, subjective listening listening was performed to evaluate the mean opinion scores (MOS) of the naturalness of the synthesized speech. For each language, 30 sentences are selected from the test set and synthesized using different models. The naturalness of each synthesized speech was scored on a scale of 1 to 5 in 0.5 intervals, by 4 to 6 native speakers.

(二)测试结果：(2) Test results:

下面表1列出了CER结果，从中可以看出，DPS模型由于推导离散识别结果时受到硬决策的影响，效果最差；Taco-Phone模型在三种语言上优于Taco-Char模型，但在es上效果相当，且在kk和bg上稍差。可能有两个原因。一是可以很容易地从拼写中推断出后三种语言的单词发音。另一个是Phonemizer工具可能无法为后三种语言提供足够准确的音素转录。本发明的基于发音表征的声学模型(即Proposed模型)在所有语言上都优于所有其他模型，除了hi上的Taco-Phone模型。这证明了本发明的语音合成系统及方法在不依赖发音词典的情况下合成高可懂度语音的有效性。Table 1 below lists the CER results, from which it can be seen that the DPS model has the worst effect due to the influence of hard decisions when deriving discrete recognition results; the Taco-Phone model is better than the Taco-Char model in three languages, but in The effect is comparable on es, and slightly worse on kk and bg. There may be two reasons. One is that the pronunciation of words in the latter three languages can be easily deduced from the spelling. Another is that the Phonemizer tool may not provide sufficiently accurate phonemic transcriptions for the latter three languages. The pronunciation representation-based acoustic model of the present invention (ie, the Proposed model) outperforms all other models on all languages, except the Taco-Phone model on hi. This proves the effectiveness of the speech synthesis system and method of the present invention in synthesizing high intelligibility speech without relying on a pronunciation dictionary.

表1：不同模型在六个目标语种上的CER(％)Table 1: CER(%) of different models on six target languages

由于DPS模型的CER性能较差，因此未对其进行主观测听。下面表2给出了主观测听的自然度MOS结果。本实施例使用了配对t检验的p值来评估两个模型之间差异的显著性。与表1中的结果类似，Taco-Phone模型在en、hi、ms上达到了比Taco-Char模型更好的自然度(p＝5×10^-6,0.03,2×10^-3)；然而，它在kk(p＝0.14)和bg(p＝1.00)上与Taco-Char模型相当，甚至比es(p＝0.02)上的Taco-Char模型还要差；本实施例的语音合成系统在en、kk、hi、bg和ms上的表现明显优于Taco-Char模型(p＝2×10^-12,7×10^-3,8×10^-4,7×10^-5,1×10^-4)，并且在es上与Taco-Char模型相当(p＝0.17)。可能是因为西班牙语中从拼写到发音的映射关系很简单。此外，与Taco-Phone模型相比，本发明的语音合成系统在en、es和bg上达到了更好的自然度(p＝3×10^-5,3×10^-4,1×10^-6)，并且它们在kk、hi和ms上是相当的(p＝0.15,0.25,0.18)。这些结果也证明了本发明语音合成系统及方法的优越性。Due to the poor CER performance of the DPS model, subjective listening was not performed on it. Table 2 below presents the naturalness MOS results for subjective listening. This example uses the p-value of the paired t-test to assess the significance of the difference between the two models. Similar to the results in Table 1, the Taco-Phone model achieves better naturalness than the Taco-Char model on en, hi, ms (p=5×10⁻⁶ , 0.03, 2×10⁻³ ); however, , it is comparable to the Taco-Char model on kk (p=0.14) and bg (p=1.00), and even worse than the Taco-Char model on es (p=0.02); the speech synthesis system of this embodiment is The performance on en, kk, hi, bg and ms is significantly better than the Taco-Char model (p=2×^10-12 , 7×^10-3 , 8×^10-4 , 7×^10-5^, 1×10-⁴ ), and is comparable to the Taco-Char model on es (p=0.17). Probably because the mapping from spelling to pronunciation in Spanish is simple. Furthermore, compared with the Taco-Phone model, the speech synthesis system of the present invention achieves better naturalness on en, es and bg (p=3×^10-5 , 3×^10-4 , 1×^10-6 ), and they are comparable on kk, hi and ms (p=0.15, 0.25, 0.18). These results also demonstrate the superiority of the speech synthesis system and method of the present invention.

表2：不同模型在六个目标语种上的自然度MOS，置信区间为95％Table 2: Naturalness MOS of different models on six target languages, with 95% confidence interval

综上可见，本发明实施例的语音合成系统及方法，不依赖于发音词典，而使用语种无关的语音识别模型来提取语种无关发音表征，进行文本到语音的预测与合成，合成语音质量高，提高了语音合成系统的性能。In summary, the speech synthesis system and method of the embodiments of the present invention do not rely on a pronunciation dictionary, but use a language-independent speech recognition model to extract language-independent pronunciation representations, perform text-to-speech prediction and synthesis, and the synthesized speech quality is high. Improves the performance of speech synthesis systems.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。本文背景技术部分公开的信息仅仅旨在加深对本发明的总体背景技术的理解，而不应当被视为承认或以任何形式暗示该信息构成已为本领域技术人员所公知的现有技术。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims. The information disclosed in this Background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Claims

Translated fromChinese

1.一种不依赖发音词典的语音合成系统，其特征在于，包括：1. a speech synthesis system that does not rely on pronunciation dictionary, is characterized in that, comprises:

2.根据权利要求1所述的不依赖发音词典的语音合成系统，其特征在于，所述语种无关的语音识别模型包括：2. The speech synthesis system that does not depend on pronunciation dictionary according to claim 1, is characterized in that, described language-independent speech recognition model comprises:

3.根据权利要求1或2所述的不依赖发音词典的语音合成系统，其特征在于，所述文本-发音表征预测模型的结构采用基于Tacotron2[5]的序列到序列结构；3. the speech synthesis system that does not depend on pronunciation dictionary according to claim 1 and 2, is characterized in that, the structure of described text-pronunciation representation prediction model adopts the sequence-to-sequence structure based on Tacotron2 [5];

4.根据权利要求1或2所述的不依赖发音词典的语音合成系统，其特征在于，所述发音表征-声学预测模型采用Tacotron2的结构；4. the speech synthesis system independent of pronunciation dictionary according to claim 1 and 2, is characterized in that, described pronunciation representation-acoustic prediction model adopts the structure of Tacotron2;

5.一种不依赖发音词典的语音合成方法，其特征在于，采用权利要求1至4任一项所述的不依赖发音词典的语音合成系统，先由所述语音合成系统的语种无关的语音识别模型从输入的目标语种的语音波形提取出发音表征，用所述发音表征训练所述语音合成系统的文本-发音表征预测模型和发音表征-声学预测模型，训练完成后得到训练好的文本-发音表征预测模型和发音表征-声学预测模型；按以下步骤进行合成：5. a speech synthesis method that does not depend on pronunciation dictionary, it is characterized in that, adopt the speech synthesis system that does not depend on pronunciation dictionary according to any one of claim 1 to 4, first by the language irrelevant speech of described speech synthesis system The recognition model extracts the pronunciation representation from the input speech waveform of the target language, uses the pronunciation representation to train the text-pronunciation representation prediction model and the pronunciation representation-acoustic prediction model of the speech synthesis system, and obtains the trained text- Pronunciation Representation Prediction Model and Pronunciation Representation-Acoustic Prediction Model; synthesized as follows:

6.根据权利要求5所述的不依赖发音词典的语音合成方法，其特征在于，所述语种无关的语音识别模型按以下方式从输入的目标语种的语音波形提取出发音表征，包括：6. the speech synthesis method that does not depend on pronunciation dictionary according to claim 5 is characterized in that, described language-independent speech recognition model extracts pronunciation representation from the speech waveform of the input target language in the following manner, including:

应用合并操作去除空白类别的瓶颈表征，并将具有相同类别的相邻瓶颈表征通过取平均值合并为一个向量R＝[r₁，…，r_N]即为发音表征其中，N是语音波形的发音表征数量，N小于T，T为语音波形的帧级瓶颈表征长度。Apply the merge operation to remove the bottleneck representation of the blank category, and merge the adjacent bottleneck representations with the same category into a vector R=[r₁ , . . . , r_N ] is the pronunciation representation, where N is the speech waveform Number of pronunciation representations, N is less than T, where T is the frame-level bottleneck representation length of the speech waveform.