CN113160792B

Movatterモバイル変換

Info

Publication number: CN113160792B
Application number: CN202110052188.4A
Authority: CN
Inventors: 李心广; 梁楚铧; 李苏梅; 杨远城; 陈帅; 马姗娴; 刘聪聪; 张�浩; 龙晓岚
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2023-11-17
Anticipated expiration: 2041-01-15
Also published as: CN113160792A

Abstract

The invention discloses a multilingual voice synthesis method, device and system, which are used for obtaining a text to be processed; performing language identification on each character in the text to be processed, and storing the characters in a corresponding preset character library; performing prosodic structure division on each character string in the language character library by adopting a pre-trained prosodic structure prediction model; acquiring a corresponding target voice splicing unit from a preset voice splicing unit library according to each character string and the corresponding rhythm structure thereof; determining adjacent target voice splicing units according to the recognition sequence, and calculating the optimal splicing point between the adjacent target voice splicing units; and splicing adjacent target splicing units according to the optimal splicing points to obtain synthesized voice. By adopting the embodiment of the invention, the speech synthesis of different languages in the text paragraph can be realized, the naturalness and fluency of the synthesized speech can be effectively ensured, and the user requirement can be met.

Description

Translated fromChinese

一种多语种的语音合成方法、装置和系统A multi-language speech synthesis method, device and system

技术领域Technical Field

本发明涉及语音合成技术领域，尤其涉及一种多语种的语音合成方法、装置和系统。The present invention relates to the field of speech synthesis technology, and in particular to a multi-language speech synthesis method, device and system.

背景技术Background Art

随着科学技术的迅猛发展，语音合成技术正在不断发展并走向市场，以适应社会的需求。目前市面上，大部分的语音合成系统一次只能合成一种语种。当用户需要转换为语音的文本段落同时具有中英双语种时，用户不得不多次来回选择不同的语种识别系统进行识别，再手动利用音频剪辑软件拼接，既麻烦，效果又不好。With the rapid development of science and technology, speech synthesis technology is constantly developing and entering the market to meet the needs of society. Currently, most speech synthesis systems on the market can only synthesize one language at a time. When the text paragraph that users need to convert to speech is in both Chinese and English, users have to select different language recognition systems for recognition many times, and then manually splice them using audio editing software, which is troublesome and the effect is not good.

因此，多语种的语音合成系统应运而生。然而，在实施本发明过程中，发明人发现现有技术至少存在如下问题：现有的多语种语音合成系统往往具备不同文本语种识别能力，但是在进行语音合成的阶段，系统只能选择一种语种的合成规则来合成语音，使得文本中不同语种部分对应合成的语音出现失真、不自然乃至表现为噪声的情况。Therefore, multilingual speech synthesis systems came into being. However, in the process of implementing the present invention, the inventors found that the prior art has at least the following problems: the existing multilingual speech synthesis systems often have the ability to recognize different text languages, but during the speech synthesis stage, the system can only select the synthesis rules of one language to synthesize speech, so that the synthesized speech corresponding to the different language parts in the text is distorted, unnatural, or even appears as noise.

发明内容Summary of the invention

本发明实施例的目的是提供一种多语种的语音合成方法、装置和系统，其能实现对文本段落中的不同语种的语音合成，并有效保证合成语音的自然度和流畅度，满足用户需求。The purpose of the embodiments of the present invention is to provide a multilingual speech synthesis method, device and system, which can realize speech synthesis of different languages in a text paragraph and effectively ensure the naturalness and fluency of the synthesized speech to meet user needs.

为实现上述目的，本发明实施例提供了一种多语种的语音合成方法，包括：To achieve the above object, an embodiment of the present invention provides a multilingual speech synthesis method, comprising:

获取待处理文本；对所述待处理文本中的每一字符进行语种识别，并存入对应的预设的字符库中；其中，所述待处理文本中包含至少两种语种；所述预设的字符库包括特殊字符库和至少两个语种对应的语种字符库；在所述语种字符库中存储的所述字符以句子为单位形成对应的字符串；Acquire a text to be processed; perform language recognition on each character in the text to be processed and store the characters in a corresponding preset character library; wherein the text to be processed contains at least two languages; the preset character library includes a special character library and a language character library corresponding to at least two languages; the characters stored in the language character library form corresponding character strings in units of sentences;

采用预先训练完成的韵律结构预测模型，对所述语种字符库中的每一字符串进行韵律结构划分；Using a pre-trained prosodic structure prediction model, each character string in the language character library is divided into prosodic structures;

根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中获取对应的目标语音拼接单元；According to each of the character strings and its corresponding prosodic structure, a corresponding target speech concatenation unit is obtained from a preset speech concatenation unit library;

按识别顺序确定相邻的目标语音拼接单元，计算相邻的目标语音拼接单元之间的最佳拼接点；并根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音。Adjacent target speech splicing units are determined in recognition order, and optimal splicing points between adjacent target speech splicing units are calculated; and adjacent target splicing units are spliced according to the optimal splicing points to obtain synthesized speech.

作为上述方案的改进，当所述待处理文本中包括汉语语种时，所述语种字符库包括汉字字符库；As an improvement of the above solution, when the text to be processed includes Chinese language, the language character library includes a Chinese character library;

则所述采用预先训练完成的韵律结构预测模型，对所述语种字符库中的每一字符串进行韵律结构划分，包括：Then the prosodic structure prediction model that has been trained in advance is used to divide the prosodic structure of each character string in the language character library, including:

对所述汉字字符库中的字符串采用预设的分词神经网络模型进行单词切分，根据切分后的词组确定其拼音和声调，以得到所述字符串对应的以拼音和声调表示的拼音串；Using a preset word segmentation neural network model to segment the character string in the Chinese character library, determining the pinyin and tone of the segmented phrases to obtain a pinyin string represented by the pinyin and tone corresponding to the character string;

根据每一所述字符串对应的拼音串，采用预设的韵律结构预测模型，对每一所述字符串进行韵律结构划分；其中，所述韵律结构至少包括韵律词、韵律短语和语调短语三个层级。According to the pinyin string corresponding to each of the character strings, a preset prosodic structure prediction model is used to divide the prosodic structure of each of the character strings; wherein the prosodic structure includes at least three levels: prosodic words, prosodic phrases and intonation phrases.

作为上述方案的改进，所述预设的韵律结构预测模型的训练方法，具体包括：As an improvement of the above solution, the training method of the preset rhythmic structure prediction model specifically includes:

获取训练语料库；其中，所述训练语料库中包括若干个汉语句子及其对应的韵律结构划分结果；Acquire a training corpus; wherein the training corpus includes a plurality of Chinese sentences and their corresponding prosodic structure division results;

根据所述训练语料库对RNN循环神经网络进行训练，得到训练后的循环神经网络模型；Training the RNN recurrent neural network according to the training corpus to obtain a trained recurrent neural network model;

采用分类回归树算法，对所述循环神经网络模型进行线性融合训练，得到训练完成的循环神经网络模型，作为所述韵律结构预测模型。The classification and regression tree algorithm is used to perform linear fusion training on the recurrent neural network model to obtain a trained recurrent neural network model as the rhythmic structure prediction model.

作为上述方案的改进，当所述待处理文本中包括英语语种时，所述语种字符库包括英文字符库；As an improvement of the above solution, when the text to be processed includes English, the language character library includes an English character library;

确定所述英文字符库中的字符串的文本属性；其中，所述文本属性与上下文环境相关；Determining the text attributes of the character string in the English character library; wherein the text attributes are related to the context;

根据每一所述字符串的文本属性，采用预设的韵律结构预测模型，对每一所述字符串进行韵律结构划分；其中，所述韵律结构预测模型为二叉决策回归树，所述二叉决策回归树反映了所述文本属性与韵律结构的映射关系。According to the text attributes of each of the character strings, a preset prosodic structure prediction model is used to divide the prosodic structure of each of the character strings; wherein the prosodic structure prediction model is a binary decision regression tree, and the binary decision regression tree reflects the mapping relationship between the text attributes and the prosodic structure.

作为上述方案的改进，所述二叉决策回归树的构建方法，具体包括：As an improvement of the above solution, the method for constructing the binary decision regression tree specifically includes:

确定所述语音拼接单元库中的每一语音拼接单元的文本属性；Determining the text attribute of each speech concatenation unit in the speech concatenation unit library;

使用决策回归树算法，以语音拼接单元间的平均距离作为不纯度度量，对所述语音拼接单元进行聚类，以划分得到不同层级的韵律结构；Using a decision regression tree algorithm, taking the average distance between speech splicing units as an impurity measure, clustering the speech splicing units to obtain different levels of prosodic structures;

根据所述文本属性与所述韵律结构之间的映射关系，构建得到所述二叉决策回归树。According to the mapping relationship between the text attributes and the prosodic structure, the binary decision regression tree is constructed.

作为上述方案的改进，所述根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中获取对应的目标语音拼接单元，具体包括：As an improvement of the above scheme, the step of obtaining a corresponding target speech concatenation unit from a preset speech concatenation unit library according to each of the character strings and its corresponding prosodic structure specifically includes:

根据所述英文字符库中的每一字符串对应的韵律结构，在所述预设的语音拼接单元库中获取符合所述韵律结构的候选语音拼接单元；According to the prosodic structure corresponding to each character string in the English character library, obtaining candidate speech splicing units that conform to the prosodic structure in the preset speech splicing unit library;

根据每一所述候选语音拼接单元的目标代价和连接代价，通过以下计算公式，计算每一候选语音拼接单元的拼接代价：According to the target cost and connection cost of each candidate speech splicing unit, the splicing cost of each candidate speech splicing unit is calculated by the following calculation formula:

其中，C为候选语音拼接单元的拼接代价，C_t(i)为候选语音拼接单元的目标代价，C_c为候选语音拼接单元的连接代价；W_t和W_c分别为目标代价和连接代价的权值，n为所述候选语音拼接单元的长度，P为所述文本属性的个数，为第j个文本属性的规则距离，ω_j为第j个文本属性的权值；Wherein, C is the concatenation cost of the candidate speech concatenation unit,_Ct (i) is the target cost of the candidate speech concatenation unit,_Cc is the connection cost of the candidate speech concatenation unit;_Wt and_Wc are the weights of the target cost and the connection cost respectively, n is the length of the candidate speech concatenation unit, P is the number of the text attributes, is the rule distance of the j-th text attribute, ω_j is the weight of the j-th text attribute;

获取使得所述拼接代价最小对应的候选语音拼接单元，作为目标语音拼接单元。A candidate speech splicing unit corresponding to the minimum splicing cost is obtained as the target speech splicing unit.

作为上述方案的改进，所述按识别顺序确定相邻的目标语音拼接单元，计算相邻的目标语音拼接单元之间的最佳拼接点；并根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音，具体包括：As an improvement of the above scheme, the method of determining adjacent target speech splicing units according to the recognition order, calculating the best splicing points between adjacent target speech splicing units, and splicing adjacent target splicing units according to the best splicing points to obtain synthesized speech specifically includes:

对每一所述目标拼接单元进行分帧处理，并提取每一所述目标拼接单元的前 m帧和后m帧的MFCC特征参数；Performing frame processing on each of the target splicing units, and extracting MFCC feature parameters of the first m frames and the last m frames of each of the target splicing units;

按识别顺序确定相邻的目标语音拼接单元，获取前一目标拼接单元的后m帧中的任一帧和后一目标拼接单元的前m帧中的任一帧作为相邻的目标语音拼接单元的拼接点，以得到m²个拼接组合；Determine adjacent target speech splicing units according to the recognition order, obtain any frame in the last m frames of the previous target splicing unit and any frame in the first m frames of the next target splicing unit as the splicing points of the adjacent target speech splicing units, so as to obtain m² splicing combinations;

通过以下计算公式，计算每一所述拼接组合对应的拼接代价；并根据拼接代价最小对应的拼接组合，确定相邻的目标语音拼接单元之间的最佳拼接点：The splicing cost corresponding to each splicing combination is calculated by the following calculation formula; and the optimal splicing point between adjacent target speech splicing units is determined according to the splicing combination corresponding to the minimum splicing cost:

其中，为拼接组合对应的拼接代价；MFCC_1,m(i)为前一目标拼接单元的第m个拼接点对应的MFCC特征参数值，MFCC_2,m(i)为后一目标拼接单元的第 m个拼接点，MFCC_dim为一帧的MFCC特征参数的维数，T_m为基频拼接代价函数的门限；m≥1；in, is the splicing cost corresponding to the splicing combination; MFCC_1,m (i) is the MFCC feature parameter value corresponding to the m-th splicing point of the previous target splicing unit, MFCC_2,m (i) is the m-th splicing point of the next target splicing unit, MFCC_dim is the dimension of the MFCC feature parameter of a frame, T_m is the threshold of the baseband splicing cost function; m≥1;

根据所述最佳拼接点对相邻的目标拼接单元进行拼接，得到初始合成语音；splicing adjacent target splicing units according to the optimal splicing points to obtain initial synthesized speech;

计算前一目标语音拼接单元的后N帧的能量和后一目标语音拼接单元的前 N帧的能量；当前一目标语音拼接单元的后N帧的能量和后一目标语音拼接单元的前N帧的能量之差高于预设的静音能量阈值时，通过以下计算公式计算相邻的目标语音拼接单元的音量调整分贝；并通过所述音量调整分贝，对所述目标语音拼接单元进行音量调整，以得到最终的合成语音：Calculate the energy of the last N frames of the previous target speech splicing unit and the energy of the first N frames of the next target speech splicing unit; when the difference between the energy of the last N frames of the previous target speech splicing unit and the energy of the first N frames of the next target speech splicing unit is higher than a preset silence energy threshold, calculate the volume adjustment decibel of the adjacent target speech splicing unit by the following calculation formula; and adjust the volume of the target speech splicing unit by the volume adjustment decibel to obtain the final synthesized speech:

DB₁＝(E₁–E₂)×W₁；DB₁ =(E₁ –E₂ )×W₁ ;

DB₂＝(E₁–E₂)×W₂；DB₂ =(E₁ –E₂ )×W₂ ;

其中，DB₁为前一目标语音拼接单元的音量调整分贝，DB₂为后一目标语音拼接单元的音量调整分贝；E₁前一目标语音拼接单元的后N帧的能量，E₂为后一目标语音拼接单元的前N帧的能量，N≥1；W₁和W₂分别为预设的权值。Among them, DB₁ is the volume adjustment decibel of the previous target speech splicing unit, DB₂ is the volume adjustment decibel of the next target speech splicing unit; E₁ is the energy of the next N frames of the previous target speech splicing unit, E₂ is the energy of the first N frames of the next target speech splicing unit, N ≥ 1; W₁ and W₂ are preset weights respectively.

作为上述方案的改进，在所述根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音之后，所述方法还包括：As an improvement of the above solution, after splicing adjacent target splicing units according to the optimal splicing points to obtain synthesized speech, the method further includes:

采用基于tacotron模型的语音拼接平滑算法对所述合成语音进行平滑处理。The synthesized speech is smoothed by using a speech concatenation smoothing algorithm based on the tacotron model.

本发明实施例还提供了一种多语种的语音合成装置，包括：The embodiment of the present invention further provides a multi-language speech synthesis device, comprising:

语种识别模块，用于获取待处理文本；对所述待处理文本中的每一字符进行语种识别，并存入对应的预设的字符库中；其中，所述待处理文本中包含至少两种语种；所述预设的字符库包括特殊字符库和至少两个语种对应的语种字符库；在所述语种字符库中存储的所述字符以句子为单位形成对应的字符串；A language recognition module is used to obtain a text to be processed; perform language recognition on each character in the text to be processed and store the characters in a corresponding preset character library; wherein the text to be processed contains at least two languages; the preset character library includes a special character library and a language character library corresponding to at least two languages; the characters stored in the language character library form corresponding character strings in units of sentences;

韵律结构划分模块，用于采用预先训练完成的韵律结构预测模型，对所述语种字符库中的每一字符串进行韵律结构划分；A prosodic structure division module, used to divide the prosodic structure of each character string in the language character library using a pre-trained prosodic structure prediction model;

目标单元获取模块，用于根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中获取对应的目标语音拼接单元；A target unit acquisition module, configured to acquire a corresponding target speech splicing unit from a preset speech splicing unit library according to each of the character strings and its corresponding prosodic structure;

语音合成模块，用于按识别顺序确定相邻的目标语音拼接单元，计算相邻的目标语音拼接单元之间的最佳拼接点；并根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音。The speech synthesis module is used to determine adjacent target speech splicing units according to the recognition order, calculate the optimal splicing points between adjacent target speech splicing units, and splice the adjacent target splicing units according to the optimal splicing points to obtain synthesized speech.

本发明实施例还提供了一种多语种的语音合成系统，包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现如上述任意一项所述的多语种的语音合成方法。An embodiment of the present invention further provides a multilingual speech synthesis system, comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements the multilingual speech synthesis method as described in any one of the above when executing the computer program.

与现有技术相比，本发明公开的一种多语种的语音合成方法、装置和系统。通过获取待处理文本；对所述待处理文本中的每一字符进行语种识别，并存入对应的预设的字符库中；采用预先训练完成的韵律结构预测模型，对所述语种字符库中的每一字符串进行韵律结构划分；根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中获取对应的目标语音拼接单元；按识别顺序确定相邻的目标语音拼接单元，计算相邻的目标语音拼接单元之间的最佳拼接点；并根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音。采用本发明实施例的技术手段，在对多语种的文本进行语音合成的过程中，考虑了不同语种文本的韵律参数，从而使得合成语音具备不同年龄、性别特征及语气、语速的表现，以及赋予个人的感情色彩。在语音合成过程中，对相邻的目标语音拼接单元选择合适的拼接点，使得合成语音更加平滑流畅，提升了合成语音的自然度与表现力。Compared with the prior art, the present invention discloses a multi-language speech synthesis method, device and system. By acquiring a text to be processed; performing language recognition on each character in the text to be processed and storing it in a corresponding preset character library; using a pre-trained rhythmic structure prediction model to perform rhythmic structure division on each character string in the language character library; according to each character string and its corresponding rhythmic structure, obtaining a corresponding target speech splicing unit in a preset speech splicing unit library; determining adjacent target speech splicing units in recognition order, calculating the best splicing point between adjacent target speech splicing units; and splicing adjacent target splicing units according to the best splicing point to obtain synthesized speech. By adopting the technical means of the embodiments of the present invention, in the process of speech synthesis of multi-language texts, the rhythmic parameters of texts in different languages are taken into account, so that the synthesized speech has the expression of different age, gender characteristics, tone, and speech speed, and is endowed with personal emotional color. In the process of speech synthesis, appropriate splicing points are selected for adjacent target speech splicing units to make the synthesized speech smoother and more fluent, thereby improving the naturalness and expressiveness of the synthesized speech.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例一中的一种多语种的语音合成方法的步骤流程示意图；FIG1 is a schematic diagram of a step flow of a multilingual speech synthesis method in Embodiment 1 of the present invention;

图2是本发明实施例一中的语种识别步骤的流程示意图；FIG2 is a flow chart of a language identification step in the first embodiment of the present invention;

图3是本发明实施例二中的分词神经网络模型的结构示意图；FIG3 is a schematic diagram of the structure of a word segmentation neural network model in Embodiment 2 of the present invention;

图4是本发明实施例二中的韵律结构划分示意图；FIG4 is a schematic diagram of rhythmic structure division in Embodiment 2 of the present invention;

图5是本发明实施例四中的MFCC特征参数提取步骤的流程示意图；FIG5 is a flow chart of the MFCC feature parameter extraction step in the fourth embodiment of the present invention;

图6是本发明实施例四中的目标语音拼接单元的拼接平滑步骤的流程示意图；6 is a schematic flow chart of a smoothing step of splicing a target speech splicing unit in a fourth embodiment of the present invention;

图7是本发明实施例六中的一种多语种的语音合成装置的结构示意图；7 is a schematic diagram of the structure of a multi-language speech synthesis device in Embodiment 6 of the present invention;

图8是本发明实施例七中的一种多语种的语音合成系统的结构示意图。FIG8 is a schematic diagram of the structure of a multi-language speech synthesis system in Embodiment 7 of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the accompanying drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

参见图1，是本发明实施例一中的一种多语种的语音合成方法的步骤流程示意图。本发明实施例提供的一种多语种的语音合成方法，通过步骤S1至S4执行：Referring to FIG1 , it is a schematic diagram of the steps of a multi-language speech synthesis method in Embodiment 1 of the present invention. The multi-language speech synthesis method provided by the embodiment of the present invention is performed through steps S1 to S4:

S1、获取待处理文本；对所述待处理文本中的每一字符进行语种识别，并存入对应的预设的字符库中；其中，所述待处理文本中包含至少两种语种；所述预设的字符库包括特殊字符库和至少两个语种对应的语种字符库；在所述语种字符库中存储的所述字符以句子为单位形成对应的字符串。S1. Obtain a text to be processed; perform language recognition on each character in the text to be processed and store the characters in a corresponding preset character library; wherein the text to be processed contains at least two languages; the preset character library includes a special character library and a language character library corresponding to at least two languages; the characters stored in the language character library form corresponding character strings in units of sentences.

在本发明实施例中，所述待处理文本中包含至少两种语种，采用基于unicode 码的文本语种识别方法进行语种识别。根据不同语言Unicode的编码范围，通过判断Unicode码所在区间来确定某个字符所处类型。In the embodiment of the present invention, the text to be processed contains at least two languages, and the language recognition is performed using a text language recognition method based on Unicode code. According to the encoding range of Unicode of different languages, the type of a certain character is determined by judging the interval where the Unicode code is located.

作为举例，以下为汉字、数字、大小写字母以及常用标点的unicode编码范围：As an example, the following are the unicode encoding ranges for Chinese characters, numbers, uppercase and lowercase letters, and common punctuation marks:

基本汉字：[0x4e00,0x9fa5](或十进制[19968,40869])Basic Chinese characters: [0x4e00,0x9fa5] (or decimal [19968,40869])

数字：[0x 0030,0x0039](或十进制[48,57])Number: [0x0030,0x0039] (or decimal [48,57])

小写字母：[0x0061,0x007a](或十进制[97,122])Lowercase letters: [0x0061,0x007a] (or decimal [97,122])

大写字母：[0x0041,0x005a](或十进制[65,90])Uppercase letters: [0x0041,0x005a] (or decimal [65,90])

常用标点：2000-206FCommon punctuation: 2000-206F

参见图2，是本发明实施例一中的语种识别步骤的流程示意图。在所述待处理包括汉语语种和英语语种的情况下，所述对所述待处理文本中的每一字符进行语种识别，并存入对应的预设的字符库中，具体包括步骤：Referring to FIG. 2 , it is a flow chart of the language identification step in the first embodiment of the present invention. In the case where the text to be processed includes Chinese and English, the language identification is performed on each character in the text to be processed and stored in the corresponding preset character library, specifically including the steps of:

S11、创建汉字字符库、英文字符库和特殊字符库，分别用于存放汉字字符、英文字符和特殊字符，创建一个字符串string用于存放英文字母。S11. Create a Chinese character library, an English character library and a special character library to store Chinese characters, English characters and special characters respectively, and create a string string to store English letters.

S12、遍历所述待处理文本中的所有字符，判断是否存在下一个字符，若是，根据下一个字符x的Unicode码所在区间来确定字符x所处区间。当x属于中文区间，则执行S13；当x属于数字区间，则执行S14；当x属于特殊字符区间，则执行S15；当x属于英文区间，则执行S16；若否，则识别结束。S12, traverse all characters in the text to be processed, determine whether there is a next character, if so, determine the interval where the character x is located according to the interval where the Unicode code of the next character x is located. When x belongs to the Chinese interval, execute S13; when x belongs to the number interval, execute S14; when x belongs to the special character interval, execute S15; when x belongs to the English interval, execute S16; if not, the recognition ends.

S13、将x放置到存放汉字字符库，执行S12；S13, placing x into the Chinese character library, and executing S12;

S14、将x转化为中文数字，放置到存放汉字字符库，执行S12；S14, convert x into Chinese numerals, place them into a Chinese character library, and execute S12;

S15、将x放置到存放特殊字符库，执行S12；S15, place x into the special character storage library, and execute S12;

S16、字符串string＝string+x，执行S17；S16, string string = string + x, execute S17;

S17、判断下一个字符是否属于英文区间，若是则执行S16；否则将string作为一个英文单词放置到英文字符库，并清除string字符。S17, determine whether the next character belongs to the English interval, if so, execute S16; otherwise, put string into the English character library as an English word, and clear the string character.

需要说明的是，在所述语种字符库中，存储的所述字符以句子为单位形成对应的字符串。具体可以通过在遍历和识别待处理文本中的字符的过程中，当识别到的下一字符为特殊字符或与前面识别到的字符的语种不同时，将前面识别到的字符形成字符串并存入对应的语种字符库的方式实现。It should be noted that in the language character library, the characters stored form corresponding character strings in units of sentences. Specifically, this can be achieved by forming a character string with the previously recognized characters and storing it in the corresponding language character library when the next recognized character is a special character or is in a different language from the previously recognized characters during the process of traversing and recognizing the characters in the text to be processed.

需要说明的是，上述语种识别方法仅作为一种实施方式，并不构成对本方案的具体限定。在实际应用中，还可以通过判断字符的ASCII码进行语种识别，均不影响本发明取得的有益效果。It should be noted that the above-mentioned language identification method is only used as an implementation method and does not constitute a specific limitation of this solution. In practical applications, language identification can also be performed by judging the ASCII code of the character, which does not affect the beneficial effects achieved by the present invention.

S2、采用预先训练完成的韵律结构预测模型，对所述语种字符库中的每一字符串进行韵律结构划分。S2. Using a pre-trained prosodic structure prediction model, divide the prosodic structure of each character string in the language character library.

具体地，韵律也称为节律、音律、非线性特征等。韵律结构划分是语音合成系统的重要组成部分，这是因为人们进行口语交流时往往不会按照分词的结果进行停顿(但分词的结果仍具有一定的指导性作用)。因此准确的预测文本的韵律边界位置以及其等级，是语音合成中的重要环节，它是合成自然、流畅的输出语音的重要前提和保证。Specifically, prosody is also called rhythm, cadence, nonlinear features, etc. Prosody structure division is an important part of speech synthesis system, because people often do not pause according to the results of word segmentation when communicating orally (but the results of word segmentation still have a certain guiding role). Therefore, accurate prediction of the prosodic boundary position and its level of the text is an important link in speech synthesis, which is an important premise and guarantee for synthesizing natural and fluent output speech.

在本发明实施例中，通过预先训练韵律结构预测模型，用于对语种字符库中的每一字符串进行韵律结构划分，从而保证后续在对语音合成步骤过程中所生成的合成语音的较好的自然度和流畅度。In the embodiment of the present invention, a prosodic structure prediction model is pre-trained to divide the prosodic structure of each character string in the language character library, thereby ensuring good naturalness and fluency of the synthesized speech generated in the subsequent speech synthesis step.

S3、根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中获取对应的目标语音拼接单元。S3. According to each of the character strings and its corresponding prosodic structure, a corresponding target speech splicing unit is obtained from a preset speech splicing unit library.

预先一个语音拼接单元库，用于存放大量的可用于合成语句的语音拼接单元。在拼接合成时，不可避免地会带来音质的损失，因此，拼接单元越大，拼接点越少，合成的效果越好，但是灵活度也越低，语音库需要更多的拼接单元。为了权衡合成语音的自然度与语音库的体积，在本发明实施例中，采用了3种不同长度的语音拼接单元：单词、音节和音素。利用较大的单元来覆盖连续语流中的各种语音现象，尤其是协同发音现象，同时利用较小的单元来提高系统的灵活性和鲁棒性。A speech splicing unit library is prepared in advance to store a large number of speech splicing units that can be used to synthesize sentences. During the splicing synthesis, the sound quality will inevitably be lost. Therefore, the larger the splicing unit, the fewer the splicing points, the better the synthesis effect, but the lower the flexibility, and the speech library needs more splicing units. In order to balance the naturalness of the synthesized speech and the volume of the speech library, in the embodiment of the present invention, three speech splicing units of different lengths are used: words, syllables and phonemes. Larger units are used to cover various speech phenomena in the continuous speech stream, especially the phenomenon of co-articulation, while smaller units are used to improve the flexibility and robustness of the system.

在语音合成时，针对语种字符库中的每一字符串，根据文本分析和韵律结构分析获取的语音学参数，以找出合适的目标语音拼接单元。如果待处理文本中的单词存在对应的单词语音拼接单元，则选择该单词语音拼接单元作为目标语音拼接单元，如果单词语音拼接单元也不存在，则选择对应的音节语音拼接单元作为目标语音拼接单元；如果音节语音拼接单元也不存在，则选择对应的音素语音拼接单元作为目标语音拼接单元。During speech synthesis, for each character string in the language character library, the phonetic parameters obtained by text analysis and prosodic structure analysis are used to find a suitable target speech splicing unit. If a word in the text to be processed has a corresponding word speech splicing unit, the word speech splicing unit is selected as the target speech splicing unit. If the word speech splicing unit does not exist, the corresponding syllable speech splicing unit is selected as the target speech splicing unit; if the syllable speech splicing unit does not exist, the corresponding phoneme speech splicing unit is selected as the target speech splicing unit.

S4、按识别顺序确定相邻的目标语音拼接单元，计算相邻的目标语音拼接单元之间的最佳拼接点；并根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音。S4, determining adjacent target speech splicing units according to the recognition order, calculating the best splicing points between adjacent target speech splicing units; and splicing the adjacent target splicing units according to the best splicing points to obtain synthesized speech.

本发明在语音合成的过程中，通过对若干个目标语音拼接单元进行波形拼接以获得合成语音。相邻的目标拼接单元通常来自于语料库中的原始文件，而即使同一拼接单元来自于同一原始文件，也会由于其所处的原始文件的位置不同，语境也不相同，而产生比较明显的差异，对合成语音的效果影响很大。In the process of speech synthesis, the present invention obtains synthesized speech by waveform splicing of several target speech splicing units. Adjacent target splicing units usually come from original files in a corpus, and even if the same splicing unit comes from the same original file, it will produce obvious differences due to different positions in the original file and different contexts, which greatly affects the effect of synthesized speech.

为使相邻的两个目标语音拼接单元的韵律参数在拼接点处的接续平滑，通过计算两个语音单元的韵律参数在拼接点处的差值，以其最小值来确定最佳拼接点，根据最佳拼接点对对相邻的目标拼接单元进行拼接，以得到合成语音。In order to make the prosodic parameters of two adjacent target speech splicing units smoothly connected at the splicing point, the difference between the prosodic parameters of the two speech units at the splicing point is calculated, and the optimal splicing point is determined by its minimum value. The adjacent target splicing units are spliced according to the optimal splicing point to obtain synthetic speech.

本发明实施例一提供了一种多语种的语音合成方法，获取待处理文本；对所述待处理文本中的每一字符进行语种识别，并存入对应的预设的字符库中；采用预先训练完成的韵律结构预测模型，对所述语种字符库中的每一字符串进行韵律结构划分；根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中获取对应的目标语音拼接单元；按识别顺序确定相邻的目标语音拼接单元，计算相邻的目标语音拼接单元之间的最佳拼接点；并根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音。采用本发明实施例的技术手段，在对多语种的文本进行语音合成的过程中，考虑了不同语种文本的韵律参数，从而使得合成语音具备不同年龄、性别特征及语气、语速的表现，以及赋予个人的感情色彩。在语音合成过程中，对相邻的目标语音拼接单元选择合适的拼接点，使得合成语音更加平滑流畅，提升了合成语音的自然度与表现力。The first embodiment of the present invention provides a multi-language speech synthesis method, which obtains a text to be processed; performs language recognition on each character in the text to be processed and stores it in a corresponding preset character library; uses a pre-trained rhythmic structure prediction model to perform rhythmic structure division on each character string in the language character library; obtains a corresponding target speech splicing unit in a preset speech splicing unit library according to each character string and its corresponding rhythmic structure; determines adjacent target speech splicing units according to the recognition order, calculates the best splicing point between adjacent target speech splicing units; and splices adjacent target splicing units according to the best splicing point to obtain synthesized speech. By adopting the technical means of the embodiment of the present invention, in the process of speech synthesis of multi-language texts, the rhythmic parameters of texts in different languages are considered, so that the synthesized speech has the performance of different age, gender characteristics, tone, and speech speed, and is endowed with personal emotional color. In the speech synthesis process, appropriate splicing points are selected for adjacent target speech splicing units, so that the synthesized speech is smoother and more fluent, and the naturalness and expressiveness of the synthesized speech are improved.

本发明实施例二提供了一种多语种的语音合成方法，在实施例一的基础上进一步实施。当所述待处理文本中包括汉语语种时，所述语种字符库包括汉字字符库。Embodiment 2 of the present invention provides a multi-language speech synthesis method, which is further implemented on the basis of embodiment 1. When the text to be processed includes Chinese language, the language character library includes a Chinese character library.

则步骤S2具体通过步骤S211至S212执行：Then step S2 is specifically performed through steps S211 to S212:

S211、对所述汉字字符库中的字符串采用预设的分词神经网络模型进行单词切分，根据切分后的词组确定其拼音和声调，以得到所述字符串对应的以拼音和声调表示的拼音串。S211. Use a preset word segmentation neural network model to segment the character string in the Chinese character library, and determine the pinyin and tone of the segmented phrases to obtain a pinyin string represented by the pinyin and tone corresponding to the character string.

S212、根据每一所述字符串对应的拼音串，采用预设的韵律结构预测模型，对每一所述字符串进行韵律结构划分；其中，所述韵律结构至少包括韵律词、韵律短语和语调短语三个层级。S212. According to the pinyin string corresponding to each of the character strings, a preset prosodic structure prediction model is used to divide the prosodic structure of each of the character strings; wherein the prosodic structure includes at least three levels: prosodic words, prosodic phrases and intonation phrases.

在本发明实施例中，为了对所述汉字字符串进行准确的韵律结构划分，需要先将汉字字符串转换为拼音串，获取到汉字对应的声母、韵母，从而与边界处的声学参数进行匹配，更好的研究韵律层级边界处的声学表现。而为了使得拼音的转换更加准确，则需要先对汉字字符串进行分词。针对中文，分词可避免大部分由于歧义产生的注音错误，还可以获得词的边界线蕴含的丰富的韵律信息，例如词的边界线往往需要稍作停顿。In the embodiment of the present invention, in order to accurately divide the rhythmic structure of the Chinese character string, it is necessary to first convert the Chinese character string into a pinyin string, obtain the initial consonants and finals corresponding to the Chinese characters, and then match them with the acoustic parameters at the boundary to better study the acoustic performance at the rhythmic level boundary. In order to make the pinyin conversion more accurate, it is necessary to first segment the Chinese character string. For Chinese, word segmentation can avoid most of the phonetic errors caused by ambiguity, and can also obtain rich rhythmic information contained in the boundary line of the word, for example, the boundary line of the word often requires a slight pause.

本发明采用预设的分词神经网络模型对汉字字符串进行单词切分。在本发明实施例二的一种实施方式下，参见图3，是本发明实施例二中的分词神经网络模型的结构示意图，所述分词神经网络模型采用循环神经网络(RNN)。具体地，所述分词神经网络模型的训练过程如下：The present invention uses a preset word segmentation neural network model to perform word segmentation on a Chinese character string. In one implementation of the second embodiment of the present invention, see FIG3 , which is a schematic diagram of the structure of the word segmentation neural network model in the second embodiment of the present invention, and the word segmentation neural network model uses a recurrent neural network (RNN). Specifically, the training process of the word segmentation neural network model is as follows:

输入信息是以中文字或词为单位的字符串，也可以是词法、句法抽象的概念，每一字词或概念作为单个神经元的输入。字符串可以不连续，也即可以出现空的神经元。神经网络的结构可以自由重构，如对新出现的字词、句法概念创建一个新的神经元。而对于新增加的只是不会刷新旧知识，因此，该神经网络的遗忘系数将置1(不遗忘)。由于神经网络学习的特性，对于新的分词知识的增加，与搜索时间无关，也即随着知识增加，搜索时间却不会过分增加。The input information is a string of Chinese characters or words, or a lexical or syntactic abstract concept. Each word or concept is used as the input of a single neuron. The string can be discontinuous, that is, empty neurons can appear. The structure of the neural network can be freely reconstructed, such as creating a new neuron for a newly appeared word or syntactic concept. As for the newly added knowledge, the old knowledge will not be refreshed, so the forgetting coefficient of the neural network will be set to 1 (no forgetting). Due to the learning characteristics of the neural network, the addition of new word segmentation knowledge has nothing to do with the search time, that is, as the knowledge increases, the search time will not increase excessively.

分词神经网络模型如图3所示。F1是输入层，F2是竞争网络，F3表示自适应学习机制。F1输入层输入为任意字词或概念组成的字符串。输入层每个结点对应一个字词或概念。当结点对应字词非空时为1，空字符为0。设x_i为F1第i个结点，S_i为第i个字词，则有The word segmentation neural network model is shown in Figure 3. F1 is the input layer, F2 is the competitive network, and F3 represents the adaptive learning mechanism. The input of the F1 input layer is a string of any words or concepts. Each node in the input layer corresponds to a word or concept. When the node corresponds to a non-empty word, it is 1, and the empty character is 0. Let_xi be the i-th node of F1, and_Si be the i-th word, then

F2为竞争网络。每一个神经元均有一正反馈激励，也即给每一神经元以自恢复、自增强机制，而互相之间为抑制作用。在信息并行输入F2时，每个神经元都积极增强自己而抑制其他神经元，但由于每个神经元的权值W_i不一样，神经元激活程度也不一样。通过反复演变和迭代，使某一神经元占主导地位，而其他神经元被抑制。演化的结果与权值和F1输入信息有关。某神经元初始激活越大, 此神经元取胜的概率越大。设t_jk是结点j到结点K的权值，则：F2 is a competitive network. Each neuron has a positive feedback excitation, that is, each neuron has a self-recovery and self-enhancement mechanism, and each neuron has an inhibitory effect on the other. When information is input into F2 in parallel, each neuron actively enhances itself and inhibits other neurons, but because the weight_Wi of each neuron is different, the degree of neuron activation is also different. Through repeated evolution and iteration, a certain neuron becomes dominant and other neurons are inhibited. The result of evolution is related to the weight and F1 input information. The greater the initial activation of a neuron, the greater the probability of this neuron winning. Let_tjk be the weight from node j to node K, then:

其中0<ε<1。Where 0<ε<1.

网络内部以Δt的时间序列迭代演化，t+1时结点j的输出值Y_j′(t+1)随t改变而改变，有：The network evolves iteratively in the time series of Δt. The output value Y_j ′(t+1) of node j at t+1 changes with the change of t, and we have:

Y_j′(t+1)＝f(Y_j(t)-ε∑_k≠jY_k(t))j,k＝1,…,MY_j ′(t+1)＝f(Y_j (t)-ε∑_k≠j Y_k (t))j,k＝1,…,M

其中，M是结点总个数。式中的f(*)为线性阈值函数：Where M is the total number of nodes. f(*) in the formula is the linear threshold function:

设输入为字符串(x₁x₂…x_n)，W_ij(t+1)是输入i的状态时神经元结点j的权重，则神经元j的输出为：Assume the input is a string (x₁ x₂ …x_n ),_Wij (t+1) is the weight of neuron node j when the input state is i, then the output of neuron j is:

设最终在竞争网络F2竞争胜出的神经元的输出大于预先设定的切分阈值，则表示当前输出结点代表的字词处为切分点，F2网络通过对照标签，判断其是否为正确切分结果，若不是，则在F3自学习算法调整权值，再重新竞争，直到切分正确位置。如果遇到了新的字词，则在输入层F1新增结点。由于输入层F1每一个神经元均对应固定的字词，一旦自学习过程完成，系统就能对由这些字词组成的语言正确处理。整个学习过程是一个由简到繁的过程，神经网络会逐渐完善知识的存储。Assuming that the output of the neuron that eventually wins the competition in the competition network F2 is greater than the preset segmentation threshold, it means that the word represented by the current output node is the segmentation point. The F2 network determines whether it is the correct segmentation result by comparing the label. If not, the weight is adjusted in the F3 self-learning algorithm, and then competes again until the correct segmentation position is reached. If a new word is encountered, a new node is added to the input layer F1. Since each neuron in the input layer F1 corresponds to a fixed word, once the self-learning process is completed, the system can correctly process the language composed of these words. The entire learning process is a process from simple to complex, and the neural network will gradually improve the storage of knowledge.

进一步地，本发明基于汉语电子词典提取出汉字与其对应的拼音，组成汉字 -拼音字典。在对汉字字符串进行分词后，在处理输入的汉字字符时，基于所述汉字-拼音字典，将分词后的汉字字符串转换为拼音串。举例如下：Furthermore, the present invention extracts Chinese characters and their corresponding pinyin based on the Chinese electronic dictionary to form a Chinese character-pinyin dictionary. After the Chinese character string is segmented, when processing the input Chinese characters, the segmented Chinese character string is converted into a pinyin string based on the Chinese character-pinyin dictionary. For example:

输入语句：最让我品味不尽的秋韵Input sentence: The autumn charm that I can’t stop savoring

转换后的拼音串：zui4 rang4 wo3 pin3_wei4 bu2_jin4 de5 qiu1_yun4。The converted pinyin string: zui4 rang4 wo3 pin3_wei4 bu2_jin4 de5 qiu1_yun4.

其中，数字(1,2,3,4)代表汉字的声调。Among them, the numbers (1,2,3,4) represent the tones of Chinese characters.

进一步地，根据每一所述字符串对应的拼音串，采用预设的韵律结构预测模型，对每一所述字符串进行韵律结构划分。在本发明实施例中，对汉语语种的韵律结构设置有三个层级，从小到大依次是韵律词、韵律短语和语调短语。Further, according to the pinyin string corresponding to each of the character strings, a preset prosodic structure prediction model is used to divide the prosodic structure of each of the character strings. In the embodiment of the present invention, the prosodic structure of the Chinese language is set to three levels, which are prosodic words, prosodic phrases and intonation phrases from small to large.

以“致以诚挚的问候和良好的祝愿”为例，将该字符串转换为拼音串后，输入所述韵律结构预测模型，参见图4，是本发明实施例二中的韵律结构划分示意图。其中PW、PP、IP、S分别表示韵律词、韵律短语、语调短语和句子。通过所述韵律结构预测模型，输出该字符串对应的韵律结构划分结果。Take "With sincere regards and good wishes" as an example. After converting the string into a pinyin string, the string is input into the prosodic structure prediction model. See FIG4, which is a schematic diagram of prosodic structure division in the second embodiment of the present invention. PW, PP, IP, and S represent prosodic words, prosodic phrases, intonation phrases, and sentences, respectively. The prosodic structure prediction model is used to output the prosodic structure division result corresponding to the string.

在本发明实施例二的一种实施方式下，所述预设的韵律结构预测模型的训练方法，具体包括步骤S212a至S212b：In one implementation of the second embodiment of the present invention, the training method of the preset prosodic structure prediction model specifically includes steps S212a to S212b:

S212a、获取训练语料库；其中，所述训练语料库中包括若干个汉语句子及其对应的韵律结构划分结果。S212a, obtaining a training corpus; wherein the training corpus includes a plurality of Chinese sentences and their corresponding prosodic structure division results.

在本发明实施例中，用于对韵律结构预测模型进行训练的语料库包括汉语 20000句，总计40多万个音节，均进行了三个层级(韵律词、韵律短语、语调短语)的韵律边界标注。In the embodiment of the present invention, the corpus used to train the prosodic structure prediction model includes 20,000 Chinese sentences, totaling more than 400,000 syllables, all of which are annotated with prosodic boundaries at three levels (prosodic words, prosodic phrases, and intonation phrases).

S212b、根据所述训练语料库对RNN循环神经网络进行训练，得到训练后的循环神经网络模型。S212b. Train the RNN recurrent neural network according to the training corpus to obtain a trained recurrent neural network model.

本发明基于RNN(循环神经网络)来预测韵律词、韵律短语和语调短语。选用词向量(word2vec)作为特征。word2vec通过训练，可以把每个词表示为N 维向量的一个特征。从而两个词之间的距离可以用词向量空间的相似度来表示。如表1所示，表1为RNN深度学习网络模型的F-score(分类模型的评价指标) 结果：The present invention predicts prosodic words, prosodic phrases and intonation phrases based on RNN (recurrent neural network). Word vector (word2vec) is selected as a feature. Word2vec can represent each word as a feature of an N-dimensional vector through training. Therefore, the distance between two words can be represented by the similarity of the word vector space. As shown in Table 1, Table 1 is the F-score (evaluation index of the classification model) of the RNN deep learning network model. Results:

表1 RNN模型F-score结果Table 1 F-score results of RNN model

S223、采用分类回归树算法，对所述循环神经网络模型进行线性融合训练，得到训练完成的循环神经网络模型，作为所述韵律结构预测模型。S223: Using a classification and regression tree algorithm, perform linear fusion training on the recurrent neural network model to obtain a trained recurrent neural network model as the rhythmic structure prediction model.

RNN模型较其他模型相比，准确率有所提升。但是其对于极少部分输入，会出现输出近似1或者0这样不够平滑的结果。通常的解决方法是将其与其他模型作一个线性融合。本发明选取分类回归树算法(CART)、循环神经网络(RNN)、条件随机场模型(CRF)模型所输出的概率结果以及部分重要特征进行模型的融合。首先用线性模型来训练线性模型融合的系数。训练时选取的指数损失函数为：Compared with other models, the RNN model has improved accuracy. However, for a very small number of inputs, the output will be close to 1 or 0, which is not smooth enough. The usual solution is to linearly fuse it with other models. The present invention selects the probability results output by the classification and regression tree algorithm (CART), recurrent neural network (RNN), conditional random field model (CRF) model and some important features for model fusion. First, the linear model is used to train the coefficients of linear model fusion. The exponential loss function selected during training is:

L(y,f)＝e^-yf(x)L(y,f)＝e^-yf(x)

其中y是预测值，f是目标值。经过计算，CRF、CART、RNN的融合系数分别为0.43、0.11、0.49，偏置项系数为-0.14。可见结果对RNN和CRF的依赖程度远高于CART。不同系数结果如表2所示：Where y is the predicted value and f is the target value. After calculation, the fusion coefficients of CRF, CART, and RNN are 0.43, 0.11, and 0.49 respectively, and the bias coefficient is -0.14. It can be seen that the results are much more dependent on RNN and CRF than CART. The results of different coefficients are shown in Table 2:

表2融合系数结果Table 2 Fusion coefficient results

为了提高模型融合的准确率，将短语长度的约束引入，加入了到上一个边界距离、当前语法词音节数、下一个语法词音节数这个三个特征，用CART对RNN 进行融合。具体训练步骤如下所示：In order to improve the accuracy of model fusion, the phrase length constraint is introduced, and three features are added: the distance to the previous boundary, the number of syllables of the current grammatical word, and the number of syllables of the next grammatical word. CART is used to fuse RNN. The specific training steps are as follows:

Step1:根据多模型的融合结果，得到训练语料库中每个训练样本的“到上一个边界的距离”。Step 1: Based on the fusion results of multiple models, obtain the “distance to the previous boundary” of each training sample in the training corpus.

Step2:随机抽取75％的训练样本，用CART进行训练。Step 2: Randomly select 75% of the training samples and train them using CART.

Step3:用剩余的25％的训练样本进行交叉验证，判断结果是否达到预值。如果达到预值，结束训练。Step 3: Use the remaining 25% of the training samples for cross-validation to determine whether the result reaches the expected value. If it reaches the expected value, the training ends.

Step4:如果没有达到预值，用CART训练的结果，更新短语长度约束特征，并转到Step2继续训练。Step 4: If the expected value is not reached, use the results of CART training to update the phrase length constraint feature and go to Step 2 to continue training.

需要说明的是，这里需要考虑的超参数除了CART自带的超参数外，还有判断结果是否理想的F-score值。即如果结果优于此值，则停止训练。It should be noted that the hyperparameters that need to be considered here include not only the hyperparameters that come with CART, but also the F-score value to determine whether the result is ideal. That is, if the result is better than this value, stop training.

在本发明实施例二中，通过所述韵律结构预测模型获得每一汉字字符串的韵律结构后，即可根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中直接获取对应的目标语音拼接单元。In the second embodiment of the present invention, after the prosodic structure of each Chinese character string is obtained through the prosodic structure prediction model, the corresponding target speech splicing unit can be directly obtained from the preset speech splicing unit library according to each character string and its corresponding prosodic structure.

采用本发明实施例的技术手段，能够提高对汉字字符串的韵律结构划分的准确性，以进一步提高合成语音的自然度和流畅度。By adopting the technical means of the embodiments of the present invention, the accuracy of the prosodic structure division of the Chinese character string can be improved, so as to further improve the naturalness and fluency of the synthesized speech.

本发明实施例三提供了一种多语种的语音合成方法，在实施例一的基础上进一步实施。当所述待处理文本中包括英语语种时，所述语种字符库包括英文字符库。Embodiment 3 of the present invention provides a multi-language speech synthesis method, which is further implemented on the basis of embodiment 1. When the text to be processed includes English, the language character library includes an English character library.

则步骤S2具体通过步骤S221至S222执行：Then step S2 is specifically performed through steps S221 to S222:

S221、确定所述英文字符库中的字符串的文本属性；其中，所述文本属性与上下文环境相关。S221. Determine the text attributes of the character string in the English character library; wherein the text attributes are related to the context.

S222、根据每一所述字符串的文本属性，采用预设的韵律结构预测模型，对每一所述字符串进行韵律结构划分。S222: According to the text attributes of each character string, a preset prosodic structure prediction model is used to divide the prosodic structure of each character string.

在本发明实施例中，针对英文语种，所述韵律结构预测模型采用二叉决策回归树。所述二叉决策回归树反映了所述文本属性与韵律结构的映射关系。根据英文字符串的文本属性，即可通过所述二叉决策回归树，确定对应的韵律结构。In the embodiment of the present invention, for the English language, the prosodic structure prediction model adopts a binary decision regression tree. The binary decision regression tree reflects the mapping relationship between the text attributes and the prosodic structure. According to the text attributes of the English string, the corresponding prosodic structure can be determined through the binary decision regression tree.

在本发明实施例二的一种实施方式下，所述二叉决策回归树的构建方法，具体包括步骤S222a至S222c：In one implementation of the second embodiment of the present invention, the method for constructing a binary decision regression tree specifically includes steps S222a to S222c:

S222a、确定所述语音拼接单元库中的每一语音拼接单元的文本属性。S222a, determining the text attribute of each speech concatenation unit in the speech concatenation unit library.

S222b、使用决策回归树算法，以语音拼接单元间的平均距离作为不纯度度量，对所述语音拼接单元进行聚类，以划分得到不同层级的韵律结构；S222b, using a decision regression tree algorithm, taking the average distance between speech splicing units as an impurity measure, clustering the speech splicing units to obtain prosodic structures of different levels;

S222c、根据所述文本属性与所述韵律结构之间的映射关系，构建得到所述二叉决策回归树。S222c. Constructing the binary decision regression tree according to the mapping relationship between the text attributes and the prosodic structure.

在本发明实施例中，由于英文字符串没有预先定义的韵律结构层级，则需要对语音拼接单元库中的每一语音拼接单元进行聚类，从而划分出相应的韵律结构层级。In the embodiment of the present invention, since there is no predefined prosodic structure level for English character strings, it is necessary to cluster each speech concatenation unit in the speech concatenation unit library so as to divide the corresponding prosodic structure level.

在语音拼接单元库中，根据上下文环境对单词，音节和音素进行聚类，筛选出具有相似的韵律特征的语音拼接单元，作为同一韵律层级。本发明采用二叉决策回归树CART(Classification and Regression Tree)，对语音拼接单元库中样本数超过20的单词，音节和音素进行聚类，得到反映上下文与韵律信息映射关系的二叉决策回归树。In the speech splicing unit library, words, syllables and phonemes are clustered according to the context environment, and speech splicing units with similar prosodic features are screened out as the same prosodic level. The present invention uses a binary decision regression tree CART (Classification and Regression Tree) to cluster words, syllables and phonemes with more than 20 samples in the speech splicing unit library to obtain a binary decision regression tree that reflects the mapping relationship between context and prosodic information.

在一种实施方式下，根据英语语言学的知识，本发明选取以下与上下文环境有关的属性作为CART的文本属性：In one embodiment, based on the knowledge of English linguistics, the present invention selects the following attributes related to the context as the text attributes of CART:

A1：语音拼接单元前一个单音的类型A1: The type of the previous single tone of the speech concatenation unit

A2：语音拼接单元后一个单音的类型A2: The type of the single tone after the speech concatenation unit

A3：语音拼接单元前一个音节的重音(无重音，有重音，前一个音节不存在)A3: The stress of the previous syllable of the phonetic splicing unit (no stress, stress, the previous syllable does not exist)

A4：语音拼接单元后一个音节的重音(无重音，有重音，后一个音节不存在)A4: The stress of the next syllable after the phonetic splicing unit (no stress, stress, and the next syllable does not exist)

A5：语音拼接单元所在单词前的停顿级别A5: The pause level before the word where the speech splicing unit is located

A6：语音拼接单元所在单词后的停顿级别A6: Pause level after the word where the speech splicing unit is located

A7：语音拼接单元所在单词在段落中的位置(段首，段中，段末，段中仅一个单词)A7: The position of the word in the paragraph where the phonetic concatenation unit is located (beginning, middle, end, or only one word in the paragraph)

A8：语音拼接单元所在单词在句子中的位置(句首，句中，句末，句中仅一个单词)A8: The position of the word in the sentence where the phonetic splicing unit is located (beginning, middle, end, or only one word in the sentence)

A9：语音拼接单元所在单词在短句中的位置(短句首，短句中，短句末，短句中仅一个单词)A9: The position of the word in the short sentence where the phonetic splicing unit is located (the beginning of the short sentence, the middle of the short sentence, the end of the short sentence, or only one word in the short sentence)

A10：语音拼接单元所在单词在短语中的位置(短语首，短语中，短语末，短语中仅一个单词)A10: The position of the word in the phrase where the phonetic concatenation unit is located (beginning of the phrase, middle of the phrase, end of the phrase, only one word in the phrase)

A11：语音拼接单元所在单词在韵律词中的位置(韵律词首，韵律词中，韵律词末，韵律词中仅一个单词)。A11: The position of the word where the phonetic splicing unit is located in the prosodic word (the beginning of the prosodic word, the middle of the prosodic word, the end of the prosodic word, only one word in the prosodic word).

A12：语音拼接单元所在音节在单词中的位置(词首，词中，词末，单词仅一个音节)A12: The position of the syllable in the word where the phonetic splicing unit is located (beginning, middle, end, or word with only one syllable)

A13：语音拼接单元所在单词的音节数A13: The number of syllables in the word where the phonetic splicing unit is located

A14：phone在音节中的位置(音节首，音节中，音节末，音节仅一个phone)A14: The position of phone in a syllable (beginning of a syllable, middle of a syllable, end of a syllable, or only one phone in a syllable)

A15：phone所在音节的单音数A15: The number of the single tone in the syllable where phone is located

对单词进行聚类时，用属性A1～A11；对音节聚类时，用属性A1～A13；对phone聚类时用到A1～A15所有15个属性。When clustering words, attributes A1 to A11 are used; when clustering syllables, attributes A1 to A13 are used; when clustering phones, all 15 attributes A1 to A15 are used.

基于上述属性，使用CART算法，并用“样本单元间的平均距离”作为不纯度度量方法进行聚类。在聚类过程中，两个语音拼接单元间韵律特征的差异，差异越小，则韵律越相似，二者进行拼接的效果越好。因此，通过度量两个语音拼接单元间韵律特征的差异，若差异大于预设阈值，则需要调整相应权值并重新进行聚类。Based on the above attributes, the CART algorithm is used, and the "average distance between sample units" is used as the impurity measurement method for clustering. In the clustering process, the difference in the prosodic features between the two speech splicing units is smaller, the more similar the prosody is, and the better the effect of splicing the two. Therefore, by measuring the difference in the prosodic features between the two speech splicing units, if the difference is greater than the preset threshold, it is necessary to adjust the corresponding weights and re-cluster.

用以下公式来度量两个语音拼接单元间韵律特征的差异：The following formula is used to measure the difference in prosodic features between two speech concatenation units:

其中，N表示音频采样点数，PA、PB分别表示对A、B单元顺序采样的第i 点的基音周期的大小；EA、EB表示A、B单元的能量；DA、DB分别表示A、B单元的时长；α、β、γ为权值，三者之和为1。由于英语的语调本身比较平稳，基音周期的权值α的取值可以较小，而能量和时长的权值β、γ基本相当，可以取较大的值。Among them, N represents the number of audio sampling points, PA and PB represent the size of the pitch period of the ith point of the sequential sampling of A and B units respectively; EA and EB represent the energy of A and B units respectively; DA and DB represent the duration of A and B units respectively; α, β, and γ are weights, and the sum of the three is 1. Since the intonation of English itself is relatively stable, the weight α of the pitch period can be smaller, while the weights β and γ of the energy and duration are basically equivalent and can take larger values.

在此基础上，作为优选的实施方式，步骤S3，具体通过步骤S31至S33执行：On this basis, as a preferred implementation, step S3 is specifically performed through steps S31 to S33:

S31、根据所述英文字符库中的每一字符串对应的韵律结构，在所述预设的语音拼接单元库中获取符合所述韵律结构的候选语音拼接单元；S31, according to the prosodic structure corresponding to each character string in the English character library, obtaining candidate speech splicing units that conform to the prosodic structure in the preset speech splicing unit library;

S32、根据每一所述候选语音拼接单元的目标代价和连接代价，通过以下计算公式，计算每一候选语音拼接单元的拼接代价：S32. According to the target cost and connection cost of each candidate speech splicing unit, the splicing cost of each candidate speech splicing unit is calculated by the following calculation formula:

S33、获取使得所述拼接代价最小对应的候选语音拼接单元，作为目标语音拼接单元。S33. Obtain a candidate speech splicing unit that minimizes the splicing cost as a target speech splicing unit.

本发明通过对每个文本属性制定一个规则距离表来确定目标单元和候选单元间相互替换时代价的大小。制定的规则距离表如表3所示，其中A1_t为目标语音拼接单元的属性值，A1_u候选语音拼接单元的属性值。The present invention determines the cost of replacing the target unit with the candidate unit by formulating a rule distance table for each text attribute. The rule distance table is shown in Table 3, where A1_t is the attribute value of the target speech concatenation unit, and A1_u is the attribute value of the candidate speech concatenation unit.

表3文本属性规则距离表Table 3 Text attribute rule distance table

连接代价主要考虑候选语音拼接单元在字符串中的位置，如果路径中前一字符与后一字符在语料中是相邻的，那么代价为0，否则代价为1。The connection cost mainly considers the position of the candidate speech concatenation unit in the string. If the previous character and the next character in the path are adjacent in the corpus, the cost is 0, otherwise the cost is 1.

综合考虑目标代价和连接代价，在候选单元网络中，分别对连续的单词、音节和phone组成的子网络，选择一条代价最小的路径，进而组合为一个目标语音拼接单元的序列。由于单词内部各个音位之间的联系很紧密，相互之间会产生较大的影响，故在phone和phone的拼接点处，给连接代价赋以更高的权值，而在单词和单词的拼接点处，目标代价权值更高，音节和音节的拼接点，连接代价和目标代价的权值基本相当。Taking the target cost and connection cost into consideration, in the candidate unit network, a path with the minimum cost is selected for the sub-networks composed of consecutive words, syllables and phones, and then combined into a sequence of target speech splicing units. Since the phonemes within a word are closely connected and have a great influence on each other, a higher weight is assigned to the connection cost at the splicing point between phones, while the target cost weight is higher at the splicing point between words, and the weights of the connection cost and the target cost are basically equivalent at the splicing point between syllables.

采用本发明实施例的技术手段，能够提高对英文字符串的韵律结构划分的准确性，以进一步提高合成语音的自然度和流畅度。By adopting the technical means of the embodiments of the present invention, the accuracy of the prosodic structure division of English character strings can be improved, so as to further improve the naturalness and fluency of synthesized speech.

本发明实施例四提供了一种多语种的语音合成方法，在实施例一至实施例三任一实施例的基础上进一步实施。步骤S4，具体通过步骤S41至S45执行：Embodiment 4 of the present invention provides a multi-language speech synthesis method, which is further implemented on the basis of any one of Embodiments 1 to 3. Step S4 is specifically performed through steps S41 to S45:

S41、对每一所述目标拼接单元进行分帧处理，并提取每一所述目标拼接单元的前m帧和后m帧的MFCC特征参数；S41, performing frame processing on each of the target splicing units, and extracting MFCC feature parameters of the first m frames and the last m frames of each of the target splicing units;

在本发明实施例中，参见图5，是本发明实施例四中的MFCC特征参数提取步骤的流程示意图。对每一目标语音拼接单元进行分帧，得到语言信号。用离散傅立叶变换得到的频谱需要通过一系列三角形滤波器来变换到Mel频率域，这是因为人对音调的感觉和声音的频率并不成正比关系。Mel频率和频率之间的对应关系可用以下公式近似表示：In the embodiment of the present invention, see FIG5, which is a flow chart of the MFCC feature parameter extraction step in the fourth embodiment of the present invention. Each target speech splicing unit is framed to obtain a language signal. The spectrum obtained by discrete Fourier transform needs to be transformed into the Mel frequency domain through a series of triangular filters, because people's perception of pitch is not proportional to the frequency of sound. The corresponding relationship between Mel frequency and frequency can be approximately expressed by the following formula:

由MFCC具有良好的识别性能和抗噪能力，能反映出音频的特征参数，所以，本发明拟采用通过提取语音拼接单元的MFCC参数来进行频谱分析。Since MFCC has good recognition performance and anti-noise ability and can reflect characteristic parameters of audio, the present invention intends to perform spectrum analysis by extracting MFCC parameters of speech splicing units.

S42、按识别顺序确定相邻的目标语音拼接单元，获取前一目标拼接单元的后m帧中的任一帧和后一目标拼接单元的前m帧中的任一帧作为相邻的目标语音拼接单元的拼接点，以得到m²个拼接组合。S42, determining adjacent target speech splicing units according to the recognition order, obtaining any frame in the last m frames of the previous target splicing unit and any frame in the first m frames of the next target splicing unit as the splicing points of the adjacent target speech splicing units, so as to obtain^m2 splicing combinations.

作为举例，取m＝2。则，对于相邻的两个目标语音拼接单元A和B，存在4 中拼接组合，分别为：A单元的最后一帧与B单元的第一帧作为拼接点、A单元的倒数第二帧与B单元的第一帧作为拼接点、A单元的最后一帧与B单元的第二帧作为拼接点、A单元的倒数第二帧与B单元的第二帧作为拼接点。As an example, take m=2. Then, for two adjacent target speech splicing units A and B, there are 4 splicing combinations, namely: the last frame of unit A and the first frame of unit B as the splicing point, the second-to-last frame of unit A and the first frame of unit B as the splicing point, the last frame of unit A and the second frame of unit B as the splicing point, and the second-to-last frame of unit A and the second frame of unit B as the splicing point.

S43、通过以下计算公式，计算每一所述拼接组合对应的拼接代价；并根据拼接代价最小对应的拼接组合，确定相邻的目标语音拼接单元之间的最佳拼接点：S43, calculating the splicing cost corresponding to each splicing combination by the following calculation formula; and determining the best splicing point between adjacent target speech splicing units according to the splicing combination corresponding to the minimum splicing cost:

其中，为拼接组合对应的拼接代价；MFCC_1,m(i)为前一目标拼接单元的第m个拼接点对应的MFCC特征参数值，MFCC_2,m(i)为后一目标拼接单元的第 m个拼接点，MFCC_dim为一帧的MFCC特征参数的维数，T_m为基频拼接代价函数的门限；m≥1。in, is the splicing cost corresponding to the splicing combination; MFCC_1,m (i) is the MFCC feature parameter value corresponding to the m-th splicing point of the previous target splicing unit, MFCC_2,m (i) is the m-th splicing point of the next target splicing unit, MFCC_dim is the dimension of the MFCC feature parameter of a frame, T_m is the threshold of the baseband splicing cost function; m≥1.

同样以m＝2作为举例，则分别计算出4种拼接组合对应的拼接代价C₁、C₂、 C₃和C₄。若拼接代价C₁最小，则对应的拼接组合(A单元的最后一帧与B单元的第一帧作为拼接点)作为最佳拼接组合，则A单元的最后一帧与B单元的第一帧作为最佳拼接点。Taking m=2 as an example, the splicing costs_C1 ,_C2 ,_C3 and_C4 corresponding to the four splicing combinations are calculated respectively. If the splicing cost_C1 is the smallest, the corresponding splicing combination (the last frame of unit A and the first frame of unit B are used as the splicing point) is the best splicing combination, and the last frame of unit A and the first frame of unit B are used as the best splicing point.

S44、根据所述最佳拼接点对相邻的目标拼接单元进行拼接，得到初始合成语音；S44, splicing adjacent target splicing units according to the optimal splicing point to obtain initial synthesized speech;

S45、计算前一目标语音拼接单元的后N帧的能量和后一目标语音拼接单元的前N帧的能量；当前一目标语音拼接单元的后N帧的能量和后一目标语音拼接单元的前N帧的能量之差高于预设的静音能量阈值时，通过以下计算公式计算相邻的目标语音拼接单元的音量调整分贝；并通过所述音量调整分贝，对所述目标语音拼接单元进行音量调整，以得到最终的合成语音：S45, calculating the energy of the last N frames of the previous target speech splicing unit and the energy of the first N frames of the next target speech splicing unit; when the difference between the energy of the last N frames of the previous target speech splicing unit and the energy of the first N frames of the next target speech splicing unit is higher than the preset silence energy threshold, calculating the volume adjustment decibel of the adjacent target speech splicing unit by the following calculation formula; and adjusting the volume of the target speech splicing unit by the volume adjustment decibel to obtain the final synthesized speech:

DB₁＝(E₁–E₂)×W₁；DB₁ =(E₁ –E₂ )×W₁ ;

DB₂＝(E₁–E₂)×W₂；DB₂ =(E₁ –E₂ )×W₂ ;

具体地，由于不同语义单位(段、句、词等)的分割处，语音的某些特征有明显变化，比如在句子边界处，音频的能量特征就显著减少。因此，本发明可以利用这一特点来做边界检测，以检测语音合成时，语言拼接单元的拼接处是否足够平滑，以降低拼接处的断层程度。Specifically, due to the segmentation of different semantic units (segments, sentences, words, etc.), some features of speech have obvious changes, such as the energy feature of the audio is significantly reduced at the sentence boundary. Therefore, the present invention can use this feature to perform boundary detection to detect whether the splicing of the language splicing unit is smooth enough during speech synthesis to reduce the degree of discontinuity at the splicing.

参见图6，是本发明实施例四中的目标语音拼接单元的拼接平滑步骤的流程示意图。通过判断前一目标语音拼接单元的后N帧的能量和后一目标语音拼接单元的前N帧的能量之差是否高于预设的静音能量阈值，来对相邻的目标语音拼接单元进行音量调整。Referring to Fig. 6, it is a schematic flow diagram of the smoothing step of the target speech splicing unit in the fourth embodiment of the present invention. By judging whether the difference between the energy of the last N frames of the previous target speech splicing unit and the energy of the first N frames of the next target speech splicing unit is higher than the preset silence energy threshold, the volume of the adjacent target speech splicing unit is adjusted.

作为举例，当计算得到DB₁为负值时，对所述前一目标语音拼接单元的后N 帧进行减音量，当计算得到DB₁为正值时，对所述前一目标语音拼接单元的后N 帧进行加音量。具体的DB₁的绝对值即为加减音量的分贝值。对后一目标语音拼接单元的音量调整同理可得。For example, when DB₁ is calculated to be a negative value, the volume of the N frames after the previous target speech splicing unit is reduced, and when DB₁ is calculated to be a positive value, the volume of the N frames after the previous target speech splicing unit is increased. The absolute value of DB₁ is the decibel value of the increase or decrease volume. The volume adjustment of the next target speech splicing unit can be obtained in the same way.

采用本发明实施例的技术手段，先通过分析频谱寻找最佳拼接点，再通过基于双阈值语音拼接平滑算法调整原始音素文件的音量，可有效解决使用拼接单元所合成的语音不平滑的问题。By adopting the technical means of the embodiment of the present invention, the optimal splicing point is first found by analyzing the spectrum, and then the volume of the original phoneme file is adjusted by a double-threshold speech splicing smoothing algorithm, which can effectively solve the problem of uneven speech synthesized by using the splicing unit.

本发明实施例五提供了一种多语种的语音合成方法，在实施例一至四任一实施例的基础上进一步实施。在步骤S4之后，所述方法还包括步骤S5：Embodiment 5 of the present invention provides a multi-language speech synthesis method, which is further implemented on the basis of any one of Embodiments 1 to 4. After step S4, the method further includes step S5:

S5、采用基于tacotron模型的语音拼接平滑算法对所述合成语音进行平滑处理。S5. Smoothing the synthesized speech using a speech concatenation smoothing algorithm based on a tacotron model.

在本发明实施例中，在获得合成语音之后，由于合成语音中可能还存在过度不自然的地方，因此通过基于tacotron的语音拼接平滑算法，对合成语音加一层滤镜。它不对整体做出太大的改动，而是磨掉不自然的地方。In the embodiment of the present invention, after obtaining the synthesized speech, since there may still be excessive unnatural parts in the synthesized speech, a filter is added to the synthesized speech through the tacotron-based speech splicing smoothing algorithm. It does not make too much change to the whole, but grinds away the unnatural parts.

tacotron模型由encoder模块，decoder模块和post-processing网络顺序连接组成。将所述合成语音的频谱图，依次经过encoder模块、decoder模块、 post-processing模块，可获取平滑后的频谱图。Encoder模块将频谱转换成更好处理的序列数据，decoder模块负责处理、预测数据，post-process模块将平滑后的序列转换成频谱。The tacotron model consists of an encoder module, a decoder module, and a post-processing network connected in sequence. The spectrum of the synthesized speech is passed through the encoder module, the decoder module, and the post-processing module in sequence to obtain a smoothed spectrum. The encoder module converts the spectrum into more manageable sequence data, the decoder module is responsible for processing and predicting data, and the post-processing module converts the smoothed sequence into a spectrum.

以下对tacotron模型的各个模块进行解释说明：The following is an explanation of the various modules of the tacotron model:

encoder模块：encoder module:

平滑时，首先将所述合成语音放到encoder的预训练层(pre-net)，它有两个隐藏层，层与层之间的连接均是全连接；第一层的隐藏单元数目与输入单元数目一致，第二层的隐藏单元数目为第一层的一半；两个隐藏层采用的激活函数均为 ReLu，dropout设置为来提高泛化能力。During smoothing, the synthesized speech is first put into the pre-training layer (pre-net) of the encoder, which has two hidden layers, and the connections between the layers are all fully connected; the number of hidden units in the first layer is consistent with the number of input units, and the number of hidden units in the second layer is half of the first layer; the activation functions used in the two hidden layers are ReLu, and the dropout is set to improve the generalization ability.

接着将预训练层网络的输出输入到CBHC模块中，从而获得原始输入的高鲁棒性序列。The output of the pre-trained layer network is then input into the CBHC module to obtain a highly robust sequence of the original input.

CBHG模块由一维卷积层(1-D Convolution bank)，高速路神经网络(HighwayNetworks)，双向门控循环单元(Bidirectional GRU)组成。它的功能是从输入中提取有价值的特征，有利于提高模型的泛化能力。The CBHG module consists of a 1-D convolution bank, a highway neural network, and a bidirectional gated recurrent unit (Bidirectional GRU). Its function is to extract valuable features from the input, which is beneficial to improving the generalization ability of the model.

输入序列首先会经过一个卷积层，然后进行残差连接(residual connection)，用于弥补经过多层卷积后丢失的信息。然后下一层输入到高速路神经网络(highwaynetwork)，这个网络的结构为：把输入(input)同时放进到两个一层的全连接网络中，两个网络的激活函数为ReLu和sigmoid，两个网络的输出为 output1和output2，那么highway层的输出(output)为：The input sequence first passes through a convolution layer, and then performs a residual connection to make up for the information lost after multiple layers of convolution. Then the next layer is input to the highway network. The structure of this network is: put the input into two one-layer fully connected networks at the same time. The activation functions of the two networks are ReLu and sigmoid. The outputs of the two networks are output1 and output2. Then the output of the highway layer is:

output＝output1*output2+input*(1-output2)output＝output1*output2+input*(1-output2)

本发明实施例中，highway层数为四层。其中highwaynetwork的公式为In the embodiment of the present invention, the number of highway layers is four. The formula of highwaynetwork is:

y＝H(x,W_H)·(x,W_T)+x·C(x,W_c)y＝H(x,W_H )·(x,W_T )+x·C(x,W_c )

然后将highway层的输出输入到双向GRU，GRU的输出结果就是encoder 的输出。GRU是LSTM的优化版，GRU只有更新门与重置门，使得训练速度有所加快。两个门的公式如下：Then the output of the highway layer is input to the bidirectional GRU, and the output of the GRU is the output of the encoder. GRU is an optimized version of LSTM. GRU only has an update gate and a reset gate, which speeds up the training. The formulas of the two gates are as follows:

z^(t)＝σ(W^(z)x^(t)+u^(z)h^(t-1))(Update gate)z^(t) =σ(W^(z) x^(t) +u^(z) h^(t-1) )(Update gate)

r^(t)＝σ(W^(r)x^(t)+uh^(r)h^(t-1))(Reset gate)r^(t) =σ(W^(r) x^(t) +uh^(r) h^(t-1) )(Reset gate)

z^(t)＝tanh(r^(t)。uh^(t-1)+Wx^(t))(new memory)z^(t) =tanh(r^(t) .uh^(t-1) +Wx^(t) )(new memory)

重置门(Resetgate)：r^(t)负责决定h(t-1)对new memory h^(t)的权重，如果r^(t)约等于0的话，h^(t-1)就不会传递给new memory h^(t)。Reset gate: r^(t) is responsible for determining the weight of h(t-1) to the new memory h^(t) . If r^(t) is approximately equal to 0, h^(t-1) will not be passed to the new memory h^(t) .

Newmemory：h^(t)是对新的输入x^(t)和上一时刻的hidden state h^(t-1)的总结。计算总结出的新的向量h^(t)包含上文信息和新的输入x^(t)。Newmemory: h^(t) is a summary of the new input x^(t) and the hidden state h^(t-1) at the previous moment. The new vector h^(t) calculated contains the above information and the new input x^(t) .

更新门(Updategate)：z^(t)负责决定传递多少h^(t-1)给h^(t)。如果z^(t)约等于1的话，h^(t-1)几乎会直接复制给h^(t)，相反，如果z^(t)约等于0，new memory h^(t)直接传递给h^(t)。Updategate: z^(t) is responsible for deciding how much h^(t-1) to pass to h^(t) . If z^(t) is approximately equal to 1, h^(t-1) is almost directly copied to h^(t) . On the contrary, if z^(t) is approximately equal to 0, the new memory h^(t) is directly passed to h^(t) .

隐藏状态(Hiddenstate)：h^(t)由h^(t-1)和h^(t)相加得到，两者的权重由update gatez^(t)控制。Hidden state: h^(t) is obtained by adding h^(t-1) and h^(t) , and the weight of the two is controlled by the update gatez^(t) .

decoder模块：Decoder module:

经由CBHG模块处理后获得序列将输入到decoder模块。Decoder分为三个部分：预训练(pre-net)、聚焦层(Attention-RNN)、解码层(Decoder-RNN)。 Decoder的预训练(pre-net)的结构与encoder的预训练(pre-net)结构相同，作用为对输入做一些非线性变换。The sequence obtained after processing by the CBHG module will be input into the decoder module. The decoder is divided into three parts: pre-training (pre-net), focusing layer (Attention-RNN), and decoding layer (Decoder-RNN). The structure of the decoder pre-training (pre-net) is the same as that of the encoder pre-training (pre-net), which is used to perform some nonlinear transformation on the input.

聚焦层(Attention-RNN)的结构为一层包含256个GRU的RNN，它将pre-net 的输出和attention的输出作为输入，经过GRU单元后输出到decoder-RNN中。解码层(Decode-RNN)为两层residual GRU，它的输出为输入与经过GRU单元输出之和。每层包含了256个GRU单元。第一步decoder的输入为0矩阵，之后都会把第t步的输出作为第t+1步的输入。The structure of the focus layer (Attention-RNN) is a layer of RNN containing 256 GRUs. It takes the output of the pre-net and the output of the attention as input, and outputs it to the decoder-RNN after passing through the GRU unit. The decoding layer (Decode-RNN) is a two-layer residual GRU, and its output is the sum of the input and the output after passing through the GRU unit. Each layer contains 256 GRU units. The input of the first decoder is a 0 matrix, and then the output of the tth step will be used as the input of the t+1th step.

post-processing模块：Post-processing module:

因为decoder-RNN输出的并不是音频文件，因此需要后处理(post-processing)，将decoder-RNN的输出转换成频谱。以Decoder模块的输出作为后处理 (post-processing)模块的输入，使用CBHG模块作为后处理层(post-processing net)进行转换。Because the output of decoder-RNN is not an audio file, post-processing is required to convert the output of decoder-RNN into a spectrum. The output of the decoder module is used as the input of the post-processing module, and the CBHG module is used as the post-processing net for conversion.

进一步地，再通过wavenet可将频谱图逆转换成音频，从而得到经过平滑处理后的最优的合成语音。Furthermore, the spectrogram can be inversely converted into audio through Wavenet, thereby obtaining the optimal synthetic speech after smoothing.

具体地，本发明使用了一个修正过的wavenet结构来将梅尔频谱图特征表现转换为时域的波形图，从而获得最终的音频文件。该wavenet有30个扩大的卷积层，分组为三个扩大的周期。用RixelCNN++和Parallel Wavenet来代替softmax 层，使用Mol来生成16为的采样，频率为24khz。为了计算逻辑混合分布，wavenet 堆叠的输出通过Relu激励函数后，通过线性投影来预测每个混合部分的参数。损失(loss)通过负对数最大似然估计(negativelog-likehood)得到。Specifically, the present invention uses a modified wavenet structure to convert the Mel-spectrogram feature representation into a waveform in the time domain to obtain the final audio file. The wavenet has 30 enlarged convolutional layers, grouped into three enlarged cycles. RixelCNN++ and Parallel Wavenet are used to replace the softmax layer, and Mol is used to generate 16-bit samples with a frequency of 24kHz. In order to calculate the logical mixture distribution, the output of the wavenet stack is passed through the Relu excitation function, and the parameters of each mixed part are predicted by linear projection. The loss is obtained by negative log-likelihood estimation.

训练过程：首先训练特征预测网络，然后通过第一层的输入来训练第二层修正的wavenet网络。训练特征预测网络时，使用极大似然估计方法 (maxinum-likehood)。使用Adam优化器优化训练速度，学习率为0.001,在每 50000次迭代之后，学习率会衰减0.00001。使用L2正则化的方法，权重为0.000001。Training process: First, train the feature prediction network, and then train the second layer of modified wavenet network through the input of the first layer. When training the feature prediction network, use the maximum likelihood estimation method (maximum-likehood). Use the Adam optimizer to optimize the training speed, with a learning rate of 0.001. After every 50,000 iterations, the learning rate will decay by 0.00001. Use the L2 regularization method with a weight of 0.000001.

之后通过特征预测网络得出的预测信息来训练修正的wavenet，。通过32个 GPU进行同步训练，batch_size＝128，并且也是用Adam作为优化器，固定学习率为0.0001。在每次参数更新的过程中来平均模型的权重。The modified WaveNet is then trained using the prediction information obtained from the feature prediction network. Synchronous training is performed on 32 GPUs with batch_size = 128, and Adam is also used as the optimizer with a fixed learning rate of 0.0001. The weight of the model is averaged during each parameter update.

采用本发明实施例的技术手段，通过对合成语音进一步进行平滑处理，能够有效提高合成语音的自然度和流畅度。考虑到tacotron模型强大的学习能力，通过对tacotron模型进行简化，将其任务从文本转语音削减至只应用于平滑语音拼接合成所带来的的断层。如此，tacotron模型的强大学习能力得以充分利用，能够进一步提升合成语音的自然度和流畅度，同时弥补了端对端模型难以调参和优化的缺陷。By adopting the technical means of the embodiments of the present invention, the naturalness and fluency of the synthesized speech can be effectively improved by further smoothing the synthesized speech. Considering the powerful learning ability of the tacotron model, by simplifying the tacotron model, its task is reduced from text-to-speech to only being applied to the fault caused by smooth speech splicing synthesis. In this way, the powerful learning ability of the tacotron model can be fully utilized, which can further improve the naturalness and fluency of the synthesized speech, while making up for the defect that the end-to-end model is difficult to adjust and optimize.

参见图7，是本发明实施例六中的一种多语种的语音合成装置的结构示意图。本发明实施例六提供的一种多语种的语音合成装置60，包括：语种识别模块61、韵律结构划分模块62、目标单元获取模块63和语音合成模块64，其中，Referring to FIG. 7 , it is a schematic diagram of the structure of a multi-language speech synthesis device in Embodiment 6 of the present invention. Embodiment 6 of the present invention provides a multi-language speech synthesis device 60, comprising: a language recognition module 61, a prosodic structure division module 62, a target unit acquisition module 63 and a speech synthesis module 64, wherein:

所述语种识别模块61，用于获取待处理文本；对所述待处理文本中的每一字符进行语种识别，并存入对应的预设的字符库中；其中，所述待处理文本中包含至少两种语种；所述预设的字符库包括特殊字符库和至少两个语种对应的语种字符库；在所述语种字符库中存储的所述字符以句子为单位形成对应的字符串；The language identification module 61 is used to obtain a text to be processed; perform language identification on each character in the text to be processed and store the characters in a corresponding preset character library; wherein the text to be processed contains at least two languages; the preset character library includes a special character library and a language character library corresponding to at least two languages; the characters stored in the language character library form a corresponding character string in units of sentences;

所述韵律结构划分模块62，用于采用预先训练完成的韵律结构预测模型，对所述语种字符库中的每一字符串进行韵律结构划分；The prosodic structure division module 62 is used to use a pre-trained prosodic structure prediction model to perform prosodic structure division on each character string in the language character library;

所述目标单元获取模块63，用于根据每一所述字符串及其对应的韵律结构，在预设的语音拼接单元库中获取对应的目标语音拼接单元；The target unit acquisition module 63 is used to acquire a corresponding target speech concatenation unit in a preset speech concatenation unit library according to each character string and its corresponding prosodic structure;

所述语音合成模块64，用于按识别顺序确定相邻的目标语音拼接单元，计算相邻的目标语音拼接单元之间的最佳拼接点；并根据所述最佳拼接点对相邻的目标拼接单元进行拼接，以得到合成语音。The speech synthesis module 64 is used to determine adjacent target speech splicing units according to the recognition order, calculate the optimal splicing points between adjacent target speech splicing units, and splice the adjacent target splicing units according to the optimal splicing points to obtain synthesized speech.

需要说明的是，本发明实施例提供的一种多语种的语音合成装置用于执行上述实施例一至五的一种多语种的语音合成方法的所有流程步骤，两者的工作原理和有益效果一一对应，因而不再赘述。It should be noted that the multilingual speech synthesis device provided in the embodiment of the present invention is used to execute all the process steps of the multilingual speech synthesis method in the above-mentioned embodiments 1 to 5, and the working principles and beneficial effects of the two correspond one to one, so they are not described in detail.

参见图8，是本发明实施例七中的一种多语种的语音合成系统的结构示意图。本发明实施例七提供了一种多语种的语音合成系统70，包括处理器71、存储器 72以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现如实施例一至五任一实施例所述的多语种的语音合成方法。Referring to Fig. 8, it is a schematic diagram of the structure of a multi-language speech synthesis system in Embodiment 7 of the present invention. Embodiment 7 of the present invention provides a multi-language speech synthesis system 70, comprising a processor 71, a memory 72, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements the multi-language speech synthesis method as described in any one of Embodiments 1 to 5 when executing the computer program.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-OnlyMemory，ROM) 或随机存储记忆体(RandomAccessMemory，RAM)等。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing related hardware through a computer program, and the program can be stored in a computer-readable storage medium. When the program is executed, it can include the processes of the embodiments of the above-mentioned methods. The storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM).

以上所述是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也视为本发明的保护范围。The above is a preferred embodiment of the present invention. It should be pointed out that, for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present invention. These improvements and modifications are also considered to be within the scope of protection of the present invention.