CN112116903B

Movatterモバイル変換

Info

Publication number: CN112116903B
Application number: CN202010827835.XA
Authority: CN
Inventors: 杨惠; 梁光; 吴雨璇; 舒景辰; 周鼎皓
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Chengdu Yudi Technology Co ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2024-09-13
Anticipated expiration: 2040-08-17
Also published as: CN112116903A

Abstract

Translated fromChinese

本申请实施例公开了一种语音合成模型的生成方法、装置、存储介质及电子设备，属于计算机技术领域。方法包括：服务器基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据，基于至少两个训练音频数据进行训练得到语音合成模型，实现了对训练数据的扩展，为训练语音合成模型提供充足的训练数据，以确保能生成质量较好的语音合成模型。

The embodiment of the present application discloses a method, device, storage medium and electronic device for generating a speech synthesis model, which belongs to the field of computer technology. The method includes: a server samples sample audio data based on at least two different sampling rates to obtain corresponding training audio data, and trains based on at least two training audio data to obtain a speech synthesis model, thereby expanding the training data and providing sufficient training data for training the speech synthesis model to ensure that a speech synthesis model with good quality can be generated.

Description

Translated fromChinese

语音合成模型的生成方法、装置、存储介质及电子设备Method, device, storage medium and electronic device for generating speech synthesis model

技术领域Technical Field

本申请涉及计算机技术领域，尤其涉及一种语音合成模型的生成方法、装置、存储介质及电子设备。The present application relates to the field of computer technology, and in particular to a method, device, storage medium and electronic device for generating a speech synthesis model.

背景技术Background Art

互联网的发展逐渐推动了人工智能技术的发展，智能语音技术在人工智能技术的发展中尤为突出，将文本转换成语音及将语音转换成文本均是建立在智能语音技术的基础上实现的。其中，将文本转换成语音是智能语音技术中语音合成的过程，语音合成的过程需要利用训练好的语音合成模型，在相关技术中，会通过训练数据对模型进行多次训练生成语音合成模型，但在训练数据缺失或训练数据较少的情况下，由此训练数据训练得到的语音合成模型会存在处理数据准确性差的问题，且此训练过程的时间周期较长。The development of the Internet has gradually promoted the development of artificial intelligence technology, and intelligent voice technology is particularly prominent in the development of artificial intelligence technology. Converting text into voice and converting voice into text are both based on intelligent voice technology. Among them, converting text into voice is the process of voice synthesis in intelligent voice technology. The process of voice synthesis requires the use of a trained voice synthesis model. In related technologies, the model is trained multiple times through training data to generate a voice synthesis model. However, in the case of missing training data or less training data, the voice synthesis model trained by the training data will have the problem of poor data processing accuracy, and the training process takes a long time.

发明内容Summary of the invention

本申请实施例提供了一种语音合成模型的生成方法、装置、存储介质及电子设备，可以解决在训练数据缺失或训练数据较少的情况，不能生成质量较好的语音合成模型的问题。所述技术方案如下：The embodiments of the present application provide a method, device, storage medium and electronic device for generating a speech synthesis model, which can solve the problem that a speech synthesis model with good quality cannot be generated when training data is missing or the training data is insufficient. The technical solution is as follows:

第一方面，本申请实施例提供了一种语音合成模型的生成方法，所述方法包括：In a first aspect, an embodiment of the present application provides a method for generating a speech synthesis model, the method comprising:

基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据；Sampling the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data;

基于至少两个训练音频数据进行训练得到所述语音合成模型。The speech synthesis model is obtained by training based on at least two training audio data.

第二方面，本申请实施例提供了一种语音合成模型的生成装置，所述语音合成模型的生成装置包括：In a second aspect, an embodiment of the present application provides a device for generating a speech synthesis model, wherein the device for generating a speech synthesis model comprises:

采样模块，用于基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据；A sampling module, used for sampling the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data;

训练模块，用于基于至少两个训练音频数据进行训练得到所述语音合成模型。The training module is used to obtain the speech synthesis model through training based on at least two training audio data.

第三方面，本申请实施例提供一种计算机存储介质，所述计算机存储介质存储有多条指令，所述指令适于由处理器加载并执行上述的方法步骤。In a third aspect, an embodiment of the present application provides a computer storage medium, wherein the computer storage medium stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor and executing the above-mentioned method steps.

第四方面，本申请实施例提供一种电子设备，可包括：处理器和存储器；其中，所述存储器存储有计算机程序，所述计算机程序适于由所述处理器加载并执行上述的方法步骤。In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program, and the computer program is suitable for being loaded by the processor and executing the above-mentioned method steps.

本申请一些实施例提供的技术方案带来的有益效果至少包括：The beneficial effects brought about by the technical solutions provided by some embodiments of the present application include at least:

本申请实施例的方案在执行时，服务器基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据，基于至少两个训练音频数据进行训练得到语音合成模型，实现了对训练数据的扩展，为训练语音合成模型提供充足的训练数据，以确保能生成质量较好的语音合成模型。When the solution of the embodiment of the present application is executed, the server samples the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data, and trains the speech synthesis model based on the at least two training audio data, thereby achieving the expansion of the training data and providing sufficient training data for training the speech synthesis model to ensure that a speech synthesis model with better quality can be generated.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本申请实施例提供的一种系统架构图；FIG1 is a system architecture diagram provided by an embodiment of the present application;

图2是本申请实施例提供的语音合成模型的生成方法的流程示意图；FIG2 is a flow chart of a method for generating a speech synthesis model provided in an embodiment of the present application;

图3是本申请实施例提供的语音合成模型的生成方法的另一流程示意图；FIG3 is another schematic diagram of a flow chart of a method for generating a speech synthesis model provided in an embodiment of the present application;

图4是本申请实施例提供的一种装置的结构示意图；FIG4 is a schematic diagram of the structure of a device provided in an embodiment of the present application;

图5是本申请实施例提供的一种装置的结构示意图。FIG5 is a schematic diagram of the structure of a device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施例方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application more clear, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

图1示出了可以应用本申请实施例的语音合成模型的生成方法或语音合成模型的生成装置的示例性系统架构100的示意图。FIG1 shows a schematic diagram of an exemplary system architecture 100 to which a method for generating a speech synthesis model or an apparatus for generating a speech synthesis model according to an embodiment of the present application can be applied.

如图1所示，系统架构100可以包括终端设备101、102、103中的一种或多种，网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质，终端设备101、102、103上可以安装有各种通信客户端应用，例如：视频录制应用、视频播放应用、语音交互应用、搜索类应用、及时通信工具、邮箱客户端、社交平台软件等。网络104可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG1 , the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for a communication link between the terminal devices 101, 102, 103 and the server 105. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as video recording applications, video playback applications, voice interaction applications, search applications, instant messaging tools, email clients, social platform software, etc. The network 104 may include various connection types, such as wired, wireless communication links, or optical fiber cables, etc.

用户可以使用终端设备101、102、103通过网络104与服务器105交互，以接收或发送消息等。终端设备101、102、103可以是具有显示屏的各种电子设备，包括但不限于智能手机、平板电脑、便携式计算机和台式计算机等等。网络104可以包括各种类型的有线通信链路或无线通信链路，例如：有线通信链路包括光纤、双绞线或同轴电缆的，无线通信链路包括蓝牙通信链路、无线保真(WIreless-FIdelity，Wi-Fi)通信链路或微波通信链路等。终端设备101、102、103可以是硬件，也可以是软件。当终端设备101、102、103为软件时，可以是安装于上述所列举的电子设备中。其可以实现呈多个软件或软件模块(例如：用来提供分布式服务)，也可以实现成单个软件或软件模块，在此不作具体限定。当终端设备101、102、103为硬件时，其上还可以安装有显示设备和摄像头，显示设备显示可以是各种能实现显示功能的设备，摄像头用于采集视频流；例如：显示设备可以是阴极射线管显示器(Cathode raytubedisplay，简称CR)、发光二极管显示器(Light-emitting diode display，简称LED)、电子墨水屏、液晶显示屏(Liquid crystal display，简称LCD)、等离子显示面板(Plasmadisplaypanel，简称PDP)等。用户可以利用终端设备101、102、103上的显示设备，来查看显示的文字、图片、视频等信息。Users can use terminal devices 101, 102, 103 to interact with server 105 through network 104 to receive or send messages, etc. Terminal devices 101, 102, 103 can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, portable computers, desktop computers, etc. Network 104 can include various types of wired communication links or wireless communication links, for example: wired communication links include optical fibers, twisted pairs or coaxial cables, and wireless communication links include Bluetooth communication links, wireless fidelity (WIreless-FIdelity, Wi-Fi) communication links or microwave communication links, etc. Terminal devices 101, 102, 103 can be hardware or software. When terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (for example: used to provide distributed services), or it can be implemented as a single software or software module, which is not specifically limited here. When the terminal devices 101, 102, and 103 are hardware, a display device and a camera may be installed thereon. The display device may be any device capable of realizing a display function, and the camera is used to collect video streams; for example, the display device may be a cathode ray tube display (CR), a light-emitting diode display (LED), an electronic ink screen, a liquid crystal display (LCD), a plasma display panel (PDP), etc. The user may use the display device on the terminal devices 101, 102, and 103 to view displayed text, pictures, videos, and other information.

需要说明的是，本申请实施例提供的语音合成模型的生成方法通常由服务器105执行，相应的，语音合成装置通常设置于服务器105中。服务器105可以是提供各种服务的服务器，服务器105可以是硬件，也可以是软件。当服务器105为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当服务器105为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块，在此不做具体限定。It should be noted that the method for generating the speech synthesis model provided in the embodiment of the present application is usually executed by the server 105, and accordingly, the speech synthesis device is usually arranged in the server 105. The server 105 can be a server that provides various services, and the server 105 can be hardware or software. When the server 105 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or it can be implemented as a single server. When the server 105 is software, it can be implemented as multiple software or software modules (for example, for providing distributed services), or it can be implemented as a single software or software module, which is not specifically limited here.

本申请中的服务器105可以为提供各种服务的终端设备，如：服务器基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据，基于至少两个训练音频数据进行训练得到语音合成模型。The server 105 in the present application may be a terminal device that provides various services, such as: the server samples the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data, and trains based on at least two training audio data to obtain a speech synthesis model.

在此需要说明的是，本申请实施例所提供的语音合成模型的生成方法可以由终端设备101、102、103中的一个或多个，和/或，服务器105执行，相应地，本申请实施例所提供的语音合成装置一般设置于对应终端设备中，和/或，服务器105中，但本申请不限于此。It should be noted here that the method for generating the speech synthesis model provided in the embodiment of the present application can be executed by one or more of the terminal devices 101, 102, 103, and/or the server 105. Accordingly, the speech synthesis device provided in the embodiment of the present application is generally arranged in the corresponding terminal device, and/or the server 105, but the present application is not limited to this.

应该理解，图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the number of terminal devices, networks and servers in Figure 1 is only illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements.

下面将结合附图2至附图3，对本申请实施例提供的语音合成模型的生成方法进行详细介绍。请参见图2，为本申请实施例提供了一种语音合成模型的生成方法的流程示意图。如图2所示，本申请实施例的所述方法可以包括以下步骤：The following will be combined with Figures 2 to 3 to introduce in detail the method for generating a speech synthesis model provided in the embodiment of the present application. Please refer to Figure 2, which is a flow chart of a method for generating a speech synthesis model provided in the embodiment of the present application. As shown in Figure 2, the method in the embodiment of the present application may include the following steps:

S201，基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据。S201, sampling audio data is respectively processed based on at least two different sampling rates to obtain corresponding training audio data.

其中，采样率是指每秒从音频信号中提取并组成离散信号的采样个数，如：可以是每秒采样22050个点数，或每秒采样16000个点数。样本音频数据是用于训练模型的样本数据，样本音频数据中包含有用户需要合成的声音特征。训练音频数据是样本音频数据经过采样处理后得到的音频数据，根据采样率的不同，采样得到的训练音频数据也不相同。The sampling rate refers to the number of samples extracted from the audio signal per second to form a discrete signal, such as 22050 points per second or 16000 points per second. The sample audio data is the sample data used to train the model. The sample audio data contains the sound features that the user needs to synthesize. The training audio data is the audio data obtained after the sample audio data is sampled and processed. Depending on the sampling rate, the training audio data obtained by sampling is also different.

一般的，需要通过对大量的样本数据进行训练才能生成语音合成模型，大量的样本数据包括大量的样本文本数据，以及样本文本数据各自对应的样本音频数据。在样本音频数据量较少或存在数据缺失的情况下，可通过对现有的样本音频数据进行多次(至少两次)采样得到新的训练音频数据，以保证训练模型的音频数据的数据量，在充足的训练音频数据的训练下可得到精准的语音合成模型。如：原始只有一个样本音频数据，在分别基于每秒22050个点数和每秒16000个点数的采样率对样本音频数据采样后，可得到第一训练音频数据和第二训练音频数据，后续可基于第一训练音频数据、第二训练音频数据和样本音频数据进行模型训练。Generally, a speech synthesis model can only be generated by training a large amount of sample data, which includes a large amount of sample text data and sample audio data corresponding to each of the sample text data. In the case where the amount of sample audio data is small or there is data missing, new training audio data can be obtained by sampling the existing sample audio data multiple times (at least twice) to ensure the data volume of the audio data of the training model. An accurate speech synthesis model can be obtained under the training of sufficient training audio data. For example, there is only one sample audio data originally. After sampling the sample audio data at sampling rates of 22050 points per second and 16000 points per second, the first training audio data and the second training audio data can be obtained. Subsequently, model training can be performed based on the first training audio data, the second training audio data and the sample audio data.

S202，基于至少两个训练音频数据进行训练得到语音合成模型。S202: Perform training based on at least two training audio data to obtain a speech synthesis model.

一般的，服务器会获取样本文本数据，及采样后的至少两个训练音频数据和样本音频数据，对样本文本数据分别进行编码处理和解码处理得到样本文本数据当前对应的梅尔频谱，并分别计算得到至少两个训练音频数据和样本音频数据各自对应的梅尔频谱，进一步分别计算样本文本数据当前对应的梅尔频谱与至少两个训练音频数据和样本音频数据各自对应的梅尔频谱间的损失值，在损失值小于或等于预设阈值时生成预训练的语音合成模型。Generally, the server obtains sample text data, and at least two sampled training audio data and sample audio data, encodes and decodes the sample text data to obtain the Mel spectrum currently corresponding to the sample text data, and calculates the Mel spectrum corresponding to the at least two training audio data and the sample audio data respectively, and further calculates the loss value between the Mel spectrum currently corresponding to the sample text data and the Mel spectrum corresponding to the at least two training audio data and the sample audio data respectively, and generates a pre-trained speech synthesis model when the loss value is less than or equal to a preset threshold.

具体地，语音合成模型的训练过程可以包括：Specifically, the training process of the speech synthesis model may include:

通过查询音素表将样本文本数据转换成至少一个音素序列，并将至少一个音素序列转换成至少一个音素特征向量，对至少一个音素特征向量进行降维处理得到第一特征向量，基于样本文本数据的文本序列信息对第一特征向量进行位置编码处理得到第二特征向量，文本序列信息用于表示样本文本数据中词的顺序和特征中的至少一种，对第二特征向量进行FFT快速傅里叶变换处理得到音素序列向量；对样本文本数据中至少一个音素序列的时长进行概率评估得到至少一个音素序列的发音时长，基于至少一个音素序列的发音时长对音素序列向量进行时长提取处理得到音素对齐向量；基于样本文本数据的文本序列信息对音素对齐向量进行位置编码处理得到第三特征向量，文本序列信息用于表示样本文本数据中词的顺序和特征中的至少一种，对第三特征向量进行FFT快速傅里叶变换处理得到第四特征向量，基于线性层对第四特征向量进行处理得到样本文本数据当前对应的梅尔频谱；同时，基于至少两个训练音频数据和样本音频数据各自对应的频谱特征，分别计算得到至少两个训练音频数据和样本音频数据各自对应的梅尔频谱，进一步分别计算样本文本数据当前对应的梅尔频谱与至少两个训练音频数据得到至少两个梅尔频谱间的损失值，及计算样本文本数据当前对应的梅尔频谱与样本音频数据对应的梅尔频谱间的损失值，在损失值小于或等于预设阈值时生成预训练的语音合成模型。The sample text data is converted into at least one phoneme sequence by querying a phoneme table, and at least one phoneme sequence is converted into at least one phoneme feature vector, a dimension reduction process is performed on at least one phoneme feature vector to obtain a first feature vector, and a position encoding process is performed on the first feature vector based on text sequence information of the sample text data to obtain a second feature vector, the text sequence information is used to represent at least one of the order and features of words in the sample text data, and a FFT fast Fourier transform process is performed on the second feature vector to obtain a phoneme sequence vector; a probability evaluation is performed on the duration of at least one phoneme sequence in the sample text data to obtain the pronunciation duration of at least one phoneme sequence, and a duration extraction process is performed on the phoneme sequence vector based on the pronunciation duration of at least one phoneme sequence to obtain a phoneme alignment vector; and a position encoding process is performed on the phoneme alignment vector based on the text sequence information of the sample text data. A third feature vector is obtained by processing, and the text sequence information is used to represent at least one of the order and features of words in the sample text data. The third feature vector is processed by FFT fast Fourier transform to obtain a fourth feature vector. The fourth feature vector is processed based on the linear layer to obtain the Mel spectrum currently corresponding to the sample text data. At the same time, based on the spectral features corresponding to at least two training audio data and the sample audio data, the Mel spectrum corresponding to at least two training audio data and the sample audio data is calculated respectively, and the loss value between the Mel spectrum currently corresponding to the sample text data and the at least two Mel spectrums obtained by the at least two training audio data is further calculated, and the loss value between the Mel spectrum currently corresponding to the sample text data and the Mel spectrum corresponding to the sample audio data is calculated, and a pre-trained speech synthesis model is generated when the loss value is less than or equal to a preset threshold.

请参见图3，为本申请实施例提供了一种语音合成模型的生成方法的流程示意图，该语音合成模型的生成方法可以包括以下步骤：Please refer to FIG3 , which is a flow chart of a method for generating a speech synthesis model provided in an embodiment of the present application. The method for generating a speech synthesis model may include the following steps:

S301，基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据。S301, sampling the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data.

一般的，需要通过对大量的样本数据进行训练才能生成语音合成模型，大量的样本数据包括大量的样本文本数据，以及样本文本数据各自对应的样本音频数据。在样本音频数据量较少或存在数据缺失的情况下，可通过对现有的样本音频数据进行多次采样得到新的训练音频数据，以保证训练模型的音频数据的数据量，在充足的训练音频数据的训练下可得到精准的语音合成模型。如：原始只有一个样本音频数据，在分别基于每秒22050个点数和每秒16000个点数的采样率对样本音频数据采样后，可得到第一训练音频数据和第二训练音频数据，后续可基于第一训练音频数据、第二训练音频数据和样本音频数据进行模型训练。Generally, a speech synthesis model can only be generated by training a large amount of sample data, which includes a large amount of sample text data and sample audio data corresponding to each of the sample text data. When the amount of sample audio data is small or there is data missing, new training audio data can be obtained by sampling the existing sample audio data multiple times to ensure the amount of audio data for the training model. An accurate speech synthesis model can be obtained under the training of sufficient training audio data. For example, there is only one sample audio data originally. After sampling the sample audio data at sampling rates of 22050 points per second and 16000 points per second, the first training audio data and the second training audio data can be obtained. Subsequently, model training can be performed based on the first training audio data, the second training audio data and the sample audio data.

S302，确定至少两个训练音频数据和样本音频数据各自对应的线性频谱。S302: Determine linear frequency spectra corresponding to at least two training audio data and sample audio data.

其中，线性频谱是指至少两个训练音频数据和样本音频数据各自对应的频率谱密度，与梅尔频谱相区别；梅尔频谱是指通过梅尔标度来表示的频谱，且为非线性频谱。Among them, the linear spectrum refers to the frequency spectrum density corresponding to at least two training audio data and sample audio data, which is different from the Mel spectrum; the Mel spectrum refers to the spectrum represented by the Mel scale and is a nonlinear spectrum.

一般的，在服务器得到至少两个训练音频数据和样本音频数据后，会分别对至少两个训练音频数据和样本音频数据进行傅里叶变换处理，将时域的音频信号转换到频域得到各自对应的频谱，对各自对应的频谱进行频谱转换可的各自对应的梅尔频谱。Generally, after the server obtains at least two training audio data and sample audio data, it will perform Fourier transform processing on the at least two training audio data and the sample audio data respectively, convert the time domain audio signals into the frequency domain to obtain their respective corresponding spectra, and perform spectrum conversion on the respective corresponding spectra to obtain their respective corresponding Mel spectra.

S303，将线性频谱转换成至少两个训练音频数据和样本音频数据各自对应的梅尔频谱。S303: Convert the linear spectrum into Mel spectra corresponding to at least two training audio data and sample audio data, respectively.

其中，梅尔频谱是指通过梅尔标度来表示的频谱，梅尔频谱包含有声音的特征。The Mel spectrum refers to a spectrum represented by the Mel scale, and the Mel spectrum contains the characteristics of the sound.

一般的，由于音频数据的梅尔频谱能更好的反应声音的特性，需要将原始的频谱转换非线性的梅尔频谱，可由公式进行非线性频谱转换得到音频数据的梅尔频谱。Generally, since the Mel spectrum of audio data can better reflect the characteristics of sound, it is necessary to convert the original spectrum into a nonlinear Mel spectrum, which can be expressed by the formula Perform nonlinear spectrum conversion to obtain the Mel spectrum of audio data.

S304，将梅尔频谱添加到样本梅尔频谱集合中。S304, adding the mel spectrum to the sample mel spectrum set.

其中，样本梅尔频谱集合是指包括至少两个梅尔频谱的集合，其中的至少两个梅尔频谱是训练音频数据对应的梅尔频谱，样本梅尔频谱集合中还包括样本音频数据对应的梅尔频谱，样本梅尔频谱集合中梅尔频谱的数量与训练音频数据和样本音频数据的数量相关。通常服务器在计算得到至少两个训练音频数据和样本音频数据各自的梅尔频谱后，会将各自对应的梅尔频谱添加到梅尔频谱集合中，便于后续计算样本文本数据当前对应的梅尔频谱与样本梅尔频谱集合中各个梅尔频谱间的损失值。The sample mel spectrum set refers to a set including at least two mel spectrums, at least two of which are mel spectrums corresponding to the training audio data, the sample mel spectrum set also includes the mel spectrum corresponding to the sample audio data, and the number of mel spectrums in the sample mel spectrum set is related to the number of training audio data and sample audio data. Usually, after calculating the mel spectrum of at least two training audio data and sample audio data, the server will add the corresponding mel spectrum to the mel spectrum set to facilitate the subsequent calculation of the loss value between the mel spectrum currently corresponding to the sample text data and each mel spectrum in the sample mel spectrum set.

S305，获取样本文本数据，并对样本文本数据进行编码处理得到音素序列向量。S305, obtaining sample text data, and encoding the sample text data to obtain a phoneme sequence vector.

其中，样本文本数据是指以文本形式呈现且包含有样本内容信息的数据，与样本音频数据的内容相对应。音素序列向量是指音素序列经过多次转换后得到的以向量形式表示的向量。音素序列是指排列成一列的音素元素，样本文本数据可以是英文单词的文本数据，该样本文本数据中每一个英文单词都对应有一个音素序列；样本文本数据也可以是中文词语，该样本文本数据中的每个词语都对应有一个音素序列。The sample text data refers to data presented in text form and containing sample content information, corresponding to the content of the sample audio data. The phoneme sequence vector refers to a vector obtained by converting the phoneme sequence multiple times in vector form. The phoneme sequence refers to the phoneme elements arranged in a column. The sample text data can be the text data of English words, and each English word in the sample text data corresponds to a phoneme sequence; the sample text data can also be Chinese words, and each word in the sample text data corresponds to a phoneme sequence.

一般的，需要通过对大量的样本数据进行训练才能得到可基于文本数据合成语音数据的语音合成模型，大量的样本数据包括大量的样本文本数据，以及样本文本数据各自对应的样本音频数据。Generally, a speech synthesis model that can synthesize speech data based on text data needs to be trained on a large amount of sample data. The large amount of sample data includes a large amount of sample text data and sample audio data corresponding to each of the sample text data.

对样本文本数据进行编码处理得到音素序列向量的过程可以包括：通过查询音素表的方式获得样本文本数据对应的音素序列，样本文本数据中的每个单词/词语均对应有一个音素序列，为后续能对得到的多个音素序列进行处理，需要将得到的多个音素序列转换成音素特征向量的形式，音素特征向量的数量与音素序列的数量相同，每个音素序列均对应有一个音素特征向量，音素特征向量是指音素序列经过初步转换得到的包含音素序列对应的特征的向量。对至少一个音素特征向量进行降维处理得到第一特征向量，第一特征向量是音素特征向量在降维后得到的向量，与原始维度的音素特征向量存在维度区别。通过对第一特征向量进行位置编码处理，可将样本文本数据中的文本序列信息添加到第一特征向量中，并得到能体现时间序列的第二特征向量，文本序列信息是样本文本数据中单词/词语的相关信息，文本序列信息可用于表示样本文本数据中词的顺序和特征中的至少一种；第二特征向量是第一特征向量经过位置编码处理后得到向量，与第一特征向量区别。可基于由FFT模块组成的，并包含有attention注意力机制、卷积层的transformer前馈网络对第二特征向量进行FFT处理，对第二特征向量中包含的参数进行训练，并提取需要关注的信息后，可得到音素序列向量。The process of encoding the sample text data to obtain the phoneme sequence vector may include: obtaining the phoneme sequence corresponding to the sample text data by querying the phoneme table, each word/phrase in the sample text data corresponds to a phoneme sequence, and in order to subsequently process the obtained multiple phoneme sequences, it is necessary to convert the obtained multiple phoneme sequences into the form of phoneme feature vectors, the number of phoneme feature vectors is the same as the number of phoneme sequences, each phoneme sequence corresponds to a phoneme feature vector, and the phoneme feature vector refers to a vector containing the features corresponding to the phoneme sequence obtained after the preliminary conversion of the phoneme sequence. Performing dimensionality reduction processing on at least one phoneme feature vector to obtain a first feature vector, the first feature vector is a vector obtained after the dimensionality reduction of the phoneme feature vector, and there is a dimension difference with the original dimensional phoneme feature vector. By performing position encoding processing on the first feature vector, the text sequence information in the sample text data can be added to the first feature vector, and a second feature vector that can reflect the time series can be obtained, the text sequence information is the relevant information of the words/phrases in the sample text data, and the text sequence information can be used to represent at least one of the order and features of the words in the sample text data; the second feature vector is a vector obtained after the position encoding processing of the first feature vector, which is different from the first feature vector. The second feature vector can be processed by FFT based on a transformer feed-forward network composed of an FFT module, an attention mechanism, and a convolutional layer. The parameters contained in the second feature vector are trained, and the information that needs attention is extracted to obtain a phoneme sequence vector.

S306，对音素序列向量进行时长提取处理得到音素对齐向量。S306, performing duration extraction processing on the phoneme sequence vector to obtain a phoneme alignment vector.

其中，音素对齐向量是指基于音素序列的发音时长进行音素对齐后得到向量。The phoneme alignment vector refers to a vector obtained by aligning phonemes based on the pronunciation duration of a phoneme sequence.

一般的，对样本文本数据中至少一个音素序列的时长进行概率评估得到至少一个音素序列的发音时长，发音时长是指音素序列中各个音素发音时长的总和，每个音素序列均对应有一个发音时长，发音时长也即音素序列的时长信息。现有的提取音素序列中每个音素的时长信息是通过预训练好的模型进行提取的，效果较差，只实现了句子的对齐，并未实现音素到音素的对齐；在本方案中采用统计模型(经典解码器)，对音素序列进行处理实现了对音素的强制对齐，具体包括：对每个单词/词语对应的音素序列进行发音时长的统计，进而对得到的每个音素序列的发音时长进行概率评估，从概率评估结果中选择概率最大的音素序列作为输出结果，由此实现了音素到音素的对齐，进而得到音素对齐的音素对齐向量。Generally, the duration of at least one phoneme sequence in the sample text data is probabilistically evaluated to obtain the pronunciation duration of at least one phoneme sequence. The pronunciation duration refers to the sum of the pronunciation durations of each phoneme in the phoneme sequence. Each phoneme sequence corresponds to a pronunciation duration, which is also the duration information of the phoneme sequence. The existing method of extracting the duration information of each phoneme in the phoneme sequence is to extract it through a pre-trained model, which has a poor effect and only achieves sentence alignment, but does not achieve phoneme-to-phoneme alignment. In this scheme, a statistical model (classical decoder) is used to process the phoneme sequence to achieve forced alignment of the phonemes, specifically including: statistics on the pronunciation duration of the phoneme sequence corresponding to each word/phrase, and then probabilistically evaluating the pronunciation duration of each obtained phoneme sequence, selecting the phoneme sequence with the highest probability from the probability evaluation results as the output result, thereby achieving phoneme-to-phoneme alignment, and then obtaining the phoneme alignment vector of the phoneme alignment.

S307，对音素对齐向量进行解码处理得到样本文本数据当前对应的梅尔频谱。S307, decoding the phoneme alignment vector to obtain the Mel spectrum currently corresponding to the sample text data.

一般的，通过对音素对齐向量进行位置编码处理，可将样本文本数据中的文本序列信息添加到音素对齐特征向量中，并得到能体现时间序列的第三特征向量，第三特征向量是音素对齐向量经过位置编码处理后得到向量，与第一特征向量、第二特征向量区别。可基于由FFT模块组成的，并包含有attention注意力机制、卷积层的transformer前馈网络对第三特征向量进行FFT处理，对第三特征向量中包含的参数进行训练，并提取需要关注的信息后，可得到第四特征向量，第四特征向量是第三特征向量经过FFT处理后得到向量，与第一特征向量、第二特征向量、第三特征向量区别，进而可基于线性层对第四特征向量进行处理得到与样本文本数据当前对应的梅尔频谱α。Generally, by performing position encoding processing on the phoneme alignment vector, the text sequence information in the sample text data can be added to the phoneme alignment feature vector, and a third feature vector that can reflect the time series can be obtained. The third feature vector is a vector obtained after the phoneme alignment vector is processed by position encoding, which is different from the first feature vector and the second feature vector. The third feature vector can be processed by FFT based on a transformer feedforward network composed of an FFT module, an attention mechanism, and a convolutional layer, and the parameters contained in the third feature vector are trained. After extracting the information that needs attention, a fourth feature vector can be obtained. The fourth feature vector is a vector obtained after the third feature vector is processed by FFT, which is different from the first feature vector, the second feature vector, and the third feature vector. Then, the fourth feature vector can be processed based on a linear layer to obtain the Mel spectrum α corresponding to the current sample text data.

S308，分别计算样本文本数据当前对应的梅尔频谱与样本梅尔频谱集合中的至少一个梅尔频谱间的损失值。S308 , respectively calculating the loss value between the mel spectrum currently corresponding to the sample text data and at least one mel spectrum in the sample mel spectrum set.

其中，损失值是指样本文本数据当前对应的梅尔频谱α与比较标签不一致程度，损失值越小，所训练的模型鲁棒性就越好。The loss value refers to the degree of inconsistency between the Mel spectrum α currently corresponding to the sample text data and the comparison label. The smaller the loss value, the better the robustness of the trained model.

一般的，可基于线性层对第四特征向量进行处理得到与样本文本数据当前对应的梅尔频谱α，同时，可基于至少两个训练音频数据和样本音频数据各自对应的频谱特征，分别计算得到至少两个训练音频数据和样本音频数据各自对应的梅尔频谱，并将得到的梅尔频谱添加到样本梅尔频谱集合中，可将样本梅尔频谱集合中的梅尔频谱记为梅尔频谱β，并将其作为比较标签，对样本文本数据当前对应的梅尔频谱α与样本梅尔频谱集合中的各个比较标签分别进行比较计算两两间的损失值，在各自的损失值未达到预设阈值之前，基于上述步骤进行不断的迭代训练，直到在各自的损失值均小于或等于预设阈值时，表明模型训练结束，此时生成训练好的语音合成模型。Generally, the fourth eigenvector can be processed based on the linear layer to obtain the Mel spectrum α currently corresponding to the sample text data. At the same time, based on the spectral features corresponding to at least two training audio data and the sample audio data, the Mel spectrum corresponding to each of the at least two training audio data and the sample audio data can be calculated respectively, and the obtained Mel spectrum can be added to the sample Mel spectrum set. The Mel spectrum in the sample Mel spectrum set can be recorded as Mel spectrum β, and used as a comparison label. The Mel spectrum α currently corresponding to the sample text data and each comparison label in the sample Mel spectrum set are compared to calculate the loss value between each pair. Before the respective loss values do not reach the preset threshold, continuous iterative training is performed based on the above steps until the respective loss values are less than or equal to the preset threshold, indicating that the model training is completed, and a trained speech synthesis model is generated at this time.

S309，在损失值小于或等于预设阈值时生成语音合成模型。S309, generating a speech synthesis model when the loss value is less than or equal to a preset threshold.

其中，预设阈值是指基于需要训练的模型预先设定的最大损失值，在损失值小于或等于预先设定的最大损失值时，表明模型训练完成。Among them, the preset threshold refers to the maximum loss value pre-set based on the model to be trained. When the loss value is less than or equal to the preset maximum loss value, it indicates that the model training is completed.

一般的，通过多次迭代训练，使样本文本数据当前当前对应的梅尔频谱α与比较标签间的损失值降到预设阈值或低于预设阈值时，可表明语音合成模型已训练完成，服务器可基于该语音合成模型对输入的文本数据进行语音合成处理，进而得到与输入的文本数据对应的语音数据。Generally, through multiple iterations of training, when the loss value between the Mel spectrum α currently corresponding to the sample text data and the comparison label drops to a preset threshold or is lower than a preset threshold, it can be indicated that the speech synthesis model has been trained, and the server can perform speech synthesis processing on the input text data based on the speech synthesis model, and then obtain speech data corresponding to the input text data.

S310，获取文本数据，并将文本数据转换成至少一个音素序列。S310, acquiring text data, and converting the text data into at least one phoneme sequence.

其中，文本数据是指以文本形式呈现且包含内容信息的数据，音素序列是指排列成一列的音素元素，文本数据可以是英文单词的文本数据，该文本数据中每一个英文单词都对应有一个音素序列；文本数据也可以是中文词语，该文本数据中的每个词语都对应有一个音素序列。Among them, text data refers to data presented in text form and containing content information, and a phoneme sequence refers to phoneme elements arranged in a column. The text data can be text data of English words, and each English word in the text data corresponds to a phoneme sequence; the text data can also be Chinese words, and each word in the text data corresponds to a phoneme sequence.

一般的，服务器可通过接收由终端发送的文本信息，解析该文本信息可得到对应的文本数据，也可以是服务器通过识别指定的文本信息得到对应的文本数据，服务器在获取文本数据后，会通过查询音素表的方式将该文本数据中的每个单词转换成各自对应的音素序列，便于后续对文本数据对应的至少一个音素序列进行处理。Generally, the server can obtain the corresponding text data by receiving text information sent by the terminal and parsing the text information, or the server can obtain the corresponding text data by identifying specified text information. After obtaining the text data, the server will convert each word in the text data into its corresponding phoneme sequence by querying the phoneme table, so as to facilitate subsequent processing of at least one phoneme sequence corresponding to the text data.

举例说明：文本数据由中文词语组成，文本数据内容为“今天你吃饭了吗？”，服务器经过查询音素表后，可将该文本数据转换成7个音素序列：{j,i,n}、{t,i,a,n}、{n,i}、{c,h,i}、{f,a,n}、{l,e}、{m,a}，文本数据中的每个词语对应一个音素序列。For example: the text data consists of Chinese words, and the content of the text data is "Have you eaten today?" After querying the phoneme table, the server can convert the text data into 7 phoneme sequences: {j,i,n}, {t,i,a,n}, {n,i}, {c,h,i}, {f,a,n}, {l,e}, {m,a}. Each word in the text data corresponds to a phoneme sequence.

S311，基于语音合成模型对至少一个音素序列进行语音合成处理得到文本数据对应的梅尔频谱。S311, performing speech synthesis processing on at least one phoneme sequence based on a speech synthesis model to obtain a Mel spectrum corresponding to the text data.

一般的，在服务器得到文本数据对应的至少一个音素序列后，可通过预先训练好的语音合成模型对至少一个音素序列进行语音合成处理，得到与该文本数据对应的梅尔频谱，该梅尔频谱中包含有文本数据对应的声音特征，基于该梅尔频谱可确定与该文本数据对应的语音。Generally, after the server obtains at least one phoneme sequence corresponding to the text data, it can perform speech synthesis processing on the at least one phoneme sequence through a pre-trained speech synthesis model to obtain a Mel spectrum corresponding to the text data. The Mel spectrum contains the sound features corresponding to the text data, and the speech corresponding to the text data can be determined based on the Mel spectrum.

S312，基于文本数据对应的梅尔频谱得到文本数据对应的合成语音。S312, obtaining a synthesized speech corresponding to the text data based on the Mel-spectrogram corresponding to the text data.

其中，合成语音是指经过精准合成处理后的语音，合成语音能较真实地体现用户的声音特征。Among them, synthesized speech refers to speech that has been accurately synthesized and can more realistically reflect the user's voice characteristics.

一般的，由于文本数据对应的梅尔频谱中包含有文本数据对应的声音特征，故可基于梅尔频谱中的特征信息，对梅尔频谱进行傅里叶变换处理得到文本数据对应的合成语音，为使最后得到的声音数据能更加真实，可基于预设信噪比计算得到背景噪声数据，并将背景噪声数据添加到合成语音中得到文本数据对应的语音，也即可得到能具有真实背景环境的语音，添加背景噪声数据后的合成语音数据更加真实自然。Generally, since the Mel spectrum corresponding to the text data contains the sound features corresponding to the text data, the Mel spectrum can be Fourier transformed based on the feature information in the Mel spectrum to obtain the synthesized speech corresponding to the text data. In order to make the final sound data more realistic, background noise data can be calculated based on a preset signal-to-noise ratio, and the background noise data can be added to the synthesized speech to obtain the speech corresponding to the text data, so that speech with a real background environment can be obtained. The synthesized speech data after adding the background noise data is more realistic and natural.

本申请实施例的方案在执行时，服务器基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据，确定至少两个训练音频数据和样本音频数据各自对应的线性频谱，将线性频谱转换成至少两个训练音频数据和样本音频数据各自对应的梅尔频谱，将梅尔频谱添加到样本梅尔频谱集合中，获取样本文本数据，并对样本文本数据进行编码处理得到音素序列向量，对音素序列向量进行时长提取处理得到音素对齐向量，对音素对齐向量进行解码处理得到样本文本数据当前对应的梅尔频谱，分别计算样本文本数据当前对应的梅尔频谱与样本梅尔频谱集合中的至少一个梅尔频谱间的损失值，在损失值小于或等于预设阈值时生成语音合成模型，获取文本数据，并将文本数据转换成至少一个音素序列，基于语音合成模型对至少一个音素序列进行语音合成处理得到文本数据对应的梅尔频谱，基于文本数据对应的梅尔频谱得到文本数据对应的合成语音，通过此种方式能在训练数据较少的情况下生成精准的语音合成模型，且能精准地对输入的文本数据进行语音合成处理得到文本数据对应的语音。When the scheme of the embodiment of the present application is executed, the server samples the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data, determines the linear spectra corresponding to the at least two training audio data and the sample audio data, converts the linear spectra into the Mel spectra corresponding to the at least two training audio data and the sample audio data, adds the Mel spectra to the sample Mel spectra set, obtains the sample text data, encodes the sample text data to obtain a phoneme sequence vector, performs time extraction on the phoneme sequence vector to obtain a phoneme alignment vector, decodes the phoneme alignment vector to obtain the Mel spectrum currently corresponding to the sample text data, calculates the loss value between the Mel spectrum currently corresponding to the sample text data and at least one Mel spectrum in the sample Mel spectrum set, generates a speech synthesis model when the loss value is less than or equal to a preset threshold, obtains text data, converts the text data into at least one phoneme sequence, performs speech synthesis processing on the at least one phoneme sequence based on the speech synthesis model to obtain the Mel spectrum corresponding to the text data, and obtains the synthesized speech corresponding to the text data based on the Mel spectrum corresponding to the text data. In this way, an accurate speech synthesis model can be generated when there is less training data, and the input text data can be accurately processed for speech synthesis to obtain the speech corresponding to the text data.

下述为本申请装置实施例，可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节，请参照本申请方法实施例。The following are device embodiments of the present application, which can be used to execute the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

请参见图4，其示出了本申请一个示例性实施例提供的语音合成模型的生成装置的结构示意图。以下简称装置4，装置4可以通过软件、硬件或者两者的结合实现成为终端的全部或一部分。装置4包括采样模块401、训练模块402。Please refer to FIG. 4 , which shows a schematic diagram of the structure of a device for generating a speech synthesis model provided by an exemplary embodiment of the present application. Hereinafter referred to as device 4 , device 4 can be implemented as all or part of a terminal through software, hardware, or a combination of both. Device 4 includes a sampling module 401 and a training module 402 .

采样模块401，用于基于至少两个不同采样率分别对样本音频数据进行采样处理得到各自对应的训练音频数据；The sampling module 401 is used to sample the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data;

训练模块402，用于基于至少两个训练音频数据进行训练得到所述语音合成模型。The training module 402 is used to obtain the speech synthesis model through training based on at least two training audio data.

可选地，所述训练模块402还包括：Optionally, the training module 402 further includes:

训练单元，语音基于所述样本音频数据进行训练得到所述语音合成模型。A training unit is used to train speech based on the sample audio data to obtain the speech synthesis model.

可选地，所述训练模块402包括：Optionally, the training module 402 includes:

第一获取单元，用于获取样本文本数据，以及获取所述至少两个训练音频数据和所述样本音频数据；A first acquisition unit, used to acquire sample text data, and acquire the at least two training audio data and the sample audio data;

第一处理单元，用于对所述样本文本数据分别进行编码处理和解码处理得到所述样本文本数据当前对应的梅尔频谱；A first processing unit is used to perform encoding and decoding processing on the sample text data to obtain a Mel spectrum currently corresponding to the sample text data;

第一生成单元，用于在损失值小于或等于预设阈值时生成所述预训练的语音合成模型；其中，所述损失值为所述样本文本数据当前对应的梅尔频谱与所述至少两个训练音频数据及所述样本音频数据对应的梅尔频谱间的损失值。The first generating unit is used to generate the pre-trained speech synthesis model when the loss value is less than or equal to a preset threshold; wherein the loss value is the loss value between the Mel spectrum currently corresponding to the sample text data and the Mel spectrum corresponding to the at least two training audio data and the sample audio data.

第二获取单元，用于分别获取所述至少两个训练音频数据和所述样本音频数据各自对应的梅尔频谱；A second acquisition unit is used to respectively acquire Mel spectra corresponding to the at least two training audio data and the sample audio data;

添加单元，用于将所述梅尔频谱添加到样本梅尔频谱集合中。An adding unit is used to add the Mel spectrum to a sample Mel spectrum set.

确定单元，用于确定所述至少两个训练音频数据和所述样本音频数据各自对应的线性频谱；A determination unit, configured to determine linear frequency spectra corresponding to the at least two training audio data and the sample audio data respectively;

转换单元，用于将所述线性频谱转换成所述至少两个训练音频数据和所述样本音频数据各自对应的所述梅尔频谱。A conversion unit is used to convert the linear spectrum into the Mel spectrum corresponding to each of the at least two training audio data and the sample audio data.

第三获取单元，用于获取样本文本数据；A third acquisition unit, used to acquire sample text data;

编码单元，用于对所述样本文本数据进行编码处理得到音素序列向量；An encoding unit, used for encoding the sample text data to obtain a phoneme sequence vector;

时长提取单元，用于对所述音素序列向量进行时长提取处理得到音素对齐向量；A duration extraction unit, used for performing duration extraction processing on the phoneme sequence vector to obtain a phoneme alignment vector;

解码单元，用于对所述音素对齐向量进行解码处理得到所述样本文本数据当前对应的梅尔频谱；A decoding unit, used for decoding the phoneme alignment vector to obtain a Mel spectrum currently corresponding to the sample text data;

计算单元，用于分别计算所述样本文本数据当前对应的梅尔频谱与所述样本梅尔频谱集合中的至少一个梅尔频谱间的损失值；A calculation unit, used to respectively calculate a loss value between the Mel spectrum currently corresponding to the sample text data and at least one Mel spectrum in the sample Mel spectrum set;

第二生成单元，用于在所述损失值小于或等于预设阈值时生成所述语音合成模型。The second generating unit is used to generate the speech synthesis model when the loss value is less than or equal to a preset threshold.

第二处理单元，用于获取文本数据，并将所述文本数据转换成至少一个音素序列；A second processing unit, configured to obtain text data and convert the text data into at least one phoneme sequence;

第三处理单元，用于基于所述语音合成模型对所述至少一个音素序列进行语音合成处理得到所述文本数据对应的梅尔频谱；A third processing unit, configured to perform speech synthesis processing on the at least one phoneme sequence based on the speech synthesis model to obtain a Mel spectrum corresponding to the text data;

第四处理单元，用于基于所述文本数据对应的梅尔频谱得到所述文本数据对应的合成语音。The fourth processing unit is used to obtain the synthesized speech corresponding to the text data based on the Mel spectrum corresponding to the text data.

需要说明的是，上述实施例提供的装置4在执行语音合成模型的生成方法时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的语音合成模型的生成方法实施例属于同一构思，其体现实现过程详见方法实施例，这里不再赘述。It should be noted that when the device 4 provided in the above embodiment executes the method for generating a speech synthesis model, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the speech synthesis model generation method embodiments provided in the above embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.

图5为本申请实施例提供的一种语音合成模型的生成装置结构示意图，以下简称装置5，装置5可以集成于前述服务器或终端设备中，如图5所示，该装置包括：存储器502、处理器501、输入装置503、输出装置504和通信接口。Figure 5 is a schematic diagram of the structure of a device for generating a speech synthesis model provided in an embodiment of the present application, hereinafter referred to as device 5. Device 5 can be integrated into the aforementioned server or terminal device. As shown in Figure 5, the device includes: a memory 502, a processor 501, an input device 503, an output device 504 and a communication interface.

存储器502可以是独立的物理单元，与处理器501、输入装置503和输出装置504可以通过总线连接。存储器502、处理器501、输入装置503和输出装置504也可以集成在一起，通过硬件实现等。The memory 502 may be an independent physical unit, and may be connected to the processor 501, the input device 503 and the output device 504 via a bus. The memory 502, the processor 501, the input device 503 and the output device 504 may also be integrated together and implemented by hardware.

存储器502用于存储实现以上方法实施例，或者装置实施例各个模块的程序，处理器501调用该程序，执行以上方法实施例的操作。The memory 502 is used to store programs for implementing the above method embodiments or various modules of the device embodiments, and the processor 501 calls the program to execute the operations of the above method embodiments.

输入装置502包括但不限于键盘、鼠标、触摸面板、摄像头和麦克风；输出装置包括但限于显示屏。The input device 502 includes but is not limited to a keyboard, a mouse, a touch panel, a camera and a microphone; the output device includes but is not limited to a display screen.

通信接口用于收发各种类型的消息，通信接口包括但不限于无线接口或有线接口。The communication interface is used to send and receive various types of messages, and the communication interface includes but is not limited to a wireless interface or a wired interface.

可选地，当上述实施例的分布式任务调度方法中的部分或全部通过软件实现时，装置也可以只包括处理器。用于存储程序的存储器位于装置之外，处理器通过电路/电线与存储器连接，用于读取并执行存储器中存储的程序。Optionally, when part or all of the distributed task scheduling method in the above embodiment is implemented by software, the device may also include only a processor. The memory for storing the program is located outside the device, and the processor is connected to the memory through a circuit/wire to read and execute the program stored in the memory.

处理器可以是中央处理器(central processing unit，CPU)，网络处理器(network processor，NP)或者CPU和NP的组合。The processor may be a central processing unit (CPU), a network processor (NP), or a combination of a CPU and a NP.

处理器还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit，ASIC)，可编程逻辑器件(programmablelogic device，PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complexprogrammable logic device，CPLD)，现场可编程逻辑门阵列(field-programmable gatearray，FPGA)，通用阵列逻辑(generic array logic，GAL)或其任意组合。The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.

存储器可以包括易失性存储器(volatile memory)，例如存取存储器(random-access memory，RAM)；存储器也可以包括非易失性存储器(non-volatile memory)，例如快闪存储器(flash memory)，硬盘(hard disk drive，HDD)或固态硬盘(solid-state drive，SSD)；存储器还可以包括上述种类的存储器的组合。The memory may include volatile memory, such as random-access memory (RAM); the memory may also include non-volatile memory, such as flash memory, hard disk drive (HDD) or solid-state drive (SSD); the memory may also include a combination of the above types of memory.

其中，处理器501调用存储器502中的程序代码用于执行以下步骤：The processor 501 calls the program code in the memory 502 to perform the following steps:

在一个或多个实施例中，处理器501还用于：In one or more embodiments, the processor 501 is further configured to:

基于所述样本音频数据进行训练得到所述语音合成模型。The speech synthesis model is obtained by training based on the sample audio data.

获取样本文本数据，以及获取所述至少两个训练音频数据和所述样本音频数据；Acquire sample text data, and acquire the at least two training audio data and the sample audio data;

对所述样本文本数据分别进行编码处理和解码处理得到所述样本文本数据当前对应的梅尔频谱；Performing encoding and decoding processing on the sample text data to obtain the Mel spectrum currently corresponding to the sample text data;

在损失值小于或等于预设阈值时生成所述预训练的语音合成模型；其中，所述损失值为所述样本文本数据当前对应的梅尔频谱与所述至少两个训练音频数据及所述样本音频数据对应的梅尔频谱间的损失值。The pre-trained speech synthesis model is generated when the loss value is less than or equal to a preset threshold; wherein the loss value is the loss value between the Mel spectrum currently corresponding to the sample text data and the Mel spectrum corresponding to the at least two training audio data and the sample audio data.

分别获取所述至少两个训练音频数据和所述样本音频数据各自对应的梅尔频谱；Respectively obtaining Mel spectra corresponding to the at least two training audio data and the sample audio data;

将所述梅尔频谱添加到样本梅尔频谱集合中。Add the mel spectrum to the sample mel spectrum set.

确定所述至少两个训练音频数据和所述样本音频数据各自对应的线性频谱；Determine linear frequency spectra corresponding to the at least two training audio data and the sample audio data respectively;

将所述线性频谱转换成所述至少两个训练音频数据和所述样本音频数据各自对应的所述梅尔频谱。The linear spectrum is converted into the Mel spectrum corresponding to each of the at least two training audio data and the sample audio data.

获取样本文本数据；Get sample text data;

对所述样本文本数据进行编码处理得到音素序列向量；Encoding the sample text data to obtain a phoneme sequence vector;

对所述音素序列向量进行时长提取处理得到音素对齐向量；Performing duration extraction processing on the phoneme sequence vector to obtain a phoneme alignment vector;

对所述音素对齐向量进行解码处理得到所述样本文本数据当前对应的梅尔频谱；Decoding the phoneme alignment vector to obtain the Mel spectrum currently corresponding to the sample text data;

分别计算所述样本文本数据当前对应的梅尔频谱与所述样本梅尔频谱集合中的至少一个梅尔频谱间的损失值；respectively calculating the loss value between the Mel spectrum currently corresponding to the sample text data and at least one Mel spectrum in the sample Mel spectrum set;

在所述损失值小于或等于预设阈值时生成所述语音合成模型。The speech synthesis model is generated when the loss value is less than or equal to a preset threshold.

获取文本数据，并将所述文本数据转换成至少一个音素序列；Acquire text data, and convert the text data into at least one phoneme sequence;

基于所述语音合成模型对所述至少一个音素序列进行语音合成处理得到所述文本数据对应的梅尔频谱；Performing speech synthesis processing on the at least one phoneme sequence based on the speech synthesis model to obtain a Mel spectrum corresponding to the text data;

基于所述文本数据对应的梅尔频谱得到所述文本数据对应的合成语音。The synthesized speech corresponding to the text data is obtained based on the Mel-spectrogram corresponding to the text data.

需要说明的是，上述实施例提供的装置5在执行语音合成模型的生成方法时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的语音合成模型的生成方法实施例属于同一构思，其体现实现过程详见方法实施例，这里不再赘述。It should be noted that when the device 5 provided in the above embodiment executes the method for generating a speech synthesis model, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the embodiments of the method for generating a speech synthesis model provided in the above embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.

上述本申请实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the above-mentioned embodiments of the present application are for description only and do not represent the advantages or disadvantages of the embodiments.

本申请实施例还提供了一种计算机存储介质，所述计算机存储介质可以存储有多条指令，所述指令适于由处理器加载并执行如上述图2～图3所示实施例的方法步骤，具体执行过程可以参见图2～图3所示实施例的具体说明，在此不进行赘述。The embodiment of the present application also provides a computer storage medium, which can store multiple instructions, and the instructions are suitable for being loaded by a processor and executing the method steps of the embodiments shown in Figures 2 to 3 above. The specific execution process can be found in the specific description of the embodiments shown in Figures 2 to 3, and will not be repeated here.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

Claims

Translated fromChinese

1.一种语音合成模型的生成方法，其特征在于，所述方法，包括：1. A method for generating a speech synthesis model, characterized in that the method comprises:

基于至少两个训练音频数据以及所述样本音频数据进行训练得到所述语音合成模型；The speech synthesis model is obtained by training based on at least two training audio data and the sample audio data;

其中，所述基于至少两个训练音频数据以及所述样本音频数据进行训练得到所述语音合成模型，包括：The step of training the speech synthesis model based on at least two training audio data and the sample audio data includes:

将所述至少两个训练音频数据和所述样本音频数据各自对应的梅尔频谱添加到样本梅尔频谱集合中；Adding the Mel spectrums corresponding to the at least two training audio data and the sample audio data to the sample Mel spectrum set;

在损失值小于或等于预设阈值时生成预训练的语音合成模型；所述损失值包括所述样本文本数据当前对应的梅尔频谱与所述至少两个训练音频数据对应的至少两个梅尔频谱间的损失值，以及所述样本文本数据当前对应的梅尔频谱与所述样本音频数据对应的梅尔频谱间的损失值。A pre-trained speech synthesis model is generated when the loss value is less than or equal to a preset threshold; the loss value includes the loss value between the Mel spectrum currently corresponding to the sample text data and at least two Mel spectrums corresponding to the at least two training audio data, and the loss value between the Mel spectrum currently corresponding to the sample text data and the Mel spectrum corresponding to the sample audio data.

2.根据权利要求1所述的方法，其特征在于，所述分别获取所述至少两个训练音频数据和所述样本音频数据各自对应的梅尔频谱，包括：2. The method according to claim 1, characterized in that the step of respectively obtaining the Mel spectra corresponding to the at least two training audio data and the sample audio data comprises:

3.根据权利要求1所述的方法，其特征在于，所述对所述样本文本数据分别进行编码处理和解码处理得到所述样本文本数据当前对应的梅尔频谱；3. The method according to claim 1, characterized in that the sample text data is encoded and decoded to obtain the Mel spectrum currently corresponding to the sample text data;

在损失值小于或等于预设阈值时生成所述预训练的语音合成模型；其中，所述损失值包括所述样本文本数据当前对应的梅尔频谱与所述至少两个训练音频数据对应的至少两个梅尔频谱间的损失值，以及所述样本文本数据当前对应的梅尔频谱与所述样本音频数据对应的梅尔频谱间的损失值，包括：The pre-trained speech synthesis model is generated when the loss value is less than or equal to a preset threshold; wherein the loss value includes a loss value between the Mel spectrum currently corresponding to the sample text data and at least two Mel spectrums corresponding to the at least two training audio data, and a loss value between the Mel spectrum currently corresponding to the sample text data and the Mel spectrum corresponding to the sample audio data, including:

获取样本文本数据；Get sample text data;

分别计算所述样本文本数据当前对应的梅尔频谱与所述样本梅尔频谱集合中的各个梅尔频谱间的损失值；Calculate the loss value between the Mel spectrum currently corresponding to the sample text data and each Mel spectrum in the sample Mel spectrum set respectively;

4.根据权利要求1所述的方法，其特征在于，所述基于至少两个训练音频数据进行训练得到所述语音合成模型之后，还包括：4. The method according to claim 1, characterized in that after the speech synthesis model is obtained by training based on at least two training audio data, it also includes:

5.一种语音合成模型的生成装置，其特征在于，所述装置包括：5. A device for generating a speech synthesis model, characterized in that the device comprises:

训练模块，用于基于至少两个训练音频数据以及所述样本音频数据进行训练得到所述语音合成模型；A training module, used for training the speech synthesis model based on at least two training audio data and the sample audio data;

其中，所述训练模块还用于：Wherein, the training module is also used for:

6.一种计算机存储介质，其特征在于，所述计算机存储介质存储有多条指令，所述指令适于由处理器加载并执行如权利要求1~4任意一项的方法步骤。6. A computer storage medium, characterized in that the computer storage medium stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor and executing the method steps as claimed in any one of claims 1 to 4.

7.一种电子设备，其特征在于，包括：处理器和存储器；其中，所述存储器存储有计算机程序，所述计算机程序适于由所述处理器加载并执行如权利要求1~4任意一项的方法步骤。7. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program, and the computer program is suitable for being loaded by the processor and executing the method steps as claimed in any one of claims 1 to 4.