CN106971703A

Movatterモバイル変換

Info

Publication number: CN106971703A
Application number: CN201710160104.2A
Authority: CN
Inventors: 杨鸿武; 赵娜; 冯欢; 甘振业
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2017-07-21

Abstract

The invention discloses a kind of song synthetic method based on HMM and device, with TTS (literary periodicals) technology, pass through HTS (speech synthesis system based on hidden Markov model), and utilize STRAIGHT algorithms, and establish the acoustic model related based on HMM speaker synthesized towards song, the melody Controlling model of song, and speaker adaptation training has been carried out, realize the personalized speech synthesizer that a kind of lyrics based on HMM are changed in real time to song.The system device enriches the research contents of phonetic synthesis, the voice of synthesis is had more the expression of expressive force and emotion；Especially give the opportunity to study that the technical operations such as song is made, music is handled are provided with music-lover；Social resources workable for people are added, with certain practical value and important meaning.

Description

Translated fromChinese

一种基于HMM的歌曲合成方法及装置A kind of song synthesis method and device based on HMM

技术领域technical field

本发明涉及人机交互技术、文-语转换技术、语音合成技术等领域，具体涉及一种基于HMM的歌曲合成方法及装置。The present invention relates to the fields of human-computer interaction technology, text-to-speech conversion technology, speech synthesis technology, etc., and specifically relates to an HMM-based song synthesis method and device.

背景技术Background technique

随着信息技术的不断创新和完善，许多人机交互方面的音乐多媒体应用也逐渐走入我们的日常生活，例如计算机点歌、谱曲、修饰歌声，以及手机上的听歌识曲等。如何使计算机更加人性化，能够像人类一样“唱歌”，也就是说，已知简谱和歌词，计算机就可以自动产生美妙、动听的歌声已经成为一种新的需求。随着多媒体技术在娱乐领域的飞速发展，同时也为这一技术提供了更为广阔的应用空间。With the continuous innovation and improvement of information technology, many music multimedia applications in terms of human-computer interaction have gradually entered our daily life, such as computer song ordering, composing, modifying singing voices, and listening to and recognizing songs on mobile phones. How to make the computer more humanized and able to "sing" like human beings, that is to say, the computer can automatically generate beautiful and pleasant singing voices has become a new demand if the numbered notation and lyrics are known. With the rapid development of multimedia technology in the field of entertainment, it also provides a broader application space for this technology.

目前绝大多数音乐都是以数字格式来记录和传播的，譬如，WAV、MP3、MIDI、以及实时音乐广播等多种存储形式。和传统的音乐模式相比，数字音乐在制作、存储、发行等方面有着不可比拟的优势。通过计算机，创作者在谱曲的同时能够听到音乐作品的制作效果，对乐谱进行的任何修改操作都可以及时的反馈给创作者，不需要进行传统的排练、演奏、录制、编辑等一系列复杂的过程来处理音乐，极大的降低了音乐制作的周期和人力成本，同时也避免了作曲家在漫长的创作过程中失去偶然得到的创作灵感。At present, most music is recorded and disseminated in digital formats, such as WAV, MP3, MIDI, and real-time music broadcasting and other storage forms. Compared with the traditional music mode, digital music has incomparable advantages in production, storage and distribution. Through the computer, the creator can hear the production effect of the music work while composing the music, and any modification to the score can be fed back to the creator in time, without the need for a series of traditional rehearsals, performances, recordings, editing, etc. The complex process of processing music greatly reduces the cycle and labor costs of music production, and at the same time prevents composers from losing the creative inspiration they get by chance during the long creative process.

语音合成技术是人机交互领域的一个重要研究内容，是嵌入式研究领域的重要组成部分。现如今，歌声合成也逐步成为了一个热点话题。然而，在歌声合成技术出现之前，语音合成技术的发展已经相对成熟了。一些学者试图利用语音合成的方法来合成歌声，但是歌声和语音又存在一定程度的差异性。语音重在内容(当然也可以表达说话人的意向、情感)，歌声重在旋律的演绎和起伏变化，这使得语音合成的方法不能够直接应用到歌声的合成当中。Speech synthesis technology is an important research content in the field of human-computer interaction and an important part of the embedded research field. Nowadays, singing voice synthesis has gradually become a hot topic. However, before the appearance of singing voice synthesis technology, the development of speech synthesis technology was relatively mature. Some scholars try to use the method of speech synthesis to synthesize singing voice, but there is a certain degree of difference between singing voice and voice. Speech focuses on content (of course, it can also express the speaker's intention and emotion), and singing focuses on the interpretation and fluctuation of melody, which makes the method of speech synthesis unable to be directly applied to the synthesis of singing.

在长期的国内外研究过程当中，歌声合成类似于语音合成技术，也已逐步形成了三种主流的合成方式：1.波形拼接式合成；2.参数化式合成；3.语音修改式合成。其中拼接合成和参数化式合成都是基于语料库的，合成音质不高，而语音修改方式比较灵活，是根据旋律信息来修改语音信号的声学参数进而达到歌声的合成。在国内外有提出了歌词到歌声实时转换的个性化语音合成。根据歌曲的乐谱信息立即产生歌声，它可以接收一首歌歌词的连续语音。该系统在录入与歌词相对应的语音后用Viterbi算法在连续的语音合成单元合成歌声，通过基音同步波形叠加(Pitch-Synchronous Overlap-Add，PSOLA)方法来实现音高、时长、能量和频谱的实时转换，并合成歌声。由于该系统没有考虑语音和歌声在音高和时长等声学方面的差异性，致使合成的效果不理想。也有在此基础上，提出了一个大语料库的歌词到歌声转换，该系统在自然度和音质等方面都达到了比较好的结果。该系统设计了3个普通话的语料库，用Viterbi算法来确定各个合成单元的最优组合。这种方法的缺陷是：制作语料库花费大量的时间以及人的精力。In the long-term research process at home and abroad, singing voice synthesis is similar to speech synthesis technology, and three mainstream synthesis methods have gradually formed: 1. Waveform splicing synthesis; 2. Parametric synthesis; 3. Voice modification synthesis. Among them, splicing synthesis and parametric synthesis are both based on corpus, and the synthesized sound quality is not high, while the speech modification method is more flexible, and the acoustic parameters of the speech signal are modified according to the melody information to achieve the synthesis of singing voice. Personalized speech synthesis for real-time conversion from lyrics to singing voices has been proposed both at home and abroad. Singing voices are produced immediately based on the musical score information of a song, and it can receive a continuous voice of a song's lyrics. After recording the voice corresponding to the lyrics, the system uses the Viterbi algorithm to synthesize the singing voice in the continuous speech synthesis unit, and realizes the pitch, duration, energy and frequency spectrum by the pitch-synchronous waveform superposition (Pitch-Synchronous Overlap-Add, PSOLA) method. Convert and synthesize vocals in real time. Since the system does not take into account the differences in acoustic aspects such as pitch and duration of speech and singing, the effect of synthesis is not ideal. Also on this basis, a large corpus of lyrics to singing conversion is proposed. The system has achieved relatively good results in terms of naturalness and sound quality. The system designs three Mandarin corpora, and uses the Viterbi algorithm to determine the optimal combination of each synthesis unit. The disadvantage of this method is that it takes a lot of time and human energy to make a corpus.

因此，本领域的技术员致力于开发一种新型的面向有音乐处理需求者的基于HMM的个性化歌曲合成的实现方法和装置。Therefore, those skilled in the art are devoting themselves to developing a novel HMM-based personalized song synthesis method and device for those who need music processing.

发明内容Contents of the invention

有鉴于现有技术的上述缺陷，本发明要解决背景技术中提出的中文歌声的合成研究较少，合成音质不高，操作耗时耗力等问题，提供了一种面向有音乐处理需求者的基于HMM的个性化歌曲合成的实现方法和装置。In view of the above-mentioned defects of the prior art, the present invention aims to solve the problems that there are few researches on the synthesis of Chinese singing voices proposed in the background art, the quality of the synthesized sound is not high, and the operation is time-consuming and labor-intensive. A method and device for realizing personalized song synthesis based on HMM.

为解决上述技术问题，本发明提供的技术方案如下：In order to solve the problems of the technologies described above, the technical solutions provided by the invention are as follows:

一种基于HMM的歌曲合成方法，包括以下步骤：A kind of song synthesis method based on HMM, comprises the following steps:

A、分析语音和歌声在声学特征的差异性，建立歌声的旋律控制模型；A. Analyze the differences in acoustic characteristics between speech and singing, and establish a melody control model for singing;

B、建立面向歌曲合成的基于HMM的说话人相关的声学模型；B. Establishing a speaker-related acoustic model based on HMM for song synthesis;

C、利用基于HMM的语音合成系统合成出歌声。C. Utilize the speech synthesis system based on HMM to synthesize the singing voice.

进一步的，所述步骤A中所述分析语音和歌声在声学特征的差异性的具体步骤如下：Further, the specific steps of analyzing voice and singing voice in the difference of acoustic features described in the step A are as follows:

a、运用时域分析法和频域分析法对语音信号进行谱分析，并将语音信号与歌声信号进行基频的对比分析；a, use time domain analysis method and frequency domain analysis method to carry out spectrum analysis to voice signal, and carry out comparative analysis of fundamental frequency of voice signal and singing voice signal;

b、利用MIDI技术从MIDI系统中提取出所需要的乐谱信息；b. Use MIDI technology to extract the required score information from the MIDI system;

c、通过读取MIDI文件中提取的乐谱的旋律信息，分析其乐谱文件的结构特征，进而获得音乐参数信息，所述音乐参数信息包括通道标号、音符音高、键的速度、音符起始时间和音符持续时间。c. By reading the melody information of the music score extracted from the MIDI file, analyzing the structural features of the music score file, and then obtaining the music parameter information, the music parameter information includes the channel label, the pitch of the note, the speed of the key, and the start time of the note and note duration.

进一步的，所述步骤A中所述歌声的旋律控制模型包括基频控制模型和时长控制模型；利用基频控制模型将乐谱中的离散音高转换为连续的基频曲线，并利用时长控制模型获得歌唱音符的发音时长。Further, the melody control model of the singing voice described in the step A includes a fundamental frequency control model and a duration control model; utilize the fundamental frequency control model to convert the discrete pitch in the music score into a continuous fundamental frequency curve, and utilize the duration control model Get the vocal duration of the singing note.

进一步的，所述步骤B中所述建立面向歌曲合成的基于HMM的说话人相关的声学模型有如下步骤：Further, the speaker-related acoustic model based on HMM for song synthesis described in the step B has the following steps:

a、利用说话人的语音语料，分析语音数据，得到语音数据中包括基频F0、时长、频谱SP和非周期索引AP的声学参数；并利用基于HMM的说话人自适应训练技术，训练获得混合语音的平均音模型；a. Use the speaker's speech corpus to analyze the speech data, and obtain the acoustic parameters including the fundamental frequency F0, duration, spectrum SP and aperiodic index AP in the speech data; and use the HMM-based speaker adaptive training technology to train and obtain a mixture Average tone model of speech;

b、利用待合成的目标说话人的少量语音数据，通过说话人自适应变换技术，得到目标说话人的自适应声学模型，并对自适应模型进行修正与更新。b. Using a small amount of speech data of the target speaker to be synthesized, the adaptive acoustic model of the target speaker is obtained through speaker adaptive transformation technology, and the adaptive model is corrected and updated.

进一步的，所述通过基于HMM的说话人自适应训练，训练得到混合语音的平均音模型包括如下步骤：Further, said through HMM-based speaker adaptive training, training to obtain the average sound model of the mixed voice includes the following steps:

a、对说话人的语料库和目标说话人的语料库数据进行语音分析，提取其声学参数：Mel倒谱系数，并计算它们的一阶差分和二阶差分；a. Perform phonetic analysis on the corpus of the speaker and the corpus of the target speaker, extract its acoustic parameters: Mel cepstral coefficients, and calculate their first-order difference and second-order difference;

b、结合上下文属性集，进行HMM模型训练，训练频谱和基频参数的HMM模型以及状态时长参数的多分布半隐马尔科夫模型MSD-HSMM；b. Combining the context attribute set, conduct HMM model training, train the HMM model of spectrum and fundamental frequency parameters and the multi-distribution semi-hidden Markov model MSD-HSMM of state duration parameters;

c、利用少量目标说话人的语音库，进行说话人自适应训练，获得混合语音的平均音模型，从而得到上下文相关的MSD-HSMM模型。c. Using the voice library of a small number of target speakers, perform speaker adaptive training to obtain the average sound model of the mixed voice, so as to obtain the context-dependent MSD-HSMM model.

进一步的，所述利用待合成的目标说话人的少量语音数据，通过说话人自适应变换技术，得到目标说话人的自适应声学模型，并对自适应模型进行修正与更新，包括如下步骤：Further, the method uses a small amount of voice data of the target speaker to be synthesized to obtain an adaptive acoustic model of the target speaker through speaker adaptive transformation technology, and corrects and updates the adaptive model, including the following steps:

a、说话人自适应训练后，利用基于HSMM的CMLLR自适应算法，计算得到说话人转换的状态输出概率分布以及时长概率分布的均值向量和协方差矩阵，状态i下特征向量o和状态时长d的变换方程为：a. After speaker adaptive training, use the HSMM-based CMLLR adaptive algorithm to calculate the speaker transition state output probability distribution and the mean vector and covariance matrix of the duration probability distribution, the eigenvector o and the state duration d in state i The transformation equation of is:

b_i(o)＝N(o；Au_i-b,AΣ_iA^T)＝|A^-1|N(Wξ；u_i,Σ_i)b_i (o)=N(o;Au_i -b,AΣ_i A^T )=|A^-1 |N(Wξ;u_i ,Σ_i )

p_i(d)＝N(d；αm_i-β,ασ_i²α)＝|α^-1|N(αψ；m_i,σ_i²)p_i (d)=N(d;αm_i -β,ασ_i² α)=|α^-1 |N(αψ;m_i ,σ_i² )

其中，ξ＝[o^T,1]，ψ＝[d,1]^T，μ_i为状态输出分布的均值，m_i为时长分布的均值，Σ_i为对角协方差矩阵，为方差，W＝[A^-1 b^-1]为目标说话人状态输出概率密度分布的线性变换矩阵，X＝[α^-1,β^-1]为状态时长概率密度分布的变换矩阵；Among them, ξ=[o^T ,1], ψ=[d,1]^T , μ_i is the mean value of the state output distribution,_{mi is the mean value of the time-length distribution, Σ i}_is the diagonal covariance matrix, is the variance, W=[A^-1 b^-1 ] is the linear transformation matrix of the target speaker state output probability density distribution, X=[α^-1 , β^-1 ] is the transformation matrix of the state duration probability density distribution;

b、通过基于HSMM的自适应变换算法，可对语音数据的频谱、基频和时长参数进行归一化和变换，对于长度为T的自适应数据O，可对变换Λ＝(W,X)进行最大似然估计；b, through the adaptive transformation algorithm based on HSMM, can carry out normalization and transformation to the frequency spectrum, fundamental frequency and duration parameter of voice data, for the adaptive data O that length is T, can transform Λ=(W, X) Perform maximum likelihood estimation;

c、采用最大后验MAP算法对语音的自适应模型进行了修正和更新，对于给定HSMM的参数集λ，若其前向概率和后向概率分别为：α_t(i)和β_t(i)，则其在状态i下连续观测序列o_t-d+1…o_t的生成概率为：c. The maximum a posteriori MAP algorithm is used to modify and update the adaptive model of speech. For a given parameter set λ of HSMM, if its forward probability and backward probability are: α_t (i) and β_t ( i), then its generation probability of continuous observation sequence o_t-d+1 ...o_t in state i for:

MAP估计描述如下：The MAP estimation is described as follows:

其中，和为线性回归变换后的均值向量，ω和τ分别为状态输出和时长分布的MAP估计参数，和为自适应均值向量和的加权平均MAP估计值。in, with is the mean vector after linear regression transformation, ω and τ are the MAP estimation parameters of the state output and time-length distribution, respectively, with is the adaptive mean vector with The weighted average MAP estimate of .

进一步的，所述步骤C中所述利用基于HMM的语音合成系统合成出歌声所采用的语音分析与合成方法是以STRAIGHT算法为基础的。Further, the speech analysis and synthesis method used in the step C to synthesize the singing voice by using the HMM-based speech synthesis system is based on the STRAIGHT algorithm.

进一步的，所述步骤C中所述利用基于HMM的语音合成系统合成出歌声包括如下步骤：Further, utilizing the speech synthesis system based on HMM described in the step C to synthesize the singing voice includes the following steps:

a、使用文本分析工具对输入的歌词文本进行分析，利用文本分析程序将给定的歌词文本转换为包含语境描述信息的声学标注序列，用训练过程中聚类得到的各个决策树来预测与每个发音及其语境相关的上下文HMM模型，再拼接成一个语句HMM模型；a. Use a text analysis tool to analyze the input lyrics text, use a text analysis program to convert the given lyrics text into an acoustic label sequence containing contextual description information, and use each decision tree clustered in the training process to predict and Each pronunciation and its context-related context HMM model are spliced into a sentence HMM model;

b、根据MIDI文件，获得歌词中每个音符的音高和音长，通过旋律控制模型得到相应的基频和时长，利用音符时长修改音节的频谱SP、非周期索引AP和基频F0的时长；b. Obtain the pitch and duration of each note in the lyrics according to the MIDI file, obtain the corresponding fundamental frequency and duration through the melody control model, and use the duration of the note to modify the spectrum SP, aperiodic index AP and duration of the fundamental frequency F0 of the syllable;

c、利用说话人相关的声学模型及STRAIGHT算法生成语句HMM模型中的关于频谱SP、非周期索引AP、时长、基频F0的参数序列，并合成出语音，再加入音乐伴奏，实现歌曲的合成。c. Use the speaker-related acoustic model and the STRAIGHT algorithm to generate the parameter sequence of the spectrum SP, aperiodic index AP, duration, and fundamental frequency F0 in the sentence HMM model, and synthesize the speech, and then add music accompaniment to realize the synthesis of songs .

进一步的，所述步骤C中所述利用基于HMM的语音合成系统合成出歌声所采用的语音分析与合成方法是以STRAIGHT算法为基础的，包括如下步骤：Further, the speech analysis and synthesis method that utilizes the speech synthesis system based on HMM to synthesize the singing voice described in the step C is based on the STRAIGHT algorithm, including the following steps:

首先输入说话人的语音信号，用STRAIGHT算法提取语音的基频F₀和谱包络Spectral envelope，然后对声学参数进行调制，产生新的声源和时变滤波器，再根据原滤波器模型，采用如下式合成语音：First input the speaker's speech signal, use the STRAIGHT algorithm to extract the fundamental frequency F₀ and the spectral envelope of the speech, and then modulate the acoustic parameters to generate a new sound source and time-varying filter, and then according to the original filter model, Speech is synthesized as follows:

其中，Q表示在合成激励中的一组样点的位置，G()表示音高调制，可以任意的与原始语音的F₀来匹配调制后的F₀，全通滤波器用于控制精细音高和原信号的时间结构，如一个与频率成正比的线性相位移，用于控制F₀的精细结构，从调制幅度谱A(S(u(w),r(t)),u(w),r(t))如下式，可以计算得到最小相位脉冲相应的傅里叶变换V(w,t_i)，其中A()、u()和r()分别表示幅度、频率和时间维的调制；Among them, Q represents the position of a group of sample points in the synthetic excitation, G() represents the pitch modulation, and the modulated F_{0 can be matched with the F 0}_of the original speech arbitrarily, and the all-pass filter is used to control the fine pitch and the temporal structure of the original signal, such as a linear phase shift proportional to frequency, is used to control the fine structure of F₀ , from the modulation amplitude spectrum A(S(u(w),r(t)),u(w) ,r(t)) as the following formula, the Fourier transform V(w,t_i ) corresponding to the minimum phase pulse can be calculated, where A(), u() and r() represent the amplitude, frequency and time dimension respectively modulation;

其中，q表示频率。Among them, q represents the frequency.

一种基于HMM的歌曲合成装置，其特征在于，包括：A kind of song synthesis device based on HMM, is characterized in that, comprises:

旋律控制模块，用于建立歌声的旋律控制模型；The melody control module is used to establish the melody control model of the singing voice;

基于HMM的说话人相关的声学模块，用于建立面向歌曲合成的说话人相关的声学模型；Speaker-related acoustic module based on HMM, used to establish a speaker-related acoustic model for song synthesis;

基于HMM的歌声合成模块，用于合成待合成的歌声语音。The singing voice synthesis module based on HMM is used for synthesizing the singing voice to be synthesized.

进一步的，所述旋律控制模块，包括：Further, the melody control module includes:

MIDI分析单元，用于分析从MIDI文件中提取的乐谱信息，并获得相应的音乐参数信息；A MIDI analysis unit is used to analyze the score information extracted from the MIDI file, and obtain corresponding music parameter information;

韵律控制单元，用于根据语音和歌声在声学特征的差异性，建立歌声的旋律控制模型。The prosody control unit is used to establish a melody control model of the singing voice according to the differences in acoustic features between the voice and the singing voice.

进一步的，所述基于HMM的说话人相关的声学模块，包括：Further, the HMM-based speaker-related acoustic module includes:

声学模型单元，用于得到目标说话人的声学模型；The acoustic model unit is used to obtain the acoustic model of the target speaker;

声学参数子单元，用于基于HMM的参数语音合成。Acoustic parameter subunit for HMM-based parametric speech synthesis.

进一步的，所述基于HMM的歌声合成模块，包括：Further, the described HMM-based singing voice synthesis module includes:

文本分析单元，对输入的歌词文本进行文本分析，获得上下文相关的标注；A text analysis unit performs text analysis on the input lyrics text to obtain context-related annotations;

HMM模型训练子单元，用于建立语音数据的HMM模型库；The HMM model training subunit is used to set up the HMM model library of voice data;

说话人自适应子单元，用于归一化和转换训练中说话人的特征参数，获得自适应模型；The speaker adaptive subunit is used to normalize and convert the characteristic parameters of the speaker in the training to obtain an adaptive model;

语音合成单元，用于合成待合成的歌声语音；Speech synthesis unit, used for synthesizing the singing voice to be synthesized;

歌声合成单元，用于对合成的歌声语音加入音乐伴奏，完成歌曲的合成。The singing voice synthesis unit is used for adding musical accompaniment to the synthesized singing voice to complete the synthesis of the song.

本发明具有的优点和积极效果是：一种基于HMM的歌曲合成方法及装置，运用TTS(文语转换)技术，通过HTS(基于隐马尔可夫模型的语音合成系统)，并利用STRAIGHT算法，以及建立了面向歌曲合成的基于HMM说话人相关的声学模型、歌曲的旋律控制模型，并进行了说话人自适应训练，实现了一种基于HMM的歌词到歌曲实时转换的个性化语音合成装置。与传统的歌声合成系统相比，本系统运用的语音分析与合成方法是以STRAIGHT算法为基础，同时在训练阶段加入了说话人自适应训练过程，获得混合语音的平均音模型，通过这个训练过程，可以减小语音库中由于说话人的差异性所造成的影响，从而提高歌声合成的语音质量；在平均音模型的基础上，通过说话人自适应变换技术，利用少量的说话人语料，合成自然度和悦耳度都比较好的歌声语音。本系统装置丰富了语音合成的研究内容，使合成的语音更具表现力与情感的表达；尤其是给具有音乐爱好者提供了歌曲制作、音乐处理等技术操作的学习机会；增加了人们可使用的社会资源，具有一定的实用价值和重要的意义。The advantages and positive effects that the present invention has are: a kind of song synthesizing method and device based on HMM, use TTS (text-to-speech) technology, by HTS (speech synthesis system based on Hidden Markov Model), and utilize STRAIGHT algorithm, And established a HMM-based speaker-related acoustic model and song melody control model for song synthesis, and carried out speaker adaptive training, and realized a personalized speech synthesis device based on HMM for real-time conversion of lyrics to songs. Compared with the traditional singing voice synthesis system, the voice analysis and synthesis method used in this system is based on the STRAIGHT algorithm. At the same time, a speaker adaptive training process is added in the training phase to obtain the average tone model of the mixed voice. Through this training process , which can reduce the influence caused by the difference of speakers in the voice bank, thereby improving the voice quality of singing voice synthesis; on the basis of the average voice model, through the speaker adaptive transformation technology, using a small amount of speaker corpus, the synthesis The singing voice is relatively natural and melodious. This system device enriches the research content of speech synthesis, making the synthesized speech more expressive and emotional; especially providing music lovers with learning opportunities for technical operations such as song production and music processing; It has certain practical value and important significance.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明的一个较佳实施例的一种基于HMM的歌曲合成方法的系统流程框图；Fig. 1 is a system flow diagram of a kind of song synthesis method based on HMM of a preferred embodiment of the present invention;

图2为本发明的一个较佳实施例的MIDI系统框图；Fig. 2 is the MIDI system block diagram of a preferred embodiment of the present invention;

图3为本发明的一个较佳实施例的说话人自适应语音合成系统框图；Fig. 3 is a block diagram of the speaker adaptive speech synthesis system of a preferred embodiment of the present invention;

图4为本发明的一个较佳实施例的STRAIGHT分析-调制-合成系统框图；Fig. 4 is the STRAIGHT analysis-modulation-synthesis system block diagram of a preferred embodiment of the present invention;

图5为本发明的一个较佳实施例的一种基于HMM的歌曲合成实现的装置结构示意图。Fig. 5 is a schematic structural diagram of a device for implementing song synthesis based on HMM in a preferred embodiment of the present invention.

具体实施方式detailed description

下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solution in the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the present invention. Obviously, what is described is only a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示，本发明的一优选实施例公开了一种基于HMM的歌曲合成方法，运用TTS(文语转换)技术，通过HTS(基于隐马尔可夫模型的语音合成系统)，并利用STRAIGHT算法，以及建立了面向歌曲合成的基于HMM说话人相关的声学模型、歌曲的旋律控制模型，并进行了说话人自适应训练，实现了一种基于HMM的歌词到歌曲实时转换的个性化语音合成方法。包括以下步骤：As shown in Figure 1, a preferred embodiment of the present invention discloses a kind of song synthesis method based on HMM, utilizes TTS (text-to-speech conversion) technology, by HTS (speech synthesis system based on Hidden Markov Model), and utilizes STRAIGHT algorithm, and established a HMM-based speaker-related acoustic model and song melody control model for song synthesis, and carried out speaker adaptive training, realizing a personalized voice that converts lyrics to songs based on HMM in real time resolve resolution. Include the following steps:

步骤A中所述分析语音和歌声在声学特征的差异性的具体步骤如下：The specific steps of analyzing voice and singing voice in the difference of acoustic feature described in step A are as follows:

如图2所示，为MIDI系统框图。As shown in Figure 2, it is a block diagram of the MIDI system.

步骤A中所述歌声的旋律控制模型包括基频控制模型和时长控制模型；利用基频控制模型将乐谱中的离散音高转换为连续的基频曲线，并利用时长控制模型获得歌唱音符的发音时长。The melody control model of the singing voice described in step A includes a fundamental frequency control model and a duration control model; utilize the fundamental frequency control model to convert the discrete pitch in the score into a continuous fundamental frequency curve, and utilize the duration control model to obtain the pronunciation of singing notes duration.

如图3所示，步骤B中所述建立面向歌曲合成的基于HMM的说话人相关的声学模型有如下步骤：As shown in Figure 3, the speaker-related acoustic model based on HMM for song synthesis described in step B has the following steps:

b、利用待合成的目标说话人的少量语音数据，通过说话人自适应变换技术，得到目标说话人的自适应声学模型，并对自适应模型进行修正与更新，进而合成出具有目标说话人音色的语音。b. Using a small amount of voice data of the target speaker to be synthesized, the adaptive acoustic model of the target speaker is obtained through the speaker adaptive transformation technology, and the adaptive model is corrected and updated, and then the timbre of the target speaker is synthesized voice.

如图3所示，所述通过基于HMM的说话人自适应训练，训练得到混合语音的平均音模型包括如下步骤：As shown in Fig. 3, described by the speaker adaptive training based on HMM, the average tone model that the training obtains mixed speech comprises the following steps:

c、利用少量目标说话人的语音库，进行说话人自适应训练，获得混合语音的平均音模型，从而得到上下文相关的MSD-HSMM模型，包括：c. Using the voice library of a small number of target speakers, perform speaker adaptive training to obtain the average sound model of the mixed voice, so as to obtain the context-dependent MSD-HSMM model, including:

①采用约束最大似然线性回归(CMML)算法，将训练中说话人的语音数据和平均音之间的差异用线性回归函数表示；① Using the Constrained Maximum Likelihood Linear Regression (CMML) algorithm, the difference between the speech data of the speaker in training and the average voice is represented by a linear regression function;

②用一组状态输出分布和状态时长分布的线性回归方程归一化训练说话人之间的差异；② Normalize the differences between training speakers with a set of linear regression equations for state output distributions and state duration distributions;

③训练得到混合语音的平均音模型，从而得到上下文相关的MSD-HSMM模型。③Training to obtain the average sound model of mixed speech, so as to obtain the context-dependent MSD-HSMM model.

所述利用待合成的目标说话人的少量语音数据，通过说话人自适应变换技术，得到目标说话人的自适应声学模型，并对自适应模型进行修正与更新，包括如下步骤：The described method uses a small amount of voice data of the target speaker to be synthesized to obtain an adaptive acoustic model of the target speaker through speaker adaptive transformation technology, and corrects and updates the adaptive model, including the following steps:

其中，ξ＝[o^T,1]，ψ＝[d,1]^T，μ_i为状态输出分布的均值，m_i为时长分布的均值，Σ_i为对角协方差矩阵，为方差，W＝[A^-1b^-1]为目标说话人状态输出概率密度分布的线性变换矩阵，X＝[α^-1,β^-1]为状态时长概率密度分布的变换矩阵；Among them, ξ=[o^T ,1], ψ=[d,1]^T , μ_i is the mean value of the state output distribution,_{mi is the mean value of the time-length distribution, Σ i}_is the diagonal covariance matrix, is the variance, W=[A^-1 b^-1 ] is the linear transformation matrix of the target speaker state output probability density distribution, X=[α^-1 , β^-1 ] is the transformation matrix of the state duration probability density distribution;

MAP估计描述如下：The MAP estimation is described as follows:

步骤C中所述利用基于HMM的语音合成系统合成出歌声所采用的语音分析与合成方法是以STRAIGHT算法为基础的。The voice analysis and synthesis method used in step C to synthesize the singing voice by using the HMM-based voice synthesis system is based on the STRAIGHT algorithm.

步骤C中所述利用基于HMM的语音合成系统合成出歌声包括如下步骤：a、使用文本分析工具对输入的歌词文本进行分析，利用文本分析程序将给定的歌词文本转换为包含语境描述信息的声学标注序列，用训练过程中聚类得到的各个决策树来预测与每个发音及其语境相关的上下文HMM模型，再拼接成一个语句HMM模型；Described in the step C utilizes the speech synthesis system based on HMM to synthesize the singing voice and comprises the following steps: a, use the text analysis tool to analyze the lyrics text of input, utilize the text analysis program to convert the given lyrics text to include context description information Acoustic labeling sequence of , using each decision tree clustered in the training process to predict the context HMM model related to each pronunciation and its context, and then splicing into a sentence HMM model;

如图4所示，在歌声语音合成的过程中，利用STRAIGHT分析-调制-合成系统来准确的提取基频信息、排除谱包络周期性的干扰，所述步骤C中所述利用基于HMM的语音合成系统合成出歌声所采用的语音分析与合成方法是以STRAIGHT算法为基础的，包括如下步骤：As shown in Figure 4, in the process of singing speech synthesis, use STRAIGHT analysis-modulation-synthesis system to accurately extract the fundamental frequency information, get rid of the periodic interference of the spectrum envelope, and use the HMM-based The speech analysis and synthesis method adopted by the speech synthesis system to synthesize the singing voice is based on the STRAIGHT algorithm, including the following steps:

其中，q表示频率。Among them, q represents the frequency.

与上述方法相对应，本发明的另一优选实施例还公开了一种基于HMM的歌曲合成装置，该装置用于建立面向歌曲合成的基于HMM说话人相关的声学模型、歌曲的旋律控制模型，进行说话人自适应训练，并通过利用STRAIGHT算法的HTS(基于隐马尔可夫模型的语音合成系统)，结合TTS(文语转换)技术，实现了歌词到歌曲的个性化实时转换。在实现上，可通过软件、硬件或软硬件结合方式实现本装置的功能。Corresponding to the above-mentioned method, another preferred embodiment of the present invention also discloses a song synthesis device based on HMM, which is used to establish a speaker-related acoustic model based on HMM for song synthesis, and a melody control model of the song, Carry out speaker adaptive training, and realize personalized real-time conversion from lyrics to songs through HTS (Hidden Markov Model-based Speech Synthesis System) using STRAIGHT algorithm, combined with TTS (Text-to-Speech) technology. In terms of implementation, the functions of the device can be realized by software, hardware or a combination of software and hardware.

如图5所示，所述歌曲合成装置包括：旋律控制模块，基于HMM的说话人相关的声学模块和基于HMM的歌声合成模块。As shown in FIG. 5 , the song synthesis device includes: a melody control module, a speaker-related acoustic module based on HMM and a singing voice synthesis module based on HMM.

所述旋律控制模块，包括：The melody control module includes:

通过MIDI分析单元，分析从MIDI文件中提取的乐谱信息，并获得相应的音乐参数信息；然后在旋律控制模块，根据语音和歌声在声学特征的差异性，建立歌声的旋律控制模型。Through the MIDI analysis unit, the score information extracted from the MIDI file is analyzed, and the corresponding music parameter information is obtained; then, in the melody control module, the melody control model of the singing voice is established according to the differences in the acoustic characteristics of the voice and singing voice.

所述基于HMM的说话人相关的声学模块，包括：The HMM-based speaker-related acoustic module includes:

所述基于HMM的歌声合成模块，包括：The singing voice synthesis module based on HMM includes:

HMM模型训练子单元，用于建立语音数据的HMM模型库，通过提取语音库中语音数据的说话人声学参数，主要是提取基频、频谱和时长参数，并结合音库的上下文标注信息，训练声学模型的统计模型，再根据上下文属性集，确定基频、频谱和时长参数；The HMM model training subunit is used to establish the HMM model library of speech data. By extracting the speaker's acoustic parameters of the speech data in the speech database, it mainly extracts the fundamental frequency, frequency spectrum and duration parameters, and combines the context annotation information of the sound database to train The statistical model of the acoustic model, and then determine the fundamental frequency, spectrum and duration parameters according to the context attribute set;

说话人自适应子单元，用于归一化和转换训练中说话人的特征参数，获得自适应模型，通过说话人训练，归一化训练中说话人和平均音模型之间状态输出分布和状态时长分布之间的差异，并采用最大似然线性回归算法确定多说话人混合语音的平均音模型，再利用自适应数据，计算说话人的状态输出概率分布以及时长概率分布的均值向量和协方差矩阵，并将其向目标说话人模型进行转化，从而建立目标说话人的MSD-HSMM的自适应模型；The speaker adaptive subunit is used to normalize and convert the characteristic parameters of the speaker in training to obtain an adaptive model. Through speaker training, the state output distribution and state between the speaker and the average tone model in normalization training are The difference between the duration distribution, and the maximum likelihood linear regression algorithm to determine the average sound model of multi-speaker mixed speech, and then use the adaptive data to calculate the speaker's state output probability distribution and the mean vector and covariance of the duration probability distribution Matrix, and transform it to the target speaker model, so as to establish the adaptive model of MSD-HSMM of the target speaker;

语音合成单元，用于合成待合成的歌声语音，利用修正的自适应模型，预测输入文本歌词的语音参数，并提取语音声学参数，再通过基于STRAIGHT算法的语音合成器合成歌声语音；Speech synthesis unit, for synthesizing the singing voice to be synthesized, using the modified adaptive model, predicting the voice parameters of the input text lyrics, and extracting the voice acoustic parameters, and then synthesizing the singing voice by the voice synthesizer based on the STRAIGHT algorithm;

以上所述的方法过程可通过程序指令相关的硬件完成，所述的程序可以存储在可读取的存储介质中，该程序在执行时执行上述方法中的相应步骤。The above-mentioned method process can be completed by program instructions related hardware, and the program can be stored in a readable storage medium, and the corresponding steps in the above-mentioned method are executed when the program is executed.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.