CN107221344A

Movatterモバイル変換

Info

Publication number: CN107221344A
Application number: CN201710222674.XA
Authority: CN
Inventors: 李华康; 杜阳阳; 金旭; 胡晓东; 丘添元; 张笑源; 孙国梓; 李涛
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2017-09-29

Abstract

Translated fromChinese

本发明公开了一种语音情感迁移方法，首先基于语音数据库生成语音情感数据集，完成标签标注，然后采用语音特征参数模型对音频文件进行音频特征抽取，得到语音特征集；接下来采用机器学习工具对语音特征集与语音情感标签进行机器学习，构建情感模型库。选择待迁移的目标，从多媒体终端输入语音信号，得到当前语音信号的特征集，通过情感分类得到当前情感类别，判断与输入的目标是否一致，如果一致则将原始输入语音信号直接作为目标情感语音输出，否则进行特征情感迁移；最后经过语音合成加工生成最终目标情感语音输出。本发明提出的基于情感分类和特征迁移的方法，能够在不失原始说话人发声特征的前提下实现语音情感的变化。

The invention discloses a voice emotion transfer method. Firstly, a voice emotion data set is generated based on a voice database, labeling is completed, and then audio feature extraction is performed on an audio file by using a voice feature parameter model to obtain a voice feature set; and then a machine learning tool is used. Carry out machine learning on the speech feature set and speech emotion labels to build an emotion model library. Select the target to be migrated, input the voice signal from the multimedia terminal, obtain the feature set of the current voice signal, obtain the current emotion category through emotion classification, judge whether it is consistent with the input target, and if it is consistent, use the original input voice signal directly as the target emotional voice output, otherwise, the feature emotion transfer is performed; finally, the final target emotional voice output is generated through speech synthesis processing. The method based on emotion classification and feature migration proposed by the present invention can realize the change of speech emotion without losing the vocal characteristics of the original speaker.

Description

Translated fromChinese

一种语音情感迁移方法A Speech Emotion Transfer Method

技术领域technical field

本发明属于语音识别技术领域，涉及语音情感的迁移方法，具体涉及一种基于不同语音提供者模型的语音情感的迁移方法。The invention belongs to the technical field of speech recognition, and relates to a speech emotion transfer method, in particular to a speech emotion transfer method based on different speech provider models.

背景技术Background technique

随着智能芯片技术的发展，各种终端设备的智能化和集成化程度越来越高，设备的小型化、轻便化、网络化使得人们的生活越来越便捷。用户不断的通过网络终端进行语音视频的交流，积累了海量的多媒体数据。随着平台数据的积累，智能问答系统也逐渐应运而生。这些智能问答系统包括了语音识别、性感分析、信息检索、语义匹配、句子生成、语音合成等先端技术。With the development of smart chip technology, the intelligence and integration of various terminal equipment are getting higher and higher, and the miniaturization, portability and networking of equipment make people's life more and more convenient. Users continue to exchange voice and video through network terminals, accumulating massive amounts of multimedia data. With the accumulation of platform data, intelligent question answering systems have gradually emerged. These intelligent question answering systems include cutting-edge technologies such as speech recognition, sexy analysis, information retrieval, semantic matching, sentence generation, and speech synthesis.

语音识别技术是让机器通过识别技术和理解过程把语音信号转化为所对应的文本信息或者机器指令，让机器能够听懂人类的表达内容，主要包括语音单元选取、语音特征提取、模式匹配和模型训练等技术。语音单元包括单词(句)、音节和音速三种，具体按照场景和任务来选择。单词单元主要适合小词汇语音识别系统；音节单元更加适合于汉语语音识别；音素虽然能够很好地解释语音基本成分，但由于发音者的复杂多变导致无法得到稳定的数据集，目前仍在研究中。Speech recognition technology is to allow machines to convert speech signals into corresponding text information or machine instructions through recognition technology and understanding processes, so that machines can understand human expressions, mainly including speech unit selection, speech feature extraction, pattern matching and model training techniques. Speech units include words (sentences), syllables and speed of sound, which are selected according to scenarios and tasks. Word units are mainly suitable for small vocabulary speech recognition systems; syllable units are more suitable for Chinese speech recognition; although phonemes can well explain the basic components of speech, due to the complexity and change of speakers, stable data sets cannot be obtained, and are still being studied middle.

另一个研究方向是语音的情感识别，主要由语音信号采集、情感特征提取和情感识别组成。其中情感特征提取主要有韵律学特征、基于谱的相关特征和音质特征三种。这些特征一般以帧为最小粒度来实现提取，并以全局特征统计值的形式进行情感识别。在情感识别算法方面，主要包括离散语言情感分类器和维度语音情感预测器两大类。语音情感识别技术也被广泛应用于电话服务中心、驾驶员精神判别、远程网络课程等领域。Another research direction is speech emotion recognition, which mainly consists of speech signal acquisition, emotion feature extraction and emotion recognition. Among them, the emotional feature extraction mainly includes prosodic feature, spectrum-based correlation feature and sound quality feature. These features are generally extracted at the minimum granularity of the frame, and emotion recognition is performed in the form of global feature statistics. In terms of emotion recognition algorithms, it mainly includes two categories: discrete language emotion classifiers and dimensional speech emotion predictors. Speech emotion recognition technology is also widely used in telephone service centers, driver mental discrimination, remote online courses and other fields.

智能体被誉为是下一代人工智能的综合产物，不仅能够识别周围环境因素，理解人的行为表达和语言描述，甚至在与人的交流过程中，更需要去理解人的情感，并且能够实现模仿人的情感表达，才能实现更为柔和的交互。目前智能体的情感研究主要集中在基于虚拟图像处理，涉及计算机图形学、心理学、认知学、神经生理学、人工智能等多个领域有研究者的成果。据研究，人虽然90％以上的环境感知信息来自视觉，但是绝大部分的情感感知是来自语音。如何从语音领域建立类人智能体的情感体系，至今尚未有公开的研究发布。The intelligent body is hailed as the comprehensive product of the next generation of artificial intelligence. It can not only recognize the surrounding environmental factors, understand human behavior expressions and language descriptions, but even in the process of communicating with people, it needs to understand people's emotions, and can realize Only by imitating human emotional expression can a softer interaction be achieved. At present, the emotional research of agents is mainly based on virtual image processing, involving the achievements of researchers in computer graphics, psychology, cognition, neurophysiology, artificial intelligence and other fields. According to research, although more than 90% of the environmental perception information of people comes from vision, most of the emotional perception comes from voice. How to establish the emotional system of human-like agents from the field of speech has not yet been published.

发明内容Contents of the invention

本发明的目的是以机器学习方法为主要手段，提出一种人的语音情感表述方法，并在此基础上使用深度学习和卷积网络算法，从系统上实现语音情感的迁移。不仅对语音识别、情感分析提供了一定的借鉴方法，更能在未来类人智能体上得到广泛应用。The purpose of the present invention is to use the machine learning method as the main means to propose a human speech emotion expression method, and use deep learning and convolutional network algorithms on this basis to realize the transfer of speech emotion from the system. It not only provides a certain reference method for speech recognition and emotion analysis, but also can be widely used in future humanoid agents.

为实现上述目的，本发明提出的技术方案为一种语音情感迁移方法，具体包含以下步骤：In order to achieve the above object, the technical solution proposed by the present invention is a method for voice emotion transfer, which specifically includes the following steps:

步骤1、准备一个语音数据库，通过标准采样生成语音情感数据集S＝{s₁,s₂,…,s_n}；Step 1. Prepare a speech database and generate a speech emotion dataset S={s₁ ,s₂ ,...,s_n } through standard sampling;

步骤2、采用人工方式对步骤1的语音数据库打标签，标注每个语音文件的情感E＝{e₁,e₂,…,e_n}；Step 2. Manually label the voice database in step 1, labeling the emotion E={e₁ , e₂ ,...,e_n } of each voice file;

步骤3、采用语音特征参数模型对语音库中的每个音频文件s_i进行音频特征抽取，得到基本的语音特征集F_i＝{f₁ⁱ,f₂ⁱ,…,f_nⁱ}；Step 3. Use the speech feature parameter model to extract the audio features of each audio file s_i in the speech library, and obtain the basic speech feature set F_i ={f₁ⁱ , f₂ⁱ ,...,_fⁿⁱ };

步骤4、采用机器学习工具对步骤3得到的每个语音特征集与步骤2得到的语音情感标签进行机器学习，得到每一类语音情感的特征模型，构建情感模型库E_b；Step 4, adopt machine learning tool to carry out machine learning to each speech feature set that step 3 obtains and the speech emotion label that step 2 obtains, obtain the characteristic model of each class speech emotion, construct emotion model storehouse E_b ;

步骤5、通过一个多媒体终端，选择需要语音情感迁移的目标Target；Step 5, through a multimedia terminal, select the target Target that needs voice emotion transfer;

步骤6、从多媒体终端输入语音信号s_t；Step 6, input the voice signal s_t from the multimedia terminal;

步骤7、将当前输入的s_t输入到语音情感特征提取模块，得到当前语音信号的特征集F_t＝{f₁^t,f₂^t,…,f_n^t}；Step 7. Input the currently input_st to the speech emotion feature extraction module to obtain the feature set F_t of the current speech signal = {f₁^t , f₂^t ,..., f_n^t };

步骤8、采用与步骤4相同的机器学习算法，将步骤7得到的s_t的语音特征集F_t结合步骤步骤4得到的情感模型库E_b进行情感分类，得到s_t的当前情感类别s_e；Step 8. Using the same machine learning algorithm as in step 4, combine the speech feature set F_t of_st obtained in step 7 with the emotion model library E_b obtained in step 4 to perform emotion classification, and obtain the current emotion category s_e of_st ;

步骤9、判断步骤8得到的s_e和步骤5输入的Target是否一致，如果s_e＝Target_e，则将原始输入语音信号直接作为目标情感语音输出，如果s_eTarget_e，则调用步骤10进行特征情感迁移；Step 9, judging whether the s_e obtained in step 8 is consistent with the Target input in step 5, if s_e =Target_e , then directly output the original input voice signal as the target emotional voice, if s_e Target_e , then call step 10 to carry out Feature emotional transfer;

步骤10、将当前语音情感主要特征向情感模型库中的语音情感主要特征进行迁移；Step 10, the main feature of current voice emotion is transferred to the main feature of voice emotion in the emotion model library;

步骤11、采用语音合成算法对步骤10得到的特征迁移后的语音特征进行加工，合成最终目标情感语音输出。Step 11: Process the speech features obtained in step 10 after the feature transfer by using a speech synthesis algorithm, and synthesize the final target emotional speech output.

进一步，上述步骤1中，语音数据的采样频率为44.1KHz，录音时间在3～10s之间，并且保存为wav格式。Further, in the above step 1, the sampling frequency of the voice data is 44.1 KHz, the recording time is between 3 and 10 s, and it is saved in wav format.

步骤1中，为了获得较好的性能，采样数据的自然属性维度不能过于集中，采样数据尽量在不同年龄、性别、职业等人中采集。In step 1, in order to obtain better performance, the natural attribute dimensions of the sampled data should not be too concentrated, and the sampled data should be collected from people of different ages, genders, occupations, etc. as much as possible.

步骤6中，所述输入可以是实时输入，也可以是录制完成后点击递交。In step 6, the input can be real-time input, or click to submit after the recording is completed.

本发明具有以下有益效果：The present invention has the following beneficial effects:

1、本发明首先提出语音情感迁移的概念，可以为未来虚拟现实提供情感构建方法。1. The present invention first proposes the concept of speech emotion transfer, which can provide an emotion construction method for future virtual reality.

2、本发明提出的基于情感分类和特征迁移的方法，能够在不失原始说话人发声特征的前提下实现语音情感的变化。2. The method based on emotion classification and feature transfer proposed by the present invention can realize the change of speech emotion without losing the vocal characteristics of the original speaker.

附图说明Description of drawings

图1是本发明提供的语音情感迁移方法示意图。Fig. 1 is a schematic diagram of the speech emotion transfer method provided by the present invention.

图2是本发明原始输入语音样本的频谱特征图。Fig. 2 is a spectrum feature diagram of the original input speech sample in the present invention.

图3是本发明原始语音样本经过情感转化的频谱特征图。Fig. 3 is a spectrum feature map of the original voice sample after emotion transformation in the present invention.

具体实施方式detailed description

现结合附图对本发明作进一步详细的说明。The present invention is described in further detail now in conjunction with accompanying drawing.

本发明提供一种基于语音情感数据库的用户表达语音情感迁移方法，如图1所示，该方法涉及的模块或功能包括：The present invention provides a kind of user expression voice emotion migration method based on voice emotion database, as shown in Figure 1, the modules or functions involved in this method include:

基础语音库，存有不同年龄、性别、场景下的语音原始数据。The basic voice library contains original voice data of different ages, genders, and scenes.

标签库，对基础语音库进行情感标注，如平和、高兴、生气、愤怒、悲伤等。Tag library, which carries out emotional labeling on the basic voice library, such as peace, happiness, anger, anger, sadness, etc.

语音输入装置，如麦克风，可以实现用户的实时语音输入。The voice input device, such as a microphone, can realize the user's real-time voice input.

语音情感特征提取，通过声音特征分析工具，得到一般的声音特征，并根据人的语音信号特点以及情感表现特点，选取所需的特征集作为语音情感特征。Speech emotion feature extraction, through sound feature analysis tools, to obtain general sound features, and according to the characteristics of human speech signals and emotional performance, select the required feature set as the voice emotion feature.

机器学习，采用机器学习算法印证语音情感标签库，对语音情感特征集构建训练模型。Machine learning, using machine learning algorithms to verify the voice emotion label library, and construct a training model for the voice emotion feature set.

情感模型库，语音库数据通过机器学习得到的按照性别、年龄、情感等维度分类后的语音情感模型库。Emotion model library, speech emotion model library classified by gender, age, emotion and other dimensions obtained through machine learning from voice database data.

选择情感，用户在输入语音信号前选择需要将当前语音实时转化为的情感模式。To select emotion, the user selects the emotion mode that needs to be converted into the current voice in real time before inputting the voice signal.

情感类别判断，判断当前用户输入的情感是否与选择的情感一致。如果一致，则直接输出目标情感语音。如果不一致，调用情感迁移模块。Emotion category judgment, judging whether the emotion input by the current user is consistent with the selected emotion. If consistent, then directly output the target emotional voice. If inconsistent, call the emotion transfer module.

情感迁移，在用户输入语音和选择情感不一致的情况下，将输入语音情感特征集与选择情感特征集进行特征距离对比，调整输入语音情感特征空间表示，实现情感迁移。然后将调整好的情感语音作为目标情感语音输出。Emotion transfer, when the user's input voice and selected emotion are inconsistent, compare the feature distance between the input voice emotion feature set and the selected emotion feature set, adjust the input voice emotion feature space representation, and realize emotion transfer. Then the adjusted emotional speech is output as the target emotional speech.

现提供一个实施例，以说明语音情感的迁移过程，具体包含以下步骤：An embodiment is now provided to illustrate the transfer process of voice emotion, which specifically includes the following steps:

步骤1、该方法需要准备一个语音数据库，作为优选，语音数据采用标准采样44.1KHz，录下某个测试人员一句话，时间在3～10s之间，并且保存为wav格式，得到语音情感数据集S＝{s₁,s₂,…,s_n}。为了获得较好的性能，采样数据尽力在不在年龄、性别、职业等人的自然属性维度过于集中。Step 1. This method needs to prepare a voice database. As a preference, the voice data adopts standard sampling 44.1KHz, and a tester’s sentence is recorded for 3 to 10 seconds, and saved in wav format to obtain a voice emotion data set S={s₁ ,s₂ ,...,s_n }. In order to obtain better performance, the sampling data try not to be too concentrated in the dimensions of natural attributes such as age, gender, and occupation.

步骤2、采用人工的方式，对步骤1准备的语音数据库打标签，标注每个语音文件的情感E＝{e₁,e₂,…,e_n}，如“担心”，“吃惊”，“生气”，“失望”，“悲伤”等Step 2. Manually label the voice database prepared in step 1, and mark the emotion E={e₁ , e₂ ,..., e_n } of each voice file, such as "worried", "surprised", "angry","disappointed","sad", etc.

步骤3、采用语音特征参数模型对语音库中每个音频文件s_i进行音频特征抽取，得到基本的语音特征集F_i＝{f₁ⁱ,f₂ⁱ,…,f_nⁱ}等(图2所示为原始语音样本的频谱特征示意图)，如”包络线(env)”,“语速(speed)”,”过零率(zcr)”,“能量(eng)”,“能量熵(eoe)”,“频谱质心(spec_cent)”,“频谱扩散(spec_spr)”,“梅尔频率(mfccs)”,“彩度向量(chrona)”等。Step 3. Use the speech feature parameter model to extract the audio features of each audio file s_i in the speech library, and obtain the basic speech feature set F_i = {f₁ⁱ , f₂ⁱ ,..., f_niⁱ }, etc. (Fig. 2 shows the schematic diagram of the spectrum features of the original speech sample), such as "envelope (env)", "speech speed (speed)", "zero-crossing rate (zcr)", "energy (eng)", "energy entropy (eoe)", "spectral centroid (spec_cent)", "spectral spread (spec_spr)", "mel frequency (mfccs)", "chroma vector (chrona)", etc.

步骤4、采用机器学习工具(如Libsvm)对步骤3得到的每个语音文件的特征集与步骤2所得到的语音情感标签进行机器学习，得到每一类语音情感的特征模型，构建情感模型库E_b。Step 4, adopt machine learning tools (such as Libsvm) to carry out machine learning to the feature set of each voice file obtained in step 3 and the voice emotion label obtained in step 2, obtain the feature model of each class of voice emotion, and build an emotion model library E_b .

步骤5、通过一个多媒体终端，选择需要语音情感迁移目标Target_e，如“悲伤”。Step 5. Through a multimedia terminal, select the target Target_e that needs voice emotion transfer, such as "sadness".

步骤6、从多媒体终端输入语音信号s_t，可以是实时输入，也可以是录制完成后点击递交。Step 6. Input the voice signal s_t from the multimedia terminal, which can be input in real time, or click to submit after the recording is completed.

步骤7、将当前输入的s_t输入到语音情感特征提取模块，得到当前语音信号的特征集F_t＝{f₁^t,f₂^t,…,f_n^t}。Step 7. Input the currently input s_t to the speech emotion feature extraction module to obtain the feature set F_t ={f₁^t ,f₂^t ,...,f_n^t } of the current speech signal.

步骤8、采用步骤4相同的机器学习算法，将步骤7得到的s_t的语音特征集F_t结合步骤步骤4得到的情感模型库E_b进行情感分类，得到s_t的当前情感类别s_e。Step 8. Using the same machine learning algorithm as in step 4, combine the speech feature set F_t of_st obtained in step 7 with the emotion model library E_b obtained in step 4 to perform emotion classification, and obtain the current emotion category s_e of_st .

步骤9、判断步骤8得到的s_e和步骤5输入的Target_e是否一致，如果s_e＝Target_e，则将原始输入语音信号直接作为目标情感语音输出。如果s_eI Target_e，则调用步骤10进行特征情感迁移。Step 9, judging whether the s_e obtained in step 8 is consistent with the Target_e input in step 5, if s_e =Target_e , then directly output the original input speech signal as the target emotional speech. If s_e I Target_e , call step 10 to perform feature emotion transfer.

步骤10、将当前语音情感主要特征向情感模型库中语音情感主要特征进行迁移(图3所示为迁移后的频谱特征)，如包络线迁移result_env＝(s_env+Target_env)/2,语速调整result_speed＝(s_speed+Target_speed)/2。Step 10, the main feature of current voice emotion is migrated to the main feature of voice emotion in the emotion model library (shown in Figure 3 is the spectrum feature after migration), such as envelope migration result_env =(s_env +Target_env )/2 , Speech speed adjustment result_speed = (s_speed + Target_speed )/2.

步骤11、采用一个语音合成算法(基音同步叠加技术,PSOLA)对步骤10得到的特征迁移过的语音特征进行加工合成最终目标情感语音输出。Step 11, using a speech synthesis algorithm (pitch synchronous overlay technology, PSOLA) to process the speech features obtained in step 10 after feature transfer to synthesize the final target emotional speech output.

以上所述仅为本发明的优选实施案例而已，并不用于限制本发明，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实施例所记载的技术方案进行改进，或者对其中部分技术进行同等替换。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred implementation examples of the present invention, and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still understand the foregoing embodiments Improvements are made to the technical solutions described, or equivalent replacements are made to some of the technologies. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. a kind of speech emotional moving method, it is characterised in that comprise the steps of：

Step 1, one speech database of preparation, pass through standard sample and generate speech emotional data set S={ s₁,s₂,…,s_n}；

Step 2, using manual type the speech database of step 1 is labelled, mark the emotion E={ e of each voice document₁,e₂,…,e_n}；

Step 3, using speech characteristic parameter model to each audio file s in sound bank_iAudio feature extraction is carried out, is obtainedBasic set of voice features F_i={ f₁ⁱ,f₂ⁱ,…,f_nⁱ}；

The speech emotional mark that step 4, each set of voice features and step 2 that are obtained using Machine learning tools to step 3 are obtainedLabel carry out machine learning, obtain the characteristic model of each class speech emotional, build emotion model storehouse E_b；

Step 5, by a multimedia terminal, selection needs the target Target that speech emotional is migrated_e；

Step 6, from multimedia terminal input speech signal s_t；

Step 7, by the s currently inputted_tSpeech emotional characteristic extracting module is input to, the feature set F of current speech signal is obtained_t={ f₁^t,f₂^t,…,f_n^t}；

Step 8, using with step 4 identical machine learning algorithm, the s that step 7 is obtained_tSet of voice features F_tWith reference to stepThe emotion model storehouse E that step 4 is obtained_bEmotional semantic classification is carried out, s is obtained_tCurrent emotional category s_e；

The s that step 9, judgment step 8 are obtained_eIt is whether consistent with step 5 Target inputted, if s_e=Target_e, then by originalBeginning input speech signal is exported directly as target emotional speech, if s_eTarget_e, then invocation step 10 carry out feature emotionMigration；

Step 10, speech emotional principal character of the current speech emotion principal character into emotion model storehouse migrated；

Phonetic feature after step 11, the feature obtained using Speech Synthesis Algorithm to step 10 migration is processed, and synthesis is mostWhole target emotional speech output.

2. speech emotional moving method according to claim 1, it is characterised in that the sample frequency of speech data in step 1For 44.1KHz, record length saves as wav forms between 3~10s.

3. speech emotional moving method according to claim 1, it is characterised in that in order to obtain preferable property in step 1Can, the natural quality dimension of sampled data can not be concentrated excessively.

4. speech emotional moving method according to claim 1, it is characterised in that input can be real-time described in step 6Click on and submit after the completion of input or recording.