CN109599094A

Movatterモバイル変換

Info

Publication number: CN109599094A
Application number: CN201811538693.4A
Authority: CN
Inventors: 段玉聪; 李亚婷; 宋正阳
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2019-04-09

Abstract

The invention discloses a kind of methods of sound beauty and emotion modification, mainly pass through the demand of user, acoustic processing and emotion modification are carried out to voice, change tone color, tone and the emotion for being included originally of original voice, and can also denoise so that the voice heard is apparent understandable.It not only can satisfy demand of the user to listening is thought, and the mood that can also be spoken by adjusting other side oneself more comfortably to loosen.

Description

The method of sound beauty and emotion modification

Technical field

The invention belongs to emotion, voice recognition and acoustic processing field, the speech processes for mainly being heard user areVoice wanting mood containing user, wanting to hear sound type, while being also required to carry out accent and unclear placeDenoising meets user demand, improves the satisfaction of user so that becoming apparent from of listening of user.

Background technique

With artificial intelligent voice identification field rapid development, Google, Xun Feideng enterprise field of speech recognitionThrough being made that very big achievement, other language can be then converted to by identifying that voice is translated into text；At present household,Electric appliance, mobile phone etc. can be by its behaviors of voice control, such as can directly be passed through by the switch of sound control air-conditioningSiri informs the contact person for oneself wanting to make, and meeting automatically dial, these are all the development step by step of speech recognition.

When everyone has unsatisfactory, if at this moment criticism also severe by others, can add again a lamination to himPower.And we are as the higher organism for having emotion, it is desirable to if hearing special emotional expression under some specific scenesLanguage or the sound of some different tone colors, tone are to user so combining the emotion of people with voice recognition with processingIt is a kind of to enjoy well.

In daily call, we are frequently encountered due to dialect, mandarin is nonstandard or the influence of ambient noise,Causing our call has a little difficult or obstacle, and in order to solve the problems, we can be increased by sound beautyPotent fruit.

Summary of the invention

Technical problem: the invention discloses a kind of methods of sound beauty and emotion modification, mainly by the demand of user,Acoustic processing and emotion modification are carried out to voice, change tone color, tone and the emotion for being included originally of original voice,And it can also denoise so that the voice heard is apparent understandable.

Technical solution: in the presence of solving the problems, such as above-mentioned background technique, the present invention propose a kind of sound beauty withThe method of emotion modification.Firstly, it is necessary to acquire voice data, each word is identified by speech recognition, is paid attention to bandThe identification for the voice having an accent；Then gone out according to the relative intensity between conjunctive word and the spaced markings between word and wordEmphasis vocabulary；Secondly emotion keynote is established according to the tongue of the intonation of each word, the power of sound and sentence entirety；Then according to above data accumulation, sound is handled, changes the feelings of primitive sound by intonation, sound intensity, interval etc.Thread, and beauty can be carried out to sound by collecting the acoustic information of special people, so that it is sounded like the sound of certain star；MostAfterwards, white noise, or the identification of enhancing sound can be eliminated to the result finally exported.The present invention not only can satisfy user to thinkingThe demand of listening, and the mood that can also be spoken by adjusting other side oneself more comfortably to loosen.

Architecture

(1) voice data is acquired by speech recognition, fuzzy diagnosis will be carried out (if comprising the country for the voice data having an accentOuter dialect then needs to inquire dialect phonetic database in the process, to be more accurately determined the semanteme that the user is spokenAnd the meaning of a word), characteristic quantity, which is converted, by the sound of input is conveniently further processed.

(2) go out emphasis vocabulary according to the relative intensity between conjunctive word and the spaced markings between word and word,The secondary tongue according to the intonation of each word, the power of sound and sentence entirety establishes emotion keynote.It can be according to passRelative intensity between keyword identifies the rough idea of fuzzy sentence, and the interval between word and word can be to avoid different wordsLinking together has the different meanings, helps to differentiate which word should form a word, and give expression to the meaning of this word.It is rightIn every words, each word even each word, intonation and strong and weak difference can give expression to different emotion, we can rootDetermine that user hears or oneself word is that kind of a kind of emotion be full of according to this basis, and it can also be according to thisIt is a little to make it with mood required for user because usually changing voice.Specific practice is according to being stored between pronunciation and phonemeTransformation rule or pronunciation and phone string between transformation rule transformation rule storage unit in the transformation rule that stores, will beThe pronunciation for being stored with the identification word stored in this storage unit of the identification of pronunciation of identification word is converted into phone string.Standard is extracted againMode is finally attached.To identification word pronunciation more than in the case where also very be applicable in.

(3) according to above data accumulation, sound is handled, primitive is changed by intonation, sound intensity, interval etc.The mood of sound, and beauty can be carried out to sound by collecting the acoustic information of special people, so that it is sounded like certain star'sSound；It a little says in detail, acquires the data of some especial sounds, such as tone, audio, tone color, the language of certain host's soundAdjust etc., user can be adjusted according to these obtained data and wishes that Duan Yuyin changed, its various values are carried outModification, to meet user's needs to the full extent to the greatest extent.Explain in detail are as follows: database is saved as to the voice data possessed,Their some features are converted to parameter deposit, when user requires to change, can be changed by changing these parametersThe audibility of sound；We are also necessary necessary not only for sound transformation model, emotion transformation model is established.First obtain instructionPractice data (duration alignment can be done to inputoutput data according to dynamic time warping algorithm), then it is pre-processed, extractsThe emotion influence factor (tone, interval etc. of speaking including words) of training data, according to the ginseng of initialization sound transformation modelNumber, training pattern are established, because the model can be neural network model, are made of encoder, each encoder represents a certainThe assertive evidence space of the similar original pronunciation people voice messaging of class, needs to convert the spectrum signature of its voice signal.

(whereinIndicate the output of n-th of eigenspace model of input coding layer i,Indicate input layer i'sN-th of eigenspace model for network parameter, δ indicate excitation function).

(4) it eliminates the effects of the act the various noise noises (white noise or other colored noises) of effect to the result finally exported,Or the identification of enhancing sound.Noise is eliminated by signal processing, it can be by acoustically exporting and the space to movable bodyThere is the phase of the noise of the inside leakage the sound of opposite phase to eliminate noise.So that sound beauty and it is changeable in mood after languageSound is more clear, and makes user acoustically also more comfortable.

Beneficial effect

(1) be conducive to user and adjust own self emotion, build comfortable auditory envelopment；

(2) new entertainment environment is manufactured to user, is allowed and oneself is changed the sound of other people or oneself by the demand of oneselfAnd the emotion contained in sound；

(3) exchange of two people to converse mutually can be promoted to a certain extent.

Detailed description of the invention

Fig. 1 is the implementation flow chart of the method for sound beauty and emotion modification.

Specific embodiment

(1) by speech recognition acquire voice data, for the voice data having an accent to carry out fuzzy diagnosis (if comprisingDialect both domestic and external then needs to inquire dialect phonetic database in the process, is spoken being more accurately determined the userThe semantic and meaning of a word), characteristic quantity, which is converted, by the sound of input is conveniently further processed.

(whereinIndicate the output of n-th of eigenspace model of input coding layer i,Indicate the of input layer iN eigenspace model for network parameter, δ indicate excitation function).

Claims

Translated fromChinese

1.本发明公开了一种声音美容与情感修饰的方法，主要通过用户的需求，对语音进行声音处理以及情感修饰，改变原来语音的音色、音调、以及原来所包含的情感，并且也可以去噪使得所听到的语音更清晰易懂；1. The present invention discloses a method for sound beauty and emotion modification, mainly through the needs of users, the voice is processed and emotionally modified, and the timbre, pitch, and emotion contained in the original voice are changed, and it can also be removed. Noise makes the speech heard more clearly and easily understood;

不仅可以满足用户对想听声音的需求，并且也可通过调整对方说话的情绪使得自己更为舒适放松；It can not only meet the needs of users to listen to the sound, but also make themselves more comfortable and relaxed by adjusting the emotions of the other party;

（1）通过语音识别采集声音数据，对于有口音的声音数据要进行模糊识别（若包含国内外的方言，则需要在过程中查询方言语音数据库，来较为精确地确定该用户所说话的语义以及词义），将输入的声音转化为特征量方便进一步处理；(1) Collect voice data through voice recognition, and perform fuzzy recognition on voice data with accents (if it contains dialects at home and abroad, you need to query the dialect voice database in the process to more accurately determine the semantics and word meaning), convert the input sound into feature quantities for further processing;

（2）根据关联词之间的相对强度以及词语与词语之间的间隔标记出重点词汇，其次根据每个词语的语调、声音的强弱以及句子整体的说话方式奠定情感基调；(2) Mark key words according to the relative strength between related words and the interval between words, and then set the emotional tone according to the intonation of each word, the strength of the sound, and the overall speaking style of the sentence;

可以根据关键词之间的相对强度识别出模糊语句的大概意思，词语与词语之间的间隔可以避免不同词连接在一起有不同的意思，有助于分辨哪些字应该组成一个词，并表达出这个词的含义；The approximate meaning of fuzzy sentences can be identified according to the relative strength between keywords, and the spacing between words can prevent different words from being connected together to have different meanings, helping to distinguish which words should form a word, and expressing the meaning of the word;

对于每句话、每个词甚至每个字，语调以及强弱不同，都能表达出不一样的情感，我们可以根据这个基础来确定用户所听到或者自己说的话是饱含一种怎样的情感，并且也可以根据这些因素来改变语音使得其具有用户所需要的情绪；For every sentence, every word, and even every word, the tone and intensity can express different emotions. We can use this basis to determine what kind of emotion the user hears or speaks. , and can also change the voice according to these factors to make it have the emotions the user needs;

具体做法根据在存储有读音与音素之间的转换规则或读音与音素串之间的转换规则的转换规则的存储部内存储的转换规则，将在存储有识别词的读音的识别此存储部内存储的识别词的读音转换成音素串；The specific method is based on the conversion rules stored in the storage unit that stores the conversion rules between pronunciation and phonemes or the conversion rules between pronunciation and phoneme strings. Convert the pronunciation of the recognized word into a phoneme string;

再提取出标准模式，最后进行连接；Then extract the standard mode, and finally connect;

对识别词读音多的情况下也非常适用；It is also very suitable when there are many pronunciations of recognized words;

（3）根据以上的数据积累，对声音进行处理，通过语调、声音强弱、间隔等改变原语音的情绪，并且可通过收集特殊人的声音信息对声音进行美容，使其听起来像某位明星的声音；详细一点说，采集一些特殊声音的数据，例如某位主持人声音的音调、音频、音色、语调等等，可以根据得到的这些数据来调整用户希望改变的那一段语音，对它的各种值进行修改，从而尽最大程度上满足用户需要；(3) According to the above data accumulation, the sound is processed to change the emotion of the original voice through intonation, voice intensity, interval, etc., and the voice can be beautified by collecting the voice information of special people to make it sound like a certain person. The voice of the star; to be more specific, collect some special voice data, such as the pitch, audio, timbre, intonation, etc. of a host's voice, and you can adjust the voice that the user wants to change according to the obtained data. to modify the various values of , so as to meet the needs of users to the greatest extent possible;

即详细解释为：对所拥有的声音数据存为数据库，将他们的一些特征都转化为参数存入，在用户要求转变时，即可通过改变这些参数来改变声音的收听效果；我们不仅仅需要建立声音转化模型，情感转化模型也是必须的；That is, the detailed explanation is: store the sound data as a database, and convert some of their features into parameters and store them. When the user requests to change, the listening effect of the sound can be changed by changing these parameters; we not only need To establish a voice transformation model, an emotion transformation model is also necessary;

先获取训练数据（可根据动态时间规整算法对输入输出数据做时长对齐），再对其进行预处理，提取训练数据的情绪影响因素（包括字词的音调、说话间隔等），根据初始化声音转换模型的参数，训练模型建立，因为该模型可以是神经网络模型，由编码器组成，每个编码器代表某一类相似原发音人语音信息的本证空间，需要对其语音信号的频谱特征进行变换First obtain the training data (the input and output data can be aligned according to the dynamic time warping algorithm), and then preprocess it to extract the emotional factors of the training data (including the pitch of words, speaking interval, etc.) The parameters of the model, the training model is established, because the model can be a neural network model, composed of encoders, each encoder represents a certain type of authentic space of the original speaker's voice information, and the spectral characteristics of its voice signal are required. transform

（其中表示输入编码层i的第n个本征空间模型的输出，表示输入层i的第n个本征空间模型对于的网络参数，δ表示激励函数）；(in represents the output of the nth eigenspace model of the input encoding layer i, represents the network parameters of the nth eigenspace model of the input layer i, and δ represents the excitation function);

（4）对最后输出的结果消除影响效果的各种噪音噪音（白噪音或其他有色噪音），或增强声音的辨识度；(4) Eliminate all kinds of noise (white noise or other colored noise) that affect the effect of the final output result, or enhance the recognition of the sound;

通过信号处理消除噪音，可通过在声学上输出与向可移动体的空间里面泄露的噪音的相位具有相反相位的声音来消除噪音；Noise cancellation through signal processing can be achieved by acoustically outputting a sound having an opposite phase to that of the noise leaking into the space of the movable body;

使得声音美容和情绪化之后的语音更加清晰，让用户听觉上也更加舒适。It makes the voice after sound beauty and emotion clearer, and makes the user's hearing more comfortable.