Summary of the invention
The embodiment of the invention provides a kind of emotion recognition network model, method and electronic equipments, at least partly to solveEmotion identification method model is complicated in the related technology, trains cumbersome, the single technical problem of application scenarios.
In order to achieve the above objectives, An embodiment provides a kind of network model of emotion recognition, the netsNetwork model includes: speech emotion recognition module and text emotion identification module;Wherein, the speech emotion recognition module, is used forVoice is inputted and carries out speech emotional feature extraction, exports speech emotional feature vector;The text emotion identification module, is used forText emotion feature extraction is carried out to text input, exports text emotion feature vector;The network model is according to the voiceAffective characteristics vector and/or text emotion feature vector carry out emotion recognition;
The type that the network model can be inputted according to target, calls the speech emotion recognition module or/and the textEmotion recognition module carries out emotion recognition, wherein the type of target input include: voice input, text input, voice andThe input of corresponding text.
Further, the speech emotion recognition module include: speech feature extraction layer and the first multi-layer biaxially oriented length in short-termMemory network layer;The text emotion identification module includes: pretreatment layer, the second multi-layer biaxially oriented length memory network layer and note in short-termMeaning power layer.
Further, the network model further include:
Input layer, for the common input end as the speech emotion recognition module and the text emotion identification module;
Fused layer is merged for merging the speech emotional feature vector and the text emotion feature vectorAffective characteristics vector;
Sorter network layer, for exporting the emotion recognition result of the target input according to the fusion affective characteristics vector.
Further, the fused layer will carry out the speech emotional feature vector and the text emotion feature vectorThe mode of fusion is added using contraposition or connecting method.
Further, the speech emotion recognition module and the text emotion identification module are parallel-connection structure.
Further, the network parameter of the speech emotion recognition module and the text emotion identification module is by oneSecondary property training obtains.
Further, the network parameter of the speech emotion recognition module and the text emotion identification module is by oneSecondary property training obtains, specifically:
Training set data is inputted into the emotion recognition model, obtains emotion prediction result, wherein the training set data packetIt includes: the corresponding text of voice, voice, affective tag;
The emotion prediction result is compared with the affective tag, when the emotion prediction result and the affective tagIn unmatched situation, the speech emotion recognition module and institute are adjusted separately by backpropagation using gradient descent algorithmThe value for stating the network parameter of text emotion identification module completes the speech emotion recognition module and institute by successive ignitionState the training of text emotion identification module network parameter.
According to one embodiment of present invention, a kind of emotion identification method is provided, comprising:
Target input is obtained, the type of the target input includes following one: voice input, text input, voice and correspondenceText input;
According to the type that the target inputs, the speech emotional of network model as claimed in claim 1 to 7 is called to knowOther module or/and the speech emotion recognition module carry out emotion recognition;
Export the emotion recognition result of the target input.
Further, the type inputted according to the target, calls the institute of network model as claimed in claim 1 to 7It states speech emotion recognition module or/and the speech emotion recognition module carries out emotion recognition, comprising:
When target input is voice input, the speech emotion recognition module is called to carry out emotion recognition;
When target input is text input, the text emotion identification module is called to carry out emotion recognition;
When target input is the input of voice and corresponding text, while calling the speech emotion recognition module and instituteIt states text emotion identification module and carries out emotion recognition.
According to still another embodiment of the invention, a kind of electronic equipment, including memory and processor are additionally provided, it is describedComputer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-describedDescribed in method.
The network model of emotion recognition provided by the invention identifies mould by speech emotion recognition module and text emotionBlock realizes the type that can be inputted according to target, calls the speech emotion recognition module or/and the speech emotion recognitionModule carries out emotion recognition, solves that emotion identification method model in the related technology is complicated, training is cumbersome, and application scenarios are singleTechnical problem.
Specific embodiment
Below based on embodiment, present invention is described, but the present invention is not restricted to these embodiments.UnderText is detailed to describe some specific detail sections in datail description of the invention, in order to avoid obscuring essence of the invention,There is no narrations in detail for well known method, process, process, element.
In addition, it should be understood by one skilled in the art that provided herein attached drawing be provided to explanation purpose, andWhat attached drawing was not necessarily drawn to scale.
Unless the context clearly requires otherwise, "include", "comprise" otherwise throughout the specification and claims etc. are similarWord should be construed as the meaning for including rather than exclusive or exhaustive meaning;That is, be " including but not limited to " containsJustice.
In the description of the present invention, it is to be understood that, term " first ", " second " etc. are used for description purposes only, withoutIt can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple "It is two or more.
It is a kind of signal of the network model 20 for emotion recognition that one embodiment of the present of invention provides referring to Fig. 1, Fig. 1Figure, the network model 20 include:
Speech emotion recognition module 202 and text emotion identification module 204;Wherein, the speech emotion recognition module 202 is usedSpeech emotional feature extraction is carried out in inputting to voice, exports speech emotional feature vector V1;The text emotion identification module204, for carrying out text emotion feature extraction to text input, export text emotion feature vector V2;The network model 20Emotion recognition is carried out according to the speech emotional feature vector V1 and/or text emotion feature vector V2;The network model 20The type that can be inputted according to target, calls the speech emotion recognition module 202 or/and the text emotion identification module204 carry out emotion recognition, wherein the type of the target input includes: voice input, text input, voice and corresponding textThis input.
It should be noted that in the prior art, to solve the problems, such as the emotion recognition under a certain scene, can generally be directed to shouldThe input of the network model of the specific emotion recognition of scenario building, the network structure, emotion recognition of this network model is relatively solidFixed, if changing emotion recognition scene, when the input of emotion recognition changes, this network model is just no longer applicable in, and is neededThe network model of new emotion recognition is rebuild, thus the cost that bring is additional.For example, under the scene of text, exampleSuch as SMS chat, mail dealing or simple wechat text chat etc., the target object of emotion recognition under these scenesIt is text, therefore, it is necessary to construct for text input the network model for identifying emotion;And under the scene of voice, such as electricityVoice-enabled chat, wechat voice-enabled chat, session recording etc. are talked about, the target object of emotion recognition is voice under these scenes, therefore, is neededIt constructs for voice input and identifies the network model of emotion;In addition, some while there is voice and corresponding textUnder scene, such as voice-enabled chat platform of some included speech identifying functions can export voice and corresponding text, theseThe target object of emotion recognition is under scape, and therefore, it is necessary to construct the input for voice and corresponding text to identify emotionNetwork model.If necessary to adapt to the emotion recognition of above-mentioned three kinds of scenes simultaneously, multiple emotions are now often used in the artThe network model of identification is thus resulted in the need for constructing multiple network models respectively, be trained, while needing to acquire differenceTraining data, time-consuming and laborious, cost is very high.
The network model of emotion recognition provided in an embodiment of the present invention is known by speech emotion recognition module and text emotionOther module realizes the type that can be inputted according to target, calls the speech emotion recognition module or/and the speech emotionalIdentification module carries out emotion recognition, solves the single technical problem of the model application scenarios of emotion recognition in the prior art, togetherWhen, network model training process is simple, and the collection of training set data is also relatively easy.
Specifically, the speech emotion recognition module 202 include: speech feature extraction layer and the first multi-layer biaxially oriented length in short-termMemory network layer (Bi-LSTM);The text emotion identification module 204 includes: that pretreatment layer, the second multi-layer biaxially oriented length are remembered in short-termRecall network layer (Bi-LSTM) and attention layer (Attention).In text emotion identification module 204, because main emotionExpression focuses mostly on the word or phrase of certain keys, so text emotion identification module 204 is needed with attention mechanism(Attention Model) goes emphasis to find the keyword or phrase to show emotion, facilitates the standard for promoting text emotion identificationTrue rate.And in speech emotion recognition module 202, the expression of emotion is mostly related with the tone intonation variation of duration, therefore languageMemory network layer (Bi-LSTM) structure removes study front and back audio letter to the sound emotion recognition module 202 multi-layer biaxially oriented length of needs in short-termBreath can, do not need attention mechanism (Attention Model).Various ways can be used in speech feature extraction layer, packetInclude linear prediction analysis (LinearPredictionCoefficients, LPC), perception linear predictor coefficient(PerceptualLinearPredictive, PLP), linear prediction residue error (LinearPredictiveCepstralCoEfficient, LPCC), mel-frequency cepstrum coefficient (MelFrequencyCepstrumCoefficient, MFCC) etc., at thisMFCC feature is used in inventive embodiments, is the prior art, the emphasis of non-present invention, details are not described herein.In addition, multilayer is doubleIt is also the prior art, Fei Benfa to the long network structure of memory network layer (Bi-LSTM) and attention layer (Attention) in short-termBright emphasis, also repeats no more herein.Particularly, the overall structure of the network model 20 for focusing on emotion recognition of the inventionDesign, and do not lie in the change of component part in network model 20 itself, therefore specification is only to the entirety of network model 20The composition of structure, design principle carry out emphasis elaboration.
Further, the network model 20 further include: input layer 206, for as the voice module 202 and describedThe common input end of text emotion identification module 204;Fused layer 208, for by the speech emotional feature vector V1 and describedText emotion feature vector V2 is merged, and fusion affective characteristics vector V3 is obtained;Sorter network layer (Softmax) 210, is used forThe emotion recognition result of the target input is exported according to the fusion affective characteristics vector V3.Input layer 206 can be according to inputInput data is transferred to the speech emotion recognition module 202 and/or the text emotion identification module by the type of data204, for example, the input data is inputted the speech emotion recognition module 202, if defeated if input data is voiceEntering data is text, then the input data is input to the text emotion identification module 204, if input is voice and correspondenceText, then the input data is input to the speech emotion recognition module 202 and the text emotion identification module 204.Sorter network layer (Softmax) is the prior art, and the emphasis of non-present invention also repeats no more herein.
Specifically, the fused layer will melt the speech emotional feature vector and the text emotion feature vectorThe mode of conjunction is added using contraposition or connecting method.The speech emotional feature vector V1 be a form be 1*M dimension toAmount, the text emotion feature vector V2 are the vectors that a form is 1*N dimension.As M=N, the speech emotional feature vectorV1 can be used the mode that contraposition is added with the text emotion feature vector V2 and be merged, and obtain final fusion affective characteristicsThe formula of vector V3, fusion are as follows: V3=V1+V2.As M ≠ N, the speech emotional feature vector V1 and the text emotion are specialThe method that splicing can be used in sign vector V2, i.e. V3=[V1, V2].As M ≠ N, in network training, it should be noted that reversely passingSowing time, the parameter update of network will set corresponding dimension, i.e. M dimension updates the network ginseng in speech emotion recognition module 202Number, N-dimensional update the network parameter in the text emotion identification module 204.Specifically, the speech emotion recognition module 202It is parallel-connection structure with the text emotion identification module 204.Network structure design in parallel, so that network model training is reversedIn communication process, it may be implemented to have updated the speech emotion recognition module 202 and the text emotion identification module 204 simultaneouslyNetwork parameter effect, and then the speech emotion recognition module 202 and the text emotion are completed by disposably trainingThe training of the network parameter of identification module 204 so that training process is simple and efficient, saved now collect training data atThis.In addition, network structure design in parallel, so that during the training of network model 20 of emotion recognition, training data ChineseOriginally the text emotion information for including has simultaneously participated in the update of the network parameter of the speech emotion recognition module 202, training numberAccording to the update for the network parameter that the speech emotional information in middle voice has also assisted in, therefore, two networks respective field similarly hereinafterWhen may learn more affective characteristics information, independent training one text emotion identification model or list than in the prior artThe information that solely one speech emotion recognition model of training is acquired is more, so that network parameter is restrained more excellent, so thatThe prediction of network model is more accurate.The network model of emotion recognition provided in an embodiment of the present invention, passes through speech emotion recognitionModule and text emotion identification module realize the type that can be inputted according to target, call the speech emotion recognition moduleOr/and the speech emotion recognition module carries out emotion recognition, solves the model application scenarios of emotion recognition in the prior artSingle technical problem, simultaneously as speech emotion recognition module and text emotion in emotion recognition network model of the inventionIdentification module is parallel-connection structure, and network structure is simple, and can complete speech emotion recognition module by disposably trainingWith the network parameter training of text emotion identification module, training process is simple, and the collection of training set data is also relatively easy.
In the embodiment of the present invention, the speech emotion recognition module 202 of the network model 20 and the text emotionThe network parameter of identification module 204 is obtained by disposably training.Specific training process is as follows:
Training set data is inputted into the emotion recognition model 20, obtains emotion prediction result, wherein the training set data packetIt includes: the corresponding text of voice, voice, affective tag;
The emotion prediction result is compared with the affective tag, when the emotion prediction result and the affective tagIn unmatched situation, the speech emotion recognition module 202 is adjusted separately by backpropagation using gradient descent algorithmThe speech emotion recognition mould is completed by successive ignition with the value of the network parameter of the text emotion identification module 204The training of block 202 and 204 network parameter of text emotion identification module.
Specifically, the data in training set include voice, the corresponding text of voice, affective tag, format shaped like:" wav ", " " txt ", " affective tag " }, wherein " wav " is one section of speech audio file, file format wav format, voiceAudio file can also be using other audio formats;" txt " is the text that is obtained by speech recognition of voice, and be byText after manual review;" affective tag " is then the feeling polarities of the voice and corresponding text, such as " happiness ", " sadness "," gentle " etc..
The data of above-mentioned training set are input in network model of the invention, obtain emotion prediction result.Detailed processAre as follows: input of the phonological component " wav " as speech emotion recognition module 202 in the data of training set, in speech feature extractionLayer extracts phonetic feature, such as MFCC feature, and then in the first multi-layer biaxially oriented length, memory network layer (Bi-LSTM) forms language in short-termSound affective characteristics vector V1, V1 are the vectors that a form is 1*N dimension;Textual portions " txt " conduct in the data of training setThe input of text emotion identification module 204, first pre-processes text, pre-treatment step include segment and generate word toAmount, then in the second multi-layer biaxially oriented length, memory network layer (Bi-LSTM) and attention layer (Attention) form text feelings in short-termFeel feature vector V2, V2 is the vector that a form is 1*M dimension;Then, fused layer 208 by speech emotional feature vector V1 withText emotion feature vector V2 is merged, obtain fusion affective characteristics vector V3, fusion process can be used contraposition be added orThe mode of splicing carries out;Finally, based on fusion affective characteristics vector V3, it is pre- using sorter network layer (Softmax) output emotionSurvey result.
The emotion prediction result is compared with the affective tag, for example, working as the emotion prediction result and instituteIt states in the unmatched situation of affective tag, for example, the emotion prediction result of the voice and corresponding text is " happiness " and its emotionLabel is " gentle " to be mismatched, then adjusts separately the speech emotion recognition module by backpropagation using gradient descent algorithm202 and the value of network parameter of the text emotion identification module 204 complete the speech emotional and know by successive ignitionThe training of other module 202 and 204 network parameter of text emotion identification module.Network model is carried out with gradient descent algorithmParameter training be the prior art, the emphasis of non-present invention, therefore be no longer described in detail.
It should be noted that the unique texture of the network model 20 due to emotion recognition of the invention, i.e., the described voice feelingsFeel identification module 202 and the text emotion identification module 204 is parallel-connection structure, in the back-propagation process of network model trainingIn, realize while having updated the network parameter of the speech emotion recognition module 202 and the text emotion identification module 204Effect, and then the speech emotion recognition module 202 and the text emotion identification module 204 are completed by disposably trainingThe training of network parameter saved the cost for collecting training data now so that training process is simple and efficient.
In addition, the text emotion that text includes in training data is believed during the training of network model 20 of emotion recognitionBreath has simultaneously participated in the update of the network parameter of the speech emotion recognition module 202, the voice feelings in training data in voiceThe update for the network parameter that sense information has also assisted in, therefore, two networks may learn more simultaneously under respective fieldAffective characteristics information individually trains a text emotion identification model or individually one speech emotional of training than in the prior artThe information that identification model is acquired is more, so that network parameter is restrained more excellent, so that the prediction of network model is moreAccurately.On the other hand, since only sorter network layer (Softmax) is used to export emotion prediction result after fused layer 208,The speech emotion recognition module 202 is parallel-connection structure with the text emotion identification module 204, is not total between two networksNetwork parameter, so two networks are mutually indepedent, disassembled.The speech emotion recognition module 202 can individually be taken outAs independent speech emotion recognition model, and its network parameter contains the text emotion information of priori, i.e., with languageSound is to combine the affective characteristics of text in the case where main feature, compared with the existing technology in independent trained voice feelingsThe emotion recognition effect for feeling identification model is more accurate.Similarly, the text emotion identification module 204 can also individually be taken outAs independent text emotion identification model, network parameter contains the speech emotional information of priori, i.e., based on textWant the affective characteristics that part of speech is taken into account in the case where feature, compared with the existing technology in independent trained text emotion recognitionThe emotion recognition effect of model is more accurate.
The speech emotion recognition module 202 and text emotion of the network model 20 of emotion recognition provided in an embodiment of the present inventionIdentification module 204 is both individually called and can also be called simultaneously, suitable for the emotion recognition of several scenes, meanwhile, network architectureSimply, training process is also relatively simple is easy, and the collection of training set data is also relatively easy.
Referring to fig. 2, Fig. 2 is a kind of flow chart for emotion identification method that another embodiment of the invention provides, the feelingsFeeling recognition methods includes:
S100 obtains target input, and the type of target input includes following one: voice input, text input, voice andThe input of corresponding text;
S200 calls the voice feelings of network model 20 described in above embodiments according to the type that the target inputsFeel identification module 202 or/and the speech emotion recognition module 204 carries out emotion recognition;
S300 exports the emotion recognition result of the target input.
Specifically, step S200 is specifically included:
When target input is voice input, the speech emotion recognition module 202 is called to carry out emotion recognition;
When target input is text input, the text emotion identification module 204 is called to carry out emotion recognition;
When target input is the input of voice and corresponding text, while calling the speech emotion recognition module 202Emotion recognition is carried out with the text emotion identification module 204.
It is the hardware configuration of the electronic equipment for the emotion identification method that one embodiment of the present of invention provides referring to Fig. 3, Fig. 3Block diagram.
Embodiment of the method provided by the embodiment of the present application can be in mobile terminal, terminal or similar operationIt is executed in device.For running on mobile terminals, Fig. 1 is that a kind of electronics of emotion identification method of the embodiment of the present invention is setStandby hardware block diagram.As shown in Figure 1, mobile terminal 10 may include one or more (only showing one in Fig. 1) processingDevice 102(processor 102 can include but is not limited to the processing unit of Micro-processor MCV or programmable logic device FPGA etc.) andMemory 104 for storing data, optionally, above-mentioned mobile terminal can also include the transmission device for communication function106 and input-output equipment 108.It will appreciated by the skilled person that structure shown in FIG. 1 is only to illustrate, simultaneouslyThe structure of above-mentioned mobile terminal is not caused to limit.For example, mobile terminal 10 may also include it is more than shown in Fig. 1 or lessComponent, or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hairThe corresponding computer program of emotion identification method in bright embodiment, processor 102 are stored in memory 104 by operationComputer program realizes above-mentioned method thereby executing various function application and data processing.Memory 104 may includeHigh speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or itsHis non-volatile solid state memory.In some instances, memory 104 can further comprise remotely setting relative to processor 102The memory set, these remote memories can pass through network connection to mobile terminal 10.The example of above-mentioned network includes but notIt is limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may includeThe wireless network that the communication providers of mobile terminal 10 provide.In an example, transmitting device 106 includes a Network adaptationDevice (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments toIt can be communicated with internet.In an example, transmitting device 106 can for radio frequency (Radio Frequency, referred to asRF) module is used to wirelessly be communicated with internet.
Those skilled in the art will readily recognize that above-mentioned each preferred embodiment can be free under the premise of not conflictingGround combination, superposition.
It should be appreciated that above-mentioned embodiment is merely exemplary, and not restrictive, without departing from of the invention basicIn the case where principle, those skilled in the art can be directed to the various apparent or equivalent modification or replace that above-mentioned details is madeIt changes, is all included in scope of the presently claimed invention.