Movatterモバイル変換


[0]ホーム

URL:


CN112232276A - A kind of emotion detection method and device based on speech recognition and image recognition - Google Patents

A kind of emotion detection method and device based on speech recognition and image recognition
Download PDF

Info

Publication number
CN112232276A
CN112232276ACN202011213188.XACN202011213188ACN112232276ACN 112232276 ACN112232276 ACN 112232276ACN 202011213188 ACN202011213188 ACN 202011213188ACN 112232276 ACN112232276 ACN 112232276A
Authority
CN
China
Prior art keywords
expression
image
recognition
emotion
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011213188.XA
Other languages
Chinese (zh)
Other versions
CN112232276B (en
Inventor
赵珍
李小强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Enterprise Information Technology Co ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IndividualfiledCriticalIndividual
Priority to CN202011213188.XApriorityCriticalpatent/CN112232276B/en
Publication of CN112232276ApublicationCriticalpatent/CN112232276A/en
Application grantedgrantedCritical
Publication of CN112232276BpublicationCriticalpatent/CN112232276B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明涉及一种基于语音识别和图像识别的情绪检测方法和装置,获取待检测的一段用户的自拍视频,以及自拍视频对应的实际场景,对自拍视频进行处理,得到图像信号和语音信号,对图像信号进行处理,获取表情变化趋势,对语音信号进行处理,获取语音信号在实际场景中的初步情绪结果,最后融合表情变化趋势以及初步情绪结果,获取用户的最终情绪结果。本发明提供的基于语音识别和图像识别的情绪检测方法是一种自动检测方法,相较于人工检测的方式,不受到主观因素的影响,从而提升检测准确性;无需专门设置检测人员,减少人工成本;处理效率较快,而且,在对处理设备进行设置,能够对多个自拍视频同时进行处理,效率较高。

Figure 202011213188

The invention relates to an emotion detection method and device based on speech recognition and image recognition, which acquires a Selfie video of a user to be detected and an actual scene corresponding to the Selfie video, processes the self-timer video, and obtains an image signal and a voice signal. The image signal is processed to obtain the expression change trend, the voice signal is processed, and the initial emotional result of the speech signal in the actual scene is obtained, and finally the expression change trend and the initial emotional result are combined to obtain the final emotional result of the user. The emotion detection method based on speech recognition and image recognition provided by the present invention is an automatic detection method. Compared with the manual detection method, it is not affected by subjective factors, thereby improving the detection accuracy; it does not need to set up special detection personnel and reduces manual labor. Cost; the processing efficiency is relatively fast, and when the processing equipment is set, multiple selfie videos can be processed at the same time, and the efficiency is relatively high.

Figure 202011213188

Description

Emotion detection method and device based on voice recognition and image recognition
Technical Field
The invention relates to a method and a device for emotion detection based on voice recognition and image recognition.
Background
In the past where information processing technology was not well developed, when the emotion of a speaker was judged from a piece of video or a piece of speech, the emotion of the speaker was judged by a special inspector from the expression of the speaker, the breath of the speaker, and related keywords appearing in the piece of speech. The artificial judgment method has the following defects: (1) the detection personnel are easily affected by subjective factors, so that detection errors are caused; (2) related personnel are required to be specially arranged, so that the labor cost is increased; (3) need the measurement personnel to see one section video or listen to one section pronunciation and just can detect the judgement, moreover, measurement personnel can only judge one section video or one section pronunciation simultaneously, and efficiency is very low.
Disclosure of Invention
In order to solve the technical problem, the invention provides a method and a device for emotion detection based on voice recognition and image recognition.
The invention adopts the following technical scheme:
a emotion detection method based on voice recognition and image recognition comprises the following steps:
acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
performing expression recognition on the at least two images to obtain the character expressions in the images;
acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images;
carrying out voice recognition on the voice signals to obtain corresponding character signals;
inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result, includes:
if the expression change trend is towards positive expression development, and the initial emotion result is positive emotion, the final emotion result is positive emotion;
and if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of acquiring the preset detection model includes:
acquiring at least two correction texts in each scene from at least two scenes;
acquiring actual emotion results of all correction texts in all scenes;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are positive emotions, acquiring first correction texts under a first scene, acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are negative emotions, and acquiring second correction texts under a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the human expression in each image includes:
carrying out user face recognition on the at least two images to obtain a user face image of the user;
and performing expression recognition on the face image of the user in each image to obtain the expression of the character in each image.
Preferably, the performing expression recognition on the facial image of the user in each image to obtain the expression of the person in each image includes:
acquiring a first sample set and a second sample set, wherein the first sample set comprises at least one positive expression sample image, and the second sample set comprises at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the user face images in the images into the expression recognition network to obtain the character expressions of the user face images in the images.
An emotion detection apparatus based on speech recognition and image recognition, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the emotion detection method based on speech recognition and image recognition as follows when executing the computer program:
acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
performing expression recognition on the at least two images to obtain the character expressions in the images;
acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images;
carrying out voice recognition on the voice signals to obtain corresponding character signals;
inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result, includes:
if the expression change trend is towards positive expression development, and the initial emotion result is positive emotion, the final emotion result is positive emotion;
and if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of acquiring the preset detection model includes:
acquiring at least two correction texts in each scene from at least two scenes;
acquiring actual emotion results of all correction texts in all scenes;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are positive emotions, acquiring first correction texts under a first scene, acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are negative emotions, and acquiring second correction texts under a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the human expression in each image includes:
carrying out user face recognition on the at least two images to obtain a user face image of the user;
and performing expression recognition on the face image of the user in each image to obtain the expression of the character in each image.
Preferably, the performing expression recognition on the facial image of the user in each image to obtain the expression of the person in each image includes:
acquiring a first sample set and a second sample set, wherein the first sample set comprises at least one positive expression sample image, and the second sample set comprises at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the user face images in the images into the expression recognition network to obtain the character expressions of the user face images in the images.
The invention has the beneficial effects that: respectively carrying out image processing and voice processing on the self-timer video of the user, wherein the image processing is used for acquiring a plurality of character expressions and obtaining expression change trends according to the character expressions and the sequence of time, because even the same character signal can have different emotions under different scenes, and the character signal is a carrier which can reflect the emotion of a person, therefore, performing voice recognition on a voice signal of a self-timer video, acquiring a character signal, inputting the character signal and an actual scene into a preset detection model, acquiring a preliminary emotion result of the voice signal in the actual scene, the method comprises the steps of acquiring a preliminary emotion result corresponding to a voice signal in the actual scene, applying the actual scene to emotion detection, improving detection accuracy, and finally fusing an expression change trend and the preliminary emotion result to acquire a final emotion result of a user. Therefore, the emotion detection method based on voice recognition and image recognition is an automatic detection method, the video is processed in two aspects, the image is subjected to character expression recognition, the voice is subjected to character recognition, an emotion result is obtained according to a character signal and an actual scene, and a final emotion result is obtained by fusing two data information; detection personnel do not need to be specially arranged, so that the labor cost is reduced; the processing efficiency is fast, and moreover, after setting up processing equipment, can handle a plurality of auto heterodyne videos simultaneously, efficiency is higher.
Drawings
Fig. 1 is a flow chart of a method for emotion detection based on speech recognition and image recognition.
Detailed Description
The embodiment provides a method for emotion detection based on voice recognition and image recognition, a hardware execution main body of the method for emotion detection can be computer equipment, server equipment, an intelligent mobile terminal and the like, and the embodiment does not specifically limit the hardware execution main body.
As shown in fig. 1, the emotion detection method based on speech recognition and image recognition includes:
step S1: acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video:
the user transmits the self-shooting video to the hardware execution main body, and the time length of the self-shooting video is set according to actual requirements, for example, the self-shooting video can be a short video within 30s, and can also be a longer video of 2-3 minutes. The user also transmits the actual scene of the self-shooting video to the hardware execution main body, and the actual scene may refer to the environment or the occasion where the self-shooting video is located, such as: at home, or at work, or in other public places, such as: KTV, supermarket, restaurant, etc. The actual scene is obtained and applied to emotion detection because the emotion may differ from scene to scene in the case where the same data is included in the video.
Step S2: processing the self-timer video to obtain an image signal and a voice signal:
analyzing the self-timer video to obtain an image signal and a voice signal, wherein the image signal is video data without sound and only with an image, and it should be understood that the image signal contains a face image of a user due to the self-timer video of the user; the voice signal is a sound signal in the self-timer video, and specifically, the voice signal is what the user said in the self-timer video.
Since the processing procedure of decomposing the video file into the image signal and the sound signal belongs to the conventional technology, the description is omitted.
Step S3: performing screenshot processing on the image signal according to a preset period to obtain at least two images:
and performing screenshot processing on the image signal according to a preset period to obtain at least two images. The preset period is set according to actual needs, and the longer the preset period is, the fewer the acquired images are. It should be understood that, since the video is a self-portrait video, each of the obtained images includes a face image of the user.
Step S4: performing expression recognition on the at least two images to obtain the character expressions in each image:
firstly, carrying out user face recognition on each image in at least two images to obtain a user face image of a user in each image.
Then, performing expression recognition on the face image of the user in each image to obtain the expression of the person in each image, and as a specific implementation manner, an expression recognition process is given as follows:
and inputting the facial images of the user in the images into an expression recognition network to obtain the expressions of the user in the images. The expression recognition network can be obtained by adopting the following training process:
a first sample set and a second sample set are obtained, the first sample set comprising at least one positive expression sample image, and the second sample set comprising at least one negative expression sample image. The positive expression sample image refers to a sample image of which the expression of a person is positive, and the positive expression is happy, happy and the like; the negative expression sample image refers to a sample image of a human expression, wherein the human expression is a negative expression, and the negative expression is particularly heart hurt, crying, hard passing and the like.
Labeling each positive expression sample image in the first sample set to obtain a first expression type, wherein the first expression type is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression type, and the second expression type is negative expression. That is to say, the expression categories of the labels are divided into two categories, different indexes can be used to represent different expression categories, where the index 0 corresponds to a positive expression and the index 1 corresponds to a negative expression, and the labels can be further encoded by one-hot. The first expression category and the second expression category constitute annotation data.
The expression recognition network comprises an expression recognition encoder, a Flatten layer, a full connection layer and a softmax function.
Inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, outputting a feature vector (such as mouth angle opening degree) by the expression recognition encoder, inputting the feature vector into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting the probabilities of two expression categories through the softmax function, wherein the probabilities of the two expression categories are added to be 1, and determining the corresponding initial expression category according to the output probabilities of the two expression categories.
And calculating the obtained initial expression categories and the marking data through a cross entropy loss function, and optimizing parameters in the expression recognition network so that the output expression categories are gradually close to the real values.
Inputting user face images in each image into an expression recognition network, and performing expression recognition through the expression recognition network, specifically, inputting the user face images in each image into an expression recognition encoder for feature extraction, outputting a feature vector by an image classification encoder, inputting the feature vector to a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature vector to a softmax function, and outputting a corresponding expression category through the softmax function, wherein the output expression category is a positive expression or a negative expression.
Step S5: acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images:
because each image is obtained according to the preset period, each image has a time sequence, and the time sequence is the time sequence of the self-timer video in the playing process. Then, after the expression of the character in each image is obtained, the expression change trend is obtained according to the sequence of time of each image. The expression change trend is the direction towards which the expression is developed, namely towards positive expression or towards negative expression.
Wherein, the development towards positive expression includes two kinds of circumstances, is respectively: the expression changes to a positive expression (e.g., the expression changes from a negative expression to a positive expression) or is always a positive expression. Similarly, the development towards negative expressions also includes two cases, respectively: the expression changes to a negative expression (e.g., the expression changes from a positive expression to a negative expression) or is always a negative expression.
Step S6: carrying out voice recognition on the voice signal to obtain a corresponding character signal:
and carrying out voice recognition on the voice signals to obtain character signals corresponding to the voice signals, namely converting the voice signals into the character signals. Since the speech recognition algorithm belongs to the conventional algorithm, the description is omitted.
Step S7: inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene:
and after the character signal is obtained, inputting the character signal and an actual scene corresponding to the self-timer video into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene.
It should be understood that the predetermined detection model may be a previously constructed detection model, including: at least two scenes are arranged in each scene, and at least two texts and emotion results corresponding to the texts in each scene are arranged in each scene. Since the scene and the text are independent, the preset detection model can also be said to include: at least two texts, and an emotional result corresponding to each text in each of the at least two scenes. Moreover, in order to improve the detection accuracy, each text in the preset detection model may be a keyword, and is not necessarily a complete sentence, for example: the complete sentence is: "I do not want to do, the keywords may be: "do not want to dry".
The detection model referred to in the above paragraph may be an existing detection model, and as a specific implementation, the preset detection model is a detection model after the existing detection model is corrected, so that an acquisition process is given as follows:
(1) at least two corrected texts in each of at least two scenes are obtained. It should be understood that, in order to improve the reliability of the correction and improve the accuracy of the prediction detection model, in this step, the acquired scenes can be set to be sufficiently wide, and the corrected text in each scene is also set to be sufficiently wide.
(2) Since the corrected text is a text for correcting the existing detection model and is known, the actual emotion result of each corrected text in each scene is also known, and the actual emotion result of each corrected text in each scene is obtained.
(3) And inputting each corrected text in each scene into the existing detection model to obtain the detection emotion result of each corrected text in each scene.
(4) Obtaining an actual emotion result of each corrected text in each scene, and after detecting the emotion result, checking the two emotion results of each corrected text in each scene, specifically: and acquiring each corrected text in each scene with the actual emotion result and the detected emotion result both being positive emotions, acquiring each first corrected text in the first scene, and each corrected text in each scene with the actual emotion result and the detected emotion result both being negative emotions, and acquiring each second corrected text in the second scene.
(5) And adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain a preset detection model. Two adjustment methods are given below, the first: the method comprises the steps that an existing detection model is not considered, and a preset detection model is directly constructed according to first correction texts in a first scene and second correction texts in a second scene; and the second method comprises the following steps: deleting each text in each scene with positive emotion as the emotion result in the existing detection model, wherein each text in each scene does not accord with the condition that the actual emotion result and the detection emotion result are positive emotion; and deleting each text in each scene with negative emotion as the emotion result in the existing detection model, wherein each text in each scene does not meet the condition that the actual emotion result and the detection emotion result are negative emotions.
Therefore, the obtained text signal and the actual scene are input into a preset detection model, and a preliminary emotion result of the text signal in the actual scene, namely a preliminary emotion result of the corresponding voice signal in the actual scene, is obtained.
Through the correction, the detection precision of the preset detection model can be improved.
Step S8: and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user:
fusing the obtained expression change trend and the initial emotion result, specifically: if the expression change trend is towards the positive expression development, and the initial emotion result is positive emotion, the final emotion result of the user is positive emotion; if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result of the user is negative emotion.
As another embodiment, two weights may be set, and the final emotion result of the user is obtained by combining the expression change trend, the preliminary emotion result, and the corresponding weights.
The present embodiment further provides an emotion detection apparatus based on speech recognition and image recognition, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the emotion detection method based on speech recognition and image recognition provided in the present embodiment. Therefore, the emotion detection device based on voice recognition and image recognition is a software device, and the essence of the emotion detection device is still an emotion detection method based on voice recognition and image recognition.
The above-mentioned embodiments are merely illustrative of the technical solutions of the present invention in a specific embodiment, and any equivalent substitutions and modifications or partial substitutions of the present invention without departing from the spirit and scope of the present invention should be covered by the claims of the present invention.

Claims (10)

Translated fromChinese
1.一种基于语音识别和图像识别的情绪检测方法,其特征在于,包括:1. an emotion detection method based on speech recognition and image recognition, is characterized in that, comprises:获取待检测的一段用户的自拍视频,以及所述自拍视频对应的实际场景;Obtain a selfie video of a user to be detected, and the actual scene corresponding to the selfie video;对所述自拍视频进行处理,得到图像信号和语音信号;processing the Selfie video to obtain image signals and voice signals;对所述图像信号按照预设周期进行截图处理,获取至少两个图像;Perform screenshot processing on the image signal according to a preset period to acquire at least two images;对所述至少两个图像进行表情识别,得到各个图像中的人物表情;Perform facial expression recognition on the at least two images to obtain the facial expressions of characters in each image;根据各个图像中的人物表情,以及各个图像的先后时间顺序,获取表情变化趋势;According to the expressions of characters in each image and the chronological order of each image, the change trend of expressions is obtained;对所述语音信号进行语音识别,获取对应的文字信号;Perform voice recognition on the voice signal to obtain a corresponding text signal;将所述文字信号以及所述实际场景输入到预设检测模型中,获取所述语音信号在所述实际场景中的初步情绪结果;Inputting the text signal and the actual scene into a preset detection model, and obtaining a preliminary emotional result of the voice signal in the actual scene;融合所述表情变化趋势以及所述初步情绪结果,获取所述用户的最终情绪结果。The final emotion result of the user is obtained by fusing the expression change trend and the preliminary emotion result.2.根据权利要求1所述的基于语音识别和图像识别的情绪检测方法,其特征在于,所述融合所述表情变化趋势以及所述初步情绪结果,获取最终情绪结果,包括:2. the emotion detection method based on speech recognition and image recognition according to claim 1, is characterized in that, described fusion described expression change trend and described preliminary emotion result, obtain final emotion result, comprise:若所述表情变化趋势为朝向正面表情发展,且所述初步情绪结果为正面情绪,则所述最终情绪结果为正面情绪;If the expression change trend is toward the development of positive expressions, and the preliminary emotional results are positive emotions, then the final emotional results are positive emotions;若所述表情变化趋势为朝向负面表情发展,且所述初步情绪结果为负面情绪,则所述最终情绪结果为负面情绪。If the expression change trend is toward a negative expression, and the preliminary emotion result is a negative emotion, the final emotion result is a negative emotion.3.根据权利要求1所述的基于语音识别和图像识别的情绪检测方法,其特征在于,所述预设检测模型的获取过程包括:3. the emotion detection method based on speech recognition and image recognition according to claim 1, is characterized in that, the acquisition process of described preset detection model comprises:获取至少两个场景中,每一个场景中的至少两个校正文本;Obtain at least two correction texts in each of at least two scenes;获取各场景中的各校正文本的实际情绪结果;Obtain the actual emotional results of each correction text in each scene;将所述各场景中的各校正文本输入到已有检测模型中,得到各场景中的各校正文本的检测情绪结果;Inputting each correction text in each of the scenes into an existing detection model to obtain the detection emotion result of each correction text in each scene;获取实际情绪结果和检测情绪结果均为正面情绪的各场景下的各校正文本,得到第一场景下的各第一校正文本,以及实际情绪结果和检测情绪结果均为负面情绪的各场景下的各校正文本,得到第二场景下的各第二校正文本;Obtain each correction text in each scenario where the actual emotion result and the detected emotion result are both positive emotions, obtain each first correction text in the first scene, and each scene where the actual emotion result and the detected emotion result are both negative emotions Each correction text of , obtains each second correction text under the second scene;根据所述第一场景下的各第一校正文本,以及所述第二场景下的各第二校正文本,调整所述已有检测模型,得到所述预设检测模型。According to each first correction text in the first scene and each second correction text in the second scene, the existing detection model is adjusted to obtain the preset detection model.4.根据权利要求1所述的基于语音识别和图像识别的情绪检测方法,其特征在于,所述对所述至少两个图像进行表情识别,得到各个图像中的人物表情,包括:4. the emotion detection method based on speech recognition and image recognition according to claim 1, is characterized in that, described at least two images are carried out facial expression recognition, obtain the character facial expression in each image, comprising:对所述至少两个图像进行用户人脸识别,得到所述用户的用户人脸图像;performing user face recognition on the at least two images to obtain a user face image of the user;对各个图像中的用户人脸图像进行表情识别,得到各个图像中的人物表情。Expression recognition is performed on the user's face image in each image to obtain the character expression in each image.5.根据权利要求4所述的基于语音识别和图像识别的情绪检测方法,其特征在于,所述对各个图像中的用户人脸图像进行表情识别,得到各个图像中的人物表情,包括:5. the emotion detection method based on speech recognition and image recognition according to claim 4, is characterized in that, described to the user's face image in each image is carried out facial expression recognition, obtains the character expression in each image, comprising:获取第一样本集和第二样本集,所述第一样本集包括至少一个正面表情样本图像,所述第二样本集包括至少一个负面表情样本图像;acquiring a first sample set and a second sample set, the first sample set includes at least one positive expression sample image, and the second sample set includes at least one negative expression sample image;对所述第一样本集中的各正面表情样本图像进行标注,得到第一表情类别,第一表情类别为正面表情,对所述第二样本集中的各负面表情样本图像进行标注,得到第二表情类别,第二表情类别为负面表情,所述第一表情类别和第二表情类别构成标注数据;Label each positive expression sample image in the first sample set to obtain a first expression category, where the first expression category is a positive expression, and label each negative expression sample image in the second sample set to obtain a second expression. Expression category, the second expression category is a negative expression, and the first expression category and the second expression category constitute labeling data;将所述第一样本集和第二样本集输入至表情识别编码器中进行特征提取,表情识别编码器输出的特征向量输入给Flatten层,经Flatten层处理得到一维特征向量,一维特征向量作为全连接层的输入,全连接层将一维特征向量映射到特征标记空间,然后输出给softmax函数,通过softmax函数输出两个表情类别的概率,根据输出的两个表情类别的概率确定对应的初始表情类别;The first sample set and the second sample set are input into the expression recognition encoder for feature extraction, and the feature vector output by the expression recognition encoder is input to the Flatten layer, and processed by the Flatten layer to obtain a one-dimensional feature vector, one-dimensional feature vector The vector is used as the input of the fully connected layer. The fully connected layer maps the one-dimensional feature vector to the feature label space, and then outputs it to the softmax function. The softmax function outputs the probability of the two expression categories, and the corresponding output is determined according to the probability of the two expression categories. The initial expression category of ;将所述初始表情类别与所述标注数据通过交叉熵损失函数进行运算,优化表情识别网络中的参数;Calculating the initial expression category and the labeling data through a cross-entropy loss function to optimize the parameters in the expression recognition network;将所述各个图像中的用户人脸图像输入到所述表情识别网络中,得到所述各个图像中的用户人脸图像的人物表情。Inputting the face images of the users in the respective images into the expression recognition network to obtain the character expressions of the face images of the users in the respective images.6.一种基于语音识别和图像识别的情绪检测装置,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如下所述的基于语音识别和图像识别的情绪检测方法的步骤:6. An emotion detection device based on speech recognition and image recognition, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the When the computer program is described, the steps of the emotion detection method based on speech recognition and image recognition as described below are realized:获取待检测的一段用户的自拍视频,以及所述自拍视频对应的实际场景;Obtain a selfie video of a user to be detected, and the actual scene corresponding to the selfie video;对所述自拍视频进行处理,得到图像信号和语音信号;processing the Selfie video to obtain image signals and voice signals;对所述图像信号按照预设周期进行截图处理,获取至少两个图像;Perform screenshot processing on the image signal according to a preset period to acquire at least two images;对所述至少两个图像进行表情识别,得到各个图像中的人物表情;Perform facial expression recognition on the at least two images to obtain the facial expressions of characters in each image;根据各个图像中的人物表情,以及各个图像的先后时间顺序,获取表情变化趋势;According to the expressions of characters in each image and the chronological order of each image, the change trend of expressions is obtained;对所述语音信号进行语音识别,获取对应的文字信号;Perform voice recognition on the voice signal to obtain a corresponding text signal;将所述文字信号以及所述实际场景输入到预设检测模型中,获取所述语音信号在所述实际场景中的初步情绪结果;Inputting the text signal and the actual scene into a preset detection model, and obtaining a preliminary emotional result of the voice signal in the actual scene;融合所述表情变化趋势以及所述初步情绪结果,获取所述用户的最终情绪结果。The final emotion result of the user is obtained by fusing the expression change trend and the preliminary emotion result.7.根据权利要求6所述的基于语音识别和图像识别的情绪检测装置,其特征在于,所述融合所述表情变化趋势以及所述初步情绪结果,获取最终情绪结果,包括:7. The emotion detection device based on speech recognition and image recognition according to claim 6, wherein the fusion of the expression change trend and the preliminary emotion results to obtain final emotion results, comprising:若所述表情变化趋势为朝向正面表情发展,且所述初步情绪结果为正面情绪,则所述最终情绪结果为正面情绪;If the expression change trend is toward the development of positive expressions, and the preliminary emotional results are positive emotions, then the final emotional results are positive emotions;若所述表情变化趋势为朝向负面表情发展,且所述初步情绪结果为负面情绪,则所述最终情绪结果为负面情绪。If the expression change trend is toward a negative expression, and the preliminary emotion result is a negative emotion, the final emotion result is a negative emotion.8.根据权利要求6所述的基于语音识别和图像识别的情绪检测装置,其特征在于,所述预设检测模型的获取过程包括:8. The emotion detection device based on speech recognition and image recognition according to claim 6, wherein the acquisition process of the preset detection model comprises:获取至少两个场景中,每一个场景中的至少两个校正文本;Obtain at least two correction texts in each of at least two scenes;获取各场景中的各校正文本的实际情绪结果;Obtain the actual emotional results of each correction text in each scene;将所述各场景中的各校正文本输入到已有检测模型中,得到各场景中的各校正文本的检测情绪结果;Inputting each correction text in each of the scenes into an existing detection model to obtain the detection emotion result of each correction text in each scene;获取实际情绪结果和检测情绪结果均为正面情绪的各场景下的各校正文本,得到第一场景下的各第一校正文本,以及实际情绪结果和检测情绪结果均为负面情绪的各场景下的各校正文本,得到第二场景下的各第二校正文本;Obtain each correction text in each scenario where the actual emotion result and the detected emotion result are both positive emotions, obtain each first correction text in the first scene, and each scene where the actual emotion result and the detected emotion result are both negative emotions Each correction text of , obtains each second correction text under the second scene;根据所述第一场景下的各第一校正文本,以及所述第二场景下的各第二校正文本,调整所述已有检测模型,得到所述预设检测模型。According to each first correction text in the first scene and each second correction text in the second scene, the existing detection model is adjusted to obtain the preset detection model.9.根据权利要求6所述的基于语音识别和图像识别的情绪检测装置,其特征在于,所述对所述至少两个图像进行表情识别,得到各个图像中的人物表情,包括:9. The emotion detection device based on speech recognition and image recognition according to claim 6, wherein the at least two images are subjected to facial expression recognition to obtain the facial expressions of characters in each image, comprising:对所述至少两个图像进行用户人脸识别,得到所述用户的用户人脸图像;performing user face recognition on the at least two images to obtain a user face image of the user;对各个图像中的用户人脸图像进行表情识别,得到各个图像中的人物表情。Expression recognition is performed on the user's face image in each image to obtain the character expression in each image.10.根据权利要求9所述的基于语音识别和图像识别的情绪检测装置,其特征在于,所述对各个图像中的用户人脸图像进行表情识别,得到各个图像中的人物表情,包括:10. The emotion detection device based on speech recognition and image recognition according to claim 9, wherein the facial image of the user in each image is subjected to facial expression recognition to obtain the facial expression of a character in each image, comprising:获取第一样本集和第二样本集,所述第一样本集包括至少一个正面表情样本图像,所述第二样本集包括至少一个负面表情样本图像;acquiring a first sample set and a second sample set, the first sample set includes at least one positive expression sample image, and the second sample set includes at least one negative expression sample image;对所述第一样本集中的各正面表情样本图像进行标注,得到第一表情类别,第一表情类别为正面表情,对所述第二样本集中的各负面表情样本图像进行标注,得到第二表情类别,第二表情类别为负面表情,所述第一表情类别和第二表情类别构成标注数据;Label each positive expression sample image in the first sample set to obtain a first expression category, where the first expression category is a positive expression, and label each negative expression sample image in the second sample set to obtain a second expression. Expression category, the second expression category is a negative expression, and the first expression category and the second expression category constitute labeling data;将所述第一样本集和第二样本集输入至表情识别编码器中进行特征提取,表情识别编码器输出的特征向量输入给Flatten层,经Flatten层处理得到一维特征向量,一维特征向量作为全连接层的输入,全连接层将一维特征向量映射到特征标记空间,然后输出给softmax函数,通过softmax函数输出两个表情类别的概率,根据输出的两个表情类别的概率确定对应的初始表情类别;The first sample set and the second sample set are input into the expression recognition encoder for feature extraction, and the feature vector output by the expression recognition encoder is input to the Flatten layer, and processed by the Flatten layer to obtain a one-dimensional feature vector, one-dimensional feature vector The vector is used as the input of the fully connected layer. The fully connected layer maps the one-dimensional feature vector to the feature label space, and then outputs it to the softmax function. The softmax function outputs the probability of the two expression categories, and the corresponding output is determined according to the probability of the two expression categories. The initial expression category of ;将所述初始表情类别与所述标注数据通过交叉熵损失函数进行运算,优化表情识别网络中的参数;Calculating the initial expression category and the labeling data through a cross-entropy loss function to optimize the parameters in the expression recognition network;将所述各个图像中的用户人脸图像输入到所述表情识别网络中,得到所述各个图像中的用户人脸图像的人物表情。Inputting the face images of the users in the respective images into the expression recognition network to obtain the character expressions of the face images of the users in the respective images.
CN202011213188.XA2020-11-042020-11-04 An emotion detection method and device based on speech recognition and image recognitionActiveCN112232276B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011213188.XACN112232276B (en)2020-11-042020-11-04 An emotion detection method and device based on speech recognition and image recognition

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011213188.XACN112232276B (en)2020-11-042020-11-04 An emotion detection method and device based on speech recognition and image recognition

Publications (2)

Publication NumberPublication Date
CN112232276Atrue CN112232276A (en)2021-01-15
CN112232276B CN112232276B (en)2023-10-13

Family

ID=74121979

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011213188.XAActiveCN112232276B (en)2020-11-042020-11-04 An emotion detection method and device based on speech recognition and image recognition

Country Status (1)

CountryLink
CN (1)CN112232276B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112992148A (en)*2021-03-032021-06-18中国工商银行股份有限公司Method and device for recognizing voice in video
CN112990301A (en)*2021-03-102021-06-18深圳市声扬科技有限公司Emotion data annotation method and device, computer equipment and storage medium
CN114065742A (en)*2021-11-192022-02-18马上消费金融股份有限公司 A text detection method and device
CN115795095A (en)*2022-11-232023-03-14中国石油大学(华东)Automatic music matching method based on video content
CN118428343A (en)*2024-07-032024-08-02广州讯鸿网络技术有限公司Full-media interactive intelligent customer service interaction method and system
CN118608742A (en)*2024-08-082024-09-06深圳市欧冠微电子科技有限公司 Viewing angle switching control method, device, medium and electronic device for display panel

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2020125386A1 (en)*2018-12-182020-06-25深圳壹账通智能科技有限公司Expression recognition method and apparatus, computer device, and storage medium
WO2020135194A1 (en)*2018-12-262020-07-02深圳Tcl新技术有限公司Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111681681A (en)*2020-05-222020-09-18深圳壹账通智能科技有限公司Voice emotion recognition method and device, electronic equipment and storage medium
CN111694959A (en)*2020-06-082020-09-22谢沛然Network public opinion multi-mode emotion recognition method and system based on facial expressions and text information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2020125386A1 (en)*2018-12-182020-06-25深圳壹账通智能科技有限公司Expression recognition method and apparatus, computer device, and storage medium
WO2020135194A1 (en)*2018-12-262020-07-02深圳Tcl新技术有限公司Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111368609A (en)*2018-12-262020-07-03深圳Tcl新技术有限公司Voice interaction method based on emotion engine technology, intelligent terminal and storage medium
CN111681681A (en)*2020-05-222020-09-18深圳壹账通智能科技有限公司Voice emotion recognition method and device, electronic equipment and storage medium
CN111694959A (en)*2020-06-082020-09-22谢沛然Network public opinion multi-mode emotion recognition method and system based on facial expressions and text information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WENBIN ZHOU等: "Deep Learning-Based Emotion Recognition from Real-Time Videos", 《HCII 2020: HUMAN-COMPUTER INTERACTION. MULTIMODAL AND NATURAL INTERACTION》*
陈师哲;王帅;金琴;: "多文化场景下的多模态情感识别", 软件学报, no. 04*
饶元;吴连伟;王一鸣;冯聪;: "基于语义分析的情感计算技术研究进展", 软件学报, no. 08*

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112992148A (en)*2021-03-032021-06-18中国工商银行股份有限公司Method and device for recognizing voice in video
CN112990301A (en)*2021-03-102021-06-18深圳市声扬科技有限公司Emotion data annotation method and device, computer equipment and storage medium
CN114065742A (en)*2021-11-192022-02-18马上消费金融股份有限公司 A text detection method and device
CN114065742B (en)*2021-11-192023-08-25马上消费金融股份有限公司Text detection method and device
CN115795095A (en)*2022-11-232023-03-14中国石油大学(华东)Automatic music matching method based on video content
CN118428343A (en)*2024-07-032024-08-02广州讯鸿网络技术有限公司Full-media interactive intelligent customer service interaction method and system
CN118608742A (en)*2024-08-082024-09-06深圳市欧冠微电子科技有限公司 Viewing angle switching control method, device, medium and electronic device for display panel

Also Published As

Publication numberPublication date
CN112232276B (en)2023-10-13

Similar Documents

PublicationPublication DateTitle
Tao et al.End-to-end audiovisual speech recognition system with multitask learning
CN115050077B (en) Emotion recognition method, device, equipment and storage medium
CN110728997B (en) A Multimodal Depression Detection System Based on Context Awareness
CN112686048B (en)Emotion recognition method and device based on fusion of voice, semantics and facial expressions
CN112232276A (en) A kind of emotion detection method and device based on speech recognition and image recognition
US10438586B2 (en)Voice dialog device and voice dialog method
CN109658923B (en)Speech quality inspection method, equipment, storage medium and device based on artificial intelligence
CN109492221B (en) An information reply method and wearable device based on semantic analysis
CN108305618B (en) Voice acquisition and search method, smart pen, search terminal and storage medium
JP2008158055A (en) Language pronunciation practice support system
CN111986675A (en) Voice dialogue method, device and computer readable storage medium
CN117150320B (en)Dialog digital human emotion style similarity evaluation method and system
CN118380144A (en) A feature extraction evaluation system and method based on multimodal deep learning
CN112597889A (en)Emotion processing method and device based on artificial intelligence
CN118658467A (en) A cheating detection method, device, equipment, storage medium and product
CN120105346A (en) A multi-modal data acquisition and feature fusion system, method and server
CN112951274A (en)Voice similarity determination method and device, and program product
CN114267324B (en) Speech generation method, device, equipment and storage medium
CN117831575B (en) Intelligent business analysis method, system and electronic device based on big data
CN112434953A (en)Customer service personnel assessment method and device based on computer data processing
CN115914742B (en)Character recognition method, device and equipment for video captions and storage medium
CN118016273A (en)Disease auxiliary diagnosis method, device, equipment and readable storage medium
CN115019137B (en)Method and device for predicting multi-scale double-flow attention video language event
KR102480722B1 (en)Apparatus for recognizing emotion aware in edge computer environment and method thereof
CN119475252B (en) A multimodal emotion recognition method

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
TA01Transfer of patent application right
TA01Transfer of patent application right

Effective date of registration:20230526

Address after:No. 16-44, No. 10A-10C, 12A, 12B, 13A, 13B, 15-18, Phase II of Wuyue Plaza Project, east of Zhengyang Street and south of Haoyue Road, Lvyuan District, Changchun City, Jilin Province, 130000

Applicant after:Jilin Huayuan Network Technology Co.,Ltd.

Address before:450000 Wenhua Road, Jinshui District, Zhengzhou City, Henan Province

Applicant before:Zhao Zhen

TA01Transfer of patent application right
TA01Transfer of patent application right

Effective date of registration:20230913

Address after:Room 1001, 1st floor, building B, 555 Dongchuan Road, Minhang District, Shanghai

Applicant after:Shanghai Enterprise Information Technology Co.,Ltd.

Address before:No. 16-44, No. 10A-10C, 12A, 12B, 13A, 13B, 15-18, Phase II of Wuyue Plaza Project, east of Zhengyang Street and south of Haoyue Road, Lvyuan District, Changchun City, Jilin Province, 130000

Applicant before:Jilin Huayuan Network Technology Co.,Ltd.

GR01Patent grant
GR01Patent grant
PE01Entry into force of the registration of the contract for pledge of patent right
PE01Entry into force of the registration of the contract for pledge of patent right

Denomination of invention:An emotion detection method and device based on speech recognition and image recognition

Granted publication date:20231013

Pledgee:Agricultural Bank of China Limited Shanghai Huangpu Sub branch

Pledgor:Shanghai Enterprise Information Technology Co.,Ltd.

Registration number:Y2024310000041

PC01Cancellation of the registration of the contract for pledge of patent right
PC01Cancellation of the registration of the contract for pledge of patent right

Granted publication date:20231013

Pledgee:Agricultural Bank of China Limited Shanghai Huangpu Sub branch

Pledgor:Shanghai Enterprise Information Technology Co.,Ltd.

Registration number:Y2024310000041

EE01Entry into force of recordation of patent licensing contract
EE01Entry into force of recordation of patent licensing contract

Application publication date:20210115

Assignee:Shanghai Quche Intelligent Technology Co.,Ltd.

Assignor:Shanghai Enterprise Information Technology Co.,Ltd.

Contract record no.:X2025980014762

Denomination of invention:An emotion detection method and device based on speech recognition and image recognition

Granted publication date:20231013

License type:Common License

Record date:20250723


[8]ページ先頭

©2009-2025 Movatter.jp