Movatterモバイル変換


[0]ホーム

URL:


CN114678003A - Speech synthesis method, speech synthesis device, electronic equipment and storage medium - Google Patents

Speech synthesis method, speech synthesis device, electronic equipment and storage medium
Download PDF

Info

Publication number
CN114678003A
CN114678003ACN202210364869.9ACN202210364869ACN114678003ACN 114678003 ACN114678003 ACN 114678003ACN 202210364869 ACN202210364869 ACN 202210364869ACN 114678003 ACN114678003 ACN 114678003A
Authority
CN
China
Prior art keywords
emotion
voice
target
information
synthesized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210364869.9A
Other languages
Chinese (zh)
Inventor
崔洋洋
余俊澎
王星宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Youmi Technology Shenzhen Co ltd
Original Assignee
Youmi Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Youmi Technology Shenzhen Co ltdfiledCriticalYoumi Technology Shenzhen Co ltd
Priority to CN202210364869.9ApriorityCriticalpatent/CN114678003A/en
Publication of CN114678003ApublicationCriticalpatent/CN114678003A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium. The method comprises the following steps: determining a target emotion; acquiring a voiceprint feature of a target emotion, wherein the voiceprint feature of the target emotion represents a voiceprint feature of a voice signal sent by a user under the condition of the target emotion; and synthesizing the information to be synthesized based on the voiceprint characteristics of the target emotion to obtain synthesized voice. In the technical scheme provided by the embodiment of the application, the emotion of the expected synthesized voice is determined, then the voiceprint characteristics of the voice signal sent by the user when the user is in the emotion are obtained, finally, the information to be synthesized to be processed is synthesized based on the voiceprint characteristics, the synthesized voice capable of expressing the emotion is obtained, and the subsequent electronic equipment can also simulate the emotion of a human when the synthesized voice is played, so that the synthesized voice output by the electronic equipment is more natural and richer in expressive force.

Description

Speech synthesis method, speech synthesis device, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, electronic device, and storage medium.
Background
At present, voice man-machine interaction is widely applied to daily life of people, such as intelligent voice assistants, audio books, voice navigation and the like.
Taking the intelligent voice assistant as an example, after waking up the intelligent voice assistant, a user sends out question information, the intelligent voice information collects the question information and carries out semantic analysis on the question information so as to obtain answer information matched with the question information, finally, the answer information in the text form is converted into the answer information in the voice form, and the answer information in the voice form is played. For example, the user issues a question "tomorrow day" and the intelligent voice assistant answers "thursday".
In the related art, the voice information output by the electronic device (including a smart speaker, a smart phone, a tablet computer, etc.) is not natural enough.
Disclosure of Invention
The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium.
In a first aspect, the present application provides a speech synthesis method, including: determining a target emotion that characterizes an emotion that the synthesized speech is expected to have; acquiring the voiceprint characteristics of the target emotion, wherein the voiceprint characteristics of the target emotion represent the voiceprint characteristics of a voice signal sent by a user under the condition that the user is in the target emotion; and synthesizing information to be synthesized based on the voiceprint characteristics of the target emotion to obtain the synthesized voice.
In a second aspect, the present application provides a speech synthesis apparatus, comprising: the emotion recognition system comprises an emotion determining module, a feature obtaining module and a synthesis processing module. An emotion determination module to determine a target emotion characterizing an emotion expected to be possessed by the synthesized speech. And the characteristic acquisition module is used for acquiring the voiceprint characteristics of the target emotion, and the voiceprint characteristics of the target emotion represent the voiceprint characteristics of the voice signals sent by the user under the condition of the target emotion. And the synthesis processing module is used for carrying out synthesis processing on the information to be synthesized based on the voiceprint characteristics of the target emotion to obtain the synthesized voice.
In a third aspect, the present application further provides an electronic device, which includes a processor and a memory, where the memory stores computer program instructions, and the computer program instructions, when called by the processor, execute the method.
In a fourth aspect, the present application also provides a computer-readable storage medium having program code stored thereon, wherein the method is performed when the program code is executed by a processor.
In a fifth aspect, the present application also provides a computer program product, which when executed, implements the above method.
The embodiment of the application provides a voice synthesis method, which includes the steps of determining the emotion of expected synthesized voice, then obtaining the voiceprint characteristics of voice signals sent by a user when the user is in the emotion, finally carrying out synthesis processing on information to be synthesized to be processed based on the voiceprint characteristics to obtain synthesized voice capable of expressing the emotion, and simulating the emotion of a human when subsequent electronic equipment plays the synthesized voice, so that the synthesized voice output by the electronic equipment is more natural and richer in expressive force.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
Fig. 2 is a flowchart of a speech synthesis method according to an embodiment of the present application.
Fig. 3 is a flowchart of a speech synthesis method according to another embodiment of the present application.
Fig. 4 is a flowchart of a speech synthesis method according to another embodiment of the present application.
Fig. 5 is a flowchart of a speech synthesis method according to another embodiment of the present application.
Fig. 6 is a flowchart of a speech synthesis method according to another embodiment of the present application.
Fig. 7 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application.
Fig. 8 is a block diagram of an electronic device provided in an embodiment of the present application.
FIG. 9 is a block diagram of a computer-readable storage medium provided by one embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for explaining the present application and are not to be construed as limiting the present application.
In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment includes anelectronic device 110. Theelectronic device 110 may be a terminal device such as a smart phone, a tablet computer, a smart watch, and a smart speaker. Theelectronic device 110 may also be a server. In the embodiment of the present application, only theelectronic device 110 is taken as an example for description.
Theelectronic device 110 has an emotion recognition function and a voice processing function. The emotion recognition function includes determining the current emotion of the speaker from the audio-video information of the speaker and determining the emotion contained in the words from the words, the emotion includes but is not limited to: happy, depressed, angry, calm, frightened, slight, confused, surprised, disliked, etc. The voice processing functions include text-to-speech, emotion addition, loudness compensation, and denoising, among others. Text-to-speech refers to converting information from text to speech. The emotion adding means that emotion processing is performed on the voice signal, and the voice signal after emotion processing can simulate human emotion. Loudness compensation refers to increasing the loudness of a speech signal. Denoising refers to removing noise components in a speech signal.
In some embodiments, in a case where theelectronic device 110 is a terminal device, theelectronic device 110 is installed with a specific application program, and the emotion recognition function and the voice processing function described above are implemented by the specific application program. The designated application may be an intelligent voice assistant, a navigation-type application, an audio book-type application, and the like.
Optionally, theelectronic device 110 further has an audio-video capture function. For example, theelectronic device 110 is provided with an audio collecting device (such as a microphone) and a video collecting device (such as a camera), wherein the audio collecting device collects audio information of a speaker, and the video collecting device collects image information of the speaker during speaking.
In some embodiments, the implementation environment further comprises an emotion database (not shown in fig. 1) for storing mapping relationships between different emotions and different voiceprint characteristics. In some embodiments, the mood database is also used to store mapping relationships between different moods and different facial features. The emotion database may be set locally in theelectronic device 110, or may be independent of theelectronic device 110, for example, set in a cloud server, which is not limited in this embodiment of the present application.
The embodiment of the application provides a voice synthesis method, which includes the steps of determining emotion of expected synthesized voice, then obtaining voiceprint characteristics of voice signals sent by a user when the user is in the emotion, finally carrying out synthesis processing on to-be-synthesized information to be processed based on the voiceprint characteristics to obtain synthesized voice capable of expressing the emotion, and enabling subsequent electronic equipment to simulate human emotion when the synthesized voice is played, so that the synthesized voice output by the electronic equipment is more natural and richer in expressive force.
The technical scheme provided by the embodiment of the application can be applied to voice man-machine interaction scenes, such as products of intelligent voice assistants, audio books and the like. The application of the scheme provided by the embodiment of the application to the above products is explained below.
(1) Intelligent voice assistant
The intelligent voice assistant determines that the current emotion of the user is "anxious" based on audio and video information of the user when the user sends a question, determines that the current emotion of the user is "anxious", determines that the emotion required to be expressed when the synthetic voice is played is "soothing" based on the current emotion of the user, obtains answer information "not to be urgent" at 8 points 30 at present and three hours away from the time set on the schedule ", converts the answer information from a text form to a voice form to obtain information to be synthesized, and then synthesizes the information to be synthesized based on the voiceprint characteristic of" soothing "so that the finally played synthetic voice shows the emotion of" soothing ".
(2) Audio book
The information to be synthesized to be played in the vocal book is ' the information to be synthesized is difficult to see by the user's back shadow ', the emotion required to be expressed when the information to be synthesized is played is ' difficult ' determined by the vocal book, and then the information to be synthesized is synthesized based on the ' difficult ' vocal print characteristics, so that the synthesized voice finally played shows the emotion of ' difficult '.
Fig. 2 is a flowchart of a speech synthesis method provided in an embodiment of the present application. The method comprises the following steps:
step 201, determining a target emotion.
The target emotion characterizes the emotion that the synthesized speech is expected to have. In some embodiments, the target emotion is set by the electronic device by default or by user customization.
In other embodiments, the target user issues a question, the electronic device collects audio and video information of the target user when the target user issues the question, determines a mood (i.e., a mood of the target user) expressed in the process of issuing the question by the target user based on the audio and video information, and finally determines a mood of the answer voice information (i.e., the synthesized voice) based on the mood of the target user. The process of determining a target emotion based on audio-visual information, and the process of determining a target emotion based on the emotion of a target user will be explained in the following embodiments.
In still other embodiments, the electronic device determines the target emotion based on keywords in the text information to be converted to speech form. Optionally, the electronic device performs semantic analysis on the text information to obtain keywords in the text information, then obtains emotions included in the text information based on a mapping relationship between the keywords and the emotions, and finally determines a target emotion based on the emotions included in the text information. Alternatively, the electronic device directly determines the emotion included in the text information as the target emotion.
Under the condition that the electronic equipment is the terminal equipment, the electronic equipment acquires a human-computer interaction instruction triggered by a user in different scenes, and starts the technical scheme provided by the embodiment of the application on the basis of the human-computer interaction instruction. In the vocal book scene, after receiving the reading instruction, the electronic equipment starts to execute from the step of determining the target emotion based on the reading instruction. In the scene of the intelligent voice assistant, after the electronic equipment receives a wake-up instruction for the intelligent voice assistant and monitors that a user sends out question information, the electronic equipment starts to execute the step of determining the target emotion.
Under the condition that the electronic equipment is a server, the electronic equipment receives a voice synthesis request sent by a terminal based on a man-machine interaction instruction, the voice synthesis request carries information to be synthesized and related information (such as audio and video information when a user sends a question, characters to be read aloud and the like) for determining a target emotion, and the server starts to execute from the step of determining the target emotion based on the voice synthesis request.
Further, under the condition that the electronic device is a terminal device, the electronic device provides an emotional voice output function, after the function is triggered to an open state by a user, human-computer interaction instructions triggered by the user in different scenes are obtained, and the technical scheme provided by the embodiment of the application is opened based on the human-computer interaction instructions, or a voice synthesis request is sent to a server based on the human-computer interaction instructions. The electronic equipment is terminal equipment provided with a display screen, such as a smart phone, a tablet computer and the like, displays a control of the emotional voice output function, and starts the emotional voice output function based on a specified operation signal acting on the control. The electronic equipment is terminal equipment which is not provided with a display screen, such as a smart sound box and the like, a smart phone (which is provided with the control emotion of the smart sound box) which is in communication connection with the smart sound box displays a control of an emotion voice output function, and the smart phone starts the emotion voice output function based on a specified operation signal acting on the control. By the mode, whether the emotion voice output function is started or not can be selected according to the actual requirements of the user, and the personalized requirements of the user are met.
Step 202, obtaining the voiceprint characteristics of the target emotion.
The voiceprint feature of the target emotion characterizes a voiceprint feature of a speech signal uttered by the user in the case of the target emotion. Voiceprint features of the target emotion include, and are not limited to: fundamental frequency, loudness, timbre, duration of utterance, rhythm, timbre, energy, and so forth.
In a possible implementation manner, the electronic device is in communication connection with an external emotion library, and sends a feature acquisition request to the external communication library, wherein the feature acquisition request carries an emotion identifier of a target emotion, and the external emotion library returns a voiceprint feature of the target emotion to the electronic device based on the feature acquisition request. In another possible implementation, the electronic device reads the voiceprint features of the target emotion from the local emotion library indexed by the emotion identification of the target emotion. The external mood database or the local mood database stores a first mapping between different moods and different voiceprint characteristics. In some embodiments, the external mood library or the local mood library stores a first mapping relationship between different moods and different facial features. The following explains the construction process of the first mapping relationship and the second mapping relationship.
In some embodiments, the electronic device includes an emotion voice output function, and when the user triggers the function to turn on for the first time, the electronic device guides the user to output voice signals with different emotions and collect the voice signals with the different emotions, performs feature extraction on the collected voice signals to obtain a mapping relationship between the emotion of the user and voiceprint features, sends the mapping relationship between the emotion of the user and the voiceprint features to the cloud, performs statistical analysis on the mapping relationship between the emotion of a plurality of users and the voiceprint features, and finally obtains a first mapping relationship and stores the first mapping relationship in an emotion library. Taking voiceprint characteristics as loudness, under a common condition, the loudness of a voice signal emitted by a user under an angry emotion is higher, the cloud end carries out statistical analysis on the loudness of each user under the angry emotion, and the average value or the middle value of the loudness of each user under the angry emotion is determined as the loudness of the user under the angry emotion.
Optionally, the electronic device may further collect face images of the users under different moods, perform feature extraction on the face images to obtain a mapping relationship between the moods and the face features of the users, send the mapping relationship between the moods and the voiceprint features of the users to the cloud, perform statistical analysis on the mapping relationships between the moods and the face features of the users to obtain a second mapping relationship, and store the second mapping relationship in the mood database. In one example, the user may be in a "perplexing" mood, typically manifested as a drop in the corners of the mouth from the brow cramped, and the electronic device may capture the facial features from the user's facial images in the "perplexing" mood.
In other embodiments, the cloud acquires existing video or audio data, and a technician may perform emotion annotation on the video or audio data to obtain audio or video data under each emotion, perform voiceprint feature extraction on the audio or video data under each emotion, perform statistical analysis on the extracted voiceprint features to obtain voiceprint features of each emotion, and store the voiceprint features in the emotion library. Optionally, the electronic device may further perform face feature extraction on the audio or video data under each emotion, perform statistical analysis on the extracted face features, obtain the face features of each emotion, and store the face features in the emotion library.
And step 203, synthesizing the information to be synthesized based on the voiceprint characteristics of the target emotion to obtain synthesized voice.
The information to be synthesized may be in a form of voice or text, which is not limited in the embodiment of the present application. The information to be synthesized may be determined according to an actual scene. Under the voice book scene, if the information to be synthesized is in a voice form, the information to be synthesized is obtained by performing voice conversion processing on the text information to be read aloud according to a first default voiceprint characteristic; and if the information voice to be synthesized is in a character form, the information to be synthesized is the text information to be read aloud. In an intelligent voice scene, if the information to be synthesized is in a voice form, the information to be synthesized is obtained by performing voice conversion processing on answer information aiming at the question information according to a second default voiceprint characteristic; and if the information sound to be synthesized is in a character form, the information to be synthesized is answer information in a text form. The first default voiceprint feature and the second default voiceprint feature may be set by the electronic device by default or may be set by the user by self-definition, which is not limited in the embodiment of the present application.
Optionally, the electronic device outputs the synthesized speech after obtaining the synthesized speech. Such as playing synthesized speech through a speaker. In the embodiment of the application, the electronic device synthesizes information to be synthesized based on a voice signal sent by the user under the condition of the target emotion to obtain the synthesized voice, so that the user can feel that the synthesized voice has the target emotion when the electronic device outputs the synthesized voice, and the man-machine interaction between the user and the electronic device is more natural and richer in expressive force.
To sum up, according to the technical scheme provided by the embodiment of the application, by determining the emotion expected to be possessed by the synthesized voice, the voiceprint feature of the voice signal sent by the user when the user is in the emotion is obtained, and finally, the information to be synthesized to be processed is synthesized based on the voiceprint feature, so that the synthesized voice capable of expressing the emotion is obtained, and the subsequent electronic device can also simulate the emotion of a human when the synthesized voice is played, so that the synthesized voice output by the electronic device is more natural and richer in expressive force.
When the electronic equipment carries out synthesis processing on information to be synthesized, the steps of adjusting vocal print characteristics, compensating loudness, denoising and the like need to be completed, the finally output synthesized voice can be guaranteed to simulate the target emotion, the loudness is moderate, the voice is clear enough, and the experience of voice man-machine interaction can be improved.
Fig. 3 is a flowchart of a speech synthesis method according to an embodiment of the present application. The method comprises the following steps:
step 301, determining a target emotion.
The target emotion characterizes the emotion that the synthesized speech is expected to have.
And step 302, acquiring the voiceprint characteristics of the target emotion.
The voiceprint feature of the target emotion characterizes a voiceprint feature of a speech signal uttered by the user in the case of the target emotion.
Step 303, adjusting the voiceprint characteristics of the information to be synthesized according to the voiceprint characteristics of the target emotion to obtain a first intermediate voice.
In the embodiment of the present application, the information to be synthesized is in the form of speech. The electronic equipment adjusts the voiceprint characteristics of the information to be synthesized into the voiceprint characteristics of the target emotion to obtain first intermediate voice, and the target emotion can be simulated when the first intermediate voice is played. In some embodiments, the electronic device adjusts a spectrogram of the information to be synthesized, where the adjusted spectrogram includes voiceprint features that are voiceprint features of the target emotion. In other embodiments, the electronic device includes a field for indicating each voiceprint feature, and the electronic device modifies a value of the field to a value of a voiceprint feature of the target emotion to adjust the voiceprint feature of the information to be synthesized according to the voiceprint feature of the target emotion.
And step 304, carrying out loudness compensation processing on the first intermediate voice to obtain a second intermediate voice.
The loudness parameter of the second intermediate speech is greater than the loudness parameter of the first intermediate speech. The loudness compensation is to increase the loudness parameter of the first intermediate speech, and since there is a possibility that loudness loss is caused in the voiceprint feature adjustment process instep 303, and a situation that a user cannot hear clearly occurs during subsequent playing, the loudness compensation is required.
In some embodiments, the electronic device includes a field for indicating a loudness parameter of the first intermediate speech, and the electronic device performs the loudness compensation process for the first intermediate speech by increasing a value of the field.
In some embodiments, the electronic device performs loudness compensation processing on the first intermediate speech if it is detected that the loudness parameter of the first intermediate speech is less than a first preset value. The first preset value is set according to experiments or experience, and the embodiment of the present application does not limit this. By the mode, unnecessary loudness compensation processing can be avoided, processing resources of electronic equipment are saved, the situation that loudness of voice information needing to be played is too large can be avoided, and experience of voice man-machine interaction is improved.
And 305, denoising the second intermediate voice to obtain a synthesized voice.
The denoising process refers to removing a noise component of the second intermediate voice. In the voiceprint feature adjustment process and the loudness compensation process, a new noise component may be introduced, and the information to be synthesized also includes a noise classification, so that a denoising process is required to ensure the emotion of the synthesized speech. The algorithm for denoising the second intermediate voice comprises the following steps: a gaussian filtering algorithm, a denoising algorithm based on an LMS framework, a deep learning denoising algorithm, and the like, which are not limited in the embodiments of the present application.
In some embodiments, the electronic device performs denoising processing on the second intermediate voice when detecting that the proportion of the noise component of the second intermediate voice is smaller than a second preset value. The second preset value is set according to experiments or experience, which is not limited in the embodiment of the present application, for example, the second preset value is 30%. By the mode, unnecessary denoising processing can be avoided, processing resources of electronic equipment are saved, the situation that the noise of the voice information needing to be played is too large can be avoided, the definition of the voice information to be played (namely, synthesized voice) is guaranteed, and the experience of voice man-machine interaction is improved.
In summary, according to the technical scheme provided by the embodiment of the application, the voiceprint feature adjustment, the loudness compensation processing and the denoising processing are sequentially performed on the information to be synthesized, so that the finally output synthesized voice can be guaranteed to simulate the target emotion, the loudness is moderate, the voice is clear enough, and the experience of voice man-machine interaction can be improved.
In the intelligent voice assistant scene, after a user sends a question to the intelligent voice assistant, the intelligent voice assistant can determine the emotion of the user when the user sends the question and determine the emotion required to be expressed when the answer information is played based on the emotion of the user, so that the man-machine interaction process of the user and the intelligent voice assistant is more natural and richer in expressive force.
Fig. 4 is a flowchart of a speech synthesis method according to an embodiment of the present application. The method comprises the following steps:
step 401, acquiring audio and video information in the process of sending out question voice information by a target user.
The information to be synthesized is answer voice information for the question voice information. The audio and video information in the process of sending out the question voice information by the target user comprises audio information or/and video information in the process of sending out the question voice information by the target user. In some embodiments, after monitoring that the user sends out the question voice information, the electronic device starts the image acquisition device to acquire video information including a face image of the user, and starts the sound acquisition device to acquire the question voice information sent out by the user. Optionally, the electronic device continuously monitors the sound signal, and determines that the user sends out the question voice message after monitoring that the sound signal contains the specified keyword, where the specified keyword may be the name of the intelligent voice assistant.
And 402, determining the emotion of the target user based on the audio and video information.
The emotion of the target user represents the emotion expressed by the target user in the process of issuing the question voice information.
In some embodiments, in a case that the audio-video information includes audio information, the electronic device extracts voiceprint features from the audio information, then sequentially calculates a similarity between the extracted voiceprint features and voiceprint features of at least one emotion, and determines an emotion, for which the similarity with the extracted voiceprint features meets a first preset condition, as an emotion of the target user.
The voiceprint feature extraction algorithm may be a voiceprint feature extraction algorithm based on wavelet transform, or the like, a voiceprint feature extraction algorithm based on linear prediction coefficients, a voiceprint feature extraction algorithm based on perceptual linear prediction, or the like, which is not limited in the embodiment of the present application.
The voiceprint feature of at least one emotion may be obtained from an external emotion database, or may be obtained from a local emotion database, and the obtaining process is explained with reference to step 202, which is not described herein.
In some embodiments, the extracted voiceprint features and the voiceprint features of the at least one emotion are feature vectors of the same dimension, and the electronic device calculates a distance between the two feature vectors to determine a similarity between the extracted voiceprint features and the voiceprint features of the at least one emotion. The distance includes a cosine distance, an euclidean distance, and the like, which is not limited in the embodiment of the present application.
The first preset condition may be that the emotion with the greatest similarity to the extracted voiceprint features is determined as the emotion of the target user, or that the emotion with the greatest similarity to the extracted voiceprint features and greater than the first preset similarity is determined as the emotion of the target user. The first predetermined similarity is set according to experiments or experience, and the embodiment of the present application is not limited thereto.
In some embodiments, where the specified audiovisual information comprises video information, the electronic device extracts facial features from the video information; acquiring the similarity between the extracted face features and at least one emotion face feature; and determining the emotion of the target user, wherein the similarity between the emotion of the target user and the extracted human face features meets a second preset condition.
The facial feature extraction algorithm includes a statistical analysis-based facial feature extraction algorithm, a deep learning-based facial feature extraction algorithm, and the like. The facial features of at least one emotion may be obtained from an external emotion database, or from a local emotion database, and the obtaining process is explained instep 202, which is not described herein.
In some embodiments, the extracted facial features and the facial features of the at least one emotion are feature vectors of the same dimension, and the electronic device calculates a distance between the two feature vectors to determine a similarity between the extracted facial features and the facial features of the at least one emotion. The distance includes a cosine distance, an euclidean distance, and the like, which is not limited in the embodiment of the present application.
The second preset condition may be that the emotion with the greatest similarity to the extracted facial features is determined as the emotion of the target user, or that the emotion with the greatest similarity to the extracted facial features and greater than the second preset similarity is determined as the emotion of the target user. The second predetermined similarity is set according to experiments or experience, and is not limited in the embodiments of the present application.
In still other embodiments, in the case that the specified audio-video information includes video information, the electronic device performs recognition processing on at least one human face image included in the video image through an emotion recognition model to obtain the emotion of the target user. In some embodiments, the electronic device inputs the face image included in the video information into the emotion recognition model, the emotion recognition model outputs the probability that the emotion of the target user belongs to each emotion tag, and the emotion tag with the highest probability is determined as the emotion of the target user.
The emotion recognition model is obtained by training the deep learning network through training sample images marked with emotion labels. The training process of the emotion recognition model is as follows: the method comprises the steps that electronic equipment obtains a preset number of training sample images, all the training sample images are labeled with emotion labels, the training sample images are input into an initial model, predicted emotion labels are output by the initial model, various parameters of the initial model are adjusted based on errors between the predicted emotion labels and the labeled emotion labels and a preset loss function, and the training sample images are input into the initial model again until the condition of stopping iteration is met.
The preset number is actually set according to the precision requirement of the emotion recognition model, and the higher the precision requirement of the emotion recognition model is, the larger the preset number is. The iteration stopping condition can be that the iteration number is greater than the preset number, or that the error between the predicted emotion label and the labeled emotion label is smaller than the preset error. The preset times are also actually set according to the precision requirement of the emotion recognition model, and the higher the precision requirement of the emotion recognition model is, the larger the preset times are. The preset error is also actually set according to the precision requirement of the emotion recognition model, and the higher the precision requirement of the emotion recognition model is, the smaller the preset error is.
In other possible embodiments, the electronic device synthetically determines the mood of the target user based on the video information and the audio information. Specifically, the electronic device obtains the similarity between the extracted voiceprint feature and the voiceprint feature of at least one emotion, determines a first probability that the emotion of the target user belongs to each emotion tag based on the similarity, outputs a second probability that the emotion of the target user belongs to each emotion tag through an emotion recognition model, performs weighted summation on the first probability and the second probability that the emotion of the target user belongs to each emotion tag to obtain a third probability that the emotion of the target user belongs to each emotion tag, and determines the emotion tag with the maximum third probability as the emotion of the target user.
Referring to fig. 5, a schematic diagram for determining the emotion of a target user according to an embodiment of the present application is shown. When a user vocalizes, the electronic equipment collects voice signals and expression signals in real time, then analyzes the voice signals through the voice data analysis module to obtain a voice data analysis report, analyzes the expression signals through the facial expression analysis module to obtain an expression analysis report, and finally determines a comprehensive analysis report through the voice data analysis report and the expression analysis report, wherein the comprehensive analysis report comprises the determined emotion of the target user.
Instep 403, a target emotion is determined based on the emotion of the target user.
In some embodiments, the electronic device determines an emotion of the target user as the target emotion. Optionally, the electronic device determines the emotion of the target user as the target emotion while in the vocal book scene. For example, for the text to be read, "he looks at her back and cry out too hard," the emotion determined by the electronic device is "too hard," and the target emotion is determined as "too hard.
In other embodiments, the electronic device determines based on a mapping between the target user's emotion and the target emotion. The mapping between the target user's mood and the target mood may be set by a relevant professional, such as a psychologist. The following table-1 exemplarily shows a mapping relationship between the emotion of the target user and the target emotion.
Figure BDA0003585451070000111
Figure BDA0003585451070000121
Optionally, the electronic device determines the target emotion based on the mapping relationship and the emotion of the target user in the intelligent voice assistant scene. For example, the electronic device determines that the target user's emotion is "anger", looks up the above mapping relationship, and determines that the target emotion is "peace".
Step 404, obtaining the voiceprint characteristics of the target emotion.
The voiceprint feature of the target emotion characterizes a voiceprint feature of a speech signal uttered by the user in the case of the target emotion.
And 405, synthesizing the information to be synthesized based on the voiceprint characteristics of the target emotion to obtain synthesized voice.
Referring to fig. 6, a flow chart of a speech synthesis method provided by an embodiment of the present application is shown. The electronic equipment collects source voice data and a face image under the condition that a user sends question information, then carries out voice processing on the source voice data through a voice recognition algorithm, processes the face image through the image recognition algorithm, finally determines the emotion of a speaker based on the recognition results of the source voice data and the face image, then obtains the voiceprint characteristics of the emotion of the speaker from an emotion library, carries out voice synthesis processing on information to be synthesized based on the voiceprint characteristics, and finally outputs voice capable of expressing the emotion.
In summary, according to the technical scheme provided by the embodiment of the application, in the scene of the intelligent voice assistant, after the user issues the question to the intelligent voice assistant, the intelligent voice assistant can determine the emotion of the user when issuing the question, and determine the emotion required to be expressed when playing the answer information based on the emotion of the user, so that the human-computer interaction process of the user and the intelligent voice assistant is more natural and richer in expressive force.
Fig. 7 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application. The speech synthesis apparatus includes: anemotion determination module 710, afeature acquisition module 720, and asynthesis processing module 730.
Anemotion determination module 710 for determining a target emotion characterizing an emotion expected to be possessed by the synthesized speech. Thefeature obtaining module 720 is configured to obtain a voiceprint feature of the target emotion, where the voiceprint feature of the target emotion represents a voiceprint feature of a voice signal sent by the user when the user is in the target emotion. And asynthesis processing module 730, configured to perform synthesis processing on the information to be synthesized based on the voiceprint feature of the target emotion to obtain a synthesized voice.
To sum up, according to the technical scheme provided by the embodiment of the application, by determining the emotion expected to be possessed by the synthesized voice, the voiceprint feature of the voice signal sent by the user when the user is in the emotion is obtained, and finally, the information to be synthesized to be processed is synthesized based on the voiceprint feature, so that the synthesized voice capable of expressing the emotion is obtained, and the subsequent electronic device can also simulate the emotion of a human when the synthesized voice is played, so that the synthesized voice output by the electronic device is more natural and richer in expressive force.
In some embodiments, thecomposition processing module 730 is configured to: adjusting the voiceprint characteristics of the information to be synthesized according to the voiceprint characteristics of the target emotion to obtain first intermediate voice; carrying out loudness compensation processing on the first intermediate voice to obtain second intermediate voice, wherein the loudness parameter of the second intermediate voice is greater than that of the first intermediate voice; and denoising the second intermediate voice to obtain a synthesized voice.
In some embodiments, thecomposition processing module 730 is configured to: under the condition that the loudness parameter of the first intermediate voice is smaller than a first preset value, carrying out loudness compensation processing on the first intermediate voice to obtain second intermediate voice; and under the condition that the proportion of the noise component in the second intermediate voice is greater than a second preset value, denoising the second intermediate voice to obtain a synthesized voice.
In some embodiments, theemotion determination module 710 is configured to: acquiring audio and video information in the process of sending out question voice information by a target user, wherein the information to be synthesized is answer voice information aiming at the question voice information; determining the emotion of a target user based on the audio and video information, wherein the emotion of the target user represents the emotion expressed by the target user in the process of sending out question voice information; a target emotion is determined based on the emotion of the target user.
In some embodiments, theemotion determination module 710 is configured to: extracting voiceprint features from the audio information; obtaining the similarity between the extracted voiceprint features and the voiceprint features of at least one emotion; and determining the emotion of the target user, wherein the similarity between the extracted voiceprint features and the extracted voiceprint features meets a first preset condition.
In some embodiments, theemotion determination module 710 is configured to: extracting human face features from video information; acquiring the similarity between the extracted face features and at least one emotion face feature; and determining the emotion with the similarity between the extracted face features and meeting a second set condition as the emotion of the target user.
In some embodiments, theemotion determination module 710 is configured to: identifying at least one face image included in the video image through an emotion identification model to obtain the emotion of a target user; the emotion recognition model is obtained by training the deep learning network through training sample images marked with emotion labels.
As shown in fig. 8, the present example further provides anelectronic device 800, where theelectronic device 800 may be a server, and theelectronic device 800 includes aprocessor 810 and amemory 820. Thememory 820 stores computer program instructions therein.
Processor 810 may include one or more processing cores. Theprocessor 810 interfaces with various interfaces and circuitry throughout the various parts of the battery management system to perform various functions of the battery management system and process data by executing or executing instructions, programs, code sets, or instruction sets stored in thememory 820 and invoking data stored in thememory 820. Alternatively, theprocessor 810 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). Theprocessor 810 may integrate one or a combination of a Central Processing Unit (CPU) 810, a Graphics Processing Unit (GPU) 810, a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into theprocessor 810, but may be implemented by a communication chip.
TheMemory 820 may include a Random Access Memory (RAM) 820 or a Read-Only Memory (Read-Only Memory) 820. Thememory 820 may be used to store instructions, programs, code sets, or instruction sets. Thememory 820 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method examples described below, and the like. The data storage area can also store data (such as a phone book, audio and video data, chatting record data) and the like created by the electronic equipment in use.
Referring to fig. 9, a computer-readable storage medium 900 is further provided according to an embodiment of the present application, in whichcomputer program instructions 910 are stored in the computer-readable storage medium 900, and thecomputer program instructions 910 can be called by a processor to execute the method described in the above embodiment.
The computer-readable storage medium 900 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 900 includes a non-volatile computer-readable storage medium. The computer-readable storage medium 900 has storage space forcomputer program instructions 910 to perform any of the method steps of the method described above. Thecomputer program instructions 910 may be read from or written to one or more computer program products. Thecomputer program instructions 910 may be compressed in a suitable form.
Although the present application has been described with reference to preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application, and all changes, substitutions and alterations that fall within the spirit and scope of the application are to be understood as being covered by the following claims.

Claims (10)

1. A method of speech synthesis, the method comprising:
determining a target emotion that characterizes an emotion that the synthesized speech is expected to have;
acquiring the voiceprint characteristics of the target emotion, wherein the voiceprint characteristics of the target emotion represent the voiceprint characteristics of a voice signal sent by a user under the condition that the user is in the target emotion;
and synthesizing information to be synthesized based on the voiceprint characteristics of the target emotion to obtain the synthesized voice.
2. The method according to claim 1, wherein the synthesizing information to be synthesized based on the voiceprint feature corresponding to the target emotion to obtain the synthesized speech comprises:
adjusting the voiceprint characteristics of the information to be synthesized according to the voiceprint characteristics of the target emotion to obtain first intermediate voice;
carrying out loudness compensation processing on the first intermediate voice to obtain second intermediate voice, wherein the loudness parameter of the second intermediate voice is greater than that of the first intermediate voice;
and denoising the second intermediate voice to obtain the synthesized voice.
3. The method of claim 2, wherein performing loudness compensation on the first intermediate speech to obtain second intermediate speech comprises:
under the condition that the loudness parameter of the first intermediate voice is smaller than a first preset value, carrying out loudness compensation processing on the first intermediate voice to obtain a second intermediate voice;
the denoising processing of the second intermediate speech to obtain the synthesized speech includes:
and under the condition that the proportion of the noise component in the second intermediate voice is larger than a second preset value, denoising the second intermediate voice to obtain the synthesized voice.
4. The method of claim 1, wherein determining the target emotion comprises:
acquiring audio and video information in the process of sending out question voice information by a target user, wherein the information to be synthesized is answer information aiming at the question voice information;
based on the audio and video information, determining the emotion of the target user, wherein the emotion of the target user represents the emotion expressed by the target user in the process of sending out the question voice information;
determining the target emotion based on the emotion of the target user.
5. The method of claim 4, wherein the specified audiovisual information comprises audio information, and wherein determining the mood of the target user based on the audiovisual information comprises:
extracting voiceprint features from the audio information;
obtaining the similarity between the extracted voiceprint features and the voiceprint features of at least one emotion;
and determining the emotion with the similarity meeting a first preset condition with the extracted voiceprint features as the emotion of the target user.
6. The method of claim 4, wherein the specified audiovisual information comprises video information, and wherein determining the mood of the target user based on the audiovisual information comprises:
extracting human face features from the video information;
acquiring the similarity between the extracted face features and at least one emotion face feature;
and determining the emotion with the similarity between the extracted human face features and meeting a second set condition as the emotion of the target user.
7. The method of claim 4, wherein the specified audiovisual information comprises video information, and wherein determining the mood of the target user based on the audiovisual information comprises:
identifying at least one face image included in the video image through an emotion identification model to obtain the emotion of the target user; the emotion recognition model is obtained by training the deep learning network through a training sample image labeled with an emotion label.
8. A speech synthesis apparatus, characterized in that the apparatus comprises:
an emotion determination module to determine a target emotion that characterizes an emotion expected to be possessed by the synthesized speech;
the characteristic acquisition module is used for acquiring the voiceprint characteristic of the target emotion, wherein the voiceprint characteristic of the target emotion represents the voiceprint characteristic of a voice signal sent by a user under the condition that the user is in the target emotion;
and the synthesis processing module is used for carrying out synthesis processing on the information to be synthesized based on the voiceprint characteristics of the target emotion to obtain the synthesized voice.
9. An electronic device comprising a processor and a memory, the memory storing computer program instructions that are invoked by the processor to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that a program code is stored in the computer-readable storage medium, which program code is called by a processor to perform the method according to any of claims 1-7.
CN202210364869.9A2022-04-072022-04-07Speech synthesis method, speech synthesis device, electronic equipment and storage mediumPendingCN114678003A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202210364869.9ACN114678003A (en)2022-04-072022-04-07Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202210364869.9ACN114678003A (en)2022-04-072022-04-07Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Publications (1)

Publication NumberPublication Date
CN114678003Atrue CN114678003A (en)2022-06-28

Family

ID=82077352

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202210364869.9APendingCN114678003A (en)2022-04-072022-04-07Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN114678003A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108922525A (en)*2018-06-192018-11-30Oppo广东移动通信有限公司Method of speech processing, device, storage medium and electronic equipment
CN110688911A (en)*2019-09-052020-01-14深圳追一科技有限公司Video processing method, device, system, terminal equipment and storage medium
CN111653265A (en)*2020-04-262020-09-11北京大米科技有限公司 Speech synthesis method, device, storage medium and electronic device
CN112188378A (en)*2020-09-282021-01-05维沃移动通信有限公司Electronic equipment sound production optimization method and device and electronic equipment
CN112883209A (en)*2019-11-292021-06-01阿里巴巴集团控股有限公司Recommendation method and processing method, device, equipment and readable medium for multimedia data
CN112992147A (en)*2021-02-262021-06-18平安科技(深圳)有限公司Voice processing method, device, computer equipment and storage medium
CN113327620A (en)*2020-02-292021-08-31华为技术有限公司Voiceprint recognition method and device
CN113450804A (en)*2021-06-232021-09-28深圳市火乐科技发展有限公司Voice visualization method and device, projection equipment and computer readable storage medium
WO2021232594A1 (en)*2020-05-222021-11-25深圳壹账通智能科技有限公司Speech emotion recognition method and apparatus, electronic device, and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108922525A (en)*2018-06-192018-11-30Oppo广东移动通信有限公司Method of speech processing, device, storage medium and electronic equipment
CN110688911A (en)*2019-09-052020-01-14深圳追一科技有限公司Video processing method, device, system, terminal equipment and storage medium
CN112883209A (en)*2019-11-292021-06-01阿里巴巴集团控股有限公司Recommendation method and processing method, device, equipment and readable medium for multimedia data
CN113327620A (en)*2020-02-292021-08-31华为技术有限公司Voiceprint recognition method and device
CN111653265A (en)*2020-04-262020-09-11北京大米科技有限公司 Speech synthesis method, device, storage medium and electronic device
WO2021232594A1 (en)*2020-05-222021-11-25深圳壹账通智能科技有限公司Speech emotion recognition method and apparatus, electronic device, and storage medium
CN112188378A (en)*2020-09-282021-01-05维沃移动通信有限公司Electronic equipment sound production optimization method and device and electronic equipment
CN112992147A (en)*2021-02-262021-06-18平安科技(深圳)有限公司Voice processing method, device, computer equipment and storage medium
CN113450804A (en)*2021-06-232021-09-28深圳市火乐科技发展有限公司Voice visualization method and device, projection equipment and computer readable storage medium

Similar Documents

PublicationPublication DateTitle
CN114121006A (en)Image output method, device, equipment and storage medium of virtual character
CN112396182B (en)Method for training face driving model and generating face mouth shape animation
CN108922525B (en)Voice processing method, device, storage medium and electronic equipment
CN108346427A (en)Voice recognition method, device, equipment and storage medium
CN111145777A (en) A virtual image display method, device, electronic device and storage medium
US20230154487A1 (en)Method, system and device of speech emotion recognition and quantization based on deep learning
CN115329057A (en)Voice interaction method and device, electronic equipment and storage medium
CN114005430A (en)Training method and device of speech synthesis model, electronic equipment and storage medium
CN116778921A (en) Sentiment analysis intelligent conversation system, method, device and medium for voice communication
CN115455136A (en) Intelligent digital human marketing interaction method, device, computer equipment and storage medium
WO2025043996A1 (en)Human-computer interaction method and apparatus, computer readable storage medium and terminal device
CN109710799B (en)Voice interaction method, medium, device and computing equipment
CN117352000A (en)Speech classification method, device, electronic equipment and computer readable medium
CN118571229A (en)Voice labeling method and device for voice feature description
CN117174067A (en) Speech processing method, device, electronic equipment and computer-readable medium
JP2011186143A (en)Speech synthesizer, speech synthesis method for learning user's behavior, and program
CN115148185A (en) Speech synthesis method and device, electronic device and storage medium
KR20170086233A (en)Method for incremental training of acoustic and language model using life speech and image logs
CN113056908B (en)Video subtitle synthesis method and device, storage medium and electronic equipment
US20240320519A1 (en)Systems and methods for providing a digital human in a virtual environment
CN118069805A (en) Intelligent question-answering method and device based on voice and text collaboration
CN114678003A (en)Speech synthesis method, speech synthesis device, electronic equipment and storage medium
Karpouzis et al.Induction, recording and recognition of natural emotions from facial expressions and speech prosody
WO2022041177A1 (en)Communication message processing method, device, and instant messaging client
CN114464151B (en)Sound repairing method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp