Optionally, the electronic device determines the target emotion based on the mapping relationship and the emotion of the target user in the intelligent voice assistant scene. For example, the electronic device determines that the target user's emotion is "anger", looks up the above mapping relationship, and determines that the target emotion is "peace".

Step 404, obtaining the voiceprint characteristics of the target emotion.

And 405, synthesizing the information to be synthesized based on the voiceprint characteristics of the target emotion to obtain synthesized voice.

Referring to fig. 6, a flow chart of a speech synthesis method provided by an embodiment of the present application is shown. The electronic equipment collects source voice data and a face image under the condition that a user sends question information, then carries out voice processing on the source voice data through a voice recognition algorithm, processes the face image through the image recognition algorithm, finally determines the emotion of a speaker based on the recognition results of the source voice data and the face image, then obtains the voiceprint characteristics of the emotion of the speaker from an emotion library, carries out voice synthesis processing on information to be synthesized based on the voiceprint characteristics, and finally outputs voice capable of expressing the emotion.

In summary, according to the technical scheme provided by the embodiment of the application, in the scene of the intelligent voice assistant, after the user issues the question to the intelligent voice assistant, the intelligent voice assistant can determine the emotion of the user when issuing the question, and determine the emotion required to be expressed when playing the answer information based on the emotion of the user, so that the human-computer interaction process of the user and the intelligent voice assistant is more natural and richer in expressive force.

Fig. 7 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application. The speech synthesis apparatus includes: anemotion determination module 710, afeature acquisition module 720, and asynthesis processing module 730.

Anemotion determination module 710 for determining a target emotion characterizing an emotion expected to be possessed by the synthesized speech. Thefeature obtaining module 720 is configured to obtain a voiceprint feature of the target emotion, where the voiceprint feature of the target emotion represents a voiceprint feature of a voice signal sent by the user when the user is in the target emotion. And asynthesis processing module 730, configured to perform synthesis processing on the information to be synthesized based on the voiceprint feature of the target emotion to obtain a synthesized voice.

In some embodiments, thecomposition processing module 730 is configured to: adjusting the voiceprint characteristics of the information to be synthesized according to the voiceprint characteristics of the target emotion to obtain first intermediate voice; carrying out loudness compensation processing on the first intermediate voice to obtain second intermediate voice, wherein the loudness parameter of the second intermediate voice is greater than that of the first intermediate voice; and denoising the second intermediate voice to obtain a synthesized voice.

In some embodiments, thecomposition processing module 730 is configured to: under the condition that the loudness parameter of the first intermediate voice is smaller than a first preset value, carrying out loudness compensation processing on the first intermediate voice to obtain second intermediate voice; and under the condition that the proportion of the noise component in the second intermediate voice is greater than a second preset value, denoising the second intermediate voice to obtain a synthesized voice.

In some embodiments, theemotion determination module 710 is configured to: acquiring audio and video information in the process of sending out question voice information by a target user, wherein the information to be synthesized is answer voice information aiming at the question voice information; determining the emotion of a target user based on the audio and video information, wherein the emotion of the target user represents the emotion expressed by the target user in the process of sending out question voice information; a target emotion is determined based on the emotion of the target user.

In some embodiments, theemotion determination module 710 is configured to: extracting voiceprint features from the audio information; obtaining the similarity between the extracted voiceprint features and the voiceprint features of at least one emotion; and determining the emotion of the target user, wherein the similarity between the extracted voiceprint features and the extracted voiceprint features meets a first preset condition.

In some embodiments, theemotion determination module 710 is configured to: extracting human face features from video information; acquiring the similarity between the extracted face features and at least one emotion face feature; and determining the emotion with the similarity between the extracted face features and meeting a second set condition as the emotion of the target user.

In some embodiments, theemotion determination module 710 is configured to: identifying at least one face image included in the video image through an emotion identification model to obtain the emotion of a target user; the emotion recognition model is obtained by training the deep learning network through training sample images marked with emotion labels.

As shown in fig. 8, the present example further provides anelectronic device 800, where theelectronic device 800 may be a server, and theelectronic device 800 includes aprocessor 810 and amemory 820. Thememory 820 stores computer program instructions therein.

Processor 810 may include one or more processing cores. Theprocessor 810 interfaces with various interfaces and circuitry throughout the various parts of the battery management system to perform various functions of the battery management system and process data by executing or executing instructions, programs, code sets, or instruction sets stored in thememory 820 and invoking data stored in thememory 820. Alternatively, theprocessor 810 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). Theprocessor 810 may integrate one or a combination of a Central Processing Unit (CPU) 810, a Graphics Processing Unit (GPU) 810, a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into theprocessor 810, but may be implemented by a communication chip.

TheMemory 820 may include a Random Access Memory (RAM) 820 or a Read-Only Memory (Read-Only Memory) 820. Thememory 820 may be used to store instructions, programs, code sets, or instruction sets. Thememory 820 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method examples described below, and the like. The data storage area can also store data (such as a phone book, audio and video data, chatting record data) and the like created by the electronic equipment in use.

Referring to fig. 9, a computer-readable storage medium 900 is further provided according to an embodiment of the present application, in whichcomputer program instructions 910 are stored in the computer-readable storage medium 900, and thecomputer program instructions 910 can be called by a processor to execute the method described in the above embodiment.

The computer-readable storage medium 900 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 900 includes a non-volatile computer-readable storage medium. The computer-readable storage medium 900 has storage space forcomputer program instructions 910 to perform any of the method steps of the method described above. Thecomputer program instructions 910 may be read from or written to one or more computer program products. Thecomputer program instructions 910 may be compressed in a suitable form.

Although the present application has been described with reference to preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application, and all changes, substitutions and alterations that fall within the spirit and scope of the application are to be understood as being covered by the following claims.

Claims

1. A method of speech synthesis, the method comprising:

determining a target emotion that characterizes an emotion that the synthesized speech is expected to have;

acquiring the voiceprint characteristics of the target emotion, wherein the voiceprint characteristics of the target emotion represent the voiceprint characteristics of a voice signal sent by a user under the condition that the user is in the target emotion;

and synthesizing information to be synthesized based on the voiceprint characteristics of the target emotion to obtain the synthesized voice.

2. The method according to claim 1, wherein the synthesizing information to be synthesized based on the voiceprint feature corresponding to the target emotion to obtain the synthesized speech comprises:

adjusting the voiceprint characteristics of the information to be synthesized according to the voiceprint characteristics of the target emotion to obtain first intermediate voice;

carrying out loudness compensation processing on the first intermediate voice to obtain second intermediate voice, wherein the loudness parameter of the second intermediate voice is greater than that of the first intermediate voice;

and denoising the second intermediate voice to obtain the synthesized voice.

3. The method of claim 2, wherein performing loudness compensation on the first intermediate speech to obtain second intermediate speech comprises:

under the condition that the loudness parameter of the first intermediate voice is smaller than a first preset value, carrying out loudness compensation processing on the first intermediate voice to obtain a second intermediate voice;

the denoising processing of the second intermediate speech to obtain the synthesized speech includes:

and under the condition that the proportion of the noise component in the second intermediate voice is larger than a second preset value, denoising the second intermediate voice to obtain the synthesized voice.

4. The method of claim 1, wherein determining the target emotion comprises:

acquiring audio and video information in the process of sending out question voice information by a target user, wherein the information to be synthesized is answer information aiming at the question voice information;

based on the audio and video information, determining the emotion of the target user, wherein the emotion of the target user represents the emotion expressed by the target user in the process of sending out the question voice information;

determining the target emotion based on the emotion of the target user.

5. The method of claim 4, wherein the specified audiovisual information comprises audio information, and wherein determining the mood of the target user based on the audiovisual information comprises:

extracting voiceprint features from the audio information;

obtaining the similarity between the extracted voiceprint features and the voiceprint features of at least one emotion;

and determining the emotion with the similarity meeting a first preset condition with the extracted voiceprint features as the emotion of the target user.

6. The method of claim 4, wherein the specified audiovisual information comprises video information, and wherein determining the mood of the target user based on the audiovisual information comprises:

extracting human face features from the video information;

acquiring the similarity between the extracted face features and at least one emotion face feature;

and determining the emotion with the similarity between the extracted human face features and meeting a second set condition as the emotion of the target user.

7. The method of claim 4, wherein the specified audiovisual information comprises video information, and wherein determining the mood of the target user based on the audiovisual information comprises:

identifying at least one face image included in the video image through an emotion identification model to obtain the emotion of the target user; the emotion recognition model is obtained by training the deep learning network through a training sample image labeled with an emotion label.

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

an emotion determination module to determine a target emotion that characterizes an emotion expected to be possessed by the synthesized speech;

the characteristic acquisition module is used for acquiring the voiceprint characteristic of the target emotion, wherein the voiceprint characteristic of the target emotion represents the voiceprint characteristic of a voice signal sent by a user under the condition that the user is in the target emotion;

and the synthesis processing module is used for carrying out synthesis processing on the information to be synthesized based on the voiceprint characteristics of the target emotion to obtain the synthesized voice.

9. An electronic device comprising a processor and a memory, the memory storing computer program instructions that are invoked by the processor to perform the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that a program code is stored in the computer-readable storage medium, which program code is called by a processor to perform the method according to any of claims 1-7.