The pre-frame feature extraction layer is used for obtaining the pre-frame features of the frame based on the virtual expression graph of any pre-frame.

Assume that the frame is two frames before the preset frame, i.e. the t-1 th frame and the t-2 th frame. The virtual emoticons of the preset frame before the frame are Fake (t-1) and Fake (t-2). In the pre-frame feature extraction layer, Fake (t-1) and Fake (t-2) are sent to a 4-layer convolutional network (kenerl 3, stride 2, channel _ out 64,128,256,512), and then a 512-dimensional feature map, that is, a pre-frame feature pft (t), is obtained through a 5 th layer ResBlock (kenerl 3,stride 1, channel _ out 512).

This results in the frame characteristics for the t-th frame, denoted cft (t) + pft (t).

And the candidate expression prediction layer is used for determining a corresponding candidate expression graph according to the input frame characteristics. In the candidate expression prediction layer, the frame feature cft (t) + pft (t) may obtain a candidate expression map of the t-th frame through 4 layers of ResBlock (kenerl ═ 3, stride ═ 1, channel _ out ═ 512) and 4 layers of upsampling layers (kenerl ═ 3, stride ═ 2, channel _ out ═ 256,128,64,1), which is denoted as s (t). Is formulated as follows:

wherein,

and predicting parameters of the layer for the candidate expression.

The optical flow prediction layer is used for predicting the optical flow between the previous frame and the frame according to the input frame characteristics and outputting the optical flow information of the frame. In the optical flow prediction layer, the frame characteristics cft (t) + pft (t) pass through 4 layers of ResBlock (kenerl 3,stride 1, channel _ out 512) and 4 layers of upsampling layers (kenerl 3, stride 2, channel _ out 256,128,64,3) to obtain an optical flow F (t-1) and a weighting matrix w (t) between the virtual expression map Fake (t-1) of the previous frame and the virtual expression map Fake (t) of the frame.

The fusion layer is used for fusing the candidate expression graph S (t) and the optical flow information F (t-1), W (t) of any frame to obtain the virtual expression graph Fake (t) of the frame, and specifically, the candidate expression graph S (t) and the last frame of virtual expression graph F (t-1) ⊙ Fake (t-1) transformed by the optical flow F (t-1) can be weighted and summed through the weighting matrix W (t), so that the fusion of the candidate expression graph S (t) and the last frame of virtual expression graph F (t-1) is realized, and the specific formula is as follows:

Fake(t)＝S(t)*W(t)+(1-W(t))*F(t-1)⊙Fake(t-1)

wherein ⊙ represents the deformation of the image by optical flow, w (t) is the corresponding weight of the candidate expression graph, and 1-w (t) is the corresponding weight of the last frame of virtual expression graph after optical flow deformation.

The expression synthesis model provided by the embodiment of the invention can depict the synthesis details of different people under different moods more vividly through the application of relevant characteristics and the integral modeling aiming at the expression, and simultaneously avoids the problem of incoordination caused by independent synthesis. Furthermore, inter-frame continuity of the synthetic avatar is optimized by the optical flow information.

Based on any one of the embodiments, in the method, the expression synthesis model is obtained by training a discriminator based on the sample speaker video, the relevant characteristics of the sample voice data and the sample image data corresponding to the sample speaker video, and the expression synthesis model and the discriminator form a generative confrontation network.

Specifically, a Generative Adaptive Networks (GAN) is a deep learning model, and is one of the most promising methods for unsupervised learning in complex distribution. The generative confrontation network passes through two modules in the framework: the mutual game learning of the Generative Model (Generative Model) and the Discriminative Model (Discriminative Model) yields a reasonably good output. In the embodiment of the invention, the expression synthesis model is a generation model, and the discriminator is a discrimination model.

The expression synthesis model is used for synthesizing continuous virtual image videos, and the discriminator is used for discriminating whether the input video is the virtual image video synthesized by the expression synthesis model or the real recorded video. The function of the discriminator is to judge whether the virtual image video synthesized by the expression synthesis model is real and lifelike.

The method provided by the embodiment of the invention can obviously improve the training effect of the expression synthesis model through the mutual game learning training of the expression synthesis model and the discriminator, thereby effectively improving the fidelity and naturalness of the virtual image video output by the expression synthesis model.

According to any of the above embodiments, the discriminator comprises an image discriminator and/or a video discriminator; the image discriminator is used for judging the synthetic reality of any frame of virtual expression image in the virtual image video, and the video discriminator is used for judging the synthetic reality of the virtual image video.

Specifically, the generative countermeasure network may include only the image discriminator or the video discriminator, or may include both the image discriminator and the video discriminator.

Wherein the image discriminator is used for performing authenticity discrimination from the image level, i.e. determining whether the composition of expressions, such as the composition of facial and neck muscles, is realistic. The image discriminator may obtain a virtual expression map fake (t) of the current frame synthesized by the expression synthesis model, send the virtual expression map fake (t) into a 4-layer convolution network (kenerl 3,stride 1, channel _ out 64,128,256,1), and calculate an L2 norm by using the feature map obtained by convolution and a full 0 matrix of the same size. Similarly, the image discriminator may send any image frame real (t) in the real recorded video to the 4-layer convolution network, and calculate the L2 norm by using the feature map obtained by convolution and the full 1 matrix with the same size. Here, the all 0 matrix corresponds to the label as the composite image, the all 1 matrix corresponds to the label as the real image, and the L2 norm is the loss value of the image discriminator. In order to improve the quality of the synthesized virtual emoticons in each resolution, the virtual emoticons output by the expression synthesis model may be down-sampled 2 times and 4 times, respectively, and then determined.

The video discriminator is used for performing authenticity discrimination on a video level, namely judging whether the linkage of the video, such as the muscle movement of the face and the neck, is real or not. Multiple frames of continuous virtual emoticons synthesized by the expression synthesis model and corresponding optical flows, such as Fake (t-2), Fake (t-1), Fake (t) and F (t-2) and F (t-1), can be acquired, and the virtual emoticons and the optical flows can be sent to a video discriminator formed by a 4-layer convolution network (kenerl 3,stride 1, channel _ out 64,128,256,1) to calculate discrimination loss. Similarly, the video discriminator also needs to calculate the discrimination loss of the real recorded video. In order to improve the quality of the synthesized avatar video at each resolution, the virtual emoticons output by the expression synthesis model may be down-sampled 2 times and 4 times, respectively, and then determined.

In the training process of the expression synthesis model, the opposite loss function of the discriminator can be added into the loss function of the expression synthesis model, so that the expression synthesis model and the discriminator are combined to form the countermeasure loss.

In any of the above embodiments, the method wherein the relevant features include language-relevant features, and emotional and/or speaker identity features.

The language-dependent features correspond to different pronunciations, the different pronunciations require a speaker to mobilize facial muscles to form different mouth shapes, and the movements of the facial muscles and neck muscles corresponding to the different mouth shapes are different. The emotional characteristics are used for representing the emotion of a speaker, and when the speaker speaks the same content under different emotions, the motions of the facial muscles and the neck muscles including the mouth shape are different. The speaker identity characteristic is used to represent an identity of the speaker, and may specifically be an identifier corresponding to the speaker, or an occupation identifier corresponding to the speaker, or corresponding identifiers such as a personality characteristic and a language style characteristic of the speaker, which is not specifically limited in this embodiment of the present invention.

In any of the above embodiments, the method wherein the profile data is determined based on speaker identity characteristics.

Specifically, in the pre-stored massive image data, different image data correspond to different avatars, and the different avatars have different identity characteristics. After the speaker identity characteristics in the voice data related characteristics are known, image data matched with the speaker identity characteristics can be selected from the massive image data and applied to synthesis of virtual image videos.

For example, A, B, C, D figure data of four persons are stored in advance. When the speaker identity characteristic of the voice data is known to point to B, the character data corresponding to B may be determined for the synthesis of the avatar video.

Based on any of the above embodiments, step 110 specifically includes: determining acoustic features of the speech data; a correlation feature is determined based on the acoustic features.

Specifically, the acoustic features here may be a spectrogram and fbank features. For example, an adaptive filter may be used to denoise the speech data and unify the audio sampling rate and channels, here set to 16K, mono, from which the spectrogram and fbank features are then extracted (frame shift 10ms, window length 1 s).

Thereafter, a BN feature sequence representing language content can be extracted as language-dependent features by using a bottleeck network, and the setting here is to obtain a 256-dimensional BN feature at intervals of 40ms, which is denoted as L (0), L (1), …, L (N-1), and N is the frame number of 25fps video. Compared with the method based on the phoneme characteristics in the prior art, the BN characteristics are characteristics irrelevant to languages, and even if only Chinese is used during the training of the expression synthesis model, the correct mouth shape can be synthesized by using other languages during the training of the expression synthesis model. In addition, the convoluting long-time memory network ConvLSTM which is fully trained on 8 basic expression (anger, joy, fear, depression, excitement, surprise, sadness and neutrality) recognition tasks is applied to extract a high-dimensional feature sequence expressing emotion to serve as emotional features. Here, a 128-dimensional emotion vector is obtained every 40ms interval, and is denoted as E (0), E (1), …, E (N-1), and N is the frame number of 25fps video. Similarly, in order to realize personalized customization, the speaker identification network based on the deep neural network DNN and the i-vector is applied to extract a speaker identity feature sequence, and here, a 128-dimensional identity feature vector is obtained every 40ms, which is denoted as P (0), P (1), …, P (N-1), and N is the frame number of the 25fps video. Finally, the obtained three feature sequences are spliced together according to corresponding frames, 512-dimensional fused related features are obtained for each frame and are marked as M (0), M (1), … and M (N-1), and N is the frame number of a 25fps video.

According to any one of the above embodiments, in the method, the expressions of the avatar configuration in the avatar video corresponding to the voice data include facial expressions and neck expressions.

Correspondingly, the covered part of the expression mask map in the character data comprises the execution area of the facial expression and the execution area of the neck expression. Here, the execution region of the facial expression may include facial muscle regions such as frontalis, orbicularis oculi, frown, orbicularis oris, and the like, and does not include an eyeball region and a nose bridge region because the movement of the eyeball is not controlled by facial muscles, and the nose bridge has a skeleton, is approximately a rigid body, and is less affected by the movement of muscles in other regions of the face.

In the embodiment of the invention, the facial expression and the neck expression are combined to be existed as the whole expression, and compared with a mode of independently modeling each region for executing the expression in the virtual image, the muscle linkage problem of each region can be effectively solved by aiming at the whole expression modeling, so that the muscle linkage of each region is more natural and vivid.

Based on any of the above embodiments, fig. 6 is a schematic flow chart of an avatar synthesis method according to another embodiment of the present invention, as shown in fig. 6, the method includes:

step 610, determining the voice data:

extracting voice data from the collected video and audio data, carrying out denoising processing on the voice data by using an adaptive filter, unifying an audio sampling rate and a sound channel, and then extracting a spectrogram and fbank characteristics from the voice data to be recognized. In order to sufficiently ensure the time sequence of the voice data, the embodiment of the invention does not need to segment the input voice data.

Step 620, obtaining relevant characteristics of the voice data:

and for the spectrogram and fbank features of the voice data obtained in the last step, obtaining language-related features, emotion features and speaker identity features respectively corresponding to the voice data to be recognized of each frame through a neural network for extracting the language-related features, emotion features and speaker identity features, and splicing the three features according to corresponding frames to obtain the related features corresponding to each frame.

Step 630, determine video data, detect face region, crop head region:

extracting video data from the collected video and audio data, detecting a face region of each frame of image, expanding the face region by 1.5 times by taking the obtained face frame as a reference, obtaining a region containing the whole head and neck, cutting the region, and storing the region as an image sequence, wherein the image sequence is marked as I (0), I (1), …, I (N-1), and N is the frame number of 25fps video.

Step 640, generating image data:

according to the skin color and the human face physiological structure characteristics or by using a neural network, facial muscle regions and neck muscle regions of each frame of the cut image I (t), such as frontal muscle, orbicularis oculi, frown muscle, orbicularis oris, etc., are divided, and the eyeball regions and the nose bridge regions are not included, because the movement of the eyeball is not controlled by the facial muscles, the nose bridge has a skeleton, is approximate to a rigid body, and is slightly influenced by the movement of muscles in other regions of the face. The pixel values of the facial muscle region and the neck muscle region are set to zero, and an expression mask image sequence is obtained and is marked as Im (0), Im (1), … and Im (N-1), wherein N is the frame number of the 25fps video.

The image data thus obtained includes a texture map I (0) and an expression mask map corresponding to each frame.

Step 650, inputting the expression synthesis model to obtain the avatar head video:

in the expression synthesis model, a texture map and an expression mask map in image data are subjected to a plurality of layers of convolution networks to obtain a feature map, the feature map is fused with spliced related features, a face region and a neck region are synthesized through the plurality of layers of convolution networks, and finally optical flow information is added into a video, so that the synthesized mouth shape, expression, throat movement and the like are more natural.

For example, the input texture map is expressionless, and the speech data is kinetically spoken "captured over! For the irrelevant areas, such as hair, nose and the like of the texture map, the areas are almost copied to the finally synthesized virtual expression map, and for the relevant areas, such as mouth shape, cheek, eyebrow and the like, the original areas are changed into new textures according to the relevant characteristics and the texture image, and the new textures are fused to obtain the finally synthesized virtual expression map.

Step 660, fusing the avatar head video and the body part of the video data:

if a fine seam appears at the boundary in the video which is directly spliced by the head area of the synthesized virtual head portrait according to the original coordinates, the seam area can be fused by adopting a Poisson fusion algorithm as the optimization, so that the boundary transition is smoother.

Compared with the traditional virtual image synthesis technology based on voice driving and the face synthesis technology based on expression migration, the method provided by the embodiment of the invention not only can more vividly synthesize the muscle movements of the face and the neck of different people under different moods, but also can realize full-automatic off-line synthesis. A large amount of labor cost is saved, and the production efficiency is improved.

Based on any of the above embodiments, fig. 7 is a schematic structural diagram of an avatar synthesis apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus includes a relevantfeature determination unit 710 and anexpression synthesis unit 720;

the relevantfeature determining unit 710 is configured to determine relevant features of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data;

theexpression synthesis unit 720 is configured to input image data and the relevant features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, where an avatar in the avatar video is configured with an expression corresponding to the voice data;

The device provided by the embodiment of the invention carries out the expression synthesis of the virtual image by applying the relevant characteristics containing rich expression relevant information, so that the virtual image expression can better fit with the voice data and is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expressions of the virtual image exist in an integral form, and compared with a mode of independently modeling each region for executing the expressions in the virtual image, the linkage problem of the muscles in each region can be effectively solved by aiming at the integral modeling of the expressions, so that the muscle linkage of each region is more natural and vivid.

Based on any of the above embodiments, theexpression synthesis unit 720 includes:

the feature extraction unit is used for inputting image data and relevant features respectively corresponding to any frame into a feature extraction layer of the expression synthesis model to obtain frame features output by the feature extraction layer;

and the expression prediction unit is used for inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression map of any frame output by the expression prediction layer.

Based on any one of the above embodiments, the feature extraction unit includes:

the current feature extraction subunit is used for inputting image data and relevant features respectively corresponding to any frame into a current feature extraction layer of the feature extraction layer to obtain current features output by the current feature extraction layer;

and the pre-frame feature extraction subunit is used for inputting the virtual expression map of any pre-frame to a pre-frame feature extraction layer of the feature extraction layer to obtain the pre-frame features output by the pre-frame feature extraction layer.

Based on any of the above embodiments, the expression prediction unit is specifically configured to:

Based on any one of the above embodiments, the expression prediction unit includes:

the candidate expression prediction subunit is used for fusing the current feature and the pre-frame feature and inputting the fused current feature and the pre-frame feature into a candidate expression prediction layer of the expression prediction layer to obtain a candidate expression graph output by the candidate expression prediction layer;

the optical flow prediction subunit is used for fusing the current feature and the pre-frame feature and inputting the fused current feature and the pre-frame feature into an optical flow prediction layer of the expression prediction layer to obtain optical flow information output by the optical flow prediction layer;

and the fusion subunit is used for inputting the candidate expression graph and the optical flow information into a fusion layer in the expression prediction layer to obtain a virtual expression graph of any frame output by the fusion layer.

Based on any one of the embodiments, the expression synthesis model is obtained by training a discriminator and based on sample speaker videos, relevant features and sample image data of sample voice data corresponding to the sample speaker videos, and the expression synthesis model and the discriminator form a generative confrontation network.

According to any of the above embodiments, the discriminator comprises an image discriminator and/or a video discriminator;

In any of the above embodiments, the relevant features include language-dependent features, and emotional and/or speaker identity features.

According to any of the above embodiments, the profile data is determined based on the speaker identity characteristics.

According to any one of the above embodiments, the expressions of the avatar configuration in the avatar video corresponding to the voice data include facial expressions and neck expressions.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 8, the electronic device may include: a processor (processor)810, acommunication Interface 820, amemory 830 and acommunication bus 840, wherein theprocessor 810, thecommunication Interface 820 and thememory 830 communicate with each other via thecommunication bus 840. Theprocessor 810 may call logic instructions in thememory 830 to perform the following method: determining relevant characteristics of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data; inputting image data and the relevant characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data; the expression synthesis model is obtained by training sample voice data relevant characteristics and sample image data corresponding to the sample speaker video based on the sample speaker video.

In addition, the logic instructions in thememory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes: determining relevant characteristics of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data; inputting image data and the relevant characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data; the expression synthesis model is obtained by training sample voice data relevant characteristics and sample image data corresponding to the sample speaker video based on the sample speaker video.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An avatar synthesis method, comprising:

2. The avatar synthesis method of claim 1, wherein said inputting the avatar data and the related features into the expression synthesis model to obtain the avatar video outputted by the expression synthesis model, comprises:

3. The avatar synthesis method of claim 2, wherein the step of inputting the avatar data and the related features corresponding to each frame into the feature extraction layer of the expression synthesis model to obtain the frame features output by the feature extraction layer comprises:

4. The avatar synthesis method of claim 3, wherein the inputting of the frame features into an expression prediction layer of the expression synthesis model to obtain the virtual expression map of any frame output by the expression prediction layer specifically comprises:

5. The avatar synthesis method of claim 4, wherein the step of fusing the current feature and the features before the frame and inputting the fused features into the expression prediction layer to obtain the virtual expression map of any frame output by the expression prediction layer comprises:

6. The avatar synthesis method of claim 1, wherein the expression synthesis model is trained based on sample speaker video, sample voice data and relevant features of the sample voice data corresponding to the sample speaker video, and a discriminator, and the expression synthesis model and the discriminator form a generative confrontation network.

7. The avatar synthesis method of claim 6, wherein the discriminator comprises an image discriminator and/or a video discriminator;

8. The avatar synthesis method of any one of claims 1 to 7, wherein the related features include language-related features, and emotional and/or speaker identity features.

9. The avatar synthesis method of claim 8, wherein the avatar data is determined based on the speaker identity characteristics.

10. The avatar synthesis method of any one of claims 1 to 7, wherein the expressions of the avatar configuration in the avatar video corresponding to the voice data include facial expressions and neck expressions.

11. An avatar synthesis apparatus, comprising:

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the avatar synthesis method according to any of claims 1 to 10 when executing said program.

13. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the avatar synthesis method according to any of claims 1 to 10.