Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the prior art, the synthesis technology of the avatar is mainly classified into the following three types:
the first category, speech driven avatar composition technology: the language information and expression information acquired from the voice are independently applied to the finally synthesized video. In the method, only a plurality of basic expressions are considered, the synthesized virtual image is compared with a crowing, only a plurality of predefined basic expressions can be made, and the problems that the mouth and the lips are not matched with the eyebrows, the throat, the cheeks and the like exist. The above problems are on the one hand because the opening and closing of the mouth shape is determined only according to the pronunciation characteristics of the voice content, and the difference between different people is not considered, and the physiological linkage between the face muscle blocks is not considered, so that the rich emotion cannot be expressed individually. On the other hand, because such a method can only select one or two from several or tens of fixed expressions to be superimposed on the synthesized video, a rich facial expression cannot be synthesized.
Second, the virtual image synthesis technology based on expression migration: and migrating the facial expression, the mouth shape and the rigid motion of the driving person to the virtual image. The video synthesized by the method is more lifelike, but is very dependent on real person performance, and can not be synthesized offline.
Third, based on the technology of synthesizing the virtual image expression by modeling each part of the face separately, the artist is required to design the motion of the whole face according to the physiological and aesthetic expertise, and a video segment is synthesized to edit the state of each part frame by frame, so that not only is strong expertise required, but also time and effort are consumed.
From the anatomical point of view, the face of the person has 42 muscles, so that rich expressions can be generated, and various different moods and emotions can be accurately conveyed. The stretching of these muscles is not independent, but has a strong correlation, for example: the person speaks in calm state, and the muscle of lips and chin stretches, and the same sentence that the person says when the emotion is excited, and the muscle of forehead, cheek muscle also stretches, and the muscle stretching intensity in areas such as lips, chin is obviously bigger than calm. In addition, there are thousands of human expressions, and the existing method only has several or tens of preset expressions, and the expression capability is not fine enough and personalized. Therefore, how to automatically compose an avatar with more lifelike and natural appearance is still a problem to be solved by those skilled in the art.
In this regard, the embodiment of the present invention provides an avatar composition method. Fig. 1 is a flow chart of an avatar composition method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
Step 110, determining relevant characteristics of voice data; the relevant features are used to characterize the features contained in the speech data that relate to the expression of the speaker.
Specifically, the voice data is voice data for performing avatar synthesis, where the avatar may be an avatar, or may be an avatar, an animal, or the like, and the embodiment of the present invention is not limited thereto. The voice data may be voice data of a speaker speaking collected by the radio device, or may be intercepted from voice data obtained through a network or the like, which is not particularly limited in the embodiment of the present invention.
The relevant features are features related to the expression of the speaker, such as language-related features in the voice data, which correspond to different utterances, which require the speaker to mobilize facial muscles to form different mouth shapes, such as emotional features in the voice data, and when the speaker speaks the same content under different emotions, the movements of the facial muscles including the mouth shapes and neck muscles are also different, such as scene features in the voice data, the speaking scene of the speaker may also affect the facial expression of the speaker, such as when speaking in a noisy environment, the speaker may speak aloud, the facial expression may be relatively exaggerated, when speaking in a quiet environment, the speaker may speak aloud, the facial expression may be relatively fine, such as the speaker's identity features in the voice data, the expressions of different speakers when speaking may be different, such as a host who hosts a child program, the expressions when speaking may be tangential, a host who hosts a smile program, and the host when speaking may be exaggerated.
Step 120, inputting the avatar data and related features into the expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to the voice data; the expression synthesis model is obtained by training based on the sample speaker video, the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data.
Specifically, the avatar data, that is, the image data for performing avatar composition, may be an avatar of a speaker corresponding to the voice data, or may be an avatar unrelated to the speaker corresponding to the voice, which is not particularly limited in the embodiment of the present invention. The avatar data includes a texture map and an expression mask map, wherein the texture map is an image of the avatar itself, the texture map includes the avatar, and each region of the avatar where the expression is performed, and the expression mask map is an avatar image after masking each region of the avatar where the expression is performed, and may be one expression mask map corresponding to each frame or one expression mask map corresponding to a plurality of frames.
The expression synthesis model is used for analyzing the expression of the avatar based on the relevant characteristics of the avatar data, and combining the expression of the avatar to obtain an avatar video configured with the expression corresponding to the voice data. Before executing step 120, an expression synthesis model may be trained in advance, and specifically, the expression synthesis model may be trained by the following manner: firstly, a large number of sample speaker videos and sample voice data corresponding to the sample speaker videos are collected, and sample image data in the sample speaker videos and relevant features in the sample voice data are extracted. Here, the sample speaker video is a real person speaker video. And then training the initial model based on the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data, thereby obtaining an expression synthesis model.
According to the method provided by the embodiment of the invention, the expression synthesis of the virtual image is performed by applying the related features containing rich expression related information, so that the expression of the virtual image can be better attached to voice data, and the virtual image is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expression of the virtual image exists in an integral form, compared with a mode of singly modeling each region of the expression in the virtual image, the method can effectively solve the problem of the linkage of the muscles of each region by aiming at the integral modeling of the expression, so that the muscle linkage of each region is more natural and lifelike.
Based on the above embodiments, the expression synthesis model includes a feature extraction layer and an expression prediction layer. Fig. 2 is a flow chart of an expression synthesis method according to an embodiment of the present invention, as shown in fig. 2, step 120 specifically includes:
step 121, inputting the image data and the related features corresponding to any frame to the feature extraction layer of the expression synthesis model to obtain the frame features output by the feature extraction layer.
Specifically, the speech data may be divided into speech data of a plurality of frames, for which there is a corresponding correlation feature. Also, in the avatar data, the same texture map may correspond to each frame to embody the appearance of an avatar in the avatar video, and different expression mask maps may correspond to different frames to embody the actions of the avatar corresponding to different frames in the avatar video, in particular, the head actions.
In the expression synthesis model, the feature extraction layer is used for extracting frame features of any frame from image data and related features respectively corresponding to the frame. The frame features herein may be the image features of the frame and the expression related features of the frame, and may also include fusion features of the image features and the expression related features of the frame, which is not particularly limited in the embodiment of the present invention.
Step 122, inputting the frame characteristics into the expression prediction layer of the expression synthesis model to obtain the virtual expression map of the frame output by the expression prediction layer.
Specifically, in the expression synthesis model, the expression prediction layer is used for predicting a virtual expression map of any frame based on frame characteristics of the frame. Here, the avatar is an image including an avatar, wherein the avatar is configured with an expression corresponding to the voice data of the frame, and a position, an action, etc. of the avatar are consistent with the avatar data corresponding to the frame. Each frame of virtual expression image forms an avatar video.
According to the method provided by the embodiment of the invention, the frame characteristics of any frame are obtained, the virtual expression image of the frame is obtained based on the frame characteristics, the virtual image video is finally obtained, and the overall naturalness and fidelity of the virtual image video are improved by improving the naturalness and fidelity of the virtual expression image of each frame.
Based on any of the above embodiments, the feature extraction layer includes a current feature extraction layer and a pre-frame feature extraction layer; fig. 3 is a flow chart of a feature extraction method according to an embodiment of the present invention, as shown in fig. 3, step 121 specifically includes:
and 1211, inputting the image data and the related features corresponding to any frame respectively to a current feature extraction layer of the feature extraction layer to obtain the current features output by the current feature extraction layer.
Step 1212, inputting the virtual expression image of the frame pre-set frame to the frame pre-feature extraction layer of the feature extraction layer to obtain the frame pre-feature output by the frame pre-feature extraction layer.
Specifically, the frame characteristics of any frame comprise two parts, namely a current characteristic and a frame front characteristic, wherein the current characteristic is obtained by extracting characteristics of image data and related characteristics respectively corresponding to the frame through a current characteristic extraction layer, and the current characteristic is used for reflecting the characteristics of the frame in the aspect of the virtual image, particularly the expression of the virtual image; the pre-frame features are obtained by extracting features of the virtual expression map of the preset frame before the frame through a pre-frame feature extraction layer, and are used for reflecting the virtual images, especially the features of the virtual image expression, in the virtual expression map of the preset frame before the frame.
Here, any frame pre-set frame may be a number of frames before the frame that are pre-set, for example, any frame is the nth frame, and the frame pre-set frame of the frame is the first two frames of the frame, that is, the nth-2 frame and the nth-1 frame.
Based on any of the above embodiments, step 122 specifically includes: and the current characteristics and the frame front characteristics are fused and then input into the expression prediction layer, so that a virtual expression image of the frame output by the expression prediction layer is obtained.
In the embodiment of the invention, the current characteristic and the frame characteristic of any frame are used for expression prediction, so that the synthesized avatar expression can not only be naturally matched with the voice data corresponding to the frame, but also realize the natural transition of the avatar expression of the frame and the avatar expressions of the previous frames, and further improve the reality and naturalness of the avatar video.
Based on any of the above embodiments, the expression prediction layer includes a candidate expression prediction layer, an optical flow prediction layer, and a fusion layer; fig. 4 is a flowchart of an expression prediction method according to an embodiment of the present invention, as shown in fig. 4, step 122 specifically includes:
step 1221, inputting the fused current feature and the pre-frame feature into a candidate expression prediction layer of the expression prediction layer to obtain a candidate expression map output by the candidate expression prediction layer.
Here, the candidate expression prediction layer is configured to predict an avatar expression of any frame based on a current feature and a pre-frame feature corresponding to the frame, and output a candidate expression map of the frame. Here, the candidate emoticons of the frame are virtual pictograms configured with emotions corresponding to the voice data of the frame.
Step 1222, merging the current feature and the pre-frame feature, and inputting the merged current feature and the pre-frame feature into an optical flow prediction layer of the expression prediction layer to obtain optical flow information output by the optical flow prediction layer.
Here, the optical flow prediction layer is configured to predict an optical flow between a previous frame and a frame based on a current feature and a pre-frame feature corresponding to any frame, and output optical flow information of the frame. Here, the optical flow information of the frame may include the predicted optical flow of the previous frame and the frame, and may further include weights for weighting the optical flow to the candidate emoticons.
Step 1223, inputting the candidate expression image and the optical flow information to a fusion layer in the expression prediction layer, so as to obtain a virtual expression image of the frame output by the fusion layer.
Here, the fusion layer is used for fusing the candidate expression image and the optical flow information of any frame, so as to obtain the virtual expression image of the frame. For example, the fusion layer may directly superimpose the candidate expression map with the virtual expression map of the previous frame after the deformation based on the predicted optical flow, or may superimpose the candidate expression map with the virtual expression map of the previous frame after the deformation based on the predicted optical flow based on the weight obtained by prediction, so as to obtain the virtual expression map.
According to the method provided by the embodiment of the invention, the current characteristics and the pre-frame characteristics are used for carrying out optical flow prediction, and the optical flow information is applied to the generation of the virtual expression graph, so that the muscle movement of each area of the virtual image executing expression in the virtual image video is more natural.
Based on any of the above embodiments, fig. 5 is a schematic structural diagram of an expression synthesis model provided in an embodiment of the present invention, and in fig. 5, the expression synthesis model includes a current feature extraction layer, a pre-frame feature extraction layer, a candidate expression prediction layer, an optical flow prediction layer, and a fusion layer.
The current feature extraction layer is used for obtaining the current features of any frame based on the image data and the related features respectively corresponding to the frame.
Assuming that the relevant feature of the voice data is M, sending M into the long and short-term memory network LSTM may obtain the hidden layer feature HT of the relevant feature, and the hidden layer feature corresponding to each frame may be labeled HT (0), HT (1), …, HT (t), …, and HT (N-1). Wherein HT (t) represents hidden layer characteristics of relevant characteristics corresponding to a t frame, and N is the total frame number of the image data. T frame pairThe corresponding image data includes I (0) and Im (t). Wherein I (O) represents a texture map, Im And (t) represents an expression mask corresponding to the t frame.
In FIG. 5, in the current feature extraction layer, I (O) and Im (t) feeding the first layer convolution (kenerl=3, stride=2, channel_out=64), feeding the obtained feature map into the second layer convolution (kenerl=3, stride=2, channel_out=128), feeding the obtained feature map into the third layer convolution (kenerl=3, stride=2, channel_out=256), feeding the obtained feature map into the fourth layer convolution (kenerl=3, stride=2, channel_out=512), obtaining a 512-dimensional feature map, and then passing through 5 layers of the feature map (kenerl=3, stride=1, channel_out=512) obtaining a 512-dimensional feature map. In the process, hidden layer features HT (t) of the related features are expanded and then embedded into second, third and fourth layers of convolutions and added with convolution results, so that fusion of the related features and image data is realized, and the current feature CFT (t) of a t-th frame is obtained.
In the current feature extraction layer, HT (t) is combined with I (0) and Im When the convolution results FT (t) of (t) are added, only HT (t) is superimposed on the mask region of FT (t), where mask regions, i.e., regions in the avatar where expressions are performed, are superimposed, and non-mask regions of FT (t) are not superimposed. Thus, the expression-related features are superimposed only in the region where the expression is required to be performed, and the original avatar is maintained in the region where the expression is not required to be performed, specifically expressed as the following formula:
in the formula, θ is a relevant parameter of the current feature extraction layer.
The frame front feature extraction layer is used for obtaining frame front features of any frame based on a virtual expression image of a preset frame before the frame.
The pre-set frame preceding the frame is assumed to be the first two frames, namely the t-1 st frame and the t-2 nd frame. The virtual emoticons of the pre-set frame before the frame are Fake (t-1) and Fake (t-2). In the pre-frame feature extraction layer, fake (t-1) and Fake (t-2) are sent to a 4-layer convolutional network (kenerl=3, stride=2, channel_out= 64,128,256,512), and then a 512-dimensional feature map, namely a pre-frame feature PFT (t), is obtained through a 5-layer ResBlock (kenerl=3, stride=1, channel_out=512).
The frame characteristics of the t-th frame are thus obtained, denoted CFT (t) +pft (t).
The candidate expression prediction layer is used for determining a corresponding candidate expression graph according to the input frame characteristics. In the candidate expression prediction layers, the frame characteristics CFT (t) +pft (t) can obtain a candidate expression map of the t frame through 4 layers of ResBlock (kenerl=3, stride=1, channel_out=512) and 4 layers of upsampling layers (kenerl=3, stride=2, channel_out= 256,128,64,1), and the candidate expression map is denoted as S (t). The formula is as follows:
Wherein,is a parameter of the candidate expression prediction layer.
The optical flow prediction layer is used for predicting the optical flow between the previous frame and the frame according to the input frame characteristics and outputting the optical flow information of the frame. In the optical flow prediction layer, the frame characteristics CFT (t) +pft (t) pass through 4 layers of ResBlock (kenerl=3, stride=1, channel_out=512) and 4 layers of upsampling layers (kenerl=3, stride=2, channel_out= 256,128,64,3) to obtain an optical flow F (t-1) and a weighting matrix W (t) between the virtual expression images Fake (t-1) of the previous frame and the virtual expression image Fake (t) of the frame.
The fusion layer is used for fusing the candidate expression image S (t) of any frame with the optical flow information F (t-1) and the optical flow information W (t) to obtain a virtual expression image Fake (t) of the frame. Specifically, the candidate expression map S (t) and the virtual expression map F (t-1) of the previous frame which is deformed by the optical flow F (t-1) can be weighted and summed by the weighting matrix W (t), so that the fusion of the candidate expression map S (t) and the virtual expression map F (t-1) is realized, and the specific formula is as follows:
Fake(t)=S(t)*W(t)+(1-W(t))*F(t-1)⊙Fake(t-1)
wherein, as follows, the image is deformed by using the optical flow, W (t) is the corresponding weight of the candidate expression map, and 1-W (t) is the corresponding weight of the virtual expression map of the previous frame after the optical flow deformation.
The expression synthesis model provided by the embodiment of the invention can be used for describing the synthesis details of different people under different emotions more vividly through the application of the related characteristics and the integral modeling of the expression, and simultaneously avoids the problem of incompatibility caused by independent synthesis. In addition, the inter-frame continuity of the composite avatar is optimized by the optical flow information.
Based on any embodiment, in the method, the expression synthesis model is obtained based on the sample speaker video, the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data, and the discriminant training, and the expression synthesis model and the discriminant form a generated type countermeasure network.
Specifically, the generated countermeasure network (GAN, generative Adversarial Networks) is a deep learning model, and is one of the most promising methods for unsupervised learning on complex distribution. The generation type antagonism network passes through two modules in the framework: the mutual game learning of the Generative Model and the discriminant Model Discriminative Model produces a fairly good output. In the embodiment of the invention, the expression synthesis model is a generated model, and the discriminant is a discriminant model.
The expression synthesis model is used for synthesizing continuous virtual image videos, and the discriminator is used for discriminating whether the input video is the virtual image video synthesized by the expression synthesis model or the truly recorded video. The role of the discriminator is to judge whether the virtual image video synthesized by the expression synthesis model is true and realistic.
According to the method provided by the embodiment of the invention, through the mutual game learning training of the expression synthesis model and the discriminator, the training effect of the expression synthesis model can be obviously improved, so that the fidelity and naturalness of the virtual image video output by the expression synthesis model can be effectively improved.
Based on any of the above embodiments, the arbiter comprises an image arbiter and/or a video arbiter; the image discriminator is used for judging the synthesis authenticity of any frame of virtual expression graph in the virtual image video, and the video discriminator is used for judging the synthesis authenticity of the virtual image video.
Specifically, the generated countermeasure network may include only an image discriminator or a video discriminator, or may include both the image discriminator and the video discriminator.
The image discriminator is used for judging the authenticity from the image level, namely judging whether the synthesis of the expression, such as the synthesis of the facial and neck muscles, is realistic. The image discriminator can obtain a virtual expression image Fake (t) of the current frame synthesized by the expression synthesis model, send the virtual expression image Fake (t) into a 4-layer convolution network (kenerl=3, stride=1 and channel_out= 64,128,256,1), and calculate an L2 norm between a feature image obtained by convolution and a full 0 matrix with the same size. Similarly, the image discriminator can send any image frame Real (t) in the Real recorded video into the 4-layer convolution network, and calculate the L2 norm between the feature map obtained by convolution and the full 1 matrix with the same size. Here, the all 0 matrix corresponds to a synthesized image, the all 1 matrix corresponds to a true image, and the L2 norm is a loss value of the image discriminator. In order to make the quality of the synthesized virtual expression image higher in each resolution, the virtual expression image output by the expression synthesis model can be respectively sampled by 2 times and 4 times for discrimination.
The video discriminator is used for judging the authenticity at the video level, namely judging whether the synthesis of the video, such as the linkage of the facial and neck muscle movements, is authentic. Multiple continuous virtual expression images and corresponding optical flows synthesized by the expression synthesis model, such as Fake (t-2), fake (t-1), fake (t) and F (t-2) and F (t-1), can be obtained, and can be sent to a video discriminator formed by a 4-layer convolution network (kenerl=3, stride=1 and channel_out= 64,128,256,1), so as to calculate discrimination loss. Similarly, the video discriminator also needs to calculate discrimination loss of the true recorded video. In order to make the quality of the synthesized virtual image video higher in each resolution, the virtual expression images output by the expression synthesis model can be respectively sampled by 2 times and 4 times for discrimination.
In the training process of the expression synthesis model, the opposite loss function of the discriminator can be added into the loss function of the expression synthesis model, so that the expression synthesis model and the discriminator are combined to form countermeasures.
Based on any of the above embodiments, the method wherein the relevant features include language-related features, as well as emotional features and/or speaker identity features.
Wherein the language-dependent features correspond to different pronunciations that require the speaker to mobilize facial muscles to form different mouth shapes, and the facial muscles corresponding to the different mouth shapes are different from the movements of the neck muscles. The emotional features are used to characterize the emotion of the speaker, who speaks the same content under different emotions, and the movements of the facial muscles, including mouth shape, and neck muscles are also different. The identity feature of the speaker is used for representing the identity of the speaker, and specifically may be an identifier corresponding to the speaker, or an identifier corresponding to the occupation of the speaker, or an identifier corresponding to the personality feature, the language style feature, and the like of the speaker.
Based on any of the above embodiments, in the method, the avatar data is determined based on a speaker identity feature.
Specifically, among the mass avatar data stored in advance, different avatar data correspond to different avatars having different identity characteristics. After the speaker identity characteristics in the related characteristics of the voice data are known, the image data matched with the speaker identity characteristics can be selected from the massive image data and applied to the synthesis of the virtual image video.
For example, image data of four A, B, C, D persons is stored in advance. When the speaker identity characteristics of the voice data are known to point to B, the avatar data of B may be correspondingly determined for the synthesis of the avatar video.
Based on any of the above embodiments, step 110 specifically includes: determining acoustic features of the speech data; relevant features are determined based on the acoustic features.
Specifically, the acoustic features herein may be spectrogram and fbank features. For example, the speech data may be denoised using an adaptive filter and the audio sample rate and channel unified, here set to 16K, mono, from which the spectrogram and fbank features (frame shift 10ms, window length 1 s) are then extracted.
Thereafter, the BN feature sequences representing the language content can be extracted as language-dependent features using the bottleneck network, respectively, and here set to obtain a 256-dimensional BN feature at intervals of 40ms, denoted as L (0), L (1), …, L (N-1), N being the number of frames of 25fps video. Compared with the method based on the phoneme characteristic in the prior art, the BN characteristic is irrelevant to languages, even if the expression synthesis model is trained only in Chinese, the correct mouth shape can be synthesized by using other languages when the expression synthesis model is trained. In addition, in the embodiment of the invention, the high-dimensional feature sequence for expressing emotion is extracted from the convolutionally long-short-time memory network ConvLSTM which is fully trained on 8 basic expression (lively, happy, afraid, depressed, excited, surprise, sad and neutral) recognition tasks and is used as emotion features. Here, a 128-dimensional emotion vector is set for every 40ms, denoted E (0), E (1), …, E (N-1), N being the number of frames of 25fps video. Similarly, in order to achieve personalized customization, in the embodiment of the invention, a speaker identity recognition network based on a deep neural network DNN and an i-vector is used for extracting a speaker identity feature sequence, wherein the speaker identity feature sequence is set to obtain a 128-dimensional identity feature vector at intervals of 40ms, the 128-dimensional identity feature vector is recorded as P (0), P (1), … and P (N-1), and N is the number of frames of 25fps video. Finally, the three feature sequences are spliced according to the corresponding frames, and 512-dimensional fusion related features are obtained for each frame and are marked as M (0), M (1), … and M (N-1), wherein N is the number of frames of the 25fps video.
Based on any of the above embodiments, in the method, the expression corresponding to the voice data of the avatar configuration in the avatar video includes a facial expression and a neck expression.
Correspondingly, the expression mask map in the image data covers the part comprising an execution area of the facial expression and an execution area of the neck expression. Here, the facial expression execution area may include facial muscle areas such as frontal muscle, orbicularis oculi muscle, frowning muscle, orbicularis stomatitis muscle, and the like, and does not include eyeball areas and nose bridge areas, because the movement of eyeballs is not controlled by facial muscles, and the nose bridge has bones, approximately rigid body, and is little affected by the movement of muscles in other areas of the face.
In the embodiment of the invention, the facial expression and the neck expression are combined to be used as the expression whole, and compared with a mode of singly modeling each region of the expression in the virtual image, the method can effectively solve the problem of the linkage of the muscles of each region by aiming at the expression whole modeling, so that the muscle linkage of each region is more natural and lifelike.
Based on any of the above embodiments, fig. 6 is a schematic flow chart of an avatar composition method according to another embodiment of the present invention, as shown in fig. 6, the method includes:
Step 610, determining voice data:
extracting voice data from the collected video and audio data, denoising the voice data by using an adaptive filter, unifying an audio sampling rate and a sound channel, and then extracting a spectrogram and fbank characteristics from the voice data to be identified. In order to fully ensure the time sequence of the voice data, the embodiment of the invention does not need to split the input voice data.
Step 620, acquiring relevant features of the voice data:
and (3) respectively obtaining the language-related features, the emotion features and the speaker identity features corresponding to the voice data to be recognized of each frame through a neural network for extracting the language-related features, the emotion features and the speaker identity features of the voice data obtained in the previous step, and splicing the three features according to the corresponding frames to obtain the corresponding related features of each frame.
Step 630, determining video data, detecting a face area, and cutting a head area:
extracting video data from the collected video and audio data, detecting a face area of each frame of image, taking the obtained face frame as a reference, expanding the face area outwards by 1.5 times to obtain an area containing the whole head and neck, cutting the area, storing the area as an image sequence, and recording as I (0), I (1), …, I (N-1) and N as the frame number of 25fps video.
Step 640, generating image data:
facial muscle areas and neck muscle areas such as frontal muscle, orbicularis oculi, frowning muscle, orbicularis stomatalis muscle and the like of each frame of cut image I (t) are segmented according to skin colors and physiological structural characteristics of a human face or by using a neural network, and eyeball areas and nose bridge areas are not included, so that the movement of eyeballs is not controlled by facial muscles, and the nose bridge has bones, is approximately rigid, and is little influenced by the movement of muscles of other areas of the face. The pixel values of the facial muscle region and the neck muscle region are set to zero to obtain an expression mask image sequence, which is marked as Im (0), im (1), …, im (N-1), and N is the number of frames of the video of 25 fps.
The image data thus obtained contains a texture map I (0) and an expression mask map corresponding to each frame.
Step 650, inputting an expression synthesis model to obtain an avatar head video:
in the expression synthesis model, a feature map is obtained from a texture map and an expression mask map in image data through a plurality of layers of convolution networks, the feature map is fused with spliced related features, then face and neck regions are synthesized through a plurality of layers of convolution networks, and finally optical flow information is added into a video, so that the synthesized mouth shape, expression, throat movement and the like are more natural.
For example, the texture map entered is expressionless and the speech data is excited to say "abstract-! For irrelevant areas, such as the hair, nose and other areas of the texture map, the relevant areas, such as the mouth shape, cheek, eyebrow and other areas, the original areas are deformed into new textures according to relevant characteristics and texture images, and the final synthesized virtual expression map is obtained through fusion.
Step 660, merging the avatar header video and the body part of the video data:
if the head area of the synthesized virtual head image is spliced into the video according to the original coordinates, tiny joints appear at the boundary, and preferably, a poisson fusion algorithm can be used for fusing the joint areas, so that the boundary transition is smoother.
Compared with the traditional virtual image synthesis technology based on voice driving and the face synthesis technology based on expression migration, the method provided by the embodiment of the invention can more realistically synthesize the facial and neck muscle movements of different people under different emotions, and can realize full-automatic offline synthesis. Saving a large amount of labor cost and improving the production efficiency.
Based on any one of the above embodiments, fig. 7 is a schematic structural view of an avatar composition device according to an embodiment of the present invention, and as shown in fig. 7, the device includes a relevant feature determining unit 710 and an expression composition unit 720;
wherein the relevant feature determining unit 710 is configured to determine relevant features of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data;
the expression synthesis unit 720 is configured to input the avatar data and the related features into an expression synthesis model, so as to obtain an avatar video output by the expression synthesis model, where an avatar in the avatar video is configured with an expression corresponding to the voice data;
the expression synthesis model is obtained by training relevant features of sample voice data and sample image data corresponding to a sample speaker video.
The device provided by the embodiment of the invention synthesizes the expression of the virtual image by applying the related characteristics containing rich expression related information, so that the expression of the virtual image can be better attached to the voice data, and the device is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expression of the virtual image exists in an integral form, compared with a mode of singly modeling each region of the expression in the virtual image, the method can effectively solve the problem of the linkage of the muscles of each region by aiming at the integral modeling of the expression, so that the muscle linkage of each region is more natural and lifelike.
Based on any of the above embodiments, the expression synthesis unit 720 includes:
the feature extraction unit is used for inputting the image data and the related features corresponding to any frame to a feature extraction layer of the expression synthesis model to obtain frame features output by the feature extraction layer;
and the expression prediction unit is used for inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression image of any frame output by the expression prediction layer.
Based on any of the above embodiments, the feature extraction unit includes:
a current feature extraction subunit, configured to input image data and related features corresponding to any frame respectively to a current feature extraction layer of the feature extraction layer, so as to obtain a current feature output by the current feature extraction layer;
and the pre-frame feature extraction subunit is used for inputting the virtual expression image of any pre-frame preset frame to the pre-frame feature extraction layer of the feature extraction layer to obtain the pre-frame features output by the pre-frame feature extraction layer.
Based on any of the above embodiments, the expression prediction unit is specifically configured to:
and the current features and the pre-frame features are fused and then input into the expression prediction layer, so that a virtual expression image of any frame output by the expression prediction layer is obtained.
Based on any of the above embodiments, the expression prediction unit includes:
the candidate expression prediction subunit is used for inputting the current characteristics and the frame front characteristics into a candidate expression prediction layer of the expression prediction layer after fusing, so as to obtain a candidate expression image output by the candidate expression prediction layer;
the optical flow prediction subunit is used for inputting the current feature and the frame front feature into the optical flow prediction layer of the expression prediction layer after fusing, so as to obtain optical flow information output by the optical flow prediction layer;
and the fusion subunit is used for inputting the candidate expression images and the optical flow information into a fusion layer in the expression prediction layer to obtain the virtual expression image of any frame output by the fusion layer.
Based on any one of the above embodiments, the expression synthesis model is obtained by training based on a sample speaker video, relevant features of sample voice data corresponding to the sample speaker video, sample image data, and a discriminator, where the expression synthesis model and the discriminator form a generated type countermeasure network.
Based on any of the above embodiments, the arbiter comprises an image arbiter and/or a video arbiter;
the image discriminator is used for judging the synthesis authenticity of any frame of virtual expression graph in the virtual image video, and the video discriminator is used for judging the synthesis authenticity of the virtual image video.
Based on any of the above embodiments, the relevant features include language-related features, as well as emotional features and/or speaker identity features.
Based on any of the above embodiments, the persona data is determined based on the speaker identity characteristics.
Based on any of the above embodiments, the expression of the avatar configuration in the avatar video corresponding to the voice data includes a facial expression and a neck expression.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 8, the electronic device may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method: determining relevant characteristics of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data; inputting the image data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to the voice data; the expression synthesis model is obtained by training relevant features of sample voice data and sample image data corresponding to a sample speaker video.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: determining relevant characteristics of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data; inputting the image data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to the voice data; the expression synthesis model is obtained by training relevant features of sample voice data and sample image data corresponding to a sample speaker video.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.