3D character facial expression animation real-time generation method based on deep learningTechnical Field
The invention relates to the technical field of animation production, in particular to a 3D character facial expression animation real-time generation method based on deep learning.
Background
At present, in the market, the facial animation of virtual characters is driven in real time by using the facial expression in videos, and a method based on human face key point detection in computer vision is mainly adopted, and the method has the following defects:
1. the generalization is poor, if a driving mode with higher accuracy is needed, data needs to be marked again when the actors are changed,
2. if data are not marked, only the character model with lower precision can be driven.
The above disadvantages determine that the method cannot meet the requirement of a 3D animation film production flow with high precision requirement (model point is 2-3 ten thousand). Currently, no mature solution is available in the market that can directly generate 3D high-precision angular color animations from actor facial performances.
Disclosure of Invention
The invention aims to provide a 3D character facial expression animation real-time generation method based on deep learning, which reduces early preparation work, has wide application range and can generate animation in real time.
The invention provides the following technical scheme:
A3D character facial expression animation real-time generation method based on deep learning comprises the following steps:
s1, acquiring training data, and performing enhancement processing on the acquired training data, wherein the training data comprises animation files of the model and values of corresponding controllers, facial motion pictures of actors, screen shot pictures of the animation files and values of corresponding controllers;
s2, building a generation model, wherein the generation model comprises 1 encoder and 3 decoders, the encoder is used for encoding the picture data of the training data into an implied space, and the 3 decoders are used for decoding the data of the implied space into facial action pictures of actors, screen shot pictures of animation files and controller values corresponding to the screen shot pictures of the animation files;
s3, training the built generation model to obtain the optimal weight values of the encoder and the decoder to obtain the optimal model;
s4, inputting the pictures of actors into the trained generation model, coding the pictures into the hidden space by the coder, and decoding the data in the hidden space by the corresponding decoder to obtain the corresponding controller value;
s5, the controller value is input to animation software to generate the facial movement of the model.
Preferably, the method of the training data enhancement processing in step S1 is to change the brightness of the actor' S facial motion picture and the captured picture of the animation file randomly, and perform data enhancement through rotation, displacement, noise addition and simulated illumination change.
Preferably, the facial motion picture of the actor and the screen shot picture of the animation file share the same code.
Preferably, in the training of the generated model in step S3, the training of the encoder, the decoder for generating the facial motion picture of the actor, and the decoder for generating the screen shot picture of the animation file is a process of outputting the restoration input.
Preferably, the training method of the generative model is as follows:
q1, inputting the actor facial motion picture and the animation file screen picture into an encoder, outputting a corresponding picture through a decoder for generating the actor facial motion picture and a decoder for generating the animation file screen picture, wherein the output picture is the input picture in the process, calculating a loss function value between the input picture and the output picture through a structural similarity index at the moment, and updating the weight of the encoder and the corresponding decoder according to the loss function value;
q2, the third decoder outputs the controller value corresponding to the screen picture, the loss function is obtained by averaging the absolute value of the difference value of each controller value, and the weight of the corresponding decoder is updated according to the loss function value.
Preferably, the animation software of step S5 comprises maya or UE.
Preferably, the encoder and decoder employ convolutional neural networks.
The invention has the beneficial effects that: according to the invention, through modeling, the facial actions of the corresponding animation models are generated according to the facial videos and photos of the actors, and paired data corresponding to the video pictures of the actors and the animation files of the corresponding roles are not needed, so that the early-stage data preparation work is greatly reduced; the actors can be changed at will without data annotation work; real-time estimation can be carried out, namely, the facial videos of actors can be acquired in real time and calculated as the facial movements of the animation model.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a schematic block diagram of the present invention.
Detailed Description
As shown in fig. 1, a method for generating facial expression animation of a 3D character based on deep learning in real time includes the following steps:
s1, acquiring training data, and performing enhancement processing on the acquired training data, wherein the training data comprises animation files of the model and values of corresponding controllers, facial action pictures of actors, screen shot pictures of the animation files and values of corresponding controllers;
s2, building a generation model, wherein the generation model comprises 1 encoder and 3 decoders, the encoder is used for encoding the picture data of the training data into a hidden space, and the 3 decoders are used for decoding the data of the hidden space into facial action pictures of actors, screen-shot pictures of animation files and controller values corresponding to the screen-shot pictures of the animation files;
s3, training the built generation model to obtain the optimal weight values of the encoder and the decoder to obtain the optimal model;
s4, inputting the pictures of actors into the trained generation model, coding the pictures into the hidden space by the coder, and decoding the data in the hidden space by the corresponding decoder to obtain the corresponding controller value;
s5, the controller value is input to animation software to generate the facial movement of the model.
The first embodiment is as follows:
acquiring training data, comprising:
A. acquiring an animation file of the model and a corresponding value of a controller, wherein the controller is a group of devices capable of controlling the facial action of the animation model, and can be quantized into a group of values, and each group of values can correspond to the facial action of the animation model one by one;
B. acquiring a facial action video of an actor at one end;
C. the animation file is aligned to the face of the model to perform screen shooting operation, and each screen shooting picture and the corresponding value of the controller are obtained;
D. the brightness of facial action pictures of actors and screen-shot pictures of animation files is changed randomly, data enhancement is carried out through rotation, displacement, noise addition and illumination change simulation, and the robustness of the system is improved.
Building a generating model, which is a training process through reconstructing an input neural network, wherein a hidden layered vector of the generating model has a dimension reduction effect, the generating model comprises 1 coder and 3 decoders, and the coder is used for coding picture data of training data into an implicit space and contains the meaning of the input data; the 3 decoders are used for decoding the data of the hidden space into facial motion pictures of actors (hereinafter referred to as "decoder a"), screen pictures of animation files (hereinafter referred to as "decoder B"), and controller values corresponding to the screen pictures of the animation files (hereinafter referred to as "decoder C"); the input data will be reconstructed by "implicit space". The final generated model through the training of the neural network will result in a "hidden space" in the hidden layer representing the input data. It can help data classification, visualization, storage. The model is an unsupervised learning mode actually, only data is needed to be input, and label or data of an input-output pair are not needed; both the decoder and encoder herein use convolutional neural networks; the actor's facial motion picture shares the same code with the filmed picture of the animation file.
Wherein, the training decoders A and B do not need labels, and the training decoder C needs labels, namely, the paired screen pictures of the animation file and the corresponding controller values are needed.
And training the built generation model to obtain the optimal weight values of the encoder and the decoder to obtain the optimal model, wherein the training of the encoder, the decoder for generating facial action pictures of actors and the decoder for generating screen shooting pictures of animation files is the process of outputting, restoring and inputting.
Specifically, the training method of the generative model is as follows:
q1, inputting the actor's facial motion picture and the animation file's screenshot picture to the encoder, outputting the corresponding picture through the decoder generating the actor's facial motion picture and the decoder generating the animation file's screenshot picture, the output in the process being the input picture, at this moment, calculating the loss function value between the input and output pictures through the structural similarity index, and updating the weights of the encoder and the corresponding decoder according to the loss function;
q2, the third decoder outputs the controller value corresponding to the screen picture, the loss function is obtained by averaging the absolute value of the difference value of each controller value, and the weight of the corresponding decoder is updated according to the loss function value.
Inputting the pictures of actors into a trained generation model, coding the pictures to a hidden space by a coder, and decoding data in the hidden space by a corresponding decoder to obtain the value of a corresponding controller; the controller values are input to animation software (maya, UE) to generate the facial movements of the model.
According to the invention, through modeling, the facial actions of the corresponding animation models are generated according to the facial videos and photos of the actors, and paired data corresponding to the video pictures of the actors and the animation files of the corresponding roles are not needed, so that the early-stage data preparation work is greatly reduced; the actors can be changed at will without data annotation work; real-time estimation can be carried out, namely, the facial videos of actors can be acquired in real time and calculated as the facial movements of the animation model.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.