Disclosure of Invention
The technical problem solved by the embodiment of the invention is how to generate the face generated image which can be matched with the mouth shape in the driving audio accurately and can express the emotion contained in the driving audio accurately.
The face image generation method comprises the steps of determining a face image generation model, wherein the face image generation model comprises an audio content feature extraction sub-model, an audio emotion feature extraction sub-model and a diffusion sub-model, inputting driving audio into the audio content feature extraction sub-model to conduct feature extraction to obtain audio content features, inputting the driving audio into the audio emotion feature extraction sub-model to conduct feature extraction to obtain audio emotion features, splicing at least based on the audio content features and the audio emotion features to obtain audio fusion features, inputting the audio fusion features and noisy reference face image features into the diffusion sub-model to conduct denoising treatment to obtain target complete face features, wherein the noisy reference face image features are obtained by splicing image features of the reference face image with a noise matrix, and decoding the target complete face features to obtain a complete face generated image.
The face image generation model comprises a face image generation model, a key point feature extraction sub-model, and the splicing is carried out at least on the basis of the audio content features and the audio emotion features, and comprises the steps of splicing the audio content features, the audio emotion features and the face key point features, wherein the face key point features are obtained by extracting key points from a face image with the lower half being blocked, and inputting the extracted key points into the key point feature extraction sub-model for feature extraction.
Optionally, before the splicing is performed at least based on the audio content features and the audio emotion features, the method further comprises determining a first number of audios with time sequences before the driving audio and a second number of audios with time sequences after the driving audio, recording the first audio and the second audio as first audios and second audios respectively, inputting each of the first audio and the second audios into the audio content feature extraction submodel to perform feature extraction respectively to obtain a plurality of corresponding first audio content features and a plurality of corresponding second audio content features, performing weighting operation on the plurality of first audio content features, the plurality of second audio content features and the audio content features to obtain a fusion audio content feature, and updating the audio content feature by using the fusion audio content feature.
Optionally, the reference face image comprises a complete face image and a lower half part blocked face image which come from the same speaker and have the same emotion, before the audio fusion feature and the noise-carrying reference face image feature are input into the diffusion submodel for denoising, the method further comprises the steps of respectively extracting the feature of the lower half part blocked face image and the feature of the complete face image to obtain a partial face image feature and a complete face image feature, and splicing the partial face image feature, the complete face image feature and the noise matrix to obtain the noise-carrying reference face image feature.
The method comprises the steps of inputting the audio fusion characteristic and the noisy reference face image characteristic into a diffusion sub-model to carry out denoising treatment to obtain a target complete face characteristic, inputting the noisy reference face image characteristic into a first layer network of the diffusion sub-model, inputting the audio fusion characteristic into each layer network of the diffusion sub-model, and taking an output result of a last layer network of the diffusion sub-model as the target complete face characteristic, wherein from a second layer network of the diffusion sub-model, input data of each layer network are output data of a previous layer network and the audio fusion characteristic.
Optionally, the audio emotion feature extraction sub-model comprises a pre-trained emotion classification network and an emotion feature extraction network, wherein driving audio is input into the audio content feature extraction sub-model to perform feature extraction to obtain audio content features, the driving audio is input into the pre-trained emotion classification network to obtain a predicted emotion type label, the predicted emotion type label is encoded to obtain an audio emotion encoding vector, and the audio emotion encoding vector is input into the emotion feature extraction network to perform feature extraction to obtain the audio emotion features.
Optionally, the predictive emotion type label is encoded to obtain an audio emotion encoding vector, wherein the predictive emotion type label is pre-encoded based on a preset emotion encoding length to obtain multiple groups of emotion sub-encodings, each group of emotion sub-encodings comprises two identical emotion sub-encodings, a sine value of one emotion sub-encoding is determined for each group of emotion sub-encodings, a cosine value of the other emotion sub-encoding is determined, so that the emotion encoding corresponding to each emotion sub-encoding is determined, and the audio emotion encoding vector is determined based on the multiple obtained emotion encodings.
Optionally, the pre-trained emotion classification network is obtained by inputting a first training data set constructed by a plurality of first sample audios and emotion type labels thereof into an initialized emotion classification network for training based on a first loss function, wherein the first loss function is represented by the following expression:
Wherein Lec denotes a function value of the first loss function, i denotes a sequence number of the first sample audio, N denotes a total number of the first sample audio, yi denotes a true emotion type tag of the ith first sample audio,Representing emotion type labels predicted for the ith first sample audio, ln () represents the logarithm of the base constant e.
Optionally, the face image generation model is determined, the face image generation model to be trained is constructed, the face image generation model to be trained comprises an audio content feature extraction sub-model to be trained, an audio emotion feature extraction sub-model to be trained and a diffusion sub-model to be trained, a plurality of reference sample face images, a plurality of second sample audios and a sample noise matrix are adopted, a second training data set is constructed, the reference sample face images and the second sample audios are aligned one by one in time sequence, and the second training data set is input into the face image generation model to be trained based on a second loss function for iterative training, so that the face image generation model is obtained.
Optionally, the second loss function is represented by the following expression:
Wherein L' represents the function value of the second loss function, t represents the layer sequence number of the network in the diffusion submodel to be trained, M () represents the prediction noise matrix output by the t layer network of the diffusion submodel to be trained, E represents the sample noise matrix,The representation E obeys normal distribution, zt represents one item of input data of the t layer network of the diffusion sub-model to be trained, namely, output data of the t-1 layer network of the diffusion sub-model to be trained, C is the other item of input data of the t layer network of the diffusion sub-model to be trained, C is a sample audio fusion characteristic obtained by splicing at least based on sample audio content characteristics and sample audio emotion characteristics, II E-M (zt, t, C) II represents Euclidean distance between the sample noise matrix and a prediction noise matrix output by the t layer network of the diffusion sub-model to be trained,The operation of obtaining the expected value from the square value of the euclidean distance sampled a plurality of times is shown.
The embodiment of the invention also provides a human face image generating device which comprises a model determining module, an audio feature extracting module, an audio feature splicing module and a human face image generating module, wherein the model determining module is used for determining a human face image generating model, the human face image generating model comprises an audio content feature extracting sub-model, an audio emotion feature extracting sub-model and a diffusion sub-model, the audio feature extracting module is used for carrying out feature extraction on driving audio input into the audio content feature extracting sub-model to obtain audio content features, the driving audio input into the audio emotion feature extracting sub-model to carry out feature extraction to obtain audio emotion features, the audio feature splicing module is used for carrying out splicing at least on the audio content features and the audio emotion features to obtain audio fusion features, the whole face feature determining module is used for inputting the audio fusion features and the noisy reference face image features into the diffusion sub-model to carry out denoising processing to obtain target whole face features, the noisy reference face image features are obtained by splicing image features of the reference face images and a noise matrix, and the human face image generating module is used for decoding the target whole face features to obtain the whole face generated images.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of the face image generation method.
The embodiment of the invention also provides a terminal which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the face image generation method when running the computer program.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
In the embodiment of the invention, characteristic information of two dimensions of content and emotion is separated from driving audio, specifically, characteristic extraction is carried out on a driving audio input audio content characteristic extraction sub-model to obtain audio content characteristics, characteristic extraction is carried out on the driving audio input audio emotion characteristic extraction sub-model to obtain audio emotion characteristics, then at least the audio content characteristics and the audio emotion characteristics are based on splicing to obtain audio fusion characteristics, the audio fusion characteristics and noisy reference face image characteristics are input into the diffusion sub-model to carry out denoising treatment to obtain target complete face characteristics, and finally the target complete face characteristics are decoded to obtain a complete face generated image.
The audio content features comprise content or semantics expressed by the speaker and can correspond to mouth shapes or mouth actions in the speaking process of the speaker, and the audio emotion features comprise emotion in the speaking process of the speaker and can correspond to real emotion in the speaking process of the speaker. Therefore, compared with the prior art that only the feature extraction of a single dimension is carried out on the driving audio, the embodiment is beneficial to obtaining the human face generated image which can be accurately matched with the mouth shape contained in the driving audio, can finely express the true emotion of a speaker and contains finer textures by combining the audio content features of the semantic (or content) dimension and the audio emotion features of the emotion (or emotion) dimension. Furthermore, the embodiment also introduces the image features of the reference face image on the basis of combining the audio content features and the audio emotion features, thereby being beneficial to generating the face image which accords with the basic outline features of the face and has more stable quality.
Detailed Description
In order to make the above objects, features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
Referring to fig. 1, fig. 1 is a flowchart of a face image generating method according to an embodiment of the present invention. The face image generation method can be applied to a terminal with a face image generation function, and the terminal can comprise, but is not limited to, a mobile phone, a computer, a tablet personal computer, intelligent wearable equipment (for example, an intelligent watch), vehicle-mounted terminal equipment, a server, a cloud platform and the like.
The method may include steps S11 to S14:
Step S11, determining a face image generation model, wherein the face image generation model comprises an audio content feature extraction sub-model, an audio emotion feature extraction sub-model and a diffusion sub-model;
Step S12, inputting the driving audio into the audio content feature extraction sub-model to perform feature extraction to obtain audio content features, and inputting the driving audio into the audio emotion feature extraction sub-model to perform feature extraction to obtain audio emotion features;
step S13, splicing at least based on the audio content characteristics and the audio emotion characteristics to obtain audio fusion characteristics;
S14, inputting the audio fusion characteristic and the noisy reference face image characteristic into the diffusion submodel for denoising treatment to obtain a target complete face characteristic, wherein the noisy reference face image characteristic is obtained by splicing the image characteristic of the reference face image with a noise matrix;
and S15, decoding the target whole face features to obtain a whole face generated image.
Further, the method further comprises the steps of generating images of a plurality of complete faces corresponding to the plurality of segments of driving audios with time sequence, and performing image stitching according to the time sequence to obtain the digital human video.
In the implementation of step S11, the audio content feature extraction sub-model may use an existing neural network capable of implementing audio content feature extraction, where the audio content feature may include text feature information and semantic feature information corresponding to the driving audio and is mainly used for affecting mouth shape and mouth motion of the face generated image, and the audio emotion feature extraction sub-model may use an existing neural network capable of implementing audio emotion feature extraction, where the audio emotion feature may include prosody (e.g., rhythm, tone, duration, etc.) feature information in audio and is mainly used for affecting expression, texture detail, etc. of the face generated image.
In the implementation of step S12, the driving audio is input into the audio content feature extraction sub-model to perform feature extraction, so as to obtain audio content features, and the driving audio is input into the audio emotion feature extraction sub-model to perform feature extraction, so as to obtain audio emotion features.
The driving audio can be derived from audio recorded in real time in the field for the speaking process of a speaker, or can be derived from audio in a historic acquisition audio database, or can be extracted from video recorded in real time or prerecorded in the field. The speaker may be selected from, but is not limited to, a lecturer, a reciter, a business negotiating person, a daily general communication speaker, and the like.
For example, audio recorded in real time on site or selected from an audio database may be segmented to obtain multiple segments of driving audio with sequential sequence. Each piece of driver audio has a respective duration and corresponding text content, and typically also contains the individual emotion or emotion of the belonging speaker. The single piece of drive audio may be used to correspondingly generate a single Zhang Wanzheng face-generated image. And generating images of a plurality of complete faces corresponding to the multiple sections of driving audios with time sequence, and performing image stitching according to the time sequence to obtain a face generation video (also called as a digital human video).
Further, the input time interval of the driving audio covers Shan Zhen the time interval of the whole face generating image. That is, the time interval of the single frame complete face generated image is located in the time interval of the driving audio.
Further, a ratio of the occupied duration of the driving audio to the occupied duration of the single-frame full face generated image may be set to be greater than or equal to 5.
Taking a speaking process video containing a face image of 25 frames (FRAMES PER seconds, FPS) per Second as an example, a single frame of face image corresponds to an audio duration of 40ms (i.e., occupies 40ms duration), and the duration of the driving audio input in the embodiment of the present invention may be at least 5 times (e.g., 10 times or even tens of times) the audio duration corresponding to the single frame of face image. In this way, not only can the driving audio be made to contain information of an audio segment aligned with the time sequence of the full face generated image, but also audio information of a time sequence preceding and following the audio segment can be contained.
In the embodiment of the invention, the time interval of the driving audio is set to cover Shan Zhen the time interval of the whole face generated image (for example, the time interval of the driving audio can be set to be positioned at the middle position of the time interval of the single frame of the whole face generated image), so that the front-back time sequence and the consecutive audio information of the driving audio are provided, and the method is beneficial to improving the naturalness and smoothness of the mouth shape and emotion of the whole face generated image generated by the model. Further, the driving audio time length is far longer than the time length of generating an image of a single frame of the whole face (at least 5 times of magnitude), and compared with the time length of setting the driving audio time length to be equal to or less than the time length of the driving audio time length to be 1 times or 2 times of the time length of generating the image of the whole face, the driving audio time length is favorable for providing richer audio information for generating the image of the whole face, and the image quality is further improved.
The audio emotion feature extraction sub-model may further include a pre-trained emotion classification network and an emotion feature extraction network, and in the step S12, the driving audio is input into the audio emotion feature extraction sub-model to perform feature extraction to obtain audio emotion features, and the audio emotion feature extraction sub-model may specifically include inputting the driving audio into the pre-trained emotion classification network to obtain a predicted emotion type label, encoding the predicted emotion type label to obtain an audio emotion encoding vector, and inputting the audio emotion encoding vector into the emotion feature extraction network to perform feature extraction to obtain the audio emotion features.
In implementations, emotion type tags may be used to indicate a particular emotion type. In some embodiments, the emotion type may be selected from, but is not limited to, happiness, sadness, anger, fatigue, anxiety, tension, surprise, and the like. In other embodiments, the emotion type may be selected from, but not limited to, positive emotion, neutral emotion, negative emotion.
Without limitation, a scalar or arabic number may be employed as an emotion type label, for example, an emotion type of "happy" as indicated by "0", a sad "as indicated by" 1", an tired" as indicated by "3.
Further, the predictive emotion type label is encoded to obtain an audio emotion encoding vector, wherein the predictive emotion type label is pre-encoded based on a preset emotion encoding length to obtain multiple groups of emotion sub-encodings, each group of emotion sub-encodings comprises two identical emotion sub-encodings, a sine value of one emotion sub-encoding is determined for each group of emotion sub-encodings, a cosine value of the other emotion sub-encoding is determined, so that emotion encoding corresponding to each emotion sub-encoding is determined, and the audio emotion encoding vector is determined based on the multiple obtained emotion encodings.
Specifically, the following formula may be used to determine the audio emotion encoding vector:
P=[sin(20πE),cos(20πE),sin(21πE),cos(21πE)…,sin(2L-1πE),cos(2L-1πE)];
Wherein P represents the audio emotion encoding vector, E represents the predictive emotion type label, L represents the number of groups of emotion sub-encodings obtained by precoding, 2L-1 pi E represents one emotion sub-encoding in the L-th group of emotion sub-encodings, sin (2L-1 pi E) represents the sine value of one emotion sub-encoding in the L-th group of emotion sub-encodings, cos (2L-1 pi E) represents the cosine value of the other emotion sub-encoding in the L-th group of emotion sub-encodings, and [ x ] represents a vector composed of x.
Further, the pre-trained emotion classification network is obtained by inputting a first training data set constructed by a plurality of first sample audios and emotion type labels thereof into an initialized emotion classification network for training based on a first loss function, wherein the first loss function is represented by the following expression:
Wherein Lec denotes a function value of the first loss function, i denotes a sequence number of the first sample audio, N denotes a total number of the first sample audio, yi denotes a true emotion type tag of the ith first sample audio,Representing emotion type labels predicted for the ith first sample audio, ln () represents the logarithm of the base constant e.
In the implementation of step S13, the audio fusion feature is obtained by stitching at least based on the audio content feature and the audio emotion feature.
In a specific embodiment, the audio content feature and the audio emotion feature may be directly spliced to obtain the audio fusion feature.
In another embodiment, the face image generating model may further include a key point feature extraction sub-model, and the stitching operation in step S13 may include stitching the audio content feature, the audio emotion feature, and the face key point feature, where the face key point feature is obtained by extracting a key point from a face image with a lower half blocked, and inputting the extracted key point into the key point feature extraction sub-model to perform feature extraction.
The face image with the blocked lower half (may be simply referred to as a mask face image) may be obtained by covering the lower half of the complete face image with a mask. Regarding the shape of the mask, it may be set in connection with actual needs. For example, it may be selected from, but not limited to, a semicircle, rectangle, or half face shape.
It should be noted that the lower part to be blocked should at least contain the region where the mouth of the full face image is located. For example, the region of the lower half may include only the lip region, or may include the nose tip to chin region, or may include the region below the eyes to chin region.
The key points of the human face refer to the key points of the areas except the lower half area which is shielded in the whole human face image. Specifically, contour keypoints and/or keypoints of regions within the contour that may comprise multiple core sites may be selected from, for example, but not limited to, an eye contour keypoint, an eyeball center point, a nose position contour keypoint, a nose tip center point, a eyebrow center point, a head contour keypoint.
Regarding the method for extracting key points from the face image with the lower part blocked, the existing face key point extraction algorithm or model can be adopted for extraction, and the representation form of the extracted key points can be generally in the form of two-dimensional coordinate points or three-dimensional coordinate points.
In some embodiments, the driving audio and the lower occluded face image may originate from the same speaker and contain the same emotion. For example, a video recording can be performed for a speaking process of a speaker, an audio stream and a face image stream are extracted from the recorded video, then the audio stream and the face image stream are segmented or sampled respectively to obtain at least one section of driving audio and at least one frame of complete face image aligned in time sequence, and then the lower half part of the complete face image is shielded.
In other embodiments, the driving audio may be extracted from a video recorded during a speaking process to a speaker, and the face image blocked in the lower half may be obtained by blocking the lower half from a full face image obtained by modeling in advance, for example, the full face image may be a high-quality face image obtained by modeling in advance with higher definition and standard contour.
In the embodiment of the invention, on the basis of combining the audio content characteristics and the audio emotion characteristics, the face key point characteristics extracted based on the face image with the blocked lower half part are further introduced. On one hand, the key point features of the human face contain important information such as head outline, key position outline, human face gesture and the like of the human face, so that the generated image of the whole human face generated by the model is more stable and standard, and the generated images of the whole human face in front and back time sequences have stronger consistency. On the other hand, the key point features of the human face do not contain mouth or mouth shape information of the human face image, so that the model is concentrated on the mouth shape information in the learning driving audio, interference of the mouth or mouth shape information in the image on model learning is avoided, and the complete human face generated image generated by the model is more close to the mouth shape of the driving audio.
In a specific implementation, the extracted audio content features, audio emotion features and face key point features are usually in a matrix or vector form, and the total number of rows and/or total number of columns of the matrix of each feature are usually consistent, so that the subsequent feature stitching operation is facilitated.
Without limitation, the scheme of stitching two features to be stitched, for example, each row element of one feature matrix may be stitched completely to a preset position of the same row of another feature matrix (i.e., stitching every two rows of elements having the same row number). Therefore, the information of each feature can be completely reserved, and the integrity and the effectiveness of the information contained in the audio fusion feature are ensured.
For another example, each row of elements of one feature matrix may be clipped, and then the remaining elements of the row may be spliced to the preset positions of the same row of the other feature matrix. Therefore, invalid elements can be removed, the data volume is reduced, and the subsequent operation efficiency is improved.
Wherein the preset position (i.e., splice position or insert position) may be a position after the last element of each row of the matrix being spliced, or a position before the first element of each row, or other suitable position.
As a non-limiting example, each row element of the matrix of audio emotion features may be completely stitched to a position after the last element of the same row of the matrix of audio content features according to the original order of the row elements to obtain a preliminary stitched matrix, and then each row element of the matrix of face key point features may be completely stitched to a position after the last element of the same row of the preliminary stitched matrix according to the original order of the row elements.
It should be noted that, the above-described embodiment does not limit the feature stitching manner, stitching sequence, and stitching position, and in a specific implementation, the stitching operation may be performed in other suitable manners, so long as it is beneficial to generate a complete face generated image with better quality.
In the implementation of step S14, the audio fusion feature and the noisy reference face image feature are input into the diffusion submodel for denoising processing, so as to obtain a target complete face feature, where the noisy reference face image feature is obtained by stitching the image feature of the reference face image with a noise matrix.
Specifically, the diffusion submodel is used for denoising the noisy reference face image feature under the condition of the audio fusion feature, and the finally output target complete face image feature is the denoised complete face image feature. The noise matrix may be a preset or randomly generated matrix, for example, a gaussian noise matrix (i.e., a noise matrix that conforms to a normal distribution). The diffusion submodel can adopt the existing neural network model which can realize denoising of the image characteristics. For example, it may be selected from, but not limited to, a U-Net model comprising a multi-layer network, a full convolutional network (Fully Convolutional Networks, FCN), and the like.
In a specific embodiment, the diffusion sub-model adopts a U-Net model comprising a plurality of layers of networks, and the step S14 specifically comprises the steps of inputting the noisy reference face image characteristic into a first layer of the diffusion sub-model, inputting the audio fusion characteristic into each layer of network of the diffusion sub-model, and taking the output result of the last layer of network of the diffusion sub-model as the target whole face characteristic, wherein the input data of each layer of network from a second layer of network of the diffusion sub-model are the output data of the last layer of network and the audio fusion characteristic.
In the embodiment of the invention, because the audio fusion feature is very key input data for guiding the model to generate the complete face generation image which is accurately matched with the mouth shape and expression of the driving audio, compared with the method adopting a single-layer network structure or inputting the audio fusion feature into a first-layer network only by utilizing a multi-layer network structure and inputting the audio fusion feature into each layer of network of the diffusion submodel, the method is beneficial to the depth operation of the audio fusion feature and the noisy reference face image feature so as to finally obtain the complete face generation image which is higher in mouth shape and expression matching degree and contained in the driving audio.
In step S14, before the audio fusion feature and the noise-carrying reference face image feature are input into the diffusion submodel for denoising, feature extraction is carried out on the lower part of the shielded face image and the whole face image respectively to obtain a partial face image feature and a complete face image feature, and the partial face image feature, the complete face image feature and the noise matrix are spliced to obtain the noise-carrying reference face image feature.
Regarding the manner of stitching the face image features and the noise matrix, specific details of feature stitching in the foregoing step S13 may be referred to, which is not described herein.
In a specific implementation, the same or different image feature extraction models may be used to perform feature extraction on the face image with the lower half blocked and the whole face image respectively.
Compared with the method only introducing the characteristics of the whole face image, in the embodiment of the invention, the characteristics of the lower part of the shielded face image and the characteristics of the whole face image are combined, wherein the characteristics of the whole face image comprise basic contour information and key position information of the whole face, so that the randomness of the generated image of the whole face generated by a model is reduced, the generated image of the face in a polar port or emotion state is avoided, the characteristics of the lower part of the shielded face image comprise basic contour and key position information of other parts except a mouth and other areas, the stability and front-back consistency of the generated image of the whole face generated by the model are further enhanced, and the interference of the mouth information in the image to the model is avoided, so that the model can intensively learn the mouth (namely mouth shape and mouth action) characteristics of driving audio.
Regarding the lower part of the blocked face image and the acquiring manner of the complete face image, specific reference may be made to the related content of the lower part of the blocked face image and the corresponding acquiring method of the complete face image, which are not described herein.
The face image generation model is determined by constructing a face image generation model to be trained, wherein the face image generation model to be trained comprises an audio content feature extraction sub-model to be trained, an audio emotion feature extraction sub-model to be trained and a diffusion sub-model to be trained, constructing a second training data set by adopting a plurality of reference sample face images, a plurality of second sample audios and a sample noise matrix, wherein the reference sample face images and the second sample audios are aligned one by one in time sequence, and inputting the second training data set into the face image generation model to be trained based on a second loss function to perform iterative training, so that the face image generation model is obtained.
Wherein the second loss function is represented by the following expression:
Wherein L' represents the function value of the second loss function, t represents the layer sequence number of the network in the diffusion submodel to be trained, M () represents the prediction noise matrix output by the t layer network of the diffusion submodel to be trained, E represents the sample noise matrix,The representation E obeys normal distribution, zt represents one item of input data of the t layer network of the diffusion sub-model to be trained, namely output data of the t-1 layer network of the diffusion sub-model to be trained, C is the other item of input data of the t layer network of the diffusion sub-model to be trained, and C is a sample audio fusion characteristic obtained by splicing at least based on a sample audio content characteristic and a sample audio emotion characteristic;
II epsilon-M (zt, t, C) II represents the Euclidean distance between the sample noise matrix and the predicted noise matrix output by the t layer network of the diffusion submodel to be trained,The operation of obtaining the expected value from the square value of the euclidean distance sampled a plurality of times is shown. Specifically, the number of sampling times may be set in combination with the actual application scenario, which is not limited by the embodiment of the present invention.
In particular, the sample noise matrix e may be a randomly generated gaussian sample noise matrix, i.e. a sample noise matrix in a normal distribution. The input data of the first layer network of the diffusion sub-model to be trained specifically comprises noisy reference sample face image features and sample audio fusion features, wherein the noisy reference sample face image features are obtained by splicing the image features of the reference sample face image and the sample noise matrix.
Further, the plurality of reference sample face images may include a plurality of complete sample face images and a plurality of lower half-blocked sample face images, wherein the plurality of complete sample face images and the plurality of lower half-blocked sample face images are aligned one by one in time sequence, and the complete sample face images and the lower half-blocked sample face images aligned one by one in time sequence may be from the same sample speaker and have the same emotion.
Correspondingly, the noise-carrying reference sample face image features can be obtained by splicing the image features of the sample face image with the lower part blocked, the image features of the complete sample face image and the noise matrix.
Specifically, the sample audio content features are obtained by inputting the second sample audio into the audio content feature extraction sub-model to be trained for feature extraction, and the sample audio emotion features are obtained by inputting the second sample audio into the audio emotion feature extraction sub-model to be trained for feature extraction.
Corresponding to the process of generating the complete face generated image by the face image generating model (namely, the model reasoning process), in the model training process, the splicing operation of obtaining the sample audio fusion characteristic C can adopt at least the following two embodiments.
In a specific embodiment, the sample audio content feature and the sample audio emotion feature may be directly spliced to obtain the audio fusion feature C.
In another embodiment, the face image generation model to be trained further comprises a key point feature extraction sub-model to be trained, and the splicing operation comprises splicing the sample audio content features, the sample audio emotion features and the sample face key point features, wherein the sample face key point features are obtained by inputting the extracted key points into the key point feature extraction sub-model to be trained after extracting the key points of the sample face image with the lower part being blocked.
For more details of the model training process, reference is made to the foregoing and related descriptions of the steps of the embodiment shown in fig. 1 with respect to the model reasoning process, which are not repeated here.
Referring to fig. 2, fig. 2 is a partial flowchart of another face image generation method according to an embodiment of the present invention. The other face image generation method may include steps S11 to S15 shown in fig. 1, and may further include steps S21 to S24, wherein the steps S21 to S24 may be performed before the step S13.
In step S21, a first number of audio frequencies whose timings are before the driving audio frequency and a second number of audio frequencies whose timings are after the driving audio frequency are recorded as first audio frequency and second audio frequency, respectively, are determined.
In a specific implementation, the first number and the second number may be set appropriately according to the actual application scenario. The first and second amounts may be selected from suitable values in the interval [5,10], such as 8, without limitation.
In step S22, each of the first audio and the second audio is input into the audio content feature extraction sub-model to perform feature extraction, so as to obtain a plurality of corresponding first audio content features and a plurality of second audio content features.
In step S23, weighting operation is performed on the plurality of first audio content features, the plurality of second audio content features, and the audio content features, so as to obtain a fused audio content feature.
For example, the first audio content features, the second audio content features, and each group of elements located at the same position in the audio content features may be weighted and summed or an average value calculated to obtain a fusion element, where all the obtained fusion elements constitute the fusion audio feature.
In a specific embodiment, the audio content feature extraction sub-model may include an audio content feature extraction network for extracting the audio content features from the driving audio, and may further include a time domain filter for performing a weighting operation. The weights of the time domain filter for performing the weighting operation can be learned and obtained in the training process of the face image generation model.
In another embodiment, the weighting operation may be performed using a preset weight. Wherein the weight of the audio content feature may be greater than the weights of the first and second audio content features.
In step S24, the audio content features are updated with the fused audio content features.
In the embodiment of the invention, the audio content information of the front and rear time sequences can be better combined in the process of generating the current complete face generated image by fusing the front and rear time sequence audio content features in the audio content features to be spliced, so that the finally generated complete face generated image of the front and rear time sequences has smoother and consistent mouth shape and expression state. Further, the smoother front-back time sequence complete face generated images are subjected to image stitching, so that the method is beneficial to obtaining high-quality digital human videos with more natural and consistent mouth shape and expression transition, and user experience is improved.
Referring to fig. 3 and 4, fig. 3 is a schematic diagram of a face image generation model according to an embodiment of the present invention, and fig. 4 is a schematic diagram of a diffusion sub-model in the face image generation model shown in fig. 3.
The face image generation model includes, without limitation, an audio content feature extraction sub-model, an audio emotion feature extraction sub-model, a diffusion sub-model, and may further include a key point feature extraction sub-model, and may further include a coding sub-model and a decoding sub-model.
The coding sub-model can be used for extracting features of a reference face image to obtain features of the reference face image, and the reference face image specifically comprises a complete face image with the same emotion from the same speaker and a face image with the lower half blocked (which can be simply called a mask face image), and the decoding sub-model can be used for decoding the target complete face feature output by the expansion sub-model to obtain the complete face generation image.
The diffusion sub-model adopts a U-Net model comprising a multi-layer network, input data of the diffusion sub-model at least comprises two items, wherein one item of input data is a noisy reference face image feature, the noisy reference face image feature can be obtained by splicing the reference face image feature output by the coding sub-model and a noise matrix, and the other item of input data is an audio fusion feature obtained by splicing the audio content feature output by the audio content extraction sub-model, the audio emotion feature output by the audio emotion extraction sub-model and the face key point feature output by the key point feature extraction sub-model.
Further, the noisy reference face image features are input into a first layer network of the diffusion sub-model, the audio fusion features are input into each layer network of the diffusion sub-model, and an output result of a last layer network of the diffusion sub-model is used as the target whole face features.
Regarding the functions and operation processes of the other sub-models of the face image generation model, refer specifically to the relevant contents in any of the embodiments of fig. 1 to fig. 2, and are not repeated here.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a face image generating apparatus according to an embodiment of the present invention. The face image generation apparatus may include:
A model determining module 51, configured to determine a face image generation model, where the face image generation model includes an audio content feature extraction sub-model, an audio emotion feature extraction sub-model, and a diffusion sub-model;
The audio feature extraction module 52 is configured to input driving audio to the audio content feature extraction sub-model to perform feature extraction to obtain audio content features, and input the driving audio to the audio emotion feature extraction sub-model to perform feature extraction to obtain audio emotion features;
an audio feature stitching module 53, configured to stitch at least based on the audio content feature and the audio emotion feature to obtain an audio fusion feature;
The complete face feature determining module 54 inputs the audio fusion feature and the noisy reference face image feature into the diffusion submodel for denoising processing to obtain a target complete face feature, wherein the noisy reference face image feature is obtained by splicing the image feature of the reference face image with a noise matrix;
And the face image generating module 55 is configured to decode the target complete face feature to obtain a complete face generated image.
Regarding the principle, specific implementation and beneficial effects of the face image generating apparatus, please refer to the foregoing and the related descriptions of the face image generating method shown in fig. 1 to 2, which are not repeated herein.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps of the face image generating method according to any of the above embodiments. The computer readable storage medium may include non-volatile memory (non-volatile) or non-transitory memory, and may also include optical disks, mechanical hard disks, solid state disks, and the like.
Specifically, in the embodiment of the present invention, the processor may be a central processing unit (central processing unit, abbreviated as CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, abbreviated as DSP), application Specific Integrated Circuits (ASIC), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, abbreviated as FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM EPROM), an electrically erasable programmable ROM (ELECTRICALLY EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (random access memory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of random access memory (random access memory, abbreviated as RAM) are available, such as static random access memory (STATIC RAM, abbreviated as SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (doubledata RATE SDRAM, abbreviated as DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, abbreviated as ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, abbreviated as SLDRAM), and direct memory bus random access memory (direct rambus RAM, abbreviated as DR RAM).
The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the facial image generation method in any embodiment when running the computer program.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B, and that three cases, a alone, a and B together, and B alone, may exist. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship.
The term "plurality" as used in the embodiments of the present application means two or more.
The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order is used, nor is the number of the devices in the embodiments of the present application limited, and no limitation on the embodiments of the present application should be construed.
It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.