Movatterモバイル変換


[0]ホーム

URL:


CN118015110B - Face image generation method and device, computer readable storage medium, terminal - Google Patents

Face image generation method and device, computer readable storage medium, terminal
Download PDF

Info

Publication number
CN118015110B
CN118015110BCN202311762681.0ACN202311762681ACN118015110BCN 118015110 BCN118015110 BCN 118015110BCN 202311762681 ACN202311762681 ACN 202311762681ACN 118015110 BCN118015110 BCN 118015110B
Authority
CN
China
Prior art keywords
audio
model
emotion
feature
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311762681.0A
Other languages
Chinese (zh)
Other versions
CN118015110A (en
Inventor
王霄鹏
虞钉钉
胡贤良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayuan Computing Technology Shanghai Co ltd
Original Assignee
Huayuan Computing Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huayuan Computing Technology Shanghai Co ltdfiledCriticalHuayuan Computing Technology Shanghai Co ltd
Priority to CN202311762681.0ApriorityCriticalpatent/CN118015110B/en
Publication of CN118015110ApublicationCriticalpatent/CN118015110A/en
Application grantedgrantedCritical
Publication of CN118015110BpublicationCriticalpatent/CN118015110B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

一种人脸图像生成方法及装置、计算机可读存储介质、终端,所述方法包括:确定人脸图像生成模型,人脸图像生成模型包括音频内容特征提取子模型、音频情感特征提取子模型、扩散子模型;将驱动音频分别输入音频内容特征提取子模型和音频情感特征提取子模型进行特征提取,得到音频内容特征和音频情感特征;至少基于音频内容特征和音频情感特征进行拼接,得到音频融合特征;将所述音频融合特征和带噪声的参考人脸图像特征输入所述扩散子模型进行去噪处理,得到目标完整人脸特征;对目标完整人脸特征进行解码,得到完整人脸生成图像。上述方案有助于生成既能准确匹配驱动音频中的口型,又能精准表达驱动音频包含的情绪的人脸生成图像。

A method and device for generating a facial image, a computer-readable storage medium, and a terminal, the method comprising: determining a facial image generation model, the facial image generation model comprising an audio content feature extraction sub-model, an audio emotion feature extraction sub-model, and a diffusion sub-model; inputting the driving audio into the audio content feature extraction sub-model and the audio emotion feature extraction sub-model for feature extraction, thereby obtaining audio content features and audio emotion features; performing splicing based at least on the audio content features and the audio emotion features, thereby obtaining audio fusion features; inputting the audio fusion features and the features of a reference facial image with noise into the diffusion sub-model for denoising, thereby obtaining target complete facial features; and decoding the target complete facial features to obtain a complete facial generated image. The above scheme helps to generate a facial generated image that can accurately match the lip shape in the driving audio and accurately express the emotions contained in the driving audio.

Description

Face image generation method and device, computer readable storage medium and terminal
Technical Field
The present invention relates to the field of digital person generation technologies, and in particular, to a face image generation method and apparatus, a computer readable storage medium, and a terminal.
Background
With the development of artificial intelligence technology, generative artificial intelligence (ARTIFICIAL INTELLIGENCE GENERATED Content, AIGC) has become the current most popular research topic. AIGC technology is also widely used in the field of digital speaker generation. The digital speaker generates tasks, which are essentially input audio and character images, and then sequentially generates face images corresponding to the audio content. How to use AIGC technology to realize audio driving to generate human face generated images which not only contain accurate mouth shapes but also accurately express emotion contained in audio has important research value.
Currently, the main technology of the audio driving face image generation task mainly adopts a depth model to process audio features and image features. The method comprises the steps of respectively encoding driving audio and face images to obtain audio coding features and face image coding features, and then directly inputting the audio coding features and the face image coding features into a pre-trained face image generation model to obtain a face generated image.
However, in the above-described scheme, the audio features employed to drive the generation of the face image typically contain only feature information of a single dimension (e.g., feature information of a content or semantic level or dimension, and feature information of emotion or emotion dimension, not), or confuse the content of the audio expression with emotion. This results in a situation that the network is hindered from clearly mining the correspondence between the audio and two dimensions of the mouth shape and emotion expressed by the face generated image, so that the generated face image or mouth shape is not high in matching degree or emotion expression (for example, facial expression) is not accurate enough.
Disclosure of Invention
The technical problem solved by the embodiment of the invention is how to generate the face generated image which can be matched with the mouth shape in the driving audio accurately and can express the emotion contained in the driving audio accurately.
The face image generation method comprises the steps of determining a face image generation model, wherein the face image generation model comprises an audio content feature extraction sub-model, an audio emotion feature extraction sub-model and a diffusion sub-model, inputting driving audio into the audio content feature extraction sub-model to conduct feature extraction to obtain audio content features, inputting the driving audio into the audio emotion feature extraction sub-model to conduct feature extraction to obtain audio emotion features, splicing at least based on the audio content features and the audio emotion features to obtain audio fusion features, inputting the audio fusion features and noisy reference face image features into the diffusion sub-model to conduct denoising treatment to obtain target complete face features, wherein the noisy reference face image features are obtained by splicing image features of the reference face image with a noise matrix, and decoding the target complete face features to obtain a complete face generated image.
The face image generation model comprises a face image generation model, a key point feature extraction sub-model, and the splicing is carried out at least on the basis of the audio content features and the audio emotion features, and comprises the steps of splicing the audio content features, the audio emotion features and the face key point features, wherein the face key point features are obtained by extracting key points from a face image with the lower half being blocked, and inputting the extracted key points into the key point feature extraction sub-model for feature extraction.
Optionally, before the splicing is performed at least based on the audio content features and the audio emotion features, the method further comprises determining a first number of audios with time sequences before the driving audio and a second number of audios with time sequences after the driving audio, recording the first audio and the second audio as first audios and second audios respectively, inputting each of the first audio and the second audios into the audio content feature extraction submodel to perform feature extraction respectively to obtain a plurality of corresponding first audio content features and a plurality of corresponding second audio content features, performing weighting operation on the plurality of first audio content features, the plurality of second audio content features and the audio content features to obtain a fusion audio content feature, and updating the audio content feature by using the fusion audio content feature.
Optionally, the reference face image comprises a complete face image and a lower half part blocked face image which come from the same speaker and have the same emotion, before the audio fusion feature and the noise-carrying reference face image feature are input into the diffusion submodel for denoising, the method further comprises the steps of respectively extracting the feature of the lower half part blocked face image and the feature of the complete face image to obtain a partial face image feature and a complete face image feature, and splicing the partial face image feature, the complete face image feature and the noise matrix to obtain the noise-carrying reference face image feature.
The method comprises the steps of inputting the audio fusion characteristic and the noisy reference face image characteristic into a diffusion sub-model to carry out denoising treatment to obtain a target complete face characteristic, inputting the noisy reference face image characteristic into a first layer network of the diffusion sub-model, inputting the audio fusion characteristic into each layer network of the diffusion sub-model, and taking an output result of a last layer network of the diffusion sub-model as the target complete face characteristic, wherein from a second layer network of the diffusion sub-model, input data of each layer network are output data of a previous layer network and the audio fusion characteristic.
Optionally, the audio emotion feature extraction sub-model comprises a pre-trained emotion classification network and an emotion feature extraction network, wherein driving audio is input into the audio content feature extraction sub-model to perform feature extraction to obtain audio content features, the driving audio is input into the pre-trained emotion classification network to obtain a predicted emotion type label, the predicted emotion type label is encoded to obtain an audio emotion encoding vector, and the audio emotion encoding vector is input into the emotion feature extraction network to perform feature extraction to obtain the audio emotion features.
Optionally, the predictive emotion type label is encoded to obtain an audio emotion encoding vector, wherein the predictive emotion type label is pre-encoded based on a preset emotion encoding length to obtain multiple groups of emotion sub-encodings, each group of emotion sub-encodings comprises two identical emotion sub-encodings, a sine value of one emotion sub-encoding is determined for each group of emotion sub-encodings, a cosine value of the other emotion sub-encoding is determined, so that the emotion encoding corresponding to each emotion sub-encoding is determined, and the audio emotion encoding vector is determined based on the multiple obtained emotion encodings.
Optionally, the pre-trained emotion classification network is obtained by inputting a first training data set constructed by a plurality of first sample audios and emotion type labels thereof into an initialized emotion classification network for training based on a first loss function, wherein the first loss function is represented by the following expression:
Wherein Lec denotes a function value of the first loss function, i denotes a sequence number of the first sample audio, N denotes a total number of the first sample audio, yi denotes a true emotion type tag of the ith first sample audio,Representing emotion type labels predicted for the ith first sample audio, ln () represents the logarithm of the base constant e.
Optionally, the face image generation model is determined, the face image generation model to be trained is constructed, the face image generation model to be trained comprises an audio content feature extraction sub-model to be trained, an audio emotion feature extraction sub-model to be trained and a diffusion sub-model to be trained, a plurality of reference sample face images, a plurality of second sample audios and a sample noise matrix are adopted, a second training data set is constructed, the reference sample face images and the second sample audios are aligned one by one in time sequence, and the second training data set is input into the face image generation model to be trained based on a second loss function for iterative training, so that the face image generation model is obtained.
Optionally, the second loss function is represented by the following expression:
Wherein L' represents the function value of the second loss function, t represents the layer sequence number of the network in the diffusion submodel to be trained, M () represents the prediction noise matrix output by the t layer network of the diffusion submodel to be trained, E represents the sample noise matrix,The representation E obeys normal distribution, zt represents one item of input data of the t layer network of the diffusion sub-model to be trained, namely, output data of the t-1 layer network of the diffusion sub-model to be trained, C is the other item of input data of the t layer network of the diffusion sub-model to be trained, C is a sample audio fusion characteristic obtained by splicing at least based on sample audio content characteristics and sample audio emotion characteristics, II E-M (zt, t, C) II represents Euclidean distance between the sample noise matrix and a prediction noise matrix output by the t layer network of the diffusion sub-model to be trained,The operation of obtaining the expected value from the square value of the euclidean distance sampled a plurality of times is shown.
The embodiment of the invention also provides a human face image generating device which comprises a model determining module, an audio feature extracting module, an audio feature splicing module and a human face image generating module, wherein the model determining module is used for determining a human face image generating model, the human face image generating model comprises an audio content feature extracting sub-model, an audio emotion feature extracting sub-model and a diffusion sub-model, the audio feature extracting module is used for carrying out feature extraction on driving audio input into the audio content feature extracting sub-model to obtain audio content features, the driving audio input into the audio emotion feature extracting sub-model to carry out feature extraction to obtain audio emotion features, the audio feature splicing module is used for carrying out splicing at least on the audio content features and the audio emotion features to obtain audio fusion features, the whole face feature determining module is used for inputting the audio fusion features and the noisy reference face image features into the diffusion sub-model to carry out denoising processing to obtain target whole face features, the noisy reference face image features are obtained by splicing image features of the reference face images and a noise matrix, and the human face image generating module is used for decoding the target whole face features to obtain the whole face generated images.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of the face image generation method.
The embodiment of the invention also provides a terminal which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the face image generation method when running the computer program.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
In the embodiment of the invention, characteristic information of two dimensions of content and emotion is separated from driving audio, specifically, characteristic extraction is carried out on a driving audio input audio content characteristic extraction sub-model to obtain audio content characteristics, characteristic extraction is carried out on the driving audio input audio emotion characteristic extraction sub-model to obtain audio emotion characteristics, then at least the audio content characteristics and the audio emotion characteristics are based on splicing to obtain audio fusion characteristics, the audio fusion characteristics and noisy reference face image characteristics are input into the diffusion sub-model to carry out denoising treatment to obtain target complete face characteristics, and finally the target complete face characteristics are decoded to obtain a complete face generated image.
The audio content features comprise content or semantics expressed by the speaker and can correspond to mouth shapes or mouth actions in the speaking process of the speaker, and the audio emotion features comprise emotion in the speaking process of the speaker and can correspond to real emotion in the speaking process of the speaker. Therefore, compared with the prior art that only the feature extraction of a single dimension is carried out on the driving audio, the embodiment is beneficial to obtaining the human face generated image which can be accurately matched with the mouth shape contained in the driving audio, can finely express the true emotion of a speaker and contains finer textures by combining the audio content features of the semantic (or content) dimension and the audio emotion features of the emotion (or emotion) dimension. Furthermore, the embodiment also introduces the image features of the reference face image on the basis of combining the audio content features and the audio emotion features, thereby being beneficial to generating the face image which accords with the basic outline features of the face and has more stable quality.
Drawings
FIG. 1 is a flowchart of a face image generation method in an embodiment of the present invention;
FIG. 2 is a partial flow chart of another face image generation method in an embodiment of the invention;
FIG. 3 is a schematic diagram of a face image generation model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the architecture of a diffusion submodel in the face image generation model shown in FIG. 3;
fig. 5 is a schematic structural diagram of a face image generating apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the above objects, features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
Referring to fig. 1, fig. 1 is a flowchart of a face image generating method according to an embodiment of the present invention. The face image generation method can be applied to a terminal with a face image generation function, and the terminal can comprise, but is not limited to, a mobile phone, a computer, a tablet personal computer, intelligent wearable equipment (for example, an intelligent watch), vehicle-mounted terminal equipment, a server, a cloud platform and the like.
The method may include steps S11 to S14:
Step S11, determining a face image generation model, wherein the face image generation model comprises an audio content feature extraction sub-model, an audio emotion feature extraction sub-model and a diffusion sub-model;
Step S12, inputting the driving audio into the audio content feature extraction sub-model to perform feature extraction to obtain audio content features, and inputting the driving audio into the audio emotion feature extraction sub-model to perform feature extraction to obtain audio emotion features;
step S13, splicing at least based on the audio content characteristics and the audio emotion characteristics to obtain audio fusion characteristics;
S14, inputting the audio fusion characteristic and the noisy reference face image characteristic into the diffusion submodel for denoising treatment to obtain a target complete face characteristic, wherein the noisy reference face image characteristic is obtained by splicing the image characteristic of the reference face image with a noise matrix;
and S15, decoding the target whole face features to obtain a whole face generated image.
Further, the method further comprises the steps of generating images of a plurality of complete faces corresponding to the plurality of segments of driving audios with time sequence, and performing image stitching according to the time sequence to obtain the digital human video.
In the implementation of step S11, the audio content feature extraction sub-model may use an existing neural network capable of implementing audio content feature extraction, where the audio content feature may include text feature information and semantic feature information corresponding to the driving audio and is mainly used for affecting mouth shape and mouth motion of the face generated image, and the audio emotion feature extraction sub-model may use an existing neural network capable of implementing audio emotion feature extraction, where the audio emotion feature may include prosody (e.g., rhythm, tone, duration, etc.) feature information in audio and is mainly used for affecting expression, texture detail, etc. of the face generated image.
In the implementation of step S12, the driving audio is input into the audio content feature extraction sub-model to perform feature extraction, so as to obtain audio content features, and the driving audio is input into the audio emotion feature extraction sub-model to perform feature extraction, so as to obtain audio emotion features.
The driving audio can be derived from audio recorded in real time in the field for the speaking process of a speaker, or can be derived from audio in a historic acquisition audio database, or can be extracted from video recorded in real time or prerecorded in the field. The speaker may be selected from, but is not limited to, a lecturer, a reciter, a business negotiating person, a daily general communication speaker, and the like.
For example, audio recorded in real time on site or selected from an audio database may be segmented to obtain multiple segments of driving audio with sequential sequence. Each piece of driver audio has a respective duration and corresponding text content, and typically also contains the individual emotion or emotion of the belonging speaker. The single piece of drive audio may be used to correspondingly generate a single Zhang Wanzheng face-generated image. And generating images of a plurality of complete faces corresponding to the multiple sections of driving audios with time sequence, and performing image stitching according to the time sequence to obtain a face generation video (also called as a digital human video).
Further, the input time interval of the driving audio covers Shan Zhen the time interval of the whole face generating image. That is, the time interval of the single frame complete face generated image is located in the time interval of the driving audio.
Further, a ratio of the occupied duration of the driving audio to the occupied duration of the single-frame full face generated image may be set to be greater than or equal to 5.
Taking a speaking process video containing a face image of 25 frames (FRAMES PER seconds, FPS) per Second as an example, a single frame of face image corresponds to an audio duration of 40ms (i.e., occupies 40ms duration), and the duration of the driving audio input in the embodiment of the present invention may be at least 5 times (e.g., 10 times or even tens of times) the audio duration corresponding to the single frame of face image. In this way, not only can the driving audio be made to contain information of an audio segment aligned with the time sequence of the full face generated image, but also audio information of a time sequence preceding and following the audio segment can be contained.
In the embodiment of the invention, the time interval of the driving audio is set to cover Shan Zhen the time interval of the whole face generated image (for example, the time interval of the driving audio can be set to be positioned at the middle position of the time interval of the single frame of the whole face generated image), so that the front-back time sequence and the consecutive audio information of the driving audio are provided, and the method is beneficial to improving the naturalness and smoothness of the mouth shape and emotion of the whole face generated image generated by the model. Further, the driving audio time length is far longer than the time length of generating an image of a single frame of the whole face (at least 5 times of magnitude), and compared with the time length of setting the driving audio time length to be equal to or less than the time length of the driving audio time length to be 1 times or 2 times of the time length of generating the image of the whole face, the driving audio time length is favorable for providing richer audio information for generating the image of the whole face, and the image quality is further improved.
The audio emotion feature extraction sub-model may further include a pre-trained emotion classification network and an emotion feature extraction network, and in the step S12, the driving audio is input into the audio emotion feature extraction sub-model to perform feature extraction to obtain audio emotion features, and the audio emotion feature extraction sub-model may specifically include inputting the driving audio into the pre-trained emotion classification network to obtain a predicted emotion type label, encoding the predicted emotion type label to obtain an audio emotion encoding vector, and inputting the audio emotion encoding vector into the emotion feature extraction network to perform feature extraction to obtain the audio emotion features.
In implementations, emotion type tags may be used to indicate a particular emotion type. In some embodiments, the emotion type may be selected from, but is not limited to, happiness, sadness, anger, fatigue, anxiety, tension, surprise, and the like. In other embodiments, the emotion type may be selected from, but not limited to, positive emotion, neutral emotion, negative emotion.
Without limitation, a scalar or arabic number may be employed as an emotion type label, for example, an emotion type of "happy" as indicated by "0", a sad "as indicated by" 1", an tired" as indicated by "3.
Further, the predictive emotion type label is encoded to obtain an audio emotion encoding vector, wherein the predictive emotion type label is pre-encoded based on a preset emotion encoding length to obtain multiple groups of emotion sub-encodings, each group of emotion sub-encodings comprises two identical emotion sub-encodings, a sine value of one emotion sub-encoding is determined for each group of emotion sub-encodings, a cosine value of the other emotion sub-encoding is determined, so that emotion encoding corresponding to each emotion sub-encoding is determined, and the audio emotion encoding vector is determined based on the multiple obtained emotion encodings.
Specifically, the following formula may be used to determine the audio emotion encoding vector:
P=[sin(20πE),cos(20πE),sin(21πE),cos(21πE)…,sin(2L-1πE),cos(2L-1πE)];
Wherein P represents the audio emotion encoding vector, E represents the predictive emotion type label, L represents the number of groups of emotion sub-encodings obtained by precoding, 2L-1 pi E represents one emotion sub-encoding in the L-th group of emotion sub-encodings, sin (2L-1 pi E) represents the sine value of one emotion sub-encoding in the L-th group of emotion sub-encodings, cos (2L-1 pi E) represents the cosine value of the other emotion sub-encoding in the L-th group of emotion sub-encodings, and [ x ] represents a vector composed of x.
Further, the pre-trained emotion classification network is obtained by inputting a first training data set constructed by a plurality of first sample audios and emotion type labels thereof into an initialized emotion classification network for training based on a first loss function, wherein the first loss function is represented by the following expression:
Wherein Lec denotes a function value of the first loss function, i denotes a sequence number of the first sample audio, N denotes a total number of the first sample audio, yi denotes a true emotion type tag of the ith first sample audio,Representing emotion type labels predicted for the ith first sample audio, ln () represents the logarithm of the base constant e.
In the implementation of step S13, the audio fusion feature is obtained by stitching at least based on the audio content feature and the audio emotion feature.
In a specific embodiment, the audio content feature and the audio emotion feature may be directly spliced to obtain the audio fusion feature.
In another embodiment, the face image generating model may further include a key point feature extraction sub-model, and the stitching operation in step S13 may include stitching the audio content feature, the audio emotion feature, and the face key point feature, where the face key point feature is obtained by extracting a key point from a face image with a lower half blocked, and inputting the extracted key point into the key point feature extraction sub-model to perform feature extraction.
The face image with the blocked lower half (may be simply referred to as a mask face image) may be obtained by covering the lower half of the complete face image with a mask. Regarding the shape of the mask, it may be set in connection with actual needs. For example, it may be selected from, but not limited to, a semicircle, rectangle, or half face shape.
It should be noted that the lower part to be blocked should at least contain the region where the mouth of the full face image is located. For example, the region of the lower half may include only the lip region, or may include the nose tip to chin region, or may include the region below the eyes to chin region.
The key points of the human face refer to the key points of the areas except the lower half area which is shielded in the whole human face image. Specifically, contour keypoints and/or keypoints of regions within the contour that may comprise multiple core sites may be selected from, for example, but not limited to, an eye contour keypoint, an eyeball center point, a nose position contour keypoint, a nose tip center point, a eyebrow center point, a head contour keypoint.
Regarding the method for extracting key points from the face image with the lower part blocked, the existing face key point extraction algorithm or model can be adopted for extraction, and the representation form of the extracted key points can be generally in the form of two-dimensional coordinate points or three-dimensional coordinate points.
In some embodiments, the driving audio and the lower occluded face image may originate from the same speaker and contain the same emotion. For example, a video recording can be performed for a speaking process of a speaker, an audio stream and a face image stream are extracted from the recorded video, then the audio stream and the face image stream are segmented or sampled respectively to obtain at least one section of driving audio and at least one frame of complete face image aligned in time sequence, and then the lower half part of the complete face image is shielded.
In other embodiments, the driving audio may be extracted from a video recorded during a speaking process to a speaker, and the face image blocked in the lower half may be obtained by blocking the lower half from a full face image obtained by modeling in advance, for example, the full face image may be a high-quality face image obtained by modeling in advance with higher definition and standard contour.
In the embodiment of the invention, on the basis of combining the audio content characteristics and the audio emotion characteristics, the face key point characteristics extracted based on the face image with the blocked lower half part are further introduced. On one hand, the key point features of the human face contain important information such as head outline, key position outline, human face gesture and the like of the human face, so that the generated image of the whole human face generated by the model is more stable and standard, and the generated images of the whole human face in front and back time sequences have stronger consistency. On the other hand, the key point features of the human face do not contain mouth or mouth shape information of the human face image, so that the model is concentrated on the mouth shape information in the learning driving audio, interference of the mouth or mouth shape information in the image on model learning is avoided, and the complete human face generated image generated by the model is more close to the mouth shape of the driving audio.
In a specific implementation, the extracted audio content features, audio emotion features and face key point features are usually in a matrix or vector form, and the total number of rows and/or total number of columns of the matrix of each feature are usually consistent, so that the subsequent feature stitching operation is facilitated.
Without limitation, the scheme of stitching two features to be stitched, for example, each row element of one feature matrix may be stitched completely to a preset position of the same row of another feature matrix (i.e., stitching every two rows of elements having the same row number). Therefore, the information of each feature can be completely reserved, and the integrity and the effectiveness of the information contained in the audio fusion feature are ensured.
For another example, each row of elements of one feature matrix may be clipped, and then the remaining elements of the row may be spliced to the preset positions of the same row of the other feature matrix. Therefore, invalid elements can be removed, the data volume is reduced, and the subsequent operation efficiency is improved.
Wherein the preset position (i.e., splice position or insert position) may be a position after the last element of each row of the matrix being spliced, or a position before the first element of each row, or other suitable position.
As a non-limiting example, each row element of the matrix of audio emotion features may be completely stitched to a position after the last element of the same row of the matrix of audio content features according to the original order of the row elements to obtain a preliminary stitched matrix, and then each row element of the matrix of face key point features may be completely stitched to a position after the last element of the same row of the preliminary stitched matrix according to the original order of the row elements.
It should be noted that, the above-described embodiment does not limit the feature stitching manner, stitching sequence, and stitching position, and in a specific implementation, the stitching operation may be performed in other suitable manners, so long as it is beneficial to generate a complete face generated image with better quality.
In the implementation of step S14, the audio fusion feature and the noisy reference face image feature are input into the diffusion submodel for denoising processing, so as to obtain a target complete face feature, where the noisy reference face image feature is obtained by stitching the image feature of the reference face image with a noise matrix.
Specifically, the diffusion submodel is used for denoising the noisy reference face image feature under the condition of the audio fusion feature, and the finally output target complete face image feature is the denoised complete face image feature. The noise matrix may be a preset or randomly generated matrix, for example, a gaussian noise matrix (i.e., a noise matrix that conforms to a normal distribution). The diffusion submodel can adopt the existing neural network model which can realize denoising of the image characteristics. For example, it may be selected from, but not limited to, a U-Net model comprising a multi-layer network, a full convolutional network (Fully Convolutional Networks, FCN), and the like.
In a specific embodiment, the diffusion sub-model adopts a U-Net model comprising a plurality of layers of networks, and the step S14 specifically comprises the steps of inputting the noisy reference face image characteristic into a first layer of the diffusion sub-model, inputting the audio fusion characteristic into each layer of network of the diffusion sub-model, and taking the output result of the last layer of network of the diffusion sub-model as the target whole face characteristic, wherein the input data of each layer of network from a second layer of network of the diffusion sub-model are the output data of the last layer of network and the audio fusion characteristic.
In the embodiment of the invention, because the audio fusion feature is very key input data for guiding the model to generate the complete face generation image which is accurately matched with the mouth shape and expression of the driving audio, compared with the method adopting a single-layer network structure or inputting the audio fusion feature into a first-layer network only by utilizing a multi-layer network structure and inputting the audio fusion feature into each layer of network of the diffusion submodel, the method is beneficial to the depth operation of the audio fusion feature and the noisy reference face image feature so as to finally obtain the complete face generation image which is higher in mouth shape and expression matching degree and contained in the driving audio.
In step S14, before the audio fusion feature and the noise-carrying reference face image feature are input into the diffusion submodel for denoising, feature extraction is carried out on the lower part of the shielded face image and the whole face image respectively to obtain a partial face image feature and a complete face image feature, and the partial face image feature, the complete face image feature and the noise matrix are spliced to obtain the noise-carrying reference face image feature.
Regarding the manner of stitching the face image features and the noise matrix, specific details of feature stitching in the foregoing step S13 may be referred to, which is not described herein.
In a specific implementation, the same or different image feature extraction models may be used to perform feature extraction on the face image with the lower half blocked and the whole face image respectively.
Compared with the method only introducing the characteristics of the whole face image, in the embodiment of the invention, the characteristics of the lower part of the shielded face image and the characteristics of the whole face image are combined, wherein the characteristics of the whole face image comprise basic contour information and key position information of the whole face, so that the randomness of the generated image of the whole face generated by a model is reduced, the generated image of the face in a polar port or emotion state is avoided, the characteristics of the lower part of the shielded face image comprise basic contour and key position information of other parts except a mouth and other areas, the stability and front-back consistency of the generated image of the whole face generated by the model are further enhanced, and the interference of the mouth information in the image to the model is avoided, so that the model can intensively learn the mouth (namely mouth shape and mouth action) characteristics of driving audio.
Regarding the lower part of the blocked face image and the acquiring manner of the complete face image, specific reference may be made to the related content of the lower part of the blocked face image and the corresponding acquiring method of the complete face image, which are not described herein.
The face image generation model is determined by constructing a face image generation model to be trained, wherein the face image generation model to be trained comprises an audio content feature extraction sub-model to be trained, an audio emotion feature extraction sub-model to be trained and a diffusion sub-model to be trained, constructing a second training data set by adopting a plurality of reference sample face images, a plurality of second sample audios and a sample noise matrix, wherein the reference sample face images and the second sample audios are aligned one by one in time sequence, and inputting the second training data set into the face image generation model to be trained based on a second loss function to perform iterative training, so that the face image generation model is obtained.
Wherein the second loss function is represented by the following expression:
Wherein L' represents the function value of the second loss function, t represents the layer sequence number of the network in the diffusion submodel to be trained, M () represents the prediction noise matrix output by the t layer network of the diffusion submodel to be trained, E represents the sample noise matrix,The representation E obeys normal distribution, zt represents one item of input data of the t layer network of the diffusion sub-model to be trained, namely output data of the t-1 layer network of the diffusion sub-model to be trained, C is the other item of input data of the t layer network of the diffusion sub-model to be trained, and C is a sample audio fusion characteristic obtained by splicing at least based on a sample audio content characteristic and a sample audio emotion characteristic;
II epsilon-M (zt, t, C) II represents the Euclidean distance between the sample noise matrix and the predicted noise matrix output by the t layer network of the diffusion submodel to be trained,The operation of obtaining the expected value from the square value of the euclidean distance sampled a plurality of times is shown. Specifically, the number of sampling times may be set in combination with the actual application scenario, which is not limited by the embodiment of the present invention.
In particular, the sample noise matrix e may be a randomly generated gaussian sample noise matrix, i.e. a sample noise matrix in a normal distribution. The input data of the first layer network of the diffusion sub-model to be trained specifically comprises noisy reference sample face image features and sample audio fusion features, wherein the noisy reference sample face image features are obtained by splicing the image features of the reference sample face image and the sample noise matrix.
Further, the plurality of reference sample face images may include a plurality of complete sample face images and a plurality of lower half-blocked sample face images, wherein the plurality of complete sample face images and the plurality of lower half-blocked sample face images are aligned one by one in time sequence, and the complete sample face images and the lower half-blocked sample face images aligned one by one in time sequence may be from the same sample speaker and have the same emotion.
Correspondingly, the noise-carrying reference sample face image features can be obtained by splicing the image features of the sample face image with the lower part blocked, the image features of the complete sample face image and the noise matrix.
Specifically, the sample audio content features are obtained by inputting the second sample audio into the audio content feature extraction sub-model to be trained for feature extraction, and the sample audio emotion features are obtained by inputting the second sample audio into the audio emotion feature extraction sub-model to be trained for feature extraction.
Corresponding to the process of generating the complete face generated image by the face image generating model (namely, the model reasoning process), in the model training process, the splicing operation of obtaining the sample audio fusion characteristic C can adopt at least the following two embodiments.
In a specific embodiment, the sample audio content feature and the sample audio emotion feature may be directly spliced to obtain the audio fusion feature C.
In another embodiment, the face image generation model to be trained further comprises a key point feature extraction sub-model to be trained, and the splicing operation comprises splicing the sample audio content features, the sample audio emotion features and the sample face key point features, wherein the sample face key point features are obtained by inputting the extracted key points into the key point feature extraction sub-model to be trained after extracting the key points of the sample face image with the lower part being blocked.
For more details of the model training process, reference is made to the foregoing and related descriptions of the steps of the embodiment shown in fig. 1 with respect to the model reasoning process, which are not repeated here.
Referring to fig. 2, fig. 2 is a partial flowchart of another face image generation method according to an embodiment of the present invention. The other face image generation method may include steps S11 to S15 shown in fig. 1, and may further include steps S21 to S24, wherein the steps S21 to S24 may be performed before the step S13.
In step S21, a first number of audio frequencies whose timings are before the driving audio frequency and a second number of audio frequencies whose timings are after the driving audio frequency are recorded as first audio frequency and second audio frequency, respectively, are determined.
In a specific implementation, the first number and the second number may be set appropriately according to the actual application scenario. The first and second amounts may be selected from suitable values in the interval [5,10], such as 8, without limitation.
In step S22, each of the first audio and the second audio is input into the audio content feature extraction sub-model to perform feature extraction, so as to obtain a plurality of corresponding first audio content features and a plurality of second audio content features.
In step S23, weighting operation is performed on the plurality of first audio content features, the plurality of second audio content features, and the audio content features, so as to obtain a fused audio content feature.
For example, the first audio content features, the second audio content features, and each group of elements located at the same position in the audio content features may be weighted and summed or an average value calculated to obtain a fusion element, where all the obtained fusion elements constitute the fusion audio feature.
In a specific embodiment, the audio content feature extraction sub-model may include an audio content feature extraction network for extracting the audio content features from the driving audio, and may further include a time domain filter for performing a weighting operation. The weights of the time domain filter for performing the weighting operation can be learned and obtained in the training process of the face image generation model.
In another embodiment, the weighting operation may be performed using a preset weight. Wherein the weight of the audio content feature may be greater than the weights of the first and second audio content features.
In step S24, the audio content features are updated with the fused audio content features.
In the embodiment of the invention, the audio content information of the front and rear time sequences can be better combined in the process of generating the current complete face generated image by fusing the front and rear time sequence audio content features in the audio content features to be spliced, so that the finally generated complete face generated image of the front and rear time sequences has smoother and consistent mouth shape and expression state. Further, the smoother front-back time sequence complete face generated images are subjected to image stitching, so that the method is beneficial to obtaining high-quality digital human videos with more natural and consistent mouth shape and expression transition, and user experience is improved.
Referring to fig. 3 and 4, fig. 3 is a schematic diagram of a face image generation model according to an embodiment of the present invention, and fig. 4 is a schematic diagram of a diffusion sub-model in the face image generation model shown in fig. 3.
The face image generation model includes, without limitation, an audio content feature extraction sub-model, an audio emotion feature extraction sub-model, a diffusion sub-model, and may further include a key point feature extraction sub-model, and may further include a coding sub-model and a decoding sub-model.
The coding sub-model can be used for extracting features of a reference face image to obtain features of the reference face image, and the reference face image specifically comprises a complete face image with the same emotion from the same speaker and a face image with the lower half blocked (which can be simply called a mask face image), and the decoding sub-model can be used for decoding the target complete face feature output by the expansion sub-model to obtain the complete face generation image.
The diffusion sub-model adopts a U-Net model comprising a multi-layer network, input data of the diffusion sub-model at least comprises two items, wherein one item of input data is a noisy reference face image feature, the noisy reference face image feature can be obtained by splicing the reference face image feature output by the coding sub-model and a noise matrix, and the other item of input data is an audio fusion feature obtained by splicing the audio content feature output by the audio content extraction sub-model, the audio emotion feature output by the audio emotion extraction sub-model and the face key point feature output by the key point feature extraction sub-model.
Further, the noisy reference face image features are input into a first layer network of the diffusion sub-model, the audio fusion features are input into each layer network of the diffusion sub-model, and an output result of a last layer network of the diffusion sub-model is used as the target whole face features.
Regarding the functions and operation processes of the other sub-models of the face image generation model, refer specifically to the relevant contents in any of the embodiments of fig. 1 to fig. 2, and are not repeated here.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a face image generating apparatus according to an embodiment of the present invention. The face image generation apparatus may include:
A model determining module 51, configured to determine a face image generation model, where the face image generation model includes an audio content feature extraction sub-model, an audio emotion feature extraction sub-model, and a diffusion sub-model;
The audio feature extraction module 52 is configured to input driving audio to the audio content feature extraction sub-model to perform feature extraction to obtain audio content features, and input the driving audio to the audio emotion feature extraction sub-model to perform feature extraction to obtain audio emotion features;
an audio feature stitching module 53, configured to stitch at least based on the audio content feature and the audio emotion feature to obtain an audio fusion feature;
The complete face feature determining module 54 inputs the audio fusion feature and the noisy reference face image feature into the diffusion submodel for denoising processing to obtain a target complete face feature, wherein the noisy reference face image feature is obtained by splicing the image feature of the reference face image with a noise matrix;
And the face image generating module 55 is configured to decode the target complete face feature to obtain a complete face generated image.
Regarding the principle, specific implementation and beneficial effects of the face image generating apparatus, please refer to the foregoing and the related descriptions of the face image generating method shown in fig. 1 to 2, which are not repeated herein.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps of the face image generating method according to any of the above embodiments. The computer readable storage medium may include non-volatile memory (non-volatile) or non-transitory memory, and may also include optical disks, mechanical hard disks, solid state disks, and the like.
Specifically, in the embodiment of the present invention, the processor may be a central processing unit (central processing unit, abbreviated as CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, abbreviated as DSP), application Specific Integrated Circuits (ASIC), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, abbreviated as FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM EPROM), an electrically erasable programmable ROM (ELECTRICALLY EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (random access memory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of random access memory (random access memory, abbreviated as RAM) are available, such as static random access memory (STATIC RAM, abbreviated as SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (doubledata RATE SDRAM, abbreviated as DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, abbreviated as ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, abbreviated as SLDRAM), and direct memory bus random access memory (direct rambus RAM, abbreviated as DR RAM).
The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the facial image generation method in any embodiment when running the computer program.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B, and that three cases, a alone, a and B together, and B alone, may exist. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship.
The term "plurality" as used in the embodiments of the present application means two or more.
The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order is used, nor is the number of the devices in the embodiments of the present application limited, and no limitation on the embodiments of the present application should be construed.
It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims (10)

Translated fromChinese
1.一种人脸图像生成方法,其特征在于,包括:1. A method for generating a face image, comprising:确定人脸图像生成模型,所述人脸图像生成模型包括音频内容特征提取子模型、音频情感特征提取子模型、扩散子模型;Determine a facial image generation model, wherein the facial image generation model includes an audio content feature extraction sub-model, an audio emotion feature extraction sub-model, and a diffusion sub-model;将驱动音频输入所述音频内容特征提取子模型进行特征提取,得到音频内容特征,以及将所述驱动音频输入所述音频情感特征提取子模型进行特征提取,得到音频情感特征;Inputting the driving audio into the audio content feature extraction submodel to perform feature extraction to obtain audio content features, and inputting the driving audio into the audio emotion feature extraction submodel to perform feature extraction to obtain audio emotion features;至少基于所述音频内容特征和所述音频情感特征进行拼接,得到音频融合特征;At least performing splicing based on the audio content feature and the audio emotion feature to obtain an audio fusion feature;将所述音频融合特征和带噪声的参考人脸图像特征输入所述扩散子模型进行去噪处理,得到目标完整人脸特征,其中,所述带噪声的参考人脸图像特征是对参考人脸图像的图像特征与噪声矩阵进行拼接得到的;Input the audio fusion feature and the noisy reference face image feature into the diffusion sub-model for denoising to obtain the target complete face feature, wherein the noisy reference face image feature is obtained by concatenating the image feature of the reference face image with the noise matrix;对所述目标完整人脸特征进行解码,得到完整人脸生成图像;Decoding the target complete face features to obtain a complete face generated image;其中,所述扩散子模型采用包含多层网络的U-Net模型;将所述音频融合特征和带噪声的参考人脸图像特征输入所述扩散子模型进行去噪处理,得到目标完整人脸特征,包括:将所述带噪声的参考人脸图像特征输入所述扩散子模型的第一层网络,以及将所述音频融合特征输入所述扩散子模型的每一层网络,并将所述扩散子模型的最后一层网络的输出结果作为所述目标完整人脸特征;其中,自所述扩散子模型的第二层网络起,每一层网络的输入数据是上一层网络的输出数据和所述音频融合特征;The diffusion sub-model adopts a U-Net model including a multi-layer network; the audio fusion feature and the noisy reference facial image feature are input into the diffusion sub-model for denoising to obtain the target complete facial feature, including: inputting the noisy reference facial image feature into the first layer network of the diffusion sub-model, and inputting the audio fusion feature into each layer network of the diffusion sub-model, and using the output result of the last layer network of the diffusion sub-model as the target complete facial feature; starting from the second layer network of the diffusion sub-model, the input data of each layer network is the output data of the previous layer network and the audio fusion feature;所述确定人脸图像生成模型,包括:构建待训练人脸图像生成模型,所述待训练人脸图像生成模型包括:待训练音频内容特征提取子模型、待训练音频情感特征提取子模型、待训练扩散子模型;采用多张参考样本人脸图像、多个第二样本音频以及样本噪声矩阵,构建第二训练数据集,其中,参考样本人脸图像与第二样本音频在时序上一一对齐;基于第二损失函数,将所述第二训练数据集输入所述待训练人脸图像生成模型进行迭代训练,得到所述人脸图像生成模型;The method of determining a facial image generation model includes: constructing a facial image generation model to be trained, wherein the facial image generation model to be trained includes: a sub-model for extracting audio content features to be trained, a sub-model for extracting audio emotion features to be trained, and a sub-model for extracting diffusion features to be trained; using a plurality of reference sample facial images, a plurality of second sample audios, and a sample noise matrix to construct a second training data set, wherein the reference sample facial images and the second sample audios are aligned one-to-one in time sequence; based on a second loss function, inputting the second training data set into the facial image generation model to be trained for iterative training to obtain the facial image generation model;其中,所述第二损失函数采用下述表达式表示:The second loss function is expressed by the following expression: ;其中,表示所述第二损失函数的函数值,t表示所述待训练扩散子模型中网络的层次序号,M()表示所述待训练扩散子模型第t层网络输出的预测噪声矩阵,表示样本噪声矩阵,表示服从正态分布,表示所述待训练扩散子模型第t层网络的输入数据的其中之一项,即,所述待训练扩散子模型第t-1层网络的输出数据,C为所述待训练扩散子模型的第t层网络的输入数据的其中之另一项,且C是至少基于样本音频内容特征和样本音频情感特征进行拼接得到的样本音频融合特征;in, represents the function value of the second loss function, t represents the hierarchical number of the network in the diffusion sub-model to be trained, M() represents the predicted noise matrix output by the t-th layer network of the diffusion sub-model to be trained, represents the sample noise matrix, express It follows a normal distribution. represents one of the input data of the t-th layer network of the diffusion sub-model to be trained, that is, the output data of the t-1-th layer network of the diffusion sub-model to be trained, C is another one of the input data of the t-th layer network of the diffusion sub-model to be trained, and C is a sample audio fusion feature obtained by splicing at least based on the sample audio content feature and the sample audio emotion feature;表示所述样本噪声矩阵与所述待训练扩散子模型第t层网络输出的预测噪声矩阵之间的欧式距离,表示对多次采样的欧式距离的平方值求取期望值操作。 represents the Euclidean distance between the sample noise matrix and the predicted noise matrix output by the t-th layer network of the diffusion sub-model to be trained, Represents the operation of finding the expected value of the squared Euclidean distance of multiple samples.2.根据权利要求1所述的方法,其特征在于,所述人脸图像生成模型还包括关键点特征提取子模型;2. The method according to claim 1, characterized in that the face image generation model also includes a key point feature extraction sub-model;所述至少基于所述音频内容特征和所述音频情感特征进行拼接,包括:The splicing is performed at least based on the audio content feature and the audio emotion feature, including:对所述音频内容特征、音频情感特征以及人脸关键点特征进行拼接;splicing the audio content features, audio emotion features, and facial key point features;其中,所述人脸关键点特征是对下半部分被遮挡的人脸图像提取关键点后,将提取的关键点输入所述关键点特征提取子模型进行特征提取得到的。The facial key point features are obtained by extracting key points from a facial image with the lower half of the face blocked, and then inputting the extracted key points into the key point feature extraction sub-model for feature extraction.3.根据权利要求1或2所述的方法,其特征在于,在至少基于所述音频内容特征和所述音频情感特征进行拼接之前,所述方法还包括:3. The method according to claim 1 or 2, characterized in that, before performing splicing based at least on the audio content feature and the audio emotion feature, the method further comprises:确定时序位于所述驱动音频之前的第一数量个音频,以及时序位于所述驱动音频之后的第二数量个音频,分别记为第一音频和第二音频;Determine a first number of audios whose timing is before the driving audio, and a second number of audios whose timing is after the driving audio, which are respectively recorded as first audios and second audios;将每个第一音频和第二音频输入所述音频内容特征提取子模型分别进行特征提取,得到对应的多个第一音频内容特征和多个第二音频内容特征;Input each of the first audio and the second audio into the audio content feature extraction sub-model to perform feature extraction respectively, to obtain corresponding multiple first audio content features and multiple second audio content features;对所述多个第一音频内容特征、多个第二音频内容特征以及所述音频内容特征进行加权运算,得到融合音频内容特征;Performing a weighted operation on the plurality of first audio content features, the plurality of second audio content features, and the audio content feature to obtain a fused audio content feature;采用所述融合音频内容特征更新所述音频内容特征。The audio content feature is updated using the fused audio content feature.4.根据权利要求1所述的方法,其特征在于,所述参考人脸图像包含来自同一说话者且具有相同情绪的完整人脸图像和下半部分被遮挡的人脸图像;4. The method according to claim 1, characterized in that the reference face image comprises a complete face image and a face image with the lower half blocked from the same speaker and with the same emotion;在将所述音频融合特征和带噪声的参考人脸图像特征输入所述扩散子模型进行去噪处理之前,所述方法还包括:Before inputting the audio fusion features and the noisy reference face image features into the diffusion sub-model for denoising, the method further includes:对所述下半部分被遮挡的人脸图像和所述完整人脸图像分别进行特征提取,得到部分人脸图像特征和完整人脸图像特征;Extracting features of the face image with the lower half blocked and the complete face image respectively to obtain partial face image features and complete face image features;对所述部分人脸图像特征、完整人脸图像特征以及所述噪声矩阵进行拼接,得到所述带噪声的参考人脸图像特征。The partial facial image features, the complete facial image features and the noise matrix are concatenated to obtain the reference facial image features with noise.5.根据权利要求1所述的方法,其特征在于,所述音频情感特征提取子模型包括:预训练的情感分类网络和情感特征提取网络;5. The method according to claim 1, characterized in that the audio emotion feature extraction sub-model comprises: a pre-trained emotion classification network and an emotion feature extraction network;将驱动音频输入所述音频内容特征提取子模型进行特征提取,得到音频内容特征,包括:Input the driving audio into the audio content feature extraction sub-model to extract features, and obtain audio content features, including:将所述驱动音频输入所述预训练的情感分类网络,得到预测情感类型标签;Inputting the driving audio into the pre-trained emotion classification network to obtain a predicted emotion type label;对所述预测情感类型标签进行编码,得到音频情感编码向量;Encoding the predicted emotion type label to obtain an audio emotion encoding vector;将所述音频情感编码向量输入所述情感特征提取网络进行特征提取,得到所述音频情感特征。The audio emotion coding vector is input into the emotion feature extraction network for feature extraction to obtain the audio emotion feature.6.根据权利要求5所述的方法,其特征在于,对所述预测情感类型标签进行编码,得到音频情感编码向量,包括:6. The method according to claim 5, characterized in that encoding the predicted emotion type label to obtain the audio emotion coding vector comprises:基于预设的情绪编码长度,对所述预测情感类型标签进行预编码,得到多组情绪子编码,每组情绪子编码中包含两个相同的情绪子编码;Based on a preset emotion coding length, the predicted emotion type label is pre-coded to obtain multiple groups of emotion sub-codes, each group of emotion sub-codes containing two identical emotion sub-codes;对于每组情绪子编码,确定其中一个情绪子编码的正弦值,以及确定另一个情绪子编码的余弦值,从而确定各个情绪子编码对应的情绪编码,并基于所得到的多个情绪编码,确定所述音频情感编码向量。For each group of emotion sub-codes, the sine value of one emotion sub-code and the cosine value of another emotion sub-code are determined, so as to determine the emotion codes corresponding to each emotion sub-code, and the audio emotion coding vector is determined based on the obtained multiple emotion codes.7.根据权利要求5或6所述的方法,其特征在于,所述预训练的情感分类网络是基于第一损失函数,将多个第一样本音频及其情感类型标签构建的第一训练数据集输入初始化情感分类网络进行训练得到的;7. The method according to claim 5 or 6, characterized in that the pre-trained emotion classification network is obtained by inputting a first training data set constructed by a plurality of first sample audios and their emotion type labels into an initialized emotion classification network for training based on a first loss function;所述第一损失函数采用下述表达式表示:The first loss function is expressed by the following expression:== ;其中,表示所述第一损失函数的函数值,i表示所述第一样本音频的序号,N表示所述第一样本音频的总数量,表示第i个第一样本音频的真实情感类型标签,表示对第i个第一样本音频预测得到的情感类型标签,ln()表示以常数e为底数的对数。in, represents the function value of the first loss function,i represents the sequence number of the first sample audio, N represents the total number of the first sample audio, represents the true emotion type label of thei -th first sample audio, represents the emotion type label predicted for thei -th first sample audio, and ln() represents the logarithm with the constant e as the base.8.一种人脸图像生成装置,其特征在于,包括:8. A facial image generation device, comprising:模型确定模块,用于确定人脸图像生成模型,所述人脸图像生成模型包括音频内容特征提取子模型、音频情感特征提取子模型、扩散子模型;A model determination module, used to determine a facial image generation model, wherein the facial image generation model includes an audio content feature extraction sub-model, an audio emotion feature extraction sub-model, and a diffusion sub-model;音频特征提取模块,用于将驱动音频输入所述音频内容特征提取子模型进行特征提取,得到音频内容特征,以及将所述驱动音频输入所述音频情感特征提取子模型进行特征提取,得到音频情感特征;An audio feature extraction module, used for inputting the driving audio into the audio content feature extraction sub-model for feature extraction to obtain audio content features, and for inputting the driving audio into the audio emotion feature extraction sub-model for feature extraction to obtain audio emotion features;音频特征拼接模块,用于至少基于所述音频内容特征和所述音频情感特征进行拼接,得到音频融合特征;An audio feature splicing module, used for performing splicing based on at least the audio content feature and the audio emotion feature to obtain an audio fusion feature;完整人脸特征确定模块,用于将所述音频融合特征和带噪声的参考人脸图像特征输入所述扩散子模型进行去噪处理,得到目标完整人脸特征,其中,所述带噪声的参考人脸图像特征是对参考人脸图像的图像特征与噪声矩阵进行拼接得到的;A complete face feature determination module, used for inputting the audio fusion feature and the reference face image feature with noise into the diffusion sub-model for denoising to obtain the target complete face feature, wherein the reference face image feature with noise is obtained by concatenating the image feature of the reference face image with the noise matrix;人脸图像生成模块,用于对所述目标完整人脸特征进行解码,得到完整人脸生成图像;A face image generation module, used for decoding the target complete face features to obtain a complete face generated image;其中,所述扩散子模型采用包含多层网络的U-Net模型;所述完整人脸特征确定模块还执行:将所述带噪声的参考人脸图像特征输入所述扩散子模型的第一层网络,以及将所述音频融合特征输入所述扩散子模型的每一层网络,并将所述扩散子模型的最后一层网络的输出结果作为所述目标完整人脸特征;其中,自所述扩散子模型的第二层网络起,每一层网络的输入数据是上一层网络的输出数据和所述音频融合特征;The diffusion sub-model adopts a U-Net model including a multi-layer network; the complete face feature determination module further executes: inputting the reference face image feature with noise into the first layer network of the diffusion sub-model, and inputting the audio fusion feature into each layer network of the diffusion sub-model, and using the output result of the last layer network of the diffusion sub-model as the target complete face feature; starting from the second layer network of the diffusion sub-model, the input data of each layer network is the output data of the previous layer network and the audio fusion feature;所述模型确定模块还执行:构建待训练人脸图像生成模型,所述待训练人脸图像生成模型包括:待训练音频内容特征提取子模型、待训练音频情感特征提取子模型、待训练扩散子模型;采用多张参考样本人脸图像、多个第二样本音频以及样本噪声矩阵,构建第二训练数据集,其中,参考样本人脸图像与第二样本音频在时序上一一对齐;基于第二损失函数,将所述第二训练数据集输入所述待训练人脸图像生成模型进行迭代训练,得到所述人脸图像生成模型;The model determination module further executes: constructing a face image generation model to be trained, the face image generation model to be trained includes: a sub-model for extracting audio content features to be trained, a sub-model for extracting audio emotion features to be trained, and a sub-model for diffusing to be trained; using a plurality of reference sample face images, a plurality of second sample audios, and a sample noise matrix to construct a second training data set, wherein the reference sample face images and the second sample audios are aligned one by one in time sequence; based on a second loss function, inputting the second training data set into the face image generation model to be trained for iterative training to obtain the face image generation model;其中,所述第二损失函数采用下述表达式表示:The second loss function is expressed by the following expression: ;其中,表示所述第二损失函数的函数值,t表示所述待训练扩散子模型中网络的层次序号,M()表示所述待训练扩散子模型第t层网络输出的预测噪声矩阵,表示样本噪声矩阵,表示服从正态分布,表示所述待训练扩散子模型第t层网络的输入数据的其中之一项,即,所述待训练扩散子模型第t-1层网络的输出数据,C为所述待训练扩散子模型的第t层网络的输入数据的其中之另一项,且C是至少基于样本音频内容特征和样本音频情感特征进行拼接得到的样本音频融合特征;in, represents the function value of the second loss function, t represents the hierarchical number of the network in the diffusion sub-model to be trained, M() represents the predicted noise matrix output by the t-th layer network of the diffusion sub-model to be trained, represents the sample noise matrix, express It follows a normal distribution. represents one of the input data of the t-th layer network of the diffusion sub-model to be trained, that is, the output data of the t-1-th layer network of the diffusion sub-model to be trained, C is another one of the input data of the t-th layer network of the diffusion sub-model to be trained, and C is a sample audio fusion feature obtained by splicing at least based on the sample audio content feature and the sample audio emotion feature;表示所述样本噪声矩阵与所述待训练扩散子模型第t层网络输出的预测噪声矩阵之间的欧式距离,表示对多次采样的欧式距离的平方值求取期望值操作。 represents the Euclidean distance between the sample noise matrix and the predicted noise matrix output by the t-th layer network of the diffusion sub-model to be trained, Represents the operation of finding the expected value of the squared Euclidean distance of multiple samples.9.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器运行时执行权利要求1至7任一项所述人脸图像生成方法的步骤。9. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, executes the steps of the facial image generation method according to any one of claims 1 to 7.10.一种终端,包括存储器和处理器,所述存储器上存储有能够在所述处理器上运行的计算机程序,其特征在于,所述处理器运行所述计算机程序时执行权利要求1至7任一项所述人脸图像生成方法的步骤。10. A terminal comprising a memory and a processor, wherein the memory stores a computer program that can be run on the processor, wherein the processor executes the steps of the facial image generation method according to any one of claims 1 to 7 when running the computer program.
CN202311762681.0A2023-12-192023-12-19 Face image generation method and device, computer readable storage medium, terminalActiveCN118015110B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311762681.0ACN118015110B (en)2023-12-192023-12-19 Face image generation method and device, computer readable storage medium, terminal

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311762681.0ACN118015110B (en)2023-12-192023-12-19 Face image generation method and device, computer readable storage medium, terminal

Publications (2)

Publication NumberPublication Date
CN118015110A CN118015110A (en)2024-05-10
CN118015110Btrue CN118015110B (en)2025-01-14

Family

ID=90946879

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311762681.0AActiveCN118015110B (en)2023-12-192023-12-19 Face image generation method and device, computer readable storage medium, terminal

Country Status (1)

CountryLink
CN (1)CN118015110B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118644596B (en)*2024-08-152024-12-27腾讯科技(深圳)有限公司Face key point moving image generation method and related equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116129004A (en)*2023-02-172023-05-16华院计算技术(上海)股份有限公司Digital person generating method and device, computer readable storage medium and terminal
CN116994307A (en)*2022-09-282023-11-03腾讯科技(深圳)有限公司Video generation method, device, equipment, storage medium and product

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114282059B (en)*2021-08-242025-01-10腾讯科技(深圳)有限公司 Video retrieval method, device, equipment and storage medium
CN116188634A (en)*2022-07-132023-05-30马上消费金融股份有限公司Face image prediction method, model, device, equipment and medium
CN116664731B (en)*2023-06-212024-03-29华院计算技术(上海)股份有限公司Face animation generation method and device, computer readable storage medium and terminal
CN117218224B (en)*2023-08-212024-09-03华院计算技术(上海)股份有限公司Face emotion image generation method and device, readable storage medium and terminal
CN117237760A (en)*2023-09-272023-12-15腾讯科技(深圳)有限公司Training method, device, equipment and storage medium for face image generation model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN116994307A (en)*2022-09-282023-11-03腾讯科技(深圳)有限公司Video generation method, device, equipment, storage medium and product
CN116129004A (en)*2023-02-172023-05-16华院计算技术(上海)股份有限公司Digital person generating method and device, computer readable storage medium and terminal

Also Published As

Publication numberPublication date
CN118015110A (en)2024-05-10

Similar Documents

PublicationPublication DateTitle
CN110457994B (en)Face image generation method and device, storage medium and computer equipment
CN111145282B (en)Avatar composition method, apparatus, electronic device, and storage medium
Tits et al.Exploring transfer learning for low resource emotional tts
CN113228163B (en)Real-time text and audio based face rendering
US20220101121A1 (en)Latent-variable generative model with a noise contrastive prior
WO2023284435A1 (en)Method and apparatus for generating animation
Ma et al.Unpaired image-to-speech synthesis with multimodal information bottleneck
CN117765950B (en) Face generation method and device
Wang et al.Comic-guided speech synthesis
CN117036555B (en)Digital person generation method and device and digital person generation system
CN115883753A (en)Video generation method and device, computing equipment and storage medium
CN118015110B (en) Face image generation method and device, computer readable storage medium, terminal
CN112330780A (en)Method and system for generating animation expression of target character
CN115858726A (en) A multi-stage multi-modal sentiment analysis method based on mutual information representation
CN115712739B (en) Dance movement generation method, computer equipment and storage medium
CN120051803A (en)Text-driven image editing via image-specific fine tuning of diffusion models
CN116524570A (en) A method and system for automatically editing hairstyles for ID photos
Dodić et al.The picture world of the future: AI text-to-image as a new era of visual content creation
Chen et al.VAST: vivify your talking avatar via zero-shot expressive facial style transfer
CN119785820A (en) Digital human audio and video generation method, system, device and medium
Dehghani et al.Generating and detecting various types of fake image and audio content: A review of modern deep learning technologies and tools
CN118413722B (en)Audio drive video generation method, device, computer equipment and storage medium
CN117078816A (en)Virtual image generation method, device, terminal equipment and storage medium
CN116542292A (en)Training method, device, equipment and storage medium of image generation model
Ma et al.M3D-GAN: Multi-modal multi-domain translation with universal attention

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp