Movatterモバイル変換


[0]ホーム

URL:


CN111145282A - Virtual image synthesis method and device, electronic equipment and storage medium - Google Patents

Virtual image synthesis method and device, electronic equipment and storage medium
Download PDF

Info

Publication number
CN111145282A
CN111145282ACN201911274701.3ACN201911274701ACN111145282ACN 111145282 ACN111145282 ACN 111145282ACN 201911274701 ACN201911274701 ACN 201911274701ACN 111145282 ACN111145282 ACN 111145282A
Authority
CN
China
Prior art keywords
expression
frame
features
avatar
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911274701.3A
Other languages
Chinese (zh)
Other versions
CN111145282B (en
Inventor
左童春
何山
胡金水
刘聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co LtdfiledCriticaliFlytek Co Ltd
Priority to CN201911274701.3ApriorityCriticalpatent/CN111145282B/en
Publication of CN111145282ApublicationCriticalpatent/CN111145282A/en
Application grantedgrantedCritical
Publication of CN111145282BpublicationCriticalpatent/CN111145282B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The embodiment of the invention provides a virtual image synthesis method, a virtual image synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining relevant characteristics of the voice data; the related features are used for representing features related to the expression of the speaker contained in the voice data; inputting the image data and the related characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data; the expression synthesis model is obtained by training based on the sample speaker video, the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data. The method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention can enable the virtual image expression to better fit with the voice data, and are more natural and real.

Description

Virtual image synthesis method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for synthesizing an avatar, an electronic device, and a storage medium.
Background
In recent years, with the continuous progress of computer voice synthesis and video synthesis technologies, various voice-driven avatar synthesis technologies have been developed. The avatar may perform tasks such as news broadcasting, weather forecasting, game commentary, providing ordering services, etc.
In the process of executing the tasks, most of the virtual images only synthesize the mouth shapes matched with the output voices, the virtual images always keep neutral expressions, or a plurality of basic expressions are preset, and the corresponding expressions are configured according to different voice output contents. When the synthesized virtual image outputs voice, the corresponding expression of the virtual image is often not vivid and natural, and the user experience is poor.
Disclosure of Invention
The embodiment of the invention provides a method and a device for synthesizing an avatar, electronic equipment and a storage medium, which are used for solving the problem that the corresponding expression of the existing avatar is not vivid and natural when voice is output.
In a first aspect, an embodiment of the present invention provides an avatar synthesis method, including:
determining relevant characteristics of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data;
inputting image data and the relevant characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data;
the expression synthesis model is obtained by training based on sample speaker videos, relevant features of sample voice data corresponding to the sample speaker videos and sample image data.
Preferably, the inputting the avatar data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model specifically includes:
inputting image data and relevant features respectively corresponding to any frame into a feature extraction layer of the expression synthesis model to obtain frame features output by the feature extraction layer;
and inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression graph of any frame output by the expression prediction layer.
Preferably, the inputting the image data and the relevant features respectively corresponding to any frame into the feature extraction layer of the expression synthesis model to obtain the frame features output by the feature extraction layer specifically includes:
inputting image data and relevant features respectively corresponding to any frame into a current feature extraction layer of the feature extraction layer to obtain current features output by the current feature extraction layer;
and inputting the virtual expression map of any frame of the preset frame before the frame into a frame front feature extraction layer of the feature extraction layer to obtain the frame front features output by the frame front feature extraction layer.
Preferably, the inputting the frame characteristics into an expression prediction layer of the expression synthesis model to obtain a virtual expression map of any frame output by the expression prediction layer specifically includes:
and fusing the current features and the features before the frames and inputting the fused current features and the features before the frames into the expression prediction layer to obtain the virtual expression graph of any frame output by the expression prediction layer.
Preferably, the fusing the current feature and the pre-frame feature and inputting the fused current feature and the fused pre-frame feature to the expression prediction layer to obtain the virtual expression map of any frame output by the expression prediction layer specifically includes:
fusing the current features and the features before the frames and inputting the fused current features and the fused features into a candidate expression prediction layer of the expression prediction layer to obtain a candidate expression graph output by the candidate expression prediction layer;
fusing the current feature and the pre-frame feature, and inputting the fused current feature and the pre-frame feature into an optical flow prediction layer of the expression prediction layer to obtain optical flow information output by the optical flow prediction layer;
and inputting the candidate expression graph and the optical flow information into a fusion layer in the expression prediction layer to obtain the virtual expression graph of any frame output by the fusion layer.
Preferably, the expression synthesis model is obtained by training a discriminator based on sample speaker video, relevant features of sample voice data and sample image data corresponding to the sample speaker video, and the expression synthesis model and the discriminator form a generative confrontation network.
Preferably, the discriminator comprises an image discriminator and/or a video discriminator;
the image discriminator is used for judging the synthetic reality of any frame of virtual expression graph in the virtual image video, and the video discriminator is used for judging the synthetic reality of the virtual image video.
Preferably, the relevant features comprise language-dependent features, and emotional and/or speaker identity features.
Preferably, the profile data is determined based on the speaker identity characteristics.
Preferably, the expressions of the avatar configuration in the avatar video corresponding to the voice data include facial expressions and neck expressions.
In a second aspect, an embodiment of the present invention provides an avatar synthesis apparatus, including:
a relevant feature determination unit for determining relevant features of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data;
the expression synthesis unit is used for inputting image data and the related characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, and an avatar in the avatar video is configured with an expression corresponding to the voice data;
the expression synthesis model is obtained by training sample voice data relevant characteristics and sample image data corresponding to the sample speaker video based on the sample speaker video.
In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a bus, where the processor and the communication interface, the memory complete communication with each other through the bus, and the processor may call a logic instruction in the memory to perform the steps of the method provided in the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.
The method, the device, the electronic equipment and the storage medium for synthesizing the virtual image, provided by the embodiment of the invention, apply the relevant characteristics containing rich expression relevant information to synthesize the expression of the virtual image, so that the expression of the virtual image can better fit with the voice data and is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expressions of the virtual image exist in an integral form, and compared with a mode of independently modeling each region for executing the expressions in the virtual image, the linkage problem of the muscles in each region can be effectively solved by aiming at the integral modeling of the expressions, so that the muscle linkage of each region is more natural and vivid.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a virtual image synthesis method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of an expression synthesis method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of a feature extraction method according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of an expression prediction method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an expression synthesis model according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of a method for synthesizing an avatar according to another embodiment of the present invention;
FIG. 7 is a schematic structural diagram of an avatar synthesis apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, the synthesis technology for virtual images can be mainly classified into the following three categories:
the first category, based on voice-driven avatar synthesis techniques: and independently applying language information and expression information acquired from the voice to the finally synthesized video. In the method, only a plurality of basic expressions are considered, the synthesized virtual image is more stuttery, only a plurality of predefined basic expressions can be made, and the problems that the lip and the eyebrow are not matched with the eyebrow, the throat, the cheek and the like are solved. On one hand, the above problem is that rich emotion cannot be expressed individually because the mouth shape opening and closing are determined only according to the pronunciation characteristics of the voice content, and different emotions, differences among different people and physiological linkage among human face muscle blocks are not considered. On the other hand, because the method can only select two from several or dozens of fixed expressions to be superposed into the synthesized video, rich facial expressions cannot be synthesized.
And the second type is an avatar synthesis technology based on expression migration: and migrating the facial expression, mouth shape and rigid motion of the driving character to the virtual image. The video synthesized by the method is more vivid, but depends on real performance, and cannot be synthesized offline.
And in the third category, the technology for realizing the synthesis of the virtual image expression based on the independent modeling of each part of the human face needs an artist to design the motion of the whole human face according to physiological and aesthetic professional knowledge, synthesizes the state of each part of a video needing to be edited frame by frame, and not only needs strong professional knowledge, but also consumes time and labor.
Anatomically, the face of the human face has 42 muscles, so that abundant expressions can be generated, and various different moods and emotions can be accurately conveyed. The flexing of these muscles is not independent and has strong relevance, for example: when a person speaks in a calm state, muscles of lips and chin stretch, and when the person speaks in a same sentence in emotional agitation, muscles of forehead and cheek also stretch, and the stretching strength of the muscles in the areas of lips and chin is obviously higher than that in the calm state. In addition, thousands of human expressions exist, the existing method only has a few or dozens of preset expressions, and the expression ability of the expressions is not fine enough and personalized. Therefore, how to automatically synthesize more realistic and natural virtual images for expressions remains an urgent problem to be solved by those skilled in the art.
Therefore, the embodiment of the invention provides a virtual image synthesis method. Fig. 1 is a schematic flow chart of an avatar synthesis method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 110, determining relevant characteristics of voice data; the related features are used for characterizing features related to the expression of the speaker contained in the voice data.
Specifically, the voice data is voice data for synthesizing an avatar, where the avatar may be a virtual character, a virtual cartoon image, an animal image, or the like, and the embodiment of the present invention does not limit this. The voice data may be voice data of a speaker speaking acquired by the radio device, or may be intercepted from voice data obtained through a network or the like, which is not specifically limited in the embodiment of the present invention.
The related features are features related to the expression of the speaker obtained from the voice data analysis, such as language related features in the voice data, the language related features correspond to different pronunciations, the different pronunciations require the speaker to mobilize facial muscles to form different mouth shapes, and such as emotional features in the voice data, when the speaker speaks the same content under different moods, the movements of the facial muscles including the mouth shapes and neck muscles are different, and such as scene features in the voice data, the speaking scene of the speaker may also influence the facial expression of the speaker, for example, when speaking under a noisy environment, the speaker may speak loudly, the facial expression is relatively exaggerated, when speaking under a quiet environment, the speaker may speak loudly, the facial expression is relatively fine, and such as the speaker identity features in the voice data, the expressions of different speakers may be different, for example, the presenter who is hosting a child's program may be familiar with the expression when talking, the presenter who is hosting a fun program may be exaggerated in the expression when talking.
Step 120, inputting the image data and the related characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data; the expression synthesis model is obtained by training based on the sample speaker video, the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data.
Specifically, the image data is image data for performing avatar synthesis, and the avatar corresponding to the image data may be an avatar of a speaker corresponding to the voice data or an avatar unrelated to the speaker corresponding to the voice, which is not specifically limited in this embodiment of the present invention. The image data includes a texture map and an expression mask map, where the texture map is an image of the avatar itself, the texture map includes the avatar and each region of the avatar performing expression, and the expression mask map is an image of the avatar after each region of the avatar performing expression is masked, and one expression mask map may be set corresponding to each frame, or multiple frames may correspond to one expression mask map.
The expression synthesis model is used for analyzing the expression of the virtual image based on the relevant characteristics of the image data, and combining the expression of the virtual image to obtain the virtual image video configured with the expression corresponding to the voice data. Beforestep 120 is executed, an expression synthesis model may also be obtained through pre-training, and specifically, the expression synthesis model may be obtained through training in the following manner: firstly, a large number of sample speaker videos and sample voice data corresponding to the sample speaker videos are collected, and sample image data in the sample speaker videos and relevant features in the sample voice data are extracted. Here, the sample speaker video is a real person speaker video. And then, training the initial model based on the sample speaker video, the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data, so as to obtain an expression synthesis model.
The method provided by the embodiment of the invention carries out the expression synthesis of the virtual image by using the relevant characteristics containing abundant expression relevant information, so that the virtual image expression can better fit with the voice data and is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expressions of the virtual image exist in an integral form, and compared with a mode of independently modeling each region for executing the expressions in the virtual image, the linkage problem of the muscles in each region can be effectively solved by aiming at the integral modeling of the expressions, so that the muscle linkage of each region is more natural and vivid.
Based on the above embodiment, the expression synthesis model includes a feature extraction layer and an expression prediction layer. Fig. 2 is a schematic flow chart of an expression synthesis method according to an embodiment of the present invention, and as shown in fig. 2, step 120 specifically includes:
and step 121, inputting the image data and the relevant characteristics respectively corresponding to any frame into a characteristic extraction layer of the expression synthesis model to obtain the frame characteristics output by the characteristic extraction layer.
Specifically, the voice data may be divided into a plurality of frames of voice data, and for each frame of voice data, there is a corresponding correlation characteristic. Similarly, in the character data, the same texture map may correspond to each frame to represent the appearance of the avatar in the avatar video, and different expression mask maps may correspond to different frames to represent the actions of the avatar corresponding to different frames in the avatar video, particularly the head action.
In the expression synthesis model, the feature extraction layer is used for extracting frame features of any frame from image data and related features respectively corresponding to the frame. The frame feature may be an image feature of the frame and an expression-related feature of the frame, and may further include a fusion feature of the image feature and the expression-related feature of the frame, which is not specifically limited in this embodiment of the present invention.
And step 122, inputting the frame characteristics into an expression prediction layer of the expression synthesis model to obtain a virtual expression graph of the frame output by the expression prediction layer.
Specifically, in the expression synthesis model, the expression prediction layer is used for predicting the virtual expression map of any frame based on the frame characteristics of the frame. Here, the avatar is a frame of image including an avatar configured with an expression corresponding to the frame of voice data, and the position, motion, etc. of the avatar are consistent with the character data corresponding to the frame. Each frame of virtual expression image forms an avatar video.
According to the method provided by the embodiment of the invention, the frame characteristics of any frame are obtained, the virtual expression graph of the frame is obtained based on the frame characteristics, the virtual image video is finally obtained, and the overall naturalness and fidelity of the virtual image video are improved by improving the naturalness and fidelity of each frame of virtual expression graph.
Based on any embodiment, the feature extraction layer comprises a current feature extraction layer and a pre-frame feature extraction layer; fig. 3 is a schematic flow chart of the feature extraction method according to the embodiment of the present invention, and as shown in fig. 3, step 121 specifically includes:
step 1211, inputting the image data and the relevant features corresponding to each frame to a current feature extraction layer of the feature extraction layer to obtain current features output by the current feature extraction layer.
Step 1212, inputting the virtual expression map of the preset frame before the frame into a before-frame feature extraction layer of the feature extraction layer, so as to obtain before-frame features output by the before-frame feature extraction layer.
Specifically, the frame features of any frame include a current feature and a pre-frame feature, wherein the current feature is obtained by feature extraction of image data and related features respectively corresponding to the frame through a current feature extraction layer, and the current feature is used for reflecting the features of the frame in the aspect of an avatar, especially the expression of the avatar; the pre-frame features are obtained by extracting features of a virtual expression map of a preset frame before the frame through a pre-frame feature extraction layer, and the pre-frame features are used for reflecting virtual images in the virtual expression map of the preset frame before the frame, especially features in the aspect of virtual image expressions.
Here, any frame preceding preset frame may be a plurality of preset frames preceding the frame, for example, any frame is an nth frame, and the frame preceding preset frame of the frame is the first two frames of the frame, that is, an nth-2 th frame and an nth-1 st frame.
Based on any of the above embodiments, step 122 specifically includes: and fusing the current characteristic and the characteristics before the frame and inputting the fused current characteristic and the fused characteristics before the frame into the expression prediction layer to obtain the virtual expression graph of the frame output by the expression prediction layer.
In the embodiment of the invention, the current characteristic and the characteristic before the frame are used as the frame characteristic of any frame for expression prediction, so that the synthesized virtual image expression not only can be naturally matched with the voice data corresponding to the frame, but also can realize the natural transition of the frame virtual image expression and the virtual image expressions of the previous frames, and further improve the reality and naturalness of the virtual image video.
Based on any embodiment, the expression prediction layer comprises a candidate expression prediction layer, an optical flow prediction layer and a fusion layer; fig. 4 is a schematic flow chart of an expression prediction method according to an embodiment of the present invention, and as shown in fig. 4, step 122 specifically includes:
and 1221, fusing the current features and the features before the frame, and inputting the fused current features and features before the frame into a candidate expression prediction layer of the expression prediction layer to obtain a candidate expression graph output by the candidate expression prediction layer.
Here, the candidate expression prediction layer is configured to predict the avatar expression of any frame based on the current feature and the pre-frame feature corresponding to the frame, and output a candidate expression map of the frame. Here, the candidate emoticons of the frame are virtual pictograms configured with emoticons corresponding to the frame of voice data.
And 1222, fusing the current feature and the pre-frame feature, and inputting the fused current feature and pre-frame feature into an optical flow prediction layer of the expression prediction layer to obtain optical flow information output by the optical flow prediction layer.
Here, the optical flow prediction layer is configured to predict an optical flow between a previous frame and the frame based on a current feature and a pre-frame feature corresponding to any frame, and output optical flow information of the frame. Here, the optical flow information of the frame may include the predicted optical flow of the previous frame and the frame, and may further include weighting the optical flow to the candidate emoticons.
And step 1223, inputting the candidate expression graph and the optical flow information into a fusion layer in the expression prediction layer to obtain the virtual expression graph of the frame output by the fusion layer.
Here, the fusion layer is configured to fuse the candidate emoticons of any frame and the optical flow information, and further obtain the virtual emoticons of the frame. For example, the candidate emoticons and the previous frame of virtual emoticons deformed based on the predicted optical flow may be directly superimposed by the fusion layer, or based on the predicted weights, the candidate emoticons and the previous frame of virtual emoticons deformed based on the predicted optical flow may be weighted and superimposed to obtain the virtual emoticons, which is not specifically limited in the embodiment of the present invention.
According to the method provided by the embodiment of the invention, the optical flow prediction is carried out through the current characteristic and the characteristic before the frame, and the optical flow information is applied to the generation of the virtual expression graph, so that the muscle movement of each region of the executed expression of the virtual image in the virtual image video is more natural.
Based on any of the above embodiments, fig. 5 is a schematic structural diagram of an expression synthesis model provided in an embodiment of the present invention, and in fig. 5, the expression synthesis model includes a current feature extraction layer, a feature extraction layer before a frame, a candidate expression prediction layer, an optical flow prediction layer, and a fusion layer.
The current feature extraction layer is used for obtaining the current features of any frame based on the image data and the relevant features respectively corresponding to the frame.
Assuming that the correlation feature of the speech data is M, the hidden layer feature HT of the correlation feature can be obtained by sending M to the long and short term memory network LSTM, and the hidden layer features corresponding to each frame can be labeled as HT (0), HT (1), …, HT (t), …, HT (N-1). HT (t) represents the hidden layer characteristics of the relevant characteristics corresponding to the t-th frame, and N is the total frame number of the image data. The image data corresponding to the t-th frame comprises I (0) and Im(t) of (d). Wherein I (O) represents a texture map, ImAnd (t) represents the corresponding expression mask map of the t-th frame.
In FIG. 5, in the current feature extraction layer, I (O) and Im(t) the first layer of convolution (kenerl 3, stride 2, channel _ out 64), the second layer of convolution (kenerl 3, stride 2, channel _ out 128), the third layer of convolution (kenerl 3, stride 2, channel _ out 256), the fourth layer of convolution (kenerl 3, stride 2, channel _ out 512), the 512-dimensional feature map, and the 5 layers of reblock (kenerl 3,stride 1, channel _ out 512). In the process, the hidden layer characteristics HT (t) of the relevant characteristics are embedded into the second layer convolution, the third layer convolution and the fourth layer convolution after being expanded, and are added with the convolution results, so that the fusion of the relevant characteristics and the image data is realized, and the current characteristics CFT (t) of the t-th frame are obtained.
Note that in the current feature extraction layer, ht (t) is compared with I (0) and I (0)mWhen the convolution results FT (t) of (t) are added, HT (t) is only overlapped on the mask region of FT (t), and the non-mask region of FT (t) is not overlapped, wherein the mask region is each region of the virtual image for executing expression. Therefore, the expression-related features are only superposed in the area where the expression is required to be executed, and the original virtual image is kept in the area where the expression is not required to be executed, which is specifically expressed by the following formula:
Figure BDA0002315227260000101
in the formula, θ is a relevant parameter of the current feature extraction layer.
The pre-frame feature extraction layer is used for obtaining the pre-frame features of the frame based on the virtual expression graph of any pre-frame.
Assume that the frame is two frames before the preset frame, i.e. the t-1 th frame and the t-2 th frame. The virtual emoticons of the preset frame before the frame are Fake (t-1) and Fake (t-2). In the pre-frame feature extraction layer, Fake (t-1) and Fake (t-2) are sent to a 4-layer convolutional network (kenerl 3, stride 2, channel _ out 64,128,256,512), and then a 512-dimensional feature map, that is, a pre-frame feature pft (t), is obtained through a 5 th layer ResBlock (kenerl 3,stride 1, channel _ out 512).
This results in the frame characteristics for the t-th frame, denoted cft (t) + pft (t).
And the candidate expression prediction layer is used for determining a corresponding candidate expression graph according to the input frame characteristics. In the candidate expression prediction layer, the frame feature cft (t) + pft (t) may obtain a candidate expression map of the t-th frame through 4 layers of ResBlock (kenerl ═ 3, stride ═ 1, channel _ out ═ 512) and 4 layers of upsampling layers (kenerl ═ 3, stride ═ 2, channel _ out ═ 256,128,64,1), which is denoted as s (t). Is formulated as follows:
Figure BDA0002315227260000111
wherein,
Figure BDA0002315227260000112
and predicting parameters of the layer for the candidate expression.
The optical flow prediction layer is used for predicting the optical flow between the previous frame and the frame according to the input frame characteristics and outputting the optical flow information of the frame. In the optical flow prediction layer, the frame characteristics cft (t) + pft (t) pass through 4 layers of ResBlock (kenerl 3,stride 1, channel _ out 512) and 4 layers of upsampling layers (kenerl 3, stride 2, channel _ out 256,128,64,3) to obtain an optical flow F (t-1) and a weighting matrix w (t) between the virtual expression map Fake (t-1) of the previous frame and the virtual expression map Fake (t) of the frame.
The fusion layer is used for fusing the candidate expression graph S (t) and the optical flow information F (t-1), W (t) of any frame to obtain the virtual expression graph Fake (t) of the frame, and specifically, the candidate expression graph S (t) and the last frame of virtual expression graph F (t-1) ⊙ Fake (t-1) transformed by the optical flow F (t-1) can be weighted and summed through the weighting matrix W (t), so that the fusion of the candidate expression graph S (t) and the last frame of virtual expression graph F (t-1) is realized, and the specific formula is as follows:
Fake(t)=S(t)*W(t)+(1-W(t))*F(t-1)⊙Fake(t-1)
wherein ⊙ represents the deformation of the image by optical flow, w (t) is the corresponding weight of the candidate expression graph, and 1-w (t) is the corresponding weight of the last frame of virtual expression graph after optical flow deformation.
The expression synthesis model provided by the embodiment of the invention can depict the synthesis details of different people under different moods more vividly through the application of relevant characteristics and the integral modeling aiming at the expression, and simultaneously avoids the problem of incoordination caused by independent synthesis. Furthermore, inter-frame continuity of the synthetic avatar is optimized by the optical flow information.
Based on any one of the embodiments, in the method, the expression synthesis model is obtained by training a discriminator based on the sample speaker video, the relevant characteristics of the sample voice data and the sample image data corresponding to the sample speaker video, and the expression synthesis model and the discriminator form a generative confrontation network.
Specifically, a Generative Adaptive Networks (GAN) is a deep learning model, and is one of the most promising methods for unsupervised learning in complex distribution. The generative confrontation network passes through two modules in the framework: the mutual game learning of the Generative Model (Generative Model) and the Discriminative Model (Discriminative Model) yields a reasonably good output. In the embodiment of the invention, the expression synthesis model is a generation model, and the discriminator is a discrimination model.
The expression synthesis model is used for synthesizing continuous virtual image videos, and the discriminator is used for discriminating whether the input video is the virtual image video synthesized by the expression synthesis model or the real recorded video. The function of the discriminator is to judge whether the virtual image video synthesized by the expression synthesis model is real and lifelike.
The method provided by the embodiment of the invention can obviously improve the training effect of the expression synthesis model through the mutual game learning training of the expression synthesis model and the discriminator, thereby effectively improving the fidelity and naturalness of the virtual image video output by the expression synthesis model.
According to any of the above embodiments, the discriminator comprises an image discriminator and/or a video discriminator; the image discriminator is used for judging the synthetic reality of any frame of virtual expression image in the virtual image video, and the video discriminator is used for judging the synthetic reality of the virtual image video.
Specifically, the generative countermeasure network may include only the image discriminator or the video discriminator, or may include both the image discriminator and the video discriminator.
Wherein the image discriminator is used for performing authenticity discrimination from the image level, i.e. determining whether the composition of expressions, such as the composition of facial and neck muscles, is realistic. The image discriminator may obtain a virtual expression map fake (t) of the current frame synthesized by the expression synthesis model, send the virtual expression map fake (t) into a 4-layer convolution network (kenerl 3,stride 1, channel _ out 64,128,256,1), and calculate an L2 norm by using the feature map obtained by convolution and a full 0 matrix of the same size. Similarly, the image discriminator may send any image frame real (t) in the real recorded video to the 4-layer convolution network, and calculate the L2 norm by using the feature map obtained by convolution and the full 1 matrix with the same size. Here, the all 0 matrix corresponds to the label as the composite image, the all 1 matrix corresponds to the label as the real image, and the L2 norm is the loss value of the image discriminator. In order to improve the quality of the synthesized virtual emoticons in each resolution, the virtual emoticons output by the expression synthesis model may be down-sampled 2 times and 4 times, respectively, and then determined.
The video discriminator is used for performing authenticity discrimination on a video level, namely judging whether the linkage of the video, such as the muscle movement of the face and the neck, is real or not. Multiple frames of continuous virtual emoticons synthesized by the expression synthesis model and corresponding optical flows, such as Fake (t-2), Fake (t-1), Fake (t) and F (t-2) and F (t-1), can be acquired, and the virtual emoticons and the optical flows can be sent to a video discriminator formed by a 4-layer convolution network (kenerl 3,stride 1, channel _ out 64,128,256,1) to calculate discrimination loss. Similarly, the video discriminator also needs to calculate the discrimination loss of the real recorded video. In order to improve the quality of the synthesized avatar video at each resolution, the virtual emoticons output by the expression synthesis model may be down-sampled 2 times and 4 times, respectively, and then determined.
In the training process of the expression synthesis model, the opposite loss function of the discriminator can be added into the loss function of the expression synthesis model, so that the expression synthesis model and the discriminator are combined to form the countermeasure loss.
In any of the above embodiments, the method wherein the relevant features include language-relevant features, and emotional and/or speaker identity features.
The language-dependent features correspond to different pronunciations, the different pronunciations require a speaker to mobilize facial muscles to form different mouth shapes, and the movements of the facial muscles and neck muscles corresponding to the different mouth shapes are different. The emotional characteristics are used for representing the emotion of a speaker, and when the speaker speaks the same content under different emotions, the motions of the facial muscles and the neck muscles including the mouth shape are different. The speaker identity characteristic is used to represent an identity of the speaker, and may specifically be an identifier corresponding to the speaker, or an occupation identifier corresponding to the speaker, or corresponding identifiers such as a personality characteristic and a language style characteristic of the speaker, which is not specifically limited in this embodiment of the present invention.
In any of the above embodiments, the method wherein the profile data is determined based on speaker identity characteristics.
Specifically, in the pre-stored massive image data, different image data correspond to different avatars, and the different avatars have different identity characteristics. After the speaker identity characteristics in the voice data related characteristics are known, image data matched with the speaker identity characteristics can be selected from the massive image data and applied to synthesis of virtual image videos.
For example, A, B, C, D figure data of four persons are stored in advance. When the speaker identity characteristic of the voice data is known to point to B, the character data corresponding to B may be determined for the synthesis of the avatar video.
Based on any of the above embodiments, step 110 specifically includes: determining acoustic features of the speech data; a correlation feature is determined based on the acoustic features.
Specifically, the acoustic features here may be a spectrogram and fbank features. For example, an adaptive filter may be used to denoise the speech data and unify the audio sampling rate and channels, here set to 16K, mono, from which the spectrogram and fbank features are then extracted (frame shift 10ms, window length 1 s).
Thereafter, a BN feature sequence representing language content can be extracted as language-dependent features by using a bottleeck network, and the setting here is to obtain a 256-dimensional BN feature at intervals of 40ms, which is denoted as L (0), L (1), …, L (N-1), and N is the frame number of 25fps video. Compared with the method based on the phoneme characteristics in the prior art, the BN characteristics are characteristics irrelevant to languages, and even if only Chinese is used during the training of the expression synthesis model, the correct mouth shape can be synthesized by using other languages during the training of the expression synthesis model. In addition, the convoluting long-time memory network ConvLSTM which is fully trained on 8 basic expression (anger, joy, fear, depression, excitement, surprise, sadness and neutrality) recognition tasks is applied to extract a high-dimensional feature sequence expressing emotion to serve as emotional features. Here, a 128-dimensional emotion vector is obtained every 40ms interval, and is denoted as E (0), E (1), …, E (N-1), and N is the frame number of 25fps video. Similarly, in order to realize personalized customization, the speaker identification network based on the deep neural network DNN and the i-vector is applied to extract a speaker identity feature sequence, and here, a 128-dimensional identity feature vector is obtained every 40ms, which is denoted as P (0), P (1), …, P (N-1), and N is the frame number of the 25fps video. Finally, the obtained three feature sequences are spliced together according to corresponding frames, 512-dimensional fused related features are obtained for each frame and are marked as M (0), M (1), … and M (N-1), and N is the frame number of a 25fps video.
According to any one of the above embodiments, in the method, the expressions of the avatar configuration in the avatar video corresponding to the voice data include facial expressions and neck expressions.
Correspondingly, the covered part of the expression mask map in the character data comprises the execution area of the facial expression and the execution area of the neck expression. Here, the execution region of the facial expression may include facial muscle regions such as frontalis, orbicularis oculi, frown, orbicularis oris, and the like, and does not include an eyeball region and a nose bridge region because the movement of the eyeball is not controlled by facial muscles, and the nose bridge has a skeleton, is approximately a rigid body, and is less affected by the movement of muscles in other regions of the face.
In the embodiment of the invention, the facial expression and the neck expression are combined to be existed as the whole expression, and compared with a mode of independently modeling each region for executing the expression in the virtual image, the muscle linkage problem of each region can be effectively solved by aiming at the whole expression modeling, so that the muscle linkage of each region is more natural and vivid.
Based on any of the above embodiments, fig. 6 is a schematic flow chart of an avatar synthesis method according to another embodiment of the present invention, as shown in fig. 6, the method includes:
step 610, determining the voice data:
extracting voice data from the collected video and audio data, carrying out denoising processing on the voice data by using an adaptive filter, unifying an audio sampling rate and a sound channel, and then extracting a spectrogram and fbank characteristics from the voice data to be recognized. In order to sufficiently ensure the time sequence of the voice data, the embodiment of the invention does not need to segment the input voice data.
Step 620, obtaining relevant characteristics of the voice data:
and for the spectrogram and fbank features of the voice data obtained in the last step, obtaining language-related features, emotion features and speaker identity features respectively corresponding to the voice data to be recognized of each frame through a neural network for extracting the language-related features, emotion features and speaker identity features, and splicing the three features according to corresponding frames to obtain the related features corresponding to each frame.
Step 630, determine video data, detect face region, crop head region:
extracting video data from the collected video and audio data, detecting a face region of each frame of image, expanding the face region by 1.5 times by taking the obtained face frame as a reference, obtaining a region containing the whole head and neck, cutting the region, and storing the region as an image sequence, wherein the image sequence is marked as I (0), I (1), …, I (N-1), and N is the frame number of 25fps video.
Step 640, generating image data:
according to the skin color and the human face physiological structure characteristics or by using a neural network, facial muscle regions and neck muscle regions of each frame of the cut image I (t), such as frontal muscle, orbicularis oculi, frown muscle, orbicularis oris, etc., are divided, and the eyeball regions and the nose bridge regions are not included, because the movement of the eyeball is not controlled by the facial muscles, the nose bridge has a skeleton, is approximate to a rigid body, and is slightly influenced by the movement of muscles in other regions of the face. The pixel values of the facial muscle region and the neck muscle region are set to zero, and an expression mask image sequence is obtained and is marked as Im (0), Im (1), … and Im (N-1), wherein N is the frame number of the 25fps video.
The image data thus obtained includes a texture map I (0) and an expression mask map corresponding to each frame.
Step 650, inputting the expression synthesis model to obtain the avatar head video:
in the expression synthesis model, a texture map and an expression mask map in image data are subjected to a plurality of layers of convolution networks to obtain a feature map, the feature map is fused with spliced related features, a face region and a neck region are synthesized through the plurality of layers of convolution networks, and finally optical flow information is added into a video, so that the synthesized mouth shape, expression, throat movement and the like are more natural.
For example, the input texture map is expressionless, and the speech data is kinetically spoken "captured over! For the irrelevant areas, such as hair, nose and the like of the texture map, the areas are almost copied to the finally synthesized virtual expression map, and for the relevant areas, such as mouth shape, cheek, eyebrow and the like, the original areas are changed into new textures according to the relevant characteristics and the texture image, and the new textures are fused to obtain the finally synthesized virtual expression map.
Step 660, fusing the avatar head video and the body part of the video data:
if a fine seam appears at the boundary in the video which is directly spliced by the head area of the synthesized virtual head portrait according to the original coordinates, the seam area can be fused by adopting a Poisson fusion algorithm as the optimization, so that the boundary transition is smoother.
Compared with the traditional virtual image synthesis technology based on voice driving and the face synthesis technology based on expression migration, the method provided by the embodiment of the invention not only can more vividly synthesize the muscle movements of the face and the neck of different people under different moods, but also can realize full-automatic off-line synthesis. A large amount of labor cost is saved, and the production efficiency is improved.
Based on any of the above embodiments, fig. 7 is a schematic structural diagram of an avatar synthesis apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus includes a relevantfeature determination unit 710 and anexpression synthesis unit 720;
the relevantfeature determining unit 710 is configured to determine relevant features of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data;
theexpression synthesis unit 720 is configured to input image data and the relevant features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, where an avatar in the avatar video is configured with an expression corresponding to the voice data;
the expression synthesis model is obtained by training sample voice data relevant characteristics and sample image data corresponding to the sample speaker video based on the sample speaker video.
The device provided by the embodiment of the invention carries out the expression synthesis of the virtual image by applying the relevant characteristics containing rich expression relevant information, so that the virtual image expression can better fit with the voice data and is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expressions of the virtual image exist in an integral form, and compared with a mode of independently modeling each region for executing the expressions in the virtual image, the linkage problem of the muscles in each region can be effectively solved by aiming at the integral modeling of the expressions, so that the muscle linkage of each region is more natural and vivid.
Based on any of the above embodiments, theexpression synthesis unit 720 includes:
the feature extraction unit is used for inputting image data and relevant features respectively corresponding to any frame into a feature extraction layer of the expression synthesis model to obtain frame features output by the feature extraction layer;
and the expression prediction unit is used for inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression map of any frame output by the expression prediction layer.
Based on any one of the above embodiments, the feature extraction unit includes:
the current feature extraction subunit is used for inputting image data and relevant features respectively corresponding to any frame into a current feature extraction layer of the feature extraction layer to obtain current features output by the current feature extraction layer;
and the pre-frame feature extraction subunit is used for inputting the virtual expression map of any pre-frame to a pre-frame feature extraction layer of the feature extraction layer to obtain the pre-frame features output by the pre-frame feature extraction layer.
Based on any of the above embodiments, the expression prediction unit is specifically configured to:
and fusing the current features and the features before the frames and inputting the fused current features and the features before the frames into the expression prediction layer to obtain the virtual expression graph of any frame output by the expression prediction layer.
Based on any one of the above embodiments, the expression prediction unit includes:
the candidate expression prediction subunit is used for fusing the current feature and the pre-frame feature and inputting the fused current feature and the pre-frame feature into a candidate expression prediction layer of the expression prediction layer to obtain a candidate expression graph output by the candidate expression prediction layer;
the optical flow prediction subunit is used for fusing the current feature and the pre-frame feature and inputting the fused current feature and the pre-frame feature into an optical flow prediction layer of the expression prediction layer to obtain optical flow information output by the optical flow prediction layer;
and the fusion subunit is used for inputting the candidate expression graph and the optical flow information into a fusion layer in the expression prediction layer to obtain a virtual expression graph of any frame output by the fusion layer.
Based on any one of the embodiments, the expression synthesis model is obtained by training a discriminator and based on sample speaker videos, relevant features and sample image data of sample voice data corresponding to the sample speaker videos, and the expression synthesis model and the discriminator form a generative confrontation network.
According to any of the above embodiments, the discriminator comprises an image discriminator and/or a video discriminator;
the image discriminator is used for judging the synthetic reality of any frame of virtual expression graph in the virtual image video, and the video discriminator is used for judging the synthetic reality of the virtual image video.
In any of the above embodiments, the relevant features include language-dependent features, and emotional and/or speaker identity features.
According to any of the above embodiments, the profile data is determined based on the speaker identity characteristics.
According to any one of the above embodiments, the expressions of the avatar configuration in the avatar video corresponding to the voice data include facial expressions and neck expressions.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 8, the electronic device may include: a processor (processor)810, acommunication Interface 820, amemory 830 and acommunication bus 840, wherein theprocessor 810, thecommunication Interface 820 and thememory 830 communicate with each other via thecommunication bus 840. Theprocessor 810 may call logic instructions in thememory 830 to perform the following method: determining relevant characteristics of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data; inputting image data and the relevant characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data; the expression synthesis model is obtained by training sample voice data relevant characteristics and sample image data corresponding to the sample speaker video based on the sample speaker video.
In addition, the logic instructions in thememory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes: determining relevant characteristics of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data; inputting image data and the relevant characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data; the expression synthesis model is obtained by training sample voice data relevant characteristics and sample image data corresponding to the sample speaker video based on the sample speaker video.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (13)

1. An avatar synthesis method, comprising:
determining relevant characteristics of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data;
inputting image data and the relevant characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein an avatar in the avatar video is configured with an expression corresponding to the voice data;
the expression synthesis model is obtained by training based on sample speaker videos, relevant features of sample voice data corresponding to the sample speaker videos and sample image data.
2. The avatar synthesis method of claim 1, wherein said inputting the avatar data and the related features into the expression synthesis model to obtain the avatar video outputted by the expression synthesis model, comprises:
inputting image data and relevant features respectively corresponding to any frame into a feature extraction layer of the expression synthesis model to obtain frame features output by the feature extraction layer;
and inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression graph of any frame output by the expression prediction layer.
3. The avatar synthesis method of claim 2, wherein the step of inputting the avatar data and the related features corresponding to each frame into the feature extraction layer of the expression synthesis model to obtain the frame features output by the feature extraction layer comprises:
inputting image data and relevant features respectively corresponding to any frame into a current feature extraction layer of the feature extraction layer to obtain current features output by the current feature extraction layer;
and inputting the virtual expression map of any frame of the preset frame before the frame into a frame front feature extraction layer of the feature extraction layer to obtain the frame front features output by the frame front feature extraction layer.
4. The avatar synthesis method of claim 3, wherein the inputting of the frame features into an expression prediction layer of the expression synthesis model to obtain the virtual expression map of any frame output by the expression prediction layer specifically comprises:
and fusing the current features and the features before the frames and inputting the fused current features and the features before the frames into the expression prediction layer to obtain the virtual expression graph of any frame output by the expression prediction layer.
5. The avatar synthesis method of claim 4, wherein the step of fusing the current feature and the features before the frame and inputting the fused features into the expression prediction layer to obtain the virtual expression map of any frame output by the expression prediction layer comprises:
fusing the current features and the features before the frames and inputting the fused current features and the fused features into a candidate expression prediction layer of the expression prediction layer to obtain a candidate expression graph output by the candidate expression prediction layer;
fusing the current feature and the pre-frame feature, and inputting the fused current feature and the pre-frame feature into an optical flow prediction layer of the expression prediction layer to obtain optical flow information output by the optical flow prediction layer;
and inputting the candidate expression graph and the optical flow information into a fusion layer in the expression prediction layer to obtain the virtual expression graph of any frame output by the fusion layer.
6. The avatar synthesis method of claim 1, wherein the expression synthesis model is trained based on sample speaker video, sample voice data and relevant features of the sample voice data corresponding to the sample speaker video, and a discriminator, and the expression synthesis model and the discriminator form a generative confrontation network.
7. The avatar synthesis method of claim 6, wherein the discriminator comprises an image discriminator and/or a video discriminator;
the image discriminator is used for judging the synthetic reality of any frame of virtual expression graph in the virtual image video, and the video discriminator is used for judging the synthetic reality of the virtual image video.
8. The avatar synthesis method of any one of claims 1 to 7, wherein the related features include language-related features, and emotional and/or speaker identity features.
9. The avatar synthesis method of claim 8, wherein the avatar data is determined based on the speaker identity characteristics.
10. The avatar synthesis method of any one of claims 1 to 7, wherein the expressions of the avatar configuration in the avatar video corresponding to the voice data include facial expressions and neck expressions.
11. An avatar synthesis apparatus, comprising:
a relevant feature determination unit for determining relevant features of the voice data; the related features are used for representing features related to the expression of a speaker contained in the voice data;
the expression synthesis unit is used for inputting image data and the related characteristics into an expression synthesis model to obtain an avatar video output by the expression synthesis model, and an avatar in the avatar video is configured with an expression corresponding to the voice data;
the expression synthesis model is obtained by training sample voice data relevant characteristics and sample image data corresponding to the sample speaker video based on the sample speaker video.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the avatar synthesis method according to any of claims 1 to 10 when executing said program.
13. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the avatar synthesis method according to any of claims 1 to 10.
CN201911274701.3A2019-12-122019-12-12Avatar composition method, apparatus, electronic device, and storage mediumActiveCN111145282B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201911274701.3ACN111145282B (en)2019-12-122019-12-12Avatar composition method, apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201911274701.3ACN111145282B (en)2019-12-122019-12-12Avatar composition method, apparatus, electronic device, and storage medium

Publications (2)

Publication NumberPublication Date
CN111145282Atrue CN111145282A (en)2020-05-12
CN111145282B CN111145282B (en)2023-12-05

Family

ID=70518080

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201911274701.3AActiveCN111145282B (en)2019-12-122019-12-12Avatar composition method, apparatus, electronic device, and storage medium

Country Status (1)

CountryLink
CN (1)CN111145282B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111915479A (en)*2020-07-152020-11-10北京字节跳动网络技术有限公司Image processing method and device, electronic equipment and computer readable storage medium
CN112132915A (en)*2020-08-102020-12-25浙江大学Diversified dynamic time-delay video generation method based on generation countermeasure mechanism
CN112182173A (en)*2020-09-232021-01-05支付宝(杭州)信息技术有限公司Human-computer interaction method and device based on virtual life and electronic equipment
CN112215927A (en)*2020-09-182021-01-12腾讯科技(深圳)有限公司Method, device, equipment and medium for synthesizing face video
CN112465935A (en)*2020-11-192021-03-09科大讯飞股份有限公司Virtual image synthesis method and device, electronic equipment and storage medium
CN112492383A (en)*2020-12-032021-03-12珠海格力电器股份有限公司Video frame generation method and device, storage medium and electronic equipment
CN112650399A (en)*2020-12-222021-04-13科大讯飞股份有限公司Expression recommendation method and device
CN112785669A (en)*2021-02-012021-05-11北京字节跳动网络技术有限公司Virtual image synthesis method, device, equipment and storage medium
CN113096242A (en)*2021-04-292021-07-09平安科技(深圳)有限公司Virtual anchor generation method and device, electronic equipment and storage medium
CN114359517A (en)*2021-11-242022-04-15科大讯飞股份有限公司Avatar generation method, avatar generation system, and computing device
CN114466179A (en)*2021-09-092022-05-10马上消费金融股份有限公司Method and device for measuring synchronism of voice and image
CN114793286A (en)*2021-01-252022-07-26上海哔哩哔哩科技有限公司Video editing method and system based on virtual image
CN114793300A (en)*2021-01-252022-07-26天津大学Virtual video customer service robot synthesis method and system based on generation countermeasure network
CN114911381A (en)*2022-04-152022-08-16青岛海尔科技有限公司Interactive feedback method and device, storage medium and electronic device
CN114937104A (en)*2022-06-242022-08-23北京有竹居网络技术有限公司Virtual object face information generation method and device and electronic equipment
CN115375809A (en)*2022-10-252022-11-22科大讯飞股份有限公司Virtual image generation method, device, equipment and storage medium
WO2022255980A1 (en)*2021-06-022022-12-08Bahcesehir UniversitesiVirtual agent synthesis method with audio to video conversion
CN116665695A (en)*2023-07-282023-08-29腾讯科技(深圳)有限公司Virtual object mouth shape driving method, related device and medium
CN117221465A (en)*2023-09-202023-12-12北京约来健康科技有限公司Digital video content synthesis method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN106919251A (en)*2017-01-092017-07-04重庆邮电大学A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition
CN107705808A (en)*2017-11-202018-02-16合光正锦(盘锦)机器人技术有限公司A kind of Emotion identification method based on facial characteristics and phonetic feature
WO2018113650A1 (en)*2016-12-212018-06-28深圳市掌网科技股份有限公司Virtual reality language interaction system and method
CN108989705A (en)*2018-08-312018-12-11百度在线网络技术(北京)有限公司A kind of video creating method of virtual image, device and terminal
CN109118562A (en)*2018-08-312019-01-01百度在线网络技术(北京)有限公司Explanation video creating method, device and the terminal of virtual image
CN109145837A (en)*2018-08-282019-01-04厦门理工学院Face emotion identification method, device, terminal device and storage medium
CN109410297A (en)*2018-09-142019-03-01重庆爱奇艺智能科技有限公司It is a kind of for generating the method and apparatus of avatar image
CN110009716A (en)*2019-03-282019-07-12网易(杭州)网络有限公司Generation method, device, electronic equipment and the storage medium of facial expression
CN110414323A (en)*2019-06-142019-11-05平安科技(深圳)有限公司 Emotion detection method, device, electronic device and storage medium
CN110488975A (en)*2019-08-192019-11-22深圳市仝智科技有限公司A kind of data processing method and relevant apparatus based on artificial intelligence
CN110503942A (en)*2019-08-292019-11-26腾讯科技(深圳)有限公司 A voice-driven animation method and device based on artificial intelligence

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2018113650A1 (en)*2016-12-212018-06-28深圳市掌网科技股份有限公司Virtual reality language interaction system and method
CN106919251A (en)*2017-01-092017-07-04重庆邮电大学A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition
CN107705808A (en)*2017-11-202018-02-16合光正锦(盘锦)机器人技术有限公司A kind of Emotion identification method based on facial characteristics and phonetic feature
CN109145837A (en)*2018-08-282019-01-04厦门理工学院Face emotion identification method, device, terminal device and storage medium
CN108989705A (en)*2018-08-312018-12-11百度在线网络技术(北京)有限公司A kind of video creating method of virtual image, device and terminal
CN109118562A (en)*2018-08-312019-01-01百度在线网络技术(北京)有限公司Explanation video creating method, device and the terminal of virtual image
CN109410297A (en)*2018-09-142019-03-01重庆爱奇艺智能科技有限公司It is a kind of for generating the method and apparatus of avatar image
CN110009716A (en)*2019-03-282019-07-12网易(杭州)网络有限公司Generation method, device, electronic equipment and the storage medium of facial expression
CN110414323A (en)*2019-06-142019-11-05平安科技(深圳)有限公司 Emotion detection method, device, electronic device and storage medium
CN110488975A (en)*2019-08-192019-11-22深圳市仝智科技有限公司A kind of data processing method and relevant apparatus based on artificial intelligence
CN110503942A (en)*2019-08-292019-11-26腾讯科技(深圳)有限公司 A voice-driven animation method and device based on artificial intelligence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LEI XIE等: "Speech and Auditory Interfaces for Ubiquitous, Immersive and Personalized Applications", 《 IEEE XPLORE》*
李欣怡等: "语音驱动的人脸动画研究现状综述", 《计算机工程与应用》, no. 22*
金辉等: "基于特征流的面部表情运动分析及应用", 《软件学报》, no. 12*

Cited By (27)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111915479A (en)*2020-07-152020-11-10北京字节跳动网络技术有限公司Image processing method and device, electronic equipment and computer readable storage medium
CN111915479B (en)*2020-07-152024-04-26抖音视界有限公司Image processing method and device, electronic equipment and computer readable storage medium
CN112132915A (en)*2020-08-102020-12-25浙江大学Diversified dynamic time-delay video generation method based on generation countermeasure mechanism
CN112132915B (en)*2020-08-102022-04-26浙江大学Diversified dynamic time-delay video generation method based on generation countermeasure mechanism
CN112215927A (en)*2020-09-182021-01-12腾讯科技(深圳)有限公司Method, device, equipment and medium for synthesizing face video
CN112215927B (en)*2020-09-182023-06-23腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN112182173A (en)*2020-09-232021-01-05支付宝(杭州)信息技术有限公司Human-computer interaction method and device based on virtual life and electronic equipment
CN112465935A (en)*2020-11-192021-03-09科大讯飞股份有限公司Virtual image synthesis method and device, electronic equipment and storage medium
CN112492383A (en)*2020-12-032021-03-12珠海格力电器股份有限公司Video frame generation method and device, storage medium and electronic equipment
CN112650399A (en)*2020-12-222021-04-13科大讯飞股份有限公司Expression recommendation method and device
CN112650399B (en)*2020-12-222023-12-01科大讯飞股份有限公司Expression recommendation method and device
CN114793286A (en)*2021-01-252022-07-26上海哔哩哔哩科技有限公司Video editing method and system based on virtual image
CN114793300A (en)*2021-01-252022-07-26天津大学Virtual video customer service robot synthesis method and system based on generation countermeasure network
CN112785669A (en)*2021-02-012021-05-11北京字节跳动网络技术有限公司Virtual image synthesis method, device, equipment and storage medium
CN112785669B (en)*2021-02-012024-04-23北京字节跳动网络技术有限公司Virtual image synthesis method, device, equipment and storage medium
CN113096242A (en)*2021-04-292021-07-09平安科技(深圳)有限公司Virtual anchor generation method and device, electronic equipment and storage medium
WO2022255980A1 (en)*2021-06-022022-12-08Bahcesehir UniversitesiVirtual agent synthesis method with audio to video conversion
CN114466179A (en)*2021-09-092022-05-10马上消费金融股份有限公司Method and device for measuring synchronism of voice and image
CN114359517A (en)*2021-11-242022-04-15科大讯飞股份有限公司Avatar generation method, avatar generation system, and computing device
CN114911381A (en)*2022-04-152022-08-16青岛海尔科技有限公司Interactive feedback method and device, storage medium and electronic device
CN114937104A (en)*2022-06-242022-08-23北京有竹居网络技术有限公司Virtual object face information generation method and device and electronic equipment
CN115375809B (en)*2022-10-252023-03-14科大讯飞股份有限公司Method, device and equipment for generating virtual image and storage medium
CN115375809A (en)*2022-10-252022-11-22科大讯飞股份有限公司Virtual image generation method, device, equipment and storage medium
CN116665695B (en)*2023-07-282023-10-20腾讯科技(深圳)有限公司Virtual object mouth shape driving method, related device and medium
CN116665695A (en)*2023-07-282023-08-29腾讯科技(深圳)有限公司Virtual object mouth shape driving method, related device and medium
CN117221465A (en)*2023-09-202023-12-12北京约来健康科技有限公司Digital video content synthesis method and system
CN117221465B (en)*2023-09-202024-04-16北京约来健康科技有限公司Digital video content synthesis method and system

Also Published As

Publication numberPublication date
CN111145282B (en)2023-12-05

Similar Documents

PublicationPublication DateTitle
CN111145282B (en)Avatar composition method, apparatus, electronic device, and storage medium
CN111243626B (en)Method and system for generating speaking video
WO2022048403A1 (en)Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
CN112465935A (en)Virtual image synthesis method and device, electronic equipment and storage medium
CN113781610A (en)Virtual face generation method
US7136818B1 (en)System and method of providing conversational visual prosody for talking heads
US11989976B2 (en)Nonverbal information generation apparatus, nonverbal information generation model learning apparatus, methods, and programs
US20120130717A1 (en)Real-time Animation for an Expressive Avatar
CN110688911A (en)Video processing method, device, system, terminal equipment and storage medium
US20110131041A1 (en)Systems And Methods For Synthesis Of Motion For Animation Of Virtual Heads/Characters Via Voice Processing In Portable Devices
KR20210119441A (en) Real-time face replay based on text and audio
CN110610534B (en)Automatic mouth shape animation generation method based on Actor-Critic algorithm
JP7423490B2 (en) Dialogue program, device, and method for expressing a character's listening feeling according to the user's emotions
CN113838173B (en)Virtual human head motion synthesis method driven by combination of voice and background sound
KR102373608B1 (en)Electronic apparatus and method for digital human image formation, and program stored in computer readable medium performing the same
JPWO2019160104A1 (en) Non-verbal information generator, non-verbal information generation model learning device, method, and program
WO2023284435A1 (en)Method and apparatus for generating animation
CN116704085B (en)Avatar generation method, apparatus, electronic device, and storage medium
US12165672B2 (en)Nonverbal information generation apparatus, method, and program
JPWO2019160105A1 (en) Non-verbal information generator, non-verbal information generation model learning device, method, and program
CN119441403A (en) A digital human control method, device and electronic device based on multi-modality
CN117036556A (en)Virtual image driving method and device and robot
CN115861670A (en)Training method of feature extraction model and data processing method and device
Filntisis et al.Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis
Verma et al.Animating expressive faces across languages

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp