Detailed Description
The detailed description provided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present examples may be constructed or utilized. The description sets forth the functions of the examples and the sequence of operations for constructing and operating the examples. However, the same or equivalent functions and sequences may be accomplished by different examples.
The techniques described herein use radiation fields and volume rendering methods. The radiation field parameterization represents the radiation field (called field) as a function of from five-dimensional (5D) space to four-dimensional (4D) space, wherein the value of the radiation is known for each pair of 3D points and 2D view directions in the field. The irradiance value is composed of a color value and an opacity value. The radiation field parameterization may be a trained machine learning model, such as a neural network, support vector machine, random decision forest, or other machine learning model that learns the associations between radiation values and a pair of 3D points and view directions. In some cases, the radiation field parameterization is a cache of associations between radiation values and 3D points, wherein the associations are obtained from a trained machine learning model.
The volume rendering method calculates an image from a radiation field for a specific camera viewpoint by checking radiation values of points along a ray forming the image. Volume rendering software is well known and commercially available.
As mentioned above, composite images of dynamic scenes are used for various purposes, such as computer games, movies, video communications, telepresence, etc. However, it is difficult to generate a composite image of a dynamic scene in a controlled manner, i.e., it is difficult to easily and accurately control how the scene is animated. Accurate control is desirable for many applications, such as applications in which a composite image of an avatar of a person in a video call is to accurately depict the facial expression of a real person. Precise control is also desirable for video game applications, where the image of a particular chair is to be broken up in a realistic manner. These examples of video calls and video games are not intended to be limiting, but illustrate the use of the present technology. The techniques can be used to capture any scene, such as objects, vegetation environments, people, or other scenes, that is static or dynamic.
Registration (enrollment) is another problem that arises when generating a composite image of a dynamic scene. Registration is a situation in which radiation field parameterizations are created for a particular 3D scene, such as a particular person or a particular chair. Some schemes of registration use a large number of training images depicting a particular 3D scene over time and from different viewpoints. Among them, registration is time-consuming and presents computationally burdensome difficulties.
It is becoming increasingly important to be able to generate composite images of dynamic scenes in real time, such as during a video call in which an avatar of a caller is to be created. However, due to the complex computation and computational burden, it is difficult to implement real-time operation.
Generalization capability (generalization ability) is a persistent problem. Trained radiation field parameterizations are often difficult to generalize in order to facilitate computing images of 3D scenes that are different from those used during training of the radiation field parameterizations.
An alternative to using implicit morphing methods based on learned functions is a "black box" to the content creator, they require a large amount of training data to meaningfully generalize, and they do not produce a true extrapolation outside of the training data.
The present technique provides an accurate way of controlling how an image of a dynamic scene is animated. The user or automated process can specify parameter values, such as a volumetric fusion shape and skeleton value of the cage applied to the primitive 3D elements. In this way, the user or an automated process is able to precisely control the deformation of the 3D object to be depicted in the composite image. In other examples, a user of the automated process can use animation data from the physics engine to precisely control the deformation of the 3D object to be depicted in the composite image. The fused shape is a mathematical function that, when applied to a parameterized 3D model, changes the parameter values of the 3D model. In an example, where the 3D model is a 3D model of a person's head, there may be hundreds of fused shapes, each changing the 3D model according to facial expression or identity characteristics.
In some examples, the present technology reduces the burden of registration. Registration burden is reduced by using a reduced amount of training images, such as training image frames from only one or only two moments in time.
In some examples, the present technology is capable of operating in real-time (such as at 30 frames per second or more). This is achieved by using optimization when calculating the transformation of the sample points into the canonical space used by the radiation field parameterization.
In some cases, the present technology operates with good generalization capability. By creating scenes that are animated with parameters from selected face models or physics engines, the techniques can use model dynamics from the face models or physics engines to animate scenes beyond the training data in a physically meaningful way to generalize well.
Fig. 1 is a schematic diagram of an image animator 100 for computing a composite image of a dynamic scene. In some cases, the image animator 100 is deployed as a web service. In some cases, the image animator 100 is deployed at a personal computer or other computing device in communication with a head-mounted computer 114, such as a head-mounted display device. In some cases, the image animator 100 is deployed in a companion computing device of the head-mounted computer 114.
The image renderer 100 includes a radiation field parameterization 102, at least one processor 104, a memory 106, and a volume renderer 108. In some cases, the radiation field parameterization 102 is a neural network, or a random decision forest, or a support vector machine, or other type of machine learning model. Which has been trained to predict pairs of color values and opacity values and view directions of three-dimensional points in canonical space of dynamic scenes, and more details about the training process are given later in this document. In some cases, the radiation field parameterization 102 is a cache of associations between three-dimensional points and color values and opacity values stored in canonical space.
The volume renderer 108 is a well-known computer graphics volume renderer that acquires pairs of color values and opacity values of three-dimensional points along a ray and calculates an output image 116.
Image animator 100 is configured to receive queries from client devices such as smart phone 122, computer game device 110, head-mounted computer 114, movie creation device 120, or other client devices. The query is sent from the client device to the image animator 100 via the communication network 124.
The query from the client device includes a specified viewpoint of the virtual camera, specified values of intrinsic parameters of the virtual camera, and a deformation description 118. The composite image will be calculated by the image animator 100 as if it had been captured by the virtual camera. The deformed description describes the desired dynamic content of the scene in the output image 116.
The image animator 100 receives the query and, in response, generates a composite output image 116 that it sends to the client device. The client device uses the output image 116 for one of a variety of useful purposes including, but not limited to, generating a virtual network camera stream, generating video for a computer video game, generating holograms for display by a mixed reality headset computing device, generating movies. The image animator 100 is capable of calculating a composite image of a dynamic 3D scene for a particular specified desired dynamic content and a particular specified viewpoint as needed. In an example, the dynamic scene is a face of a speaking person. The image animator 100 is capable of calculating a composite image of the face from multiple viewpoints and with arbitrarily specified dynamic content. Non-limiting examples of specified viewpoints and dynamic content are planar view, eye closure, face tilt up, smile, perspective view, eye opening, mouth opening, anger expression. Note that since the machine learning used to create the radiation field parameterization 102 can be generalized, the image animator 100 can calculate a composite image for view points and deformation descriptions that are not present in the training data used to train the radiation field parameterization 102. Other examples of dynamic scenarios are given below with reference to fig. 2 and 3, and include generic objects such as chairs, automobiles, trees, full-body. By using the deformation description, dynamic scene content depicted in the generated composite image can be controlled. In some cases, the deformation descriptions are obtained using a physics engine 126, enabling a user or an automated process to apply physical rules to break up 3D objects depicted in the composite output image 116 or apply other physical rules to depict animations, such as bouncing, waving, swaying, dancing, spinning, or other animations. A physical simulation can be applied to the cages of 3D primitive elements using a finite element method to create the deformation descriptions in order to produce elastic deformation or fracture. In some cases, such as where an avatar of a person is created, the deformation description is obtained using a face or body tracker 124. By selecting the viewpoint and the intrinsic camera parameter value, the characteristics of the synthesized output image can be controlled.
The image animator operates in a non-conventional manner to enable the generation of composite images of a dynamic scene in a controlled manner. Many alternative methods of generating composite images using machine learning have little or no control over what is depicted in the generated composite image.
The image animator 100 improves the functionality of the underlying computing device by enabling the composite image of the dynamic scene to be calculated in a manner whereby the content and view of the dynamic scene is controllable.
Alternatively or additionally, the functions of image animation 100 are performed, at least in part, by one or more hardware logic components. For example, but not limited to, exemplary types of hardware logic components that may optionally be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), graphics Processing Units (GPUs).
In other examples, the functionality of the image animator 100 is located at the client device or shared between the client device and the cloud.
Fig. 2 shows a deformation description 200 of a person's head calculated using the image animator 100 of fig. 1 and three images 204, 206, 208, each showing a person's head animated in a different manner, such as with the mouth open or closed. The deformed description 200 is a cage of primitive 3D elements, which are tetrahedral in the example of fig. 2, but in some examples use other primitive 3D elements, such as spheres or cuboids. In the example of fig. 2, the tetrahedral cage extends from the surface mesh of the person's head so as to include the volume around the head that can be used to represent the person's hair and any headgear worn by the person. In the case of a general object such as a chair, the volume around the object in the cage is useful because modeling the volume with a volume rendering method results in a more realistic image and the cage only needs to approximate the mesh, which reduces the complexity of the cage for objects with many parts (the cage for plants does not need to have different parts per leaf, which only needs to cover all leaves) and allows the same cage to be used for objects of the same type with similar shapes (different chairs can use the same cage). The cage can be intuitively deformed and controlled by a user, physical-based simulation, or conventional automatic animation techniques like fused shapes. Faces are particularly difficult cases due to the unusual combination of stiffness and (viscoelastic) motion, and the present technique works well for faces, as described in more detail below. Once the radiation field is trained using the present technique, it can be generalized to any geometric deformation that can be represented with a cage of 3D primitives constructed from its density. This opens up new possibilities for using the volumetric model in a game or augmented reality/virtual reality context, where the user's manipulation of the environment is not known a priori.
In an example, the deformation description 200 is referred to as a volumetric three-dimensional deformable model (Vol 3 DMM), which is a parametric 3D face model that uses a skeleton and a fused shape to animate the surface mesh of a person's head and the volume around the mesh.
The user or automated process can specify values for parameters of the Vol3DMM model that are used to animate the Vol3DMM model in order to create the images 204 through 208, as described in more detail below. Different values of the parameters of the Vol3DMM model are used to generate each of the three images 204 through 208. The Vol3DMM model is an example of a deformation description along with parameter values.
Vol3DMM animates a volumetric mesh having a sequence of volumetric fusion shapes and a skeleton. It is a generalization of a parametric three-dimensional deformable model (3 DMM) model that animates a mesh with a skeleton and a fused shape to the parametric model to animate the volume around the mesh.
The skeleton and fusion shape of the Vol3DMM are defined by expanding the skeleton and fusion shape of the parameter 3DMM face model. The skeleton has four skeletons, namely root bone, cervical bone, left eye bone and right eye bone for controlling rotation. To use the skeleton in Vol3DMM, linear fused skin weights are extended from vertices of the 3DMM mesh to vertices of tetrahedrons by nearest vertex lookup, i.e., each tetrahedron vertex has the skin weight of the nearest vertex in the 3DMM mesh. The volumetric fusion shape is created by expanding the 224 expression fusion shape and 256 identity fusion shape of the 3DMM model to the volume surrounding its template mesh by creating the ith volumetric fusion shape of Vol3DMM as a tetrahedral embedding of the mesh of the ith 3DMM fusion shape. To create tetrahedral embedding, a single volumetric structure is created from a generic mesh and an accurate embedding is created taking into account facial geometry and facial deformation, which avoids tetrahedral penetration between the upper and lower lips, which defines a volumetric support covering the hair, and which has a higher resolution in the areas subject to more deformation. In an example, an exact number of bones or fusion shapes are inherited from a particular instance of the selected 3DMM model, but the techniques can be applied to different 3DMM models using fusion shapes and/or skeletons for modeling faces, bodies, or other objects.
As a result of this construction, the Vol3DMM is controlled and posed with the same identity, expression, and pose parameters α, β, θ of the 3DMM face model. This means that it can be animated with a face tracker built on the 3DMM face model by changing α, β, θ and more importantly generalized to any expression that can be represented by the 3DMM face model, as long as there is a good fit of the face model to the training frame. During training, the tetrahedral mesh of the Vol3DMM is posed using parameters α, β, θ to define a physical space, while a canonical space is defined for each object by posing the Vol3DMM with identity parameter α and setting β, θ to zero for a neutral pose. In an example, inheritance from a particular instance of the selected 3DMM model to a decomposition of identity, expression, and gesture. However, techniques for training and/or animating accommodate different decompositions by constructing a corresponding Vol3DMM model for the particular 3DMM model selected.
Fig. 3 shows a chair 300 and a composite image 302 of a chair break calculated using the image animator of fig. 1. In this case, the deformed description includes a cage surrounding the chair 300, wherein the cage is formed of primitive 3D elements such as tetrahedrons, spheres, or cuboids. The deformation description also includes information such as rules from the physics engine about how the object behaves when it breaks.
FIG. 4 is a flow chart of an example method performed by the image animator of FIG. 1. Inputs 400 to the method include a morphing description, a camera viewpoint, and camera parameters. The camera viewpoint is the viewpoint of the virtual camera for which the composite image is to be generated. The camera parameters are lens and sensor parameters such as image resolution, field of view, focal length. The type and format of the deformation description depends on the type and format of the deformation description used in the training data when the radiation field parameterization is trained. The training process is described later with respect to fig. 7. Fig. 4 relates to test time operation after radiation field parameterization has been learned. In some cases, the deformation description is a vector of concatenated parameter values of a parameterized 3D model of an object in a dynamic scene (such as a Vol3DMM model). In some cases, the deformed description is one or more physics-based rules from a physics engine to be applied to a cage that encapsulates a 3D object to be rendered and extends to primitive 3D elements in a volume surrounding the 3D object.
In some examples, the input 400 includes default values for some or all of the deformation description, the viewpoint, intrinsic camera parameters. In some cases, the input 400 is from a user or from a gaming device or other automated process. In an example, input 400 is made according to a game state from a computer game or according to a state received from a mixed reality computing device. In an example, the face or body tracker 420 provides values of the deformation description. The face or body tracker is a trained machine learning model that takes as input captured sensor data depicting at least a portion of a person's face or body and predicts values of parameters of a 3D face model or 3D body model of the person. The parameter is a shape parameter, a gesture parameter, or other parameter.
The deformation description includes a cage 418 of primitive 3D elements. The cages of primitive 3D elements represent a 3D object to be depicted in an image and a volume extending from the 3D object. In some cases, such as where the 3D object is a human head or body, the cage includes a volumetric mesh having a plurality of volumetric fusion shapes and a skeleton. In some cases where the 3D object is a chair or other 3D object, the cage is parameterized from the learned radiation field by calculating a mesh from the density of the learned radiation field volume and calculating tetrahedral embedding of the mesh using a mobile cube method (Marching Cubes). The cage 418 of primitive 3D elements is a deformed version of the canonical cage. That is, to generate a modified version of the scene, the method begins by deforming a canonical cage into a desired shape, the desired shape being the deformation description. The method does not know the way in which the deformed cage is generated and what kind of object is deformed.
The use of a cage to control and parameterize volumetric deformation enables the deformation to be represented in real time and applied to the scene, which can represent both smooth and discontinuous functions and allows for intuitive control by changing the geometry of the cage. The geometric control is compatible with machine learning models, physics engines, and art generation software, thereby allowing good extrapolation or generalization to configurations not observed in training.
In the case of a cage formed by tetrahedra, the use of a set of tetrahedra corresponds to a piecewise linear approximation of the playing field. Graphics Processing Unit (GPU) accelerated ray tracing allows the cage representation to be fast enough to query in milliseconds, even with highly complex geometries. Cage representations using tetrahedra can reproduce hard object boundaries by construction and can be edited in off-the-shelf software as consisting of only points and triangles.
At operation 402, the dynamic scene image generator calculates a plurality of rays, each ray associated with a pixel of the output image 116 to be generated by the image animator. For a given pixel (x, y position in the output image), the image animator computes rays from the virtual camera through the pixel into a deformation description that includes the cage. To calculate the rays, the image animator uses the geometry of the intrinsic camera parameters and the selected values and camera viewpoints. The rays are computed in parallel, where possible, to give efficiency, since one ray is computed per pixel.
For each ray, the image animator samples a plurality of points along the ray. In general, the more points sampled, the better the quality of the output image. Rays are randomly selected and samples are drawn within specified limits obtained from scene knowledge 416. In an example, the specified boundaries are calculated from training data that has been used to train a machine learning system. The boundaries indicate the size of the dynamic scene such that one or more samples are taken from regions of rays in the dynamic scene. In order to calculate the boundaries from the training data, standard image processing techniques are used to examine the training images. The boundaries of the dynamic scene can also be specified manually by an operator or measured automatically using a depth camera, global Positioning System (GPS) sensor, or other location sensor.
Each sample is assigned an index of the 3D primitive elements of the deformed cage within which the sample falls.
At operation 406, the image animator transforms the samples from the deformation description cages into canonical cages. A canonical cage is a version of a cage that represents a 3D object in a stationary state or other specified original state, such as where the parameter value is zero. In an example where the 3D object is a human head, the canonical cage represents a human head looking straight at the virtual camera, with eyes open and mouth closed and neutral expression.
In the case where the primitive 3D element is a tetrahedron, the barycentric coordinates as described below are used to calculate the transformation of the sample into a canonical cage. Using barycentric coordinates is a particularly efficient way of computing the transformation.
In the example where the cage uses tetrahedrons, the barycentric coordinates defined for both canonical tetrahedron x= { X1,X2,X3,X4 } and deformed tetrahedron element x= { X1,x2,x3,x4 } are used to map point P in deformed space to point P in canonical space.
Tetrahedra, a basic building block, are four-sided pyramids. The undeformed "rest" position defining its four constituent points is:
X={X1,X2,X3,X4}(2)
And the deformation state x= { x1,x2,x3,x4 } is represented using a lowercase. Because tetrahedrons are simplex, the barycentric coordinates (λ1, λ2, λ3, λ4) can be used to represent the points that fall inside them with reference to the set X or X.
Although the input points can be restored as: If P falls inside the tetrahedron, its rest position P in canonical space is obtained as:
in case the primitive 3D element is a sphere or cuboid, the transformation of the sample into a canonical cage is instead calculated using an affine transformation, which is sufficiently expressive for large rigid moving parts of the motion field.
From each camera, rays are emitted into the physical space, tetrahedron x0 incident to each sample p along the rays are detected and their barycentric coordinates are calculated such that:
In the case where the 3D element is a tetrahedron, optimization is optionally used to compute the transformation at operation 406 by optimizing primitive point lookup. The optimizing includes calculating a transformation P of the sample by setting P equal to a normalized distance between a previous intersection point and a next intersection point of the tetrahedron, multiplied by a sum of barycenter coordinates of the vertices at the previous intersection point multiplied by canonical coordinates of the vertices on the four vertices of the tetrahedron, plus one minus the normalized distance, multiplied by a sum of barycenter coordinates of the vertices at the next intersection point multiplied by canonical coordinates of the vertices on the four vertices of the tetrahedron. This optimization was found to give a significant improvement in processing time, making real-time operation of the process of fig. 4 possible beyond 30 frames per second (i.e., computing more than 30 images per second in the case where the processor is a single RTX3090 (trade mark) graphics processing unit).
Operation 407 is optional and includes rotating a view direction of at least one of the rays. In this case, for one of the transformed samples, the view direction of the rays that rotate the sample is done before querying the learned radiation field. Calculating the rotation R for the view direction of a small part of the primitive 3D element and propagating the value of R to the remaining tetrahedra via nearest neighbor interpolation is found to give good results in practice.
For each sample point, the dynamic scene image generator queries 408 the radiation field parameterization 102. Given the points and associated viewing directions in the canonical 3D cage, the radiation field parameterization has been trained to produce color values and density values. In response to each query, the radiation field parameterization produces a pair of values including color and opacity at sampling points in a standard cage. In this way, the method utilizes the applied deformation descriptions to calculate a plurality of color values and opacity values 410 and view directions for the 3D points in the canonical cage.
In an example, the learned radiation field parameterization 102 is a cache of associations between 3D points and view directions and color values and opacity values in a canonical version of the cage obtained by querying a machine learning model trained using training data comprising images of the dynamic scene from multiple viewpoints. Significant acceleration is achieved by using a cache of values rather than querying the machine learning model directly.
The radiation field is a function of the color c and the density σ that are queried to obtain the color at that location in space. In general, an image plane is obtained via volume rendering using a volume rendering equation in the form of emission-absorptionColor of the pixel above:
Where δi= (pi+1-pi) represents the distance between samples (N total) along the straight line, and the throw ratio Ti is defined asModeling is typically performed by multi-layer perception (MLP), explicit voxel grids, or a combination of both. In addition to the sample position p,Also on the direction b of the ray, this allows it to model view-dependent effects such as specular reflection.
For each ray, the volume rendering 412 method is applied to the color values and opacity values calculated along the ray to produce pixel values of the output image. Any well-known computer graphics method for volumetric ray tracing is used. Where real-time operation is desired, hardware accelerated volume rendering is used.
The output image is stored 414 or inserted into a virtual network camera stream or used for telepresence, gaming, or other applications.
FIG. 5 is a schematic diagram of rays in a deformed cage 500, rays transformed to a canonical cage 502, a volume lookup 504, and a volume rendering 506. To render a single pixel, rays are cast from the camera center through the pixel into the scene in its deformed state. A plurality of samples are generated along the ray and then each sample is mapped to the canonical space using a deformation Mj corresponding to tetrahedron j. The transformed sample position p'j and the direction of the ray rotated based on the rotation of the jth tetrahedron are then used to query the volumetric representation of the scene. The resulting per-sample opacity values and color values are then combined using volume rendering as in equation one.
The density and color at each point in the scene are functions of both the sample position and the view direction. If the sample position is moved but the view direction remains unchanged, the light reflected from the elements of the scene will look the same for each deformation. To alleviate this problem, the view direction of each sample is rotated with rotation between the canonical tetrahedron and its deformed equivalent:
v′=Rv,
U,E,V=SVD((X-cX)T(X′-v′X)),
R=UVT,
where cX, c' X are the center of gravity of the canonical and deformed states of the tetrahedron into which a given sample falls. With this approach, the direction in which light is reflected at each point of the scene will match the deformation caused by the tetrahedral mesh. Note, however, that the reflected light will represent the scene in its canonical pose.
In practice, it is inefficient to calculate R for each sample in the scene, or even for each tetrahedron, as it requires the calculation of Singular Value Decomposition (SVD). Alternatively a random scheme is employed, wherein R is calculated for a small fraction ρ of tetrahedra and propagated to the remaining tetrahedra via nearest neighbor interpolation. In the experiments described herein, ρ=0.05.
More details about an example of primitive point lookup are now presented.
For complex grids, it is difficult to examine each tetrahedron for association with each input point given the complexity of the tetrahedron midpoint test. For non-self-intersecting tetrahedral meshes, the concept of points "in front of" or "behind" a particular triangle is uniquely determined by the winding order of the triangle vertices. Determining which tetrahedron a point belongs to corresponds to emitting rays from the point in random directions, evaluating the triangle at a first intersection point and checking which side of the triangle the sample is on. This uniquely identifies tetrahedra, as each triangle can belong to at most two tetrahedra. In particular, these queries are very efficient in terms of storage and computation when hardware acceleration is available.
In an example, the same acceleration is applied to any triangle shape to combine tetrahedrons with a triangle rigid moving shape that does not need to be filled with tetrahedrons, but can be regarded as a unit in terms of deformation. Second, the number of tetrahedral midpoint tests required is reduced by noting that many samples along a single ray can fall into the same element. Knowing the previous and next intermediate parts, a simple depth test determines which tetrahedral samples fall into it. The barycentric coordinates are linear and thus barycentric interpolation values are obtained by interpolating values at the previous intersection point and the next intersection point within each element. For this, equation (3) is rewritten as:
Where superscripts 1 and 2 refer to the previous and next intersection points, and a is the normalized distance between the two intersection points, which defines the point of interpolation of the method.
Due to this modification, each point value remains stable even if the "wrong" side of the triangle (or a completely incorrect triangle) is queried for lack of numerical accuracy. In contrast to the per-point formula of tetrahedral index lookup, one important side effect of this per-ray is that it is naturally integrated with the ray-progression scheme for rendering. In the latter, rays are terminated based on the transmittance, which is naturally allowed by the reformulated tetrahedral search algorithm.
Fig. 6 is a flow chart of a method of sampling. The method comprises querying the learned radiation field of a 3D scene to obtain color values and opacity values, using only one radiation field network 600 and increasing the size of a sampling bin (bin) 602.
Volume rendering typically involves sampling depth along each ray. In an example, there are sampling strategies that enable capturing thin structures and fine details and improving sampling boundaries. The method gives an improved quality with a fixed sample count.
Some schemes represent a scene with two multi-layer perceptions (MLPs): "coarse" and "fine". First, nc samples are evaluated by a coarse network to obtain a coarse estimate of the opacity along the ray. These estimates then lead to Nf samples of the second round placed around the location where the opacity value is greatest. The fine network is then queried at both the coarse sample location and the fine sample location, resulting in Nc evaluations in the coarse network and nc+nf evaluations in the fine network. During training, the two MLPs are optimized independently, but only samples from the fine MLP contribute to the final pixel color. The inventors have realized that the first Nc samples evaluated in the coarse MLP are not used to render the output image and are therefore effectively wasted.
To improve efficiency, the fine network is avoided from being queried at the location of the coarse samples, but instead the output from the first round of coarse samples is reused with a single MLP network.
Simple changes using one network instead of two results in artifacts, where the area around the segment of the ray to which the high weight has been assigned can be cropped as illustrated in 606, 608 of fig. 6. Clipping can occur because binning placement for drawing fine samples treats density as a step function at sample locations rather than a point estimate of a smoothing function. Thus, the size of each importance sample bin 610 doubles, allowing the importance samples to cover the entire range between coarse samples, as illustrated in 612, 614 of fig. 6.
Fig. 7 is a flow chart of a method of computing an image of a person depicting their mouth open. The method of fig. 7 is optionally used, wherein only one or two time instances are used in the training image. If many time instances are available in the training image, the process of FIG. 7 is not required. In the method, the cage represents a face of a person and includes a mesh of mouth interior 700, a first plane representing an upper row of teeth of the person, and a second plane 702 representing a lower row of teeth of the person. The method includes checking 704 if one of the samples falls in the interior of the mouth and using information about the first plane and the second plane to calculate 708 a transformation on the sample. The transformed samples are used to interrogate 710 the radiation field and the method proceeds as in fig. 4.
In the example, a separate deformation model is defined for the inside of the mouth, defined by a closed triangle primitive, and animated by two rigidly moving planes, one for each row of teeth.
The model is trained using a single frame and operated in a minimum data training mechanism. The animation is then driven by the "a priori" known animation model Vol3DMM with the face animated. Thus, the cage geometry model makes primitives non-self-intersecting (to allow real-time rendering) and driven with Vol3 DMM. In the special case of the inside of the mouth, a completely filled tetrahedral cavity is not an appropriate choice, as the rendered teeth will deform as the mouth opens. This will result in an unrealistic appearance and movement. An alternative to placing rigid deformable tetrahedrons around teeth would require a very high accuracy of the geometric estimation.
Instead, different primitives are selected for the inside of the mouth. Firstly, the inside of the mouth is filled with tetrahedra as if it were treated identically to the rest of the head, and secondly, the index of the external triangle of the tetrahedra corresponding to the inside of the mouth is recorded, effectively forming a surface mesh for the inside of the mouth. The surface mesh moves with the rest of the head and is used to determine which samples fall within the mouth interior, but is not used to deform them back into the canonical space. GPU accelerated ray tracing supports both tetrahedrally and triangle defined primitives, allowing the primitives that drive the animation to be changed.
To model the deformation, two planes are used, one placed just below the top tooth and one just above the bottom tooth. Both planes, together with the basic volumetric 3D deformable model of the face, move rigidly (i.e. both remain planar). It is assumed that the teeth move rigidly with these planes and decide not to support the tongue, so the space between the planes is assumed to be empty.
The mouth interior is animated by the following steps, with a surface mesh defining the entire mouth and both planes.
(1) The primitive in which each sample falls is detected and checked whether it is a nozzle-inside primitive.
(2) For each sample within the intra-mouth primitive, the signed distances to the upper and lower planes are used to determine whether it falls into the upper or lower mouth.
(3) The coordinates of the sample in the canonical space are calculated by 1) calculating the coordinates of the sample with respect to the relevant plane, 2) finding the position of the plane in the canonical space, and 3) assuming that the relative coordinates of the sample with respect to the relevant plane remain unchanged.
In an example, the canonical pose is a mouth-closed canonical pose, i.e., where the teeth overlap (the top of the bottom tooth is below the bottom of the upper tooth). As a result, the upper mouth region and the lower mouth region partially overlap in the canonical space. Thus, the color and density learned in the canonical space must be the average of the corresponding regions in the upper and lower mouths. To overcome this obstacle, the canonical regions for the inside of the upper and lower mouth are placed outside the tetrahedron cage, on its left and right sides. Together with the assumption of a blank space between the two planes, this placement results in a bijective mapping of the sample from inside the mouth in the deformation space to the canonical space, allowing a correct learning of the radiation field for that region.
Fig. 8 is a flow chart of a method of training a machine learning model and calculating a cache for use in the image animator 100. Training data 800 is accessed that includes images of a scene (static or dynamic) taken from a number of viewpoints. It is possible to use a set of images of a static scene as training data. A sequence in which each image represents a scene in a different state can also be used.
Fig. 8 is first described for the case where an image is an image of a scene from a plurality of different viewpoints obtained at the same time instance or two times, so that the amount of training data required for registration is relatively low. By using a single instant or two, the accuracy with which the face tracker is used to calculate the reference truth (ground truth) parameter values of the deformation description is improved. This is because the face tracker introduces errors and if it is used for frames at many time instances, there are more errors.
The training data image is a real image such as a photograph or video frame. The training data image may also be a composite image. A tuple of values is extracted 601 from the training data image, wherein each tuple is a deformation description, a camera viewpoint, camera intrinsic parameters, and a color of a given pixel.
In the example of a chair from fig. 3, the training data includes images of the chair taken from many different known viewpoints at the same time. The image is a composite image generated using computer graphics techniques. A tuple of values is extracted from each training image, where each tuple is a deformation description, a camera viewpoint, camera intrinsic parameters, and a color of a given pixel. The deformed description is a cage that is determined 802 by placing the cage of primitive 3D elements around and extending from the chair using known image processing techniques. A user or an automated process such as a computer game triggers a physical engine to deform the cage using physical rules to crush the chair as it falls under gravity or crush the chair as it is subjected to pressure from another object.
To form training data, samples are acquired along the rays in the cage 804 by emitting rays from the point of view of the camera capturing the training images into the cage. Samples are taken along the rays as described with reference to fig. 4. Each sample is assigned an index of one of the 3D primitive elements of the cage according to the element within which the sample falls. The samples are then transformed 806 to a standard cage, which is a version of the cage in a rest position. The transformed samples are used to calculate the output pixel color by using volume rendering. The output pixel color is compared to a reference true value output pixel color of the training image and a loss function is used to evaluate a difference or error. The loss function output is used to perform back propagation to train 808 the machine learning model and output a trained machine learning model 810. The training process is repeated for a number of samples until convergence is reached.
The resulting trained machine learning model 810 is used to calculate and store a cache 812 of associations between 3D positions and view directions and color values and opacity values in the canonical cage. This is done by querying the trained machine learning model 810 for a range of 3D locations and storing the results in a cache.
In the example of a face from fig. 2, the training data includes images of a person's face taken simultaneously from many different known viewpoints. Associated with each training data image are values of parameters of the Vol3DMM model of the face and head of the person. The parameters include the pose (position and orientation) of the eyes and the bones of the neck and jaw, as well as fusion shape parameters specifying characteristics of human facial expressions such as eye closed/open, mouth closed/open, smile/smile free, etc. The image is a real image of a person captured using one or more cameras having known viewpoints. A 3D model is fitted to each image using any well known model fitting procedure whereby the values of the parameters of the 3DMM model used to generate the Vol3DMM are searched for a set of values that enable the 3D model to describe the observed real image. The true image is then marked with the found values of the parameters and is the value of the deformation description. Each real image is also marked with a known camera viewpoint of the camera used to capture the image.
Then, the processes of operations 802 to 812 of fig. 8 are performed.
The machine learning model is trained 808 with a training objective that attempts to minimize the difference between the colors produced by the machine learning model and the colors given in the baseline truth training data.
In some examples involving facial animation, sparsity loss is optionally applied in the volume around the head and in the mouth interior.
The sparsity loss allows for handling incorrect background reconstruction and mitigates problems caused by non-occlusion in the region inside the mouth. In an example, using Cauchy loss,
Where i indexes the rays ri that emerge from the training camera and k indexes the samples tk along each of the rays. N is the number of samples to which the loss is applied, λs is a scalar loss weighted hyper-parameter, and σ is the opacity returned by the radiation field parameterization. To ensure that space is evenly covered by the sparsity loss, it is applied to "coarse" samples. Other sparsity-induced losses such as l1 or weighted least squares are also effective. Sparsity loss is applied in both regions, in the volume around the head and in the mouth interior. Applied to the volume around the head, the sparsity penalty prevents opaque regions from appearing in regions where there is insufficient multiview information to eliminate foreground from the background in 3D. To detect these regions, the penalty is applied to (1) the samples that fall in tetrahedral primitives, as this is the region rendered at the test time, and (2) the samples that belong to rays that fall in the background in the training image, as detected by the 2D face segmentation network of the training image. Sparsity loss is also applied to coarse samples that fall within the mouth interior volume. This prevents opaque areas from being created inside the mouth in areas that are not visible during training and are therefore not supervised, but become unobstructed at the test time.
The loss of sparsity within the mouth ensures that there is no unnecessary density within the mouth interior. However, the color behind the occluded areas at the training frame remains undefined, resulting in unpleasant artifacts when these areas are occluded at the test time. The solution herein is to cover the color and density of the last sample along each ray falling in the mouth interior, which allows the color of the unoccluded area to be set at the test time to match the color of the learned color in the visible area between the teeth at the training time.
The present technology has been empirically tested for a first application and a second application.
In a first experiment, a physics-based simulation is used to control the deformation of a static object (an aircraft propeller) subject to complex topological changes, and to render a photo-realistic image of the process for each step of the simulation. This experiment shows the representation capability of the deformation description and the capability of rendering images from physical deformations that are difficult to capture with a camera. The data sets of the propellers subjected to successive compression and rotation are synthesized. For both types of variants, 48 time frames are rendered for 100 cameras. For the present technique, only training is performed on the first frame, which can be considered to be stationary, but a coarse tetrahedral mesh describing the motion of the sequence is supplied. In the first application, the average peak signal-to-noise ratio of the interpolation (not visible in training) of the present technique for each other frame is 27.72, compared to 16.63, which is an alternative that does not use a cage and uses position coding on the time signal. The peak signal-to-noise ratio of the present technique over time extrapolated (second half of the frame, not visible in training) is 29.87 for the present technique, as compared to 12.78 for the alternative technique. In contrast to about 30s for the alternative technique, the present technique calculates the image in a first application with a frame of about 6ms with a resolution of 512 x 512.
In a second experiment, a photo-realistic animation of a human head avatar was calculated in real time using a fusion shape-based face tracker. The avatar is trained using 30 images of the subject taken from different viewpoints at the same time. Thus, for each avatar, the method sees only a single facial expression and gesture. To animate the head avatar, the control parameters of the parameter 3DMM face model extending from the surface mesh to the volume around it are used. The resulting parametric volumetric facial model is called Vol3DMM. Building on the parametric face model allows generalizing to facial expressions and poses that are not visible at training and using a face tracker built on top of it for real-time control. A key benefit of the method is that hair, accessories and other elements are captured by the cage. The proposed solution can be applied to the whole body.
In a second experiment, multi-view facial data was acquired with a camera rig that captured synchronized video from 31 cameras at 30 frames per second. These cameras are located 0.75-1m from the subject, with the viewpoint spanning 270 ° around its head and focused mostly on a front view within ±60 frames. The illumination is non-uniform. All images were downsampled to 512 x 512 pixels and color corrected to have consistent color characteristics across the camera. The camera pose and intrinsic parameters are estimated using a standard motion-to-structure pipeline.
For the second experiment, a speech sequence with natural head movements for four subjects was captured. Half of the subjects additionally perform various facial expressions and head rotations. To train the model for each object, face tracking results from a face tracker and images from multiple cameras at a single instance of time (frame) are used. The frame is chosen to meet the criteria that 1) a significant area of teeth is visible and the bottom of the upper teeth is above the top of the lower teeth to place a plane between them, 2) the subject is looking forward and some of the eye white is visible on both sides of the iris, 3) the facial fit for the frame is accurate, 4) the texture of the face is not too wrinkled (e.g., in the nasolabial folds) due to the mouth opening. When a single frame satisfying 1-4 is not available, two frames are used, a frame in which the user has all forward looking neutral expressions satisfying 2-4 to train everything except inside the mouth, and a frame in which the mouth opens and satisfies 1 and 3 to train inside the mouth.
The present technique was found to provide a 00.1dB better PSNR than the baseline technique and a 10% improvement in perceived image tile similarity (LPIPS) that was learned. The baseline technique uses an explicit mesh and does not have a cage that extends beyond the face.
FIG. 9 illustrates various components of an exemplary computing-based device 900 implemented as any form of computing and/or electronic device, and in which embodiments of an image animator are implemented in some examples.
The computing-based device 900 includes one or more processors 914, which are microprocessors, controllers, or any other suitable type of processor for processing computer-executable instructions to control the operation of the device to generate a composite image of a dynamic scene in a controlled manner. In some examples, for example where a system-on-chip architecture is used, processor 914 includes one or more fixed function blocks (also referred to as accelerators) that implement a portion of the method of any of fig. 4-8 in hardware (rather than software or firmware). Platform software, including an operating system 908 or any other suitable platform software, is provided at the computing-based device to enable application software 910 to execute on the device. The data store 922 holds output images, values of face tracker parameters, values of physical engine rules, intrinsic camera parameter values, viewpoints, and other data. An animator 902 comprising a radiation field parameterization 904 and a volume renderer 906 is present at the computing-based device 900.
Computer-executable instructions are provided using any computer-readable medium accessible by the computing-based device 900. Computer-readable media includes, for example, computer storage media such as memory 912 and communication media. Computer storage media, such as memory 912, includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, and the like. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium for storing information for access by a computing device. Rather, the communication media embodies computer readable instructions, data structures, program modules, etc. in a modulated data signal such as a carrier wave or other transport mechanism. As defined herein, computer storage media does not include communication media. Accordingly, computer storage media should not be construed as propagating signals themselves. Although the computer storage media (memory 912) is shown within the computing-based device 900, it will be appreciated that in some examples, the storage is distributed or remotely located and accessed via a network or other communication link (e.g., using communication interface 916).
The computing-based device 900 has an optional capture device 918 to enable the device to capture sensor data, such as images and video. The computing-based device 900 has an optional display device 920 for displaying output images and/or parameter values.
Alternatively or in addition to other examples described herein, examples include any combination of the following clauses:
clause a, a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object, the method comprising:
Receiving a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model;
for pixels of the image, computing rays from a virtual camera into the cage through the pixels, animated according to the animation data, and computing a plurality of samples on the rays, each sample being a 3D position and view direction in one of the 3D elements;
Computing a transformation of the samples into a canonical version of the cage to generate transformed samples;
Querying a learned radiation field parameterization of the 3D scene for each transformed sample to obtain color values and opacity values;
a volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.
Clause B, the method of clause a, wherein the cage of primitive 3D elements represents the 3D object and a volume extending from the 3D object.
Clause C, the method of clause B, wherein the cage comprises a volumetric mesh having a plurality of volumetric fusion shapes and a skeleton.
Clause D, the method of clause B, wherein the cage is calculated from the learned radiation field parameterization by calculating a mesh from the density of the learned radiation field parameterization using a mobile cube method and calculating tetrahedral embedding of the mesh.
Clause E, the method of any of the preceding clauses, further comprising calculating the transformation P of the sample by setting P equal to a normalized distance between a previous intersection point and a next intersection point of a tetrahedron on the ray, multiplying a sum of barycentric coordinates of the vertices at the previous intersection point on four vertices of the tetrahedron times canonical coordinates of the vertices, adding one minus the normalized distance, multiplying a sum of barycentric coordinates of the vertices at the next intersection point on four vertices of the tetrahedron times canonical coordinates of the vertices.
Clause F, the method of clause a, further comprising, for one of the transformed samples, rotating a view direction of rays of the sample prior to querying the learned radiation field parameterization.
Clause G, the method of clause F, comprising calculating a rotation R of the view direction for a small portion of the primitive 3D element, and propagating the value of R to the remaining tetrahedrons via nearest neighbor interpolation.
Clause H, the method of any of the preceding clauses, wherein the canonical version of the cage is a cage with specified parameter values of an articulated object model or specified parameters of a physics engine.
Clause I, the method of any of the preceding clauses, wherein the canonical version of the cage represents a face with a closed mouth.
Clause J, the method of any of the preceding clauses, wherein the learned radiation field parameterization is a cache of associations between 3D points in the canonical version of the cage and color values and opacity values, the associations obtained by querying a machine learning model trained using training data comprising images of the dynamic scene from multiple viewpoints.
Clause K, the method of any of the preceding clauses, wherein the images of the dynamic scene from multiple viewpoints are obtained at the same instance of time or at two moments in time.
Clause L, the method of clause K, wherein the cage represents a face of the person and includes a mesh inside the mouth, a first plane representing an upper row of teeth of the person, and a second plane representing a lower row of teeth of the person.
Clause M, the method of clause L, comprising checking if one of the samples falls in the interior of the mouth, and using information about the first plane and the second plane to calculate the transformation on the sample.
Clause N, the method of any preceding clause, comprising using only one radiation field network and increasing the number of sampling bins during the process of querying the learned radiation field parameterization of the 3D scene for each transformed sample to obtain color values and opacity values.
Clause O, an apparatus comprising at least one processor, a memory storing instructions that, when executed by the at least one processor, perform a method for computing an image of a dynamic 3D scene comprising a 3D object, the method comprising:
Receiving a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model;
for pixels of the image, computing rays from a virtual camera into the cage through the pixels, animated according to the animation data, and computing a plurality of samples on the rays, each sample being a 3D position and view direction in one of the 3D elements;
Computing a transformation of the samples into a canonical version of the cage to generate transformed samples;
Querying a learned radiation field parameterization of the 3D scene for each transformed sample to obtain color values and opacity values;
a volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.
Clause P, a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object, the method comprising:
receiving a description of a deformation of the 3D object;
For pixels of the image, computing a ray from a virtual camera through the pixels into the description and computing a plurality of samples on the ray, each sample being a 3D position and view direction in one of the 3D elements;
computing a transformation of the samples into a canonical space to generate transformed samples;
For each transformed sample, querying a cache of associations between 3D points in the canonical space and color values and opacity values;
a volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.
Clause Q, the method of clause P, further comprising one or more of storing the image, transmitting the image to a computer gaming application, transmitting the image to a telepresence application, inserting the image into a virtual network camera stream, and transmitting the image to a head-mounted display.
Clause R, the method of clause P or Q, comprising detecting parameter values of a model of a 3D object depicted in a video using an object tracker, and calculating the description of the deformation of the 3D object using the detected parameter values and the model.
Clause S, the method of any of clauses P to R, comprising using a physics engine to specify the description.
Clause T, the method of any of clauses P-S, wherein the 3D primitive element is any one of tetrahedron, sphere, cuboid.
The term 'computer' or 'computing-based device' is used herein to refer to any device having processing capabilities such that it executes instructions. Those skilled in the art will recognize that such processing capabilities are incorporated into many different devices, and thus the terms 'computer' and 'computing-based device' each include Personal Computers (PCs), servers, mobile phones (including smart phones), tablet computers, set-top boxes, media players, game consoles, personal digital assistants, wearable computers, and many other devices.
In some examples, the methods described herein are performed by software in machine-readable form (e.g., in the form of a computer program comprising computer program code means adapted to perform all the operations of one or more of the methods described herein when the program is run on a computer), on a tangible storage medium, and wherein the computer program may be embodied on a computer-readable medium. The software is adapted to be executed on a parallel processor or a serial processor such that the method operations may be performed in any suitable order or simultaneously.
Those skilled in the art will recognize that the storage devices used to store program instructions may alternatively be distributed across a network. For example, a remote computer can store an example of a process described as software. The local or terminal computer can access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software as needed, or execute some of the software instructions at the local terminal and remote computer (or computer network). Those skilled in the art will also recognize that all or a portion of the software instructions may be executed by dedicated circuitry, such as a Digital Signal Processor (DSP), programmable logic array, or the like, by utilizing conventional techniques known to those skilled in the art.
As will be apparent to one of skill in the art, any of the ranges or device values set forth herein may be extended or altered without losing the effect sought.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
It will be appreciated that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. Embodiments are not limited to those solving any or all of the problems or those having any or all of the benefits and advantages. It will also be understood that reference to an item refers to one or more of those items.
The operations of the methods described herein may be performed in any suitable order or concurrently where appropriate. In addition, individual blocks may be deleted from any of the methods without departing from the scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without loss of the effect sought.
The term 'comprising' is used herein to mean including the identified method blocks or elements, but that such blocks or elements do not include an exclusive list, and that the method or apparatus may include additional blocks or elements.
The term 'subset' is used herein to refer to the proper subset such that the subset of the set does not include all elements of the set (i.e., at least one of the elements of the set is missing from the subset).
It will be appreciated that the above description is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments. Although various embodiments have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the scope of this disclosure.