Movatterモバイル変換


[0]ホーム

URL:


CN119497876A - Computing images of controllable dynamic scenes - Google Patents

Computing images of controllable dynamic scenes
Download PDF

Info

Publication number
CN119497876A
CN119497876ACN202380052012.6ACN202380052012ACN119497876ACN 119497876 ACN119497876 ACN 119497876ACN 202380052012 ACN202380052012 ACN 202380052012ACN 119497876 ACN119497876 ACN 119497876A
Authority
CN
China
Prior art keywords
image
cage
values
sample
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202380052012.6A
Other languages
Chinese (zh)
Inventor
J·P·C·瓦伦丁
V·埃斯特勒斯卡萨斯
S·雷扎伊法尔
申晶晶
S·K·希马诺维奇
S·J·加尔宾
M·A·科瓦尔斯基
M·A·约翰逊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB2210930.0Aexternal-prioritypatent/GB202210930D0/en
Application filed by Microsoft Technology Licensing LLCfiledCriticalMicrosoft Technology Licensing LLC
Publication of CN119497876ApublicationCriticalpatent/CN119497876A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

To compute an image of a dynamic 3D scene comprising a 3D object, a description of a deformation of the 3D object is received, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model. For pixels of the image, the method calculates rays from a virtual camera that are animated according to the animation data through the pixels into the cage, and calculates a plurality of samples on the rays. Each sample is a 3D position and view direction in one of the 3D elements. The method calculates a transformation of the sample into a canonical cage. For each transformed sample, the method queries a learned radiation field parameterization of the 3D scene to obtain color values and opacity values. A volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.

Description

Computing images of controllable dynamic scenes
Background
Dynamic scenes are environments in which one or more objects are moving, as opposed to static scenes in which all objects are stationary. An example of a dynamic scene is a person's face that moves when the person speaks. Another example of a dynamic scenario is a propeller of an aircraft being rotated. Another example of a dynamic scenario is a standing person moving an arm.
In conventional computer graphics, computing a composite image of a dynamic scene is a complex task because of the need for the scene and its dynamic assembly of a three-dimensional (3D) model. Obtaining such an assembled 3D model is complex and time consuming and involves manual work.
Composite images of dynamic scenes are used for various purposes such as computer games, movies, video communications, etc.
The embodiments described below are not limited to implementations that address any or all of the disadvantages of known devices for computing composite images of dynamic scenes.
Disclosure of Invention
The following presents a simplified summary of the disclosure in order to provide a basic understanding to the reader. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Its sole purpose is to present a simplified form as a prelude to the more detailed description that is presented later.
In various examples, there are ways to compute images of a dynamic scene in a controlled manner so that a user or an automated process can easily control how the dynamic scene is animated. Optionally, the image is computed in real time (such as at 30 frames per second or more) and realistic, i.e., the image has characteristics that substantially match those of an empirical image and/or video.
In various examples, there is a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object. The method includes receiving a description of a deformation of the 3D object, the description including a cage (cage) of primitive 3D elements and associated animation data from a physics engine or an articulated object model. For pixels of the image, the method calculates rays (ray) from a virtual camera into the cage through the pixels, animated according to the animation data, and calculates a plurality of samples on the rays. Each sample is a 3D position and view direction in one of the 3D elements. The method calculates a transformation of the samples into a canonical version of the cage to produce transformed samples. For each transformed sample, the method queries a learned radiation field parameterization of the 3D scene to obtain color values and opacity values. A volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.
Many of the attendant features will be more readily appreciated as the same becomes better understood by reference to the following detailed description considered in connection with the accompanying drawings.
Drawings
The specification will be better understood by reading the following detailed description in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic diagram of an image animator for computing images of a controllable dynamic scene;
FIG. 2 shows a deformed depiction of a human head and three images calculated using the image animator of FIG. 1;
FIG. 3 shows an image of a chair broken and a chair calculated using the image animator of FIG. 1;
FIG. 4 is a flow chart of an example method performed by the image animator of FIG. 1;
FIG. 5 is a schematic illustration of rays in a deformed cage, transformed to a canonical cage, volume lookup, and volume rendering;
FIG. 6 is a flow chart of a method of sampling;
FIG. 7 is a flow chart of a method of computing an image of a person depicting their mouth open;
FIG. 8 is a flow chart of a method of training a machine learning model and computing a cache;
FIG. 9 illustrates an exemplary computing-based device in which an embodiment of an animator for computing images of a controllable dynamic scene is implemented.
In the drawings, like reference numerals are used to designate like parts in the accompanying drawings.
Detailed Description
The detailed description provided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present examples may be constructed or utilized. The description sets forth the functions of the examples and the sequence of operations for constructing and operating the examples. However, the same or equivalent functions and sequences may be accomplished by different examples.
The techniques described herein use radiation fields and volume rendering methods. The radiation field parameterization represents the radiation field (called field) as a function of from five-dimensional (5D) space to four-dimensional (4D) space, wherein the value of the radiation is known for each pair of 3D points and 2D view directions in the field. The irradiance value is composed of a color value and an opacity value. The radiation field parameterization may be a trained machine learning model, such as a neural network, support vector machine, random decision forest, or other machine learning model that learns the associations between radiation values and a pair of 3D points and view directions. In some cases, the radiation field parameterization is a cache of associations between radiation values and 3D points, wherein the associations are obtained from a trained machine learning model.
The volume rendering method calculates an image from a radiation field for a specific camera viewpoint by checking radiation values of points along a ray forming the image. Volume rendering software is well known and commercially available.
As mentioned above, composite images of dynamic scenes are used for various purposes, such as computer games, movies, video communications, telepresence, etc. However, it is difficult to generate a composite image of a dynamic scene in a controlled manner, i.e., it is difficult to easily and accurately control how the scene is animated. Accurate control is desirable for many applications, such as applications in which a composite image of an avatar of a person in a video call is to accurately depict the facial expression of a real person. Precise control is also desirable for video game applications, where the image of a particular chair is to be broken up in a realistic manner. These examples of video calls and video games are not intended to be limiting, but illustrate the use of the present technology. The techniques can be used to capture any scene, such as objects, vegetation environments, people, or other scenes, that is static or dynamic.
Registration (enrollment) is another problem that arises when generating a composite image of a dynamic scene. Registration is a situation in which radiation field parameterizations are created for a particular 3D scene, such as a particular person or a particular chair. Some schemes of registration use a large number of training images depicting a particular 3D scene over time and from different viewpoints. Among them, registration is time-consuming and presents computationally burdensome difficulties.
It is becoming increasingly important to be able to generate composite images of dynamic scenes in real time, such as during a video call in which an avatar of a caller is to be created. However, due to the complex computation and computational burden, it is difficult to implement real-time operation.
Generalization capability (generalization ability) is a persistent problem. Trained radiation field parameterizations are often difficult to generalize in order to facilitate computing images of 3D scenes that are different from those used during training of the radiation field parameterizations.
An alternative to using implicit morphing methods based on learned functions is a "black box" to the content creator, they require a large amount of training data to meaningfully generalize, and they do not produce a true extrapolation outside of the training data.
The present technique provides an accurate way of controlling how an image of a dynamic scene is animated. The user or automated process can specify parameter values, such as a volumetric fusion shape and skeleton value of the cage applied to the primitive 3D elements. In this way, the user or an automated process is able to precisely control the deformation of the 3D object to be depicted in the composite image. In other examples, a user of the automated process can use animation data from the physics engine to precisely control the deformation of the 3D object to be depicted in the composite image. The fused shape is a mathematical function that, when applied to a parameterized 3D model, changes the parameter values of the 3D model. In an example, where the 3D model is a 3D model of a person's head, there may be hundreds of fused shapes, each changing the 3D model according to facial expression or identity characteristics.
In some examples, the present technology reduces the burden of registration. Registration burden is reduced by using a reduced amount of training images, such as training image frames from only one or only two moments in time.
In some examples, the present technology is capable of operating in real-time (such as at 30 frames per second or more). This is achieved by using optimization when calculating the transformation of the sample points into the canonical space used by the radiation field parameterization.
In some cases, the present technology operates with good generalization capability. By creating scenes that are animated with parameters from selected face models or physics engines, the techniques can use model dynamics from the face models or physics engines to animate scenes beyond the training data in a physically meaningful way to generalize well.
Fig. 1 is a schematic diagram of an image animator 100 for computing a composite image of a dynamic scene. In some cases, the image animator 100 is deployed as a web service. In some cases, the image animator 100 is deployed at a personal computer or other computing device in communication with a head-mounted computer 114, such as a head-mounted display device. In some cases, the image animator 100 is deployed in a companion computing device of the head-mounted computer 114.
The image renderer 100 includes a radiation field parameterization 102, at least one processor 104, a memory 106, and a volume renderer 108. In some cases, the radiation field parameterization 102 is a neural network, or a random decision forest, or a support vector machine, or other type of machine learning model. Which has been trained to predict pairs of color values and opacity values and view directions of three-dimensional points in canonical space of dynamic scenes, and more details about the training process are given later in this document. In some cases, the radiation field parameterization 102 is a cache of associations between three-dimensional points and color values and opacity values stored in canonical space.
The volume renderer 108 is a well-known computer graphics volume renderer that acquires pairs of color values and opacity values of three-dimensional points along a ray and calculates an output image 116.
Image animator 100 is configured to receive queries from client devices such as smart phone 122, computer game device 110, head-mounted computer 114, movie creation device 120, or other client devices. The query is sent from the client device to the image animator 100 via the communication network 124.
The query from the client device includes a specified viewpoint of the virtual camera, specified values of intrinsic parameters of the virtual camera, and a deformation description 118. The composite image will be calculated by the image animator 100 as if it had been captured by the virtual camera. The deformed description describes the desired dynamic content of the scene in the output image 116.
The image animator 100 receives the query and, in response, generates a composite output image 116 that it sends to the client device. The client device uses the output image 116 for one of a variety of useful purposes including, but not limited to, generating a virtual network camera stream, generating video for a computer video game, generating holograms for display by a mixed reality headset computing device, generating movies. The image animator 100 is capable of calculating a composite image of a dynamic 3D scene for a particular specified desired dynamic content and a particular specified viewpoint as needed. In an example, the dynamic scene is a face of a speaking person. The image animator 100 is capable of calculating a composite image of the face from multiple viewpoints and with arbitrarily specified dynamic content. Non-limiting examples of specified viewpoints and dynamic content are planar view, eye closure, face tilt up, smile, perspective view, eye opening, mouth opening, anger expression. Note that since the machine learning used to create the radiation field parameterization 102 can be generalized, the image animator 100 can calculate a composite image for view points and deformation descriptions that are not present in the training data used to train the radiation field parameterization 102. Other examples of dynamic scenarios are given below with reference to fig. 2 and 3, and include generic objects such as chairs, automobiles, trees, full-body. By using the deformation description, dynamic scene content depicted in the generated composite image can be controlled. In some cases, the deformation descriptions are obtained using a physics engine 126, enabling a user or an automated process to apply physical rules to break up 3D objects depicted in the composite output image 116 or apply other physical rules to depict animations, such as bouncing, waving, swaying, dancing, spinning, or other animations. A physical simulation can be applied to the cages of 3D primitive elements using a finite element method to create the deformation descriptions in order to produce elastic deformation or fracture. In some cases, such as where an avatar of a person is created, the deformation description is obtained using a face or body tracker 124. By selecting the viewpoint and the intrinsic camera parameter value, the characteristics of the synthesized output image can be controlled.
The image animator operates in a non-conventional manner to enable the generation of composite images of a dynamic scene in a controlled manner. Many alternative methods of generating composite images using machine learning have little or no control over what is depicted in the generated composite image.
The image animator 100 improves the functionality of the underlying computing device by enabling the composite image of the dynamic scene to be calculated in a manner whereby the content and view of the dynamic scene is controllable.
Alternatively or additionally, the functions of image animation 100 are performed, at least in part, by one or more hardware logic components. For example, but not limited to, exemplary types of hardware logic components that may optionally be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), graphics Processing Units (GPUs).
In other examples, the functionality of the image animator 100 is located at the client device or shared between the client device and the cloud.
Fig. 2 shows a deformation description 200 of a person's head calculated using the image animator 100 of fig. 1 and three images 204, 206, 208, each showing a person's head animated in a different manner, such as with the mouth open or closed. The deformed description 200 is a cage of primitive 3D elements, which are tetrahedral in the example of fig. 2, but in some examples use other primitive 3D elements, such as spheres or cuboids. In the example of fig. 2, the tetrahedral cage extends from the surface mesh of the person's head so as to include the volume around the head that can be used to represent the person's hair and any headgear worn by the person. In the case of a general object such as a chair, the volume around the object in the cage is useful because modeling the volume with a volume rendering method results in a more realistic image and the cage only needs to approximate the mesh, which reduces the complexity of the cage for objects with many parts (the cage for plants does not need to have different parts per leaf, which only needs to cover all leaves) and allows the same cage to be used for objects of the same type with similar shapes (different chairs can use the same cage). The cage can be intuitively deformed and controlled by a user, physical-based simulation, or conventional automatic animation techniques like fused shapes. Faces are particularly difficult cases due to the unusual combination of stiffness and (viscoelastic) motion, and the present technique works well for faces, as described in more detail below. Once the radiation field is trained using the present technique, it can be generalized to any geometric deformation that can be represented with a cage of 3D primitives constructed from its density. This opens up new possibilities for using the volumetric model in a game or augmented reality/virtual reality context, where the user's manipulation of the environment is not known a priori.
In an example, the deformation description 200 is referred to as a volumetric three-dimensional deformable model (Vol 3 DMM), which is a parametric 3D face model that uses a skeleton and a fused shape to animate the surface mesh of a person's head and the volume around the mesh.
The user or automated process can specify values for parameters of the Vol3DMM model that are used to animate the Vol3DMM model in order to create the images 204 through 208, as described in more detail below. Different values of the parameters of the Vol3DMM model are used to generate each of the three images 204 through 208. The Vol3DMM model is an example of a deformation description along with parameter values.
Vol3DMM animates a volumetric mesh having a sequence of volumetric fusion shapes and a skeleton. It is a generalization of a parametric three-dimensional deformable model (3 DMM) model that animates a mesh with a skeleton and a fused shape to the parametric model to animate the volume around the mesh.
The skeleton and fusion shape of the Vol3DMM are defined by expanding the skeleton and fusion shape of the parameter 3DMM face model. The skeleton has four skeletons, namely root bone, cervical bone, left eye bone and right eye bone for controlling rotation. To use the skeleton in Vol3DMM, linear fused skin weights are extended from vertices of the 3DMM mesh to vertices of tetrahedrons by nearest vertex lookup, i.e., each tetrahedron vertex has the skin weight of the nearest vertex in the 3DMM mesh. The volumetric fusion shape is created by expanding the 224 expression fusion shape and 256 identity fusion shape of the 3DMM model to the volume surrounding its template mesh by creating the ith volumetric fusion shape of Vol3DMM as a tetrahedral embedding of the mesh of the ith 3DMM fusion shape. To create tetrahedral embedding, a single volumetric structure is created from a generic mesh and an accurate embedding is created taking into account facial geometry and facial deformation, which avoids tetrahedral penetration between the upper and lower lips, which defines a volumetric support covering the hair, and which has a higher resolution in the areas subject to more deformation. In an example, an exact number of bones or fusion shapes are inherited from a particular instance of the selected 3DMM model, but the techniques can be applied to different 3DMM models using fusion shapes and/or skeletons for modeling faces, bodies, or other objects.
As a result of this construction, the Vol3DMM is controlled and posed with the same identity, expression, and pose parameters α, β, θ of the 3DMM face model. This means that it can be animated with a face tracker built on the 3DMM face model by changing α, β, θ and more importantly generalized to any expression that can be represented by the 3DMM face model, as long as there is a good fit of the face model to the training frame. During training, the tetrahedral mesh of the Vol3DMM is posed using parameters α, β, θ to define a physical space, while a canonical space is defined for each object by posing the Vol3DMM with identity parameter α and setting β, θ to zero for a neutral pose. In an example, inheritance from a particular instance of the selected 3DMM model to a decomposition of identity, expression, and gesture. However, techniques for training and/or animating accommodate different decompositions by constructing a corresponding Vol3DMM model for the particular 3DMM model selected.
Fig. 3 shows a chair 300 and a composite image 302 of a chair break calculated using the image animator of fig. 1. In this case, the deformed description includes a cage surrounding the chair 300, wherein the cage is formed of primitive 3D elements such as tetrahedrons, spheres, or cuboids. The deformation description also includes information such as rules from the physics engine about how the object behaves when it breaks.
FIG. 4 is a flow chart of an example method performed by the image animator of FIG. 1. Inputs 400 to the method include a morphing description, a camera viewpoint, and camera parameters. The camera viewpoint is the viewpoint of the virtual camera for which the composite image is to be generated. The camera parameters are lens and sensor parameters such as image resolution, field of view, focal length. The type and format of the deformation description depends on the type and format of the deformation description used in the training data when the radiation field parameterization is trained. The training process is described later with respect to fig. 7. Fig. 4 relates to test time operation after radiation field parameterization has been learned. In some cases, the deformation description is a vector of concatenated parameter values of a parameterized 3D model of an object in a dynamic scene (such as a Vol3DMM model). In some cases, the deformed description is one or more physics-based rules from a physics engine to be applied to a cage that encapsulates a 3D object to be rendered and extends to primitive 3D elements in a volume surrounding the 3D object.
In some examples, the input 400 includes default values for some or all of the deformation description, the viewpoint, intrinsic camera parameters. In some cases, the input 400 is from a user or from a gaming device or other automated process. In an example, input 400 is made according to a game state from a computer game or according to a state received from a mixed reality computing device. In an example, the face or body tracker 420 provides values of the deformation description. The face or body tracker is a trained machine learning model that takes as input captured sensor data depicting at least a portion of a person's face or body and predicts values of parameters of a 3D face model or 3D body model of the person. The parameter is a shape parameter, a gesture parameter, or other parameter.
The deformation description includes a cage 418 of primitive 3D elements. The cages of primitive 3D elements represent a 3D object to be depicted in an image and a volume extending from the 3D object. In some cases, such as where the 3D object is a human head or body, the cage includes a volumetric mesh having a plurality of volumetric fusion shapes and a skeleton. In some cases where the 3D object is a chair or other 3D object, the cage is parameterized from the learned radiation field by calculating a mesh from the density of the learned radiation field volume and calculating tetrahedral embedding of the mesh using a mobile cube method (Marching Cubes). The cage 418 of primitive 3D elements is a deformed version of the canonical cage. That is, to generate a modified version of the scene, the method begins by deforming a canonical cage into a desired shape, the desired shape being the deformation description. The method does not know the way in which the deformed cage is generated and what kind of object is deformed.
The use of a cage to control and parameterize volumetric deformation enables the deformation to be represented in real time and applied to the scene, which can represent both smooth and discontinuous functions and allows for intuitive control by changing the geometry of the cage. The geometric control is compatible with machine learning models, physics engines, and art generation software, thereby allowing good extrapolation or generalization to configurations not observed in training.
In the case of a cage formed by tetrahedra, the use of a set of tetrahedra corresponds to a piecewise linear approximation of the playing field. Graphics Processing Unit (GPU) accelerated ray tracing allows the cage representation to be fast enough to query in milliseconds, even with highly complex geometries. Cage representations using tetrahedra can reproduce hard object boundaries by construction and can be edited in off-the-shelf software as consisting of only points and triangles.
At operation 402, the dynamic scene image generator calculates a plurality of rays, each ray associated with a pixel of the output image 116 to be generated by the image animator. For a given pixel (x, y position in the output image), the image animator computes rays from the virtual camera through the pixel into a deformation description that includes the cage. To calculate the rays, the image animator uses the geometry of the intrinsic camera parameters and the selected values and camera viewpoints. The rays are computed in parallel, where possible, to give efficiency, since one ray is computed per pixel.
For each ray, the image animator samples a plurality of points along the ray. In general, the more points sampled, the better the quality of the output image. Rays are randomly selected and samples are drawn within specified limits obtained from scene knowledge 416. In an example, the specified boundaries are calculated from training data that has been used to train a machine learning system. The boundaries indicate the size of the dynamic scene such that one or more samples are taken from regions of rays in the dynamic scene. In order to calculate the boundaries from the training data, standard image processing techniques are used to examine the training images. The boundaries of the dynamic scene can also be specified manually by an operator or measured automatically using a depth camera, global Positioning System (GPS) sensor, or other location sensor.
Each sample is assigned an index of the 3D primitive elements of the deformed cage within which the sample falls.
At operation 406, the image animator transforms the samples from the deformation description cages into canonical cages. A canonical cage is a version of a cage that represents a 3D object in a stationary state or other specified original state, such as where the parameter value is zero. In an example where the 3D object is a human head, the canonical cage represents a human head looking straight at the virtual camera, with eyes open and mouth closed and neutral expression.
In the case where the primitive 3D element is a tetrahedron, the barycentric coordinates as described below are used to calculate the transformation of the sample into a canonical cage. Using barycentric coordinates is a particularly efficient way of computing the transformation.
In the example where the cage uses tetrahedrons, the barycentric coordinates defined for both canonical tetrahedron x= { X1,X2,X3,X4 } and deformed tetrahedron element x= { X1,x2,x3,x4 } are used to map point P in deformed space to point P in canonical space.
Tetrahedra, a basic building block, are four-sided pyramids. The undeformed "rest" position defining its four constituent points is:
X={X1,X2,X3,X4}(2)
And the deformation state x= { x1,x2,x3,x4 } is represented using a lowercase. Because tetrahedrons are simplex, the barycentric coordinates (λ1, λ2, λ3, λ4) can be used to represent the points that fall inside them with reference to the set X or X.
Although the input points can be restored as: If P falls inside the tetrahedron, its rest position P in canonical space is obtained as:
in case the primitive 3D element is a sphere or cuboid, the transformation of the sample into a canonical cage is instead calculated using an affine transformation, which is sufficiently expressive for large rigid moving parts of the motion field.
From each camera, rays are emitted into the physical space, tetrahedron x0 incident to each sample p along the rays are detected and their barycentric coordinates are calculated such that:
In the case where the 3D element is a tetrahedron, optimization is optionally used to compute the transformation at operation 406 by optimizing primitive point lookup. The optimizing includes calculating a transformation P of the sample by setting P equal to a normalized distance between a previous intersection point and a next intersection point of the tetrahedron, multiplied by a sum of barycenter coordinates of the vertices at the previous intersection point multiplied by canonical coordinates of the vertices on the four vertices of the tetrahedron, plus one minus the normalized distance, multiplied by a sum of barycenter coordinates of the vertices at the next intersection point multiplied by canonical coordinates of the vertices on the four vertices of the tetrahedron. This optimization was found to give a significant improvement in processing time, making real-time operation of the process of fig. 4 possible beyond 30 frames per second (i.e., computing more than 30 images per second in the case where the processor is a single RTX3090 (trade mark) graphics processing unit).
Operation 407 is optional and includes rotating a view direction of at least one of the rays. In this case, for one of the transformed samples, the view direction of the rays that rotate the sample is done before querying the learned radiation field. Calculating the rotation R for the view direction of a small part of the primitive 3D element and propagating the value of R to the remaining tetrahedra via nearest neighbor interpolation is found to give good results in practice.
For each sample point, the dynamic scene image generator queries 408 the radiation field parameterization 102. Given the points and associated viewing directions in the canonical 3D cage, the radiation field parameterization has been trained to produce color values and density values. In response to each query, the radiation field parameterization produces a pair of values including color and opacity at sampling points in a standard cage. In this way, the method utilizes the applied deformation descriptions to calculate a plurality of color values and opacity values 410 and view directions for the 3D points in the canonical cage.
In an example, the learned radiation field parameterization 102 is a cache of associations between 3D points and view directions and color values and opacity values in a canonical version of the cage obtained by querying a machine learning model trained using training data comprising images of the dynamic scene from multiple viewpoints. Significant acceleration is achieved by using a cache of values rather than querying the machine learning model directly.
The radiation field is a function of the color c and the density σ that are queried to obtain the color at that location in space. In general, an image plane is obtained via volume rendering using a volume rendering equation in the form of emission-absorptionColor of the pixel above:
Where δi= (pi+1-pi) represents the distance between samples (N total) along the straight line, and the throw ratio Ti is defined asModeling is typically performed by multi-layer perception (MLP), explicit voxel grids, or a combination of both. In addition to the sample position p,Also on the direction b of the ray, this allows it to model view-dependent effects such as specular reflection.
For each ray, the volume rendering 412 method is applied to the color values and opacity values calculated along the ray to produce pixel values of the output image. Any well-known computer graphics method for volumetric ray tracing is used. Where real-time operation is desired, hardware accelerated volume rendering is used.
The output image is stored 414 or inserted into a virtual network camera stream or used for telepresence, gaming, or other applications.
FIG. 5 is a schematic diagram of rays in a deformed cage 500, rays transformed to a canonical cage 502, a volume lookup 504, and a volume rendering 506. To render a single pixel, rays are cast from the camera center through the pixel into the scene in its deformed state. A plurality of samples are generated along the ray and then each sample is mapped to the canonical space using a deformation Mj corresponding to tetrahedron j. The transformed sample position p'j and the direction of the ray rotated based on the rotation of the jth tetrahedron are then used to query the volumetric representation of the scene. The resulting per-sample opacity values and color values are then combined using volume rendering as in equation one.
The density and color at each point in the scene are functions of both the sample position and the view direction. If the sample position is moved but the view direction remains unchanged, the light reflected from the elements of the scene will look the same for each deformation. To alleviate this problem, the view direction of each sample is rotated with rotation between the canonical tetrahedron and its deformed equivalent:
v′=Rv,
U,E,V=SVD((X-cX)T(X′-v′X)),
R=UVT,
where cX, c' X are the center of gravity of the canonical and deformed states of the tetrahedron into which a given sample falls. With this approach, the direction in which light is reflected at each point of the scene will match the deformation caused by the tetrahedral mesh. Note, however, that the reflected light will represent the scene in its canonical pose.
In practice, it is inefficient to calculate R for each sample in the scene, or even for each tetrahedron, as it requires the calculation of Singular Value Decomposition (SVD). Alternatively a random scheme is employed, wherein R is calculated for a small fraction ρ of tetrahedra and propagated to the remaining tetrahedra via nearest neighbor interpolation. In the experiments described herein, ρ=0.05.
More details about an example of primitive point lookup are now presented.
For complex grids, it is difficult to examine each tetrahedron for association with each input point given the complexity of the tetrahedron midpoint test. For non-self-intersecting tetrahedral meshes, the concept of points "in front of" or "behind" a particular triangle is uniquely determined by the winding order of the triangle vertices. Determining which tetrahedron a point belongs to corresponds to emitting rays from the point in random directions, evaluating the triangle at a first intersection point and checking which side of the triangle the sample is on. This uniquely identifies tetrahedra, as each triangle can belong to at most two tetrahedra. In particular, these queries are very efficient in terms of storage and computation when hardware acceleration is available.
In an example, the same acceleration is applied to any triangle shape to combine tetrahedrons with a triangle rigid moving shape that does not need to be filled with tetrahedrons, but can be regarded as a unit in terms of deformation. Second, the number of tetrahedral midpoint tests required is reduced by noting that many samples along a single ray can fall into the same element. Knowing the previous and next intermediate parts, a simple depth test determines which tetrahedral samples fall into it. The barycentric coordinates are linear and thus barycentric interpolation values are obtained by interpolating values at the previous intersection point and the next intersection point within each element. For this, equation (3) is rewritten as:
Where superscripts 1 and 2 refer to the previous and next intersection points, and a is the normalized distance between the two intersection points, which defines the point of interpolation of the method.
Due to this modification, each point value remains stable even if the "wrong" side of the triangle (or a completely incorrect triangle) is queried for lack of numerical accuracy. In contrast to the per-point formula of tetrahedral index lookup, one important side effect of this per-ray is that it is naturally integrated with the ray-progression scheme for rendering. In the latter, rays are terminated based on the transmittance, which is naturally allowed by the reformulated tetrahedral search algorithm.
Fig. 6 is a flow chart of a method of sampling. The method comprises querying the learned radiation field of a 3D scene to obtain color values and opacity values, using only one radiation field network 600 and increasing the size of a sampling bin (bin) 602.
Volume rendering typically involves sampling depth along each ray. In an example, there are sampling strategies that enable capturing thin structures and fine details and improving sampling boundaries. The method gives an improved quality with a fixed sample count.
Some schemes represent a scene with two multi-layer perceptions (MLPs): "coarse" and "fine". First, nc samples are evaluated by a coarse network to obtain a coarse estimate of the opacity along the ray. These estimates then lead to Nf samples of the second round placed around the location where the opacity value is greatest. The fine network is then queried at both the coarse sample location and the fine sample location, resulting in Nc evaluations in the coarse network and nc+nf evaluations in the fine network. During training, the two MLPs are optimized independently, but only samples from the fine MLP contribute to the final pixel color. The inventors have realized that the first Nc samples evaluated in the coarse MLP are not used to render the output image and are therefore effectively wasted.
To improve efficiency, the fine network is avoided from being queried at the location of the coarse samples, but instead the output from the first round of coarse samples is reused with a single MLP network.
Simple changes using one network instead of two results in artifacts, where the area around the segment of the ray to which the high weight has been assigned can be cropped as illustrated in 606, 608 of fig. 6. Clipping can occur because binning placement for drawing fine samples treats density as a step function at sample locations rather than a point estimate of a smoothing function. Thus, the size of each importance sample bin 610 doubles, allowing the importance samples to cover the entire range between coarse samples, as illustrated in 612, 614 of fig. 6.
Fig. 7 is a flow chart of a method of computing an image of a person depicting their mouth open. The method of fig. 7 is optionally used, wherein only one or two time instances are used in the training image. If many time instances are available in the training image, the process of FIG. 7 is not required. In the method, the cage represents a face of a person and includes a mesh of mouth interior 700, a first plane representing an upper row of teeth of the person, and a second plane 702 representing a lower row of teeth of the person. The method includes checking 704 if one of the samples falls in the interior of the mouth and using information about the first plane and the second plane to calculate 708 a transformation on the sample. The transformed samples are used to interrogate 710 the radiation field and the method proceeds as in fig. 4.
In the example, a separate deformation model is defined for the inside of the mouth, defined by a closed triangle primitive, and animated by two rigidly moving planes, one for each row of teeth.
The model is trained using a single frame and operated in a minimum data training mechanism. The animation is then driven by the "a priori" known animation model Vol3DMM with the face animated. Thus, the cage geometry model makes primitives non-self-intersecting (to allow real-time rendering) and driven with Vol3 DMM. In the special case of the inside of the mouth, a completely filled tetrahedral cavity is not an appropriate choice, as the rendered teeth will deform as the mouth opens. This will result in an unrealistic appearance and movement. An alternative to placing rigid deformable tetrahedrons around teeth would require a very high accuracy of the geometric estimation.
Instead, different primitives are selected for the inside of the mouth. Firstly, the inside of the mouth is filled with tetrahedra as if it were treated identically to the rest of the head, and secondly, the index of the external triangle of the tetrahedra corresponding to the inside of the mouth is recorded, effectively forming a surface mesh for the inside of the mouth. The surface mesh moves with the rest of the head and is used to determine which samples fall within the mouth interior, but is not used to deform them back into the canonical space. GPU accelerated ray tracing supports both tetrahedrally and triangle defined primitives, allowing the primitives that drive the animation to be changed.
To model the deformation, two planes are used, one placed just below the top tooth and one just above the bottom tooth. Both planes, together with the basic volumetric 3D deformable model of the face, move rigidly (i.e. both remain planar). It is assumed that the teeth move rigidly with these planes and decide not to support the tongue, so the space between the planes is assumed to be empty.
The mouth interior is animated by the following steps, with a surface mesh defining the entire mouth and both planes.
(1) The primitive in which each sample falls is detected and checked whether it is a nozzle-inside primitive.
(2) For each sample within the intra-mouth primitive, the signed distances to the upper and lower planes are used to determine whether it falls into the upper or lower mouth.
(3) The coordinates of the sample in the canonical space are calculated by 1) calculating the coordinates of the sample with respect to the relevant plane, 2) finding the position of the plane in the canonical space, and 3) assuming that the relative coordinates of the sample with respect to the relevant plane remain unchanged.
In an example, the canonical pose is a mouth-closed canonical pose, i.e., where the teeth overlap (the top of the bottom tooth is below the bottom of the upper tooth). As a result, the upper mouth region and the lower mouth region partially overlap in the canonical space. Thus, the color and density learned in the canonical space must be the average of the corresponding regions in the upper and lower mouths. To overcome this obstacle, the canonical regions for the inside of the upper and lower mouth are placed outside the tetrahedron cage, on its left and right sides. Together with the assumption of a blank space between the two planes, this placement results in a bijective mapping of the sample from inside the mouth in the deformation space to the canonical space, allowing a correct learning of the radiation field for that region.
Fig. 8 is a flow chart of a method of training a machine learning model and calculating a cache for use in the image animator 100. Training data 800 is accessed that includes images of a scene (static or dynamic) taken from a number of viewpoints. It is possible to use a set of images of a static scene as training data. A sequence in which each image represents a scene in a different state can also be used.
Fig. 8 is first described for the case where an image is an image of a scene from a plurality of different viewpoints obtained at the same time instance or two times, so that the amount of training data required for registration is relatively low. By using a single instant or two, the accuracy with which the face tracker is used to calculate the reference truth (ground truth) parameter values of the deformation description is improved. This is because the face tracker introduces errors and if it is used for frames at many time instances, there are more errors.
The training data image is a real image such as a photograph or video frame. The training data image may also be a composite image. A tuple of values is extracted 601 from the training data image, wherein each tuple is a deformation description, a camera viewpoint, camera intrinsic parameters, and a color of a given pixel.
In the example of a chair from fig. 3, the training data includes images of the chair taken from many different known viewpoints at the same time. The image is a composite image generated using computer graphics techniques. A tuple of values is extracted from each training image, where each tuple is a deformation description, a camera viewpoint, camera intrinsic parameters, and a color of a given pixel. The deformed description is a cage that is determined 802 by placing the cage of primitive 3D elements around and extending from the chair using known image processing techniques. A user or an automated process such as a computer game triggers a physical engine to deform the cage using physical rules to crush the chair as it falls under gravity or crush the chair as it is subjected to pressure from another object.
To form training data, samples are acquired along the rays in the cage 804 by emitting rays from the point of view of the camera capturing the training images into the cage. Samples are taken along the rays as described with reference to fig. 4. Each sample is assigned an index of one of the 3D primitive elements of the cage according to the element within which the sample falls. The samples are then transformed 806 to a standard cage, which is a version of the cage in a rest position. The transformed samples are used to calculate the output pixel color by using volume rendering. The output pixel color is compared to a reference true value output pixel color of the training image and a loss function is used to evaluate a difference or error. The loss function output is used to perform back propagation to train 808 the machine learning model and output a trained machine learning model 810. The training process is repeated for a number of samples until convergence is reached.
The resulting trained machine learning model 810 is used to calculate and store a cache 812 of associations between 3D positions and view directions and color values and opacity values in the canonical cage. This is done by querying the trained machine learning model 810 for a range of 3D locations and storing the results in a cache.
In the example of a face from fig. 2, the training data includes images of a person's face taken simultaneously from many different known viewpoints. Associated with each training data image are values of parameters of the Vol3DMM model of the face and head of the person. The parameters include the pose (position and orientation) of the eyes and the bones of the neck and jaw, as well as fusion shape parameters specifying characteristics of human facial expressions such as eye closed/open, mouth closed/open, smile/smile free, etc. The image is a real image of a person captured using one or more cameras having known viewpoints. A 3D model is fitted to each image using any well known model fitting procedure whereby the values of the parameters of the 3DMM model used to generate the Vol3DMM are searched for a set of values that enable the 3D model to describe the observed real image. The true image is then marked with the found values of the parameters and is the value of the deformation description. Each real image is also marked with a known camera viewpoint of the camera used to capture the image.
Then, the processes of operations 802 to 812 of fig. 8 are performed.
The machine learning model is trained 808 with a training objective that attempts to minimize the difference between the colors produced by the machine learning model and the colors given in the baseline truth training data.
In some examples involving facial animation, sparsity loss is optionally applied in the volume around the head and in the mouth interior.
The sparsity loss allows for handling incorrect background reconstruction and mitigates problems caused by non-occlusion in the region inside the mouth. In an example, using Cauchy loss,
Where i indexes the rays ri that emerge from the training camera and k indexes the samples tk along each of the rays. N is the number of samples to which the loss is applied, λs is a scalar loss weighted hyper-parameter, and σ is the opacity returned by the radiation field parameterization. To ensure that space is evenly covered by the sparsity loss, it is applied to "coarse" samples. Other sparsity-induced losses such as l1 or weighted least squares are also effective. Sparsity loss is applied in both regions, in the volume around the head and in the mouth interior. Applied to the volume around the head, the sparsity penalty prevents opaque regions from appearing in regions where there is insufficient multiview information to eliminate foreground from the background in 3D. To detect these regions, the penalty is applied to (1) the samples that fall in tetrahedral primitives, as this is the region rendered at the test time, and (2) the samples that belong to rays that fall in the background in the training image, as detected by the 2D face segmentation network of the training image. Sparsity loss is also applied to coarse samples that fall within the mouth interior volume. This prevents opaque areas from being created inside the mouth in areas that are not visible during training and are therefore not supervised, but become unobstructed at the test time.
The loss of sparsity within the mouth ensures that there is no unnecessary density within the mouth interior. However, the color behind the occluded areas at the training frame remains undefined, resulting in unpleasant artifacts when these areas are occluded at the test time. The solution herein is to cover the color and density of the last sample along each ray falling in the mouth interior, which allows the color of the unoccluded area to be set at the test time to match the color of the learned color in the visible area between the teeth at the training time.
The present technology has been empirically tested for a first application and a second application.
In a first experiment, a physics-based simulation is used to control the deformation of a static object (an aircraft propeller) subject to complex topological changes, and to render a photo-realistic image of the process for each step of the simulation. This experiment shows the representation capability of the deformation description and the capability of rendering images from physical deformations that are difficult to capture with a camera. The data sets of the propellers subjected to successive compression and rotation are synthesized. For both types of variants, 48 time frames are rendered for 100 cameras. For the present technique, only training is performed on the first frame, which can be considered to be stationary, but a coarse tetrahedral mesh describing the motion of the sequence is supplied. In the first application, the average peak signal-to-noise ratio of the interpolation (not visible in training) of the present technique for each other frame is 27.72, compared to 16.63, which is an alternative that does not use a cage and uses position coding on the time signal. The peak signal-to-noise ratio of the present technique over time extrapolated (second half of the frame, not visible in training) is 29.87 for the present technique, as compared to 12.78 for the alternative technique. In contrast to about 30s for the alternative technique, the present technique calculates the image in a first application with a frame of about 6ms with a resolution of 512 x 512.
In a second experiment, a photo-realistic animation of a human head avatar was calculated in real time using a fusion shape-based face tracker. The avatar is trained using 30 images of the subject taken from different viewpoints at the same time. Thus, for each avatar, the method sees only a single facial expression and gesture. To animate the head avatar, the control parameters of the parameter 3DMM face model extending from the surface mesh to the volume around it are used. The resulting parametric volumetric facial model is called Vol3DMM. Building on the parametric face model allows generalizing to facial expressions and poses that are not visible at training and using a face tracker built on top of it for real-time control. A key benefit of the method is that hair, accessories and other elements are captured by the cage. The proposed solution can be applied to the whole body.
In a second experiment, multi-view facial data was acquired with a camera rig that captured synchronized video from 31 cameras at 30 frames per second. These cameras are located 0.75-1m from the subject, with the viewpoint spanning 270 ° around its head and focused mostly on a front view within ±60 frames. The illumination is non-uniform. All images were downsampled to 512 x 512 pixels and color corrected to have consistent color characteristics across the camera. The camera pose and intrinsic parameters are estimated using a standard motion-to-structure pipeline.
For the second experiment, a speech sequence with natural head movements for four subjects was captured. Half of the subjects additionally perform various facial expressions and head rotations. To train the model for each object, face tracking results from a face tracker and images from multiple cameras at a single instance of time (frame) are used. The frame is chosen to meet the criteria that 1) a significant area of teeth is visible and the bottom of the upper teeth is above the top of the lower teeth to place a plane between them, 2) the subject is looking forward and some of the eye white is visible on both sides of the iris, 3) the facial fit for the frame is accurate, 4) the texture of the face is not too wrinkled (e.g., in the nasolabial folds) due to the mouth opening. When a single frame satisfying 1-4 is not available, two frames are used, a frame in which the user has all forward looking neutral expressions satisfying 2-4 to train everything except inside the mouth, and a frame in which the mouth opens and satisfies 1 and 3 to train inside the mouth.
The present technique was found to provide a 00.1dB better PSNR than the baseline technique and a 10% improvement in perceived image tile similarity (LPIPS) that was learned. The baseline technique uses an explicit mesh and does not have a cage that extends beyond the face.
FIG. 9 illustrates various components of an exemplary computing-based device 900 implemented as any form of computing and/or electronic device, and in which embodiments of an image animator are implemented in some examples.
The computing-based device 900 includes one or more processors 914, which are microprocessors, controllers, or any other suitable type of processor for processing computer-executable instructions to control the operation of the device to generate a composite image of a dynamic scene in a controlled manner. In some examples, for example where a system-on-chip architecture is used, processor 914 includes one or more fixed function blocks (also referred to as accelerators) that implement a portion of the method of any of fig. 4-8 in hardware (rather than software or firmware). Platform software, including an operating system 908 or any other suitable platform software, is provided at the computing-based device to enable application software 910 to execute on the device. The data store 922 holds output images, values of face tracker parameters, values of physical engine rules, intrinsic camera parameter values, viewpoints, and other data. An animator 902 comprising a radiation field parameterization 904 and a volume renderer 906 is present at the computing-based device 900.
Computer-executable instructions are provided using any computer-readable medium accessible by the computing-based device 900. Computer-readable media includes, for example, computer storage media such as memory 912 and communication media. Computer storage media, such as memory 912, includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, and the like. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium for storing information for access by a computing device. Rather, the communication media embodies computer readable instructions, data structures, program modules, etc. in a modulated data signal such as a carrier wave or other transport mechanism. As defined herein, computer storage media does not include communication media. Accordingly, computer storage media should not be construed as propagating signals themselves. Although the computer storage media (memory 912) is shown within the computing-based device 900, it will be appreciated that in some examples, the storage is distributed or remotely located and accessed via a network or other communication link (e.g., using communication interface 916).
The computing-based device 900 has an optional capture device 918 to enable the device to capture sensor data, such as images and video. The computing-based device 900 has an optional display device 920 for displaying output images and/or parameter values.
Alternatively or in addition to other examples described herein, examples include any combination of the following clauses:
clause a, a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object, the method comprising:
Receiving a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model;
for pixels of the image, computing rays from a virtual camera into the cage through the pixels, animated according to the animation data, and computing a plurality of samples on the rays, each sample being a 3D position and view direction in one of the 3D elements;
Computing a transformation of the samples into a canonical version of the cage to generate transformed samples;
Querying a learned radiation field parameterization of the 3D scene for each transformed sample to obtain color values and opacity values;
a volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.
Clause B, the method of clause a, wherein the cage of primitive 3D elements represents the 3D object and a volume extending from the 3D object.
Clause C, the method of clause B, wherein the cage comprises a volumetric mesh having a plurality of volumetric fusion shapes and a skeleton.
Clause D, the method of clause B, wherein the cage is calculated from the learned radiation field parameterization by calculating a mesh from the density of the learned radiation field parameterization using a mobile cube method and calculating tetrahedral embedding of the mesh.
Clause E, the method of any of the preceding clauses, further comprising calculating the transformation P of the sample by setting P equal to a normalized distance between a previous intersection point and a next intersection point of a tetrahedron on the ray, multiplying a sum of barycentric coordinates of the vertices at the previous intersection point on four vertices of the tetrahedron times canonical coordinates of the vertices, adding one minus the normalized distance, multiplying a sum of barycentric coordinates of the vertices at the next intersection point on four vertices of the tetrahedron times canonical coordinates of the vertices.
Clause F, the method of clause a, further comprising, for one of the transformed samples, rotating a view direction of rays of the sample prior to querying the learned radiation field parameterization.
Clause G, the method of clause F, comprising calculating a rotation R of the view direction for a small portion of the primitive 3D element, and propagating the value of R to the remaining tetrahedrons via nearest neighbor interpolation.
Clause H, the method of any of the preceding clauses, wherein the canonical version of the cage is a cage with specified parameter values of an articulated object model or specified parameters of a physics engine.
Clause I, the method of any of the preceding clauses, wherein the canonical version of the cage represents a face with a closed mouth.
Clause J, the method of any of the preceding clauses, wherein the learned radiation field parameterization is a cache of associations between 3D points in the canonical version of the cage and color values and opacity values, the associations obtained by querying a machine learning model trained using training data comprising images of the dynamic scene from multiple viewpoints.
Clause K, the method of any of the preceding clauses, wherein the images of the dynamic scene from multiple viewpoints are obtained at the same instance of time or at two moments in time.
Clause L, the method of clause K, wherein the cage represents a face of the person and includes a mesh inside the mouth, a first plane representing an upper row of teeth of the person, and a second plane representing a lower row of teeth of the person.
Clause M, the method of clause L, comprising checking if one of the samples falls in the interior of the mouth, and using information about the first plane and the second plane to calculate the transformation on the sample.
Clause N, the method of any preceding clause, comprising using only one radiation field network and increasing the number of sampling bins during the process of querying the learned radiation field parameterization of the 3D scene for each transformed sample to obtain color values and opacity values.
Clause O, an apparatus comprising at least one processor, a memory storing instructions that, when executed by the at least one processor, perform a method for computing an image of a dynamic 3D scene comprising a 3D object, the method comprising:
Receiving a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model;
for pixels of the image, computing rays from a virtual camera into the cage through the pixels, animated according to the animation data, and computing a plurality of samples on the rays, each sample being a 3D position and view direction in one of the 3D elements;
Computing a transformation of the samples into a canonical version of the cage to generate transformed samples;
Querying a learned radiation field parameterization of the 3D scene for each transformed sample to obtain color values and opacity values;
a volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.
Clause P, a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object, the method comprising:
receiving a description of a deformation of the 3D object;
For pixels of the image, computing a ray from a virtual camera through the pixels into the description and computing a plurality of samples on the ray, each sample being a 3D position and view direction in one of the 3D elements;
computing a transformation of the samples into a canonical space to generate transformed samples;
For each transformed sample, querying a cache of associations between 3D points in the canonical space and color values and opacity values;
a volume rendering method is applied to the color values and the opacity values to generate pixel values of the image.
Clause Q, the method of clause P, further comprising one or more of storing the image, transmitting the image to a computer gaming application, transmitting the image to a telepresence application, inserting the image into a virtual network camera stream, and transmitting the image to a head-mounted display.
Clause R, the method of clause P or Q, comprising detecting parameter values of a model of a 3D object depicted in a video using an object tracker, and calculating the description of the deformation of the 3D object using the detected parameter values and the model.
Clause S, the method of any of clauses P to R, comprising using a physics engine to specify the description.
Clause T, the method of any of clauses P-S, wherein the 3D primitive element is any one of tetrahedron, sphere, cuboid.
The term 'computer' or 'computing-based device' is used herein to refer to any device having processing capabilities such that it executes instructions. Those skilled in the art will recognize that such processing capabilities are incorporated into many different devices, and thus the terms 'computer' and 'computing-based device' each include Personal Computers (PCs), servers, mobile phones (including smart phones), tablet computers, set-top boxes, media players, game consoles, personal digital assistants, wearable computers, and many other devices.
In some examples, the methods described herein are performed by software in machine-readable form (e.g., in the form of a computer program comprising computer program code means adapted to perform all the operations of one or more of the methods described herein when the program is run on a computer), on a tangible storage medium, and wherein the computer program may be embodied on a computer-readable medium. The software is adapted to be executed on a parallel processor or a serial processor such that the method operations may be performed in any suitable order or simultaneously.
Those skilled in the art will recognize that the storage devices used to store program instructions may alternatively be distributed across a network. For example, a remote computer can store an example of a process described as software. The local or terminal computer can access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software as needed, or execute some of the software instructions at the local terminal and remote computer (or computer network). Those skilled in the art will also recognize that all or a portion of the software instructions may be executed by dedicated circuitry, such as a Digital Signal Processor (DSP), programmable logic array, or the like, by utilizing conventional techniques known to those skilled in the art.
As will be apparent to one of skill in the art, any of the ranges or device values set forth herein may be extended or altered without losing the effect sought.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
It will be appreciated that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. Embodiments are not limited to those solving any or all of the problems or those having any or all of the benefits and advantages. It will also be understood that reference to an item refers to one or more of those items.
The operations of the methods described herein may be performed in any suitable order or concurrently where appropriate. In addition, individual blocks may be deleted from any of the methods without departing from the scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without loss of the effect sought.
The term 'comprising' is used herein to mean including the identified method blocks or elements, but that such blocks or elements do not include an exclusive list, and that the method or apparatus may include additional blocks or elements.
The term 'subset' is used herein to refer to the proper subset such that the subset of the set does not include all elements of the set (i.e., at least one of the elements of the set is missing from the subset).
It will be appreciated that the above description is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments. Although various embodiments have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the scope of this disclosure.

Claims (20)

Translated fromChinese
1.一种计算包括3D对象的动态3D场景的图像的计算机实现的方法,所述方法包括:1. A computer-implemented method of computing an image of a dynamic 3D scene including 3D objects, the method comprising:接收对所述3D对象的变形的描述,所述描述包括基元3D元素的笼和来自物理引擎或关节式对象模型的相关联的动画数据;receiving a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model;针对所述图像的像素,计算根据所述动画数据而动画化的从虚拟相机通过所述像素到所述笼中的射线,并且计算所述射线上的多个样本,每个样本是所述3D元素中的一个3D元素中的3D位置和视图方向;For a pixel of the image, calculating a ray from a virtual camera through the pixel into the cage animated according to the animation data, and calculating a plurality of samples on the ray, each sample being a 3D position and a view direction in one of the 3D elements;计算所述样本到所述笼的规范版本中的变换以产生经变换的样本;computing a transformation of the sample into a canonical version of the cage to produce a transformed sample;针对每个经变换的样本,查询所述3D场景的学习的辐射场参数化以获得颜色值和不透明度值;For each transformed sample, querying a learned radiance field parameterization of the 3D scene to obtain a color value and an opacity value;将体积渲染方法应用于所述颜色值和所述不透明度值以产生所述图像的像素值。A volume rendering method is applied to the color values and the opacity values to generate pixel values for the image.2.根据权利要求1所述的方法,其中,基元3D元素的所述笼表示所述3D对象和从所述3D对象延伸的体积。2 . The method of claim 1 , wherein the cage of primitive 3D elements represents the 3D object and a volume extending from the 3D object.3.根据权利要求2所述的方法,其中,所述笼包括具有多个体积融合形状和骨架的体积网格。3 . The method of claim 2 , wherein the cage comprises a volume mesh having a plurality of volume fused shapes and a skeleton.4.根据权利要求2所述的方法,其中,所述笼是通过使用移动立方体法根据所述学习的辐射场参数化的密度计算网格并且计算所述网格的四面体嵌入而根据所述学习的辐射场参数化来计算的。4. The method of claim 2, wherein the cage is computed from the learned radiation field parameterization by computing a mesh according to a density of the learned radiation field parameterization using a marching cubes method and computing a tetrahedral embedding of the mesh.5.根据任一前述权利要求所述的方法,还包括:通过将P设置等于射线上的四面体的先前相交点与下一相交点之间的归一化距离,乘以在四面体的四个顶点上在所述先前相交点处顶点的重心坐标乘以顶点的规范坐标的总和,加上一减去所述归一化距离,乘以在所述四面体的四个顶点上在所述下一相交点处顶点的重心坐标乘以所述顶点的规范坐标的总和,来计算样本的变换P。5. A method according to any preceding claim, further comprising: calculating a transformation P of the sample by setting P equal to the normalized distance between a previous intersection point and a next intersection point of a tetrahedron on the ray, multiplied by the sum of the barycentric coordinates of the vertices at the previous intersection point multiplied by the canonical coordinates of the vertices on the four vertices of the tetrahedron, plus one minus the normalized distance, multiplied by the sum of the barycentric coordinates of the vertices at the next intersection point multiplied by the canonical coordinates of the vertices on the four vertices of the tetrahedron.6.根据任一前述权利要求所述的方法,还包括:针对所述经变换的样本中的一个样本,在查询所述学习的辐射场参数化之前旋转所述样本的射线的视图方向。6. A method according to any preceding claim, further comprising, for one of the transformed samples, rotating a view direction of rays of the sample before querying the learned radiation field parameterization.7.根据权利要求6所述的方法,包括:计算针对所述基元3D元素的小部分的所述视图方向的旋转R,并且经由最近邻插值将R的值传播到剩余四面体。7. The method of claim 6, comprising computing a rotation R of the view direction for a small portion of the primitive 3D element and propagating the value of R to the remaining tetrahedrons via nearest neighbor interpolation.8.根据任一前述权利要求所述的方法,其中,所述笼的所述规范版本是具有关节式对象模型的指定参数值或物理引擎的指定参数的所述笼。8. A method according to any preceding claim, wherein the canonical version of the cage is the cage with specified parameter values of an articulated object model or specified parameters of a physics engine.9.根据任一前述权利要求所述的方法,其中,所述笼的所述规范版本表示具有闭合的嘴的面部。9. A method according to any preceding claim, wherein the canonical version of the cage represents a face with a closed mouth.10.根据任一前述权利要求所述的方法,其中,所述学习的辐射场参数化是所述笼的所述规范版本中的3D点与颜色值和不透明度值之间的关联的高速缓存,所述关联是通过查询使用包括来自多个视点的所述动态场景的图像的训练数据训练的机器学习模型而获得的。10. A method according to any preceding claim, wherein the learned radiation field parameterization is a cache of associations between 3D points in the canonical version of the cage and color values and opacity values, the associations being obtained by querying a machine learning model trained using training data comprising images of the dynamic scene from multiple viewpoints.11.根据任一前述权利要求所述的方法,其中,来自多个视点的所述动态场景的所述图像是在同一时间实例或两个时刻获得的。11. A method according to any preceding claim, wherein the images of the dynamic scene from multiple viewpoints are acquired at the same time instance or at two moments in time.12.根据权利要求11所述的方法,其中,所述笼表示人的面部并且包括嘴内部的网格、表示所述人的上排牙齿的第一平面和表示所述人的下排牙齿的第二平面。12. The method of claim 11, wherein the cage represents a face of a person and comprises a mesh of an interior of a mouth, a first plane representing an upper row of teeth of the person, and a second plane representing a lower row of teeth of the person.13.根据权利要求12所述的方法,包括:检查所述样本中的一个样本是否落在所述嘴的内部中,并且使用关于所述第一平面和所述第二平面的信息来计算对所述样本的所述变换。13. The method of claim 12, comprising checking whether one of the samples falls within the interior of the mouth and calculating the transformation of the sample using information about the first plane and the second plane.14.根据任一前述权利要求所述的方法,包括:在针对每个经变换的样本而查询所述3D场景的所述学习的辐射场参数化以获得颜色值和不透明度值的过程期间,仅使用一个辐射场网络并且增加采样分仓的数量。14. A method according to any preceding claim, comprising: using only one radiation field network and increasing the number of sampling bins during the process of querying the learned radiation field parameterization of the 3D scene to obtain color values and opacity values for each transformed sample.15.一种装置,包括:至少一个处理器;存储指令的存储器,所述指令当由所述至少一个处理器运行时,执行用于计算包括3D对象的动态3D场景的图像的方法,所述方法包括:15. An apparatus comprising: at least one processor; a memory storing instructions, the instructions, when executed by the at least one processor, performing a method for calculating an image of a dynamic 3D scene including 3D objects, the method comprising:接收对所述3D对象的变形的描述,所述描述包括基元3D元素的笼和来自物理引擎或关节式对象模型的相关联的动画数据;receiving a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model;针对所述图像的像素,计算根据所述动画数据而动画化的从虚拟相机通过所述像素到所述笼中的射线,并且计算所述射线上的多个样本,每个样本是在所述3D元素中的一个3D元素中的3D位置和视图方向;For a pixel of the image, calculating a ray from a virtual camera through the pixel into the cage animated according to the animation data, and calculating a plurality of samples on the ray, each sample being a 3D position and a view direction in one of the 3D elements;计算所述样本到所述笼的规范版本中的变换以产生经变换的样本;computing a transformation of the sample into a canonical version of the cage to produce a transformed sample;针对每个经变换的样本,查询所述3D场景的学习的辐射场参数化以获得颜色值和不透明度值;For each transformed sample, querying a learned radiance field parameterization of the 3D scene to obtain a color value and an opacity value;将体积渲染方法应用于所述颜色值和所述不透明度值以产生所述图像的像素值。A volume rendering method is applied to the color values and the opacity values to generate pixel values for the image.16.一种计算包括3D对象的动态3D场景的图像的计算机实现的方法,所述方法包括:16. A computer-implemented method of computing an image of a dynamic 3D scene including 3D objects, the method comprising:接收对所述3D对象的变形的描述;receiving a description of a deformation of the 3D object;针对所述图像的像素,计算从虚拟相机通过所述像素到所述描述中的射线并且计算所述射线上的多个样本,每个样本是所述3D元素中的一个3D元素中的3D位置和视图方向;For a pixel of the image, calculating a ray from a virtual camera through the pixel to the description and calculating a plurality of samples on the ray, each sample being a 3D position and a view direction in one of the 3D elements;计算所述样本到规范空间中的变换以产生经变换的样本;computing a transformation of the samples into a canonical space to produce transformed samples;针对每个经变换的样本,查询在所述规范空间中的3D点与颜色值和不透明度值之间的关联的高速缓存;for each transformed sample, querying a cache of associations between 3D points in the canonical space and color and opacity values;将体积渲染方法应用于所述颜色值和所述不透明度值以产生所述图像的像素值。A volume rendering method is applied to the color values and the opacity values to generate pixel values for the image.17.根据权利要求16所述的方法,还包括以下中的一项或多项:存储所述图像,将所述图像传输到计算机游戏应用,将所述图像传输到远程呈现应用,将所述图像插入到虚拟网络相机流中,将所述图像传输到头戴式显示器。17. The method of claim 16, further comprising one or more of: storing the image, transmitting the image to a computer game application, transmitting the image to a telepresence application, inserting the image into a virtual webcam stream, transmitting the image to a head mounted display.18.根据权利要求16或权利要求17所述的方法,包括:使用对象跟踪器来检测在视频中所描绘的3D对象的模型的参数值,并且使用所检测到的参数值和所述模型来计算对所述3D对象的所述变形的所述描述。18. A method according to claim 16 or claim 17, comprising using an object tracker to detect parameter values of a model of a 3D object depicted in the video, and using the detected parameter values and the model to calculate the description of the deformation of the 3D object.19.根据权利要求16至18中的任一项所述的方法,包括使用物理引擎来指定所述描述。19. A method according to any one of claims 16 to 18, comprising specifying the description using a physics engine.20.根据权利要求16至19中的任一项所述的方法,其中,所述3D基元元素是以下中的任一项:四面体、球体、长方体。20. The method according to any one of claims 16 to 19, wherein the 3D primitive element is any one of the following: a tetrahedron, a sphere, a cuboid.
CN202380052012.6A2022-07-262023-06-12 Computing images of controllable dynamic scenesPendingCN119497876A (en)

Applications Claiming Priority (5)

Application NumberPriority DateFiling DateTitle
GB2210930.02022-07-26
GBGB2210930.0AGB202210930D0 (en)2022-07-262022-07-26Computing images of controllable dynamic scenes
US17/933,453US12182922B2 (en)2022-07-262022-09-19Computing images of controllable dynamic scenes
US17/933,4532022-09-19
PCT/US2023/025095WO2024025668A1 (en)2022-07-262023-06-12Computing images of controllable dynamic scenes

Publications (1)

Publication NumberPublication Date
CN119497876Atrue CN119497876A (en)2025-02-21

Family

ID=87202014

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202380052012.6APendingCN119497876A (en)2022-07-262023-06-12 Computing images of controllable dynamic scenes

Country Status (3)

CountryLink
EP (1)EP4562608A1 (en)
CN (1)CN119497876A (en)
WO (1)WO2024025668A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119006687B (en)*2024-07-292025-05-23中国矿业大学 4D scene representation method based on joint pose and radiation field optimization in complex mine environment

Also Published As

Publication numberPublication date
EP4562608A1 (en)2025-06-04
WO2024025668A1 (en)2024-02-01

Similar Documents

PublicationPublication DateTitle
Lombardi et al.Neural volumes: Learning dynamic renderable volumes from images
Pighin et al.Modeling and animating realistic faces from images
CN113421328B (en)Three-dimensional human body virtual reconstruction method and device
US6559849B1 (en)Animation of linear items
CN113706714A (en)New visual angle synthesis method based on depth image and nerve radiation field
US12236517B2 (en)Techniques for multi-view neural object modeling
KR101560508B1 (en)Method and arrangement for 3-dimensional image model adaptation
CN117315211B (en)Digital human synthesis and model training method, device, equipment and storage medium thereof
Muratov et al.3DCapture: 3D Reconstruction for a Smartphone
Li et al.Eyenerf: a hybrid representation for photorealistic synthesis, animation and relighting of human eyes
EP3980975B1 (en)Method of inferring microdetail on skin animation
US12182922B2 (en)Computing images of controllable dynamic scenes
US20230260186A1 (en)Generating a facial-hair-free mesh of a subject
CN107123139A (en)2D to 3D facial reconstruction methods based on opengl
Wang et al.Digital twin: Acquiring high-fidelity 3D avatar from a single image
Maxim et al.A survey on the current state of the art on deep learning 3D reconstruction
Osman Ulusoy et al.Dynamic probabilistic volumetric models
CA3143520A1 (en)Method of computing simulated surfaces for animation generation and other purposes
CN119497876A (en) Computing images of controllable dynamic scenes
Almanza-Medina et al.Imaging sonar simulator for assessment of image registration techniques
CN118262034A (en)System and method for reconstructing an animated three-dimensional human head model from an image
CN116912393A (en)Face reconstruction method and device, electronic equipment and readable storage medium
US11410370B1 (en)Systems and methods for computer animation of an artificial character using facial poses from a live actor
US12417575B2 (en)Dynamic 3D scene generation
Jian et al.Realistic face animation generation from videos

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp