. Author manuscript; available in PMC: 2020 Jun 10.

Published in final edited form as:IEEE Winter Conf Appl Comput Vis. 2020 May 14;2020:390–399. doi:10.1109/wacv45572.2020.9093533

Robust Template-Based Non-Rigid Motion Tracking Using Local Coordinate Regularization

Wei Li¹,Shang Zhao¹,Xiao Xiao¹,James K Hahn¹

¹Department of Computer Science, The George Washington University

^✉

Email:gw_liwei@gwu.edu

Issue date 2020 Mar.

PMC Copyright notice

PMCID: PMC7286606 NIHMSID: NIHMS1590038 PMID:32524059

Abstract

In this paper, we propose our template-based non-rigid registration algorithm to address the misalignments in the frame-to-frame motion tracking with single or multiple commodity depth cameras. We analyze the deformation in the local coordinates of neighboring nodes and use this differential representation to formulate the regularization term for the deformation field in our non-rigid registration. The local coordinate regularizations vary for each pair of neighboring nodes based on the tracking status of the surface regions. We propose our tracking strategies for different surface regions to minimize misalignments and reduce error accumulation. This method can thus preserve local geometric features and prevent undesirable distortions. Moreover, we introduce a geodesic-based correspondence estimation algorithm to align surfaces with large displacements. Finally, we demonstrate the effectiveness of our proposed method with detailed experiments.

1. Introduction

The recent development of low cost RGB-D cameras makes surface reconstruction and motion capture more accessible to general users. Compared with traditional commercial marker-based motion capture systems, such as Vicon®, the ability to track time-varying deformable surfaces other than the skeletal motions with commodity depth cameras provides more flexibilities and capabilities in various areas such as motion analysis, medical simulation and virtual reality. The major challenge of the motion tracking problem is to solve the misalignments between the deformed model and the partially captured scans. In this paper, we propose our template-based non-rigid registration algorithm to address the misalignments in a frame-to-frame motion tracking pipeline with single or multiple depth cameras.

There are several possible reasons for the failure of motion tracking. With sparse distributed depth camera setup, some regions of the surface may be occluded from the camera view preventing motion tracking. When those regions become visible again in subsequent frames, the calculated deformation may differ greatly from the actual deformation. Moreover, due to the low capture frequency of the commodity depth cameras, fast motions can cause large displacements in the camera space. This violates the assumption of small displacements in some correspondence estimation algorithms, which may lead to incorrect result. In other cases, even though the surface is visible to the camera and has reliable correspondences in the depth image, the surface is still not well registered. This is because the non-optimal constraints in the template model prevent large potential deformation from the rest pose. Error accumulation in the registered frames can also reduce the quality of alignments and result in undesirable distortion for long motion sequences.

In this paper, we utilize the differential representation of the deformation field to address the above issues in the framework of volumetric embedded deformation graph [28]. In this differential representation, we define the local deformation and the rotation-invariant regularization in the local coordinates of embedded nodes [16,1]. We formulate an analytic solution for the regularization term by analyzing the net strain energy between each pair of neighboring nodes in the graph. This differential representation can help preserve local geometric features and prevent undesirable stretching, bending or torsional deformations. Unlike previous works which select a single reference frame to constrain the deformation of the entire surface, we adaptively select the reference frames for different regions of the surface in the regularization term based on the tracking status of the surface. We treat the registered motion sequences as feasible candidates for the deformable surface. Various tracking strategies are then applied to minimize misalignments and recover untracked regions. Moreover, we use the differential representation in multi-view motion tracking to transfer the registered motion from one view to another in order to reduce the effects of asynchronization and interference between different cameras. Finally, we introduce a geodesic-based correspondence estimation algorithm to solve large misalignments after the initial projective alignment. Unlike the previous spectral-based algorithms, our method focuses on evaluating geodesic features and estimating correspondences in the partially aligned scans with incomplete topologies.

2. Related Work

2.1. Template-Based Non-Rigid Registration

The deformation model for template-based non-rigid registration is usually based on differential geometry methods [7,30,12]. Botsch et al. [1] provide a comprehensive summary of the linear variational deformation methods. Local feature preservation and rotational invariance are important properties that lead to the popularity of these methods. As-Rigid-As-Possible (ARAP) [27] is one of the commonly used regularizations for deformable surfaces. This method estimates the local transformation from the neighboring vertices, and then minimizes the vertex displacement in the local space. Embedded deformation graph [28,22,33] samples a set of nodes on the surface and associates each node with an affine transformation which is implicitly solved in the optimization. This method evaluates the deformation in a coarser graph, which is more efficient than in a dense mesh. De Aguila et al. [2] generate tetrahedrons within the template model to preserve the volume during the deformation. DynamicFusion [22] uses dual quaternions to represent deformations for better quality of transformation blending. Some other methods [3,4,20] use multiple cameras to reduce occluded regions and improve motion tracking.

One major issue of these general deformation models is that the topology or the constraints of the model do not reflect the underlying kinematic structure of the scanned object. This often results in over-smoothed deformation or rubbery appearance. Therefore, other works are modeling the deformation field based on the piece-wise rigid articulated structure. Some of these works refine the original embedded deformation graph. Li et al. [14] adaptively refine the topology of the embedded graph by adding more nodes to the regions with larger misalignments. Guo et al. [5] apply an additional L0 optimization after the L2 optimization with the ARAP regularization to minimize the number of non-rigid node connections. The non-rigid regions are then assigned less constraint weights than the rigid regions in the regularization term. Some other works implicitly generate skeletal structure based on the analysis of the registered motion sequences. ArticulatedFusion [13] hierarchically clusters the surface into rigid segments in a bottom-top manner by minimizing a rigid registration energy function. Tzionas et al. [31] compute the deformation variance among a set of sample nodes on the surface, and apply spectral clustering on the corresponding affinity matrix to determine the rigid segments. There are also some works that explicitly define the kinematic structure of the human body for performance capture applications. DoubleFusion [35] separates the surface into two layers: the inner layer uses parametric model SMPL [17] to track the poses and correspondences, while the outer layer fuses far-body shapes. BodyFusion [34] presents a skeleton-embedded surface fusion (SSF) to join the surface graph with a skeleton structure. However, most of these works use single reference frame to constrain the deformation of the entire surface, which may cause large misalignments or tracking failure. We apply different reference frames and tracking strategies in the local coordinate of embedded nodes to address these issues.

2.2. Correspondence Estimation

The correspondence estimation is critical for the convergence of non-rigid registration. Most previous works estimate the correspondences under the Iterative Closest Point (ICP) framework [36]. Spatial partitioning data structures such as KD-trees [5] are often applied to accelerate the searching process. However, due to the high computational cost of building the data structures, these methods are often not suitable for dynamic surface tracking and interactive applications. Therefore, some other works search the correspondences in the image space of the measurements. Projective Depth Association (PDA) [23] is commonly used in surface tracking with depth cameras. This method searches for the matched point within a small window around the projection location in the depth image. However, this method is not efficient for large tangential motions which require a larger searching window. Tagliasacchi et al. [29] and Li et al. [13] accelerate the correspondence searching in depth images by using the Distance Transform (DT) of the foreground segmentation to locate the closest point on the silhouette of the surface. Moreover, Valstic et al. [32] constrain the silhouette of the canonical mesh to match that detected in the current image.

Another category is performing correspondence estimation in the embedded space. Jain et al. [9] perform principle component analysis (PCA) on the geodesic distance matrix to convert the points into embedded space to align shapes. The isometry-invariant property of the geodesic distance is demonstrated to be a significant feature for robust correspondence estimation on deformable surfaces. Motion2Fusion [3] improves the performance of the spectral embedded algorithm by developing a machine learning method to efficiently map the points to the embedded space. On the other hand, FunctionalMaps (FM) [24] define a consistent and linear mapping function between a pair of full shapes using Laplace-Beltrami eigenfunctions. Rodola et al.’s method [26] improves FM for the partial to full correspondence estimation with an additional permutation matrix. Instead, our work focuses on estimating reliable correspondences in the misaligned regions between the partially matched model and depth images using geodesic features. We use the geodesic distances precalculated in the canonical model to help evaluate the geodesic features along the incomplete topology of the depth images.

3. Motion Reconstruction

In this section, we discuss our algorithm of the non-rigid registration for motion reconstruction with single or multiple depth cameras. InSec. 3.1, we construct a watertight canonical model from the static pose. InSec. 3.2, we generate a volumetric embedded deformation graph to represent the local deformations and constraints of the canonical model. InSec. 3.3, we reconstruct the motion for each frame by solving a non-linear least square optimization problem to minimize the misalignments between the deformed model and captured depth images. We discuss the issues of the traditional ARAP regularizations and derive our solution using the relative transformation between neighboring nodes. InSec. 3.4, we describe our tracking strategies based on the tracking status of the surface regions. InSec. 3.5, we introduce a geodesic-based correspondence estimation algorithm to solve large partial misalignments in the frame-to-frame non-rigid registration.

3.1. Canonical Model

We fuse the depth images captured from a static pose of the scanned object into a Truncated Signed Distance Field (TSDF) [8,15] and extract the point cloud from the zero-crossing level-set of the TSDF. We then apply the Poisson reconstruction [11] on the point cloud to construct a watertight canonical model. The canonical model provides a unified topology of the deformed model across all frames in the motion tracking.

3.2. Embedded Deformation Graph

To deform the canonical model, we apply the volumetric Embedded Deformation Graph introduced in [28]. A set of embedded nodes is sampled in the volume of the watertight canonical model and a local rigid transformation $F : {(q, g), q \in ℍ, g \in ℝ^{3}}$ is associated with each node. We represent rotations by using unit quaternions $q (‖ q ‖_{2}^{2} = 1)$ to reduce the number of parameters for computation and storage as in [3]. The algebras ofF used in this paper are listed below:

F v = q v q * + g, v \in ℝ^{3}

(1)

F_{1} F_{2} = (q_{1} q_{2}, q_{1} g_{2} q_{1}^{*} + g_{1})

(2)

F^{- 1} = (q *, - q * g q)

(3)

The point $v \in ℝ^{3}$ in the canonical model $V$ is deformed by the weighted transformations from the neighboring nodes $N (v)$ :

F (v; F) = \sum_{i \in N (v)} {\hat{w}}_{i} (v) F_{i} F_{i}^{c, - 1} v

(4)

F^c : {(q^c,g^c)} is the initial node transformation in the canonical space.q^c is set to identity rotation by default, and g^c is the sample position of the embedded node. ${\hat{w}}_{i} (v)$ is the normalized weight using the Radial Basis Function (RBF): $exp (- \frac{{‖ v - g_{i} ‖}^{2}}{2 σ^{2}})$ .

3.3. Energy Function

We formulate our non-rigid registration as a non-linear least square optimization problem. The complete energy function is:

E (F) = w_{data} E_{data} (F) + w_{reg} E_{reg} (F)

(5)

The following subsections will describe each of these terms in details.

3.3.1. Data Term

The major objective of the template-based non-rigid registration is to minimize the distances between the deformed canonical model and the partially observed scan from the depth camera at each frame. We apply the point-to-plane distance metric [18,25] in our data term:

E_{data} (F) = \sum_{k \in C} {| (F (v_{k}; F) - v_{k}^{'}) \cdot n_{k}^{'} |}^{2}

(6)

$C$ is a set of chosen points in the canonical model that have reliable correspondences in the depth image. $v_{k}^{'}$ with normal $n_{k}^{'}$ is the correspondence point of v_k found in the point cloud back projected from the depth image.

3.3.2. Local Coordinate Regularization Term

To formulate the regularization term in our energy function, we first analyze the net strain energy between two embedded nodesi andj in the deformation graph. We consider a particle x around nodej in the reference frameϵ. x is located in a cubic volume Ω of size [−r,r] centered at the origin of the local coordinate of nodej (SeeFig. 1). We measure the displacement vector of x from the reference frame to current frame in the local coordinate of nodei. For linearly elastic materials, the rotation-invariant net strain energy between nodesi andj can thus be approximated by:

E_{ij} = \frac{1}{2} \int_{Ω} {‖ {\dot{F}}_{ij} x - {\dot{F}}_{ij}^{ϵ} x ‖}_{2}^{2} d x

(7)

where ${\dot{F}}_{ij} = F_{i}^{- 1} F_{j} : {(θ_{ij}, δ_{ij}), θ_{ij} \in ℍ, δ_{ij} \in ℝ^{3}}$ is the relative transformation from the local coordinate of is the relative transformation from the local coordinate nodej to that of nodei. SubstitutingEq. 1 intoEq. 7, we obtain an analytic solution in the quaternion representation using properties of odd functions and Frobenius norms:

E_{ij} = \frac{1}{2} \int_{- r}^{r} \int_{- r}^{r} \int_{- r}^{r} {‖ R (θ_{ij}) x - R (θ_{ij}^{ϵ}) x + δ_{ij} - δ_{ij}^{ϵ} ‖}_{2}^{2} d x = \frac{4 r^{5}}{3} {‖ R (θ_{ij}) - R (θ_{ij}^{ϵ}) ‖}_{F}^{2} + 4 r^{3} {‖ δ_{ij} - δ_{ij}^{ϵ} ‖}_{2}^{2} = \frac{32 r^{5}}{3} (1 - {(θ_{ij} \cdot θ_{ij}^{ϵ})}^{2}) + 4 r^{3} {‖ δ_{ij} - δ_{ij}^{ϵ} ‖}_{2}^{2} \leq \frac{32 r^{5}}{3} {‖ θ_{ij} - θ_{ij}^{ϵ} ‖}_{2}^{2} + 4 r^{3} {‖ δ_{ij} - δ_{ij}^{ϵ} ‖}_{2}^{2}

(8)

whereR(·) is the rotation matrix form of a quaternion. $θ_{ij} \cdot θ_{ij}^{ϵ}$ is the dot product ofθ_ij and $θ_{ij}^{ϵ}$ . The equality ofEq. 8 holds when $θ_{ij} = θ_{ij}^{ϵ}$ . Evaluating the energy for all pairs of neighboring nodes in the graph, we obtain our regularization term so as to minimize the total elastic energy of the deformation graph:

E_{reg} (F) = \sum_{i \in V} \sum_{j \in N (v_{i})} w_{ij} {‖ W_{ij} ({\dot{F}}_{ij} - {\dot{F}}_{ij}^{ϵ}) ‖}_{2}^{2}

(9)

whereω_ij is initialized to the RBF weight $w_{ij}^{c}$ with respect to ${‖ δ_{ij}^{c} ‖}_{2}$ . The diagonal weight matrixW_ij can be correctly derived from the coefficients ofEq. 8:

W_{ij} = [\begin{array}{l} \sqrt{\frac{8}{3}} r I_{4 \times 4} \\ I_{3 \times 3} \end{array}]

(10)

Here we setr to be the sampling radius of the embedded nodes.I is an identity matrix.

Figure 1: — ${\dot{F}}_{ij}^{ϵ} : (θ_{ij}^{ϵ}, δ_{ij}^{ϵ})$ and ${\dot{F}}_{ij} : (θ_{ij}, δ_{ij})$ are the relative transformations from the coordinate of nodej to nodei at reference frameϵ and current frame respectively. Ω is a cubic volume centered at the origin in the coordinate of nodej. x is a particle located in Ω.d is the displacement vector of x from reference frame to current frame observed in the coordinate of nodei.

The commonly used ARAP regularization [27] is a special case of this method and has several disadvantages. First, the ARAP regularization treats nodes as particles instead of finite elements, thus it can only constrain the translation partδ_ij of the relative transformation. Lacking constraints on the rotation partθ_ij makes ARAP regularization unable to constrain torsional deformation around the node edges. Second, the ARAP regularization usually chooses the canonical frame or the key frame as the reference frame for all node edges in the graph, which is not optimal. For the nodes only affected by the regularization terms (e.g., occluded nodes), the transformations will drift back to the canonical poses. For visible nodes, the constraints may prevent potentially large deformation from the rest poses for joint regions. All of the issues mentioned above may cause large misalignments or tracking failure in the motion registration. In contrast, unlike ARAP, the local coordinate regularization with rotational constraints can uniquely determine the deformation field with sparse embedded nodes. This provides the capabilities and flexibilities in reconstructing the deformation field from multiple reference frames and views by manipulating the reference ${\dot{F}}_{ij}^{ϵ}$ inEq. 9.Sec. 3.4 will discuss more about the strategies of selecting ${\dot{F}}_{ij}^{ϵ}$ in the frame-to-frame motion registration.

3.4. Tracking Strategies

To establish correspondences $C$ between the canonical model and the captured depth image, we first render the deformed canonical model from the last frame to generate a new depth image in the camera space. This depth image is used to determine the visibility of the points in the canonical model to the camera. Then we perform PDA correspondence estimation in the depth image for the visible points in the deformed model. The points on the model can thus be clustered into three categories: occluded, tracked (visible and have correspondences), and untracked (visible but have no correspondences) (SeeFig. 2). We can cluster the embedded nodes into the same categories by checking the tracking status of the associated points with the node. If most of the associated points are well aligned, the estimated transformation of the node from the optimization can be considered as a reliable solution. If most of the associated points are occluded, the transformation of the node can only be guessed from previous frames. If most of the associated points have no correspondences, then we lose track of the node at this frame. This is often due to fast motion or occlusions, which needs better global correspondence estimation methods (SeeSec. 3.5). We store both of the transformationsF and the tracking status of the nodes into files after the registration of each frame. Based on the tracking status, we develop several strategies for the selection of the reference ${\dot{F}}_{ij}^{ϵ}$ to constrain the deformations in different regions.

Figure 2: — Left: depth image and registered deformed model. Middle: point clusters. Blue, red and gray points are tracked, untracked and occluded points respectively. Right: corresponding embedded graph. Orange and gray edges use the canonical frame and the latest tracked frame as the reference frame respectively.

Visible region

If both nodesi andj are visible to the camera and their associated vertices have sufficiently reliable correspondences, we consider the edge between the nodes as being tracked (blue blocks inFig. 3). For template-based motion tracking, we usually choose the relative transformation ${\dot{F}}_{ij}^{c}$ from the canonical frame as the reference to constrain the deformation of tracked edges. But as mentioned inSec. 3.3.2, this fixed ARAP constraint does not work well for non-rigid regions with large deformations. Another method always uses the relative transformation from the latest tracked frame ${\dot{F}}_{ij}^{t}$ as the reference for the registration. However, this method can lead to error accumulation that changes the distribution of the vertices in the deformed mesh. Therefore, we add an L0 norm regularization termE_F in the energy function to adaptively adjust the reference frame ${\dot{F}}_{ij}^{ϵ}$ and encourage using the constraints from the canonical frame ${\dot{F}}_{ij}^{c}$ :

E_{F} ({\dot{F}}^{ϵ}) = λ_{F} \sum_{i \in V} \sum_{j \in N (v_{i})} w_{ij} {‖ {‖ {\dot{F}}_{ij}^{ϵ} - {\dot{F}}_{ij}^{c} ‖}_{2} ‖}_{0}

(11)

where ${\dot{F}}_{ij}^{ϵ} \in {{\dot{F}}_{ij}^{c}, {\dot{F}}_{ij}^{t}}$ . When the optimization converges, we add another L0 norm termE_ω to adaptively adjust the weightω_ij in an additional optimization:

E_{w} (w) = λ_{w} \sum_{i \in V} \sum_{j \in N (v_{i})} {‖ w_{ij} - w_{ij}^{c} ‖}_{0}

(12)

where $w_{ij} \in {w_{ij}^{c}, 0.3 w_{ij}^{c}}$ . Similar to [5], this method can refine the deformation field to be as articulated as possible and reduce the over-smoothed deformation in the joint regions.

Figure 3: — Illustration of the tracking strategies. Each row is a timeline of the tracking status for a node edge. Orange, blue, gray and red blocks stand for canonical, tracked, occluded, and untracked frames respectively.

Occluded region

If the node edge is occluded or untracked (the gray and red blocks respectively inFig. 3), we apply the relative transformation ${\dot{F}}_{ij}^{c}$ from the latest tracked frame as the reference (the dashed lines inFig. 3a and3b). To optimize the query performance, we cache a copy of the relative transformations ${\dot{F}}_{ij}^{i}$ across frames, and only update the cached ${\dot{F}}_{ij}$ when the corresponding nodes are tracked in the new registered frame. This method allows the occluded surface to preserve its local deformation from previous visible frame while being transformed globally with the tracked surface. Due to the temporal coherence, there would be more opportunities for the occluded surface to be recovered when the surface becomes visible again in subsequent frames. To reduce the motion discontinuity between the occluded and tracked frames, we interpolate the transformations between tracked frames and perform another optimization for occluded nodes after the motion is reconstructed (the bidirectional arrows inFig. 3a and3b).

Untracked region

Once the untracked nodes are recovered in subsequent frames, we apply a second-pass registration to recover the untracked nodes. Similar to the strategy used in the occluded region, we first calculate an interpolated reference frame ${\dot{F}}_{ij}^{ϵ}$ . Then we only apply the regularization in the untracked region $P$ , and fix the position of the other nodes as hard constraints to initialize the poses:

E_{reg} (F) = \sum_{i \in V} \sum_{j \in N (v_{i})} w_{ij} E_{ij}^{ϵ} (F), i \in P \lor j \in P

(13)

If the untracked surface with the interpolated deformation has sufficient overlaps with the depth image, the transformations of the untracked nodes can be recovered in the optimization.

Multi-view tracking

A multi-view tracking system can significantly reduce occluded regions and second-pass tracking. However, for commodity depth cameras, such as Kinect V2, there are still some issues to address: the Kinect V2 lacks control of the camera shutters, therefore the depth images are captured at varying frequencies and phases that are not synchronized; the interference between multiple cameras also leads to larger measurement noises even for static objects. All of these issues make it difficult to integrate multi-view data into the same energy function because of the large misalignments in the global space. Therefore, we introduce a sequential registration pipeline for multiple depth cameras. We sort the frames captured from all the cameras by their timestamps, and register the sorted frames in chronological order. For occluded node edges, we choose the relative transformation from the closest tracked frame among all the camera views as the reference (SeeFig. 3c). Therefore, the registered frames are first converted to the local space of neighboring nodes and then transfered to another view in the form of relative transformations to reduce the effects of global misalignments between different cameras.

3.5. Correspondence Estimation

Most of the misalignments can be solved by the frame-to-frame non-rigid registration using PDA correspondence estimation. However, due to the low capture frequency of the commodity depth cameras (e.g., 30Hz for Kinects) and the sparse camera arrangement, we can still fail to track some regions when fast motion or extended occlusion occurs. Since the captured objects are usually articulated, such as the human body, the deformation is almost isometric, which is more insensitive in the geodesic space along the surface. Therefore, we choose to convert the deformed points and the captured points to the geodesic space to perform global correspondence matching.

To be precise, we first sample a set of source points $P$ on the surface of the canonical model $V$ . For each point $v \in V$ , we calculate the minimum path distances between the point and the source points along the topology of the canonical surface. This defines our mapping from the Euclidean space to the geodesic feature space: $ℝ^{3} \to ℝ^{| P |}$ . However, since the depth image is just a partial scan of the model, the connectivity on the captured surface is incomplete or ambiguous due to surface occlusion (SeeFig. 4a) and contact (SeeFig. 4c). Therefore, it is not feasible to compute the geodesic feature for the captured points directly from the depth image. To address this problem, we reduce the feature space to $ℝ^{| Q |}$ where $Q$ is a subset of $P$ that excludes the source points with few reliable correspondences in the initial alignment. We consider the surface points around each source point within a fixed geodesic distance, and evaluate the proportion of surface points with projective correspondences. If the proportion is lower than a threshold, the source point is excluded from $Q$ . We then join the graph of the canonical model and the depth image through the aligned points. More specifically, we use the geodesic distances precalculated in the canonical model to initialize the geodesic distances of the aligned points in the depth image, and then estimate the distances in the unaligned regions based on those at the boundary of the overlapped region. This helps solve both the discontinuity and ambiguous connection in the partial graph of the depth image (SeeFig. 4b and4d). After these steps, we search for the best matched points between the unaligned sample points and captured points by comparing theirL₁ distances in the sub-feature space $ℝ^{| Q |}$ with a reciprocity test to reject false matching. These sparse correspondences are then applied to recover large misalignments in the registration.

Figure 4: — Illustration of the single source geodesic distance. (a) and (c) calculate geodesic distances directly from the depth image. (b) and (d) initialize geodesic distances from the canonical model.

3.6. Motion Smoothing

The reconstructed motion usually suffers from jittering deformation due to noises of measurements. Therefore, after the motion sequence is reconstructed, we smooth the motion transformations to filter out high frequency noises for visualization purpose. We calculate the weighted average of the translation g and the rotation q for framet within a frame window of [−n, +n] (SeeEq. 14). $\hat{w}$ are the normalized RBF weights with respect to the time offset. The averaged quaternion ${\bar{q}}^{t}$ is calculated as the eigenvector of the covariance matrix ${cov}_{q}^{t}$ with the maximum eigenvalue as in [6,19]:

{\bar{g}}^{t} = \sum_{k = t - n}^{t + n} {\hat{w}}_{k} g^{k}, {cov}_{q}^{t} = \sum_{k = t - n}^{t + n} {\hat{w}}_{k} q^{k} q^{k, ⊥}

(14)

4. Result

This section contains experimental results of our motion tracking method with single-view and multi-view depth image sequences (600 to 3000 frames) from [5] (SeeFig. 5,6, and7) and our capture system (SeeFig. 8). We focus on comparisons between registration methods using the deformation graph as a general underlying structure without any prior knowledge of the scanned object, and demonstrate the advantages of using our local coordinate regularization.

Figure 5: — L2 regularizations with (a) and without (b) the rotation termθ_ij. Gray points are back projected from the depth image.

Figure 6: — Comparison of using adaptive and single reference frames. (a) and (c) use adaptive reference frames. (b) only uses the canonical frame as the reference frame. (d) only uses the latest tracked frame as the reference frame.

Figure 7: — Comparison of different tracking strategies for the occluded regions in a motion sequence. (a), (b) and (c) use the latest tracked frame as the reference frame. (d), (e) and (f) use the canonical frame as the reference frame.

Figure 8: — Results of our motion tracking method using four Kinect V2 cameras connected to one computer.

Fig. 5 visualizes the importance of the rotation termθ_ij in the local coordinate regularization. Due to the error of the correspondence estimation at the boundary of the surface, the registration may produce an undesirable torsional distortion in the region with fewer embedded nodes (e.g., the arms and the legs). Without the rotation term, there is no constraint to prevent this torsional distortion, and the misalignments may finally become unrecoverable (e.g., the left foot inFig. 5b). Our regularization with the rotation term is thus more robust to these drift errors.

Fig. 6 shows the results using single and adaptive reference frame in the non-rigid registration. InFig. 6d, we use the latest tracked frame as the reference frame for the registration. Due to the inaccuracy of alignments and error accumulation in the frame-to-frame registration, the deformed model no longer satisfies the ARAP constraints after several frames. InFig. 6b, we use the canonical frame as the reference frame. The surface does not align well with the scans in the bending regions (e.g., the knees and the elbows). This is because the canonical reference ${\dot{F}}_{ij}^{c}$ and the corresponding weight $w_{ij}^{c}$ may not be optimal for large deformations. In contrast, our method with adaptive selection of reference frames can prevent error accumulation and fit the scans better (SeeFig. 6a and6c). We evaluate the alignment errors and compare our method with other single-view (SeeFig. 9) and multi-view (SeeFig. 10) registration methods using deformation graphs. As can be seen, our method achieves lower errors for large bending motions (kicking motion) and fast motions (jumping motions).

Figure 9: — Quantitative comparison of single-view registration methods for kicking motions.

Figure 10: — Quantitative comparison of multi-view registration methods for jumping motions.

InFig. 7, we demonstrate our tracking strategy for the occluded regions. In the first row, we use the latest tracked frame (SeeFig. 7a) as the reference frame in the occluded regions as in our method. In the second column, only the bottom of the right foot is visible to the camera, while the rest of the right leg is occluded. However, the right leg can keep its local pose and be rotated with the torso. The misalignments are then recovered after several iterations when there are more overlaps between the right foot and the depth image (SeeFig. 7b). In the second row, we only use the canonical frame as the reference frame in the occluded regions. The right leg thus drifts back to the canonical pose (SeeFig. 7e), which causes incorrect alignments in subsequent frames (SeeFig. 7f). This tracking failure is a common issue in previous works [14,37].

We compare our correspondence estimation method with the KD-trees [5], PDA [23] and DT [13] methods. We show the calculated correspondences for the sparse sample points in the misaligned regions and demonstrate the convergence of different methods inFig. 11. The PDA method can only find a few correspondences around the overlapped regions due to a small searching window (SeeFig. 11c). KD-trees and DT methods can find longer-range correspondences than the PDA method, but since the spatial distance based correspondences are not reliable for large non-rigid deformation, the estimation may lead to incorrect alignments (SeeFig. 11b and11d). In contrast, our geodesic-based method can find more robust correspondences following the topology of the surface (SeeFig. 11a). Though the preprocessing time of our method is more than the DT and PDA methods (SeeTable 1), our method requires fewer iterations to converge and obtains better alignments (SeeFig. 12).

Figure 11: — Comparison of correspondence estimation methods. The first and the second rows show the sparse correspondence pairs and the registration results respectively. The columns show our method, KD-trees, PDA and DT methods from left to right. The gray and green meshes are the canonical model and the captured scan respectively.

Table 1:

Preprocessing time of correspondence estimation methods on Intel Core™ i7–6900K with a single thread. Our method computes geodesic distances and corresponding KD-trees in non-overlapped regions using FLANN [21]; KD-trees build the spatial structure for the depth image; DT computes the distance map from the silhouettes of the scan.

Method	Ours	KD-trees	PDA	DT
Time (ms)	13	33	0	10

Open in a new tab

Figure 12: — Convergence of different correspondence estimation methods for large misalignments. The first two iterations use estimated sparse correspondences. The rest iterations use dense correspondences with the PDA method.

5. Conclusion

We have presented a non-rigid registration algorithm using the local coordinate regularization to solve the misalignments in the motion tracking. We have demonstrated several tracking strategies for different regions of the surface based on the visibility and alignment of the surface. This method is shown to be effective and robust in various scenarios with single or multiple depth camera setup. We also propose a geodesic-based algorithm to efficiently locate reliable correspondences in partially aligned scans to recover the surface with large displacements. There are still difficulties in solving the ambiguous cases with self-intersecting surfaces for correspondence estimation, which would be another interesting topic.

Acknowledgements

This study was supported by NSF grant CNS-1337722 and NIH grants R21HL124443 and R01HD091179.

References

[1].Botsch M and Sorkine O. On linear variational surface deformation methods. IEEE transactions on visualization and computer graphics, 14(1):213–230, 2007. [DOI] [PubMed] [Google Scholar]
[2].De Aguiar E, Stoll C, Theobalt C, Ahmed N, Seidel H-P, and Thrun S. Performance capture from sparse multi-view video. ACM Transactions on Graphics (TOG), 27(3):98, 2008. [Google Scholar]
[3].Dou M, Davidson P, Fanello SR, Khamis S, Kowdle A, Rhemann C, Tankovich V, and Izadi S. Motion2fusion: Real-time volumetric performance capture. ACM Transactions on Graphics (TOG), 36(6):246, 2017. [Google Scholar]
[4].Dou M, Khamis S, Degtyarev Y, Davidson P, Fanello SR, Kowdle A, Escolano SO, Rhemann C, Kim D, Taylor J, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG), 35(4):114, 2016. [Google Scholar]
[5].Guo K, Xu F, Wang Y, Liu Y, and Dai Q. Robust non-rigid motion tracking and surface reconstruction using l0 regularization. In Proceedings of the IEEE International Conference on Computer Vision, pages 3083–3091, 2015. [Google Scholar]
[6].Hartley R, Trumpf J, Dai Y, and Li H. Rotation averaging. International journal of computer vision, 103(3):267–305, 2013. [Google Scholar]
[7].Helgason S. Differential geometry, Lie groups, and symmetric spaces, volume 80Academic press, 1979. [Google Scholar]
[8].Izadi S, Kim D, Hilliges O, Molyneaux D, Newcombe R, Kohli P, Shotton J, Hodges S, Freeman D, Davison A, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568. ACM, 2011. [Google Scholar]
[9].Jain V and Zhang H. Robust 3d shape correspondence in the spectral domain In IEEE International Conference on Shape Modeling and Applications 2006 (SMI’06), pages 19–19. IEEE, 2006. [Google Scholar]
[10].Kavan L, Collins S, Žára J, and O’Sullivan C. Skinning with dual quaternions In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pages 39–46. ACM, 2007. [Google Scholar]
[11].Kazhdan M and Hoppe H. Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG), 32(3):29, 2013. [Google Scholar]
[12].Laga H, Xie Q, Jermyn IH, and Srivastava A. Numerical inversion of srnf maps for elastic shape analysis of genuszero surfaces. IEEE transactions on pattern analysis and machine intelligence, 39(12):2451–2464, 2017. [DOI] [PubMed] [Google Scholar]
[13].Li C, Zhao Z, and Guo X. Articulatedfusion: Real-time reconstruction of motion, geometry and segmentation using a single depth camera. In Proceedings of the European Conference on Computer Vision (ECCV), pages 317–332, 2018. [Google Scholar]
[14].Li H, Adams B, Guibas LJ, and Pauly M. Robust single-view geometry and motion reconstruction. ACM Transactions on Graphics (ToG), 28(5):175, 2009. [Google Scholar]
[15].Li W, Xiao X, and Hahn J. 3d reconstruction and texture optimization using a sparse set of rgb-d cameras In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1413–1422. IEEE, 2019. [Google Scholar]
[16].Lipman Y, Sorkine O, Levin D, and Cohen-Or D. Linear rotation-invariant coordinates for meshes. ACM Transactions on Graphics (TOG), 24(3):479–487, 2005. [Google Scholar]
[17].Loper M, Mahmood N, Romero J, Pons-Moll G, and Black MJ. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):248, 2015. [Google Scholar]
[18].Low K-L. Linear least-squares optimization for point-to-plane icp surface registration. Chapel Hill, University of North Carolina, 4, 2004. [Google Scholar]
[19].Markley FL, Cheng Y, Crassidis JL, and Oshman Y. Averaging quaternions. Journal of Guidance, Control, and Dynamics, 30(4):1193–1197, 2007. [Google Scholar]
[20].Meerits S, Thomas D, Nozick V, and Saito H. Fusionmls: Highly dynamic 3d reconstruction with consumer-grade rgb-d cameras. Computational Visual Media, 4(4):287–303, 2018. [Google Scholar]
[21].Muja M and Lowe DG. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331–340):2, 2009. [Google Scholar]
[22].Newcombe RA, Fox D, and Seitz SM. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 343–352, 2015. [Google Scholar]
[23].Newcombe RA, Izadi S, Hilliges O, Molyneaux D, Kim D, Davison AJ, Kohi P, Shotton J, Hodges S, and Fitzgibbon A. Kinectfusion: Real-time dense surface mapping and tracking In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pages 127–136. IEEE, 2011. [Google Scholar]
[24].Ovsjanikov M, Ben-Chen M, Solomon J, Butscher A, and Guibas L. Functional maps: a flexible representation of maps between shapes. ACM Transactions on Graphics (TOG), 31(4):30, 2012. [Google Scholar]
[25].Park S-Y and Subbarao M. An accurate and fast point-to-plane registration technique. Pattern Recognition Letters, 24(16):2967–2976, 2003. [Google Scholar]
[26].Rodolà E, Cosmo L, Bronstein MM, Torsello A, and Cremers D. Partial functional correspondence In Computer Graphics Forum, volume 36, pages 222–236. Wiley Online Library, 2017. [Google Scholar]
[27].Sorkine O and Alexa M. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, volume 4, pages 109–116, 2007. [Google Scholar]
[28].Sumner RW, Schmid J, and Pauly M. Embedded deformation for shape manipulation In ACM Transactions on Graphics (TOG), volume 26, page 80ACM, 2007. [Google Scholar]
[29].Tagliasacchi A, Schröder M, Tkach A, Bouaziz S, Botsch M, and Pauly M. Robust articulated-icp for real-time hand tracking In Computer Graphics Forum, volume 34, pages 101–114. Wiley Online Library, 2015. [Google Scholar]
[30].Tumpach AB, Drira H, Daoudi M, and Srivastava A. Gauge invariant framework for shape analysis of surfaces. IEEE transactions on pattern analysis and machine intelligence, 38(1):46–59, 2015. [DOI] [PubMed] [Google Scholar]
[31].Tzionas D and Gall J. Reconstructing articulated rigged models from rgb-d videos In European Conference on Computer Vision, pages 620–633. Springer, 2016. [Google Scholar]
[32].Vlasic D, Baran I, Matusik W, and Popović J. Articulated mesh animation from multi-view silhouettes In ACM Transactions on Graphics (TOG), volume 27, page 97ACM, 2008. [Google Scholar]
[33].Whelan T, Leutenegger S, Salas-Moreno R, Glocker B, and Davison A. Elasticfusion: Dense slam without a pose graph. Robotics: Science and Systems, 2015. [Google Scholar]
[34].Yu T, Guo K, Xu F, Dong Y, Su Z, Zhao J, Li J, Dai Q, and Liu Y. Bodyfusion: Real-time capture of human motion and surface geometry using a single depth camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 910–919, 2017. [Google Scholar]
[35].Yu T, Zheng Z, Guo K, Zhao J, Dai Q, Li H, Pons-Moll G, and Liu Y. Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7287–7296, 2018. [DOI] [PubMed] [Google Scholar]
[36].Zhang Z. Iterative point matching for registration of freeform curves and surfaces. International journal of computer vision, 13(2):119–152, 1994. [Google Scholar]
[37].Zollhöfer M, Nießner M, Izadi S, Rehmann C, Zach C, Fisher M, Wu C, Fitzgibbon A, Loop C, Theobalt C, et al. Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics (ToG), 33(4):156, 2014. [Google Scholar]

Movatterモバイル変換

PERMALINK

Robust Template-Based Non-Rigid Motion Tracking Using Local Coordinate Regularization

Wei Li

Shang Zhao

Xiao Xiao

James K Hahn

Abstract

1. Introduction

2. Related Work

2.1. Template-Based Non-Rigid Registration

2.2. Correspondence Estimation

3. Motion Reconstruction

3.1. Canonical Model

3.2. Embedded Deformation Graph

3.3. Energy Function

3.3.1. Data Term

3.3.2. Local Coordinate Regularization Term

Figure 1:

3.4. Tracking Strategies

Figure 2:

Visible region

Figure 3:

Occluded region

Untracked region

Multi-view tracking

3.5. Correspondence Estimation

Figure 4:

3.6. Motion Smoothing

4. Result

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

Figure 11:

Table 1:

Figure 12:

5. Conclusion

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases