CN111832386A

Movatterモバイル変換

Info

Publication number: CN111832386A
Application number: CN202010439265.7A
Authority: CN
Inventors: 曲毅; 何晓光; 屈莎; 刘佳玉; 王洪亮; 杜浩宇
Original assignee: Dalian Ruidong Technology Co ltd
Current assignee: Dalian Ruidong Technology Co ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-10-27

Abstract

The invention relates to a method and a device for estimating human body posture and a computer readable medium, belonging to the technical field of computer image processing. The invention first estimates the human pose of the articulated 3D object model by the computer by obtaining a sequence of source images and thereby a sequence of objects separated from the image background from the corresponding source image segments; then matching the sequence with a reference profile sequence to determine one or more selected reference profile sequences that form a best match; subsequently retrieving, for each of said reference sequence contours, a reference body pose associated with one of said reference contours; and calculating an estimate of the body pose of the articulated object model from the retrieved one or more reference body poses. The result from the above steps is an initial body pose estimate, which can then be used in further steps. The invention can be operated in an uncontrolled environment with loose camera calibration, low athlete resolution and occlusion.

Description

Translated fromChinese

一种估计人体姿态的方法、装置及计算机可读介质A method, apparatus and computer readable medium for estimating human body pose

技术领域technical field

本发明属于计算机图像处理技术领域，具体涉及一种估计人体姿态的方法、装置及计算机可读介质。The invention belongs to the technical field of computer image processing, and in particular relates to a method, a device and a computer-readable medium for estimating a human body posture.

背景技术Background technique

人体姿态估计或运动捕捉作为计算机视觉和图形学领域中的一个基本问题，在比赛和电影中的角色动画、比赛和监控的无控制器界面等领域都有广泛应用。鉴于问题的复杂性，目前尚无一个能够使其适用于所有应用领域的通用型解决方案。同样需要注意的是，所谓解决方案在很大程度上会依赖于相关条件以及对设置施加的约束。通常情况下，约束越多，就可以获得更为精确的人体姿态估计结果。但在现实世界的场景中，通常很难增加约束。而许多实际应用则都离不开场景。例如，可以简单的利用TV广播中已有的视频素材，展示如何使用精确的人体姿态估计结果来在体育比赛期间从任意视点对运动员进行高质量的渲染。除了在渲染领域中应用之外，在比赛期间对人体姿态估计还可以用于进行生物力学分析和合成，以及用于比赛统计，甚至将真实的比赛移植到电脑游戏之中。Human pose estimation or motion capture, as a fundamental problem in computer vision and graphics, is widely used in fields such as character animation in games and movies, controller-free interfaces for games and surveillance, etc. Given the complexity of the problem, there is currently no one-size-fits-all solution that makes it suitable for all application areas. It is also important to note that the so-called solution will depend heavily on the conditions and constraints imposed on the setup. In general, the more constraints, the more accurate human pose estimation results can be obtained. But in real-world scenarios, it is often difficult to add constraints. Many practical applications are inseparable from the scene. For example, it is simple to use video footage already available in TV broadcasts to show how accurate human pose estimation results can be used to render high-quality athletes from arbitrary viewpoints during sports competitions. In addition to applications in the field of rendering, human pose estimation during competitions can also be used for biomechanical analysis and synthesis, as well as for competition statistics, and even to port real competitions into computer games.

目前，商业运动捕捉系统通常使用遍布全身的光学标记来跟踪一个对象的时变性运动状态。尽管这类系统能够得到非常准确的人体姿态估计结果，可以捕捉到各种身体人体姿态以及面部表情。但是，这类方法只能在受控环境下工作，因此只应用于特定范围，适用范围严格受限。Currently, commercial motion capture systems typically use optical markers throughout the body to track the time-varying motion state of an object. Although such systems can obtain very accurate human pose estimation results, they can capture various body poses and facial expressions. However, such methods can only work in a controlled environment and are therefore only applicable to a specific and strictly limited scope.

目前根据使用的“镜头”类型，无标记人体姿态重建(或运动捕捉)问题可大致被分为以下两类，即使用来自一个摄像机的视频序列，以及使用来自多个校准摄像机的“镜头”。由于单目视频序列对用户的限制较少，因此基于单目视频序列的人体姿态估计方法对于某些应用而更为方便，但它亦存在一些固有的问题，其中包括深度不确定性 (模糊性)。尽管模糊性可以利用从运动到结构的方法来解决，这在视觉上却是一个非常困难的问题。来自运动算法的结构通常依赖于包含大量细节的高分辨率场景，而在运动场景中通常并不具备所需条件。Currently, label-free human pose reconstruction (or motion capture) problems can be roughly divided into two categories, namely, using a video sequence from one camera, and using "shots" from multiple calibrated cameras, depending on the type of "shots" used. The human pose estimation method based on monocular video sequences is more convenient for some applications because monocular video sequences are less restrictive to users, but it also has some inherent problems, including depth uncertainty (ambiguity ). Although ambiguity can be resolved using a motion-to-structure approach, it is a very difficult problem visually. Structures from motion algorithms often rely on high-resolution scenes that contain a lot of detail, which is often not available in motion scenes.

另外，在人体姿态估计方法中，另一个急需解决的问题是遮挡。如果旨在进行人体姿态分析的“镜头”仅来自单一摄像机，那么则很难对其进行解析。并且通过增加摄像机的数量，则更有可能获得同一对象的无遮挡“镜头”。一般而言，摄像机的空间覆盖率越高，那么获得的镜头的质量就会越高。除此之外，目前体育转播已开始使用多个摄像机来进行图像获取。因此，人们可以利用这些信息来得到更精确的3D人体姿态估计结果。In addition, in the human pose estimation method, another problem that needs to be solved urgently is occlusion. If the "shots" intended for human pose analysis come from only a single camera, it is difficult to parse them. And by increasing the number of cameras, it is more likely to get an unobstructed "shot" of the same object. In general, the higher the spatial coverage of the camera, the higher the quality of the footage obtained. In addition to this, sports broadcasting has started to use multiple cameras for image acquisition. Therefore, people can use this information to get more accurate 3D human pose estimation results.

就目前已有技术而言，大多数多视角3D人体姿态估计方法都是利用跟踪算法来重建某一时刻处的人体姿态或根据某一时间下的人体姿态来重建另一时刻的人体姿态。其中，跟踪可以使用光流拟合和或立体视觉匹配来实现。然而，尽管这些方法可以提供较精确的人体姿态估计结果，但它们通常需要在受控环境下工作，且需要更多的高分辨率摄像机(通常至少四个)和良好的场景空间覆盖(通常是圆形覆盖)来解决由于遮挡造成的模糊性问题。As far as the existing technologies are concerned, most multi-view 3D human pose estimation methods use tracking algorithms to reconstruct the human pose at a certain moment or reconstruct the human pose at another moment according to the human pose at a certain time. Among them, tracking can be achieved using optical flow fitting and or stereo vision matching. However, although these methods can provide relatively accurate human pose estimation results, they usually need to work in a controlled environment and require more high-resolution cameras (usually at least four) and good scene space coverage (usually Circular Overlay) to address ambiguity due to occlusion.

当然，目前亦存在一些使用多视角轮廓或多视角立体方法来构建 “代理几何体”的其他方法。在完成所述代理几何体构建之后，骨架会被装入此几何体。尽管这些方法能够提供非常好的效果，但它们亦存在一些关于设置方面的限制。这是因为它们需要精心搭建的摄影棚、许多高分辨率摄像机以及非常好的空间覆盖率。Of course, there are other ways to build "proxy geometry" using multi-view contour or multi-view stereo methods. After the proxy geometry is built, the skeleton is loaded into this geometry. Although these methods can provide very good results, they also have some limitations in terms of settings. This is because they require well-built studios, many high-resolution cameras, and very good space coverage.

除此之外，还存在另一类基于图像分析和分割的算法。这些算法能够使用机器学习方法来区分身体部位。不过，此分析通常需要高分辨率的镜头，对于大多数应用场景而言是无法实现的条件。In addition to this, there is another class of algorithms based on image analysis and segmentation. These algorithms are able to use machine learning methods to differentiate body parts. However, this analysis typically requires high-resolution footage, an unrealistic condition for most application scenarios.

发明内容SUMMARY OF THE INVENTION

本发明针对以上问题，本发明的目的是提供一种可在摄像机标定松散、运动员分辨率低和存在遮挡的非受控环境中运行的数据驱动人体姿态估计方法、装置及计算机可读介质。The present invention addresses the above problems, and an object of the present invention is to provide a data-driven human pose estimation method, device, and computer-readable medium that can operate in an uncontrolled environment with loose camera calibration, low player resolution, and occlusion.

为实现上述目的，本发明采取的技术方案如下。In order to achieve the above objects, the technical solutions adopted by the present invention are as follows.

本发明的第一方面提供了一种估计人体姿态的方法，用于估计铰接式对象模型，其中所述铰接式对象模型为由一个或多个源摄像机记录得真实世界对象的基于计算机的3D模型，并且所述铰接式对象模型表示多个关节和链接所述关节的链接，并且其中所述铰接式对象模型的人体姿态由所述关节的空间位置来定义，所述方法包括以下步骤：A first aspect of the present invention provides a method of estimating human pose for estimating an articulated object model, wherein the articulated object model is a computer-based 3D model of a real-world object recorded by one or more source cameras , and the articulated object model represents a plurality of joints and links linking the joints, and wherein the human pose of the articulated object model is defined by the spatial positions of the joints, the method includes the steps of:

获取摄像机(9)记录的真实世界对象(14)的视图的视频流；obtaining a video stream of the view of the real-world object (14) recorded by the camera (9);

从所述视频流获得至少一个源图像序列(10)；obtaining at least one sequence of source images from the video stream (10);

处理所述至少一个序列的源图像，以针对每个图像提取对应的源图像片段，所述对应的源图像片段包括与所述图像背景分离的所述现实世界对象的视图，从而生成至少一个源图像片段序列；processing the at least one sequence of source images to extract, for each image, a corresponding source image segment, the corresponding source image segment comprising a view of the real-world object separated from the image background, thereby generating at least one source image fragment sequence;

在计算机可读形式的数据库中维护一组参考轮廓序列，每个参考轮廓与铰接式对象模型相关联，并且与该铰接式对象模型的特定参考人体姿态相关联；maintaining a set of reference contour sequences in a database in computer readable form, each reference contour being associated with an articulated object model and associated with a specific reference human pose for the articulated object model;

对于所述至少一个源图像段序列的每个序列，将该序列与多个参考轮廓序列进行匹配，并确定与所述源图像段序列最匹配的一个或多个所选参考轮廓序列；for each sequence of the at least one sequence of source image segments, matching the sequence to a plurality of reference contour sequences, and determining one or more selected reference contour sequences that best match the sequence of source image segments;

其中两个序列的匹配由通过将每个源图像片段与在其序列内相同位置的参考轮廓进行匹配、计算指示它们匹配程度的匹配误差，以及根据源图像片段的匹配误差计算序列匹配误差来完成；Two of the sequences are matched by matching each source image segment to a reference contour at the same location within its sequence, calculating a match error indicating how well they match, and calculating the sequence match error based on the source image segment's match error ;

对于每一个选定的参考轮廓序列，检索与其中一个参考轮廓相关联的参考人体姿态；以及for each selected sequence of reference contours, retrieve the reference body pose associated with one of the reference contours; and

根据检索到的一个或多个参考人体姿态来计算铰接式对象模型的人体姿态的估计结果；calculating an estimate of the body pose of the articulated object model based on the retrieved one or more reference body poses;

源图像序列中的一个图像被指定为感兴趣帧，并且由其生成的源图像段被指定为感兴趣的源图像段；An image in the sequence of source images is designated as the frame of interest, and the source image segment generated therefrom is designated as the source image segment of interest;

序列匹配误差为两个序列匹配误差的加权和；The sequence matching error is the weighted sum of the two sequence matching errors;

感兴趣的源图像片段的匹配误差的权重最大，并且随源图像片段 (在序列内)到感兴趣的源图像片段的距离而减小；以及The matching error of the source image segment of interest has the greatest weight and decreases with the distance (within the sequence) of the source image segment to the source image segment of interest; and

检索与感兴趣的源图像段匹配的参考轮廓的参考人体姿态。Retrieve the reference human pose for the reference contour that matches the source image segment of interest.

通过上述步骤得到的估计结果为初始人体姿态估计，随后人们可以在相关扩展步骤中使用该初始人体姿态估计结果，例如，用于保证来自连续帧的人体姿态估计之间的局部一致性，以及保证在更长的帧序列上的全局一致性。The estimation result obtained through the above steps is the initial human pose estimation, which one can then use in the relevant extension steps, for example, to ensure local consistency between human pose estimates from consecutive frames, and to ensure Global consistency over longer frame sequences.

优选地，源图像序列(10)的一个图像被指定为感兴趣帧(56)，并且从其生成的源图像段(13)被指定为感兴趣的源图像段；Preferably, one image of the sequence of source images (10) is designated as the frame of interest (56), and the source image segment (13) generated therefrom is designated as the source image segment of interest;

序列匹配误差是两个序列(51，52)匹配误差的加权和；The sequence matching error is the weighted sum of the matching errors of the two sequences (51, 52);

感兴趣的源图像片段的匹配误差的权重最大，并且随源图像片段到感兴趣的源图像片段的距离而减小；并且检索与感兴趣的源图像片段匹配的参考轮廓的参考人体姿态。The matching error of the source image segment of interest has the largest weight and decreases with the distance from the source image segment to the source image segment of interest; and the reference body pose of the reference contour matching the source image segment of interest is retrieved.

优选地，其中，获得并处理由至少两个源摄像机(9，9')同时记录的至少两个源图像序列(10)，并且通过选择在3D空间中最符合的检索到的参考人体姿态的组合，根据从至少两个源图像序列确定的检索到的参考人体姿态来获取铰接式对象模型(4)的人体姿态估计结果。Preferably, wherein, at least two source image sequences (10) simultaneously recorded by at least two source cameras (9, 9') are obtained and processed, and by selecting the retrieved reference body pose that best fits in 3D space In combination, a human pose estimation result of the articulated object model (4) is obtained based on the retrieved reference human pose determined from at least two source image sequences.

优选地，从至少一个连续源图像段序列确定的两个人体姿态之间建立局部一致性，每个人体姿态与至少一个源图像段(13)相关联，其中铰接式对象模型(4)的一个或两个人体姿态中的元素，即关节 (2)和(或)链接(3)对应于可以以模糊方式标记的真实世界对象 (14)的肢体：Preferably, local consistency is established between two human poses determined from at least one continuous sequence of source image segments, each human pose being associated with at least one source image segment (13), wherein one of the articulated object models (4) Or elements in two human poses, namely joints (2) and/or links (3) correspond to limbs of real-world objects (14) that can be labeled in a fuzzy way:

对于一对连续的源图像段中的每一个，根据相关联的人体姿态以及对于每个人体姿态的可能的模糊标签中的每一个，确定来自肢体的图像点在源图像段中的对应标签；For each of a pair of consecutive source image segments, determine the corresponding label in the source image segment for the image point from the limb based on the associated body pose and each of the possible fuzzy labels for each body pose;

为该对连续源图像片段中的第一个选择该人体姿态的标签；selecting a label for the human pose for the first of the pair of consecutive source image segments;

计算这对连续源图像段中的第一个和第二个之间的光流；compute the optical flow between the first and the second of the pair of consecutive source image segments;

从光流中确定第二图像段中与第一图像段的肢体相对应的图像点已经移动到的位置，并根据第一图像段中的肢体的标记来标记第二图像段中的这些位置；determining from the optical flow where the image points in the second image segment corresponding to the limbs of the first image segment have moved, and marking these locations in the second image segment according to the markings of the limbs in the first image segment;

在用于根据人体姿态来标记图像点的第二图像段的人体姿态的可能的模糊标签中，选择与根据光流确定的标签一致的标签。Among the possible fuzzy labels for labeling the human pose of the second image segment of the image point according to the human pose, a label is selected that is consistent with the label determined from the optical flow.

优选地，在所述源图像段中标记来自肢体的图像点的步骤通过以下步骤完成：Preferably, the step of marking image points from the limb in the source image segment is accomplished by the following steps:

对于一对连续的源图像段中的每一个，根据相关联的人体姿态和对于每个人体姿态的可能的模糊标签中的每一个，确定现实世界对象 (14)的模型到源图像中的投影，并且由此根据在位置图像点处可见的投影肢体来标签源图像段的图像点。For each of a pair of consecutive source image segments, determine the projection of the model of the real-world object (14) into the source image from the associated body pose and each of the possible fuzzy labels for each body pose , and thereby label the image points of the source image segment according to the projected limbs visible at the location image points.

优选地，给出人体姿态序列，每个人体姿态与源图像片段序列中的一个相关联，并且其中存在关于标记一个或多个肢体集的模糊性，包括以下步骤，用于建立与连续源图像序列匹配的人体姿态的全局一致性：Preferably, given a sequence of human poses, each human pose being associated with one of the sequence of source image segments, and where there is ambiguity with respect to labeling one or more sets of limbs, the following steps are included for establishing a relationship with successive source images Global consistency of human poses for sequence matching:

对于源图像序列的每个源图像段(13)，For each source image segment (13) of the source image sequence,

检索由先前步骤确定的模型元素的关联人体姿态和标签；Retrieve the associated human poses and labels for the model elements determined by the previous steps;

考虑数据库人体姿态的标记，确定数据库中与检索到的人体姿态具有最小距离的人体姿态；Considering the mark of the human pose in the database, determine the human pose in the database with the smallest distance from the retrieved human pose;

计算表示两个人体姿态之间的差异的一致性误差项；Compute a consistency error term representing the difference between two human poses;

根据这些一致性误差项计算整个源图像序列的总一致性误差；Calculate the total consistency error for the entire source image sequence from these consistency error terms;

重复上述步骤以计算模糊肢体集合的可能全局标签的所有变体的总一致性误差；Repeat the above steps to calculate the total consistency error for all variants of the possible global labels of the fuzzy limb set;

选择总体一致性误差最小的全局标签变量。Select the global label variable with the smallest overall agreement error.

优选地，所述总一致性误差是所述序列上所有一致性误差项的总和。Preferably, the total identity error is the sum of all identity error terms on the sequence.

本发明的第一方面提供了一种估计人体姿态的装置，包括处理器和存储器，所述处理器和存储器相互连接，其中，所述存储器用于存储计算机程序，所述计算机程序包括程序指令，所述处理器被配置用于调用所述程序指令，执行本发明第一方面提供的所述方法。A first aspect of the present invention provides an apparatus for estimating human body posture, comprising a processor and a memory, wherein the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program includes program instructions, The processor is configured to invoke the program instructions to execute the method provided by the first aspect of the present invention.

本发明的第三方面提供了一种计算机可读存储介质，所述计算机存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时使所述处理器执行本发明第一方面提供的所述方法。A third aspect of the present invention provides a computer-readable storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to execute The method provided by the first aspect of the present invention.

本发明的第四方面提供了一种制造计算机可读存储介质的方法，包括在计算机可读介质上存储计算机可执行指令的步骤，当计算机可执行指令由计算系统的处理器执行时，使计算系统执行本发明第一方面提供的所述方法步骤。A fourth aspect of the present invention provides a method of manufacturing a computer-readable storage medium, comprising the step of storing computer-executable instructions on the computer-readable medium, and when the computer-executable instructions are executed by a processor of a computing system, the The system performs the method steps provided in the first aspect of the present invention.

相对于现有技术，本发明取得的有益效果是可在摄像机标定松散、球员分辨率低和存在遮挡的非受控环境中运行的数据驱动人体姿态估计方法。本发明提供的方法和装置可以仅使用少至两个摄像机来估计人体姿态。并且其对可能的人体姿态或动作序列没有任何限制性的假设。通过使用时间一致性进行初始人体姿态估计以及人体姿态细化，在失效的情况下，用户交互仅限于几次点击即可倒转手臂和腿。Compared with the prior art, the present invention has the beneficial effect of a data-driven human pose estimation method that can operate in an uncontrolled environment with loose camera calibration, low player resolution and occlusion. The method and apparatus provided by the present invention can estimate human pose using only as few as two cameras. And it does not make any restrictive assumptions about possible human poses or action sequences. By using temporal consistency for initial human pose estimation as well as human pose refinement, user interaction is limited to inverting the arms and legs in a few clicks in the event of failure.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例中所需要使用的附图作简单地介绍，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the accompanying drawings required in the embodiments of the present invention will be briefly introduced below. For those of ordinary skill in the art, without creative work, the Additional drawings can be obtained from these drawings.

图1是真实世界场景概貌示意图；Figure 1 is a schematic diagram of an overview of a real-world scene;

图2是铰接式对象模型；Figure 2 is an articulated object model;

图3中a是分割图像中的典型轮廓，a in Figure 3 is a typical contour in the segmented image,

b是数据库中的三个最佳匹配人体姿态；b are the three best matching human poses in the database;

图4是将3D骨架投影到源摄像机的比赛中的估计人体姿态；Figure 4 is the estimated human pose in the game projecting the 3D skeleton to the source camera;

图5是所述方法概述；Figure 5 is an overview of the method;

图6是根据现有技术和根据本方法的2D人体姿态估计结果：Figure 6 is a 2D human body pose estimation result according to the prior art and according to the present method:

图7是人体姿态模糊示例；Figure 7 is an example of human posture blurring;

图8是局部一致性图示；Figure 8 is a local consistency diagram;

图9是人体姿态优化前后的估计人体姿态；Fig. 9 is the estimated human body posture before and after optimization of the human body posture;

图10是失效案例；Figure 10 is the failure case;

图11是每帧显示所有摄影机视图的结果序列。Figure 11 is the resulting sequence showing all camera views per frame.

具体实施方式Detailed ways

下面将详细描述本发明的各个方面的特征和示例性实施例，为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细描述。应理解，此处所描述的具体实施例仅被配置为解释本发明，并不被配置为限定本发明。对于本领域技术人员来说，本发明可以在不需要这些具体细节中的一些细节的情况下实施。下面对实施例的描述仅仅是为了通过示出本发明的示例来提供对本发明更好的理解。The features and exemplary embodiments of various aspects of the present invention will be described in detail below. In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only configured to explain the present invention, and not to limit the present invention. It will be apparent to those skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is only intended to provide a better understanding of the present invention by illustrating examples of the invention.

本实施例重点描述了基于无约束田径体育赛事转播镜头的人体姿态估计方法。在这种场景下，意味着摄像机位置、物体大小和时间一致性面临一些挑战。虽然人体姿态估计可以基于多摄像机设置来进行计算，但依旧存在可用摄像机数量很少以及基线很宽的问题。除此之外，由于摄像机通常只放置在场地的一侧，因此只能提供有限的场景覆盖。尽管这些摄像机提供高分辨率的图像，但出于编辑原因，其通常被设置为广角。因此，运动员通常只覆盖50到200像素之间的高度。除此之外，运动员的运动可能非常复杂，特别是在像篮球等接触性对抗运动中，因此会存在更多的遮挡。This embodiment focuses on describing the human pose estimation method based on the broadcast footage of unconstrained track and field sports events. In this scenario, it means that camera position, object size, and temporal consistency face some challenges. Although human pose estimation can be computed based on multi-camera setups, there are still problems with the small number of available cameras and wide baselines. In addition, because the cameras are usually placed on only one side of the field, they can only provide limited scene coverage. Although these cameras provide high-resolution images, they are usually set to wide-angle for editing reasons. Therefore, athletes usually only cover heights between 50 and 200 pixels. In addition to this, the movement of athletes can be very complex, especially in contact sports like basketball, and therefore there will be more occlusions.

为此目的，本项发明提出了一种可在摄像机标定松散、运动员分辨率低和存在遮挡的非受控环境中运行的数据驱动人体姿态估计方法。由此产生的方法和系统可以仅使用少至两个摄像机来估计人体姿态。并且其对可能的人体姿态或动作序列没有任何限制性的假设。通过使用时间一致性进行初始人体姿态估计以及人体姿态细化，在失效的情况下，用户交互仅限于几次点击即可倒转手臂和腿。To this end, the present invention proposes a data-driven human pose estimation method that can operate in an uncontrolled environment with loose camera calibration, low player resolution, and occlusions. The resulting method and system can estimate human pose using as few as two cameras. And it does not make any restrictive assumptions about possible human poses or action sequences. By using temporal consistency for initial human pose estimation as well as human pose refinement, user interaction is limited to a few clicks to invert arms and legs in the event of failure.

人体姿态估计中的许多现有方法依赖于在2D中跟踪或分割图像，并使用校准信息将骨架外推到3D。这些方法对于高分辨率的镜头效果很好，但由于缺乏信息，它们在低分辨率图像上经常失效，并且对外部照明条件非常敏感。本发明所涉及方法适用于完全不受控制的低分辨率室外设置，因为它只依赖于粗略的轮廓和粗略的校准。Many existing methods in human pose estimation rely on tracking or segmenting images in 2D and using calibration information to extrapolate the skeleton to 3D. These methods work well for high-resolution footage, but they often fail on low-resolution images due to lack of information and are very sensitive to external lighting conditions. The method involved in the present invention is suitable for a completely uncontrolled low-resolution outdoor setting, as it only relies on rough outlines and rough calibration.

所述方法使用人体姿态和轮廓比较数据库来提取2D中的人体姿态候选对象，并使用摄像机校准信息来计算相应的3D骨架。本发明所涉及方法首先在数据库中执行一种新型基于时间一致轮廓的搜索策略，以提取具有时间一致性的最接近的数据库候选对象。另外应用了一种新型时间一致性步骤来进行初始人体姿态估计。因为准确的真实人体姿态通常不在数据库中，所以这只会导致最接近的匹配，而不是准确的人体姿态。为此目的，本发明提出了一种能够利用时间信息自动实现精确三维人体姿态的新型时空优化技术。The method uses a human pose and contour comparison database to extract human pose candidates in 2D, and uses camera calibration information to compute the corresponding 3D skeleton. The method involved in the present invention first executes a novel time-consistent contour-based search strategy in the database to extract the closest database candidates with time-consistency. In addition, a novel temporal consistency step is applied for initial human pose estimation. Since the exact real human pose is usually not in the database, this will only result in the closest match, not the exact human pose. For this purpose, the present invention proposes a novel spatio-temporal optimization technology that can automatically realize accurate three-dimensional human body posture using time information.

相对于现有技术，本发明做出的创造性贡献如下：With respect to the prior art, the creative contributions made by the present invention are as follows:

本发明提出了一种基于时间一致性轮廓的数据库人体姿态查找的初始人体姿态估计方法；The invention proposes an initial human body posture estimation method based on the database human body posture search based on the time-consistent contour;

改进了初始人体姿态估计的局部和全局一致性检查方法；Improved local and global consistency checking methods for initial human pose estimation;

本发明提出了一种基于新约束的时空人体姿态优化方法。The present invention proposes a spatiotemporal human body posture optimization method based on new constraints.

与学习骨骼统计模型不同，本发明直接使用的是人体姿态数据库。后者具有两个以下两个优势：首先，这种数据驱动型方法允许容易地添加新的人体姿态序列以适应新的设置或以前未知的人体姿态。其次，由于该方法只是在数据库中搜索最接近的人体姿态，因此对较常见人体姿态的统计偏差较小。使用包含人体测量学正确数据的数据库将始终导致初始估计的合理人体姿态。Different from learning the skeletal statistical model, the present invention directly uses the human body posture database. The latter has two following advantages: First, this data-driven approach allows new sequences of human poses to be easily added to accommodate new settings or previously unknown human poses. Second, since the method only searches the database for the closest human pose, the statistical bias for more common human poses is small. Using a database containing anthropometrically correct data will always result in an initial estimate of reasonable human pose.

以下结合附图，以标枪比赛场景为例对本发明进行详细阐述。The present invention will be described in detail below by taking a javelin match scene as an example in conjunction with the accompanying drawings.

图1给出了真实世界场景8的示意图。所述场景8包括由两个或更多源摄像机9、9'观察的诸如人的真实世界对象14，所述源摄像机 9、9'分别生成源图像10、10'的视频流。根据本发明的系统和方法从与所述源摄像机9、9'的视点不同的虚拟摄像机1的视点生成显示场景8的虚拟图像12。可选地，从虚拟图像序列12生成虚拟视频流。根据本发明所实现的设备包括处理单元15，所述处理单元15在给定源图像10、10'的情况下执行实现本发明方法的图像处理计算，并生成一个或多个虚拟图像12。处理单元15被配置为与存储单元16交互，用于存储源图像10、虚拟图像12和中间结果。所述处理单元15 由工作站19控制，所述工作站19通常包括显示设备、诸如键盘的数据输入设备和诸如鼠标的定点设备。除此之外，所述处理单元15可以被配置为向TV广播发射机17和(或)向视频显示设备18提供虚拟视频流。Figure 1 presents a schematic diagram of a real-world scenario 8 . Thescene 8 includes a real-world object 14, such as a person, observed by two ormore source cameras 9, 9', which generate video streams ofsource images 10, 10', respectively. The system and method according to the invention generates avirtual image 12 displaying thescene 8 from a viewpoint of the virtual camera 1 which is different from the viewpoint of thesource cameras 9, 9'. Optionally, a virtual video stream is generated from thevirtual image sequence 12 . The device implemented according to the invention comprises aprocessing unit 15 which, given asource image 10, 10', performs image processing calculations implementing the method of the invention and generates one or morevirtual images 12. Theprocessing unit 15 is configured to interact with thestorage unit 16 for storing thesource image 10, thevirtual image 12 and the intermediate results. Theprocessing unit 15 is controlled by aworkstation 19, which typically includes a display device, a data input device such as a keyboard, and a pointing device such as a mouse. Among other things, theprocessing unit 15 may be configured to provide a virtual video stream to theTV broadcast transmitter 17 and/or to thevideo display device 18.

图2给出了所述场景8的3D模型1，其包括所述现实世界对象 14的铰接式对象模型4。所述3D模型1通常还包括其他对象模型，例如表示其他人、地面、建筑物等(未示出)。所述铰接式对象模型 4包括通过连杆3连接的关节2，在人体模型的情况下，所述连杆3 大致对应于骨骼或四肢。每个所述关节2被定义为3D空间中的一个点，并且每个连杆3可以由通过3D空间连接两个关节2的直线来表示。基于人体姿态，可以计算对象的网格模型的3D形状或真实世界对象的3D形状的另一模型表示的3D形状，并将其投影到所述源摄像机9、9'看到的2D图像之中。这允许检查和改进人体姿态。虽然所述人体姿态由一维链接所描述，但由(网格)模型表示的、根据人体姿态放置的肢体可在三维中进行延伸。虽然本申请是根据建模的人类形状来解释的，但是其范围也涵盖了对动物形状或对诸如机器人的人工结构的应用。Figure 2 presents a 3D model 1 of thescene 8 including an articulated object model 4 of thereal world object 14. The 3D model 1 typically also includes other object models, eg representing other people, ground, buildings, etc. (not shown). The articulated object model 4 comprisesjoints 2 connected by connectingrods 3, which in the case of the human body model correspond roughly to bones or limbs. Each of saidjoints 2 is defined as a point in 3D space, and eachlink 3 can be represented by a straight line connecting twojoints 2 through 3D space. Based on the human pose, the 3D shape of the mesh model of the object or the 3D shape represented by another model of the 3D shape of the real world object can be calculated and projected into the 2D image seen by thesource cameras 9, 9' . This allows checking and improving human posture. While the human pose is described by one-dimensional links, the limbs represented by the (mesh) model, placed according to the human pose, can be extended in three dimensions. Although this application is explained in terms of modeled human shapes, its scope also covers applications to animal shapes or to artificial structures such as robots.

首先，对于每个单独输入视图中的2D人体姿态估计，本发明利用一个轮廓数据库。本发明假设例如使用色度键控法或减背景法可以从背景中粗略地分割对象14。图3a示出了所述应用场景中的分割图像13的典型示例。其中。计算对象人体姿态(即骨架关节2的2D位置)的初始猜测的基本思想是将其与各个骨架人体姿态已知的轮廓数据库进行比较。参见具有不同参考轮廓3'的图3b，它们中的两个以较小的比例再现以节省空间。First, for 2D human pose estimation in each individual input view, the present invention utilizes a contour database. The present invention assumes that theobject 14 can be roughly segmented from the background using, for example, chroma keying or background subtraction. Figure 3a shows a typical example of thesegmented image 13 in the application scenario. in. The basic idea of computing an initial guess of the object's body pose (i.e. the 2D position of the skeleton joint 2) is to compare it with a database of known contours of the individual skeleton body poses. See Figure 3b with different reference contours 3', two of which are reproduced on a smaller scale to save space.

图4示出了来自田径运动场景的源图像序列中的一帧的示例，其中通过该方法找到的3D人体姿态被投影到图像中并叠加在运动员身上。Figure 4 shows an example of a frame from a source image sequence of a track and field scene, where the 3D human pose found by the method is projected into the image and superimposed on the athlete.

在一个实施例中，该方法包括如图5所示的两个步骤。在第一步骤62中，给定来自人体姿态数据库66的粗校准和轮廓61以及参考轮廓，该方法使用时空轮廓匹配技术来提取每个单独摄像机视图的 2D人体姿态，从而产生三角化的3D人体姿态猜测63(标记为“估计的3D身体人体姿态”)。不过，这种人体姿态检测会很容易产生模糊性，即对称部分的左右翻转。虽然骨骼与轮廓非常匹配，但运动员的手臂或腿仍然可能被翻转。由于遮挡和低分辨率，这些模糊性有时即使对于人眼也很难发现。因此，我们使用一种基于光流的技术来检测翻转发生的情况，并对它们进行校正，以获得一致的序列。重要的是要注意，光流在所述设置中不够可靠，不足以在整个序列上跟踪运动员身体各部分的整个运动，但是它可以用于局部比较。第一步骤使用源图像段3的序列51，将该序列51与聚焦于感兴趣的图像帧56 的多个参考轮廓13'的序列52进行匹配。In one embodiment, the method includes two steps as shown in FIG. 5 . In afirst step 62, given a coarse calibration andcontour 61 from ahuman pose database 66 and a reference contour, the method uses spatiotemporal contour matching techniques to extract the 2D human pose for each individual camera view, resulting in a triangulated 3D human body Pose guess 63 (labeled "Estimated 3D Body Human Pose"). However, such human pose detection is prone to ambiguity, i.e. left-right flipping of symmetrical parts. Although the bones are well matched to the contours, the athlete's arm or leg may still be flipped. Due to occlusion and low resolution, these ambiguities are sometimes difficult to detect even by the human eye. Therefore, we use an optical flow-based technique to detect when flips occur and correct them for consistent sequences. It is important to note that optical flow is not reliable enough in the described setup to track the entire motion of the player's body parts over the entire sequence, but it can be used for local comparisons. The first step uses asequence 51 ofsource image segments 3, matching thissequence 51 with asequence 52 of reference contours 13' focused on theimage frame 56 of interest.

但是，通常情况下，数据库中的任何人体姿态都不会与实际人体姿态完全匹配。因此，在该方法的第二部分或步骤64(标记为“时空人体姿态优化”)中，该初始3D人体姿态63由基于时空约束的优化过程64来细化。所得到的优化的3D骨架65(标记为“优化的3D 身体人体姿态”)与来自所有视图的轮廓相匹配，并且特征在连续帧上具有时间一致性。However, in general, any human pose in the database will not exactly match the actual human pose. Thus, in the second part or step 64 of the method (labeled "spatiotemporal human pose optimization"), the initial 3D human pose 63 is refined by anoptimization process 64 based on spatiotemporal constraints. The resulting optimized 3D skeleton 65 (labeled "Optimized 3D Body Human Pose") matches contours from all views and features are temporally consistent across consecutive frames.

通过使用新型基于时空数据驱动的轮廓搜索方法首先从每个运动员和每个摄像机视图检索2D人体姿态来计算初始人体姿态估计。一旦找到每个摄像机中每个运动员的2D人体姿态，就可以使用摄像头的校准信息将图像中观察到的2D关节位置放置在3D空间中(这一步骤也称为在3D提升2D位置)。本发明通过与每个摄影机视图中的每个2D关节对应的光线相交来计算每个关节的3D位置。由于光线不会精确相交，因此本发明选择最小二乘意义上距离这些光线最近的点。由此可以得到三角测量误差E，以及初始摄像机偏移。Initial human pose estimates are computed by first retrieving 2D human poses from each athlete and each camera view using a novel spatiotemporal data-driven contour search method. Once the 2D human pose for each athlete in each camera is found, the 2D joint positions observed in the image can be placed in 3D space using the camera's calibration information (this step is also known as lifting the 2D position in 3D). The present invention calculates the 3D position of each joint by intersecting the rays corresponding to each 2D joint in each camera view. Since rays do not intersect exactly, the present invention selects the points closest to these rays in the least squares sense. This gives the triangulation error E, and the initial camera offset.

本方法在角度空间中以以下方式表示人体姿态S的3D骨架：每个骨骼i相对于其“parent”骨使用两个角度α和β，以及骨骼的长度l来表示。“root”骨由全局位置p₀给出的三个角度α₀、β₀、γ₀所指定的方向定义。可以根据该角度空间表示容易地计算3D欧几里得空间中的关节位置j，反之亦然(考虑万向节锁定)。The present method represents the 3D skeleton of a human pose S in angular space in the following way: each bone i is represented using two angles α and β with respect to its "parent" bone, and the length l of the bone. The "root" bone is defined by the orientation specified by the three angles α₀ , β₀ , γ₀ given by the global position p₀ . Joint positions j in 3D Euclidean space can be easily calculated from this angle space representation, and vice versa (considering gimbal locking).

具备对整个人体运动范围进行采样能力的大型数据库对于本发明所涉及方法非常重要，并且很难手动进行创建。因此，本发明使用 CMU运动捕捉数据库。然后使用线性混合蒙皮(骨骼蒙皮动画算法) 对装配了相同骨架的模板网格进行变形处理，以匹配数据库人体姿态的人体姿态。由此，拍摄虚拟快照并提取轮廓。通过这种方式，本发明创建了一个包含大约50000个轮廓的数据库。Large databases with the ability to sample the entire range of motion of the human body are important to the methods of the present invention and are difficult to create manually. Therefore, the present invention uses the CMU motion capture database. The template meshes fitted with the same skeleton are then deformed using linear blend skinning (skeletal skinning animation algorithm) to match the human poses of the database human poses. From this, virtual snapshots are taken and contours are extracted. In this way, the present invention creates a database of approximately 50,000 contours.

不过，所述CMU数据库只包括有限的人体姿态类型，因为其主要来自跑步和行走序列。因此，本发明又手动添加了来自几个运动场景中的1200个轮廓。尽管与自动生成的轮廓数量相比，这种手动添加的轮廓数量很少，但足以扩大示例人体姿态的跨度，以获得良好的效果。重要的是要注意，添加的示例人体姿态与我们用来拟合人体姿态的序列不同。数据库可以通过新生成的人体姿态不断扩大，从而得到更好的初始人体姿态估计。However, the CMU database includes only limited types of human poses, since they are mainly from running and walking sequences. Therefore, the present invention manually added another 1200 contours from several motion scenes. Although the number of such manually added contours is small compared to the number of automatically generated contours, it is enough to widen the span of the example human pose for good results. It is important to note that the example body poses added are different from the sequence we used to fit the body poses . The database can be continuously expanded by newly generated human poses, so as to obtain better initial human pose estimation.

本发明所涉及方法接受将每个运动员的粗略的二进制轮廓掩模以及粗略的摄像机校准作为输入数据。并将这些轮廓与来自数据库的轮廓进行比较。其计算适合分割的固定光栅尺寸(高度＝40和宽度＝32 像素的网格)上的输入轮廓和数据库轮廓之间的匹配质量。The method of the present invention accepts as input data a rough binary contour mask for each player and a rough camera calibration. And compare these contours with contours from the database. It computes the quality of the match between the input contours and the database contours on a fixed raster size suitable for segmentation (a grid of height=40 and width=32 pixels).

所述轮廓提取方法同利用时间信息有效扩展了传统方法。所述方法不依赖于单个帧匹配，而是考虑源图像段13和参考轮廓13'之间的差值在图像帧序列上的加权和。将具有索引t的二进制输入轮廓图像I(来自感兴趣的图像帧56)与来自数据库的具有索引s的轮廓图像I′_s进行比较时，得到的像素误差E_q(s)计算如下：The contour extraction method effectively extends the traditional method with the use of temporal information. The method does not rely on individual frame matching, but considers a weighted sum of the differences between thesource image segment 13 and the reference contour 13' over a sequence of image frames. When comparing the binary input contour image I with index t (from the image frame of interest 56) with the contour image I's with index_s from the database, the resulting pixel error_Eq (s) is calculated as:

其中，n为滤波窗口大小，也就是所考虑的感兴趣帧56之前和之后的帧数；P为所有光栅位置的集合，其中对应的像素在两个图像中不可能被遮挡，即，不期望是另一个运动员轮廓的一部分。|P|表示P中光栅位置的个数。光栅位置可以对应于摄像头的实际硬件像素，也可以对应于摄像头图像计算出的缩放图像中的像素。where n is the filter window size, i.e. the number of frames before and after the frame ofinterest 56 under consideration; P is the set of all raster positions where the corresponding pixel is unlikely to be occluded in both images, i.e., it is not expected that is part of another athlete's silhouette. |P| represents the number of grating positions in P. The raster position can correspond to the actual hardware pixels of the camera, or it can correspond to the pixels in the zoomed image calculated from the camera image.

其中，权重θ_s(i)描述中心围绕s的归一化高斯函数53。对于未被包括在所述数据集的I′_s+_i(p)而言，θ_s(i)在标准化处理之前被设置为0。本方法的优势在于，其是对序列而不是单个图像进行比较，这不仅能够增加时间相干性(实现平稳运动)，而且还能够有效改进人体姿态估计。通过这种方法，即使是遮挡在几帧上的图像部分也可以更牢固地拟合。通常，此方法有助于防止匹配相似但源自完全不同人体姿态的轮廓。图6给出了本发明所涉及的初始人体姿态估计与现有技术中人体姿态估计的直接比较。其中。图6(a)为通过仅将当前帧与数据库进行比较来估计2D人体姿态(现有技术的做法)；(b)找到的用于单帧比较的数据库项；(c)通过比较轮廓序列估计2D人体姿态； (d)找到的数据库序列与中间的相应图像段。where the weights θ_s (i) describe a normalizedGaussian function 53 centered around s. For I'_s +_i (p) not included in the dataset, θ_s (i) was set to 0 before normalization. The advantage of this method is that it compares sequences rather than single images, which not only increases temporal coherence (achieving smooth motion), but also effectively improves human pose estimation. In this way, even parts of the image that are occluded over a few frames can be fitted more firmly. In general, this approach helps prevent matching contours that are similar but originate from disparate human poses. FIG. 6 shows a direct comparison between the initial human pose estimation involved in the present invention and the human pose estimation in the prior art. in. Figure 6(a) is the estimation of 2D human pose by comparing only the current frame with the database (the practice in the prior art); (b) the found database entry for single-frame comparison; (c) estimated by comparing contour sequences 2D human pose; (d) Found database sequence with corresponding image segments in the middle.

通过使用该像素误差，本发明在每个摄像机视图中搜索最佳的两个人体姿态假设，并通过选择最低的三角测量误差E，来选择这些假设地最佳组合。当然，在替代实施例中，可以使用来自每个摄像机的多于两个的人体姿态假设，以确定最佳组合，或者每个摄像机的不同数目，或者假设它是最佳的而不考虑三角测量误差，或者来自至少一个摄像机的仅一个人体姿态假设。Using this pixel error, the present invention searches for the best two body pose hypotheses in each camera view and selects the best combination of these hypotheses by selecting the lowest triangulation error E. Of course, in alternative embodiments, more than two human pose hypotheses from each camera could be used to determine the best combination, or a different number of each camera, or assume it's optimal regardless of triangulation error, or only one human pose hypothesis from at least one camera.

图7示出了人体姿态模糊性的示例：(a)第一摄像机中的可能标签；(b)从顶部的示意图来说明膝盖的两个可能位置；以及(c) 第二摄像机中的可能标签。Figure 7 shows examples of human pose ambiguity: (a) possible labels in the first camera; (b) schematic diagram from the top illustrating two possible positions of the knee; and (c) possible labels in the second camera .

由于2D人体姿态检测步骤依赖于轮廓匹配，因此会很容易产生模糊性。尽管给出一个轮廓和从数据库中检索到的匹配的初始2D人体姿态，但是人们依旧无法确定手臂和腿上的“左”和“右”标签是否正确。为此，在一个实施例中，在2D轮廓匹配之后，忽略来自检索到的数据库人体姿态的信息，该信息定义腿或手臂(或者，在一般情况下，是一组对称关节链中的一个)是否要被标记，例如，在2D 轮廓的匹配之后，将被标记为例如“左”或“右”。那么剩余的模糊会按如下方式得到解决。其中，图7(a)显示了腿的两个可能标签的示例轮廓。从图中可以看出，右膝可能的位置用菱形标记。该视图来自图7(b)的模式中的左侧照摄像机。然后同一帧中的同一对象在另一台摄像机中显示出如图7(c)所示的轮廓，同样带有两个可能的腿部标签。因此，在提升到3D后，在3D中有四个可能的右膝位置。这些可能的情况如图7(b)所示。可以看出，如果右膝落在标有星号的一个位置上，那么左膝就会落在另一个星号标记位置之上。如果右膝落在一个标有圆圈的位置上，那么左膝就会落在由另一个圆圈标记的位置之上。假设圆圈所标记位置是膝盖的正确位置，那么可以有两种不同类型的失效：要么膝盖在3D中会被错误地标记，但位置正确；要么膝盖出现在错误的位置(星形)。Since the 2D human pose detection step relies on contour matching, it is prone to ambiguity. Although given an outline and the matching initial 2D human pose retrieved from the database, one cannot be sure that the "left" and "right" labels on the arms and legs are correct. To this end, in one embodiment, after 2D contour matching, information from the retrieved database human pose is ignored, which defines a leg or arm (or, in general, one of a set of symmetrical joint chains) Whether to be marked, eg after matching of 2D contours, will be marked eg "left" or "right". The remaining ambiguity is then resolved as follows. Among them, Fig. 7(a) shows example contours of two possible labels for legs. As can be seen from the picture, the possible positions of the right knee are marked with diamonds. This view is from the left side camera in the mode of Figure 7(b). Then the same object in the same frame is shown in another camera with the contour shown in Fig. 7(c), again with two possible leg labels. Therefore, after lifting to 3D, there are four possible right knee positions in 3D. These possible cases are shown in Fig. 7(b). It can be seen that if the right knee lands on one of the positions marked with an asterisk, then the left knee will fall on the other asterisked position. If the right knee falls on a position marked by a circle, the left knee falls on the position marked by another circle. Assuming that the position marked by the circle is the correct position of the knee, there can be two different types of failures: either the knee will be incorrectly marked in 3D, but in the correct position, or the knee will appear in the wrong position (star).

由于没有额外的信息，因此无法在这种情况下决定哪些位置是正确的，即从四种可能性中选择唯一正确的位置，特别是在只有两个摄像机可用的情况下。消除翻转模糊性的一种可能方法包括检查所有可能的组合，只保留解剖学上可能的组合。然而需要注意的是，几种翻转配置仍有可能产生解剖学上正确的人体姿态。Without additional information, it is impossible to decide which positions are correct in this case, i.e. choose the only correct position out of four possibilities, especially if only two cameras are available. One possible way to remove the ambiguity of flipping involves examining all possible combinations and keeping only those that are anatomically possible. It should be noted, however, that several flip configurations are still possible to produce anatomically correct human poses.

为了正确地解决这些模糊性，本发明使用了两步法：首先，在每对连续的2D帧之间建立局部一致性，使得整个2D人体姿态序列在时间上是一致的。其次，贯穿整个序列的任何剩余模糊性都被全局解决。To properly resolve these ambiguities, the present invention uses a two-step approach: first, local consistency is established between each pair of consecutive 2D frames so that the entire 2D human pose sequence is temporally consistent. Second, any remaining ambiguities throughout the sequence are resolved globally.

实现局部一致性这一目标的第一步是确保从摄像机帧k(如图8 (a)和8(b)所示)和k+l(如图8(c)所示)发现的2D人体姿态具有一致性，即在两个连续帧之间不存在手臂和腿部的左右翻转。换言之，如果位于帧k处的一个像素属于右腿，那么在帧k+l处，该像素亦应该属于右腿。通过应用这一假设，本发明为彩色图像l_C(k)和 I_C(k+1)中的每个像素分配相应的骨骼，并计算帧之间的光流(如图8 (d)所示)。The first step in achieving this goal of local consistency is to ensure that the 2D human body found from camera frames k (as shown in Fig. 8(a) and 8(b)) and k+l (as shown in Fig. 8(c)) The pose is consistent, that is, there is no left-right flip of the arms and legs between two consecutive frames. In other words, if a pixel at frame k belongs to the right leg, then at frame k+1, that pixel should also belong to the right leg. By applying this assumption, the present invention assigns a corresponding bone to each pixel in the color images l_C(k) and I_C(k+1) , and calculates the optical flow between frames (as shown in Fig. 8(d) Show).

图8描述了通过基于模型的一致性检查来建立局部一致性的方法。在该实施例中，尽管这是一种基于网格的模型，但是该方法也可以用真实世界对象的3D形状的另一表示方法来执行。其中分配给右腿的关节用菱形标记：(a)上一帧和(b)适合的网格以及(c)在当前帧中错误地分配腿部，(d)光流，(e)在当前帧中用正确和错误标记的匹配来拟合网格，(f)当前帧中翻转(正确)腿部的错误。其中，在(e)和(f)中，小腿和脚上的不规则白色形状表示在(e) 中被标记为错误的54和在(f)中被标记为正确的55的像素。尽管很少出现(e)中没有“正确”的像素，(f)中也没有“错误”的像素，但。无法确保上述问题不会出现。Figure 8 depicts a method for establishing local consistency through model-based consistency checking. In this embodiment, although this is a mesh-based model, the method can also be performed with another representation of the 3D shape of the real world object. where the joints assigned to the right leg are marked with diamonds: (a) the previous frame and (b) the mesh that fits and (c) the wrongly assigned leg in the current frame, (d) the optical flow, (e) the current frame The mesh is fitted with correct and incorrectly labeled matches in the frame, (f) the error of flipped (correct) legs in the current frame. Among them, in (e) and (f), the irregular white shapes on the lower legs and feet represent the pixels marked as wrong 54 in (e) and correct 55 in (f). Although it is rare that there are no "correct" pixels in (e) and no "wrong" pixels in (f), but. There is no guarantee that the above problems will not arise.

其基本思想是，使用光流计算的帧k中的像素和帧k+1中的对应像素应该指定给相同的骨骼。否则可能会出现翻转(如图8(c)所示)。因此，本发明计算了左或右关节可能标签的所有组合的这种关联(参见图7)，对每个组合的像素流的一致性进行计算，并为第二帧选择一个一致性最高的标签组合。为使该方法在光流误差方面更具鲁棒性，本发明只考虑具有良好光流的像素，并且在两帧中对应的像素标签是相同的骨骼类型，即，或者是两个手臂，或者是两条腿。例如，如果像素p在帧k中属于左臂，在帧k+1中则属于躯干，那么这很可能是因为由于存在遮挡而造成光流不准确，因此可以排除该像素。如果像素属于同一类型(手臂或腿部)的不同骨骼，则这是翻转的强烈指示。本发明采用投票策略来选择最佳翻转配置：根据肢体是否包含大多数“左”或“右”像素，将其标记为“左”或“右”。The basic idea is that the pixel in frame k calculated using optical flow and the corresponding pixel in frame k+1 should be assigned to the same bone. Otherwise flipping may occur (as shown in Figure 8(c)). Therefore, the present invention computes this association for all combinations of possible labels for the left or right joint (see Figure 7), computes the consistency of the pixel stream for each combination, and selects the one with the highest consistency for the second frame combination. To make the method more robust in terms of optical flow error, the present invention only considers pixels with good optical flow, and the corresponding pixel labels in two frames are the same bone type, i.e., either two arms, or are two legs. For example, if pixel p belongs to the left arm in frame k and the torso in frame k+1, then this is likely due to inaccurate optical flow due to occlusion, so the pixel can be excluded. If the pixel belongs to different bones of the same type (arm or leg), this is a strong indication of flipping. The present invention employs a voting strategy to select the best flip configuration: a limb is marked as "left" or "right" depending on whether it contains the majority of "left" or "right" pixels.

为此，必须将每个像素指定给其相应的骨骼。由于基于到骨骼的距离的简单指定没有考虑到遮挡，因此其并不是一个最佳的方法。因此，本发明使用来自所有摄像机的信息来构建3D人体姿态(如“初始人体姿态估计”一节中所述)。同样，本发明对所有骨骼使用颜色编码，对所有摄像机中的所有可能翻转使用变形和渲染的模板网格。渲染的网格为每个肢体携带信息，无论它是“左”还是“右”。因此，像素指定是一个简单的查找，尽管存在自遮挡，但仍可提供准确的指定：对于每个像素，相同位置的渲染网格能够指示该像素属于“左” 还是属于“右”。To do this, each pixel must be assigned to its corresponding bone. Since a simple specification based on distance to the bone does not take occlusion into account, it is not an optimal method. Therefore, the present invention uses information from all cameras to construct a 3D human pose (as described in the "Initial Human Pose Estimation" section). Also, the present invention uses color coding for all bones, deformed and rendered template meshes for all possible flips in all cameras. The rendered mesh carries information for each limb, whether it is "left" or "right". Thus, pixel assignment is a simple lookup that, despite self-occlusion, provides an accurate assignment: for each pixel, a co-located rendering grid can indicate whether the pixel belongs to "left" or "right."

在完成局部一致性步骤之后，所有连续的帧之间不应该再存在翻转，这意味着整个序列是一致的。不过需要注意的是，仍然有可能存在整个序列被反转到错误方向的情况。不过，这一个问题只需要对整个序列进行二进制消歧处理就能够请以解决。因此，本发明能够通过评估手臂的可能的全局标签和腿部的可能的全局标签的函数来检查其全局一致性。通过选择标签组合来选择最终标签，该标签组合最小化了整个序列中的以下误差项的总和：After completing the local consistency step, there should be no more flips between all consecutive frames, which means that the entire sequence is consistent. Note, however, that it is still possible for the entire sequence to be reversed in the wrong direction. However, this problem can be solved simply by binary disambiguation of the entire sequence. Thus, the present invention is able to check for global consistency by evaluating a function of possible global labels for the arms and possible global labels for the legs. The final label is chosen by choosing a label combination that minimizes the sum of the following error terms over the entire sequence:

E_g＝λ_DBE_DB+λ_tE_t (2)E_g =λ_DB E_DB +λ_t E_t (2)

可以看出，上式是具有恒定参数

和λ的加权和；其中，

是 “到数据库的距离”，其确保所选的标记/翻转会产生沿序列的合理人体姿态。它根据与数据库中最接近的人体姿态P的距离对人体姿态进行处罚：It can be seen that the above formula has constant parameters

and a weighted sum of λ; where,

is the "distance to database" which ensures that the selected markers/flips result in reasonable human poses along the sequence. It penalizes human poses based on the distance to the closest human pose P in the database:

其中，a和β是三角连接位置J的连接角度。α′和β′是数据库人体姿态P中的两个人体姿态。|J|为关节数。当搜索沿序列的每个人体姿态的最接近的数据库人体姿态时，会考虑数据库中人体姿态的标签 (右/左)。也就是说，序列人体姿态的肢体仅与标记为相同的数据库人体姿态的肢体匹配。由于数据库只包含人体测量学上正确的人体姿态，这将对不可信的人体姿态进行处罚。where a and β are the connection angles of the triangular connection position J. α' and β' are the two human poses in the database human pose P. |J| is the number of joints. The label (right/left) of the body pose in the database is considered when searching for the closest database body pose for each body pose along the sequence. That is, the limbs of the sequence human poses are only matched with the limbs marked with the same database human poses. Since the database only contains anthropometrically correct human poses, this will penalize untrustworthy human poses.

至此，由人体姿态估计计算出地最佳3D人体姿态仍然限于在每个视图中适合存在于数据库中的人体姿态。数据库只包含所有可能人体姿态的子集，因此通常不包含准确的解决方案。So far, the optimal 3D human pose calculated from the human pose estimation is still limited to the human poses that fit in the database in each view. The database contains only a subset of all possible human poses, and therefore usually does not contain accurate solutions.

为此目的，本发明应用优化方法来检索更准确地人体姿态，如图 9所示。为指导这种优化方法，本发明结合了几个空间和时间能量函数，并使用优化方法将它们进行最小化处理。For this purpose, the present invention applies an optimization method to retrieve more accurate human body poses, as shown in FIG. 9 . To guide this optimization method, the present invention combines several spatial and temporal energy functions and minimizes them using an optimization method.

所述能量函数或误差函数是以本发明在初始人体姿态估计一节中描述的骨架S的表示方式为基础。除骨骼长度之外的所有参数都是每帧可变的。并且骨骼长度也是可变的，但在整个序列中保持不变，并且被初始化为所有帧的局部长度的平均值。这会自动引入人体测量约束，因为骨骼不应随时间缩小或增长。所选骨架表示的另一个很好的属性是它显著减少了变量的数量。为了应对标定误差，本发明还对每个对象、摄像机和帧给定的一个维度的平移矢量进行了优化。The energy function or error function is based on the representation of the skeleton S described in the initial human pose estimation section of the present invention. All parameters except bone length are variable per frame. And the bone length is also variable, but remains constant throughout the sequence, and is initialized as the average of the local lengths of all frames. This automatically introduces anthropometric constraints, as bones should not shrink or grow over time. Another nice property of the selected skeleton representation is that it significantly reduces the number of variables. In order to cope with the calibration error, the present invention also optimizes the translation vector of one dimension given by each object, camera and frame.

本发明将每个帧和对象的能量或误差函数定义为以下误差项的加权和：The present invention defines the energy or error function for each frame and object as the weighted sum of the following error terms:

E(S)＝ω_sE_s+ω_fE_f+ω_DBE_DB+ω_rotE_rot+ω_pE_p (4)E(S)=ω_s E_s +ω_f E_f +ω_DB E_DB +ω_rot E_rot +ω_p E_p (4)

需要注意的是，并不是所有错误项都是优化处理返回有用结果所必需的条件。根据场景和现实世界对象的性质，人们可以省略一个或多个误差项。例如，在一个实施例中，为了观察运动场景，至少使用轮廓填充误差，并且可选地使用距离数据库误差。Note that not all error terms are necessary for optimization to return useful results. Depending on the nature of the scene and real-world objects, one can omit one or more error terms. For example, in one embodiment, in order to observe a moving scene, at least the contour fill error, and optionally the distance database error, is used.

在一个局部优化过程中，可以首先对误差泛函进行最小化处理，其中，需要注意的是，在同一时刻的一帧或一组帧中看到的对象的人体姿态都是可变的。或者，可以在较长的帧序列上对误差泛函进行最小化处理，在较长的帧序列中，对应于所有帧的人体姿态亦是变化的，这能够较为方便地找到彼此一致的整个序列地最佳匹配(根据链接连续帧的优化标准)。In a local optimization process, the error functional can be minimized first, where it should be noted that the human poses of objects seen in a frame or a group of frames at the same time are all variable. Alternatively, the error functional can be minimized over a longer sequence of frames. In the longer sequence of frames, the poses of the human body corresponding to all frames also change, which makes it easier to find the entire sequence that is consistent with each other. the best match (according to the optimization criteria for linking consecutive frames).

轮廓匹配误差项E_s：正确3D骨架的骨骼应该能够投影到所有摄影机中的2D轮廓之上，误差项E_s负责对2D投影位于轮廓之外的关节位置进行惩罚：Contour matching error term_Es : The bones of the correct 3D skeleton should be able to project onto the 2D contour in all cameras, and the error term_Es is responsible for penalizing joint positions where the 2D projection is outside the contour:

其中，C是覆盖所述对象轮廓的所有摄影机的集合。J₊是所有关节的集合J与位于骨骼中间的点或沿骨骼放置的一个或多个其他点的并集。归一欧几里得距离转换(EDT)算法能够通过除以轮廓边界框的较大边为摄像机视图中的每个2D点返回一个相距轮廓内最近点的距离。该归一化处理对于使误差独立于摄像机图像中的对象(大小可以根据变焦而变化)的大小而言至关重要。而P_c(j)是考虑摄像机偏移将3D关节j变换到摄像机空间的投影，其目的是对小的校准误差进行校正处理。where C is the set of all cameras covering the outline of the object. J₊ is the union of the set J of all joints with a point located in the middle of the bone or one or more other points placed along the bone. The Normalized Euclidean Distance Transform (EDT) algorithm returns a distance to the closest point within the contour for each 2D point in the camera view by dividing by the larger side of the contour bounding box. This normalization process is critical to make the error independent of the size of the object in the camera image (which can vary in size depending on zoom). And P_c (j) is the projection of transforming the 3D joint j into the camera space considering the camera offset, and its purpose is to correct the small calibration error.

轮廓填充误差项E_f：尽管轮廓匹配项E_s主要负责对位于轮廓外的关节进行惩罚，但到目前为止，当将其应用到轮廓时并不存在限制。填充误差项E_f可防止关节的位置与位于躯干内的另一个关节的位置过于靠近，因此，换句话说，该项可以确保在所有肢体中都存在关节：Contour filling error term E_f : Although the contour matching term E_s is mainly responsible for penalizing joints located outside the contour, so far there is no limitation when applying it to the contour. Filling in the error term E_f prevents the position of the joint from being too close to the position of another joint located in the torso, so in other words, this term ensures that the joint is present in all limbs:

其中，R是位于轮廓内的2D人体姿态估计部分中的所有栅格点的集合。

能够将所述网格点从摄像机c的摄像机空间变换为世界空间中的光线。而dist()是光线到关节的距离。where R is the set of all grid points in the 2D human pose estimation part located within the contour.

The grid points can be transformed from camera space of camera c to rays in world space. And dist() is the distance from the ray to the joint.

该误差项的目的是惩罚铰接式对象模型的元素(在本例中是关节和链接，或者仅是链接)在轮廓内发生折叠的人体姿态。误差项倾向于轮廓内的每个光栅点或网格点靠近所述元素的人体姿态。换句话说，随着一个点到最近元素的距离增加，轮廓填充误差项增加。最近的元素可以是链接或关节，或者两者兼而有之。The purpose of this error term is to penalize human poses where elements of the articulated object model (in this case joints and links, or just links) are folded within the silhouette. The error term tends to have each raster point or grid point within the contour close to the element's human pose. In other words, as the distance of a point to the nearest element increases, the contour fill error term increases. The closest element can be a link or a joint, or both.

距数据库人体姿态误差项E_DB：该项已由公式3定义。它能够通过利用正确人体姿态的数据库来确保最终的3D人体姿态在运动学上是可能的(例如，膝关节以正确的方式弯曲)。它隐含地将人体测量约束添加到我们的优化中。这里使用的最接近的数据库人体姿态是通过对人体姿态进行新的搜索来找到的，因为估计的人体姿态可能在优化过程中已经改变。Human pose error term E_DB from database: This term has been defined byEquation 3. It is able to ensure that the final 3D human pose is kinematically possible by leveraging a database of correct human poses (e.g., the knee joint is bent in the correct way). It implicitly adds anthropometric constraints to our optimization. The closest database human poses used here are found by performing a new search on human poses, as the estimated human poses may have changed during the optimization process.

平滑度误差项E_rol和E_p：人体运动通常是平滑的，因此相邻帧的骨架应该是相似的。这使得本发明能够将时间一致性引入人体姿态优化。因此，E_rol惩罚连续帧的骨架内角的较大变化，而E_p惩罚较大的运动：Smoothness error terms E_rol and E_p : Human motion is usually smooth, so the skeletons of adjacent frames should be similar. This enables the present invention to introduce temporal consistency into human pose optimization. Therefore, E_rol penalizes large changes in the inner angle of the skeleton for successive frames, while E_p penalizes large motions:

E_p＝|p₀-p₀|(8)E_p =|p₀ -p₀ |(8)

其中，α′和β′是前一帧中同一对象的对应角度，而p′₀是前一帧中根关节的全局位置。可以以类似的方式考虑和约束根骨骼的旋转。where α′ and β′ are the corresponding angles of the same object in the previous frame, and p′₀ is the global position of the root joint in the previous frame. The rotation of the root bone can be considered and constrained in a similar fashion.

长度误差项E：在处理帧序列时，骨骼长度(或链接长度)的初始化已经是一个很好的近似值。因此，本发明尝试将优化后的人体姿态保持在以下长度附近：Length error term E: The initialization of bone length (or link length) is already a good approximation when dealing with frame sequences. Therefore, the present invention attempts to keep the optimized human posture around the following lengths:

其中，l_i是最终骨骼长度，

是初始骨骼或链接长度或另一参考骨骼或链接长度。在另一实施例中，

可以被认为是真实的骨骼长度，并且也是可变的，并且可以在对整个帧序列执行优化时进行估计。where_li is the final bone length,

is the initial bone or link length or another reference bone or link length. In another embodiment,

Can be considered the true bone length, and is also variable and can be estimated when performing optimization on the entire frame sequence.

优化过程如下：为了最小化公式4中的能量项，本发明采用例如局部优化策略，其中本发明通过沿着随机选取的方向执行线搜索来迭代地逐个优化变量。对于每个变量，本发明选择10个随机方向进行优化，并执行20次全局迭代。该优化过程可以独立于上述初始人体姿态估计过程来实现，即，利用任何其他人体姿态估计过程或者利用恒定的默认人体姿态作为初始估计。然而，使用提供相当好的初始估计的初始人体姿态估计过程确保优化过程可能找到全局最优匹配，从而避免误差函数的局部最小值。图9说明了优化过程的效果。最左边的例子显示了轮廓填充误差项的影响：可以向上或向下抬起运动员的手臂以减小轮廓匹配误差项，但只有当手臂向上移动时才会减小轮廓填充误差项。图9显示了对Germann等人的方法的明显改进，该方法根本不包括人体姿态优化，并且每个人体姿态都必须手动校正。The optimization process is as follows: To minimize the energy term in Equation 4, the present invention employs, for example, a local optimization strategy, where the present invention iteratively optimizes variables one by one by performing a line search along randomly chosen directions. For each variable, the present invention selects 10 random directions for optimization and performs 20 global iterations. This optimization process can be implemented independently of the initial human pose estimation process described above, i.e., using any other human pose estimation process or using a constant default human pose as the initial estimate. However, using an initial human pose estimation process that provides a reasonably good initial estimate ensures that the optimization process is likely to find a globally optimal match, thus avoiding local minima of the error function. Figure 9 illustrates the effect of the optimization process. The leftmost example shows the effect of the contour-filling error term: the athlete's arm can be lifted up or down to reduce the contour-matching error term, but the contour-filling error term is only reduced when the arm is moved up. Figure 9 shows a clear improvement over the method of Germann et al., which does not include human pose optimization at all and each human pose must be manually corrected.

本发明利用两个或三个摄像机在四个真实的标枪比赛电视镜头序列上对所提出的方法进行了评估，产生了大约1500个要处理的人体姿态。图4和图11显示了结果的一个子集。The present invention evaluates the proposed method on four real javelin-match TV footage sequences with two or three cameras, yielding approximately 1500 human poses to process. Figures 4 and 11 show a subset of the results.

图11中的每一行都显示了一组连续的人体姿态，并且每一项都显示了所有可用摄像机中各自运动员的图像。从中可以看出，即使只有两个摄像头和非常低分辨率的图像，所述方法在大多数情况下也可以恢复出精确的人体姿态。Each row in Figure 11 shows a sequential set of human poses, and each item shows images of the respective athlete from all available cameras. It can be seen that even with only two cameras and very low-resolution images, the method can recover accurate human poses in most cases.

本发明在所有结果中使用的参数值(使用所述自动参数调优系统计算)如下所示：The parameter values used by the present invention in all results (calculated using the automatic parameter tuning system) are as follows:

ω_s＝9,ω_f＝15,ω_DB＝0.05,ω_rot＝0.1,ω_p＝1,λ_DB＝0.15,λ_t＝0.3ω_s =9,ω_f =15,ω_DB =0.05,ω_rot =0.1,ω_p =1,λ_DB =0.15,λ_t =0.3

对于公式(4)和(2)中的优化函数，本发明将上述参数用于所有结果。所述参数通过以下参数调整过程获得。本发明对两个场景中的2D人体姿态进行了手动注释。然后运行该方法，并自动将结果与手动注释进行比较。通过将其用作误差函数，并将参数用作变量，可实现自动参数优化。For the optimization functions in equations (4) and (2), the present invention uses the above parameters for all results. The parameters are obtained through the following parameter adjustment process. The present invention manually annotates the 2D human poses in the two scenes. The method is then run and the results are automatically compared to the manual annotations. Automatic parameter optimization is achieved by using it as the error function and using the parameters as variables.

本发明所涉及的人体姿态估计方法在双摄像头的设置中每个运动员每帧大约需要40秒，在三摄像头设置中大约需要60秒。本发明实现了一个并行版本，为每个运动员运行一个线程。在8核系统上，其能够提供大约8倍的加速比。The human pose estimation method involved in the present invention takes approximately 40 seconds per frame per athlete in a dual-camera setup, and approximately 60 seconds in a triple-camera setup. The present invention implements a parallel version, running one thread for each player. On an 8-core system, it can provide about an 8x speedup.

需要注意的是，本发明所述初始人体姿态估计并不依赖于前一帧的人体姿态估计结果。因此，该过程并不存在漂移，其可以从错误的人体姿态猜测中恢复。It should be noted that the initial human pose estimation in the present invention does not depend on the human pose estimation result of the previous frame. Therefore, there is no drift in the process, which can recover from erroneous human pose guesses.

图10显示了到目前为止所描述的方法的失效情况：(a)人体姿态离数据库太远并且不能被正确估计；(b)手臂离身体太近并且不能被正确定位。Figure 10 shows the failure of the methods described so far: (a) the human pose is too far from the database and cannot be estimated correctly; (b) the arm is too close to the body and cannot be positioned correctly.

图11显示了每帧显示所有摄影机视图的结果序列。Figure 11 shows the resulting sequence showing all camera views per frame.

到目前为止所描述的方法可能由于缺乏仅由二进制轮廓提供的信息而失效，特别是当手臂太靠近身体时，如图10(b)所示。也就是说，几个人体姿态可以具有非常相似的二元轮廓。因此，仅使用轮廓信息可能不足以消除人体姿态的模糊性。而将光流引入优化过程可以解决这种模糊性。更详细地，这是通过确定两个连续图像之间的光流，并由此确定一个或多个骨骼或关节的预期位置来实现的。可以对所有关节执行此操作，或仅对遮挡(或被身体其他部位遮挡)的关节执行此操作。例如，给定一个帧中关节(或骨骼位置和方向)的位置，将使用到相邻帧的光流来计算相邻帧中的预期位置。The methods described so far may fail due to the lack of information provided only by binary contours, especially when the arm is too close to the body, as shown in Fig. 10(b). That is, several human poses can have very similar binary contours. Therefore, using contour information alone may not be sufficient to remove the ambiguity of human poses. Introducing optical flow into the optimization process can resolve this ambiguity. In more detail, this is achieved by determining the optical flow between two consecutive images, and thereby the expected position of one or more bones or joints. This can be done for all joints, or only for joints that are occluded (or occluded by other parts of the body). For example, given the positions of joints (or bone positions and orientations) in one frame, the optical flow to adjacent frames will be used to calculate the expected positions in adjacent frames.

然而，如果在所述摄像机中，身体部分没有发生遮挡，那么光流法则是一种最可靠的方法。因此，在另一实施例中，类似于上述方法，从摄像机视图渲染的网格也被用来用其所属的身体部位来标记每个像素。然后，为了将关节的位置传播到下一帧，仅使用投影关节位置附近(或之上)的属于该关节的相应身体部分的那些像素的光流。However, if the body parts are not occluded in the camera, the optical flow law is the most reliable method. Therefore, in another embodiment, similar to the method described above, the grid rendered from the camera view is also used to label each pixel with the body part to which it belongs. Then, in order to propagate the position of the joint to the next frame, the optical flow of only those pixels belonging to the corresponding body part of the joint near (or above) the projected joint position are used.

对于所述每个可用的摄像机，利用给定这些预期位置，可以通过三角测量计算预期关节位置。以与到数据库E_DB的距离基本上相同的方式计算到相应预期人体姿态(考虑到整个身体或仅感兴趣的肢体或骨骼)的距离(也称为流动误差项E_EX)，并且将相应的加权项ω_EXE_EX添加到等式(4)的能量函数。For each of the available cameras, given these expected positions, the expected joint positions can be calculated by triangulation. The distance (also called the flow error term E_EX ) to the corresponding expected human pose (considering the entire body or only the limb or bone of interest) is calculated in substantially the same way as the distance to the database E_DB , and the corresponding A weighting term ω_EX E_EX is added to the energy function of equation (4).

除此之外，本发明所述方法的结果在很大程度上依赖于人体姿态数据库。因此，一个好的数据库应该有大范围的动作和大范围的视图，以便最初的人体姿态猜测接近正确的人体姿态。图10(a)示出了其中数据库中没有类似人体姿态并且因此人体姿态估计失效的示例。可以手动或自动选择好的人体姿态，然后将其添加到数据库中，从而扩大可能人体姿态的空间。In addition to this, the results of the method described in the present invention rely heavily on a database of human poses. Therefore, a good database should have a large range of motions and a large range of views so that the initial human pose guesses are close to the correct human pose. Figure 10(a) shows an example where there are no similar human poses in the database and thus human pose estimation fails. Good human poses can be selected manually or automatically and then added to the database, thus expanding the space of possible human poses.

另一个可以进一步利用的重要先验信息是人体骨骼的运动学信息：到目前为止，该方法已经使用了一些隐式人体测量约束，但是可以在人体姿态优化中加入对关节角度的特定约束，即考虑到人体关节角度被限制在一定范围内的事实。Another important prior information that can be further exploited is the kinematic information of the human skeleton: so far the method has used some implicit anthropometric constraints, but specific constraints on joint angles can be added to the human pose optimization, i.e. Considering the fact that the human joint angles are limited within a certain range.

为了描述的方便，描述以上装置时以功能分为各种单元分别描述。当然，在实施本申请时可以把各单元的功能在同一个或多个软件和/ 或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described respectively. Of course, when implementing the present application, the functions of each unit may be implemented in one or more software and/or hardware.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存 (flash RAM)。内存是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer readable media includes both persistent and non-permanent, removable and non-removable media. Information storage can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture or apparatus that includes the element.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.