











技术领域technical field
本发明属于计算机图像处理技术领域,具体涉及一种估计人体姿 态的方法、装置及计算机可读介质。The invention belongs to the technical field of computer image processing, and in particular relates to a method, a device and a computer-readable medium for estimating a human body posture.
背景技术Background technique
人体姿态估计或运动捕捉作为计算机视觉和图形学领域中的一 个基本问题,在比赛和电影中的角色动画、比赛和监控的无控制器界 面等领域都有广泛应用。鉴于问题的复杂性,目前尚无一个能够使其 适用于所有应用领域的通用型解决方案。同样需要注意的是,所谓解 决方案在很大程度上会依赖于相关条件以及对设置施加的约束。通常 情况下,约束越多,就可以获得更为精确的人体姿态估计结果。但在 现实世界的场景中,通常很难增加约束。而许多实际应用则都离不开 场景。例如,可以简单的利用TV广播中已有的视频素材,展示如何 使用精确的人体姿态估计结果来在体育比赛期间从任意视点对运动 员进行高质量的渲染。除了在渲染领域中应用之外,在比赛期间对人 体姿态估计还可以用于进行生物力学分析和合成,以及用于比赛统计, 甚至将真实的比赛移植到电脑游戏之中。Human pose estimation or motion capture, as a fundamental problem in computer vision and graphics, is widely used in fields such as character animation in games and movies, controller-free interfaces for games and surveillance, etc. Given the complexity of the problem, there is currently no one-size-fits-all solution that makes it suitable for all application areas. It is also important to note that the so-called solution will depend heavily on the conditions and constraints imposed on the setup. In general, the more constraints, the more accurate human pose estimation results can be obtained. But in real-world scenarios, it is often difficult to add constraints. Many practical applications are inseparable from the scene. For example, it is simple to use video footage already available in TV broadcasts to show how accurate human pose estimation results can be used to render high-quality athletes from arbitrary viewpoints during sports competitions. In addition to applications in the field of rendering, human pose estimation during competitions can also be used for biomechanical analysis and synthesis, as well as for competition statistics, and even to port real competitions into computer games.
目前,商业运动捕捉系统通常使用遍布全身的光学标记来跟踪一 个对象的时变性运动状态。尽管这类系统能够得到非常准确的人体姿 态估计结果,可以捕捉到各种身体人体姿态以及面部表情。但是,这 类方法只能在受控环境下工作,因此只应用于特定范围,适用范围严 格受限。Currently, commercial motion capture systems typically use optical markers throughout the body to track the time-varying motion state of an object. Although such systems can obtain very accurate human pose estimation results, they can capture various body poses and facial expressions. However, such methods can only work in a controlled environment and are therefore only applicable to a specific and strictly limited scope.
目前根据使用的“镜头”类型,无标记人体姿态重建(或运动捕 捉)问题可大致被分为以下两类,即使用来自一个摄像机的视频序列, 以及使用来自多个校准摄像机的“镜头”。由于单目视频序列对用户 的限制较少,因此基于单目视频序列的人体姿态估计方法对于某些应 用而更为方便,但它亦存在一些固有的问题,其中包括深度不确定性 (模糊性)。尽管模糊性可以利用从运动到结构的方法来解决,这在 视觉上却是一个非常困难的问题。来自运动算法的结构通常依赖于包 含大量细节的高分辨率场景,而在运动场景中通常并不具备所需条件。Currently, label-free human pose reconstruction (or motion capture) problems can be roughly divided into two categories, namely, using a video sequence from one camera, and using "shots" from multiple calibrated cameras, depending on the type of "shots" used. The human pose estimation method based on monocular video sequences is more convenient for some applications because monocular video sequences are less restrictive to users, but it also has some inherent problems, including depth uncertainty (ambiguity ). Although ambiguity can be resolved using a motion-to-structure approach, it is a very difficult problem visually. Structures from motion algorithms often rely on high-resolution scenes that contain a lot of detail, which is often not available in motion scenes.
另外,在人体姿态估计方法中,另一个急需解决的问题是遮挡。 如果旨在进行人体姿态分析的“镜头”仅来自单一摄像机,那么则很 难对其进行解析。并且通过增加摄像机的数量,则更有可能获得同一 对象的无遮挡“镜头”。一般而言,摄像机的空间覆盖率越高,那么 获得的镜头的质量就会越高。除此之外,目前体育转播已开始使用多 个摄像机来进行图像获取。因此,人们可以利用这些信息来得到更精 确的3D人体姿态估计结果。In addition, in the human pose estimation method, another problem that needs to be solved urgently is occlusion. If the "shots" intended for human pose analysis come from only a single camera, it is difficult to parse them. And by increasing the number of cameras, it is more likely to get an unobstructed "shot" of the same object. In general, the higher the spatial coverage of the camera, the higher the quality of the footage obtained. In addition to this, sports broadcasting has started to use multiple cameras for image acquisition. Therefore, people can use this information to get more accurate 3D human pose estimation results.
就目前已有技术而言,大多数多视角3D人体姿态估计方法都是 利用跟踪算法来重建某一时刻处的人体姿态或根据某一时间下的人 体姿态来重建另一时刻的人体姿态。其中,跟踪可以使用光流拟合和 或立体视觉匹配来实现。然而,尽管这些方法可以提供较精确的人体 姿态估计结果,但它们通常需要在受控环境下工作,且需要更多的高 分辨率摄像机(通常至少四个)和良好的场景空间覆盖(通常是圆形 覆盖)来解决由于遮挡造成的模糊性问题。As far as the existing technologies are concerned, most multi-view 3D human pose estimation methods use tracking algorithms to reconstruct the human pose at a certain moment or reconstruct the human pose at another moment according to the human pose at a certain time. Among them, tracking can be achieved using optical flow fitting and or stereo vision matching. However, although these methods can provide relatively accurate human pose estimation results, they usually need to work in a controlled environment and require more high-resolution cameras (usually at least four) and good scene space coverage (usually Circular Overlay) to address ambiguity due to occlusion.
当然,目前亦存在一些使用多视角轮廓或多视角立体方法来构建 “代理几何体”的其他方法。在完成所述代理几何体构建之后,骨架 会被装入此几何体。尽管这些方法能够提供非常好的效果,但它们亦 存在一些关于设置方面的限制。这是因为它们需要精心搭建的摄影棚、 许多高分辨率摄像机以及非常好的空间覆盖率。Of course, there are other ways to build "proxy geometry" using multi-view contour or multi-view stereo methods. After the proxy geometry is built, the skeleton is loaded into this geometry. Although these methods can provide very good results, they also have some limitations in terms of settings. This is because they require well-built studios, many high-resolution cameras, and very good space coverage.
除此之外,还存在另一类基于图像分析和分割的算法。这些算法 能够使用机器学习方法来区分身体部位。不过,此分析通常需要高分 辨率的镜头,对于大多数应用场景而言是无法实现的条件。In addition to this, there is another class of algorithms based on image analysis and segmentation. These algorithms are able to use machine learning methods to differentiate body parts. However, this analysis typically requires high-resolution footage, an unrealistic condition for most application scenarios.
发明内容SUMMARY OF THE INVENTION
本发明针对以上问题,本发明的目的是提供一种可在摄像机标定 松散、运动员分辨率低和存在遮挡的非受控环境中运行的数据驱动人 体姿态估计方法、装置及计算机可读介质。The present invention addresses the above problems, and an object of the present invention is to provide a data-driven human pose estimation method, device, and computer-readable medium that can operate in an uncontrolled environment with loose camera calibration, low player resolution, and occlusion.
为实现上述目的,本发明采取的技术方案如下。In order to achieve the above objects, the technical solutions adopted by the present invention are as follows.
本发明的第一方面提供了一种估计人体姿态的方法,用于估计铰 接式对象模型,其中所述铰接式对象模型为由一个或多个源摄像机记 录得真实世界对象的基于计算机的3D模型,并且所述铰接式对象模 型表示多个关节和链接所述关节的链接,并且其中所述铰接式对象模 型的人体姿态由所述关节的空间位置来定义,所述方法包括以下步骤:A first aspect of the present invention provides a method of estimating human pose for estimating an articulated object model, wherein the articulated object model is a computer-based 3D model of a real-world object recorded by one or more source cameras , and the articulated object model represents a plurality of joints and links linking the joints, and wherein the human pose of the articulated object model is defined by the spatial positions of the joints, the method includes the steps of:
获取摄像机(9)记录的真实世界对象(14)的视图的视频流;obtaining a video stream of the view of the real-world object (14) recorded by the camera (9);
从所述视频流获得至少一个源图像序列(10);obtaining at least one sequence of source images from the video stream (10);
处理所述至少一个序列的源图像,以针对每个图像提取对应的源 图像片段,所述对应的源图像片段包括与所述图像背景分离的所述现 实世界对象的视图,从而生成至少一个源图像片段序列;processing the at least one sequence of source images to extract, for each image, a corresponding source image segment, the corresponding source image segment comprising a view of the real-world object separated from the image background, thereby generating at least one source image fragment sequence;
在计算机可读形式的数据库中维护一组参考轮廓序列,每个参考 轮廓与铰接式对象模型相关联,并且与该铰接式对象模型的特定参考 人体姿态相关联;maintaining a set of reference contour sequences in a database in computer readable form, each reference contour being associated with an articulated object model and associated with a specific reference human pose for the articulated object model;
对于所述至少一个源图像段序列的每个序列,将该序列与多个参 考轮廓序列进行匹配,并确定与所述源图像段序列最匹配的一个或多 个所选参考轮廓序列;for each sequence of the at least one sequence of source image segments, matching the sequence to a plurality of reference contour sequences, and determining one or more selected reference contour sequences that best match the sequence of source image segments;
其中两个序列的匹配由通过将每个源图像片段与在其序列内相 同位置的参考轮廓进行匹配、计算指示它们匹配程度的匹配误差,以 及根据源图像片段的匹配误差计算序列匹配误差来完成;Two of the sequences are matched by matching each source image segment to a reference contour at the same location within its sequence, calculating a match error indicating how well they match, and calculating the sequence match error based on the source image segment's match error ;
对于每一个选定的参考轮廓序列,检索与其中一个参考轮廓相关 联的参考人体姿态;以及for each selected sequence of reference contours, retrieve the reference body pose associated with one of the reference contours; and
根据检索到的一个或多个参考人体姿态来计算铰接式对象模型 的人体姿态的估计结果;calculating an estimate of the body pose of the articulated object model based on the retrieved one or more reference body poses;
源图像序列中的一个图像被指定为感兴趣帧,并且由其生成的源 图像段被指定为感兴趣的源图像段;An image in the sequence of source images is designated as the frame of interest, and the source image segment generated therefrom is designated as the source image segment of interest;
序列匹配误差为两个序列匹配误差的加权和;The sequence matching error is the weighted sum of the two sequence matching errors;
感兴趣的源图像片段的匹配误差的权重最大,并且随源图像片段 (在序列内)到感兴趣的源图像片段的距离而减小;以及The matching error of the source image segment of interest has the greatest weight and decreases with the distance (within the sequence) of the source image segment to the source image segment of interest; and
检索与感兴趣的源图像段匹配的参考轮廓的参考人体姿态。Retrieve the reference human pose for the reference contour that matches the source image segment of interest.
通过上述步骤得到的估计结果为初始人体姿态估计,随后人们可 以在相关扩展步骤中使用该初始人体姿态估计结果,例如,用于保证 来自连续帧的人体姿态估计之间的局部一致性,以及保证在更长的帧 序列上的全局一致性。The estimation result obtained through the above steps is the initial human pose estimation, which one can then use in the relevant extension steps, for example, to ensure local consistency between human pose estimates from consecutive frames, and to ensure Global consistency over longer frame sequences.
优选地,源图像序列(10)的一个图像被指定为感兴趣帧(56), 并且从其生成的源图像段(13)被指定为感兴趣的源图像段;Preferably, one image of the sequence of source images (10) is designated as the frame of interest (56), and the source image segment (13) generated therefrom is designated as the source image segment of interest;
序列匹配误差是两个序列(51,52)匹配误差的加权和;The sequence matching error is the weighted sum of the matching errors of the two sequences (51, 52);
感兴趣的源图像片段的匹配误差的权重最大,并且随源图像片段 到感兴趣的源图像片段的距离而减小;并且检索与感兴趣的源图像片 段匹配的参考轮廓的参考人体姿态。The matching error of the source image segment of interest has the largest weight and decreases with the distance from the source image segment to the source image segment of interest; and the reference body pose of the reference contour matching the source image segment of interest is retrieved.
优选地,其中,获得并处理由至少两个源摄像机(9,9')同时 记录的至少两个源图像序列(10),并且通过选择在3D空间中最符 合的检索到的参考人体姿态的组合,根据从至少两个源图像序列确定 的检索到的参考人体姿态来获取铰接式对象模型(4)的人体姿态估 计结果。Preferably, wherein, at least two source image sequences (10) simultaneously recorded by at least two source cameras (9, 9') are obtained and processed, and by selecting the retrieved reference body pose that best fits in 3D space In combination, a human pose estimation result of the articulated object model (4) is obtained based on the retrieved reference human pose determined from at least two source image sequences.
优选地,从至少一个连续源图像段序列确定的两个人体姿态之间 建立局部一致性,每个人体姿态与至少一个源图像段(13)相关联, 其中铰接式对象模型(4)的一个或两个人体姿态中的元素,即关节 (2)和(或)链接(3)对应于可以以模糊方式标记的真实世界对象 (14)的肢体:Preferably, local consistency is established between two human poses determined from at least one continuous sequence of source image segments, each human pose being associated with at least one source image segment (13), wherein one of the articulated object models (4) Or elements in two human poses, namely joints (2) and/or links (3) correspond to limbs of real-world objects (14) that can be labeled in a fuzzy way:
对于一对连续的源图像段中的每一个,根据相关联的人体姿态以 及对于每个人体姿态的可能的模糊标签中的每一个,确定来自肢体的 图像点在源图像段中的对应标签;For each of a pair of consecutive source image segments, determine the corresponding label in the source image segment for the image point from the limb based on the associated body pose and each of the possible fuzzy labels for each body pose;
为该对连续源图像片段中的第一个选择该人体姿态的标签;selecting a label for the human pose for the first of the pair of consecutive source image segments;
计算这对连续源图像段中的第一个和第二个之间的光流;compute the optical flow between the first and the second of the pair of consecutive source image segments;
从光流中确定第二图像段中与第一图像段的肢体相对应的图像 点已经移动到的位置,并根据第一图像段中的肢体的标记来标记第二 图像段中的这些位置;determining from the optical flow where the image points in the second image segment corresponding to the limbs of the first image segment have moved, and marking these locations in the second image segment according to the markings of the limbs in the first image segment;
在用于根据人体姿态来标记图像点的第二图像段的人体姿态的 可能的模糊标签中,选择与根据光流确定的标签一致的标签。Among the possible fuzzy labels for labeling the human pose of the second image segment of the image point according to the human pose, a label is selected that is consistent with the label determined from the optical flow.
优选地,在所述源图像段中标记来自肢体的图像点的步骤通过以 下步骤完成:Preferably, the step of marking image points from the limb in the source image segment is accomplished by the following steps:
对于一对连续的源图像段中的每一个,根据相关联的人体姿态和 对于每个人体姿态的可能的模糊标签中的每一个,确定现实世界对象 (14)的模型到源图像中的投影,并且由此根据在位置图像点处可见 的投影肢体来标签源图像段的图像点。For each of a pair of consecutive source image segments, determine the projection of the model of the real-world object (14) into the source image from the associated body pose and each of the possible fuzzy labels for each body pose , and thereby label the image points of the source image segment according to the projected limbs visible at the location image points.
优选地,给出人体姿态序列,每个人体姿态与源图像片段序列中 的一个相关联,并且其中存在关于标记一个或多个肢体集的模糊性, 包括以下步骤,用于建立与连续源图像序列匹配的人体姿态的全局一 致性:Preferably, given a sequence of human poses, each human pose being associated with one of the sequence of source image segments, and where there is ambiguity with respect to labeling one or more sets of limbs, the following steps are included for establishing a relationship with successive source images Global consistency of human poses for sequence matching:
对于源图像序列的每个源图像段(13),For each source image segment (13) of the source image sequence,
检索由先前步骤确定的模型元素的关联人体姿态和标签;Retrieve the associated human poses and labels for the model elements determined by the previous steps;
考虑数据库人体姿态的标记,确定数据库中与检索到的人体姿态 具有最小距离的人体姿态;Considering the mark of the human pose in the database, determine the human pose in the database with the smallest distance from the retrieved human pose;
计算表示两个人体姿态之间的差异的一致性误差项;Compute a consistency error term representing the difference between two human poses;
根据这些一致性误差项计算整个源图像序列的总一致性误差;Calculate the total consistency error for the entire source image sequence from these consistency error terms;
重复上述步骤以计算模糊肢体集合的可能全局标签的所有变体 的总一致性误差;Repeat the above steps to calculate the total consistency error for all variants of the possible global labels of the fuzzy limb set;
选择总体一致性误差最小的全局标签变量。Select the global label variable with the smallest overall agreement error.
优选地,所述总一致性误差是所述序列上所有一致性误差项的总 和。Preferably, the total identity error is the sum of all identity error terms on the sequence.
本发明的第一方面提供了一种估计人体姿态的装置,包括处理器 和存储器,所述处理器和存储器相互连接,其中,所述存储器用于存 储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用 于调用所述程序指令,执行本发明第一方面提供的所述方法。A first aspect of the present invention provides an apparatus for estimating human body posture, comprising a processor and a memory, wherein the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program includes program instructions, The processor is configured to invoke the program instructions to execute the method provided by the first aspect of the present invention.
本发明的第三方面提供了一种计算机可读存储介质,所述计算机 存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程 序指令当被处理器执行时使所述处理器执行本发明第一方面提供的 所述方法。A third aspect of the present invention provides a computer-readable storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to execute The method provided by the first aspect of the present invention.
本发明的第四方面提供了一种制造计算机可读存储介质的方法, 包括在计算机可读介质上存储计算机可执行指令的步骤,当计算机可 执行指令由计算系统的处理器执行时,使计算系统执行本发明第一方 面提供的所述方法步骤。A fourth aspect of the present invention provides a method of manufacturing a computer-readable storage medium, comprising the step of storing computer-executable instructions on the computer-readable medium, and when the computer-executable instructions are executed by a processor of a computing system, the The system performs the method steps provided in the first aspect of the present invention.
相对于现有技术,本发明取得的有益效果是可在摄像机标定松散、 球员分辨率低和存在遮挡的非受控环境中运行的数据驱动人体姿态 估计方法。本发明提供的方法和装置可以仅使用少至两个摄像机来估 计人体姿态。并且其对可能的人体姿态或动作序列没有任何限制性的 假设。通过使用时间一致性进行初始人体姿态估计以及人体姿态细化,在失效的情况下,用户交互仅限于几次点击即可倒转手臂和腿。Compared with the prior art, the present invention has the beneficial effect of a data-driven human pose estimation method that can operate in an uncontrolled environment with loose camera calibration, low player resolution and occlusion. The method and apparatus provided by the present invention can estimate human pose using only as few as two cameras. And it does not make any restrictive assumptions about possible human poses or action sequences. By using temporal consistency for initial human pose estimation as well as human pose refinement, user interaction is limited to inverting the arms and legs in a few clicks in the event of failure.
附图说明Description of drawings
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实 施例中所需要使用的附图作简单地介绍,对于本领域普通技术人员来 讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的 附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the accompanying drawings required in the embodiments of the present invention will be briefly introduced below. For those of ordinary skill in the art, without creative work, the Additional drawings can be obtained from these drawings.
图1是真实世界场景概貌示意图;Figure 1 is a schematic diagram of an overview of a real-world scene;
图2是铰接式对象模型;Figure 2 is an articulated object model;
图3中a是分割图像中的典型轮廓,a in Figure 3 is a typical contour in the segmented image,
b是数据库中的三个最佳匹 配人体姿态;b are the three best matching human poses in the database;
图4是将3D骨架投影到源摄像机的比赛中的估计人体姿态;Figure 4 is the estimated human pose in the game projecting the 3D skeleton to the source camera;
图5是所述方法概述;Figure 5 is an overview of the method;
图6是根据现有技术和根据本方法的2D人体姿态估计结果:Figure 6 is a 2D human body pose estimation result according to the prior art and according to the present method:
图7是人体姿态模糊示例;Figure 7 is an example of human posture blurring;
图8是局部一致性图示;Figure 8 is a local consistency diagram;
图9是人体姿态优化前后的估计人体姿态;Fig. 9 is the estimated human body posture before and after optimization of the human body posture;
图10是失效案例;Figure 10 is the failure case;
图11是每帧显示所有摄影机视图的结果序列。Figure 11 is the resulting sequence showing all camera views per frame.
具体实施方式Detailed ways
下面将详细描述本发明的各个方面的特征和示例性实施例,为了 使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实 施例,对本发明进行进一步详细描述。应理解,此处所描述的具体实 施例仅被配置为解释本发明,并不被配置为限定本发明。对于本领域 技术人员来说,本发明可以在不需要这些具体细节中的一些细节的情 况下实施。下面对实施例的描述仅仅是为了通过示出本发明的示例来 提供对本发明更好的理解。The features and exemplary embodiments of various aspects of the present invention will be described in detail below. In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only configured to explain the present invention, and not to limit the present invention. It will be apparent to those skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is only intended to provide a better understanding of the present invention by illustrating examples of the invention.
本实施例重点描述了基于无约束田径体育赛事转播镜头的人体 姿态估计方法。在这种场景下,意味着摄像机位置、物体大小和时间 一致性面临一些挑战。虽然人体姿态估计可以基于多摄像机设置来进 行计算,但依旧存在可用摄像机数量很少以及基线很宽的问题。除此 之外,由于摄像机通常只放置在场地的一侧,因此只能提供有限的场 景覆盖。尽管这些摄像机提供高分辨率的图像,但出于编辑原因,其 通常被设置为广角。因此,运动员通常只覆盖50到200像素之间的 高度。除此之外,运动员的运动可能非常复杂,特别是在像篮球等接 触性对抗运动中,因此会存在更多的遮挡。This embodiment focuses on describing the human pose estimation method based on the broadcast footage of unconstrained track and field sports events. In this scenario, it means that camera position, object size, and temporal consistency face some challenges. Although human pose estimation can be computed based on multi-camera setups, there are still problems with the small number of available cameras and wide baselines. In addition, because the cameras are usually placed on only one side of the field, they can only provide limited scene coverage. Although these cameras provide high-resolution images, they are usually set to wide-angle for editing reasons. Therefore, athletes usually only cover heights between 50 and 200 pixels. In addition to this, the movement of athletes can be very complex, especially in contact sports like basketball, and therefore there will be more occlusions.
为此目的,本项发明提出了一种可在摄像机标定松散、运动员分 辨率低和存在遮挡的非受控环境中运行的数据驱动人体姿态估计方 法。由此产生的方法和系统可以仅使用少至两个摄像机来估计人体姿 态。并且其对可能的人体姿态或动作序列没有任何限制性的假设。通 过使用时间一致性进行初始人体姿态估计以及人体姿态细化,在失效 的情况下,用户交互仅限于几次点击即可倒转手臂和腿。To this end, the present invention proposes a data-driven human pose estimation method that can operate in an uncontrolled environment with loose camera calibration, low player resolution, and occlusions. The resulting method and system can estimate human pose using as few as two cameras. And it does not make any restrictive assumptions about possible human poses or action sequences. By using temporal consistency for initial human pose estimation as well as human pose refinement, user interaction is limited to a few clicks to invert arms and legs in the event of failure.
人体姿态估计中的许多现有方法依赖于在2D中跟踪或分割图像, 并使用校准信息将骨架外推到3D。这些方法对于高分辨率的镜头效 果很好,但由于缺乏信息,它们在低分辨率图像上经常失效,并且对 外部照明条件非常敏感。本发明所涉及方法适用于完全不受控制的低 分辨率室外设置,因为它只依赖于粗略的轮廓和粗略的校准。Many existing methods in human pose estimation rely on tracking or segmenting images in 2D and using calibration information to extrapolate the skeleton to 3D. These methods work well for high-resolution footage, but they often fail on low-resolution images due to lack of information and are very sensitive to external lighting conditions. The method involved in the present invention is suitable for a completely uncontrolled low-resolution outdoor setting, as it only relies on rough outlines and rough calibration.
所述方法使用人体姿态和轮廓比较数据库来提取2D中的人体姿 态候选对象,并使用摄像机校准信息来计算相应的3D骨架。本发明 所涉及方法首先在数据库中执行一种新型基于时间一致轮廓的搜索 策略,以提取具有时间一致性的最接近的数据库候选对象。另外应用 了一种新型时间一致性步骤来进行初始人体姿态估计。因为准确的真 实人体姿态通常不在数据库中,所以这只会导致最接近的匹配,而不 是准确的人体姿态。为此目的,本发明提出了一种能够利用时间信息 自动实现精确三维人体姿态的新型时空优化技术。The method uses a human pose and contour comparison database to extract human pose candidates in 2D, and uses camera calibration information to compute the corresponding 3D skeleton. The method involved in the present invention first executes a novel time-consistent contour-based search strategy in the database to extract the closest database candidates with time-consistency. In addition, a novel temporal consistency step is applied for initial human pose estimation. Since the exact real human pose is usually not in the database, this will only result in the closest match, not the exact human pose. For this purpose, the present invention proposes a novel spatio-temporal optimization technology that can automatically realize accurate three-dimensional human body posture using time information.
相对于现有技术,本发明做出的创造性贡献如下:With respect to the prior art, the creative contributions made by the present invention are as follows:
本发明提出了一种基于时间一致性轮廓的数据库人体姿态查找 的初始人体姿态估计方法;The invention proposes an initial human body posture estimation method based on the database human body posture search based on the time-consistent contour;
改进了初始人体姿态估计的局部和全局一致性检查方法;Improved local and global consistency checking methods for initial human pose estimation;
本发明提出了一种基于新约束的时空人体姿态优化方法。The present invention proposes a spatiotemporal human body posture optimization method based on new constraints.
与学习骨骼统计模型不同,本发明直接使用的是人体姿态数据库。 后者具有两个以下两个优势:首先,这种数据驱动型方法允许容易地 添加新的人体姿态序列以适应新的设置或以前未知的人体姿态。其次, 由于该方法只是在数据库中搜索最接近的人体姿态,因此对较常见人 体姿态的统计偏差较小。使用包含人体测量学正确数据的数据库将始 终导致初始估计的合理人体姿态。Different from learning the skeletal statistical model, the present invention directly uses the human body posture database. The latter has two following advantages: First, this data-driven approach allows new sequences of human poses to be easily added to accommodate new settings or previously unknown human poses. Second, since the method only searches the database for the closest human pose, the statistical bias for more common human poses is small. Using a database containing anthropometrically correct data will always result in an initial estimate of reasonable human pose.
以下结合附图,以标枪比赛场景为例对本发明进行详细阐述。The present invention will be described in detail below by taking a javelin match scene as an example in conjunction with the accompanying drawings.
图1给出了真实世界场景8的示意图。所述场景8包括由两个或 更多源摄像机9、9'观察的诸如人的真实世界对象14,所述源摄像机 9、9'分别生成源图像10、10'的视频流。根据本发明的系统和方法 从与所述源摄像机9、9'的视点不同的虚拟摄像机1的视点生成显示 场景8的虚拟图像12。可选地,从虚拟图像序列12生成虚拟视频流。 根据本发明所实现的设备包括处理单元15,所述处理单元15在给定 源图像10、10'的情况下执行实现本发明方法的图像处理计算,并生 成一个或多个虚拟图像12。处理单元15被配置为与存储单元16交 互,用于存储源图像10、虚拟图像12和中间结果。所述处理单元15 由工作站19控制,所述工作站19通常包括显示设备、诸如键盘的数 据输入设备和诸如鼠标的定点设备。除此之外,所述处理单元15可 以被配置为向TV广播发射机17和(或)向视频显示设备18提供虚 拟视频流。Figure 1 presents a schematic diagram of a real-
图2给出了所述场景8的3D模型1,其包括所述现实世界对象 14的铰接式对象模型4。所述3D模型1通常还包括其他对象模型, 例如表示其他人、地面、建筑物等(未示出)。所述铰接式对象模型 4包括通过连杆3连接的关节2,在人体模型的情况下,所述连杆3 大致对应于骨骼或四肢。每个所述关节2被定义为3D空间中的一个 点,并且每个连杆3可以由通过3D空间连接两个关节2的直线来表 示。基于人体姿态,可以计算对象的网格模型的3D形状或真实世界 对象的3D形状的另一模型表示的3D形状,并将其投影到所述源摄像 机9、9'看到的2D图像之中。这允许检查和改进人体姿态。虽然所 述人体姿态由一维链接所描述,但由(网格)模型表示的、根据人体 姿态放置的肢体可在三维中进行延伸。虽然本申请是根据建模的人类 形状来解释的,但是其范围也涵盖了对动物形状或对诸如机器人的人 工结构的应用。Figure 2 presents a 3D model 1 of the
首先,对于每个单独输入视图中的2D人体姿态估计,本发明利 用一个轮廓数据库。本发明假设例如使用色度键控法或减背景法可以 从背景中粗略地分割对象14。图3a示出了所述应用场景中的分割图 像13的典型示例。其中。计算对象人体姿态(即骨架关节2的2D位 置)的初始猜测的基本思想是将其与各个骨架人体姿态已知的轮廓数 据库进行比较。参见具有不同参考轮廓3'的图3b,它们中的两个以 较小的比例再现以节省空间。First, for 2D human pose estimation in each individual input view, the present invention utilizes a contour database. The present invention assumes that the
图4示出了来自田径运动场景的源图像序列中的一帧的示例,其 中通过该方法找到的3D人体姿态被投影到图像中并叠加在运动员身 上。Figure 4 shows an example of a frame from a source image sequence of a track and field scene, where the 3D human pose found by the method is projected into the image and superimposed on the athlete.
在一个实施例中,该方法包括如图5所示的两个步骤。在第一步 骤62中,给定来自人体姿态数据库66的粗校准和轮廓61以及参考 轮廓,该方法使用时空轮廓匹配技术来提取每个单独摄像机视图的 2D人体姿态,从而产生三角化的3D人体姿态猜测63(标记为“估计 的3D身体人体姿态”)。不过,这种人体姿态检测会很容易产生模 糊性,即对称部分的左右翻转。虽然骨骼与轮廓非常匹配,但运动员 的手臂或腿仍然可能被翻转。由于遮挡和低分辨率,这些模糊性有时 即使对于人眼也很难发现。因此,我们使用一种基于光流的技术来检 测翻转发生的情况,并对它们进行校正,以获得一致的序列。重要的 是要注意,光流在所述设置中不够可靠,不足以在整个序列上跟踪运 动员身体各部分的整个运动,但是它可以用于局部比较。第一步骤使 用源图像段3的序列51,将该序列51与聚焦于感兴趣的图像帧56 的多个参考轮廓13'的序列52进行匹配。In one embodiment, the method includes two steps as shown in FIG. 5 . In a
但是,通常情况下,数据库中的任何人体姿态都不会与实际人体 姿态完全匹配。因此,在该方法的第二部分或步骤64(标记为“时 空人体姿态优化”)中,该初始3D人体姿态63由基于时空约束的优 化过程64来细化。所得到的优化的3D骨架65(标记为“优化的3D 身体人体姿态”)与来自所有视图的轮廓相匹配,并且特征在连续帧 上具有时间一致性。However, in general, any human pose in the database will not exactly match the actual human pose. Thus, in the second part or step 64 of the method (labeled "spatiotemporal human pose optimization"), the initial 3D human pose 63 is refined by an
通过使用新型基于时空数据驱动的轮廓搜索方法首先从每个运 动员和每个摄像机视图检索2D人体姿态来计算初始人体姿态估计。 一旦找到每个摄像机中每个运动员的2D人体姿态,就可以使用摄像 头的校准信息将图像中观察到的2D关节位置放置在3D空间中(这一 步骤也称为在3D提升2D位置)。本发明通过与每个摄影机视图中的 每个2D关节对应的光线相交来计算每个关节的3D位置。由于光线不 会精确相交,因此本发明选择最小二乘意义上距离这些光线最近的点。 由此可以得到三角测量误差E,以及初始摄像机偏移。Initial human pose estimates are computed by first retrieving 2D human poses from each athlete and each camera view using a novel spatiotemporal data-driven contour search method. Once the 2D human pose for each athlete in each camera is found, the 2D joint positions observed in the image can be placed in 3D space using the camera's calibration information (this step is also known as lifting the 2D position in 3D). The present invention calculates the 3D position of each joint by intersecting the rays corresponding to each 2D joint in each camera view. Since rays do not intersect exactly, the present invention selects the points closest to these rays in the least squares sense. This gives the triangulation error E, and the initial camera offset.
本方法在角度空间中以以下方式表示人体姿态S的3D骨架:每 个骨骼i相对于其“parent”骨使用两个角度α和β,以及骨骼的长 度l来表示。“root”骨由全局位置p0给出的三个角度α0、β0、γ0所指定的方向定义。可以根据该角度空间表示容易地计算3D欧几里 得空间中的关节位置j,反之亦然(考虑万向节锁定)。The present method represents the 3D skeleton of a human pose S in angular space in the following way: each bone i is represented using two angles α and β with respect to its "parent" bone, and the length l of the bone. The "root" bone is defined by the orientation specified by the three angles α0 , β0 , γ0 given by the global position p0 . Joint positions j in 3D Euclidean space can be easily calculated from this angle space representation, and vice versa (considering gimbal locking).
具备对整个人体运动范围进行采样能力的大型数据库对于本发 明所涉及方法非常重要,并且很难手动进行创建。因此,本发明使用 CMU运动捕捉数据库。然后使用线性混合蒙皮(骨骼蒙皮动画算法) 对装配了相同骨架的模板网格进行变形处理,以匹配数据库人体姿态 的人体姿态。由此,拍摄虚拟快照并提取轮廓。通过这种方式,本发 明创建了一个包含大约50000个轮廓的数据库。Large databases with the ability to sample the entire range of motion of the human body are important to the methods of the present invention and are difficult to create manually. Therefore, the present invention uses the CMU motion capture database. The template meshes fitted with the same skeleton are then deformed using linear blend skinning (skeletal skinning animation algorithm) to match the human poses of the database human poses. From this, virtual snapshots are taken and contours are extracted. In this way, the present invention creates a database of approximately 50,000 contours.
不过,所述CMU数据库只包括有限的人体姿态类型,因为其主要 来自跑步和行走序列。因此,本发明又手动添加了来自几个运动场景 中的1200个轮廓。尽管与自动生成的轮廓数量相比,这种手动添加 的轮廓数量很少,但足以扩大示例人体姿态的跨度,以获得良好的效 果。重要的是要注意,添加的示例人体姿态与我们用来拟合人体姿态 的序列不同。数据库可以通过新生成的人体姿态不断扩大,从而得到 更好的初始人体姿态估计。However, the CMU database includes only limited types of human poses, since they are mainly from running and walking sequences. Therefore, the present invention manually added another 1200 contours from several motion scenes. Although the number of such manually added contours is small compared to the number of automatically generated contours, it is enough to widen the span of the example human pose for good results. It is important to note that the example body poses added are different from the sequence we used to fit the body poses . The database can be continuously expanded by newly generated human poses, so as to obtain better initial human pose estimation.
本发明所涉及方法接受将每个运动员的粗略的二进制轮廓掩模 以及粗略的摄像机校准作为输入数据。并将这些轮廓与来自数据库的 轮廓进行比较。其计算适合分割的固定光栅尺寸(高度=40和宽度=32 像素的网格)上的输入轮廓和数据库轮廓之间的匹配质量。The method of the present invention accepts as input data a rough binary contour mask for each player and a rough camera calibration. And compare these contours with contours from the database. It computes the quality of the match between the input contours and the database contours on a fixed raster size suitable for segmentation (a grid of height=40 and width=32 pixels).
所述轮廓提取方法同利用时间信息有效扩展了传统方法。所述方 法不依赖于单个帧匹配,而是考虑源图像段13和参考轮廓13'之间 的差值在图像帧序列上的加权和。将具有索引t的二进制输入轮廓图 像I(来自感兴趣的图像帧56)与来自数据库的具有索引s的轮廓图 像I′s进行比较时,得到的像素误差Eq(s)计算如下:The contour extraction method effectively extends the traditional method with the use of temporal information. The method does not rely on individual frame matching, but considers a weighted sum of the differences between the
其中,n为滤波窗口大小,也就是所考虑的感兴趣帧56之前和 之后的帧数;P为所有光栅位置的集合,其中对应的像素在两个图像 中不可能被遮挡,即,不期望是另一个运动员轮廓的一部分。|P|表 示P中光栅位置的个数。光栅位置可以对应于摄像头的实际硬件像素, 也可以对应于摄像头图像计算出的缩放图像中的像素。where n is the filter window size, i.e. the number of frames before and after the frame of
其中,权重θs(i)描述中心围绕s的归一化高斯函数53。对于未被 包括在所述数据集的I′s+i(p)而言,θs(i)在标准化处理之前被设置为0。 本方法的优势在于,其是对序列而不是单个图像进行比较,这不仅能 够增加时间相干性(实现平稳运动),而且还能够有效改进人体姿态 估计。通过这种方法,即使是遮挡在几帧上的图像部分也可以更牢固 地拟合。通常,此方法有助于防止匹配相似但源自完全不同人体姿态 的轮廓。图6给出了本发明所涉及的初始人体姿态估计与现有技术中 人体姿态估计的直接比较。其中。图6(a)为通过仅将当前帧与数 据库进行比较来估计2D人体姿态(现有技术的做法);(b)找到的 用于单帧比较的数据库项;(c)通过比较轮廓序列估计2D人体姿态; (d)找到的数据库序列与中间的相应图像段。where the weights θs (i) describe a normalized
通过使用该像素误差,本发明在每个摄像机视图中搜索最佳的两 个人体姿态假设,并通过选择最低的三角测量误差E,来选择这些假 设地最佳组合。当然,在替代实施例中,可以使用来自每个摄像机的 多于两个的人体姿态假设,以确定最佳组合,或者每个摄像机的不同 数目,或者假设它是最佳的而不考虑三角测量误差,或者来自至少一 个摄像机的仅一个人体姿态假设。Using this pixel error, the present invention searches for the best two body pose hypotheses in each camera view and selects the best combination of these hypotheses by selecting the lowest triangulation error E. Of course, in alternative embodiments, more than two human pose hypotheses from each camera could be used to determine the best combination, or a different number of each camera, or assume it's optimal regardless of triangulation error, or only one human pose hypothesis from at least one camera.
图7示出了人体姿态模糊性的示例:(a)第一摄像机中的可能 标签;(b)从顶部的示意图来说明膝盖的两个可能位置;以及(c) 第二摄像机中的可能标签。Figure 7 shows examples of human pose ambiguity: (a) possible labels in the first camera; (b) schematic diagram from the top illustrating two possible positions of the knee; and (c) possible labels in the second camera .
由于2D人体姿态检测步骤依赖于轮廓匹配,因此会很容易产生 模糊性。尽管给出一个轮廓和从数据库中检索到的匹配的初始2D人 体姿态,但是人们依旧无法确定手臂和腿上的“左”和“右”标签是 否正确。为此,在一个实施例中,在2D轮廓匹配之后,忽略来自检索到的数据库人体姿态的信息,该信息定义腿或手臂(或者,在一般 情况下,是一组对称关节链中的一个)是否要被标记,例如,在2D 轮廓的匹配之后,将被标记为例如“左”或“右”。那么剩余的模糊 会按如下方式得到解决。其中,图7(a)显示了腿的两个可能标签 的示例轮廓。从图中可以看出,右膝可能的位置用菱形标记。该视图 来自图7(b)的模式中的左侧照摄像机。然后同一帧中的同一对象 在另一台摄像机中显示出如图7(c)所示的轮廓,同样带有两个可 能的腿部标签。因此,在提升到3D后,在3D中有四个可能的右膝位 置。这些可能的情况如图7(b)所示。可以看出,如果右膝落在标 有星号的一个位置上,那么左膝就会落在另一个星号标记位置之上。 如果右膝落在一个标有圆圈的位置上,那么左膝就会落在由另一个圆 圈标记的位置之上。假设圆圈所标记位置是膝盖的正确位置,那么可 以有两种不同类型的失效:要么膝盖在3D中会被错误地标记,但位 置正确;要么膝盖出现在错误的位置(星形)。Since the 2D human pose detection step relies on contour matching, it is prone to ambiguity. Although given an outline and the matching initial 2D human pose retrieved from the database, one cannot be sure that the "left" and "right" labels on the arms and legs are correct. To this end, in one embodiment, after 2D contour matching, information from the retrieved database human pose is ignored, which defines a leg or arm (or, in general, one of a set of symmetrical joint chains) Whether to be marked, eg after matching of 2D contours, will be marked eg "left" or "right". The remaining ambiguity is then resolved as follows. Among them, Fig. 7(a) shows example contours of two possible labels for legs. As can be seen from the picture, the possible positions of the right knee are marked with diamonds. This view is from the left side camera in the mode of Figure 7(b). Then the same object in the same frame is shown in another camera with the contour shown in Fig. 7(c), again with two possible leg labels. Therefore, after lifting to 3D, there are four possible right knee positions in 3D. These possible cases are shown in Fig. 7(b). It can be seen that if the right knee lands on one of the positions marked with an asterisk, then the left knee will fall on the other asterisked position. If the right knee falls on a position marked by a circle, the left knee falls on the position marked by another circle. Assuming that the position marked by the circle is the correct position of the knee, there can be two different types of failures: either the knee will be incorrectly marked in 3D, but in the correct position, or the knee will appear in the wrong position (star).
由于没有额外的信息,因此无法在这种情况下决定哪些位置是正 确的,即从四种可能性中选择唯一正确的位置,特别是在只有两个摄 像机可用的情况下。消除翻转模糊性的一种可能方法包括检查所有可 能的组合,只保留解剖学上可能的组合。然而需要注意的是,几种翻 转配置仍有可能产生解剖学上正确的人体姿态。Without additional information, it is impossible to decide which positions are correct in this case, i.e. choose the only correct position out of four possibilities, especially if only two cameras are available. One possible way to remove the ambiguity of flipping involves examining all possible combinations and keeping only those that are anatomically possible. It should be noted, however, that several flip configurations are still possible to produce anatomically correct human poses.
为了正确地解决这些模糊性,本发明使用了两步法:首先,在每 对连续的2D帧之间建立局部一致性,使得整个2D人体姿态序列在时 间上是一致的。其次,贯穿整个序列的任何剩余模糊性都被全局解决。To properly resolve these ambiguities, the present invention uses a two-step approach: first, local consistency is established between each pair of consecutive 2D frames so that the entire 2D human pose sequence is temporally consistent. Second, any remaining ambiguities throughout the sequence are resolved globally.
实现局部一致性这一目标的第一步是确保从摄像机帧k(如图8 (a)和8(b)所示)和k+l(如图8(c)所示)发现的2D人体姿 态具有一致性,即在两个连续帧之间不存在手臂和腿部的左右翻转。 换言之,如果位于帧k处的一个像素属于右腿,那么在帧k+l处,该 像素亦应该属于右腿。通过应用这一假设,本发明为彩色图像lC(k)和 IC(k+1)中的每个像素分配相应的骨骼,并计算帧之间的光流(如图8 (d)所示)。The first step in achieving this goal of local consistency is to ensure that the 2D human body found from camera frames k (as shown in Fig. 8(a) and 8(b)) and k+l (as shown in Fig. 8(c)) The pose is consistent, that is, there is no left-right flip of the arms and legs between two consecutive frames. In other words, if a pixel at frame k belongs to the right leg, then at frame k+1, that pixel should also belong to the right leg. By applying this assumption, the present invention assigns a corresponding bone to each pixel in the color images lC(k) and IC(k+1) , and calculates the optical flow between frames (as shown in Fig. 8(d) Show).
图8描述了通过基于模型的一致性检查来建立局部一致性的方 法。在该实施例中,尽管这是一种基于网格的模型,但是该方法也可 以用真实世界对象的3D形状的另一表示方法来执行。其中分配给右 腿的关节用菱形标记:(a)上一帧和(b)适合的网格以及(c)在 当前帧中错误地分配腿部,(d)光流,(e)在当前帧中用正确和错 误标记的匹配来拟合网格,(f)当前帧中翻转(正确)腿部的错误。 其中,在(e)和(f)中,小腿和脚上的不规则白色形状表示在(e) 中被标记为错误的54和在(f)中被标记为正确的55的像素。尽管 很少出现(e)中没有“正确”的像素,(f)中也没有“错误”的像 素,但。无法确保上述问题不会出现。Figure 8 depicts a method for establishing local consistency through model-based consistency checking. In this embodiment, although this is a mesh-based model, the method can also be performed with another representation of the 3D shape of the real world object. where the joints assigned to the right leg are marked with diamonds: (a) the previous frame and (b) the mesh that fits and (c) the wrongly assigned leg in the current frame, (d) the optical flow, (e) the current frame The mesh is fitted with correct and incorrectly labeled matches in the frame, (f) the error of flipped (correct) legs in the current frame. Among them, in (e) and (f), the irregular white shapes on the lower legs and feet represent the pixels marked as wrong 54 in (e) and correct 55 in (f). Although it is rare that there are no "correct" pixels in (e) and no "wrong" pixels in (f), but. There is no guarantee that the above problems will not arise.
其基本思想是,使用光流计算的帧k中的像素和帧k+1中的对应 像素应该指定给相同的骨骼。否则可能会出现翻转(如图8(c)所 示)。因此,本发明计算了左或右关节可能标签的所有组合的这种关 联(参见图7),对每个组合的像素流的一致性进行计算,并为第二 帧选择一个一致性最高的标签组合。为使该方法在光流误差方面更具 鲁棒性,本发明只考虑具有良好光流的像素,并且在两帧中对应的像 素标签是相同的骨骼类型,即,或者是两个手臂,或者是两条腿。例 如,如果像素p在帧k中属于左臂,在帧k+1中则属于躯干,那么这 很可能是因为由于存在遮挡而造成光流不准确,因此可以排除该像素。 如果像素属于同一类型(手臂或腿部)的不同骨骼,则这是翻转的强 烈指示。本发明采用投票策略来选择最佳翻转配置:根据肢体是否包 含大多数“左”或“右”像素,将其标记为“左”或“右”。The basic idea is that the pixel in frame k calculated using optical flow and the corresponding pixel in frame k+1 should be assigned to the same bone. Otherwise flipping may occur (as shown in Figure 8(c)). Therefore, the present invention computes this association for all combinations of possible labels for the left or right joint (see Figure 7), computes the consistency of the pixel stream for each combination, and selects the one with the highest consistency for the second frame combination. To make the method more robust in terms of optical flow error, the present invention only considers pixels with good optical flow, and the corresponding pixel labels in two frames are the same bone type, i.e., either two arms, or are two legs. For example, if pixel p belongs to the left arm in frame k and the torso in frame k+1, then this is likely due to inaccurate optical flow due to occlusion, so the pixel can be excluded. If the pixel belongs to different bones of the same type (arm or leg), this is a strong indication of flipping. The present invention employs a voting strategy to select the best flip configuration: a limb is marked as "left" or "right" depending on whether it contains the majority of "left" or "right" pixels.
为此,必须将每个像素指定给其相应的骨骼。由于基于到骨骼的 距离的简单指定没有考虑到遮挡,因此其并不是一个最佳的方法。因 此,本发明使用来自所有摄像机的信息来构建3D人体姿态(如“初 始人体姿态估计”一节中所述)。同样,本发明对所有骨骼使用颜色 编码,对所有摄像机中的所有可能翻转使用变形和渲染的模板网格。 渲染的网格为每个肢体携带信息,无论它是“左”还是“右”。因此, 像素指定是一个简单的查找,尽管存在自遮挡,但仍可提供准确的指 定:对于每个像素,相同位置的渲染网格能够指示该像素属于“左” 还是属于“右”。To do this, each pixel must be assigned to its corresponding bone. Since a simple specification based on distance to the bone does not take occlusion into account, it is not an optimal method. Therefore, the present invention uses information from all cameras to construct a 3D human pose (as described in the "Initial Human Pose Estimation" section). Also, the present invention uses color coding for all bones, deformed and rendered template meshes for all possible flips in all cameras. The rendered mesh carries information for each limb, whether it is "left" or "right". Thus, pixel assignment is a simple lookup that, despite self-occlusion, provides an accurate assignment: for each pixel, a co-located rendering grid can indicate whether the pixel belongs to "left" or "right."
在完成局部一致性步骤之后,所有连续的帧之间不应该再存在翻 转,这意味着整个序列是一致的。不过需要注意的是,仍然有可能存 在整个序列被反转到错误方向的情况。不过,这一个问题只需要对整 个序列进行二进制消歧处理就能够请以解决。因此,本发明能够通过 评估手臂的可能的全局标签和腿部的可能的全局标签的函数来检查 其全局一致性。通过选择标签组合来选择最终标签,该标签组合最小 化了整个序列中的以下误差项的总和:After completing the local consistency step, there should be no more flips between all consecutive frames, which means that the entire sequence is consistent. Note, however, that it is still possible for the entire sequence to be reversed in the wrong direction. However, this problem can be solved simply by binary disambiguation of the entire sequence. Thus, the present invention is able to check for global consistency by evaluating a function of possible global labels for the arms and possible global labels for the legs. The final label is chosen by choosing a label combination that minimizes the sum of the following error terms over the entire sequence:
Eg=λDBEDB+λtEt (2)Eg =λDB EDB +λt Et (2)
可以看出,上式是具有恒定参数和λ的加权和;其中,是 “到数据库的距离”,其确保所选的标记/翻转会产生沿序列的合理 人体姿态。它根据与数据库中最接近的人体姿态P的距离对人体姿态 进行处罚:It can be seen that the above formula has constant parameters and a weighted sum of λ; where, is the "distance to database" which ensures that the selected markers/flips result in reasonable human poses along the sequence. It penalizes human poses based on the distance to the closest human pose P in the database:
其中,a和β是三角连接位置J的连接角度。α′和β′是数据库人 体姿态P中的两个人体姿态。|J|为关节数。当搜索沿序列的每个人体 姿态的最接近的数据库人体姿态时,会考虑数据库中人体姿态的标签 (右/左)。也就是说,序列人体姿态的肢体仅与标记为相同的数据 库人体姿态的肢体匹配。由于数据库只包含人体测量学上正确的人体 姿态,这将对不可信的人体姿态进行处罚。where a and β are the connection angles of the triangular connection position J. α' and β' are the two human poses in the database human pose P. |J| is the number of joints. The label (right/left) of the body pose in the database is considered when searching for the closest database body pose for each body pose along the sequence. That is, the limbs of the sequence human poses are only matched with the limbs marked with the same database human poses. Since the database only contains anthropometrically correct human poses, this will penalize untrustworthy human poses.
至此,由人体姿态估计计算出地最佳3D人体姿态仍然限于在每 个视图中适合存在于数据库中的人体姿态。数据库只包含所有可能人 体姿态的子集,因此通常不包含准确的解决方案。So far, the optimal 3D human pose calculated from the human pose estimation is still limited to the human poses that fit in the database in each view. The database contains only a subset of all possible human poses, and therefore usually does not contain accurate solutions.
为此目的,本发明应用优化方法来检索更准确地人体姿态,如图 9所示。为指导这种优化方法,本发明结合了几个空间和时间能量函 数,并使用优化方法将它们进行最小化处理。For this purpose, the present invention applies an optimization method to retrieve more accurate human body poses, as shown in FIG. 9 . To guide this optimization method, the present invention combines several spatial and temporal energy functions and minimizes them using an optimization method.
所述能量函数或误差函数是以本发明在初始人体姿态估计一节 中描述的骨架S的表示方式为基础。除骨骼长度之外的所有参数都是 每帧可变的。并且骨骼长度也是可变的,但在整个序列中保持不变, 并且被初始化为所有帧的局部长度的平均值。这会自动引入人体测量 约束,因为骨骼不应随时间缩小或增长。所选骨架表示的另一个很好 的属性是它显著减少了变量的数量。为了应对标定误差,本发明还对 每个对象、摄像机和帧给定的一个维度的平移矢量进行了优化。The energy function or error function is based on the representation of the skeleton S described in the initial human pose estimation section of the present invention. All parameters except bone length are variable per frame. And the bone length is also variable, but remains constant throughout the sequence, and is initialized as the average of the local lengths of all frames. This automatically introduces anthropometric constraints, as bones should not shrink or grow over time. Another nice property of the selected skeleton representation is that it significantly reduces the number of variables. In order to cope with the calibration error, the present invention also optimizes the translation vector of one dimension given by each object, camera and frame.
本发明将每个帧和对象的能量或误差函数定义为以下误差项的 加权和:The present invention defines the energy or error function for each frame and object as the weighted sum of the following error terms:
E(S)=ωsEs+ωfEf+ωDBEDB+ωrotErot+ωpEp (4)E(S)=ωs Es +ωf Ef +ωDB EDB +ωrot Erot +ωp Ep (4)
需要注意的是,并不是所有错误项都是优化处理返回有用结果所 必需的条件。根据场景和现实世界对象的性质,人们可以省略一个或 多个误差项。例如,在一个实施例中,为了观察运动场景,至少使用 轮廓填充误差,并且可选地使用距离数据库误差。Note that not all error terms are necessary for optimization to return useful results. Depending on the nature of the scene and real-world objects, one can omit one or more error terms. For example, in one embodiment, in order to observe a moving scene, at least the contour fill error, and optionally the distance database error, is used.
在一个局部优化过程中,可以首先对误差泛函进行最小化处理, 其中,需要注意的是,在同一时刻的一帧或一组帧中看到的对象的人 体姿态都是可变的。或者,可以在较长的帧序列上对误差泛函进行最 小化处理,在较长的帧序列中,对应于所有帧的人体姿态亦是变化的, 这能够较为方便地找到彼此一致的整个序列地最佳匹配(根据链接连 续帧的优化标准)。In a local optimization process, the error functional can be minimized first, where it should be noted that the human poses of objects seen in a frame or a group of frames at the same time are all variable. Alternatively, the error functional can be minimized over a longer sequence of frames. In the longer sequence of frames, the poses of the human body corresponding to all frames also change, which makes it easier to find the entire sequence that is consistent with each other. the best match (according to the optimization criteria for linking consecutive frames).
轮廓匹配误差项Es:正确3D骨架的骨骼应该能够投影到所有摄 影机中的2D轮廓之上,误差项Es负责对2D投影位于轮廓之外的关节 位置进行惩罚:Contour matching error termEs : The bones of the correct 3D skeleton should be able to project onto the 2D contour in all cameras, and the error termEs is responsible for penalizing joint positions where the 2D projection is outside the contour:
其中,C是覆盖所述对象轮廓的所有摄影机的集合。J+是所有关 节的集合J与位于骨骼中间的点或沿骨骼放置的一个或多个其他点 的并集。归一欧几里得距离转换(EDT)算法能够通过除以轮廓边界 框的较大边为摄像机视图中的每个2D点返回一个相距轮廓内最近点 的距离。该归一化处理对于使误差独立于摄像机图像中的对象(大小 可以根据变焦而变化)的大小而言至关重要。而Pc(j)是考虑摄像机偏 移将3D关节j变换到摄像机空间的投影,其目的是对小的校准误差 进行校正处理。where C is the set of all cameras covering the outline of the object. J+ is the union of the set J of all joints with a point located in the middle of the bone or one or more other points placed along the bone. The Normalized Euclidean Distance Transform (EDT) algorithm returns a distance to the closest point within the contour for each 2D point in the camera view by dividing by the larger side of the contour bounding box. This normalization process is critical to make the error independent of the size of the object in the camera image (which can vary in size depending on zoom). And Pc (j) is the projection of transforming the 3D joint j into the camera space considering the camera offset, and its purpose is to correct the small calibration error.
轮廓填充误差项Ef:尽管轮廓匹配项Es主要负责对位于轮廓外的 关节进行惩罚,但到目前为止,当将其应用到轮廓时并不存在限制。 填充误差项Ef可防止关节的位置与位于躯干内的另一个关节的位置 过于靠近,因此,换句话说,该项可以确保在所有肢体中都存在关节:Contour filling error term Ef : Although the contour matching term Es is mainly responsible for penalizing joints located outside the contour, so far there is no limitation when applying it to the contour. Filling in the error term Ef prevents the position of the joint from being too close to the position of another joint located in the torso, so in other words, this term ensures that the joint is present in all limbs:
其中,R是位于轮廓内的2D人体姿态估计部分中的所有栅格点 的集合。能够将所述网格点从摄像机c的摄像机空间变换为世 界空间中的光线。而dist()是光线到关节的距离。where R is the set of all grid points in the 2D human pose estimation part located within the contour. The grid points can be transformed from camera space of camera c to rays in world space. And dist() is the distance from the ray to the joint.
该误差项的目的是惩罚铰接式对象模型的元素(在本例中是关节和链 接,或者仅是链接)在轮廓内发生折叠的人体姿态。误差项倾向于轮 廓内的每个光栅点或网格点靠近所述元素的人体姿态。换句话说,随 着一个点到最近元素的距离增加,轮廓填充误差项增加。最近的元素 可以是链接或关节,或者两者兼而有之。The purpose of this error term is to penalize human poses where elements of the articulated object model (in this case joints and links, or just links) are folded within the silhouette. The error term tends to have each raster point or grid point within the contour close to the element's human pose. In other words, as the distance of a point to the nearest element increases, the contour fill error term increases. The closest element can be a link or a joint, or both.
距数据库人体姿态误差项EDB:该项已由公式3定义。它能够通 过利用正确人体姿态的数据库来确保最终的3D人体姿态在运动学上 是可能的(例如,膝关节以正确的方式弯曲)。它隐含地将人体测量 约束添加到我们的优化中。这里使用的最接近的数据库人体姿态是通 过对人体姿态进行新的搜索来找到的,因为估计的人体姿态可能在优 化过程中已经改变。Human pose error term EDB from database: This term has been defined by
平滑度误差项Erol和Ep:人体运动通常是平滑的,因此相邻帧的 骨架应该是相似的。这使得本发明能够将时间一致性引入人体姿态优 化。因此,Erol惩罚连续帧的骨架内角的较大变化,而Ep惩罚较大的 运动:Smoothness error terms Erol and Ep : Human motion is usually smooth, so the skeletons of adjacent frames should be similar. This enables the present invention to introduce temporal consistency into human pose optimization. Therefore, Erol penalizes large changes in the inner angle of the skeleton for successive frames, while Ep penalizes large motions:
Ep=|p0-p0|(8)Ep =|p0 -p0 |(8)
其中,α′和β′是前一帧中同一对象的对应角度,而p′0是前一帧中 根关节的全局位置。可以以类似的方式考虑和约束根骨骼的旋转。where α′ and β′ are the corresponding angles of the same object in the previous frame, and p′0 is the global position of the root joint in the previous frame. The rotation of the root bone can be considered and constrained in a similar fashion.
长度误差项E:在处理帧序列时,骨骼长度(或链接长度)的初 始化已经是一个很好的近似值。因此,本发明尝试将优化后的人体姿 态保持在以下长度附近:Length error term E: The initialization of bone length (or link length) is already a good approximation when dealing with frame sequences. Therefore, the present invention attempts to keep the optimized human posture around the following lengths:
其中,li是最终骨骼长度,是初始骨骼或链接长度或另一参考 骨骼或链接长度。在另一实施例中,可以被认为是真实的骨骼长度, 并且也是可变的,并且可以在对整个帧序列执行优化时进行估计。whereli is the final bone length, is the initial bone or link length or another reference bone or link length. In another embodiment, Can be considered the true bone length, and is also variable and can be estimated when performing optimization on the entire frame sequence.
优化过程如下:为了最小化公式4中的能量项,本发明采用例如 局部优化策略,其中本发明通过沿着随机选取的方向执行线搜索来迭 代地逐个优化变量。对于每个变量,本发明选择10个随机方向进行 优化,并执行20次全局迭代。该优化过程可以独立于上述初始人体 姿态估计过程来实现,即,利用任何其他人体姿态估计过程或者利用 恒定的默认人体姿态作为初始估计。然而,使用提供相当好的初始估 计的初始人体姿态估计过程确保优化过程可能找到全局最优匹配,从 而避免误差函数的局部最小值。图9说明了优化过程的效果。最左边 的例子显示了轮廓填充误差项的影响:可以向上或向下抬起运动员的 手臂以减小轮廓匹配误差项,但只有当手臂向上移动时才会减小轮廓 填充误差项。图9显示了对Germann等人的方法的明显改进,该方法 根本不包括人体姿态优化,并且每个人体姿态都必须手动校正。The optimization process is as follows: To minimize the energy term in Equation 4, the present invention employs, for example, a local optimization strategy, where the present invention iteratively optimizes variables one by one by performing a line search along randomly chosen directions. For each variable, the present invention selects 10 random directions for optimization and performs 20 global iterations. This optimization process can be implemented independently of the initial human pose estimation process described above, i.e., using any other human pose estimation process or using a constant default human pose as the initial estimate. However, using an initial human pose estimation process that provides a reasonably good initial estimate ensures that the optimization process is likely to find a globally optimal match, thus avoiding local minima of the error function. Figure 9 illustrates the effect of the optimization process. The leftmost example shows the effect of the contour-filling error term: the athlete's arm can be lifted up or down to reduce the contour-matching error term, but the contour-filling error term is only reduced when the arm is moved up. Figure 9 shows a clear improvement over the method of Germann et al., which does not include human pose optimization at all and each human pose must be manually corrected.
本发明利用两个或三个摄像机在四个真实的标枪比赛电视镜头 序列上对所提出的方法进行了评估,产生了大约1500个要处理的人 体姿态。图4和图11显示了结果的一个子集。The present invention evaluates the proposed method on four real javelin-match TV footage sequences with two or three cameras, yielding approximately 1500 human poses to process. Figures 4 and 11 show a subset of the results.
图11中的每一行都显示了一组连续的人体姿态,并且每一项都 显示了所有可用摄像机中各自运动员的图像。从中可以看出,即使只 有两个摄像头和非常低分辨率的图像,所述方法在大多数情况下也可 以恢复出精确的人体姿态。Each row in Figure 11 shows a sequential set of human poses, and each item shows images of the respective athlete from all available cameras. It can be seen that even with only two cameras and very low-resolution images, the method can recover accurate human poses in most cases.
本发明在所有结果中使用的参数值(使用所述自动参数调优系统 计算)如下所示:The parameter values used by the present invention in all results (calculated using the automatic parameter tuning system) are as follows:
ωs=9,ωf=15,ωDB=0.05,ωrot=0.1,ωp=1,λDB=0.15,λt=0.3ωs =9,ωf =15,ωDB =0.05,ωrot =0.1,ωp =1,λDB =0.15,λt =0.3
对于公式(4)和(2)中的优化函数,本发明将上述参数用于所 有结果。所述参数通过以下参数调整过程获得。本发明对两个场景中 的2D人体姿态进行了手动注释。然后运行该方法,并自动将结果与 手动注释进行比较。通过将其用作误差函数,并将参数用作变量,可 实现自动参数优化。For the optimization functions in equations (4) and (2), the present invention uses the above parameters for all results. The parameters are obtained through the following parameter adjustment process. The present invention manually annotates the 2D human poses in the two scenes. The method is then run and the results are automatically compared to the manual annotations. Automatic parameter optimization is achieved by using it as the error function and using the parameters as variables.
本发明所涉及的人体姿态估计方法在双摄像头的设置中每个运 动员每帧大约需要40秒,在三摄像头设置中大约需要60秒。本发明 实现了一个并行版本,为每个运动员运行一个线程。在8核系统上, 其能够提供大约8倍的加速比。The human pose estimation method involved in the present invention takes approximately 40 seconds per frame per athlete in a dual-camera setup, and approximately 60 seconds in a triple-camera setup. The present invention implements a parallel version, running one thread for each player. On an 8-core system, it can provide about an 8x speedup.
需要注意的是,本发明所述初始人体姿态估计并不依赖于前一帧 的人体姿态估计结果。因此,该过程并不存在漂移,其可以从错误的 人体姿态猜测中恢复。It should be noted that the initial human pose estimation in the present invention does not depend on the human pose estimation result of the previous frame. Therefore, there is no drift in the process, which can recover from erroneous human pose guesses.
图10显示了到目前为止所描述的方法的失效情况:(a)人体姿 态离数据库太远并且不能被正确估计;(b)手臂离身体太近并且不 能被正确定位。Figure 10 shows the failure of the methods described so far: (a) the human pose is too far from the database and cannot be estimated correctly; (b) the arm is too close to the body and cannot be positioned correctly.
图11显示了每帧显示所有摄影机视图的结果序列。Figure 11 shows the resulting sequence showing all camera views per frame.
到目前为止所描述的方法可能由于缺乏仅由二进制轮廓提供的 信息而失效,特别是当手臂太靠近身体时,如图10(b)所示。也就 是说,几个人体姿态可以具有非常相似的二元轮廓。因此,仅使用轮 廓信息可能不足以消除人体姿态的模糊性。而将光流引入优化过程可 以解决这种模糊性。更详细地,这是通过确定两个连续图像之间的光 流,并由此确定一个或多个骨骼或关节的预期位置来实现的。可以对 所有关节执行此操作,或仅对遮挡(或被身体其他部位遮挡)的关节 执行此操作。例如,给定一个帧中关节(或骨骼位置和方向)的位置, 将使用到相邻帧的光流来计算相邻帧中的预期位置。The methods described so far may fail due to the lack of information provided only by binary contours, especially when the arm is too close to the body, as shown in Fig. 10(b). That is, several human poses can have very similar binary contours. Therefore, using contour information alone may not be sufficient to remove the ambiguity of human poses. Introducing optical flow into the optimization process can resolve this ambiguity. In more detail, this is achieved by determining the optical flow between two consecutive images, and thereby the expected position of one or more bones or joints. This can be done for all joints, or only for joints that are occluded (or occluded by other parts of the body). For example, given the positions of joints (or bone positions and orientations) in one frame, the optical flow to adjacent frames will be used to calculate the expected positions in adjacent frames.
然而,如果在所述摄像机中,身体部分没有发生遮挡,那么光流 法则是一种最可靠的方法。因此,在另一实施例中,类似于上述方法, 从摄像机视图渲染的网格也被用来用其所属的身体部位来标记每个 像素。然后,为了将关节的位置传播到下一帧,仅使用投影关节位置 附近(或之上)的属于该关节的相应身体部分的那些像素的光流。However, if the body parts are not occluded in the camera, the optical flow law is the most reliable method. Therefore, in another embodiment, similar to the method described above, the grid rendered from the camera view is also used to label each pixel with the body part to which it belongs. Then, in order to propagate the position of the joint to the next frame, the optical flow of only those pixels belonging to the corresponding body part of the joint near (or above) the projected joint position are used.
对于所述每个可用的摄像机,利用给定这些预期位置,可以通过 三角测量计算预期关节位置。以与到数据库EDB的距离基本上相同的 方式计算到相应预期人体姿态(考虑到整个身体或仅感兴趣的肢体或 骨骼)的距离(也称为流动误差项EEX),并且将相应的加权项ωEXEEX添加到等式(4)的能量函数。For each of the available cameras, given these expected positions, the expected joint positions can be calculated by triangulation. The distance (also called the flow error term EEX ) to the corresponding expected human pose (considering the entire body or only the limb or bone of interest) is calculated in substantially the same way as the distance to the database EDB , and the corresponding A weighting term ωEX EEX is added to the energy function of equation (4).
除此之外,本发明所述方法的结果在很大程度上依赖于人体姿态 数据库。因此,一个好的数据库应该有大范围的动作和大范围的视图, 以便最初的人体姿态猜测接近正确的人体姿态。图10(a)示出了其 中数据库中没有类似人体姿态并且因此人体姿态估计失效的示例。可 以手动或自动选择好的人体姿态,然后将其添加到数据库中,从而扩 大可能人体姿态的空间。In addition to this, the results of the method described in the present invention rely heavily on a database of human poses. Therefore, a good database should have a large range of motions and a large range of views so that the initial human pose guesses are close to the correct human pose. Figure 10(a) shows an example where there are no similar human poses in the database and thus human pose estimation fails. Good human poses can be selected manually or automatically and then added to the database, thus expanding the space of possible human poses.
另一个可以进一步利用的重要先验信息是人体骨骼的运动学信 息:到目前为止,该方法已经使用了一些隐式人体测量约束,但是可 以在人体姿态优化中加入对关节角度的特定约束,即考虑到人体关节 角度被限制在一定范围内的事实。Another important prior information that can be further exploited is the kinematic information of the human skeleton: so far the method has used some implicit anthropometric constraints, but specific constraints on joint angles can be added to the human pose optimization, i.e. Considering the fact that the human joint angles are limited within a certain range.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。 当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/ 或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described respectively. Of course, when implementing the present application, the functions of each unit may be implemented in one or more software and/or hardware.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系 统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全 软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明 可采用在一个或多个其中包含有计算机可用程序代码的计算机可用 存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实 施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算 机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序 指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图 和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指 令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理 设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处 理设备的处理器执行的指令产生用于实现在流程图一个流程或多个 流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
本申请可以在由计算机执行的计算机可执行指令的一般上下文 中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现 特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可 以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通 过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境 中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介 质中。The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
这些计算机程序指令也可存储在能引导计算机或其他可编程数 据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计 算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实 现在流程图一个流程或多个流程和/或方框图一个方框或多个方框 中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理 设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产 生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令 提供用于实现在流程图一个流程或多个流程和/或方框图一个方框 或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、 输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存 储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存 (flash RAM)。内存是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体 可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、 数据结构、程序的模块或其他数据。计算机的存储介质的例子包括, 但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存 取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内 存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其 他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任 何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本 文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer readable media includes both persistent and non-permanent, removable and non-removable media. Information storage can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体 意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、 商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要 素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。 在没有更多限制的情况下,由语句“包括一个……”限定的要素,并 不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的 相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture or apparatus that includes the element.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之 间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他 实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于 方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分 说明即可。The various embodiments in this specification are described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment.
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于 本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的 精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本 申请的权利要求范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010439265.7ACN111832386A (en) | 2020-05-22 | 2020-05-22 | A method, apparatus and computer readable medium for estimating human body pose |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010439265.7ACN111832386A (en) | 2020-05-22 | 2020-05-22 | A method, apparatus and computer readable medium for estimating human body pose |
| Publication Number | Publication Date |
|---|---|
| CN111832386Atrue CN111832386A (en) | 2020-10-27 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010439265.7APendingCN111832386A (en) | 2020-05-22 | 2020-05-22 | A method, apparatus and computer readable medium for estimating human body pose |
| Country | Link |
|---|---|
| CN (1) | CN111832386A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113077486A (en)* | 2021-04-30 | 2021-07-06 | 深圳世源工程技术有限公司 | Method and system for monitoring vegetation coverage rate in mountainous area |
| CN113255429A (en)* | 2021-03-19 | 2021-08-13 | 青岛根尖智能科技有限公司 | Method and system for estimating and tracking human body posture in video |
| CN113673318A (en)* | 2021-07-12 | 2021-11-19 | 浙江大华技术股份有限公司 | Action detection method and device, computer equipment and storage medium |
| CN113989283A (en)* | 2021-12-28 | 2022-01-28 | 中科视语(北京)科技有限公司 | 3D human body posture estimation method and device, electronic equipment and storage medium |
| CN114155555A (en)* | 2021-12-02 | 2022-03-08 | 北京中科智易科技有限公司 | Human behavior artificial intelligence judgment system and method |
| CN116189225A (en)* | 2021-11-26 | 2023-05-30 | 财团法人工业技术研究院 | Image analysis method and image analysis device using same |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2383696A1 (en)* | 2010-04-30 | 2011-11-02 | LiberoVision AG | Method for estimating a pose of an articulated object model |
| US20140219550A1 (en)* | 2011-05-13 | 2014-08-07 | Liberovision Ag | Silhouette-based pose estimation |
| CN104700433A (en)* | 2015-03-24 | 2015-06-10 | 中国人民解放军国防科学技术大学 | Vision-based real-time general movement capturing method and system for human body |
| WO2019232894A1 (en)* | 2018-06-05 | 2019-12-12 | 中国石油大学(华东) | Complex scene-based human body key point detection system and method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2383696A1 (en)* | 2010-04-30 | 2011-11-02 | LiberoVision AG | Method for estimating a pose of an articulated object model |
| EP2383699A2 (en)* | 2010-04-30 | 2011-11-02 | LiberoVision AG | Method for estimating a pose of an articulated object model |
| US20140219550A1 (en)* | 2011-05-13 | 2014-08-07 | Liberovision Ag | Silhouette-based pose estimation |
| CN104700433A (en)* | 2015-03-24 | 2015-06-10 | 中国人民解放军国防科学技术大学 | Vision-based real-time general movement capturing method and system for human body |
| WO2019232894A1 (en)* | 2018-06-05 | 2019-12-12 | 中国石油大学(华东) | Complex scene-based human body key point detection system and method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113255429A (en)* | 2021-03-19 | 2021-08-13 | 青岛根尖智能科技有限公司 | Method and system for estimating and tracking human body posture in video |
| CN113077486A (en)* | 2021-04-30 | 2021-07-06 | 深圳世源工程技术有限公司 | Method and system for monitoring vegetation coverage rate in mountainous area |
| CN113077486B (en)* | 2021-04-30 | 2021-10-08 | 深圳世源工程技术有限公司 | Method and system for monitoring vegetation coverage rate in mountainous area |
| CN113673318A (en)* | 2021-07-12 | 2021-11-19 | 浙江大华技术股份有限公司 | Action detection method and device, computer equipment and storage medium |
| CN113673318B (en)* | 2021-07-12 | 2024-05-03 | 浙江大华技术股份有限公司 | Motion detection method, motion detection device, computer equipment and storage medium |
| CN116189225A (en)* | 2021-11-26 | 2023-05-30 | 财团法人工业技术研究院 | Image analysis method and image analysis device using same |
| CN114155555A (en)* | 2021-12-02 | 2022-03-08 | 北京中科智易科技有限公司 | Human behavior artificial intelligence judgment system and method |
| CN114155555B (en)* | 2021-12-02 | 2022-06-10 | 北京中科智易科技有限公司 | Human behavior artificial intelligence judgment system and method |
| CN113989283A (en)* | 2021-12-28 | 2022-01-28 | 中科视语(北京)科技有限公司 | 3D human body posture estimation method and device, electronic equipment and storage medium |
| Publication | Publication Date | Title |
|---|---|---|
| US9117113B2 (en) | Silhouette-based pose estimation | |
| Zheng et al. | Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus | |
| CN111832386A (en) | A method, apparatus and computer readable medium for estimating human body pose | |
| US10431000B2 (en) | Robust mesh tracking and fusion by using part-based key frames and priori model | |
| Kaufmann et al. | Emdb: The electromagnetic database of global 3d human pose and shape in the wild | |
| Cheung et al. | Shape-from-silhouette across time part ii: Applications to human modeling and markerless motion tracking | |
| CN104322052B (en) | System for mixing in real time or the three dimensional object of hybrid computer generation and film camera feed video | |
| US20230008567A1 (en) | Real-time system for generating 4d spatio-temporal model of a real world environment | |
| Wang et al. | Outdoor markerless motion capture with sparse handheld video cameras | |
| Bachmann et al. | Motion capture from pan-tilt cameras with unknown orientation | |
| Leroy et al. | SMPLy benchmarking 3D human pose estimation in the wild | |
| Mehta et al. | Single-shot multi-person 3d body pose estimation from monocular rgb input | |
| JP6799468B2 (en) | Image processing equipment, image processing methods and computer programs | |
| Germann et al. | Space-time body pose estimation in uncontrolled environments | |
| Desai et al. | Combining skeletal poses for 3D human model generation using multiple Kinects | |
| Fua et al. | Human shape and motion recovery using animation models | |
| Joo | Sensing, measuring, and modeling social signals in nonverbal communication | |
| Garau et al. | Unsupervised continuous camera network pose estimation through human mesh recovery | |
| Chou et al. | Wide-baseline Multi-camera Automatic Calibration Using Recovered Human Body Mesh | |
| van der Zwan et al. | Robust and fast teat detection and tracking in low-resolution videos for automatic milking devices | |
| Lin et al. | Multi-view 3D Human Physique Dataset Construction For Robust Digital Human Modeling of Natural Scenes | |
| Choi et al. | Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos | |
| Ajisafe et al. | Mirror-aware neural humans | |
| JP7754511B2 (en) | A real-time system for generating 4D spatiotemporal models of real-world environments | |
| KR102848376B1 (en) | A volumetric data processing system |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date:20201027 |