CN111310595B

Movatterモバイル変換

Info

Publication number: CN111310595B
Application number: CN202010065727.3A
Authority: CN
Inventors: 安容巧
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2023-08-25
Anticipated expiration: 2040-01-20
Also published as: CN111310595A

Abstract

The embodiment of the disclosure discloses a method and a device for generating information. One embodiment of the method comprises the following steps: acquiring a user position information set corresponding to a target video frame sequence, wherein the user position information is used for representing the position of a user displayed in the target video frame sequence, and comprises human body position information and local human body position information; according to the human body position information and the local human body position information, respectively determining the association degree between human bodies of the users and the association degree between local human bodies displayed in the target video frame sequence; according to the determined association degree, determining the association relation between human bodies of the users and the association relation between local human bodies displayed in the target video frame sequence; track information of the user displayed in the target video frame sequence is generated in response to determining that the determined associations match. The implementation reduces the complexity of the detection and tracking model and saves the network transmission flow.

Description

Translated fromChinese

用于生成信息的方法和装置Method and apparatus for generating information

技术领域technical field

本公开的实施例涉及计算机技术领域，具体涉及用于生成信息的方法和装置。Embodiments of the present disclosure relate to the field of computer technology, and in particular to methods and devices for generating information.

背景技术Background technique

随着计算机视觉的飞速发展，对于多目标的人体跟踪的研究和应用也越来越广泛。With the rapid development of computer vision, the research and application of multi-target human body tracking are becoming more and more extensive.

相关的方式通常是对行人整体进行检测和跟踪。在存在遮挡等场景下，现有技术通常会采用构造更为复杂的网络的方式。例如一方面生成空间注意力图谱，并且对图像中的目标(如行人)进行加权；另一方面构造循环神经网络生成时间注意力模型对每一帧中的目标进行时序方面的选择，从而完成对目标的跟踪。A related approach is usually to detect and track pedestrians as a whole. In the presence of occlusion and other scenarios, the existing technology usually adopts the method of constructing a more complex network. For example, on the one hand, generate a spatial attention map, and weight the targets (such as pedestrians) in the image; on the other hand, construct a recurrent neural network to generate a temporal attention model to select the targets in each frame in terms of timing, so as to complete the process of Target tracking.

发明内容Contents of the invention

本公开的实施例提出了用于生成信息的方法和装置。Embodiments of the present disclosure propose methods and apparatuses for generating information.

第一方面，本公开的实施例提供了一种用于生成信息的方法，该方法包括：获取目标视频帧序列对应的用户位置信息集合，其中，用户位置信息集合中的用户位置信息用于表征目标视频帧序列中的目标视频帧所显示的用户的位置，用户位置信息包括人体位置信息和局部人体位置信息；根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度；根据所确定的用户的人体之间的关联度和局部人体之间的关联度，确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系；响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成目标视频帧序列中显示的用户的轨迹信息。In the first aspect, an embodiment of the present disclosure provides a method for generating information, the method including: acquiring a user location information set corresponding to a target video frame sequence, wherein the user location information in the user location information set is used to represent The position of the user shown in the target video frame in the target video frame sequence, the user position information includes human body position information and partial human body position information; according to the human body position information and local human body position information, determine the user's position displayed in the target video frame sequence respectively The correlation degree between the human bodies and the correlation degree between the partial human bodies; according to the determined correlation degree between the user's human bodies and the correlation degree between the local human bodies, determine the correlation between the user's human bodies displayed in the target video frame sequence An association relationship and an association relationship between partial human bodies; in response to determining that the determined association relationship between the user's partial human bodies matches the association relationship between the user's human bodies, generating track information of the user displayed in the target video frame sequence .

在一些实施例中，上述获取目标视频帧序列对应的用户位置信息集合，包括：获取目标视频帧序列；将目标视频帧序列中的目标视频帧输入至预先训练的用户位置检测模型，得到与目标视频帧对应的用户位置信息，其中，用户位置检测模型用于表征目标视频帧与用户位置信息之间的对应关系。In some embodiments, the acquisition of the user position information set corresponding to the target video frame sequence includes: acquiring the target video frame sequence; inputting the target video frame in the target video frame sequence into a pre-trained user position detection model to obtain The user location information corresponding to the video frame, wherein the user location detection model is used to characterize the correspondence between the target video frame and the user location information.

在一些实施例中，上述根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度，包括：从目标视频帧序列中提取人体位置信息所指示的用户的人体的图像作为用户人体图像；将所提取的用户人体图像输入至预先训练的用户特征提取模型，得到与用户人体图像对应的用户特征，其中，用户特征提取模型用于表征用户人体图像与用户特征之间的对应关系；根据所得到的用户特征与基于轨迹预测得到的轨迹所包括的用户特征之间的距离，确定目标视频帧序列中显示的用户的人体之间的关联度。In some embodiments, according to the human body position information and the local human body position information, respectively determining the degree of correlation between the user's human bodies displayed in the target video frame sequence and the degree of correlation between the local human bodies includes: from the target video frame sequence Extracting the image of the user's human body indicated by the human body position information as the user's human body image; inputting the extracted user's human body image into the pre-trained user feature extraction model to obtain the user features corresponding to the user's human body image, wherein the user feature extraction The model is used to characterize the correspondence between the user's human body image and the user's features; according to the distance between the obtained user's features and the user's features included in the trajectory based on trajectory prediction, determine the user's human body displayed in the target video frame sequence correlation between.

在一些实施例中，上述用户特征提取模型通过如下步骤得到：获取训练样本集合，其中，训练样本包括样本用户人体图像和与样本用户人体图像对应的样本标注信息，样本标注信息用于标识用户；将训练样本集合中的训练样本的样本用户人体图像作为输入，将与输入的样本用户人体图像对应的样本标注信息匹配的用户特征作为期望输出，训练得到用户特征提取模型，其中，与输入的训练样本对应的样本标注信息匹配的用户特征与样本标注信息所标识的用户一致。In some embodiments, the above-mentioned user feature extraction model is obtained through the following steps: obtaining a training sample set, wherein the training samples include a sample user's body image and sample annotation information corresponding to the sample user's body image, and the sample annotation information is used to identify the user; Taking the sample user human body image of the training sample in the training sample set as input, taking the user feature matching the sample annotation information corresponding to the input sample user human body image as the expected output, and training to obtain the user feature extraction model, wherein, training with the input The user characteristics matched by the sample annotation information corresponding to the sample are consistent with the user identified by the sample annotation information.

在一些实施例中，上述局部人体位置信息所指示的局部人体包括基于头肩关键点所确定的头部；以及上述响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成目标视频帧序列中显示的用户的轨迹信息，包括：确定目标视频帧序列中的目标视频帧中人体位置信息所指示的人体区域与局部人体位置信息所指示的头部区域之间的交并比(Intersection over Union，IOU)；确定交并比满足预设条件的人体位置信息与局部人体位置信息所指示的位置之间的距离；响应于确定所确定的距离满足预设距离条件，生成人体位置信息与局部人体位置信息之间的关联关系；根据所生成的关联关系，生成目标视频帧序列中显示的用户的轨迹信息。In some embodiments, the partial human body indicated by the above partial human body position information includes the head determined based on the key points of the head and shoulders; Match the correlation between the target video frame sequence to generate the user's trajectory information displayed in the target video frame sequence, including: determining the human body area indicated by the human body position information in the target video frame sequence and the head indicated by the local human body position information Intersection over Union (IOU) between the internal regions; determine the distance between the human body position information whose intersection ratio meets the preset condition and the position indicated by the local human body position information; in response to determining that the determined distance satisfies A distance condition is preset to generate an association relationship between human body position information and local human body position information; according to the generated association relationship, user trajectory information displayed in a target video frame sequence is generated.

第二方面，本公开的实施例提供了一种用于生成信息的装置，该装置包括：获取单元，被配置成获取目标视频帧序列对应的用户位置信息集合，其中，用户位置信息集合中的用户位置信息用于表征目标视频帧序列中的目标视频帧所显示的用户的位置，用户位置信息包括人体位置信息和局部人体位置信息；第一确定单元，被配置成根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度；第二确定单元，被配置成根据所确定的用户的人体之间的关联度和局部人体之间的关联度，确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系；生成单元，被配置成响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成目标视频帧序列中显示的用户的轨迹信息。In a second aspect, an embodiment of the present disclosure provides an apparatus for generating information, the apparatus including: an acquisition unit configured to acquire a user location information set corresponding to a target video frame sequence, wherein, in the user location information set The user position information is used to characterize the position of the user displayed in the target video frame in the target video frame sequence, and the user position information includes human body position information and partial human body position information; the first determination unit is configured to Position information, respectively determining the correlation degree between the user's human body and the correlation degree between the local human bodies displayed in the target video frame sequence; the second determining unit is configured to determine the correlation degree and the The degree of correlation between the partial human bodies is to determine the correlation between the user's human bodies displayed in the target video frame sequence and the correlation between the partial human bodies; the generation unit is configured to respond to determining the determined user's partial human Match the association relationship among the user's human bodies to generate the user's trajectory information displayed in the target video frame sequence.

在一些实施例中，上述获取单元包括：获取模块，被配置成获取目标视频帧序列；第一生成模块，被配置成将目标视频帧序列中的目标视频帧输入至预先训练的用户位置检测模型，得到与目标视频帧对应的用户位置信息，其中，用户位置检测模型用于表征目标视频帧与用户位置信息之间的对应关系。In some embodiments, the acquisition unit includes: an acquisition module configured to acquire a target video frame sequence; a first generation module configured to input a target video frame in the target video frame sequence to a pre-trained user position detection model , to obtain the user location information corresponding to the target video frame, wherein the user location detection model is used to characterize the correspondence between the target video frame and the user location information.

在一些实施例中，上述第一确定单元包括：提取模块，被配置成从目标视频帧序列中提取人体位置信息所指示的用户的人体的图像作为用户人体图像；第二生成模块，被配置成将所提取的用户人体图像输入至预先训练的用户特征提取模型，得到与用户人体图像对应的用户特征，其中，用户特征提取模型用于表征用户人体图像与用户特征之间的对应关系；第一确定模块，被配置成根据所得到的用户特征与基于轨迹预测得到的轨迹所包括的用户特征之间的距离，确定目标视频帧序列中显示的用户的人体之间的关联度。In some embodiments, the above-mentioned first determining unit includes: an extraction module configured to extract the image of the user's human body indicated by the human body position information from the target video frame sequence as the user's human body image; the second generation module is configured to Inputting the extracted user's body image into a pre-trained user feature extraction model to obtain user features corresponding to the user's body image, wherein the user feature extraction model is used to represent the correspondence between the user's body image and the user's features; the first The determining module is configured to determine the degree of correlation between the user's human bodies displayed in the target video frame sequence according to the distance between the obtained user features and the user features included in the trajectory obtained based on trajectory prediction.

在一些实施例中，上述局部人体位置信息所指示的局部人体包括基于头肩关键点所确定的头部；生成单元包括：第二确定模块，被配置成确定目标视频帧序列中的目标视频帧中人体位置信息所指示的人体区域与局部人体位置信息所指示的头部区域之间的交并比；第三确定模块，被配置成确定交并比满足预设条件的人体位置信息与局部人体位置信息所指示的位置之间的距离；第三生成模块，被配置成响应于确定所确定的距离满足预设距离条件，生成人体位置信息与局部人体位置信息之间的关联关系；第四生成模块，被配置成根据所生成的关联关系，生成目标视频帧序列中显示的用户的轨迹信息。In some embodiments, the partial human body indicated by the above partial human body position information includes the head determined based on the key points of the head and shoulders; the generation unit includes: a second determination module configured to determine the target video frame in the target video frame sequence The intersection and union ratio between the human body area indicated by the human body position information and the head area indicated by the local human body position information; the third determining module is configured to determine the intersection and union ratio between the human body position information and the local human body position information satisfying the preset condition The distance between the positions indicated by the position information; the third generation module is configured to generate the association relationship between the human body position information and the partial human body position information in response to determining that the determined distance satisfies the preset distance condition; the fourth generation A module configured to generate track information of the user displayed in the target video frame sequence according to the generated association relationship.

第三方面，本公开的实施例提供了一种电子设备，该电子设备包括：一个或多个处理器；存储装置，其上存储有一个或多个程序；当一个或多个程序被一个或多个处理器执行，使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。In a third aspect, an embodiment of the present disclosure provides an electronic device, the electronic device includes: one or more processors; a storage device, on which one or more programs are stored; when one or more programs are used by one or more Multiple processors are executed, so that one or more processors implement the method described in any implementation manner of the first aspect.

第四方面，本公开的实施例提供了一种计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现如第一方面中任一实现方式描述的方法。In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, the method described in any implementation manner in the first aspect is implemented.

本公开的实施例提供的用于生成信息的方法和装置，首先获取目标视频帧序列对应的用户位置信息集合。其中，用户位置信息集合中的用户位置信息用于表征目标视频帧序列中的目标视频帧所显示的用户的位置。用户位置信息包括人体位置信息和局部人体位置信息。而后，根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度。之后，根据所确定的用户的人体之间的关联度和局部人体之间的关联度，确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系。最后，响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成目标视频帧序列中显示的用户的轨迹信息。从而可以大大降低检测和跟踪模型的复杂度，尤其适用于无人零售店、监控区域等端设备和嵌入式设备，因而也节约了网络传输流量。In the method and device for generating information provided by the embodiments of the present disclosure, firstly, a user location information set corresponding to a target video frame sequence is acquired. Wherein, the user position information in the user position information set is used to represent the position of the user displayed by the target video frame in the target video frame sequence. The user position information includes human body position information and partial human body position information. Then, according to the human body position information and the local human body position information, the correlation degree between the user's human bodies displayed in the target video frame sequence and the correlation degree between the partial human bodies are respectively determined. Afterwards, according to the determined degree of correlation between the user's human bodies and the degree of correlation between the partial human bodies, the correlation between the user's human bodies and the correlation between the partial human bodies displayed in the target video frame sequence is determined. Finally, in response to determining that the determined correlation between the user's partial human bodies matches the correlation between the user's human bodies, the trajectory information of the user displayed in the target video frame sequence is generated. As a result, the complexity of the detection and tracking model can be greatly reduced, especially suitable for unmanned retail stores, monitoring areas and other end devices and embedded devices, thus saving network transmission traffic.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本公开的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present disclosure will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1是本公开的一个实施例可以应用于其中的示例性系统架构图；FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;

图2是根据本公开的用于生成信息的方法的一个实施例的流程图；Figure 2 is a flowchart of one embodiment of a method for generating information according to the present disclosure;

图3a、图3b是根据本公开的实施例的用于生成信息的方法的一个应用场景的示意图；3a and 3b are schematic diagrams of an application scenario of a method for generating information according to an embodiment of the present disclosure;

图4是根据本公开的用于生成信息的方法的又一个实施例的流程图；FIG. 4 is a flowchart of yet another embodiment of a method for generating information according to the present disclosure;

图5是根据本公开的用于生成信息的装置的一个实施例的结构示意图；Fig. 5 is a schematic structural diagram of an embodiment of an apparatus for generating information according to the present disclosure;

图6是适于用来实现本公开的实施例的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device suitable for implementing an embodiment of the present disclosure.

具体实施方式Detailed ways

下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The present disclosure will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, rather than to limit the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。It should be noted that, in the case of no conflict, the embodiments in the present disclosure and the features in the embodiments can be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings and embodiments.

图1示出了可以应用本公开的用于生成信息的方法或用于生成信息的装置的示例性架构100。FIG. 1 shows an exemplary architecture 100 to which the method for generating information or the apparatus for generating information of the present disclosure can be applied.

如图1所示，系统架构100可以包括终端设备101、102，网络103和服务器104。网络103用以在终端设备101、102和服务器104之间提供通信链路的介质。网络103可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , a system architecture 100 may include terminal devices 101 and 102 , a network 103 and a server 104 . The network 103 is used as a medium for providing communication links between the terminal devices 101 , 102 and the server 104 . Network 103 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.

终端设备101、102通过网络103与服务器104交互，以接收或发送消息等。终端设备101、102可以是硬件，也可以是软件。当终端设备101、102为硬件时，终端设备101可以是具有摄像头并且支持图像传输的各种电子设备，包括但不限于各种光学摄像机或智能相机等等；终端设备102可以是具有显示屏作为监控或核算终端的各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。当终端设备101、102为软件时，可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块)，也可以实现成单个软件或软件模块。在此不做具体限定。The terminal devices 101, 102 interact with the server 104 through the network 103 to receive or send messages and the like. The terminal devices 101 and 102 may be hardware or software. When the terminal devices 101 and 102 are hardware, the terminal device 101 may be a variety of electronic devices with a camera and support image transmission, including but not limited to various optical cameras or smart cameras, etc.; the terminal device 102 may have a display screen as Various electronic devices for monitoring or accounting terminals, including but not limited to smartphones, tablet computers, laptop computers and desktop computers, etc. When the terminal devices 101 and 102 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (such as software or software modules for providing distributed services), or as a single software or software module. No specific limitation is made here.

服务器104可以是提供各种服务的服务器，例如对终端设备101发送的图像进行分析处理的后台服务器。后台服务器可以对终端设备101发送的图像进行各种分析处理，并将处理结果(例如用户的移动路径)发送给终端设备102进行显示。The server 104 may be a server that provides various services, for example, a background server that analyzes and processes images sent by the terminal device 101 . The background server can perform various analysis and processing on the images sent by the terminal device 101, and send the processing results (such as the user's moving path) to the terminal device 102 for display.

需要说明的是，服务器可以是硬件，也可以是软件。当服务器为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当服务器为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块)，也可以实现成单个软件或软件模块。在此不做具体限定。It should be noted that the server may be hardware or software. When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When the server is software, it can be implemented as multiple software or software modules (such as software or software modules for providing distributed services), or as a single software or software module. No specific limitation is made here.

需要说明的是，上述终端设备101或102也可以直接对获取的图像进行分析处理，此时，可以不存在服务器104。It should be noted that the above-mentioned terminal device 101 or 102 may also directly analyze and process the acquired image, and at this time, the server 104 may not exist.

需要说明的是，本公开的实施例所提供的用于生成信息的方法可以由终端设备101或102执行，相应地，用于生成信息的装置一般设置于终端设备101或102中。本公开的实施例所提供的用于生成信息的方法也可以由服务器105执行，相应地，用于生成信息的装置也可以设置于服务器105中。It should be noted that the method for generating information provided by the embodiments of the present disclosure may be executed by the terminal device 101 or 102 , and correspondingly, the apparatus for generating information is generally disposed in the terminal device 101 or 102 . The method for generating information provided by the embodiments of the present disclosure may also be executed by the server 105 , and correspondingly, the device for generating information may also be disposed in the server 105 .

应该理解，图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.

继续参考图2，示出了根据本公开的用于生成信息的方法的一个实施例的流程200。该用于生成信息的方法包括以下步骤：Continuing to refer to FIG. 2 , a flow 200 of one embodiment of the method for generating information according to the present disclosure is shown. The method for generating information includes the following steps:

步骤201，获取目标视频帧序列对应的用户位置信息集合。Step 201, acquire a user location information set corresponding to a target video frame sequence.

在本实施例中，用于生成信息的方法的执行主体(如图1所示的终端设备101或102)可以通过有线连接方式或者无线连接方式获取目标视频帧序列对应的用户位置信息集合。其中，上述目标视频帧序列可以是从针对预设区域(例如安保监控区域、无人售货店等)拍摄的视频中提取的。在这里，上述提取的方式可以是根据帧率来等距提取，也可以优先选取清晰度较高图像，从而形成上述目标视频帧序列。上述用户位置信息集合中的位置信息可以用于表征上述目标视频帧序列中的目标视频帧所显示的用户的位置。通常，上述用户位置信息集合中的每条用户位置信息对应目标视频帧序列中的一帧。上述每条用户位置信息可以包括人体位置信息和局部人体位置信息。上述人体位置信息可以用于表征人体全身像在图像中的位置，例如可以包括行人检测的预测边框(bounding box)的中心位置的坐标。上述局部人体位置可以用于表征人体局部关键点在图像中的位置。上述人体局部关键点可以包括能够用于识别用户的关键点，其可以包括但不限于以下至少一项：头部，肩部，面部。In this embodiment, the executing subject of the method for generating information (the terminal device 101 or 102 shown in FIG. 1 ) can obtain the user location information set corresponding to the target video frame sequence through a wired connection or a wireless connection. Wherein, the above-mentioned target video frame sequence may be extracted from a video shot for a preset area (such as a security monitoring area, an unmanned store, etc.). Here, the above-mentioned extraction method may be equidistant extraction according to the frame rate, or may preferentially select images with higher definition, so as to form the above-mentioned target video frame sequence. The position information in the user position information set may be used to characterize the position of the user displayed by the target video frame in the target video frame sequence. Usually, each piece of user location information in the above user location information set corresponds to a frame in the target video frame sequence. Each piece of user location information above may include body location information and partial body location information. The above human body position information can be used to characterize the position of the whole body image of the human body in the image, for example, it can include the coordinates of the center position of the predicted bounding box for pedestrian detection. The above local human body position can be used to characterize the position of the local key points of the human body in the image. The aforementioned partial key points of the human body may include key points that can be used to identify the user, which may include but not limited to at least one of the following: head, shoulder, and face.

在本实施例中，作为示例，上述执行主体可以获取预先存储于本地的目标视频帧序列对应的用户位置信息集合。作为又一示例，上述执行主体可以从与之通信连接的电子设备(例如图1所示的终端设备101)获取目标视频帧序列对应的用户位置信息集合。其中，上述终端设备101可以对拍摄的视频进行处理，以生成上述用户位置信息集合。In this embodiment, as an example, the execution subject may acquire a user location information set corresponding to a target video frame sequence stored locally in advance. As yet another example, the above execution subject may acquire the user location information set corresponding to the target video frame sequence from an electronic device (such as the terminal device 101 shown in FIG. 1 ) that is communicatively connected thereto. Wherein, the terminal device 101 may process the captured video to generate the user location information set.

在本实施例的一些可选的实现方式中，上述执行主体还可以按照以下步骤获取目标视频帧序列对应的用户位置信息集合：In some optional implementations of this embodiment, the execution subject may also obtain the user location information set corresponding to the target video frame sequence according to the following steps:

第一步，获取目标视频帧序列。The first step is to obtain the target video frame sequence.

在这些实现方式中，上述执行主体可以通过各种方式获取目标视频帧序列。作为示例，上述执行主体可以从本地获取预先存储的目标视频帧序列。作为又一示例，上述执行主体可以从通信连接的电子设备获取上述目标视频帧序列。In these implementation manners, the execution subject may obtain the target video frame sequence in various ways. As an example, the above execution subject may obtain a pre-stored target video frame sequence locally. As yet another example, the above execution subject may obtain the above target video frame sequence from a communication-connected electronic device.

第二步，将目标视频帧序列中的目标视频帧输入至预先训练的用户位置检测模型，得到与目标视频帧对应的用户位置信息。In the second step, the target video frame in the target video frame sequence is input to the pre-trained user position detection model to obtain the user position information corresponding to the target video frame.

在这些实现方式中，上述用户位置检测模型可以用于表征目标视频帧与用户位置信息之间的对应关系。上述用户位置检测模型可以通过以下步骤训练得到：In these implementation manners, the above user position detection model may be used to characterize the correspondence between the target video frame and the user position information. The above user location detection model can be trained through the following steps:

S1、获取训练样本集合。S1. Obtain a training sample set.

在这些实现方式中，上述训练样本可以包括样本视频帧和与样本视频帧对应的样本用户位置信息。上述样本用户位置信息可以包括样本人体位置信息和样本局部人体位置信息。上述样本人体位置信息可以用于表征人体全身像在样本视频帧中的位置，例如可以包括标注用户全身像的边框的中心位置的坐标。上述样本局部人体位置可以用于表征人体局部关键点在样本视频帧中的位置。上述人体局部关键点可以包括能够用于识别用户的关键点，其可以包括但不限于以下至少一项：头部，肩部，面部。In these implementation manners, the above training samples may include sample video frames and sample user location information corresponding to the sample video frames. The above sample user location information may include sample human body location information and sample partial human body location information. The above sample human body position information may be used to characterize the position of the full body image of the human body in the sample video frame, for example, may include the coordinates marking the center position of the bounding box of the full body image of the user. The above sample local human body position can be used to characterize the position of the local human body key point in the sample video frame. The aforementioned partial key points of the human body may include key points that can be used to identify the user, which may include but not limited to at least one of the following: head, shoulder, and face.

S2、将训练样本集合中的训练样本的样本视频帧作为输入，将与输入的样本视频帧对应的样本用户位置信息作为期望输出，训练得到用户位置检测模型。S2. Taking sample video frames of training samples in the training sample set as input, and taking sample user location information corresponding to the input sample video frames as expected output, and training to obtain a user location detection model.

在这些实现方式中，上述训练方式可以是利用机器学习方法对初始用户位置检测模型进行有监督或弱监督的训练。上述初始用户位置检测模型可以包括但不限于以下至少一项：FSSD(Feature Fusion Single Shot Multibox Detector)模型，YoloV3检测模型。In these implementation manners, the above training manner may be to use a machine learning method to perform supervised or weakly supervised training on the initial user location detection model. The aforementioned initial user position detection model may include but not limited to at least one of the following: FSSD (Feature Fusion Single Shot Multibox Detector) model, YoloV3 detection model.

作为示例，上述执行主体可以对采用MobilenetV1网络结构的FSSD模型进行训练。可选地，还可以将FSSD模型中的部分上采样层(Upsampling layer)改为反卷积层(Deconvolution layer)，以提高模型的准确率。可选地，还可以将上述FSSD模型中的卷积层改为深度可分离卷积结构(depthwise separable convolution)。从而可以降低模型参数和推理时间。进而实现模型的轻量化，以便于在无人零售店、监控终端和各种嵌入式设备中使用。As an example, the above execution subject can train the FSSD model using the MobilenetV1 network structure. Optionally, part of the upsampling layer (Upsampling layer) in the FSSD model can also be changed to a deconvolution layer (Deconvolution layer) to improve the accuracy of the model. Optionally, the convolution layer in the above FSSD model can also be changed to a depthwise separable convolution structure (depthwise separable convolution). This can reduce model parameters and inference time. Then realize the lightweight of the model, so that it can be used in unmanned retail stores, monitoring terminals and various embedded devices.

作为又一示例，上述执行主体可以在Paddlepaddle深度框架中对采用Darknet53model网络结构的上述YoloV3模型进行训练，并采用同步方式(Sync.BatchNormalization)。可选地，还可以在YoloV3模型中增加时空联合池化层(spatial temporalpooling)。从而可以显著提升模型的精确度。As yet another example, the above-mentioned executive body may train the above-mentioned YoloV3 model using the Darknet53model network structure in the Paddlepaddle depth framework, and adopt a synchronization method (Sync.BatchNormalization). Optionally, a spatial temporal pooling layer (spatial temporal pooling) can also be added to the YoloV3 model. This can significantly improve the accuracy of the model.

步骤202，根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度。Step 202, according to the position information of the human body and the position information of the partial human body, respectively determine the correlation degree between the user's human bodies and the correlation degree between the partial human bodies displayed in the target video frame sequence.

在本实施例中，根据人体位置信息和局部人体位置信息，上述执行主体可以通过各种方式分别确定上述目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度。作为示例，上述执行主体可以利用基于轨迹预测的跟踪方式来对上述目标视频帧序列中显示的用户的人体和局部人体分别进行轨迹预测，生成各目标视频帧与序列中的前一视频帧中显示的用户的人体之间的关联度和局部人体之间的关联度。其中，上述轨迹预测可以采用各种方法。例如卡尔曼滤波，拟合函数等。上述关联度可以通过以下公式确定：In this embodiment, according to the position information of the human body and the position information of the partial human body, the above-mentioned execution subject can respectively determine the correlation degree between the user's human bodies and the correlation degree between the partial human bodies displayed in the target video frame sequence in various ways . As an example, the above-mentioned executive body can use the tracking method based on trajectory prediction to perform trajectory prediction on the user's human body and partial human body displayed in the above-mentioned target video frame sequence, and generate each target video frame and the previous video frame in the sequence. The correlation degree between the user's human bodies and the correlation degree between the local human bodies. Among them, various methods can be used for the above-mentioned trajectory prediction. Such as Kalman filter, fitting function, etc. The above correlation degree can be determined by the following formula:

其中，上述d(i,j)可以用于表示第j个检测结果与第i条轨迹之间的关联度。上述可以用于表示轨迹由卡尔曼滤波器得到的在当前时刻观测空间的协方差矩阵。上述y_i可以用于表示轨迹在当前时刻的预测结果。上述d_j可以用于表示第j个检测结果的状态。上述检测结果的状态可以是(μ,θ,γ,h)。上述μ,θ,γ,h可以分别用于表示检测结果的检测框的中心位置、长宽比、高度。上述检测结果可以包括上述目标视频帧序列中显示的用户的人体和局部人体中的至少一项。上述预测结果可以包括基于上述目标视频帧序列进行轨迹预测所得到的用户的人体位置的轨迹和局部人体位置的轨迹中的至少一项Wherein, the above d(i, j) can be used to represent the correlation degree between the jth detection result and the ith track. the above It can be used to represent the covariance matrix of the observation space at the current moment obtained by the trajectory obtained by the Kalman filter. The above y_i can be used to represent the prediction result of the trajectory at the current moment. The above d_j can be used to represent the state of the jth detection result. The state of the above detection result may be (μ, θ, γ, h). The aforementioned μ, θ, γ, and h can be used to represent the center position, aspect ratio, and height of the detection frame of the detection result, respectively. The detection result may include at least one of the user's human body and partial human body displayed in the target video frame sequence. The above prediction result may include at least one of the trajectory of the user's human body position and the trajectory of the local human body position obtained by trajectory prediction based on the above target video frame sequence

在本实施例的一些可选的实现方式中，上述执行主体还可以按照以下步骤确定目标视频帧序列中显示的用户的人体之间的关联度：In some optional implementations of this embodiment, the execution subject may also determine the degree of correlation between the user's human bodies displayed in the target video frame sequence according to the following steps:

第一步，从目标视频帧序列中提取人体位置信息所指示的用户的人体的图像作为用户人体图像。In the first step, the image of the user's human body indicated by the human body position information is extracted from the target video frame sequence as the user's human body image.

在这些实现方式中，上述执行主体可以通过各种人体检测方法从目标视频帧序列中提取人体位置信息所指示的用户的人体的图像作为用户人体图像。In these implementations, the execution subject may extract the image of the user's human body indicated by the human body position information from the target video frame sequence through various human body detection methods as the user's human body image.

第二步，将所提取的用户人体图像输入至预先训练的用户特征提取模型，得到与用户人体图像对应的用户特征。In the second step, the extracted user body image is input to the pre-trained user feature extraction model to obtain user features corresponding to the user body image.

在这些实现方式中，上述用户特征提取模型可以用于表征用户人体图像与用户特征之间的对应关系。例如，可以包括深度残差网络(Deep residual network,ResNet)。In these implementation manners, the above-mentioned user feature extraction model may be used to characterize the correspondence between the user's human body image and the user's features. For example, a deep residual network (Deep residual network, ResNet) may be included.

可选地，上述用户特征提取模型可以通过如下步骤得到：Optionally, the above-mentioned user feature extraction model can be obtained through the following steps:

S1、获取训练样本集合。S1. Obtain a training sample set.

在这些实现方式中，上述训练样本可以包括样本用户人体图像和与样本用户人体图像对应的样本标注信息。上述样本标注信息可以用于标识用户。例如，对于位于不同场景下的同一用户，其样本标注信息通常一致。In these implementation manners, the above training samples may include sample user body images and sample annotation information corresponding to the sample user body images. The above sample annotation information may be used to identify the user. For example, for the same user in different scenarios, their sample annotation information is usually consistent.

S2、将训练样本集合中的训练样本的样本用户人体图像作为输入，将与输入的样本用户人体图像对应的样本标注信息匹配的用户特征作为期望输出，训练得到用户特征提取模型。S2. Taking the sample user body image of the training sample in the training sample set as input, and taking the user feature matching the sample annotation information corresponding to the input sample user body image as the expected output, and training to obtain a user feature extraction model.

在这些实现方式中，上述与输入的训练样本对应的样本标注信息匹配的用户特征与样本标注信息所标识的用户一致。In these implementation manners, the above-mentioned user characteristics matched with the sample annotation information corresponding to the input training samples are consistent with the user identified by the sample annotation information.

具体地，训练步骤的执行主体可以将训练样本集合中的训练样本的样本用户人体图像输入至初始用户特征提取模型，得到该训练样本的用户特征。然后，根据所得到的用户特征与样本标注信息所指示的用户的用户特征进行匹配，得到匹配的用户。接下来，可以利用预设的损失函数计算所得到的匹配的用户的用户特征与该训练样本的样本标注信息所指定的用户的用户特征之间的差异程度。之后，基于计算所得的差异程度和模型的复杂度，调整初始用户特征提取模型的网络参数，并在满足预设的训练结束条件的情况下，结束训练。最后，将训练得到的初始用户特征提取模型确定为用户特征提取模型。Specifically, the executing body of the training step may input the human body images of the training samples in the training sample set to the initial user feature extraction model to obtain the user features of the training samples. Then, match the obtained user features with the user features indicated by the sample annotation information to obtain the matched users. Next, a preset loss function may be used to calculate the degree of difference between the obtained user features of the matched user and the user features of the user specified by the sample annotation information of the training sample. Afterwards, based on the calculated degree of difference and the complexity of the model, the network parameters of the initial user feature extraction model are adjusted, and the training ends when the preset training end conditions are met. Finally, the trained initial user feature extraction model is determined as the user feature extraction model.

第三步，根据所得到的用户特征与基于轨迹预测得到的轨迹所包括的用户特征之间的距离，确定目标视频帧序列中显示的用户的人体之间的关联度。In the third step, according to the distance between the obtained user features and the user features included in the trajectory obtained based on trajectory prediction, the degree of correlation between the user's human bodies displayed in the target video frame sequence is determined.

在这些实现方式中，上述执行主体可以首先确定从上述用户特征提取模型所得到的用户特征与基于轨迹预测得到的轨迹所包括的用户特征之间的距离。其中，上述距离可以包括但不限于欧氏距离、余弦距离中的至少一项。而后，可以选择上述距离中最小的距离作为上述目标视频帧序列中显示的用户的人体之间的关联度。In these implementation manners, the execution subject may first determine the distance between the user features obtained from the user feature extraction model and the user features included in the trajectory obtained based on trajectory prediction. Wherein, the above distance may include but not limited to at least one of Euclidean distance and cosine distance. Then, the smallest distance among the above-mentioned distances may be selected as the correlation degree between the user's human bodies displayed in the above-mentioned target video frame sequence.

步骤203，根据所确定的用户的人体之间的关联度和局部人体之间的关联度，确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系。Step 203: Determine the correlation between the user's human bodies and the correlation between the partial human bodies displayed in the target video frame sequence according to the determined correlation between the user's human bodies and the correlation between the partial human bodies.

在本实施例中，根据所确定的用户的人体之间的关联度和局部人体之间的关联度之间的融合，上述执行主体可以通过各种方式确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系。其中，上述关联关系可以用于表征上述目标视频帧序列中前后帧所显示的用户的人体之间的对应关系和用户的局部人体之间的对应关系。作为示例，上述执行主体可以将上述步骤202所确定的用户的人体之间的关联度和对应的局部人体之间的关联度进行加权平均，选取关联度最大的匹配对作为具备关联关系的条件。In this embodiment, according to the fusion between the determined degree of correlation between the user's human bodies and the degree of correlation between partial human bodies, the above-mentioned executive body can determine the user's human body displayed in the target video frame sequence in various ways The association relationship between and the association relationship between local human bodies. Wherein, the above association relationship may be used to characterize the corresponding relationship between the user's human body and the corresponding relationship between the user's partial human body displayed in the preceding and following frames in the target video frame sequence. As an example, the executive body may perform a weighted average of the degree of association between the user's human bodies determined in step 202 and the degree of association between the corresponding partial human bodies, and select the matching pair with the highest degree of association as the condition for having the association relationship.

步骤204，响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成目标视频帧序列中显示的用户的轨迹信息。Step 204, in response to determining that the determined correlation between the user's partial human bodies matches the correlation between the user's human bodies, generating track information of the user displayed in the target video frame sequence.

在本实施例中，响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，上述执行主体可以通过各种方式生成目标视频帧序列中显示的用户的轨迹信息。其中，上述关联关系相匹配通常包括所确定的局部人体之间的关联关系所指示的用户与人体之间的关联关系所指示的用户一致。In this embodiment, in response to determining that the determined association relationship between the user's partial human body matches the association relationship between the user's human body, the execution subject may generate the user information displayed in the target video frame sequence in various ways. track information. Wherein, the matching of the above-mentioned association relationship generally includes that the determined user indicated by the association relationship between partial human bodies is consistent with the user indicated by the association relationship between human bodies.

在本实施例的一些可选的实现方式中，上述执行主体还可以通过行人重识别(Person Re-identification，ReID)技术确定用户的局部人体和用户的人体所对应的用户是否一致。In some optional implementation manners of this embodiment, the execution subject may also use a person re-identification (Person Re-identification, ReID) technology to determine whether the user's partial human body is consistent with the user corresponding to the user's human body.

继续参见图3a，图3a是根据本公开的实施例的用于生成信息的方法的应用场景的一个示意图。在图3a的应用场景中，摄像头301可以向控制终端302发送根据拍摄的视频生成用户位置信息集合303。用户位置信息集合303具体可以参见图3b。图像3031、3032分别用于表征目标视频帧序列中的第n、第n+1张图像。在图像3031中，检测框A、B分别用于表征用户的人体位置，检测框a、b分别用于表征用户的头部位置。在图像3032中，检测框A'、B'分别用于表征用户的人体位置，检测框a'、b'分别用于表征用户的头部位置。上述用户位置信息集合303中的用户位置信息包括表征上述检测框A、B、a、b、A'、B'、a'、b'位置的信息(例如坐标)。而后，继续参见图3a，上述控制终端302可以确定检测框A和A'、A和B'、B和A'、B和B'所指示的人体之间的关联度和检测框a和a'、a和b'、b和a'、b和b'所指示的头部之间的关联度304。接下来，根据所确定的关联度304，上述控制终端302可以确定检测框A和A'、B和B'所指示的人体之间具备关联关系，检测框a和a'、b和b'所指示的头部之间具备关联关系305。而后，响应于确定检测框A和a、A'和a'所指示的用户(例如用户x)一致，上述控制终端302可以生成用户x的轨迹信息；响应于确定检测框B和b、B'和b'所指示的用户(例如用户y)一致，上述控制终端302可以生成用户y的轨迹信息306。可选地，上述控制终端302还可以将所生成的用户的轨迹信息306显示在显示屏上。Continuing to refer to FIG. 3 a , FIG. 3 a is a schematic diagram of an application scenario of a method for generating information according to an embodiment of the present disclosure. In the application scenario of FIG. 3 a , the camera 301 may send to the control terminal 302 a user location information set 303 generated according to the captured video. For details of the user location information set 303, refer to FIG. 3b. The images 3031 and 3032 are respectively used to represent the nth and n+1th images in the target video frame sequence. In the image 3031, the detection frames A and B are respectively used to represent the position of the user's human body, and the detection frames a and b are respectively used to represent the position of the user's head. In the image 3032, the detection frames A' and B' are respectively used to represent the user's body position, and the detection frames a' and b' are respectively used to represent the user's head position. The user location information in the user location information set 303 includes information (such as coordinates) characterizing the locations of the detection frames A, B, a, b, A', B', a', and b'. Then, continuing to refer to FIG. 3a, the above-mentioned control terminal 302 can determine the degree of correlation between the human bodies indicated by the detection frames A and A', A and B', B and A', B and B' and the detection frames a and a' , a and b', b and a', b and b' indicate the degree of association 304 between the heads. Next, according to the determined association degree 304, the above-mentioned control terminal 302 can determine that the human body indicated by the detection frame A and A', B and B' has an association relationship, and the human body indicated by the detection frame a and a', b and b' The indicated headers have an association relationship 305 . Then, in response to determining that the detection frame A is consistent with the user (such as user x) indicated by a, A' and a', the above-mentioned control terminal 302 can generate the track information of the user x; in response to determining that the detection frame B is consistent with b, B' Consistent with the user indicated by b' (for example, user y), the control terminal 302 may generate track information 306 of user y. Optionally, the control terminal 302 may also display the generated user's track information 306 on a display screen.

目前，现有技术之一通常是采用构造结构复杂的神经网络(例如空间注意力图谱或循环神经网络)的方式来提高对用户的跟踪准确率，导致需要较高性能的硬件设备(例如高端GPU)。而本公开的上述实施例提供的方法，通过将对目标用户的跟踪分解为人体位置信息和局部人体位置信息进行检测并对位置信息进行关联，根据关联结果之间的匹配来生成目标用户的轨迹信息。从而大大降低了检测和跟踪模型的复杂度，尤其适用于无人零售店、监控区域等端设备和嵌入式设备。而且，由于无需将图像等信息传输至后台服务器，因而也节约了网络传输流量。At present, one of the existing technologies usually adopts the method of constructing a complex neural network (such as a spatial attention map or a recurrent neural network) to improve the tracking accuracy of users, resulting in the need for higher-performance hardware devices (such as high-end GPU ). However, in the method provided by the above-mentioned embodiments of the present disclosure, the tracking of the target user is decomposed into human body position information and local human body position information for detection and correlation of the position information, and the trajectory of the target user is generated according to the matching between the correlation results. information. As a result, the complexity of the detection and tracking model is greatly reduced, and it is especially suitable for end devices and embedded devices such as unmanned retail stores and monitoring areas. Moreover, since there is no need to transmit information such as images to the background server, network transmission traffic is also saved.

进一步参考图4，其示出了用于生成信息的方法的又一个实施例的流程400。该用于生成信息的方法的流程400，包括以下步骤：Further referring to FIG. 4 , it shows a flow 400 of still another embodiment of a method for generating information. The flow 400 of the method for generating information includes the following steps:

步骤401，获取目标视频帧序列对应的用户位置信息集合。Step 401, acquire a user location information set corresponding to a target video frame sequence.

在本实施例中，上述局部人体位置信息所指示的局部人体可以包括基于头肩关键点所确定的头部。其中，上述基于头肩关键点确定头部的方法可以采用现有的各种方式，此处不再赘述。In this embodiment, the partial human body indicated by the above partial human body position information may include the head determined based on the key points of the head and shoulders. Wherein, the above-mentioned method of determining the head based on the key points of the head and shoulders can adopt various existing methods, which will not be repeated here.

步骤402，根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度。Step 402: According to the position information of the human body and the position information of the partial human body, respectively determine the correlation degree between the user's human bodies and the correlation degree between the partial human bodies displayed in the target video frame sequence.

步骤403，根据所确定的用户的人体之间的关联度和局部人体之间的关联度，确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系。Step 403: Determine the correlation between the user's human bodies and the correlation between the partial human bodies displayed in the target video frame sequence according to the determined correlation between the user's human bodies and the correlation between the partial human bodies.

上述步骤401、步骤402、步骤403分别与前述实施例中的步骤201、步骤202、步骤203及其可选的实现方式一致，上文针对步骤201、步骤202和步骤203的描述也适用于步骤401、步骤402和步骤403，此处不再赘述。The above step 401, step 402, and step 403 are respectively consistent with step 201, step 202, step 203 and their optional implementations in the foregoing embodiments, and the above descriptions for step 201, step 202, and step 203 are also applicable to step Step 401, step 402 and step 403 will not be repeated here.

步骤404，确定目标视频帧序列中的目标视频帧中人体位置信息所指示的人体区域与局部人体位置信息所指示的局部人体区域之间的交并比。Step 404, determine the intersection ratio between the human body area indicated by the human body position information and the partial human body area indicated by the local human body position information in the target video frame in the target video frame sequence.

在本实施例中，用于生成信息的方法的执行主体(例如图1所示的终端设备102)可以通过各种方式确定目标视频帧序列中的目标视频帧中人体位置信息所指示的人体区域与局部人体位置信息所指示的局部人体区域之间的交并比。具体地，对于目标视频帧序列中的目标视频帧，上述执行主体可以确定该目标视频帧的人体位置信息所指示的人体区域与相交的局部人体位置信息所指示的头部区域之间的交并比。从而，上述目标视频帧序列中各目标视频帧通常可以对应至少一个交并比。In this embodiment, the executing subject of the method for generating information (such as the terminal device 102 shown in FIG. 1 ) may determine the human body area indicated by the human body position information in the target video frame in the target video frame sequence in various ways The intersection ratio with the local human body area indicated by the local human body position information. Specifically, for a target video frame in the target video frame sequence, the execution subject may determine the intersection and union between the human body area indicated by the human body position information of the target video frame and the head area indicated by the intersecting partial human body position information. Compare. Therefore, each target video frame in the target video frame sequence may generally correspond to at least one cross-over-union ratio.

步骤405，确定交并比满足预设条件的人体位置信息与局部人体位置信息所指示的位置之间的距离。Step 405, determining the distance between the human body position information whose intersection-over-union ratio satisfies the preset condition and the position indicated by the local human body position information.

在本实施例中，对于上述目标视频帧序列中各目标视频帧对应的至少一个交并比，上述执行主体可以首先从上述交并比中选取满足预设条件的交并比。其中，上述预设条件可以包括大于预设交并比阈值。上述预设条件也可以包括属于交并比由大至小排列的前n个。而后，上述执行主体可以确定所选取的满足预设条件的交并比所对应的头部位置与人体位置之间的距离。其中，上述头部位置与人体位置之间的距离可以通过各种方式确定。作为示例，上述距离可以包括头部检测框和人体检测框的顶部高度差。作为又一示例，上述距离可以包括头部检测框和人体检测框的中心位置之间的距离。In this embodiment, for at least one intersection and combination ratio corresponding to each target video frame in the target video frame sequence, the executive body may first select an intersection and combination ratio that satisfies a preset condition from the above intersection and combination ratios. Wherein, the aforementioned preset condition may include being greater than a preset cross-merge ratio threshold. The aforementioned preset conditions may also include the top n items that belong to the intersection and union ratios arranged in descending order. Then, the execution subject may determine the distance between the head position and the human body position corresponding to the selected intersection-over-union ratio that satisfies the preset condition. Wherein, the above-mentioned distance between the head position and the human body position can be determined in various ways. As an example, the above distance may include a height difference between the tops of the head detection frame and the human body detection frame. As yet another example, the above distance may include the distance between the center positions of the head detection frame and the human body detection frame.

步骤406，响应于确定所确定的距离满足预设距离条件，生成人体位置信息与局部人体位置信息之间的关联关系。Step 406, in response to determining that the determined distance satisfies a preset distance condition, an association relationship between human body position information and partial human body position information is generated.

在本实施例中，响应于确定上述步骤405所确定的距离满足预设距离条件，上述执行主体可以生成表征上述距离所指示的人体位置信息与局部人体位置信息之间具备关联关系的信息。其中，上述满足预设距离条件可以包括小于预设距离阈值。上述满足预设距离条件也可以包括距离最小。上述满足预设距离条件还可以包括上述距离所对应的比例小于预设比例阈值。其中，上述距离所对应的比例例如可以包括头部检测框和人体检测框的顶部高度差与人体检测框的高度的比值。从而可以减少因图像中显示的人像大小不一致而造成的偏差。In this embodiment, in response to determining that the distance determined in the above step 405 satisfies the preset distance condition, the execution subject may generate information representing an association relationship between the human body position information indicated by the above distance and the local human body position information. Wherein, the aforementioned meeting the preset distance condition may include being smaller than a preset distance threshold. The aforementioned meeting the preset distance condition may also include the minimum distance. Satisfying the preset distance condition may also include that a ratio corresponding to the distance is smaller than a preset ratio threshold. Wherein, the ratio corresponding to the above distance may include, for example, the ratio of the height difference between the tops of the head detection frame and the human body detection frame to the height of the human body detection frame. This reduces bias caused by inconsistencies in the size of the portraits displayed in the image.

步骤407，根据所生成的关联关系，生成目标视频帧序列中显示的用户的轨迹信息。Step 407: Generate track information of the user displayed in the target video frame sequence according to the generated association relationship.

在本实施例中，上述执行主体可以根据步骤406所生成的关联关系，确定上述关联关系所指示的用户的位置。从而，根据上述目标视频帧序列中顺序相邻的目标视频帧所指示的同一用户的位置，上述执行主体可以生成上述所指示的用户的轨迹。In this embodiment, the execution subject may determine the location of the user indicated by the association relationship according to the association relationship generated in step 406 . Therefore, according to the position of the same user indicated by sequentially adjacent target video frames in the target video frame sequence, the execution subject can generate the track of the indicated user.

从图4中可以看出，与图2对应的实施例相比，本实施例中的用于生成信息的方法的流程400细化了根据用户的头部位置和人体位置确定人体位置信息与局部人体位置信息之间的关联关系的步骤。由此，本实施例描述的方案可以通过人体检测和头部检测结果之间的互为补充，既提升了检测准确度，又降低了漏检率。It can be seen from FIG. 4 that, compared with the embodiment corresponding to FIG. 2 , the flow 400 of the method for generating information in this embodiment refines the determination of human body position information and local A step of associating relationship between human body position information. Therefore, the solution described in this embodiment can complement each other through the human body detection and head detection results, which not only improves the detection accuracy, but also reduces the missed detection rate.

进一步参考图5，作为对上述各图所示方法的实现，本公开提供了用于生成信息的装置的一个实施例，该装置实施例与图2或图4所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。Further referring to FIG. 5 , as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a device for generating information, which corresponds to the method embodiment shown in FIG. 2 or FIG. 4 , The device can be specifically applied to various electronic devices.

如图5所示，本实施例提供的用于生成信息的装置500包括获取单元501、第一确定单元502、第二确定单元503和生成单元504。其中，获取单元501，被配置成获取目标视频帧序列对应的用户位置信息集合，其中，用户位置信息集合中的用户位置信息用于表征目标视频帧序列中的目标视频帧所显示的用户的位置，用户位置信息包括人体位置信息和局部人体位置信息；第一确定单元502，被配置成根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度；第二确定单元503，被配置成根据所确定的用户的人体之间的关联度和局部人体之间的关联度，确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系；生成单元504，被配置成响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成目标视频帧序列中显示的用户的轨迹信息。As shown in FIG. 5 , the apparatus 500 for generating information provided in this embodiment includes an acquiring unit 501 , a first determining unit 502 , a second determining unit 503 and a generating unit 504 . Wherein, the acquiring unit 501 is configured to acquire the user position information set corresponding to the target video frame sequence, wherein the user position information in the user position information set is used to represent the position of the user displayed by the target video frame in the target video frame sequence , the user position information includes human body position information and partial human body position information; the first determining unit 502 is configured to determine the degree of correlation between the user's human bodies displayed in the target video frame sequence respectively according to the human body position information and the partial human body position information degree of association with the local human body; the second determining unit 503 is configured to determine the user's human body displayed in the target video frame sequence according to the determined degree of association between the user's human body and the degree of association between the local human body The association relationship between the user's partial human bodies and the association relationship between the partial human bodies; the generating unit 504 is configured to generate a target The user's trajectory information displayed in the sequence of video frames.

在本实施例中，用于生成信息的装置500中：获取单元501、第一确定单元502、第二确定单元503和生成单元504的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201、步骤202、步骤203和步骤204的相关说明，在此不再赘述。In this embodiment, in the apparatus 500 for generating information: the specific processing of the acquiring unit 501, the first determining unit 502, the second determining unit 503, and the generating unit 504 and the technical effects brought about by them can refer to FIG. 2 respectively. Relevant descriptions corresponding to step 201, step 202, step 203, and step 204 in the embodiment are not repeated here.

在本实施例的一些可选的实现方式中，上述获取单元501可以包括获取模块(图中未示出)、第一生成模块(图中未示出)。其中，上述获取模块可以被配置成获取目标视频帧序列。上述第一生成模块可以被配置成将目标视频帧序列中的目标视频帧输入至预先训练的用户位置检测模型，得到与目标视频帧对应的用户位置信息。其中，上述用户位置检测模型可以用于表征目标视频帧与用户位置信息之间的对应关系。In some optional implementation manners of this embodiment, the acquisition unit 501 may include an acquisition module (not shown in the figure) and a first generation module (not shown in the figure). Wherein, the above acquisition module may be configured to acquire the target video frame sequence. The above-mentioned first generation module may be configured to input the target video frame in the sequence of target video frames into the pre-trained user position detection model to obtain the user position information corresponding to the target video frame. Wherein, the above user position detection model may be used to characterize the correspondence between the target video frame and the user position information.

在本实施例的一些可选的实现方式中，上述第一确定单元502可以包括提取模块(图中未示出)、第二生成模块(图中未示出)、第一确定模块(图中未示出)。其中，上述提取模块，可以被配置成从目标视频帧序列中提取人体位置信息所指示的用户的人体的图像作为用户人体图像。上述第二生成模块，可以被配置成将所提取的用户人体图像输入至预先训练的用户特征提取模型，得到与用户人体图像对应的用户特征。其中，上述用户特征提取模型可以用于表征用户人体图像与用户特征之间的对应关系。上述第一确定模块，可以被配置成根据所得到的用户特征与基于轨迹预测得到的轨迹所包括的用户特征之间的距离，确定目标视频帧序列中显示的用户的人体之间的关联度。In some optional implementations of this embodiment, the first determination unit 502 may include an extraction module (not shown in the figure), a second generation module (not shown in the figure), a first determination module (not shown in the figure) not shown). Wherein, the above extraction module may be configured to extract the image of the user's human body indicated by the human body position information from the target video frame sequence as the user's human body image. The above-mentioned second generation module may be configured to input the extracted user body image into a pre-trained user feature extraction model to obtain user features corresponding to the user body image. Wherein, the above-mentioned user feature extraction model can be used to characterize the corresponding relationship between the user's human body image and the user's features. The above-mentioned first determining module may be configured to determine the degree of correlation between the user's human bodies displayed in the target video frame sequence according to the distance between the obtained user features and the user features included in the track obtained based on track prediction.

在本实施例的一些可选的实现方式中，上述用户特征提取模型可以通过如下步骤得到：获取训练样本集合。其中，上述训练样本可以包括样本用户人体图像和与样本用户人体图像对应的样本标注信息。上述样本标注信息可以用于标识用户。将训练样本集合中的训练样本的样本用户人体图像作为输入，将与输入的样本用户人体图像对应的样本标注信息匹配的用户特征作为期望输出，训练得到用户特征提取模型。其中，上述与输入的训练样本对应的样本标注信息匹配的用户特征通常与样本标注信息所标识的用户一致。In some optional implementation manners of this embodiment, the above-mentioned user feature extraction model may be obtained through the following steps: acquiring a training sample set. Wherein, the above-mentioned training sample may include a sample user's human body image and sample annotation information corresponding to the sample user's human body image. The above sample annotation information may be used to identify the user. Taking the sample user human body image of the training sample in the training sample set as input, and taking the user feature matching the sample label information corresponding to the input sample user human body image as the expected output, and training to obtain the user feature extraction model. Wherein, the above-mentioned user characteristics matched with the sample annotation information corresponding to the input training samples are generally consistent with the user identified by the sample annotation information.

在本实施例的一些可选的实现方式中，上述局部人体位置信息所指示的局部人体可以包括基于头肩关键点所确定的头部。上述生成单元504可以包括：第二确定模块(图中未示出)、第三确定模块(图中未示出)、第三生成模块(图中未示出)、第四生成模块(图中未示出)。其中，上述第二确定模块，可以被配置成确定目标视频帧序列中的目标视频帧中人体位置信息所指示的人体区域与局部人体位置信息所指示的头部区域之间的交并比。上述第三确定模块，可以被配置成确定交并比满足预设条件的人体位置信息与局部人体位置信息所指示的位置之间的距离。上述第三生成模块，可以被配置成响应于确定所确定的距离满足预设距离条件，生成人体位置信息与局部人体位置信息之间的关联关系。上述第四生成模块，可以被配置成根据所生成的关联关系，生成目标视频帧序列中显示的用户的轨迹信息。In some optional implementation manners of this embodiment, the partial human body indicated by the above partial human body position information may include the head determined based on the key points of the head and shoulders. The above generation unit 504 may include: a second determination module (not shown in the figure), a third determination module (not shown in the figure), a third generation module (not shown in the figure), a fourth generation module (not shown in the figure) not shown). Wherein, the above-mentioned second determination module may be configured to determine an intersection ratio between the human body area indicated by the human body position information and the head area indicated by the local human body position information in the target video frame in the target video frame sequence. The above-mentioned third determination module may be configured to determine the distance between the human body position information whose intersection and union ratio satisfies a preset condition and the position indicated by the partial human body position information. The above-mentioned third generating module may be configured to, in response to determining that the determined distance satisfies a preset distance condition, generate an association relationship between human body position information and partial human body position information. The above-mentioned fourth generation module may be configured to generate the track information of the user displayed in the target video frame sequence according to the generated association relationship.

本公开的上述实施例提供的装置，首先，通过获取单元501获取目标视频帧序列对应的用户位置信息集合。其中，用户位置信息集合中的用户位置信息用于表征目标视频帧序列中的目标视频帧所显示的用户的位置。用户位置信息包括人体位置信息和局部人体位置信息。而后，根据人体位置信息和局部人体位置信息，第一确定单元502分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度。接下来，根据所确定的用户的人体之间的关联度和局部人体之间的关联度，第二确定单元503确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系。最后，响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成单元504生成目标视频帧序列中显示的用户的轨迹信息。从而大大降低了检测和跟踪模型的复杂度，尤其适用于无人零售店、监控区域等端设备和嵌入式设备，因而也节约了网络传输流量。In the apparatus provided by the above-mentioned embodiments of the present disclosure, firstly, the acquisition unit 501 acquires a user location information set corresponding to a target video frame sequence. Wherein, the user position information in the user position information set is used to represent the position of the user displayed by the target video frame in the target video frame sequence. The user position information includes human body position information and partial human body position information. Then, according to the human body position information and the partial human body position information, the first determining unit 502 respectively determines the correlation degree between the user's human bodies and the correlation degree between the partial human bodies displayed in the target video frame sequence. Next, according to the determined correlation between the user's human bodies and the correlation between partial human bodies, the second determination unit 503 determines the correlation between the user's human bodies displayed in the target video frame sequence and the correlation between the partial human bodies. relationship. Finally, in response to determining that the determined correlation between the user's partial human bodies matches the correlation between the user's human bodies, the generating unit 504 generates the user's trajectory information displayed in the target video frame sequence. As a result, the complexity of the detection and tracking model is greatly reduced, and it is especially suitable for end devices and embedded devices such as unmanned retail stores and monitoring areas, thus saving network transmission traffic.

下面参考图6，其示出了适于用来实现本公开的实施例的电子设备(例如图1中的终端设备102)600的结构示意图。本公开的实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的终端设备仅仅是一个示例，不应对本公开的实施例的功能和使用范围带来任何限制。Referring now to FIG. 6 , it shows a schematic structural diagram of an electronic device (such as the terminal device 102 in FIG. 1 ) 600 suitable for implementing an embodiment of the present disclosure. Terminal devices in embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), etc., and mobile terminals such as digital TVs, desktop computers, etc. and so on for fixed terminals. The terminal device shown in FIG. 6 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.

如图6所示，电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中，还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601 , ROM 602 , and RAM 603 are connected to each other through a bus 604 . An input/output (I/O) interface 605 is also connected to the bus 604 .

通常，以下装置可以连接至I/O接口605：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606；包括例如液晶显示器(LCD，LiquidCrystal Display)、扬声器、振动器等的输出装置607；包括例如磁带、硬盘等的存储装置608；以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图6中示出的每个方框可以代表一个装置，也可以根据需要代表多个装置。Generally, the following devices can be connected to the I/O interface 605: an input device 606 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; including, for example, a liquid crystal display (LCD, LiquidCrystal Display), an output device 607 of a speaker, a vibrator, or the like; a storage device 608 including, for example, a magnetic tape, a hard disk, or the like; and a communication device 609 . The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as required.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信装置609从网络上被下载和安装，或者从存储装置608被安装，或者从ROM 602被安装。在该计算机程序被处理装置601执行时，执行本公开的实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609 , or from storage means 608 , or from ROM 602 . When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

需要说明的是，本公开的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF(Radio Frequency，射频)等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the embodiments of the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the embodiments of the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . The program code contained on the computer readable medium can be transmitted by any appropriate medium, including but not limited to: electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.

上述计算机可读介质可以是上述电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子设备：获取目标视频帧序列对应的用户位置信息集合，其中，用户位置信息集合中的用户位置信息用于表征目标视频帧序列中的目标视频帧所显示的用户的位置，用户位置信息包括人体位置信息和局部人体位置信息；根据人体位置信息和局部人体位置信息，分别确定目标视频帧序列中显示的用户的人体之间的关联度和局部人体之间的关联度；根据所确定的用户的人体之间的关联度和局部人体之间的关联度，确定目标视频帧序列中显示的用户的人体之间的关联关系和局部人体之间的关联关系；响应于确定所确定的用户的局部人体之间的关联关系与用户的人体之间的关联关系相匹配，生成目标视频帧序列中显示的用户的轨迹信息。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the user location information set corresponding to the target video frame sequence, wherein the user location information set The user position information in is used to represent the position of the user displayed in the target video frame in the target video frame sequence. The user position information includes human body position information and local human body position information; according to the human body position information and local human body position information, determine the target The correlation degree between the user's human bodies and the correlation degree between the partial human bodies displayed in the video frame sequence; according to the determined correlation degree between the user's human body and the correlation degree between the partial human bodies, determine the The displayed association relationship between the user's human bodies and the association relationship between the partial human bodies; in response to determining that the determined association relationship between the user's partial human bodies matches the association relationship between the user's human bodies, generating a target video frame The track information of the user displayed in the sequence.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开的实施例的操作的计算机程序代码，所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, Also included are conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).

附图中的流程图和框图，图示了按照本公开的各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

描述于本公开的实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中，例如，可以描述为：一种处理器，包括获取单元、第一确定单元、第二确定单元、生成单元。其中，这些单元的名称在某种情况下并不构成对该单元本身的限定，例如，获取单元还可以被描述为“获取目标视频帧序列对应的用户位置信息集合的单元”。The units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. The described units may also be set in a processor, for example, may be described as: a processor including an acquiring unit, a first determining unit, a second determining unit, and a generating unit. Wherein, the names of these units do not constitute a limitation on the unit itself under certain circumstances, for example, the acquisition unit may also be described as "a unit that acquires the user location information set corresponding to the target video frame sequence".

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开的实施例中所涉及的发明范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述发明构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the above-mentioned invention without departing from the above-mentioned inventive concept. Other technical solutions formed by any combination of technical features or equivalent features. For example, a technical solution formed by replacing the above-mentioned features with technical features with similar functions disclosed in (but not limited to) the embodiments of the present disclosure.