Disclosure of Invention
The embodiment of the application provides a method and a device for generating a human body key point detection model.
In a first aspect, an embodiment of the present application provides a method for generating a human body key point detection model, including: acquiring a sample set, wherein samples in the sample set comprise sample human body images and marking information of key points in the sample human body images; selecting samples from the sample set, and performing the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.
In some embodiments, inputting a sample human body image of the selected sample into the initial first model to obtain a feature map of the pyramid structure, including: inputting the sample human body image of the selected sample into a residual error network to obtain a feature map output by the last convolution layer of each residual error block; and (4) respectively passing the feature maps output by the convolution layers through the full convolution layers, and then obtaining the feature map of the pyramid structure through horizontal connection after up-sampling.
In some embodiments, determining the first layer loss value based on the feature map and the annotation information of the key points in the sample human body image comprises: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of first predictive thermodynamic diagrams according to the feature map, wherein each first predictive thermodynamic diagram corresponds to a key point; a first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram.
In some embodiments, inputting the feature map into the initial second model, and obtaining the position coordinates of the detected keypoints comprises: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.
In some embodiments, generating an attention feature map from the feature map comprises: adding the feature map into bottleneck blocks of different times to obtain feature maps of different scales; the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map; inputting feature maps of different scales into an attention model to obtain first attention maps of different resolutions; the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram; inputting the second feature map into the attention model to obtain a second attention map; and combining the second attention diagram and the second characteristic diagram to obtain the attention characteristic diagram.
In some embodiments, determining the second layer loss value based on the position coordinates of the detected keypoints and the annotation information of the keypoints in the sample human body image comprises: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram.
In some embodiments, the method further comprises: and in response to determining that the initial first model and the initial second model are not trained, adjusting relevant parameters in the initial first model and the initial second model, reselecting the sample from the sample set, and continuing to perform the training step by using the adjusted initial first model and the adjusted initial second model.
In a second aspect, an embodiment of the present application provides a method for detecting a human body, including: acquiring a human body image of a detection object; inputting the human body image into the human body key point detection model generated by the method according to the first aspect, and generating the position coordinates of the human body key points of the detection object.
In a third aspect, an embodiment of the present application provides an apparatus for generating a human body key point detection model, including: the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire a sample set, and samples in the sample set comprise a sample human body image and marking information of key points in the sample human body image; a training unit configured to select samples from a set of samples, and to perform the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.
In some embodiments, the training unit is further configured to: inputting the sample human body image of the selected sample into a residual error network to obtain a feature map output by the last convolution layer of each residual error block; and (4) respectively passing the feature maps output by the convolution layers through the full convolution layers, and then obtaining the feature map of the pyramid structure through horizontal connection after up-sampling.
In some embodiments, the training unit is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of first predictive thermodynamic diagrams according to the feature map, wherein each first predictive thermodynamic diagram corresponds to a key point; a first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram.
In some embodiments, the training unit is further configured to: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.
In some embodiments, the training unit is further configured to: adding the feature map into bottleneck blocks of different times to obtain feature maps of different scales; the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map; inputting feature maps of different scales into an attention model to obtain first attention maps of different resolutions; the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram; inputting the second feature map into the attention model to obtain a second attention map; and combining the second attention diagram and the second characteristic diagram to obtain the attention characteristic diagram.
In some embodiments, the training unit is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram.
In some embodiments, the apparatus further comprises an adjustment unit configured to: and in response to determining that the initial first model and the initial second model are not trained, adjusting relevant parameters in the initial first model and the initial second model, reselecting the sample from the sample set, and continuing to perform the training step by using the adjusted initial first model and the adjusted initial second model.
In a fourth aspect, an embodiment of the present application provides an apparatus for detecting key points of a human body, including: a detection unit configured to acquire a human body image of a detection object; a generating unit configured to input the human body image into the human body key point detection model generated by the method according to any one of the first aspect, and generate position coordinates of the human body key points of the detection object.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.
In a sixth aspect, the present application provides a computer-readable medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method according to any one of the first aspect.
According to the method and the device for generating the human body key point detection model, the pyramid model and the attention model are fused to generate the human body key point detection model. For the convolutional neural network, different depths correspond to different levels of semantic features, the shallow network has high resolution, the detail features are concerned, the deep network has low resolution, and the semantic features are concerned. The cascade model is implemented by connecting two or more neural networks in series, so as to obtain more context information. Under the condition that the calculated amount of an original model is not increased basically, the multi-scale problem in object detection can be solved by changing network connection. The attention model (attention model) measures the importance of different features to the current task by calculating the weight of the features, thereby focusing on important features and weakening unimportant features. The technologies are beneficial for the network to detect the hidden or hidden human key points more accurately.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 illustrates anexemplary system architecture 100 to which a method of generating a human keypoint detection model, an apparatus for generating a human keypoint detection model, a method of detecting human keypoints, or an apparatus for detecting human keypoints according to an embodiment of the present application may be applied.
As shown in fig. 1, thesystem architecture 100 may includeterminals 101, 102, anetwork 103, adatabase server 104, and aserver 105. Thenetwork 103 serves as a medium for providing communication links between theterminals 101, 102, thedatabase server 104 and theserver 105. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
Theuser 110 may use theterminals 101, 102 to interact with theserver 105 over thenetwork 103 to receive or send messages or the like. Theterminals 101 and 102 may have various client applications installed thereon, such as a model training application, a human key point detection and recognition application, a shopping application, a payment application, a web browser, an instant messenger, and the like.
Here, theterminals 101 and 102 may be hardware or software. When theterminals 101 and 102 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), laptop portable computers, desktop computers, and the like. When theterminals 101 and 102 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
When theterminals 101, 102 are hardware, an image capturing device may be further mounted thereon. The image acquisition device can be various devices capable of realizing the function of acquiring images, such as a camera, a sensor and the like. Theuser 110 may use the image capturing device on theterminal 101, 102 to capture a human body image of himself or another person.
Database server 104 may be a database server that provides various services. For example, a database server may have a sample set stored therein. The sample set contains a large number of samples. The sample can include a sample human body image and labeling information of key points in the sample human body image. In this way, theuser 110 may also select samples from a set of samples stored by thedatabase server 104 via theterminals 101, 102.
Theserver 105 may also be a server providing various services, such as a background server providing support for various applications displayed on theterminals 101, 102. The background server may train the initial model by using samples in the sample set sent by theterminals 101 and 102, and may send a training result (e.g., the generated human body key point detection model) to theterminals 101 and 102. In this way, the user can apply the generated human key point detection model to perform human key point detection.
Here, thedatabase server 104 and theserver 105 may be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the method for generating a human body key point detection model or the method for detecting a human body provided by the embodiment of the present application is generally performed by theserver 105. Accordingly, the means for generating a human body keypoint detection model or the means for detecting human body keypoints is generally also provided in theserver 105.
It is noted thatdatabase server 104 may not be provided insystem architecture 100, asserver 105 may perform the relevant functions ofdatabase server 104.
It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.
With continued reference to FIG. 2, aflow 200 of one embodiment of a method of generating a human keypoint detection model according to the present application is shown. The method for generating the human body key point detection model can comprise the following steps:
step 201, a sample set is obtained.
In this embodiment, the execution subject of the method of generating a human body keypoint detection model (e.g., theserver 105 shown in fig. 1) may obtain the sample set in a variety of ways. For example, the executing entity may obtain the existing sample set stored therein from a database server (e.g.,database server 104 shown in fig. 1) via a wired connection or a wireless connection. As another example, a user may collect a sample via a terminal (e.g.,terminals 101, 102 shown in FIG. 1). In this way, the executing entity may receive samples collected by the terminal and store the samples locally, thereby generating a sample set.
Here, the sample set may include at least one sample. The sample can include a sample human body image and annotation information associated with key points in the sample human body image.
Optionally, data enhancement of the training samples may be performed, including rotation, size change, cropping, flipping, changing light intensity, etc., to obtain augmented training data and to make the model more generalized. Experiments show that the size of an input picture can influence the accuracy of key point detection, and the larger the size of the input picture is in a certain range, the more accurate the position of a detected key point is. Since the shape of a person in a picture is generally a long bar, the method sets the size of an input picture to 864 x 648, taking accuracy and calculation into consideration. In implementation, the picture is cropped while ensuring that the aspect ratio of the picture is unchanged, and the picture size is modified to 864 x 648 after zero padding of the picture edges. When data enhancement is carried out on the picture, corresponding operations such as rotation, scale change and overturning are carried out on the marked key point coordinates.
In the present embodiment, the sample human body image generally refers to an image containing a human body. It may be a planar human body image or a stereoscopic human body image (i.e., a human body image containing depth information). And the sample human body image may be a color image (e.g., RGB (Red, Green, Blue, Red-Green-Blue) photograph) and/or a grayscale image, etc. The Format of the Image is not limited in the present application, and may be a Format such as jpg (Joint Photo graphics Experts Group, a picture Format), BMP (Bitmap, Image file Format), or RAW (RAW Image Format), as long as the subject reading and recognition can be performed.
Atstep 202, a sample is selected from a sample set.
In this embodiment, the executing subject may select a sample from the sample set obtained instep 201, and perform the training steps fromstep 203 to step 208. The selection manner and the number of samples are not limited in the present application. For example, at least one sample may be randomly selected, or a sample with better definition (i.e., higher pixels) of the human body image may be selected from the samples.
Step 203, inputting the sample human body image of the selected sample into the initial first model to obtain the characteristic diagram of the pyramid structure.
In this embodiment, the executive may input the sample human image of the sample selected instep 202 into the initial first model. By detecting and analyzing the key point regions in the sample human body image, a feature map containing key points can be obtained.
In this embodiment, the initial first model may be an existing variety of neural network models created based on machine learning techniques. The neural network model may have various existing neural network structures (e.g., DenseBox, VGGNet, ResNet, SegNet, etc.). The storage location of the initial model is likewise not limited in this application.
As shown in fig. 3a, in the first stage of the cascade model, Resnet101 may be selected as a basic network structure, the feature maps conv2, conv3, conv4, conv5 output by the last convolutional layer of each residual block are taken out to pass through 1 × 1 full convolutional layer 1, and then the feature maps of the feature pyramid structure are obtained by up-sampling and transverse connecting the layers.
And step 204, determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image.
In this embodiment, the executive subject may analyze the labeling information of the key points of the sample human body image and the feature map obtained instep 203, so as to determine the first layer loss value. For example, the feature map and the label information of the key point may be used as parameters and input to a specified loss function (loss function), so that a loss value between the feature map and the key point can be calculated.
In this embodiment, the loss function is usually used to measure the degree of inconsistency between the predicted value (e.g. feature map) and the actual value (e.g. annotation information) of the model. It is a non-negative real-valued function. In general, the smaller the loss function, the better the robustness of the model. The loss function may be set according to actual requirements.
In some optional implementations of this embodiment, determining the first-layer loss value based on the feature map and the annotation information of the key points in the sample human body image includes: and generating a real thermodynamic diagram (heatmap) for each key point according to the labeling information of the key points in the sample human body image. And generating a preset number of first predictive thermodynamic diagrams according to the feature maps, wherein each first predictive thermodynamic diagram corresponds to one key point. A first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram. And (3) sequentially passing each hierarchical feature diagram output in thestep 203 through a 1 × 1 convolutional layer and a 3 × 3 convolutional layer to obtain the detected key point thermodynamic diagrams under each resolution, and calculating the L2 losses of the detected key point thermodynamic diagrams and the real key point thermodynamic diagrams as a first layer loss function of the network.
Step 205, inputting the feature map into the initial second model to obtain the position coordinates of the detected key points.
In this embodiment, the executing agent may input the feature map generated instep 203 into the initial second model, and obtain the position coordinates of the detected key points. The initial second model may be an attention model based neural network. The primary purpose of the initial second model is to extract the concerned features from feature maps of different scales, and the detailed features and the semantic features can be retained. Thereby concentrating the important features and weakening the unimportant features.
In some optional implementations of this embodiment, inputting the feature map into the initial second model, and obtaining the position coordinates of the detected key points includes: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.
In some optional implementations of this embodiment, generating the attention feature map according to the feature map includes:
and step 2051, adding the feature maps into the bottleneck blocks of different times to obtain feature maps of different scales.
In this embodiment, in the second stage of the cascade model, bottleeck (bottleneck block) of different times is added to the feature maps of each level output instep 203, so as to obtain feature maps of different scales. Stacking more bottleneck blocks into deeper levels, with smaller space sizes, achieves a good balance between efficiencies.
And step 2052, the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map.
In this embodiment, the feature maps of different scales obtained in step 2051 are upsampled and then subjected to pixel-wise addition (pixel-wise add) to obtain a feature map fcSee 301 in fig. 3 a.
And step 2053, inputting the feature maps with different scales into the attention model to obtain first attention maps with different resolutions.
In this embodiment, the different-scale feature maps output in step 2051 are subjected to attention maps (attention maps) with different resolutions by an attention model, see 302 in fig. 3 a.
And step 2054, the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram.
In this embodiment, the fused first attention map is combined with f obtained in step 2052cCombining to obtain refined feature maps fAM1See 303 in fig. 3 a.
In step 2055, the second feature map is input into the attention model to obtain a second attention map.
In this embodiment, still further, fAM1Obtaining a refined attention map AM from an attention model2I.e. 304 in fig. 3 a.
And step 2056, combining the second attention map and the second feature map to obtain an attention feature map.
In this embodiment, f isAM1And AM2Combining to obtain a refined feature map fout. I.e. 305 in fig. 3 a. In this step, different resolutions focus on different image features, small resolutions focus on global information, and high resolutions focus more on local details.
And step 206, determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image.
In this embodiment, the execution subject may analyze the labeling information of the key points of the sample human body image and the position coordinates of the key points obtained instep 205, so as to determine the second layer loss value. For example, the position coordinates of the detected key points and the label information of the key points may be input to a predetermined loss function (loss function) as parameters, and a loss value between the two values may be calculated.
In this embodiment, the loss function is generally used to measure the degree of inconsistency between the predicted value (e.g. the position coordinates of the detected key points) and the actual value (e.g. the annotation information of the key points) of the model. It is a non-negative real-valued function. In general, the smaller the loss function, the better the robustness of the model. The loss function may be set according to actual requirements.
In some optional implementations of this embodiment, the second layer is determined based on the position coordinates of the detected key points and the label information of the key points in the sample human body imageLoss values, including: and generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image. And generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to one key point. And determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram. Feature map f output in step 2056outAnd inputting the heat maps into the full convolutional layer 2 (namely, sequentially passing through 1 × 1 convolutional layer and 3 × 3 convolutional layers) to obtain detected key points, wherein the number of the heat maps is the same as that of the key points, each heat map corresponds to one key point, and the position of the maximum probability pixel is searched on each heat map, namely the position coordinate of the detected key point. And calculating the L2 loss of the detection thermodynamic diagram and the real thermodynamic diagram of the key points as a function of the loss of the second stage of the network.
Step 207, determining whether the initial first model and the initial second model are trained based on the first layer loss value and the second layer loss value.
In this embodiment, the first layer loss value and the second layer loss value are added to obtain the total loss value of the network. In each iterative training process, inputting pictures and corresponding key point marking data, calculating a first layer loss value and a second layer loss value by forward propagation, then calculating the gradient of the first layer loss value and the second layer loss value, and completing the backward propagation of the network and updating parameters. Experiments show that after a certain number of iterations, the first layer loss value and the second layer loss value are changed, only key points which are difficult to detect are concerned, namely, only a plurality of key point channels with larger second loss values are calculated and returned, and therefore a better detection effect on the key points which are difficult to detect is achieved.
From the change in the loss value, the execution subject may determine whether the initial model is trained. As an example, if multiple samples are selected instep 202, the performing agent may determine that the initial first model and the initial second model are trained to be complete if the total loss value of each sample reaches the target value. As another example, the performing agent may count the proportion of samples with total loss values reaching the target value to the selected samples. And when the ratio reaches a preset sample ratio (e.g., 95%), it can be determined that the initial model training is complete.
In this embodiment, if the executing entity determines that the training of the initial first model and the initial second model is completed, the executing entity may continue to executestep 208. If the executing agent determines that the initial first model and the initial second model are not trained, the relevant parameters in the initial first model and the initial second model may be adjusted. The weights in each convolutional layer in the initial first model and the weights in each attention model in the initial second model are modified, for example, using a back propagation technique. And may return to step 202 to re-select samples from the sample set. So that the training steps described above can be continued.
It should be noted that the selection manner is not limited in the present application. For example, in the case where there are a large number of samples in the sample set, the execution subject may select a non-selected sample from the sample set.
Step 208, in response to determining that the training of the initial first model and the initial second model is complete, determining the initial first model and the initial second model as the human body key point detection models.
In this embodiment, if the execution subject determines that the training of the initial first model and the initial second model is completed, the initial first model and the initial second model may be determined as the human body key point detection model.
Optionally, the executing entity may store the generated human body key point detection model locally, or may send it to a terminal or a database server.
According to the method provided by the embodiment of the application, the attention model is added into the cascaded characteristic pyramid model, so that the accuracy of detecting the human key points which are difficult to detect, blocked or rarely act is improved. The test result is as the example of fig. 3b, and the scheme can accurately detect 14 key points of the head, the neck, the left and right shoulders, the left and right elbows, the left and right wrists, the left and right hips, the left and right knees and the left and right ankles of the human body, and can accurately detect key points which are difficult to detect, are shielded and have rare actions.
Referring to fig. 4, aflowchart 400 of an embodiment of a method for detecting a human body provided by the present application is shown. The method for detecting a human body may include the steps of:
step 401, acquiring a human body image of a detection object.
In the present embodiment, an execution subject (e.g., theserver 105 shown in fig. 1) of the method for detecting a human body may acquire a human body image of a detection target in various ways. For example, the execution subject may obtain the human body image stored therein from a database server (e.g.,database server 104 shown in fig. 1) through a wired connection manner or a wireless connection manner. As another example, the execution subject may also receive a human body image captured by a terminal (e.g.,terminals 101, 102 shown in fig. 1) or other device.
In the present embodiment, the detection object may be any user, such as a user using a terminal, or another user who appears in the image capturing range, or the like. The body image may equally be a color image and/or a grayscale image, etc. And the format of the human body image is not limited in this application.
Step 402, inputting the human body image into the human body key point detection model, and generating the position coordinates of the human body key points of the detection object.
In this embodiment, the executing subject may input the human body image acquired instep 401 into the human body key point detection model, thereby generating a human body key point detection result of the detection object. The human body key point detection result may be position information for describing key points of the human body in the image.
In this embodiment, the human key point detection model may be generated by the method described in the embodiment of fig. 2. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again.
It should be noted that the method for detecting a human body in this embodiment may be used to test the human body key point detection model generated in the foregoing embodiments. And then the human body key point detection model can be continuously optimized according to the test result. The method may also be a practical application method of the human body key point detection model generated by the above embodiments. The human body key point detection model generated by the embodiments is adopted to detect the human body key points, and the performance of human body key point detection is improved. If more human key points are found, the found human key point information is more accurate, and the like. The accuracy of human key point detection of difficult detection, sheltered or rare actions is improved.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a human body key point detection model, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus may be applied to various electronic devices.
As shown in fig. 5, the apparatus 500 for generating a human body key point detection model of the present embodiment includes: an acquisition unit 501, a training unit 502 and an adjustment unit 503. Wherein the obtaining unit 501 is configured to obtain a sample set, wherein samples in the sample set include a sample human body image and annotation information of key points in the sample human body image. The training unit 502 is configured to select samples from a sample set and to perform the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.
In this embodiment, the specific processes of the obtaining unit 501, the training unit 502 and the adjusting unit 503 of the apparatus 500 for generating a human body key point detection model may refer tosteps 201 and 208 in the corresponding embodiment of fig. 2.
In some optional implementations of this embodiment, the training unit 502 is further configured to: inputting the sample human body image of the selected sample into a residual error network to obtain a feature map output by the last convolution layer of each residual error block; and (4) respectively passing the feature maps output by the convolution layers through the full convolution layers, and then obtaining the feature map of the pyramid structure through horizontal connection after up-sampling.
In some optional implementations of this embodiment, the training unit 502 is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of first predictive thermodynamic diagrams according to the feature map, wherein each first predictive thermodynamic diagram corresponds to a key point; a first layer loss value is determined based on a positional deviation of each keypoint in the real thermodynamic diagram from the first predicted thermodynamic diagram.
In some optional implementations of this embodiment, the training unit 502 is further configured to: generating an attention feature map according to the feature map; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and for a second predictive thermodynamic diagram in the predetermined number of second predictive thermodynamic diagrams, detecting the position coordinates of the corresponding key points according to the positions of the maximum probability pixels in each second predictive thermodynamic diagram.
In some optional implementations of this embodiment, the training unit 502 is further configured to: adding the feature map into bottleneck blocks of different times to obtain feature maps of different scales; the feature maps with different scales are subjected to upsampling and then fused together to obtain a first feature map; inputting feature maps of different scales into an attention model to obtain first attention maps of different resolutions; the first attention diagrams with different resolutions are fused together after being subjected to upsampling to obtain a fused first attention diagram, and the fused first attention diagram is combined with the first feature diagram to obtain a second feature diagram; inputting the second feature map into the attention model to obtain a second attention map; and combining the second attention diagram and the second characteristic diagram to obtain the attention characteristic diagram.
In some optional implementations of this embodiment, the training unit 502 is further configured to: generating a real thermodynamic diagram for each key point according to the labeling information of the key points in the sample human body image; generating a predetermined number of second predictive thermodynamic diagrams according to the attention feature map, wherein each second predictive thermodynamic diagram corresponds to a key point; and determining a second layer loss value based on the position deviation of each key point in the real thermodynamic diagram and the second predicted thermodynamic diagram.
In some optional implementations of this embodiment, the apparatus 500 further includes an adjusting unit 503 configured to: and in response to determining that the initial first model and the initial second model are not trained, adjusting relevant parameters in the initial first model and the initial second model, reselecting the sample from the sample set, and continuing to perform the training step by using the adjusted initial first model and the adjusted initial second model.
With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for detecting key points of a human body, which corresponds to the embodiment of the method shown in fig. 4, and which can be applied in various electronic devices.
As shown in fig. 6, the apparatus 600 for detecting key points of a human body according to the present embodiment includes: a detection unit 601 and a generation unit 602. Wherein the detection unit 601 is configured to acquire a human body image of the detection object. The generating unit 602 is configured to input the human body image into a human body key point detection model generated by the method described in the embodiment of fig. 2, and generate the position coordinates of the human body key points of the detection object.
It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 4. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.
Referring now to FIG. 7, shown is a block diagram of acomputer system 700 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 7, thecomputer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from astorage section 708 into a Random Access Memory (RAM) 703. In theRAM 703, various programs and data necessary for the operation of thesystem 700 are also stored. TheCPU 701, theROM 702, and theRAM 703 are connected to each other via abus 704. An input/output (I/O)interface 705 is also connected tobus 704.
The following components are connected to the I/O interface 705: aninput portion 706 including a touch panel, a keyboard, a mouse, a camera, and the like; anoutput section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; astorage section 708 including a hard disk and the like; and acommunication section 709 including a network interface card such as a LAN card, a modem, or the like. Thecommunication section 709 performs communication processing via a network such as the internet. Adrive 710 is also connected to the I/O interface 705 as needed. Aremovable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on thedrive 710 as necessary, so that a computer program read out therefrom is mounted into thestorage section 708 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through thecommunication section 709, and/or installed from theremovable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and a training unit. As another example, it can also be described as: a processor includes a detection unit and a generation unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit acquiring a sample set".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a sample set, wherein samples in the sample set comprise sample human body images and marking information of key points in the sample human body images; selecting samples from the sample set, and performing the following training steps: inputting a sample human body image of the selected sample into the initial first model to obtain a characteristic diagram of the pyramid structure; determining a first-layer loss value based on the feature map and the labeling information of the key points in the sample human body image; inputting the characteristic diagram into an initial second model to obtain the position coordinates of the detected key points; determining a second-layer loss value based on the position coordinates of the detected key points and the labeling information of the key points in the sample human body image; determining whether the training of the initial first model and the initial second model is finished based on the first layer loss value and the second layer loss value; in response to determining that the training of the initial first model and the initial second model is complete, the initial first model and the initial second model are determined to be human keypoint detection models.
Further, the one or more programs, when executed by the electronic device, may further cause the electronic device to: acquiring a human body image of a detection object; and inputting the human body image into the human body key point detection model to generate the position coordinates of the human body key points of the detection object. The human key point detection model may be generated by using the method for generating the human key point detection model described in the above embodiments.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.