Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is a schematic diagram of an application scenario of a keypoint detection method of some embodiments of the present disclosure.
The execution subject of the keypoint detection method may be any computing device. The computing device may be hardware or software. When the computing device is hardware, the computing device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices listed above. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.
In the application scenario of fig. 1, the computing device may first input the image 101 to be detected into the feature extraction network 101 for feature extraction. On this basis, the resulting image features may be input into a pre-trained thermodynamic diagram regression network 103, resulting in a keypoint thermodynamic diagram 104. In addition, the computing device may input the output of the middle tier (third tier in fig. 1, for example) of thermodynamic diagram regression network 103 into pre-trained coordinate regression network 105, resulting in set of keypoint coordinates 106. The computing device may then generate keypoint location information 107 corresponding to the image to be detected based on the keypoint thermodynamic diagram 104 and the set of keypoint coordinates 106. For ease of illustration, the keypoint location information 107 may be visually displayed on the image to be detected, as shown at 108.
With continued reference to fig. 2, a flow 200 of some embodiments of a keypoint detection method according to the present disclosure is shown. The key point detection method comprises the following steps:
step 201, extracting features of the image to be detected to obtain image features.
In some embodiments, the execution body of the keypoint detection method may perform feature extraction on the image to be detected by using various feature extraction algorithms, so as to obtain image features. For example, the image to be detected may be input into a convolutional neural network to obtain the image features. For another example, image feature extraction may also be performed by algorithms such as color histograms, color correlograms, and the like. The image to be detected can be any image. For example, in the context of gesture recognition, the image to be detected may be an image currently captured by a camera. Of course, it may be an image obtained by preprocessing a captured image, or the like. As another example, it may be an image in a gallery specified by the user.
In some optional implementation manners of some embodiments, before extracting the features of the image to be detected to obtain the image features, the method may further include performing target portion detection on the original image to be detected to obtain an image area displaying the target portion, and scaling the image area to a target size to obtain the image to be detected. In practice, the content possibly displayed in the original image to be detected is more. For example, in a scene of hand keypoint detection, legs, bodies, heads, etc. may be displayed in addition to hands in the original image to be detected. Other content can interfere with the detection of keypoints at the target site. Therefore, the target portion can be detected first, and the image area displaying the target portion can be obtained. On the basis, in order to facilitate unified processing, the image area can be scaled to the target size, and then the image to be detected is obtained.
In some optional implementations of some embodiments, generating the keypoint location information corresponding to the image to be detected based on the keypoint thermodynamic diagram and the set of keypoint coordinates includes generating the keypoint location information corresponding to the original image to be detected based on the keypoint thermodynamic diagram and the set of keypoint coordinates. In these alternative implementations, since the image to be detected is obtained based on the original image to be detected, the key point position information corresponding to the original image to be detected can be generated as required.
In some alternative implementations of some embodiments, mapping the set of keypoint thermodynamic diagrams and keypoint coordinates to the image to be detected, respectively, resulting in a target image comprising the set of thermodynamic diagrams and the set of coordinate mapping keypoints, includes mapping the set of keypoint thermodynamic diagrams and keypoint coordinates to the original image to be detected, respectively, resulting in a target image comprising the set of thermodynamic diagrams and the set of coordinate mapping keypoints.
Step 202, inputting the image features into a pre-trained thermodynamic diagram regression network to obtain a key point thermodynamic diagram.
In some embodiments, the executing entity may input the image features into a pre-trained thermodynamic diagram regression network to obtain the keypoint thermodynamic diagram. The thermodynamic diagram regression network can be used for predicting a two-dimensional thermodynamic diagram, and the principle is that the coordinate position of the standard (groundTruth) of the key point is utilized to generate the thermodynamic diagram (heatmap) by utilizing a two-dimensional Gaussian function, and finally, the position coordinate with the highest activation value in the diagram is taken as the final key point coordinate. As an example, the thermodynamic diagram regression network may include a plurality (e.g., 3) of deconvolution layers and output layers. In addition, the thermodynamic diagram network may include a network structure such as a residual block, if necessary.
In practice, thermodynamic diagram regression network prediction has higher accuracy, but the association relation between key points cannot be predicted well. Fig. 3 illustrates hand keypoint detection results obtained using thermodynamic regression networks. In practice, in order to facilitate subsequent processing such as gesture detection by using the key point detection result, the key points are generally numbered. In this example, the 3 keypoints on the index finger are numbered 1-3 in turn, and the 3 keypoints on the middle finger are numbered 4-6 in turn, as shown at 301 in FIG. 3. However, as shown in fig. 3 at 302, the positions of the key points are changed in the number 2 and the number 4, and if the key points are sequentially connected in this order, it is possible to intuitively see that the connection line between the key points has changed significantly, although the positions of the key points are relatively accurate as a whole, as a result of the detection of the key points of the hand by using the thermodynamic regression network. And the connection lines between the key points may represent the association relationship between the key points. If gesture detection is performed subsequently, the association relationship between key points needs to be utilized. For example, it is determined whether the user is "bijean" and needs to be determined by the relative positional relationship between the links of the middle finger and index finger keypoints.
And 203, inputting the output result of the middle layer of the thermodynamic diagram regression network into a pre-trained coordinate regression network to obtain a key point coordinate set.
In some embodiments, the executing entity may input the output result of the middle layer of the thermodynamic diagram regression network into a pre-trained coordinate regression network to obtain the set of coordinates of the key points. Wherein, the output result of any middle layer can be selected to be input into the coordinate regression network. As an example, the network structure of the coordinate regression network may include a plurality of convolution layers and a pooling layer, a reorganization (Reshape) layer. In practice, the initial coordinate regression network and thermodynamic diagram regression network may be trained in advance by some machine learning methods using a training sample set. For example, the initial coordinate regression network and thermodynamic diagram regression network may be trained using a back-propagation, random gradient descent method, and the coordinate regression network is obtained if the training stop condition is satisfied. The initial coordinate regression network and the thermodynamic diagram regression network may be trained separately or in combination.
In practice, the accuracy of the coordinate regression network prediction is low, but the association relationship between key points can be well predicted. Fig. 4 illustrates hand keypoint detection results using a coordinate regression network. In this example, like FIG. 3, the 3 keypoints on the index finger are numbered sequentially 1-3 and the 3 keypoints on the middle finger are numbered sequentially 4-6, as shown at 401. As shown in 402, the hand keypoint detection result obtained by using the coordinate regression network does not change significantly the connecting line between the keypoints, but the position coordinates of some keypoints (for example, the keypoints with the number of 4) have larger prediction offset.
And 204, generating the key point position information corresponding to the image to be detected based on the key point thermodynamic diagram and the key point coordinate set.
In some embodiments, the executing body may generate the keypoint location information corresponding to the image to be detected based on the keypoint thermodynamic diagram and the set of keypoint coordinates. As an example, the execution subject performs coordinate regression on the key point thermodynamic diagram. Specifically, the position coordinate with the highest activation value in the thermodynamic diagram of the key point can be obtained first, and the average value of the position coordinate and the position information of the key point can be obtained. On the basis, the obtained average value can be used as the key point position information corresponding to the image to be detected.
Some embodiments of the present disclosure provide methods that combine the advantages of both approaches by combining thermodynamic diagram regression networks and coordinate regression networks. Therefore, the requirements of the association relation between the accuracy and the key points can be met at the same time. In the process, compared with the process of inputting the image features into the two branch networks respectively, the output result of the middle layer of the thermodynamic diagram network is utilized to further conduct coordinate regression, so that the fusion of the capacities of the two networks is facilitated, and the requirements of the accuracy and the association relation between the key points are met.
With further reference to FIG. 5, a flow 500 of further embodiments of a keypoint detection method is illustrated. The process 500 of the keypoint detection method includes the steps of:
step 501, extracting features of an image to be detected to obtain image features.
In some embodiments, the execution body (e.g., the server shown in FIG. 1) on which the keypoint detection method runs.
Step 502, inputting the image features into a pre-trained thermodynamic diagram regression network to obtain a key point thermodynamic diagram.
Step 503, inputting the output result of the middle layer of the thermodynamic diagram regression network into a pre-trained coordinate regression network to obtain a key point coordinate set.
In some embodiments, the specific implementation of steps 501-503 and the technical effects thereof may refer to those embodiments corresponding to fig. 2, and are not described herein.
And step 504, mapping the thermodynamic diagram of the key points and the coordinate set of the key points to the image to be detected respectively to obtain a target image containing the thermodynamic diagram mapping key point set and the coordinate mapping key point set.
In some embodiments, for the thermodynamic diagram of the keypoint, the execution body of the keypoint detection method may take at least one position with the highest activation value in the diagram, and obtain, based on this, the thermodynamic diagram mapping keypoints in the image to be detected through a certain mapping. For each key point coordinate in the key point coordinate set, a coordinate mapping key point can be obtained through certain mapping. Wherein the mapping may comprise a matrix transformation, multiplication with fixed coefficients, etc., according to the actual needs. It will be appreciated that the target image is obtained by mapping a set of keypoint thermodynamic diagrams and keypoint coordinates to the image to be detected.
And 505, selecting a key point from each key point group in the target image as a target key point to obtain a target key point set, wherein each key point group comprises a corresponding thermodynamic diagram mapping key point and a coordinate mapping key point.
In some embodiments, both the thermodynamic map and coordinate map keypoint sets are predictors of keypoints in the image to be detected. Thus, for the same location (e.g., the tip of a thumb), there will be one thermodynamic map key and one coordinate map key, i.e., a set of keys, corresponding thereto. And two key points in the key point group corresponding to the same position are mutually corresponding. For each key point group, a key point can be selected as the key point corresponding to the position, namely the target key point. Since the target image has a plurality of positions, a plurality of key point groups exist. Thus a target set of keypoints is obtained. As an example, one key point may be randomly selected as the target key point. Thus, as a whole, a combination of two networks can be achieved.
In some alternative implementations of some embodiments, one keypoint from the set of keypoints is selected as the target keypoint based on the distance between two keypoints in each set of keypoints. As an example, the distance between two keys may be euclidean distance.
As an example, in response to determining that the distance is less than or equal to a preset threshold, it is illustrated that the predictions for the two networks are relatively similar. That is, regardless of which network's prediction results are selected, the location information of the key points is relatively accurate. At this time, the coordinate mapping key points in the key point group are preferentially determined as target key points, and as the coordinate mapping key points can also meet the requirement of the association relationship, the dual requirements of the accuracy and the association relationship between the key points can be met at the same time. And determining thermodynamic map key points in the key point group as target key points in response to determining that the distance is greater than a preset threshold. In the implementation modes, the selection of the key points is realized by setting the threshold value, and the requirements of the accuracy and the association relation between the key points can be further considered.
And step 506, determining the position information of each target key point in the target key point set as the key point position information corresponding to the image to be detected.
In some alternative implementations of some embodiments, the thermodynamic diagram regression network and the coordinate regression network are trained by training the initial thermodynamic diagram regression network at a first learning rate until a convergence condition is met to obtain an intermediate thermodynamic diagram regression network, and performing joint training on the intermediate thermodynamic diagram regression network and the initial coordinate regression network at a second learning rate until a training end condition is met to obtain the thermodynamic diagram regression network and the coordinate regression network, wherein the second learning rate is less than the first learning rate.
In practice, because thermodynamic diagram regression networks are more sensitive to keypoints, coordinate regression networks are more sensitive to associations between keypoints. Therefore, the convergence direction of the two during training is different. Based on this, if two networks are directly trained, which is equivalent to converging in two directions at the same time, the overall convergence speed and the accuracy of the prediction result of the network are inevitably affected.
Furthermore, the learning rate (LEARNING RATE) acts as an important super-parameter in deep learning, which determines whether and when the objective function can converge to a local minimum. The appropriate learning rate enables the objective function to converge to a local minimum at an appropriate time.
Based on this, in these implementations, the initial thermodynamic diagram regression network may be trained first at a first, larger learning rate, such that the thermodynamic diagram regression network converges quickly, and the keypoint location information is learned first. On the basis, the intermediate thermodynamic diagram regression network and the initial coordinate regression network are jointly trained at a smaller second learning rate, so that the association relationship between key points can be learned while the local optimal solution is found, and the requirements of the accuracy and the association relationship between the key points are met.
In contrast to the description of some embodiments corresponding to fig. 2, the process 500 of the keypoint detection method in some embodiments corresponding to fig. 5 obtains the target keypoints by selecting the keypoints from the set of keypoints. From the whole, there are necessarily some key point thermodynamic diagrams and some key point coordinate sets, so that the advantages of two networks can be integrated, and the requirements of the accuracy and the association relation between the key points are met.
With further reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present disclosure provides some embodiments of a keypoint detection apparatus, which correspond to those method embodiments shown in fig. 2, and which are particularly applicable in a variety of electronic devices.
As shown in fig. 6, the keypoint detection apparatus 600 of some embodiments includes an extraction unit 601, a thermodynamic diagram generation unit 602, a coordinate generation unit 603, and a position information generation unit 604. The extracting unit 601 is configured to perform feature extraction on an image to be detected, so as to obtain image features. Thermodynamic diagram generation unit 602 is configured to input image features into a pre-trained thermodynamic diagram regression network to derive a keypoint thermodynamic diagram. The coordinate generation unit 603 is configured to input the output result of the middle layer of the thermodynamic diagram regression network into the pre-trained coordinate regression network, resulting in a set of key point coordinates. The position information generating unit 604 is configured to generate the keypoint position information corresponding to the image to be detected based on the keypoint thermodynamic diagram and the set of keypoint coordinates.
In an alternative implementation manner of some embodiments, the location information generating unit 604 is further configured to map the thermodynamic diagram of the keypoint and the coordinate set of the keypoint to the image to be detected to obtain a target image including the thermodynamic diagram mapping keypoint set and the coordinate mapping keypoint set, select one keypoint from each set of keypoints in the target image as a target keypoint to obtain a target keypoint set, wherein each set of keypoints includes the corresponding thermodynamic diagram mapping keypoint and the coordinate mapping keypoint, and determine location information of each target keypoint in the target keypoint set as the keypoint location information corresponding to the image to be detected.
In an alternative implementation of some embodiments, the location information generating unit 604 is further configured to select one keypoint from the keypoint groups as the target keypoint based on the distance between the two keypoints in each keypoint group.
In an alternative implementation of some embodiments, the location information generating unit 604 is further configured to determine thermodynamic map keypoints of the set of keypoints as target keypoints in response to determining that the distance is less than or equal to a preset threshold, and to determine coordinate map keypoints of the set of keypoints as target keypoints in response to determining that the distance is greater than the preset threshold.
In an alternative implementation of some embodiments, the thermodynamic diagram regression network comprises a plurality of deconvolution layers, and the coordinate generation unit 603 is configured to input the output result of the last deconvolution layer of the plurality of deconvolution layers to the coordinate regression network to obtain the set of keypoint coordinates.
In an alternative implementation of some embodiments, the thermodynamic diagram regression network and the coordinate regression network are trained by training the initial thermodynamic diagram regression network at a first learning rate until a convergence condition is met to obtain an intermediate thermodynamic diagram regression network, and performing joint training on the intermediate thermodynamic diagram regression network and the initial coordinate regression network at a second learning rate until a training end condition is met to obtain the thermodynamic diagram regression network and the coordinate regression network, wherein the second learning rate is less than the first learning rate.
In an alternative implementation of some embodiments, the apparatus 600 further comprises a detection unit, a scaling unit. The detection unit is configured to detect a target part of the original image to be detected, and an image area displaying the target part is obtained. The scaling unit is configured to scale the image area to a target size, resulting in an image to be detected. The location information generating unit 604 is further configured to generate the location information of the keypoints corresponding to the original image to be detected based on the keypoint thermodynamic diagram and the set of keypoint coordinates.
In an alternative implementation of some embodiments, the location information generating unit 604 is further configured to map the set of keypoint thermodynamic diagrams and the set of keypoint coordinates to the original image to be detected, respectively, resulting in a target image comprising the set of thermodynamic diagrams map the set of keypoints and the set of coordinate map keypoints.
It will be appreciated that the elements described in the apparatus 600 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 600 and the units contained therein, and are not described in detail herein.
Referring now to fig. 7, a schematic diagram of an electronic device 700 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 7 is only one example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a central processor, a graphics processor, etc.) 701, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
In general, devices may be connected to I/O interface 705 including input devices 706 such as a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 707 including a Liquid Crystal Display (LCD), speaker, vibrator, etc., storage devices 708 including, for example, magnetic tape, hard disk, etc., and communication devices 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 shows an electronic device 700 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 7 may represent one device or a plurality of devices as needed.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 709, or from storage 708, or from ROM 702. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing means 701.
It should be noted that, the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device is caused to perform feature extraction on an image to be detected to obtain image features, input the image features into a pre-trained thermodynamic diagram regression network to obtain a key point thermodynamic diagram, input an output result of an intermediate layer of the thermodynamic diagram regression network into the pre-trained coordinate regression network to obtain a key point coordinate set, and generate key point position information corresponding to the image to be detected based on the key point thermodynamic diagram and the key point coordinate set.
Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, which may be described as, for example, a processor comprising an extraction unit, a thermodynamic diagram generation unit, a coordinate generation unit and a location information generation unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the extraction unit may also be described as "a unit that performs feature extraction on an image to be detected".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.