Detailed Description
For a clearer understanding of the objects, features and advantages of the present application, reference will now be made in detail to the present application with reference to the accompanying drawings and specific examples. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict. In the following description, numerous specific details are set forth to provide a thorough understanding of the present application, and the described embodiments are merely a subset of the embodiments of the present application and are not intended to be a complete embodiment.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The embodiment of the application provides a dog face key point detection method based on artificial intelligence, which can be applied to one or more electronic devices, wherein the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and hardware of the electronic devices includes but is not limited to a microprocessor, an Application Specific Integrated Circuit (ASIC), a programmable gate array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive web television (IPTV), an intelligent wearable device, and the like.
The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud computing (cloud computing) based cloud consisting of a large number of hosts or network servers.
The network where the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a Virtual Private Network (VPN), and the like.
Fig. 1 is a flowchart of a method for detecting key points of a dog face based on artificial intelligence according to a preferred embodiment of the present invention. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.
And S10, performing transcoding processing on the historical dog face image to obtain a transcoded image.
In an optional embodiment, the transcoding the historical dog face image to obtain a transcoded image includes:
carrying out up-sampling on the historical dog face image to obtain a sampling image;
calculating the gray value of each pixel point in the sampling image to obtain a gray image;
and sharpening the gray level image to obtain a transcoding image.
In this optional embodiment, in order to enhance the spatial information of the historical dog face image, the dog face image may be up-sampled according to a preset interpolation algorithm to obtain a sampled image.
In some embodiments, the interpolation algorithm may be an existing interpolation algorithm such as a neighbor interpolation algorithm, a bilinear interpolation algorithm, or a cubic linear interpolation algorithm, which is not limited in this application.
In this optional embodiment, taking a proximity interpolation algorithm as an example, the acquisition mode of the sampling image is as follows:
constructing a sampling matrix corresponding to each pixel point in the dog face image according to a preset sampling multiplying factor, wherein the size of the sampling matrix is the square of the sampling multiplying factor, and for example, in this embodiment, if the sampling multiplying factor can be 2, the size of the sampling matrix is 4;
initializing and setting the value of each element in all sampling matrixes to be 0;
taking the coordinates of each pixel point in the dog face image as the coordinates of a sampling matrix corresponding to each pixel point, wherein the exemplary sampling matrix coordinates corresponding to the pixel point with the coordinates of [2,3] are [2,3 ];
taking the pixel value of each pixel point in the dog face image as the value of all elements in a sampling matrix with the same coordinate;
and combining the sampling matrixes according to the coordinates corresponding to all the sampling matrixes to obtain a sampling image.
In this optional embodiment, the gray value of each pixel point in the sampled image may be calculated according to a preset gray value calculation formula to obtain a gray image, where the gray value calculation formula satisfies the following relation:
Gray=0.299*R+0.587*G+0.114*B
wherein Gray represents the Gray value of a certain pixel point in the Gray image; r represents the value of the R channel corresponding to the pixel point; g represents a G channel value corresponding to the pixel point; and B represents the value of the B channel corresponding to the pixel point.
In this optional embodiment, since the grayscale image is obtained by upsampling, a blur condition may occur, and therefore, a preset sharpening formula may be used to sharpen the grayscale image to obtain a sharpened image, where taking any one pixel in the grayscale image as an example, the preset sharpening formula satisfies the following relational expression:
P=(1+γ)*Gray-γ*m
wherein, P represents the sharpening value of the pixel point; gray represents the Gray value of the pixel point; m represents the mean value of the gray values of all pixel points in the gray image; γ represents a preset sharpening parameter, which may be 0.5 based on experience of multiple experiments.
Illustratively, when the gray value of a certain pixel in the gray image is 100 and the mean value of the gray values of all pixels in the gray image is 80, the method for calculating the sharpening value corresponding to the pixel is as follows:
P=(1+0.5)*100-0.5*80=110
the sharpening value corresponding to the pixel point is 110.
In this alternative embodiment, each sharpening value may be used as a pixel value corresponding to each pixel point to obtain a transcoded image.
Therefore, spatial information of the sampled image in order to highlight the dog face image is obtained by up-sampling the dog face image, the gray value of each pixel point in the sampled image is calculated to obtain a gray image, the gray image is further sharpened to obtain a transcoding image, the contrast of the dog face image can be enhanced while the spatial information of the dog face image is promoted, and therefore the accuracy of subsequent image block division is promoted.
S11, the transcoded image is divided into a plurality of image blocks.
In an optional embodiment, the segmenting the transcoded image to obtain a plurality of image blocks includes:
constructing a feature descriptor of each pixel point in the transcoding image;
setting a plurality of clustering centers in the transcoded image according to the preset number of the categories of the key points of the dog face;
and clustering each pixel point in the transcoded image based on the feature descriptors and the clustering center to obtain a plurality of image blocks.
In this optional embodiment, a feature descriptor of each pixel point may be constructed according to a pixel value and a coordinate of each pixel point in the transcoded image, where the feature descriptor is used to represent a spatial feature of each pixel point and a light-dark feature of each pixel point, and the form of the feature descriptor may be [ P, x, y ], where P represents a sharpening value of each pixel point in the transcoded image, x represents an abscissa of each pixel point, and y represents an ordinate of each pixel point.
Illustratively, when the sharpening value of a certain pixel in the transcoded image is [50] and the coordinate of the pixel is [2,3], the feature descriptor of the pixel is [50,2,3 ].
In this optional embodiment, a plurality of pixel points may be initialized in the transcoded image as a clustering center according to a preset number of dog face key point categories, where the number of the dog face key point categories is a positive integer greater than 1.
For example, when the category of the key points of the dog face includes [ eyes, ears, nose ], the number of the key point categories is 3, and 3 pixel points can be selected from the transcoded image as a cluster center.
In this optional embodiment, the specific implementation steps of the clustering are as follows:
a, selecting a pixel point from the transcoded image as a target pixel point;
b, respectively calculating the similarity of the target pixel point and each clustering center, and classifying the target pixel point into the clustering center corresponding to the maximum similarity, wherein the calculation mode of the similarity satisfies the following relational expression:
s represents the similarity between the target pixel point and the clustering center, and the greater the similarity is, the more similar the target pixel point and the clustering center is; a. thei A value representing the ith dimension in the feature descriptor of the cluster center; b isi Representing the value of the ith dimension in the feature descriptor of the target pixel point; k represents the number of dimensions of the feature descriptor, and k is 3 in the present scheme.
For example, when the feature descriptor of a certain cluster center is [50,2,3] and the feature descriptor of a target pixel point is [60,1,2], the similarity between the target pixel point and the cluster center is calculated in the following manner:
the similarity between the target pixel point and the cluster center is 0.99.
In this alternative embodiment, the target pixel point and the cluster center corresponding to the maximum similarity may be classified into the same cluster;
c, respectively taking each pixel point in the transcoded image as a target pixel point, and repeating the step b to obtain a plurality of cluster clusters, wherein each cluster comprises a plurality of pixel points;
d, calculating the mean value of the feature descriptors of all the pixel points in each cluster to serve as the mean value corresponding to each cluster, calculating the difference value between the mean value and the cluster center, and outputting a plurality of clusters if the difference value is smaller; and if the difference value is larger, taking the mean value as a clustering center and repeating the steps a to d.
In the optional embodiment, each cluster represents an image block, each image block includes a plurality of pixel points, and the similarity of all the pixel points in each image block is high.
Illustratively, fig. 4 is a schematic diagram of the plurality of pixel blocks.
Therefore, a plurality of pixel points in the transcoded image are initialized and set as clustering centers through the preset number of the key point types of the dog face, the pixel points in the transcoded image are clustered according to the clustering centers to generate a plurality of image blocks, and data support is provided for subsequently calculating the key points of the image.
And S12, calculating the key points of each image block, and labeling the transcoded images according to the key points to obtain labeled images.
In an optional embodiment, the calculating the keypoints of each image block and labeling the transcoded image according to the keypoints to obtain a labeled image includes:
calculating the coordinates of key points of each image block according to the coordinates of all pixel points in each image block;
setting a virtual anchor frame of each image block according to the key point coordinates and a preset width and height parameter;
and setting a category label for each virtual anchor frame by using a preset labeling tool to obtain a labeled image.
In this optional embodiment, the key point coordinates of each image block may be calculated according to the coordinates of all pixel points in each image block, and taking an example of a certain image block in the transcoded image, the calculation manner of the key point coordinates in the image block satisfies the following relational expression:
x=(xmin +xmax )/2
y=(ymin +ymax )/2
wherein x represents the abscissa of the key point; x is the number ofmin Represents the minimum value, x, of the abscissa of all the pixels in the image blockmax Representing the maximum value of the abscissa of all pixel points in the image block; y represents the ordinate of the key point; y ismin Represents the minimum value of the vertical coordinates of all pixel points in the image block, ymax And representing the maximum value of the vertical coordinates of all the pixel points in the image block.
In this optional embodiment, the virtual anchor frame of each image block may be set according to the coordinates of the key point and a preset width-height parameter, where the width-height parameter includes a width parameter w and a height parameter h, and for example, when the coordinates of the key point of a certain image block are [ x, y ], and the width parameter w corresponding to the key point is 10 and the height parameter h is 20, the coordinates of the geometric center of the virtual anchor frame corresponding to the image block are [ x, y ], and the width of the virtual anchor frame corresponding to the image block is 10 and the height is 20.
In this optional embodiment, a label may be set for each virtual anchor frame according to a preset labeling tool, and in this embodiment, the labeling tool may be a labelme tool, and the function of the labeling tool is to give a label to a pixel point in an image. And setting the category label of each key point by using the labelme tool, and taking the category label of each key point as the category labels of all pixel points in the virtual anchor frame corresponding to the key point.
In this alternative embodiment, a transcoded image with multiple key points may be used as a marker image, each marker image includes at least one key point, each key point corresponds to a virtual anchor frame, and the categories of all pixel points in the virtual anchor frame are the same, for example, the categories may include dog face key parts such as [ ear, eye, nose ], and the like.
In this optional embodiment, the marker image has a plurality of key points, each key point corresponds to a category label and a set of width and height parameters, and in this embodiment, the coordinates in the marker image are [ x, y ]]The probability value of the pixel point belonging to the i category key point is pi,,y Illustratively, when the coordinates in the marker image are [2,3]]The pixel point is the key point with the category of 'ear', then pThe ear, the ear neck and the ear neck are connected with each other, =1。
illustratively, fig. 5 is a schematic diagram of the marker image.
Therefore, the positions of the key points of each image block are calculated through the coordinates of all the pixel points in each image block, and further data labeling is carried out on all the key points to obtain a labeled image, so that the accuracy and the efficiency of the data labeling can be improved.
And S13, training a dog face key point detection model based on the marked image and the transcoding image.
In an optional embodiment, the training of the dog face keypoint detection model based on the labeled image and the transcoded image comprises:
taking the transcoding image as a sample image, taking the marking image as a label image, and storing the sample image and the label image to obtain a training data set;
constructing an initial dog face key point detection model, wherein the initial dog face key point detection model comprises an encoder and a decoder;
and sequentially inputting the sample images into the initial dog face key point detection model to obtain a detection result, and updating parameters of the initial dog face key point detection model according to a preset loss function to obtain the dog face key point detection model.
In this alternative embodiment, all the transcoded images may be used as sample images, and the label image corresponding to each transcoded image may be used as a label image, and further, all the sample images and the label image may be stored to obtain a training data set.
In this optional embodiment, the encoder may be an existing neural network structure such as ResNet (residual error network), DLA (deep layer aggregation network), or Hourglass (funnel network); the decoder includes a first decoder, a second decoder and a third decoder, and the decoder may be an existing feature extraction network such as CNN (convolutional neural network) or R-CNN (cyclic convolutional neural network).
In this alternative embodiment, the input data and the output data of the encoder and the decoder include:
the input of the encoder is a sample image, the output of the encoder is a feature map, and the size of the feature map is the same as that of the sample image;
the input of the decoder is the feature map, the output of the first decoder is a plurality of predicted heat maps corresponding to the sample image, all the predicted heat maps have the same size as the sample image, each predicted heat map comprises at least one key point, all the key points in each predicted heat map have the same category, the value of each pixel point in each predicted heat map represents the probability that the pixel point belongs to the key point corresponding to the heat map category, and exemplarily, when the label of one predicted heat map is 'ear', and one coordinate of [1, 2] exists in the predicted heat map]And the probability that the pixel point is the key point with the category of 'ear' is 1 if the pixel point is the key point with the pixel value of 1, in this embodiment, the probability value of each pixel point in the prediction heat map can be recorded as

Representing coordinates [ x, y ] in said predicted heatmap]The probability that the pixel point belongs to the i category key point;
the output of the second decoder is a predicted coordinate, the predicted coordinate refers to the coordinate of the key point in the sample image predicted by the second decoder, and the predicted coordinate can be recorded as
The output of the third decoder is a predicted width-height parameter, the predicted width-height parameter refers to the width and height of the virtual anchor frame corresponding to each key point in the sample image predicted by the third decoder, and in the scheme, the predicted width-height parameter can be recorded as
Wherein
A prediction width parameter representing the output of said third decoder,
A predicted height parameter representing an output of the third decoder.
Illustratively, as shown in fig. 6, a schematic structural diagram of the initial dog face key point detection model is shown.
In this optional embodiment, in order to ensure that the output of the dog face key point detection model is as similar as possible to the tag image, it is necessary to perform iterative training on the initial dog face key point detection model according to a preset loss function to update parameters of the initial dog face key point detection model, where the preset loss function includes a heat map loss, a displacement loss, and a regression loss, and the heat map loss satisfies the following relation:
wherein L is
heatmap Representing the heat map loss, a smaller heat map loss value indicating a more similar keypoint in the input predicted heat map to a keypoint in the sample image; n represents the number of key point categories in the label image, N
x Representing the width of the prediction heat map, namely the number of coordinate points in the row direction of the prediction heat map; n is
y Representing the height of the prediction heat map, namely the number of coordinate points in the column direction of the prediction heat map; x and y represent coordinates of pixel points in the heat map; α and β represent preset blending parameters, and α ═ 2 and β ═ 4 may be set in the present embodiment;
represents the probability that the pixel point with coordinate (x, y) in the prediction heat map belongs to the key point of the category i, as an example
Time represents the coordinate in the heatmap as [2,3]]The probability that the pixel point belongs to the key point of the ear category is 1 when
Time represents the coordinates in the heat map as [50,50 ]]The probability that the pixel point at (b) belongs to the key point of the 'nose' category is 0.8.
In this optional embodiment, a regression loss function may be constructed according to the width and height parameters and the predicted width and height parameters in the label image, where the regression loss function satisfies the following relation:
wherein L is
regression Representing regression loss, wherein the regression loss is used for representing the difference between the key point range size in the prediction heat map and the key point range size in the label image, and the smaller the value of the regression loss is, the more similar the predicted width and height parameters are to the width and height parameters in the label image; m represents the number of key points in the label image; w and h respectively represent width and height parameters corresponding to key points of the kth category in the label image;
representing the predicted breadth-height parameter corresponding to the predicted kth key point.
In this alternative embodiment, the displacement loss function satisfies the following relationship:
wherein L is
offset Representing the loss value of the displacement loss function, wherein the smaller the regression loss value is, the closer the predicted key point coordinate is to the coordinate of the key point in the label image; m represents the number of key points in the label image, and i representsThe index of the key point in the label image is shown; d is a radical of
i Representing Euclidean distances between the predicted coordinates and coordinates of key points in the label image, an
In this alternative embodiment, the overall loss value of the initial dog face key point detection model may be calculated according to the heat map loss, the regression loss and the displacement loss, and the overall loss value may be calculated in a manner satisfying the following relation:
Loss=Lregression +A×Lheatmap +B×Loffset
the Loss represents the total Loss value, and the smaller the total Loss value is, the more similar the output of the initial dog face key point detection model is to the label image is, the better the performance of the initial dog face key point detection model is; l isregression Represents the regression loss value; l isheatmap Represents the heat map loss value; l isoffset Representing the displacement loss value; a and B represent preset weighting parameters, and according to experience obtained through multiple experiments, A can be 2 and B can be 4.
In this optional embodiment, the sample image may be sequentially input into the initial dog face key point detection model to obtain a key point detection result corresponding to the sample image, where the detection result includes a plurality of heatmaps, coordinates of each key point, and a width and height parameter corresponding to each key point. Further, the overall loss value may be calculated according to the label image corresponding to the sample image and the detection result, the overall loss value is used to characterize the difference between the label image and the prediction result, and a smaller overall loss value indicates that the detection result is more similar to the label image.
In this optional embodiment, parameters of the initial dog face key point detection model may be iteratively updated by using a gradient descent method, until the total loss value is smaller than a preset termination threshold value, the iteration is stopped to obtain the dog face key point detection model, and the termination threshold value may be 0.001 according to experience obtained through multiple experiments.
So, training initial dog face key point detection model based on a large amount of mark images and a large amount of transcoding images has obtained dog face key point detection model to parameters in the initial dog face detection model are constantly updated according to predetermined loss function, can promote the performance of dog face key point detection model, and then promote the degree of accuracy that dog face key point detected.
And S14, inputting the image to be detected into the dog face key point detection model to obtain a detection result.
In an optional embodiment, the inputting the image to be detected into the dog face key point detection model to obtain a detection result includes:
inputting the image to be detected into the dog face key point detection model to obtain a plurality of prediction heat maps, the category of key points in each prediction heat map and prediction width and height parameters corresponding to each key point;
and dividing a real anchor frame in the image to be detected according to the category of the key points in the prediction heat map and the prediction width and height parameters, and taking the real anchor frame as a detection result.
In this alternative embodiment, a dog face image to be detected may be input into the dog face key point detection model to obtain a plurality of prediction heat maps and prediction width and height parameters, each prediction heat map corresponds to one prediction category tag, the prediction category tags are categories of all key points in the heat map, each prediction heat map includes at least one key point, and the prediction width and height parameters are used to characterize a range of key portions in the dog face image, as an example, as shown in fig. 7, a schematic diagram of the detection result is shown.
In this optional embodiment, the pixel points at the same position in the image to be detected can be searched for as the predicted key points of the image to be detected according to the pixel points with the pixel value of 1 in the predicted heat map, and the category of the heat map is used as the category of the key points. Further, dividing a real anchor frame in the image to be detected according to the prediction width and height parameters and the prediction key points, and taking the real anchor frame as a detection result, wherein the real anchor frame is used for representing that all pixel points in the anchor frame belong to corresponding categories.
Therefore, based on the detection result calculated by the output data of the dog face key point detection model, a relatively accurate detection result can be obtained without maximum suppression, and the efficiency and accuracy of dog face key point detection are improved.
According to the dog face key point detection method based on artificial intelligence, the transcoding image is obtained by transcoding the dog face image, the transcoding image is divided to obtain a plurality of image blocks, the key points of each image block are calculated, data labeling is carried out to obtain the labeled image, the dog face key point detection model is trained based on the labeled image and the transcoding image, key points in the image can be automatically labeled, the efficiency of model training is improved while the labeling cost is reduced, the spatial characteristics are applied in the model training process, the generalization capability of the model is improved, and the accuracy of dog face key point detection is improved.
Fig. 2 is a functional block diagram of a preferred embodiment of the artificial intelligence-based dog face key point detection apparatus according to the embodiment of the present application. The artificial intelligence-based dog face keypoint detection device 11 comprises atranscoding unit 110, asegmentation unit 111, a markingunit 112, atraining unit 113 and adetection unit 114. The module/unit referred to in this application refers to a series of computer program segments that can be executed by theprocessor 13 and that can perform a fixed function, and that are stored in thememory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.
In an alternative embodiment, thetranscoding unit 110 is configured to transcode the historical dog face image to obtain a transcoded image.
In an optional embodiment, the transcoding the historical dog face image to obtain a transcoded image includes:
carrying out up-sampling on the historical dog face image to obtain a sampling image;
calculating the gray value of each pixel point in the sampling image to obtain a gray image;
and sharpening the gray level image to obtain a transcoding image.
In this optional embodiment, in order to enhance the spatial information of the historical dog face image, the dog face image may be up-sampled according to a preset interpolation algorithm to obtain a sampled image.
In some embodiments, the interpolation algorithm may be an existing interpolation algorithm such as a neighbor interpolation algorithm, a bilinear interpolation algorithm, or a cubic linear interpolation algorithm, which is not limited in this application.
In this optional embodiment, taking a proximity interpolation algorithm as an example, the acquisition mode of the sampling image is as follows:
constructing a sampling matrix corresponding to each pixel point in the dog face image according to a preset sampling multiplying factor, wherein the size of the sampling matrix is the square of the sampling multiplying factor, and for example, in this embodiment, if the sampling multiplying factor can be 2, the size of the sampling matrix is 4;
initializing and setting the value of each element in all sampling matrixes to be 0;
taking the coordinates of each pixel point in the dog face image as the coordinates of a sampling matrix corresponding to each pixel point, wherein exemplarily, the coordinates of the sampling matrix corresponding to the pixel point with the coordinates of [2,3] are [2,3 ];
taking the pixel value of each pixel point in the dog face image as the value of all elements in a sampling matrix with the same coordinate;
and combining the sampling matrixes according to the coordinates corresponding to all the sampling matrixes to obtain a sampling image.
In this optional embodiment, the gray value of each pixel point in the sampled image may be calculated according to a preset gray value calculation formula to obtain a gray image, where the gray value calculation formula satisfies the following relation:
Gray=0.299*R+0.587*G+0.114*B
wherein Gray represents the Gray value of a certain pixel point in the Gray image; r represents the value of the R channel corresponding to the pixel point; g represents a G channel value corresponding to the pixel point; and B represents the value of the B channel corresponding to the pixel point.
In this optional embodiment, since the grayscale image is obtained by upsampling, a blur condition may occur, and therefore, a preset sharpening formula may be used to sharpen the grayscale image to obtain a sharpened image, where taking any one pixel in the grayscale image as an example, the preset sharpening formula satisfies the following relational expression:
P=(1+γ)*Gray-γ*m
wherein, P represents the sharpening value of the pixel point; gray represents the Gray value of the pixel point; m represents the mean value of the gray values of all pixel points in the gray image; γ represents a preset sharpening parameter, which may be 0.5 based on experience with multiple trials.
Illustratively, when the gray value of a certain pixel in the gray image is 100 and the mean value of the gray values of all pixels in the gray image is 80, the method for calculating the sharpening value corresponding to the pixel is as follows:
P=(1+0.5)*100-0.5*80=110
the sharpening value corresponding to the pixel point is 110.
In this alternative embodiment, each sharpened value may be used as a pixel value corresponding to each pixel point to obtain a transcoded image.
In an alternative embodiment, thesegmentation unit 111 is configured to segment the transcoded image to obtain a plurality of image blocks.
In an optional embodiment, the segmenting the transcoded image to obtain a plurality of image blocks includes:
constructing a feature descriptor of each pixel point in the transcoding image;
setting a plurality of clustering centers in the transcoded image according to the preset number of the categories of the key points of the dog face;
and clustering each pixel point in the transcoded image based on the feature descriptors and the clustering center to obtain a plurality of image blocks.
In this optional embodiment, a feature descriptor of each pixel point may be constructed according to a pixel value and a coordinate of each pixel point in the transcoded image, where the feature descriptor is used to represent a spatial feature of each pixel point and a light-dark feature of each pixel point, and the form of the feature descriptor may be [ P, x, y ], where P represents a sharpening value of each pixel point in the transcoded image, x represents an abscissa of each pixel point, and y represents an ordinate of each pixel point.
Illustratively, when the sharpening value of a certain pixel point in the transcoded image is [50] and the coordinate of the pixel point is [2,3], the feature descriptor of the pixel point is [50,2,3 ].
In this optional embodiment, a plurality of pixel points may be initialized in the transcoded image as a clustering center according to a preset number of dog face key point categories, where the number of the dog face key point categories is a positive integer greater than 1.
For example, when the category of the key points of the dog face includes [ eyes, ears, nose ], the number of the key point categories is 3, and 3 pixel points can be selected from the transcoded image as a cluster center.
In this optional embodiment, the specific implementation steps of the clustering are as follows:
a, selecting a pixel point from the transcoded image as a target pixel point;
b, respectively calculating the similarity of the target pixel point and each clustering center, and classifying the target pixel point into the clustering center corresponding to the maximum similarity, wherein the calculation mode of the similarity satisfies the following relational expression:
wherein, S represents the similarity between the target pixel point and the clustering center, and the greater the similarity is, the more similar the target pixel point and the clustering center is; a. thei A value representing the ith dimension in a feature descriptor of the cluster center; b isi Representing the value of the ith dimension in the feature descriptor of the target pixel point; k represents the number of dimensions of the feature descriptor, and k is 3 in the present scheme.
For example, when the feature descriptor of a certain cluster center is [50,2,3] and the feature descriptor of a target pixel point is [60,1,2], the similarity between the target pixel point and the cluster center is calculated in the following manner:
the similarity between the target pixel point and the cluster center is 0.99.
In this alternative embodiment, the target pixel point and the cluster center corresponding to the maximum similarity may be classified into the same cluster;
c, respectively taking each pixel point in the transcoded image as a target pixel point and repeating the step b to obtain a plurality of cluster clusters, wherein each cluster comprises a plurality of pixel points;
d, calculating the mean value of the feature descriptors of all the pixel points in each cluster to serve as the mean value corresponding to each cluster, calculating the difference value between the mean value and the cluster center, and outputting a plurality of clusters if the difference value is smaller; and if the difference value is larger, taking the mean value as a clustering center and repeating the steps a to d.
In this optional embodiment, each cluster represents an image block, each image block includes a plurality of pixel points, and the similarity of all the pixel points in each image block is high.
Illustratively, fig. 4 is a schematic diagram of the plurality of pixel blocks.
In an alternative embodiment, thelabeling unit 112 is configured to calculate a keypoint for each image block and label the transcoded image according to the keypoint to obtain a labeled image.
In an optional embodiment, the calculating the keypoints of each image block and labeling the transcoded image according to the keypoints to obtain a labeled image includes:
calculating the coordinates of key points of each image block according to the coordinates of all pixel points in each image block;
setting a virtual anchor frame of each image block according to the key point coordinates and a preset width and height parameter;
and setting a category label for each virtual anchor frame by using a preset labeling tool to obtain a labeled image.
In this optional embodiment, the key point coordinates of each image block may be calculated according to the coordinates of all pixel points in each image block, and taking an example of a certain image block in the transcoded image, the calculation manner of the key point coordinates in the image block satisfies the following relational expression:
x=(xmin +xmax )/2
y=(ymin +ymax )/2
wherein x represents the abscissa of the key point; x is the number ofmin Represents the minimum value, x, of the abscissa of all the pixels in the image blockmax Representing the maximum value of the abscissa of all pixel points in the image block; y represents the ordinate of the key point; y ismin Represents the minimum value, y, of the vertical coordinates of all the pixel points in the image blockmax And representing the maximum value of the vertical coordinates of all the pixel points in the image block.
In this optional embodiment, the virtual anchor frame of each image block may be set according to the coordinates of the key points and a preset width-height parameter, where the width-height parameter includes a width parameter w and a height parameter h, and for example, when the coordinates of the key points of a certain image block are [ x, y ], and the width parameter w corresponding to the key points is 10 and the height parameter h is 20, the coordinates of the geometric center of the virtual anchor frame corresponding to the image block are [ x, y ], and the width of the virtual anchor frame corresponding to the image block is 10 and the height is 20.
In this optional embodiment, a label may be set for each virtual anchor frame according to a preset labeling tool, and in this embodiment, the labeling tool may be a labelme tool, and the function of the labeling tool is to give a label to a pixel point in an image. And setting the category label of each key point by using the labelme tool, and taking the category label of each key point as the category labels of all pixel points in the virtual anchor frame corresponding to the key point.
In this alternative embodiment, a transcoded image with multiple key points may be used as a marker image, each marker image includes at least one key point, each key point corresponds to a virtual anchor frame, and the categories of all pixel points in the virtual anchor frame are the same, for example, the categories may include key parts of a dog face such as [ ear, eye, nose ], and the like.
In this optional embodiment, the marker image has a plurality of key points, each key point corresponds to a category label and a set of width and height parameters, and in this embodiment, the coordinates in the marker image are [ x, y ]]The probability value of the pixel point belonging to the i category key point is pi,,y Illustratively, when the coordinates in the marker image are [2,3]]The pixel point at is the key point with the category of 'ear', then pThe ear, the ear part is fixed on the ear shell, =1。
illustratively, fig. 5 is a schematic diagram of the marker image.
In an alternative embodiment, thetraining unit 113 is configured to train a dog face keypoint detection model based on the labeled images and the transcoded images.
In an optional embodiment, the training of the dog face keypoint detection model based on the labeled image and the transcoded image comprises:
taking the transcoded image as a sample image, taking the marked image as a label image, and storing the sample image and the label image to obtain a training data set;
constructing an initial dog face key point detection model, wherein the initial dog face key point detection model comprises an encoder and a decoder;
and sequentially inputting the sample images into the initial dog face key point detection model to obtain a detection result, and updating parameters of the initial dog face key point detection model according to a preset loss function to obtain the dog face key point detection model.
In this alternative embodiment, all the transcoded images may be used as sample images, and the label image corresponding to each transcoded image may be used as a label image, and further, all the sample images and the label image may be stored to obtain a training data set.
In this optional embodiment, the encoder may be an existing neural network structure such as ResNet (residual error network), DLA (deep layer aggregation network), or Hourglass (funnel network); the decoder includes a first decoder, a second decoder and a third decoder, and the decoder may be an existing feature extraction network such as CNN (convolutional neural network) or R-CNN (cyclic convolutional neural network).
In this alternative embodiment, the input data and the output data of the encoder and the decoder include:
the input of the encoder is a sample image, the output of the encoder is a feature map, and the size of the feature map is the same as that of the sample image;
the input of the decoder is the feature map, the output of the first decoder is a plurality of predicted heat maps corresponding to the sample image, all the predicted heat maps have the same size as the sample image, each predicted heat map comprises at least one key point, all the key points in each predicted heat map have the same category, the value of each pixel point in each predicted heat map represents the probability that the pixel point belongs to the key point corresponding to the heat map category, and exemplarily, when the label of one predicted heat map is 'ear', and one coordinate of [1, 2] exists in the predicted heat map]And the probability that the pixel is the key point of the category of ear is 1 if the pixel value is the key point of 1, in this embodiment, it can be noted that the probability value of each pixel in the prediction heat map is 1

Representing coordinates [ x, y ] in said predicted heatmap]The probability that the pixel point belongs to the i category key point;
the output of the second decoder is a predicted coordinate, the predicted coordinate refers to the coordinate of the key point in the sample image predicted by the second decoder, and the predicted coordinate can be recorded as
The output of the third decoder is a predicted width-height parameter, the predicted width-height parameter refers to the width and height parameters of the virtual anchor frame corresponding to each key point in the sample image predicted by the third decoder, and the method is characterized in thatIn the case, the predicted width and height parameter can be recorded as
Wherein
A prediction width parameter representing the output of said third decoder,
A predicted height parameter representing an output of the third decoder.
Illustratively, fig. 6 is a schematic structural diagram of the initial dog face key point detection model.
In this optional embodiment, in order to ensure that the output of the dog face key point detection model is as similar as possible to the tag image, it is necessary to perform iterative training on the initial dog face key point detection model according to a preset loss function to update parameters of the initial dog face key point detection model, where the preset loss function includes a heat map loss, a displacement loss, and a regression loss, and the heat map loss satisfies the following relation:
wherein L is
heatmap Representing the heat map loss, a smaller heat map loss value indicating a more similar keypoint in the input predicted heat map to a keypoint in the sample image; n represents the number of key point categories in the label image, N
x Representing the width of the prediction heat map, namely the number of coordinate points in the row direction of the prediction heat map; n is
y Representing the height of the prediction heat map, namely the number of coordinate points in the column direction of the prediction heat map; x and y represent coordinates of pixel points in the heat map; α and β represent preset blending parameters, and α ═ 2 and β ═ 4 may be set in the present embodiment;
representing coordinates in said predicted heat map as(x, y) probability that the pixel belongs to the category i keypoint, exemplary when
Time represents the coordinate in the heatmap as [2,3]]The probability that the pixel point belongs to the ear category key point is 1 when
Time represents the coordinate in the heat map as [50,50 ]]The probability that the pixel point at (b) belongs to the key point of the 'nose' category is 0.8.
In this alternative embodiment, a regression loss function may be constructed according to the width-height parameter and the predicted width-height parameter in the label image, where the regression loss function satisfies the following relation:
wherein L is
regression Representing regression loss, wherein the regression loss is used for representing the difference between the key point range size in the prediction heat map and the key point range size in the label image, and the smaller the value of the regression loss is, the more similar the predicted width and height parameters are to the width and height parameters in the label image; m represents the number of key points in the label image; w and h respectively represent width and height parameters corresponding to key points of the kth category in the label image;
representing the predicted breadth-height parameter corresponding to the predicted kth key point.
In this alternative embodiment, the displacement loss function satisfies the following relationship:
wherein L is
offset A loss value representing said displacement loss function, said smaller said regression loss value indicating said predicted turn-offThe more similar the key point coordinates are to the coordinates of the key points in the label image; m represents the number of key points in the label image, and i represents the index of the key points in the label image; d
i Representing Euclidean distances between the predicted coordinates and coordinates of key points in the label image, an
In this alternative embodiment, the overall loss value of the initial dog face keypoint detection model may be calculated according to the heat map loss, the regression loss, and the displacement loss, and the overall loss value may be calculated in a manner that satisfies the following relation:
Loss=Lregression +A×Lheatmap +B×Loffset
the Loss represents the total Loss value, and the smaller the total Loss value is, the more similar the output of the initial dog face key point detection model is to the label image is, the better the performance of the initial dog face key point detection model is; l isregression Represents the regression loss value; l isheatmap Represents the heat map loss value; l isoffset Representing the displacement loss value; a and B represent preset weighting parameters, and according to experience obtained through multiple experiments, A can be 2 and B can be 4.
In this optional embodiment, the sample image may be sequentially input into the initial dog face key point detection model to obtain a key point detection result corresponding to the sample image, where the detection result includes a plurality of heatmaps, coordinates of each key point, and a width and height parameter corresponding to each key point. Further, the overall loss value may be calculated according to the label image corresponding to the sample image and the detection result, the overall loss value is used to characterize the difference between the label image and the prediction result, and a smaller overall loss value indicates that the detection result is more similar to the label image.
In this optional embodiment, parameters of the initial dog face key point detection model may be iteratively updated by using a gradient descent method, until the total loss value is smaller than a preset termination threshold value, the iteration is stopped to obtain the dog face key point detection model, and the termination threshold value may be 0.001 according to experience obtained through multiple experiments.
In an alternative embodiment, thedetection unit 114 is configured to input the image to be detected into the dog face key point detection model to obtain a detection result.
In an optional embodiment, the inputting the image to be detected into the dog face key point detection model to obtain a detection result includes:
inputting the image to be detected into the dog face key point detection model to obtain a plurality of prediction heat maps, the category of key points in each prediction heat map and prediction width and height parameters corresponding to each key point;
and dividing a real anchor frame in the image to be detected according to the category of the key points in the prediction heat map and the prediction width and height parameters, and taking the real anchor frame as a detection result.
In this alternative embodiment, a dog face image to be detected may be input into the dog face key point detection model to obtain a plurality of prediction heat maps and prediction width and height parameters, each prediction heat map corresponds to one prediction category tag, the prediction category tag is a category of all key points in the heat map, each prediction heat map includes at least one key point, and the prediction width and height parameters are used to characterize a range of key portions in the dog face image, as an example, as shown in fig. 7, a schematic diagram of the detection result is shown.
In this optional embodiment, the pixel points at the same position in the image to be detected can be searched for as the predicted key points of the image to be detected according to the pixel points with the pixel value of 1 in the predicted heat map, and the category of the heat map is used as the category of the key points. Further, dividing a real anchor frame in the image to be detected according to the prediction width and height parameters and the prediction key points, and taking the real anchor frame as a detection result, wherein the real anchor frame is used for representing that all pixel points in the anchor frame belong to corresponding categories.
According to the dog face key point detection method based on artificial intelligence, the transcoding image is obtained by transcoding the dog face image, the transcoding image is divided to obtain a plurality of image blocks, the key points of each image block are calculated, data labeling is carried out to obtain the labeled image, the dog face key point detection model is trained based on the labeled image and the transcoding image, key points in the image can be automatically labeled, the efficiency of model training is improved while the labeling cost is reduced, the spatial characteristics are applied in the model training process, the generalization capability of the model is improved, and the accuracy of dog face key point detection is improved.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Theelectronic device 1 comprises amemory 12 and aprocessor 13. Thememory 12 is used for storing computer readable instructions, and theprocessor 13 is used for executing the computer readable instructions stored in the memory to implement the artificial intelligence based dog face key point detection method of any one of the above embodiments.
In an alternative embodiment, theelectronic device 1 further comprises a bus, a computer program stored in thememory 12 and executable on theprocessor 13, such as an artificial intelligence based dog face key detection program.
Fig. 3 only shows theelectronic device 1 with components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of theelectronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
Referring to fig. 1, amemory 12 of theelectronic device 1 stores a plurality of computer-readable instructions to implement an artificial intelligence based method for detecting key points on a dog face, and aprocessor 13 can execute the plurality of instructions to implement:
transcoding the historical dog face image to obtain a transcoded image;
segmenting the transcoding image to obtain a plurality of image blocks;
calculating key points of each image block, and marking the transcoded images according to the key points to obtain marked images;
training a dog face key point detection model based on the marked image and the transcoding image;
and inputting the image to be detected into the dog face key point detection model to obtain a detection result.
Specifically, the specific implementation method of the instruction by theprocessor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.
It will be understood by those skilled in the art that the schematic diagram is merely an example of theelectronic device 1, and does not constitute a limitation to theelectronic device 1, theelectronic device 1 may have a bus-type structure or a star-type structure, and theelectronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, theelectronic device 1 may further include an input and output device, a network access device, etc.
It should be noted that theelectronic device 1 is only an example, and other existing or future electronic products, such as those that may be adapted to the present application, should also be included in the scope of protection of the present application, and are included by reference.
Memory 12 includes at least one type of readable storage medium, which may be non-volatile or volatile. The readable storage medium includes flash memory, removable hard disks, multimedia cards, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. Thememory 12 may in some embodiments be an internal storage unit of theelectronic device 1, for example a removable hard disk of theelectronic device 1. Thememory 12 may also be an external storage device of theelectronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash memory card (FlashCard), and the like, provided on theelectronic device 1. Further, thememory 12 may also include both an internal storage unit and an external storage device of theelectronic device 1. Thememory 12 can be used not only to store application software installed in theelectronic device 1 and various types of data, such as codes of an artificial intelligence-based dog face key point detection program, etc., but also to temporarily store data that has been output or is to be output.
Theprocessor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital processing chips, graphics processors, and combinations of various control chips. Theprocessor 13 is a control unit (control unit) of theelectronic device 1, connects various components of theelectronic device 1 by various interfaces and lines, and executes various functions and processes data of theelectronic device 1 by running or executing programs or modules stored in the memory 12 (for example, executing a dog face key point detection program based on artificial intelligence, etc.), and calling data stored in thememory 12.
Theprocessor 13 executes an operating system of theelectronic device 1 and various types of application programs installed. Theprocessor 13 executes the application program to implement the steps of the above-mentioned embodiments of the artificial intelligence based dog face key point detection method, such as the steps shown in fig. 1.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in thememory 12 and executed by theprocessor 13 to accomplish the present application. The one or more modules/units may be a series of computer-readable instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in theelectronic device 1. For example, the computer program may be segmented into atranscoding unit 110, asegmentation unit 111, alabeling unit 112, atraining unit 113, adetection unit 114.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute the portions of the artificial intelligence based dog face key point detection method according to the embodiments of the present application.
The integrated modules/units of theelectronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the processes in the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer-readable storage medium and executed by a processor, to implement the steps of the embodiments of the methods described above.
Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, Read-only memory (ROM), random access memory and other memory, etc.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connected communication between thememory 12 and at least oneprocessor 13 or the like.
Although not shown, theelectronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least oneprocessor 13 through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. Theelectronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, theelectronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between theelectronic device 1 and other electronic devices.
Optionally, theelectronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (organic light-emitting diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in theelectronic device 1 and for displaying a visualized user interface, among other things.
The embodiment of the present application further provides a computer-readable storage medium (not shown), in which computer-readable instructions are stored, and the computer-readable instructions are executed by a processor in an electronic device to implement the method for detecting a key point of a dog face based on artificial intelligence according to any of the above embodiments.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the specification may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present application and not for limiting, and although the present application is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application.