Disclosure of Invention
The invention provides an image target detection method, an image target detection system and image target detection equipment, which aim at the defects in the prior art, and only use an image classifier to complete the target detection task without any training.
According to an aspect of the present invention, there is provided an image object detection method including:
a pre-trained deep learning convolutional neural network is adopted as an image classifier;
inputting an image to be detected into the pre-trained image classifier to generate a final feature map, and classifying each pixel on the feature map;
processing pixels of the feature map, and distinguishing whether each pixel represents a foreground or a background;
and calculating the connected domain of the foreground pixel to obtain the position information of a prediction frame, namely a target, in the connected domain.
Optionally, the pretrained convolutional neural network removes the last global average pooling layer of the common convolutional neural network, and changes the full connection layer into a convolutional layer as the image classifier.
The method can directly realize the detection function by using the pre-trained image classifier without any training, and avoids dependence on calculation force and time.
Optionally, the processing the pixels of the feature map to distinguish whether each pixel represents a foreground or a background includes: whether the pixel represents a foreground or a background is determined based on the variance of the feature vector.
Optionally, after determining whether the pixel represents the foreground or the background according to the variance of the feature vector, further comprising:
filtering the background pixels again according to the judgment of the pixel classification;
the background pixels are filtered again according to the confidence level.
Optionally, calculating the connected domain of the foreground pixel to obtain a prediction frame in the connected domain includes:
calculating connected domains for all the foreground pixels;
limiting the upper and lower bounds of the number of the connected domains, and reserving the connected domains within the bounds;
and calculating barycentric coordinates on each connected domain, positioning a prediction frame, and simultaneously calculating the position information of the prediction frame, namely the target, on each connected domain by using the aspect ratio of the connected domain as the aspect ratio of the prediction frame.
Optionally, the method further includes post-processing a prediction box in the connected domain, the post-processing including:
and re-sending the obtained prediction frames into the image classifier for classification, and filtering the prediction frames classified into background categories.
Optionally, the post-processing further includes:
filtering the small-size prediction frame based on the length and width of the maximum-size prediction frame;
and carrying out non-maximum value inhibition operation on the prediction frames, and reserving the prediction frame with the highest confidence in the frames with high overlapping rate, so as to ensure that only one optimal prediction frame is reserved for each object.
According to a second aspect of the present invention, there is provided an image object detection system comprising:
the feature map acquisition module inputs the image to be detected into the pre-training image classifier to generate a final feature map, and classifies each pixel on the feature map; the image classifier adopts a deep learning convolutional neural network;
the foreground and background distinguishing module is used for processing the pixels of the feature map obtained by the feature map obtaining module and distinguishing whether each pixel represents a foreground or a background;
and the target position positioning module is used for calculating the connected domain of the foreground pixels obtained by the foreground and background distinguishing module to obtain the position information of the prediction frame, namely the target, in the connected domain.
According to a third aspect of the present invention there is provided an image object detection apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program being operable to perform the image object detection method of any one of the above.
Compared with the prior art, the embodiment of the invention has at least one of the following beneficial effects:
the method, the system and the equipment provided by the invention are independent of model training, and can be used for directly realizing the detection function by using the pre-training classifier, so that the detection target can be simply and directly obtained without any training, and the dependence on calculation force and time is avoided.
According to the method, the system and the equipment, the extraction of the candidate areas is based on deep learning, and the method, the system and the equipment are simple and efficient.
The method, the system and the equipment do not depend on massive labeling data, and the labor cost and the time cost required by labeling data and training a model are avoided.
Detailed Description
The following describes embodiments of the present invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and detailed implementation modes and specific operation processes are given. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the invention, which falls within the scope of the invention. Portions of the following embodiments not described in detail may be implemented using prior art techniques.
Fig. 1 is a flowchart of an image object detection method according to an embodiment of the invention. Referring to fig. 1, this embodiment may include the steps of:
s100, adopting a pre-trained deep learning convolutional neural network as an image classifier;
s200, inputting an image to be detected into a pre-trained image classifier, generating a final feature map, and classifying each pixel on the feature map;
s300, processing pixels of the feature map, and distinguishing whether each pixel represents a foreground or a background;
s400, calculating a connected domain of the foreground pixel to obtain the position information of a prediction frame, namely a target, in the connected domain.
In the above embodiment, the pretrained image classifier adopts a convolutional neural network, for example, a ResNet network series pretrained based on an ImageNet data set can be used, the last global average pooling layer is removed, and the full-connection layer is changed into a 1×1 convolutional layer (the weight uses the weight of the original full-connection layer). Of course, in other embodiments, other convolutional neural networks may be employed, and are not limited to the ResNet networks described above.
The embodiment of the invention can directly realize the detection function by using the pre-trained image classifier without any training, avoids the dependence on calculation force and time, and has applicability and expansibility to other data sets.
As a preferred embodiment, the pixels of the feature map are processed in S300 to distinguish whether each pixel represents a foreground or a background, and whether a pixel represents a foreground or a background can be determined according to the variance of the feature vector. Preferably, after determining whether the pixel represents the foreground or the background according to the variance of the feature vector, filtering may be further performed: filtering the background pixels again according to the judgment of the pixel classification; and filtering the background pixels again according to the confidence level, so that the target detection result is more accurate, and the calculation amount and calculation time in the subsequent detection are reduced.
As a preferred embodiment, S400 may include: calculating connected domains for all foreground pixels; limiting the upper and lower bounds of the number of the connected domains, and reserving the connected domains within the bounds; and calculating barycentric coordinates on each connected domain, positioning the prediction frames, and simultaneously calculating the position information of the prediction frames, namely the targets, on each connected domain by using the aspect ratio of the connected domain as the aspect ratio of the prediction frames. The embodiment carries out pixel-level classification on the final layer of feature map, and rapidly locates the approximate position of the candidate frame, so that the method is simple and effective.
As a preferred embodiment, the obtained target position information may be further post-processed, that is, the obtained prediction frame is re-sent to the image classifier for classification, and the prediction frame classified into the background class is filtered, so that the obtained result is more accurate. After filtering out the prediction frames classified as the background, the small-size prediction frames can be filtered based on the prediction frames with the maximum size; and carrying out non-maximum value inhibition operation on the prediction frames, and reserving the prediction frame with the highest confidence in the frames with high overlapping rate, so as to ensure that only one optimal prediction frame is reserved for each object.
Fig. 2 is a flowchart of an image object detection method according to a preferred embodiment of the invention. In order to better illustrate the method of the present invention, a specific operational description is given below in connection with fig. 2.
Referring to fig. 2, a method for detecting an image object in a preferred embodiment of the present invention is shown. Specifically, the image target detection method may include the steps of:
step 1: changing a pre-trained image classifier, removing a final global average pooling layer of a deep learning convolutional neural network, and changing a full-connection layer into a 1 multiplied by 1 convolutional layer (the weight of the full-connection layer is the weight of the original full-connection layer), so as to be used as the image classifier;
step 2: constructing a scale gold sub-tower for an image to be detected, and scaling the image according to the current scale in each scale space;
step 3: inputting each image to be detected processed in the step 2 into a pre-trained image classifier to obtain a feature map of the image to be detected, and carrying out confidence calculation on each pixel in each category on the feature map so as to facilitate subsequent classification;
step 4: performing front-back background distinction on the feature map obtained in the step 3 according to the obtained confidence coefficient, and selecting foreground pixels;
step 5: calculating connected domains for all foreground pixels obtained in the step 4; limiting the upper and lower bounds of the number of the connected domains, and reserving the connected domains within the bounds; calculating barycentric coordinates on each connected domain, positioning the prediction frames, and simultaneously calculating the prediction frames on each connected domain by taking the length-width ratio of the connected domain as the length-width ratio of the prediction frames;
step 6: the obtained prediction frame is sent into the image classifier again for classification, the classification of the background is filtered, and meanwhile, the confidence level of the prediction frame is updated to be a classification score (used for non-maximum suppression); the classification score represents the probability that an object (detection target) is in the prediction frame, and the larger the value is, the more credible the classification is;
step 7: mapping the size of the obtained prediction frame on each scale space to the original image scale space, and fusing to obtain the final target position information.
In the above preferred embodiment, the step 4 may be performed according to the following steps:
step 4.1: calculating the variance of the feature vector on each pixel in the feature map, taking the variance value smaller than a threshold value as a background, determining the threshold value by a specific data set, selecting the variance of the front and rear background feature vectors in the data set in a large amount, and selecting a numerical value capable of approximately distinguishing the front and rear backgrounds as the threshold value;
step 4.2: calculating classification results on the remaining pixels in the feature map, and taking pixels which do not belong to the class set as a background;
step 4.3: the confidence level on the rest pixels in the feature map is calculated, the pixels with the confidence level smaller than a certain threshold value are used as the background, the selection of the threshold value is determined by a specific data set, and 0.7 is usually used as the threshold value.
In the above preferred embodiment, the step 7 may be performed according to the following steps:
step 7.1: after the prediction frames are fused, dividing all the prediction frames into different areas according to the overlapping ratio, and filtering the prediction frames with relatively smaller sizes in each area by taking the size of the maximum prediction frame as a reference;
step 7.2: and finally, reserving the prediction frame with the highest confidence in the frames with high overlapping rate through a non-maximum value suppression module, so as to ensure that only one optimal prediction frame is reserved for each object.
The preferred embodiment shown in fig. 2 is independent of model training, and the detection function is directly realized by using the pre-training classifier, so that the detection target can be simply and directly obtained without any training, and the dependence on calculation force and time is avoided. Further, more accurate results can be obtained through the screening of the foreground pixels and the filtering of the prediction frame.
FIG. 3 is a block diagram of an exemplary image object detection system according to the present invention. Referring to fig. 3, the image object detection system in this embodiment may be used to implement the image object detection methods of fig. 1 and 2. Specifically, the image target detection system includes: the system comprises a feature map acquisition module, a foreground and background distinguishing module and a target position positioning module, wherein the feature map acquisition module inputs an image to be detected into a pre-trained image classifier to generate a final feature map, and classifies each pixel on the feature map; the image classifier adopts a deep learning convolutional neural network; the foreground and background distinguishing module processes the pixels of the feature map obtained by the feature map obtaining module and distinguishes whether each pixel represents a foreground pixel or a background pixel; the target position positioning module calculates the connected domain of the foreground pixels obtained by the foreground and background distinguishing module to obtain a prediction frame in the connected domain, and then post-processing is carried out to obtain final target position information.
FIG. 4 is a block diagram of an exemplary image object detection system according to the present invention. In order to better illustrate the method of the present invention, the following description is given with reference to the specific structure of fig. 4.
Referring to fig. 4, the image object detection system in the preferred embodiment includes: the device comprises an input module, a feature map acquisition module, a foreground and background distinguishing module, a target position positioning module and a post-processing module. Wherein:
an input module: the method is used for inputting an image to be detected and carrying out image preprocessing, and comprises the steps of reading in an RGB channel image and carrying out size scaling on the image according to the scale of the current space;
the feature map acquisition module: processing the image to be detected by adopting a pre-trained image classifier with an improved structure to obtain a pixel-by-pixel scoring condition on the final layer of feature map, wherein the score is used for subsequent foreground and background distinction and connected domain barycenter coordinate calculation; the structural improvement of the pre-training image classifier is as follows: removing the final global average pooling layer of the neural network, and changing the full-connection layer into a convolution layer with the weight of 1 multiplied by 1 (the weight of the full-connection layer is used as the weight of the original full-connection layer) as the image classifier;
and a foreground and background distinguishing module: processing the obtained feature map, and distinguishing whether each pixel represents a foreground or a background;
a target position locating module: calculating a connected domain of the foreground pixels, and calculating position information of a target in the connected domain, namely a prediction frame;
and a post-processing module: and (3) carrying out multi-layer screening on the currently obtained prediction frame to obtain a final target detection result.
In this embodiment, the front-back background distinguishing module includes the following modules:
a background threshold judgment module: judging the front background and the rear background according to the variance of the feature vector;
a classification positive and negative threshold judgment module: filtering the background pixels again according to the classified determination;
confidence threshold determination module: filtering the background pixels again according to the confidence level;
the three modules are used for distinguishing front and rear backgrounds from different angles and are performed simultaneously, and the cascading processing is beneficial to filtering out more background pixels and ensuring the accuracy of foreground pixels.
Wherein, the post-processing module comprises the following modules:
and the secondary classification filtering module is used for: the method is used for classifying the currently obtained prediction frames through a classifier again, and filtering out a part of the prediction frames;
small-size filtration module: for filtering small-sized prediction frames based on the largest-sized prediction frames;
a non-maximum suppression module: the method is used for reserving a prediction frame with highest confidence in frames with high overlapping rate, so that only one optimal prediction frame is reserved for each object;
the above three modules filter unsuitable prediction frames from different angles, and do so simultaneously, and this cascading process is beneficial to ensure the accuracy of the position and size of the final prediction frame.
The system modules in the above embodiments of the present invention may be implemented by using technologies in corresponding steps of the method, which are not described herein.
Fig. 5 is a diagram showing a detection effect of an embodiment of the present invention, for detecting dogs in an image. Confidence and detection frames are marked in the drawings, and it can be seen that the application example has good effects on single-target and multi-target detection, and the position and the size of the detection frame can be guaranteed.
In another embodiment of the present invention, there is also provided an image object detection apparatus including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program being operable to perform the method of any of the above embodiments.
Optionally, a memory for storing a program; memory, which may include volatile memory (english) such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), and the like; the memory may also include a non-volatile memory (English) such as a flash memory (English). The memory 62 is used to store computer programs (e.g., application programs, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in one or more memories in a partitioned manner. And the above-described computer programs, computer instructions, data, etc. may be invoked by a processor.
The computer programs, computer instructions, etc. described above may be stored in one or more memories in partitions. And the above-described computer programs, computer instructions, data, etc. may be invoked by a processor.
A processor for executing the computer program stored in the memory to implement the steps in the method according to the above embodiment. Reference may be made in particular to the description of the embodiments of the method described above.
The processor and the memory may be separate structures or may be integrated structures that are integrated together. When the processor and the memory are separate structures, the memory and the processor may be connected by a bus coupling.
It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, etc. in the system, and those skilled in the art may refer to a technical solution of the system to implement the step flow of the method, that is, the embodiment in the system may be understood as a preferred example for implementing the method, which is not described herein.
Those skilled in the art will appreciate that the invention provides a system and its individual devices that can be implemented entirely by logic programming of method steps, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the system and its individual devices being implemented in pure computer readable program code. Therefore, the system and various devices thereof provided by the present invention may be considered as a hardware component, and the devices included therein for implementing various functions may also be considered as structures within the hardware component; means for achieving the various functions may also be considered as being either a software module that implements the method or a structure within a hardware component.
Those skilled in the art will appreciate that all of the features disclosed in this specification, as well as all of the processes or units of any apparatus so disclosed, may be combined in any combination, except that at least some of such features and/or processes or units are mutually exclusive.
The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the claims without affecting the spirit of the invention.