CN114332473B

Movatterモバイル変換

Info

Publication number: CN114332473B
Application number: CN202111154945.5A
Authority: CN
Inventors: 刘文龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2024-09-17
Anticipated expiration: 2041-09-29
Also published as: CN114332473A

Abstract

The application provides a target detection method, a target detection device, computer equipment, a storage medium and a program product, and relates to the technical fields of computer vision, image processing, artificial intelligence and the like. At least two slices are obtained by segmenting an image to be detected, and pixel point characteristics and pixel point context characteristics of each slice are extracted through a characteristic extraction layer to obtain at least two characteristic diagrams; detecting the feature map to obtain defect information of the slice; the feature extraction layer comprises a target network layer for extracting pixel point features of the slice, and through the target network layer, downsampling operation is omitted when the pixel point features of the slice are directly extracted, the detail features of the original image are reserved, and even small targets with smaller area range can be accurately detected; and the feature extraction layer is also designed to extract pixel point features and pixel point context features, so that the robustness of detection of targets with different sizes is improved, and the accuracy of target detection is further improved.

Description

Object detection method, device, computer apparatus, storage medium, and program product

Technical Field

The application relates to the technical fields of computer vision, image processing, artificial intelligence and the like, and relates to a target detection method, a target detection device, computer equipment, a storage medium and a program product.

Background

With the development of technology, computer vision technology is increasingly applied in industrial scenes. For example, in modern industrial manufacturing, products such as machine parts, electronic components, etc. produced in an industrial line inevitably have defects. Therefore, in the quality control link, target detection is generally performed by using the product image to identify product defects.

Disclosure of Invention

The application provides a target detection method, a device, computer equipment, a storage medium and a program product, which can solve the problem of low accuracy and applicability of target detection in the related technology. The technical scheme is as follows:

in one aspect, a method for detecting a target is provided, the method comprising:

determining an image to be detected comprising a target object, and determining at least two slices of the image to be detected;

The method comprises the steps that through a feature extraction layer of a target model, pixel point feature extraction and pixel point context feature extraction are carried out on at least two slices, at least two feature images are obtained, any feature point of the feature images is used for indicating the feature and the context feature of a corresponding pixel point in each slice, the size of an intermediate feature image output by a target network layer of the feature extraction layer is the same as the size of each slice, and the target network layer is used for carrying out pixel point feature extraction on each slice;

And detecting the corresponding feature map of each slice to obtain defect information of each slice, wherein the defect information is used for indicating defects of target objects included in the slices.

In another aspect, there is provided an object detection apparatus including:

The device comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining an image to be detected comprising a target object and determining at least two slices of the image to be detected;

The feature extraction module is used for extracting pixel point features and pixel point context features of the at least two slices through a feature extraction layer of the target model to obtain at least two feature images, any feature point of the feature images is used for indicating the feature and the context feature of a corresponding pixel point in the slice, the size of an intermediate feature image output by a target network layer of the feature extraction layer is the same as the size of the slice, and the target network layer is used for extracting the pixel point features of the slice;

and the detection module is used for detecting the corresponding feature map of each slice to obtain defect information of each slice, wherein the defect information is used for indicating the defects of the target object included in the slice.

In another aspect, a computer device is provided, including a memory, a processor, and a computer program stored on the memory, the processor executing the computer program to implement the above-described target detection method.

In another aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the above-described object detection method.

In another aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described object detection method.

The technical scheme provided by the application has the beneficial effects that:

Firstly, cutting an image to be detected to obtain at least two slices, and then extracting pixel point characteristics and pixel point context characteristics of each slice through a characteristic extraction layer of a target model to obtain at least two characteristic images; detecting the feature map to obtain defect information of the slice; resolution limitation is removed, and the method is applicable to the situation of images with any resolution; the feature extraction layer comprises a target network layer for extracting pixel point features of the slice, and an intermediate feature image output by the target network layer is the same as the size of the slice, namely downsampling is omitted when the pixel point features of the slice are extracted, so that the detail features of the original image are reserved, the feature extraction layer is designed to extract the pixel point features and the pixel point context features, the robustness of detection of targets with different sizes is improved, even small targets with smaller area range can be accurately detected, and the accuracy of target detection is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of an implementation environment of a target detection method according to an embodiment of the present application;

Fig. 2 is a schematic flow chart of a target detection method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a calculation mode of an overlap ratio according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an offset parameter according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a regression branch network training process according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a detected defect in accordance with an embodiment of the present application;

Fig. 7 is a schematic diagram of target detection based on each network layer according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an object detection device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g. "a and/or B" indicates implementation as "a", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, electromechanical integration, blockchain, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The Computer Vision technology (CV) Computer Vision is a science of researching how to make a machine "look at", and more specifically, it means to replace a human eye with a camera and a Computer to perform machine Vision such as identifying and measuring on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for the human eye to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The application provides a target detection method, and relates to the technical fields of artificial intelligence, machine learning and the like. For example, the techniques of machine simulation, cloud computing and the like in the artificial intelligence technology can be utilized to realize target detection of the image to be detected based on the model obtained by training, for example, the image to be detected is segmented into a plurality of slices by utilizing the target model; extracting pixel point features and context features of each slice by utilizing a feature extraction layer of the target model; and detecting small targets in the high-resolution image by using the feature map of the obtained slice. Of course, the machine learning technique described above may also be utilized to reinforcement learn the initial model with a sample set to obtain a more robust target model.

In modern industrial manufacturing, production efficiency is improved by introducing a production line. But complex processes inevitably lead to the production of defects. These defects are mostly dependent on environmental conditions, and occur probabilistically, requiring statistical analysis of the defects at a later stage. Therefore, the method is an essential link in the modern production process for defect detection and diagnosis of finished products.

In the traditional method, enterprises mostly adopt a manual observation mode to detect defects of products. In this example, there is a problem that the detection cost is large (personnel cost) for an enterprise; for staff, because the defect area is small (detection is difficult), the problems of large working strength and single working content exist, and the staff loss rate is high; for the algorithm, the defect shapes are different in size, and the defect is too large and too small, so that the defect can cause a certain degree of omission, thereby influencing the yield of the actual production line.

Currently FASTER RCNN (Faster Region Convolutional Neural Networks, mask area convolutional neural network) networks are commonly used for defect detection of object images. When FASTER RCNN networks are adopted, images are cut in a size conversion mode of the images, then the images are input into a neural network, and the images are firstly downsampled in the FASTER RCNN networks; and the cutting, downsampling and the like can lead to the loss of small-size defect areas in the image, so that the accuracy of target detection is lower, the service of high-resolution detection requirements cannot be met, and the practicability is lower.

Fig. 1 is a schematic diagram of an implementation environment of a target detection method according to the present application. As shown in fig. 1, the implementation environment may include: a computer device 101. For example, the computer device may be preconfigured with a trained target model and perform target detection on the image to be detected using the target model; alternatively, the computer device 101 may use a plurality of sample training to obtain the target model, so as to use the target model to perform target detection on the image to be detected.

In one possible scenario, as shown in fig. 1, the implementation environment may further include: the image acquisition device 102 can acquire an image to be detected of a target object to be detected, the image to be detected is sent to the computer device 101, the computer device 101 is pre-configured with a trained target model, the computer device 101 performs target detection on the image to be detected by using the target model, and a detection result is sent to the image acquisition device 102. For example, the terminal 102 may be provided with an application program having a target detection function, and the server 101 may be a background server of the application program. The terminal 102 and the server 102 can perform data interaction based on the application program, so as to realize real-time data transmission of the image to be detected and the detection result.

In the scenario shown in fig. 1, the above-mentioned target detection method may be performed in any device such as a server, a terminal, a server cluster, or a cloud computing service cluster. For example, the server or the terminal may have both image acquisition and target detection functions, for example, the server acquires an image to be detected and performs target detection based on the image to be detected and the target model.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The network may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, wi-Fi, and other networks implementing wireless communications. The terminal may be a smart phone (such as an Android Mobile phone, an iOS Mobile phone, etc.), a tablet computer, a notebook computer, a digital broadcast receiver, an MID (Mobile INTERNET DEVICES, mobile internet device), a PDA (personal digital assistant), a desktop computer, a vehicle-mounted terminal (such as a vehicle-mounted navigation terminal, a vehicle-mounted computer, etc.), a smart speaker, a smart watch, etc., and the terminal and the server may be directly or indirectly connected through wired or wireless communication manners, but are not limited thereto. And in particular, the method can be determined based on actual application scene requirements, and is not limited herein.

Fig. 2 is a schematic flow chart of a target detection method according to an embodiment of the present application. The execution body of the method may be a computer device, and the computer device may be a server, a terminal, or any electronic device with a target detection function, which is not limited in particular in the embodiment of the present application. In the embodiment of the application, a server is taken as an example for explanation. As shown in fig. 2, the method includes the following steps.

Step 201, the server determines an image to be detected including the target object.

The target detection method provided by the embodiment of the application can be detection with possible defects on the target object as targets. The image to be detected may include defects of the target object. The server may, for example, obtain the image to be detected from an image acquisition device. Or the server may store the image to be detected in advance. For example, the image to be detected may be an image obtained by capturing a target object to be detected by an image capturing device or a server.

The object detection process of the embodiment of the application may refer to: small target detection targeting a small range of defects. The small target detection may refer to detection of defects of small or relatively small size. In one possible example, small target detection may refer to detection within a range of sizes less than a certain threshold. For example, detection is aimed at defects with dimensions smaller than 32×32. In another possible example, the small target detection may also refer to defect detection within a range where the relative size is less than a certain threshold; for example, detection is aimed at defects whose width and height are respectively lower than one tenth of the width and height of the image to be detected.

It should be noted that the image to be detected may be an image with a higher resolution; that is, the image to be detected may be an image whose resolution exceeds a target resolution threshold and whose size exceeds a target size threshold. For example, it may be a high resolution, large size raw image of the target object acquired by a high definition camera; different from the common target detection method based on the neural network, the target detection method provided by the embodiment of the application has a better detection effect on high-resolution and large-size images. Of course, the image to be detected may also be a low resolution image, and the target detection method of the embodiment of the present application is also applicable to a low resolution image.

In one possible example, the target object may be a 3C (Computer, communication, consumer Electronic, computer, communication and consumer electronics) or an accessory to a 3C product, such as a computer, tablet, cell phone or digital audio player, cell phone camera holder, or the like. For example, in the embodiment of the application, defects in the 3C product accessory can be detected according to the image data of the 3C product accessory acquired by the camera.

Step 202, the server determines at least two slices of the image to be detected.

The server cuts the original image to be detected based on a sliding window slicing mode. The server may slice the image to be detected into a plurality of slices according to a certain size. This step may include: the server cuts the image to be detected in a sliding window mode based on a sliding window of a target step length and a target size to obtain at least two slices of the image to be detected. For example, the size of the target step may be configured in association with the target size, e.g., the target step may be 0.7 times the target size. For example, the image to be detected may be segmented into a number of slices based on the need in actual operation, and the data magnitude of the slices may be tens, hundreds, thousands, or the like. For example, when the resolution of the image to be detected is higher or the accuracy requirement of target detection to be performed is higher, the step length of the slice can be set smaller, and the number of the sliced slices is larger, so that the detection accuracy is improved, and the detection accuracy is further improved.

The image to be detected may be an original image, and the server may segment (PATCH SPLIT) the original image, for example, the server sets a fixed step size stride, configures a sliding window with a window size window-size, and autonomously segments a plurality of slices of the high-resolution original image in a sliding window manner; for example, the fixed step size versus window size may be: stride= (0.7 to 0.8) window-size.

The method has the advantages that the method is suitable for the image to be detected with any resolution, the limitation of the conditions such as the resolution of the original image to be detected is removed, the application range of the method is improved, the applicability is further improved, the detection can be carried out based on the high-resolution image, and the possibility of accurate detection is improved.

And 203, the server extracts pixel point characteristics and pixel point context characteristics of the at least two slices through a characteristic extraction layer of the target model to obtain at least two characteristic diagrams.

The pixel feature extraction may include extracting semantic features of the pixels of the slice, and the process extracted features may include features of the primary semantics of the pixels. The server performs semantic feature extraction and contextual feature extraction on the slices through a feature extraction layer of the target model to obtain a multi-channel feature map. The pixel context feature extraction may include the feature of the pixel itself and an extraction process of the context feature fused with the pixel, and the feature extracted by the process may include the feature of the high-level semantics of the pixel.

Any feature point of the feature map is used for indicating the feature and the context feature of a corresponding pixel point in the slice, the size of an intermediate feature map output by a target network layer of the feature extraction layer is the same as the size of the slice, and the target network layer is used for extracting the pixel point feature of the slice; each slice corresponds to a multi-channel signature. The server can firstly extract pixel point characteristics of the slice through the characteristic extraction layer to obtain at least two slice characteristic diagrams; and extracting the pixel point context characteristics of the at least two slice characteristic images to obtain the at least two characteristic images. For example, the server may first perform pixel feature extraction on the slice through the feature extraction layer to obtain a primary semantic feature map; and extracting context features of the primary semantic feature map to obtain the high-level semantic feature map.

In one possible implementation, the feature extraction layer includes a mini-backbone network and an attention network; this step 203 may be implemented by the following steps 2031-2033.

Step 2031, the server extracts pixel point characteristics of the slice through the micro backbone network to obtain a slice characteristic diagram.

The slice feature map may include primary semantic features of the pixel points. Any feature point of the slice feature map is used for indicating the feature of a corresponding pixel point in the slice, the network layer number of the micro backbone network does not exceed a first threshold, and the convolution kernel number of each network layer does not exceed a second threshold. The target network layer may be located in the mini-backbone network, the size of the middle slice feature map output by the target network layer is the same as the size of the slice, and the target network layer may be a convolution layer for extracting pixel point features of the original slice map. The micro backbone network comprises a plurality of network layers, the server sequentially inputs at least two slices into a first network layer according to the data transmission sequence of the network layers, inputs the output result of the first network layer into a second network layer, and the like until the output result of the last network layer of the micro backbone network serves as a slice characteristic diagram.

The target network layer may be a first network layer of the plurality of network layers. In the first network layer in the mini-backbone network, the downsampling of the slices is omitted. That is, the size of the intermediate slice feature map output by the first network layer is the same as the size of the slice input by the first network layer.

Illustratively, the mini-Backbone network may be TBNet (Tiny Backbone network), the TBNet network may be a lightweight residual network with fewer network layers, fewer convolution kernels and fewer parameters, and the TBNet network has the following structure as shown in table 1 below:

TABLE 1

The slice size may be 640 x 640, the target network layer may be Conv-1 (convolutional layer-1), the slice size of the Conv-1 input may be 640 x 640, and the Conv-1 output size may also be 640 x 640, as shown in table 1. The downsampling operation is omitted in Conv-1, so that the detection performance of a small target is ensured, and the accuracy is improved. The list of network structures represents the structure comprised by each network layer, e.g. Conv-1, the network structure isRepresenting a total of 2 sets of convolution kernels in Conv-1, each set comprising 12 convolution kernels of 3 x 3 size. Of course, the mini-backbone network may also employ other lightweight networks to achieve the same functionality, such as pruning, compression or other optimization operations on MobileNet and shuffleNet to generate the mini-backbone network for achieving the same functionality. The embodiment of the present application is not particularly limited thereto.

It should be noted that, the existing VGG (Visual Geometry Group Network, super-resolution test sequence network), inception network or ResNet (Residual Neural Network, residual error network) has a large number of parameters of basic network structures, so that the real-time efficiency is low; in addition, the existing basic network structure has more parameters, is easy to be over-fitted, and especially under the condition of limited samples; the micro backbone network of the application shown in table 1 has small parameter amount, does not need to perform pre-training, avoids domain gap between the pre-training data set and the target data set, and further improves the accuracy of the model.

By inputting the slice with the original resolution into the first network layer, the output result of the first network layer is still an intermediate slice characteristic diagram with the same size as the slice, so that the detail characteristic of the original image is reserved, a possible target in a small range is not lost, and the accuracy of target detection is improved; in addition, the number of network layers of the micro backbone network and the number of convolution kernels included in each network layer are within a certain threshold range, and the slice characteristic diagram of the slice can be extracted rapidly by adopting the lightweight network result, so that the detection efficiency is improved. Compared with the common ResNet, mobileNet, shuffleNet networks comprising a large number of parameters, the miniature backbone network has fewer parameters and is not easy to be overfitted, so that the miniature backbone network is not limited by the data set scale, is suitable for data set scenes with various sizes, reduces the operand on the premise of meeting the requirements of small target detection and ensuring the detection accuracy, greatly improves the detection efficiency, is suitable for being used in various low-configuration hardware scenes, expands the application range of the miniature backbone network and improves the applicability.

Step 2032, for each slice, the server extracts, through the pooling layer of the attention network, the context features of each feature point in the slice feature map of the slice, and obtains the context feature map of the slice.

The pooling layer may be a multi-scale pooling layer. For the slice feature map of each slice, the server may perform context feature extraction by pooling multiple different sized convolution check slice feature maps of the layer and obtain a context feature map of the slice based on the intermediate context feature map obtained. In one possible implementation, step 2032 may include: and the server extracts the context characteristics of each characteristic point in the slice characteristic diagram through each convolution kernel in at least two convolution kernels included in the pooling layer to obtain at least two first context characteristic diagrams of the slice characteristic diagram, wherein the at least two first context characteristic diagrams are different in size. Illustratively, the server may extract the contextual features of each slice feature map separately through each convolution kernel; for each slice feature map, at least two first context feature maps corresponding to the at least two convolution kernels are obtained. For example, for slice feature map a, three context feature maps A1, A2, A3 of a may be obtained by extracting context feature maps of a by three convolution kernels 1*1, 2 x 2, 4*4, respectively.

It should be noted that, the attention network may be a GAM (Global Attention Module, global self-attention module) configured in the target model, and the GAM module may be implemented by using a feature pyramid pooling module when being specifically implemented after being connected to TBNet; the feature pyramid pooling module may include different pyramid layers. For example, the slice feature map is pooled by three different pyramid layers with convolution kernels 1*1, 2 x 2, 4*4, respectively. Through the pooling layer of the attention network, the slice feature is subjected to context feature extraction, so that each feature point in the context feature map can represent the pixel point feature of the corresponding pixel point and also can represent the feature of the pixel point of the surrounding area of the corresponding pixel point, thereby fusing the features of the contexts of the pixel point corresponding to each feature point and the surrounding area thereof, improving the feature expression capability of each feature point in the context feature map, simultaneously, inhibiting some non-target information and further improving the accuracy of target detection.

Step 2033, the server performs feature fusion on the at least one context feature map to obtain a feature map of the slice.

In one possible implementation manner, when the server performs the extraction of the context feature by using at least two convolution kernels to obtain at least two first context feature graphs with different sizes, in this step, the server may perform the size transformation first and then perform the feature fusion. This step may include: the server upsamples the at least two first context feature maps to obtain at least two second context feature maps, wherein the dimensions of the at least two second context feature maps are the same as the dimensions of the slice feature maps; and the server performs feature fusion on the at least two second context feature graphs to obtain the feature graphs of the slices.

It should be noted that, the server may implement the upsampling process by using the GAM module and using the bilinear interpolation method, so as to restore the size of the context feature map to be the same as the size of the slice feature map.

Carrying out lightweight operation through a miniature backbone network in the feature extraction layer, omitting a downsampling process in a target network layer of the miniature backbone network, and rapidly extracting slice features to obtain a slice feature map; the context feature extraction is carried out on the slice feature map through the attention network, the global context information is fused based on the features of the pixel points corresponding to the feature points and the surrounding feature points, the feature characterization capability of each feature point in the context feature map is further enhanced, and meanwhile, some non-target information can be restrained; and then restoring the up-sampling and re-pooling context feature images to the size of the slice feature images, and then carrying out feature fusion, so that the feature images can obtain more context information, the receptive field can be expanded to the whole image, the detection of the targets with different sizes by the subsequent target model is facilitated, the accuracy of target detection is improved, and the accuracy of small target detection is further improved.

Moreover, the basic network structure in the prior art is generally used for image classification tasks, and the features acquired in the prior art are not suitable for detection tasks; compared with the prior art, the method and the device can extract the characteristics of the pixel points, and can fuse the characteristics of surrounding pixel points, so that the fusion of global context characteristics is realized, and the fusion characteristics with stronger expression capability, which are more suitable for detection tasks, are obtained, thereby further matching the service requirements and improving the accuracy of target detection.

Step 204, the server detects the corresponding feature map of each slice to obtain defect information of each slice.

The defect information is used to indicate defects of a target object comprised by the slice. The object model may comprise a detector for object detection, by which the server may detect the feature map, resulting in defect information of the slice. Illustratively, the detector is configured to perform target detection based on the input feature map, and output defect information of the target object. In this step, for each slice, the server may determine, by using a detector included in the target model, a target frame based on candidate frames corresponding to respective feature points in a feature map corresponding to the slice, and output, based on the target frame, a defect position and a defect classification result of the slice, where the target frame is used to indicate a region where a defect of the target object is located; the defect classification result includes a defect category and a category probability, wherein the defect category refers to a defect category to which the defect belongs, and the category probability refers to a probability that the defect belongs to the defect category. For example, the defect class may be at least one of, including crush, sticky, lack of material, and dirt, and may be one or more of them, for example. The defect position can be represented by the upper left corner coordinate and the lower right corner coordinate of the rectangular frame; of course, the lower left corner coordinate and the upper right corner coordinate may be used for representation, and the representation mode of the defect position in the embodiment of the present application is not particularly limited.

For example, the server may generate at least two candidate frames of each feature point in the pixel region corresponding to the slice through the detector, and further regress the plurality of candidate frames into a target frame based on confidence levels of the at least two candidate frames corresponding to the plurality of feature points. This step may include: the server generates at least two candidate frames corresponding to each feature point in the feature map based on the offset parameter through a regression branch network of the detector; the server determines the contribution degree of each candidate frame except the current maximum value candidate frame based on the current maximum value candidate frame with the maximum confidence degree in the at least two candidate frames, deletes the first candidate frame with the contribution degree which does not meet the target condition in each candidate frame, and executes the operations of determining the contribution degree and deleting the first candidate frame again based on the remaining candidate frames after each deletion until each candidate frame is traversed, and determines the target frame based on at least one second candidate frame remaining after the deletion operation; the server classifies the region included in the target frame through the classification branch network of the detector, and outputs the defect classification result. The offset parameter is used for indicating the offset distance between each boundary of the candidate frame and the pixel point of the positive sample in the corresponding slice of the feature map. The positive sample pixel may be a pixel included in the region of the slice where the defect is located. Illustratively, the contribution of the candidate box refers to the value of the candidate box in determining the existence of the target box based on the plurality of candidate boxes. Illustratively, in this step, the server may perform the determination of the target frame in a plurality of iterations, and the iterative process may include the following steps (1) - (4).

Step (1): the server performs descending order arrangement on the plurality of candidate frames based on the confidence degrees of the plurality of candidate frames, and selects a current maximum candidate frame with the maximum confidence degree;

Step (2): for each other candidate frame except the current maximum candidate frame, the server sequentially determines the overlapping ratio between each other candidate frame and the current maximum candidate frame;

Step (3): deleting the other candidate frame when the overlap ratio between the other candidate frame and the current maximum candidate frame exceeds a target threshold;

step (4): the server repeatedly executes the steps (1) - (3) based on the current remaining candidate frames again, namely, screening the current maximum candidate frame with the highest confidence in the current remaining candidate frames, and executing deleting operation on other candidate frames based on the overlapping ratio of the current maximum candidate frame and the other candidate frames; until all candidate boxes are traversed. The server may determine the target frame based on at least one second candidate frame that is currently remaining after traversing all candidate frames. For example, the at least one second candidate frame is combined to obtain the target frame.

In one possible example, the overlap ratio of the candidate box and the current maximum candidate box may be used to represent the contribution of the candidate box; the larger the overlap ratio of the candidate frame with the current maximum candidate frame, the smaller the contribution of the candidate frame. The overlap ratio is denoted by IoU (Intersection-over-Union), which refers to the overlap ratio between the candidate frame and the original mark frame, that is, the ratio between the intersection of the candidate frame and the original mark frame and the Union, as shown in fig. 3, where the numerator may be the area of the black region of the intersection and the denominator may be the area of the black region of the Union in fig. 3. In this step, in the process of determining the target frame based on the candidate frame, the original mark frame may be the current maximum candidate frame, and the overlapping ratio of the candidate frame and the current maximum candidate frame is the ratio between the intersection and the union of the candidate frame and the current maximum candidate frame.

In one possible example, the confidence is used to indicate the probability that the region within the candidate box is the region where the defect is located. Each feature point in the feature map corresponds to a pixel region in the slice, the pixel region including a plurality of pixel points, for example, one feature point corresponds to a pixel region composed of 16×16 pixel points in the slice. The server may generate at least one candidate frame for each feature point corresponding to the pixel region using the detector and calculate a confidence level for each candidate frame. The offset parameter is used to indicate the distance by which each boundary of the target box is offset from the positive sample pixel point in the corresponding slice of the feature map. The positive sample pixel point may be a pixel point where the feature point corresponds to within the target region of the slice. As shown in fig. 4, the offset parameters may include l, t, r, b parameters in the horizontal and vertical directions, i represents the offset distance between the positive sample pixel point and the left boundary of the target frame, t represents the offset distance between the positive sample pixel point and the top boundary of the target frame, r represents the offset distance between the positive sample pixel point and the right boundary of the target frame, and b represents the offset distance between the positive sample pixel point and the bottom boundary of the target frame.

In one possible example, the server predicts the offset of four boundaries of the output and the target frame with the positive sample pixel point as a reference, and fine-tunes the positive sample pixel point based on the offset parameter;

In one possible example, the Regression branch network may be a Regression type network; the Classification branch network may be a Classification type network, and a Classification function for judging a defect class may be included in the Classification branch network. For example, the classification branch network may also use the candidate boxes to make a class determination, e.g., the classification branch network may determine a defect class and class probability corresponding to each candidate box. The processes of steps (1) - (4) above are performed for candidate boxes belonging to the same defect class. For example, the remaining at least one second candidate box may be subjected to a final defect class and class probability of the target box; for example, the class probability of the target frame may be calculated by averaging or merging at least one second candidate frame. Wherein the classification branch network may comprise 2 convolutional layer and 1 classification layer implementations. The Detector can be realized by adopting an anchor free Detector, so that the anchor free Detector with smaller calculation amount can be used without priori knowledge to preset anchor super parameters, and the result can be rapidly and accurately output by adopting the small calculation amount, thereby improving the detection efficiency.

In one possible implementation, the object model may further include a classifier for outputting defect indication information of the slice, and the process may include: for each slice, the server outputs defect indication information of the slice based on the feature map of the slice through the classifier of the target model, wherein the defect indication information is used for indicating whether a local target object included in the slice has a defect or not. For example, the defect indication information may include a defect probability of whether the slice is defective. In one possible example, the Classifier may be a Classifier-type network, and the classification branch network may include a classification function, for example, a classification function, for determining a defect class. For example, the classifier may include two convolution modules (conv-bn-relu) and a global average pooling (GAP, global average pooling) network, and the feature map may be processed by the two convolution modules included in the classifier and the GAP to perform global average pooling, and then pass through a classification classifier to output a defect probability of whether a defect exists.

In one possible implementation, the target model is finally obtained by performing joint training on the classifier and the detector; the training process of the target model comprises the following steps: the server inputs a sample set into an initial model, wherein the sample set refers to a sample image comprising a target object and a true value label of the sample image, and the initial model comprises an initial detector and an initial classifier; the server determines the joint difference degree between the joint result output by the initial model and the truth value label through a joint loss function based on the sample detection position and the sample detection classification output by the initial detector, the sample indication information output by the initial classifier and the truth value label of the sample set, wherein the joint result comprises the sample detection position, the sample detection classification and the sample indication information; and the server adjusts model parameters of the initial model based on the joint difference degree until the model parameters meet target conditions, and the adjustment is stopped to obtain the target model. The model parameters include at least initial offset parameters of the initial detector. The server can optimize parameters of each initial network layer of the initial model through the joint difference degree, execute the process of inputting the sample set and calculating the joint difference degree based on the input result and the sample true value again based on the optimized parameters, optimize the parameters again based on the latest joint difference degree, and repeat the model optimization for a plurality of times in an iterative mode until the optimization is met with the target condition. The target condition may be that the joint difference is smaller than a target difference threshold or the iteration number exceeds a target number threshold.

In one possible example, the process of the server calculating the joint variability based on the joint loss function may include: the server determining a first difference between a sample candidate box and a truth box of a sample set based on the sample candidate box predicted by an initial regression branch network of the initial detector, the truth box and a first loss function; the server determines a second difference between the predicted class probability and the overlap ratio based on the overlap ratio between a truth box and a predicted box of the sample set, the predicted class probability predicted by the initial classification branch network of the initial detector, and a second loss function, wherein the predicted box is a region where a defect predicted based on the sample candidate box is located; the server determining a third difference between the sample indication information and the true probability based on the sample indication information predicted by the classifier, the true probability of the sample set, and a third loss function; the server determines the joint variability from the joint loss function based on the first difference, the second difference, the third difference, and the supervisory signal of the classifier. The server can predict the region where the defect is based on a plurality of sample candidate frames to obtain a predicted frame; for example, a procedure similar to the iterative procedures of steps (1) - (4) above may be employed to obtain the prediction frame based on a plurality of sample candidate frames, which will not be described in detail herein. The sample indication information may be a probability of whether the sample image predicted by the classifier of the initial model includes a defect; when the sample image includes a defect, the true value probability is 1, otherwise, 0. The truth box may be the actual location of the defect in the sample image, indicating the actual value of the location of the defect.

In one possible example, the server may calculate the first difference based on a prediction box of the initial detector prediction samples and based on the prediction box and a truth box. The process may include: for each sample image, the server predicts at least two sample candidate boxes of the sample image through the initial regression branch network based on initial offset parameters of the initial regression branch network; the server reorganizes the boundaries of the at least two sample candidate frames based on the deviation of the boundaries of the at least two sample candidate frames and the boundaries of the truth frame to obtain at least two reorganized frames; the server determines a first difference between each reorganization box and the truth box based on an overlapping ratio of each reorganization box and the truth box, the number of pixels of the truth box, and the first loss function. The initial offset parameter may be an initial value of the offset parameter, for example, may be an initial value of l, t, r, b parameters. The confidence of the reorganization frame is used for indicating the probability that the reorganization frame is a true box, and the confidence of the sample candidate frame is used for indicating the probability that the sample candidate frame is a true box. For example, the overlap ratio of the recombination box (sample candidate box) to the truth box may be employed as the confidence of the recombination box (confidence of the sample candidate box). The deviation of the boundaries of the sample candidate box from the boundaries of the truth box may be: the deviation between the two boundaries at the same relative position of the candidate box and the truth box may be, for example, a distance difference. The relative position may be the position of the boundary relative to the frame, e.g., a left boundary on the left side of the frame relative to the positive sample pixel point of the frame, a top boundary at the top of the frame, etc. The same relative position may be the left boundary of the candidate box and the left boundary of the truth box, the top boundary of the candidate box and the top boundary of the truth box.

For example, the process of the server reorganizing the sample candidate box to obtain the reorganized box may include: the server decomposes and divides boundaries of the at least two sample candidate frames into at least two sets of boundaries based on confidence of each sample candidate frame, each set of boundaries including at least two boundaries of the at least two sample candidate frames at the same relative position; the server calculates the deviation between each boundary in the boundary set and the true value boundary of the corresponding true value frame for each boundary set, and sorts at least two boundaries in the boundary set based on the deviation corresponding to each boundary; the service reorganizes each boundary with the same arrangement sequence in each boundary set into a reorganization frame based on the arrangement sequence of each boundary in each boundary set, and calculates the overlapping ratio between the reorganization frame and the truth frame. For example, the candidate frame, the prediction frame, and the truth frame may be rectangular frames. In one possible example, the process of the server predicting a sample candidate block of a sample based on an initial detector and calculating the first difference may include: five processes of decomposition, sequencing, reorganization, distribution and difference calculation are specifically corresponding to the following steps a-e.

Step a: the decomposition (composition) and server adopts the confidence level representation of four boundaries based on the confidence level of the predicted candidate frame, then divides the boundaries in four relative positions into four groups, and establishes four boundary sets:

;

Wherein left is the set of confidence levels for the left boundary of each candidate box; right is a set of confidence levels for the right boundary of each candidate box; top is a set of confidence levels for the top boundary of each candidate box; bottom is a set of confidence levels for the bottom boundary of each candidate box; as shown in fig. 5, step a obtains four boundary sets by decomposing (deconstructing) the respective boundaries of the three candidate frames S0, S1, S2. As shown in fig. 5 (a), the right1 set includes the right boundaries of three candidate boxes S₀、S₁、S₂.

Step b: ranking (ranking), server calculates four boundaries based on target instance boundaries, respectivelyThe deviation of each edge to the corresponding true value boundary with the same relative position in the true value frame, and sorting four boundary sets based on the deviation;

Step c: and (recombination) reorganizing the four boundaries in the reorganization frame, namely reorganizing the boundaries with the same arrangement sequence in each boundary set into one reorganization frame, and calculating the overlapping ratio of the reorganization frame and the target frame after reorganization, so as to serve as the reorganization confidence of the four boundaries of the reorganization frame.

As shown in fig. 5 (c), the boundary sets having the same level are recombined into a new frame, and the overlap ratio between the recombined frame and the truth frame is calculated. The confidence of the reorganization frame of the three candidate frames S₀、S₁、S₂ is S₀^'、S₁^'、S₂^'.

Step d: assignment (assignment), reassigning the confidence of each boundary in the reorganization frame based on the confidence of the reorganization frame and the confidence of the candidate frame.

For example, for each bounding box in the bounding box, a greater value between the confidence of the candidate box in which the bounding box is located and the confidence of the bounding box in which the bounding box is located may be taken as the confidence of the bounding box. That is, as shown in fig. 5 (d), two sets of boundary scores are obtained for the boundary between the original candidate frame and the recombined frame, for example, S₀^' has a value of max (S₁,S₀^'), that is, the maximum value of the original sample candidate frame S₁ and the recombined frame S₀^' is taken. The final confidence for each boundary is assigned using the higher of the two sets of boundary scores, rather than using one of the sets entirely.

It should be noted that if the confidence of the reassembled bounding box is low, i.e., the included bounding box is far from the bounding box ground truth, this may result in the confidence of the reassembled four bounding boxes being far below their original bounding boxes, and these heavily shifted confidence scores may result in unstable gradient back propagation during the training phase, so a set with higher scores is selected to ensure training accuracy.

Step e: the calculated difference, the first loss function, may be in the form of equation one below. The server determines a first difference between each reorganization box and the truth box based on the overlapping ratio of the reorganization box and the truth box, the number of pixels of the truth box, and the first loss function, including: the server determines a first difference between each reorganization frame and the truth frame according to the following formula I based on the overlapping ratio of the reorganization frame and the truth frame, the overlapping ratio of the candidate frame and the truth frame, and the number of pixels of the truth frame.

Equation one:；

Wherein,The number of pixels of the truth box is represented, that is, the truth box corresponds to the number of pixels included in the corresponding pixel area in the sample image.In order to indicate the function,Representing the score (i.e., confidence, i.e., overlap ratio between the ith reorganization box and the truth box) of the ith reorganization box in the plurality of reorganization boxes,Representing the score (i.e., confidence, i.e., overlap ratio between the ith sample candidate box and the truth box) of the ith sample candidate box before recombination, indicating that the function value is 1 if the recombination box confidence is higher than the confidence of the sample candidate box, otherwise, 0; Representation reorganization frameAnd truth value boxThe overlap ratio between the two is the confidence of the recombination frame; representing sample candidate boxesAnd truth value boxThe overlap ratio between the two is the confidence of the sample candidate box.

In one possible implementation, the second loss function may be QFL (Quality Focal Loss, quality focus loss) function, the QFL function may be in the form of the following formula two, and the step of determining, by the server, a second difference between the prediction class probability and the overlap ratio based on the overlap ratio between the truth box and the prediction box of the sample set, the prediction class probability predicted by the initial classification branch network, and the second loss function may include: the server calculates the second difference based on the overlap ratio between the truth box and the prediction box, the prediction category probability by the following formula two:

Formula II:；

Wherein,A predicted class probability for predicting output, i.e., predicting as a certain class; y is the overlap ratio of the predicted prediction box and the truth box,；

In one possible implementation, the three-loss function may be a loss function of the classifier, which may be represented by a softmax loss function, that is, the server may calculate the third difference based on the sample indication information predicted by the classifier, the true probability of the sample, and the softmax loss function. In one possible example, the server may use the above formula one, formula two, and softmax loss function to represent a joint loss function, and the step of determining, by the joint loss function, the joint difference degree by the server based on the first difference, the second difference, the third difference, and the supervisory signal of the classifier may include: the server calculates the joint difference degree based on the first difference, the second difference, the third difference and the supervision signals of the classifier through the following formula III corresponding to the joint loss function:

And (3) a formula III:；

Wherein,Is the degree of joint variability; n is a supervision signal corresponding to the classifier, and can represent true value probability, n= {0,1}, n=1 represents that the slice currently input contains the defect to be detected, and n=0 represents that the slice currently input does not contain the defect; a loss function representing the classified branching network of the detector, i.e. a second loss function; representing the loss function of the regression branch network of the detector, i.e. the first loss function; As an indication function, if n=1 (indicating that the slice currently input contains a defect to be detected), the indication function value is 1, otherwise, is 0; is a super parameter, may be configured on an as needed basis, e.g.,Set to 0.25.

The server may calculate a joint difference degree by using the joint loss function shown in the formula three, and optimize parameters of each network layer, such as the micro backbone network, the attention network, the regression network score included in the detector, the classification network branches, the classifier and the like, of the feature extraction layer of the target model based on the joint difference degree, for example, may optimize parameters of each network layer by using a gradient descent algorithm, for example, may optimize offset parameters of the detector, so that the detector outputs more accurate defect information by using the optimized parameters.

The parameters of each network layer in the optimization model are jointly trained through the output results of the joint classifier and the detector, and the joint loss function is used, so that the detector and the classifier mutually strengthen training optimization, the false detection probability is reduced, the detection accuracy is improved, and the detection performance is improved. And the optimization can be further enhanced by combining the respective loss functions of the regression branch and the classification branch of the detector and the first difference and the second difference, so that the accuracy of the target model obtained by training is improved.

As shown in fig. 6, fig. 6 is a visual representation of a defect detected using the method of the present application. As shown in fig. 7, fig. 7 is a schematic flow diagram of object detection based on the respective network layers in the object model. In fig. 7, taking a mobile phone camera bracket fitting as an example, the image to be detected may include defects such as crush injury, sticking, lack of material, dirt, etc. of the mobile phone camera bracket fitting. In fig. 7 (a), a plurality of small slices are obtained by slicing an image to be detected, and in fig. 7 (a), two of the small slices are shown. The two small slices are input into a micro backbone network TBNet of the target model, pass through a target network layer conv-1 of the micro backbone network, omit the downsampling operation, output an intermediate feature map with the same slice size, and pass through other network layers of the TBNet network in sequence to obtain a slice feature map. Inputting the slice feature map into an attention module GAM, respectively pooling by a pyramid layer in the GAM, for example, pooling by three convolution kernels 1×1,2×2 and 4×4 shown in (c) in fig. 7, respectively obtaining three context feature maps with different scales, and recovering to three context feature maps with the same scale by upsample (up-sampling); for example, three context feature maps, which are all the dimensions of the slice feature map, are merged into one feature map via feature fusion layer Eltwise (fusion). The feature map is input to (d) a Detector and to (e) a Classifier, respectively. In (D), 4 convolution layers and D & R (Decomposition & recombination module) respectively pass through a Regression branch network; the feature map output by the feature fusion layer Eltwise is data of n×128×h×w, N is the number of slices, H is the slice height, W is the slice width, and 128 is the number of channels. After passing through the D & R module, the feature vector of nx4a×h×w is finally output, where 4A is the coordinate position of the target frame where the defect is located. in (d), the data is n×128×h×w through the 2 convolutional layers and 1 classifying layer of the classifying branch network Classification, then Objectness (target probability) data of n×a×h×w of the Regression branch network Regression is multiplied by n×128×h×w data output from the 2 convolutional layers and 1 classifying layer of the classifying branch network Classification, and feature vectors of n×ka×h×w are output, K represents K defect categories, and the feature vectors of n×ka×h×w can represent category probabilities corresponding to the K defect categories, respectively. in the Classifier (e), after passing through two convolution modules, the data is changed from n×128×h×w to n×128× (H/2) × (W/2), and then is processed by Ave Pool of the global average pooling layer to output n×256 data, and then passes through a Classifier, such as a two-class Classifier, and finally outputs an n×2 matrix, where the n×2 matrix may represent Defect y/N (Defect yes/no), that is, whether there is a Defect.

As can be seen from fig. 7, in the object model of the present application, the network structure is clear, and each network layer and module has better generalization capability; the high-resolution image to be detected is segmented, so that any resolution image input monitoring can be supported; the feature extraction layer is used for extracting pixel point features and pixel point context features to obtain a feature map with greatly improved feature characterization capability, so that the robustness of the model is enhanced, and the accuracy of detection by the model is further improved. Particularly for small target detection, the model is required to have extremely strong robustness for the size, and the target detection method can utilize the characteristics of the target model such as segmentation and global context, omit up-sampling through a target network layer, and the like, so that the characteristic map obtains more and more detailed characteristics, the receptive field can be expanded to the whole image, and the robustness of the model for target detection with different scales is improved.

The following are exemplary experimental data for target detection using the method of the embodiment of the present application, and experimental result data are as follows:

the model was trained on a self-built training set containing 7545 pictures, and testing was performed using 3458 pictures, with the results shown in table 2 below:

TABLE 2

In table 2, the indices include: mAP (MEAN AVERAGE precision, average accuracy) and APs (SMALL TARGET AVERAGE accuracy ). As shown in table 2, the parameter amounts in other methods (for example FASTER RCNN) are obviously doubled, and the index of the other methods is obviously lower and the consumed time is more for the pictures with the same size. Compared with the prior art, the method can adopt smaller parameter quantity to reach higher index, and has smaller time consumption and higher efficiency. As shown in fig. 6, fig. 6 is a visual diagram of the test, and it is apparent that even the small and minute size defect shown in fig. 6 can be detected when the object detection method of the present application is used for detection, thereby achieving improvement of accuracy and precision of detection.

According to the target detection method provided by the application, at least two slices are obtained by firstly cutting an image to be detected, and then pixel point characteristics and pixel point context characteristics of each slice are extracted through a characteristic extraction layer of a target model to obtain at least two characteristic images; detecting the feature map to obtain defect information of the slice; resolution limitation is removed, and the method is applicable to the situation of images with any resolution; the feature extraction layer comprises a target network layer for extracting pixel point features of the slice, and an intermediate feature image output by the target network layer is the same as the size of the slice, namely downsampling is omitted when the pixel point features of the slice are extracted, so that the detail features of the original image are reserved, the feature extraction layer is designed to extract the pixel point features and the pixel point context features, the robustness of detection of targets with different sizes is improved, even small targets with smaller area range can be accurately detected, and the accuracy of target detection is further improved.

The parameters of each network layer in the optimization model are jointly trained by using the output results of the joint loss function and the joint classifier, the detector and the classifier mutually strengthen training optimization, the false detection probability is reduced, the detection accuracy is improved, and the detection performance is improved.

The network layer number, the convolution kernel number and other parameters of the miniature backbone network are fewer, the operation amount is reduced on the premise of meeting the requirements of small target detection and ensuring the detection accuracy, the detection efficiency is greatly improved, and the method is suitable for being used in various low-configuration hardware scenes and improves the applicability.

Fig. 8 is a schematic structural diagram of an object detection device according to an embodiment of the present application. As shown in fig. 8, the apparatus includes:

A determining module 801, configured to determine an image to be detected including a target object, and determine at least two slices of the image to be detected;

The feature extraction module 802 is configured to perform pixel feature extraction and pixel context feature extraction on the at least two slices through a feature extraction layer of the target model, so as to obtain at least two feature graphs, where any feature point of the feature graphs is used to indicate a feature and a context feature of a corresponding pixel in a slice, a size of an intermediate feature graph output by a target network layer of the feature extraction layer is the same as a size of the slice, and the target network layer is used to perform pixel feature extraction on the slice;

And the detection module 803 is configured to detect a corresponding feature map of each slice, and obtain defect information of each slice, where the defect information is used to indicate a defect of a target object included in the slice.

In one possible implementation, the feature extraction layer includes a mini-backbone network and an attention network; the feature extraction module 802 includes:

The pixel point feature extraction unit is used for extracting pixel point features of the at least two slices through the miniature backbone network to obtain at least two slice feature graphs, and any feature point of the slice feature graphs is used for indicating the feature of a corresponding pixel point in the slice;

The context feature extraction unit is used for extracting the context features of each feature point in the slice feature map of each slice through the pooling layer of the attention network for each slice to obtain at least one context feature map of the slice, and carrying out feature fusion on the at least one context feature map to obtain the feature map of the slice;

wherein the mini-backbone network comprises a number of network layers not exceeding a first threshold and each network layer comprises a number of convolution kernels not exceeding a second threshold.

In one possible implementation manner, the context feature extraction unit is configured to extract, through each convolution kernel of at least two convolution kernels included in the pooling layer, a context feature of each feature point in the slice feature map, so as to obtain at least two first context feature maps of the slice feature map, where the at least two first context feature maps are different in size;

Correspondingly, the context feature extraction unit is further configured to upsample the at least two first context feature maps to obtain at least two second context feature maps, where the dimensions of the at least two second context feature maps are the same as the dimensions of the slice feature map; and carrying out feature fusion on the at least two second context feature graphs to obtain the feature graph of the slice.

In one possible implementation manner, the detection module 803 is configured to determine, for each slice, by using a detector included in the target model, a target frame based on candidate frames corresponding to respective feature points in a feature map corresponding to the slice, and output, based on the target frame, a defect position and a defect classification result of the slice, where the target frame is used to indicate a region where a defect of the target object is located;

the defect classification result includes a defect category and a category probability, wherein the defect category refers to a defect category to which the defect belongs, and the category probability refers to a probability that the defect belongs to the defect category.

In one possible implementation, the detecting module 803 is configured to generate, through a regression branch network of the detector, at least two candidate frames corresponding to each feature point in the feature map based on an offset parameter, where the offset parameter is used to indicate a distance by which each boundary of the candidate frames is offset from a positive sample pixel point in a corresponding slice of the feature map; determining contribution degrees of all other candidate frames except the current maximum candidate frame based on the current maximum candidate frame with the maximum reliability in the at least two candidate frames, deleting the first candidate frame with the contribution degrees not meeting the target condition in all other candidate frames, executing the operations of determining the contribution degrees and deleting the first candidate frame again based on the remaining candidate frames after each deletion until each candidate frame is traversed, and determining the target frame based on at least one second candidate frame remaining after the deletion operation; and classifying the region included in the target frame through a classification branch network of the detector, and outputting the defect classification result.

In one possible implementation, the apparatus further includes:

the classifying module is used for outputting defect indication information of each slice based on the feature map of the slice through the classifier of the target model, wherein the defect indication information is used for indicating whether a local target object included in the slice has a defect or not;

The target model is obtained by training the output result of the classifier and the output result of the detector; correspondingly, the device also comprises a model training module, wherein the module training module comprises:

an input unit for inputting a sample set, which refers to a sample image including a target object and a set of truth labels of the sample image, into an initial model including an initial detector and an initial classifier;

The joint difference determining unit is used for determining the joint difference degree between the joint result output by the initial model and the truth value label through a joint loss function based on the sample detection position and the sample detection classification output by the initial detector, the sample indication information output by the initial classifier and the truth value label of the sample set, wherein the joint result comprises the sample detection position, the sample detection classification and the sample indication information;

and the optimization unit is used for adjusting the model parameters of the initial model based on the joint difference degree until the model parameters meet the target conditions, and stopping adjusting to obtain the target model, wherein the model parameters at least comprise initial offset parameters of the initial detector.

In one possible implementation manner, the joint difference determining unit is configured to include:

A first difference determination subunit configured to determine a first difference between a sample candidate box and a truth box of a sample set based on the sample candidate box predicted by an initial regression branch network of the initial detector, the truth box, and a first loss function;

A second difference determining subunit, configured to determine a second difference between the prediction class probability and the overlap ratio based on an overlap ratio between a truth box and a prediction box of the sample set, a prediction class probability predicted by an initial classification branch network of the initial detector, and a second loss function, where the prediction box is a region where a defect predicted based on the sample candidate box is located;

a third difference determining subunit configured to determine a third difference between the sample indication information and the true probability based on the sample indication information predicted by the classifier, the true probability of the sample set, and a third loss function;

A joint difference determination subunit, configured to determine the joint difference degree through the joint loss function based on the first difference, the second difference, the third difference, and the supervisory signal of the classifier.

In one possible implementation, the first difference determining subunit is configured to predict, for each sample image, at least two sample candidate boxes of the sample image through the initial regression branch network based on initial offset parameters of the initial regression branch network; recombining the boundaries of the at least two sample candidate frames based on the deviation of the boundaries of the at least two sample candidate frames and the boundaries of the truth frame to obtain at least two recombined frames; a first difference between each reorganization box and the truth box is determined based on an overlap ratio of each reorganization box and the truth box, a number of pixels of the truth box, and the first loss function.

In one possible implementation, the first difference determining subunit is configured to decompose and divide boundaries of the at least two sample candidate frames into at least two boundary sets based on a confidence level of each sample candidate frame, where each boundary set includes at least two boundaries of the at least two sample candidate frames at a same relative position; for each boundary set, calculating the deviation between each boundary in the boundary set and the true value boundary of the corresponding true value frame, and sequencing at least two boundaries in the boundary set based on the deviation corresponding to each boundary; and recombining all boundaries with the same arrangement sequence in each boundary set based on the arrangement sequence of each boundary in each boundary set to obtain a recombination frame, and calculating the overlapping ratio between the recombination frame and the truth frame.

In one possible implementation manner, the determining module 801 is further configured to segment the image to be detected by using a sliding window manner based on the sliding window of the target step size and the target size, to obtain at least two slices of the image to be detected.

According to the target detection device provided by the application, at least two slices are obtained by firstly cutting an image to be detected, and then pixel point characteristics and pixel point context characteristics of each slice are extracted through the characteristic extraction layer of the target model to obtain at least two characteristic images; detecting the feature map to obtain defect information of the slice; resolution limitation is removed, and the method is applicable to the situation of images with any resolution; the feature extraction layer comprises a target network layer for extracting pixel point features of the slice, and an intermediate feature image output by the target network layer is the same as the size of the slice, namely downsampling is omitted when the pixel point features of the slice are extracted, so that the detail features of the original image are reserved, the feature extraction layer is designed to extract the pixel point features and the pixel point context features, the robustness of detection of targets with different sizes is improved, even small targets with smaller area range can be accurately detected, and the accuracy of target detection is further improved.

The quantity of parameters, the quantity of convolution kernels and the like of the miniature backbone network are small, the operand is reduced on the premise of meeting the requirements of small target detection and ensuring detection accuracy, the detection efficiency is greatly improved, and the method is suitable for various low-configuration hardware scenes and improves the applicability.

The object detection device of the present embodiment may execute the object detection method according to the above embodiment of the present application, and the implementation principle is similar, and will not be described herein.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer device includes: a memory and a processor; at least one program stored in the memory for execution by the processor, which, when executed by the processor, performs:

Obtaining at least two slices by segmenting an image to be detected, and extracting pixel point characteristics and pixel point context characteristics of each slice by a characteristic extraction layer of a target model to obtain at least two characteristic images; detecting the feature map to obtain defect information of the slice; resolution limitation is removed, and the method is applicable to the situation of images with any resolution; the feature extraction layer comprises a target network layer for extracting pixel point features of the slice, and an intermediate feature image output by the target network layer is the same as the size of the slice, namely downsampling is omitted when the pixel point features of the slice are extracted, so that the detail features of the original image are reserved, the feature extraction layer is designed to extract the pixel point features and the pixel point context features, the robustness of detection of targets with different sizes is improved, even small targets with smaller area range can be accurately detected, and the accuracy of target detection is further improved.

In an alternative embodiment, a computer device is provided, as shown in fig. 9, the computer device 900 shown in fig. 9 includes: a processor 901 and a memory 903. The processor 901 is coupled to a memory 903, such as via a bus 902. Optionally, the computer device 900 may also include a transceiver 904, where the transceiver 904 may be used for data interaction between the computer device and other computer devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical application, the transceiver 904 is not limited to one, and the structure of the computer device 900 is not limited to the embodiment of the present application.

The Processor 901 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 901 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of DSP and microprocessor, etc.

Bus 902 may include a path to transfer information between the components. Bus 902 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 902 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

The Memory 903 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 903 is used for storing application program codes (computer programs) for executing the inventive arrangements, and is controlled to be executed by the processor 901. The processor 901 is configured to execute application code stored in the memory 903 to implement what is shown in the foregoing method embodiments.

Wherein the computer device includes, but is not limited to: server, terminal or service cluster, etc.

Embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, which when run on a computer, enables the computer to perform the respective content of the object detection method in the foregoing method embodiments.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the above-described object detection method.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A method of target detection, the method comprising:

For each slice, determining a target frame based on candidate frames corresponding to each feature point in a feature map corresponding to the slice by using a detector included in the target model, and outputting a defect position and a defect classification result of the slice based on the target frame; outputting defect indication information of the slice based on the feature map of the slice through a classifier of the target model; the target frame is used for indicating the region where the defect of the target object is located, and the defect indication information is used for indicating whether the local target object included in the slice has the defect or not;

The target model is obtained by training based on the combination difference degree between the combination result output by the initial model aiming at the sample set and the truth value label of the sample set; the initial model comprises an initial detector and an initial classifier, and the combined result comprises a sample detection position and a sample detection classification output by the initial detector and sample indication information output by the initial classifier;

the method for determining the joint difference degree comprises the following steps:

determining a first difference between a sample candidate box and a truth box of a sample set based on the sample candidate box predicted by an initial regression branch network of the initial detector, the truth box and a first loss function;

determining a second difference between the prediction category probability and the overlap ratio based on the overlap ratio between a truth box and a prediction box of the sample set, the prediction category probability predicted by the initial classification branch network of the initial detector, and a second loss function, wherein the prediction box is a region where a defect predicted based on the sample candidate box is located;

determining a third difference between the sample indication information and the true probability based on the sample indication information predicted by the initial classifier, the true probability of the sample set, and a third loss function;

The joint degree of difference is determined by a joint loss function based on the first difference, the second difference, the third difference, and the supervisory signals of the initial classifier.

2. The object detection method of claim 1, wherein the feature extraction layer comprises a mini-backbone network and an attention network; the step of extracting pixel point features and pixel point context features from the at least two slices through the feature extraction layer of the target model to obtain at least two feature graphs comprises the following steps:

extracting pixel point characteristics of the at least two slices through the miniature backbone network to obtain at least two slice characteristic diagrams, wherein any characteristic point of the slice characteristic diagrams is used for indicating the characteristic of a corresponding pixel point in the slice;

For each slice, extracting context features of each feature point in a slice feature map of the slice through a pooling layer of the attention network to obtain at least one context feature map of the slice, and carrying out feature fusion on the at least one context feature map to obtain a feature map of the slice;

the number of network layers included in the micro backbone network does not exceed a first threshold, and the number of convolution kernels included in each network layer does not exceed a second threshold.

3. The method for detecting an object according to claim 2, wherein the extracting, by the pooling layer of the attention network, the contextual feature of each feature point in the slice feature map of the slice to obtain at least one contextual feature map of the slice includes:

Extracting context features of each feature point in the slice feature map through each convolution kernel in at least two convolution kernels included in the pooling layer to obtain at least two first context feature maps of the slice feature map, wherein the at least two first context feature maps are different in size;

correspondingly, the feature fusion of the at least one context feature map to obtain the feature map of the slice includes:

Upsampling the at least two first context feature maps to obtain at least two second context feature maps, wherein the dimensions of the at least two second context feature maps are the same as the dimensions of the slice feature maps;

and carrying out feature fusion on the at least two second context feature graphs to obtain the feature graphs of the slices.

4. The target detection method according to claim 1, wherein the defect classification result includes a defect class and a class probability, the defect class being a defect class to which the defect belongs, the class probability being a probability that the defect belongs to the defect class.

5. The method for detecting a target according to claim 1, wherein the determining, by the detector included in the target model, a target frame in which the defect is located based on candidate frames corresponding to each feature point in the feature map corresponding to the slice, and outputting, based on the target frame, a defect position and a defect classification result of the slice includes:

Generating at least two candidate frames corresponding to each feature point in the feature map based on an offset parameter through a regression branch network of the detector, wherein the offset parameter is used for indicating the offset distance between each boundary of the candidate frames and a positive sample pixel point in a corresponding slice of the feature map;

Determining contribution degrees of all other candidate frames except the current maximum candidate frame based on the current maximum candidate frame with the maximum confidence in the at least two candidate frames, deleting a first candidate frame with the contribution degrees not meeting a target condition in the other candidate frames, executing the operations of determining the contribution degrees and deleting the first candidate frame again based on the remaining candidate frames after each deletion until each candidate frame is traversed, and determining the target frame based on at least one remaining second candidate frame after the deletion operation;

And classifying the region included in the target frame through a classification branch network of the detector, and outputting the defect classification result.

6. The target detection method according to claim 1, wherein the target model is obtained by training in combination of an output result of the classifier and an output result of the detector; correspondingly, the training process of the target model comprises the following steps:

Inputting a sample set into an initial model, wherein the sample set refers to a sample image of a target object and a true value label of the sample image, and the initial model comprises an initial detector and an initial classifier;

Determining a joint difference degree between a joint result output by the initial model and the truth label through a joint loss function based on the sample detection position and sample detection classification output by the initial detector, sample indication information output by the initial classifier and the truth label of the sample set, wherein the joint result comprises the sample detection position, the sample detection classification and the sample indication information;

And based on the joint difference degree, adjusting model parameters of the initial model until the model parameters meet target conditions, and stopping adjusting to obtain the target model, wherein the model parameters at least comprise initial offset parameters of the initial detector.

7. The method of claim 1, wherein the determining a first difference between the sample candidate box and the truth box based on the sample candidate box predicted by the initial regression branch network of the initial detector, the truth box of the sample set, and the first loss function comprises:

Predicting, for each sample image, at least two sample candidate boxes of the sample image by the initial regression branch network based on initial offset parameters of the initial regression branch network;

Recombining the boundaries of the at least two sample candidate frames based on the deviation of the boundaries of the at least two sample candidate frames and the boundaries of the truth frame to obtain at least two recombined frames;

A first difference between each reorganization box and the truth box is determined based on an overlap ratio of each reorganization box and the truth box, a number of pixels of the truth box, and the first loss function.

8. The method according to claim 7, wherein the recombining the boundaries of the at least two sample candidate frames based on the deviation of the boundaries of the at least two sample candidate frames from the boundaries of the truth frame to obtain at least two recombined frames comprises:

Decomposing and dividing boundaries of the at least two sample candidate frames into at least two boundary sets based on the confidence level of each sample candidate frame, wherein each boundary set comprises at least two boundaries of the at least two sample candidate frames at the same relative position;

For each boundary set, calculating the deviation between each boundary in the boundary set and the true value boundary of the corresponding true value frame, and sequencing at least two boundaries in the boundary set based on the deviation corresponding to each boundary;

And recombining all boundaries with the same arrangement sequence in each boundary set based on the arrangement sequence of each boundary in each boundary set to obtain a recombination frame, and calculating the overlapping ratio between the recombination frame and the truth frame.

9. The method of claim 1, wherein the determining at least two slices of the image to be detected comprises:

And cutting the image to be detected by adopting a sliding window mode based on the sliding window of the target step length and the target size to obtain at least two slices of the image to be detected.

10. An object detection apparatus, comprising:

The detection module is used for determining a target frame based on candidate frames corresponding to each feature point in the feature map corresponding to each slice through a detector included in the target model, and outputting a defect position and a defect classification result of the slice based on the target frame; outputting defect indication information of the slice based on the feature map of the slice through a classifier of the target model; the target frame is used for indicating the region where the defect of the target object is located, and the defect indication information is used for indicating whether the local target object included in the slice has the defect or not;

11. The object detection device of claim 10, wherein the feature extraction layer comprises a mini-backbone network and an attention network;

The feature extraction module comprises:

the context feature extraction unit is used for extracting the context features of each feature point in the slice feature map of each slice through the pooling layer of the attention network for each slice to obtain at least one context feature map of the slice, and carrying out feature fusion on the at least one context feature map to obtain a feature map of the slice;

12. A computer device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the object detection method of any one of claims 1 to 9.

13. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the object detection method according to any of claims 1 to 9.

14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the object detection method of any one of claims 1 to 9.