CN114663670B

Movatterモバイル変換

Info

Publication number: CN114663670B
Application number: CN202210305284.XA
Authority: CN
Inventors: 王昌安; 王亚彪
Original assignee: Tencent Technology Shanghai Co Ltd
Current assignee: Tencent Technology Shanghai Co Ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2025-09-19
Anticipated expiration: 2042-03-25
Also published as: CN114663670A

Abstract

The invention discloses an image detection method, an image detection device, electronic equipment and a storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, and the method comprises the steps of determining a first vector sequence according to an image block sequence corresponding to an image to be processed; the method comprises the steps of sequentially carrying out attention coding on a first vector sequence based on a plurality of attention coding layers to obtain a first feature map, obtaining a first attention distribution map based on target attention features corresponding to classified embedded vectors, merging the first attention distribution map with the first feature map to obtain a second feature map after inverting the first attention distribution map, and carrying out target object detection according to the first feature map, the first attention distribution map and the second feature map to obtain category information and position information of a target object in an image to be processed. The invention improves the accuracy of the position information in the detection result.

Description

Image detection method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an image detection method, an image detection device, an electronic device, and a storage medium.

Background

The target detection can give out the position information and the category information of the target of interest in the image, and is a pre-basic task of the subsequent computer vision task.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the invention provides an image detection method, an image detection device, electronic equipment and a storage medium. The technical scheme is as follows:

in one aspect, there is provided an image detection method, the method comprising:

Determining a first vector sequence according to an image block sequence corresponding to an image to be processed, wherein the first vector sequence comprises a classified embedded vector and an image block vector corresponding to each image block in the image block sequence;

Performing attention coding on the first vector sequence based on a plurality of attention coding layers in turn to obtain a first feature map;

Determining an attention matrix corresponding to each attention coding layer, extracting target attention characteristics corresponding to the classified embedded vectors in each attention matrix, and carrying out fusion processing on the target attention characteristics to obtain a first attention distribution map;

Performing inverse processing on the first attention distribution map to obtain a reverse attention distribution map, and performing fusion processing on the reverse attention distribution map and the first feature map to obtain a second feature map;

And detecting a target object according to the first feature map, the first attention distribution map and the second feature map to obtain category information and position information of the target object in the image to be processed.

In another aspect, there is provided an image detection apparatus, the apparatus including:

The first vector sequence comprises a classified embedded vector and image block vectors corresponding to all image blocks in the image block sequence;

The attention coding module is used for sequentially carrying out attention coding on the first vector sequence based on a plurality of attention coding layers to obtain a first feature map;

The first attention distribution determining module is used for determining an attention matrix corresponding to each attention coding layer, extracting target attention characteristics of the corresponding classified embedded vectors in each attention matrix, and carrying out fusion processing on the target attention characteristics to obtain a first attention distribution map;

The inverse processing module is used for carrying out inverse processing on the first attention distribution diagram to obtain a reverse attention distribution diagram, and carrying out fusion processing on the reverse attention distribution diagram and the first feature diagram to obtain a second feature diagram;

and the target object detection module is used for detecting the target object according to the first feature map, the first attention distribution map and the second feature map to obtain the category information and the position information of the target object in the image to be processed.

In an exemplary embodiment, the target object detection module includes:

The second attention distribution determining module is used for determining category information of the target object in the image to be processed according to the second feature map and the first feature map and generating a second attention distribution map;

the target attention distribution determining module is used for multiplying the first attention distribution map and the second attention distribution map pixel by pixel to obtain a target attention distribution map;

and the position information determining module is used for determining the position information of the target object in the image to be processed according to the target attention distribution diagram.

In an exemplary embodiment, the second attention profile determination module includes:

the first classification module is used for classifying according to the second feature map to obtain a first classification result, wherein the first classification result comprises probability values corresponding to each preset category in a plurality of preset categories;

the heavy parameter module is used for carrying out weighted summation on a plurality of initialized convolution kernels according to the first classification result to obtain a target convolution kernel;

the feature extraction module is used for carrying out feature extraction on the first feature map according to the target convolution check to obtain a third feature map;

and the second determining module is used for determining the category information of the target object in the image to be processed according to the third characteristic diagram and generating a second attention distribution diagram.

In an exemplary embodiment, the first classification module includes:

the dimension reduction module is used for carrying out dimension reduction processing on the second feature map to obtain a dimension reduction feature map;

The first pooling module is used for carrying out global maximum pooling and global average pooling on the dimension reduction feature map respectively to obtain a first pooling feature and a second pooling feature;

The pooling feature fusion module is used for fusing the first pooling feature and the second pooling feature to obtain a fused pooling feature;

And the classification sub-module is used for classifying the fusion pooling features to obtain a first classification result.

In an exemplary embodiment, the second determining module includes:

The second pooling module is used for carrying out global average pooling on the third feature map to obtain third pooling features;

The second classification module is used for classifying based on the third pooling feature to obtain a second classification result, wherein the second classification result represents the class information of the target object in the image to be processed;

an attention profile generation module for generating a second attention profile based on the third pooling feature and the third feature map.

In an exemplary embodiment, the inverse processing module is specifically configured to multiply the inverse attention profile with the first feature map to obtain a second feature map when performing fusion processing on the inverse attention profile with the first feature map to obtain the second feature map.

In an exemplary embodiment, the first attention profile determination module includes:

The first average module is used for determining the average of the attention matrix of each attention mechanism module aiming at each attention coding layer to obtain the attention matrix corresponding to the attention coding layer;

the extraction module is used for extracting the target attention characteristic corresponding to the classified embedded vector in the attention matrix corresponding to the attention coding layer;

and the second averaging module averages the target attention characteristics corresponding to each attention coding layer to obtain the first attention distribution diagram.

In an exemplary embodiment, the first determining module includes:

The image segmentation module is used for acquiring an image to be processed, and segmenting the image to be processed into a plurality of image blocks to obtain an image block sequence;

The embedding module is used for carrying out vector embedding on the image blocks in the image block sequence to obtain an image block embedded vector sequence;

The first adding module is used for adding initialized classified embedded vectors into the image block embedded vector sequence to obtain an embedded vector sequence;

And the second adding module is used for adding position codes to each embedded vector in the embedded vector sequence to obtain the first vector sequence, and the position codes represent the position information of the corresponding embedded vector in the embedded vector sequence.

In an exemplary embodiment, the image detection method is implemented based on an image detection model, and the apparatus further includes a training module, where the training module includes:

the system comprises a sample acquisition module, a sample image acquisition module and a sample image acquisition module, wherein the sample acquisition module is used for acquiring a sample image and a category label corresponding to the sample image, and the category label indicates reference category information of a target object in the sample image;

The first sample feature map determining module is used for inputting a sample image block sequence corresponding to the sample image into an attention coding unit of a preset neural network model, and performing attention coding based on a plurality of attention coding layers in the attention coding unit to obtain a first sample feature map;

The class prediction module is used for determining an attention matrix of each attention coding layer based on a local feature mining unit of the preset neural network model, extracting target attention features of corresponding classified embedded vectors in each attention matrix, carrying out fusion processing on each target attention feature to obtain a first sample attention distribution map, carrying out negation processing on the first sample attention distribution map to obtain a sample reverse attention distribution map, and carrying out fusion processing on the sample reverse attention distribution map and the first sample feature map to obtain a second sample feature map;

And the training sub-module is used for determining a loss value according to the difference between the predicted category information corresponding to the sample image and the category label, reversely adjusting the model parameters of the preset neural network model based on the loss value, and performing iterative training until the training ending condition is met, so as to obtain the image detection model.

In another aspect, an electronic device is provided, including a processor and a memory, where at least one instruction or at least one program is stored in the memory, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the image detection method described above.

In another aspect, a computer readable storage medium having at least one instruction or at least one program stored therein is provided, the at least one instruction or the at least one program loaded and executed by a processor to implement an image detection method as described above.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the electronic device performs the image detection method described above.

According to the embodiment of the invention, the first vector sequence is determined based on the image block sequence corresponding to the image to be processed, and the first vector sequence is sequentially subjected to attention coding based on the plurality of attention coding layers to obtain the first feature map, so that the local area in the image is subjected to full association analysis based on the global attention mechanism to obtain effective global features, then the target attention features of the corresponding classified character vectors in the attention matrix corresponding to each attention coding layer are extracted, the target attention features are fused to obtain the first attention distribution map, the first attention distribution map is subjected to inverse processing to obtain the inverse attention distribution map, the inverse attention distribution map is fused with the first feature map to obtain the second feature map, and the target object detection is performed based on the first feature map, the first attention distribution map and the second feature map.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1a is an example of a thermal activation map generated when target detection is based on a convolutional neural network;

FIG. 1b is an example of a thermal activation map obtained using the solution of the present invention;

FIG. 2 is a schematic illustration of an implementation environment provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a detection result of the image detection method applied to industrial AI quality inspection according to the embodiment of the invention;

fig. 4 is a schematic flow chart of an image detection method according to an embodiment of the present invention;

FIG. 5 is a flowchart of another image detection method according to an embodiment of the present invention;

Fig. 6 is a block diagram of an image detection apparatus according to an embodiment of the present invention;

Fig. 7 is a block diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the target detection in the related art, a convolutional neural network is used as a main network to extract a deep semantic feature map of an image, then global Chi Huazheng is used for combining local information with category distinction and giving the local information to a classification layer for classification, a region with concentrated attention of the classification layer is extracted by a CAM (Class Activation Mapping, class activation map) method and a thermal activation map is formed, and a communication region analysis means is adopted for the thermal activation map to determine a positioning frame to obtain position information. Because convolution is an operator with local feature extraction capability, only features with the most discriminant of each category can be extracted by adopting a convolution neural network, so that the network in the related technology only focuses on partial high discriminant areas of target objects in images, namely, only partial areas of the thermal activation map show high response.

As shown in fig. 1a, an example of a thermal activation diagram generated when target detection is performed based on a convolutional neural network is shown, the target object in the image to be processed is a bird, and because the head area of the bird is more discriminant than other parts and is a key area for distinguishing the bird from other types, when target detection is performed based on the convolutional neural network, the network only focuses on the head area of the bird, that is, only the head area in the thermal activation diagram shown in fig. 1a shows high response, so that a positioning frame determined based on the thermal activation diagram is not complete enough, positioning accuracy is poor, and position information in a detection result is not accurate enough.

Based on this, the embodiment of the invention provides an image detection method, which extracts global features of an image to be processed based on a plurality of attention coding layers and further combines mining of local features to obtain a fully activated thermodynamic activation diagram, as shown in fig. 1b, which is an example of a thermodynamic activation diagram obtained by adopting the technical scheme of the invention, it can be seen that, compared with the thermodynamic activation diagram of fig. 1a, the embodiment of the invention activates a region with helpful classification in a weak response region of fig. 1a, thereby improving the integrity of a positioning frame, improving positioning precision and ensuring the accuracy of position information in a detection result.

Referring to fig. 2, a schematic diagram of an implementation environment provided by an embodiment of the present invention is shown, where the implementation environment may include a terminal 210 and a server 220, where the terminal 210 and the server 220 may communicate through a wired connection or a wireless connection.

The terminal 210 includes, but is not limited to, a cell phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, etc. The terminal 210 is provided with client software having an image detection function, such as an Application (App), which may be a stand-alone Application or a subroutine in the Application.

Server 220 may be a server that provides background services for applications in terminal 210, and may specifically be object location detection of images. For example, a pre-trained image detection model may be stored in the server 220, and when the image to be processed is detected, the terminal 210 may call the image detection model in the server 220 to perform image detection by using the image detection method according to the embodiment of the present invention, and obtain a detection result returned by the server 220, where the detection result includes location information and category information of the target object. It can be appreciated that the terminal 210 may also download the trained image detection model from the server 220 and store the model locally, and directly call the local image detection model when performing image detection on the image to be processed, thereby improving the efficiency of image detection.

It should be noted that, the server 220 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.

In an exemplary embodiment, the terminal 210 and the server 220 may each be a node device in a blockchain system, and may be capable of sharing acquired and generated information to other node devices in the blockchain system, so as to implement information sharing between multiple node devices. The plurality of node devices in the blockchain system can be configured with the same blockchain, the blockchain consists of a plurality of blocks, and the blocks adjacent to each other in front and back have an association relationship, so that the data in any block can be detected through the next block when being tampered, thereby avoiding the data in the blockchain from being tampered, and ensuring the safety and reliability of the data in the blockchain.

The image processing method of the embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The Computer Vision technology (CV) Computer Vision is a science of researching how to make a machine "look at", and more specifically, it means to replace a human eye with a camera and a Computer to perform machine Vision such as identifying and measuring on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for the human eye to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The intelligent transportation system (INTELLIGENT TRAFFIC SYSTEM, ITS) is also called an intelligent transportation system (INTELLIGENT TRANSPORTATION SYSTEM), which is a comprehensive transportation system for effectively and comprehensively applying advanced scientific technologies (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation research, artificial intelligence and the like) to transportation, service control and vehicle manufacturing, and enhancing the connection among vehicles, roads and users, thereby forming a comprehensive transportation system for guaranteeing safety, improving efficiency, improving environment and saving energy.

Taking an industrial AI quality inspection scene as an example, an image shot by a camera aiming at an industrial manufacturing component can be input into an image detection model as an image to be processed, defect detection is performed by the image detection method based on the embodiment of the invention, and position information (detection frame) of the defect and the defect type in the image are output. Fig. 3 is a schematic diagram of a detection result of an image detection method applied to industrial AI quality inspection according to an embodiment of the present invention, wherein a defect type is a dirt type, and a box is a positioning box of the defect.

Referring to fig. 4, a flowchart of an image detection method according to an embodiment of the present invention is shown, and the method may be applied to the electronic device in fig. 2, where the electronic device may be a terminal or a server. It is noted that the present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. In actual system or product execution, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment). As shown in fig. 4, the method may include:

s401, determining a first vector sequence according to an image block sequence corresponding to an image to be processed.

The first vector sequence comprises a classification embedded vector and an image block vector corresponding to each image block in the image block sequence.

For example, the step S401 may include:

Acquiring an image to be processed, and dividing the image to be processed into a plurality of image blocks to obtain an image block sequence;

performing vector embedding on the image blocks in the image block sequence to obtain an image block embedded vector sequence;

adding initialized classified embedded vectors into the image block embedded vector sequence to obtain an embedded vector sequence;

And adding position codes to each embedded vector in the embedded vector sequence to obtain the first vector sequence, wherein the position codes represent the position information of the corresponding embedded vector in the embedded vector sequence.

Specifically, the image to be processed may be segmented according to a preset resolution, so that the image to be processed may be divided into image blocks with the same size, and then an unfolding operation is performed on each image block to draw an image block sequence. The vector embedding method can be any embedding tool in the prior art, for example, word2Vec can be used for vector embedding of the image blocks. The initialization of the Class embedding vector (usually denoted as Class token) can be random, the output of which Class embedding vector corresponds can be used to implement the image classification, usually the Class embedding vector is located at the head of the sequence, i.e. the 0 position, and the length of the sequence of embedding vectors is n+1, assuming that the sequence of image blocks is N in length, i.e. there are N image blocks.

The position code position embedding is used to code the relative position information of each embedded vector, and in practical application, each embedded vector and its corresponding position code can be added so as to add the position information of each embedded vector in the sequence, and obtain the first vector sequence.

It will be appreciated that the first vector sequence includes a classification embedded vector and an image block vector corresponding to each image block, where the classification embedded vector is a vector obtained by adding a corresponding position code to the initialized classification embedded vector, and the image block vector is a vector obtained by adding a corresponding position code to the image block embedded vector.

And S403, sequentially performing attention coding on the first vector sequence based on a plurality of attention coding layers to obtain a first feature map.

It should be noted that, the plurality of attention encoding layers are in cascade connection, that is, only the input of the first-stage attention encoding layer is the first vector sequence, the input of the subsequent attention encoding layers is the output of the previous-stage attention encoding layer, and the output of the last-stage attention encoding layer is converted (Reshape) to obtain the first feature map.

In practice, when the first vector sequence is input to the first level attention encoding layer, it is typically input in the form of a matrix, for example, the length of the first vector sequence is n+1 (where N is the number of image blocks, 1 represents 1 classification embedded vector added), and then the first vector sequence is input to the first level attention encoding layer in the form of a matrix of (n+1) x D, where D represents the dimension of each vector in the first vector sequence.

S405, determining an attention matrix corresponding to each attention coding layer, extracting target attention characteristics corresponding to the classification character vectors in each attention matrix, and carrying out fusion processing on the target attention characteristics to obtain a first attention distribution diagram.

In the embodiment of the present invention, each attention encoding layer encodes based on a self-attention mechanism, and each attention encoding layer may include a plurality of self-attention mechanism modules, and assuming that the output of the first attention encoding layer is X_l∈R^(N+1)×D, the attention matrix a_l∈R^{S×(N+1)×(N+1)} corresponding to the self-attention mechanism module may be expressed as:

Wherein, Q_l、K_l represents a query matrix query and a key matrix Keys obtained by linear mapping on the output of the previous-stage attention coding layer, S represents the number of attention heads of the attention coding layer, that is, the number of self-attention mechanism modules in the attention coding layer, T is a transpose operator, softmax () is a normalization function, normalized to between 0 and 1 and the sum is 1.

For each attention coding layer, the attention matrix of the S self-attention mechanism modules is fused to obtain an attention matrix a'_l∈R^(N+1)×(N+1) corresponding to each attention coding layer, where the fusion may be an average. Then extracting the target attention feature of the corresponding classification embedded vector, i.e., [ Class ] token, from each A'_l∈R^(N+1)×(N+1)Thereby obtaining the corresponding target attention characteristic of each attention coding layerThe method comprisesThe extent of influence of each image block on the image classification is characterized.

In the embodiment of the invention, each attention coding layer in the plurality of attention coding layers can model the attention at different semantic levels, so that the first attention distribution map can be obtained by carrying out fusion processing on the target attention characteristics corresponding to the plurality of attention coding layersWherein the fusion process may be to average the corresponding target attention characteristics of each attention encoding layer.

It will be appreciated that the first attention profileThe influence degree of each image block on the image classification is represented, namely the association degree between the classification embedded vector and all the image blocks is reflected, and the global association is focused. In practice, the first attention profile may be represented in the form of a thermodynamic activation diagram.

Based on this, in an exemplary embodiment, the step S405 may include, when implemented:

For each attention coding layer, determining the average of attention matrixes of the respective attention mechanism modules to obtain an attention matrix corresponding to the attention coding layer;

Extracting target attention characteristics of the corresponding classified embedded vectors from an attention matrix corresponding to the attention coding layer;

and averaging the target attention characteristics corresponding to each attention coding layer to obtain a first attention distribution map.

In the above embodiment, the first attention distribution map is obtained by averaging the target attention features corresponding to different semantic levels, so that the accuracy of the attention distribution in the first attention distribution map is improved.

S407, performing inverse processing on the first attention distribution map to obtain a reverse attention distribution map, and performing fusion processing on the reverse attention distribution map and the first feature map to obtain a second feature map.

S409, detecting a target object according to the first feature map, the first attention distribution map and the second feature map to obtain category information and position information of the target object in the image to be processed.

By the way, by the first attention profileThe rough outline of the target object can be obtained by binarizing and obtaining the maximum connected domain, and the positioning frame of the target object can be obtained by further obtaining the maximum circumscribed rectangle of the outline, but becauseFocusing on global correlation, the correlation between local areas is ignored, namely the correlation between image blocks is ignored, so that the situation that the local response of the first attention distribution map is weak occurs, and the positioning frame of the target object is determined directly based on the first attention distribution map, which is not beneficial to maximizing the positioning precision.

Based on this, the embodiment of the invention obtains the first attention distribution mapPerforming a reversal process to obtain a reverse attention profile, the reverse attention profile being capable of highlighting the first attention profileThe weak response regions of (a) mainly comprise a part which is a complete background region and does not contribute to image classification, and a part which is a first attention profileThis part is helpful for classification, but is not responded to due to the inherent defect of the target attention feature to which the classification embedding vector corresponds. Wherein the negating process may be a process of now mapping the first attention profileNormalized to between 0 and 1, and then subjected to a negation operation.

Therefore, in the embodiment of the invention, the reverse attention distribution map and the first feature map are fused to obtain the second feature map, the fusion process can multiply the reverse attention distribution map with the first feature map, then the first feature map, the first attention distribution map and the second feature map are combined to detect the target object to obtain the category information and the position information of the target object in the image to be processed, and the second feature map is obtained by fusion based on the reverse attention distribution map and the first feature map, so that the weak response area with help in classification in the first feature map can be activated, the positioning precision of the target detection can be improved, and the accuracy of the detection result can be improved.

In an exemplary embodiment, the target object detection according to the first feature map, the first attention distribution map and the second feature map may include determining category information of a target object in an image to be processed according to the second feature map and the first feature map and generating a second attention distribution map, multiplying the first attention distribution map and the second attention distribution map pixel by pixel to obtain a target attention distribution map, and determining position information of the target object in the image to be processed according to the target attention distribution map.

The second attention distribution map activates the region with the aid of classification in the weak response region, so that the second attention distribution map is complementary with the first attention distribution map, the target attention distribution map is obtained by multiplying the first attention distribution map and the second attention distribution map pixel by pixel, and then a more complete and more accurate positioning frame can be obtained by analyzing the communication region of the target attention distribution map, so that the accuracy of the position information of the target object in the image to be processed is greatly improved.

In an exemplary embodiment, performing target object detection based on the first feature map, the first attention profile, and the second feature map may include, in implementation:

classifying according to the second feature map to obtain a first classification result, wherein the first classification result represents probability values corresponding to each preset category in a plurality of preset categories;

weighting and summing a plurality of initialized convolution kernels according to the first classification result to obtain a target convolution kernel;

performing feature extraction on the first feature map according to the target convolution kernel to obtain a third feature map;

And determining category information of the target object in the image to be processed according to the third feature map and generating a second attention distribution map.

In the above embodiment, the weighted summation of the plurality of initialized convolution kernels according to the first classification result to obtain the target convolution kernel realizes the re-parameterization of the convolution kernel, and by re-applying the convolution kernel of the heavy parameter to the first feature map, the region of the weak response region, which is helpful for classification, occupies a higher weight in the final convolution, so that the region of the weak response region can be activated in the subsequent second attention profile.

In an exemplary embodiment, when classifying according to the second feature map to obtain a first classification result, the method may include:

performing dimension reduction processing on the second feature map to obtain a dimension reduction feature map;

Respectively carrying out global maximum pooling and global average pooling on the dimension reduction feature map to obtain a first pooling feature and a second pooling feature;

fusing the first pooling feature and the second pooling feature to obtain a fused pooling feature;

and classifying the fusion pooling features to obtain a first classification result.

In a specific implementation, the dimension reduction process may be to process the second feature map by convolution of 1×1, so as to compress the second feature map into one dimension. The first pooling feature and the second pooling feature may be fused by summing the two. Sigmoid activation functions may be employed in classifying fusion pooled features. In the above embodiment, the features of the weak response region are measured from different dimensions by global average pooling (Global Average Pooling, GAP) and global maximum pooling (Global Max Pooling, GMP), so that the accuracy of the first classification result is improved.

In an exemplary embodiment, determining the category information of the target object in the image to be processed and generating the second attention profile according to the third feature map may include:

carrying out global average pooling on the third feature map to obtain a third pooling feature;

classifying based on the third pooling feature to obtain a second classification result, wherein the second classification result represents the class information of the target object in the image to be processed;

a second attention profile is generated based on the third pooling feature and the third feature map.

In a specific implementation, the third feature map includes feature maps of each channel, the third pooled feature includes features corresponding to each channel, and when the second attention distribution map is generated, the features of each channel in the third pooled feature are taken as weights of the feature maps of the corresponding channels, and then weighted sums of the weights and the feature maps of the corresponding channels are calculated, so that the second attention distribution map can be used, and in practical application, the second attention distribution can also be called a category activation map.

In an exemplary implementation, the image detection method of the embodiment of the present invention may be implemented based on an image detection model including an attention encoding unit and a local feature mining unit. The attention encoding unit is a neural network based on a transducer, and the attention encoding unit may include an embedding layer and a plurality of attention encoding layers, the embedding layer is used for determining a first vector sequence according to an image block sequence corresponding to an image to be processed, and the plurality of attention encoding layers are used for sequentially performing attention encoding on the first vector sequence to obtain a first feature map as an output of the attention encoding unit.

The local feature mining unit (CDM, cue Digging Module) is used for determining an attention matrix corresponding to each attention coding layer, extracting target attention features of classified embedded vectors corresponding to each attention moment matrix, carrying out fusion processing on the target attention features to obtain a first attention distribution map, carrying out inverse processing on the first attention distribution map to obtain a reverse attention distribution map, carrying out fusion processing on the reverse attention distribution map and the first feature map output by the attention coding unit to obtain a second feature map, and further carrying out target object detection according to the first feature map, the first attention distribution map and the second feature map to obtain category information and position information of a target object in an image to be processed.

In practical applications, the transducer-based neural network may be a transducer-based pre-training classification network, such as a visual transducer (Vision Transformer, abbreviated as ViT) network, and the ViT network uses the transducer structure of the original BERT (Bidirectional Encoder Representation from Transformers, transducer-based bi-directional encoder), mainly to convert the picture into a word-like (token) form, which introduces the concept of image blocks, i.e., dividing the input picture into image blocks of the same size, and then performing a flattening operation for each image block to convert the image block into a one-dimensional image block embedded vector, so as to facilitate the feeding into the encoder.

In addition to the above described block embedding vectors, the ViT network requires another special input, namely position coding. Unlike convolutional neural networks, position coding is required to encode the relative position information of words, mainly because the self-attention (self-attention) structure in the intra-coder of the converter is insensitive to the order of the input sequence, i.e., the order of the words in the shuffled sequence does not change the output result. If the network is not actively provided with the relative position information of the image blocks, the network needs to learn to infer the relative position relation between the image blocks through the semantics of the image blocks, which additionally increases the learning cost and leads to the reduction of model accuracy.

To achieve image level classification, viT networks add a special Class token to the BERT reference. The classification of the image can be realized by adding a linear classifier to the final output characteristics of the Class token.

Fig. 5 is a schematic flow chart of another image detection method according to an embodiment of the present invention, and the image detection method according to the embodiment of the present invention is described in detail below with reference to fig. 5.

After an image to be processed is divided into N image blocks with the same size, the N image blocks are stretched into an image block sequence, the N image block sequences are input into a vector embedding (Embedding) layer, vector embedding is carried out on each image block in the vector embedding layer, position coding and initialization [ Class ] token addition are carried out to obtain a first vector sequence with the output length of N+1, the vectors in the first vector sequence are represented by X_l, and the dimension of X_l is D.

The first vector sequence output by the vector embedding layer is further input to a transform-based coding unit, and the output of the coding unit is converted (Reshape) to obtain a first feature map X_L. Wherein the coding unit includes a plurality transformer block (i.e., attention coding layers), each of which codes based on a multi-head attention mechanism. As shown in fig. 5, each Attention encoding layer includes a normal layer, a Multi-Head Attention layer, an Add layer and an MLP layer, where the normal layer functions to normalize hidden layers in the neural network to be standard normal distribution, the Multi-Head Attention layer is a Multi-Head Attention layer, and includes a plurality of self-Attention mechanism modules, the inputs of which are matrices obtained by three linear mappings, namely a query matrix Q, a key matrix K and a value matrix V, and the encoder mechanism related to the Multi-Head Attention layer can be described in the related art, the Add layer is used for residual connection, and the MLP layer is a Multi-layer sensing network. For the specific working principle of the attention encoding layer, reference may be made to the related description in the prior art, and the description is omitted here.

For each attention coding layer, the local feature mining unit CDM averages the attention matrix of each attention mechanism module in the attention coding layer to obtain an attention matrix A '_l∈R^(N+1)×(N+1) corresponding to the attention coding layer, and extracts an attention distribution diagram corresponding to the Class token from the A'_l∈R^(N+1)×(N+1)The attention profiles corresponding to all the attention encoding layers are then averaged (i.e., mean-manipulated) to obtain a first attention profileThen for the first attention profileThe method comprises the steps of taking inverse processing (i.e. performing Reverse operation), multiplying the inverse processing by a first feature map X_L, convolving a multiplication result (i.e. a second feature map) by 1X1, respectively using two branches (a global maximum pooling layer GMP+full connection layer FC and a global average pooling layer GAP+full connection layer FC) to characterize the features of a weak response area from different dimensions, superposing the outputs of the two branches (i.e. Add), classifying the outputs by a Sigmoid activation function, and using the classification result for re-parameterization (i.e. WEIGHTING) of a convolution kernel. Specifically, m convolution kernels (for example, may be 6 kernels) may be initialized, and then the m convolution kernels are linearly weighted according to the Sigmoid output result, so as to generate a new convolution kernel, and the new convolution kernel is used for convolution operation, for example, a certain area is classified as a background, and then the corresponding convolution kernel is activated, so that a higher weight is occupied in the final convolution.

The convolution kernel of the heavy parameter acts on the first feature map X_L to output a third feature map X_CDM, and the third feature map X_CDM is sent to a classifier for classification after passing through the global average pooling layer again, so that the class information of the target object in the output image to be processed is obtained.

In addition, after the third feature map X_CDM passes through the global average pooling layer, a second attention map M_CDM is generated based on the class activation map method, and the second attention map M_CDM is compared with the first attention mapA final class activation map M_fuse is obtained by pixel-by-pixel multiplication, and the final activation map M_fuse is sent to a positioning module for positioning, so as to obtain the position information of the target object in the output image to be processed. The positioning module is mainly used for carrying out connected domain analysis based on the class activation diagram so as to determine a target positioning frame.

According to the embodiment of the invention, through a feature mining mechanism driven by reverse attention force diagram, the region which is also helpful to classification in the weak response region can be well mined, so that the influence of the attention distribution diagram corresponding to the Class token on the positioning precision due to neglecting the correlation between image blocks is made up, a more complete and accurate positioning frame is finally obtained, the positioning precision of objects in the image is greatly improved, and the accuracy of the position information is ensured.

The training process of the image detection model is described below. Specifically, the training of the image detection model may include:

Acquiring a sample image and a class label corresponding to the sample image, wherein the class label indicates reference class information of a target object in the sample image;

inputting a sample image block sequence corresponding to a sample image into an attention coding unit of a preset neural network model, and performing attention coding based on a plurality of attention coding layers in the attention coding unit to obtain a first sample feature map;

Determining an attention matrix of each attention coding layer based on a local feature mining unit of the preset neural network model, extracting target attention features of corresponding classified embedded vectors in each attention matrix, carrying out fusion processing on each target attention feature to obtain a first sample attention distribution map, carrying out inverse processing on the first sample attention distribution map to obtain a sample reverse attention distribution map, and carrying out fusion processing on the sample reverse attention distribution map and the first sample feature map to obtain a second sample feature map;

And determining a loss value according to the difference between the predicted category information corresponding to the sample image and the category label, reversely adjusting model parameters of the preset neural network model based on the loss value, and performing iterative training until a training ending condition is met, so as to obtain the image detection model.

The attention encoding unit of the preset neural network model comprises an embedding layer and a plurality of attention encoding layers, the embedding layer performs vector embedding on each sample image block in the sample image block sequence after receiving the sample image block sequence to obtain a sample image block embedded vector sequence, and adds initialized classified embedded vectors into the sample image block embedded vector sequence to obtain a sample embedded vector sequence, and further adds corresponding position codes to each sample embedded vector in the sample embedded vector sequence to obtain a sample first vector sequence, and the first vector sequence is subjected to sequential attention encoding of the plurality of attention encoding layers to finally obtain a first sample characteristic diagram output by the attention encoding unit.

The sample image block sequence can be obtained by carrying out average segmentation on sample images and unfolding and arranging the sample images into a sequence, and the sizes of the sample image blocks obtained by the average segmentation are the same.

The specific determination process of the local feature mining unit of the preset neural network model for the sample inverse attention profile and the second sample feature profile can be referred to the relevant content of the foregoing steps S405 to S407 in the embodiment of the present invention, which is not repeated here. In addition, when category prediction is performed based on the second sample feature map and the first sample feature map, the probability values corresponding to the preset categories can be obtained by classifying the second sample feature map, then the target convolution kernels are obtained by weighting and summing a plurality of initial convolution kernels in the local feature mining unit based on the probability values corresponding to the preset categories, further feature extraction is performed based on the target convolution kernel first sample feature map to obtain a third sample feature map, global average pooling is performed on the third sample feature map, category prediction is performed based on the feature after global average pooling, and therefore prediction category information is obtained.

The loss value in the training process can be obtained by adopting a cross entropy loss function based on the difference between the predicted category information corresponding to the sample image and the category label. The training ending condition may be that the loss value reaches a preset loss threshold value, or that the iteration number reaches a preset iteration number threshold value.

The training process of the image detection model only needs to mark the real type of the target object in the sample image, accurate target position information is not needed to be provided any more, the marking complexity can be greatly reduced, and further, mass pictures obtained from the Internet can be effectively utilized for model training, so that the large-scale/long-tail type target positioning is realized.

The embodiment of the present invention also provides an image detection device corresponding to the image detection methods provided in the above embodiments, and since the image detection device provided in the embodiment of the present invention corresponds to the image detection method provided in the above embodiments, implementation of the image detection method described above is also applicable to the image detection device provided in the embodiment, and will not be described in detail in the embodiment.

Referring to fig. 6, a schematic structural diagram of an image detection device according to an embodiment of the present invention is shown, where the image detection device 600 has a function of implementing the image detection method in the above method embodiment, and the function may be implemented by hardware or implemented by executing corresponding software by hardware. As shown in fig. 6, the image detection apparatus 600 may include:

A first determining module 610, configured to determine a first vector sequence according to an image block sequence corresponding to an image to be processed, where the first vector sequence includes a classification embedded vector and an image block vector corresponding to each image block in the image block sequence;

An attention encoding module 620, configured to sequentially perform attention encoding on the first vector sequence based on a plurality of attention encoding layers, so as to obtain a first feature map;

A first attention distribution determining module 630, configured to determine an attention matrix corresponding to each attention coding layer, extract target attention features corresponding to the classified embedded vectors in each attention matrix, and perform fusion processing on each target attention feature to obtain a first attention distribution map;

the inverse processing module 640 is configured to perform inverse processing on the first attention profile to obtain a reverse attention profile, and perform fusion processing on the reverse attention profile and the first feature map to obtain a second feature map;

And a target object detection module 650, configured to perform target object detection according to the first feature map, the first attention distribution map, and the second feature map, so as to obtain category information and position information of the target object in the image to be processed.

In an exemplary embodiment, the target object detection module 650 includes:

In an exemplary embodiment, the second attention profile determination module 650 includes:

In an exemplary embodiment, the first classification module includes:

In an exemplary embodiment, the second determining module includes:

In an exemplary embodiment, the inverse processing module 640 is specifically configured to multiply the inverse attention profile with the first feature map to obtain a second feature map when performing a fusion process on the inverse attention profile and the first feature map to obtain the second feature map.

In an exemplary embodiment, the first attention profile determination module 630 includes:

In an exemplary embodiment, the first determining module 610 includes:

In an exemplary implementation manner, the image detection method of the embodiment of the present invention may be implemented based on an image detection model, and correspondingly, the image detection apparatus of the embodiment of the present invention further includes a training module, where the training module includes:

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

The embodiment of the invention provides an electronic device, which comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize the image detection method provided by the embodiment of the method.

The memory may be used to store software programs and modules that the processor executes to perform various functional applications and image detection by running the software programs and modules stored in the memory. The memory may mainly include a storage program area which may store an operating system, application programs required for functions, and the like, and a storage data area which may store data created according to the use of the device, and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor.

The method embodiments provided by the embodiments of the present invention may be executed in a computer terminal, a server, or similar computing device. Taking a server as an example, fig. 7 is a block diagram of a hardware structure of a server running an image detection method according to an embodiment of the present invention, as shown in fig. 7, the server 700 may generate relatively large differences according to configuration or performance, and may include one or more central processing units (Central Processing Units, CPU) 710 (the processor 710 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 730 for storing data, and one or more storage mediums 720 (e.g., one or more mass storage devices) storing application programs 723 or data 722. Wherein memory 730 and storage medium 720 may be transitory or persistent. The program stored in the storage medium 720 may include one or more modules, each of which may include a series of instruction operations on the server. Still further, the central processor 710 may be configured to communicate with the storage medium 720 and execute a series of instruction operations in the storage medium 720 on the server 700. The server 700 may also include one or more power supplies 760, one or more wired or wireless network interfaces 750, one or more input/output interfaces 740, and/or one or more operating systems 721, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

Input-output interface 740 may be used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the server 700. In one example, the input/output interface 740 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the Internet. In one example, the input/output interface 740 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 7 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, server 700 may also include more or fewer components than shown in fig. 7, or have a different configuration than shown in fig. 7.

Embodiments of the present invention also provide a computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing an image detection method, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the image detection method provided in the above method embodiments.

Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the electronic device performs the image detection method described above.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. various media that can store program codes.

It should be noted that the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. An image detection method, the method comprising:

And detecting the target object according to the first feature map, the first attention distribution map and the second feature map to obtain category information and position information of the target object in the image to be processed.

2. The image detection method according to claim 1, wherein the detecting the target object according to the first feature map, the first attention profile, and the second feature map, to obtain category information and position information of the target object in the image to be processed, includes:

determining category information of a target object in the image to be processed according to the second feature map and the first feature map and generating a second attention distribution map;

Performing pixel-by-pixel multiplication on the first attention distribution map and the second attention distribution map to obtain a target attention distribution map;

And determining the position information of the target object in the image to be processed according to the target attention distribution map.

3. The image detection method according to claim 2, wherein the determining the category information of the target object in the image to be processed and generating a second attention profile based on the second feature map and the first feature map includes:

Classifying according to the second feature map to obtain a first classification result, wherein the first classification result comprises probability values corresponding to each preset category in a plurality of preset categories;

4. The image detection method according to claim 3, wherein the classifying according to the second feature map, to obtain a first classification result, includes:

5. The image detection method according to claim 3, wherein determining category information of the target object in the image to be processed and generating a second attention profile according to the third feature map includes:

6. The image detection method according to claim 1, wherein the fusing the reverse attention profile and the first feature map to obtain a second feature map includes:

And multiplying the reverse attention distribution map with the first characteristic map to obtain a second characteristic map.

7. The method of claim 1, wherein determining an attention matrix corresponding to each attention encoding layer, extracting a target attention feature of the classification embedded vector corresponding to each attention matrix, and performing fusion processing on each target attention feature to obtain a first attention distribution map, includes:

8. The method according to claim 1, wherein determining the first vector sequence according to the image block sequence corresponding to the image to be processed comprises:

9. The image detection method according to any one of claims 1 to 8, wherein the image detection method is implemented based on an image detection model, and training of the image detection model includes:

the method comprises the steps of inputting a sample image block sequence corresponding to a sample image into an attention coding unit of a preset neural network model, and performing attention coding based on a plurality of attention coding layers in the attention coding unit to obtain a first sample feature map;

10. An image detection apparatus, the apparatus comprising:

11. An electronic device, comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the image detection method according to any one of claims 1 to 9.

12. A computer-readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the image detection method of any one of claims 1-9.