Detailed Description
The technical solutions of the present application will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The current image classification tasks are mainly divided into traditional image classification tasks and fine-grained image classification tasks. In the traditional image classification task, no matter how large the important discrimination area in the image accounts for the proportion of the whole image, only the extraction features of the whole image are extracted at the same time, and then classification is carried out; however, in the fine-grained image classification, the distinguishable region in the image to be classified is often only in a small region in the image, so that it is generally necessary to obtain the region of the object of interest first, and then finely classify the object in a plurality of categories with small differences.
The classification of fine-grained images is classified into strong supervised learning and weak supervised learning. The strong supervision learning needs to add more marking frames to the network to perform the strong supervision learning, so that the network can learn the position information of the target, and the method is similar to the target detection task. The weak supervision learning is to judge the position of the region through the unsupervised learning by the network, and then pay special Attention to the characteristic difference of the region to identify the category of the target, and the common method is to obtain the position of the judging region through analyzing the most prominent part in the characteristic diagram based on the image classification of the Attention (Attention) mechanism.
In an image recognition scene which only needs to recognize whether a certain target exists in an image and does not need to recognize the type, position and other detailed information of the target, if a traditional image classification task is adopted for model training, the characteristics of a key small target are easily ignored, so that the recognition capability of the model is poor; if the fine-grained classification task is used for model training, the training process and the obtained model are too complex, and the recognition efficiency is affected.
Based on the above, the embodiment of the application provides a target recognition model training method, a device and electronic equipment, wherein a feature loss function value can be calculated through feature extraction of a fitting image, and a reverse gradient propagation training is carried out on an initial image classification model through the feature loss function value and a cross entropy loss function value, so that a target recognition model capable of recognizing whether a target is carried in the image can be trained, and the recognition accuracy rate and recall rate of the target recognition model are improved.
For the sake of understanding the present embodiment, first, a method for training a target recognition model disclosed in the present embodiment is described in detail.
FIG. 1 is a flowchart of a training method for a target recognition model, which is applied to an electronic device, wherein an initial image classification model is pre-stored in the electronic device; the initial image classification model may be implemented in a variety of ways, and is not specifically limited herein. The object may be a gun, a cutter, or the like, and the object recognition model trained by the object recognition model training method provided in this embodiment may quickly determine whether a certain image includes or carries the object, where the object recognition model training method specifically includes the following steps:
Step S11, a training sample set and a fitting image set are obtained.
The samples in the training sample set comprise positive samples and negative samples, wherein the positive samples are images containing targets, and the negative samples do not contain the targets; the image in the fitting image set is an image with the area ratio of the target being larger than a set threshold, for example, only contains pure samples of the target, or the area ratio of the target being larger than a certain threshold, for example, 95%, and the threshold can be adjusted according to practical situations.
And step S12, based on the training sample set and the fitting image set, determining a training sample set and a current fitting image corresponding to each round of training, and executing the following operation for each round of training until the training round reaches the preset times or the total loss value converges to the preset convergence threshold value, so as to obtain the target recognition model.
During model training, a training sample subset corresponding to current wheel training and a current fitting image are required to be determined from a training sample set and a fitting image set, for example, 20 images are selected from the training sample set to serve as samples in the training sample subset corresponding to the current wheel training, and one fitting image is randomly extracted from the fitting image set to serve as the current fitting image. And then executing the model training process of the following five steps until the training round reaches the preset times (such as 100 times) or the total loss value converges to the preset convergence threshold value, and stopping training to obtain the target recognition model.
The following five steps are performed for each round of training:
step S121, inputting samples in the current training sample subset into an initial image classification model to obtain a first feature vector and a prediction label of each sample; wherein the first feature vector is a vector output by a first intermediate layer of the initial image classification model.
The above-mentioned process of obtaining the first feature vector may include various ways, and the first intermediate layer for extracting the feature vector is different for the initial image classification models with different structures. In the embodiment of the present application, the first middle layer may be a fusion module, where after the feature map extracted by the neural network and the attention map extracted by the attention structure are fused, a first feature vector of the sample is output.
On the basis of obtaining the first feature vector of the sample, the classification result, that is, the prediction label of the sample, can be further output through the classifier, for example, the label includes Y and N, Y represents that the sample is an image containing the target, and N represents that the sample is an image not containing the target.
And step S122, extracting features of the current fitting image through a second middle layer of the initial image classification model to obtain a second feature vector corresponding to the current fitting image.
And the second middle layer is different from the first middle layer in structure position in the initial image classification model, and the second characteristic vector can be output through the second middle layer by inputting the current fitting image into the initial image classification model.
Step S123, calculating the feature loss function value of the current training according to the first feature vector corresponding to the positive sample in the current training sample subset and the second feature vector corresponding to the current fitting image.
The calculation of the feature loss function value may be performed by substituting two feature vectors into a preset feature loss function. If the positive samples are one, the first feature vector corresponding to the positive sample and the second feature vector corresponding to the current fitting image are directly substituted into a preset feature loss function to be calculated, and normally, the positive samples are multiple, the feature loss function value of each positive sample can be calculated respectively, and then the average value of the feature loss function values corresponding to the positive samples is taken as the feature loss function value of the training of the round.
Step S124, calculating the cross entropy loss function value of the training according to the prediction label and the real label corresponding to each sample in the current training sample subset.
Similarly, the calculation of the cross entropy loss function value can also be realized by adopting a preset calculation formula, and the average value of the cross entropy loss function values corresponding to a plurality of samples can be taken as the cross entropy loss function value of the training.
And step S125, determining the total loss value of the round of training based on the characteristic loss function value and the cross entropy loss function value of the round of training, and carrying out reverse gradient propagation training on the initial image classification model according to the total loss value of the round of training.
In the step, the characteristic loss function value and the cross entropy loss function value of the round of training are added to obtain the total loss value of the round of training, and then the initial image classification model is subjected to reverse gradient propagation training through the total loss value.
Through a certain number of cyclic training processes, a more ideal target recognition model can be finally obtained. According to the target recognition model training method provided by the embodiment of the application, the feature vector extraction of the fitting image is added, so that the feature loss function value can be calculated, the initial image classification model is subjected to reverse gradient propagation training through the feature loss function value and the cross entropy loss function value, the target recognition model capable of recognizing whether the image carries a target can be trained, and the recognition accuracy and recall rate of the target recognition model are improved.
In the following, a preferred embodiment is listed, and a training process of a target recognition model is realized by adding an attention mechanism, and as shown in fig. 2, in the embodiment of the present application, the initial image classification model includes a convolutional neural network, an attention structure, a fusion module and a classifier which are sequentially connected; the fusion module, that is, the first intermediate layer, may output a first feature vector of the sample.
The specific model training process is as follows:
(1) Aiming at the current training sample subset and the corresponding current fitting image thereof, simultaneously carrying out the characteristic extraction step:
the feature extraction process for the current training sample subset is as follows:
A. and inputting samples in the current training sample subset into a convolutional neural network to obtain an original feature map corresponding to each sample.
In the embodiment of the present application, a Residual Network (res net 50) is used to implement a process of extracting a feature map from a sample in a current training sample subset, and the current main convolutional neural Network may be other networks, such as VGG, res net152, and the like. Model parameters trained on an ImageNet image database are adopted in the initialization process, and only the full-connection layer of the last layer is required to be modified into a two-class problem of whether a target is carried in the current sample set in the training process. The input size of all sample data is first scaled to 224 x 224, and in the embodiment of the application, the feature map extracted by the last convolution layer of the res net50 model is extracted as the original feature map Vs of the samples in the current training sample subset.
B. Inputting the original feature map corresponding to each sample into an attention structure to obtain an attention map corresponding to each sample; the attention structure includes three convolution layers; each convolution layer is followed by a BN layer and a linear connection unit.
After the feature map Vs is obtained from the res net50, it is input to the Attention structure learning to obtain an Attention map Vatt. The Attention structure consists of three convolutions, the first using 1024 convolution kernels of size 1*1, the second using 512 convolution kernels of size 3*3, the third using 1 convolution kernel of size 1*1, with one BN layer and a modified linear element following each convolution. The BN layer functions in three main ways: the training and convergence speed of the network is increased; controlling gradient explosion to prevent gradient disappearance; overfitting is prevented.
C. And inputting the original feature map and attention map corresponding to each sample into a fusion module to obtain a first feature vector corresponding to each sample.
Specifically, for each original feature map and attention map corresponding to each sample, the following operations are performed: spatially normalizing the attention map corresponding to the sample by a softmax function to obtain a value corresponding to each pixel in the attention map; and taking the value corresponding to each pixel in the attention map as a weight value, and carrying out weighted summation on the original feature map corresponding to the sample to obtain a first feature vector corresponding to the sample.
The softmax function described above is as follows:
wherein a isi,j To take care of the values at the (i, j) position in the vant, i.e. the weight values at the (i, j) position in the original feature map, after spatial normalization;to pay attention to force Vatt The middle position is the value at (i, j).
The first eigenvector is calculated as follows:
v1 =∑i,j xi,j ai,j ,
wherein v is1 Representing a first feature vector corresponding to the sample; x is xi,j Representing the feature vector, a, at the position (i, j) in the original feature map Vsi,j The values at the (i, j) position in Vatt, i.e. the weight values at the (i, j) position in the original feature map, are sought after spatial normalization.
The feature extraction process for the current fitted image is as follows:
A. and inputting the current fitting image into a convolutional neural network to obtain a second feature vector corresponding to the current fitting image. The convolutional neural network is the second middle layer of the initial image classification model.
The same depth convolution neural network ResNet50 is used for extracting the characteristics of the current fitting image, at the moment, the last full connection layer of the network model is removed, and the characteristics of the last convolution layer are extracted as characteristic vectors, so that a second characteristic vector v corresponding to the current fitting image is obtained2 。
(2) And inputting the first feature vector corresponding to each sample into a classifier to obtain a prediction label corresponding to each sample.
Using the first eigenvector v corresponding to each sample1 To learn a two-class linear classifier for object recognition:where W and b are linear classifier parameters, and each sample is associated with a first eigenvector v1 And inputting the classifier to obtain the prediction label corresponding to each sample.
(3) The feature Loss function value corresponding to this round of training is calculated, as Loss2 in fig. 2.
In order to train the Attention structure, the embodiment of the application needs to calculate the characteristic fitting loss, namely calculate the second characteristic vector v of the fitting image2 And a first feature vector v for classification1 The fitting capability of the training method enables an attention mechanism to automatically judge the characteristics of a target area in an image, and the attention mechanism needs to be noted that as no target is contained in a negative sample during training, the characteristic loss function value corresponding to the training is calculated only for a positive sample by the following steps.
A. And calculating a first characteristic loss function value corresponding to each positive sample according to the first characteristic vector corresponding to each positive sample in the current training sample subset and the second characteristic vector corresponding to the current fitting image.
Specifically, the first characteristic loss function value of the positive sample is calculated by the following formula:
Wherein L is2 A first feature loss function value representing a positive sample; the MSE () represents the mean square error function,a first feature vector corresponding to the positive sample is represented; v2 Representing a second feature vector corresponding to the currently fitted image.
B. And carrying out mean value calculation on the first characteristic loss function values corresponding to the positive samples to obtain the characteristic loss function values of the training.
For example, the subset of the training samples of the present round includes 20 images, wherein 7 positive samples are included, and then an average value of the first feature loss function values corresponding to the 7 positive samples respectively can be calculated to obtain the feature loss function value of the training of the present round.
(4) The cross entropy Loss function value corresponding to this round of training is calculated, as Loss1 in fig. 2.
A. And calculating a first cross entropy loss function value corresponding to each sample according to the prediction label, the real label and the cross entropy loss function corresponding to each sample in the current training sample subset.
Computing predictive labelsLoss from the genuine tag y, i.e. minimize +.>And y, the formula is: />Where Cross Entropy () is the Cross Entropy loss function. From this function, a first cross entropy loss function value corresponding to each sample can be calculated.
B. And carrying out mean value calculation on the first cross entropy loss function value corresponding to each sample to obtain the cross entropy loss function value of the training.
For example, if the subset of training samples includes 20 images, an average value of the first cross entropy loss function values corresponding to the 20 samples may be calculated to obtain the cross entropy loss function value of the training sample.
(5) The total Loss value corresponding to the training round is calculated, such as the Loss total in fig. 2.
And summing the characteristic loss function value and the cross entropy loss function value of the training to obtain the total loss value of the training.
ModelThe final loss function is:therefore, the characteristic loss function value and the cross entropy loss function value of the training are summed to obtain the total loss value of the training.
(6) Back propagation training. And performing back propagation training based on the calculated total loss value of the training round.
Repeating the steps (1) - (6), and training to obtain the target recognition model.
In addition, the samples in the training sample set need to be labeled manually before training, namely, the samples are divided into positive samples and negative samples, and because the data labeling cost is high, training data are few when the preliminary image classification model is trained.
Namely: in the model training process, predicting a designated image by using a target recognition model obtained by current training every other preset training rounds; designating the image as a target related image which is not marked with labels; if the confidence of the predicted result exceeds a preset threshold, adding the specified image to a training sample set for model training.
In practical application, a certain threshold k can be set, firstly, a trained target recognition model is loaded to predict unlabeled data, images with confidence coefficient larger than the threshold k are automatically picked and added into training, and each training time is n epochs, the model automatically reselects the unlabeled data once, and the size of the threshold k is adjusted by observing the selected data quantity and the test result in the model training process. Through fine adjustment of the model, the accuracy and generalization capability of the model can be improved.
According to the target recognition model training method provided by the embodiment of the application, the attention weighted feature vector and the fitting capacity between the fitting images are calculated to directly train the attention structure while the cross entropy loss of model prediction is calculated, so that the accuracy of model recognition is improved. In addition, the semi-supervised training method for selecting unlabeled images while training is performed in the training process, so that the generalization capability of the model can be improved without increasing the labeling cost.
Further, an embodiment of the present application further provides a target recognition method, as shown in fig. 3, where the method includes the following steps:
step S302, obtaining an image to be identified;
step S304, inputting the image to be identified into the target identification model to obtain an identification result corresponding to the image to be identified.
The target recognition model is a target recognition model obtained by training the target recognition model training method described in the previous embodiment, and the image to be recognized is input into the target recognition model to obtain a recognition result corresponding to the image to be recognized, that is, a prediction label is obtained by the extraction process of the first feature vector described in the previous embodiment and the prediction of the classifier, where the prediction label can represent whether the image to be recognized is an image including the target. The specific identification process can be referred to the previous embodiment, and will not be described herein.
Based on the method embodiment, the embodiment of the application also provides a target recognition model training device which is applied to electronic equipment, wherein the electronic equipment pre-stores an initial image classification model; referring to fig. 4, the apparatus includes:
an image set acquisition module 41, configured to acquire a training sample set and a fitting image set; the samples in the training sample set comprise positive samples and negative samples, and the image in the fitting image set is an image with the area ratio of the target being greater than a set threshold; the model training module 42 is configured to determine, based on the training sample set and the fitting image set, a training sample subset and a current fitting image corresponding to each training round, and perform the following operations for each training round, until the training round reaches a preset number of times or the total loss value converges to a preset convergence threshold, and obtain the target recognition model.
The model training module 42 includes: the feature extraction and identification module 421, the loss value calculation module 422 and the back propagation training module 423, where the feature extraction and identification module 421 is configured to input samples in the current training sample subset into the initial image classification model to obtain a first feature vector and a prediction label of each sample; the first feature vector is a vector output by a first middle layer of the initial image classification model; extracting features of the current fitting image through a second middle layer of the initial image classification model to obtain a second feature vector corresponding to the current fitting image; the loss value calculation module 422 is configured to calculate a feature loss function value of the current training according to a first feature vector corresponding to the positive sample in the current training sample subset and a second feature vector corresponding to the current fitting image; calculating a cross entropy loss function value of the current training according to the prediction label and the real label corresponding to each sample in the current training sample subset; determining a total loss value of the present round of training based on the characteristic loss function value and the cross entropy loss function value of the present round of training; the back propagation training module 423 is configured to perform back gradient propagation training on the initial image classification model according to the total loss value of the current training.
Further, the initial image classification model comprises a convolutional neural network, an attention structure, a fusion module and a classifier which are connected in sequence; the fusion module is a first intermediate layer; the feature extraction and identification module 421 is further configured to: inputting samples in the current training sample subset into a convolutional neural network to obtain an original feature map corresponding to each sample; inputting the original feature map corresponding to each sample into an attention structure to obtain an attention map corresponding to each sample; inputting the original feature map and attention map corresponding to each sample into a fusion module to obtain a first feature vector corresponding to each sample; and inputting the first feature vector corresponding to each sample into a classifier to obtain a prediction label corresponding to each sample.
Further, the feature extraction and identification module 421 is further configured to: for each sample corresponding original feature map and attention map, the following operations are performed: spatially normalizing the attention map corresponding to the sample by a softmax function to obtain a value corresponding to each pixel in the attention map; and taking the value corresponding to each pixel in the attention map as a weight value, and carrying out weighted summation on the original feature map corresponding to the sample to obtain a first feature vector corresponding to the sample.
Further, the second intermediate layer is a convolutional neural network; the feature extraction and identification module 421 is further configured to: and inputting the current fitting image into a convolutional neural network to obtain a second feature vector corresponding to the current fitting image.
Further, the loss value calculation module 422 is further configured to: calculating a first feature loss function value corresponding to each positive sample according to a first feature vector corresponding to each positive sample in the current training sample subset and a second feature vector corresponding to the current fitting image; and carrying out mean value calculation on the first characteristic loss function values corresponding to the positive samples to obtain the characteristic loss function values of the training.
Further, the loss value calculation module 422 is further configured to: the first characteristic loss function value of the positive sample is calculated by the following formula:
wherein L is2 A first feature loss function value representing a positive sample; the MSE () represents the mean square error function,a first feature vector corresponding to the positive sample is represented; v2 Representing a second feature vector corresponding to the currently fitted image.
Further, the loss value calculation module 422 is further configured to: calculating a first cross entropy loss function value corresponding to each sample according to the prediction label, the real label and the cross entropy loss function corresponding to each sample in the current training sample subset; and carrying out mean value calculation on the first cross entropy loss function value corresponding to each sample to obtain the cross entropy loss function value of the training.
Further, the loss value calculation module 422 is further configured to: and summing the characteristic loss function value and the cross entropy loss function value of the training to obtain the total loss value of the training.
Further, the attention structure includes three convolution layers; each convolution layer is followed by a BN layer and a linear connection unit.
Further, the model training module 42 is further configured to: in the model training process, predicting a designated image by using a target recognition model obtained by current training every other preset training rounds; designating the image as a target related image which is not marked with labels; if the confidence of the predicted result exceeds a preset threshold, adding the specified image to a training sample set for model training.
Further, the device further comprises: the image recognition module is used for acquiring an image to be recognized; and inputting the image to be identified into the target identification model to obtain an identification result corresponding to the image to be identified.
The implementation principle and the generated technical effects of the object recognition model training device provided by the embodiment of the application are the same as those of the object recognition model training method, and for the sake of brief description, reference may be made to corresponding contents in the embodiment of the object recognition model training method where the embodiment of the object recognition model training device is not mentioned.
An embodiment of the present application further provides an electronic device, as shown in fig. 5, which is a schematic structural diagram of the electronic device, where the electronic device includes a processor 51 and a memory 50, where the memory 50 stores computer executable instructions that can be executed by the processor 51, and the processor 51 executes the computer executable instructions to implement the above method.
In the embodiment shown in fig. 5, the electronic device further comprises a bus 52 and a communication interface 53, wherein the processor 51, the communication interface 53 and the memory 50 are connected by the bus 52.
The memory 50 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is achieved via at least one communication interface 53 (which may be wired or wireless), and the internet, wide area network, local network, metropolitan area network, etc. may be used. Bus 52 may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 52 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one bi-directional arrow is shown in FIG. 5, but not only one bus or type of bus.
The processor 51 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 51 or by instructions in the form of software. The processor 51 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory and the processor 51 reads information in the memory and in combination with its hardware performs the steps of the method of the previous embodiment.
The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions that, when being called and executed by a processor, cause the processor to implement the above method, and the specific implementation can refer to the foregoing method embodiment and will not be described herein.
The method, the apparatus and the computer program product of the electronic device for training the target recognition model provided by the embodiments of the present application include a computer readable storage medium storing program codes, and the instructions included in the program codes may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be repeated herein.
The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In the description of the present application, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.