Detailed Description
The technical solutions of the present application will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The current image classification tasks are mainly divided into traditional image classification tasks and fine-grained image classification tasks. In the traditional image classification task, no matter how large the proportion of an important judgment area in an image in the whole image is, only the characteristics of the whole image are extracted at one glance, and then classification is carried out; in fine-grained image classification, the discriminable area in an image to be classified is usually only a small area in the image, so that an area of an object of interest is usually obtained first, and then the object is subjected to fine classification in a plurality of classes with small differences.
And the classification of fine-grained images is divided into strong supervised learning and weak supervised learning. The strong supervised learning needs to add more additional labeling frames to the network for the strong supervised learning, so that the network can learn the position information of the target, which is similar to a target detection task. In the weak supervised learning, the network discriminates the position of a region through unsupervised learning, and then particularly pays Attention to the feature difference of the region to identify the category of a target.
In an image recognition scene which only needs to recognize whether a certain target exists in an image and does not need to recognize detailed information such as the type, the position and the like of the target, if a traditional image classification task is adopted for model training, the characteristics of a key small target are easily ignored, and the recognition capability of a model is poor; if the fine-grained classification task is used for model training, the training process and the obtained model are too complex, and the recognition efficiency is influenced.
Based on this, the embodiment of the application provides a method and a device for training a target recognition model, and an electronic device, wherein a feature loss function value can be calculated through feature extraction of a fitted image, and a reverse gradient propagation training is performed on an initial image classification model through the feature loss function value and a cross entropy loss function value, so that the target recognition model capable of recognizing whether an image carries a target or not can be trained, and the recognition accuracy and the recall rate of the target recognition model are improved.
For the convenience of understanding the present embodiment, a method for training a target recognition model disclosed in the embodiments of the present application will be described in detail first.
Fig. 1 is a flowchart of a target recognition model training method according to an embodiment of the present disclosure, where the method is applied to an electronic device, and an initial image classification model is prestored in the electronic device; the initial image classification model may be implemented in various ways, and is not limited in any way. The target may be a gun, a knife, or other articles, the target recognition model trained by the target recognition model training method provided in this embodiment may quickly determine whether a certain image includes or carries a target, and the target recognition model training method specifically includes the following steps:
and step S11, acquiring a training sample set and a fitting image set.
The samples in the training sample set comprise positive samples and negative samples, the positive samples are images containing targets, and the negative samples do not contain the images of the targets; the images in the fitted image set are images in which the area ratio of the target is greater than a set threshold, for example, only pure samples of the target are contained, or images in which the area ratio of the target is greater than a certain threshold, for example, 95%, and the threshold can be adjusted according to actual conditions.
And step S12, determining a training sample subset and a current fitting image corresponding to each training round based on the training sample set and the fitting image set, and executing the following operations for each training round until the training round reaches a preset number of times or the total loss value converges to a preset convergence threshold value, so as to obtain a target recognition model.
During model training, a training sample subset and a current fitting image corresponding to current training need to be determined from a training sample set and a fitting image set, for example, 20 images are selected from the training sample set as samples in the training sample subset corresponding to the current training, and one fitting image is randomly selected from the fitting image set as the current fitting image. And then, executing a model training process of the following five steps, and stopping training until the training round reaches a preset number (such as 100 times) or the total loss value converges to a preset convergence threshold value to obtain the target recognition model.
The following five steps are performed for each round of training:
step S121, inputting samples in a current training sample subset into an initial image classification model to obtain a first feature vector and a prediction label of each sample; and the first feature vector is a vector output by the first intermediate layer of the initial image classification model.
The above-mentioned process of obtaining the first feature vector may include multiple ways, and the first intermediate layer for extracting the feature vector is also different for the initial image classification models of different structures. In this embodiment, the first intermediate layer may be a fusion module, which outputs a first feature vector of the sample after fusing the feature map extracted by the neural network and the attention map extracted by the attention structure.
On the basis of obtaining the first feature vector of the sample, the classifier may further output a classification result, that is, a prediction label of the sample, for example, the label includes Y and N, Y indicates that the sample is an image containing the target, and N indicates that the sample is an image not containing the target.
And S122, performing feature extraction on the current fitting image through a second intermediate layer of the initial image classification model to obtain a second feature vector corresponding to the current fitting image.
And the second intermediate layer and the first intermediate layer have different structural positions in the initial image classification model, and the current fitting image is input into the initial image classification model, namely a second feature vector can be output through the second intermediate layer.
Step S123, calculating a feature loss function value of the training in the current round according to the first feature vector corresponding to the positive sample in the current training sample subset and the second feature vector corresponding to the current fitting image.
The characteristic loss function value can be calculated by substituting two kinds of characteristic vectors into a preset characteristic loss function. If the number of the positive samples is one, the first feature vector corresponding to the positive sample and the second feature vector corresponding to the current fitting image are directly substituted into a preset feature loss function to be calculated, and generally, the number of the positive samples is multiple, so that the feature loss function value of each positive sample can be calculated respectively, and then the average value of the feature loss function values corresponding to the multiple positive samples is taken as the feature loss function value of the training.
Step S124, calculating a cross entropy loss function value of the training in the current round according to the predicted label and the real label corresponding to each sample in the current training sample subset.
Similarly, the calculation of the cross entropy loss function value can also be realized by adopting a preset calculation formula, and the average value of the cross entropy loss function values corresponding to a plurality of samples can be taken as the cross entropy loss function value of the training in the current round.
And step S125, determining a total loss value of the training of the current round based on the characteristic loss function value and the cross entropy loss function value of the training of the current round, and performing reverse gradient propagation training on the initial image classification model according to the total loss value of the training of the current round.
In this step, the characteristic loss function value and the cross entropy loss function value of the current round of training are added to obtain a total loss value of the current round of training, and then the initial image classification model is subjected to inverse gradient propagation training through the total loss value.
Through a certain number of times of cyclic training processes, an ideal target recognition model can be obtained finally. According to the target recognition model training method provided by the embodiment of the application, the feature vector extraction of the fitting image is added, so that the feature loss function value can be calculated, the reverse gradient propagation training is carried out on the initial image classification model through the feature loss function value and the cross entropy loss function value, the target recognition model which can identify whether the image carries the target or not can be trained, and the recognition accuracy and the recall rate of the target recognition model are improved.
In the following, a preferred embodiment is listed, in which the training process of the target recognition model is implemented by adding an attention mechanism, and as shown in fig. 2, in the embodiment of the present application, the initial image classification model includes a convolutional neural network, an attention structure, a fusion module, and a classifier, which are connected in sequence; the fusion module, i.e. the first intermediate layer, may output a first feature vector of the sample.
The specific model training process is as follows:
(1) and simultaneously performing the characteristic extraction steps on the current training sample subset and the corresponding current fitting image:
the feature extraction process for the current training sample subset is as follows:
A. and inputting the samples in the current training sample subset into a convolutional neural network to obtain an original characteristic diagram corresponding to each sample.
In the embodiment of the present application, a ResNet50(Residual Network) is used to implement a process of extracting a feature map of samples in a current training sample subset, and the process may also be another Network, and currently, mainstream convolutional neural networks may be, for example, VGG, ResNet152, and the like. Model parameters trained on an ImageNet image database are adopted during initialization, and only the last full-connection layer is required to be modified into a binary classification problem whether the current sample set carries a target or not in the training process. The input size of all sample data is first scaled to 224 x 224, and in the embodiment of the present application, the feature map extracted from the last convolutional layer of the ResNet50 model is extracted as the original feature map Vs of the samples in the current training sample subset.
B. Inputting the original characteristic diagram corresponding to each sample into an attention structure to obtain an attention diagram corresponding to each sample; the attention structure comprises three convolution layers; a BN layer and a linear connection unit are connected behind each convolution layer.
After obtaining the feature map Vs from the ResNet50, it is input to the Attention structure learning to obtain the Attention map Vatt. The Attention structure consists of three convolutional layers, the first layer using 1024 convolutional kernels of size 1 x 1, the second layer using 512 convolutional kernels of size 3 x 3, and the third layer using 1 convolutional kernel of size 1 x 1, while each convolution is followed by a BN layer and a modified linear element. The role of the BN layer is mainly three: the training and convergence speed of the network is accelerated; controlling the gradient explosion to prevent the gradient from disappearing; overfitting is prevented.
C. And inputting the original feature map and the attention map corresponding to each sample into a fusion module to obtain a first feature vector corresponding to each sample.
Specifically, the following operations are performed for the original feature map and the attention map corresponding to each sample: carrying out spatial standardization on the attention diagram corresponding to the sample through a softmax function to obtain a value corresponding to each pixel in the attention diagram; and taking the value corresponding to each pixel in the attention map as a weight value, and carrying out weighted summation on the original feature map corresponding to the sample to obtain a first feature vector corresponding to the sample.
The softmax function described above is as follows:
wherein, a
i,jThe value at the (i, j) position in the attention map Vatt after spatial normalization, that is, the weight value at the (i, j) position in the original feature map;
for attention force diagram V
attThe median is the value at (i, j).
The first feature vector is calculated as follows:
v1=∑i,jxi,jai,j,
wherein v is1Representing a first feature vector corresponding to the sample; x is the number ofi,jRepresents the feature vector at position (i, j) in the original feature map Vs, ai,jThe value at the (i, j) position in the attention map Vatt after spatial normalization, i.e. the weight value at the (i, j) position in the original feature map, is obtained.
The feature extraction process for the current fitted image is as follows:
A. and inputting the current fitting image into the convolutional neural network to obtain a second feature vector corresponding to the current fitting image. The convolutional neural network is the second intermediate layer of the initial image classification model.
The same applies toThe deep convolutional neural network ResNet50 is used for extracting the features of the current fitting image, at the moment, the last full connection layer of the network model is removed, the features of the last convolutional layer are extracted as feature vectors, and the second feature vector v corresponding to the current fitting image is obtained2。
(2) And inputting the first feature vector corresponding to each sample into a classifier to obtain a prediction label corresponding to each sample.
Using the corresponding first feature vector v of each sample
1To learn a binary linear classifier for target recognition:
wherein W and b are linear classifier parameters, and corresponding first feature vector v to each sample
1And inputting the classifier to obtain the prediction label corresponding to each sample.
(3) The corresponding feature Loss function value of this round of training is calculated, as shown in Loss2 in fig. 2.
In order to train the Attention structure, in the embodiment of the application, the feature fitting loss needs to be calculated, namely, the second feature vector v of the fitting image is calculated2With the first feature vector v for classification1The feature loss function value corresponding to the training of the current round is calculated by the following steps.
A. And calculating a first characteristic loss function value corresponding to each positive sample according to the first characteristic vector corresponding to each positive sample in the current training sample subset and the second characteristic vector corresponding to the current fitting image.
Specifically, the first characteristic loss function value of the positive sample is calculated by the following formula:
wherein L is
2Representing positive samplesA first characteristic loss function value; MSE () represents the mean square error function,
representing a first feature vector corresponding to the positive sample; v. of
2Representing a second feature vector corresponding to the currently fitted image.
B. And carrying out mean value calculation on the first characteristic loss function values corresponding to the positive samples to obtain the characteristic loss function values of the training of the round.
For example, the subset of training samples in the current round includes 20 images, where 7 of the images are positive samples, and then the average of the first characteristic loss function values corresponding to the 7 positive samples can be calculated to obtain the characteristic loss function value of the current round of training.
(4) The cross entropy Loss function value corresponding to this round of training is calculated, as shown in Loss1 in fig. 2.
A. And calculating a first cross entropy loss function value corresponding to each sample according to the prediction label, the real label and the cross entropy loss function corresponding to each sample in the current training sample subset.
Computing predictive labels
With respect to authentic labels y, i.e. to minimise losses
And the cross entropy loss between y, the formula:
where Cross Encopy () is a Cross Entropy loss function. And calculating a first cross entropy loss function value corresponding to each sample through the function.
B. And carrying out mean value calculation on the first cross entropy loss function values corresponding to the samples to obtain cross entropy loss function values of the training of the current round.
Further, taking the above example as an example, for example, if the subset of the training samples of the current round includes 20 images, the average value of the first cross entropy loss function values corresponding to the 20 samples may be calculated to obtain the cross entropy loss function value of the training of the current round.
(5) And calculating the total Loss value corresponding to the training of the current round, such as the Loss total in FIG. 2.
And summing the characteristic loss function value and the cross entropy loss function value of the training of the current round to obtain the total loss value of the training of the current round.
The final loss function of the model is:
therefore, the characteristic loss function value and the cross entropy loss function value of the training of the current round are summed, and the total loss value of the training of the current round can be obtained.
(6) And (4) carrying out back propagation training. And carrying out back propagation training based on the total loss value of the training of the current round obtained by the calculation.
And (5) repeating the steps (1) to (6) to train to obtain the target recognition model.
In addition, the samples in the training sample set need to be labeled manually before training, namely, the samples are divided into positive samples and negative samples, due to the fact that data labeling cost is high, training data during training of the preliminary image classification model are few, in order to improve generalization capability of the model, semi-supervised training is further adopted in the embodiment of the application, and a large amount of data which are not labeled and are related to the target are added into training.
Namely: in the model training process, predicting the specified image by using a target recognition model obtained by current training every preset training turn; designating the image as a target related image without labeling; and if the confidence of the prediction result exceeds a preset threshold, adding the specified image to the training sample set to perform model training.
In practical application, a certain threshold value k can be set, firstly, a trained target recognition model is loaded to predict unlabelled data, images with confidence degrees larger than the threshold value k are automatically selected to be added into training, the model automatically reselects the unlabelled data once every n epochs are trained, and the size of the threshold value k is adjusted by observing the selected data volume and a test result in the model training process. Through model fine adjustment, the accuracy and generalization capability of the model can be improved.
According to the target recognition model training method provided by the embodiment of the application, the cross entropy loss predicted by the model is calculated, and meanwhile the fitting capacity between the attention weighted feature vector and the fitting image is calculated to directly train the attention structure, so that the accuracy of model recognition is improved. And the semi-supervised training method of selecting the unlabelled images while training is carried out in the training process, so that the generalization capability of the model can be improved without increasing the labeling cost.
Further, an embodiment of the present application further provides a target identification method, as shown in fig. 3, the method includes the steps of:
step S302, acquiring an image to be identified;
and step S304, inputting the image to be recognized into the target recognition model to obtain a recognition result corresponding to the image to be recognized.
The target recognition model is obtained by training the target recognition model training method in the previous embodiment, and the image to be recognized is input to the target recognition model to obtain the recognition result corresponding to the image to be recognized, that is, the prediction label is obtained through the extraction process of the first feature vector in the previous embodiment and the prediction of the classifier, and the prediction label can represent whether the image to be recognized is the image containing the target. For a specific identification process, reference may be made to the above embodiment, which is not described herein again.
Based on the method embodiment, the embodiment of the application also provides a target recognition model training device, which is applied to electronic equipment, wherein an initial image classification model is prestored in the electronic equipment; referring to fig. 4, the apparatus includes:
an image set obtaining module 41, configured to obtain a training sample set and a fitting image set; the samples in the training sample set comprise positive samples and negative samples, and the images in the fitting image set are images of which the area occupation ratio of the target is greater than a set threshold value; and the model training module 42 is configured to determine a training sample subset and a current fitting image corresponding to each training round based on the training sample set and the fitting image set, and perform the following operations for each training round until the training round reaches a preset number of times or a total loss value converges to a preset convergence threshold value, so as to stop training and obtain a target recognition model.
The model training module 42 includes: the system comprises a feature extraction and identification module 421, a loss value calculation module 422 and a back propagation training module 423, wherein the feature extraction and identification module 421 is used for inputting samples in a current training sample subset into an initial image classification model to obtain a first feature vector and a prediction label of each sample; the first feature vector is a vector output by a first middle layer of the initial image classification model; performing feature extraction on the current fitting image through a second intermediate layer of the initial image classification model to obtain a second feature vector corresponding to the current fitting image; the loss value calculating module 422 is configured to calculate a loss function value of the feature of the training in the current round according to a first feature vector corresponding to each positive sample in the current training sample subset and a second feature vector corresponding to the current fitting image; calculating a cross entropy loss function value of the training of the current round according to a prediction label and a real label corresponding to each sample in the current training sample subset; determining a total loss value of the training of the current round based on the characteristic loss function value and the cross entropy loss function value of the training of the current round; the back propagation training module 423 is configured to perform back gradient propagation training on the initial image classification model according to the total loss value of the training in the current round.
Further, the initial image classification model comprises a convolutional neural network, an attention structure, a fusion module and a classifier which are connected in sequence; the fusion module is a first middle layer; the feature extraction and identification module 421 is further configured to: inputting samples in the current training sample subset into a convolutional neural network to obtain an original characteristic diagram corresponding to each sample; inputting the original characteristic diagram corresponding to each sample into an attention structure to obtain an attention diagram corresponding to each sample; inputting the original characteristic diagram and the attention diagram corresponding to each sample into a fusion module to obtain a first characteristic vector corresponding to each sample; and inputting the first feature vector corresponding to each sample into a classifier to obtain a prediction label corresponding to each sample.
Further, the feature extraction and identification module 421 is further configured to: for each sample corresponding original feature map and attention map, the following operations are performed: carrying out spatial standardization on the attention diagram corresponding to the sample through a softmax function to obtain a value corresponding to each pixel in the attention diagram; and taking the value corresponding to each pixel in the attention map as a weight value, and carrying out weighted summation on the original feature map corresponding to the sample to obtain a first feature vector corresponding to the sample.
Further, the second intermediate layer is a convolutional neural network; the feature extraction and identification module 421 is further configured to: and inputting the current fitting image into the convolutional neural network to obtain a second feature vector corresponding to the current fitting image.
Further, the loss value calculation module 422 is further configured to: calculating a first characteristic loss function value corresponding to each positive sample according to a first characteristic vector corresponding to each positive sample in the current training sample subset and a second characteristic vector corresponding to the current fitting image; and carrying out mean value calculation on the first characteristic loss function values corresponding to the positive samples to obtain the characteristic loss function values of the training of the round.
Further, the loss value calculation module 422 is further configured to: calculating a first characteristic loss function value for the positive sample by the following equation:
wherein L is
2A first characteristic loss function value representing a positive sample; MSE () represents the mean square error function,
representing a first feature vector corresponding to the positive sample; v. of
2Representing a second feature vector corresponding to the currently fitted image.
Further, the loss value calculation module 422 is further configured to: calculating a first cross entropy loss function value corresponding to each sample according to a prediction label, a real label and a cross entropy loss function corresponding to each sample in the current training sample subset; and carrying out mean value calculation on the first cross entropy loss function values corresponding to the samples to obtain cross entropy loss function values of the training of the current round.
Further, the loss value calculation module 422 is further configured to: and summing the characteristic loss function value and the cross entropy loss function value of the training of the current round to obtain the total loss value of the training of the current round.
Further, the attention structure includes three convolution layers; a BN layer and a linear connection unit are connected behind each convolution layer.
Further, the model training module 42 is further configured to: in the model training process, predicting the specified image by using a target recognition model obtained by current training every preset training turn; designating the image as a target related image without labeling; and if the confidence of the prediction result exceeds a preset threshold, adding the specified image to the training sample set to perform model training.
Further, the above apparatus further comprises: the image recognition module is used for acquiring an image to be recognized; and inputting the image to be recognized into the target recognition model to obtain a recognition result corresponding to the image to be recognized.
The implementation principle and the generated technical effect of the target recognition model training device provided in the embodiment of the present application are the same as those of the target recognition model training method embodiment, and for brief description, the corresponding contents in the target recognition model training method embodiment may be referred to where the embodiment of the target recognition model training device is not mentioned.
An electronic device is further provided in the embodiments of the present application, as shown in fig. 5, which is a schematic structural diagram of the electronic device, where the electronic device includes aprocessor 51 and amemory 50, thememory 50 stores computer-executable instructions capable of being executed by theprocessor 51, and theprocessor 51 executes the computer-executable instructions to implement the method.
In the embodiment shown in fig. 5, the electronic device further comprises abus 52 and acommunication interface 53, wherein theprocessor 51, thecommunication interface 53 and thememory 50 are connected by thebus 52.
TheMemory 50 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 53 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. Thebus 52 may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. Thebus 52 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.
Theprocessor 51 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in theprocessor 51. TheProcessor 51 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and theprocessor 51 reads information in the memory and performs the steps of the method of the previous embodiment in combination with hardware thereof.
Embodiments of the present application further provide a computer-readable storage medium, where computer-executable instructions are stored, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the method, and specific implementation may refer to the foregoing method embodiments, and is not described herein again.
The method, the apparatus, and the computer program product for training a target recognition model provided in the embodiments of the present application include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.
Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present application.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the description of the present application, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.