Disclosure of Invention
In order to solve the technical problems of weak generalization ability and low detection accuracy in the existing computer vision technology, the invention provides a deep learning-based detection method for kitchen violation behaviors, which can accurately extract violation visual features, is not limited by a kitchen scene and has strong generalization ability. The specific implementation mode is as follows:
a method for detecting kitchen violation, comprising:
acquiring a kitchen violation picture disclosed on a network, constructing a public atlas A, acquiring a kitchen violation picture captured by a camera in an actual kitchen scene, constructing a real atlas B, constructing a public violation area set a by applying the violation picture in the public atlas A, and constructing a real violation area set B by applying the violation picture in the real atlas B;
constructing a coding-decoding detection model based on a convolutional neural network structure;
the encoding-decoding model comprises an encoder and a decoder, wherein the decoder is a decoupling structure of the encoder;
carrying out first iterative training on the coding-decoding model by taking the illegal picture of the public map set A and the labeled file in the public illegal region set a as training samples, and then carrying out second iterative training on the coding-decoding model by taking the labeled file in the real map set B and the labeled file in the real illegal region set B as training samples to obtain a kitchen illegal behavior detection model;
and acquiring an image shot by the kitchen camera in real time and inputting the image into a kitchen violation detection model for violation detection.
In one possible embodiment, the obtaining of the picture of kitchen violation behavior disclosed on the network constructs a public atlas a, and the method further includes:
balancing the sample distribution in the public atlas A to obtain a public atlas A with uniformly distributed samples;
and performing image enhancement on the samples in the public atlas A with the uniformly distributed samples to obtain the public atlas A with diverse sample data.
In one possible embodiment, the constructing a public violation area set a by applying the violation pictures in the public atlas a includes:
and marking the violation region of the kitchen violation behavior picture in the public atlas A by using a marking tool, generating the marking file, and constructing a public violation region set a by using the marking file as sample data.
In one possible embodiment, the training of the first iteration of the coding-decoding model includes:
performing feature matching training on the coding-decoding model to obtain a detection model with accurate feature matching;
and carrying out violation classification training on the detection model with the accurate feature matching.
In one possible embodiment, the training of feature matching on the coding-decoding model includes:
when the intersection ratio IOU1 is smaller than a first threshold value R1 in a preset threshold value set R, judging that the prediction is wrong, and finishing the training;
comparing the odds ratio IOU1 with a second threshold R2 of the preset set of thresholds R when the odds ratio IOU1 is greater than the first threshold R1;
when the intersection ratio IOU1 is less than the second threshold R2, determining that the prediction is wrong;
and when the cross-over ratio IOU1 is larger than the second threshold value R2, taking the result of the original prediction interval [ R1, R2] as a negative sample, balancing the positive and negative samples for retraining, calculating the cross-over ratio IOU2 again and continuously comparing with the next threshold value in the preset threshold value set R, iteratively improving the cross-over ratio, and finishing the feature matching training if the prediction is wrong or all the threshold values in the preset threshold value set R are compared.
In one possible embodiment, the performing violation classification training on the detection model with accurate feature matching includes:
calculating a cross entropy loss function value between the encoding-decoding model input data layer and the output data layer;
and when the cross entropy loss function value is larger than a preset threshold value K, adjusting the classification network parameters of the coding-decoding model according to the cross entropy loss function value to obtain a new output result, recalculating the cross entropy loss function value of the input layer and the output layer of the coding-decoding model, and ending the training until the cross entropy loss function value is smaller than the preset threshold value K through multiple iterations.
In one possible embodiment, before the training of the first iteration of the coding-decoding model, the method further includes:
and inputting the training sample into an encoder and encoding the training sample by using the encoder to obtain an m-dimensional vector for calculating the cross entropy loss function value.
A kitchen violation detection device, comprising:
a sample library construction module: the method comprises the steps of obtaining a kitchen violation picture disclosed on a network, constructing a public atlas A, obtaining a kitchen violation picture captured by a camera in an actual kitchen scene, constructing a real atlas B, constructing a public violation area set a by applying the violation picture in the public atlas A, and constructing a real violation area set B by applying the violation picture in the real atlas B.
A model initialization module: and constructing a coding-decoding detection model based on a convolutional neural network structure.
A model training module: and performing first iterative training on the coding-decoding model by taking the illegal picture of the public map set A and the labeled file in the public illegal region set a as training samples, and performing second iterative training on the coding-decoding model by taking the labeled file in the real map set B and the labeled file in the real illegal region set B as training samples to obtain a kitchen illegal behavior detection model.
And a violation detection module: and acquiring an image shot by the kitchen camera in real time, and inputting the image into a kitchen violation detection model for violation detection.
In one possible embodiment, the sample library construction module is specifically configured to:
balancing the sample distribution in the public atlas A to obtain a public atlas A with uniformly distributed samples;
and performing image enhancement on the samples in the public atlas A with the uniformly distributed samples to obtain the public atlas A with diverse sample data.
And marking the violation region of the kitchen violation behavior picture in the public atlas A by using a marking tool, generating the marking file, and constructing a public violation region set a by using the marking file as sample data.
In one possible embodiment, the model training module is specifically configured to:
performing feature matching training on the coding-decoding model to obtain a detection model with accurate feature matching;
and carrying out violation classification training on the detection model with the accurate feature matching.
When the intersection ratio IOU1 is smaller than a first threshold value R1 in a preset threshold value set R, judging that the prediction is wrong, and finishing the training;
comparing the odds ratio IOU1 with a second threshold R2 of the preset set of thresholds R when the odds ratio IOU1 is greater than the first threshold R1;
when the intersection ratio IOU1 is less than the second threshold R2, determining that the prediction is wrong;
and when the cross-over ratio IOU1 is larger than the second threshold value R2, taking the result of the original prediction interval [ R1, R2] as a negative sample, balancing the positive and negative samples for retraining, calculating the cross-over ratio IOU2 again and continuously comparing with the next threshold value in the preset threshold value set R, iteratively improving the cross-over ratio, and finishing the feature matching training if the prediction is wrong or all the threshold values in the preset threshold value set R are compared.
Calculating a cross entropy loss function value between the encoding-decoding model input layer and the output layer;
and when the cross entropy loss function value is larger than a preset threshold value K, adjusting the classification network parameters of the coding-decoding model according to the cross entropy loss function value to obtain a new output result, recalculating the cross entropy loss function value between the input layer and the output layer of the coding-decoding model, and ending the training until the cross entropy loss function value is smaller than the preset threshold value K through multiple iterations.
And inputting the training sample into an encoder and encoding the sample by using the encoder to obtain an m-dimensional vector for calculating the cross entropy loss function value.
A kitchen violation detection apparatus comprising a memory and a processor, the memory having stored therein computer-readable instructions that, when executed by the processor, cause the processor to perform the kitchen violation detection method described above.
A storage medium having computer-readable instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to perform the above-described method of kitchen violation detection.
Compared with the prior art: the method comprises the steps of constructing a public map set A by obtaining a kitchen violation picture disclosed on a network, obtaining a kitchen violation picture captured by a camera in an actual kitchen scene, constructing a real map set B, constructing a public violation area set a by applying the violation picture in the public map set A, and constructing a real violation area set B by applying the violation picture in the real map set B; constructing a coding-decoding detection model based on a convolutional neural network structure, carrying out first iterative training on the coding-decoding model by taking illegal pictures of the public atlas A and labeled files in the public illegal region set a as training samples, and then carrying out second iterative training on the coding-decoding model by taking labeled files in the real atlas B and the real illegal region set B as training samples to obtain a kitchen illegal behavior detection model; and acquiring an image shot by the kitchen camera in real time, and inputting the image into a kitchen violation detection model for violation detection. Therefore, the kitchen violation detection method with strong generalization capability and high detection accuracy is realized.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
As used herein, the singular forms "a", "an", "the" and "the" may include the plural forms as well, unless expressly stated otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is an overall flowchart of a kitchen violation detection method in an embodiment of the present application, and as shown in fig. 1, a kitchen violation detection method includes the following steps:
s1, obtaining a kitchen violation picture disclosed on a network, constructing a public atlas A, obtaining a kitchen violation picture captured by a camera in an actual kitchen scene, constructing a real atlas B, constructing a public violation area set a by applying the violation picture in the public atlas A, and constructing a real violation area set B by applying the violation picture in the real atlas B;
acquiring kitchen violation pictures in AN Imagenet (visual database) data set and a Coco (visual database) data set as a first picture sample set, searching a kitchen violation picture second picture sample set of a kitchen violation picture of a kitchen enterprise by using a network search engine, collecting the first picture sample set and the second picture sample set to obtain a final violation picture sample set, and constructing a public picture set A = { A1, A2, …, AN }; and (3) acquiring video monitoring of a kitchen, taking 5000 video frames containing violation behaviors as violation pictures, and constructing a real atlas B = { B1, B2, …, BN }. In this embodiment, 5000 cases of each violating behavior are labeled as training sample data. Where N is the total number of violation pictures, i.e., 5000 in this implementation.
Because the cost of acquiring and labeling data of an actual scene is high, the model is trained on an open data set firstly, and then is transferred to the data of the actual scene for fine adjustment after the model is converged. And freezing the pre-trained shallow network parameters of the model, training and optimizing deep network parameters by using actual data, iterating until the model converges, and then optimizing the shallow network parameters to obtain final model parameters.
In one embodiment, the building of the public atlas a in S1 may include the following steps:
balancing the sample distribution in the public atlas A to obtain a public atlas A with uniformly distributed samples;
balancing the distribution of the picture samples from multiple angles, for example, adding illegal action pictures wearing different types of hats, which are common in kitchen scenes such as disposable hats and cloth hats; illegal action pictures of wearing caps with different colors are added, for example, a chef wears a white cap with golden stripes, and a common chef wears a white cap without stripes; and adding illegal action pictures for wearing caps with different shapes, such as square or oblate caps.
And performing image enhancement on the samples in the public atlas A with the uniformly distributed samples to obtain the public atlas A with diverse sample data.
Image enhancement is achieved by adding some information or transformation data to the original image by some means to selectively highlight features of interest in the image or to suppress (mask) some unwanted features in the image to match the image to the visual response characteristics. In the embodiment, the image enhancement methods mainly adopted, such as horizontal flipping, image scaling, image capturing, image rotation, and the like, increase the diversity of sample data.
Similarly, the above steps may be included after the real atlas B is constructed.
According to the embodiment, the sample data is uniformly distributed, the diversity of the sample data is increased, the model has high robustness, and the method can adapt to different scenes.
In one embodiment, the constructing a public violation area set a by using the violation pictures in the public atlas a in S1 may include the following steps:
and marking the violation region of the kitchen violation behavior picture in the public atlas A by using a marking tool, generating a public violation region marking file, and constructing a public violation region set a by using the public violation region marking file as sample data.
And framing the violation area of each type of violation scene in the picture by using a marking tool, marking a violation type label, generating a corresponding xml file by using the marking tool, wherein the corresponding xml file comprises the coordinates and the label of the marking frame in the picture, and the violation area is specifically a set of pixel point positions of violation behaviors in the corresponding violation picture. For example, for a violation picture with a violation behavior of "wearing a chef hat without rules", the violation area is the position of the pixel point on the head of the chef in the violation picture, the violation type label is "wearing a hat without rules", and the generated xml file includes the coordinates of the violation area and the violation label "wearing a hat without rules".
The construction of the real violation area set b may include the same steps described above.
In the embodiment, the distribution condition of the violation area and all types of the violation labels are determined, and a strict violation standard is provided for the following violation behavior model training.
S2, constructing a coding-decoding detection model based on a convolutional neural network structure;
the coding-decoding model is realized by adopting the current convolutional neural network structure in the computer vision field, such as L eNet-5, VGG, AlexNet, Googlenet. convolutional neural network modeling biological visual perception (visual perception) mechanism, can perform supervised learning and unsupervised learning, and the structure comprises an input layer, an implicit layer and an output layer, wherein the convolutional kernel parameter sharing in the implicit layer and the sparsity of interlayer connection enable the convolutional neural network to learn grid-like characteristics (grid-like) such as pixels and audio with small calculation amount, have stable effect and have no additional characteristic engineering (feature) requirement on data.
In a good embodiment, a VGG network structure is adopted to construct an encoding-decoding model, and the encoding-decoding model comprises an encoder and a decoder, wherein the decoder is a decoupling structure of the encoder. The input layer of the coding-decoding model is used for inputting pictures, the coder and the decoder are used for the standardized processing of the pictures, please refer to fig. 2, the input of the coder is a picture I, the output is a coding result X, the coding result is a vector with m dimensions, the cross entropy loss function is mainly and conveniently used for calculating the cross entropy loss function of the model, and the ending time of the model training is controlled by comparing the value of the cross entropy loss function with the preset threshold P1. The input of the decoder is the encoding result X of the encoder, and the output is the picture Id, i.e.:
s3, carrying out first iterative training on the coding-decoding model by taking the illegal picture of the public map set A and the labeled file in the public illegal region set a as training samples, and then carrying out second iterative training on the coding-decoding model by taking the labeled file in the real map set B and the labeled file in the real illegal region set B as training samples to obtain a kitchen illegal behavior detection model;
it should be emphasized that, in order to further ensure the privacy and security of the detection model data information, the detection model data information may also be stored in a node of a block chain.
In one embodiment, the step S3, before the first iterative training and the second iterative training, further includes:
and inputting the training sample into an encoder and encoding the sample by using the encoder to obtain an m-dimensional vector for calculating the cross entropy loss function value.
After the first iterative training and the second iterative training in step S3, the method further includes:
and decoding the m-dimensional vector by using the decoder to obtain the ID of the picture.
In this embodiment, conversion between input and output of the encoding-decoding model is realized.
The cross entropy can be used as a loss function in a neural network (machine learning), p represents the distribution of real marks, q is the distribution of predicted marks of the trained model, and the cross entropy loss function can measure the similarity between p and q. The cross entropy as the loss function has the advantage that the problem of the learning rate reduction of the mean square error loss function can be avoided when the gradient is reduced by using the sigmoid function, because the learning rate can be controlled by the output error. In feature engineering, it can be used to measure the similarity between two random variables.
In the training stage, a training picture and a corresponding marking file are input into a model, the model extracts features through a multilayer convolution network, the position of a target is located according to feature matching, and then the violation type is obtained through a classification network. The training process of the model is driven by data, and final model parameters are obtained by minimizing a cross entropic loss function without manually adjusting the parameters. The number of layers of the model exceeds 50 layers, deep information in the image can be extracted, the generalization capability of the model in detection is stronger, and parameters of the model can be continuously optimized along with the continuous increase of training samples, so that the detection accuracy is improved.
In one embodiment, the first iterative training in step S3 includes the following steps:
performing feature matching training on the coding-decoding model to obtain a detection model with accurate feature matching;
and carrying out violation classification training on the detection model with the accurate feature matching.
The second iterative training in step S3 includes the same steps as described above, and the training samples used mainly differ between the first iterative training and the second iterative training, where the former uses public data and the latter uses actual scene data.
The method comprises the following steps that a cross-over ratio iterative lifting method is adopted in feature matching training, and the accuracy of the position of a prediction frame is improved;
the intersection ratio (IOU) is the coincidence degree of the prediction frame and the target object, and the specific calculation formula of the IOU is as follows:
wherein, S1 is the area of the intersection of the prediction box boundary and the actual boundary, and S2 is the entire area of the prediction box.
The current method for determining the violation is that the prediction is correct when the cross-over ratio is larger than a certain set threshold, and generally, the detection result is correct when the cross-over ratio is larger than 0.5. In the step, a method for improving the prediction accuracy by iteratively improving the cross-over ratio is adopted, wherein a threshold set R = { R1, R2, …, Rn } is preset in the method, wherein n represents the number of thresholds, R1< R2< … < Rn, and the size of the thresholds in the threshold set R and the number n of the thresholds are determined by effects required by experiments. In one embodiment, the threshold set R = {0.3, 0.4, 0.5, 0.6}, and the specific feature matching training steps are as follows:
when the intersection ratio IOU1 is smaller than a first threshold value 0.3 in a preset threshold value set R, judging that the prediction is wrong, and finishing training;
when the intersection ratio IOU1 is greater than the first threshold value 0.3, comparing the intersection ratio IOU1 with a second threshold value 0.4 in the preset threshold value set R;
when the intersection ratio IOU1 is less than the second threshold value 0.4, judging that the prediction is wrong;
and when the cross-over ratio IOU1 is greater than the second threshold value 0.4, taking the result of the original prediction interval [ R1, R2] as a negative sample, balancing the positive and negative samples for retraining, calculating the cross-over ratio IOU2 again, continuously comparing with the next threshold value 0.5 in the preset threshold value set R, iteratively improving the cross-over ratio, and finishing the feature matching training if the prediction is wrong or all the threshold values in the preset threshold value set R are compared.
In the embodiment, a position of a relatively higher intersection is obtained than a position of a corresponding prediction frame, so that the prediction frame can be more accurately matched with the position area corresponding to the feature.
And the accuracy of model classification is measured by using cross entropy in violation classification training, the positioning error of the model is measured by using L1 norm, the result output by each model is compared with the result manually marked to obtain an error, parameters are corrected according to the error, and model training is completed after multiple iterations until the error is less than a set threshold value, and the final model parameters are stored.
Taking the illegal picture and the annotation file as samples, training the coding-decoding model by using a random gradient descent method until a cross entropy loss function between input data and output data of the coding-decoding model converges to a first threshold, wherein the first threshold is preferably 0.001. Wherein the cross entropy loss function is specifically as follows:
wherein,
as weights, dependent on the pixel point
If the pixel point is
Is located in the violation area of the corresponding picture, then
=
,0.6≤
1 or less, otherwise
=1-
;
For pixel points in illegal pictures
Image ofThe prime value;
for pixel points in output result of coding-decoding model with the illegal picture as input
The pixel value of (2).
The violation classification capability of the detection model is trained, so that the judgment of the violation type of the image by the model is more accurate.
S4, acquiring an image shot by the kitchen camera in real time, and inputting the image into a kitchen violation detection model for violation detection;
in the using stage, as long as the detection model of the kitchen violation behavior is loaded into the network framework, and the picture shot by the kitchen camera is input, the model can perform feature extraction, positioning and classification on the picture, and the violation behavior is detected.
According to the embodiment, the behavior action of the worker can be conveniently and accurately captured in the kitchen scene, and whether violation is caused or not is judged without manual supervision.
Fig. 3 is a structural diagram of a kitchen violation detection device in an embodiment of the present application, and as shown in fig. 3, a kitchen violation detection device includes the following modules:
the samplelibrary construction module 10 is used for acquiring a kitchen violation picture disclosed on a network, constructing a public atlas A, acquiring a kitchen violation picture captured by a camera in an actual kitchen scene, constructing a real atlas B, constructing a public violation area set a by using the violation picture in the public atlas A, and constructing a real violation area set B by using the violation picture in the real atlas B;
amodel initialization module 20, configured to construct a convolutional neural network structure-based encoding-decoding detection model;
and themodel training module 30 is configured to perform first iterative training on the coding-decoding model by using the illegal picture in the public map set a and the labeled file in the public illegal region set a as training samples, and perform second iterative training on the coding-decoding model by using the labeled file in the real map set B and the labeled file in the real illegal region set B as training samples, so as to obtain a kitchen illegal behavior detection model.
And theviolation detection module 40 is used for acquiring an image shot by the kitchen camera in real time and inputting the image into the kitchen violation detection model for violation detection.
Wherein the memory has stored therein computer readable instructions that, when executed by the processor, cause the processor to perform the steps of the above-described method of kitchen violation detection.
In one embodiment, a storage medium storing computer-readable instructions is provided, the computer-usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware instructions related to a program, and the program may be stored in a computer readable storage medium, which includes: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
The technical features of the embodiments described above can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-described embodiments are merely illustrative of some embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent should be subject to the appended claims.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.