Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
The embodiment provides a method for detecting and identifying an enlarged license plate number based on a deep neural network, which mainly comprises the steps of detecting and identifying the enlarged license plate number as shown in figure 1, wherein the position label information of the enlarged license plate number is shown in figure 2; the detailed steps are as follows:
and S1, detecting and positioning the area where the license plate enlarged number is located, and obtaining a sample image of the original license plate enlarged number.
In the step, a detection network (such as a convolutional neural network) based on deep convolution is used for detecting and positioning the enlarged license plate number, and a lightweight network architecture MobileNet-SSD is taken as an example for detailed description:
s11, firstly, according to the label sample data distribution in the training sample, calculating the generation parameters of each layer of default box by using a k-means clustering algorithm (k-means clustering, k-means) of a convolutional neural network (YOLOv 3). Because the license plate amplification number sample image generally has a large width-to-height ratio, the size of the input image of the detection network is set to be w x h, (1.5 w < h < 2 w) so as to eliminate the influence on the detection effect.
And S12, using various data enhancement methods in the training process to increase the diversity of the sample images and improve the detection performance of the detection network, including horizontal turning, cutting, zooming in and zooming out and the like.
And S13, extracting the features of the sample image by using a backbone convolutional network (MobileNet), and constructing a feature pyramid network with 6 layers for position regression and class classification.
And S14, processing the output of the multilayer characteristic pyramid network through the non-maximum suppression unit to obtain a final detection positioning result of the area where the license plate amplification number is located.
The structure of the detection network is shown in fig. 3, and includes a backbone convolutional network MobileNet, a Non-Maximum Suppression unit (NMS), and a multilayer feature pyramid network, where the backbone convolutional network MobileNet is connected to an input end of the multilayer feature pyramid network, an output end of each layer of feature pyramid network is connected to the Non-Maximum Suppression unit, and the Non-Maximum Suppression unit outputs a final detection positioning result.
Step S2, recognizing characters of the enlarged number of the license plate
The method comprises the following steps of identifying license plate enlarged number characters by a deep convolution-based identification network (such as a convolution neural network (CRNN)), specifically comprising the following steps:
and S21, expanding the training sample image to obtain a training sample set.
In the training stage of identifying the network model, the CRNN convolutional neural network uses an end-to-end (end-to-end) training mode, and needs a large number of input sample images to perform network optimization training, the invention firstly labels the sample images of the original license plate enlarged number, and then expands the labeled sample images of the original license plate enlarged number, and the expansion process is shown in fig. 4 and mainly comprises the following steps:
s211, cropping the sample image to generate area images with different sizes, as shown in (a) - (c) of fig. 5, the area images obtained after cropping specifically include the following categories:
the original license plate enlarged sample (7-8 characters): such a sample image is an enlarged-size area image of the original license plate, as shown in fig. 5 (a);
② defective license plate enlarged sample (5-7 characters): the sample image is a region image obtained by cutting after discarding the original license plate province region for short, as shown in the diagram (b) of fig. 5;
③ sample after boundary expansion: the sample image is an area image obtained by carrying out random boundary expansion on the two types of license plate amplified number area images. The extended formula is specifically as follows:
wherein l, r, u and b are the expansion sizes of the license plate magnified region image at the left, right, upper and lower boundaries respectively, w and h are the width and height of the original license plate magnified region image, and random is a random function.
Fourthly, loading samples: the samples are false detection samples of a detection network, namely non-license plate amplified number areas.
S212, image normalization processing and color transformation: before model training, the convolutional neural network CRNN needs to normalize the four types of region images obtained after the sample is cut in step S211, normalize the size to W × 32, where W is the normalized image width, and then perform color transformation; the method mainly comprises the following steps:
keeping the height h unchanged, and stretching the random width of the image to improve the recognition capability of a convolutional neural network (CRNN) on narrower characters; the formula for the random width stretch transform is:
w*=w*(random(0.4*w,0.8*w)+1)
wherein, w*And w is the original image width, and random is a random function.
② judging the aspect ratio w of the image after width stretching*Whether/h is equal to the normalized size, i.e., W/32:
1) if w is*W/32, scaling the image to W32;
2) if w is*W/32 is smaller than h, the image is scaled to W***32,w**=w*(32/h), then expanding the left and right image boundaries, the formula is as follows:
wherein l and r are the expansion sizes of the left and right boundaries respectively, and random is a random function. In this embodiment, the convolutional neural network CRNN has no requirement on the width of the image, and therefore, in the size normalization process, the normalized size is W × 32, but the maximum width value is set to 280 in the width stretching transformation and left and right boundary expansion processes.
3) If w is*W/32, scaling the image to W x h**,h**=h*(W/w*) Then, the upper and lower boundaries of the image are expanded, and the formula is as follows:
wherein u and b are the extension sizes of the upper and lower boundaries respectively, and random is a random function.
And thirdly, random color space transformation is carried out, the diversity of the samples is further increased, and the sample images which are finally input into the identification network are generated.
S213, generating a sample label: and storing each license plate character of the license plate number in an array, and then generating a sample label of the license plate number according to the index value of the license plate character corresponding to the array.
The convolutional neural network CRNN needs to set a space (blank) tag, which is generally set as the first bit ("0") or the last bit ("n-1") of the tag list, where n is the length of the tag list, i.e., the number of character classes), the length of the sample tag is 8, and less than 8 bits are complemented with "0" after the tag value.
For example, if the label value of blank is set to be "0", the label value of the positive sample image with the license plate number of "87569" is "986710000", that is, the label value of the license plate character is obtained by adding 1 to the corresponding index value of the license plate character in the label list, and the label value of the license plate character is obtained by adding 1 to the corresponding index value of the character in the label list no matter whether the license plate character is a number, a letter or a Chinese character; for a negative sample image, its label value is "00000000".
S22, constructing a recognition network based on the expanded sample image, and performing feature extraction on the actual license plate enlarged number image by using the constructed recognition network.
In this embodiment, a feature extraction network is constructed as an identification network, and specifically, the feature extraction network is a deep convolutional network including a convolutional layer (CNN), a feature super-resolution branch network (SR layer), a cyclic layer (RNN), a transcription layer (CTC), and a loss function layer, where the convolutional layer is connected to the SR layer and the cyclic layer, the transcription layer is connected to the cyclic layer, the transcription layer and the feature super-resolution branch network are connected to the loss function layer, the size of an input image is W × 32, W is an image width, and 32 is an image height.
In the invention, the SR layer and the RNN layer share the characteristic sequence of the image and do not need an additional characteristic extraction network, so the number of network layers of the SR layer is less, the SR layer has a simpler structure than the existing super-resolution network, the occupied video memory of a video card is less in the training process, and the training time is shorter.
S221, a feature sequence is extracted from the input image by the convolutional layer (CNN).
Taking a dense convolutional network (DenseNet) as an example, when constructing a feature extraction network, connecting CNN layers in series by using 3 DenseNet blocks, wherein the depth of each DenseNet block is d, the feature map growth rate is r, connecting convolution layers with the kernel size of k × k and random inactivation layers (dropouts) between every two DenseNet blocks, setting the proportion of the random inactivation layers dropouts as ratio, finally connecting a pooling layer with the kernel size of m × N, and outputting a feature map with dimensions of N × C × H × W, wherein N, C, H and W are the batch processing size, the feature map channel number, the feature map height and the feature map width respectively.
S222, in the training stage, the feature expression capability of the CNN layer is improved through a feature super-resolution branch network (SR layer), and a super-resolution image is reconstructed and output.
The purpose of the feature super-resolution branch network is to obtain high-resolution image features using low-resolution images. Due to the influence of hardware conditions, working environments and driving road conditions, the camera can often acquire a large amount of low-quality license plate amplified number images, and the recognition result is influenced. Therefore, in the training process, the characteristic super-resolution branch network is added to improve the characteristic expression capability of the CNN layer, namely, the characteristic sequence obtained by the CNN layer is input into the SR layer to reconstruct the super-resolution image, so that the low-resolution characteristics are restored into the corresponding super-resolution image.
Because the license plate amplification number identification data set does not distinguish high-resolution images and low-resolution images, in the training process, the invention uses two image expansion modes of Gaussian blur processing and 4-time up-down sampling to perform online expansion preprocessing on the original image to generate the low-resolution images so as to enrich the diversity of sample images in the training data set; and after the generated low-resolution image is subjected to feature sequence extraction by the convolutional layer, the low-resolution image is input to a SR layer of a feature super-resolution branch network and is reconstructed into a super-resolution image. In this embodiment, the image after the "gaussian blur processing" and the "4-fold up-down sampling" processing is represented as:
wherein, IblurFor processed low resolution images, fd-uAnd fgauRespectively representing 4 times of up-down sampling and Gaussian blur processing, O is an original image, p1And p2Are two random parameters and alpha is a threshold.
The SR layer is mainly implemented by 2 super-resolution base units based on a residual network structure (Resnet) and an upsampling unit (UpSample), where the super-resolution base unit is a residual channel attention block RG, the RCAB is a sub-module of the residual channel attention block RG, and two RCAB sub-blocks constitute a residual attention module RG.
SR layer uses characteristic sequence F output by CNN layerCNNPerforming super-resolution reconstruction, and outputting deeper features through two RG layers, namely:
FRG=HRG(HRG(FCNN))
wherein, FRGFeatures processed by two layers of RG modules, HRGCorresponding operation for the RG module; then using the upsampling layer UpSample, convolution operation pair FRGAnd processing the characteristics to obtain a super-resolution reconstructed image O with the same size as the input image.
Wherein, FUPFor the feature processed by the UpSample module at the up-sampling layer, HUPFor the corresponding operation of UpSample Module, HConvCorresponding operations for the convolution module. And finally, the original high-resolution image in the training sample set is used as a real sample label, the loss of the reconstructed super-resolution image is calculated by using the super-resolution loss function of S225, and the reconstruction effect of the super-resolution image is judged and evaluated according to the loss value.
S223, the tag value distribution, i.e., the true value distribution of the feature sequence obtained from the convolutional layer (CNN) is predicted by the cycle layer (RNN).
The circulation layer RNN comprises two bidirectional long and short term memory networks (BilSTM), the features extracted by the convolution layer CNN are converted by the circulation layer to obtain T x N x M dimensional features, the T is the time sequence length of the circulation layer RNN, N is the batch processing size, M is the input feature length, then T x N x N dimensional label distribution results are obtained by the full connection layer, and N is the length of a label list (character category number); the loop layer RNN may be expressed as y ═ Rw(x) Where x is the input, w is the RNN layer parameter, and y is the output.
S224, the tag value distribution obtained from the cycle layer RNN is converted into the final recognition result by the transcription layer (CTC) through operations such as de-duplication integration.
A blank mechanism is introduced into CTC in a transcription layer, and the purpose is to obtain a final predicted text sequence through operations such as de-duplication integration. Taking the "-" symbol representing blank as an example, the CTC in the transcription layer considers that "continuous repeated characters without blank intervals" are the same character, deletes "continuous repeated characters without blank intervals" for the character sequence, and then deletes all "-" characters from the path to obtain the final predicted text sequence.
For the input x given by the cycle layer RNN, the probability that the transcription layer outputs the correct license plate is as follows:
wherein, pi ∈ B-1(l) Representing all paths of which the result is correct license plate L after B conversion (namely after cycle layer RNN processing), and L is a prediction output sequence (namely a predicted license plate number); for any path π is:
where L' is all paths. In the training process, the training target of the CTC of the transcription layer is essentially through the gradient
Adjusting the parameter w of the loop layer RNN such that for an input sample, pi ∈ B
-1(l) The probability p (l | x) of the correct license plate is the greatest.
And S225, calculating and identifying the total loss of the network through a loss function.
In the training process, the loss function simultaneously contains the loss of the text recognition part and the loss of the super-resolution branch network part, so that the feature sequence extracted by the CNN layer simultaneously contains the information of the recognition part and the super-resolution branch network part, the feature expression capability of the recognition network on low-quality images is improved, and the feature extraction effect of the recognition network on the low-quality images is improved.
That is, in the present invention, the total loss of recognition network is the text recognition loss L generated by the transcription layer CTCrecAnd super-resolution image loss L generated by super-resolution branch networksrSumming and using a hyper-parameter lambda to the super-resolution image loss LsrThe weights of (a) are adjusted, i.e. weighted summation; the loss function can be described as:
wherein O is the original image, Oi,jIs the pixel value of the original image at the (I, j) position, Ii,jThe pixel value of a super-resolution image output by a SR layer of the characteristic super-resolution branch network at the (i, j) position is represented by x, S is a training sample set and z is a sample real label. And reducing the total loss of the recognition network through training to obtain the optimized weight parameter of the recognition network.
The five stages of steps S221 to S225 form a training stage of the recognition network, and refer to fig. 6 to 11 in detail.
S226, reasoning output stage
Using the trained recognition network model to perform inference output, specifically referring to fig. 12, the main process includes: directly inputting an actual license plate amplification number image into a CNN layer without image preprocessing such as Gaussian blur processing or up-down sampling to obtain a corresponding characteristic sequence; directly inputting the characteristic sequence output by the CNN layer into the RNN layer to obtain the probability distribution of all character types of each time step; inputting the character type probability distribution output by the RNN layer into a CTC layer, taking the characters with the maximum probability distribution in all the character types of each time step as the output characters of the time step by the CTC layer, splicing the output characters with all the time steps to obtain a sequence path as the maximum probability path, and finally obtaining the final text recognition result by using a blank mechanism in the CTC layer.
That is, in the training stage, the SR layer continuously updates the network weights through iterative training to minimize the loss function, so as to obtain the optimized weight parameters. In the inference stage, the input of the CNN layer is an actually acquired license plate amplification number image, and image preprocessing such as Gaussian blur processing or up-down sampling is not performed; and the recognition network does not use the SR layer any more, and directly uses the trained weight parameters, and because the output result of the SR layer is a super-resolution image and is useless for reasoning of the recognition network, the SR layer is discarded in the reasoning stage, and the character recognition result cannot be influenced.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.