K is the reserved proportion of image information, W and H are the width and height of the original image respectively, M is the reserved pixel number, K has no direct relation with the four parameters, the parameters indirectly define r, and the definition of r can be obtained through K conversion

k＝1-(1-r)²

x and y are defined as random over a certain area:

δ_x (δ_y )＝random(0,d-1)

in a task of detecting the LED instrument, r in 4 hyper-parameters of the GridMask is set to be 0.4, d is set to be (96,224), in the using process, the GridMask is enhanced on a training image with the probability that P is 0.6, the detection task is set to be 0 at the beginning, the GridMask enhancement mode is gradually increased on the training image along with the increase of training times, and finally the detection task is changed to be P.

Step 1.2, a YOLOv4 target detection network is constructed to position the position of a character area of an LED instrument dial in a picture, learned high-level semantic information is transmitted into a low-level network through an FPN network, then the high-level semantic information and low-level high-resolution information are fused to improve the detection effect, an information transmission path from the low level to the high level is added, feature information is enhanced through down-sampling operation, and finally the feature information of different convolution layers is fused to achieve the detection effect. The trunk extraction network CSPDarknet53 of YOLOv4 uses a Mish activation function, the Mish function is a smooth curve, the smooth activation function can enable information to be better input into a neural network, so that better accuracy and generalization are obtained, and smaller negative gradient input can be allowed. The functional expression is as follows:

Mish＝x×tanh(ln(1+e^x ))

step 1.3, defining a target marking frame of a character area of the dial plate of the LED instrument marked in advance, wherein the area is defined as a Ground route, inputting a marked target picture and a marking file thereof into a YOLOv4 network for training, and positioning the character area of the LED instrument with different characters by utilizing the trained YOLOv4 target detection network.

And step 1.4, using DIoU-NMS, and simultaneously considering the distance between the central points of the overlapped area and the two boxes to achieve the purpose of removing repeated target frames and finally obtaining the digital character area of the LED instrument.

The step 2 is implemented according to the following steps:

step 2.1, feature extraction, namely performing feature extraction on an input picture through a Resnet50 residual network, wherein ResNet50 has 50 Conv2d layers, extracting feature maps output by Conv2, Conv3, Conv4 and Conv5 layers respectively to construct a feature pyramid, and extracting 4-layer feature P in a top-down and transverse connection mode₂ ，P₃ ，P₄ ，P₅ And extracting to obtain 4 feature layers with 256 channels.

Step 2.2, feature fusion, namely fusing 4 feature graphs obtained in the step 2.1 and fusing P₃ ，P₄ ，P₅ Respectively characterized by the characteristic layer P through 2 times, 4 times and 8 times of upsampling₂ And performing characteristic cascade to finally obtain a 1024-dimensional fused characteristic vector F. The high-level semantic features and the low-level semantic features are fused together, so that the distribution of the LED characters can be effectively sensed, and the character boundary can be more accurately detected. The specific implementation mode is as follows:

F＝C(P₂ ，P₃ ，P₄ ，P₅ )＝P₂ ||U_P×2 (P₃ )||U_P×4 (P₄ )||U_P×8 (P₅ )

wherein, "|" represents the connection operation, and the upsampling is performed in a manner of 2 times, 4 times and 8 times respectively.

Step 2.3, the fusion characteristic F obtained in the step 2.2 is convoluted by 3 multiplied by 3, characteristic diagrams of 256 channels are obtained by a BN layer and a ReLU layer, and the characteristic diagrams are input into the convolution of 1 multiplied by 1 to obtain s₁ ,s₂ ,...,s_n And (4) dividing the results, and arranging the division results in the order from small to large according to the kernel scale.

And 2.4, sequentially performing scale expansion from the minimum kernel through a PSENet algorithm, and adopting a scheme of first-come first-obtained to solve the problem of boundary conflict in the scale expansion to finally obtain an LED character detection result with clear boundary.

Step 3 is specifically implemented according to the following steps:

step 3.1, detecting LED character feature extraction, wherein a VGG structure is used in a CNN part in a CRNN, in order to enable the model convergence speed to be faster, in consideration of the actual aspect ratio of LED characters, pictures are unified and normalized to the size of [240,50], and because the network has deep convolutional layers and recursive layers, the training of the deep convolutional layers and the recursive layers is difficult, a batch normalization layer BN layer is added after the fifth convolutional layer and the sixth convolutional layer of the network, and a batch normalization layer is adopted, so that the training speed is greatly increased. And finally, carrying out feature extraction through a CNN network to obtain 240/4 feature sequences of 512 channels.

And 3.2, an LED character prediction part inputs the characteristic diagram extracted from the CNN network in the step 3.1 into the RNN network for character prediction by utilizing the RNN network, the used CNN network has four maximum pooling layers, and the window sizes of the last two pooling layers are changed from 2 multiplied by 2 to 1 multiplied by 2, because most LED character areas are small in height and long in width, and the use of the 1 multiplied by 2 pooling windows can ensure that information in the width direction is not lost as much as possible, and the LED character prediction part is more suitable for identifying English letters and numbers. Because the shot LED characters are fuzzy in the actual station, in order to improve the accuracy of fuzzy LED character recognition, a deep bidirectional RNN is adopted as the RNN in the CRNN, and the RNN is a characteristic sequence x which is output by the CNN and is x₁ ,…,x_t Each input x_t All have an output y_t (ii) a Because different LED tables and characters are different in length, in order to identify the phenomenon of an indefinite-length sequence, a long-short-time memory unit LSTM is selected as a unit of the RNN, and meanwhile, the LSTM can also effectively prevent the gradient disappearance phenomenon of the RNN network in the training process. Firstly, extracting a feature map of a text picture based on 7-layer CNN, segmenting the feature map according to columns, and inputting each channel into two layers of bidirectional LSTMs with 256 units as 512-dimensional time sequences for classification.

And 3.3, a character transcription part, namely, after the LED character sequence passes through an RNN network, the obtained prediction result needs to be converted into a character tag through a transcription layer CTC, a blank character epsilon is introduced into the CTC, pauses in the character interval all represent epsilon, and the CTC mainly relates to two parts of repeated letter removal and epsilon removal. The invention adopts a dictionary-based CTC algorithm to transcribe characters, error difference is propagated backwards through a forward algorithm and a backward algorithm in a transcription layer, the probability of all labels is finally obtained based on a prediction result of the dictionary, and finally, the corresponding label value with the maximum probability is selected as a recognition result.

The invention has the beneficial effects that:

1. the method is suitable for identifying the digital instruments of the power distribution room and the transformer substation, solves the problem of low current reading efficiency of manually entered instruments on one hand, and can obtain higher identification effect under the influence of external factors such as illumination, shooting angle, instrument form and the like on the other hand.

2. The YOLO target detection network based on deep learning can accurately detect the position of the instrument, does not need to mark instrument position information in a picture shot by a camera, and can directly input the shot picture into the network for detection.

3. The method comprises the steps of utilizing a PSENet algorithm of a progressive scale expansion network as a character detection module of a digital LED instrument in a transformer substation scene, positioning a single-line or multi-line character area by the network, namely detecting and positioning all character areas in the instrument, and finally inputting the detected text character area into a character recognition network for recognition.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of the working procedure of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The whole work flow of the invention is as shown in figure 1, firstly, the character area of the LED meter is positioned by using a YOLOv4 network, the position of the meter can be accurately detected by using a YOLO target detection network based on deep learning, the position information of the meter does not need to be marked in the picture shot by a camera, the shot picture can be directly input into the network for detection, after the ROI (region of interest) of the meter is detected, the ROI is input into the character detection network, the network PSENet is progressively expanded by using a progressive network, the output with the same size as the original picture is obtained by down sampling, feature fusion and up sampling, the final text connected domain is obtained, namely the position of each line of characters of the LED meter is positioned, the network can position a single line or a plurality of lines of character areas, namely all the character areas in the meter can be detected and positioned, and finally the detected text character areas are input into a character recognition network for recognition, and the CRNN network is used for automatic identification, and the automatic positioning and identification of the LED multi-line characters are completed in an effective mode.

step 1, an LED meter area positioning module, as shown in the LED meter character area positioning module in fig. 1, inputs a picture with a resolution of 1920 × 1080 taken by a camera into a target detection network, and because there are some other interference factors, such as indicator lights, other signboard characters, etc., in the picture, the LED meter to be recognized is to be accurately detected, so that the YOLOv4 target detection algorithm is used for LED meter target detection in a substation scene, and in order to avoid the problem of other areas of the dial plate from interfering with character recognition, only a digital character area where the LED meter is located is positioned, and a roi area where the character is located is output as the input of step 2;

step 2, an LED instrument character detection module, as shown in the LED instrument character detection module in FIG. 1, inputting the digital character region output in the step 1 into a network, wherein the LED instrument has 3 rows of character regions as shown in the figure, and in order to accurately identify each row of character region, each row of character region needs to be detected through the character detection module, so that a PSENet algorithm of a progressive scale expansion network is used as the digital LED instrument character detection module in a transformer substation scene, and through an image segmentation technology, the LED character target region is detected at a pixel level, and the detection performance of the model under the LED multi-row characters is improved;

and 3, an LED instrument character recognition module, as shown in the LED instrument character recognition module in fig. 1, sequentially inputting the multiple lines of character target areas detected in the step 2 into a recognition network, training the acquired one or more lines of character target area characteristics by using a CRNN network, and finally recognizing specific characters by using a CTC algorithm to acquire the recognition result of the LED instrument.

The step 1 is implemented according to the following steps:

step 1.1, data enhancement is carried out on LED sample data, and by using a GridMask data enhancement method, GridMask belongs to a method for deleting information, and specifically, a region is randomly discarded on an image, namely, a regular item is newly added on a network to avoid network overfitting, so that a balance is carried out before information is deleted and retained. Random erasure, cutout and hide-seek methods may cause all discriminable areas to be deleted or reserved, noise is introduced, and training of the model is not facilitated.

One GridMask corresponds to 4 parameters, namely x, y, r and d, a group of specific Mask regions is determined through the 4 parameters, and the Mask regions are also rotated in the actual training process.

k＝1-(1-r)²

x and y are defined as random over a certain area:

δ_x (δ_y )＝random(0,d-1)

Mish＝x×tanh(ln(1+e^x ))

The step 2 is implemented according to the following steps:

and 2. step 2.1. Feature extraction, namely performing feature extraction on an input picture through a Resnet50 residual network, wherein ResNet50 has 50 Conv2d layers, extracting feature maps output by Conv2, Conv3, Conv4 and Conv5 layers respectively to construct a feature pyramid, and extracting 4-layer features P by using a top-down and transverse connection mode₂ ，P₃ ，P₄ ，P₅ Extraction is carried out, and 4 feature layers with 256 channels are obtained.

Step 2.2, feature fusion, namely fusing 4 feature graphs obtained in the step 2.1 and fusing P₃ ，P₄ ，P₅ Respectively characterized by characteristic layer P through 2 times, 4 times and 8 times of upsampling₂ And performing characteristic cascade to finally obtain a 1024-dimensional fused characteristic vector F. The high-level semantic features and the low-level semantic features are fused together, so that the distribution of the LED characters can be effectively sensed, and the character boundary can be more accurately detected. The specific implementation mode is as follows:

Step 2.3, the fusion characteristic F obtained in the step 2.2 is convoluted by 3 multiplied by 3, characteristic diagrams of 256 channels are obtained by a BN layer and a ReLU layer, and the characteristic diagrams are input into the convolution of 1 multiplied by 1 to obtain s₁ ,s₂ ,...,s_n And dividing the results, and arranging the division results in a sequence from small to large according to the kernel scale.

Step 3 is implemented specifically according to the following steps:

step 3.1, detecting LED character feature extraction, wherein a VGG structure is used in a CNN part in a CRNN, in order to enable the model convergence speed to be faster, in consideration of the actual aspect ratio of LED characters, pictures are unified and normalized to the size of [240,50], and because the network has deep convolutional layers and recursive layers, the training of the deep convolutional layers and the recursive layers is difficult, a batch normalization layer BN layer is added after the fifth convolutional layer and the sixth convolutional layer of the network, and a batch normalization layer is adopted, so that the training speed is greatly increased. And finally, performing feature extraction through a CNN network to obtain 240/4 feature sequences of 512 channels.

And 3.2, an LED character prediction part inputs the characteristic diagram extracted from the CNN network in the step 3.1 into the RNN network for character prediction by utilizing the RNN network, the used CNN network has four maximum pooling layers, and the window sizes of the last two pooling layers are changed from 2 multiplied by 2 to 1 multiplied by 2, because most LED character areas are small in height and long in width, and the use of the 1 multiplied by 2 pooling windows can ensure that information in the width direction is not lost as much as possible, and the LED character prediction part is more suitable for identifying English letters and numbers. Because the captured LED characters are fuzzy in the actual station, in order to improve the accuracy of fuzzy LED character recognition, a deep bidirectional RNN is adopted as the RNN in the CRNN, and the RNN is a characteristic sequence x ═ x output by the CNN₁ ,…,x_t Each input x_t All have an output y_t (ii) a Because different LED tables and characters are different in length, in order to identify the phenomenon of an indefinite-length sequence, a long-short-time memory unit LSTM is selected as a unit of the RNN, and meanwhile, the LSTM can also effectively prevent the gradient disappearance phenomenon of the RNN network in the training process. Firstly, extracting a feature map of a text picture based on 7-layer CNN, segmenting the feature map according to columns, and inputting each channel into two layers of bidirectional LSTMs with 256 units as 512-dimensional time sequences for classification.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, the above is only a preferred embodiment of the present invention, and since it is basically similar to the method embodiment, it is described simply, and the relevant points can be referred to the partial description of the method embodiment. The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention will be covered by the present invention without departing from the principle of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An LED character automatic positioning and identifying method based on deep learning is characterized in that: the method comprises the steps of firstly, positioning all areas of LED characters needing to be recognized by using a YOLOv4 algorithm, realizing the positioning of a digital character area of an LED dial plate from a panoramic picture, then inputting the recognized LED character area into a character detection network, carrying out positioning detection on one-line or multi-line characters by using a PSENet network, finally obtaining output with the same size as an original picture through downsampling, feature fusion and upsampling, obtaining a final text connected domain, positioning the position of each line of characters in the LED dial plate, and finally realizing the recognition of the LED multi-line characters by using a CRNN network.

2. The deep learning based LED character automatic positioning and recognition method according to claim 1, characterized by comprising the following steps:

step 1), an LED instrument area positioning module performs LED instrument target detection in a transformer substation scene by using a YOLOv4 target detection algorithm, and only positions a digital character area where an LED instrument is located;

step 2) an LED instrument character detection module, which uses a progressive scale expansion network PSENet algorithm as a digital LED instrument character detection module in a transformer substation scene, and improves the detection performance of the model under LED multi-line characters through an image segmentation technology and a pixel-level LED character detection target area;

and 3) the LED instrument character recognition module trains the acquired one-line or multi-line character target area characteristics by using the CRNN according to the one-line or multi-line character target area acquired in the step 2), and finally recognizes specific characters by using a CTC algorithm to acquire the recognition result of the LED instrument.

3. The deep learning-based LED character automatic positioning and recognition method according to claim 2, characterized in that the step 1) is implemented as follows:

step 1.1) data enhancement is carried out on LED sample data, a GridMask corresponds to 4 parameters, namely x, y, r and d, by utilizing a GridMask data enhancement method, a group of specific Mask regions are determined according to the 4 parameters, and the Mask regions are rotated in the actual training process.

k＝1-(1-r)²

x and y are defined as random over a certain area:

δ_x (δ_y )＝random(0,d-1)

in a task of detecting an LED instrument, r in 4 hyper-parameters of GridMask is set to be 0.4, d is set to be (96,224), in the using process, GridMask is enhanced on a training image with the probability that P is 0.6, the detection task is set to be 0 at the beginning, the GridMask enhancement mode on the training image is gradually increased along with the increase of training times, and finally the detection task is changed into P;

step 1.2, constructing a YOLOv4 target detection network to position the position of a character area of an LED instrument dial in a picture, firstly transmitting learned high-level semantic information into a low-level network through an FPN network, then fusing the high-level semantic information with low-level high-resolution information to improve the detection effect, then increasing an information transmission path from the low level to the high level, enhancing the characteristic information through down-sampling operation, and finally fusing the characteristic information of different convolution layers to achieve the detection effect; the trunk extraction network CSPDarknet53 of YOLOv4 uses a hash activation function, which is a smooth curve whose functional expression is:

Mish＝x×tanh(ln(1+e^x ))；

step 1.3, defining a target marking frame of a character area of the dial plate of the LED instrument marked in advance, wherein the area is defined as a Ground route, inputting a marked target picture and a marking file thereof into a YOLOv4 network for training, and positioning the character area of the LED instrument with different characters by utilizing the trained YOLOv4 target detection network;

4. The deep learning based LED character automatic positioning and recognition method according to claim 3, characterized in that: the GridMask data enhancement method in the step 1.1) belongs to a method for deleting information, and is specifically realized by randomly discarding an area on an image.

5. The deep learning based LED character automatic positioning and recognition method according to claim 2, characterized in that the step 2) is implemented as follows:

step 2.1) feature extraction, namely, passing the input picture through Resnet50 residual error networkPerforming feature extraction, wherein the ResNet50 has 50 Conv2d layers, extracting feature maps output by Conv2, Conv3, Conv4 and Conv5 layers respectively to construct a feature pyramid, and extracting 4-layer features P in a top-down and transverse connection mode₂ ，P₃ ，P₄ ，P₅ Extracting to obtain feature layers of 4 256 channels;

step 2.2) feature fusion, fusing 4 feature maps obtained in step 2.1), and fusing P₃ ，P₄ ，P₅ Respectively characterized by the characteristic layer P through 2 times, 4 times and 8 times of upsampling₂ Carrying out feature cascade to finally obtain a 1024-dimensional fused feature vector F; the high-level semantic features and the low-level semantic features are fused together, and the specific implementation mode is as follows:

wherein, "|" represents the connection operation, and the up-sampling is carried out by respectively adopting 2 times, 4 times and 8 times;

step 2.3) performing convolution of 3 multiplied by 3 on the fusion characteristic F obtained in the step 2.2), performing BN layer and ReLU layer to obtain characteristic diagrams of 256 channels, and inputting the characteristic diagrams into convolution of 1 multiplied by 1 to obtain s₁ ,s₂ ,...,s_n Dividing results, and arranging the dividing results in a sequence from small to large according to the kernel scale;

and step 2.4) sequentially carrying out scale expansion from the minimum kernel through a PSENet algorithm, and finally obtaining an LED character detection result with clear boundary by adopting a first-come-first-obtained scheme.

6. The deep learning based LED character automatic positioning and recognition method according to claim 2, characterized in that the step 3) is implemented as follows:

step 3.1) detecting LED character feature extraction, wherein a CNN part in a CRNN network uses a VGG structure, pictures are firstly unified and normalized to the size of [240,50], a batch normalization layer BN layer is added after the fifth convolution layer and the sixth convolution layer of the network, and finally feature extraction is carried out through the CNN network to obtain 240/4 feature sequences of 512 channels;

step 3.2) LED character prediction part, utilizing RNN network, inputting the characteristic diagram extracted from the CNN network in step 3.1) into RNN network for character prediction, wherein the used CNN network has four maximum pooling layers, and the window size of the last two pooling layers is changed from 2 x 2 to 1 x 2; a deep bidirectional RNN network is adopted as the RNN network in the CRNN, and the RNN network is used for the characteristic sequence x which is output by the CNN₁ ,…,x_t Each inputting x_t All have an output y_t (ii) a Selecting a long-short-time memory unit LSTM as a unit of RNN, firstly extracting a feature map of a text picture based on 7-layer CNN, segmenting the feature map according to columns, and then inputting each channel as a 512-dimensional time sequence into two layers of bidirectional LSTMs of 256 units for classification;

step 3.3) a character transcription part, wherein after the LED character sequence passes through an RNN network, the obtained prediction result needs to be converted into a character tag through a transcription layer CTC, a blank character epsilon is introduced into the CTC, pauses in the character space all represent epsilon, and the CTC mainly relates to two parts of removing repeated letters and removing epsilon; and (3) performing character transcription by adopting a CTC algorithm based on a dictionary, in a transcription layer, transmitting error difference backwards through a forward and backward algorithm, finally obtaining the probability of all labels based on the prediction result of the dictionary, and finally selecting the corresponding label value with the maximum probability as a recognition result.