Deep learning-based automatic LED character positioning and identifying methodTechnical Field
The invention relates to the field of intelligent instruments, in particular to an automatic LED character positioning and identifying method based on deep learning.
Background
The LED digital meter is commonly found in novel intelligent instrument, compares with traditional mechanical type ammeter, has advantages such as the installation is simple, the engineering volume is little, the degree of accuracy is high, small, has the wide application in various control system, transformer automation, distribution automation, district electric power control, intelligent power distribution cabinet and cubical switchboard. However, the mode of reading the instrument usually depends on manual operation, which is not only inefficient, but also needs to consume a lot of manpower and energy, and in a high-voltage transformer substation and a distribution substation, the reading of the LED instrument needs to be read manually, and some dangerous factors cannot be denied in these situations, so that an algorithm capable of automatically identifying the reading of the LED instrument needs to be developed.
At present, a reading identification method of a digital LED instrument only identifies numbers of 0-9, and identification logics of decimal point, sign and ABC three phases are not designed, so that characters are mainly segmented and identified for the LED instrument with a single background, and the condition that the characters segmented by the characters cannot be identified due to incomplete display of the characters under the condition of complex background is generally caused; the method comprises the steps of designing different identification logics for numbers, decimal points, signs and ABC, positioning a character area of the LED instrument to be identified in a panoramic image by using a preset digital LED instrument image, a single character area in the LED instrument to be identified, a possibly-appearing decimal point area and the like by using a traditional template matching mode, acquiring the single character area and the decimal point area to be identified according to the relative position relation between a positioning frame and a target frame, inputting the acquired single character area into a trained convolutional neural network such as Alexnet for identification, detecting the brightness of the decimal point area, post-processing the detection result, and finally acquiring a final reading according to the character, decimal points, ABC three phases and sign identification results. The method has the defects that the single character is recognized, the digital recognition is seriously influenced by the character segmentation effect, and the problem that the LED character which is inclined, influenced by illumination or fuzzy in shooting is easy to be recognized wrongly is solved.
Disclosure of Invention
The invention provides an automatic LED character positioning and identifying method based on deep learning, which aims to solve the problems in the prior art, and comprises the steps of positioning all areas of LED characters to be identified by using a YOLOv4 algorithm, realizing the positioning of digital character areas of an LED dial from a panoramic picture, then carrying out positioning detection on one-line or multi-line characters by using a PSENet network, and finally realizing the identification of the LED multi-line characters by using a CRNN network. The method can also solve the problem of inaccurate identification caused by the problems of inclined LED tables, fuzzy characters and the like.
The invention fully uses the powerful ability of deep learning, namely, firstly, a character area of an LED instrument is positioned by using a YOLOv4 network, the position of the instrument can be accurately detected by using the YOLO target detection network based on the deep learning, the position information of the instrument does not need to be marked in a picture shot by a camera, the shot picture can be directly input into the network for detection, after an ROI (region of interest) of the instrument is detected, the ROI is input into the character detection network, a progressive expansion network PSENet is used, the output with the same size as an original image is obtained through downsampling, feature fusion and upsampling, a final text connected domain is obtained, namely, the position of each line of characters in the LED instrument is positioned, the network can position a single line or a plurality of lines of character areas, namely, all the character areas in the instrument can be detected and positioned, and finally the detected text character areas are input into a character recognition network for recognition, and the CRNN network is used for automatic identification, and the automatic positioning and identification of the LED multi-line characters are completed in an effective mode.
The technical scheme adopted by the invention is that the LED character automatic positioning and identifying method based on deep learning is implemented according to the following steps:
step 1, an LED instrument area positioning module performs LED instrument target detection in a transformer substation scene by using a YOLOv4 target detection algorithm, and only positions a digital character area where an LED meter is located in order to avoid interference of other area problems of a dial plate on character recognition;
step 2, an LED instrument character detection module uses a progressive scale expansion network PSENet algorithm as a digital LED instrument character detection module in a transformer substation scene, and the detection performance of the model under LED multi-line characters is improved by detecting an LED character target area at a pixel level through an image segmentation technology;
and 3, the LED instrument character recognition module trains the acquired one-line or multi-line character target area characteristics by using the CRNN according to the one-line or multi-line character target area acquired in the step 2, and finally recognizes specific characters by using a CTC algorithm to acquire the recognition result of the LED instrument.
The step 1 is implemented according to the following steps:
step 1.1, data enhancement is carried out on LED sample data, a GridMask data enhancement method is used, the GridMask belongs to an information deleting method, and specifically, a region is randomly discarded on an image, namely, in order to avoid network overfitting, a regular item is newly added on a network, and balance is carried out before information is deleted and retained. Random erasure, cutout and hide-seek methods may cause all discriminable areas to be deleted or reserved, noise is introduced, and training of the model is not facilitated.
A GridMask corresponds to 4 parameters which are x, y, r and d respectively, a group of specific Mask areas are determined through the 4 parameters, and the Mask areas are also rotated in the actual training process.
K is the reserved proportion of image information, W and H are the width and height of the original image respectively, M is the reserved pixel number, K has no direct relation with the four parameters, the parameters indirectly define r, and the definition of r can be obtained through K conversion
k=1-(1-r)2
x and y are defined as random over a certain area:
δx (δy )=random(0,d-1)
in a task of detecting the LED instrument, r in 4 hyper-parameters of the GridMask is set to be 0.4, d is set to be (96,224), in the using process, the GridMask is enhanced on a training image with the probability that P is 0.6, the detection task is set to be 0 at the beginning, the GridMask enhancement mode is gradually increased on the training image along with the increase of training times, and finally the detection task is changed to be P.
Step 1.2, a YOLOv4 target detection network is constructed to position the position of a character area of an LED instrument dial in a picture, learned high-level semantic information is transmitted into a low-level network through an FPN network, then the high-level semantic information and low-level high-resolution information are fused to improve the detection effect, an information transmission path from the low level to the high level is added, feature information is enhanced through down-sampling operation, and finally the feature information of different convolution layers is fused to achieve the detection effect. The trunk extraction network CSPDarknet53 of YOLOv4 uses a Mish activation function, the Mish function is a smooth curve, the smooth activation function can enable information to be better input into a neural network, so that better accuracy and generalization are obtained, and smaller negative gradient input can be allowed. The functional expression is as follows:
Mish=x×tanh(ln(1+ex ))
step 1.3, defining a target marking frame of a character area of the dial plate of the LED instrument marked in advance, wherein the area is defined as a Ground route, inputting a marked target picture and a marking file thereof into a YOLOv4 network for training, and positioning the character area of the LED instrument with different characters by utilizing the trained YOLOv4 target detection network.
And step 1.4, using DIoU-NMS, and simultaneously considering the distance between the central points of the overlapped area and the two boxes to achieve the purpose of removing repeated target frames and finally obtaining the digital character area of the LED instrument.
The step 2 is implemented according to the following steps:
step 2.1, feature extraction, namely performing feature extraction on an input picture through a Resnet50 residual network, wherein ResNet50 has 50 Conv2d layers, extracting feature maps output by Conv2, Conv3, Conv4 and Conv5 layers respectively to construct a feature pyramid, and extracting 4-layer feature P in a top-down and transverse connection mode2 ,P3 ,P4 ,P5 And extracting to obtain 4 feature layers with 256 channels.
Step 2.2, feature fusion, namely fusing 4 feature graphs obtained in the step 2.1 and fusing P3 ,P4 ,P5 Respectively characterized by the characteristic layer P through 2 times, 4 times and 8 times of upsampling2 And performing characteristic cascade to finally obtain a 1024-dimensional fused characteristic vector F. The high-level semantic features and the low-level semantic features are fused together, so that the distribution of the LED characters can be effectively sensed, and the character boundary can be more accurately detected. The specific implementation mode is as follows:
F=C(P2 ,P3 ,P4 ,P5 )=P2 ||UP×2 (P3 )||UP×4 (P4 )||UP×8 (P5 )
wherein, "|" represents the connection operation, and the upsampling is performed in a manner of 2 times, 4 times and 8 times respectively.
Step 2.3, the fusion characteristic F obtained in the step 2.2 is convoluted by 3 multiplied by 3, characteristic diagrams of 256 channels are obtained by a BN layer and a ReLU layer, and the characteristic diagrams are input into the convolution of 1 multiplied by 1 to obtain s1 ,s2 ,...,sn And (4) dividing the results, and arranging the division results in the order from small to large according to the kernel scale.
And 2.4, sequentially performing scale expansion from the minimum kernel through a PSENet algorithm, and adopting a scheme of first-come first-obtained to solve the problem of boundary conflict in the scale expansion to finally obtain an LED character detection result with clear boundary.
Step 3 is specifically implemented according to the following steps:
step 3.1, detecting LED character feature extraction, wherein a VGG structure is used in a CNN part in a CRNN, in order to enable the model convergence speed to be faster, in consideration of the actual aspect ratio of LED characters, pictures are unified and normalized to the size of [240,50], and because the network has deep convolutional layers and recursive layers, the training of the deep convolutional layers and the recursive layers is difficult, a batch normalization layer BN layer is added after the fifth convolutional layer and the sixth convolutional layer of the network, and a batch normalization layer is adopted, so that the training speed is greatly increased. And finally, carrying out feature extraction through a CNN network to obtain 240/4 feature sequences of 512 channels.
And 3.2, an LED character prediction part inputs the characteristic diagram extracted from the CNN network in the step 3.1 into the RNN network for character prediction by utilizing the RNN network, the used CNN network has four maximum pooling layers, and the window sizes of the last two pooling layers are changed from 2 multiplied by 2 to 1 multiplied by 2, because most LED character areas are small in height and long in width, and the use of the 1 multiplied by 2 pooling windows can ensure that information in the width direction is not lost as much as possible, and the LED character prediction part is more suitable for identifying English letters and numbers. Because the shot LED characters are fuzzy in the actual station, in order to improve the accuracy of fuzzy LED character recognition, a deep bidirectional RNN is adopted as the RNN in the CRNN, and the RNN is a characteristic sequence x which is output by the CNN and is x1 ,…,xt Each input xt All have an output yt (ii) a Because different LED tables and characters are different in length, in order to identify the phenomenon of an indefinite-length sequence, a long-short-time memory unit LSTM is selected as a unit of the RNN, and meanwhile, the LSTM can also effectively prevent the gradient disappearance phenomenon of the RNN network in the training process. Firstly, extracting a feature map of a text picture based on 7-layer CNN, segmenting the feature map according to columns, and inputting each channel into two layers of bidirectional LSTMs with 256 units as 512-dimensional time sequences for classification.
And 3.3, a character transcription part, namely, after the LED character sequence passes through an RNN network, the obtained prediction result needs to be converted into a character tag through a transcription layer CTC, a blank character epsilon is introduced into the CTC, pauses in the character interval all represent epsilon, and the CTC mainly relates to two parts of repeated letter removal and epsilon removal. The invention adopts a dictionary-based CTC algorithm to transcribe characters, error difference is propagated backwards through a forward algorithm and a backward algorithm in a transcription layer, the probability of all labels is finally obtained based on a prediction result of the dictionary, and finally, the corresponding label value with the maximum probability is selected as a recognition result.
The invention has the beneficial effects that:
1. the method is suitable for identifying the digital instruments of the power distribution room and the transformer substation, solves the problem of low current reading efficiency of manually entered instruments on one hand, and can obtain higher identification effect under the influence of external factors such as illumination, shooting angle, instrument form and the like on the other hand.
2. The YOLO target detection network based on deep learning can accurately detect the position of the instrument, does not need to mark instrument position information in a picture shot by a camera, and can directly input the shot picture into the network for detection.
3. The method comprises the steps of utilizing a PSENet algorithm of a progressive scale expansion network as a character detection module of a digital LED instrument in a transformer substation scene, positioning a single-line or multi-line character area by the network, namely detecting and positioning all character areas in the instrument, and finally inputting the detected text character area into a character recognition network for recognition.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of the working procedure of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The whole work flow of the invention is as shown in figure 1, firstly, the character area of the LED meter is positioned by using a YOLOv4 network, the position of the meter can be accurately detected by using a YOLO target detection network based on deep learning, the position information of the meter does not need to be marked in the picture shot by a camera, the shot picture can be directly input into the network for detection, after the ROI (region of interest) of the meter is detected, the ROI is input into the character detection network, the network PSENet is progressively expanded by using a progressive network, the output with the same size as the original picture is obtained by down sampling, feature fusion and up sampling, the final text connected domain is obtained, namely the position of each line of characters of the LED meter is positioned, the network can position a single line or a plurality of lines of character areas, namely all the character areas in the meter can be detected and positioned, and finally the detected text character areas are input into a character recognition network for recognition, and the CRNN network is used for automatic identification, and the automatic positioning and identification of the LED multi-line characters are completed in an effective mode.
The technical scheme adopted by the invention is that the LED character automatic positioning and identifying method based on deep learning is implemented according to the following steps:
step 1, an LED meter area positioning module, as shown in the LED meter character area positioning module in fig. 1, inputs a picture with a resolution of 1920 × 1080 taken by a camera into a target detection network, and because there are some other interference factors, such as indicator lights, other signboard characters, etc., in the picture, the LED meter to be recognized is to be accurately detected, so that the YOLOv4 target detection algorithm is used for LED meter target detection in a substation scene, and in order to avoid the problem of other areas of the dial plate from interfering with character recognition, only a digital character area where the LED meter is located is positioned, and a roi area where the character is located is output as the input of step 2;
step 2, an LED instrument character detection module, as shown in the LED instrument character detection module in FIG. 1, inputting the digital character region output in the step 1 into a network, wherein the LED instrument has 3 rows of character regions as shown in the figure, and in order to accurately identify each row of character region, each row of character region needs to be detected through the character detection module, so that a PSENet algorithm of a progressive scale expansion network is used as the digital LED instrument character detection module in a transformer substation scene, and through an image segmentation technology, the LED character target region is detected at a pixel level, and the detection performance of the model under the LED multi-row characters is improved;
and 3, an LED instrument character recognition module, as shown in the LED instrument character recognition module in fig. 1, sequentially inputting the multiple lines of character target areas detected in the step 2 into a recognition network, training the acquired one or more lines of character target area characteristics by using a CRNN network, and finally recognizing specific characters by using a CTC algorithm to acquire the recognition result of the LED instrument.
The step 1 is implemented according to the following steps:
step 1.1, data enhancement is carried out on LED sample data, and by using a GridMask data enhancement method, GridMask belongs to a method for deleting information, and specifically, a region is randomly discarded on an image, namely, a regular item is newly added on a network to avoid network overfitting, so that a balance is carried out before information is deleted and retained. Random erasure, cutout and hide-seek methods may cause all discriminable areas to be deleted or reserved, noise is introduced, and training of the model is not facilitated.
One GridMask corresponds to 4 parameters, namely x, y, r and d, a group of specific Mask regions is determined through the 4 parameters, and the Mask regions are also rotated in the actual training process.
K is the reserved proportion of image information, W and H are the width and height of the original image respectively, M is the reserved pixel number, K has no direct relation with the four parameters, the parameters indirectly define r, and the definition of r can be obtained through K conversion
k=1-(1-r)2
x and y are defined as random over a certain area:
δx (δy )=random(0,d-1)
in a task of detecting the LED instrument, r in 4 hyper-parameters of the GridMask is set to be 0.4, d is set to be (96,224), in the using process, the GridMask is enhanced on a training image with the probability that P is 0.6, the detection task is set to be 0 at the beginning, the GridMask enhancement mode is gradually increased on the training image along with the increase of training times, and finally the detection task is changed to be P.
Step 1.2, a YOLOv4 target detection network is constructed to position the position of a character area of an LED instrument dial in a picture, learned high-level semantic information is transmitted into a low-level network through an FPN network, then the high-level semantic information and low-level high-resolution information are fused to improve the detection effect, an information transmission path from the low level to the high level is added, feature information is enhanced through down-sampling operation, and finally the feature information of different convolution layers is fused to achieve the detection effect. The trunk extraction network CSPDarknet53 of YOLOv4 uses a Mish activation function, the Mish function is a smooth curve, the smooth activation function can enable information to be better input into a neural network, so that better accuracy and generalization are obtained, and smaller negative gradient input can be allowed. The functional expression is as follows:
Mish=x×tanh(ln(1+ex ))
step 1.3, defining a target marking frame of a character area of the dial plate of the LED instrument marked in advance, wherein the area is defined as a Ground route, inputting a marked target picture and a marking file thereof into a YOLOv4 network for training, and positioning the character area of the LED instrument with different characters by utilizing the trained YOLOv4 target detection network.
And step 1.4, using DIoU-NMS, and simultaneously considering the distance between the central points of the overlapped area and the two boxes to achieve the purpose of removing repeated target frames and finally obtaining the digital character area of the LED instrument.
The step 2 is implemented according to the following steps:
and 2. step 2.1. Feature extraction, namely performing feature extraction on an input picture through a Resnet50 residual network, wherein ResNet50 has 50 Conv2d layers, extracting feature maps output by Conv2, Conv3, Conv4 and Conv5 layers respectively to construct a feature pyramid, and extracting 4-layer features P by using a top-down and transverse connection mode2 ,P3 ,P4 ,P5 Extraction is carried out, and 4 feature layers with 256 channels are obtained.
Step 2.2, feature fusion, namely fusing 4 feature graphs obtained in the step 2.1 and fusing P3 ,P4 ,P5 Respectively characterized by characteristic layer P through 2 times, 4 times and 8 times of upsampling2 And performing characteristic cascade to finally obtain a 1024-dimensional fused characteristic vector F. The high-level semantic features and the low-level semantic features are fused together, so that the distribution of the LED characters can be effectively sensed, and the character boundary can be more accurately detected. The specific implementation mode is as follows:
F=C(P2 ,P3 ,P4 ,P5 )=P2 ||UP×2 (P3 )||UP×4 (P4 )||UP×8 (P5 )
wherein, "|" represents the connection operation, and the upsampling is performed in a manner of 2 times, 4 times and 8 times respectively.
Step 2.3, the fusion characteristic F obtained in the step 2.2 is convoluted by 3 multiplied by 3, characteristic diagrams of 256 channels are obtained by a BN layer and a ReLU layer, and the characteristic diagrams are input into the convolution of 1 multiplied by 1 to obtain s1 ,s2 ,...,sn And dividing the results, and arranging the division results in a sequence from small to large according to the kernel scale.
And 2.4, sequentially performing scale expansion from the minimum kernel through a PSENet algorithm, and adopting a scheme of first-come first-obtained to solve the problem of boundary conflict in the scale expansion to finally obtain an LED character detection result with clear boundary.
Step 3 is implemented specifically according to the following steps:
step 3.1, detecting LED character feature extraction, wherein a VGG structure is used in a CNN part in a CRNN, in order to enable the model convergence speed to be faster, in consideration of the actual aspect ratio of LED characters, pictures are unified and normalized to the size of [240,50], and because the network has deep convolutional layers and recursive layers, the training of the deep convolutional layers and the recursive layers is difficult, a batch normalization layer BN layer is added after the fifth convolutional layer and the sixth convolutional layer of the network, and a batch normalization layer is adopted, so that the training speed is greatly increased. And finally, performing feature extraction through a CNN network to obtain 240/4 feature sequences of 512 channels.
And 3.2, an LED character prediction part inputs the characteristic diagram extracted from the CNN network in the step 3.1 into the RNN network for character prediction by utilizing the RNN network, the used CNN network has four maximum pooling layers, and the window sizes of the last two pooling layers are changed from 2 multiplied by 2 to 1 multiplied by 2, because most LED character areas are small in height and long in width, and the use of the 1 multiplied by 2 pooling windows can ensure that information in the width direction is not lost as much as possible, and the LED character prediction part is more suitable for identifying English letters and numbers. Because the captured LED characters are fuzzy in the actual station, in order to improve the accuracy of fuzzy LED character recognition, a deep bidirectional RNN is adopted as the RNN in the CRNN, and the RNN is a characteristic sequence x ═ x output by the CNN1 ,…,xt Each input xt All have an output yt (ii) a Because different LED tables and characters are different in length, in order to identify the phenomenon of an indefinite-length sequence, a long-short-time memory unit LSTM is selected as a unit of the RNN, and meanwhile, the LSTM can also effectively prevent the gradient disappearance phenomenon of the RNN network in the training process. Firstly, extracting a feature map of a text picture based on 7-layer CNN, segmenting the feature map according to columns, and inputting each channel into two layers of bidirectional LSTMs with 256 units as 512-dimensional time sequences for classification.
And 3.3, a character transcription part, namely, after the LED character sequence passes through an RNN network, the obtained prediction result needs to be converted into a character tag through a transcription layer CTC, a blank character epsilon is introduced into the CTC, pauses in the character interval all represent epsilon, and the CTC mainly relates to two parts of repeated letter removal and epsilon removal. The invention adopts a dictionary-based CTC algorithm to transcribe characters, error difference is propagated backwards through a forward algorithm and a backward algorithm in a transcription layer, the probability of all labels is finally obtained based on a prediction result of the dictionary, and finally, the corresponding label value with the maximum probability is selected as a recognition result.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, the above is only a preferred embodiment of the present invention, and since it is basically similar to the method embodiment, it is described simply, and the relevant points can be referred to the partial description of the method embodiment. The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention will be covered by the present invention without departing from the principle of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.