Double-channel fused three-layer architecture mathematical formula identification method, system and storage deviceTechnical Field
The invention relates to the technical field of mathematical formula identification, in particular to a method, a system and a storage device for identifying a three-layer architecture mathematical formula fused with two channels.
Background
Mathematical formulas are widely used in many scientific fields, and have essential functions in the aspects of explaining theoretical knowledge, describing scientific problems and the like. The mathematical formula can be input by using a mathematical tool, but the way is to input a mathematical markup language Latex to generate a corresponding mathematical formula, which requires a certain grammar basis for a user, while handwriting input can solve the input problem well, and the user can input the mathematical formula more conveniently.
Compared with the recognition of the printed mathematical formula, the recognition difficulty of the handwritten mathematical formula is far higher than that of the printed mathematical formula due to the fuzziness of the handwritten symbols, the diversity of the handwriting styles and the large amount of adhesion among the symbols.
Generally, the mathematical formula identification process can be divided into three stages: a symbol segmentation stage, a symbol identification stage and a structure analysis stage. For different ways of implementing the three phases, the current mainstream identification schemes can be divided into two types: multi-stage identification and single-stage identification.
The multi-stage solution firstly divides the symbols in the mathematical formula, then identifies the divided symbols, and finally performs the structural analysis according to the identification result and the symbol position, although the solution can realize the modularization of the identification process, it has a very serious problem: the error inherits. The error of the previous stage can be transmitted to the next stage to cause error accumulation, so that the recognition precision of the whole recognition process is influenced.
The single-stage solution realizes an end-to-end identification network by using a deep neural network, and completes three stages of formula identification at one time. The identification network usually adopts a coding and decoding structure, firstly, an encoder is used for extracting characteristics of an input mathematical formula picture and coding the characteristic, then a decoder with an attention mechanism is used for scanning the characteristics extracted by the encoder, a most relevant region is used for describing a segmented symbol, and a mathematical markup language Latex corresponding to the mathematical formula is output.
In the encoder part, because the symbol sizes of different mathematical formulas are inconsistent, in order to more effectively utilize visual features in pictures, researchers solve the problem of symbol scale inconsistency by simultaneously extracting a plurality of feature outputs with different granularities in the encoder. However, although the problem of symbol size can be solved by extracting a plurality of features with different granularities in the coding layer, the characterization force of the extracted features is insufficient, the context information of the symbols is not fully utilized, and a large number of irrelevant features are introduced.
In the attention mechanism of the decoder, the output characteristic diagram of the encoder is weighted and summed according to the attention characteristic diagram at the current moment to obtain a vector for representing the most relevant region of the current recognition character. And then decodes and outputs the serialized backward quantity. Using only the attention at the current time, symbol repetition and symbol deletion may occur in the recognition result.
Because the circular neural network has poor prediction capability in the initial stage of model training, the real label is used as the input of the neural unit in the circular neural network in the prior art, thereby preventing the training effect of the circular neural network from being influenced by the larger deviation of a certain neuron. Although the model has a better recognition effect in the training phase by using the real label instead of the input of the prediction result, the recognition accuracy is lowered due to the lack of guidance of the real label in the testing process.
Disclosure of Invention
Therefore, a double-channel fused three-layer structure mathematical formula identification method needs to be provided to solve the problem of low precision of the existing single-stage formula identification technology. The specific technical scheme is as follows:
a method for identifying a three-layer architecture mathematical formula fused with two channels comprises the following steps:
performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information;
capturing the context of the regional visual information through an attention layer to generate a context vector;
and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula.
Further, the "performing feature extraction on an input picture by using an encoding layer" further includes the steps of:
using DenseNet as an encoder;
extracting visual information of an input picture by using a Dense network as a backbone network;
a spatial attention module and a channel attention module are fused in the encoder.
Further, before the "extracting features of the input picture by the coding layer", the method further includes the steps of:
adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information.
Further, the "performing feature extraction on an input picture by using an encoding layer" further includes the steps of:
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
Further, the attention layer is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past moments;
the attention calculation formula is:
expti=fatt(αi,ht-1)
zt=φ({αi,αti})
wherein f isattRepresenting a multi-layer perceptron, aiIs a vector in the decoder output, corresponding to a region in the picture, ht-1Is the hidden layer output of the previous unit, alphatiThe attention weight of the ith vector at the t time step is represented, c represents the coverage vector, alphalAttention weight, z, representing the ith time steptRepresents the output of the attention mechanism and denotes the application of attention weights to the image regions.
Further, the decoding layer adopts a GRU recurrent neural network for decoding.
Further, the generating of the mathematical markup language file corresponding to the formula further includes the steps of:
and selecting the sequence with the highest output score as a final output sequence by using a cluster searching algorithm on the output of the coding layer.
In order to solve the technical problem, the three-layer architecture mathematical formula recognition system fusing double channels is further provided, and the specific technical scheme is as follows:
a three-layer architecture mathematical formula recognition system fusing two channels comprises: an encoding layer, an attention layer, and a decoding layer;
the coding layer is configured to: performing feature extraction on an input picture, wherein the features comprise: regional visual information;
the attention layer is used for: capturing the context of the regional visual information to generate a context vector;
the decoding layer is used for: and decoding the context vector to generate a mathematical markup language file corresponding to the formula.
Further, DenseNet was used as the encoder;
extracting visual information of an input picture by using a Dense network as a backbone network;
a space attention module and a channel attention module are fused in the encoder;
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
In order to solve the technical problem, the storage device is further provided, and the specific technical scheme is as follows:
a storage device having stored therein a set of instructions for performing: the steps of any of the above claims.
The invention has the beneficial effects that: 1. performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information; capturing the context of the regional visual information through an attention layer to generate a context vector; and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the three layers of the coding layer, the attention layer and the decoding layer, a mathematical formula with higher precision can be obtained.
2. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.
3. Before the feature extraction of the input picture through the coding layer, the method further comprises the following steps: adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information. The information redundancy of the picture filling process can be reduced by adding the mask information.
4. A coverage vector is provided in the attention layer, which is used to represent the cumulative sum of all attention mechanisms at past times, and which tells the model which parts of the encoder's inputs are already attended to and not attended to. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.
Drawings
FIG. 1 is a flow chart of a method for identifying a mathematical formula with a three-layer architecture incorporating two channels according to an embodiment;
FIG. 2 is a diagram illustrating a method for identifying a mathematical formula with a three-layer architecture incorporating two channels according to an embodiment;
FIG. 3 is a block diagram of a two-channel fused three-layer structure mathematical formula recognition system according to an embodiment;
fig. 4 is a block diagram of a storage device according to an embodiment.
Description of reference numerals:
300. a three-layer structure mathematical formula recognition system integrating two channels,
301. a layer of code to be encoded,
302. attention is paid to the layer of attention,
303. a layer of code that is decoded,
400. a storage device.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
Referring to fig. 1 to 2, in the present embodiment, a method for identifying a mathematical formula of a three-layer architecture with two channels integrated can be applied to a storage device, including but not limited to: personal computers, servers, general purpose computers, special purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, etc. The specific implementation is as follows:
step S101: performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information.
Step S102: a context vector is generated by capturing the context of the regional visual information through the attention layer.
Step S103: and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula.
Performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information; capturing the context of the regional visual information through an attention layer to generate a context vector; and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the three layers of the coding layer, the attention layer and the decoding layer, a mathematical formula with higher precision can be obtained.
Wherein step S101 in this embodiment further includes the steps of: using DenseNet as an encoder; extracting visual information of an input picture by using a Dense network as a backbone network; a spatial attention module and a channel attention module are fused in the encoder. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.
In order to reduce the information redundancy in the picture filling process, in this embodiment, before the "performing feature extraction on an input picture by using a coding layer", the method further includes the steps of: adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information.
The following specifically describes the operation of the spatial attention module and the channel attention module:
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
In this embodiment, the densnet network structure is mainly composed of a DenseBlock and a Transition, and a channel attention module and a space attention module are sequentially added between the DenseBlock and the Transition.
In this embodiment, the attention layer is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past time;
the attention calculation formula is:
expti=fatt(αi,ht-1)
zt=φ({ai,αti})
wherein f isattRepresenting a multi-layer perceptron, aiIs a vector in the decoder output, corresponding to a region in the picture, ht-1Is the hidden layer output of the previous unit, alphatiThe attention weight of the ith vector at the t time step is represented, c represents the coverage vector, alphalAttention weight, z, representing the ith time steptRepresents the output of the attention mechanism and denotes the application of attention weights to the image regions.
The coverage vector is used to represent the cumulative sum of all attention mechanisms at past times, which tells the model which parts of the encoder's inputs are already attended to and none. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.
In this embodiment, the decoding layer performs decoding using a GRU recurrent neural network. The problems of gradient explosion and gradient disappearance of the circulating neural network are alleviated. And selecting a probability value in the training process to select whether the current label or the output of the previous unit is used as the input of the current unit, and finally outputting a one-hot coding set corresponding to each symbol by the coding layer. As shown in fig. 2. By randomly selecting the real label and the prediction output as the input of the coding layer recurrent neural network, the recognition capability of the recognition system under different recognition scenes can be enhanced.
Preferably, the generating of the mathematical markup language file corresponding to the formula further includes the steps of: and selecting the sequence with the highest output score as a final output sequence by using a cluster searching algorithm on the output of the coding layer.
Referring to fig. 3, in the present embodiment, an embodiment of a two-channel fused three-layer structure mathematicalformula identification system 300 is as follows:
a two-channel fused three-tier architecture mathematicalformula identification system 300, comprising: anencoding layer 301, anattention layer 302, and a decoding layer 303; thecoding layer 301 is configured to: performing feature extraction on an input picture, wherein the features comprise: regional visual information; theattention layer 302 is used to: capturing the context of the regional visual information to generate a context vector; the decoding layer 303 is configured to: and decoding the context vector to generate a mathematical markup language file corresponding to the formula.
Performing feature extraction on an input picture through thecoding layer 301, where the features include: regional visual information; capturing the context of the regional visual information through theattention layer 302 to generate a context vector; and decoding the context vector through a decoding layer 303 to generate a mathematical markup language file corresponding to the formula. By the above operations of the three layers of theencoding layer 301, theattention layer 302 and the decoding layer 303, a mathematical formula with higher accuracy can be obtained.
In theencoding layer 301, DenseNet is used as an encoder; extracting visual information of an input picture by using a Dense network as a backbone network; a spatial attention module and a channel attention module are fused in the encoder. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.
The following specifically describes the operation of the spatial attention module and the channel attention module:
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
In this embodiment, the densnet network structure is mainly composed of a DenseBlock and a Transition, and a channel attention module and a space attention module are sequentially added between the DenseBlock and the Transition.
In this embodiment, theattention layer 302 is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past time;
the attention calculation formula is:
expti=fatt(αi,ht-1)
zt=φ({ai,αti})
wherein f isattRepresenting a multi-layer perceptron, aiIs a vector in the decoder output, corresponding to a region in the picture, ht-1Is the hidden layer output of the previous unit, alphatiThe attention weight of the ith vector at the t time step is represented, c represents the coverage vector, alphalAttention weight, z, representing the ith time steptRepresents the output of the attention mechanism and denotes the application of attention weights to the image regions.
The coverage vector is used to represent the cumulative sum of all attention mechanisms at past times, which tells the model which parts of the encoder's inputs are already attended to and none. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.
In this embodiment, the decoding layer 303 performs decoding using a GRU recurrent neural network. The problems of gradient explosion and gradient disappearance of the circulating neural network are alleviated. In the training process, a probability value is selected to select whether to use the output of the current label or the previous unit as the input of the current unit, and finally the one-hot encoding set corresponding to each symbol is output by theencoding layer 301. As shown in fig. 2. By randomly selecting the real label and the prediction output as the input of the cyclic neural network of thecoding layer 301, the recognition capability of the recognition system under different recognition scenes can be enhanced. And selecting the sequence with the highest output score as a final output sequence by using a bundle searching algorithm on the output of thecoding layer 301.
Referring to fig. 4, in the present embodiment, amemory device 400 is implemented as follows: thememory device 400 may be used to perform any of the steps of the above-mentioned two-channel fused three-layer structure mathematical formula identification method, and a repeated description thereof is omitted.
It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.