CN112183544A

Movatterモバイル変換

Info

Publication number: CN112183544A
Application number: CN202011046709.7A
Authority: CN
Inventors: 胡健; 苏松志
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-05
Anticipated expiration: 2040-09-29
Also published as: CN112183544B

Abstract

Translated fromChinese

本发明涉及数学公式识别技术领域，特别涉及一种融合双通道的三层架构数学公式识别方法、系统和存储设备。所述一种融合双通道的三层架构数学公式识别方法，包括步骤：通过编码层对输入图片进行特征提取，所述特征包括：区域视觉信息；通过注意力层捕获区域视觉信息的上下文，生成context向量；通过解码层对所述context向量进行解码，生成公式对应的数学标记语言文件。通过编码层、注意力层和解码层三个层的上述操作，可获得精准度更高的数学公式。

The present invention relates to the technical field of mathematical formula recognition, in particular to a method, system and storage device for mathematical formula recognition of a three-layer structure integrating dual channels. The method for recognizing mathematical formulas of a three-layer architecture by merging two channels includes the steps of: extracting features from an input picture through an encoding layer, the features including: regional visual information; capturing the context of the regional visual information through an attention layer, and generating context vector; decoding the context vector through the decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the encoding layer, the attention layer and the decoding layer, a mathematical formula with higher accuracy can be obtained.

Description

Double-channel fused three-layer architecture mathematical formula identification method, system and storage device

Technical Field

The invention relates to the technical field of mathematical formula identification, in particular to a method, a system and a storage device for identifying a three-layer architecture mathematical formula fused with two channels.

Background

Mathematical formulas are widely used in many scientific fields, and have essential functions in the aspects of explaining theoretical knowledge, describing scientific problems and the like. The mathematical formula can be input by using a mathematical tool, but the way is to input a mathematical markup language Latex to generate a corresponding mathematical formula, which requires a certain grammar basis for a user, while handwriting input can solve the input problem well, and the user can input the mathematical formula more conveniently.

Compared with the recognition of the printed mathematical formula, the recognition difficulty of the handwritten mathematical formula is far higher than that of the printed mathematical formula due to the fuzziness of the handwritten symbols, the diversity of the handwriting styles and the large amount of adhesion among the symbols.

Generally, the mathematical formula identification process can be divided into three stages: a symbol segmentation stage, a symbol identification stage and a structure analysis stage. For different ways of implementing the three phases, the current mainstream identification schemes can be divided into two types: multi-stage identification and single-stage identification.

The multi-stage solution firstly divides the symbols in the mathematical formula, then identifies the divided symbols, and finally performs the structural analysis according to the identification result and the symbol position, although the solution can realize the modularization of the identification process, it has a very serious problem: the error inherits. The error of the previous stage can be transmitted to the next stage to cause error accumulation, so that the recognition precision of the whole recognition process is influenced.

The single-stage solution realizes an end-to-end identification network by using a deep neural network, and completes three stages of formula identification at one time. The identification network usually adopts a coding and decoding structure, firstly, an encoder is used for extracting characteristics of an input mathematical formula picture and coding the characteristic, then a decoder with an attention mechanism is used for scanning the characteristics extracted by the encoder, a most relevant region is used for describing a segmented symbol, and a mathematical markup language Latex corresponding to the mathematical formula is output.

In the encoder part, because the symbol sizes of different mathematical formulas are inconsistent, in order to more effectively utilize visual features in pictures, researchers solve the problem of symbol scale inconsistency by simultaneously extracting a plurality of feature outputs with different granularities in the encoder. However, although the problem of symbol size can be solved by extracting a plurality of features with different granularities in the coding layer, the characterization force of the extracted features is insufficient, the context information of the symbols is not fully utilized, and a large number of irrelevant features are introduced.

In the attention mechanism of the decoder, the output characteristic diagram of the encoder is weighted and summed according to the attention characteristic diagram at the current moment to obtain a vector for representing the most relevant region of the current recognition character. And then decodes and outputs the serialized backward quantity. Using only the attention at the current time, symbol repetition and symbol deletion may occur in the recognition result.

Because the circular neural network has poor prediction capability in the initial stage of model training, the real label is used as the input of the neural unit in the circular neural network in the prior art, thereby preventing the training effect of the circular neural network from being influenced by the larger deviation of a certain neuron. Although the model has a better recognition effect in the training phase by using the real label instead of the input of the prediction result, the recognition accuracy is lowered due to the lack of guidance of the real label in the testing process.

Disclosure of Invention

Therefore, a double-channel fused three-layer structure mathematical formula identification method needs to be provided to solve the problem of low precision of the existing single-stage formula identification technology. The specific technical scheme is as follows:

a method for identifying a three-layer architecture mathematical formula fused with two channels comprises the following steps:

performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information;

capturing the context of the regional visual information through an attention layer to generate a context vector;

and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula.

Further, the "performing feature extraction on an input picture by using an encoding layer" further includes the steps of:

using DenseNet as an encoder;

extracting visual information of an input picture by using a Dense network as a backbone network;

a spatial attention module and a channel attention module are fused in the encoder.

Further, before the "extracting features of the input picture by the coding layer", the method further includes the steps of:

adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information.

the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;

the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.

Further, the attention layer is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past moments;

the attention calculation formula is:

exp_ti＝f_att(α_i，h_t-1)

z_t＝φ({α_i，α_ti})

wherein f is_attRepresenting a multi-layer perceptron, a_iIs a vector in the decoder output, corresponding to a region in the picture, h_t-1Is the hidden layer output of the previous unit, alpha_tiThe attention weight of the ith vector at the t time step is represented, c represents the coverage vector, alpha_lAttention weight, z, representing the ith time step_tRepresents the output of the attention mechanism and denotes the application of attention weights to the image regions.

Further, the decoding layer adopts a GRU recurrent neural network for decoding.

Further, the generating of the mathematical markup language file corresponding to the formula further includes the steps of:

and selecting the sequence with the highest output score as a final output sequence by using a cluster searching algorithm on the output of the coding layer.

In order to solve the technical problem, the three-layer architecture mathematical formula recognition system fusing double channels is further provided, and the specific technical scheme is as follows:

a three-layer architecture mathematical formula recognition system fusing two channels comprises: an encoding layer, an attention layer, and a decoding layer;

the coding layer is configured to: performing feature extraction on an input picture, wherein the features comprise: regional visual information;

the attention layer is used for: capturing the context of the regional visual information to generate a context vector;

the decoding layer is used for: and decoding the context vector to generate a mathematical markup language file corresponding to the formula.

Further, DenseNet was used as the encoder;

a space attention module and a channel attention module are fused in the encoder;

In order to solve the technical problem, the storage device is further provided, and the specific technical scheme is as follows:

a storage device having stored therein a set of instructions for performing: the steps of any of the above claims.

The invention has the beneficial effects that: 1. performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information; capturing the context of the regional visual information through an attention layer to generate a context vector; and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the three layers of the coding layer, the attention layer and the decoding layer, a mathematical formula with higher precision can be obtained.

2. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.

3. Before the feature extraction of the input picture through the coding layer, the method further comprises the following steps: adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information. The information redundancy of the picture filling process can be reduced by adding the mask information.

4. A coverage vector is provided in the attention layer, which is used to represent the cumulative sum of all attention mechanisms at past times, and which tells the model which parts of the encoder's inputs are already attended to and not attended to. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.

Drawings

FIG. 1 is a flow chart of a method for identifying a mathematical formula with a three-layer architecture incorporating two channels according to an embodiment;

FIG. 2 is a diagram illustrating a method for identifying a mathematical formula with a three-layer architecture incorporating two channels according to an embodiment;

FIG. 3 is a block diagram of a two-channel fused three-layer structure mathematical formula recognition system according to an embodiment;

fig. 4 is a block diagram of a storage device according to an embodiment.

Description of reference numerals:

300. a three-layer structure mathematical formula recognition system integrating two channels,

301. a layer of code to be encoded,

302. attention is paid to the layer of attention,

303. a layer of code that is decoded,

400. a storage device.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Referring to fig. 1 to 2, in the present embodiment, a method for identifying a mathematical formula of a three-layer architecture with two channels integrated can be applied to a storage device, including but not limited to: personal computers, servers, general purpose computers, special purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, etc. The specific implementation is as follows:

step S101: performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information.

Step S102: a context vector is generated by capturing the context of the regional visual information through the attention layer.

Step S103: and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula.

Performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information; capturing the context of the regional visual information through an attention layer to generate a context vector; and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the three layers of the coding layer, the attention layer and the decoding layer, a mathematical formula with higher precision can be obtained.

Wherein step S101 in this embodiment further includes the steps of: using DenseNet as an encoder; extracting visual information of an input picture by using a Dense network as a backbone network; a spatial attention module and a channel attention module are fused in the encoder. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.

In order to reduce the information redundancy in the picture filling process, in this embodiment, before the "performing feature extraction on an input picture by using a coding layer", the method further includes the steps of: adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information.

The following specifically describes the operation of the spatial attention module and the channel attention module:

In this embodiment, the densnet network structure is mainly composed of a DenseBlock and a Transition, and a channel attention module and a space attention module are sequentially added between the DenseBlock and the Transition.

In this embodiment, the attention layer is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past time;

the attention calculation formula is:

exp_ti＝f_att(α_i，h_t-1)

z_t＝φ({a_i，α_ti})

The coverage vector is used to represent the cumulative sum of all attention mechanisms at past times, which tells the model which parts of the encoder's inputs are already attended to and none. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.

In this embodiment, the decoding layer performs decoding using a GRU recurrent neural network. The problems of gradient explosion and gradient disappearance of the circulating neural network are alleviated. And selecting a probability value in the training process to select whether the current label or the output of the previous unit is used as the input of the current unit, and finally outputting a one-hot coding set corresponding to each symbol by the coding layer. As shown in fig. 2. By randomly selecting the real label and the prediction output as the input of the coding layer recurrent neural network, the recognition capability of the recognition system under different recognition scenes can be enhanced.

Preferably, the generating of the mathematical markup language file corresponding to the formula further includes the steps of: and selecting the sequence with the highest output score as a final output sequence by using a cluster searching algorithm on the output of the coding layer.

Referring to fig. 3, in the present embodiment, an embodiment of a two-channel fused three-layer structure mathematicalformula identification system 300 is as follows:

a two-channel fused three-tier architecture mathematicalformula identification system 300, comprising: anencoding layer 301, anattention layer 302, and a decoding layer 303; thecoding layer 301 is configured to: performing feature extraction on an input picture, wherein the features comprise: regional visual information; theattention layer 302 is used to: capturing the context of the regional visual information to generate a context vector; the decoding layer 303 is configured to: and decoding the context vector to generate a mathematical markup language file corresponding to the formula.

Performing feature extraction on an input picture through thecoding layer 301, where the features include: regional visual information; capturing the context of the regional visual information through theattention layer 302 to generate a context vector; and decoding the context vector through a decoding layer 303 to generate a mathematical markup language file corresponding to the formula. By the above operations of the three layers of theencoding layer 301, theattention layer 302 and the decoding layer 303, a mathematical formula with higher accuracy can be obtained.

In theencoding layer 301, DenseNet is used as an encoder; extracting visual information of an input picture by using a Dense network as a backbone network; a spatial attention module and a channel attention module are fused in the encoder. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.

In this embodiment, theattention layer 302 is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past time;

the attention calculation formula is:

exp_ti＝f_att(α_i，h_t-1)

z_t＝φ({a_i，α_ti})

In this embodiment, the decoding layer 303 performs decoding using a GRU recurrent neural network. The problems of gradient explosion and gradient disappearance of the circulating neural network are alleviated. In the training process, a probability value is selected to select whether to use the output of the current label or the previous unit as the input of the current unit, and finally the one-hot encoding set corresponding to each symbol is output by theencoding layer 301. As shown in fig. 2. By randomly selecting the real label and the prediction output as the input of the cyclic neural network of thecoding layer 301, the recognition capability of the recognition system under different recognition scenes can be enhanced. And selecting the sequence with the highest output score as a final output sequence by using a bundle searching algorithm on the output of thecoding layer 301.

Referring to fig. 4, in the present embodiment, amemory device 400 is implemented as follows: thememory device 400 may be used to perform any of the steps of the above-mentioned two-channel fused three-layer structure mathematical formula identification method, and a repeated description thereof is omitted.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

Translated fromChinese

1.一种融合双通道的三层架构数学公式识别方法，其特征在于，包括步骤：1. a three-layer structure mathematical formula identification method of fusion dual-channel, is characterized in that, comprises the steps:

通过编码层对输入图片进行特征提取，所述特征包括：区域视觉信息；Feature extraction is performed on the input picture through the coding layer, and the features include: regional visual information;

通过注意力层捕获区域视觉信息的上下文，生成context向量；The context of the visual information of the region is captured by the attention layer, and the context vector is generated;

通过解码层对所述context向量进行解码，生成公式对应的数学标记语言文件。The context vector is decoded by the decoding layer, and a mathematical markup language file corresponding to the formula is generated.

2.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法，其特征在于，所述“通过编码层对输入图片进行特征提取”，还包括步骤：2. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 1, is characterized in that, described " carry out feature extraction to input picture by coding layer ", also comprise the step:

使用DenseNet作为编码器；Use DenseNet as the encoder;

通过使用Dense网络作为主干网络对输入图片进行视觉信息提取；Extract visual information from the input image by using the Dense network as the backbone network;

在所述编码器中融合有空间注意力模块和通道注意力模块。A spatial attention module and a channel attention module are fused in the encoder.

3.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法，其特征在于，所述“通过编码层对输入图片进行特征提取”前，还包括步骤：3. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 1, is characterized in that, before described " feature extraction is carried out to input picture by coding layer ", also comprises steps:

在所述输入图片数据中添加mask掩码信息，对填充的部分额外增加一个通道，所述通道用来记录填充的信息。The mask information is added to the input picture data, and an additional channel is added to the filled part, and the channel is used to record the filled information.

4.根据权利要求2所述的一种融合双通道的三层架构数学公式识别方法，其特征在于，所述“通过编码层对输入图片进行特征提取”，还包括步骤：4. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 2, is characterized in that, described " is carried out feature extraction to input picture by coding layer ", also comprises the step:

所述空间注意力模块分别对输入特征图进行平均池化和最大池化得到两个维度相同的特征图，对所述两个维度相同的特征图按照通道进行拼接后通过sigmoid函数得到一个空间注意力矩阵，所述空间注意力矩阵与所述输入特征图相乘得空间注意力特征图；The spatial attention module performs average pooling and maximum pooling on the input feature maps respectively to obtain two feature maps with the same dimensions, and after splicing the two feature maps with the same dimensions according to channels, a spatial attention is obtained through the sigmoid function. force matrix, the spatial attention feature map is obtained by multiplying the spatial attention matrix and the input feature map;

所述通道注意力模块分别对输入特征图进行全局平均池化和全局最大池化得到两个特征图，输入所述两个特征图至共享的多层感知机中得两个向量，并将所述两个向量相加再和所述输入特征图相乘得到通道注意力特征图。The channel attention module performs global average pooling and global maximum pooling on the input feature map to obtain two feature maps, input the two feature maps to the shared multi-layer perceptron to obtain two vectors, and combine the two feature maps. The two vectors are added and then multiplied by the input feature map to obtain the channel attention feature map.

5.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法，其特征在于，所述注意力层设置有coverage向量，所述coverage向量用于表示过往时刻的所有注意力机制的累加和；5. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 1, is characterized in that, described attention layer is provided with coverage vector, and described coverage vector is used to represent all attention of past moments cumulative sum of mechanisms;

所述注意力计算公式为：The attention calculation formula is:

exp_ti＝f_att(a_i，h_t-1)exp_ti =f_att (a_i , h_t-1 )

z_t＝φ({a_i，a_ti})z_t =φ({a_i , a_ti })

其中f_att表示一个多层感知机，a_i是解码器输出中的一个向量，对应图像中的一个区域，h_t-1是上个单元的隐藏层输出，a_ti表示第t时间步第i个向量的注意力权值，c表示coverage向量，a_l表示第l个时间步的注意力权值，z_t表示注意力机制的输出，φ表示将注意力权值施加到图像区域。where f_att represents a multilayer perceptron, a_i is a vector in the output of the decoder, corresponding to an area in the image, h_t-1 is the hidden layer output of the previous unit, and a_ti is the ith time step t A vector of attention weights, c represents the coverage vector, a_l represents the attention weight of the l-th time step, z_t represents the output of the attention mechanism, and φ represents the application of the attention weight to the image region.

6.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法，其特征在于，所述解码层采用GRU循环神经网络进行解码。6 . The method for recognizing mathematical formulas of a three-layer architecture with dual-channel fusion according to claim 1 , wherein the decoding layer adopts a GRU cyclic neural network for decoding. 7 .

7.根据权利要求6所述的一种融合双通道的三层架构数学公式识别方法，其特征在于，所述“生成公式对应的数学标记语言文件”，还包括步骤：7. a kind of three-layer architecture mathematical formula identification method of fusion dual channel according to claim 6, is characterized in that, described " the mathematical markup language file corresponding to the generation formula ", also comprises the step:

对编码层的输出使用集束搜索算法选取输出得分最高的序列作为最终的输出序列。The output of the coding layer uses the beam search algorithm to select the sequence with the highest output score as the final output sequence.

8.一种融合双通道的三层架构数学公式识别系统，其特征在于，包括：编码层、注意力层和解码层；8. A three-layer architecture mathematical formula recognition system fused with dual channels, characterized in that, comprising: an encoding layer, an attention layer and a decoding layer;

所述编码层用于：对输入图片进行特征提取，所述特征包括：区域视觉信息；The coding layer is used for: extracting features from the input picture, and the features include: regional visual information;

所述注意力层用于：捕获区域视觉信息的上下文，生成context向量；The attention layer is used to: capture the context of regional visual information, and generate a context vector;

所述解码层用于：对所述context向量进行解码，生成公式对应的数学标记语言文件。The decoding layer is used for: decoding the context vector to generate a mathematical markup language file corresponding to the formula.

9.根据权利要求8所述的一种融合双通道的三层架构数学公式识别系统，其特征在于，9. the three-layer architecture mathematical formula recognition system of a kind of fusion dual-channel according to claim 8, is characterized in that,

使用DenseNet作为编码器；Use DenseNet as the encoder;

在所述编码器中融合有空间注意力模块和通道注意力模块；A spatial attention module and a channel attention module are fused in the encoder;

所述通道注意力模块分别对输入特征图进行全局平均池化和全局最大池化得到两个特征图，输入所述两个特征图至共享的多层感知机中得两个向量，并将所述两个向量相加再和所述输入特征图相乘得到通道注意力特征图。The channel attention module performs global average pooling and global maximum pooling on the input feature map respectively to obtain two feature maps, input the two feature maps to the shared multi-layer perceptron to obtain two vectors, and combine the two feature maps. The two vectors are added and then multiplied by the input feature map to obtain the channel attention feature map.

10.一种存储设备，其中存储有指令集，其特征在于，所述指令集用于执行：权利要求1至7中任意一个权利要求的步骤。10. A storage device, wherein an instruction set is stored, wherein the instruction set is used to perform: the steps of any one of claims 1 to 7.