Movatterモバイル変換


[0]ホーム

URL:


CN112183544A - Double-channel fused three-layer architecture mathematical formula identification method, system and storage device - Google Patents

Double-channel fused three-layer architecture mathematical formula identification method, system and storage device
Download PDF

Info

Publication number
CN112183544A
CN112183544ACN202011046709.7ACN202011046709ACN112183544ACN 112183544 ACN112183544 ACN 112183544ACN 202011046709 ACN202011046709 ACN 202011046709ACN 112183544 ACN112183544 ACN 112183544A
Authority
CN
China
Prior art keywords
layer
attention
channel
input
mathematical formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011046709.7A
Other languages
Chinese (zh)
Other versions
CN112183544B (en
Inventor
胡健
苏松志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen UniversityfiledCriticalXiamen University
Priority to CN202011046709.7ApriorityCriticalpatent/CN112183544B/en
Publication of CN112183544ApublicationCriticalpatent/CN112183544A/en
Application grantedgrantedCritical
Publication of CN112183544BpublicationCriticalpatent/CN112183544B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明涉及数学公式识别技术领域,特别涉及一种融合双通道的三层架构数学公式识别方法、系统和存储设备。所述一种融合双通道的三层架构数学公式识别方法,包括步骤:通过编码层对输入图片进行特征提取,所述特征包括:区域视觉信息;通过注意力层捕获区域视觉信息的上下文,生成context向量;通过解码层对所述context向量进行解码,生成公式对应的数学标记语言文件。通过编码层、注意力层和解码层三个层的上述操作,可获得精准度更高的数学公式。

Figure 202011046709

The present invention relates to the technical field of mathematical formula recognition, in particular to a method, system and storage device for mathematical formula recognition of a three-layer structure integrating dual channels. The method for recognizing mathematical formulas of a three-layer architecture by merging two channels includes the steps of: extracting features from an input picture through an encoding layer, the features including: regional visual information; capturing the context of the regional visual information through an attention layer, and generating context vector; decoding the context vector through the decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the encoding layer, the attention layer and the decoding layer, a mathematical formula with higher accuracy can be obtained.

Figure 202011046709

Description

Double-channel fused three-layer architecture mathematical formula identification method, system and storage device
Technical Field
The invention relates to the technical field of mathematical formula identification, in particular to a method, a system and a storage device for identifying a three-layer architecture mathematical formula fused with two channels.
Background
Mathematical formulas are widely used in many scientific fields, and have essential functions in the aspects of explaining theoretical knowledge, describing scientific problems and the like. The mathematical formula can be input by using a mathematical tool, but the way is to input a mathematical markup language Latex to generate a corresponding mathematical formula, which requires a certain grammar basis for a user, while handwriting input can solve the input problem well, and the user can input the mathematical formula more conveniently.
Compared with the recognition of the printed mathematical formula, the recognition difficulty of the handwritten mathematical formula is far higher than that of the printed mathematical formula due to the fuzziness of the handwritten symbols, the diversity of the handwriting styles and the large amount of adhesion among the symbols.
Generally, the mathematical formula identification process can be divided into three stages: a symbol segmentation stage, a symbol identification stage and a structure analysis stage. For different ways of implementing the three phases, the current mainstream identification schemes can be divided into two types: multi-stage identification and single-stage identification.
The multi-stage solution firstly divides the symbols in the mathematical formula, then identifies the divided symbols, and finally performs the structural analysis according to the identification result and the symbol position, although the solution can realize the modularization of the identification process, it has a very serious problem: the error inherits. The error of the previous stage can be transmitted to the next stage to cause error accumulation, so that the recognition precision of the whole recognition process is influenced.
The single-stage solution realizes an end-to-end identification network by using a deep neural network, and completes three stages of formula identification at one time. The identification network usually adopts a coding and decoding structure, firstly, an encoder is used for extracting characteristics of an input mathematical formula picture and coding the characteristic, then a decoder with an attention mechanism is used for scanning the characteristics extracted by the encoder, a most relevant region is used for describing a segmented symbol, and a mathematical markup language Latex corresponding to the mathematical formula is output.
In the encoder part, because the symbol sizes of different mathematical formulas are inconsistent, in order to more effectively utilize visual features in pictures, researchers solve the problem of symbol scale inconsistency by simultaneously extracting a plurality of feature outputs with different granularities in the encoder. However, although the problem of symbol size can be solved by extracting a plurality of features with different granularities in the coding layer, the characterization force of the extracted features is insufficient, the context information of the symbols is not fully utilized, and a large number of irrelevant features are introduced.
In the attention mechanism of the decoder, the output characteristic diagram of the encoder is weighted and summed according to the attention characteristic diagram at the current moment to obtain a vector for representing the most relevant region of the current recognition character. And then decodes and outputs the serialized backward quantity. Using only the attention at the current time, symbol repetition and symbol deletion may occur in the recognition result.
Because the circular neural network has poor prediction capability in the initial stage of model training, the real label is used as the input of the neural unit in the circular neural network in the prior art, thereby preventing the training effect of the circular neural network from being influenced by the larger deviation of a certain neuron. Although the model has a better recognition effect in the training phase by using the real label instead of the input of the prediction result, the recognition accuracy is lowered due to the lack of guidance of the real label in the testing process.
Disclosure of Invention
Therefore, a double-channel fused three-layer structure mathematical formula identification method needs to be provided to solve the problem of low precision of the existing single-stage formula identification technology. The specific technical scheme is as follows:
a method for identifying a three-layer architecture mathematical formula fused with two channels comprises the following steps:
performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information;
capturing the context of the regional visual information through an attention layer to generate a context vector;
and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula.
Further, the "performing feature extraction on an input picture by using an encoding layer" further includes the steps of:
using DenseNet as an encoder;
extracting visual information of an input picture by using a Dense network as a backbone network;
a spatial attention module and a channel attention module are fused in the encoder.
Further, before the "extracting features of the input picture by the coding layer", the method further includes the steps of:
adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information.
Further, the "performing feature extraction on an input picture by using an encoding layer" further includes the steps of:
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
Further, the attention layer is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past moments;
the attention calculation formula is:
expti=fatti,ht-1)
Figure BDA0002708229590000031
Figure BDA0002708229590000032
zt=φ({αi,αti})
wherein f isattRepresenting a multi-layer perceptron, aiIs a vector in the decoder output, corresponding to a region in the picture, ht-1Is the hidden layer output of the previous unit, alphatiThe attention weight of the ith vector at the t time step is represented, c represents the coverage vector, alphalAttention weight, z, representing the ith time steptRepresents the output of the attention mechanism and denotes the application of attention weights to the image regions.
Further, the decoding layer adopts a GRU recurrent neural network for decoding.
Further, the generating of the mathematical markup language file corresponding to the formula further includes the steps of:
and selecting the sequence with the highest output score as a final output sequence by using a cluster searching algorithm on the output of the coding layer.
In order to solve the technical problem, the three-layer architecture mathematical formula recognition system fusing double channels is further provided, and the specific technical scheme is as follows:
a three-layer architecture mathematical formula recognition system fusing two channels comprises: an encoding layer, an attention layer, and a decoding layer;
the coding layer is configured to: performing feature extraction on an input picture, wherein the features comprise: regional visual information;
the attention layer is used for: capturing the context of the regional visual information to generate a context vector;
the decoding layer is used for: and decoding the context vector to generate a mathematical markup language file corresponding to the formula.
Further, DenseNet was used as the encoder;
extracting visual information of an input picture by using a Dense network as a backbone network;
a space attention module and a channel attention module are fused in the encoder;
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
In order to solve the technical problem, the storage device is further provided, and the specific technical scheme is as follows:
a storage device having stored therein a set of instructions for performing: the steps of any of the above claims.
The invention has the beneficial effects that: 1. performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information; capturing the context of the regional visual information through an attention layer to generate a context vector; and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the three layers of the coding layer, the attention layer and the decoding layer, a mathematical formula with higher precision can be obtained.
2. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.
3. Before the feature extraction of the input picture through the coding layer, the method further comprises the following steps: adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information. The information redundancy of the picture filling process can be reduced by adding the mask information.
4. A coverage vector is provided in the attention layer, which is used to represent the cumulative sum of all attention mechanisms at past times, and which tells the model which parts of the encoder's inputs are already attended to and not attended to. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.
Drawings
FIG. 1 is a flow chart of a method for identifying a mathematical formula with a three-layer architecture incorporating two channels according to an embodiment;
FIG. 2 is a diagram illustrating a method for identifying a mathematical formula with a three-layer architecture incorporating two channels according to an embodiment;
FIG. 3 is a block diagram of a two-channel fused three-layer structure mathematical formula recognition system according to an embodiment;
fig. 4 is a block diagram of a storage device according to an embodiment.
Description of reference numerals:
300. a three-layer structure mathematical formula recognition system integrating two channels,
301. a layer of code to be encoded,
302. attention is paid to the layer of attention,
303. a layer of code that is decoded,
400. a storage device.
Detailed Description
To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.
Referring to fig. 1 to 2, in the present embodiment, a method for identifying a mathematical formula of a three-layer architecture with two channels integrated can be applied to a storage device, including but not limited to: personal computers, servers, general purpose computers, special purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, etc. The specific implementation is as follows:
step S101: performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information.
Step S102: a context vector is generated by capturing the context of the regional visual information through the attention layer.
Step S103: and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula.
Performing feature extraction on an input picture through a coding layer, wherein the features comprise: regional visual information; capturing the context of the regional visual information through an attention layer to generate a context vector; and decoding the context vector through a decoding layer to generate a mathematical markup language file corresponding to the formula. Through the above operations of the three layers of the coding layer, the attention layer and the decoding layer, a mathematical formula with higher precision can be obtained.
Wherein step S101 in this embodiment further includes the steps of: using DenseNet as an encoder; extracting visual information of an input picture by using a Dense network as a backbone network; a spatial attention module and a channel attention module are fused in the encoder. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.
In order to reduce the information redundancy in the picture filling process, in this embodiment, before the "performing feature extraction on an input picture by using a coding layer", the method further includes the steps of: adding mask information in the input picture data, and additionally adding a channel to the filled part, wherein the channel is used for recording the filled information.
The following specifically describes the operation of the spatial attention module and the channel attention module:
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
In this embodiment, the densnet network structure is mainly composed of a DenseBlock and a Transition, and a channel attention module and a space attention module are sequentially added between the DenseBlock and the Transition.
In this embodiment, the attention layer is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past time;
the attention calculation formula is:
expti=fatti,ht-1)
Figure BDA0002708229590000081
Figure BDA0002708229590000082
zt=φ({ai,αti})
wherein f isattRepresenting a multi-layer perceptron, aiIs a vector in the decoder output, corresponding to a region in the picture, ht-1Is the hidden layer output of the previous unit, alphatiThe attention weight of the ith vector at the t time step is represented, c represents the coverage vector, alphalAttention weight, z, representing the ith time steptRepresents the output of the attention mechanism and denotes the application of attention weights to the image regions.
The coverage vector is used to represent the cumulative sum of all attention mechanisms at past times, which tells the model which parts of the encoder's inputs are already attended to and none. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.
In this embodiment, the decoding layer performs decoding using a GRU recurrent neural network. The problems of gradient explosion and gradient disappearance of the circulating neural network are alleviated. And selecting a probability value in the training process to select whether the current label or the output of the previous unit is used as the input of the current unit, and finally outputting a one-hot coding set corresponding to each symbol by the coding layer. As shown in fig. 2. By randomly selecting the real label and the prediction output as the input of the coding layer recurrent neural network, the recognition capability of the recognition system under different recognition scenes can be enhanced.
Preferably, the generating of the mathematical markup language file corresponding to the formula further includes the steps of: and selecting the sequence with the highest output score as a final output sequence by using a cluster searching algorithm on the output of the coding layer.
Referring to fig. 3, in the present embodiment, an embodiment of a two-channel fused three-layer structure mathematicalformula identification system 300 is as follows:
a two-channel fused three-tier architecture mathematicalformula identification system 300, comprising: anencoding layer 301, anattention layer 302, and a decoding layer 303; thecoding layer 301 is configured to: performing feature extraction on an input picture, wherein the features comprise: regional visual information; theattention layer 302 is used to: capturing the context of the regional visual information to generate a context vector; the decoding layer 303 is configured to: and decoding the context vector to generate a mathematical markup language file corresponding to the formula.
Performing feature extraction on an input picture through thecoding layer 301, where the features include: regional visual information; capturing the context of the regional visual information through theattention layer 302 to generate a context vector; and decoding the context vector through a decoding layer 303 to generate a mathematical markup language file corresponding to the formula. By the above operations of the three layers of theencoding layer 301, theattention layer 302 and the decoding layer 303, a mathematical formula with higher accuracy can be obtained.
In theencoding layer 301, DenseNet is used as an encoder; extracting visual information of an input picture by using a Dense network as a backbone network; a spatial attention module and a channel attention module are fused in the encoder. By merging a spatial attention module and a channel attention module in the encoder. The method can lead the encoder to learn what to pay attention and where to pay attention on the channel and the spatial axis, thereby improving the expression of the interest region, leading the characteristics extracted by the encoder to have more representation power and effectively improving the recognition rate.
The following specifically describes the operation of the spatial attention module and the channel attention module:
the spatial attention module respectively performs average pooling and maximum pooling on the input feature map to obtain two feature maps with the same dimensionality, splices the two feature maps with the same dimensionality according to a channel and then obtains a spatial attention matrix through a sigmoid function, and multiplies the spatial attention matrix with the input feature map to obtain a spatial attention feature map;
the channel attention module respectively carries out global average pooling and global maximum pooling on the input feature map to obtain two feature maps, inputs the two feature maps into a shared multilayer perceptron to obtain two vectors, adds the two vectors and multiplies the two vectors by the input feature map to obtain the channel attention feature map.
In this embodiment, the densnet network structure is mainly composed of a DenseBlock and a Transition, and a channel attention module and a space attention module are sequentially added between the DenseBlock and the Transition.
In this embodiment, theattention layer 302 is provided with a coverage vector, and the coverage vector is used for representing the accumulated sum of all attention mechanisms at past time;
the attention calculation formula is:
expti=fatti,ht-1)
Figure BDA0002708229590000101
Figure BDA0002708229590000102
zt=φ({ai,αti})
wherein f isattRepresenting a multi-layer perceptron, aiIs a vector in the decoder output, corresponding to a region in the picture, ht-1Is the hidden layer output of the previous unit, alphatiThe attention weight of the ith vector at the t time step is represented, c represents the coverage vector, alphalAttention weight, z, representing the ith time steptRepresents the output of the attention mechanism and denotes the application of attention weights to the image regions.
The coverage vector is used to represent the cumulative sum of all attention mechanisms at past times, which tells the model which parts of the encoder's inputs are already attended to and none. In order to prevent the model from paying more attention to the concerned region, the coverage vector is used as a component of the attention of the next step, so that the attention distribution generated in the next step intentionally reduces the probability of the concerned region, and the phenomena of symbol missing and symbol repetition in formula recognition are effectively reduced.
In this embodiment, the decoding layer 303 performs decoding using a GRU recurrent neural network. The problems of gradient explosion and gradient disappearance of the circulating neural network are alleviated. In the training process, a probability value is selected to select whether to use the output of the current label or the previous unit as the input of the current unit, and finally the one-hot encoding set corresponding to each symbol is output by theencoding layer 301. As shown in fig. 2. By randomly selecting the real label and the prediction output as the input of the cyclic neural network of thecoding layer 301, the recognition capability of the recognition system under different recognition scenes can be enhanced. And selecting the sequence with the highest output score as a final output sequence by using a bundle searching algorithm on the output of thecoding layer 301.
Referring to fig. 4, in the present embodiment, amemory device 400 is implemented as follows: thememory device 400 may be used to perform any of the steps of the above-mentioned two-channel fused three-layer structure mathematical formula identification method, and a repeated description thereof is omitted.
It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims (10)

Translated fromChinese
1.一种融合双通道的三层架构数学公式识别方法,其特征在于,包括步骤:1. a three-layer structure mathematical formula identification method of fusion dual-channel, is characterized in that, comprises the steps:通过编码层对输入图片进行特征提取,所述特征包括:区域视觉信息;Feature extraction is performed on the input picture through the coding layer, and the features include: regional visual information;通过注意力层捕获区域视觉信息的上下文,生成context向量;The context of the visual information of the region is captured by the attention layer, and the context vector is generated;通过解码层对所述context向量进行解码,生成公式对应的数学标记语言文件。The context vector is decoded by the decoding layer, and a mathematical markup language file corresponding to the formula is generated.2.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法,其特征在于,所述“通过编码层对输入图片进行特征提取”,还包括步骤:2. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 1, is characterized in that, described " carry out feature extraction to input picture by coding layer ", also comprise the step:使用DenseNet作为编码器;Use DenseNet as the encoder;通过使用Dense网络作为主干网络对输入图片进行视觉信息提取;Extract visual information from the input image by using the Dense network as the backbone network;在所述编码器中融合有空间注意力模块和通道注意力模块。A spatial attention module and a channel attention module are fused in the encoder.3.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法,其特征在于,所述“通过编码层对输入图片进行特征提取”前,还包括步骤:3. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 1, is characterized in that, before described " feature extraction is carried out to input picture by coding layer ", also comprises steps:在所述输入图片数据中添加mask掩码信息,对填充的部分额外增加一个通道,所述通道用来记录填充的信息。The mask information is added to the input picture data, and an additional channel is added to the filled part, and the channel is used to record the filled information.4.根据权利要求2所述的一种融合双通道的三层架构数学公式识别方法,其特征在于,所述“通过编码层对输入图片进行特征提取”,还包括步骤:4. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 2, is characterized in that, described " is carried out feature extraction to input picture by coding layer ", also comprises the step:所述空间注意力模块分别对输入特征图进行平均池化和最大池化得到两个维度相同的特征图,对所述两个维度相同的特征图按照通道进行拼接后通过sigmoid函数得到一个空间注意力矩阵,所述空间注意力矩阵与所述输入特征图相乘得空间注意力特征图;The spatial attention module performs average pooling and maximum pooling on the input feature maps respectively to obtain two feature maps with the same dimensions, and after splicing the two feature maps with the same dimensions according to channels, a spatial attention is obtained through the sigmoid function. force matrix, the spatial attention feature map is obtained by multiplying the spatial attention matrix and the input feature map;所述通道注意力模块分别对输入特征图进行全局平均池化和全局最大池化得到两个特征图,输入所述两个特征图至共享的多层感知机中得两个向量,并将所述两个向量相加再和所述输入特征图相乘得到通道注意力特征图。The channel attention module performs global average pooling and global maximum pooling on the input feature map to obtain two feature maps, input the two feature maps to the shared multi-layer perceptron to obtain two vectors, and combine the two feature maps. The two vectors are added and then multiplied by the input feature map to obtain the channel attention feature map.5.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法,其特征在于,所述注意力层设置有coverage向量,所述coverage向量用于表示过往时刻的所有注意力机制的累加和;5. the three-layer structure mathematical formula identification method of a kind of fusion dual-channel according to claim 1, is characterized in that, described attention layer is provided with coverage vector, and described coverage vector is used to represent all attention of past moments cumulative sum of mechanisms;所述注意力计算公式为:The attention calculation formula is:expti=fatt(ai,ht-1)expti =fatt (ai , ht-1 )
Figure FDA0002708229580000021
Figure FDA0002708229580000021
Figure FDA0002708229580000022
Figure FDA0002708229580000022
zt=φ({ai,ati})zt =φ({ai , ati })其中fatt表示一个多层感知机,ai是解码器输出中的一个向量,对应图像中的一个区域,ht-1是上个单元的隐藏层输出,ati表示第t时间步第i个向量的注意力权值,c表示coverage向量,al表示第l个时间步的注意力权值,zt表示注意力机制的输出,φ表示将注意力权值施加到图像区域。where fatt represents a multilayer perceptron, ai is a vector in the output of the decoder, corresponding to an area in the image, ht-1 is the hidden layer output of the previous unit, and ati is the ith time step t A vector of attention weights, c represents the coverage vector, al represents the attention weight of the l-th time step, zt represents the output of the attention mechanism, and φ represents the application of the attention weight to the image region.6.根据权利要求1所述的一种融合双通道的三层架构数学公式识别方法,其特征在于,所述解码层采用GRU循环神经网络进行解码。6 . The method for recognizing mathematical formulas of a three-layer architecture with dual-channel fusion according to claim 1 , wherein the decoding layer adopts a GRU cyclic neural network for decoding. 7 .7.根据权利要求6所述的一种融合双通道的三层架构数学公式识别方法,其特征在于,所述“生成公式对应的数学标记语言文件”,还包括步骤:7. a kind of three-layer architecture mathematical formula identification method of fusion dual channel according to claim 6, is characterized in that, described " the mathematical markup language file corresponding to the generation formula ", also comprises the step:对编码层的输出使用集束搜索算法选取输出得分最高的序列作为最终的输出序列。The output of the coding layer uses the beam search algorithm to select the sequence with the highest output score as the final output sequence.8.一种融合双通道的三层架构数学公式识别系统,其特征在于,包括:编码层、注意力层和解码层;8. A three-layer architecture mathematical formula recognition system fused with dual channels, characterized in that, comprising: an encoding layer, an attention layer and a decoding layer;所述编码层用于:对输入图片进行特征提取,所述特征包括:区域视觉信息;The coding layer is used for: extracting features from the input picture, and the features include: regional visual information;所述注意力层用于:捕获区域视觉信息的上下文,生成context向量;The attention layer is used to: capture the context of regional visual information, and generate a context vector;所述解码层用于:对所述context向量进行解码,生成公式对应的数学标记语言文件。The decoding layer is used for: decoding the context vector to generate a mathematical markup language file corresponding to the formula.9.根据权利要求8所述的一种融合双通道的三层架构数学公式识别系统,其特征在于,9. the three-layer architecture mathematical formula recognition system of a kind of fusion dual-channel according to claim 8, is characterized in that,使用DenseNet作为编码器;Use DenseNet as the encoder;通过使用Dense网络作为主干网络对输入图片进行视觉信息提取;Extract visual information from the input image by using the Dense network as the backbone network;在所述编码器中融合有空间注意力模块和通道注意力模块;A spatial attention module and a channel attention module are fused in the encoder;所述空间注意力模块分别对输入特征图进行平均池化和最大池化得到两个维度相同的特征图,对所述两个维度相同的特征图按照通道进行拼接后通过sigmoid函数得到一个空间注意力矩阵,所述空间注意力矩阵与所述输入特征图相乘得空间注意力特征图;The spatial attention module performs average pooling and maximum pooling on the input feature maps respectively to obtain two feature maps with the same dimensions, and after splicing the two feature maps with the same dimensions according to channels, a spatial attention is obtained through the sigmoid function. force matrix, the spatial attention feature map is obtained by multiplying the spatial attention matrix and the input feature map;所述通道注意力模块分别对输入特征图进行全局平均池化和全局最大池化得到两个特征图,输入所述两个特征图至共享的多层感知机中得两个向量,并将所述两个向量相加再和所述输入特征图相乘得到通道注意力特征图。The channel attention module performs global average pooling and global maximum pooling on the input feature map respectively to obtain two feature maps, input the two feature maps to the shared multi-layer perceptron to obtain two vectors, and combine the two feature maps. The two vectors are added and then multiplied by the input feature map to obtain the channel attention feature map.10.一种存储设备,其中存储有指令集,其特征在于,所述指令集用于执行:权利要求1至7中任意一个权利要求的步骤。10. A storage device, wherein an instruction set is stored, wherein the instruction set is used to perform: the steps of any one of claims 1 to 7.
CN202011046709.7A2020-09-292020-09-29Three-layer architecture mathematical formula identification method, system and storage device integrating double channelsActiveCN112183544B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011046709.7ACN112183544B (en)2020-09-292020-09-29Three-layer architecture mathematical formula identification method, system and storage device integrating double channels

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011046709.7ACN112183544B (en)2020-09-292020-09-29Three-layer architecture mathematical formula identification method, system and storage device integrating double channels

Publications (2)

Publication NumberPublication Date
CN112183544Atrue CN112183544A (en)2021-01-05
CN112183544B CN112183544B (en)2024-09-13

Family

ID=73945817

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011046709.7AActiveCN112183544B (en)2020-09-292020-09-29Three-layer architecture mathematical formula identification method, system and storage device integrating double channels

Country Status (1)

CountryLink
CN (1)CN112183544B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113554151A (en)*2021-07-072021-10-26浙江工业大学Attention mechanism method based on convolution interlayer relation
CN113743315A (en)*2021-09-072021-12-03电子科技大学Handwritten elementary mathematical formula recognition method based on structure enhancement
CN113888551A (en)*2021-10-222022-01-04中国人民解放军战略支援部队信息工程大学Liver tumor image segmentation method based on dense connection network of high-low layer feature fusion
CN118155221A (en)*2024-05-112024-06-07济南大学 A method for printed formula recognition based on multi-supervision

Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20060062466A1 (en)*2004-09-222006-03-23Microsoft CorporationMathematical expression recognition
US20060100872A1 (en)*2004-11-082006-05-11Kabushiki Kaisha ToshibaPattern recognition apparatus and pattern recognition method
CN109241861A (en)*2018-08-142019-01-18科大讯飞股份有限公司A kind of method for identifying mathematical formula, device, equipment and storage medium
CN109389091A (en)*2018-10-222019-02-26重庆邮电大学The character identification system and method combined based on neural network and attention mechanism
CN110110601A (en)*2019-04-042019-08-09深圳久凌软件技术有限公司Video pedestrian weight recognizer and device based on multi-space attention model
CN110119765A (en)*2019-04-182019-08-13浙江工业大学A kind of keyword extracting method based on Seq2seq frame
US20190278857A1 (en)*2018-03-122019-09-12Microsoft Technology Licensing, LlcSequence to Sequence Conversational Query Understanding
CA3050025A1 (en)*2018-07-192020-01-19Tata Consultancy Services LimitedSystems and methods for end-to-end handwritten text recognition using neural networks
CN111126221A (en)*2019-12-162020-05-08华中师范大学Mathematical formula identification method and device integrating two-way visual attention mechanism
CN111199233A (en)*2019-12-302020-05-26四川大学 An improved deep learning method for pornographic image recognition
CN111562612A (en)*2020-05-202020-08-21大连理工大学 A deep learning microseismic event recognition method and system based on attention mechanism

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20060062466A1 (en)*2004-09-222006-03-23Microsoft CorporationMathematical expression recognition
US20060100872A1 (en)*2004-11-082006-05-11Kabushiki Kaisha ToshibaPattern recognition apparatus and pattern recognition method
US20190278857A1 (en)*2018-03-122019-09-12Microsoft Technology Licensing, LlcSequence to Sequence Conversational Query Understanding
CA3050025A1 (en)*2018-07-192020-01-19Tata Consultancy Services LimitedSystems and methods for end-to-end handwritten text recognition using neural networks
CN109241861A (en)*2018-08-142019-01-18科大讯飞股份有限公司A kind of method for identifying mathematical formula, device, equipment and storage medium
CN109389091A (en)*2018-10-222019-02-26重庆邮电大学The character identification system and method combined based on neural network and attention mechanism
CN110110601A (en)*2019-04-042019-08-09深圳久凌软件技术有限公司Video pedestrian weight recognizer and device based on multi-space attention model
CN110119765A (en)*2019-04-182019-08-13浙江工业大学A kind of keyword extracting method based on Seq2seq frame
CN111126221A (en)*2019-12-162020-05-08华中师范大学Mathematical formula identification method and device integrating two-way visual attention mechanism
CN111199233A (en)*2019-12-302020-05-26四川大学 An improved deep learning method for pornographic image recognition
CN111562612A (en)*2020-05-202020-08-21大连理工大学 A deep learning microseismic event recognition method and system based on attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖文斌: "基于编码器-解码器和注意力机制神经网络的数学公式识别方法", 《中国优秀硕士学位论文全文数据库》, pages 20 - 59*

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113554151A (en)*2021-07-072021-10-26浙江工业大学Attention mechanism method based on convolution interlayer relation
CN113554151B (en)*2021-07-072024-03-22浙江工业大学Attention mechanism method based on convolution interlayer relation
CN113743315A (en)*2021-09-072021-12-03电子科技大学Handwritten elementary mathematical formula recognition method based on structure enhancement
CN113743315B (en)*2021-09-072023-07-14电子科技大学 A Recognition Method for Handwritten Elementary Mathematics Formulas Based on Structural Enhancement
CN113888551A (en)*2021-10-222022-01-04中国人民解放军战略支援部队信息工程大学Liver tumor image segmentation method based on dense connection network of high-low layer feature fusion
CN118155221A (en)*2024-05-112024-06-07济南大学 A method for printed formula recognition based on multi-supervision

Also Published As

Publication numberPublication date
CN112183544B (en)2024-09-13

Similar Documents

PublicationPublication DateTitle
Li et al.Spatial information enhancement network for 3D object detection from point cloud
CN108615036B (en) A natural scene text recognition method based on convolutional attention network
CN112183544A (en)Double-channel fused three-layer architecture mathematical formula identification method, system and storage device
CN114612832B (en) Real-time gesture detection method and device
US20180114071A1 (en)Method for analysing media content
CN110210485A (en)The image, semantic dividing method of Fusion Features is instructed based on attention mechanism
CN114020891A (en) Video question answering method and system for dual-channel semantic localization, multi-granularity attention and mutual enhancement
CN113011320B (en)Video processing method, device, electronic equipment and storage medium
CN113468978B (en) Fine-grained car body color classification method, device and equipment based on deep learning
CN114283352A (en)Video semantic segmentation device, training method and video semantic segmentation method
CN117173450A (en)Traffic scene generation type image description method
Agrawal et al.Image caption generator using attention mechanism
CN117370604A (en)Video description generation method and system based on video space-time scene graph fusion reasoning
CN115082915B (en) A visual-language navigation method for mobile robots based on multi-modal features
Huang et al.Spatial–temporal context-aware online action detection and prediction
CN108389239A (en)A kind of smile face video generation method based on condition multimode network
Li et al.Dual attention convolutional network for action recognition
CN115937641A (en) Transformer-based joint coding method, device and equipment between modalities
Zou et al.360° image saliency prediction by embedding self-supervised proxy task
CN117541668A (en) Virtual character generation method, device, equipment and storage medium
Fazry et al.Change detection of high-resolution remote sensing images through adaptive focal modulation on hierarchical feature maps
CN111753859B (en)Sample generation method, device and equipment
CN119810702A (en) Video data processing method, device, electronic device and readable storage medium
CN119832596A (en)Pedestrian re-recognition method and training method for optimizing fine granularity feature fusion
Lin et al.Region-based context enhanced network for robust multiple face alignment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp