CN116824608A

Movatterモバイル変換

Info

Publication number: CN116824608A
Application number: CN202310667530.0A
Authority: CN
Inventors: 付鹏斌; 张旭; 杨惠荣
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-09-29

Abstract

The invention discloses an answer sheet layout analysis method based on target detection, which is used for carrying out layout analysis on an answer sheet image. Firstly, preprocessing an original image to obtain a question answering card image which is free of inclination, proper in brightness and free of frame interference; then detecting foreground contents such as handwriting contents, graphs, tables and the like by using a layout target detection model YOLOv5s-DC trained by a decoupling head and an improved loss function, and performing structural analysis on a layout by using a projection-based background segmentation method; after dividing a foreground object into various areas by utilizing the structural information obtained by analysis, supplementing and edge-segmenting foreground contents by using an MSER algorithm and a SeamCarving algorithm, detecting row and column lines of table contents in a layout, and analyzing a table structure; and finally, identifying the text in the layout by using the CRNN network.

Description

Translated fromChinese

基于目标检测技术的答题卡版面分析方法Answer sheet layout analysis method based on target detection technology

技术领域Technical field

本发明涉及图像分割、文本识别以及机器学习领域，尤其涉及基于目标检测技术的答题卡版面分析方法。The invention relates to the fields of image segmentation, text recognition and machine learning, and in particular to an answer sheet layout analysis method based on target detection technology.

背景技术Background technique

版面分析是从文档图像中提取各种文本、公式、图形、表格等要素用于文档识别、文档理解的关键技术。版面分析是自动阅卷和其他OCR任务的第一个步骤，版面分析算法的效果对后续文本识别和自然语言处理算法至关重要。近年来文档识别领域虽然提出了许多适用于不同场景的版面分析方法，但是这些算法在面对答题卡时，依然存在一些局限性。究其原因，是答题卡有别于其他印刷文档版面如论文、报纸，也不同于手写的篇章和段落文本，其内部同时包含大量的印刷文本和非受限的手写文本，现有方法难以对其进行处理。Layout analysis is a key technology for extracting various texts, formulas, graphics, tables and other elements from document images for document recognition and document understanding. Layout analysis is the first step in automatic marking and other OCR tasks. The effect of the layout analysis algorithm is crucial to subsequent text recognition and natural language processing algorithms. Although many layout analysis methods suitable for different scenarios have been proposed in the field of document recognition in recent years, these algorithms still have some limitations when facing answer sheets. The reason is that answer sheets are different from other printed document layouts such as papers and newspapers, and are also different from handwritten chapters and paragraphs. They contain a large amount of printed text and unrestricted handwritten text at the same time. It is difficult for existing methods to deal with them. its processing.

手写文本内容是答题卡图像中的关键内容，包含学生的主观题作答信息，对于自动阅卷技术至关重要，而背景中的印刷文本也有着版面结构信息和分值等重要信息。The handwritten text content is the key content in the answer sheet image. It contains students’ subjective answer information and is crucial for automatic marking technology. The printed text in the background also contains important information such as layout structure information and score values.

传统的版面分析技术往往利用投影切分、边缘检测、连通域等方法对版面进行由上到下或由下至上的切分或合并。这些方法对于答题卡存在漏检、误检且只能检测结果为纯文本，失去了版面内的结构信息，基于机器学习算法的版面分析算法目前大多只适用于印刷版面或者手写篇章级文本，没有专用于答题卡版面分析的方法和相关数据。Traditional layout analysis technology often uses methods such as projection segmentation, edge detection, and connected domains to segment or merge the layout from top to bottom or bottom to top. These methods have missed or false detections for answer sheets and can only detect plain text, losing the structural information within the layout. Most of the layout analysis algorithms based on machine learning algorithms are currently only applicable to printed layouts or handwritten chapter-level texts. Methods and related data dedicated to answer sheet layout analysis.

发明内容Contents of the invention

针对以上问题，本发明通过结合目标检测算法和传统版面分析方法提取答题卡版面结构，通过标注答题卡目标检测数据集，训练改进后的YOLOv5s-DC目标检测网络，再使用递归切分、文本行聚类、CRNN文本识别、表格识别等算法流程实现对答题卡版面的版面分析。In response to the above problems, the present invention extracts the layout structure of the answer card by combining the target detection algorithm and the traditional layout analysis method, trains the improved YOLOv5s-DC target detection network by annotating the answer card target detection data set, and then uses recursive segmentation, text lines Algorithmic processes such as clustering, CRNN text recognition, and table recognition realize layout analysis of the answer sheet layout.

实现本发明的主要步骤如下：The main steps to implement the present invention are as follows:

首先，对原始图像进行预处理，调整图像gamma值以确保二值化以后得到的背景像素清晰；然后对二值化图像进行基于霍夫直线检测的倾斜校正，将图像中最长直线的角度旋转至水平角度，保证答题卡背景内容没有倾斜；之后使用目标检测算法对版面内的非背景要素进行检测、提取，并将检测到的内容区域设为空白，以减少手写文本等元素对结构分析的干扰；之后使用递归投影切分的方法，将剩余内容进行由版面-列-行-文本块的切分。在切分的得到的各个文本快区域内，通过MSER算法提取文本区域，再使用基于KNN的文本行聚类算法，得到可以用于识别的文本行。将目标检测得到的结果按位置分配到上文切分出的文本区域后，再结合该区域的原始图像，对目标检测的结果进行补全和边缘的切分。另外，对于检测到的表格内容，首先使用各个边界位置判断其所属区域，并裁剪其在目标检测时产生的误检区域，对其进行表格分析。First, the original image is preprocessed, and the image gamma value is adjusted to ensure that the background pixels obtained after binarization are clear; then the binarized image is subjected to tilt correction based on Hough line detection, and the angle of the longest straight line in the image is rotated. to a horizontal angle to ensure that the background content of the answer sheet is not tilted; then use a target detection algorithm to detect and extract non-background elements in the layout, and set the detected content area to blank to reduce the impact of handwritten text and other elements on structural analysis. Interference; then use the recursive projection segmentation method to segment the remaining content into layout-column-row-text blocks. In each segmented text area, the text area is extracted through the MSER algorithm, and then the KNN-based text line clustering algorithm is used to obtain text lines that can be used for identification. After allocating the results of target detection to the text area segmented above according to position, the results of target detection are then combined with the original image of the area to complete and segment the edges. In addition, for the detected table content, each boundary position is first used to determine the area it belongs to, and the misdetection area generated during target detection is cropped, and the table analysis is performed on it.

基于目标检测技术的答题卡版面分析方法，包括如下步骤：The answer sheet layout analysis method based on target detection technology includes the following steps:

步骤一，对输入答题卡图像进行倾斜校正和伽马校正，并通过洪泛填充算法、矩形检测算法裁切边缘空白或黑边区域。Step 1: Perform tilt correction and gamma correction on the input answer sheet image, and use the flood filling algorithm and the rectangle detection algorithm to crop edge blank or black border areas.

步骤二，使用构建的答题卡版面分析目标检测数据集训练改进后的YOLOv5s-DC网络进行训练，使用解耦合的检测头和EIOU、QFL对损失函数进行改进，监督其分类和回归任务，并用训练好的网络对上一步得到的答题卡进行目标检测。将检测到的目标区域置空，以减少对后续步骤的干扰。Step 2: Use the built answer card layout analysis target detection data set to train the improved YOLOv5s-DC network, use the decoupled detection head and EIOU and QFL to improve the loss function, supervise its classification and regression tasks, and use the training A good network performs target detection on the answer sheet obtained in the previous step. Leave the detected target area blank to reduce interference with subsequent steps.

步骤三，对剩余版面进行递归切分，切分时首先对像素进行反色，以便于对像素分布进行处理。列切分时，首先对版面进行横向的腐蚀操作和纵向的膨胀操作，以突出图像切分点特征。之后对图像进行垂直投影，将投影结果的最大值的二分之一作为分割阈值，再提取投影结果曲线内的所有极大值区域，并将极大值点中小于阈值的点剔除，大于等于阈值的点保留。之后利用滑动窗口，对临近的潜在分割点进行合并、分组，并取组内坐标的平均值为分割位置。Step 3: Recursively segment the remaining layout. When segmenting, first invert the color of the pixels to facilitate processing of pixel distribution. When dividing columns, first perform horizontal erosion operations and vertical expansion operations on the layout to highlight the characteristics of the image segmentation points. Then the image is projected vertically, and half of the maximum value of the projection result is used as the segmentation threshold. Then all maximum value areas within the projection result curve are extracted, and points smaller than the threshold among the maximum value points are removed, and points greater than or equal to Threshold points are retained. Then, a sliding window is used to merge and group adjacent potential segmentation points, and the average coordinate within the group is taken as the segmentation position.

步骤四，对分割后的各区域内文本进行提取，使用极大值稳定区域算法MSER对各区域进行检测，得到文本区域后，计算各连通域的中心位置和各自的距离。使用K值为3的KNN算法，对文本区域的笔画和字符进行合并，得到可以用于识别的文本行内容区域。Step 4: Extract the text in each area after segmentation, and use the maximum stable region algorithm MSER to detect each area. After obtaining the text area, calculate the center position and respective distances of each connected domain. Using the KNN algorithm with a K value of 3, the strokes and characters in the text area are merged to obtain a text line content area that can be used for identification.

步骤五，将目标检测结果与分割后区域的原始图像进行对比，使用MSER提取遗漏文本，并将其合并到目标检测结果中，再使用SeamCarving算法对检测框的重叠部分进行分割，以保证提取的准确性。Step 5: Compare the target detection results with the original image of the segmented area, use MSER to extract the missing text, and merge it into the target detection results, and then use the SeamCarving algorithm to segment the overlapping parts of the detection frames to ensure that the extracted text accuracy.

步骤六，使用霍夫直线检测对检测到的表格内容进行判定，并根据上文的分割区域位置对表格检测框的边缘进行裁切，使用矩形检测算法检测表格的外接矩形，同时对表格进行倾斜校正。分别使用水平和垂直方向的膨胀操作检测表格中的框架结构，并将两者结果中重叠的点作为待选的表格结构点。对于每个待选结构点p_i,j，判断其到右侧和下侧的另一个结构点p_i+1,j和p_i,j+1中是否存在完整联通线条，如果都存在，则证明此处存在一个文本格cell_i,j，否则将其去除。对所有待选位置进行判断后即可得到最终的表格结构，在每个表格cell中，提取中央位置的外接矩形框，即可对文本进行提取。Step 6: Use Hough line detection to determine the content of the detected table, and crop the edge of the table detection frame according to the above segmentation area position. Use the rectangle detection algorithm to detect the circumscribed rectangle of the table and tilt the table at the same time. Correction. The frame structure in the table is detected using horizontal and vertical expansion operations respectively, and the overlapping points in the two results are used as table structure points to be selected. For each structural point p_i,j to be selected, determine whether there is a complete connecting line to another structural point p_i+1,j on the right and lower side and p_i,j+1 . If both exist, then Prove that there is a text cell_i,j here, otherwise remove it. After judging all the positions to be selected, the final table structure can be obtained. In each table cell, the surrounding rectangular frame at the central position is extracted to extract the text.

步骤七，使用卷积循环神经网络对文本内容进行识别，其中对于选择题填涂等特殊文本，在UTF-8字库中添加字符模板后，可以使用Text Recognition Generator工具生成训练用的字符图像数据，并对网络进行训练。Step 7: Use a convolutional recurrent neural network to recognize text content. For special texts such as multiple-choice questions, after adding character templates to the UTF-8 font library, you can use the Text Recognition Generator tool to generate character image data for training. and train the network.

附图说明Description of the drawings

图1为倾斜校正效果示意图。a)倾斜校正前；b)倾斜校正后Figure 1 is a schematic diagram of the tilt correction effect. a) Before tilt correction; b) After tilt correction

图2为数据集构建示意图。Figure 2 is a schematic diagram of data set construction.

图3为YOLOv5s-DC网络结构简图。Figure 3 is a simplified diagram of the YOLOv5s-DC network structure.

图4为解耦检测头结构图。Figure 4 is a structural diagram of the decoupled detection head.

图5为目标检测结果图。Figure 5 shows the target detection results.

图6为分列方法投影示意图。a)原始背景图像；b)横向膨胀后背景图像；c)原始背景投影结果；d)膨胀后背景投影结果。Figure 6 is a schematic diagram of the projection of the sorting method. a) Original background image; b) Background image after lateral expansion; c) Original background projection result; d) Background projection result after expansion.

图7为版面分割结果图。Figure 7 shows the result of layout segmentation.

图8为文本行提取结果。a)抬头部分；b)填空题部分。Figure 8 shows the text line extraction results. a) Heading part; b) Fill-in-the-blank part.

图9为补全算法示意图。a)目标检测结果；b)补全后结果。Figure 9 is a schematic diagram of the completion algorithm. a) Target detection results; b) Completion results.

图10为合并示意图。Figure 10 is a schematic diagram of the merger.

图11为方法效果对比图。a)目标检测结果；b)MSER结果；c)混合方法结果。Figure 11 is a comparison chart of method effects. a) Target detection results; b) MSER results; c) Mixed method results.

图12为边缘裁切效果图。a)原始图像与切分结果；b)梯度图像与切分结果。Figure 12 shows the edge cutting effect. a) Original image and segmentation results; b) Gradient image and segmentation results.

图13为选择题特殊字符展示。Figure 13 shows the special characters for multiple choice questions.

图14为文本识别数据集。Figure 14 shows the text recognition data set.

图15为选择题识别结果图。Figure 15 shows the multiple-choice question identification results.

图16为其他文本识别结果图。a)选择题图像；b)识别结果。Figure 16 shows other text recognition results. a) Multiple choice image; b) Recognition results.

图17为表格分析结果图。a)抬头区域；b)填空题区域。Figure 17 shows the table analysis results. a) Header area; b) Fill-in-the-blank area.

图18为整体版面分析结果图。Figure 18 shows the overall layout analysis results.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明做进一步的描述。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

本发明所涉及方法的流程包括以下步骤：The process of the method involved in the present invention includes the following steps:

(1)图像预处理。使用霍夫直线检测算法检测图像中的直线，取其中最长的一条直线为基准，根据其相对于水平面的角度，计算校正所需的旋转矩阵H，并用其对图像进行校正。图1展示了倾斜校正前后的图像效果对比。(1) Image preprocessing. Use the Hough straight line detection algorithm to detect straight lines in the image, take the longest straight line as the benchmark, calculate the rotation matrix H required for correction based on its angle relative to the horizontal plane, and use it to correct the image. Figure 1 shows the comparison of image effects before and after tilt correction.

(2)构建目标检测数据并对YOLOv5s模型进行改进和训练。使用labelme图像标注工具对答题卡图像进行标注。使用矩形标注模式，对目标的左上角和右下角进行定位，并根据类别进行标号。图2为标注软件界面。图3为YOLOv5s模型的基本结构，图4为图3中head部分的具体结构，本发明使用Pytorch深度学习框架构建网络模型，共训练200轮，初始学习率为0.001，输入大小为1280x1280，批次大小12，使用SGD优化器，图5为本文目标检测方法输入一张答题卡图像，得到的检测结果。训练时使用的损失函数如下,主要由以下三个部分组成：回归损失L_{_reg}，置信度损失L_{_conf}和分类损失L_{_cls}。(2) Construct target detection data and improve and train the YOLOv5s model. Use the labelme image annotation tool to annotate the answer sheet image. Use rectangular labeling mode to locate the upper left and lower right corners of the target and label them according to category. Figure 2 shows the annotation software interface. Figure 3 is the basic structure of the YOLOv5s model. Figure 4 is the specific structure of the head part in Figure 3. The present invention uses the Pytorch deep learning framework to build a network model. It trains for a total of 200 rounds. The initial learning rate is 0.001. The input size is 1280x1280. The batch size is 1280x1280. Size 12, using the SGD optimizer. Figure 5 shows the detection results obtained by inputting an answer card image to the target detection method in this article. The loss function used during training is as follows, mainly composed of the following three parts: regression loss_{L_reg} , confidence loss_{L_conf} and classification loss_{L_cls} .

L＝L_reg+L_conf+L_clsL＝L_reg +L_conf +L_cls

L_cls＝-|y-σ|^β(1-y)log(1-σ)+ylog(σ))L_cls =-|y-σ|^β (1-y)log(1-σ)+ylog(σ))

(3)版面背景分割。常见的答题卡版面根据分栏情况可以分为两种，单栏答题卡和多栏答题卡，如图。单栏答题卡一般为A4尺寸，常用于英语、物理、化学、生物、地理、政治和历史科目，多栏答题卡一般为A3尺寸且常常分为3栏，常用于语文和数学答题。为了分析其版面结构，应当对其进行分栏和分行处理。从整体结构来看，多栏答题卡中每一栏的面积是相同的，并且形状规则。因此，本文使用传统的灰度直方图来统计其灰度分布情况，并以此对其进行分栏处理。具体算法流程如下：(3) Layout background segmentation. Common answer sheet layouts can be divided into two types according to the column arrangement, single-column answer sheets and multi-column answer sheets, as shown in the figure. Single-column answer sheets are generally A4 in size and are commonly used in subjects such as English, physics, chemistry, biology, geography, politics and history. Multi-column answer sheets are generally in A3 size and are often divided into 3 columns and are commonly used in Chinese and mathematics. In order to analyze its layout structure, it should be divided into columns and lines. From the overall structure, the area of each column in the multi-column answer sheet is the same and the shape is regular. Therefore, this article uses the traditional grayscale histogram to count the grayscale distribution and process it into columns. The specific algorithm flow is as follows:

Step1：为了突出分栏位置的特征，减少非边框线条投影的干扰，对图像中黑色部分沿水平方向进行膨胀处理。图6为膨胀前后及其投影结果对比，可以明显地看到图中竖直方向的线条被加粗了，而水平方向的线条则没有变化。Step1: In order to highlight the characteristics of the column position and reduce the interference of non-border line projection, the black part of the image is expanded in the horizontal direction. Figure 6 shows the comparison of the projection results before and after expansion. It can be clearly seen that the vertical lines in the figure are thickened, while the horizontal lines have not changed.

Step2：沿垂直方向计算图像每一列的灰度和，得到灰度直方图。图6(c)和6(d)为膨胀前后灰度投影对比。可以发现，经过形态学处理后，投影结果中的分割点更明显了，从数值上看，分割点与其他点在差值和比例方面都更大了。Step2: Calculate the grayscale sum of each column of the image along the vertical direction to obtain a grayscale histogram. Figures 6(c) and 6(d) show the grayscale projection comparison before and after expansion. It can be found that after morphological processing, the segmentation points in the projection results are more obvious. From a numerical point of view, the differences and proportions between the segmentation points and other points are larger.

Step3：将投影结果中最大值的一半作为分割阈值t，将大于t处的点作为待定分割点。Step3: Use half of the maximum value in the projection result as the segmentation threshold t, and use the point greater than t as the undetermined segmentation point.

Step4：对于每个待定分割点，合并水平距离小于100像素的分割点为一组。Step 4: For each undetermined segmentation point, merge the segmentation points whose horizontal distance is less than 100 pixels into a group.

Step5：取每个组中待定分割点的中心作为最终分割点，并沿垂直方向将图像切分。Step5: Take the center of the undetermined segmentation point in each group as the final segmentation point, and segment the image along the vertical direction.

ENDEND

递归地使用此分割算法，依次对版面进行分栏、分行、分块处理。图7为版面分割结果。This segmentation algorithm is used recursively to divide the layout into columns, rows, and blocks in sequence. Figure 7 shows the layout segmentation results.

完成对各个区域的分割后，我们使用如下的区域判别算法，对各区域内元素的分布离散程度进行计算，以判别该区域是否是选择题填涂区域。After completing the segmentation of each area, we use the following area discrimination algorithm to calculate the dispersion degree of the distribution of elements in each area to determine whether the area is a multiple-choice filling area.

与其他区域相比，选择题作答区域具有明显的特征，去除噪声点的干扰后，其内部文本按照规则的棋盘状，均匀排布。Compared with other areas, the multiple-choice response area has obvious characteristics. After removing the interference of noise points, the internal text is evenly arranged in a regular checkerboard shape.

定义各连通域到其最近邻的距离为D，选择题区域中两组距离D_i、D_j的大小与非选择题(如填空题)中D_m、D_n相比更加接近。因此，只需要对各区域内连通域的位置分布状态和数量进行判断，即可达到判别选择题区域的目标。The distance between each connected domain and its nearest neighbor is defined as D. The sizes of the two distances D_i and D_j in the multiple-choice question area are closer to D_m and D_n in non-choice questions (such as fill-in-the-blank questions). Therefore, it is only necessary to judge the position distribution status and number of connected domains in each area to achieve the goal of identifying multiple-choice question areas.

具体算法步骤如下：The specific algorithm steps are as follows:

BEGINBEGIN

Step1：为了将各题号中的数字融合为一个连通域，使得相邻连通域间距离更均匀，对输入图像进行膨胀处理。Step1: In order to fuse the numbers in each question number into a connected domain and make the distance between adjacent connected domains more uniform, the input image is expanded.

Step2：提取区域内所有连通域C。Step2: Extract all connected domains C in the area.

Step3：计算各连通域c_i的中心点坐标p_i。Step3: Calculate the center point coordinate p_i of each connected domain c_i .

Step4：计算各连通域c_i到其它c_j的距离矩阵D，寻找各连通域的最近邻并将其距离放入列表L。Step4: Calculate the distance matrix D from each connected domain c_i to other c_j , find the nearest neighbor of each connected domain and put its distance into the list L.

Step5：计算L的方差V。如果V大于阈值t，且连通域数量大于n，则判定当前区域为选择题区域。Step5: Calculate the variance V of L. If V is greater than the threshold t and the number of connected domains is greater than n, the current area is determined to be a multiple-choice question area.

ENDEND

(4)背景文本提取(4) Background text extraction

完成对背景图像的切分后，为了获取各部分的文本内容，需要进一步对其文字进行提取和识别。为此，本文使用MSER算法，提取每块区域中的印刷体文字。因为单一的极大值稳定区域可能不是一个字符，或者只是字符的某一部分，为了保证不同大小的文字能够被准确的识别，我们需要对提取出的各个字符进行合并处理。图中中文“二”和“三”等多连通域文本如果不进行合并，则会导致识别错误，影响题号的判断。After completing the segmentation of the background image, in order to obtain the text content of each part, the text needs to be further extracted and recognized. To this end, this article uses the MSER algorithm to extract the printed text in each area. Because a single maximum value stable area may not be a character, or only a certain part of a character, in order to ensure that text of different sizes can be accurately recognized, we need to merge the extracted characters. If the multi-connected domain texts such as the Chinese "二" and "三" in the picture are not merged, it will lead to recognition errors and affect the judgment of the question number.

本文选择使用KNN算法对每个MSER区域进行合并。K的取值很大程度上会影响文本行合并的效果，K过小会导致欠合并，造成笔画遗漏或文本行中断，K过大则会增加计算量，降低算法效率。在中文答题卡环境下，经过实验证明K值取4时合并效果最好。经过合并后的文字由单一连通域转变为文本行，可以有效的提高识别率，同时形成了完整的语句结构，后续可以方便地进行语义理解。图8为文本连通域合并后的结果，图中绿色矩形为文字的外接矩形，红线为近邻连通结果。This article chooses to use the KNN algorithm to merge each MSER region. The value of K will largely affect the effect of text line merging. If K is too small, it will lead to undermerging, resulting in missing strokes or text line interruptions. If K is too large, it will increase the amount of calculation and reduce the efficiency of the algorithm. In the Chinese answer sheet environment, experiments have proven that the best merging effect is when the K value is 4. The merged text is transformed from a single connected domain into a text line, which can effectively improve the recognition rate and form a complete sentence structure, which can be easily understood later. Figure 8 shows the result of merging text connected domains. The green rectangle in the figure is the circumscribed rectangle of the text, and the red line is the result of neighbor connectivity.

(5)遗漏补全、边缘切割(5) Missing completion and edge cutting

尽管YOLOv5s-DC在答题卡版面目标检测任务上有着很高的准确率，依然会有部分笔画或是文本被遗漏。其中部分原因在于，矩形检测框的表示能力有限，不能覆盖一些过长的笔画，或者不规则的文本排布。为此，本文提出一种利用最大稳定极值区域(MaximallyStable Extremal Regions)对检测结果进行补全的算法。通过擦除YOLOv5s-DC检测出的目标区域，可以有效地减少MSER算法所需的计算量，再将补全的结果与检测结果进行合并，就可以得到完整的提取结果，补全效果如图9。Although YOLOv5s-DC has a high accuracy in the answer sheet target detection task, some strokes or text will still be missed. Part of the reason is that the rectangular detection frame has limited representation capabilities and cannot cover some overly long strokes or irregular text layout. To this end, this article proposes an algorithm that uses MaximallyStable Extremal Regions to complete the detection results. By erasing the target area detected by YOLOv5s-DC, the calculation amount required by the MSER algorithm can be effectively reduced. Then the completion results and the detection results are merged to obtain the complete extraction results. The completion effect is shown in Figure 9 .

MSER是一种用于自然场景下文本检测的算法，该算法利用动态的阈值对图像进行多次二值化处理，在此过程中面积变化率较小的连通域就被作为最大稳定极值区域。一个连通域的面积变化率v_i可由下式计算，其中Q_i为连通域i的面积，Q_i-Δ为阈值变化Δ后此连通域的面积。MSER is an algorithm for text detection in natural scenes. This algorithm uses dynamic thresholds to perform multiple binarization processes on the image. In this process, the connected domain with a small area change rate is regarded as the maximum stable extreme value area. . The area change rate v_i of a connected domain can be calculated by the following formula, where Q_i is the area of connected domain i, and Q_i-Δ is the area of this connected domain after the threshold change Δ.

用MSER完成提取后，我们对两种算法得到的结果进行合并。合并时，本发明以目标检测得到的文本内容矩形框为中心，对其两侧一定范围内的小块内容进行合并。为了不在合并时造成额外的重叠，本发明仅连接两个矩形框中坐标有重叠部分的空白区域，形成文本行的外接多边形，如图10。为了验证该混合方法的有效性，我们进行了实验对比，三种方法的效果如图11。表1为三种方法在同一组数据上文本检测结果的详细指标，本发明提出的混合方法可以在不明显增加检测时间的基础上，充分利用目标检测网络的高效性能和分类能力，同时利用传统方法的高召回率对目标检测的遗漏内容进行补充。After completing the extraction with MSER, we merged the results obtained by the two algorithms. When merging, the present invention takes the text content rectangular frame obtained by target detection as the center and merges small pieces of content within a certain range on both sides of it. In order not to cause additional overlap during merging, the present invention only connects the blank areas with overlapping coordinates in the two rectangular frames to form a circumscribed polygon of the text line, as shown in Figure 10. In order to verify the effectiveness of this hybrid method, we conducted experimental comparisons. The effects of the three methods are shown in Figure 11. Table 1 shows the detailed indicators of the text detection results of the three methods on the same set of data. The hybrid method proposed by the present invention can make full use of the efficient performance and classification capabilities of the target detection network without significantly increasing the detection time, and at the same time utilize the traditional The method's high recall complements what is missed in object detection.

表1Table 1

对于检测框存在重叠的问题，本发明使用边缘切分算法对每个检测框内文本的上下边缘进行重新定位，具体算法如下。For the problem of overlapping detection frames, the present invention uses an edge segmentation algorithm to reposition the upper and lower edges of the text in each detection frame. The specific algorithm is as follows.

BEGINBEGIN

Step1：使用Sobel算子计算I中各点处的梯度grad，为了减少平方后再开方的计算量，直接使用两个方向梯度值的绝对值之和作为梯度结果，如式。Step1: Use the Sobel operator to calculate the gradient grad at each point in I. In order to reduce the amount of calculation after squaring and then taking the square root, the sum of the absolute values of the gradient values in the two directions is directly used as the gradient result, as shown in the formula.

Step2：令M为最小路径消耗矩阵，B为回溯路径存储矩阵，初始化M和B为与img尺寸相同的全0矩阵。初始迭代坐标为(i,j)，i的取值为j列最小grad的行坐标，j初始值为0，表示从左往右计算。Step2: Let M be the minimum path consumption matrix, B be the backtracking path storage matrix, and initialize M and B to be all-0 matrices with the same size as img. The initial iteration coordinate is (i, j), the value of i is the row coordinate of the minimum grad in column j, and the initial value of j is 0, which means calculation from left to right.

Step3：对于每个i,j从0迭代到w-1，w即img的宽度，按照动态规划的状态转移式计算最小能量和，并将当前最小能量和对应的上一步坐标存入B中，以此表示从(i,0)到当前坐标的最短路径。Step3: For each i, j iterate from 0 to w-1, w is the width of img, calculate the minimum energy sum according to the state transition formula of dynamic programming, and store the current minimum energy and the corresponding previous step coordinates in B, This represents the shortest path from (i,0) to the current coordinates.

Step4：重复迭代每个i，将每条路径的消耗和坐标存入M和B中。Step4: Repeat iteration for each i, and store the consumption and coordinates of each path into M and B.

Step5：取M矩阵右侧路径终点中，最小的点为路径终点，并按照B中坐标回溯，将每一次得到的坐标(i,j)加入path中。Step5: Take the smallest point among the path end points on the right side of the M matrix as the path end point, and trace back according to the coordinates in B, and add the coordinates (i, j) obtained each time to the path.

ENDEND

图12展示了文本行边缘定位的结果。Figure 12 shows the results of text line edge positioning.

(6)表格分析(6)Table analysis

首先对由目标检测检测到的表格区域进行裁切，得到表格图像I。之后对I进行二值化处理，得到二值图像B，并进行反色处理。利用角点检测函数检测图像中存在的连通域列表C。遍历列表，计算其最小外接矩形，并放入列表rects。遍历完成后，取rects中面积最大的矩形为表格定位结果，沿其x,y,w,h，分别代表横坐标、纵坐标、宽度、高度对输入图像进行裁切。使用一维膨胀算子沿水平和竖直方向分别对裁切后的binary图像进行多次开操作，以将其内部存在的线条延长，得到DC和DR两个特征图，分别代表表格中的横线和竖线。对两张特征图进行按位and运算，得到特征图AND，图中为1的位置既是表格中线段的交点。为了减少重复点带来的干扰，本发明使用滑动窗口法，分别沿水平和垂直两个方向，对所有交点的位置进行去重处理，得到坐标列表X和Y。之后，对于任一结构点p_i,j，判断其到右侧和下侧的另一个结构点p_i+1,j和p_i,j+1中是否存在完整联通线条，如果都存在，则证明此处存在一个文本格cell_i,j，否则将其去除。最后，对于每个cell_i,j，取其中文本的最小外接矩形为裁切边缘，提取文本用于后续识别，表格分析结果如图13。First, the table area detected by target detection is cropped to obtain the table image I. Then I is binarized to obtain the binary image B, and the color is inverted. Use the corner detection function to detect the connected domain list C existing in the image. Traverse the list, calculate its smallest enclosing rectangle, and put it into the list rects. After the traversal is completed, the rectangle with the largest area in rects is taken as the table positioning result, and the input image is cropped along its x, y, w, and h, which represent the abscissa, ordinate, width, and height respectively. Use a one-dimensional expansion operator to perform multiple opening operations on the cropped binary image in the horizontal and vertical directions to extend the lines existing inside it, and obtain two feature maps DC and DR, which respectively represent the horizontal lines in the table. lines and vertical lines. Perform a bitwise AND operation on the two feature maps to obtain the feature map AND. The position 1 in the figure is the intersection point of the line segments in the table. In order to reduce the interference caused by repeated points, the present invention uses the sliding window method to deduplicate the positions of all intersection points along the horizontal and vertical directions to obtain coordinate lists X and Y. Afterwards, for any structural point p_i,j , determine whether there is a complete connecting line to another structural point p_i+1 ,j on the right and lower side and p_i,j+1 , if both exist, then Prove that there is a text cell_i,j here, otherwise remove it. Finally, for each cell_i,j , the smallest circumscribed rectangle of the text is taken as the cutting edge, and the text is extracted for subsequent recognition. The table analysis results are shown in Figure 13.

(7)文本内容识别。为了对版面中的文本内容进行识别，便于后续的自然语言处理工作，本发明使用扩展字符库后的CRNN网络对文本内容进行识别。图17为扩展的特殊字符，将其加入UTF-8字库后，使用印刷体字符生成工具输出如图15的字符识别训练数据。对于网络的输出结果，使用CTC算法进行处理，以便得到最终的识别结果。对于输入的文本特征序列X＝{x_{_t}|t＝1,2,...,T}，输出标签序列为L＝{l_{_u}|u＝1,2,...,U}，CTC中引入了空白blank作为无内容输出时的输出结果，用空白分隔网络中输出的连续的相同内容。因此由网络的输出到最终结果的处理过程B的定义为：1、合并连续相同的符号。2、去除空白占位符。例如：B(#z#oo#o)＝B(z##o#o#)＝zoo。(7) Text content recognition. In order to identify the text content in the layout and facilitate subsequent natural language processing work, the present invention uses the CRNN network after expanding the character library to identify the text content. Figure 17 shows the extended special characters. After adding them to the UTF-8 font library, use the printed character generation tool to output the character recognition training data shown in Figure 15. The output results of the network are processed using the CTC algorithm to obtain the final recognition results._For the input text_feature sequence Blank is introduced as the output result when no content is output, and blanks are used to separate consecutive identical contents output in the network. Therefore, the process B from the output of the network to the final result is defined as: 1. Merge consecutive identical symbols. 2. Remove blank placeholders. For example: B(#z#oo#o)=B(z##o#o#)=zoo.

对于损失和置信度的计算，CTC使用了动态规划的方法，来求解输出的条件概率。假设输入X对应的输出Y为“ZOO”。例如对于t1时刻，网络只有输出空白和Z才有可能最终得到ZOO的结果，同样对于t2时间，网络可以继续保持空白输出或继续输出Z。对于输入X来说，输出为Y的概率为：For the calculation of loss and confidence, CTC uses dynamic programming method to solve the conditional probability of output. Assume that the output Y corresponding to input X is "ZOO". For example, for time t1, the network can only output blank and Z before it is possible to finally get the result of ZOO. Similarly, for time t2, the network can continue to maintain blank output or continue to output Z. For input X, the probability that the output is Y is:

其中a_t是各时间片上输出对应结果，p_t(a_t|X)为在t时刻输出a_t的概率。假设各时间片的输出是独立分布的，那么将其相乘即可得到一条组合方式上的概率，将各条组合路径上的概率相加，便可得到输出为Y的概率。Among them, a_t is the corresponding output result in each time slice, and p_t (a_t |X) is the probability of outputting a_t at time t. Assuming that the output of each time slice is independently distributed, then multiplying them together can get the probability of a combination, and adding the probabilities of each combination path can get the probability that the output is Y.

在模型训练时，对于输入D，只需要最小化负对数似然函数值即可。During model training, for input D, it is only necessary to minimize the negative log-likelihood function value.

在模型预测时，则需要输出概率最大的路径组合，即求解：When predicting the model, it is necessary to output the path combination with the highest probability, that is, to solve:

图16，17分别展示版面内的特殊字符和其他字符的识别结果。Figures 16 and 17 respectively show the recognition results of special characters and other characters in the layout.