CN105574513A

Movatterモバイル変換

Info

Publication number: CN105574513A
Application number: CN201510970839.2A
Authority: CN
Inventors: 姚聪; 周舒畅; 周昕宇; 印奇
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Yuanli Jinzhi Chongqing Technology Co ltd
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2016-05-11
Anticipated expiration: 2035-12-22
Also published as: CN105574513B

Abstract

The invention discloses a character detection method and a character detection device. The character detection method comprises the following steps: receiving an image to be detected; generating a text region probability map of the full map of the image to be detected through a semantic prediction model, wherein the text region probability map uses different pixel values to distinguish a text region of the image to be detected and a non-text region of the image to be detected; and carrying out segmentation operation on the text region probability graph to determine the text region. The character detection method and the device can effectively inhibit the interference of a complex background, detect characters of different languages, directions, colors, fonts and sizes and have wide application range. In addition, the character detection method and device have the characteristic of strong robustness, and can effectively deal with the interference of factors such as image noise, image blur, complex background in the image, non-uniform illumination and the like.

Description

Translated fromChinese

文字检测方法和装置Text detection method and device

技术领域technical field

本发明涉及图像处理领域，具体涉及一种文字检测方法和装置。The invention relates to the field of image processing, in particular to a character detection method and device.

背景技术Background technique

随着智能手机的广泛普及和移动互联网的迅速发展，通过手机等移动终端的摄像头获取、检索和分享资讯已经逐步成为一种生活方式。基于摄像头的(Camera-based)的应用更加强调对拍摄场景的理解。通常，在文字和其他物体并存的场景，用户往往首先更关注场景中的文字信息，因而正确识别图像中的文字对用户拍摄意图会有更深入的理解。这就涉及了文字检测技术来识别拍摄图像中的文字区域。With the widespread popularity of smart phones and the rapid development of the mobile Internet, acquiring, retrieving and sharing information through the cameras of mobile terminals such as mobile phones has gradually become a way of life. Camera-based applications put more emphasis on the understanding of the shooting scene. Usually, in a scene where text and other objects coexist, users tend to pay more attention to the text information in the scene first, so correctly identifying the text in the image will have a deeper understanding of the user's shooting intention. This involves text detection techniques to identify text areas in captured images.

文字检测作为一项重要的基础技术，具有巨大的应用价值和广阔的应用前景，特别是自然场景图像的文字检测。例如，自然场景图像的文字检测技术可直接应用于增强现实、地理定位、人机交互、机器人导航、自动驾驶汽车和工业自动化等领域。As an important basic technology, text detection has great application value and broad application prospects, especially text detection in natural scene images. For example, text detection technology in natural scene images can be directly applied to fields such as augmented reality, geolocation, human-computer interaction, robot navigation, self-driving cars, and industrial automation.

然而，待检测图像中大多包含较复杂的背景，且其质量可能受到噪声、模糊、非均匀光照等因素的影响；此外，文字具有多样性，比如，自然场景图像中的文字可能具有不同的颜色、尺寸、字体和方向等。这些因素都会给文字检测带来巨大的困难和挑战。基于上述原因，现有的文字检测方法容易产生虚警(falsealarm)，也即将背景中的非文字成分错误地判别为文字。此外，现有的文字检测方法在适应性方面也存在不足之处，例如，大部分方法只能检测水平方向的文字，对于倾斜或旋转的文字则无能为力。又例如，有些方法只能够应用于中文检测，无法直接推广到不同类别语言(如英文、俄文、韩文等)的文字。而且当图像中存在严重的噪声、模糊或者非均匀光照时，现有的文字检测方法又往往会产生错误。总之，现有的文字检测方法和系统在精度和适用范围等方面存在缺陷。However, most of the images to be detected contain complex backgrounds, and their quality may be affected by factors such as noise, blur, and non-uniform lighting; in addition, texts are diverse, for example, texts in natural scene images may have different colors , size, font and orientation, etc. These factors will bring great difficulties and challenges to text detection. Based on the above reasons, existing text detection methods are prone to false alarms, that is, the non-text components in the background are mistakenly identified as text. In addition, the existing text detection methods also have shortcomings in terms of adaptability. For example, most methods can only detect horizontal text, but cannot do anything for tilted or rotated text. For another example, some methods can only be applied to Chinese detection, and cannot be directly extended to texts of different types of languages (such as English, Russian, Korean, etc.). Moreover, when there is serious noise, blur or non-uniform illumination in the image, the existing text detection methods often produce errors. In short, the existing text detection methods and systems have defects in accuracy and scope of application.

发明内容Contents of the invention

鉴于上述问题，提出了本发明以便提供一种至少部分地解决上述问题的文字检测方法和装置。In view of the above problems, the present invention is proposed to provide a character detection method and device that at least partly solve the above problems.

根据本发明一个方面，提供了一种文字检测方法，包括：According to one aspect of the present invention, a text detection method is provided, including:

接收待检测图像；经由语义预测模型生成所述待检测图像的全图的文字区域概率图，其中，所述文字区域概率图使用不同的像素值区分所述待检测图像的文字区域和所述待检测图像的非文字区域；以及receiving an image to be detected; generating a text area probability map of the entire image of the image to be detected via a semantic prediction model, wherein the text area probability map uses different pixel values to distinguish the text area of the image to be detected from the text area to be detected detect non-text areas of the image; and

对所述文字区域概率图进行分割操作，以确定所述文字区域。A segmentation operation is performed on the text area probability map to determine the text area.

根据本发明另一方面，还提供了一种文字检测装置，包括语义分析模块和分割模块。语义分析模块用于接收待检测图像，并使用语义预测模型以生成所述待检测图像的全图的文字区域概率图，其中，所述文字区域概率图使用不同的像素值区分所述待检测图像的文字区域和所述待检测图像的非文字区域。分割模块用于对所述文字区域概率图进行分割操作，以确定所述文字区域。According to another aspect of the present invention, a text detection device is also provided, including a semantic analysis module and a segmentation module. The semantic analysis module is used to receive the image to be detected, and use the semantic prediction model to generate a text area probability map of the entire image of the image to be detected, wherein the text area probability map uses different pixel values to distinguish the image to be detected The text area and the non-text area of the image to be detected. The segmentation module is used to perform a segmentation operation on the text area probability map to determine the text area.

上述文字检测方法和装置中，支持对待检测图像的全图直接进行文字检测，不同于基于简单阈值分割、滑动窗或连通分量的算法。其可以在有效抑制复杂背景的干扰的同时，检测不同语种、方向、颜色、字体和尺寸的文字，适应范围广。此外，该文字检测方法和装置具有鲁棒性强的特点，可以有效应对图像噪声、图像模糊、图像中复杂背景、非均匀光照等因素的干扰。The above text detection method and device support direct text detection on the entire image of the image to be detected, which is different from algorithms based on simple threshold segmentation, sliding window or connected components. It can detect characters in different languages, directions, colors, fonts and sizes while effectively suppressing the interference of complex backgrounds, and has a wide range of applications. In addition, the text detection method and device have strong robustness, and can effectively deal with the interference of factors such as image noise, image blur, complex background in the image, and non-uniform illumination.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1a和图1b分别示例性地示出了根据本发明一个实施例的待检测图像和经检测图像；Figure 1a and Figure 1b exemplarily show an image to be detected and a detected image according to an embodiment of the present invention;

图2示例性地示出了根据本发明一个实施例的文字检测方法的流程图；Fig. 2 schematically shows a flow chart of a text detection method according to an embodiment of the present invention;

图3a和图3b、图4a和图4b、图5a和图5b、图6a和图6b分别示例性地示出了根据本发明的实施例的待检测图像的全图和其对应生成的文字区域概率图。Fig. 3a and Fig. 3b, Fig. 4a and Fig. 4b, Fig. 5a and Fig. 5b, Fig. 6a and Fig. 6b respectively exemplarily show the full image of the image to be detected and its corresponding generated text area according to an embodiment of the present invention Probability map.

图7示例性地示出了根据本发明一个实施例的获得待检测图像的方法的流程图；FIG. 7 exemplarily shows a flow chart of a method for obtaining an image to be detected according to an embodiment of the present invention;

图8示例性地示出了根据本发明一个实施例的对文字区域概率图进行分割操作的方法的流程图；FIG. 8 exemplarily shows a flow chart of a method for segmenting a text region probability map according to an embodiment of the present invention;

图9示例性地示出了根据本发明一个实施例的训练神经网络的方法的流程图；Fig. 9 exemplarily shows a flowchart of a method for training a neural network according to an embodiment of the present invention;

图10a、图10b、图10c和图10d分别示出了根据本发明一个实施例的具有标注信息的样本图像；Fig. 10a, Fig. 10b, Fig. 10c and Fig. 10d respectively show sample images with annotation information according to an embodiment of the present invention;

图11a和图11b分别示出了根据本发明一个实施例的具有标注信息的样本图像和其对应的掩模图；Figure 11a and Figure 11b respectively show a sample image with annotation information and its corresponding mask image according to an embodiment of the present invention;

图12示例性地示出了根据本发明一个实施例的全卷积神经网络的示意图；Fig. 12 exemplarily shows a schematic diagram of a fully convolutional neural network according to an embodiment of the present invention;

图13示例性地示出了根据本发明一个实施例的文字检测装置的示意性框图；Fig. 13 exemplarily shows a schematic block diagram of a text detection device according to an embodiment of the present invention;

图14示例性地示出了根据本发明另一个实施例的文字检测装置的示意性框图；以及Fig. 14 exemplarily shows a schematic block diagram of a text detection device according to another embodiment of the present invention; and

图15示例性地示出了根据本发明一个实施例的文字检测系统的示意性框图。Fig. 15 exemplarily shows a schematic block diagram of a text detection system according to an embodiment of the present invention.

具体实施方式detailed description

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

为了更合理地自动识别图像中文字区域，本发明提供了一种文字检测方法。图1a和图1b分别示例性地示出了根据本发明一个实施例的待检测图像和经检测图像。图2示出了根据本发明一个实施例的文字检测方法200的流程图。如图2所示，该方法200包括步骤S210至步骤S230。In order to automatically identify the text area in the image more reasonably, the present invention provides a text detection method. Fig. 1a and Fig. 1b exemplarily show a to-be-detected image and a detected image according to an embodiment of the present invention, respectively. FIG. 2 shows a flowchart of a character detection method 200 according to an embodiment of the present invention. As shown in FIG. 2, the method 200 includes step S210 to step S230.

在步骤S210中，接收待检测图像。待检测图像可以是原始图像，也可以是对原始图像进行预处理后得到的图像。在本发明的一个实施例中，可以通过对采集到的原始图像进行预处理得到所述待检测图像。下文中将结合具体附图对所述图像预处理的方法进行详细描述。In step S210, an image to be detected is received. The image to be detected may be an original image, or an image obtained after preprocessing the original image. In an embodiment of the present invention, the image to be detected may be obtained by preprocessing the collected original image. The image preprocessing method will be described in detail below with reference to specific drawings.

在步骤S220中，经由语义预测模型生成所述待检测图像的全图的文字区域概率图，其中，所述文字区域概率图使用不同的像素值区分所述待检测图像的文字区域和所述待检测图像的非文字区域。根据本发明的一个实施例，文字区域是指图像中包含文字的区域。以图1a和图1b为例，图1b中两个黑色四边形内部的区域是文字区域。在第一个文字区域中，包含文字“我在生长”，在第二个文字区域中，包含文字“请不要踩我”。In step S220, a text area probability map of the entire image of the image to be detected is generated via a semantic prediction model, wherein the text area probability map uses different pixel values to distinguish the text area of the image to be detected from the text area to be detected Detect non-text areas of an image. According to an embodiment of the present invention, the text area refers to the area containing text in the image. Taking Fig. 1a and Fig. 1b as an example, the area inside the two black rectangles in Fig. 1b is a text area. In the first text area, contain the text "I'm growing" and in the second text area, contain the text "Please don't step on me".

在一个实施例中，文字区域概率图使用不同的像素值表示不同的概率以区分所述待检测图像的文字区域和所述待检测图像的非文字区域。在一个实施例中，图像的像素值越高表示该像素所在区域属于文字区域的概率越高，图像的像素值越低则表示该像素所在区域属于文字区域的概率越低。例如像素值为0的黑色像素表示该像素所在区域属于文字区域的概率为0，像素值为255的白色像素表示该像素所在区域属于文字区域的概率为100％。In one embodiment, the text area probability map uses different pixel values to represent different probabilities to distinguish the text area of the image to be detected from the non-text area of the image to be detected. In one embodiment, the higher the pixel value of the image, the higher the probability that the area where the pixel is located belongs to the text area, and the lower the pixel value of the image, the lower the probability that the area where the pixel is located belongs to the text area. For example, a black pixel with a pixel value of 0 means that the probability that the area where the pixel is located belongs to the text area is 0, and a white pixel with a pixel value of 255 means that the probability that the area where the pixel is located belongs to the text area is 100%.

根据本发明的一个实施例，待检测图像的全图经由语义预测模型生成文字区域概率图。语义预测模型用于根据待检测图像的语义生成文字区域概率图，以预测待检测图像中的像素属于文字区域还是属于非文字区域。图像语义是图像的高层特征，其虽然以图像的色彩、纹理、形状等等底层特征为基础，但是与这些底层特征显著不同。图像语义作为知识信息的基本描述载体，能将完整的图像内容转换成可直观理解的类文本语言表达，在图像理解中起着至关重要的作用。图像理解输入的是图像数据，输出的是知识，其属于图像研究领域的高层内容。语义预测模型能够实现图像理解，其能够直接根据图像语义识别图像中的文字区域，这与基于阈值分割图像的各个模型显著不同。语义预测模型可以基于其对待检测图像的理解，根据待检测图像的语义，生成更理想的文字区域概率图，从而预测待检测图像中的像素属于文字区域还是属于非文字区域，以获得更合理的文字区域。According to an embodiment of the present invention, the full image of the image to be detected is used to generate a text area probability map through a semantic prediction model. The semantic prediction model is used to generate a text area probability map according to the semantics of the image to be detected, so as to predict whether the pixels in the image to be detected belong to the text area or the non-text area. Image semantics is a high-level feature of an image. Although it is based on the underlying features of the image such as color, texture, and shape, it is significantly different from these underlying features. As the basic description carrier of knowledge information, image semantics can convert the complete image content into an intuitively understandable text-like language expression, and plays a vital role in image understanding. The input of image understanding is image data, and the output is knowledge, which belongs to the high-level content in the field of image research. Semantic prediction models enable image understanding, which can directly identify text regions in images based on image semantics, which is significantly different from various models that segment images based on thresholds. The semantic prediction model can generate a more ideal text area probability map based on its understanding of the image to be detected and the semantics of the image to be detected, so as to predict whether the pixels in the image to be detected belong to the text area or non-text area, so as to obtain a more reasonable text area.

所述语义预测模型可以通过训练神经网络得到。神经网络可用于根据大量的输入估计一般的未知近似函数。神经网络能够机器学习，具有较强的自适应性质。经过训练的神经网络能够逼近一个任意函数，其能够从已知数据“学习”。由此，神经网络非常适用于经过训练用作语义预测模型，来识别待检测图像中的文字区域。在下文中将结合图9至图12对训练神经网络获得语义预测模型进行详细描述。The semantic prediction model can be obtained by training a neural network. Neural networks can be used to estimate general unknown approximate functions from a large number of inputs. Neural networks are capable of machine learning and have strong adaptive properties. A trained neural network is capable of approximating an arbitrary function that can "learn" from known data. As such, neural networks are well-suited for being trained as semantic prediction models to identify text regions in images to be detected. The training of the neural network to obtain the semantic prediction model will be described in detail below with reference to FIGS. 9 to 12 .

图3a和图3b、图4a和图4b、图5a和图5b、图6a和图6b分别是根据本发明的实施例的待检测图像的全图和经由语义预测模型生成的对应的文字区域概率图。图3a、图4a、图5a和图6a可以是待检测的图像的全图，并且，图像上含有文字区域，例如，图3a上的文字区域中包含的是中文；图4a图像上包含的文字区域包含中文和英文，并且如图4a所示，图4a中的文字区域的方向非水平；图5a的图像上的文字区域包含俄文；图6a图像的文字区域包含韩文。并且，可以看出，图3a、图4a、图5a和图6a的图像具有不同的背景，且背景比较复杂；并且，上述图像中的文字也具有多样性，例如这些文字具有不同的颜色、字体、语种和尺寸等信息。图3b、图4b、图5b和图6b分别示出所述图3a、图4a、图5a和图6a的待检测图像的全图经过语义预测模型之后所生成的文字区域概率图。生成的文字区域概率图使用不同的像素值表示不同的概率以区分待检测图像的文字区域和非文字区域。例如，使用像素值255的像素填充文字区域，表示该区域属于文字区域的概率最高，使用像素值0的像素填充非文字区域(例如，背景区域)，表示该区域属于文字区域的概率最低，从而区分出待检测图像中的文字区域和非文字区域。以图4b为例，图4b的文字区域概率图使用不同的像素值区分待检测图像4a中的文字区域和非文字区域。例如，使用具有像素值为255的像素填充待检测图像4a中的两个文字区域，“非授权请勿入内”和“AuthorizedPersonnelOnly”，从而得到如图4b所示出的文字区域概率图，并且，图4b的文字区域概率图也完整准确地示出了待检测图像4a中的文字区域的方向。Figure 3a and Figure 3b, Figure 4a and Figure 4b, Figure 5a and Figure 5b, Figure 6a and Figure 6b are the full image of the image to be detected and the corresponding text region probability generated by the semantic prediction model according to the embodiment of the present invention, respectively picture. Fig. 3a, Fig. 4a, Fig. 5a and Fig. 6a can be the full figure of the image to be detected, and, contain text region on the image, for example, what contain in the text region on Fig. 3a is Chinese; Fig. 4a text contained on the image The area contains Chinese and English, and as shown in Figure 4a, the direction of the text area in Figure 4a is non-horizontal; the text area on the image in Figure 5a contains Russian; the text area on the image in Figure 6a contains Korean. Moreover, it can be seen that the images in Figure 3a, Figure 4a, Figure 5a, and Figure 6a have different backgrounds, and the backgrounds are relatively complex; moreover, the characters in the above images also have diversity, for example, these characters have different colors and fonts , language, and size. Fig. 3b, Fig. 4b, Fig. 5b and Fig. 6b respectively show the text area probability maps generated after the full image of the image to be detected in Fig. 3a, Fig. 4a, Fig. 5a and Fig. 6a is passed through the semantic prediction model. The generated text area probability map uses different pixel values to represent different probabilities to distinguish the text area and non-text area of the image to be detected. For example, using a pixel with a pixel value of 255 to fill a text area indicates that the probability of this area belonging to a text area is the highest, and using a pixel with a pixel value of 0 to fill a non-text area (for example, a background area) indicates that the probability of this area belonging to a text area is the lowest. Distinguish the text area and non-text area in the image to be detected. Taking FIG. 4b as an example, the text area probability map in FIG. 4b uses different pixel values to distinguish text areas and non-text areas in the image to be detected 4a. For example, use pixels with a pixel value of 255 to fill the two text areas in the image to be detected 4a, "Unauthorized Do Not Enter" and "AuthorizedPersonnelOnly", thereby obtaining the text area probability map as shown in Figure 4b, and, The text area probability map in FIG. 4b also completely and accurately shows the direction of the text area in the image to be detected 4a.

在步骤S230中，对步骤S220所生成的文字区域概率图进行分割操作，以确定文字区域。因为文字区域概率图中的像素的数值可以表示该像素所在区域属于文字区域的概率，从而区分文字区域和非文字区域，所以可以根据底层特征(例如图像的灰度)对文字区域概率进行分割。In step S230, a segmentation operation is performed on the text area probability map generated in step S220 to determine the text area. Because the value of a pixel in the text area probability map can indicate the probability that the area where the pixel is located belongs to the text area, thereby distinguishing the text area from the non-text area, the text area probability can be segmented according to the underlying features (such as the grayscale of the image).

例如，该步骤S230可以通过对文字区域概率图进行二值化操作来获得文字区域。本发明中，由于期望区分文字区域和非文字区域(背景区域)，所以利用二值化操作即可实现该目的。二值化操作实现简单，计算量少并且速度快。For example, the step S230 can obtain the text area by performing a binarization operation on the text area probability map. In the present invention, since it is desired to distinguish the text area and the non-text area (background area), this purpose can be achieved by using the binarization operation. The binarization operation is simple to implement, with less calculation and high speed.

二值化操作可以是阈值分割操作。可选地，阈值T为可调参数。如果灰度值255表示属于文字区域的概率为100％，灰度值0表示属于文字区域的概率为0，那么可以将阈值设置为128。The binarization operation may be a threshold segmentation operation. Optionally, the threshold T is an adjustable parameter. If a grayscale value of 255 indicates that the probability of belonging to the text area is 100%, and a grayscale value of 0 indicates that the probability of belonging to the text area is 0, then the threshold can be set to 128.

二值化操作还可以是基于区域增长的分割操作。区域增长方法是根据同一物体区域内像素的相似性质来聚集像素的方法。具体地，从初始区域(例如，文字区域概率图中像素值较大的像素)开始，将相邻的具有同样性质(与当前像素的像素值的差比较小)的像素归并到目前的区域中从而逐步增长区域，直至没有可以归并的像素为止。The binarization operation can also be a segmentation operation based on region growing. The region growing method is a method of clustering pixels according to the similar properties of pixels in the same object region. Specifically, starting from the initial area (for example, the pixel with a larger pixel value in the text area probability map), the adjacent pixels with the same property (the difference between the pixel value of the current pixel and the current pixel is relatively small) are merged into the current area Thus, the area is gradually increased until there are no pixels that can be merged.

可以认为分割后所获得的图像中平均像素值较小的区域为非文字区域，其他区域为文字区域。下文中将结合具体的附图对二值化操作确定所述文字区域进行详细描述。It can be considered that the area with a smaller average pixel value in the image obtained after segmentation is a non-text area, and the other areas are text areas. The determination of the text region by the binarization operation will be described in detail below in conjunction with specific drawings.

本领域普通技术人员可以理解，上述方法200具有普适性。其可以用于任何图像的文字检测。该方法200可以针对文档图像进行文字检测和识别，文档图像例如证件和票据的照片、纸件文档的扫描件等。该方法200还可以针对自然场景图像进行文字检测和识别。Those of ordinary skill in the art can understand that the foregoing method 200 has universal applicability. It can be used for text detection in any image. The method 200 can perform text detection and recognition on document images, such as photos of certificates and bills, scanned copies of paper documents, and the like. The method 200 can also perform text detection and recognition on natural scene images.

本发明的上述方法200摒弃了基于滑动窗的检测方式以及基于连通分量的检测方式，采用了基于语义分割的全新检测方式。该方法200能够实现全图预测，即输入和输出都是整幅图像，而不是局部区域或窗口，因此可以更好地利用图像中的上下文信息，特别是自然场景图像中的上下文信息，从而得到更准确的文字检测结果。The above-mentioned method 200 of the present invention abandons the detection method based on sliding windows and the detection method based on connected components, and adopts a new detection method based on semantic segmentation. The method 200 can realize full-image prediction, that is, the input and output are the whole image, rather than local regions or windows, so the context information in the image, especially the context information in natural scene images, can be better utilized to obtain More accurate text detection results.

该方法200可以处理不同场景、不同质量的图像。该方法200可以在有效抑制复杂背景的干扰的同时，检测不同颜色、字体和尺寸的文字。该方法200可以自动预测文字行的方向，可以直接检测图像中不同方向的文字。该方法200对文字所属的语言不敏感，可以同时检测不同类别语言(如中文、英文、韩文等)对应的文字。此外，该方法200具有鲁棒性强的特点，可以有效应对噪声、模糊、复杂背景、非均匀光照等因素的干扰。The method 200 can handle images of different scenes and different qualities. The method 200 can detect characters of different colors, fonts and sizes while effectively suppressing interference from complex backgrounds. The method 200 can automatically predict the direction of a character line, and can directly detect characters in different directions in an image. The method 200 is not sensitive to the language to which the text belongs, and can simultaneously detect text corresponding to different types of languages (such as Chinese, English, Korean, etc.). In addition, the method 200 is characterized by strong robustness, and can effectively deal with interference from factors such as noise, blur, complex background, and non-uniform illumination.

图7示出了根据本发明一个实施例的获得所述待检测图像的流程图。Fig. 7 shows a flow chart of obtaining the image to be detected according to an embodiment of the present invention.

在步骤S710中，接收原始图像。在一个实施例中，原始图像可以具有复杂的背景信息，其包含的文字区域也可以具有多样性，例如文字区域可以包括有不同的颜色、字体、语种和尺寸等的文字信息。In step S710, an original image is received. In an embodiment, the original image may have complex background information, and the text areas contained therein may also have diversity. For example, the text areas may include text information in different colors, fonts, languages, and sizes.

在步骤S720中，对接收到的原始图像进行预处理，以获得待检测图像。在一个实施例中，可以将接收到的原始图像进行尺度归一化，即将原始图像的最大维度(例如，原始图像的高度和宽度中的较大者)缩放到预设尺寸，所述预设尺寸可以包括480、640、800、和960像素等。在尺度归一化操作之后得到的待检测图像的长宽比例与原始图像的长宽比例保持相同。In step S720, the received original image is preprocessed to obtain an image to be detected. In one embodiment, the received original image may be scale-normalized, that is, the maximum dimension of the original image (for example, the larger of the height and width of the original image) is scaled to a preset size, and the preset Dimensions can include 480, 640, 800, and 960 pixels, among others. The aspect ratio of the image to be detected obtained after the scale normalization operation is kept the same as that of the original image.

图8示例性地示出了根据本发明一个实施例的对文字区域概率图进行分割操作的方法的流程图。Fig. 8 exemplarily shows a flow chart of a method for segmenting a text region probability map according to an embodiment of the present invention.

在步骤S810中，对待检测图像的文字区域概率图进行二值化操作。In step S810, a binarization operation is performed on the probability map of the text region of the image to be detected.

可以理解，可以直接根据二值化操作的结果来获得文字区域。在本发明中，由于期望区分文字区域和非文字区域(背景区域)，所以利用二值化操作即可实现该目的。二值化操作实现简单，计算量少并且速度快。It can be understood that the text area can be obtained directly according to the result of the binarization operation. In the present invention, since it is desired to distinguish the text area from the non-text area (background area), this purpose can be achieved by using the binarization operation. The binarization operation is simple to implement, with less calculation and high speed.

二值化操作还可以是基于区域增长的分割操作。区域增长方法是根据同一物体区域内像素的相似性质来聚集像素的方法。具体地，从初始区域(例如，文字区域概率图中像素值较大的像素)开始，将相邻的具有同样性质(与当前像素的像素值的差比较小)的像素归并到目前的区域中从而逐步增长区域，直至没有可以归并的像素为止The binarization operation can also be a segmentation operation based on region growing. The region growing method is a method of clustering pixels according to the similar properties of pixels in the same object region. Specifically, starting from the initial area (for example, the pixel with a larger pixel value in the text area probability map), the adjacent pixels with the same property (the difference between the pixel value of the current pixel and the current pixel is relatively small) are merged into the current area Thus gradually growing the region until there are no pixels that can be merged

在图8所示的实施例中，在二值化操作之后，还包括步骤S820和步骤S830。In the embodiment shown in FIG. 8, after the binarization operation, step S820 and step S830 are further included.

在步骤S820中，确定二值化操作所获得的每个连通区域的轮廓。该步骤可以用现有的或未来研发的任何边缘检测方法来实现，例如基于诸如Sobel或Canny算子等各种边缘检测方法。In step S820, the contour of each connected region obtained by the binarization operation is determined. This step can be implemented by any existing or future developed edge detection method, for example based on various edge detection methods such as Sobel or Canny operator.

在步骤S830中，将每个连通区域的轮廓拟合为四边形以确定所述文字区域。在一个实施例中，所有四边形的内部区域可以作为文字区域。具体地，假设所有四边形组成的集合为B,B＝{b_k},k＝1,2,…Q,其中b_k表示拟合获得的四边形，Q表示四边形的数目，k为下标。则集合B即为文字检测的结果输出。In step S830, the outline of each connected region is fitted to a quadrilateral to determine the text region. In one embodiment, the inner areas of all quadrilaterals can be used as text areas. Specifically, it is assumed that the set of all quadrilaterals is B, B={b_k }, k=1, 2,...Q, where b_k represents the quadrilaterals obtained by fitting, Q represents the number of quadrilaterals, and k is the subscript. Then the set B is the result output of the text detection.

四边形围成的区域能够较好地包括任何方向、语言的文字，并且其计算简单。例如图6b的文字区域概率图所示，图像中噪声、图像中文字形状等种种原因可能导致文字区域概率图未能较理想地表示像素属于文字区域的概率。通过用四边形区域来拟合文字区域，可以进一步保证文字区域内包含全部文字内容，从而保证文字检测的精度。The area enclosed by the quadrilaterals can better include characters in any direction and language, and its calculation is simple. For example, as shown in the text area probability map in FIG. 6 b , various reasons such as noise in the image and the shape of the text in the image may cause the text area probability map to fail to ideally represent the probability that the pixel belongs to the text area. By fitting the text area with a quadrilateral area, it can be further ensured that the text area contains all text content, thereby ensuring the accuracy of text detection.

图9示例性地示出了根据本发明一个实施例的训练神经网络以得到语义预测模型的方法的流程图。该方法的目的在于从样本图像中学习语义预测模型，该模型可以有效区分待检测图像中的文字区域和非文字区域。Fig. 9 exemplarily shows a flowchart of a method for training a neural network to obtain a semantic prediction model according to an embodiment of the present invention. The purpose of this method is to learn a semantic prediction model from sample images, which can effectively distinguish text regions and non-text regions in the image to be detected.

样本图像是已知其中文字区域的图像。如上所述，神经网络具有“学习”能力，可以通过利用多个样本图像训练神经网络来获得可用的语义预测模型。在该实施例中，该训练方法使得语义预测模型能够根据待检测图像的语义，生成更准确的文字区域概率图，从而预测所述待检测图像中的像素属于文字区域还是非文字区域，从而，使得文字检测方法的检测结果的正确率更高。A sample image is an image in which text regions are known. As mentioned above, the neural network has a "learning" ability, and a usable semantic prediction model can be obtained by training the neural network with multiple sample images. In this embodiment, the training method enables the semantic prediction model to generate a more accurate text area probability map according to the semantics of the image to be detected, thereby predicting whether the pixels in the image to be detected belong to a text area or a non-text area, thus, Therefore, the correct rate of the detection result of the text detection method is higher.

本领域普通技术人员可以理解，对于文字检测系统来说，该语义预测模型可以预先存储于其中。Those of ordinary skill in the art can understand that, for the text detection system, the semantic prediction model can be pre-stored therein.

在步骤S910中，接收多个样本图像和其标注信息。In step S910, a plurality of sample images and their annotation information are received.

在一个实施例中，可以从不同来源采集大量包含文字的各种图像作为样本图像，例如，自然场景图像。期望样本图像种类丰富且数目较多，以获得理想的语义预测模型。在一个实施例中，样本图像的数目不少于1000。In one embodiment, a large number of various images containing text may be collected from different sources as sample images, for example, images of natural scenes. It is expected that the sample images are rich in variety and number in order to obtain an ideal semantic prediction model. In one embodiment, the number of sample images is not less than 1000.

可以使用多边形在每个样本图像中标注所述样本图像中的所有文字区域，从而获得样本图像的标注信息。标注的基本文字单位可以是文字行或单词。样本图像中文字区域的标注信息可以以多边形(例如，四边形)的形式保存。具体地，在一个实施例中，可以仅保存四边形的四个顶点的坐标。以四边形的形状保存标注信息不仅可以满足任何方向、语言的文字，而且便于计算。All text regions in each sample image may be marked with polygons, so as to obtain the mark information of the sample image. The basic text unit of a label can be a text line or a word. The annotation information of the text area in the sample image may be stored in the form of polygons (for example, quadrilaterals). Specifically, in one embodiment, only the coordinates of the four vertices of the quadrilateral may be saved. Preserving annotation information in the shape of a quadrilateral not only satisfies text in any direction and language, but also facilitates calculations.

图10a、图10b、图10c和图10d分别示出了根据本发明一个实施例的经标注的具有标注信息的样本图像。如这些图中所示出的，可以用四边形(图中浅色四边形)标注样本图像中的文字区域，且该标注区域适用于任意字体、语种、以及文字的方向。Fig. 10a, Fig. 10b, Fig. 10c, and Fig. 10d respectively show annotated sample images with annotation information according to an embodiment of the present invention. As shown in these figures, the text area in the sample image can be marked with a quadrilateral (the light-colored quadrilateral in the figure), and the marked area is applicable to any font, language, and direction of the text.

在步骤S920中，根据所述样本图像和其标注信息生成样本图像的掩膜图。具体地，对于样本图像I和对应的标注信息a，生成一幅与样本图像I大小一致的掩膜图。在一个实施例中，所述掩膜图可以包括二值掩膜图R。在所述二值掩膜图R中，使用不同的像素值区分样本图像的文字区域和非文字区域。在一个实施例中，对于样本图像I，使用具有第一像素值的像素填充标注信息所标注的文字区域，使用具有第二像素值的像素填充非文字区域，从而生成二值掩膜图R，其中，第一像素值和第二像素值不同，以区分所述文字区域和非文字区域。例如，在二值掩膜图R中，所标注的文字区域(也即使用四边形标注的内部区域)的像素值被填充为255，而非文字区域的像素值被填充为0。In step S920, a mask map of the sample image is generated according to the sample image and its annotation information. Specifically, for the sample image I and the corresponding label information a, a mask image with the same size as the sample image I is generated. In one embodiment, the mask map may include a binary mask map R. In the binary mask image R, different pixel values are used to distinguish text areas and non-text areas of the sample image. In one embodiment, for the sample image I, pixels with a first pixel value are used to fill the text area marked by the annotation information, and pixels with a second pixel value are used to fill the non-text area, thereby generating a binary mask image R, Wherein, the first pixel value and the second pixel value are different to distinguish the text area from the non-text area. For example, in the binary mask image R, the pixel value of the marked text area (that is, the inner area marked with a quadrilateral) is filled with 255, and the pixel value of the non-text area is filled with 0.

图11a和图11b分别示出了根据本发明一个实施例的经标注的具有标注信息的样本图像和其对应的掩膜图。如图11a所示，使用四边形将原样本图像的文字部分(例如，“海淀建设证券”、“海淀中街”、“HAIDIANZHONGJIE”、“海淀南路”)标注出来，并据此生成图11b中所示的掩模图。其中，使用具有像素值为255的像素填充标注出来的文字部分，使用像素值为0的像素填充非文字部分，从而得到图11b中所示的掩膜图。Fig. 11a and Fig. 11b respectively show annotated sample images with annotation information and their corresponding mask images according to an embodiment of the present invention. As shown in Figure 11a, the text part of the original sample image (for example, "Haidian Jianshe Securities", "Haidian Middle Street", "HAIDIANZHONGJIE", "Haidian South Road") is marked with a quadrilateral, and the text in Figure 11b is generated accordingly The mask image shown. Among them, the marked text part is filled with pixels with a pixel value of 255, and the non-text part is filled with pixels with a pixel value of 0, so as to obtain the mask image shown in FIG. 11 b .

在步骤S930中，利用样本图像和其掩膜图构建训练集，并训练神经网络，以获得语义预测模型M。原始的样本图像和其对应的掩膜图构成训练样本集S。S＝{(I_i,R_i)},i＝1,2,...,N，其中I_i表示原始的样本图像，R_i为原始的样本图像I_i对应的掩膜图，N为训练样本集S中样本图像的数目，i为下标。In step S930, a training set is constructed using sample images and their mask images, and a neural network is trained to obtain a semantic prediction model M. The original sample image and its corresponding mask image constitute the training sample set S. S={(I_i ,R_i )}, i=1,2,...,N, where I_i represents the original sample image, R_i is the mask image corresponding to the original sample image I_i , and N is The number of sample images in the training sample set S, i is the subscript.

在一个实施例中，神经网络可以包括全卷积神经网络。全卷积神经网络是一类特殊的神经网络，其特点在于从输入到输出的所有包含可学参数的层都是卷积层(convolutionallayer)。全卷积神经网络避免了对图像的复杂前期预处理，可以直接输入原始图像，其特别适用于对具有复杂背景的图像的分析处理，可以使图像的文字检测结果更准确。In one embodiment, the neural network may comprise a fully convolutional neural network. The fully convolutional neural network is a special type of neural network, which is characterized in that all layers containing learnable parameters from input to output are convolutional layers. The fully convolutional neural network avoids the complicated preprocessing of the image, and can directly input the original image. It is especially suitable for the analysis and processing of images with complex backgrounds, and can make the text detection results of images more accurate.

根据本发明一个具体实施例，可以采用一个由13个层构成的全卷积神经网络。图12示出了该全卷积神经网络的示意图。According to a specific embodiment of the present invention, a fully convolutional neural network composed of 13 layers can be used. Figure 12 shows a schematic diagram of the fully convolutional neural network.

在该全卷积神经网络中除了包括卷积层，还包括最大池化层。最大池化层隔开连续的卷积层，其可以有效减少计算量，同时增强神经网络的鲁棒性。In addition to the convolutional layer, the fully convolutional neural network also includes a maximum pooling layer. The maximum pooling layer separates consecutive convolutional layers, which can effectively reduce the amount of computation while enhancing the robustness of the neural network.

该全卷积神经网络的输入为原始图像数据。如图12所示，该全卷积神经网络包括第一卷积层和第二卷积层，其中滤波器的数目可以为64，滤波器大小可以为3x3。第二卷积层连接第一最大池化层(maxpoollayer)。接下来是第三卷积层和第四卷积层，其中滤波器的数目可以为128，滤波器大小可以为3x3。第四卷积层连接第二最大池化层。接下来是第五卷积层、第六卷积层和第七卷积层，其中滤波器的数目为256，滤波器大小为3x3。第七卷积层连接第三最大池化层。接下来是第八卷积层、第九卷积层和第十卷积层，其中滤波器的数目可以为512，滤波器大小可以为3x3。第十卷积层连接第四最大池化层。接下来是第十一卷积层、第十二卷积层和第十三卷积层，其中滤波器的数目可以为512，滤波器大小可以为3x3。The input of this fully convolutional neural network is raw image data. As shown in FIG. 12 , the fully convolutional neural network includes a first convolutional layer and a second convolutional layer, where the number of filters can be 64, and the filter size can be 3x3. The second convolutional layer is connected to the first maxpool layer. Next is the third convolutional layer and the fourth convolutional layer, where the number of filters can be 128 and the filter size can be 3x3. The fourth convolutional layer is connected to the second max pooling layer. Next is the fifth convolutional layer, the sixth convolutional layer and the seventh convolutional layer, where the number of filters is 256 and the filter size is 3x3. The seventh convolutional layer is connected to the third max pooling layer. Next is the eighth convolutional layer, the ninth convolutional layer and the tenth convolutional layer, where the number of filters can be 512, and the filter size can be 3x3. The tenth convolutional layer is connected to the fourth max pooling layer. Next is the eleventh convolutional layer, the twelfth convolutional layer and the thirteenth convolutional layer, where the number of filters may be 512 and the filter size may be 3x3.

在训练过程中，每次将一个样本图像和对应的掩膜图输入到全卷积神经网络中，初始学习率可以为0.00000001，每经过10000轮迭代，学习率降为原来的1/10。当迭代100000轮后，训练过程可以终止。训练过程终止时所获得的全卷积神经网络即为期望的语义预测模型。经由所述训练好的语义预测模型，可以根据待检测图像的语义生成待检测图像的全图的文字区域概率图，从而预测待检测图像中的文字区域。During the training process, each time a sample image and the corresponding mask image are input into the fully convolutional neural network, the initial learning rate can be 0.00000001, and after every 10,000 iterations, the learning rate is reduced to 1/10 of the original. After 100,000 iterations, the training process can be terminated. The fully convolutional neural network obtained at the end of the training process is the desired semantic prediction model. Through the trained semantic prediction model, a text area probability map of the entire image of the image to be detected can be generated according to the semantics of the image to be detected, so as to predict the text area in the image to be detected.

本领域普通技术人员可以理解，虽然上面以13层的全卷积神经网络为例来说明，但是全卷积神经网络的层数可以是包括6到19之间的任意数。这个范围的层数权衡了计算结果准确性和计算量这两个方面。此外，上面所述的滤波器的数目和大小也仅为示例，而非限制。例如滤波器的数目还可以是100、500或1000等，滤波器的大小还可以是1x1或5x5。Those of ordinary skill in the art can understand that although the 13-layer fully convolutional neural network is used as an example for illustration above, the number of layers of the fully convolutional neural network can be any number between 6 and 19. The number of layers in this range weighs the accuracy of the calculation results and the amount of calculations. In addition, the number and size of the filters described above are only examples and not limitations. For example, the number of filters may also be 100, 500 or 1000, etc., and the size of the filters may also be 1x1 or 5x5.

根据本发明另一方面，还提供了一种文字检测装置。图13示出了根据本发明一个实施例的文字检测装置1300的示意性框图。如图13所示，文字检测装置1300包括语义分析模块1330和分割模块1340。在根据本发明的一个实施例中，所述语义分析模块1330还包括语义预测模型1350。According to another aspect of the present invention, a character detection device is also provided. Fig. 13 shows a schematic block diagram of a text detection device 1300 according to an embodiment of the present invention. As shown in FIG. 13 , the text detection device 1300 includes a semantic analysis module 1330 and a segmentation module 1340 . In an embodiment according to the present invention, the semantic analysis module 1330 further includes a semantic prediction model 1350 .

语义分析模块1330用于接收待检测图像，并使用语义预测模型1350生成所述待检测图像的全图的文字区域概率图。语义预测模型用于根据待检测图像的语义生成文字区域概率图，以预测所述待检测图像中的像素属于文字区域还是属于非文字区域。所述文字区域概率图使用不同的像素值表示不同的概率以区分所述待检测图像的文字区域和所述待检测图像的非文字区域。The semantic analysis module 1330 is used to receive the image to be detected, and use the semantic prediction model 1350 to generate a text area probability map of the entire image of the image to be detected. The semantic prediction model is used to generate a text area probability map according to the semantics of the image to be detected, so as to predict whether the pixels in the image to be detected belong to the text area or belong to the non-text area. The text area probability map uses different pixel values to represent different probabilities to distinguish the text area of the image to be detected from the non-text area of the image to be detected.

在一个实施例中，待检测图像可以是原始图像，也可以是对原始图像进行预处理后得到的图像。In an embodiment, the image to be detected may be an original image, or an image obtained by preprocessing the original image.

在一个实施例中，语义预测模型1350可以通过训练神经网络得到。在下文中将结合图14对训练神经网络获得语义预测模型1350进行详细描述。In one embodiment, the semantic prediction model 1350 can be obtained by training a neural network. The training of the neural network to obtain the semantic prediction model 1350 will be described in detail below with reference to FIG. 14 .

结合图3a和图3b、图4a和图4b、图5a和图5b、图6a和图6b描述文字区域概率图。图3a和图3b、图4a和图4b、图5a和图5b、图6a和图6b分别是根据本发明的实施例的待检测图像的全图和经由语义预测模型1350生成的对应的文字区域概率图。图3a、图4a、图5a和图6a可以是待检测的图像的全图，并且，图像上含有文字区域，图3b、图4b、图5b和图6b示出所述图3a、图4a、图5a和图6a的待检测图像的全图经过语义预测模型1350之后所生成的文字区域概率图。生成的文字区域概率图使用不同的像素值表示不同的概率以区分待检测图像的文字区域和非文字区域。例如，使用像素值255的像素填充文字区域，表示该区域属于文字区域的概率最高，使用像素值0的像素填充非文字区域(例如，背景区域)，表示该区域属于文字区域的概率最低，从而区分出待检测图像中的文字区域和非文字区域。以图4b为例，图4b的文字区域概率图使用不同的像素值区分待检测图像4a中的文字区域和非文字区域。例如，使用具有像素值为255的像素填充待检测图像4a中的两个文字区域，“非授权请勿入内”和“AuthorizedPersonnelOnly”，从而得到如图4b所示出的文字区域概率图，并且，图4b的文字区域概率图也完整准确地示出了原待检测图像4a中的文字区域的方向Text region probability maps are described in conjunction with FIGS. 3a and 3b , FIGS. 4a and 4b , FIGS. 5a and 5b , and FIGS. 6a and 6b . Figure 3a and Figure 3b, Figure 4a and Figure 4b, Figure 5a and Figure 5b, Figure 6a and Figure 6b are the full image of the image to be detected and the corresponding text area generated by the semantic prediction model 1350 according to the embodiment of the present invention, respectively Probability map. Fig. 3a, Fig. 4a, Fig. 5a and Fig. 6a can be the full figure of the image to be detected, and, contain text region on the image, Fig. 3b, Fig. 4b, Fig. 5b and Fig. 6b show that described Fig. 3a, Fig. 4a, The text area probability map generated after the full image of the image to be detected in Fig. 5a and Fig. 6a passes through the semantic prediction model 1350. The generated text area probability map uses different pixel values to represent different probabilities to distinguish the text area and non-text area of the image to be detected. For example, using a pixel with a pixel value of 255 to fill a text area indicates that the probability of this area belonging to a text area is the highest, and using a pixel with a pixel value of 0 to fill a non-text area (for example, a background area) indicates that the probability of this area belonging to a text area is the lowest. Distinguish the text area and non-text area in the image to be detected. Taking FIG. 4b as an example, the text area probability map in FIG. 4b uses different pixel values to distinguish text areas and non-text areas in the image to be detected 4a. For example, use pixels with a pixel value of 255 to fill the two text areas in the image to be detected 4a, "Unauthorized Do Not Enter" and "AuthorizedPersonnelOnly", thereby obtaining the text area probability map as shown in Figure 4b, and, The text area probability map in Figure 4b also completely and accurately shows the direction of the text area in the original image 4a to be detected

分割模块1340用于对所述文字区域概率图进行分割操作，以确定文字区域。因为文字区域概率图中的像素的数值可以表示该像素属于文字区域的概率，所以可以根据底层特征(例如图像的灰度)对文字区域概率进行分割。The segmentation module 1340 is used to perform a segmentation operation on the text region probability map to determine the text region. Because the value of a pixel in the text area probability map can represent the probability that the pixel belongs to the text area, the text area probability can be segmented according to the underlying features (such as the grayscale of the image).

例如，分割模块1340可以通过对文字区域概率图进行二值化操作来获得文字区域。本发明中，由于期望区分文字区域和非文字区域(背景区域)，所以利用二值化操作即可实现该目的。二值化操作实现简单，计算量少并且速度快。For example, the segmentation module 1340 can obtain the text area by performing a binarization operation on the text area probability map. In the present invention, since it is desired to distinguish the text area and the non-text area (background area), this purpose can be achieved by using the binarization operation. The binarization operation is simple to implement, with less calculation and high speed.

二值化操作之后，所述分割模块1340还可以用于确定二值化操作所获得的每个连通区域的轮廓。可以用现有的或未来研发的任何边缘检测方法来实现，例如基于诸如Sobel或Canny算子等各种边缘检测方法。分割模块1340还可以用于将每个连通区域的轮廓拟合为四边形以确定所述文字区域。在一个实施例中，所有四边形的内部区域可以作为文字区域。具体地，假设所有四边形组成的集合为B,B＝{b_k},k＝1,2,…Q,其中b_k表示拟合获得的四边形，Q表示四边形的数目，k为下标。则集合B即为文字检测的结果输出。After the binarization operation, the segmentation module 1340 can also be used to determine the contour of each connected region obtained by the binarization operation. It can be realized by any existing or future developed edge detection method, for example, based on various edge detection methods such as Sobel or Canny operator. The segmentation module 1340 can also be used to fit the outline of each connected region to a quadrilateral to determine the text region. In one embodiment, the inner areas of all quadrilaterals can be used as text areas. Specifically, it is assumed that the set of all quadrilaterals is B, B={b_k }, k=1, 2,...Q, where b_k represents the quadrilaterals obtained by fitting, Q represents the number of quadrilaterals, and k is the subscript. Then the set B is the result output of the text detection.

在一个实施例中，可以认为分割模块1340分割后所获得的图像中平均像素值较小的区域为非文字区域，其他区域为文字区域。In one embodiment, it can be considered that in the image obtained after segmentation by the segmentation module 1340, the area with a smaller average pixel value is a non-text area, and the other areas are text areas.

图14示出了根据本发明另一实施例的文字检测装置1400的示意性框图。文字检测装置1400中的语义分析模块1330与文字检测装置1300中的语义分析模块1330类似，文字检测装置1400中的分割模块1340与文字检测装置1300中的分割模块1340类似，为了简洁，在此不再赘述。Fig. 14 shows a schematic block diagram of a text detection device 1400 according to another embodiment of the present invention. The semantic analysis module 1330 in the text detection device 1400 is similar to the semantic analysis module 1330 in the text detection device 1300, and the segmentation module 1340 in the text detection device 1400 is similar to the segmentation module 1340 in the text detection device 1300. Let me repeat.

与文字检测装置1300相比，文字检测装置1400增加了图像预处理模块1410和训练模块1420。Compared with the text detection device 1300 , the text detection device 1400 adds an image preprocessing module 1410 and a training module 1420 .

根据本发明的实施例，所述图像预处理模块1410接收原始图像。在一个实施例中，原始图像可以具有复杂的背景信息，可以包括具有多样性的文字区域，例如，有不同的颜色、字体、语种和尺寸的文字信息。According to an embodiment of the present invention, the image preprocessing module 1410 receives an original image. In an embodiment, the original image may have complex background information and may include text areas with diversity, for example, text information in different colors, fonts, languages and sizes.

图像预处理模块1410对接收到的原始图像进行预处理。在一个实施例中，图像预处理模块1410可以对接收到的原始图像进行尺度归一化，即将原始图像的最大维度(例如，原始图像的高度和宽度中的较大者)缩放到预设尺寸，所述预设尺寸可以包括480、640、800、和960像素等。并且，预处理之后得到的图像的长宽比例与所述原始图像的长宽比例保持相同。The image preprocessing module 1410 preprocesses the received original image. In one embodiment, the image preprocessing module 1410 can perform scale normalization on the received original image, that is, scale the maximum dimension of the original image (for example, the larger of the height and width of the original image) to a preset size , the preset size may include 480, 640, 800, and 960 pixels, etc. Moreover, the aspect ratio of the image obtained after the preprocessing is kept the same as the aspect ratio of the original image.

经过预处理之后，图像预处理模块1410得到所述待检测图像并将所述待检测图像的全图输出至所述语义分析模块1330进行处理。其中，根据上文的描述，所述待检测图像具有预设尺寸大小，并且所述待检测图像的长宽比例与所述原始图像的长宽比例相同。After preprocessing, the image preprocessing module 1410 obtains the image to be detected and outputs the full image of the image to be detected to the semantic analysis module 1330 for processing. Wherein, according to the above description, the image to be detected has a preset size, and the aspect ratio of the image to be detected is the same as that of the original image.

根据本发明的一个实施例，训练模块1420用于利用多个样本图像训练神经网络，以获得语义预测模型1350，该模型可以有效地区分待检测图像中的文字区域和非文字区域。According to an embodiment of the present invention, the training module 1420 is used to train a neural network using a plurality of sample images to obtain a semantic prediction model 1350, which can effectively distinguish text regions and non-text regions in the image to be detected.

在一个实施例中，训练模块1420可以从不同来源采集大量包含文字的各种图像作为样本图像并接收样本图像的标注信息。样本图像例如是自然场景图像。期望样本图像种类丰富且数目较多，以获得理想的语义预测模型。在一个实施例中，样本图像的数目不少于1000。In one embodiment, the training module 1420 may collect a large number of various images containing text from different sources as sample images and receive annotation information of the sample images. The sample images are, for example, natural scene images. It is expected that the sample images are rich in variety and number in order to obtain an ideal semantic prediction model. In one embodiment, the number of sample images is not less than 1000.

每个样本图像中的所有文字区域可以使用多边形在该样本图像中标注。标注的基本文字单位可以是文字行或单词。样本图像中文字区域的标注信息可以以多边形(例如，四边形)的形式保存。具体地，在一个实施例中，可以仅保存四边形的四个顶点的坐标。以四边形的形状保存标注信息不仅可以满足任何方向、语言的文字，而且便于计算。All text regions in each sample image can be marked in the sample image using polygons. The basic text unit of a label can be a text line or a word. The annotation information of the text area in the sample image may be stored in the form of polygons (for example, quadrilaterals). Specifically, in one embodiment, only the coordinates of the four vertices of the quadrilateral may be saved. Preserving annotation information in the shape of a quadrilateral not only satisfies text in any direction and language, but also facilitates calculations.

训练模块1420还用于根据样本图像和其标注信息生成样本图像的掩膜图。在一个实施例中，所述掩膜图包括二值掩膜图。具体地，对于样本图像I和对应的标注信息a，训练模块1420生成一幅与样本图像I大小一致的掩膜图，例如，二值掩膜图R。二值掩膜图R使用不同的像素值区分样本图像的文字区域和非文字区域。在一个实施例中，对于样本图像I，使用具有第一像素值的像素填充所标注的文字区域，使用具有第二像素值的像素填充非文字区域，从而生成所述掩膜图，其中，第一像素值和第二像素值不同，以区分所述文字区域和非文字区域。例如，所标注的文字区域(也即使用四边形标注的内部区域)的像素值被填充为255，而非文字区域的像素值被填充为0。The training module 1420 is also used to generate a mask map of the sample image according to the sample image and its annotation information. In one embodiment, the mask map includes a binary mask map. Specifically, for the sample image I and the corresponding label information a, the training module 1420 generates a mask map with the same size as the sample image I, for example, a binary mask map R. The binary mask R uses different pixel values to distinguish the text area and non-text area of the sample image. In one embodiment, for the sample image I, pixels with the first pixel value are used to fill the marked text area, and pixels with the second pixel value are used to fill the non-text area, thereby generating the mask map, wherein, the first The first pixel value and the second pixel value are different to distinguish the text area from the non-text area. For example, the pixel value of the marked text area (that is, the inner area marked with a quadrilateral) is filled with 255, and the pixel value of the non-text area is filled with 0.

训练模块1420进一步用于利用样本图像和其掩膜图构建训练集，并训练神经网络，以获得语义预测模型1350。具体地，原始的样本图像和其对应的掩膜图构成的训练样本集为S。S＝{(I_i,R_i)},i＝1,2,...,N，其中I_i表示原始的样本图像，R_i为原始的样本图像I_i对应的掩膜图，N为训练样本集S中样本图像的数目，i为下标。The training module 1420 is further used to construct a training set by using the sample image and its mask image, and train the neural network to obtain the semantic prediction model 1350 . Specifically, the training sample set composed of the original sample image and its corresponding mask image is S. S={(I_i ,R_i )}, i=1,2,...,N, where I_i represents the original sample image, R_i is the mask image corresponding to the original sample image I_i , and N is The number of sample images in the training sample set S, i is the subscript.

在一个实施例中，神经网络可以是全卷积神经网络。全卷积神经网络是一类特殊的神经网络，其特点在于从输入到输出的所有包含可学参数的层都是卷积层。全卷积神经网络避免了对图像的复杂前期预处理，可以直接输入原始图像，其特别适用于对具有复杂背景的图像的分析处理，可以使图像的文字检测结果更准确。In one embodiment, the neural network may be a fully convolutional neural network. The fully convolutional neural network is a special type of neural network, which is characterized in that all layers containing learnable parameters from input to output are convolutional layers. The fully convolutional neural network avoids the complicated preprocessing of the image, and can directly input the original image. It is especially suitable for the analysis and processing of images with complex backgrounds, and can make the text detection results of images more accurate.

训练模块1420将训练样本集S输入全卷积神经网络进行训练，以得到语义预测模型1350。根据本发明一个具体实施例，可以采用一个由13个层构成的全卷积神经网络。图12示出了该全卷积神经网络的示意图。The training module 1420 inputs the training sample set S into the fully convolutional neural network for training to obtain the semantic prediction model 1350 . According to a specific embodiment of the present invention, a fully convolutional neural network composed of 13 layers can be used. Figure 12 shows a schematic diagram of the fully convolutional neural network.

该全卷积神经网络的输入为原始图像数据。如图12所示，该全卷积神经网络包括第一卷积层和第二卷积层，其中滤波器的数目可以为64，滤波器大小可以为3x3。第二卷积层连接第一最大池化层。接下来是第三卷积层和第四卷积层，其中滤波器的数目可以为128，滤波器大小可以为3x3。第四卷积层连接第二最大池化层。接下来是第五卷积层、第六卷积层和第七卷积层，其中滤波器的数目为256，滤波器大小为3x3。第七卷积层连接第三最大池化层。接下来是第八卷积层、第九卷积层和第十卷积层，其中滤波器的数目可以为512，滤波器大小可以为3x3。第十卷积层连接第四最大池化层。接下来是第十一卷积层、第十二卷积层和第十三卷积层，其中滤波器的数目可以为512，滤波器大小可以为3x3。The input of this fully convolutional neural network is raw image data. As shown in FIG. 12 , the fully convolutional neural network includes a first convolutional layer and a second convolutional layer, where the number of filters can be 64, and the filter size can be 3x3. The second convolutional layer is connected to the first max pooling layer. Next is the third convolutional layer and the fourth convolutional layer, where the number of filters can be 128 and the filter size can be 3x3. The fourth convolutional layer is connected to the second max pooling layer. Next is the fifth convolutional layer, the sixth convolutional layer and the seventh convolutional layer, where the number of filters is 256 and the filter size is 3x3. The seventh convolutional layer is connected to the third max pooling layer. Next is the eighth convolutional layer, the ninth convolutional layer and the tenth convolutional layer, where the number of filters can be 512, and the filter size can be 3x3. The tenth convolutional layer is connected to the fourth max pooling layer. Next is the eleventh convolutional layer, the twelfth convolutional layer and the thirteenth convolutional layer, where the number of filters may be 512 and the filter size may be 3x3.

在训练过程中，每次将一个样本图像和对应的掩膜图输入到全卷积神经网络中，初始学习率可以为0.00000001，每经过10000轮迭代，学习率降为原来的1/10。当迭代100000轮后，训练过程可以终止。训练过程终止时所获得的全卷积神经网络即为期望的语义预测模型。经由所述训练好的语义预测模型，可以根据待检测图像的语义生成文字区域概率图，从而预测待检测图像中的文字区域。During the training process, each time a sample image and the corresponding mask image are input into the fully convolutional neural network, the initial learning rate can be 0.00000001, and after every 10,000 iterations, the learning rate is reduced to 1/10 of the original. After 100,000 iterations, the training process can be terminated. The fully convolutional neural network obtained at the end of the training process is the desired semantic prediction model. Through the trained semantic prediction model, a text area probability map can be generated according to the semantics of the image to be detected, so as to predict the text area in the image to be detected.

经过训练模块1420利用多个样本图像训练神经网络而获得的语义预测模型1350可以有效地区分待检测图像中的文字区域和非文字区域。The semantic prediction model 1350 obtained through the training module 1420 using a plurality of sample images to train the neural network can effectively distinguish text regions and non-text regions in the image to be detected.

图15示出了根据本发明实施例的文字检测系统1500的示意性框图。如图15所示，文字检测系统1500包括处理器1510、存储器1520以及在所述存储器1520中存储的程序指令1530。Fig. 15 shows a schematic block diagram of a text detection system 1500 according to an embodiment of the present invention. As shown in FIG. 15 , the text detection system 1500 includes a processor 1510 , a memory 1520 and program instructions 1530 stored in the memory 1520 .

所述程序指令1530在所述处理器1510运行时可以实现根据本发明实施例的文字检测装置的各个功能模块的功能，并且/或者可以执行根据本发明实施例的文字检测方法的各个步骤。When the processor 1510 is running, the program instructions 1530 can realize the functions of each functional module of the text detection device according to the embodiment of the present invention, and/or can execute various steps of the text detection method according to the embodiment of the present invention.

具体地，在所述程序指令1530被所述处理器1510运行时，执行以下步骤：接收待检测图像；经由语义预测模型生成所述待检测图像的全图的文字区域概率图，其中，所述文字区域概率图使用不同的像素值区分所述待检测图像的文字区域和所述待检测图像的非文字区域；以及对所述文字区域概率图进行分割操作，以确定所述文字区域。语义预测模型用于根据图像的语义预测所述待检测图像中的像素属于文字区域还是属于非文字区域。Specifically, when the program instruction 1530 is executed by the processor 1510, the following steps are performed: receiving the image to be detected; generating a text region probability map of the full image of the image to be detected via a semantic prediction model, wherein the The character region probability map uses different pixel values to distinguish the character region of the image to be detected from the non-character region of the image to be detected; and a segmentation operation is performed on the character region probability map to determine the character region. The semantic prediction model is used to predict whether a pixel in the image to be detected belongs to a text area or a non-text area according to the semantics of the image.

此外，在所述程序指令1530被所述处理器1510运行时，还执行以下步骤：接收原始图像；以及对所述原始图像进行预处理，以获得所述待检测图像，其中，所述待检测图像具有预设尺寸大小，并且所述待检测图像的长宽比例与所述原始图像的长宽比例相同。In addition, when the program instruction 1530 is executed by the processor 1510, the following steps are also performed: receiving an original image; and performing preprocessing on the original image to obtain the image to be detected, wherein the image to be detected The image has a preset size, and the aspect ratio of the image to be detected is the same as that of the original image.

此外，在所述程序指令1530被所述处理器1510运行时所执行的对所述文字区域概率图进行分割操作以确定所述文字区域的步骤包括：对所述文字区域概率图进行二值化操作，以确定所述文字区域。In addition, when the program instruction 1530 is executed by the processor 1510, the step of segmenting the text area probability map to determine the text area includes: binarizing the text area probability map Operation to determine the text area.

此外，在所述程序指令1530被所述处理器1510运行时所执行的对所述文字区域概率图进行二值化操作以确定所述文字区域的步骤包括：确定所述二值化操作所获得的每个连通区域的轮廓；以及将所述轮廓拟合为四边形，其中，所述四边形内部区域为所述文字区域。In addition, when the program instruction 1530 is executed by the processor 1510, the step of performing a binarization operation on the text region probability map to determine the text region includes: determining the obtained The outline of each connected region of ; and fitting the outline to a quadrilateral, wherein the inner area of the quadrilateral is the text area.

此外，在所述程序指令1530被所述处理器1510运行时，还执行以下步骤：利用多个样本图像训练神经网络，以获得所述语义预测模型。In addition, when the program instruction 1530 is executed by the processor 1510, the following step is further performed: using a plurality of sample images to train a neural network to obtain the semantic prediction model.

此外，在所述程序指令1530被所述处理器1510运行时，所执行的利用多个样本图像训练神经网络以获得所述语义预测模型的步骤包括：接收所述样本图像和所述样本图像的标注信息；根据所述样本图像和所述样本图像的标注信息生成所述样本图像的掩膜图；以及利用所述样本图像和所述掩膜图训练所述神经网络，以获得所述语义预测模型。In addition, when the program instruction 1530 is executed by the processor 1510, the executed step of using a plurality of sample images to train a neural network to obtain the semantic prediction model includes: receiving the sample image and the labeling information; generating a mask map of the sample image according to the sample image and the label information of the sample image; and using the sample image and the mask map to train the neural network to obtain the semantic prediction Model.

此外，在所述程序指令1530被所述处理器1510运行执行利用多个样本图像训练神经网络以获得所述语义预测模型的步骤中，所述掩膜图包括二值掩膜图，并且所述二值掩膜图使用不同的像素值区分所述样本图像的文字区域和非文字区域。In addition, in the step that the program instruction 1530 is executed by the processor 1510 to train a neural network using a plurality of sample images to obtain the semantic prediction model, the mask map includes a binary mask map, and the The binary mask map uses different pixel values to distinguish text areas and non-text areas of the sample image.

此外，在所述程序指令1530被所述处理器1510运行执行利用多个样本图像训练神经网络以获得所述语义预测模型的步骤中，所述神经网络包括全卷积神经网络。In addition, in the step of using a plurality of sample images to train a neural network to obtain the semantic prediction model when the program instructions 1530 are executed by the processor 1510, the neural network includes a fully convolutional neural network.

此外，在所述程序指令1530被所述处理器1510运行执行利用多个样本图像训练神经网络以获得所述语义预测模型的步骤中，所述全卷积神经网络的层数包括6到19之间的任意数。In addition, in the step that the program instructions 1530 are executed by the processor 1510 to train a neural network using a plurality of sample images to obtain the semantic prediction model, the number of layers of the fully convolutional neural network includes between 6 and 19 any number in between.

此外，根据本发明实施例，还提供了一种存储介质，在所述存储介质上存储了程序指令，在所述程序指令被计算机或处理器运行时用于执行本发明实施例的文字检测方法的相应步骤，并且用于实现根据本发明实施例的文字检测装置中的相应模块。所述存储介质例如可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、或者上述存储介质的任意组合。所述计算机可读存储介质可以是一个或多个计算机可读存储介质的任意组合，例如一个计算机可读存储介质包含用于训练神经网络以获得语义预测模型的计算机可读的程序代码，另一个计算机可读存储介质包含用于进行文字检测的计算机可读的程序代码。In addition, according to an embodiment of the present invention, a storage medium is also provided, on which a program instruction is stored, and when the program instruction is executed by a computer or a processor, it is used to execute the text detection method of the embodiment of the present invention The corresponding steps are used to realize the corresponding modules in the character detection device according to the embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage unit of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk ROM, etc. (CD-ROM), USB memory, or any combination of the above storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media, for example, one computer-readable storage medium contains computer-readable program codes for training a neural network to obtain a semantic prediction model, and the other A computer-readable storage medium includes computer-readable program code for character detection.

在一个实施例中，所述计算机程序指令在被计算机运行时可以实现根据本发明实施例的文字检测装置的各个功能模块，并且/或者可以执行根据本发明实施例的文字检测方法。In one embodiment, when the computer program instructions are executed by a computer, various functional modules of the character detection device according to the embodiment of the present invention can be realized, and/or the character detection method according to the embodiment of the present invention can be executed.

在一个实施例中，所述计算机程序指令在被计算机运行时执行以下步骤：接收待检测图像；经由语义预测模型生成所述待检测图像的全图的文字区域概率图，其中，所述文字区域概率图使用不同的像素值区分所述待检测图像的文字区域和所述待检测图像的非文字区域；以及对所述文字区域概率图进行分割操作，以确定所述文字区域。所述语义预测模型用于根据图像的语义预测所述待检测图像中像素属于文字区域还是属于非文字区域。In one embodiment, when the computer program instructions are executed by a computer, the following steps are performed: receiving an image to be detected; generating a text area probability map of the full image of the image to be detected via a semantic prediction model, wherein the text area The probability map uses different pixel values to distinguish the text area of the image to be detected from the non-text area of the image to be detected; and a segmentation operation is performed on the text area probability map to determine the text area. The semantic prediction model is used to predict whether a pixel in the image to be detected belongs to a text area or a non-text area according to the semantics of the image.

此外，所述计算机程序指令在被计算机运行时执行，还执行以下步骤：接收原始图像；以及对所述原始图像进行预处理，以获得所述待检测图像，其中，所述待检测图像具有预设尺寸大小，并且所述待检测图像的长宽比例与所述原始图像的长宽比例相同。In addition, when the computer program instructions are executed by the computer, the following steps are also performed: receiving an original image; and performing preprocessing on the original image to obtain the image to be detected, wherein the image to be detected has a pre- The size is set, and the aspect ratio of the image to be detected is the same as that of the original image.

此外，在所述计算机程序指令在被计算机运行时所执行的对所述文字区域概率图进行分割操作以确定所述文字区域的步骤包括：对所述文字区域概率图进行二值化操作，以确定所述文字区域。In addition, when the computer program instructions are executed by the computer, the step of segmenting the text area probability map to determine the text area includes: performing a binarization operation on the text area probability map to obtain Determine the text area.

此外，在所述计算机程序指令在被计算机运行时所执行的对所述文字区域概率图进行二值化操作以确定所述文字区域的步骤包括：确定所述二值化操作所获得的每个连通区域的轮廓；以及将所述轮廓拟合为四边形，其中，所述四边形内部区域为所述文字区域。In addition, when the computer program instructions are executed by the computer, the step of performing a binarization operation on the text area probability map to determine the text area includes: determining each the contour of the connected region; and fitting the contour to a quadrilateral, wherein the inner region of the quadrilateral is the text region.

此外，在所述计算机程序指令在被计算机运行时，还执行以下步骤：利用多个样本图像训练神经网络，以获得所述语义预测模型。In addition, when the computer program instructions are executed by the computer, the following step is further performed: using a plurality of sample images to train a neural network to obtain the semantic prediction model.

此外，在所述计算机程序指令在被计算机运行时，所执行的利用多个样本图像训练神经网络以获得所述语义预测模型的步骤包括：接收所述样本图像和所述样本图像的标注信息；根据所述样本图像和所述样本图像的标注信息生成所述样本图像的掩膜图；以及利用所述样本图像和所述掩膜图训练所述神经网络，以获得所述语义预测模型。In addition, when the computer program instructions are executed by a computer, the executed step of using a plurality of sample images to train a neural network to obtain the semantic prediction model includes: receiving the sample images and annotation information of the sample images; generating a mask map of the sample image according to the sample image and label information of the sample image; and training the neural network by using the sample image and the mask map to obtain the semantic prediction model.

此外，在所述计算机程序指令在被计算机运行时执行利用多个样本图像训练神经网络以获得所述语义预测模型的步骤中，所述掩膜图包括二值掩膜图，并且所述二值掩膜图使用不同的像素值区分所述样本图像的文字区域和非文字区域。In addition, in the step of training a neural network using a plurality of sample images to obtain the semantic prediction model when the computer program instructions are executed by a computer, the mask map includes a binary mask map, and the binary mask map The mask map uses different pixel values to distinguish text areas and non-text areas of the sample image.

此外，在所述计算机程序指令在被计算机运行时执行利用多个样本图像训练神经网络以获得所述语义预测模型的步骤中，所述神经网络包括全卷积神经网络。In addition, in the step of using a plurality of sample images to train a neural network to obtain the semantic prediction model when the computer program instructions are executed by a computer, the neural network includes a fully convolutional neural network.

此外，在所述计算机程序指令在被计算机运行时执行利用多个样本图像训练神经网络以获得所述语义预测模型的步骤中，所述全卷积神经网络的层数包括6到19之间的任意数。In addition, in the step of using a plurality of sample images to train a neural network to obtain the semantic prediction model when the computer program instructions are executed by a computer, the number of layers of the fully convolutional neural network includes between 6 and 19 any number.

本领域普通技术人员通过阅读上文关于文字检测方法的详细描述，能够理解上述文字检测装置、系统的结构、实现以及优点，因此这里不再赘述。Those of ordinary skill in the art can understand the structure, implementation and advantages of the above-mentioned text detection device and system by reading the above detailed description of the text detection method, so details are not repeated here.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者装置的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will appreciate that all of the features and/or processes or elements disclosed in this specification (including accompanying claims, abstract and drawings) may be used in any combination, except that at least some of such features and/or processes or elements are mutually exclusive. features and all processes or elements of any method or apparatus so disclosed. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的文字检测装置中的一些模块的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some modules in the text detection device according to the embodiment of the present invention. The present invention can also be implemented as an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.