CN111639639B

Movatterモバイル変換

Info

Publication number: CN111639639B
Application number: CN201910157678.3A
Authority: CN
Inventors: 王杰
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2023-05-02
Anticipated expiration: 2039-03-01
Also published as: CN111639639A

Abstract

The invention discloses a method, a device, equipment and a storage medium for detecting text areas, and belongs to the field of image detection. The method comprises the following steps: acquiring an image to be subjected to text region detection, wherein the image contains multiple language texts; extracting general features of at least two language texts in the image; extracting distinguishing features of at least two language texts in the image based on the universal features; a first text region of the multi-language text in the image is determined based on the distinguishing characteristics. According to the method and the device, the universal features of at least two language texts in the image are extracted, and the distinguishing features of the at least two language texts in the image are extracted based on the universal features, so that the difference among multiple different languages can be considered, and the first text region of the multiple language texts can be determined simultaneously based on the distinguishing features, so that the detection difficulty is reduced, and the detection efficiency is improved.

Description

Method, device, equipment and storage medium for detecting text area

Technical Field

The present invention relates to the field of image detection, and in particular, to a method, apparatus, device, and storage medium for detecting text regions.

Background

With the continuous development of image detection technology, image-based detection is increasingly applied. For example, for an image containing text in a plurality of languages, it is one application to detect a text region of each language text contained in the image to locate a position of each language text from the image.

In the method for detecting the text region provided by the related art, firstly, an image is detected through an artificial neural network, and the image is divided into a text region and a non-text region. And then, continuously detecting the text region for a plurality of times through the artificial neural network, determining a subarea where the language text is located from the text region in each detection, and finally obtaining the text region of each language text in the image.

The inventors found that the related art has at least the following problems:

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for detecting text areas, which are used for solving the problems of related technologies. The technical scheme is as follows:

In one aspect, a method of detecting text regions is provided, the method comprising: acquiring an image to be subjected to text region detection, wherein the image contains multiple language texts; extracting general features of at least two language texts in the image; extracting distinguishing features of at least two language texts in the image based on the universal features; and determining a first text region of the multi-language text in the image according to the distinguishing characteristic.

Optionally, the extracting the general features of at least two language texts in the image includes: invoking a first neural network, and extracting general features of at least two language texts in the image to obtain a general feature map;

the extracting distinguishing features of at least two language texts in the image based on the universal features comprises the following steps: invoking a reference number of second neural networks, extracting distinguishing features of the at least two language texts based on the universal feature map, and obtaining a distinguishing feature map, wherein each second neural network is a neural network trained according to supervision information of any language text.

Optionally, the determining a first text region of the multi-language text in the image according to the distinguishing feature includes: invoking a classifier with reference quantity, and acquiring confidence coefficient graphs of at least two language texts based on the distinguishing feature graphs, wherein each pixel in the confidence coefficient graphs has a confidence coefficient value; for the confidence coefficient map of any language text, taking pixels with confidence coefficient values not lower than a first threshold value in the confidence coefficient map of any language text as a pixel set of the any language text; generating an circumscribed rectangle of the pixel set of the text in any language, and taking the circumscribed rectangle as a first text area of the text in any language.

Optionally, the determining a first text region of the multi-language text in the image according to the distinguishing feature includes: invoking a classifier with reference quantity, and acquiring confidence coefficient graphs of at least two language texts based on the distinguishing feature graphs, wherein each pixel in the confidence coefficient graphs has a confidence coefficient value; for the confidence level diagram of any language text, calling a third neural network, and acquiring at least two deviation coordinates based on the universal feature diagram; taking the coordinates of the pixel points with the confidence coefficient value not lower than a second threshold value in the confidence coefficient map of any language text as reference coordinates, and respectively calculating the at least two deviation coordinates to obtain at least two region coordinates; and taking the area indicated by the at least two area coordinates as a first text area of the text in any language.

Optionally, the method further comprises: fusing the distinguishing features to obtain total language features; and determining a second text region of the multi-language text in the image according to the total language characteristic.

Optionally, the method further comprises: and updating the detection result of the text region according to the first text region and the second text region.

Optionally, the updating the detection result of the text region according to the first text region and the second text region includes: for any first text region, deleting the first text region if a second text region with the overlapping rate of the first text region being larger than a third threshold value does not exist; and for any second text region, if the first text region with the overlapping rate of the second text region being larger than a fourth threshold value does not exist, the second text region is used as a new language text region.

Optionally, the method further comprises: non-maximum suppression is performed on a first text region of the one or more languages of text.

Optionally, the method further comprises: and performing non-maximum suppression on a second text region of the multi-language text.

In one aspect, there is provided an apparatus for detecting a text region, the apparatus comprising:

the acquisition module is used for acquiring an image to be subjected to text region detection, wherein the image contains multiple language texts;

the first extraction module is used for extracting general features of at least two language texts in the image;

the second extraction module is used for extracting distinguishing features of at least two language texts in the image based on the general features;

And the first determining module is used for determining a first text area of the multi-language text in the image according to the distinguishing characteristics.

Optionally, the first extracting module is configured to invoke a first neural network, extract general features of at least two language texts in the image, and obtain a general feature map;

the second extraction module is used for calling a reference number of second neural networks, extracting distinguishing features of the at least two language texts based on the general feature map, and obtaining a distinguishing feature map, wherein each second neural network is a neural network trained according to supervision information of any language text.

Optionally, the first determining module is configured to call a reference number of classifiers, and obtain confidence maps of at least two language texts based on the distinguishing feature map, where each pixel in the confidence maps has a confidence value; for the confidence coefficient map of any language text, taking pixels with confidence coefficient values not lower than a first threshold value in the confidence coefficient map of any language text as a pixel set of the any language text; generating an circumscribed rectangle of the pixel set of the text in any language, and taking the circumscribed rectangle as a first text area of the text in any language.

Optionally, the first determining module is configured to call a reference number of classifiers, and obtain confidence maps of at least two language texts based on the distinguishing feature map, where each pixel in the confidence maps has a confidence value; for the confidence level diagram of any language text, calling a third neural network, and acquiring at least two deviation coordinates based on the universal feature diagram; taking the coordinates of the pixel points with the confidence coefficient value not lower than a second threshold value in the confidence coefficient map of any language text as reference coordinates, and respectively calculating the at least two deviation coordinates to obtain at least two region coordinates; and taking the area indicated by the at least two area coordinates as a first text area of the text in any language.

Optionally, the apparatus further comprises:

the fusion module is used for fusing the distinguishing features to obtain total language features;

and the second determining module is used for determining a second text area of the multi-language text in the image according to the total language characteristic.

Optionally, the apparatus further comprises:

and the updating module is used for updating the detection result of the text area according to the first text area and the second text area.

Optionally, the updating module is configured to delete, for any first text region, if there is no second text region with an overlapping rate with the first text region being greater than a third threshold value; and for any second text region, if the first text region with the overlapping rate of the second text region being larger than a fourth threshold value does not exist, the second text region is used as a new language text region.

Optionally, the updating module is further configured to perform non-maximum suppression on the first text region of the one or more languages.

Optionally, the updating module is further configured to perform non-maximum suppression on the second text region of the multi-language text.

There is also provided an apparatus for detecting text regions, the apparatus comprising a processor and a memory having stored therein at least one instruction for execution by the processor to implement any of the above methods of detecting text regions.

There is also provided a computer readable storage medium having stored therein at least one instruction executable by a processor to implement any of the above methods of detecting text regions.

The technical scheme provided by the invention has the beneficial effects that at least:

the method and the device have the advantages that the general features of at least two language texts in the image are extracted, and the distinguishing features of the at least two language texts in the image are extracted based on the general features, so that the difference among multiple different languages can be considered, and the first text region of the multiple language texts can be determined simultaneously based on the distinguishing features, so that the detection difficulty is reduced, and the detection efficiency is improved.

Drawings

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present invention;

FIG. 2 is a flowchart of a method for detecting text regions according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an extraction process of distinguishing features provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of an extraction process of distinguishing features provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of determining a first text region provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a feature fusion process provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a feature fusion process provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of non-maximum suppression provided by an embodiment of the present invention;

FIG. 9 is a schematic diagram of a system framework for detecting text regions according to an embodiment of the present invention;

Fig. 10 is a schematic structural diagram of an apparatus for detecting text regions according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of an apparatus for detecting text regions according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an apparatus for detecting text regions according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Often, a large amount of text is contained in an image, and the text is composed of multiple languages together. Thus, the text region in which each language text contained in the image is located can be detected so as to locate the position of each language text from the image.

For this reason, the embodiment of the invention provides a method for detecting text regions, which can be applied to the implementation environment shown in fig. 1. In fig. 1, the terminal 11 comprises at least one terminal 11 and a server 12, wherein the terminal 11 can be in communication connection with the server 12 to acquire an image containing a text region from the server 12; of course, in addition to acquiring an image from the server 12, the terminal 11 may acquire an image containing a text region from a local cache.

The terminal 11 may be any electronic product that can perform man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction or a handwriting device, for example, a PC (Personal Computer, a personal computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant, a personal digital assistant), a wearable device, a palm computer PPC (Pocket PC), a tablet computer, a smart car machine, a smart television, a smart sound box, etc.

The server 12 may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center.

Those skilled in the art will appreciate that the above-described terminal 11 and server 12 are by way of example only, and that other terminals or servers, either now present or later, may be suitable for use in the present application, and are intended to be within the scope of the present application and are incorporated herein by reference.

Based on the implementation environment shown in fig. 1 and referring to fig. 2, an embodiment of the present invention provides a method for detecting a text region, which can be applied to the terminal shown in fig. 1. As shown in fig. 2, the method includes:

instep 201, an image to be subjected to text region detection is acquired, wherein the image contains multiple language texts.

The image may be an image acquired by a camera, an image synthesized by a person, or an image derived from the internet, or the like, and the image may be stored in a buffer of the terminal device, and then the image may be directly acquired from the buffer of the terminal device, or the terminal device may initiate an acquisition request for the image to the server, so as to acquire an image returned after the server receives the request.

The image contains multiple language texts, wherein the language texts refer to a set of words, and the words form the language texts in the form of words, sentences or paragraphs. Each language text can be distributed continuously or at intervals on the image, and text areas corresponding to different language texts can be independent of each other or can be overlapped with each other. For example, the image may have chinese text paragraphs and english text paragraphs distributed longitudinally, where the chinese text region and the english text region are independent of each other; alternatively, the text passage of chinese in the image may contain text sentences of english, where the text area of chinese and the text area of english overlap each other.

For any acquired image, the image may be examined to determine the text region of the language text contained in the image.

Step 202, extracting general features of at least two language texts in an image.

Where a generic feature is a feature that is common to at least two (i.e., two or more) language texts, the language texts and the non-language texts can be distinguished according to the generic feature. The general features have various forms, and the features of color, edge, texture, outline and the like can be used as distinguishing features of two or more languages.

For general feature extraction, the method provided in this embodiment includes, but is not limited to, the following two alternative embodiments:

first embodiment: and calling the first neural network, and extracting the general features of at least two language texts in the image to obtain a general feature map.

In this embodiment, the generic features of the at least two language texts are extracted through a neural network. The neural network is a network which can be trained according to a training data set and can acquire feature extraction capability through training, and the general feature extraction can be realized by inputting images of features to be extracted into the trained neural network. The training data set is a set of images containing language text, and the more the number of images in the training data set is, the stronger the feature extraction capability obtained by the neural network for learning according to the training data set is, i.e. the deep learning network can extract general features with higher representativeness, i.e. higher expressive capability.

Further, the neural network comprises a plurality of networks such as a Resnet (Residual Neural Network ), an acceptance and a CNN (Convolutional Neural Network ), and the like, and the general characteristics of at least two language texts can be extracted by selecting any network. In the embodiment of the present invention, the first neural network for extracting the general feature is a first CNN.

Second embodiment: and extracting the general characteristics of the multi-language text in a manual setting mode. For example, extraction of general features is achieved by using various algorithms such as SIFT (Scale-Invariant Feature Transform, scale invariant feature transform) algorithm, HOG (Histogram of Oriented Gradient, directional gradient histogram) algorithm, or Gabor (filter) feature algorithm.

For easy understanding, a procedure for extracting general features of at least two language texts in an image will be described below using a first CNN as an example of a first neural network:

in the network structure of CNN, feature extraction is achieved by convolution layers, which may be one or more layers in number. Wherein each convolution layer contains a reference number of neurons, each neuron focuses on and focuses on only one feature, and a pair of feature maps of the feature are extracted from the image, so that the convolution layer can extract the same number of feature maps as the neurons based on the input image.

For example, the convolution layer of the CNN includes 256 convolution feature extraction kernels, and after the acquired image is input into the CNN, 256 generic feature graphs are obtained through the 256 convolution feature extraction kernels, where each generic feature graph corresponds to one generic feature of at least two language texts.

Since the reference number of feature maps obtained by the CNN convolution layers have the same height dimension and the same width dimension, in practical applications, the feature maps obtained by the CNN convolution layers are generally expressed using a method of c×h×w. Where C (Channel) represents the number of feature images obtained by the convolution layer, H (Height) represents the Height dimension of the feature images, and W (Width) represents the Width dimension of the feature images. For example, in the case where 256 general feature maps are obtained by the 256 convolution feature extraction kernels, if the height and width dimensions of the general feature maps are 56 pixels, the 256×56×56 general feature maps can be used to represent the 256 general feature maps.

Whichever way is used to extract the generic features, after the generic features are obtained, the distinguishing features of the multi-language text may be further extracted based on the obtained generic features, as detailed in the followingstep 203.

Step 203, extracting distinguishing features of at least two language texts in the image based on the universal features.

Each language text has a distinguishing feature that is unique to that language text, i.e., the language text can be distinguished from other types of language text by a distinguishing feature of one language text. As with the generic features, the distinguishing features may also take a variety of forms, such as colors, edges, textures, contours, etc.

Optionally, when the method provided by the embodiment of the invention extracts the distinguishing features of at least two language texts in the image, a neural network mode may still be adopted, for example, a second neural network with reference number is called, and the distinguishing features of at least two language texts are extracted based on the universal feature map to obtain the distinguishing feature map.

Wherein each second neural network is a neural network that has been trained from the supervisory information of either language text. After training according to the supervision information of the language text, the second neural network obtains the capability of extracting the distinguishing features of the language text, and then the second neural network can extract the distinguishing features of the language text based on the universal feature map. Also, a language text may be a language text of one language, such as chinese language text, english language text; or a language text of a language system containing a plurality of languages, such as Tibetan language text and printed European language text.

In an alternative embodiment, the second neural network employs a second CNN. Wherein the number of discriminating characteristic maps of the output second CNN depends only on the number of convolution characteristic extraction kernels contained in the convolution layer of the second CNN, irrespective of the number of common characteristic maps of the input second CNN. That is, the number of distinguishing feature maps may be greater than or less than or equal to the number of general feature maps. In this embodiment, referring to fig. 3, the final convolution layer of each second CNN includes 2 convolution feature extraction kernels, and after the generic feature map is input into the second CNN, the CNN may extract 2 distinctive feature maps of a language text based on the generic feature map. For example, the general feature map of the input second CNN is expressed as 256×56×56, and the distinguishing feature map of the output second CNN is expressed as 2×56×56.

It should be noted that, the reference number of second CNNs all accept the input of the generic feature map, but only the second CNNs corresponding to the language text included in the generic feature map can output the distinguishing feature map based on the generic feature map. For example, there are chinese text and english text in an image, and the reference number of second CNNs each receive input of a general feature map of the image, but only the second CNNs capable of extracting chinese distinguishing features and the second CNNs capable of extracting english distinguishing features output the distinguishing feature map, and the second CNNs capable of extracting distinguishing features of other kinds of languages do not output the distinguishing feature map.

It can be seen that, in this embodiment, the general feature of the language text in the image is first extracted by the first CNN, and then the distinguishing feature of each language text is extracted based on the general feature by the reference number of second CNNs, so that the first CNN only needs to have the capability of extracting the general feature, and each second CNN only needs to have the capability of extracting the distinguishing feature of one language text. Therefore, compared with the related art, the difficulty of training the first CNN and the second CNN is reduced, so that the detection difficulty is low and the detection efficiency is high.

After the extraction of the distinguishing features is completed, the text region of the language text can be determined according to the extracted distinguishing features, so that the positioning of the text region of the language text is completed.

Step 204, determining a first text region of the multi-language text in the image according to the distinguishing features.

For an image containing multi-language texts, after the distinguishing features of the multi-language texts are extracted, a distinguishing feature diagram of the multi-language texts can be obtained, and the obtained distinguishing feature diagram of the multi-language texts is further processed, so that a first text region of the multi-language texts can be determined.

As an alternative implementation manner, the method provided by the embodiment of the present invention includes, but is not limited to, the following two determination manners:

The first determination mode is as follows: invoking a classifier with reference quantity, and acquiring confidence coefficient graphs of at least two language texts based on the distinguishing feature graphs, wherein each pixel in the confidence coefficient graphs has a confidence coefficient value; for the confidence coefficient map of any language text, taking the pixels with the confidence coefficient value not lower than a first threshold value in the confidence coefficient map of any language text as a pixel set of any language text; generating an circumscribed rectangle of the pixel set of the text in any language, and taking the circumscribed rectangle as a first text area of the text in any language.

For the sake of understanding, the process of determining the first text area of the multi-language text according to the distinguishing feature will be described by taking CNN as an example, where, since the determining process of the different types of language text is the same, only the text area of any one of the language text is determined as an example, and the determining manners of the text areas of the other language text are not repeated here:

referring to fig. 3, after calling a CNN to obtain a distinguishing feature map of a language text, a classifier (e.g., softmax classifier) is called to classify the obtained distinguishing feature map, so as to obtain a confidence map of the language text, where each pixel forming the confidence map has a confidence value. The confidence value is in the range of 0-1, and the smaller the difference value between the confidence value and 0 is, the lower the probability that the pixel is a text pixel is, and correspondingly, the larger the difference value between the confidence value and 0 is, the higher the probability that the pixel is a text pixel is.

Further, since the confidence value identifies the probability that the pixel is a text pixel, a first threshold may be set, each pixel may be traversed in a reference manner, such as a depth-first search, a line-by-line search, etc., to classify pixels having all confidence values not below the first threshold as text pixels, and pixels having all confidence values below the first threshold as non-text pixels. In this embodiment, if the value of the first threshold is 0.5, all pixels with confidence values not lower than 0.5 in the confidence map are confirmed as a set of pixels of the language text, and each pixel in the set is a text pixel.

Whichever way is used to traverse the pixels, after the traversal is completed, a set of pixels of the language text can be obtained. Then, a quadrangle fitting mode can be adopted, and a detection result of the language text region, namely a first text region, is obtained based on the pixel set. For example, as shown in fig. 3, the hatched portion is a pixel set of the language text, a circumscribed rectangle of the pixel set of the language text is generated, and the circumscribed rectangle is used as the first text region of the text in any language.

The second determination mode: invoking a classifier with reference quantity, and acquiring confidence coefficient graphs of at least two language texts based on the distinguishing feature graphs, wherein each pixel in the confidence coefficient graphs has a confidence coefficient value; for the confidence level diagram of any language text, calling a third neural network, and acquiring at least two deviation coordinates based on the universal feature diagram; taking the coordinates of the pixel points with the confidence coefficient value not lower than the second threshold value in the confidence coefficient map of any language text as reference coordinates, and respectively calculating the deviation coordinates to obtain region coordinates; and taking the region indicated by the region coordinates as a first text region of any language text.

A schematic diagram of this confirmation is shown in fig. 4. Wherein the offset coordinates are used to represent the offset distance relative to the reference coordinates. In an alternative embodiment, the number of acquired offset coordinates is four. For example, as shown in fig. 5, the reference coordinates are (5, 5), the deviation coordinates are (-2, -1), (-2, 1), (2, 1) and (2, -1), and the calculated region coordinates are (3, 4), (3, 6), (7, 6) and (7, 4), and the first text region indicated by the region coordinates is the hatched region in fig. 5. The region coordinates may represent other region boundary points than the region vertices. The region coordinates may indicate not only a rectangular region but also a region having a shape other than a rectangle, and the embodiment of the present invention is not limited thereto.

Optionally, the method provided by the embodiment of the invention further includes: and fusing the distinguishing features to obtain the total language features. A second text region of the multi-language text in the image is determined based on the total language characteristic.

The feature fusion generally includes a lateral fusion as shown in fig. 6, and a longitudinal fusion as shown in fig. 7. The fusion of the multiple language distinguishing features provided by the embodiment of the invention adopts transverse fusion, namely the distinguishing features of each language are parallel on the same level, for example, the addition of the distinguishing features can be carried out by adopting an eltwise algorithm, or the distinguishing features are overlapped on a channel by adopting a concat algorithm, so that the transverse fusion of the distinguishing features is realized, and the total language features are obtained.

And then, according to the obtained total language characteristics, obtaining a detection result of the language text in the image, namely determining a second text region. The determining process is the same as the process of determining the first text area according to the distinguishing feature, and has two determining modes, which are not described herein.

Further, the method provided by the embodiment of the invention further comprises the following steps: and updating the detection result of the text region according to the first text region and the second text region.

For example, updating the detection result of the text region according to the first text region and the second text region includes:

and for any first text region, deleting the first text region if a second text region with the overlapping rate of the first text region being larger than a third threshold value does not exist. And for any second text region, if the first text region with the overlapping rate of the second text region being larger than the fourth threshold value does not exist, the second text region is taken as a new language text region.

The total language feature is obtained by fusing the distinguishing features, so that the total language feature is a feature with stronger expressive power than the distinguishing features, that is, the second text region determined according to the total language feature is a text region with higher accuracy than the first text region determined according to the distinguishing features. Therefore, for any first text region, if there is no second text region with an overlapping rate with the first text region being greater than a third threshold (e.g., 75%), it is indicated that the deviation between the first text region and the second text region is greater and the accuracy is lower, so that the first text region can be deleted.

In addition, the obtained total language feature may include a new language feature, where the new language feature refers to a feature other than the distinguishing feature, and the text with the new language feature also belongs to the language text, but the new language feature is not extracted in the process of calling the second neural network to extract the distinguishing feature. The reason is that a second neural network needs to be trained according to the supervision information of a language text before extracting the distinguishing features of the language text; however, since there are many types of languages existing, and the capability of collecting supervision information is limited, there is a small possibility that a second neural network is trained for each language existing, and distinguishing features of some types of language text cannot be extracted.

Thus, it can be seen that the second text region determined from the total language feature is a more comprehensive text region than the first text region determined from the distinguishing feature. If there is no first text region with the overlapping rate of the second text region being greater than the fourth threshold (e.g., 75%) for any second text region, it is indicated that the region is a text region of a new language text obtained according to the fused new language feature, so that the second text region may be reserved and used as the new language text region. In this way, the detection result can be more generalized.

Optionally, the method provided by the embodiment of the invention further includes:

non-maximum suppression is performed on a first text region of one or more languages of text. For the same language text, a plurality of first text areas overlapped with each other may be detected, and the effect of non-maximum suppression is to remove redundant first text areas from the plurality of first text areas overlapped with each other, and to keep one first text area with highest confidence.

Next, a process of non-maximum suppression will be described by taking non-maximum suppression of a first text region of a language text as an example:

all the first text regions of the language text are input into the classifier, which gives a confidence value to each first text region, the higher the confidence value, the higher the accuracy of the first text is stated. And then, taking the first text area with the highest confidence value (namely, the confidence value is the maximum value) as a reference text area, and deleting all the first text areas with the overlapping rate with the reference text area being larger than a reference threshold value (such as 50%). If the first text region still remains after deletion, selecting one with the highest confidence value from the remaining first text regions as a reference text region, and repeating the deleting process in the remaining first text regions.

For example, as shown in fig. 8, A, B and C are three mutually overlapping first text regions of a language text, and the confidence values of A, B and C are 0.9, 0.8, and 0.7, respectively. Since 0.9 > 0.8 > 0.7, A is taken as a reference text region; taking the reference threshold value as 50%, deleting B, C because the overlapping rate of B, C and A is greater than 50%, and obtaining the first text region of the language text finally as A.

In an optional implementation manner, the method provided by the embodiment of the invention further comprises the following steps: non-maximum suppression is performed on a second text region of the multi-language text. The same procedure as the above non-maximum suppression is not described here.

It should be noted that, whether the non-maximum suppression is performed only on the first text region or the second text region, or the non-maximum suppression is performed on both the first text region and the second text region, the method provided by the embodiment of the invention can update the detection result of the text region according to the suppressed text region when updating the detection result of the text region according to the first text region and the second text region. Optionally, the principle of the method for updating the detection result of the text region according to the suppressed text region is the same as the above method for updating the detection result of the text region according to the first text region and the second text region, and will not be described herein.

In summary, based on the description of the method provided by the embodiment of the present invention, the process of detecting the text region provided by the embodiment of the present invention may be as shown in fig. 9. In fig. 9, the multi-language text detection framework includes a general feature extraction module, a distinguishing feature extraction module, and a text detection module, where the output of the former module is the input of the latter module. Inputting the image which originally contains the multi-language text into a general feature extraction module, and outputting general features by the general feature extraction module; secondly, inputting the general feature into a distinguishing feature extraction module of a reference number, and outputting distinguishing features of different languages by the distinguishing feature extraction module; and then, inputting the distinguishing features into a text detection module, and obtaining the output rough detection results of different languages, namely a first text region.

On the other hand, the distinguishing features of different languages can be transversely fused to obtain the total language features, and the total language features are input into the text detection module, so that accurate detection results of the different languages, namely the second text region, can be obtained. Finally, the rough detection results (first text area) of different languages can be combined for correction and suppression to obtain the final detection result of the text area, namely the detection result of the text area is updated according to the first text area and the second text area.

According to the method provided by the embodiment of the invention, the common characteristics of at least two language texts in the image are extracted, and the distinguishing characteristics of the at least two language texts in the image are extracted based on the common characteristics, so that the difference among a plurality of different languages can be considered, and the first text region of the plurality of language texts can be determined simultaneously based on the distinguishing characteristics, thereby reducing the detection difficulty and improving the detection efficiency.

In addition, the method provided by the embodiment can enable the language to be finer in granularity due to the fact that the distinguishing characteristic processing process is integrated. The method can support the detection of text areas of multiple languages, the framework is easy to expand, and the method can support the learning of mass data, so that better accuracy and generalization capability can be achieved.

Based on the same conception, the embodiment of the present invention provides an apparatus for detecting text regions, referring to fig. 10, the apparatus includes:

an obtainingmodule 1001, configured to obtain an image to be subjected to text region detection, where the image includes multiple language texts;

afirst extraction module 1002, configured to extract generic features of at least two language texts in an image;

a second extractingmodule 1003, configured to extract distinguishing features of at least two language texts in the image based on the common features;

A first determiningmodule 1004 is configured to determine a first text region of the multi-language text in the image according to the distinguishing feature.

Optionally, a first extractingmodule 1002 is configured to invoke a first neural network, and extract general features of at least two language texts in the image to obtain a general feature map;

thesecond extraction module 1003 is configured to invoke second neural networks, and extract distinguishing features of at least two language texts based on the generic feature map, so as to obtain distinguishing feature maps, where each second neural network is a neural network that has been trained according to the supervision information of any one language text.

Optionally, a first determiningmodule 1004 is configured to invoke a reference number of classifiers, and obtain a confidence map of at least two language texts based on the distinguishing feature map, where each pixel in the confidence map has a confidence value; for the confidence coefficient map of any language text, taking the pixels with the confidence coefficient value not lower than a first threshold value in the confidence coefficient map of any language text as a pixel set of any language text; generating an circumscribed rectangle of the pixel set of the text in any language, and taking the circumscribed rectangle as a first text area of the text in any language.

Optionally, a first determiningmodule 1004 is configured to invoke a reference number of classifiers, and obtain a confidence map of at least two language texts based on the distinguishing feature map, where each pixel in the confidence map has a confidence value; for the confidence level diagram of any language text, calling a third neural network, and acquiring at least two deviation coordinates based on the universal feature diagram; taking the coordinates of the pixel points with the confidence coefficient value not lower than the second threshold value in the confidence coefficient map of any language text as reference coordinates, and respectively calculating at least two deviation coordinates to obtain at least two region coordinates; and taking the region indicated by the coordinates of at least two regions as a first text region of any language text.

Optionally, referring to fig. 11, the apparatus further includes:

afusion module 1005, configured to fuse the distinguishing features to obtain a total language feature;

a second determiningmodule 1006 is configured to determine a second text region of the multi-language text in the image according to the total language feature.

Optionally, referring to fig. 12, the apparatus further includes:

anupdating module 1007 is configured to update the detection result of the text region according to the first text region and the second text region.

Optionally, theupdating module 1007 is configured to delete, for any first text region, if there is no second text region with an overlapping rate with the first text region being greater than a third threshold; and for any second text region, if the first text region with the overlapping rate of the second text region being larger than the fourth threshold value does not exist, the second text region is taken as a new language text region.

Optionally, theupdating module 1007 is further configured to perform non-maximum suppression on the first text region of the one or more languages.

Optionally, theupdating module 1007 is further configured to perform non-maximum suppression on the second text region of the multi-language text.

According to the device provided by the embodiment of the invention, the universal characteristics of at least two language texts in the image are extracted, and the distinguishing characteristics of the at least two language texts in the image are extracted based on the universal characteristics, so that the difference among a plurality of different languages can be considered, and the first text region of the plurality of language texts can be determined simultaneously based on the distinguishing characteristics, thereby reducing the detection workload and improving the detection efficiency.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the terminal is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to fig. 13, a schematic structural diagram of a terminal 1300 for detecting text regions according to an embodiment of the present disclosure is shown. The terminal 1300 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names of user terminals, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 1300 includes: aprocessor 1301, and amemory 1302.

Processor 1301 may include one or more processing cores, such as 4 core processors, 5 core processors, etc.Processor 1301 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ).Processor 1301 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments,processor 1301 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, theprocessor 1301 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory.Memory 1302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage terminals, flash storage terminals. In some embodiments, a non-transitory computer readable storage medium inmemory 1302 is used to store at least one instruction for execution byprocessor 1301 to implement the method of detecting text regions provided by the method embodiments herein.

In some embodiments, the terminal 1300 may further optionally include: aperipheral interface 1303 and at least one peripheral. Theprocessor 1301, thememory 1302, and theperipheral interface 1303 may be connected by a bus or signal lines. The respective peripheral devices may be connected to theperipheral device interface 1303 through a bus, a signal line, or a circuit board. Specifically, the peripheral device includes: at least one ofradio frequency circuitry 1304, adisplay screen 1305, acamera assembly 1306,audio circuitry 1307, apositioning assembly 1308, and apower supply 1309.

Aperipheral interface 1303 may be used to connect I/O (Input/Output) related at least one peripheral to theprocessor 1301 and thememory 1302. In some embodiments,processor 1301,memory 1302, andperipheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, either or both of theprocessor 1301, thememory 1302, and theperipheral interface 1303 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

TheRadio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. Theradio frequency circuit 1304 communicates with a communication network and other communication terminals via electromagnetic signals. Theradio frequency circuit 1304 converts an electrical signal to an electromagnetic signal for transmission, or converts a received electromagnetic signal to an electrical signal. Optionally, theradio frequency circuit 1304 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. Theradio frequency circuit 1304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, theradio frequency circuit 1304 may also include NFC (Near Field Communication ) related circuits, which are not limited in this application.

Thedisplay screen 1305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When thedisplay 1305 is a touch display, thedisplay 1305 also has the ability to capture touch signals at or above the surface of thedisplay 1305. The touch signal may be input to theprocessor 1301 as a control signal for processing. At this point, thedisplay 1305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, thedisplay screen 1305 may be one and disposed on the front panel of the terminal 1300; in other embodiments, thedisplay 1305 may be at least two, disposed on different surfaces of the terminal 1300 or in a folded configuration; in still other embodiments, thedisplay 1305 may be a flexible display disposed on a curved surface or a folded surface of theterminal 1300. Even more, thedisplay screen 1305 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. Thedisplay screen 1305 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

Thecamera assembly 1306 is used to capture images or video. Optionally,camera assembly 1306 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments,camera assembly 1306 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

Theaudio circuit 1307 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to theprocessor 1301 for processing, or inputting the electric signals to theradio frequency circuit 1304 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal 1300, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is then used to convert electrical signals from theprocessor 1301 or theradio frequency circuit 1304 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, theaudio circuit 1307 may also comprise a headphone jack.

Thelocation component 1308 is used to locate the current geographic location of the terminal 1300 to enable navigation or LBS (Location Based Service, location-based services). Thepositioning component 1308 may be a positioning component based on the united states GPS (Global Positioning System ), the beidou system of china, the grainer system of russia, or the galileo system of the european union.

Apower supply 1309 is used to power the various components in theterminal 1300. Thepower supply 1309 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When thepower supply 1309 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1300 also includes one ormore sensors 1310. The one ormore sensors 1310 include, but are not limited to: acceleration sensor 1311,gyroscope sensor 1312,pressure sensor 1313,fingerprint sensor 1314,optical sensor 1315, andproximity sensor 1316.

The acceleration sensor 1311 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with theterminal 1300. For example, the acceleration sensor 1311 may be used to detect components of gravitational acceleration in three coordinate axes.Processor 1301 may controldisplay screen 1305 to display a user interface in either a landscape view or a portrait view based on gravitational acceleration signals acquired by acceleration sensor 1311. The acceleration sensor 1311 may also be used for the acquisition of motion data of a game or user.

Thegyro sensor 1312 may detect a body direction and a rotation angle of the terminal 1300, and thegyro sensor 1312 may collect a 3D motion of the user on the terminal 1300 in cooperation with the acceleration sensor 1311.Processor 1301 can implement the following functions based on the data collected by gyro sensor 1312: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 1313 may be disposed on a side frame of terminal 1300 and/or belowdisplay screen 1305. When thepressure sensor 1313 is disposed at a side frame of the terminal 1300, a grip signal of the terminal 1300 by a user may be detected, and theprocessor 1301 performs left-right hand recognition or shortcut operation according to the grip signal collected by thepressure sensor 1313. When thepressure sensor 1313 is disposed at the lower layer of thedisplay screen 1305, theprocessor 1301 realizes control of the operability control on the UI interface according to the pressure operation of the user on thedisplay screen 1305. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

Thefingerprint sensor 1314 is used to collect a fingerprint of the user, and theprocessor 1301 identifies the identity of the user based on the fingerprint collected by thefingerprint sensor 1314, or thefingerprint sensor 1314 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized byprocessor 1301 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Thefingerprint sensor 1314 may be disposed on the front, back, or side of theterminal 1300. When a physical key or vendor Logo is provided on the terminal 1300, thefingerprint sensor 1314 may be integrated with the physical key or vendor Logo.

Theoptical sensor 1315 is used to collect ambient light intensity. In one embodiment,processor 1301 may control the display brightness ofdisplay screen 1305 based on the intensity of ambient light collected byoptical sensor 1315. Specifically, when the intensity of the ambient light is high, the display brightness of thedisplay screen 1305 is turned up; when the ambient light intensity is low, the display brightness of thedisplay screen 1305 is turned down. In another embodiment,processor 1301 may also dynamically adjust the shooting parameters ofcamera assembly 1306 based on the intensity of ambient light collected byoptical sensor 1315.

Aproximity sensor 1316, also referred to as a distance sensor, is typically provided on the front panel of theterminal 1300. Theproximity sensor 1316 is used to collect the distance between the user and the front of theterminal 1300. In one embodiment, whenproximity sensor 1316 detects a gradual decrease in the distance between the user and the front of terminal 1300,processor 1301controls display screen 1305 to switch from a bright screen state to a inactive screen state; when theproximity sensor 1316 detects that the distance between the user and the front surface of the terminal 1300 gradually increases, theprocessor 1301 controls thedisplay screen 1305 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 13 is not limiting of terminal 1300 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment, a computer terminal is also provided, the computer terminal including a processor and a memory having at least one instruction stored therein. At least one instruction is configured to be executed by one or more processors to implement any of the methods of detecting text regions described above.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction is stored, which when executed by a processor of a computer terminal, implements any of the methods of detecting text regions described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather, the present invention is to be construed as limited to the appended claims.

Claims

1. A method of detecting text regions, the method comprising:

acquiring an image to be subjected to text region detection, wherein the image contains multiple language texts;

extracting general features of at least two language texts in the image;

extracting distinguishing features of at least two language texts in the image based on the universal features;

determining a first text region of the multi-language text in the image according to the distinguishing feature;

fusing the distinguishing features to obtain total language features;

determining a second text region of the multi-language text in the image according to the total language feature;

updating the detection result of the text region according to the first text region and the second text region;

wherein the updating the detection result of the text region according to the first text region and the second text region includes:

for any first text region, deleting the first text region if a second text region with the overlapping rate of the first text region being larger than a third threshold value does not exist;

and for any second text region, if the first text region with the overlapping rate of the second text region being larger than a fourth threshold value does not exist, the second text region is used as a new language text region.

2. The method of claim 1, wherein the extracting common features of at least two language texts in the image comprises:

invoking a first neural network, and extracting general features of at least two language texts in the image to obtain a general feature map;

the extracting distinguishing features of at least two language texts in the image based on the universal features comprises the following steps:

invoking a reference number of second neural networks, extracting distinguishing features of the at least two language texts based on the universal feature map, and obtaining a distinguishing feature map, wherein each second neural network is a neural network trained according to supervision information of any language text.

3. The method of claim 2, wherein said determining a first text region of a plurality of language text in the image from the distinguishing feature comprises:

invoking a classifier with reference quantity, and acquiring confidence coefficient graphs of at least two language texts based on the distinguishing feature graphs, wherein each pixel in the confidence coefficient graphs has a confidence coefficient value;

for the confidence coefficient map of any language text, taking pixels with confidence coefficient values not lower than a first threshold value in the confidence coefficient map of any language text as a pixel set of the any language text;

Generating an circumscribed rectangle of the pixel set of the text in any language, and taking the circumscribed rectangle as a first text area of the text in any language.

4. The method of claim 2, wherein said determining a first text region of a plurality of language text in the image from the distinguishing feature comprises:

for the confidence level diagram of any language text, calling a third neural network, and acquiring at least two deviation coordinates based on the universal feature diagram;

taking the coordinates of the pixel points with the confidence coefficient value not lower than a second threshold value in the confidence coefficient map of any language text as reference coordinates, and respectively calculating the at least two deviation coordinates to obtain at least two region coordinates;

and taking the area indicated by the at least two area coordinates as a first text area of the text in any language.

5. The method according to claim 1, wherein the method further comprises:

and performing non-maximum suppression on a first text region of the multi-language text.

6. The method according to claim 1, wherein the method further comprises:

and performing non-maximum suppression on a second text region of the multi-language text.

7. An apparatus for detecting text regions, the apparatus comprising:

a first determining module, configured to determine a first text region of a multi-language text in the image according to the distinguishing feature;

a second determining module, configured to determine a second text region of the multi-language text in the image according to the total language feature;

the updating module is used for updating the detection result of the text area according to the first text area and the second text area; wherein the updating the detection result of the text region according to the first text region and the second text region includes: for any first text region, deleting the first text region if a second text region with the overlapping rate of the first text region being larger than a third threshold value does not exist; and for any second text region, if the first text region with the overlapping rate of the second text region being larger than a fourth threshold value does not exist, the second text region is used as a new language text region.

8. The apparatus of claim 7, wherein the first extraction module is configured to invoke a first neural network to extract generic features of at least two language texts in the image to obtain a generic feature map;

9. The apparatus of claim 8, wherein the first determining module is configured to invoke a reference number of classifiers and obtain a confidence map of at least two language texts based on the distinguishing feature map, wherein each pixel in the confidence map has a confidence value; for the confidence coefficient map of any language text, taking pixels with confidence coefficient values not lower than a first threshold value in the confidence coefficient map of any language text as a pixel set of the any language text; generating an circumscribed rectangle of the pixel set of the text in any language, and taking the circumscribed rectangle as a first text area of the text in any language.

10. The apparatus of claim 8, wherein the first determining module is configured to invoke a reference number of classifiers and obtain a confidence map of at least two language texts based on the distinguishing feature map, wherein each pixel in the confidence map has a confidence value; for the confidence level diagram of any language text, calling a third neural network, and acquiring at least two deviation coordinates based on the universal feature diagram; taking the coordinates of the pixel points with the confidence coefficient value not lower than a second threshold value in the confidence coefficient map of any language text as reference coordinates, and respectively calculating the at least two deviation coordinates to obtain at least two region coordinates; and taking the area indicated by the at least two area coordinates as a first text area of the text in any language.

11. The apparatus of claim 7, wherein the updating module is further configured to perform non-maximum suppression on a first text region of the multi-language text.

12. The apparatus of claim 7, wherein the updating module is further configured to perform non-maximum suppression on a second text region of the multi-language text.

13. An apparatus for detecting text regions, the apparatus comprising a processor and a memory having at least one instruction stored therein, the instruction being executable by the processor to implement the method of detecting text regions as recited in any one of claims 1-6.

14. A computer readable storage medium having stored therein at least one instruction for execution by a processor to implement the method of detecting text regions of any of claims 1-6.