Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
Since a long time ago, research on OCR has been started in countries around the world, and at the beginning of the research, mainly research on a character recognition method has been carried out, and recognized characters are only numbers from 0 to 9. The research work of China on OCR technology starts late, the recognition of numbers, English letters and symbols starts to be researched only in the 70 s, and the recognition of Chinese characters starts to be researched at the end of the 70 s and enters a substantial stage. For early OCR software, the actual requirements cannot be met due to various factors such as recognition rate and productization, and meanwhile, because hardware equipment cost is high, operation speed is slow, and the practical degree is not reached, only individual departments use OCR software, such as news publishing units use OCR software. Slowly with the wide application of the platform scanner and the popularization of the information automation and the office automation in China, the further development of the OCR technology is greatly promoted, and the demand for marking the corpus is greatly cut while the OCR technology is rapidly developed.
At present, the OCR corpus is labeled by first performing one-click framing on an image according to target text content, then manually adjusting and adapting the boundary of a frame according to the area size of the target text content, and after repeated multiple adjustments, selecting correct text content from the frame. However, such an operation process is complicated and time-consuming, which reduces the labeling efficiency, and at the same time, the size of the selected area may be different due to manual adjustment, which may cause the labeling effect to be affected if the selected area cannot be tightly attached to the text area or a part of the text area is pressed.
Based on this, embodiments of the present application provide a text image processing method and apparatus, an electronic device, and a readable storage medium, which aim to solve the problems of nonstandard frame selection area and complex operation.
It can be understood that the scheme provided by the embodiment of the present application can be applied to the field of text annotation, and can also be applied to any other scenes requiring text annotation. For example, in the field of character recognition, characters in a designated area (which can be understood as a target text area) in an image selected by a user need to be recognized, the designated area may be correspondingly processed based on the scheme provided in the embodiment of the present application, a final bounding box with the accurate area is determined, and then only the text area in the final bounding box may be subjected to character recognition processing, so that the data processing amount may be effectively reduced.
Optionally, the scheme provided in the embodiment of the present application may be applied to a labeling scene of a training image of a text processing model in an artificial intelligence technology, and a text region range in the training image may be determined based on the method of the embodiment of the present application, so that automatic labeling of the text region may be performed, and the method may also be applied to other text processing scenes.
Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Optionally, the data processing/computing related in the embodiment of the present application may be obtained in a cloud computing-based manner. Among them, cloud computing (cloud computing) is a computing mode that distributes computing tasks over a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.
As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.
According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.
Optionally, the text image to be processed in the embodiment of the present application may be a cloud-stored text image. The distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system which integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network through application software or application interfaces to cooperatively work through functions such as cluster application, grid technology, distributed storage file system and the like, and provides data storage and service access functions to the outside.
At present, a storage method of a storage system is as follows: logical volumes are created, and when created, each logical volume is allocated physical storage space, which may be the disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as data identification (ID, ID entry), the file system writes each object into a physical storage space of the logical volume, and the file system records storage location information of each object, so that when the client requests to access the data, the file system can allow the client to access the data according to the storage location information of each object.
The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to a group of capacity measures of objects stored in a logical volume (the measures often have a large margin relative to the capacity of the actual objects to be stored) and Redundant Array of Independent Disks (RAID), and one logical volume can be understood as one stripe, so that the logical volume is allocated with the physical storage space
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The method provided by the embodiment of the application can be executed by the terminal equipment or the server and the terminal equipment interact to execute the method. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
Fig. 2 shows a flowchart of a text image processing method provided in an embodiment of the present application, and optionally, the method may be performed by any electronic device, or may be performed by interaction between different devices, such as a terminal device or a server, or by interaction between a terminal device and a server, it is understood that the method may also be performed by interaction between a terminal device and a server, as shown in fig. 2, and the method may include:
and step S101, acquiring a text image to be processed.
The text image to be processed may refer to an image that needs to be subjected to text annotation, and the text image to be processed includes a text region, which may be a single image acquired by an image acquisition device or a frame image in a video. Optionally, after the text image to be processed is acquired, the text image to be processed may be displayed by the terminal device.
As an alternative embodiment, the text image to be processed is obtained by:
acquiring an image acquisition operation, wherein the image acquisition operation comprises an acquisition path of a text image to be processed;
accessing an image database according to the acquisition path;
and acquiring the text image to be processed corresponding to the acquisition path from the image database.
When a user (such as a developer) wants to perform text annotation on a text image to be processed, the text image to be processed can be acquired first, and when the user acquires the text image to be processed, an image acquisition operation including an acquisition path of the text image to be processed can be triggered; correspondingly, when the terminal device receives an image acquisition operation triggered by a user, the terminal device may access the image database based on an acquisition path included in the image acquisition operation, acquire a to-be-processed text image corresponding to the acquisition path from the image database, and display the acquired to-be-processed text image in the interface.
Optionally, the specific form of the user triggering the image capturing operation may be configured as required, for example, the specific form may be a triggering action of the user specifying the operation position on the application program interface. In practical applications, the image capturing operation may be triggered by a related trigger in the client, where a specific form of the trigger may be configured according to actual needs, for example, the trigger may be a virtual button of an "image" displayed on an application program interface, and an operation of clicking the button by a user is that the user triggers the image capturing operation.
In an optional embodiment of the present application, the obtaining path includes a URL (Uniform Resource Locator) or a local storage path, and obtaining a to-be-processed text image corresponding to the obtaining path from an image database includes:
if the acquisition path is a local storage path, acquiring a text image to be processed according to the local storage path;
and if the acquisition path is the URL, acquiring the text image to be processed according to the URL.
The method includes the steps that an acquisition source of a text image to be processed is not limited, if an image database can be an image database of a server, the text image to be processed is a text image stored at the server, a storage path used for acquiring the text image to be processed can be a URL of the text image to be processed, correspondingly, an acquisition path included in an image acquisition operation triggered by a user is the URL of the text image to be processed, and a terminal device can acquire the text image to be processed from the image database of the server according to the URL and display the text image to be processed after the text image to be processed is acquired; of course, the image database may also be an image database locally stored in the terminal device, at this time, the text image to be processed is a text image locally stored in the terminal device, at this time, the obtaining path used for obtaining the text image to be processed is a local storage path, and correspondingly, when the local storage path is known by the terminal device, the text image to be processed may be obtained in the image database locally stored in the terminal device based on the local storage path.
Optionally, if the text image to be processed is a text image locally stored in the terminal device, and the text annotation on the text image to be processed is performed by the server, the specific text image to be processed is not known to the server at this time, so that the terminal device can send the text image to be processed to the server after acquiring the text image to be processed according to the local storage path. Correspondingly, when the acquired text image to be processed is displayed by the terminal device, the user can also trigger the image deletion operation, and when the terminal device receives the image deletion operation, the text image to be processed can be deleted from the display interface.
Step S102, responding to the text region selection operation of the user aiming at the text image to be processed, and generating an initial boundary box of a target text region corresponding to the text region selection operation.
The text region selection operation refers to an action of selecting a target text region by a user, and a specific form for triggering the text region selection operation may be configured as required, which is not limited in the embodiment of the present application. For example, a designated trigger button may be set on the application program interface for a user to trigger a text region selection trigger operation, when the user clicks the designated trigger button to trigger the text region selection trigger operation, the user may select a text region by a frame with a mouse or a stylus, and when the frame selection is completed, it is determined that the user has triggered the text region selection operation, and a region corresponding to the frame is a target text region corresponding to the text region selection operation.
Correspondingly, the terminal device can respond to the text region selection operation of the user aiming at the text image to be processed, and the text box corresponding to the text region selection operation triggered by the user is taken as the initial boundary box of the target text region and displayed. Optionally, the position of the edge of the initial bounding box may also be in the middle of the target text region.
Step S103, based on the initial boundary box, determining a critical point between the target text area and the background area in the text image to be processed.
Step S104, generating the final boundary box of the target text area according to the critical point
The critical point refers to a pixel point between the target text region and the background region, and the critical point can be determined according to the value of each pixel point in the initial boundary box of the target text region. If the pixel values of two adjacent pixels of a certain pixel in the initial boundary frame are different, the pixel is located between the target text region and the background region, and the pixel is the critical point. Further, a final bounding box for the target text region may be generated based on the determined critical points. Optionally, a user may trigger a text content selection operation, where the text content selection operation refers to an action of the user that wants to determine the final bounding box of the target text region and an action of informing a position of a pixel point of the text in the text image to be processed, and a specific form of the trigger text content selection operation may be configured as needed, which is not limited in the embodiment of the present application. For example, when a user clicks any text included in the target text region, it may be regarded that the user has triggered a text content selection operation, and at this time, the pixel point of the clicked position is the pixel point of the text.
Accordingly, when the terminal device receives a text content selection operation of the user for the target text region, a final bounding box of the target text region may be generated according to the determined critical point in response to the text content selection operation and displayed to the user. Optionally, the final bounding box is not larger than the initial bounding box.
It can be understood that, when the final bounding box of the target text region is determined by the server, the terminal device may send a text content selection request to the server in response to a text content selection operation of the user for the target text region, and the server may know positions of pixel points of the text in the text image to be processed and determine the final bounding box of the target text region and return the determined final bounding box to the terminal device so that the terminal device displays the final bounding box to the user.
In an alternative embodiment of the present application, the method further comprises:
and carrying out text region labeling on the text image to be processed according to the final boundary box so as to take the labeled text image to be processed as a training sample image of the text processing model.
As an optional embodiment, after the final bounding box of the target text region is obtained, text labeling may be performed on the to-be-processed text image according to the obtained final bounding box, and the to-be-processed text image after labeling is used as a training sample image to train the initial text processing model, so as to obtain a final text processing model. For example, when the text processing model is a text recognition model, a text included in the final bounding box in the text image to be processed may be labeled to obtain a labeled text image to be processed, and then the initial text recognition model is trained based on the labeled text image to be processed until a corresponding training condition is satisfied.
Optionally, when the scheme provided in the embodiment of the present application is executed by the terminal device and the server in an interactive manner, the acquiring of the text image to be processed may be that the terminal acquires the text image to be processed, and then sends the text image to be processed to the server, and the server responds to a text region selection operation of a user for the text image to be processed, generates an initial bounding box of a target text region corresponding to the text region selection operation, determines a critical point of the target text region in the text image to be processed based on the initial bounding box, and then generates a final bounding box of the target text region according to the critical point. Optionally, when the text image to be processed is acquired, the text image to be processed may be displayed in the terminal device, and after the initial boundary box of the target text region and the final boundary box of the target text region are generated, the generated initial boundary box and the final boundary box of the target text region may also be displayed to the user through the terminal device, so that the user performs subsequent processing based on the displayed initial boundary box or final boundary box. When the method is executed by the server, the text image to be processed, the initial boundary box of the target text area and the final boundary box of the target text area can be displayed to a user through a display device corresponding to the server.
Based on the scheme provided by the embodiment of the application, the boundary box of the text region in the text image can be intelligently detected, the boundary box suitable for the text region can be accurately selected without multiple manual adjustments, the frame selection efficiency is improved, the problem that the frame selection region is not standard due to manual adjustment is effectively solved, and therefore when the method is applied to text frame marking, the efficiency and the accuracy of text frame marking are effectively improved.
In an alternative embodiment of the present application, determining a critical point between the target text region and the background region in the text image to be processed based on the initial bounding box includes:
if the text image to be processed is not the image in the target format, processing the text image to be processed into an image in the target format, wherein the image in the target format is a binary image or a gray image;
for the target format image, determining a target text region and a background region based on the initial bounding box;
and determining a critical point between the target text region and the background region according to the pixel value of each pixel point in the target text region and the background region.
Optionally, when the final bounding box of the target text region is generated, it may be determined whether the text image to be processed is an image in the target format, and if the text image to be processed is not an image in the target format, the text image to be processed may be converted into an image in the target format, and then the final bounding box of the target text region is generated based on the processed text image to be processed. The target format image can be a binary image or a gray image, that is, when the target format image is the binary image, if the text image to be processed is not the binary image, the text image to be processed is converted into the binary image; and when the target format image is a gray image, if the text image to be processed is not the gray image, converting the text image to be processed into the gray image.
The processing of the text image to be processed into the target format image means that the RGB (Red Green Blue, Red, Green, Blue) three primary colors of the pixel points in the text image to be processed are adjusted. For example, when the target format image is a binary image, it may refer to converting pixel values of pixel points in the text image to be processed into black or white, for example, the pixel values of the pixel points of the text in the text image to be processed may be converted into white, and the pixel values of the pixel points of the background in the text image to be processed may be converted into black, or the pixel values of the pixel points of the text in the text image to be processed may be converted into black, and the pixel values of the pixel points of the background in the text image to be processed may be converted into white, and the like. And when the target format image is a Gray image, the external script can be used for carrying out Gray processing on the text image to be processed, namely, the RGB values of the pixel points in the text image to be processed are converted into Gray, namely, the RGB values of the pixel points are uniformly replaced by Gray (Gray) to form new RGB (Gray, Gray and Gray), so that the Gray effect is realized.
Optionally, when the text image to be processed is converted into the image in the target format and the initial boundary box of the target text region is obtained, the target text region and the background region may be determined based on the determined initial boundary box, for example, the target text region and the background region in the initial boundary box may be determined, and at this time, compared with determining all the text regions and the background region in the text image to be processed, the data processing amount may be effectively reduced, and the data processing efficiency is improved. Optionally, when the target text region and the background region are determined, a critical point between the target text region and the background region may be determined according to pixel values of pixels in the target text region and the background region, and then a final bounding box of the target text region may be generated according to the critical point.
In an alternative embodiment of the present application, the critical points include a critical point in a first direction and a critical point in a second direction, the first direction is a writing direction of the text in the target text region, the second direction is a direction perpendicular to the first direction, and the critical points in each of the first direction and the second direction include critical points located on both sides of the text region in the direction;
generating a final bounding box of the target text region according to the critical points, comprising:
for each side of each direction in the first direction and the second direction, obtaining a boundary connecting line corresponding to the side based on the critical point of the side;
and connecting the boundary connecting lines to generate a final boundary box of the target text area.
In the embodiment of the present application, the critical points may include a critical point in a first direction and a critical point in a second direction, and the first direction may refer to a writing direction of the text in the target text region, and the second direction is a direction perpendicular to the first direction. For example, when the writing direction of the text in the target text region is from left to right, the first direction is from left to right, and the second direction is from top to bottom, and when the writing direction of the text in the target text region is from top to bottom, the first direction is from top to bottom, and the second direction is from left to right.
Optionally, for the critical point in the first direction and the critical point in the second direction, the critical point in each of the first direction and the second direction includes critical points located on both sides of the text region in the direction, that is, the critical point in each direction includes critical points on both sides of the text region. For example, it is assumed that the writing direction of the text in the target text region is from left to right, the first direction is from left to right, the second direction is from top to bottom, the critical points in the first direction include critical points located at the top and bottom sides of the text region, and the critical points in the second direction include critical points located at the left and right sides of the text region.
Accordingly, when determining the final bounding box of the target text region according to the critical points, for each side of each of the first direction and the second direction, the critical point belonging to the side may be determined from the critical points, and the boundary connecting line corresponding to the side may be obtained based on the critical point belonging to the side.
In one example, assuming that the critical points in the first direction include critical points located at upper and lower sides of the text region, and the critical points in the second direction include critical points located at left and right sides of the text region, neighboring pixel points belonging to an upper side of the text region are respectively determined, then, based on each adjacent pixel point belonging to the upper side of the text region, a boundary connecting line corresponding to the upper side of the text region is obtained, and similarly, a boundary connecting line corresponding to the lower side of the text region, a boundary connecting line corresponding to the left side of the text region, and a boundary connecting line corresponding to the right side of the text region are obtained, then, the boundary connecting line corresponding to the upper side of the text region, the boundary connecting line corresponding to the lower side of the text region, the boundary connecting line corresponding to the left side of the text region, and the boundary connecting line corresponding to the right side of the text region are connected to generate a final boundary box of the target text region.
In an optional embodiment of the present application, determining a critical point between a target text region and a background region according to pixel values of pixels in the target text region and the background region includes:
determining pixel values of all pixel points in a target text region and a background region;
determining each initial critical point between the target text region and the background region according to the difference value between the pixel values of each pixel point;
and determining a critical point between the target text region and the background region according to each initial critical point.
The text image to be processed is a text image of the target format image, and at this time, the pixel values of the pixel points in the background region in the text image to be processed are the same, the pixel values of the pixel points in the text region in the text image to be processed are the same, but the pixel values of the pixel points in the background region are different from the pixel values of the pixel points in the text region, when the difference between the pixel values of the pixel points is determined, if the pixel points belong to the same region (i.e., the background region or the text region), the difference between the pixel values is 0, and when the pixel points do not belong to the same region (e.g., one pixel point belongs to the background region and one pixel point belongs to the text region), the difference between the pixel values is not 0.
Correspondingly, when the critical points are determined, the pixel values of the pixel points in the target text region and the background region can be determined, then each initial critical point between the target text region and the background region is determined according to the difference value between the pixel values of the pixel points, and then the critical points are determined from each initial critical point. If the difference value between the obtained pixel values of the two pixel points is not 0, the two pixel points are respectively located in a background area and a target text area, the positions of the pixel points of the text in the text image to be processed can be known based on the text content selection operation, the pixel values of the pixel points belonging to the text can be known according to the pixel values of the pixel points of the positions, then the pixel points belonging to the text in the two pixel points can be determined according to the pixel values of the pixel points belonging to the text, and the pixel points belonging to the text are used as initial adjacent pixel points.
In an alternative embodiment of the present application, determining a critical point between the target text region and the background region according to each initial critical point includes:
and for each side of each of the first direction and the second direction, determining at least one final pixel point from each initial critical point according to the distance of each initial critical point belonging to the side in the direction perpendicular to the direction.
Alternatively, for each side of each of the first direction and the second direction, a distance of each initial critical point belonging to the side in a direction perpendicular to the direction may be determined, and then at least one final critical point may be determined from the initial critical points according to the distance of each initial critical point in the direction perpendicular to the direction.
In an example, it is assumed that the first direction is from left to right, at this time, two sides in the first direction are upper and lower sides in the text region, each initial pixel point may be embodied in the form of an (x, y) coordinate, at this time, for each initial pixel point belonging to the upper side of the text region, a distance of each initial pixel point in the y-axis direction may be determined, and then, according to the distance of each initial pixel point in the y-axis direction, at least two critical points are determined from each initial pixel point belonging to the upper side of the text region. Similarly, for each initial pixel point belonging to the left side of the text region, the distance of each initial pixel point in the x-axis direction can be determined, and then at least one final critical point is determined from each initial pixel point belonging to the left side of the text region according to the distance of each initial pixel point in the x-axis direction.
In an alternative embodiment of the present application, determining a final critical point from the initial critical points comprises:
if the distance between the initial critical points on the same side is smaller than or equal to a set value, determining at least one initial critical point as a final critical point;
if the distance between the initial critical points on the same side is greater than the set value, determining at least two final critical points from the initial critical points, wherein the at least two final critical points comprise the initial critical point with the farthest distance on the other side corresponding to the side in the initial critical points.
Optionally, for each initial critical point belonging to the same side, if it is determined that the distance between each initial critical point in the vertical direction of the direction is smaller than or equal to the set value, it indicates that each initial critical point is adjacent to each other and is closer to each other, and may be considered to be on the same horizontal line, and at this time, any one of at least one initial critical point may be selected from each initial critical point to be determined as the final critical point. On the contrary, if there is a distance between the initial critical points in the direction perpendicular to the direction greater than the set value, which indicates that there are adjacent and distant initial critical points between the initial critical points, at least two final critical points may be determined from the initial critical points, so as to ensure the accuracy of the final bounding box finally determined according to the critical points.
And the initial critical point with the farthest distance from the other side corresponding to the side in each initial critical point in the determined at least two critical points. For example, if two initial critical points corresponding to a distance greater than the set value belong to the left side of the text region, the distance between each initial critical point belonging to the left side of the text region and the right side of the text region may be determined, and then the initial critical point farther away from the right side of the text region may be determined as one of the at least two final critical points. Of course, in practical applications, both of the two initial critical points corresponding to the distance greater than the set value may be determined as the critical points.
In an alternative embodiment of the present application, obtaining the boundary connecting line corresponding to the side based on the critical point of the side includes:
if the critical point of the side is one, determining a boundary connecting line corresponding to the side based on the critical point of the side and the direction to which the side belongs, wherein the direction to which the side belongs is a first direction or a second direction;
and if the number of the critical points on the side is at least two, determining the boundary connecting line corresponding to the side based on the at least two critical points on the side.
Optionally, if there is one critical point belonging to the same side, the boundary connecting line corresponding to the side may be determined based on the one critical point and the direction to which the side belongs; accordingly, if there are at least two critical points belonging to the same side, the boundary connecting line corresponding to the side can be determined based on the at least two critical points.
In an example, assuming that the first direction is from left to right, two sides of the first direction are an upper side and a lower side of the text region, the second direction is from top to bottom, two sides of the second direction are a left side and a right side of the text region, and there is only one critical point located on the upper side of the text region, at this time, the critical point may be taken as a reference, the first direction is taken as a horizontal direction, and the second direction is taken as a vertical direction, and the two sides extend to the left and right sides, so as to obtain a boundary connecting line on the upper side of the text region; if there are at least two critical points located on the upper side of the text region, the at least two critical points are connected and extend from the left side to the right side with the first direction as the horizontal direction and the second direction as the vertical direction, so as to obtain the boundary connecting line on the upper side of the text region.
In an alternative embodiment of the present application, generating a final bounding box of the target text region according to the critical point includes:
generating a candidate bounding box corresponding to the target text region according to the critical point;
if the number of the candidate bounding boxes is one, determining the candidate bounding box as a final bounding box of the target text area;
and if the number of the candidate bounding boxes is at least two, filtering the at least two candidate bounding boxes according to the heights of the at least two candidate bounding boxes to obtain the final bounding box of the target text region, wherein the height is the height in the vertical direction corresponding to the writing direction of the text in the target text region.
Optionally, when the initial bounding box of the target text region corresponding to the text region selection operation only includes a text region that needs to be subjected to text labeling, at this time, one first candidate bounding box corresponding to the target text region determined according to the critical point is used, and at this time, the candidate bounding box may be directly determined as the final bounding box of the target text region.
However, in practical applications, sometimes the initial bounding box of the target text region corresponding to the text region selection operation may include other partial text regions except for the text annotation, if there is a case that the other partial text regions are pressed, at this time, according to a critical point in the target text region, there may be at least two candidate bounding boxes corresponding to the determined target text region, at this time, at least two candidate bounding boxes may be filtered according to the heights of the candidate bounding boxes, so as to obtain a final bounding box of the target text region. Wherein the height of the first bounding box refers to a height in a vertical direction corresponding to a writing direction of the text in the target text region. Optionally, when the at least two candidate bounding boxes are filtered according to the heights of the at least two candidate bounding boxes to obtain the final bounding box of the target text region, the candidate bounding box with the largest height may be reserved as the final bounding box of the target text region.
In an example, assuming that the writing direction of the text in the target text region is from left to right, the height of the candidate bounding box is the vertical direction of the writing direction, i.e. the top and bottom widths of the candidate bounding box, and according to the critical point in the target text region, the candidate bounding box corresponding to the target text region is determined as shown in fig. 3, i.e. the determined candidate bounding boxes are two (i.e. candidate bounding box a and candidate bounding box B), at this time, the top and bottom heights of each first bounding box may be determined, and then the candidate bounding box with the largest height is retained as the final bounding box of the target text region.
In an alternative embodiment of the present application, after generating the final bounding box of the target text region according to the critical point, the method further includes:
cutting a text area from the text image to be processed according to the final bounding box;
performing text recognition on the cut text area to obtain a text recognition result;
displaying the text recognition result in a text labeling area of the image to be processed;
and responding to the approval operation of the text recognition result in the text labeling area, and generating a text image labeling corpus according to the approved text recognition result and the text image to be processed.
Optionally, after the final bounding box of the target text region is generated, the final bounding box may be used as a cropping frame to crop a cropping text region from the text image to be processed, where the size of the obtained cropping text region is greater than or equal to the size of the region surrounded by the final bounding box. Correspondingly, text recognition can be carried out on the cut text to obtain a text recognition result, and the text recognition result is displayed in a text labeling area of the image to be processed, so that a user can know the text recognition result obtained by the text recognition; correspondingly, the user can approve the recognized text recognition result, namely, whether the recognized text recognition result is consistent with the text included in the text cutting region is verified, and the approval operation of the text recognition result can be triggered when the verification is completed.
When the user checks the text recognition result obtained by recognition, if the text recognition result obtained by recognition is consistent with the text included in the text cutting area, the text recognition result is accurate, the approval operation triggered by the user is approval confirmation operation, and the approved text recognition result is still the text recognition result obtained by recognition; correspondingly, if the recognized text recognition result is inconsistent with the text included in the cut text area, which indicates that the text recognition result is wrong, the user can correct the recognized text result according to the text included in the cut text area, and trigger an approval operation after the correction is completed, and the corrected text recognition result is the approved text recognition result.
In order to better understand the processing method of the text image provided in the embodiment of the present application, the following describes the processing method of the text image in detail with reference to a specific application scenario. In the application scenario in this example, a final boundary box of a target text region in a text image to be processed is determined according to a text image processing method, then, text region labeling is performed on the text image to be processed according to the generated final boundary box, the labeled text image to be processed is used as a training sample image of a text processing model, an obtaining path of the text image to be processed is a URL, a server determines the final boundary box of the target text region in the text image to be processed, and the target format image is a binary image, such as a black-and-white image. A method for determining a final bounding box of a target text region in a text image to be processed is described below with reference to fig. 4.
Step S401, the terminal equipment acquires and displays a text image to be processed;
specifically, a user can trigger an image acquisition operation on a terminal device, the image acquisition operation includes a URL of a text image to be processed, the terminal device responds to the image acquisition operation and sends the URL of the text image to be processed to a server, the server accesses an image database according to the URL of the text image to be processed to acquire the text image to be processed and returns the text image to the terminal device, and the terminal device loads the text image to be processed when receiving the text image to be processed and displays the text image to be processed in an application interface. For example, the acquired to-be-processed text image is shown in fig. 5, and the to-be-processed text image is an RGB image and includes a text region "time is a bird that is always flying" and a background region (e.g., a region other than the text region in fig. 5).
Step S402, the terminal equipment displays an initial boundary box of a target text area corresponding to the text area selection operation;
specifically, when a user wants to label a text in a text region to be processed, a text region selection operation may be triggered to select a text region (i.e., a target text region) that specifically wants to be labeled, the terminal device determines and displays an initial bounding box of the target text region corresponding to the text region selection operation in response to the text region selection operation, and meanwhile, may send the initial bounding box of the target text region to the server to notify the server that the user wants to label the text in the initial bounding box. When the terminal device sends the initial boundary box of the target text region to the server, the terminal device can send the coordinates of the initial boundary box of the target text region in the text image to be processed to the server, and the server can determine the initial boundary box of the target text region according to the received coordinates.
In an example, assuming that a user wants to perform text labeling on a text "flying bird" included in a text image to be processed, where the text "flying bird" is a target text region, the user may select the text "flying bird" through a mouse box when triggering a text region selection operation, and accordingly, the terminal device may, in response to the text region selection operation, display a boundary of the user box region as an initial boundary box of the target text region, which may be specifically shown as a black rectangle in fig. 6.
Step S403, the terminal equipment receives the text content selection operation of the user aiming at the target text area and sends a text content selection request to the server;
specifically, after the initial boundary box of the target text region is displayed to the user, the user may trigger a text content selection operation for the target text region, and when the terminal device receives the text content selection operation, it indicates that the user wants to determine the final boundary box of the target text region, at this time, a text content selection request may be sent to the server, and when the server receives the text content selection request, the server may know the positions of the pixel points of the text in the text image to be processed,
step S404, the server responds to the text content selection request, judges whether the text image to be processed is processed into a binary image, if so, executes step S406, otherwise, executes step S405;
step S405; processing and converting a text image to be processed into a binary image;
specifically, when the server receives a text content selection request sent by the terminal device, it may be determined whether the text image to be processed is a black-and-white image, and if the text image to be processed is not a black-and-white image, binarization processing may be performed on each pixel point in the text image to be processed to obtain a processed text image to be processed, and then a final bounding box of the target text region is determined according to the processed image. For example, the text image to be processed shown in fig. 5 may be binarized into a black-and-white image, and specifically, may be as shown in fig. 7.
Step S406, the server determines each initial critical point between the target text area and the background area;
wherein, because the text image to be processed is a black-and-white image, the pixel values of the pixel points in the background region in the text image to be processed are the same (namely, the RGB values are all black), the pixel values of the pixel points in the text region are the same (namely, the RGB values are all white), and because the position of the pixel points of the text in the text image to be processed can be known based on the text content selection operation, the pixel values of the pixel points belonging to the text (namely, the RGB values of the pixel points of the text are white) can be known according to the pixel values of the pixel points at the position, then the pixel points belonging to the text and the pixel points belonging to the background region in the text image to be processed can be determined according to the pixel values of the pixel points belonging to the text, the region of the target text region can be determined according to the pixel points belonging to the text, and the region of the target text region can be framed out, specifically, as shown by the white line shown in fig. 8, it is understood that the white line should coincide with the text "bird flying" boundary, but for ease of description, the white line does not coincide with the text "bird flying" boundary in fig. 8.
Further, each initial critical point between the target text region and the background region may be determined according to a difference between pixel values of each pixel point. If the difference between the pixel values of the two pixel points is not 0, the two pixel points are respectively located in the background region and the target text region, then the pixel point belonging to the text in the two pixel points can be determined according to the pixel value of the pixel point belonging to the text, and the pixel point belonging to the text is used as an initial critical point.
Step S407, the server determines a critical point between the target text area and the background area according to each initial critical point;
alternatively, the critical points may include a critical point in a first direction and a critical point in a second direction, where the first direction may refer to a writing direction of the text in the target text region, and the second direction is a direction perpendicular to the first direction. Further, for each side of each of the first direction and the second direction, a distance of each initial critical point in a direction perpendicular to the direction may be determined, and then a critical point of the side may be determined from each initial critical point according to the distance of each initial critical point in the direction perpendicular to the direction.
When the critical points are determined from the initial critical points according to the distance of each initial critical point in the vertical direction of the direction, for each initial critical point belonging to the same side, if it is determined that the distance between each initial critical point in the vertical direction of the direction is smaller than or equal to a set value, it indicates that each initial critical point is adjacent to each other and is close to each other, and may be considered to be on the same horizontal line, at this time, at least one initial critical point may be arbitrarily selected from each initial critical point to be determined as a critical point. On the contrary, if there is a distance between the initial critical points in the direction perpendicular to the direction greater than the set value, which indicates that there are adjacent and distant initial critical points between the initial critical points, at least two critical points may be determined from the initial critical points, so as to ensure the accuracy of the final bounding box finally determined according to the critical points.
In one example, it is assumed that each initial critical point determined in the text image to be processed is as shown in fig. 9, where the initial critical points a to C are on the same side in the same direction, the initial critical point D and the initial critical point E are on one side, and the initial critical points F to I are on the same side in the same direction. At this time, for the initial critical points a to C, since the distances between the initial critical points a to C in the direction perpendicular to the text writing direction are all smaller than a set value, the initial critical point a may be selected from the initial critical points a to C and determined as a critical point; for the initial critical points F-H, because the distances between the initial critical points F-H in the direction vertical to the text writing direction are all smaller than a set value, the initial critical point F can be selected from the initial critical points F-H to be determined as a critical point; for the initial critical point D and the initial critical point E, since the initial critical point D and the initial critical point E are respectively located at one side, the initial critical point D and the initial critical point E can be directly determined as the critical points.
In step S408, the server generates a final bounding box of the target text region according to the critical point.
Specifically, for the same side of each of the first direction and the second direction, the critical points belonging to the same side may be determined from the adjacent pixel points, and the boundary connecting lines corresponding to the side are obtained based on the critical points belonging to the same side.
In an example, assuming that the critical points on the same side of each direction determined in the text image to be displayed are as shown in fig. 10, specifically including the critical point a, the critical point F, the critical point D, and the critical point E, the boundary connecting line corresponding to the side may be obtained based on the critical points belonging to the same side. For example, the boundary connecting line on the upper side of the target text region is obtained according to the critical point a, the boundary connecting line on the lower side of the target text region is obtained according to the critical point F, the boundary connecting line on the left side of the target text region is obtained according to the critical point D, and the boundary connecting line on the right side of the target text region is obtained according to the critical point F, and then the boundary connecting lines on the upper, lower, left and right sides of the target text region are used to obtain the final bounding box of the target text region, which is specifically shown as a white rectangle in fig. 10.
Step S409, sending the final bounding box of the target text region to the terminal device, so that the terminal device displays the final bounding box of the target text region to the user.
In the embodiment of the application, the text content annotation of the image region is completed through the final boundary box of the target text region, the accuracy of the final boundary box of the target text region is ensured through intelligent boundary detection, and the determined final boundary box is matched with the size of the target text region in the target text region, so that the problem of manual character region selection can be effectively solved, and the annotation efficiency and accuracy are improved.
An embodiment of the present application provides a processing apparatus for a text image, and as shown in fig. 11, theprocessing apparatus 60 for a text image may include: animage acquisition module 601, an initial boundingbox generation module 602, and a final boundingbox generation module 603, wherein,
theimage acquisition module 601 is configured to display a text image to be processed;
an initial boundingbox generating module 602, configured to generate an initial bounding box of a target text region corresponding to a text region selection operation in response to the text region selection operation for a text image to be processed by a user;
a final boundingbox generating module 603, configured to determine a critical point between the target text region and the background region in the text image to be processed based on the initial bounding box; and generating a final bounding box of the target text region according to the critical points.
Optionally, when determining a critical point between the target text region and the background region in the to-be-processed text image based on the initial bounding box, the final bounding box generating module is specifically configured to:
if the text image to be processed is not the image in the target format, processing the text image to be processed into an image in the target format, wherein the image in the target format is a binary image or a gray image;
for the target format image, determining a target text region and a background region based on the initial bounding box;
and determining a critical point between the target text region and the background region according to the pixel value of each pixel point in the target text region and the background region.
Optionally, the critical points include a critical point in a first direction and a critical point in a second direction, the first direction is a writing direction of the text in the target text region, the second direction is a direction perpendicular to the first direction, and the critical points in each of the first direction and the second direction include critical points located on both sides of the text region in the direction;
the final bounding box generating module is specifically configured to, when generating the final bounding box of the target text region according to a critical point between the text region and the background region in the target text region:
for each side of each direction in the first direction and the second direction, obtaining a boundary connecting line corresponding to the side based on the critical point of the side;
and connecting the boundary connecting lines to generate a final boundary box of the target text area.
Optionally, when the final bounding box generating module determines a critical point between the target text region and the background region according to the pixel value of each pixel point in the target text region and the background region, the final bounding box generating module is specifically configured to:
determining pixel values of all pixel points in a target text region and a background region;
determining each initial critical point between the target text region and the background region according to the difference value between the pixel values of each pixel point;
and determining a critical point between the target text region and the background region according to each initial critical point.
Optionally, when the final bounding box generating module determines the critical point between the target text region and the background region according to each initial critical point, the final bounding box generating module is specifically configured to:
for each side of each of the first and second directions, at least one final critical point is determined from the initial critical points belonging to the side, based on the distance of the initial critical points in a direction perpendicular to the direction.
Optionally, when determining at least one final critical point from the initial critical points, the final bounding box generating module is specifically configured to:
if the distance between the initial critical points on the same side is less than or equal to a set value, determining any initial critical point in the initial critical points as a critical point;
if the distance between the initial critical points on the same side is smaller than or equal to a set value, determining at least one initial critical point as a final critical point;
if the distance between the initial critical points on the same side is greater than the set value, determining at least two final critical points from the initial critical points, wherein the at least two final critical points comprise the initial critical point with the farthest distance on the other side corresponding to the side in the initial critical points.
Optionally, when the final bounding box generating module obtains the boundary connecting line corresponding to the side based on the critical point of the side, the final bounding box generating module is specifically configured to:
if the critical point of the side is one, determining a boundary connecting line corresponding to the side based on the critical point of the side and the direction to which the side belongs, wherein the direction to which the side belongs is a first direction or a second direction;
and if the number of the critical points on the side is at least two, determining the boundary connecting line corresponding to the side based on the at least two critical points on the side.
Optionally, when the final bounding box of the target text region is determined according to the critical point, the final bounding box generating module is specifically configured to:
generating a candidate bounding box corresponding to the target text region according to the critical point;
if the number of the candidate bounding boxes is one, determining the candidate bounding boxes as final bounding boxes of the target text area;
and if the number of the candidate bounding boxes is at least two, filtering the at least two candidate bounding boxes according to the heights of the at least two candidate bounding boxes to obtain the final bounding box of the target text region, wherein the height is the height in the vertical direction corresponding to the writing direction of the text in the target text region.
Optionally, the apparatus further includes a labeling module, configured to:
and carrying out text region labeling on the text image to be processed according to the final boundary box so as to take the labeled text image to be processed as a training sample image of the text processing model.
Optionally, the apparatus further includes an image acquisition module, specifically configured to:
acquiring an image acquisition operation, wherein the image acquisition operation comprises an acquisition path of a text image to be processed;
accessing an image database according to the acquisition path;
and acquiring the text image to be processed corresponding to the acquisition path from the image database.
Optionally, the device further comprises a display module,
optionally, the apparatus further includes a labeled corpus generating module, configured to:
after a final boundary box of a target text region is generated according to the critical point, the text region is cut out from the text image to be processed according to the final boundary box;
performing text recognition on the cut text area to obtain a text recognition result;
displaying the text recognition result in a text labeling area of the image to be processed;
and responding to the approval operation of the text recognition result in the text labeling area, and generating a text image labeling corpus according to the approved text recognition result and the text image to be processed.
The processing device for text images according to the embodiment of the present application can execute the processing method for text images according to the embodiment of the present application, and the implementation principles thereof are similar, and are not described herein again.
Alternatively, the processing means of the text image may be a computer program (including program code) running on a computer device, for example, the processing means of the text image is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application.
In some embodiments, the processing Device of the text image provided in the embodiments of the present Application may be implemented by a combination of hardware and software, and by way of example, the processing Device of the text image provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the image descrambling processing method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
In other embodiments, the processing apparatus for text images provided in the embodiments of the present application may be implemented in software, and fig. 11 illustrates theprocessing apparatus 60 for storing text images, which may be software in the form of programs and plug-ins, and includes a series of modules, including animage obtaining module 601, an initial boundingbox generating module 602, and a final boundingbox generating module 603; theimage obtaining module 601, the initial boundingbox generating module 602, and the final boundingbox generating module 603 are configured to implement the text image processing method provided in the embodiment of the present application.
An embodiment of the present application provides an electronic device, as shown in fig. 12, anelectronic device 2000 shown in fig. 12 includes: aprocessor 2001 and amemory 2003. Wherein theprocessor 2001 is coupled to amemory 2003, such as via abus 2002. Optionally, theelectronic device 2000 may also include atransceiver 2004. It should be noted that thetransceiver 2004 is not limited to one in practical applications, and the structure of theelectronic device 2000 is not limited to the embodiment of the present application.
Theprocessor 2001 is applied in the embodiment of the present application to implement the functions of the modules shown in fig. 11.
Theprocessor 2001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. Theprocessor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.
Bus 2002 may include a path that conveys information between the aforementioned components. Thebus 2002 may be a PCI bus or an EISA bus, etc. Thebus 2002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.
Thememory 2003 may be, but is not limited to, ROM or other types of static storage devices that can store static information and computer programs, RAM or other types of dynamic storage devices that can store information and computer programs, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store a desired computer program or in the form of a data structure and that can be accessed by a computer.
Thememory 2003 is used for storing computer programs for executing the application programs of the present scheme and is controlled in execution by theprocessor 2001. Theprocessor 2001 is used to execute a computer program of an application program stored in thememory 2003 to realize the actions of the processing apparatus of a text image provided by the embodiment shown in fig. 11.
An embodiment of the present application provides an electronic device, including a processor and a memory: the memory is configured to store a computer program which, when executed by the processor, causes the processor to perform any of the methods of the above embodiments.
The present application provides a computer-readable storage medium for storing a computer program, which, when run on a computer, enables the computer to execute any one of the above-mentioned methods.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.
The terms and implementation principles related to a computer-readable storage medium in the present application may specifically refer to a text image processing method in the embodiment of the present application, and are not described herein again.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.