Movatterモバイル変換


[0]ホーム

URL:


CN116311298B - Information generation method, information processing method, device, electronic device, and medium - Google Patents

Information generation method, information processing method, device, electronic device, and medium

Info

Publication number
CN116311298B
CN116311298BCN202310023539.8ACN202310023539ACN116311298BCN 116311298 BCN116311298 BCN 116311298BCN 202310023539 ACN202310023539 ACN 202310023539ACN 116311298 BCN116311298 BCN 116311298B
Authority
CN
China
Prior art keywords
information
text
level
feature
text recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310023539.8A
Other languages
Chinese (zh)
Other versions
CN116311298A (en
Inventor
于海鹏
李煜林
钦夏孟
姚锟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co LtdfiledCriticalBeijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310023539.8ApriorityCriticalpatent/CN116311298B/en
Publication of CN116311298ApublicationCriticalpatent/CN116311298A/en
Application grantedgrantedCritical
Publication of CN116311298BpublicationCriticalpatent/CN116311298B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本公开提供了一种信息生成方法、信息处理方法、装置、电子设备以及介质,涉及人工智能技术领域,尤其涉及深度学习技术、图像处理技术和计算机视觉技术领域,可应用于OCR光学字符识别等场景。具体实现方案为:对文本图像进行文本检测,得到检测信息,检测信息包括多个文本区域各自的类别信息和位置信息;根据位置信息和文本图像,获取与多个文本区域各自对应的文本区域图像;对文本区域图像进行文本识别,得到识别信息,识别信息包括多个文本区域图像各自的文本识别信息;根据识别信息,确定语义关系信息,语义关系信息包括多个文本识别信息之间的语义关系;根据类别信息、语义关系信息和识别信息,生成文本图像的结构化信息。

The present disclosure provides an information generation method, an information processing method, an apparatus, an electronic device, and a medium, which relate to the fields of artificial intelligence technology, in particular to the fields of deep learning technology, image processing technology, and computer vision technology, and can be applied to scenarios such as OCR (Optical Character Recognition). The specific implementation scheme is as follows: performing text detection on a text image to obtain detection information, the detection information including category information and position information of each of a plurality of text regions; obtaining text region images corresponding to each of the plurality of text regions based on the position information and the text image; performing text recognition on the text region images to obtain recognition information, the recognition information including text recognition information of each of the plurality of text region images; determining semantic relationship information based on the recognition information, the semantic relationship information including the semantic relationship between a plurality of text recognition information; generating structured information of the text image based on the category information, the semantic relationship information, and the recognition information.

Description

Information generation method, information processing device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly to the fields of deep learning, image processing, and computer vision, and may be applied to scenes such as OCR optical character recognition. And in particular, to an information generating method, an information processing method, an apparatus, an electronic device, and a medium.
Background
With the development of computer technology, artificial intelligence technology has also been developed. For example, artificial intelligence techniques may be utilized to perform processes such as entity recognition and relationship extraction on images containing text data to obtain textual structured information in the images.
Disclosure of Invention
The disclosure provides an information generation method, an information processing device, electronic equipment and a medium.
According to one aspect of the disclosure, an information generating method is provided, which includes performing text detection on a text image to obtain detection information, wherein the detection information includes category information and position information of each of a plurality of text areas, acquiring the text area image corresponding to each of the plurality of text areas according to the position information and the text image, performing text recognition on the text area image to obtain recognition information, wherein the recognition information includes text recognition information of each of the plurality of text area images, determining semantic relation information according to the recognition information, wherein the semantic relation information includes semantic relation among the plurality of text recognition information, and generating structural information of the text image according to the category information, the semantic relation information and the recognition information.
According to another aspect of the present disclosure, there is provided an information processing method including processing a text image to be processed using the information generating method, acquiring structural information of the text image to be processed, and performing information processing using the structural information of the text image to be processed.
According to another aspect of the disclosure, an information generating apparatus is provided, which includes a text detection module configured to perform text detection on a text image to obtain detection information, where the detection information includes category information and location information of each of a plurality of text regions, a first acquisition module configured to acquire a text region image corresponding to each of the plurality of text regions according to the location information and the text image, a text recognition module configured to perform text recognition on the text region image to obtain recognition information, where the recognition information includes text recognition information of each of the plurality of text region images, a determination module configured to determine semantic relationship information according to the recognition information, where the semantic relationship information includes a semantic relationship between the plurality of text recognition information, and a generation module configured to generate structured information of the text image according to the category information, the semantic relationship information, and the recognition information.
According to another aspect of the disclosure, there is provided an information processing apparatus including a second acquisition module configured to process a text image to be processed using the information generating apparatus, acquire structured information of the text image to be processed, and an information processing module configured to perform information processing using the structured information of the text image to be processed.
According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described in the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates an exemplary system architecture to which information generation methods, information processing methods, and apparatus may be applied, according to embodiments of the present disclosure;
FIG. 2 schematically illustrates a flow chart of an information generation method according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a flow chart of a method for text detection of a text image to obtain detection information, according to an embodiment of the disclosure;
FIG. 4 schematically illustrates an example schematic diagram of a text detection of a text image to obtain detection information according to an embodiment of the present disclosure;
FIG. 5A schematically illustrates a flowchart of a method of determining semantic relationship information according to identification information according to an embodiment of the present disclosure;
FIG. 5B schematically illustrates an example schematic diagram of a process for determining semantic relationship information based on identification information according to an embodiment of the present disclosure;
FIG. 5C schematically illustrates an example schematic diagram of a process for determining semantic relationship information based on identification information according to another embodiment of the present disclosure;
FIG. 5D schematically illustrates a flowchart of a method of determining semantic relationship information according to identification information according to another embodiment of the present disclosure;
FIG. 5E schematically illustrates an example schematic diagram of a process for determining semantic relationship information based on identification information according to another embodiment of the present disclosure;
FIG. 5F schematically illustrates an example schematic diagram of a process for determining semantic relationship information based on identification information according to another embodiment of the present disclosure;
FIG. 5G schematically illustrates a flowchart of a method of determining semantic relationship information according to identification information according to another embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of a method of text recognition of a text region image to obtain recognition information, in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart of a structured information method of generating a text image from category information, semantic relationship information, and identification information according to an embodiment of the present disclosure;
FIG. 8 schematically illustrates an example schematic diagram of an information generation process according to an embodiment of the disclosure;
FIG. 9 schematically illustrates a flow chart of an information processing method according to an embodiment of the present disclosure;
Fig. 10 schematically shows a block diagram of an information generating apparatus according to an embodiment of the present disclosure;
FIG. 11 schematically shows a block diagram of an information processing apparatus according to an embodiment of the present disclosure, and
Fig. 12 schematically shows a block diagram of an electronic device adapted to implement the information generating method, the information processing method according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
Fig. 1 schematically illustrates an exemplary system architecture to which information generating methods, information processing methods, and apparatuses may be applied according to embodiments of the present disclosure.
It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the information generating method, the information processing method, and the apparatus may be applied may include a terminal device, but the terminal device may implement the information generating method, the information processing method, and the apparatus provided by the embodiments of the present disclosure without interacting with a server.
As shown in fig. 1, a system architecture 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types. Such as at least one of a wired and wireless communication link, etc. The terminal device may comprise at least one of a first terminal device 101, a second terminal device 102 and a third terminal device 103.
The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, and the third terminal device 103 to receive or send messages or the like. At least one of the first terminal device 101, the second terminal device 102, and the third terminal device 103 may be installed with various communication client applications. For example, at least one of a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and social platform software, and the like.
The first terminal device 101, the second terminal device 102, and the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing. For example, the electronic device may include at least one of a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like.
The server 105 may be a server providing various services. For example, the server 105 may be a cloud server, also called a cloud computing server or a cloud host, which is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the conventional physical hosts and VPS services (Virtual PRIVATE SERVER, virtual private servers).
Note that the information generating method and the information processing method provided by the embodiments of the present disclosure may be generally performed by one of the first terminal device 101, the second terminal device 102, and the third terminal device 103. Accordingly, the information generating apparatus and the information processing apparatus provided by the embodiments of the present disclosure may also be provided to one of the first terminal device 101, the second terminal device 102, and the third terminal device 103.
Alternatively, the information generating method and the information processing method provided by the embodiments of the present disclosure may also be generally performed by the server 105. Accordingly, the information generating apparatus and the information processing apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The information generating method and the information processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. Accordingly, the information generating apparatus and the information processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with at least one of the first terminal apparatus 101, the second terminal apparatus 102, the third terminal apparatus 103, and the server 105.
It should be understood that the number of first terminal device, second terminal device, third terminal device networks and servers in fig. 1 is merely illustrative. There may be any number of first, second, third, network and server terminals, as desired for implementation.
It should be noted that the sequence numbers of the respective operations in the following methods are merely representative of the operations for the purpose of description, and should not be construed as representing the order of execution of the respective operations. The method need not be performed in the exact order shown unless explicitly stated.
Fig. 2 schematically shows a flowchart of an information generation method according to an embodiment of the present disclosure.
As shown in FIG. 2, the method 200 includes operations S210-S250.
In operation S210, text detection is performed on the text image to obtain detection information.
In operation S220, text region images corresponding to the respective text regions are acquired based on the position information and the text images.
In operation S230, text recognition is performed on the text region image, resulting in recognition information.
In operation S240, semantic relationship information is determined according to the identification information.
In operation S250, structured information of the text image is generated according to the category information, the semantic relationship information, and the identification information.
According to an embodiment of the present disclosure, the detection information may include category information and location information of each of the plurality of text regions. The identification information may include text identification information of each of the plurality of text region images. The semantic relationship information may include semantic relationships between a plurality of text recognition information.
According to embodiments of the present disclosure, a text image may refer to an image that includes text content. The text content in the text image belongs to unstructured information, and the unstructured text content in the text image can be extracted according to the information generation method provided by the embodiment of the disclosure so as to generate the structured information of the text image.
According to embodiments of the present disclosure, the types of text images may include a variety of types, and for example, the text images may include medical text images, merchandise list text images, financial text images, or the like. File formats of the text image may include JPG (Joint Photographic Experts Group, joint picture experts group), TIFF (TAG IMAGE FILE Format ), PNG (PortableNetwork Graphics, portable network graphic), PDF (Portable Document Format ), GIF (GRAPHICS INTERCHANGE Format, graphic interchange Format), and the like. The embodiments of the present disclosure do not limit the file format of the text image.
According to embodiments of the present disclosure, the text image may be acquired through real-time acquisition, for example, the text image may be acquired through photographing or scanning of the entity text, or the like. Alternatively, the text image may be pre-stored in a database, for example, for an electronic document including text information, the text image may be obtained by capturing a screenshot of the document. Alternatively, the text image may be received from other terminal devices. The embodiment of the disclosure does not limit the acquisition mode of the text image.
According to the embodiment of the disclosure, after the text image is obtained, the text image may be subjected to text detection by using the text detection model, so as to obtain detection information corresponding to the text image. The text detection model may be trained on a first predetermined model using a first training sample set and a first tag set. The first predetermined model may include a deep learning model or a conventional model. The deep learning model may include a text detection model based on a candidate box, a text detection model based on segmentation, or a text detection model based on a mixture of both, etc. Conventional models may include a text detection model based on SWT (Stroke WidthTransform ) or a text detection model based on EdgeBox (i.e., edge box), etc.
According to an embodiment of the present disclosure, the detection information may include category information and location information of each of the plurality of text regions. The category information may characterize a category of text content included in the text region. The category information may include at least one of a keyword category or a numeric category. The keyword category may characterize a category attribute of text content included in the text region. The numeric category may characterize a content attribute of the text content included in the text region.
For example, if the text content included in one text area is "a city center hospital", the category information of the text area is a numeric category. One text region includes text content of "name", and category information of the text region is a keyword category. A text region includes text content of "Zhang San", and the category information of the text region is a numerical category.
According to embodiments of the present disclosure, the location information may characterize the location where the text region is located. The position information may be used as a basis for extracting a text region image corresponding to the text region from the text image. The location information may be characterized using a text detection box. The text detection box may comprise a four corner box, i.e. the location information may be characterized using four coordinates.
According to the embodiment of the present disclosure, after the position information of each of the plurality of text regions is obtained, the text image may be subjected to image segmentation processing according to the position information using a predetermined image segmentation method, resulting in a text region image corresponding to each of the plurality of text regions. The predetermined image segmentation method may include at least one of a threshold-based image segmentation method, a region-based image segmentation method, an edge-based image segmentation method, a theory-specific image segmentation method, a gene-coding-based image segmentation method, a wavelet-transformation-based image segmentation method, and a neural network-based image segmentation method.
According to the embodiment of the disclosure, after obtaining the text region image corresponding to each of the plurality of text regions, the text region image may be text-recognized using the text recognition model to obtain the recognition information. The identification information may include text identification information of each of the plurality of text region images. Text identifying information may be used to characterize the text content corresponding to a field of consecutive text in the text region image. The text recognition model may be trained using a second training sample set and a second label set for a second predetermined model. The second predetermined model may include a pattern matching model, a machine learning model, or a deep learning model. The deep learning model may include a text recognition model based on single character recognition or a text recognition model based on overall recognition.
According to the embodiment of the disclosure, after the identification information is obtained, text identification information of each of the plurality of text region images can be processed by using the text classification model to obtain semantic relationship information. For example, semantic features included in the text recognition information of each of the plurality of text region images are extracted using the text classification model, and semantic relationships between the plurality of text recognition information are determined based on the semantic features. The text classification model may be trained using a third set of training samples on a third predetermined model. The third predetermined model may include a machine learning model or a deep learning model. The machine learning model may include a naive bayes algorithm based text classification model or a decision tree based text classification model.
According to embodiments of the present disclosure, after obtaining the semantic relationship information, structured information of the text image may be generated according to the category information, the semantic relationship information, and the identification information. The structured information of the text image may include a value corresponding to the keyword category and a value corresponding to the numeric category. The value corresponding to the keyword category may include semantic relationship information. The value corresponding to the numeric category may include identification information.
According to an embodiment of the present invention, operations S210 to S250 may be performed by an electronic device. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, and the third terminal device 103 in fig. 1.
According to the embodiment of the invention, the category information and the position information of each of the plurality of text areas are obtained by performing text detection on the text image, and the identification information is obtained by performing text recognition on the text area image because the text area image corresponding to each of the plurality of text areas is obtained according to the position information and the text image, so that the identification information can comprise the text identification information of each of the plurality of text area images. On the basis, because the structural information of the text image is generated according to the category information, the semantic relation information and the identification information, the semantic relation information is determined according to the identification information, and the semantic relation information comprises semantic relations among a plurality of text identification information, the structural information is generated by utilizing the semantic relation information and the visual information, and the accuracy of the structural information is improved.
Merely exemplary embodiments, but are not limited thereto, other information generation methods known in the art may be included as long as the accuracy of the structured information can be improved.
The method of fig. 2 is further described with reference to fig. 3, 4, 5A, 5B, 5C, 5D, 5E, 5F, 5G, 6, 7, and 8 in conjunction with embodiments.
According to an embodiment of the present disclosure, the text image comprises a medical text image.
According to the embodiment of the disclosure, medical text is an important way to save information in medical scenes, wherein the medical text contains a lot of structured information of users, and the acquisition of the structured information is helpful for knowing the health condition of the users, and then targeted analysis and processing are performed. Meanwhile, a perfect database and a user portrait can be established. The medical text can exist in the form of an image, and how to extract the needed structural information from the medical text image is a technical difficulty faced in the medical scene and can be realized by using the information generation scheme provided by the embodiment of the disclosure.
Fig. 3 schematically illustrates a flowchart of a method for text detection of a text image to obtain detection information according to an embodiment of the present disclosure.
As shown in FIG. 3, the method 300 is further defined with respect to operation S210 of FIG. 2, and the method 300 may include operations S311-S315.
In operation S311, feature extraction is performed on the text image, and a first feature map of at least one scale is obtained.
In operation S312, a second feature map is acquired from the first feature map of the at least one scale.
In operation S313, a third feature map is acquired from the first feature map of the at least one scale.
In operation S314, category information of each of the plurality of text regions is acquired according to the second feature map.
In operation S315, respective position information of a plurality of text regions is acquired according to the third feature map.
According to embodiments of the present disclosure, scale may refer to image resolution. Each scale may have at least one first feature map corresponding to the scale.
According to embodiments of the present disclosure, a text image may be processed based on a single-stage tandem method, resulting in a first feature map of at least one scale. Alternatively, the text image may be processed based on a multi-stage tandem method resulting in a first feature map of at least one dimension. Alternatively, the text image may be processed based on a multi-stage parallel method resulting in a first feature map of at least one dimension.
According to an embodiment of the present disclosure, after the first feature map of at least one scale is obtained, a second feature map may be obtained from the first feature map of at least one scale. For example, the first feature map of at least one scale may be fused to obtain a first fused feature map. And acquiring a second feature map according to the first fusion feature map. For example, the first fused feature map may be determined as the second feature map. Alternatively, the first fused feature map may be processed to obtain the second feature map.
According to an embodiment of the present disclosure, after the at least one scale first feature map is obtained, a third feature map may be obtained according to the at least one scale first feature map. For example, the first feature map of at least one scale may be fused to obtain a second fused feature map. And acquiring a third feature map according to the second fusion feature map. For example, the second fused feature map may be determined as the third feature map. Alternatively, the second fused feature map may be processed to obtain a third feature map.
According to the embodiment of the disclosure, after the second feature map is obtained, the category information of each of the plurality of text regions may be obtained according to the second feature map. For example, a thermodynamic diagram of each of the plurality of text regions may be obtained from the second feature map. And determining the category information of each of the plurality of text areas according to the thermodynamic diagrams of each of the plurality of text areas. Alternatively, the second feature map may be processed based on a regression location method to obtain the respective category information of the plurality of text regions.
According to the embodiment of the present disclosure, after the third feature map is obtained, the position information of each of the plurality of text regions may be obtained according to the third feature map. For example, a thermodynamic diagram of each of the plurality of text regions may be obtained from the third feature map. And determining the position information of each of the plurality of text areas according to the thermodynamic diagrams of each of the plurality of text areas. Alternatively, the third feature map may be processed based on a regression location method to obtain location information of each of the plurality of text regions.
According to an embodiment of the present invention, operations S311 to S315 may be performed by an electronic device. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, and the third terminal device 103 in fig. 1.
According to the embodiment of the disclosure, the at least one first feature map can provide richer information, so that the category information and the position information of each of the text regions are obtained by using the at least one first feature map, and the accuracy of the category information and the position information of each of the text regions is improved.
According to an embodiment of the present disclosure, operation S311 may include the following operations.
And carrying out feature extraction of M stages on the text image to obtain at least one first feature map corresponding to the M-th stage. And obtaining a first characteristic diagram of at least one scale according to the at least one first characteristic diagram corresponding to the M stage.
According to an embodiment of the disclosure, the mth stage has Tm parallel levels, the image resolutions of the first feature maps of the same parallel level are the same, and the image resolutions of the first feature maps of different parallel levels are different.
According to an embodiment of the present disclosure, M is an integer greater than or equal to 1. M is an integer greater than or equal to 1 and less than or equal to M. Tm is an integer greater than or equal to 1.
According to embodiments of the present disclosure, the M-phase may include an input phase, an intermediate phase, and an output phase. The input phase may refer to phase 1. The output phase may refer to the mth phase. The intermediate stages may refer to stages 2 through M-1. The number of parallel stages of each stage may be the same or different. In stages 1 to M-1, the current stage may have at least one more parallel hierarchy than the previous stage. The mth stage may be the same number of parallel stages as the M-1 stage. M may be configured according to actual service requirements, which is not limited herein. For example, m=4. In stages 1 to 3, the current stage may be at least one more parallel hierarchy than the previous stage. Stage 1 has T1 = 2 parallel levels. Stage 2 has T2 = 3 parallel levels. Stage 3 has T3 =4 parallel levels. Stage 4 has T4 =4 parallel levels.
According to an embodiment of the present disclosure, the image resolution of the first feature map of the same parallel hierarchy is the same. The image resolution of the first feature map of the different parallel levels is different, e.g. the image resolution of the first feature map of the current parallel level is smaller than the image resolution of the first feature map of the upper parallel level. The image resolution of the first feature map of the current parallel hierarchy of the current stage may be determined from the image resolution of the first feature map of the upper parallel hierarchy of the previous stage. For example, the image resolution of the first feature map of the current stage may be obtained by downsampling the image resolution of the first feature map of the upper parallel hierarchy of the previous stage.
According to the embodiment of the disclosure, under the condition that M >1, carrying out feature extraction on the text image in M stages to obtain at least one first feature map corresponding to an Mth stage, wherein the method can comprise the steps of responding to m=1, carrying out feature extraction on the text image to obtain at least one intermediate first feature map corresponding to the 1 st stage. And obtaining a first characteristic diagram of at least one scale corresponding to the stage 1 according to the intermediate first characteristic diagram of at least one scale corresponding to the stage 1. And responding 1<m to be less than or equal to M, carrying out feature extraction on the first feature map of at least one scale corresponding to the M-1 stage, and obtaining an intermediate first feature map of at least one scale corresponding to the M-1 stage. And obtaining a first feature map of at least one scale corresponding to the mth stage according to the intermediate first feature map of at least one scale corresponding to the mth stage.
According to the embodiment of the disclosure, under the condition that M=1, performing feature extraction of M stages on the text image to obtain at least one first feature map corresponding to an Mth stage may include performing feature extraction on the text image to obtain at least one intermediate first feature map corresponding to the 1 st stage. And obtaining a first characteristic diagram of at least one scale corresponding to the stage 1 according to the intermediate first characteristic diagram of at least one scale corresponding to the stage 1.
According to an embodiment of the disclosure, obtaining the at least one scale first feature map from the at least one first feature map corresponding to the mth stage may include determining the at least one first feature map corresponding to the mth stage as the at least one scale first feature map.
According to the embodiment of the disclosure, since the image resolutions of the first feature images of the same parallel hierarchy are the same, and the image resolutions of the first feature images of different parallel hierarchies are different, the high-resolution feature characterization can be maintained in the whole feature extraction process, and the parallel hierarchies from high resolution to low resolution can be gradually increased. Deep semantic information is directly extracted on the high-resolution feature representation, but not used as the supplement of low-level feature information of the image, so that the image has enough classification capability and avoids the loss of effective spatial resolution. At least one parallel hierarchy can capture the context information and acquire rich global and local information. In addition, the information is repeatedly exchanged on the parallel hierarchy to realize multi-scale fusion of the features, and more accurate position information can be obtained, so that the accuracy of the category information and the position information of each of the text regions is improved.
According to an embodiment of the present disclosure, in a case where M is an integer greater than 1, feature extraction of M stages is performed on a text image, and at least one first feature map corresponding to an mth stage is obtained, which may include the following operations.
And (3) carrying out convolution processing on at least one first characteristic diagram corresponding to the m-1 stage to obtain at least one middle first characteristic diagram corresponding to the m stage. And carrying out feature fusion on at least one middle first feature map corresponding to the mth stage to obtain at least one first feature map corresponding to the mth stage.
According to an embodiment of the present disclosure, M is an integer greater than 1 and less than or equal to M.
According to an embodiment of the present disclosure, for the m-1 th stage, for a first feature map of the at least one first feature map, convolution processing may be performed on the first feature map to obtain an intermediate first feature map of the m-th stage, so that at least one intermediate first feature map of the m-th stage may be obtained.
According to the embodiment of the disclosure, feature fusion is performed on at least one intermediate first feature map corresponding to an mth stage to obtain at least one first feature map corresponding to the mth stage, and the feature fusion may include, for an intermediate first feature map in the at least one intermediate first feature map corresponding to the mth stage, fusing the intermediate first feature map of the mth stage with intermediate first feature maps of parallel levels other than the parallel level where the intermediate first feature map is located, to obtain a first feature map corresponding to the intermediate first feature map of the mth stage. Other parallel levels may refer to at least some parallel levels of the mth stage other than the parallel level at which the intermediate first feature map is located.
According to an embodiment of the present disclosure, feature fusion is performed on at least one intermediate first feature map corresponding to the mth stage, to obtain at least one first feature map corresponding to the mth stage, which may include the following operations.
And aiming at the ith parallel hierarchy in the Tm parallel hierarchies, obtaining a first feature map corresponding to the ith parallel hierarchy according to other intermediate first feature maps corresponding to the ith parallel hierarchy and the intermediate first feature maps corresponding to the ith parallel hierarchy.
According to an embodiment of the present disclosure, the other intermediate first feature maps corresponding to the i-th parallel level may be intermediate first feature maps corresponding to at least part of the Tm parallel levels other than the i-th parallel level. i may be an integer greater than or equal to 1 and less than or equal to Tm.
According to an embodiment of the disclosure, in case 1< i, up-sampling is performed on at least one first other intermediate first feature map, resulting in an up-sampled first feature map corresponding to the at least one first other intermediate first feature map. And downsampling the at least one second other intermediate first feature map to obtain a downsampled first feature map corresponding to the at least one second other intermediate first feature map. The first other intermediate first feature map may refer to other intermediate first feature maps of greater than the i-th parallel level of the Tm parallel levels. The second other intermediate first signature may refer to other intermediate first signatures in less than the i-th parallel level in the Tm parallel levels. The image resolution of the upsampled first feature map is the same as the resolution of the intermediate first feature map of the ith parallel level. The resolution of the downsampled first feature map is the same as the resolution of the intermediate first feature map of the i parallel levels.
According to an embodiment of the present disclosure, in the case of i=1, at least one second other intermediate first feature map is up-sampled, resulting in a down-sampled first feature map corresponding to the at least one first other intermediate first feature map. The first other intermediate first feature map may refer to other intermediate first feature maps of greater than the 1 st parallel level of the Tm parallel levels. The image resolution of the upsampled first feature map is the same as the resolution of the intermediate first feature map of the 1 st parallel level.
According to an embodiment of the present disclosure, in the case of i=i, at least one second other intermediate first feature map is downsampled, resulting in a downsampled first feature map corresponding to the at least one second other intermediate first feature map. The second other intermediate first signature may refer to other intermediate first signatures in less than the i-th parallel level in the Tm parallel levels. The resolution of the downsampled first feature map is the same as the resolution of the intermediate first feature map of the i parallel levels.
According to an embodiment of the disclosure, a first feature map corresponding to an ith parallel hierarchy is obtained from an up-sampled first feature map corresponding to at least one first other intermediate first feature map, a down-sampled first feature map corresponding to at least one second other intermediate first feature map, and an intermediate first feature map of the ith parallel hierarchy. For example, an up-sampled first feature map corresponding to at least one first other intermediate first feature map, a down-sampled first feature map corresponding to at least one second other intermediate first feature map, and an intermediate first feature map of an i-th parallel hierarchy may be fused to obtain a first feature map corresponding to the i-th parallel hierarchy. The fusing may include at least one of stitching and adding.
Fig. 4 schematically illustrates an example schematic diagram of a process of performing feature extraction of M stages on a text image to obtain at least one first feature map corresponding to an mth stage according to an embodiment of the present disclosure.
As shown in fig. 4, in 400, m=4, for example, a1 st stage 401, a 2 nd stage 402, a 3 rd stage 403, and a 4 th stage 404. Stage 1 has two parallel levels, for example, a1 st parallel level 405 and a 2 nd parallel level 406. Stage 2 402 has three parallel levels, e.g., level 1 parallel level 405, level 2 parallel level 406, and level 3 parallel level 407. Stage 3 403 is specifically four parallel levels, e.g., parallel level 1 405, parallel level 2 406, parallel level 3 407, and parallel level 4 408.
The at least one first feature map corresponding to stage 4 may include a first feature map 409, a first feature map 410, a first feature map 411, and a first feature map 412. Furthermore, the "up-right arrow" between the last two columns of each stage in fig. 4 characterizes "up-sampling". The "lower left arrow" characterizes "downsampling".
According to an embodiment of the present disclosure, operation S311 may include the following operations.
And carrying out feature extraction of N cascade levels on the text image to obtain a first feature map of at least one scale.
According to an embodiment of the present disclosure, N is an integer greater than 1. N may be configured according to actual service requirements, which is not limited herein. For example, n=4.
According to the embodiment of the disclosure, feature extraction of N cascade levels can be performed on the text image, and at least one first feature map corresponding to the N cascade levels is obtained. And obtaining a first feature map of at least one scale according to the at least one first feature map corresponding to the N cascade levels. For example, for an nth cascade level of the N cascade levels, a first feature map of a scale corresponding to the nth cascade level is obtained from the first feature maps of other cascade levels and the first feature map corresponding to the nth cascade level. Other cascade levels may refer to at least some of the N cascade levels except for the nth cascade level.
According to the embodiment of the disclosure, since the first feature map of at least one scale can provide richer information, the accuracy of the category information and the position information of each of the plurality of text regions can be improved by determining the category information and the position information of each of the plurality of text regions according to the first feature map of at least one scale.
Fig. 5A schematically illustrates a flowchart of a method of determining semantic relationship information based on identification information according to an embodiment of the present disclosure.
As shown in fig. 5A, the method 500A is a further limitation of operation S240 in fig. 2, and the method 500A may include operation S541.
In operation S541, semantic relationship information is determined based on the auxiliary information and the identification information.
According to an embodiment of the present disclosure, the auxiliary information may include at least one of a second feature map and location information.
According to an embodiment of the present disclosure, the second feature map may be obtained from the first fused feature map. The first fused feature map may be obtained by fusing the first feature map of at least one scale. The location information may characterize the location where the text region is located.
According to an embodiment of the present disclosure, for example, in the case where the auxiliary information includes the second feature map, the semantic relationship information may be determined according to the second feature map and the identification information. Alternatively, in the case where the auxiliary information includes location information, the semantic relationship information may be determined from the location information and the identification information. Alternatively, in the case where the auxiliary information includes the second feature map and the position information, the semantic relationship information may be determined from the second feature map, the position information, and the identification information.
According to the embodiment of the disclosure, since the semantic relation information is determined according to the auxiliary information and the identification information, the auxiliary information includes at least one of the second feature map and the position information, and the semantic relation information is determined by using the auxiliary information and the identification information, the accuracy of the semantic relation information is improved.
In accordance with an embodiment of the present disclosure, in the case where the auxiliary information includes the second feature map, operation S541 may include the following operations.
And fusing the second characteristic diagram and the fourth characteristic diagram corresponding to the identification information to obtain a fused characteristic diagram. And determining semantic relation information according to the fusion feature map.
According to the embodiment of the disclosure, in the case where the auxiliary information includes the second feature map, feature extraction may be performed on the text identification information of each of the plurality of text region images, to obtain a fourth feature map corresponding to the identification information. After the fourth feature map is obtained, the second feature map and the fourth feature map corresponding to the identification information may be fused to obtain a fused feature map. After the fusion feature map is obtained, the fusion feature map can be processed by using a text classification model to obtain semantic relation information.
According to embodiments of the present disclosure, the text classification model may include a deep learning model or a machine learning model. A third predetermined model may be trained using a third training sample set, which may include a plurality of training texts, and a third label set, which may include a third label corresponding to each training text, to obtain a text classification model.
According to an embodiment of the present disclosure, training a third predetermined model using a third training sample set and a third tag set to obtain a text classification model may include inputting each training text of a plurality of training texts into the third predetermined model to obtain a semantic category result corresponding to each training text. And inputting the semantic category result and the third label corresponding to each training text into a first loss function to obtain a first output value. And adjusting model parameters of the third preset model according to the first output value until the first output value converges. A third predetermined model obtained in a case where convergence of the first output value is satisfied is determined as a text classification model.
In accordance with an embodiment of the present disclosure, in the case where the auxiliary information further includes location information, operation S541 may include the following operations.
And determining semantic relation information according to the fusion feature map and the position information.
According to the embodiment of the disclosure, in the case that the auxiliary information further includes the position information, feature extraction may be performed on the text identification information of each of the plurality of text region images, so as to obtain a fourth feature map corresponding to the identification information. After the fourth feature map is obtained, the second feature map and the fourth feature map corresponding to the identification information may be fused to obtain a fused feature map. After the fused feature map is obtained, semantic relationship information can be determined according to the fused feature map and the position information.
Fig. 5B schematically illustrates an example schematic diagram of a process of determining semantic relationship information according to identification information according to an embodiment of the present disclosure.
As shown in fig. 5B, in the diagram 500B, in the case where the auxiliary information 501 includes the second feature map 5011, the fourth feature map 503 corresponding to the identification information 502 can be determined from the identification information 502.
After the fourth feature map 503 corresponding to the identification information 502 is obtained, the second feature map 5011 and the fourth feature map 503 corresponding to the identification information 502 may be fused to obtain a fused feature map 504.
After the fused feature map 504 is obtained, semantic relationship information 505 may be determined from the fused feature map 504.
Fig. 5C schematically illustrates an example schematic diagram of a process of determining semantic relationship information according to identification information according to another embodiment of the present disclosure.
As shown in fig. 5C, in 500C, in the case where the auxiliary information 506 includes the second feature map 5061 and the position information 5062, a fourth feature map 508 corresponding to the identification information 507 may be determined from the identification information 507.
After the fourth feature map 508 corresponding to the identification information 507 is obtained, the second feature map 5061 and the fourth feature map 508 corresponding to the identification information 507 may be fused to obtain a fused feature map 509.
After the fused feature map 509 is obtained, semantic relationship information 510 may be determined from the fused feature map 509 and the location information 5062.
Fig. 5D schematically illustrates a flowchart of a method of determining semantic relationship information according to identification information according to another embodiment of the present disclosure.
As shown in fig. 5D, the method 500D is further defined with respect to operation S240 in fig. 2, and the method 500C may include operations S542-S543.
In operation S542, global feature extraction is performed on the identification information to obtain global feature information.
In operation S543, semantic relationship information is determined from the global feature information.
According to the embodiment of the disclosure, the text identification information of each of the plurality of text region images can be input into the global feature extraction model to obtain global feature information. After the global feature information is obtained, a semantic relationship between the plurality of text recognition information may be determined based on the global feature information. The global feature extraction model may be derived by training a fourth predetermined model using a fourth set of training samples. The fourth predetermined model may include a recurrent neural network (Recurrent Neural Networks, RNN) model, a Long Short-Term Memory (LSTM) model, and a transducer model. The fourth predetermined model may be configured according to actual service requirements, and may only need to implement a global feature extraction function, which is not limited herein.
For example, the fourth predetermined model may comprise at least one model structure. The model structure may comprise at least one model substructure and a connection relationship of the respective model substructure to each other. The model structure may be a structure obtained by connecting at least one model substructure based on a connection relationship between the model substructures. The at least one model substructure comprised by the model structure may be a structure from at least one operational layer. For example, the model structure may be a structure obtained by connecting at least one model substructure from at least one operation layer based on a connection relationship between model substructures. For example, the at least one operational layer may include at least one of an input layer, a convolution layer, a hidden layer, a transcription layer, a pooling layer, a anti-pooling layer, an anti-convolution layer, a feedforward neural network layer, an attention layer, a residual layer, a full connection layer, a batch normalization layer, a linear embedding (i.e., linear Embedding) layer, a non-linear layer, and the like.
According to the embodiment of the disclosure, since the global feature information is obtained by performing global feature extraction on the identification information, the global feature information can characterize the global characteristics of the identification information. On the basis, because the semantic relation information is determined according to the global feature information, the semantic relation information can more comprehensively represent the semantic relation among solicit articles pieces of identification information, and the accuracy of the semantic relation information is further improved.
According to an embodiment of the present disclosure, operation S542 may include the following operations.
And processing the identification information based on the attention strategy to obtain global characteristic information.
According to embodiments of the present disclosure, an attention policy may be used to achieve focusing of important information with high weight, ignoring non-important information with low weight, and enabling information exchange with other information by sharing important information, thereby achieving transfer of important information. In the embodiment of the disclosure, the attention strategy can extract information among the text identification information of each of the text region images so as to better complete the information generation of the text images. The attention policy may include one of a self-attention policy and a mutual-attention policy.
According to embodiments of the present disclosure, text identifying information may be used to determine a first key matrix, a first value matrix, and a first query matrix. For example, where the attention policy may be a self-attention policy, the text identifying information may be used as a first key matrix, a first value matrix, and a first query matrix. The Key (i.e., key) matrix, value (i.e., value), and Query (i.e., query) matrix may be matrices in the attention mechanism.
According to an embodiment of the present disclosure, in a case where the attention policy may be a self-attention policy, text identification information corresponding to the plurality of text region images for use as the first key matrix, the first value matrix, and the first query matrix may be processed based on the self-attention policy, resulting in global feature information corresponding to each of the plurality of text region images. For example, the attention unit may be determined according to a self-attention policy. And processing text identification information corresponding to the plurality of text region images and serving as a first key matrix, a first value matrix and a first query matrix by using an attention unit to obtain global feature information corresponding to each of the plurality of text region images.
According to the embodiment of the disclosure, the global feature information is obtained by processing the text identification information corresponding to each of the plurality of text region images by using the attention strategy, and the attention strategy can extract semantic information between the text region images and other text region images, so that the accuracy of generating the structured information of the text images is improved.
According to an embodiment of the present disclosure, processing the identification information based on the attention policy, resulting in global feature information, may include the following operations.
And carrying out U-level processing on the identification information based on the self-attention strategy to obtain global characteristic information.
According to an embodiment of the present disclosure, U is an integer greater than or equal to 1. U may be configured according to actual service requirements, which is not limited herein. For example, u=4.
According to the embodiment of the disclosure, for the text region images in the plurality of text region images, text identification information of each of the plurality of text region images can be processed based on the attention policy to obtain global feature information. For example, the text identification information of each of the plurality of text region images may be subjected to U-level processing based on a self-attention policy, to obtain global feature information.
According to an embodiment of the present disclosure, in the case where U is an integer greater than 1, in the case where 1<u +.u, the identification information is subjected to U-level processing based on the self-attention policy, resulting in global feature information, which may include the following operations.
And obtaining second intermediate characteristic information of the u-1 hierarchy corresponding to the text identification information according to the first intermediate characteristic information of the u-1 hierarchy corresponding to the text identification information. And obtaining the first intermediate characteristic information of the u-th level corresponding to the plurality of text recognition information according to the second intermediate characteristic information of the u-th level corresponding to the plurality of text recognition information and the first intermediate characteristic information of the u-1-th level corresponding to the plurality of text recognition information. And obtaining global feature information according to the first intermediate feature information of the R-level, which corresponds to each of the plurality of text identification information.
According to embodiments of the present disclosure, the first intermediate feature information may be used to determine a first query matrix, a first key matrix, and a first value matrix.
According to embodiments of the present disclosure, U may be an integer greater than or equal to 1 and less than or equal to U. U e {1, 2.,. U-1, U }. R may be an integer greater than or equal to 1 and less than or equal to U.
According to the embodiment of the disclosure, under the condition that 1<u is less than or equal to U, processing the first intermediate feature information corresponding to each of the plurality of text identification information of the U-1 level based on the self-attention strategy to obtain the second intermediate feature information corresponding to each of the plurality of text identification information of the U-1 level. The first intermediate feature information of the u-th hierarchy, which corresponds to each of the plurality of text recognition information, may be used as a first key matrix, a first value matrix, and a first query matrix of the u+1-th hierarchy. And fusing the second intermediate characteristic information of the u-th level, which corresponds to the text recognition information, with the first intermediate characteristic information of the u-1-th level, which corresponds to the text recognition information, to obtain fourth intermediate characteristic information of the u-th level, which corresponds to the text recognition information. And obtaining the first intermediate characteristic information of the u-th level corresponding to the text recognition information according to the fourth intermediate characteristic information of the u-th level corresponding to the text recognition information. And obtaining global feature information according to the first intermediate feature information of the R-level, which corresponds to each of the plurality of text identification information. The fusing may include one of adding and stitching.
According to the embodiment of the disclosure, according to fourth intermediate feature information of the u-th level, which corresponds to each of the plurality of text recognition information, obtaining first intermediate feature information of the u-th level, which corresponds to each of the plurality of text recognition information, may include performing multi-layer perceptron processing on the fourth intermediate feature information of the u-th level, which corresponds to each of the plurality of text recognition information, to obtain fifth intermediate feature information of the u-th level, which corresponds to each of the plurality of text recognition information. And obtaining the first intermediate characteristic information of the u-th level corresponding to the text recognition information according to the fifth intermediate characteristic information of the u-th level corresponding to the text recognition information. For example, the sixth intermediate feature information corresponding to each of the plurality of text recognition information in the u-th hierarchy may be normalized to obtain the fourth intermediate feature information corresponding to each of the plurality of text recognition information in the u-th hierarchy. Normalization (i.e., normalization) may include one of batch Normalization (Batch Normalization, BN) and layer Normalization (Lay Normalization, LN). For example, the sixth intermediate feature information corresponding to each of the plurality of text recognition information in the u-th hierarchy may be subjected to batch normalization processing, and fourth intermediate feature information corresponding to each of the plurality of text recognition information in the u-th hierarchy may be obtained.
According to the embodiment of the disclosure, the processing of the first intermediate feature information of the u-1 level corresponding to each of the plurality of text recognition information based on the self-attention policy to obtain the second intermediate feature information of the u-1 level corresponding to each of the plurality of text recognition information may include obtaining eighth intermediate feature information of the u-1 level corresponding to each of the plurality of text recognition information according to the seventh intermediate feature information of the u-1 level corresponding to each of the plurality of text recognition information. For example, the seventh intermediate feature information corresponding to each of the plurality of text recognition information in the u-1 hierarchy is normalized to obtain the eighth intermediate feature information corresponding to each of the plurality of text recognition information in the u-1 hierarchy. And processing eighth intermediate feature information of the u-th level, which corresponds to each of the plurality of text recognition information, based on the self-attention strategy to obtain first intermediate feature information of the u-th level, which corresponds to each of the plurality of text recognition information.
According to an embodiment of the present disclosure, obtaining second intermediate feature information of the u-th hierarchy corresponding to each of the plurality of text recognition information from the first intermediate feature information of the u-1-th hierarchy corresponding to each of the plurality of text recognition information may include the following operations.
And determining a plurality of first matrix sets of the u-1 hierarchy corresponding to the text identification information according to the first intermediate feature information of the u-1 hierarchy corresponding to the text identification information. And aiming at the text recognition information in the text recognition information of the u-th level, aiming at a first matrix set in a plurality of first matrixes corresponding to the text recognition information, obtaining a first attention matrix corresponding to the text recognition information of the u-th level according to a first query matrix corresponding to the text recognition information of the u-th level and a first key matrix corresponding to each of the text recognition information of the u-th level. And obtaining third intermediate characteristic information corresponding to the text identification information of the u-th level according to the first attention matrix corresponding to the text identification information of the u-th level and the first value matrix corresponding to the text identification information of the u-th level. And obtaining the second intermediate characteristic information corresponding to the text identification information of the u-th level according to the plurality of third intermediate characteristic information corresponding to the text identification information of the u-th level.
According to an embodiment of the present disclosure, the first matrix set may include a first query matrix, a first key matrix, and a first value matrix.
According to embodiments of the present disclosure, the self-attention policy may include a multi-headed self-attention policy. Determining the plurality of first matrix sets of the u-th level corresponding to the plurality of text identifying information according to the first intermediate feature information of the u-1-th level corresponding to the plurality of text identifying information may include determining the first matrix set of the u-th level corresponding to the text identifying information according to the first intermediate feature information of the u-1-th level corresponding to the text identifying information for the text identifying information of the u-1-th level corresponding to the plurality of text identifying information. The first matrix set may include a first key matrix, a first value matrix, and a first query matrix.
According to the embodiment of the disclosure, the obtaining the second intermediate feature information corresponding to the text identification information of the u-th level according to the third intermediate feature information corresponding to the text identification information of the u-th level may include obtaining the ninth intermediate feature information corresponding to the text identification information of the u-th level according to the third intermediate feature information corresponding to the text identification information of the u-th level. For example, the plurality of third intermediate feature information corresponding to the text recognition information in the u-th hierarchy may be fused to obtain a plurality of ninth intermediate feature information corresponding to the text recognition information in the u-th hierarchy. The fusing may include at least one of stitching and adding. And obtaining second intermediate characteristic information corresponding to the text identification information of the u-th level according to the plurality of ninth intermediate characteristic information corresponding to the text identification information of the u-th level. For example, the plurality of ninth intermediate feature information corresponding to the text recognition information of the u-th hierarchy may be linearly transformed to obtain the second intermediate feature information corresponding to the text recognition information of the u-th hierarchy.
According to an embodiment of the present disclosure, in the case where u=1, performing U-level processing on the identification information based on the self-attention policy, obtaining global feature information may include the following operations.
And obtaining second intermediate characteristic information corresponding to each of the text identification information of the level 2 according to the global characteristic information corresponding to each of the text identification information of the level 1. And obtaining the first intermediate characteristic information corresponding to the text recognition information of the 2 nd level according to the second intermediate characteristic information corresponding to the text recognition information of the 2 nd level and the global characteristic information corresponding to the text recognition information of the 1 st level.
According to embodiments of the present disclosure, the global feature information may be used to determine a second query matrix, a second key matrix, and a second value matrix.
According to the embodiment of the disclosure, global feature information corresponding to each of the plurality of text recognition information of the 1 st level is processed based on the self-attention policy, and second intermediate feature information corresponding to each of the plurality of text recognition information of the 2 nd level is obtained. The global feature information of the level 1 corresponding to each of the plurality of text recognition information may be used as a first query matrix (i.e., a second query matrix), a first key matrix (i.e., a second key matrix), and a first value matrix (i.e., a second value matrix) of the level 2. And fusing the second intermediate characteristic information corresponding to each of the text identification information of the 2 nd level with the global characteristic information corresponding to each of the text identification information of the 1 st level to obtain fourth intermediate characteristic information corresponding to each of the text identification information of the 2 nd level. And obtaining the first intermediate characteristic information corresponding to the text identification information of the 2 nd level according to the fourth intermediate characteristic information corresponding to the text identification information of the 2 nd level.
According to the embodiment of the disclosure, according to fourth intermediate feature information corresponding to each of the plurality of text recognition information in the 2 nd layer, obtaining first intermediate feature information corresponding to each of the plurality of text recognition information in the 2 nd layer may include performing multi-layer perceptron processing on the fourth intermediate feature information corresponding to each of the plurality of text recognition information in the 2 nd layer, to obtain fifth intermediate feature information corresponding to each of the plurality of text recognition information in the 2 nd layer. And obtaining the first intermediate characteristic information corresponding to the text recognition information of the 2 nd level according to the fifth intermediate characteristic information corresponding to the text recognition information of the 2 nd level. For example, the sixth intermediate feature information corresponding to each of the plurality of text recognition information at the level 2 may be normalized to obtain the fourth intermediate feature information corresponding to each of the plurality of text recognition information at the level 2. The normalization may include one of batch normalization and layer normalization. For example, the sixth intermediate feature information corresponding to each of the plurality of text recognition information at the level 2 may be subjected to batch normalization processing to obtain fourth intermediate feature information corresponding to each of the plurality of text recognition information at the level 2.
According to the embodiment of the disclosure, processing the first intermediate feature information corresponding to each of the plurality of text recognition information in the 1 st level based on the self-attention policy to obtain the second intermediate feature information corresponding to each of the plurality of text recognition information in the 2 nd level may include obtaining the eighth intermediate feature information corresponding to each of the plurality of text recognition information in the 2 nd level according to the seventh intermediate feature information corresponding to each of the plurality of text recognition information in the 1 st level. For example, the seventh intermediate feature information corresponding to each of the plurality of text recognition information of the 1 st hierarchy is normalized to obtain the eighth intermediate feature information corresponding to each of the plurality of text recognition information of the 2 nd hierarchy. And processing the eighth intermediate feature information corresponding to each of the plurality of text identification information in the 2 nd layer based on the self-attention strategy to obtain the first intermediate feature information corresponding to each of the plurality of text identification information in the 2 nd layer.
According to an embodiment of the present disclosure, obtaining the second intermediate feature information of the level 2 corresponding to each of the plurality of text recognition information according to the global feature information of the level 1 corresponding to each of the plurality of text recognition information may include the following operations.
And determining a plurality of second matrix sets corresponding to the text identification information of the level 2 according to the global characteristic information corresponding to the text identification information of the level 1. And aiming at the text recognition information in the 2 nd level, aiming at a second matrix set in the second matrix sets corresponding to the text recognition information, obtaining a second attention matrix corresponding to the text recognition information in the 2 nd level according to a second query matrix corresponding to the text recognition information in the 2 nd level and a second key matrix corresponding to the text recognition information in the 2 nd level. And obtaining third intermediate characteristic information corresponding to the text identification information of the 2 nd level according to the second attention matrix corresponding to the text identification information of the 2 nd level and the second value matrix corresponding to the text identification information of the 2 nd level. And obtaining the second intermediate characteristic information corresponding to the text identification information of the 2 nd layer according to the third intermediate characteristic information corresponding to the text identification information of the 2 nd layer.
According to an embodiment of the present disclosure, the second set of matrices may include a second query matrix, a second key matrix, and a second value matrix.
According to embodiments of the present disclosure, the self-attention policy may include a multi-headed self-attention policy. Determining the plurality of second matrix sets of the level 2 corresponding to the plurality of text identifying information according to the first intermediate feature information of the level 1 corresponding to the plurality of text identifying information respectively may include determining the second matrix set of the level 2 corresponding to the text identifying information according to the first intermediate feature information of the level 1 corresponding to the text identifying information for the text identifying information of the level 1 corresponding to the plurality of text identifying information. The second matrix set may include a second key matrix, a second value matrix, and a second query matrix.
According to the embodiment of the disclosure, according to the third intermediate feature information corresponding to the text identification information of the 2 nd layer, the second intermediate feature information corresponding to the text identification information of the 2 nd layer is obtained, and the method can comprise the step of obtaining the ninth intermediate feature information corresponding to the text identification information of the 2 nd layer according to the third intermediate feature information corresponding to the text identification information of the 2 nd layer. For example, the third intermediate feature information corresponding to the text recognition information in the level 2 may be fused to obtain the ninth intermediate feature information corresponding to the text recognition information in the level 2. The fusing may include at least one of stitching and adding. And obtaining second intermediate characteristic information corresponding to the text identification information of the 2 nd layer according to the plurality of ninth intermediate characteristic information corresponding to the text identification information of the 2 nd layer. For example, the plurality of ninth intermediate feature information corresponding to the text recognition information of the level 2 may be linearly transformed to obtain second intermediate feature information corresponding to the text recognition information of the level 2.
Fig. 5E schematically illustrates an example schematic diagram of a process of determining semantic relationship information according to identification information according to another embodiment of the present disclosure.
As shown in fig. 5E, in 500E, a plurality of first matrix sets 512 of the u-th hierarchy corresponding to the plurality of text recognition information may be determined according to the first intermediate feature information 511 of the u-1-th hierarchy corresponding to the plurality of text recognition information. The first matrix set 512 may include a first query matrix 512_1, a first key matrix 512_2, and a first value matrix 512_3.
After obtaining the plurality of first matrix sets 512 of the u-th level corresponding to the plurality of text recognition information, the first attention matrix 513 of the u-th level corresponding to the text recognition information may be obtained according to the first query matrix 512_1 of the u-th level corresponding to the text recognition information and the first key matrix 512_2 of the u-th level corresponding to the plurality of text recognition information.
After the first attention matrix 513 is obtained, third intermediate feature information 514 corresponding to the text recognition information of the u-th level may be obtained from the first attention matrix 513 corresponding to the text recognition information of the u-th level and the first value matrix 5123 corresponding to the text recognition information of the u-th level.
After the third intermediate feature information 514 is obtained, second intermediate feature information 515 corresponding to the text recognition information of the u-th hierarchy may be obtained from the plurality of third intermediate feature information 514 corresponding to the text recognition information of the u-th hierarchy.
After the second intermediate feature information 515 corresponding to the text recognition information of the u-th hierarchy is obtained, the first intermediate feature information 516 corresponding to the plurality of text recognition information of the u-th hierarchy may be obtained from the second intermediate feature information 515 corresponding to each of the plurality of text recognition information of the u-th hierarchy and the first intermediate feature information 511 corresponding to each of the plurality of text recognition information of the u-1-th hierarchy.
After the first intermediate feature information 516 corresponding to the plurality of text recognition information of the u-th hierarchy is obtained, global feature information 517 may be obtained according to the first intermediate feature information corresponding to each of the plurality of text recognition information of the R-th hierarchy.
After global feature information 517 is obtained, semantic relationship information 518 may be determined from global feature information 517.
Fig. 5F schematically illustrates an example schematic diagram of a process of determining semantic relationship information according to identification information according to another embodiment of the present disclosure.
As shown in fig. 5F, in 500F, a plurality of second matrix sets 520 corresponding to each of the plurality of text recognition information of level 2 may be determined from the global feature information 519 corresponding to each of the plurality of text recognition information of level 1. The second matrix set 520 may include a second query matrix 520_1, a second key matrix 520_2, and a second value matrix 520_3.
After obtaining the plurality of second matrix sets 520 corresponding to each of the plurality of text recognition information of level 2, the second attention matrix 521 corresponding to the text recognition information of level 2 may be obtained according to the second query matrix 520_1 corresponding to the text recognition information of level 2 and the second key matrix 520_2 corresponding to the plurality of text recognition information of level 2.
After the second attention matrix 521 is obtained, third intermediate feature information 522 corresponding to the text recognition information at level 2 may be obtained from the second attention matrix 521 corresponding to the text recognition information at level 2 and the second value matrix 520_2 corresponding to the text recognition information at level 2.
After the third intermediate feature information 522 is obtained, the second intermediate feature information 523 corresponding to the text recognition information of the level 2 may be obtained from the plurality of third intermediate feature information 522 corresponding to the text recognition information of the level 2.
After obtaining the second intermediate feature information 523 corresponding to the text recognition information of the level 2, the first intermediate feature information 524 corresponding to the text recognition information of the level 2 may be obtained according to the second intermediate feature information 523 corresponding to the text recognition information of the level 2 and the global feature information 519 corresponding to the text recognition information of the level 1.
After obtaining the first intermediate feature information 524 corresponding to each of the plurality of text recognition information of level 2, global feature information 525 may be obtained according to the first intermediate feature information 524 corresponding to each of the plurality of text recognition information of level 2.
After global feature information 525 is obtained, semantic relationship information 526 may be determined from the global feature information 525.
Fig. 5G schematically illustrates a flowchart of a method of determining semantic relationship information according to identification information according to another embodiment of the present disclosure.
As shown in fig. 5G, the method 500G is a further definition of operation S240 in fig. 2, and the method 500E may include including operation S544.
In operation S544, the identification information is processed using the text semantic relationship model to obtain semantic relationship information.
According to embodiments of the present disclosure, the text semantic relationship model may be derived by training a first deep learning model using a plurality of positive sample pairs and a plurality of negative sample pairs. The number of positive and negative pairs of samples may satisfy a predetermined equalization condition. The positive sample pair may include two sample texts with a key-value relationship therebetween. The negative sample pair may include two sample texts with a non-key-value relationship therebetween.
According to embodiments of the present disclosure, a positive sample pair may refer to two sample texts having a key-value relationship. Negative sample pairs may refer to two sample texts having a non-key-value relationship. For example, the sample text may include "name", "age" and "gender" of the keyword category as category information, "Zhang Sang", "42 years old" and "Male" of the numerical category.
In this case, it can be determined that the positive sample pair includes "name" and "Zhang Sank", "age" and "42 years", "sex" and "male". In addition, candidate negative-sample pairs may be determined to include "name" and "42 years", "name" and "male", "age" and "Zhang Sank", "age" and "male", "gender" and "Zhang Sank", "gender" and "42 years".
According to an embodiment of the present disclosure, after determining the number of positive sample pairs and the number of candidate negative sample pairs, the positive sample pairs and the candidate negative sample pairs may be subjected to a screening process according to a predetermined equalization condition such that the numbers of positive sample pairs and negative sample pairs satisfy the predetermined equalization condition. The predetermined equalization conditions may be configured according to actual service requirements, and are not limited herein. For example, the predetermined equalization condition may be configured such that the number of positive sample pairs and the number of negative sample pairs are equal. Alternatively, the predetermined equalization condition may be configured such that a difference between the number of positive sample pairs and the number of negative sample pairs is less than or equal to a predetermined threshold. The predetermined threshold may be set to 1.
According to an embodiment of the present invention, operations S541, S542 to S543, S544 may be performed by an electronic device. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, and the third terminal device 103 in fig. 1.
According to the embodiment of the disclosure, since the positive sample pair comprises two sample texts with a key value relationship, and the negative sample pair comprises two sample texts with a non-key value relationship, the first deep learning model is trained by utilizing a plurality of positive sample pairs and a plurality of negative sample pairs, so that the obtained text semantic relationship model can automatically classify the semantic relationship. On the basis, the semantic relation information is obtained by processing the identification information by using a text semantic relation model, so that the accuracy of the semantic relation information is improved.
According to an embodiment of the present disclosure, the plurality of negative sample pairs are determined from the plurality of candidate negative sample pairs based on a negative sample pruning strategy.
According to embodiments of the present disclosure, a negative-sample pruning strategy may be used to characterize conditions for determining negative-sample pairs from a plurality of candidate negative-sample pairs. For example, the negative sample pruning policy may include a positional relationship condition between candidate sample texts, in which case a negative sample pair may be determined from a plurality of candidate negative sample pairs according to a positional relationship between each of the plurality of candidate sample texts. Alternatively, the negative sample pruning policy may include a similarity condition between candidate sample texts, in which case negative sample pairs may be determined from a plurality of candidate negative sample pairs according to the degree of similarity between each of the plurality of candidate sample texts.
According to an embodiment of the present disclosure, the plurality of negative sample pairs are determined from the plurality of candidate negative sample pairs based on a negative sample pruning strategy, and may include the following operations.
The plurality of negative sample pairs are determined from the plurality of candidate negative sample pairs based on a positional relationship between each of the plurality of candidate sample texts.
According to embodiments of the present disclosure, the negative-sample pruning strategy may include at least one of a short text position relationship strategy and a long text position relationship strategy. The short text positional relationship policy may be used to characterize positional relationship conditions that need to be satisfied in the case where two sample texts with non-key-value relationships belong to a short text. The long text positional relationship policy may be used to characterize positional relationship conditions that need to be satisfied in the case where two sample texts with non-key-value relationships belong to a long text.
For example, the short text positional relationship policy may include that the sample text belonging to the keyword category is to the left of the sample text belonging to the numeric category. In this case, a plurality of negative sample pairs satisfying that the sample text belonging to the keyword category is located on the left side of the sample text belonging to the numeric category may be determined from the plurality of negative sample pairs according to the short text positional relationship policy and the positional relationship between each of the plurality of candidate sample texts.
For example, a long text positional relationship policy may include that sample text belonging to a keyword category is located above sample text belonging to a numeric category. In this case, a plurality of negative sample pairs satisfying that the sample text belonging to the keyword category is located on top of the sample text belonging to the numeric category may be determined from the plurality of negative sample pairs according to the short text positional relationship policy and the positional relationship between each of the plurality of candidate sample texts.
Fig. 6 schematically illustrates a flowchart of a method for text recognition of a text region image to obtain recognition information according to an embodiment of the present disclosure.
As shown in fig. 6, the method 600 is further defined with respect to operation S220 in fig. 2, and the method 600 may include operations S621-S622.
In operation S621, position information is converted into target position information using affine transformation.
In operation S622, images corresponding to a plurality of text regions are extracted from the text image according to the target position information, resulting in text region images corresponding to the plurality of text regions.
According to embodiments of the present disclosure, affine transformation is a linear transformation from two-dimensional coordinates to two-dimensional coordinates for maintaining "flatness" and "parallelism" of a two-dimensional graph. Flatness is understood to mean whether the straight line or the straight line is transformed, and whether the arc or the circular arc is not bent. Parallelism is understood to mean that the relative positional relationship between different two-dimensional patterns is kept unchanged, parallel lines or parallel lines are also kept unchanged, and the included angle of intersecting straight lines is kept unchanged. Affine transformation may be achieved by at least one of translation, scaling, flipping, rotation, shearing, and the like.
According to an embodiment of the present disclosure, converting the position information corresponding to the text region into the target position information using the affine transformation may include converting the text region in the form of a tetrad box into the text region in the form of a rectangular box using the affine transformation, and determining the position information corresponding to the text region in the form of a rectangular box as the target position information so that the text region images corresponding to the plurality of text regions may be extracted from the text image according to the target position information.
For example, the text region is a four corner box, which may be characterized by { P1,P2,P3,P4 }, P1 as the top left corner of the four corner box, P2 as the top right corner of the four corner box, P3 as the bottom left corner of the four corner box, and P4 as the bottom right corner of the four corner box. The coordinates of P1 can be characterized as { x1,y1},P2, the coordinates of { x2,y2},P3, the coordinates of { x3,y3},P4, and { x4,y4 }. Using affine transformation, P1→P′1,P2→P′2,P3→P′3,P4→P′4, the coordinates of the resulting rectangular box { P '1,P′2,P′3,P′4}.P1' may be characterized as { x '1,y′1},P′2, the coordinates of { x'2,y2},P′3 may be characterized as { x '3,y′3},P′4, and the coordinates of { x'4,y′4 }.
Operations S621-S622 may be performed by an electronic device according to an embodiment of the present invention. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, and the third terminal device 103 in fig. 1.
According to the embodiment of the disclosure, the target position information is obtained by converting the position information by affine transformation, so that the target position information can be automatically determined, and the efficiency and the accuracy of the target position information determination are improved. On the basis, the text region images are obtained by extracting images corresponding to a plurality of text regions from the text images according to the target position information, so that the text region images can be automatically extracted, and the efficiency and the accuracy of obtaining the text region images are improved.
Fig. 7 schematically illustrates a flow chart of a structured information method of generating a text image from category information, semantic relationship information, and identification information according to an embodiment of the present disclosure.
As shown in fig. 7, the method 700 is a further limitation of operation S250 of fig. 2, and the method 700 may include operations S751-S752.
In operation S751, a plurality of target text recognition information is determined from among the plurality of text recognition information based on the semantic relationship information.
In operation S752, the structured information of the text image is generated according to the category information, the key value relationship, and the target text recognition information.
According to an embodiment of the present disclosure, the category information may include one of a keyword category and a numeric category. The semantic relationship may include one of a key-value relationship and a non-key-value relationship.
According to embodiments of the present disclosure, the semantic relationship corresponding to the text recognition information may be a key-value relationship. The structured information may include a keyword category, target text identification information corresponding to the keyword category, a numeric category, and target text identification information corresponding to the numeric category.
According to embodiments of the present disclosure, target text recognition information may be used to characterize two text recognition information according to having a key-value relationship. The plurality of target text recognition information may be determined from among the plurality of text recognition information based on a key value relationship between the plurality of text recognition information. Alternatively, the target text identifying information may be used to characterize the text identifying information based on two text identifying information having a non-key-value relationship. The plurality of target text recognition information may be determined from the plurality of text recognition information based on a non-key-value relationship between the plurality of text recognition information.
For example, the text identification information may include "name", "Zhang Sanj", "age", "42 years", "sex", and "male". The target text recognition information "name", "age", "sex" of which category information is a keyword category and the target text recognition information "Zhang Sang", "42 years old" and "male" of which category information corresponding to the keyword category is a numerical category may be determined based on a key value relationship among the plurality of text recognition information.
According to the embodiment of the present disclosure, after obtaining a plurality of target text recognition information, the structured information of the text image may be generated according to the keyword category, the target text recognition information corresponding to the keyword category, the numerical category, and the target text recognition information corresponding to the numerical category.
Operations S751-S752 may be performed by an electronic device, according to an embodiment of the invention. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, and the third terminal device 103 in fig. 1.
According to the embodiment of the present disclosure, since a plurality of target text recognition information is determined from a plurality of text recognition information according to semantic relation information, the plurality of target text recognition information can be used to characterize two text recognition information according to having a key value relation. On the basis, the structural information of the text image is generated according to the category information, the key value relation and the target text identification information, so that the accuracy of the structural information generation of the text image is improved.
Fig. 8 schematically illustrates an example schematic diagram of an information generation process according to an embodiment of the present disclosure.
As shown in fig. 8, in the information generation process 800, the text detection model 802 may perform text detection on the text image 801 to obtain category information and position information of each of a plurality of text areas corresponding to the text image 801. The plurality of text regions may include a text region 8031, a text region 803_2, a text region 803_3, a text region 803_4, a text region 803_5, a text region 803_6, a text region 803_7, and a text region 803_8.
For example, the category information may include a keyword category or a numeric category. The category information of the text area 803_1, the text area 803_3, the text area 803_5, and the text area 803_7 may be keyword categories. The category information of the text area 803_2, the text area 803_4, the text area 803_6, and the text area 803_8 is a numeric category.
After obtaining the category information and the position information of each of the plurality of text areas, a text area image corresponding to each of the plurality of text areas may be obtained based on the position information of each of the plurality of text areas and the text image 801. The text region image corresponding to each of the plurality of text regions may include a text region image 804_1 (i.e., a text region image including text content "name:"), a text region image 804_2 (i.e., a text region image including text content "Zhang San", etc.), a text region image 804_3 (i.e., a text region image including text content "sex:"), a text region image 804_4 (i.e., a text region image including text content "man"), a text region image 804_5 (i.e., a text region image including text content "age:"), a text region image 804_6 (i.e., a text region image including text content "42 years"), a text region image 804_7 (i.e., a text region image including text content "detection result:"), and a text region image 804_8 (i.e., a text region image including text content "XX").
For example, a text region image 804_1 corresponding to the text region 803_1 may be acquired from the position information of the text image 801 and the text region 803_1. Based on the positional information of the text image 801 and the text region 803_2, a text region image 804_2 corresponding to the text region 803_2 is acquired. By analogy, a text region image 804_8 corresponding to the text region 803_8 is acquired from the position information of the text image 801 and the text region 803_8.
After obtaining the text region images corresponding to the respective text regions, the text recognition model 805 may perform text recognition on the plurality of text region images to obtain text recognition information of the respective text region images. The text identification information of each of the plurality of text area images may include text identification information 806_1 (i.e., "name:"), text identification information 806_2 (i.e., "Zhang Sany"), text identification information 806_3 (i.e., "sex:"), text identification information 806_4 (i.e., "Man"), text identification information 806_5 (i.e., "age:"), text identification information 806_6 (i.e., "42 years"), text identification information 806_7 (i.e., "detection result:"), and text identification information 806_8 (i.e., "XX").
For example, text recognition of text region image 804_1 may be performed using text recognition model 805 to obtain text recognition information 806_1 for text region image 804_1. Text recognition is performed on the text region image 804_2 using the text recognition model 805, resulting in text recognition information 806_2 of the text region image 804_2. By analogy, text recognition is performed on the text region image 804_8 by using the text recognition model 805, and text recognition information 806_8 of the text region image 804_8 is obtained.
After obtaining the text recognition information of each of the plurality of text region images, the text classification model 807 may process the text recognition information of each of the plurality of text region images to obtain semantic relationship information.
For example, the text classification model 807 may be used to process the text identification information of each of the plurality of text region images to obtain a semantic relationship between the text region image 804_1 (i.e., text region image including text content "name:") having category information and the text region image 804_2 (i.e., text region image including text content "Zhang three") having category information as a numeric category, the text region image 804_3 (i.e., text region image including text content "sex:") having category information as a keyword category and the text region image 804_4 (i.e., text region image including text content "man") having category information as a numeric category, the text region image 804_5 (i.e., text region image including text content "age:") having category information as a numeric category and the text region image 804_6 (i.e., text region image including text content "42 years"), the text region image 804_7 (i.e., text region image including text content "detection result:") having category information as a numeric category "and the text region image 804_8 (i.e., text region image including text content" text region image "XX".
After obtaining the semantic relationship between the plurality of text recognition information, structured information of the text image may be generated based on the category information, the semantic relationship information, and the recognition information. Structured information of the text image may include structured information 808_1 (i.e., "name: zhang Sanj"), structured information 808_2 (i.e., "sex: man"), structured information 808_3 (i.e., "age: 42 years"), and structured information 808_4 (i.e., "detection result: XX").
Fig. 9 schematically shows a flowchart of an information processing method according to an embodiment of the present disclosure.
As shown in FIG. 9, the method includes operations S910-S920.
In operation S910, the text image to be processed is processed using the information generating method 200, and the structured information of the text image to be processed is acquired.
In operation S920, information processing is performed using the structured information of the text image to be processed.
According to an embodiment of the present disclosure, the structured information of the text image to be processed may be determined using the information generating method according to the embodiment of the present disclosure.
According to the embodiment of the disclosure, the text image to be processed can be processed by using the information generation method, and the structural information of the text image to be processed can be obtained. And carrying out information processing by using the structural information of the text image to be processed.
Operations S910 to S920 may be performed by an electronic device according to an embodiment of the present invention. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, and the third terminal device 103 in fig. 1.
According to the embodiment of the invention, the text image to be processed is processed by utilizing the information generation method, and the structural information of the text image to be processed is obtained, and the semantic relation information and the visual information are utilized to generate the structural information, so that the accuracy of the structural information is improved. On the basis, the structural information of the text image to be processed is utilized for information processing, so that the accuracy of information processing is improved.
Merely exemplary embodiments, but are not limited thereto, other information processing methods known in the art may be included as long as the accuracy of information processing can be improved.
Fig. 10 schematically shows a block diagram of an information generating apparatus according to an embodiment of the present disclosure.
As shown in fig. 10, the information generating apparatus 1000 may include a text detection module 1010, a first acquisition module 1020, a text recognition module 1030, a determination module 1040, and a generation module 1050.
The text detection module 1010 is configured to perform text detection on the text image to obtain detection information, where the detection information includes category information and location information of each of the plurality of text regions.
The first obtaining module 1020 is configured to obtain a text region image corresponding to each of the plurality of text regions according to the location information and the text image.
The text recognition module 1030 is configured to perform text recognition on the text region images to obtain recognition information, where the recognition information includes text recognition information of each of the plurality of text region images.
A determining module 1040, configured to determine semantic relationship information according to the identification information, where the semantic relationship information includes semantic relationships between a plurality of text identification information.
The generating module 1050 is configured to generate structural information of the text image according to the category information, the semantic relation information and the identification information.
According to an embodiment of the present disclosure, the text detection module 1010 may include a feature extraction sub-module, a first acquisition sub-module, a second acquisition sub-module, a third acquisition sub-module, and a fourth acquisition sub-module.
And the feature extraction sub-module is used for carrying out feature extraction on the text image to obtain a first feature map with at least one scale.
The first acquisition sub-module is used for acquiring a second characteristic diagram according to the first characteristic diagram of at least one scale.
And the second acquisition sub-module is used for acquiring a third characteristic diagram according to the first characteristic diagram of at least one scale.
And the third acquisition sub-module is used for acquiring the respective category information of the text areas according to the second feature map.
And the fourth acquisition sub-module is used for acquiring the position information of each of the text areas according to the third feature map.
According to an embodiment of the present disclosure, the feature extraction sub-module may include a first feature extraction unit and a first obtaining unit.
And the first feature extraction unit is used for carrying out feature extraction of M stages on the text image to obtain at least one first feature map corresponding to the M-th stage.
The first obtaining unit is used for obtaining a first characteristic diagram of at least one scale according to at least one first characteristic diagram corresponding to the M-th stage.
According to an embodiment of the disclosure, the mth stage has Tm parallel levels, the image resolutions of the first feature maps of the same parallel level are the same, and the image resolutions of the first feature maps of different parallel levels are different.
According to an embodiment of the present disclosure, M is an integer greater than or equal to 1 and less than or equal to M, and Tm is an integer greater than or equal to 1.
According to an embodiment of the present disclosure, where M is an integer greater than 1, the feature extraction unit may include a convolution processing subunit and a feature fusion subunit.
And the convolution processing subunit is used for carrying out convolution processing on at least one first characteristic diagram corresponding to the m-1 stage to obtain at least one middle first characteristic diagram corresponding to the m-1 stage.
And the feature fusion subunit is used for carrying out feature fusion on at least one intermediate first feature map corresponding to the mth stage to obtain at least one first feature map corresponding to the mth stage.
According to an embodiment of the present disclosure, M is an integer greater than 1 and less than or equal to M.
According to an embodiment of the present disclosure, a feature fusion subunit may include:
For the ith parallel level of the Tm parallel levels,
And obtaining a first feature map corresponding to the ith parallel hierarchy according to the other intermediate first feature maps corresponding to the ith parallel hierarchy and the intermediate first feature map corresponding to the ith parallel hierarchy.
According to an embodiment of the present disclosure, the other intermediate first feature maps corresponding to the i-th parallel level are intermediate first feature maps corresponding to at least part of the Tm parallel levels other than the i-th parallel level, i being an integer greater than or equal to 1 and less than or equal to Tm.
According to an embodiment of the present disclosure, the feature extraction sub-module may include a second feature extraction unit.
And the second feature extraction unit is used for carrying out feature extraction of N cascade levels on the text image to obtain a first feature map with at least one scale, wherein N is an integer greater than 1.
According to an embodiment of the present disclosure, the determination module 1040 may include a first determination sub-module.
And the first determining submodule is used for determining semantic relation information according to the auxiliary information and the identification information, wherein the auxiliary information comprises at least one of a second characteristic diagram and position information.
According to an embodiment of the present disclosure, in the case where the auxiliary information includes the second feature map, the first determination sub-module may include a fusion unit and a first determination unit.
And the fusion unit is used for fusing the second characteristic diagram and the fourth characteristic diagram corresponding to the identification information to obtain a fusion characteristic diagram.
And the first determining unit is used for determining semantic relation information according to the fusion feature map.
According to an embodiment of the present disclosure, in case the auxiliary information further includes location information, the first determining unit may include a first determining subunit.
And the first determining subunit is used for determining semantic relation information according to the fusion feature map and the position information.
According to an embodiment of the present disclosure, the determination module 1040 may include a global feature extraction sub-module and a second determination sub-module.
And the global feature extraction sub-module is used for carrying out global feature extraction on the identification information to obtain global feature information.
And the second determining submodule is used for determining semantic relation information according to the global characteristic information.
According to an embodiment of the present disclosure, the global feature extraction sub-module may include a processing unit.
And the processing unit is used for processing the identification information based on the attention strategy to obtain global characteristic information.
According to an embodiment of the present disclosure, the processing unit may comprise a processing subunit.
And the processing subunit is used for carrying out U-level processing on the identification information based on the self-attention strategy to obtain global characteristic information, wherein U is an integer greater than or equal to 1.
According to an embodiment of the present disclosure, in the case where U is an integer greater than 1, and in the case of 1<u +.U, the processing subunit may include:
Obtaining second intermediate characteristic information of the u-1 hierarchy, which corresponds to the text recognition information according to the first intermediate characteristic information of the u-1 hierarchy, wherein the first intermediate characteristic information is used for determining a first query matrix, a first key matrix and a first value matrix;
Obtaining the first intermediate feature information corresponding to the plurality of text recognition information of the u-th level according to the second intermediate feature information corresponding to the plurality of text recognition information of the u-th level and the first intermediate feature information corresponding to the plurality of text recognition information of the u-1-th level respectively, and
And obtaining global feature information according to the first intermediate feature information of the R-level, which corresponds to each of the plurality of text identification information.
According to an embodiment of the present disclosure, U is an integer greater than or equal to 1 and less than or equal to U, and R is an integer greater than or equal to 1 and less than or equal to U.
According to an embodiment of the present disclosure, obtaining second intermediate feature information of the u-th level corresponding to each of the plurality of text recognition information according to first intermediate feature information of the u-1-th level corresponding to each of the plurality of text recognition information may include:
Determining a plurality of first matrix sets corresponding to the text recognition information of the u-1 level according to the first intermediate feature information corresponding to the text recognition information of the u-1 level, wherein the first matrix sets comprise a first query matrix, a first key matrix and a first value matrix, and
For text recognition information of the plurality of text recognition information of the u-th hierarchy,
For a first set of matrices of the plurality of first sets of matrices corresponding to the text recognition information,
Obtaining a first attention matrix corresponding to the text recognition information of the u-th level according to the first query matrix corresponding to the text recognition information of the u-th level and the first key matrix corresponding to each of the text recognition information of the u-th level;
Obtaining third intermediate feature information corresponding to the text identification information of the u-th level according to the first attention matrix corresponding to the text identification information of the u-th level and the first value matrix corresponding to the text identification information of the u-th level;
and obtaining the second intermediate characteristic information corresponding to the text identification information of the u-th level according to the plurality of third intermediate characteristic information corresponding to the text identification information of the u-th level.
According to an embodiment of the present disclosure, in the case of u=1, the processing subunit may include:
obtaining second intermediate feature information corresponding to the text recognition information of the level 2 according to the global feature information corresponding to the text recognition information of the level 1, wherein the global feature information is used for determining a second query matrix, a second key matrix and a second value matrix, and
And obtaining the first intermediate characteristic information corresponding to the text recognition information of the 2 nd level according to the second intermediate characteristic information corresponding to the text recognition information of the 2 nd level and the global characteristic information corresponding to the text recognition information of the 1 st level.
According to an embodiment of the present disclosure, obtaining second intermediate feature information of level 2 corresponding to each of the plurality of text recognition information according to global feature information of level 1 corresponding to each of the plurality of text recognition information may include:
determining a plurality of second matrix sets corresponding to the text recognition information according to the global feature information corresponding to the text recognition information of the 1 st level, wherein the second matrix sets comprise a second query matrix, a second key matrix and a second value matrix, and
For text recognition information of the plurality of text recognition information of the level 2,
For a second set of the plurality of second sets of matrices corresponding to the text recognition information,
Obtaining a second attention matrix corresponding to the text recognition information of the 2 nd level according to a second query matrix corresponding to the text recognition information of the 2 nd level and a second key matrix corresponding to a plurality of text recognition information of the 2 nd level;
obtaining third intermediate feature information corresponding to the text identification information of the 2 nd level according to the second attention matrix corresponding to the text identification information of the 2 nd level and the second value matrix corresponding to the text identification information of the 2 nd level;
and obtaining the second intermediate characteristic information corresponding to the text identification information of the 2 nd layer according to the third intermediate characteristic information corresponding to the text identification information of the 2 nd layer.
According to an embodiment of the present disclosure, the semantic relationship includes one of a key-value relationship and a non-key-value relationship.
According to an embodiment of the present disclosure, the determination module 1040 may include a first processing sub-module.
The first processing sub-module is used for processing the identification information by using a text semantic relation model to obtain semantic relation information, wherein the text semantic relation model is obtained by training a first deep learning model by using a plurality of positive sample pairs and a plurality of negative sample pairs, the number of the positive sample pairs and the negative sample pairs meet a preset balance condition, a key value relation is arranged between two sample texts included in the positive sample pairs, and a non-key value relation is arranged between two sample texts included in the negative sample pairs.
According to an embodiment of the present disclosure, the plurality of negative sample pairs are determined from the plurality of candidate negative sample pairs based on a negative sample pruning strategy.
According to an embodiment of the present disclosure, the plurality of negative sample pairs are determined from the plurality of candidate negative sample pairs based on a negative sample pruning strategy, and may include:
The plurality of negative sample pairs are determined from the plurality of candidate negative sample pairs based on a positional relationship between each of the plurality of candidate sample texts.
According to embodiments of the present disclosure, the text recognition module 1030 may include a conversion sub-module and an extraction sub-module.
And a conversion sub-module for converting the position information into target position information by affine transformation.
And the extraction sub-module is used for extracting images corresponding to the text areas from the text image according to the target position information to obtain text area images corresponding to the text areas.
According to an embodiment of the present disclosure, the category information includes one of a keyword category and a numeric category, and the semantic relationship includes one of a key-value relationship and a non-key-value relationship.
According to embodiments of the present disclosure, the generation module 1050 may include a third determination sub-module and a generation sub-module.
And the third determining sub-module is used for determining a plurality of target text identification information from a plurality of text identification information according to the semantic relation information, wherein the semantic relation corresponding to the text identification information is a key value relation.
And the generation sub-module is used for generating structural information of the text image according to the category information, the key value relation and the target text identification information, wherein the structural information comprises a keyword category, the target text identification information corresponding to the keyword category, a numerical value category and the target text identification information corresponding to the numerical value category.
According to an embodiment of the present disclosure, the text image comprises a medical text image.
Fig. 11 schematically shows a block diagram of an information processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 11, the information processing apparatus 1100 may include a second acquisition module 1110 and an information processing module 1120.
The second obtaining module 1110 is configured to process the text image to be processed by using the information generating apparatus 1000, and obtain structural information of the text image to be processed.
The information processing module 1120 is used for performing information processing by using the structured information of the text image to be processed.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
According to an embodiment of the present disclosure, an electronic device includes at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described in the present disclosure.
According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in the present disclosure.
According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the present disclosure.
Fig. 12 schematically shows a block diagram of an electronic device adapted to implement the information generating method, the information processing method according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 12, the apparatus 1200 includes a computing unit 1201, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 12012 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.
Various components in the device 1200 are connected to the I/O interface 1205, including an input unit 1206, such as a keyboard, mouse, etc., an output unit 1207, such as various types of displays, speakers, etc., a storage unit 1208, such as a magnetic disk, optical disk, etc., and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1201 performs the respective methods and processes described above, such as an information generating method and an information processing method. For example, in some embodiments, the information generating method and the information processing method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When a computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the information generating method and the information processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the information generation method and the information processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (23)

Translated fromChinese
1.一种信息生成方法,包括:1. A method for generating information, comprising:对文本图像进行文本检测,得到检测信息,其中,所述检测信息包括多个文本区域各自的类别信息和位置信息;Performing text detection on the text image to obtain detection information, wherein the detection information includes category information and position information of each of the plurality of text regions;根据所述位置信息和所述文本图像,获取与所述多个文本区域各自对应的文本区域图像;acquiring text region images corresponding to respective ones of the plurality of text regions according to the position information and the text image;对所述文本区域图像进行文本识别,得到识别信息,其中,所述识别信息包括多个所述文本区域图像各自的文本识别信息;Performing text recognition on the text region image to obtain recognition information, wherein the recognition information includes text recognition information of each of the plurality of text region images;基于自注意力策略对所述识别信息进行U层级处理,得到全局特征信息,U是大于或等于1的整数;Performing U-level processing on the recognition information based on a self-attention strategy to obtain global feature information, where U is an integer greater than or equal to 1;根据所述全局特征信息,确定语义关系信息,其中,所述语义关系信息包括多个所述文本识别信息之间的语义关系;以及Determining semantic relationship information based on the global feature information, wherein the semantic relationship information includes semantic relationships between the plurality of text recognition information; and根据所述类别信息、所述语义关系信息和所述识别信息,生成所述文本图像的结构化信息;generating structured information of the text image according to the category information, the semantic relationship information, and the identification information;所述得到全局特征信息包括:在1<u≤U的情况下,The obtaining of global feature information includes: in the case of 1<u≤U,根据第u-1层级的与多个文本识别信息各自对应的第一中间特征信息,确定第u层级的与多个文本识别信息各自对应的多个第一矩阵集,其中,所述第一矩阵集包括第一查询矩阵、第一键矩阵和第一值矩阵;Determining, based on the first intermediate feature information corresponding to each of the plurality of text recognition information at the u-1th level, a plurality of first matrix sets corresponding to each of the plurality of text recognition information at the u-th level, wherein the first matrix sets include a first query matrix, a first key matrix, and a first value matrix;针对第u层级的文本识别信息,根据与文本识别信息对应的第一查询矩阵和与多个文本识别信息各自对应的第一键矩阵,得到第一注意力矩阵,根据第一注意力矩阵和与文本识别信息对应的第一值矩阵,得到与文本识别信息对应的第三中间特征信息,根据多个第三中间特征信息,得到与文本识别信息对应的第二中间特征信息;For the text recognition information at the u-th level, obtain a first attention matrix based on a first query matrix corresponding to the text recognition information and first key matrices corresponding to each of the plurality of text recognition information; obtain third intermediate feature information corresponding to the text recognition information based on the first attention matrix and a first value matrix corresponding to the text recognition information; and obtain second intermediate feature information corresponding to the text recognition information based on the plurality of third intermediate feature information;根据第u层级的与多个文本识别信息各自对应的第二中间特征信息和第u-1层级的与多个文本识别信息各自对应的第一中间特征信息,得到第u层级的与多个文本识别信息对应的第一中间特征信息;以及Obtaining first intermediate feature information corresponding to the plurality of text recognition information at the u-th level based on the second intermediate feature information corresponding to each of the plurality of text recognition information at the u-th level and the first intermediate feature information corresponding to each of the plurality of text recognition information at the u-1-th level; and根据第R层级的与多个文本识别信息各自对应的第一中间特征信息,得到所述全局特征信息;Obtaining the global feature information according to the first intermediate feature information corresponding to each of the plurality of text recognition information at the R-th level;其中,u是大于或等于1且小于或等于U的整数,R是大于或等于1且小于或等于U的整数。Here, u is an integer greater than or equal to 1 and less than or equal to U, and R is an integer greater than or equal to 1 and less than or equal to U.2.根据权利要求1所述的方法,其中,所述对文本图像进行文本检测,得到检测信息,包括:2. The method according to claim 1, wherein the performing text detection on the text image to obtain detection information comprises:对所述文本图像进行特征提取,得到至少一个尺度的第一特征图;Performing feature extraction on the text image to obtain a first feature map of at least one scale;根据所述至少一个尺度的第一特征图,获取第二特征图;Acquire a second feature map according to the first feature map of the at least one scale;根据所述至少一个尺度的第一特征图,获取第三特征图;Acquire a third feature map according to the first feature map of the at least one scale;根据所述第二特征图,获取所述多个文本区域各自的类别信息;以及acquiring category information of each of the plurality of text regions according to the second feature map; and根据所述第三特征图,获取所述多个文本区域各自的位置信息。According to the third feature map, position information of each of the multiple text regions is obtained.3.根据权利要求2所述的方法,其中,所述对所述文本图像进行特征提取,得到至少一个尺度的第一特征图,包括:3. The method according to claim 2, wherein the step of extracting features from the text image to obtain a first feature map at at least one scale comprises:对所述文本图像进行M个阶段的特征提取,得到与第M阶段对应的至少一个第一特征图;以及Performing M stages of feature extraction on the text image to obtain at least one first feature map corresponding to the M-th stage; and根据与所述第M阶段对应的至少一个第一特征图,得到所述至少一个尺度的第一特征图;Obtaining the at least one first feature map at the at least one scale according to the at least one first feature map corresponding to the M-th stage;其中,第m阶段具有Tm个并联层级,同一并联层级的第一特征图的图像分辨率相同,不同并联层级的第一特征图的图像分辨率不同;The mth stage has Tm parallel levels, the image resolution of the first feature maps of the same parallel level is the same, and the image resolution of the first feature maps of different parallel levels is different;其中,M是大于1或等于1的整数,m是大于或等于1且小于或等于M的整数,Tm是大于或等于1的整数。Wherein, M is an integer greater than or equal to 1, m is an integer greater than or equal to 1 and less than or equal to M, andTm is an integer greater than or equal to 1.4.根据权利要求3所述的方法,其中,在M是大于1的整数的情况下,所述对所述文本图像进行M个阶段的特征提取,得到与第M阶段对应的至少一个第一特征图,包括:4. The method according to claim 3, wherein, when M is an integer greater than 1, performing M stages of feature extraction on the text image to obtain at least one first feature map corresponding to the Mth stage comprises:对与第m-1阶段对应的至少一个第一特征图进行卷积处理,得到与第m阶段对应的至少一个中间第一特征图;以及Performing convolution processing on at least one first feature map corresponding to the (m-1)th stage to obtain at least one intermediate first feature map corresponding to the (m)th stage; and对与所述第m阶段对应的至少一个中间第一特征图进行特征融合,得到与第m阶段对应的至少一个第一特征图;Performing feature fusion on at least one intermediate first feature map corresponding to the m-th stage to obtain at least one first feature map corresponding to the m-th stage;其中,m是大于1且小于或等于M的整数。Here, m is an integer greater than 1 and less than or equal to M.5.根据权利要求4所述的方法,其中,所述对与所述第m阶段对应的至少一个中间第一特征图进行特征融合,得到与第m阶段对应的至少一个第一特征图,包括:5. The method according to claim 4, wherein the performing feature fusion on the at least one intermediate first feature map corresponding to the m-th stage to obtain the at least one first feature map corresponding to the m-th stage comprises:针对所述Tm个并联层级中的第i个并联层级,For the i-th parallel level among the Tm parallel levels,根据与所述第i个并联层级对应的其他中间第一特征图和与所述第i个并联层级对应的中间第一特征图,得到与所述第i个并联层级对应的第一特征图;Obtaining a first feature map corresponding to the i-th parallel level according to other intermediate first feature maps corresponding to the i-th parallel level and the intermediate first feature map corresponding to the i-th parallel level;其中,与所述第i个并联层级对应的其他中间第一特征图是与所述Tm个并联层级中除所述第i个并联层级以外的至少部分并联层级对应的中间第一特征图,i是大于或等于1且小于或等于Tm的整数。Among them, the other intermediate first characteristic graphs corresponding to the i-th parallel level are intermediate first characteristic graphs corresponding to at least some of the Tm parallel levels except the i-th parallel level, and i is an integer greater than or equal to 1 and less than or equal to Tm .6.根据权利要求2所述的方法,其中,所述对所述文本图像进行特征提取,得到至少一个尺度的第一特征图,包括:6. The method according to claim 2, wherein the step of extracting features from the text image to obtain a first feature map at at least one scale comprises:对所述文本图像进行N个级联层级的特征提取,得到所述至少一个尺度的第一特征图,其中,N是大于1的整数。Performing N cascade-level feature extraction on the text image to obtain a first feature map of the at least one scale, where N is an integer greater than 1.7.根据权利要求2~6中任一项所述的方法,其中,所述确定语义关系信息,包括:7. The method according to any one of claims 2 to 6, wherein determining the semantic relationship information comprises:根据辅助信息和所述识别信息,确定所述语义关系信息,其中,所述辅助信息包括以下至少之一:所述第二特征图和所述位置信息。The semantic relationship information is determined based on auxiliary information and the identification information, wherein the auxiliary information includes at least one of the following: the second feature map and the position information.8.根据权利要求7所述的方法,其中,在所述辅助信息包括所述第二特征图的情况下,所述根据辅助信息和所述识别信息,确定所述语义关系信息,包括:8. The method according to claim 7, wherein, when the auxiliary information includes the second feature map, determining the semantic relationship information based on the auxiliary information and the identification information comprises:将所述第二特征图和与所述识别信息对应的第四特征图进行融合,得到融合特征图;以及Fusing the second feature map with a fourth feature map corresponding to the identification information to obtain a fused feature map; and根据所述融合特征图,确定所述语义关系信息。The semantic relationship information is determined according to the fused feature map.9.根据权利要求8所述的方法,其中,在所述辅助信息还包括所述位置信息的情况下,所述根据所述融合特征图,确定所述语义关系信息,包括:9. The method according to claim 8, wherein, when the auxiliary information further includes the position information, determining the semantic relationship information based on the fused feature map comprises:根据所述融合特征图和所述位置信息,确定所述语义关系信息。The semantic relationship information is determined according to the fused feature map and the position information.10.根据权利要求1所述的方法,还包括:10. The method according to claim 1, further comprising:在u=1的情况下,In the case of u=1,根据第1层级的与多个所述文本识别信息各自对应的全局特征信息,得到第2层级的与多个所述文本识别信息各自对应的第二中间特征信息,其中,所述全局特征信息用于确定第二查询矩阵、第二键矩阵和第二值矩阵;以及Obtaining second intermediate feature information corresponding to each of the plurality of text recognition information at a second level based on the first-level global feature information corresponding to each of the plurality of text recognition information, wherein the global feature information is used to determine a second query matrix, a second key matrix, and a second value matrix; and根据所述第2层级的与多个所述文本识别信息各自对应的第二中间特征信息和所述第1层级的与多个所述文本识别信息各自对应的全局特征信息,得到第2层级的与多个所述文本识别信息各自对应的第一中间特征信息。The first intermediate feature information of the second level corresponding to each of the plurality of text recognition information is obtained based on the second intermediate feature information of the second level corresponding to each of the plurality of text recognition information and the global feature information of the first level corresponding to each of the plurality of text recognition information.11.根据权利要求10所述的方法,其中,所述根据第1层级的与多个所述文本识别信息各自对应的全局特征信息,得到第2层级的与多个所述文本识别信息各自对应的第二中间特征信息,包括:11. The method according to claim 10, wherein obtaining second intermediate feature information corresponding to each of the plurality of text recognition information at a second level based on the first-level global feature information corresponding to each of the plurality of text recognition information comprises:根据所述第1层级的与多个所述文本识别信息各自对应全局特征信息,确定所述第2层级的与多个所述文本识别信息各自对应的多个第二矩阵集,其中,所述第二矩阵集包括所述第二查询矩阵、所述第二键矩阵和所述第二值矩阵;以及Determining, based on the global feature information of the first level corresponding to each of the plurality of text recognition information, a plurality of second matrix sets of the second level corresponding to each of the plurality of text recognition information, wherein the second matrix sets include the second query matrix, the second key matrix, and the second value matrix; and针对所述第2层级的多个所述文本识别信息中的文本识别信息,For the text identification information in the plurality of text identification information in the second level,针对与所述文本识别信息对应的多个第二矩阵集中的第二矩阵集,For a second matrix set among a plurality of second matrix sets corresponding to the text recognition information,根据所述第2层级的与所述文本识别信息对应的第二查询矩阵和所述第2层级的与多个所述文本识别信息对应的第二键矩阵,得到所述第2层级的与所述文本识别信息对应的第二注意力矩阵;Obtaining a second attention matrix corresponding to the text recognition information at the second level according to the second query matrix corresponding to the text recognition information at the second level and the second key matrix corresponding to the plurality of text recognition information at the second level;根据所述第2层级的与所述文本识别信息对应的第二注意力矩阵和所述第2层级的与所述文本识别信息对应的第二值矩阵,得到所述第2层级的与所述文本识别信息对应的第三中间特征信息;Obtaining third intermediate feature information corresponding to the text recognition information at the second level according to the second attention matrix corresponding to the text recognition information at the second level and the second value matrix corresponding to the text recognition information at the second level;根据所述第2层级的与所述文本识别信息对应的多个第三中间特征信息,得到所述第2层级的与所述文本识别信息对应的第二中间特征信息。The second intermediate feature information corresponding to the text recognition information at the second level is obtained according to the plurality of third intermediate feature information corresponding to the text recognition information at the second level.12.根据权利要求1~6中任一项所述的方法,其中,所述语义关系包括键值关系和非键值关系中的之一;12. The method according to any one of claims 1 to 6, wherein the semantic relationship comprises one of a key-value relationship and a non-key-value relationship;其中,所述确定语义关系信息,包括:The determining of semantic relationship information includes:利用文本语义关系模型处理所述识别信息,得到所述语义关系信息,其中,所述文本语义关系模型是利用多个正样本对和多个负样本对训练第一深度学习模型得到的,所述正样本对和所述负样本对的数目满足预定均衡条件,所述正样本对包括的两个样本文本之间具有键值关系,所述负样本对包括的两个样本文本之间具有非键值关系。The recognition information is processed using a text semantic relationship model to obtain the semantic relationship information, wherein the text semantic relationship model is obtained by training a first deep learning model using multiple positive sample pairs and multiple negative sample pairs, the number of the positive sample pairs and the number of the negative sample pairs meet a predetermined balance condition, the two sample texts included in the positive sample pair have a key-value relationship, and the two sample texts included in the negative sample pair have a non-key-value relationship.13.根据权利要求12所述的方法,其中,所述多个负样本对是基于负样本剪枝策略,从多个候选负样本对中确定的。13. The method according to claim 12, wherein the multiple negative sample pairs are determined from multiple candidate negative sample pairs based on a negative sample pruning strategy.14.根据权利要求13所述的方法,其中,所述多个负样本对是基于负样本剪枝策略,从多个候选负样本对中确定的,包括:14. The method according to claim 13, wherein the multiple negative sample pairs are determined from multiple candidate negative sample pairs based on a negative sample pruning strategy, comprising:所述多个负样本对是根据多个候选样本文本各自之间的位置关系,从所述多个候选负样本对中确定的。The multiple negative sample pairs are determined from the multiple candidate negative sample pairs according to the positional relationship between the multiple candidate sample texts.15.根据权利要求1所述的方法,其中,所述根据所述位置信息和所述文本图像,获取与所述多个文本区域各自对应的文本区域图像,包括:15. The method according to claim 1, wherein acquiring text region images corresponding to each of the plurality of text regions based on the position information and the text image comprises:利用仿射变换将所述位置信息转换为目标位置信息;以及Converting the position information into target position information using affine transformation; and根据所述目标位置信息,从所述文本图像中提取与所述多个文本区域对应的图像,得到与所述多个文本区域对应的文本区域图像。Images corresponding to the multiple text regions are extracted from the text image according to the target position information to obtain text region images corresponding to the multiple text regions.16.根据权利要求1所述的方法,其中,所述类别信息包括关键字类别和数值类别中的之一,所述语义关系包括键值关系和非键值关系中的之一;16. The method according to claim 1, wherein the category information comprises one of a keyword category and a value category, and the semantic relationship comprises one of a key-value relationship and a non-key-value relationship;其中,所述根据所述类别信息、所述语义关系信息和所述识别信息,生成所述文本图像的结构化信息,包括:The step of generating the structured information of the text image according to the category information, the semantic relationship information, and the identification information includes:根据所述语义关系信息,从多个所述文本识别信息中确定多个目标文本识别信息,其中,与所述文本识别信息对应的语义关系为所述键值关系;以及Determining a plurality of target text recognition information from the plurality of text recognition information according to the semantic relationship information, wherein the semantic relationship corresponding to the text recognition information is the key-value relationship; and根据所述类别信息、所述键值关系和所述目标文本识别信息,生成所述文本图像的结构化信息,其中,所述结构化信息包括所述关键字类别、与所述关键字类别对应的目标文本识别信息、所述数值类别和与所述数值类别对应的目标文本识别信息。Based on the category information, the key-value relationship and the target text recognition information, structured information of the text image is generated, wherein the structured information includes the keyword category, the target text recognition information corresponding to the keyword category, the numerical category and the target text recognition information corresponding to the numerical category.17.根据权利要求1所述的方法,其中,所述文本图像包括医疗文本图像。The method of claim 1 , wherein the text image comprises a medical text image.18.一种信息处理方法,包括:18. An information processing method, comprising:利用根据权利要求1~17中任一项所述的方法处理待处理文本图像,获取所述待处理文本图像的结构化信息;以及Processing the text image to be processed using the method according to any one of claims 1 to 17 to obtain structural information of the text image to be processed; and利用所述待处理文本图像的结构化信息进行信息处理。Information processing is performed using the structured information of the text image to be processed.19.一种信息生成装置,包括:19. An information generating device, comprising:文本检测模块,用于对文本图像进行文本检测,得到检测信息,其中,所述检测信息包括多个文本区域各自的类别信息和位置信息;a text detection module, configured to perform text detection on a text image to obtain detection information, wherein the detection information includes category information and position information of each of a plurality of text regions;第一获取模块,用于根据所述位置信息和所述文本图像,获取与所述多个文本区域各自对应的文本区域图像;A first acquisition module is configured to acquire a text region image corresponding to each of the plurality of text regions according to the position information and the text image;文本识别模块,用于对所述文本区域图像进行文本识别,得到识别信息,其中,所述识别信息包括多个所述文本区域图像各自的文本识别信息;a text recognition module, configured to perform text recognition on the text region image to obtain recognition information, wherein the recognition information includes text recognition information of each of the plurality of text region images;全局特征信息确定模块,用于基于自注意力策略对所述识别信息进行U层级处理,得到全局特征信息,U是大于或等于1的整数;a global feature information determination module, configured to perform U-level processing on the recognition information based on a self-attention strategy to obtain global feature information, where U is an integer greater than or equal to 1;语义关系确定模块,用于根据所述全局特征信息,确定语义关系信息;以及a semantic relationship determination module, configured to determine semantic relationship information based on the global feature information; and生成模块,用于根据所述类别信息、所述语义关系信息和所述识别信息,生成所述文本图像的结构化信息;a generating module, configured to generate structured information of the text image based on the category information, the semantic relationship information, and the identification information;所述全局特征信息确定模块,用于在1<u≤U的情况下,根据第u-1层级的与多个文本识别信息各自对应的第一中间特征信息,确定第u层级的与多个文本识别信息各自对应的多个第一矩阵集,其中,所述第一矩阵集包括第一查询矩阵、第一键矩阵和第一值矩阵;针对第u层级的文本识别信息,根据与文本识别信息对应的第一查询矩阵和与多个文本识别信息各自对应的第一键矩阵,得到第一注意力矩阵,根据第一注意力矩阵和与文本识别信息对应的第一值矩阵,得到与文本识别信息对应的第三中间特征信息,根据多个第三中间特征信息,得到与文本识别信息对应的第二中间特征信息;根据第u层级的与多个文本识别信息各自对应的第二中间特征信息和第u-1层级的与多个文本识别信息各自对应的第一中间特征信息,得到第u层级的与多个文本识别信息对应的第一中间特征信息;以及根据第R层级的与多个文本识别信息各自对应的第一中间特征信息,得到所述全局特征信息;其中,u是大于或等于1且小于或等于U的整数,R是大于或等于1且小于或等于U的整数。The global feature information determination module is used to determine, under the condition of 1<u≤U, a plurality of first matrix sets corresponding to the plurality of text recognition information at the u-1th level according to the first intermediate feature information corresponding to the plurality of text recognition information, wherein the first matrix set includes a first query matrix, a first key matrix and a first value matrix; for the text recognition information at the u-th level, a first attention matrix is obtained according to the first query matrix corresponding to the text recognition information and the first key matrix corresponding to the plurality of text recognition information, and a first value matrix corresponding to the text recognition information is obtained according to the first attention matrix and the first value matrix corresponding to the text recognition information. According to the third intermediate feature information corresponding to the text recognition information, the second intermediate feature information corresponding to the text recognition information is obtained according to the multiple third intermediate feature information; according to the second intermediate feature information corresponding to the multiple text recognition information at the u-th level and the first intermediate feature information corresponding to the multiple text recognition information at the u-1-th level, the first intermediate feature information corresponding to the multiple text recognition information at the u-th level is obtained; and according to the first intermediate feature information corresponding to the multiple text recognition information at the R-th level, the global feature information is obtained; wherein u is an integer greater than or equal to 1 and less than or equal to U, and R is an integer greater than or equal to 1 and less than or equal to U.20.一种信息处理装置,包括:20. An information processing device, comprising:第二获取模块,用于利用根据权利要求19所述的装置处理待处理文本图像,获取所述待处理文本图像的结构化信息;以及A second acquisition module is configured to process the text image to be processed using the apparatus according to claim 19 to acquire structural information of the text image to be processed; and信息处理模块,用于利用所述待处理文本图像的结构化信息进行信息处理。The information processing module is used to perform information processing using the structured information of the text image to be processed.21.一种电子设备,包括:21. An electronic device comprising:至少一个处理器;以及at least one processor; and与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1~17中任一项或权利要求18所述的方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 17 or claim 18.22.一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1~17中任一项或权利要求18所述的方法。22. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1 to 17 or claim 18.23.一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1~17中任一项或权利要求18所述的方法。23. A computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the computer program implements the method according to any one of claims 1 to 17 or claim 18.
CN202310023539.8A2023-01-062023-01-06 Information generation method, information processing method, device, electronic device, and mediumActiveCN116311298B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310023539.8ACN116311298B (en)2023-01-062023-01-06 Information generation method, information processing method, device, electronic device, and medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310023539.8ACN116311298B (en)2023-01-062023-01-06 Information generation method, information processing method, device, electronic device, and medium

Publications (2)

Publication NumberPublication Date
CN116311298A CN116311298A (en)2023-06-23
CN116311298Btrue CN116311298B (en)2025-08-26

Family

ID=86824754

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310023539.8AActiveCN116311298B (en)2023-01-062023-01-06 Information generation method, information processing method, device, electronic device, and medium

Country Status (1)

CountryLink
CN (1)CN116311298B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117275005B (en)*2023-09-212024-08-09北京百度网讯科技有限公司Text detection, text detection model optimization and data annotation method and device
WO2025065335A1 (en)*2023-09-272025-04-03京东方科技集团股份有限公司Image inpainting method, model training method, electronic device, and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110889416A (en)*2019-12-132020-03-17南开大学 A salient object detection method based on cascade improved network
CN114495119A (en)*2021-12-012022-05-13浙江大学Real-time irregular text recognition method under complex scene

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111047602A (en)*2019-11-262020-04-21中国科学院深圳先进技术研究院Image segmentation method and device and terminal equipment
US12205317B2 (en)*2020-02-132025-01-21Northeastern UniversityLight-weight pose estimation network with multi-scale heatmap fusion
CN113627439B (en)*2021-08-112024-10-18广州市积成信息科技有限公司Text structuring processing method, processing device, electronic equipment and storage medium
CN114202648B (en)*2021-12-082024-04-16北京百度网讯科技有限公司Text image correction method, training device, electronic equipment and medium
CN114880427B (en)*2022-04-202025-08-29迈容智能科技(上海)有限公司 Model, event argument extraction method and system based on multi-level attention mechanism
CN114912433B (en)*2022-05-252024-07-02亚信科技(中国)有限公司Text-level multi-label classification method, apparatus, electronic device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110889416A (en)*2019-12-132020-03-17南开大学 A salient object detection method based on cascade improved network
CN114495119A (en)*2021-12-012022-05-13浙江大学Real-time irregular text recognition method under complex scene

Also Published As

Publication numberPublication date
CN116311298A (en)2023-06-23

Similar Documents

PublicationPublication DateTitle
US20220253631A1 (en)Image processing method, electronic device and storage medium
CN113627439B (en)Text structuring processing method, processing device, electronic equipment and storage medium
CN113343982B (en)Entity relation extraction method, device and equipment for multi-modal feature fusion
WO2023024614A1 (en)Document classification method and apparatus, electronic device and storage medium
CN113255694A (en)Training image feature extraction model and method and device for extracting image features
CN116311298B (en) Information generation method, information processing method, device, electronic device, and medium
CN113065614B (en)Training method of classification model and method for classifying target object
CN113255824B (en)Method and apparatus for training classification model and data classification
CN114612743A (en)Deep learning model training method, target object identification method and device
CN114882321A (en)Deep learning model training method, target object detection method and device
CN113221918B (en)Target detection method, training method and device of target detection model
CN114494784A (en) Training methods, image processing methods and object recognition methods of deep learning models
CN114724156B (en)Form identification method and device and electronic equipment
CN113343981A (en)Visual feature enhanced character recognition method, device and equipment
CN114429633B (en)Text recognition method, training method and device of model, electronic equipment and medium
CN112101360A (en)Target detection method and device and computer readable storage medium
CN114360027A (en) A training method, device and electronic device for feature extraction network
CN114495113A (en) Text classification method and training method and device for text classification model
CN116824609B (en)Document format detection method and device and electronic equipment
CN113343979A (en)Method, apparatus, device, medium and program product for training a model
CN114419327B (en) Image detection method and image detection model training method and device
CN116244447A (en)Multi-mode map construction and information processing method and device, electronic equipment and medium
CN116662589A (en)Image matching method, device, electronic equipment and storage medium
CN115205555A (en) Method, training method, information determination method and device for determining similar images
CN114565760A (en)Image segmentation method, model training method, device, electronic device, and medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp