CN119692330A

Movatterモバイル変換

Info

Publication number: CN119692330A
Application number: CN202411552888.XA
Authority: CN
Inventors: 吴文豪
Original assignee: China Merchants Bank Co Ltd
Current assignee: China Merchants Bank Co Ltd
Priority date: 2024-11-01
Filing date: 2024-11-01
Publication date: 2025-03-25

Abstract

The application discloses a document analysis method, device, medium and computer program product, which relate to the technical field of data processing and comprise the steps of obtaining a document to be analyzed, carrying out layout detection on the document to be analyzed to obtain element blocks, determining content extraction rules corresponding to the element blocks according to layout elements of the element blocks, processing the element blocks through the content extraction rules to obtain preliminary contents of the element blocks, and correcting the preliminary contents by using a large language model to obtain a target document. According to the application, different element blocks are divided, different content extraction methods are adopted for each element block, and the extracted content is corrected by using a large language model, so that the accuracy of document analysis is improved.

Description

Document parsing method, device, medium and computer program product

Technical Field

The present application relates to the field of data processing technology, and in particular, to a document parsing method, apparatus, medium, and computer program product.

Background

With the continuous evolution of information technology, the document format is gradually developed from plain text and rich text in early stages to rich and various document formats, such as PDF (Portable Document Format portable document format), word format, markdown format and the like, which are widely used at present.

Text information in a document can be effectively extracted by means of OCR (Optical Character Recognition ) technology. However, because a document may cover a complex layout, multiple fonts, and image elements, applying OCR techniques to parse the document is subject to challenges, and it is difficult to accurately extract all elements within the document. Specifically, inaccuracy of OCR text recognition and errors in text segmentation, etc., all result in reduced accuracy of document parsing.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The application mainly aims to provide a document analysis method, device, medium and computer program product, which aim to solve the technical problem of inaccurate document analysis.

In order to achieve the above object, the present application provides a document parsing method, which includes:

acquiring a document to be analyzed, and performing layout detection on the document to be analyzed to obtain an element block;

Determining a content extraction rule corresponding to the element block according to the layout element of the element block;

processing the element blocks through the content extraction rules to obtain preliminary contents of the element blocks;

and correcting the preliminary content by using a large language model to obtain a target document.

In one embodiment, before the step of correcting each of the preliminary contents using the large language model, the method further includes:

and in the case that the layout element is text, performing segment correction on the preliminary content of the element block by using a segment correction model.

In one embodiment, the step of performing segment correction on the preliminary content of the element block using a segment correction model includes:

deleting a line feed mark in the preliminary content, and extracting adjacent sentence pairs of preset symbols;

judging whether the adjacent sentence pairs belong to the same paragraph or not by utilizing the segmentation correction model;

And adding a line feed mark between the adjacent sentence pairs in the case that the adjacent sentence pairs do not belong to the same paragraph.

In an embodiment, the segmentation correction model includes a bi-directional encoder representation BERT model from a converter, and the step of determining whether the adjacent sentence pair belongs to the same paragraph using the segmentation correction model includes:

Extracting semantic features of the adjacent sentence pairs by using the BERT model;

Mapping the semantic features into probability values, wherein the probability values are probabilities that the adjacent sentence pairs belong to the same paragraph;

Under the condition that the probability value is larger than a preset probability threshold value, determining that the adjacent sentence pairs belong to the same paragraph;

and under the condition that the probability value is not greater than a preset probability threshold value, determining that the adjacent sentence pairs do not belong to the same paragraph.

In an embodiment, the step of performing layout detection on the document to be parsed to obtain the element block includes:

converting the document to be analyzed into a picture to be analyzed;

determining layout elements contained in the picture to be analyzed and element positions of the layout elements by using a layout detection model;

And splitting the picture to be analyzed according to the element position to obtain the element block.

In one embodiment, the step of processing each of the element blocks by each of the content extraction rules to obtain preliminary content of each of the element blocks includes:

Determining preliminary content and style characteristics of each element block by utilizing an optical character recognition technology under the condition that the layout element is a title;

determining the title level of the element block according to the style characteristics;

Before the step of obtaining the target document, the method further comprises:

And adjusting the document display style in the element block according to the title level of the element block.

In an embodiment, the preliminary content includes picture content and text content, and the step of processing the element block by the content extraction rule to obtain the preliminary content of the element block includes:

In the case that the layout element is a picture, determining the element block as the picture content;

The correcting of the preliminary content using a large language model includes:

Based on the picture content, correcting the text content in the preliminary content by using the large language model.

In addition, in order to achieve the above object, the present application also provides a document parsing apparatus including:

The detection module is used for acquiring a document to be analyzed, and carrying out layout detection on the document to be analyzed to obtain an element block;

the determining module is used for determining a content extraction rule corresponding to the element block according to the layout element of the element block;

the extraction module is used for processing the element blocks through the content extraction rule to obtain preliminary contents of the element blocks;

And the correction module is used for correcting the preliminary content by using the large language model to obtain the target document.

In addition, in order to achieve the above object, the present application also proposes a document parsing apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the document parsing method as described above.

In addition, in order to achieve the above object, the present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the document parsing method as described above.

Furthermore, to achieve the above object, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the document parsing method as described above.

The technical scheme provided by the application has at least the following technical effects that a document to be analyzed is obtained, layout detection is carried out on the document to be analyzed to obtain element blocks, different content areas in the document are identified and distinguished through the layout detection, the information in the document is extracted and processed more accurately, the accuracy and the efficiency of content extraction are improved, then the content extraction rule corresponding to the element blocks is determined according to layout elements of the element blocks, the extraction process is more customized and accurate, the quality of content extraction is improved, the element blocks can be processed through the content extraction rule corresponding to the element blocks, primary content of the element blocks is obtained, basic data is provided for subsequent content correction and optimization, the originality and the integrity of the information are ensured, after the primary content is obtained, the correction is carried out through a large language model, the target document of each element block is obtained, the accuracy and the readability of text content are improved, and the accuracy of document analysis is jointly improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a document parsing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of layout detection according to a first embodiment of the document parsing method of the present application;

FIG. 3 is a schematic diagram of content extraction according to a first embodiment of the document parsing method of the present application;

FIG. 4 is a flowchart of training and applying a segmentation correction model according to a first embodiment of the document parsing method of the present application;

FIG. 5 is a schematic diagram of a module structure of a document parsing apparatus according to an embodiment of the present application;

fig. 6 is a schematic diagram of an apparatus structure of a hardware operating environment related to a file parsing method according to an embodiment of the present application.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

In this embodiment, for convenience of description, the following description will be made with the terminal as an execution subject.

Conventional techniques are first responsible for identifying and precisely locating the various elements in a document, including text, images, tables, etc., and then employing specially designed parsing strategies for each particular element for further processing. However, while this approach can achieve good results in most cases, there are still some problems and challenges. For example, OCR techniques may be inaccurate in recognition, resulting in deviations or omissions in text information, and in addition, text segmentation may be incorrect, the otherwise consecutive paragraphs may be incorrectly cut, or the text of different paragraphs may be incorrectly merged together. These problems not only can lead to reduced readability of the document, but also can affect the resolution accuracy of the original document, resulting in inaccurate document resolution.

The method comprises the steps of obtaining a document to be analyzed, carrying out layout detection on the document to be analyzed to obtain element blocks, identifying and distinguishing different content areas in the document, such as texts, images and tables, through the layout detection, facilitating accurate extraction and processing of information in the document, improving accuracy of content extraction, determining content extraction rules corresponding to the element blocks according to layout elements of the element blocks, enabling an extraction process to be customized and accurate, improving quality of content extraction, further processing the element blocks through the content extraction rules corresponding to the element blocks, obtaining primary content of the element blocks, such as storing pictures and recording file information, identifying patterns, content and the like, providing basic data for subsequent content correction and optimization, guaranteeing originality and integrity of information, enabling a segmentation correction model to be adopted for segment identification of the text content after the text content is extracted, maintaining logic and readability of the text content as much as possible, improving quality of the document, correcting the content of the element blocks through a large language model, obtaining target document of each element block, improving the accuracy and readability of the text content, and improving the accuracy of the document.

It should be noted that, the execution body of the embodiment may be a computing service device having functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, or an electronic device capable of implementing the above functions. This embodiment and the following embodiments will be described below using a document analysis device as an execution subject.

Based on this, an embodiment of the present application provides a document parsing method, and referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the document parsing method of the present application.

In this embodiment, the document analysis method includes steps S10 to S40:

step S10, obtaining a document to be analyzed, and performing layout detection on the document to be analyzed to obtain an element block;

It should be noted that the document to be analyzed refers to a digitized document to be processed, analyzed or analyzed, which may be PDF, word, HTML or other format documents, and the document may contain various types of data such as text, pictures, tables, etc., and the element blocks refer to regions or blocks divided according to layout elements and the positions of the layout elements, where each element block only contains one layout element, such as a section of text and a table.

Additionally, the layout detection refers to analyzing the structure and layout of the document by using an algorithm and a technical means, and identifying the location areas and arrangement modes of different parts in the document, and common layout detection models include PicoDet models, layoutParser models, and the like.

For example, referring to fig. 2, fig. 2 provides a schematic layout detection diagram, and for PDF documents containing multiple different layout elements, layout elements including a table, a picture related to a large language model, a title "large language model" and an introduction text of the large language model in the document to be analyzed are identified through layout detection, then elements are split, and split according to each layout element and its position, so as to obtain a table element block, a picture element block, a title element block and a text element block, which are convenient for adopting different content extraction rules for different layout elements in the following, and performing content analysis in a targeted manner.

It will be appreciated that the division of the document into blocks of elements facilitates subsequent steps such as content extraction, text analysis, etc.

Step S20, determining a content extraction rule corresponding to the element block according to the layout elements of the element block;

It should be noted that, the content extraction rule refers to an algorithm or logic rule for extracting internal information or content of an element block according to layout elements (such as text format, image features, table format, etc.), for example, in the case that the layout elements are tables, a table structure recognition model, such as TableAtten model, is used to determine a table structure and corresponding position information, and an OCR model is used to recognize the table content and record the corresponding position of the content.

For a PDF document containing layout elements such as a title, a body, and a table, it may be determined that the extraction rule of the title element block is to extract plain text and convert english letters into uppercase, the extraction rule of the body element block is to extract text content and remove format, and the extraction rule of the table element block is to extract table data and store it in columns.

It can be appreciated that, for different layout elements, different content extraction rules are provided for the element blocks, and this customized extraction rule helps to improve accuracy of content extraction, thereby improving accuracy of document parsing.

Step S30, processing the element blocks through content extraction rules to obtain preliminary contents of the element blocks;

the processing means that a content extraction rule is applied to each element block to extract text information, image information, and the like therein.

Referring to fig. 3, fig. 3 provides a schematic diagram of content extraction, in which, for an element block obtained after element splitting, content extraction is completed according to a content extraction rule corresponding to a layout element, for an element block with a layout element being a table, table structure identification is performed mainly by using a TableAtten model and table content identification is performed by using a PPOCRv model, for an element block with a layout element being a picture, a picture is saved, file information including a file name, a file storage path and the like is recorded, and for an element block with a layout element being a title or a text, content identification is performed directly by using a PPOCRv model to obtain title content and text content, and the output of each model is the preliminary content of each element block. After the preliminary content is obtained, result aggregation is carried out, including preliminary correction of text content by adopting a BERT model, text correction of all the preliminary content by adopting a Qwen big model, layout ordering, outputting of target documents and the like.

Step S40, correcting the preliminary content by using the large language model to obtain a target document;

It should be noted that, the large language model is a natural language processing model based on deep learning technology, and through training of a large-scale corpus, the model can understand and generate human language, and has strong text composition, understanding and analysis capabilities, and can process complex language structures and semantic relations.

Additionally, it should be noted that, the target document refers to a document that finally contains contents of all element blocks, and the format of the target document may be determined according to the user's requirement, for example, a text format suitable for model training, a Markdown format, and the like, which is not limited in this embodiment.

The method comprises the steps of determining wrongly written characters in text content by utilizing a large language model, generating candidate replacement words for the wrongly written characters according to the context, determining target replacement words of the wrongly written characters from the candidate replacement words based on the context semantics of the wrongly written characters, and replacing the wrongly written characters in the text content by using the target replacement words to finish content correction. For example, the phrase "please ensure the correctness of the algorithm" in the document to be parsed may be recognized as "confirm the correctness of the algorithm" by the OCR technology, and for this purpose, the understanding capability of the large language model on the context may be utilized to correct the text content, so as to obtain a correct result.

Illustratively, for table contents, column information in the table is production month, and the previous description of the table refers to production month, but the table column name in the table identification result is "production date", and the large language model changes "production date" to "production month" by understanding the table column information and the previous information and modifying the table column name. Similarly, for formula information in a document, the situation that the upper and lower identifiers are not clear may occur, and the context understanding capability of a large model can be utilized to understand the introduction information of a formula in the document, so as to modify the formula or correct the related introduction content in the document according to the formula information.

In the case that a plurality of element blocks exist, the content layout in the target document can be adjusted according to the appearance sequence of the element blocks, so that the layout is ensured to be consistent with the original document, and the analysis accuracy and the readability of the target document are enhanced.

It can be appreciated that by correcting the large language model, accuracy and readability of the preliminary content are improved, thereby improving accuracy of file parsing.

The embodiment provides a document analysis method, which comprises the steps of carrying out layout detection on a document to be analyzed, dividing the document into small element blocks, applying content extraction rules corresponding to layout elements to the element blocks, guaranteeing accuracy of content identification, and further correcting by using a large language model, so that accuracy of content extraction is further improved, and further accuracy of document analysis is improved.

In a possible embodiment, before step S40, further includes:

step A10, in the case that the layout element is text, the segmentation correction model is utilized to perform segmentation correction on the preliminary content of the element block.

It should be noted that the segment correction model refers to an algorithm or program for correcting and optimizing text segments, and the model can identify and adjust segment boundaries in the text through a series of rules and algorithms so as to ensure that the segmentation of the text is more accurate and reasonable. Common piecewise correction models include correction models based on natural language processing, such as the BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder representation from a transducer) model, the Seq2Seq (sequence to sequence, sequence-to-sequence) model, and joint models based on CRF (Conditional Random Field ) and n-gram, and the like.

Illustratively, text content is segmented using a Transform model, associations between portions of the text are captured based on a self-attention mechanism of the model, and segmentation suggestions are generated to divide the original text into a plurality of content-consistent paragraphs.

In the embodiment, the segmentation correction model is introduced to carry out intelligent segmentation correction on the primary content of the text, so that the text structure is optimized, the logic and continuity of the content are ensured, the accuracy and the readability of the text content are effectively improved, and the accuracy of document analysis is improved.

In one possible embodiment, step a10 includes:

Step A20, deleting a line feed mark in the primary content, and extracting adjacent sentence pairs of preset symbols;

it should be noted that the line feed flag refers to a specific character or code used to indicate the start of a new line or paragraph in the document format, for example, < br > in the web page code, and possibly a carriage return in the Word document. The preset symbol refers to a mark or symbol for identifying a specific position, structure or content in the text, and can be a punctuation mark, special characters or the like, according to the text content.

Additionally, it should be noted that, the adjacent sentence pairs refer to two or more sentences that are sequentially connected in the text, may belong to the same paragraph, may need to be divided into different paragraphs, and the number of the adjacent sentence pairs extracted may be 1, i.e. 1 sentence is extracted forward and backward respectively, or may be set by itself according to the capability of the model, which is not specifically limited in this embodiment.

It will be appreciated that by selectively extracting text content as input to the segmentation correction model, unnecessary text processing operations may be reduced, improving the efficiency of text parsing.

Step A30, judging whether adjacent sentence pairs belong to the same paragraph or not by utilizing a segmentation correction model;

After the segmentation correction model receives adjacent sentence pairs, the segmentation correction model performs processing such as feature extraction, semantic analysis and the like on the input text, recognizes the relation and logic structure between sentences, and outputs a judgment conclusion of whether each pair of adjacent sentence pairs belong to the same paragraph or not based on the structural relation.

Step A40, adding a line feed flag between adjacent sentence pairs in the case that the adjacent sentence pairs do not belong to the same paragraph.

And analyzing the text content of the adjacent sentence pairs by using the segmentation correction model, automatically judging whether the adjacent sentence pairs belong to the same paragraph, and if not, automatically adding a proper line feed mark between the adjacent sentence pairs according to the format requirement of the target document so as to separate different paragraphs.

In the embodiment, the paragraph attribution is accurately judged, so that the continuity and the structural rationality of the document are enhanced, and the accuracy of document analysis is improved.

In one possible implementation, the piecewise correction model comprises a BERT model, and step a20 comprises:

step A21, extracting semantic features of adjacent sentence pairs by using a BERT model;

It should be noted that the BERT model is a pre-training language representation model based on a Transform architecture, and the model can capture abundant language features including multiple layers of vocabulary, syntax, semantics and the like by performing unsupervised learning on a large-scale text corpus, wherein the semantic features are used for reflecting meanings or information expressed by texts, including the semantics of the vocabulary, the semantic relationship of sentences, the context information of the texts and the like.

It will be appreciated that the fine tuning cost of using the BERT model is relatively small, with significant advantages in terms of speed of paragraph recognition over generating a large language model.

Step A22, mapping the semantic features into probability values, wherein the probability values are probabilities that adjacent sentence pairs belong to the same paragraph;

it should be noted that mapping refers to inputting the extracted semantic features into an algorithm or model that outputs a probability value that identifies the likelihood that adjacent sentence pairs belong to the same paragraph.

Illustratively, using an MLP (MultiLayer Perceptron, multi-layer perceptron) as the classification head of the paragraph correction model, the features extracted by the BERT model are converted into a real value, which is then mapped to a probability using an activation function (e.g., sigmoid function), and based on the probability value, it is determined whether adjacent sentence pairs belong to the same paragraph.

It can be understood that by mapping the semantic features into probability values, whether the adjacent sentence pairs belong to the same paragraph can be judged more accurately according to the logic relationship between the adjacent sentence pairs, so that the accuracy of document analysis is improved.

Step A23, determining that adjacent sentence pairs belong to the same paragraph under the condition that the probability value is larger than a preset probability threshold value;

and step A24, determining that the adjacent sentence pairs do not belong to the same paragraph under the condition that the probability value is not larger than a preset probability threshold value.

An exemplary process for training and applying the segment correction model based on the BERT model is provided in FIG. 4, wherein the model training process comprises the steps of firstly obtaining a standard document, extracting text content from the standard document to construct a training data set, adding a label of "0" to sentences from different paragraphs, adding a label of "1" to sentences from the same paragraph, then extracting features of sentences in the training data set by using the BERT model to obtain feature vectors of each pair of sentences, mapping the feature vectors onto a prediction result by a classification head, namely, predicting whether the pair of sentences belong to the same paragraph, then reading the label content, performing loss calculation and gradient feedback, comparing the prediction result with the label, calculating errors by adopting a binary cross entropy loss function, guiding the optimization direction of the classification head, and transmitting the errors back to each parameter in the model layer by a gradient feedback algorithm, and simultaneously calculating the gradient function values of the parameters, wherein the gradients are used for updating model parameters to minimize loss, and finishing model training after the loss tends to be stable or reach a preset number of times.

FIG. 4 also provides a model application flow, which includes obtaining preliminary content obtained after content extraction of element blocks, extracting front and rear text (adjacent sentence pairs) of a line feed symbol (preset symbol), performing feature extraction by using a BERT model to obtain feature vectors of each pair of adjacent sentences, mapping the feature vectors into probability values by using MLP as a classification head, judging that the adjacent sentence pairs belong to the same paragraph under the condition that the probability values are larger than a preset probability threshold, and judging that the adjacent sentence pairs do not belong to the same paragraph under the condition that the probability values are not larger than the preset probability threshold, thereby obtaining a prediction result.

In the embodiment, the accuracy of paragraph judgment is improved by using deep semantic features extracted by the BERT model, and the readability of document content is improved and the accuracy of document analysis is improved by reasonable text segmentation.

In the second embodiment of the present application, the same or similar content as in the first embodiment of the present application may be referred to the above description, and will not be repeated. On the basis, in step S10, the step of performing layout detection on the document to be parsed to obtain the element block includes:

Step S11, converting a document to be analyzed into a picture to be analyzed;

it should be noted that, the picture to be analyzed refers to an image file needing to be subjected to layout detection and element extraction, and may be a scanned paper document, a picture obtained by photographing, or a webpage screenshot containing a complex layout, etc., and various layout elements are contained in the picture to be analyzed, and the positions and arrangement modes of the elements form a layout structure of the picture.

Step S12, determining layout elements and element positions of the layout elements contained in the picture to be analyzed by using a layout detection model;

It should be noted that the layout detection model can identify and locate different layout elements in the image by analyzing visual features of the image, the element positions refer to coordinate information of the layout elements in the image to be analyzed, and specific positions of the elements in the image are described, and the specific position information provides splitting basis for subsequent splitting operation.

The existing layout detection model is basically a visual model, so that a document to be analyzed needs to be converted into a picture, and then the picture is used as input of the layout detection model to output the position information of layout elements.

And S13, splitting the picture to be analyzed according to the element positions to obtain element blocks.

According to a given coordinate position, the picture to be parsed is divided into one or more picture segments (element blocks) containing independent layout elements.

Illustratively, after the layout detection is completed, the upper left corner is at (50, 30) and the lower right corner is at (250, 80) in the element positions of the table, so that the split layout element is the element block of the table as the region from the (50, 30) position to the (250, 80) position of the picture.

In the embodiment, through accurate element positioning and splitting, interference among different layout elements is reduced, and the accuracy of subsequent text content extraction is improved, so that the accuracy of document analysis is improved.

In one possible embodiment, step S30 includes:

Step B10, determining preliminary content and style characteristics of the element block by utilizing an optical character recognition technology under the condition that the layout element is a title;

the OCR technology refers to a series of image processing technologies for converting characters in an image into editable and searchable text formats, and the style features refer to visual properties such as fonts, sizes, thicknesses and the like of the characters.

Step B20, determining the title level of the element block according to the style characteristics;

It should be noted that, the title level refers to a hierarchical relationship of different titles in a document, and is generally distinguished by different fonts, sizes, bold, and the like, for example, a primary title may represent a chapter name of the document, and a secondary title represents a section name under the chapter.

For example, a style may be characterized as a font number two and a bolded element block, determined as a primary heading.

It will be appreciated that determining the title level by OCR technology helps to better understand the importance and hierarchy of text content.

Before the step of obtaining the target document in step S40, further includes:

And step B30, adjusting the document display style of the element block according to the title level of the element block.

It should be noted that, the document display style refers to a presentation form of layout elements corresponding to a target document, including visual attributes such as fonts, sizes, etc., and by adjusting the document display style, different layout elements can be distinguished, and key contents in the document can be highlighted.

Illustratively, in the target document, a series of title levels and corresponding document display styles are predefined. Before the title text content is written into the target document, the corresponding document display style needs to be determined, so that the title style is convenient to adjust.

In the embodiment, the readability of the document is enhanced by distinguishing and displaying the titles of different levels, and the corresponding display style rule is applied according to the title level of the element block, so that the document to be analyzed is restored, and the accuracy of the document analysis is improved.

In a possible implementation, the preliminary content includes a picture content and a text content, and step S30 includes:

step E10, determining the element block as picture content in the case that the layout element is a picture;

it should be noted that the picture content refers to a picture itself contained in a document to be parsed, and is usually a combination of visual elements such as characters, graphics, symbols, and the like. In document processing or information extraction, the content of a picture is generally regarded as an important source of information. Literal content refers to all literal forms of content (excluding pictures) including titles, text, formulas, tables, etc., and since a large language model can actually understand the structure of tables and the content, all literal content except pictures can be understood and corrected.

The step of correcting the preliminary content using the large language model in step S40 includes:

and E20, correcting the text content in the preliminary content by using a large language model based on the picture content.

The method comprises the steps of obtaining a document to be analyzed, including a picture, carrying out layout detection on the document to be analyzed to obtain a plurality of element blocks, processing each element block by utilizing a content extraction rule corresponding to the layout element in the element block, processing the element blocks with the picture as the layout element, wherein the primary content obtained after processing the element blocks with the picture as the layout element is the picture (picture content), and the primary content obtained after processing the element blocks with the non-picture as the layout element is the text content, learning and understanding the picture content and semantic, scene and other information in the text content by utilizing a multi-mode large language model with image recognition capability, correcting wrongly written words and the like in the text content, and improving the accuracy and readability of the text content, and finally presenting the processed content to a user according to an original sequence so that the user can read and analyze the text content further.

In the embodiment, the image recognition capability and the context understanding capability based on the multi-mode large language model are improved, the accuracy of the text content is improved, and the accuracy of document analysis is improved.

It should be noted that the foregoing examples are only for understanding the present application, and are not meant to limit the method of analyzing the document of the present application, and many simple changes based on the technical concept are all within the scope of the present application.

The embodiment of the application also provides a document analysis device, please refer to fig. 5, which includes:

The detection module 10 is used for acquiring a document to be analyzed, and carrying out layout detection on the document to be analyzed to obtain an element block;

a determining module 20, configured to determine a content extraction rule corresponding to the element block according to the layout element of the element block;

An extraction module 30, configured to process the element block according to a content extraction rule, so as to obtain preliminary content of the element block;

the correction module 40 corrects the preliminary content using the large language model to obtain the target document.

The document analysis device provided by the embodiment of the application can solve the technical problem of inaccurate document analysis by adopting the document analysis method in the embodiment. Compared with the prior art, the beneficial effects of the document analysis device provided by the application are the same as those of the document analysis method provided by the embodiment, and other technical features in the document analysis device are the same as those disclosed by the method of the embodiment, so that the description is omitted here.

The embodiment of the application provides a document analysis device, which comprises at least one processor and a memory in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the document analysis method in the first embodiment.

Referring now to FIG. 6, a schematic diagram of a document parsing apparatus suitable for use in implementing embodiments of the present application is shown. The document parsing apparatus in the embodiment of the present application may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal DIGITAL ASSISTANT: personal digital assistants), PADs (Portable Application Description: tablet computers), PMPs (Portable MEDIA PLAYER: portable multimedia players), vehicle-mounted terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The document parsing apparatus shown in fig. 6 is only an example, and should not be construed as limiting the function and scope of use of the embodiment of the present application.

As shown in fig. 6, the document parsing apparatus may include a processing device 1001 (e.g., a central processing unit, a graphic processor, etc.), which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1003 into a random access Memory (RAM: random Access Memory) 1004. In the RAM1004, various programs and data required for the operation of the document parsing apparatus are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, a system including an input device 1007 such as a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope, etc., an output device 1008 including a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, etc., a storage device 1003 including a magnetic tape, a hard disk, etc., and a communication device 1009 may be connected to the I/O interface 1006. The communication means 1009 may allow the document parsing apparatus to communicate with other apparatuses wirelessly or by wire to exchange data. While a document parsing apparatus having various systems is shown in the figures, it should be understood that not all of the illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.

The document analysis equipment provided by the embodiment of the application can solve the technical problem of inaccurate document analysis by adopting the document analysis method in the embodiment. Compared with the prior art, the beneficial effects of the document analysis device provided by the application are the same as those of the document analysis method provided by the embodiment, and other technical features of the document analysis device are the same as those disclosed by the method of the previous embodiment, and are not repeated here.

It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

An embodiment of the present application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon for performing the document parsing method in the above-described embodiment.

The computer readable storage medium provided by the embodiments of the present application may be, for example, a usb disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, the computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (Radio Frequency) and the like, or any suitable combination of the foregoing.

The above-mentioned computer-readable storage medium may be contained in the document parsing apparatus or may exist alone without being assembled into the document parsing apparatus.

The computer readable storage medium is loaded with one or more programs, and when the one or more programs are executed by the document parsing device, the document parsing device is enabled to acquire a document to be parsed, perform layout detection on the document to be parsed to obtain element blocks, determine content extraction rules corresponding to the element blocks according to layout elements of the element blocks, process the element blocks through the content extraction rules to obtain preliminary contents of the element blocks, and correct the preliminary contents by using a large language model to obtain a target document.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The readable storage medium provided by the embodiment of the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (namely computer programs) for executing the document analysis method, so that the technical problem of inaccurate document analysis can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the document parsing method provided by the above embodiment, and are not described herein.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program realizes the steps of the document analysis method when being executed by a processor.

The computer program product provided by the embodiment of the application can solve the technical problem of inaccurate document analysis. Compared with the prior art, the beneficial effects of the computer program product provided by the application are the same as those of the document parsing method provided by the above embodiment, and are not described herein.

The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A document parsing method, the document parsing method comprising:

2. The document parsing method according to claim 1, further comprising, before the step of correcting each of the preliminary contents using a large language model:

3. The document parsing method according to claim 2, wherein the step of performing segment correction on the preliminary contents of the element block using a segment correction model includes:

4. The document parsing method of claim 3, wherein the segmentation correction model includes a bi-directional encoder representation BERT model from a converter, and the step of determining whether the adjacent sentence pair belongs to the same paragraph using the segmentation correction model includes:

5. The document parsing method according to claim 1, wherein the step of performing layout detection on the document to be parsed to obtain the element block includes:

converting the document to be analyzed into a picture to be analyzed;

6. The document parsing method according to claim 1, wherein the step of processing the element block by the content extraction rule to obtain preliminary content of the element block includes:

determining preliminary content and style characteristics of the element block by utilizing an optical character recognition technology under the condition that the layout element is a title;

Before the step of obtaining the target document, the method further comprises:

And adjusting the document display style of the element block according to the title level of the element block.

7. The document parsing method according to claim 1, wherein the preliminary content includes picture content and text content, and the step of processing the element block by the content extraction rule to obtain the preliminary content of the element block includes:

8. A document parsing apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the document parsing method according to any one of claims 1 to 7.

9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the document parsing method according to any one of claims 1 to 7.

10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the steps of the document parsing method according to any one of claims 1 to 7.