CN120012759B

Movatterモバイル変換

Info

Publication number: CN120012759B
Application number: CN202510486491.3A
Authority: CN
Inventors: 赵瑞; 聂志锋; 沈泮; 王祎童
Original assignee: Beijing Big Data Center
Current assignee: Beijing Big Data Center
Priority date: 2025-04-18
Filing date: 2025-04-18
Publication date: 2025-08-05
Anticipated expiration: 2045-04-18
Also published as: CN120012759A

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a global context analysis method and a global context analysis system for complex documents, wherein the global context analysis method comprises the steps of generating a catalog for a target document; establishing an association relation between pictures and text contents, establishing a basic knowledge graph based on titles of the catalogues, carrying out semantic analysis on paragraphs corresponding to the titles step by step, adding entities and relations for the basic knowledge graph to obtain a global knowledge graph, mapping the entities and the relations in the global knowledge graph into low-dimensional vectors by using a knowledge graph embedding method, carrying out global relation extraction on the embedded global knowledge graph by using a graph attention network model, constructing a global context chain based on the global relation, and updating the relation of the global knowledge graph based on the global context chain. The invention not only reduces the burden of the user in the aspect of document management, but also improves the utilization efficiency and the value of the document information.

Description

Global context analysis method and system for complex document

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a global context analysis method and system for a complex document.

Background

With the rapid development of information technology, electronic documents have become an important tool for people to acquire information and communicate ideas. However, how to efficiently extract, organize and utilize information in massive document data is a problem to be solved. Traditional document processing methods are often limited to simple text extraction and keyword retrieval, and cannot deeply mine the inherent semantic structures and knowledge associations of documents, thereby limiting the effective utilization of information.

The knowledge graph is constructed for the document as a leading edge document analysis strategy, so that the knowledge graph can assist in generating refined abstract content and can realize an efficient content retrieval function. It is worth noting that the process of extracting entities and relationships from documents to construct knowledge maps often faces the challenges of complex and confusable relationships of entities.

Disclosure of Invention

The invention provides a global context analysis method and a global context analysis system for complex documents, which aim at solving the technical problems.

In a first aspect, the present invention provides a global context analysis method for a complex document, comprising:

Generating a catalog for a target document by performing layout analysis on the target document;

identifying the text information of the picture in the target document, and establishing the association relation between the picture and the text content by carrying out semantic analysis on the text information and the adjacent text content;

Constructing a basic knowledge graph based on the titles of the catalogues, gradually carrying out semantic analysis on paragraphs corresponding to the titles, and adding entities and relations for the basic knowledge graph to obtain a global knowledge graph;

Mapping entities and relations in the global knowledge graph into low-dimensional vectors by using a knowledge graph embedding method, and extracting global relations of the embedded global knowledge graph by using a graph attention network model;

and constructing a global context chain based on the global relationship, and updating the relationship of the global knowledge graph based on the global context chain.

In an alternative embodiment, generating a catalog for a target document by layout analysis of the target document includes:

Analyzing layout characteristics of the target document by using a layout analysis model, wherein the layout characteristics comprise paragraph distribution, title format parameters, title content and paragraph format parameters;

constructing a title hierarchy based on the layout characteristics by using a rule algorithm;

constructing a corresponding relation between a title and a paragraph based on the title hierarchy, and generating a catalog based on the corresponding relation;

And monitoring the updated content of the target document, and synchronously updating the catalog based on the updated content.

In an optional implementation manner, identifying text information of a picture in a target document, and establishing an association relationship between the picture and text content by performing semantic analysis on the text information and adjacent text content, wherein the method comprises the following steps:

Extracting text information from the picture by utilizing an optical character recognition technology;

extracting adjacent text paragraphs of the picture according to the position of the picture in the target document;

Screening out target paragraphs matched with the text information from adjacent text paragraphs by using a keyword matching technology, and setting associated labels for the pictures and the target paragraphs;

carrying out semantic analysis on the text information and the target paragraph, and screening one or more sentences matched with the semantics of the text information from the target paragraph;

adding an independent mark for the sentence matched with the text information and the semantics, wherein the independent mark is used for indicating that the sentence matched with the text information and the semantics is an integral content body which is inseparable;

when the document hierarchy is generated, a specific hierarchical position relative to the associated text is determined for the picture, so that the image-text content is displayed together as a hierarchical unit.

In an optional implementation manner, a basic knowledge graph is constructed based on the titles of the catalogs, semantic analysis is gradually performed on paragraphs corresponding to the titles, entities and relations are added to the basic knowledge graph, and a global knowledge graph is obtained, wherein the method comprises the following steps:

Extracting entities and entity relations from titles of the catalogs, and constructing a basic knowledge graph based on the extracted entities and entity relations;

Acquiring a lowest-level title in a catalog, and acquiring a local knowledge graph corresponding to the lowest-level title, wherein the local knowledge graph comprises an entity corresponding to the lowest-level title;

acquiring a target paragraph corresponding to the lowest-level title based on a catalog, extracting entities and relations from the target paragraph, and updating the local knowledge graph based on the extracted entities and relations;

And traversing all the lowest-level titles to obtain a global knowledge graph.

In an alternative embodiment, mapping the entities and the relations in the global knowledge graph into low-dimensional vectors by using a knowledge graph embedding method, and extracting the global relations of the embedded global knowledge graph by using a graph attention network model, including:

Mapping entities and relations in the global knowledge graph into low-dimensional vectors by adopting a translation model, and setting a loss function of the translation model as negative log likelihood loss;

Taking the low-dimensional vector of the entity and the relationship as input of a graph attention network model, wherein the graph attention network model extracts the relationship between the entities in the global knowledge graph by calculating attention coefficients between the entities and aggregating information of neighbor nodes;

And integrating the relationship between the entities in the global knowledge graph and the association relationship between the entities in different local knowledge graphs into a global relationship.

In an alternative embodiment, the method for training the graph annotation force network model includes:

Extracting sentences and paragraphs with similar or related semantics from the document to form a positive sample pair, and selecting sentences or paragraphs with irrelevant semantics from the document to form a negative sample pair;

using a pre-trained hierarchical attention network model to encode sentences or paragraphs in positive sample pairs and negative sample pairs to generate high-dimension semantic vectors;

adopting a contrast loss function to enable the model to pull the distance of the positive sample pair and pull the distance of the negative sample pair in semantic space;

and training the graph attention network model by using the encoded positive sample pair and the encoded negative sample pair.

In an alternative embodiment, constructing a global context chain based on the global relationship, and updating the relationship of the global knowledge graph based on the global context chain includes:

Constructing a global context chain based on the global relation by using a round-robin algorithm;

and matching the entity and the relation related to the global context chain with the global knowledge graph so as to complement the missing entity relation for the global knowledge graph based on a matching result.

In a second aspect, the present invention provides a global context analysis system for complex documents, comprising:

The catalog generation module is used for generating a catalog for the target document by carrying out layout analysis on the target document;

The picture association module is used for identifying the text information of the picture in the target document and establishing association relation between the picture and the text content by carrying out semantic analysis on the text information and the adjacent text content;

the map construction module is used for constructing a basic knowledge map based on the titles of the catalogues, carrying out semantic analysis on paragraphs corresponding to the titles step by step, and adding entities and relations for the basic knowledge map to obtain a global knowledge map;

the global analysis module is used for mapping entities and relations in the global knowledge graph into low-dimensional vectors by using a knowledge graph embedding method, and extracting global relations of the embedded global knowledge graph by using a graph attention network model;

And the context enhancement module is used for constructing a global context chain based on the global relationship and updating the relationship of the global knowledge graph based on the global context chain.

The method and the system for analyzing the global context of the complex document have the beneficial effects that the consistency of the knowledge graph and the global context of the document is remarkably improved through integrating the technical means of automatic catalog generation, picture and text association, knowledge graph construction and relation extraction, global context chain and relation updating and the like. This not only reduces the burden on the user in terms of document management, but also improves the utilization efficiency and value of the document information.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.

FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The global context analysis method of the complex document provided by the embodiment of the invention is executed by the computer equipment, and correspondingly, the global context analysis system of the complex document runs in the computer equipment.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention. Wherein the FIG. 1 execution body may be a global context analysis system for complex documents. The order of the steps in the flow chart may be changed and some may be omitted according to different needs.

As shown in fig. 1, the method includes:

s1, generating a catalog for a target document by carrying out layout analysis on the target document.

An exhaustive layout analysis step is performed on the target document. This process involves identifying the layout structure of the document, such as a title, paragraph, header footer, etc., to accurately segment the document content. Then, according to the identification result, a catalog containing each level of titles and corresponding paragraph indexes is automatically generated for the target document. The directory is not only convenient for a user to quickly browse the document structure, but also provides a convenient index framework for subsequent information retrieval and processing.

S2, identifying the text information of the picture in the target document, and establishing the association relation between the picture and the text content by carrying out semantic analysis on the text information and the adjacent text content.

Text information in a picture embedded in a target document is recognized using advanced OCR (optical character recognition) techniques. On the basis, the text content near or relevant to the picture is combined for deep semantic analysis. Through strategies such as comparison and association, an association relation between the picture and surrounding text content is established, which is helpful for better understanding the meaning and effect of the picture in the document.

S3, constructing a basic knowledge graph based on the titles of the catalogues, gradually carrying out semantic analysis on paragraphs corresponding to the titles, and adding entities and relations to the basic knowledge graph to obtain a global knowledge graph.

According to the generated catalogue, firstly, constructing an initial basic knowledge graph frame by taking each level of titles as nodes. Then, according to the index sequence of the paragraphs in the catalogue, semantic analysis is carried out one by one. The process comprises entity identification, relation extraction and the like, and aims to convert key information in paragraphs into entities and relations in the atlas, so that basic knowledge atlas is gradually enriched and perfected, and finally, a global knowledge atlas is formed. The global knowledge graph not only covers the core content of the document, but also reveals the inherent links between the information.

S4, mapping the entities and the relations in the global knowledge graph into low-dimensional vectors by using a knowledge graph embedding method, and extracting the global relations of the embedded global knowledge graph by using a graph attention network model.

In order to more effectively utilize information in the global knowledge-graph, knowledge-graph embedding techniques are employed to map entities and relationships into a low-dimensional vector space. This step reduces the complexity of data processing while preserving key structural and semantic information in the atlas. Subsequently, global relationship extraction is performed on the embedded global knowledge graph using a graph attention network model (Graph Attention Network, GAN). The GAN model can focus on the importance of different nodes and relationships in the atlas, thereby capturing the association patterns more accurately in the global scope.

S5, constructing a global context chain based on the global relationship, and updating the relationship of the global knowledge graph based on the global context chain.

Based on the extracted global relationships, a global context chain is constructed. This chain not only reflects the flow path of the information in the document, but also reveals a deep link between the information. On the basis, the relations in the global knowledge graph are re-examined and updated, each relation in the graph is ensured to accord with the logic of the global context chain, and the accuracy and the practicability of the knowledge graph are further improved. Through the series of steps, a global knowledge graph which comprehensively and accurately reflects the content of the target document is finally obtained.

In one embodiment of the invention, based on step S1, a possible embodiment thereof will be given below as a non-limiting illustration.

S101, layout characteristics are analyzed, namely firstly, a trained LayoutLMv layout analysis model is utilized to conduct deep analysis on the target document, and the purpose of extracting the layout characteristics is achieved. These features not only cover intuitive paragraph distributions, but also include finer title format parameters (e.g., font size, bolding, centering, etc.), title content (i.e., specific text information), and paragraph format parameters (e.g., paragraph indentation, line spacing, etc.). This process is the basis for constructing all subsequent steps, ensuring that the subsequent processing can accurately reflect the original structure and content of the document.

S102, constructing a title hierarchy, namely after acquiring detailed layout characteristics, constructing the title hierarchy based on the characteristics by using a well-designed rule algorithm. The hierarchy is intended to clearly reflect the hierarchical relationships between levels of titles in a document, such as main titles, sub-titles, section titles, etc. This step is critical to understanding the structure, content hierarchy, and logical relationship of the document.

The following is a simplified example of a rule algorithm, which is intended to illustrate the basic principles and steps of this process:

1. Input data preparation

Layout feature data including paragraph distribution, title format parameters (e.g., font size, bold, color, center alignment, etc.), title content, paragraph format parameters, etc.

Document content-complete document text, including all header and body content.

2. Title identification

Title identification based on format parameters first, all possible titles are identified based on preset format parameter thresholds (e.g., font size threshold, bold flag, etc.).

Content analysis assists in identifying titles that are not apparent in the format features, which may be aided by content analysis (e.g., keyword matching, sentence structure analysis).

3. Hierarchical relationship determination

Title level judgment, namely judging the level of each title according to the format parameters (such as font size decreasing rule, thickening degree and the like) and the content importance (such as the keyword level contained in the title).

Hierarchical relationship construction by logical analysis, determining hierarchical relationships between various levels of titles, such as main title and subtitle, subtitle and section title, etc.

4. Error handling and verification

Conflict detection, detecting if there is a hierarchy conflict (e.g., one title is identified as two titles at different hierarchies at the same time).

And correcting and adjusting, namely correcting the detected conflict through manual intervention or an automatic adjustment algorithm.

Consistency verification-ensuring that the title hierarchy of the entire document is logically consistent.

5. Output title hierarchy

Structured output, namely outputting the structured title hierarchy in a structured mode, such as a tree structure, a list structure and the like.

Assume a simple document whose title hierarchy is as follows:

main title-main title of document

Subheading 1 first subheading under main heading

Section header 1 first section header under subheader 1

Section header 2 second section header under subheader 1

Subheading 2 second subheading under main heading

Section header 3 first section header under subheader 2

Such a hierarchy of titles can be automatically identified and built according to the rules algorithm described above.

S103, establishing the corresponding relation between the title and the paragraph and generating a catalog, wherein the corresponding relation between the title and the paragraph is further established based on the constructed title hierarchy. Then, an exhaustive list is automatically generated based on the correspondences. The catalog not only lists all levels of titles, but also provides corresponding page numbers or paragraph positions, thereby greatly facilitating browsing and searching of documents.

The catalog generation method comprises the following steps:

After the title hierarchy is identified, the catalog is automatically generated using a catalog generation tool. The directory may include information such as names of various levels of titles, page numbers, or paragraph positions.

S104, a catalog synchronous updating mechanism, namely, a set of monitoring and synchronous updating mechanism is designed for ensuring the accuracy and timeliness of the catalog in consideration of the fact that target documents can be updated continuously along with the time. The mechanism is capable of monitoring updated content of a document in real time, including but not limited to newly added paragraphs, modified titles or paragraph content, and the like. Once an update is detected, the mechanism will automatically trigger a synchronized update process for the directory, ensuring that the directory remains consistent with the document content at all times.

In one embodiment of the present invention, based on step S2, a possible embodiment thereof will be given below as a non-limiting illustration.

S201, extracting text information from pictures by utilizing optical character recognition technology

Technical details firstly, the picture in the document is scanned pixel by a high-precision Optical Character Recognition (OCR) technology, and the embedded or attached text information in the picture is recognized and extracted. This process needs to ensure the accuracy and robustness of OCR technology to cope with picture words of different fonts, sizes and backgrounds.

S202, extracting adjacent text paragraphs of the picture according to the position of the picture in the target document

1. Document format identification

PDF analysis, namely if the document is in the PDF format, reading detailed information such as page content, fonts, sizes, positions and the like of the document by utilizing a PDF analysis library (such as PyMuPDF, PDFMiner and the like).

Word document parsing, for Word documents, python-docx and other libraries can be used to parse the locations and attributes of paragraphs, tables, pictures and other elements of the document.

2. Adjacent text paragraph extraction

And (3) carrying out proximity search on the upper side, the lower side or the left side and the right side of the picture by utilizing the layout information of the document after determining the position of the picture, and finding out text paragraphs adjacent to the picture.

S203, screening out target paragraphs matched with the text information from adjacent text paragraphs by using a keyword matching technology, and setting associated labels for the pictures and the target paragraphs

Keyword matching, namely matching the extracted picture text information with adjacent text paragraphs, and searching for paragraphs containing related or similar information. The matching algorithm can be performed based on strategies such as word frequency, TF-IDF weight, semantic similarity and the like.

And (3) once the matched target paragraph is found, setting a unique association label for the picture and the target paragraph, and identifying the association relation of the picture and the target paragraph in subsequent processing.

S204, carrying out semantic analysis on the text information and the target paragraph, and screening one or more sentences matched with the text information from the target paragraph

Semantic analysis, namely performing deep semantic analysis on the extracted text information and the target paragraph by utilizing a Natural Language Processing (NLP) technology, and identifying semantic links between the extracted text information and the target paragraph. This may involve advanced NLP tasks such as syntactic analysis, entity recognition, semantic role labeling, etc.

Sentence screening, namely screening one or more sentences which are matched with the picture text information semanteme most from the target paragraph based on the result of semantic analysis. These sentences should be able to accurately reflect the content of the association between the picture and the text.

S205, adding independent marks for sentences with the matched text information and the semanteme, wherein the independent marks are used for indicating that the sentences with the matched text information and the semanteme are integral content bodies which are not separable

Independent identification, namely adding unique identifiers or labels to the screened semantic matching sentences and text information, wherein the identifiers have uniqueness in the document and are used for identifying the identifiers as an integral content body.

And after the independent identification is added, the sentence matched with the text information and the semanteme can be further integrated into a comprehensive content unit, so that the subsequent processing and the display are convenient.

S206, determining the specific hierarchical position of the related text for the picture when generating the hierarchical structure of the file, so that the image-text content is displayed together as a hierarchical unit

And generating a hierarchical structure, namely determining a specific hierarchical position for the picture according to the logical relation between the picture and the associated text (including the semantically matched sentences) when constructing the hierarchical structure of the document.

And the picture and text integrated display ensures that the picture and the associated text thereof are displayed as a whole hierarchical unit in the final document so as to embody the tight association between the picture and the associated text. This may require adjusting the layout and format of the document to accommodate the requirements of the integration of graphics and text.

And carrying out structural processing on the associated picture text information and the corresponding text content in the document, so as to facilitate subsequent data management and information extraction. The specific operation comprises the steps of labeling the pictures, namely adding label information for each picture, wherein the label information comprises picture numbers, titles, associated text paragraphs and the like, so as to trace back. And synchronizing the associated paragraphs, namely synchronizing the text content recognized by OCR with the corresponding paragraphs in the document, so as to ensure that the picture information is integrated into a part of the document structure. And storing the structured image-text data in a hierarchical structure of the document by the system, and specifying the hierarchical position of the picture content in the document structure to ensure the integrity of the document structure.

In addition, by setting independent labels for the pictures and the associated text contents, the binding of the pictures and the text contents is realized, the pictures and the text contents are ensured not to be split in the subsequent analysis, and serious deviation of the subsequent semantic analysis is avoided.

In one embodiment of the present invention, based on step S3, a possible embodiment thereof will be given below as a non-limiting illustration.

S301, extracting entity and entity relation from the title of the catalogue, and constructing a basic knowledge graph based on the extracted entity and entity relation

Entity and relationship extraction, namely, firstly, performing detailed analysis on each level of titles in a catalog, and identifying key entities (such as a person name, a place name, an organization name and the like) and potential relationships (such as a top-bottom relationship, an inclusion relationship and the like) among the entities from a title text by utilizing Natural Language Processing (NLP) technology such as Named Entity Recognition (NER) and relationship extraction algorithm.

And constructing a basic knowledge graph, namely constructing a knowledge graph comprising the basic entity and the relationship by utilizing a graph construction technology, such as a Neo4j graph database, based on the extracted entity and the relationship. This map will serve as the basis for the subsequent local knowledge-graph construction and global knowledge-graph integration.

S302, acquiring the lowest-level title in the catalogue, and acquiring a local knowledge graph corresponding to the lowest-level title, wherein the local knowledge graph comprises an entity corresponding to the lowest-level title

Lowest level title identification-based on the hierarchical structure of the directory, the lowest level titles are identified, which generally correspond to the most specific, detailed content in the document.

And generating a local knowledge graph, namely generating a local knowledge graph containing the entity corresponding to each lowest-level title according to the position of each lowest-level title in the basic knowledge graph and related entities. This local map will focus on showing entities and relationships that are closely related to the lowest level title content.

S303, obtaining a target paragraph corresponding to the lowest-level title based on the catalogue, extracting entities and relations from the target paragraph, and updating the local knowledge graph based on the extracted entities and relations

And obtaining the target paragraphs, namely quickly positioning the target paragraphs corresponding to each lowest-level title by using the catalogue.

Paragraph content analysis, namely carrying out detailed text analysis on a target paragraph, and extracting entities and relations in the paragraph by utilizing NLP technology.

And updating the local knowledge graph, namely comparing and integrating the extracted entity and relationship with the existing local knowledge graph, updating information in the graph, and ensuring the accuracy and the integrity of the graph.

S304, traversing all the lowest level titles to obtain a global knowledge graph

And (3) traversing, namely traversing all the lowest-level titles in sequence according to the sequence of the catalogue, executing the steps in S302 and S303 on each title, and updating the local knowledge graph.

And integrating the global knowledge graph, namely gradually integrating the entities and the relations in each local knowledge graph into the global knowledge graph in the traversal process. Through the map merging technology, the entity and the relationship in the global map are ensured to keep consistent, and a complete and coherent knowledge system is formed.

And finally, verifying and optimizing the global knowledge graph, checking whether the entity and the relation in the graph are correct or not, and ensuring the quality and the usability of the graph.

Through the steps, the basic knowledge graph is constructed by extracting the entities and the relations from the catalogue of the document, the local knowledge graph is generated and updated based on the lowest-level title, and the global knowledge graph is finally integrated. The process not only improves the understanding and analysis efficiency of the document content, but also provides powerful support for subsequent knowledge mining and application.

In one embodiment of the present invention, based on step S4, a possible embodiment thereof will be given below as a non-limiting illustration.

S401, mapping entities and relations in a global knowledge graph into low-dimensional vectors by adopting a translation model, and setting a loss function of the translation model as negative log likelihood loss.

Step description first, entities and relationships in the global knowledge-graph are embedded into the low-dimensional vector space using a translation model (e.g., transE, transR, etc.). These low-dimensional vectors can capture semantic information between entities and relationships and facilitate subsequent graph-annotation-aware network model processing.

Loss function setting to optimize parameters of the translation model, a loss function is set as a negative log likelihood loss (Negative Log Likelihood Loss) that aims to minimize the difference between the score of the positive sample and the score of the negative sample, thereby ensuring accuracy and robustness of the embedded vector.

S402, taking low-dimensional vectors of entities and relations as input of a graph attention network model, wherein the graph attention network model extracts relations among the entities in a global knowledge graph through calculating attention coefficients among the entities and aggregating information of neighbor nodes;

Wherein the graph attention network model comprises:

and the input layer takes the initial embedded vector of the entity as the input of the GAT model. These vectors will be characteristic representations of the nodes.

Attention mechanism layer-build GAT layer, where each node aggregates information of its neighbor nodes through an attention mechanism. The attention coefficients are determined by the similarity or correlation between nodes and may be calculated using dot products, bilinear transforms, or neural networks.

Multi-head attention to enhance the stability and expressive power of the model, a multi-head attention mechanism may be introduced. This means that each GAT layer will contain multiple independent attention heads, each of which will independently calculate the attention coefficients and aggregate neighbor information. Finally, the outputs of these heads will be stitched or averaged to form the final node representation.

Multiple GAT layers to capture global context information, multiple GAT layers may be stacked. Each layer updates the representation of the node based on the output of the previous layer. With the increase of the layer number, the node can aggregate the information from the more distant neighbors, thereby realizing the propagation and integration of semantic information of cross-entity and cross-relation.

The training method of the graph attention network model comprises the following steps:

1. Data preparation phase

Document preprocessing, namely firstly, preprocessing an input document, including removing stop words, punctuation marks, extracting stems or restoring shapes, and the like, so as to reduce noise and standardize text data.

And constructing positive sample pairs, namely extracting sentences and paragraphs with similar or related semantics from the preprocessed document by utilizing a semantic analysis or syntax analysis technology, and forming the sentences and paragraphs into the positive sample pairs. Sentences or paragraphs in positive sample pairs should have high similarity or relevance in terms of content, topic or emotion, etc.

Negative example pair construction similarly, sentences or paragraphs that are semantically uncorrelated are selected randomly from the document or based on some policy (e.g., random selection, topic uncorrelation, etc.), making up a negative example pair. Sentences or paragraphs in the negative sample pair should have significant differences in content.

2. Feature extraction stage

The pre-trained hierarchical attention network model application encodes sentences or paragraphs in the positive and negative sample pairs using the hierarchical attention network model that has been pre-trained. The hierarchical attention network model is capable of capturing attention information inside and between sentences, thereby generating a high-dimensional, semantic information rich vector representation.

Vector representation optimization-during encoding, further optimization of the generated vector representation may be required, such as reducing redundant information by dimension reduction techniques (e.g., PCA, t-SNE, etc.), or ensuring comparability between vectors by normalization processing.

3. Model training stage

Contrast loss function design using NT-Xent contrast loss function, which aims at measuring the distance of positive and negative pairs of samples in semantic space. For positive pairs of samples, the loss function should encourage the models to zoom in their distance, while for negative pairs of samples, the loss function should encourage the models to zoom out their distance.

And model parameter optimization, namely carrying out iterative updating on parameters of the graph annotation force network model by utilizing optimization algorithms such as gradient descent and the like and combining a contrast loss function. In each iteration, the loss value of the model under the current parameter needs to be calculated, and the parameter is updated according to the gradient of the loss value.

Early stopping and verification, namely setting a verification set to monitor the performance of the model in the training process. When performance on the validation set is no longer improved, an early-stop strategy is employed to prevent overfitting. Meanwhile, the verification set can be used for adjusting the super parameters (such as learning rate, batch size and the like) of the model.

4. Model evaluation and optimization

Performance assessment after training is completed, the test set is used to assess the performance of the model. The evaluation metrics may include accuracy, recall, F1 score, etc., which can reflect the ability of the model to handle semantic similarity and correlation tasks.

And (3) model optimization, namely further optimizing the model according to the evaluation result. This may include adjusting the model structure, increasing the attentive mechanisms, introducing external knowledge, etc.

S403, performing similarity calculation on the entities in the different local knowledge maps by using the graph attention network model, and constructing association relations of the entities in the different local knowledge maps based on calculation results;

and after obtaining the embedded vector of the global knowledge graph, performing similarity calculation on the entities in different local knowledge graphs by using the trained graph attention network model. The similarity and the relevance between the entities can be evaluated by calculating the cosine similarity or Euclidean distance and other indexes between the entities.

And constructing association relations of the entities in different local knowledge maps based on the similarity calculation result. These associations may be represented as edges or links between entities for subsequent global relationship integration.

S404, integrating the relationship between the entities in the global knowledge graph and the association relationship between the entities in different local knowledge graphs into a global relationship.

And integrating the relationship between the entities in the global knowledge graph and the association relationship between the entities in different local knowledge graphs to form a more complete and comprehensive global knowledge graph. The global knowledge graph comprises a direct relationship in the global knowledge graph and an indirect relationship obtained by similarity calculation in the local knowledge graph.

The integration method can adopt a graph merging algorithm or a graph merging algorithm and the like to merge and integrate the relations of different sources. Meanwhile, different weights are set for the direct relation and the indirect relation to distinguish, so that the accuracy and the reliability of the integrated global knowledge graph are ensured.

In one embodiment of the present invention, based on step S5, a possible embodiment thereof will be given below as a non-limiting illustration.

S501, constructing a global context chain based on the global relation by using a round-robin algorithm.

After the integrated global knowledge graph is obtained, a graph round Algorithm (GRAPH WHEEL Algorithm) is further utilized to construct a global context chain. The graph round algorithm is an algorithm based on graph structure, which builds a series of interrelated context chains with logical order by analyzing nodes (entities) and edges (relationships) in the graph. These context chains are able to capture complex relationships between entities and reveal their dynamic changes in the global context.

The specific implementation is as follows:

and selecting and sorting the nodes, namely firstly, selecting a group of core nodes from the global knowledge graph as a starting point. The core nodes can be key entities in the global knowledge graph or representative entities obtained by similarity calculation in the local knowledge graph. The nodes are then ordered according to the relationships and weights between the entities to form a logically coherent chain of contexts.

Relational links-based on node ordering, the nodes are connected by relational links in the global relationship. The relationship links can be direct relationships in the global knowledge graph or indirect relationships obtained by similarity calculation in the local knowledge graph. Through relational links, a complete contextual chain is constructed that can reflect complex relationships and dynamic changes between entities.

Context chain optimization in order to improve accuracy and reliability of the context chain, the constructed context chain is optimized. This includes removing redundant nodes and relationships, adjusting the order of nodes, merging similar contextual chains, and so forth. Through optimization, a more concise and clear global context chain is obtained.

S502, matching the entity and relation related to the global context chain with the global knowledge graph to complement the missing entity relation for the global knowledge graph based on a matching result.

And selecting a proper matching algorithm to match the global context chain with the global knowledge graph. These algorithms may be similarity-based matching algorithms, rule-based matching algorithms, or machine-learning-based matching algorithms, etc. By selecting a proper matching algorithm, the accuracy and efficiency of matching can be improved.

And in the matching process, comparing the entities and the relations in the global context chain with the entities and the relations in the global knowledge graph one by one. And skipping the matched entities and relations. For the unmatched entities and relations, the reasons of the unmatched entities and relations need to be analyzed, and the unmatched entities and relations serve as complement objects, such as entity relations existing in a global context chain, and the corresponding entity relations are complemented in a global knowledge graph if the entity relations do not exist in the global knowledge graph.

And completing the missing relation, namely completing the missing entity relation for the global knowledge graph according to the matching result after the matching is completed. This includes adding new entity relationships, updating existing entity relationships, and so forth. The structure and the content of the global knowledge graph can be further perfected by complementing the missing relation, and the effect and the value of the global knowledge graph in practical application are improved.

And finally, verifying and optimizing the completed global knowledge graph. This includes checking the accuracy and rationality of the complement relationships, adjusting the relationship weights and importance, etc. Through verification and optimization, the accuracy and reliability of the global knowledge graph can be ensured, and powerful support is provided for subsequent application.

Through the steps, global context chains are constructed by using a graph round algorithm, and entities and relations in the chains are matched with a global knowledge graph so as to complement the missing entity relations, thereby realizing the relation establishment of the cross-paragraph content. The method is helpful for further perfecting the structure and the content of the global knowledge graph and improving the effect and the value of the global knowledge graph in practical application.

After global context tracking and semantic enhancement is completed, the system enters the content induction and information extraction phase. The step aims at extracting and summarizing multi-level key information from the document and providing a concise and accurate content summary for a user.

The specific implementation process is as follows:

Hierarchical structure identification using previously established document hierarchies and knowledge maps, the system identifies different hierarchical content in the document, such as questions, requirements, goals, tasks, solutions, etc.

And positioning the key content, namely precisely positioning the key content of each hierarchy through semantic representation and entity relation. In the knowledge graph, nodes represent entities (such as problems and demands), and edges represent relationships (such as 'results', 'satisfies', 'realization'), so that semantic positions of contents of all levels are defined.

The abstract model application adopts a pre-trained abstract generation model (such as a text abstract model based on a transducer) to generate a concise abstract aiming at the content of each level. And (3) context information fusion, namely combining global context information when generating the abstract, and ensuring that the abstract content is consistent with the whole semantics of the document. The model can consider the context and associated content in generating the summary, avoiding information sheeting or distortion.

Hierarchical content organization, namely organizing the extracted content according to a hierarchical relationship to form a structure from a high layer to a low layer. For example, from the problem to the requirement, the task is developed step by step, and a clear content architecture is constructed. And (3) structured data storage, namely storing the generalized content in a structured form, so that the subsequent retrieval and analysis are facilitated. The data formats such as JSON, XML and the like can be adopted to reflect the hierarchy and association relation of the content.

Logic chain construction, namely constructing logic chains among different levels of contents by a system based on knowledge graph and global context tracking. For example, a problem leads to which demands, which in turn correspond to which tasks and solutions. And showing the association relationship, namely highlighting the association between different levels of content when the content is presented, and helping a user understand the logical relationship of the content. The interrelationship of the levels of content is shown visually, such as in a flowchart or tree diagram.

And redundancy detection, namely identifying redundant information in the content through semantic similarity calculation, and avoiding repeated description. For semantically similar or duplicate content, the system performs a merging process. And integrating similar or related contents to form a more comprehensive and unified expression, and ensuring the integrity and consistency of the information.

Visual presentation, namely intuitively displaying the induced content in a chart, a tree structure and the like, and reflecting the hierarchy and association of the content. A user can quickly browse and locate content of interest through the interactive interface.

And the language fluency is that when the abstract and the summary content are generated, the natural language generation technology is utilized to ensure that the output words are expressed clearly and fluently, and the language fluency is easy to understand.

User customization, namely allowing a user to set parameters such as the length, the detail degree and the like of the abstract and generating content meeting the requirements of the user. The user may select a hierarchy or topic of interest to obtain a personalized content summary.

And (3) evaluating the quality of the generated abstract and summary contents, detecting the integrity, accuracy and language quality of the information by the system, and ensuring the high quality of the output contents.

The embodiment provides a global context analysis method of a complex document, which comprises the following steps:

1. generating a document catalog

And analyzing the layout characteristics of the document by using the layout analysis model, and constructing a title hierarchy by using a rule algorithm to generate the catalogue. The layout analysis model learns and classifies the characteristics of paragraph distribution, title format parameters and the like of the document based on a machine learning algorithm.

The flow is as follows:

Let the layout feature vector of the document be x= (x₁,x₂,⋯,x_n), where x_i represents the ith layout feature, such as paragraph spacing, title word size, etc.

And classifying the feature vectors by using a layout analysis model (such as a support vector machine) to obtain a preliminary classification result of the title. The decision function of the support vector machine is: Where α_i is the Lagrangian multiplier, y_i is the sample label, K (x_i, x) is the kernel function (e.g., radial basis function K (x_i,x)=exp(−γ∥x_i -x/2)), and b is the bias term.

From the classification results, a title hierarchy is constructed using a rule algorithm (e.g., title-based hierarchical relationship rules). Let the level of the title T_i be l_i, determined by the following rule:

and constructing the corresponding relation between the title and the paragraph based on the title hierarchy, and generating a catalog.

2. Establishing association relation between pictures and text contents

And extracting text information in the picture by utilizing an Optical Character Recognition (OCR) technology, and establishing association between the picture and adjacent text content through keyword matching and semantic analysis.

Text information W_image is extracted from the picture by OCR technology.

And extracting adjacent text paragraphs P_adjacent according to the position of the picture in the document.

And calculating the matching degree of the text information and the adjacent text segments by using a keyword matching technology (such as a TF-IDF algorithm). Let t be a keyword, the word frequency of t in W_image be TF_t,image, the inverse document frequency of t in the document be IDF_t, then the TF-IDF value of t in W_image is: The TF-IDF value TF-IDF_t,adjacent for t in P_adjacent is similarly obtainable. The degree of matching of W_image to P_adjacent S_match is defined as:

Semantic analysis (e.g., using cosine similarity of word vectors) is performed on the textual information and the target paragraph. Let the word vectors of W_image and P_adjacent be v_image and v_adjacent, respectively, then their semantic similarity S_semantic is:

And screening out the target paragraphs according to the matching degree and the similarity, and setting association labels for the pictures and the target paragraphs.

3. Construction of global knowledge graph

Extracting entities and relations from the catalog titles to construct a basic knowledge graph, gradually carrying out semantic analysis on paragraphs corresponding to the titles, and adding the entities and the relations to obtain a global knowledge graph.

Extracting the entity and entity relation from the title of the catalog, and setting the extracted entity set as E= { E₁,e₂,⋯,e_m }, and the relation set as R= { R₁,r₂,⋯,r_n }. The basic knowledge graph is represented as a triplet set G₀={(e_i,r_j,e_k)∣e_i,e_k E, rj E, R.

And acquiring the lowest-level title in the catalogue, and acquiring the local knowledge graph G_local corresponding to the lowest-level title.

And acquiring a target paragraph corresponding to the lowest-level title based on the catalog, extracting an entity and a relation from the target paragraph, and setting a newly extracted entity set as E 'and a relation set as R'. Updating local knowledge graph ：G_local=G_local∪{(e_i,r_j,e_k)∣e_i,e_k∈E∪E′,r_j∈R∪R′}

Traversing all the lowest level titles, and integrating all the local knowledge maps into a global knowledge map G_global.

4. Global relationship extraction

And mapping the entities and the relations in the global knowledge graph into low-dimensional vectors by using a knowledge graph embedding method, and extracting the global relations by using a graph attention network model.

The entities and relationships in the global knowledge-graph are mapped to low-dimensional vectors using a translation model (e.g., transE). Let the vector of entity e be denoted as e and the vector of relation r be denoted as r, the goal of TransE is to let e_i+r≈e_j for one triplet (e_i,r,e_j). The loss function is a negative log likelihood loss:

Where σ is a sigmoid function,,Is a negative set of samples.

The low-dimensional vectors of entities and relationships are used as inputs to the graph attention network model. Let the feature vector of node i be h_i, and the graph attention network calculates the attention coefficient α_ij of node i to node j as: where W is a learnable weight matrix, a is an attention vector, and N_i is a set of neighbor nodes for node i.

Aggregating the information of the neighbor nodes to obtain update characteristics h_i' of the node i:

And constructing association relations of the entities in different local knowledge maps according to the similarity, and integrating the relations between the entities in the global knowledge map and the association relations of the entities in different local knowledge maps into a global relation.

5. Training graph attention network model

The graph attention network model is trained by positive and negative sample pairs to be able to distinguish between related and unrelated content in semantic space.

Sentences and paragraphs with similar or related semantics are extracted from the document to form a positive sample pair (p₁⁺,p₂⁺), sentences or paragraphs with irrelevant semantics are selected from the document to form a negative sample pair (p₁⁻,p₂⁻).

The pre-trained hierarchical attention network model is used for encoding sentences or paragraphs in the positive sample pair and the negative sample pair, and a high-dimensional semantic vector v₁⁺,v₂⁺,v₁⁻,v₂⁻ is generated.

A contrast Loss function (e.g., triplet Loss) is used:

Where m is the margin parameter.

And training the graph attention network model by using the encoded positive sample pair and the encoded negative sample pair, and updating model parameters by minimizing a loss function.

6. Updating global knowledge-graph relationships

And constructing a global context chain based on the global relationship, and complementing the missing entity relationship by matching the global context chain with the global knowledge graph.

The global context chain C is built based on global relationships using a round-robin algorithm.

Matching the entity and relation related to the global context chain with the global knowledge graph, and setting the triplet set in the global knowledge graph as G_global and the triplet set in the global context chain as C, wherein the updated global knowledge graph G_updated is:

The figure round Algorithm (GRAPH WHEEL Algorithm) is an Algorithm for constructing a circular association path in a figure structure, and is characterized in that a multi-hop association relationship between entities is found through the figure round structure (a closed loop formed by a central node and a plurality of spoke nodes). The algorithm has important application value in the scenes of knowledge graph completion, relationship reasoning and the like. The knowledge graph is represented as a directed graph g= (V, E), where:

V= { V₁,v₂,...,v_n } is the set of entity nodes;

E= { (v_i,v_j,r_ij) } is an edge set, and r_ij represents a relationship type of entities v_i to v_j.

The figure wheel algorithm builds a wheel structure by:

selecting a core entity c as a wheel axle;

Screening spoke nodes, namely screening an entity s₁,s₂,...,s_k directly connected with c as a spoke;

the closed loop is formed by connecting spoke nodes through multi-hop paths to form a wheel-shaped circulating path taking c as a center.

The path weight calculation process comprises the steps of dynamically adjusting the weight of each path through relation confidence and path length:

Where m is the path length, αri is the confidence of the relationship ri, and βε (0, 1) is the path attenuation coefficient.

In some embodiments, the global context analysis system of the complex document may comprise a plurality of functional modules consisting of computer program segments. The computer program of the individual program segments in the global context analysis system of the complex document may be stored in a memory of a computer device and executed by at least one processor to perform (see fig. 1 for details) the functions of global context analysis of the complex document.

In this embodiment, the global context analysis system of the complex document may be divided into a plurality of functional modules according to the functions performed by the system, as shown in fig. 2. Functional modules of system 200 may include a catalog generation module 210, a picture association module 220, a map construction module 230, a global analysis module 240, and a context enhancement module 250. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.

Optionally, as an embodiment of the present invention, the catalog generating module includes:

The layout analysis unit is used for analyzing layout characteristics of the target document by using a layout analysis model, wherein the layout characteristics comprise paragraph distribution, title format parameters, title content and paragraph format parameters;

a system construction unit for constructing a title hierarchy based on the layout characteristics by using a rule algorithm;

The catalog generation unit is used for constructing the corresponding relation between the title and the paragraph based on the title hierarchy system and generating a catalog based on the corresponding relation;

and the catalog updating unit is used for monitoring the updating content of the target document and synchronously updating the catalog based on the updating content.

Optionally, as an embodiment of the present invention, the picture associating module includes:

a character extraction unit for extracting character information from the picture by using an optical character recognition technology;

The paragraph extraction unit is used for extracting adjacent text paragraphs of the picture according to the position of the picture in the target document;

The label association unit is used for screening out target paragraphs matched with the text information from adjacent text paragraphs by utilizing a keyword matching technology, and setting association labels for the pictures and the target paragraphs;

The semantic matching unit is used for carrying out semantic analysis on the text information and the target paragraph, and screening one or more sentences matched with the text information from the target paragraph;

The sentence marking unit is used for adding independent marks for sentences matched with the text information and the semantics, wherein the independent marks are used for indicating that the sentences matched with the text information and the semantics are integral content bodies which are not separable;

and the position determining unit is used for determining the specific hierarchical position of the related text for the picture when the document hierarchical structure is generated, so that the image-text content is displayed together as a hierarchical unit.

Optionally, as an embodiment of the present invention, constructing a basic knowledge graph based on the titles of the catalogs, and performing semantic analysis on paragraphs corresponding to the titles step by step, adding entities and relationships to the basic knowledge graph to obtain a global knowledge graph, where the method includes:

And traversing all the lowest-level titles to obtain a global knowledge graph.

Optionally, as an embodiment of the present invention, mapping the entities and the relationships in the global knowledge graph into low-dimensional vectors by using a knowledge graph embedding method, and performing global relationship extraction on the embedded global knowledge graph by using a graph attention network model, including:

Optionally, as an embodiment of the present invention, the training method of the graph annotation force network model includes:

Optionally, as an embodiment of the present invention, constructing a global context chain based on the global relationship, and updating the relationship of the global knowledge graph based on the global context chain includes:

Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims.

Claims

1. A method for global context analysis of a complex document, comprising:

constructing a global context chain based on the global relationship, and updating the relationship of the global knowledge graph based on the global context chain;

Mapping the entity and the relation in the global knowledge graph into a low-dimensional vector by using a knowledge graph embedding method, and extracting the global relation of the embedded global knowledge graph by using a graph attention network model, wherein the method comprises the following steps:

integrating the relationship between the entities in the global knowledge graph and the association relationship between the entities in different local knowledge graphs into a global relationship;

Constructing a global context chain based on the global relationship, and updating the relationship of the global knowledge graph based on the global context chain, wherein the method comprises the following steps:

2. The method of claim 1, wherein generating a catalog for a target document by layout analysis of the target document comprises:

3. The method of claim 1, wherein identifying text information of a picture in a target document and establishing an association of the picture with text content by performing semantic analysis on the text information and adjacent text content comprises:

4. The method of claim 1, wherein constructing a basic knowledge graph based on the titles of the catalogs, and performing semantic analysis on paragraphs corresponding to the titles step by step, adding entities and relationships to the basic knowledge graph, and obtaining a global knowledge graph, comprises:

And traversing all the lowest-level titles to obtain a global knowledge graph.

5. The method of claim 4, wherein the method of training the graph-annotation-force network model comprises:

6. A global context analysis system for a complex document, comprising:

The context enhancement module is used for constructing a global context chain based on the global relationship and updating the relationship of the global knowledge graph based on the global context chain;

7. The system of claim 6, wherein the catalog generation module comprises:

8. The system of claim 6, wherein the picture association module comprises: