Disclosure of Invention
In order to solve the problems, the application provides a RAG knowledge base construction method based on layout analysis and query generation, a computer-readable storage medium and electronic equipment, which can improve the retrieval enhancement generation effect of a system.
The application provides a RAG knowledge base construction method based on layout analysis and query generation, which comprises the steps of receiving a plurality of query documents, carrying out layout analysis on the query documents by using a layout analysis tool to obtain layout analysis results of the query documents, wherein the layout analysis results of the query documents comprise a plurality of plates and plate analysis results of the plurality of plates, the plate analysis results comprise text contents, position information and plate types, the layout analysis tool at least comprises three layout analysis models which are respectively trained by using labeling data of manuals, papers and legal documents, carrying out segmentation and merging of the text contents according to the layout analysis results of the query documents to obtain a plurality of text segments, generating titles for the query documents by using a large language model, respectively generating a preset number of queries for each text segment, generating a first vector based on the titles and the text segments, generating a second vector based on the preset number of queries of each text segment, respectively merging and storing the text segments with the first vector and the second vector, so as to construct the RAG knowledge base, and the RAG knowledge base is used for processing the user query operation by a retrieval enhancement generation system.
Therefore, text segments are obtained through effective layout analysis of the query document, and queries are generated for the text segments to expand semantics, so that a more comprehensive RAG knowledge base is constructed, and the retrieval enhancement generation effect of the system can be improved.
In one possible implementation, the method further comprises setting document attributes for each query document, wherein the document attributes comprise document authors, document creation dates and document modification dates of the query document, generating keywords for each text segment by using a machine learning model, merging and storing each text segment with a first vector and a second vector respectively to construct an RAG knowledge base, and merging and storing each text segment with the document attributes of the corresponding query document, the keywords of the text segment, the first vector and the second vector respectively to construct the RAG knowledge base according to storage units corresponding to a plurality of text segments of the query document.
In one possible implementation manner, the first vector is generated based on the title and the text segment, and comprises the steps of respectively inputting the title and the text segment into a text embedding model to generate title vectors and text segment vectors with preset dimensions, combining the title vectors and the text segment vectors to generate the first vector, generating preset quantity queries of the text segment into the second vector, respectively inputting preset quantity queries corresponding to the text segment into the text embedding model to generate query vectors with preset dimensions respectively corresponding to the text segment, and combining the query vectors with preset dimensions respectively corresponding to the preset quantity queries to generate the second vector.
In one possible implementation, the calculation formula of the first vector includes:
V1=weighttitie*Vtitle+(1-weighttitie)*Vchunk, wherein V1 is a first vector, Vtitle and Vchunk are respectively a title vector and a text segment vector, weiqhttitie is the weight of the title vector in the first vector, and 0 is less than or equal to weighttitie≤1,weighttitie and is preset.
In one possible implementation, the calculation formula of the second vector includes: V2 is a second vector, Vi (i is more than or equal to 1 and less than or equal to n, i is epsilon Z) is an ith query vector of the text segment, and n is a preset number.
In one possible implementation, the layout analysis of the query document is performed by using a layout analysis tool, including selecting a corresponding layout analysis model from the layout analysis tool according to a document type of the query document, and performing layout analysis of the query document by using the corresponding layout analysis model.
In one possible implementation, the text content segmentation and merging are performed according to the layout analysis result of the query document, and the method comprises the steps of determining a corresponding segmentation strategy according to the text content of each plate and the plate type, segmenting the text content according to the segmentation strategy, and merging the text contents which are adjacent in position and logically belong to the same content based on the position information of a plurality of plates.
In one possible implementation, the method further comprises preprocessing the query document before layout analysis is performed on the query document by using the layout analysis tool.
The second aspect of the application provides a retrieval enhancement generation processing method of a user query, which comprises the steps of receiving the user query, invoking a RAG knowledge base according to the user query to perform data retrieval to obtain a sorting result of a storage unit corresponding to a plurality of text segments, wherein the data retrieval comprises the steps of comparing a query vector corresponding to the user query with a first vector and a second vector of each text segment, reordering the sorting result by using a sorting model or a manually set sorting rule, and inputting the text segments in one or a plurality of storage units with the forefront sorting and the user query into a large language model so that the large language model outputs the query result of the user query under the guidance of a preset prompt word.
In a third aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect or any one of the possible implementations of the first aspect, or the second aspect.
In a fourth aspect, the application provides an electronic device comprising at least one memory for storing a program, at least one processor for executing the program stored in the memory, wherein the processor is adapted to perform the first aspect or any one of the possible implementations of the first aspect, or the method described in the second aspect, when the program stored in the memory is executed.
It will be appreciated that the advantages of the second to fourth aspects may be found in the relevant description of the first aspect and are not repeated here.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.
In describing embodiments of the present application, words such as "exemplary," "such as" or "for example" are used to mean serving as examples, illustrations or explanations. Any embodiment or design described herein as "exemplary," "such as" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary," "such as" or "for example," etc., is intended to present related concepts in a concrete fashion.
In the description of the embodiment of the present application, the term "and/or" is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B, and may indicate that a exists alone, B exists alone, and both a and B exist. In addition, unless otherwise indicated, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
In the construction of the RAG knowledge base, the conventional process is to parse a document, segment the parsed text by a fixed length or punctuation, and store the text segment and vector information in a vector database (such as FAISS, milvus) or a search engine (such as elastiscearch), where obtaining the text segment in this way may result in separation of semantically related text.
Furthermore, in conventional word and vector based retrieval, similarity is often used to express relevance, however, sometimes such similarity scores may not accurately reflect semantic relevance.
In view of this, the embodiment of the present application performs effective layout analysis on the query document by using the trained layout analysis model, obtains a plurality of text segments based on the layout analysis result, and generates a query for each text segment using a large language model to expand semantics. And generating vectors according to each text segment and the corresponding query thereof, thereby constructing a more comprehensive RAG knowledge base and improving the retrieval enhancement generation effect of the system.
Exemplary, fig. 1 shows a schematic diagram of a RAG knowledge base construction process based on layout analysis and query generation according to an embodiment of the present application. The RAG knowledge base construction process may be implemented by any device, apparatus, platform, server, cluster of devices, etc. having computing and processing capabilities.
As shown in fig. 1, the construction process of the RAG knowledge base is as follows:
First, a set of query documents that need to be processed is received. The query document may be in the format of PDF, PPT, picture, etc.
Secondly, performing text recognition on the query document, including performing layout analysis on the query document by using a trained layout analysis tool to obtain blocks such as text blocks, images, tables and the like in the document, and text content, position information and block types of the blocks. And then segmenting or merging the text contents in the plate according to the layout analysis result to obtain meaningful text segments, such as text segment 1, text segment 2, text segment 3 and the like.
Next, semantic expansion is performed using the large language model, including generating a title for the query document and generating a preset number of queries for each text segment. For each text segment, respectively combining the generated title with the text segment to create a first vector of the text segment; and combining the preset data queries of the text segment to generate a second vector of the text segment. In addition, other representation information, such as keywords of the text segment, is extended for each text segment.
And finally, merging and storing each text segment with the first vector, the second vector and other representation information respectively to construct an RAG knowledge base.
Thus, by performing efficient layout analysis on the query document, a plurality of text segments can be obtained based on the layout analysis result. And generating queries for each text segment by using the large language model, so that the semantic expression type of the text segment can be expanded, and semantic expression knowledge is enriched, so that more effective similarity calculation and semantic matching are realized. Therefore, the performance of the retrieval enhancement generation system is finally improved, and the system can more accurately understand and respond to the inquiry of the user.
Fig. 2 shows a flowchart of a method for constructing a RAG knowledge base based on layout analysis and query generation according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:
Step S201, a number of query documents are received.
In one embodiment, academic documents, research reports, legal documents, and the like are obtained as query documents by accessing a specialized database related to the target area. The target domain is the domain in which the user query processed by the RAG system is located.
These query documents contain electronic documents in various formats, such as editable formats PDF, word, excel, PPT, etc., or image formats JPEG, PNG, etc.
Illustratively, the query document is preprocessed after receiving the query document.
Preprocessing includes denoising, contrast adjustment, rotation correction, clipping, resizing, binarization, and the like. The accuracy of the subsequent processing steps is improved by preprocessing the acquired query document.
Step S202, carrying out layout analysis on the query document by using a layout analysis tool for each query document to obtain a layout analysis result of the query document. The layout analysis result of the query document comprises a plurality of slabs and the slab analysis result of the plurality of slabs, wherein the slab analysis result comprises text content, position information and slab category. The layout analysis tool at least comprises three layout analysis models which are obtained by training the marking data of handbooks, papers and legal documents.
In one embodiment, for each query document, the layout analysis tool uses the output pre-processed in step S201 as part of the input to further understand the structure and layout of the query document. This includes identifying and classifying different categories of slabs, such as text blocks, titles, tables, pictures, lists, etc., on a document page. The layout analysis tool also outputs the text content of each tile, as well as the location information.
For example, the layout analysis tool analyzes the data output after the document is queried into a plurality of pieces of data in the following format, and each piece of data corresponds to one plate on the document page and can be used as a plate analysis result of the plate. The plate analysis results can be exemplarily expressed as:
Wherein, 'x0', 'x1', 'top', 'bottom' represent position information of the tile in combination, 'text' represents text content of the tile, and 'layout_type' represents a category of the tile. 'page_number' indicates document page information.
'Timeout_type' is a data structure of an enumeration class, which may include the following enumeration values:
For example, "Table" indicates a Table category, and "Table description" indicates a Table title category, etc., and will not be described again.
In one implementation, the layout analysis of the query document is completed by using three layout analysis models obtained by training the labeling data of handbooks, papers and legal documents.
The layout analysis model is obtained by training by a training method provided in reference PaddleOCR. In the training process, the typesetting characteristics of three files of manual, paper and legal files are utilized for data marking, and marking data of each file is utilized for training to obtain three different layout analysis models. These layout analysis models are optimized for manual, paper, and legal document types of documents, respectively, and can identify and understand specific structures and terms of documents in the respective fields.
For example, the characteristics of the legal documents are specially trained to obtain optimized legal models, and the legal models can identify and extract key information in the legal text, such as legal terms, case references and the like.
And selecting a corresponding layout analysis model from the layout analysis tool according to the document type of the query document, and carrying out layout analysis on the query document by utilizing the corresponding layout analysis model.
For example, if the document type of the query document is legal, selecting legal models from the layout analysis tool to perform layout analysis on the query document, so as to obtain a layout analysis result of the query document.
Step S203, segmenting and merging text contents according to the layout analysis result of the query document to obtain a plurality of text segments.
In one implementation, for each tile, a corresponding segmentation strategy may be formulated for each tile according to the text content and tile category included in the layout analysis result. Segmentation of the text content of the tile is performed according to a segmentation strategy. The segmentation strategy includes that for the tile of the "Title" class it may be taken as an independent paragraph, for the tile of the "Text" class it may be segmented according to sentence structure or paragraph structure, for the tile of the "Table" class it may be taken as a separate Text segment per line or column etc.
Natural Language Processing (NLP) tools, such as sentence segmentation models, may be used to identify boundaries of sentences, or to segment text content of slabs based on punctuation (e.g., periods, commas) and text structure, according to a determined segmentation strategy.
For each tile, visually adjacent and logically related tiles may also be identified based on the positional information of the plurality of tiles. For these neighboring tiles, their text content is merged into one larger text segment if they belong to the same content or context. Merging objects may include a spread paragraph, a long text that is incorrectly split, adjacent list items or bullets, and so forth.
Through this process, the text content in the query document may be organized into structured text segments that may be used to construct a RAG knowledge base, thereby improving the effectiveness of the search and generation tasks.
Step S204, generating titles for the query documents by using the large language model, and respectively generating a preset number of queries for each text segment.
In one implementation, the large language model may include an open source pre-trained model, such as GPT-3, LLaMA, bloom, and the like.
The query document can be input into a large language model, so that the query document is guided by Prompt words Prompt to analyze the content of the query document, generate a title for the query document, or perform optimization adjustment based on the original title to obtain a new title. The title is a text that can directly express the content of the query document.
The method can also design a special Prompt word Prompt according to historical user searching, behavior patterns, contextual information or current popular query trends in the target field, so that a large language model predicts questions or keywords which a user may want to raise or search according to each text segment which is input respectively, and the keywords are the preset number of queries of each text segment.
Therefore, on the basis of carrying out word recognition on the query document to obtain a plurality of text segments, the large language model is utilized to generate titles for the query document and generate queries for the text segments, so that the semantic expression types of the text segments can be expanded, and the semantic expression knowledge is enriched.
In step S205, for each text segment, a first vector is generated based on the title and the text segment, and a second vector is generated based on a predetermined number of queries for the text segment.
In one implementation, for each text segment, a heading and the text segment are input into a text embedding model, respectively, to generate a heading vector and a text segment vector of a preset dimension. These vectors are used to represent the emotion, semantics, subject, etc. characteristics of the model input text. The preset dimension is a preset value, which represents the dimension of the output vector of the text embedding model, such as 512, 1024, etc.
Text embedding models, including but not limited to OpenAl text embedding (openai-text embedding), bilingual universal embedding (bilingual general embedding, BGE), etc., are trained on a large amount of text data, capable of generating context-sensitive embedded representations, capturing complex semantic relationships and language patterns.
Next, a heading vector is combined with the text segment vector to generate a first vector. The calculation formula of the first vector includes:
V1=weighttitie*Vtitle+(1-weighttitie)*Vchunk (1)
Wherein V1 is a first vector, Vtitle and Vchunk are a title vector and a text segment vector respectively, weighttitie(0≤weighttitie is less than or equal to 1) the weight of the title vector in the first vector, and weighttitie is preset.
Further, for each text segment, respectively inputting a corresponding preset number of queries into the text embedding model to generate query vectors of respective corresponding preset dimensions. And combining query vectors of preset dimensions corresponding to the preset number of queries to generate a second vector. The calculation formula of the second vector includes:
V2 is a second vector, Vi (i is more than or equal to 1 and less than or equal to n, i is epsilon Z) is an ith query vector of the text segment, and n is a preset number. The preset number is also preset.
In step S206, each text segment is stored in combination with its first vector and second vector, respectively, to construct a RAG knowledge base. The RAG knowledge base is used to retrieve the processing of user query operations by the enhanced generation system.
In one implementation, document properties are set for each query document. The document attributes include document author of query document, document creation date, document modification date.
Optionally, keywords are also generated for each text segment separately. Keyword extraction involves automatically identifying words or phrases from text that best represent the subject matter and content of the text. The task can be realized by a traditional TF-IDF or TextRank deep learning method, and can also be realized by a pre-training language model (such as BERT, GPT and the like) based on a transducer structure.
For each text segment, the text segment is stored in combination with the document attribute of the corresponding query document, the keywords of the text segment, the first vector and the second vector, so as to construct a RAG knowledge base according to the storage units corresponding to the text segments of the query document.
Therefore, on the basis of performing text recognition on the query document to obtain a plurality of text segments, the application utilizes the large language model to generate a plurality of prediction queries for each text segment and generate titles for the query document, so that the semantic expression type of the query document can be expanded, the semantic expression knowledge is enriched, more effective similarity calculation and semantic matching are realized, and the retrieval efficiency and the retrieval performance of the RAG system for user query are improved.
Based on the method in the above embodiment, for example, fig. 3 shows a flowchart of a method for generating and processing a search enhancement of a user query according to an embodiment of the present application. As shown in fig. 3, the method comprises the steps of:
Step S301, a user query is received.
Illustratively, a user query is an input to the RAG system that involves query matching accordingly to provide relevant information or data. The RAG system may receive a user query from a user via a user interface, the user query may be of a text entry type, such as a simple keyword, a complex sentence, or a specific query.
Optionally, in order to improve the accuracy of the search, the RAG system may extend the original query of the user before searching according to the user query, for example, enrich the query content by adding synonyms, related words or context information on the basis of analyzing the query intention of the user.
Step S302, according to the user query, invoking the RAG knowledge base as shown in FIG. 2 to perform data retrieval, and obtaining the ordering result of the storage units corresponding to the text segments.
Illustratively, the text embedding model in step S204 described above is utilized to generate a corresponding user query vector for the user query.
Based on the word segment information stored in the RAG knowledge base of FIG. 2, a hybrid query mode of word-based query and vector-based nearest neighbor algorithm (k-nearest neighbors, kNN) query is used for data retrieval. The mixed query mode can combine the advantages of word-based retrieval and vector-based retrieval, and improves the retrieval accuracy and recall rate. For example, word-based search may be used to quickly narrow the search, and then vector-based search may be used to accurately find the most relevant text segment.
The vector-based retrieval process includes comparing query vectors corresponding to the user query with vectors in the RAG knowledge base. For example, by calculating the distance between the user query vector and the vector (including the first vector and the second vector) of each text segment, respectively, their similarity can be evaluated. Common distance measurement methods include euclidean distance, cosine similarity, manhattan distance, and the like. The multiple text segments in the RAG knowledge base are ordered according to the calculated distance, and the closest vector (i.e. the most similar text segment) will be ranked in front.
Eventually, the hybrid query will output a ranked list of text segments that can be used to generate answers, provide information retrieval services, or support other downstream applications.
Step S303, the sequencing result is reordered by using a sequencing model or a manually set sequencing rule.
Illustratively, the ranking results are reordered using a pre-trained ranking model or manually set rules to ensure that the most relevant text segments are ranked top. These top text segments are considered to most likely contain answers to the user query or related information.
Step S304, inputting the text segments and the user query in one or a plurality of storage units with the forefront sequence into a large language model, so that the large language model outputs the query result of the user query under the guidance of a preset prompt word.
Illustratively, entering the text segment and user query in the top-ranked storage unit or units into a large language model typically involves building an input context containing the user query and related text segment, and using preset hint words to guide the model generation process. These hints may help the model better understand the nature of the task and the intent of the user.
The large language model generates a query result according to the input context and the prompt word. This may involve generating an answer, a summary, a written text, or any other form of output, depending on the design and task requirements of the RAG system.
Further, post-processing can be performed on the output generated by the large language model to improve the quality of the results. Post-processing includes grammar checking, answer formatting, removal of extraneous information, and the like.
Finally, the processed query results are presented to the user as a response to the user query by the RAG system.
It should be noted that while in the above embodiments the operations of the methods of embodiments of the present application are described in a particular order, this does not require or imply that the operations must be performed in that particular order or that all of the illustrated operations be performed in order to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
Based on the method in the above embodiment, the embodiment of the present application provides a computer-readable storage medium storing a computer program, which when executed on a processor, causes the processor to perform the method in the above embodiment.
Based on the method in the above embodiment, the embodiment of the application provides an electronic device. The electronic device may comprise at least one memory for storing a program and at least one processor for executing the program stored in the memory. Wherein the processor is adapted to perform the method described in the above embodiments when the program stored in the memory is executed.
By way of example, the electronic device may be a cell phone, tablet computer, desktop computer, laptop computer, handheld computer, notebook computer, server, ultra-mobile personal computer, UMPC, netbook, as well as a cellular telephone, personal Digital Assistant (PDA), augmented reality (augmented reality, AR) device, virtual Reality (VR) device, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) device, wearable device, vehicle device, smart home device, and/or smart city device, the specific category of which the embodiments of the present application are not particularly limited.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application. It should be understood that, in the embodiment of the present application, the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present application in further detail, and are not to be construed as limiting the scope of the application, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the application.