CN119441507A

Movatterモバイル変換

Info

Publication number: CN119441507A
Application number: CN202411571824.4A
Authority: CN
Inventors: 石颍乐; 夏敏; 易丛文; 管健
Original assignee: Shenzhen Zhixian Future Industrial Software Co ltd
Current assignee: Shenzhen Zhixian Future Industrial Software Co ltd
Priority date: 2024-11-05
Filing date: 2024-11-05
Publication date: 2025-02-14

Abstract

Translated fromChinese

本申请提供基于版面分析和查询生成的RAG知识库构建方法，方法包括：接收若干查询文档；对每个查询文档，利用版面分析工具对查询文档进行版面分析；根据版面分析结果进行文本内容的分段和合并，得到多个文本段；利用大语言模型为查询文档生成标题，以及为各文本段分别生成预设数量的查询；对每个文本段，基于标题与该文本段生成第一向量，基于该文本段的预设数量查询生成第二向量；将各文本段分别与其第一向量和第二向量合并存储，以构建RAG知识库。由此，通过对查询文档进行有效的版面分析得到文本段，以及为各文本段生成查询以扩充语义，从而构建更全面的RAG知识库，可以提升系统的检索增强生成效果。

The present application provides a method for constructing a RAG knowledge base based on layout analysis and query generation, the method comprising: receiving a number of query documents; for each query document, performing layout analysis on the query document using a layout analysis tool; segmenting and merging text content according to the layout analysis results to obtain multiple text segments; using a large language model to generate a title for the query document, and generating a preset number of queries for each text segment; for each text segment, generating a first vector based on the title and the text segment, and generating a second vector based on a preset number of queries for the text segment; merging and storing each text segment with its first vector and second vector respectively to construct a RAG knowledge base. Thus, by performing effective layout analysis on the query document to obtain text segments, and generating queries for each text segment to expand semantics, a more comprehensive RAG knowledge base can be constructed, which can improve the retrieval enhancement generation effect of the system.