CN117828050B

Movatterモバイル変換

Info

Publication number: CN117828050B
Application number: CN202311849418.5A
Authority: CN
Inventors: 赵庆飞; 岑宇阔; 毛文静
Original assignee: Beijing Zhipu Huazhang Technology Co ltd
Current assignee: Beijing Zhipu Huazhang Technology Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-07-09
Anticipated expiration: 2043-12-29
Also published as: CN117828050A

Abstract

The invention belongs to the technical field of traditional Chinese medicine questions and answers, and relates to a traditional Chinese medicine questions and answers method, equipment and medium based on long-document retrieval enhancement generation, wherein the method comprises the following steps: 1) Problem extension; 2) Cutting a document; 3) Document recall; 4) And (5) reordering; 5) Selecting a large language model; 6) And generating a large language model. The method can improve the searching and generating capacity, ensure the relativity of the searched knowledge and the questions, extract shorter and more accurate related information, remove noise input irrelevant to the questions, and help a large language model to generate more accurate answers; meanwhile, the document sources of the references are marked in the generated results, and the interpretability of the large language model is enhanced, so that the phantom problem and the problem of insufficient real-time data of the large language model in a specific field, such as the field of traditional Chinese medicine question and answer, can be relieved, and more accurate and targeted answers are provided for the traditional Chinese medicine question and answer.

Description

Translated fromChinese

基于长文档检索增强生成的中医问答方法、设备及介质Traditional Chinese medicine question-answering method, device and medium based on long document retrieval and enhanced generation

技术领域Technical Field

本发明属于中医问答技术领域，涉及一种中医问答方法、设备及介质，尤其是一种基于长文档检索增强生成的中医问答方法、设备及介质。The present invention belongs to the technical field of traditional Chinese medicine question answering, and relates to a traditional Chinese medicine question answering method, device and medium, in particular to a traditional Chinese medicine question answering method, device and medium based on long document retrieval enhanced generation.

背景技术Background technique

检索增强生成技术，主要是两大技术的融合：检索和生成。这种方法通常由一个检索器和一个生成器组成。检索器通常采用信息检索方法，生成器则通常是语言模型。检索器负责根据问题从外部知识库中获取与问题相关的知识，生成器则负责结合问题和检索到的知识进行生成。Retrieval-enhanced generation technology is mainly a fusion of two technologies: retrieval and generation. This method usually consists of a retriever and a generator. The retriever usually uses information retrieval methods, and the generator is usually a language model. The retriever is responsible for obtaining knowledge related to the question from the external knowledge base based on the question, and the generator is responsible for combining the question and the retrieved knowledge for generation.

目前主流信息检索方法包括基于稀疏向量的方法，比如TF-IDF或者BM25，以及最近的基于密集向量的方法，如DPR、ANCE和Contriever等。但是，基于稀疏向量的检索方法通过倒排索引高效匹配关键词，建立在问题与相关文档存在词汇重叠的基础之上，主要侧重于召回具有相似表达的文本，忽略了语义信息。为了充分利用文本中丰富的语义知识，有人引入了基于密集向量的方法，该方法通常遵循双编码器架构，即通过编码器将查询和文本映射到具体的语义向量空间下，把信息搜索问题转化为在给定空间下向量间的最近邻搜索问题。At present, mainstream information retrieval methods include sparse vector-based methods, such as TF-IDF or BM25, and recent dense vector-based methods, such as DPR, ANCE, and Contriever. However, sparse vector-based retrieval methods efficiently match keywords through inverted indexes, are based on the overlap of vocabulary between questions and related documents, and mainly focus on recalling texts with similar expressions, ignoring semantic information. In order to make full use of the rich semantic knowledge in the text, some people have introduced dense vector-based methods. This method usually follows a dual encoder architecture, that is, the encoder maps the query and text to a specific semantic vector space, converting the information search problem into a nearest neighbor search problem between vectors in a given space.

为了引入检索来增强语言模型，REALM和RAG将检索融入到条件生成中，通过将检索到的文档和输入结合，进行概率建模和边缘化，实现了检索器和语言模型的联合优化。REALM在预训练阶段引入一个知识检索器，以无监督的方式通过预测被掩码的原始token为目标训练这个检索器，使得语言模型能够在预训练、精调、推理阶段显式地使用知识库中的知识。RAG将预训练的seq2seq生成模型和基于DPR的双编码器架构的检索器组合在一个端到端训练的概率模型中，进行检索和生成的端到端学习，可以用于各种知识密集的自然语言处理任务。Fusion-in-Decoder提出将检索到的文本和问题通过编码器分别编码，然后拼接在一起输入解码器生成最终的回答。Atlas联合训练检索器和语言模型，利用基于Contriever的检索器检索到的文本来增强语言模型，语言模型采用和Fusion-in-Decoder同样的生成方式，可以通过少量的训练样本学习到特定任务的知识，以较少的模型参数达到媲美大规模语言模型的效果。In order to introduce retrieval to enhance the language model, REALM and RAG integrate retrieval into conditional generation, and achieve joint optimization of the retriever and language model by combining the retrieved documents and input for probabilistic modeling and marginalization. REALM introduces a knowledge retriever in the pre-training stage, and trains the retriever in an unsupervised manner by predicting the masked original token, so that the language model can explicitly use the knowledge in the knowledge base in the pre-training, fine-tuning, and reasoning stages. RAG combines the pre-trained seq2seq generation model and the retriever with a dual encoder architecture based on DPR in an end-to-end trained probabilistic model for end-to-end learning of retrieval and generation, which can be used for various knowledge-intensive natural language processing tasks. Fusion-in-Decoder proposes to encode the retrieved text and question separately through the encoder, and then splice them together and input them into the decoder to generate the final answer. Atlas jointly trains the retriever and language model, and uses the text retrieved by the Contriever-based retriever to enhance the language model. The language model is generated in the same way as Fusion-in-Decoder, and can learn task-specific knowledge through a small number of training samples, achieving results comparable to large-scale language models with fewer model parameters.

但是，当前由于部分语言模型参数规模变得越来越大且并不开源，如大模型GPT-3的参数规模达到175B，检索器和语言模型的联合学习的方法变的不再通用。REPLUG提出将语言模型视为一个黑盒，检索组件被添加为一个可调的插件模块，利用语言模型产生监督信号，从而去优化检索器。WebGLM利用大模型的能力来训练增强检索器的性能，从万维网中检索知识来引导语言模型生成。However, as the parameter size of some language models is getting larger and larger and is not open source, such as the parameter size of the large model GPT-3 reaching 175B, the joint learning method of the retriever and language model is no longer universal. REPLUG proposes to treat the language model as a black box, and the retrieval component is added as an adjustable plug-in module, using the language model to generate supervision signals to optimize the retriever. WebGLM uses the power of large models to train and enhance the performance of the retriever, and retrieves knowledge from the World Wide Web to guide the generation of language models.

总的来说，检索增强生成技术的核心在于如何利用外部知识库来强化语言模型的生成。但是由于大模型的输入长度受限，如GPT-3的最大输入限制是4096个token，所以无法将知识库中的所有文档都输入到大模型中，因此需要找到与问题最相似的k个相关文档作为输入。现有方法通常利用嵌入模型将问题和文档转换为向量，并根据向量之间的相似性召回相关文档。In general, the core of retrieval-enhanced generation technology lies in how to use external knowledge bases to strengthen the generation of language models. However, due to the limited input length of large models, such as the maximum input limit of GPT-3 is 4096 tokens, it is impossible to input all documents in the knowledge base into the large model, so it is necessary to find the k relevant documents that are most similar to the question as input. Existing methods usually use embedding models to convert questions and documents into vectors, and recall relevant documents based on the similarity between vectors.

然而，现有适用检索领域的嵌入模型大多基于Bert模型训练得到的，Bert模型的最大输入长度为512个token，无法处理长文档数据，并且某些长文档的长度可能仍超过了大模型的输入长度。故需要将长文档切分成多个短文档。现有的检索增强生成技术通常聚焦短文档的检索器和生成器的优化，没有针对长文档进行特殊的处理，对长文档的处理方法通常是简单地按长度切分，可能导致文档语义不完整，且与问题相关的信息可能被切分在多个文档中。另外，检索到的文档通常包含大量与问题无关的噪声信息，占用了大量的token输入。However, most of the existing embedding models applicable to the retrieval field are trained based on the Bert model. The maximum input length of the Bert model is 512 tokens, which cannot process long document data, and the length of some long documents may still exceed the input length of the large model. Therefore, it is necessary to split the long document into multiple short documents. Existing retrieval enhancement generation technology usually focuses on the optimization of retrievers and generators for short documents, without special processing for long documents. The processing method for long documents is usually to simply split them by length, which may lead to incomplete document semantics, and information related to the problem may be split into multiple documents. In addition, the retrieved documents usually contain a lot of noise information that is irrelevant to the problem, which takes up a lot of token input.

鉴于现有技术的缺陷，需要研究一种新型的基于长文档的检索增强生成技术。In view of the shortcomings of the existing technology, it is necessary to study a new type of retrieval enhancement generation technology based on long documents.

发明内容Summary of the invention

为了克服现有技术的缺陷，本发明提出了一种基于长文档检索增强生成的中医问答方法、设备及介质，其能够提高检索和生成能力，确保检索到的知识与问题的相关性，提取更短、更准确的相关信息，去除与问题无关的噪声输入，帮助大语言模型生成更加准确的答案；同时，在生成的结果中标注参考的文档来源，增强大语言模型的可解释性，从而有望缓解大语言模型在特定领域，如中医问答领域的幻觉问题和实时数据不足的问题，为中医问答提供更准确、有针对性的答案。In order to overcome the defects of the prior art, the present invention proposes a Chinese medicine question and answer method, device and medium based on long document retrieval and enhanced generation, which can improve the retrieval and generation capabilities, ensure the relevance of the retrieved knowledge to the question, extract shorter and more accurate relevant information, remove noise input irrelevant to the question, and help the large language model generate more accurate answers; at the same time, the reference document source is marked in the generated results to enhance the interpretability of the large language model, which is expected to alleviate the hallucination problem and insufficient real-time data of the large language model in specific fields, such as the field of Chinese medicine question and answer, and provide more accurate and targeted answers for Chinese medicine question and answer.

为了实现上述目的，本发明提供如下技术方案：In order to achieve the above object, the present invention provides the following technical solutions:

一种基于长文档检索增强生成的中医问答方法，其特征在于，包括以下步骤：A Chinese medicine question-answering method based on long document retrieval and enhanced generation, characterized by comprising the following steps:

1)、问题扩展：构建提示模板，使用大语言模型对用户问题进行扩展，以生成扩展问题；1) Question expansion: Build a prompt template and use a large language model to expand the user's question to generate an extended question;

2)、文档切分：将中医知识库进行文档切分以形成多个短文档d^d；2) Document segmentation: segment the TCM knowledge base into multiple short documents d^d ;

3)、文档召回：使用编码器将所述多个短文档d^d进行编码以得到多个文档向量，并使用编码器将所述用户问题和扩展问题进行编码以得到问题向量和扩展问题向量，然后分别计算所述问题向量和扩展问题向量与每个所述文档向量之间的相似度，根据相似度召回两组分别与所述问题向量和扩展问题向量最相关的前K_r个短文档d^d，并对所述前K_r个短文档d^d进行去重合并后作为最终的召回文档集D_r；3) Document recall: Encode the multiple short documents d^d using an encoder to obtain multiple document vectors, and encode the user question and the extended question using an encoder to obtain a question vector and an extended question vector, and then respectively calculate the similarity between the question vector and the extended question vector and each of the document vectors, and recall two groups of top K_r short documents d^d that are most relevant to the question vector and the extended question vector respectively according to the similarity, and remove duplicates and merge the top K_r short documents d^d as the final recalled document set D_r ;

4)、重排序：使用交叉编码器对所述用户问题与所述召回文档集D_r中的短文档d^d进行相关性打分并排序，选择与所述用户问题相关性最强的前K_rr个短文档d^d，组成文档集D_rr；4) Re-ranking: Use a cross encoder to score and rank the relevance between the user question and the short documents d^d in the recalled document set D_r , and select the first K_rr short documents d^d with the strongest relevance to the user question to form the document set D_rr ;

5)、大语言模型选择：将所述文档集D_rr中的短文档d^d还原成长文档，并使用大语言模型对还原后的长文档进行选择过滤，以筛选出与所述用户问题相关的文档段落d^s，组成选择的文档集D_s；5) Large language model selection: restore the short document d^d in the document set D_rr into a long document, and use the large language model to select and filter the restored long document to screen out the document paragraphs d^s related to the user question to form the selected document set D_s ;

6)、大语言模型生成：构建提示模板，并将所述用户问题与所述选择的文档集D_s一起输入大语言模型，让大语言模型根据所述选择的文档集D_s来进行回答，以生成答案。6) Large language model generation: construct a prompt template, and input the user question and the selected document set_Ds into the large language model, so that the large language model can answer according to the selected document set_Ds to generate an answer.

优选地，所述步骤2)具体包括：Preferably, the step 2) specifically includes:

2.1)、设置文档的最大长度阈值L_max，文档切分后的最小长度阈值L_min和滑动窗口大小w；2.1) Set the maximum length threshold L_max of the document, the minimum length threshold L_min after document segmentation and the sliding window size w;

2.2)、把中医知识库的每个知识点小节视为一个文档d，然后根据文档d的长度进行判断和切分，如果文档d的长度超过所述文档的最大长度阈值L_max则将所述文档d视为长文档d^l，并需对长文档d^l进行切分，以形成多个短文档d^d，如果文档d的长度不超过所述文档的最大长度阈值L_max则直接将文档d视为短文档d^d，不用进行切分；2.2) Treat each knowledge point section of the TCM knowledge base as a document d, and then judge and segment the document d according to its length. If the length of the document d exceeds the maximum length threshold L_max of the document, the document d is regarded as a long document d^l , and the long document d^l needs to be segmented to form multiple short documents d^d . If the length of the document d does not exceed the maximum length threshold L_max of the document, the document d is directly regarded as a short document d^d without segmentation.

2.3)、对于长文档d^l，按照句子和长度划分，当句子累计长度达到L_max后则切分成一个短文档d^d，并设置滑动窗口w，使相邻的短文档d^d之间设置w个句子重叠，以保持语义的连贯性，如果长文档d^l切分后的最后一个短文档d^d的长度小于所述文档切分后的最小长度阈值L_min，则不再将其划分为一个新的短文档d^d，而是直接拼接到上一个短文档d^d的末尾；2.3) For the long document d^l , it is divided according to sentences and lengths. When the cumulative length of sentences reaches L_max , it is divided into a short document d^d , and a sliding window w is set so that w sentences are overlapped between adjacent short documents d^d to maintain semantic coherence. If the length of the last short document d^d after the long document d^l is divided is less than the minimum length threshold L_min after the document is divided, it is no longer divided into a new short document d^d , but directly spliced to the end of the previous short document d^d ;

2.4)、对所有的短文档d^d，在开头加上其所在小节对应的书籍名称、章节名称和小节名称。2.4) For all short documents d^d , add the book name, chapter name and section name corresponding to the section at the beginning.

优选地，在所述步骤2.3)中，保留切分后的每个所述短文档d^d与所述长文档d^l的映射关系，以便后续还原成长文档d^l。Preferably, in the step 2.3), the mapping relationship between each of the segmented short documents d^d and the long document d^l is retained so as to be subsequently restored to the long document d^l .

优选地，所述步骤3)中，得到所述多个文档向量后，将所述多个文档向量存入向量数据库FAISS中并构建索引。Preferably, in the step 3), after obtaining the multiple document vectors, the multiple document vectors are stored in a vector database FAISS and an index is constructed.

优选地，所述步骤3)中，在进行召回时，从所述向量数据库FAISS中进行召回。Preferably, in the step 3), when recalling, the recall is performed from the vector database FAISS.

优选地，所述步骤4)中的使用交叉编码器对所述用户问题与所述召回文档集D_r中的短文档d^d进行相关性打分具体包括：Preferably, the step 4) of using a cross encoder to score the relevance between the user question and the short document d^d in the recalled document set D_r specifically includes:

4.1)、使用交叉编码器将所述用户问题与所述召回文档集D_r中的短文档d^d使用分隔符拼接起来作为输入，并通过深度神经网络来建模，以获得深层交互的表示；4.1) Using a cross encoder to concatenate the user question and the short document d^{d in the recalled document set D r}_using a separator as input, and modeling through a deep neural network to obtain a deep interactive representation;

4.2)、在深层交互表示的基础上应用多层感知器来预测所述用户问题与所述召回文档集D_r中的短文档d^d的相关性得分。4.2) Based on the deep interaction representation, a multilayer perceptron is applied to predict the relevance score between the user question and the short document d^d in the recalled document set D_r .

优选地，所述步骤5)中，在筛选出与所述用户问题相关的文档段落d^s后，利用交叉编码器对所述用户问题与所述文档段落d^s再次进行相关性打分并排序，选择与所述用户问题相关性最强的K_s个文档段落d^s，组成所述选择的文档集D_s。Preferably, in step 5), after screening out the document paragraphs d^s related to the user question, a cross encoder is used to score and sort the correlation between the user question and the document paragraphs d^s again, and K_s document paragraphs d^s with the strongest correlation with the user question are selected to form the selected document set D_s .

优选地，所述步骤6)中在让大语言模型根据所述选择的文档集D_s来进行回答，以生成答案时，通过标注示例来引导所述大语言模型对答案添加引用标记，并在末尾列出参考文档来源。Preferably, in step 6), when the large language model is allowed to answer based on the selected document set_Ds to generate an answer, the large language model is guided to add reference marks to the answer by annotating examples, and the reference document sources are listed at the end.

此外，本发明还提供一种基于长文档检索增强生成的中医问答设备，其特征在于，包括：In addition, the present invention also provides a TCM question-answering device based on long document retrieval and enhanced generation, which is characterized by comprising:

一个或多个处理器；one or more processors;

存储器，用于存储一个或多个程序；A memory for storing one or more programs;

当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现如上述基于长文档检索增强生成的中医问答方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the above-mentioned Chinese medicine question-answering method based on long document retrieval and enhanced generation.

最后，本发明还提供一种计算机可读存储介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现如上述基于长文档检索增强生成的中医问答方法中的步骤。Finally, the present invention also provides a computer-readable storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, the steps in the above-mentioned Chinese medicine question-answering method based on long document retrieval and enhanced generation are implemented.

与现有技术相比，本发明的基于长文档检索增强生成的中医问答方法、设备及介质具有如下有益技术效果中的一者或多者：Compared with the prior art, the TCM question-answering method, device and medium based on long document retrieval and enhanced generation of the present invention have one or more of the following beneficial technical effects:

1、本发明结合大语言模型对用户问题进行扩展，利用大语言模型的知识丰富用户问题的信息，并设计针对长文档的中医知识库的切分方法，保证切分后的文档具有丰富的语义信息，提高检索召回能力。1. The present invention expands user questions in combination with a large language model, uses the knowledge of the large language model to enrich the information of user questions, and designs a segmentation method for the traditional Chinese medicine knowledge base for long documents to ensure that the segmented documents have rich semantic information and improve retrieval recall capabilities.

2、本发明采用大语言模型选择长文档的相关段落的方案，利用大语言模型对原始长文档筛选出与用户问题匹配的段落，有效避免了因长文档切分导致大语言模型生成时接收的相关信息不完整的问题，并且可以过滤与用户问题不相关的噪声信息，使输入大语言模型的相关文本更短且更完整，有效地提高了大语言模型生成的回答的效果。2. The present invention adopts a solution of using a large language model to select relevant paragraphs of a long document, and uses the large language model to screen out paragraphs that match user questions from the original long document, effectively avoiding the problem of incomplete relevant information received when the large language model is generated due to long document segmentation, and can filter out noise information that is irrelevant to the user's question, making the relevant text input into the large language model shorter and more complete, effectively improving the effect of the answer generated by the large language model.

3、本发明利用大语言模型的能力，通过少样本的标注示例来引导大语言模型在生成答案时标注引用信息，并给出参考文献来源，提供了可解释性。3. The present invention utilizes the capability of the large language model and guides the large language model to annotate citation information when generating answers through a small number of labeled examples, and provides the source of the references, thereby providing explainability.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的基于长文档检索增强生成的中医问答方法的流程示意图。FIG1 is a flow chart of a Chinese medicine question-answering method based on long document retrieval and enhanced generation according to the present invention.

具体实施方式Detailed ways

在详细说明本发明的任何实施方式之前，应理解的是，本发明在其应用中并不限于以下描述阐述或以下附图图示的部件的构造和布置细节。本发明能够具有其他实施方式并且能够以各种方式实践或进行。另外，应理解的是，这里使用的措辞和术语出于描述的目的并且不应该被认为是限制性的。本文中使用“包括”或“具有”及其变型意在涵盖下文中陈列的条目及其等同物以及附加条目。Before describing in detail any embodiment of the present invention, it should be understood that the present invention is not limited in its application to the construction and arrangement details of the components set forth in the following description or illustrated in the following figures. The present invention can have other embodiments and can be practiced or carried out in various ways. In addition, it should be understood that the words and terms used herein are for descriptive purposes and should not be considered restrictive. The use of "including" or "having" and variations thereof herein is intended to cover the items and their equivalents and additional items displayed below.

并且，在本发明的揭露中，术语“一”应理解为“至少一”或“一个或多个”，即在一个实施例中，一个元件的数量可以为一个，而在另外的实施例中，该元件的数量可以为多个，术语“一”不能理解为对数量的限制。Furthermore, in the disclosure of the present invention, the term "one" should be understood as "at least one" or "one or more", that is, in one embodiment, the number of an element may be one, while in another embodiment, the number of the element may be multiple, and the term "one" should not be understood as a limitation on the quantity.

生成式大语言模型如GPT-3，已经极大地提升了人工智能在自然语言理解和生成方面的能力，可以完成诸多复杂的自然语言处理任务，如问答、写作、翻译、代码理解等等。然而，大语言模型主要是基于公开可用的高频数据进行训练的，因此在私有领域和长尾知识的掌握上存在不足。同时，受限于模型参数，大语言模型无法完全存储世界知识，且其内部知识具有静态特性，很难进行实时更新。此外，大语言模型还存在幻觉问题，即在面对不熟悉的问题时，大语言模型会虚构看似专业但无事实依据的答案，而在中医等关键任务领域中回答的准确性和可靠性至关重要，大语言模型的幻觉问题可能导致严重后果。Generative large language models such as GPT-3 have greatly improved the ability of artificial intelligence in natural language understanding and generation, and can complete many complex natural language processing tasks, such as question-answering, writing, translation, code understanding, etc. However, large language models are mainly trained based on publicly available high-frequency data, so they are insufficient in mastering private domains and long-tail knowledge. At the same time, limited by model parameters, large language models cannot fully store world knowledge, and their internal knowledge has static characteristics and is difficult to update in real time. In addition, large language models also have the problem of hallucination, that is, when faced with unfamiliar questions, large language models will fabricate answers that seem professional but have no factual basis. In key task areas such as traditional Chinese medicine, the accuracy and reliability of answers are crucial, and the hallucination problem of large language models may lead to serious consequences.

通过检索增强生成技术，可以帮助大语言模型注入当前领域知识并减少幻觉。然而，现有的检索增强生成技术通常聚焦于针对短文档的检索器和生成器进行优化，对长文档数据的处理能力不足。对于长文档数据一般采用简单的切分策略，如按长度切分，但这种方法很难保证切分后的文档具有丰富的语义信息，并且与用户问题相关的完整信息可能被切分在多个文档之中。此外，文档中还可能包含大量与用户问题无关的噪声信息。Retrieval-enhanced generation technology can help large language models inject current domain knowledge and reduce hallucinations. However, existing retrieval-enhanced generation technology usually focuses on optimizing retrievers and generators for short documents, and lacks the ability to process long document data. For long document data, a simple segmentation strategy is generally adopted, such as segmentation by length, but this method can hardly ensure that the segmented documents have rich semantic information, and the complete information related to the user's question may be segmented into multiple documents. In addition, the document may also contain a lot of noise information that is irrelevant to the user's question.

为解决上述问题，本发明提出了一种针对长文档知识库检索增强大语言模型生成方法，旨在提高检索和生成能力，确保检索到的知识与用户问题的相关性，提取更短、更准确的相关信息，去除与问题无关的噪声输入，帮助大语言模型生成更加准确的答案。同时，在生成的结果中标注参考的文档来源，增强大语言模型的可解释性。通过这一方法，有望缓解大语言模型在特定领域，如中医问答领域的幻觉问题和实时数据不足的问题，为中医问答提供更准确、有针对性的答案。To solve the above problems, the present invention proposes a method for generating a large language model for long document knowledge base retrieval enhancement, which aims to improve retrieval and generation capabilities, ensure the relevance of retrieved knowledge to user questions, extract shorter and more accurate relevant information, remove noise inputs irrelevant to the questions, and help large language models generate more accurate answers. At the same time, the reference document source is annotated in the generated results to enhance the interpretability of the large language model. Through this method, it is expected to alleviate the hallucination problem and insufficient real-time data of large language models in specific fields, such as the field of traditional Chinese medicine question and answer, and provide more accurate and targeted answers for traditional Chinese medicine question and answer.

图1示出了本发明的基于长文档检索增强生成的中医问答方法的流程示意图。如图1所示，本发明的基于长文档检索增强生成的中医问答方法包括以下步骤：FIG1 shows a schematic flow chart of a TCM question-answering method based on long document retrieval and enhanced generation of the present invention. As shown in FIG1 , the TCM question-answering method based on long document retrieval and enhanced generation of the present invention comprises the following steps:

一、问题扩展。1. Problem expansion.

在进行中医问答时，首先要给定一个用户问题q。但是，通常的用户问题都是十分简单的，通过用户问题召回相关文档是一个非对称的语义匹配问题，用户问题的语义不完整和信息量太少都会极大地影响召回的效果。因此，在本发明中，需要对用户问题进行扩展，也就是，构建提示模板，使用大语言模型对用户问题进行扩展，以生成扩展问题q′。When conducting TCM Q&A, a user question q must first be given. However, user questions are usually very simple, and recalling related documents through user questions is an asymmetric semantic matching problem. The incomplete semantics and too little information in user questions will greatly affect the recall effect. Therefore, in the present invention, it is necessary to expand the user question, that is, to construct a prompt template and use a large language model to expand the user question to generate an extended question q′.

具体地，通过设计提示模板，借助大语言模型内部丰富的知识对用户问题进行一个简单的回答扩展，提取大语言模型内部与用户问题最相关的知识，并只输出简洁关键的信息，过滤无关的信息噪声，从而获得大量与原始用户问题匹配的相关信息。尤其是在用户问题较短或不明确的情况下，大语言模型能够提供的丰富的语义信息。具体计算公式如下：Specifically, by designing a prompt template, a simple answer expansion is made to the user's question with the help of the rich knowledge inside the large language model, the most relevant knowledge inside the large language model is extracted, and only concise and key information is output, and irrelevant information noise is filtered out, thereby obtaining a large amount of relevant information that matches the original user's question. Especially when the user's question is short or unclear, the large language model can provide rich semantic information. The specific calculation formula is as follows:

q′＝LLM(prompt_q(q))q′＝LLM(prompt_q (q))

其中，LLM是大语言模型，而prompt_q是基于用户问题扩展设计的提示模板，q是用户问题，q′是扩展后的扩展问题。Among them, LLM is a large language model, and prompt_q is a prompt template designed based on the expansion of user questions, q is the user question, and q′ is the extended question after expansion.

二、文档切分。2. Document segmentation.

在进行中医问答时，还必须提供一个知识库D^l。知识库D^l中包含N^l个文档d,其中长度较长的文档d视为长文档d^l。When conducting TCM question-answering, a knowledge base D^l must also be provided. The knowledge base D^l contains N^l documents d, among which the longer documents d are regarded as long documents d^l .

在召回之前，需要对中医知识库进行处理，其中的长文档d^l需要进行文档切分，防止其长度超过编码器和大语言模型的最大输入。Before recall, the TCM knowledge base needs to be processed, and the long document d^l needs to be segmented to prevent its length from exceeding the maximum input of the encoder and the large language model.

在本发明中，文档切分就是将中医知识库中的长文档d^l进行文档切分以形成多个短文档d^d。考虑到使用的中医知识库是各类中医书籍，因此，本发明针对中医书籍专门设计了一种文档切分策略。该文档切分策略如下：In the present invention, document segmentation is to segment the long document d^l in the TCM knowledge base into multiple short documents d^d . Considering that the TCM knowledge base used is various TCM books, the present invention specifically designs a document segmentation strategy for TCM books. The document segmentation strategy is as follows:

首先，设置文档的最大长度阈值L_max，文档切分后的最小长度阈值L_min和滑动窗口大小w。First, set the maximum length threshold L_max of the document, the minimum length threshold L_min after the document is segmented, and the sliding window size w.

其次、把中医知识库(也就是，中医书籍)的每个知识点小节视为一个文档d，然后根据文档d的长度进行判断和切分。如果文档d的长度超过所述文档的最大长度阈值L_max，则将所述文档d视为长文档d^l，并需对长文档d^l进行切分，以形成多个短文档d^d；如果文档d的长度不超过所述文档的最大长度阈值L_max，则直接将文档d视为短文档d^d，不用进行切分。Secondly, each knowledge point section of the TCM knowledge base (that is, TCM books) is regarded as a document d, and then judged and segmented according to the length of the document d. If the length of the document d exceeds the maximum length threshold L_max of the document, the document d is regarded as a long document d^l , and the long document d^l needs to be segmented to form multiple short documents d^d ; if the length of the document d does not exceed the maximum length threshold L_max of the document, the document d is directly regarded as a short document d^d without segmentation.

再次，对于长文档d^l，按照句子和长度进行切分。也就是，当句子累计长度达到L_max后，则切分成一个短文档d^d，并设置滑动窗口w，使相邻的两个短文档d^d之间设置w个句子重叠，以保持语义的连贯性。如果长文档d^l切分后的最后一个短文档d^d的长度小于所述文档切分后的最小长度阈值L_min，则不再将其划分为一个新的短文档d^d，而是直接拼接到上一个短文档d^d的末尾。并且，在切分时，需要保留切分后的每个所述短文档d^d与所述长文档d^l的映射关系，以便后续还原成长文档d^l。Again, for the long document d^l , it is segmented according to sentences and lengths. That is, when the cumulative length of the sentences reaches L_max , it is segmented into a short document d^d , and a sliding window w is set so that w sentences overlap between two adjacent short documents d^d to maintain semantic coherence. If the length of the last short document d^d after segmentation of the long document d^l is less than the minimum length threshold L_min after the document segmentation, it is no longer divided into a new short document d^d , but directly spliced to the end of the previous short document d^d . In addition, when segmenting, it is necessary to retain the mapping relationship between each of the short documents d^d after segmentation and the long document d^l , so as to restore it to the long document d^l later.

最后、对所有的短文档d^d，在开头加上其所在小节对应的书籍名称、章节名称和小节名称。Finally, for all short documents d^d , add the book name, chapter name, and section name corresponding to the section they are in at the beginning.

由此，经过切分后的文档即保持了语义的连贯性，又避免切分后存在过短的语义不丰富的文档，保证了每个文档都具有丰富的语义信息。Therefore, the segmented documents maintain semantic coherence and avoid the existence of too short and semantically poor documents after segmentation, ensuring that each document has rich semantic information.

经过文档切分，将包含N^l个文档d的知识库D^l变成了包含N个短文档d^d的知识库D。After document segmentation, the knowledge base D^l containing N^l documents d is transformed into a knowledge base D containing N short documents d^d .

三、文档召回。3. Document recall.

要进行文档召回，需要找到与所述用户问题和扩展问题相关的短文档d^d。在本发明中，使用编码器将所述多个短文档d^d进行编码以得到多个文档向量，并使用编码器将所述用户问题和扩展问题进行编码以得到问题向量和扩展问题向量，然后分别计算所述问题向量和扩展问题向量与每个所述文档向量之间的相似度，根据相似度召回两组分别与所述问题向量和扩展问题向量最相关的前K_r个短文档d^d(也就是，根据所述问题向量与每个所述文档向量之间的相似度召回一组与所述问题向量最相关的多个短文档d^d，并根据所述扩展问题向量与每个所述文档向量之间的相似度召回一组与所述扩展问题向量最相关的多个短文档d^d，它们共同组成了前K_r个短文档d^d)，并对所述前K_r个短文档d^d进行去重合并后作为最终的召回文档集D_r。To perform document recall, it is necessary to find short documents d^d related to the user question and the extended question. In the present invention, the multiple short documents d^d are encoded using an encoder to obtain multiple document vectors, and the user question and the extended question are encoded using an encoder to obtain a question vector and an extended question vector, and then the similarities between the question vector and the extended question vector and each of the document vectors are calculated respectively, and two groups of the first K_r short documents d^d most related to the question vector and the extended question vector are recalled according to the similarities (that is, a group of multiple short documents d^d most related to the question vector is recalled according to the similarity between the question vector and each of the document vectors, and a group of multiple short documents d^d most related to the extended question vector is recalled according to the similarity between the extended question vector and each of the document vectors, and they together constitute the first K_r short documents d^d ), and the first K_r short documents d^d are de-duplicated and merged as the final recalled document set D_r .

具体地，本发明使用双编码器架构，即分别使用嵌入模型对用户问题和扩展问题及短文档d^d进行编码以得到低维的稠密向量，将两者之间的相关性转化为向量之间的相似性，具体相关性分数计算公式如下：Specifically, the present invention uses a dual encoder architecture, that is, the user question, the extended question and the short document d^d are respectively encoded using an embedding model to obtain a low-dimensional dense vector, and the correlation between the two is converted into the similarity between the vectors. The specific correlation score calculation formula is as follows:

s(q,d^d)＝E(q)·E(d^d)s(q,d^d )＝E(q)·E(d^d )

其中，s是问题和文档的相关性得分，q是问题(包括用户问题和扩展问题)，d^d是短文档，E是嵌入模型，·是向量的点积操作。Where s is the relevance score between the question and the document, q is the question (including user questions and extended questions), d^d is the short document, E is the embedding model, and · is the dot product operation of the vector.

在本发明中，在得到所述多个文档向量后，将所述多个文档向量存入向量数据库FAISS中并构建索引，并且，在进行召回时，从所述向量数据库FAISS中进行召回。In the present invention, after the multiple document vectors are obtained, the multiple document vectors are stored in a vector database FAISS and an index is constructed, and when recalling, the recall is performed from the vector database FAISS.

具体地，为了保证召回的高效性，本发明使用了可以执行快速相似性搜索的向量数据库FAISS来构建索引，实现嵌入的高效的相似性检索。对中医知识库中的长文档切分后得到的所有短文档进行向量化后存入所述向量数据库FAISS并构建索引，然后将扩展问题q′和用户问题q分别在所述向量数据库FAISS中进行检索召回，计算这两个问题与每个短文档之间的相关性分数，然后根据相关性得分来排序，取出这两组的最相关的K_r个结果(K_r通常远远小于所有短文档的数量N)，去重合并后作为最终的召回文档集D_r。Specifically, in order to ensure the high efficiency of recall, the present invention uses the vector database FAISS that can perform fast similarity search to build an index and realize embedded efficient similarity retrieval. All short documents obtained after long document segmentation in the traditional Chinese medicine knowledge base are vectorized and stored in the vector database FAISS and indexed, and then the extended question q′ and the user question q are searched and recalled in the vector database FAISS respectively, and the relevance scores between the two questions and each short document are calculated, and then sorted according to the relevance scores, and the most relevant K_r results of the two groups are taken out (K_r is usually much smaller than the number of all short documents N), and the results are removed and merged as the final recalled document set D_r .

四、重排序。4. Reorder.

考虑到使用双编码器方法召回的准确性不足，本发明对召回文档集D_r中的短文档d^d进行了重排序。也就是，使用交叉编码器对所述用户问题与所述召回文档集D_r中的短文档d^d进行相关性打分并排序，选择与所述用户问题相关性最强的前K_rr个短文档d^d，组成文档集D_rr。Considering the insufficient recall accuracy of the dual encoder method, the present invention reorders the short documents d^d in the recalled document set_Dr. That is, the cross encoder is used to score and sort the relevance between the user question and the short documents d^d in the recalled document set_Dr , and the top K_rr short documents d^d with the strongest relevance to the user question are selected to form the document set_Dr.

其中，使用交叉编码器对所述用户问题与所述召回文档集D_r中的短文档d^d进行相关性打分具体包括：The use of a cross encoder to score the relevance between the user question and the short document d^d in the recalled document set D_r specifically includes:

1、使用交叉编码器将所述用户问题与所述召回文档集D_r中的短文档d^d使用分隔符拼接起来作为输入，并通过深度神经网络来建模，以获得深层交互的表示.1. Use a cross encoder to concatenate the user question and the short document d^d in the recalled document set D_r using a separator as input, and model it through a deep neural network to obtain a deep interactive representation.

2、在深层交互表示的基础上应用多层感知器来预测所述用户问题与所述召回文档集D_r中的短文档d^d的相关性得分。2. Apply a multi-layer perceptron based on the deep interaction representation to predict the relevance score between the user question and the short document d^d in the recalled document set D_r .

具体地，本发明使用交叉编码器(CE)将所述用户问题和召回文档集D_r中的短文档d^d使用分隔符拼接起来作为输入，由于拼接的句子之间存在语义信息，可以通过深度神经网络来建模获得深层交互的表示r：Specifically, the present invention uses a cross encoder (CE) to concatenate the user question and the short document d^{d in the recalled document set D r}_using a separator as input. Since there is semantic information between the concatenated sentences, a deep neural network can be used to model and obtain a deep interactive representation r:

r＝CE(q+d^d)r＝CE(q+d^d )

然后在深层交互表示的基础上应用多层感知器(MLP)来预测相关性得分：Then a multi-layer perceptron (MLP) is applied on top of the deep interaction representation to predict the relevance score:

s(q,d^d)＝MLP(r)。s(q, d^d )=MLP(r).

由于交叉编码器可以考虑到问题和文档的深层交互，其相关性计算比双编码器更加的准确，因此，本发明利用交叉编码器对问题和召回得到的D_r中的短文档再次进行相关性打分并排序，选择与问题相关性最强的K_rr个短文档，组成文档集D_rr(K_rr通常小于K_r)。Since the cross encoder can take into account the deep interaction between questions and documents, its relevance calculation is more accurate than that of the dual encoder. Therefore, the present invention uses the cross encoder to score and sort the relevance of the short documents in_Dr obtained by question and recall again, and selects K_rr short documents with the strongest relevance to the question to form the document set_Dr (K_rr is usually smaller than K_r ).

五、大语言模型选择。5. Large language model selection.

将所述文档集D_rr中的短文档d^d还原成长文档，并使用大语言模型对还原后的长文档进行选择过滤，以筛选出与所述用户问题相关的文档段落d^s，组成选择的文档集D_s。The short document d^d in the document set D_rr is restored into a long document, and the restored long document is selected and filtered using a large language model to screen out document paragraphs d^s related to the user question to form a selected document set D_s .

优选地，在筛选出与所述用户问题相关的文档段落d^s后，利用交叉编码器对所述用户问题与所述文档段落d^s再次进行相关性打分并排序，选择与所述用户问题相关性最强的K_s个文档段落d^s，组成所述文档集D_s。Preferably, after screening out the document paragraphs d^s related to the user question, a cross encoder is used to score and sort the correlation between the user question and the document paragraphs d^s again, and K_s document paragraphs d^s with the strongest correlation with the user question are selected to form the document set D_s .

具体地，为了获取与所述用户问题最相关且简洁完整的文本段落，本发明将切分后的短文档还原成原始的长文档，并使用大语言模型对还原后的长文档进行选择过滤，筛选出与问题最相关的段落，这样既能得到与所述用户问题相关的完整语义段落，避免了因长文档切分导致的信息丢失，又能过滤掉无关的句子，可以使大语言模型接收更多的相关文档。具体公式如下：Specifically, in order to obtain the most relevant and concise text paragraphs to the user's question, the present invention restores the segmented short documents to the original long documents, and uses the large language model to select and filter the restored long documents to filter out the paragraphs most relevant to the question. This can not only obtain the complete semantic paragraphs related to the user's question, avoiding the information loss caused by the segmentation of the long document, but also filter out irrelevant sentences, so that the large language model can receive more relevant documents. The specific formula is as follows:

d^s＝LLM(prompt_s(doc2raw(d^d)))d^s =LLM(prompt_s (doc2raw(d^d )))

其中，d^s是大语言模型从长文档中选择的与用户问题相关的文档段落。doc2raw(d^d)是将短文档d^d还原成长文档，即如果文档d^d是长文档切分得到的短文档，则判断长文档的长度是否超过大语言模型的输入的最大阈值，如果没有，则把该短文档替换成原始的长文档，否则仍使用短文档，最终得到原始未切分且不重复的长文档集。prompt_s是构建提示模板引导大语言模型完成对长文档集的相似段落查找，即如果还原长文档中存在与用户问题相关的文档段落，则选择这个相关的文档段落作为输出，并过滤其它无关的句子，若文档中不存在相关文档段落，则不进行选择。Among them,^ds is the document paragraph related to the user question selected by the large language model from the long document. doc2raw(^dd ) is to restore the short document^dd to a long document, that is, if the document^dd is a short document obtained by segmenting a long document, then determine whether the length of the long document exceeds the maximum threshold of the input of the large language model. If not, replace the short document with the original long document, otherwise, still use the short document, and finally obtain the original unsegmented and non-repeated long document set._prompts is to construct a prompt template to guide the large language model to complete the search for similar paragraphs in the long document set. That is, if there is a document paragraph related to the user question in the restored long document, then select this related document paragraph as the output, and filter out other irrelevant sentences. If there is no related document paragraph in the document, then do not select it.

随着输入语境的变长，大语言模型性能也会逐步下降，因此，本发明使用大语言模型对重排序的D_rr中的短文档还原成的长文档进行相关文档段落选择，取出长文档中与用户问题相关的文档段落作为新的文档段落d^s，去除文档中与问题无关的噪声信息。As the input context becomes longer, the performance of the large language model will gradually decline. Therefore, the present invention uses the large language model to select relevant document paragraphs from the long document restored from the short documents in the reordered D_rr , extracts the document paragraphs related to the user's question in the long document as the new document paragraph d^s , and removes noise information in the document that is irrelevant to the question.

另外，当与用户问题的相关信息出现在大语言模型输入上下文的开头或结尾时，大语言模型的性能通常是最高的。当大语言模型必须在长上下文的中间访问相关信息时，性能会明显下降。因此，本发明再次使用交叉编码器对问题和大语言模型选择得到的所有文档段落d^s进行相关性打分并排序，得到最相关的K_s个文档段落组成的集合D_s(K_s通常小于等于K_rr)。In addition, when the relevant information about the user's question appears at the beginning or end of the large language model input context, the performance of the large language model is usually the highest. When the large language model must access relevant information in the middle of a long context, the performance will be significantly reduced. Therefore, the present invention again uses a cross encoder to score and sort the relevance of the question and all document paragraphs^ds selected by the large language model to obtain a set_Ds consisting of the most relevant_Ks document paragraphs (_{Ks is} usually less than or equal to_Krr ).

六、大语言模型生成。6. Large language model generation.

构建提示模板，并将所述用户问题与所述选择的文档集D_s一起输入大语言模型，让大语言模型根据所述选择的文档集D_s来进行回答，以生成答案。A prompt template is constructed, and the user question and the selected document set D_s are input into a large language model, so that the large language model can answer according to the selected document set D_s to generate an answer.

具体地，将大语言模型选择排序后得到的文档集D_s，和所述用户问题一起交给大语言模型进行生成，让大语言模型根据提供的文档集D_s来进行回答，缓解大语言模型在中医领域的幻觉问题。Specifically, the document set D_s obtained after the large language model selects and sorts is given to the large language model together with the user question for generation, and the large language model is allowed to answer according to the provided document set D_s , thereby alleviating the hallucination problem of the large language model in the field of traditional Chinese medicine.

优选地，通过少样本的标注示例来引导大语言模型对答案添加引用标记，并在末尾列出参考文档来源，提供可解释性。具体回答的公式如下：Preferably, a small number of labeled examples are used to guide the large language model to add reference marks to the answer, and the reference document source is listed at the end to provide explainability. The specific answer formula is as follows:

r＝LLM(prompt_r(q,D_s))r＝LLM(prompt_r (q,_Ds ))

其中，prompt_r是构建规则引导大语言模型结合文档集D_s进行回答，且如果回答问题的过程中引用了相关文档，则在引用的位置给出相应标记。Among them, prompt_r is a rule-building method to guide the large language model to answer questions in combination with the document set D_s , and if relevant documents are cited in the process of answering questions, a corresponding mark is given at the cited position.

此外，本发明还提供一种基于长文档检索增强生成的中医问答设备，其包括：一个或多个处理器；存储器，用于存储一个或多个程序；当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现如上述基于长文档检索增强生成的中医问答方法。In addition, the present invention also provides a Chinese medicine question and answer device based on long document retrieval enhanced generation, which includes: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the Chinese medicine question and answer method based on long document retrieval enhanced generation as mentioned above.

而且，本发明还提供一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如上述基于长文档检索增强生成的中医问答方法中的步骤。Moreover, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the above-mentioned Chinese medicine question-answering method based on long document retrieval and enhanced generation.

最后，发明人进行了实验。在进行实验时，在包含中医内科学、中医基础理论、中医诊断学、方剂学、中药学、四大经典的中医知识的问答评测数据集上，使用大语言模型ChatGLM2-6B对问题进行生成答案，并使用专家对答案进行人工评测，其答案的综合评分为34％。将ChatGLM2-6B结合本发明的方法后对问题进行生成答案，并使用专家对答案进行人工评测，其生成的答案的综合评分为69％，答案的评分得到了显著提高。也就是说，实验表明，本发明在中医问答领域极大地提升了大语言模型的生成能力。Finally, the inventor conducted an experiment. When conducting the experiment, on a question-and-answer evaluation data set containing Chinese medicine internal medicine, basic theory of Chinese medicine, Chinese medicine diagnosis, prescription, Chinese medicine, and four classic Chinese medicine knowledge, the large language model ChatGLM2-6B was used to generate answers to questions, and experts were used to manually evaluate the answers, and the comprehensive score of the answers was 34%. ChatGLM2-6B was combined with the method of the present invention to generate answers to questions, and experts were used to manually evaluate the answers. The comprehensive score of the answers generated was 69%, and the score of the answers was significantly improved. In other words, the experiment shows that the present invention greatly improves the generation ability of the large language model in the field of Chinese medicine question and answer.

最后应当说明的是，以上实施例仅用以说明本发明的技术方案，而非对本发明保护范围的限制。本领域的技术人员，依据本发明的思想，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的实质和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, rather than to limit the protection scope of the present invention. Those skilled in the art can modify or replace the technical solution of the present invention according to the idea of the present invention without departing from the essence and scope of the technical solution of the present invention.

Claims

Translated fromChinese

1.一种基于长文档检索增强生成的中医问答方法，其特征在于，包括以下步骤：1. A Chinese medicine question-answering method based on long document retrieval and enhanced generation, characterized in that it comprises the following steps:

1)、问题扩展：构建问题扩展提示模板，基于所述问题扩展提示模板使用大语言模型对用户问题进行扩展，以生成扩展问题；1) Question extension: construct a question extension prompt template, and use a large language model to expand the user question based on the question extension prompt template to generate an extended question;

6)、大语言模型生成：构建答案生成提示模板，并将所述用户问题与所述选择的文档集D_s一起输入大语言模型中，基于所述答案生成提示模板让大语言模型根据所述选择的文档集D_s来进行回答，以生成答案；6) Large language model generation: construct an answer generation prompt template, and input the user question and the selected document set_Ds into the large language model together, and based on the answer generation prompt template, let the large language model answer according to the selected document set_Ds to generate an answer;

所述步骤2)具体包括：The step 2) specifically includes:

2.2)、把中医知识库的每个知识点小节视为一个文档d，然后根据文档d的长度进行判断和切分，如果文档d的长度超过所述文档的最大长度阈值L_max则将所述文档d视为长文档d^l，并对长文档d^l进行切分，以形成多个短文档d^d，如果文档d的长度不超过所述文档的最大长度阈值L_max则直接将文档d视为短文档d^d，不用进行切分；2.2) Treat each knowledge point section of the TCM knowledge base as a document d, and then judge and segment the document d according to its length. If the length of the document d exceeds the maximum length threshold L_max of the document, the document d is regarded as a long document d^l , and the long document d^l is segmented to form multiple short documents d^d . If the length of the document d does not exceed the maximum length threshold L_max of the document, the document d is directly regarded as a short document d^d without segmentation.

2.3)、对于长文档d^l，按照句子和长度切分，当句子累计长度达到L_max后则切分成一个短文档d^d，并设置滑动窗口w，使相邻的两个短文档d^d之间设置w个句子重叠，以保持语义的连贯性，如果长文档d^l切分后的最后一个短文档d^d的长度小于所述文档切分后的最小长度阈值L_min，则不再将其划分为一个新的短文档d^d，而是直接拼接到上一个短文档d^d的末尾；2.3) For the long document d^l , it is segmented according to sentences and lengths. When the cumulative length of sentences reaches L_max , it is segmented into a short document d^d , and a sliding window w is set so that w sentences are overlapped between two adjacent short documents d^d to maintain semantic coherence. If the length of the last short document d^d after segmentation of the long document d^l is less than the minimum length threshold L_min after the document segmentation, it is no longer divided into a new short document d^d , but directly spliced to the end of the previous short document d^d ;

2.根据权利要求1所述的基于长文档检索增强生成的中医问答方法，其特征在于，在所述步骤2.3)中，保留切分后的每个所述短文档d^d与所述长文档d^l的映射关系，以便后续还原成长文档d^l。2. The TCM question-answering method based on long document retrieval and enhanced generation according to claim 1 is characterized in that, in the step 2.3), the mapping relationship between each of the short documents d^d and the long document d^l after segmentation is retained so as to subsequently restore the long document d^l .

3.根据权利要求2所述的基于长文档检索增强生成的中医问答方法，其特征在于，所述步骤3)中，得到所述多个文档向量后，将所述多个文档向量存入向量数据库FAISS中并构建索引。3. According to the Chinese medicine question-answering method based on long document retrieval and enhanced generation according to claim 2, it is characterized in that in the step 3), after obtaining the multiple document vectors, the multiple document vectors are stored in the vector database FAISS and an index is constructed.

4.根据权利要求3所述的基于长文档检索增强生成的中医问答方法，其特征在于，所述步骤3)中，在进行召回时，从所述向量数据库FAISS中进行召回。4. According to the Chinese medicine question-answering method based on long document retrieval and enhanced generation as described in claim 3, it is characterized in that in the step 3), when recalling, recall is performed from the vector database FAISS.

5.根据权利要求4所述的基于长文档检索增强生成的中医问答方法，其特征在于，所述步骤4)中的使用交叉编码器对所述用户问题与所述召回文档集D_r中的短文档d^d进行相关性打分具体包括：5. According to the TCM question-answering method based on long document retrieval and enhanced generation according to claim 4, it is characterized in that the step 4) of using a cross encoder to score the relevance between the user question and the short document d^d in the recalled document set D_r specifically comprises:

6.根据权利要求5所述的基于长文档检索增强生成的中医问答方法，其特征在于，所述步骤5)中，在筛选出与所述用户问题相关的文档段落d^s后，利用交叉编码器对所述用户问题与每个所述文档段落d^s再次进行相关性打分并排序，选择与所述用户问题相关性最强的K_s个文档段落d^s，组成选择的文档集D_s。6. The TCM question-answering method based on long document retrieval and enhanced generation according to claim 5 is characterized in that, in the step 5), after screening out the document paragraphs d^s related to the user question, a cross encoder is used to score and sort the relevance between the user question and each of the document paragraphs d^s , and K_s document paragraphs d^s with the strongest relevance to the user question are selected to form the selected document set D_s .

7.根据权利要求6所述的基于长文档检索增强生成的中医问答方法，其特征在于，所述步骤6)中在让大语言模型根据所述选择的文档集D_s来进行回答，以生成答案时，通过标注示例来引导所述大语言模型对答案添加引用标记，并在末尾列出参考文档来源。7. The TCM question-answering method based on long document retrieval and enhanced generation according to claim 6 is characterized in that, in step 6), when the large language model is allowed to answer based on the selected document set_Ds to generate an answer, the large language model is guided to add reference marks to the answer by annotating examples, and the reference document sources are listed at the end.

8.一种基于长文档检索增强生成的中医问答设备，其特征在于，包括：8. A Chinese medicine question-answering device based on long document retrieval and enhanced generation, characterized by comprising:

一个或多个处理器；one or more processors;

当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现如权利要求1-7任一项所述的基于长文档检索增强生成的中医问答方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the Chinese medicine question answering method based on long document retrieval and enhanced generation as described in any one of claims 1 to 7.

9.一种计算机可读存储介质，其上存储有计算机程序，其特征在于，该程序被处理器执行时实现如权利要求1-7中任一项所述的基于长文档检索增强生成的中医问答方法中的步骤。9. A computer-readable storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, the steps in the traditional Chinese medicine question-answering method based on long document retrieval and enhanced generation as described in any one of claims 1 to 7 are implemented.