技术领域Technical Field
本发明涉及人工智能技术领域,尤其涉及智能回复方法、装置、电子设备及存储介质。The present invention relates to the field of artificial intelligence technology, and in particular to an intelligent reply method, device, electronic device and storage medium.
背景技术Background Art
随着人工智能生成内容(Artificial Intelligence Generated Content,AIGC)技术的不断发展进步,越来越多的企业使用AIGC技术搭建智能客服平台,以便于企业可以通过智能客服平台及时回复用户的各种提问。当用户向智能客服平台输入提问内容时,智能客服平台通常会基于提问从知识文档库中检索出与提问内容相关的知识文档标题,以及知识文档标题所对应的知识文档,再确定知识文档中每个分句内容的分句内容向量与提问内容向量之间的相似度,最后基于最大相似度对应的目标分句内容确定提问内容的回复内容。With the continuous development and progress of artificial intelligence generated content (AIGC) technology, more and more companies are using AIGC technology to build intelligent customer service platforms so that companies can respond to various questions from users in a timely manner through the intelligent customer service platform. When a user enters a question into the intelligent customer service platform, the intelligent customer service platform usually retrieves the title of the knowledge document related to the question content and the knowledge document corresponding to the knowledge document title from the knowledge document library based on the question, and then determines the similarity between the sentence content vector of each sentence content in the knowledge document and the question content vector, and finally determines the reply content of the question content based on the target sentence content corresponding to the maximum similarity.
然而,由于仅将每篇知识文档的知识文档标题作为检索项,检索与提问内容相关的知识文档的难度较大,并且向量相似度由于对算力和资源要求高而使得计算向量相似度的速度很慢,从而导致基于知识文档回复提问内容的准确率和效率均不高。However, since only the knowledge document title of each knowledge document is used as a search item, it is difficult to retrieve knowledge documents related to the question content, and the speed of calculating vector similarity is very slow due to the high requirements on computing power and resources, resulting in low accuracy and efficiency in replying to questions based on knowledge documents.
发明内容Summary of the invention
本发明旨在至少解决相关技术中存在的技术问题之一。为此,本发明提出一种智能回复方法,不仅提高了相关分块内容准确检索的检索几率,而且也提高了基于知识文档回复提问的准确率和效率,同时也大幅提升了用户满意度和体验感,从而真正意义上实现了个性化智能问答。The present invention aims to solve at least one of the technical problems existing in the related art. To this end, the present invention proposes an intelligent reply method, which not only improves the retrieval probability of accurate retrieval of relevant block content, but also improves the accuracy and efficiency of replying questions based on knowledge documents, and also greatly improves user satisfaction and experience, thereby truly realizing personalized intelligent question and answer.
本发明还提出一种智能回复装置。The present invention also provides an intelligent reply device.
本发明还提出一种电子设备。The invention also provides an electronic device.
本发明还提出一种非暂态计算机可读存储介质。The present invention also provides a non-transitory computer-readable storage medium.
本发明还提出一种计算机程序产品。The present invention also provides a computer program product.
根据本发明第一方面实施例的智能回复方法,包括:The intelligent reply method according to the first embodiment of the present invention includes:
基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定所述目标业务关键词对应的目标知识文档筛选范围;Recalling knowledge documents based on target business keywords in the target question text, and determining a target knowledge document screening range corresponding to the target business keywords based on the recall result;
针对所述目标知识文档筛选范围内各待筛选的目标知识文档,从所述目标知识文档含有的所有分块内容中检索与所述目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与所述目标知识文档的目标树结构层级目录和所述目标提问文本匹配的第二分块内容;For each target knowledge document to be screened within the target knowledge document screening range, retrieve first block content whose matching degree with the target question text meets a first preset matching degree threshold from all block contents contained in the target knowledge document, and retrieve second block content that matches the target tree structure hierarchical directory of the target knowledge document and the target question text;
基于各所述第一分块内容和各所述第二分块内容,确定所述目标提问文本的目标回复内容。Based on the contents of each of the first sub-blocks and the contents of each of the second sub-blocks, target reply contents of the target question text are determined.
根据本发明实施例的智能回复方法,AIGC客服平台对于基于目标提问文本中目标业务关键词所召回的各待筛选的目标知识文档,通过从每篇目标知识文档所含的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配匹配的第二分块内容,并结合检索到的第一分块内容和第二分块内容,确定目标提问文本的目标回复内容。这样,通过从用户提问中的关键词多路召回知识文档的方式丰富和扩大了后续知识文档知识检索的搜索范围,确保最终输出的目标回复内容更贴合用户实际诉求;此外,通过将召回的每篇目标知识文档中所有分块内容分别与提问这一分句和提问中业务关键词这一分词之间的匹配程度,确定与用户真实诉求最匹配的最终回复内容,不仅提高了相关分块内容准确检索的检索几率,而且也提高了基于知识文档回复提问的准确率和效率,同时也大幅提升了用户满意度和体验感,从而真正意义上实现了个性化智能问答。According to the intelligent reply method of the embodiment of the present invention, the AIGC customer service platform determines the target reply content of the target question text by retrieving the first block content whose matching degree with the target question text meets the first preset matching degree threshold from all the block contents contained in each target knowledge document, and retrieving the second block content that matches the target tree structure hierarchy directory and the target question text of the target knowledge document, and combining the retrieved first block content and second block content. In this way, the search scope of subsequent knowledge document knowledge retrieval is enriched and expanded by the method of multi-channel recall of knowledge documents from keywords in user questions, ensuring that the target reply content outputted in the end is more in line with the actual demands of users; in addition, by respectively matching all the block contents in each recalled target knowledge document with the sentence of the question and the word of the business keyword in the question, the final reply content that best matches the real demands of users is determined, which not only improves the retrieval probability of accurate retrieval of relevant block contents, but also improves the accuracy and efficiency of replying questions based on knowledge documents, and also greatly improves user satisfaction and experience, thereby truly realizing personalized intelligent question and answer.
根据本发明的一个实施例,所述基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定所述目标业务关键词对应的目标知识文档筛选范围,包括:According to an embodiment of the present invention, the step of recalling knowledge documents based on target business keywords in the target question text and determining a target knowledge document screening range corresponding to the target business keywords based on the recall result includes:
基于预先存储的业务关键词与文档id之间映射关系,对所述目标业务关键词进行知识文档召回,并基于召回的知识文档列表确定所述目标知识文档筛选范围。Based on the pre-stored mapping relationship between the business keyword and the document ID, the knowledge document is recalled for the target business keyword, and the target knowledge document screening range is determined based on the recalled knowledge document list.
根据本发明的一个实施例,所述基于召回的知识文档列表确定所述目标知识文档筛选范围,包括:According to an embodiment of the present invention, determining the target knowledge document screening range based on the recalled knowledge document list includes:
判断各所述目标业务关键词各自召回的所述文档id列表之间是否存在相同文档id:Determine whether the document IDs recalled by the target business keywords are identical:
确定各所述文档id列表之间存在所述相同文档id,则基于存在的所述相同文档id各自对应的知识文档确定所述目标知识文档筛选范围;Determining that the same document ID exists between the document ID lists, determining the target knowledge document screening range based on the knowledge documents corresponding to the existing same document IDs;
确定各所述文档id列表之间不存在所述相同文档id,则从各所述文档id列表中确定含有文档id数量最少的目标文档id列表,并基于所述目标文档id列表中各目标文档id各自对应的知识文档确定所述目标知识文档筛选范围。If it is determined that the same document ID does not exist between the document ID lists, a target document ID list containing the least number of document IDs is determined from the document ID lists, and the target knowledge document screening range is determined based on the knowledge documents corresponding to each target document ID in the target document ID list.
根据本发明的一个实施例,所述业务关键词与文档id之间映射关系的构建过程包括:According to an embodiment of the present invention, the process of constructing the mapping relationship between the business keyword and the document ID includes:
针对各文档id,基于预先构建的实体词典,从所述文档id对应的知识文档含有的分句内容中识别不同类型的目标实体;基于预先设置的不同类型编码标识和不同类型编码规则,从所述知识文档中识别不同类型的目标编码;以及从所述知识文档中识别出现频率最高的目标高频分词;For each document ID, based on a pre-built entity dictionary, different types of target entities are identified from the sentence contents of the knowledge document corresponding to the document ID; based on pre-set different types of encoding identifiers and different types of encoding rules, different types of target encodings are identified from the knowledge document; and the target high-frequency segmented words with the highest frequency of occurrence are identified from the knowledge document;
将各所述文档id各自对应的所述目标实体、所述目标编码和所述目标高频分词中的至少一项均确定为所述业务关键词,并基于各所述业务关键词和各所述文档id构建所述业务关键词与文档id之间映射关系。At least one of the target entity, the target code and the target high-frequency word segmentation corresponding to each of the document IDs is determined as the business keyword, and a mapping relationship between the business keyword and the document ID is constructed based on each of the business keywords and the document ID.
根据本发明的一个实施例,从所述目标知识文档含有的所有分块内容中检索与所述目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与所述目标知识文档的目标树结构层级目录和所述目标提问文本匹配的第二分块内容,包括:According to an embodiment of the present invention, retrieving a first block content whose matching degree with the target question text satisfies a first preset matching degree threshold from all block contents contained in the target knowledge document, and retrieving a second block content matching the target tree structure hierarchical directory of the target knowledge document and the target question text, comprises:
基于预先针对各种不同类型知识文档分别构建的分句内容-文档id-分块内容之间映射关系,从所述目标知识文档中检索与所述目标提问文本匹配度满足所述第一预设匹配度阈值的目标分句内容所在的所述第一分块内容;Based on the mapping relationships between sentence content, document ID and block content constructed in advance for various types of knowledge documents, the first block content of the target sentence content whose matching degree with the target question text meets the first preset matching degree threshold is retrieved from the target knowledge document;
基于文档id-知识文档的树结构层级目录之间映射关系,确定所述目标知识文档的目标树结构层级目录,以及确定所述目标树结构层级目录中每个层级标题分别与所述目标提问文本之间的匹配度;Based on the mapping relationship between the document ID and the tree structure hierarchy directory of the knowledge document, determine the target tree structure hierarchy directory of the target knowledge document, and determine the matching degree between each level title in the target tree structure hierarchy directory and the target question text;
从所述目标树结构层级目录中各层级标题下各自的分块内容中,检索满足第二预设匹配度阈值的目标匹配度对应的目标层级标题下的所述第二分块内容。From the respective block contents under the respective level titles in the target tree structure level directory, the second block contents under the target level title corresponding to the target matching degree that meets the second preset matching degree threshold are retrieved.
根据本发明的一个实施例,所述分句内容-文档id-分块内容之间映射关系的构建过程包括:According to an embodiment of the present invention, the process of constructing the mapping relationship between the sentence content-document id-block content includes:
针对所述各种不同类型知识文档各自的文档id,基于预先设置的分句分隔符,对所述文档id对应的知识文档进行分句,以及基于文本分割模型对所述知识文档进行分块;For the document IDs of the various types of knowledge documents, based on the preset sentence separators, the knowledge documents corresponding to the document IDs are divided into sentences, and the knowledge documents are divided into blocks based on the text segmentation model;
针对分句所得的所有分句内容,从所述分句内容在分块所得的对应初始分块内容中确定预设分割窗口内的分块内容;For all sentence contents obtained by sentence segmentation, determining the block contents within a preset segmentation window from the corresponding initial block contents obtained by segmenting the sentence contents;
基于所述分句所得的所有分句内容、各所述预设分割窗口内的分块内容和各所述文档id,构建所述分句内容-文档id-分块内容之间映射关系。Based on all sentence contents obtained from the sentence segmentation, the block contents in each of the preset segmentation windows and each of the document IDs, a mapping relationship between the sentence content-document ID-block contents is constructed.
根据本发明的一个实施例,所述文档id-知识文档的树结构层级目录之间映射关系的构建过程包括:According to an embodiment of the present invention, the process of constructing the mapping relationship between the document ID and the tree structure level directory of the knowledge document includes:
基于所述各种不同类型知识文档各自预先设置的内容格式、文档标题和字体大小,识别每个知识文档中不同层级标题和每个层级标题下的分块内容,并基于识别结果和所述各种不同类型知识文档各自的文档id,构建所述文档id-知识文档的树结构层级目录之间映射关系;各所述文档id-知识文档的树结构层级目录之间映射关系用于确定所述目标知识文档的所述目标树结构层级目录。Based on the pre-set content formats, document titles and font sizes of the various types of knowledge documents, the different levels of titles in each knowledge document and the block contents under each level of title are identified, and based on the identification results and the document IDs of the various types of knowledge documents, a mapping relationship between the document ID and the tree structure hierarchical directory of the knowledge document is constructed; the mapping relationship between each document ID and the tree structure hierarchical directory of the knowledge document is used to determine the target tree structure hierarchical directory of the target knowledge document.
根据本发明的一个实施例,所述基于各所述第一分块内容和各所述第二分块内容,确定所述目标提问文本的目标回复内容,包括:According to an embodiment of the present invention, determining the target reply content of the target question text based on the content of each of the first sub-blocks and the content of each of the second sub-blocks includes:
对各所述第一分块内容和各所述第二分块内容进行排序,并基于排序结果确定目标分块内容;Sorting the first sub-block contents and the second sub-block contents, and determining target sub-block contents based on the sorting results;
基于预先为不同业务场景对应设置的个性化回复提示词和大语言模型,确定所述目标分块内容对应的所述目标回复内容。Based on the personalized reply prompt words and the large language model pre-set for different business scenarios, the target reply content corresponding to the target block content is determined.
根据本发明第二方面实施例的智能回复装置,包括:According to the second aspect of the present invention, the intelligent reply device includes:
知识文档召回单元,用于基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定所述目标业务关键词对应的目标知识文档筛选范围;A knowledge document recall unit, configured to recall knowledge documents based on target business keywords in a target question text, and determine a target knowledge document screening range corresponding to the target business keywords based on the recall result;
分块内容确定单元,用于针对所述目标知识文档筛选范围内各待筛选的目标知识文档,从所述目标知识文档含有的所有分块内容中检索与所述目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与所述目标知识文档的目标树结构层级目录和所述目标提问文本匹配的第二分块内容;A block content determination unit is used to retrieve, for each target knowledge document to be screened within the target knowledge document screening range, a first block content whose matching degree with the target question text satisfies a first preset matching degree threshold from all block contents contained in the target knowledge document, and to retrieve a second block content that matches the target tree structure hierarchical directory of the target knowledge document and the target question text;
回复内容确定单元,用于基于各所述第一分块内容和各所述第二分块内容,确定所述目标提问文本的目标回复内容。The reply content determination unit is used to determine the target reply content of the target question text based on the first sub-block contents and the second sub-block contents.
根据本发明实施例的智能回复装置,AIGC客服平台对于基于目标提问文本中目标业务关键词所召回的各待筛选的目标知识文档,通过从每篇目标知识文档所含的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配匹配的第二分块内容,并结合检索到的第一分块内容和第二分块内容,确定目标提问文本的目标回复内容。这样,通过从用户提问中的关键词多路召回知识文档的方式丰富和扩大了后续知识文档知识检索的搜索范围,确保最终输出的目标回复内容更贴合用户实际诉求;此外,通过将召回的每篇目标知识文档中所有分块内容分别与提问这一分句和提问中业务关键词这一分词之间的匹配程度,确定与用户真实诉求最匹配的最终回复内容,不仅提高了相关分块内容准确检索的检索几率,而且也提高了基于知识文档回复提问的准确率和效率,同时也大幅提升了用户满意度和体验感,从而真正意义上实现了个性化智能问答。According to the intelligent reply device of the embodiment of the present invention, the AIGC customer service platform determines the target reply content of the target question text by retrieving the first block content whose matching degree with the target question text meets the first preset matching degree threshold from all the block contents contained in each target knowledge document, and retrieving the second block content that matches the target tree structure hierarchy directory and the target question text of the target knowledge document, and combining the retrieved first block content and second block content. In this way, the search scope of subsequent knowledge document knowledge retrieval is enriched and expanded by the method of multi-channel recall of knowledge documents from the keywords in the user's question, ensuring that the target reply content outputted in the end is more in line with the actual demands of the user; in addition, by respectively matching all the block contents in each recalled target knowledge document with the sentence of the question and the word of the business keyword in the question, the final reply content that best matches the real demands of the user is determined, which not only improves the retrieval probability of accurate retrieval of relevant block contents, but also improves the accuracy and efficiency of replying questions based on knowledge documents, and also greatly improves user satisfaction and experience, thereby truly realizing personalized intelligent question and answer.
本发明实施例中的上述一个或多个技术方案,至少具有如下技术效果之一:AIGC客服平台对于基于目标提问文本中目标业务关键词所召回的各待筛选的目标知识文档,通过从每篇目标知识文档所含的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配的第二分块内容,并结合检索到的第一分块内容和第二分块内容,确定目标提问文本的目标回复内容。这样,通过从用户提问中的关键词多路召回知识文档的方式丰富和扩大了后续知识文档知识检索的搜索范围,确保最终输出的目标回复内容更贴合用户实际诉求;此外,通过将召回的每篇目标知识文档中所有分块内容分别与提问这一分句和提问中业务关键词这一分词之间的匹配程度,确定与用户真实诉求最匹配的最终回复内容,不仅提高了相关分块内容准确检索的检索几率,而且也提高了基于知识文档回复提问的准确率和效率,同时也大幅提升了用户满意度和体验感,从而真正意义上实现了个性化智能问答。The above one or more technical solutions in the embodiments of the present invention have at least one of the following technical effects: the AIGC customer service platform determines the target reply content of the target question text by retrieving the first block content whose matching degree with the target question text meets the first preset matching degree threshold from all the block contents contained in each target knowledge document for each target knowledge document to be screened based on the target business keywords in the target question text, and retrieving the second block content that matches the target tree structure hierarchical directory of the target knowledge document and the target question text, and combining the retrieved first block content and the second block content. In this way, by multi-channel recall of knowledge documents from keywords in user questions, the search scope of subsequent knowledge document knowledge retrieval is enriched and expanded, ensuring that the target response content finally output is more in line with the user's actual demands; in addition, by matching all the block contents in each recalled target knowledge document with the sentence of the question and the business keyword in the question, the final response content that best matches the user's real demands is determined, which not only improves the retrieval probability of accurate retrieval of relevant block contents, but also improves the accuracy and efficiency of answering questions based on knowledge documents, while also greatly improving user satisfaction and experience, thereby truly realizing personalized intelligent question and answer.
进一步的,AIGC客服平台通过预先构建并存储的业务关键词-文档id之间映射关系对目标业务关键词进行知识问答召回的方式,优先考虑关键信息,不仅降低了非关键信息的干扰,提高了检索性能,而且提高了知识文档召回的召回效率,同时也能避免相关知识文档漏命中的几率,从而也能提高后续基于知识文档回复提问的准确性和丰富性。Furthermore, the AIGC customer service platform recalls knowledge questions and answers for target business keywords through pre-built and stored mapping relationships between business keywords and document IDs, giving priority to key information. This not only reduces the interference of non-key information and improves retrieval performance, but also improves the recall efficiency of knowledge document recall. At the same time, it also avoids the chance of missing relevant knowledge documents, thereby improving the accuracy and richness of subsequent responses to questions based on knowledge documents.
更进一步的,AIGC客服平台对于多个目标业务关键词召回的所有文档id列表,通过所有文档id列表相交时基于交集确定目标知识文档筛选范围或者交集为空时基于所有文档id列表的最小文档id列表确定目标知识文档筛选范围,以此可以确保筛选的目标知识文档对于目标业务关键词的重要性,提高了目标知识文档筛选范围的准确性和可靠性。Furthermore, for all document ID lists recalled for multiple target business keywords, the AIGC customer service platform determines the target knowledge document screening range based on the intersection when all document ID lists intersect, or determines the target knowledge document screening range based on the minimum document ID list of all document ID lists when the intersection is empty. This ensures the importance of the screened target knowledge documents to the target business keywords and improves the accuracy and reliability of the target knowledge document screening range.
再进一步的,AIGC客服平台通过以文档分句内容为检索项、分句内容所在的文档分块内容作为返回项、和分句内容所在文档id作为筛选项,从目标知识文档中检索第一分块内容,以及以文档各级目录为检索项、文档各级目录所在文档id为筛选项和以各子目录下分块内容为返回项,从目标知识文档中检索第二分块内容,这样,可以实现基于文档分句和目录的两路召回,增加相关分块内容准确检索的几率,尽可能地避免相关知识文档漏命中。Furthermore, the AIGC customer service platform retrieves the first block content from the target knowledge document by using the document sentence content as the search item, the document block content where the sentence content is located as the return item, and the document ID where the sentence content is located as the filter item, and retrieves the second block content from the target knowledge document by using the document directories at all levels as the search item, the document IDs where the document directories at all levels are located as the filter item, and the block content under each sub-directory as the return item. In this way, two-way recall based on document sentences and directories can be achieved, increasing the probability of accurate retrieval of relevant block content and avoiding missing relevant knowledge documents as much as possible.
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or related technologies, the drawings required for use in the embodiments or related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本发明实施例提供的智能回复方法的流程示意图之一;FIG1 is a flow chart of a smart reply method according to an embodiment of the present invention;
图2是本发明实施例提供的智能回复方法的流程示意图之二;FIG2 is a second flow chart of the intelligent reply method provided by an embodiment of the present invention;
图3是本发明实施例提供的智能回复装置的结构示意图;FIG3 is a schematic diagram of the structure of an intelligent reply device provided by an embodiment of the present invention;
图4是本发明实施例提供的电子设备的结构示意图。FIG. 4 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
在本发明的实施例中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。在本发明的文字描述中,字符“/”一般表示前后关联对象是一种“或”的关系。此外,需要说明的是,本发明中为描述的对象所编序号本身,例如“第一”、“第二”等,仅用于区分所描述的对象,不具有任何顺序或技术含义。In the embodiments of the present invention, "at least one" refers to one or more, and "more than one" refers to two or more. "And/or" describes the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B may represent: A exists alone, A and B exist at the same time, and B exists alone, where A and B may be singular or plural. In the textual description of the present invention, the character "/" generally indicates that the previous and next associated objects are in an "or" relationship. In addition, it should be noted that the serial numbers themselves, such as "first", "second", etc., used to distinguish the described objects in the present invention are only used to distinguish the described objects and do not have any order or technical meaning.
随着AIGC技术的不断发展进步,无论是学界还是业界,都纷纷投入研究AIGC技术在各自领域或场景的应用,而智能客服成为其中最热门的领域之一;越来越多的企业使用AIGC技术搭建智能客服平台,例如美国人工智能研究公司(Open AI)的聊天生成预训练变压器(Chat Generative Pre-trained Transformer,ChatGpt)模型、的文言一心和的星火认知等大语言模型的出现为智能客服平台更加智能化带来了契机,可以支持智能客服平台的回复更加人性化,涉猎的知识范围更广泛;但同时要充分发挥各大语言模型的作用也面临巨大挑战,其中亟需解决的难题之一则为针对业务场景知识的引入,尤其是企业以往大量积累的知识文档知识。With the continuous development and progress of AIGC technology, both academia and industry have invested in the research of AIGC technology in their respective fields or scenarios, and intelligent customer service has become one of the hottest fields. More and more companies use AIGC technology to build intelligent customer service platforms, such as the Chat Generative Pre-trained Transformer (ChatGpt) model of the American artificial intelligence research company (Open AI), The literary words of one heart and The emergence of large language models such as Spark Cognition has brought opportunities for smart customer service platforms to become more intelligent, supporting more humanized responses and covering a wider range of knowledge. However, at the same time, giving full play to the role of major language models also faces huge challenges. One of the problems that needs to be solved urgently is the introduction of business scenario knowledge, especially the knowledge document knowledge that the company has accumulated in the past.
知识文档知识来源于word、excel、ppt和pdf等各类型格式的知识文档,格式不固定且非结构化,有些知识文档甚至没有分段,从没有分段的知识文档中有效提取知识点以及进行知识切片均很困难;解决此问题可以通过借助标准化的知识文档内容提取工具包,从知识文档中提取标题和正文文字数据,并通过基于符号进行分句内容分段的方式提取检索应的知识文档知识,提取内容单一且知识分块粗糙,一般也仅考虑知识分句内容进行检索,检索时容易出现定位偏离,存在匹配不到最相关知识文档知识的情况;而目前知识文档知识检索技术主要以关键词检索为主,部分基于向量表征进行匹配,其优缺点如表1所示。Knowledge in knowledge documents comes from knowledge documents in various formats such as word, excel, ppt and pdf. The formats are not fixed and unstructured. Some knowledge documents are not even segmented. It is difficult to effectively extract knowledge points and perform knowledge slicing from knowledge documents that are not segmented. This problem can be solved by using a standardized knowledge document content extraction toolkit to extract title and text data from knowledge documents, and extract the knowledge document knowledge to be retrieved by performing sentence content segmentation based on symbols. The extracted content is single and the knowledge segmentation is rough. Generally, only the knowledge sentence content is considered for retrieval. Positioning deviation is prone to occur during retrieval, and there is a situation where the most relevant knowledge document knowledge cannot be matched. At present, knowledge document knowledge retrieval technology mainly focuses on keyword retrieval, and some are matched based on vector representation. Its advantages and disadvantages are shown in Table 1.
表1Table 1
基于关键词的知识文档知识检索方式,通过计算提问与数据库中检索句匹配的分词数计算匹配得分,速度快,能快速收缩检索范围,但依赖于分词的效果,而且容易出现匹配得分最高的检索句存在很多的情况;而基于向量表征的知识文档知识检索方式,通过计算提问向量与数据库中检索句向量的相似度计算匹配得分,考虑了语义信息,而且表征方式更细致,更容易检索到最相关的知识文档知识,但向量相似度的计算对算力和资源有更高要求,速度往往较慢。另外,单纯考虑提问向量与数据库中检索句向量的相似度,容易受句中一些非关键信息的干扰,如何在优先考虑关键信息的前提下,再进行向量表征匹配,实现先收缩范围,再快速精细匹配,是本发明针对现有知识文档检索技术方案存在的不足之处所提出的创新性检索方案。The knowledge document knowledge retrieval method based on keywords calculates the matching score by calculating the number of word segments that match the question and the search sentence in the database. It is fast and can quickly narrow the search scope, but it depends on the effect of word segmentation, and it is easy to have many search sentences with the highest matching scores; while the knowledge document knowledge retrieval method based on vector representation calculates the matching score by calculating the similarity between the question vector and the search sentence vector in the database. It takes semantic information into consideration, and the representation method is more detailed, making it easier to retrieve the most relevant knowledge document knowledge, but the calculation of vector similarity has higher requirements on computing power and resources, and the speed is often slow. In addition, simply considering the similarity between the question vector and the search sentence vector in the database is easily disturbed by some non-critical information in the sentence. How to perform vector representation matching on the premise of giving priority to critical information, and then achieve first narrowing the scope and then fast and precise matching, is an innovative retrieval solution proposed by the present invention to address the shortcomings of existing knowledge document retrieval technology solutions.
此外,检索出最相关的知识文档知识后,需要推出答案。当前最简单的处理方式是直接推出整篇知识文档或知识文档片段,并由用户人为从推出的整篇知识文档或知识文档片段中寻找或提炼答案,而且仅返回相关知识文档难以形成推理能力。因此如何利用AIGC。ChatGPT等大语言生成模型提供的提炼答案能力,以及人性化回复的能力,通过输入检索到的最相关知识文档知识,配合包含一些业务场景特殊设定的提示词,生成语气自然且内容凝练的回复内容,这也是本发明针对现有知识文档检索技术方案存在的另一不足之处所提出的创新性检索方案最终形成一整套基于AIGC的知识文档知识召回方法。In addition, after retrieving the most relevant knowledge document knowledge, it is necessary to deduce the answer. The simplest processing method at present is to directly deduce the entire knowledge document or knowledge document fragment, and the user manually finds or refines the answer from the entire knowledge document or knowledge document fragment, and it is difficult to form reasoning ability by only returning relevant knowledge documents. Therefore, how to use AIGC. Large language generation models such as ChatGPT provide the ability to refine answers and the ability to provide humanized replies. By inputting the most relevant knowledge document knowledge retrieved, combined with prompt words containing special settings for some business scenarios, a reply content with a natural tone and concise content is generated. This is also the innovative retrieval solution proposed by the present invention for another shortcoming of the existing knowledge document retrieval technology solution, which ultimately forms a complete set of knowledge document knowledge recall methods based on AIGC.
综上,本发明所需解决的技术问题包括如下三方面:In summary, the technical problems to be solved by the present invention include the following three aspects:
1)知识文档数据非结构化,如何提取其中的有效信息面临挑战。知识文档数据通常包含大量的自然语言文本、图表、图片等多种形式的数据,即使仅考虑从中提取的文本,文本的巨量性和非结构化使得难以快速准确地定位到提问相关的文本知识,需要有效提取知识文档检索项和相应的返回项。1) Knowledge document data is unstructured, and how to extract effective information from it is challenging. Knowledge document data usually contains a large amount of natural language text, charts, pictures and other forms of data. Even if we only consider the text extracted from it, the huge amount and unstructured nature of the text make it difficult to quickly and accurately locate the text knowledge related to the question. It is necessary to effectively extract knowledge document retrieval items and corresponding return items.
2)准确检索提问相关的知识文档知识难度大。为了能准确定位到相关的知识文档知识分块,不可避免地需要从每篇知识文档中提取粒度更细的检索项,知识文档检索项的激增一方面使检索性能面临巨大挑战,另一方面也大幅增加准确检索的难度。如何利用知识文档中的关键信息收缩检索范围,并从中快速检索到相关知识文档分块是知识文档知识召回迫切需要解决的难题。2) It is difficult to accurately retrieve knowledge documents related to the question. In order to accurately locate the relevant knowledge document knowledge blocks, it is inevitable to extract more fine-grained search items from each knowledge document. The surge in knowledge document search items has not only posed a huge challenge to the search performance, but also greatly increased the difficulty of accurate search. How to use the key information in the knowledge document to narrow the search scope and quickly retrieve the relevant knowledge document blocks is a difficult problem that needs to be solved in knowledge document knowledge recall.
3)检索到的知识文档知识分块难以直接回应用户的提问。知识文档知识分块一般只是从知识文档内容中截取的部分,无法直接作为回应提问的回答,一方面需要再从中提取直接应对提问的知识,另一方面需要转化为适应问答的语言风格,如何充分利用大语言模型进行合理答案生成亟待解决。3) The knowledge blocks of the retrieved knowledge documents are difficult to directly respond to the user's questions. The knowledge blocks of knowledge documents are generally only parts of the knowledge document content, and cannot be directly used as answers to questions. On the one hand, it is necessary to extract knowledge that directly responds to questions from them, and on the other hand, it needs to be converted into a language style that adapts to questions and answers. How to make full use of the large language model to generate reasonable answers needs to be solved urgently.
下面结合图1~图4描述本发明提供的智能回复方法、装置、电子设备及存储介质,其中智能回复方法的执行主体为预先搭建的AIGC客服平台,该AIGC客服平台可以设置于电子设备或者服务器中且提供用于输入问题和反馈回复的可视化页面,电子设备可以为个人计算机(Personal Computer,PC)、便携式设备、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备等其它设备;服务器可以是指一台服务器,也可以是由多台服务器构成的服务器集群、云服务器等等。本发明对电子设备或者服务器的具体形式不作具体限定。进一步的,该智能回复方法还可以应用于设置在电子设备或者服务器中的智能回复装置中,该智能回复装置可以通过软件、硬件或者两者的结合来实现。下面以该智能回复方法的执行主体为AIGC客服平台为例,对该智能回复方法进行描述。The following describes the intelligent reply method, device, electronic device and storage medium provided by the present invention in conjunction with Figures 1 to 4, wherein the execution subject of the intelligent reply method is a pre-built AIGC customer service platform, which can be set in an electronic device or server and provides a visual page for inputting questions and feedback replies. The electronic device can be a personal computer (PC), a portable device, a laptop, a smart phone, a tablet computer, a portable wearable device and other devices; the server can refer to a server, or a server cluster composed of multiple servers, a cloud server, etc. The present invention does not specifically limit the specific form of the electronic device or server. Furthermore, the intelligent reply method can also be applied to an intelligent reply device set in an electronic device or server, and the intelligent reply device can be implemented by software, hardware or a combination of both. The following describes the intelligent reply method by taking the execution subject of the intelligent reply method as the AIGC customer service platform as an example.
为了便于理解本发明实施例提供的智能回复方法,下面,将通过下述几个示例地实施例对本发明提供的智能回复方法进行详细地说明。可以理解的是,下面这几个示例地实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。In order to facilitate understanding of the intelligent reply method provided by the embodiment of the present invention, the intelligent reply method provided by the present invention will be described in detail below through the following exemplary embodiments. It can be understood that the following exemplary embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
参照图1,为本发明提供的智能回复方法的流程示意图之一,如图1所示,该智能回复方法包括如下步骤110~步骤130。Referring to FIG. 1 , which is one of the flow charts of the intelligent reply method provided by the present invention, as shown in FIG. 1 , the intelligent reply method includes the following steps 110 to 130 .
步骤110、基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定目标业务关键词对应的目标知识文档筛选范围。Step 110: recall knowledge documents based on the target business keywords in the target question text, and determine the target knowledge document screening range corresponding to the target business keywords based on the recall result.
其中,目标关键词可以包括但不限定目标提问文本中含有的产品品类、产品品类和业务术语等中至少一个的实体,目标提问文本中含有的产品型号和/或错误提示码等的编码,以及目标提问文本中出现频率最高的高频词中至少一项。此外,目标提问文本可以为用户向AIGC客服平台输入的问题句。The target keywords may include but are not limited to at least one entity of the product category, product category and business terminology contained in the target question text, the encoding of the product model and/or error prompt code contained in the target question text, and at least one of the high-frequency words that appear most frequently in the target question text. In addition, the target question text may be a question sentence input by the user to the AIGC customer service platform.
可以理解的是,AIGC客服平台提供输入问题和反馈回复的可视化页面,还至少具备信息处理功能、语音识别功能和图像识别功能。It is understandable that the AIGC customer service platform provides a visual page for inputting questions and providing feedback responses, and also has at least information processing, voice recognition and image recognition functions.
具体的,在步骤110中,AIGC客服平台首先获取用户输入的目标提问文本,具体可以是由用户向AIGC客服平台输入目标提问文本,其输入方式可以包括但不限定可视化页面上输入、拍照上传输入和语音输入等。例如,可以通过用户人为在AIGC客服平台提供的可视化页面中输入文本形式的目标提问文本,也可以将含有目标提问文本的图像信息上传至AIGC客服平台、再由AIGC客服平台识别该图像信息,还可以由用户向AIGC客服平台语言输出目标提问文本、再由AIGC客服平台识别语音信号。此处对用户向AIGC客服平台输入目标提问文本的方式不作具体限定。Specifically, in step 110, the AIGC customer service platform first obtains the target question text input by the user, which can be specifically input by the user to the AIGC customer service platform, and the input method may include but is not limited to input on a visual page, input by taking a photo and uploading, and voice input. For example, the user can manually input the target question text in text form in the visual page provided by the AIGC customer service platform, or upload image information containing the target question text to the AIGC customer service platform, and then the AIGC customer service platform recognizes the image information, or the user can verbally output the target question text to the AIGC customer service platform, and then the AIGC customer service platform recognizes the voice signal. The method in which the user inputs the target question text to the AIGC customer service platform is not specifically limited here.
AIGC客服平台对于用户输入的目标提问文本,可以基于预先构建的知识文档知识图谱,对目标提问文本中的目标业务关键词进行知识文档召回,并将召回的各个目标知识文档均确定为待筛选的各个目标知识文档,该各个目标知识文档也可构成目标知识文档筛选范围。需要说明的是,知识文档知识图谱是通过将所有类型知识文档中每篇知识文档的知识文档标题和/或最能体现对应知识文档核心内容的关键词作为实体、将实体与对应知识文档之间匹配度作为关系所构建的知识图谱。For the target question text input by the user, the AIGC customer service platform can recall the target business keywords in the target question text based on the pre-built knowledge document knowledge graph, and determine each recalled target knowledge document as each target knowledge document to be screened, and each target knowledge document can also constitute the target knowledge document screening range. It should be noted that the knowledge document knowledge graph is constructed by taking the knowledge document title of each knowledge document in all types of knowledge documents and/or the keywords that best reflect the core content of the corresponding knowledge document as the entity, and the matching degree between the entity and the corresponding knowledge document as the relationship.
步骤120、针对目标知识文档筛选范围内各待筛选的目标知识文档,从目标知识文档含有的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配的第二分块内容。Step 120: for each target knowledge document to be screened within the target knowledge document screening range, retrieve the first block content whose matching degree with the target question text satisfies the first preset matching degree threshold from all the block contents contained in the target knowledge document, and retrieve the second block content that matches the target tree structure hierarchical directory and the target question text of the target knowledge document.
其中,每篇目标知识文档中的知识内容可以为HTML格式、Word格式等其它可编辑知识文档中的所有文字内容,或者,也可以为PDF格式或图像等其它不可编辑知识文档中的全部文字内容。The knowledge content in each target knowledge document may be all text content in other editable knowledge documents such as HTML format, Word format, or all text content in other non-editable knowledge documents such as PDF format or image.
具体的,在步骤120中,AIGC客服平台对于目标知识文档筛选范围内各待筛选的目标知识文档,可以先对每篇目标知识文档进行分块,例如可以选取基于序列建模的文本分割模型对每篇目标进行分块,再确定每篇目标知识文档经过分块所得到的各个分块内容,分别与目标提问文本之间的匹配度,后从针对每篇目标知识文档所确定的所有匹配度中,确定满足第一预设阈值的目标匹配度对应的第一分块内容。示例性的,满足第一预设阈值的目标匹配度可以为所有匹配度中的最大匹配度,也可以为大于或等于第一预设阈值的至少一个匹配度。本发明对此不作具体限定。Specifically, in step 120, the AIGC customer service platform can first divide each target knowledge document into blocks for each target knowledge document to be screened within the target knowledge document screening range. For example, a text segmentation model based on sequence modeling can be selected to divide each target document into blocks, and then the matching degree between each block content obtained after the block division and the target question text is determined. Then, from all the matching degrees determined for each target knowledge document, the first block content corresponding to the target matching degree that meets the first preset threshold is determined. Exemplarily, the target matching degree that meets the first preset threshold can be the maximum matching degree among all matching degrees, or it can be at least one matching degree that is greater than or equal to the first preset threshold. The present invention does not make specific limitations on this.
为了提高后续回复内容的精度,AIGC客服平台对于目标知识文档筛选范围内各待筛选的目标知识文档,还可以确定每篇目标知识文档的目标树结构层级目录涵盖的所有分块内容分别与目标提问文本之间的匹配度,以便于从每篇目标知识文档中检索与目标知识文档的目标树结构层级目录和目标提问文本匹配的第二分块内容。In order to improve the accuracy of subsequent reply content, the AIGC customer service platform can also determine the matching degree between all block contents covered by the target tree structure hierarchical directory of each target knowledge document and the target question text for each target knowledge document to be screened within the target knowledge document screening range, so as to retrieve the second block content that matches the target tree structure hierarchical directory and the target question text of the target knowledge document from each target knowledge document.
步骤130、基于各第一分块内容和各第二分块内容,确定目标提问文本的目标回复内容。Step 130: Determine target reply content of the target question text based on the content of each first sub-block and the content of each second sub-block.
具体的,在步骤130中,AIGC客服平台对于从每篇目标知识文档中检索出的第一分块内容和第二目标分块内容,可以先从所有第一分块内容中筛选出与目标提问文本之间相似度最大的第一目标分块内容,以及,从所有第二分块内容中筛选出与目标提问文本之间相似度最大的第二目标分块内容,再从确定的第一目标分块内容和第二目标分块内容中确定符合预设回复约束条件的目标回复内容;此处的预设回复约束条件可以包括但不限定对于对话场景、身份角色的约束说明,回复使用的语气、语调和语言等约束说明,业务场景限制说明,以及兜底提示说明中的至少一项。Specifically, in step 130, the AIGC customer service platform may first select the first target block content with the greatest similarity to the target question text from all the first block contents and the second target block content retrieved from each target knowledge document, and select the second target block content with the greatest similarity to the target question text from all the second block contents, and then determine the target reply content that meets the preset reply constraint conditions from the determined first target block content and second target block content; the preset reply constraint conditions here may include but are not limited to the constraint descriptions on the dialogue scene, identity role, the tone, intonation and language used in the reply, the business scenario restriction description, and at least one of the bottom-up prompt descriptions.
本发明实施例提供的智能回复方法,AIGC客服平台对于基于目标提问文本中目标业务关键词所召回的各待筛选的目标知识文档,通过从每篇目标知识文档所含的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配匹配的第二分块内容,并结合检索到的第一分块内容和第二分块内容,确定目标提问文本的目标回复内容。这样,通过从用户提问中的关键词多路召回知识文档的方式丰富和扩大了后续知识文档知识检索的搜索范围,确保最终输出的目标回复内容更贴合用户实际诉求;此外,通过将召回的每篇目标知识文档中所有分块内容分别与提问这一分句和提问中业务关键词这一分词之间的匹配程度,确定与用户真实诉求最匹配的最终回复内容,不仅提高了相关分块内容准确检索的检索几率,而且也提高了基于知识文档回复提问的准确率和效率,同时也大幅提升了用户满意度和体验感,从而真正意义上实现了个性化智能问答。According to the intelligent reply method provided by the embodiment of the present invention, the AIGC customer service platform determines the target reply content of the target question text by retrieving the first block content whose matching degree with the target question text meets the first preset matching degree threshold from all the block contents contained in each target knowledge document, and retrieving the second block content that matches the target tree structure hierarchical directory and the target question text of the target knowledge document, and combining the retrieved first block content and second block content. In this way, the search scope of subsequent knowledge document knowledge retrieval is enriched and expanded by the method of multi-channel recall of knowledge documents from keywords in user questions, ensuring that the target reply content outputted in the end is more in line with the actual demands of users; in addition, by respectively matching all the block contents in each recalled target knowledge document with the sentence of the question and the word segment of the business keyword in the question, the final reply content that best matches the actual demands of users is determined, which not only improves the retrieval probability of accurate retrieval of relevant block contents, but also improves the accuracy and efficiency of replying questions based on knowledge documents, and also greatly improves user satisfaction and experience, thereby truly realizing personalized intelligent question and answer.
基于上述图1所示的智能回复方法,在一种示例实施例中,步骤110的具体实现过程可以包括:Based on the intelligent reply method shown in FIG. 1 , in an exemplary embodiment, the specific implementation process of step 110 may include:
基于预先存储的业务关键词与文档id之间映射关系,对目标业务关键词进行知识文档召回,并基于召回的知识文档列表确定目标知识文档筛选范围。Based on the pre-stored mapping relationship between business keywords and document IDs, knowledge documents are recalled for target business keywords, and the target knowledge document screening range is determined based on the recalled knowledge document list.
具体的,为了提高多维度召回知识文档的广度和深度,可以预先构建并存储业务关键词与文档id之间映射关系,每个文档id分别用于唯一标识一篇知识文档,每个业务关键词至少存储于一篇知识文档中的至少一个分句内容中。这样,可以基于目标提问文本中目标业务关键词,在业务关键词与文档id之间映射关系中查找目标业务关键词,或者查找与目标业务关键词之间匹配度最高的第一业务关键词,并确定业务关键词与文档id之间映射关系中与目标业务关键词对应的至少一个文档id,或者在业务关键词与文档id之间映射关系中确定第一业务关键词对应的知识一个文档id,并将至少一个文档id各自对应的知识文档均确定为构成目标知识文档筛选范围的各个目标知识文档。Specifically, in order to improve the breadth and depth of multi-dimensional recall of knowledge documents, the mapping relationship between business keywords and document ids can be pre-constructed and stored, each document id is used to uniquely identify a knowledge document, and each business keyword is stored in at least one sentence content in a knowledge document. In this way, based on the target business keyword in the target question text, the target business keyword can be searched in the mapping relationship between the business keyword and the document id, or the first business keyword with the highest matching degree with the target business keyword can be searched, and at least one document id corresponding to the target business keyword in the mapping relationship between the business keyword and the document id can be determined, or the knowledge document id corresponding to the first business keyword can be determined in the mapping relationship between the business keyword and the document id, and the knowledge documents corresponding to each of the at least one document id are determined as the respective target knowledge documents constituting the target knowledge document screening range.
本发明实施例提供的智能回复方法,AIGC客服平台通过预先构建并存储的业务关键词-文档id之间映射关系对目标业务关键词进行知识问答召回的方式,优先考虑关键信息,不仅降低了非关键信息的干扰,提高了检索性能,而且提高了知识文档召回的召回效率,同时也能避免相关知识文档漏命中的几率,从而也能提高后续基于知识文档回复提问的准确性和丰富性。The intelligent reply method provided by the embodiment of the present invention is a way for the AIGC customer service platform to recall knowledge questions and answers for target business keywords through the pre-constructed and stored mapping relationship between business keywords and document IDs, giving priority to key information. This not only reduces the interference of non-key information and improves the retrieval performance, but also improves the recall efficiency of knowledge document recall, while also avoiding the probability of missing relevant knowledge documents, thereby also improving the accuracy and richness of subsequent replies to questions based on knowledge documents.
基于上述图1所示的智能回复方法,在一种示例实施例中,当目标提问文本中目标业务关键词的数量为多个时,AIGC客服平台基于召回的知识文档列表确定目标知识文档筛选范围,其具体实现过程包括:Based on the intelligent reply method shown in FIG1 above, in an exemplary embodiment, when there are multiple target business keywords in the target question text, the AIGC customer service platform determines the target knowledge document screening range based on the recalled knowledge document list, and the specific implementation process includes:
首先,判断各目标业务关键词各自召回的文档id列表之间是否存在相同文档id:进一步的,确定各文档id列表之间存在相同文档id,则基于存在的相同文档id各自对应的知识文档确定目标知识文档筛选范围;或者,确定各文档id列表之间不存在相同文档id,则从各文档id列表中确定含有文档id数量最少的目标文档id列表,并基于目标文档id列表中各目标文档id各自对应的知识文档确定目标知识文档筛选范围。First, determine whether there is the same document ID between the document ID lists recalled by each target business keyword: further, if it is determined that there is the same document ID between the document ID lists, the target knowledge document screening range is determined based on the knowledge documents corresponding to the same document IDs; or, if it is determined that there is no same document ID between the document ID lists, the target document ID list with the least number of document IDs is determined from the document ID lists, and the target knowledge document screening range is determined based on the knowledge documents corresponding to each target document ID in the target document ID lists.
具体的,AIGC客服平台对于目标提问文本中的多个目标业务关键词,可以通过检索业务关键词-文档id之间映射关系,确定每个目标业务关键词对应的文档id列表,此时可以基于多个文档id列表的交集确定目标知识文档筛选范围,也即将多个文档id列表之间存在的至少一个相同文档id各自对应的知识文档均确定为目标知识文档筛选范围内的目标知识文档;考虑到多个目标业务关键词各自对应的文档id列表不存在交集的情况,为了避免交集为空时筛选失效,此时可以基于关键词对应的知识文档数量越少,该关键词对于知识文档的重要性倾向越大的原则,在各文档id列表之间交集为空时,可以从所有目标业务关键词各自召回的文档id列表中选取文档id数量的目标文档id列表,并将该目标文档id列表确定为最终的目标知识文档筛选范围,此时该目标文档id列表中各目标文档id各自对应的知识文档,也即为目标知识文档筛选范围内含有的各个目标知识文档。Specifically, the AIGC customer service platform can determine the document id list corresponding to each target business keyword for multiple target business keywords in the target question text by retrieving the mapping relationship between business keywords and document ids. At this time, the target knowledge document screening range can be determined based on the intersection of multiple document id lists, that is, the knowledge documents corresponding to at least one identical document id between the multiple document id lists are all determined as the target knowledge documents within the target knowledge document screening range; considering that there is no intersection in the document id lists corresponding to multiple target business keywords, in order to avoid screening failure when the intersection is empty, based on the principle that the fewer the number of knowledge documents corresponding to the keyword, the greater the importance of the keyword to the knowledge document, when the intersection between the document id lists is empty, a target document id list with the same number of document ids can be selected from the document id lists recalled by all the target business keywords, and the target document id list is determined as the final target knowledge document screening range. At this time, the knowledge documents corresponding to each target document id in the target document id list are the target knowledge documents contained in the target knowledge document screening range.
本发明实施例提供的智能回复方法,AIGC客服平台对于多个目标业务关键词召回的所有文档id列表,通过所有文档id列表相交时基于交集确定目标知识文档筛选范围或者交集为空时基于所有文档id列表的最小文档id列表确定目标知识文档筛选范围,以此可以确保筛选的目标知识文档对于目标业务关键词的重要性,提高了目标知识文档筛选范围的准确性和可靠性。The intelligent reply method provided by the embodiment of the present invention is that the AIGC customer service platform determines the target knowledge document screening range based on the intersection of all document id lists recalled for multiple target business keywords when all document id lists intersect, or determines the target knowledge document screening range based on the minimum document id list of all document id lists when the intersection is empty, thereby ensuring the importance of the screened target knowledge documents to the target business keywords and improving the accuracy and reliability of the target knowledge document screening range.
基于上述图1所示的智能回复方法,在一种示例实施例中,AIGC客服平台可以预先构建业务关键词与文档id之间映射关系,其构建过程包括:Based on the intelligent reply method shown in FIG. 1 above, in an exemplary embodiment, the AIGC customer service platform can pre-build a mapping relationship between business keywords and document IDs, and the construction process includes:
首先,针对各文档id,基于预先构建的实体词典,从文档id对应的知识文档含有的分句内容中识别不同类型的目标实体;基于预先设置的不同类型编码标识和不同类型编码规则,从知识文档中识别不同类型的目标编码;以及从知识文档中识别出现频率最高的目标高频分词;然后,将各文档id各自对应的目标实体、目标编码和目标高频分词中的至少一项均确定为业务关键词,并基于各业务关键词和各文档id构建业务关键词与文档id之间映射关系。First, for each document id, based on a pre-constructed entity dictionary, different types of target entities are identified from the sentence content contained in the knowledge document corresponding to the document id; based on pre-set different types of coding identifiers and different types of coding rules, different types of target coding are identified from the knowledge document; and the target high-frequency word segmentation with the highest frequency of occurrence is identified from the knowledge document; then, at least one of the target entity, target coding and target high-frequency word segmentation corresponding to each document id is determined as a business keyword, and a mapping relationship between the business keyword and the document id is constructed based on each business keyword and each document id.
具体的,AIGC客服平台可以预先通过可编辑文档读取工具和光学字符识别(Optical Character Recognition,OCR)技术,分别提取HTML格式、Word格式等其它可编辑文档,以及PDF格式或图像等其它不可编辑文档中的全部文字内容并作为提取到的每个知识文档,并提取每个知识文档的文档名作为文档标题;对于提取到的各知识文档,可以通过预先构建的实体词典和正向最大匹配法,从每个知识文档中识别如产品品牌、产品品类和业务术语等其它不同类型的目标实体。示例性的,预先构建的实体词典可以如表2所示。Specifically, the AIGC customer service platform can pre-extract all text content from other editable documents such as HTML format, Word format, and other non-editable documents such as PDF format or images through editable document reading tools and optical character recognition (OCR) technology as each extracted knowledge document, and extract the document name of each knowledge document as the document title; for each extracted knowledge document, other different types of target entities such as product brands, product categories, and business terms can be identified from each knowledge document through a pre-built entity dictionary and forward maximum matching method. Exemplarily, the pre-built entity dictionary can be shown in Table 2.
表2Table 2
此外,AIGC客服平台还可以基于预先为各类编码设置的不同类型编码标识和不同类型编码规则,从每篇知识文档中识别诸如产品型号、错误标识码等其它不同类型的目标编码,例如,识别到的目标编码可以为QMF1679-G1”、“F-6”。示例性的,不同类型编码规则可以为针对每类编码设置的正则表达式,具体可以参照表3所示。In addition, the AIGC customer service platform can also identify other different types of target codes such as product models, error identification codes, etc. from each knowledge document based on different types of coding identifiers and different types of coding rules pre-set for each type of coding. For example, the identified target codes may be QMF1679-G1, "F-6". Exemplarily, different types of coding rules may be regular expressions set for each type of coding, as shown in Table 3.
表3Table 3
另外,AIGC客服平台对于提取到的所有知识文档,还可以基于预先设置的不同停用词去除每个知识文档中的无用停用词,对应得到不含无用停用词的有效知识文档;然后,基于分词在某篇知识文档中出现的频率越高、在所有知识文档中出现的频率越低时,代表该分词对于该篇知识文档越关键这一原则,计算每个有效知识文档含有的所有分词在对应有效知识文档中的词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)值,并从各有效知识文档各自计算出的所有TF-IDF值中分别确定最大TF-IDF值;再针对各有效知识文档各自确定的最大TF-IDF值,从有效知识文档中确定最大TF-IDF值对应的分词为目标高频分词;按照此方式遍历所有有效知识文档,可以从每个知识文档中分别识别出现频率最高的目标高频分词。示例性的,无用停用词可以包括但不限定“这个”、“要”、“如上”等其它停用词。In addition, the AIGC customer service platform can also remove useless stop words in each knowledge document based on different pre-set stop words for all the extracted knowledge documents, and obtain corresponding valid knowledge documents without useless stop words; then, based on the principle that the higher the frequency of a segmentation word in a certain knowledge document and the lower the frequency of its appearance in all knowledge documents, the more critical the segmentation word is to the knowledge document, calculate the term frequency-inverse document frequency (TF-IDF) value of all segmentations contained in each valid knowledge document in the corresponding valid knowledge document, and determine the maximum TF-IDF value from all TF-IDF values calculated for each valid knowledge document; then, for each valid knowledge document, determine the segmentation word corresponding to the maximum TF-IDF value as the target high-frequency segmentation word from the valid knowledge document; traverse all valid knowledge documents in this way, and identify the target high-frequency segmentation word with the highest frequency of appearance from each knowledge document. Exemplarily, useless stop words may include but are not limited to other stop words such as "this", "to", "as above", etc.
此时,AIGC客服平台可以基于各业务关键词和各文档id构建业务关键词与文档id之间映射关系,也即将从每个知识文档含有的分句内容中提取出的业务关键词作为键、分句内容所在知识文档的文档id作为键值,存入关键词倒排索引表中,关键词倒排索引表可以如表4所示。At this time, the AIGC customer service platform can build a mapping relationship between business keywords and document IDs based on each business keyword and each document ID, that is, the business keywords extracted from the sentence content contained in each knowledge document are used as keys, and the document ID of the knowledge document where the sentence content is located is used as the key value, and stored in the keyword inverted index table. The keyword inverted index table can be shown in Table 4.
表4Table 4
表4所示的关键词倒排索引表可以用于检索文档知识时收缩检索范围,达到检索更加精准化的目的。并且,在向关键词倒排索引表中存入新提取到的业务关键词时,可以首先使用该业务关键词做次查询,判断该业务关键词是否已存在于关键词倒排索引表中,若存在,则对关键词倒排索引表中该业务关键词对应的键值进行更新,也即将对于当前新提取的业务关键词对应确定的键值,与关键词倒排索引表中该业务关键词对应的键值进行合并且删除重复的文档id,即可实现键值更新目的;若不存在,则在关键词倒排索引表中重新插入一条含有当前新提取的业务关键词及其对应键值的键值对行数据;重复上述过程,直至所有业务关键词均以遍历完,即可得到完整的关键词倒排索引表,再进一步将该完整的关键词倒排索引表作为业务关键词与文档id之间映射关系并进行存储。The keyword inverted index table shown in Table 4 can be used to narrow the search scope when retrieving document knowledge, so as to achieve the purpose of more accurate retrieval. In addition, when storing the newly extracted business keyword in the keyword inverted index table, the business keyword can be used to make a query first to determine whether the business keyword already exists in the keyword inverted index table. If it exists, the key value corresponding to the business keyword in the keyword inverted index table is updated, that is, the key value corresponding to the newly extracted business keyword is merged with the key value corresponding to the business keyword in the keyword inverted index table and the duplicate document id is deleted, so as to achieve the purpose of key value update; if it does not exist, a key value pair row data containing the newly extracted business keyword and its corresponding key value is reinserted into the keyword inverted index table; the above process is repeated until all business keywords are traversed, and a complete keyword inverted index table can be obtained, and then the complete keyword inverted index table is further used as a mapping relationship between business keywords and document ids and stored.
本发明实施例提供的智能回复方法,AIGC客服平台通过基于从不同格式知识文档中提取的目标实体、目标编码和目标高频词,与不同格式知识文档的文档id构建业务关键词与文档id之间映射关系,提高了业务关键词的全面性和丰富性,以此,从而能够有效支持后续文档知识检索的快速搜索范围和多路检索召回。The intelligent reply method provided by the embodiment of the present invention, the AIGC customer service platform builds a mapping relationship between business keywords and document IDs based on target entities, target codes and target high-frequency words extracted from knowledge documents of different formats, and document IDs of knowledge documents of different formats, thereby improving the comprehensiveness and richness of business keywords, thereby effectively supporting the rapid search range and multi-channel retrieval recall of subsequent document knowledge retrieval.
基于上述图1所示的智能回复方法,在一种示例实施例中,步骤120的具体实现过程可以包括:Based on the intelligent reply method shown in FIG. 1 , in an exemplary embodiment, the specific implementation process of step 120 may include:
基于预先针对各种不同类型知识文档分别构建的分句内容-文档id-分块内容之间映射关系,从目标知识文档中检索与目标提问文本匹配度满足第一预设匹配度阈值的目标分句内容所在的第一分块内容;基于文档id-知识文档的树结构层级目录之间映射关系,确定目标知识文档的目标树结构层级目录,以及确定目标树结构层级目录中每个层级标题分别与目标提问文本之间的匹配度;从目标树结构层级目录中各层级标题下各自的分块内容中,检索满足第二预设匹配度阈值的目标匹配度对应的目标层级标题下的第二分块内容。Based on the mapping relationship between sentence content-document id-block content constructed in advance for various types of knowledge documents, the first block content containing the target sentence content whose matching degree with the target question text meets the first preset matching degree threshold is retrieved from the target knowledge document; based on the mapping relationship between document id-tree structure hierarchical directory of the knowledge document, the target tree structure hierarchical directory of the target knowledge document is determined, and the matching degree between each level title in the target tree structure hierarchical directory and the target question text is determined; from the block content under each level title in the target tree structure hierarchical directory, the second block content under the target level title corresponding to the target matching degree that meets the second preset matching degree threshold is retrieved.
具体的,为了提高检索的精准性和快速性,可以针对各种不同类型知识文档分别预先构建并存储分句内容-文档id-分块内容之间映射关系,每个分句内容-文档id-分块内容之间映射关系用于表征分句内容为检索项、分句内容所在文档id为筛选项以及分句内容所在分块内容为返回项;这样,通过确定目标提问文本与所有分句内容之间的匹配度,通过基于图的K最邻近(KNN,K-NearestNeighbor)分类算法,可以从目标知识文档中快速且准确检索与目标提问文本匹配度满足第一预设匹配度阈值的目标分句内容所在的第一分块内容。Specifically, in order to improve the accuracy and speed of retrieval, the mapping relationship between sentence content-document id-block content can be pre-constructed and stored for various types of knowledge documents, and each mapping relationship between sentence content-document id-block content is used to represent the sentence content as a search item, the document id where the sentence content is located as a screening item, and the block content where the sentence content is located as a return item; in this way, by determining the matching degree between the target question text and all sentence contents, through the graph-based K nearest neighbor (KNN, K-NearestNeighbor) classification algorithm, the first block content where the target sentence content is located, whose matching degree with the target question text meets the first preset matching degree threshold, can be quickly and accurately retrieved from the target knowledge document.
此外,还可以针对各种不同类型知识文档分别预先构建并存储文档id-知识文档的树结构层级目录之间映射关系,每个文档id-知识文档的树结构层级目录之间映射关系用于表明文档id为筛选项、树结构层级目录下各子目录的分块内容为返回项;这样,通过文档id筛选出目标知识文档的目标树结构层级目录时,可以通过确定目标业务关键词与目标树结构层级目录中每个层级标题之间的匹配度,同样基于图的KNN算法,可以从目标知识文档中快速且准确检索与目标提问文本之间匹配度满足第二预设匹配度阈值的目标匹配度所对应的目标层级标题下的第二分块内容。In addition, the mapping relationship between the document id and the tree structure hierarchical directory of the knowledge document can be pre-built and stored for various types of knowledge documents. The mapping relationship between each document id and the tree structure hierarchical directory of the knowledge document is used to indicate that the document id is a filtering item and the block content of each subdirectory under the tree structure hierarchical directory is a return item. In this way, when the target tree structure hierarchical directory of the target knowledge document is filtered out by the document id, the matching degree between the target business keyword and each level title in the target tree structure hierarchical directory can be determined. Similarly, based on the graph KNN algorithm, the second block content under the target level title corresponding to the target matching degree whose matching degree with the target question text meets the second preset matching degree threshold can be quickly and accurately retrieved from the target knowledge document.
示例性的,满足第二预设阈值的目标匹配度可以为所有匹配度中的最大匹配度,也可以为大于或等于第二预设阈值的至少一个匹配度。本发明对此不作具体限定。Exemplarily, the target matching degree that satisfies the second preset threshold may be the maximum matching degree among all matching degrees, or may be at least one matching degree that is greater than or equal to the second preset threshold, which is not specifically limited in the present invention.
需要说明的是,为了提高筛选和返回的速率,可以将检索项设置为向量模式,也即构建并存储每个分句内容向量-文档id-分块内容之间映射关系,以及每个文档id-知识文档的树结构层级目录中每个层级标题均为标题向量;相应地,检索时可以确定目标提问向量与所有分句内容向量之间的匹配度,以及确定目标业务关键词向量与目标树结构层级目录中每个层级标题向量之间的匹配度。示例性的,上述确定的所有匹配度,具体均可以为余弦相似度。It should be noted that in order to improve the screening and return rate, the search item can be set to vector mode, that is, to construct and store the mapping relationship between each sentence content vector-document id-block content, and each document id-each level title in the tree structure hierarchy directory of the knowledge document is a title vector; accordingly, during the search, the matching degree between the target question vector and all sentence content vectors can be determined, as well as the matching degree between the target business keyword vector and each level title vector in the target tree structure hierarchy directory can be determined. Exemplarily, all the matching degrees determined above can be specifically cosine similarity.
此外,需要说明的是,图的KNN算法基于小世界理论,即在一个充分连接的图里,通过六跳就可以连通两个点,应用到向量检索上,通过任意起点,以贪心算法按距离逼近跳到目标顶点。这样,通过使用图的KNN算法,可以提高检索第一分块内容和第二分块内容的速率和精度。In addition, it should be noted that the KNN algorithm for graphs is based on the small world theory, that is, in a fully connected graph, two points can be connected through six jumps. When applied to vector retrieval, a greedy algorithm can be used to jump to the target vertex by distance from any starting point. In this way, by using the KNN algorithm for graphs, the speed and accuracy of retrieving the first block content and the second block content can be improved.
本发明实施例提供的智能回复方法,AIGC客服平台通过以文档分句内容为检索项、分句内容所在的文档分块内容作为返回项、和分句内容所在文档id作为筛选项,从目标知识文档中检索第一分块内容,以及以文档各级目录为检索项、文档各级目录所在文档id为筛选项和以各子目录下分块内容为返回项,从目标知识文档中检索第二分块内容,这样,可以实现基于文档分句和目录的两路召回,增加相关分块内容准确检索的几率,尽可能地避免相关知识文档漏命中。The intelligent reply method provided by the embodiment of the present invention is that the AIGC customer service platform retrieves the first block content from the target knowledge document by using the document sentence content as the search item, the document block content where the sentence content is located as the return item, and the document ID where the sentence content is located as the filter item, and retrieves the second block content from the target knowledge document by using the document directories at all levels as the search item, the document ID where the document directories at all levels are located as the filter item and the block content under each sub-directory as the return item. In this way, two-way recall based on document sentences and directories can be achieved, thereby increasing the probability of accurate retrieval of relevant block content and avoiding missing relevant knowledge documents as much as possible.
基于上述图1所示的智能回复方法,在一种示例实施例中,分句内容-文档id-分块内容之间映射关系的构建过程包括:Based on the intelligent reply method shown in FIG. 1 above, in an exemplary embodiment, the process of constructing the mapping relationship between sentence content-document id-block content includes:
针对各种不同类型知识文档各自的文档id,基于预先设置的分句分隔符,对文档id对应的知识文档进行分句,以及基于文本分割模型对知识文档进行分块;针对分句所得的所有分句内容,从分句内容在分块所得的对应初始分块内容中确定预设分割窗口内的分块内容;基于分句所得的所有分句内容、各预设分割窗口内的分块内容和各文档id,构建分句内容-文档id-分块内容之间映射关系。For the document IDs of various types of knowledge documents, the knowledge documents corresponding to the document IDs are segmented into sentences based on the preset sentence delimiters, and the knowledge documents are segmented into blocks based on the text segmentation model; for all sentence contents obtained from the sentence segmentation, the block contents within the preset segmentation window are determined from the corresponding initial block contents obtained from the segmentation; based on all sentence contents obtained from the sentence segmentation, the block contents within each preset segmentation window and each document ID, a mapping relationship between sentence content-document ID-block content is constructed.
具体的,AIGC客服平台对于预先提取到的各种不同类型知识文档,可以首先使用类似于中英文句号、感叹号和问号等其它常见分句分隔符,对知识文档进行分句,并将分句所得所有分句作为后续知识文档检索的检索项;再进一步选取基于序列建模的文本分割模型(nlp_bert_document-segmentation_chinese-base)对知识文档进行分块,并将分块所得到的所有分块内容作为后续知识文档检索的返回项。其中,基于序列建模的文本分割模型用于利用长文本的语义信息对每个分句内容是否为段落内容的边界进行分类,在有效利用足够的上下文信息以进行准确分割和高效推理效率之间找到良好的平衡,克服了层次模型面临计算量大,推理速度慢等问题。Specifically, the AIGC customer service platform can first use other common sentence separators such as Chinese and English periods, exclamation marks, and question marks to segment knowledge documents for various types of pre-extracted knowledge documents, and use all the sentences obtained by the segmentation as search items for subsequent knowledge document retrieval; then further select the text segmentation model based on sequence modeling (nlp_bert_document-segmentation_chinese-base) to segment the knowledge documents, and use all the segmented contents obtained by the segmentation as return items for subsequent knowledge document retrieval. Among them, the text segmentation model based on sequence modeling is used to use the semantic information of long texts to classify whether the content of each sentence is the boundary of the paragraph content, and find a good balance between effectively using sufficient context information for accurate segmentation and efficient reasoning efficiency, overcoming the problems of large computational complexity and slow reasoning speed faced by hierarchical models.
AIGC客服平台对于每个知识文档经由分句所得到的所有分句内容,以分句内容对应的分句向量为检索项,以分句内容在分块所得到的初始分块内容作为返回项,以及分句内容所在文档id作为筛选项,存入索引便于通过提问文本匹配分句内容进行检索。For each knowledge document, the AIGC customer service platform uses the sentence vector corresponding to the sentence content as the search item, the initial block content obtained by the sentence content in the block as the return item, and the document ID where the sentence content is located as the filter item. The index is stored to facilitate retrieval by matching the sentence content through the question text.
考虑到文档分割的准确性对于后续回复尤其重要,以及为了更加便捷地调整各初始分块内容的大小,存储时存入每个分句内容的前后各n个分句内容作为备选的分块内容,以作为检索的返回项;另外,还可以存入一个基于文本分割模型预测的预设分割窗口,如[1,2]表示截取分句内容在对应初始分块内容中的前1个分句内容、当前分句内容和后2个分句内容作为检索的返回项,如[n,n]表示截取分句内容在对应初始分块内容中的前n个分句内容、当前分句内容和后n个分句内容作为检索的返回项,存入分句链表索引用于分句内容检索。示例性的,分句链表索引字段示例如表5所示,其中n为相对较大的正整数,如n可以取值为30。Considering that the accuracy of document segmentation is particularly important for subsequent replies, and in order to more conveniently adjust the size of each initial block content, the n sentence contents before and after each sentence content are stored as alternative block contents during storage, as the return items for retrieval; in addition, a preset segmentation window based on the prediction of the text segmentation model can also be stored, such as [1, 2] means that the first sentence content, the current sentence content and the next two sentence contents in the corresponding initial block content are taken as the return items for retrieval, such as [n, n] means that the first n sentence contents, the current sentence content and the next n sentence contents in the corresponding initial block content are taken as the return items for retrieval, and the sentence linked list index is stored for sentence content retrieval. Exemplarily, the sentence linked list index field example is shown in Table 5, where n is a relatively large positive integer, such as n can be 30.
表5Table 5
然后,可以基于完整的分句链表索引,为各种不同类型知识文档中每个知识文档构建分句内容-文档id-分块内容之间映射关系。Then, based on the complete clause linked list index, a mapping relationship between clause content-document id-block content can be constructed for each knowledge document in various types of knowledge documents.
本发明实施例提供的智能回复方法,AIGC客服平台通过对不同类型知识文档分别先进行分句和分块,再基于分句所得分句在分块所得对应初始分块内容中选取作为返回项的分块内容、后结合各文档id构建分句内容-文档id-分块内容之间映射关系,提高了后续快速检索与提问文本相关分块内容的全面性和丰富性。The intelligent reply method provided by the embodiment of the present invention is that the AIGC customer service platform first divides different types of knowledge documents into sentences and blocks, and then selects the block content as the return item from the corresponding initial block content obtained by the block based on the sentence obtained by the sentence, and then builds a mapping relationship between the sentence content-document id-block content in combination with each document id, thereby improving the comprehensiveness and richness of the subsequent rapid retrieval of the block content related to the question text.
基于上述图1所示的智能回复方法,在一种示例实施例中,文档id-知识文档的树结构层级目录之间映射关系的构建过程包括:Based on the intelligent reply method shown in FIG1 , in an exemplary embodiment, the process of constructing the mapping relationship between the document id and the tree structure level directory of the knowledge document includes:
基于各种不同类型知识文档各自预先设置的内容格式、文档标题和字体大小,识别每个知识文档中不同层级标题和每个层级标题下的分块内容,并基于识别结果和各种不同类型知识文档各自的文档id,构建文档id-知识文档的树结构层级目录之间映射关系;各文档id-知识文档的树结构层级目录之间映射关系用于确定目标知识文档的目标树结构层级目录。Based on the pre-set content formats, document titles and font sizes of various types of knowledge documents, the different levels of titles in each knowledge document and the block contents under each level of title are identified, and based on the identification results and the document IDs of various types of knowledge documents, a mapping relationship between the document ID and the tree structure hierarchical directory of the knowledge document is constructed; the mapping relationship between each document ID and the tree structure hierarchical directory of the knowledge document is used to determine the target tree structure hierarchical directory of the target knowledge document.
具体的,AIGC客服平台对于预先提取到的各种不同类型知识文档,可以基于知识文档内预先设置的内容格式、文档标题和字体大小,识别每个知识文档中不同层级标题,如word文档中的标题1、标题2、标题3和html文档的h1格式、h2格式、h3格式,同一格式的文档标题可以认为是同级标题,低层级的格式标题可以认为是高层级格式文档标题的子层级标题,并将各层级标题转化为能体现父子关系的父子关系目录,具体可以如表6所示。Specifically, the AIGC customer service platform can identify different levels of titles in each knowledge document based on the pre-set content format, document title and font size in the knowledge document, such as Title 1, Title 2, Title 3 in word documents and h1 format, h2 format, h3 format in html documents for various types of knowledge documents extracted in advance. Document titles of the same format can be considered as titles of the same level, and low-level format titles can be considered as sub-level titles of high-level format document titles. The titles of each level are converted into a parent-child relationship directory that can reflect the parent-child relationship, as shown in Table 6.
表6Table 6
此时,结合表6所示的父子关系目录,AIGC客服平台可以以各层级目录中每个层级标题的标题向量为检索项,各层级目录下各子目录标题的分块内容作为返回项,以及各层级目录所在文档id作为筛选项,存入索引便于通过提问文本匹配层级标题进行检索。At this time, combined with the parent-child relationship directory shown in Table 6, the AIGC customer service platform can use the title vector of each level title in each level directory as the search item, the block content of each sub-directory title under each level directory as the return item, and the document ID of each level directory as the filter item, and store it in the index to facilitate retrieval by matching the level title through the question text.
需要说明的是,考虑到高层级标题下可能存在多个层级子标题,如果直接返回子层级标题下的所有分块内容最终可能形成较长的知识片段,不但可能会对后续大语言模型生成的应答回复造成干扰,也可能超出大语言模型所能接收的最大内容数,导致知识片段直接被截断。因此,可以通过分层级存储每个知识文档的文档目录下各子层级的分块内容,例如:“1.1”:[[“1.1.1”,“1.1.2”],[“1.1.1.1”,“1.1.1.2”],[“1.1.1.2.1”,“1.1.1.2.2”]]这种树结构存入树目录索引表中,可以灵活调整检索文档目录的层级深度。示例性的,树目录索引表字段示例如表7所示。It should be noted that, considering that there may be multiple levels of sub-titles under a high-level title, if all the block contents under the sub-level title are directly returned, it may eventually form a longer knowledge fragment, which may not only interfere with the response response generated by the subsequent large language model, but may also exceed the maximum number of contents that the large language model can receive, resulting in the knowledge fragment being directly truncated. Therefore, the block contents of each sub-level under the document directory of each knowledge document can be stored hierarchically, for example: "1.1": [["1.1.1", "1.1.2"], ["1.1.1.1", "1.1.1.2"], ["1.1.1.2.1", "1.1.1.2.2"]] This tree structure is stored in the tree directory index table, and the hierarchical depth of the retrieved document directory can be flexibly adjusted. Exemplarily, an example of the tree directory index table field is shown in Table 7.
表7Table 7
然后,可以基于完整的树目录索引表,为各种不同类型知识文档中每个知识文档构建文档id-知识文档的树结构层级目录之间映射关系。Then, based on the complete tree directory index table, a mapping relationship between the document id and the tree structure level directory of the knowledge document can be constructed for each knowledge document in various types of knowledge documents.
本发明实施例提供的智能回复方法,AIGC客服平台通过识别的每个知识文档预先设置的层级目录、每个层级标题下的分块内容和文档id,为每个每个知识文档构建文档id-知识文档的树结构层级目录之间映射关系,提高了后续快速检索与业务关键词相关分块内容的全面性和丰富性。The intelligent reply method provided by the embodiment of the present invention is that the AIGC customer service platform constructs a mapping relationship between the document id and the tree structure hierarchical directory of the knowledge document for each knowledge document by identifying the pre-set hierarchical directory of each knowledge document, the block content under each hierarchical title, and the document id, thereby improving the comprehensiveness and richness of subsequent rapid retrieval of the block content related to the business keywords.
基于上述图1所示的智能回复方法,在一种示例实施例中,步骤130的具体实现过程可以包括:Based on the intelligent reply method shown in FIG. 1 , in an exemplary embodiment, the specific implementation process of step 130 may include:
对各第一分块内容和各第二分块内容进行排序,并基于排序结果确定目标分块内容;基于预先为不同业务场景对应设置的个性化回复提示词和大语言模型,确定目标分块内容对应的目标回复内容。The first block contents and the second block contents are sorted, and the target block contents are determined based on the sorting results; based on the personalized reply prompt words and the large language model pre-set for different business scenarios, the target reply content corresponding to the target block content is determined.
示例性的,大语言模型具体可以为生成型预训练变换(Chat Generative Pre-trained Transformer,ChatGPT)模型。Exemplarily, the large language model may be a Chat Generative Pre-trained Transformer (ChatGPT) model.
具体的,AIGC客服平台可以基于余弦相似度对所有第一分块内容和第二分块内容进行排序,考虑到分句内容和层级标题分别与提问文本之间像素点的敏感程度可能存在差异,可以在使用余弦相似度进行排序前先进行加权。再从排序结果中选取优先级最高的前N个分块内容作为目标分块内容,也可以排序后的分块内容都作为目标分块内容,具体如何确定目标分块内容可以根据用户实际需求确定,此处不作具体限定。N为大于等于1的正整数。Specifically, the AIGC customer service platform can sort all the first block contents and the second block contents based on cosine similarity. Considering that there may be differences in the sensitivity of the pixels between the sentence content and the hierarchical title and the question text, weighting can be performed before sorting using cosine similarity. Then, the top N block contents with the highest priority are selected from the sorting results as the target block contents, or all the sorted block contents can be used as the target block contents. How to determine the target block content can be determined according to the actual needs of the user, and is not specifically limited here. N is a positive integer greater than or equal to 1.
此时,以检索获取的目标分块内容为背景知识,目标提问文本为提问,根据预先为不同业务场景对应设置的个性化回复提示词和大语言模型,生成最终可想用户回复的目标回复内容。At this time, the target block content retrieved is used as the background knowledge, and the target question text is used as the question. According to the personalized reply prompt words and large language model pre-set for different business scenarios, the target reply content that the user can finally reply to is generated.
需要说明的是,考虑到不同业务场景的特征,可以预先针对性地设置一些限制,包括身份场景、语言风格、业务规则等其它个性化回复提示词,并配置ChatGPT无法根据背景知识作出回答时的兜底提示,具体参数及说明如表8所示。It should be noted that, considering the characteristics of different business scenarios, some restrictions can be set in advance, including identity scenarios, language styles, business rules and other personalized reply prompts, and configure a fallback prompt when ChatGPT cannot answer based on background knowledge. The specific parameters and descriptions are shown in Table 8.
表8Table 8
本发明实施例提供的智能回复方法,AIGC客服平台基于检索的相关所有分块内容,配合针对特定业务场景的提示词,打通AIGC能力在文档知识问答应用的链路,可以有效提高文档知识提取、总结归纳和推理能力,自然合理地应对不同业务场景的智能问答。The intelligent reply method provided in the embodiment of the present invention is based on the retrieval of all relevant block contents by the AIGC customer service platform, combined with prompt words for specific business scenarios, to open up the link of AIGC capabilities in document knowledge question and answer applications, which can effectively improve the ability of document knowledge extraction, summarization and reasoning, and naturally and reasonably respond to intelligent questions and answers in different business scenarios.
示例性的,参照图2所示的智能回复方法的流程示意图之二,在图2中,业务知识管理系统可以为预先搭建的AIGC客服平台,知识提取为各种不同类型知识文档的提取,关键词-文档id具体为业务关键词与文档id之间映射关系,分句-文档分块具体为针对每篇知识文档进行分句及分块,目录-正文具体为针对每篇知识文档预先设置的不同层级标题分别转化体现父子关系的父子关系目录,embedding具体指代嵌入或存入,关键词具体指代目标业务关键词,提问具体为目标提问文本,最匹配文档分块具体为所有第一分块内容,最匹配文档分层级正文具体为所有第二分块内容,提示词具体可以为预先为不同业务场景对应设置的个性化回复提示词,答案具体为最终向用户输出的目标回复内容;其中涉及的具体实现过程及效果可以参照前述实施例。此处不再赘述。Exemplarily, referring to the second flow chart of the intelligent reply method shown in FIG2, in FIG2, the business knowledge management system can be a pre-built AIGC customer service platform, knowledge extraction is the extraction of various types of knowledge documents, keyword-document id is specifically the mapping relationship between business keywords and document id, sentence-document segmentation is specifically for each knowledge document to be segmented and segmented, directory-text is specifically for each knowledge document pre-set different levels of titles are converted into parent-child relationship directories reflecting the parent-child relationship, embedding specifically refers to embedding or storage, keywords specifically refer to target business keywords, questions are specifically for target question texts, the most matching document segment is specifically all the first segment content, the most matching document hierarchical text is specifically all the second segment content, prompt words can be specifically personalized reply prompt words pre-set for different business scenarios, and the answer is specifically the target reply content finally output to the user; the specific implementation process and effects involved can refer to the aforementioned embodiment. No further details are given here.
这样,通过关键词识别+知识分块+目录提取这一文档知识提取方式,实现了知识文档检索项和返回项的有效提取,提取知识文档的分句内容和目录作为检索项,识别检索项的实体、编码、高频词三类关键词作为筛选项,并通过语义分割模型对知识文档知识分块,作为检索项对应的返回项,支持后续的文档知识多路检索,解决了文档内容非结构化和巨量特征导致难以定位相关文档分块内容的技术问题。In this way, through the document knowledge extraction method of keyword recognition + knowledge segmentation + directory extraction, the effective extraction of knowledge document retrieval items and return items is realized, the sentence content and directory of the knowledge document are extracted as retrieval items, and the three types of keywords of the retrieval items, namely entities, codes, and high-frequency words, are identified as screening items. The knowledge of the knowledge document is segmented into knowledge blocks through the semantic segmentation model as the return items corresponding to the retrieval items, which supports subsequent multi-way retrieval of document knowledge and solves the technical problem that it is difficult to locate the content of related document blocks due to the unstructured content and massive features of the document.
通过关键词倒排索引+分句链表索引+树目录索引表这一知识文档存储与检索方式,预先存储文档知识存储结构的设计和多路召回检索方式的架构。也即首先设计关键词倒排索引存储知识文档中业务关键词与文档id之间映射关系,用于通过业务关键词快速收缩知识文档检索范围,其次设计分句链表索引存储分句内容与分句内容对应分块内容的映射关系,用于通过分句内容准确检索相关文档分块内容,并设计树目录索引存储多层级目录的父子结构与对应文档分块内容的映射关系,用于通过目录加强相关文档分块内容的召回,最后通过向量检索的方式快速准确检索与目标提问文本相似的分句内容或层级标题,以获取相关的文档分块内容,解决了文档知识难以快速定位准确检索的技术问题。Through the knowledge document storage and retrieval method of keyword inverted index + sentence linked list index + tree directory index table, the design of the document knowledge storage structure and the architecture of the multi-way recall retrieval method are pre-stored. That is, firstly, the keyword inverted index is designed to store the mapping relationship between business keywords and document ids in knowledge documents, which is used to quickly narrow the knowledge document retrieval scope through business keywords. Secondly, the sentence linked list index is designed to store the mapping relationship between sentence content and the corresponding block content of the sentence content, which is used to accurately retrieve the relevant document block content through the sentence content. The tree directory index is designed to store the mapping relationship between the parent-child structure of the multi-level directory and the corresponding document block content, which is used to strengthen the recall of the relevant document block content through the directory. Finally, the sentence content or hierarchical title similar to the target question text is quickly and accurately retrieved through the vector retrieval method to obtain the relevant document block content, which solves the technical problem that document knowledge is difficult to quickly locate and accurately retrieve.
以及通过文档知识+提示词+ChatGPT生成回答这一文档知识回答方式,打通了AIGC能力在文档知识问答应用的链路,并且预先设计提示词模板,以充分利用大语言模型的知识提取和总结归纳能力,基于检索的相关文档分块内容,应对不同情况下的提问文本,作出合理而自然的目标回复内容,解决了文档分块需要进一步提炼转化以自然回应用户提问的技术问题。The document knowledge answering method of generating answers through document knowledge + prompt words + ChatGPT has opened up the link of AIGC capabilities in document knowledge question and answer applications, and pre-designed prompt word templates to fully utilize the knowledge extraction and summarization and induction capabilities of the large language model. Based on the retrieved relevant document block content, it can respond to question texts in different situations and make reasonable and natural target reply content, solving the technical problem that document blocks need to be further refined and transformed to naturally respond to user questions.
下面对本发明提供的智能回复装置进行描述,下文描述的智能回复装置与上文描述的智能回复方法可相互对应参照。The intelligent reply device provided by the present invention is described below. The intelligent reply device described below and the intelligent reply method described above can be referenced to each other.
参照图3,为本发明提供的智能回复装置的结构示意图,如图3所示,该智能回复装置300,包括:知识文档召回单元310、分块内容确定单元320和回复内容确定单元330。3 , which is a schematic diagram of the structure of the intelligent reply device provided by the present invention, as shown in FIG3 , the intelligent reply device 300 includes: a knowledge document recall unit 310 , a block content determination unit 320 and a reply content determination unit 330 .
知识文档召回单元310,用于基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定目标业务关键词对应的目标知识文档筛选范围。The knowledge document recall unit 310 is used to recall knowledge documents based on target business keywords in the target question text, and determine the target knowledge document screening range corresponding to the target business keywords based on the recall result.
分块内容确定单元320,用于针对目标知识文档筛选范围内各待筛选的目标知识文档,从目标知识文档含有的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配的第二分块内容。The block content determination unit 320 is used to retrieve, for each target knowledge document to be screened within the target knowledge document screening range, the first block content whose matching degree with the target question text meets a first preset matching degree threshold from all the block contents contained in the target knowledge document, and to retrieve the second block content that matches the target tree structure hierarchical directory and the target question text of the target knowledge document.
回复内容确定单元330,用于基于各第一分块内容和各第二分块内容,确定目标提问文本的目标回复内容。The reply content determination unit 330 is used to determine the target reply content of the target question text based on the contents of each first sub-block and each second sub-block.
可选的,知识文档召回单元310,具体用于基于预先存储的业务关键词与文档id之间映射关系,对目标业务关键词进行知识文档召回,并基于召回的知识文档列表确定目标知识文档筛选范围。Optionally, the knowledge document recall unit 310 is specifically configured to recall knowledge documents for target business keywords based on a pre-stored mapping relationship between business keywords and document IDs, and determine a target knowledge document screening range based on the recalled knowledge document list.
可选的,知识文档召回单元310,具体用于判断各目标业务关键词各自召回的文档id列表之间是否存在相同文档id:确定各文档id列表之间存在相同文档id,则基于存在的相同文档id各自对应的知识文档确定目标知识文档筛选范围;确定各文档id列表之间不存在相同文档id,则从各文档id列表中确定含有文档id数量最少的目标文档id列表,并基于目标文档id列表中各目标文档id各自对应的知识文档确定目标知识文档筛选范围。Optionally, the knowledge document recall unit 310 is specifically used to determine whether there is an identical document ID between the document ID lists recalled by each target business keyword: if it is determined that there is an identical document ID between the document ID lists, then the target knowledge document screening range is determined based on the knowledge documents corresponding to the identical document IDs; if it is determined that there is no identical document ID between the document ID lists, then the target document ID list containing the least number of document IDs is determined from the document ID lists, and the target knowledge document screening range is determined based on the knowledge documents corresponding to each target document ID in the target document ID lists.
可选的,知识文档召回单元310,具体用于针对各文档id,基于预先构建的实体词典,从文档id对应的知识文档含有的分句内容中识别不同类型的目标实体;基于预先设置的不同类型编码标识和不同类型编码规则,从知识文档中识别不同类型的目标编码;以及从知识文档中识别出现频率最高的目标高频分词;将各文档id各自对应的目标实体、目标编码和目标高频分词中的至少一项均确定为业务关键词,并基于各业务关键词和各文档id构建业务关键词与文档id之间映射关系。Optionally, the knowledge document recall unit 310 is specifically used to identify different types of target entities from the sentence content contained in the knowledge document corresponding to the document id based on a pre-constructed entity dictionary for each document id; identify different types of target codes from the knowledge document based on pre-set different types of coding identifiers and different types of coding rules; and identify the most frequently appearing target high-frequency word segmentations from the knowledge document; determine at least one of the target entity, target code and target high-frequency word segmentations corresponding to each document id as a business keyword, and construct a mapping relationship between the business keyword and the document id based on each business keyword and each document id.
可选的,分块内容确定单元320,具体用于基于预先针对各种不同类型知识文档分别构建的分句内容-文档id-分块内容之间映射关系,从目标知识文档中检索与目标提问文本匹配度满足第一预设匹配度阈值的目标分句内容所在的所述第一分块内容;基于文档id-知识文档的树结构层级目录之间映射关系,确定目标知识文档的目标树结构层级目录,以及确定目标树结构层级目录中每个层级标题分别与目标提问文本之间的匹配度;从目标树结构层级目录中各层级标题下各自的分块内容中,检索满足第二预设匹配度阈值的目标匹配度对应的目标层级标题下的第二分块内容。Optionally, the block content determination unit 320 is specifically used to retrieve the first block content containing the target sentence content whose matching degree with the target question text satisfies a first preset matching degree threshold from the target knowledge document based on the mapping relationship between the sentence content-document id-block content pre-constructed for various types of knowledge documents; determine the target tree structure hierarchical directory of the target knowledge document based on the mapping relationship between the document id-tree structure hierarchical directory of the knowledge document, and determine the matching degree between each level title in the target tree structure hierarchical directory and the target question text; retrieve the second block content under the target level title corresponding to the target matching degree that satisfies a second preset matching degree threshold from the respective block contents under each level title in the target tree structure hierarchical directory.
可选的,分块内容确定单元320,具体用于针对各种不同类型知识文档各自的文档id,基于预先设置的分句分隔符,对文档id对应的知识文档进行分句,以及基于文本分割模型对知识文档进行分块;针对分句所得的所有分句内容,从分句内容在分块所得的对应初始分块内容中确定预设分割窗口内的分块内容;基于分句所得的所有分句内容、各预设分割窗口内的分块内容和各文档id,构建分句内容-文档id-分块内容之间映射关系。Optionally, the block content determination unit 320 is specifically used to segment the knowledge document corresponding to the document id based on a preset sentence delimiter for each document id of each different type of knowledge document, and to segment the knowledge document based on a text segmentation model; for all sentence contents obtained by sentence segmentation, determine the block content within a preset segmentation window from the corresponding initial block content obtained by segmentation; and construct a mapping relationship between sentence content-document id-block content based on all sentence contents obtained by sentence segmentation, the block content within each preset segmentation window and each document id.
可选的,分块内容确定单元320,具体用于基于各种不同类型知识文档各自预先设置的内容格式、文档标题和字体大小,识别每个知识文档中不同层级标题和每个层级标题下的分块内容,并基于识别结果和各种不同类型知识文档各自的文档id,构建文档id-知识文档的树结构层级目录之间映射关系;各文档id-知识文档的树结构层级目录之间映射关系用于确定目标知识文档的目标树结构层级目录。Optionally, the block content determination unit 320 is specifically used to identify different levels of titles and block contents under each level of title in each knowledge document based on the pre-set content format, document title and font size of each different type of knowledge document, and construct a mapping relationship between the document id and the tree structure hierarchical directory of the knowledge document based on the recognition result and the document id of each different type of knowledge document; the mapping relationship between each document id and the tree structure hierarchical directory of the knowledge document is used to determine the target tree structure hierarchical directory of the target knowledge document.
可选的,回复内容确定单元330,具体用于对各第一分块内容和各第二分块内容进行排序,并基于排序结果确定目标分块内容;基于预先为不同业务场景对应设置的个性化回复提示词和大语言模型,确定目标分块内容对应的目标回复内容。Optionally, the reply content determination unit 330 is specifically used to sort the first block contents and the second block contents, and determine the target block contents based on the sorting results; based on the personalized reply prompt words and large language models pre-set for different business scenarios, determine the target reply content corresponding to the target block content.
本发明实施例提供的智能回复装置300,可以执行上述任一实施例中智能回复方法的技术方案,其实现原理以及有益效果与智能回复方法的实现原理及有益效果类似,可参见智能回复方法的实现原理及有益效果,此处不再进行赘述。The intelligent reply device 300 provided in an embodiment of the present invention can execute the technical solution of the intelligent reply method in any of the above-mentioned embodiments. Its implementation principle and beneficial effects are similar to the implementation principle and beneficial effects of the intelligent reply method. Please refer to the implementation principle and beneficial effects of the intelligent reply method, and no further details will be given here.
图4示例了一种电子设备的实体结构示意图,如图4所示,该电子设备可以包括:处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440,其中,处理器410,通信接口420,存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令,以执行如下方法:FIG4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG4 , the electronic device may include: a processor 410, a communications interface 420, a memory 430, and a communication bus 440, wherein the processor 410, the communications interface 420, and the memory 430 communicate with each other through the communication bus 440. The processor 410 may call the logic instructions in the memory 430 to execute the following method:
基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定目标业务关键词对应的目标知识文档筛选范围;针对目标知识文档筛选范围内各待筛选的目标知识文档,从目标知识文档含有的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配的第二分块内容;基于各第一分块内容和各第二分块内容,确定目标提问文本的目标回复内容。Knowledge documents are recalled based on target business keywords in the target question text, and a target knowledge document screening range corresponding to the target business keywords is determined based on the recall result; for each target knowledge document to be screened within the target knowledge document screening range, first block contents whose matching degree with the target question text meets a first preset matching degree threshold are retrieved from all block contents contained in the target knowledge document, and second block contents that match the target tree structure hierarchical directory of the target knowledge document and the target question text are retrieved; based on each first block content and each second block content, target reply content of the target question text is determined.
此外,上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the relevant technology or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc. Various media that can store program codes.
另一方面,本发明实施例公开一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的方法,例如包括:On the other hand, an embodiment of the present invention discloses a computer program product, wherein the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, wherein the computer program includes program instructions. When the program instructions are executed by a computer, the computer can perform the methods provided by the above-mentioned method embodiments, for example, including:
基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定目标业务关键词对应的目标知识文档筛选范围;针对目标知识文档筛选范围内各待筛选的目标知识文档,从目标知识文档含有的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配的第二分块内容;基于各第一分块内容和各第二分块内容,确定目标提问文本的目标回复内容。Knowledge documents are recalled based on target business keywords in the target question text, and a target knowledge document screening range corresponding to the target business keywords is determined based on the recall result; for each target knowledge document to be screened within the target knowledge document screening range, first block contents whose matching degree with the target question text meets a first preset matching degree threshold are retrieved from all block contents contained in the target knowledge document, and second block contents that match the target tree structure hierarchical directory of the target knowledge document and the target question text are retrieved; based on each first block content and each second block content, target reply content of the target question text is determined.
又一方面,本发明实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的方法,例如包括:In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium having a computer program stored thereon, which is implemented when the computer program is executed by a processor to perform the methods provided in the above embodiments, for example, including:
基于目标提问文本中目标业务关键词进行知识文档召回,并基于召回结果确定目标业务关键词对应的目标知识文档筛选范围;针对目标知识文档筛选范围内各待筛选的目标知识文档,从目标知识文档含有的所有分块内容中检索与目标提问文本之间匹配度满足第一预设匹配度阈值的第一分块内容,以及检索与目标知识文档的目标树结构层级目录和目标提问文本匹配的第二分块内容;基于各第一分块内容和各第二分块内容,确定各目标提问文本的目标回复内容。Knowledge documents are recalled based on target business keywords in the target question text, and a target knowledge document screening range corresponding to the target business keywords is determined based on the recall result; for each target knowledge document to be screened within the target knowledge document screening range, first block contents whose matching degree with the target question text meets a first preset matching degree threshold are retrieved from all block contents contained in the target knowledge document, and second block contents that match the target tree structure hierarchical directory of the target knowledge document and the target question text are retrieved; based on each first block content and each second block content, a target reply content for each target question text is determined.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the relevant technology can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiment.
最后应说明的是,以上实施方式仅用于说明本发明,而非对本发明的限制。尽管参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行各种组合、修改或者等同替换,都不脱离本发明技术方案的精神和范围,均应涵盖在本发明的权利要求范围中。Finally, it should be noted that the above embodiments are only used to illustrate the present invention, rather than to limit the present invention. Although the present invention is described in detail with reference to the embodiments, it should be understood by those skilled in the art that various combinations, modifications or equivalent substitutions of the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and should be included in the scope of the claims of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410582015.7ACN118585618A (en) | 2024-05-11 | 2024-05-11 | Intelligent reply method, device, electronic device and storage medium |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410582015.7ACN118585618A (en) | 2024-05-11 | 2024-05-11 | Intelligent reply method, device, electronic device and storage medium |
| Publication Number | Publication Date |
|---|---|
| CN118585618Atrue CN118585618A (en) | 2024-09-03 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410582015.7APendingCN118585618A (en) | 2024-05-11 | 2024-05-11 | Intelligent reply method, device, electronic device and storage medium |
| Country | Link |
|---|---|
| CN (1) | CN118585618A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119296516A (en)* | 2024-12-10 | 2025-01-10 | 中科南京人工智能创新研究院 | Domain-based speech recognition method and system based on RAG |
| CN119396983A (en)* | 2024-12-31 | 2025-02-07 | 金卡智能集团股份有限公司 | Document question and answer method, system and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119296516A (en)* | 2024-12-10 | 2025-01-10 | 中科南京人工智能创新研究院 | Domain-based speech recognition method and system based on RAG |
| CN119396983A (en)* | 2024-12-31 | 2025-02-07 | 金卡智能集团股份有限公司 | Document question and answer method, system and storage medium |
| CN119396983B (en)* | 2024-12-31 | 2025-04-25 | 金卡智能集团股份有限公司 | Document question and answer method, system and storage medium |
| Publication | Publication Date | Title |
|---|---|---|
| CN116775847B (en) | Question answering method and system based on knowledge graph and large language model | |
| RU2628436C1 (en) | Classification of texts on natural language based on semantic signs | |
| RU2628431C1 (en) | Selection of text classifier parameter based on semantic characteristics | |
| CN114610845B (en) | Intelligent question-answering method, device and equipment based on multiple systems | |
| CN103514183B (en) | Information search method and system based on interactive document clustering | |
| CN114780746A (en) | Knowledge graph-based document retrieval method and related equipment thereof | |
| CN103761264B (en) | Concept hierarchy establishing method based on product review document set | |
| CN110909160A (en) | Regular expression generation method, server and computer-readable storage medium | |
| CN118585618A (en) | Intelligent reply method, device, electronic device and storage medium | |
| CN117743558A (en) | Knowledge processing, knowledge question and answer methods, devices and media based on large models | |
| CN113032531B (en) | Text processing method and device | |
| CN116028618B (en) | Text processing, text retrieval methods, devices, electronic equipment and storage media | |
| CN118709687A (en) | A short text microblog coreference resolution model for network feedback information monitoring | |
| CN118747293A (en) | Document writing intelligent recall method and device and document generation method and device | |
| CN119226455A (en) | Text generation method, device, electronic device and readable storage medium | |
| CN117808923A (en) | Image generation method, system, electronic device and readable storage medium | |
| CN117131863A (en) | Text generation method, device, equipment and medium | |
| CN118569208A (en) | Document segmentation method, device and electronic device | |
| CN116842270A (en) | Patent search term recommending method and device based on intention recognition and electronic equipment | |
| CN118113864A (en) | Text emotion classification method and device, electronic equipment and storage medium | |
| CN111625579B (en) | Information processing method, device and system | |
| CN114547233A (en) | Data duplicate checking method and device and electronic equipment | |
| CN112632229A (en) | Text clustering method and device | |
| CN119783672B (en) | Corpus expansion method, corpus expansion equipment and storage medium | |
| CN116306616B (en) | Method and device for determining keywords of text |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |