Movatterモバイル変換


[0]ホーム

URL:


CN117149990A - Text retrieval method, text retrieval device, electronic equipment and storage medium - Google Patents

Text retrieval method, text retrieval device, electronic equipment and storage medium
Download PDF

Info

Publication number
CN117149990A
CN117149990ACN202210541779.2ACN202210541779ACN117149990ACN 117149990 ACN117149990 ACN 117149990ACN 202210541779 ACN202210541779 ACN 202210541779ACN 117149990 ACN117149990 ACN 117149990A
Authority
CN
China
Prior art keywords
vector
text
index
inverted
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210541779.2A
Other languages
Chinese (zh)
Inventor
林伟家
刘子甲
王志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin 3600 Kuaikan Technology Co ltd
Original Assignee
Tianjin 3600 Kuaikan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin 3600 Kuaikan Technology Co ltdfiledCriticalTianjin 3600 Kuaikan Technology Co ltd
Priority to CN202210541779.2ApriorityCriticalpatent/CN117149990A/en
Publication of CN117149990ApublicationCriticalpatent/CN117149990A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The application discloses a text retrieval method, a text retrieval device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a user search request; encoding the query text input by the user through a deep learning model to obtain a second vector; inquiring a third vector with highest similarity with the second vector from a first vector index, wherein the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model; and taking the sub text corresponding to the third vector as a target text. The method carries out text retrieval based on semantic vector retrieval, and improves the text retrieval performance and the accuracy of the retrieval result.

Description

Translated fromChinese
文本检索方法、装置、电子设备和存储介质Text retrieval method, device, electronic device and storage medium

技术领域Technical field

本申请属于数据挖掘技术领域,具体涉及一种文本检索方法、装置、电子设备和存储介质。This application belongs to the field of data mining technology, and specifically relates to a text retrieval method, device, electronic equipment and storage medium.

背景技术Background technique

目前,基于文本挖掘100的网络运营广泛应用于风险管理101、知识管理102、网络犯罪预防管理103、客户服务104、保险索赔105、情境广告推荐106、商业智能107、邮件过滤108、社交媒体分析109等场景中,如图1中所示,主要是通过文本分析技术和传统统计分析技术的结合了解用户行为,更准确地在网站上提供产品和服务;同时,将文本分析技术用于文本信息处理,将处理过的文本内容直接作为在线服务的输出结果推送给用户。Currently, network operations based on text mining100 are widely used in risk management101, knowledge management102, cybercrime prevention management103, customer service104, insurance claims105, situational advertising recommendation106, business intelligence107, email filtering108, social media analysis 109 and other scenarios, as shown in Figure 1, mainly through the combination of text analysis technology and traditional statistical analysis technology to understand user behavior and provide products and services on the website more accurately; at the same time, text analysis technology is used for text information Processing, and the processed text content is directly pushed to the user as the output result of the online service.

传统的文本检索方法,多在对文本分词或分字后,从词在句子中或字在句子中的含义出发进行语义分析,形成一层粒度的语义信息,并对一层粒度的语义信息进行检索,但是一层粒度的语义信息存在语义信息丢失的问题,缺乏语义信息之间相关性的考虑,语义层面相关性召回能力偏弱,文本检索精确度较差。Traditional text retrieval methods mostly perform semantic analysis based on the meaning of the word in the sentence or the word in the sentence after segmenting the text into words, forming a layer of granular semantic information, and performing a semantic analysis on the layer of granular semantic information. Retrieval, but the semantic information at one level of granularity has the problem of semantic information loss, lack of consideration of correlation between semantic information, weak correlation recall ability at the semantic level, and poor text retrieval accuracy.

发明内容Contents of the invention

本申请实施例的目的是提供一种文本检索方法、装置、电子设备和存储介质,基于语义向量检索进行文本检索,提升了文本检索性能与检索结果的精确度。The purpose of the embodiments of the present application is to provide a text retrieval method, device, electronic device and storage medium to perform text retrieval based on semantic vector retrieval, thereby improving text retrieval performance and the accuracy of retrieval results.

为了解决上述技术问题,本申请是这样实现的:In order to solve the above technical problems, this application is implemented as follows:

第一方面,本申请实施例提供了一种文本检索方法,包括:In the first aspect, embodiments of the present application provide a text retrieval method, including:

获取用户搜索请求;Get user search request;

将用户输入的查询文本通过深度学习模型编码得到第二向量;The query text input by the user is encoded through the deep learning model to obtain the second vector;

从第一向量索引中查询与所述第二向量相似度最高的第三向量,Query the third vector with the highest similarity to the second vector from the first vector index,

所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;

将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.

可选的,所述的方法还包括:Optionally, the method also includes:

生成所述待检索文档库的倒排文件;Generate an inverted file of the document library to be retrieved;

根据所述倒排文件生成所述待检索文档库的第一倒排索引。Generate a first inverted index of the document library to be retrieved based on the inverted file.

可选的,所述从第一向量索引中查询与所述第二向量相似度最高的第三向量,包括:Optionally, querying the third vector with the highest similarity to the second vector from the first vector index includes:

将所述查询文本拆分为多个独立的分词;Split the query text into multiple independent word segments;

在所述第一倒排索引中查询每个分词对应的倒排链数据;Query the inverted chain data corresponding to each word segment in the first inverted index;

在所述第一向量索引中找到距离所述第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据;Find at least one center point in the first vector index that satisfies a preset distance from the second vector, and obtain the inverted chain data corresponding to each center point;

对每个分词的倒排链数据求交集,得到第一权重值;Find the intersection of the inverted chain data of each segmentation to obtain the first weight value;

对每一个所述中心点对应的倒排链数据求并集,得到第二权重值;Calculate the union of the inverted chain data corresponding to each of the center points to obtain the second weight value;

比较所述第一权重值与所述第二权重值,将权重值大的倒排链数据进行过滤,当满足预设的过滤条件时,存入召回中间结果数据集;Compare the first weight value and the second weight value, filter the inverted chain data with a large weight value, and store the recall intermediate result data set when the preset filtering conditions are met;

将所述召回中间结果数据集排序,确定所述第三向量。The recall intermediate result data set is sorted to determine the third vector.

可选的,在所述召回中间结果数据集达到预设的第一存储容量阈值,或者Optionally, when the recall intermediate result data set reaches a preset first storage capacity threshold, or

利用所述查询文本检索的时间超过预设的第一时间阈值的情况下,When the query text retrieval time exceeds the preset first time threshold,

终止收集所述召回中间结果数据集。Terminate collection of the recall intermediate result data set.

可选的,所述将所述召回中间结果数据集排序,确定所述第三向量,包括:Optionally, sorting the recall intermediate result data set and determining the third vector includes:

将所述召回中间结果数据集存储的所有倒排链数据按照分数高低进行排序,截取排名靠前的倒排链数据作为所述第三向量。All the inverted chain data stored in the recall intermediate result data set are sorted according to their scores, and the top-ranked inverted chain data is intercepted as the third vector.

可选的,所述在第一倒排索引中查询每一个分词对应的倒排链数据,包括:Optionally, querying the inverted chain data corresponding to each word segment in the first inverted index includes:

获取每一个所述分词的文档编号ID信息;Obtain the document number ID information of each word segment;

将所有所述分词的文档编号ID信息以降序方式排列,构成所述倒排链数据。Arrange the document number ID information of all the word segments in descending order to form the inverted chain data.

可选的,所述的方法还包括:Optionally, the method also includes:

根据每一个所述分词的文档编号ID信息确定每一个分词的权重值;Determine the weight value of each word segment based on the document number ID information of each word segment;

根据每一个所述分词权重值的大小,确定每一个分词的召回时间;Determine the recall time of each word segment according to the weight value of each word segment;

根据所述召回时间,以及每一条倒排链数据的链长,对检索过程做截断处理。According to the recall time and the chain length of each inverted chain data, the retrieval process is truncated.

可选的,所述的方法还包括:Optionally, the method also includes:

将所述第一向量索引与所述第一倒排索引一起写入内存段,构建临时内存向量索引;Write the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;

在所述临时内存向量索引达到预设的第二存储容量阈值,或者构建临时内存向量索引的时间达到预设的第二时间阈值时,When the temporary memory vector index reaches the preset second storage capacity threshold, or the time to build the temporary memory vector index reaches the preset second time threshold,

将所述第一向量索引与所述第一倒排索引一起写入磁盘段,构建持久磁盘向量索引。The first vector index and the first inverted index are written into the disk segment together to construct a persistent disk vector index.

第二方面,本申请实施例提供了一种文本检索装置,包括:In a second aspect, embodiments of the present application provide a text retrieval device, including:

获取模块,用于获取用户搜索请求;Obtain module, used to obtain user search requests;

编码模块,用于将用户输入的查询文本通过深度学习模型编码得到第二向量;The encoding module is used to encode the query text input by the user through the deep learning model to obtain the second vector;

检索模块,用于从第一向量索引中查询与所述第二向量相似度最高的第三向量,A retrieval module used to query the third vector with the highest similarity to the second vector from the first vector index,

所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;

将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.

可选的,所述的装置还包括:Optionally, the device also includes:

生成所述待检索文档库的倒排文件;Generate an inverted file of the document library to be retrieved;

根据所述倒排文件生成所述待检索文档库的第一倒排索引。Generate a first inverted index of the document library to be retrieved based on the inverted file.

可选的,所述从第一向量索引中查询与所述第二向量相似度最高的第三向量,包括:Optionally, querying the third vector with the highest similarity to the second vector from the first vector index includes:

将所述查询文本拆分为多个独立的分词;Split the query text into multiple independent word segments;

在所述第一倒排索引中查询每个分词对应的倒排链数据;Query the inverted chain data corresponding to each word segment in the first inverted index;

在所述第一向量索引中找到距离所述第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据;Find at least one center point in the first vector index that satisfies a preset distance from the second vector, and obtain the inverted chain data corresponding to each center point;

对每个分词的倒排链数据求交集,得到第一权重值;Find the intersection of the inverted chain data of each segmentation to obtain the first weight value;

对每一个所述中心点对应的倒排链数据求并集,得到第二权重值;Calculate the union of the inverted chain data corresponding to each of the center points to obtain the second weight value;

比较所述第一权重值与所述第二权重值,将权重值大的倒排链数据进行过滤,当满足预设的过滤条件时,存入召回中间结果数据集;Compare the first weight value and the second weight value, filter the inverted chain data with a large weight value, and store the recall intermediate result data set when the preset filtering conditions are met;

将所述召回中间结果数据集排序,确定所述第三向量。The recall intermediate result data set is sorted to determine the third vector.

可选的,在所述召回中间结果数据集达到预设的第一存储容量阈值,或者Optionally, when the recall intermediate result data set reaches a preset first storage capacity threshold, or

利用所述查询文本检索的时间超过预设的第一时间阈值的情况下,When the query text retrieval time exceeds the preset first time threshold,

终止收集所述召回中间结果数据集。Terminate collection of the recall intermediate result data set.

可选的,所述将所述召回中间结果数据集排序,确定所述第三向量,包括:Optionally, sorting the recall intermediate result data set and determining the third vector includes:

将所述召回中间结果数据集存储的所有倒排链数据按照分数高低进行排序,截取排名靠前的倒排链数据作为所述第三向量。All the inverted chain data stored in the recall intermediate result data set are sorted according to their scores, and the top-ranked inverted chain data is intercepted as the third vector.

可选的,所述在第一倒排索引中查询每一个分词对应的倒排链数据,包括:Optionally, querying the inverted chain data corresponding to each word segment in the first inverted index includes:

获取每一个所述分词的文档编号ID信息;Obtain the document number ID information of each word segment;

将所有所述分词的文档编号ID信息以降序方式排列,构成所述倒排链数据。Arrange the document number ID information of all the word segments in descending order to form the inverted chain data.

可选的,所述的装置还包括:Optionally, the device also includes:

根据每一个所述分词的文档编号ID信息确定每一个分词的权重值;Determine the weight value of each word segment based on the document number ID information of each word segment;

根据每一个所述分词权重值的大小,确定每一个分词的召回时间;Determine the recall time of each word segment according to the weight value of each word segment;

根据所述召回时间,以及每一条倒排链数据的链长,对检索过程做截断处理。According to the recall time and the chain length of each inverted chain data, the retrieval process is truncated.

可选的,所述的装置还包括存储模块,用于:Optionally, the device also includes a storage module for:

将所述第一向量索引与所述第一倒排索引一起写入内存段,构建临时内存向量索引;Write the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;

在所述临时内存向量索引达到预设的第二存储容量阈值,或者构建临时内存向量索引的时间达到预设的第二时间阈值时,When the temporary memory vector index reaches the preset second storage capacity threshold, or the time to build the temporary memory vector index reaches the preset second time threshold,

将所述第一向量索引与所述第一倒排索引一起写入磁盘段,构建持久磁盘向量索引。The first vector index and the first inverted index are written into the disk segment together to construct a persistent disk vector index.

第三方面,本申请实施例提供了一种电子设备,该电子设备包括:处理器、存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或所述指令被所述处理器执行时实现上述文本检索方法的步骤。In a third aspect, embodiments of the present application provide an electronic device. The electronic device includes: a processor, a memory, and a program or instructions stored on the memory and executable on the processor. The program or the When the instructions are executed by the processor, the steps of the above text retrieval method are implemented.

第四方面,本申请实施例提供了一种可读存储介质,该可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现上述文本检索方法的步骤。In the fourth aspect, embodiments of the present application provide a readable storage medium, which stores programs or instructions. When the programs or instructions are executed by a processor, the steps of the above text retrieval method are implemented.

第五方面,本申请实施例提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述文本检索方法的步骤。In the fifth aspect, embodiments of the present application provide a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the above text retrieval method. A step of.

在本申请实施例中,将语义向量检索在与文本检索结合,基于向量检索进行文本检索,在获取用户搜索请求后,将用户输入的查询文本通过深度学习模型编码得到第二向量,然后利用对待检索文档库拆分为多个独立的子文本后利用深度学习模型编码得到的第一向量索引,从第一向量索引中查询与第二向量相似度最高的第三向量,将第三向量对应的子文本作为目标文本。该文本检索方法从语义信息之间相关性的考虑,大幅提升语义向量检索在与文本检索结合场景下的语义召回效果,提升了文本检索性能与检索结果的精确度。In the embodiment of this application, semantic vector retrieval is combined with text retrieval, and text retrieval is performed based on vector retrieval. After obtaining the user search request, the query text input by the user is encoded through the deep learning model to obtain the second vector, and then the treated Retrieve the first vector index encoded by the deep learning model after the document library is split into multiple independent sub-texts, query the third vector with the highest similarity to the second vector from the first vector index, and search the third vector corresponding to the third vector subtext as target text. This text retrieval method takes into account the correlation between semantic information, greatly improves the semantic recall effect of semantic vector retrieval in scenarios combined with text retrieval, and improves text retrieval performance and the accuracy of retrieval results.

附图说明Description of the drawings

图1是文本检索的常用应用场景示意图;Figure 1 is a schematic diagram of common application scenarios for text retrieval;

图2为本申请实施例提供的文本检索方法应用系统的架构图;Figure 2 is an architecture diagram of the text retrieval method application system provided by the embodiment of the present application;

图3是本申请实施例提供的文本检索方法的流程示意图;Figure 3 is a schematic flowchart of a text retrieval method provided by an embodiment of the present application;

图4是本申请实施例提供的文本检索方法步骤S3的具体流程示意图;Figure 4 is a specific flow diagram of step S3 of the text retrieval method provided by the embodiment of the present application;

图5是本申请实施例提供的文本检索装置的模块结构示意图;Figure 5 is a schematic module structure diagram of a text retrieval device provided by an embodiment of the present application;

图6是本申请实施例提供的电子设备的结构示意图;Figure 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

图7是本申请实施例提供的电子设备的硬件具体结构示意图;Figure 7 is a schematic diagram of the specific hardware structure of the electronic device provided by the embodiment of the present application;

其中:in:

100-文本挖掘;100-Text Mining;

101-风险管理;101-Risk Management;

102-知识管理;102-Knowledge Management;

103-网络犯罪预防管理;103-Cybercrime prevention and management;

104-客户服务;104-Customer Service;

105-保险索赔;105-Insurance claims;

106-情境广告推荐;106-Situational advertising recommendation;

107-商业智能;107-Business Intelligence;

108-邮件过滤;108-Email filtering;

109-社交媒体分析;109-Social media analysis;

201-用户端;201-Client;

202-搜索引擎服务端;202-Search engine server;

400-文本检索装置;400-Text retrieval device;

401-获取模块;401-get module;

402-编码模块;402-encoding module;

403-检索模块;403-Retrieval module;

404-存储模块;404-storage module;

500-电子设备;500-Electronic equipment;

501-处理器;501-processor;

502-存储器;502-memory;

600-电子设备;600-Electronic equipment;

601-射频单元;601-RF unit;

602-网络模块;602-Network module;

603-音频输出单元;603-audio output unit;

604-输入单元;604-input unit;

6041-图形处理器;6041-Graphics processor;

6042-麦克风;6042-Microphone;

605-传感器;605-sensor;

606-显示单元;606-display unit;

6061-显示面板;6061-display panel;

607-用户输入单元;607-User input unit;

6071-触控面板;6071-touch panel;

6072-其他输入设备;6072-Other input devices;

608-接口单元;608-interface unit;

609-存储器;609-memory;

610-处理器。610-Processor.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the description and claims of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that data so used are interchangeable under appropriate circumstances so that embodiments of the present application can be practiced in sequences other than those illustrated or described herein. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the related objects are in an "or" relationship.

如上所述,现有的传统文本检索缺乏语义信息之间相关性的考虑,语义层面相关性召回能力偏弱,文本检索精确度较差的问题。为此,本申请实施例提出了基于向量索引的文本检索方法的方案。As mentioned above, existing traditional text retrieval lacks consideration of the correlation between semantic information, has weak semantic level correlation recall capabilities, and has poor text retrieval accuracy. To this end, embodiments of the present application propose a text retrieval method based on vector indexing.

下面结合附图,通过具体的实施例及其应用场景,对本申请实施例提供的文本检索方法进行详细地说明。The text retrieval method provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios.

图2是本申请实施例一提供的文本检索方法应用系统的架构图;Figure 2 is an architecture diagram of the text retrieval method application system provided in Embodiment 1 of the present application;

文本检索系统主要包括用户端201、以及搜索引擎服务端202,用户端201与搜索引擎服务端202通信交互,用户端发送在线搜索请求至搜索引擎服务端202,搜索引擎服务端202存储有检索需要的文档数据库,在接收用户的在线搜索请求后,根据搜索请求,进行相关文档查询。The text retrieval system mainly includes a client 201 and a search engine server 202. The client 201 communicates and interacts with the search engine server 202. The user sends an online search request to the search engine server 202. The search engine server 202 stores retrieval needs. The document database, after receiving the user's online search request, performs related document queries based on the search request.

图3是本申请实施例一提供的文本检索方法的流程示意图。Figure 3 is a schematic flowchart of the text retrieval method provided in Embodiment 1 of the present application.

参见图3,本申请实施例一提供的文本检索方法,包括:Referring to Figure 3, the text retrieval method provided in Embodiment 1 of the present application includes:

S1、获取用户搜索请求;S1. Obtain user search request;

S2、将用户输入的查询文本通过深度学习模型编码得到第二向量。S2. Encode the query text input by the user through the deep learning model to obtain the second vector.

在具体实现中,用户在搜索引擎中输入查询文本后会从待检索文档库中获取查询结果,即目标文本。待检索文档库为用户的搜索引擎数据库,以学术论文的搜索为例,待检索文档库采用知网数据库。In specific implementation, after the user enters the query text in the search engine, the query results, that is, the target text, are obtained from the document library to be retrieved. The document library to be retrieved is the user's search engine database. Taking the search of academic papers as an example, the document library to be retrieved uses the CNKI database.

在获取用户搜索请求后,将用户输入的查询文本通过深度学习模型编码得到第二向量,即第二向量为用户输入的查询文本的向量。用户搜索请求可以为用户在线搜索请求方式或者非在线搜索请求方式。After obtaining the user search request, the query text input by the user is encoded through the deep learning model to obtain a second vector, that is, the second vector is the vector of the query text input by the user. The user search request can be a user online search request method or a non-online search request method.

本实施例中,具体将文本检索与向量索引结合,通过构建待检索文档库第一向量索引,基于第一向量索引进行文本检索。具体的,对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用第一深度学习模型编码得到所述待检索文档库的第一向量索引。In this embodiment, text retrieval is specifically combined with vector indexing, and a first vector index of the document library to be retrieved is constructed, and text retrieval is performed based on the first vector index. Specifically, after the document library to be retrieved is split into multiple independent sub-texts, the split sub-texts are encoded using a first deep learning model to obtain a first vector index of the document library to be retrieved.

此外,查询文本可以选择与待检索文档库采用同一深度学习模型,将用户输入的查询文本通过所述第一深度学习模型编码得到第二向量。In addition, the query text can be selected to use the same deep learning model as the document library to be retrieved, and the query text input by the user is encoded by the first deep learning model to obtain the second vector.

此外,在对所述待检索文档库拆分为多个独立的子文本,将拆分的所述子文本利用深度学习模型编码得到所述待检索文档库的第一向量索引的同时,生成待检索文档库的倒排文件,根据所述倒排文件生成所述待检索文档库的第一倒排索引。In addition, while the document library to be retrieved is split into multiple independent sub-texts, and the split sub-texts are encoded using a deep learning model to obtain the first vector index of the document library to be retrieved, at the same time, a first vector index of the document library to be retrieved is generated. Retrieve inverted files of the document library, and generate a first inverted index of the document library to be retrieved based on the inverted files.

S3、从第一向量索引中查询与所述第二向量相似度最高的第三向量。S3. Query the third vector with the highest similarity to the second vector from the first vector index.

在具体实现中,如图4所示,图4示出了步骤S3的具体流程示意图;所述从第一向量索引中查询与所述第二向量相似度最高的第三向量,包括:In specific implementation, as shown in Figure 4, Figure 4 shows a specific flow diagram of step S3; querying the third vector with the highest similarity to the second vector from the first vector index includes:

S31、将所述查询文本拆分为多个独立的分词,在所述第一倒排索引中查询每个分词对应的倒排链数据。S31. Split the query text into multiple independent word segments, and query the inverted chain data corresponding to each word segment in the first inverted index.

在具体实现中,所述在第一倒排索引中查询每一个分词对应的倒排链数据,包括:获取每一个所述分词的文档编号ID信息;将所有所述分词的文档编号ID信息以降序方式排列,构成所述倒排链数据。In a specific implementation, querying the inverted chain data corresponding to each word segment in the first inverted index includes: obtaining the document number ID information of each word segment; arranged in a sequential manner to form the inverted chain data.

例如,以用户在知网数据库搜索“文本挖掘引擎”的相关论文为例,首先将用户输入的查询文本“文本挖掘引擎”拆分为多个独立的分词,如“文本”、“挖掘”、“引擎”几个分词,然后在待检索文档库的所述第一倒排索引中查询“文本”、“挖掘”、“引擎”几个分词的文档编号ID信息,然后将查询到的所有所述分词的文档编号ID信息以降序方式排列,构成每一个分词对应的倒排链数据。For example, if a user searches for papers related to "text mining engine" in the CNKI database, first the query text "text mining engine" entered by the user is split into multiple independent word segments, such as "text", "mining", "Engine" several word segments, and then query the document number ID information of the "Text", "Mining", and "Engine" segment words in the first inverted index of the document library to be retrieved, and then query all the queried The document number ID information of the above-mentioned word segments is arranged in descending order to form the inverted chain data corresponding to each word segmentation.

S32、在所述第一向量索引中找到距离所述第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据。S32. Find at least one center point that satisfies a preset distance from the second vector in the first vector index, and obtain inverted chain data corresponding to each center point.

在具体实现中,将文本检索与向量索引结合,基于向量检索进行文本检索,通过构建待检索文档库第一向量索引,基于第一向量索引进行文本检索。同时,在接收用户在线搜索请求后,将用户输入的查询文本通过深度学习模型编码得到第二向量,即第二向量为用户输入的查询文本的向量。然后,在第一向量索引中找到距离第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据。In the specific implementation, text retrieval is combined with vector index, text retrieval is performed based on vector retrieval, and the first vector index of the document library to be retrieved is constructed, and text retrieval is performed based on the first vector index. At the same time, after receiving the user's online search request, the query text input by the user is encoded through the deep learning model to obtain a second vector, that is, the second vector is the vector of the query text input by the user. Then, find at least one center point that satisfies a preset distance from the second vector in the first vector index, and obtain the inverted chain data corresponding to each center point.

S33、对每个分词的倒排链数据求交集,得到第一权重值;S33. Find the intersection of the inverted chain data of each segment and obtain the first weight value;

同时,对每一个所述中心点对应的倒排链数据求并集,得到第二权重值。At the same time, the inverted chain data corresponding to each of the center points is combined to obtain the second weight value.

例如,对于“文本”、“挖掘”、“引擎”几个分词,对每个分词的倒排链数据求交集,得到第一权重值。For example, for the word segments "text", "mining", and "engine", the intersection of the inverted chain data of each segment is obtained to obtain the first weight value.

同时,在第一向量索引中找到距离第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据求并集,得到第二权重值。At the same time, at least one center point whose distance from the second vector satisfies the preset distance is found in the first vector index, and the inverted chain data corresponding to each center point is obtained and the union set is obtained to obtain the second weight value.

S34、比较所述第一权重值与所述第二权重值,将权重值大的倒排链数据进行过滤,当满足预设的过滤条件时,存入召回中间结果数据集。S34. Compare the first weight value and the second weight value, filter the inverted chain data with a large weight value, and store the recall intermediate result data set when the preset filtering conditions are met.

然后,重复上述S32-S34步骤,直至在所述召回中间结果数据集达到预设的第一存储容量阈值,或者利用所述查询文本检索的时间超过预设的第一时间阈值的情况下,终止收集所述召回中间结果数据集。Then, repeat the above steps S32-S34 until the recall intermediate result data set reaches the preset first storage capacity threshold, or the time of retrieval using the query text exceeds the preset first time threshold, terminate. Collect the recall intermediate results data set.

S35、将所述召回中间结果数据集排序,确定所述第三向量,将所述第三向量对应的子文本作为目标文本。S35. Sort the recall intermediate result data set, determine the third vector, and use the sub-text corresponding to the third vector as the target text.

在具体实现中,将所述召回中间结果数据集存储的所有倒排链数据按照分数高低进行排序,截取排名靠前的倒排链数据作为所述第三向量。In a specific implementation, all the inverted chain data stored in the recall intermediate result data set are sorted according to their scores, and the top-ranked inverted chain data is intercepted as the third vector.

S4、将所述第三向量对应的子文本作为目标文本。S4. Use the sub-text corresponding to the third vector as the target text.

本实施例中,将语义向量检索在与文本检索结合,基于向量检索进行文本检索,从语义信息之间相关性的考虑,大幅提升语义向量检索在与文本检索结合场景下的召回能力,尤其是在存在数据过滤条件的场景下,能够稳定召回综合最优的结果作为目标文本。In this embodiment, semantic vector retrieval is combined with text retrieval, and text retrieval is performed based on vector retrieval. Considering the correlation between semantic information, the recall ability of semantic vector retrieval in the scenario of combining with text retrieval is greatly improved, especially In scenarios where data filtering conditions exist, the comprehensively optimal result can be stably recalled as the target text.

此外,对于整个检索过程,可以根据召回时间以及每一条倒排链数据的链长等维度做截断处理,对于找回时间,可以通过每一个所述分词的文档编号ID信息确定每一个分词的权重值;然后,根据每一个所述分词权重值的大小,确定每一个分词的召回时间;文档编号ID信息越大,对应的分词权重值,越需要被召回。In addition, for the entire retrieval process, truncation processing can be performed based on dimensions such as the recall time and the chain length of each inverted chain data. For the retrieval time, the weight of each segment can be determined based on the document number ID information of each segment. value; then, determine the recall time of each word segmentation based on the weight value of each word segmentation; the larger the document number ID information, the more the corresponding word segmentation weight value needs to be recalled.

此外,为保证向量索引的时效性,本实施例进一步预先将所述第一向量索引与所述第一倒排索引一起写入内存段,构建临时内存向量索引;然后,在所述临时内存向量索引达到预设的第二存储容量阈值,或者构建临时内存向量索引的时间达到预设的第二时间阈值时,将所述第一向量索引与所述第一倒排索引一起写入磁盘段,构建持久磁盘向量索引。即本实施例通过构建临时内存向量索引的内存小容量缓存,以及持久磁盘向量索引的磁盘大容量永存两种方式结合,对于高时效业务,可以做到实时级别,提高了向量索引构建的灵活性与时效性,能够保证语义召回的效果,提升了文本检索的准确性。In addition, in order to ensure the timeliness of the vector index, this embodiment further writes the first vector index and the first inverted index into the memory segment in advance to construct a temporary memory vector index; then, in the temporary memory vector When the index reaches the preset second storage capacity threshold, or the time to construct the temporary memory vector index reaches the preset second time threshold, the first vector index and the first inverted index are written to the disk segment together, Build a persistent disk vector index. That is to say, this embodiment combines the two methods of building a small-capacity memory cache of a temporary memory vector index and a large-capacity disk permanent storage of a persistent disk vector index. For high-time-sensitive services, real-time level can be achieved, which improves the flexibility of vector index construction. and timeliness, which can ensure the effect of semantic recall and improve the accuracy of text retrieval.

如上所述,本实施例中将语义向量检索在与文本检索结合,基于向量检索进行文本检索,在获取用户搜索请求后,将用户输入的查询文本通过深度学习模型编码得到第二向量,然后利用对待检索文档库拆分为多个独立的子文本后利用深度学习模型编码得到的第一向量索引,从第一向量索引中查询与第二向量相似度最高的第三向量,将第三向量对应的子文本作为目标文本。该文本检索方法从语义信息之间相关性的考虑,大幅提升语义向量检索在与文本检索结合场景下的语义召回效果,提升了文本检索性能与检索结果的精确度。As mentioned above, in this embodiment, semantic vector retrieval is combined with text retrieval, and text retrieval is performed based on vector retrieval. After obtaining the user search request, the query text input by the user is encoded through the deep learning model to obtain the second vector, and then using After splitting the document library to be retrieved into multiple independent sub-texts, use the first vector index encoded by the deep learning model, query the third vector with the highest similarity to the second vector from the first vector index, and match the third vector to subtext as the target text. This text retrieval method takes into account the correlation between semantic information, greatly improves the semantic recall effect of semantic vector retrieval in scenarios combined with text retrieval, and improves text retrieval performance and the accuracy of retrieval results.

需要说明的是,本申请上述实施例提供的文本检索方法,可以应用于各种终端,诸如上位机服务器,以及台式机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)、手机等客户端设备,本申请实施例不作具体限定。It should be noted that the text retrieval method provided by the above embodiments of the present application can be applied to various terminals, such as host computer servers, desktop computers, tablet computers, notebook computers, handheld computers, vehicle-mounted electronic devices, wearable devices, and super computers. Client devices such as ultra-mobile personal computers (UMPCs), netbooks, personal digital assistants (PDAs), and mobile phones are not specifically limited in the embodiments of this application.

图5是本申请实施例提供的文本检索装置400的模块结构示意图。FIG. 5 is a schematic module structure diagram of the text retrieval device 400 provided by the embodiment of the present application.

参见图5,文本检索装置400的模块结构对应于图3-图4示出的文本检索方法,本申请实施例提供的文本检索装置能够实现上述文本检索方法实现的各个过程。Referring to Figure 5, the module structure of the text retrieval device 400 corresponds to the text retrieval method shown in Figures 3-4. The text retrieval device provided by the embodiment of the present application can implement various processes implemented by the above text retrieval method.

如图5中所示,本申请实施例提供的种文本检索装置400,包括:As shown in Figure 5, the text retrieval device 400 provided by the embodiment of the present application includes:

获取模块401,用于获取用户搜索请求;Obtain module 401, used to obtain user search requests;

编码模块402,用于将用户输入的查询文本通过深度学习模型编码得到第二向量;The encoding module 402 is used to encode the query text input by the user through the deep learning model to obtain the second vector;

检索模块403,用于从第一向量索引中查询与所述第二向量相似度最高的第三向量,The retrieval module 403 is used to query the third vector with the highest similarity to the second vector from the first vector index,

所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;

将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.

可选的,所述装置还包括:Optionally, the device also includes:

生成所述待检索文档库的倒排文件;Generate an inverted file of the document library to be retrieved;

根据所述倒排文件生成所述待检索文档库的第一倒排索引。Generate a first inverted index of the document library to be retrieved based on the inverted file.

可选的,所述从所述第一向量索引中查询与所述第二向量相似度最高的第三向量,包括:Optionally, querying the third vector with the highest similarity to the second vector from the first vector index includes:

将所述查询文本拆分为多个独立的分词;Split the query text into multiple independent word segments;

在所述第一倒排索引中查询每个分词对应的倒排链数据;Query the inverted chain data corresponding to each word segment in the first inverted index;

在所述第一向量索引中找到距离所述第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据;Find at least one center point in the first vector index that satisfies a preset distance from the second vector, and obtain the inverted chain data corresponding to each center point;

对每个分词的倒排链数据求交集,得到第一权重值;Find the intersection of the inverted chain data of each segmentation to obtain the first weight value;

对每一个所述中心点对应的倒排链数据求并集,得到第二权重值;Calculate the union of the inverted chain data corresponding to each of the center points to obtain the second weight value;

比较所述第一权重值与所述第二权重值,将权重值大的倒排链数据进行过滤,当满足预设的过滤条件时,存入召回中间结果数据集;Compare the first weight value and the second weight value, filter the inverted chain data with a large weight value, and store the recall intermediate result data set when the preset filtering conditions are met;

将所述召回中间结果数据集排序,确定所述第三向量。The recall intermediate result data set is sorted to determine the third vector.

可选的,在所述召回中间结果数据集达到预设的第一存储容量阈值,或者Optionally, when the recall intermediate result data set reaches a preset first storage capacity threshold, or

利用所述查询文本检索的时间超过预设的第一时间阈值的情况下,When the query text retrieval time exceeds the preset first time threshold,

终止收集所述召回中间结果数据集。Terminate collection of the recall intermediate result data set.

可选的,所述将所述召回中间结果数据集排序,确定所述第三向量,包括:Optionally, sorting the recall intermediate result data set and determining the third vector includes:

将所述召回中间结果数据集存储的所有倒排链数据按照分数高低进行排序,截取排名靠前的倒排链数据作为所述第三向量。All the inverted chain data stored in the recall intermediate result data set are sorted according to their scores, and the top-ranked inverted chain data is intercepted as the third vector.

可选的,所述在第一倒排索引中查询每一个分词对应的倒排链数据,包括:Optionally, querying the inverted chain data corresponding to each word segment in the first inverted index includes:

获取每一个所述分词的文档编号ID信息;Obtain the document number ID information of each word segment;

将所有所述分词的文档编号ID信息以降序方式排列,构成所述倒排链数据。Arrange the document number ID information of all the word segments in descending order to form the inverted chain data.

可选的,根据每一个所述分词的文档编号ID信息确定每一个分词的权重值;Optionally, determine the weight value of each word segment based on the document number ID information of each word segment;

根据每一个所述分词权重值的大小,确定每一个分词的召回时间;Determine the recall time of each word segment according to the weight value of each word segment;

根据所述召回时间,以及每一条倒排链数据的链长,对检索过程做截断处理。According to the recall time and the chain length of each inverted chain data, the retrieval process is truncated.

可选的,还包括存储模块404,用于:Optionally, a storage module 404 is also included for:

将所述第一向量索引与所述第一倒排索引一起写入内存段,构建临时内存向量索引;Write the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;

在所述临时内存向量索引达到预设的第二存储容量阈值,或者构建临时内存向量索引的时间达到预设的第二时间阈值时,When the temporary memory vector index reaches the preset second storage capacity threshold, or the time to build the temporary memory vector index reaches the preset second time threshold,

将所述第一向量索引与所述第一倒排索引一起写入磁盘段,构建持久磁盘向量索引。The first vector index and the first inverted index are written into the disk segment together to construct a persistent disk vector index.

因此,根据本申请实施例的文本检索装置400,将语义向量检索在与文本检索结合,基于向量检索进行文本检索,在获取用户搜索请求后,将用户输入的查询文本通过深度学习模型编码得到第二向量,然后利用对待检索文档库拆分为多个独立的子文本后利用深度学习模型编码得到的第一向量索引,从第一向量索引中查询与第二向量相似度最高的第三向量,将第三向量对应的子文本作为目标文本。该文本检索方法从语义信息之间相关性的考虑,大幅提升语义向量检索在与文本检索结合场景下的语义召回效果,提升了文本检索性能与检索结果的精确度。Therefore, according to the text retrieval device 400 of the embodiment of the present application, semantic vector retrieval is combined with text retrieval, and text retrieval is performed based on vector retrieval. After obtaining the user search request, the query text input by the user is encoded through the deep learning model to obtain the third two vectors, and then use the first vector index obtained by encoding the deep learning model after splitting the document library to be retrieved into multiple independent sub-texts, and query the third vector with the highest similarity to the second vector from the first vector index. Use the sub-text corresponding to the third vector as the target text. This text retrieval method takes into account the correlation between semantic information, greatly improves the semantic recall effect of semantic vector retrieval in scenarios combined with text retrieval, and improves text retrieval performance and the accuracy of retrieval results.

应当理解,对上述文本检索方法的各描述同样适用于根据本申请实施例的文本检索装置400,为避免重复,不再详细描述。It should be understood that each description of the above text retrieval method is also applicable to the text retrieval device 400 according to the embodiment of the present application, and to avoid repetition, no detailed description will be given.

此外,应当理解,在根据本申请实施例的文本检索装置400中,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即文本检索装置可划分为与上述例示出的模块不同的功能模块,以完成以上描述的全部或者部分功能。In addition, it should be understood that in the text retrieval device 400 according to the embodiment of the present application, only the division of the above-mentioned functional modules is used as an example. In actual applications, the above-mentioned function allocation can be completed by different functional modules as needed, that is, The text retrieval device can be divided into functional modules different from the modules illustrated above to complete all or part of the functions described above.

图6是本申请实施例提供的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

如图6中所示,本申请实施例还提供了一种电子设备500,包括处理器501,存储器502,存储在存储器502上并可在所述处理器501上运行的程序或指令,该程序或指令被处理器501执行时实现上述文本检索方法的步骤,且能达到相同的技术效果。As shown in Figure 6, the embodiment of the present application also provides an electronic device 500, including a processor 501, a memory 502, and a program or instructions stored on the memory 502 and executable on the processor 501. The program Or when the instruction is executed by the processor 501, the steps of the above text retrieval method are implemented, and the same technical effect can be achieved.

因此,根据本申请实施例的电子设备500,将语义向量检索在与文本检索结合,大幅提升语义向量检索在与文本检索结合场景下的语义召回效果,提升了文本检索性能与检索结果的精确度。Therefore, according to the electronic device 500 according to the embodiment of the present application, the semantic vector retrieval is combined with the text retrieval, which greatly improves the semantic recall effect of the semantic vector retrieval in the scenario of combining the semantic vector retrieval with the text retrieval, and improves the text retrieval performance and the accuracy of the retrieval results. .

对于根据本申请实施例的电子设备500的其他技术效果,为避免重复,这里不再详细描述。Other technical effects of the electronic device 500 according to the embodiments of the present application will not be described in detail here to avoid repetition.

需要注意的是,本申请实施例中的电子设备可包括移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application may include mobile electronic devices and non-mobile electronic devices.

图7是本申请实施例提供的电子设备的硬件具体结构示意图。FIG. 7 is a schematic diagram of the specific hardware structure of the electronic device provided by the embodiment of the present application.

参照图7,电子设备600包括但不限于:射频单元601、网络模块602、音频输出单元603、输入单元604、传感器605、显示单元606、用户输入单元607、接口单元608、存储器609、以及处理器610等部件。Referring to Figure 7, the electronic device 600 includes, but is not limited to: a radio frequency unit 601, a network module 602, an audio output unit 603, an input unit 604, a sensor 605, a display unit 606, a user input unit 607, an interface unit 608, a memory 609, and processing. 610 and other components.

应理解的是,本申请实施例中,射频单元601可用于收发信息或通话过程中,信号的接收和发送,具体的,将来自基站的下行数据接收后,给处理器610处理;另外,将上行的数据发送给基站。通常,射频单元601包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频单元601还可以通过无线通信系统与网络和其他设备通信。It should be understood that in the embodiment of the present application, the radio frequency unit 601 can be used to receive and send information or signals during a call. Specifically, after receiving downlink data from the base station, it is processed by the processor 610; in addition, Uplink data is sent to the base station. Generally, the radio frequency unit 601 includes, but is not limited to, an antenna, at least one amplifier, transceiver, coupler, low noise amplifier, duplexer, etc. In addition, the radio frequency unit 601 can also communicate with the network and other devices through a wireless communication system.

电子设备600通过网络模块602为用户提供了无线的宽带互联网访问,如帮助用户收发电子邮件、浏览网页和访问流式媒体等。The electronic device 600 provides users with wireless broadband Internet access through the network module 602, such as helping users send and receive emails, browse web pages, and access streaming media.

音频输出单元603可以将射频单元601或网络模块602接收的或者在存储器609中存储的音频数据转换成音频信号并且输出为声音。而且,音频输出单元603还可以提供与电子设备600执行的特定功能相关的音频输出(例如,呼叫信号接收声音、消息接收声音等等)。音频输出单元603包括扬声器、蜂鸣器以及受话器等。The audio output unit 603 may convert the audio data received by the radio frequency unit 601 or the network module 602 or stored in the memory 609 into an audio signal and output it as a sound. Furthermore, the audio output unit 603 may also provide audio output related to a specific function performed by the electronic device 600 (eg, call signal reception sound, message reception sound, etc.). The audio output unit 603 includes a speaker, a buzzer, a receiver, and the like.

输入单元604用于接收音频或视频信号。应理解的是,本申请实施例中,输入单元604可以包括图形处理器(Graphics Processing Unit,GPU)6041和麦克风6042,图形处理器6041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。The input unit 604 is used to receive audio or video signals. It should be understood that in the embodiment of the present application, the input unit 604 may include a graphics processor (Graphics Processing Unit, GPU) 6041 and a microphone 6042. The graphics processor 6041 is responsible for the image capture device (GPU) in the video capture mode or the image capture mode. Process the image data of still pictures or videos obtained by cameras (such as cameras).

电子设备600还包括至少一种传感器605,比如光传感器、运动传感器以及其他传感器。具体地,光传感器包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板6061的亮度,接近传感器可在电子设备600移动到耳边时,关闭显示面板6061和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别电子设备姿态(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;传感器605还可以包括指纹传感器、压力传感器、虹膜传感器、分子传感器、陀螺仪、气压计、湿度计、温度计、红外线传感器等,在此不再赘述。Electronic device 600 also includes at least one sensor 605, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 6061 according to the brightness of the ambient light. The proximity sensor can close the display panel 6061 when the electronic device 600 moves to the ear. /or backlight. As a type of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes). It can detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of electronic devices (such as horizontal and vertical screen switching, related games , magnetometer attitude calibration), vibration recognition related functions (such as pedometer, knock), etc.; the sensor 605 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, Infrared sensors, etc. will not be described in detail here.

显示单元606用于显示由用户输入的信息或提供给用户的信息。显示单元606可包括显示面板6061,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板6061。The display unit 606 is used to display information input by the user or information provided to the user. The display unit 606 may include a display panel 6061, which may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (OLED), or the like.

用户输入单元607可用于接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。具体地,用户输入单元607包括触控面板6071以及其他输入设备6072。触控面板6071,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板6071上或在触控面板6071附近的操作)。触控面板6071可包括触摸检测装置和触摸控制器两个部分。其他输入设备6072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。接口单元608为外部装置与电子设备600连接的接口。例如,外部装置可以包括有线或无线头戴式耳机端口、外部电源(或电池充电器)端口、有线或无线数据端口、存储卡端口、用于连接具有识别模块的装置的端口、音频输入/输出(I/O)端口、视频I/O端口、耳机端口等等。接口单元608可以用于接收来自外部装置的输入(例如,数据信息、电力等等)并且将接收到的输入传输到电子设备600内的一个或多个元件或者可以用于在电子设备600和外部装置之间传输数据。The user input unit 607 may be used to receive input numeric or character information and generate key signal input related to user settings and function control of the electronic device. Specifically, the user input unit 607 includes a touch panel 6071 and other input devices 6072. The touch panel 6071, also known as a touch screen, can collect the user's touch operations on or near the touch panel 6071 (for example, the user uses a finger, stylus, or any suitable object or accessory on or near the touch panel 6071 to operate). The touch panel 6071 may include two parts: a touch detection device and a touch controller. Other input devices 6072 may include but are not limited to physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here. The interface unit 608 is an interface for connecting external devices to the electronic device 600 . For example, external devices may include a wired or wireless headphone port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device with an identification module, audio input/output (I/O) port, video I/O port, headphone port, etc. Interface unit 608 may be used to receive input (eg, data information, power, etc.) from an external device and transmit the received input to one or more elements within electronic device 600 or may be used to connect between electronic device 600 and an external device. Transfer data between devices.

存储器609可用于存储软件程序以及各种数据。存储器609可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器609可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。Memory 609 may be used to store software programs as well as various data. The memory 609 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program according to Data created by the use of mobile phones (such as audio data, phone books, etc.), etc. In addition, memory 609 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

处理器610是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器609内的软件程序和/或模块,以及调用存储在存储器609内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。处理器610可包括一个或多个处理单元;优选的,处理器610可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器610中。本领域技术人员可以理解,电子设备600还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器610逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图7中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。在本申请实施例中,电子设备包括但不限于手机、平板电脑、笔记本电脑、掌上电脑、车载终端、可穿戴设备(例如手环、眼镜)、以及计步器等。The processor 610 is the control center of the electronic device, using various interfaces and lines to connect various parts of the entire electronic device, by running or executing software programs and/or modules stored in the memory 609, and calling data stored in the memory 609 , perform various functions of the electronic device and process data, thereby overall monitoring the electronic device. The processor 610 may include one or more processing units; preferably, the processor 610 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc., and the modem processor The processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 610. Those skilled in the art can understand that the electronic device 600 may also include a power supply (such as a battery) that supplies power to various components. The power supply may be logically connected to the processor 610 through a power management system, thereby managing charging, discharging, and function through the power management system. Consumption management and other functions. The structure of the electronic device shown in Figure 7 does not constitute a limitation on the electronic device. The electronic device may include more or less components than shown in the figure, or combine certain components, or arrange different components, which will not be described again here. . In the embodiment of this application, electronic devices include but are not limited to mobile phones, tablet computers, notebook computers, PDAs, vehicle-mounted terminals, wearable devices (such as bracelets, glasses), and pedometers.

具体地,specifically,

处理器610,用于:Processor 610 for:

获取用户搜索请求;Get user search request;

将用户输入的查询文本通过深度学习模型编码得到第二向量;The query text input by the user is encoded through the deep learning model to obtain the second vector;

从第一向量索引中查询与所述第二向量相似度最高的第三向量,Query the third vector with the highest similarity to the second vector from the first vector index,

所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;

将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.

因此,根据本申请实施例的电子设备600,将语义向量检索在与文本检索结合,大幅提升语义向量检索在与文本检索结合场景下的语义召回效果,提升了文本检索性能与检索结果的精确度。Therefore, according to the electronic device 600 according to the embodiment of the present application, the semantic vector retrieval is combined with the text retrieval, which greatly improves the semantic recall effect of the semantic vector retrieval in the scenario of combining the semantic vector retrieval with the text retrieval, and improves the text retrieval performance and the accuracy of the retrieval results. .

本申请实施例还提供一种可读存储介质,所述可读存储介质上存储有程序或指令,该程序或指令被处理器执行时实现上述文本检索方法的步骤,且能达到相同的技术效果。因此,根据本申请实施例的可读存储介质,将语义向量检索在与文本检索结合,大幅提升语义向量检索在与文本检索结合场景下的语义召回效果,提升了文本检索性能与检索结果的精确度。Embodiments of the present application also provide a readable storage medium, which stores a program or instructions. When the program or instructions are executed by a processor, the steps of the above text retrieval method are implemented, and the same technical effect can be achieved. . Therefore, according to the readable storage medium according to the embodiment of the present application, semantic vector retrieval is combined with text retrieval, which greatly improves the semantic recall effect of semantic vector retrieval in the scenario of combining it with text retrieval, and improves the performance of text retrieval and the accuracy of retrieval results. Spend.

对于根据本申请实施例的可读存储介质的其他技术效果,为避免重复,这里不再赘述。For other technical effects of the readable storage medium according to the embodiments of the present application, to avoid repetition, they will not be described again here.

其中,所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等。Wherein, the processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage media, such as computer read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

本申请实施例还提供了一种芯片,芯片包括处理器和通信接口,通信接口和处理器耦合,处理器用于运行程序或指令,实现上述文本检索方法的步骤,且能达到相同的技术效果。Embodiments of the present application also provide a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the steps of the above text retrieval method, and can achieve the same technical effect.

因此,根据本申请实施例的芯片,将语义向量检索在与文本检索结合,大幅提升语义向量检索在与文本检索结合场景下的语义召回效果,提升了文本检索性能与检索结果的精确度。Therefore, according to the chip of the embodiment of the present application, the semantic vector retrieval is combined with the text retrieval, which greatly improves the semantic recall effect of the semantic vector retrieval in the scenario of combining the semantic vector retrieval with the text retrieval, and improves the text retrieval performance and the accuracy of the retrieval results.

对于根据本申请实施例的芯片的其他技术效果,为避免重复,这里不再赘述。For other technical effects of the chip according to the embodiment of the present application, to avoid repetition, they will not be described again here.

应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。It should be understood that the chips mentioned in the embodiments of this application may also be called system-on-chip, system-on-a-chip, system-on-a-chip or system-on-chip, etc.

需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以施加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may also be applied, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk , optical disk), including several instructions to cause a terminal (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.

上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

本申请公开了一种A1.一种文本检索方法,包括:This application discloses A1. A text retrieval method, including:

获取用户搜索请求;Get user search request;

将用户输入的查询文本通过深度学习模型编码得到第二向量;The query text input by the user is encoded through the deep learning model to obtain the second vector;

从第一向量索引中查询与所述第二向量相似度最高的第三向量,Query the third vector with the highest similarity to the second vector from the first vector index,

所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;

将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.

A2.根据A1所述的方法,还包括:A2. According to the method described in A1, it also includes:

生成所述待检索文档库的倒排文件;Generate an inverted file of the document library to be retrieved;

根据所述倒排文件生成所述待检索文档库的第一倒排索引。Generate a first inverted index of the document library to be retrieved based on the inverted file.

A3.根据A2所述的方法,所述从第一向量索引中查询与所述第二向量相似度最高的第三向量,包括:A3. According to the method described in A2, querying the third vector with the highest similarity to the second vector from the first vector index includes:

将所述查询文本拆分为多个独立的分词;Split the query text into multiple independent word segments;

在所述第一倒排索引中查询每个分词对应的倒排链数据;Query the inverted chain data corresponding to each word segment in the first inverted index;

在所述第一向量索引中找到距离所述第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据;Find at least one center point in the first vector index that satisfies a preset distance from the second vector, and obtain the inverted chain data corresponding to each center point;

对每个分词的倒排链数据求交集,得到第一权重值;Find the intersection of the inverted chain data of each segmentation to obtain the first weight value;

对每一个所述中心点对应的倒排链数据求并集,得到第二权重值;Calculate the union of the inverted chain data corresponding to each of the center points to obtain the second weight value;

比较所述第一权重值与所述第二权重值,将权重值大的倒排链数据进行过滤,当满足预设的过滤条件时,存入召回中间结果数据集;Compare the first weight value and the second weight value, filter the inverted chain data with a large weight value, and store the recall intermediate result data set when the preset filtering conditions are met;

将所述召回中间结果数据集排序,确定所述第三向量。The recall intermediate result data set is sorted to determine the third vector.

A4.根据A3所述的方法,A4. According to the method described in A3,

在所述召回中间结果数据集达到预设的第一存储容量阈值,或者When the recall intermediate result data set reaches a preset first storage capacity threshold, or

利用所述查询文本检索的时间超过预设的第一时间阈值的情况下,When the query text retrieval time exceeds the preset first time threshold,

终止收集所述召回中间结果数据集。Terminate collection of the recall intermediate result data set.

A5.根据A3所述的方法,所述将所述召回中间结果数据集排序,确定所述第三向量,包括:A5. According to the method described in A3, sorting the recall intermediate result data set and determining the third vector includes:

将所述召回中间结果数据集存储的所有倒排链数据按照分数高低进行排序,截取排名靠前的倒排链数据作为所述第三向量。All the inverted chain data stored in the recall intermediate result data set are sorted according to their scores, and the top-ranked inverted chain data is intercepted as the third vector.

A6.根据A3所述的方法,所述在第一倒排索引中查询每一个分词对应的倒排链数据,包括:A6. According to the method described in A3, querying the inverted chain data corresponding to each word segment in the first inverted index includes:

获取每一个所述分词的文档编号ID信息;Obtain the document number ID information of each word segment;

将所有所述分词的文档编号ID信息以降序方式排列,构成所述倒排链数据。Arrange the document number ID information of all the word segments in descending order to form the inverted chain data.

A7.根据A6所述的方法,还包括:A7. According to the method described in A6, it also includes:

根据每一个所述分词的文档编号ID信息确定每一个分词的权重值;Determine the weight value of each word segment based on the document number ID information of each word segment;

根据每一个所述分词权重值的大小,确定每一个分词的召回时间;Determine the recall time of each word segment according to the weight value of each word segment;

根据所述召回时间,以及每一条倒排链数据的链长,对检索过程做截断处理。According to the recall time and the chain length of each inverted chain data, the retrieval process is truncated.

A8.根据A2所述的方法,还包括:A8. According to the method described in A2, it also includes:

将所述第一向量索引与所述第一倒排索引一起写入内存段,构建临时内存向量索引;Write the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;

在所述临时内存向量索引达到预设的第二存储容量阈值,或者构建临时内存向量索引的时间达到预设的第二时间阈值时,When the temporary memory vector index reaches the preset second storage capacity threshold, or the time to build the temporary memory vector index reaches the preset second time threshold,

将所述第一向量索引与所述第一倒排索引一起写入磁盘段,构建持久磁盘向量索引。The first vector index and the first inverted index are written into the disk segment together to construct a persistent disk vector index.

本申请还公开一种B9.一种文本检索装置,包括:This application also discloses a B9. A text retrieval device, including:

获取模块,用于获取用户搜索请求;Obtain module, used to obtain user search requests;

编码模块,用于将用户输入的查询文本通过深度学习模型编码得到第二向量;The encoding module is used to encode the query text input by the user through the deep learning model to obtain the second vector;

检索模块,用于从第一向量索引中查询与所述第二向量相似度最高的第三向量,A retrieval module used to query the third vector with the highest similarity to the second vector from the first vector index,

所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;

将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.

B10.根据B9所述的装置,所述索引构建模块还用于:B10. According to the device described in B9, the index building module is also used for:

生成所述待检索文档库的倒排文件;Generate an inverted file of the document library to be retrieved;

根据所述倒排文件生成所述待检索文档库的第一倒排索引。Generate a first inverted index of the document library to be retrieved based on the inverted file.

B11.根据B10所述的装置,所述从第一向量索引中查询与所述第二向量相似度最高的第三向量,包括:B11. According to the device of B10, querying the third vector with the highest similarity to the second vector from the first vector index includes:

将所述查询文本拆分为多个独立的分词;Split the query text into multiple independent word segments;

在所述第一倒排索引中查询每个分词对应的倒排链数据;Query the inverted chain data corresponding to each word segment in the first inverted index;

在所述第一向量索引中找到距离所述第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据;Find at least one center point in the first vector index that satisfies a preset distance from the second vector, and obtain the inverted chain data corresponding to each center point;

对每个分词的倒排链数据求交集,得到第一权重值;Find the intersection of the inverted chain data of each segmentation to obtain the first weight value;

对每一个所述中心点对应的倒排链数据求并集,得到第二权重值;Calculate the union of the inverted chain data corresponding to each of the center points to obtain the second weight value;

比较所述第一权重值与所述第二权重值,将权重值大的倒排链数据进行过滤,当满足预设的过滤条件时,存入召回中间结果数据集;Compare the first weight value and the second weight value, filter the inverted chain data with a large weight value, and store the recall intermediate result data set when the preset filtering conditions are met;

将所述召回中间结果数据集排序,确定所述第三向量。The recall intermediate result data set is sorted to determine the third vector.

B12.根据B11所述的装置,B12. Device according to B11,

在所述召回中间结果数据集达到预设的第一存储容量阈值,或者When the recall intermediate result data set reaches a preset first storage capacity threshold, or

利用所述查询文本检索的时间超过预设的第一时间阈值的情况下,When the query text retrieval time exceeds the preset first time threshold,

终止收集所述召回中间结果数据集。Terminate collection of the recall intermediate result data set.

B13.根据B11所述的装置,所述将所述召回中间结果数据集排序,确定所述第三向量,包括:B13. According to the device described in B11, sorting the recall intermediate result data set and determining the third vector includes:

将所述召回中间结果数据集存储的所有倒排链数据按照分数高低进行排序,截取排名靠前的倒排链数据作为所述第三向量。All the inverted chain data stored in the recall intermediate result data set are sorted according to their scores, and the top-ranked inverted chain data is intercepted as the third vector.

B14.根据B11所述的装置,所述在第一倒排索引中查询每一个分词对应的倒排链数据,包括:B14. According to the device described in B11, querying the inverted chain data corresponding to each word segment in the first inverted index includes:

获取每一个所述分词的文档编号ID信息;Obtain the document number ID information of each word segment;

将所有所述分词的文档编号ID信息以降序方式排列,构成所述倒排链数据。Arrange the document number ID information of all the word segments in descending order to form the inverted chain data.

B15.根据B14所述的装置,还包括:B15. The device according to B14, further including:

根据每一个所述分词的文档编号ID信息确定每一个分词的权重值;Determine the weight value of each word segment based on the document number ID information of each word segment;

根据每一个所述分词权重值的大小,确定每一个分词的召回时间;Determine the recall time of each word segment according to the weight value of each word segment;

根据所述召回时间,以及每一条倒排链数据的链长,对检索过程做截断处理。According to the recall time and the chain length of each inverted chain data, the retrieval process is truncated.

B16.根据B10所述的装置,还包括存储模块,用于:B16. The device according to B10, further comprising a storage module for:

将所述第一向量索引与所述第一倒排索引一起写入内存段,构建临时内存向量索引;Write the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;

在所述临时内存向量索引达到预设的第二存储容量阈值,或者构建临时内存向量索引的时间达到预设的第二时间阈值时,When the temporary memory vector index reaches the preset second storage capacity threshold, or the time to build the temporary memory vector index reaches the preset second time threshold,

将所述第一向量索引与所述第一倒排索引一起写入磁盘段,构建持久磁盘向量索引。The first vector index and the first inverted index are written into the disk segment together to construct a persistent disk vector index.

本申请还公开一种电子设备,包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或所述指令被所述处理器执行时实现如A1-A8中任一项所述的文本检索方法的步骤。This application also discloses an electronic device, including a processor, a memory, and a program or instructions stored on the memory and executable on the processor. The program or instructions are implemented when executed by the processor. The steps of the text retrieval method as described in any one of A1-A8.

本申请还公开一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如A1-A8中任一项所述的文本检索方法的步骤。This application also discloses a readable storage medium, which stores programs or instructions. When the program or instructions are executed by a processor, the steps of the text retrieval method as described in any one of A1-A8 are implemented. .

Claims (10)

Translated fromChinese
1.一种文本检索方法,其特征在于,包括:1. A text retrieval method, characterized by including:获取用户搜索请求;Get user search request;将用户输入的查询文本通过深度学习模型编码得到第二向量;The query text input by the user is encoded through the deep learning model to obtain the second vector;从第一向量索引中查询与所述第二向量相似度最高的第三向量,Query the third vector with the highest similarity to the second vector from the first vector index,所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.2.根据权利要求1所述的方法,其特征在于,所述从第一向量索引中查询与所述第二向量相似度最高的第三向量,包括:2. The method according to claim 1, characterized in that querying the third vector with the highest similarity to the second vector from the first vector index includes:将所述查询文本拆分为多个独立的分词;Split the query text into multiple independent word segments;在第一倒排索引中查询每个分词对应的倒排链数据,所述第一倒排索引为依据待检索文档库生成的倒排文件得到;Query the inverted chain data corresponding to each word segment in the first inverted index, which is obtained based on the inverted file generated based on the document library to be retrieved;在所述第一向量索引中找到距离所述第二向量满足预设距离的至少一个中心点,并获取每一个中心点对应的倒排链数据;Find at least one center point in the first vector index that satisfies a preset distance from the second vector, and obtain the inverted chain data corresponding to each center point;对每个分词的倒排链数据求交集,得到第一权重值;Find the intersection of the inverted chain data of each segmentation to obtain the first weight value;对每一个所述中心点对应的倒排链数据求并集,得到第二权重值;Calculate the union of the inverted chain data corresponding to each of the center points to obtain the second weight value;比较所述第一权重值与所述第二权重值,将权重值大的倒排链数据进行过滤,当满足预设的过滤条件时,存入召回中间结果数据集;Compare the first weight value and the second weight value, filter the inverted chain data with a large weight value, and store the recall intermediate result data set when the preset filtering conditions are met;将所述召回中间结果数据集排序,确定所述第三向量。The recall intermediate result data set is sorted to determine the third vector.3.根据权利要求2所述的方法,其特征在于,3. The method according to claim 2, characterized in that,在所述召回中间结果数据集达到预设的第一存储容量阈值,或者When the recall intermediate result data set reaches a preset first storage capacity threshold, or利用所述查询文本检索的时间超过预设的第一时间阈值的情况下,When the query text retrieval time exceeds the preset first time threshold,终止收集所述召回中间结果数据集。Terminate collection of the recall intermediate result data set.4.根据权利要求2所述的方法,其特征在于,所述将所述召回中间结果数据集排序,确定所述第三向量,包括:4. The method according to claim 2, characterized in that, sorting the recall intermediate result data set and determining the third vector includes:将所述召回中间结果数据集存储的所有倒排链数据按照分数高低进行排序,截取排名靠前的倒排链数据作为所述第三向量。All the inverted chain data stored in the recall intermediate result data set are sorted according to their scores, and the top-ranked inverted chain data is intercepted as the third vector.5.根据权利要求2所述的方法,其特征在于,所述在第一倒排索引中查询每一个分词对应的倒排链数据,包括:5. The method according to claim 2, characterized in that querying the inverted chain data corresponding to each word segmentation in the first inverted index includes:获取每一个所述分词的文档编号ID信息;Obtain the document number ID information of each word segment;将所有所述分词的文档编号ID信息以降序方式排列,构成所述倒排链数据。Arrange the document number ID information of all the word segments in descending order to form the inverted chain data.6.根据权利要求5所述的方法,其特征在于,还包括:6. The method of claim 5, further comprising:根据每一个所述分词的文档编号ID信息确定每一个分词的权重值;Determine the weight value of each word segment based on the document number ID information of each word segment;根据每一个所述分词权重值的大小,确定每一个分词的召回时间;Determine the recall time of each word segment according to the weight value of each word segment;根据所述召回时间,以及每一条倒排链数据的链长,对检索过程做截断处理。According to the recall time and the chain length of each inverted chain data, the retrieval process is truncated.7.根据权利要求2所述的方法,其特征在于,还包括:7. The method of claim 2, further comprising:将所述第一向量索引与所述第一倒排索引一起写入内存段,构建临时内存向量索引;Write the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;在所述临时内存向量索引达到预设的第二存储容量阈值,或者构建临时内存向量索引的时间达到预设的第二时间阈值时,When the temporary memory vector index reaches the preset second storage capacity threshold, or the time to build the temporary memory vector index reaches the preset second time threshold,将所述第一向量索引与所述第一倒排索引一起写入磁盘段,构建持久磁盘向量索引。The first vector index and the first inverted index are written into the disk segment together to construct a persistent disk vector index.8.一种文本检索装置,其特征在于,包括:8. A text retrieval device, characterized by comprising:获取模块,用于获取用户搜索请求;Obtain module, used to obtain user search requests;编码模块,用于将用户输入的查询文本通过深度学习模型编码得到第二向量;The encoding module is used to encode the query text input by the user through the deep learning model to obtain the second vector;检索模块,用于从第一向量索引中查询与所述第二向量相似度最高的第三向量,A retrieval module used to query the third vector with the highest similarity to the second vector from the first vector index,所述第一向量索引通过对待检索文档库拆分为多个独立的子文本后,将拆分的所述子文本利用深度学习模型编码得到;The first vector index is obtained by splitting the document library to be retrieved into multiple independent sub-texts, and encoding the split sub-texts using a deep learning model;将所述第三向量对应的子文本作为目标文本。The sub-text corresponding to the third vector is used as the target text.9.一种电子设备,其特征在于,包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或所述指令被所述处理器执行时实现如权利要求1-7中任一项所述的文本检索方法的步骤。9. An electronic device, characterized in that it includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and the program or instructions are executed by the processor. When implementing the steps of the text retrieval method as described in any one of claims 1-7.10.一种可读存储介质,其特征在于,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如权利要求1-7中任一项所述的文本检索方法的步骤。10. A readable storage medium, characterized in that the readable storage medium stores programs or instructions, and when the programs or instructions are executed by a processor, the text as described in any one of claims 1-7 is implemented. Retrieval method steps.
CN202210541779.2A2022-05-172022-05-17Text retrieval method, text retrieval device, electronic equipment and storage mediumPendingCN117149990A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202210541779.2ACN117149990A (en)2022-05-172022-05-17Text retrieval method, text retrieval device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202210541779.2ACN117149990A (en)2022-05-172022-05-17Text retrieval method, text retrieval device, electronic equipment and storage medium

Publications (1)

Publication NumberPublication Date
CN117149990Atrue CN117149990A (en)2023-12-01

Family

ID=88910578

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202210541779.2APendingCN117149990A (en)2022-05-172022-05-17Text retrieval method, text retrieval device, electronic equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN117149990A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117992606A (en)*2024-01-262024-05-07郑州日兴电子科技有限公司 A method and system for full-text archive retrieval based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117992606A (en)*2024-01-262024-05-07郑州日兴电子科技有限公司 A method and system for full-text archive retrieval based on artificial intelligence

Similar Documents

PublicationPublication DateTitle
CN113378556B (en) Method and device for extracting text keywords
CN111125523B (en)Searching method, searching device, terminal equipment and storage medium
CN112685578B (en)Method and device for providing multimedia information content
CN108255372B (en) A kind of desktop application icon arrangement method and mobile terminal
WO2020238951A1 (en)Network content processing method and device, apparatus, and computer storage medium
CN107436948B (en)File searching method and device and terminal
CN109726726B (en)Event detection method and device in video
CN108427761B (en)News event processing method, terminal, server and storage medium
TW201512865A (en)Method for searching web page digital data, device and system thereof
WO2024036616A1 (en)Terminal-based question and answer method and apparatus
CN111159338A (en)Malicious text detection method and device, electronic equipment and storage medium
CN110287466A (en) A method and device for generating a solid template
CN108549681B (en)Data processing method and device, electronic equipment and computer readable storage medium
CN110826098A (en)Information processing method and electronic equipment
CN112328783B (en) A method for determining a summary and a related device
CN117149990A (en)Text retrieval method, text retrieval device, electronic equipment and storage medium
CN107329948B (en) Method, device and storage medium for estimating occurrence time of event described in sentence
CN110866114B (en)Object behavior identification method and device and terminal equipment
CN108415996A (en)A kind of news information method for pushing, device and electronic equipment
US20220197939A1 (en)Image-based search method, server, terminal, and medium
CN107885887B (en) A file storage method and mobile terminal
CN115412726B (en)Video authenticity detection method, device and storage medium
CN111506730A (en)Data clustering method and related device
CN116383224A (en)Data storage method, device, medium and equipment
CN116758362A (en)Image processing method, device, computer equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp