Knowledge search method and system based on long and short text semantic analysis retrievalTechnical Field
The application relates to the field of knowledge retrieval, in particular to a knowledge search method and a system based on long and short text semantic analysis retrieval.
Background
Natural language processing technology is always dedicated to make machines communicate smoothly and freely like people, and is an initial aim of designing products such as Siri and small scale in the market; however, it is also desirable that the machine be able to produce dialogue replies with a rich knowledge while fluent dialogue, not just response to the dialogue content. Therefore, when the training corpus is provided for the machine, the machine is required to provide corresponding historical dialogue information and corresponding dialogue knowledge, so that the machine generates a dialogue reply with knowledge on the premise of acquiring the knowledge.
The current knowledge base searching method generally adopts the elastic search based on keywords, and has the problems that the search result is incorrect due to the fact that the user input keywords are not matched with knowledge points, and the text display is inaccurate due to the fact that the maximum length limit exists in the text length and exceeds the cut-off processing.
Disclosure of Invention
In view of the above, the application aims to provide a knowledge search method and a system based on long and short text semantic analysis retrieval, which effectively improve the comprehensiveness and accuracy of the retrieval.
In order to achieve the above purpose, the application adopts the following technical scheme:
a knowledge search method based on long and short text semantic analysis retrieval comprises the following steps:
acquiring knowledge text data to be searched, and constructing a knowledge vector base;
converting the requested knowledge search text into a semantic vector;
through filtering conditions and semantic vectors of the knowledge search text, relevant knowledge points are screened from a knowledge vector library and used as candidate sets;
and returning to the recommended knowledge point ID list after sequencing and de-duplication.
Further, the knowledge vector library is constructed specifically as follows:
preprocessing all text information items, and dividing the text information items into short text items and long text items according to the text length;
constructing two paths of semantic vectors for all the text information items after pretreatment;
partitioning the knowledge vector library, wherein the knowledge vector library comprises a Q2Q partition and a Q2P partition, and storing two paths of semantic vectors into the two partitions respectively.
Further, the text information items include title, content, question and answer information items.
Further, the preprocessing comprises filtering spaces, special characters and nonsense characters, and removing HTML tags.
Further, the two paths of semantic vectors are respectively constructed specifically as follows:
after preprocessing, obtaining a plurality of groups of long texts and a plurality of groups of short texts, inputting the long texts into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the knowledge semantic vectors into a Milvus vector library partitioned into Q2P;
inputting the short text into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the corresponding knowledge semantic vectors into a Milvus vector library partitioned into Q2Q to complete vectorization construction of the knowledge semantics.
Further, the candidate set is obtained by adopting a DPR model, which is specifically as follows:
the DPR model adopts two independent BERT as encoders, including a Query Encoder and a Document Encoder;
firstly, using Document Encoder to encode text paragraphs into a d-dimensional dense vector, and establishing indexes for the vectorized text paragraphs to wait for recall;
in the reasoning stage, the input question is encoded into a d-dimensional vector by using a Query Encoder, and TopK paragraph vectors closest to the question vector are retrieved based on a Milvus vector library.
Further, the problem vectorEQ(q)Sum paragraph vectorEP(p)Similarity between the two is measured in dot products:
。
further, the sorting is specifically:
splicing the problem vector and the paragraph vector through a separator < SEP >, and inputting a cross-encoding cross Encoder model;
and obtaining a knowledge list with the knowledge similarity score ordered from high to low as an ordering result by calculating the similarity score between the Query and each Document in the recall result one by one.
Further, the de-duplication specifically includes: and performing duplicate removal filtering based on the knowledge point ID, and only taking the ID arranged at the first position and filtering other repeated IDs.
A knowledge search system based on long and short text semantic analysis retrieval comprises a client, a request processing module, a recommendation module and a knowledge vector base; the client is connected with the request processing module and the recommendation module in a distributed manner through an API interface; the recommendation module is respectively connected with the request processing module and the knowledge vector library; the user inputs the requested knowledge search text through the client and transmits the knowledge search text to the request processing module through the API interface; the request processing model converts the requested knowledge search text into a semantic vector; and the recommending module screens relevant knowledge points from a knowledge vector library through filtering conditions and semantic vectors of the knowledge search text to serve as candidate sets, and returns a recommended knowledge point ID list after sequencing and de-duplication.
Compared with the prior art, the application has the following beneficial effects:
the application adopts semantic-based retrieval, is not limited by the Query literal of the user search problem, but can accurately capture the real intention behind the Query of the user and search by the real intention, so as to more accurately return the most conforming result to the user, and find the vector representation of the text by using a semantic index model, index the text in a high-dimensional vector space, and measure the similarity degree of the Query vector and the index document, thereby solving the defect brought by the keyword index;
according to the application, all text information items are divided into short text items and long text items according to the text length, two paths of semantic vectors are respectively constructed, and a recommended knowledge point ID list is returned after the candidate set is sequenced and de-duplicated during retrieval, so that the maximum length limit of an input text is solved, and the short text input by a user is more accurately matched.
Drawings
FIG. 1 is a schematic diagram of a system framework of the present application;
FIG. 2 is a schematic diagram of a data preprocessing flow in an embodiment of the application;
FIG. 3 is a schematic diagram of knowledge vector base construction in an embodiment of the application;
FIG. 4 is a schematic diagram of a DPR model in accordance with one embodiment of the application;
FIG. 5 is a schematic diagram of a general recall flow in an embodiment of the present application;
FIG. 6 is a schematic diagram of a ranking model framework in accordance with an embodiment of the application.
Detailed Description
The application will be further described with reference to the accompanying drawings and examples.
Referring to fig. 1-6, in the present embodiment, there is provided a knowledge search method based on long and short text semantic analysis retrieval, including the steps of:
acquiring knowledge text data to be searched, and constructing a knowledge vector base;
converting the requested knowledge search text into a semantic vector;
through filtering conditions and semantic vectors of the knowledge search text, relevant knowledge points are screened from a knowledge vector library and used as candidate sets;
and returning to the recommended knowledge point ID list after sequencing and de-duplication.
In this embodiment, referring to fig. 2 and 3, a knowledge vector base is constructed, specifically:
preprocessing all text information items, and dividing the text information items into short text items and long text items according to the text length; specifically, if the text length of the content is greater than the maximum length limit (384 characters by default), performing multi-segment fall segmentation of the long text; text stitching the title and content (paragraphs);
if the knowledge type is a convenience question and answer, then splicing the questions and the answers to finally obtain the processed text information;
constructing two paths of semantic vectors for all the text information items after pretreatment;
partitioning the knowledge vector library, wherein the knowledge vector library comprises a Q2Q partition and a Q2P partition, and storing two paths of semantic vectors into the two partitions respectively.
In the present embodiment, the text information items include title, content, question, and answer information items; preprocessing includes filtering spaces, special characters, nonsense characters, and removing HTML tags.
In this embodiment, two paths of semantic vectors are respectively constructed, specifically:
after preprocessing, obtaining a plurality of groups of long texts and a plurality of groups of short texts, inputting the long texts into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the knowledge semantic vectors into a Milvus vector library partitioned into Q2P;
inputting the short text into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the corresponding knowledge semantic vectors into a Milvus vector library partitioned into Q2Q to complete vectorization construction of the knowledge semantics.
In this embodiment, the candidate set is obtained by using a DPR model, which is specifically as follows:
the DPR model adopts two independent BERT as encoders, including a Query Encoder and a Document Encoder;
firstly, using Document Encoder to encode text paragraphs into a d-dimensional dense vector, and establishing indexes for the vectorized text paragraphs to wait for recall;
in the reasoning stage, the input question is encoded into a d-dimensional vector by using a Query Encoder, and TopK paragraph vectors closest to the question vector are retrieved based on a Milvus vector library.
In the present embodiment, problem vectorsEQ(q)Sum paragraph vectorEP(p)Similarity between them is measured by dot product:
。
In this embodiment, the sorting algorithm performs pairwise matching calculation on each recalled knowledge point and Query based on the TopK recall results screened by the recall algorithm, specifically:
splicing the problem vector and the paragraph vector through a separator < SEP >, and inputting a cross-encoding cross Encoder model; preferably, a rock qa-zh-duread-cross-encoder is used;
and obtaining a knowledge list with the knowledge similarity score ordered from high to low as an ordering result by calculating the similarity score between the Query and each Document in the recall result one by one.
Because the long text is segmented into a plurality of paragraphs during construction, a plurality of paragraphs in the same knowledge point possibly exist in the sequencing result, duplicate removal filtering is also needed based on the knowledge point ID, and the specific implementation is to only take the ID arranged at the top and filter other repeated IDs.
And finally returning to the knowledge recommendation list according to the similarity threshold or the TopK parameter.
In this embodiment, referring to fig. 1, there is further provided a knowledge search system based on long and short text semantic analysis and retrieval, including a client, a request processing module, a recommendation module, and a knowledge vector library; the client is connected with the request processing module and the recommendation module in a distributed manner through an API interface; the recommendation module is respectively connected with the request processing module and the knowledge vector library; the user inputs the requested knowledge search text through the client and transmits the knowledge search text to the request processing module through the API interface; the request processing model converts the requested knowledge search text into a semantic vector; and the recommending module screens relevant knowledge points from a knowledge vector library through filtering conditions and semantic vectors of the knowledge search text to serve as candidate sets, and returns a recommended knowledge point ID list after sequencing and de-duplication.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present application, and is not intended to limit the application in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present application still fall within the protection scope of the technical solution of the present application.