Movatterモバイル変換


[0]ホーム

URL:


CN116932729A - Knowledge search method and system based on long and short text semantic analysis retrieval - Google Patents

Knowledge search method and system based on long and short text semantic analysis retrieval
Download PDF

Info

Publication number
CN116932729A
CN116932729ACN202311178590.2ACN202311178590ACN116932729ACN 116932729 ACN116932729 ACN 116932729ACN 202311178590 ACN202311178590 ACN 202311178590ACN 116932729 ACN116932729 ACN 116932729A
Authority
CN
China
Prior art keywords
knowledge
text
vector
long
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311178590.2A
Other languages
Chinese (zh)
Inventor
林韶军
黄炳裕
戴文艳
何亦龙
黄滇玲
黄河
叶威鑫
刘骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Evecom Information Technology Development Co ltd
Original Assignee
Evecom Information Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Evecom Information Technology Development Co ltdfiledCriticalEvecom Information Technology Development Co ltd
Priority to CN202311178590.2ApriorityCriticalpatent/CN116932729A/en
Publication of CN116932729ApublicationCriticalpatent/CN116932729A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The application relates to a knowledge search method based on long and short text semantic analysis and retrieval, which comprises the following steps: acquiring knowledge text data to be searched, and constructing a knowledge vector base; converting the requested knowledge search text into a semantic vector; through filtering conditions and semantic vectors of the knowledge search text, relevant knowledge points are screened from a knowledge vector library and used as candidate sets; and returning to the recommended knowledge point ID list after sequencing and de-duplication. The application effectively improves the comprehensiveness and accuracy of the search.

Description

Knowledge search method and system based on long and short text semantic analysis retrieval
Technical Field
The application relates to the field of knowledge retrieval, in particular to a knowledge search method and a system based on long and short text semantic analysis retrieval.
Background
Natural language processing technology is always dedicated to make machines communicate smoothly and freely like people, and is an initial aim of designing products such as Siri and small scale in the market; however, it is also desirable that the machine be able to produce dialogue replies with a rich knowledge while fluent dialogue, not just response to the dialogue content. Therefore, when the training corpus is provided for the machine, the machine is required to provide corresponding historical dialogue information and corresponding dialogue knowledge, so that the machine generates a dialogue reply with knowledge on the premise of acquiring the knowledge.
The current knowledge base searching method generally adopts the elastic search based on keywords, and has the problems that the search result is incorrect due to the fact that the user input keywords are not matched with knowledge points, and the text display is inaccurate due to the fact that the maximum length limit exists in the text length and exceeds the cut-off processing.
Disclosure of Invention
In view of the above, the application aims to provide a knowledge search method and a system based on long and short text semantic analysis retrieval, which effectively improve the comprehensiveness and accuracy of the retrieval.
In order to achieve the above purpose, the application adopts the following technical scheme:
a knowledge search method based on long and short text semantic analysis retrieval comprises the following steps:
acquiring knowledge text data to be searched, and constructing a knowledge vector base;
converting the requested knowledge search text into a semantic vector;
through filtering conditions and semantic vectors of the knowledge search text, relevant knowledge points are screened from a knowledge vector library and used as candidate sets;
and returning to the recommended knowledge point ID list after sequencing and de-duplication.
Further, the knowledge vector library is constructed specifically as follows:
preprocessing all text information items, and dividing the text information items into short text items and long text items according to the text length;
constructing two paths of semantic vectors for all the text information items after pretreatment;
partitioning the knowledge vector library, wherein the knowledge vector library comprises a Q2Q partition and a Q2P partition, and storing two paths of semantic vectors into the two partitions respectively.
Further, the text information items include title, content, question and answer information items.
Further, the preprocessing comprises filtering spaces, special characters and nonsense characters, and removing HTML tags.
Further, the two paths of semantic vectors are respectively constructed specifically as follows:
after preprocessing, obtaining a plurality of groups of long texts and a plurality of groups of short texts, inputting the long texts into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the knowledge semantic vectors into a Milvus vector library partitioned into Q2P;
inputting the short text into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the corresponding knowledge semantic vectors into a Milvus vector library partitioned into Q2Q to complete vectorization construction of the knowledge semantics.
Further, the candidate set is obtained by adopting a DPR model, which is specifically as follows:
the DPR model adopts two independent BERT as encoders, including a Query Encoder and a Document Encoder;
firstly, using Document Encoder to encode text paragraphs into a d-dimensional dense vector, and establishing indexes for the vectorized text paragraphs to wait for recall;
in the reasoning stage, the input question is encoded into a d-dimensional vector by using a Query Encoder, and TopK paragraph vectors closest to the question vector are retrieved based on a Milvus vector library.
Further, the problem vectorEQ(q)Sum paragraph vectorEP(p)Similarity between the two is measured in dot products:
further, the sorting is specifically:
splicing the problem vector and the paragraph vector through a separator < SEP >, and inputting a cross-encoding cross Encoder model;
and obtaining a knowledge list with the knowledge similarity score ordered from high to low as an ordering result by calculating the similarity score between the Query and each Document in the recall result one by one.
Further, the de-duplication specifically includes: and performing duplicate removal filtering based on the knowledge point ID, and only taking the ID arranged at the first position and filtering other repeated IDs.
A knowledge search system based on long and short text semantic analysis retrieval comprises a client, a request processing module, a recommendation module and a knowledge vector base; the client is connected with the request processing module and the recommendation module in a distributed manner through an API interface; the recommendation module is respectively connected with the request processing module and the knowledge vector library; the user inputs the requested knowledge search text through the client and transmits the knowledge search text to the request processing module through the API interface; the request processing model converts the requested knowledge search text into a semantic vector; and the recommending module screens relevant knowledge points from a knowledge vector library through filtering conditions and semantic vectors of the knowledge search text to serve as candidate sets, and returns a recommended knowledge point ID list after sequencing and de-duplication.
Compared with the prior art, the application has the following beneficial effects:
the application adopts semantic-based retrieval, is not limited by the Query literal of the user search problem, but can accurately capture the real intention behind the Query of the user and search by the real intention, so as to more accurately return the most conforming result to the user, and find the vector representation of the text by using a semantic index model, index the text in a high-dimensional vector space, and measure the similarity degree of the Query vector and the index document, thereby solving the defect brought by the keyword index;
according to the application, all text information items are divided into short text items and long text items according to the text length, two paths of semantic vectors are respectively constructed, and a recommended knowledge point ID list is returned after the candidate set is sequenced and de-duplicated during retrieval, so that the maximum length limit of an input text is solved, and the short text input by a user is more accurately matched.
Drawings
FIG. 1 is a schematic diagram of a system framework of the present application;
FIG. 2 is a schematic diagram of a data preprocessing flow in an embodiment of the application;
FIG. 3 is a schematic diagram of knowledge vector base construction in an embodiment of the application;
FIG. 4 is a schematic diagram of a DPR model in accordance with one embodiment of the application;
FIG. 5 is a schematic diagram of a general recall flow in an embodiment of the present application;
FIG. 6 is a schematic diagram of a ranking model framework in accordance with an embodiment of the application.
Detailed Description
The application will be further described with reference to the accompanying drawings and examples.
Referring to fig. 1-6, in the present embodiment, there is provided a knowledge search method based on long and short text semantic analysis retrieval, including the steps of:
acquiring knowledge text data to be searched, and constructing a knowledge vector base;
converting the requested knowledge search text into a semantic vector;
through filtering conditions and semantic vectors of the knowledge search text, relevant knowledge points are screened from a knowledge vector library and used as candidate sets;
and returning to the recommended knowledge point ID list after sequencing and de-duplication.
In this embodiment, referring to fig. 2 and 3, a knowledge vector base is constructed, specifically:
preprocessing all text information items, and dividing the text information items into short text items and long text items according to the text length; specifically, if the text length of the content is greater than the maximum length limit (384 characters by default), performing multi-segment fall segmentation of the long text; text stitching the title and content (paragraphs);
if the knowledge type is a convenience question and answer, then splicing the questions and the answers to finally obtain the processed text information;
constructing two paths of semantic vectors for all the text information items after pretreatment;
partitioning the knowledge vector library, wherein the knowledge vector library comprises a Q2Q partition and a Q2P partition, and storing two paths of semantic vectors into the two partitions respectively.
In the present embodiment, the text information items include title, content, question, and answer information items; preprocessing includes filtering spaces, special characters, nonsense characters, and removing HTML tags.
In this embodiment, two paths of semantic vectors are respectively constructed, specifically:
after preprocessing, obtaining a plurality of groups of long texts and a plurality of groups of short texts, inputting the long texts into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the knowledge semantic vectors into a Milvus vector library partitioned into Q2P;
inputting the short text into a language model encoder, outputting corresponding knowledge semantic vectors, and storing the corresponding knowledge semantic vectors into a Milvus vector library partitioned into Q2Q to complete vectorization construction of the knowledge semantics.
In this embodiment, the candidate set is obtained by using a DPR model, which is specifically as follows:
the DPR model adopts two independent BERT as encoders, including a Query Encoder and a Document Encoder;
firstly, using Document Encoder to encode text paragraphs into a d-dimensional dense vector, and establishing indexes for the vectorized text paragraphs to wait for recall;
in the reasoning stage, the input question is encoded into a d-dimensional vector by using a Query Encoder, and TopK paragraph vectors closest to the question vector are retrieved based on a Milvus vector library.
In the present embodiment, problem vectorsEQ(q)Sum paragraph vectorEP(p)Similarity between them is measured by dot product:
In this embodiment, the sorting algorithm performs pairwise matching calculation on each recalled knowledge point and Query based on the TopK recall results screened by the recall algorithm, specifically:
splicing the problem vector and the paragraph vector through a separator < SEP >, and inputting a cross-encoding cross Encoder model; preferably, a rock qa-zh-duread-cross-encoder is used;
and obtaining a knowledge list with the knowledge similarity score ordered from high to low as an ordering result by calculating the similarity score between the Query and each Document in the recall result one by one.
Because the long text is segmented into a plurality of paragraphs during construction, a plurality of paragraphs in the same knowledge point possibly exist in the sequencing result, duplicate removal filtering is also needed based on the knowledge point ID, and the specific implementation is to only take the ID arranged at the top and filter other repeated IDs.
And finally returning to the knowledge recommendation list according to the similarity threshold or the TopK parameter.
In this embodiment, referring to fig. 1, there is further provided a knowledge search system based on long and short text semantic analysis and retrieval, including a client, a request processing module, a recommendation module, and a knowledge vector library; the client is connected with the request processing module and the recommendation module in a distributed manner through an API interface; the recommendation module is respectively connected with the request processing module and the knowledge vector library; the user inputs the requested knowledge search text through the client and transmits the knowledge search text to the request processing module through the API interface; the request processing model converts the requested knowledge search text into a semantic vector; and the recommending module screens relevant knowledge points from a knowledge vector library through filtering conditions and semantic vectors of the knowledge search text to serve as candidate sets, and returns a recommended knowledge point ID list after sequencing and de-duplication.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present application, and is not intended to limit the application in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present application still fall within the protection scope of the technical solution of the present application.

Claims (10)

10. The knowledge search system based on the long text semantic analysis and the short text semantic analysis is characterized by comprising a client, a request processing module, a recommendation module and a knowledge vector library; the client is connected with the request processing module and the recommendation module in a distributed manner through an API interface; the recommendation module is respectively connected with the request processing module and the knowledge vector library; the user inputs the requested knowledge search text through the client and transmits the knowledge search text to the request processing module through the API interface; the request processing model converts the requested knowledge search text into a semantic vector; and the recommending module screens relevant knowledge points from a knowledge vector library through filtering conditions and semantic vectors of the knowledge search text to serve as candidate sets, and returns a recommended knowledge point ID list after sequencing and de-duplication.
CN202311178590.2A2023-09-132023-09-13Knowledge search method and system based on long and short text semantic analysis retrievalPendingCN116932729A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311178590.2ACN116932729A (en)2023-09-132023-09-13Knowledge search method and system based on long and short text semantic analysis retrieval

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311178590.2ACN116932729A (en)2023-09-132023-09-13Knowledge search method and system based on long and short text semantic analysis retrieval

Publications (1)

Publication NumberPublication Date
CN116932729Atrue CN116932729A (en)2023-10-24

Family

ID=88382841

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311178590.2APendingCN116932729A (en)2023-09-132023-09-13Knowledge search method and system based on long and short text semantic analysis retrieval

Country Status (1)

CountryLink
CN (1)CN116932729A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120179890A (en)*2025-05-202025-06-20北京巨量动能科技有限公司 Text search method and system based on vector retrieval and large model optimization

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114003707A (en)*2021-07-132022-02-01北京金山数字娱乐科技有限公司 Training method and device for question retrieval model, question retrieval method and device
CN114880452A (en)*2022-05-252022-08-09重庆大学 A Text Retrieval Method Based on Multi-view Contrastive Learning
US20220374459A1 (en)*2021-05-172022-11-24Salesforce.Com, Inc.Systems and methods for hierarchical retrieval of semantic-based passages in deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20220374459A1 (en)*2021-05-172022-11-24Salesforce.Com, Inc.Systems and methods for hierarchical retrieval of semantic-based passages in deep learning
CN114003707A (en)*2021-07-132022-02-01北京金山数字娱乐科技有限公司 Training method and device for question retrieval model, question retrieval method and device
CN114880452A (en)*2022-05-252022-08-09重庆大学 A Text Retrieval Method Based on Multi-view Contrastive Learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YINGQI QU等: ""RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering"", 《ARXIV》, pages 1 - 9*
神洛华: ""PaddleNLP系列课程二:RocketQA、SKEP(属性级情感分析)、通用信息抽取技术UIE"", Retrieved from the Internet <URL:https://blog.csdn.net/ qq_56591814/article/details/128246965>*

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120179890A (en)*2025-05-202025-06-20北京巨量动能科技有限公司 Text search method and system based on vector retrieval and large model optimization

Similar Documents

PublicationPublication DateTitle
CN110929038B (en)Knowledge graph-based entity linking method, device, equipment and storage medium
CN108319627B (en)Keyword extraction method and keyword extraction device
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
KR101508260B1 (en)Summary generation apparatus and method reflecting document feature
CN110019669B (en)Text retrieval method and device
CN112633000B (en)Method and device for associating entities in text, electronic equipment and storage medium
CN113569018B (en) Question-answer pair mining method and device
CN108875065B (en) A content-based recommendation method for Indonesian news pages
CN117539990A (en)Problem processing method and device, electronic equipment and storage medium
CN113886535B (en)Knowledge graph-based question and answer method and device, storage medium and electronic equipment
CN117609418B (en) Document processing method, device, electronic device and storage medium
CN113761104A (en) Method, device and electronic device for detecting entity relationship in knowledge graph
CN115828893A (en)Method, device, storage medium and equipment for question answering of unstructured document
Zubiaga et al.Content-based clustering for tag cloud visualization
CN112307190A (en)Medical literature sorting method and device, electronic equipment and storage medium
CN115203445A (en) Multimedia resource searching method, device, device and medium
CN116910599A (en)Data clustering method, system, electronic equipment and storage medium
CN117573800A (en)Paragraph retrieval method, device, equipment and storage medium
CN117992573A (en)Text expansion-based information retrieval method and device, electronic equipment and medium
CN116401344A (en)Method and device for searching table according to question
CN116932729A (en)Knowledge search method and system based on long and short text semantic analysis retrieval
CN119739838A (en) RAG intelligent question answering method, device, equipment and medium for multi-label generation and matching
CN119887461A (en)Integrated knowledge management method, device, computer equipment and medium
CN119669534A (en) Material retrieval method, device, computer equipment and storage medium
CN119513275A (en) Journal paper source tracing method, device and computer equipment based on RAG technology

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20231024


[8]ページ先頭

©2009-2025 Movatter.jp