Disclosure of Invention
Aiming at the defects of the prior art, the application provides a knowledge construction method and system based on a large model and RAG technology.
In a first aspect, the application provides a knowledge construction method based on a large model and RAG technology, which comprises knowledge cleaning and knowledge construction stages:
The knowledge cleaning stage comprises the following steps:
obtaining text data from various sources, cleaning, converting formats, verifying the text data and storing the text data into a text database;
Setting parameters and splitting a large text in a text database into text data blocks with semantic meanings according to the priority of separators;
selecting an embedded model to convert the text data block into a text vector, carrying out normalization processing, and storing the text vector into a vector database;
Establishing a vector index, adopting the combination of similarity retrieval and full-text retrieval, and optimizing a retrieval strategy through mixed search;
The knowledge construction stage comprises:
extracting local knowledge by using a large model, and extracting entities, attributes and relations from the text data block;
constructing related entity pairs across the text data blocks by adopting a mutual information method, and carrying out global knowledge extraction by combining RAG technology and a retrieval strategy to generate entity relations;
judging whether two entities point to the same physical object or not through a multidimensional reference digestion method, merging the same entities to form a knowledge base, and storing the constructed knowledge.
In some embodiments, the step of performing a cleaning operation, a format conversion operation, and a verification operation on the text data and then storing the text data in a text database includes:
the cleaning operation is to remove HTML labels, emoticons, messy code characters and redundant blank spaces, and simultaneously retain technical terms, numbers and date information;
The format conversion operation is that the cleaned text is uniformly coded into UTF-8 or GBK format and converted into plain text or rich text format;
And the verification operation is to verify the data after the cleaning and conversion are finished, verify whether the content comprises text length, key information retention and coding correctness, and if the data is found to be abnormal, re-perform the cleaning operation and the format conversion operation.
In some embodiments, the setting parameters and splitting the large text in the text database into text data blocks with semantic meaning according to the separator priority comprises:
Setting a maximum block length threshold value, namely a chunk_size and an overlap reservation length, namely an overlap_size, and creating temporary storage blocks and a final set for storing the segmented text data blocks;
Reading each character in sequence, adding the characters into a temporary storage block, checking whether the length of the temporary storage block exceeds the length_size in real time, and processing according to the separator priority, wherein the separator priority is that a line feed character > a period end mark > a semicolon/comma;
When encountering a line feed character, immediately adding the current temporary storage block into a final set, emptying the current temporary storage block, and reserving the last overlap_size character as the initial content of a new temporary storage block;
when the length of the temporary storage block is more than or equal to the length of the chunk_size, searching the nearest sentence end punctuation forward, if so, splitting after punctuation, adding a final set in the front section, reserving the rear section as a new temporary storage block, and reserving overlap_size characters as overlapping;
when the length of the temporary storage block is more than or equal to the length of the trunk_size 1.2, searching the nearest semicolon/comma forwards, if the nearest semicolon/comma is found, splitting after punctuation, adding a final set into the front section, reserving the rear section as a new temporary storage block, and reserving overlapping_size characters as overlapping;
when the length of the temporary storage block is more than or equal to the length of the chunk_size which is 1.5 and no separator exists, the temporary storage block is forcedly split at the chunk_size, a final set is added in the front section, and the rear section is reserved as a new temporary storage block;
after the traversal is finished, if the current temporary storage block is not empty, the final set is directly added, and if the length exceeds the length of the chunk_size by 2, a warning log is recorded.
In some embodiments, the selecting the embedded model to convert the text data block into a text vector, normalize the text vector, and store the text vector in a vector database includes:
Taking the segmented text data blocks as input, each text data block being represented as
WhereinRepresent the firstA word or character;
for each wordGenerating corresponding word vectors through word embedding modelsThe word embedding model maps words to a low-dimensional continuous vector space, so that the words with similar semantics are similar in distance in the vector space;
For the entire text data blockGenerating text vectors by means of text vectorization;
Text vector to be generatedAnd after normalization processing, storing the normalized data in a vector database.
In some embodiments, the establishing the vector index, using a combination of similarity search and full text search and optimizing the search strategy by hybrid search, includes:
The vectorized data are stored in a vector database, indexes are built for quick retrieval, the vector database is used for building by adopting similarity retrieval, full text retrieval or mixed search strategies together with knowledge, and the index building process is as follows:
Vector storage, namely storing the vector of each text data block into a database, and adding metadata for the vector, wherein the metadata comprises text content, a source and a time stamp;
Index construction, namely selecting a corresponding index structure according to the characteristics of a vector database;
The similarity retrieval is that the record with the highest score is returned by calculating the similarity score of the query vector and all vectors in the database;
establishing an inverted index through keywords, and searching the full text through the keywords during searching to find out corresponding records;
defining a plurality of retrievers, respectively inquiring by using similarity retrieval and full text retrieval to obtain respective retrieval results, and reordering the retrieval results by using an RRF algorithm to obtain a final record.
In some embodiments, the extracting entities, attributes, and relationships from the text data blocks using the large model for local knowledge extraction includes:
Setting a prompt example in the prompt word, and guiding a large model to extract entities, attributes and relations from the text data block through the prompt example, wherein setting elements of the prompt example comprise role definition instructions, input and output specifications, term extraction rules and relation generation constraints.
In some embodiments, the constructing related entity pairs across the text data blocks by using a mutual information method and performing global knowledge extraction in combination with a RAG technology and a retrieval policy to generate entity relationships includes:
Calculating mutual information among entities in different text blocks, wherein the formula is as follows:
Wherein, theAndRepresenting a collection of entities in two text blocks,Representing entitiesAndThe probability of the simultaneous occurrence of the two,AndRepresenting entities respectivelyAndThe probability of single occurrence is calculated to judge whether the entities in different text blocks are related or not;
retrieving context knowledge related to the entity pairs from a vector database using a RAG system;
Based on the retrieved knowledge segments, the relation among the entities is generated through a pre-trained large language model by combining with the specific requirements of the task, and global knowledge extraction is completed.
In some embodiments, the determining whether two entities point to the same physical object through the multidimensional reference resolution method, merging the same entities to form a knowledge base, and storing the constructed knowledge includes:
Calculating text similarity, namely calculating the text similarity of entities in different text blocks through cosine similarity or editing distance;
and (3) calculating structural similarity, namely analyzing the structural relation of the entities in the knowledge graph, and judging whether the entities have similar structures.
And (3) semantic similarity calculation, namely calculating the semantic similarity of the entities through a large model, and judging whether the entities have the same semantics.
And (3) entity alignment and merging, namely judging whether the two entities point to the same physical object according to the text, the structure and the semantic similarity, and merging if so, so as to obtain an entity alignment result.
In a second aspect, the application provides a knowledge construction system based on a large model and RAG technology, which comprises a data cleaning module, a text segmentation module, a vector conversion module, a vector indexing module, a local knowledge extraction module, a global knowledge extraction module and an index digestion module;
The data cleaning module is used for acquiring text data from various sources, cleaning, format conversion and verification the text data, and storing the text data in a text database;
the text segmentation module is used for setting parameters and splitting a large text in the text database into text data blocks with semantic meaning according to the separator priority;
The vector conversion module is used for selecting the embedded model to convert the text data block into a text vector, carrying out normalization processing and storing the text vector into the vector database;
The vector index module is used for establishing a vector index, adopting the combination of similarity retrieval and full text retrieval and optimizing a retrieval strategy through mixed search;
the local knowledge extraction module is used for extracting local knowledge by using a large model and extracting entities, attributes and relations from the text data block;
The global knowledge extraction module is used for constructing related entity pairs across the text data blocks by adopting a mutual information method, and carrying out global knowledge extraction by combining RAG technology and a retrieval strategy to generate entity relations;
The reference digestion module is used for judging whether two entities point to the same physical object or not through a multidimensional reference digestion method, merging the same entities to form a knowledge base, and storing the constructed knowledge.
In a third aspect the application proposes an electronic device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method as described above when said computer program is executed.
In a fourth aspect the application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of a method as described above.
The invention has the beneficial effects that:
Through cross-text block mutual information calculation and RAG enhanced reasoning, the system can quantify statistical correlation among entities, and in combination with context information of an external knowledge base, potential association relations among paragraphs or documents are mined, a dynamic semantic segmentation strategy is adopted, internal semantics of text blocks are ensured to be consistent, context continuity is maintained through reserved overlapping parts, multiple expressions of the same entity are comprehensively judged through a multidimensional reference digestion mechanism, error digestion rate is reduced, the problem of large model input length limitation is solved through combination of dynamic segmentation and mixed retrieval, and key problems such as semantic fracture, cross-text association deletion, entity alignment precision are low are remarkably solved through matching complementation of semantic retrieval and key words, and recall rate is improved.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the exemplary embodiments of the present invention have been illustrated in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein, but rather, these embodiments are provided so that the present invention will be more thoroughly understood and will fully convey the scope of the invention to those skilled in the art.
In a first aspect, the present application proposes a knowledge construction method based on a large model and a RAG technology, as shown in fig. 1, including knowledge cleaning and knowledge construction stages:
The knowledge cleaning stage comprises the following steps:
S100, acquiring text data from various sources, cleaning, converting formats, verifying the text data and storing the text data into a text database;
In some embodiments, the step of performing a cleaning operation, a format conversion operation, and a verification operation on the text data and then storing the text data in a text database includes:
the cleaning operation is to remove HTML labels, emoticons, messy code characters and redundant blank spaces, and simultaneously retain technical terms, numbers and date information;
The format conversion operation is that the cleaned text is uniformly coded into UTF-8 or GBK format and converted into plain text or rich text format;
And the verification operation is to verify the data after the cleaning and conversion are finished, verify whether the content comprises text length, key information retention and coding correctness, and if the data is found to be abnormal, re-perform the cleaning operation and the format conversion operation.
Data preprocessing is the first step of knowledge construction, and aims to acquire required text data from various sources (such as webpages, emails, social media posts and the like captured by web crawlers), and clean and standardize the data to ensure the integrity and accuracy of the data. The method comprises the following specific steps:
Text data is obtained from a variety of sources including, but not limited to, web pages, emails, social media posts, news stories, contract text, and the like. These data may contain different formats and encodings and therefore require uniform processing;
Converting the cleaned text into a unified format and coding. Common text formats include plain text (.txt), rich text (.rtf), etc., and coding modes can be selected from UTF-8, GBK, etc. Proper text codes are selected, so that the problems of messy codes and incompatibility are reduced, and subsequent data analysis and mining are facilitated;
After the cleaning and conversion are completed, the data are verified, and the integrity and the accuracy of the data are ensured. The verification content includes text length, whether key information is retained, whether the encoding is correct, etc. If the data is found to be abnormal, the cleaning and conversion are needed to be carried out again
The obtained text is cleaned, and special symbols (such as HTML labels, emoticons, messy code characters and the like), redundant blank spaces, line-wrapping symbols and the like are removed, so that the text is more standard. Key information in the text, such as technical terms, numbers, dates and the like, needs to be reserved in the cleaning process, so that the contents cannot be deleted or split by mistake;
And storing the processed text data into a database, so that the subsequent knowledge extraction and knowledge graph construction are facilitated. Common databases include relational databases (e.g., mySQL, postgreSQL) and non-relational databases (e.g., mongoDB, elasticsearch).
S200, setting parameters and splitting a large text in a text database into text data blocks with semantic meaning according to the separator priority;
in some embodiments, the setting parameters and splitting the large text in the text database into text data blocks with semantic meaning according to the separator priority comprises:
Setting a maximum block length threshold value, namely a chunk_size and an overlap reservation length, namely an overlap_size, and creating temporary storage blocks and a final set for storing the segmented text data blocks;
Reading each character in sequence, adding the characters into a temporary storage block, checking whether the length of the temporary storage block exceeds the length_size in real time, and processing according to the separator priority, wherein the separator priority is that a line feed character > a period end mark > a semicolon/comma;
When encountering a line feed character, immediately adding the current temporary storage block into a final set, emptying the current temporary storage block, and reserving the last overlap_size character as the initial content of a new temporary storage block;
when the length of the temporary storage block is more than or equal to the length of the chunk_size, searching the nearest sentence end punctuation forward, if so, splitting after punctuation, adding a final set in the front section, reserving the rear section as a new temporary storage block, and reserving overlap_size characters as overlapping;
when the length of the temporary storage block is more than or equal to the length of the trunk_size 1.2, searching the nearest semicolon/comma forwards, if the nearest semicolon/comma is found, splitting after punctuation, adding a final set into the front section, reserving the rear section as a new temporary storage block, and reserving overlapping_size characters as overlapping;
when the length of the temporary storage block is more than or equal to the length of the chunk_size which is 1.5 and no separator exists, the temporary storage block is forcedly split at the chunk_size, a final set is added in the front section, and the rear section is reserved as a new temporary storage block;
after the traversal is finished, if the current temporary storage block is not empty, the final set is directly added, and if the length exceeds the length of the chunk_size by 2, a warning log is recorded.
Text segmentation is the process of splitting large pieces of text into smaller, semantically meaningful blocks for subsequent vectorization and knowledge extraction. The method comprises the following specific steps:
The chunk_size (maximum block length threshold) and the overlap_size (overlap reserve length, 10-20% of chunk_size is recommended) are set. Creating a temporary storage block and a final set for storing the segmented text blocks;
each character is read in sequence, the character is added into the temporary storage block, and whether the length of the temporary storage block exceeds the length of the chunk_size is checked in real time. Processing according to the delimiter priority;
When encountering a line-changing symbol, immediately adding a temporary storage block (containing the line-changing symbol) into a final set, emptying the temporary storage block, and reserving a final overlap_size character as the initial content of a new temporary storage block;
when the length of the temporary storage block is more than or equal to the length of the chunk_size, searching the nearest sentence end punctuation forward, if so, splitting after punctuation, adding a final set in the front section, reserving the rear section as a new temporary storage block, and reserving overlap_size characters as overlapping;
When the length of the temporary storage block is more than or equal to the length of the chunk_size (the moderate overlength is allowed), searching the nearest semicolon/comma forwards, if the nearest semicolon/comma is found, splitting after punctuation, adding a final set into the front section, reserving the rear section as a new temporary storage block, and reserving an overlap_size character as an overlap;
When the length of the temporary storage block is more than or equal to the length of the chunk_size which is 1.5 and no separator exists, the temporary storage block is forcedly split at the chunk_size, a final set is added in the front section, and the rear section is reserved as a new temporary storage block (no overlap is reserved);
and after the traversal is finished, if the temporary storage block is not empty, directly adding the final set. If the length exceeds the length_size by 2, a warning log is required to be recorded;
For the paired symbols such as quotation marks/brackets and the like, the integrity needs to be maintained, the technical terms, numbers, dates and the like, the indissolvable contents need to be integrally reserved, and the length is uniformly calculated according to Chinese characters when Chinese and western languages are mixed.
Assume that there is a text segment containing business contract terms, 5000 characters in length. First, the chunk_size is set to 1000 characters, and the overlap_size is set to 150 characters. Then, traversing the text character by character, sequentially reading each character, adding the characters into the temporary storage block, and checking whether the length of the temporary storage block exceeds 1000 characters in real time. When a line feed is encountered, a scratch block (containing the line feed) is added immediately to the final set, the scratch block is emptied, and the last 150 characters are reserved as the starting content of the new scratch block. When the length of the temporary storage block is more than or equal to 1000 characters, searching the nearest punctuation of the sentence forward, if the nearest punctuation is found, splitting after punctuation, adding a final set into the front section, reserving the rear section as a new temporary storage block, and reserving 150 characters as overlapping. If the end punctuation of the sentence is not found, checking the semicolon/comma, when the length of the temporary storage block is more than or equal to 1200 characters, searching the nearest semicolon/comma forwards, if the end punctuation is found, splitting after punctuation, adding the final set into the front section, reserving the rear section as a new temporary storage block, and reserving 150 characters as overlapping. If the temporary storage block length is more than or equal to 1500 characters and no separator exists, the temporary storage block is forcedly split at 1000 characters, the front section is added into the final set, and the rear section is reserved as a new temporary storage block (without overlapping). And after the traversal is finished, if the temporary storage block is not empty, directly adding the final set. If the length exceeds 2000 characters, a warning log is recorded. Through the steps, the 5000-character contract clause text is successfully divided into a plurality of blocks with semantic meaning, so that subsequent vectorization and knowledge extraction are facilitated.
S300, selecting an embedded model to convert a text data block into a text vector, carrying out normalization processing, and storing the text vector into a vector database;
in some embodiments, the selecting the embedded model to convert the text data block into a text vector, normalize the text vector, and store the text vector in a vector database includes:
Taking the segmented text data blocks as input, each text data block being represented as
WhereinRepresent the firstA word or character;
for each wordGenerating corresponding word vectors through word embedding modelsThe word embedding model maps words to a low-dimensional continuous vector space, so that the words with similar semantics are similar in distance in the vector space;
For the entire text data blockGenerating text vectors by means of text vectorization;
Text vector to be generatedAnd after normalization processing, storing the normalized data in a vector database.
And selecting a proper embedded model according to the specific application scene. Models with excellent performance, such as bge-large or E5 series, may be selected with reference to MTEB (Massive Text Embedding Benchmark) leaderboards. For a particular domain, an open source model may be selected for fine tuning or de novo training to accommodate domain-specific semantic features.
Text preprocessing, namely, performing further preprocessing on the text before vectorization, wherein the preprocessing comprises the steps of removing stop words, extracting word stems, restoring word shapes and the like. For Chinese text, word segmentation is also required. The aim of the pretreatment is to reduce noise and improve vectorization accuracy.
Text embedding, namely inputting the preprocessed text into an embedded model to generate corresponding vector representation. The specific process is as follows:
inputting text by taking as input the text blocks after segmentation, each text block can be expressed asWhereinRepresent the firstIndividual words or characters.
Word embedding for each wordGenerating corresponding Word vectors through Word embedding models (such as Word2Vec, gloVe, BERT and the like). The word embedding model maps words to a low-dimensional continuous vector space such that semantically similar words are closely spaced in the vector space.
Text vectorization for an entire block of textThe vector representation thereof may be generated in a variety of ways. Common methods include:
averaging pooling, namely averaging word vectors of all words in the text to obtain text vectors。
Maximum pooling, namely taking the maximum value of each dimension to obtain a text vector。
Pooling based on attention mechanism by calculating weight of each word by attention mechanism and then weighting and summing to obtain text vectorWhereinRepresent the firstAttention weight of individual words.
And vector normalization, namely carrying out normalization processing on the generated text vector, so that the vector is distributed on a unit sphere, and the subsequent similarity calculation is facilitated. The normalization formula is: WhereinRepresenting vectorsEuclidean norms of (c).
And storing the generated text vector into a vector database, so as to facilitate subsequent retrieval and comparison. Common vector databases include FAISS, chromadb, elasticsearch, milvus, etc. When stored, metadata, such as content, source, time stamp, etc., of the text block may be added to each vector for subsequent querying and analysis.
S400, establishing a vector index, adopting the combination of similarity retrieval and full-text retrieval, and optimizing a retrieval strategy through mixed search;
In some embodiments, the establishing the vector index, using a combination of similarity search and full text search and optimizing the search strategy by hybrid search, includes:
The vectorized data are stored in a vector database, indexes are built for quick retrieval, the vector database is used for building by adopting similarity retrieval, full text retrieval or mixed search strategies together with knowledge, and the index building process is as follows:
Vector storage, namely storing the vector of each text data block into a database, and adding metadata for the vector, wherein the metadata comprises text content, a source and a time stamp;
Index construction, namely selecting a corresponding index structure according to the characteristics of a vector database;
The similarity retrieval is that the record with the highest score is returned by calculating the similarity score of the query vector and all vectors in the database;
establishing an inverted index through keywords, and searching the full text through the keywords during searching to find out corresponding records;
defining a plurality of retrievers, respectively inquiring by using similarity retrieval and full text retrieval to obtain respective retrieval results, and reordering the retrieval results by using an RRF algorithm to obtain a final record.
Vector retrieval is a key step in knowledge-graph construction, and aims to quickly find the vector most similar to the query vector from a vector database by an efficient retrieval method. The following is a detailed implementation step of vector retrieval:
Vector index establishment:
Indexing the vectorized data for quick retrieval. Common indexing methods include inverted indexing, HNSW, and the like. HNSW is a graph-based indexing method that can efficiently process high-dimensional vector data.
The index building formula is as follows:
Wherein, theA set of vectors is represented and,The method of indexing is represented as such,Representing the vector index established.
Similarity retrieval:
similarity between the query vector and vectors in the database is calculated using cosine similarity. The cosine similarity measures the similarity of two vectors in the direction, the range of values is [ -1,1], and the larger the value is, the higher the similarity is.
The cosine similarity formula is as follows:
Wherein, theAndThe two vectors are represented as a set of two vectors,Representing a vector dot product of the vector,Representing the length of the vector.
In order to improve the retrieval efficiency, an approximate nearest neighbor search algorithm can be adopted, and the retrieval speed can be greatly improved on the premise of ensuring the retrieval precision.
Full text retrieval:
On the basis of vector retrieval, the recall rate is further improved by combining a full-text retrieval method. The full text search finds documents relevant to the query from the text data by means of keyword matching.
The full text search formula is as follows:
Wherein, theA document is represented and,The query is represented by a representation of the query,Representing terms in the query,Representing a termIn a documentTF-IDF value of (b).
And (5) mixing and searching:
The similarity retrieval and the full text retrieval are combined, and a mixed retrieval method is adopted, so that the retrieval precision and recall rate are improved. And obtaining a final search result by weighting and fusing the results of the two search methods in the mixed search.
The hybrid search formula is as follows:
Wherein, theThe weight representing the retrieval of the similarity is given,The weight of the full-text retrieval is represented,A score representing a similarity search is presented,Representing the score of the full text search.
And (5) sequencing search results:
The search results are ranked so that the user can quickly find the most relevant results. The ranking method can comprehensively rank according to various factors such as search scores, time stamps, user preferences and the like.
The ordering formula is as follows:
Wherein, theRepresenting documentsAnd can be adjusted according to factors such as importance of the document, user preference, etc.
Is provided with a query vectorWe need to retrieve the most similar vector from the vector database. First, a vector index is established using HNSW index method. Then, the query vector is calculated by cosine similarityWith each vector in the databaseObtain a similarity score. Meanwhile, a full-text retrieval method is used, and the correlation between the query keywords and the document is calculated through TF-IDF, so that a full-text retrieval score is obtained. Then, weighting and fusing the results of the similarity search and the full text search to obtain a final search score. And finally, sorting the results according to the search scores, and outputting the most relevant search results. Through the series of operations, the vector most similar to the query can be efficiently retrieved from the vector database, and the retrieval precision and recall rate are improved.
The knowledge construction stage comprises:
s500, extracting local knowledge by using a large model, and extracting entities, attributes and relations from the text data block;
In some embodiments, the extracting entities, attributes, and relationships from the text data blocks using the large model for local knowledge extraction includes:
Setting a prompt example in the prompt word, and guiding a large model to extract entities, attributes and relations from the text data block through the prompt example, wherein setting elements of the prompt example comprise role definition instructions, input and output specifications, term extraction rules and relation generation constraints.
By setting some examples in the prompt words, the large model can better understand which information is to be extracted, or guide the large model to perform one-step reasoning during extraction, so that the accuracy of the extraction result is also improved. A sample example is given below:
"you are a network diagram producer, extracting terms and their relationships from a given context. "
"Provide you with one context block (separated by \n), your task is to extract the ontology. "
"In the given context. These terms should represent key concepts depending on the context. N'
Step 1. Key terms mentioned therein are considered when traversing each sentence. N'
The term "\t" may include an object, entity, location, organization, person, \n'
"\T conditions, acronyms, documents, services, concepts, etc. N'
The "\t term should be atomized as much as possible. N/n \ "
Step 2, thinking about how these terms have a one-to-one relationship with other terms. N'
Terms mentioned in "\t and a sentence or a paragraph are generally interrelated. N'
The "\t term may relate to many other terms \n\n \'
And step 3, finding out the relation between each pair of related terms. N/n \ "
"Formatting the output as a json list, each element in the list contains a pair of terms. "
"And the relationship between them, as shown in \n"
"[\n"
"{\n"
' Node_1 ' concept extracted from the extracted ontology \n '
' Node_2 ' the related concepts extracted from the extracted ontology ", \n '
'Edge' is the relationship between the two concepts of node_1 and node_2 in one or two sentences. "\n'
"}, {...}\n"
"]"
S600, constructing related entity pairs across the text data blocks by adopting a mutual information method, and carrying out global knowledge extraction by combining RAG technology and a retrieval strategy to generate entity relations;
in some embodiments, the constructing related entity pairs across the text data blocks by using a mutual information method and performing global knowledge extraction in combination with a RAG technology and a retrieval policy to generate entity relationships includes:
Calculating mutual information among entities in different text blocks, wherein the formula is as follows:
Wherein, theAndRepresenting a collection of entities in two text blocks,Representing entitiesAndThe probability of the simultaneous occurrence of the two,AndRepresenting entities respectivelyAndThe probability of single occurrence is calculated to judge whether the entities in different text blocks are related or not;
retrieving context knowledge related to the entity pairs from a vector database using a RAG system;
Based on the retrieved knowledge segments, the relation among the entities is generated through a pre-trained large language model by combining with the specific requirements of the task, and global knowledge extraction is completed.
Mutual information is an index in the information theory and is used for measuring the degree of interdependence between two variables. That is, the larger the value of the mutual information, the stronger the correlation between the two variables. Suppose an example of weather selection and whether to bring an umbrella. Assuming that two random variables X and Y are provided, X represents weather (0 represents a sunny day, 1 represents a rainy day), Y represents whether or not an umbrella is attached (0 represents no attached, 1 represents attached). Assuming 10 days of data were observed, the following joint frequency distribution was obtained:
when x=0 (sunny), y=0 has 4 days and y=1 has 1 day;
When x=1 (rainy days), y=0 has 1 day, and y=1 has 4 days.
Thus, there was a total of 10 days. Then the joint probability distribution can be calculated as:
P(X=0,Y=0)=4/10=0.4;
P(X=0,Y=1)=1/10=0.1;
P(X=1,Y=0)=1/10=0.1;
P(X=1,Y=1)=4/10=0.4;
Next, edge probabilities are calculated:
P(X=0)=0.4+0.1=0.5;
P(X=1)=0.1+0.4=0.5;
P(Y=0)=0.4+0.1=0.5;
P(Y=1)=0.1+0.4=0.5;
then, each (x, y) corresponding term is calculated according to the mutual information formula, and then added.
Now, calculate each item:
For x=0, y=0:
P(x,y)=0.4,P(x)P(y)=0.5*0.5=0.25;
log2(0.4/0.25)=log2(1.6)≈0.678;
the terms are 0.4 x 0.678≡0.2712; x=0, y=1:
P(x,y)=0.1,P(x)P(y)=0.25;
0.1/0.25=0.4,log2(0.4)=-1.3219;
the terms are 0.1 (-1.3219) ≡0.1322; x=1, y=0:
Also, P (x, y) =0.1, P (x) P (y) =0.25, so as above, the terms are-0.1322; x=1, y=1:
p (X, y) =0.4, P (X) P (y) =0.25, so as in the case of x=0, y=0, the term is 0.2712,
These are added 0.2712-0.1322-0.1322+0.2712+.0.2712+(-0.1322) 2= (0.5424) - (0.2644) =0.278
The mutual information in the above example is about 0.278 bits. This means that there is a certain correlation between X and Y, in this example the probability is higher when X and Y agree, so the mutual information should be positive.
1. RAG enhanced reasoning process, namely combining retrieval and generated relation prediction, and designing the process:
step 1, searching across text block entities:
The entity pairs with high relevance (even if they are not in the same text block) are filtered by mutual information, for example the MI values of entity a and entity B are calculated, and if the threshold is exceeded, the search is triggered.
Step 2, context enhancement retrieval:
for pairs of entities with high MI values, all relevant text blocks (e.g., paragraphs, documents) containing these entities are retrieved from the knowledge base to form an enhanced context.
Step 3, designing and generating constraint of the prompting word, and template of the prompting word:
the potential relationship of entities [ entity a ] and [ entity B ] is analyzed based on the following context:
[ retrieved text Block 1]
[ Retrieved text Block 2]
Possible types of associations include investment, collaboration, competition, relatives, etc.
The output format is { "relation": "< type >", "evaluation": "< key sentence >" }
Generating a constraint:
Limiting LLM to only output JSON format, and selecting relationship type from predefined list;
Knowledge extraction is performed in combination with RAG technology, and example demonstration is as follows:
Input:
text block 1 (news A), "company X announces the field of new energy automobiles for army. "
Text block 2 (finance report B): "CEO of company Y was the dominant battery technology development project. "
The flow is as follows:
1. the MI values for "company X" and "company Y" are calculated, assuming that the threshold is exceeded.
2. Searching the context containing both, it was found that "company Y's battery technology" and "company X's new energy automobile" are possible in the supply chain.
LLM generation result { "relation": "supply chain cooperation", "evaluation": "New energy automobile of company X may depend on battery technology of company Y" };
Through the steps, the relation among the entities can be extracted from the text blocks, and the more accurate and rich relation can be generated by combining the information in the external knowledge base, so that the content and depth of the knowledge graph are improved, and the global knowledge extraction is completed.
And constructing entity pairs in the cross-text blocks by adopting mutual information, and carrying out semantic reasoning on the entity pairs meeting the set threshold by adopting RAG technology. Meanwhile, the large model has strong reasoning capability, the reasoning and the inference can be carried out by analyzing the relation among the entities, the reasoning capability can help to discover more association relations and implicit knowledge, and the content and depth of the knowledge graph are enriched. These advantages make the large model play an important role in the construction and application of the knowledge graph.
And S700, judging whether the two entities point to the same physical object through a multidimensional reference digestion method, merging the same entities to form a knowledge base, and storing the constructed knowledge.
In some embodiments, the determining whether two entities point to the same physical object through the multidimensional reference resolution method, merging the same entities to form a knowledge base, and storing the constructed knowledge includes:
Calculating text similarity, namely calculating the text similarity of entities in different text blocks through cosine similarity or editing distance;
and (3) calculating structural similarity, namely analyzing the structural relation of the entities in the knowledge graph, and judging whether the entities have similar structures.
And (3) semantic similarity calculation, namely calculating the semantic similarity of the entities through a large model, and judging whether the entities have the same semantics.
And (3) entity alignment and merging, namely judging whether the two entities point to the same physical object according to the text, the structure and the semantic similarity, and merging if so, so as to obtain an entity alignment result.
Text similarity calculation:
Text similarity is used to measure how similar two entities are in text expression. A common calculation method includes an edit distance that measures similarity by calculating the minimum number of edit operations required to convert one string to another. The smaller the edit distance, the more similar the text expressions representing the two entities.
The edit distance formula is as follows:
Wherein, theAndThe text of the two entities is represented,、AndIndicating the number of insert, delete and replace operations, respectively.
Structural similarity calculation:
the structural similarity is used for measuring whether the structural relationship of two entities in the knowledge graph is similar. A common calculation method includes Jaccard coefficients, which measure structural similarity by calculating the ratio of the intersection to the union of a set of neighbors of two entities.
The Jaccard coefficient formula is as follows:
Wherein, theAndRepresenting neighbor sets of two entities, respectively.
Semantic similarity calculation:
semantic similarity is used to measure how similar two entities are in terms of semantics. The usual calculation method includes a RAG-based knowledge extraction method that calculates the semantic similarity of two entities by retrieving relevant information from an external knowledge base and generating semantic representations in combination with a pre-trained large language model.
The semantic similarity formula is as follows:
Wherein, theAndWhich represents the two entities of which the two are part,Representing entity semantic representations generated by RAG techniques,Representing cosine similarity.
Multidimensional similarity fusion:
and carrying out weighted fusion on the text similarity, the structural similarity and the semantic similarity to obtain a comprehensive similarity score, and judging whether the two entities point to the same physical object.
The comprehensive similarity formula is as follows:
Wherein, the、AndWeights respectively representing text similarity, structural similarity and semantic similarity are generally adjusted according to specific application scenarios.
Is provided with two entitiesAndIt is necessary to determine whether they are directed to the same physical object. First, their text similarity is calculated, using an edit distance formulaA text similarity score is obtained. Then, their structural similarity is calculated, using the Jaccard coefficient formulaA structural similarity score is obtained. Then, calculating their semantic similarity, generating semantic representation using RAG-based knowledge extraction method, and passing through cosine similarity formulaA semantic similarity score is calculated. Finally, the text similarity, the structure similarity and the semantic similarity are subjected to weighted fusion to obtain a comprehensive similarity scoreIf the integrated similarity score exceeds a preset threshold, then it is consideredAndPointing to the same physical object.
In a second aspect, the application provides a knowledge construction system based on a large model and a RAG technology, as shown in fig. 2, which comprises a data cleaning module, a text segmentation module, a vector conversion module, a vector indexing module, a local knowledge extraction module, a global knowledge extraction module and an index digestion module;
The data cleaning module is used for acquiring text data from various sources, cleaning, format conversion and verification the text data, and storing the text data in a text database;
the text segmentation module is used for setting parameters and splitting a large text in the text database into text data blocks with semantic meaning according to the separator priority;
The vector conversion module is used for selecting the embedded model to convert the text data block into a text vector, carrying out normalization processing and storing the text vector into the vector database;
The vector index module is used for establishing a vector index, adopting the combination of similarity retrieval and full text retrieval and optimizing a retrieval strategy through mixed search;
the local knowledge extraction module is used for extracting local knowledge by using a large model and extracting entities, attributes and relations from the text data block;
The global knowledge extraction module is used for constructing related entity pairs across the text data blocks by adopting a mutual information method, and carrying out global knowledge extraction by combining RAG technology and a retrieval strategy to generate entity relations;
The reference digestion module is used for judging whether two entities point to the same physical object or not through a multidimensional reference digestion method, merging the same entities to form a knowledge base, and storing the constructed knowledge.
In a third aspect the application proposes an electronic device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method as described above when said computer program is executed.
In a fourth aspect the application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of a method as described above.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementations, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium can include any entity or device capable of carrying computer program code, recording medium, USB flash disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media, among others. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements made by those skilled in the art without departing from the present technical solution shall be considered as falling within the scope of the present technical solution claimed.