Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a digital intelligent power regulation question-answering method and system based on a large language model, which can improve question-answering accuracy, can interact in a more natural language and can improve question-answering experience of users.
In order to achieve the above purpose, the present invention has the following technical scheme:
in a first aspect, a method for digital intelligent power regulation question-answering based on a large language model is provided, including:
Acquiring a power regulation field text, acquiring a vector code of the power regulation field text by using an embedded model, and storing the vector code into a vector knowledge base;
when a user gives a question, receiving a user question text and preprocessing the user question text;
inputting the preprocessed user problem text into a retrieval tool, and retrieving and matching the user problem text in a vector knowledge base through the retrieval tool to obtain a text with comprehensive similarity meeting the requirement;
And (5) performing post-processing on the text with the comprehensive similarity meeting the requirement by adopting a large language model, and generating a corresponding answer.
As a preferred solution, the step of collecting the text of the electric power regulation and control field includes: preprocessing the collected text in the power regulation field, wherein the preprocessing operation comprises document analysis and conversion of a document in a non-text format into a plain text format;
And after preprocessing, carrying out segmentation slicing to divide the text content into paragraphs or sentences.
As a preferred solution, the step of obtaining the vector code of the text in the power regulation field by using the embedded model and storing the vector code in the vector knowledge base includes:
The vector coding of the text after the segmentation and slicing is obtained by adopting a bi-directional encoder BERT model, and the expression is as follows:
Wherein,Representing an input segmented slice text, s representing a vector representation of the corresponding text, θ representing the BERT model parameters; and establishing a relation between the text of the power regulation field and the corresponding vector by adopting an index form, and storing the relation in a vector knowledge base.
As a preferred scheme, in the step of receiving and preprocessing the user question text when the user presents the question, the preprocessing operation includes format conversion, noise removal and question reformulation of the user question text by using the large model and the corresponding prompt word;
Specifically, a regular expression is adopted to remove noise of a user problem text, and irrelevant characters and interference information in the text are removed; the question restation refers to a restation that uses a large model to normalize the user question text, and the expression is as follows:
In the formula, LLM () represents large model calculation, prompt represents Prompt words, original_question represents Original text of user problem, θ represents parameters of large model, and processed_question represents user problem after restatement.
As a preferable scheme, the step of inputting the preprocessed user question text into a search tool, and searching and matching the user question text in a vector knowledge base through the search tool, wherein the step of obtaining the text with the comprehensive similarity meeting the requirement comprises the following steps: the retrieval tool performs vector embedding on the user problem text, converts the user problem text into a vector, and retrieves and matches the most similar vector in a vector knowledge base by using vector similarity, text similarity and information loss indexes;
The vector similarity index is characterized by cosine similarity, and the expression is as follows:
wherein X, Y represents a vector formed by embedding two text segments;
The text similarity index is characterized by using Jaccard similarity, and the expression is as follows:
Wherein A and B are word sets of two sections of texts X and Y respectively, |A n B| represents the size of the intersection of two sets, |A n B| represents the size of the union of two sets;
the information loss index is characterized by using the difference between the information entropy and the mutual information, and the expression is as follows:
Wherein H (X) and H (Y) are information entropy of two vectors, and I (X; Y) is mutual information between the two vectors;
Based on cosine similarity, jaccard similarity and information loss index, the comprehensive similarity is calculated according to the following formula:
Where λC、λJ and λI are parameters that are adjusted according to the text type differences in the power regulation domain.
As a preferred solution, the step of post-processing the text with the integrated similarity meeting the requirement by using the large language model includes:
Adopting a large language model to carry out prompt compression, text rearrangement, text summarization and text fusion on the text with the comprehensive similarity meeting the requirement; the prompt compression is to reduce noise in information by compressing information retrieved from a knowledge base; the text rearrangement rearranges the retrieved text fragments to enable the text fragments to more accord with logic or semantic sequences; the text summarization is generated through a large language model, and a section of continuous characters is summarized from the extracted text information; the text fusion is to integrate information through a large language model and combine similar or supplementary information in a plurality of pieces of text information.
As a preferred scheme, the text rearrangement rearranges the retrieved text segments using a Cohere Rerank model, and the expression is as follows:
In the formula,For the sequence of vectors prior to the rearrangement,Is the rearranged vector sequence; n represents the number of vectors, reranker represents a Cohere Rerank model, S represents a standard adopted by rearrangement, and θ represents a parameter of a Cohere Rerank model; s and theta are adjusted according to different texts in the electric power regulation and control field.
As a preferable scheme, the intelligent electric power regulation question-answering method based on the large language model further comprises the step of storing the text of the question-answer into a memory component for calling, and the large language model generates corresponding answers according to the text information after text rearrangement, text summarization and text fusion by combining preset prompt words and calling of the memory component.
In a second aspect, a digital intelligent power regulation question-answering system based on a large language model is provided, including:
The text acquisition coding module is used for acquiring the text of the power regulation field, acquiring the vector codes of the text of the power regulation field by using the embedded model, and storing the vector codes into the vector knowledge base;
the problem text preprocessing module is used for receiving the user problem text and preprocessing the user problem text when the user gives a problem;
The retrieval matching module is used for inputting the preprocessed user problem text into a retrieval tool, and retrieving and matching the user problem text in a vector knowledge base through the retrieval tool to obtain a text with comprehensive similarity meeting the requirement;
And the answer generation module is used for carrying out post-processing on the texts with the comprehensive similarity meeting the requirements by adopting the large language model and generating corresponding answers.
In a third aspect, a computer-readable storage medium is provided, where at least one instruction is stored that, when executed by a processor, implements the large language model-based digital intelligent power regulation question-answering method.
Compared with the prior art, the first aspect of the invention has at least the following beneficial effects:
By combining an advanced large language model with the retrieval function of a text vector knowledge base in the power regulation field, when a user puts forward a problem, the user can understand and process the complex natural language problem and provide more accurate and relevant answers. According to the invention, the text with the comprehensive similarity meeting the requirement is post-processed by adopting the large language model, and the corresponding answer is generated, and the answers of multiple angles and crossing multiple power regulation sub-fields can be generated due to the language understanding and generating capability and the strong generalization performance of the large language model, so that the generalization performance of the power regulation question-answer is obviously improved. The intelligent power regulation question-answering method based on the large language model is not only suitable for conventional question-answering tasks, but also can cope with emergency consultation in power regulation, and increases the application range and flexibility of the system. By adopting the power regulation question-answering method provided by the invention, the dependence on professionals can be reduced, so that non-professional users can also quickly acquire power regulation related information, and the operation and maintenance cost is reduced.
Furthermore, the invention introduces a large model technology, a retrieval enhancement generation technology, a memory mechanism and a chained workflow mechanism into the field of electric power regulation, establishes a vector knowledge base to construct chained workflow and intelligent question-answer chained workflow, sequentially connects all modules by a chained structure, and completes text analysis and pretreatment, semantic-based segmentation slicing, language model-based vector embedding and index-based vector storage on texts in the field of electric power regulation by the vector knowledge base; the intelligent question-answering chain workflow sequentially completes preprocessing such as format conversion, noise removal, question re-expression and the like on a user text, vector embedding based on a language model, iterative and recursive retrieval based on comprehensive similarity measurement, post-processing including prompt compression, text rearrangement, text summarization and text fusion, answer generation based on a large language model and prompt words and the like. The question-answering method has higher independence, flexibility and expandability, and can reduce the development and maintenance cost of the system. Meanwhile, the memory component is utilized, so that the system has the capability of memorizing and reading the question-answering context, and the memory component is used as an independent component to be introduced into the intelligent question-answering chain type workflow for electric power regulation, thereby enhancing the continuity of conversation and improving the user interaction experience.
Furthermore, the invention provides the comprehensive similarity measurement consisting of text similarity, vector similarity and information loss, and the accuracy of vector retrieval and matching is greatly improved by utilizing the comprehensive similarity measurement, so that the effect of the whole question-answering system is improved.
It will be appreciated that the advantages of the second to third aspects may be found in the relevant description of the first aspect, and are not described in detail herein.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The embodiment of the invention provides a digital intelligent power control question-answering method based on a large language model, belongs to the application of the technical field of artificial intelligence in the technical field of power automation, and particularly relates to the improvement of the performance of an automatic question-answering system in the power control industry by using the large pre-training language model and a vector retrieval method through introducing a natural language processing technology. The technology ensures that the automatic question-answering system in the electric power regulation field has higher answer accuracy and response efficiency when processing the query with strong specialization and high real-time requirement, and meets the requirement of the industry on intelligent service in the informatization and digital transformation process.
Description about a search enhancement generation method (abbreviated as RAG in english): the search enhancement generation method (RETRIEVAL-Augmented Generation, RAG) is a natural language processing technique that combines information retrieval and text generation. In the method, the system first retrieves relevant context information or documents from a knowledge base using a pre-trained retrieval component. The retrieved information is then provided to a generation model that incorporates the information into its generated text to improve the accuracy of answering the questions, generate an interpretation, or enhance the output quality of the language model. The RAG method is particularly applicable to tasks requiring external knowledge, such as open domain questions and answers, factual content generation, or any scenario requiring reference to a specific knowledge point. In this way, the generated text is richer, more accurate and more informative.
Description of the large language model (or called large model, english abbreviation LLM): the large language model (Large Language Models) is a deep learning-based natural language processing technique that is widely used to understand and generate natural language text. These models are based on the transformer architecture, have a large number of parameters, and can handle large-scale data sets. The large model learns the complex patterns and structures of languages through pre-training on a large amount of texts, and therefore can perform excellently on various language tasks including text generation, translation, question-answering, abstract, emotion analysis and the like.
The embodiment of the invention discloses a digital intelligent power regulation question-answering method based on a large language model, which mainly comprises the following steps: large language models, such as the generative pre-training transformer GPT and bi-directional encoder BERT, are pre-trained on large-scale data sets using deep learning networks to understand and generate human language, thus providing rich semantic understanding and answer generation capabilities. The method for generating the search enhancement combines the information search technology and the language generation model, and generates answers by searching related documents and using the document auxiliary language model, thereby improving the accuracy and the reliability of the question-answering system. Vector embedding refers to converting text into a mathematical vector form, such that a machine can retrieve information by computing the similarity between vectors. Vector similarity calculation refers to measuring the similarity degree of two vectors by using mathematical methods such as cosine similarity, euclidean distance and the like, and finding the best match between a question and an answer. The memory component is then used to enhance the contextual understanding capabilities of the question-answering system, and is capable of memorizing and reading previous question-answering scenarios in order to provide a more consistent and accurate answer.
Referring to fig. 1, the digital intelligent power regulation question-answering method based on a large language model in the embodiment of the invention includes:
S1, acquiring a power regulation field text, acquiring a vector code of the power regulation field text by using an embedded model, and storing the vector code into a vector knowledge base;
s2, when a user gives a problem, receiving a user problem text and preprocessing the user problem text;
S3, inputting the preprocessed user problem text into a search tool, and searching and matching in a vector knowledge base through the search tool to obtain a text with comprehensive similarity meeting the requirement;
S4, performing post-processing on the text with the comprehensive similarity meeting the requirement by adopting a large language model, and generating a corresponding answer.
In one possible implementation manner, the step of collecting the text of the power regulation field in step S1 includes:
Preprocessing the acquired text in the power regulation field, wherein the preprocessing operation comprises document analysis and conversion of a document in a non-text format into a plain text format, for example, text materials exist in the non-text format (such as PDF or Word document) and need to be converted into the plain text format, and the embodiment of the invention is realized through PyPDF and Python-docx libraries in Python programming software. And after preprocessing, segmentation slicing is carried out to divide the text content into smaller paragraphs or sentences. In the embodiment of the invention, the clauses and the segmentation are realized through NLTK libraries in Python programming software and semantic analysis technology.
In one possible implementation manner, the step of obtaining the vector code of the text of the power regulation field using the embedded model and storing the vector code in the vector knowledge base in step S1 includes:
And acquiring the vector codes of the text after the segmented slicing by adopting a bi-directional encoder BERT model, and acquiring the vector codes of the segmented slicing text by adopting the bi-directional encoder BERT model. For more complex power regulation contexts, BERT can process whole pieces of text, capturing more rich context information. The bi-directional encoder BERT model obtains the expression of vector coding as follows:
Wherein,Representing an input segmented slice text, s representing a vector representation of the corresponding text, θ representing the BERT model parameters; and establishing a relation between the text of the power regulation field and the corresponding vector by adopting an index form, and storing the relation in a vector knowledge base.
In order to modularly and extendably combine the various flows of the system together, embodiments of the present invention introduce a chained workflow structure to sequentially and independently perform the various steps. Step S1 is to construct a chained workflow for the vector knowledge base, as shown in FIG. 2.
In a possible implementation manner, when a user presents a problem, step S2, a preprocessing module in the intelligent question-answering chain workflow performs format conversion on a user problem text, removes noise and restatement of the problem by using a large model and corresponding prompt words through a format conversion component; specifically, a regular expression is adopted to remove noise of a user problem text, and irrelevant characters and interference information in the text are removed; the problem restation refers to the standardized restation of user problem text by using a large model for subsequent processing, and the expression of this process is as follows:
In the formula, LLM () represents large model calculation, prompt represents Prompt words, original_question represents Original text of user problem, θ represents parameters of large model, and processed_question represents user problem after restatement.
Steps S2 to S5 of the embodiment of the present invention are the flow of the intelligent question-answering chain workflow, as shown in fig. 3.
In one possible implementation manner, step S3 of inputting the preprocessed text of the user question into a search tool, and searching and matching the text in a vector knowledge base by the search tool, where obtaining the text with the integrated similarity meeting the requirement includes: the retrieval tool adopts a bi-directional encoder BERT embedded network to carry out vector embedding on the user problem text, converts the user problem text into a vector, and then utilizes vector similarity, text similarity and information loss indexes to retrieve and match the most similar vector in a vector knowledge base;
The vector similarity index is characterized by cosine similarity, and the expression is as follows:
wherein X, Y represents a vector formed by embedding two text segments;
The text similarity index is characterized by using Jaccard similarity, and the expression is as follows:
Wherein A and B are word sets of two sections of texts X and Y respectively, |A n B| represents the size of the intersection of two sets, |A n B| represents the size of the union of two sets;
the information loss index is characterized by using the difference between the information entropy and the mutual information, and the expression is as follows:
Wherein H (X) and H (Y) are information entropy of two vectors, and I (X; Y) is mutual information between the two vectors; the smaller the information difference between the two vectors, the more similar the two vectors are.
Based on cosine similarity, jaccard similarity and information loss index, the comprehensive similarity is calculated according to the following formula:
Where λC、λJ and λI are parameters that are adjusted according to the text type differences in the power regulation domain.
And S3, obtaining texts with the comprehensive similarity meeting the requirement, namely a plurality of texts with the highest comprehensive similarity, and if the texts with the high comprehensive similarity cannot be searched, returning information of the searched results.
Where vector similarity refers to a characterization of how similar two vectors are. In machine learning and data analysis, data points are often represented as vectors, which may be points in a multidimensional space, representing various features. Vector similarity is used to determine how close or similar these vectors or points are.
And text similarity refers to the degree of similarity in content, semantics, or context of two pieces of text. It is a basic concept in the field of natural language processing for measuring and comparing similarity between documents, sentences or phrases. The measurement of text similarity may be based on a variety of methods including lexical similarity, structural similarity, semantic similarity, similarity based on vector space, similarity based on language models, and the like. Text similarity plays an important role in applications such as information retrieval, document classification, text clustering, plagiarism detection, question-answering systems and the like.
Information loss is an important concept in information theory, and refers to the phenomenon that part of information is lost in the processes of data processing, compression, transmission or conversion, and can also be used for measuring the similarity of two groups of information.
The similarity measure is a function or criterion used to evaluate the degree of similarity between two objects (e.g., numbers, text, images, etc.). In the field of machine learning and data analysis, such metrics may be used to compare data points, feature vectors, or complex objects, and for various applications, including clustering, classification, recommendation systems, and information retrieval.
In one possible implementation manner, step S4 performs post-processing on the several texts with highest searched similarity by using a large language model, that is, performs prompt compression, text rearrangement, text summarization and text fusion on the texts with comprehensive similarity meeting the requirement by using the large language model, so as to generate more complete and rich answers. The prompt compression is to reduce noise in information by compressing information retrieved from a knowledge base, and the generation efficiency of a subsequent large model is improved; the text rearrangement rearranges the searched text fragments according to a certain rule or algorithm, so that the text fragments more accord with logic or semantic sequence; the text summarization is generated through a large language model, and a section of continuous characters is summarized from the extracted text information; the text fusion is to integrate information through a large language model and combine similar or supplementary information in a plurality of pieces of text information.
In a possible implementation manner, the text rearrangement according to the embodiment of the present invention rearranges the retrieved text segments using a Cohere Rerank model, where the expression is as follows:
In the formula,For the sequence of vectors prior to the rearrangement,Is the rearranged vector sequence; n represents the number of vectors, reranker represents a Cohere Rerank model, S represents a standard adopted by rearrangement, and θ represents a parameter of a Cohere Rerank model; s and theta are adjusted according to different texts in the electric power regulation and control field.
In a possible implementation manner, the intelligent power regulation question-answering method based on the large language model further comprises the step of storing the text of the question-answer into a memory component for calling, and the large language model generates corresponding answers according to the text information after text rearrangement, text summarization and text fusion by combining preset prompt words and calling of the memory component.
Thereafter, the text of the question and answer is saved in the memory component for later recall.
And finally, outputting the generated answer through a user interface to complete interaction with a user.
A Prompt term (Prompt) in the context of a large language model refers to a word or a group of words or sentences used to guide or excite the model to generate a particular type of response. In the field of machine learning, and in particular when using pre-trained language models for tasks, a prompt word is used as part of the input to assist the model in understanding the desired task or output format.
Memory components refer to modules or subsystems for storing, maintaining, and retrieving information. In large models, the introduction of a memory component enables the model to remember and utilize contextual input and learned information.
The intelligent power regulation question-answering method based on the large language model solves the defects of the automatic question-answering technology in the existing power regulation field, and is mainly characterized in the following aspects:
(1) The accuracy of question and answer is improved: by constructing a vector knowledge base of the text in the electric power regulation field, the system can understand and process more complex and professional electric power regulation problems, so that the question-answering accuracy is improved.
(2) High degree of intelligent response: by combining a large language model, the system not only can provide accurate answers, but also can interact in a more natural language, and the question-answer experience of the user is improved.
(3) Memory question-answer context: by introducing a memory component, the system is enabled to memorize and read the context of the questions and answers, providing support for a coherent dialog.
(4) Improving generalization of questions and answers: through optimizing the search algorithm and enhancing the generation method, the generalization of the question and answer of the system is improved, so that the question and answer can respond to more diversified and more complex queries of users.
(5) Information loss is reduced: the user problem text and the electric power regulation text are subjected to vector embedding processing, and similarity calculation is performed by utilizing various measurement indexes, so that loss in the information processing process is reduced, and the integrity and accuracy of information are ensured.
The invention also provides a digital intelligent power regulation question-answering system based on a large language model, which comprises the following steps:
The text acquisition coding module is used for acquiring the text of the power regulation field, acquiring the vector codes of the text of the power regulation field by using the embedded model, and storing the vector codes into the vector knowledge base;
the problem text preprocessing module is used for receiving the user problem text and preprocessing the user problem text when the user gives a problem;
The retrieval matching module is used for inputting the preprocessed user problem text into a retrieval tool, and retrieving and matching the user problem text in a vector knowledge base through the retrieval tool to obtain a text with comprehensive similarity meeting the requirement;
And the answer generation module is used for carrying out post-processing on the texts with the comprehensive similarity meeting the requirements by adopting the large language model and generating corresponding answers.
The invention combines a large language model with a search enhancement generation method, and the main idea is that text data in the power regulation field is processed by using an embedded model to obtain vector codes and store the vector codes into a vector knowledge base; the user question text is then subjected to preprocessing such as format conversion and input into the retrieval tool. In a retrieval tool, vector embedding processing is carried out on a user problem text, and indexes such as vector similarity, text similarity, information loss and the like are utilized to inquire the most similar vector in a vector knowledge base; and combining the text corresponding to the vector and the prompt word written in advance, generating output by the large language model and returning the output to the user.
In some possible implementations, knowledge maps may be used instead of vector knowledge bases, but doing so may result in reduced performance. The large language models used in the present invention can be replaced with each other, which can result in a change in system performance. Furthermore, a fine tuning technique of a large language model may be used instead of the search enhancement generation technique, but doing so requires more computational effort and may cause a decrease in performance.
Another embodiment of the present invention also proposes an electronic device including a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the large language model-based digital intelligent power regulation question-answering method.
Another embodiment of the present invention also proposes a computer-readable storage medium storing at least one instruction that, when executed by a processor, implements the large language model-based digital intelligent power regulation question-answering method.
The computer program comprises computer program code which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals. For convenience of description, the foregoing disclosure shows only those parts relevant to the embodiments of the present invention, and specific technical details are not disclosed, but reference is made to the method parts of the embodiments of the present invention. The computer readable storage medium is non-transitory and can be stored in a storage device formed by various electronic devices, and can implement the execution procedure described in the method according to the embodiment of the present invention.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.