Disclosure of Invention
The invention provides a method for constructing a railway multi-mode knowledge base question-answering system with a hybrid RAG architecture, which can be used for constructing the railway multi-mode knowledge base question-answering system better.
The invention relates to a method for constructing a railway multi-mode knowledge base question-answering system with a hybrid RAG architecture, which comprises the following steps:
step 1, deploying and fine-tuning a server local model;
The deployed models include a language model and a text embedding model;
the fine tuning is based on DyLoRA fine tuning framework, integrates dynamic rank allocation, hierarchical adaptation and expert routing mechanisms, and improves model performance while maintaining parameter efficiency;
Step 2, data processing and database construction;
The data comprise a docx, pdf-formatted railway research report, a track operation and maintenance report, various industry design standards, various-formatted design drawings, specification tables and pictures;
After a language model and a text embedding model are locally deployed, an interface management of the model is carried out by using an open source project, a railway standard knowledge base database management system based on a MongoDB content memory and a PostgreSQL vector retrieval library is established, a vector field is set by the PostgreQL for storing vectors, the MongoDB is used for information access of vector original data, the vector is recalled firstly during retrieval, and then the original data content is searched in the MongoDB according to the ID of the vector;
step 3, constructing a multi-mode knowledge base;
embedding the pictures and the corresponding texts into a vector database by using a multi-mode embedding mode, storing the corresponding original pictures and the corresponding texts in a document storage, directly acquiring the corresponding pictures in the document storage when carrying out mixed similarity retrieval, and transmitting the original pictures and the text blocks to a large model to generate an answer;
And 4, expanding the multiple platforms.
Preferably, in step 1, the language model is an autoregressive language model of a transducer architecture, pre-training is performed based on massive text data, statistical rules, world knowledge and complex patterns of the language are learned, and the core task of the text embedding model is to convert a piece of text into a fixed-length, dense vector for embedding.
Preferably, in step 1, the formula of DyLoRA fine-tuning framework is as follows:
;
Wherein, theOutputting fine tuning of the model; Is a pre-training part parameter that remains unchanged in the fine tuning; representing parameters that each group of experts of the model expert routing system can learn; AndTwo low-rank matrices trained in fine tuning models; The rank is indicated as such,As a minimum value of the rank of the signal,Is at a maximum value; Represent the firstSensitivity score of the layer network; Is a sparse coefficient; the transpose is represented by the number,Representing natural constants.
Preferably, the fine tuning use instruction supervised fine tuning data set comprises ten thousand railway industry corpus of industry specification, a research report and an operation and maintenance report, and the instruction supervised fine tuning data set is shown in the following formula:
;
Wherein, theThe representation is composed ofA dataset of individual samples; Representing each triplet sample; Represent the firstInstructions for the samples to guide model behavior; Represent the firstInput of individual samples, providing a specific context or question of a task; Represent the firstThe output of the samples, being the expected response or answer;
the core training parameters comprise learning rate, batch size, training round number, maximum sample number and gradient cutting, and are flexibly adjusted according to model parameters and data set size.
Preferably, the hybrid search comprises semantic vector search and full text sparse search, the semantic vector search obtains similarity by calculating the distance of space vectors, and the full text sparse search adopts a keyword search mode based on a sparse algorithm;
After the results of the two retrieval are obtained, the related content is used as a reference for outputting a large model by normalization, reordering and search filtering, wherein the reordering is a method for combining a plurality of result sets with different relevance indexes into a single result set;
wherein, the semantic vector searches and calculates index itemsWith candidate documentsSemantic vector similarity score betweenThe formula is as follows:
;
in the formula,In order to query the vector of the vector,Computing each candidate document for each candidate document vector, full text sparse searchSparse search score of (a)The formula is as follows:
;
in the formula,Is an index itemIs a term of (a); Is a termIn a documentWord frequency of (a) is determined; Is a documentIs a length of (2); An average document length for the set of documents; Controlling the saturation of word frequency, the largerThe greater the impact of (2); Controlling the intensity of document length normalization; Is a termIs the inverse document frequency of (2); the total number of the documents in the document collection is the total number of the documents; is the number of documents containing the term;
after obtaining the sparse scores and the vector similarity scores of all the documents, the formulas for score normalization are as follows:
;
in the formula,Is an index item; All candidate documents; AndRespectively represent the average value of the sparse score and the vector similarity score of all the documents in the current candidate poolAndAnd respectively representing the standard deviation of the sparse score and the vector similarity score of all the documents in the current candidate pool.
In the step 2, the text is processed by sectioning according to rules and then converted into a slicing format capable of carrying out semantic search, and each slice data of the database is independently trimmed and corrected.
In the step 2, the data of the table is optimized by simplifying the table and converting the table into characters, and for a large-scale table, a large-scale table model is adopted to encode the table and then embed the table;
the method comprises the steps of firstly carrying out structural analysis, including identifying row-column structures of tables, analyzing merging cells and extracting hierarchical relations, and then carrying out semantic coding, including linearization representation functions and semantic embedding representation processes, wherein the formulas are as follows:
;
Wherein, theRepresentation ofIs a matrix of (1), initial row indexInitial column index;AndRespectively representing a total line number range and a total column number range of the table cell coverage logic area; AndThe number of the horizontal merging cells and the vertical merging cells respectively; A generating function representing a linearization sequence; Is a set of cell text; connecting operators for the character strings; Is a column separator; Is a cell spacer; Representing tensor product operation; Representing text generated by a pre-trained embedding modelEmbedding vectors; Representing word segmentation functions, and performing text segmentation; representing the parameters asIs embedded in the model encoder.
Preferably, in step 3, the input data isImage-text pairs of (1), wherein the original image,For associating text, the formulas generated for preprocessing and embedding are as follows:
;
Wherein, theIndicating whether the image type belongs to a macroscopic image or not; representing the use of a pre-trained image classification model; an image representing the macro image discriminated by the classification model; as a Softmax function; For macroscopic image threshold, ifThen trigger abstract generation;Representing a multi-modal large model; the maximum number of the token is the abstract; Representing image embedding vectorsAnd aligned text embedding vectorsIs used for embedding the joint embedded vector of the block; AndRespectively representing the embedded modal weights of the image and the text; embedding vectors for original text that is not co-dimensionally aligned with the image vectors,Obtaining parameters for a projection matrix for alignment dimension through linear regression training; AndRespectively an image embedding model and a text embedding model; AndRespectively a graph embedding dimension and a text embedding dimension;
for CAD vector diagram format, adopting B spline curve analysis steering vector mode, and constructing a characteristic vector formula as follows:
;
Wherein, theThen represent a stripA sub-B spline curve is used to determine,To be defined atUpper part of the cylinderSub-B spline basis function;Splicing curve parameters into vectors according to a fixed sequence for linear converted feature vectors; Representing a node vector; Is a control point; and the weight factor corresponding to the B spline.
In step 4, the local framework is expanded to a webpage end, a WeChat public number, an applet and third party software, and a core component comprises an API gateway and an asynchronous message middleware which are self-adaptive to a protocol;
the multi-port access is realized by adopting a unified API gateway architecture, a protocol conversion layer is built in the gateway and supports the unified conversion of HTTP/WebSocket heterogeneous protocols into gRPC protocols, and the protocol conversion process is realized by inputting a normalization functionThe following is shown:
;
Wherein, theRepresenting input data whenFrom the HTTP protocol, i.e.Performing a serialization conversion of JSON to ProtoBuf when the input data comes from the WeChat ecology, i.ePerforming XML to ProtoBuf conversion; for heterogeneous input data sets, the protocol conversion delay is controlled within 15ms,;
For a high concurrency scene, introducing a message queue to perform asynchronous peak clipping, wherein a message processing delay model is as follows:
;
Wherein, theRepresenting the total message processing time; a serialization time representing the conversion of the original message data into a standardized format that can be transmitted or stored; Representing the number of messages; The number of partitions; and consuming time delay for single message.
The beneficial effects of the invention are as follows:
The invention applies the hybrid RAG technology to the railway industry, builds an intelligent question-answering system for integrating multi-mode data such as railway design specifications, engineering cases, research reports, operation maintenance reports and the like through the processes of server building and local model deployment, data cleaning and processing, multi-mode database building, multi-deployment platform testing and the like, improves knowledge acquisition efficiency and accuracy for designers, simultaneously deeply integrates and utilizes multi-mode information to enable complex decisions and design optimization, ensures compliance and reduces design errors, and finally promotes the railway industry to carry out intelligent and digital transformation.
Detailed Description
For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples. It is to be understood that the examples are illustrative of the present invention and are not intended to be limiting.
Example 1:
As shown in fig. 1, the embodiment provides a method for constructing a railway multi-mode knowledge base question-answering system of a hybrid RAG architecture, which comprises the following steps:
step 1, deploying and fine-tuning a server local model;
The deployed models comprise a language model and a text embedded model, the related programs of the embodiment are written by using Python, CUDA (Compute Unified Device Architecture) is adopted for acceleration, and data generation and processing, model deployment and reasoning are performed on a Linux server. The language model is an autoregressive language model of a typical transducer architecture, is pre-trained based on massive text data, learns statistical rules, world knowledge and complex modes of the language, and has the core task of converting a text into a fixed-length dense vector for embedding (semantically similar text, the distance of corresponding vectors in a vector space such as cosine similarity is very similar), and is an indispensable model for constructing an industry knowledge base.
Model fine tuning refers to instruction fine tuning training of a large language model to adapt to specific tasks, and can be mainly divided into two modes of parameter efficient fine tuning and full parameter fine tuning. The Parameter Efficient Fine Tuning (PEFT) only updates part of parameters in the model, minimizes the number of fine Tuning parameters and the calculation complexity, remarkably reduces the training time and the cost, and realizes efficient migration learning. Based on the traditional LoRA fine tuning method, the invention provides a DyLoRA (Dynamic Low-Rank Adaptation) fine tuning framework, integrates Dynamic Rank allocation, hierarchical Adaptation and expert routing mechanisms, and improves model performance while maintaining parameter efficiency, wherein the formula is as follows:
;
Wherein, theOutputting fine tuning of the model; Is a pre-training part parameter that remains unchanged in the fine tuning; representing parameters that each group of experts of the model expert routing system can learn; AndTwo low-rank matrixes trained during fine tuning of the model are usually initialized to zero in an initialization stage, and the other matrix is initialized by adopting random Gaussian distribution, so that the mapping of the original model is not influenced during the initial stage of fine tuning; The rank is indicated as such,Taking 8 for the rank minimum value,Taking 64 as a maximum value; Represent the firstSensitivity score of the layer network; taking 0.3 as a sparse coefficient; the transpose is represented by the number,Representing natural constants.
The fine tuning use instruction supervised fine tuning data set comprises ten thousand railway industry corpus of industry specification, a research report and an operation and maintenance report, and the instruction supervised fine tuning data set is shown in the following formula:
;
Wherein, theThe representation is composed ofA dataset of individual samples; Representing each triplet sample; Represent the firstInstructions for the samples to guide model behavior; Represent the firstInput of individual samples, providing a specific context or question of a task; Represent the firstThe output of the samples, being the expected response or answer;
the core training parameters comprise learning rate, batch size, training round number, maximum sample number and gradient cutting, and are flexibly adjusted according to model parameters and data set size. Taking a 70B parameter model as an example, the dynamic rank distribution reduces the total training parameter amount from 0.161B to 0.102B, and the total training parameter amount is reduced by about 37%, so that the training energy consumption is effectively reduced.
Step 2, data processing and database construction;
The data of the railway industry has various types and numbers, including docx and pdf format railway research reports, track operation and maintenance reports, industry various design standards, and various format design drawings, specification tables and pictures, and in order to construct a standard industry multi-mode knowledge base, different processing and optimization needs to be performed on the data of different formats, as shown in fig. 2, specifically:
2.1 Data collection;
the data comprises pre-lapping and lapping reports, operation and maintenance reports, industry design specifications, design drawings, specification tables, pictures and the like;
2.2 Data processing;
The text mainly adopts mixed search combining semantic vector search and full text sparse search to carry out semantic recognition and splitting, and carries out similarity index quantification; the method comprises the steps of carrying out text processing or OCR recognition on a table, or carrying out table coding and then embedding by adopting a large table model, carrying out text extraction and picture vectorization embedding on a picture, embedding the picture and a corresponding text into a vector database together by using a multi-mode embedding mode by using a complex picture or drawing, and generating a text abstract for the picture based on the multi-mode large model for a macroscopic picture and then carrying out embedding and searching;
2.3 Data fusion and storage;
The method comprises the steps of carrying out fusion storage on texts, tables, pictures, vector diagrams, matrixes and formulas, inputting fused data into a MongoDB content memory and a PostgreSQL vector retrieval library, constructing a multi-level railway industry data processing system and constructing a database.
After the language model and the text embedding model are deployed locally, an interface management of the model is carried out by using an open source item, a railway standard knowledge base database management system based on a MongoDB content memory and a PostgreSQL vector retrieval library is established, the database retrieval principle is shown in figure 3, postgreQL sets a vector field for storing vectors, mongoDB is used for information access of vector original data, for example, when vector data 1 is retrieved, index vector 1 is recalled firstly, original data content 1 is searched in MongoDB according to the ID of index vector 1, and for the retrieved multiple groups of data, for example, original data contents 2 and 3, search filtering, normalization and reordering of results are carried out on vector data 2-4 corresponding to indexes 2-4 based on a mixed retrieval and RRF algorithm, and finally optimization and merging are carried out.
The mixed search enhancement generation (hybird RAG) adopted in the embodiment uses semantic vector search and full text sparse search to make up for the respective defects, so that the search result is richer and more accurate, and meanwhile, the possibility of illusion of a large model is reduced, and the principle is shown in fig. 4.
The semantic vector retrieval obtains similarity by calculating the distance of space vectors, has the advantages of similar semantic understanding, cross-multilingual understanding (for example, input Chinese questions are matched with English knowledge points), convenience for multi-modal understanding and mapping, fault-tolerant space provision (such as misspelling, fuzzy description and the like) and the defects of model-dependent training effect, unstable precision and the like, and is suitable for accurate matching of a small number of low-frequency characters by adopting a keyword retrieval mode based on a sparse algorithm in full text sparse retrieval. Search filtering is needed during searching, namely, the searching quality is improved by utilizing the upper limit of reference (namely, the content of which the maximum number of reference n tokens is searched each time) and the lowest correlation (namely, some search results with low correlation are directly filtered).
As shown in fig. 4, slices 1 and 2 are obtained through semantic vector retrieval and search filtration, and slices a and b are obtained through full-text sparse retrieval and search filtration, and after the retrieval results of the two are obtained, relevant contents can be used as a prompt word to be input into a large model to be used as a reference for output through normalizing document and vector similarity indexes and then reordering. The reordering is a method for combining a plurality of result sets with different correlation indexes into a single result set, for example, the normalized similarity index of a slice b obtained by sparse search is 0.9, which is higher than the results of slices 1 and 2 obtained by semantic search, and the like, and the results can be used as the primary reference of output.
Wherein, the semantic vector searches and calculates index itemsWith candidate documentsSemantic vector similarity score betweenThe formula is as follows:
;
in the formula,In order to query the vector of the vector,Computing each candidate document for each candidate document vector, full text sparse searchSparse search score of (a)The formula is as follows:
;
in the formula,Is an index itemIs a term of (a); Is a termIn a documentWord frequency of (a) is determined; Is a documentIs a length of (2); An average document length for the set of documents; the saturation of word frequency is controlled, generally in the range of 1.2-2.0, the largerThe greater the impact of (2); controlling the intensity of the document length normalization, wherein the range is generally 0.5-0.8,1 to completely normalized, and 0 to completely un-normalized; Is a termIs the inverse document frequency of (2); the total number of the documents in the document collection is the total number of the documents; is the number of documents that contain terms.
After obtaining the sparse scores and the vector similarity scores of all the documents, the formulas for score normalization are as follows:
;
in the formula,Is an index item; All candidate documents; AndRespectively represent the average value of the sparse score and the vector similarity score of all the documents in the current candidate poolAndAnd respectively representing the standard deviation of the sparse score and the vector similarity score of all the documents in the current candidate pool.
The text is generally required to be segmented and sliced according to a certain rule, then is converted into a slice format capable of carrying out semantic search, and each slice data of the database can be independently trimmed and corrected. In addition, the method for optimizing the knowledge database structure further comprises the steps of carrying out certain combination and arrangement on similar related contents, processing the situation that the default problem cannot be matched, optimizing table names, optimizing the situation of table body matching errors and the like, and optimizing the knowledge database structure according to different situations in a test stage.
Because the occurrence frequency of the table data is high and the resolution difficulty is higher than that of the text, the data optimization is generally carried out by adopting means of simplifying the table (such as reducing cell splitting and adding symbol interpretation to facilitate indexing), converting the table into text (summarizing the table with simple meaning by language text) and the like. For large forms, a specialized form large model can be used to encode the form before embedding.
The method comprises the steps of firstly carrying out structural analysis, including identifying row-column structures of tables, analyzing merging cells and extracting hierarchical relations, and then carrying out semantic coding, including linearization representation functions and semantic embedding representation processes, wherein the formulas are as follows:
;
Wherein, theRepresentation ofIs a matrix of (1), initial row indexInitial column index;AndRespectively representing a total line number range and a total column number range of the table cell coverage logic area; AndThe number of the horizontal merging cells and the vertical merging cells respectively; A generating function representing a linearization sequence; Is a set of cell text; connecting operators for the character strings; Is a column separator; Is a cell spacer; Representing tensor product operation; Representing text generated by a pre-trained embedding modelEmbedding vectors; Representing word segmentation functions, and performing text segmentation; representing the parameters asIs embedded in the model encoder.
Step 3, constructing a multi-mode knowledge base;
because the railway industry also comprises design drawings and pictures in various formats, and the large-scale standard tables are not suitable to output in a table mode during indexing and are more suitable to output in a picture mode, a multi-mode knowledge base containing picture formats needs to be constructed.
The principle of embedding multi-mode data is shown in fig. 5, for complex pictures or drawings, the pictures, the corresponding map title and introduction texts are embedded into a vector database together by using a multi-mode embedding mode, meanwhile, corresponding original pictures and texts are stored in the same slice of a database document, the related pictures of the content in the database can be obtained after the related texts are obtained by indexing in a mixed similarity retrieval process, the related pictures and the related text blocks are transmitted to a large model to generate answers, for macroscopic pictures, the pictures are generated to a text abstract based on the multi-mode large model, the text abstract is embedded and retrieved by using a text embedding model, and the related pictures and the related texts corresponding to the retrieved text abstract are transmitted to the large model to generate answers.
For input data asImage-text pairs of (1), wherein the original image,For associating text, the formulas generated for preprocessing and embedding are as follows:
;
Wherein, theIndicating whether the image type belongs to a macroscopic image or not; representing the use of a pre-trained image classification model; an image representing the macro image discriminated by the classification model; as a Softmax function; taking 0.8 as the macroscopic image threshold value, ifThen trigger abstract generation;Representing a multi-modal large model, such as qwen2.5-VL; taking 1024 for the maximum token number of the abstract; Representing image embedding vectorsAnd aligned text embedding vectorsIs used for embedding the joint embedded vector of the block; AndRespectively representing the embedded modal weights of the image and the text; embedding vectors for original text that is not co-dimensionally aligned with the image vectors,Obtaining parameters for a projection matrix for alignment dimension through linear regression training; AndRespectively an image embedding model and a text embedding model; AndRespectively a graph embedding dimension and a text embedding dimension;
for CAD vector diagram format, adopting B spline curve analysis steering vector mode, and constructing a characteristic vector formula as follows:
;
Wherein, theThen represent a stripA sub-B spline curve is used to determine,To be defined atUpper part of the cylinderSub-B spline basis function;Splicing curve parameters into vectors according to a fixed sequence for linear converted feature vectors; Representing a node vector; Is a control point; and the weight factor corresponding to the B spline.
And 4, expanding the multiple platforms.
In step 4, the local framework is expanded to a webpage end, a WeChat public number, an applet and third party software, and a core component comprises an API gateway and an asynchronous message middleware which are self-adaptive to a protocol;
the multi-port access is realized by adopting a unified API gateway architecture, a protocol conversion layer is built in the gateway and supports the unified conversion of HTTP/WebSocket heterogeneous protocols into gRPC protocols, and the protocol conversion process is realized by inputting a normalization functionThe following is shown:
;
Wherein, theRepresenting input data whenFrom the HTTP protocol, i.e.Performing a serialization conversion of JSON to ProtoBuf when the input data comes from the WeChat ecology, i.ePerforming XML to ProtoBuf conversion; for heterogeneous input data sets, the protocol conversion delay is controlled within 15ms,;
For a high concurrency scene, a message queue is introduced to carry out asynchronous peak clipping, so that stable message throughput can be maintained under peak load, and a message processing delay model is as follows:
;
Wherein, theRepresenting the total message processing time; a serialization time representing the conversion of the original message data into a standardized format that can be transmitted or stored; Representing the number of messages; configuring 8 partitions by default for the number of partitions; and consuming time delay for single message.
The advantages of this embodiment are:
(1) The mixed RAG architecture integrating semantic vector retrieval and full-text retrieval is provided, and the retrieval accuracy and fault tolerance are remarkably improved by combining a reordering algorithm such as RRF and a search filtering mechanism, so that the illusion problem of a large model is effectively relieved. Unified embedding and cross-modal association of multi-modal data (text, form and drawing) are realized, an industry knowledge base framework supporting semantic understanding is established, and multi-dimensional information support is provided for complex decisions.
(2) The database system taking MongoDB (content repository) and PostgreSQL (vector retrieval repository) as cores is constructed, and the efficient management and dynamic update of heterogeneous data such as railway design specifications, operation and maintenance reports, engineering cases and the like are supported. Through optimization strategies such as form simplification and image abstract generation, the problems of analysis and retrieval of non-text data are solved, and the practicability and coverage rate of a knowledge base are improved.
(3) In the scene tests of railway line design, track maintenance, train traction and the like, the system accurately outputs standard clauses, disease treatment schemes and energy-saving technical basis, ensures that the answer accords with the industry standard, and reduces the deviation of manual experience. The multi-platform (webpage end, mobile end and third party software) deployment is supported, the real-time knowledge query of designers across scenes is realized, the design efficiency and compliance are improved, an extensible technical paradigm is provided for design optimization, fault prediction and operation and maintenance decision making, and the power-assisted railway system is upgraded to a digital and knowledge-driven model.
The invention and its embodiments have been described above by way of illustration and not limitation, and the invention is illustrated in the accompanying drawings and described in the drawings in which the actual structure is not limited thereto. Therefore, if one of ordinary skill in the art is informed by this disclosure, the structural mode and the embodiments similar to the technical scheme are not creatively designed without departing from the gist of the present invention.