CN120448512A

Movatterモバイル変換

Info

Publication number: CN120448512A
Application number: CN202510955213.8A
Authority: CN
Inventors: 何庆; 洪英杰; 高岩; 王平; 徐双婷; 邓婷; 刘高志; 江哲; 李好
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2025-07-11
Filing date: 2025-07-11
Publication date: 2025-08-08
Anticipated expiration: 2045-07-11
Also published as: CN120448512B

Abstract

The invention relates to the technical field of intelligent railways, and provides a method for constructing a railway multi-mode knowledge base question-answering system of a hybrid RAG architecture, which comprises the steps of 1, deploying and finely adjusting a server local model; the method comprises the steps of 2, data processing and database construction, establishing a railway standard knowledge base database management system based on a MongoDB content memory and a PostgreSQL vector retrieval library, optimizing and combining search results based on a mixed retrieval and RRF algorithm for a plurality of groups of data, finally reordering and searching and filtering, 3, constructing a multi-mode knowledge base, embedding pictures and texts into a vector database by using multi-mode embedding, storing corresponding original pictures and texts in a document storage, acquiring pictures in the document storage when the mixed similarity retrieval is carried out, and transmitting the original pictures and text blocks to a large model to generate answers, and 4, expanding by multiple platforms. The invention can preferably construct a railway multi-mode knowledge base question-answering system.

Description

Railway multi-mode knowledge base question-answering system construction method of hybrid RAG architecture

Technical Field

The invention relates to the technical field of intelligent railways, in particular to a method for constructing a railway multi-mode knowledge base question-answering system of a hybrid RAG architecture.

Background

In recent years, the related technology of a Large Language Model (LLM) is rapidly developed, huge application potential is shown in a plurality of traffic infrastructure fields such as traffic safety, automatic driving and the like, for example, an MT-GPT framework can be applied to planning and design, operation and maintenance, multi-azimuth decision making and the like of traffic infrastructure, students construct a question-answering system in the aspects of traffic safety, BIM forward design and the like based on the LLM, construct a local knowledge base and design evaluation indexes to evaluate the performance of the system, and the students try to utilize the LLM to perform predictive health management on a railway system, analyze and reveal potential fault hidden dangers and provide scientific basis, so that the system is hopeful to perform fault diagnosis report generation, maintenance planning, scheduling document generation and the like of the railway system. However, in the intelligent railway field, LLM research still has a large number of gaps.

The retrieval enhancement generation (RAG) is deeply applied to various intelligent question-answering scenes facing LLM, and research focusing is also carried out to improve the retrieval quality and enhance the generation robustness and confidence.

However, the existing RAG technology still has the following drawbacks:

1. Modality singleness existing RAG systems mostly deal with unstructured text data only, or support for multiple modalities (especially image + text + structured forms) is very limited and inflexible. They have difficulty in effectively fusing and processing information like canonical charts, as well as structured data hidden in the tables, resulting in an inability to make full use of industry-specific multi-source heterogeneous data.

2. Structured data processing is weak, and conventional RAGs generally consider table data as text blocks simply for vectorization, so that the row-column relationship, hierarchical structure and inherent semantic association of the table are easily destroyed. This results in an inability to precisely locate a particular cell of a form during retrieval, and also difficulty in understanding data relationships across forms, affecting the ability to extract accurate answers from forms such as complex industry specifications.

3. The search quality optimization is not sufficient, the basic RAG usually only depends on simple vector similarity search, and the returned result may contain low-correlation or redundant information. An effective search quality optimization mechanism is lacking to optimize search results.

4. The adaptability of the professional field is low, the conventional RAG technology is mostly in the general field, the systematic verification is not performed under complex, high-requirement and real vertical multi-scene conditions such as railway engineering, and the general field large model has poor performance on professional terms and standard expression of professional vertical industries such as railway.

Therefore, a method for constructing a railway multi-modal knowledge base question-answering system of a hybrid RAG architecture is needed to solve the above problems.

Disclosure of Invention

The invention provides a method for constructing a railway multi-mode knowledge base question-answering system with a hybrid RAG architecture, which can be used for constructing the railway multi-mode knowledge base question-answering system better.

The invention relates to a method for constructing a railway multi-mode knowledge base question-answering system with a hybrid RAG architecture, which comprises the following steps:

step 1, deploying and fine-tuning a server local model;

The deployed models include a language model and a text embedding model;

the fine tuning is based on DyLoRA fine tuning framework, integrates dynamic rank allocation, hierarchical adaptation and expert routing mechanisms, and improves model performance while maintaining parameter efficiency;

Step 2, data processing and database construction;

The data comprise a docx, pdf-formatted railway research report, a track operation and maintenance report, various industry design standards, various-formatted design drawings, specification tables and pictures;

After a language model and a text embedding model are locally deployed, an interface management of the model is carried out by using an open source project, a railway standard knowledge base database management system based on a MongoDB content memory and a PostgreSQL vector retrieval library is established, a vector field is set by the PostgreQL for storing vectors, the MongoDB is used for information access of vector original data, the vector is recalled firstly during retrieval, and then the original data content is searched in the MongoDB according to the ID of the vector;

step 3, constructing a multi-mode knowledge base;

embedding the pictures and the corresponding texts into a vector database by using a multi-mode embedding mode, storing the corresponding original pictures and the corresponding texts in a document storage, directly acquiring the corresponding pictures in the document storage when carrying out mixed similarity retrieval, and transmitting the original pictures and the text blocks to a large model to generate an answer;

And 4, expanding the multiple platforms.

Preferably, in step 1, the language model is an autoregressive language model of a transducer architecture, pre-training is performed based on massive text data, statistical rules, world knowledge and complex patterns of the language are learned, and the core task of the text embedding model is to convert a piece of text into a fixed-length, dense vector for embedding.

Preferably, in step 1, the formula of DyLoRA fine-tuning framework is as follows:

;

Wherein, theOutputting fine tuning of the model; Is a pre-training part parameter that remains unchanged in the fine tuning; representing parameters that each group of experts of the model expert routing system can learn; AndTwo low-rank matrices trained in fine tuning models; The rank is indicated as such,As a minimum value of the rank of the signal,Is at a maximum value; Represent the firstSensitivity score of the layer network; Is a sparse coefficient; the transpose is represented by the number,Representing natural constants.

Preferably, the fine tuning use instruction supervised fine tuning data set comprises ten thousand railway industry corpus of industry specification, a research report and an operation and maintenance report, and the instruction supervised fine tuning data set is shown in the following formula:

;

Wherein, theThe representation is composed ofA dataset of individual samples; Representing each triplet sample; Represent the firstInstructions for the samples to guide model behavior; Represent the firstInput of individual samples, providing a specific context or question of a task; Represent the firstThe output of the samples, being the expected response or answer;

the core training parameters comprise learning rate, batch size, training round number, maximum sample number and gradient cutting, and are flexibly adjusted according to model parameters and data set size.

Preferably, the hybrid search comprises semantic vector search and full text sparse search, the semantic vector search obtains similarity by calculating the distance of space vectors, and the full text sparse search adopts a keyword search mode based on a sparse algorithm;

After the results of the two retrieval are obtained, the related content is used as a reference for outputting a large model by normalization, reordering and search filtering, wherein the reordering is a method for combining a plurality of result sets with different relevance indexes into a single result set;

wherein, the semantic vector searches and calculates index itemsWith candidate documentsSemantic vector similarity score betweenThe formula is as follows:

;

in the formula,In order to query the vector of the vector,Computing each candidate document for each candidate document vector, full text sparse searchSparse search score of (a)The formula is as follows:

;

in the formula,Is an index itemIs a term of (a); Is a termIn a documentWord frequency of (a) is determined; Is a documentIs a length of (2); An average document length for the set of documents; Controlling the saturation of word frequency, the largerThe greater the impact of (2); Controlling the intensity of document length normalization; Is a termIs the inverse document frequency of (2); the total number of the documents in the document collection is the total number of the documents; is the number of documents containing the term;

after obtaining the sparse scores and the vector similarity scores of all the documents, the formulas for score normalization are as follows:

;

in the formula,Is an index item; All candidate documents; AndRespectively represent the average value of the sparse score and the vector similarity score of all the documents in the current candidate poolAndAnd respectively representing the standard deviation of the sparse score and the vector similarity score of all the documents in the current candidate pool.

In the step 2, the text is processed by sectioning according to rules and then converted into a slicing format capable of carrying out semantic search, and each slice data of the database is independently trimmed and corrected.

In the step 2, the data of the table is optimized by simplifying the table and converting the table into characters, and for a large-scale table, a large-scale table model is adopted to encode the table and then embed the table;

the method comprises the steps of firstly carrying out structural analysis, including identifying row-column structures of tables, analyzing merging cells and extracting hierarchical relations, and then carrying out semantic coding, including linearization representation functions and semantic embedding representation processes, wherein the formulas are as follows:

;

Wherein, theRepresentation ofIs a matrix of (1), initial row indexInitial column index;AndRespectively representing a total line number range and a total column number range of the table cell coverage logic area; AndThe number of the horizontal merging cells and the vertical merging cells respectively; A generating function representing a linearization sequence; Is a set of cell text; connecting operators for the character strings; Is a column separator; Is a cell spacer; Representing tensor product operation; Representing text generated by a pre-trained embedding modelEmbedding vectors; Representing word segmentation functions, and performing text segmentation; representing the parameters asIs embedded in the model encoder.

Preferably, in step 3, the input data isImage-text pairs of (1), wherein the original image,For associating text, the formulas generated for preprocessing and embedding are as follows:

;

Wherein, theIndicating whether the image type belongs to a macroscopic image or not; representing the use of a pre-trained image classification model; an image representing the macro image discriminated by the classification model; as a Softmax function; For macroscopic image threshold, ifThen trigger abstract generation;Representing a multi-modal large model; the maximum number of the token is the abstract; Representing image embedding vectorsAnd aligned text embedding vectorsIs used for embedding the joint embedded vector of the block; AndRespectively representing the embedded modal weights of the image and the text; embedding vectors for original text that is not co-dimensionally aligned with the image vectors,Obtaining parameters for a projection matrix for alignment dimension through linear regression training; AndRespectively an image embedding model and a text embedding model; AndRespectively a graph embedding dimension and a text embedding dimension;

for CAD vector diagram format, adopting B spline curve analysis steering vector mode, and constructing a characteristic vector formula as follows:

;

Wherein, theThen represent a stripA sub-B spline curve is used to determine,To be defined atUpper part of the cylinderSub-B spline basis function;Splicing curve parameters into vectors according to a fixed sequence for linear converted feature vectors; Representing a node vector; Is a control point; and the weight factor corresponding to the B spline.

In step 4, the local framework is expanded to a webpage end, a WeChat public number, an applet and third party software, and a core component comprises an API gateway and an asynchronous message middleware which are self-adaptive to a protocol;

the multi-port access is realized by adopting a unified API gateway architecture, a protocol conversion layer is built in the gateway and supports the unified conversion of HTTP/WebSocket heterogeneous protocols into gRPC protocols, and the protocol conversion process is realized by inputting a normalization functionThe following is shown:

;

Wherein, theRepresenting input data whenFrom the HTTP protocol, i.e.Performing a serialization conversion of JSON to ProtoBuf when the input data comes from the WeChat ecology, i.ePerforming XML to ProtoBuf conversion; for heterogeneous input data sets, the protocol conversion delay is controlled within 15ms,;

For a high concurrency scene, introducing a message queue to perform asynchronous peak clipping, wherein a message processing delay model is as follows:

;

Wherein, theRepresenting the total message processing time; a serialization time representing the conversion of the original message data into a standardized format that can be transmitted or stored; Representing the number of messages; The number of partitions; and consuming time delay for single message.

The beneficial effects of the invention are as follows:

The invention applies the hybrid RAG technology to the railway industry, builds an intelligent question-answering system for integrating multi-mode data such as railway design specifications, engineering cases, research reports, operation maintenance reports and the like through the processes of server building and local model deployment, data cleaning and processing, multi-mode database building, multi-deployment platform testing and the like, improves knowledge acquisition efficiency and accuracy for designers, simultaneously deeply integrates and utilizes multi-mode information to enable complex decisions and design optimization, ensures compliance and reduces design errors, and finally promotes the railway industry to carry out intelligent and digital transformation.

Drawings

FIG. 1 is a flowchart of a method for constructing a railway multi-modal knowledge base question-answering system of a hybrid RAG architecture in an embodiment;

FIG. 2 is a schematic diagram of a data processing system and database construction in the railway industry in an embodiment;

FIG. 3 is a schematic diagram of database retrieval principles in an embodiment;

FIG. 4 is a schematic diagram of a hybrid search in an embodiment;

FIG. 5 is a diagram of multi-modal knowledge base construction in an embodiment.

Detailed Description

For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples. It is to be understood that the examples are illustrative of the present invention and are not intended to be limiting.

Example 1:

As shown in fig. 1, the embodiment provides a method for constructing a railway multi-mode knowledge base question-answering system of a hybrid RAG architecture, which comprises the following steps:

step 1, deploying and fine-tuning a server local model;

The deployed models comprise a language model and a text embedded model, the related programs of the embodiment are written by using Python, CUDA (Compute Unified Device Architecture) is adopted for acceleration, and data generation and processing, model deployment and reasoning are performed on a Linux server. The language model is an autoregressive language model of a typical transducer architecture, is pre-trained based on massive text data, learns statistical rules, world knowledge and complex modes of the language, and has the core task of converting a text into a fixed-length dense vector for embedding (semantically similar text, the distance of corresponding vectors in a vector space such as cosine similarity is very similar), and is an indispensable model for constructing an industry knowledge base.

Model fine tuning refers to instruction fine tuning training of a large language model to adapt to specific tasks, and can be mainly divided into two modes of parameter efficient fine tuning and full parameter fine tuning. The Parameter Efficient Fine Tuning (PEFT) only updates part of parameters in the model, minimizes the number of fine Tuning parameters and the calculation complexity, remarkably reduces the training time and the cost, and realizes efficient migration learning. Based on the traditional LoRA fine tuning method, the invention provides a DyLoRA (Dynamic Low-Rank Adaptation) fine tuning framework, integrates Dynamic Rank allocation, hierarchical Adaptation and expert routing mechanisms, and improves model performance while maintaining parameter efficiency, wherein the formula is as follows:

;

Wherein, theOutputting fine tuning of the model; Is a pre-training part parameter that remains unchanged in the fine tuning; representing parameters that each group of experts of the model expert routing system can learn; AndTwo low-rank matrixes trained during fine tuning of the model are usually initialized to zero in an initialization stage, and the other matrix is initialized by adopting random Gaussian distribution, so that the mapping of the original model is not influenced during the initial stage of fine tuning; The rank is indicated as such,Taking 8 for the rank minimum value,Taking 64 as a maximum value; Represent the firstSensitivity score of the layer network; taking 0.3 as a sparse coefficient; the transpose is represented by the number,Representing natural constants.

The fine tuning use instruction supervised fine tuning data set comprises ten thousand railway industry corpus of industry specification, a research report and an operation and maintenance report, and the instruction supervised fine tuning data set is shown in the following formula:

;

the core training parameters comprise learning rate, batch size, training round number, maximum sample number and gradient cutting, and are flexibly adjusted according to model parameters and data set size. Taking a 70B parameter model as an example, the dynamic rank distribution reduces the total training parameter amount from 0.161B to 0.102B, and the total training parameter amount is reduced by about 37%, so that the training energy consumption is effectively reduced.

Step 2, data processing and database construction;

The data of the railway industry has various types and numbers, including docx and pdf format railway research reports, track operation and maintenance reports, industry various design standards, and various format design drawings, specification tables and pictures, and in order to construct a standard industry multi-mode knowledge base, different processing and optimization needs to be performed on the data of different formats, as shown in fig. 2, specifically:

2.1 Data collection;

the data comprises pre-lapping and lapping reports, operation and maintenance reports, industry design specifications, design drawings, specification tables, pictures and the like;

2.2 Data processing;

The text mainly adopts mixed search combining semantic vector search and full text sparse search to carry out semantic recognition and splitting, and carries out similarity index quantification; the method comprises the steps of carrying out text processing or OCR recognition on a table, or carrying out table coding and then embedding by adopting a large table model, carrying out text extraction and picture vectorization embedding on a picture, embedding the picture and a corresponding text into a vector database together by using a multi-mode embedding mode by using a complex picture or drawing, and generating a text abstract for the picture based on the multi-mode large model for a macroscopic picture and then carrying out embedding and searching;

2.3 Data fusion and storage;

The method comprises the steps of carrying out fusion storage on texts, tables, pictures, vector diagrams, matrixes and formulas, inputting fused data into a MongoDB content memory and a PostgreSQL vector retrieval library, constructing a multi-level railway industry data processing system and constructing a database.

After the language model and the text embedding model are deployed locally, an interface management of the model is carried out by using an open source item, a railway standard knowledge base database management system based on a MongoDB content memory and a PostgreSQL vector retrieval library is established, the database retrieval principle is shown in figure 3, postgreQL sets a vector field for storing vectors, mongoDB is used for information access of vector original data, for example, when vector data 1 is retrieved, index vector 1 is recalled firstly, original data content 1 is searched in MongoDB according to the ID of index vector 1, and for the retrieved multiple groups of data, for example, original data contents 2 and 3, search filtering, normalization and reordering of results are carried out on vector data 2-4 corresponding to indexes 2-4 based on a mixed retrieval and RRF algorithm, and finally optimization and merging are carried out.

The mixed search enhancement generation (hybird RAG) adopted in the embodiment uses semantic vector search and full text sparse search to make up for the respective defects, so that the search result is richer and more accurate, and meanwhile, the possibility of illusion of a large model is reduced, and the principle is shown in fig. 4.

The semantic vector retrieval obtains similarity by calculating the distance of space vectors, has the advantages of similar semantic understanding, cross-multilingual understanding (for example, input Chinese questions are matched with English knowledge points), convenience for multi-modal understanding and mapping, fault-tolerant space provision (such as misspelling, fuzzy description and the like) and the defects of model-dependent training effect, unstable precision and the like, and is suitable for accurate matching of a small number of low-frequency characters by adopting a keyword retrieval mode based on a sparse algorithm in full text sparse retrieval. Search filtering is needed during searching, namely, the searching quality is improved by utilizing the upper limit of reference (namely, the content of which the maximum number of reference n tokens is searched each time) and the lowest correlation (namely, some search results with low correlation are directly filtered).

As shown in fig. 4, slices 1 and 2 are obtained through semantic vector retrieval and search filtration, and slices a and b are obtained through full-text sparse retrieval and search filtration, and after the retrieval results of the two are obtained, relevant contents can be used as a prompt word to be input into a large model to be used as a reference for output through normalizing document and vector similarity indexes and then reordering. The reordering is a method for combining a plurality of result sets with different correlation indexes into a single result set, for example, the normalized similarity index of a slice b obtained by sparse search is 0.9, which is higher than the results of slices 1 and 2 obtained by semantic search, and the like, and the results can be used as the primary reference of output.

;

in the formula,Is an index itemIs a term of (a); Is a termIn a documentWord frequency of (a) is determined; Is a documentIs a length of (2); An average document length for the set of documents; the saturation of word frequency is controlled, generally in the range of 1.2-2.0, the largerThe greater the impact of (2); controlling the intensity of the document length normalization, wherein the range is generally 0.5-0.8,1 to completely normalized, and 0 to completely un-normalized; Is a termIs the inverse document frequency of (2); the total number of the documents in the document collection is the total number of the documents; is the number of documents that contain terms.

;

The text is generally required to be segmented and sliced according to a certain rule, then is converted into a slice format capable of carrying out semantic search, and each slice data of the database can be independently trimmed and corrected. In addition, the method for optimizing the knowledge database structure further comprises the steps of carrying out certain combination and arrangement on similar related contents, processing the situation that the default problem cannot be matched, optimizing table names, optimizing the situation of table body matching errors and the like, and optimizing the knowledge database structure according to different situations in a test stage.

Because the occurrence frequency of the table data is high and the resolution difficulty is higher than that of the text, the data optimization is generally carried out by adopting means of simplifying the table (such as reducing cell splitting and adding symbol interpretation to facilitate indexing), converting the table into text (summarizing the table with simple meaning by language text) and the like. For large forms, a specialized form large model can be used to encode the form before embedding.

;

Step 3, constructing a multi-mode knowledge base;

because the railway industry also comprises design drawings and pictures in various formats, and the large-scale standard tables are not suitable to output in a table mode during indexing and are more suitable to output in a picture mode, a multi-mode knowledge base containing picture formats needs to be constructed.

The principle of embedding multi-mode data is shown in fig. 5, for complex pictures or drawings, the pictures, the corresponding map title and introduction texts are embedded into a vector database together by using a multi-mode embedding mode, meanwhile, corresponding original pictures and texts are stored in the same slice of a database document, the related pictures of the content in the database can be obtained after the related texts are obtained by indexing in a mixed similarity retrieval process, the related pictures and the related text blocks are transmitted to a large model to generate answers, for macroscopic pictures, the pictures are generated to a text abstract based on the multi-mode large model, the text abstract is embedded and retrieved by using a text embedding model, and the related pictures and the related texts corresponding to the retrieved text abstract are transmitted to the large model to generate answers.

For input data asImage-text pairs of (1), wherein the original image,For associating text, the formulas generated for preprocessing and embedding are as follows:

;

Wherein, theIndicating whether the image type belongs to a macroscopic image or not; representing the use of a pre-trained image classification model; an image representing the macro image discriminated by the classification model; as a Softmax function; taking 0.8 as the macroscopic image threshold value, ifThen trigger abstract generation;Representing a multi-modal large model, such as qwen2.5-VL; taking 1024 for the maximum token number of the abstract; Representing image embedding vectorsAnd aligned text embedding vectorsIs used for embedding the joint embedded vector of the block; AndRespectively representing the embedded modal weights of the image and the text; embedding vectors for original text that is not co-dimensionally aligned with the image vectors,Obtaining parameters for a projection matrix for alignment dimension through linear regression training; AndRespectively an image embedding model and a text embedding model; AndRespectively a graph embedding dimension and a text embedding dimension;

;

And 4, expanding the multiple platforms.

;

For a high concurrency scene, a message queue is introduced to carry out asynchronous peak clipping, so that stable message throughput can be maintained under peak load, and a message processing delay model is as follows:

;

Wherein, theRepresenting the total message processing time; a serialization time representing the conversion of the original message data into a standardized format that can be transmitted or stored; Representing the number of messages; configuring 8 partitions by default for the number of partitions; and consuming time delay for single message.

The advantages of this embodiment are:

(1) The mixed RAG architecture integrating semantic vector retrieval and full-text retrieval is provided, and the retrieval accuracy and fault tolerance are remarkably improved by combining a reordering algorithm such as RRF and a search filtering mechanism, so that the illusion problem of a large model is effectively relieved. Unified embedding and cross-modal association of multi-modal data (text, form and drawing) are realized, an industry knowledge base framework supporting semantic understanding is established, and multi-dimensional information support is provided for complex decisions.

(2) The database system taking MongoDB (content repository) and PostgreSQL (vector retrieval repository) as cores is constructed, and the efficient management and dynamic update of heterogeneous data such as railway design specifications, operation and maintenance reports, engineering cases and the like are supported. Through optimization strategies such as form simplification and image abstract generation, the problems of analysis and retrieval of non-text data are solved, and the practicability and coverage rate of a knowledge base are improved.

(3) In the scene tests of railway line design, track maintenance, train traction and the like, the system accurately outputs standard clauses, disease treatment schemes and energy-saving technical basis, ensures that the answer accords with the industry standard, and reduces the deviation of manual experience. The multi-platform (webpage end, mobile end and third party software) deployment is supported, the real-time knowledge query of designers across scenes is realized, the design efficiency and compliance are improved, an extensible technical paradigm is provided for design optimization, fault prediction and operation and maintenance decision making, and the power-assisted railway system is upgraded to a digital and knowledge-driven model.

The invention and its embodiments have been described above by way of illustration and not limitation, and the invention is illustrated in the accompanying drawings and described in the drawings in which the actual structure is not limited thereto. Therefore, if one of ordinary skill in the art is informed by this disclosure, the structural mode and the embodiments similar to the technical scheme are not creatively designed without departing from the gist of the present invention.

Claims

Translated fromChinese

1.混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：包括以下步骤：1. A method for constructing a railway multimodal knowledge base question-answering system based on a hybrid RAG architecture, characterized by comprising the following steps:

步骤1：服务器本地模型部署与微调；Step 1: Server local model deployment and fine-tuning;

部署的模型包括语言模型和文本嵌入模型；The deployed models include language models and text embedding models;

微调是基于DyLoRA微调框架，融合动态秩分配、分层适配与专家路由机制，在保持参数效率的同时提升模型表现；Fine-tuning is based on the DyLoRA fine-tuning framework, integrating dynamic rank allocation, hierarchical adaptation, and expert routing mechanisms to improve model performance while maintaining parameter efficiency;

步骤2：数据处理和数据库搭建；Step 2: Data processing and database construction;

数据包括docx、pdf格式的铁路可研报告、轨道运维报告、行业各种设计标准，还包括各种格式的设计图纸、规范表格及图片；The data includes railway feasibility studies, track operation and maintenance reports, various industry design standards in docx and pdf formats, as well as design drawings, specification tables, and images in various formats;

本地部署语言模型和文本嵌入模型后，使用开源项目进行模型的接口管理，建立基于MongoDB内容存储器和PostgreSQL向量检索库的铁路规范知识库数据库管理系统；PostgreSQL设置vector字段用于存储向量，而MongoDB用于向量原数据的信息存取，在检索时先召回向量，再根据向量的ID在MongoDB中寻找原数据内容；对于检索到的多组数据，则基于混合检索和RRF算法进行搜索结果的优化和合并，最后进行重排序和搜索过滤；After deploying the language model and text embedding model locally, we used open source projects to manage the model interfaces and established a railway specification knowledge base database management system based on MongoDB content storage and PostgreSQL vector retrieval libraries. PostgreSQL used vector fields to store vectors, while MongoDB was used to access the original vector data. During retrieval, the vectors were first retrieved, and then the original data content was searched in MongoDB based on the vector ID. For multiple sets of retrieved data, the search results were optimized and merged using a hybrid search and RRF algorithm, followed by re-ranking and search filtering.

步骤3：多模态知识库构建；Step 3: Multimodal knowledge base construction;

使用多模态嵌入的方式将图片和对应的文本一起嵌入到向量数据库中，同时在文档存储中存储对应的原始图片和文本，在进行混合相似度检索时直接获取文档存储中对应的图片，将原始图片和文本块传递给大模型生成回答；对于宏观图片，基于多模态大模型对图片生成文本摘要，使用文本嵌入模型对文本摘要进行嵌入和检索；Using multimodal embedding, images and corresponding text are embedded in a vector database. The original images and text are stored in the document store. When performing hybrid similarity retrieval, the corresponding images in the document store are directly retrieved, and the original images and text blocks are passed to the large model to generate answers. For macro images, a text summary is generated for the image based on the multimodal large model, and the text summary is embedded and retrieved using a text embedding model.

步骤4：多平台扩展。Step 4: Multi-platform expansion.

2.根据权利要求1所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：步骤1中，语言模型是Transformer架构的自回归语言模型，基于海量文本数据进行预训练，学习语言的统计规律、世界知识和复杂模式；而文本嵌入模型的核心任务是将一段文本转换成一个固定长度的、稠密的向量进行嵌入。2. The method for constructing a railway multimodal knowledge base question-answering system with a hybrid RAG architecture according to claim 1 is characterized in that: in step 1, the language model is an autoregressive language model with a Transformer architecture, which is pre-trained based on massive text data to learn the statistical laws, world knowledge, and complex patterns of language; and the core task of the text embedding model is to convert a piece of text into a fixed-length, dense vector for embedding.

3.根据权利要求2所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：步骤1中，DyLoRA微调框架的公式如下所示：3. The method for constructing a railway multimodal knowledge base question-answering system based on a hybrid RAG architecture according to claim 2, wherein: in step 1, the formula of the DyLoRA fine-tuning framework is as follows:

； ;

其中，为模型微调的输出；是在微调中保持不变的预训练部分参数；表示模型专家路由系统的每组专家可学习的参数；和是微调模型时训练的两个低秩矩阵；表示秩，为秩最小值，为最大值；表示第层网络的敏感度分数；为稀疏系数；表示转置，表示自然常数。in, The output of model fine-tuning; These are the pre-trained parameters that remain unchanged during fine-tuning; Represents the learnable parameters of each group of experts in the model expert routing system; and are two low-rank matrices trained when fine-tuning the model; represents rank, is the minimum rank, is the maximum value; Indicates the Sensitivity score of the layer network; is the sparse coefficient; represents transpose, Represents a natural constant.

4.根据权利要求3所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：步骤1中，微调使用指令监督微调数据集，包括行业规范、可研报告、运维报告的万条铁路行业语料，指令监督微调数据集如下公式所示：4. The method for constructing a railway multimodal knowledge base question-answering system with a hybrid RAG architecture according to claim 3 is characterized in that: in step 1, fine-tuning uses an instruction supervision fine-tuning dataset, which includes 10,000 railway industry corpora of industry specifications, feasibility reports, and operation and maintenance reports. The instruction supervision fine-tuning dataset is represented by the following formula:

； ;

其中，表示由个独立样本组成的数据集；表示每个三元组样本；表示第个样本的指令，用于指导模型行为；表示第个样本的输入，提供任务的具体上下文或问题；表示第个样本的输出，是期望的响应或答案；in, Indicated by A dataset consisting of independent samples; Represents each triplet sample; Indicates the Instructions for samples to guide model behavior; Indicates the The input of samples provides the specific context or problem of the task; Indicates the The output of each sample is the expected response or answer;

核心训练参数包括学习率、批次大小、训练轮数、最大样本数和梯度裁剪，根据模型参数和数据集大小灵活调整。Core training parameters include learning rate, batch size, number of training rounds, maximum number of samples, and gradient clipping, which can be flexibly adjusted according to model parameters and dataset size.

5.根据权利要求4所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：混合检索包括语义向量检索和全文稀疏检索，语义向量检索通过计算空间向量的距离从而得出相似度，全文稀疏检索采用基于稀疏算法的关键词检索方式；5. The method for constructing a railway multimodal knowledge base question-answering system with a hybrid RAG architecture according to claim 4 is characterized in that: the hybrid search includes semantic vector search and full-text sparse search, the semantic vector search calculates the distance between spatial vectors to obtain similarity, and the full-text sparse search uses a keyword search method based on a sparse algorithm;

得到两者检索的结果后，通过归一化、重排序和搜索过滤，将相关内容作为提示词输入大模型作为输出的参考；重排序是一种将具有不同相关性指标的多个结果集组合成单个结果集的方法；搜索过滤则是利用引用上限和最低相关度来提升检索的质量；After obtaining the results of both searches, the relevant content is input as prompt words into the large model through normalization, re-ranking and search filtering as a reference for the output; re-ranking is a method of combining multiple result sets with different relevance indicators into a single result set; search filtering uses the citation upper limit and minimum relevance to improve the quality of the search;

其中，语义向量搜索计算索引项与候选文档之间的语义向量相似度分数，公式如下所示：Among them, semantic vector search calculates index items With candidate documents The semantic vector similarity score between , the formula is as follows:

； ;

式中，为查询向量，为每个候选文档向量；全文稀疏检索计算每个候选文档的稀疏检索分数，其公式如下所示：Where, is the query vector, For each candidate document vector; full-text sparse retrieval calculates each candidate document Sparse retrieval score , whose formula is as follows:

； ;

式中，为索引项的一个词项；为词项在文档中的词频；为文档的长度；为文档集合的平均文档长度；控制词频饱和度，越大则的影响越大；控制文档长度归一化的强度；为词项的逆文档频率；为文档集合中文档总数；为包含词项的文档数；Where, For index items a term of; For terms In the documentation word frequency in ; For Documents length; is the average document length of the document collection; Control word frequency saturation, the larger the The greater the impact; Controls the strength of document length normalization; For terms The inverse document frequency of is the total number of documents in the document collection; is the number of documents containing the term;

在得到所有文档稀疏分数和向量相似度分数后，进行分数归一化的公式如下所示：After obtaining all document sparsity scores and vector similarity scores, the formula for score normalization is as follows:

； ;

式中，为索引项；为所有候选文档；和分别表示当前候选池中所有文档稀疏分数和向量相似度分数的均值；而和分别表示当前候选池中所有文档稀疏分数和向量相似度分数的标准差。Where, is the index item; For all candidate documents; and Represent the mean of all document sparse scores and vector similarity scores in the current candidate pool respectively; and and Represent the standard deviation of the sparsity scores and vector similarity scores of all documents in the current candidate pool.

6.根据权利要求5所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：步骤2中，文本按规则进行分段切片处理后，转成可进行语义搜索的切片格式，对数据库的每条切片数据进行单独微调和修正；优化知识数据库结构的方法包括对相似相关内容进行合并和整理、对缺省问题无法进行匹配的情况进行处理、优化表名和表本体匹配错误情况。6. The method for constructing a railway multimodal knowledge base question-answering system with a hybrid RAG architecture according to claim 5 is characterized in that: in step 2, the text is segmented and sliced according to rules and then converted into a slice format that can be semantically searched, and each slice data in the database is individually fine-tuned and corrected; the method for optimizing the knowledge database structure includes merging and organizing similar related content, handling situations where default questions cannot be matched, and optimizing table name and table body matching errors.

7.根据权利要求6所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：步骤2中，表格数据采取简化表格、表格转文字手段进行数据优化；针对大型的表格，采用表格大模型先进行表格编码再进行嵌入；7. The method for constructing a railway multimodal knowledge base question-answering system with a hybrid RAG architecture according to claim 6 is characterized in that: in step 2, table data is optimized by simplifying the table and converting the table into text; for large tables, a large table model is used to first encode the table and then embed it;

表格编码首先进行结构解析，包括识别表格的行列结构、解析合并单元格和提取层级关系，接着进行语义编码，包括线性化表示函数、语义嵌入表示过程，公式如下所示：Table encoding first performs structural analysis, including identifying the row and column structure of the table, parsing merged cells, and extracting hierarchical relationships. Then, semantic encoding is performed, including linearization of the representation function and semantic embedding representation process. The formula is as follows:

； ;

其中，表示的矩阵，起始行索引、起始列索引；和分别表示表格单元格覆盖逻辑区域的总行数范围和总列数范围；和分别为水平和垂直合并单元格数量；表示线性化序列的生成函数；为单元格文本集合；为字符串连接运算符；为列分隔符；为单元间隔符；表示张量积运算；表示预训练嵌入模型生成的文本嵌入向量；表示分词函数，进行文本分割；表示参数为的嵌入模型编码器。in, express Matrix, starting row index , starting column index ; and Respectively represent the total number of rows and columns in the logical area covered by table cells; and The number of cells merged horizontally and vertically respectively; represents the generating function of the linearized sequence; It is a cell text collection; is the string concatenation operator; is the column separator; is a unit separator; Represents the tensor product operation; Represents text generated by a pre-trained embedding model Embedding vector; Represents the word segmentation function for text segmentation; Indicates that the parameter is Embedding model encoder.

8.根据权利要求7所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：步骤3中，对于输入数据为的图像-文本对，其中原始图像，为关联文本，预处理和嵌入生成的公式如下：8. The method for constructing a railway multimodal knowledge base question-answering system based on a hybrid RAG architecture according to claim 7, wherein: in step 3, for input data image-text pairs, where the original image , The formulas generated for associating text, preprocessing and embedding are as follows:

； ;

其中，表示判断图像类型是否属于宏观图像；表示采用预训练图像分类模型；表示被分类模型判别为宏观图像的图像；为Softmax函数；为宏观图像阈值，若，则触发摘要生成；表示多模态大模型；为摘要最大token数；表示图像嵌入向量和对齐后的文本嵌入向量的联合嵌入向量；和分别表示图像和文本的嵌入模态权重；为未与图像向量进行同维度对齐处理的原始文本嵌入向量，为用于对齐维度的投影矩阵，通过线性回归训练得到参数；和分别为图像嵌入模型和文本嵌入模型；和分别为图形嵌入维度和文本嵌入维度；in, Indicates whether the image type is a macro image; Indicates the use of a pre-trained image classification model; represents an image that is identified as a macro image by the classification model; is the Softmax function; is the macro image threshold, if , then trigger summary generation ; Represents a multimodal large model; The maximum number of digest tokens; Represents the image embedding vector and aligned text embedding vectors The joint embedding vector of and Represent the embedding modality weights of image and text respectively; is the original text embedding vector that is not aligned with the image vector in the same dimension. is the projection matrix used to align the dimensions, and its parameters are obtained through linear regression training; and They are image embedding model and text embedding model respectively; and They are the graphic embedding dimension and the text embedding dimension respectively;

对于CAD矢量图格式，采取B样条曲线解析转向量方式，构建的特征向量公式如下所式：For CAD vector graphics format, the B-spline curve analytical method is adopted to construct the characteristic vector formula as follows:

； ;

其中，则表示一条次B样条曲线，为定义在上的次B样条基函数；为线形转换而来的特征向量，将曲线参数按固定顺序拼接为向量；代表节点矢量；为控制点；为B样条对应的权重因子。in, It means a Secondary B-spline curve, For the definition on Sub-B-spline basis functions ; The characteristic vector obtained by linear transformation is the curve parameters spliced into a vector in a fixed order; represents the node vector; is the control point; is the weight factor corresponding to the B-spline.

9.根据权利要求8所述的混合RAG架构的铁路多模态知识库问答系统构建方法，其特征在于：步骤4中，将本地框架扩展到网页端、微信公众号、小程序和第三方软件，核心组件包含协议自适应的API网关与异步消息中间件；9. The method for constructing a railway multimodal knowledge base question-answering system with a hybrid RAG architecture according to claim 8, characterized in that: in step 4, the local framework is extended to web pages, WeChat public accounts, mini-programs, and third-party software, and the core components include a protocol-adaptive API gateway and asynchronous messaging middleware;

采用统一API网关架构实现多端接入，网关内置协议转换层，支持将HTTP/WebSocket异构协议统一转换为gRPC协议；协议转换过程通过输入归一化函数如下所示：A unified API gateway architecture is used to achieve multi-terminal access. The gateway has a built-in protocol conversion layer that supports unified conversion of HTTP/WebSocket heterogeneous protocols to gRPC protocol. The protocol conversion process is performed by inputting a normalization function. As shown below:

； ;

其中，表示输入数据；当来自HTTP协议时，即，执行JSON到ProtoBuf的序列化转换；当输入数据来自微信生态时，即，执行XML到ProtoBuf的转换；为异构输入数据集，协议转换延迟控制在15ms内，；in, Represents input data; when When coming from HTTP protocol, that is , perform serialization conversion from JSON to ProtoBuf; when the input data comes from the WeChat ecosystem, that is, , perform XML to ProtoBuf conversion; For heterogeneous input data sets, the protocol conversion delay is controlled within 15ms. ;

针对高并发场景，引入消息队列进行异步削峰，其消息处理延迟模型为：For high-concurrency scenarios, message queues are introduced for asynchronous peak shaving. The message processing delay model is as follows:

； ;

其中，表示消息处理总时间；表示将原始消息数据转换为可传输或存储的标准化格式的序列化时间；表示消息数量；为分区数；为单消息消费时延。in, Indicates the total message processing time; Indicates the serialization time of converting the original message data into a standardized format for transmission or storage; Indicates the number of messages; is the number of partitions; This is the single message consumption latency.