Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a real-time index construction and intelligent optimization method and system based on large model driving, which are used for collecting and analyzing sample data characteristics and access modes through a data stream processing and dynamic preprocessing technology, fusing a pre-training language model and a knowledge graph method, carrying out deep semantic analysis on sample data, adopting a self-adaptive mixed index strategy to generate an index scheme, implementing index construction, applying a multi-objective optimization algorithm to optimize index performance, dynamically adjusting index configuration through a real-time monitoring and machine learning prediction model, improving database query efficiency through optimal index combination, and realizing an intelligent and efficient index management strategy.
In order to achieve the above purpose, the present invention provides the following technical solutions:
The real-time index construction and intelligent optimization method based on large model driving comprises the following steps:
Step S1, collecting sample data by utilizing a data stream processing mode, performing dynamic self-adaptive preprocessing, analyzing characteristics of the sample data, and evaluating an access mode of the sample data;
s2, predicting thermal data based on the historical access record and the current data characteristics by using a statistical method, automatically triggering a loading mechanism according to a thermal data prediction result, and loading the thermal data into a memory in advance;
Step S3, loading a pre-training language model, combining a knowledge graph method, carrying out deep semantic analysis on sample data, adopting a self-adaptive hybrid index strategy according to analysis results, sample data characteristics, access modes and thermal data prediction results, combining entity relation information in the knowledge graph, generating a hybrid index scheme, and storing the hybrid index scheme in an index database;
S4, implementing index construction in a database system according to the hybrid index scheme, performing multi-dimensional performance evaluation on the hybrid index scheme by utilizing a pre-training language model, automatically generating an index optimization scheme by adopting a multi-objective optimization algorithm according to an evaluation result, and implementing the optimized index scheme;
And S5, monitoring performance indexes of index construction and query processing in real time, and predicting query optimization effects under different index combinations by utilizing a machine learning algorithm in combination with historical query data and index performance data to obtain an optimal index combination.
Specifically, the specific steps of the step S3 include:
s3.1, loading a pre-trained language model and a word segmentation device matched with the pre-trained language model, and using the word segmentation device of the pre-trained language model to segment and encode input sample data;
S3.2, inputting the encoded data into a pre-training language model, running forward propagation of the pre-training language model, and extracting the hidden state of the last layer of the model as the deep semantic feature of sample data;
and S3.3, acquiring embedded representation of related entities and relations in the knowledge graph in DistMult mode, and combining the embedded representation of the entities and relations acquired from the knowledge graph with deep semantic features extracted from the pre-training language model by using a splicing and dimension transformation method to form feature vectors.
Specifically, the specific steps of the step S3 further include:
s3.4, collecting access logs of the database, and carrying out statistical analysis on the collected access logs to obtain a user access mode and a query requirement;
s3.5, extracting the characteristics of the characteristic vector according to the access mode of the user and the query requirement to obtain index candidate characteristics, and evaluating influence factors of different index candidate characteristics on the query performance through simulation testWherein, the method comprises the steps of, wherein,An influence factor representing an nth index candidate feature, n representing the number of index candidate features;
S3.6, according to the evaluation result, influencing factorsDescending order is carried out, and the lowest influencing factor threshold value is set as;
If it isThen obtain the index featureBased on the indexing features and the query requirements, a hybrid indexing scheme is generated using an adaptive hybrid indexing strategy, wherein,Represents the mth index feature, m represents the number of index features.
Specifically, the specific steps of the step S3 further include:
s3.7, introducing a dynamic adjustment mechanism of the index, monitoring index inquiry efficiency in real time based on a mixed index scheme, and dynamically adjusting an index structure according to a monitoring result;
S3.8, predicting the trend of index query by using a trained machine learning model, reading and analyzing the predicted trend result of the index query, identifying query hotspots and mode changes, and evaluating whether index structures or parameters need to be adjusted to optimize the adaptive hybrid index strategy according to the predicted trend result of the index query and the current index;
S3.9, compiling an adaptive hybrid index strategy construction script, executing the adaptive hybrid index strategy construction script on a large-scale data set, and monitoring resource consumption and performance indexes in the construction process;
And S3.10, after the new index strategy is executed, storing the constructed mixed index into an index database.
Specifically, the specific steps of the multi-objective optimization algorithm in step S4 include:
s4.1, setting an objective function and constraint conditions of index optimization, and generating a group of initial index configuration schemes as candidate solution sets through a heuristic method, wherein the objective functionThe formula of (2) is:
;
Wherein,Indicating that a given i-th index configuration scheme,Representing the average or total response time required to perform a query operation under the ith index configuration scheme,Indicating the amount of memory space occupied by the index structure under the ith index configuration scheme,Representing the time required to build an index under the ith index configuration scheme,、、Representing the weight coefficient;
S4.2, performing multi-dimensional performance evaluation on each index configuration scheme in the candidate solution set by utilizing a pre-training language model;
And S4.3, selecting part of solutions from the candidate solution set to serve as parents to generate new solutions to serve as offspring according to the multi-dimensional performance evaluation result, and meanwhile, combining the parents and the offspring to form a new solution set to evaluate.
Specifically, the specific steps of the multi-objective optimization algorithm in step S4 further include:
S4.4, layering is carried out in the new solution set according to the dominant relation of the solutions, so that each layer comprises a group of solutions which are not dominant to each other, different non-dominant layers are obtained, and the crowding degree of all individuals in the same non-dominant layer is initialized to be 0;
S4.5, sequencing the individuals in each non-dominant layer on each objective function, calculating the objective function difference between the individuals and the adjacent individuals for each objective function, and carrying out normalization processing;
S4.6, adding the normalized difference values on all objective functions to obtain the crowding degree of the individual, and selecting the individual of the next generation population based on the level and the crowding degree of the non-dominant order to perform iterative operation;
S4.7, setting the maximum iteration times, stopping iteration if the maximum iteration times are met, obtaining a Pareto optimal solution, otherwise, returning to the step S4.3, and continuing iteration;
and S3.8, generating an index optimization scheme according to the Pareto optimal solution.
Specifically, the step of generating the hybrid index scheme in S3.6 adopts a hash table and bitmap index method, and combines entity relationship information in the knowledge graph to generate the hybrid index scheme.
The real-time index construction and intelligent optimization system based on large model driving comprises a data processing module, an index generating module, an index optimizing module and a monitoring module;
the data processing module is used for collecting sample data and performing dynamic self-adaptive preprocessing;
The index generation module is used for creating an index structure according to the preprocessed sample data and generating a mixed index scheme;
the index optimization module is used for implementing index construction and adjusting and optimizing an index structure according to requirements;
The monitoring module is used for monitoring the index construction and the query processing performance indexes in real time.
The index generation module comprises a model loading unit, a knowledge graph fusion unit, a strategy making unit and a storage unit;
The model loading unit is used for loading a pre-training language model and carrying out deep semantic representation;
The knowledge spectrum fusion unit is used for fusing an external knowledge spectrum or external knowledge into the index construction process, enhancing the semantic understanding capability of the index and improving the query accuracy;
the strategy making unit is used for making a self-adaptive mixed index strategy according to the analysis result, the data characteristic and the access mode of the sample data;
The storage unit is used for generating an index scheme according to the self-adaptive hybrid index strategy, storing the index scheme in an index database, realizing physical storage of indexes and managing index files.
The computer readable storage medium has stored thereon computer instructions which, when executed, perform the steps of a large model driven real-time index building and intelligent optimization method.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a real-time index construction and intelligent optimization system based on large model driving, and performs optimization and improvement on architecture, operation steps and flow, and the system has the advantages of simple flow, low investment and operation cost and low production and working cost.
2. The invention provides a real-time index construction and intelligent optimization method based on large model driving, which is provided with a self-adaptive index mode, and through deep semantic analysis, a self-adaptive mixed index strategy, multi-dimensional performance evaluation and dynamic adjustment, the intellectualization and efficiency of index construction are improved, the self-adaptability of index is improved, the query performance is optimized, and the real-time performance of data processing and the operation efficiency of an index construction and intelligent optimization system are improved.
Detailed Description
Example 1
Referring to fig. 1-3, the method for constructing and intelligently optimizing real-time indexes based on large model driving according to the embodiment of the invention comprises the following steps:
Step S1, collecting sample data by utilizing a data stream processing mode, performing dynamic self-adaptive preprocessing, analyzing characteristics of the sample data, and evaluating an access mode of the sample data;
The sample data comprises data distribution, access frequency and update frequency information.
Further, the specific steps of step S1 include:
a1, setting a data stream access point, configuring a preprocessing strategy library, and initializing a data analysis tool and an index construction system;
a2, receiving new data items from a data source in real time, carrying out dynamic self-adaptive preprocessing on the new data items, and storing preprocessed sample data into a temporary storage area or directly carrying out characteristic analysis;
and A3, carrying out statistical analysis, pattern recognition and correlation analysis on the sample data in the storage area at regular intervals, analyzing an access log, mining an access pattern and user behaviors, and storing an analysis result into an analysis result database for subsequent index construction and optimization.
S2, predicting thermal data based on the historical access record and the current data characteristics by using a statistical method, automatically triggering a loading mechanism according to a thermal data prediction result, and loading the thermal data into a memory in advance;
Step S3, loading a pre-training language model, combining a knowledge graph method, carrying out deep semantic analysis on sample data, adopting a self-adaptive hybrid index strategy according to analysis results, sample data characteristics, access modes and thermal data prediction results, combining entity relation information in the knowledge graph, generating a hybrid index scheme, and storing the hybrid index scheme in an index database;
S4, implementing index construction in a database system according to the hybrid index scheme, performing multi-dimensional performance evaluation on the hybrid index scheme by utilizing a pre-training language model, automatically generating an index optimization scheme by adopting a multi-objective optimization algorithm according to an evaluation result, and implementing the optimized index scheme;
Step S5, monitoring performance indexes of index construction and query processing in real time, and predicting query optimization effects under different index combinations by utilizing a machine learning algorithm in combination with historical query data and index performance data to obtain an optimal index combination;
Step S6, feeding new query data and index performance data back to the machine learning model, and continuously updating and optimizing the machine learning model by adopting an incremental learning mode to improve the prediction accuracy and adaptability, wherein the incremental learning method is the prior art content in the field and is not an inventive scheme of the application, and is not repeated here.
The specific steps of the step S3 include:
s3.1, loading a pre-trained language model and a word segmentation device matched with the pre-trained language model, and using the word segmentation device of the pre-trained language model to segment and encode input sample data;
GPT-3 is used as the pre-training language model in the present invention.
Further, the step of S3.1 includes:
(1) Ensuring that the PyTorch and transformers libraries are installed in the environment, as the transformers library provides a pre-trained language model and its word segmenters;
(2) Using AutoModel in transformers libraries to load a pre-trained language model and using AutoTokenizer to load a supporting word segmenter;
(3) The input sample data is segmented using a loaded segmenter and the segmentation results are converted into an ID sequence, i.e. encoding, that the model can understand, which typically involves segmenting the text into sub-words, adding special tokens, such as CLS and SEP, and mapping to an index in the vocabulary.
S3.2, inputting the encoded data into a pre-training language model, running forward propagation of the pre-training language model, and extracting the hidden state of the last layer of the model as the deep semantic feature of sample data;
further, the specific step of S3.2 includes:
(1) Encoding the text data into a format acceptable to the pre-training language model through tokenizer, and converting the encoded data into a tensor format suitable for input by the pre-training language model;
(2) Transmitting the encoded data as input to a pre-training language model, wherein the pre-training language model performs forward propagation and outputs a plurality of results including hidden states and attention weights;
(3) Extracting the hidden state of the last layer from the output of the pre-training language model by accessing the output.last_hidden_state, wherein the output.last_hidden_state refers to the hidden state of the last transducer layer generated after the model processes the input data, and is usually formed by stacking a plurality of transducer layers, each layer can perform a series of operations on the input data or the output of the previous layer, and finally the hidden state of the layer is output, and the series of operations comprise a self-attention mechanism and a feedforward network;
(4) The extracted hidden states are further processed, such as averaged pooling, taking vectors of specific locations, as needed to obtain deep semantic features of the sample data.
S3.3, acquiring embedded representations of related entities and relations in the knowledge graph in DistMult mode, combining the embedded depth semantic features extracted from the pre-training language model by using a splicing and dimension transformation method to form feature vectors;
Further, the specific step of S3.3 includes:
(1) Preparing triplet data and text data of the knowledge graph;
(2) Loading DistMult a model and training to obtain embedded representations of entities and relationships;
(3) Loading a pre-training language model and extracting deep semantic features of text data;
(4) Embedding the knowledge graph into text depth semantic features to splice or splice after dimension transformation;
(5) Forming the final feature vector.
S3.4, collecting access logs of the database, and carrying out statistical analysis on the collected access logs to obtain a user access mode and a query requirement;
s3.5, extracting the characteristics of the characteristic vector according to the access mode of the user and the query requirement to obtain index candidate characteristics, and evaluating influence factors of different index candidate characteristics on the query performance through simulation testWherein, the method comprises the steps of, wherein,An influence factor representing an nth index candidate feature, n representing the number of index candidate features;
The influence factors comprise complexity of search conditions, time consumption of search, hit data blocks, hit number, sorting mode and user satisfaction.
S3.6, according to the evaluation result, influencing factorsDescending order is carried out, and the lowest influencing factor threshold value is set as;
If it isThen obtain the index featureBased on the indexing features and the query requirements, a hybrid indexing scheme is generated using an adaptive hybrid indexing strategy, wherein,Represents the mth index feature, m represents the number of index features,Is formed by influencing factors greater thanThe set of index candidate features forms an index feature set;
s3.7, introducing a dynamic adjustment mechanism of the index, monitoring index inquiry efficiency in real time based on a mixed index scheme, and dynamically adjusting an index structure according to a monitoring result;
S3.8, predicting the trend of index query by using a trained machine learning model, reading and analyzing the predicted trend result of the index query, identifying query hotspots and mode changes, and evaluating whether index structures or parameters need to be adjusted to optimize the adaptive hybrid index strategy according to the predicted trend result of the index query and the current index;
S3.9, compiling an adaptive hybrid index strategy construction script, executing the adaptive hybrid index strategy construction script on a large-scale data set, and monitoring resource consumption and performance indexes in the construction process;
And S3.10, after the new index strategy is executed, storing the constructed mixed index into an index database.
The specific steps of the multi-objective optimization algorithm in the step S4 include:
s4.1, setting an objective function and constraint conditions of index optimization, and generating a group of initial index configuration schemes as candidate solution sets through a heuristic method, wherein the objective functionThe formula of (2) is:
;
Wherein,Indicating that a given i-th index configuration scheme,Representing the average or total response time required to perform a query operation under the ith index configuration scheme,Indicating the amount of memory space occupied by the index structure under the ith index configuration scheme,Representing the time required to build an index under the ith index configuration scheme,、、Representing the weight coefficient;
S4.2, performing multi-dimensional performance evaluation on each index configuration scheme in the candidate solution set by utilizing a pre-training language model;
Further, the specific step of S4.2 includes:
(1) Collecting each index configuration scheme in the candidate solution set, ensuring that each scheme has clear definition and parameters, and preparing a data set for evaluating the performance of the index configuration scheme, wherein the data set is used for covering a plurality of dimensions so as to comprehensively evaluate the effect of the index configuration;
(2) Preprocessing the text in the evaluation data set, including cleaning, word segmentation and stop word removal, so as to ensure the text quality input to the pre-training language model;
(3) Extracting features from the preprocessed text by using a pre-training language model, wherein the pre-training language model outputs deep semantic feature representations of each text sequence, which can be used for subsequent evaluation tasks;
(4) Determining performance evaluation indexes such as query efficiency, response time, accuracy and recall rate according to the evaluation requirements, wherein the indexes are used for quantifying the performance of an index configuration scheme;
(5) Correlating the extracted features of the pre-training language model with the index configuration schemes, and evaluating the performance of different index configuration schemes on a specific data set in a mode of simulating query or actual query;
(6) Scoring or ranking each index configuration scheme using the determined performance evaluation index;
(7) And analyzing the evaluation result to find out an index configuration scheme with excellent performance and characteristics thereof.
S4.3, selecting part of solutions from the candidate solution set to serve as a parent to generate a new solution to serve as a child according to a multi-dimensional performance evaluation result, and meanwhile, combining the parent and the child to form a new solution set to evaluate, wherein the genetic algorithm is the prior art content in the field and is not an inventive scheme of the application and is not repeated here;
S4.4, layering is carried out in the new solution set according to the dominant relation of the solutions, so that each layer comprises a group of solutions which are not dominant to each other, different non-dominant layers are obtained, and the crowding degree of all individuals in the same non-dominant layer is initialized to be 0;
S4.5, sequencing the individuals in each non-dominant layer on each objective function, calculating the objective function difference between the individuals and the adjacent individuals for each objective function, and carrying out normalization processing;
S4.6, adding the normalized difference values on all objective functions to obtain the crowding degree of the individual, and selecting the individual of the next generation population based on the level and the crowding degree of the non-dominant order to perform iterative operation;
S4.7, setting the maximum iteration times, stopping iteration if the maximum iteration times are met, obtaining a Pareto optimal solution, otherwise, returning to the step S4.3, and continuing iteration;
and S4.8, generating an index optimization scheme according to the Pareto optimal solution.
And S3.6, generating a hybrid index scheme by adopting a hash table and bitmap index method and combining entity relation information in the knowledge graph.
Example 2
Referring to FIG. 4, in another embodiment of the present invention, a real-time index building and intelligent optimization system based on large model driving includes a data processing module, an index generating module, an index optimizing module, and a monitoring module;
the data processing module is used for collecting sample data and carrying out dynamic self-adaptive preprocessing to ensure the quality and consistency of the data;
the index generation module is used for creating an index structure according to the preprocessed sample data, generating a mixed index scheme, accelerating the data query process and reducing the search time and resource consumption;
The index optimization module is used for implementing index construction, and adjusting and optimizing an index structure according to the need so as to adapt to the change of data and the evolution of a query mode;
the monitoring module is used for monitoring the performance indexes of index construction and query processing in real time, can discover and solve problems in time, and ensures the stability and reliability of the index construction and intelligent optimization system.
The data processing module comprises a data stream processing unit, a dynamic preprocessing unit, a characteristic analysis unit and an access mode evaluation unit;
The data stream processing unit is used for receiving and processing sample data in the data stream in real time and can rapidly identify and process new data input;
The dynamic preprocessing unit is used for preprocessing sample data, such as data cleaning, missing value processing and data conversion, dynamically adjusting the preprocessing process according to the characteristics and access modes of the sample data, ensuring the data quality and improving the consistency and accuracy of the data;
the characteristic analysis unit is used for analyzing the statistical characteristic and the distribution characteristic of the sample data, performing characteristic selection and engineering, reducing redundant information and enhancing index characteristics;
the access mode evaluation unit is used for analyzing the access frequency and the access mode of a user or a system, predicting the query requirement, providing guidance for index construction and optimization, and ensuring that an index structure meets the query requirement.
The index generation module comprises a model loading unit, a knowledge graph fusion unit, a strategy making unit and a storage unit;
The model loading unit is used for loading the pre-training language model and carrying out deep semantic representation;
the knowledge map fusion unit is used for fusing an external knowledge map or external knowledge into the index construction process, so that the semantic understanding capability of the index is enhanced, and the query accuracy is improved;
The strategy making unit is used for making a self-adaptive mixed index strategy according to the analysis result, the data characteristic and the access mode of the sample data;
and the storage unit is used for generating an index scheme according to the self-adaptive hybrid index strategy, storing the index scheme in an index database, realizing physical storage of indexes and managing index files.
The index optimization module comprises an index construction unit, a performance evaluation unit, an optimization algorithm unit and an optimization implementation unit;
an index construction unit for constructing an index in the database system according to the hybrid index scheme;
The performance evaluation unit is used for testing and evaluating the mixed index scheme by utilizing the pre-training language model and collecting query execution time and resource use condition indexes such as query speed and storage space;
The optimization algorithm unit is used for reconstructing, merging or deleting unnecessary index items by adopting a multi-objective optimization algorithm, improving index performance and reducing maintenance cost;
and the optimization implementation unit is used for implementing the optimized indexing scheme and updating the indexing structure.
The monitoring module comprises a performance monitoring unit, a prediction model unit and a dynamic adjustment unit;
The performance monitoring unit is used for monitoring performance indexes of index construction and query processing in real time, such as response time, CPU (Central processing Unit) utilization rate and throughput;
The prediction model unit is used for predicting query optimization effects under different index combinations by using a machine learning algorithm, planning resource allocation in advance and preventing performance from being reduced;
The dynamic adjustment unit is used for dynamically adjusting the configuration of the index construction and intelligent optimization system, such as an index strategy and query optimization parameters, according to the monitoring result and the prediction result, so as to ensure the stability and the high efficiency of the index construction and intelligent optimization system.
Example 3
A computer readable storage medium having stored thereon computer instructions which when executed perform the steps of a large model driven real-time index building and intelligent optimization method, wherein the storage medium may be a volatile or non-volatile computer readable storage medium.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and variations, modifications, substitutions and alterations can be made to the above-described embodiments by those having ordinary skill in the art without departing from the spirit and scope of the present invention, and these are all within the protection of the present invention.