Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides a big data index management method and system based on a weak model.
The embodiment of the invention provides a big data index management method based on a weak model, which comprises the following steps:
Selecting a data source based on a service scene, writing an SQL sentence, identifying dimension information and index information based on the SQL sentence, writing the dimension information and the index information into an index table, and correspondingly writing the SQL sentence and the index table into a distributed column database;
Receiving a query request of a custom index, querying an SQL sentence based on parameter key information of the query request and combining basic information of the custom index, assembling the SQL sentence, generating a custom SQL sentence, and carrying out sentence optimization on the custom SQL sentence;
And creating an index code according to the dimension information and the index information, generating a detail index based on the index code, and setting a synchronization parameter of the detail index, wherein the detail index comprises a basic index, a derivative index and a composite index.
In one embodiment, the method further comprises:
reading the data source data, writing the data source data into a temporary table of a buffer queue, and generating a corresponding partition data file by combining partition rules;
and replacing the original partition file with the partition data file through the Stream Load of the distributed column database, and deleting the temporary table.
In one embodiment, the method further comprises:
and receiving a custom index of an OpenAPI interface, and extracting key information of the custom index based on the service scene and the user request, wherein the key information comprises a specified index and a specified dimension.
In one embodiment, the method further comprises:
Analyzing the query request through JsqlParser library to obtain SQL statement of the specified index and the specified dimension, searching data source table based on the SQL statement of the specified index and the specified dimension,
Based on the table relation between the derivative index and the compound index, adding connection conditions to the corresponding SQL statement, and selecting a corresponding aggregation function to aggregate the SQL statement according to the aggregation mode of the specified index and the specified dimension to generate a custom SQL statement.
In one embodiment, the method further comprises:
Detecting whether an index exists in an SQL field in the custom SQL statement, detecting the execution efficiency of the SQL statement when the index does not exist in the SQL field, and creating a corresponding index for the SQL field with the execution efficiency smaller than a preset threshold;
And detecting whether multiple nested sub-queries exist in the custom SQL statement, and replacing the multiple nested sub-queries through connection operation when the multiple nested sub-queries exist in the custom SQL statement.
In one embodiment, the method further comprises:
Packaging the SQL template of the custom SQL statement, and converting the SQL template into a target format according to the query request;
And storing the SQL statement query result into an LRU cache, and periodically updating the cache invalidation result of the SQL statement in the LRU cache.
The embodiment of the invention provides a big data index management system based on a weak model, which comprises the following steps:
the writing module is used for selecting a data source based on a service scene, writing an SQL sentence, identifying dimension information and index information based on the SQL sentence, writing the dimension information and the index information into an index table, and correspondingly writing the SQL sentence and the index table into a distributed column database;
the query module is used for receiving a query request of a custom index, querying an SQL sentence based on parameter key information of the query request and combining basic information of the custom index, assembling the SQL sentence, generating a custom SQL sentence, and carrying out sentence optimization on the custom SQL sentence;
The index configuration module is used for creating an index code according to the dimension information and the index information, generating a detail index based on the index code, and setting a synchronization parameter of the detail index, wherein the detail index comprises a basic index, a derivative index and a compound index.
In one embodiment, the system further comprises:
The temporary module is used for reading the data source data, writing the data source data into a temporary table of the buffer queue, and generating a corresponding partition data file by combining a partition rule;
And the replacing module is used for replacing the original partition file with the partition data file through the Stream Load of the distributed column database and deleting the temporary table.
In view of the above, in one or more embodiments of the present disclosure, a data source is selected based on a service scenario, and an SQL statement is written, dimension information and index information are identified based on the SQL statement, the dimension information and the index information are written into an index table, the SQL statement and the index table are correspondingly written into a distributed column database, a query request of a custom index is received, the SQL statement is queried based on parameter key information of the query request in combination with basic information of the custom index, the SQL statement is assembled, the custom SQL statement is generated, statement optimization is performed on the custom SQL statement, an index code is created according to the dimension information and the index information, a detail index is generated based on the index code, and synchronization parameters of the detail index are set, wherein the detail index comprises a basic index, a derivative index and a composite index. The method can complete real-time updating and high-performance query of data corresponding to business requirements through the steps of data writing, index management, data query and data updating, and also provides quick response for complex data query, so that a user can acquire an analysis result in a period of time, timely and accurate data support is provided for enterprise decision, and the efficiency and flexibility of data analysis are improved.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure as set forth in the specification. Various examples may omit, replace, or add various procedures or components as desired. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may be combined in other examples as well.
As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.
As shown in fig. 1, an embodiment of the present invention provides a big data index management method based on a weak model, including:
Step S101, selecting a data source based on a service scene, writing an SQL sentence, identifying dimension information and index information based on the SQL sentence, writing the dimension information and the index information into an index table, and correspondingly writing the SQL sentence and the index table into a distributed column database.
Specifically, when constructing the high-efficiency intelligent index engine, the data of the data source needs to be written into the database, namely, the data writing flow is needed first. When writing data, determining which data sources contain valuable data for the current service scenario according to the explicit service requirements of the service scenario. Different types of data sources are suitable for different application scenarios, for example, transaction processing is generally more suitable for using relational databases such as MySQL or PostgreSQL, whereas large-scale data analysis may use a distributed storage system such as Hive. After the data source is selected, connectivity and compatibility between the data sources and StarRocks should be determined because the data of the data source will flow to the distributed columnar database StarRocks. StarRocks is an enterprise level MPP (MASSIVELY PARALLEL Processing) database system designed specifically for real-time analysis. The method can provide rapid data query capability and high-efficiency analysis performance, support real-time query of a large-scale data set, is suitable for online analytical processing (OLAP) scenes, and can improve the query performance due to the characteristics of StarRocks such as materialized views, index mechanisms, partitions and the like and the secondary cache characteristics of an index engine. And meanwhile, high-performance data processing is realized by utilizing the technologies of column storage, vectorization execution and the like.
Further, identifying key indexes according to business logic and targets of a business scene, wherein the key indexes comprise index information and dimension information, the index information is related data information of performance indexes, and the dimension information is dimension information of corresponding users and sellers. Taking a business scenario of electronic commerce data analysis as an example, sales indexes of different areas and different commodity categories in one electronic commerce data analysis scenario are taken as index information, and dimension information of users related to the sales indexes, such as age distribution, sex proportion and the like. And writing SQL sentences according to the index information and the dimension information, so as to ensure that the SQL sentences can accurately reflect the business rules. The previously written index definition SQL is then parsed using JsqlParser (Java library for parsing and modifying SQL queries), all relevant column information is obtained, including column names and their data types, then the parsed columns and their types are converted to StarRocks supported data types, and corresponding index tables are created, including but not limited to primary keys, indexes, partition policies, etc.
Further, in building index tables, a pattern of one index per table may be followed, but is not limited to, i.e., each table is focused on storing a key index or a certain class of key indexes under a particular business logic, which helps to simplify the data model, make each table structure clear, easy to maintain, and in addition, a more compact data layout implies lower storage overhead and space wastage.
Further, to determine the timeliness and consistency of the data, a set of rules engines or configuration files are developed based on the information in the index definition for automatically identifying and selecting the DataX plug-in most suitable for the current task, dataX being a heterogeneous data source offline synchronization tool. The method is mainly used for efficiently migrating data among different data storage systems and supporting various data sources, such as a relational database (MySQL, oracle and the like), a NoSQL database (HBase, mongoDB and the like), a cloud storage service and the like. The expandability of the system can be improved, and the multiple requirements of different service scenes can be met.
In addition, when the data source is written in through the DataX plug-in, the data source can be intelligently split, namely, the most suitable splitting strategy is automatically selected according to the size of the data volume. If the data volume is large, the plug-in will disassemble a single big task into a plurality of subtasks to realize batch reading and processing. In connecting to a source data source through JDBC, the read data may be written first to a buffer queue instead of directly to a target data source. Thus, the load peak can be smoothed, overload of the target system caused by instantaneous large-amount data inrush is prevented, then a table for storing temporary data is created in the target data source (StarRocks) in advance, and after all data are successfully written into the temporary table, the data are read according to the established partition rule, and corresponding partition data files are generated. And using the Stream Load characteristic of StarRocks to efficiently replace the original partition file with the newly generated partition data file in a data Stream mode, and deleting the temporary table to release the storage space after the data synchronization updating is completed and the verification is correct.
Step S102, receiving a query request of a custom index, querying an SQL sentence based on parameter key information of the query request and basic information of the custom index, assembling the SQL sentence, generating a custom SQL sentence, and carrying out sentence optimization on the custom SQL sentence.
Specifically, after the index engine performs work based on index definition, it receives the query requests of various user-defined indexes of different service systems or users, and the interface for receiving the index query requests may be an open interface, such as a RESTful API interface based on the OpenAPI standard, so as to allow the external system to initiate the data query request in a standardized manner, and dynamically adjust the query parameters according to its service logic. In the user-defined query request of the user, including the user definition of indexes and dimensions, specific indexes (such as sales, click rate and the like) and dimensions (such as time, geographic position, commodity category and the like) of interest can be specified in the request, so that targeted data are acquired for market policy analysis.
Further, when analyzing the query statement, the JSqlParser library is used to deeply analyze the incoming SQL query statement, and key information such as index, dimension, aggregation mode and the like can be extracted from the deep analysis. Then, based on the extracted key information, a data source table participating in the query is determined in combination with a configuration database of the index engine, and when the query involves a plurality of tables, JSqlParser can automatically add an appropriate JOIN statement according to a predefined table relationship, and select a corresponding aggregation function according to an aggregation mode and grouping of the key information. ON the basis of the foregoing steps, the parsing step for a query statement may be illustrated, for example, in that "SUM (samples)" is an aggregate function for the query statement "SELECT SUM(sales)AS total_sales,region FROM sales_data WHERE date BETWEEN'2023-01-01'AND'2023-12-31'GROUP BY region",JSQLPARSER, representing the SUM (index) of sales to be calculated, "region" is a grouping-basis dimension, "date BETWEEN '2023-01' AND '2023-12-31'" is a query condition, and when the data source table is determined, if the samples_data table contains sales data and has samples and region fields, it will become a base table for the query, and when a connection condition is added, if the region dimension is stored in another region_info table, and the samples_data table and region_info table are associated by the region_id field, a JOIN statement "JOIN region_info ON samples_data. Region_id=region_info. When aggregation and grouping are processed, SOL statements :"SELECT SUM (sales) AS total_sales, region FROMsales_data JOIN region_info ON sales_data.region_id = region_info.region_id WHERE date BETWEEN '2023-01-01' AND '2023-12-31' GROUP BY region". are generated, and in addition, when query is performed, the data can directly determine the index table without the query related to the data source table, so that the query speed of the data can be improved, and the query efficiency can be improved.
In addition, after analyzing the query statement to obtain a customized SQL statement, in order to enhance the reusability and maintainability of the code, the generated SQL statement can be packaged into a template form. Therefore, the same template can be reused in different scenes, and only parameter values need to be replaced.
Further, for SOL statements, the content of the optimization includes, but is not limited to, index optimization, checking whether fields involved in SQL statements already have proper indexes, especially for fields frequently appearing in WHERE clauses, JOIN conditions, and ORDER BY clauses, etc., if some fields are detected to lack the necessary indexes, it may be suggested and performed to create proper index operations based on the access pattern and query frequency of the data. For example, the date field is often used for time range filtering, then an index may be created for the date field on the samples_data table, and for dimension fields (e.g., region or category), an index acceleration grouping and ordering operation may also be created.
In addition, the content of the optimization includes, but is not limited to, query rewrite, which may be performed by the index engine for queries containing multiple nested sub-queries or complex logic, including converting the nested sub-queries into equivalent JOIN operations to reduce execution time. Or to identify and eliminate duplicate or unnecessary computation steps, such as avoiding multiple identical operations on the same column in the aggregate function. Or to combine multiple independent filtering conditions into a more compact form.
In addition, the optimized content comprises, but is not limited to, the utilization of statistical information, namely, interacting with a database system, periodically collecting statistical information about table structures, line numbers, field value distribution and the like, and based on the latest statistical information, an index engine can dynamically adjust a query execution plan, timely update an optimization strategy along with the change of data characteristics, and ensure that the optimal performance state is always maintained. For example, if the data volume of a certain table increases rapidly, the selection of indexes may need to be re-evaluated, for data distributed obliquely, a partition clipping strategy may be selected to reduce the scanning range, and the coverage index is used to avoid the operation of returning to the table, so that the query efficiency is further improved.
Further, for the query result, the query result can be converted into a format required by the user according to the user requirement in the query statement, for example, if the data is required to be returned to the front end or an external system through an API interface, the result can be packaged into a JSON format, and subsequent transmission and processing are facilitated. In addition, after the query is completed, it is detected whether the same query condition exists. If yes, the result is directly read from the cache, and repeated calculation is avoided. And the storage strategy of the cache can adopt an LRU (LRU) elimination strategy, and the cache space is managed by adopting an LRU (LEAST RECENTLY Used) algorithm. When the cache reaches the upper limit of capacity, the data item which is not used for the longest time is automatically removed, so that the latest and most commonly used index data is always reserved in the cache. In addition, considering the timeliness of data, a reasonable Time To Live (TTL) may be defined. The expired data should be updated or cleared in time to ensure the accuracy of the query results.
Step S103, an index code is created according to the dimension information and the index information, a detail index is generated based on the index code, and synchronous parameters of the detail index are set, wherein the detail index comprises a basic index, a derivative index and a compound index.
Specifically, in order to ensure the atomicity and consistency of the data, incremental update and full coverage update are supported, and for the update time, the update of the custom index after the custom index is queried can be, but is not limited to, and StarRocks data can be updated periodically, so that the real-time property and accuracy of the data are ensured. In the process of data updating, incremental updating and full coverage updating of data can be realized by submitting transactions in two stages of the partition function of StarRocks and the XA of the index engine, a new data source can be selected according to the requirement of a custom SQL statement, then new data source information is connected, for example, a new data source entry is added on a system management interface through a graphical interface or an API interface. The detailed information of the new data source entry is then entered and configured, including but not limited to database type, server address, port number, user name and password, etc. And then determining a specific data source and a data table needing dimension configuration, dynamically analyzing the structure of the selected data table, acquiring information of all columns, selecting the required columns from the column information obtained by analysis as dimensions, designating a dimension data format, and configuring.
Further, the updating of the data may further include updating the index code, defining a specific SQL query model for subsequent index calculation, that is, writing calculation logic of the index defined BY the SQL query statement, including an aggregation function (such as SUM, COUNT), a grouping condition (such as GROUP BY), a filtering condition (such as WHERE), and the like, then analyzing the SQL statement BY using the JSqlParser library, extracting key fields and logic, and matching the fields obtained BY analysis with the dimension information configured in step S103, thereby completing the creation of the index code. Then, establishing detail indexes according to the index codes, determining that each detail index can be updated and synchronized regularly, specifically, selecting an SQL query model suitable for the current service scene from a stored index code library, establishing or referring to a data table according to the SQL query model in the index codes, determining that the data storage structure is reasonable, and defining a basic index, a derivative index and a coincidence index. The basic index defines a measure of a certain action under the service process, such as advertisement click quantity, order quantity and the like, so that each atomic index is ensured to have a certain data type, algorithm description and naming, the derivative index describes the service process, such as effective click rate, conversion rate and the like, the effective click quantity=click quantity (basic index)/display quantity (basic index), the compound index is further calculated and completed through the basic index and the derivative index under the same dimension group, such as average order amount, customer life value and the like, the average order amount=total sales quantity/order quantity, or the basic index, the compound index and the derivative index are freely combined and arranged through four arithmetic operations to further calculate and obtain the compound index, such as gross interest rate, net interest rate and the like, and the periodic updating and parameter synchronization of the data are completed according to task scheduling information.
Further, in this embodiment, the data query in step S102 and the data update steps in steps S103 and S104 are not limited to the flow sequence in this embodiment, and in this embodiment, after the data query of the custom index is detected (or after the period of periodic update expires), the data update is performed on the custom index after the data query. The actual data query and data update steps can also be that after data update, problems (such as data deletion, repeated record and the like) possibly existing in the data update process are found through data query, so that targeted optimization and adjustment are performed.
In addition, after the detailed indexes including the basic indexes, the derivative indexes and the composite indexes are created, the quick response of the complex business logic can be supported, and the optimal query experience is provided for the end user. For example, when a user queries through the basic index, the basic index information can be analyzed in a mode of extracting information such as technical caliber, index definition, aggregation mode and dimension, then SQL query sentences are dynamically generated, the index query engines are connected StarRocks through the JDBC protocol, and the generated SQL query sentences are executed to acquire data. When a user queries data through composite indexes and derivative indexes, query sentences can be decomposed to basic index granularity, required data tables are determined, dimension information and calculation rules of the data tables are integrated, SQL sentences are dynamically generated, and query is submitted and returned results are processed through JDBC protocols, so that quick response to complex business logic is ensured, comprehensive and accurate data support is provided for end users, and enterprises can deploy and apply more flexibly when facing different business requirements and technical environments.
The embodiment of the invention provides a big data index management method based on a weak model, which selects a data source based on a service scene, compiles SQL sentences, identifies dimension information and index information based on the SQL sentences, writes the dimension information and the index information into an index table, writes the SQL sentences and the index table into a distributed column database correspondingly, receives a query request of a self-defined index, queries the SQL sentences based on parameter key information of the query request and basic information of the self-defined index, assembles the SQL sentences to generate the self-defined SQL sentences, carries out sentence optimization on the self-defined SQL sentences, creates index codes according to the dimension information and the index information, generates detail indexes based on the index codes, and sets synchronous parameters of the detail indexes, wherein the detail indexes comprise basic indexes, derivative indexes and compound indexes. The method can complete real-time updating and high-performance query of data corresponding to business requirements through the steps of data writing, index management, data query and data updating, and also provides quick response for complex data query, so that a user can acquire an analysis result in a period of time, timely and accurate data support is provided for enterprise decision, and the efficiency and flexibility of data analysis are improved.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a big data index management system based on a weak model according to an embodiment of the present application. As shown in fig. 2, the system includes:
the writing module S201 is used for selecting a data source based on a service scene, writing an SQL sentence, identifying dimension information and index information based on the SQL sentence, writing the dimension information and the index information into an index table, and correspondingly writing the SQL sentence and the index table into a distributed column database;
The query module S202 is used for receiving a query request of a custom index, querying SQL sentences based on parameter key information of the query request and combining basic information of the custom index, generating custom SQL sentences, and optimizing the custom SQL sentences;
The index configuration module S203 is configured to create an index code according to the dimension information and the index information, generate a detail index based on the index code, and set a synchronization parameter of the detail index, where the detail index includes a base index, a derivative index, and a composite index.
In another embodiment, a big data index management system based on a weak model further includes:
The temporary module is used for reading the data source data, writing the data source data into a temporary table of the buffer queue, and generating a corresponding partition data file by combining a partition rule;
And the replacing module is used for replacing the original partition file with the partition data file through the Stream Load of the distributed column database and deleting the temporary table.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.