Detailed Description
The application can be applied to the field of real-time extremely high frequency access to massive data, such as industries of financial payment, stock quotation, futures transaction and the like. For convenience, in the following examples, a transaction environment of the payment industry is mainly described as an example.
For ease of understanding and comparison, a schematic system hierarchy of a conventional trading platform that does not use the scheme of the present application is first presented in FIG. 1.
As shown, the graph can be divided into three layers from top to bottom, namely, a front-end server (i.e., user layer), a back-end server (i.e., business layer), and a distributed database (i.e., data layer, which can also be referred to as a distributed database module). Among them, the front-end server (user layer) includes clients and devices thereof that can be used by various merchants, such as smartphones, personal computers, tablets, notebooks, etc. on which the clients are installed. The back-end server (business layer) includes a plurality of business processing systems for processing various transaction queries. While the distributed database module (data layer) includes a distributed database that stores the full amount of transaction data.
A conventional transaction processing flow based on the conventional transaction platform of fig. 1 is shown in fig. 2. Firstly, a user at a user layer initiates a query to a front-end server through a query interface of a client, the front-end server distributes query conditions in a query request to a back-end server, the back-end server directly queries a transaction data table according to the query conditions, and statistical result data is returned to the query interface in a primary way.
As can be readily seen from fig. 1 and 2, all query requests are issued directly to the database and massive amounts of transaction data are queried. Therefore, when massive concurrent query requests exist, the database read-write pressure is overlarge, the performance is reduced, and even the risk of death or downtime occurs.
Thus, the above-described problem can be solved if statistics including only a small number of fields required for the platform query dimension can be summarized from the data of the transaction table in advance and stored for query.
Therefore, a solution is needed to collect in advance from the data of the transaction table a statistics data containing only a small number of fields required for the platform query dimension, to be queried by the merchant or the user. For example, the merchant wants to view transaction data collected on the same day/month, such as total amount of money, total amount of money and the like, or according to different transaction types, such as WeChat, payment treasures, cloud flash payment and the like.
For this reason, the technical scheme of the application firstly uses the concept of 'several bins' to store data.
A data warehouse, or data warehouse (hereinafter also referred to simply as "data warehouse"), is a theme-oriented, integrated, relatively stable, data set that reflects historical changes, primarily for supporting decision-making in management. It is generated for further mining data resources and meeting decision-making needs on the basis that the database already exists in a large number. The data in the data warehouse is obtained by systematic processing, summarizing and sorting on the basis of extracting and cleaning the original scattered database data, and inconsistency in the source data must be eliminated so as to ensure that the information in the data warehouse is consistent global information about the whole enterprise. The data of the data warehouse is mainly used for decision analysis of enterprises, the related data operation is mainly data query, and once certain data enter the data warehouse, the data is generally reserved for a long time. The data in the data warehouse typically contains historical information from which quantitative analysis and predictions of the development history and future trends of the enterprise can be made.
Alternatively, the bins may be divided into real-time bins and offline bins. Both real-time and offline data bins are different types of data warehouses that are used to store and manage enterprise data, but they differ significantly in the time, speed, and purpose of data processing and use.
An offline data store (Offline Data Warehouse) is a system for storing and processing batch data. It is characterized in that the processing and analysis of data is based on batch processing jobs, typically in units of longer time periods. The main application scenarios of the offline digital warehouse include applications requiring historical data analysis and report generation, such as sales reports, month and year financial reports, and the like, and applications with large data volume and non-critical processing time.
Thus, the offline bins include the following features:
Batch processing-off-line number bin processes data by batch processing job, which means that data is collected, stored for a period of time and then processed once.
High capacity offline bins are typically designed to store large amounts of historical data.
The delay is high and off-line bins are not suitable for applications requiring real-time or near real-time data because the data processing is batch.
On the other hand, real-time digital (Real-time Data Warehouse) is a system for storing and processing Real-time data. Its main feature is that the processing and analysis of data is performed instantaneously, and the data enters several bins almost instantaneously and can be used for analysis and decision immediately. The application scene of the real-time digital bin mainly comprises applications needing to monitor business indexes in real time, such as financial transaction signboards, real-time sales reports, online advertisement putting analysis and the like, and applications needing to immediately take actions to cope with real-time events, such as abnormal monitoring large screens, fraud real-time detection and the like.
Thus, the real-time digital bin includes the following features:
low latency-real time bins are capable of capturing and processing data quickly after it is generated, typically at a speed on the order of seconds or sub-seconds.
Data stream processing-real-time bins typically use stream processing techniques to process the data, which allow the data to be converted and calculated immediately upon entry into the warehouse.
Real-time analysis the data can be used for real-time monitoring, dashboards, predictions and decision support.
High throughput-real-time bins require processing of a large number of data streams and thus require high throughput performance.
Complexity the architecture and techniques of real-time bins are often complex due to the need to process real-time data streams.
The real-time digital warehouse is applicable to applications requiring real-time monitoring of business indexes, such as financial transaction signboards, real-time sales reports, online advertisement putting analysis and the like. Immediate action is required to cope with real-time event applications such as anomaly monitoring large screens, fraud real-time detection, etc.
To overcome the above problem of all query requests collectively querying a huge amount of transaction table data,
The application provides an intelligent storage and analysis processing scheme for carrying out cloud number bin analysis and summarization on high-frequency query dimension data. In other words, the scheme of the application meets different transaction data processing requirements by simultaneously introducing the real-time data bins and the off-line data bins into the transaction data processing and storing system, thereby fully utilizing the respective advantages thereof and realizing quick and efficient transaction inquiry.
First, a schematic hierarchical structure of a data processing and storage system based on a query cloud database (also referred to simply as "cloud") according to the present application is shown in fig. 3.
The hierarchy includes:
A client in the front-end server, which may be a mobile device, PC, tablet or notebook, etc. with a corresponding program or APP installed, is configured to receive transaction query requests from merchants.
A server node in the back-end server configured to process data queries for merchant query dimensions;
Databases in the (distributed) database include a self-built database and a cloud database. The self-built database is a database built by an enterprise, is a database where the original sql script and the meta-table are located, and is used for storing all data related to transactions and executing query operation. Unlike the operation of the self-built database in the traditional technology for executing a large amount of transaction inquiry on mass data, the method and the device have the advantages that the sql script corresponding to some high-frequency inquiry dimensions is put into a plurality of bins from the self-built database to be executed, the statistical result data executed in the plurality of bins is stored in an intermediate table, and the intermediate table can be inquired by the next same inquiry dimension without inquiring in the self-built database. Thus, the primary responsibility of the self-built database of the present application is to act as a backup store and not to frequently perform transaction inquiry operations. While most query operations (operations in the high frequency query dimension) are transferred to the cloud database. The cloud database may be an online number bin provided by a third party platform and configured to store the statistical result data after the number bin analysis and to execute a corresponding query. According to different characteristics of the stored statistics data, the cloud database may be further divided into a real-time data bin and an offline data bin, wherein the offline data bin is configured to store data with larger data volume and relatively lower query frequency in the statistics, and the real-time data bin is configured to store data with smaller data volume and relatively higher query frequency in the statistics.
In addition to the above structures, the transaction inquiry platform system further comprises a data analysis layer and a data analyzer therein, wherein the data analyzer is configured to perform buried point and statistical analysis on inquiry requests of merchants, and perform analysis and summarization processing on massive transaction data in corresponding bins. The specific implementation of the method flow of the data analyzer will be described in detail below.
Overall, the cloud-based data processing and storing method of the present application generally includes the following steps:
1. a query dimension buried point table is constructed to record a query dimension (query condition), a query sql script and an associated meta-table, wherein the query dimension buried point is also the basis for defining a high-frequency query dimension later.
2. And determining that certain query dimensions are high-frequency query dimensions according to the data records in the query dimension buried table. Transferring the sql script of each high-frequency query dimension to a plurality of bins (a real-time plurality of bins and an offline plurality of bins) for execution; and storing statistical result data obtained by executing the sql script into an intermediate table, wherein different intermediate tables are constructed by different sql scripts.
3. And carrying out one-to-one association on the query dimension, the table names of the intermediate table and the data sources of the intermediate table so as to construct a query dimension routing table.
After the flow of the method is finished, when the merchant of the user layer initiates the query again, the query dimension in the query request can be matched with the query dimension routing table. If the match is successful, the query request may be directed to execute on a corresponding intermediate table on the data source recorded by the query dimension routing table to save time and improve efficiency. If no match is successful, the query request is performed on the metadata in the self-built database, as in the conventional query.
Fig. 4 is a schematic flow chart of the cloud-based data processing and storing method of the present application.
And step 1, analyzing and recording the query dimension of the merchant to construct a query dimension buried point table.
First, merchants at the user layer initiate query requests through clients on a query platform, which are forwarded by the clients to a backend server. The server node at the back-end server collects the merchant's query terms, such as querying total revenue for a day, paying treasures total revenue for a day, etc., from the received query request. And then, the server node carries out correlative record on the collected query conditions and the date of the current day, simultaneously stores the execution sql script of the current query, records the execution times of the current query on the current day and the time consumption of the current query, and is used for statistics and analysis of the high-frequency query dimension.
These recorded data may be referred to as buried points of the query dimension. Moreover, the data to be recorded may be stored in the form of a table (which may be referred to as a "query dimension buried point table"), as shown in table 1:
TABLE 1
As shown in table 1, the query dimension embedded table records a query dimension (query_condition), and the corresponding query sql script and the associated meta table are also the basis for defining the high-frequency query dimension later.
It should be understood that the fields and field names in the table are exemplary only and not limiting. The technician can add, delete or adjust the query dimension according to the actual requirement to obtain the required query dimension embedded point table, which belongs to the scope of the application to be protected.
By constructing the query dimension buried point table, the subsequent statistical analysis of the high-frequency query dimension is facilitated.
And 2, carrying out statistical analysis on the query dimension buried points of the merchant to define a high-frequency query dimension.
The statistical analysis includes counting high frequency query dimensions among the query dimensions. The high frequency query dimension, as the name implies, refers to the query dimension that is queried more frequently. With the query dimension buried point table constructed in step 1, a query dimension which is determined to be relatively high in query consumption and high in query frequency can be defined as a high-frequency query dimension based on buried point data collected every day. Higher query time consumption means that the statistical result data amount is larger, and higher query frequency means that the number of times of query is larger.
Thus, again taking Table 1 as an example, the definition criteria for the high frequency query dimension may be as follows:
1. The time consumption is less than or equal to 1000 milliseconds (query_time is less than or equal to 1000 ms), and the query times of the day is less than or equal to 10000 times (query_count is less than or equal to 10000 times), and the time is not defined as high-frequency query;
2. The time consumption is less than or equal to 1000 milliseconds, the number of queries in the same day is more than or equal to 10000, and the time consumption is not defined as high-frequency query;
3. the time consumption is more than 1000 milliseconds, the number of queries in the same day is less than or equal to 10000, and the time is not defined as high-frequency query;
4. takes more than 1000 milliseconds, and the number of queries per day is more than or equal to 10000, which is defined as high-frequency query.
It should be appreciated that the above time-consuming threshold of 1000 milliseconds and the number of queries threshold of 10000 times are given as examples. In practical applications, technicians may adjust the transaction amount, server processing capacity, network bandwidth, and other factors as appropriate. Such modifications are within the scope of the present application.
For the query dimension defined as the high-frequency query dimension, the corresponding query sql script needs to be put into a number bin to be executed asynchronously, and data analysis and summarization processing are performed.
And 3, analyzing the summarized data in a plurality of bins.
Firstly, all metadata related to a query dimension defined as a high-frequency query are synchronized from a self-built database to a plurality of bins through a data transmission service, wherein the metadata refer to table data (generally a plurality of tables are associated, such as a transaction table, a merchant table and the like) related to an original sql script, and the synchronization can be divided into:
synchronizing metadata with high real-time requirements, recent metadata and relatively small data volume into a real-time digital bin;
And synchronizing the metadata with low real-time requirements, history and relatively large data volume into the offline data bin.
Where "recent" generally refers to data of the present year or the last year (last 12 months). In practical application, most of transaction inquiry of scenes is aimed at recent data, and the metadata amount is small.
Where "historical" generally refers to data over a year, the amount of metadata in this case is large. In practical applications, there will of course be transaction queries that query transactions several years ago, such as transaction summary queries within 5 years, 10 years, which are slow to analyze, but the query scenario is less frequent.
Next, a query sql script associated with a query dimension defined as a high frequency query is executed at each bin and the query results are analyzed, summarized and counted.
By transferring the sql script query to the plurality of bins, the advantage of the data processing performance of the plurality of bins can be fully utilized, so that the sql execution speed is faster.
Step 4, storing the statistical result data after the analysis of the number bins into the corresponding number bins of the cloud database, wherein:
The data with larger data quantity of the statistical result and relatively lower query frequency are stored in an offline data bin;
the data quantity of the statistical result is small, and the data with relatively high query frequency is stored in a real-time number bin;
For the statistical result data, also called intermediate table data, different scripts obtain different intermediate tables, where the intermediate tables include query dimensions of merchants and required query results, such as merchant numbers, transaction dates, payment modes, total amounts, and the like. The data format can be referred to in the following table 2:
TABLE 2
Also, it should be understood that the fields and field names in the table are exemplary only and not limiting. The technician can add, delete or adjust the query dimension and the query result in the table according to the actual requirement, which belong to the scope of the application to be protected.
The intermediate table corresponding to the high-frequency query dimension is constructed according to the statistical result data after the sql script execution of different high-frequency query dimensions, and massive metadata table data can be replaced by the intermediate table to execute the query so as to improve the query efficiency.
Step 5, associating the query dimension defined as the high frequency dimension with the corresponding intermediate table to construct a query dimension routing table
The query dimension defined as the high-frequency query, the table name of the constructed corresponding intermediate table and the data source (such as real-time number bin or offline number bin) where the intermediate table is located are associatively and uniformly placed in a query dimension routing table to maintain association relation, and an example of the query dimension routing table can refer to table 3:
TABLE 3 Table 3
By using the query dimension routing table, a query request aiming at a high-frequency query dimension can be conveniently routed to a data source where a corresponding intermediate table is located to execute a script on the intermediate table, so that mass queries on a self-built database are reduced.
For each query dimension defined as a high-frequency query, steps 3-5 are sequentially executed to construct a corresponding query dimension routing table for the query dimension of each high-frequency query, so that the schematic flow of the cloud-based data processing and storing method is basically ended.
After the cloud-based data processing and storing method is used for realizing the characteristics of real-time number bins, offline number bins, query dimension embedded point tables, intermediate tables, query dimension routing tables and the like, when a subsequent query request is received, a server side can process a merchant query request according to the following flow, namely, a back-end system judges whether the data of the intermediate tables need to be read or not by verifying the query conditions in the query request, and the method comprises the following steps:
If no matched query dimension routing table exists, the query is carried out without querying intermediate table data, and the flow is the original logic, namely the data such as transaction tables are directly queried in a self-built database;
If the matched query dimension routing table exists, the query can be realized by querying the intermediate table data, namely, the corresponding intermediate table name and the data source where the intermediate table name is located are acquired from the matched query dimension routing table, the data source is connected, and the query on the intermediate table is directly executed on the data source so as to replace the massive data in the query self-built database.
Because the intermediate table data only contains the statistical result data of the high-frequency query dimension, the data volume is far smaller than the metadata in the self-built database. Thus, the query speed and efficiency based on the intermediate table data are far greater than corresponding operations performed on the self-built database.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be understood by those of ordinary skill in the relevant art(s) that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims. Thus, the breadth and scope of the present invention as disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.