Movatterモバイル変換


[0]ホーム

URL:


CN118260366B - Data lifecycle analysis method, device and product based on audit log - Google Patents

Data lifecycle analysis method, device and product based on audit log
Download PDF

Info

Publication number
CN118260366B
CN118260366BCN202410677122.8ACN202410677122ACN118260366BCN 118260366 BCN118260366 BCN 118260366BCN 202410677122 ACN202410677122 ACN 202410677122ACN 118260366 BCN118260366 BCN 118260366B
Authority
CN
China
Prior art keywords
data
audit log
intermediate result
result set
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410677122.8A
Other languages
Chinese (zh)
Other versions
CN118260366A (en
Inventor
赵庆刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Nankai University General Data Technologies Co ltd
Original Assignee
Tianjin Nankai University General Data Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Nankai University General Data Technologies Co ltdfiledCriticalTianjin Nankai University General Data Technologies Co ltd
Priority to CN202410677122.8ApriorityCriticalpatent/CN118260366B/en
Publication of CN118260366ApublicationCriticalpatent/CN118260366A/en
Application grantedgrantedCritical
Publication of CN118260366BpublicationCriticalpatent/CN118260366B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention relates to the technical field of data management, and discloses a data life cycle analysis method, a device and a product based on an audit log, wherein the method comprises the following steps: acquiring an audit log summary table; creating a row number dimension table according to the number of the rows which can be split to the maximum of a single row; filtering an audit log summary table according to the starting time, and filtering useless keywords in the audit log summary table to generate an intermediate result set table A; cross-correlating the line number dimension table with the intermediate result set table A to realize self-building Cartesian product and generate an intermediate result set table B; filtering the tables of the system library to generate a basic analysis data table; and (5) carrying out cold and hot data analysis according to the basic analysis data table to obtain a data life cycle analysis result. The invention realizes the classification function of the data table on the basis of comprehensively considering the problems of low metadata acquisition efficiency and high integration, which are related to the use condition of the table, and greatly improves the accuracy and the working efficiency of the result.

Description

Translated fromChinese
基于审计日志的数据生命周期分析方法、装置及产品Data lifecycle analysis method, device and product based on audit log

技术领域Technical Field

本发明涉及数据管理技术领域,尤其涉及一种基于审计日志的数据生命周期分析方法、装置及产品。The present invention relates to the technical field of data management, and in particular to a data lifecycle analysis method, device and product based on audit logs.

背景技术Background Art

数据生命周期管理是在数据仓库管理工作中必然会遇到的问题,它通常产生于数据仓库运营中后期。随着存储资源不断被数据占用,数据仓库的存储资源日趋紧张,因此数据更新换代的需求也日益迫切。Data lifecycle management is an inevitable problem in data warehouse management, which usually occurs in the middle and late stages of data warehouse operation. As storage resources are constantly occupied by data, the storage resources of data warehouses are becoming increasingly tight, so the need for data updates is becoming increasingly urgent.

通常,数据仓库内的数据以数据对象的形式被管理,如:数据表。数据生命周期管理就是根据数据表的被使用特征将其分类,针对不同类型,采取不同的数据表处理策略。例如:可以根据数据表近3个月未被使用过、近半年未被使用过、近1年未被使用过,而分别分类为热数据、温数据和冷数据。分离出的冷数据可以采取异地迁移或销毁的策略剥离出数据仓库,从而释放更多的存储资源,缓解数据仓库的存储压力。Usually, data in a data warehouse is managed in the form of data objects, such as data tables. Data lifecycle management is to classify data tables according to their usage characteristics, and adopt different data table processing strategies for different types. For example, data tables can be classified into hot data, warm data, and cold data according to whether they have not been used in the past three months, six months, or a year. The separated cold data can be removed from the data warehouse by adopting a strategy of off-site migration or destruction, thereby releasing more storage resources and alleviating the storage pressure of the data warehouse.

传统的数据分类方式包括:(1)根据数据表名上标记的时间戳统计;(2)根据数据表的建表时间统计;(3)根据数据表内数据的更新时间统计。首先,以上方式均无法与表的使用情况(包括读和写)相关联,从而造成错误的分类方式,导致错误的数据处理策略。例如:表的建表时间是3年前,但最近1个月被读取过,该类统计方式会将此类数据归类为冷数据,导致表最终被删除。其次,由于表的创建时间、更新时间等元数据信息不可以批量获取,当表的数量巨大时,该类信息获取效率异常低下。例如:在拥有50万张表的数据仓库中,遍历获取该类信息至少需要4个小时。Traditional data classification methods include: (1) statistics based on the timestamp marked on the data table name; (2) statistics based on the creation time of the data table; (3) statistics based on the update time of the data in the data table. First, none of the above methods can be associated with the usage of the table (including reading and writing), which will cause incorrect classification methods and lead to incorrect data processing strategies. For example: the table was created 3 years ago, but it has been read in the last month. This type of statistical method will classify such data as cold data, causing the table to be deleted eventually. Secondly, since metadata information such as table creation time and update time cannot be obtained in batches, when the number of tables is huge, the efficiency of obtaining this type of information is extremely low. For example: in a data warehouse with 500,000 tables, it takes at least 4 hours to traverse and obtain this type of information.

发明内容Summary of the invention

本发明旨在解决传统分类方式由于未和表的使用情况相关联,容易造成错误统计,进而造成错误的数据处理策略,以及分析用的元数据信息提取效率低下的问题。为此,本发明提供一种基于审计日志的数据生命周期分析方法、装置及产品,在综合考虑与表的使用情况相关联、规避元数据获取效率低下问题,以及高度集成化三方面的基础上,实现了对数据表的分类功能,大大提高了结果的准确性和工作效率。The present invention aims to solve the problem that the traditional classification method is not associated with the usage of the table, which easily leads to erroneous statistics, and then leads to erroneous data processing strategies, as well as low efficiency in extracting metadata information for analysis. To this end, the present invention provides a data lifecycle analysis method, device and product based on audit logs, which realizes the classification function of data tables on the basis of comprehensively considering the association with the usage of the table, avoiding the problem of low efficiency in metadata acquisition, and high integration, and greatly improves the accuracy of the results and work efficiency.

本发明提供一种基于审计日志的数据生命周期分析方法,采用的技术方案如下:包括The present invention provides a data life cycle analysis method based on audit logs, and the technical solution adopted is as follows:

获取审计日志汇总表,所述审计日志汇总表为采用表的形式输出的数据表的审计日志;Obtain an audit log summary table, where the audit log summary table is an audit log of a data table output in the form of a table;

设定单行最大能拆分出行的数量,根据所述单行最大能拆分出行的数量创建行号维度表;Set the maximum number of rows that can be split into a single row, and create a row number dimension table according to the maximum number of rows that can be split into a single row;

设定起始时间,根据所述起始时间过滤所述审计日志汇总表,然后将所述审计日志汇总表中的无用关键字滤除,过滤后生成中间结果集表A;Set a start time, filter the audit log summary table according to the start time, and then filter out useless keywords in the audit log summary table, and generate an intermediate result set table A after filtering;

将所述行号维度表和中间结果集表A进行交叉关联,实现自建笛卡尔积,生成中间结果集表B,所述中间结果集表B统计出每张所述数据表最近的访问时间;The row number dimension table and the intermediate result set table A are cross-correlated to realize a self-built Cartesian product, and an intermediate result set table B is generated. The intermediate result set table B counts the most recent access time of each of the data tables;

从所述中间结果集表B中过滤掉冷温热数据分析不关注的数据表,生成基础分析数据表;Filter out data tables that are not of interest to cold, warm, and hot data analysis from the intermediate result set table B to generate a basic analysis data table;

根据所述基础分析数据表进行冷温热数据分析,获得数据生命周期分析结果。Cold, warm and hot data analysis is performed according to the basic analysis data table to obtain data life cycle analysis results.

进一步的,所述数据表的审计日志为数据库集群存储的数据表的审计日志。Furthermore, the audit log of the data table is an audit log of the data table stored in the database cluster.

进一步的,所述行号维度表采用临时表和无副本表的方式创建,所述单行最大能拆分出行的数量不小于所述审计日志汇总表的单行中的表的数量的最大值,所述行号维度表的创建过程为:在所述行号维度表内循环插入代表行号的数字,所述数字的起讫位置为1和所述单行最大能拆分出行的数量。Furthermore, the row number dimension table is created in the form of a temporary table and a non-duplicate table, the maximum number of rows that can be split from a single row is not less than the maximum number of tables in a single row of the audit log summary table, and the creation process of the row number dimension table is: cyclically inserting numbers representing row numbers in the row number dimension table, the starting and ending positions of the numbers are 1 and the maximum number of rows that can be split from the single row.

进一步的,所述冷温热数据分析不关注的数据表包括系统库下的表、不存在于系统表中的表,Furthermore, the data tables that are not concerned with the cold-warm-hot data analysis include tables under the system library and tables that do not exist in the system tables.

从所述中间结果集表B中过滤掉所述不存在于系统表中的表的过程为:创建中间结果集表C,所述中间结果集表C用于存储系统表中的库名和表名,然后从所述中间结果集表B中过滤掉不存在于中间结果集表C的表;The process of filtering out the tables that do not exist in the system table from the intermediate result set table B is as follows: creating an intermediate result set table C, the intermediate result set table C being used to store the library name and the table name in the system table, and then filtering out the tables that do not exist in the intermediate result set table C from the intermediate result set table B;

从所述中间结果集表B中过滤掉系统库下的表的过程为:直接从所述中间结果集表B中过滤掉系统库的表。The process of filtering out the tables under the system library from the intermediate result set table B is as follows: directly filtering out the tables of the system library from the intermediate result set table B.

本发明还提供一种基于审计日志的数据生命周期分析装置,采用的技术方案如下:包括审计日志汇总表获取模块、行号维度表生成模块、过滤模块A、交叉关联模块、过滤模块B和冷温热数据分析模块,The present invention also provides a data lifecycle analysis device based on audit logs, which adopts the following technical solution: comprising an audit log summary table acquisition module, a row number dimension table generation module, a filtering module A, a cross-correlation module, a filtering module B and a cold, warm and hot data analysis module,

所述审计日志汇总表获取模块与所述过滤模块A连接,所述过滤模块A、行号维度表生成模块分别与所述交叉关联模块连接,所述交叉关联模块与所述过滤模块B连接,所述过滤模块B与所述冷温热数据分析模块连接,The audit log summary table acquisition module is connected to the filter module A, the filter module A and the row number dimension table generation module are respectively connected to the cross-correlation module, the cross-correlation module is connected to the filter module B, and the filter module B is connected to the cold, warm and hot data analysis module.

所述审计日志汇总表获取模块,用于获取审计日志汇总表,所述审计日志汇总表为采用表的形式输出的数据表的审计日志,The audit log summary table acquisition module is used to obtain an audit log summary table, where the audit log summary table is an audit log of a data table output in the form of a table.

所述行号维度表生成模块,用于接收单行最大能拆分出行的数量,并根据所述单行最大能拆分出行的数量创建行号维度表;The row number dimension table generating module is used to receive the maximum number of rows that can be split into a single row, and create a row number dimension table according to the maximum number of rows that can be split into a single row;

所述过滤模块A,用于接收起始时间,并根据所述起始时间过滤所述审计日志汇总表,然后将所述审计日志汇总表中的无用关键字滤除,过滤后生成中间结果集表A;The filtering module A is used to receive the start time, filter the audit log summary table according to the start time, and then filter out useless keywords in the audit log summary table, and generate an intermediate result set table A after filtering;

所述交叉关联模块,用于将所述行号维度表和中间结果集表A进行交叉关联,实现自建笛卡尔积,生成中间结果集表B,所述中间结果集表B统计出每张所述数据表最近的访问时间;The cross-correlation module is used to cross-correlate the row number dimension table and the intermediate result set table A to realize a self-built Cartesian product and generate an intermediate result set table B. The intermediate result set table B counts the most recent access time of each of the data tables;

所述过滤模块B,用于从所述中间结果集表B中过滤掉冷温热数据分析不关注的数据表,生成基础分析数据表;The filtering module B is used to filter out data tables that are not of interest to the cold, warm and hot data analysis from the intermediate result set table B, and generate a basic analysis data table;

所述冷温热数据分析模块,用于根据所述基础分析数据表进行冷温热数据分析,获得数据生命周期分析结果。The cold-warm-hot data analysis module is used to perform cold-warm-hot data analysis according to the basic analysis data table to obtain data life cycle analysis results.

进一步的,所述数据表的审计日志为数据库集群存储的数据表的审计日志。Furthermore, the audit log of the data table is an audit log of the data table stored in the database cluster.

进一步的,所述行号维度表生成模块用于采用临时表和无副本表的方式创建所述行号维度表,所述行号维度表的创建过程为:在所述行号维度表内循环插入代表行号的数字,所述数字的起讫位置为1和单行最大能拆分出行的数量,所述单行最大能拆分出行的数量不小于所述审计日志汇总表的单行中的表的数量的最大值。Furthermore, the row number dimension table generation module is used to create the row number dimension table by using a temporary table and a copyless table. The creation process of the row number dimension table is: cyclically inserting numbers representing row numbers in the row number dimension table, the starting and ending positions of the numbers are 1 and the maximum number of rows that can be split into a single row, and the maximum number of rows that can be split into a single row is not less than the maximum value of the number of tables in a single row of the audit log summary table.

进一步的,所述冷温热数据分析不关注的数据表包括系统库下的表、不存在于系统表中的表,Furthermore, the data tables that are not concerned with the cold-warm-hot data analysis include tables under the system library and tables that do not exist in the system tables.

所述过滤模块B从所述中间结果集表B中过滤掉所述不存在于系统表中的表的过程为:创建中间结果集表C,所述中间结果集表C用于存储系统表中的库名和表名,然后从所述中间结果集表B中过滤掉不存在于中间结果集表C的表;The process of filtering the table that does not exist in the system table from the intermediate result set table B by the filtering module B is as follows: creating an intermediate result set table C, the intermediate result set table C is used to store the library name and table name in the system table, and then filtering out the table that does not exist in the intermediate result set table C from the intermediate result set table B;

所述过滤模块B从所述中间结果集表B中过滤掉系统库下的表的过程为:直接从所述中间结果集表B中过滤掉系统库的表。The process in which the filtering module B filters out the tables under the system library from the intermediate result set table B is: directly filtering out the tables of the system library from the intermediate result set table B.

本发明还提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述的一种基于审计日志的数据生命周期分析方法。The present invention also provides a computer program product, including a computer program, which implements the above-mentioned data lifecycle analysis method based on audit log when executed by a processor.

本发明实施例中的上述一个或多个技术方案,至少具有如下技术效果之一:The above one or more technical solutions in the embodiments of the present invention have at least one of the following technical effects:

1.本发明一改传统的通过建表时间来对数据生命周期分类的方式,利用数据表的使用信息,依靠审计日志来对数据表进行分类,避免了因统计出建表的时间长、使用的时间短,而造成统计结果的逻辑错误。1. The present invention changes the traditional method of classifying data life cycle by table creation time, utilizes the usage information of data tables, and relies on audit logs to classify data tables, thus avoiding logical errors in statistical results caused by statistics showing that the table creation time is long and the usage time is short.

2.本发明避免访问元数据表,大幅提高分析数据汇总效率。其次,由于批量获取元数据接口information_schema中的表的创建时间的效率极低,一般在拥有50万张表的数据仓库中,遍历获取该类信息至少需要4个小时。而本发明获取的是审计日志信息,在50亿条审计日志中汇总,只需要50分钟左右,效率提升近4倍。2. The present invention avoids accessing metadata tables, greatly improving the efficiency of data analysis and aggregation. Secondly, since the efficiency of creating tables in the metadata interface information_schema in batches is extremely low, it generally takes at least 4 hours to traverse and obtain such information in a data warehouse with 500,000 tables. However, the present invention obtains audit log information, and it only takes about 50 minutes to aggregate 5 billion audit logs, which is nearly 4 times more efficient.

3.本发明通过交叉关联自建笛卡尔积,实现一行变多行的操作逻辑,将每行涉及多表字符串记录转换为多行数据,使得生成的中间结果集表B可以统计出每张所述数据表最近的访问时间。3. The present invention builds a Cartesian product through cross-correlation to implement the operation logic of converting one row into multiple rows, and converts each row of string records involving multiple tables into multiple rows of data, so that the generated intermediate result set table B can count the most recent access time of each of the data tables.

4.本发明主要的实现逻辑可以以存储过程的形式编译固化到数据库中,使不懂SQL(Structured Query Language,结构化查询语言)语言的人员也可以进行使用,大大降低了操作复杂度,为推广使用带来便利条件。4. The main implementation logic of the present invention can be compiled and solidified into the database in the form of stored procedures, so that people who do not understand SQL (Structured Query Language) can also use it, which greatly reduces the complexity of operation and brings convenient conditions for popularization and use.

5.本发明的主要逻辑变化利用参数实现,存储过程的调用,只需要根据实际需求,修改2个参数:单行最大能拆分出行的数量和起始时间,无需二次开发,大大降低开发和维护成本。5. The main logic changes of the present invention are realized by using parameters. The call of the stored procedure only needs to modify two parameters according to actual needs: the maximum number of rows that can be split into a single row and the start time. No secondary development is required, which greatly reduces the development and maintenance costs.

6.本发明能够满足不同版本数据库集群需求,也适用于虚拟集群,提高了本发明的兼容性。6. The present invention can meet the requirements of database clusters of different versions and is also applicable to virtual clusters, thus improving the compatibility of the present invention.

本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明提供的方法的流程图。FIG1 is a flow chart of the method provided by the present invention.

图2是本发明提供的审计日志的样例图。FIG. 2 is a sample diagram of the audit log provided by the present invention.

图3是本发明提供的GBASE数据库集群审计日志框架图。FIG3 is a diagram of a GBASE database cluster audit log framework provided by the present invention.

图4是本发明提供的装置的结构示意图。FIG. 4 is a schematic diagram of the structure of the device provided by the present invention.

附图标记:Reference numerals:

1、审计日志汇总表获取模块;2、行号维度表生成模块;3、过滤模块A;4、交叉关联模块;5、过滤模块B;6、冷温热数据分析模块。1. Audit log summary table acquisition module; 2. Row number dimension table generation module; 3. Filter module A; 4. Cross-correlation module; 5. Filter module B; 6. Cold, warm and hot data analysis module.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。以下实施例用于说明本发明,但不能用来限制本发明的范围。In order to make the purpose, technical scheme and advantages of the present invention clearer, the technical scheme of the present invention will be clearly and completely described below in conjunction with the drawings in the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments in the present invention, all other embodiments obtained by ordinary technicians in the field without creative work are within the scope of protection of the present invention. The following embodiments are used to illustrate the present invention, but cannot be used to limit the scope of the present invention.

在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明实施例的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the embodiments of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

下面结合图1-图4对本发明做进一步详细说明,描述本发明的一种基于审计日志的数据生命周期分析方法、装置及产品:The present invention is further described in detail below in conjunction with FIG. 1 to FIG. 4 to describe a data lifecycle analysis method, device and product based on an audit log of the present invention:

本实施例中,如图1所示,提供了一种基于审计日志的数据生命周期分析方法,包括以下步骤:In this embodiment, as shown in FIG1 , a data lifecycle analysis method based on an audit log is provided, including the following steps:

步骤1:获取审计日志汇总表,所述审计日志汇总表为采用表的形式输出的数据表的审计日志。所述数据表的审计日志为数据库集群存储的数据表的审计日志。Step 1: Obtain an audit log summary table, wherein the audit log summary table is an audit log of a data table output in the form of a table. The audit log of the data table is an audit log of a data table stored in a database cluster.

以GBASE数据库集群为例,其审计日志如图2所示,参考图2中方框中的内容,数据库中audit_log表的table_list字段表示每个SQL语句所使用到的表,所有表名存储在与该SQL对应的每一行上,且用逗号分隔,且内部分为3种类型:WRITE、READ和OTHER。如果要时间和表名的明细数据,则需要做单行转多行的操作,故需要首先将审计日志采用表的形式输出,得到审计日志汇总表。由于审计日志汇总表中会出现某一表最近访问时间对应多个表名的情况,因此不能直接用于冷、温、热数据的分析。Taking the GBASE database cluster as an example, its audit log is shown in Figure 2. Referring to the content in the box in Figure 2, the table_list field of the audit_log table in the database indicates the table used by each SQL statement. All table names are stored in each row corresponding to the SQL statement and separated by commas. They are divided into three types: WRITE, READ, and OTHER. If you want detailed data on time and table names, you need to convert a single row to multiple rows. Therefore, you need to output the audit log in the form of a table first to obtain an audit log summary table. Since the audit log summary table may have a situation where the most recent access time of a table corresponds to multiple table names, it cannot be directly used for the analysis of cold, warm, and hot data.

想要获得所述审计日志汇总表,首先需要开启数据库的审计日志功能。下面同样以GBASE数据库集群为例,说明审计日志功能开启及其相关配置初始化的过程。To obtain the audit log summary table, you first need to enable the audit log function of the database. The following also takes the GBASE database cluster as an example to illustrate the process of enabling the audit log function and initializing its related configuration.

步骤1.1:开启审计日志功能,初始化相关配置参数。将数据库系统参数“审计日志开关”设置为开启;将数据库系统参数“执行时间长度”设置为最短的“零”;将数据库系统参数“日志输出”设置为“表”,使审计结果存放表内,确保输出的是审计日志汇总表;新建一条审计策略。Step 1.1: Enable the audit log function and initialize the relevant configuration parameters. Set the database system parameter "Audit log switch" to on; set the database system parameter "Execution time length" to the shortest "zero"; set the database system parameter "Log output" to "table" so that the audit results are stored in the table, ensuring that the output is the audit log summary table; create a new audit policy.

步骤1.2:调整审计日志数据汇总频率。先使事件管理中的“导入审计日志”失效;再将该事件的执行计划调整为每1小时做一次汇总;再使事件管理中的“导入审计日志”生效。该步骤将审计日志数据汇总频率调整为每1小时做一次汇总。审计日志数据汇总频率可以根据使用需要作出改变,例如设定为半小时一次汇总或者2小时一次汇总。Step 1.2: Adjust the frequency of audit log data aggregation. First, disable "Import audit log" in event management; then adjust the execution plan of the event to aggregate once every hour; then enable "Import audit log" in event management. This step adjusts the frequency of audit log data aggregation to once every hour. The frequency of audit log data aggregation can be changed according to usage needs, for example, set to aggregate once every half hour or once every two hours.

步骤1.3:检查审计日志汇总数据的有效性。应用端向数据库随机下发增、删、改、查等语句,累计超过1年;查询审计日志汇总表中是否已经生成汇总数据。累计超过1年的时限是由使用者根据使用需要作出的设定,可根据实际需要进行更改。人为查看所述审计日志汇总表中的内容,表内有汇总数据,则说明已经成功生成所述审计日志汇总表。Step 1.3: Check the validity of the audit log summary data. The application randomly sends add, delete, modify, and query statements to the database for a cumulative period of more than 1 year; check whether the audit log summary table has generated summary data. The cumulative period of more than 1 year is set by the user according to usage needs and can be changed according to actual needs. Manually check the contents of the audit log summary table. If there is summary data in the table, it means that the audit log summary table has been successfully generated.

GBASE数据库集群的审计日志框架图如图3所示,通过步骤1.1-步骤1.3的操作,由数据库的事件管理器导入审计日志事件,将多个调度节点内存储的审计日志汇总,以表的形式输出,生成所述审计日志汇总表。The audit log framework diagram of the GBASE database cluster is shown in FIG3 . Through the operations of steps 1.1 to 1.3, the audit log events are imported by the event manager of the database, and the audit logs stored in multiple scheduling nodes are summarized and output in the form of a table to generate the audit log summary table.

步骤2:设定单行最大能拆分出行的数量,根据所述单行最大能拆分出行的数量创建行号维度表。Step 2: Set the maximum number of rows that can be split from a single row, and create a row number dimension table according to the maximum number of rows that can be split from a single row.

所述行号维度表采用临时表和无副本表的方式创建,所述行号维度表的创建过程为:在所述行号维度表内循环插入代表行号的数字,所述数字的起讫位置为1和所述单行最大能拆分出行的数量。The row number dimension table is created in a temporary table and a non-duplicate table manner. The creation process of the row number dimension table is: cyclically inserting numbers representing row numbers in the row number dimension table, where the starting and ending positions of the numbers are 1 and the maximum number of rows that can be split from the single row.

所述单行最大能拆分出行的数量不小于所述审计日志汇总表的单行中的表的数量的最大值。The maximum number of rows that can be split from the single row is not less than the maximum value of the number of tables in a single row of the audit log summary table.

单行最大能拆分出行的数量根据实际经验设定。每个SQL中涉及到的表一般不会超过100个,故单行最大能拆分出行的数量一般设定为100个,如有需要可以自行修改添加行数。The maximum number of rows that can be split from a single row is set based on actual experience. The number of tables involved in each SQL statement is generally no more than 100, so the maximum number of rows that can be split from a single row is generally set to 100. If necessary, you can modify the number of rows.

步骤3:设定起始时间,根据起始时间过滤所述审计日志汇总表,所述起始时间之前的内容将被过滤掉;然后将所述审计日志汇总表中的无用关键字滤除,过滤后生成中间结果集表A。无用关键字包括对表的操作指令,如:WRITE、READ和OTHER。Step 3: Set the start time, filter the audit log summary table according to the start time, and filter out the content before the start time; then filter out useless keywords in the audit log summary table, and generate the intermediate result set table A after filtering. Useless keywords include operation instructions for the table, such as: WRITE, READ and OTHER.

所述中间结果集表A样例表如表1,表1中的内容对应图2方框中的SQL语句。表1中,start_time是表最近访问时间,tbl是对应的表名(表名列表),表名形式为数据库名称.数据表名称。例如ssbm.customer,ssbm为数据库名称,customer为数据表名称。单个SQL语句涉及到多个数据表时,中间结果集表A的tbl的内容也会有多个表名。The sample table of the intermediate result set table A is shown in Table 1. The content in Table 1 corresponds to the SQL statement in the box of Figure 2. In Table 1, start_time is the most recent access time of the table, tbl is the corresponding table name (table name list), and the table name format is database name.data table name. For example, ssbm.customer, ssbm is the database name, and customer is the data table name. When a single SQL statement involves multiple data tables, the content of tbl of the intermediate result set table A will also have multiple table names.

表1 中间结果集表A样例表Table 1 Intermediate result set table A sample table

步骤4:将行号维度表和中间结果集表A进行交叉关联自建笛卡尔积,根据交叉关联自建笛卡尔积的结果集创建中间结果集表B,所述中间结果集表B统计出每张表最近的访问时间。Step 4: Cross-correlate the row number dimension table and the intermediate result set table A to build a Cartesian product, and create an intermediate result set table B according to the result set of the cross-correlation self-built Cartesian product. The intermediate result set table B counts the most recent access time of each table.

交叉关联自建笛卡尔积,实现单行转多行的效果,用来实现根据行号对不同行的数据实现裁剪功能。根据交叉关联自建笛卡尔积的结果集创建中间结果集表B实质是筛选的过程,筛选出每张表最近的访问时间。所述中间结果集表B中,每张表占用一行,无重复表名。中间结果集表B的样例表如表2。The cross-correlation self-built Cartesian product realizes the effect of converting a single row to multiple rows, and is used to realize the function of cutting data of different rows according to the row number. Creating the intermediate result set table B based on the result set of the cross-correlation self-built Cartesian product is essentially a screening process, which screens out the most recent access time of each table. In the intermediate result set table B, each table occupies one row, and there is no duplicate table name. The sample table of the intermediate result set table B is shown in Table 2.

表2 中间结果集表B样例表Table 2 Intermediate result set table B sample table

步骤5:中间结果集表B中存在着一些冷温热数据分析不关注的数据表,主要分两类,一类是系统库下的表,另一类是不存在于系统表中的表。因此,需要从所述中间结果集表B中过滤掉冷温热数据分析不关注的数据表,生成基础分析数据表。基础分析数据表就可以用于冷温热数据的分析了。Step 5: There are some data tables in the intermediate result set table B that are not concerned with cold, warm and hot data analysis. They are mainly divided into two categories: one is the table under the system library, and the other is the table that does not exist in the system table. Therefore, it is necessary to filter out the data tables that are not concerned with cold, warm and hot data analysis from the intermediate result set table B to generate a basic analysis data table. The basic analysis data table can be used for cold, warm and hot data analysis.

步骤5.1:在实际的应用系统中,有的临时表可能用完就会被清理掉,临时表不存在于系统表中,但是对该临时表的操作仍然保留在审计日志中,故需要将此类临时表从中间结果集表B中剔除。Step 5.1: In actual application systems, some temporary tables may be cleaned up after use. The temporary tables do not exist in the system tables, but the operations on the temporary tables are still retained in the audit logs. Therefore, such temporary tables need to be removed from the intermediate result set table B.

创建行号维度表、中间结果集表A和中间结果集表B等过程都是在数据库内完成的,而系统表不存储于数据库中,数据库不能直接读取系统表。因此,需要将数据库参数“gbase_query_path”设置为“1”,使2种不同引擎可以交换数据。创建中间结果集表C,利用所述中间结果集表C存储数据库外的系统表中的库名和表名。然后将所述中间结果集表C与所述中间结果集表B进行比对,从所述中间结果集表B中过滤掉不存在于所述中间结果集表C的表。The processes of creating the row number dimension table, intermediate result set table A, and intermediate result set table B are all completed in the database, but the system table is not stored in the database, and the database cannot directly read the system table. Therefore, it is necessary to set the database parameter "gbase_query_path" to "1" so that the two different engines can exchange data. Create an intermediate result set table C, and use the intermediate result set table C to store the library name and table name in the system table outside the database. Then compare the intermediate result set table C with the intermediate result set table B, and filter out the tables that do not exist in the intermediate result set table C from the intermediate result set table B.

步骤5.2:系统库下的表不属于数据库本身存储的数据,不属于冷温热数据分析关注的数据,需要从中间结果集表B中剔除。系统库下的表是存储于数据库中的,因此,可以直接从所述中间结果集表B中过滤掉系统库下的表。Step 5.2: The tables under the system library do not belong to the data stored in the database itself, nor to the data concerned by the cold-warm-hot data analysis, and need to be removed from the intermediate result set table B. The tables under the system library are stored in the database, so the tables under the system library can be directly filtered out from the intermediate result set table B.

步骤5.3:完成对于所述中间结果集表B的过滤后,生成基础分析数据表。Step 5.3: After completing the filtering of the intermediate result set table B, a basic analysis data table is generated.

步骤6:根据所述基础分析数据表进行冷温热数据分析,获得数据生命周期分析结果。Step 6: Perform cold, warm and hot data analysis based on the basic analysis data table to obtain data life cycle analysis results.

步骤6.1:根据业务需要指定冷温热数据的生命周期数据,冷温热数据的生命周期数据根据行业、客户需求的人为指定。以移动通讯行业分类场景为例,一般情况下数据周转60天内为热数据,60-90天为温数据,90天以上为冷数据。Step 6.1: Specify the life cycle data of cold, warm and hot data according to business needs. The life cycle data of cold, warm and hot data is specified according to industry and customer needs. Taking the classification scenario of the mobile communication industry as an example, in general, data turnover within 60 days is hot data, 60-90 days is warm data, and more than 90 days is cold data.

步骤6.2:对基础分析数据表按照指定的生命周期数据进行分类,得到基础分析数据表中每张表对应的类型,类型包括冷数据、温数据和热数据。后续可按照客户需求进行冷温热数据分析,获得数据生命周期分析结果。示意性的,可以简单的导出冷温热数据的全部明细数据。或者对冷温热数据分别求和,统计出冷温热数据各自的总数;还可以在此基础上,计算3种数据类型占总量的百分比。又或者按数据库名分组统计冷温热数据的数量,得出哪些数据库下的冷数据比较多,可以优先清理该库内的数据表。冷温热数据的分析结果,可以为客户处理其数据提供依据,方便客户合理判断,作出数据清理、数据迁移等决定。Step 6.2: Classify the basic analysis data table according to the specified life cycle data to obtain the type corresponding to each table in the basic analysis data table, including cold data, warm data and hot data. Subsequently, cold, warm and hot data analysis can be performed according to customer needs to obtain data life cycle analysis results. Indicatively, all detailed data of cold, warm and hot data can be simply exported. Or the cold, warm and hot data can be summed up separately to calculate the total number of cold, warm and hot data respectively; on this basis, the percentage of the three data types in the total can also be calculated. Or the number of cold, warm and hot data can be counted by grouping by database name to find out which databases have more cold data, and the data tables in the database can be cleaned up first. The analysis results of cold, warm and hot data can provide a basis for customers to process their data, facilitate customers to make reasonable judgments, and make decisions such as data cleaning and data migration.

本发明实现了审计日志汇总数据的重构,并以存储过程的形式,分步完成基础分析数据的汇总,在基础分析数据的基础上,对库内的数据表进行灵活的分类处理,最终生产出可用于交付的分析数据资产。The present invention realizes the reconstruction of audit log summary data, and completes the summary of basic analysis data in steps in the form of stored procedures. On the basis of the basic analysis data, the data tables in the library are flexibly classified and processed, and finally analytical data assets that can be used for delivery are produced.

本发明为存储过程内部实现逻辑,在数据库内部完成。之前曾经尝试过把table_list字段整列数据分成2类,1类单表的,1类多表的,对多表的数据导出库外,使用sed命令处理后,再加载入库。但该方案涉及库内库外反复操作,不如本发明仅在库内实现方便,而且在库外操作当数据量很多待处理文件很大时,sed命令的处理效率会大幅下降,处理效率无法保障。The present invention implements the logic inside the stored procedure and is completed inside the database. Previously, attempts were made to divide the entire column of data in the table_list field into two categories, one for a single table and one for multiple tables. The data of multiple tables was exported out of the library, processed using the sed command, and then loaded into the library. However, this solution involves repeated operations inside and outside the library, which is not as convenient as the present invention which is only implemented inside the library. Moreover, when operating outside the library, when the amount of data is large and the file to be processed is large, the processing efficiency of the sed command will be greatly reduced, and the processing efficiency cannot be guaranteed.

本实施例中,如图4所示,还提供一种基于审计日志的数据生命周期分析装置,采用的技术方案如下:包括审计日志汇总表获取模块1、行号维度表生成模块2、过滤模块A3、交叉关联模块4、过滤模块B5和冷温热数据分析模块6。所述审计日志汇总表获取模块1与所述过滤模块A3连接,所述过滤模块A3、行号维度表生成模块2分别与所述交叉关联模块4连接,所述交叉关联模块4与所述过滤模块B5连接,所述过滤模块B5与所述冷温热数据分析模块6连接。In this embodiment, as shown in FIG4 , a data lifecycle analysis device based on an audit log is also provided, and the technical solution adopted is as follows: comprising an audit log summary table acquisition module 1, a row number dimension table generation module 2, a filtering module A3, a cross-correlation module 4, a filtering module B5, and a cold, warm, and hot data analysis module 6. The audit log summary table acquisition module 1 is connected to the filtering module A3, the filtering module A3 and the row number dimension table generation module 2 are respectively connected to the cross-correlation module 4, the cross-correlation module 4 is connected to the filtering module B5, and the filtering module B5 is connected to the cold, warm, and hot data analysis module 6.

所述审计日志汇总表获取模块1,用于获取审计日志汇总表,所述审计日志汇总表为采用表的形式输出的数据表的审计日志。所述数据表的审计日志为数据库集群存储的数据表的审计日志。审计日志汇总表的字段包括start_time和tbl,start_time是表最近访问时间,tbl是对应的表名(表名列表),表名形式为数据库名称.数据表名称。所述审计日志汇总表获取模块1将所述审计日志汇总表发送给所述过滤模块A3。The audit log summary table acquisition module 1 is used to obtain the audit log summary table, which is an audit log of a data table output in the form of a table. The audit log of the data table is an audit log of a data table stored in a database cluster. The fields of the audit log summary table include start_time and tbl, where start_time is the most recent access time of the table, and tbl is the corresponding table name (table name list), and the table name format is database name.data table name. The audit log summary table acquisition module 1 sends the audit log summary table to the filtering module A3.

所述行号维度表生成模块2,用于接收单行最大能拆分出行的数量,并根据所述单行最大能拆分出行的数量创建行号维度表。所述行号维度表生成模块2将所述行号维度表发送给交叉关联模块4。The row number dimension table generation module 2 is used to receive the maximum number of rows that can be split from a single row, and create a row number dimension table according to the maximum number of rows that can be split from a single row. The row number dimension table generation module 2 sends the row number dimension table to the cross-correlation module 4.

所述行号维度表生成模块2采用临时表和无副本表的方式创建所述行号维度表。所述行号维度表的创建过程为:在所述行号维度表内循环插入代表行号的数字,所述数字的起讫位置为1和单行最大能拆分出行的数量,所述单行最大能拆分出行的数量不小于所述审计日志汇总表的单行中的表的数量的最大值。单行最大能拆分出行的数量是使用者根据实际情况人为设定的一个参数,示意性的,可以设定为100个。The row number dimension table generation module 2 creates the row number dimension table in a temporary table and a non-duplicate table manner. The creation process of the row number dimension table is: cyclically inserting numbers representing row numbers in the row number dimension table, the starting and ending positions of the numbers are 1 and the maximum number of rows that can be split into a single row, and the maximum number of rows that can be split into a single row is not less than the maximum value of the number of tables in a single row of the audit log summary table. The maximum number of rows that can be split into a single row is a parameter manually set by the user according to actual conditions, and can be set to 100 indicatively.

所述过滤模块A3,用于接收起始时间,并根据所述起始时间过滤所述审计日志汇总表,所述起始时间之前的内容将被过滤掉,然后将所述审计日志汇总表中的无用关键字滤除,过滤后生成中间结果集表A。无用关键字包括对表的操作指令。所述过滤模块A3将所述中间结果集表A发送给交叉关联模块4。起始时间是使用者根据实际情况人为设定的一个参数,示意性的,可以设定为180天。The filtering module A3 is used to receive the start time and filter the audit log summary table according to the start time. The content before the start time will be filtered out, and then the useless keywords in the audit log summary table are filtered out, and the intermediate result set table A is generated after filtering. The useless keywords include operation instructions for the table. The filtering module A3 sends the intermediate result set table A to the cross-correlation module 4. The start time is a parameter set by the user according to the actual situation. Indicatively, it can be set to 180 days.

所述交叉关联模块4,用于将所述行号维度表和中间结果集表A进行交叉关联,实现自建笛卡尔积,生成中间结果集表B,所述中间结果集表B统计出每张所述数据表最近的访问时间。所述中间结果集表B中,每张表占用一行,无重复表名。所述交叉关联模块4将中间结果集表B发送给过滤模块B5。The cross-correlation module 4 is used to cross-correlate the row number dimension table and the intermediate result set table A to realize self-built Cartesian product and generate the intermediate result set table B. The intermediate result set table B counts the most recent access time of each data table. In the intermediate result set table B, each table occupies one row, and there is no duplicate table name. The cross-correlation module 4 sends the intermediate result set table B to the filtering module B5.

所述过滤模块B5,用于从所述中间结果集表B中过滤掉冷温热数据分析不关注的数据表,生成基础分析数据表。所述过滤模块B5将基础分析数据表发送给冷温热数据分析模块6。The filtering module B5 is used to filter out the data tables that are not concerned with the cold, warm and hot data analysis from the intermediate result set table B to generate a basic analysis data table. The filtering module B5 sends the basic analysis data table to the cold, warm and hot data analysis module 6.

所述冷温热数据分析不关注的数据表一般包括系统库下的表、不存在于系统表中的表。The data tables that are not concerned with the cold-warm-hot data analysis generally include tables under the system library and tables that do not exist in the system tables.

所述过滤模块B5从所述中间结果集表B中过滤掉所述不存在于系统表中的表的过程为:创建中间结果集表C,所述中间结果集表C存储数据库外的系统表中的库名和表名,然后从所述中间结果集表B中过滤掉不存在于中间结果集表C的表。The process of the filtering module B5 filtering out the tables that do not exist in the system table from the intermediate result set table B is: creating an intermediate result set table C, the intermediate result set table C stores the library name and table name in the system table outside the database, and then filtering out the tables that do not exist in the intermediate result set table C from the intermediate result set table B.

所述过滤模块B5从所述中间结果集表B中过滤掉系统库下的表的过程为:直接从所述中间结果集表B中过滤掉系统库的表。The process of filtering module B5 filtering out the tables under the system library from the intermediate result set table B is as follows: directly filtering out the tables of the system library from the intermediate result set table B.

所述冷温热数据分析模块6,用于根据所述基础分析数据表进行冷温热数据分析,获得数据生命周期分析结果。The cold-warm-hot data analysis module 6 is used to perform cold-warm-hot data analysis according to the basic analysis data table to obtain data life cycle analysis results.

根据业务需要指定冷温热数据的生命周期数据,冷温热数据的生命周期数据根据行业、客户需求的人为指定。以移动通讯行业分类场景为例,一般情况下数据周转60天内为热数据,60-90天为温数据,90天以上为冷数据。The life cycle data of cold, warm and hot data is specified according to business needs. The life cycle data of cold, warm and hot data is specified according to industry and customer needs. Taking the classification scenario of the mobile communication industry as an example, under normal circumstances, data turnover within 60 days is hot data, 60-90 days is warm data, and more than 90 days is cold data.

对基础分析数据表按照指定的生命周期数据进行分类,得到基础分析数据表中每张表对应的类型,类型包括冷数据、温数据和热数据。后续可按照客户需求进行冷温热数据分析,获得数据生命周期分析结果。示意性的,可以简单的导出冷温热数据的全部明细数据。或者对冷温热数据分别求和,统计出冷温热数据各自的总数;还可以在此基础上,计算3种数据类型占总量的百分比。又或者按数据库名分组统计冷温热数据的数量,得出哪些数据库下的冷数据比较多,可以优先清理该库内的数据表。冷温热数据的分析结果,可以为客户处理其数据提供依据,方便客户合理判断,作出数据清理、数据迁移等决定。The basic analysis data table is classified according to the specified life cycle data to obtain the type corresponding to each table in the basic analysis data table, including cold data, warm data and hot data. Subsequently, cold, warm and hot data analysis can be performed according to customer needs to obtain data life cycle analysis results. Indicatively, all detailed data of cold, warm and hot data can be simply exported. Or the cold, warm and hot data can be summed separately to calculate the total number of cold, warm and hot data respectively; on this basis, the percentage of the three data types in the total can also be calculated. Or the number of cold, warm and hot data can be counted by grouping by database name to find out which databases have more cold data, and the data tables in the database can be cleaned up first. The analysis results of cold, warm and hot data can provide a basis for customers to process their data, facilitate customers to make reasonable judgments, and make decisions such as data cleaning and data migration.

本装置可以以存储过程的形式编译固化到数据库中,只需要根据实际需求,修改2个参数:单行最大能拆分出行的数量和起始时间。各模块按预设逻辑工作,自动汇总数据库集群存储的数据表的审计日志、保存数据库外的系统表中的库名和表名、创建各表格,最终输出冷温热数据的分析结果。This device can be compiled and solidified into the database in the form of a stored procedure. Only two parameters need to be modified according to actual needs: the maximum number of rows that can be split into a single row and the start time. Each module works according to the preset logic, automatically summarizes the audit logs of the data table stored in the database cluster, saves the library name and table name in the system table outside the database, creates each table, and finally outputs the analysis results of cold, warm and hot data.

本实施例中还提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述的一种基于审计日志的数据生命周期分析方法。This embodiment also provides a computer program product, including a computer program, which implements the above-mentioned data lifecycle analysis method based on audit logs when executed by a processor.

最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

Translated fromChinese
1.一种基于审计日志的数据生命周期分析方法,其特征在于,包括:1. A data lifecycle analysis method based on audit logs, characterized by comprising:获取审计日志汇总表,所述审计日志汇总表为采用表的形式输出的数据表的审计日志;Obtain an audit log summary table, where the audit log summary table is an audit log of a data table output in the form of a table;设定单行最大能拆分出行的数量,根据所述单行最大能拆分出行的数量创建行号维度表,所述行号维度表的创建过程为:在所述行号维度表内循环插入代表行号的数字,所述数字的起讫位置为1和所述单行最大能拆分出行的数量;The maximum number of rows that can be split from a single row is set, and a row number dimension table is created according to the maximum number of rows that can be split from a single row. The creation process of the row number dimension table is: cyclically inserting numbers representing row numbers into the row number dimension table, where the starting and ending positions of the numbers are 1 and the maximum number of rows that can be split from a single row;设定起始时间,根据所述起始时间过滤所述审计日志汇总表,然后将所述审计日志汇总表中的无用关键字滤除,过滤后生成中间结果集表A;Set a start time, filter the audit log summary table according to the start time, and then filter out useless keywords in the audit log summary table, and generate an intermediate result set table A after filtering;将所述行号维度表和中间结果集表A进行交叉关联,实现自建笛卡尔积,生成中间结果集表B,所述中间结果集表B统计出每张所述数据表最近的访问时间;The row number dimension table and the intermediate result set table A are cross-correlated to realize a self-built Cartesian product, and an intermediate result set table B is generated. The intermediate result set table B counts the most recent access time of each of the data tables;从所述中间结果集表B中过滤掉冷温热数据分析不关注的数据表,生成基础分析数据表;Filter out data tables that are not of interest to cold, warm, and hot data analysis from the intermediate result set table B to generate a basic analysis data table;根据所述基础分析数据表进行冷温热数据分析,获得数据生命周期分析结果。Cold, warm and hot data analysis is performed according to the basic analysis data table to obtain data life cycle analysis results.2.如权利要求1所述的一种基于审计日志的数据生命周期分析方法,其特征在于,所述数据表的审计日志为数据库集群存储的数据表的审计日志。2. A data lifecycle analysis method based on audit logs as described in claim 1, characterized in that the audit log of the data table is an audit log of the data table stored in the database cluster.3.如权利要求1或2所述的一种基于审计日志的数据生命周期分析方法,其特征在于,所述行号维度表采用临时表和无副本表的方式创建,所述单行最大能拆分出行的数量不小于所述审计日志汇总表的单行中的表的数量的最大值。3. A data lifecycle analysis method based on audit logs as described in claim 1 or 2, characterized in that the row number dimension table is created in the form of a temporary table and a copyless table, and the maximum number of rows that can be split from a single row is not less than the maximum number of tables in a single row of the audit log summary table.4.如权利要求1或2所述的一种基于审计日志的数据生命周期分析方法,其特征在于,4. A data lifecycle analysis method based on audit logs as described in claim 1 or 2, characterized in that:所述冷温热数据分析不关注的数据表包括系统库下的表、不存在于系统表中的表,The data tables that are not concerned with the cold, warm and hot data analysis include tables under the system library and tables that do not exist in the system table.从所述中间结果集表B中过滤掉所述不存在于系统表中的表的过程为:创建中间结果集表C,所述中间结果集表C用于存储系统表中的库名和表名,然后从所述中间结果集表B中过滤掉不存在于中间结果集表C的表。The process of filtering out the tables that do not exist in the system table from the intermediate result set table B is as follows: creating an intermediate result set table C, which is used to store the library name and table name in the system table, and then filtering out the tables that do not exist in the intermediate result set table C from the intermediate result set table B.5.一种基于审计日志的数据生命周期分析装置,其特征在于,用以执行如权利要求1至4任一项所述的一种基于审计日志的数据生命周期分析方法,包括审计日志汇总表获取模块、行号维度表生成模块、过滤模块A、交叉关联模块、过滤模块B和冷温热数据分析模块,5. A data lifecycle analysis device based on audit logs, characterized in that it is used to execute a data lifecycle analysis method based on audit logs as described in any one of claims 1 to 4, comprising an audit log summary table acquisition module, a row number dimension table generation module, a filtering module A, a cross-correlation module, a filtering module B and a cold, warm and hot data analysis module,所述审计日志汇总表获取模块与所述过滤模块A连接,所述过滤模块A、行号维度表生成模块分别与所述交叉关联模块连接,所述交叉关联模块与所述过滤模块B连接,所述过滤模块B与所述冷温热数据分析模块连接,The audit log summary table acquisition module is connected to the filter module A, the filter module A and the row number dimension table generation module are respectively connected to the cross-correlation module, the cross-correlation module is connected to the filter module B, and the filter module B is connected to the cold, warm and hot data analysis module.所述审计日志汇总表获取模块,用于获取审计日志汇总表,所述审计日志汇总表为采用表的形式输出的数据表的审计日志,The audit log summary table acquisition module is used to obtain an audit log summary table, where the audit log summary table is an audit log of a data table output in the form of a table.所述行号维度表生成模块,用于接收单行最大能拆分出行的数量,并根据所述单行最大能拆分出行的数量创建行号维度表,所述行号维度表的创建过程为:在所述行号维度表内循环插入代表行号的数字,所述数字的起讫位置为1和所述单行最大能拆分出行的数量;The row number dimension table generation module is used to receive the maximum number of rows that can be split from a single row, and create a row number dimension table according to the maximum number of rows that can be split from a single row. The row number dimension table creation process is: cyclically inserting numbers representing row numbers into the row number dimension table, the starting and ending positions of the numbers are 1 and the maximum number of rows that can be split from a single row;所述过滤模块A,用于接收起始时间,并根据所述起始时间过滤所述审计日志汇总表,然后将所述审计日志汇总表中的无用关键字滤除,过滤后生成中间结果集表A;The filtering module A is used to receive the start time, filter the audit log summary table according to the start time, and then filter out useless keywords in the audit log summary table, and generate an intermediate result set table A after filtering;所述交叉关联模块,用于将所述行号维度表和中间结果集表A进行交叉关联,实现自建笛卡尔积,生成中间结果集表B,所述中间结果集表B统计出每张所述数据表最近的访问时间;The cross-correlation module is used to cross-correlate the row number dimension table and the intermediate result set table A to realize a self-built Cartesian product and generate an intermediate result set table B. The intermediate result set table B counts the most recent access time of each of the data tables;所述过滤模块B,用于从所述中间结果集表B中过滤掉冷温热数据分析不关注的数据表,生成基础分析数据表;The filtering module B is used to filter out data tables that are not of interest to the cold, warm and hot data analysis from the intermediate result set table B, and generate a basic analysis data table;所述冷温热数据分析模块,用于根据所述基础分析数据表进行冷温热数据分析,获得数据生命周期分析结果。The cold-warm-hot data analysis module is used to perform cold-warm-hot data analysis according to the basic analysis data table to obtain data life cycle analysis results.6.如权利要求5所述的一种基于审计日志的数据生命周期分析装置,其特征在于,所述数据表的审计日志为数据库集群存储的数据表的审计日志。6. A data lifecycle analysis device based on audit logs as described in claim 5, characterized in that the audit log of the data table is an audit log of the data table stored in the database cluster.7.如权利要求5或6所述的一种基于审计日志的数据生命周期分析装置,其特征在于,所述行号维度表生成模块用于采用临时表和无副本表的方式创建所述行号维度表,所述单行最大能拆分出行的数量不小于所述审计日志汇总表的单行中的表的数量的最大值。7. A data lifecycle analysis device based on audit logs as described in claim 5 or 6, characterized in that the row number dimension table generation module is used to create the row number dimension table in the form of a temporary table and a copy-free table, and the maximum number of rows that can be split into a single row is not less than the maximum value of the number of tables in a single row of the audit log summary table.8.如权利要求5或6所述的一种基于审计日志的数据生命周期分析装置,其特征在于,所述冷温热数据分析不关注的数据表包括系统库下的表、不存在于系统表中的表,8. A data lifecycle analysis device based on audit logs as described in claim 5 or 6, characterized in that the data tables not concerned by the cold-warm-hot data analysis include tables under the system library and tables that do not exist in the system table.所述过滤模块B从所述中间结果集表B中过滤掉所述不存在于系统表中的表的过程为:创建中间结果集表C,所述中间结果集表C用于存储系统表中的库名和表名,然后从所述中间结果集表B中过滤掉不存在于中间结果集表C的表。The process of filtering module B filtering out the tables that do not exist in the system table from the intermediate result set table B is as follows: creating an intermediate result set table C, wherein the intermediate result set table C is used to store the library name and table name in the system table, and then filtering out the tables that do not exist in the intermediate result set table C from the intermediate result set table B.9.一种计算机程序产品,包括计算机程序,其特征在于,该计算机程序被处理器执行时实现权利要求1至4中任一项所述的一种基于审计日志的数据生命周期分析方法。9. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the method for analyzing data lifecycle based on an audit log according to any one of claims 1 to 4 is implemented.
CN202410677122.8A2024-05-292024-05-29 Data lifecycle analysis method, device and product based on audit logActiveCN118260366B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410677122.8ACN118260366B (en)2024-05-292024-05-29 Data lifecycle analysis method, device and product based on audit log

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410677122.8ACN118260366B (en)2024-05-292024-05-29 Data lifecycle analysis method, device and product based on audit log

Publications (2)

Publication NumberPublication Date
CN118260366A CN118260366A (en)2024-06-28
CN118260366Btrue CN118260366B (en)2024-10-08

Family

ID=91613465

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410677122.8AActiveCN118260366B (en)2024-05-292024-05-29 Data lifecycle analysis method, device and product based on audit log

Country Status (1)

CountryLink
CN (1)CN118260366B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114547104A (en)*2022-02-142022-05-27中国平安人寿保险股份有限公司Log data query method and device, computer equipment and storage medium
CN118069712A (en)*2024-02-292024-05-24中国农业银行股份有限公司Data life cycle management method and device, electronic equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110399351A (en)*2019-05-282019-11-01西安交大捷普网络科技有限公司The processing method and system of audit log data
CN114064812B (en)*2020-07-312025-09-05中移(苏州)软件技术有限公司 Data processing method, device, equipment and computer-readable storage medium
US11461347B1 (en)*2021-06-162022-10-04Amazon Technologies, Inc.Adaptive querying of time-series data over tiered storage
CN114490886A (en)*2021-12-292022-05-13北京航天智造科技发展有限公司Industrial operation system data lake construction method based on data warehouse
US12248456B2 (en)*2022-03-302025-03-11International Business Machines CorporationUtilizing a structured audit log for improving accuracy and efficiency of database auditing
CN115481124A (en)*2022-09-202022-12-16中银金融科技有限公司Data information filing processing method, system, device and storage medium
CN116975052A (en)*2023-05-152023-10-31腾讯科技(深圳)有限公司Data processing method and related equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114547104A (en)*2022-02-142022-05-27中国平安人寿保险股份有限公司Log data query method and device, computer equipment and storage medium
CN118069712A (en)*2024-02-292024-05-24中国农业银行股份有限公司Data life cycle management method and device, electronic equipment and storage medium

Also Published As

Publication numberPublication date
CN118260366A (en)2024-06-28

Similar Documents

PublicationPublication DateTitle
CN111552687B (en)Time sequence data storage method, query method, device, equipment and storage medium
EP3513314B1 (en)System for analysing data relationships to support query execution
CN110300963B (en)Data management system in a large-scale data repository
US11693912B2 (en)Adapting database queries for data virtualization over combined database stores
CN111768850B (en)Hospital data analysis method, hospital data analysis platform, device and medium
US20110119300A1 (en)Method Of Generating An Analytical Data Set For Input Into An Analytical Model
US20130031143A1 (en)Large scale real-time multistaged analytic system using data contracts
CN113934713B (en) Order data indexing method, system, computer device and storage medium
US20070250517A1 (en)Method and Apparatus for Autonomically Maintaining Latent Auxiliary Database Structures for Use in Executing Database Queries
US20230067182A1 (en)Data Processing Device and Method, and Computer Readable Storage Medium
CN104572856A (en)Converged storage method of service source data
CN115221143A (en)Cross-type migration operator-based multi-source big data processing method
CN115129686A (en) The method of regularly deleting the historical partition data of Oracle database partition table
CN111061758A (en)Data storage method, device and storage medium
CN116719799A (en) Environmentally friendly data management methods, devices, computer equipment and storage media
CN117668050A (en)Cross-data-source hybrid engine query method, system, equipment and medium
CN118260366B (en) Data lifecycle analysis method, device and product based on audit log
CN104820700A (en)Processing method of unstructured data of transformer substation
CN111143329B (en)Data processing method and device
CN106599113B (en)Database reading and writing method for mass performance data of network management system
CN110502543B (en) Device performance data storage method, device, device and storage medium
US10713226B1 (en)Managing data using archiving
CN118193622A (en) A method, system and storage medium for fusion of multi-source data
CN112307063B (en)Method and system for checking data quality of various platforms by metadata
US10747461B1 (en)Updating objects for archived objects

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
CP03Change of name, title or address
CP03Change of name, title or address

Address after:Room 201-33, Unit 2, Building 2, No. 39 Gaoxin 6th Road, Binhai Science and Technology Park, Binhai New Area, Tianjin, China 300452

Patentee after:TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co.,Ltd.

Country or region after:China

Address before:300384 building J, Haitai green industrial base, 6 Haitai development road, Huayuan Industrial Zone, Binhai New Area, Tianjin

Patentee before:TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co.,Ltd.

Country or region before:China


[8]ページ先頭

©2009-2025 Movatter.jp