CN107391306B

Movatterモバイル変換

Info

Publication number: CN107391306B
Application number: CN201710622124.7A
Authority: CN
Inventors: 刘赛; 杨华飞; 聂庆节; 刘嘉华; 刘军; 张磊; 马悦皎; 缪骞云; 张翼; 张迎星
Original assignee: NARI Group Corp; NARI Information and Communication Technology Co; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd; State Grid Corp of China SGCC
Current assignee: NARI Group Corp; NARI Information and Communication Technology Co; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd; State Grid Corp of China SGCC
Priority date: 2017-07-27
Filing date: 2017-07-27
Publication date: 2019-12-10
Anticipated expiration: 2037-07-27
Also published as: CN107391306A

Abstract

the invention discloses a heterogeneous database backup file recovery method, which comprises the following steps: normalizing and converting data in a heterogeneous source database; adopting a DELTA compression algorithm of K-medoids clustering to perform clustering pretreatment on the data blocks, and classifying the data blocks with higher similarity into one class; compressing the same type of data blocks by utilizing a Delta compression algorithm; the data is restored based on an SQL reproduction method, a database version at a restoring end is read according to a configuration file, a metadata model is converted into SQL statements supported by a database of a corresponding version according to conversion rules, and the SQL statements are imported into the database after data consistency detection, so that the functions of stable backup and recovery of a heterogeneous database are realized. The invention can support various source databases through the expansion of the mapping rule, realizes the backup of the heterogeneous database, supports high-efficiency file compression and reduces the backup cost.

Description

Translated fromChinese

一种异构数据库备份文件恢复方法A Method for Restoring Heterogeneous Database Backup Files

技术领域technical field

本发明涉及一种异构数据库备份文件恢复方法，属于技术领域。The invention relates to a method for restoring backup files of heterogeneous databases, and belongs to the technical field.

背景技术Background technique

近年来，随着信息技术的发展，信息管理系统得到大大普及。它们以快速、高效、便捷的特点变成了信息发布、信息交易的平台，并进一步推动整个社会的数字化和信息化进程，各式各样的信息化系统构建了当今的“信息化世界”。In recent years, with the development of information technology, information management system has been greatly popularized. With their fast, efficient, and convenient features, they have become a platform for information release and information trading, and further promote the digitization and informatization process of the entire society. Various information systems have built today's "information world".

各行各业的发展离不开“数据”：产品数据、客户数据、财务数据等等，企业的生存发展越来越依赖IT系统。由于电脑病毒、网络入侵、物理损伤、人工操作失误等原因对信息数据造成大规模破坏，不但使信息系统无法提供正常服务，对于一些关系经济利益的行业如银行、电力以及通信等领域还会造成巨大的经济损失。通过数据备份手段对数据进行保护，确保在故障发生后能够迅速的恢复本地数据。The development of all walks of life is inseparable from "data": product data, customer data, financial data, etc. The survival and development of enterprises are increasingly dependent on IT systems. Due to computer viruses, network intrusion, physical damage, manual operation errors and other reasons, large-scale destruction of information data will not only make the information system unable to provide normal services, but also cause serious damage to some industries related to economic interests, such as banking, electric power and communications. Huge economic loss. Data is protected through data backup to ensure that local data can be quickly restored after a failure occurs.

数据库备份研究属于需求驱动型领域，在这方面各大公司较早开始相关研究，有些备份技术已经在各类应用环境下使用相当长时间。国外对备份软件的研究开始于20世纪80年代中期，到目前为止成熟的商用备份产品包括：EMC公司的Tivoli、BakBone公司的NetVault、CA公司的BrightStor等。Database backup research is a demand-driven field. In this regard, major companies have started related research earlier, and some backup technologies have been used in various application environments for quite a long time. Foreign research on backup software began in the mid-1980s, and so far mature commercial backup products include: Tivoli from EMC, NetVault from BakBone, BrightStor from CA, etc.

中山大学软件研究所与广州威腾网络科技有限公司联合开发NetBunker2用于Linux备份服务器的网络备份恢复。中山同向公司的HeartOne Backup Enterprise提供分布式备份，实现智能备份恢复，简化服务器和网络存储环境。Sun Yat-sen University Software Research Institute and Guangzhou Weiteng Network Technology Co., Ltd. jointly developed NetBunker2 for network backup and recovery of Linux backup servers. Zhongshan Tongxiang provides distributed backup to the company's HeartOne Backup Enterprise, realizes intelligent backup and recovery, and simplifies server and network storage environments.

在开源领域，备份软件蓬勃发展，出现一大批优秀的开源备份软件，其中比较有名的包括Amanda、Bacula、BackupPC、Restore、Burt等。开源软件虽然技术公开但功能仅能支持一些备份中最基本工作，不适用商业场景。因此还有必要对一些商用功能进行理论研究。In the field of open source, backup software is developing vigorously, and a large number of excellent open source backup software have emerged, among which the more famous ones include Amanda, Bacula, BackupPC, Restore, Burt and so on. Although the technology of open source software is public, its functions can only support some of the most basic tasks in backup, and are not suitable for commercial scenarios. Therefore, it is also necessary to conduct theoretical research on some commercial functions.

随着企业的逐步发展，企业数据具有数量大、来源广、种类多、结构复杂等特点。企业积累了大量的业务数据，这些数据对企业的正常运营具有十分重要的意义，由于各个阶段使用的数据库系统不同，如何对异构数据进行备份成为数据备份领域一个关键问题。尽管一些大型数据库如Oracle以及SQL Server本身已经提供了数据库备份还原工具，但这些工具仅支持单一数据库备份，无法解决数据库备份过程的异构问题。With the gradual development of enterprises, enterprise data has the characteristics of large quantity, wide source, variety and complex structure. Enterprises have accumulated a large amount of business data, which is of great significance to the normal operation of the enterprise. Due to the different database systems used in each stage, how to back up heterogeneous data has become a key issue in the field of data backup. Although some large databases such as Oracle and SQL Server have provided database backup and restoration tools, these tools only support single database backup and cannot solve the heterogeneous problem of the database backup process.

发明内容Contents of the invention

本发明的目的在于克服现有技术中的不足，提供一种异构数据库备份文件恢复方法，解决现有技术中异构数据无法有效备份恢复的技术问题。The purpose of the present invention is to overcome the deficiencies in the prior art, provide a method for restoring backup files of heterogeneous databases, and solve the technical problem in the prior art that heterogeneous data cannot be effectively backed up and restored.

为解决上述技术问题，本发明所采用的技术方案是：一种异构数据库备份文件恢复方法，包括以下步骤：In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is: a method for restoring a heterogeneous database backup file, comprising the following steps:

(1)对异构源数据库中数据归一转化；(1) Normalize and transform the data in the heterogeneous source database;

(2)对数据块进行聚类预处理，利用DELTA压缩算法对同一类数据块进行压缩，生成对应的二进制存储文件，并将压缩后的备份文件备份到备份介质中；(2) carry out clustering preprocessing to data block, utilize DELTA compression algorithm to compress same type of data block, generate corresponding binary storage file, and back up the backup file after compression in the backup medium;

(3)采用“SQL重现法”对备份文件中元数据进行还原，根据配置文件读取还原端数据库版本；(3) Use the "SQL reproduction method" to restore the metadata in the backup file, and read the database version at the restore end according to the configuration file;

(4)根据转化规则将元数据模型转化为对应版本数据库支持的SQL语句，进行数据一致性检测后导入数据库，实现异构数据库备份文件恢复。(4) According to the conversion rules, the metadata model is converted into SQL statements supported by the corresponding version of the database, and the data consistency is checked and then imported into the database to realize the recovery of heterogeneous database backup files.

步骤(1)的具体方法如下：The concrete method of step (1) is as follows:

101、加载驱动程序：将驱动程序导入开发环境，通过Class.forName()函数对驱动程序进行加载；101. Load the driver: import the driver into the development environment, and load the driver through the Class.forName() function;

102、创建连接：加载完驱动程序后，通过DriverManage的getConnect()函数创建数据库连接对象，连接对象包含：协议名、IP地址、端口号、数据库名称；102. Create a connection: After loading the driver, create a database connection object through the getConnect() function of DriverManage. The connection object includes: protocol name, IP address, port number, and database name;

103、创建Statement对象：通过Connection的create Statement()函数创建Statement对象；103. Create a Statement object: Create a Statement object through the create Statement() function of Connection;

104、执行SQL语句：当SQL语句产生单个结果集时，使用executeQuery()；当无返回结果时，使用executeUpdate()；当返回多个结果集时，使用execute()；104. Execute SQL statement: when the SQL statement generates a single result set, use executeQuery(); when no result is returned, use executeUpdate(); when multiple result sets are returned, use execute();

105、获取结果：当执行statement的execute()和executeQuery()时返回的结果是ResultSet对象，并通过指向该对象的指针使用next()函数获取返回结果中的数据；105. Get the result: when the execute() and executeQuery() of the statement are executed, the result returned is the ResultSet object, and the data in the returned result is obtained by using the next() function through the pointer to the object;

106、根据数据库类型加载转化规则，通过excData()函数将异构数据转化为统一标准的元数据，元数据中各元素包含关键的字段标识符，用于数据恢复时检查数据的一致性，备份过程中如果对元数据进行更改则设置标识符为1；106. According to the loading and conversion rules of the database type, the heterogeneous data is converted into unified standard metadata through the excData() function. Each element in the metadata contains key field identifiers, which are used to check the consistency of the data during data recovery and backup If the metadata is changed during the process, set the identifier to 1;

107、通过wrtData()函数将所得数据按照XML格式写入文件，生成对应的备份文件；107. Write the obtained data into the file according to the XML format through the wrtData() function, and generate a corresponding backup file;

108、关闭连接：若不再使用数据库时，用close()方法关闭数据库连接。108. Close the connection: If the database is no longer used, use the close() method to close the database connection.

所述元数据为数据模型的最小单元，元数据结构表达式如式(1)所示：The metadata is the smallest unit of the data model, and the metadata structure expression is shown in formula (1):

M＝CS+SS (1)M=CS+SS (1)

其中：CS为内容结构，定义元数据的构成元素以及元素内容进行，SS为句法结构，定义元数据格式结构以及具体描述方法；Among them: CS is the content structure, which defines the constituent elements of metadata and element content, and SS is the syntax structure, which defines the metadata format structure and specific description methods;

内容结构表达式如式(2)所示：The content structure expression is shown in formula (2):

CS＝(T,Z,S,F) (2)CS=(T,Z,S,F) (2)

T表示源表，为多源数据库的表结构，存储待备份数据的表结构信息，包含：源表序号、源表名、标识符、字段数目、字段名以及字段类型信息；T represents the source table, which is the table structure of the multi-source database, and stores the table structure information of the data to be backed up, including: source table serial number, source table name, identifier, field number, field name, and field type information;

Z表示字段，为多源数据库的数据值，存储表中字段的具体数值，包括：字段序号、字段名、字段类型、字段值、表名以及标识符；Z represents the field, which is the data value of the multi-source database, and stores the specific value of the field in the table, including: field serial number, field name, field type, field value, table name and identifier;

S表示预定集，为备份的基本单位，包括预定集编号、源服务器、目标服务器、开始时间、结束时间、备份序号、源表序号以及字段序号；用于定义备份对象，将备份过程细分为各个单元，当一个备份任务发生中断时，从中断位置继续备份任务；S represents the scheduled set, which is the basic unit of backup, including the scheduled set number, source server, target server, start time, end time, backup sequence number, source table sequence number, and field sequence number; it is used to define backup objects and subdivide the backup process into Each unit, when a backup task is interrupted, continues the backup task from the interrupted position;

F表示约束，约束元素描述表中字段约束信息，用于记录表中特殊列信息，包括表名、约束序号、主键列名、外键列名、索引列名以及标识符。F represents a constraint, and the constraint element describes the field constraint information in the table, and is used to record the special column information in the table, including the table name, constraint serial number, primary key column name, foreign key column name, index column name, and identifier.

所述特殊列信息单独记录从而对表结构进行完整性描述。The special column information is recorded separately so as to fully describe the table structure.

步骤(2)的具体方法如下：The concrete method of step (2) is as follows:

201、对待压缩文件进行切分，采用1M文件大小作为划分单位，对划分的文件块两两之间进行DELTA压缩，DELTA压缩后的文件大小存储在临时矩阵arr_delta[N][N]，将其作为数据块之间的相似度；201. Segment the file to be compressed, use 1M file size as the division unit, and perform DELTA compression between the divided file blocks. The file size after DELTA compression is stored in the temporary matrix arr_delta[N][N], and its as the similarity between data blocks;

202、以相似度矩阵中所保存的相似度信息作为聚类依据，通过K-medoids聚类算法对数据块进行聚类；202. Using the similarity information stored in the similarity matrix as the basis for clustering, cluster the data blocks through the K-medoids clustering algorithm;

203、采用内容无关方法从文件中选取特征集，根据可分配内存的大小，确定产生中间指纹数量与文件大小；203. Using a content-independent method to select a feature set from the file, and determine the number of intermediate fingerprints and the file size according to the size of the allocatable memory;

204、设定一个滑动窗口的大小，不断向前移动滑动窗口，计算移动窗口下的数据指纹，采用Hash函数映射成超级特征或超级指纹集；204. Set the size of a sliding window, continuously move the sliding window forward, calculate the data fingerprints under the moving window, and use the Hash function to map into super features or super fingerprint sets;

205、若超级指纹相匹配，则在特征数据库中搜索一个与它相似度最高的参考文件，找到该参考文件后，根据压缩函数D进行压缩；205. If the super fingerprints match, search for a reference file with the highest similarity with it in the feature database, and compress according to the compression function D after finding the reference file;

206、通过压缩函数D对有序符号串进行编码，利用ADD编码命令，其命令格式为(ADD,L,S)，表示在V中的指定位置添加长度为L的字符串S；COPY编码命令，其命令格式为(COPY,L,O)，表示从R中复制长度为L，偏移量为O的字符串到V中的指定位置；206. Encode the ordered symbol string through the compression function D, use the ADD encoding command, and its command format is (ADD, L, S), which means adding a string S of length L at a specified position in V; COPY encoding command , the command format is (COPY,L,O), which means copying a character string with length L and offset O from R to the specified position in V;

207、将压缩后的数据块重新合并为备份文件。207. Remerge the compressed data blocks into a backup file.

利用DELTA压缩算法对同一类数据块进行压缩的具体方法为：The specific method of using the DELTA compression algorithm to compress the same type of data block is as follows:

对备份文件进行分块，数据块集合记为S＝{S1,S2,S3…Sn}，对集合S中的数据对象进行聚类，将数据块分为K类C'＝{C1',C2',C3'…Ck'}，两个相似数据块之间的相似度表示为二者的DELTA距离，即：Divide the backup file into blocks, and record the data block set as S={S1,S2,S3...Sn}, cluster the data objects in the set S, and divide the data blocks into K categories C'={C1',C2 ',C3'...Ck'}, the similarity between two similar data blocks is expressed as the DELTA distance between the two, namely:

dist(Si,Sj)＝delta(Si,Sj) (3)dist(Si,Sj)=delta(Si,Sj) (3)

在S中任意选择K个数据块作为簇的中心点，分别用{m1,m2,m3…mk}表示，代表剩余数据块的点分配给距它最近的簇中，获得聚类簇C＝{C1,C2,C3…Ck}；In S, K data blocks are arbitrarily selected as the center points of the clusters, which are represented by {m1, m2, m3...mk} respectively, and the points representing the remaining data blocks are assigned to the nearest cluster, and the clustering cluster C={ C1,C2,C3...Ck};

对每一个簇Ci,i∈{1,2,3…k}，遍历簇中的第j个非中心点对象Sj，用公式(4)计算簇中每个数据块S_j与其余数据块S_k的总代价，For each cluster Ci,i∈{1,2,3...k}, traverse the jth non-central point object Sj in the cluster, and use the formula (4) to calculate the relationship between each data block S_j and other data blocks S in the cluster the total cost of_k ,

选择簇内最小的总代价点作为新簇的中心点，迭代以上步骤，直到各个簇的中心点不再变化，最终获得的K个簇C'＝{C1',C2',C3'…Ck'}。Select the smallest total cost point in the cluster as the center point of the new cluster, iterate the above steps until the center point of each cluster does not change, and finally obtain K clusters C'={C1',C2',C3'...Ck' }.

步骤(3)的具体方法如下：The concrete method of step (3) is as follows:

301、读取恢复端数据库类型及版本号，根据数据库版本加载相应映射规则；301. Read the recovery terminal database type and version number, and load the corresponding mapping rules according to the database version;

302、根据恢复任务信息读取相应任务的预定集序号，根据预定集序号查找待恢复源表序号、约束序号以及字段序号；302. Read the predetermined set number of the corresponding task according to the recovery task information, and search for the source table number, constraint number, and field number to be restored according to the predetermined set number;

303、根据源表序号以及约束序号查找元数据中对应源表元素与约束元素，检查对应标识符内容：如果标识符为1，则执行步骤304，否则执行步骤305；303. Find the corresponding source table element and constraint element in the metadata according to the source table serial number and the constraint serial number, and check the content of the corresponding identifier: if the identifier is 1, execute step 304; otherwise, execute step 305;

304、获取源表与依赖具体信息，包括：表名、表中字段名、字段类型、主键、外键以及索引，获取完成后生成对应SQL语句并存储到.sql文件中，文件生成结束后将标识符设置为0；304. Obtain source table and dependent specific information, including: table name, field name in the table, field type, primary key, foreign key, and index. After the acquisition is completed, the corresponding SQL statement is generated and stored in the .sql file. After the file is generated, the identifier is set to 0;

305、据字段序号获取对应字段元素，并检查对应标识符内容，如果标识符为1则执行步骤306，否则执行步骤307；305. Obtain the corresponding field element according to the field serial number, and check the content of the corresponding identifier. If the identifier is 1, execute step 306; otherwise, execute step 307;

306、获取字段具体信息，包括:字段名、字段类型、字段值、字段对应源表名，根据获取信息生成对应INSERT语句实现数据添加，并将这些内容存储到.sql文件中，文件生成结束后将标识符设置为0；306. Obtain field specific information, including: field name, field type, field value, field corresponding source table name, generate corresponding INSERT statement according to the obtained information to realize data addition, and store these contents in the .sql file. After the file is generated set identifier to 0;

307、调用控制命令，通过执行.sql文件将数据恢复到数据库。307. Call the control command, and restore the data to the database by executing the .sql file.

采用“SQL重现法”对备份文件中元数据进行还原时首先检查元数据文件中标识符的值：When using the "SQL reproduction method" to restore the metadata in the backup file, first check the value of the identifier in the metadata file:

若标识符为1，则表示该数据未被恢复过，再逆向使用语法映射规则将备份文件中内容转化为SQL语句；If the identifier is 1, it means that the data has not been restored, and then reversely use the syntax mapping rules to convert the contents of the backup file into SQL statements;

若标识符为0，则表示该内容在之前的恢复任务中已经被恢复到数据库中，无需再进行转化恢复。If the identifier is 0, it means that the content has been restored to the database in the previous restoration task, and there is no need to perform conversion and restoration.

与现有技术相比，本发明所达到的有益效果是：Compared with the prior art, the beneficial effects achieved by the present invention are:

本发明设计一种通用的元数据模型，并定义当前主流数据库Oracle、Mysql以及PostgreSql中数据与此模型之间的映射规则，将数据归一化为元数据后存储到XML文件中；The present invention designs a general metadata model, and defines the mapping rules between the data in the current mainstream database Oracle, Mysql and PostgreSql and this model, stores the data in the XML file after being normalized into metadata;

提出一种改进的DELTA压缩算法，对备份文件进行重复数据删除，降低备份成本；An improved DELTA compression algorithm is proposed to deduplicate the backup files to reduce the backup cost;

能够克服企业内部异构数据库带来的“信息孤岛”问题，提供面向企业需求的一致性备份框架，同时还能够提升备份介质利用率，降低备份成本；It can overcome the "information island" problem caused by heterogeneous databases in the enterprise, provide a consistent backup framework oriented to the needs of enterprises, and at the same time improve the utilization of backup media and reduce backup costs;

对于恢复任务根据数据库配置将元数据恢复成指定版本数据库支持的SQL语句，通过执行SQL语句方式将数据导入数据库实现恢复，在恢复时根据源数据模型中修改标记对数据进行选择性恢复，保证数据一致性。For the recovery task, the metadata is restored to the SQL statement supported by the database of the specified version according to the database configuration, and the data is imported into the database by executing the SQL statement to realize recovery. During the recovery, the data is selectively recovered according to the modification mark in the source data model, ensuring the data consistency.

附图说明Description of drawings

图1是备份系统层次结构示意图；Fig. 1 is a schematic diagram of the hierarchical structure of the backup system;

图2是本发明的流程图；Fig. 2 is a flow chart of the present invention;

图3是异构数据抽取流程图；Figure 3 is a flowchart of heterogeneous data extraction;

图4是基于K-medoids聚类数据压缩流程图；Fig. 4 is a flow chart of data compression based on K-medoids clustering;

图5是数据恢复流程图。Figure 5 is a flow chart of data recovery.

具体实施方式Detailed ways

本发明提供一种异构数据库备份文件恢复方法，包括：设计一种元数据模型，对异构源数据库中数据归一转化，通过XML文件存储元数据模型；提出基于K-medoids聚类的DELTA压缩算法，先对数据块进行聚类预处理，将相似度较高的数据块分为一类。利用DELTA压缩算法对同一类数据块进行压缩；基于“SQL重现法”对数据还原，根据配置文件读取还原端数据库版本，根据转化规则将元数据模型转化为对应版本数据库支持的SQL语句，进行数据一致性检测后导入数据库，实现异构数据库的稳定备份与恢复功能。The invention provides a backup file recovery method of a heterogeneous database, including: designing a metadata model, normalizing and transforming data in a heterogeneous source database, and storing the metadata model through an XML file; proposing a DELTA based on K-medoids clustering The compression algorithm first performs clustering preprocessing on the data blocks, and divides the data blocks with high similarity into one category. Use the DELTA compression algorithm to compress the same type of data blocks; restore the data based on the "SQL reproduction method", read the database version of the restoration end according to the configuration file, and convert the metadata model into the SQL statement supported by the corresponding version of the database according to the conversion rules. Import the database after data consistency detection to realize the stable backup and recovery functions of heterogeneous databases.

下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案，而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.

备份系统包括三个功能：数据抽取、数据处理以及数据恢复。数据抽取通过元数据模型实现对不同数据库数据类型进行统一描述，根据备份任务对源数据进行抽取并存储到备份文件中。数据处理利用压缩算法对备份文件中重复内容进行压缩，生成对应的二进制存储文件，并将压缩后备份文件备份到备份介质中。数据恢复基于SQL重现法将备份文件中元数据进行转换，生成能够被各版本数据库所执行的.sql文件，最终将数据导入数据库系统实现数据恢复。The backup system includes three functions: data extraction, data processing, and data recovery. Data extraction achieves a unified description of different database data types through the metadata model, and extracts the source data according to the backup task and stores it in the backup file. The data processing uses the compression algorithm to compress the repeated content in the backup file, generates the corresponding binary storage file, and backs up the compressed backup file to the backup medium. Data recovery converts the metadata in the backup file based on the SQL reproduction method, generates .sql files that can be executed by databases of various versions, and finally imports the data into the database system to realize data recovery.

如图1所示，是备份系统层次结构示意图，分为三层结构，分别为公共连接层、业务层以及应用层。As shown in Figure 1, it is a schematic diagram of the hierarchical structure of the backup system, which is divided into three layers, namely, the public connection layer, the business layer and the application layer.

(1)公共连接层(1) Public connection layer

公共连接层位于系统最底层，负责数据库连接功能的实现，向业务层提供数据库连接和查询服务，对于安全级别较高数据库进行备份时还提供加密解密，保证与异构数据源之间建立的可靠连接。主要通过JDBC技术实现与不同数据库建立连接。The public connection layer is located at the bottom of the system, responsible for the realization of the database connection function, providing database connection and query services to the business layer, and providing encryption and decryption when backing up databases with high security levels to ensure the reliability established with heterogeneous data sources connect. Mainly through JDBC technology to establish connections with different databases.

(2)业务层(2) Business layer

业务层实现系统核心功能，数据库备份恢复的各个组成环节的实现都在这一层。数据转换实现元数据与数据库数据相互映射，通过映射规则屏蔽异构数据库数据格式、约束规则以及SQL语法的差异，这是异构数据库备份一个难点。The business layer implements the core functions of the system, and all the components of database backup and recovery are realized at this layer. Data conversion realizes the mutual mapping between metadata and database data, and shields the differences in data format, constraint rules, and SQL syntax of heterogeneous databases through mapping rules. This is a difficult point in heterogeneous database backup.

数据压缩功能使用基于K-medoids聚类的DELTA压缩算法，在最基本的DELTA压缩算法基础上效率提升一倍，可将备份文件压缩为原来的四分之一左右，可在实现增加备份速度的同时减少备份成本。The data compression function uses the DELTA compression algorithm based on K-medoids clustering, which doubles the efficiency on the basis of the most basic DELTA compression algorithm, and can compress the backup file to about a quarter of the original, which can increase the backup speed. Also reduce backup costs.

一致性检测功能是为数据可靠性进行保护，保证恢复任务执行后数据库中内容与备份时内容相同。The consistency detection function is to protect data reliability and ensure that the content in the database after the recovery task is executed is the same as that at the time of backup.

它们之间在功能流程上存在相互依赖关系。备份任务阶段首先进行数据转化，再将转化后文件进行数据压缩存储到备份介质中。恢复阶段通过数据压缩的恢复技术将压缩文件还原为数据文件，通过检查数据中标识符确定恢复数据内容，再通过转化规则转化为SQL语句导入数据库。There is interdependence between them in functional flow. In the backup task stage, data conversion is performed first, and then the converted files are compressed and stored in the backup medium. In the recovery stage, the compressed file is restored to a data file through the recovery technology of data compression, the content of the recovered data is determined by checking the identifier in the data, and then converted into SQL statements and imported into the database through the conversion rules.

(3)应用层(3) Application layer

应用层利用业务层和公共连接层提供的服务解决实际问题，主要包括用户定制的备份恢复任务或者备份恢复计划。该层基于QT进行界面设计，保证整体系统的移植性以及系统的扩展性。The application layer uses the services provided by the business layer and the public connection layer to solve practical problems, mainly including user-defined backup and recovery tasks or backup and recovery plans. This layer is based on QT for interface design to ensure the portability and scalability of the overall system.

如图2～5所示，是本发明提供的异构数据库备份文件恢复方法，具体包括以下步骤：As shown in Figures 2 to 5, it is a heterogeneous database backup file recovery method provided by the present invention, which specifically includes the following steps:

(1)对异构源数据库中数据归一转化，具体方法如下：(1) To normalize and transform the data in the heterogeneous source database, the specific method is as follows:

102、创建连接：加载完驱动程序后，通过DriverManage的getConnect()函数创建数据库连接对象，例如Connect connect＝DriverManager.getConnection(“url”，“UserName”，“PassWord”)。虽然不同数据库的url具有不同格式，但其中应包含协议名、IP地址、端口号、数据库名称等信息。UserName和PassWord为连接到数据库管理系统的用户名和密码；102. Create a connection: after loading the driver, create a database connection object through the getConnect() function of DriverManage, for example, Connect connect=DriverManager.getConnection("url", "UserName", "PassWord"). Although URLs of different databases have different formats, they should contain information such as protocol name, IP address, port number, and database name. UserName and PassWord are the username and password for connecting to the database management system;

103、创建Statement对象：通过Connection的create Statement()函数创建Statement对象；Statement类主要用来执行SQL语句以获取执行后生成的结果；103. Create a Statement object: Create a Statement object through the create Statement() function of Connection; the Statement class is mainly used to execute SQL statements to obtain the results generated after execution;

104、执行SQL语句：Statement执行SQL语句的方法主要有executeQuery()、executeUpdate()、以及execute()三种。当SQL语句产生单个结果集时使用executeQuery()，当无返回结果时使用executeUpdate()，当返回多个结果集时使用execute()；104. Execute SQL statement: There are three main methods for Statement to execute SQL statement: executeQuery(), executeUpdate(), and execute(). Use executeQuery() when the SQL statement produces a single result set, executeUpdate() when no results are returned, and execute() when multiple result sets are returned;

108、关闭连接：为了不造成资源浪费，当不再使用数据库时，用close()方法关闭数据库连接。108. Close the connection: In order not to waste resources, when the database is no longer used, use the close() method to close the database connection.

各异构数据库数据格式与元数据格式存在一定差异，目前主流数据库数据格式在功能语法强大的同时数据类型主键丰富，如Oracle的基本字符类型就包括CHAR、VARCHAR2、NCHAR以及NVARCHAR2等。同时对于不同数据库系统，存在无法支持不同数据类型的情况，因此设定映射规则进行转化。数据映射规则也称为元数据字典，是对异构数据进行归一化的基础。映射规则的设计基于异构源数据库中的数据类型，参照数据类型所要表达的含义可将其分为字符型、实数型、整数型以及字节型。具体映射关系如表1所示。There are certain differences between the data format and metadata format of heterogeneous databases. The current mainstream database data format has powerful functional syntax and rich primary keys of data types. For example, the basic character types of Oracle include CHAR, VARCHAR2, NCHAR, and NVARCHAR2. At the same time, for different database systems, there are cases where different data types cannot be supported, so the mapping rules are set for conversion. Data mapping rules, also known as metadata dictionaries, are the basis for normalizing heterogeneous data. The design of the mapping rules is based on the data types in the heterogeneous source databases. According to the meaning of the data types, they can be divided into character type, real number type, integer type and byte type. The specific mapping relationship is shown in Table 1.

表1数据类型映射规则Table 1 data type mapping rules

元数据为数据模型的最小单元，元数据结构表达式如式(1)所示：Metadata is the smallest unit of the data model, and the metadata structure expression is shown in formula (1):

M＝CS+SS (1)M=CS+SS (1)

其中：CS(Content Structure)为内容结构，定义元数据的构成元素以及元素内容进行，SS(Syntax Structure)为句法结构，定义元数据格式结构以及具体描述方法；Among them: CS (Content Structure) is the content structure, which defines the constituent elements of metadata and element content, and SS (Syntax Structure) is the syntax structure, which defines the metadata format structure and specific description methods;

CS＝(T,Z,S,F) (2)CS=(T,Z,S,F) (2)

源表(T)：该元素表示多源数据库的表结构，存储待备份数据的表结构信息，包含源表序号、源表名、标识符、字段数目、字段名以及字段类型信息。Source table (T): This element represents the table structure of the multi-source database, and stores the table structure information of the data to be backed up, including the source table serial number, source table name, identifier, field number, field name, and field type information.

字段(Z)：该元素表示多源数据库的数据值，存储表中字段的具体数值。包括字段序号、字段名、字段类型、字段值、表名以及标识符。Field (Z): This element represents the data value of the multi-source database, and stores the specific value of the field in the table. Including field serial number, field name, field type, field value, table name and identifier.

预定集(S)：定义备份对象将备份过程细分为各个单元，当一个备份任务发生中断时可以从中断位置继续备份任务。这样的机制既节约了时间提高备份效率，也有助于保证备份结果一致性，防止已经备份过的数据被重复备份导致数据冗余。预定集被定义为备份的基本单位，包含要备份的对象。预定集成员包括预定集编号、源服务器、目标服务器、开始时间、结束时间、备份序号、源表序号以及字段序号。Scheduled set (S): Define the backup object to subdivide the backup process into each unit. When a backup task is interrupted, the backup task can be continued from the interrupted location. Such a mechanism not only saves time and improves backup efficiency, but also helps to ensure the consistency of backup results and prevent data redundancy from repeated backup of already backed up data. A scheduled set is defined as the basic unit of backup and contains the objects to be backed up. Scheduled set members include scheduled set number, source server, target server, start time, end time, backup sequence number, source table sequence number, and field sequence number.

约束(F)：约束元素描述表中字段约束信息，用于记录表中特殊列信息。包括表名、约束序号、主键列名、外键列名、索引列名以及标识符。特殊列信息由于其特殊功能必须进行单独记录从而对表结构进行完整性描述。Constraint (F): The constraint element describes the field constraint information in the table, and is used to record the special column information in the table. Including table name, constraint serial number, primary key column name, foreign key column name, index column name and identifier. The special column information must be recorded separately due to its special function, so as to describe the completeness of the table structure.

Z表示字段，为多源数据库的数据值，存储表中字段的具体数值，包括：字段序号、字段名、字段类型、字段值、表名以及标识符；Z represents the field, which is the data value of the multi-source database, and stores the specific value of the field in the table, including: field serial number, field name, field type, field value, table name, and identifier;

(2)对数据块进行聚类预处理，利用DELTA压缩算法对同一类数据块进行压缩，生成对应的二进制存储文件，并将压缩后的备份文件备份到备份介质中，具体方法如下：(2) Carry out clustering preprocessing to data block, utilize DELTA compression algorithm to compress same type of data block, generate corresponding binary storage file, and back up the backup file after compression in the backup medium, concrete method is as follows:

202、以相似度矩阵中所保存的相似度信息作为聚类依据，通过K-medoids聚类算法对数据块进行聚类，聚类结果保证同一类中数据块之间相似度较高；202. Using the similarity information stored in the similarity matrix as the clustering basis, the data blocks are clustered through the K-medoids clustering algorithm, and the clustering results ensure that the similarity between the data blocks in the same class is high;

204、设定一个滑动窗口的大小，不断向前移动滑动窗口，计算移动窗口下的数据指纹。为了提高检索速度，降低查找时间，采用Hash函数映射成超级特征或超级指纹集；204. Set the size of a sliding window, continuously move the sliding window forward, and calculate the data fingerprint under the moving window. In order to improve the retrieval speed and reduce the search time, the Hash function is used to map into a super feature or a super fingerprint set;

205、若超级指纹相匹配，则两个文件的相似度较大。在特征数据库中搜索一个与它高度相似的参考文件，找到该参考文件后，根据压缩函数D进行压缩；205. If the super fingerprints match, the similarity between the two files is relatively large. Search for a reference file highly similar to it in the feature database, and compress it according to the compression function D after finding the reference file;

dist(Si,Sj)＝delta(Si,Sj) (3)dist(Si,Sj)=delta(Si,Sj) (3)

(3)采用“SQL重现法”对备份文件中元数据进行还原，根据配置文件读取还原端数据库版本；逆向使用转化规则将元数据信息还原为数据库能够识别的SQL语句并生成相应的.sql文件。具体方法如下：(3) Use the "SQL reproduction method" to restore the metadata in the backup file, read the database version of the restorer according to the configuration file; reversely use the transformation rules to restore the metadata information to the SQL statement that the database can recognize and generate the corresponding . sql file. The specific method is as follows:

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和变形，这些改进和变形也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, and it should be pointed out that for those of ordinary skill in the art, without departing from the technical principle of the present invention, some improvements and modifications can also be made. It should also be regarded as the protection scope of the present invention.