CN112800019A

Movatterモバイル変換

Info

Publication number: CN112800019A
Application number: CN202110233087.7A
Authority: CN
Inventors: 段军红; 靳丹; 张旭; 杨波; 王琼
Original assignee: Nanjing University of Aeronautics and Astronautics; State Grid Gansu Electric Power Co Ltd; State Grid Electric Power Research Institute; Information and Communication Co of State Grid Gansu Electric Power Co Ltd
Current assignee: Nanjing University of Aeronautics and Astronautics; State Grid Gansu Electric Power Co Ltd; State Grid Electric Power Research Institute; Information and Communication Co of State Grid Gansu Electric Power Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-05-14

Abstract

Translated fromChinese

本发明公开了一种基于Hadoop分布式文件系统的数据备份方法及系统，方法包括通过HDFS客户端对文件夹通过快照的方式备份，通过客户端生成文件夹的时间点快照，文件夹内的数据存储至外部存储介质。系统包括HDFS系统以及与该系统连接的存储服务器，所述储服务器包含存储介质和文件索引数据库；所述存储介质用于保存系统文件数据，文件索引数据库用于保存系统文件元数据。本发明能够提高HDFS中数据的安全性，防止Hadoop集群发生灾难，能够自动化、快速恢复系统数据，保护公司数据完整性、一致性。

The invention discloses a data backup method and system based on a Hadoop distributed file system. The method includes backing up a folder by means of a snapshot through an HDFS client, generating a snapshot of the folder in time through the client, and the data in the folder. Store to external storage media. The system includes an HDFS system and a storage server connected to the system, the storage server includes a storage medium and a file index database; the storage medium is used to store system file data, and the file index database is used to store system file metadata. The invention can improve the security of data in HDFS, prevent disasters in Hadoop clusters, recover system data automatically and quickly, and protect the integrity and consistency of company data.

Description

Data backup method and system based on Hadoop distributed file system

Technical Field

The invention relates to a data backup method and system, in particular to a data backup method and system based on a Hadoop distributed file system.

Background

The big data exists in the fields of physics, biology, environmental ecology and the like and the industries of military, finance, communication and the like for a long time, but attracts people's attention due to the development of the internet and the information industry in recent years. With the rapid development and popularization and application of computers and information technologies, big data shows the advantages of the computers and the information technologies, the scale of an industry application system is rapidly enlarged, and data generated by industry application is explosively increased.

Hadoop realizes a Distributed File System (HDFS) with high fault tolerance, and is used for solving the problems of low hardware, capability of building a telescopic super-large cluster, realization of storage and access of large data volume and the like. With the technology becoming more mature and the system becoming more stable, manufacturers such as Cloudera and hotsonnorks successively put forward big data solutions based on the Hadoop architecture, and more enterprises use Hadoop as the basic platform of the company. In the case of a Hadoop cluster in a catastrophic situation, a method for protecting and recovering data from a remote place and an external place is lacked in the market. The invention aims to deal with the occurrence of the cluster disaster.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a method and a system for efficiently and safely backing up data by using a third-party medium based on a Hadoop distributed file system.

The technical scheme is as follows: according to the data backup method based on the Hadoop distributed file system, the folder is backed up in a snapshot mode through the HDFS client, the time point snapshot of the folder is generated through the client, and the data in the folder is stored to an external storage medium.

The method comprises the following steps:

(1) creating an HDFS client;

(2) creating a folder snapshot: reading snapshot information of a current cluster through an HDFS client, generating a read-only time point snapshot for a backup folder, and not copying any data block;

(3) data backup: establishing connection between an external storage medium and an HDFS file system, reading file contents, and writing the file contents into the external storage medium;

(4) metadata element backup: and reading the metadata of the file/folder through the HDFS client, connecting a remote storage index database, and writing the metadata into the file index database.

The step (1) comprises the following steps:

(11) downloading an HDFS service client and user credentials on a Hadoop management system interface, acquiring HDFS configuration information and Kerberos authentication information, and placing configuration on an agent node;

(12) and issuing a backup job through the agent framework, reading the acquired HDFS configuration and Kerberos authentication in the job, and creating the HDFS client.

In the step (2), if the folder is an incremental backup, comparing the previous snapshot with the next snapshot, and acquiring the modification information of the folder.

The external storage medium of the step (2) is a remote storage medium.

The metadata of the step (4) includes attribute information of the distributed file and storage location information of the distributed file in the external storage medium.

Further comprising the steps of: when a file in the system is damaged or lost, the HDFS client selectively restores a part of the file by acquiring the position information of the file in the external storage medium without restoring the whole snapshot.

The data backup system based on the Hadoop distributed file system comprises an HDFS system and a storage server connected with the HDFS system, wherein the storage server comprises a storage medium and a file index database; the storage medium is used for storing system file data, and the file index database is used for storing system file metadata; the HDFS system further comprises a backup server, wherein the backup proxy node is provided with a proxy service, the server downloads Hadoop configuration and Kerberos user authentication, an HDFS client is established through the proxy service, and the HDFS client interacts with the Hadoop cluster.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:

the method for protecting the folder data under the Hadoop platform builds an agent environment at a remote node by downloading corresponding configuration, supports Kerberos authentication, remotely connects a Hadoop cluster through a Hadoop API, does not influence the service, improves the backup efficiency and reliability, and realizes the backup protection of the data in a distributed file system (HDFS). The invention can improve the safety of data in the HDFS, prevent Hadoop clusters from disasters, automatically and quickly recover system data, and protect the integrity and consistency of company data. Backup of the entire distributed file system, or portions of the distributed file system, is supported.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The Hadoop Distributed File System (HDFS) of the present invention comprises: backup proxy node, storage server. The backup server is an independent machine, can be a physical machine or a virtual machine, an agent service is installed on a backup agent node, Hadoop configuration and Kerberos user authentication are downloaded, an HDFS client is established through the agent service, and interaction is carried out between the HDFS client and a Hadoop cluster. The storage server comprises a storage medium and a file index database. The storage medium is used for storing system file data, and the file index database is used for storing system file metadata.

The HDFS cluster is composed of one Namenode and a certain number of dataodes. The Namenode is a central server responsible for managing the namespace (namespace) of the file system and client access to files. The dataode in a cluster is typically one node and is responsible for managing the storage on the node where it resides. HDFS has a namespace of the file system on which users can store data in the form of files. Internally, a file is divided into one or more data blocks, which are stored on a set of dataodes. The Namenode performs namespace operations for the file system, such as opening, closing, renaming files or directories. It is also responsible for determining the mapping of data blocks to specific dataode nodes. The dataode is responsible for processing read and write requests of the file system client. And creating, deleting and copying the data blocks under the unified scheduling of the Namenode.

As shown in fig. 1, the method of the present invention specifically includes the following steps:

1. the proxy service is installed. Downloading the proxy service installation package on the proxy node, and configuring Hadoop client parameters, wherein the steps comprise: download Hadoop configuration, Kerberos authentication, etc.

2. And starting the proxy service, reading Hadoop client parameters, and creating an HDFS client.

3. And connecting the Hadoop cluster through the HDFS client to interact. The method comprises the steps of obtaining snapshot list information and file metadata information of the distributed file system.

4. A snapshot is created. The HDFS client is connected with nodes of NameNode in the Hadoop cluster, creates snapshots for the folders, opens up a new space in the system for storing modified files, records the positions of all file blocks of the folders at a certain time point, records information such as lists and sizes of the file blocks and does not copy Chinese file block data of the DataNode. The snapshot backup is to copy the position information of the file block in time, and the operation is instantaneous, so that the data cannot be modified; and the subsequent data backup reads the file data to an external medium through the position information of the file block in the snapshot.

5. And backing up the data. The HDFS client is connected with the Hadoop cluster NameNode node, a file block list (DataNode information) of files in the snapshot is obtained, then the HDFS client is connected with the DataNode, the backup file data is read, and the backup file data is backed up to an external storage medium.

6. And backing up the metadata. And the HDFS client is connected with the Hadoop cluster NameNode node, acquires the metadata of the snapshot directory and the storage position of the file in the external storage medium, and stores the file in a file index library of an external storage server.

7. Recovery after data corruption, loss, or mishandling. The HDFS client selectively restores a part of files by acquiring the position information of the files in the external storage medium without restoring all snapshots, so that the problem of low data restoration speed is solved. The data in the external medium is directly read and written into the HDFS folder, a folder is usually newly built, the data is directly restored to the newly built path, and if the newly built path does not exist, the original data can be covered.

Claims

Translated fromChinese

1.一种基于Hadoop分布式文件系统的数据备份方法，其特征在于，通过HDFS客户端对文件夹通过快照的方式备份，通过客户端生成文件夹的时间点快照，文件夹内的数据存储至外部存储介质。1. A data backup method based on Hadoop distributed file system, it is characterized in that, the folder is backed up by means of snapshots by the HDFS client, the time point snapshot of the folder is generated by the client, and the data in the folder is stored to external storage media.

2.根据权利要求1所述的基于Hadoop分布式文件系统的数据备份方法，其特征在于，包括以下步骤：2. the data backup method based on Hadoop distributed file system according to claim 1, is characterized in that, comprises the following steps:

(1)创建HDFS客户端；(1) Create an HDFS client;

(2)创建文件夹快照：通过HDFS客户端读取当前集群的快照信息，对备份文件夹生成只读时间点快照，不拷贝任何数据块；(2) Create a folder snapshot: Read the snapshot information of the current cluster through the HDFS client, and generate a read-only point-in-time snapshot of the backup folder without copying any data blocks;

(3)数据备份：将外部存储介质与HDFS文件系统建立连接，读取文件内容，写入到外部存储介质；(3) Data backup: establish a connection between the external storage medium and the HDFS file system, read the file content, and write it to the external storage medium;

(4)元数据元备份：通过HDFS客户端，读取文件/文件夹的元数据，连接远程存储索引库，将元数据写入文件索引数据库中。(4) Metadata metadata backup: Through the HDFS client, read the metadata of the file/folder, connect to the remote storage index library, and write the metadata into the file index database.

3.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法，其特征在于，所述步骤(1)包括：3. the data backup method based on Hadoop distributed file system according to claim 2, is characterized in that, described step (1) comprises:

(11)在Hadoop管理系统界面下载HDFS服务客户端和用户凭据，获取HDFS配置信息和Kerberos认证信息，并将配置放置在代理节点上；(11) Download the HDFS service client and user credentials on the Hadoop management system interface, obtain HDFS configuration information and Kerberos authentication information, and place the configuration on the proxy node;

(12)通过代理框架下发备份作业，在作业中读取获取到的HDFS配置和Kerberos认证，创建HDFS客户端。(12) Distribute the backup job through the proxy framework, read the obtained HDFS configuration and Kerberos authentication in the job, and create an HDFS client.

4.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法，其特征在于，所述步骤(2)中，若文件夹是增量备份，则对前后两个快照进行对比，获取文件夹的修改信息。4. the data backup method based on Hadoop distributed file system according to claim 2, is characterized in that, in described step (2), if folder is incremental backup, then two snapshots before and after are compared, obtain Modification information for the folder.

5.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法，其特征在于，所述步骤(2)的外部存储介质为远程存储介质。5 . The data backup method based on the Hadoop distributed file system according to claim 2 , wherein the external storage medium of the step (2) is a remote storage medium. 6 .

6.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法，其特征在于，所述步骤(4)的元数据包括分布式文件的属性信息和分布式文件在外部存储介质中的存储位置信息。6. the data backup method based on Hadoop distributed file system according to claim 2, is characterized in that, the metadata of described step (4) comprises the attribute information of distributed file and distributed file in external storage medium Store location information.

7.根据权利要求6所述的基于Hadoop分布式文件系统的数据备份方法，其特征在于，还包括以下步骤：当系统中的文件损坏或丢失时，HDFS客户端通过获取文件在外部存储介质中的位置信息，选择性的恢复部分文件，而不需要恢复全部的快照。7. The data backup method based on the Hadoop distributed file system according to claim 6, further comprising the following steps: when a file in the system is damaged or lost, the HDFS client stores the file in an external storage medium by acquiring the file location information, and selectively restore some files without restoring the entire snapshot.

8.一种基于Hadoop分布式文件系统的数据备份系统，其特征在于，包括HDFS系统以及与该系统连接的存储服务器，所述储服务器包含存储介质和文件索引数据库；所述存储介质用于保存系统文件数据，文件索引数据库用于保存系统文件元数据；所述HDFS系统中还包括备份服务器，其中的备份代理节点上安装有代理服务，服务器下载Hadoop配置以及Kerberos用户认证，通过代理服务创建HDFS客户端，通过HDFS客户端与Hadoop集群进行交互。8. A data backup system based on a Hadoop distributed file system, characterized in that it comprises an HDFS system and a storage server connected to the system, wherein the storage server includes a storage medium and a file index database; the storage medium is used for storing System file data, the file index database is used to save system file metadata; the HDFS system also includes a backup server, wherein a proxy service is installed on the backup proxy node, the server downloads Hadoop configuration and Kerberos user authentication, and creates HDFS through the proxy service Client, interacts with Hadoop cluster through HDFS client.