Data backup method and system based on Hadoop distributed file systemTechnical Field
The invention relates to a data backup method and system, in particular to a data backup method and system based on a Hadoop distributed file system.
Background
The big data exists in the fields of physics, biology, environmental ecology and the like and the industries of military, finance, communication and the like for a long time, but attracts people's attention due to the development of the internet and the information industry in recent years. With the rapid development and popularization and application of computers and information technologies, big data shows the advantages of the computers and the information technologies, the scale of an industry application system is rapidly enlarged, and data generated by industry application is explosively increased.
Hadoop realizes a Distributed File System (HDFS) with high fault tolerance, and is used for solving the problems of low hardware, capability of building a telescopic super-large cluster, realization of storage and access of large data volume and the like. With the technology becoming more mature and the system becoming more stable, manufacturers such as Cloudera and hotsonnorks successively put forward big data solutions based on the Hadoop architecture, and more enterprises use Hadoop as the basic platform of the company. In the case of a Hadoop cluster in a catastrophic situation, a method for protecting and recovering data from a remote place and an external place is lacked in the market. The invention aims to deal with the occurrence of the cluster disaster.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a method and a system for efficiently and safely backing up data by using a third-party medium based on a Hadoop distributed file system.
The technical scheme is as follows: according to the data backup method based on the Hadoop distributed file system, the folder is backed up in a snapshot mode through the HDFS client, the time point snapshot of the folder is generated through the client, and the data in the folder is stored to an external storage medium.
The method comprises the following steps:
(1) creating an HDFS client;
(2) creating a folder snapshot: reading snapshot information of a current cluster through an HDFS client, generating a read-only time point snapshot for a backup folder, and not copying any data block;
(3) data backup: establishing connection between an external storage medium and an HDFS file system, reading file contents, and writing the file contents into the external storage medium;
(4) metadata element backup: and reading the metadata of the file/folder through the HDFS client, connecting a remote storage index database, and writing the metadata into the file index database.
The step (1) comprises the following steps:
(11) downloading an HDFS service client and user credentials on a Hadoop management system interface, acquiring HDFS configuration information and Kerberos authentication information, and placing configuration on an agent node;
(12) and issuing a backup job through the agent framework, reading the acquired HDFS configuration and Kerberos authentication in the job, and creating the HDFS client.
In the step (2), if the folder is an incremental backup, comparing the previous snapshot with the next snapshot, and acquiring the modification information of the folder.
The external storage medium of the step (2) is a remote storage medium.
The metadata of the step (4) includes attribute information of the distributed file and storage location information of the distributed file in the external storage medium.
Further comprising the steps of: when a file in the system is damaged or lost, the HDFS client selectively restores a part of the file by acquiring the position information of the file in the external storage medium without restoring the whole snapshot.
The data backup system based on the Hadoop distributed file system comprises an HDFS system and a storage server connected with the HDFS system, wherein the storage server comprises a storage medium and a file index database; the storage medium is used for storing system file data, and the file index database is used for storing system file metadata; the HDFS system further comprises a backup server, wherein the backup proxy node is provided with a proxy service, the server downloads Hadoop configuration and Kerberos user authentication, an HDFS client is established through the proxy service, and the HDFS client interacts with the Hadoop cluster.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:
the method for protecting the folder data under the Hadoop platform builds an agent environment at a remote node by downloading corresponding configuration, supports Kerberos authentication, remotely connects a Hadoop cluster through a Hadoop API, does not influence the service, improves the backup efficiency and reliability, and realizes the backup protection of the data in a distributed file system (HDFS). The invention can improve the safety of data in the HDFS, prevent Hadoop clusters from disasters, automatically and quickly recover system data, and protect the integrity and consistency of company data. Backup of the entire distributed file system, or portions of the distributed file system, is supported.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
The Hadoop Distributed File System (HDFS) of the present invention comprises: backup proxy node, storage server. The backup server is an independent machine, can be a physical machine or a virtual machine, an agent service is installed on a backup agent node, Hadoop configuration and Kerberos user authentication are downloaded, an HDFS client is established through the agent service, and interaction is carried out between the HDFS client and a Hadoop cluster. The storage server comprises a storage medium and a file index database. The storage medium is used for storing system file data, and the file index database is used for storing system file metadata.
The HDFS cluster is composed of one Namenode and a certain number of dataodes. The Namenode is a central server responsible for managing the namespace (namespace) of the file system and client access to files. The dataode in a cluster is typically one node and is responsible for managing the storage on the node where it resides. HDFS has a namespace of the file system on which users can store data in the form of files. Internally, a file is divided into one or more data blocks, which are stored on a set of dataodes. The Namenode performs namespace operations for the file system, such as opening, closing, renaming files or directories. It is also responsible for determining the mapping of data blocks to specific dataode nodes. The dataode is responsible for processing read and write requests of the file system client. And creating, deleting and copying the data blocks under the unified scheduling of the Namenode.
As shown in fig. 1, the method of the present invention specifically includes the following steps:
1. the proxy service is installed. Downloading the proxy service installation package on the proxy node, and configuring Hadoop client parameters, wherein the steps comprise: download Hadoop configuration, Kerberos authentication, etc.
2. And starting the proxy service, reading Hadoop client parameters, and creating an HDFS client.
3. And connecting the Hadoop cluster through the HDFS client to interact. The method comprises the steps of obtaining snapshot list information and file metadata information of the distributed file system.
4. A snapshot is created. The HDFS client is connected with nodes of NameNode in the Hadoop cluster, creates snapshots for the folders, opens up a new space in the system for storing modified files, records the positions of all file blocks of the folders at a certain time point, records information such as lists and sizes of the file blocks and does not copy Chinese file block data of the DataNode. The snapshot backup is to copy the position information of the file block in time, and the operation is instantaneous, so that the data cannot be modified; and the subsequent data backup reads the file data to an external medium through the position information of the file block in the snapshot.
5. And backing up the data. The HDFS client is connected with the Hadoop cluster NameNode node, a file block list (DataNode information) of files in the snapshot is obtained, then the HDFS client is connected with the DataNode, the backup file data is read, and the backup file data is backed up to an external storage medium.
6. And backing up the metadata. And the HDFS client is connected with the Hadoop cluster NameNode node, acquires the metadata of the snapshot directory and the storage position of the file in the external storage medium, and stores the file in a file index library of an external storage server.
7. Recovery after data corruption, loss, or mishandling. The HDFS client selectively restores a part of files by acquiring the position information of the files in the external storage medium without restoring all snapshots, so that the problem of low data restoration speed is solved. The data in the external medium is directly read and written into the HDFS folder, a folder is usually newly built, the data is directly restored to the newly built path, and if the newly built path does not exist, the original data can be covered.