Movatterモバイル変換


[0]ホーム

URL:


CN112800019A - Data backup method and system based on Hadoop distributed file system - Google Patents

Data backup method and system based on Hadoop distributed file system
Download PDF

Info

Publication number
CN112800019A
CN112800019ACN202110233087.7ACN202110233087ACN112800019ACN 112800019 ACN112800019 ACN 112800019ACN 202110233087 ACN202110233087 ACN 202110233087ACN 112800019 ACN112800019 ACN 112800019A
Authority
CN
China
Prior art keywords
hdfs
file
data
folder
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110233087.7A
Other languages
Chinese (zh)
Inventor
段军红
靳丹
张旭
杨波
王琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
State Grid Gansu Electric Power Co Ltd
State Grid Electric Power Research Institute
Information and Communication Co of State Grid Gansu Electric Power Co Ltd
Original Assignee
Nanjing University of Aeronautics and Astronautics
State Grid Gansu Electric Power Co Ltd
State Grid Electric Power Research Institute
Information and Communication Co of State Grid Gansu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics, State Grid Gansu Electric Power Co Ltd, State Grid Electric Power Research Institute, Information and Communication Co of State Grid Gansu Electric Power Co LtdfiledCriticalNanjing University of Aeronautics and Astronautics
Priority to CN202110233087.7ApriorityCriticalpatent/CN112800019A/en
Publication of CN112800019ApublicationCriticalpatent/CN112800019A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于Hadoop分布式文件系统的数据备份方法及系统,方法包括通过HDFS客户端对文件夹通过快照的方式备份,通过客户端生成文件夹的时间点快照,文件夹内的数据存储至外部存储介质。系统包括HDFS系统以及与该系统连接的存储服务器,所述储服务器包含存储介质和文件索引数据库;所述存储介质用于保存系统文件数据,文件索引数据库用于保存系统文件元数据。本发明能够提高HDFS中数据的安全性,防止Hadoop集群发生灾难,能够自动化、快速恢复系统数据,保护公司数据完整性、一致性。

Figure 202110233087

The invention discloses a data backup method and system based on a Hadoop distributed file system. The method includes backing up a folder by means of a snapshot through an HDFS client, generating a snapshot of the folder in time through the client, and the data in the folder. Store to external storage media. The system includes an HDFS system and a storage server connected to the system, the storage server includes a storage medium and a file index database; the storage medium is used to store system file data, and the file index database is used to store system file metadata. The invention can improve the security of data in HDFS, prevent disasters in Hadoop clusters, recover system data automatically and quickly, and protect the integrity and consistency of company data.

Figure 202110233087

Description

Data backup method and system based on Hadoop distributed file system
Technical Field
The invention relates to a data backup method and system, in particular to a data backup method and system based on a Hadoop distributed file system.
Background
The big data exists in the fields of physics, biology, environmental ecology and the like and the industries of military, finance, communication and the like for a long time, but attracts people's attention due to the development of the internet and the information industry in recent years. With the rapid development and popularization and application of computers and information technologies, big data shows the advantages of the computers and the information technologies, the scale of an industry application system is rapidly enlarged, and data generated by industry application is explosively increased.
Hadoop realizes a Distributed File System (HDFS) with high fault tolerance, and is used for solving the problems of low hardware, capability of building a telescopic super-large cluster, realization of storage and access of large data volume and the like. With the technology becoming more mature and the system becoming more stable, manufacturers such as Cloudera and hotsonnorks successively put forward big data solutions based on the Hadoop architecture, and more enterprises use Hadoop as the basic platform of the company. In the case of a Hadoop cluster in a catastrophic situation, a method for protecting and recovering data from a remote place and an external place is lacked in the market. The invention aims to deal with the occurrence of the cluster disaster.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a method and a system for efficiently and safely backing up data by using a third-party medium based on a Hadoop distributed file system.
The technical scheme is as follows: according to the data backup method based on the Hadoop distributed file system, the folder is backed up in a snapshot mode through the HDFS client, the time point snapshot of the folder is generated through the client, and the data in the folder is stored to an external storage medium.
The method comprises the following steps:
(1) creating an HDFS client;
(2) creating a folder snapshot: reading snapshot information of a current cluster through an HDFS client, generating a read-only time point snapshot for a backup folder, and not copying any data block;
(3) data backup: establishing connection between an external storage medium and an HDFS file system, reading file contents, and writing the file contents into the external storage medium;
(4) metadata element backup: and reading the metadata of the file/folder through the HDFS client, connecting a remote storage index database, and writing the metadata into the file index database.
The step (1) comprises the following steps:
(11) downloading an HDFS service client and user credentials on a Hadoop management system interface, acquiring HDFS configuration information and Kerberos authentication information, and placing configuration on an agent node;
(12) and issuing a backup job through the agent framework, reading the acquired HDFS configuration and Kerberos authentication in the job, and creating the HDFS client.
In the step (2), if the folder is an incremental backup, comparing the previous snapshot with the next snapshot, and acquiring the modification information of the folder.
The external storage medium of the step (2) is a remote storage medium.
The metadata of the step (4) includes attribute information of the distributed file and storage location information of the distributed file in the external storage medium.
Further comprising the steps of: when a file in the system is damaged or lost, the HDFS client selectively restores a part of the file by acquiring the position information of the file in the external storage medium without restoring the whole snapshot.
The data backup system based on the Hadoop distributed file system comprises an HDFS system and a storage server connected with the HDFS system, wherein the storage server comprises a storage medium and a file index database; the storage medium is used for storing system file data, and the file index database is used for storing system file metadata; the HDFS system further comprises a backup server, wherein the backup proxy node is provided with a proxy service, the server downloads Hadoop configuration and Kerberos user authentication, an HDFS client is established through the proxy service, and the HDFS client interacts with the Hadoop cluster.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:
the method for protecting the folder data under the Hadoop platform builds an agent environment at a remote node by downloading corresponding configuration, supports Kerberos authentication, remotely connects a Hadoop cluster through a Hadoop API, does not influence the service, improves the backup efficiency and reliability, and realizes the backup protection of the data in a distributed file system (HDFS). The invention can improve the safety of data in the HDFS, prevent Hadoop clusters from disasters, automatically and quickly recover system data, and protect the integrity and consistency of company data. Backup of the entire distributed file system, or portions of the distributed file system, is supported.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
The Hadoop Distributed File System (HDFS) of the present invention comprises: backup proxy node, storage server. The backup server is an independent machine, can be a physical machine or a virtual machine, an agent service is installed on a backup agent node, Hadoop configuration and Kerberos user authentication are downloaded, an HDFS client is established through the agent service, and interaction is carried out between the HDFS client and a Hadoop cluster. The storage server comprises a storage medium and a file index database. The storage medium is used for storing system file data, and the file index database is used for storing system file metadata.
The HDFS cluster is composed of one Namenode and a certain number of dataodes. The Namenode is a central server responsible for managing the namespace (namespace) of the file system and client access to files. The dataode in a cluster is typically one node and is responsible for managing the storage on the node where it resides. HDFS has a namespace of the file system on which users can store data in the form of files. Internally, a file is divided into one or more data blocks, which are stored on a set of dataodes. The Namenode performs namespace operations for the file system, such as opening, closing, renaming files or directories. It is also responsible for determining the mapping of data blocks to specific dataode nodes. The dataode is responsible for processing read and write requests of the file system client. And creating, deleting and copying the data blocks under the unified scheduling of the Namenode.
As shown in fig. 1, the method of the present invention specifically includes the following steps:
1. the proxy service is installed. Downloading the proxy service installation package on the proxy node, and configuring Hadoop client parameters, wherein the steps comprise: download Hadoop configuration, Kerberos authentication, etc.
2. And starting the proxy service, reading Hadoop client parameters, and creating an HDFS client.
3. And connecting the Hadoop cluster through the HDFS client to interact. The method comprises the steps of obtaining snapshot list information and file metadata information of the distributed file system.
4. A snapshot is created. The HDFS client is connected with nodes of NameNode in the Hadoop cluster, creates snapshots for the folders, opens up a new space in the system for storing modified files, records the positions of all file blocks of the folders at a certain time point, records information such as lists and sizes of the file blocks and does not copy Chinese file block data of the DataNode. The snapshot backup is to copy the position information of the file block in time, and the operation is instantaneous, so that the data cannot be modified; and the subsequent data backup reads the file data to an external medium through the position information of the file block in the snapshot.
5. And backing up the data. The HDFS client is connected with the Hadoop cluster NameNode node, a file block list (DataNode information) of files in the snapshot is obtained, then the HDFS client is connected with the DataNode, the backup file data is read, and the backup file data is backed up to an external storage medium.
6. And backing up the metadata. And the HDFS client is connected with the Hadoop cluster NameNode node, acquires the metadata of the snapshot directory and the storage position of the file in the external storage medium, and stores the file in a file index library of an external storage server.
7. Recovery after data corruption, loss, or mishandling. The HDFS client selectively restores a part of files by acquiring the position information of the files in the external storage medium without restoring all snapshots, so that the problem of low data restoration speed is solved. The data in the external medium is directly read and written into the HDFS folder, a folder is usually newly built, the data is directly restored to the newly built path, and if the newly built path does not exist, the original data can be covered.

Claims (8)

Translated fromChinese
1.一种基于Hadoop分布式文件系统的数据备份方法,其特征在于,通过HDFS客户端对文件夹通过快照的方式备份,通过客户端生成文件夹的时间点快照,文件夹内的数据存储至外部存储介质。1. A data backup method based on Hadoop distributed file system, it is characterized in that, the folder is backed up by means of snapshots by the HDFS client, the time point snapshot of the folder is generated by the client, and the data in the folder is stored to external storage media.2.根据权利要求1所述的基于Hadoop分布式文件系统的数据备份方法,其特征在于,包括以下步骤:2. the data backup method based on Hadoop distributed file system according to claim 1, is characterized in that, comprises the following steps:(1)创建HDFS客户端;(1) Create an HDFS client;(2)创建文件夹快照:通过HDFS客户端读取当前集群的快照信息,对备份文件夹生成只读时间点快照,不拷贝任何数据块;(2) Create a folder snapshot: Read the snapshot information of the current cluster through the HDFS client, and generate a read-only point-in-time snapshot of the backup folder without copying any data blocks;(3)数据备份:将外部存储介质与HDFS文件系统建立连接,读取文件内容,写入到外部存储介质;(3) Data backup: establish a connection between the external storage medium and the HDFS file system, read the file content, and write it to the external storage medium;(4)元数据元备份:通过HDFS客户端,读取文件/文件夹的元数据,连接远程存储索引库,将元数据写入文件索引数据库中。(4) Metadata metadata backup: Through the HDFS client, read the metadata of the file/folder, connect to the remote storage index library, and write the metadata into the file index database.3.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法,其特征在于,所述步骤(1)包括:3. the data backup method based on Hadoop distributed file system according to claim 2, is characterized in that, described step (1) comprises:(11)在Hadoop管理系统界面下载HDFS服务客户端和用户凭据,获取HDFS配置信息和Kerberos认证信息,并将配置放置在代理节点上;(11) Download the HDFS service client and user credentials on the Hadoop management system interface, obtain HDFS configuration information and Kerberos authentication information, and place the configuration on the proxy node;(12)通过代理框架下发备份作业,在作业中读取获取到的HDFS配置和Kerberos认证,创建HDFS客户端。(12) Distribute the backup job through the proxy framework, read the obtained HDFS configuration and Kerberos authentication in the job, and create an HDFS client.4.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法,其特征在于,所述步骤(2)中,若文件夹是增量备份,则对前后两个快照进行对比,获取文件夹的修改信息。4. the data backup method based on Hadoop distributed file system according to claim 2, is characterized in that, in described step (2), if folder is incremental backup, then two snapshots before and after are compared, obtain Modification information for the folder.5.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法,其特征在于,所述步骤(2)的外部存储介质为远程存储介质。5 . The data backup method based on the Hadoop distributed file system according to claim 2 , wherein the external storage medium of the step (2) is a remote storage medium. 6 .6.根据权利要求2所述的基于Hadoop分布式文件系统的数据备份方法,其特征在于,所述步骤(4)的元数据包括分布式文件的属性信息和分布式文件在外部存储介质中的存储位置信息。6. the data backup method based on Hadoop distributed file system according to claim 2, is characterized in that, the metadata of described step (4) comprises the attribute information of distributed file and distributed file in external storage medium Store location information.7.根据权利要求6所述的基于Hadoop分布式文件系统的数据备份方法,其特征在于,还包括以下步骤:当系统中的文件损坏或丢失时,HDFS客户端通过获取文件在外部存储介质中的位置信息,选择性的恢复部分文件,而不需要恢复全部的快照。7. The data backup method based on the Hadoop distributed file system according to claim 6, further comprising the following steps: when a file in the system is damaged or lost, the HDFS client stores the file in an external storage medium by acquiring the file location information, and selectively restore some files without restoring the entire snapshot.8.一种基于Hadoop分布式文件系统的数据备份系统,其特征在于,包括HDFS系统以及与该系统连接的存储服务器,所述储服务器包含存储介质和文件索引数据库;所述存储介质用于保存系统文件数据,文件索引数据库用于保存系统文件元数据;所述HDFS系统中还包括备份服务器,其中的备份代理节点上安装有代理服务,服务器下载Hadoop配置以及Kerberos用户认证,通过代理服务创建HDFS客户端,通过HDFS客户端与Hadoop集群进行交互。8. A data backup system based on a Hadoop distributed file system, characterized in that it comprises an HDFS system and a storage server connected to the system, wherein the storage server includes a storage medium and a file index database; the storage medium is used for storing System file data, the file index database is used to save system file metadata; the HDFS system also includes a backup server, wherein a proxy service is installed on the backup proxy node, the server downloads Hadoop configuration and Kerberos user authentication, and creates HDFS through the proxy service Client, interacts with Hadoop cluster through HDFS client.
CN202110233087.7A2021-03-032021-03-03 Data backup method and system based on Hadoop distributed file systemPendingCN112800019A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110233087.7ACN112800019A (en)2021-03-032021-03-03 Data backup method and system based on Hadoop distributed file system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110233087.7ACN112800019A (en)2021-03-032021-03-03 Data backup method and system based on Hadoop distributed file system

Publications (1)

Publication NumberPublication Date
CN112800019Atrue CN112800019A (en)2021-05-14

Family

ID=75816340

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110233087.7APendingCN112800019A (en)2021-03-032021-03-03 Data backup method and system based on Hadoop distributed file system

Country Status (1)

CountryLink
CN (1)CN112800019A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113434606A (en)*2021-06-302021-09-24青岛海尔科技有限公司Data import method, device, equipment and medium
CN113821490A (en)*2021-08-242021-12-21济南浪潮数据技术有限公司 A data synchronization method and device
CN114153842A (en)*2021-11-122022-03-08广东广信通信服务有限公司Cross-platform data processing method, system, equipment and medium
CN114185484A (en)*2021-11-042022-03-15福建升腾资讯有限公司Method, device, equipment and medium for clustering document storage

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104714858A (en)*2013-12-132015-06-17中国移动通信集团公司Data backup method, data recovery method and device
CN112214357A (en)*2020-10-302021-01-12上海爱数信息技术股份有限公司HDFS data backup and recovery system and backup and recovery method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104714858A (en)*2013-12-132015-06-17中国移动通信集团公司Data backup method, data recovery method and device
CN112214357A (en)*2020-10-302021-01-12上海爱数信息技术股份有限公司HDFS data backup and recovery system and backup and recovery method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113434606A (en)*2021-06-302021-09-24青岛海尔科技有限公司Data import method, device, equipment and medium
CN113821490A (en)*2021-08-242021-12-21济南浪潮数据技术有限公司 A data synchronization method and device
CN114185484A (en)*2021-11-042022-03-15福建升腾资讯有限公司Method, device, equipment and medium for clustering document storage
CN114153842A (en)*2021-11-122022-03-08广东广信通信服务有限公司Cross-platform data processing method, system, equipment and medium

Similar Documents

PublicationPublication DateTitle
US12130708B2 (en)Cloud-based air-gapped data storage management system
US11093336B2 (en)Browsing data stored in a backup format
US20210263888A1 (en)User-centric interfaces for information management systems
US20230205648A1 (en)Enhanced file indexing, live browsing, and restoring of backup copies of virtual machines and/or file systems by populating and tracking a cache storage area and a backup index
US7953945B2 (en)System and method for providing a backup/restore interface for third party HSM clients
JP5247202B2 (en) Read / write implementation on top of backup data, multi-version control file system
US9632713B2 (en)Secondary storage editor
CN104040481B (en) Method and system for fusing, storing and retrieving incremental backup data
CN102597983B (en) Backup using metadata virtual hard drives and differential virtual hard drives
WO2021263224A1 (en)Incremental backup to object store
JP5145098B2 (en) System and method for directly exporting data from a deduplication storage device to a non-deduplication storage device
US8688645B2 (en)Incremental restore of data between storage systems having dissimilar storage operating systems associated therewith
JP5731000B2 (en) Method and system for performing individual restore of a database from a differential backup
EP3796174B1 (en)Restoring a database using a fully hydrated backup
US8433863B1 (en)Hybrid method for incremental backup of structured and unstructured files
US8874517B2 (en)Summarizing file system operations with a file system journal
CN112800019A (en) Data backup method and system based on Hadoop distributed file system
US20160162364A1 (en)Secondary storage pruning
US9043280B1 (en)System and method to repair file system metadata
US10387381B1 (en)Data management using an open standard file system interface to a storage gateway
CN102388369B (en) Lifecycle of granular application data from a single backup
US20230153010A1 (en)Pruning data segments stored in cloud storage to reclaim cloud storage space
US10628298B1 (en)Resumable garbage collection
WO2006107394A2 (en)Production server to data protection server mapping
CN114026545A (en) Snapshots for replication at any point in time

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20210514


[8]ページ先頭

©2009-2025 Movatter.jp