CN102982180A

Movatterモバイル変換

Info

Publication number: CN102982180A
Application number: CN2012105520997A
Authority: CN
Inventors: 付旭东; 段雨梅
Original assignee: Huawei Technologies Co Ltd
Current assignee: Hunan And Magnetic Technology Co Ltd
Priority date: 2012-12-18
Filing date: 2012-12-18
Publication date: 2013-03-20
Anticipated expiration: 2032-12-18
Also published as: CN102982180B

Abstract

The embodiment of the invention provides a method and a device for storing data. The method comprises the following steps: matching a fingerprint of each data block of a to-be-stored file with a fingerprint in a fingerprint database to obtain a corresponding backup data block; deleting repeated data of the to-be-stored file according to the backup data block, and performing status identification to the backup data block; and performing recovery process to the backup data block according to the status identification of the backup data block.

Description

Translated fromChinese

数据存储方法及设备Data storage method and device

技术领域technical field

本发明实施例涉及数据处理技术，尤其涉及一种数据存储方法及设备。Embodiments of the present invention relate to data processing technologies, and in particular, to a data storage method and device.

背景技术Background technique

随着企业的数据量不断增大，大量的重复数据给存储带来严峻的挑战。而重复数据删除(Date de-duplication，简称De-Dupe)作为通过有效地减少数据，降低数据存储成本的重要技术，越来越受到重视。As the amount of enterprise data continues to increase, a large amount of duplicate data poses severe challenges to storage. Data deduplication (Date de-duplication, De-Dupe for short), as an important technology to effectively reduce data and reduce data storage costs, has received more and more attention.

在进行数据存储的任务中，通常将待存储文件划分成数据块，重复数据删除技术可自动搜索重复数据块，将相同数据块只保留唯一的一个副本，并使用指向唯一副本的指针替换掉其他重复副本，同时该副本的引用计数增加1，以达到消除冗余数据、降低存储容量需求的存储技术。当重复数据删除后保留的唯一副本数据块被修改或删除时，将导致其引用计数发生改变，当该副本的引用计数减为0时，该副本就满足了垃圾收集的条件，将该副本作为垃圾进行回收，从而释放更多的存储空间。In the task of data storage, the file to be stored is usually divided into data blocks. Data deduplication technology can automatically search for duplicate data blocks, keep only one copy of the same data block, and use pointers to the unique copy to replace other data blocks. The copy is repeated, and the reference count of the copy is increased by 1 to achieve a storage technology that eliminates redundant data and reduces storage capacity requirements. When the only copy data block retained after data deduplication is modified or deleted, its reference count will change. When the reference count of the copy is reduced to 0, the copy meets the conditions for garbage collection, and the copy is used as Garbage is collected, thereby freeing up more storage space.

然而现有技术中，当重复数据删除与回收并发执行时，会使提供给重复副本的指针指向刚刚回收的数据，导致数据丢失。However, in the prior art, when data deduplication and recovery are performed concurrently, the pointer provided to the duplicate copy will point to the data just recovered, resulting in data loss.

发明内容Contents of the invention

本发明实施例提供一种数据处理方法及设备，以优化重复数据删除和回收的并发执行流程。Embodiments of the present invention provide a data processing method and device to optimize concurrent execution processes of deduplication and recovery.

第一方面，本发明实施例提供一种数据存储方法，包括：In a first aspect, an embodiment of the present invention provides a data storage method, including:

将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块；Match the fingerprint of each data block of the file to be stored with the fingerprint in the fingerprint library to obtain the corresponding backup data block;

根据所述备份数据块对所述待存储文件进行重复数据删除操作，且为所述备份数据块进行状态标识；performing a data deduplication operation on the file to be stored according to the backup data block, and performing a status identification for the backup data block;

根据所述备份数据块的状态标识对所述备份数据块进行回收处理。Perform recovery processing on the backup data block according to the state identifier of the backup data block.

在第一种可能的实现方式中，根据第一方面，具体实现为：将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块包括：In a first possible implementation, according to the first aspect, the specific implementation is as follows: matching the fingerprints of each data block of the file to be stored with the fingerprints in the fingerprint library, so as to obtain the corresponding backup data block includes:

对所述待存储文件进行分块处理，得到各数据块，并计算各数据块的指纹；Perform block processing on the file to be stored to obtain each data block, and calculate the fingerprint of each data block;

对各所述数据块的指纹进行抽样处理，并根据抽取到的指纹生成所述待存储文件的指纹抽样表；Sampling the fingerprints of each of the data blocks, and generating a fingerprint sampling table of the file to be stored according to the extracted fingerprints;

根据所述指纹抽样表和分组抽样库，确定所述待存储文件在所述分组抽样库中所属的相似分组，将所述相似分组对应的已存储的数据块作为所述备份数据块，所述分组抽样库由所述指纹库进行抽样处理得到，所述相似分组为所述分组抽样库中与所述待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。According to the fingerprint sampling table and the group sampling library, determine the similar group to which the file to be stored belongs in the group sampling library, and use the stored data block corresponding to the similar group as the backup data block, the The group sampling library is obtained by performing sampling processing on the fingerprint library, and the similar group is a sampling group in the group sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.

在第二种可能的实现方式中，根据第一方面，具体实现为：根据所述备份数据块对所述待存储文件进行重复数据删除操作，且为所述备份数据块进行状态标识包括：In a second possible implementation manner, according to the first aspect, the specific implementation is: performing a data deduplication operation on the file to be stored according to the backup data block, and performing a status identification for the backup data block includes:

在根据所述备份数据块对所述待存储文件进行重复数据删除操作之前，将所述备份数据块的分组计数加一；Before performing a deduplication operation on the file to be stored according to the backup data block, adding one to the grouping count of the backup data block;

在完成根据所述备份数据块对所述待存储文件进行重复数据删除操作之后，将所述备份数据块的分组计数减一。After completing the data deduplication operation on the file to be stored according to the backup data block, decrement the grouping count of the backup data block by one.

在第三种可能的实现方式中，根据第一方面第二种可能的实现方式，具体实现为：根据所述备份数据块的状态标识对所述备份数据块进行回收处理包括：In the third possible implementation manner, according to the second possible implementation manner of the first aspect, the specific implementation is: performing recovery processing on the backup data block according to the state identifier of the backup data block includes:

当识别到所述备份数据块的状态标识中的分组计数不为零时，暂停对所述备份数据块的回收处理；When it is recognized that the group count in the state identifier of the backup data block is not zero, suspend the recovery process of the backup data block;

当识别到所述备份数据块的状态标识中的分组计数为零时，触发对所述备份数据块的回收处理。When it is recognized that the group count in the state identifier of the backup data block is zero, the recovery processing of the backup data block is triggered.

在第四种可能的实现方式中，根据第一方面或第一方面第一种可能的实现方式或第一方面第二种可能的实现方式，具体实现为：根据所述备份数据块的状态标识对所述备份数据块进行回收处理包括：In the fourth possible implementation manner, according to the first aspect or the first possible implementation manner of the first aspect or the second possible implementation manner of the first aspect, the specific implementation is: according to the state identification of the backup data block Recycling the backup data block includes:

当监测到所述备份数据块的引用计数的数值发生变化时，识别对应的备份数据块的状态标识；When it is detected that the value of the reference count of the backup data block changes, identify the status identifier of the corresponding backup data block;

当识别到对应的备份数据块的状态标识表明所述备份数据块未使用时，则识别所述备份数据块的引用计数的数值；When it is identified that the status identifier of the corresponding backup data block indicates that the backup data block is not in use, then identify the value of the reference count of the backup data block;

当识别到所述备份数据块的引用计数的数值为零时，触发对所述备份数据块进行回收处理。When it is identified that the value of the reference count of the backup data block is zero, triggering recovery processing on the backup data block.

第二方面，本发明实施例提供一种数据存储设备，包括：In a second aspect, an embodiment of the present invention provides a data storage device, including:

备份数据块获取模块，用于将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块；The backup data block acquisition module is used to match the fingerprint of each data block of the file to be stored with the fingerprint in the fingerprint library to obtain the corresponding backup data block;

重复数据删除模块，用于根据所述备份数据块对所述待存储文件进行重复数据删除操作，且为所述备份数据块进行状态标识；A data deduplication module, configured to perform a data deduplication operation on the file to be stored according to the backup data block, and perform a status identification for the backup data block;

回收模块，用于根据所述备份数据块的状态标识对所述备份数据块进行回收处理。A recovery module, configured to perform recovery processing on the backup data block according to the state identifier of the backup data block.

在第一种可能的实现方式中，根据第二方面，具体实现为：所述备份数据块获取模块包括：In a first possible implementation manner, according to the second aspect, it is specifically implemented as follows: the backup data block acquisition module includes:

指纹计算单元，用于对所述待存储文件进行分块处理，得到各数据块，并计算各数据块的指纹；a fingerprint calculation unit, configured to divide the file to be stored into blocks, obtain each data block, and calculate the fingerprint of each data block;

指纹抽样单元，用于对各所述数据块的指纹进行抽样处理，并根据抽取到的指纹生成所述待存储文件的指纹抽样表；A fingerprint sampling unit, configured to sample the fingerprints of each of the data blocks, and generate a fingerprint sampling table for the file to be stored according to the extracted fingerprints;

分组确定单元，用于根据所述指纹抽样表和分组抽样库，确定所述待存储文件在所述分组抽样库中所属的相似分组，将所述相似分组对应的已存储的数据块作为所述备份数据块，所述分组抽样库由所述指纹库进行抽样处理得到，所述相似分组为所述分组抽样库中与所述待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。A group determination unit, configured to determine the similar group to which the file to be stored belongs in the group sampling library according to the fingerprint sampling table and the group sampling library, and use the stored data block corresponding to the similar group as the Backup data block, the group sampling library is obtained by sampling processing of the fingerprint library, and the similar group is a sampling group in the group sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored .

在第二种可能的实现方式中，根据第二方面，具体实现为：所述重复数据删除模块包括：In a second possible implementation manner, according to the second aspect, it is specifically implemented as follows: the deduplication module includes:

第一计数单元，用于在根据所述备份数据块对所述待存储文件进行重复数据删除操作之前，将所述备份数据块的分组计数加一；The first counting unit is configured to add one to the grouping count of the backup data block before performing a deduplication operation on the file to be stored according to the backup data block;

第二计数单元，用于在完成根据所述备份数据块对所述待存储文件进行重复数据删除操作之后，将所述备份数据块的分组计数减一。The second counting unit is configured to decrement the grouping count of the backup data block by one after the data deduplication operation is performed on the file to be stored according to the backup data block.

在第三种可能的实现方式中，根据第二方面第二种可能的实现方式，具体实现为：所述回收模块包括：In a third possible implementation manner, according to the second possible implementation manner of the second aspect, the specific implementation is as follows: the recycling module includes:

回收暂停单元，用于当识别到所述备份数据块的状态标识中的分组计数不为零时，暂停对所述备份数据块的回收处理；A recovery suspension unit, configured to suspend recovery processing of the backup data block when it is recognized that the packet count in the state identifier of the backup data block is not zero;

第一回收触发单元，用于当识别到所述备份数据块的状态标识中的分组计数为零时，触发对所述备份数据块的回收处理。The first reclamation triggering unit is configured to trigger reclamation processing of the backup data block when it is recognized that the packet count in the state identifier of the backup data block is zero.

在第四种可能的实现方式中，根据第二方面或第二方面第一种可能的实现方式或第二方面第二种可能的实现方式，具体实现为：所述回收模块包括：In a fourth possible implementation manner, according to the second aspect or the first possible implementation manner of the second aspect or the second possible implementation manner of the second aspect, the specific implementation is: the recycling module includes:

引用计数监测单元，用于当监测到所述备份数据块的引用计数的数值发生变化时，识别对应的备份数据块的状态标识；A reference count monitoring unit, configured to identify the status identifier of the corresponding backup data block when it detects that the value of the reference count of the backup data block changes;

引用计数识别单元，用于当识别到对应的备份数据块的状态标识表明所述备份数据块未使用时，则识别所述备份数据块的引用计数的数值；A reference count identification unit, configured to identify the value of the reference count of the backup data block when it is recognized that the status identifier of the corresponding backup data block indicates that the backup data block is unused;

第二回收触发单元，用于当识别到所述备份数据块的引用计数的数值为零时，触发对所述备份数据块进行回收处理。The second recycling triggering unit is configured to trigger recycling of the backup data block when it is recognized that the value of the reference count of the backup data block is zero.

本发明实施例提供一种数据存储方法及设备，该方法通过将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块，根据备份数据块对待存储文件进行重复数据删除操作，且为备份数据块进行状态标识，根据备份数据块的状态标识对备份数据块进行回收处理，使重复数据删除处理优先进行，解决了重复数据删除处理与回收处理并发执行导致数据丢失的问题，保证了重复删除处理和回收处理的有序进行以及已存储数据的安全性。Embodiments of the present invention provide a data storage method and device. In the method, the fingerprints of each data block of the file to be stored are matched with the fingerprints in the fingerprint library to obtain the corresponding backup data block, and the file to be stored is processed according to the backup data block. Data deduplication operation, and status identification for the backup data block, and recovery processing of the backup data block according to the status identification of the backup data block, so that the deduplication process is prioritized, and the data deduplication process and recovery process are executed concurrently. The loss problem ensures the orderly deduplication and recovery processing and the security of the stored data.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为本发明数据存储方法实施例一的流程图；FIG. 1 is a flow chart of Embodiment 1 of the data storage method of the present invention;

图2为本发明数据存储方法实施例二的流程图；FIG. 2 is a flow chart of Embodiment 2 of the data storage method of the present invention;

图3为本发明数据存储方法实施例三的流程图；FIG. 3 is a flow chart of Embodiment 3 of the data storage method of the present invention;

图4为本发明数据存储逻辑架构实施例一示意图；FIG. 4 is a schematic diagram of Embodiment 1 of the data storage logical architecture of the present invention;

图5为本发明数据存储集群架构实施例一示意图；FIG. 5 is a schematic diagram of Embodiment 1 of the data storage cluster architecture of the present invention;

图6为本发明数据存储装置实施例一的结构图；FIG. 6 is a structural diagram of Embodiment 1 of the data storage device of the present invention;

图7为本发明数据存储装置实施例二的结构图；FIG. 7 is a structural diagram of Embodiment 2 of the data storage device of the present invention;

图8为本发明数据存储装置实施例三的结构图。FIG. 8 is a structural diagram of Embodiment 3 of the data storage device of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

图1为本发明数据存储方法实施例一的流程图，如图1所示，本实施例提供了一种数据存储方法，该方法可以由任意执行数据存储操作的设备来执行，可以具体包括如下步骤：Figure 1 is a flow chart of Embodiment 1 of the data storage method of the present invention. As shown in Figure 1, this embodiment provides a data storage method, which can be performed by any device that performs data storage operations, and can specifically include the following step:

步骤101：将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块。Step 101: Match the fingerprints of each data block of the file to be stored with the fingerprints in the fingerprint database to obtain corresponding backup data blocks.

本实施例中对于每个文件的存储均执行相同的数据存储方法，文件在存储前为待存储文件。指纹库中的指纹为已存储的文件的各数据块的指纹。通过对待存储文件各数据块的指纹与指纹库中已存储文件的各数据块的指纹逐一匹配，根据待存储文件各数据块的指纹与已存储文件的各数据块的指纹相似度，确定待存储文件在指纹库中所属的对应的备份数据块。具体地，当待存储文件数据块的指纹与已存储文件的数据块的指纹的相似度大于或等于预设的相似度阈值时，则认为该已存储文件的数据块是与待存储文件的数据块对应的备份数据块。相似度可以为待存储文件数据块的指纹与已存储文件的数据块的指纹相同或相似的指纹数占待存储文件的数据块的指纹的比例。In this embodiment, the same data storage method is implemented for the storage of each file, and the file is a file to be stored before storage. The fingerprints in the fingerprint library are the fingerprints of each data block of the stored file. By matching the fingerprints of each data block of the file to be stored with the fingerprints of each data block of the stored file in the fingerprint library one by one, according to the similarity between the fingerprint of each data block of the file to be stored and the fingerprint of each data block of the stored file, determine the The corresponding backup data block to which the file belongs in the fingerprint library. Specifically, when the similarity between the fingerprint of the data block of the file to be stored and the fingerprint of the data block of the stored file is greater than or equal to the preset similarity threshold, it is considered that the data block of the stored file is the same as the data of the file to be stored. The backup data block corresponding to the block. The similarity may be the ratio of the fingerprints of the data block of the file to be stored that are the same or similar to the fingerprints of the data block of the stored file to the fingerprints of the data block of the file to be stored.

步骤102：根据备份数据块对待存储文件进行重复数据删除操作，且为备份数据块进行状态标识。Step 102: Perform data deduplication operation on the file to be stored according to the backup data block, and perform status identification for the backup data block.

确定了备份数据块后，在该备份数据块中对待存储文件进行重复数据删除处理，具体的删除方法可以与现有技术中类似，即将计算得到的待存储文件的各分块的指纹与该备份数据块中保存的指纹相匹配。若备份数据块中已保存有与一个待存储文件的数据块相同或相似的指纹时，则删除该待存储文件的数据块的数据；若备份数据块中没有与待存储文件的数据块相同或相似的指纹时，则对该待存储文件的数据块的数据进行存储。After the backup data block is determined, the file to be stored is deduplicated in the backup data block, and the specific deletion method can be similar to that in the prior art, that is, the calculated fingerprint of each block of the file to be stored is compared with the backup The fingerprint stored in the data block matches. If the same or similar fingerprint as a data block of a file to be stored has been preserved in the backup data block, then delete the data of the data block of the file to be stored; if there is no identical or similar fingerprint with the data block of the file to be stored in the backup data block When the fingerprints are similar, the data of the data block of the file to be stored is stored.

本步骤中在根据备份数据块对待存储文件进行重复数据删除操作时，还要对备份数据块进行状态标识。其中，状态标识用于表征该备份数据块是否在重复数据删除操作的使用中。In this step, when the data deduplication operation is performed on the file to be stored according to the backup data block, the state identification of the backup data block is also carried out. Wherein, the status flag is used to indicate whether the backup data block is being used in the deduplication operation.

状态标识的具体形式可以有多种，优选是包括备份数据块的分组号以及该备份数据块的分组计数。分组计数是指根据该备份数据块进行重复数据删除的次数，即适用于备份数据块在多个并行执行的重复删除操作中被使用。因此，根据所述备份数据块对所述待存储文件进行重复数据删除操作，且为所述备份数据块进行状态标识的操作优选是在根据备份数据块对待存储文件进行重复数据删除操作之前，将备份数据块的分组计数加一，在完成根据备份数据块对待存储文件进行重复数据删除操作之后，将该备份数据块的分组计数减一。There are many specific forms of the state identification, preferably including the group number of the backup data block and the group count of the backup data block. The group count refers to the number of data deduplication based on the backup data block, that is, it is suitable for the backup data block to be used in multiple parallel deduplication operations. Therefore, the deduplication operation is performed on the file to be stored according to the backup data block, and the state identification operation for the backup data block is preferably performed before the deduplication operation is performed on the file to be stored according to the backup data block. Add one to the grouping count of the backup data block, and decrement the grouping count of the backup data block by one after completing the data deduplication operation on the file to be stored according to the backup data block.

本领域技术人员可以理解，当分组计数不为零时，即正在进行重复数据删除操作。在本实施例中，还设置有一个重删列表，该重删列表中包括了需进行重复数据删除处理的各备份数据块的状态标识，当备份数据块的分组计数为零时，可将该备份数据块的状态标识从重删列表中删除。Those skilled in the art can understand that when the packet count is not zero, the data deduplication operation is being performed. In this embodiment, a deduplication list is also provided, and the deduplication list includes the status identifiers of each backup data block that needs to be deduplicated. When the grouping count of the backup data block is zero, the The status identifier of the backup data block is deleted from the deduplication list.

步骤103：根据备份数据块的状态标识对备份数据块进行回收处理。Step 103: Perform recovery processing on the backup data block according to the status identifier of the backup data block.

在确定对备份数据块进行回收处理时，首先需要根据备份数据块的状态标识确定是否对该备份数据块进行回收处理。本实施例中，可通过查询重删列表中备份数据块的状态标识确定备份数据块是否在使用，从而决定是否回收处理。当备份数据块的状态标识中该备份数据块的分组计数为零时，说明该备份数据块没有进行重复删除处理，因此，可对该备份数据块进行回收处理。当备份数据块的状态标识中该备份数据块的分组计数不为零时，说明该备份数据块正在进行重复删除处理，因此，暂停对备份数据块的回收处理。在本实施例中，还可设置一个回收列表，该回收列表中包括了需回收处理的各备份数据块的状态标识，当备份数据块进行回收处理之后，将该备份数据块的状态标识从回收列表中删除。When determining to reclaim the backup data block, it is first necessary to determine whether to reclaim the backup data block according to the state identifier of the backup data block. In this embodiment, it can be determined whether the backup data block is in use by querying the status identifier of the backup data block in the deduplication list, so as to determine whether to recycle. When the group count of the backup data block in the status identifier of the backup data block is zero, it means that the backup data block has not been deduplicated, and therefore, the backup data block can be recycled. When the group count of the backup data block in the state identifier of the backup data block is not zero, it indicates that the backup data block is being deduplicated, and therefore, the recovery process of the backup data block is suspended. In this embodiment, a recovery list can also be set, which includes the state identification of each backup data block that needs to be recovered. After the backup data block is recovered, the status identification of the backup data block is recovered from the Deleted from the list.

本领域技术人员可以理解，当对该备份数据块进行重复删除处理时，也可查询回收列表，当回收列表中包括了该备份数据块的状态标识时，同样可暂停对该备份数据块的回收处理，保证重复删除处理优先进行。Those skilled in the art can understand that when the backup data block is deduplicated, the recycling list can also be queried, and when the status identifier of the backup data block is included in the recycling list, the recycling of the backup data block can also be suspended processing to ensure that deduplication processing takes precedence.

本领域技术人员可以理解，当备份数据块的分组计数为零时，重复删除处理过程和回收处理过程不仅可以同时进行，而且二者在运行过程中，不会相互影响，最大限度的保证了重删和回收的并发执行。Those skilled in the art can understand that when the grouping count of the backup data block is zero, not only can the deduplication process and the recovery process be carried out simultaneously, but also the two will not affect each other during operation, ensuring the maximum possible duplication Concurrent execution of delete and reclamation.

本发明实施例提供一种数据存储方法，通过将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块，根据备份数据块对待存储文件进行重复数据删除操作，且为备份数据块进行状态标识，根据备份数据块的状态标识对备份数据块进行回收处理，使重复数据删除处理优先进行，解决了重复数据删除处理与回收处理并发执行导致数据丢失的问题，保证了重复删除处理和回收处理的有序进行以及已存储数据的安全性。The embodiment of the present invention provides a data storage method, by matching the fingerprints of each data block of the file to be stored with the fingerprint in the fingerprint library to obtain the corresponding backup data block, and performing deduplication operation on the file to be stored according to the backup data block , and carry out status identification for the backup data block, and recycle the backup data block according to the status identification of the backup data block, so that the deduplication process is performed first, and solves the problem of data loss caused by concurrent execution of deduplication processing and recovery processing, This ensures the orderly execution of deduplication and recovery processing and the security of stored data.

图2为本发明数据存储方法实施例二的流程图，如图2所示，本实施例提供了一种数据存储方法，该方法可以由任意执行数据存储操作的设备来执行，可以具体包括如下步骤：Figure 2 is a flow chart of the second embodiment of the data storage method of the present invention. As shown in Figure 2, this embodiment provides a data storage method, which can be performed by any device that performs data storage operations, and can specifically include the following step:

步骤201：对待存储文件进行分块处理，得到各数据块，并计算各数据块的指纹。Step 201: Divide the file to be stored into blocks to obtain each data block, and calculate the fingerprint of each data block.

本步骤先对待存储文件进行分块处理，具体的分块处理过程可以采用现有技术中的分块技术，如通过变长分块算法对待存储文件进行分块。再计算分块处理后的得到的各分块的指纹，具体的指纹计算过程也可以采用现有技术中的计算方法，如可以采用安全哈希算法(Secure Hash Algorithm)、消息摘要算法第五版(Message Digest Algorithm，简称MD5)双哈希算法来计算各分块的指纹。In this step, the file to be stored is firstly divided into blocks, and the specific block processing process may adopt the block technology in the prior art, such as dividing the file to be stored into blocks by using a variable-length block algorithm. Then calculate the fingerprint of each block obtained after the block processing, the specific fingerprint calculation process can also use the calculation method in the prior art, such as the secure hash algorithm (Secure Hash Algorithm), the fifth edition of the message digest algorithm can be used (Message Digest Algorithm, referred to as MD5) double hash algorithm to calculate the fingerprint of each block.

步骤202：对各数据块的指纹进行抽样处理，并根据抽取到的指纹生成待存储文件的指纹抽样表。Step 202: Sampling the fingerprints of each data block, and generating a fingerprint sampling table of the file to be stored according to the extracted fingerprints.

为了缩减重复数据删除过程中去重的计算量，在得到待存储文件的各分块的指纹后，对这些指纹进行抽样，抽样的基本要求是抽样结果中的指纹在待存储文件的各分块的指纹的范围内，且抽样结果中指纹的数量不多于待存储文件的分块指纹的数量。对各分块指纹进行抽样具体可以为：直接将各分块的指纹中最后一个字节为0的指纹作为抽样处理抽取到的指纹；或者将固定位置上的分块作为抽取到的指纹，例如将9的整数倍位置上的分块作为抽取到得指纹；或者根据预定的抽样比例进行抽样，例如随机抽取5％的分块作为抽取到的指纹。此处对各分块的指纹进行抽样处理，对指纹进行筛选，并根据抽取到的指纹生成该待存储文件的指纹抽样表。本领域技术人员可以理解，本实施例中还存在抽样结果均不满足抽样条件，即该待存储文件中不存在满足抽样条件的块的情况，则得到的指纹抽样表为空。In order to reduce the amount of deduplication calculations in the deduplication process, after obtaining the fingerprints of each block of the file to be stored, these fingerprints are sampled. The basic requirement for sampling is that the fingerprints in the sampling results are in each block of the file to be stored. within the range of fingerprints, and the number of fingerprints in the sampling results is not more than the number of block fingerprints of the file to be stored. Sampling the fingerprints of each block can be specifically: directly use the fingerprint whose last byte is 0 in the fingerprint of each block as the fingerprint extracted by the sampling process; or use the block at a fixed position as the extracted fingerprint, for example The blocks at positions that are integer multiples of 9 are used as the extracted fingerprints; or sampling is performed according to a predetermined sampling ratio, for example, 5% of the blocks are randomly selected as the extracted fingerprints. Here, the fingerprints of each block are sampled, the fingerprints are screened, and a fingerprint sampling table of the file to be stored is generated according to the extracted fingerprints. Those skilled in the art can understand that in this embodiment, none of the sampling results satisfy the sampling conditions, that is, there is no block satisfying the sampling conditions in the file to be stored, and the obtained fingerprint sampling table is empty.

步骤203：根据指纹抽样表和分组抽样库，确定待存储文件在分组抽样库中所属的相似分组，将相似分组对应的已存储的数据块作为备份数据块。Step 203: According to the fingerprint sampling table and the group sampling library, determine the similar group to which the file to be stored belongs in the group sampling library, and use the stored data block corresponding to the similar group as a backup data block.

在获取到待存储文件的指纹抽样表后，根据指纹抽样表和分组抽样库，确定待存储文件在分组抽样库中所属的相似分组，将相似分组对应的已存储的数据块作为备份数据块。分组抽样库由指纹库进行抽样处理得到，相似分组为分组抽样库中与待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。After obtaining the fingerprint sampling table of the file to be stored, according to the fingerprint sampling table and the group sampling library, determine the similar group to which the file to be stored belongs in the group sampling library, and use the stored data block corresponding to the similar group as the backup data block. The group sampling library is obtained by sampling processing of the fingerprint library, and the similar group is a sampling group in the group sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.

特别地，指纹库保存了存储文件经过重复数据删除后的所有指纹。若本步骤处理的待存储文件为第一个文件，则指纹库为空。此时，若指纹抽样表不为空，则在分组抽样库中建立一个新建分组，确定待存储文件在分组抽样库中所属的相似分组为新建分组，并将待存储文件的指纹抽样表中的指纹保存到新建分组中。当指纹抽样表不为空，且指纹库不为空的时候，对指纹库中的指纹进行抽样处理，获得分组抽样库。其中抽样处理的方法与步骤202中对待存储文件的各数据块进行抽样处理的方法类似，本实施例此处不再赘述。本领域技术人员可以理解，对指纹库中的指纹进行抽样处理的方法与对待存储文件的各数据块进行抽样处理的方法应保持一致，这样可以得到相似度较高的相似分组。In particular, the fingerprint library saves all fingerprints of stored files after data deduplication. If the file to be stored in this step is the first file, the fingerprint library is empty. At this time, if the fingerprint sampling table is not empty, a new group is established in the group sampling library, and the similar group of the file to be stored in the group sampling library is determined to be a new group, and the fingerprint sampling table of the file to be stored is added to the new group. Fingerprints are saved to the newly created group. When the fingerprint sampling table is not empty and the fingerprint library is not empty, the fingerprints in the fingerprint library are sampled to obtain the group sampling library. The sampling processing method is similar to the sampling processing method for each data block of the file to be stored in step 202, which will not be repeated here in this embodiment. Those skilled in the art can understand that the method for sampling the fingerprints in the fingerprint library should be consistent with the method for sampling each data block of the file to be stored, so that similar groups with high similarity can be obtained.

通过对指纹抽样表中的各指纹与当前的分组抽样库中各抽样分组逐一匹配，根据匹配结果在当前的分组抽样库中确定待存储文件所属的相似分组。具体地，当指纹抽样表中的各指纹与当前的分组抽样指纹库中的一个抽样分组的指纹相似度大于或等于预设的相似度阈值时，则认为该待存储文件属于该抽样分组，该抽样分组为相似分组，该相似分组中的指纹对应的已存储的数据块作为备份数据块；当指纹抽样表中的各指纹与当前分组抽样库中的所有分组的指纹相似度均小于预设的相似度阈值时，在分组抽样库中建立一个新建分组，确定待存储文件在分组抽样库中所属的相似分组为新建分组，并将待存储文件的指纹抽样表中的指纹保存到新建分组中。By matching each fingerprint in the fingerprint sampling table with each sampling group in the current group sampling library one by one, determine the similar group to which the file to be stored belongs to in the current group sampling library according to the matching result. Specifically, when the fingerprint similarity between each fingerprint in the fingerprint sampling table and a sampling group in the current group sampling fingerprint library is greater than or equal to the preset similarity threshold, then it is considered that the file to be stored belongs to the sampling group, the The sampling group is a similar group, and the stored data block corresponding to the fingerprint in the similar group is used as a backup data block; When the similarity threshold is reached, a new group is established in the group sampling library, and the similar group to which the file to be stored belongs in the group sampling library is determined to be a new group, and the fingerprints in the fingerprint sampling table of the file to be stored are stored in the new group.

当步骤202中的抽样结果均不满足抽样条件时，即该待存储文件中不存在满足抽样条件的块，则确定所述待存储文件在当前的分组抽样库中所属的相似分组为当前的分组抽样库中的预设分组，本实施例的相似性分析过程结束。在指纹库中与该预设分组对应的指纹分组中对待存储文件进行重复数据删除处理。该预设分组为本实施例预先设定的一个分组，没有特定的含义，该预设分组可以为空，其与指纹库中一个特定的指纹分组相对应，该特定的指纹分组中保存的是这些抽样后指纹抽样表为空的待存储文件的指纹。在实际抽样过程中，存在抽样后指纹抽样表为空的特殊情况，此处仅是对这种特殊情况下的处理进行说明，避免因出现这种特殊情况而导致整个流程中断。When the sampling results in step 202 do not meet the sampling conditions, that is, there is no block satisfying the sampling conditions in the file to be stored, then it is determined that the similar grouping of the file to be stored in the current grouping sampling library is the current grouping The preset grouping in the library is sampled, and the similarity analysis process of this embodiment ends. In the fingerprint group corresponding to the preset group in the fingerprint library, data deduplication is performed on the file to be stored. The preset group is a group preset in this embodiment, and has no specific meaning. The preset group can be empty, and it corresponds to a specific fingerprint group in the fingerprint database. What is stored in the specific fingerprint group is The fingerprints of the files to be stored for which the fingerprint sampling table is empty after these samplings. In the actual sampling process, there is a special case that the fingerprint sampling table is empty after sampling. Here we only explain the handling of this special case to avoid the interruption of the entire process due to this special case.

进一步地，在根据备份数据块对待存储文件进行重复数据删除操作时，还可将相似分组的指纹分为多个区间，且每个区间内建立一个数据库，用于存放对应区间的指纹；在查询重复数据块的时候可以分开进行查询，在多线程、多节点的情况下可以并发查询每一个区间，提升并发查询的能力，加速查询速度。Further, when performing data deduplication operations on the files to be stored according to the backup data blocks, the fingerprints of similar groups can also be divided into multiple intervals, and a database is established in each interval to store the fingerprints of the corresponding intervals; When data blocks are repeated, they can be queried separately. In the case of multi-thread and multi-node, each interval can be queried concurrently, which improves the ability of concurrent query and speeds up the query speed.

步骤204：根据备份数据块对待存储文件进行重复数据删除操作，且为备份数据块进行状态标识。本步骤可以与上述步骤102类似，此处不再赘述。Step 204: Perform data deduplication operation on the file to be stored according to the backup data block, and perform status identification for the backup data block. This step may be similar to theabove step 102, and will not be repeated here.

步骤205：根据备份数据块的状态标识对备份数据块进行回收处理。本步骤可以与上述步骤103类似，此处不再赘述。Step 205: Perform recovery processing on the backup data block according to the state identifier of the backup data block. This step may be similar to the above step 103, and will not be repeated here.

本发明实施例提供了一种数据存储方法，通过对待存储文件进行分块处理，得到各数据块，并计算各数据块的指纹，对各数据块的指纹进行抽样处理，并根据抽取到的指纹生成待存储文件的指纹抽样表，根据指纹抽样表和分组抽样库，确定待存储文件在分组抽样库中所属的相似分组，作为备份数据块，本实施例对待存储文件的各数据块以及指纹库进行进一步的抽样处理，先通过相似性分析确定相似分组，再在相似分组对应的指纹分组中进行重复数据删除处理，缩小了去重的查询计算量，解决了现有技术中重删时海量分块数据引入的计算量和资源消耗巨大的问题，缩减了重复数据删除中去重的计算量，提升了重删性能。The embodiment of the present invention provides a data storage method. By dividing the file to be stored into blocks, each data block is obtained, and the fingerprint of each data block is calculated, and the fingerprint of each data block is sampled. Generate the fingerprint sampling table of the file to be stored, and determine the similar grouping of the file to be stored in the grouping sampling library according to the fingerprint sampling table and the grouping sampling library, as the backup data block, each data block of the file to be stored and the fingerprint library in this embodiment Carry out further sampling processing, first determine similar groups through similarity analysis, and then perform deduplication processing in the fingerprint groups corresponding to similar groups, which reduces the amount of deduplication query calculations and solves the problem of massive points in deduplication in the prior art. The problem of huge calculation and resource consumption introduced by block data reduces the calculation of deduplication in deduplication and improves deduplication performance.

图3为本发明数据存储方法实施例三的流程图，如图3所示，本实施例提供了一种数据存储方法，该方法可以由任意执行数据存储操作的设备来执行，可以具体包括如下步骤：Figure 3 is a flow chart of the third embodiment of the data storage method of the present invention. As shown in Figure 3, this embodiment provides a data storage method, which can be performed by any device that performs data storage operations, and can specifically include the following step:

步骤301：将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块。Step 301: Match the fingerprints of each data block of the file to be stored with the fingerprints in the fingerprint database to obtain corresponding backup data blocks.

图3实施例中的步骤301可以与图1实施例中的步骤101类似，也可以采用图2实施例所示获取对应的备份数据块的方法，本实施例此处不再赘述。Step 301 in the embodiment in FIG. 3 may be similar to step 101 in the embodiment in FIG. 1 , or the method shown in the embodiment in FIG. 2 for obtaining the corresponding backup data block may be used, which will not be repeated here in this embodiment.

步骤302：根据备份数据块对待存储文件进行重复数据删除操作，且为备份数据块进行状态标识。Step 302: Deduplicate the file to be stored according to the backup data block, and perform status identification for the backup data block.

图3实施例中的步骤302可以与图1实施例中的步骤102类似，本实施例此处不再赘述。Step 302 in the embodiment of FIG. 3 may be similar to step 102 in the embodiment of FIG. 1 , and details are not repeated here in this embodiment.

步骤303：当监测到备份数据块的引用计数的数值发生变化时，识别对应的备份数据块的状态标识。Step 303: When it is detected that the value of the reference count of the backup data block changes, identify the status identifier of the corresponding backup data block.

在预设时间内，对备份数据块的引用计数进行监测，当已存储的文件被修改或删除时，修改或删除的位置对应的备份数据块的引用情况发生改变，当备份数据块的引用计数发生变化时，识别对应的备份数据块的状态标识。Within the preset time, the reference count of the backup data block is monitored. When the stored file is modified or deleted, the reference status of the backup data block corresponding to the modified or deleted position changes. When the reference count of the backup data block When a change occurs, identify the status identifier of the corresponding backup data block.

步骤304：当识别到对应的备份数据块的状态标识表明备份数据块未使用时，则识别备份数据块的引用计数的数值。Step 304: When it is identified that the status identifier of the corresponding backup data block indicates that the backup data block is unused, identify the value of the reference count of the backup data block.

当识别到对应的备份数据块的状态标识表明备份数据块没有被使用，即没有根据该备份数据块对待存储文件进行重复数据删除操作，则识别该备份数据块的引用计数的数值。When it is recognized that the status identifier of the corresponding backup data block indicates that the backup data block is not used, that is, the file to be stored is not deduplicated according to the backup data block, then the value of the reference count of the backup data block is identified.

步骤305：当识别到该备份数据块的引用计数的数值为零时，触发对备份数据块进行回收处理。Step 305: When it is recognized that the value of the reference count of the backup data block is zero, trigger recovery processing on the backup data block.

当该备份数据块的引用计数的数值为零时，说明该备份数据块为垃圾文件，可以进行回收处理。其中，引用计数法是唯一没有使用根集的垃圾回收方法，该方法使用引用计数器来区分存活对象和不再使用的对象。一般来说，当对象即备份数据块被丢弃或不再使用，引用计数器减1，一旦引用计数器为0，该备份数据块就满足了垃圾收集的条件。本领域技术人员可以理解，图3实施例所示的步骤303至步骤305还可以应用到对所有的已存储数据块的回收中，当监测到以存储的数据块的引用计数发生变化时，执行回收任务，当该数据块的引用计数为0时，对该数据块进行回收处理。When the value of the reference count of the backup data block is zero, it indicates that the backup data block is a garbage file and can be recycled. Among them, the reference counting method is the only garbage collection method that does not use the root set, which uses reference counters to distinguish between live objects and objects that are no longer used. Generally speaking, when the object, that is, the backup data block is discarded or no longer used, the reference counter is decremented by 1. Once the reference counter is 0, the backup data block meets the condition of garbage collection. Those skilled in the art can understand thatsteps 303 to 305 shown in the embodiment of FIG. 3 can also be applied to the recovery of all stored data blocks. When it is detected that the reference count of the stored data blocks changes, execute Recycling task, when the reference count of the data block is 0, the data block is reclaimed.

本发明实施例提供一种数据存储方法，通过监测备份数据块的引用计数的数值变化，当引用计数发生变化的备份数据块的状态标识表明备份数据块未使用时，则识别备份数据块的引用计数的数值，当识别到备份数据块的引用计数的数值为零时，触发对备份数据块进行回收处理，本实施例只针对引用计数产生变化的备份数据块进行回收扫描，提升了回收速度，可以更及时的找回用户的存储空间。An embodiment of the present invention provides a data storage method. By monitoring the value change of the reference count of the backup data block, when the status identifier of the backup data block whose reference count has changed indicates that the backup data block is not used, the reference of the backup data block is identified. The value of the count, when it is recognized that the value of the reference count of the backup data block is zero, the backup data block is triggered to be recycled. This embodiment only performs recovery scanning for the backup data block whose reference count changes, which improves the recycling speed. The user's storage space can be retrieved in a more timely manner.

图4为本发明数据存储逻辑架构实施例一示意图。本实施例提供的数据存储逻辑架构示意图，能够执行上述数据存储方法的实施例。如图4所示，本实施例提供的数据存储逻辑架构示意图包括集群管理模块40，重删引擎模块41，元数据服务器42，单一实例库43，转发模块44。FIG. 4 is a schematic diagram of Embodiment 1 of the data storage logical architecture of the present invention. The schematic diagram of the data storage logic architecture provided in this embodiment can implement the above embodiments of the data storage method. As shown in FIG. 4 , the schematic diagram of the data storage logical architecture provided in this embodiment includes acluster management module 40 , adeduplication engine module 41 , ametadata server 42 , asingle instance library 43 , and aforwarding module 44 .

其中，集群管理模块40用于管理回收列表和重删列表。Among them, thecluster management module 40 is used to manage the recycling list and the deduplication list.

重删引擎模块41用于重复数据删除任务，空间回收任务，以及对数据块进行引用计数等各种任务的处理和管理。对应地，重删引擎模块41包括任务处理模块411，任务管理模块412以及分发器413。其中，任务处理模块411包括重删模块4111，用于执行重复数据删除任务，引用计数模块4112，用于执行对数据块进行引用计数的任务，空间回收模块4113，用于执行空间回收任务。任务管理模块412用于管理线程池，包括对任务队列的监控以及对线程池中线程运行状态的监控。分发器413，用于维护各个元数据服务器42管理的数据分块。Thededuplication engine module 41 is used for processing and managing various tasks such as data deduplication tasks, space reclamation tasks, and reference counting of data blocks. Correspondingly, thededuplication engine module 41 includes atask processing module 411 , atask management module 412 and adistributor 413 . Thetask processing module 411 includes adeduplication module 4111 for performing a data deduplication task, areference counting module 4112 for performing a reference counting task on data blocks, and aspace reclamation module 4113 for performing a space reclamation task. Thetask management module 412 is used to manage the thread pool, including monitoring the task queue and the running status of the threads in the thread pool. Thedistributor 413 is configured to maintain the data blocks managed by eachmetadata server 42 .

元数据服务器42用于确定待存储文件在分组抽样库中所属的相似分组。Themetadata server 42 is used to determine the similar group to which the file to be stored belongs in the group sampling library.

单一实例库43中用于存储分组抽样库中的各抽样分组，以及抽样分组中的各抽样分组区间。Thesingle instance library 43 is used to store each sampling group in the group sampling library, and each sampling group interval in the sampling group.

转发模块44负责集群管理模块40，重删引擎模块41，元数据服务器42之间数据的传输。Theforwarding module 44 is responsible for data transmission between thecluster management module 40 , thededuplication engine module 41 and themetadata server 42 .

在一个具体的实施例中，当执行重复数据删除任务时，重删引擎模块41对待存储文件进行分块处理，计算分块处理结果中各分块的指纹，对各分块的指纹进行抽样处理，并根据抽取到的指纹生成待存储文件的指纹抽样表，并向元数据服务器42发送分组请求消息，元数据服务器42确定与该指纹抽样表对应的相似分组，以及与所述相似分组对应的备份数据块。重删引擎模块41在进行重复数据删除之前，向集群管理模块40发送携带备份数据块状态标识的重删请求消息，集群管理模块40确定回收列表中是否存在该备份数据块状态标识，若存在，则集群管理模块40使重删引擎模块41取消对该备份数据块的回收，继续进行重复数据删除任务。In a specific embodiment, when performing the deduplication task, thededuplication engine module 41 performs block processing on the file to be stored, calculates the fingerprint of each block in the block processing result, and performs sampling processing on the fingerprint of each block , and generate the fingerprint sampling table of the file to be stored according to the extracted fingerprint, and send a grouping request message to themetadata server 42, themetadata server 42 determines the similar grouping corresponding to the fingerprint sampling table, and the similar grouping corresponding to the similar grouping Backup data blocks. Thededuplication engine module 41 sends the deduplication request message carrying the backup data block status identification to thecluster management module 40 before performing deduplication, and thecluster management module 40 determines whether the backup data block status identification exists in the recovery list, and if exists, Then thecluster management module 40 makes thededuplication engine module 41 cancel the recovery of the backup data block, and continue to perform the deduplication task.

在一个具体的实施例中，当执行回收任务时，重删引擎模块41统计引用计数发生变化的备份数据块，对引用计数发生变化的备份数据块进行回收，在回收之前，向集群管理模块40发送携带备份数据块状态标识的回收请求消息，集群管理模块40确定重删列表中是否存在该备份数据块状态标识，若存在，则集群管理模块40使重删引擎模块41取消对该备份数据块的回收。In a specific embodiment, when performing the recycling task, thededuplication engine module 41 counts the backup data blocks whose reference counts change, recycles the backup data blocks whose reference counts change, and sends thecluster management module 40 Send the recycling request message carrying the backup data block status identifier, thecluster management module 40 determines whether the backup data block status identifier exists in the deduplication list, and if it exists, thecluster management module 40 makes thededuplication engine module 41 cancel the backup data block recycling.

本实施例提供的数据存储逻辑架构，在具体实现数据存储过程中，重复数据删除任务的优先级高于空间回收任务的优先级，解决了重复数据删除处理与回收处理并发执行导致数据丢失的问题，保证了重复删除处理和回收处理的有序进行以及已存储数据的安全性。In the data storage logic architecture provided by this embodiment, in the process of implementing data storage, the priority of the deduplication task is higher than that of the space reclamation task, which solves the problem of data loss caused by concurrent execution of deduplication processing and recovery processing , ensuring the orderly deduplication and recovery processing and the security of the stored data.

图5为本发明数据存储集群架构实施例一示意图。本实施例提供的数据存储集群架构可通过图1至图3所示数据存储方法实施例以及图4所示数据存储逻辑架构实施例实现。如图5所示，本实施例提供的数据存储集群架构包括从节点501，主节点502，备节点503。FIG. 5 is a schematic diagram of Embodiment 1 of the data storage cluster architecture of the present invention. The data storage cluster architecture provided in this embodiment can be realized through the data storage method embodiments shown in FIGS. 1 to 3 and the data storage logical architecture embodiment shown in FIG. 4 . As shown in FIG. 5 , the data storage cluster architecture provided in this embodiment includes aslave node 501 , amaster node 502 , and astandby node 503 .

其中，从节点501、主节点502、备节点503数据存储共享。三者均包括集群管理模块、重删引擎模块以及元数据服务器。同时，从节点501、主节点502、备节点503都可以完成上述的数据存储方法中的重复数据删除和空间回收的过程。主节点502具体可以为局域网中的主机，从节点501具体可以为局域网中的分机。本领域技术人员可以理解，在实际应用过程中，从节点501的个数可以为多个。主节点502主要负责向从节点501下发开始重复数据删除或空间回收的命令，以使从节点501执行相应的重复数据删除或空间回收任务。当从节点501执行任务结束后，从节点501可将执行结果告知主节点。当从节点501或主节点502发生故障，无法正常工作时，可由备节点503代替发生故障的从节点501或主节点502继续工作，保证数据存储过程能够持续进行。Wherein, theslave node 501, themaster node 502, and thestandby node 503 share data storage. All three include cluster management module, deduplication engine module and metadata server. At the same time, theslave node 501, themaster node 502, and thestandby node 503 can all complete the process of data deduplication and space reclamation in the above data storage method. Specifically, themaster node 502 may be a host in the local area network, and theslave node 501 may specifically be an extension in the local area network. Those skilled in the art can understand that, in an actual application process, there may bemultiple slave nodes 501 . Themaster node 502 is mainly responsible for issuing a command to start data deduplication or space reclamation to theslave node 501, so that theslave node 501 performs the corresponding data deduplication or space reclamation task. After the execution of the task by theslave node 501 ends, theslave node 501 may notify the master node of the execution result. When theslave node 501 or themaster node 502 fails and cannot work normally, thestandby node 503 can replace the failedslave node 501 or themaster node 502 to continue working to ensure that the data storage process can continue.

本实施例提供的数据存储集群架构，从节点、主节点以及备节点均能够执行数据存储方法，同时主节点能够控制多台从节点同时执行数据存储，提高了数据存储的效率。在从节点和主节点发生故障时，备节点能够代替发生故障的从节点或主节点继续工作，避免了数据存储过程的中断，保证数据存储过程的连续性。In the data storage cluster architecture provided in this embodiment, the slave nodes, master nodes, and standby nodes can all execute the data storage method, and the master node can control multiple slave nodes to execute data storage at the same time, which improves the efficiency of data storage. When the slave node and the master node fail, the standby node can continue to work instead of the failed slave node or master node, avoiding the interruption of the data storage process and ensuring the continuity of the data storage process.

图6为本发明数据存储装置实施例一的结构图，如图6所示，本实施例提供的数据存储装置包括备份数据块获取模块61，重复数据删除模块62，回收模块63。其中处理模块61用于将待存储文件各数据块的指纹与指纹库中的指纹进行匹配，以获取对应的备份数据块；重复数据删除模块62用于根据所述备份数据块对所述待存储文件进行重复数据删除操作，且为所述备份数据块进行状态标识；回收模块63用于根据所述备份数据块的状态标识对所述备份数据块进行回收处理。FIG. 6 is a structural diagram of Embodiment 1 of the data storage device of the present invention. As shown in FIG. Wherein theprocessing module 61 is used for matching the fingerprint of each data block of the file to be stored with the fingerprint in the fingerprint library to obtain the corresponding backup data block; Deduplication of the file is performed, and status identification is performed for the backup data block; therecovery module 63 is configured to perform recovery processing on the backup data block according to the status identification of the backup data block.

本实施例的数据存储装置，可以用于执行上述方法实施例的技术方案，其实现原理和技术效果类似，此处不再赘述。The data storage device of this embodiment can be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.

图7为本发明数据存储装置实施例二的结构图，如图7所示，本实施例在图6所示实施例的基础上，所述重复数据删除模块62包括：第一计数单元621，第二计数单元622。FIG. 7 is a structural diagram of the second embodiment of the data storage device of the present invention. As shown in FIG. 7, this embodiment is based on the embodiment shown in FIG. 6, and thededuplication module 62 includes: a first counting unit 621, Thesecond counting unit 622 .

其中，第一计数单元621，用于在根据所述备份数据块对所述待存储文件进行重复数据删除操作之前，将所述备份数据块的分组计数加一；第二计数单元622，用于在完成根据所述备份数据块对所述待存储文件进行重复数据删除操作之后，将所述备份数据块的分组计数减一。Wherein, the first counting unit 621 is configured to add one to the grouping count of the backup data block before performing a deduplication operation on the file to be stored according to the backup data block; thesecond counting unit 622 is configured to After completing the data deduplication operation on the file to be stored according to the backup data block, decrement the grouping count of the backup data block by one.

在图6所示实施例的基础上，所述回收模块63包括：回收暂停单元631，第一回收触发单元632。On the basis of the embodiment shown in FIG. 6 , therecovery module 63 includes: arecovery suspension unit 631 and a firstrecovery trigger unit 632 .

其中，回收暂停单元631，用于当识别到所述备份数据块的状态标识中的分组计数不为零时，暂停对所述备份数据块的回收处理；第一回收触发单元632，用于当识别到所述备份数据块的状态标识中的分组计数为零时，触发对所述备份数据块的回收处理。Wherein, therecovery suspension unit 631 is configured to suspend recovery processing of the backup data block when it is recognized that the group count in the status identifier of the backup data block is not zero; the firstrecovery trigger unit 632 is configured to When it is recognized that the group count in the state identifier of the backup data block is zero, trigger the recycling process of the backup data block.

图8为本发明数据存储装置实施例三的结构图，如图8所示，本实施例在图6所示实施例的基础上，所述备份数据块获取模块61包括：指纹计算单元611，指纹抽样单元612，分组确定单元613。FIG. 8 is a structural diagram of the third embodiment of the data storage device of the present invention. As shown in FIG. 8, this embodiment is based on the embodiment shown in FIG. 6, and the backup datablock acquisition module 61 includes: a fingerprint calculation unit 611, Fingerprint sampling unit 612,group determination unit 613.

其中，指纹计算单元611，用于对所述待存储文件进行分块处理，得到各数据块，并计算各数据块的指纹；指纹抽样单元612，用于对各所述数据块的指纹进行抽样处理，并根据抽取到的指纹生成所述待存储文件的指纹抽样表；分组确定单元613，用于根据所述指纹抽样表和分组抽样库，确定所述待存储文件在所述分组抽样库中所属的相似分组，将所述相似分组对应的已存储的数据块作为所述备份数据块，所述分组抽样库由所述指纹库进行抽样处理得到，所述相似分组为所述分组抽样库中与所述待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。Wherein, the fingerprint calculation unit 611 is used to divide the file to be stored into blocks to obtain each data block and calculate the fingerprint of each data block; the fingerprint sampling unit 612 is used to sample the fingerprint of each data block processing, and generate the fingerprint sampling table of the file to be stored according to the extracted fingerprint; thegrouping determination unit 613 is used to determine that the file to be stored is in the grouping sampling library according to the fingerprint sampling table and the grouping sampling library The similar group to which it belongs, the stored data block corresponding to the similar group is used as the backup data block, the group sampling library is obtained by sampling processing of the fingerprint library, and the similar group is obtained from the group sampling library A sampling group that matches the sampling fingerprints in the fingerprint sampling table of the file to be stored.

在图6所示实施例的基础上，所述回收模块63包括：On the basis of the embodiment shown in Figure 6, therecovery module 63 includes:

引用计数监测单元633，用于当监测到所述备份数据块的引用计数的数值发生变化时，识别对应的备份数据块的状态标识；引用计数识别单元634，用于当识别到对应的备份数据块的状态标识表明所述备份数据块未使用时，则识别所述备份数据块的引用计数的数值；第二回收触发单元635，用于当识别到所述备份数据块的引用计数的数值为零时，触发对所述备份数据块进行回收处理。The referencecount monitoring unit 633 is configured to identify the state identifier of the corresponding backup data block when the value of the reference count of the backup data block is monitored to be changed; the referencecount identification unit 634 is configured to identify the corresponding backup data block When the state flag of the block indicates that the backup data block is unused, then identify the value of the reference count of the backup data block; the secondrecycling trigger unit 635 is configured to recognize that the value of the reference count of the backup data block is At zero time, the backup data block is triggered to be reclaimed.

本实施例的数据存储装置，可以用于执行上述方法实施例的技术方案，其实现原理和技术效果类似，此处不再赘。The data storage device in this embodiment can be used to implement the technical solutions of the above method embodiments, and its implementation principles and technical effects are similar, and are not repeated here.

本领域普通技术人员可以理解：实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时，执行包括上述各方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.