






技术领域technical field
本发明实施例涉及数据处理技术,尤其涉及一种数据存储方法及设备。Embodiments of the present invention relate to data processing technologies, and in particular, to a data storage method and device.
背景技术Background technique
随着企业的数据量不断增大,大量的重复数据给存储带来严峻的挑战。而重复数据删除(Date de-duplication,简称De-Dupe)作为通过有效地减少数据,降低数据存储成本的重要技术,越来越受到重视。As the amount of enterprise data continues to increase, a large amount of duplicate data poses severe challenges to storage. Data deduplication (Date de-duplication, De-Dupe for short), as an important technology to effectively reduce data and reduce data storage costs, has received more and more attention.
在进行数据存储的任务中,通常将待存储文件划分成数据块,重复数据删除技术可自动搜索重复数据块,将相同数据块只保留唯一的一个副本,并使用指向唯一副本的指针替换掉其他重复副本,同时该副本的引用计数增加1,以达到消除冗余数据、降低存储容量需求的存储技术。当重复数据删除后保留的唯一副本数据块被修改或删除时,将导致其引用计数发生改变,当该副本的引用计数减为0时,该副本就满足了垃圾收集的条件,将该副本作为垃圾进行回收,从而释放更多的存储空间。In the task of data storage, the file to be stored is usually divided into data blocks. Data deduplication technology can automatically search for duplicate data blocks, keep only one copy of the same data block, and use pointers to the unique copy to replace other data blocks. The copy is repeated, and the reference count of the copy is increased by 1 to achieve a storage technology that eliminates redundant data and reduces storage capacity requirements. When the only copy data block retained after data deduplication is modified or deleted, its reference count will change. When the reference count of the copy is reduced to 0, the copy meets the conditions for garbage collection, and the copy is used as Garbage is collected, thereby freeing up more storage space.
然而现有技术中,当重复数据删除与回收并发执行时,会使提供给重复副本的指针指向刚刚回收的数据,导致数据丢失。However, in the prior art, when data deduplication and recovery are performed concurrently, the pointer provided to the duplicate copy will point to the data just recovered, resulting in data loss.
发明内容Contents of the invention
本发明实施例提供一种数据处理方法及设备,以优化重复数据删除和回收的并发执行流程。Embodiments of the present invention provide a data processing method and device to optimize concurrent execution processes of deduplication and recovery.
第一方面,本发明实施例提供一种数据存储方法,包括:In a first aspect, an embodiment of the present invention provides a data storage method, including:
将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块;Match the fingerprint of each data block of the file to be stored with the fingerprint in the fingerprint library to obtain the corresponding backup data block;
根据所述备份数据块对所述待存储文件进行重复数据删除操作,且为所述备份数据块进行状态标识;performing a data deduplication operation on the file to be stored according to the backup data block, and performing a status identification for the backup data block;
根据所述备份数据块的状态标识对所述备份数据块进行回收处理。Perform recovery processing on the backup data block according to the state identifier of the backup data block.
在第一种可能的实现方式中,根据第一方面,具体实现为:将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块包括:In a first possible implementation, according to the first aspect, the specific implementation is as follows: matching the fingerprints of each data block of the file to be stored with the fingerprints in the fingerprint library, so as to obtain the corresponding backup data block includes:
对所述待存储文件进行分块处理,得到各数据块,并计算各数据块的指纹;Perform block processing on the file to be stored to obtain each data block, and calculate the fingerprint of each data block;
对各所述数据块的指纹进行抽样处理,并根据抽取到的指纹生成所述待存储文件的指纹抽样表;Sampling the fingerprints of each of the data blocks, and generating a fingerprint sampling table of the file to be stored according to the extracted fingerprints;
根据所述指纹抽样表和分组抽样库,确定所述待存储文件在所述分组抽样库中所属的相似分组,将所述相似分组对应的已存储的数据块作为所述备份数据块,所述分组抽样库由所述指纹库进行抽样处理得到,所述相似分组为所述分组抽样库中与所述待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。According to the fingerprint sampling table and the group sampling library, determine the similar group to which the file to be stored belongs in the group sampling library, and use the stored data block corresponding to the similar group as the backup data block, the The group sampling library is obtained by performing sampling processing on the fingerprint library, and the similar group is a sampling group in the group sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
在第二种可能的实现方式中,根据第一方面,具体实现为:根据所述备份数据块对所述待存储文件进行重复数据删除操作,且为所述备份数据块进行状态标识包括:In a second possible implementation manner, according to the first aspect, the specific implementation is: performing a data deduplication operation on the file to be stored according to the backup data block, and performing a status identification for the backup data block includes:
在根据所述备份数据块对所述待存储文件进行重复数据删除操作之前,将所述备份数据块的分组计数加一;Before performing a deduplication operation on the file to be stored according to the backup data block, adding one to the grouping count of the backup data block;
在完成根据所述备份数据块对所述待存储文件进行重复数据删除操作之后,将所述备份数据块的分组计数减一。After completing the data deduplication operation on the file to be stored according to the backup data block, decrement the grouping count of the backup data block by one.
在第三种可能的实现方式中,根据第一方面第二种可能的实现方式,具体实现为:根据所述备份数据块的状态标识对所述备份数据块进行回收处理包括:In the third possible implementation manner, according to the second possible implementation manner of the first aspect, the specific implementation is: performing recovery processing on the backup data block according to the state identifier of the backup data block includes:
当识别到所述备份数据块的状态标识中的分组计数不为零时,暂停对所述备份数据块的回收处理;When it is recognized that the group count in the state identifier of the backup data block is not zero, suspend the recovery process of the backup data block;
当识别到所述备份数据块的状态标识中的分组计数为零时,触发对所述备份数据块的回收处理。When it is recognized that the group count in the state identifier of the backup data block is zero, the recovery processing of the backup data block is triggered.
在第四种可能的实现方式中,根据第一方面或第一方面第一种可能的实现方式或第一方面第二种可能的实现方式,具体实现为:根据所述备份数据块的状态标识对所述备份数据块进行回收处理包括:In the fourth possible implementation manner, according to the first aspect or the first possible implementation manner of the first aspect or the second possible implementation manner of the first aspect, the specific implementation is: according to the state identification of the backup data block Recycling the backup data block includes:
当监测到所述备份数据块的引用计数的数值发生变化时,识别对应的备份数据块的状态标识;When it is detected that the value of the reference count of the backup data block changes, identify the status identifier of the corresponding backup data block;
当识别到对应的备份数据块的状态标识表明所述备份数据块未使用时,则识别所述备份数据块的引用计数的数值;When it is identified that the status identifier of the corresponding backup data block indicates that the backup data block is not in use, then identify the value of the reference count of the backup data block;
当识别到所述备份数据块的引用计数的数值为零时,触发对所述备份数据块进行回收处理。When it is identified that the value of the reference count of the backup data block is zero, triggering recovery processing on the backup data block.
第二方面,本发明实施例提供一种数据存储设备,包括:In a second aspect, an embodiment of the present invention provides a data storage device, including:
备份数据块获取模块,用于将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块;The backup data block acquisition module is used to match the fingerprint of each data block of the file to be stored with the fingerprint in the fingerprint library to obtain the corresponding backup data block;
重复数据删除模块,用于根据所述备份数据块对所述待存储文件进行重复数据删除操作,且为所述备份数据块进行状态标识;A data deduplication module, configured to perform a data deduplication operation on the file to be stored according to the backup data block, and perform a status identification for the backup data block;
回收模块,用于根据所述备份数据块的状态标识对所述备份数据块进行回收处理。A recovery module, configured to perform recovery processing on the backup data block according to the state identifier of the backup data block.
在第一种可能的实现方式中,根据第二方面,具体实现为:所述备份数据块获取模块包括:In a first possible implementation manner, according to the second aspect, it is specifically implemented as follows: the backup data block acquisition module includes:
指纹计算单元,用于对所述待存储文件进行分块处理,得到各数据块,并计算各数据块的指纹;a fingerprint calculation unit, configured to divide the file to be stored into blocks, obtain each data block, and calculate the fingerprint of each data block;
指纹抽样单元,用于对各所述数据块的指纹进行抽样处理,并根据抽取到的指纹生成所述待存储文件的指纹抽样表;A fingerprint sampling unit, configured to sample the fingerprints of each of the data blocks, and generate a fingerprint sampling table for the file to be stored according to the extracted fingerprints;
分组确定单元,用于根据所述指纹抽样表和分组抽样库,确定所述待存储文件在所述分组抽样库中所属的相似分组,将所述相似分组对应的已存储的数据块作为所述备份数据块,所述分组抽样库由所述指纹库进行抽样处理得到,所述相似分组为所述分组抽样库中与所述待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。A group determination unit, configured to determine the similar group to which the file to be stored belongs in the group sampling library according to the fingerprint sampling table and the group sampling library, and use the stored data block corresponding to the similar group as the Backup data block, the group sampling library is obtained by sampling processing of the fingerprint library, and the similar group is a sampling group in the group sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored .
在第二种可能的实现方式中,根据第二方面,具体实现为:所述重复数据删除模块包括:In a second possible implementation manner, according to the second aspect, it is specifically implemented as follows: the deduplication module includes:
第一计数单元,用于在根据所述备份数据块对所述待存储文件进行重复数据删除操作之前,将所述备份数据块的分组计数加一;The first counting unit is configured to add one to the grouping count of the backup data block before performing a deduplication operation on the file to be stored according to the backup data block;
第二计数单元,用于在完成根据所述备份数据块对所述待存储文件进行重复数据删除操作之后,将所述备份数据块的分组计数减一。The second counting unit is configured to decrement the grouping count of the backup data block by one after the data deduplication operation is performed on the file to be stored according to the backup data block.
在第三种可能的实现方式中,根据第二方面第二种可能的实现方式,具体实现为:所述回收模块包括:In a third possible implementation manner, according to the second possible implementation manner of the second aspect, the specific implementation is as follows: the recycling module includes:
回收暂停单元,用于当识别到所述备份数据块的状态标识中的分组计数不为零时,暂停对所述备份数据块的回收处理;A recovery suspension unit, configured to suspend recovery processing of the backup data block when it is recognized that the packet count in the state identifier of the backup data block is not zero;
第一回收触发单元,用于当识别到所述备份数据块的状态标识中的分组计数为零时,触发对所述备份数据块的回收处理。The first reclamation triggering unit is configured to trigger reclamation processing of the backup data block when it is recognized that the packet count in the state identifier of the backup data block is zero.
在第四种可能的实现方式中,根据第二方面或第二方面第一种可能的实现方式或第二方面第二种可能的实现方式,具体实现为:所述回收模块包括:In a fourth possible implementation manner, according to the second aspect or the first possible implementation manner of the second aspect or the second possible implementation manner of the second aspect, the specific implementation is: the recycling module includes:
引用计数监测单元,用于当监测到所述备份数据块的引用计数的数值发生变化时,识别对应的备份数据块的状态标识;A reference count monitoring unit, configured to identify the status identifier of the corresponding backup data block when it detects that the value of the reference count of the backup data block changes;
引用计数识别单元,用于当识别到对应的备份数据块的状态标识表明所述备份数据块未使用时,则识别所述备份数据块的引用计数的数值;A reference count identification unit, configured to identify the value of the reference count of the backup data block when it is recognized that the status identifier of the corresponding backup data block indicates that the backup data block is unused;
第二回收触发单元,用于当识别到所述备份数据块的引用计数的数值为零时,触发对所述备份数据块进行回收处理。The second recycling triggering unit is configured to trigger recycling of the backup data block when it is recognized that the value of the reference count of the backup data block is zero.
本发明实施例提供一种数据存储方法及设备,该方法通过将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块,根据备份数据块对待存储文件进行重复数据删除操作,且为备份数据块进行状态标识,根据备份数据块的状态标识对备份数据块进行回收处理,使重复数据删除处理优先进行,解决了重复数据删除处理与回收处理并发执行导致数据丢失的问题,保证了重复删除处理和回收处理的有序进行以及已存储数据的安全性。Embodiments of the present invention provide a data storage method and device. In the method, the fingerprints of each data block of the file to be stored are matched with the fingerprints in the fingerprint library to obtain the corresponding backup data block, and the file to be stored is processed according to the backup data block. Data deduplication operation, and status identification for the backup data block, and recovery processing of the backup data block according to the status identification of the backup data block, so that the deduplication process is prioritized, and the data deduplication process and recovery process are executed concurrently. The loss problem ensures the orderly deduplication and recovery processing and the security of the stored data.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1为本发明数据存储方法实施例一的流程图;FIG. 1 is a flow chart of Embodiment 1 of the data storage method of the present invention;
图2为本发明数据存储方法实施例二的流程图;FIG. 2 is a flow chart of Embodiment 2 of the data storage method of the present invention;
图3为本发明数据存储方法实施例三的流程图;FIG. 3 is a flow chart of Embodiment 3 of the data storage method of the present invention;
图4为本发明数据存储逻辑架构实施例一示意图;FIG. 4 is a schematic diagram of Embodiment 1 of the data storage logical architecture of the present invention;
图5为本发明数据存储集群架构实施例一示意图;FIG. 5 is a schematic diagram of Embodiment 1 of the data storage cluster architecture of the present invention;
图6为本发明数据存储装置实施例一的结构图;FIG. 6 is a structural diagram of Embodiment 1 of the data storage device of the present invention;
图7为本发明数据存储装置实施例二的结构图;FIG. 7 is a structural diagram of Embodiment 2 of the data storage device of the present invention;
图8为本发明数据存储装置实施例三的结构图。FIG. 8 is a structural diagram of Embodiment 3 of the data storage device of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
图1为本发明数据存储方法实施例一的流程图,如图1所示,本实施例提供了一种数据存储方法,该方法可以由任意执行数据存储操作的设备来执行,可以具体包括如下步骤:Figure 1 is a flow chart of Embodiment 1 of the data storage method of the present invention. As shown in Figure 1, this embodiment provides a data storage method, which can be performed by any device that performs data storage operations, and can specifically include the following step:
步骤101:将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块。Step 101: Match the fingerprints of each data block of the file to be stored with the fingerprints in the fingerprint database to obtain corresponding backup data blocks.
本实施例中对于每个文件的存储均执行相同的数据存储方法,文件在存储前为待存储文件。指纹库中的指纹为已存储的文件的各数据块的指纹。通过对待存储文件各数据块的指纹与指纹库中已存储文件的各数据块的指纹逐一匹配,根据待存储文件各数据块的指纹与已存储文件的各数据块的指纹相似度,确定待存储文件在指纹库中所属的对应的备份数据块。具体地,当待存储文件数据块的指纹与已存储文件的数据块的指纹的相似度大于或等于预设的相似度阈值时,则认为该已存储文件的数据块是与待存储文件的数据块对应的备份数据块。相似度可以为待存储文件数据块的指纹与已存储文件的数据块的指纹相同或相似的指纹数占待存储文件的数据块的指纹的比例。In this embodiment, the same data storage method is implemented for the storage of each file, and the file is a file to be stored before storage. The fingerprints in the fingerprint library are the fingerprints of each data block of the stored file. By matching the fingerprints of each data block of the file to be stored with the fingerprints of each data block of the stored file in the fingerprint library one by one, according to the similarity between the fingerprint of each data block of the file to be stored and the fingerprint of each data block of the stored file, determine the The corresponding backup data block to which the file belongs in the fingerprint library. Specifically, when the similarity between the fingerprint of the data block of the file to be stored and the fingerprint of the data block of the stored file is greater than or equal to the preset similarity threshold, it is considered that the data block of the stored file is the same as the data of the file to be stored. The backup data block corresponding to the block. The similarity may be the ratio of the fingerprints of the data block of the file to be stored that are the same or similar to the fingerprints of the data block of the stored file to the fingerprints of the data block of the file to be stored.
步骤102:根据备份数据块对待存储文件进行重复数据删除操作,且为备份数据块进行状态标识。Step 102: Perform data deduplication operation on the file to be stored according to the backup data block, and perform status identification for the backup data block.
确定了备份数据块后,在该备份数据块中对待存储文件进行重复数据删除处理,具体的删除方法可以与现有技术中类似,即将计算得到的待存储文件的各分块的指纹与该备份数据块中保存的指纹相匹配。若备份数据块中已保存有与一个待存储文件的数据块相同或相似的指纹时,则删除该待存储文件的数据块的数据;若备份数据块中没有与待存储文件的数据块相同或相似的指纹时,则对该待存储文件的数据块的数据进行存储。After the backup data block is determined, the file to be stored is deduplicated in the backup data block, and the specific deletion method can be similar to that in the prior art, that is, the calculated fingerprint of each block of the file to be stored is compared with the backup The fingerprint stored in the data block matches. If the same or similar fingerprint as a data block of a file to be stored has been preserved in the backup data block, then delete the data of the data block of the file to be stored; if there is no identical or similar fingerprint with the data block of the file to be stored in the backup data block When the fingerprints are similar, the data of the data block of the file to be stored is stored.
本步骤中在根据备份数据块对待存储文件进行重复数据删除操作时,还要对备份数据块进行状态标识。其中,状态标识用于表征该备份数据块是否在重复数据删除操作的使用中。In this step, when the data deduplication operation is performed on the file to be stored according to the backup data block, the state identification of the backup data block is also carried out. Wherein, the status flag is used to indicate whether the backup data block is being used in the deduplication operation.
状态标识的具体形式可以有多种,优选是包括备份数据块的分组号以及该备份数据块的分组计数。分组计数是指根据该备份数据块进行重复数据删除的次数,即适用于备份数据块在多个并行执行的重复删除操作中被使用。因此,根据所述备份数据块对所述待存储文件进行重复数据删除操作,且为所述备份数据块进行状态标识的操作优选是在根据备份数据块对待存储文件进行重复数据删除操作之前,将备份数据块的分组计数加一,在完成根据备份数据块对待存储文件进行重复数据删除操作之后,将该备份数据块的分组计数减一。There are many specific forms of the state identification, preferably including the group number of the backup data block and the group count of the backup data block. The group count refers to the number of data deduplication based on the backup data block, that is, it is suitable for the backup data block to be used in multiple parallel deduplication operations. Therefore, the deduplication operation is performed on the file to be stored according to the backup data block, and the state identification operation for the backup data block is preferably performed before the deduplication operation is performed on the file to be stored according to the backup data block. Add one to the grouping count of the backup data block, and decrement the grouping count of the backup data block by one after completing the data deduplication operation on the file to be stored according to the backup data block.
本领域技术人员可以理解,当分组计数不为零时,即正在进行重复数据删除操作。在本实施例中,还设置有一个重删列表,该重删列表中包括了需进行重复数据删除处理的各备份数据块的状态标识,当备份数据块的分组计数为零时,可将该备份数据块的状态标识从重删列表中删除。Those skilled in the art can understand that when the packet count is not zero, the data deduplication operation is being performed. In this embodiment, a deduplication list is also provided, and the deduplication list includes the status identifiers of each backup data block that needs to be deduplicated. When the grouping count of the backup data block is zero, the The status identifier of the backup data block is deleted from the deduplication list.
步骤103:根据备份数据块的状态标识对备份数据块进行回收处理。Step 103: Perform recovery processing on the backup data block according to the status identifier of the backup data block.
在确定对备份数据块进行回收处理时,首先需要根据备份数据块的状态标识确定是否对该备份数据块进行回收处理。本实施例中,可通过查询重删列表中备份数据块的状态标识确定备份数据块是否在使用,从而决定是否回收处理。当备份数据块的状态标识中该备份数据块的分组计数为零时,说明该备份数据块没有进行重复删除处理,因此,可对该备份数据块进行回收处理。当备份数据块的状态标识中该备份数据块的分组计数不为零时,说明该备份数据块正在进行重复删除处理,因此,暂停对备份数据块的回收处理。在本实施例中,还可设置一个回收列表,该回收列表中包括了需回收处理的各备份数据块的状态标识,当备份数据块进行回收处理之后,将该备份数据块的状态标识从回收列表中删除。When determining to reclaim the backup data block, it is first necessary to determine whether to reclaim the backup data block according to the state identifier of the backup data block. In this embodiment, it can be determined whether the backup data block is in use by querying the status identifier of the backup data block in the deduplication list, so as to determine whether to recycle. When the group count of the backup data block in the status identifier of the backup data block is zero, it means that the backup data block has not been deduplicated, and therefore, the backup data block can be recycled. When the group count of the backup data block in the state identifier of the backup data block is not zero, it indicates that the backup data block is being deduplicated, and therefore, the recovery process of the backup data block is suspended. In this embodiment, a recovery list can also be set, which includes the state identification of each backup data block that needs to be recovered. After the backup data block is recovered, the status identification of the backup data block is recovered from the Deleted from the list.
本领域技术人员可以理解,当对该备份数据块进行重复删除处理时,也可查询回收列表,当回收列表中包括了该备份数据块的状态标识时,同样可暂停对该备份数据块的回收处理,保证重复删除处理优先进行。Those skilled in the art can understand that when the backup data block is deduplicated, the recycling list can also be queried, and when the status identifier of the backup data block is included in the recycling list, the recycling of the backup data block can also be suspended processing to ensure that deduplication processing takes precedence.
本领域技术人员可以理解,当备份数据块的分组计数为零时,重复删除处理过程和回收处理过程不仅可以同时进行,而且二者在运行过程中,不会相互影响,最大限度的保证了重删和回收的并发执行。Those skilled in the art can understand that when the grouping count of the backup data block is zero, not only can the deduplication process and the recovery process be carried out simultaneously, but also the two will not affect each other during operation, ensuring the maximum possible duplication Concurrent execution of delete and reclamation.
本发明实施例提供一种数据存储方法,通过将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块,根据备份数据块对待存储文件进行重复数据删除操作,且为备份数据块进行状态标识,根据备份数据块的状态标识对备份数据块进行回收处理,使重复数据删除处理优先进行,解决了重复数据删除处理与回收处理并发执行导致数据丢失的问题,保证了重复删除处理和回收处理的有序进行以及已存储数据的安全性。The embodiment of the present invention provides a data storage method, by matching the fingerprints of each data block of the file to be stored with the fingerprint in the fingerprint library to obtain the corresponding backup data block, and performing deduplication operation on the file to be stored according to the backup data block , and carry out status identification for the backup data block, and recycle the backup data block according to the status identification of the backup data block, so that the deduplication process is performed first, and solves the problem of data loss caused by concurrent execution of deduplication processing and recovery processing, This ensures the orderly execution of deduplication and recovery processing and the security of stored data.
图2为本发明数据存储方法实施例二的流程图,如图2所示,本实施例提供了一种数据存储方法,该方法可以由任意执行数据存储操作的设备来执行,可以具体包括如下步骤:Figure 2 is a flow chart of the second embodiment of the data storage method of the present invention. As shown in Figure 2, this embodiment provides a data storage method, which can be performed by any device that performs data storage operations, and can specifically include the following step:
步骤201:对待存储文件进行分块处理,得到各数据块,并计算各数据块的指纹。Step 201: Divide the file to be stored into blocks to obtain each data block, and calculate the fingerprint of each data block.
本步骤先对待存储文件进行分块处理,具体的分块处理过程可以采用现有技术中的分块技术,如通过变长分块算法对待存储文件进行分块。再计算分块处理后的得到的各分块的指纹,具体的指纹计算过程也可以采用现有技术中的计算方法,如可以采用安全哈希算法(Secure Hash Algorithm)、消息摘要算法第五版(Message Digest Algorithm,简称MD5)双哈希算法来计算各分块的指纹。In this step, the file to be stored is firstly divided into blocks, and the specific block processing process may adopt the block technology in the prior art, such as dividing the file to be stored into blocks by using a variable-length block algorithm. Then calculate the fingerprint of each block obtained after the block processing, the specific fingerprint calculation process can also use the calculation method in the prior art, such as the secure hash algorithm (Secure Hash Algorithm), the fifth edition of the message digest algorithm can be used (Message Digest Algorithm, referred to as MD5) double hash algorithm to calculate the fingerprint of each block.
步骤202:对各数据块的指纹进行抽样处理,并根据抽取到的指纹生成待存储文件的指纹抽样表。Step 202: Sampling the fingerprints of each data block, and generating a fingerprint sampling table of the file to be stored according to the extracted fingerprints.
为了缩减重复数据删除过程中去重的计算量,在得到待存储文件的各分块的指纹后,对这些指纹进行抽样,抽样的基本要求是抽样结果中的指纹在待存储文件的各分块的指纹的范围内,且抽样结果中指纹的数量不多于待存储文件的分块指纹的数量。对各分块指纹进行抽样具体可以为:直接将各分块的指纹中最后一个字节为0的指纹作为抽样处理抽取到的指纹;或者将固定位置上的分块作为抽取到的指纹,例如将9的整数倍位置上的分块作为抽取到得指纹;或者根据预定的抽样比例进行抽样,例如随机抽取5%的分块作为抽取到的指纹。此处对各分块的指纹进行抽样处理,对指纹进行筛选,并根据抽取到的指纹生成该待存储文件的指纹抽样表。本领域技术人员可以理解,本实施例中还存在抽样结果均不满足抽样条件,即该待存储文件中不存在满足抽样条件的块的情况,则得到的指纹抽样表为空。In order to reduce the amount of deduplication calculations in the deduplication process, after obtaining the fingerprints of each block of the file to be stored, these fingerprints are sampled. The basic requirement for sampling is that the fingerprints in the sampling results are in each block of the file to be stored. within the range of fingerprints, and the number of fingerprints in the sampling results is not more than the number of block fingerprints of the file to be stored. Sampling the fingerprints of each block can be specifically: directly use the fingerprint whose last byte is 0 in the fingerprint of each block as the fingerprint extracted by the sampling process; or use the block at a fixed position as the extracted fingerprint, for example The blocks at positions that are integer multiples of 9 are used as the extracted fingerprints; or sampling is performed according to a predetermined sampling ratio, for example, 5% of the blocks are randomly selected as the extracted fingerprints. Here, the fingerprints of each block are sampled, the fingerprints are screened, and a fingerprint sampling table of the file to be stored is generated according to the extracted fingerprints. Those skilled in the art can understand that in this embodiment, none of the sampling results satisfy the sampling conditions, that is, there is no block satisfying the sampling conditions in the file to be stored, and the obtained fingerprint sampling table is empty.
步骤203:根据指纹抽样表和分组抽样库,确定待存储文件在分组抽样库中所属的相似分组,将相似分组对应的已存储的数据块作为备份数据块。Step 203: According to the fingerprint sampling table and the group sampling library, determine the similar group to which the file to be stored belongs in the group sampling library, and use the stored data block corresponding to the similar group as a backup data block.
在获取到待存储文件的指纹抽样表后,根据指纹抽样表和分组抽样库,确定待存储文件在分组抽样库中所属的相似分组,将相似分组对应的已存储的数据块作为备份数据块。分组抽样库由指纹库进行抽样处理得到,相似分组为分组抽样库中与待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。After obtaining the fingerprint sampling table of the file to be stored, according to the fingerprint sampling table and the group sampling library, determine the similar group to which the file to be stored belongs in the group sampling library, and use the stored data block corresponding to the similar group as the backup data block. The group sampling library is obtained by sampling processing of the fingerprint library, and the similar group is a sampling group in the group sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
特别地,指纹库保存了存储文件经过重复数据删除后的所有指纹。若本步骤处理的待存储文件为第一个文件,则指纹库为空。此时,若指纹抽样表不为空,则在分组抽样库中建立一个新建分组,确定待存储文件在分组抽样库中所属的相似分组为新建分组,并将待存储文件的指纹抽样表中的指纹保存到新建分组中。当指纹抽样表不为空,且指纹库不为空的时候,对指纹库中的指纹进行抽样处理,获得分组抽样库。其中抽样处理的方法与步骤202中对待存储文件的各数据块进行抽样处理的方法类似,本实施例此处不再赘述。本领域技术人员可以理解,对指纹库中的指纹进行抽样处理的方法与对待存储文件的各数据块进行抽样处理的方法应保持一致,这样可以得到相似度较高的相似分组。In particular, the fingerprint library saves all fingerprints of stored files after data deduplication. If the file to be stored in this step is the first file, the fingerprint library is empty. At this time, if the fingerprint sampling table is not empty, a new group is established in the group sampling library, and the similar group of the file to be stored in the group sampling library is determined to be a new group, and the fingerprint sampling table of the file to be stored is added to the new group. Fingerprints are saved to the newly created group. When the fingerprint sampling table is not empty and the fingerprint library is not empty, the fingerprints in the fingerprint library are sampled to obtain the group sampling library. The sampling processing method is similar to the sampling processing method for each data block of the file to be stored in step 202, which will not be repeated here in this embodiment. Those skilled in the art can understand that the method for sampling the fingerprints in the fingerprint library should be consistent with the method for sampling each data block of the file to be stored, so that similar groups with high similarity can be obtained.
通过对指纹抽样表中的各指纹与当前的分组抽样库中各抽样分组逐一匹配,根据匹配结果在当前的分组抽样库中确定待存储文件所属的相似分组。具体地,当指纹抽样表中的各指纹与当前的分组抽样指纹库中的一个抽样分组的指纹相似度大于或等于预设的相似度阈值时,则认为该待存储文件属于该抽样分组,该抽样分组为相似分组,该相似分组中的指纹对应的已存储的数据块作为备份数据块;当指纹抽样表中的各指纹与当前分组抽样库中的所有分组的指纹相似度均小于预设的相似度阈值时,在分组抽样库中建立一个新建分组,确定待存储文件在分组抽样库中所属的相似分组为新建分组,并将待存储文件的指纹抽样表中的指纹保存到新建分组中。By matching each fingerprint in the fingerprint sampling table with each sampling group in the current group sampling library one by one, determine the similar group to which the file to be stored belongs to in the current group sampling library according to the matching result. Specifically, when the fingerprint similarity between each fingerprint in the fingerprint sampling table and a sampling group in the current group sampling fingerprint library is greater than or equal to the preset similarity threshold, then it is considered that the file to be stored belongs to the sampling group, the The sampling group is a similar group, and the stored data block corresponding to the fingerprint in the similar group is used as a backup data block; When the similarity threshold is reached, a new group is established in the group sampling library, and the similar group to which the file to be stored belongs in the group sampling library is determined to be a new group, and the fingerprints in the fingerprint sampling table of the file to be stored are stored in the new group.
当步骤202中的抽样结果均不满足抽样条件时,即该待存储文件中不存在满足抽样条件的块,则确定所述待存储文件在当前的分组抽样库中所属的相似分组为当前的分组抽样库中的预设分组,本实施例的相似性分析过程结束。在指纹库中与该预设分组对应的指纹分组中对待存储文件进行重复数据删除处理。该预设分组为本实施例预先设定的一个分组,没有特定的含义,该预设分组可以为空,其与指纹库中一个特定的指纹分组相对应,该特定的指纹分组中保存的是这些抽样后指纹抽样表为空的待存储文件的指纹。在实际抽样过程中,存在抽样后指纹抽样表为空的特殊情况,此处仅是对这种特殊情况下的处理进行说明,避免因出现这种特殊情况而导致整个流程中断。When the sampling results in step 202 do not meet the sampling conditions, that is, there is no block satisfying the sampling conditions in the file to be stored, then it is determined that the similar grouping of the file to be stored in the current grouping sampling library is the current grouping The preset grouping in the library is sampled, and the similarity analysis process of this embodiment ends. In the fingerprint group corresponding to the preset group in the fingerprint library, data deduplication is performed on the file to be stored. The preset group is a group preset in this embodiment, and has no specific meaning. The preset group can be empty, and it corresponds to a specific fingerprint group in the fingerprint database. What is stored in the specific fingerprint group is The fingerprints of the files to be stored for which the fingerprint sampling table is empty after these samplings. In the actual sampling process, there is a special case that the fingerprint sampling table is empty after sampling. Here we only explain the handling of this special case to avoid the interruption of the entire process due to this special case.
进一步地,在根据备份数据块对待存储文件进行重复数据删除操作时,还可将相似分组的指纹分为多个区间,且每个区间内建立一个数据库,用于存放对应区间的指纹;在查询重复数据块的时候可以分开进行查询,在多线程、多节点的情况下可以并发查询每一个区间,提升并发查询的能力,加速查询速度。Further, when performing data deduplication operations on the files to be stored according to the backup data blocks, the fingerprints of similar groups can also be divided into multiple intervals, and a database is established in each interval to store the fingerprints of the corresponding intervals; When data blocks are repeated, they can be queried separately. In the case of multi-thread and multi-node, each interval can be queried concurrently, which improves the ability of concurrent query and speeds up the query speed.
步骤204:根据备份数据块对待存储文件进行重复数据删除操作,且为备份数据块进行状态标识。本步骤可以与上述步骤102类似,此处不再赘述。Step 204: Perform data deduplication operation on the file to be stored according to the backup data block, and perform status identification for the backup data block. This step may be similar to the
步骤205:根据备份数据块的状态标识对备份数据块进行回收处理。本步骤可以与上述步骤103类似,此处不再赘述。Step 205: Perform recovery processing on the backup data block according to the state identifier of the backup data block. This step may be similar to the above step 103, and will not be repeated here.
本发明实施例提供了一种数据存储方法,通过对待存储文件进行分块处理,得到各数据块,并计算各数据块的指纹,对各数据块的指纹进行抽样处理,并根据抽取到的指纹生成待存储文件的指纹抽样表,根据指纹抽样表和分组抽样库,确定待存储文件在分组抽样库中所属的相似分组,作为备份数据块,本实施例对待存储文件的各数据块以及指纹库进行进一步的抽样处理,先通过相似性分析确定相似分组,再在相似分组对应的指纹分组中进行重复数据删除处理,缩小了去重的查询计算量,解决了现有技术中重删时海量分块数据引入的计算量和资源消耗巨大的问题,缩减了重复数据删除中去重的计算量,提升了重删性能。The embodiment of the present invention provides a data storage method. By dividing the file to be stored into blocks, each data block is obtained, and the fingerprint of each data block is calculated, and the fingerprint of each data block is sampled. Generate the fingerprint sampling table of the file to be stored, and determine the similar grouping of the file to be stored in the grouping sampling library according to the fingerprint sampling table and the grouping sampling library, as the backup data block, each data block of the file to be stored and the fingerprint library in this embodiment Carry out further sampling processing, first determine similar groups through similarity analysis, and then perform deduplication processing in the fingerprint groups corresponding to similar groups, which reduces the amount of deduplication query calculations and solves the problem of massive points in deduplication in the prior art. The problem of huge calculation and resource consumption introduced by block data reduces the calculation of deduplication in deduplication and improves deduplication performance.
图3为本发明数据存储方法实施例三的流程图,如图3所示,本实施例提供了一种数据存储方法,该方法可以由任意执行数据存储操作的设备来执行,可以具体包括如下步骤:Figure 3 is a flow chart of the third embodiment of the data storage method of the present invention. As shown in Figure 3, this embodiment provides a data storage method, which can be performed by any device that performs data storage operations, and can specifically include the following step:
步骤301:将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块。Step 301: Match the fingerprints of each data block of the file to be stored with the fingerprints in the fingerprint database to obtain corresponding backup data blocks.
图3实施例中的步骤301可以与图1实施例中的步骤101类似,也可以采用图2实施例所示获取对应的备份数据块的方法,本实施例此处不再赘述。Step 301 in the embodiment in FIG. 3 may be similar to step 101 in the embodiment in FIG. 1 , or the method shown in the embodiment in FIG. 2 for obtaining the corresponding backup data block may be used, which will not be repeated here in this embodiment.
步骤302:根据备份数据块对待存储文件进行重复数据删除操作,且为备份数据块进行状态标识。Step 302: Deduplicate the file to be stored according to the backup data block, and perform status identification for the backup data block.
图3实施例中的步骤302可以与图1实施例中的步骤102类似,本实施例此处不再赘述。Step 302 in the embodiment of FIG. 3 may be similar to step 102 in the embodiment of FIG. 1 , and details are not repeated here in this embodiment.
步骤303:当监测到备份数据块的引用计数的数值发生变化时,识别对应的备份数据块的状态标识。Step 303: When it is detected that the value of the reference count of the backup data block changes, identify the status identifier of the corresponding backup data block.
在预设时间内,对备份数据块的引用计数进行监测,当已存储的文件被修改或删除时,修改或删除的位置对应的备份数据块的引用情况发生改变,当备份数据块的引用计数发生变化时,识别对应的备份数据块的状态标识。Within the preset time, the reference count of the backup data block is monitored. When the stored file is modified or deleted, the reference status of the backup data block corresponding to the modified or deleted position changes. When the reference count of the backup data block When a change occurs, identify the status identifier of the corresponding backup data block.
步骤304:当识别到对应的备份数据块的状态标识表明备份数据块未使用时,则识别备份数据块的引用计数的数值。Step 304: When it is identified that the status identifier of the corresponding backup data block indicates that the backup data block is unused, identify the value of the reference count of the backup data block.
当识别到对应的备份数据块的状态标识表明备份数据块没有被使用,即没有根据该备份数据块对待存储文件进行重复数据删除操作,则识别该备份数据块的引用计数的数值。When it is recognized that the status identifier of the corresponding backup data block indicates that the backup data block is not used, that is, the file to be stored is not deduplicated according to the backup data block, then the value of the reference count of the backup data block is identified.
步骤305:当识别到该备份数据块的引用计数的数值为零时,触发对备份数据块进行回收处理。Step 305: When it is recognized that the value of the reference count of the backup data block is zero, trigger recovery processing on the backup data block.
当该备份数据块的引用计数的数值为零时,说明该备份数据块为垃圾文件,可以进行回收处理。其中,引用计数法是唯一没有使用根集的垃圾回收方法,该方法使用引用计数器来区分存活对象和不再使用的对象。一般来说,当对象即备份数据块被丢弃或不再使用,引用计数器减1,一旦引用计数器为0,该备份数据块就满足了垃圾收集的条件。本领域技术人员可以理解,图3实施例所示的步骤303至步骤305还可以应用到对所有的已存储数据块的回收中,当监测到以存储的数据块的引用计数发生变化时,执行回收任务,当该数据块的引用计数为0时,对该数据块进行回收处理。When the value of the reference count of the backup data block is zero, it indicates that the backup data block is a garbage file and can be recycled. Among them, the reference counting method is the only garbage collection method that does not use the root set, which uses reference counters to distinguish between live objects and objects that are no longer used. Generally speaking, when the object, that is, the backup data block is discarded or no longer used, the reference counter is decremented by 1. Once the reference counter is 0, the backup data block meets the condition of garbage collection. Those skilled in the art can understand that
本发明实施例提供一种数据存储方法,通过监测备份数据块的引用计数的数值变化,当引用计数发生变化的备份数据块的状态标识表明备份数据块未使用时,则识别备份数据块的引用计数的数值,当识别到备份数据块的引用计数的数值为零时,触发对备份数据块进行回收处理,本实施例只针对引用计数产生变化的备份数据块进行回收扫描,提升了回收速度,可以更及时的找回用户的存储空间。An embodiment of the present invention provides a data storage method. By monitoring the value change of the reference count of the backup data block, when the status identifier of the backup data block whose reference count has changed indicates that the backup data block is not used, the reference of the backup data block is identified. The value of the count, when it is recognized that the value of the reference count of the backup data block is zero, the backup data block is triggered to be recycled. This embodiment only performs recovery scanning for the backup data block whose reference count changes, which improves the recycling speed. The user's storage space can be retrieved in a more timely manner.
图4为本发明数据存储逻辑架构实施例一示意图。本实施例提供的数据存储逻辑架构示意图,能够执行上述数据存储方法的实施例。如图4所示,本实施例提供的数据存储逻辑架构示意图包括集群管理模块40,重删引擎模块41,元数据服务器42,单一实例库43,转发模块44。FIG. 4 is a schematic diagram of Embodiment 1 of the data storage logical architecture of the present invention. The schematic diagram of the data storage logic architecture provided in this embodiment can implement the above embodiments of the data storage method. As shown in FIG. 4 , the schematic diagram of the data storage logical architecture provided in this embodiment includes a
其中,集群管理模块40用于管理回收列表和重删列表。Among them, the
重删引擎模块41用于重复数据删除任务,空间回收任务,以及对数据块进行引用计数等各种任务的处理和管理。对应地,重删引擎模块41包括任务处理模块411,任务管理模块412以及分发器413。其中,任务处理模块411包括重删模块4111,用于执行重复数据删除任务,引用计数模块4112,用于执行对数据块进行引用计数的任务,空间回收模块4113,用于执行空间回收任务。任务管理模块412用于管理线程池,包括对任务队列的监控以及对线程池中线程运行状态的监控。分发器413,用于维护各个元数据服务器42管理的数据分块。The
元数据服务器42用于确定待存储文件在分组抽样库中所属的相似分组。The
单一实例库43中用于存储分组抽样库中的各抽样分组,以及抽样分组中的各抽样分组区间。The
转发模块44负责集群管理模块40,重删引擎模块41,元数据服务器42之间数据的传输。The
在一个具体的实施例中,当执行重复数据删除任务时,重删引擎模块41对待存储文件进行分块处理,计算分块处理结果中各分块的指纹,对各分块的指纹进行抽样处理,并根据抽取到的指纹生成待存储文件的指纹抽样表,并向元数据服务器42发送分组请求消息,元数据服务器42确定与该指纹抽样表对应的相似分组,以及与所述相似分组对应的备份数据块。重删引擎模块41在进行重复数据删除之前,向集群管理模块40发送携带备份数据块状态标识的重删请求消息,集群管理模块40确定回收列表中是否存在该备份数据块状态标识,若存在,则集群管理模块40使重删引擎模块41取消对该备份数据块的回收,继续进行重复数据删除任务。In a specific embodiment, when performing the deduplication task, the
在一个具体的实施例中,当执行回收任务时,重删引擎模块41统计引用计数发生变化的备份数据块,对引用计数发生变化的备份数据块进行回收,在回收之前,向集群管理模块40发送携带备份数据块状态标识的回收请求消息,集群管理模块40确定重删列表中是否存在该备份数据块状态标识,若存在,则集群管理模块40使重删引擎模块41取消对该备份数据块的回收。In a specific embodiment, when performing the recycling task, the
本实施例提供的数据存储逻辑架构,在具体实现数据存储过程中,重复数据删除任务的优先级高于空间回收任务的优先级,解决了重复数据删除处理与回收处理并发执行导致数据丢失的问题,保证了重复删除处理和回收处理的有序进行以及已存储数据的安全性。In the data storage logic architecture provided by this embodiment, in the process of implementing data storage, the priority of the deduplication task is higher than that of the space reclamation task, which solves the problem of data loss caused by concurrent execution of deduplication processing and recovery processing , ensuring the orderly deduplication and recovery processing and the security of the stored data.
图5为本发明数据存储集群架构实施例一示意图。本实施例提供的数据存储集群架构可通过图1至图3所示数据存储方法实施例以及图4所示数据存储逻辑架构实施例实现。如图5所示,本实施例提供的数据存储集群架构包括从节点501,主节点502,备节点503。FIG. 5 is a schematic diagram of Embodiment 1 of the data storage cluster architecture of the present invention. The data storage cluster architecture provided in this embodiment can be realized through the data storage method embodiments shown in FIGS. 1 to 3 and the data storage logical architecture embodiment shown in FIG. 4 . As shown in FIG. 5 , the data storage cluster architecture provided in this embodiment includes a
其中,从节点501、主节点502、备节点503数据存储共享。三者均包括集群管理模块、重删引擎模块以及元数据服务器。同时,从节点501、主节点502、备节点503都可以完成上述的数据存储方法中的重复数据删除和空间回收的过程。主节点502具体可以为局域网中的主机,从节点501具体可以为局域网中的分机。本领域技术人员可以理解,在实际应用过程中,从节点501的个数可以为多个。主节点502主要负责向从节点501下发开始重复数据删除或空间回收的命令,以使从节点501执行相应的重复数据删除或空间回收任务。当从节点501执行任务结束后,从节点501可将执行结果告知主节点。当从节点501或主节点502发生故障,无法正常工作时,可由备节点503代替发生故障的从节点501或主节点502继续工作,保证数据存储过程能够持续进行。Wherein, the
本实施例提供的数据存储集群架构,从节点、主节点以及备节点均能够执行数据存储方法,同时主节点能够控制多台从节点同时执行数据存储,提高了数据存储的效率。在从节点和主节点发生故障时,备节点能够代替发生故障的从节点或主节点继续工作,避免了数据存储过程的中断,保证数据存储过程的连续性。In the data storage cluster architecture provided in this embodiment, the slave nodes, master nodes, and standby nodes can all execute the data storage method, and the master node can control multiple slave nodes to execute data storage at the same time, which improves the efficiency of data storage. When the slave node and the master node fail, the standby node can continue to work instead of the failed slave node or master node, avoiding the interruption of the data storage process and ensuring the continuity of the data storage process.
图6为本发明数据存储装置实施例一的结构图,如图6所示,本实施例提供的数据存储装置包括备份数据块获取模块61,重复数据删除模块62,回收模块63。其中处理模块61用于将待存储文件各数据块的指纹与指纹库中的指纹进行匹配,以获取对应的备份数据块;重复数据删除模块62用于根据所述备份数据块对所述待存储文件进行重复数据删除操作,且为所述备份数据块进行状态标识;回收模块63用于根据所述备份数据块的状态标识对所述备份数据块进行回收处理。FIG. 6 is a structural diagram of Embodiment 1 of the data storage device of the present invention. As shown in FIG. Wherein the
本实施例的数据存储装置,可以用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The data storage device of this embodiment can be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
图7为本发明数据存储装置实施例二的结构图,如图7所示,本实施例在图6所示实施例的基础上,所述重复数据删除模块62包括:第一计数单元621,第二计数单元622。FIG. 7 is a structural diagram of the second embodiment of the data storage device of the present invention. As shown in FIG. 7, this embodiment is based on the embodiment shown in FIG. 6, and the
其中,第一计数单元621,用于在根据所述备份数据块对所述待存储文件进行重复数据删除操作之前,将所述备份数据块的分组计数加一;第二计数单元622,用于在完成根据所述备份数据块对所述待存储文件进行重复数据删除操作之后,将所述备份数据块的分组计数减一。Wherein, the first counting unit 621 is configured to add one to the grouping count of the backup data block before performing a deduplication operation on the file to be stored according to the backup data block; the
在图6所示实施例的基础上,所述回收模块63包括:回收暂停单元631,第一回收触发单元632。On the basis of the embodiment shown in FIG. 6 , the
其中,回收暂停单元631,用于当识别到所述备份数据块的状态标识中的分组计数不为零时,暂停对所述备份数据块的回收处理;第一回收触发单元632,用于当识别到所述备份数据块的状态标识中的分组计数为零时,触发对所述备份数据块的回收处理。Wherein, the
本实施例的数据存储装置,可以用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The data storage device of this embodiment can be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
图8为本发明数据存储装置实施例三的结构图,如图8所示,本实施例在图6所示实施例的基础上,所述备份数据块获取模块61包括:指纹计算单元611,指纹抽样单元612,分组确定单元613。FIG. 8 is a structural diagram of the third embodiment of the data storage device of the present invention. As shown in FIG. 8, this embodiment is based on the embodiment shown in FIG. 6, and the backup data
其中,指纹计算单元611,用于对所述待存储文件进行分块处理,得到各数据块,并计算各数据块的指纹;指纹抽样单元612,用于对各所述数据块的指纹进行抽样处理,并根据抽取到的指纹生成所述待存储文件的指纹抽样表;分组确定单元613,用于根据所述指纹抽样表和分组抽样库,确定所述待存储文件在所述分组抽样库中所属的相似分组,将所述相似分组对应的已存储的数据块作为所述备份数据块,所述分组抽样库由所述指纹库进行抽样处理得到,所述相似分组为所述分组抽样库中与所述待存储文件的指纹抽样表中的抽样指纹相匹配的一个抽样分组。Wherein, the fingerprint calculation unit 611 is used to divide the file to be stored into blocks to obtain each data block and calculate the fingerprint of each data block; the fingerprint sampling unit 612 is used to sample the fingerprint of each data block processing, and generate the fingerprint sampling table of the file to be stored according to the extracted fingerprint; the
在图6所示实施例的基础上,所述回收模块63包括:On the basis of the embodiment shown in Figure 6, the
引用计数监测单元633,用于当监测到所述备份数据块的引用计数的数值发生变化时,识别对应的备份数据块的状态标识;引用计数识别单元634,用于当识别到对应的备份数据块的状态标识表明所述备份数据块未使用时,则识别所述备份数据块的引用计数的数值;第二回收触发单元635,用于当识别到所述备份数据块的引用计数的数值为零时,触发对所述备份数据块进行回收处理。The reference
本实施例的数据存储装置,可以用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘。The data storage device in this embodiment can be used to implement the technical solutions of the above method embodiments, and its implementation principles and technical effects are similar, and are not repeated here.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210552099.7ACN102982180B (en) | 2012-12-18 | 2012-12-18 | Date storage method and equipment |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210552099.7ACN102982180B (en) | 2012-12-18 | 2012-12-18 | Date storage method and equipment |
| Publication Number | Publication Date |
|---|---|
| CN102982180Atrue CN102982180A (en) | 2013-03-20 |
| CN102982180B CN102982180B (en) | 2016-08-03 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201210552099.7AExpired - Fee RelatedCN102982180B (en) | 2012-12-18 | 2012-12-18 | Date storage method and equipment |
| Country | Link |
|---|---|
| CN (1) | CN102982180B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103886070A (en)* | 2014-03-21 | 2014-06-25 | 华为技术有限公司 | Method and device for recycling data of file system |
| CN103973708A (en)* | 2014-05-26 | 2014-08-06 | 中电长城网际系统应用有限公司 | Determination method and system for data breach event |
| CN104598927A (en)* | 2015-01-29 | 2015-05-06 | 中国科学院深圳先进技术研究院 | Large-scale graph partitioning method and system |
| CN104881475A (en)* | 2015-06-02 | 2015-09-02 | 北京京东尚科信息技术有限公司 | Method and system for randomly sampling big data |
| WO2016037560A1 (en)* | 2014-09-10 | 2016-03-17 | 华为技术有限公司 | Data writing method and apparatus and memory |
| CN106708927A (en)* | 2016-11-18 | 2017-05-24 | 北京二六三企业通信有限公司 | Duplicate removal processing method and duplicate removal processing device for files |
| CN106775501A (en)* | 2017-02-14 | 2017-05-31 | 华南师范大学 | Elimination of Data Redundancy method and system based on nonvolatile memory equipment |
| CN106959888A (en)* | 2016-01-11 | 2017-07-18 | 杭州海康威视数字技术股份有限公司 | Task processing method and device in cloud storage system |
| CN107193503A (en)* | 2017-05-27 | 2017-09-22 | 杭州宏杉科技股份有限公司 | A kind of data delete method and storage device again |
| CN108021828A (en)* | 2017-12-06 | 2018-05-11 | 湖南文理学院 | A kind of computer information data multi-stage protection system |
| CN109416681A (en)* | 2016-08-29 | 2019-03-01 | 国际商业机器公司 | The data de-duplication of workload optimization is carried out using ghost fingerprint |
| CN109753228A (en)* | 2017-11-08 | 2019-05-14 | 阿里巴巴集团控股有限公司 | Snapshot delet method, apparatus and system |
| CN110647294A (en)* | 2019-09-09 | 2020-01-03 | Oppo(重庆)智能科技有限公司 | Storage block recovery method and device, storage medium and electronic equipment |
| CN110945483A (en)* | 2017-08-25 | 2020-03-31 | 华为技术有限公司 | Network system and method for deduplication |
| CN111125033A (en)* | 2018-10-31 | 2020-05-08 | 深信服科技股份有限公司 | Space recovery method and system based on full flash memory array |
| CN111124750A (en)* | 2019-11-05 | 2020-05-08 | 国家电网有限公司 | A fast data deletion method based on source deduplication |
| CN111143343A (en)* | 2019-12-27 | 2020-05-12 | 南京壹进制信息科技有限公司 | Data efficient deleting method and system based on source-end deduplication |
| CN111522502A (en)* | 2019-02-01 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Data deduplication method and device, electronic equipment and computer-readable storage medium |
| CN111581955A (en)* | 2019-02-15 | 2020-08-25 | 阿里巴巴集团控股有限公司 | Text fingerprint extraction and verification method and device |
| CN111897845A (en)* | 2020-07-29 | 2020-11-06 | 徐州金蝶软件有限公司 | Method and system for processing mass credit information based on process |
| CN113568877A (en)* | 2020-04-28 | 2021-10-29 | 杭州海康威视数字技术股份有限公司 | File merging method and device, electronic equipment and storage medium |
| CN115543979A (en)* | 2022-09-29 | 2022-12-30 | 广州鼎甲计算机科技有限公司 | Method, device, equipment, storage medium and program product for deleting repeated data |
| CN117369731A (en)* | 2023-12-07 | 2024-01-09 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
| CN119336543A (en)* | 2024-10-15 | 2025-01-21 | 广州鼎甲计算机科技有限公司 | Fingerprint database recovery method, device, computer equipment, readable storage medium and program product |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101582076A (en)* | 2009-06-24 | 2009-11-18 | 浪潮电子信息产业股份有限公司 | Data de-duplication method based on data base |
| CN101599079A (en)* | 2009-07-22 | 2009-12-09 | 中国科学院计算技术研究所 | A Management Method for Centralized Storage of Backup Data |
| CN101706825A (en)* | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
| CN102222085A (en)* | 2011-05-17 | 2011-10-19 | 华中科技大学 | Data de-duplication method based on combination of similarity and locality |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101582076A (en)* | 2009-06-24 | 2009-11-18 | 浪潮电子信息产业股份有限公司 | Data de-duplication method based on data base |
| CN101599079A (en)* | 2009-07-22 | 2009-12-09 | 中国科学院计算技术研究所 | A Management Method for Centralized Storage of Backup Data |
| CN101706825A (en)* | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
| CN102222085A (en)* | 2011-05-17 | 2011-10-19 | 华中科技大学 | Data de-duplication method based on combination of similarity and locality |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103886070A (en)* | 2014-03-21 | 2014-06-25 | 华为技术有限公司 | Method and device for recycling data of file system |
| CN103973708A (en)* | 2014-05-26 | 2014-08-06 | 中电长城网际系统应用有限公司 | Determination method and system for data breach event |
| CN105468533B (en)* | 2014-09-10 | 2019-02-19 | 华为技术有限公司 | Data writing method, device and memory |
| WO2016037560A1 (en)* | 2014-09-10 | 2016-03-17 | 华为技术有限公司 | Data writing method and apparatus and memory |
| CN105468533A (en)* | 2014-09-10 | 2016-04-06 | 华为技术有限公司 | Data writing method and apparatus, and memory |
| CN104598927A (en)* | 2015-01-29 | 2015-05-06 | 中国科学院深圳先进技术研究院 | Large-scale graph partitioning method and system |
| CN104881475A (en)* | 2015-06-02 | 2015-09-02 | 北京京东尚科信息技术有限公司 | Method and system for randomly sampling big data |
| CN106959888A (en)* | 2016-01-11 | 2017-07-18 | 杭州海康威视数字技术股份有限公司 | Task processing method and device in cloud storage system |
| CN109416681B (en)* | 2016-08-29 | 2022-03-18 | 国际商业机器公司 | Deduplication for workload optimization using ghost fingerprints |
| CN109416681A (en)* | 2016-08-29 | 2019-03-01 | 国际商业机器公司 | The data de-duplication of workload optimization is carried out using ghost fingerprint |
| CN106708927A (en)* | 2016-11-18 | 2017-05-24 | 北京二六三企业通信有限公司 | Duplicate removal processing method and duplicate removal processing device for files |
| CN106775501A (en)* | 2017-02-14 | 2017-05-31 | 华南师范大学 | Elimination of Data Redundancy method and system based on nonvolatile memory equipment |
| CN106775501B (en)* | 2017-02-14 | 2019-06-11 | 华南师范大学 | Data De-Redundancy System Based on Non-Volatile Memory Devices |
| CN107193503A (en)* | 2017-05-27 | 2017-09-22 | 杭州宏杉科技股份有限公司 | A kind of data delete method and storage device again |
| CN107193503B (en)* | 2017-05-27 | 2020-05-29 | 杭州宏杉科技股份有限公司 | Data deduplication method and storage device |
| CN110945483A (en)* | 2017-08-25 | 2020-03-31 | 华为技术有限公司 | Network system and method for deduplication |
| CN109753228B (en)* | 2017-11-08 | 2022-08-02 | 阿里巴巴集团控股有限公司 | Snapshot deleting method, device and system |
| CN109753228A (en)* | 2017-11-08 | 2019-05-14 | 阿里巴巴集团控股有限公司 | Snapshot delet method, apparatus and system |
| CN108021828B (en)* | 2017-12-06 | 2020-01-24 | 湖南文理学院 | A multi-level protection system for computer information and data |
| CN108021828A (en)* | 2017-12-06 | 2018-05-11 | 湖南文理学院 | A kind of computer information data multi-stage protection system |
| CN111125033A (en)* | 2018-10-31 | 2020-05-08 | 深信服科技股份有限公司 | Space recovery method and system based on full flash memory array |
| CN111125033B (en)* | 2018-10-31 | 2024-04-09 | 深信服科技股份有限公司 | Space recycling method and system based on full flash memory array |
| CN111522502A (en)* | 2019-02-01 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Data deduplication method and device, electronic equipment and computer-readable storage medium |
| CN111522502B (en)* | 2019-02-01 | 2022-04-29 | 阿里巴巴集团控股有限公司 | Data deduplication method and device, electronic equipment and computer-readable storage medium |
| CN111581955A (en)* | 2019-02-15 | 2020-08-25 | 阿里巴巴集团控股有限公司 | Text fingerprint extraction and verification method and device |
| CN110647294A (en)* | 2019-09-09 | 2020-01-03 | Oppo(重庆)智能科技有限公司 | Storage block recovery method and device, storage medium and electronic equipment |
| CN110647294B (en)* | 2019-09-09 | 2022-03-25 | Oppo广东移动通信有限公司 | Storage block recovery method and device, storage medium and electronic equipment |
| CN111124750B (en)* | 2019-11-05 | 2024-04-30 | 国家电网有限公司 | A method for fast data deletion based on source-side deduplication |
| CN111124750A (en)* | 2019-11-05 | 2020-05-08 | 国家电网有限公司 | A fast data deletion method based on source deduplication |
| CN111143343B (en)* | 2019-12-27 | 2023-12-15 | 航天壹进制(江苏)信息科技有限公司 | Efficient data deleting method and system based on source terminal deduplication |
| CN111143343A (en)* | 2019-12-27 | 2020-05-12 | 南京壹进制信息科技有限公司 | Data efficient deleting method and system based on source-end deduplication |
| CN113568877A (en)* | 2020-04-28 | 2021-10-29 | 杭州海康威视数字技术股份有限公司 | File merging method and device, electronic equipment and storage medium |
| CN111897845B (en)* | 2020-07-29 | 2023-10-31 | 江苏新蝶数字科技有限公司 | Method and system for processing massive credit information based on flow |
| CN111897845A (en)* | 2020-07-29 | 2020-11-06 | 徐州金蝶软件有限公司 | Method and system for processing mass credit information based on process |
| CN115543979A (en)* | 2022-09-29 | 2022-12-30 | 广州鼎甲计算机科技有限公司 | Method, device, equipment, storage medium and program product for deleting repeated data |
| CN115543979B (en)* | 2022-09-29 | 2023-08-08 | 广州鼎甲计算机科技有限公司 | Method, apparatus, device, storage medium and program product for deleting duplicate data |
| CN117369731A (en)* | 2023-12-07 | 2024-01-09 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
| CN117369731B (en)* | 2023-12-07 | 2024-02-27 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
| CN119336543A (en)* | 2024-10-15 | 2025-01-21 | 广州鼎甲计算机科技有限公司 | Fingerprint database recovery method, device, computer equipment, readable storage medium and program product |
| Publication number | Publication date |
|---|---|
| CN102982180B (en) | 2016-08-03 |
| Publication | Publication Date | Title |
|---|---|---|
| CN102982180B (en) | Date storage method and equipment | |
| US8892529B2 (en) | Data processing method and apparatus in cluster system | |
| US8782011B2 (en) | System and method for scalable reference management in a deduplication based storage system | |
| US8799601B1 (en) | Techniques for managing deduplication based on recently written extents | |
| CN106776967B (en) | Method and device for storing massive small files in real time based on time sequence aggregation algorithm | |
| CN103309975B (en) | Duplicated data deleting method and apparatus | |
| CN109445702A (en) | A kind of piece of grade data deduplication storage | |
| KR20150064593A (en) | Deduplication method using data association and system thereof | |
| CN103186652A (en) | Distributed data de-duplication system and method thereof | |
| CN102467458B (en) | Create an index method for data blocks | |
| CN104246718A (en) | Segment combining for deduplication | |
| CN103106147B (en) | Memory allocation method and system | |
| CN105917304A (en) | Apparatus and method for data deduplication | |
| CN102184198A (en) | Data deduplication method suitable for working load protecting system | |
| CN111522502A (en) | Data deduplication method and device, electronic equipment and computer-readable storage medium | |
| CN102722450B (en) | Storage method for redundancy deletion block device based on location-sensitive hash | |
| CN104035822A (en) | Low-cost efficient internal storage redundancy removing method and system | |
| CN104298614B (en) | Data block storage method and storage device in storage device | |
| CN104537023B (en) | A kind of storage method and device of reverse indexing record | |
| CN105302669A (en) | Method and system for data deduplication in cloud backup process | |
| JP2016027496A (en) | Cluster system data processing method and device | |
| Zhu et al. | Exploring the Reference Management in Parallel De-Duplication |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| TR01 | Transfer of patent right | ||
| TR01 | Transfer of patent right | Effective date of registration:20170508 Address after:510640 Guangdong City, Tianhe District Province, No. five, road, public education building, unit 371-1, unit 2401 Patentee after:Guangdong Gaohang Intellectual Property Operation Co., Ltd. Address before:518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen Patentee before:Huawei Technologies Co., Ltd. | |
| CB03 | Change of inventor or designer information | Inventor after:Xiao Wenchang Inventor before:Fu Xudong Inventor before:Duan Yumei | |
| CB03 | Change of inventor or designer information | ||
| TR01 | Transfer of patent right | Effective date of registration:20170519 Address after:414000 Zhongke Industrial Park, Yueyang Road, Yueyang economic and Technological Development Zone, Hunan Patentee after:Hunan and Magnetic Technology Co., Ltd. Address before:510640 Guangdong City, Tianhe District Province, No. five, road, public education building, unit 371-1, unit 2401 Patentee before:Guangdong Gaohang Intellectual Property Operation Co., Ltd. | |
| TR01 | Transfer of patent right | ||
| CF01 | Termination of patent right due to non-payment of annual fee | Granted publication date:20160803 Termination date:20171218 | |
| CF01 | Termination of patent right due to non-payment of annual fee |