Movatterモバイル変換


[0]ホーム

URL:


CN113986891B - Method and device for deleting repeated data - Google Patents

Method and device for deleting repeated data
Download PDF

Info

Publication number
CN113986891B
CN113986891BCN202111058390.4ACN202111058390ACN113986891BCN 113986891 BCN113986891 BCN 113986891BCN 202111058390 ACN202111058390 ACN 202111058390ACN 113986891 BCN113986891 BCN 113986891B
Authority
CN
China
Prior art keywords
fingerprint
type
fingerprint information
data
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111058390.4A
Other languages
Chinese (zh)
Other versions
CN113986891A (en
Inventor
郑乾坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co LtdfiledCriticalNew H3C Big Data Technologies Co Ltd
Priority to CN202111058390.4ApriorityCriticalpatent/CN113986891B/en
Publication of CN113986891ApublicationCriticalpatent/CN113986891A/en
Application grantedgrantedCritical
Publication of CN113986891BpublicationCriticalpatent/CN113986891B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The present disclosure relates to the field of data storage technologies, and in particular, to a method and an apparatus for deleting duplicate data. Calculating data fingerprints of each data block after downloading, carrying out hash processing on each data fingerprint, and sending first type fingerprint information to a fingerprint analyzer, wherein one first type fingerprint information comprises address information corresponding to one data block; receiving first type fingerprint information sent by a data processing module, and carrying out merging processing on the first type fingerprint information with the same data fingerprint when the quantity of the first type fingerprint information stored locally reaches a first set threshold value, so as to obtain corresponding second type fingerprint information, wherein one second type fingerprint information comprises the repeated quantity of the first type fingerprint information corresponding to one data block and one or more address information of the data block; and judging whether the repeated quantity of the first type of fingerprint information in each second type of fingerprint information reaches a second set threshold value, and if so, deleting the data based on the second type of fingerprint information.

Description

Method and device for deleting repeated data
Technical Field
The present disclosure relates to the field of data storage technologies, and in particular, to a method and an apparatus for deleting duplicate data.
Background
To meet the ever-increasing business demands, storage systems must provide lower latency, higher IOPS, further improving customer satisfaction. SSD disks can provide lower latency, higher IOPS for storage systems than traditional HDD disks, but SSD disks are more expensive. In order to reduce the storage cost of using SSD disk, the deduplication technology identifies and deletes duplicate data, improves the utilization rate of storage space, and becomes a necessary choice for SSD disk storage (full flash memory).
At present, data are segmented on a host according to a certain granularity, each equal-sized data block is used for calculating fingerprint data, and all fingerprint data and data storage positions are recorded in a fingerprint table. Before the data is downloaded, the fingerprint table is queried, if the fingerprint is already in the fingerprint table, the data block with the same content already exists on the current host, the data does not need to be downloaded, only the data address recorded in the fingerprint table is queried from the fingerprint table, and the address is referenced in the metadata of the data block, so that the deletion of repeated data is realized.
However, in distributed storage, where data is scattered over multiple hosts, the distribution of data within a storage pool is far more discrete than in traditional centralized storage, and the likelihood that the same content is scattered for storage on different hosts is greater. If we use local deduplication in distributed storage, we would benefit little. Therefore, it is necessary to implement global deduplication of a storage pool, identify duplicate data within the global scope of the storage pool, and perform data deduplication.
In the open source ceph project, in order to implement data deduplication, two storage pools are introduced, one Base Pool for storing metadata and Chunk Pool for storing data. The metadata stored by the Base Pool contains the fingerprint of the data block and the offset of the corresponding data block stored in the Chunk Pool. In fact, the Base Pool is equivalent to the function of a fingerprint table, all data writing needs to calculate fingerprints, and the fingerprints are recorded in metadata of the Base Pool. The actual writing position of the data is obtained through fingerprint HASH (1's content) =k- > oid=k- > CRUSH (K) - > chunk's location), the same data is necessarily written into the same position, and if the corresponding position has data, the data of the content is written, and the data does not need to be rewritten, so that the data can be deleted again.
However, fingerprint information of all data is written into Base Pool, and Base Pool is large in scale and cannot be fully cached by using a memory. The size of the last written object of the data is equal to the size of the calculated fingerprint data block, and in order to ensure that more repeated data are identified, the size of the calculated fingerprint data block is not easy to be too large, so that the writing performance of the data can be influenced if the object for storing the data is too small.
Disclosure of Invention
The application provides a method and a device for deleting repeated data, which are used for solving the problem that the query performance of a fingerprint table is not high because all fingerprint information in the prior art is written into the fingerprint table.
In a first aspect, the present application provides a method of deduplication applied to a distributed storage system including a data processing module and a fingerprint analyzer deployed on each host, the method comprising:
the data processing module calculates data fingerprints of each data block after downloading, hashes each data fingerprint, and sends first type fingerprint information to a corresponding fingerprint analyzer based on the hashed result, wherein the first type fingerprint information comprises data fingerprints corresponding to one data block and address information of the data block;
the target fingerprint analyzer receives first type fingerprint information sent by the data processing module, and when the quantity of the first type fingerprint information stored locally reaches a first set threshold value, performs merging processing on the first type fingerprint information with the same data fingerprint to obtain corresponding second type fingerprint information, wherein one second type fingerprint information comprises a data fingerprint corresponding to one data block, the repeated quantity of the first type fingerprint information corresponding to the data block and one or more address information of the data block;
The target fingerprint analyzer judges whether the repetition number of the first type of fingerprint information in each second type of fingerprint information reaches a second set threshold, and if so, the target fingerprint analyzer deletes the repeated data based on the second type of fingerprint information.
Optionally, the step of receiving the first type of fingerprint information sent by the data processing module by the target fingerprint analyzer, and merging the first type of fingerprint information with the same data fingerprint when determining that the number of the locally stored first type of fingerprint information reaches a first set threshold value, to obtain the corresponding second type of fingerprint information includes:
the target fingerprint analyzer writes one first type of fingerprint information into a specified file with preset capacity when receiving the first type of fingerprint information sent by the data processing module, and encapsulates the specified file to generate a new specified file if the size of the specified file reaches the preset capacity;
and when the target fingerprint analyzer determines that the number of the packaged specified files reaches a third set threshold, merging the first type fingerprint information with the same data fingerprint in the packaged specified files to obtain the corresponding second type fingerprint information.
Optionally, recording first-class fingerprint information and second-class fingerprint information in a key value pair mode, wherein key values of the first-class fingerprint information are data fingerprints, and value values comprise address information of the first-class fingerprint information and data blocks;
the step of combining the first type fingerprint information with the same data fingerprint by the target fingerprint analyzer to obtain the corresponding second type fingerprint information comprises the following steps:
the target fingerprint analyzer combines a plurality of first type fingerprint information with the same fingerprint data into a second type fingerprint information, wherein key values of the second type fingerprint information are data fingerprints, value values comprise the repetition number of the first type fingerprint information of a data block corresponding to the fingerprint data, and a plurality of address information of the data block.
Optionally, the method further comprises:
the target fingerprint analyzer stores fingerprint information of the data block after the repeated data deletion is executed into a fingerprint table, wherein the fingerprint information comprises data fingerprints of the data block and address information of the data block;
the target fingerprint analyzer generates third type fingerprint information corresponding to the data block, wherein the third type fingerprint information is recorded in a key value pair mode, the key value of the third type fingerprint information is a data fingerprint, the value comprises the repeated number of the first type fingerprint information of the data block corresponding to the fingerprint data, and the version number of the fingerprint table, wherein the fingerprint table performs fingerprint cleaning once, and the version number of the fingerprint table is increased by one.
Optionally, the method further comprises:
after receiving the first type fingerprint information of a data block, the target fingerprint analyzer judges whether third type fingerprint information corresponding to the data block exists locally or not;
if the target fingerprint analyzer judges that the third type of fingerprint information corresponding to the data block exists locally, judging whether the version number of the fingerprint table recorded by the third type of fingerprint information is smaller than the current version number of the fingerprint table;
if the target fingerprint analyzer judges that the version number of the fingerprint table recorded by the third type fingerprint information is smaller than the current version number of the fingerprint table, searching whether fingerprint information corresponding to the data block exists in the fingerprint table, if so, executing repeated data deleting processing, and updating the version number of the fingerprint table of the third type fingerprint information path; if not, recording the first type fingerprint information.
In a second aspect, the present application provides a distributed storage system comprising a data processing module and a fingerprint analyzer disposed on each host, wherein,
the data processing module is used for calculating the data fingerprints of each data block after downloading, carrying out hash processing on each data fingerprint, and sending first type fingerprint information to the corresponding fingerprint analyzer based on the hash result, wherein one first type fingerprint information comprises the data fingerprint corresponding to one data block and the address information of the data block;
The target fingerprint analyzer is used for receiving the first type of fingerprint information sent by the data processing module, and carrying out merging processing on the first type of fingerprint information with the same data fingerprint when the number of the first type of fingerprint information stored locally reaches a first set threshold value, so as to obtain corresponding second type of fingerprint information, wherein one second type of fingerprint information comprises a data fingerprint corresponding to one data block, the repeated number of the first type of fingerprint information corresponding to the data block and one or more address information of the data block;
the target fingerprint analyzer is further configured to determine whether the number of repetitions of the first type of fingerprint information in each second type of fingerprint information reaches a second set threshold, and if it is determined that the number of repetitions of the first type of fingerprint information in one second type of fingerprint information reaches the second set threshold, delete the repeated data based on the second type of fingerprint information.
Optionally, the target fingerprint analyzer is configured to, when it is determined that the number of the first type of fingerprint information stored locally reaches a first set threshold, combine the first type of fingerprint information with the same data fingerprint to obtain corresponding second type of fingerprint information, and specifically:
When receiving a first type fingerprint information sent by the data processing module, writing the first type fingerprint information into a specified file with preset capacity, and if the size of the specified file reaches the preset capacity, packaging the specified file to generate a new specified file;
and when the number of the packaged specified files reaches a third set threshold, merging the first type fingerprint information with the same data fingerprint in the packaged specified files to obtain the corresponding second type fingerprint information.
Optionally, recording first-class fingerprint information and second-class fingerprint information in a key value pair mode, wherein key values of the first-class fingerprint information are data fingerprints, and value values comprise address information of the first-class fingerprint information and data blocks;
and combining the first type of fingerprint information with the same data fingerprint to obtain the corresponding second type of fingerprint information, wherein the target fingerprint analyzer is particularly used for:
and merging the plurality of first type fingerprint information with the same fingerprint data into second type fingerprint information, wherein the key value of the second type fingerprint information is a data fingerprint, the value comprises the repetition number of the first type fingerprint information of a data block corresponding to the fingerprint data, and the plurality of address information of the data block.
Optionally, the target fingerprint analyzer is further configured to:
storing fingerprint information of the data block after the repeated data deletion is executed into a fingerprint table, wherein the fingerprint information comprises data fingerprints of the data block and address information of the data block;
and generating third type fingerprint information corresponding to the data block, wherein the third type fingerprint information is recorded in a key value pair mode, the key value of the third type fingerprint information is a data fingerprint, the value comprises the repetition number of the first type fingerprint information of the data block corresponding to the fingerprint data, and the version number of the fingerprint table, wherein the fingerprint table performs fingerprint cleaning once, and the version number of the fingerprint table is increased by one.
Optionally, the target fingerprint analyzer is further configured to:
after receiving the first type fingerprint information of a data block, judging whether third type fingerprint information corresponding to the data block exists locally;
if the third type fingerprint information corresponding to the data block is judged to exist locally, judging whether the version number of the fingerprint table recorded by the third type fingerprint information is smaller than the current version number of the fingerprint table;
if the version number of the fingerprint table recorded by the third type fingerprint information is smaller than the current version number of the fingerprint table, searching whether fingerprint information corresponding to the data block exists in the fingerprint table, if so, executing repeated data deleting processing, and updating the version number of the fingerprint table of the third type fingerprint information path; if not, recording the first type fingerprint information.
In a third aspect, an embodiment of the present application provides a data deduplication apparatus, including:
a memory for storing program instructions;
a processor for invoking program instructions stored in said memory, performing the steps of the method according to any of the first aspects above in accordance with the obtained program instructions.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the steps of the method according to any one of the first aspects.
As can be seen from the foregoing, the method for deleting duplicate data provided in the embodiments of the present application is applied to a distributed storage system, where the distributed storage system includes a data processing module and a fingerprint analyzer deployed on each host, and the method includes: the data processing module calculates data fingerprints of each data block after downloading, hashes each data fingerprint, and sends first type fingerprint information to a corresponding fingerprint analyzer based on the hashed result, wherein the first type fingerprint information comprises data fingerprints corresponding to one data block and address information of the data block; the target fingerprint analyzer receives first type fingerprint information sent by the data processing module, and when the quantity of the first type fingerprint information stored locally reaches a first set threshold value, performs merging processing on the first type fingerprint information with the same data fingerprint to obtain corresponding second type fingerprint information, wherein one second type fingerprint information comprises a data fingerprint corresponding to one data block, the repeated quantity of the first type fingerprint information corresponding to the data block and one or more address information of the data block; the target fingerprint analyzer judges whether the repetition number of the first type of fingerprint information in each second type of fingerprint information reaches a second set threshold, and if so, deletes the repeated data based on the second type of fingerprint information
The repeated data deleting method provided by the embodiment of the application can be adopted
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly describe the drawings that are required to be used in the embodiments of the present application or the description in the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may also be obtained according to these drawings of the embodiments of the present application for a person having ordinary skill in the art.
Fig. 1 is a schematic structural diagram of a distributed storage system according to an embodiment of the present application;
FIG. 2 is a detailed flowchart of a method for deduplication according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a data organization structure of a fingerprint analyzer according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a distributed storage system according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a deduplication apparatus according to an embodiment of the present application.
Detailed Description
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Depending on the context, furthermore, the word "if" used may be interpreted as "at … …" or "at … …" or "in response to a determination".
In order to better illustrate the data de-duplication method provided in the embodiments of the present application, an exemplary schematic structure diagram of a distributed storage system provided in the embodiments of the present application is shown in fig. 1, where the data de-duplication method provided in the embodiments of the present application is applied to the distributed storage system. The distributed storage system comprises a plurality of hosts for storing data, wherein each host can be operated with a data processing module, and a fingerprint analyzer is deployed on each host. After receiving the data to be processed sent by the client, any host divides the data to be processed into data blocks with fixed length (e.g. 8 kb), and downloads each data block to a corresponding storage disk, after each fixed-length data block is downloaded, a given-length data block calculates a data fingerprint, and fingerprint information such as a logical address LBA, an actual disc-falling physical address PBA, and the data fingerprint of the fixed-length data block is collected into a fingerprint analyzer.
Referring to fig. 2, a detailed flowchart of a method for deduplication according to an embodiment of the present application is provided, where the method is applied to a distributed storage system, and the distributed storage system includes a data processing module and a fingerprint analyzer deployed on each host, and the method includes the following steps:
step 200: the data processing module calculates data fingerprints of each data block after downloading, hashes each data fingerprint, and sends first type fingerprint information to a corresponding fingerprint analyzer based on the hash result, wherein the first type fingerprint information comprises data fingerprints corresponding to one data block and address information of the data block.
Specifically, in the process of collecting fingerprint information to fingerprint analyzers, fingerprint information of different data blocks can be scattered to different hosts by carrying out HASH on the data fingerprints, so that analysis processing pressure of each fingerprint analyzer can be balanced on one hand; on the other hand, the fingerprint information with the same data fingerprint must be scattered to the same fingerprint analyzer. Because the data fingerprints are the same, the content of the data blocks is the same, fingerprint information of all the data blocks with the same content in the storage pool can be collected to the same fingerprint analyzer, repeated data can be analyzed and identified on the fingerprint analyzer, and repeated deleting requests are carried out on the data.
The distributed storage system divides data to be processed into a plurality of fixed-length data blocks and downloads the data blocks, calculates the data fingerprint of each data block, hashes the data fingerprints of each data block to obtain a hash result, and diffuses each data fingerprint to a fingerprint analyzer of each host based on the hash result. Specifically, the data fingerprint of a data block, and the address information (logical address LBA and actual landing physical address PBA) of the data block are sent to the corresponding fingerprint analyzer.
Step 210: and the target fingerprint analyzer receives the first type of fingerprint information sent by the data processing module, and when the number of the first type of fingerprint information stored locally reaches a first set threshold value, performs merging processing on the first type of fingerprint information with the same data fingerprint to obtain corresponding second type of fingerprint information, wherein one second type of fingerprint information comprises a data fingerprint corresponding to one data block, the repeated number of the first type of fingerprint information corresponding to the data block and one or more address information of the data block.
The target fingerprint analyzer receives the first type fingerprint information sent by the data processing module, and when determining that the number of the first type fingerprint information stored locally reaches a first set threshold, performs merging processing on the first type fingerprint information with the same data fingerprint, and obtains corresponding second type fingerprint information, the target fingerprint analyzer is preferably realized in the following manner:
The target fingerprint analyzer writes one first type of fingerprint information into a specified file with preset capacity when receiving the first type of fingerprint information sent by the data processing module, and encapsulates the specified file to generate a new specified file if the size of the specified file reaches the preset capacity; and when the target fingerprint analyzer determines that the number of the packaged specified files reaches a third set threshold, merging the first type fingerprint information with the same data fingerprint in the packaged specified files to obtain the corresponding second type fingerprint information.
For example, assuming that the fingerprint analyzer stores all collected fingerprint information (first type fingerprint information) using ROCKSDB, then since the sequence number is incremented above the internal Key of ROCKSDB, the data of the same Key value is written multiple times without immediately overwriting the last write. The ROCKSDB data is written into the memory data structure Memable first, the total size of the single memory data structure Memable exceeds a certain threshold value, the single memory data structure Memable is packaged, new Memable writing is regenerated, and the packaged Memable waits to be sequentially written into the SST file of Level0 shown in FIG. 3. Data is written from Memtable to Level0, and multiple writes of the same Key value are not processed. When the data reaches a certain data amount at Level0, the number of SST files of Level0 exceeds a certain threshold, and the data needs to be merged from Level0 to Level 1. Within a single SST file in Level0, the data is ordered in Key values, but the Key value ranges of different SST files are likely to intersect. And the Key value ranges of all SSTs in Level1 do not have intersection, and data can be arranged in a global order in Level 1. Thus, the data is merged from Level0 to Level1, and the data with the same Key is necessarily arranged together. And then, based on a data merging mechanism, merging the multiple time point data of the same Key value into one piece, and recording the repetition number of fingerprints to obtain second-type fingerprint information corresponding to the Key value.
In this embodiment of the present application, the first type of fingerprint information and the second type of fingerprint information may be recorded in a key-value pair (key-value) manner, where the key value of the first type of fingerprint information is a data fingerprint, and the value includes the address information of the fingerprint information type as the first type and the data block.
Then, the target fingerprint analyzer performs merging processing on the first type of fingerprint information with the same data fingerprint, and when obtaining the corresponding second type of fingerprint information, a preferred implementation manner is as follows:
the target fingerprint analyzer combines a plurality of first type fingerprint information with the same fingerprint data into a second type fingerprint information, wherein key values of the second type fingerprint information are data fingerprints, value values comprise the repetition number of the first type fingerprint information of a data block corresponding to the fingerprint data, and a plurality of address information of the data block.
Step 220: the target fingerprint analyzer judges whether the repetition number of the first type of fingerprint information in each second type of fingerprint information reaches a second set threshold, and if so, the target fingerprint analyzer deletes the repeated data based on the second type of fingerprint information.
In the embodiment of the present application, the method for deleting the repeated data may further include the following steps:
the target fingerprint analyzer stores fingerprint information of the data block after the repeated data deletion is executed into a fingerprint table, wherein the fingerprint information comprises data fingerprints of the data block and address information of the data block; the target fingerprint analyzer generates third type fingerprint information corresponding to the data block, wherein the third type fingerprint information is recorded in a key value pair mode, the key value of the third type fingerprint information is a data fingerprint, the value comprises the repeated number of the first type fingerprint information of the data block corresponding to the fingerprint data, and the version number of the fingerprint table, wherein the fingerprint table performs fingerprint cleaning once, and the version number of the fingerprint table is increased by one.
That is, after merging the data fingerprints of any data block to obtain the second type of fingerprint information, determining whether the number of times of repeated downloading of any data block carried in the second type of fingerprint information reaches a second set threshold value, so as to determine whether the data block needs to be subjected to the deduplication processing, if so, the data block needs to be subjected to the deduplication processing, and corresponding third type of fingerprint information is generated, wherein the third type of fingerprint information records the data fingerprints, the number of repeated data blocks and the current version number of the fingerprint table.
For example, assuming that n pieces of target data are fast, when deleting the target data block again, only one piece of target data block is reserved, and the data fingerprint of the target data block, n logical addresses and 1 physical address are used as the final fingerprint information of the target data block and maintained in the fingerprint table.
Further, in the embodiment of the present application, the method for deleting the repeated data may further include the following steps:
after receiving the first type fingerprint information of a data block, the target fingerprint analyzer judges whether third type fingerprint information corresponding to the data block exists locally or not; if the target fingerprint analyzer judges that the third type of fingerprint information corresponding to the data block exists locally, judging whether the version number of the fingerprint table recorded by the third type of fingerprint information is smaller than the current version number of the fingerprint table; if the target fingerprint analyzer judges that the version number of the fingerprint table recorded by the third type fingerprint information is smaller than the current version number of the fingerprint table, searching whether fingerprint information corresponding to the data block exists in the fingerprint table, if so, executing repeated data deleting processing, and updating the version number of the fingerprint table of the third type fingerprint information path; if not, recording the first type fingerprint information.
The data organization structure of the fingerprint analyzer provided in the embodiment of the present application is described in detail below with reference to a specific application scenario. For example, referring to fig. 3, a schematic diagram of a data organization structure of a fingerprint analyzer according to an embodiment of the present application is shown, where fingerprint information collected from each host includes a logical address LBA, an actual landing physical address PBA, and a data fingerprint. In the fingerprint information written to the Rocksdb database, 3 types are generated as the fingerprints merge. The fingerprint information of type 1 is the data originally written into the fingerprint analyzer, the data fingerprint is taken as Key, and Value comprises type, logical address and physical address. When a plurality of fingerprint information of type 1 is combined into one piece of fingerprint information, the fingerprint information of type 2 is obtained, the fingerprint information of type 2 still takes the data fingerprint as Key, the Value contains the repetition amount of the fingerprint information of type 1, and the logical address and the physical address of each piece of fingerprint information of type 1 before combination. When the type 2 fingerprint information includes the type 1 fingerprint information, the repeated number exceeds a certain threshold, the repeated data is deleted according to the logical address and the physical address recorded in the type 2 fingerprint information, and the fingerprint is added into the fingerprint table. After the data recorded in the fingerprint information of the type 2 is deleted, the logical address and the physical address do not need to be recorded again, the fingerprint information of the type 3 is evolved, the fingerprint information of the type 3 still takes the data fingerprint as a Key, and the Value comprises the type and the fingerprint table version number.
The fingerprint table version number in the type 3 fingerprint information is related to fingerprint table cleaning. By fingerprint table cleaning, it is meant that fingerprints in the fingerprint table that are no longer referenced are deleted from the fingerprint table. Preferably, the fingerprints which are not referenced are not deleted from the fingerprint table immediately, but the fingerprint table is cleaned periodically by a certain algorithm, and the fingerprint table is cleaned after one half month or other events. The version number of the fingerprint table described in the fingerprint information of the type 3 is 0, and the version number is increased by 1 along with each cleaning of the fingerprint table. The version number of the fingerprint table recorded in the type 3 fingerprint information is the version number of the fingerprint table when the fingerprint is written into the fingerprint table after the number of the repeated fingerprints reaches the threshold value.
When the fingerprint information of the type 1 and the fingerprint information of the type 3 are combined, if the version number of the recorded fingerprint table is equal to the version number of the current fingerprint table, fingerprint cleaning is not triggered after the fingerprint is written into the fingerprint table. The fingerprint information of the type 3 is still effective, and the fingerprint is directly deleted according to the logical address and the physical address recorded by the fingerprint information of the type 1, so that the fingerprint is not required to be newly added into the fingerprint table. If the recorded version number is smaller than the version number of the current fingerprint table, fingerprint cleaning is triggered after the fingerprint is written into the fingerprint table. At this time, the fingerprint table needs to be queried, if the fingerprint is still in the fingerprint table, the version number of the fingerprint information of the type 3 is updated to the version number of the current fingerprint table, and the fingerprint information combined with the type 3 is subjected to erasure. If the fingerprint is not in the fingerprint table, the fingerprint information of the type 1 is combined with the fingerprint information of the type 3, only the fingerprint information of the type 1 is reserved, the repeated number of the fingerprints is restored to 1, and the subsequent repeated data writing is waited for and then repeated deleting is carried out.
The fingerprint information of type 2 is similar to the fingerprint information of type 3 in merging process, except that a plurality of data blocks are deleted again and again when deleting again, if the fingerprint does not exist in the fingerprint table, the reserved type is the fingerprint information of type 2. When the fingerprint information of the type 3 is combined with the fingerprint information of the type 3, the version number is triggered to update, and if the version number is not in the fingerprint table, the fingerprint information of the type 3 is deleted.
It should be noted that, the above combinations refer to combinations of different types of fingerprint information with the same key (data fingerprint).
Referring to fig. 4 for exemplary purposes, a schematic diagram of a distributed storage system is provided according to an embodiment of the present application, the distributed storage system includes a data processing module 40 and a fingerprint analyzer 41 disposed on each host, wherein,
the data processing module 40 is configured to calculate a data fingerprint of each data block after downloading, hash each data fingerprint, and send first type fingerprint information to a corresponding fingerprint analyzer based on a hash result, where a first type fingerprint information includes a data fingerprint corresponding to a data block and address information of the data block;
the target fingerprint analyzer 41 is configured to receive the first type of fingerprint information sent by the data processing module, and perform merging processing on the first type of fingerprint information with the same data fingerprint to obtain corresponding second type of fingerprint information when determining that the number of locally stored first type of fingerprint information reaches a first set threshold, where one second type of fingerprint information includes a data fingerprint corresponding to one data block, a repetition number of the first type of fingerprint information corresponding to the data block, and one or more address information of the data block;
The target fingerprint analyzer 41 is further configured to determine whether the number of repetitions of the first type of fingerprint information in each second type of fingerprint information reaches a second set threshold, and if it is determined that the number of repetitions of the first type of fingerprint information in one second type of fingerprint information reaches the second set threshold, delete the repeated data based on the second type of fingerprint information.
Optionally, the target fingerprint analyzer 41 is specifically configured to, when receiving the first type of fingerprint information sent by the data processing module and determining that the number of locally stored first type of fingerprint information reaches a first set threshold, combine the first type of fingerprint information with the same data fingerprint to obtain the corresponding second type of fingerprint information:
when receiving a first type fingerprint information sent by the data processing module, writing the first type fingerprint information into a specified file with preset capacity, and if the size of the specified file reaches the preset capacity, packaging the specified file to generate a new specified file;
and when the number of the packaged specified files reaches a third set threshold, merging the first type fingerprint information with the same data fingerprint in the packaged specified files to obtain the corresponding second type fingerprint information.
Optionally, recording first-class fingerprint information and second-class fingerprint information in a key value pair mode, wherein key values of the first-class fingerprint information are data fingerprints, and value values comprise address information of the first-class fingerprint information and data blocks;
when the first type fingerprint information with the same data fingerprint is combined to obtain the corresponding second type fingerprint information, the target fingerprint analyzer 41 is specifically configured to:
and merging the plurality of first type fingerprint information with the same fingerprint data into second type fingerprint information, wherein the key value of the second type fingerprint information is a data fingerprint, the value comprises the repetition number of the first type fingerprint information of a data block corresponding to the fingerprint data, and the plurality of address information of the data block.
Optionally, the target fingerprint analyzer 41 is further configured to:
storing fingerprint information of the data block after the repeated data deletion is executed into a fingerprint table, wherein the fingerprint information comprises data fingerprints of the data block and address information of the data block;
and generating third type fingerprint information corresponding to the data block, wherein the third type fingerprint information is recorded in a key value pair mode, the key value of the third type fingerprint information is a data fingerprint, the value comprises the repetition number of the first type fingerprint information of the data block corresponding to the fingerprint data, and the version number of the fingerprint table, wherein the fingerprint table performs fingerprint cleaning once, and the version number of the fingerprint table is increased by one.
Optionally, the target fingerprint analyzer 41 is further configured to:
after receiving the first type fingerprint information of a data block, judging whether third type fingerprint information corresponding to the data block exists locally;
if the third type fingerprint information corresponding to the data block is judged to exist locally, judging whether the version number of the fingerprint table recorded by the third type fingerprint information is smaller than the current version number of the fingerprint table;
if the version number of the fingerprint table recorded by the third type fingerprint information is smaller than the current version number of the fingerprint table, searching whether fingerprint information corresponding to the data block exists in the fingerprint table, if so, executing repeated data deleting processing, and updating the version number of the fingerprint table of the third type fingerprint information path; if not, recording the first type fingerprint information.
The above units may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more microprocessors (digital singnal processor, abbreviated as DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), or the like. For another example, when a unit is implemented in the form of a processing element scheduler code, the processing element may be a general purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the units may be integrated together and implemented in the form of a system-on-a-chip (SOC).
Further, in the data de-duplication apparatus provided in the embodiments of the present application, as for a hardware architecture schematic of the data de-duplication apparatus may be shown in fig. 5, the data de-duplication apparatus may include: a memory 50 and a processor 51,
memory 50 is used to store program instructions; the processor 51 calls the program instructions stored in the memory 50 and executes the above-described method embodiments according to the obtained program instructions. The specific implementation manner and the technical effect are similar, and are not repeated here.
Optionally, the present application also provides a distributed storage device comprising at least one processing element (or chip) for performing the above-described method embodiments.
Optionally, the present application also provides a program product, such as a computer readable storage medium, storing computer executable instructions for causing the computer to perform the above-described method embodiments.
Here, a machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Moreover, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the preferred embodiments of the present invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims (10)

CN202111058390.4A2021-09-092021-09-09Method and device for deleting repeated dataActiveCN113986891B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111058390.4ACN113986891B (en)2021-09-092021-09-09Method and device for deleting repeated data

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111058390.4ACN113986891B (en)2021-09-092021-09-09Method and device for deleting repeated data

Publications (2)

Publication NumberPublication Date
CN113986891A CN113986891A (en)2022-01-28
CN113986891Btrue CN113986891B (en)2024-03-12

Family

ID=79735588

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111058390.4AActiveCN113986891B (en)2021-09-092021-09-09Method and device for deleting repeated data

Country Status (1)

CountryLink
CN (1)CN113986891B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104978151A (en)*2015-06-192015-10-14浪潮电子信息产业股份有限公司Application awareness based data reconstruction method in repeated data deletion and storage system
CN106610790A (en)*2015-10-262017-05-03华为技术有限公司Repeated data deleting method and device
WO2021016728A1 (en)*2019-07-262021-02-04华为技术有限公司Data processing method and device in storage system, and computer readable storage medium
CN112800057A (en)*2021-01-222021-05-14新华三大数据技术有限公司Fingerprint table management method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11055005B2 (en)*2018-10-122021-07-06Netapp, Inc.Background deduplication using trusted fingerprints

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104978151A (en)*2015-06-192015-10-14浪潮电子信息产业股份有限公司Application awareness based data reconstruction method in repeated data deletion and storage system
CN106610790A (en)*2015-10-262017-05-03华为技术有限公司Repeated data deleting method and device
WO2021016728A1 (en)*2019-07-262021-02-04华为技术有限公司Data processing method and device in storage system, and computer readable storage medium
CN112800057A (en)*2021-01-222021-05-14新华三大数据技术有限公司Fingerprint table management method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于重复数据删除技术的存储系统分析;朱江;冀鸣;杨志成;张嘉贤;曹雄;;信息系统工程;20170420(04);全文*
基于重复数据删除技术的雾存储数据去冗余方案;陈思佳;温蜜;陈珊;;计算机应用与软件;20200212(02);全文*

Also Published As

Publication numberPublication date
CN113986891A (en)2022-01-28

Similar Documents

PublicationPublication DateTitle
US11461027B2 (en)Deduplication-aware load balancing in distributed storage systems
US10809928B2 (en)Efficient data deduplication leveraging sequential chunks or auxiliary databases
CN105843551B (en) Data Integrity and Loss Resistance in High Performance and Mass Storage Deduplication
CN107491523B (en) Method and apparatus for storing data objects
CN108268219B (en)Method and device for processing IO (input/output) request
CN103902623B (en)Method and system for accessing files on a storage system
US10078648B1 (en)Indexing deduplicated data
CN107329692A (en)Method and storage device that a kind of data are deleted again
US10503608B2 (en)Efficient management of reference blocks used in data deduplication
US10474588B1 (en)Method and system for memory-based data caching
JP6098301B2 (en) Storage control device, storage control method, and storage control program
US11144508B2 (en)Region-integrated data deduplication implementing a multi-lifetime duplicate finder
JP6089890B2 (en) Storage control device, storage control device control method, and storage control device control program
CN108021562B (en)Disk storage method and device applied to distributed file system and distributed file system
US10282127B2 (en)Managing data in a storage system
US10585802B1 (en)Method and system for caching directories in a storage system
CN112800057B (en)Fingerprint table management method and device
US10831370B1 (en)Deduplicated and compressed non-volatile memory cache
CN113986891B (en)Method and device for deleting repeated data
KR101144321B1 (en)Methods of managing buffer cache using solid state disk as an extended buffer and apparatuses for using solid state disk as an extended buffer
Song et al.Efficient key-value stores with ranged log-structured merge trees
JP2010191903A (en)Distributed file system striping class selecting method and distributed file system
US10664442B1 (en)Method and system for data consistency verification in a storage system
US11847334B2 (en)Method or apparatus to integrate physical file verification and garbage collection (GC) by tracking special segments
US11150827B2 (en)Storage system and duplicate data management method

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp