Summary of the invention
The main object of the present invention be to provide a kind of distributed memory system, memory node equipment, data duplicate removal method andComputer readable storage medium, it is intended to improve the deduplicated efficiency of distributed memory system.
To achieve the above object, the present invention proposes that a kind of distributed memory system, the distributed memory system include moreA memory node equipment and several shared fingerprint bases, communicate to connect, institute between the memory node equipment and shared fingerprint baseIt states and is provided with local fingerprint base in memory node equipment, alternatively, the memory node equipment is communicated with corresponding local fingerprint baseConnection, the memory node equipment are used for:
Receive data slice write request, the data slice write request includes several data slices to be written and each describedThe fingerprint of data slice to be written;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base it is each toDuplicate removal fingerprint whether there is, and the local fingerprint base includes the fingerprint of storing data piece in the memory node equipment;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or more of to duplicate removalThe corresponding data slice to be written of fingerprint is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, by one or more of wait goWeight fingerprint searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base, and the shared data bank includes instituteThere is in memory node equipment the fingerprint of storing data piece, it is one or more to be processed when being found in the shared fingerprint baseWhen fingerprint, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includesData slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written includes: to judge the data to be written to duplicate removal fingerprintWhether there is the fingerprint of redundancy in the fingerprint of piece, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as toDuplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The memory node equipment is also used to:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The memory node equipment is also used to:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believedBreath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint baseControl node equipment, the memory node equipment is also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be writtenReference count changing value be sent to control node equipment;
The control node equipment is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the fingerprint of each data slice to be written is updatedAccumulative reference count.
Preferably, the memory node equipment is also used to:
Receive the removal request of a data to be deleted;
The data slice fingerprint sequence of the data to be deleted is obtained, determines and respectively refers in the data slice fingerprint sequence obtainedThe reference count changing value of line, and the reference count changing value of each fingerprint in the data slice fingerprint sequence is sent to control nodeEquipment;
The control node equipment is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, the data slice fingerprint sequence is updatedIn each fingerprint accumulative reference count, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint baseIt removes, and the memory node equipment is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, records the fingerprint and keepThe duration for the state that accumulative reference count is zero deletes the fingerprint when the duration is greater than preset duration, andCorresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
In addition, to achieve the above object, the present invention also proposes that a kind of data duplicate removal method, this method are deposited suitable for distributionStorage system, the distributed memory system include multiple memory node equipment and several shared fingerprint bases, the memory nodeIt is communicated to connect between equipment and shared fingerprint base, is provided with local fingerprint base in the memory node equipment, alternatively, the storageNode device is communicated to connect with corresponding local fingerprint base, the method includes the steps:
Receiving step: memory node equipment receives data slice write request, and the data slice write request includes severalThe fingerprint of data slice to be written and each data slice to be written;
Query steps: memory node equipment determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and inSearched in local fingerprint base it is each whether there is to duplicate removal fingerprint, it is described local fingerprint base include in the memory node equipmentThe fingerprint of storing data piece;
First duplicate removal step:, memory node one or more when duplicate removal fingerprint is present in the local fingerprint base when havingEquipment is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step:, storage section one or more when duplicate removal fingerprint is not present in the local fingerprint base when havingPoint device using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and search in shared fingerprint base each to be processedFingerprint, the shared data bank includes the fingerprint of storing data piece in all memory node equipment, when in the shared fingerprintIt is when finding one or more fingerprints to be processed in library, the one or more of fingerprints to be processed found are corresponding to be writtenEnter data slice deletion.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includesData slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be writtenEnter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprintAs to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
After the receiving step, the method also includes:
Memory node equipment saves the data slice fingerprint sequence into the local fingerprint base and shared fingerprint base;
After the second duplicate removal step, the method also includes:
Memory node equipment saves all remaining data slices to be written, and the remaining data slice to be written is correspondingStorage location information preservation into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint baseControl node equipment, after the receiving step, the method also includes:
Memory node equipment determines the reference count changing value of the fingerprint of each data slice to be written, and will be each to be writtenThe reference count changing value of the fingerprint of data slice is sent to control node equipment;
Control node equipment updates each to be written according to the reference count changing value of the fingerprint of each data slice to be writtenThe accumulative reference count of the fingerprint of data slice.
Preferably, the method also includes:
Memory node equipment receives the removal request of a data to be deleted;
Memory node equipment obtains the data slice fingerprint sequence of the data to be deleted, determines that the data slice obtained refers toThe reference count changing value of each fingerprint in line sequence, and send the reference count variation of each fingerprint in the data slice fingerprint sequenceIt is worth to control node equipment;
Control node equipment updates the number according to the reference count changing value of each fingerprint in the data slice fingerprint sequenceAccording to the accumulative reference count of each fingerprint in piece fingerprint sequence, and by the data slice fingerprint sequence of the data to be deleted from described totalFingerprint base deletion is enjoyed, and notifies the memory node equipment by the data slice fingerprint sequence of the data to be deleted from local fingerprintIt deletes in library;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, control node equipment recordThe fingerprint keeps the duration for the state that accumulative reference count is zero, when the duration is greater than preset duration, deletesExcept the fingerprint, and corresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
In addition, to achieve the above object, the present invention also proposes a kind of memory node equipment, and the memory node equipment is togetherIt enjoys and being communicated to connect between fingerprint base, be provided with local fingerprint base in the memory node equipment, alternatively, the memory node equipmentIt is communicated to connect with corresponding local fingerprint base, the memory node equipment includes memory and processor, is deposited on the memoryData deduplication program is contained, the data deduplication program realizes following steps when being executed by the processor:
Receiving step: data slice write request is received, the data slice write request includes several data slices to be writtenAnd the fingerprint of each data slice to be written;
Query steps: determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint baseSearch it is each whether there is to duplicate removal fingerprint, the local fingerprint base includes storing data piece in the memory node equipmentFingerprint;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by described oneIt is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be describedOne or more searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint, described totalEnjoying database includes the fingerprint of storing data piece in all memory node equipment, when finding one in the shared fingerprint baseWhen a or multiple fingerprints to be processed, the corresponding data slice to be written of one or more of fingerprints to be processed found is deletedIt removes.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includesData slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be writtenEnter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprintAs to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The processor executes the data deduplication program, after the receiving step, also realizes following steps:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The processor executes the data deduplication program, after the second duplicate removal step, also realizes following steps:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believedBreath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint baseControl node equipment, the processor executes the data deduplication program, after the receiving step, also realizes following stepIt is rapid:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be writtenReference count changing value be sent to control node equipment, for the control node equipment according to the finger of each data slice to be writtenThe reference count changing value of line updates the accumulative reference count of the fingerprint of each data slice to be written.
In addition, to achieve the above object, the present invention also proposes a kind of computer readable storage medium, it is suitable for memory nodeEquipment communicates to connect between the memory node equipment and shared fingerprint base, is provided in the memory node equipment and locally refers toLine library, alternatively, the memory node equipment is communicated to connect with corresponding local fingerprint base, the computer readable storage medium is depositedData deduplication program is contained, the data deduplication program can be executed by least one processor, so that at least one described processingDevice executes following steps:
Receiving step: data slice write request is received, the data slice write request includes several data slices to be writtenAnd the fingerprint of each data slice to be written;
Query steps: determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint baseSearch it is each whether there is to duplicate removal fingerprint, the local fingerprint base includes storing data piece in the memory node equipmentFingerprint;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by described oneIt is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be describedOne or more searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint, described totalEnjoying database includes the fingerprint of storing data piece in all memory node equipment, when finding one in the shared fingerprint baseWhen a or multiple fingerprints to be processed, the corresponding data slice to be written of one or more of fingerprints to be processed found is deletedIt removes.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includesData slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be writtenEnter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprintAs to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The processor executes the data deduplication program, after the receiving step, also realizes following steps:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The processor executes the data deduplication program, after the second duplicate removal step, also realizes following steps:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believedBreath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint baseControl node equipment, the processor executes the data deduplication program, after the receiving step, also realizes following stepIt is rapid:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be writtenReference count changing value be sent to control node equipment, for the control node equipment according to the finger of each data slice to be writtenThe reference count changing value of line updates the accumulative reference count of the fingerprint of each data slice to be written.
Compared with prior art, one memory node equipment of the present embodiment is when carrying out data deduplication, if in local fingerprint baseIn do not inquire the fingerprint of a data slice to be written, then directly can inquire whether the fingerprint is duplicate finger in shared fingerprint baseLine, without carrying out communication inquiry one by one with other memory node equipment, this improves the data deduplications of distributed memory systemEfficiency.
Specific embodiment
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, andIt is non-to be used to limit the scope of the invention.
As shown in fig.1, being the system architecture diagram of one embodiment of distributed memory system of the present invention.
In the present embodiment, the distributed memory system includes multiple memory node equipment 1 and several shared fingerprintsLibrary 2, communication connection (for example, being communicated to connect by network 4), described to deposit between the memory node equipment 1 and shared fingerprint base 2It is provided with local fingerprint base 3 in storage node device 1, alternatively, the memory node equipment 1 is communicated with corresponding local fingerprint base 3Connection.The local fingerprint base 3 includes the fingerprint of storing data piece in corresponding memory node equipment 1, the shared fingerprint base 2Including the fingerprint of storing data piece in all memory node equipment 1.
The memory node equipment 1 is used for:
Receive data slice write request, the data slice write request includes several data slices to be written and each describedThe fingerprint of data slice to be written;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base 3 eachIt whether there is to duplicate removal fingerprint;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, by one or more of wait goThe corresponding data slice to be written of weight fingerprint is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having, by it is one or more of toDuplicate removal fingerprint searches each fingerprint to be processed as fingerprint to be processed in shared fingerprint base 2, when in the shared fingerprint baseIt is when finding one or more fingerprints to be processed in 2, the one or more of fingerprints to be processed found are corresponding to be writtenEnter data slice deletion.
In the present embodiment, memory node equipment 1 receives data slice write request, and the data slice write request includes severalThe fingerprint of a data slice to be written and each data slice to be written.The data slice to be written is (described by data to be writtenThe data type of data to be written includes block grade data, file-level data) cutting obtains.The operation of the cutting can be by memory nodeEquipment 1 executes, or is executed by other any applicable equipment (for example, client), and cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantityIt is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to trueFixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.ItsIn, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated,For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (SecureHash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be writtenColumn sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be writtenPiece fingerprint sequence is assembled into the data to be written.In addition, memory node equipment 1 the data slice fingerprint sequence can also be saved toIn the local fingerprint base 3 and shared fingerprint base 2.
Then, memory node equipment 1 determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, determine wait goThe method of weight fingerprint includes: the fingerprint that whether there is redundancy in the fingerprint for judge the data slice to be written, and if it exists, is then deletedThe fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal fingerprint, if it does not exist, then by all data slices to be writtenFingerprint be used as to duplicate removal fingerprint.
For example, memory node equipment 1 judges in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.If depositingIn identical fingerprint, then using identical fingerprint as a fingerprint group, after finding out all fingerprint groups, in each fingerprint groupMiddle one fingerprint of selection retains, and deletes non-selected fingerprint as the fingerprint of redundancy, and judge whether there is ungroupedFingerprint, if so, using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating process.Identical finger if it does not existLine, then using the fingerprint of all data slices to be written as ungrouped fingerprint, then using each ungrouped fingerprint as to duplicate removalFingerprint.
It is identifying after duplicate removal fingerprint, memory node equipment 1 is searched in local fingerprint base 3 each is to duplicate removal fingerprintNo presence.One or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, memory node equipment 1 is by described oneIt is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint.When there is one or more to duplicate removal fingerprint to be present in describedWhen ground fingerprint base 3, represents these and wait for that the corresponding data slice to be written of duplicate removal fingerprint is duplicate data slice, it is empty in order to save storageBetween, these duplicate data slices are deleted.
Finally, memory node equipment 1 one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when havingUsing it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each fingerprint to be processed is searched in shared fingerprint base 2,When finding one or more fingerprints to be processed in the shared fingerprint base 2, by find it is one or more of toThe corresponding data slice to be written of fingerprint is handled to delete.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if memory node equipment 1 is in local fingerprint base 3In do not inquire one to duplicate removal fingerprint, then continue inquiry in shared fingerprint base 2 should whether there is to duplicate removal fingerprint, and if it exists,Then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint, belong to duplicate data slice,It no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint baseThe fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory systemDeduplicated efficiency.
Further, in the present embodiment, the memory node equipment 1 is also used to:
Memory node equipment 1 saves all remaining data slices (i.e. not duplicate data slice) to be written, and will be described surplusThe corresponding storage location information preservation of remaining data slice to be written is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the distributed memory system further includes control node equipment 5, the controlNode device 5 is respectively with the memory node equipment 1 and the communication connection of shared fingerprint base 2 (for example, pass through 4 communication link of networkIt connects).The shared fingerprint base 2 may be disposed in a shared disk (the NVME disk as passed through NVMEOF carry), the shared diskIt may be disposed in control node equipment 5, be independently of the setting of control node equipment 5.
The memory node equipment 1 is also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written (for example, determining each to duplicate removal fingerprintReference count changing value is+1), and the reference count changing value of the fingerprint of each data slice to be written is sent to control sectionPoint device 5.
The control node equipment 5 is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the fingerprint of each data slice to be written is updatedAccumulative reference count (the accumulative reference count of a fingerprint represent the corresponding data slice of the fingerprint quoted by storing data it is totalNumber).
Further, in the present embodiment, the memory node equipment 1 is also used to:
The removal request of a data to be deleted is received, the data slice fingerprint sequence of the data to be deleted is obtained, determination obtainsThe reference count changing value of each fingerprint is (for example, determine in the data slice fingerprint sequence in the data slice fingerprint sequence takenThe reference count changing value of each fingerprint is -1), and send the reference count of each fingerprint in the data slice fingerprint sequenceChanging value is to control node equipment 5.
The control node equipment 5 is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, the data slice fingerprint sequence is updatedIn each fingerprint accumulative reference count, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint base 2It removes, and the memory node equipment 1 is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deletedConnect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such asFruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deletedData slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
Further, in the present embodiment, the control node equipment 5 is also used to:
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprintAccording to piece not by any data referencing) when, record the duration that the fingerprint keeps adding up the state that reference count is zero.
When the duration is greater than preset duration, the fingerprint is deleted, and corresponding memory node equipment 1 is notified to deleteExcept the fingerprint and the corresponding data slice of the fingerprint.
When the duration is less than or equal to preset duration, do not make delete processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zeroThe corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset durationThe reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in timeCaused by data accidentally delete.
Further, in the present embodiment, the memory node equipment 1 is also used to:
When receiving the read requests of a data to be read, the data slice fingerprint sequence of the data to be read is obtained, andThe storage location information for obtaining the corresponding data slice of each fingerprint in the data slice fingerprint sequence, according to the storage position of acquisitionConfidence breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then the data slice that will acquire is according to the numberThe data to be read are assembled into according to piece fingerprint sequence.
The present invention proposes a kind of data deduplication program.
Referring to Fig. 2, being the running environment schematic diagram of 10 1 embodiment of data deduplication program of the present invention.
In the present embodiment, data deduplication program 10 is installed and is run in memory node equipment 1.Memory node equipment 1It can be desktop PC, notebook, palm PC and server etc. and calculate equipment.The memory node equipment 1 may include,But it is not limited only to, memory 11, processor 12 and display 13.Fig. 2 illustrates only the memory node equipment with component 11-131, it should be understood that be not required for implementing all components shown, the implementation that can be substituted is more or less component.
Memory 11 can be the internal storage unit of memory node equipment 1, such as the storage in some embodimentsThe hard disk or memory of node device 1.Memory 11 is also possible to the external storage of memory node equipment 1 in further embodimentsThe plug-in type hard disk being equipped in equipment, such as memory node equipment 1, intelligent memory card (Smart Media Card, SMC), peaceDigital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 11 can also be wrapped bothThe internal storage unit for including memory node equipment 1 also includes External memory equipment.Memory 11 is installed on storage section for storingApplication software and Various types of data, such as the program code of data deduplication program 10 of point device 1 etc..Memory 11 can be also used forTemporarily store the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit,CPU), microprocessor or other data processing chips, program code or processing data for being stored in run memory 11, exampleSuch as execute data deduplication program 10.
Display 13 can be in some embodiments light-emitting diode display, liquid crystal display, touch-control liquid crystal display andOLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..Display 13 is for being shown inThe information that is handled in memory node equipment 1 and for showing visual user interface.The component 11- of memory node equipment 113 are in communication with each other by program bus.
Referring to Fig. 3, being the Program modual graph of 10 1 embodiment of data deduplication program of the present invention.In the present embodiment, numberAccording to going master control program 10 that can be divided into one or more modules, one or more module is stored in memory 11, andIt is performed by one or more processors (the present embodiment is processor 12), to complete the present invention.For example, data are gone in Fig. 3Master control program 10 can be divided into receiving module 101, preprocessing module 102, enquiry module 103, the first deduplication module 104 andTwo deduplication modules 105.The so-called module of the present invention is the series of computation machine program instruction section for referring to complete specific function, thanImplementation procedure of the program more suitable for description data deduplication program 10 in memory node equipment 1, in which:
Receiving module 101, piece write request, the data slice write request include that several are to be written for receiving dataThe fingerprint of data slice and each data slice to be written.
By data to be written, (data type of the data to be written includes block grade data and text to the data slice to be writtenPart grade data) cutting obtains, and the operation of the cutting can be executed by receiving module 101, or by other any applicable equipment (examplesSuch as, client) it executes, cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantityIt is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to trueFixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.ItsIn, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated,For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (SecureHash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be writtenColumn sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be writtenPiece fingerprint sequence is assembled into the data to be written.In addition, also the data slice fingerprint sequence can be saved to the local fingerprintIn library 3 and shared fingerprint base 2.
Preprocessing module 102, in the fingerprint for determining the data slice to be written to duplicate removal fingerprint.
Preprocessing module 102 determines that the method to duplicate removal fingerprint includes:
Judge the fingerprint that whether there is redundancy in the fingerprint of the data slice to be written, and if it exists, then delete the redundancyFingerprint if it does not exist, then the fingerprint of all data slices to be written is made and using remaining fingerprint as to duplicate removal fingerprintFor to duplicate removal fingerprint.
For example, judging in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.Identical fingerprint if it exists,Then using identical fingerprint as a fingerprint group, after finding out all fingerprint groups, a finger is selected in each fingerprint groupLine retains, and deletes non-selected fingerprint as the fingerprint of redundancy, and judge whether there is ungrouped fingerprint, if so,Using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating process.Identical fingerprint if it does not exist will then ownThe fingerprint of data slice to be written is as ungrouped fingerprint, then using each ungrouped fingerprint as to duplicate removal fingerprint.
Enquiry module 103 each whether there is for searching in local fingerprint base 3 to duplicate removal fingerprint.
First deduplication module 104, for one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having,It is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint.
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, it represents these and waits for duplicate removal fingerprint pairThe data slice to be written answered is duplicate data slice, and in order to save memory space, these duplicate data slices are deleted.
Second deduplication module 105 has one or more to be not present in the local fingerprint base 3 to duplicate removal fingerprint for working asWhen, using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each finger to be processed is searched in shared fingerprint base 2Line, it is one or more of by what is found when finding one or more fingerprints to be processed in the shared fingerprint base 2The corresponding data slice to be written of fingerprint to be processed is deleted.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if the first deduplication module 104 is in local fingerprintOne is not inquired in library 3 to duplicate removal fingerprint, then the second deduplication module 105 continued inquiry in shared fingerprint base 2 to refer to duplicate removalLine whether there is, and if it exists, and then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint,Belong to duplicate data slice, no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint baseThe fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory systemDeduplicated efficiency.
Further, in the present embodiment, the data deduplication program 10 further includes memory module (not shown), is usedIn:
Save all remaining data slices (i.e. not duplicate data slice) to be written, and by the remaining data to be writtenThe corresponding storage location information preservation of piece is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the data deduplication program 10 further includes that reference update module (is not shown in figureOut), it is used for:
Determine the reference count changing value of the fingerprint of each data slice to be written (for example, determining each to duplicate removal fingerprintReference count changing value is+1), and the reference count changing value of the fingerprint of each data slice to be written is sent to control sectionPoint device 5 for control node equipment 5 according to the reference count changing value of the fingerprint of each data slice to be written, and updates each(the accumulative reference count of a fingerprint represents the corresponding data slice quilt of the fingerprint for the accumulative reference count of the fingerprint of data slice to be writtenThe total degree of storing data reference).
Further, in the present embodiment, the data deduplication program 10 further includes removing module (not shown), is usedIn:
The removal request of a data to be deleted is received, the data slice fingerprint sequence of the data to be deleted is obtained, determination obtainsThe reference count changing value of each fingerprint is (for example, determine in the data slice fingerprint sequence in the data slice fingerprint sequence takenThe reference count changing value of each fingerprint is -1), and send the reference count of each fingerprint in the data slice fingerprint sequenceChanging value is to control node equipment 5.For control node equipment 5 according to the reference count of each fingerprint in the data slice fingerprint sequenceChanging value, updates the accumulative reference count of each fingerprint in the data slice fingerprint sequence, and by the data of the data to be deletedPiece fingerprint sequence is deleted from the shared fingerprint base 2, and notifies the memory node equipment 1 by the data of the data to be deletedPiece fingerprint sequence is deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deletedConnect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such asFruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deletedData slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprintAccording to piece not by any data referencing) when, control node equipment 5 records the fingerprint and keeps the state that accumulative reference count is zeroDuration.When the duration is greater than preset duration, the fingerprint is deleted, and notify corresponding memory node equipment 1Delete the fingerprint and the corresponding data slice of the fingerprint.When the duration is less than or equal to preset duration, do not deleteExcept processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zeroThe corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset durationThe reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in timeCaused by data accidentally delete.
Further, in the present embodiment, the data deduplication program 10 further includes read module (not shown), is usedIn:
When receiving the read requests of a data to be read, the data slice fingerprint sequence of the data to be read is obtained, andThe storage location information for obtaining the corresponding data slice of each fingerprint in the data slice fingerprint sequence, according to the storage position of acquisitionConfidence breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then the data slice that will acquire is according to the numberThe data to be read are assembled into according to piece fingerprint sequence.
Further it is proposed that a kind of data duplicate removal method.This method is suitable for above-mentioned distributed memory system.
As shown in figure 4, Fig. 4 is the flow diagram of one embodiment of data duplicate removal method of the present invention.
In the present embodiment, which comprises
Step S10, memory node equipment 1 receive data slice write request, and the data slice write request includes severalThe fingerprint of data slice to be written and each data slice to be written.
By data to be written, (data type of the data to be written includes block grade data, text to the data slice to be writtenPart grade data) cutting obtains, and the operation of the cutting can be executed by memory node equipment 1, or by other any applicable equipment (examplesSuch as, client) it executes, cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantityIt is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to trueFixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.ItsIn, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated,For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (SecureHash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be writtenColumn sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be writtenPiece fingerprint sequence is assembled into the data to be written.In addition, also the data slice fingerprint sequence can be saved to the local fingerprintIn library 3 and shared fingerprint base 2.
Step S20, memory node equipment 1 determine in the fingerprint of the data slice to be written to duplicate removal fingerprint.
Determine that the method to duplicate removal fingerprint includes: the finger that whether there is redundancy in the fingerprint for judge the data slice to be writtenLine, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal fingerprint, if it does not exist, then will ownThe fingerprint of the data slice to be written is used as to duplicate removal fingerprint.
For example, the step S20 includes step S21~S26 (not shown).Wherein:
Step S21 judges in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.
Step S22, identical fingerprint is finding out all fingers then using identical fingerprint as a fingerprint group if it existsIt after line group, selects a fingerprint to retain in each fingerprint group, is deleted non-selected fingerprint as the fingerprint of redundancy, andUngrouped fingerprint is judged whether there is, if so, using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminatingProcess.
Step S23, identical fingerprint if it does not exist, then using the fingerprint of all data slices to be written as ungrouped fingerLine, then using each ungrouped fingerprint as to duplicate removal fingerprint.
Step S30, memory node equipment 1 is searched in local fingerprint base 3 each whether there is to duplicate removal fingerprint.
Step S40, memory node equipment 1 one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when havingIt is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint.
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, it represents these and waits for duplicate removal fingerprint pairThe data slice to be written answered is duplicate data slice, and in order to save memory space, these duplicate data slices are deleted.
Step S50, one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having, memory node is setStandby 1 using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each finger to be processed is searched in shared fingerprint base 2Line, it is one or more of by what is found when finding one or more fingerprints to be processed in the shared fingerprint base 2The corresponding data slice to be written of fingerprint to be processed is deleted.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if memory node equipment 1 is in local fingerprint base 3In do not inquire one to duplicate removal fingerprint, then continue inquiry in shared fingerprint base 2 should whether there is to duplicate removal fingerprint, and if it exists,Then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint, belong to duplicate data slice,It no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint baseThe fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory systemDeduplicated efficiency.
Further, in the present embodiment, after step S60, this method further include:
Memory node equipment 1 saves all remaining data slices (i.e. not duplicate data slice) to be written, and will be described surplusThe corresponding storage location information preservation of remaining data slice to be written is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the distributed memory system further include with each memory node equipment 1 and altogetherThe control node equipment 5 for enjoying the communication connection of fingerprint base 2, after the step S20, the method also includes:
Memory node equipment 1 determines the reference count changing value of the fingerprint of each data slice to be written (for example, determination is eachReference count changing value to duplicate removal fingerprint is+1), and by the reference count changing value of the fingerprint of each data slice to be writtenIt is sent to control node equipment 5.
Then, control node equipment 5 updates each according to the reference count changing value of the fingerprint of each data slice to be written(the accumulative reference count of a fingerprint represents the corresponding data slice quilt of the fingerprint for the accumulative reference count of the fingerprint of data slice to be writtenThe total degree of storing data reference).
Further, in the present embodiment, this method further includes step S60~S80 (not shown).
Wherein:
Step S60, memory node equipment 1 receive the removal request of a data to be deleted.
Step S70, memory node equipment 1 obtain the data slice fingerprint sequence of the data to be deleted, determine the institute obtainedThe reference count changing value of each fingerprint in data slice fingerprint sequence is stated (for example, determining each in the data slice fingerprint sequenceThe reference count changing value of fingerprint is -1), and send the reference count changing value of each fingerprint in the data slice fingerprint sequenceTo control node equipment 5.
Step S80, control node equipment 5 according to the reference count changing value of each fingerprint in the data slice fingerprint sequence,Update the accumulative reference count of each fingerprint in the data slice fingerprint sequence, and by the data slice fingerprint sequence of the data to be deletedColumn are deleted from the shared fingerprint base 2, and notify data slice fingerprint sequence of the memory node equipment 1 by the data to be deletedColumn are deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deletedConnect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such asFruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deletedData slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
Further, in the present embodiment, this method further include:
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprintAccording to piece not by any data referencing) when, control node equipment 5 records the fingerprint and keeps the state that accumulative reference count is zeroDuration.
When the duration is greater than preset duration, the fingerprint is deleted, and corresponding memory node equipment 1 is notified to deleteExcept the fingerprint and the corresponding data slice of the fingerprint.
When the duration is less than or equal to preset duration, do not make delete processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zeroThe corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset durationThe reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in timeCaused by data accidentally delete.
Further, in the present embodiment, this method further includes step S90 (not shown).
Step S90, memory node equipment 1 obtain the access of continuing when receiving the read requests of a data to be readAccording to data slice fingerprint sequence, and obtain the storage location letter of the corresponding data slice of each fingerprint in the data slice fingerprint sequenceBreath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then will according to the storage location information of acquisitionThe data slice of acquisition is assembled into the data to be read according to the data slice fingerprint sequence.
Further, the present invention also proposes that a kind of computer readable storage medium, the computer readable storage medium are depositedData deduplication program 10 is contained, the embodiment of the data deduplication program 10 has been described in detail in the above content, has not been done hereinIt repeats.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all at thisUnder the inventive concept of invention, using equivalent structure transformation made by description of the invention and accompanying drawing content, or directly/use indirectlyIt is included in other related technical areas in scope of patent protection of the invention.