Movatterモバイル変換


[0]ホーム

URL:


CN111858651A - A data processing method and data processing device - Google Patents

A data processing method and data processing device
Download PDF

Info

Publication number
CN111858651A
CN111858651ACN202010998674.0ACN202010998674ACN111858651ACN 111858651 ACN111858651 ACN 111858651ACN 202010998674 ACN202010998674 ACN 202010998674ACN 111858651 ACN111858651 ACN 111858651A
Authority
CN
China
Prior art keywords
fingerprint
data processing
filter
storage
cells
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010998674.0A
Other languages
Chinese (zh)
Inventor
郭得科
罗来龙
廖汉龙
袁昊
武睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense TechnologyfiledCriticalNational University of Defense Technology
Priority to CN202010998674.0ApriorityCriticalpatent/CN111858651A/en
Publication of CN111858651ApublicationCriticalpatent/CN111858651A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本申请实施例提供一种数据处理方法以及数据处理装置,方法包括:获取目标元素;计算得到所述目标元素的元素指纹;通过全局哈希函数根据所述元素指纹选取过滤器的分段;在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理;其中,所述过滤器包括多个分段,所述分段包括多个存储单元格,所述存储单元格包括多个用于存储所述元素指纹的存储位。采用上述数据处理方法可以实现目标元素相对于过滤器并行操作,例如并行得元素插入、元素查询和元素删除等,因此保证了过滤器更好的数据吞吐量。并且,通过采用布谷鸟哈希算法,可以保证目标元素的查询精度,以及,可以保证目标元素的查询和删除均为常数时间复杂度。

Figure 202010998674

The embodiment of the present application provides a data processing method and a data processing device, the method includes: acquiring a target element; calculating an element fingerprint of the target element; selecting a segment of a filter according to the element fingerprint by a global hash function; In the selected segment, a cuckoo hash algorithm is used to perform data processing on the element fingerprint; wherein, the filter includes a plurality of segments, and the segment includes a plurality of storage cells, and the storage cells A plurality of memory bits are included for storing the fingerprint of the element. Using the above data processing method can realize the parallel operation of the target element relative to the filter, such as parallel element insertion, element query and element deletion, etc., thus ensuring better data throughput of the filter. Moreover, by using the cuckoo hash algorithm, the query accuracy of the target element can be guaranteed, and the query and deletion of the target element can be guaranteed to have constant time complexity.

Figure 202010998674

Description

Translated fromChinese
一种数据处理方法以及数据处理装置A data processing method and data processing device

技术领域technical field

本申请实施例涉及数据处理技术领域,尤其涉及一种数据处理方法以及数据处理装置。The embodiments of the present application relate to the technical field of data processing, and in particular, to a data processing method and a data processing device.

背景技术Background technique

数据库系统可以采用摘要数据结构来表示和存储数据集,并支持常数时间复杂度的近似成员查询,典型摘要数据结构包括布隆过滤器及其变种,以及,布谷鸟过滤器及其变种。The database system can use summary data structures to represent and store data sets, and support approximate membership queries with constant time complexity. Typical summary data structures include Bloom filters and their variants, and cuckoo filters and their variants.

布隆过滤器利用固定长度的比特向量中k个比特的值来表征元素是否属于给定集合,提供常数时间复杂度的元素插入和查询,但布隆过滤器查询性能弱,容易导致极高的查询的误报率;另外,布隆过滤器的空间利用效率低,且不支持反向操作。相比之下,布谷过滤器提供两个候选单元格来存储元素指纹,可实现精确的元素表示以及常数时间复杂度的查询与删除。Bloom filter uses the value ofk bits in a fixed-length bit vector to characterize whether an element belongs to a given set, and provides element insertion and query with constant time complexity, but the query performance of Bloom filter is weak, which easily leads to extremely high The false positive rate of the query; in addition, the Bloom filter has low space utilization efficiency and does not support reverse operations. In contrast, the Cuckoo filter provides two candidate cells to store element fingerprints, enabling accurate element representation and constant-time query and deletion.

然而,布谷鸟过滤器序列性地插入、查询和删除元素容易导致数据操作耗时,处理效率低,此种局限性在面对大体量数据集时尤为明显。However, the sequential insertion, query and deletion of elements by the cuckoo filter can easily lead to time-consuming data operations and low processing efficiency. This limitation is especially obvious when dealing with large-scale data sets.

发明内容SUMMARY OF THE INVENTION

本申请实施例的目的在于提供一种数据处理方法以及数据处理装置,用于改善数据操作的操作效率。The purpose of the embodiments of the present application is to provide a data processing method and a data processing apparatus, which are used to improve the operation efficiency of data operations.

基于上述目的,第一方面,本申请实施例提供一种数据处理方法,包括:Based on the above purpose, in the first aspect, an embodiment of the present application provides a data processing method, including:

获取目标元素;get the target element;

计算得到所述目标元素的元素指纹;Calculate the element fingerprint of the target element;

通过全局哈希函数根据所述元素指纹选取过滤器的分段;The segment of the filter is selected according to the element fingerprint by a global hash function;

在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理;In the selected segment, adopt the cuckoo hash algorithm to perform data processing on the element fingerprint;

其中,所述过滤器包括多个分段,所述分段包括多个存储单元格,所述存储单元格包括多个用于存储所述元素指纹的存储位。Wherein, the filter includes a plurality of segments, the segments include a plurality of storage cells, and the storage cells include a plurality of storage bits for storing the fingerprint of the element.

采用上述数据处理方法,在过滤器一个分段中的数据操作不会影响其他分段中的数据操作,从而可以实现目标元素相对于过滤器并行操作,例如并行得元素插入、元素查询和元素删除等,因此保证了过滤器更好的数据吞吐量。并且,通过采用布谷鸟哈希算法,可以保证目标元素的查询精度,以及,可以保证目标元素的查询和删除均为常数时间复杂度。With the above data processing method, data operations in one segment of the filter will not affect data operations in other segments, so that the target element can be operated in parallel with respect to the filter, such as parallel element insertion, element query and element deletion etc., thus guaranteeing better data throughput for the filter. Moreover, by using the cuckoo hash algorithm, the query accuracy of the target element can be guaranteed, and the query and deletion of the target element can be guaranteed to have constant time complexity.

在一种可能的实施方式中,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理步骤包括:In a possible implementation manner, the step of performing data processing on the element fingerprint by adopting the cuckoo hash algorithm in the selected segment includes:

根据所述元素指纹选取两个所述存储单元格为两个候选单元格;According to the element fingerprint, two of the storage cells are selected as two candidate cells;

在所述元素指纹能够存入两个所述候选单元格时,将所述元素指纹存入;When the element fingerprint can be stored in two of the candidate cells, the element fingerprint is stored;

在所述元素指纹无法存入两个所述候选单元格时,随机将两个所述候选单元格中已经存在的第一指纹踢出,并将所述元素指纹存入所述第一指纹空出的所述存储位。When the element fingerprint cannot be stored in the two candidate cells, randomly kick out the first fingerprint that already exists in the two candidate cells, and store the element fingerprint in the first fingerprint space out of the storage bits.

在一种可能的实施方式中,被踢出的所述第一指纹进入重分配步骤,所述重分配步骤包括:In a possible implementation manner, the kicked-out first fingerprint enters a reassignment step, and the reassignment step includes:

根据所述第一指纹以及所述第一指纹被踢出前所在的存储单元格,获得与所述第一指纹对应的另一存储单元格;Obtain another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is located before being kicked out;

当所述第一指纹能够存入所述另一存储单元格时,将所述第一指纹存入;When the first fingerprint can be stored in the other storage cell, storing the first fingerprint;

当所述第一指纹无法存入所述另一存储单元格时,随机将所述另一候选单元格中已经存在的第二指纹踢出,将所述第一指纹存入所述第二指纹空出的位置。When the first fingerprint cannot be stored in the other storage cell, randomly kick out the second fingerprint that already exists in the other candidate cell, and store the first fingerprint in the second fingerprint vacant position.

在一种可能的实施方式中,所述将所述第一指纹存入所述第二指纹空出的位置步骤之后还包括:In a possible implementation manner, after the step of storing the first fingerprint in the position vacated by the second fingerprint, the step further includes:

更新所述第一指纹为所述第二指纹,循环以上重分配步骤,直至所有元素指纹存入或者循环次数超过阈值。The first fingerprint is updated to be the second fingerprint, and the above reassignment steps are repeated until all element fingerprints are stored or the number of cycles exceeds a threshold.

在一种可能的实施方式中,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理步骤包括:In a possible implementation manner, the step of performing data processing on the element fingerprint by adopting the cuckoo hash algorithm in the selected segment includes:

根据所述元素指纹选取两个所述存储单元格为两个候选单元格;According to the element fingerprint, two of the storage cells are selected as two candidate cells;

判断两个所述候选单元格中是否存在所述元素指纹;Determine whether the element fingerprint exists in the two candidate cells;

若是,判定所述目标元素属于所述过滤器;If so, determine that the target element belongs to the filter;

若否,判定所述目标元素不属于所述过滤器。If not, it is determined that the target element does not belong to the filter.

在一种可能的实施方式中,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理步骤包括:In a possible implementation manner, the step of performing data processing on the element fingerprint by adopting the cuckoo hash algorithm in the selected segment includes:

判断所述目标元素是否已经插入所述过滤器;Determine whether the target element has been inserted into the filter;

若是,根据所述元素指纹选取两个所述存储单元格为两个候选单元格;If so, select two described storage cells as two candidate cells according to the element fingerprint;

在两个所述候选单元格中寻找所述元素指纹副本并将其删除。Find and delete copies of the element fingerprint in both of the candidate cells.

在一种可能的实施方式中,所述分段为所述过滤器中所述存储单元格均匀划分而成。In a possible implementation manner, the segment is formed by uniformly dividing the storage cells in the filter.

在一种可能的实施方式中,所述根据所述元素指纹选取两个所述存储单元格为两个候选单元格通过第一哈希函数和第二哈希函数实现,所述第一哈希函数为:In a possible implementation manner, the selecting two of the storage cells as the two candidate cells according to the element fingerprint is implemented by a first hash function and a second hash function, and the first hash function The function is:

Figure 100002_DEST_PATH_IMAGE001
Figure 100002_DEST_PATH_IMAGE001

所述第二哈希函数为:The second hash function is:

Figure 100002_DEST_PATH_IMAGE002
Figure 100002_DEST_PATH_IMAGE002

其中,x为目标元素,ηx为目标元素x的元素指纹,m为过滤器存储单元格的数目,s为过滤器分段段数。Among them,x is the target element,nx is the element fingerprint of the target elementx ,m is the number of filter storage cells, ands is the filter segment number.

在一种可能的实施方式中,所述元素指纹通过以下公式获得:In a possible implementation, the element fingerprint is obtained by the following formula:

ηx=h0(x) mod 2fηx =h0 (x )mod 2f

其中,ηx为目标元素x的元素指纹,fηx的位数。Among them,nx is the element fingerprint of the target elementx , andf is the number of digits ofnx .

第二方面,本申请实施例提供了一种数据处理装置,包括:In a second aspect, an embodiment of the present application provides a data processing apparatus, including:

获取模块,被配置为获取目标元素;Get module, configured to get the target element;

指纹计算模块,被配置为计算得到所述目标元素的元素指纹;a fingerprint calculation module, configured to calculate the element fingerprint of the target element;

分段选取模块,被配置为通过全局哈希函数根据所述元素指纹选取过滤器的分段;a segment selection module configured to select segments of the filter according to the element fingerprints through a global hash function;

数据操作模块,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理;The data manipulation module adopts the cuckoo hash algorithm to perform data processing on the element fingerprint in the selected segment;

其中,所述过滤器包括多个分段,所述分段包括多个存储单元格,所述存储单元格包括多个用于存储所述元素指纹的存储位。Wherein, the filter includes a plurality of segments, the segments include a plurality of storage cells, and the storage cells include a plurality of storage bits for storing the fingerprint of the element.

本实施例的装置可以用于执行第一方面实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The apparatus of this embodiment can be used to implement the technical solution of the embodiment of the first aspect, and the implementation principle and technical effect thereof are similar, and are not repeated here.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are the embodiments of the present application, and for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1为本申请实施例提供的一种数据处理方法流程图一;FIG. 1 is aflowchart 1 of a data processing method provided by an embodiment of the present application;

图2为本申请实施例提供的一种数据处理方法流程图二;2 is aflowchart 2 of a data processing method provided by an embodiment of the present application;

图3为本申请实施例提供的一种数据处理方法流程图三;FIG. 3 is a third flowchart of a data processing method provided by an embodiment of the present application;

图4a为测试实验中插入失败次数与分段数s之间的关系图;Figure 4a is a graph showing the relationship between the number of insertion failures and the number of segments s in the test experiment;

图4b为测试实验中插入时间与分段数s之间的关系图;Figure 4b is a graph showing the relationship between the insertion time and the number of segments s in the test experiment;

图4c为测试实验中插入失败次数与存储位数量b之间的关系图;Figure 4c is a graph showing the relationship between the number of insertion failures and the number of storage bits b in the test experiment;

图4d为测试实验中插入时间与存储位数量b之间的关系图;Figure 4d is a graph showing the relationship between the insertion time and the number of storage bits b in the test experiment;

图4e为测试实验中插入失败次数与重分配系数r之间的关系图;Figure 4e is a graph showing the relationship between the number of insertion failures and the redistribution coefficient r in the test experiment;

图4f为测试实验中插入时间与重分配系数r之间的关系图;Figure 4f is a graph showing the relationship between the insertion time and the redistribution coefficient r in the test experiment;

图5为本申请实施例提供的一种数据处理装置的结构示意图。FIG. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

本申请实施例提供了一种数据处理方法,如图1所示,该数据处理方法包括:An embodiment of the present application provides a data processing method, as shown in FIG. 1 , the data processing method includes:

步骤S10:获取目标元素;Step S10: obtaining the target element;

目标元素可以为对应给定数据集待插入的元素、待查询的元素以及待删除的元素,也就是说,此步骤可以为插入过程、查询过程或者删除过程获取目标元素。The target element can be the element to be inserted, the element to be queried, and the element to be deleted corresponding to the given dataset, that is, this step can obtain the target element for the insertion process, the query process or the deletion process.

步骤S20:计算得到所述目标元素的元素指纹;Step S20: calculating the element fingerprint of the target element;

元素指纹为表征目标元素的摘要值,或者称为哈希值,可以通过预设的哈希函数计算所得,例如在一种可能的实施方式中,任意目标元素xf位元素指纹ηx可以通过以下公式(1)得到:The element fingerprint is a digest value representing the target element, or is called a hash value, which can be calculated by a preset hash function. For example, in a possible implementation, thef -bit element fingerprintnx of any target elementx can be It is obtained by the following formula (1):

ηx=h0(x) mod 2f (1)ηx =h0 (x ) mod 2f (1)

步骤S30:通过全局哈希函数根据所述元素指纹选取过滤器的分段;Step S30: selecting the segment of the filter according to the element fingerprint by the global hash function;

本实施例中的过滤器为布谷鸟过滤器(Cuckoo Filter,简称CF)的一种变体,可以称之为并行布谷过滤器(Parallelized Cuckoo Filter,简称PCF),PCF在物理上将过滤器分为s独立的分段,并额外引入一个全局哈希函数H(x)以将目标元素x映射到分段中。具体地,过滤器包括m个存储单元格,每个存储单元格包含b个存储位以最多存储b个元素指纹,也就是说,每个存储位可用于存储一个元素指纹,因此,存储位能够存储的数据位数与上述公式(1)中f的位数相等。The filter in this embodiment is a variant of the Cuckoo Filter (CF for short), which can be called a Parallelized Cuckoo Filter (PCF for short), and the PCF physically divides the filter into Separate segments fors and additionally introduce a global hash functionH(x) to map target elementx into segments. Specifically, the filter includesm storage cells, each storage cell containsb storage bits to store at mostb element fingerprints, that is, each storage bit can be used to store one element fingerprint, so the storage bit can The number of data bits stored is equal to the number of bits off in the above formula (1).

过滤器中的m个存储单元格被均匀地划分为s个分段,每个分段包含m/s个存储单元格。全局哈希函数H(x)用于为目标元素x的元素指纹ηx随机选择一个分段,由于全局哈希函数H(x)为元素指纹ηx随机选择一个分段时,每一个分段被选中的概率均为1/s,因此基于负载均衡的考虑,s个分段通常由m个存储单元格均匀划分而成,但是本申请实施例中的数据处理方法也可用于分段不是由m个存储单元格被均匀划分而成的情况。Them memory cells in the filter are evenly divided intos segments, each segment containingm /s memory cells. The global hash functionH(x) is used to randomly select a segment for the element fingerprintηx of the target elementx . Since the global hash functionH(x) randomly selects a segment for the element fingerprintηx , each segment is The probability of being selected is all1/s . Therefore, based on the consideration of load balancing, thes segments are usually evenly divided bym storage cells. However, the data processing method in this embodiment of the present application can also be used for segments not composed of The case wherem memory cells are evenly divided.

本申请实施例采用PCF表示并储存给定数据集。The embodiments of the present application use PCF to represent and store a given data set.

步骤S40:在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理。Step S40: Perform data processing on the element fingerprint by using the cuckoo hash algorithm in the selected segment.

在目标元素x被映射到相应分段后,关于元素指纹ηx的所有数据操作均在所选分段中进行,数据操作采用布谷鸟哈希算法进行,数据操作例如可以为目标元素x的元素插入、元素查询和元素删除。After the target elementx is mapped to the corresponding segment, all data operations on the element fingerprintnx are performed in the selected segment, and the data operation is performed using the cuckoo hash algorithm. For example, the data operation can be the element of the target elementx Insertion, element query and element deletion.

采用上述数据处理方法,在过滤器一个分段中的数据操作不会影响其他分段中的数据操作,从而可以实现目标元素相对于过滤器并行操作,例如并行得元素插入、元素查询和元素删除等,因此保证了过滤器更好的数据吞吐量。并且,通过采用布谷鸟哈希算法,可以保证目标元素的查询精度,以及,可以保证目标元素的查询和删除均为常数时间复杂度。With the above data processing method, data operations in one segment of the filter will not affect data operations in other segments, so that the target element can be operated in parallel with respect to the filter, such as parallel element insertion, element query and element deletion etc., thus guaranteeing better data throughput for the filter. Moreover, by using the cuckoo hash algorithm, the query accuracy of the target element can be guaranteed, and the query and deletion of the target element can be guaranteed to have constant time complexity.

在本申请实施例中,采用布谷鸟哈希算法对元素指纹的数据操作可以包括元素插入、元素查询和元素删除。In this embodiment of the present application, the data operation on element fingerprints using the cuckoo hash algorithm may include element insertion, element query, and element deletion.

其中,如图2所示,目标元素的插入步骤可以包括:Wherein, as shown in Figure 2, the inserting step of the target element may include:

步骤S411:根据所述元素指纹选取两个所述存储单元格为两个候选单元格;Step S411: selecting two of the storage cells as two candidate cells according to the element fingerprint;

在步骤S30中可以计算目标元素x的元素指纹ηx,在得到元素指纹ηx后可以在所选的分段中,通过第一哈希函数

Figure DEST_PATH_IMAGE003
和第二哈希函数
Figure DEST_PATH_IMAGE004
选取两个存储单元格为两个候选单元格,其中,第一哈希函数
Figure 16110DEST_PATH_IMAGE003
和第二哈希函数
Figure 198829DEST_PATH_IMAGE004
分别为:In step S30, the element fingerprintn x of the target element x can be calculated, and after the element fingerprint nxisobtained, in the selected segment, the first hash function
Figure DEST_PATH_IMAGE003
and the second hash function
Figure DEST_PATH_IMAGE004
Two storage cells are selected as two candidate cells, where the first hash function
Figure 16110DEST_PATH_IMAGE003
and the second hash function
Figure 198829DEST_PATH_IMAGE004
They are:

Figure DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE005

通过上述公式(2)和公式(3)中可以看出,在已知元素指纹ηx以及与该元素指纹ηx对应的一个候选单元格时,通过执行异或操作即可推断出另一候选单元格的位置,而无需得知目标数据x的具体内容。也就是说,与同一元素指纹x对应的两个候选单元格的位置具有对偶性,一个候选单元格的位置是另一个候选单元格的对偶位置。It can be seen from the above formula (2) and formula (3) that when the element fingerprintηx and a candidate cell corresponding to the element fingerprintηx are known, another candidate can be inferred by performing the XOR operation. The location of the cell without knowing the specific content of the target datax . That is, the positions of two candidate cells corresponding to the same element fingerprintx have duality, and the position of one candidate cell is the dual position of the other candidate cell.

步骤S412:在所述元素指纹能够存入两个所述候选单元格时,将所述元素指纹存入;Step S412: when the element fingerprint can be stored in two of the candidate cells, the element fingerprint is stored;

如果两个候选单元格中存在一个可用的存储位时,则可使用该存储位直接存储元素指纹ηx;当存在多个可用存储位时,选取一个存储位存储元素指纹ηx,元素指纹ηx存储完成后返回“True”以表示目标元素x插入成功。If there is an available storage bit in the two candidate cells, then the storage bit can be used to directly store the element fingerprintnx ; when there are multiple available storage bits, select a storage bit to store the element fingerprintnx , the element fingerprintn Returns "True" after the storage ofx is completed to indicate that the target elementx is inserted successfully.

步骤S413:在所述元素指纹无法存入两个所述候选单元格时,随机将两个所述候选单元格中已经存在的第一指纹踢出,并将所述元素指纹存入所述第一指纹空出的所述存储位。Step S413: When the element fingerprint cannot be stored in the two candidate cells, randomly kick out the first fingerprint that already exists in the two candidate cells, and store the element fingerprint in the first fingerprint. The storage bit vacated by a fingerprint.

在两个候选单元格中不存在可用存储位以存储元素指纹ηx时,需在两个候选单元格中随机选取一个候选单元格Br,并在该选取的候选单元格Br中随机踢出已经存在的第一指纹ηr,第一指纹ηr为之前已存入过滤器的元素的指纹。第一指纹ηr被踢出后,第一指纹ηr原本占据的存储位空出,目标指纹x的元素指纹ηx将被置于第一指纹ηr调整出来的存储位中,即元素指纹ηx挤占第一指纹ηr的存储位。When there are no available storage bits in the two candidate cells to store the element fingerprintηx , a candidate cellBrneeds to be randomly selected from the two candidate cells, and randomly selected in the selected candidate cellBr The existing first fingerprintηr is obtained, and the first fingerprintηr is the fingerprint of the element that has been stored in the filter before. After the first fingerprintnr is kicked out, the storage position originally occupied by the first fingerprintnr is vacated, and the element fingerprintnx of the target fingerprintx will be placed in the storage position adjusted by the first fingerprintnr , that is, the element fingerprint.nxsqueezes the storage bits of the firstfingerprintnr .

进一步地,被踢出的第一指纹ηr进入重分配步骤进行分配,如图3所示,该重分配步骤包括:Further, the kicked out first fingerprintnrenters the redistribution step for distribution, as shown in Figure 3, the redistribution step includes:

步骤S414:根据所述第一指纹以及所述第一指纹被踢出前所在的存储单元格,获得与所述第一指纹对应的另一存储单元格;Step S414: obtaining another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell in which the first fingerprint was kicked out;

在过滤器中与同一元素对应有两个存储单元格,被踢出的第一指纹ηr同样对应有两个存储单元格,分别为存储单元格Br和存储单元格Ba,在第一指纹ηr被踢出后,应先尝试将第一指纹ηr存储至与第一指纹ηr对应的另一存储单元格Ba中。在已知第一指纹ηr、被踢出前所在的存储单元格Br,根据上述公式(3)可以计算出另一存储单元格Ba的位置。There are two storage cells corresponding to the same element in the filter, and the kicked out first fingerprintηr also corresponds to two storage cells, namely storage cellBr and storage cellBa . After the fingerprintηr is kicked out, first try to store the first fingerprintηrinanother storage cellBa corresponding to the firstfingerprint ηr. Given the known first fingerprintηr and the storage cellBr where the first fingerprint was located before being kicked out, the position of another storage cellBa can be calculated according to the above formula (3).

步骤S415:当所述第一指纹能够存入所述另一候选单元格时,将所述第一指纹存入;Step S415: when the first fingerprint can be stored in the other candidate cell, store the first fingerprint;

在另一存储单元格Ba中具有可用存储位时,将第一指纹ηr存入,并返回“True”,以表示重分配结束;When there is an available storage bit in another storage cellBa , the first fingerprintnr is stored, and "True" is returned to indicate that the reallocation is over;

步骤S416:当所述第一指纹无法存入所述另一存储单元格时,随机将所述另一候选单元格中已经存在的第二指纹踢出,将所述第一指纹存入所述第二指纹空出的位置;Step S416: When the first fingerprint cannot be stored in the other storage cell, randomly kick out the second fingerprint that already exists in the other candidate cell, and store the first fingerprint in the The position vacated by the second fingerprint;

如果另一存储单元格Ba无法成功存储第一指纹ηr,则在另一存储单元格Ba中随机踢出已经存在的第二指纹ηw,第二指纹ηw为之前已存入过滤器的元素的指纹。第二指纹ηw被踢出后,第二指纹ηw原本占据的存储位空出,第一指纹ηr将被置于第二指纹ηw调整出来的存储位中,即第一指纹ηr挤占第二指纹ηw的存储位。If another storage cellBa cannot successfully store the first fingerprintηr , the existing second fingerprintηw is randomly kicked out in another storage cellBa , and the second fingerprintηw is previously stored in the filter fingerprint of the element of the device. After the secondfingerprintnw iskicked out, the storage bit originally occupied by the second fingerprintnw isvacated , and the first fingerprintnr will be placed in the storage bit adjusted by the secondfingerprintnw , that is, the firstfingerprintnr The storage bits of the second fingerprintnw aresqueezed .

随后,更新第一指纹为第二指纹,循环步骤S414至步骤S416,直至所有指纹存入或者循环次数达到阈值MAXS。当所有指纹存入时,重分配成功结束,返回“True”以表明插入成功;当循环次数达到设定阈值MAXS时,重分配失败,返回“ False”以表明插入失败。Subsequently, the first fingerprint is updated to be the second fingerprint, and steps S414 to S416 are repeated until all fingerprints are stored or the number of cycles reaches the thresholdMAXS . When all fingerprints are stored, the reassignment ends successfully, and "True" is returned to indicate that the insertion is successful; when the number of cycles reaches the set thresholdMAXS , the reassignment fails, and "False" is returned to indicate that the insertion fails.

在上述方法中,最多能执行MAXS次插入循环,因此元素插入的时间复杂度为O(MAXS)。又因为元素插入过程仅在所选分段中进行,相比于标准布谷鸟过滤器CF的阈值MAX,本实施例中MAXS能设置的小得多,即最多所需的循环次数要少得多,因此此种设计也加速了插入操作。In the above method, at mostMAXS insertion loops can be performed, so the time complexity of element insertion isO (MAXS ). And because the element insertion process is only performed in the selected segment, compared with the thresholdMAX of the standard cuckoo filter CF, theMAXS can be set much smaller in this embodiment, that is, the maximum number of cycles required is less. Therefore, this design also speeds up the insertion operation.

在一种可能的实施方式中,目标元素的查询步骤可以包括:In a possible implementation manner, the query step of the target element may include:

根据所述元素指纹选取两个所述存储单元格为两个候选单元格;According to the element fingerprint, two of the storage cells are selected as two candidate cells;

判断两个所述候选单元格中是否存在所述元素指纹;Determine whether the element fingerprint exists in the two candidate cells;

若是,判定所述目标元素属于所述过滤器;If so, determine that the target element belongs to the filter;

若否,判定所述目标元素不属于所述过滤器。If not, it is determined that the target element does not belong to the filter.

在本实施例中数据处理方法的元素查询步骤中,首先对与目标指纹x对应的元素指纹ηx,通过第一哈希函数和第二哈希函数选取两个存储单元个为两个候选单元格,随后检查这两个候选单元格:如果两个候选单元格具有与的元素指纹ηx匹配的指纹数据,则返回“True”,以判定目标元素x属于过滤器(给定数据集);否则返回“False”,以判定目标元素x不属于过滤器(给定数据集)。通过上述描述可知,本申请实施例提供的查询步骤最多仅需要检查其候选单元格中的2b个存储位,元素查询的时间复杂度保持常数。In the element query step of the data processing method in this embodiment, first, for the element fingerprintη x corresponding to the target fingerprintx, two storage units are selected as two candidate units through the first hash function and the second hash function Then check the two candidate cells: if the two candidate cells have fingerprint data matching the element fingerprintηx , return "True" to determine that the target elementx belongs to the filter (given dataset); Otherwise return "False" to determine that the target elementx does not belong to the filter (given the dataset). It can be seen from the above description that the query step provided by the embodiment of the present application only needs to check 2b storage bits in the candidate cells at most, and the time complexity of element query remains constant.

在一种可能的实施方式中,目标元素的删除步骤可以包括:In a possible implementation, the step of deleting the target element may include:

判断所述目标元素是否已经插入所述过滤器;Determine whether the target element has been inserted into the filter;

若是,根据所述元素指纹选取两个所述存储单元格为两个候选单元格;If so, select two described storage cells as two candidate cells according to the element fingerprint;

在两个所述候选单元格中寻找所述元素指纹副本并将其删除。Find and delete copies of the element fingerprint in both of the candidate cells.

在本申请实施例提供的数据处理方法的元素删除步骤中,对与目标指纹x对应的元素指纹ηx,通过第一哈希函数和第二哈希函数选取两个存储单元个为两个候选单元格,如果任何候选单元格中能匹配元素指纹ηx,则将元素指纹ηx从该候选单元格中删除。此外,用户必须确保删除的元素先前已插入;否则,删除未插入的元素实际上会删除现有的元素,从而导致查询的假阴性错误。通过上述描述可知,删除操作的时间复杂度也常数级。In the element deletion step of the data processing method provided by the embodiment of the present application, for the element fingerprintnx corresponding to the target fingerprint x, two storage units are selected as two candidates through the first hash function and the second hash function If the element fingerprintηx can be matched in any candidate cell, the element fingerprintηx is removed from the candidate cell. Additionally, the user must ensure that the removed element was previously inserted; otherwise, removing an uninserted element would actually remove the existing element, resulting in a false-negative error for the query. It can be seen from the above description that the time complexity of the deletion operation is also constant.

为了说明本申请实施例提供的数据处理方法的工作原理及工作性能,对其进行了测试,本文根据插入失败次数和处理1×105元素时的总时间来量化元素插入性能。所有实验均在具有16 GB RAM和12×2.6 GHz CPU的主机上进行。测试数据集合成的冷数据集,即具有不同长度和字符的字符串。实验图中所有结果均为10次执行的平均值。In order to illustrate the working principle and working performance of the data processing method provided by the embodiment of the present application, it is tested, and the element insertion performance is quantified in this paper according to the number of insertion failures and the total time for processing 1×105 elements. All experiments are performed on a host with 16 GB RAM and 12 × 2.6 GHz CPU. A cold dataset consisting of test datasets, i.e. strings of varying lengths and characters. All results in the experimental figures are the average of 10 executions.

本文还改变每个存储单元格中的存储位数量,即b,PCF中的分段数,即s,以及重分配系数

Figure DEST_PATH_IMAGE006
,以确定它们对PCF的影响。在本文的实验中,默认设置为n= 1×105f= 30,b = 8,r = 0.1,m = 16,384,过滤器总容量(即存储位数)固定为131,072,上述实验结果如图4a、图4b、图4c、4d、图4e和图4f所示。This paper also changes the number of storage bits in each storage cell, i.e.b , the number of segments in the PCF, i.e. s, and the reallocation coefficient
Figure DEST_PATH_IMAGE006
, to determine their effect on PCF. In the experiments of this paper, the default settings aren = 1×105 ,f = 30,b = 8,r = 0.1,m = 16,384, and the total filter capacity (ie, the number of storage bits) is fixed at 131,072. The above experimental results are as follows 4a, 4b, 4c, 4d, 4e and 4f.

图4a为测试实验中插入失败次数与分段数s之间的关系图,图4b为测试实验中插入时间与分段数s之间的关系图,从图4a和图4b中可以看出,当PCF中的分段数s从1增加到64时,插入失败的次数从39增加到18414,而插入时间从17,071s减少到77s。当分段数s增加时,每个分段中的存储单元格数相应减少。在给定重分配系数r的情况下,每个分段将搜索较少的存储单元格以容纳被插入元素。如此反过来导致更多的插入失败和更少的时间消耗。应当注意的是,当s=1时,过滤器是标准CF。CF导致的插入失败最少,但是花费的时间却最多且不可接受。Figure 4a is a graph showing the relationship between the number of insertion failures and the number of segmentss in the test experiment, and Figure 4b is a graph showing the relationship between the insertion time and the number of segmentss in the test experiment. It can be seen from Figures 4a and 4b that, When the number of segmentss in PCF increases from 1 to 64, the number of insertion failures increases from 39 to 18414, while the insertion time decreases from 17,071s to 77s. When the number of segmentss increases, the number of memory cells in each segment decreases accordingly. Given a reallocation coefficientr , each segment will search fewer memory cells to accommodate the inserted element. This in turn results in more insert failures and less time consumption. It should be noted that whens = 1, the filter is standard CF. CF causes the fewest insert failures, but takes the most and unacceptable time.

图4c为测试实验中插入失败次数与存储位数量b之间的关系图,图4d为测试实验中插入时间与存储位数量b之间的关系图,如图4c和图4d中可以看出,当每个存储单元格中的存储位数b从1增加到64时,插入失败的次数和总插入时间均快速降低。在每个存储单元格中有更多的存储位时,发生冲突的元素可以通过重分配过程以更高的概率在其候选单元格中找到一个空存储位。此外,本文通过调整重分配系数r来影响每个分段中的最大重分配时间。Figure 4c is the relationship diagram between the number of insertion failures and the number of storage bitsb in the test experiment, and Figure 4d is the relationship diagram between the insertion time and the number of storage bitsb in the test experiment, as can be seen in Figures 4c and 4d, When the number of storage bitsb in each storage cell increases from 1 to 64, both the number of failed insertions and the total insertion time decrease rapidly. When there are more memory bits in each memory cell, the conflicting element can find an empty memory bit in its candidate cell with a higher probability through the reallocation process. Furthermore, this paper affects the maximum reallocation time in each segment by adjusting the reallocation coefficientr .

图4e为测试实验中插入失败次数与重分配系数r之间的关系图,图4f为测试实验中插入时间与重分配系数r之间的关系图,从图4e与图4f中可以看出,重分配策略略微提高了元素插入成功率。如果不允许重分配,则PCF会导致1.6×104次插入失败,但是当r为0.01时,插入失败数量急剧减少到8470。此后,即使r为0.8,插入失败的次数展现出缓慢下降趋势,但仍超过8000。这一现象的一部分原因是在PCF中,重分配指纹时存在无限循环。这种循环的存在,使得增加r值也不能增加PCF探索更多空存储位的机会。此外,随着重分配的增加,总时间消耗从311s不断增加到434s。Figure 4e shows the relationship between the number of insertion failures and the redistribution coefficientr in the test experiment, and Figure 4f shows the relationship between the insertion time and the redistribution coefficientr in the test experiment. It can be seen from Figure 4e and Figure 4f that, The reallocation strategy slightly improves the success rate of element insertion. If reallocation is not allowed, PCF causes 1.6 ×104 insertion failures, but whenr is 0.01, the number of insertion failures is drastically reduced to 8470. After that, even withr of 0.8, the number of failed insertions showed a slow downward trend, but still exceeded 8000. Part of the reason for this phenomenon is that in PCF, there is an infinite loop when reassigning fingerprints. The existence of such a cycle makes increasing the value ofr unable to increase the PCF's chance of exploring more vacant memory bits. Furthermore, the total time consumption keeps increasing from 311s to 434s as the reallocation increases.

通过上述的实验结果也可以看出,PCF以并行方式插入元素,提升了元素插入速度,并且还可以通过调整每个存储单元格中的存储位数和最大重分配次数,能很好地控制插入失败元素数量。It can also be seen from the above experimental results that PCF inserts elements in parallel, which improves the element insertion speed, and can also control the insertion well by adjusting the number of storage bits and the maximum number of reallocations in each storage cell. The number of failed elements.

需要说明的是,本申请实施例的方法可以由单个设备执行,例如一台计算机或服务器等。本实施例的方法也可以应用于分布式场景下,由多台设备相互配合来完成。在这种分布式场景的情况下,这多台设备中的一台设备可以只执行本申请实施例的方法中的某一个或多个步骤,这多台设备相互之间会进行交互以完成所述的方法。It should be noted that, the methods in the embodiments of the present application may be executed by a single device, such as a computer or a server. The method in this embodiment can also be applied in a distributed scenario, and is completed by the cooperation of multiple devices. In the case of such a distributed scenario, one device among the multiple devices may only execute one or more steps in the methods of the embodiments of the present application, and the multiple devices will interact with each other to complete all the steps. method described.

另外,上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。Additionally, the foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

图5为本申请实施例提供的一种数据处理装置的结构示意图,如图5所示,本实施例的装置可以包括:FIG. 5 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application. As shown in FIG. 5 , the apparatus of this embodiment may include:

获取模块100,被配置为获取目标元素;an obtainingmodule 100, configured to obtain the target element;

指纹计算模块200,被配置为计算得到所述目标元素的元素指纹;Thefingerprint calculation module 200 is configured to calculate the element fingerprint of the target element;

分段选取模块300,被配置为通过全局哈希函数根据所述元素指纹选取过滤器的分段;Thesegment selection module 300 is configured to select the segment of the filter according to the element fingerprint through a global hash function;

数据操作模块400,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理;Thedata operation module 400 adopts the cuckoo hash algorithm to perform data processing on the element fingerprint in the selected segment;

其中,所述过滤器包括多个分段,所述分段包括多个存储单元格,所述存储单元格包括多个用于存储所述元素指纹的存储位。Wherein, the filter includes a plurality of segments, the segments include a plurality of storage cells, and the storage cells include a plurality of storage bits for storing the fingerprint of the element.

本实施例的装置可以用于执行图1所示方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The apparatus in this embodiment can be used to execute the technical solution of the method embodiment shown in FIG. 1 , and the implementation principle and technical effect thereof are similar, and are not repeated here.

为了描述的方便,描述以上装置是以功能分为各种模块分别描述。当然,在实施本申请实施例时可以把各模块的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, the above apparatus is described by dividing the functions into various modules and describing them respectively. Of course, when implementing the embodiments of the present application, the functions of each module may be implemented in one or more software and/or hardware.

本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by program instructions related to hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the steps including the above method embodiments are executed; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application. scope.

Claims (10)

Translated fromChinese
1.一种数据处理方法,其特征在于,包括:1. a data processing method, is characterized in that, comprises:获取目标元素;get the target element;计算得到所述目标元素的元素指纹;Calculate the element fingerprint of the target element;通过全局哈希函数根据所述元素指纹选取过滤器的分段;The segment of the filter is selected according to the element fingerprint by a global hash function;在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理;In the selected segment, adopt the cuckoo hash algorithm to perform data processing on the element fingerprint;其中,所述过滤器包括多个分段,所述分段包括多个存储单元格,所述存储单元格包括多个用于存储所述元素指纹的存储位。Wherein, the filter includes a plurality of segments, the segments include a plurality of storage cells, and the storage cells include a plurality of storage bits for storing the fingerprint of the element.2.根据权利要求1所述的数据处理方法,其特征在于,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理步骤包括:2. data processing method according to claim 1, is characterized in that, in the described segment that adopts Cuckoo Hash algorithm to carry out data processing step to described element fingerprint and comprises:根据所述元素指纹选取两个所述存储单元格为两个候选单元格;According to the element fingerprint, two of the storage cells are selected as two candidate cells;在所述元素指纹能够存入两个所述候选单元格时,将所述元素指纹存入;When the element fingerprint can be stored in two of the candidate cells, the element fingerprint is stored;在所述元素指纹无法存入两个所述候选单元格时,随机将两个所述候选单元格中已经存在的第一指纹踢出,并将所述元素指纹存入所述第一指纹空出的所述存储位。When the element fingerprint cannot be stored in the two candidate cells, randomly kick out the first fingerprint that already exists in the two candidate cells, and store the element fingerprint in the first fingerprint space out of the storage bits.3.根据权利要求2所述的数据处理方法,其特征在于,被踢出的所述第一指纹进入重分配步骤,所述重分配步骤包括:3. The data processing method according to claim 2, wherein the kicked-out first fingerprint enters a redistribution step, and the redistribution step comprises:根据所述第一指纹以及所述第一指纹被踢出前所在的存储单元格,获得与所述第一指纹对应的另一存储单元格;Obtain another storage cell corresponding to the first fingerprint according to the first fingerprint and the storage cell where the first fingerprint is located before being kicked out;当所述第一指纹能够存入所述另一存储单元格时,将所述第一指纹存入;When the first fingerprint can be stored in the other storage cell, storing the first fingerprint;当所述第一指纹无法存入所述另一存储单元格时,随机将所述另一候选单元格中已经存在的第二指纹踢出,将所述第一指纹存入所述第二指纹空出的位置。When the first fingerprint cannot be stored in the other storage cell, randomly kick out the second fingerprint that already exists in the other candidate cell, and store the first fingerprint in the second fingerprint vacant position.4.根据权利要求3所述的数据处理方法,其特征在于,所述将所述第一指纹存入所述第二指纹空出的位置步骤之后还包括:4. The data processing method according to claim 3, wherein after the step of storing the first fingerprint in the position vacated by the second fingerprint, it further comprises:更新所述第一指纹为所述第二指纹,循环所述重分配步骤,直至所有元素指纹存入或者循环次数超过阈值。The first fingerprint is updated to be the second fingerprint, and the reassignment step is repeated until all element fingerprints are stored or the number of cycles exceeds a threshold.5.根据权利要求1所述的数据处理方法,其特征在于,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理步骤包括:5. data processing method according to claim 1, is characterized in that, adopting the cuckoo hash algorithm to carry out data processing step to described element fingerprint in the described segment that chooses comprises:根据所述元素指纹选取两个所述存储单元格为两个候选单元格;According to the element fingerprint, two of the storage cells are selected as two candidate cells;判断两个所述候选单元格中是否存在所述元素指纹;Determine whether the element fingerprint exists in the two candidate cells;若是,判定所述目标元素属于所述过滤器;If so, determine that the target element belongs to the filter;若否,判定所述目标元素不属于所述过滤器。If not, it is determined that the target element does not belong to the filter.6.根据权利要求1所述的数据处理方法,其特征在于,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理步骤包括:6. data processing method according to claim 1, is characterized in that, adopting the cuckoo hash algorithm to carry out data processing step to described element fingerprint in the described segment that chooses comprises:判断所述目标元素是否已经插入所述过滤器;Determine whether the target element has been inserted into the filter;若是,根据所述元素指纹选取两个所述存储单元格为两个候选单元格;If so, select two described storage cells as two candidate cells according to the element fingerprint;在两个所述候选单元格中寻找所述元素指纹副本并将其删除。Find and delete copies of the element fingerprint in both of the candidate cells.7.根据权利要求2-6中任一项所述的数据处理方法,其特征在于,所述分段为所述过滤器中所述存储单元格均匀划分而成。7 . The data processing method according to claim 2 , wherein the segmentation is formed by uniformly dividing the storage cells in the filter. 8 .8.根据权利要求7所述的数据处理方法,其特征在于,所述根据所述元素指纹选取两个所述存储单元格为两个候选单元格通过第一哈希函数和第二哈希函数实现,所述第一哈希函数为:8. The data processing method according to claim 7, wherein the selection of two of the storage cells according to the element fingerprint is two candidate cells through a first hash function and a second hash function Implementation, the first hash function is:
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE001
所述第二哈希函数为:The second hash function is:
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE002
其中,x为目标元素,ηx为目标元素x的元素指纹,m为过滤器存储单元格的数目,s为过滤器分段段数。Among them,x is the target element,nx is the element fingerprint of the target elementx ,m is the number of filter storage cells, ands is the filter segment number.9.根据权利要求1所述的数据处理方法,其特征在于,所述元素指纹通过以下公式获得:9. The data processing method according to claim 1, wherein the element fingerprint is obtained by the following formula:ηx=h0(x) mod 2fηx =h0 (x ) mod 2f其中,ηx为目标元素x的元素指纹,fηx的位数。Among them,nx is the element fingerprint of the target elementx , andf is the number of digits ofnx .10.一种数据处理装置,其特征在于,包括:10. A data processing device, comprising:获取模块,被配置为获取目标元素;Get module, configured to get the target element;指纹计算模块,被配置为计算得到所述目标元素的元素指纹;a fingerprint calculation module, configured to calculate the element fingerprint of the target element;分段选取模块,被配置为通过全局哈希函数根据所述元素指纹选取过滤器的分段;a segment selection module configured to select segments of the filter according to the element fingerprints through a global hash function;数据操作模块,在选取的所述分段中采用布谷鸟哈希算法对所述元素指纹进行数据处理;The data manipulation module adopts the cuckoo hash algorithm to perform data processing on the element fingerprint in the selected segment;其中,所述过滤器包括多个分段,所述分段包括多个存储单元格,所述存储单元格包括多个用于存储所述元素指纹的存储位。Wherein, the filter includes a plurality of segments, the segments include a plurality of storage cells, and the storage cells include a plurality of storage bits for storing the fingerprint of the element.
CN202010998674.0A2020-09-222020-09-22 A data processing method and data processing devicePendingCN111858651A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010998674.0ACN111858651A (en)2020-09-222020-09-22 A data processing method and data processing device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010998674.0ACN111858651A (en)2020-09-222020-09-22 A data processing method and data processing device

Publications (1)

Publication NumberPublication Date
CN111858651Atrue CN111858651A (en)2020-10-30

Family

ID=72968505

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010998674.0APendingCN111858651A (en)2020-09-222020-09-22 A data processing method and data processing device

Country Status (1)

CountryLink
CN (1)CN111858651A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113486025A (en)*2021-07-282021-10-08北京腾云天下科技有限公司Data storage method, data query method and device
CN113535706A (en)*2021-08-032021-10-22重庆赛渝深科技有限公司Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
CN113535705A (en)*2021-08-032021-10-22佛山赛思禅科技有限公司SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter
CN113641681A (en)*2021-10-132021-11-12南京大数据集团有限公司Space self-adaptive mass data query method
CN114268501A (en)*2021-12-242022-04-01深信服科技股份有限公司Data processing method, firewall generation method, computing device and storage medium
CN114527929A (en)*2020-11-232022-05-24洪文圳Cloud storage data fusion method based on double-hash fuzzy bloom filter
CN115113826A (en)*2022-07-282022-09-27中国科学技术大学Bypass type data processing method and device
CN117891858A (en)*2024-03-142024-04-16苏州大学 A time- and space-efficient parallel approximate membership query method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107256130A (en)*2017-06-062017-10-17华中科技大学Data store optimization method and system based on Cuckoo Hash calculations
CN109815234A (en)*2018-12-292019-05-28杭州中科先进技术研究院有限公司A kind of multiple cuckoo filter under streaming computing model
CN111552692A (en)*2020-04-302020-08-18南方科技大学Plus-minus cuckoo filter

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107256130A (en)*2017-06-062017-10-17华中科技大学Data store optimization method and system based on Cuckoo Hash calculations
CN109815234A (en)*2018-12-292019-05-28杭州中科先进技术研究院有限公司A kind of multiple cuckoo filter under streaming computing model
CN111552692A (en)*2020-04-302020-08-18南方科技大学Plus-minus cuckoo filter

Cited By (14)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114527929A (en)*2020-11-232022-05-24洪文圳Cloud storage data fusion method based on double-hash fuzzy bloom filter
CN113486025B (en)*2021-07-282023-07-25北京腾云天下科技有限公司Data storage method, data query method and device
CN113486025A (en)*2021-07-282021-10-08北京腾云天下科技有限公司Data storage method, data query method and device
CN113535705A (en)*2021-08-032021-10-22佛山赛思禅科技有限公司SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter
CN113535706A (en)*2021-08-032021-10-22重庆赛渝深科技有限公司Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
CN113535706B (en)*2021-08-032023-05-23佛山赛思禅科技有限公司Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
CN113535705B (en)*2021-08-032024-02-02佛山赛思禅科技有限公司SFAD cuckoo filter and repeated data deleting method based on SFAD cuckoo filter
CN113641681A (en)*2021-10-132021-11-12南京大数据集团有限公司Space self-adaptive mass data query method
CN113641681B (en)*2021-10-132022-02-22南京大数据集团有限公司Space self-adaptive mass data query method
CN114268501A (en)*2021-12-242022-04-01深信服科技股份有限公司Data processing method, firewall generation method, computing device and storage medium
CN114268501B (en)*2021-12-242024-02-23深信服科技股份有限公司Data processing method, firewall generating method, computing device and storage medium
CN115113826A (en)*2022-07-282022-09-27中国科学技术大学Bypass type data processing method and device
CN117891858A (en)*2024-03-142024-04-16苏州大学 A time- and space-efficient parallel approximate membership query method and system
CN117891858B (en)*2024-03-142024-07-05苏州大学 A time- and space-efficient parallel approximate membership query method and system

Similar Documents

PublicationPublication DateTitle
CN111858651A (en) A data processing method and data processing device
US11604834B2 (en)Technologies for performing stochastic similarity searches in an online clustering space
CN110413611B (en)Data storage and query method and device
US7167856B2 (en)Method of storing and retrieving multi-dimensional data using the hilbert curve
CN110046164A (en)Index independent grain distribution filter, consistency grain distribution filter and operation method
CN112882663B (en)Random writing method, electronic equipment and storage medium
CN106599091B (en)RDF graph structure storage and index method based on key value storage
CN105740264A (en)Distributed XML database sorting method and apparatus
CN112070652A (en)Data compression method, data decompression method, readable storage medium and electronic device
CN107341507A (en)A kind of rapid image SIFT feature matching process based on GPU with cascade Hash
CN111861744A (en)Method for realizing parallelization of block chain transaction and block chain link point
CN107330094A (en)The Bloom Filter tree construction and key-value pair storage method of dynamic memory key-value pair
CN116578239A (en) Method, electronic device and storage medium for partitioning memory
US9946461B2 (en)In-flash immutable object processing
CN115576947A (en)Data management method and device, combined library, electronic equipment and storage medium
US20150100536A1 (en)Parallel hardware searching system for building artifical intelligent computer
CN110245130A (en) Data deduplication method, device, computer equipment and storage medium
CN109937453B (en)Memory reduced nucleotide sequence comparison
CN116662019A (en)Request distribution method and device, storage medium and electronic device
CN110659286A (en)Dynamic space index method based on weak balance space tree and storage medium and device thereof
CN116401416A (en)Persistent variable radix tree access system supporting lockless concurrent access
CN113495901B (en)Quick retrieval method for variable-length data blocks
CN112506440A (en)Data searching method and equipment based on dichotomy
CN115221360A (en) Tree structure configuration method and system
CN118708591B (en) Data processing method, computer device and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp